Tuesday, August 30, 2016

Data Analysis with R

Take-home thought

I only learned R for about 12 hours and finished half of this Udactiy course. I get a quick feeling of what R really is.
For me, R is very similar to Matlab, but more focus and highly specialized on data visualization. The console is very powerful, you can install package in console, like a terminal.
R has many built-in functions specialized for statistics. So it is very handy to get values like median, mean, correlation, deviation.
RStudio is a very nice IDE. It allows Rmd, similar to ipython notebook.
However, the syntax of R is highly specialized for certain drawing. And there are some syntax changes for ggplot2. I could pick up these details quickly later if I have to.

basics

constructs

concepts difficult to define and measure:
  • Memory
  • Happiness
  • Guilt
  • Love

operational definition

anger: number of profanities uttered per min
happiness: ratio of minutes spent smiling to minutes not smiling

what’s EDA

Initial data analysis
check assumptions required for model fitting and hypothesis testing and handling missing values and making transformation of variables.
Exploratory data analysis
summarize their main characteristics, generate better hypothesis, determine which variables have the most predictive power, and select appropriate Statistical tools
Develop a mindset of curious and skeptical.
install Studio: https://www.rstudio.com/
Swirl (statistics with interactive R learning). Swirl is a software package for the R statistical programming language. Its purpose is to teach statistics and R commands interactively.
Type the following commands in the Console, pressing Enter or Return after each line:
install.packages("swirl")
library(swirl)
swirl()

package

textcat
ggplot2

learning sources

basic command

ctrl+ L: clear the console
students <- c(“John”,”Kate”) # assignment, vector, 1-based, chr instead of string
numbers <- c(1:10)
numberOfChar = nchar(students)
data(mtcars) # load built-in data mtcars
names(mtcars)
str(mtcars)
dim(mtcars)
getwd()
setwd(“/Users/yuchaojiang/Downloads/EDA_Course_Materials/lesson2”)
statesInfo <- read.csv(“stateData.csv”)
stateSubset <-subset(statesInfo, state.region==1)
stateSubset <- statesInfo[statesInfo$state.region==1,] # equavilent method
head(stateSubset,2)
dim(stateSubset)

ggplot2

install.packages(‘ggplot2’, dep = TRUE)
Sean Taylor
Good data science comes from good questions, not from fancy techniques, or having the right data. It comes from motivating your research with an idea that you care about, and that you think other people will care about.
Gene Wilder
“Success is a terrible thing and a wonderful thing… Just do what you love.”
John Turkey
An ==approximate answer to the right problem== is worth a good deal more than the exact answer to an approximate problem

data wrangling

tidyr
dplyr

pseudo_facebook

setwd("/Users/yuchaojiang/Downloads/EDA_Course_Materials/lesson3")

pf <- read.csv("pseudo_facebook.tsv", sep = '\t')
names(pf)
library(ggplot2)
qplot(x =dob_day, data = pf) +
  scale_x_continuous(breaks=1:31)

ggplot(data = pf, aes(x = dob_day)) + 
  geom_histogram(binwidth = 1) + 
  scale_x_continuous(breaks = 1:31) + 
  facet_wrap(~dob_month)

qplot(x=friend_count,data=pf, xlim=c(0,1000))

qplot(x = friend_count, data = pf, binwidth = 25) + 
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))

ggplot(aes(x = friend_count), data = subset(pf, !is.na(gender))) + 
  geom_histogram() + 
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + 
  facet_wrap(~gender)

table(pf$gender)
by(pf$friend_count,pf$gender,summary)

qplot(x=tenure/365, data = pf, binwidth = .25, 
      color = I("black"),fill = I('#F79420'))+
  scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))+
  xlab('Number of years using Facebook') + 
  ylab('Number of users in sample')

qplot(x=age, data = pf, binwidth = 1, 
      color = I("black"),fill = I('#F79420'))

summary(pf$age)

# lesson 4
qplot(age, friend_count,data=pf)

ggplot(aes(x=age, y=friend_count),data=pf)+
  geom_jitter(alpha=1/20)+
  xlim(13,90)

install.packages('dplyr')
library(dplyr)
age_groups <- group_by(pf,age)
pf.fc_by_age <-summarise(age_groups,
            friend_count_mean= mean (friend_count),
            friend_count_median= median(friend_count),
            n=n())
head(pf.fc_by_age)

ggplot(aes(x=age, y=friend_count),data=pf)+
  xlim(13,90)+
  geom_point(alpha=0.05,
             position= position_jitter(h=0),
             color= 'orange')+
  coord_trans(y='sqrt')+
  geom_line(stat='summary', fun.y=mean)+
  geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.1),linetype=2,color='blue')+
  geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.5),linetype=2,color='red')+
  geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.9),linetype=2,color='black')

cor.test(pf$age,pf$friend_count,method="pearson")

with(subset(pf,age<=70),cor.test(age,friend_count,method="pearson"))

with(subset(pf,age<=70),cor.test(age,friend_count,method="spearman"))

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.