Take-home thought
I only learned R for about 12 hours and finished half of this Udactiy course. I get a quick feeling of what R really is.
For me, R is very similar to Matlab, but more focus and highly specialized on data visualization. The console is very powerful, you can install package in console, like a terminal.
R has many built-in functions specialized for statistics. So it is very handy to get values like median, mean, correlation, deviation.
RStudio is a very nice IDE. It allows Rmd, similar to ipython notebook.
However, the syntax of R is highly specialized for certain drawing. And there are some syntax changes for ggplot2. I could pick up these details quickly later if I have to.
basics
constructs
concepts difficult to define and measure:
- Memory
- Happiness
- Guilt
- Love
operational definition
anger: number of profanities uttered per min
happiness: ratio of minutes spent smiling to minutes not smiling
what’s EDA
Initial data analysis
check assumptions required for model fitting and hypothesis testing and handling missing values and making transformation of variables.
Exploratory data analysis
summarize their main characteristics, generate better hypothesis, determine which variables have the most predictive power, and select appropriate Statistical tools
Develop a mindset of curious and skeptical.
install R: https://cran.rstudio.com/
install Studio: https://www.rstudio.com/
Swirl (statistics with interactive R learning). Swirl is a software package for the R statistical programming language. Its purpose is to teach statistics and R commands interactively.
Type the following commands in the Console, pressing Enter or Return after each line:
install.packages("swirl")
library(swirl)
swirl()
package
textcat
ggplot2
learning sources
basic command
ctrl+ L: clear the console
students <- c(“John”,”Kate”) # assignment, vector, 1-based, chr instead of string
numbers <- c(1:10)
numberOfChar = nchar(students)
data(mtcars) # load built-in data mtcars
names(mtcars)
str(mtcars)
dim(mtcars)
getwd()
setwd(“/Users/yuchaojiang/Downloads/EDA_Course_Materials/lesson2”)
statesInfo <- read.csv(“stateData.csv”)
stateSubset <-subset(statesInfo, state.region==1)
stateSubset <- statesInfo[statesInfo$state.region==1,] # equavilent method
head(stateSubset,2)
dim(stateSubset)
ggplot2
install.packages(‘ggplot2’, dep = TRUE)
Sean Taylor
Good data science comes from good questions, not from fancy techniques, or having the right data. It comes from motivating your research with an idea that you care about, and that you think other people will care about.
Gene Wilder
“Success is a terrible thing and a wonderful thing… Just do what you love.”
John Turkey
An ==approximate answer to the right problem== is worth a good deal more than the exact answer to an approximate problem
data wrangling
tidyr
dplyr
pseudo_facebook
setwd("/Users/yuchaojiang/Downloads/EDA_Course_Materials/lesson3")
pf <- read.csv("pseudo_facebook.tsv", sep = '\t')
names(pf)
library(ggplot2)
qplot(x =dob_day, data = pf) +
scale_x_continuous(breaks=1:31)
ggplot(data = pf, aes(x = dob_day)) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = 1:31) +
facet_wrap(~dob_month)
qplot(x=friend_count,data=pf, xlim=c(0,1000))
qplot(x = friend_count, data = pf, binwidth = 25) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))
ggplot(aes(x = friend_count), data = subset(pf, !is.na(gender))) +
geom_histogram() +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
table(pf$gender)
by(pf$friend_count,pf$gender,summary)
qplot(x=tenure/365, data = pf, binwidth = .25,
color = I("black"),fill = I('#F79420'))+
scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))+
xlab('Number of years using Facebook') +
ylab('Number of users in sample')
qplot(x=age, data = pf, binwidth = 1,
color = I("black"),fill = I('#F79420'))
summary(pf$age)
# lesson 4
qplot(age, friend_count,data=pf)
ggplot(aes(x=age, y=friend_count),data=pf)+
geom_jitter(alpha=1/20)+
xlim(13,90)
install.packages('dplyr')
library(dplyr)
age_groups <- group_by(pf,age)
pf.fc_by_age <-summarise(age_groups,
friend_count_mean= mean (friend_count),
friend_count_median= median(friend_count),
n=n())
head(pf.fc_by_age)
ggplot(aes(x=age, y=friend_count),data=pf)+
xlim(13,90)+
geom_point(alpha=0.05,
position= position_jitter(h=0),
color= 'orange')+
coord_trans(y='sqrt')+
geom_line(stat='summary', fun.y=mean)+
geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.1),linetype=2,color='blue')+
geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.5),linetype=2,color='red')+
geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.9),linetype=2,color='black')
cor.test(pf$age,pf$friend_count,method="pearson")
with(subset(pf,age<=70),cor.test(age,friend_count,method="pearson"))
with(subset(pf,age<=70),cor.test(age,friend_count,method="spearman"))