I only learned R for about 12 hours and finished half of this Udactiy course. I get a quick feeling of what R really is.
For me, R is very similar to Matlab, but more focus and highly specialized on data visualization. The console is very powerful, you can install package in console, like a terminal.
R has many built-in functions specialized for statistics. So it is very handy to get values like median, mean, correlation, deviation.
RStudio is a very nice IDE. It allows Rmd, similar to ipython notebook.
However, the syntax of R is highly specialized for certain drawing. And there are some syntax changes for ggplot2. I could pick up these details quickly later if I have to.
concepts difficult to define and measure:
anger: number of profanities uttered per min
happiness: ratio of minutes spent smiling to minutes not smiling
Initial data analysis
check assumptions required for model fitting and hypothesis testing and handling missing values and making transformation of variables.
Exploratory data analysis
summarize their main characteristics, generate better hypothesis, determine which variables have the most predictive power, and select appropriate Statistical tools
Develop a mindset of curious and skeptical.
install R: https://cran.rstudio.com/
install Studio: https://www.rstudio.com/
Swirl (statistics with interactive R learning). Swirl is a software package for the R statistical programming language. Its purpose is to teach statistics and R commands interactively.
Type the following commands in the Console, pressing Enter or Return after each line:
ctrl+ L: clear the console
students <- c(“John”,”Kate”) # assignment, vector, 1-based, chr instead of string
numbers <- c(1:10)
numberOfChar = nchar(students)
data(mtcars) # load built-in data mtcars
statesInfo <- read.csv(“stateData.csv”)
stateSubset <-subset(statesInfo, state.region==1)
stateSubset <- statesInfo[statesInfo$state.region==1,] # equavilent method
install.packages(‘ggplot2’, dep = TRUE)
Good data science comes from good questions, not from fancy techniques, or having the right data. It comes from motivating your research with an idea that you care about, and that you think other people will care about.
“Success is a terrible thing and a wonderful thing… Just do what you love.”
An ==approximate answer to the right problem== is worth a good deal more than the exact answer to an approximate problem
setwd("/Users/yuchaojiang/Downloads/EDA_Course_Materials/lesson3") pf <- read.csv("pseudo_facebook.tsv", sep = '\t') names(pf) library(ggplot2) qplot(x =dob_day, data = pf) + scale_x_continuous(breaks=1:31) ggplot(data = pf, aes(x = dob_day)) + geom_histogram(binwidth = 1) + scale_x_continuous(breaks = 1:31) + facet_wrap(~dob_month) qplot(x=friend_count,data=pf, xlim=c(0,1000)) qplot(x = friend_count, data = pf, binwidth = 25) + scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) ggplot(aes(x = friend_count), data = subset(pf, !is.na(gender))) + geom_histogram() + scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + facet_wrap(~gender) table(pf$gender) by(pf$friend_count,pf$gender,summary) qplot(x=tenure/365, data = pf, binwidth = .25, color = I("black"),fill = I('#F79420'))+ scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))+ xlab('Number of years using Facebook') + ylab('Number of users in sample') qplot(x=age, data = pf, binwidth = 1, color = I("black"),fill = I('#F79420')) summary(pf$age) # lesson 4 qplot(age, friend_count,data=pf) ggplot(aes(x=age, y=friend_count),data=pf)+ geom_jitter(alpha=1/20)+ xlim(13,90) install.packages('dplyr') library(dplyr) age_groups <- group_by(pf,age) pf.fc_by_age <-summarise(age_groups, friend_count_mean= mean (friend_count), friend_count_median= median(friend_count), n=n()) head(pf.fc_by_age) ggplot(aes(x=age, y=friend_count),data=pf)+ xlim(13,90)+ geom_point(alpha=0.05, position= position_jitter(h=0), color= 'orange')+ coord_trans(y='sqrt')+ geom_line(stat='summary', fun.y=mean)+ geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.1),linetype=2,color='blue')+ geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.5),linetype=2,color='red')+ geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.9),linetype=2,color='black') cor.test(pf$age,pf$friend_count,method="pearson") with(subset(pf,age<=70),cor.test(age,friend_count,method="pearson")) with(subset(pf,age<=70),cor.test(age,friend_count,method="spearman"))