Take-home thought

I only learned R for about 12 hours and finished half of this Udactiy course. I get a quick feeling of what R really is.

For me, R is very similar to Matlab, but more focus and highly specialized on data visualization. The console is very powerful, you can install package in console, like a terminal.

R has many built-in functions specialized for statistics. So it is very handy to get values like median, mean, correlation, deviation.

RStudio is a very nice IDE. It allows Rmd, similar to ipython notebook.

However, the syntax of R is highly specialized for certain drawing. And there are some syntax changes for ggplot2. I could pick up these details quickly later if I have to.

basics

constructs

concepts difficult to define and measure:

Memory
Happiness
Guilt
Love

operational definition

depression: http://www.hr.ucdavis.edu/asap/pdf_files/Beck_Depression_Inventory.pdf

anger: number of profanities uttered per min

happiness: ratio of minutes spent smiling to minutes not smiling

what’s EDA

Initial data analysis

check assumptions required for model fitting and hypothesis testing and handling missing values and making transformation of variables.

Exploratory data analysis

summarize their main characteristics, generate better hypothesis, determine which variables have the most predictive power, and select appropriate Statistical tools

Develop a mindset of curious and skeptical.

install R: https://cran.rstudio.com/

install Studio: https://www.rstudio.com/

Swirl (statistics with interactive R learning). Swirl is a software package for the R statistical programming language. Its purpose is to teach statistics and R commands interactively.

Type the following commands in the Console, pressing Enter or Return after each line:
install.packages("swirl")
library(swirl)
swirl()

package

textcat

ggplot2

learning sources

http://www.statmethods.net/

https://www.r-bloggers.com/

http://stackoverflow.com/tags/r/info

https://google.github.io/styleguide/Rguide.xml

basic command

ctrl+ L: clear the console

students <- c(“John”,”Kate”) # assignment, vector, 1-based, chr instead of string

numbers <- c(1:10)

numberOfChar = nchar(students)

data(mtcars) # load built-in data mtcars

names(mtcars)

str(mtcars)

dim(mtcars)

getwd()

setwd(“/Users/yuchaojiang/Downloads/EDA_Course_Materials/lesson2”)

statesInfo <- read.csv(“stateData.csv”)

stateSubset <-subset(statesInfo, state.region==1)

stateSubset <- statesInfo[statesInfo$state.region==1,] # equavilent method

head(stateSubset,2)

dim(stateSubset)

ggplot2

install.packages(‘ggplot2’, dep = TRUE)

Sean Taylor

Good data science comes from good questions, not from fancy techniques, or having the right data. It comes from motivating your research with an idea that you care about, and that you think other people will care about.

Gene Wilder

“Success is a terrible thing and a wonderful thing… Just do what you love.”

John Turkey

An ==approximate answer to the right problem== is worth a good deal more than the exact answer to an approximate problem

data wrangling

tidyr

dplyr

https://s3.amazonaws.com/udacity-hosted-downloads/ud651/DataWranglingWithR.pdf

pseudo_facebook

setwd("/Users/yuchaojiang/Downloads/EDA_Course_Materials/lesson3")

pf <- read.csv("pseudo_facebook.tsv", sep = '\t')
names(pf)
library(ggplot2)
qplot(x =dob_day, data = pf) +
  scale_x_continuous(breaks=1:31)

ggplot(data = pf, aes(x = dob_day)) + 
  geom_histogram(binwidth = 1) + 
  scale_x_continuous(breaks = 1:31) + 
  facet_wrap(~dob_month)

qplot(x=friend_count,data=pf, xlim=c(0,1000))

qplot(x = friend_count, data = pf, binwidth = 25) + 
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))

ggplot(aes(x = friend_count), data = subset(pf, !is.na(gender))) + 
  geom_histogram() + 
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + 
  facet_wrap(~gender)

table(pf$gender)
by(pf$friend_count,pf$gender,summary)

qplot(x=tenure/365, data = pf, binwidth = .25, 
      color = I("black"),fill = I('#F79420'))+
  scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))+
  xlab('Number of years using Facebook') + 
  ylab('Number of users in sample')

qplot(x=age, data = pf, binwidth = 1, 
      color = I("black"),fill = I('#F79420'))

summary(pf$age)

# lesson 4
qplot(age, friend_count,data=pf)

ggplot(aes(x=age, y=friend_count),data=pf)+
  geom_jitter(alpha=1/20)+
  xlim(13,90)

install.packages('dplyr')
library(dplyr)
age_groups <- group_by(pf,age)
pf.fc_by_age <-summarise(age_groups,
            friend_count_mean= mean (friend_count),
            friend_count_median= median(friend_count),
            n=n())
head(pf.fc_by_age)

ggplot(aes(x=age, y=friend_count),data=pf)+
  xlim(13,90)+
  geom_point(alpha=0.05,
             position= position_jitter(h=0),
             color= 'orange')+
  coord_trans(y='sqrt')+
  geom_line(stat='summary', fun.y=mean)+
  geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.1),linetype=2,color='blue')+
  geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.5),linetype=2,color='red')+
  geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.9),linetype=2,color='black')

cor.test(pf$age,pf$friend_count,method="pearson")

with(subset(pf,age<=70),cor.test(age,friend_count,method="pearson"))

with(subset(pf,age<=70),cor.test(age,friend_count,method="spearman"))

Yuchao's blogspot

Tuesday, August 30, 2016

Data Analysis with R

Take-home thought

basics

constructs

operational definition

what’s EDA

package

learning sources

basic command

ggplot2

data wrangling

pseudo_facebook