# Take-home thought

I only learned R for about 12 hours and finished half of this Udactiy course. I get a quick feeling of what R really is.

For me, R is very similar to Matlab, but more focus and highly specialized on data visualization. The console is very powerful, you can install package in console, like a terminal.

R has many built-in functions specialized for statistics. So it is very handy to get values like median, mean, correlation, deviation.

RStudio is a very nice IDE. It allows Rmd, similar to ipython notebook.

However, the syntax of R is highly specialized for certain drawing. And there are some syntax changes for ggplot2. I could pick up these details quickly later if I have to.

# basics

## constructs

concepts difficult to define and measure:

- Memory
- Happiness
- Guilt
- Love

## operational definition

anger: number of profanities uttered per min

happiness: ratio of minutes spent smiling to minutes not smiling

## what’s EDA

Initial data analysis

check assumptionsrequired for model fitting andhypothesis testingand handling missing values and making transformation of variables.

Exploratory data analysis

summarize their main characteristics, generate better hypothesis, determine which variables have the most predictive power, and select appropriate Statistical tools

Develop a mindset of curious and skeptical.

install R: https://cran.rstudio.com/

install Studio: https://www.rstudio.com/

**Swirl**(statistics with interactive R learning). Swirl is a software package for the R statistical programming language. Its purpose is to teach statistics and R commands interactively.

Type the following commands in the

**Console**, pressing Enter or Return after each line:`install.packages("swirl")`

`library(swirl)`

`swirl()`

## package

textcat

ggplot2

## learning sources

## basic command

ctrl+ L: clear the console

students <- c(“John”,”Kate”) # assignment, vector, 1-based, chr instead of string

numbers <- c(1:10)

numberOfChar = nchar(students)

data(mtcars) # load built-in data mtcars

names(mtcars)

str(mtcars)

dim(mtcars)

getwd()

setwd(“/Users/yuchaojiang/Downloads/EDA_Course_Materials/lesson2”)

statesInfo <- read.csv(“stateData.csv”)

stateSubset <-subset(statesInfo, state.region==1)

stateSubset <- statesInfo[statesInfo$state.region==1,] # equavilent method

head(stateSubset,2)

dim(stateSubset)

## ggplot2

install.packages(‘ggplot2’, dep = TRUE)

**Sean Taylor**

Good data science comes from good questions, not from fancy techniques, or having the right data. It comes from motivating your research with an idea that you care about, and that you think other people will care about.

**Gene Wilder**

“Success is a terrible thing and a wonderful thing… Just do what you love.”

**John Turkey**

An==approximate answer to the right problem==is worth a good deal more than the exact answer to an approximate problem

## data wrangling

tidyr

dplyr

# pseudo_facebook

```
setwd("/Users/yuchaojiang/Downloads/EDA_Course_Materials/lesson3")
pf <- read.csv("pseudo_facebook.tsv", sep = '\t')
names(pf)
library(ggplot2)
qplot(x =dob_day, data = pf) +
scale_x_continuous(breaks=1:31)
ggplot(data = pf, aes(x = dob_day)) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = 1:31) +
facet_wrap(~dob_month)
qplot(x=friend_count,data=pf, xlim=c(0,1000))
qplot(x = friend_count, data = pf, binwidth = 25) +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))
ggplot(aes(x = friend_count), data = subset(pf, !is.na(gender))) +
geom_histogram() +
scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
table(pf$gender)
by(pf$friend_count,pf$gender,summary)
qplot(x=tenure/365, data = pf, binwidth = .25,
color = I("black"),fill = I('#F79420'))+
scale_x_continuous(breaks=seq(1,7,1),limits=c(0,7))+
xlab('Number of years using Facebook') +
ylab('Number of users in sample')
qplot(x=age, data = pf, binwidth = 1,
color = I("black"),fill = I('#F79420'))
summary(pf$age)
# lesson 4
qplot(age, friend_count,data=pf)
ggplot(aes(x=age, y=friend_count),data=pf)+
geom_jitter(alpha=1/20)+
xlim(13,90)
install.packages('dplyr')
library(dplyr)
age_groups <- group_by(pf,age)
pf.fc_by_age <-summarise(age_groups,
friend_count_mean= mean (friend_count),
friend_count_median= median(friend_count),
n=n())
head(pf.fc_by_age)
ggplot(aes(x=age, y=friend_count),data=pf)+
xlim(13,90)+
geom_point(alpha=0.05,
position= position_jitter(h=0),
color= 'orange')+
coord_trans(y='sqrt')+
geom_line(stat='summary', fun.y=mean)+
geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.1),linetype=2,color='blue')+
geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.5),linetype=2,color='red')+
geom_line(stat='summary',fun.y=quantile,fun.args=list(probs= 0.9),linetype=2,color='black')
cor.test(pf$age,pf$friend_count,method="pearson")
with(subset(pf,age<=70),cor.test(age,friend_count,method="pearson"))
with(subset(pf,age<=70),cor.test(age,friend_count,method="spearman"))
```