Thursday, February 23, 2017

Data Analyst ND 1, Statistics

7-day free trial: 2017-2-22~3.1
Program director: Mat Leonard
content developer: Caroline Buckey

Update on 2017-2-27

I soon found there are so much overlap with my existing knowledge, I can even finish all the projects in the 7-day trial! But I change my goal from getting a nano degree to filling my knowledge gap. I am close to get my first data scientist job, a nanodegree is not so important to me now .
projects recap:
  1. bike share. The main takeaway is how to slice the dataset to smaller size and how to use datetimemodule to parse the timestamp
  2. Stroop effect. hypothesis test, epecially t-test.
  3. Titanic Analysis. get you familar with numpy and pandas. This is the same with stage 5 (choose your path) of Intro to Programming ND.
  4. open street map. focus on data wrangling skills. parse various data types: csv, excel, JSON, XML, HTML. extract data from database, SQL or NoSQL.
  5. (Data Set Options) practise R
  6. Enron Email. This is the same with the free course “Intro to machine learning”
  7. (Data Set Options) the course content is about JavaScript plotting API: D3 and Dimple. And the fancy way to draw world map!
  8. Free trial screener of udacity charged courses. This is the same the free course “A/B testing

P0 Bay Area Bike Share Analysis

Two of the major parts of the data analysis process: data wrangling and exploratory data analysis.
before you even start looking at data, think about some questions you might want to understand about the bike share data.
After all, your best analysis is only as good as your ability to communicate it.
When dealing with a lot of data, it can be useful to start by working with only a sample of the data.

P1: Statistics

The course materials in this section has 11 lessons + placement advisor to help you locate your knowledge gap in case you already know some statistics.
Actually, Udacity has provided 2 free courses:

Staticstics placement Advisor

If you are comfortable:
  • Performing a hypothesis test using a two-sample t-test
  • Calculating a p-value and a confidence interval
  • Deciding whether to reject the null based on the result of the above
continue straight to Project 1.

constructs and their operational definition

constructs are concepts difficult to define and measure. Scientists try to quantify them by attemping their operational definition. units are at the heart of measurement.
  • Memory:
  • Guilt
  • Love
  • Stress: levels of cortisol (the stress hormone)
  • depression: Beck’s Depression Inventory: 21 questions
  • anger: number of profanities uttered per min
  • happiness: ratio of minutes spent simling to minutes not smiling

interpreting scatter plots

We can infer a trend, but it is not necessary true.
Correlation does not imply causation. Golden Arches Theory of conflict prevention: No 2 countries with a McDonald’s have ever gone to war since opening McDonald’s.
  • show relationships: observational study, surveys
  • show causation: controlled experiment. Use double blind to avoid placebo effect (unconsciously or consciously alter the measurement).
Most research studies only use a sample because collecting data about a entire population is way too expensive. As a result, we expect our estimates will not be exactly accurate when we do this.
A fixed number is called constant, a changeable number is called variable.
\bar{x} is for sample mean, \mu is for population mean.
We can make prediction by either correlation or causation.

standard normal distribution

Any normal distribution can be normalized by the z-score: z=\frac{x-\mu}{\sigma}
z-table shows the probality the something is less than a z-score.
standard error(SE) is the standard deviation of the distribution of the sample means. SE= \frac{\sigma}{\sqrt(n)}
This is called central limit theorem.
margin of error: 95% of sample means fall within \frac{2\sigma}{\sqrt(n)}
critical value: 98% of sample means fall within \frac{2.33\sigma}{\sqrt(n)}
The level of unlikely is called alpha levels: 5%, 1%, 0.1%. For one-tailed critical region, these correspod to z-value of 1.65, 2.33, 3.08.
e.g. if z= 1.82, we say \bar x is significant at p <0.05. This is interesting. ==The outlier is statistically significant.==
We can also have two-tailed critical region, similar to


H0(null hypothesis): the mean of intervention is outside the critical region.
Ha(alternative hypthesis): the mean of intervention is inside the critical region.
we can’t prove that the null hypothesis is true. we can only obtain evidence to reject the null hypothesis.
e.g. Most dogs have 4 legs. (significance level = 50%)

t distribution

z-test works when we know mu and sigma, but we don’t.
Degree of freedoms are the number of pieces of information that can be freely varied without violating any given restrictions.They are independent pieces of information available to estimate another piece of information.
sample deviation = \sqrt{\frac{\sum_i (x_i-\bar x)^2}{n-1}}, n-1 is the effective sample size.
t-distribution is kind of a flatterned form of normal distribution. As the degree of freedom tends to infinity, they overlap with each other.
t-value can be obtained by checking t-Table
from sample to calculte the t-value:
def t_value(nums, mean):
  length = len(nums)
  x_bar= np.mean(nums)
  sample_sd = np.sqrt(np.var(a)*length/(length-1))
  t = (x_bar-mean)/sample_sd*np.sqrt(length)
  print('x_bar={0:.2f}\n sample_sd={1:.3f}\n t={2:.3f}'.format(x_bar,sample_sd,t))
def t_value(x_bar,ssd, mean,num):
from t-value to get the 2-tailed p-value: Link to GraphPad
One-sample test, dependent samples, repeated measures:
  • two conditions,
  • longitudinal(t1,t2),
  • pretest & posttest
this approach is cheap, but the downside is the carry-on effects: 2nd measurement can be affected by first treatment, and order may influnce results.

types of effect size measue

difference: mean, deviation
statistical significance doen’t mean important, large, sizeable or meaningful. It means rejected the null, results are not likely due to chance (sampling error).
Cohen’s d measure standardized mean difference.

independent samples

standard deviation s^2 = s1^2 +s2^2
standard error = s/\sqrt n=\sqrt{s_1^2/n1+s_2^2/n2}