Friday, May 5, 2017

Business Analyst ND 2, AB test, time series, cluster

6 A/B test

Experimental design

treatment group vs control group

target vs control variable: causal relation.

Lurking variable (confounding): e.g Temp. in Death rate vs. Ice-cream

experimental, predictor variables

Two kinds of designs:

randomized design: use if data is low cost and no bias
matched pair design: use if data is high cost

use its built-in function T.TEST(data1,data2, tails,type) to see p-value. If the p-value is smaller than 0.05 , means low probability of randomness, then statistically significant. It is more accurate for statisticians to say correlation rather than causal relation. Because we want to find the useful relation, so p-value is lower the better.

project

compare the sale before and after a new menu. To avoid seasonal factors, use the same dates in last year vs this year.

How to parse a specific date to the number of weeks in the year?

Commercial software like Alteryx has specially designed to address this problem, and it is only one click!

Anyway, I can do it in python, as I posted in stackoverflow.

8 Time series forecasting

Two models:

8.1 ETS: Exponential smoothing.

Decomposition plot (season, trend, error).

Seasonal (no, constant, increasing)
trend (no,linear, exponential)

Simple Exponential smoothing

This method is suitable for forecasting data with no trend or seasonal pattern.

$p= \alpha\sum_{i=0}\beta^ip_i$

where $\beta=1-\alpha$

p is the price for tomorrow, p0 is the price for today, i= 1 is the price for yesterday,… We can easily prove the sum of total weights is 1.

An example is:

If there’s a trend, more complicated method is used, such as Holt’s linear trend method: https://www.otexts.org/fpp/7/2

Anyway, Alteryx has everything built-in for you.

8.2 ARIMA: Auto regressive integrated moving average

AR: use p points to do a linear regression
I: difference d points
MA: Error Component of a linear regression of q points

Overall, this is single variable (time) prediction, it assumes all the other variables are condensed or represented by this variable, which is often not possible. Many variables are independent variables.

9 Segmentation and clustering

Basically, it’s unsupervised learning.

Standardizing vs localizing is a tradeoff between one-size-fits-all solution and customized solution.

Business perspective says segment, statistics perspective says clusters.

Definition: A mathematical procedure for multidimensional analysis. Given the characteristics of a set of objects, this procedure groups similar objects into clusters.

reduce the variables to artificial/component variables.

PCA catches the variance of all variables.

Because everything is packaged in the software Alteryx, it is not supposed to an in-depth course. What interests me most is how the questions are presented in a business context and how the solution is used to help to make the business decision.

This ends my 7-day free trial BAND journey. Other unfinished project materials can be found in https://github.com/jychstar/NanoDegreeProject/tree/master/BAND

Yuchao's blogspot