# Update on 2017-2-9

4 month later after this initial post, I almost finish Machine Learning Engineer NanoDegree. It is a perfect time for me to compare their difference.
The pros of “Intro“ course are that it’s free, fun to watch Sebastian’s self-driving car, and the videos are completely produced by him and a nice female Kattie. The cons are:
1. no project feedback
2. course materials seem not well organized. The sequence of their teaching is Naive Bayes, SVM, decision Trees, Regression, clustering, Feature scaling and selection, PCA, Evaluation Metrics. Are you kidding me? I think the correct approach should be exactly the reverse order! We should have a big picture first, then narrow down to a specific topic, so we don’t get used to the narrow-minded. The trick here is that a more specific thing is more easy and fun to teach!
3. For Naive Bayes, Sebatian explained Bayes is a Christian trying to use evidence to infer the existence of God, and Naive is that it doesn’t consider the order of words. In my understanding, it can more explicitly that Naive is due to the assumption of independence of individual evidence. And NB algorithm is “probability-based“.
4. sklearn 0.15 on 2014-11-11, then code updated on 2015-10
5. the biggest problem is that it does too much preprocessing about the data and visualization, the student writes a few “magic code” to get a “good feeling” about himself. But actually he doesn’t earn as much because it’s a fake feeling and he lose the big picture.
6. For a beginner course, it is better to use more simple, tansparant dataset. Enron email dataset is too overwhelming for a beginner. see my analysis at https://github.com/jychstar/datasets
The MLND course has a slightly better structure due to its available length. However, many materials are still scrambled together. For example, in supervised learning, SVM section is put after Neural Network. And SVM section begins with “GaTech” version, then Sebatian’s intro version. What a shame!
Anyway, if you are interested in the MLND, its sequence is:
• Model Evaluation and Validation (project: predicting Boston Housing prices)
• Supervised Learning (project: Finding Donors for Charity ML)
• UnSupervised Learning (project: Creating Customer Segments)
• Reinforcement Learning (project: Train a Smartcab to Drive)
• Deep Learning (project: build a digit recognition program)
• Capstone Project

# Supervised learning

## naive bayes

probabilistic classifier, apply Bayes’ theorem with strong independence assumption between features
`from sklearn.naive_bayes import GaussianNB`

### accuracy

``````from sklearn.metrics import accuracy_score,fbeta_score
y_pred=clf.predict(features_test)
print accuracy_score(y_true, y_pred)
print fbeta_score(y_true, y_pred, beta)
``````

## support vector machine

non-probabilistic binary linear classifier, separate categories by a clear gap as wide as possible
parameters: kernel(linear, rbf), C, Gamma
`from sklearn.svm import SVC`

## decision tree

`from sklearn.tree import DecisionTreeClassifier`
algorithm: max(information gain)=max(entropy(parent)-average*entropy(children))

## k nearest neighbors

An object is classified by a majority vote of its k neighbors. instance-based learning, lazy-learning.
`from sklearn.eighbors import KNeighborsClassifier()`

Adaptive Boosting, enhance individual weak learners (decision tree) by harder-to-classify examples. Sensitive to noisy data and outliers.
`from sklearn.ensemble import AdaBoostClassifier`

## random forest

construct a multitude of decision trees and correct dt’s overfitting problems.
`from sklearn.ensemble import RandomForestClassifier()`

### comparison

training data: makeTerrainData()
algorithm parameter No. of training time accuracy
k nearest neigh default=5 750 0.389 0.92
k nearest neigh neighbors=15 750 0.545 0.928
k nearest neigh neighbors=3 750 0.267 0.936
Ada boost est. 100 750 0.686 0.924
Ada boost est. 10 750 0.379 0.916
random 750 0.389 0.924

## Accuracy vs Training set size

More Data > Fine-tuned Algorithm

## regression

``````from sklearn import linear_model
reg=linear_model.LinearRegression()
reg.fit(features_train, labels_train)
pre=reg.predict([[27]])[0][0]
print "slope:",reg.coef_
print "intercept:", reg.intercept_
print "r-squared socre:"
test_score = reg.score(ages_test,net_worths_test)
training_score = reg.score(ages_train,net_worths_train)
``````
classification vs regression
property supervised classification regression
output type discrete(class labels) continuous (ordered numbers)
goal decision boundary best fit line
evaluation accuracy r^2

### remove outliers

data is a list of tuple, in which the 2nd element is used for sorting.
``````data.sort(key=lambda tup: tup[2])
``````

# unsupervised learning

k-means

## feature scaling

``````from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
import numpy
data=numpy.array([[1.],[2.],[3.]])
re_data=scaler.fit_transform(data)
``````

## Text learning

### countVectorizer

``````from sklearn.feature_extraction.text import CountVectorizer
clf=CountVectorizer()
bag_of_words=clf.fit(email_list)
bag_of_words=clf.transform(email_list)
print clf.vocabulary_.get('great')
``````
Not all words are equal. Some words contain more information than others.

#### stopwords

(low-information, highly frequent word):
and, the, I, you, have,be, in, will

### NTTK ( natural language tool kit)

``````import nltk
from nltk.corpus import stopwords
sw = stopwords.words("english")
``````
Stemmer
``````from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("responsivity")
``````
TF-IDF
term-frequency * inverse document-frequency
``````from sklearn.feature_extraction.text import TfidfVectorizer
clf=TfidfVectorizer(stop_words="english")
tfidf=clf.fit_transform(word_data)
print clf.get_feature_names()
``````

## feature selection

high bias:
1. pays little attention to data
2. oversimplified
3. high error on training set
4. few features used
high variance:
1. pays too much attention to data
2. overfit
3. higher error on test set

## PCA

principal component analysis
``````from sklearn.decomposition import PCA
pca=PCA(n_components=2)
pca.fit(data)
print pca.explained_variance_ratio_ #see which one has the largest variation
first_pc=pca.components_[0]
second_pc=pca.components_[1]
``````

# Validation and Evaluation Metrics

## cross validation

To avoid overfitting by using the whole data set, it is randomly split into training and test sets.
``````#from sklearn.model_selection import train_test_split  # available after 0.18
from sklearn.cross_validation import train_test_split
features_train, features_test,labels_train,labels_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
``````

## GridSearchCV

faces recognition

## Evaluation Metrics

precision score: true positive/ all predict positive
recall score: true positive/ all real positive