Update on 2017-2-9

4 month later after this initial post, I almost finish Machine Learning Engineer NanoDegree. It is a perfect time for me to compare their difference.

The pros of “Intro“ course are that it’s free, fun to watch Sebastian’s self-driving car, and the videos are completely produced by him and a nice female Kattie. The cons are:

no project feedback
course materials seem not well organized. The sequence of their teaching is Naive Bayes, SVM, decision Trees, Regression, clustering, Feature scaling and selection, PCA, Evaluation Metrics. Are you kidding me? I think the correct approach should be exactly the reverse order! We should have a big picture first, then narrow down to a specific topic, so we don’t get used to the narrow-minded. The trick here is that a more specific thing is more easy and fun to teach!
For Naive Bayes, Sebatian explained Bayes is a Christian trying to use evidence to infer the existence of God, and Naive is that it doesn’t consider the order of words. In my understanding, it can more explicitly that Naive is due to the assumption of independence of individual evidence. And NB algorithm is “probability-based“.
sklearn 0.15 on 2014-11-11, then code updated on 2015-10
the biggest problem is that it does too much preprocessing about the data and visualization, the student writes a few “magic code” to get a “good feeling” about himself. But actually he doesn’t earn as much because it’s a fake feeling and he lose the big picture.
For a beginner course, it is better to use more simple, tansparant dataset. Enron email dataset is too overwhelming for a beginner. see my analysis at https://github.com/jychstar/datasets

The MLND course has a slightly better structure due to its available length. However, many materials are still scrambled together. For example, in supervised learning, SVM section is put after Neural Network. And SVM section begins with “GaTech” version, then Sebatian’s intro version. What a shame!

Anyway, if you are interested in the MLND, its sequence is:

Model Evaluation and Validation (project: predicting Boston Housing prices)
Supervised Learning (project: Finding Donors for Charity ML)
UnSupervised Learning (project: Creating Customer Segments)
Reinforcement Learning (project: Train a Smartcab to Drive)
Deep Learning (project: build a digit recognition program)
Capstone Project

Supervised learning

naive bayes

probabilistic classifier, apply Bayes’ theorem with strong independence assumption between features

from sklearn.naive_bayes import GaussianNB

accuracy

from sklearn.metrics import accuracy_score,fbeta_score
y_pred=clf.predict(features_test)
print accuracy_score(y_true, y_pred)
print fbeta_score(y_true, y_pred, beta)

support vector machine

non-probabilistic binary linear classifier, separate categories by a clear gap as wide as possible

parameters: kernel(linear, rbf), C, Gamma

from sklearn.svm import SVC

decision tree

from sklearn.tree import DecisionTreeClassifier

algorithm: max(information gain)=max(entropy(parent)-average*entropy(children))

comparison

training data: makeTerrainData()

algorithm	parameter	No. of training	time	accuracy
k nearest neigh	default=5	750	0.389	0.92
k nearest neigh	neighbors=15	750	0.545	0.928
k nearest neigh	neighbors=3	750	0.267	0.936
Ada boost	est. 100	750	0.686	0.924
Ada boost	est. 10	750	0.379	0.916
random		750	0.389	0.924

Accuracy vs Training set size

More Data > Fine-tuned Algorithm

regression

from sklearn import linear_model
reg=linear_model.LinearRegression()
reg.fit(features_train, labels_train)
pre=reg.predict([[27]])[0][0]
print "slope:",reg.coef_
print "intercept:", reg.intercept_
print "r-squared socre:"
test_score = reg.score(ages_test,net_worths_test)
training_score = reg.score(ages_train,net_worths_train)

classification vs regression

property	supervised classification	regression
output type	discrete(class labels)	continuous (ordered numbers)
goal	decision boundary	best fit line
evaluation	accuracy	r^2

remove outliers

data is a list of tuple, in which the 2nd element is used for sorting.

data.sort(key=lambda tup: tup[2])

unsupervised learning

cluster

k-means

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

feature scaling

MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
import numpy
data=numpy.array([[1.],[2.],[3.]])
re_data=scaler.fit_transform(data)

Text learning

countVectorizer

from sklearn.feature_extraction.text import CountVectorizer
clf=CountVectorizer()
bag_of_words=clf.fit(email_list)
bag_of_words=clf.transform(email_list)
print clf.vocabulary_.get('great')

Not all words are equal. Some words contain more information than others.

stopwords

(low-information, highly frequent word):

and, the, I, you, have,be, in, will

NTTK ( natural language tool kit)

import nltk
nltk.download()
from nltk.corpus import stopwords
sw = stopwords.words("english")

Stemmer

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("responsivity")

TF-IDF

term-frequency * inverse document-frequency

from sklearn.feature_extraction.text import TfidfVectorizer
clf=TfidfVectorizer(stop_words="english")
tfidf=clf.fit_transform(word_data)
print clf.get_feature_names()

feature selection

high bias:

pays little attention to data
oversimplified
high error on training set
few features used

high variance:

pays too much attention to data
overfit
higher error on test set

PCA

principal component analysis

from sklearn.decomposition import PCA
pca=PCA(n_components=2)
pca.fit(data)
print pca.explained_variance_ratio_ #see which one has the largest variation
first_pc=pca.components_[0]
second_pc=pca.components_[1]

Validation and Evaluation Metrics

cross validation

http://scikit-learn.org/stable/modules/cross_validation.html

To avoid overfitting by using the whole data set, it is randomly split into training and test sets.

#from sklearn.model_selection import train_test_split  # available after 0.18
from sklearn.cross_validation import train_test_split
features_train, features_test,labels_train,labels_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

GridSearchCV

faces recognition

http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html

Evaluation Metrics

precision score: true positive/ all predict positive

recall score: true positive/ all real positive

Yuchao's blogspot

Friday, October 7, 2016

Intro to machine learning