Friday, October 7, 2016

Intro to machine learning

Update on 2017-2-9

4 month later after this initial post, I almost finish Machine Learning Engineer NanoDegree. It is a perfect time for me to compare their difference.
The pros of “Intro“ course are that it’s free, fun to watch Sebastian’s self-driving car, and the videos are completely produced by him and a nice female Kattie. The cons are:
  1. no project feedback
  2. course materials seem not well organized. The sequence of their teaching is Naive Bayes, SVM, decision Trees, Regression, clustering, Feature scaling and selection, PCA, Evaluation Metrics. Are you kidding me? I think the correct approach should be exactly the reverse order! We should have a big picture first, then narrow down to a specific topic, so we don’t get used to the narrow-minded. The trick here is that a more specific thing is more easy and fun to teach!
  3. For Naive Bayes, Sebatian explained Bayes is a Christian trying to use evidence to infer the existence of God, and Naive is that it doesn’t consider the order of words. In my understanding, it can more explicitly that Naive is due to the assumption of independence of individual evidence. And NB algorithm is “probability-based“.
  4. sklearn 0.15 on 2014-11-11, then code updated on 2015-10
  5. the biggest problem is that it does too much preprocessing about the data and visualization, the student writes a few “magic code” to get a “good feeling” about himself. But actually he doesn’t earn as much because it’s a fake feeling and he lose the big picture.
  6. For a beginner course, it is better to use more simple, tansparant dataset. Enron email dataset is too overwhelming for a beginner. see my analysis at https://github.com/jychstar/datasets
The MLND course has a slightly better structure due to its available length. However, many materials are still scrambled together. For example, in supervised learning, SVM section is put after Neural Network. And SVM section begins with “GaTech” version, then Sebatian’s intro version. What a shame!
Anyway, if you are interested in the MLND, its sequence is:
  • Model Evaluation and Validation (project: predicting Boston Housing prices)
  • Supervised Learning (project: Finding Donors for Charity ML)
  • UnSupervised Learning (project: Creating Customer Segments)
  • Reinforcement Learning (project: Train a Smartcab to Drive)
  • Deep Learning (project: build a digit recognition program)
  • Capstone Project

Supervised learning

naive bayes

probabilistic classifier, apply Bayes’ theorem with strong independence assumption between features
from sklearn.naive_bayes import GaussianNB

accuracy

from sklearn.metrics import accuracy_score,fbeta_score
y_pred=clf.predict(features_test)
print accuracy_score(y_true, y_pred)
print fbeta_score(y_true, y_pred, beta)

support vector machine

non-probabilistic binary linear classifier, separate categories by a clear gap as wide as possible
parameters: kernel(linear, rbf), C, Gamma
from sklearn.svm import SVC

decision tree

from sklearn.tree import DecisionTreeClassifier
algorithm: max(information gain)=max(entropy(parent)-average*entropy(children))

k nearest neighbors

An object is classified by a majority vote of its k neighbors. instance-based learning, lazy-learning.
from sklearn.eighbors import KNeighborsClassifier()

adaboost

Adaptive Boosting, enhance individual weak learners (decision tree) by harder-to-classify examples. Sensitive to noisy data and outliers.
from sklearn.ensemble import AdaBoostClassifier

random forest

construct a multitude of decision trees and correct dt’s overfitting problems.
from sklearn.ensemble import RandomForestClassifier()

comparison

training data: makeTerrainData()
algorithm parameter No. of training time accuracy
k nearest neigh default=5 750 0.389 0.92
k nearest neigh neighbors=15 750 0.545 0.928
k nearest neigh neighbors=3 750 0.267 0.936
Ada boost est. 100 750 0.686 0.924
Ada boost est. 10 750 0.379 0.916
random 750 0.389 0.924

Accuracy vs Training set size

More Data > Fine-tuned Algorithm

regression

from sklearn import linear_model
reg=linear_model.LinearRegression()
reg.fit(features_train, labels_train)
pre=reg.predict([[27]])[0][0]
print "slope:",reg.coef_
print "intercept:", reg.intercept_
print "r-squared socre:"
test_score = reg.score(ages_test,net_worths_test)
training_score = reg.score(ages_train,net_worths_train)
classification vs regression
property supervised classification regression
output type discrete(class labels) continuous (ordered numbers)
goal decision boundary best fit line
evaluation accuracy r^2

remove outliers

data is a list of tuple, in which the 2nd element is used for sorting.
data.sort(key=lambda tup: tup[2])

unsupervised learning

cluster

k-means

feature scaling

from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
import numpy
data=numpy.array([[1.],[2.],[3.]])
re_data=scaler.fit_transform(data)

Text learning

countVectorizer

from sklearn.feature_extraction.text import CountVectorizer
clf=CountVectorizer()
bag_of_words=clf.fit(email_list)
bag_of_words=clf.transform(email_list)
print clf.vocabulary_.get('great')
Not all words are equal. Some words contain more information than others.

stopwords

(low-information, highly frequent word):
and, the, I, you, have,be, in, will

NTTK ( natural language tool kit)

import nltk
nltk.download()
from nltk.corpus import stopwords
sw = stopwords.words("english")
Stemmer
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("responsivity")
TF-IDF
term-frequency * inverse document-frequency
from sklearn.feature_extraction.text import TfidfVectorizer
clf=TfidfVectorizer(stop_words="english")
tfidf=clf.fit_transform(word_data)
print clf.get_feature_names()

feature selection

high bias:
  1. pays little attention to data
  2. oversimplified
  3. high error on training set
  4. few features used
high variance:
  1. pays too much attention to data
  2. overfit
  3. higher error on test set

PCA

principal component analysis
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
pca.fit(data)
print pca.explained_variance_ratio_ #see which one has the largest variation
first_pc=pca.components_[0]
second_pc=pca.components_[1]

Validation and Evaluation Metrics

cross validation

To avoid overfitting by using the whole data set, it is randomly split into training and test sets.
#from sklearn.model_selection import train_test_split  # available after 0.18
from sklearn.cross_validation import train_test_split
features_train, features_test,labels_train,labels_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

GridSearchCV

faces recognition

Evaluation Metrics

precision score: true positive/ all predict positive
recall score: true positive/ all real positive