Update on 2017-2-9
4 month later after this initial post, I almost finish Machine Learning Engineer NanoDegree. It is a perfect time for me to compare their difference.
The pros of “Intro“ course are that it’s free, fun to watch Sebastian’s self-driving car, and the videos are completely produced by him and a nice female Kattie. The cons are:
- no project feedback
- course materials seem not well organized. The sequence of their teaching is Naive Bayes, SVM, decision Trees, Regression, clustering, Feature scaling and selection, PCA, Evaluation Metrics. Are you kidding me? I think the correct approach should be exactly the reverse order! We should have a big picture first, then narrow down to a specific topic, so we don’t get used to the narrow-minded. The trick here is that a more specific thing is more easy and fun to teach!
- For Naive Bayes, Sebatian explained Bayes is a Christian trying to use evidence to infer the existence of God, and Naive is that it doesn’t consider the order of words. In my understanding, it can more explicitly that Naive is due to the assumption of independence of individual evidence. And NB algorithm is “probability-based“.
- sklearn 0.15 on 2014-11-11, then code updated on 2015-10
- the biggest problem is that it does too much preprocessing about the data and visualization, the student writes a few “magic code” to get a “good feeling” about himself. But actually he doesn’t earn as much because it’s a fake feeling and he lose the big picture.
- For a beginner course, it is better to use more simple, tansparant dataset. Enron email dataset is too overwhelming for a beginner. see my analysis at https://github.com/jychstar/datasets
The MLND course has a slightly better structure due to its available length. However, many materials are still scrambled together. For example, in supervised learning, SVM section is put after Neural Network. And SVM section begins with “GaTech” version, then Sebatian’s intro version. What a shame!
Anyway, if you are interested in the MLND, its sequence is:
- Model Evaluation and Validation (project: predicting Boston Housing prices)
- Supervised Learning (project: Finding Donors for Charity ML)
- UnSupervised Learning (project: Creating Customer Segments)
- Reinforcement Learning (project: Train a Smartcab to Drive)
- Deep Learning (project: build a digit recognition program)
- Capstone Project
probabilistic classifier, apply Bayes’ theorem with strong independence assumption between features
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,fbeta_score y_pred=clf.predict(features_test) print accuracy_score(y_true, y_pred) print fbeta_score(y_true, y_pred, beta)
non-probabilistic binary linear classifier, separate categories by a clear gap as wide as possible
parameters: kernel(linear, rbf), C, Gamma
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
algorithm: max(information gain)=max(entropy(parent)-average*entropy(children))
An object is classified by a majority vote of its k neighbors. instance-based learning, lazy-learning.
from sklearn.eighbors import KNeighborsClassifier()
Adaptive Boosting, enhance individual weak learners (decision tree) by harder-to-classify examples. Sensitive to noisy data and outliers.
from sklearn.ensemble import AdaBoostClassifier
construct a multitude of decision trees and correct dt’s overfitting problems.
from sklearn.ensemble import RandomForestClassifier()
training data: makeTerrainData()
|algorithm||parameter||No. of training||time||accuracy|
|k nearest neigh||default=5||750||0.389||0.92|
|k nearest neigh||neighbors=15||750||0.545||0.928|
|k nearest neigh||neighbors=3||750||0.267||0.936|
|Ada boost||est. 100||750||0.686||0.924|
|Ada boost||est. 10||750||0.379||0.916|
Accuracy vs Training set size
More Data > Fine-tuned Algorithm
from sklearn import linear_model reg=linear_model.LinearRegression() reg.fit(features_train, labels_train) pre=reg.predict([]) print "slope:",reg.coef_ print "intercept:", reg.intercept_ print "r-squared socre:" test_score = reg.score(ages_test,net_worths_test) training_score = reg.score(ages_train,net_worths_train)
classification vs regression
|output type||discrete(class labels)||continuous (ordered numbers)|
|goal||decision boundary||best fit line|
data is a list of tuple, in which the 2nd element is used for sorting.
data.sort(key=lambda tup: tup)
from sklearn.preprocessing import MinMaxScaler scaler=MinMaxScaler() import numpy data=numpy.array([[1.],[2.],[3.]]) re_data=scaler.fit_transform(data)
from sklearn.feature_extraction.text import CountVectorizer clf=CountVectorizer() bag_of_words=clf.fit(email_list) bag_of_words=clf.transform(email_list) print clf.vocabulary_.get('great')
Not all words are equal. Some words contain more information than others.
(low-information, highly frequent word):
and, the, I, you, have,be, in, will
NTTK ( natural language tool kit)
import nltk nltk.download() from nltk.corpus import stopwords sw = stopwords.words("english")
from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("english") stemmer.stem("responsivity")
term-frequency * inverse document-frequency
from sklearn.feature_extraction.text import TfidfVectorizer clf=TfidfVectorizer(stop_words="english") tfidf=clf.fit_transform(word_data) print clf.get_feature_names()
- pays little attention to data
- high error on training set
- few features used
- pays too much attention to data
- higher error on test set
principal component analysis
from sklearn.decomposition import PCA pca=PCA(n_components=2) pca.fit(data) print pca.explained_variance_ratio_ #see which one has the largest variation first_pc=pca.components_ second_pc=pca.components_
Validation and Evaluation Metrics
To avoid overfitting by using the whole data set, it is randomly split into training and test sets.
#from sklearn.model_selection import train_test_split # available after 0.18 from sklearn.cross_validation import train_test_split features_train, features_test,labels_train,labels_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
precision score: true positive/ all predict positive
recall score: true positive/ all real positive