Update on 2017-2-9
4 month later after this initial post, I almost finish Machine Learning Engineer NanoDegree. It is a perfect time for me to compare their difference.
The pros of “Intro“ course are that it’s free, fun to watch Sebastian’s self-driving car, and the videos are completely produced by him and a nice female Kattie. The cons are:
- no project feedback
- course materials seem not well organized. The sequence of their teaching is Naive Bayes, SVM, decision Trees, Regression, clustering, Feature scaling and selection, PCA, Evaluation Metrics. Are you kidding me? I think the correct approach should be exactly the reverse order! We should have a big picture first, then narrow down to a specific topic, so we don’t get used to the narrow-minded. The trick here is that a more specific thing is more easy and fun to teach!
- For Naive Bayes, Sebatian explained Bayes is a Christian trying to use evidence to infer the existence of God, and Naive is that it doesn’t consider the order of words. In my understanding, it can more explicitly that Naive is due to the assumption of independence of individual evidence. And NB algorithm is “probability-based“.
- sklearn 0.15 on 2014-11-11, then code updated on 2015-10
- the biggest problem is that it does too much preprocessing about the data and visualization, the student writes a few “magic code” to get a “good feeling” about himself. But actually he doesn’t earn as much because it’s a fake feeling and he lose the big picture.
- For a beginner course, it is better to use more simple, tansparant dataset. Enron email dataset is too overwhelming for a beginner. see my analysis at https://github.com/jychstar/datasets
The MLND course has a slightly better structure due to its available length. However, many materials are still scrambled together. For example, in supervised learning, SVM section is put after Neural Network. And SVM section begins with “GaTech” version, then Sebatian’s intro version. What a shame!
Anyway, if you are interested in the MLND, its sequence is:
- Model Evaluation and Validation (project: predicting Boston Housing prices)
- Supervised Learning (project: Finding Donors for Charity ML)
- UnSupervised Learning (project: Creating Customer Segments)
- Reinforcement Learning (project: Train a Smartcab to Drive)
- Deep Learning (project: build a digit recognition program)
- Capstone Project
Supervised learning
naive bayes
probabilistic classifier, apply Bayes’ theorem with strong independence assumption between features
from sklearn.naive_bayes import GaussianNB
accuracy
from sklearn.metrics import accuracy_score,fbeta_score
y_pred=clf.predict(features_test)
print accuracy_score(y_true, y_pred)
print fbeta_score(y_true, y_pred, beta)
support vector machine
non-probabilistic binary linear classifier, separate categories by a clear gap as wide as possible
parameters: kernel(linear, rbf), C, Gamma
from sklearn.svm import SVC
decision tree
from sklearn.tree import DecisionTreeClassifier
algorithm: max(information gain)=max(entropy(parent)-average*entropy(children))
k nearest neighbors
An object is classified by a majority vote of its k neighbors. instance-based learning, lazy-learning.
from sklearn.eighbors import KNeighborsClassifier()
adaboost
Adaptive Boosting, enhance individual weak learners (decision tree) by harder-to-classify examples. Sensitive to noisy data and outliers.
from sklearn.ensemble import AdaBoostClassifier
random forest
construct a multitude of decision trees and correct dt’s overfitting problems.
from sklearn.ensemble import RandomForestClassifier()
comparison
training data: makeTerrainData()
algorithm | parameter | No. of training | time | accuracy |
---|---|---|---|---|
k nearest neigh | default=5 | 750 | 0.389 | 0.92 |
k nearest neigh | neighbors=15 | 750 | 0.545 | 0.928 |
k nearest neigh | neighbors=3 | 750 | 0.267 | 0.936 |
Ada boost | est. 100 | 750 | 0.686 | 0.924 |
Ada boost | est. 10 | 750 | 0.379 | 0.916 |
random | 750 | 0.389 | 0.924 | |
Accuracy vs Training set size
More Data > Fine-tuned Algorithm
regression
from sklearn import linear_model
reg=linear_model.LinearRegression()
reg.fit(features_train, labels_train)
pre=reg.predict([[27]])[0][0]
print "slope:",reg.coef_
print "intercept:", reg.intercept_
print "r-squared socre:"
test_score = reg.score(ages_test,net_worths_test)
training_score = reg.score(ages_train,net_worths_train)
classification vs regression
property | supervised classification | regression |
---|---|---|
output type | discrete(class labels) | continuous (ordered numbers) |
goal | decision boundary | best fit line |
evaluation | accuracy | r^2 |
remove outliers
data is a list of tuple, in which the 2nd element is used for sorting.
data.sort(key=lambda tup: tup[2])
unsupervised learning
cluster
k-means
feature scaling
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
import numpy
data=numpy.array([[1.],[2.],[3.]])
re_data=scaler.fit_transform(data)
Text learning
countVectorizer
from sklearn.feature_extraction.text import CountVectorizer
clf=CountVectorizer()
bag_of_words=clf.fit(email_list)
bag_of_words=clf.transform(email_list)
print clf.vocabulary_.get('great')
Not all words are equal. Some words contain more information than others.
stopwords
(low-information, highly frequent word):
and, the, I, you, have,be, in, will
NTTK ( natural language tool kit)
import nltk
nltk.download()
from nltk.corpus import stopwords
sw = stopwords.words("english")
Stemmer
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("responsivity")
TF-IDF
term-frequency * inverse document-frequency
from sklearn.feature_extraction.text import TfidfVectorizer
clf=TfidfVectorizer(stop_words="english")
tfidf=clf.fit_transform(word_data)
print clf.get_feature_names()
feature selection
high bias:
- pays little attention to data
- oversimplified
- high error on training set
- few features used
high variance:
- pays too much attention to data
- overfit
- higher error on test set
PCA
principal component analysis
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
pca.fit(data)
print pca.explained_variance_ratio_ #see which one has the largest variation
first_pc=pca.components_[0]
second_pc=pca.components_[1]
Validation and Evaluation Metrics
cross validation
To avoid overfitting by using the whole data set, it is randomly split into training and test sets.
#from sklearn.model_selection import train_test_split # available after 0.18
from sklearn.cross_validation import train_test_split
features_train, features_test,labels_train,labels_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
GridSearchCV
faces recognition
Evaluation Metrics
precision score: true positive/ all predict positive
recall score: true positive/ all real positive
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.