Thursday, December 8, 2016

Machine Learning ND 0, Model Evaluation, Kaggle, machine intelligence 3.0

Udacity also provides a free “Intro to Machine Learning” course, see my previous post

a quick feeling of machine learning algorithm:

Decision Tree
Naive Bayes
Gradient Descent
Linear Regression
support vector machines
neural network
k-means clustering
hierarchical clustering

Tips:

stick to your schedule. work regularly.
be relentless in searching for an answer on your own. The struggle and search is where you learn the most. If you come across a term you don’t get, spend time reading up on it.
be active member of your community.

Model Evaluation and Validation

(some course materials are from “introduction to machine learning”)

statistical analysis

Interquartile range (IQR)= Q3-Q1

$outlier <Q_1-1.5*IQR$

$outlier >Q_3 +1.5*IQR$

Evaluation Metrics

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error

In many cases such as disease diagnosis, we don’t care so much about the true negative because it is too common. We care more about the true positive. This can be further divided into postivie predictive value and sensitivity.

e.g. When a search engine) returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3.In this case, Precision is how useful is the result, recall is how complete is the result.

$F_\beta =\frac{ (1+\beta^2) \times precision * recall)} { \beta^2precision + recall}$

$F_1 =\frac{ 2 \times precision * recall)} { precision + recall}$

F1 is criticized because recall and precision are evenly weighted. Therefore, F0.5 focus more on precision, and F2 weights more on recall.

Causes of Error

Bias-variance dilemma:

high bias

a model being unable to represent the complexity of the underlying data

pay little attention to data, oversimplified, high error on trainning set

high variance

a model being overly sensitive to the limited data it has been trained on

pay too much attention to data, overfit, higher error on test set than training set

Kaggle

Automated Essay Scoring, sponsored by The Hewlett Foundation
Transforming How We Diagnose Heart Disease, part of the 2015-2016 Data Science Bowl, sponsored by Booz Allen Hamilton
a library called XGboost, an implementation of Gradientboosting
Scripts
Titanic; candor and digits

Ben Hammer:

I learn best through creative projects, not lectures.
I learn R/Python analytics API’s best through well-crafted examples, not docs
Join the competition can be quite addictive. When you make the first submission, you see everyone above you, and then that makes you ask the question what are these guys above me doing? how can I do better than them? And that question keeps striving you to try to do better and better at our problems, and it really forces you to explore the scope of supervised machine learning and different methodologies and approaches that you can use. That means you are not trying one or two ways, you are trying a thousand ways to figure out which make a really big difference in the model performance.
2 catogories for people to approach problems. One spend 2 months trying to develope the great idea, implement it in code and see how it performs. It doesn’t work out.
The correct mindset is: I have a lot of different ideas that I think might work out. I want to experiment with them and explore to see how they work. I want to get through as many of these idea as possible, to find the couple really matter and make a big difference. One common pattern amonge winners is that they make 100 to 1000 submissions, they get through athe iterative loop of making a new results. Learning from how it performed very quickly. They optimize their environments and workflows to get through the iterative loop as fast as possible.

The gaming world offers a perfect place to start machine intelligence work (e.g., constrained environments, explicit rewards, easy-to-compare results, looks impressive)—especially for reinforcement learning.

Moral of the story: use a simple model you can understand. Only then move onto something more complex, and only if you need to.

An applied science lab is a big commitment. Data can often be quite threatening to people who prefer to trust their instincts. R&D has a high risk of failure and unusually high levels of perseverance are table stakes. Do some soul searching - will your company really accept this culture?

Data is not a commodity, it needs to be transformed into a product before it’s valuable. Many respondents told me of projects which started without any idea of who their customer is or how they are going to use this “valuable data”. The answer came too late: “nobody” and “they aren’t”