Friday, January 6, 2017

Fundamental of machine learning for predictive data analytics

by John D. Kelleher 2015
Best book in machine learning I have read so far! Cover very practical perspectives of machine learning problem.

1 Introduction

Because of the noise nature and finite sampling, Machine learning is an ill-posed problem, which can’t be completely determined by a unique solution.
In fact, searching for predictive models that are consistent with the dataset is equivalent to just memorizing the dataset. As a result, no learning is taking place it tells us nothing about the underlying relationship between the descriptive and target features. If a predictive model captures this underlying relationship between the descriptive and target features, it is said to generalize well. The goal of machine learning is to find the one generalizes best.
Machine learning is sometimes called inductive learning, because it learns a general rule from a finite set of examples. Every machine learning algorithm has inductive bias, which is a set of assumptions. Two types of inductive bias are restriction bias and preference bias. A inductive bias is a necessary prerequisite for learning to occur; without inductive bias, a machine learning algorithm cannot learn anything beyond what is in the data.
No particular inductive bias on average is the best one to use. (No Free Lunch Theorem). A core skill for a data analyst is to select the appropriate model. An inappropriate inductive bias can lead to underfitting or overfitting, when the model is too simple or too complex. The goal is to strike a good balance.
CRISPDM: cross industry standard process for data mining:
  • Business understanding. The goal of predictive data analytics projects is not building a prediction model, but things like gaining new customers, selling more products, or adding efficiencies ot a process. So, during the first phase in any analytics project, **the primary goal of the data analyst is to fully understand the business or organizational problem that is being addressed, and then to design a data analytics solution for it.
  • Data Understanding
  • Data preparation. convert required data sources into a well-form analytics base table(ABT)
  • Modeling
  • Evaluation
  • Deployment
predictive data analytics tools
  • application-based solution: IBM SPSS, SAS
  • programming. This has more flexibilities and newest analytics techniques, but learning curve is steeper and need to put extra burden on developers to implement infrastructural support as data management.

2 Data to insights to decisions

Albert Einstein
We cannot solve our problems with the same thinking we used when we created them
Organizations don’t exist to do predictive data analytics. Organizations exist to do things like make more money, gain new customers, sell more product or reduce loss. The prediction don’t solve business problems, but provide insight that help the organization make better decision to solve their business problem.
converting a business problem into an analytics solution:
  1. What’s the business problem? What are the goals that the business wants to achieve? Most of the time, organizations begin analytics projects because they have a clear issue that they want to address; but sometimes it’s simply because somebody in the organization feels that this is an important new technique that they should use it. Unless a project is focused on clearly stated goals, it is unlikely to be successful
  2. How does the business currently work? It’s not feasible for an analytics practitioner to learn everything about the business because they will move quickly between different areas. But they must possess situational fluency, that they can use correct terminology to build analytics solution for that domain.

data availability

A lack of appropriate data will simply rule out proposed analytics solutions to a business problem. The easy availability of data for some solutions might favor them over others.

4 Information-based learning

decision tree

Model Ensembles

Rather than creating a single model, they generate a set of models and then make predictions by aggregating(such as voting) the outputs of these models. A prediction model that is composed of a set of models is called a model ensembles.
Two standard approaches:
  • boosting. works by iterating creating models, which add biased to pay more attention to instanced misclassified by last model.
  • bagging(or bootstrap aggregating). Each training data set has a random sampling with replacement. A decision tree bagging and subspace sampling is called random forest.

5 Similarity-based learning

nearest neighbor algorithm is a lazy learner, which delays abstracting from the data until it is asked to make a prediction. It is relatively slow because it needs to store a large number of instances. It’s sensitive to redundant and irrelevant descriptive features. The advantage is that it’s robust to concept drift, which means the relationship between features and target may change over time.

6 Probability-based learning

Bayes’ Theorem
P(X|Y)= \frac{P(Y|X)P(X)}{P(Y)}
As early as 1700, Reverend Thomas Rayes wrote an essay that described how to update beliefs as new information arises. The modern mathematical form was developed by Laplace.
If X is target, Y is features, then Bayes’ Theorem can be used for prediction. Take X for categorical values, the maximum probability of P(Y|X_i)P(X_i) is chosen as the predicted value.
Prediction is kind of inverse reasoning (from evidence to event), which is often much more difficult than forward reasoning (from event to evidence). 事后诸葛亮. 20/20 hindsight.
Naive Bayes model is naive because it simply assume the independence between features. This greatly reduces the difficulty of computation. The maximum mechanism make it robust to noise.

7 Error-based learning

The value chosen for the learning rate and initial weights can have a significant impact on how the (batch) gradient descent algorithm proceeds. However, how to choose is more an art gathered through experience, rather than a well-defined science.
The gradient descent algorithm requires the results to be differentiable, but the simple linear regression with sign function fails to that, so the logistic regression with the logistic function cuts in and do the job:
logistic(x)=\frac{1}{1+e^{-x}} where x is the weighted sum.
Support vector machine took another approach, where the margins are defined by the support vectors. The negative target feature is set to -1 and the positive target feature is set to +1. Kernel trick is played on the descriptive features to moves the data into a higher-dimensional space.

11 The art of ML for predictive data analytics

Sherlock Holmes
It is a captial mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.
Predictive data analytics projects use machine learning to build models that capture the relationships in large datasets between descriptive features and a target feature. Machine learning is a type of inductive learning, so they share some properties:
  1. the general rule induced from a sample may not be true for all instances in a population
  2. learning cannot occur unless the learning process is biased in some way, which means we need to tell the learning process what types of patterns to look for in the data. This bias is referred to as inductive bias.
  3. the outcome is also intentionally biased to suit our need.
An analytics project is often iterative, with different stages of the project feeding back into later cycle. It is also important to remember that the purpose of an analytics project is to solve a real-world problem and to keep focus on this, rather than being distracted by the admittedly sometimes fascinating, technical challenges of model building. The best way to keep an analytics project focused and to improve the likelihood of a successful conclusion, isto adopt a structured project lifecycle like CRISP_DM.

choosing a machine learning approach

No free lunch theorem.
A simple example shown in figure 11.2 reveals that each ml algorithm has its edge. The decision boundaries learned by each algorithm are characteristic of that algorithm
For small dataset, generative models are preferred than discriminative models because the prior structural information is encoded into the generative models, which can be used to generate data.