Monday, March 13, 2017

2nd Xap I built, AWS S3

Last mile vs first mile

The previous post showed how I built my 1st Xap. I deliberately avoided directly file import. My logical thinking is reconstructed as follows:
  1. One line of d3 code d3.csv(file, function) has already integrated 3 steps: open the local or remote file by the string name or URL, parse the file object into JSON format, callback a function to process the data.
  2. Although the first 2 steps have the corresponding Exaptive components, building a pure step-3 component could be more difficult than building a one-stop-shopping component.
  3. At that time, I hadn’t figure out what exactly is the data format during the flow between components and being sent to the dimple function.
  4. To get d3.csv work, I came out a nerdy approach: host the dataset in my own GitHub.
So my 1st Xap is focused on the data visualization, the last mile for a data scientist. “Begin with the End in Mind” is a hard lesson I learned during my past years. After that, I turned my attention to the first mile. I want to be very clear about the data format in each step and I would like to peek into every black box.

Gear up: WebStorm, AWS S3

Then I looked for a Javascript IDE, something like Jupyter notebook in Python. WebStorm caught my attention. It offers 30-day free trial and is free for 2 years for students and teachers, training and open source projects. Even for a standard individual customer, the annual fee of $59 is affordable.
After playing for a while, I realized WebStorm is good for developing the complex function of a simple input, something more pure for javascript coding. But for a web-based application, you always have to deal with file input and script input. If you don’t have a html file, the powerful libraries like jQuery and D3 are sitting on the bench because they are born to manipulate the DOM.
So I return to my previous tool set: start up a local host by python, use atom to write html/javascript codes and check the visual effect on Chrome.
Once I handcraft my html pages, I would like to have a remote host to display them. Years ago Google Drive provided such service for free. Now the hosting business is taken by AWS. I am surprised to see AWS provides up to 17 categories of services such as: compute, storage, database, developer tools, management tools, analytics, AI, Internet of things. S3 (simple static storage) is only a tiny business. By the way, the AWS page looks ugly. I guess because it is like a warehouse shopping center, the target customers care more about the price and stability, rather than a sexy face.
Here are my examples:
Note: For security reasons. the browsers block the Javascript codes. To unblock, in Chrome, there is a shield logo, click on it and “load unsafe scripts”. In firebox, there is “i” sign to fix. In Safari, sorry I don’t know how to solve it.

Papa.parse

Papa parse seems one of the most popular JavaScript libraries that do the parsing job. 4000 stars in Github. Major development around 2014. By the way, this implies it was at the time when the author, Matt Holt, was working at SmartyStreets(a company providing address info) and an undergraduate student at Brigham Young University.
I love the style of its official website: http://papaparse.com/ It uses friendly dialogues to provide case-by-case solution. So the users can quickly pin down what they want, whether it is parsing csv-format string, local file, remote file or even convert JSON back to csv. Actually, it is smart enough that it can find the right delimiter by scanning the first few rows.
var results = Papa.parse(csvString);
console.log(results.meta.delimiter);
However, there is a pitfall when parsing a local file:
Papa.parse(file, {
    complete: function(results) {
        console.log(results);}
});
According to the documentation, file is a File object obtained from the DOM. You can’t just use “data.csv” and hope it can do the magic. Papa will think “data.csv” is only a string and tell you it can’t find a delimiter. In this sense, d3 is much smarter in that d3.csv("data.csv", callback) works for local file.
Anyway, the lower-level of Papa.parse means it has more flexibility to manipulate the data stream. In a github issue discussion , Holt provides some codes that use jQuery AJAX call to pass a file:
$.get("/basic_charts/train.csv", function(text){
    var data = Papa.parse(text);
    console.log(data);
});
However, the problem by this AJAX request is that the whole file is loaded into memory. If the file is too large, the browser gets crash. Alternatively, Papa uses HTML5’s FileReader API to “stream” in the file, if a <input type="file"> element is used in the HTML file. Thanks to Raffael Vogler for providing a pain-free tutorial
<script src="https://cdnjs.cloudflare.com/ajax/libs/PapaParse/4.1.2/papaparse.js"></script>
<script src="https://code.jquery.com/jquery-3.1.1.min.js"></script>
<script>
function handleFileSelect(evt) {
    var file = evt.target.files[0];
    Papa.parse(file, {
        header: true,
        dynamicTyping: true,
        complete: function(results) {
            console.log(JSON.stringify(results, null, 2));}
    });
}
$(document).ready(function(){
    $("#csv-file").change(handleFileSelect);
});
</script>
<!--To get started we need a button to open a file: -->
<input type="file" id="csv-file" name="files"/>
console.log()is only able to print simple data formats such as string, number or array. JSON.stringify is a very powerful debugger tool that you can print the original format of an object. The only thing you can’t print is the function.
debug finding:
  1. jQeury change() is triggered by any change made to <input>, \<textara> and \<select> elements.
  2. evt is a jQuery object, has 9 top-level keys, one of them is target. Below target is an encoded JQuery file object.
  3. results has 3 top-level keys: data, errors, meta. The JSON array is under data key.
  4. add indention in stringify parameter for pretty print.
If the file is written in JSON format or JSON array,
var xmlhttp = new XMLHttpRequest();
xmlhttp.onreadystatechange = function() {
    if (this.readyState == 4 && this.status == 200) {
        myObj = JSON.parse(this.responseText); // read raw text
        console.log(JSON.stringify(myObj));
    }
}; // define action
xmlhttp.open("GET", "data.txt", true);
xmlhttp.send();

TitanicXap2

The ready-to-use Xap is here, you can download the csv file from here). Drop the file to see the magic.
The block diagram is:
FileDropTarget:
    CSVParser:
        dimple_barplt
        Titanic_python: pie_plotly
I am going to document my source code.
For the whole Xap, the page DOM is configured in Edit-style-HTML. We can see there are three visible objects.
<div data-node="FileDropTarget_0"></div>
<div data-node="dimple_barplot_0" id="barplot"></div>
<div data-node="pie_plotly_0" id = "pie"></div>

dimple_barplot

Input: data: multiset
spect: dependencies: jQuery, d3, dimple
script: inside export default{data(){},};
let d3 = this.api.imports.d3;
let dimple = this.api.imports.dimple;
let data = this.api.inputState.export().data; // JSON Array
//this.api.log("I am Cool: ", JSON.stringify(data)); // test the raw data
function draw(data) {
  "use strict";
  var svg = dimple.newSvg("#barplot", 600, 400);  // svg inside id barplot with size (600,400)
  svg.append("text").
  attr("x", 300).attr("y", 20).attr("text-anchor", "middle").style("font-size", "20px").style("font-weight", "bold").text("Titanic Survivor");
  var myChart = new dimple.chart(svg, data);
  var x = myChart.addCategoryAxis("x", "Pclass");
  var y = myChart.addPctAxis("y", "Survived");
  var s = myChart.addSeries("Sex", dimple.plot.bar);

  s.afterDraw = function (shape, data) {
    var s = d3.select(shape),
      rect = {
        x: parseFloat(s.attr("x")),
        y: parseFloat(s.attr("y")),
        width: parseFloat(s.attr("width")),
        height: parseFloat(s.attr("height"))
      };
    if (rect.height >= 8) {
      svg.append("text")
        .attr("x", rect.x + rect.width / 2)
        .attr("y", rect.y + rect.height / 2 + 3.5)
        .style("text-anchor", "middle")
        .style("font-size", "10px")
        .style("font-family", "sans-serif")
        .style("opacity", 0.6)
        .style("pointer-events", "none")
        .text(data.yValue);
    }
  }; // end s.afterDraw
  myChart.addLegend(150, 10, 380, 20, "right");
  myChart.draw();
}// end function draw
draw(data); // call function and feed data with JSON Array format
The majority of the code is from dimple github.

Titanic_python

Input: data: dynamic
output: feature_weight: enitity
SPEC. Tricks:
  1. sklearn is based on scipy;
  2. json can’t be install by pip
  3. pandas is 0.19.2, numpy is 1.12, sklearn is 0.18
"dependencies":{
      "apt": [],
    "pip": [
      {"path": "numpy"},
      {"path": "pandas"},
      {"path": "scikit-learn"},
      {"path": "scipy"}
    ],
    "file":[]
},
script
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,fbeta_score
import numpy as np
import sklearn
import pandas as pd

def data(self):
    data = self.api.inputstate.export()['data'] 
    # self.api.log("I am Cool: ", data) # test data
    df = pd.DataFrame(data)  # convert JSON array to dataframe
    df['Age'] = df['Age'].apply(pd.to_numeric, args=('coerce',)) # convert to number
    train_data = df

    # Data preprocessing
    # use median age to fill missing value
    def get_median_ages(df):
        median_ages = np.zeros((2,3))
        for j in range(0, 3):
            median_ages[0,j] = df[(df['Sex'] == 'female') & \
                                  (df['Pclass'] == j+1)]['Age'].dropna().median()
            median_ages[1,j] = df[(df['Sex'] == 'male') & \
                                  (df['Pclass'] == j+1)]['Age'].dropna().median()
        return median_ages
    def data_clean(df, median_ages):
        df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
        # use median age to fill the missing data
        for i in range(0, 2):
            for j in range(0, 3):
                  df.loc[ (df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1),\
                        'Age'] = median_ages[i,j]

        droplist = ['Name','Ticket','Cabin','Embarked','Sex'] # reserve ID for check
        features = df.drop(droplist, axis = 1)
        return features

    median_ages = get_median_ages(train_data)
    train_cleaned = data_clean(train_data,median_ages)
    features = train_cleaned.drop(['PassengerId','Survived'],axis=1)
    labels = train_data ['Survived']

    # machine learning to predict a target feature: Survived
    # use train-test split to generate 2 sets of data
    X_train, X_test,y_train,y_test = train_test_split(features, labels, test_size=0.3, random_state=0)
    clf=DecisionTreeClassifier()
    clf.fit(X_train,y_train)
    pred=clf.predict(X_test)
    self.api.log("test score:",accuracy_score(y_test, pred))  # 0.81, seems good

    feature_weight={}
    for i,key in enumerate(X_train.columns.values):
        feature_weight[key] = clf.feature_importances_[i]
    self.api.output("feature_weight", feature_weight)
Note that in line 11, df = pd.DataFrame(data) convert JSON array to dataframe. This is the most important glue code that connects JavaScript to Python.

pie_plotly

Initially I tried dimple for an hour or so, but got a lot of bug. Then I switch to plotly and it was awesome!
input: data: entity
spec:
"dependencies": {
    "file": [
        {
            "type": "js",
            "path": "https://cdn.plot.ly/plotly-latest.min.js",
            "name": "Plotly"
        },
        {
            "type": "js",
            "path": "https://d3js.org/d3.v4.min.js",
            "name": "d3"
        },
        {
            "type": "js",
            "path": "https://cdnjs.cloudflare.com/ajax/libs/numeric/1.2.6/numeric.min.js",
            "name": "numeric"
        }
    ]
},
script:
export default {
    data() {
    let data = this.api.inputState.export().data;
    let d3 = this.api.imports.d3;
    let Plotly = this.api.imports.Plotly; 

    var labels = [];
    var values = [];
    var row;
    for (row in data) {        
        labels.push(String(row));
        values.push(data[row]);
    }

    var d = {}; // dictionary to store labels, values, type
    d.labels = labels;
    d.values = values;
    d.type  = "pie";

    var layout = {
      width: 500,
      height: 400,
      title : "Feature importance",
      titlefont : 20
    };

    var li = [d];
    Plotly.newPlot('pie', li, layout);
    this.api.log(labels[0]);
    }
};