Last mile vs first mile
The previous post showed how I built my 1st Xap. I deliberately avoided directly file import. My logical thinking is reconstructed as follows:
- One line of d3 code
d3.csv(file, function)
has already integrated 3 steps: open the local or remote file by the string name or URL, parse the file object into JSON format, callback a function to process the data. - Although the first 2 steps have the corresponding Exaptive components, building a pure step-3 component could be more difficult than building a one-stop-shopping component.
- At that time, I hadn’t figure out what exactly is the data format during the flow between components and being sent to the dimple function.
- To get
d3.csv
work, I came out a nerdy approach: host the dataset in my own GitHub.
So my 1st Xap is focused on the data visualization, the last mile for a data scientist. “Begin with the End in Mind” is a hard lesson I learned during my past years. After that, I turned my attention to the first mile. I want to be very clear about the data format in each step and I would like to peek into every black box.
Gear up: WebStorm, AWS S3
Then I looked for a Javascript IDE, something like Jupyter notebook in Python. WebStorm caught my attention. It offers 30-day free trial and is free for 2 years for students and teachers, training and open source projects. Even for a standard individual customer, the annual fee of $59 is affordable.
After playing for a while, I realized WebStorm is good for developing the complex function of a simple input, something more pure for javascript coding. But for a web-based application, you always have to deal with file input and script input. If you don’t have a html file, the powerful libraries like jQuery and D3 are sitting on the bench because they are born to manipulate the DOM.
So I return to my previous tool set: start up a local host by python, use atom to write html/javascript codes and check the visual effect on Chrome.
Once I handcraft my html pages, I would like to have a remote host to display them. Years ago Google Drive provided such service for free. Now the hosting business is taken by AWS. I am surprised to see AWS provides up to 17 categories of services such as: compute, storage, database, developer tools, management tools, analytics, AI, Internet of things. S3 (simple static storage) is only a tiny business. By the way, the AWS page looks ugly. I guess because it is like a warehouse shopping center, the target customers care more about the price and stability, rather than a sexy face.
The entrance is: https://console.aws.amazon.com/s3
My endpoint is: http://jychstar.s3-website-us-west-2.amazonaws.com/
Here are my examples:
Note: For security reasons. the browsers block the Javascript codes. To unblock, in Chrome, there is a shield logo, click on it and “load unsafe scripts”. In firebox, there is “i” sign to fix. In Safari, sorry I don’t know how to solve it.
Papa.parse
Papa parse seems one of the most popular JavaScript libraries that do the parsing job. 4000 stars in Github. Major development around 2014. By the way, this implies it was at the time when the author, Matt Holt, was working at SmartyStreets(a company providing address info) and an undergraduate student at Brigham Young University.
I love the style of its official website: http://papaparse.com/ It uses friendly dialogues to provide case-by-case solution. So the users can quickly pin down what they want, whether it is parsing csv-format string, local file, remote file or even convert JSON back to csv. Actually, it is smart enough that it can find the right delimiter by scanning the first few rows.
var results = Papa.parse(csvString);
console.log(results.meta.delimiter);
However, there is a pitfall when parsing a local file:
Papa.parse(file, {
complete: function(results) {
console.log(results);}
});
According to the documentation,
file
is a File object obtained from the DOM. You can’t just use “data.csv” and hope it can do the magic. Papa will think “data.csv” is only a string and tell you it can’t find a delimiter. In this sense, d3 is much smarter in that d3.csv("data.csv", callback)
works for local file.
Anyway, the lower-level of Papa.parse means it has more flexibility to manipulate the data stream. In a github issue discussion , Holt provides some codes that use jQuery AJAX call to pass a file:
$.get("/basic_charts/train.csv", function(text){
var data = Papa.parse(text);
console.log(data);
});
However, the problem by this AJAX request is that the whole file is loaded into memory. If the file is too large, the browser gets crash. Alternatively, Papa uses HTML5’s FileReader API to “stream” in the file, if a
<input type="file">
element is used in the HTML file. Thanks to Raffael Vogler for providing a pain-free tutorial<script src="https://cdnjs.cloudflare.com/ajax/libs/PapaParse/4.1.2/papaparse.js"></script>
<script src="https://code.jquery.com/jquery-3.1.1.min.js"></script>
<script>
function handleFileSelect(evt) {
var file = evt.target.files[0];
Papa.parse(file, {
header: true,
dynamicTyping: true,
complete: function(results) {
console.log(JSON.stringify(results, null, 2));}
});
}
$(document).ready(function(){
$("#csv-file").change(handleFileSelect);
});
</script>
<!--To get started we need a button to open a file: -->
<input type="file" id="csv-file" name="files"/>
console.log()
is only able to print simple data formats such as string, number or array. JSON.stringify
is a very powerful debugger tool that you can print the original format of an object. The only thing you can’t print is the function.
debug finding:
jQeury change()
is triggered by any change made to<input>, \<textara> and \<select>
elements.evt
is a jQuery object, has 9 top-level keys, one of them is target. Below target is an encoded JQuery file object.results
has 3 top-level keys: data, errors, meta. The JSON array is under data key.- add indention in
stringify
parameter for pretty print.
If the file is written in JSON format or JSON array,
var xmlhttp = new XMLHttpRequest();
xmlhttp.onreadystatechange = function() {
if (this.readyState == 4 && this.status == 200) {
myObj = JSON.parse(this.responseText); // read raw text
console.log(JSON.stringify(myObj));
}
}; // define action
xmlhttp.open("GET", "data.txt", true);
xmlhttp.send();
TitanicXap2
The ready-to-use Xap is here, you can download the csv file from here). Drop the file to see the magic.
The block diagram is:
FileDropTarget:
CSVParser:
dimple_barplt
Titanic_python: pie_plotly
I am going to document my source code.
For the whole Xap, the page DOM is configured in Edit-style-HTML. We can see there are three visible objects.
<div data-node="FileDropTarget_0"></div>
<div data-node="dimple_barplot_0" id="barplot"></div>
<div data-node="pie_plotly_0" id = "pie"></div>
dimple_barplot
Input: data: multiset
spect: dependencies: jQuery, d3, dimple
script: inside
export default{data(){},};
let d3 = this.api.imports.d3;
let dimple = this.api.imports.dimple;
let data = this.api.inputState.export().data; // JSON Array
//this.api.log("I am Cool: ", JSON.stringify(data)); // test the raw data
function draw(data) {
"use strict";
var svg = dimple.newSvg("#barplot", 600, 400); // svg inside id barplot with size (600,400)
svg.append("text").
attr("x", 300).attr("y", 20).attr("text-anchor", "middle").style("font-size", "20px").style("font-weight", "bold").text("Titanic Survivor");
var myChart = new dimple.chart(svg, data);
var x = myChart.addCategoryAxis("x", "Pclass");
var y = myChart.addPctAxis("y", "Survived");
var s = myChart.addSeries("Sex", dimple.plot.bar);
s.afterDraw = function (shape, data) {
var s = d3.select(shape),
rect = {
x: parseFloat(s.attr("x")),
y: parseFloat(s.attr("y")),
width: parseFloat(s.attr("width")),
height: parseFloat(s.attr("height"))
};
if (rect.height >= 8) {
svg.append("text")
.attr("x", rect.x + rect.width / 2)
.attr("y", rect.y + rect.height / 2 + 3.5)
.style("text-anchor", "middle")
.style("font-size", "10px")
.style("font-family", "sans-serif")
.style("opacity", 0.6)
.style("pointer-events", "none")
.text(data.yValue);
}
}; // end s.afterDraw
myChart.addLegend(150, 10, 380, 20, "right");
myChart.draw();
}// end function draw
draw(data); // call function and feed data with JSON Array format
The majority of the code is from dimple github.
Titanic_python
Input: data: dynamic
output: feature_weight: enitity
SPEC. Tricks:
- sklearn is based on scipy;
- json can’t be install by pip
- pandas is 0.19.2, numpy is 1.12, sklearn is 0.18
"dependencies":{
"apt": [],
"pip": [
{"path": "numpy"},
{"path": "pandas"},
{"path": "scikit-learn"},
{"path": "scipy"}
],
"file":[]
},
script
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,fbeta_score
import numpy as np
import sklearn
import pandas as pd
def data(self):
data = self.api.inputstate.export()['data']
# self.api.log("I am Cool: ", data) # test data
df = pd.DataFrame(data) # convert JSON array to dataframe
df['Age'] = df['Age'].apply(pd.to_numeric, args=('coerce',)) # convert to number
train_data = df
# Data preprocessing
# use median age to fill missing value
def get_median_ages(df):
median_ages = np.zeros((2,3))
for j in range(0, 3):
median_ages[0,j] = df[(df['Sex'] == 'female') & \
(df['Pclass'] == j+1)]['Age'].dropna().median()
median_ages[1,j] = df[(df['Sex'] == 'male') & \
(df['Pclass'] == j+1)]['Age'].dropna().median()
return median_ages
def data_clean(df, median_ages):
df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
# use median age to fill the missing data
for i in range(0, 2):
for j in range(0, 3):
df.loc[ (df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1),\
'Age'] = median_ages[i,j]
droplist = ['Name','Ticket','Cabin','Embarked','Sex'] # reserve ID for check
features = df.drop(droplist, axis = 1)
return features
median_ages = get_median_ages(train_data)
train_cleaned = data_clean(train_data,median_ages)
features = train_cleaned.drop(['PassengerId','Survived'],axis=1)
labels = train_data ['Survived']
# machine learning to predict a target feature: Survived
# use train-test split to generate 2 sets of data
X_train, X_test,y_train,y_test = train_test_split(features, labels, test_size=0.3, random_state=0)
clf=DecisionTreeClassifier()
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
self.api.log("test score:",accuracy_score(y_test, pred)) # 0.81, seems good
feature_weight={}
for i,key in enumerate(X_train.columns.values):
feature_weight[key] = clf.feature_importances_[i]
self.api.output("feature_weight", feature_weight)
Note that in line 11,
df = pd.DataFrame(data)
convert JSON array to dataframe. This is the most important glue code that connects JavaScript to Python.pie_plotly
plotly documentation: https://plot.ly/javascript/reference/
Initially I tried dimple for an hour or so, but got a lot of bug. Then I switch to plotly and it was awesome!
input: data: entity
spec:
"dependencies": {
"file": [
{
"type": "js",
"path": "https://cdn.plot.ly/plotly-latest.min.js",
"name": "Plotly"
},
{
"type": "js",
"path": "https://d3js.org/d3.v4.min.js",
"name": "d3"
},
{
"type": "js",
"path": "https://cdnjs.cloudflare.com/ajax/libs/numeric/1.2.6/numeric.min.js",
"name": "numeric"
}
]
},
script:
export default {
data() {
let data = this.api.inputState.export().data;
let d3 = this.api.imports.d3;
let Plotly = this.api.imports.Plotly;
var labels = [];
var values = [];
var row;
for (row in data) {
labels.push(String(row));
values.push(data[row]);
}
var d = {}; // dictionary to store labels, values, type
d.labels = labels;
d.values = values;
d.type = "pie";
var layout = {
width: 500,
height: 400,
title : "Feature importance",
titlefont : 20
};
var li = [d];
Plotly.newPlot('pie', li, layout);
this.api.log(labels[0]);
}
};