Thursday, March 9, 2017

1st Xap I built

Doing the right thing is more important than doing the thing right

Before implementing the technical details, let me exercise my critical thinking and keep the big picture in mind.
In my understanding, Exaptive Studio is trying to encapsulate and modularize every step of a data pipeline. Basically, the pipeline can be decomposed into 3 steps: data wrangling, data mining, data visualization. But there is no clear cut between these steps because they are closely related. For example, in data mining, we use a lot of statistical learning tools and plotting libraries to extract the useful information, it is like turning over 100 rocks to find 2 interesting nuggets. In data visualization, we only present the few interesting things and deliberately polish them to catch viewers’ eyes. We don’t display the 98% boring things or the dots we fail to connect.
These two purposes of visualization at different stages reveal the exploration-exploitation dilemma, which is a fundamental tradeoff in Reinforcement Learning. This term pops to my mind because I remember the GaTech professors joking at the similar spelling in the Udacity video.
To fully expose the difficulty of implementation in Exaptive, I list the most-common Python codes for data exploration.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv(csvfile)
data.info()
data.head()
data.describe()
data.corr()
data.plot()
data.groupby()
I usually do it in Jupyter notebook because I can get immediate feedback from the result and determine the next direction I want to explore. I can imagine how inefficient it is if we have to pull out the log, build or search for a desirable component. Within Ipython, it is only one line of code!
So the best chance Exaptive Studio can survive and flourish in the data presentation stage, which leverages the power of web-based visualization technologies as such D3.js. There are already strong players like Datameer eating the same market. We have to act very fast to adjust our stratergy to capture more end-users.
Specifically, one of our target markets is data journalism. We should be able to help our users to achieve something like The Facebook IPO, which is data-intensive and highly interactive. In a post-truth era, such reports will stand out and become the new norm. Because false claim is cheap to fabricate but big data is not. Truth and sights that are hidden in the data will eventually prevail. People are so hungry for data-backed news. So don’t let Trump twitter. Let Data Speak. This is huge business!

Building a JavaScript component for data visualization
Steps:
  1. Read relevant documentations and decompose an existing plot component, put it in pure HTML/JavaScript environment and study how it works.
  2. search for the right JavaScript libraries and functions as building blocks. Play with them to understand how they work and put them in a local host. Try Amazon simple storage service to host these visual appealing pages.
  3. Modify these workable JavaScript codes to make them work in Xap.

learn from Exaptive documentation on JavaScript

D3.js in Expative

In studio page, create a new JavaScript Component. Rename it and modify spec in the edit page and save.
"dependencies": {
    "file": [
        {
            "type": "js",
            "path": "https://cdnjs.cloudflare.com/ajax/libs/jquery/3.0.0-beta1/jquery.min.js",
            "name": "$"
        },
        {
            "type": "js",
            "path": "https://d3js.org/d3.v4.min.js",
            "name": "d3"
        }
    ]
}
The first dependency is jQuery with version 3, which is default. The second one is for d3 with version 4.
In the “script” tab, there are 3 default functions:_init(), _close() , and doSomething(). Add theses codes to the _init method and save.
let d3 = this.api.imports.d3; // declare namespace
d3.select("body").style("background-color", "deeppink")
    .selectAll("p")
    .data([1, 2, 3, 4, 5])
    .enter().append("p")
        .text(function(d) {
            return `Hello, I'm number ${ d }!`;
        }).style("color", "white");
In the “inner html” tab, delete everything in it or add everything you want. This is kind of inside the body block. Save it
In the “studio” page. create an Xap. Add this component to the dataflow. Go to preview, hooray !

JavaScript in Exapitve

This page explains the “spec” in details, especially the dependency about url and asset.
this page explains the JavaScript API in the script:
this.api.inputState.export()
this.api.layoutElement
this.api.imports  
this.api.log( msg, value ) 
this.api.warning( msg, value )
this.api.error( msg, value ) 
this.api.output( name, value )
this.api.value
this.api.dataflow.setLayout( layoutSpec )

Decompose existing components

FileDropTarget

100 lines, mostly deal with binary strings.
I would like to use file drop. js to encapsulate the codes.

CSVParser

reuse of Papa.Parse

Scatterplot

based on D3.js, with 2 helper scripts: tinycolor, visUtil (hosted at AWS).
The basic architecture :
explort default {
      _init(){} // 400 lines
    resize(){}
    data(){}
    brushExtents(){}
    options(){}
    select(){}
}
I trace the data flow by the key word “export”,
data = this.api.inputState.export().data; // line 349
this.updateOptions(this.inputState.export()); // line 420
var brushExtents = this.api.inputState.get('brushExtents').export(); //line 434
this.updateOptions(this.inputState.export()); // line 438
select(){ this.onSelect( this.api.inputState.get("select").export() ); } //446
and by the key word “output”,
_this.api.output("brushExtents", {
  x: empty ? [] : x,
  y: empty ? [] : y
}); // line 96, within function brushend()
nodes.on("mouseover", function(d) {api.output("mouseover", [d.projectedId]);d3.select(this).classed("highlight", true);})
.on("mouseout", function(d){api.output("mouseout", [d.projectedId]);
  d3.select(this).classed("highlight", false);})
.on("click", function(d){api.output("click", [d.projectedId]); //line 266
this.api.output("selected", selected[0].map(function(d){ return d.__data__.projectedId;})); // line 344
I still couldn’t fully understand the codes. It heavily uses D3 to manipulate data and mapping. And there is some confusion for the use of this.api, api, this and _this, due to the need of creating a copy?
Another thing is this component use vis-util library, to use class visUtil.Axes, visUtil.Brush for drawing. However, this library seems not very popular because there is very few sources about it. And no official website!
tinycolor library seems popular with 1536 stars in GitHub. But weird thing is why I didn’t see the related scripts?
By the way, Stacked/Grouped Barchart uses d3, velocity, visUtil, vis libraries.

My Xap

Dimple_xap

The initial difficulty I encountered was how to feed data into high-level library based codes such as pandas.read_table(file)or d3.tsv(file,function).Because FileDropTarget component already “half-process” the file, and CSVParser make it into a JSON type data stream. I just don’t want to reverse duffles back into file.
After some exploration, I figured it out that I can host the data file in my github, get a raw file link, and use some commands to directly read from the URL. For Python, there are 2 ways:
# recommended way
import StringIO as io  # import io if using python 3
import requests
import pandas as pd
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

# not recommended, write data into a file storing somewhere
import urllib2
response = urllib2.urlopen(url)
data = response.read()
filename = "t.txt"
with open(filename, 'w') as f:
    f.write(data)  # Write data to file
df = pd.read_table(filename)
For Javascript,
d3.tsv("world_cup.tsv", draw);  // file in local disk
d3.tsv("https://raw.githubusercontent.com/jychstar/datasets/master/titanic/world_cup.tsv", draw); // file hosted at github
draw(JSON_dataset);  // direct read JSON dataset/list of dictionaries
After trial and error, I finally make my first Xap work. It only includes a JavaScript Component, the script inside the export_default{}; is:
_init() {
  let d3 = this.api.imports.d3;
  let dimple = this.api.imports.dimple;
  function draw(data) {
  "use strict";
  var svg = d3.select("body").append("svg")
      .attr("width", 1400).attr("height", 600);
  svg.append("text").attr("x", 700).attr("y", 30).attr("text-anchor", "middle").style("font-size", "30px").style("font-weight", "bold").text("World Cup Attendance vs. Year");
  var myChart = new dimple.chart(svg, data);
  var x = myChart.addTimeAxis("x", "year");
  var y = myChart.addMeasureAxis("y", "attendance");
  x.dateParseFormat = "%Y";
  x.tickFormat = "%Y";
  x.timeInterval = 4;
  x.fontSize = 20;
  y.fontSize = 20;
  myChart.addSeries(null, dimple.plot.line);
  myChart.addSeries(null, dimple.plot.scatter);
  myChart.addSeries(null, dimple.plot.bar);
  myChart.draw();
  }
var url = "https://raw.githubusercontent.com/jychstar/datasets/master/titanic/world_cup.tsv";
 d3.tsv(url, draw);   
},
The dependencies is:
"dependencies": {
    "file": [
        {
            "type": "js",
            "path": "https://cdnjs.cloudflare.com/ajax/libs/jquery/3.0.0-beta1/jquery.min.js",
            "name": "$"
        },
        {
            "type": "js",
            "path": "https://d3js.org/d3.v4.min.js",
            "name": "d3"
        },
        {
            "type": "js",
            "path": "https://cdnjs.cloudflare.com/ajax/libs/dimple/2.3.0/dimple.latest.min.js",
            "name": "dimple"
        }
    ]
},

Titanic_xap, to be continued

My original plan is to implement what I can do in a Jupyter notebook: https://github.com/jychstar/datasets/blob/master/titanic/Titanic%2C%20from%20Kaggle.ipynb
It is cool and have the mouse-over feature. However, the barplot is not exactly what I want. I realize the seaborn.factorplot and dimple.plot.bar have their own concepts of even the seemly similar thing. There is a huge gap between different language communities.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.