Saturday, March 4, 2017

First experience with Exaptive platform

Why Exaptive?

I always feel excited to explore new things. As a data scientist, I am now trying to learn a web-based tool to increase my productivity, although learning itself takes time.
Exaptive is developing a platform that enables data scientist to build customized tools for themselves. Why build such a platform? Because a data science project usually includes data wrangling, data mining and data visualization. You need different APIs at different stages, you write your own scripts here and there. And for the last step, data visualization, you probably need to display scalable and interactive graphs on web to reach more audiences. This means you need to change Python/R to JavaScript with D3. So, to have a seamless workflow, isn’t it nice to have all the pieces in one place?
There is no magic wand in the world that can turn arbitrary format, unknown quality dataset into a pleasing insight. Human invent various tools to make the process easier. Existing tools like Microsoft Excel and Tableau are great. But they only provide limited customized options. So why not decompose them a little bit and remix the components as you wish? This is exactly what Exaptive is trying to accomplish. Let’s dive in.

First look of Exaptive

Open https://exaptive.city/, there are 4 options at the top:
  • Home. Empty here unless you have created your Xap.
  • Studio. Empty first. When you click + button, there pops up there things: Component, Xap, Asset. Component is the basic building block. Xap is the product when you piece required components together. Asset is dataset, code snippet, pictures that are used to build component or feed data. The latest version is 4.0.24, released on 2016.8.29
  • Explore. A collection of pre-built xap, component, assent. You can add them into your own Studio. Currently, there are 100 public modules , among which 16 are Xaps.
  • Learn. Tutorials, Documentations, Discussion (226 bug reports, 107 feature requests, 17 general discussion. This means an early stage of development)
I first read some documentations to get some basic concepts such as Entity-Attribute-Value data model, duffles( packed data from one component to another), primitive data types and containers, special treatment for Python (only for v 2.7, dependency, etc).
Then the best way to learn is to exercise with some workable examples, instead of building from scratch. Because these components and interfaces are designed with “biases” in mind. Learning should be like in our childhood that curiosity drives us to break down everything to see what’s inside the black box, even if parents will nag about the mess we make.

Demystify: Pub Med Result List Example

Go to “Explore” page, “Pub Med Result List Example” XAP appear first. Let’s try it and add it to Studio.
Go to Studio page, we see the “Pub Med” is there. We have 4 choices: info (click name string), run (click “play” icon), edit (click ‘hammer’ icon), delete (“trash can”).
On “info“ page, which shows the components that are used: Text Box, Button, PubMed Search, Result List. We can simply infer their functions from the names. “layout” shows some HTML codes, which will be inserted into
<body data-gr-c-s-loaded='true'> == $0
  <div class = "exaptive-doc">
    <div class = "exaptive-doc-main">
      <div data-node>
        <!-- layout codes -->
      </div>
    </div>
  </div>
</body>
if you parse the page when run Xap. The “layout” codes construct the front-end interface on webpage and link to the back-end modules.
On run page, you see a search input and button. Obviously, this is a customized search engine based on PubMed.
On edit page under dataflow tab, as the name, this is where the data flows. These boxes and wiring immediately remind me a software with a simiar graphical interface: Labview. During my doctoral research, one of my proudest work is that I wrote a Labview program that had improved the data collection efficiency by more than 10 times. It not only reduces the tedious work, but also help me discover new phenomenon by refining the resolution.
Labview aims for the electrical signals that are parsed by the National Instrument hardware. It is a powerful tool for electrical engineer. I can imagine how data engineers feel empowered if a “data” version of the Labview is at their hands. I am super excited with the opportunity to contribute to its early development. We can borrow some concepts from Labview and implement exaptively!
Go back to my topic on dataflow tab. We can intuitively see the input nodes and output nodes for each component. When you hover over a component, it shows name for each node and a row of short cut buttons: info, suggestions, edit, setting and remove. You can click on the node or double click the component to see the data type, attribute, values for each node, and possible version numbers.
The secret sauce is revealed in the edit of the component. Take “PubMed Search_0” component as an example. We see:
  • description (baisc usage, output fields, output example in JSON format, developers: Matt & Frank)
  • inputs
  • outputs
  • script (python codes, use xmltodict module to parse data),
  • spec (main, domain, dependency, input, output, etc).
Note that the output fields in the description are different from outputs. At this moment, I don’t know where exactly these python codes are executed. I guess it is wrapped in the exaptive.js somewhere else. There must be some sort of glue codes to connect inputs, python codes and outputs.
The “ResultList_0” component is written in JavaScript in the script. It is strange to me that TextBox and Button component are empty in the edit tab, and sometimes mistakenly display other compoenent (e.g.PubMed Search)’s information.
To sum up, this is what a customized search engine is composed of. If you want to modified it for other domains such as astrophysics or politics, you will need to revise the python scripts, especially for the XML sources and tabs.

Tutorial: build your first Xap

Steps to follow:
  1. download a csv file. It has 3818 instances , 24 columns (11 numerical), some missing values.
  2. in the Explore page, add some pre-built components to Studio:File Drop Target, CSV Parser, Button, Scatterplot, Modal (look for the one that's black and white),Tooltip, Table.
  3. in the Studio page, create an Xap, open it, rename it
  4. Go to DataFlow tab, drop the above components, try following 2 wiring.
  5. FileDropTarget.FileData->CSVParser.data—result-> Button.value —click->Table.data. save, run, load csv file. After you see how table works, delete table component .
  6. FileDropTarget->CSVParser-> Button->Scatterplot, click button in the “preview” tab, wait until you see something in the plot (means data is loaded).
  7. If data is correctly parsed, it shows “20/3181 entities| 20/24 attributes” after the “data” input node. 20 means the 20 samples are showed underneath. Expand “data” node by click, set x= accommodates, y=price, color = room_type. Go to “preview” tab to see the magic.
  8. In styple tab, add <br> style ="height:75%;"</br>before button tag. save and see the magic in ‘preview’ tab.
  9. Set data.mouseover = street. Add tooltip and data gates to Dataflow. Add 2 wiring : scatterplot.mouseover -> dataMergeGate (expression: x0[0]) -> tooltip.html; scatterplot.mouseoutput -> tooltip.hide. Go to ‘preview’ tab, pretty cool!
  10. In scatterplot.options, set "brush": true. Add table component back. wire: scatterplot.selected -> table.data. Go to ‘preview’, hold and drag to select multiple points, see the magic.
  11. visualize price distribution in areas. Set data.x= longitude, data.y= latitude, axes.x.hidden: true, axes.y.hidden: true. Add a new scatterplot, wire: scatterplot_0.selected -> scatterplot_1.data. So the selected data will be plotted. In scatterplot_1, set data.x = accommodate, data.y = price, data.color = room_type, axes.x.text: "Accommodate", axes.y.text: "Price($)". play around in “preview” tab.
  12. Use Modal component to place 2 plots in 2 different layers. This is very powerful to manage visual complexity. Add Modal component, set options.fullScreen: true,options.hideHeader:true. Add Button component, set options.text: "Add or Send Data". Do 2 wiring: button_1.click ->Modal.open, button_2.click -> Modal.close. This is to close the modal after data is sent, so as to prevent the port from saving the data.
  13. In the “style” page, change html code snippet to:
    <button data-node="Button_1"></button>
    <div data-node="Modal_0">    
       <div data-node="FileDropTarget_0"></div>    
       <button data-node="Button_0"></button>
    </div>
    
    <svg data-node="Scatterplot_0" style="height: 50%;"></svg>
    <div data-node="Tooltip_0"></div>
    
    From the code sequence, you can see Button_1 is first executed to open Modal_0, within which the FileDropTarget and Button_0 functions. This ensure the dataflow sequence.
  14. Add another Modal component. wiring: Scatterplot_0.selected -> Modal_1.open. This will trigger Modal_1 to open when new data is received. Set `options.title:”price vs accommodation”. Add these html codes into “style” page.
    <div data-node="Modal_1" style="height: 45%; width: 75%;">    
       <div data-node="Scatterplot_1" style="height: 90%; width: 100%;"></div>    
    </div>
    
    Note that the tutorial has a typo which misspell “45%, 75%” as “450px,750px”.
  15. Add Text to the front-end html. In “style” page, add following to HTML:
    <h3 class = "text">
     A Map by AirBNB
    </h3>
    <p class= "text">
     click and drag accross the visualization to sse more information
    </p>
    
    And add .text{text-align: center;} to CSS.
  16. Save & Publish.
  17. In “scatterplot” component, at data.points, click the funnel icon to add filter, or key icon to add id.
    Entity Filer: price < 150 && room_type == "Entire home/apt", Value Selector: room_type == "Private bedroom" ? "star" : "triangle"
Note for the essential component “Scatterplot”: The vsiualization is based on D3.js, with 2 helper scripts: tinycolor, visUtil (created by Exaptive).

Tutorial: So You’re Comfortable with the Exaptive Fundamentals

After watch a few video clips, I realized this tutorial is out-dated. Some Components are not there, such as Quandle Stock Prices, Bar char, Duffle Join. Some videos even don’ t play.
The interesting thing is the TwitterSearchAPI-> wordFrequency-> word cloud, which shows the words with size proportional to thier frequency.

Tutorial: build a Python Component

  1. In Studio page, add a python component, rename it, add description
  2. In edit page -> spec, install python modules in Docker by adding following scripts:
    "dependencies":{
          "apt": [{"path": "libffi-dev"},
                {"path": "libssl-dev"}],
        "pip": [{"path": "numpy"},
                {"path": "quandl"}],
        "file":[]
    },
    
    When you click save, it, the environement will be built, which may take a while. Note: in the latest python domain, ‘gfortran’ is pre-installed.
  3. In edit page-> inputs, revise the defaut input “count” to “call”:
    Name: call
    value type: entity<list tickers, string startDate, string endDate, string APIKey, string frequency>
    default: {"tickers":["AAPL","MSFT"], "startDate": "2011-12-12", "endDate":"2016-05-05", "APIKey": "tE4dug_G3e-gcf72vq7g", "frequency":"monthly"}
    
    If granular inputs is checked, the input ports will explicitly display these attributes.
  4. In edit page -> script, replace them by :
    import urllib2
    import json
    def call(self):
        call = self.api.inputstate.export()['call'] 
        tickers = call['tickers']  
        start_date = call['startDate']
        end_date = call['end_date']
        APIKey = call['frequency']
    
        base = "http://www.quandl.com/api/v3/datasets/WIKI/"
    
        data = []
        for i in tickers:
            response = urllib2.urlopen(base+i+'.json?'+'start_date='+start_date+'&end_date'+end_date+'&collapse='+
                                       frequency+"&api_key="+APIKey)
            response_data = json.loads(response.read())
            data.append(response_data)
    
        arrEnt = []
    
        for ticker in data:
            ticker_data = ticker['dataset']['data']
            ticker_columns = ticker['dataset']['column_names']
            ticker_symbol = ticker['dataset']['dataset_code']
    
            for line in ticker_data:
                counter = 0;
                info ={'Stock':ticker_symbol}
                while counter < len(ticker_columns):
                    info[ticker_columns[counter]] = line[counter]
                    counter += 1
                arrEnt.append(info)
    
        duffle = self.api.value.multiset(arrEnt)
        self.api.output("data", duffle)
    
    What it does is read a dictionary containing the requested information. Use urllib2.urlopen to crawl information from webpage, use json.loads to parse the webpage. Each ticker corresponds to a dataset, which is a dictioary with 21 keys. Among 21 keys, the column name and data are the most important, and each is a list of 105 items. Rearrange this information to produce a list “arrEnt”. Each element in arrEnt is a dictionary “info”. “arrEnt” is wrapped into duffle and output.
  5. Build an Xap with this python Component and a LineChart Component.

The Exaptive Python Data API

When an input is activated by data being sent into that port, the component will attempt to call a method defined in the script by the same name. That is, if your input port is named my_data, then the component will attempt to call a method named my_data.
Usually the first line of code is: state = self.api.inputstate.export() This gets an input port state. self here corresponds to the main argument to your method.
The last line of code is: self.api.output("output_name", my_output) This sends my_output to the output port named “output_name”.
some other commonly-seen API codes:
self.api.value(my_variable) # Casting data into exaptive data model
self.api.imports['datafile']  # datafile is an asset that you pre-define in the "dependencies"-> "file"
slef.api.log("this is a log", variable) # output to the log messages tab, which locates at the bottom right of the dataflow page. It's a little icon and looks like a logbook.

Tutorial: Data Analysis in Python

  1. create a python component in Studio, name it “chopstick”, modify “dependencies” in spec:
    "dependencies": {
      "apt": [{"path": "python-numpy"},
                  {"path": "python-scipy"}],
       "pip": [{"path": "scikit-learn"}], 
       "file": []
      },
    
  2. rename input from “count” to “data”. And because the python function is triggered by the input port with the same name, we need to rename the function and tinker with it in the script:
    from sklearn import cluster
    import numpy as np
    def data(self):
        data = self.api.inputstate.export()["data"]
        duffle = []
        self.api.output("data", duffle)
    
    self here seems werid for python beginner. What is the top-level class and how the class is initialized? My guess is that when you create a python component, the backend of the Exaptive Studio is actually calling a class constructor to initialize it. Also rename output to “duffle”
  3. create a nex Xap in Studio, then go to DataFlow in Edit page. Add components and wire as: AssetLoader.data -> CSVParser.data - -result -> chopstick.data
  4. Double click AssetLoader and write f89fd760-8cab-11e6-a897-2dbff63efc07 into “uuid” input port. click the arrow buttion nearby (this logo is somewhat confusing to me, I first reaction is that it will lead me to a new page, which is the norm in the internet ). What it actually does is feeding the source and trigger the flow. You will see several ports light up blue, which means the data flow through these ports. When you click at a single port, you can see the value in it. For this case, we can see CSVParser.result has 186 entities and 3 attributes, among which 20 entities are displayed when you further click.
  5. Hover over “chopstick” component, open the edit page and edit the script.
    To better understand the essential codes for data analytics, I practise them in a python notebook, link
  6. Add a “table” component… It seems this tutorial is not finished.

Tutorial: Using a Visualization to Trigger Events

This tutorial actually doesn’t cover much about visualization. I would suggest follow the first tutorial “build your first Xap” for visualization purpose. What interests me is how the “Iris Sample Dataset” component is built. In this case, the input port is named “trigger”. In the same name function, data is directly loaded from sklearn.datasets, then is reorganized into a list of dictionary.

Some handy tricks

  1. In “dataflow” page, there are “+O-“ buttons at the bottom left. So you can resize these boxes. If they are off the center, you can click anywhere in the blank space, hold it and drag the whole stuff to the center.