Friday, April 28, 2017

Google Prediction API

Google Cloud Prediction API provides a RESTful API to build Machine Learning models.
  1. create a cloud platform project: predictionapi0
  2. enable billing
  3. enable API
  4. download training data (txt file)
  5. create bucket: jychstar-bucket, upload txt file to bucket
  6. project: predicitonapi0
    request body: {
    “id”: “language-identifier”,
    “storageDataLocation”: “jychstar_bucket/language_id.txt”
It turns out this language-identifier API is only a toy, with 403 input instances and 3-class labels(English, French, Spanish). It is a blackbox that are written for a specific purpose.
The business model for prediction API is $0.50/ 1000 prediction after 10 k free prediction. And they charged the training as well. I think such API is application specific. As a black box, it should generalize well enough to be useful in the changing world.
Some mature APIs are:
  • Natural language analysis: syntax, entity, sentiment
  • speech to text
  • translation
  • image analysis
  • video analysis

Amazon Redshift

Amazon Redshift was launched in 2012.11. It is targeted at big data which is at the level of petabytes.
An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases.Reserving compute nodes offers significant savings compared to the hourly rates that you pay when you provision compute nodes on demand.
With Amazon Redshift, you can start small for just $0.25 per hour with no commitments and scale out to petabytes of data for $1,000 per terabyte per year, less than a tenth the cost of traditional solutions.
When you launch a cluster, one option you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. The dense storage (DS) node types are storage optimized. The dense compute (DC) node types are compute optimized. more details here.

getting started

there are 7 steps to follow:

1, sql client and driver

SQL Workbench/J is a free, DBMS-independent, cross-platform SQL query tool. It is written in Java and should run on any operating system that provides a Java Runtime Environment.
You can use a JDBC connection to connect to your Amazon Redshift cluster from many third-party SQL client tools. To do this, you need to download a JDBC driver.

2, create an IAM role

roles -> create new role -> AWS service Role: Amazon Redshift -> Attache policy: AmazoneS3ReadOnlyAccess -> role Name: myRedshiftRole
Role ARN: arn:aws:iam::992413356070:role/myRedshitRole

3, launch a redshift cluster

There will be a charge for $0.25/hour, so delete the cluster after tutorial.
launch cluster -> cluster identifier: examplecluster, Master User name: masteruser, password: Ai8920113

4, Authorize Access to the Cluster

Redshift -> Clusters -> examplecluster -> configuration -> cluster properties -> VPC security Groups -> inbound -> Edit -> Type: Custom TCP rule, protocol: TC: , Port Range: 5439

5, Connect to the Sample Cluster

Redshift -> Clusters -> examplecluster -> configuration -> cluster databse properties, JDBC URL: jdbc:redshift://
software SQL workbench/J -> file -> connect window -> new profile -> manage Drivers -> new driver, load the jdbc driver, ok -> continue fill profile, driver, url, user name, password, autocommit, ok

6, Load Sample Data from Amazon S3

When lauch cluster, there is a default database “dev”. The database is still empty. So the first thing is to create some tables such as “users”, by writing queries in “statement” of SQL WorkbenchJ:
create table users(
    userid integer not null distkey sortkey,
    username char(8),
    firstname varchar(30),
    lastname varchar(30),
    city varchar(30),
    state char(2),
    email varchar(100),
    phone char(14),
    likesports boolean,
    liketheatre boolean,
    likeconcerts boolean,
    likejazz boolean,
    likeclassical boolean,
    likeopera boolean,
    likerock boolean,
    likevegas boolean,
    likebroadway boolean,
    likemusicals boolean);
Then we load sample data to the tables by “copy” from Amazon S3:
copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';
Note that the credential string in <iam-role-arn> is from step 2. Unfortunately , I got S3ServiceException: Access Denied due to my setup in cluster launch.
Now you are ready to write queries like select * from users.
Check your query history at redshift -> example cluster -> Queries tab

7 try something interesting or reset environment

If you feel ambitious, try Tutorial: Loading Data from Amazon S3
Otherwise, revoke access from the VPC seucrity Group. redshift-> clusters -> example cluster -> configuration -> cluster properties -> inbound -> edit, delete custom TCP rule, save.
redshift -> cluster -> examplecluster -> configuration ->cluster -> delete. create snapshot: no, delete.

Thursday, April 27, 2017

Intro to Hadoop and MapReduce

course speaker:
  • Sarah Sproehnle, vice president of Cloudera.
  • Ian Wrigley, senior curriculum manager at Cloudera
Cloudera was founded in 2008 by 3 engineers from Google, Yahoo and Facebook. It develops Apache Hadoop and provided related service. As early as 2003, Doug Cutting was inspired by Google labs’ papers on their distributed file system (GFS) and their processing framework, MapReduce. He wrote the initial Hadoop software with partner Mike Cafarella. They were invested by Yahoo.
Since 2012, many companies such as Oracle, Dell, Intel, SAS, Microsoft and Internet of things start-ups announce the partnership with Cloudera. In March 2017, Cloudera filed fro an IPO.

1 Intro

IBM: 90% of world’s data was created in the last 2 years alone.
challenges with big data:
  1. data is created fast
  2. data from different sources in various formats
  1. volume,
  2. variety (unstructured, raw format instead of SQL)
  3. velocity
All data are worth storing: transactions, logs, business, user, sensor, medical, social.
The key is what data interests you most: science, e-commerce, financial, medical , sports, social, utilities.
Core Hadoop is storage in HDFS and process in MapReduce. But now Hadoop has grown into an ecosystem. Helper tools such as Hive and Pig could turn SQL into MapReduce code and run in the cluster. But they are on top of MapReduce and hence slow. Impala can directly access data in HDFS. Other ecosystem projects include Sqoop, Flume, HBase, Hue, oozie, Mahout(machine learning). Making them talking to one another and work well can be tricky. CDH (Cloudera distribution of Hadoop) packages all these things together, which makes life much easier.

2 HDFS and MapReduce

Backup file to avoid accidental lose in the cluster: data redundancy(2 more copies) and NameNode standby (1 more copy).
Hadoop’s block size is set to 64MB by default, when most filesystems have block sizes of 16KB or less.


Instructions on how to download and run the virtual machines here.
Information on how to transfer files back and forth to the virtual machine can be found here.
After downloading the zip file which includes a 4.2 G .vmdk file and then put into the virtual box. In setting: Network -> attached to: Bridged Adapter
Once you press ‘start’, you will enter Hadoop system. However, with ifconfig, I couldn’t get ‘net addr’ for eth1. So I can’t start a ssh connection and use scp to transfer data between vm and host.
typical hadoop command:
hadoop fs -ls   # check hadoop file system
hadoop fs -put purchases.txt   # put file in cluster
hadoop fs -tail purchases.txt  # display end
hadoop fs -mv purchases.txt newname.txt # rename
hadoop fs -rm newname.txt   # delete file
head  -50 ../data/purchases.txt > testfile  # first 50 lines
cat testfile | ./ | sort | ./
hs myinput output2


Dictionary approach will take a long time and may run out of memory.
  • Mapper: divide the whole chunk to multiple key-value pairs
  • Reducer: each has partial keys
  • Task trackers
  • Job tracker
def mapper():
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) == 6:
            date, time, store, item, cost, payment = data
        print "{0}\t{1}".format(store, cost)
def reducer():
    salesTotal, oldKey = 0, None
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) != 2:
        thisKey, thisSale = data
        if oldKey and oldKey!= thisKey:
            print "{0}\t{1}".format(oldKey, salesTotal)
            salesTotal = 0
        oldKey = thisKey
        salesTotal += float(thisSale)
    if oldKey:
        print "{0}\t{1}".format(oldKey, salesTotal)

MapReduce Design Patterns

book: MapReduce Design Patterns by Donald Miner in 2012, $30
  • filter patterns: bloom filter, sampling filter
  • summarization pattern: counting, statistics
  • structural pattern: combining data sets
I will save lesson 7 and lesson 8 for future practice. To be good at Big Data, this intro little course is never enough. There will be a lot of difficult concepts and hard work.

Real-time analytics with Apache Storm

Types of analytics:
  • cube analytics: business intelligence
  • predictive analytics: statistics and machine learning
  • realtime: streaming or interactive
  • batch:
Hadoop: big batch processing
Storm: fast, reactive, real-time processing
Apache Storm Site with Documentation:


Step 1) (Apple OSX) Install VirtualBox for your operating system:
Step 2) (Apple OSX) Install Vagrant for your operating system:
git clone
cd ud381
vagrant up   # 1st download 2G vmdk file to VB folder
vagrant ssh
storm version  # 0.9.2-incubting
cd ..
cd ..
cd vagrant  # this is a shared folder 
vagrant ssh
cd /vagrant
cd lesson1/stage1
mvn clean
mvn package
storm jar target/udacity-storm-lesson1_stage1-0.0.1-SNAPSHOT-jar-with-dependencies.jar udacity.storm.ExclamationTopology
There are a lot of Java implementation. I am not particularly interested in collecting Tweets.

Data visualization with tableau

book: the visual display of quantitative information.
visual encoding:
  1. lines for trends, if no trend, use bars to compare group
  2. histogram is a bar plot where a variable is binned into ranges
  3. violin plot = box plot + kernel density estimation
char suggestion:


Tableau is an interactive data visualization software for business intelligence. It is founded in 2003, as a result, to commercialize research at Stanford University’s CS department.
subscription price ranges from $35 to 70 per month.
It supports excel, text, JSON and statistical files.

Union vs Join

Union Join
operation drag right below 1st table drag elsewhere
column change combine without merge merge common fields
more verbose more concise
default join is inner join, which only combines data with a common value. Left join will have all the original data.

Dimensions vs Measures

Dimensions are more discrete values, like category, city, date, region.
Measures are more continuous values, like profit, quantity, height, age.
There are some overlaps.

3 working modes

  1. sheet. drag x value to “Columns”, y value to “Rows”, label value to “Marks” or “Filters”
  2. dashboard: drop multiple sheets together to address something
  3. story: ppt-like experience to tell a story.


This is an elegant and powerful software. It seems to have a web-based functionality in mind. In my public version, I can’t output the graph but can only store everything in its cloud. Maybe it is because the mouse-over feature is data encoded so the stand-alone graph will lose its spotlight.

Wednesday, April 26, 2017

Data Wrangling with MongoDB

The previous post focuses on data wrangling by parsing csv/excel/json/xml/html formatted files. And SQL queries, especially in SQLite.
This post now turns attention to NoSQL, especially in MongoDB.

course syllabus and setup:
course video has 11 sections, syllabus only list 6 lessons, the other 5 are problem sets. These videos are actually the major part of Data Analyst nanodegree in the data wrangling section.


MongoDB was first released in 2009.
Why MongoDB?
  • flexible schema
  • oriented toward Programmers
  • flexible deployment
  • designed for big data
  • aggregation framework
document database vs relational database.
here, document means associative array:
  • JSON object
  • PHP array
  • python dict
  • Ruby Hash
MongoDB store these types of hierarchical data structures directly in a database as individual documents. JSON-like syntax.

install and setup

brew install mongodb
brew services start mongodb
mkdir -p /data/db    //create default directory
sudo chown -R `id -u` /data/db    // give permission

practice by pymongo

start the server first by mongod
learning points:
  1. create client
  2. create database
  3. create table by insert one or multiple dictionaries
  4. query and projection by dictionary
from pymongo import MongoClient
import pprint
client = MongoClient('localhost:27017')
db = client.examples    # creat database "examples" if not exist
db.cities.insert_one({"name" : "Chicago"})  # insert a dict to table "cities"
print db.cities.find_one() # note each dict has an id No.
tesla_s = {
    "manufacturer": "Tesla Motors",
    "class": "full-size",
}  # insert a dict to table "autos"
for a in
autos ={"manufacturer":"Tesla Motors"})  # use dict for query
for a in autos:
query = {"manufacturer":"Toyota", "class": "mid-size car"}
projection = {"id":0, "name": 1}  // 0: hide; 1; show
autos =, projection) # multiple field queires and projection


Documentation for mongoimport can be found here.
The command used in this video:
mongoimport --help
mongoimport -d dbname -c collectionname --file input-file.json
mongoimport -d examples -c myautos2 --file autos.json
mongo  // start mongo shell locally
If no hostname and credentials are supplied, mongoimport will try to connect to the default localhost:27017


operator is in subqueries with leading dollar sign
query = {"name": {"$gte":"X","$lt":"Y"}}
query = {"foundingDate":{"$gt":datetime(1837,1,1),
query = {"country":{"$ne":"United States"}}
query = {"govermentType":{"$exists":1}}
query = {"motto":{"$regex":"[Ff]riendship|[Pp]ride"}}
query = {"modelYears":{"$in":[1965,1966,1967]}}
query = {"modelYears":{"$all":[1965,1966,1967]}}
query = {"dimensions.weight":{"$gt":10000}}
To my surprise, 21st century begins from 2001.1.1 Perl Compatible Regular Expressions
Live RegEx tester at
MongoDB $regex Manual.
Official Python Regular Expression HOWTO.
Another good Python Regular Expressions page.

update/drop   // replace or add a dict
db.cities.update(query,{"$set"}:dict,multi= True)
db.cities.update(query,{"$unset"}: {"key":""})

aggregation framework

# 3, group & sort operator
pipeline = [
## 8,match & project operator
pipeline =[
  {"$match": query},
  {"$project": {"ratio":
               {"$divide":["$user.followers_count",                                 "$user.friends_count"]},
## 10, quize
pipeline =[
  {"$match": {"user.time_zone":"Brasilia",
  {"$project": {"followers":"$user.followers_count",
# 11, unwind operator: expand single to multiple instances
pipeline = [
pipeline = [
   {"$match": {"country":"India"}},
    {"$group":{"_id": "$isPartOf",
other operator
{"$addToSet":"$text"}  # unique
{"$push":"$text"}  # everything

10 problem set

pipeline = [
    {"$match": {"name":{"$ne":None}}},
    {"$group":{"_id": "$name",

project: OpenStreetMap

SAX (Simple API for XML) is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM).
parse XML -> audit dataset -> CSV/JSON-> SQL/MongoDB -> explore
Open Street Map has customized its own version: OSM XML. Major tags
  • way: street
  • node
search tag name use element.find(s), search attribute name use element.get(s)

search() vs. match()

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while checks for a match anywhere in the string.
>>> re.match("c", "abcdef")    # No match
>>>"c", "abcdef")   # Match
<_sre.SRE_Match object at ...>
I didn’t find OpenStreetMap particularly interesting so I paused here.