Friday, April 28, 2017

Google Prediction API

Google Cloud Prediction API provides a RESTful API to build Machine Learning models.

https://cloud.google.com/prediction/docs/

follow https://cloud.google.com/prediction/docs/quickstart

steps:

create a cloud platform project: predictionapi0
enable billing
enable API
download training data (txt file)
create bucket: jychstar-bucket, upload txt file to bucket
train model in https://developers.google.com/apis-explorer/#p/prediction/v1.6/prediction.trainedmodels.insert

project: predicitonapi0

request body: {
“id”: “language-identifier”,
“storageDataLocation”: “jychstar_bucket/language_id.txt”
}
check training status: running or done in https://developers.google.com/apis-explorer/#p/prediction/v1.6/prediction.trainedmodels.get
input unseen data in trained model: https://www.googleapis.com/prediction/v1.6/projects/prediction-docs/trainedmodels/language-identifier/predict
delete model https://developers.google.com/apis-explorer/#p/prediction/v1.6/prediction.trainedmodels.delete
delete bucket https://console.cloud.google.com/storage/
delete project https://console.cloud.google.com/iam-admin/projects

It turns out this language-identifier API is only a toy, with 403 input instances and 3-class labels(English, French, Spanish). It is a blackbox that are written for a specific purpose.

The business model for prediction API is $0.50/ 1000 prediction after 10 k free prediction. And they charged the training as well. I think such API is application specific. As a black box, it should generalize well enough to be useful in the changing world.

https://cloud.google.com/products/

Some mature APIs are:

Natural language analysis: syntax, entity, sentiment
speech to text
translation
image analysis
video analysis

Amazon Redshift

Amazon Redshift was launched in 2012.11. It is targeted at big data which is at the level of petabytes.

official guide: https://aws.amazon.com/redshift/getting-started/

An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases.Reserving compute nodes offers significant savings compared to the hourly rates that you pay when you provision compute nodes on demand.

With Amazon Redshift, you can start small for just $0.25 per hour with no commitments and scale out to petabytes of data for $1,000 per terabyte per year, less than a tenth the cost of traditional solutions.

When you launch a cluster, one option you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. The dense storage (DS) node types are storage optimized. The dense compute (DC) node types are compute optimized. more details here.

getting started

http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html

there are 7 steps to follow:

1, sql client and driver

http://www.sql-workbench.net/

SQL Workbench/J is a free, DBMS-independent, cross-platform SQL query tool. It is written in Java and should run on any operating system that provides a Java Runtime Environment.

You can use a JDBC connection to connect to your Amazon Redshift cluster from many third-party SQL client tools. To do this, you need to download a JDBC driver.

JDBC 4.2–compatible driver: https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC42-1.2.1.1001.jar.

2, create an IAM role

IAM console at https://console.aws.amazon.com/iam/.

roles -> create new role -> AWS service Role: Amazon Redshift -> Attache policy: AmazoneS3ReadOnlyAccess -> role Name: myRedshiftRole

Role ARN: arn:aws:iam::992413356070:role/myRedshitRole

3, launch a redshift cluster

There will be a charge for $0.25/hour, so delete the cluster after tutorial.

launch cluster -> cluster identifier: examplecluster, Master User name: masteruser, password: Ai8920113

4, Authorize Access to the Cluster

Redshift -> Clusters -> examplecluster -> configuration -> cluster properties -> VPC security Groups -> inbound -> Edit -> Type: Custom TCP rule, protocol: TC: , Port Range: 5439

5, Connect to the Sample Cluster

Redshift -> Clusters -> examplecluster -> configuration -> cluster databse properties, JDBC URL: jdbc:redshift://examplecluster.cl0oz8dhlrae.us-east-1.redshift.amazonaws.com:5439/dev

software SQL workbench/J -> file -> connect window -> new profile -> manage Drivers -> new driver, load the jdbc driver, ok -> continue fill profile, driver, url, user name, password, autocommit, ok

6, Load Sample Data from Amazon S3

When lauch cluster, there is a default database “dev”. The database is still empty. So the first thing is to create some tables such as “users”, by writing queries in “statement” of SQL WorkbenchJ:

create table users(
    userid integer not null distkey sortkey,
    username char(8),
    firstname varchar(30),
    lastname varchar(30),
    city varchar(30),
    state char(2),
    email varchar(100),
    phone char(14),
    likesports boolean,
    liketheatre boolean,
    likeconcerts boolean,
    likejazz boolean,
    likeclassical boolean,
    likeopera boolean,
    likerock boolean,
    likevegas boolean,
    likebroadway boolean,
    likemusicals boolean);

Then we load sample data to the tables by “copy” from Amazon S3:

copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

Note that the credential string in <iam-role-arn> is from step 2. Unfortunately , I got S3ServiceException: Access Denied due to my setup in cluster launch.

Now you are ready to write queries like select * from users.

Check your query history at redshift -> example cluster -> Queries tab

7 try something interesting or reset environment

If you feel ambitious, try Tutorial: Loading Data from Amazon S3

Otherwise, revoke access from the VPC seucrity Group. redshift-> clusters -> example cluster -> configuration -> cluster properties -> inbound -> edit, delete custom TCP rule, save.

redshift -> cluster -> examplecluster -> configuration ->cluster -> delete. create snapshot: no, delete.

Thursday, April 27, 2017

Intro to Hadoop and MapReduce

https://classroom.udacity.com/courses/ud617

course speaker:

Sarah Sproehnle, vice president of Cloudera.
Ian Wrigley, senior curriculum manager at Cloudera

Cloudera was founded in 2008 by 3 engineers from Google, Yahoo and Facebook. It develops Apache Hadoop and provided related service. As early as 2003, Doug Cutting was inspired by Google labs’ papers on their distributed file system (GFS) and their processing framework, MapReduce. He wrote the initial Hadoop software with partner Mike Cafarella. They were invested by Yahoo.

Since 2012, many companies such as Oracle, Dell, Intel, SAS, Microsoft and Internet of things start-ups announce the partnership with Cloudera. In March 2017, Cloudera filed fro an IPO.

https://www.cloudera.com/

http://hadoop.apache.org/

1 Intro

IBM: 90% of world’s data was created in the last 2 years alone.

challenges with big data:

data is created fast
data from different sources in various formats

3Vs:

volume,
variety (unstructured, raw format instead of SQL)
velocity

All data are worth storing: transactions, logs, business, user, sensor, medical, social.

The key is what data interests you most: science, e-commerce, financial, medical , sports, social, utilities.

Core Hadoop is storage in HDFS and process in MapReduce. But now Hadoop has grown into an ecosystem. Helper tools such as Hive and Pig could turn SQL into MapReduce code and run in the cluster. But they are on top of MapReduce and hence slow. Impala can directly access data in HDFS. Other ecosystem projects include Sqoop, Flume, HBase, Hue, oozie, Mahout(machine learning). Making them talking to one another and work well can be tricky. CDH (Cloudera distribution of Hadoop) packages all these things together, which makes life much easier.

2 HDFS and MapReduce

Backup file to avoid accidental lose in the cluster: data redundancy(2 more copies) and NameNode standby (1 more copy).

Hadoop’s block size is set to 64MB by default, when most filesystems have block sizes of 16KB or less.

setup

Instructions on how to download and run the virtual machines here.

Information on how to transfer files back and forth to the virtual machine can be found here.

After downloading the zip file which includes a 4.2 G .vmdk file and then put into the virtual box. In setting: Network -> attached to: Bridged Adapter

Once you press ‘start’, you will enter Hadoop system. However, with ifconfig, I couldn’t get ‘net addr’ for eth1. So I can’t start a ssh connection and use scp to transfer data between vm and host.

typical hadoop command:

hadoop fs -ls   # check hadoop file system
hadoop fs -put purchases.txt   # put file in cluster
hadoop fs -tail purchases.txt  # display end
hadoop fs -mv purchases.txt newname.txt # rename
hadoop fs -rm newname.txt   # delete file
head  -50 ../data/purchases.txt > testfile  # first 50 lines
cat testfile | ./mapper.py | sort | ./reducer.py
hs mapper.py reducer.py myinput output2

MapReduce

Dictionary approach will take a long time and may run out of memory.

MapReduce:

Mapper: divide the whole chunk to multiple key-value pairs
Reducer: each has partial keys
Task trackers
Job tracker

def mapper():
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) == 6:
            date, time, store, item, cost, payment = data
        print "{0}\t{1}".format(store, cost)
def reducer():
    salesTotal, oldKey = 0, None
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) != 2:
            continue
        thisKey, thisSale = data
        if oldKey and oldKey!= thisKey:
            print "{0}\t{1}".format(oldKey, salesTotal)
            salesTotal = 0
        oldKey = thisKey
        salesTotal += float(thisSale)
    if oldKey:
        print "{0}\t{1}".format(oldKey, salesTotal)

MapReduce Design Patterns

book: MapReduce Design Patterns by Donald Miner in 2012, $30

filter patterns: bloom filter, sampling filter
summarization pattern: counting, statistics
structural pattern: combining data sets

I will save lesson 7 and lesson 8 for future practice. To be good at Big Data, this intro little course is never enough. There will be a lot of difficult concepts and hard work.

Real-time analytics with Apache Storm

course link: https://classroom.udacity.com/courses/ud381

Types of analytics:

cube analytics: business intelligence
predictive analytics: statistics and machine learning
realtime: streaming or interactive
batch:

Hadoop: big batch processing

Storm: fast, reactive, real-time processing

Apache Storm Site with Documentation: https://storm.apache.org/

setup

Step 1) (Apple OSX) Install VirtualBox for your operating system: https://www.virtualbox.org/wiki/Downloads

Step 2) (Apple OSX) Install Vagrant for your operating system: https://www.vagrantup.com/

git clone https://github.com/Udacity/ud381
cd ud381
vagrant up   # 1st download 2G vmdk file to VB folder
vagrant ssh
storm version  # 0.9.2-incubting
cd ..
cd ..
cd vagrant  # this is a shared folder 
logout
vagrant ssh
cd /vagrant
cd lesson1/stage1
mvn clean
mvn package
tree
storm jar target/udacity-storm-lesson1_stage1-0.0.1-SNAPSHOT-jar-with-dependencies.jar udacity.storm.ExclamationTopology

There are a lot of Java implementation. I am not particularly interested in collecting Tweets.

Data visualization with tableau

course link: https://www.udacity.com/course/data-visualization-in-tableau--ud1006

book: the visual display of quantitative information.

visual encoding:

lines for trends, if no trend, use bars to compare group
histogram is a bar plot where a variable is binned into ranges
violin plot = box plot + kernel density estimation

char suggestion:

Tableau

Tableau is an interactive data visualization software for business intelligence. It is founded in 2003, as a result, to commercialize research at Stanford University’s CS department.

free version: https://public.tableau.com/s/

official training: https://www.tableau.com/learn/training

subscription price ranges from $35 to 70 per month.

It supports excel, text, JSON and statistical files.

Union vs Join

https://onlinehelp.tableau.com/current/pro/desktop/en-us/joining_tables.html

	Union	Join
operation	drag right below 1st table	drag elsewhere
column change	combine without merge	merge common fields
	more verbose	more concise

default join is inner join, which only combines data with a common value. Left join will have all the original data.

Dimensions vs Measures

Dimensions are more discrete values, like category, city, date, region.

Measures are more continuous values, like profit, quantity, height, age.

There are some overlaps.

3 working modes

sheet. drag x value to “Columns”, y value to “Rows”, label value to “Marks” or “Filters”
dashboard: drop multiple sheets together to address something
story: ppt-like experience to tell a story.

Remark

This is an elegant and powerful software. It seems to have a web-based functionality in mind. In my public version, I can’t output the graph but can only store everything in its cloud. Maybe it is because the mouse-over feature is data encoded so the stand-alone graph will lose its spotlight.

Wednesday, April 26, 2017

Data Wrangling with MongoDB

Recap: Data Analyst ND 2, data wrangling, SQL

The previous post focuses on data wrangling by parsing csv/excel/json/xml/html formatted files. And SQL queries, especially in SQLite.

This post now turns attention to NoSQL, especially in MongoDB.

course link: https://classroom.udacity.com/courses/ud032

course syllabus and setup: https://www.udacity.com/wiki/ud032

course material: https://github.com/udacity/ud032.git

course video has 11 sections, syllabus only list 6 lessons, the other 5 are problem sets. These videos are actually the major part of Data Analyst nanodegree in the data wrangling section.

NoSQL

MongoDB was first released in 2009.

Why MongoDB?

flexible schema
oriented toward Programmers
flexible deployment
designed for big data
aggregation framework

document database vs relational database.

here, document means associative array:

JSON object
PHP array
python dict
Ruby Hash

MongoDB store these types of hierarchical data structures directly in a database as individual documents. JSON-like syntax.

install and setup

https://www.udacity.com/wiki/ud032#!#installing-mongodb

brew install mongodb
brew services start mongodb
mkdir -p /data/db    //create default directory
sudo chown -R `id -u` /data/db    // give permission
mongod

practice by pymongo

start the server first by mongod

learning points:

create client
create database
create table by insert one or multiple dictionaries
query and projection by dictionary

from pymongo import MongoClient
import pprint
client = MongoClient('localhost:27017')
db = client.examples    # creat database "examples" if not exist
db.cities.insert_one({"name" : "Chicago"})  # insert a dict to table "cities"
print db.cities.find_one() # note each dict has an id No.
tesla_s = {
    "manufacturer": "Tesla Motors",
    "class": "full-size",
    "production":[2012,2013]
}
db.autos.insert_one(tesla_s)  # insert a dict to table "autos"
for a in db.autos.find():
    pprint.pprint(a)
autos = db.autos.find({"manufacturer":"Tesla Motors"})  # use dict for query
for a in autos:
    pprint.pprint(a)
query = {"manufacturer":"Toyota", "class": "mid-size car"}
projection = {"id":0, "name": 1}  // 0: hide; 1; show
autos = db.autos.find(query, projection) # multiple field queires and projection

mongoimport

Documentation for mongoimport can be found here.

The command used in this video:

mongoimport --help
mongoimport -d dbname -c collectionname --file input-file.json
mongoimport -d examples -c myautos2 --file autos.json
mongo  // start mongo shell locally

If no hostname and credentials are supplied, mongoimport will try to connect to the default localhost:27017

Queries

operator is in subqueries with leading dollar sign

query = {"name": {"$gte":"X","$lt":"Y"}}
query = {"foundingDate":{"$gt":datetime(1837,1,1),
                        "$lte":datetime(1837,12,31)}}
query = {"country":{"$ne":"United States"}}
query = {"govermentType":{"$exists":1}}
db.autos.find(query).count()
db.autos.find(query).pretty()
query = {"motto":{"$regex":"[Ff]riendship|[Pp]ride"}}
query = {"modelYears":{"$in":[1965,1966,1967]}}
query = {"modelYears":{"$all":[1965,1966,1967]}}
query = {"dimensions.weight":{"$gt":10000}}

To my surprise, 21st century begins from 2001.1.1

http://www.pcre.org/ Perl Compatible Regular Expressions

Live RegEx tester at regexpal.com.

MongoDB $regex Manual.

Official Python Regular Expression HOWTO.

Another good Python Regular Expressions page.

update/drop

db.cities.save(city)   // replace or add a dict
db.cities.update(query,{"$set"}:dict,multi= True)
db.cities.update(query,{"$unset"}: {"key":""})
db.ciites.drop()
db.cities.remove(query)

aggregation framework

$group aggregation operator documentation.

db.tweets.aggregate(pipeline)
# 3, group & sort operator
pipeline = [
  {"$group":{"_id":"$user.screen_name",
            "count":{"$sum":1}}},
    {"$sort":{"count":-1}}
]
## 8,match & project operator
pipeline =[
  {"$match": query},
  {"$project": {"ratio":
               {"$divide":["$user.followers_count",                                 "$user.friends_count"]},
              "screen_name":"$user.screen_name"}},
  {"$sort":{"ratio":-1}},
    {"$limit":1}
]
## 10, quize
pipeline =[
  {"$match": {"user.time_zone":"Brasilia",
              "user.statuses_count":{"$gte":100}}},
  {"$project": {"followers":"$user.followers_count",
                "tweets":"$user.statuses_count",
              "screen_name":"$user.screen_name"}},
  {"$sort":{"followers":-1}},
    {"$limit":1}
]
# 11, unwind operator: expand single to multiple instances
pipeline = [
    {"$unwind":"$entities.user_mentions"},
  {"$group":{"_id":"$user.screen_name",
            "count":{"$sum":1}}},
    {"$sort":{"count":-1}}
]
pipeline = [
   {"$match": {"country":"India"}},
    {"$unwind":"$isPartOf"},
    {"$group":{"_id": "$isPartOf",
            "count":{"$sum":1}}
    },
    {"$sort":{"count":-1}},
    {"$limit":1}
    ]

other operator

{"$addToSet":"$text"}  # unique
{"$push":"$text"}  # everything

10 problem set

pipeline = [
    {"$match": {"name":{"$ne":None}}},
    {"$group":{"_id": "$name",
        "count":{"$sum":1}}},
    {"$sort":{"count":-1}},
    {"$limit":1}
    ]

project: OpenStreetMap

SAX (Simple API for XML) is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM).

parse XML -> audit dataset -> CSV/JSON-> SQL/MongoDB -> explore

http://www.openstreetmap.org/

OpenStreetMap Documentation

OSM XML Documentation

Open Street Map has customized its own version: OSM XML. Major tags

way: street
node

search tag name use element.find(s), search attribute name use element.get(s)

search() vs. match()

https://docs.python.org/2/library/re.html#search-vs-match

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string.

>>> re.match("c", "abcdef")    # No match
>>> re.search("c", "abcdef")   # Match
<_sre.SRE_Match object at ...>

I didn’t find OpenStreetMap particularly interesting so I paused here.