Thursday, April 27, 2017

Intro to Hadoop and MapReduce


course speaker:
  • Sarah Sproehnle, vice president of Cloudera.
  • Ian Wrigley, senior curriculum manager at Cloudera
Cloudera was founded in 2008 by 3 engineers from Google, Yahoo and Facebook. It develops Apache Hadoop and provided related service. As early as 2003, Doug Cutting was inspired by Google labs’ papers on their distributed file system (GFS) and their processing framework, MapReduce. He wrote the initial Hadoop software with partner Mike Cafarella. They were invested by Yahoo.
Since 2012, many companies such as Oracle, Dell, Intel, SAS, Microsoft and Internet of things start-ups announce the partnership with Cloudera. In March 2017, Cloudera filed fro an IPO.

1 Intro

IBM: 90% of world’s data was created in the last 2 years alone.
challenges with big data:
  1. data is created fast
  2. data from different sources in various formats
3Vs:
  1. volume,
  2. variety (unstructured, raw format instead of SQL)
  3. velocity
All data are worth storing: transactions, logs, business, user, sensor, medical, social.
The key is what data interests you most: science, e-commerce, financial, medical , sports, social, utilities.
Core Hadoop is storage in HDFS and process in MapReduce. But now Hadoop has grown into an ecosystem. Helper tools such as Hive and Pig could turn SQL into MapReduce code and run in the cluster. But they are on top of MapReduce and hence slow. Impala can directly access data in HDFS. Other ecosystem projects include Sqoop, Flume, HBase, Hue, oozie, Mahout(machine learning). Making them talking to one another and work well can be tricky. CDH (Cloudera distribution of Hadoop) packages all these things together, which makes life much easier.

2 HDFS and MapReduce

Backup file to avoid accidental lose in the cluster: data redundancy(2 more copies) and NameNode standby (1 more copy).
Hadoop’s block size is set to 64MB by default, when most filesystems have block sizes of 16KB or less.

setup

Instructions on how to download and run the virtual machines here.
Information on how to transfer files back and forth to the virtual machine can be found here.
After downloading the zip file which includes a 4.2 G .vmdk file and then put into the virtual box. In setting: Network -> attached to: Bridged Adapter
Once you press ‘start’, you will enter Hadoop system. However, with ifconfig, I couldn’t get ‘net addr’ for eth1. So I can’t start a ssh connection and use scp to transfer data between vm and host.
typical hadoop command:
hadoop fs -ls   # check hadoop file system
hadoop fs -put purchases.txt   # put file in cluster
hadoop fs -tail purchases.txt  # display end
hadoop fs -mv purchases.txt newname.txt # rename
hadoop fs -rm newname.txt   # delete file
head  -50 ../data/purchases.txt > testfile  # first 50 lines
cat testfile | ./mapper.py | sort | ./reducer.py
hs mapper.py reducer.py myinput output2

MapReduce

Dictionary approach will take a long time and may run out of memory.
MapReduce:
  • Mapper: divide the whole chunk to multiple key-value pairs
  • Reducer: each has partial keys
  • Task trackers
  • Job tracker
def mapper():
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) == 6:
            date, time, store, item, cost, payment = data
        print "{0}\t{1}".format(store, cost)
def reducer():
    salesTotal, oldKey = 0, None
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) != 2:
            continue
        thisKey, thisSale = data
        if oldKey and oldKey!= thisKey:
            print "{0}\t{1}".format(oldKey, salesTotal)
            salesTotal = 0
        oldKey = thisKey
        salesTotal += float(thisSale)
    if oldKey:
        print "{0}\t{1}".format(oldKey, salesTotal)

MapReduce Design Patterns

book: MapReduce Design Patterns by Donald Miner in 2012, $30
  • filter patterns: bloom filter, sampling filter
  • summarization pattern: counting, statistics
  • structural pattern: combining data sets
I will save lesson 7 and lesson 8 for future practice. To be good at Big Data, this intro little course is never enough. There will be a lot of difficult concepts and hard work.

Real-time analytics with Apache Storm

Types of analytics:
  • cube analytics: business intelligence
  • predictive analytics: statistics and machine learning
  • realtime: streaming or interactive
  • batch:
Hadoop: big batch processing
Storm: fast, reactive, real-time processing
Apache Storm Site with Documentation: https://storm.apache.org/

setup

Step 1) (Apple OSX) Install VirtualBox for your operating system: https://www.virtualbox.org/wiki/Downloads
Step 2) (Apple OSX) Install Vagrant for your operating system: https://www.vagrantup.com/
git clone https://github.com/Udacity/ud381
cd ud381
vagrant up   # 1st download 2G vmdk file to VB folder
vagrant ssh
storm version  # 0.9.2-incubting
cd ..
cd ..
cd vagrant  # this is a shared folder 
logout
vagrant ssh
cd /vagrant
cd lesson1/stage1
mvn clean
mvn package
tree
storm jar target/udacity-storm-lesson1_stage1-0.0.1-SNAPSHOT-jar-with-dependencies.jar udacity.storm.ExclamationTopology
There are a lot of Java implementation. I am not particularly interested in collecting Tweets.

Data visualization with tableau

book: the visual display of quantitative information.
visual encoding:
  1. lines for trends, if no trend, use bars to compare group
  2. histogram is a bar plot where a variable is binned into ranges
  3. violin plot = box plot + kernel density estimation
char suggestion:

Tableau

Tableau is an interactive data visualization software for business intelligence. It is founded in 2003, as a result, to commercialize research at Stanford University’s CS department.
subscription price ranges from $35 to 70 per month.
It supports excel, text, JSON and statistical files.

Union vs Join

Union Join
operation drag right below 1st table drag elsewhere
column change combine without merge merge common fields
more verbose more concise
default join is inner join, which only combines data with a common value. Left join will have all the original data.

Dimensions vs Measures

Dimensions are more discrete values, like category, city, date, region.
Measures are more continuous values, like profit, quantity, height, age.
There are some overlaps.

3 working modes

  1. sheet. drag x value to “Columns”, y value to “Rows”, label value to “Marks” or “Filters”
  2. dashboard: drop multiple sheets together to address something
  3. story: ppt-like experience to tell a story.

Remark

This is an elegant and powerful software. It seems to have a web-based functionality in mind. In my public version, I can’t output the graph but can only store everything in its cloud. Maybe it is because the mouse-over feature is data encoded so the stand-alone graph will lose its spotlight.