Intro to Hadoop and MapReduce

https://classroom.udacity.com/courses/ud617

course speaker:

Sarah Sproehnle, vice president of Cloudera.
Ian Wrigley, senior curriculum manager at Cloudera

Cloudera was founded in 2008 by 3 engineers from Google, Yahoo and Facebook. It develops Apache Hadoop and provided related service. As early as 2003, Doug Cutting was inspired by Google labs’ papers on their distributed file system (GFS) and their processing framework, MapReduce. He wrote the initial Hadoop software with partner Mike Cafarella. They were invested by Yahoo.

Since 2012, many companies such as Oracle, Dell, Intel, SAS, Microsoft and Internet of things start-ups announce the partnership with Cloudera. In March 2017, Cloudera filed fro an IPO.

https://www.cloudera.com/

http://hadoop.apache.org/

1 Intro

IBM: 90% of world’s data was created in the last 2 years alone.

challenges with big data:

data is created fast
data from different sources in various formats

3Vs:

volume,
variety (unstructured, raw format instead of SQL)
velocity

All data are worth storing: transactions, logs, business, user, sensor, medical, social.

The key is what data interests you most: science, e-commerce, financial, medical , sports, social, utilities.

Core Hadoop is storage in HDFS and process in MapReduce. But now Hadoop has grown into an ecosystem. Helper tools such as Hive and Pig could turn SQL into MapReduce code and run in the cluster. But they are on top of MapReduce and hence slow. Impala can directly access data in HDFS. Other ecosystem projects include Sqoop, Flume, HBase, Hue, oozie, Mahout(machine learning). Making them talking to one another and work well can be tricky. CDH (Cloudera distribution of Hadoop) packages all these things together, which makes life much easier.

2 HDFS and MapReduce

Backup file to avoid accidental lose in the cluster: data redundancy(2 more copies) and NameNode standby (1 more copy).

Hadoop’s block size is set to 64MB by default, when most filesystems have block sizes of 16KB or less.

setup

Instructions on how to download and run the virtual machines here.

Information on how to transfer files back and forth to the virtual machine can be found here.

After downloading the zip file which includes a 4.2 G .vmdk file and then put into the virtual box. In setting: Network -> attached to: Bridged Adapter

Once you press ‘start’, you will enter Hadoop system. However, with ifconfig, I couldn’t get ‘net addr’ for eth1. So I can’t start a ssh connection and use scp to transfer data between vm and host.

typical hadoop command:

hadoop fs -ls   # check hadoop file system
hadoop fs -put purchases.txt   # put file in cluster
hadoop fs -tail purchases.txt  # display end
hadoop fs -mv purchases.txt newname.txt # rename
hadoop fs -rm newname.txt   # delete file
head  -50 ../data/purchases.txt > testfile  # first 50 lines
cat testfile | ./mapper.py | sort | ./reducer.py
hs mapper.py reducer.py myinput output2

MapReduce

Dictionary approach will take a long time and may run out of memory.

MapReduce:

Mapper: divide the whole chunk to multiple key-value pairs
Reducer: each has partial keys
Task trackers
Job tracker

def mapper():
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) == 6:
            date, time, store, item, cost, payment = data
        print "{0}\t{1}".format(store, cost)
def reducer():
    salesTotal, oldKey = 0, None
    for line in sys.stdin:
        data = line.strip().split("\t")
        if len(data) != 2:
            continue
        thisKey, thisSale = data
        if oldKey and oldKey!= thisKey:
            print "{0}\t{1}".format(oldKey, salesTotal)
            salesTotal = 0
        oldKey = thisKey
        salesTotal += float(thisSale)
    if oldKey:
        print "{0}\t{1}".format(oldKey, salesTotal)

MapReduce Design Patterns

book: MapReduce Design Patterns by Donald Miner in 2012, $30

filter patterns: bloom filter, sampling filter
summarization pattern: counting, statistics
structural pattern: combining data sets

I will save lesson 7 and lesson 8 for future practice. To be good at Big Data, this intro little course is never enough. There will be a lot of difficult concepts and hard work.

Real-time analytics with Apache Storm

course link: https://classroom.udacity.com/courses/ud381

Types of analytics:

cube analytics: business intelligence
predictive analytics: statistics and machine learning
realtime: streaming or interactive
batch:

Hadoop: big batch processing

Storm: fast, reactive, real-time processing

Apache Storm Site with Documentation: https://storm.apache.org/

setup

Step 1) (Apple OSX) Install VirtualBox for your operating system: https://www.virtualbox.org/wiki/Downloads

Step 2) (Apple OSX) Install Vagrant for your operating system: https://www.vagrantup.com/

git clone https://github.com/Udacity/ud381
cd ud381
vagrant up   # 1st download 2G vmdk file to VB folder
vagrant ssh
storm version  # 0.9.2-incubting
cd ..
cd ..
cd vagrant  # this is a shared folder 
logout
vagrant ssh
cd /vagrant
cd lesson1/stage1
mvn clean
mvn package
tree
storm jar target/udacity-storm-lesson1_stage1-0.0.1-SNAPSHOT-jar-with-dependencies.jar udacity.storm.ExclamationTopology

There are a lot of Java implementation. I am not particularly interested in collecting Tweets.

Data visualization with tableau

course link: https://www.udacity.com/course/data-visualization-in-tableau--ud1006

book: the visual display of quantitative information.

visual encoding:

lines for trends, if no trend, use bars to compare group
histogram is a bar plot where a variable is binned into ranges
violin plot = box plot + kernel density estimation

char suggestion:

Tableau

Tableau is an interactive data visualization software for business intelligence. It is founded in 2003, as a result, to commercialize research at Stanford University’s CS department.

free version: https://public.tableau.com/s/

official training: https://www.tableau.com/learn/training

subscription price ranges from $35 to 70 per month.

It supports excel, text, JSON and statistical files.

Union vs Join

https://onlinehelp.tableau.com/current/pro/desktop/en-us/joining_tables.html

	Union	Join
operation	drag right below 1st table	drag elsewhere
column change	combine without merge	merge common fields
	more verbose	more concise

default join is inner join, which only combines data with a common value. Left join will have all the original data.

Dimensions vs Measures

Dimensions are more discrete values, like category, city, date, region.

Measures are more continuous values, like profit, quantity, height, age.

There are some overlaps.

3 working modes

sheet. drag x value to “Columns”, y value to “Rows”, label value to “Marks” or “Filters”
dashboard: drop multiple sheets together to address something
story: ppt-like experience to tell a story.

Remark

This is an elegant and powerful software. It seems to have a web-based functionality in mind. In my public version, I can’t output the graph but can only store everything in its cloud. Maybe it is because the mouse-over feature is data encoded so the stand-alone graph will lose its spotlight.

Yuchao's blogspot

Thursday, April 27, 2017

Intro to Hadoop and MapReduce

1 Intro

2 HDFS and MapReduce

setup

MapReduce

MapReduce Design Patterns

Real-time analytics with Apache Storm

setup

Data visualization with tableau

Tableau

Union vs Join

Dimensions vs Measures

3 working modes

Remark