Sunday, May 14, 2017

Hadoop Hive setup by Cloudera quickstart


For beginners of Hadoop and Hive, a good starting point is to use Cloudera quickstart. Because the tricky configuration and overwhelming warning may scare off beginners. Steps:
  1. Download virtual box.
  2. Download cloudera quickstart vm at https://www.cloudera.com/downloads/
  3. use import .ovf file for setup. I was stupid to try the manual setup.
  4. start the vm.
  5. In the pop-up firefox browser, go through quickstart.cloudera tutorial to get yourself familiar with popular Hadoop framework/tools/concepts: Hue, Hive, file browser, sqoop, impala, parquet
Following is my learning notes for the tutorial

1, Ingest and Query Relational Data

Use Apache Sqoop to load relational data from MySQL into HDFS. With a few additional parameters, the relational data can be ready to be queried by Impala with Hadoop optimized file format Apache Avro.
sqoop import-all-tables \    -m 1 \    --connect jdbc:mysql://quickstart:3306/retail_db \    
--username=retail_dba \    --password=cloudera \    
--compression-codec=snappy \    
--as-parquetfile \    
--warehouse-dir=/user/hive/warehouse \    
--hive-import
It is launching MapReduce jobs to pull the data from our MySQL database and write the data to HDFS, distributed across the cluster in Apache Parquet format. Parquet is a format designed for analytical applications on Hadoop. Instead of grouping your data into rows like typical data formats, it groups your data into columns. This is ideal for many analytical queries where instead of retrieving data from specific records.
Hue provides a web-based interface for many of the tools in CDH with address: quickstart.cloudera:8888. In the QuickStart VM, the administrator username for Hue is ‘cloudera’ and the password is ‘cloudera’.
we told Sqoop to import the data into Hive but used Impala to query the data. This is because Hive and Impala can share both data files and the table metadata. Hive works by compiling SQL queries into MapReduce jobs, which makes it very flexible, whereas Impala executes queries itself and is built from the ground up to be as fast as possible, which makes it better for interactive analysis. We’ll use Hive later for an ETL (extract-transform-load) workload.
Simply put, Hive aims for compatibility and Impala aims for speed.

2, Correlate Structured Data with Unstructured Data

use hive to parse the unstructured log data and use impala to query.First create a table by parsing data using regular expression
CREATE EXTERNAL TABLE intermediate_access_logs (    ip STRING,    date STRING,    method STRING,    url STRING,    http_version STRING,    code1 STRING,    code2 STRING,    dash STRING,    user_agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'WITH SERDEPROPERTIES ('input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',    'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s") LOCATION '/user/hive/warehouse/original_access_logs';
create another table, and then
INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs;
This does MapReduce job. You will get a table with 160 k records.Once it is ready, we can use impala to do the query:
select count(*),url from tokenized_access_logswhere url like '%\/product\/%'group by url order by count(*) desc;
Then analyze why some products are viewed most but don’t have a good sale.

3. Relationship strength analytics using Spark

The tool in CDH best suited for quick analytics on object relationships is Apache Spark.

4. Explore Log Events Interactively

learned how to use Cloudera Search to allow exploration of data in real time, using Flume and Solr and Morphlines

5. Hue Dashboard for data visualization

The plot is relatively simple. Maybe because it’s a free version.

short history of Hadoop

  • In March 2013 Intel invested $740 million in Cloudera for an 18% investment. Intel shared its roadmap with Cloudera so Cloudera could develop the software to maximize the chip performance. With cash piled from other investors by selling equity, Cloudera is able to acquire other company when necessary.
  • super evangelical to do a technology education project in order to win early customers.
  • nobody buys a database because it is easy to manage. people buy applications. They have important business or mission or operational problem to solve.So they buy the software that allows a non-programmer to do the job. That requires applications tools systems integrated together stack on top of the database.
  • Since the mid-1980s that dynamic is very much alive in this ecosystem. People bring existing skills to this new platform. BI, data analytics report, machine learning that you can never do that before but you can do it now.
  • banks, hospital, retail stores take security very seriously. If you break into a yahoo cluster steal a new story, who cares? Break into a bank’s big data platform, steal some transaction data that’s a big issue.
  • the question is ill-formed. Open source is a distribution model, license model, a development model but not a business model. You can’t build the long-term sustainable pure open-source business. Open source project often gets acquired by big company which has other revenue stream.
  • my competitors will tell you that I’m a furious guy that’s trying to lock our customers in because of my evil desires on their wallets. I suspect so. I’ve heard rumors. I am trying to lock IBM out. We will always have proprietary IP at Cloudera.
  • Who can afford to hire the thousands of smart people around the world working on this thing(open source)? No single company can compete with that. And we get the benefit.
  • plan to IPO. Cloudera was just traded at NYSE as CLDR on 2017.4.28, with IPO price at $18.

A New Generation Of Data Scientists

  • Most of the time I don’t do the fancy data visualization as in Sci-fi. I do data cleansing, prepare for the dataset.
  • big data economics: no individual record is particularly valuable. Having every record is incredibly valuable.
  • google file system: 4 kB per block. HDFS: 64/256 MB per block.
  • here’s a bunch of data and find me some insights. This is the worst thing can happen to me. I would say to the business person: tell me the problem you have. In the process of solving problem, we will discover the insights. Insights don’t come from vacuum. Insights come from interesting meaningful problems.
  • Have a data science team. Nobody is good at all the skills. You have to know every part and talk to everybody in the team.
  • we have to have skills not only analyzing data, but also deploy model in production system. Being data scientist involves some of the skills in DevOps.
  • Don’t solve the problem once. Solve it zero or solve it with multiple models. So choose the good problem first.
  • It’s never the case that we are trying to optimize a single thing in a data science problem. Like in an Ad prediction model, it’s not only use machine learning model to predict the click, but how to increase the revenue.
  • be self. iterate until awesome.
  • Hadoop developer training, hive and pig training, intro to data science (end to end problem-solving). recommendation system is in every field.

some terminology

ETL(Extract, Transform, Load): a repeatable programmed data movement
Extract: get data from source, is the most resource intensive
Transform: filter/map/enrich/combine/validate/sort, most difficult
Load: store data in a data warehouse or data mart.
Apache Spark was a cluster-computing framework, released in 2014.5.30 to address the limitation in the MapReduce cluster computing paradigm, which is a linear data flow structure. Spark provides a data structure called resilient distributed dataset (RDD), which facilliates the iterative algorithms of data access and data analysis.
Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks.