Sunday, May 21, 2017

Hive JDBC in Cloudera Hadoop


I am going to stick with Cloudera Quickstart VM, which saves me a lot of time on buggy messy configuration. Now I try to bridge the gap between Hive and unstructured data by JDBC. Forget python. Java is the native language in Hadoop.

1. install hive jdbc driver

$ sudo yum install hive-jdbc  # red-hat
Add /usr/lib/hive/lib/*.jar and /usr/lib/hadoop/*.jarto classpath.

2 write java codes

import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.Statement;
import java.sql.DriverManager;

public class etl {
       private static String driverName = "org.apache.hive.jdbc.HiveDriver";
       public static void main(String[] args) throws SQLException {
    try {Class.forName(driverName);}
    catch(ClassNotFoundException ex) {
       System.out.println("Error: unable to load driver class!");
       System.exit(1);
    }    
          // get connection, user and password are ignored in non-secured mode
          Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "cloudera", "cloudera");
          Statement stmt = con.createStatement();
       // execute statement
          ResultSet res = stmt.executeQuery("SELECT * FROM employee ");
          System.out.println("Result:");
          System.out.println(" ID \t Name \t Salary \t Designation  ");
          while (res.next()) {
             System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " + res.getString(4));
          }
          con.close();
       }
    }
notes:
  1. driverName is for hiveServer2. the previous version has longer driver name.
  2. table employee is prepared as in my previous blog or https://www.tutorialspoint.com/hive/
  3. connection is “jdbc:hive2://localhost:10000/default”, username and password can be empty string.
  4. Java class ResultSet is weird and quite different from Python cursor. It provides a getter methods such as getInt, getString, getDouble and requires a column index numbered from 1. This means the strict data type enforcement. The nextmethod moves the object cursor to the next row so a while loop can iterate through the result set.

3. compile and run

This is the most buggy part. I saw in StackOverflow, someone suggest run it as:
javac -cp . etl.class
java -cp . etl
But no matter how I tried(change configuration here and there, tinker codes here and there), I always got java - ClassNotFoundException. I guess the reason is that the Java compiler or JVM doesn’t read classpath as supposed.
Fortunately, I got a friend studying PhD in Hadoop. He simply use IDE to add classpath instead of writing classpath in bashrc. To be more specific:
  1. eclipse -new -> new java project -> build java path, add external JARs
  2. local at /usr/lib/hive/lib/*.jar, add all JARs
  3. run as application

mySQL

This is another topic. I will keep it here in case I will need in the future.
brew install mysql
We've installed your MySQL database without a root password. To secure it run:
    mysql_secure_installation

MySQL is configured to only allow connections from localhost by default

To connect run:
    mysql -uroot

To have launch start mysql now and restart at login:
  brew services start mysql
Or, if you don't want/need a background service you can just run:
  mysql.server start
==> Summary
🍺  /usr/local/Cellar/mysql/5.7.18_1: 321 files, 234.5MB

Hive Python API installation in RedHat

cloudera quickstart vm is based on Centos, a free version of Redhat distribution. What you can get from a free meal is some basic stuff for a quick demo. The syntax of Hive is pretty similar to normal SQL, but the problem is how to efficiently transform real world data to organized structure so you can feed them into Hadoop world. You will have to use general purpose languages such as python and java to cleanse the unstructured data.

install sublime

The default text editor vi is so ugly and difficult to use. Let’s go sublime.
download at www.sublimetext.com/3 go for the tarball besides Ubuntu 64 bit. It’s only 9 MB.
cd Downloads
tar -vxjf sublime_text_xxx.tar.bz2
nano .bash_profile   # add next line, save and exit
alias subl="~/Downloads/sublime_text_3/sublime_text"
source .bash_profile # load bash script
subl  # enjoy!
Note that Centos shell doesn’t automatically source bash_profile when open. Go edit -> profile preference -> title -> check “login shell”.

show module path in 3 ways

python -c 'import sys; print "\n".join(sys.path)'  #show class path
pip show numpy # show path for a specific class/module
import numpy
help(numpy)
I quickly find out how anaconda accommodate each module, e.g.:
anaconda/envs/py3/lib/python3.6/site-packages/matplotlib/pyplot.py

python API: PyHive

I first tried to upgrade the access to “express” to “enterprise” and add anaconda parcel in the cloudera manager. But my computer becomes extremely slow and conda command still doesn’t work.
My 2nd attempt is sudo yum install -y python27 because the default python is outdated 2.6 version which is abandoned by pandas. But I couldn’t figure out where the python2.7 was installed. Then I download python 2.7 source tgz from official: https://www.python.org/downloads/release/python-2713/
sudo mv ~/Downloads/Python-2.7.13.tgz /usr/src
cd /usr/src
sudo tar xzf Python-2.7.13.tgz
cd Python-2.7.13.tgz
sudo ./configure
sudo make altinstall # not replace default /usr/bin/python
which python2.7
But I realized I have to use anaconda to manage the modules because I couldn’t get any useful things done with them. So comes my 3rd solution:anaconda: https://docs.continuum.io/anaconda/install-linux
Note: download the python 2.7 version because:
  1. pyhive require python 2.7
  2. virtual env deosn’t work well in Centos
cd Downloads
bash Anaconda **.sh
... will be installed at /home/cloudera/Anaconda2
Prepending PATH=/home/cloudera/Downloads/enter/bin to PATH in /home/cloudera/.bashrc
A backup will be made to: /home/cloudera/.bashrc-anaconda2.bak
# exit shell and reopen
which python
conda install pyhive, thrift
conda install -c blaze sasl=0.2.1
conda install -c conda-forge thrift_sasl=0.2.1
Sadly, the hidden caveat is the missing of something called “GLIBC 2.14”. What’s more PyHive seems to be built for hiveServer but not hiveServer2.

python API: pyhs2

Then I planned to try https://github.com/BradRuderman/pyhs2 Although the author stopped maintaince 3 years ago, this module works amazingly.
still need anaconda python 2.7
sudo yum install gcc-c++ python-devel.x86_64 cyrus-sasl-devel.x86_64
sudo pip install pyhs2
minimum viable code:
import pyhs2
conn = pyhs2.connect(host='localhost',port=10000,
               authMechanism="PLAIN",user='cloudera',
               password='cloudera',database='default')
cur = conn.cursor()
print cur.getDatabases() # Show databases

cur.execute("select * from employee") # Execute query
print cur.getSchema() # Return column info from query

#Fetch table results
for i in cur.fetch():
    print i
Note that the table employee is prepared as in https://www.tutorialspoint.com
You can do it in hive command line or CDH hue.
create table
create table if not exists employee (eid int, name String, salary string, destination string)
comment 'Employee details'
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
The table is stored at (HDFS) \user\hive\warehouse
prepare data in sample.txt file:
1201    Gopal   45000   Technical manager
1202    Manisha 45000   Proof reader
1203    Masthanvali     40000   Technical writer
1204    Kiran   40000   Hr Admin
1205    Kranthi 30000   Op Admin
load data
load data local inpath
'/home/cloudera/Downloads/sample.txt' overwrite into table employee;
table metadata operation
show tables;
show tables '.*s';  --table end with 's',java regex
describe employee; -- show list of columns
alter table employee rename personnel; 
alter table employee add columns (age int);

set hive.cli.print.header=true;
select * from employee limit 10;

python API: impyla

install
sudo pip install impyla
sudo pip install thrift==0.9.3
try
from impala.dbapi import connect
conn = connect(host='localhost', port=21050)
cursor = conn.cursor()
print conn
cursor.execute('show tables;')
print cursor.description  # prints the result set's schema
results = cursor.fetchall()
The codes runs without error. But it doesn’t connect to the right database. Maybe there is some database connection catch somewhere.

Last catch

I realize python APIs don’t perform well (execute speed, documentation, community support, etc) because the native language of Hadoop is Java. So my next stop is JDBC.