Sunday, May 21, 2017

Hive Python API installation in RedHat

cloudera quickstart vm is based on Centos, a free version of Redhat distribution. What you can get from a free meal is some basic stuff for a quick demo. The syntax of Hive is pretty similar to normal SQL, but the problem is how to efficiently transform real world data to organized structure so you can feed them into Hadoop world. You will have to use general purpose languages such as python and java to cleanse the unstructured data.

install sublime

The default text editor vi is so ugly and difficult to use. Let’s go sublime.
download at www.sublimetext.com/3 go for the tarball besides Ubuntu 64 bit. It’s only 9 MB.
cd Downloads
tar -vxjf sublime_text_xxx.tar.bz2
nano .bash_profile   # add next line, save and exit
alias subl="~/Downloads/sublime_text_3/sublime_text"
source .bash_profile # load bash script
subl  # enjoy!
Note that Centos shell doesn’t automatically source bash_profile when open. Go edit -> profile preference -> title -> check “login shell”.

show module path in 3 ways

python -c 'import sys; print "\n".join(sys.path)'  #show class path
pip show numpy # show path for a specific class/module
import numpy
help(numpy)
I quickly find out how anaconda accommodate each module, e.g.:
anaconda/envs/py3/lib/python3.6/site-packages/matplotlib/pyplot.py

python API: PyHive

I first tried to upgrade the access to “express” to “enterprise” and add anaconda parcel in the cloudera manager. But my computer becomes extremely slow and conda command still doesn’t work.
My 2nd attempt is sudo yum install -y python27 because the default python is outdated 2.6 version which is abandoned by pandas. But I couldn’t figure out where the python2.7 was installed. Then I download python 2.7 source tgz from official: https://www.python.org/downloads/release/python-2713/
sudo mv ~/Downloads/Python-2.7.13.tgz /usr/src
cd /usr/src
sudo tar xzf Python-2.7.13.tgz
cd Python-2.7.13.tgz
sudo ./configure
sudo make altinstall # not replace default /usr/bin/python
which python2.7
But I realized I have to use anaconda to manage the modules because I couldn’t get any useful things done with them. So comes my 3rd solution:anaconda: https://docs.continuum.io/anaconda/install-linux
Note: download the python 2.7 version because:
  1. pyhive require python 2.7
  2. virtual env deosn’t work well in Centos
cd Downloads
bash Anaconda **.sh
... will be installed at /home/cloudera/Anaconda2
Prepending PATH=/home/cloudera/Downloads/enter/bin to PATH in /home/cloudera/.bashrc
A backup will be made to: /home/cloudera/.bashrc-anaconda2.bak
# exit shell and reopen
which python
conda install pyhive, thrift
conda install -c blaze sasl=0.2.1
conda install -c conda-forge thrift_sasl=0.2.1
Sadly, the hidden caveat is the missing of something called “GLIBC 2.14”. What’s more PyHive seems to be built for hiveServer but not hiveServer2.

python API: pyhs2

Then I planned to try https://github.com/BradRuderman/pyhs2 Although the author stopped maintaince 3 years ago, this module works amazingly.
still need anaconda python 2.7
sudo yum install gcc-c++ python-devel.x86_64 cyrus-sasl-devel.x86_64
sudo pip install pyhs2
minimum viable code:
import pyhs2
conn = pyhs2.connect(host='localhost',port=10000,
               authMechanism="PLAIN",user='cloudera',
               password='cloudera',database='default')
cur = conn.cursor()
print cur.getDatabases() # Show databases

cur.execute("select * from employee") # Execute query
print cur.getSchema() # Return column info from query

#Fetch table results
for i in cur.fetch():
    print i
Note that the table employee is prepared as in https://www.tutorialspoint.com
You can do it in hive command line or CDH hue.
create table
create table if not exists employee (eid int, name String, salary string, destination string)
comment 'Employee details'
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
The table is stored at (HDFS) \user\hive\warehouse
prepare data in sample.txt file:
1201    Gopal   45000   Technical manager
1202    Manisha 45000   Proof reader
1203    Masthanvali     40000   Technical writer
1204    Kiran   40000   Hr Admin
1205    Kranthi 30000   Op Admin
load data
load data local inpath
'/home/cloudera/Downloads/sample.txt' overwrite into table employee;
table metadata operation
show tables;
show tables '.*s';  --table end with 's',java regex
describe employee; -- show list of columns
alter table employee rename personnel; 
alter table employee add columns (age int);

set hive.cli.print.header=true;
select * from employee limit 10;

python API: impyla

install
sudo pip install impyla
sudo pip install thrift==0.9.3
try
from impala.dbapi import connect
conn = connect(host='localhost', port=21050)
cursor = conn.cursor()
print conn
cursor.execute('show tables;')
print cursor.description  # prints the result set's schema
results = cursor.fetchall()
The codes runs without error. But it doesn’t connect to the right database. Maybe there is some database connection catch somewhere.

Last catch

I realize python APIs don’t perform well (execute speed, documentation, community support, etc) because the native language of Hadoop is Java. So my next stop is JDBC.

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Yes, I also reliazed that the perfomacen with Java is great and also it has wide community support.

    learn all about Apache Hive visit

    ReplyDelete

Note: Only a member of this blog may post a comment.