cloudera quickstart vm is based on Centos, a free version of Redhat distribution. What you can get from a free meal is some basic stuff for a quick demo. The syntax of Hive is pretty similar to normal SQL, but the problem is how to efficiently transform real world data to organized structure so you can feed them into Hadoop world. You will have to use general purpose languages such as python and java to cleanse the unstructured data.
The default text editor vi is so ugly and difficult to use. Let’s go sublime.
download at www.sublimetext.com/3 go for the
tarballbesides Ubuntu 64 bit. It’s only 9 MB.
cd Downloads tar -vxjf sublime_text_xxx.tar.bz2 nano .bash_profile # add next line, save and exit alias subl="~/Downloads/sublime_text_3/sublime_text" source .bash_profile # load bash script subl # enjoy!
Note that Centos shell doesn’t automatically source bash_profile when open. Go edit -> profile preference -> title -> check “login shell”.
show module path in 3 ways
python -c 'import sys; print "\n".join(sys.path)' #show class path pip show numpy # show path for a specific class/module import numpy help(numpy)
I quickly find out how anaconda accommodate each module, e.g.:
python API: PyHive
I first tried to upgrade the access to “express” to “enterprise” and add anaconda parcel in the
cloudera manager. But my computer becomes extremely slow and
condacommand still doesn’t work.
My 2nd attempt is
sudo yum install -y python27because the default python is outdated 2.6 version which is abandoned by pandas. But I couldn’t figure out where the python2.7 was installed. Then I download python 2.7 source tgz from official: https://www.python.org/downloads/release/python-2713/
sudo mv ~/Downloads/Python-2.7.13.tgz /usr/src cd /usr/src sudo tar xzf Python-2.7.13.tgz cd Python-2.7.13.tgz sudo ./configure sudo make altinstall # not replace default /usr/bin/python which python2.7
But I realized I have to use anaconda to manage the modules because I couldn’t get any useful things done with them. So comes my 3rd solution:anaconda: https://docs.continuum.io/anaconda/install-linux
Note: download the python 2.7 version because:
- pyhive require python 2.7
- virtual env deosn’t work well in Centos
cd Downloads bash Anaconda **.sh ... will be installed at /home/cloudera/Anaconda2 Prepending PATH=/home/cloudera/Downloads/enter/bin to PATH in /home/cloudera/.bashrc A backup will be made to: /home/cloudera/.bashrc-anaconda2.bak # exit shell and reopen which python conda install pyhive, thrift conda install -c blaze sasl=0.2.1 conda install -c conda-forge thrift_sasl=0.2.1
Sadly, the hidden caveat is the missing of something called “GLIBC 2.14”. What’s more PyHive seems to be built for hiveServer but not hiveServer2.
python API: pyhs2
Then I planned to try https://github.com/BradRuderman/pyhs2 Although the author stopped maintaince 3 years ago, this module works amazingly.
still need anaconda python 2.7
sudo yum install gcc-c++ python-devel.x86_64 cyrus-sasl-devel.x86_64 sudo pip install pyhs2
minimum viable code:
import pyhs2 conn = pyhs2.connect(host='localhost',port=10000, authMechanism="PLAIN",user='cloudera', password='cloudera',database='default') cur = conn.cursor() print cur.getDatabases() # Show databases cur.execute("select * from employee") # Execute query print cur.getSchema() # Return column info from query #Fetch table results for i in cur.fetch(): print i
Note that the table employee is prepared as in https://www.tutorialspoint.com
You can do it in hive command line or CDH hue.
create table if not exists employee (eid int, name String, salary string, destination string) comment 'Employee details' row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
The table is stored at (HDFS)
prepare data in
1201 Gopal 45000 Technical manager 1202 Manisha 45000 Proof reader 1203 Masthanvali 40000 Technical writer 1204 Kiran 40000 Hr Admin 1205 Kranthi 30000 Op Admin
load data local inpath '/home/cloudera/Downloads/sample.txt' overwrite into table employee;
table metadata operation
show tables; show tables '.*s'; --table end with 's',java regex describe employee; -- show list of columns alter table employee rename personnel; alter table employee add columns (age int); set hive.cli.print.header=true; select * from employee limit 10;
python API: impyla
sudo pip install impyla sudo pip install thrift==0.9.3
from impala.dbapi import connect conn = connect(host='localhost', port=21050) cursor = conn.cursor() print conn cursor.execute('show tables;') print cursor.description # prints the result set's schema results = cursor.fetchall()
The codes runs without error. But it doesn’t connect to the right database. Maybe there is some database connection catch somewhere.
I realize python APIs don’t perform well (execute speed, documentation, community support, etc) because the native language of Hadoop is Java. So my next stop is JDBC.