Sunday, May 21, 2017

Hive JDBC in Cloudera Hadoop


I am going to stick with Cloudera Quickstart VM, which saves me a lot of time on buggy messy configuration. Now I try to bridge the gap between Hive and unstructured data by JDBC. Forget python. Java is the native language in Hadoop.

1. install hive jdbc driver

$ sudo yum install hive-jdbc  # red-hat
Add /usr/lib/hive/lib/*.jar and /usr/lib/hadoop/*.jarto classpath.

2 write java codes

import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.Statement;
import java.sql.DriverManager;

public class etl {
       private static String driverName = "org.apache.hive.jdbc.HiveDriver";
       public static void main(String[] args) throws SQLException {
    try {Class.forName(driverName);}
    catch(ClassNotFoundException ex) {
       System.out.println("Error: unable to load driver class!");
       System.exit(1);
    }    
          // get connection, user and password are ignored in non-secured mode
          Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "cloudera", "cloudera");
          Statement stmt = con.createStatement();
       // execute statement
          ResultSet res = stmt.executeQuery("SELECT * FROM employee ");
          System.out.println("Result:");
          System.out.println(" ID \t Name \t Salary \t Designation  ");
          while (res.next()) {
             System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " + res.getString(4));
          }
          con.close();
       }
    }
notes:
  1. driverName is for hiveServer2. the previous version has longer driver name.
  2. table employee is prepared as in my previous blog or https://www.tutorialspoint.com/hive/
  3. connection is “jdbc:hive2://localhost:10000/default”, username and password can be empty string.
  4. Java class ResultSet is weird and quite different from Python cursor. It provides a getter methods such as getInt, getString, getDouble and requires a column index numbered from 1. This means the strict data type enforcement. The nextmethod moves the object cursor to the next row so a while loop can iterate through the result set.

3. compile and run

This is the most buggy part. I saw in StackOverflow, someone suggest run it as:
javac -cp . etl.class
java -cp . etl
But no matter how I tried(change configuration here and there, tinker codes here and there), I always got java - ClassNotFoundException. I guess the reason is that the Java compiler or JVM doesn’t read classpath as supposed.
Fortunately, I got a friend studying PhD in Hadoop. He simply use IDE to add classpath instead of writing classpath in bashrc. To be more specific:
  1. eclipse -new -> new java project -> build java path, add external JARs
  2. local at /usr/lib/hive/lib/*.jar, add all JARs
  3. run as application

mySQL

This is another topic. I will keep it here in case I will need in the future.
brew install mysql
We've installed your MySQL database without a root password. To secure it run:
    mysql_secure_installation

MySQL is configured to only allow connections from localhost by default

To connect run:
    mysql -uroot

To have launch start mysql now and restart at login:
  brew services start mysql
Or, if you don't want/need a background service you can just run:
  mysql.server start
==> Summary
🍺  /usr/local/Cellar/mysql/5.7.18_1: 321 files, 234.5MB

3 comments:

  1. Nice summary. To compile the program I think you meant
    javac etl.java
    Also you can specify the classpath with
    export CLASSPATH=/usr/lib/hive/lib/hive-jdbc-standalone.jar:.:
    then run the progam as
    java etl

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Nice Article with great information with Java,

    To learn complete about Apache Hive. Visit Learn Apache Hive

    ReplyDelete

Note: Only a member of this blog may post a comment.