I am going to stick with Cloudera Quickstart VM, which saves me a lot of time on buggy messy configuration. Now I try to bridge the gap between Hive and unstructured data by JDBC. Forget python. Java is the native language in Hadoop.
1. install hive jdbc driver
$ sudo yum install hive-jdbc # red-hat
Add
/usr/lib/hive/lib/*.jar
and /usr/lib/hadoop/*.jar
to classpath.2 write java codes
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.Statement;
import java.sql.DriverManager;
public class etl {
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException {
try {Class.forName(driverName);}
catch(ClassNotFoundException ex) {
System.out.println("Error: unable to load driver class!");
System.exit(1);
}
// get connection, user and password are ignored in non-secured mode
Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "cloudera", "cloudera");
Statement stmt = con.createStatement();
// execute statement
ResultSet res = stmt.executeQuery("SELECT * FROM employee ");
System.out.println("Result:");
System.out.println(" ID \t Name \t Salary \t Designation ");
while (res.next()) {
System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " + res.getString(4));
}
con.close();
}
}
notes:
- driverName is for hiveServer2. the previous version has longer driver name.
- table employee is prepared as in my previous blog or https://www.tutorialspoint.com/hive/
- connection is “jdbc:hive2://localhost:10000/default”, username and password can be empty string.
- Java class
ResultSet
is weird and quite different from Python cursor. It provides a getter methods such as getInt, getString, getDouble and requires a column index numbered from 1. This means the strict data type enforcement. Thenext
method moves the object cursor to the next row so a while loop can iterate through the result set.
3. compile and run
This is the most buggy part. I saw in StackOverflow, someone suggest run it as:
javac -cp . etl.class
java -cp . etl
But no matter how I tried(change configuration here and there, tinker codes here and there), I always got java - ClassNotFoundException. I guess the reason is that the Java compiler or JVM doesn’t read classpath as supposed.
Fortunately, I got a friend studying PhD in Hadoop. He simply use IDE to add classpath instead of writing classpath in bashrc. To be more specific:
- eclipse -new -> new java project -> build java path, add external JARs
- local at
/usr/lib/hive/lib/*.jar
, add all JARs - run as application
mySQL
This is another topic. I will keep it here in case I will need in the future.
brew install mysql
We've installed your MySQL database without a root password. To secure it run:
mysql_secure_installation
MySQL is configured to only allow connections from localhost by default
To connect run:
mysql -uroot
To have launch start mysql now and restart at login:
brew services start mysql
Or, if you don't want/need a background service you can just run:
mysql.server start
==> Summary
🍺 /usr/local/Cellar/mysql/5.7.18_1: 321 files, 234.5MB