Showing posts with label hive. Show all posts
Showing posts with label hive. Show all posts

Tuesday, 8 December 2015

TRANSPOSE/PIVOT a Table in Hive

Transposing/pivoting a table means to convert values of one of the column as set of new columns and another column as corresponding values to those new set of columns.
For example, if original table, "temp_sample" is:

id code   key    value
-------------------------------
1  A       p       e
2       B       q       f
3       B       p       f
3       B       q       h
3       B       r        j
3       C       t        k

the transpose/pivot of the table should be:

id      code    p       q       r       t
--------------------------------------
1       A       e
2       B                 f
3       B       f         h        j
3       C                                   k


Hive Implementation

1) When value is of string type
If "test_sample" is hive table with following table definiton:
create table test_sample(id string, code string, key string, value string) row format delimited fields terminated by ',' lines terminated by '\n';

hive> set hive.cli.print.header=true;
hive> select * from test_sample;
id code   key    value
1  A       p       e
2       B       q       f
3       B       p       f
3       B       q       h
3       B       r        j
3       C       t        k

the query to create transpose of it is:

select b.id, b.code, concat_ws('',b.p) as p, concat_ws('',b.q) as q, concat_ws('',b.r) as r, concat_ws('',b.t) as t from
(select id, code,
collect_list(a.group_map['p']) as p,
collect_list(a.group_map['q']) as q,
collect_list(a.group_map['r']) as r,
collect_list(a.group_map['t']) as t
from ( select
id, code,
map(key,value) as group_map
from test_sample
) a group by a.id, a.code) b;

On execution of this query, the output will be:
 id      code    p       q       r       t
--------------------------------------
1       A       e
2       B                 f
3       B       f         h        j
3       C                                   k

which is the expected output.

Important, point to note is that it is not using any custom UDF/UDAFs. It is only using in-built hive functions which save us lot of hassles.
"concat_ws" and "map" are hive udf and "collect_list" is a hive udaf.

Working
  • "map" function creates map of values of two columns as key value pairs.
  • the in outer query, we did group by on dimension columns (id and code) and aggregate all the values of a particular key using "collect_list"
id      code    p       q          r       t
1       A       ["e"]   []         []      []
2       B       []        ["f"]     []      []
3       B       ["f"]   ["h"]    ["j"]   []
3       C       []        []          []      ["k"]

  • Then in last outer query, we use "concat_ws" function to get a single string value out of array.

2) When value is of numeric (int/float/double) type
If "test_sample" is hive table with following table definiton:
create table test_sample(id string, code string, key string, value int) row format delimited fields terminated by ',' lines terminated by '\n';

hive> set hive.cli.print.header=true;
hive> select * from test_sample;
id code   key    value
1       A       p       5
2       B       q       6
3       B       p       6
3       B       q       8
3       B       r       10
3       C       t       11

the query to create transpose of it is:

select id, code, sum(a.group_map['p']) as p, sum(a.group_map['q']) as q, sum(a.group_map['r']) as r, sum(a.group_map['t']) as t from
( select id, code,  
map(proc1,proc2) as group_map 
from test_sample
) a group by a.id, a.code;


On execution of this query, the output will be:
 id      code    p            q             r             t
-----------------------------------------------------------
1       A         5            NULL    NULL    NULL
2       B         NULL    6            NULL    NULL
3       B         6             8            10          NULL
3       C         NULL    NULL    NULL    11

which is the expected output.

Thursday, 23 July 2015

Hive 0.13: Got Exception while executing a query with UDF due to Kryo Serialization of MapWork

I wrote my custom UDF which internally maintained a Map of List as part of logic. But when i tried to execute it as part of query:
hive> ADD JAR customUDF.jar;
hive> CREATE TEMPORARY FUNCTION customUDF AS 'org.custom.MyUDF';
hive> select x, customUDF(y) from A;

i got following error:
----------------------------------------
xxxx-xx-xx xx:xx:xx,xxx INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
xxxx-xx-xx xx:xx:xx,xxx INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
xxxx-xx-xx xx:xx:xx,xxx INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1436789290291_17819, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@6be236d0)
---------------------------------------------------
----------------------------------------------------
xxxx-xx-xx xx:xx:xx,xxx INFO [main] org.apache.hadoop.mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
xxxx-xx-xx xx:xx:xx,xxx INFO [main] org.apache.hadoop.hive.ql.log.PerfLogger: <PERFLOG method=deserializePlan from=org.apache.hadoop.hive.ql.exec.Utilities>
xxxx-xx-xx xx:xx:xx,xxx INFO [main] org.apache.hadoop.hive.ql.exec.Utilities: Deserializing MapWork via kryo
xxxx-xx-xx xx:xx:xx,xxx FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.IllegalAccessError: tried to access class sun.nio.cs.UTF_8 from class sun.nio.cs.UTF_8ConstructorAccess
 at sun.nio.cs.UTF_8ConstructorAccess.newInstance(Unknown Source)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo$1.newInstance(Kryo.java:1062)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1112)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:526)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:502)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
 at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
 at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:672)
 at org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(Utilities.java:918)
 at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:826)
 at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:840)
 at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:333)
 at org.apache.hadoop.hive.ql.exec.Utilities.getMapWork(Utilities.java:275)
 at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:254)
 at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:437)
 at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:430)
 at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:587)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:172)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:414)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Solution:
set `hive.plan.serialization.format=javaXML`

Reason:
The exception stack trace  is due to error in deserializing an object serialized using kryo library. Kryo is a fast java serialization library. It is very generic, not hinting to us the object whose deserialization is failing.
Hint lies in hive internals and line in logs just before error:
xxxx-xx-xx xx:xx:xx,xxx INFO [main] org.apache.hadoop.hive.ql.exec.Utilities: Deserializing MapWork via kryo

Internally, When we submit a hive query, hive execution engine builds an execution plan detailing the work to be done in mapper and reducer phase of each mapreduce job going to launched. It uses kryo to serialize the execution plan of map and reduce phase. The mappers and reducers pick up their corresponding serialized plan, deserialize and got to know what to execute.

Due to some open issues(https://issues.apache.org/jira/browse/HIVE-7711), kryo might not be able to serialize the custom UDF in the query. Thus, we have to switch to old and trusted java-xml serialization. Obviously, it is not as optimized as kryo but still worthy to use till kryo resolves these issues.

Tuesday, 9 June 2015

Passing variables from shell script to hive script


I have a hive script which might have some variables like table-name, count, etc to make it more generic and automated.

$ cat hive-test.hql
select count(*) from demodb.demo_table limit ${hiveconf:num}

Creating a shell script to execute above hive script and passing required variables to it.

$ cat script-test.sh
# /bin/bash
count=5
hive -hiveconf num="$cnt" -f hive-test.hql





Running Hive queries/Hadoop Command from non-hadoop environment using ssh

Sometimes, we need to run set of hive queries like SELECT, etc to analyse the data in Hive or hadoop fs commands to list the files in HDFS  from non-hadoop machines. One of the  usecase could be scripts that are pushing data to HDFS on remote cluster. SSH comes as a quick resolution in such scenario, when we don't need to maintain session or do transaction management.

But there are some catches here in this obvious solution which needs to be kept in mind. I will give complete flow of solution.

1)  Login interactive Shell Vs Non-Login interactive Shell while executing hadoop commands from SSH

$ ssh user@remote-host 'hadoop fs -ls /'
user@remote-host's password:
+======================================================================+
|      Error: JAVA_HOME is not set and Java could not be found         |
+----------------------------------------------------------------------+
| Please download the latest Sun JDK from the Sun Java web site        |
|       > http://java.sun.com/javase/downloads/ <                      |
|                                                                      |
| Hadoop requires Java 1.6 or later.                                   |
| NOTE: This script will find Sun Java whether you install using the   |
|       binary or the RPM based installer.                             |
+======================================================================+

The reason of this error is non-login bash shell. When we ssh into remote machine, it defaults to non-login shell which only reads from ~/.bashrc. Environment variables like PATH, JAVA_HOME, etc are sourced from ~/.bash_profile for each user. As ~/.bash_profile source from ~/.bashrc and not vice-versa, lies the reason for our problem.
The solution lies in interactive login bash shell which is same as when we login from putty or any other ssh client. The interactive login bash shell source from ~/.bash_profile which reads all environment variables for the user on remote-host.
The syntax to interactive login bash shell is
bash -l -c '<command>'


$ ssh user@remote-host 'bash -l -c "hadoop fs -ls /"'
user@remote-host's password:
Found 5 items
drwxr-xr-x   - 49483 mapr          9 2015-05-28 14:24 /abc
drwxr-xr-x   - 49483 mapr          0 2013-12-10 11:45 /hbase
drwxrwxrwx   - 49483 mapr         38 2015-06-08 20:03 /tmp
drwxr-xr-x   - 49483 mapr          3 2015-05-27 16:53 /user
drwxr-xr-x   - 49483 mapr          1 2013-12-10 11:45 /var

Voila!!!

2) '(Single Quote) vs "(Double Quote) in passing variables to commands being executed by SSH

As end of every process is automation, we want to read target directory path from environment variable.

$ DIR_PATH="/"
$ echo $DIR_PATH
/
$ ssh user@remote-host 'bash -l -c "hadoop fs -ls $DIR_PATH"'
user@remote-host's password:
ls: Cannot access .: No such file or directory.

It is failing now with such a confusing error trace. The target dir path is $DIR_PATH whose value is "/" but error trace is not even able to read correct value.
The solution lies in meaning of '(single-quote) vs "(double-quote) in shell expressions. Everything inside '(single-quote)s is literal, even the variables. But variables are evaluated and replaced with their value during evaluation if same expression is quoted by "(double-quote)s.

In above case, as argument to ssh is quoted by '(single-quote), it is passed to remote-host as it is without evaluating $DIR_PATH. On remote host, hadoop command under "(double-quote)s is getting evaluated but as variable $DIR_PATH is not set in remote-host, error is coming.  
So, re-arrange the quotes to make it evaluate expression before ssh'ing it to remote-host.


$ ssh user@remote-host "bash -l -c 'hadoop fs -ls '$DIR_PATH"
user@remote-host's password:
Found 5 items
drwxr-xr-x   - 49483 mapr          9 2015-05-28 14:24 /abc
drwxr-xr-x   - 49483 mapr          0 2013-12-10 11:45 /hbase
drwxrwxrwx   - 49483 mapr         38 2015-06-08 20:03 /tmp
drwxr-xr-x   - 49483 mapr          3 2015-05-27 16:53 /user
drwxr-xr-x   - 49483 mapr          1 2013-12-10 11:45 /var


Now similarly, we can run hive or any other hadoop ecosystem commands from non-hadoop remote host using ssh.

HAPPY HACKING !!!