Tuesday, 8 December 2015

TRANSPOSE/PIVOT a Table in Hive

Transposing/pivoting a table means to convert values of one of the column as set of new columns and another column as corresponding values to those new set of columns.
For example, if original table, "temp_sample" is:

id code   key    value
-------------------------------
1  A       p       e
2       B       q       f
3       B       p       f
3       B       q       h
3       B       r        j
3       C       t        k

the transpose/pivot of the table should be:

id      code    p       q       r       t
--------------------------------------
1       A       e
2       B                 f
3       B       f         h        j
3       C                                   k


Hive Implementation

1) When value is of string type
If "test_sample" is hive table with following table definiton:
create table test_sample(id string, code string, key string, value string) row format delimited fields terminated by ',' lines terminated by '\n';

hive> set hive.cli.print.header=true;
hive> select * from test_sample;
id code   key    value
1  A       p       e
2       B       q       f
3       B       p       f
3       B       q       h
3       B       r        j
3       C       t        k

the query to create transpose of it is:

select b.id, b.code, concat_ws('',b.p) as p, concat_ws('',b.q) as q, concat_ws('',b.r) as r, concat_ws('',b.t) as t from
(select id, code,
collect_list(a.group_map['p']) as p,
collect_list(a.group_map['q']) as q,
collect_list(a.group_map['r']) as r,
collect_list(a.group_map['t']) as t
from ( select
id, code,
map(key,value) as group_map
from test_sample
) a group by a.id, a.code) b;

On execution of this query, the output will be:
 id      code    p       q       r       t
--------------------------------------
1       A       e
2       B                 f
3       B       f         h        j
3       C                                   k

which is the expected output.

Important, point to note is that it is not using any custom UDF/UDAFs. It is only using in-built hive functions which save us lot of hassles.
"concat_ws" and "map" are hive udf and "collect_list" is a hive udaf.

Working
  • "map" function creates map of values of two columns as key value pairs.
  • the in outer query, we did group by on dimension columns (id and code) and aggregate all the values of a particular key using "collect_list"
id      code    p       q          r       t
1       A       ["e"]   []         []      []
2       B       []        ["f"]     []      []
3       B       ["f"]   ["h"]    ["j"]   []
3       C       []        []          []      ["k"]

  • Then in last outer query, we use "concat_ws" function to get a single string value out of array.

2) When value is of numeric (int/float/double) type
If "test_sample" is hive table with following table definiton:
create table test_sample(id string, code string, key string, value int) row format delimited fields terminated by ',' lines terminated by '\n';

hive> set hive.cli.print.header=true;
hive> select * from test_sample;
id code   key    value
1       A       p       5
2       B       q       6
3       B       p       6
3       B       q       8
3       B       r       10
3       C       t       11

the query to create transpose of it is:

select id, code, sum(a.group_map['p']) as p, sum(a.group_map['q']) as q, sum(a.group_map['r']) as r, sum(a.group_map['t']) as t from
( select id, code,  
map(proc1,proc2) as group_map 
from test_sample
) a group by a.id, a.code;


On execution of this query, the output will be:
 id      code    p            q             r             t
-----------------------------------------------------------
1       A         5            NULL    NULL    NULL
2       B         NULL    6            NULL    NULL
3       B         6             8            10          NULL
3       C         NULL    NULL    NULL    11

which is the expected output.

17 comments:

  1. THis is very useful. But what if i dont want to hard code the column names (in this case there are only 3 columns, but i have more than 70 columns).

    ReplyDelete
    Replies
    1. Did you get a solution for this

      Delete
    2. Did you get a solution for this

      Delete
    3. I'm also looking for an elegant way to pivot dynamically within hive. However, it's quite straightforward to pivot dynamically using spark.

      # Read flat file with narrow data
      # Columns: customer_id,category,weekofyear,totalspend
      transactions = sqlContext.read \
      .options(header=True, inferSchema=True) \
      .csv("transaction_data.csv")

      # Attempt 1
      tx_pivot = transactions \
      .withColumn("weekofyear", sf.concat(sf.lit("T"), sf.col("weekofyear"))) \
      .groupBy("customer_id") \
      .pivot("weekofyear") \
      .sum("totalspend")

      Delete
    4. Thank you! you saved my day!

      Delete
  2. is there a way to do the oppsit .?
    thanks in advance

    ReplyDelete
  3. how to pivot row to column dynamically in hive without hard code the column names

    ReplyDelete
  4. check this link ... just 5 lines query ,, can create pivot table.
    using map dat type to convert columns name are keys and values are Value in map ( k,v) then later view explode ( map ) to convert into (k,v) as rows.. what sample.

    ReplyDelete
  5. https://stackoverflow.com/questions/37436710/is-there-a-way-to-transpose-data-in-hive

    ReplyDelete
  6. any options for dynamic pivoting for n number columns?

    ReplyDelete
  7. Really Good blog post.provided a helpful information.I hope that you will post more updates like this Big data hadoop online training Bangalore

    ReplyDelete
  8. original post

    https://stackoverflow.com/questions/23025380/how-to-transpose-pivot-data-in-hive

    ReplyDelete
  9. (proc1,proc2), where is it from ?

    ReplyDelete
  10. hi how can we pivot the data in hive without hard coding the column names

    ReplyDelete
  11. this we can easily done with spark data frames
    val newdf=df.groupBy("code").pivot("key").sum("value")

    ReplyDelete