Transposing/pivoting a table means to convert values of one of the column as set of new columns and another column as corresponding values to those new set of columns.
For example, if original table, "temp_sample" is:
id code key value
-------------------------------
1 A p e
2 B q f
3 B p f
3 B q h
3 B r j
3 C t k
the transpose/pivot of the table should be:
id code p q r t
--------------------------------------
1 A e
2 B f
3 B f h j
3 C k
Hive Implementation
1) When value is of string type
If "test_sample" is hive table with following table definiton:
create table test_sample(id string, code string, key string, value string) row format delimited fields terminated by ',' lines terminated by '\n';
hive> set hive.cli.print.header=true;
hive> select * from test_sample;
id code key value
1 A p e
2 B q f
3 B p f
3 B q h
3 B r j
3 C t k
the query to create transpose of it is:
On execution of this query, the output will be:
id code p q r t
--------------------------------------
1 A e
2 B f
3 B f h j
3 C k
For example, if original table, "temp_sample" is:
id code key value
-------------------------------
1 A p e
2 B q f
3 B p f
3 B q h
3 B r j
3 C t k
the transpose/pivot of the table should be:
id code p q r t
--------------------------------------
1 A e
2 B f
3 B f h j
3 C k
Hive Implementation
1) When value is of string type
If "test_sample" is hive table with following table definiton:
create table test_sample(id string, code string, key string, value string) row format delimited fields terminated by ',' lines terminated by '\n';
hive> set hive.cli.print.header=true;
hive> select * from test_sample;
id code key value
1 A p e
2 B q f
3 B p f
3 B q h
3 B r j
3 C t k
the query to create transpose of it is:
select b.id, b.code, concat_ws('',b.p) as p, concat_ws('',b.q) as q, concat_ws('',b.r) as r, concat_ws('',b.t) as t from
(select id, code,
collect_list(a.group_map['p']) as p,
collect_list(a.group_map['q']) as q,
collect_list(a.group_map['r']) as r,
collect_list(a.group_map['t']) as t
from ( select
id, code,
map(key,value) as group_map
from test_sample
) a group by a.id, a.code) b;
On execution of this query, the output will be:
id code p q r t
--------------------------------------
1 A e
2 B f
3 B f h j
3 C k
which is the expected output.
Important, point to note is that it is not using any custom UDF/UDAFs. It is only using in-built hive functions which save us lot of hassles.
"concat_ws" and "map" are hive udf and "collect_list" is a hive udaf.
Working
- "map" function creates map of values of two columns as key value pairs.
- the in outer query, we did group by on dimension columns (id and code) and aggregate all the values of a particular key using "collect_list"
id code p q r t
1 A ["e"] [] [] []
2 B [] ["f"] [] []
3 B ["f"] ["h"] ["j"] []
3 C [] [] [] ["k"]
- Then in last outer query, we use "concat_ws" function to get a single string value out of array.
2) When value is of numeric (int/float/double) type
If "test_sample" is hive table with following table definiton:
create table test_sample(id string, code string, key string, value int) row format delimited fields terminated by ',' lines terminated by '\n';
hive> set hive.cli.print.header=true;
hive> select * from test_sample;
id code key value
1 A p 5
2 B q 6
3 B p 6
3 B q 8
3 B r 10
3 C t 11
the query to create transpose of it is:
On execution of this query, the output will be:
id code p q r t
-----------------------------------------------------------
1 A 5 NULL NULL NULL
2 B NULL 6 NULL NULL
3 B 6 8 10 NULL
3 C NULL NULL NULL 11
If "test_sample" is hive table with following table definiton:
create table test_sample(id string, code string, key string, value int) row format delimited fields terminated by ',' lines terminated by '\n';
hive> set hive.cli.print.header=true;
hive> select * from test_sample;
id code key value
1 A p 5
2 B q 6
3 B p 6
3 B q 8
3 B r 10
3 C t 11
the query to create transpose of it is:
select id, code, sum(a.group_map['p']) as p, sum(a.group_map['q']) as q, sum(a.group_map['r']) as r, sum(a.group_map['t']) as t from
( select id, code,
map(proc1,proc2) as group_mapfrom test_sample) a group by a.id, a.code;
On execution of this query, the output will be:
id code p q r t
-----------------------------------------------------------
1 A 5 NULL NULL NULL
2 B NULL 6 NULL NULL
3 B 6 8 10 NULL
3 C NULL NULL NULL 11
which is the expected output.
THis is very useful. But what if i dont want to hard code the column names (in this case there are only 3 columns, but i have more than 70 columns).
ReplyDeleteDid you get a solution for this
DeleteDid you get a solution for this
DeleteI'm also looking for an elegant way to pivot dynamically within hive. However, it's quite straightforward to pivot dynamically using spark.
Delete# Read flat file with narrow data
# Columns: customer_id,category,weekofyear,totalspend
transactions = sqlContext.read \
.options(header=True, inferSchema=True) \
.csv("transaction_data.csv")
# Attempt 1
tx_pivot = transactions \
.withColumn("weekofyear", sf.concat(sf.lit("T"), sf.col("weekofyear"))) \
.groupBy("customer_id") \
.pivot("weekofyear") \
.sum("totalspend")
Thank you! you saved my day!
Deleteis there a way to do the oppsit .?
ReplyDeletethanks in advance
how to pivot row to column dynamically in hive without hard code the column names
ReplyDeleteTHanks much this is working
ReplyDeleteNice approach, thanks
ReplyDeletecheck this link ... just 5 lines query ,, can create pivot table.
ReplyDeleteusing map dat type to convert columns name are keys and values are Value in map ( k,v) then later view explode ( map ) to convert into (k,v) as rows.. what sample.
https://stackoverflow.com/questions/37436710/is-there-a-way-to-transpose-data-in-hive
ReplyDeleteany options for dynamic pivoting for n number columns?
ReplyDeleteReally Good blog post.provided a helpful information.I hope that you will post more updates like this Big data hadoop online training Bangalore
ReplyDeleteoriginal post
ReplyDeletehttps://stackoverflow.com/questions/23025380/how-to-transpose-pivot-data-in-hive
(proc1,proc2), where is it from ?
ReplyDeletehi how can we pivot the data in hive without hard coding the column names
ReplyDeletethis we can easily done with spark data frames
ReplyDeleteval newdf=df.groupBy("code").pivot("key").sum("value")