Tuesday, 9 June 2015

Passing variables from shell script to hive script


I have a hive script which might have some variables like table-name, count, etc to make it more generic and automated.

$ cat hive-test.hql
select count(*) from demodb.demo_table limit ${hiveconf:num}

Creating a shell script to execute above hive script and passing required variables to it.

$ cat script-test.sh
# /bin/bash
count=5
hive -hiveconf num="$cnt" -f hive-test.hql





Running Hive queries/Hadoop Command from non-hadoop environment using ssh

Sometimes, we need to run set of hive queries like SELECT, etc to analyse the data in Hive or hadoop fs commands to list the files in HDFS  from non-hadoop machines. One of the  usecase could be scripts that are pushing data to HDFS on remote cluster. SSH comes as a quick resolution in such scenario, when we don't need to maintain session or do transaction management.

But there are some catches here in this obvious solution which needs to be kept in mind. I will give complete flow of solution.

1)  Login interactive Shell Vs Non-Login interactive Shell while executing hadoop commands from SSH

$ ssh user@remote-host 'hadoop fs -ls /'
user@remote-host's password:
+======================================================================+
|      Error: JAVA_HOME is not set and Java could not be found         |
+----------------------------------------------------------------------+
| Please download the latest Sun JDK from the Sun Java web site        |
|       > http://java.sun.com/javase/downloads/ <                      |
|                                                                      |
| Hadoop requires Java 1.6 or later.                                   |
| NOTE: This script will find Sun Java whether you install using the   |
|       binary or the RPM based installer.                             |
+======================================================================+

The reason of this error is non-login bash shell. When we ssh into remote machine, it defaults to non-login shell which only reads from ~/.bashrc. Environment variables like PATH, JAVA_HOME, etc are sourced from ~/.bash_profile for each user. As ~/.bash_profile source from ~/.bashrc and not vice-versa, lies the reason for our problem.
The solution lies in interactive login bash shell which is same as when we login from putty or any other ssh client. The interactive login bash shell source from ~/.bash_profile which reads all environment variables for the user on remote-host.
The syntax to interactive login bash shell is
bash -l -c '<command>'


$ ssh user@remote-host 'bash -l -c "hadoop fs -ls /"'
user@remote-host's password:
Found 5 items
drwxr-xr-x   - 49483 mapr          9 2015-05-28 14:24 /abc
drwxr-xr-x   - 49483 mapr          0 2013-12-10 11:45 /hbase
drwxrwxrwx   - 49483 mapr         38 2015-06-08 20:03 /tmp
drwxr-xr-x   - 49483 mapr          3 2015-05-27 16:53 /user
drwxr-xr-x   - 49483 mapr          1 2013-12-10 11:45 /var

Voila!!!

2) '(Single Quote) vs "(Double Quote) in passing variables to commands being executed by SSH

As end of every process is automation, we want to read target directory path from environment variable.

$ DIR_PATH="/"
$ echo $DIR_PATH
/
$ ssh user@remote-host 'bash -l -c "hadoop fs -ls $DIR_PATH"'
user@remote-host's password:
ls: Cannot access .: No such file or directory.

It is failing now with such a confusing error trace. The target dir path is $DIR_PATH whose value is "/" but error trace is not even able to read correct value.
The solution lies in meaning of '(single-quote) vs "(double-quote) in shell expressions. Everything inside '(single-quote)s is literal, even the variables. But variables are evaluated and replaced with their value during evaluation if same expression is quoted by "(double-quote)s.

In above case, as argument to ssh is quoted by '(single-quote), it is passed to remote-host as it is without evaluating $DIR_PATH. On remote host, hadoop command under "(double-quote)s is getting evaluated but as variable $DIR_PATH is not set in remote-host, error is coming.  
So, re-arrange the quotes to make it evaluate expression before ssh'ing it to remote-host.


$ ssh user@remote-host "bash -l -c 'hadoop fs -ls '$DIR_PATH"
user@remote-host's password:
Found 5 items
drwxr-xr-x   - 49483 mapr          9 2015-05-28 14:24 /abc
drwxr-xr-x   - 49483 mapr          0 2013-12-10 11:45 /hbase
drwxrwxrwx   - 49483 mapr         38 2015-06-08 20:03 /tmp
drwxr-xr-x   - 49483 mapr          3 2015-05-27 16:53 /user
drwxr-xr-x   - 49483 mapr          1 2013-12-10 11:45 /var


Now similarly, we can run hive or any other hadoop ecosystem commands from non-hadoop remote host using ssh.

HAPPY HACKING !!!