Wednesday, 6 January 2016

Configuring spark history server for running on Yarn in CDH

In Spark-on-Yarn mode, each running spark application on yarn launches its own web ui which can be accessed from Yarn Resource Manager UI with "tracking url" link. This web ui has all the data on running spark application like event timeline, jobs, stages, task, task metrics, etc.
By default configuration, we can only see this web ui for running jobs. To enable it to do the same for completed jobs, spark history server has to be started and configured.
Spark history server is used to maintain and visualize the event-logs of the spark application after they got completed running on Yarn.

I tested this on CDH 5.3 and 5.5 which have spark version 1.3 and 1.5 respectively.

1) Test if spark-history-server is running or not
$ /etc/init.d/spark-history-server status
If it is not running, start it using
$ /etc/init.d/spark-history-server start

2) Configuring spark-history-server
We need to know two configuration of spark-history-server:
spark.history.fs.logDirectory : this is the directory where history-server expects the application event logs
- spark.history.ui.port : port on which it runs

These properties are configured in file "/etc/default/spark".


So we know that SPARK_HISTORY_SERVER_WEBUI_PORT is 18088 and SPARK_HISTORY_SERVER_LOG_DIR is hdfs:///user/spark/applicationHistory

If we want to change any of this property, we can change it in this file and restart the spark-history-server.

PS: There is quicker way to get to know the value of these properties of spark history server:
$ ps -ef | grep HistoryServer
mapred    2595     1  0 01:54 ?        00:01:54 /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_historyserver -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-mapreduce -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-mapreduce -Dhadoop.log.file=hadoop.log -Dhadoop.root.logger=INFO,console -Dhadoop.id.str=mapred -Dhadoop.log.dir=/var/log/hadoop-mapreduce -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-mapreduce -Dhadoop.log.file=mapred-mapred-historyserver-quickstart.cloudera.log -Dhadoop.root.logger=INFO,console -Dmapred.jobsummary.logger=INFO,JSA -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
spark     4174     1  0 01:55 ?        00:03:02 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /etc/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory -Dspark.history.ui.port=18088 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.history.HistoryServer
cloudera 18659  6018  0 09:06 pts/0    00:00:00 grep HistoryServer 

3) Configuring spark
Add following properties in "/etc/spark/conf/spark-default.conf"(in this file, keys and values are separated by whitespace characters).



Note that the port in "spark.yarn.historyServer.address" should be equal to "SPARK_HISTORY_SERVER_WEBUI_PORT" set in history-server. Similarly, the value of "spark.eventLog.dir" should be equal to "SPARK_HISTORY_SERVER_LOG_DIR" set in history-server.

4) Now run a spark application.
After application has completed, go to yarn ui for that application and click on "Histoy" link. It will take you to the spark web ui for that application.

3 comments:

  1. Thanks for the Blog,
    Hotel jobs at your finger tips. This hotelierjobz provides you thousands of hotel jobs, chef jobs, hospitality jobs in different places in world like Asia , Dubai many more
    Address: 8th block, Janis Alpine meadows, Tiruneermalai Road, Tiruneermalai ,Chennai
    info@hotelierjobz.com

    ReplyDelete
  2. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating hadoop online training

    ReplyDelete
  3. Very nice article,thank you for sharing this awesome content with us.

    hadoop admin course

    ReplyDelete