Hadoop Mania: Configuring spark history server for running on Yarn in CDH

In Spark-on-Yarn mode, each running spark application on yarn launches its own web ui which can be accessed from Yarn Resource Manager UI with "tracking url" link. This web ui has all the data on running spark application like event timeline, jobs, stages, task, task metrics, etc.

By default configuration, we can only see this web ui for running jobs. To enable it to do the same for completed jobs, spark history server has to be started and configured.

Spark history server is used to maintain and visualize the event-logs of the spark application after they got completed running on Yarn.

I tested this on CDH 5.3 and 5.5 which have spark version 1.3 and 1.5 respectively.

1) Test if spark-history-server is running or not

$ /etc/init.d/spark-history-server status

If it is not running, start it using

$ /etc/init.d/spark-history-server start

2) Configuring spark-history-server

We need to know two configuration of spark-history-server:

- spark.history.fs.logDirectory : this is the directory where history-server expects the application event logs

- spark.history.ui.port : port on which it runs

These properties are configured in file "/etc/default/spark".

export SPARK_HISTORY_SERVER_LOG_DIR=hdfs:///user/spark/applicationHistory
export SPARK_HISTORY_SERVER_WEBUI_PORT=18088
export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=${SPARK_HISTORY_SERVER_LOG_DIR}"
export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.ui.port=${SPARK_HISTORY_SERVER_WEBUI_PORT}"
export SPARK_CONF_DIR=/etc/spark/conf

So we know that SPARK_HISTORY_SERVER_WEBUI_PORT is 18088 and SPARK_HISTORY_SERVER_LOG_DIR is hdfs:///user/spark/applicationHistory

If we want to change any of this property, we can change it in this file and restart the spark-history-server.

PS: There is quicker way to get to know the value of these properties of spark history server:
$ ps -ef | grep HistoryServer
mapred 2595 1 0 01:54 ? 00:01:54 /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_historyserver -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-mapreduce -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-mapreduce -Dhadoop.log.file=hadoop.log -Dhadoop.root.logger=INFO,console -Dhadoop.id.str=mapred -Dhadoop.log.dir=/var/log/hadoop-mapreduce -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-mapreduce -Dhadoop.log.file=mapred-mapred-historyserver-quickstart.cloudera.log -Dhadoop.root.logger=INFO,console -Dmapred.jobsummary.logger=INFO,JSA -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
spark 4174 1 0 01:55 ? 00:03:02 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /etc/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory -Dspark.history.ui.port=18088 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.history.HistoryServer
cloudera 18659 6018 0 09:06 pts/0 00:00:00 grep HistoryServer

3) Configuring spark
Add following properties in "/etc/spark/conf/spark-default.conf"(in this file, keys and values are separated by whitespace characters).

spark.eventLog.enabled                  true
spark.yarn.historyServer.address        ${hadoopconf-yarn.resourcemanager.hostname}:18088
spark.eventLog.dir                      hdfs:///user/spark/applicationHistory

Note that the port in "spark.yarn.historyServer.address" should be equal to "SPARK_HISTORY_SERVER_WEBUI_PORT" set in history-server. Similarly, the value of "spark.eventLog.dir" should be equal to "SPARK_HISTORY_SERVER_LOG_DIR" set in history-server.

4) Now run a spark application.
After application has completed, go to yarn ui for that application and click on "Histoy" link. It will take you to the spark web ui for that application.

3 comments:

Unknown22 May 2018 at 06:27
Thanks for the Blog,
Hotel jobs at your finger tips. This hotelierjobz provides you thousands of hotel jobs, chef jobs, hospitality jobs in different places in world like Asia , Dubai many more
Address: 8th block, Janis Alpine meadows, Tiruneermalai Road, Tiruneermalai ,Chennai
info@hotelierjobz.com
Unknown15 October 2019 at 03:25
Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating hadoop online training
veera25 August 2020 at 22:57
Very nice article,thank you for sharing this awesome content with us.

hadoop admin course

Hadoop Mania

Wednesday, 6 January 2016

Configuring spark history server for running on Yarn in CDH

3 comments:

About Me