Submit MapReduce Jobs to Yarn RM using REST API(s) - Oozie Workflow

Recently was exploring to integrate my MapReduce (MR) application with rest of the micro-services and their infrastructure. Available command line mechanism (e.g. using Hadoop or Yarn command from prompt) is not a recommended option here. It needs to be REST API based solution to make them consumable across applications. While exploring realized we can't use YARN Resource Manager API(s) to execute (submit) map-reduce applications. As documented here YARN (RM) REST API(s) mechanism to use RM API(s) involve retrieving application ID and then submit the application. It works fine for the Spark Job but not really for MapReduce Jobs. 

If we execute a map-reduce application using these API(s), on completion of the sub-process parent application fails with  "Application application_xx_00xx failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_xx_00xx_000001 exited with exitCode: 0". You can see in the image it will launch two tasks parent and sub-task. Even after successful completion of sub-task parent task reports a failure. 

How to Submit MR job to RM (Yarn) using API(s)

Recommended mechanism for submitting or managing map-reduce or Hadoop jobs is by using oozie workflow. It is integrated with the rest of the Hadoop stack supporting Hadoop MapReduce jobs. We can use oozie rest endpoint URL to submit (execute) a MapReduce job. Offcourse we can use Yarn (RM) API(s) to manage our submitted application from any other channel and perform activities like status, kill etc. 
Submitting MapReduce (MR) jobs with Oozie
To provide more details using wordcount example from Hadoop examples jar. 

Step 1: We can start by login to Hadoop Cluster and create a required working directory or helper structure. You can also use Ambari UI to perform these tasks.
Step 2: Create a properties or config file comprising the following properties. Let's say oozieconfig.xml
<configuration>

<property>

<name>user.name</name>

<value>root</value>

</property>

<property>

<name>jobTracker</name>

<value>sandbox-hdp.hortonworks.com:8032</value>

</property>

<property>

<name>oozie.wf.application.path</name>

<value>/user/root/examples/apps/mapreduce</value>

</property>

<property>

<name>queueName</name>

<value>default</value>

</property>

<property>

<name>nameNode</name>

<value>hdfs://sandbox-hdp.hortonworks.com:8020</value>

</property>

<property>

<name>applicationName</name>

<value>testoozie</value>

</property>

</configuration>
NOTE : <job-tracker> element in Oozie referred here is used to pass information of the RM and not really represent old job tracker.Specs still call it jobTracker though it can serve both JT or RM depending on Hadoop version you are using.

Here value provided for jobTracker is resource-manager URL. I am using the reference name from HDP sandbox here. Another value is nameNode URL here again referred from HDP sandbox for representation.
Step 3: 
 <workflow-app xmlns="uri:oozie:workflow:0.5" name="map-reduce-wf">

<start to="mr-node"/>

<action name="mr-node">

<map-reduce>

 <job-tracker>${jobTracker}</job-tracker>

 <name-node>${nameNode}</name-node>

 <configuration>

  <property>

   <name>mapred.job.name</name>

   <value>map-reduce-wf</value>

  </property>

  <property>

   <name>mapred.mapper.new-api</name>

   <value>true</value>

  </property>

  <property>

   <name>mapred.reducer.new-api</name>

   <value>true</value>

  </property>

  <property>

   <name>mapred.job.queue.name</name>

   <value>${queueName}</value>

  </property>

  <property>

   <name>mapreduce.map.class</name>

   <value>org.apache.hadoop.examples.WordCount$TokenizerMapper</value>

  </property>

  <property>

   <name>mapreduce.reduce.class</name>

  <value>org.apache.hadoop.examples.WordCount$TokenizerMapper</value>

  </property>

  <property>

   <name>mapreduce.reduce.class</name>

   <value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>

  </property>

  <property>

   <name>mapreduce.combine.class</name>

   <value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>

  </property>

  <property>

   <name>mapred.output.key.class</name>

   <value>org.apache.hadoop.io.Text</value>

  </property>

  <property>

   <name>mapred.output.value.class</name>

   <value>org.apache.hadoop.io.IntWritable</value>

  </property>

  <property>

   <name>mapred.input.dir</name>

   <value>/user/root/examples/input-data/mapreduce</value>

  </property>

  <property>

   <name>mapred.output.dir</name>

   <value>/user/root/examples/output-data/mapreduce</value>

  </property>

 </configuration>

</map-reduce>

<ok to="end"/>

<error to="fail"/>

</action>

<kill name="fail">

<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>

</kill>

<end name="end"/>

</workflow-app>
Apart from input and output directory you also need to provide classes representing mapper, reducer and other relevant information to your flow. 
Step 4: Create sampledata.txt
test

here

insight

failure

nomore

enough
Step 5: Copy the desired libraries to designated directory structure within HDFS using following commands. Please change the structure as appropriate for your application.
Copy application jar file in this case hadoop mapreduce example
hdfs dfs -put hadoop-mapreduce-examples.jar /user/root/examples/apps/mapreduce/lib 

Copy created workflow xml file based on sample given
hdfs dfs -put workflow.xml /user/root/examples/apps/mapreduce/ 

Copy Sample Data for input
hdfs dfs -put sampledata.txt /user/root/examples/input-data/mapreduce 
Step 6: Run
Here is how you can call it via curl. Off-course can use same from within your program based on the programming language you are using. 
curl -i -s -X POST -H “Content-Type: application/xml” -T oozieconfig.xml http://sandbox-hdp.hortonworks.com:11000/oozie/v1/jobs?action=start
Step 7: Response
The command returns a JSON response that is similar to
HTTP/1.1 100 Continue

HTTP/1.1 201 Created

Server: Apache-Coyote/1.1

Content-Type: application/json;charset=UTF-8

Content-Length: 45

Date: Mon, 21 Jan 2019 07:19:03 GMT

{“id”:“0000008-190119071046177-oozie-oozi-W”}

You need to Be sure to record the job ID value i.e. {“id”:“0000008-190119071046177-oozie-oozi-W”}
Step 8: Status
Here is an example of using curl to retrieve the status of the workflow
curl -i -s -X GET -H "Content-Type: application/xml" -T oozieconfig.xml http://sandbox-hdp.hortonworks.com:11000/oozie/v1/job/"0000008-190119071046177-oozie-oozi-W"
HTTP/1.1 100 Continue
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json;charset=UTF-8
Content-Length: 8114
Date: Mon, 21 Jan 2019 17:09:56 GMT
You can also check the job status from the Ambari console, select YARN and click Quick Links > Resource Manager UI. Select the job ID that matches the previous step result and view the job details.

For additional details you can refer oozie specification guidelines.

-Ritesh
Disclaimer: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”