Introduction to Apache Oozie Workflow Management in Hadoop

Slide Note
Embed
Share

Apache Oozie is a scalable, reliable, and extensible workflow scheduler system designed to manage Apache Hadoop jobs. It facilitates the coordination and execution of complex workflows by chaining actions together, running jobs on a schedule, handling pre and post-processing tasks, and retrying failures. Oozie enables users to organize and monitor their Hadoop jobs efficiently, ensuring proper job execution based on dependencies. It provides a common framework for communication and simplifies resource coupling without the need for custom code bases.


Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Workflow Management CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

  2. APACHE OOZIE

  3. Problem! "Okay, Hadoop is great, but how do people actually do this? A Real Person Package jobs? Chaining actions together? Run these on a schedule? Pre and post processing? Retry failures?

  4. Apache Oozie Workflow Scheduler for Hadoop Scalable, reliable, and extensible workflow scheduler system to manage Apache Hadoop jobs Workflow jobs are DAGs of actions Coordinator jobs are recurrent Oozie Workflow jobs triggered by time and data availability Supports several types of jobs: Java MapReduce Streaming MapReduce Pig Hive Sqoop Distcp Java programs Shell scripts

  5. Why should I care? Retry jobs in the event of a failure Execute jobs at a specific time or when data is available Correctly order job execution based on dependencies Provide a common framework for communication Use the workflow to couple resources instead of some home-grown code base

  6. Layers of Oozie Bundles Coordinators Workflows Actions

  7. Actions Have a type, and each type has a defined set of configuration variables Each action must specify what to do based on success or failure

  8. Workflow DAGs M/R OK streaming job OK Java Main start fork join MORE Pig job OK decision M/R job ENOUGH OK Java Main OK OK FS job end

  9. Workflow Language Flow-control Node Decision Description Expressing switch-case logic Fork Join Kill Splits one path of execution into multiple concurrent paths Waits until every concurrent execution path of a previous fork node arrives to it Forces a workflow job to abort execution Action Node java fs MapReduce Description Invokes the main() method from the specified java class Manipulate files and directories in HDFS; supports commands: move, delete, mkdir Starts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job Pig Sub workflow Runs a Pig job Runs a child workflow job Hive Shell ssh Runs a Hive job Runs a Shell command Starts a shell command on a remote machine as a remote secure shell Sqoop Email Distcp Custom Runs a Sqoop job Sending emails from Oozie workflow application Runs a Hadoop Distcp MapReduce job Does what you program it to do

  10. Oozie Workflow Application An HDFS Directory containing: Definition file: workflow.xml Configuration file: config-default.xml App files: lib/ directory with JAR and other dependencies

  11. WordCount Workflow <workflow-app name='wordcount-wf'> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>foo.com:9001</job-tracker> <name-node>hdfs://bar.com:9000</name-node> <configuration> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'/> <end name='end'/> </workflow-app> OK Start M-R Start End wordcount Error Kill

  12. Coordinators Oozie executes workflows based on Time Dependency Data Dependency Tomcat Check Data Availability Oozie WS API Coordinator Oozie Workflow Oozie Client Hadoop

  13. Bundle Bundles are higher-level abstractions that batch a set of coordinators together No explicit dependencies between them, but they can be used to define a pipeline

  14. Interacting with Oozie Read-Only Web Console CLI Java client Web Service Endpoints Directly with Oozie DB using SQL

  15. What do I need to deploy a workflow? coordinator.xml workflow.xml Libraries Properties Contains things like NameNode and ResourceManager addresses and other job- specific properties

  16. Okay, I've built those Now you can put it in HDFS and run it hdfs dfs -put my_job oozie/app oozie job -run -config job.properties

  17. Java Action A Java action will execute the main method of the specified Java class Java classes should be packaged in a JAR and placed with workflow application's lib directory wf-app-dir/workflow.xml wf-app-dir/lib wf-app-dir/lib/myJavaClasses.JAR

  18. Java Action $ java -Xms512m a.b.c.MyMainClass arg1 arg2 <action name='java1'> <java> ... <main-class> a.b.c.MyJavaMain </main-class> <java-opts> -Xms512m </java-opts> <arg> arg1 </arg> <arg> arg2 </arg> ... </java> </action>

  19. Java Action Execution Executed as a MR job with a single task So you need the MR information <action name='java1'> <java> <job-tracker>foo.bar:8021</job-tracker> <name-node>foo1.bar:8020</name-node> ... <configuration> <property> <name>abc</name> <value>def</value> </property> </configuration> </java> </action>

  20. A Use Case: Hourly Jobs Replace a CRON job that runs a bash script once a day 1. Java main class that pulls data from a file stream and dumps it to HDFS 2. Runs a MapReduce job on the files 3. Emails a person when finished 4. Start within X amount of time 5. Complete within Y amount of time 6. And retry Z times on failure

  21. <workflow-app name="filestream_wf" xmlns="uri:oozie:workflow:0.1"> <start to="java-node"/> <action name="java-node"/> <java> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <main-class>org.foo.bar.PullFileStream</main-class> </java> <ok to="mr-node"/> <error to="fail"/> </action> <action name="mr-node"> <map-reduce> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <configuration> ... </configuration> </map-reduce> <ok to="email-node"> <error to="fail"/> </action> ... 1 2 3 ... <action name="email-node"> <email xmlns="uri:oozie:email-action:0.1"> <to>customer@foo.bar</to> <cc>employee@foo.bar</cc> <subject>Email notification</subject> <body>The wf completed</body> </email> <ok to="myotherjob"/> <error to="errorcleanup"/> </action> <end name="end"/> <kill name="fail"/> </workflow-app>

  22. <?xml version="1.0"?> <coordinator-app end="${COORD_END}" frequency="${coord:days(1)}" name="daily_job_coord" start="${COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1" xmlns="uri:oozie:sla:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</app-path> </workflow> <sla:info> <sla:nominal-time>${coord:nominalTime()}</sla:nominal-time> <sla:should-start>${X * MINUTES}</sla:should-start> <sla:should-end>${Y * MINUTES}</sla:should-end> <sla:alert-contact>foo@bar.com</sla:alert-contact> </sla:info> </action> </coordinator-app> 6 4, 5

  23. Review Oozie ties together many Hadoop ecosystem components to "productionalize" this stuff Advanced control flow and action extendibility lets Oozie do whatever you would need it to do at any point in the workflow XML is gross

  24. References http://oozie.apache.org https://cwiki.apache.org/confluence/display/OOZ IE/Index http://www.slideshare.net/mattgoeke/oozie-riot- games http://www.slideshare.net/mislam77/oozie- sweet-13451212 http://www.slideshare.net/ChicagoHUG/everythi ng-you-wanted-to-know-but-were-afraid-to-ask- about-oozie

Related