SlideShare a Scribd company logo
Airflow
The Compele Hands-On Course
강석우 pko89403@gmail.com
Why
AIRFLOW
What is airflow ?
š 프로그램적으로 데이터 파이프라인을 author, schedule, monitor
š 컴포넌트 : Web Server , Scheduler, Executor, Worker, Metadatabase
š 키 컨셉 : DAG, Operator, Task, TaskInstance, Workflow
Airflow Architecture
š Airflow Webserver : Serves the UI Dashboard over http
š Airflow Scheduler : A daemon
š Airflow Worker : working wrapper
š Metadata Database : Stores information regarding state of tasks
š Executor : Message queuing process that is bound to the scheduler and determines the
worker processes that executes scheduled tasks
Airflow Webserver
Airflow Scheduler
Worker
Worker
Worker
Meta DB
Logs
Dags
How airflow works ?
š 1. The scheduler reads the DAG folder
š 2. Your Dag is parsed by a process to create a DagRun based on the scheduling
parameters of your DAG.
š 3. A TaskInstance is instantiated for each Task that needs to be executed and flagged to
“Scheduled” in the metadata database
š 4. The Scheduler gets all TaskInstances flagged “Scheduled” from the metadata database,
changes the state to “Queued” and sends them to the executors to be executed.
š 5. Executors pull out Tasks from the queue ( depending on your execution setup ), change
the state from “Queued” to “Running” and Workers start executing the TaskInstances.
š 6. When a Task is finished, the Executor changes the state of that task to its final
state( success, failed, etc ) in the database and the DAGRun is updated by the scheduler
with the state “Success” or “Failed”. Of course, the web server periodically fetch data
from metadatabae to update UI.
Start Airflow ( From install to UI )
š AIRFLOW_GPL_UNICODE=yes pip install “apache-airflow[celery, crypto, postgres, hive, rabbitmq, redis]”
š Airflow initdb
š Airflow upgraded
š Ls
š cd airflow
š grep dags_folder airflow.cfg
š mkdir –p /home/airflow/airflow/dags
š Ls
š Vim airflow.cfg ( Configuration File )
š Load.example ( false )
š Airflow resetdb
š Airflow scheduler
š Airflow webserver
QuickTour of Airflow
š airflow list_dags
š airflow list_tasks {dag name} –tree
š airflow test {dag name} python_task {execution date}
š airflow –h
What is DAG ?
š Finte directed graph with no directed cycles. No cycle
š Dag represents a collection of tasks to run, organized in a way that represent their
dependencies and relations
š Each node is a Task
š Each edge is Dependency
š 어떻게 워크플로우를 실행시킬건가?
DAG’s important properties
š Defined in Python files placed into Airflow’s DAG_FOLDER ( usually ~/airflow/dags)
š Dag_id
š Description
š Start_date
š Schedule_interval
š Dependent_on_past : run the next DAGRun if the Previous one completed successfully
š Default_args : constructor keyword parameter when initializing opeators
What is Operator?
š Determines what actually gets done.
š Operators are usually (but now always) atomic, meaning they can stand on their own and
don’t need to share resources with any other operators.
š Definition of single task
š Should be idempotent ( 항상 같은 결과를 출력 )
š Task is created by instantiating an Operator class
š An operator defines the nature of this task and how should it be executed
š Operator is instantiated, this task becomes a node in your DAG.
Many Operators
š Bash Operator
š Python Operator
š EmailOperator ( sends an email )
š SqlOperator ( Executes a SQL command
š All Operators inherit from BaseOperator
š 3 types of operators
š Action operators that perform action (BashOperator, PythonOperator, EmailOperator … )
š Transfer operators that move data from one system to another ( sqlOperator, sftpOperator)
š Sensor operators waiting for data to arrive at defined location.
Operator ++
š Transfer Operators
š Move data from one system to another
š Pulled out from the source, staged on the machine where the executor is running, and then transferred
to the target system.
š Don’t use if you are dealing with a large amount of data
š Sensor Operators
š Inherit of BaseSensorOperator
š They are useful for monitoring external processes like waiting for files to be uploaded in HDFS or a
partition appearing in Hive
š Basically long running task
š Sensor operator has a poke method called repeatedly until it returns True ( method used for monitoring
the external process)
Make Dependencies in python
š set_upstream()
š set_downstream()
š << ( = set_upstream )
š >> ( = set_downstream )
A
B
C
D
š B depends of A
š C depends of A
š D depends of B and C
( Example )
A.set_downstream(B)
A >> B
A >> { B, C } >> D
How the Scheduler Works
š DagRun
š A Dag consists of Tasks and need those tasks to run
š When the Scheduler parses a Dag, it automatically creates a DagRun which is an instantiation of a DAG in time according to start_date
and schedule
š Backfill and Catchup
š Scheduler Interval
š None
š @once
š @hourly
š @daily
š @weekly
š @monthly
š @yearly
š Cron time string format can be used : ( * * * * * - Minute(0-59) Hour(0-23) Day of the month(1-31) Month(1-12) Day of the week(0-7)
Concurrency vs Parallelism
š Concurrent – If it can support two or more actions in progress at the same time
š Parallel – If it can support two or more actions executing simultaneously
š In concurrent systems, multiple actions can be in progress (may not be executed) at the
same time
š In parallel systems, multiple actions are simultaneously executed
Database and Executor
š Sequential Executor ( Default executor, SQLlite )
š Default executor you get when you run Apache Airflow
š Only run one task at time (Sequential), useful for debugging
š It is the only executor that can be used with SQLite since SQLlite donesn’t support multiple writers
š Local Executor ( PostgreSQL )
š It can run multiple tasks at a time
š Multiprocessing python library and queues to parallelize the execution of tasks
š Run tasks by spawning processes in a controlled fashion in different modes on the same machine
š Can tune the number of processes to spawn by using the parallelism parameter
Database and Executor
š Celery Executor
š Celery == Python Task-Queue System
š Task-Queue System handle distribution of tasks on workers across threads or network nodes
š Tasks need to be pushed into a broker( RabbitMQ )
š celery workers will pop them and schedule task executions
š Recommend for production use of Airflow
š Allows distributing the execution of task instances to multiple worker node(Computer)
š ++ Dask, Mesos, Kubernetes … etc
Celery Executor, PostgreSQL and RabbitMQ Structure
Executor Architecture
Meta DB
Web Server
Scheduler +
Worker
Local Executor ( Single Machine )
Meta DB
Web Server Scheduler +
Worker
Worker
Worker
Celery
Celery Executor
Advanced Concept
š SubDAG
š Minimising repetitive patterns
š Main DAG mangages all the subDAGs as normal taks
š SubDAGs must be scheduled the same as their parent DAG
š Hooks
š Interfaces to interact with your external sources such as (PostgreSQL, Spark, SFTP … )
XCOM
š Tasks communicate ( cross-communication , allows multiple tasks to exchange messages )
š Principally defined by a key, value and a timestamp
š XCOMs data can be “pushed” or “pulled”
š X_com_push()
š If a task returns a value, a XCOM containing that value is automatically pushed
š X_com_pull()
š Task gets the message based on parameters such as “key”, “task_ids” and “dag_id”
š Keys that are automatically given to XCOMs when they are pushed by being returned from
Branching
š Allowing DAG to choose between different paths according to the result of a specific task
š Use BranchPythonOperator
š When using branch, do not use property depends on past+
Service Level Agreement ( SLAs )
š SLA is a contract between a service provider and the end user that defines the level of
service expected from the service provider
š Define what the end user will received ( Must be received )
š Time, relative to the execution_date of tast not the start time(more than 30 min from exec )
š Different from ‘execution_timeout’ parameter << It makes task stopped and marks failed
Ad

More Related Content

What's hot (20)

Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Sumit Maheshwari
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
Varya Karpenko
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
Liangjun Jiang
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Knoldus Inc.
 
Apache airflow
Apache airflowApache airflow
Apache airflow
Pavel Alexeev
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
Chandler Huang
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Anant Corporation
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
BagustTriCahyo1
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
Robert Sanders
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
Tao Feng
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
Derrick Qin
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
Liangjun Jiang
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
Robert Sanders
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
Tao Feng
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
Derrick Qin
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 

Similar to Airflow tutorials hands_on (20)

Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
SNEHAL MASNE
 
GoDocker presentation
GoDocker presentationGoDocker presentation
GoDocker presentation
Olivier Sallou
 
Building Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdfBuilding Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
srikanthhadoop
 
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiaoadaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
lyvanlinh519
 
airflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptxairflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptx
VIJAYAPRABAP
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
InfluxData
 
Introduce Airflow.ppsx
Introduce Airflow.ppsxIntroduce Airflow.ppsx
Introduce Airflow.ppsx
ManKD
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
AnushkaChauhan68
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
EasyMedico.com
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
shams03159691010
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars george
O'Reilly Media
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
InfluxData
 
Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2
JollyRogers5
 
G pars
G parsG pars
G pars
NexThoughts Technologies
 
airflow web UI and CLI.pptx
airflow web UI and CLI.pptxairflow web UI and CLI.pptx
airflow web UI and CLI.pptx
VIJAYAPRABAP
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
Renato Guimaraes
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
SNEHAL MASNE
 
Building Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdfBuilding Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
srikanthhadoop
 
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiaoadaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
lyvanlinh519
 
airflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptxairflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptx
VIJAYAPRABAP
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
InfluxData
 
Introduce Airflow.ppsx
Introduce Airflow.ppsxIntroduce Airflow.ppsx
Introduce Airflow.ppsx
ManKD
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars george
O'Reilly Media
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
InfluxData
 
Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2
JollyRogers5
 
airflow web UI and CLI.pptx
airflow web UI and CLI.pptxairflow web UI and CLI.pptx
airflow web UI and CLI.pptx
VIJAYAPRABAP
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
Renato Guimaraes
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Ad

More from pko89403 (11)

Wide&Deep Recommendation Model
Wide&Deep Recommendation ModelWide&Deep Recommendation Model
Wide&Deep Recommendation Model
pko89403
 
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
pko89403
 
Item2Vec
Item2VecItem2Vec
Item2Vec
pko89403
 
Improving Language Understanding by Generative Pre-Training
Improving Language Understanding by Generative Pre-TrainingImproving Language Understanding by Generative Pre-Training
Improving Language Understanding by Generative Pre-Training
pko89403
 
CNN Introduction
CNN IntroductionCNN Introduction
CNN Introduction
pko89403
 
AutoEncoder&GAN Introduction
AutoEncoder&GAN IntroductionAutoEncoder&GAN Introduction
AutoEncoder&GAN Introduction
pko89403
 
Accelerating the machine learning lifecycle with m lflow
Accelerating the machine learning lifecycle with m lflowAccelerating the machine learning lifecycle with m lflow
Accelerating the machine learning lifecycle with m lflow
pko89403
 
Auto rec autoencoders meets collaborative filtering
Auto rec autoencoders meets collaborative filteringAuto rec autoencoders meets collaborative filtering
Auto rec autoencoders meets collaborative filtering
pko89403
 
Graph convolutional matrix completion
Graph convolutional  matrix completionGraph convolutional  matrix completion
Graph convolutional matrix completion
pko89403
 
Efficient thompson sampling for online matrix factorization recommendation
Efficient thompson sampling for online matrix factorization recommendationEfficient thompson sampling for online matrix factorization recommendation
Efficient thompson sampling for online matrix factorization recommendation
pko89403
 
Session based rcommendations with recurrent neural networks
Session based rcommendations with recurrent neural networksSession based rcommendations with recurrent neural networks
Session based rcommendations with recurrent neural networks
pko89403
 
Wide&Deep Recommendation Model
Wide&Deep Recommendation ModelWide&Deep Recommendation Model
Wide&Deep Recommendation Model
pko89403
 
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
pko89403
 
Improving Language Understanding by Generative Pre-Training
Improving Language Understanding by Generative Pre-TrainingImproving Language Understanding by Generative Pre-Training
Improving Language Understanding by Generative Pre-Training
pko89403
 
CNN Introduction
CNN IntroductionCNN Introduction
CNN Introduction
pko89403
 
AutoEncoder&GAN Introduction
AutoEncoder&GAN IntroductionAutoEncoder&GAN Introduction
AutoEncoder&GAN Introduction
pko89403
 
Accelerating the machine learning lifecycle with m lflow
Accelerating the machine learning lifecycle with m lflowAccelerating the machine learning lifecycle with m lflow
Accelerating the machine learning lifecycle with m lflow
pko89403
 
Auto rec autoencoders meets collaborative filtering
Auto rec autoencoders meets collaborative filteringAuto rec autoencoders meets collaborative filtering
Auto rec autoencoders meets collaborative filtering
pko89403
 
Graph convolutional matrix completion
Graph convolutional  matrix completionGraph convolutional  matrix completion
Graph convolutional matrix completion
pko89403
 
Efficient thompson sampling for online matrix factorization recommendation
Efficient thompson sampling for online matrix factorization recommendationEfficient thompson sampling for online matrix factorization recommendation
Efficient thompson sampling for online matrix factorization recommendation
pko89403
 
Session based rcommendations with recurrent neural networks
Session based rcommendations with recurrent neural networksSession based rcommendations with recurrent neural networks
Session based rcommendations with recurrent neural networks
pko89403
 
Ad

Recently uploaded (20)

AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Time series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdfTime series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdf
asmaamahmoudsaeed
 
How to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process miningHow to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process mining
Process mining Evangelist
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Agricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptxAgricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptx
mostafaahammed38
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Understanding Complex Development Processes
Understanding Complex Development ProcessesUnderstanding Complex Development Processes
Understanding Complex Development Processes
Process mining Evangelist
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Time series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdfTime series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdf
asmaamahmoudsaeed
 
How to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process miningHow to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process mining
Process mining Evangelist
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Agricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptxAgricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptx
mostafaahammed38
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 

Airflow tutorials hands_on

  • 1. Airflow The Compele Hands-On Course 강석우 pko89403@gmail.com
  • 2. Why
  • 4. What is airflow ? š 프로그램적으로 데이터 파이프라인을 author, schedule, monitor š 컴포넌트 : Web Server , Scheduler, Executor, Worker, Metadatabase š 키 컨셉 : DAG, Operator, Task, TaskInstance, Workflow
  • 5. Airflow Architecture š Airflow Webserver : Serves the UI Dashboard over http š Airflow Scheduler : A daemon š Airflow Worker : working wrapper š Metadata Database : Stores information regarding state of tasks š Executor : Message queuing process that is bound to the scheduler and determines the worker processes that executes scheduled tasks Airflow Webserver Airflow Scheduler Worker Worker Worker Meta DB Logs Dags
  • 6. How airflow works ? š 1. The scheduler reads the DAG folder š 2. Your Dag is parsed by a process to create a DagRun based on the scheduling parameters of your DAG. š 3. A TaskInstance is instantiated for each Task that needs to be executed and flagged to “Scheduled” in the metadata database š 4. The Scheduler gets all TaskInstances flagged “Scheduled” from the metadata database, changes the state to “Queued” and sends them to the executors to be executed. š 5. Executors pull out Tasks from the queue ( depending on your execution setup ), change the state from “Queued” to “Running” and Workers start executing the TaskInstances. š 6. When a Task is finished, the Executor changes the state of that task to its final state( success, failed, etc ) in the database and the DAGRun is updated by the scheduler with the state “Success” or “Failed”. Of course, the web server periodically fetch data from metadatabae to update UI.
  • 7. Start Airflow ( From install to UI ) š AIRFLOW_GPL_UNICODE=yes pip install “apache-airflow[celery, crypto, postgres, hive, rabbitmq, redis]” š Airflow initdb š Airflow upgraded š Ls š cd airflow š grep dags_folder airflow.cfg š mkdir –p /home/airflow/airflow/dags š Ls š Vim airflow.cfg ( Configuration File ) š Load.example ( false ) š Airflow resetdb š Airflow scheduler š Airflow webserver
  • 8. QuickTour of Airflow š airflow list_dags š airflow list_tasks {dag name} –tree š airflow test {dag name} python_task {execution date} š airflow –h
  • 9. What is DAG ? š Finte directed graph with no directed cycles. No cycle š Dag represents a collection of tasks to run, organized in a way that represent their dependencies and relations š Each node is a Task š Each edge is Dependency š 어떻게 워크플로우를 실행시킬건가?
  • 10. DAG’s important properties š Defined in Python files placed into Airflow’s DAG_FOLDER ( usually ~/airflow/dags) š Dag_id š Description š Start_date š Schedule_interval š Dependent_on_past : run the next DAGRun if the Previous one completed successfully š Default_args : constructor keyword parameter when initializing opeators
  • 11. What is Operator? š Determines what actually gets done. š Operators are usually (but now always) atomic, meaning they can stand on their own and don’t need to share resources with any other operators. š Definition of single task š Should be idempotent ( 항상 같은 결과를 출력 ) š Task is created by instantiating an Operator class š An operator defines the nature of this task and how should it be executed š Operator is instantiated, this task becomes a node in your DAG.
  • 12. Many Operators š Bash Operator š Python Operator š EmailOperator ( sends an email ) š SqlOperator ( Executes a SQL command š All Operators inherit from BaseOperator š 3 types of operators š Action operators that perform action (BashOperator, PythonOperator, EmailOperator … ) š Transfer operators that move data from one system to another ( sqlOperator, sftpOperator) š Sensor operators waiting for data to arrive at defined location.
  • 13. Operator ++ š Transfer Operators š Move data from one system to another š Pulled out from the source, staged on the machine where the executor is running, and then transferred to the target system. š Don’t use if you are dealing with a large amount of data š Sensor Operators š Inherit of BaseSensorOperator š They are useful for monitoring external processes like waiting for files to be uploaded in HDFS or a partition appearing in Hive š Basically long running task š Sensor operator has a poke method called repeatedly until it returns True ( method used for monitoring the external process)
  • 14. Make Dependencies in python š set_upstream() š set_downstream() š << ( = set_upstream ) š >> ( = set_downstream ) A B C D š B depends of A š C depends of A š D depends of B and C ( Example ) A.set_downstream(B) A >> B A >> { B, C } >> D
  • 15. How the Scheduler Works š DagRun š A Dag consists of Tasks and need those tasks to run š When the Scheduler parses a Dag, it automatically creates a DagRun which is an instantiation of a DAG in time according to start_date and schedule š Backfill and Catchup š Scheduler Interval š None š @once š @hourly š @daily š @weekly š @monthly š @yearly š Cron time string format can be used : ( * * * * * - Minute(0-59) Hour(0-23) Day of the month(1-31) Month(1-12) Day of the week(0-7)
  • 16. Concurrency vs Parallelism š Concurrent – If it can support two or more actions in progress at the same time š Parallel – If it can support two or more actions executing simultaneously š In concurrent systems, multiple actions can be in progress (may not be executed) at the same time š In parallel systems, multiple actions are simultaneously executed
  • 17. Database and Executor š Sequential Executor ( Default executor, SQLlite ) š Default executor you get when you run Apache Airflow š Only run one task at time (Sequential), useful for debugging š It is the only executor that can be used with SQLite since SQLlite donesn’t support multiple writers š Local Executor ( PostgreSQL ) š It can run multiple tasks at a time š Multiprocessing python library and queues to parallelize the execution of tasks š Run tasks by spawning processes in a controlled fashion in different modes on the same machine š Can tune the number of processes to spawn by using the parallelism parameter
  • 18. Database and Executor š Celery Executor š Celery == Python Task-Queue System š Task-Queue System handle distribution of tasks on workers across threads or network nodes š Tasks need to be pushed into a broker( RabbitMQ ) š celery workers will pop them and schedule task executions š Recommend for production use of Airflow š Allows distributing the execution of task instances to multiple worker node(Computer) š ++ Dask, Mesos, Kubernetes … etc
  • 19. Celery Executor, PostgreSQL and RabbitMQ Structure
  • 20. Executor Architecture Meta DB Web Server Scheduler + Worker Local Executor ( Single Machine ) Meta DB Web Server Scheduler + Worker Worker Worker Celery Celery Executor
  • 21. Advanced Concept š SubDAG š Minimising repetitive patterns š Main DAG mangages all the subDAGs as normal taks š SubDAGs must be scheduled the same as their parent DAG š Hooks š Interfaces to interact with your external sources such as (PostgreSQL, Spark, SFTP … )
  • 22. XCOM š Tasks communicate ( cross-communication , allows multiple tasks to exchange messages ) š Principally defined by a key, value and a timestamp š XCOMs data can be “pushed” or “pulled” š X_com_push() š If a task returns a value, a XCOM containing that value is automatically pushed š X_com_pull() š Task gets the message based on parameters such as “key”, “task_ids” and “dag_id” š Keys that are automatically given to XCOMs when they are pushed by being returned from
  • 23. Branching š Allowing DAG to choose between different paths according to the result of a specific task š Use BranchPythonOperator š When using branch, do not use property depends on past+
  • 24. Service Level Agreement ( SLAs ) š SLA is a contract between a service provider and the end user that defines the level of service expected from the service provider š Define what the end user will received ( Must be received ) š Time, relative to the execution_date of tast not the start time(more than 30 min from exec ) š Different from ‘execution_timeout’ parameter << It makes task stopped and marks failed
  翻译: