🗞️Pipelines

Taking the "DAG" onwards to Products.

In Orchestra, a pipeline is a directed acyclic graph ("DAG" for short). A DAG is the core concept of most workflow orchestration tools, and collects Tasks together, organised with dependencies and relationships to say how they should run.

Basic example DAG

It defines four Tasks - A, B, C, and D - and dictates the order in which they have to run, and which tasks depend on what others. By default, the Pipeline has settings that Tasks defer to when running. In the example above, Task D will not run (and be moved to a "SKIPPED" state) if either Task A, Task B or Task C fails.

Settings

Settings are default values that can be overriden by the user to configure how a pipeline and tasks within it should run:

Timeouts: are not currently enforced

Retries: tasks will by default retry three times

Name: the name of the pipeline, as set by the user

Schedule: the cron schedule if the trigger type for the pipeline is "Cron".

Properties

Properties associated with a pipeline that can be set by the user and are available to our back-end systems for monitoring and user experience purposes. These are not set by the User

  • Created time: the time the pipeline was created (immutable)

  • Updated time: the time the pipeline was last changed (mutable)

  • Number of Tasks: the number of tasks in the pipeline (mutable)

  • Status: the status of the last associated pipeline run (mutable)

Usage advice

As Data Engineers and Analytics engineers, one problem we discovered when scaling operations was how complicated it was to run data pipelines on different schedules. This was important for two reasons:

  1. Running tasks in batch processes as infrequently as possible is desirable for reducing cloud warehouse costs

  2. We required different SLAs for internal "Data Products" which necessitated running different schedules

In Orchestra, we advise running a single pipeline for a single Data Product or set of Data Products. We would typically expect each Pipeline to run on a different schedule, so a good place to start would be with three Data Products:

  • Hourly

  • Daily

  • Weekly

For most data teams starting out, a single Hourly or Daily pipeline should suffice. Any Pipeline can always be run manually, so we'd recommend choosing a daily schedule excluding weekends and having a manual element for teams wanting to keep operations lean and cloud costs low.

Last updated