🏛️Artifacts

The key to maintainability and end-to-end observability

Introduction

In Orchestra, we define an arbitrary piece of data that results from a given operation or trigger (or both) as an artifact.

Artifacts are specific to integrations and specific to operations. In an ideal world, we would keep a directory or registry of the schema of every artifact, but this would be an incredible operational burden, and one we currently do not have the bandwidth to do.

A simple example of an artifact might relate to a HTTP call, in this case, using a HTTP integration.

A successful call may result in a response object. This has a number of properties, such as response.text, response.status and of course, the viewable, human-readable response.json which might look like:

{"request_id" : "some_guid"}

This would be an artifact. Orchestra may or may not store all or part of these artifacts, and then coalesces/merges/finesses them into our data model, which is required to serve up the UI in the portal.

How are artifacts used?

Orchestra does not always have this information available, however. Sometimes responses are limited by the connection methodology (for example, when using drivers to connect with databases). Other times, like when Orchestra is triggered via webhook, the artifact might be user-uploaded and therefore even be of an arbitrary format.

Generally, there are two forms of information contained within an artifact:

Information typeDescription

Operational

Operational information relates to the status of the operation in question. It is limited to such items as duration, status, error descriptions, execution times, and so on

Informative

Informative information relates to other things an operation might relate to. It also covers information about that operation not directly related to its execution. This would include the operation's cost, what dependencies it have, the number of rows amended, potentially the name of underlying data assets, and so on

As mentioned before, in Orchestra, we use artifacts to A) inform users of what happened in their data pipelines and B) construct a data model that is used to provide additional layers of insight and analytical capability, such as end-to-end lineage

How artifacts can be fetched

Artifacts are generally decided upon by Orchestra - given most data teams use the same sets of tools and custom infrastructure, we generally know what artifacts to fetch and how to merge these into our data model.

For example, having the run_results.json is important when we interact with dbt, as it contains a log of every model and every test that ran. Therefore, rather than ask users of dbt to explicitly ask us to fetch this, Orchestra will do it automatically.

Other times, users may wish to upload an artifact somewhere for Orchestra to access. This will require Orchestra to know both the type of artifact, and will need to access it with a connection.

Finally, users may wish to upload arbitrary artifacts to be rendered as json. In this case, Orchestra will not need to know the type of artifact (or rather, it will assume the artifact is a "custom" type) but will still require a connection to access.

Additionally, we are experimenting with removing the connection option and allowing users to push an artifact to Orchestra's storage, which is obviously both cheaper and architecturally more elegant a solution rather than having users upload artifacts to cloud storage, grant Orchestra access to cloud storage, and then have Orchestra pull these artifacts (leaving data existing in the ether unnecessarily).

Artifact Fetching Description

Automatic

For certain operations, Orchestra will automatically try to fetch an artifact after the operation is completed. This is a non-blocking operation for a data pipeline, but may fail

User-opt-in

For additional metadata, Users should opt-in to Orchestra fetching this data. Fetching additional data is a non-blocking operation for a data pipeline, but may fail. Generally, there is a higher likelihood users incur costs and/or contribute to API request limits for User-opt-in artifact fetching types. However, these potential costs should still be very negligible, and are only flagged here out of full transparency

User-defined, externally-hosted

For certain Operations and Trigger types, users can specify an artifact, a location and a connection. In this case, Orchestra will try to fetch and validate a particular artifact from a user-defined location using a user-defined connection, from an externally-hosted source

TBD: User-pushed

For certain Operations and Trigger types, Orchestra will look for an artifact in the relevant storage location but will assume it has been uploaded there by the user

Which operations cause artifacts to be fetched?

Generally, the completion of any Task will cause some artifacts to be fetched automatically. This is unavoidable, and is required for Orchestra to deliver basic datapoints such as the status of the task.

For most Tasks, there will also be User-opt-in artifacts that can be fetched, such as those required to fetch additional cost metadata.

Orchestra can also fetch User-defined, externally-hosted Artifacts for Triggers. This allows you to upload an artifact to object storage at the end of an externally-run pipeline, and for Orchestra to glean this information and display it in our UI. This gives users end-to-end lineage and visibility.

This is summarised in the table below:

ActionArtifact

Trigger

Artifacts may be fetched from externall-hosted sources using a user-defined connection if the user requires it

Task

Artifacts are generally collected automatically. For additional datapoints, Users may "opt-in" to have additional metadata/ artifacts fetched to enrich the metadata available to the Orchestra UI

Last updated