Running dbt Core on ECS

Details how to configure ECS jobs to collect dbt Core operational metadata.

Aim

Collect dbt Core operational metadata from dbt jobs that run as part of an ECS task in AWS. Information about ECS Run Task integration job can be found here.

Prerequisites

The following assumptions are made about your setup:

  • You are running dbt Core jobs as standalone ECS tasks

  • You can store the dbt Core output artifacts to S3

  • When running your dbt Core job you have a wrapper function around the call to run dbt Core that will catch any dbt command failures. Potential options:

    • A wrapper script (written in Python, Bash, etc.)

    • Use a dbt post hook. dbt Documentation here

Setup

For Orchestra to build dbt Operations from tasks run in ECS, Orchestra must have the manifest.json and the run_results.json files for every dbt command that is executed as part of your ECS task.

Orchestra collects these files from S3. Therefore your ECS task should have the following features:

  • The Dockerfile entrypoint should be a script that will run the dbt commands and upload the artifacts to S3.

  • The script should execute the dbt command in a wrapper to catch any errors.

  • On success or failure of the dbt command, the wrapper script must store the manifest.json and run_results.json in an S3 bucket in your AWS account.

  • If the dbt command fails it is recommended to ensure the wrapper script returns with an exit code. This way the ECS task will enter a failed state in the Orchestra engine and no subsequent Task Runs will run.

  • If you are running multiple dbt commands within the same script ensure you upload the run_results.json files with a prefix indicating the order in which they ran. Ie the first dbt command you run should be uploaded to S3 with the filename set as run_results_1.json, the second will be run_results_2.json and so on. See the example below for more details.

  • When uploading to S3 there are several rules dictating the key to use. It must be of the format: <YOUR_S3_PREFIX>/$ORCHESTRA_TASK_RUN_ID/*'. The ORCHESTRA_TASK_RUN_ID value is an environment variable that Orchestra injects into every container that runs as part of an ECS task. By including this in the S3 key it allows Orchestra to fetch the correct artifacts from S3 for that specific Task Run.

The AWS User you create to run the ECS task must have the additional permissions to download the objects from this S3 bucket. This is in addition to the permissions already required to run the ECS task itself, details of which can be found here. The required permissions are as follows:

{
    "Sid": "VisualEditor3",
    "Effect": "Allow",
    "Action": [
        "s3:GetObject",
        "s3:ListBucket"
    ],
    "Resource": [
        "<BUCKET_ARN>/*",
        "<BUCKET_ARN>"
    ]
}

Example

There are many ways to implement the above requirements, for example, you could write a Python script to execute the dbt commands and use boto3 to upload the artifacts to S3.

For this example, we have used a bash script that executes a dbt run command and then runs some tests. After each command, it uploads the artifacts to the correct path in S3. Notice how it stores the artifacts for each dbt command in a separate folder to prevent dbt overwriting the files on each command execution.

#!/bin/bash

if [ -z "$OUTPUT_BUCKET_NAME" ]; then
  echo "OUTPUT_BUCKET_NAME is not set"
  exit 1
fi
if [ -z "$S3_STORAGE_PREFIX" ]; then
  S3_STORAGE_PREFIX="dbt"
fi

# Setup and install packages
cd dbt_files
mkdir -p target/run && mkdir -p target/test
dbt deps

# Run dbt
dbt run --select tag:working --target-path target/run
RUN_EXIT_CODE=$?

# Copy artifacts to S3. Run results = run_results_1.json
aws s3 cp target/run/manifest.json s3://$OUTPUT_BUCKET_NAME/$S3_STORAGE_PREFIX/$ORCHESTRA_TASK_RUN_ID/run/manifest.json
aws s3 cp target/run/run_results.json s3://$OUTPUT_BUCKET_NAME/$S3_STORAGE_PREFIX/$ORCHESTRA_TASK_RUN_ID/run/run_results_1.json

if [ $RUN_EXIT_CODE -ne 0 ]; then
  # If run command failed. Exit early
  echo "dbt run failed"
  exit 1
fi

# Run tests
dbt test --select tag:broken_tests --target-path target/test
TEST_EXIT_CODE=$?

# Copy artifacts to S3. Run results = run_results_2.json
aws s3 cp target/test/manifest.json s3://$OUTPUT_BUCKET_NAME/$S3_STORAGE_PREFIX/$ORCHESTRA_TASK_RUN_ID/test/manifest.json
aws s3 cp target/test/run_results.json s3://$OUTPUT_BUCKET_NAME/$S3_STORAGE_PREFIX/$ORCHESTRA_TASK_RUN_ID/test/run_results_2.json

# Exit with the test exit code
echo "dbt run complete"
exit $TEST_EXIT_CODE

Last updated