Customers are adopting event-driven-architectures to improve the agility and resiliency of their applications. As a result, data engineers are increasingly looking for simple-to-use yet powerful and feature-rich data processing tools to build pipelines that enrich data, move data in and out of their data lake and data warehouse, and analyze data. AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.
Data integration jobs have varying degrees of priority and time sensitivity. For example, you can use batch processing to process weekly sales data but in some cases, data needs to be processed immediately. Fraud detection applications, for example, require near-real-time processing of security logs. Or if a partner uploads product information to your Amazon Simple Storage Service (Amazon S3) bucket, it needs to be processed right away to ensure that your website has the latest product information.
This post discusses how to configure AWS Glue workflows to run based on real-time events. You no longer need to set schedules or build complex solutions to trigger jobs based on events; AWS Glue event-driven workflows manage it all for you.
Get started with AWS Glue event-driven workflows
As a business requirement, most companies need to hydrate their data lake and data warehouse with data in near-real time. They run their pipelines on a schedule (hourly, daily, or even weekly) or trigger the pipeline through an external system. It’s difficult to predict the frequency at which upstream systems generate data, which makes it difficult to plan and schedule ETL pipelines to run efficiently. Scheduling ETL pipelines to run too frequently can be expensive, whereas scheduling pipelines to run infrequently can lead to making decisions based on stale data. Similarly, triggering pipelines from an external process can increase complexity, cost, and job startup time.
AWS Glue now supports event-driven workflows, a capability that lets developers start AWS Glue workflows based on events delivered by Amazon EventBridge. With this new feature, you can trigger a data integration workflow from any events from AWS services, software as a service (SaaS) providers, and any custom applications. For example, you can react to an S3 event generated when new buckets are created and when new files are uploaded to a specific S3 location. In addition, if your environment generates many events, AWS Glue allows you to batch them either by time duration or by the number of events. Event-driven workflows make it easy to start an AWS Glue workflow based on real-time events.
To get started, you simply create a new AWS Glue trigger of type
EVENT and place it as the first trigger in your workflow. You can optionally specify a batching condition. Without event batching, the AWS Glue workflow is triggered every time an EventBridge rule matches which may result in multiple concurrent workflow runs. In some environments, starting many concurrent workflow runs could lead to throttling, reaching service quota limits, and potential cost overruns. This can also result in workflow execution failures in case the concurrency limit specified on the workflow and the jobs within the workflow do not match. Event batching allows you to configure the number of events to buffer or the maximum elapsed time before firing the particular trigger. Once the batching condition is met, a workflow run is started. For example, you can trigger your workflow when 100 files are uploaded in S3 or 5 minutes after the first upload. We recommend configuring event batching to avoid too many concurrent workflow runs, and optimize resource usage and cost.
Overview of the solution
In this post, we walk through a solution to set up an AWS Glue workflow that listens to S3
PutObject data events captured by AWS CloudTrail. This workflow is configured to run when five new files are added or the batching window time of 900 seconds expires after first file is added. The following diagram illustrates the architecture.
The steps in this solution are as follows:
- Create an AWS Glue workflow with a starting trigger of
EVENTtype and configure the batch size on the trigger to be five and batch window to be 900 seconds.
- Configure Amazon S3 to log data events, such as
PutObjectAPI calls to CloudTrail.
- Create a rule in EventBridge to forward the
PutObjectAPI events to AWS Glue when they are emitted by CloudTrail.
- Add an AWS Glue event-driven workflow as a target to the EventBridge rule.
- To start the workflow, upload files to the S3 bucket. Remember you need to have at least five files before the workflow is triggered.
Deploy the solution with AWS CloudFormation
For a quick start of this solution, you can deploy the provided AWS CloudFormation stack. This creates all the required resources in your account.
The CloudFormation template generates the following resources:
- S3 bucket – This is used to store data, CloudTrail logs, job scripts, and any temporary files generated during the AWS Glue ETL job run.
- CloudTrail trail with S3 data events enabled – This enables EventBridge to receive
PutObjectAPI call data on specific bucket.
- AWS Glue workflow – A data processing pipeline that is comprised of a crawler, jobs, and triggers. This workflow converts uploaded data files into Apache Parquet format.
- AWS Glue database – The AWS Glue Data Catalog database that is used to hold the tables created in this walkthrough.
- AWS Glue table – The Data Catalog table representing the Parquet files being converted by the workflow.
- AWS Lambda function – This is used as an AWS CloudFormation custom resource to copy job scripts from an AWS Glue-managed GitHub repository and an AWS Big Data blog S3 bucket to your S3 bucket.
- IAM roles and policies – We use the following AWS Identity and Access Management (IAM) roles:
LambdaExecutionRole– Runs the Lambda function that has permission to upload the job scripts to the S3 bucket.
GlueServiceRole– Runs the AWS Glue job that has permission to download the script, read data from the source, and write data to the destination after conversion.
EventBridgeGlueExecutionRole– Has permissions to invoke the
NotifyEventAPI for an AWS Glue workflow.
To launch the CloudFormation stack, complete the following steps:
- Sign in to the AWS CloudFormation console.
- Choose Launch Stack:
- Choose Next.
- For S3BucketName, enter the unique name of your new S3 bucket.
- For WorkflowName, DatabaseName, and TableName, leave as the default.
- Choose Next.
- On the next page, choose Next.
- Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create.
It takes a few minutes for the stack creation to complete; you can follow the progress on the Events tab.
By default, the workflow runs whenever a single file is uploaded to the S3 bucket, resulting in a
PutObject API call. In the next section, we configure the event batching to change this behavior.
Review the AWS Glue trigger and add event batching conditions
The CloudFormation template provisioned an AWS Glue workflow including a crawler, jobs, and triggers. The first trigger in the workflow is configured as an event-based trigger. Next, we update this trigger to batch five events or wait for 900 seconds after the first event before it starts the workflow.
Before we make any changes, let’s review the trigger on the AWS Glue console:
- On the AWS Glue console, under ETL, choose Triggers.
- Choose <Workflow-name>_pre_job_trigger.
- Choose Edit.
We can see the trigger’s type is set to EventBridge event, which means it’s an event-based trigger. Let’s change the event batching condition to run the workflow after five files are uploaded to Amazon S3.
- For Number of events, enter
- For Time delay (sec), enter
- Choose Next.
- On the next screen, under Choose jobs to trigger, leave as the default and choose Next.
- Choose Finish.
Review the EventBridge rule
The CloudFormation template created an EventBridge rule to forward S3
PutObject API events to AWS Glue. Let’s review the configuration of the EventBridge rule:
- On the EventBridge console, under Events, choose Rules.
- Choose s3_file_upload_trigger_rule-<CloudFormation-stack-name>.
- Review the information in the Event pattern section.
The event pattern shows that this rule is triggered when an S3 object is uploaded to
s3://<bucket_name>/data/products_raw/. CloudTrail captures the
PutObject API calls made and relays them as events to EventBridge.
- In the Targets section, you can verify that this EventBridge rule is configured with an AWS Glue workflow as a target.
Trigger the AWS Glue workflow by uploading files to Amazon S3
To test your workflow, we upload files to Amazon S3 using the AWS Command Line Interface (AWS CLI). If you don’t have the AWS CLI, see Installing, updating, and uninstalling the AWS CLI.
Let’s upload some small files to your S3 bucket.
- Run the following command to upload the first file to your S3 bucket:
- Run the following command to upload the second file:
- Run the following command to upload the third file:
- Run the following command to upload the fourth file:
These events didn’t trigger the workflow because it didn’t meet the batch condition of five events.
- Run the following command to upload the fifth file:
Now the five JSON files have been uploaded to Amazon S3.
Verify the AWS Glue workflow is triggered successfully
Now the workflow should be triggered. Open the AWS Glue console to validate that your workflow is in the
To view the run details, complete the following steps:
- On the History tab of the workflow, choose the current or most recent workflow run.
- Choose View run details.
When the workflow run status changes to
Completed, let’s see the converted files in your S3 bucket.
- Switch to the Amazon S3 console, and navigate to your bucket.
You can see the Parquet files under
Congratulations! Your workflow ran successfully based on S3 events triggered by uploading files to your bucket. You can verify everything works as expected by running a query against the generated table using Amazon Athena.
Verify the metrics for the EventBridge rule
Optionally, you can use Amazon CloudWatch metrics to validate the events were sent to the AWS Glue workflow.
- On the EventBridge console, in the navigation pane, choose Rules.
- Select your EventBridge rule
s3_file_upload_trigger_rule-<Workflow-name>and choose Metrics for the rule.
When the target workflow is invoked by the rule, the metrics
TriggeredRules are published.
FailedInvocations is published if the EventBridge rule is unable to trigger the AWS Glue workflow. In that case, we recommend you check the following configurations:
- Verify the IAM role provided to the EventBridge rule allows the
glue:NotifyEventpermission on the AWS Glue workflow.
- Verify the trust relationship on the IAM role provides the
events.amazonaws.comservice principal the ability to assume the role.
- Verify the starting trigger on your target AWS Glue workflow is an event-based trigger.
Now to the final step, cleaning up the resources. Delete the CloudFormation stack to remove any resources you created as part of this walkthrough.
AWS Glue event-driven workflows enable data engineers to easily build event driven ETL pipelines that respond in near-real time, delivering fresh data to business users. In this post, we demonstrated how to configure a rule in EventBridge to forward events to AWS Glue. We also saw how to create an event-based trigger that either immediately, or after a set number of events or period of time, starts a Glue ETL workflow. Migrating your existing AWS Glue workflows to make them event-driven is easy. This can be simply done by replacing the first trigger in the workflow to be of type
EVENT and adding this workflow as a target to an EventBridge rule that captures events of your interest.
For more information about event-driven AWS Glue workflows, see Starting an AWS Glue Workflow with an Amazon EventBridge Event.
About the Authors
Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. In his spare time, he enjoys playing with his children. They are addicted to grabbing crayfish and worms in the park, and putting them in the same jar to observe what happens.