This post is co-written with data engineers, Anton Morozov and James Phillips, from Weatherbug.
WeatherBug is a brand owned by GroundTruth, based in New York City, that provides location-based advertising solutions to businesses. WeatherBug consists of a mobile app reporting live and forecast data on hyperlocal weather to consumer users. The WeatherBug Data Engineering team has built a modern analytics platform to serve multiple use cases, including weather forecasting and location-based advertising, that is completely built on AWS. They use an Amazon Simple Storage Service (Amazon S3) data lake to store clickstream data and use Amazon Redshift as their cloud data warehouse platform.
In this post, we share how WeatherBug reduced their extract, transform, and load (ETL) latency using Amazon Redshift Spectrum.
Amazon Redshift Spectrum overview
Amazon Redshift is the most widely used cloud data warehouse. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.
Redshift Spectrum allows you to query open format data directly in the S3 data lake without having to load the data or duplicate your infrastructure. With Redshift Spectrum, you can query open file formats such as Apache Parquet, ORC, JSON, Avro, and CSV. For more information, see Amazon Redshift Spectrum overview and Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required.
Redshift Spectrum runs on a massive compute fleet independent of your Amazon Redshift cluster. Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer. Therefore, Redshift Spectrum queries use much less of your cluster’s processing capacity than other queries. With Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to run very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster.
To summarize performance metrics for internal BI needs, WeatherBug has to move a lot of data between their S3 data lake and Amazon Redshift cluster using complex ETL processings. They used Apache Airflow to orchestrate their pipeline during the ETL process and used Apache Hive for large-scale ETL jobs in order to offload the data from the Amazon Redshift cluster. The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. You can project this structure onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
WeatherBug chose Hive as a solution because it was easy to use with their Python/Airflow pipeline and they were able to launch multiple jobs in parallel. This solution was working well for WeatherBug but it needed more engineering effort to build the ETL processes along with operational overheads on the current solution because it involved multiple technologies.
New Amazon Redshift Spectrum-based solution
To optimize the current Hadoop-based solution to move data between their Amazon Redshift cluster and S3 buckets, the WeatherBug team considered Redshift Spectrum as an alternative solution. The following diagram shows their updated architecture.
WeatherBug created Redshift Spectrum external tables that pointed to the data stored in their S3 buckets. This helped them perform the data movement and transformations from Amazon Redshift to Amazon S3 using
Insert into <external table> select from <Redshift table> and from Amazon S3 to Amazon Redshift using
Insert into <Redshift table> select from <external table> along with the data transformations in the inline SQL queries.
During the pilot implementation on a few sample tables with this current solution, WeatherBug found it very easy to learn how to use the Amazon Redshift Spectrum features, and not long after they had a proof of concept far superior to their existing Hadoop-based solution. They reduced the targeted pipeline’s runtime from 17 minutes to 30 seconds, a 3300% improvement, with the additional benefit of eliminating the cost and management of Hadoop cluster. They were excited to apply this approach to additional pipelines that support their Amazon Redshift clusters. This was a nice win for them to be able to improve processing times and reduce cost and overhead with low engineering effort.
In addition to these improvements, they replaced some of their jobs that use Apache Hive to query tables with Amazon Redshift Spectrum.
In their initial testing, WeatherBug is seeing costs for Amazon Redshift Spectrum of $0.14 per day instead of $16.12 per day on Hive for a selected sample job, which is a 115 times reduction in cost.
When you have a data lake and cloud data warehouse built in Amazon S3 and Amazon Redshift, you may need frequent ETL jobs between the two systems for different use cases. Amazon Redshift Spectrum provides an easy-to-implement, cost-effective, and high-performance solution to interact between Amazon Redshift and Amazon S3 to query the Amazon S3 data from Amazon Redshift, join Amazon Redshift tables with S3 objects, and transform using simple SQL queries. Many Data Engineering use cases similar to the WeatherBug example discussed here can be optimized using Amazon Redshift Spectrum.
About the Authors
Anton Morozov is a WeatherBug data engineer working on analytics pipelines. He considers AWS Redshift an essential technology to understanding data for business intelligence needs.
James Phillips is a WeatherBug data engineer who uses many AWS technologies on a daily basis. Some of these include Amazon Redshift, Amazon EMR, and Amazon SageMaker.
Avijit Goswami is a Principal Solutions Architect at AWS, helping his digital native and startup customers become tomorrow’s enterprises using AWS services. He is part of the analytics specialists field community in AWS and works closely with the WeatherBug Data Engineering team.