How Flux manages complex ETL scripting with its file orchestrator
Flux recently helped a large media and entertainment company who was looking for assistance orchestrating the execution of their suite of ETL scripts. The company uses Python scripts to run their ETL processes.
This M&E company needs to be able to run multiple queries on their backend database. They want to be able to control when they run, and if there’s an issue with a certain data set, they want to be able to adjust how many queries are run on that data set. They were handling error resolution manually, which can be time consuming.
The original setup involved a series of scripts that were each responsible for a piece of the ETL process:
- Script #1 accepts an XML file as a parameter, containing info about their database schema, a table, and a query. The script then runs a query against SQL server and dumps the result on a local machine.
- Script #2 picks up the data, breaks it up into chunks, applies the correct field, map, and row delimiters, and then dumps the data in a relevant Amazon S3 location.
- Script #3 is on an Amazon machine and it reads the data via a Hive table, transforms it via an SQL query, and inserts it into another Hive table.
- Administrative scripts then refresh partitions, point tables to the latest data, etc.
The biggest question the company had was: How could Flux read from SQL server and then write to S3?
First, a Flux workflow ran an SQL script using the SQL Server bulk copy utility, BCP, in a process action to copy the selected rows to a CSV file.
Then, with a small tweak of to some S3 Java client code to make it a Flux listener, the workflow was able to upload the CSV file into an S3 bucket.
Flux’s orchestration of ETL processes with scheduling, dependency, and error handling requirements gave them the scalable solution and automation needed. With workflows built around sound business logic, Flux was able to manage the execution of their ETL tools from start to finish.
If there is an issue with the querying of a data set, Flux responds to dynamic load and can throttle the number of queries on that data set. Flux performs automatic error recovery based on business rules and can be monitored by a centralized web dashboard.
This is a relatively standard use case in which an IT team wants to focus on their core competencies and leave the complexities such as dependency management, error resolution, and resource optimization to Flux.