Enterprise Data Model in Redshift, our Technology Story

Bruno Freitag

Data Architecture & Data Mesh

Published Jun 16, 2020

This article is the one in a series of three about implementing an enterprise data model (EDM) in AWS Redshift serving data sciences and data visualization. This EDM, a set of marts, is the go-to point for integrated data, data sourced from multiple SAP systems, from various proprietary R&D systems, from salesforce.com and from other internal and external sources. All those data sources are combined and integrated into a coherent modelling framework, accelerating data provisioning for reporting, analytical and other digital use cases, including advanced analytics /data science.

This article focuses on the ETL technology and the journey of getting there, the other articles address design and the transition.

Technically the ETL process builds on S3 objects, mapping them with the glue catalog and spectrum to “SQL-tables” in Redshift, thereafter using Glue processes with embedded SQL to transform the data into conformed dimension tables in Redshift.

Getting the data into S3 was relatively easy

Getting the data into S3 was relatively easy, either by S3 becoming one more subscriber of our standardized messages on the integrated middle-ware or by means within the SAP environment or by API-calls / Json files. The most challenging aspect was probably getting salesforce.com data at scale – and reliably and actually getting at the data’s true source, getting to the horse’s mouth. Any step in between is a potential source of error – and trust me - the user of the data will find every single missing entry or attribute.

The data from those sources are largely incompatible or they need specialist knowledge to combine. This not so much for technology reasons but for their content. All these sources have their own way of looking at and of coding data. Each system speaks its own language. We had to specifically convert, format and adapt the data to achieve an integrated enterprise data model.

experimented with several technologies

We experimented with several technologies for performing those transformations. PDI was abandoned for reasons of available knowledge and complexity of operating it. Thereafter we tried native AWS glue and its built-in transformations. Yet we were largely unsuccessful. The functionality is too limited and not robust on real-life data sets. The processes were slow, knowledge scarce and the development process cumbersome. The resulting solutions were unstable and difficult to maintain. Especially rationalizing complex JSONs using Glue was not stable enough to cope with real-life data quality

Spark, catalog and plain SQL

Finally, we settled on Spark for rationalizing JSONs, Glue catalog and Spectrum for the S3 mapping and on plain old-fashioned SQL embedded in the Glue framework for the transformations and updates. The resulting scripts are automated using Lambda and the basics of the glue framework. That - simple - approach has the advantage of readily available knowledge, fast, agile development cycles, good collaboration and superior performance and transparency. Ultimately, we templated the approach, resulting in low cost and fast time to benefits.

We can’t beat physical or re-invent big data. Yet this approach provides availability of integrated data very early in the process. This enables a close interaction with the business user, early validation of the business case on integrated data thus focusing the development effort and accelerating time to value.

Sanjeev Jha

DGM - Data Analytics COE at SYNGENTA SERVICES PRIVATE LIMITED

Thanks for posting and enriching our knowledge

See more comments

To view or add a comment, sign in

Enterprise Data Model in Redshift, our Technology Story

Bruno Freitag

Data Architecture & Data Mesh

More articles by Bruno Freitag

Insights from the community

Others also viewed

Azure Data Factory

Migrating Legacy ETL and Data Warehouse Projects to Azure Data Factory: A Comprehensive Guide

Synapse Data Pipeline vs Azure Data Factory: Key Use Cases and the Role of Microsoft Fabric

Azure Synapse Analytics and Azure Data Factory: Empowering Your Data Lake

Mastering Data Pipelines: An In-Depth Guide to Azure Data Factory

Data lineage is more important than you think!

Building Incremental Join Workflows between Hudi Tables and DynamoDB in AWS Glue: A Beginner's Hands-on Guide

Unlocking the Power of Data Integration with Azure Data Factory 🏭🚀

From Chaos to Clarity: Building a High-Performance and Scalable Data Warehouse

Exploring Snowflake Development: A Comprehensive Guide

Explore topics

More articles by Bruno Freitag

Good Data – Good AI

Harmonized data without data harmonization?

Data Mesh Design Or – keep the chicken and the egg for integrated data.

Enterprise Data Model in Redshift, actionable lineage information

Enterprise Data Model in Redshift, actionable lineage information (v1)

Enterprise Data Model in Redshift, the Data Model

Independent Data marts, pragmatic and effective – the shorter version

Independent Data marts, pragmatic and effective

Integrated value capture past the legacy data warehouse

Process Integration Journey, Why did it work this time