Enterprise Data Model in Redshift, our Technology Story

This article is the one in a series of three about implementing an enterprise data model (EDM) in AWS Redshift serving data sciences and data visualization. This EDM, a set of marts, is the go-to point for integrated data, data sourced from multiple SAP systems, from various proprietary R&D systems, from salesforce.com and from other internal and external sources. All those data sources are combined and integrated into a coherent modelling framework, accelerating data provisioning for reporting, analytical and other digital use cases, including advanced analytics /data science. 

This article focuses on the ETL technology and the journey of getting there, the other articles address design and the transition.

Technically the ETL process builds on S3 objects, mapping them with the glue catalog and spectrum to “SQL-tables” in Redshift, thereafter using Glue processes with embedded SQL to transform the data into conformed dimension tables in Redshift.

Getting the data into S3 was relatively easy

Getting the data into S3 was relatively easy, either by S3 becoming one more subscriber of our standardized messages on the integrated middle-ware or by means within the SAP environment or by API-calls / Json files. The most challenging aspect was probably getting salesforce.com data at scale – and reliably and actually getting at the data’s true source, getting to the horse’s mouth. Any step in between is a potential source of error – and trust me - the user of the data will find every single missing entry or attribute.

The data from those sources are largely incompatible or they need specialist knowledge to combine. This not so much for technology reasons but for their content. All these sources have their own way of looking at and of coding data. Each system speaks its own language. We had to specifically convert, format and adapt the data to achieve an integrated enterprise data model.

experimented with several technologies

We experimented with several technologies for performing those transformations. PDI was abandoned for reasons of available knowledge and complexity of operating it. Thereafter we tried native AWS glue and its built-in transformations. Yet we were largely unsuccessful. The functionality is too limited and not robust on real-life data sets. The processes were slow, knowledge scarce and the development process cumbersome. The resulting solutions were unstable and difficult to maintain. Especially rationalizing complex JSONs using Glue was not stable enough to cope with real-life data quality

Spark, catalog and plain SQL

Finally, we settled on Spark for rationalizing JSONs, Glue catalog and Spectrum for the S3 mapping and on plain old-fashioned SQL embedded in the Glue framework for the transformations and updates. The resulting scripts are automated using Lambda and the basics of the glue framework. That - simple - approach has the advantage of readily available knowledge, fast, agile development cycles, good collaboration and superior performance and transparency. Ultimately, we templated the approach, resulting in low cost and fast time to benefits. 

No alt text provided for this image

We can’t beat physical or re-invent big data. Yet this approach provides availability of integrated data very early in the process. This enables a close interaction with the business user, early validation of the business case on integrated data thus focusing the development effort and accelerating time to value. 

Sanjeev Jha

DGM - Data Analytics COE at SYNGENTA SERVICES PRIVATE LIMITED

4y

Thanks for posting and enriching our knowledge

Like
Reply

To view or add a comment, sign in

More articles by Bruno Freitag

Insights from the community

Others also viewed

Explore topics