High volume data ingestion from SAP S/4HANA public cloud to the Azure cloud by Data Factory (ADF)
Problem statement:
Public cloud applications (SaaS) offer limited data sharing options. At the same time there is an increasing need to fetch data from public cloud to a modern data platform like Azure Synapse for addressing an organization’s requirements around advanced and comprehensive analytics.
In this blog, we focus on in extracting data from SAP S/4 HANA Public Cloud to Azure Synapse using Azure Data Factory. SAP S/4 HANA Public Cloud provides an OData interface to extract data through standard and custom CDS views. Unfortunately, as OData is based on HTTP protocol, it is not best suited for large data extractions. This blog explores the nuances of dealing with high volume data extraction from SAP S/4 HANA public cloud.
Solution:
For one of our recent implementations with a customer interested in drawing insights from data residing in SAP S/4 HANA public cloud, we started off by evaluating the data integration options mentioned below:
SAP CDC options supports only the standard extraction enabled CDC views but does not support SAP custom CDS views. Hence, this option was ruled out for extraction from custom views. OData is the only option for data extraction from custom views but dealing with medium and high volume data is a challenge. To overcome it we decided to investigate option to paginate the dataset which divides the data into sizable chunks This brought us to a new consideration of deciding on the optimal page size. There are two ways in which data can be divided into chunks:
Microsoft has published a series of blogs which have given detailed explanation for this solution.
Following the rule-based approach customized to our customer’s requirements, we have observed the following results.
POC results:
Inputs:
# of Records in custom CDS view: 100,086
# of columns in custom CDS view: 393 columns
Run Type: Serial as well as parallel.
Batch size: 5K to 25K
Technologies: Azure Data Factory (ADF), CDS views from SAP S/4HANA (public cloud), Azure Data lake storage (ADLS), Data is written to ADLS in Parquet file format.
Results
Observations and Recommendations
Suggestions based on executions:
Since the initial full load for any analytics system is resource and data intensive activity, we have only given some recommendations to consider alternative options.
Recommended by LinkedIn
Suggestions for Initial Full load:
Conclusion
Businesses are now moving more and more to the SaaS business applications. So specifically for public cloud solution, ODATA is common REST API protocol available for data sharing. ODATA API works well for small data volumes, however in case of larger data sets it may be required to further tune the extraction process using pagination and parallel processing. Using OData protocol in Azure Data Factory allowed us to meet business goals and ensure smooth data extraction even in case of large data volumes. Using the approach described in this blog we avoided extraction errors and exhausting resources of the source system.
PoC results summary
Is the OData protocol a viable option for extraction of high volume data?
Yes. Performance is acceptable and we were able to meet business goals using this approach.
Is it possible to improve the extraction performance through parallel processing of the same source data object?
During the PoC we successfully used the combination of paginating the data source and extracting chunks in parallel to optimize the performance.
What is the recommendation on the optimum batch size for data load?
The batch size heavily depends on the source object and the number of included attributes. While we haven’t noticed much difference in the performance of the extraction by changing the batch size, it allowed us to minimize the chances of requesting too much data at once, which could result in error during generating the response. After testing we decided to use batch size of 15k records for parallel run and 25k records serial run.
The solution should also leverage the parallel processing feature and robust data loading capacity of the Azure Data Factory. ADF is meant for large data volumes, but it also needs performant and predictable data reads from the source. With pagination feature of ODATA, we can develop ADF solution which can ingest medium to large volume of data to Azure data cloud. Once data is ingested to the cloud, it can be used for any further analytics requirements.
Reference terms
SAP CDS Views - Core Data Service (CDS) Views are virtual data models of SAP HANA which allows direct access to underlying tables of the HANA database.
SAP S/4HANA - is the successor to SAP R/3 and SAP ERP and is optimized for SAP's in-memory database SAP HANA.
Azure Data Factory (ADF) - is a powerful cloud-based data integration service from Microsoft Azure that allows organizations to create, schedule, and orchestrate data pipelines.
OData (Open Data Protocol) - is an ISO/IEC approved, OASIS standard that defines a set of best practices for building and consuming RESTful APIs.
Reference Microsoft blogs
• Extracting SAP data using OData - Part 4 - Handling large volumes of data - Microsoft Tech Community
Thanks to the Microsoft team - Bartosz Jarkowski Tushar Gupta Sarah Ebert Ulrich Christ - for working closely with us.
Lead Consultant at YASH Technologies
7moVery informative. Thank you!
CQA Manager Personal Care at Guiltfree Industries Ltd.
1yGood 👍
Sales and Marketing
1yGreat 👍
Amberenviro india pvt. ltd
1ygreat
Principal Program Manager (SAP Extend and Innovate) at Microsoft
1yWell done!