High volume data ingestion from SAP S/4HANA public cloud to the Azure cloud by Data Factory (ADF)

Problem statement:

Public cloud applications (SaaS) offer limited data sharing options. At the same time there is an increasing need to fetch data from public cloud to a modern data platform like Azure Synapse for addressing an organization’s requirements around advanced and comprehensive analytics.

In this blog, we focus on in extracting data from SAP S/4 HANA Public Cloud to Azure Synapse using Azure Data Factory. SAP S/4 HANA Public Cloud provides an OData interface to extract data through standard and custom CDS views. Unfortunately, as OData is based on HTTP protocol, it is not best suited for large data extractions. This blog explores the nuances of dealing with high volume data extraction from SAP S/4 HANA public cloud.

Solution:

For one of our recent implementations with a customer interested in drawing insights from data residing in SAP S/4 HANA public cloud, we started off by evaluating the data integration options mentioned below:

No alt text provided for this image

SAP CDC options supports only the standard extraction enabled CDC views but does not support SAP custom CDS views. Hence, this option was ruled out for extraction from custom views. OData is the only option for data extraction from custom views but dealing with medium and high volume data is a challenge. To overcome it we decided to investigate option to paginate the dataset which divides the data into sizable chunks This brought us to a new consideration of deciding on the optimal page size. There are two ways in which data can be divided into chunks:

  1. Use business logic to provide rules for pagination: Using the Sales Orders as an example, we could chunk it into smaller pieces using business fields like- SalesOrganization or SalesOrderType
  2. Rule based: this approach requires three parameters-

  • The total count of records in the OData source
  • Number of records to fetch in a single request (batch size)
  • Number of requests

Microsoft has published a series of blogs which have given detailed explanation for this solution.

Following the rule-based approach customized to our customer’s requirements, we have observed the following results.

POC results:

Inputs:

# of Records in custom CDS view: 100,086

# of columns in custom CDS view: 393 columns

Run Type: Serial as well as parallel.

Batch size: 5K to 25K

Technologies:  Azure Data Factory (ADF), CDS views from SAP S/4HANA (public cloud), Azure Data lake storage (ADLS), Data is written to ADLS in Parquet file format.

Results

No alt text provided for this image
No alt text provided for this image

Observations and Recommendations

Suggestions based on executions:

  1. Performance improvement can be done mostly on ODATA source side for data read. 95% for job execution time is spent to get the first byte of data to ADF and then rest of the time is consumed for writing to the target.
  2. Batch size can be raised to gain some performance, but it should be below the resource capacity of the source server.

Since the initial full load for any analytics system is resource and data intensive activity, we have only given some recommendations to consider alternative options.

Suggestions for Initial Full load:

  1. Since both source and target are cloud technologies, we can raise the resources as per requirement for resource intensive workloads. In this case, we may need to raise resources at source/CDS view side.
  2. Full load is a one-time activity and it generally takes significant time for any Data Warehousing system. We may set expectations of stakeholders accordingly.
  3. Ingestion of the huge data can be done in batches (instead of all data once) based on some partitioning. Eg. time period based. So, it will load data for one year/one period at a time.

Conclusion

Businesses are now moving more and more to the SaaS business applications. So specifically for public cloud solution, ODATA is common REST API protocol available for data sharing. ODATA API works well for small data volumes, however in case of larger data sets it may be required to further tune the extraction process using pagination and parallel processing. Using OData protocol in Azure Data Factory allowed us to meet business goals and ensure smooth data extraction even in case of large data volumes. Using the approach described in this blog we avoided extraction errors and exhausting resources of the source system.

PoC results summary

Is the OData protocol a viable option for extraction of high volume data?

Yes. Performance is acceptable and we were able to meet business goals using this approach.

Is it possible to improve the extraction performance through parallel processing of the same source data object?

During the PoC we successfully used the combination of paginating the data source and extracting chunks in parallel to optimize the performance.

What is the recommendation on the optimum batch size for data load?

The batch size heavily depends on the source object and the number of included attributes. While we haven’t noticed much difference in the performance of the extraction by changing the batch size, it allowed us to minimize the chances of requesting too much data at once, which could result in error during generating the response.  After testing we decided to use batch size of 15k records for parallel run and 25k records serial run.

The solution should also leverage the parallel processing feature and robust data loading capacity of the Azure Data Factory. ADF is meant for large data volumes, but it also needs performant and predictable data reads from the source. With pagination feature of ODATA, we can develop ADF solution which can ingest medium to large volume of data to Azure data cloud. Once data is ingested to the cloud, it can be used for any further analytics requirements.

Reference terms

SAP CDS Views - Core Data Service (CDS) Views are virtual data models of SAP HANA which allows direct access to underlying tables of the HANA database.

SAP S/4HANA - is the successor to SAP R/3 and SAP ERP and is optimized for SAP's in-memory database SAP HANA.

Azure Data Factory (ADF) - is a powerful cloud-based data integration service from Microsoft Azure that allows organizations to create, schedule, and orchestrate data pipelines.

OData (Open Data Protocol) - is an ISO/IEC approved, OASIS standard that defines a set of best practices for building and consuming RESTful APIs.

Reference Microsoft blogs

•      Extracting SAP data using OData - Part 4 - Handling large volumes of data - Microsoft Tech Community

•      Extracting SAP data using OData - Part 5 - Filter and Select - Microsoft Tech Community

•      Extracting SAP data using OData - Part 7 - Delta extraction using SAP Extractors and CDS Views - Microsoft Tech Community

#saponazure, #azuredatafactory, #wipro

Thanks to the Microsoft team - Bartosz Jarkowski Tushar Gupta Sarah Ebert Ulrich Christ - for working closely with us.

Kunal Turakhia , Balasubramani Harikrishnan , Sujal Parulekar , Somnath Kumar , Chris Dearing

Ghantasala Praveen, PMP

Lead Consultant at YASH Technologies

7mo

Very informative. Thank you!

Like
Reply
Sambhaji Jadhav

CQA Manager Personal Care at Guiltfree Industries Ltd.

1y

Good 👍

Great 👍

sharad jadhav

Amberenviro india pvt. ltd

1y

great

Bartosz Jarkowski

Principal Program Manager (SAP Extend and Innovate) at Microsoft

1y

Well done!

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics