BIG DATA TESTING

BIG DATA TESTING

BigData testing is defined as testing of Bigdata applications. Big data is a collection of large datasets that cannot be processed using traditional computing techniques. Testing of these datasets involves various tools, techniques, and frameworks to process. Big data relates to data creation, storage, retrieval and analysis that is remarkable in terms of volume, variety, and velocity.

Big Data Testing Strategy

Testing Big Data application is more verification of its data processing rather than testing the individual features of the software product. When it comes to Big data testing, performance and functional testing are the keys.

In Big data testing, QA engineers verify the successful processing of terabytes of data using commodity cluster and other supportive components. It demands a high level of testing skills as the processing is very fast. Processing may be of three types


Along with this, data quality is also an important factor in Hadoop testing. Before testing the application, it is necessary to check the quality of data and should be considered as a part of database testing. It involves checking various characteristics like conformity, accuracy, duplication, consistency, validity, data completeness, etc.

Big Data Testing can be broadly divided into three steps

Step 1: Data Staging Validation

The first step of big data testing also referred as pre-Hadoop stage involves process validation.

  • Data from various source like RDBMS, weblogs, social media, etc. should be validated to make sure that correct data is pulled into the system
  • Comparing source data with the data pushed into the Hadoop system to make sure they match
  • Verify the right data is extracted and loaded into the correct HDFS location

Tools like TalendDatameer, can be used for data staging validation

Step 2: "MapReduce" Validation

The second step is a validation of "MapReduce". In this stage, the tester verifies the business logic validation on every node and then validating them after running against multiple nodes, ensuring that the

  • Map Reduce process works correctly
  • Data aggregation or segregation rules are implemented on the data
  • Key value pairs are generated
  • Validating the data after the Map-Reduce process

Step 3: Output Validation Phase

The final or third stage of Big Data testing is the output validation process. The output data files are generated and ready to be moved to an EDW (Enterprise Data Warehouse) or any other system based on the requirement.

Activities in the third stage include

  • To check the transformation rules are correctly applied
  • To check the data integrity and successful data load into the target system
  • To check that there is no data corruption by comparing the target data with the HDFS file system data

Architecture Testing

Hadoop processes very large volumes of data and is highly resource intensive. Hence, architectural testing is crucial to ensure the success of your Big Data project. A poorly or improper designed system may lead to performance degradation, and the system could fail to meet the requirement. At least, Performance and Failover test services should be done in a Hadoop environment.

Performance testing includes testing of job completion time, memory utilization, data throughput, and similar system metrics. While the motive of Failover test service is to verify that data processing occurs seamlessly in case of failure of data nodes

Performance Testing

Performance Testing for Big Data includes two main action

  • Data ingestion and Throughout: In this stage, the tester verifies how the fast system can consume data from various data source. Testing involves identifying a different message that the queue can process in a given time frame. It also includes how quickly data can be inserted into the underlying data store for example insertion rate into a Mongo and Cassandra database.
  • Data Processing: It involves verifying the speed with which the queries or map reduce jobs are executed. It also includes testing the data processing in isolation when the underlying data store is populated within the data sets. For example, running Map Reduce jobs on the underlying HDFS
  • Sub-Component Performance: These systems are made up of multiple components, and it is essential to test each of these components in isolation. For example, how quickly the message is indexed and consumed, MapReduce jobs, query performance, search, etc.


To view or add a comment, sign in

More articles by Smriti Saini

  • What Is Portfolio Analytics?

    The term portfolio analytics may be interpreted and implemented in many different ways. The first order of business…

  • Annuity

    An annuity is a series of payments made at equal intervals. Examples of annuities are regular deposits to a savings…

  • What is Actuarial Modeling?

    Actuarial modeling is the name for a set of techniques used in the insurance industry. These models are composed of…

    1 Comment
  • Supervised vs. Unsupervised Learning: What’s the Difference?

    The world is getting “smarter” every day, and to keep up with consumer expectations, companies are increasingly using…

  • APACHE HIVE

    Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and…

  • Acceptance testing

    In engineering and its various subdisciplines, acceptance testing is a test conducted to determine if the requirements…

  • SAP HANA

    SAP HANA (high-performance analytic appliance) is an in-memory, column-oriented, relational database management system…

  • Machine Learning Architecture

    Introduction to Machine Learning Architecture Machine Learning architecture is defined as the subject that has evolved…

  • AZURE DEVOPS

    What is Azure DevOps? Azure DevOps is a Software as a service (SaaS) platform from Microsoft that provides an…

  • Report Building

    Elemental development means high productivity for report developers. To enable end-users to see, understand and act…

Insights from the community

Others also viewed

Explore topics