Bigdata Testing - where to start from

With the blast in big data implementation and customers are moving to implement the same, testing is also becoming critical, article undergoes the test scenarios adopted for big data projects.

Problem statement: Developers are properly trained into big data eco-system, but testers are asked to test the big data implementation without being given proper training of the big data ecosystem, this leads testers questioning what should we test? This articles will give some scenarios which will be a starting point of what all can be tested to verify in the bigdata ecosystem.

What is Bigdata (simple usecase)

Considering you are already aware of big data concepts, let me give a short introduction

  1. We are having various databases for different purposes within the same organization i.e. team using Oracle to store customer information, using SAP to store transactions. they are called OLTP databases ie. they help with data transaction processing.
  2. With the advance in technology and the availability of user data increasing online, we can now use this data to predict. i.e. we can combine databases in the above step to predict customer behavior of buying. So to combine that data and perform analysis, we move that data to the OLAP system i.e. data warehouse.
  3. Since the availability of unstructured and semi-structured data and quantity of data size and other reasons, the warehouse is unable to handle this amount of data, so we concept of big data came, in simpler terms smarter warehouse.

To move ahead requires a basic understanding of big data, ETL and hdfs.

Test Scenarios

When you are asked to test the big data, there can be many scenarios which may be specific to implementation and projects, but below are common scenarios which will be part of every big data testing. Also whether to automate or not? Automation is key in bigdata testing and unlike web testing, automation scripts are prepared first to start test execution ie. we avoid manual testing.

Functional Testing

  1. Validating to data which is getting dump into hdfs, ie. between source (any source of data) and target (hdfs).
  2. If any staging area is available, then need to identify what activities dev are performing and need to validate the same for data correctness.
  3. Data will be dump daily, to handle incremental data two types of scripts can be created.
  • Run weekly which verifies complete test data between source and target (hdfs)
  • Run daily which verifies data they are added daily.

4. Test cases can include scenarios like

  • validating the total number of records between source and target
  • validating the total columns between source and target
  • validating the column names between source and target
  • validating duplicate records if possible with primary or foreign key
  • validating not null columns

5. Complete testing of data is not possible and would require a large amount of time to test hence a subset of data is used, which is called test data.

6. The main reason for moving data to big data is to analysis and process on it to get a meaningful result. Those transformations need to be validated either in map reduce or hql etc depending upon the technology used.

7. Once the data is analyzed/transformed it is moved to data warehouse so that reports can be generated, as reports aren't directly generated from hdfs. We need to validate the authenticity, data correctness, logic, etc transformation applied is correct between source (hdfs) and target (warehouse)

8. Once the data is available in the warehouse, BI tools fetch that data with some queries and display the graphical representation ie.e. dashboard which help customers to predict or get information which helps to make the decision for the business. As a tester we need to verify the data presented on the dashboard is the same we are getting from the warehouse.

9. BI dashboard needs to be validated if it features like dropdown, filter, etc are working fine, selenium automation tool can be used for automation purpose.

10. A number of other data checks during this ETL process can be applied depending upon business eg name filed cannot be null, employee salary should not be in negative etc.

11. As per project various test scenarios occur eg validating the nifi flow, oozie jobs etc.

Performance Testing

  • Speed at which data is loaded into hdfs.
  • Speed at which data is processing in hdfs
  • Memory utilization
  • You will be writing queries to work on data, you need to work on optimizing those queries.

Below tools can be helping to achieve performance testing

  • Sandstormsolution
  • Jmeter
  • Yahoo! Cloud Serving Benchmark (YCSB)

Hadoop ecosystem has become so matured that tools like zoo kepper etc are giving various performance parameters and notifications.

Failover Testing

  • No data corruption because of name node failure.
  • Data recovery when the data node fails
  • Validating that replication is initiated when one of data nodes fails or data becomes corrupted.
  • Security testing is required to validate

As a lot of new components for quality are been added like Griffin, Query Surge etc also you can develop your framework if don't wanna go for tools.





To view or add a comment, sign in

More articles by Vaibhav Singhal

Insights from the community

Others also viewed

Explore topics