Insights of building better data pipelines and gaining accuracy

Insights of building better data pipelines and gaining accuracy

The scale of the data has the largest impact on the accuracy of the pipelines. This post focuses on three aspects of the data with these parameters.

Distributed systems

Performance

Consistency of data

Signal-to-noise ratio (S/N)


The "Signal-to-noise ratio" is a measure of the amount of data that is useful in proportion to all unnecessary data. For example, if you receive an e-mail, it is small (signal) compared to the noise in the e-mail (spam, irrelevant information). In the pipeline, the amount of data is large, and the noise must be carefully cleared (removal of irrelevant data; ie. data cleaning).

There are many definitions of S/N ratio. But in the case of large datasets, it is convenient to use it in terms of db. For example, if you build a web scraper, you usually receive 10'000 HTTP responses per hour, or over 1.5 TB per day. Therefore, the S/N ratio in the web scraper is 1/1.5T = 10'000.

Also, according to the signal-to-noise ratio, we can estimate the number of machines that are needed to crawl the internet for collecting data.

No alt text provided for this image
data scraper


This article introduces three basic solutions for reducing the size of the data by discretization (the principle of point-processing; unique indication of data with point-processing).


In this article, we will study the case where the dataset has a specific structure (eg the data model of database table). In general, with the ETL process, you can make your data more specific and structured.


Types of Data:

The first type of data is unstructured data (eg Twitter data). If you store unstructured data, you just store the data without any additional schema. However, if you need a specific structure, you are forced to use a specific schema to store the data (for example, the JSON schema).

The second type of data is semi-structured data, ie when you cannot determine in advance the schema of your data (eg Stack Overflow data).

The third type of data is a pre-defined schema. For example, the data model of database tables can be pre-defined. For example, the data model of database tables can be pre-defined.


The speed difference:

Non-serial data

According to the nature of non-serial data, we need to store repeated data many times. For statistical data, because the data is continuous, it is relatively easy to handle, but for semi-structured and semi-structured data, the data needs to be stored many times.

Serial data

Serial data requires less storage space because the data is serial. For example, if you store a photo, you only need to store it once. Therefore, for serial data, storage is efficient.

Data loss

If a database goes down, the database will eventually become unavailable. However, if the data goes down, there is no way to recover it. Therefore, managers should be careful about storing this type of data.


Consider the taste of foods among the entire population. If the database goes down, there is no way to recover it. Therefore, managers should be careful about storing this type of data.


Summary

1. Non-serial data

2. Serial data

3. Non-serial data and serial data combined

At a certain level, I have covered the understanding of storage algorithms and the problem of data storage.

#data #dataengineering #dataengineer #datascience #AmazonWebServices #GoogleCloud #datastorage #algorithm #datapipeline

Jignesh Jagad

Tech Enthusiast | Absolute Learner | Curious Mind

2y

Yes, you do have growth mindset 💡🤘

To view or add a comment, sign in

More articles by Gaurav Gurjar

Insights from the community

Others also viewed

Explore topics