Overview of Inoduction to BIG DATA

Overview of Inoduction to BIG DATA

In this article

we discussed what we mean by Big Data, structured and unstructured data, some real-world applications of Big Data, and how we can store and process Big Data using Hadoop.

Introduction

Big data is a blanket term for the non-traditional strategies and technologies needed to gather, organize, process, and gather insights from large datasets. While the problem of working with data that exceeds the computing power or storage of a single computer is not new, the pervasiveness, scale, and value of this type of computing has greatly expanded in recent years. In this article, we will talk about big data on a fundamental level and define common concepts you might come across while researching the subject. We will also take a high-level look at some of the processes and technologies currently being used in this space. With such a massive amount of data being collected, it only makes sense for companies to use this data to understand their customers and their behavior better. This is the reason why the popularity of Data Science has grown manifold over the last few years.

Structured and Unstructured Data

Before we deep dive into the nuances of Big Data, it is important to understand the different kinds of data, namely structured and unstructured data.

Structured data

Structured data includes quantitative data that is stored in an organized manner. It consists of numerical and text data. It is easy to analyze and process structured data. It is generally stored in a relational database and can be queried using Structured Query Language (SQL).

Unstructured data

Unstructured data includes qualitative data that lacks any predefined structure and can come in a variety of formats (images, mp3 files, wav files, etc.). Unstructured data is said to lack “structure”. It is stored in a non-relational database and can be queried using NoSQL.

There can be semi-structured data as well, which lies somewhat in between structured and unstructured data.

No alt text provided for this image

What Is Big Data?

An exact definition of “big data” is difficult to nail down because projects, vendors, practitioners, and business professionals use it quite differently. With that in mind, generally speaking, big data is:

  • large datasets
  • the category of computing strategies and technologies that are used to handle large datasets

In this context, “large dataset” means a dataset too large to reasonably process or store with traditional tooling or on a single computer. This means that the common scale of big datasets is constantly shifting and may vary significantly from organization to organization.

Why Are Big Data Systems Different?

The basic requirements for working with big data are the same as the requirements for working with datasets of any size. However, the massive scale, the speed of ingesting and processing, and the characteristics of the data that must be dealt with at each stage of the process present significant new challenges when designing solutions. The goal of most big data systems is to surface insights and connections from large volumes of heterogeneous data that would not be possible using conventional methods.

In 2001, Gartner’s Doug Laney first presented what became known as the “three Vs of big data” to describe some of the characteristics that make big data different from other data processing:

What are the 5 Vs of Big Data?

Doug Laney introduced this concept of 3 Vs of Big Data, viz. Volume, Variety, and Velocity.

Volume refers to the amount of data that is being collected. The data could be structured or unstructured.

Velocity refers to the rate at which data is coming in.

Variety refers to the different kinds of data (data types, formats, etc.) that is coming in for analysis.

Over the last few years, 2 additional Vs of data have also emerged – value and veracity.

Value refers to the usefulness of the collected data.

Veracity refers to the quality of data that is coming in from different sources.


No alt text provided for this image

Volume

The sheer scale of the information processed helps define big data systems. These datasets can be orders of magnitude larger than traditional datasets, which demands more thought at each stage of the processing and storage life cycle.

Often, because the work requirements exceed the capabilities of a single computer, this becomes a challenge of pooling, allocating, and coordinating resources from groups of computers. Cluster management and algorithms capable of breaking tasks into smaller pieces become increasingly important.

Velocity

Another way in which big data differs significantly from other data systems is the speed that information moves through the system. Data is frequently flowing into the system from multiple sources and is often expected to be processed in real time to gain insights and update the current understanding of the system.

This focus on near instant feedback has driven many big data practitioners away from a batch-oriented approach and closer to a real-time streaming system. Data is constantly being added, massaged, processed, and analyzed in order to keep up with the influx of new information and to surface valuable information early when it is most relevant. These ideas require robust systems with highly available components to guard against failures along the data pipeline.

Variety

Big data problems are often unique because of the wide range of both the sources being processed and their relative quality.

Data can be ingested from internal systems like application and server logs, from social media feeds and other external APIs, from physical device sensors, and from other providers. Big data seeks to handle potentially useful data regardless of where it’s coming from by consolidating all information into a single system.

The formats and types of media can vary significantly as well. Rich media like images, video files, and audio recordings are ingested alongside text files, structured logs, etc. While more traditional data processing systems might expect data to enter the pipeline already labeled, formatted, and organized, big data systems usually accept and store data closer to its raw state. Ideally, any transformations or changes to the raw data will happen in memory at the time of processing.

Other Characteristics

Various individuals and organizations have suggested expanding the original three Vs, though these proposals have tended to describe challenges rather than qualities of big data. Some common additions are:

  • Veracity: The variety of sources and the complexity of the processing can lead to challenges in evaluating the quality of the data (and consequently, the quality of the resulting analysis)
  • Variability: Variation in the data leads to wide variation in quality. Additional resources may be needed to identify, process, or filter low quality data to make it more useful.
  • Value: The ultimate challenge of big data is delivering value. Sometimes, the systems and processes in place are complex enough that using the data and extracting actual value can become difficult.

Applications in the real world

Big Data helps corporations in making better and faster decisions, because they have more information available to solve problems, and have more data to test their hypothesis on.

Customer experience is a major field that has been revolutionized with the advent of Big Data. Companies are collecting more data about their customers and their preferences than ever. This data is being leveraged in a positive way, by giving personalized recommendations and offers to customers, who are more than happy to allow companies to collect this data in return for the personalized services. The recommendations you get on Netflix, or Amazon/Flipkart are a gift of Big Data!

Machine Learning is another field that has benefited greatly from the increasing popularity of Big Data. More data means we have larger datasets to train our ML models, and a more trained model (generally) results in a better performance. Also, with the help of Machine Learning, we are now able to automate tasks that were earlier being done manually, all thanks to Big Data.

Demand forecasting has become more accurate with more and more data being collected about customer purchases. This helps companies build forecasting models, that help them forecast future demand, and scale production accordingly. It helps companies, especially those in manufacturing businesses, to reduce the cost of storing unsold inventory in warehouses.

Big data also has extensive use in applications such as product development and fraud detection.

No alt text provided for this image

The volume and velocity of Big Data can be huge, which makes it almost impossible to store it in traditional data warehouses. Although some and sensitive information can be stored on company premises, for most of the data, companies have to opt for cloud storage or Hadoop.

Hadoop also does the same thing, by giving you the ability to store and process large amounts of data at once. Hadoop is an open-source software framework and is free. It allows users to process large datasets across clusters of computers.

Challenges

1. Data growth

Managing datasets having terabytes of information can be a big challenge for companies. As datasets grow in size, storing them not only becomes a challenge but also becomes an expensive affair for companies.

To overcome this, companies are now starting to pay attention to data compression and de-duplication. Data compression reduces the number of bits that the data needs, resulting in a reduction in space being consumed.

2. Data security

Data security is often prioritized quite low in the Big Data workflow, which can backfire at times.

Mining of sensitive information, fake data generation, and lack of cryptographic protection (encryption) are some of the challenges businesses face when trying to adopt Big Data techniques.Companies need to understand the importance of data security, and need to prioritize it.

3. Data integration

There are several Big Data solution vendors that offer ETL (Extract, Transform, Load) and data integration solutions to companies that are trying to overcome data integration problems. There are also several APIs that have already been built to tackle issues related to data integration.

The future of Big Data

The volume of data being produced every day is continuously increasing, with increasing digitization. More and more businesses are starting to shift from traditional data storage and analysis methods to cloud solutions. Companies are starting to realize the importance of data. All of these imply one thing, the future of Big Data looks promising! It will change the way businesses operate, and decisions are made.

Conclusion

Big data is a broad, rapidly evolving topic. While it is not well-suited for all types of computing, many organizations are turning to big data for certain types of work loads and using it to supplement their existing analysis and business tools. Big data systems are uniquely suited for surfacing difficult-to-detect patterns and providing insight into behaviors that are impossible to find through conventional means. By correctly implement systems that deal with big data, organizations can gain incredible value from data that is already available.



To view or add a comment, sign in

More articles by Venna Navya Sree

Insights from the community

Others also viewed

Explore topics