UNIT - 1 Part 2: Data Warehousing and Data Mining

DATA WAREHOUSING
AND
DATA MINING
UNIT – 1
Prepared by
Mr. P. Nandakumar
Assistant Professor,
Department of IT, SVCET

DBMS Schemas for Decision Support
Schema is a logical description of the entire database.
It includes the name and description of records of all
record types including all associated data-items and
aggregates.
Much like a database, a data warehouse also requires to
maintain a schema.
A database uses relational model, while a data warehouse
uses Star, Snowflake, and Fact Constellation schema.

Star Schema
• Each dimension in a star schema is represented with
only one-dimension table.
• This dimension table contains the set of attributes.
• The following diagram shows the sales data of a
company with respect to the four dimensions,
namely time, item, branch, and location.
• There is a fact table at the center. It contains the keys
to each of four dimensions.
• The fact table also contains the attributes, namely
dollars sold and units sold.

Snowflake Schema
• Some dimension tables in the Snowflake schema are normalized.
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are
normalized.
• For example, the item dimension table in star schema is normalized
and split into two dimension tables, namely item and supplier table.
• Now the item dimension table contains the attributes item_key,
item_name, type, brand, and supplier-key.
• The supplier key is linked to the supplier dimension table. The
supplier dimension table contains the attributes supplier_key and
supplier_type.

Fact Constellation Schema
 A fact constellation has multiple fact tables. It is also known as galaxy
schema.
 The following diagram shows two fact tables, namely sales and shipping.
 The sales fact table is same as that in the star schema.
 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and
units sold.
 It is also possible to share dimension tables between fact tables. For example,
time, item, and location dimension tables are shared between the sales and
shipping fact table.

Schema Definition
Multidimensional schema is defined using Data Mining
Query Language (DMQL).
The two primitives, cube definition and dimension
definition, can be used for defining the data warehouses
and data marts.

Data extraction, clean up and
transformation tools
1. Tools requirements:
 The tools enable sourcing of the proper data contents and formats
from operational and external data stores into the data warehouse.
The task includes:
Data transformation from one format to another
Data transformation and calculation based on the application of the
business rules. Eg : age from date of birth.
Data consolidation (several source records into single records) and
integration

 Meta data synchronizations and management include storing or updating
metadata definitions.
 When implementing datawarehouse, several selections criteria that affect the
tools ability to transform, integrate and repair the data should be considered.
 The ability to identify the data source
 Support for flat files, Indexed files
 Ability to merge the data from multiple data source
 Ability to read information from data dictionaries
 The code generated tool should be maintained in the development
environment
 The ability to perform data type and character set translation is requirement
when moving data between incompatible systems.

 The ability to summarization and aggregations of records
 The data warehouse database management system should be able to perform
the load directly from the tool using the native API.
2. Vendor approaches:
 The tasks of capturing data from a source data system, cleaning transforming
it and the loading the result into a target data system.
 It can be a carried out either by separate product or by single integrated
solutions. the integrated solutions are described below:
Code generators:
 Create tailored 3GL/4GL transformation program based on source and target
data definitions.
 The data transformations and enhancement rules defined by developer and it
employ data manipulation language.

Database data Replication tools:
 It employs changes to a single data source on one system and apply the
changes to a copy of the source data that are loaded on a different systems.
 Rule driven dynamic transformations engines (also known as data mart
builders)
 Capture the data from a source system at user defined interval, transforms the
data, then send and load the result in to a target systems.
 Data Transformation and enhancement is based on a script or function logic
defined to the tool.

3. Access to legacy data:
 Today many businesses are adopting client/server technologies and data
warehousing to meet customer demand for new products and services to
obtain competitive advantages.
 Majority of information required supporting business application and
analytical power of data warehousing is located behind mainframe based
legacy systems.
 While many organizations protecting their heavy financial investment in
hardware and software to meet this goal many organization turn to
middleware solutions.
 Middleware strategy is the foundation for the enterprise/access. it is designed
for scalability and manageability in a data warehousing environment.

4. Vendor solutions :
4.1 Prism solutions:
 Prism manager provides a solution for data warehousing by mapping source
data to target database management system.
 The prism warehouse manager generates code to extract and integrate data,
create and manage metadata and create subject oriented historical database.
 It extracts data from multiple sources –DB2, IMS, VSAM, RMS &sequential
files.

4.2 SAS institute:
 SAS data access engines serve as a extraction tools to combine common
variables, transform data
 Representations forms for consistency.
 it support for decision reporting ,graphing .so it act as the front end.
4.3 Carleton corporations PASSPORT and metacenter:
 Carleton’s PASSPORT and the MetaCenter fulfill the data extraction and
transformation need of data warehousing.

Metadata
1.Metadata defined
 Data about data, It contains Location and description of dw.
 Names, definition, structure and content of the dw.
 Identification of data sources.
 Integration and transformation rules to populate dw and end user.
 Information delivery information
 Data warehouse operational information
 Security authorization
 Metadata interchange initiative
 It is used for develop the standard specifications to exchange metadata

Metadata
2. Metadata Interchange initiative
 It used for develop the standard specifications for metadata interchange
format it will allow Vendors to exchange common metadata for avoid
difficulties of exchanging, sharing and Managing metadata
 The initial goals include
 Creating a vendor-independent, industry defined and maintained standard access
mechanism and standard API
 Enabling individual tools to satisfy their specific metadata for access
requirements, freely and easily within the context of an interchange model.
 Defining a clean simple, interchange implementation infrastructure.
 Creating a process and procedures for extending and updating.

Metadata
 Metadata Interchange initiative have define two distinct Meta models
 The application Metamodel- it holds the metadata for particular application
 The metadata Metamodel- set of objects that the metadata interchange
standard can be used to describe
 The above models represented by one or more classes of tools (data extraction,
cleanup, replication)
 Metadata interchange standard framework
 Metadata itself store any type of storage facility or format such as relational tables,
ASCII files ,fixed format or customized formats the Metadata interchange standard
framework will translate the an access request into interchange standard syntax and
format

Metadata
 Metadata interchange standard framework - Accomplish following
approach
Procedural approach-
ASCII batch approach-ASCII file containing metadata standard schema
and access parameters is reloads when over a tool access metadata
through API
Hybrid approach-it follow a data driven model by implementing table
driven API, that would support only fully qualified references for each
metadata
The Components of the metadata interchange standard frame work.
The standard metadata model-which refer the ASCII file format used to
represent the metadata

Metadata
The standard access framework-describe the minimum number of API
function for communicate metadata.
Tool profile-the tool profile is a file that describes what aspects of the
interchange standard metamodel a particular tool supports.
The user configuration-which is a file describing the legal interchange
paths for metadata in the users environment.

Metadata
3. Metadata Repository
 It is implemented as a part of the data warehouse frame work it following
benefits
 It provides a enterprise wide metadata management.
 It reduces and eliminates information redundancy, inconsistency
 It simplifies management and improves organization control
 It increase flexibility, control, and reliability of application development
 Ability to utilize existing applications
 It eliminates redundancy with ability to share and reduce metadata

Metadata
4. Metadata Management
 The collecting, maintain and distributing metadata is needed for a successful
data warehouse implementation so these tool need to be carefully evaluated
before any purchasing decision is made
5. Implementation Example
 Implementation approaches adopted by
 platinum technology,
 R&O,
 prism solutions, and
 logical works

Metadata
6. Metadata trends
 The process of integrating external and external data into the warehouse faces
a number of challenges
 Inconsistent data formats
 Missing or invalid data
 Different level of aggregation
 Semantic inconsistency
 Different types of database (text, audio, full-motion, images, temporal
databases, etc..)
 The above issues put an additional burden on the collection and management
of common metadata definition this is addressed by Metadata Coalition’s
metadata interchange specification

Reporting, Query Tools and
Applications
Tool Categories: There are five categories of decision support tools
 Reporting
 Managed query
 Executive information system
 OLAP
 Data Mining
Reporting Tools
 Production Reporting Tools
 Companies generate regular operational reports or support high volume batch
jobs, such as calculating and printing pay checks
 Report writers
 Crystal Reports/Accurate reporting system
 User design and run reports without having to rely on the IS department

Applications
Managed query tools
 Managed query tools shield end user from the Complexities of SQL and database
structures by inserting a metalayer between user and the database
 Metalayer :Software that provides subject oriented views of a database and
supports point and click creation of SQL
Executive information system
 First deployed on main frame system
 Predate report writer and managed query tools
 Build customized, graphical decision support apps or briefing books
 Provides high level view of the business and access to external sources eg
custom, on-line news feed
 EIS Apps highlight exceptions to business activity or rules by using color-coded
graphics

Applications
OLAP Tools
 Provide an intuitive way to view corporate data
 Provide navigation through the hierarchies and dimensions with the single click
 Aggregate data along common business subjects or dimensions
 Users can drill down across ,or up levels
Data mining Tools
 Provide insights into corporate data that are nor easily discerned with managed
query or OLAP tools
 Use variety of statistical and AI algorithm to analyze the correlation of variables
in data

Data Warehousing - OLAP
 OLAP stands for Online Analytical Processing.
 It uses database tables (fact and dimension tables) to enable multidimensional
viewing, analysis and querying of large amounts of data.
 E.g. OLAP technology could provide management with fast answers to complex
queries on their operational data or enable them to analyze their company’s
historical data for trends and patterns.
 Online Analytical Processing (OLAP) applications and tools are those that are
designed to ask ―complex queries of large multidimensional collections of
data. Due to that OLAP is accompanied with data warehousing.

Need
 The key driver of OLAP is the multidimensional nature of the business
problem.
 These problems are characterized by retrieving a very large number of
records that can reach gigabytes and terabytes and summarizing this data into
a form information that can by used by business analysts.
 One of the limitations that SQL has, it cannot represent these complex
problems.
 A query will be translated in to several SQL statements. These SQL
statements will involve multiple joins, intermediate tables, sorting,
aggregations and a huge temporary memory to store these tables.

 Online Analytical Processing Server (OLAP) is based on the
multidimensional data model.
 It allows managers, and analysts to get an insight of the information through
fast, consistent, and interactive access to information.
 Provide an intuitive way to view corporate data.
Types of OLAP Servers:
We have four types of OLAP servers −
 Relational OLAP (ROLAP)
 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers

OLAP Vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)
1 Involves historical processing of
information.
Involves day-to-day processing.
2 OLAP systems are used by knowledge
workers such as executives, managers
and analysts.
OLTP systems are used by clerks, DBAs,
or database professionals.
3 Useful in analyzing the business. Useful in running the business.
4 It focuses on Information out. It focuses on Data in.
5 Based on Star Schema, Snowflake,
Schema and Fact Constellation Schema.
Based on Entity Relationship Model.
6 Contains historical data. Contains current data.

OLAP Vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)
7 Provides summarized and
consolidated data.
Provides primitive and highly detailed
data.
8 Provides summarized and
multidimensional view of data.
Provides detailed and flat relational
view of data.
9 Number or users is in hundreds. Number of users is in thousands.
10 Number of records accessed is in
millions.
Number of records accessed is in tens.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.

Multidimensional Data Model
 The multidimensional data model is an integral part of On-Line Analytical
Processing, or OLAP.
 Multidimensional data model is to view it as a cube. The cable at the left
contains detailed sales data by product, market and time. The cube on the
right associates sales number (unit sold) with dimensions-product type,
market and time with the unit variables organized as cell in an array.
 This cube can be expended to include another array-price-which can be
associates with all or only some dimensions. As number of dimensions
increases number of cubes cell increase exponentially.

ETL Process in Data Warehouse
 ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format
suitable for loading into a data warehouse, and then load it into the
warehouse. The process of ETL can be broken down into the following three
stages:
 Extract: The first stage in the ETL process is to extract data from various
sources such as transactional systems, spreadsheets, and flat files. This step
involves reading data from the source systems and storing it in a staging area.
 Transform: In this stage, the extracted data is transformed into a format that is
suitable for loading into the data warehouse. This may involve cleaning and
validating the data, converting data types, combining data from multiple
sources, and creating new data fields.

 Load: After the data is transformed, it is loaded into the data warehouse. This
step involves creating the physical data structures and loading the data into
the warehouse.
 The ETL process is an iterative process that is repeated as new data is added
to the warehouse. The process is important because it ensures that the data in
the data warehouse is accurate, complete, and up-to-date. It also helps to
ensure that the data is in the format required for data mining and reporting.
 Additionally, there are many different ETL tools and technologies available,
such as Informatica, Talend, DataStage, and others, that can automate and
simplify the ETL process.
 ETL is a process in Data Warehousing and it stands for Extract, Transform
and Load. It is a process in which an ETL tool extracts the data from various
data source systems, transforms it in the staging area, and then finally, loads it
into the Data Warehouse system.

ETL Tools: Most
commonly used
ETL tools
are Hevo,
Sybase, Oracle
Warehouse
builder,
CloverETL, and
MarkLogic.
Data
Warehouses: M
ost commonly
used Data
Warehouses
are Snowflake,
Redshift,
BigQuery, and
Overall, ETL process is an essential process in data
warehousing that helps to ensure that the data in the data
warehouse is accurate, complete, and up-to-date.

ETL Process
ADVANTAGES and DISADVANTAGES
Advantages of ETL process in data warehousing:
 Improved data quality: ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
 Better data integration: ETL process helps to integrate data from multiple
sources and systems, making it more accessible and usable.
 Increased data security: ETL process can help to improve data security by
controlling access to the data warehouse and ensuring that only authorized
users can access the data.
 Improved scalability: ETL process can help to improve scalability by
providing a way to manage and analyze large amounts of data.
 Increased automation: ETL tools and technologies can automate and simplify
the ETL process, reducing the time and effort required to load and update data
in the warehouse.

ETL Process
ADVANTAGES OR DISADVANTAGES
Disadvantages of ETL process in data warehousing:
 High cost: ETL process can be expensive to implement and maintain,
especially for organizations with limited resources.
 Complexity: ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or resources.
 Limited flexibility: ETL process can be limited in terms of flexibility, as it
may not be able to handle unstructured data or real-time data streams.
 Limited scalability: ETL process can be limited in terms of scalability, as it
may not be able to handle very large amounts of data.
 Data privacy concerns: ETL process can raise concerns about data privacy, as
large amounts of data are collected, stored, and analyzed.

10 Best Data Warehouse Tools to Explore
in 2023
1. Hevo Data
2. Amazon Web Services Data Warehouse Tools
3. Google Data Warehouse Tools
4. Microsoft Azure Data Warehouse Tools
5. Oracle Autonomous Data Warehouse
6. Snowflake
7. IBM Data Warehouse Tools
8. Teradata Vantage
9. SAS Cloud
10. SAP Data Warehouse Cloud

IMPORTANT WEBSITE LINKS
1. AWS Redshift: Best for real-time and predictive analytics
2. Oracle Autonomous Data Warehouse: Best for autonomous management
capabilities
3. Azure Synapse Analytics: Best for intelligent workload management
4. IBM Db2 Warehouse: Best for fully managed cloud versions
5. Teradata Vantage: Best for enhanced analytics capabilities
6. SAP BW/4HANA: Best for advanced analytics and tailored applications
7. Google BigQuery: Best for built-in query acceleration and serverless
architecture
8. Snowflake for Data Warehouse: Best for separate computation and storage

IMPORTANT WEBSITE LINKS
9. Cloudera Data Platform: Best for faster scaling
10. Micro Focus Vertica: Best for improved query performance
11. MarkLogic: Best for complex data challenges
12. MongoDB: Best for sophisticated access management
13. Talend: Best for simplified data governance
14. Informatica: Best for intelligent data management
15. Arm Treasure Data: Best for connected customer experience

UNIT - 1 Part 2: Data Warehousing and Data Mining

Recommended

More Related Content

What's hot (20)

Similar to UNIT - 1 Part 2: Data Warehousing and Data Mining (20)

More from Nandakumar P (17)

Recently uploaded (20)

UNIT - 1 Part 2: Data Warehousing and Data Mining