Intro to Data warehousing lecture 08

1
Data Warehousing
Dimensional Modeling (DM)
Ch Anwar ul Hassan (Lecturer)
Department of Computer Science and Software
Engineering
Capital University of Sciences & Technology,
Islamabad Pakistan
anwarchaudary@gmail.com

2
The need for ER modeling?
 Problems with early COBOLian data
processing systems.
 Data redundancies
 From flat file to Table, each entity ultimately
becomes a Table in the physical schema.
 Simple O(n2) join to work with tables

3
Why ER Modeling has been so successful?
 Coupled with normalization drives out all
the redundancy out of the database.
 Change (or add or delete) the data at just
one point.
 Can be used with indexing for very fast
access.
 Resulted in success of OLTP systems.

4
Need for DM: Un-answered Qs
 Lets have a look at a typical ER data model first.
 Some Observations:
 All tables look-alike, as a consequence it is difficult to
identify:
 Which table is more important ?
 Which is the largest?
 Which tables contain numerical measurements of the
business?
 Which table contain nearly static descriptive attributes?

Is DM really needed? In order to better understand the need for DM lets have a
look at the diagram showing the retail data in simplified 3N
5

6
Need for DM: Complexity of Representation
 Many topologies for the same ER
diagram, all appearing different.
Very hard to visualize and remember.
A large number of possible connections to
any two (or more) tables
1
10
3
12
2
6
5
11 4
7
8
9
1
10
3
12
2
6
5
11
4
7
8
9

7
Need for DM: The Paradox
 The Paradox: Trying to make information accessible using
tables resulted in an inability to query them!
 ER and Normalization result in large number of tables which
are:
 Hard to understand by the users (DB programmers)
 Hard to navigate optimally by DBMS software
 Real value of ER is in using tables individually or in pairs
 Too complex for queries that span multiple tables with a
large number of records

8
ER vs. DM
ER DM
Constituted to optimize OLTP
performance.
Constituted to optimize DSS
query performance.
Models the micro relationships
among data elements.
Models the macro
relationships among data
elements with an overall
deterministic strategy.
A wild variability of the
structure of ER models.
All dimensions serve as
equal entry points to the
fact table.
Very vulnerable to changes in
the user's querying habits,
because such schemas are
asymmetrical.
Changes in users' querying
habits can be
accommodated by
automatic SQL generators.

9
How to simplify a ER data model?
Two general methods:
 De-Normalization
 Dimensional Modeling (DM)

10
What is DM?…
 A simpler logical model optimized for decision
support.
 Inherently dimensional in nature, with a single
central fact table and a set of smaller
dimensional tables.
 Multi-part key for the fact table
 Dimensional tables with a single-part PK.
 Keys are usually system generated

11
What is DM?...
 Results in a star like structure, called star
schema or star join.
 All relationships mandatory M-1.
 Single path between any two levels.
 Supports ROLAP operations.

12
Dimensions have Hierarchies
Items
Books Cloths
Fiction Text Men Women
MedicalEngg
Analysts tend to look at the data through dimension at a
particular “level” in the hierarchy

13
The two Schemas
Star
Snow-flake

14
“Simplified” 3NF (Retail)
CITY DISTRICT
1
ZONE CITY
DISTRICT DIVISION
MONTH QTR
STORE # STREET ZONE ...
WEEK MONTH
DATE WEEK
RECEIPT # STORE # DATE ...
ITEM #RECEIPT # ... $
ITEM # CATEGORY
ITEM #
DEPTCATEGORY
year
month
week
sale_header
store
sale_detail
item_x_cat
item_x_splir
cat_x_dept
M
1
M
1M
1
M
1
1
M M
1
M
M M1
1
M
1
1
M
YEAR QTR
1
M
quarter
SUPPLIER
DIVISION PROVINCEM
1
BACK
division
district
zone

15
Vastly Simplified Star Schema
RECEIPT#
STORE#
DATE
ITEM# M
Fact Table
ITEM#
CATEGORY
DEPT
SUPPLIER
Product Dim
M
Sale Rs.
M
STORE#
ZONE
CITY
PROVINCE
Geography Dim
DISTRICT
DATE
WEEK
QUARTER
YEAR
Time Dim
MONTH
.
.
.
1
1
1
facts
DIVISION

16
The Benefit of Simplicity
Beauty lies in close correspondence
with the business, evident even to
business users.

17
Features of Star Schema
Dimensional hierarchies are collapsed into a single
table for each dimension. Loss of Information?
A single fact table created with a single header from the
detail records, resulting in:
 A vastly simplified physical data model!
 Fewer tables (thousands of tables in some ERP systems).
 Fewer joins resulting in high performance.
 Quantifying space requirement

18
The Process of Dimensional Modeling
Four Step Method from ER to DM
1. Choose the Business Process
2. Choose the Grain
3. Choose the Facts
4. Choose the Dimensions

19
Step-1: Choose the Business Process
 A business process is a major operational
process in an organization.
 Typically supported by a legacy system
(database) or an OLTP.
 Examples: Orders, Invoices, Inventory etc.
 Business Processes are often termed as
Data Marts and that is why many people
criticize DM as being data mart oriented.

20
Star-1
Star-2
Snow-flake
Step-1: Separating the Process

21
Step-2: Choosing the Grain
 Grain is the fundamental, atomic level of data to be
represented.
 Grain is also termed as the unit of analyses.
 Example grain statements
 Typical grains
 Individual Transactions
 Daily aggregates (snapshots)
 Monthly aggregates
 Relationship between grain and expressiveness.
 Grain vs. hardware trade-off.

22
Step-2: Relationship b/w Grain
Daily aggregates
6 x 4 = 24 values
Four aggregates per week
4 x 4 = 16 values
Two aggregates per week
2 x 4 = 8 values
LOW Granularity HIGH Granularity

23
The case FOR data aggregation
 Works well for repetitive queries.
 Follows the known thought process.
 Justifiable if used for max number of queries.
 Provides a “big picture” or macroscopic view.

24
The case AGAINST data aggregation
Aggregation is irreversible.
 Can create monthly sales data from weekly sales
data, but the reverse is not possible.
Aggregation limits the questions that can be
answered.
 What, when, why, where, what-else, what-next

25
The case AGAINST data aggregation
Aggregation can hide crucial facts.
The average of 100 & 100 is same as 150 & 50

26
Aggregation hides crucial facts Example
Week-1 Week-2 Week-3 Week-4 Average
Zone-1 100 100 100 100 100
Zone-2 50 100 150 100 100
Zone-3 50 100 100 150 100
Zone-4 200 100 50 50 100
Average 100 100 100 100
Just looking at the averages i.e. aggregate

27
Aggregation hides crucial facts chart
0
50
100
150
200
250
Week-1 Week-2 Week-3 Week-4
Z1 Z2 Z3 Z4
Z1: Sale is constant (need to work on it)
Z2: Sale went up, then fell (need of concern)
Z3: Sale is on the rise, why?
Z4: Sale dropped sharply, need to look deeply.
W2: Static sale

28
“We need monthly sales
volume and Rs. by
week, product and Zone”
Facts
Dimensions
Step 3: Choose Facts statement

29
 Choose the facts that will populate
each fact table record.
 Remember that best Facts are Numeric,
Continuously Valued and Additive.
 Example: Quantity Sold, Amount etc.
Step 3: Choose Facts

30
 Choose the dimensions that apply to
each fact in the fact table.
 Typical dimensions: time, product,
geography etc.
 Identify the descriptive attributes that
explain each dimension.
 Determine hierarchies within each
dimension.
Step 4: Choose Dimensions

31
Step-4: How to Identify a Dimension?
 The single valued attributes during recording of a
transaction are dimensions.
Calendar_Date
Time_of_Day
Account _No
ATM_Location
Transaction_Type
Transaction_Rs
Fact Table
Dim
Time_of_day: Morning, Mid Morning, Lunch Break etc.
Transaction_Type: Withdrawal, Deposit, Check balance etc.

32
Step-4: Can Dimensions be Multi-valued?
 Are dimensions ALWYS single?
 Not really
 What are the problems? And how to handle them
 Calendar_Date (of inspection)
 Reg_No
 Technician
 Workshop
 Maintenance_Operation
 How many maintenance operations are possible?
 Few
 Maybe more for old cars.

33
Step-4: Dimensions & Grain
 Several grains are possible as per business
requirement.
 For some aggregations certain descriptions do not
remain atomic.
 Example: Time_of_Day may change several times
during daily aggregate, but not during a transaction
 Choose the dimensions that are applicable
within the selected grain.

Intro to Data warehousing lecture 08

Recommended

More Related Content

What's hot (20)

Similar to Intro to Data warehousing lecture 08 (20)

More from AnwarrChaudary (20)

Recently uploaded (20)

Intro to Data warehousing lecture 08