dvance computer architecture computer architecture: a quantitative approach chapter 1 Fundamentals of Quantitative Design and Analysis

1
Copyright © 2019, Elsevier Inc. All rights reserved.
‫پیشرفته‬ ‫کامپیوتر‬ ‫معماری‬
‫اول‬ ‫جلسه‬
‫پروردگار‬ ‫نام‬ ‫به‬
‫مهر‬ ‫گسترده‬

Copyright © 2019, Elsevier Inc. All rights reserved. 2
Chapter 1
Fundamentals of Quantitative
Design and Analysis
Computer Architecture
A Quantitative Approach, Sixth Edition

3
Computer Technology
 Performance improvements:
 Improvements in semiconductor technology

Feature size, clock speed
 Improvements in computer architectures

Enabled by HLL compilers, UNIX

Lead to RISC architectures
 Together have enabled:

Lightweight computers

Productivity-based managed/interpreted
programming languages
Introduction

4
Single Processor Performance
Introduction

5
Current Trends in Architecture
 Cannot continue to leverage Instruction-Level
parallelism (ILP)
 Single processor performance improvement ended in
2003
 New models for performance:
 Data-level parallelism (DLP)
 Thread-level parallelism (TLP)
 Request-level parallelism (RLP)
 These require explicit restructuring of the
application
Introduction

6
Classes of Computers
 Personal Mobile Device (PMD)
 e.g. start phones, tablet computers
 Emphasis on energy efficiency and real-time
 Desktop Computing
 Emphasis on price-performance
 Servers
 Emphasis on availability, scalability, throughput
 Clusters / Warehouse Scale Computers
 Used for “Software as a Service (SaaS)”
 Emphasis on availability and price-performance
 Sub-class: Supercomputers, emphasis: floating-point
performance and fast internal networks
 Internet of Things/Embedded Computers
 Emphasis: price
Classes
of
Computers

7
Parallelism
 Classes of parallelism in applications:
 Data-Level Parallelism (DLP)
 Task-Level Parallelism (TLP)
 Classes of architectural parallelism:
 Instruction-Level Parallelism (ILP)
 Vector architectures/Graphic Processor Units (GPUs)
 Thread-Level Parallelism
 Request-Level Parallelism
Classes
of
Computers

8
Flynn’s Taxonomy
 Single instruction stream, single data stream (SISD)
 Single instruction stream, multiple data streams (SIMD)
 Vector architectures
 Multimedia extensions
 Graphics processor units
 Multiple instruction streams, single data stream (MISD)
 No commercial implementation
 Multiple instruction streams, multiple data streams
(MIMD)
 Tightly-coupled MIMD
 Loosely-coupled MIMD
Classes
of
Computers

9
1- Single Instruction Single Data(SISD)
 This category is the uniprocessor.
 The programmer thinks of it as the standard sequential
computer,but it can exploit ILP.

10
2-Single Instruction Multiple Data(SIMD)
 The same instruction is executed by multiple processors using
different data streams.
 SIMD computers exploit data-level parallelism by applying the same
operations to multiple items of data in parallel.
 Each processor has its own data memory
 but there is a single instruction memory and control processor ,which
fetches and dispatches instructions.
 vector architectures,
 multimedia extensions to standard instruction sets, and GPUs.

11
3- Multiple Instruction Single Data(MISD)
Nocommercial multiprocessor of this type has been built
to date, but it rounds out this simple classification.

12
4- Multiple Instruction Multiple Data(MIMD)
 Each processor fetches its own instructions and operates
on its own data, and it targets task-level parallelism(TLP)
 DLP (more expensive than SIMD)
 Tightly coupled MIMD architectures:TLP
 Loosely coupled MIMD architectures:RLP
 Clusters
 warehouse-scale computers

13
Defining Computer Architecture
 “Old” view of computer architecture:
 Instruction Set Architecture (ISA) design
 i.e. decisions regarding:

registers, memory addressing, addressing modes,
instruction operands, available operations, control flow
instructions, instruction encoding
 “Real” computer architecture:
 Specific requirements of the target machine
 Design to maximize performance within constraints:
cost, power, and availability
 Includes ISA, microarchitecture, hardware
Defining
Computer
Architecture

14
Instruction Set Architecture
 Class of ISA
 General-purpose registers
 Register-memory vs load-store
 RISC-V registers
 32 g.p., 32 f.p.
Defining
Computer
Architecture
Register Name Use Saver
x0 zero constant 0 n/a
x1 ra return addr caller
x2 sp stack ptr callee
x3 gp gbl ptr
x4 tp thread ptr
x5-x7 t0-t2 temporaries caller
x8 s0/fp saved/
frame ptr
callee
Register Name Use Saver
x9 s1 saved callee
x10-x17 a0-a7 arguments caller
x18-x27 s2-s11 saved callee
x28-x31 t3-t6 temporaries caller
f0-f7 ft0-ft7 FP temps caller
f8-f9 fs0-fs1 FP saved callee
f10-f17 fa0-fa7 FP arguments callee
f18-f27 fs2-fs21 FP saved callee
f28-f31 ft8-ft11 FP temps caller

15
 Memory addressing
 RISC-V: byte addressed, aligned accesses faster

An access to an object of size s bytes at byte address A is aligned if
A mod s=0.
 Addressing modes
 RISC-V: Register, immediate, displacement (base+offset)
 Other examples: autoincrement, indexed, PC-relative
 Types and size of operands
 RISC-V: 8-bit, 32-bit, 64-bit
 IEEE 754 floating point in 32-bit (single precision) and 64-bit
(double precision).
 The 80x86 also supports 80-bit floating point (extended
double precision).
Defining
Computer
Architecture

16
Floating point instructions for RISC-V.

17
IEEE 754 Format

18
 Operations
 RISC-V: data transfer, arithmetic, logical, control,
floating point
 See Fig. 1.5 in text
 Control flow instructions
 Use content of registers (RISC-V) vs. status bits (x86,
ARMv7, ARMv8)
 Return address in register (RISC-V, ARMv7, ARMv8)
vs. on stack (x86)
 Encoding
 Fixed (RISC-V, ARMv7/v8 except compact instruction
set) vs. variable length (x86)
Defining
Computer
Architecture

19
Encoding

20

21
‫دوم‬ ‫جلسه‬

Chapter 1
Design and Analysis…(Cont.)

23
Genuine Computer Architecture
 The implementation of a computer
has two components:
 organization
 hardware

24
…Genuine Computer Architecture
 Organization

the high-level aspects of a computer’s design,
 the memory system, the memory interconnect, and the
design of the internal processor or CPU (central
processing unit—where arithmetic, logic, branching, and
data transfer are implemented).
 The term microarchitecture is also used instead of
organization.

25
 Two processors with the same instruction set
architectures but different organizations are
the AMD Opteron and the Intel Core i7.

Both processors implement the 80x86 instruction
set, but they have very different pipeline and cache
organizations.

26
 Hardware
 refers to the specifics of a computer:

the detailed logic design

the packaging technology of the computer.
 Often a line of computers contains computers
with :

identical instruction set architectures

very similar organizations,

differ in the detailed hardware implementation.

27
 the Intel Core i7 and the Intel Xeon E7
 nearly identical
 different clock rates
 different memory systems
 the Xeon E7 more effective for server
computers.

28
 Computer architects must design a
computer to meet
 functional requirements as well as
price,power,performance,andavailability goals

architects also must determine what the functional
requirements are, which can be a major task.

The requirements may be specific features inspired
by the market.

Application software typically drives the choice of
certain functional requirements by determining how
the computer will be used

29
Summary
of
some
of
the
most
important
functional
requirements
an
architect
faces

30
Trends in Technology
 Integrated circuit technology (Moore’s Law)
 Transistor density: 35%/year
 Die size: 10-20%/year
 Integration overall: 40-55%/year
 DRAM capacity: 25-40%/year (slowing)
 8 Gb (2014), 16 Gb (2019), possibly no 32 Gb
 Flash capacity: 50-60%/year
 8-10X cheaper/bit than DRAM
 Magnetic disk capacity: recently slowed to 5%/year
 Density increases may no longer be possible, maybe increase from 7 to 9 platters
 8-10X cheaper/bit then Flash
 200-300X cheaper/bit than DRAM
 Network technology
 Network Performance depends both on the performance of switches and on the
performance of the transmission system.
Trends
in
Technology

Designers often design for the next
technology.

Cost has decreased at about the rate
at which density increases.

31
Bandwidth and Latency
 Bandwidth or throughput
 Total work done in a given time
 32,000-40,000X improvement for processors
 300-1200X improvement for memory and disks
 Latency or response time
 Time between start and completion of an event
 50-90X improvement for processors
 6-8X improvement for memory and disks
Trends
in
Technology

32
Bandwidth and Latency…
 Performance is the primary differentiator
for microprocessors and networks.
 the greatest gains: 32,000–40,000 in
bandwidth and 50–90 in latency.
 Capacity is generally more important than
performance for memory and disks.
 capacity has improved more,
 bandwidth advances of 400–2400
 gains in latency of 8–9.

33
Performance milestones over 25–40 years for
microprocessors

34
Performance milestones over 25–40 years for memory

35
Performance milestones over 25–40 years for networks,

36
Performance milestones over 25–40 years for disks

37
Bandwidth and Latency
Log-log plot of bandwidth and latency milestones relative to the first milestone.
latency improved 8–91, **** bandwidth improved about 400–32,000.
Except for networking, there were modest improvements in latency and bandwidth in the other three
technologies in the six years (2011-2017): 0%–23% in latency and 23%–70% in bandwidth.
Trends
in
Technology

38

39
‫سوم‬ ‫جلسه‬

Chapter 1

41
Transistors and Wires
 Feature size
 Minimum size of transistor or wire in x or y
dimension
 10 microns in 1971 to .011 microns in 2017
 Transistor performance scales linearly

Wire delay does not improve with feature size!
 Integration density scales quadratically
Trends
in
Technology
 Larger and larger fractions of the clock cycle have been
consumed by the propagation delay of signals on wires .
 but power now plays an even greater role than wire delay.

42
Transistors and Wires

43
Power and Energy

44
Power and Energy concerns
1. what is the maximum power a processor
ever requires?
 voltage indexing methods that allow the
processor to slow down and regulate voltage
within a wider margin.
2. what is the sustained power
consumption( thermal design power (TDP))
it determines the cooling requirement.
3. Which metric is the right one for comparing
processors: energy or power?

45
Power and Energy
 Problem: Get power in, get power out
 Thermal Design Power (TDP)
 Characterizes sustained power consumption
 Used as target for power supply and cooling system
 Lower than peak power (1.5X higher), higher than
average power consumption
 Clock rate can be reduced dynamically to limit
power consumption
 Energy per task is often a better measurement
Trends
in
Power
and
Energy

46
Power and Energy
 power : energy per unit time
 1 watt = 1 joule per second.
E=P*T
 Which metric is the right one for comparing
processors: energy or power?
 In general, energy is always a better metric

because it is tied to a specific task and the time
required for that task.

47
Power and Energy
 if we want to know which of two
processors is more efficient for a given
task, we should compare energy
consumption (not power) for executing the
task.

48
Power and Energy
 When is power consumption a useful
measure?
 as a constraint.

for example, a chip might be limited to 100 watts.

49
Power and Energy
 Static power
 Dynamic power

50
Dynamic Energy and Power

51
 Dynamic energy
 Transistor switch from 0 -> 1 or 1 -> 0
 ½ x Capacitive load x Voltage2
 Dynamic power
 ½ x Capacitive load x Voltage2
x Frequency switched
 Reducing clock rate reduces power, not energy
Trends
in
Power
and
Energy

52

53
Power
 Intel 80386
consumed ~ 2 W
 3.3 GHz Intel
Core i7 consumes
130 W
 Heat must be
dissipated from
1.5 x 1.5 cm chip
 This is the limit of
what can be
cooled by air
Trends
in
Power
and
Energy

54
Power

55
Reducing Power

56
Reducing Power
 Techniques for reducing power:
 Do nothing well: (clock gating)

Most microprocessors today turn off the clock of inactive modules to
save energy and dynamic power
 Dynamic Voltage-Frequency Scaling (DVFS).

Personal mobile devices, laptops, and even servers have periods of
low activity where there is no need to operate at the highest clock
frequency and voltages.
 Low power state for DRAM, disks :

Given that PMDs and laptops are often idle, memory and storage
offer low power modes to save energy
 Overclocking, turning off cores

the 3.3 GHz Core i7 can run in short bursts for 3.6 GHz.

microprocessors can turn off all cores but one and run it

at an even higher clock rate.

For single threaded code, these microprocessors can turn off
all cores but one and run it at an even higher clock rate.
Trends
in
Power
and
Energy

57
Reducing Power
 Techniques for reducing power:
 Do nothing well
 Dynamic Voltage-Frequency Scaling
 Low power state for DRAM, disks
Trends
in
Power
and
Energy

58
Static Power

59
Static Power
 Static power consumption
 25-50% of total power

Currentstatic x Voltage
 Scales with number of transistors
 To reduce: power gating
Trends
in
Power
and
Energy

60
Static Power
 large SRAM caches that need power to
maintain the storage values. (The S in
SRAM is for static.)
 The only hope to stop leakage is to turn off
power to the chips’ subsets.

61
race-to-halt.
 because the processor is just a portion of
the whole energy cost of a system,
 it can make sense to use a faster, less
energy-efficient processor to allow the rest
of the system to go into a sleep mode. This
strategy is known as race-to-halt.

62
Domain specific processors
A computer will consist of
 standard processors to run conventional
large programs such as operating systems
 Domain specific processors
do only a narrow range of tasks, but they do them
extremely well.
 such computers will be much more
heterogeneous than the homogeneous
multicore chips of the past.

63

64
‫چهارم‬ ‫جلسه‬

65
Copyright © 2019, Elsevier Inc. All rights reserved
.
10

Chapter 1

67
Trends in Cost
 Although costs tend to be less important in some
computer designs—specifically supercomputers
 cost-sensitive designs are of growing
significance
 learning curve :manufacturing costs
decrease over time.

Example
 Price per megabyte of DRAM has dropped over the long
term. price and cost of DRAM track closely.
 Microprocessor prices also drop over time, but because
they are less standardized than DRAMs, the relationship
between price and cost is more complex.
yield

68
Trends in Cost
 Cost driven down by learning curve
 Yield
 DRAM: price closely tracks cost
 Microprocessors: price depends on
volume
 10% less for each doubling of volume
Trends
in
Cost

69
Trends in Cost
 key factor in determining cost:

70
Cost of an Integrated Circuit
 standard parts—disks, Flash memory, DRAMs,
and so on—are becoming a significant portion of
any system’s cost.
 with PMDs’ increasing reliance of whole systems
on a chip (SOC), the cost of the integrated
circuits is much of the cost of the PMD.

71
Trends in Cost

72

73

74

75
Integrated Circuit Cost
 Integrated circuit
 Bose-Einstein formula:
 Defects per unit area = 0.016-0.057 defects per square cm (2010)
 N = process-complexity factor = 11.5-15.5 (40 nm, 2010)
 For 28 nm processes in 2017, N is 7.5–9.5. For a 16 nm process,
 N ranges from 10 to 14
Trends
in
Cost

76

77

78
Integrated Circuit Cost :redundancy as a way to
raise yield.
 Given the tremendous price pressures on commodity products such
as DRAM and SRAM, designers have included redundancy as a
way to raise yield.
 DRAMs have regularly included some redundant memory cells so
that a certain number of flaws can be accommodated.
 Designers have used similar techniques in both standard SRAMs
and in large SRAM arrays used for caches within microprocessors.
 GPUs have 4 redundant processors out of 84 for the same reason.
Obviously, the presence of redundant entries can be used to boost
the yield significantly.

79
Cost Versus Price
 Margin between the cost to manufacture a
product and the price the product sells for has
been shrinking.
 Those margins pay for
 company’s research and development (R&D),
 marketing,
 sales,
 manufacturing equipment maintenance,
 building rental,
 cost of financing,
 Pretax profits, and taxes.

80
Cost of Manufacturing Versus Cost of Operation
 Before
 cost meant the cost to build a computer
 price meant price to purchase a computer.
 With the advent of WSCs,
 capital expenses (CAPEX):

tens of thousands of servers,
 operational expenses (OPEX):

the cost to operate the computers

81
(CAPEX) & (OPEX)

82

83

84

85

86

87
Dependability
 Before :
 ICs were one of the most reliable components
of a computer.

their pins may be vulnerable, and faults may occur
over communication channels, the failure rate
inside the chip was very low.
 Now,
 because of feature sizes of 16 nm and
smaller,

Transient faults and permanent faults are
becoming more commonplace.

88
Dependability
 Service level agreements (SLAs)

an SLA could be used to decide whether
the system was up or down.

89
Dependability
 Systems alternate between two states:
1. Service accomplishment:
where the service is delivered as specified.
2. Service interruption:
where the delivered service is different from the SLA
 Transitions between these two states are
caused by

Failures (from state 1 to state 2)

Restorations (2 to 1).

90
Dependability
 Quantifying these transitions leads to the
two main measures of dependability:
 Module reliability
 a measure of the continuous service accomplishment

the time to failure from a reference initial instant.
 Module availability
 a measure of the service accomplishment with respect
to the alternation between the two states of
accomplishment and interruption.

91
Dependability
 Module reliability
 Mean time to failure (MTTF)

mean time to failure
 FIT (=1/MTTF)

failures in time
 rate of failures, generally reported as failures per billion
hours of operation
 Mean time to repair (MTTR)
 Mean time between failures (MTBF) = MTTF + MTTR
 Module Availability = MTTF / MTBF
Dependability

92
Dependability
 Assume a disk subsystem with the following components
and MTTF:
 10 disks, each rated at 1,000,000-hour MTTF
 1 ATA controller, 500,000-hour MTTF
 1 power supply, 200,000-hour MTTF
 1 fan, 200,000-hour MTTF
 1 ATA cable, 1,000,000-hour MTTF

93
Dependability
 Redundancy
 The primary way to cope with failure

in time (repeat the operation to see if it still
is erroneous)

in resources (have other components to
take over from the one that failed).

94
Dependability
 Redundancy example
 Assume that one power supply is sufficient to run the disk subsystem
and that we are adding one redundant power supply.
 2 power supplies and independent failures
 MTTF for redundant power supplies
 MTTFone=MTTFpower supply/2
 MTTFpair: the mean time until one power supply fails divided by the chance that
the other will fail before the first one is replaced.
 the probability of a second failure is MTTR over the mean time until the other
power supply fails
 24 hours to notice that a power supply has failed and to replace it
 4150 times more reliable than a single power supply

95
Measuring Performance
 Typical performance metrics:
 Response time :execution time
 Throughput
 Speedup of X relative to Y

Execution timeY / Execution timeX
 Execution time
 the time between the start and the completion of an event
 Wall clock time: includes all system overheads

storage accesses, memory accesses, input/output activities, operating
system, …
 CPU time: only computation time
Measuring
Performance

96
Benchmarks
 Kernels (e.g. matrix multiply)
 Toy programs (e.g. sorting)
 Synthetic benchmarks (e.g. Dhrystone)
 Benchmark suites (e.g. SPEC06fp, TPC-C)
 Standard test suites
 CPU tests Mathematical operations, compression, encryption, physics.
 2D graphics tests Vectors, bitmaps, fonts, text, and GUI elements.
 3D graphics tests DirectX 9 to DirectX 12 in 4K resolution. DirectCompute &
OpenCL
 Disk tests Reading, writing & seeking within disk files + IOPS
 Memory tests Memory access speeds and latency

97
Benchmarks

98
Principles of Computer Design
 Take Advantage of Parallelism
 e.g. multiple processors, disks, memory banks,
pipelining, multiple functional units
 ILP,DLP,TLP,RLP
 Principle of Locality
 Reuse of data and instructions
 a program spends 90% of its execution time in only 10% of the
code.
 Focus on the Common Case : energy, resource allocation,
and performance.
 The instruction fetch and decode unit of a processor may be used much more
frequently than a multiplier, so optimize it first.
 Amdahl’s Law
Principles

99
Amdahl’s Law

100
Amdahl’s Law

101
 The Processor Performance Equation
Principles

102
Principles
 Different instruction types having different
CPIs

103
 Example: Suppose we made the following measurements:
 Frequency of FP operations=25%
 Average CPI of FP operations=4.0
 Average CPI of other instructions=1.33
 Frequency of FSQRT=2%
 CPI of FSQRT=20
 Compare these two design
 decrease the CPI of FSQRT to 2
 decrease the average CPI of all FP operations to 2.5.

104
 Example: Suppose we made the following measurements:
 Frequency of FP operations=25%
 Average CPI of FP operations=4.0
 Average CPI of other instructions=1.33
 Frequency of FSQRT=2%
 CPI of FSQRT=20
 Compare these two design
 decrease the CPI of FSQRT to 2
 decrease the average CPI of all FP operations to 2.5.

105
Fallacies and Pitfalls
 All exponential laws must come to an end
 Dennard scaling (constant power density)

Stopped by threshold voltage
 Disk capacity

30-100% per year to 5% per year
 Moore’s Law

Most visible with DRAM capacity

ITRS disbanded

Only four foundries left producing state-of-the-art
logic chips

11 nm, 3 nm might be the limit

106
 Microprocessors are a silver bullet
 Performance is now a programmer’s burden
 Falling prey to Amdahl’s Law
 A single point of failure
 Hardware enhancements that increase
performance also improve energy
efficiency, or are at worst energy neutral
 Benchmarks remain valid indefinitely
 Compiler optimizations target benchmarks

107
 The rated mean time to failure of disks is
1,200,000 hours or almost 140 years, so
disks practically never fail
 MTTF value from manufacturers assume
regular replacement
 Peak performance tracks observed
performance
 Fault detection can lower availability
 Not all operations are needed for correct
execution

108

dvance computer architecture computer architecture: a quantitative approach chapter 1 Fundamentals of Quantitative Design and Analysis

Recommended

More Related Content

What's hot (20)

Similar to dvance computer architecture computer architecture: a quantitative approach chapter 1 Fundamentals of Quantitative Design and Analysis (20)

Recently uploaded (20)

dvance computer architecture computer architecture: a quantitative approach chapter 1 Fundamentals of Quantitative Design and Analysis