Chapter 5: Mapping and Scheduling


Mapping Data to Processors
2
 Processor arrays and multi-computers are characterized by a non-
uniform memory structure:
 each processor is able to get data from local memory much faster
than from nonlocal memory.
 When designing algorithms for these machines, it makes sense to
have processors manipulate local data as much as possible.
 For this reason, the distribution of parallel data structures often
dictates which processor is responsible for performing a particular
operation.


 An algorithm's data manipulation patterns can be represented as a
graph:
 each vertex represents a data subset allocated to the same local memory,
 and each edge represents a computation involving data from two data
sets.
 These graphs are often regular.
An important goal of a parallel algorithm designer is to map
the algorithm graph into the corresponding graph of the target
machine's processor organization (Bokhari 1981).
3

Performance may suffer if the algorithm graph is not
a sub-graph of the parallel architecture's processor
organization.
 On a multicomputer two connected vertices in the
algorithm graph map to a pair of vertices distance
two apart on the machine.
 Passing a message from one processor to the other
requires roughly twice the time it takes to pass a
message between adjacent processors.
4

 A parallel algorithm is implemented on a
multicomputer with circuit-switched routing.
 Different edges in the algorithm graph map to a
shared link on the machine.
 If simultaneous communications occur between
both pairs of nodes, the speed of the communication
will be affected by the shared use of the same link.
5
 On some systems one message would be blocked until the other message had finished.
 On other systems the physical link would be multiplexed between the two virtual links.
Cutting the bandwidth of each link in half.
In either case, performance is lower because each message does not have exclusive use of a
communication path.


6
• An embedding of a graph G = (V, E) into a graph G' = (V', E') is a
function  from V to V'.
Definition 5.1:
• Let  be the function that embeds graph G = (V, E) into graph
G' = (V'. E'). The dilation of the embedding is defined as follows :
dil() =max{dist( (u),  (v))} where dist(a, b) is the distance between
vertices a and b in G'.
Definition 5.2:


7
There exists a dilation-I embedding of G into G' if G is a sub-graph of G'.
dilation-3 Embedding of G into G’ dilation-1 Embedding


• The load of an embedding  : G  G' is
the maximum number of vertices of G
that are mapped to a single vertex of G‘.
Definition 5.3:
8


 The ring and the mesh have the same number of vertices.
 A dilation-1 embedding exists if the mesh has an even number of row
and/or columns.
 A mesh with an odd number of rows and columns has no way to
embed a ring in such a mesh without increasing the dilation.
10
Embedding a Ring into 2D-Mesh
0 1
2
3
4
5
6
7
8
91011
12
13
14
15
16
17
18
19


Embedding a 2D-Mesh into 2D-Mesh
11
Ar : denote the number of rows in the algorithm graph
Ac : denote the number of columns in the algorithm graph
Mr : denote the number of row in the machine graph
Mc : denote the number of columns in the machine graph.
The algorithm graph can be embedded with dilation-1 in the
machine graph if and only if
Ar <= Mr and Ac <= Mc
Ac <= Mr and Ar <= Mc


Embedding a 2D-Mesh into 2D-Mesh
12
Ar <= Mr and Ac <= Mc
Ac <= Mr and Ar <= Mc


Complete Binary Tree into 2D-Mesh
13
dilation-1 embedding of a complete binary tree of height 3 into a 2-D mesh.
129
8 64 13
32 1
10 75 15
1411


14
• A complete binary tree of height greater than 4 cannot be embedded in
a 2-D mesh without increasing the dilation beyond 1.
Theorem 5.1.
• The total no. of mesh points in k or fewer jumps away from an arbitrary
point in a 2D-Mesh is 2k2+2k+1.
• The total no. of nodes in a complete binary tree of depth k is 2k+1 – 1.
• 2k+1 – 1 > 2k2 + 2k + 1 for k > 4
Proof:


 The H-tree is a common way of embedding a complete binary tree into
a 2-D mesh. The name H-tree arises from the recursive "H" pattern
used to construct a large H-tree out of four small H-trees.
15


16
• A complete binary tree of height n has a dilation 𝑛/2 embedding in a
2-D mesh.
Theorem 5.2.
• We use H-trees to map binary trees to nodes of a 2-D Mesh. The
longest edges in an H-tree are the edges from the root to its two
children. These edges have the same length. The length of the root
edges in an H-tree of height n is 𝑛/2 .
Proof:


Binomial Tree into 2D-Mesh
01234
Construction of Binomial Tree of height k using two B-Tree of height k-1


18
1
23
4
5
67
8
9
1011
12
13
1415
16
13
2
5
4 6
7
8
1012
9
14
11 13
16
15
dilation-1 embedding of a binomial tree of height 4 into a 2-D mesh.


• A binomial tree of height greater than 4 cannot be embedded in 2-D mesh
without increasing the dilation beyond 1.
Theorem 5.3.
• The root node of a binomial tree of height d tree is connected to d other
nodes. No node in a 2-D mesh has more than 4 neighbours. Hence a
binomial tree of height greater than 4 cannot be embedded in a 2-D mesh
without increasing the dilation beyond 1.
Proof:
19


• A binomial tree of height n has a dilation 𝑛/2
embedding in a 2-D mesh.
Theorem 5.4.
• The construction is illustrated in Fig.
Proof:
20


21


• A graph G is cubical if there is a dilation-1 embedding of G
into a hypercube.
Definition 5.4.
• The problem of determining whether an arbitrary graph G is
cubical is NP- Complete (Afrati et al. 1985; Cybenko et al. 1986).
Theorem 5.5.
23
Embedding Graphs Into Hypercubes


Theorem 5.6. : A dilation-1 embedding of a connected graph G into a
hypercube with n nodes exists if and only if it is possible to label the
edges of G with the integers (1 ,.... n) such that:
1. Edges incident with a common vertex have different labels.
2. In every path of G at least one label appears an odd number of times.
3. In every cycle of G no label appears an odd number of times.
24


25


• A dilation-1 embedding of a complete binary tree of height n into a
hypercube of dimension n + 1 does not exist if n > 1.
Theorem 5.7.
• A complete binary tree of height n has 2n+1 - 1 nodes.
• A hypercube of dimension n + 1 has 2n+1 nodes.
• Let root of the tree mapped to node X of the cube and height of the tree is k.
Proof:
26
Complete Binary Tree Into Hypercube


• Hypercube is a bipartite graph.
• Half the nodes can be reached only by following an even number of edges from X.
• Half the nodes can be reached only by following an odd number of edges from X.
• If k is odd, then more than half the nodes of the binary tree are an even distance away
from the root of the binary tree.
• If k is even, then more than half the nodes of the binary tree are an odd distance away
from the root of the binary tree.
• In either case, there is no way to embed the binary tree into the hypercube and keep the
dilation at 1, because there are not enough hypercube nodes to accommodate the leaves
of the tree while maintaining parity (odd or even) with the interior nodes.
Proof:
27


• A balanced binary tree with of height n has a dilation-1
embedding into a hypercube of dimension n + 2 (see Nebesky 1974).
Theorem 5.8.
• A complete binary tree of height n has a dilation-2 embedding in
hypercube of dimension n + 1, for all n > 1 (see Leighton 1992).
Theorem 5.9.
28


• A binomial tree of height n can be embedded in a hypercube of dimension n such that the
dilation is 1.
Theorem 5.10.
• Organize the sub-tree so that the nodes that are roots of larger sub-trees appear to the left of
nodes that are the roots of smaller sub-trees.
• Give the edge to the leftmost child of the root node the label 1,
• the edge to the second child of the root node the label 2, and so on, to the edge to the last child,
which gets label n.
• For all remaining interior nodes of the tree, if the edge above the node has label i then the
edges to the k children of the node should be given labels i + 1, i + 2,.... i + k, from left to right.
Proof:
29
Binomial Tree Into Hypercube


30


31
0
12
3
4
56
7
6 7
54
2 3
10


32
Ring Into Hypercube
6 7
54
2 3
10
0
1
2
3
4
5
6
7


 Assume the hypercube contains p = 2d processors.
 Let G(i) be the number of the processor assigned to ring
position i, where 0<=i<p.
 G(0) = 0, G(1) = 1, G(2) = 2 ,.... G(p -1) = p -1 will not work
 The encoding must have the following properties:
 1. The values must be unique, in other words, G(i) = G(j), i=j.
 2. G(i) and G(i+1) must differ in exactly one bit position, for
all i, where 0<=i<p.
 3. G(p-1) and G(0) must differ in exactly one bit position.
33
Ring Into Hypercube
6 7
54
2 3
10
0
1
2
3
4
5
6
7


• Gray code is a binary numeral system where two successive values differ in only
one bit position.
• There are many possible n-bit Gray codes.
Gray Code
• Longer Gray codes can be constructed from shorter Gray codes.
• Given a d-bit Gray code a (d + 1)-bit Gray code can be constructed by listing the
d-bit Gray code with the prefix 0, followed by the d-bit Gray code in reverse
order with the prefix 1.
Gray Code Generation
34
Ring Into Hypercube


 A one bit Gray code is {0,1}
 G(0) = 0 and
 G(1) = 1
35
10
0 1


0
0
36
0
1
1
0
1
1
G(0) = 0
G(1) = 1
G(2) = 3
G(3) = 2
2,3 3,2
1,10,0
2
1
0
3
 A two bit Gray code can be generated from two 1-bit Gray Code


37
0
0
0
1
1
0
1
1
1
1
0
1
1
0
0
0
0
0
0
0
1
1
1
1
G(0) = 0
G(1) = 1
G(2) = 3
G(3) = 2
G(4) = 6
G(5) = 7
G(6) = 5
G(7) = 4
6,4 7,5
5,64,7
2,3 3,2
1,10,0
0
1
2
3
4
5
6
7
 A three bit Gray code can be generated from two 2-bit Gray Code


38
G(0) = 0
G(1) = 1
G(2) = 3
G(3) = 2
G(4) = 6
G(5) = 7
G(6) = 5
G(7) = 4
0
1
2
3
4
5
6
7G-1 (0) = 0
G-1 (1) = 1
G-1 (2) = 3
G-1 (3) = 2
G-1 (4) = 7
G-1 (5) = 6
G-1 (6) = 4
G-1 (7) = 5
successor(0) = 1
successor(1) = 3
successor(2) = 6
successor(3) = 2
successor(4) = 0
successor(5) = 4
successor(6) = 7
successor(7) = 5
0-1-3-2-6-7-5-4-0
Successor(i) = ?
X = G-1 (i)
Y = X + 1
Successor(i) = G(Y)
= G(X+1)
= G(G-1 (i) + 1)
6,4 7,5
5,64,7
2,3 3,2
1,10,0


Gray Code
G(i)
Input: Ring Node
Output:
Hypercube Node
Inverse Gray Code
G-1(i)
Input: Hypercube
Node
Output: Ring Node
G-1(i) = j if and
only if G(j) = i
Successor(i)
i is Hypercube
node
0 if i = 2d-1
G(G-1(i) + 1)
Otherwise
39


40
i i/2 i  i/2 G(i)
0: 000 0: 000 000: 0 G(0) = 0
1: 001 0: 000 001: 1 G(1) = 1
2: 010 1: 001 011: 3 G(2) = 3
3: 011 1: 001 010: 2 G(3) = 2
4: 100 2: 010 110: 6 G(4) = 6
5: 101 2: 010 111: 7 G(5) = 7
6: 110 3: 011 101: 5 G(6) = 5
7: 111 3: 011 100: 4 G(7) = 4


41
Gray-1 Code Generation
i Ans mask
0: 000 0: 000 000: 0
1: 001 1: 001 000: 0
2: 010 2: 010 001: 1
3: 011 000: 0
3: 011 3: 011 001: 1
2: 010 000: 0
4: 100 4: 100 010: 2
6: 110 001: 1
7: 111 000: 0
i Ans mask
5: 101 5: 101 010: 2
7: 111 001: 1
6: 110 000: 0
6: 110 6: 110 011: 3
5: 101 001: 1
4: 100 000: 0
7: 111 7: 111 011: 3
4: 100 001: 1
5: 101 000: 0
G-1 (0) = 0
G-1 (1) = 1
G-1 (2) = 3
G-1 (3) = 2
G-1 (4) = 7
G-1 (5) = 6
G-1 (6) = 4
G-1 (7) = 5


42
Mesh into Hypercube
 The use of Gray codes yields a straight
forward solution with the constraint that
the size of the mesh in each dimension must
be a power of 2.
 Each dimension of the mesh is assigned an
appropriate number of bit positions of the
encoding string.
 Traversing mesh nodes along that
dimension yields a cycle.
 Gray codes determine the values assigned to
the bit field.


43
Mesh into Hypercube
 For example. consider mapping a 4 x 8 mesh into a 32-node hypercube.
 Two bit positions are reserved for the row and three bits are set for the column.
 Let us assume that the first two bit positions are used for the row. The 2-bit Gray
code (00, 01, 11, 10} corresponds to a traversal through rows 0, 1, 2, and 3.
 The 3-bit Gray code {000, 001, 011, 010,110,111,
 101, 100} corresponds to a traversal through columns 0,1, 2, 3, 4, 5, 6, and 7.
 Hence we have the following mapping of a 4 x 8 mesh into a 32-node hypercube:


• Any two-dimensional mesh with n
vertices can be embedded in a
𝑙𝑜𝑔𝑛 -dimensional hypercube with
dilation 2 (Leighton 1992).
Theorem 5.11.
45
Mesh into Hypercube


 In multicomputer it has been assumed that if data were
distributed evenly among the local memories of a
multicomputer's processors, the processors' workloads
would be balanced for the entire computation.
 In many cases this assumption is true but not always.
 If nothing is done to change the size of each processor's area
of responsibility the processors, workloads may become
severely imbalanced.
47
Dynamic Load Balancing in Multi-computer


• Dynamic load balancing is the process of making changes to
the distribution of work among the processors at run time.
• The measure of success of dynamic load balancing is the net
reduction of execution time achieved by applying the load
balancing algorithm.
• Dynamic load balancing may increase the execution time of the
parallel algorithm if the time spent performing the load
balancing is more than the time saved by reducing the variance
in the execution time of tasks on the various processors.
Dynamic Load Balancing
48
Dynamic Load Balancing in Multi-computer


49
Centralized Load Balancing Algorithms
 It make a global decision about the reallocation of
work to processors.
 Some centralized algorithms assign the maintenance
of the system's global state information to a particular
node.
 Global state information can allow the algorithm to do
a good job balancing the work among the processors.
 However, this approach does not scale well, because
the amount of information increases linearly with the
number of processors.


50
Fully Distributed Load Balancing Algorithms
 Each processor build its own view of the state of the system.
 Processors exchange information with neighbouring processors and
use this information to make local changes in the allocation of work.
 A fully distributed algorithm has the advantage of lower scheduling
overhead.
 However. since processors have only local state information, the
workload may not be balanced as well as it would be by centralized
algorithms.

51
Semi Distributed Load Balancing Algorithms
 A semi-distributed load balancing algorithm divides the
processors into regions.
 Within each region a centralized algorithm distributes the
workload among the processors.
 A higher level scheduling mechanism balances the work load
between regions.


52
Dynamic Load Balancing
Load
Balancing
Sender
Initiated
a processor with too much work sends
some work to another processor. They
perform better in an environment with
light to medium workload per processor.
Receiver
Initiated
a processor with too little work takes some
work from another processor. They
perform better in an environment with
heavy workload per processor.
Task migration can be expensive if the receiver grabs a
partially completed task.


54
Scheduling on UMA Multiprocessors


First
• Static scheduling sometimes results in lower execution times than
dynamic scheduling.
Second
• static scheduling can allow the generation of only one process per
processor, reducing process creation, synchronization, and termination
overhead.
Third
• static scheduling can be used to predict the speedup that can be achieved
by a particular parallel algorithm on a target machine, assuming no pre
emption of processes occurs.
55
Static Vs. Dynamic Scheduling


 One way to view a parallel algorithm is as a collection of
tasks, some of which must be completed before others begin.
 In a deterministic model, the execution time needed by each
task and the precedence relations between the tasks are
fixed and known in advance.
 This information can be represented by a directed graph
called a task graph, which ignores variances in tasks'
execution times due to interrupts, contention for shared
memory, etc.
 Task graphs do provide a basis for the static allocation of
tasks to processors.
56
Deterministic Static Scheduling


• A schedule is an allocation of tasks to processors. Schedules arc often
illustrated with Gantt charts.
Schedule
• A Gantt chart indicates the time each task spend in
execution, as well as the processor on which it
executes. A desirable feature of Gantt charts is that
they graphically illustrate the utilization of each
processor (percentage of time spent executing tasks).
Gantt Chart
57


58
 Some simple scheduling problems are solvable in polynomial time, while other
problems, only slightly more complex, are intractable.
 For example, if all of the tasks take unit time, and the task graph is a forest, then a
polynomial time algorithm exists to find an optimal schedule (Hu 1961).
 If all of the tasks take unit time, and the number of processors is two, then a
polynomial time algorithm exists to find an optimal schedule (Coffinan and Graham 1972).
 If the task lengths vary at all, or if there are more than two processors, then the
problem of finding an optimal schedule is NP-hard (Ullman 1975), meaning the only
known algorithms that find an optimal schedule require exponential time in the
worst case.


59
 In general we are interested in scheduling arbitrary task graphs
onto a reasonable number of processors, with polynomial time
scheduling algorithms that do a good, but not perfect, job.
 Given a list of tasks ordered by their relative priority, it is possible to
assign tasks to processors by always giving each available processor
the first unassigned task on the list whose predecessor tasks have
already finished execution.
 This list-scheduling algorithm was proposed by Graham (1966,
1969, 1972), and we formalize it next.


 Let T = {T1 , T2 ,..., Tn} be a set of tasks.
 Let : T  (0,) be a function that associates execution time with task.
 We are also given a partial order  on T.
 Let L, be a list of the tasks in T.
 Whenever a processor has no work to do, it instantaneously removes from L
the first ready task; that is, an unscheduled task whose predecessors under 
have all completed execution.
 If two or more processors simultaneously attempt to execute the same tasks.
the processor with the lowest index succeeds, and the other processors look
for another suitable task.
60
Graham‘s List scheduling Algorithm


61
L = (T1, T2, T3, T4, T5 , T6 , T7 )
P = 3
P2
P1
P0
1
T1
2
T1
3
T4
T3
T2
4
T4
T2
5
T6
T2
6
T6
T5
7
T6
T5
8
T5
9
T7


It is expected that length of the schedule should decrease by
Increasing the number of processors,
Decreasing the execution times of one or more tasks,
Eliminating some of the precedence constraints
62


63
An example illustrating that increasing the number of processors can increase
the length of schedule generated using Graham’s heuristic.


64
An example illustrating that decreasing the execution times of one or more tasks
can increase the length of schedule generated using Graham’s heuristic.


65
An example illustrating that eliminating some of the precedence constraints can
increase the length of schedule generated using Graham’s heuristic.


 Graham's list-scheduling algorithm depends upon a
prioritized list of tasks to execute.
 A well-known and intuitive scheduling algorithm due to
Coffman and Graham (1972) constructs the list of tasks for
the simple case when all tasks take the same amount of time.
 Once this list L has been constructed, the algorithm applies
Graham's list-scheduling algorithm, already described.
Coffman & Graham’s List Construction Algo.
66


 Let T = T1,T2,..., Tn be a set of n unit-time tasks to be executed on p processors.
 Let  be a partial order on T, that specifies which tasks must complete before
other tasks begin.
 If Ti  Tj then task Ti is an immediate predecessor of task Tj and Tj is an
immediate successor of Ti.
 Let S(Ti ) : denote the set of immediate successors of Ti.
 Let (T) : be an integer label assigned to T.
 N(T) : denotes the decreasing sequence of integers formed by
ordering the set {(T')|T'  S(T)).
67


68


69


70


• If  is the length of a schedule produced by the Coffman-Graham
algorithm and 0 is the length of an optimal schedule, then /0 <2 -
2/p, where p is the number of processors and p 2 (see Lam and
Sethi 1977).
Theorem 5.12.
• The Coffman-Graham algorithm generates an optimal schedule if the
number of processors is two (see Lan and Sethi 1977).
Corollary 5.1.
71


 In a nondeterministic model, the execution time of a task
is represented by a random variable, making the
scheduling problem more difficult.
 This subsection summarizes mathematics developed by
Robinson (1979) that allow an estimate of the execution
time of parallel programs with "simple" task graphs on
UMA multiprocessors.
72
Nondeterministic Model


Initial Tasks Tasks with no predecessors are called initial tasks.
Independent
Tasks
A set of tasks is independent if, for every pair of tasks Ti and Tj in the set,
neither is a predecessor of the other.
Width The width of a task graph is the size of the maximal set of independent tasks.
Chain A chain is a totally ordered task graph.
Chain Length The length of a chain is the number of tasks in the chain.
Level The level of a task T in a task graph G is the maximum chain length in G from
an initial task to T.
Depth The depth of a task graph G is the maximum level of any task in G.
73


74
• Given a task graph G let C1 C2 ...., Cm be all the m chains from initial
to final tasks in G.
• For every chain Ci, consisting of tasks Ti1, Ti2 ,..., Tij let Xi be the
expression xi1, xi2 ,..., xij where x1, x2 ,..., xn are polynomial variables.
• Then G is a simple task graph if the polynomial X1 + X2 ,..., Xm can
be factored so that every variable appears exactly once (see
Robinson 1979).
Definition 5.5.


75
x1x2x4 + x1x3x4 + x1x3x5
x1[x2x4 + x3x4 + x3x5]
Simple Graph


76
x1x2x4 + x1x3x4
x1[x2 + x3] x4
Simple Graph


77
x1x3 + x1x4 + x2x3 + x2x4
x1[x3 + x4] + x2[x3 + x4]
[x1 + x2] [x3 + x4]
Simple Graph


78
x1x3 + x1x4 + x2x4
x1[x3 + x4] + x2x4
Non-simple Graph


 A set of active concurrent processes is said to be deadlocked
if each holds non-pre-emptible resources that must be
acquired by some other process in order to proceed.
 The potential for deadlock exists whenever multiple
processes share resources in an unsupervised way.
 Hence deadlock can exist in multi-programmed operating
systems as well as in multiprocessors and multi-computers.
81
Deadlock

 Consider the two processes executing simultaneously.
 Each process attempts to lock on two resources.
 Note that lock and unlock correspond to P and V
operations on binary semaphores.
 Process 1 locks A while process 2 locks B.
 Process 1 is blocked when it tries to lock B; likewise,
process 2 is blocked when it tries to lock A. Neither
process can proceed.
 If neither of the processes can be made to back up and yield its
semaphore, the two processes will remain deadlocked indefinitely.
82
Deadlock
Proc.-1 Proc.-2
. .
. .
. .
lock(A) lock(B)
. .
. .
. .
lock(B) lock(A)

 Consider a multicomputer in which processors communicate asynchronously.
 When a processor sends a message to another processor the message are stored in
a system buffer until the receiving processor reads the message.
 Many processors are sending data to processor 0 that its system buffer fills up.
 Further attempts to send data are blocked until processor 0 reads one or more
messages, making room in the system buffer.
 Let processor i be one of the processors unable to send its message to processor 0.
 If processor 0 attempts to read the message sent by processor i, it will block until
the data appears in the system buffer.
 We have already seen that processor i is blocked until processor 0 removes one or
more messages from the system buffer.
 The two processors are dead locked.
83
Buffer - Deadlock


85
Deadlock
Four necessary conditions for deadlock to occur
(Coffman and Denning 1973)
Mutual exclusion:
each process has
exclusive use of
its resources
Non-pre-emption:
a process never
releases
resources it holds
until it is through
using them
Resource waiting:
each process
holds resources
while waiting for
other processes
to release theirs
A cycle of waiting
processes:
each process in
the cycle waits for
resources that the
next process
owns and will not
relinquish


The problem of deadlock is commonly addressed in one of three ways.
• One approach is to detect deadlocks when they occur and try to
recover from them.
• Another approach is to avoid deadlocks by using advance
information about requests for resources to control allocation, so
that the next allocation of a resource will not cause processes to
enter a situation in which deadlock may occur.
• The third approach is to prevent deadlock by forbidding one of the
last three conditions listed above.
86
Deadlock


87
Deadlock
 A cycle of waiting processes can be prevented by ordering
shared resources and forcing processes to request
resources in that order.
 Deadlock can also be prevented by requiring processes to
acquire all their resources at once.
 The second approach often leads to underutilization of
resources.

Chapter 5: Mapping and Scheduling

Recommended

More Related Content

What's hot (20)

Similar to Chapter 5: Mapping and Scheduling (20)

More from Heman Pathak (13)

Recently uploaded (20)

Chapter 5: Mapping and Scheduling