SlideShare a Scribd company logo
Query Evaluation Techniques for Large Databases
GOETZ GRAEFE
Portland State University, Computer Science Department, P. O. Box751, Portland, Oregon 97207-0751
Database management systems will continue to manage large data volumes. Thus,
efficient algorithms for accessing and manipulating large sets and sequences will be
required to provide acceptable performance. The advent of object-oriented and extensible
database systems will not solve this problem. On the contrary, modern data models
exacerbate the problem: In order to manipulate large sets of complex objects as
efficiently as today’s database systems manipulate simple records, query processing
algorithms and software will become more complex, and a solid understanding of
algorithm and architectural issues is essential for the designer of database management
software.
This survey provides a foundation for the design and implementation of query
execution facilities innew database management systems. It describes awide array of
practical query evaluation techniques for both relational and postrelational database
systems, including iterative execution of complex query evaluation plans, the duality of
sort- and hash-based set-matching algorithms, types of parallel query execution and
their implementation, and special operators for emerging database application domains.
Categories and Subject Descriptors: E.5 [Data]: Files; H.2.4 [Database Management]:
Systems—query processing
General Terms: Algorithms, Performance
Additional Key Words and Phrases: Complex query evaluation plans, dynamic query
evaluation plans; extensible database systems, iterators, object-oriented database
systems, operator model of parallelization, parallel algorithms, relational database
systems, set-matching algorithms, sort-hash duality
INTRODUCTION
Effective and efficient management of
large data volumes is necessary in virtu-
ally all computer applications, from busi-
ness data processing to library infor-
mation retrieval systems, multimedia
applications with images and sound,
computer-aided design and manufactur-
ing, real-time process control, and scien-
tific computation. While database
management systems are standard
tools in business data processing, they
are only slowly being introduced to all
the other emerging database application
areas.
In most of these new application do-
mains, database management systems
have traditionally not been used for two
reasons. First, restrictive data definition
and manipulation languages can make
application development and mainte-
nance unbearably cumbersome. Research
into semantic and object-oriented data
models and into persistent database pro-
gramming languages has been address-
ing this problem and will eventually lead
to acceptable solutions. Second, data vol-
Permission to copy without fee all or part of this material is granted provided that the copies are not made
or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication
and its date appear, and notice is given that copying ie by permission of the Association for Computing
Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
@ 1993 ACM 0360-0300/93/0600-0073 $01.50
ACM Computing Surveys,Vol. 25, No. 2, June 1993
GUN I tNT”S
INTRODUCTION
1
9
-,
3.
4
5
6
7
8
9
10
11
12
ARCHITECTURE OF QUERY EXECUTION
ENGINES
SORTING AND HASHING
2.1 Sorting
2.2 Hashing
DISK ACCESS
31 File Scans
32 Associative Access Using Indices
3.3 Buffer Management
AGGREGATION AND DUPLICATE REMOVAL
41 Aggregation Algorithm sBased on Nested
Loops
4,z Aggrcgat~on Algorithms Based on S(,rtlng
43 Aggregation Algorithms Based on Hashing
44 ARough Performance C’omparlson
45 Addltlonal Remarks on Aggregation
BINARY MATCHING OPERATIONS
51 Nested-Loops Jom Algorithms
52 Merge-Join Algorithms
53 Hash Join AIgorlthms
54 Pointer-Based Joins
55 ARough Performance Comparison
UNIVERSAL QuANTIFICATION
DUALITY OF SORT- AND HASH-BASED
QUERY PROCESSING ALGORITHMS
EXECUTION OF COMPLEX QUERY PLANS
MECHANISMS FOR PARALLEL QUERY
EXECUTION
91 Parallel versus Dlstrlbuted Database
Systems
g~ ~~rm~ ~fp~ralle]lsm
9.3 Implementation Strategies
94 Load Balancing and Skew
95 Arcbltectures and Architecture
Independence
PARALLEL ALGORITHMS
10.1 Parallel Selections and Updates
102 Parallel SOrtmg
103 Parallel Aggregation and Duphcate
Removal
10.4 Parallel Jolnsand Otber Binary Matcb, ng
Operations
105 Parallel Universal Quantlficatlon
NONSTANDARD QUERY PROCESSING
ALGORITHMS
11 1 Nested Relations
112 TemPoral and Scientific Database
Management
11.3 Object-oriented Database Systems
114 More Control Operators
ALJDITIONAL TECHNIQUES FOR
PERFORMANCE IMPROVEMENT
121 Precomputatlon and Derived Data
122 Data Compression
12.3 Surrogate Processing
124 B,t Vector F,lter, ng
12.5 Specmllzed Hardware
SUMMARY AND OUTLOOK
ACKNOWLEDGMENTS
REFERENCES
umes might be so large or complex that
the real or perceived performance advan-
tage of file systems is considered more
important than all other criteria, e.g., the
higher levels of abstraction and program-
mer productivity typically achieved with
database management systems. Thus,
object-oriented database management
systems that are designed for nontradi-
tional database application domains and
extensible database management system
toolkits that support a variety of data
models must provide excellent perfor-
mance to meet the challenges of very
large data volumes, and techniques for
manipulating large data sets will find
renewed and increased interest in the
database community.
The purpose of this paper is to survey
efficient algorithms and software archi-
tectures of database query execution en-
gines for executing complex queries over
large databases. A “complex” query is
one that requires a number of query-
processing algorithms to work together,
and a “large” database uses files with
sizes from several megabytes to many
terabytes, which are typical for database
applications at present and in the near
future [Dozier 1992; Silberschatz et al.
1991]. This survey discusses a large vari-
ety of query execution techniques that
must be considered when designing and
implementing the query execution mod-
ule of a new database management sys-
tem: algorithms and their execution costs,
sorting versus hashing, parallelism, re-
source allocation and scheduling issues
in complex queries, special operations for
emerging database application domains
such as statistical and scientific data-
bases, and general performance-enhanc-
ing techniques such as precomputation
and compression. While many, although
not all, techniques discussed in this pa-
per have been developed in the context of
relational database systems, most of
them are applicable to and useful in the
query processing facility for any database
management system and any data model,
provided the data model permits queries
over “bulk” data types such as sets and
lists.
Query Evaluation Techniques ● 75
F
User Interface
Database Query Language
Query Optimizer
Query Execution Engine
Files and Indices
1/0 Buffer
Disk
Figure 1. Query processing in a database system,
It is assumed that the reader possesses
basic textbook knowledge of database
query languages, in particular of rela-
tional algebra, and of file systems, in-
cluding some basic knowledge of index
structures. As shown in Figure 1, query
processing fills the gap between database
query languages and file systems. It can
be divided into query optimization and
query execution. A query optimizer
translates a query expressed in a high-
level query language into a sequence of
operations that are implemented in the
query execution engine or the file system.
The goal of query optimization is to find
a query evaluation plan that minimizes
the most relevant performance measure,
which can be the database user’s wait for
the first or last result item, CPU, 1/0,
and network time and effort (time and
effort can differ due to parallelism),
memory costs (as maximum allocation or
as time-space product), total resource us-
age, even energy consumption (e.g., for
battery-powered laptop systems or space
craft), a combination of the above, or some
other performance measure. Query opti-
mization is a special form of planning,
employing techniques from artificial in-
telligence such as plan representation,
search including directed search and
pruning, dynamic programming, branch-
and-bound algorithms, etc. The query ex-
ecution engine is a collection of query
execution operators and mechanisms for
operator communication and synchro-
nization—it employs concepts from al-
gorithm design, operating systems,
networks, and parallel and distributed
computation. The facilities of the query
execution engine define the space of
possible plans that can be chosen by
the auerv o~timizer,
A ~en~ra~ outline of the steps required
for processing a database query are
shown in Fimu-e 2. Of course. this se-
quence is only a general guideline, and
different database systems may use dif-
ferent steps or merge multiple steps into
one. After a query or request has been
entered into the database system, be it
interactively or by an application pro-
gram, the query is parsed into an inter-
nal form. Next, the query is validated
against the metadata (data about the
data, also called schema or catalogs) to
ensure that the query contains only valid
references to existing database objects. If
the database system provides a macro
facility such as relational views, refer-
enced macros and views are expanded
into the query [ Stonebraker 1975]. In-
tegrity constraints might be expressed as
views (externally or internally) and would
also be integrated into the query at this
point in most systems [Metro 1989]. The
query optimizer then maps the expanded
query expression into an optimized plan
that operates directly on the stored
database objects. This mapping process
can be very complex and might require
substantial search and cost estimation
effort. (O~timization is not discussed in
this pape~; a survey can be found in Jarke
and Koch [1984 ].) The optimizer’s output
is called a query execution plan, query
evaluation plan, QEP, or simply plan.
Using a simple tree traversal algorithm,
this plan is translated into a representa-
tion readv for execution bv the database’s
query ex~cution engine; t~e result of this
translation can be compiled machine code
or a semicompiled or interpreted lan-
guage or data structure.
This survey discusses only read-only
queries explicitly; however, most of the
techniques are also applicable to update
requests. In most database management
systems, update requests may include a
search predicate to determine the
database objects are to be modified. Stan-
dard query optimization and execution
techniques apply to this search; the ac-
tual update procedure can be either ap-
ACM Computing Surveys, Vol. 25, No. 2, June 1993
76 * Goetz Graefe
Parsing
L Query Validation
L View Resolution
L Optimization
L Plan Compilation
L Execution
Figure 2. Query processing steps.
plied in a second phase, a method called
deferred updates, or merged into the
search phase if there is no danger of
creating ambiguous update semantics. 1
The problem of ensuring ACID seman-
tics for updates—making updates
Atomic (all-or-nothing semantics), Con-
sistent (translating any consistent
database state into another consistent
database state), Isolated (from other
queries and requests), and Durable (per-
sistent across all failures)—is beyond the
scope of this paper; suitable techniques
have been described by many other au-
thors, e.g., Bernstein and Goodman
[1981], Bernstein et al. [1987], Gray and
Reuter [1991], and Haerder and Reuter
[1983].
Most research into providing ACID se-
mantics focuses on efficient techniques
for processing very large numbers of
relatively small requests. For example,
increasing the balance of one account and
decreasing the balance of another account
require exclusive access to only two
database records and writing some
information to an update log. Current
research and development efforts in
transaction processing target hundreds
and even thousands of small transactions
per second [Davis 1992; Serlin 1991].
.—
1A standard example for this danger is the “Hal-
loween” problem: Consider the request to “give all
employees with salaries greater than $30,000 a 3%
raise.” If (i) these employees are found using an
index on salaries, (ii) index entries are scanned in
increasing salary order, and (iii ) the index is up-
dated immediately as index entries are found, then
each qualifying employee will get an infinite num-
ber of raises.
ACM Computing Surveys, Vol 25, No 2, June 1993
Query processing, on the other hand, fo-
cuses on extracting information from a
large amount of data without actually
changing the database. For example,
printing reports for each branch office
with average salaries of employees under
30 years old requires shared access to a
large number of records. Mixed requests
are also possible, e.g., for crediting
monthly earnings to a stock account by
combining information about a number
of sales transactions. The techniques dis-
cussed here apply to the search effort for
such a mixed request, e.g., for finding the
relevant sales transactions for each stock
account.
Embedded queries, i.e., database
queries that are contained in an applica-
tion program written in a standard pro-
gramming language such as Cobol, PL/1,
C, or Fortran, are also not addressed
specifically in this paper because all
techniques discussed here can be used
for interactive as well as embedded
queries. Embedded queries usually are
optimized when the program is compiled,
in order to avoid the optimization over-
head when the program runs. This
method was pioneered in System R, in-
cluding mechanisms for storing opti-
mized plans and invalidating stored plans
when they become infeasible, e.g., when
an index is dropped from the database
[Chamberlain et al. 1981b]. Of course, the
cut between compile-time and run-time
can be placed at any other point in the
sequence in Figure 2.
Recursive queries are omitted from this
survey, because the entire field of recur-
sive query processing—optimization
rules and heuristics, selectivity and cost
Query Evaluation Techniques “ 77
estimation, algorithms and their paral-
lelization—is still developing rapidly
(suffice it to point to two recent surveys
by Bancilhon and Ramakrishnan [1986]
and Cacace et al. [1993]).
The present paper surveys query exe-
cution techniques; other surveys that
pertain to the wide subject of database
systems have considered data models and
query langaages [Gallaire et al. 1984;
Hull and King 1987; Jarke and Vassiliou
1985; McKenzie and Snodgrass 1991;
Peckham and Maryanski 1988], access
methods [Comer 1979; Enbody and Du
1988; Faloutsos 1985; Samet 1984;
Sockut and Goldberg 1979], compression
techniques [Bell et al. 1989; Lelewer and
Hirschberg 1987], distributed and het-
erogeneous systems [Batini et al. 1986;
Litwin et al. 1990; Sheth and Larson
1990; Thomas et al. 1990], concurrency
control and recovery [Barghouti and
Kaiser 1991; Bernstein and Goodman
1981; Gray et al. 1981; Haerder and
Reuter 1983; Knapp 1987], availability
and reliability [Davidson et al. 1985; Kim
1984], query optimization [Jarke and
Koch 1984; Mannino et al. 1988; Yu and
Chang 1984], and a variety of other
database-related topics [Adam and Wort-
mann 1989; Atkinson and Bunemann
1987; Katz 1990; Kemper and Wallrath
1987; Lyytinen 1987; Teoroy et al. 1986].
Bitton et al. [1984] have discussed a
number of parallel-sorting techniques,
only a few of which are really used in
database systems. Mishra and Eich’s
[1992] recent survey of relational join al-
gorithms compares their behavior using
diagrams derived from one by Kitsure-
gawa et al. [1983] and also describes join
methods using index structures and join
methods for distributed systems. The
present survey is much broader in scope
as it also considers system architectures
for complex query plans and for parallel
execution, selection and aggregation al-
gorithms, the relationship of sorting and
hashing as it pertains to database query
processing, special operations for nontra-
ditional data models, and auxiliary tech-
niques such as compression.
Section 1 discusses the architecture of
query execution engines. Sorting and
hashing, the two general approaches to
managing and matching elements of large
sets, are described in Section 2. Section 3
focuses on accessing large data sets on
disk. Section 4 begins the discussion of
actual data manipulation methods with
algorithms for aggregation and duplicate
removal, continued in Section 5 with bi-
nary matching operations such as join
and intersection and in Section 6 with
operations for universal quantification.
Section 7 reviews the manv dualities be-
tween sorting and hashi~g and points
out their differences that have an impact
on the performance of algorithms based
on either one of these approaches. Execu-
tion of very complex query plans with
many operators and with nontrivial plan
shaues is discussed in Section 8. Section
9 is’ devoted to mechanisms for parallel
execution, including architectural issues
and load balancing, and Section 10
discusses specific parallel algorithms.
Section 11 outlines some nonstandard
operators for emerging database appli-
cations such as statistical and scientific
database management systems. Section
12 is a potpourri of additional techniques
that enhance the performance of many
algorithms, e.g., compression, precompu-
tation, and specialized hardware. The fi-
nal section contains a brief summarv and
.
an outlook on query processing research
and its future.
For readers who are more interested in
some tonics than others. most sections
are fair~y self-contained.’ Moreover, the
hurried reader may want to skip the
derivation of cost functions ;2 their re-
sults and effects are summarized later in
diagrams.
1. ARCHITECTURE OF QUERY
EXECUTION ENGINES
This survey focuses on useful mecha-
nisms for processing sets of items. These
items can be records, tuples, entities, or
objects. Furthermore, most of the tech-
2 In any case, our cost functions cover only a lim-
ited, though Important, aspect of query execution
cost, namely 1/0 effort.
ACM Computing Surveys, Vol 25, No 2, June 1993
78 “ Goetz Graefe
niques discussed in this survey apply to
sequences, not only sets, of items, al-
though most query processing research
has assumed relations and sets. All query
processing algorithm implementations it-
erate over the members of their input
sets; thus, sets are always represented
by sequences. Sequences can be used to
represent not only sets but also other
one-dimensional “bulk types such as
lists, arrays, and time series, and many
database query processing algorithms
and techniques can be used to manipu-
late these other bulk types as well as
sets. The important point is to think of
these algorithms as algebra operators
consuming zero or more inputs (sets or
sequences) and producing one (or some-
times more) outputs. A complete query
execution engine consists of a collection
of operators and mechanisms to execute
complex expressions using multiple oper-
ators, including multiple occurrences of
the same operator. Taken as a whole, the
query processing algorithms form an al-
gebra which we call the physical algebra
of a database system.
The physical algebra is equivalent to,
but quite different from, the logical alge-
bra of the data model or the database
system. The logical algebra is more
closely related to the data model and
defines what queries can be expressed in
the data model; for example, the rela-
tional algebra is a logical algebra. A
physical algebra, on the other hand, is
system specific. Different systems may
implement the same data model and the
same logical algebra but may use very
different physical algebras. For example,
while one relational system may use only
nested-loops joins, another system may
provide both nested-loops join and
merge-join, while a third one may rely
entirely on hash join algorithms. (Join
algorithms are discussed in detail later
in the section on binary matching opera-
tors and algorithms.)
Another significant difference between
logical and physical algebras is the fact
that specific algorithms and therefore
cost functions are associated only with
physical operators, not with logical alge-
bra operators. Because of the lack of an
algorithm specification, a logical algebra
expression is not directly executable and
must be mapped into a physical algebra
expression. For example, it is impossible
to determine the execution time for the
left expression in Figure 3, i.e., a logical
algebra expression, without mapping it
first into a physical algebra expression
such as the query evaluation plan on the
right of Figure 3. This mapping process
can be trivial in some database systems
but usually is fairly complex in real
database svstems because it involves al-
gorithm c~oices and because logical and
physical operators frequently do not map
directly into one another, as shown in the
following four examples. First, some op-
erators in the physical algebra may im-
plement multiple logical operators. For
example, all serious implementations of
relational join algorithms include a facil-
ity to output fewer than all attributes,
i.e., a relational delta-project (a projec-
tion without dudicate removal) is in-
cluded in the ph~sical join operator. Sec-
ond, some physical operators implement
only part of a logical operator. For exam-
ple, a duplicate removal algorithm imple-
ments only the “second half” of a rela-
tional projection operator. Third, some
physical operators do not exist in the
logical algebra. Concretely, a sort opera-
tor has no place in pure relational alge-
bra because it is an algebra of sets. and
sets are, by their defi~ition, unordered.
Finally, some properties that hold for
logical operators do not hold, or only with
some qualifications, for the counterparts
in physical algebra. For example, while
intersection and union are entirely sym-
metric and commutative, algorithms im-
plementing them (e.g., nested loops or
hybrid hash join) do not treat their two
inputs equally.
The difference of logical and physical
algebras can also be looked at in a differ-
ent way. Any database system raises the
level of abstraction above files and
records; to do so, there are some logical
type constructors such as tuple, relation,
set, list, array, pointer, etc. Each logical
type constructor is complemented by
ACM Comput,ng Surveys, V.1 25, No. 2, June 1993
Query Evaluation Techniques ● 79
Merge-Join (Intersect)
Intersection
/
/
sort sort
Set A Set B
I I
File Scan A File Scan B
Figure 3. Logical and physical algebra expressions.
some operations that are permitted on
instances of such types, e.g., attribute
extraction, selection, insertion, deletion,
etc.
On the physical or representation level,
there is typically a smaller set of repre-
sentation types and structures, e.g., file,
record, record identifier (RID), and maybe
very large byte arrays [Carey et al. 1986].
For manipulation, the representation
types have their own operations, which
will be different from the operations on
logical types. Multiple logical types and
type constructors can be mapped to the
same physical concept. They may also be
situations in which one logical type con-
structor can be mapped to multiple phys-
ical concepts, e.g., a set depending on its
size. The mapping from logical types to
physical representation types and struc-
tures is called physical database design.
Query optimization is the mapping from
logical to physical operations, and the
query execution engine is the imple-
mentation of operations on physical rep-
resentation types and of mechanisms
for coordination and cooperation among
multiple such operations in complex que-
ries. The policies for using these mech-
anisms are part of the query optimizer.
Synchronization and data transfer be-
tween operators is the main issue to be
addressed in the architecture of the query
execution engine. Imagine a query with
two joins, and consider how the result of
the first join is passed to the second one.
The simplest method is to create (write)
and read a temporary file. The need for
temporary files, whether they are kept in
the buffer or not, is a direct result of
executing an operator’s input subplans
completely before starting the operator.
Alternatively, it is possible to create one
process for each operator and then to use
interprocess communication mechanisms
(e.g., pipes) to transfer data between op-
erators, leaving it to the operating sys-
tem to schedule and suspend operator
processes as pipes are full or empty.
While such data-driven execution re-
moves the need for temporary disk files,
it introduces another cost, that of operat-
ing system scheduling and interprocess
communication. In order to avoid both
temporary files and operating system
scheduling, Freytag and Goodman [1989]
proposed writing rule-based translation
programs that transform a plan repre-
sented as a tree structure into a single
iterative program with nested loops and
other control structures. However, the re-
quired rule set is not simple, in particu-
lar for algorithms with complex control
logic such as sorting, merge-join, or even
hybrid hash join (to be discussed later in
the section on matching).
The most practical alternative is to im-
plement all operators in such a way that
they schedule each other within a single
operating system process. The basic idea
is to define a granule, typically a single
record, and to iterate over all granules
comprising an intermediate query result.3
Each time an operator needs another
granule, it calls its input (operator) to
produce one. This call is a simple pro-
—-
3It is possible to use multiple granule sizes within
a single query-processing system and to provide
special operators with the sole purpose of translat-
ing from one granule size to another. An example M
a query processing system that uses records as an
iteration granule except for the inputs of merge-join
(see later in the section on binary matching), for
which it uses “value packets,” i.e., groups of records
with equal join attribute values.
ACM Computmg Surveys, Vol. 25, No. 2, June 1993
80 ● Goetz Graefe
cedure call, much cheaper than inter-
process communication since it does not
involve the operating system. The calling
operator waits (just as any calling rou-
tine waits) until the input operator has
produced an item. That input operator,
in a complex query plan, might require
an item from its own input to produce an
item; in that case, it calls its own input
(operator) to produce one. Two important
features of operators implemented in this
way are that they can be combined into
arbitrarily complex query evaluation
plans and that any number of operators
can execute and schedule each other in a
single process without assistance from or
interaction with the underlying operat-
ing system. This model of operator imple-
mentation and scheduling resembles very
closely those used in relational systems,
e.g., System R (and later SQL/DS and
DB2), Ingres, Informix, and Oracle, as
well as in experimental systems, e.g., the
E programming language used in EXO-
DUS [Richardson and Carey 1987], Gen-
esis [Batory et al. 1988a; 1988 b], and
Starburst [Haas et al. 1989; 1990]. Oper-
ators imdemented in this model are
L
called iterators, streams, synchronous
pipelines, row-sources, or similar names
in the “lingo” of commercial systems.
To make the implementation of opera-
tors a little easier, it makes sense to
separate the functions (a) to prepare an
operator for producing data, (b) to pro-
duce an item, and (c) to perform final
housekeeping. In a file scan, these func-
tions are called o~en. next. and close
procedures; we adopt these names for all
operators. Table 1 gives a rough idea of
what the open, next, and close proce-
dures for some operators do, as well as
the principal local state that needs to be
saved from one invocation to the next.
(Later sections will discuss sort and join
operations in detail. ) The first three
examples are trivial, but the hash join
operator shows how an operator can
schedule its inputs in a nontrivial
manner. The interesting observations
are that (i) the entire query plan is exe-
cuted within a single process, (ii) oper-
ators produce one item at a time on
request, (iii) this model effectively im-
plements, within a single process,
(special-purpose) coroutines and de-
mand-driven data flow. (iv) items never
wait in a temporary ‘file or buffer be-
tween operators because they are never
m-educed before thev are needed. (v)
~herefore this model “& very efficient in
its time-space-product memory costs,
(vi) iterators can schedule anv tree. in-
cluding bushy trees (see below), ‘(vii)
no operator is affected by the complex-
ity of the whole plan, i.e., this model
of operator implementation and syn-
chronization works for sim~le as well
as very complex query plan;. As a final
remark, there are effective ways to com-
bine the iterator model with ~arallel
query processing, as will be disc~ssed in
Section 9.
Since query plans are algebra expres-
sions, they can be represented as trees.
Query plans can be divided into prototyp-
ical shapes, and query execution engines
can be divided into groups according to
which shapes of plans they can evaluate.
Figure 4 shows prototypical left-deep,
right-deep, and bushy plans for a join of
four inputs. Left-deep and right-deep
plans are different because join al-
gorithms use their two inputs in differ-
ent ways; for example, in the nested-loops
join algorithm, the outer loop iterates
over one input (usually drawn as left
input) while the inner loop iterates over
the other input. The set of bushy plans is
the most general as it includes the sets of
both left-deep and right-deep plans.
These names are taken from Graefe and
DeWitt [ 1987]; left-deep plans are also
called “linear processing trees”
[Krishnamurthy et al. 1986] or “plans
with no composite inner” [Ono and
Lehman 1990].
For queries with common subexpres-
sions, the query evaluation plan is not a
tree but an acyclic directed graph (DAG).
Most systems that identify and exploit
common subexpressions execute the plan
equivalent to a common subexpression
separately, saving the intermediate re-
sult in a temporary file to be scanned
repeatedly and destroyed after the last
ACM Computing Surveys, Vol 25, No 2. June 1993
Query Evaluation Techniques ● 81
Table 1. Examples of Iterator Functions
Iterator Open Next Close Local State
Print open input call next on input; close input
format the item on
screen
Scan open file read next item close file open file descriptor
Select open input call next on input close input
until an item
qualifies
Hash join allocate hash call next on probe
(without
close probe input; hash directory
directory; open left input until a match is deallocate hash
overflow “build” input; build found directory
resolution) hash table calling
next on build input;
close build input;
open right “probe”
input
Merge-Join open both inputs get next item from
(without
close both inputs
input with smaller
duplicates) key until a match is
found
Sort open input; build all determine next destroy remaining merge heap, open file
initial run files output item; read run files descriptors for run files
calling next on input; new item from the
close input; merge correct run file
run files untd only
one merge step is left
Join C-D Join A-B
Jo::@ :fi: ‘m:.-.
A B c D
Figure 4. Left-deep, bushy, and right-deep plans.
scan. Each plan fragment that is exe-
cuted as a unit is indeed a tree. The
alternative is a “split” iterator that can
deliver data to multiple consumers, i.e.,
that can be invoked as iterator by multi-
ple consumer iterators. The split iterator
paces its input subtree as fast as the
fastest consumer requires it and holds
items until the slowest consumer has
consumed them. If the consumers re-
quest data at about the same rate, the
split operator does not require a tempo-
rary spool file; such a file and its associ-
ated 1/0 cost are required only if the
data rate required by the consumers di-
verges above some predefine threshold.
Among the implementations of itera-
tors for query processing, one group can
be called “stored-set oriented and the
other “algebra oriented.” In System R, an
example for the first group, complex join
plans are constructed using binary join
iterators that “attach” one more set
(stored relation) to an existing intermedi-
ate result [Astrahan et al. 1976; Lorie
and Nilsson 1979], a design that sup-
ports only left-deep plans. This design
led to a significant simplification of the
System R optimizer which could be based
on dynamic programming techniques, but
ACM Computing Surveys, Vol. 25, No. 2, June 1993
82 - Goetz Graefe
it ignores the optimal plan for some
queries [Selinger et al. 1979].4 A similar
design was used, although not strictly
required by the design of the execution
engine, in the Gamma database machine
[DeWitt et al. 1986; 1990; Gerber 1986].
On the other hand, some systems use
binary operators for which both inputs
can be intermediate results, i.e., the out-
put of arbitrarily complex subplans. This
design is more general as it also permits
bushy plans. Examples for this approach
are the second query processing engine of
Ingres based on Kooi’s thesis [Kooi 1980;
Kooi and Frankforth 1982], the Starburst
execution engine [Haas et al. 1989], and
the Volcano query execution engine
[Graefe 1993b]. The tradeoff between
left-deep and bushy query evaluation
plans is reduction of the search space in
the query optimizer against generality of
the execution engine and efficiency for
some queries. Right-deep plans have only
recently received more interest and may
actually turn out to be very efficient, in
particular in systems with ample mem-
ory [Schneider 1990; Schneider and De-
Witt 1990].
The remainder of this section provides
more details of how iterators are imple-
mented in the Volcano extensible query
processing system. We use this system
repeatedly as an example in this survey
because it provides a large variety of
mechanisms for database query process-
ing, but mostly because its model of oper-
ator implementation and scheduling
resembles very closely those used in
many relational and extensible systems.
The purpose of this section is to provide
implementation concepts from which a
new query processing engine could be
derived.
Figure 5 shows how iterators are rep-
resented in Volcano. A box stands for a
~Since each operator in such a query execution
system will accessa permanent relation, the name
“accesspath selection” used for System R optimiza-
tion, although including and actually focusing on
join optimization, was entirely correct and more
descriptive than “query optimization.”
record structure in Volcano’s implemen-
tation language (C [Kernighan and
Ritchie 1978]), and an arrow represents
a pointer. Each operator in a query eval-
uation dan consists of two record struc-
tures, ; small structure of four points
and a state record. The small structure is
the same for all algorithms. It represents
the stream or iterator abstraction and
can be invoked with the open, next, and
close procedures. The purpose of state
records is similar to that of activation
records allocated by com~iler-~enerated
code upon entry in~o a p>oced~re. Both
hold values local to the procedure or the
iterator. Their main difference is that
activation records reside on the stack and
vanish upon procedure exit, while state
records must ~ersist from one invocation
of the iterato~ to the next, e.g., from the
invocation of open to each invocation of
next and the invocation of close. Thus.
state records do not reside on the stack
but in heap space.
The type of state records is different
for each iterator as it contains iterator-
specific arguments and local variables
(state) while the iterator is suspended,
e.g., currently not active between invoca-
tions of the operator’s next m-ocedure.
Query plan nodes are linked t~gether by
means of input pointers, which are also
ke~t in the state records. Since ~ointers
to ‘functions are used extensivel~ in this
design, all operator code (i.e., the open,
next, and close procedures) can be writ-
ten in such a way that the names of
input operators and their iterator proce-
dures are not “hard-wired” into the code.
and the operator modules do not need to
be recompiled for each query. Further-
more, all operations on individual items,
e.g., printing, are imported into Volcano
o~erators as functions. makirw the o~er-
a~ors independent of the sem%tics ‘and
representation of items in the data
streams they are processing. This organi-
zation using function pointers for input
operators is fairly standard in commer-
cial database management systems.
In order to make this discussion more
concrete, Figure 5 shows two operators in
a query evaluation plan that prints se-
ACM Computmg Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 83
* open-filter
* next-filter
e
+ close-filter
1 [
Arguments I Input I State
I t I I + open-filescan
T + next-filescan
e
print () + close-filescan
I i
I 3+
Arguments i Input 1 State
1 1 I
predicate () (none)
Figure 5. Two operators in a Volcano query plan.
Iected records from a file. The purpose
and capabilities of the filter operator in
Volcano include printing items of a
stream using a print function passed to
the filter operator as one of its argu-
ments. The small structure at the top
gives access to the filter operator’s itera-
tor functions (the open, next, and close
procedures) as well as to its state record.
Using a pointer to this structure, the
open, next, and close procedures of the
filter operator can be invoked, and their
local state can be passed to them as a
procedure argument. The filter’s iterator
functions themselves, e.g., open-filter, can
use the input pointer contained in the
state record to invoke the input operator’s
functions, e.g., open-file-scan. Thus, the
filter functions can invoke the file scan
functions as needed and can pace the file
scan according to the needs of the filter.
In this section, we have discussed gen-
eral physical algebra issues and syn-
chronization and data transfer between
operators. Iterators are relatively
straightforward to implement and are
suitable building blocks for efficient, ex-
tensible query processing engines. In the
following sections, we consider individual
operators and algorithms including a
comparison of sorting and hashing, de-
tailed treatment of parallelism, special
operators for emerging database applica-
tions such as scientific databases, and
auxiliary techniques such as precompu-
tation and compression.
2. SORTING AND HASHING
Before discussing specific algorithms, two
general approaches to managing sets of
data are introduced. The purpose of many
query-processing algorithms is to per-
form some kind of matching, i.e., bring-
ing items that are “alike” together and
performing some operation on them.
There are two basic approaches used for
this purpose, sorting and hashing. This
pair permeates many aspects of query
processing, from indexing and clustering
over aggregation and join algorithms to
methods for parallelizing database oper-
ations. Therefore, we discuss these ap-
proaches first in general terms, without
regard to specific algorithms. After a sur-
vey of specific algorithms for unary (ag-
gregation, duplicate removal) and binary
(join, semi-join, intersection, division,
etc.) matching problems in the following
sections, the duality of sort- and hash-
based algorithms is discussed in detail.
2.1 Sorting
Sorting is used very frequently in
database systems, both for presentation
to the user in sorted reports or listings
and for query processing in sort-based
algorithms such as merge-join. There-
fore, the performance effects of the many
algorithmic tricks and variants of exter-
nal sorting deserve detailed discussion in
this survey. All sorting algorithms actu-
ally used in database systems use merg-
ACM Computing Surveys,Vol. 25, No. 2, June 1993
84 “ Goetz Graefe
ing, i.e., the input data are written into
initial sorted runs and then merged into
larger and larger runs until only~ne run
is left, the sorted output. Only in the
unusual case that a data set is smaller
than the available memorv can in-
memory techniques such as q-uicksort be
used. An excellent reference for many of
the issues discussed here is Knuth [ 19731.
-,
who analyzes algorithms much more ac-
curately than we do in this introductory
survey.
In order to ensure that the sort module
interfaces well with the other operators,
e.g., file scan or merge-join, sorting should
be im~lemented as an iterator. i.e.. with
open, ‘next, and close procedures as all
other operators of the physical algebra.
In the Volcano query-processing system
(which is based on iterators), most of the
sort work is done during open -sort
[Graefe 1990a; 1993b]. This procedure
consumes the entire input and leaves ap-
propriate data structures for next-sort to
produce the final sorted output. If the
entire input fits into the sort space in
main memory, open-sort leaves a sorted
array of pointers to records in 1/0 buffer
memory which is used by next-sort to
~roduce the records in sorted order. If
~he input is larger than main memory,
the open-sort procedure creates sorted
runs and merges them until onlv one
final merge ph~se is left, The last merge
step is performed in the next-sort proce-
dure, i.e., when demanded by the con-
sumer of the sorted stream, e.g., a
merge-join. The input to the sort module
must be an iterator, and sort uses open,
next, and close procedures to request its
input; therefore, sort input can come from
a scan or a complex query plan, and the
sort operator can be inserted into a query
plan at any place or at several places.
Table 2, which summarizes a taxon-
omy of parallel sort algorithms [Graefe
1990a], indicates some main characteris-
tics of database sort algorithms. The first
few items apply to any database sort and
will be discussed in this section. The
questions pertaining to parallel inputs
and outputs and to data exchange will be
considered in a later section on parallel
algorithms, and the last question regard-
ing substitute sorts will be touched on in
the section on surrogate processing.
All sort algorithms try to exploit the
duality between main-memory mergesort
and quicksort. Both of these algorithms
are recursive divide-and-conquer algo-
rithms. The difference is that merge sort
first divides physically and then merges
on logical keys, whereas quicksort first
divides on logical keys and then com-
bines physically by trivially concatenat-
ing sorted subarrays. In general, one of
the two phases—dividing and combining
—is based on logical keys, whereas the
other arranges data items only physi-
cally. We call these the logical and the
physical phases. Sorting algorithms for
verv large data sets stored on disk or
tap: are-also based on dividing and com-
bining. Usually, there are two distinct
sub algorithms, one for sorting within
main memory and one for managing sub-
sets of the data set on disk or tape. The
choices for mapping logical and physical
phases to dividing and combining steps
are independent for these two subalgo-
rithms. For practical reasons, e.g., ensur-
ing that a run fits into main memory, the
disk management algorithm typically
uses physical dividing and logical com-
bining (merging). A point of practical im-
portance is the fan-in or degree of
merging, but this is a parameter rather
than a defining algorithm property.
There are two alternative methods for
creating initial runs, also called “level-O
runs” here. First, an in-memory sort al-
gorithm can be used, typically quicksort.
Using this method, each run will have
the size of allocated memory, and the
number of initial runs W will be W =
[RM] for input size R and memory size
M. (Table 3 summarizes variables and
their meaning in cost calculations in this
survey.) Second, runs can be produced
using replacement selection [Knuth
1973]. Replacement selection starts by
filling memory with items which are or-
ganized into a priority heap, i.e., a data
structure that efficiently supports the op-
erations insert and remove-smallest.
Next, the item with the smallest key is
ACM Comput, ng Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques “ 85
Table 2. A Taxonomy of Database Sorhng Algorithms
Determinant Possible Options
Input division Logical keys (partitioning) or physical division
Result combination Logical keys (merging) or physical concatenation
Main-memory sort Quicksort or replacement selection
Merging Eager or lazy or semi-eager; lazy and semi-eager with or without
optimizations
Read-ahead No read-ahead or double-buffering or read-ahead with forecasting
Input Single-stream or parallel
output Single-stream or parallel
Number of data exchanges One or multiple
Data exchange Before or after local sort
Sort objects Original records or key-RZD pairs (substitute sort)
removed from the priority heap and writ-
ten to a run file and then immediately
replaced in the priority heap with an-
other item from the input, With high
probability, this new item has a key
larger than the item just written and
therefore will be included in the same
run file. Notice that if this is the case,
the first run file will be larger than mem-
ory. Now the second item (the currently
smallest item in the priority heap) is
written to the run file and is also re-
placed immediately in memory by an-
other item from the input. This process
repeats, always keeping the memory and
the priority heap entirely filled. If a new
item has a key smaller than the last key
written, the new item cannot be included
in the current run file and is marked for
the next run file. In comparisons among
items in the heap, items marked for the
current run file are always considered
“smaller” than items marked for the next
run file. Eventually, all items in memory
are marked for the next run file, at which
point the current run file is closed, and a
new one is created.
Using replacement selection, run files
are typically larger than memory. If the
input is already sorted or almost sorted,
there will be only one run file. This situa-
tion could arise, for example, if a file is
sorted on field A but should be sorted on
A as major and B as the minor sort key.
If the input is sorted in reverse order,
which is the worst case, each run file will
be exactly as large as memory. If the
input is random, the average run file will
be twice the size of memory, except the
first few runs (which get the process
started) and the last run. On the aver-
age, the expected number of runs is about
w = [R/(2 x M)l + I, i.e., about half as
many runs as created with quicksort. A
more detailed discussion and an analysis
of replacement selection were provided
by Knuth [1973].
An additional difference between
quicksort and replacement selection is
the resulting 1/0 pattern during run cre-
ation. Quicksort results in bursts of reads
and writes for entire memory loads from
the input file and to initial run files,
while replacement selection alternates
between individual read and write opera-
tions. If only a single device is used,
quicksort may result in faster 1/0 be-
cause fewer disk arm movements are re-
quired. However, if different devices are
used for input and temporary files, or if
the input comes as a stream from an-
other operator, the alternating behavior
of replacement selection may permit more
overlap of 1/0 and processing and there-
fore result in faster sorting.
The problem with replacement selec-
tion is memory management. If input
items are kept in their original pages in
the buffer (in order to save copying data,
a real concern for large data volumes)
each page must be kept in the buffer
until its last record has been written to a
run file. On the average, half a page’s
records will be in the priority heap. Thus,
the priority heap must be reduced to half
the size (the number of items in the heap
ACM Computing Surveys, Vol 25, No. 2, June 1993
86 ● Goetz Graefe
Table 3. Variables, Their Mearrrrg, and Units
Variables Description Units
M Memory size pages
R, S Inputs or their
sizes pages
c Cluster or unit
of I/O pages
F, K Fan-in or
fan-out (none)
w Number of level-O
run files (none)
L Number of
merge levels (none)
is one half the number of records that fit
into memory), canceling the advantage of
longer and fewer run files. The solution
to this problem is to copy records into a
holding space and to keep them there
while they are in the priority heap and
until they are written to a run file. If the
input items are of varying sizes, memory
management is more complex than for
quicksort because a new item may not fit
into the space vacated in the holding
space by the last item written into a run
file. Solutions to this problem will intro-
duce memory management overhead and
some amount of fragmentation, i.e., the
size of runs will be less than twice the
size of memory. Thus, the advantage of
having fewer runs must be balanced with
the different 1/0 pattern and the dis-
advantage of more complex memory
management.
The level-O runs are merged into level-
1 runs, which are merged into level-2
runs, etc., to produce the sorted output.
During merging, a certain amount of
buffer memory must be dedicated to each
input run and the merge output. We call
the unit of 1/0 a cluster in this survey,
which is a number of pages located con-
tiguously on disk. We indicate the cluster
size with C, which we measure in pages
just like memory and input sizes. The
number of 1/0 clusters that flt in mem-
ory is the quotient of memory size and
cluster size. The maximal merge fan-in
F, i.e., the number of runs that can be
merged at one time, is this quotient mi-
nus one cluster for the output. Thus, F =
[M/C – 1]. Since the sizes of runs grow
by a factor F from level to level, the
number of merge levels L, i.e., the num-
ber of times each item is written to a run
file, is logarithmic with the input size,
namely L = [log~( W )1.
There are four considerations that can
improve the merge efficiency. The first
two issues pertain to scheduling of 1/0
operations, First, scans are faster if
read-ahead and write-behind are used;
therefore, double-buffering using two
pages of memory per input run and two
for the merge output might speed the
merge process [Salzberg 1990; Salzberg
et al. 1990]. The obvious disadvantage is
that the fan-in is cut in half. However,
instead of reserving 2 x F + 2 clusters, a
predictive method called forecasting can
be employed in which the largest key in
each input buffer is used to determine
from which input run the next cluster
will be read. Thus, the fan-in can be set
to any number in the range [ ilI/(2 X C)
– 2] s F s [M/C – 1]. One or two
read-ahead buffers per input disk are
sufficient, and F = [ iW/C ] – 3 will be
reasonable in most cases because it uses
maximal fan-in with one forecasting in-
put buffer and double-buffering for the
merge output.
Second, if the operating system and
the 1/0 hardware support them, using
large cluster sizes for the run files is very
beneficial. Larger cluster sizes will re-
duce the fan-in and therefore may in-
crease the number of merge levels.5
However, each merging level is per-
formed much faster because fewer 1/0
operations and disk seeks and latency
delays are required. Furthermore, if the
unit of 1/0 is equal to a disk track,
rotational latencies can be avoided en-
tirely with a sufficiently smart disk con-
5In files storing permanent data, large clusters
(units of 1/0) contammg many records may also
create artificial buffer contention (if much more
disk space is copied into the buffer than truly nec-
essary for one record) and “false sharing” in envi-
ronments with page (cluster) locks, I.e., artificial
concurrency conflicts, Since run files m a sort oper-
ation are not shared but temporary, these problems
do not exist in this context.
ACM Computmg Surveys,Vol. 25, No 2, June 1993
Query Evaluation Techniques ● 87
troller. Usually, relatively small fan-ins
with large cluster sizes are the optimal
choice, even if the sort requires multiple
merge levels [Graefe 1990a]. The precise
tradeoff depends on disk seek, latency,
and transfer times. It is interesting to
note that the optimal cluster size and
fan-in basically do not depend on the
input size.
As a concrete example, consider sort-
ing a file of R = 50 MB = 51,200 KB
using M = 160 KB of memory. The num-
ber of runs created by quicksort will be
W = [51200/160] = 320. Depending on
the disk access and transfer times (e.g.,
25 ms disk seek and latency, 2 ms trans-
fer time for a page of 4 KB), C = 16 KB
will typically be a good cluster size for
fast merging. If one cluster is used for
read-ahead and two for the merge out-
put, the fan-in will be F = 1160/161 – 3
= 7. The number of merge levels will be
L = [logT(320)l = 3. If a 16 KB 1/0 oper-
ation takes T = 33 ms, the total 1/0
time, including a factor of two for writing
and reading at each merge level, for the
entire sort will be 2 X L X [R/Cl X T =
10.56 min.
An entirely different approach to de-
termining optimal cluster sizes and the
amount of memory allocated to forecast-
ing and read-ahead is based on process-
ing and 1/0 bandwidths and latencies.
The cluster sizes should be set such that
the 1/0 bandwidth matches the process-
ing bandwidth of the CPU. Bandwidths
for both 1/0 and CPU are measured here
in record or bytes per unit time; instruc-
tions per unit time (MIPS) are irrelevant.
It is interesting to note that the CPU’s
processing bandwidth is largely deter-
mined by how fast the CPU can assemble
new pages, in other words, how fast the
CPU can copy records within memory.
This performance measure is usually ig-
nored in modern CPU and cache designs
geared towards high MIPS or MFLOPS
numbers [ Ousterhout 19901.
Tuning the sort based on bandwidth
and latency proceeds in three steps. First,
the cluster size is set such that the pro-
cessing and 1/0 bandwidths are equal or
very close to equal. If the sort is 1/0
bound, the cluster size is increased for
less disk access overhead per record and
therefore faster 1/0; if the sort is CPU
bound, the cluster size is decreased to
slow the 1/0 in favor of a larger merge
fan-in. Next, in order to ensure that the
two processing components (1/0 and
CPU) never (or almost never) have to
wait for one another, the amount of space
dedicated to read-ahead is determined as
the 1/0 time for one cluster multiplied
by the processing bandwidth. Typically,
this will result in one cluster of read-
ahead space per disk used to store and
read inputs run into a merge. Of course,
in order to make read-ahead effective,
forecasting must be used. Finally, the
same amount of buffer space is allocated
for the merge output (access latency times
bandwidth) to ensure that merge pro-
cessing never has to wait for the com-
pletion of output 1/0. It is an open
issue whether these two alternative ap-
proaches to tuning cluster size and read-
ahead space result in different alloca-
tions and sorting speeds or whether one
of them is more effective than the other.
The third and fourth merging issues
focus on using (and exploiting) the maxi-
mal fan-in as effectively and often as
possible. Both issues require adjusting
the fan-in of the first merge step using
the formula given below, either the first
merge step of all merge steps or, in
semi-eager merging [Graefe 1990a], the
first merge step after the end of the in-
put has been reached. This adjustment is
used for only one merge step, called the
initial merge here, not for an entire merge
level.
The third issue to be considered is that
the number of runs W is typically not a
power of F; therefore, some merges pro-
ceed with fewer than F inputs, which
creates the opportunity for some opti-
mization. Instead of always merging runs
of only one level together, the optimal
strategy is to merge as many runs as
possible using the smallest run files
available. The only exception is the fan-in
of the first merge, which is determined to
ensure that all subsequent merges will
use the full fan-in F.
ACM Computing Surveys,Vol. 25, No. 2, June 1993
88 e
Goetz Graefe
—
Figure 6. Naive and optimized merging.
Let us explain this idea with the exam-
ple shown in Figure 6. Consider a sort
with a maximal fan-in F = 10 and an
input file that requires W = 12 initial
runs. Instead of merging only runs of the
same level as shown in Figure 6, merging
is delayed until the end of the input has
been reached. In the first merge step,
only 3 of the 12 runs are combined, and
the result is then merged with the other
9 runs, as shown in Figure 6. The 1/0
cost (measured by the number of memory
loads that must be written to any of the
runs created) for the first strategy is 12
+ 10 + 2 = 24, while for the second
strategy it is 12 + 3 = 15. In other words,
the first strategy requires 607. more 1/0
to temporary files than the second one.
The general rule is to merge just the
right number of runs after the end of the
input file has been reached, and to al-
ways merge the smallest runs available
for merging. More detailed examples are
given in Graefe [1990a]. One conse-
quence of this optimization is that the
merge depth L, i.e., the number of run
files a record is written to during the sort
or the number of times a record is writ-
ten to and read from disk, is not uniform
for all records. Therefore, it makes sense
to calculate an average merge depth (as
required in cost estimation during query
optimization), which may be a fraction.
Of course, there are much more sophisti-
cated merge optimizations, e.g., cascade
and polyphase merges [Knuth 1973].
Fourth, since some operations require
multiple sorted inputs, for example
merge-join (to be discussed in the section
on matching) and sort output can be
passed directly from the final merge into
the next operation (as is natural when
using iterators), memory must be divided
among multiple final merges. Thus, the
final fan-in f and the “normal” fan-in F
should be specified separately in an ac-
tual sort implementation. Using a final
fan-in of 1 also allows the sort operator
to produce output into a very slow opera-
tor, e.g., a display operation that allows
scrolling by a human user, without occu-
pying a lot of buffer memory for merging
input runs over an extended period of
time.G
Considering the last two optimization
options for merging, the following for-
mula determines the fan-in of the first
merge. Each merge with normal fan-in F
will reduce the number of run files by
F – 1 (removing F runs, creating one
new one). The goal is to reduce the num-
ber of runs from W to f and then to 1
(the final output). Thus, the first merge
should reduce the number of runs to f +
k(F – 1) for some integer k. In other
words, the first merge should use a fan-in
of FO=((W–f– l)mod(F– 1))+ 2.In
the example of Figure 6, (12 – 10 –
1) mod (10 – 1) + 2 results in a fan-in for
the initial merge of FO = 3. If the sort of
Figure 6 were the input into a merge-join
and if a final fan-in of 5 were desired, the
initial merge should proceed with a fan-in
of FO=(12–5– l)mod(10– 1)+2=
8.
If multiple sort operations produce
input data for a common consumer oper-
ator, e.g., a merge-join, the two final fan-
ins should be set proportionally to the
GThere is a simdar case of resource sharing among
the operators producing a sort’s input and the run
generation phase of the sort, We will come back to
these issues later in the section on executing and
scheduling complex queries and plans.
ACM Computing Surveys, Vol 25, No. 2, June 1993
Query Evaluation Techniques ● 89
size of the two inputs. For example, if
two merge-join inputs are 1 MB and 9
MB, and if 20 clusters are available for
inputs into the two final merges, then 2
clusters should be allocated for the first
and 18 clusters for the second input (1/9
= 2/18).
Sorting is sometimes criticized because
it requires, unlike hybrid hashing (dis-
cussed in the next section), that the en-
tire input be written to run files and then
retrieved for merging. This difference has
a particularly large effect for files only
slightly larger than memory, e.g., 1.25
times the size of memory. Hybrid hash-
ing determines dynamically how much of
the input data truly must be written to
temporary disk files. In the example, only
slightly more than one quarter of the
memory size must be written to tempo-
rary files on disk while the remainder of
the file remains in memory. In sorting,
the entire file (1.25 memory sizes) is
written to one or two run files and then
read for merging. Thus, sorting seems to
require five times more 1/0 for tempo-
rary files in this example than hybrid
hashing. However, this is not necessarily
true. The simple trick is to write initial
runs in decreasing (reverse) order. When
the input is exhausted, and merging in
increasing order commences, buffer
memory is still full of useful pages with
small sort keys that can be merged im-
mediately without 1/0 and that never
have to be written to disk. The effect of
writing runs in reverse order is compara-
ble to that of hybrid hashing, i.e., it
is particularly effective if the input is
only slightly larger than the available
memory.
To demonstrate the effect of cluster
size optimization (the second of the four
merging issues discussed above), we
sorted 100,000 100-byte records, about
10 MB, with the Volcano query process-
ing system, which includes all merge op-
timization described above with the ex-
ception of read-ahead and forecasting.
(This experiment and a similar one were
described in more detail earlier [Graefe
1990a; Graefe et al. 1993].) We used a
sort space of 40 pages (160 KB) within a
50-page (200 KB) 1/0 buffer, varying the
cluster size from 1 page (4 KB) to 15
pages (60 KB). The initial run size was
1,600 records, for a total of 63 initial
runs. We counted the number of 1/0
operations and the transferred pages for
all run files, and calculated the total 1/0
cost by charging 25 ms per 1/0 operation
(for seek and rotational latency) and 2
ms for each transferred page (assuming 2
MB/see transfer rate). As can be seen in
Table 4 and Figure 7, there is an optimal
cluster size with minimal 1/0 cost. The
curve is not as smooth as might have
been expected from the approximate cost
function because the curve reflects all
real-system effects such as rounding
(truncating) the fan-in if the cluster size
is not an exact divisor of the memory
size, the effectiveness of merge optimiza-
tion varying for different fan-ins, and
internal fragmentation in clusters. The
detailed data in Table 4, however, reflect
the trends that larger clusters and
smaller fan-ins clearly increase the
amount of data transferred but not the
number of 1/0 operations (disk and la-
tency time) until the fan-in has shrunk
to very small values, e.g., 3. It is clearly
suboptimal to always choose the smallest
cluster size (1 page) in order to obtain
the largest fan-in and fewest merge lev-
els. Furthermore, it seems that the range
of cluster sizes that result in near-
optimal total 1/0 costs is fairly large;
thus it is not as important to determine
the exact value as it is to use a cluster
size “in the right ball park.” The optimal
fan-in is typically fairly small; however,
it is not e or 3 as derived by Bratberg-
sengen [1984] under the (unrealistic)
assumption that the cost of an 1/0 oper-
ation is independent of the amount of
data being transferred.
2.2 Hashing
For many matching tasks, hashing is an
alternative to sorting. In general, when
equality matching is required, hashing
should be considered because the ex-
pected complexity of set algorithms based
on hashing is O(N) rather than
ACM Computing Surveys, Vol. 25, No. 2, June 1993
90 “ Goetz Graefe
Table 4. Effect of Cluster Size Optlmlzations
Cluster Pages Total 1/0
Size Average Disk Transferred cost
[ X 4 KB] Fan-in Depth Operations [X4KB] [see]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
40 1.376
20 1.728
13 1.872
10 1.936
8 2.000
6 2.520
5 2.760
5 2.760
4 3.000
4 3.000
3 3.856
3 3.856
3 3.856
2 5.984
2 5.984
6874
4298
3176
2406
1984
2132
1980
1718
1732
1490
1798
1686
1628
2182
2070
6874
8596
9528
9624
9920
12792
13860
13744
15588
14900
19778
20232
21164
30548
31050
185.598
124.642
98.456
79.398
69.440
78.884
77.220
70.438
74.476
67.050
84.506
82.614
83.028
115.646
113.850
150
Total
1/0 Cost 100
[SW]
50
0
i
o
I I I I I I
2.5 5 7.5 10 12.5 15
Cluster Size [x 4 KB]
Figure 7. Effect of cluster size optimizations
0( IV log N) as for sorting. Of course. this
makes ‘intuitive sense- if hashing is
viewed as radix sorting on a virtual key
[Knuth 1973].
Hash-based query processing algo-
rithms use an in-memory hash table of
database objects to perform their match-
ing task. If the entire hash table (includ-
ing all records or items) fits into memory,
hash-based query processing algorithms
are very easy to design, understand, and
implement, and they outperform sort-
based alternatives. Note that for binary
matching operations, such as join or in-
tersection, only one of the two inputs
must fit into memory. However, if the
required hash table is larger than mem-
ory, hash table ouerfZow occurs and must
be dealt with.
There are basically two methods for
managing hash table overflow, namely
avoidance and resolution. In either case,
the input is divided into multiple parti-
tion files such that partitions can be pro-
cessed independently from one another,
and the concatenation of the results of all
partitions is the result of the entire oper-
ation. Partitioning should ensure that the
partitioning files are of roughly even size
and can be done using either hash parti -
ACM Computing Surveys,Vol. 25, No. 2, June 1993
Query Evaluation Techniques ● 91
tioning or range partitioning, i.e., based
on keys estimated to be quantiles. Usu-
ally, partition files can be processed us-
ing the original hash-based algorithm.
The maximal partitioning fan-out F, i.e.,
number of partition files created, is de-
termined by the memory size ill divided
over the cluster size C minus one cluster
for the partitioning input, i.e., F = [M/C
– 11, just like the fan-in for sorting.
In hash table overflow avoidance, the
input set is partitioned into F partition
files before any in-memory hash table is
built. If it turns out that fewer partitions
than have been created would have been
sufficient to obtain partition files that
will fit into memory, bucket tuning (col-
lapsing multiple small buckets into larger
ones) and dynamic destaging (determin-
ing which buckets should stay in mem-
ory) can improve the performance of
hash-based operations [Kitsuregawa et
al. 1989a; Nakayama et al. 1988].
Algorithms based on hash table over-
flow resolution start with the assumption
that overflow will not occur, but resort to
basically the same set of mechanisms as
hash table overflow avoidance once it
does occur. No real system uses this naive
hash table overflow resolution because
so-called hybrid hashing is as efficient
but more flexible. Hybrid hashing com-
bines in-memory hashing and overflow
resolution [DeWitt et al. 1984; Shapiro
1986]. Although invented for relational
join and known as hybrid hash join, hy-
brid hashing is equally applicable to all
hash-based query processing algorithms.
Hybrid hash algorithms start out with
the (optimistic) premise that no overflow
will occur; if it does, however, they parti-
tion the input into multiple partitions of
which only one is written immediately to
temporary files on disk. The other F – 1
partitions remain in memory. If another
overflow occurs, another partition is
written to disk. If necessary, all F parti-
tions are written to disk. Thus, hybrid
hash algorithms use all available mem-
ory for in-memory processing, but at the
same time are able to process large input
files by overflow resolution, Figure 8
shows the idea of hybrid hash algo-
1
-..
‘.+
‘.<
‘.<
‘..
Iash /
‘-
Partition
‘= Files
On Disk
‘*
‘*
Directory ~
Figure 8. Hybrid hashing.
rithms. As many hash buckets as possi-
ble are kept in memory, e.g., as linked
lists as indicated by solid arrows. The
other hash buckets are spooled to tempo-
rary disk files, called the overflow or par-
tition files, and are processed in later
stages of the algorithm. Hybrid hashing
is useful if the input size R is larger than
the memory size AL?but smaller than the
memory size multiplied by the fan-out F,
i.e., M< R< FXM.
In order to predict the number of 1/0
operations (which actually is not neces-
sary for execution because the algorithm
adapts to its input size but may be desir-
able for cost estimation during query
optimization), the number of required
partition files on disk must be deter-
mined. Call this number K, which must
satisfy O s ~ s F. Presuming that the
assignment of buckets to partitions is
optimal and that each partition file is
equal to the memory size ikf, the amount
of data that may be written to ~ parti-
tion files is equal to K x M. The number
of required 1/0 buffers is 1 for the input
and K for the output partitions, leaving
M – (K+ 1) x C memory for the hash
table. The optimal K for a given input
size R is the minimal K for which K X
M + (M – (K + 1) x C) > R. Solving
this inequality and taking the smallest
such K results in K = [(~ – M i-
C)/(M – C)]. The minimal possible 1/0
cost, including a factor of 2 for writing
ACM Computing Surveys,Vol. 25, No 2, June 1993
92 “ Goetz Graefe
and reading the partition files and mea-
sured in the amount of data that must be
written or read, is 2 x (~ – (lvl – (~ +
1) X C)). To determine the 1/0 time, this
amount must be divided by the cluster
size and multiplied with the 1/0 time for
one cluster.
For example, consider an input of R =
240 pages, a memory of M = 80 pages,
and a cluster size of C = 8 pages. The
maximal fan-out is F = [80/8 – 1] = 9.
The number of partition files that need
to be created on disk is K = [(240 – 80
+ 8)/(80 – 8)] = 3. In other words, in the
best case, K X C = 3 X 8 = 24 pages will
be used as output buffers to write K = 3
partition files of no more than M = 80
pages, and M–(K+l)x C=80–4
X 8 = 48 pages of memory will be used
as the hash table. The total amount of
data written to and read from disk is
2 X (240 – (80 – 4 X 8)) = 384 pages. If
writing or reading a cluster of C = 8
pages takes 40 msec, the total 1/0 time
is 384/8 X 40 = 1.92 sec.
In the calculation of K, we assumed an
optimal assignment of hash buckets to
partition files. If buckets were assigned
in the most straightforward way, e.g., by
dividing the hash directory into F
equal-size regions and assigning the
buckets of one region to a partition as
indicated in Figure 8, all partitions were
of nearly the same size, and either all or
none of them will fit into their output
cluster and therefore into memory. In
other words, once hash table overflow
occurred, all input was written to parti-
tion files. Thus, we presumed in the ear-
lier calculations that hash buckets were
assigned more intelligently to output
partitions.
There are three ways to assign hash
buckets to partitions. First, each time a
hash table overflow occurs, a fixed num-
ber of hash buckets is assigned to a new
output partition. In the Gamma database
machine, the number of disk partitions is
chosen “such that each bucket [bucket
here means what is called an output par-
tition in this survey] can reasonably be
expected to fit in memory” [DeWitt and
Gerber 1985], e.g., 10% of the hash buck-
ets in the hash directory for a fan-out of
10 [Schneider 19901. In other words. the
fan~out is set a pri~ri by the query opti-
mizer based on the expected (estimated)
input size. Since the page size in Gamma
is relatively small, only a fraction of
memorv is needed for outrmt buffers. and
an in-memory hash tab~e can be ‘used
even while output partitions are being
written to disk. Second. in bucket tuning
and dynamic destaging [Kitsuregawa e;
al. 1989a; Nakayama 1988], a large num-
ber of small partition files is created and
then collamed into fewer partition files
no larger ~han memory. I; order to ob-
tain a large number of partition files and,
at the same time. retain some memorv
for a hash table, ‘the cluster size is sent
quite small, e.g., C = 1 page, and the
fan-out is very large though not maxi-
mal, e.g., F = M/C/2. In the example
above, F = 40 output partitions with an
average size of R/F = 6 pages could be
created, even though only K = 3 output
partitions are required. The smallest
partitions are assigned to fill an in-mem-
ory hash table of size M – K X C = 80 –
3 x 1 = 77 pages. Hopefully, the dy-
namic destaging rule—when an overflow
occurs, assign the largest partition still
in memorv to disk—ensures that indeed
.
the smallest partitions are retained in
memory. The partitions assigned to disk
are collapsed into K = 3 partitions of no
more than M = 80 pages, to be processed
in K = 3 subsequent phases. In binary
operations such as intersection and rela-
tional join, bucket tuning is quite effec-
tive for skew in the first input, i.e., if the
hash value distribution is nonuniform
and if the partition files are of uneven
sizes. It avoids spooling parts of the sec-
ond (typically larger) input to temporary
~artition files because the ~artitions in
. .
memory can be matched immediately us-
ing a hash table in the memory not re-
auired as outtmt buffer and because a
. L
number of small partitions have been col-
lapsed into fewer, larger partitions, in-
creasing the memory available for the
hash table. For skew in the second inmt.
bucket tuning and dynamic desta~ng
have no advantage. Another disadvan-
ACM Computing Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques 8 93
tage of bucket tuning and dynamic
destaging is that the cluster size has to
be relatively small, thus requiring a large
number of 1/0 operations with disk seeks
and rotational latencies to write data to
the overflow files. Third, statistics gath-
ered before hybrid hashing commences
can be used to assign hash buckets to
partitions [Graefe 1993a].
Unfortunately, it is possible that one
or more partition files are larger than
memory. In that case, partitioning is used
recursively until the file sizes have
shrunk to memory size. Figure 9 shows
how a hash-based algorithm for a unary
operation, such as aggregation or dupli-
cate removal, partitions its input file over
multiple recursion levels. The recursion
terminates when the files fit into mem-
ory. In the deepest recursion level, hy-
brid hashing may be employed.
If the partitioning (hash) function is
good and creates a uniform hash value
distribution, the file size in each recur-
sion level shrinks by a factor equal to the
fan-out, and therefore the number of re-
cursion levels L is logarithmic with the
size of the input being partitioned. After
L partitioning levels, each partition file
is of size R’ = R/FL. In order to obtain
partition files suitable for hybrid hashing
(with M < R’ < F X itf), the number of
full recursion levels L, i.e., levels at which
hybrid hashing is not applied, is L =
[log~(R/M)]. The 1/0 cost of the re-
maining step using hybrid hashing can
be estimated using the hybrid hash for-
mula above with R replaced by R’ and
multiplying the cost with FL because
hybrid hashing is used for this number of
partition files. Thus, the total 1/0 cost
for partitioning an input and using hy-
brid hashing in the deepest recursion
level is
2X RX L+2XFL
X( R’–(M– KX C))
=2x( Rx(L+l)– F”
X( M–K XC))
=2 X( RX(L+l)– FLX(M
-[(R’ –M)/(M - C)l X C)).
~ x=
Figure 9. Recursive partitioning.
A major problem with hash-based algo-
rithms is that their performance depends
on the quality of the hash function. In
many situations, fairly simple hash func-
tions will perform reasonably well.
Remember that the purpose of using
hash-based algorithms usually is to find
database items with a specific key or to
bring like items together; thus, methods
as simple as using the value of a join key
as a hash value will frequently perform
satisfactorily. For string values, good
hash values can be determined by using
binary “exclusive or” operations or by de-
termining cyclic redundancy check (CRC)
values as used for reliable data storage
and transmission. If the quality of the
hash function is a potential problem, uni-
versal hash functions should be consid-
ered [Carter and Wegman 1979].
If the partitioning is skewed, the re-
cursion depth may be unexpectedly high,
making the algorithm rather slow. This
is analogous to the worst-case perfor-
mance of quicksort, 0(iV2 ) comparisons
for an array of N items, if the partition-
ing pivots are chosen extremely poorly
and do not divide arrays into nearly equal
sub arrays.
Skew is the major danger for inferior
performance of hash-based query-
processing algorithms, There are several
ways to deal with skew. For hash-based
algorithms using overflow avoidance,
bucket tuning and dynamic destaging are
quite effective. Another method is to ob-
tain statistical information about hash
values and to use it to carefully assign
ACM Computing Surveys, Vol. 25, No. 2, June 1993
94 “ Goetz Graefe
hash buckets to partitions. Such statisti-
cal information can be ke~t in the form of
histomams and can ei;her come from
perm~nent system catalogs (metadata),
from sampling the input, or from previ-
ous recursion levels. For exam~le. for an
intermediate query processing’ result for
which no statistical parameters are
known a priori, the first partitioning level
might have to proceed naively pretending
that the partitioning hash function is
perfect, but the second and further recur-
sion levels should be able to use statistics
gathered in earlier levels to ensure that
each partitioning step creates even parti-
tions, i.e., that the data is ~artitioned
. . .
with maximal effectiveness [Graefe
1993a]. As a final resort, if skew cannot
be managed otherwise, or if distribution
skew is not the problem but duplicates
are, some systems resort to algorithms
that are not affected by data or hash
value skew. For example, Tandem’s hash
join algorithm resorts to nested-loops join
(to be discussed later) [Zeller and Gray
19901.
As- for sorting, larger cluster sizes re-
sult in faster 1/0 at the expense of
smaller fan-outs, with the optimal fan-out
being fairly small [Graefe 1993a; Graefe
et al. 1993]. Thus, multiple recursion lev-
els are not uncommon for large files, and
statistics gathered on one level to limit
skew effects on the next level are a real-
istic method for large files to control
the performance penalties of uneven
partitioning.
3. DISK ACCESS
All query evaluation systems have to
access base data stored in the data-
base. For databases in the megabyte to
terabyte range, base data are typically
stored on secondary storage in the form
of rotating random-access disks. How-
ever, deeper storage hierarchies includ-
ing optical storage, (maybe robot-oper-
ated) tape archives, and remote storage
servers will also have to be considered in
future high-functionality high-volume
database management systems, e.g., as
outlined by Stonebraker [1991]. Research
into database systems supporting and ex-
ploiting a deep storage hierarchy is still
in its infancy.
On the other hand, some researchers
have considered in-memory or main-
memory databases, motivated both by the
desire for faster transaction and query-
processing performance and by the de-
creasing cost of semi-conductor memory
[Analyti and Pramanik 1992; Bitton et
al. 1987; Bucheral et al. 1990; DeWitt et
al. 1984; Gruenwald and Eich 1991; Ku-
mar and Burger 1991; Lehman and Carey
1986; Li and Naughton 1988; Severance
et al. 1990; Whang and Krishnamurthy
1990]. However, for most applications, an
analysis by Gray and Putzolo [ 1987]
demonstrated that main memory is cost
effective only for the most frequently ac-
cessed data, The time interval between
accesses with equal disk and memory
costs was five minutes for their values of
memory and disk prices and was ex-
pected to grow as main-memory prices
decrease faster than disk prices. For the
purposes of this survey, we will presume
a disk-based storage architecture and will
consider disk 1/0 one of the major costs
of query evaluation over large databases.
3.1 File Scans
The first operator to access base data is
the file scan, typically combined with
a built-in selection facility. There is
not much to be said about file scan ex-
cept that it can be made very fast using
read-ahead, particularly, large-chunk
(“track-at-a-crack”) read-ahead. In some
database systems, e.g., IBMs DB2, the
read-ahead size is coordinated with the
free space left for future insertions dur-
ing database reorganization. If a free
page is left after every 15 full data pages,
the read-ahead unit of 16 pages (64 KB)
ensures that overflow records are imme-
diately available in the buffer.
Efficient read-ahead requires contigu-
ous file allocation, which is supported by
many operating systems. Such contigu-
ous disk regions are frequently called ex-
tents. The UNIX operating system does
not provide contiguous files, and many
ACM Computmg Surveys, Vol 25, No. 2, June 1993
Query Evaluation Techniques ● 95
database systems running on UNIX use
“raw” devices instead, even though this
means that the database management
system must provide operating-system
functionality such as file structures, disk
space allocation, and buffering.
The disadvantages of large units of 1/0
are buffer fragmentation and the waste
of 1/0 and bus bandwidth if only individ-
ual records are required. Permitting dif-
ferent page sizes may seem to be a good
idea, even at the added complexity in the
buffer manager [Carey et al. 1986; Sikeler
1988], but this does not solve the prob-
lem of mixed sequential scans and ran-
dom record accesses within one file. The
common solution is to choose a middle-of-
the-road page size, e.g., 8 KB, and to
support multipage read-ahead.
3.2 Associative Access Using Indices
In order to reduce the number of accesses
to secondary storage (which is relatively
slow compared to main memory), most
database systems employ associative
search techniques in the form of indices
that map key or attribute values to loca-
tor information with which database
objects can be retrieved. The best known
and most often used database index
structure is the B-tree [Bayer and
McCreighton 1972; Comer 1979]. A large
number of extensions to the basic struc-
ture and its algorithms have been pro-
posed, e.g., B ‘-trees for faster scans, fast
loading from a sorted file, increased fan-
out and reduced depth by prefix and suf-
fix truncation, B*-trees for better space
utilization in random insertions, and
top-down B-trees for better locking be-
havior through preventive maintenance
[Guibas and Sedgewick 1978]. Interest-
ingly, B-trees seem to be having a renais-
sance as a research subject, in particular
with respect to improved space utiliza-
tion [Baeza-Yates and Larson 1989], con-
currency control [Srinivasan and Carey
1991], recovery [Lanka and Mays 1991],
parallelism [Seeger and Larson 1991],
and on-line creation of B-trees for very
large databases [Srinivasan and Carey
1992]. On-line reorganization and modifi-
cation of storage structures, though not a
new idea [Omiecinski 1985], is likely to
become an important research topic
within database research over the next
few years as databases become larger
and larger and are spread over many
disks and many nodes in parallel and
distributed systems.
While most current database system
implementations only use some form of
B-trees, an amazing variety of index
structures has been described in the lit-
erature [Becker et al. 1991; Beckmann et
al. 1990; Bentley 1975; Finkel and Bent-
ley 1974; Guenther and Bilmes 1991;
Gunther and Wong 1987; Gunther 1989;
Guttman 1984; Henrich et al. 1989; Heel
and Samet 1992; Hutflesz et al. 1988a;
1988b; 1990; Jagadish 1991; Kemper and
Wallrath 1987; Kolovson and Stone-
braker 1991; Kriegel and Seeger 1987;
1988; Lomet and Salzberg 1990a; Lomet
1992; Neugebauer 1991; Robinson 1981;
Samet 1984; Six and Widmayer 1988].
One of the few multidimensional index
structures actually implemented in a
complete database management system
are R-trees in Postgres [Guttman 1984;
Stonebraker et al. 1990b].
Table 5 shows some example index
structures classified according to four
characteristics, namely their support
for ordering and sorted scans, their
dynamic-versus-static behavior upon
insertions and deletions, their support
for multiple dimensions, and their sup-
port for point data versus range data. We
omitted hierarchical concatenation of at-
tributes and uniqueness, because all in-
dex structures can be implemented to
support these. The indication “no range
data” for multidimensional index struc-
tures indicates that range data are not
part of the basic structure, although they
can be simulated using twice the number
of dimensions. We included a reference or
two with each structure; we selected orig-
inal descriptions and surveys over the
many subsequent papers on special as-
pects such as performance analyses, mul-
tidisk’ and multiprocessor implementa-
tions, page placement on disk, concur-
rency control, recovery, order-preserving
ACM Computing Surveys, Vol. 25, No. 2, June 1993
96 = Goetz Graefe
Table 5. Classlflcatlon of Some Index Structures
Structure Ordered Dynamic
ISAM Yes No
B-trees Yes Yes
Quad-tree Yes Yes
kD-trees Yes Yes
KDB-trees Yes Yes
hB-trees Yes Yes
R-trees Yes Yes
Extendible No Yes
Hashing
Linear Hashing No Yes
Grid Files Yes Yes
Multl-Dim. Range Data References
No No [Larson 1981]
No No [Bayer and McCreighton 1972;
Comer 1979]
Yes No [Finkel and Bentley 1974;
Samet 1984]
Yes No [Bentley 1975]
Yes No [Robinson 1981]
Yes No [Lomet and Salzberg 1990a]
Yes Yes [Guttman 1984]
No No [Fagin et al. 1979]
No No [Litwin 1980]
Yes No [Nievergelt et al. 1984]
hashing, mapping range data of N di-
mensions into point data of 2 N dimen-
sions, etc.—this list suggests the wealth
of subsequent research, in particular on
B-trees, linear hashing, and refined mul-
tidimensional index structures,
Storage structures typically thought of
as index structures may be used as pri-
mary structures to store actual data or
as redundant structures (“access paths”)
that do not contain actual data but point-
ers to the actual data items in a separate
data file. For example, Tandem’s Non-
Stop SQL system uses B-trees for actual
data as well as for redundant index
structures. In this case, a redundant in-
dex structure contains not absolute loca-
tions of the data items but keys used to
search the primary B-tree. If indices are
redundant structures, they can still be
used to cluster the actual data items, i.e.,
the order or organization of index entries
determines the order of items in the data
file. Such indices are called clustering
indices; other indices are called nonclus-
tering indices. Clustering indices do not
necessarily contain an entry for each data
item in the primary file, but only one
entry for each page of the primary file; in
this case, the index is called sparse. Non-
clustering indices must always be dense,
i.e., there are the same number of entries
in the index as there are items in the
primary file.
The common theme for all index struc-
tures is that they associatively map some
attribute of a data object to some locator
information that can then be used to re-
trieve the actual data object. Typically, in
relational systems, an attribute value is
mapped to a tuple or record identifier
(TID or RID). Different systems use dif-
ferent approaches, but it seems that most
new designs do not firmly attach the
record lookup to the index scan.
There are several advantages to sepa-
rating index scan and record lookup.
First, it is possible to scan an index only
without ever retrieving records from the
underlying data file. For example, if only
salary values are needed (e.g., to deter-
mine the count or sum of all salaries), it
is sufficient to access the salary index
only without actually retrieving the data
records. The advantages are that (i) fewer
1/0s are required (consider the number
of 1/0s for retrieving N successive index
entries and those to retrieve N index
entries plus N full records, in particular
if the index is nonclustering [Mackert
and Lehman 1989] and (ii) the remaining
1/0 operations are basically sequential
along the leaves of the index (at least for
B ‘-trees; other index types behave differ-
ently). The optimizers of several commer-
cial relational products have recently
been revised to recognize situations in
which an index-only scan is sufficient.
Second, even if none of the existing in-
dices is sufficient by itself, multiple in-
dices may be “joined” on equal RIDs to
obtain all attributes required for a query
(join algorithms are discussed below in
the section on binary matching). For ex-
ACM Computmg Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques ● 97
ample, by matching entries in indices on
salaries and on names by equal RIDs,
the correct salary-name pairs are estab-
lished. If a query requires only names
and salaries, this “join” has made access-
ing the underlying data file obsolete.
Third, if two or more indices apply to
individual clauses of a query, it may be
more effective to take the union or inter-
section of RID lists obtained from two
index scans than using only one index
(algorithms for union and intersection are
also discussed in the section on binary
matching). Fourth, joining two tables can
be accomplished by joining the indices on
the two join attributes followed by record
retrievals in the two underlying data sets;
the advantage of this method is that only
those records will be retrieved that truly
contribute to the join result [Kooi 1980].
Fifth, for nonclustering indices, sets of
RIDs can be sorted by physical location,
and the records can be retrieved very
efficiently, reducing substantially the
number of disk seeks and their seek dis-
tances. Obviously, several of these tech-
niques can be combined. In addition, some
systems such as Rdb/VMS and DB2 use
very sophisticated implementations of
multiindex scans that decide dynami-
cally, i.e., during run-time, which indices
to scan, whether scanning a particular
index reduces the resulting RID list suf-
ficiently to offset the cost of the index
scan, and whether to use bit vector filter-
ing for the RID list intersection (see a
later section on bit vector filtering)
[Antoshenkov 1993; Mohan et al. 1990].
Record access performance for nonclus-
tering indices can be addressed without
performing the entire index scan first (as
required if all RIDs are to be sorted) by
using a “window” of RIDs. Instead of
obtaining one RID from the index scan,
retrieving the record, getting the next
RID from the index scan, etc., the lookup
operator (sometimes called “functional
join”) could load IV RIDs, sort them into
a priority heap, retrieve the most conve-
niently located record, get another RID,
insert it into the heap, retrieve a record,
etc. Thus, a functional join operator us-
ing a window always has N open refer-
ences to items that must be retrieved,
giving the functional join operator signif-
icant freedom to fetch items from disk
efficiently. Of course, this technique
works most effectively if no other trans-
actions or operators use the same disk
drive at the same time.
This idea has been generalized to as-
semble complex objects~In object-oriented
systems, objects can contain pointers to
(identifiers of) other objects or compo-
nents, which in turn may contain further
pointers, etc. If multiple objects and all
their unresolved references can be con-
sidered concurrently when scheduling
disk accesses, significant savings in disk
seek times can be achieved [Keller et al.
1991].
3.3 Buffer Management
1/0 cost can be further reduced by
caching data in an 1/0 buffer. A large
number of buffer management tech-
niques have been devised; we give only a
few references. Effelsberg and Haerder
[19841 survey many of the buffer man-
agement issues, including those pertain-
ing to issues of recovery, e.g., write-ahead
logging. In a survey paper on the interac-
tions of operating systems and database
management systems, Stonebraker
[ 1981] pointed out that the “standard”
buffer replacement policy, LRU (least re-
cently used), is wrong for many database
situations. For example, a file scan reads
a large set of pages but uses them only
once, “sweeping” the buffer clean of all
other pages, even if they might be useful
in the future and should be kept in mem-
ory. Sacco and Schkolnick [1982; 19861
focused on the nonlinear performance
effects of buffer allocation to many rela-
tional algorithms, e.g., nested-loops join.
Chou [ 1985] and Chou and DeWitt [ 1985]
combined these two ideas in their DBMIN
algorithm which allocates a fixed number
of buffer pages to each scan, depending
on its needs, and uses a local replace-
ment policy for each scan appropriate to
its reference pattern. A recent study into
buffer allocation is by Faloutsos et al.
[1991] and Ng et al. [19911 on using
ACM Computing Surveys, Vol. 25, No. 2, June 1993
98 ● G’oetz Graefe
marginal gain for buffer allocation. A very
promising research direction for buffer
management in object-oriented database
systems is the work by Palmer and
Zdonik [1991] on saving reference pat-
terns and using them to predict future
object faults and to prevent them by
prefetching the required pages.
The interactions of index retrieval and
buffer management were studied by
Sacco [1987] as well as Mackert and
Lehman [ 19891. and several authors
-,
studied database buffer management and
virtual memory provided by the operat-
ing system [Sherman and Brice 1976;
Stonebraker 1981; Traiger 1982].
On the level of buffer manager imple-
mentation, most database buffer man-
agers do not ~rovide read and write in.
t~rfaces to th~ir client modules but fixing
and unfixing, also called pinning and un-
pinning. The semantics of fixing is that a
fixed page is not subject to replacement
or relocation in the buffer pool, and a
client module may therefore safely use a
memory address within a fixed page. If
the buffer manager needs to replace a
page but all its buffer frames are fixed,
some special action must occur such as
dynamic growth of the buffer pool or
transaction abort.
The iterator implementation of query
evaluation algorithms can exploit the
buffer’s fix/unfix interface by passing
pointers to items (records, objects) fixed
in the buffer from iterator to iterator.
The receiving iterator then owns the fixed
item; it may unfix it immediately (e.g.,
after a predicate fails), hold on to the
fixed record for a while (e.g., in a hash
table), or pass it on to the next iterator
(e.g., if a predicate succeeds). Because the
iterator control and interaction of opera-
tors ensure that items are never pro-
duced and fixed before they are required,
the iterator protocol is very eflic~ent in
its buffer usage.
Some implementors, however, have felt
that intermediate results should not be
materialized or kept in the database sys-
tem’s 1/0 buffer, e.g., in order to ease
implementation of transaction (ACID) se-
mantics, and have designed a separate
memory management scheme for inter-
mediate results and items passed from
iterator to iterator. The cost of this deci-
sion is additional in-memory copying as
well as the possible inefficiencies associ-
ated with, in effect, two buffer and mem-
ory managers.
4. AGGREGATION AND DUPLICATE
REMOVAL
Aggregation is a very important statisti-
cal concept to summarize information
about large amounts of data. The idea is
to represent a set of items by a single
value or to classify items into groups and
determine one value per group. Most
database systems support aggregate
functions for minimum, maximum, sum,
count, and average (arithmetic mean’).
Other aggregates, e.g., geometric mean
or standard deviation, are typically not
provided, but may be constructed in some
systems with extensibility features. Ag-
gregation has been added to both rela-
tional calculus and algebra and adds the
same expressive power to each of them
[Klug 1982].
Aggregation is typically supported in
two forms, called scalar aggregates and
aggregate functions [Epstein 1979].
Scalar aggregates calculate a single
scalar value from a unary input relation,
e.g., the sum of the salaries of all employ-
ees. Scalar aggregates can easily be de-
termined using a single pass over a data
set. Some systems exploit indices, in par-
ticular for minimum, maximum, and
count.
Aggregate functions, on the other hand,
determine a set of values from a binary
input relation, e.g., the sum of salaries
for each department. Aggregate functions
are relational operators, i.e., they con-
sume and produce relations. Figure 10
shows the output of the query “count of
employees by department.” The “by-list”
or grouping attributes are the key of the
new relation, the Department attribute
in this example.
Algorithms for aggregate functions re-
quire grouping, e.g., employee items may
be grouped by department, and then one
ACM Computing Surveys. Vol 25, No 2, June 1993
Query Evaluation Techniques ● 99
Shoe 9
Hardware 7 I
Figure 10. Count of employees by department,
output item is calculated per group. This
grouping process is very similar to dupli-
cate removal in which eaual data items
must be brought together; compared, and
removed. Thus, aggregate functions and
duplicate removal are typically imple-
mented in the same module. There are
only two differences between aggregate
functions and duplicate removal. First, in
duplicate removal, items are compared
on all their attributes. but onlv on the
.
attributes in the by-list of aggregate
functions. Second, an identical item is
immediately dropped from further con-
sideration in duplicate removal whereas
in aggregate functions some computation
is ~erformed before the second item of
th~ same group is dropped. Both differ-
ences can easily be dealt with using a
switch in an actual algorithm implemen-
tation. Because of their similarity, dupli-
cate removal and aggregation are de-
scribed and used interchangeably here.
In most existing commercial relational
systems, aggregation and duplicate re-
moval algorithms are based on sorting,
following Epstein’s [1979] work. Since
aggregation requires that all data be con-
sumed before any output can be pro-
duced, and since main memories were
significantly smaller 15 years ago when
the prototypes of these systems were de-
signed, these implementations used tem-
porary files for output, not streams and
iterator algorithms. However, there is no
reason why aggregation and duplicate re-
moval cannot be implemented using iter-
ators exploiting today’s memory sizes.
4.1 Aggregation Algorithms Based on
Nested Loops
There are three types of algorithms for
aggregation and duplicate removal based
on nested loops, sorting, and hashing.
The first algorithm, which we call
nested-loops aggregation, is the most
simple-minded one. Using a temporary
file to accumulate the output, it loops for
each input item over the output file accu-
mulated so far and either aggregates the
input item into the appropriate output
item or creates a new output item and
appends it to the output file. Obviously,
this algorithm is quite inefficient for large
inputs, even if some performance en-
hancements can be applied.7 We mention
it here because it corres~onds to the al~o-
.
rithm choices available for relational
joins and other binary matching prob-
lems (discussed in the next section),
which are the nested-loops join and the
more efficient sort-based and hash-based
join algorithms. As for joins and binary
matching, where the nested-loops algo-
rithm is the only algorithm that can eval-
uate any join predicate, the nested-loops
aggregation algorithm can support un-
usual aggregations where the input items
are not divided into disjoint equivalence
classes but where a single input item
may contribute to multiple output items.
While such aggregations are not sup-
ported in today’s database systems, clas-
sifications that do not divide the input
into equivalence classes can be useful in
both commercial and scientific applica-
tions. If the number of classifications is
small enough that all output items can
be kept in memory, the performance of
this algorithm is acceptable. However, for
the more standard database aggregation
problems, sort-based and hash-based du-
plicate removal and aggregation algo-
rithms are more appropriate.
7The possible improvements are (i) looping over
pages or clusters rather than over records of input
and output items (block nested loops), (ii) speeding
the inner loop by an index (index nested loops), a
method that has been used in some commercial
relational sYstems, (iii) bit vector filtering to deter-
mine without inner loop or index lookup that an
item in the outer loop cannot possibly have a match
in the inner loop. All three of these issues are
discussed later in this survey as they apply to
binary operations such as joins and intersection.
ACM Computing Surveys, Vol. 25, No 2, June 1993
100 “ Goetz Graefe
4.2 Aggregation Algorithms Based on
Sorting
Sorting will bring equal items together,
and duplicate removal will then be easy.
The cost of duplicate removal is domi-
nated bv the sort cost. and the cost of
this na~ve dudicate removal al~orithm
based on sorting can be assume~ to be
that of the sort operation. For aggrega-
tion, items are sorted on their grouping
attributes.
This simple method can be improved
by detecting and removing duplicates as
early as possible, easily implemented in
the routines that write run files during
sorting. With such “early” duplicate re-
moval or aggregation, a run file can never
contain more items than the final output
(because otherwise it would contain du-
plicates!), which may speed up the final
merges significantly [Bitten and DeWitt
1983].
As for any external sort operation, the
optimizations discussed in the section on
sorting, namely read-ahead using fore-
casting, merge optimizations, large clus-
ter sizes, and reduced final fan-in for
binary consumer operations, are fully ap-
plicable when sorting is used for aggre-
gation and duplicate removal. However,
to limit the complexity of the formulas,
we derive 1/0 cost formulas without the
effects of these optimizations.
The amount of I/0 in sort-based a~-
gregation is determined by the number ~f
merge levels and the effect of early dupli-
cate removal on each merge step. The
total number of merge levels is unaf-
fected by aggregation; in sorting with
quicksort and without optimized merg-
ing, the number of merge levels is L =
DogF(R/M)l for input size R, memory
size M, and fan-in F. In the first merge
levels, the likelihood is negligible that
items of the same group end up in the
same run file, and we therefore assume
that the sizes of run files are unaffected
until their sizes would exceed the size of
the final output. Runs on the first few
merge levels are of size M X F’ for level
i, and runs of the last levels have the
same size as the final output. Assuming
the output cardinality (number of items)
is G times less than the input cardinality
(G = R/o), where G is called the aver.
age group size or the reduction factor,
only the last ~logF(G)l merge levels, in-
cluding the final merge, are affected by
early aggregation because in earlier lev-
els more than G runs exist, and items
from each group are distributed over all
those runs, giving a negligible chance of
early aggregation.
In the first merge levels, all input items
participate, and the cost for these levels
can be determined without explicitly cal-
culating the size and number of run files
on these levels. In the affected levels, the
size of the output runs is constant, equal
to the size of the final output O = R/G,
while the number of run files decreases
by a factor equal to the fan-in F in each
level. The number of affected levels that
create run files is Lz = [log~(G)l – 1; the
subtraction of 1 is necessary because the
final merge does not create a run file but
the output stream. The number of unaf-
fected levels is LI = L – Lz. The number
of input runs is W/Fz on level i (recall
the number of initial runs W = R/M
from the discussion of sorting). The total
cost,8 including a factor 2 for writing and
reading, is
L–1
2X RX LI+2XOX ~W/Fz
Z=L1
=2 XRXL1+2XOXW
x(lFL1 – l/’F~)/’(l – lF).
For example, consider aggregating R
= 100 MB input into O = 1 MB output
(i.e., reduction factor G = 100) using a
system with M = 100 KB memory and
fan-in F = 10. Since the input is W =
1,000 times the size of memory, L = 3
merge levels will be needed. The last Lz
= log~(G) – 1 = 1 merge level into tem-
porary run files will permit early aggre-
gation. Thus, the total 1/0 will be
8 Using Z~= ~ a z = (1 – a~+l)/(l – a) and Z~.KaL
= X$LO a’ – E~=-O] a’ = (aK – a“’’+l)/(l – a).
ACM Computing Surveys, Vol. 25, No 2, June 1993
Query Evaluation Techniques ● 101
2X1 OOX2+2X1X1OOO
x(l/lo2 – 1/103 )/’(1 – 1/10)
= 400 + 2 x 1000 x 0.009/0.9
= 420 MB
which has to be divided by the cluster
size used and multi~lied bv the time to
read or write a clu~ter to “estimate the
1/0 time for aggregation based on sort-
ing. Naive separation of sorting and sub-
sequent aggregation would have required
reading and writing the entire input file
three times, for a total of 600 MB 1/0.
Thus, early aggregation realizes almost
30% savings in this case.
Aggrega~e queries may require that
duplicates be removed from the input set
to the aggregate functions,g e.g., if the
SQL distinct keyword is used. If such an
aggregate function is to be executed us-
ing sorting, early aggregation can be used
only for the duplicate removal part. How-
ever, the sort order used for duplicate
removal can be suitable to ~ermit the
.
subsequent aggregation as a simple filter
operation on the duplicate removal’s out-
put stream.
4.3 Aggregation Algorithms Based on
Hashing
Hashing can also be used for aggregation
by hashing on the grouping attributes.
Items of the same group (or duplicate
items in duplicate removal) can be found
and aggregated when inserting them into
the hash table. Since only output items,
not input items, are kept in memory,
hash table overflow occurs only if the
output does not fit into memory. How-
ever, if overflow does occur, the partition
9 Consider two queries, both counting salaries per
department. In order to determine the number of
(salaried) employees per department, all salaries
are counted without removing duplicate salary val-
ues. On the other hand, in order to assess salary
differentiation in each department, one might want
to determine the number of distinct salary levels in
each department. For this query, only distinct
salaries are counted, i.e., duplicate department-
salary pairs must be removed prior to counting.
(This refers to the latter type of query.)
files (all partitioning files in any one re-
cursion level) will basically be as large as
the entire input because once a partition
is being written to disk, no further aggre-
gation can occur until the partition files
are read back into memory.
The amount of 1/0 for hash-based
aggregation depends on the number of
partitioning (recursion) levels required
before the output (not the input) of one
partition fits into memory. This will be
the case when partition files have been
reduced to the size G x M. Since the par-
titioning files shrink by a factor of F at
each level (presuming hash value skew is
absent or effectively counteracted), the
number of partitioning (recursion) levels
is DogF(R/G/M)l = [log~(O/M)l for in-
put size R, output size O, reduction fac-
tor G, and fan-out 1’. The costs at each
level are proportional to the input file
size R. The total 1/0 volume for hashing
with overflow avoidance, including a fac-
tor of 2 for writing and reading, is
2 x R X [logF(O/~)l.
The last partitioning level may use hy-
brid hashing, i.e., it may not involve 1/0
for the entire input file. In that case,
L = llog~( OM )j complete recursion lev-
els involving all input records are re-
quired, partitioning the input into files of
size R’ = R/FL. In each remaining hy-
brid hash aggregation, the size limit for
overflow files is M x G because such an
overflow file can be aggregated in mem-
ory. The number of partition files K must
satisfy KxMx G+(M– KxC)XG
> R’, meaning K = [(R’/G – M)/(M –
C)l partition files will be created. The
total 1/0 cost for hybrid hash aggrega-
tion is
2X RX L+2XFL
x( R’–(M– KxC)XG)
=2 X( RX(L+l)– FL
x( M–Kx C)XG)
=2 X( RX(L+l)– FL
x(M– [(R’/G –M)/(M– C)]
xC) x G).
ACM Computing Surveys, Vol. 25, No 2, June 1993
102 “ Goetz Graefe
I/o
[MB]
600 –
500 –
400 –
300 –
200 –
o Sorting without early aggregation
A Sorting with early aggregation
100 – x Hashing without hybrid hashing
o– ❑ Hashing with hybrid hashing
I I I I I I I 1 I I I I I
1 23510 2030 50 100 200300500 1000
Group Size or Reduction Factor
Figure 11. Performance of sort- and hash-based aggregation.
As for sorting, if an aggregate query
requires duplicate removal for the input
set to the aggregate function,l” the group
size or reduction factor of the duplicate
removal step determines the perfor-
mance of hybrid hash duplicate removal.
The subsequent aggregation can be per-
formed after the duplicate removal as an
additional operation within each hash
bucket or as a simple filter operation on
the duplicate removal’s output stream.
4.4 A Rough Performance Comparison
It is interesting to note that the perfor-
mance of both sort-based and hash-based
aggregation is logarithmic and improves
with increasing reduction factors. Figure
11 compares the performance of sort- and
hash-based aggregationll using the for-
mulas developed above for 100 MB input
data, 100 KB memory, clusters of 8 KB,
fan-in or fan-out of 10, and varying group
sizes or reduction factors. The output size
is the input size divided by the group
size.
It is immediately obvious in Figure 11
that sorting without early aggregation is
not competitive because it does not limit
the sizes of run files, confirming the re-
1“ See footnote 9.
11Aggregation by nested-loops methods is omitted
from Figure 11 because it is not competitive for
large data sets.
suits of Bitton and DeWitt [1983]. The
other algorithms all exhibit similar,
though far from equal, performance im-
provements for larger reduction factors.
Sorting with early aggregation improves
once the reduction factor is large enough
to affect not only the final but also previ-
ous merge steps. Hashing without hybrid
hashing improves in steps as the number
of partitioning levels can be reduced, with
“step” points where G = F’ for some i.
Hybrid hashing exploits all available
memory to improve performance and
generally outperforms overflow avoid-
ance hashing. At points where overflow
avoidance hashing shows a step, hybrid
hashing has no effect, and the two hash-
ing schemes have the same performance.
While hash-based aggregation and du-
plicate removal seem superior in this
rough analytical performance compari-
son, recall that the cost formula for sort-
based aggregation does not include the
effects of replacement selection or the
merge optimizations discussed earlier in
the section on sorting; therefore, Figure
11 shows an upper bound for the 1/0
cost of sort-based aggregation and dupli-
cate removal. Furthermore, since the cost
formula for hashing presumes optimal
assignments of hash buckets to output
partitions, the real costs of sort- and
hash-based aggregation will be much
more similar than they appear in Figure
11. The important point is that both their
ACM Computing Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques “ 103
costs are logarithmic with the input size,
improve with the group size or reduction
factor, and are quite similar overall.
4.5 Additional Remarks on Aggregation
Some applications require multilevel ag-
gregation. For example, a report genera-
tion language might permit a request like
“sum (employee. salary by employee.id by
employee. department by employee. divi-
siony to create a report with an entry for
each employee and a sum for each de-
partment and each division. In fact, spec-
ifying such reports concisely was the
driving design goal for the report genera-
tion language RPG. In SQL, this requires
multiple cursors within an application
program, one for each level of detail. This
is very undesirable for two reasons. First,
the application program performs what
is essentially a join of three inputs. Such
joins should be provided by the database
system, not required to be performed
within application programs. Second, the
database system more likely than not
executes the operations for these cursors
independently from one another, result-
ing in three sort operations on the em-
ployee file instead of one.
If complex reporting applications are
to be supported, the query language
should support direct requests (perhaps
similar to the syntax suggested above),
and the sort operator should be imple-
mented such that it can perform the en-
tire operation in a single sort and one
final pass over the sorted data. An analo-
gous algorithm based on hashing can be
defined; however, if the aggregated data
are required in sort order, sort-based ag-
gregation will be the algorithm of choice.
For some applications, exact aggregate
functions are not required; reasonably
close approximations will do. For exam-
ple, exploratory (rather than final pre-
cise) data analysis is frequently very use-
ful in “approaching” a new set of data
[Tukey 1977]. In real-time systems, pre-
cision and response time may be reason-
able tradeoffs. For database query opti-
mization, approximate statistics are a
sufficient basis for selectivity estimation,
cost calculation, and comparison of alter-
native plans. For these applications,
faster algorithms can be designed that
rely either on a single sequential scan of
the data (no run files, no overflow files)
or on sampling [Astrahan et al. 1987;
Hou and Ozsoyoglu 1991; 1993; Hou et
al. 1991].
5. BINARY MATCHING OPERATIONS
While aggregation is essential to con-
dense information, there are a number of
database operations that combine infor-
mation from two inputs, files, or sets and
therefore are essential for database
systems’ ability to provide more than
reliable shared storage and to perform
inferences, albeit limited. A group of op-
erators that all do basically the same
task are called the one-to-one match op-
erations here because an input item con-
tributes to the output depending on its
match with one other item. The most
prominent among these operations is the
relational join. Mishra and Eich [1992]
have recently written a survey of join
algorithms, which includes an interest-
ing analysis and comparison of algo-
rithms focusing on how data items from
the two inputs are compared with one
another. The other one-to-one match op-
erations are left and right semi-joins, left,
right, and symmetric outer-joins, left and
right anti-semi-joins, symmetric anti-
join, intersection, union, left and right
differences, and symmetric or anti-dif-
ference.lz Figure 12 shows the basic
12 The anti-semijoin of R and S is R SEMIJOIN
S = R – (R SEMZJOIN S), i.e., the items in R
without matches in S. The (symmetric) anti-join
contains those items from both inputs that do not
have matches, suitably padded as in outer joins to
make them union compatible. Formally, the (sym-
metric) anti-join of R” and S is R JOIN S =- (R
SEMIJOIN S ) U (S SEMIJOIN R ) with the tuples
of the two union arguments suitably extended with
null values. The symmetric or anti-difference M the
union of the two differences. Formally the anti-dif-
ference of R and S is (R u S) – (R n S) = (R –
S) u (S – R) [Maier 1983]. Among these three op-
erations, the anti-semijoin is probably the most
useful one, as in the query to “find the courses that
don’t have any enrollment.”
ACM Computing Surveys, Vol. 25, No. 2, June 1993
104 “ Goetz Graefe
R s
m
A B c
output Match on all Match on some
Attributes Attributes
A Difference Anti-semi-join
B Intersection Join, semi-join
c Difference Anti-semi-join
A, B Left outer join
A, C Symmetric difference Anti-join
B, C Right outer join
A, B, C Union Symmernc outer join
Figure 12, Binary one-to-one matching.
principle underlying all these operations,
namely separation of the matching and
nonmatching components of two sets,
called R and S in the figure, and produc-
tion of appropriate subsets, possibly after
some transformation and combination of
records as in the case of a join. If the sets
R and S have different schemas as in
relational joins, it might make sense to
think of the set B as two sets 11~ and
B~, i.e., the matching elements from R
and S. This distinction permits a clearer
definition of left semi-join and right
semi-join, etc. Since all these operations
require basically the same steps and can
be implemented with the same algo-
rithms, it is logical to implement them in
one general and efficient module. For
simplicity, only join algorithms are dis-
cussed here. Moreover, we discuss algo-
rithms for only one join attribute since
the algorithms for multi-attribute joins
(and their performance) are not different
from those for single-attribute joins.
Since set operations such as intersec-
tion and difference will be used and must
be implemented efficiently for any data
model, this discussion is relevant to rela-
tional, extensible, and object-oriented
database systems alike. Furthermore, bi-
nary matching problems occur in some
surprising places. Consider an object-
oriented database system that uses a
table to map logical object identifiers
(OIDS) to physical locations (record iden-
tifiers or RIDs). Resolving a set of OIDS
to RIDs can be regarded (as well as opti-
mized and executed) as a semi-join of the
mapping table and the set of OIDS, and
all conventional join strategies can be
employed. Another example that can oc-
cur in a database management system
for any data model is the use of multiple
indices in a query: the pointer (OID or
RID) lists obtained from the indices must
be intersected (for a conjunction) or
united (for a disjunction) to obtain the
list of pointers to items that satisfy the
whole query. Moreover, the actual lookup
of the items using the pointer list can be
regarded as a semi-join of the underlying
data set and the list, as in Kooi’s [1980]
thesis and the Ingres product [Kooi and
Frankforth 1982] and a recent study by
Shekita and Carey [1990]. Finally, many
path expressions in object-oriented
database systems such as “employee. de-
partment.manager. office.location” can
frequently be interpreted, optimized, and
executed as a sequence of one-to-one
match operations using existing join and
semi-join algorithms. Thus, even if rela-
ACM Computing Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 105
tional systems were completely abolished
and replaced by object-oriented database
systems, set matching and join tech-
niques developed in the relational con-
text would continue to be important for
the performance of database systems.
Most of today’s commercial database
systems use only nested loops and
merge-join because an analysis per-
formed in connection with the System R
project determined that of all the join
methods considered, one of these two al-
ways provided either the best or very
close to the best performance [Blasgen
and Eswaran 1976; 1977]. However, the
System R study did not consider hash
join algorithms, which are now regarded
as more efficient in many cases.
There continues to be a strong interest
in join techniques, although the interest
has shifted over the last 20 years from
basic algorithmic concerns to parallel
techniques and to techniques that adapt
to unpredictable run-time situations such
as data skew and changing resource
availability. Unfortunately, many new
proposed techniques fail a very simple
test (which we call the “Guy Lehman test
for join techniques” after the first person
who pointed this test out to us), making
them problematic for nontrivial queries.
The crucial test question is: Does this
new technique apply to joining three in-
puts without interrupting data flow be-
tween the join operators? For example, a
technique fails this test if it requires ma-
terializing the entire intermediate join
result for random sampling of both join
inputs or for obtaining exact knowledge
about both join input sizes. Given its im-
portance, this test should be applied to
both proposed query optimization and
query execution techniques.
For the 1/0 cost formulas given here,
we assume that the left and right inputs
have R and S pages, respectively, and
that the memory size is M pages. We
assume that the algorithms are imple-
mented as iterators and omit the cost of
reading stored inputs and writing an op-
eration’s output from the cost formulas
because both inputs and output may be
iterators, i.e., these intermediate results
are never written to disk, and because
these costs are equal for all algorithms.
5.1 Nested-Loops Join Algorithms
The simplest and, in some sense, most
direct algorithm for binary matching is
the nested-loops join: for each item in one
input (called the outer input), scan the
entire other input (called the inner in-
put) and find matches. The main advan-
tage of this algorithm is its simplicity.
Another advantage is that it can com-
pute a Cartesian product and any @join
of two relations, i.e., a join with an arbi-
trary two-relation comparison predicate.
However, Cartesian products are avoided
by query optimizers because their out-
puts tend to contain many data items
that will eventually not satisfy a query
predicate verified later in the query eval-
uation plan.
Since the inner input is scanned re-
peatedly, it must be stored in a file, i.e., a
temporary file if the inner input is pro-
duced by a complex subplan. This situa-
tion does not change the cost of nested
loops; it just replaces the first read of the
inner input with a write.
Except for very small inputs, the per-
formance of nested-loops join is disas-
trous because the inner input is scanned
very often, once for each item in the outer
input. There are a number of improve-
ments that can be made to this naiue
nested-loops join. First, for one-to-one
match operations in which a single match
carries all necessary information, e.g.,
semi-join and intersection, a scan of the
inner input can be terminated after the
first match for an item of the outer input.
Second, instead of scanning the inner in-
put once for each item from the outer
input, the inner input can be scanned
once for each page of the outer input, an
algorithm called block nested-loops join
[Kim 1980]. Third, the performance can
be improved further by filling all of mem-
ory except K pages with pages of the
outer input and by using the remaining
K pages to scan the inner input and to
save pages of the inner input in memory.
Finally, scans of the inner input can be
ACM Computmg Surveys, Vol. 25, No 2, June 1993
106 e Goetz Graefe
made a little faster by scanning the inner
input alternatingly forward and back-
ward, thus reusing the last page of the
previous scan and therefore saving one
1/0 per inner scan. The 1/0 cost for this
version of nested-loops join is the product
of the number of scans (determined by
the size of the outer input) and the cost
per scan of the inner input, plus K 1/0s
because the first inner scan has to scan
or save the entire inner input. Thus, the
total cost for scanning the inner input
repeatedly is [R(M – K)] x (S – K) +
K. This expression is minimized if K = 1
and R z S, i.e., the larger input should
be the outer.
If the critical performance measure is
not the amount of data read in the re-
peated inner scans but the number of
1/0 operations, more than one page
should be moved in each 1/0, even if
more memory has to be dedicated to the
inner input and less to the outer input,
thus increasing the number of passes over
the inner input. If C pages are moved in
each 1/0 on the inner input and M – C
pages for the outer input, the number of
1/0s is [R/(M – C)] x (S/C) + 1,
which is minimized if C = M/2. In other
words, in order to minimize the number
of large-chunk 1/0 operations, the clus-
ter size should be chosen as half the
available memory size [Hagmann 1986].
Finally, index nested-loops join ex-
ploits a permanent or temporary index
on the inner input’s join attribute to re-
place file scans by index lookups. In prin-
ciple, each scan of the inner input in
naive nested-loops join is used to find
matches, i.e., to provide associativity. Not
surprisingly, since all index structures
are designed and used for the purpose of
associativity, any index structure sup-
porting the join predicate (such as = , <,
etc.) can be used for index nested-loops
join. The fastest indices for exact match
queries are hash indices, but any index
structure can be used, ordered or un -
ordered (hash), single- or multi-attribute,
single- or multidimensional. Therefore,
indices on frequently used join attributes
(keys and foreign keys in relational sys-
tems) may be useful. Index nested-loops
join is also used sometimes with indices
built on the fly, i.e., indices built on inter-
mediate query processing results.
A recent investigation by DeWitt et al.
[1993] demonstrated that index nested-
loops join can be the fastest join method
if one of the inputs is so small and if the
other indexed input is so large that the
number of index and data page re-
trievals, i.e., about the product of the
index depth and the cardinality of the
smaller input, is smaller than the num-
ber of pages in the larger input.
Another interesting idea using two or-
dered indices, e.g., a B-tree on each of the
two join columns, is to switch roles of
inner and outer join inputs after each
index lookup, which leads to the name
“zig-zag join.” For example, for a join
predicate R.a = S.a, a scan in the index
on R .a finds the lower join attribute
value in R, which is then looked up in
the index on S.a. A continuing scan in
the index on S.a yields the next possible
join attribute value, which is looked up
in the index on R .a, etc. It is not immedi-
ately clear under which circumstances
this join method is most efficient.
For complex queries, N-ary joins are
sometimes written as a single module,
i.e., a module that performs index lookups
into indices of multiple relations and joins
all relations simultaneously. However, it
is not clear how such a multi-input join
implementation is superior to multiple
index nested-loops joins,
5.2 Merge-Join Algorithms
The second commonly used join method
is the merge-join. It requires that both
inputs are sorted on the join attribute.
Merging the two inputs is similar to the
merge process used in sorting. An impor-
tant difference, however, is that one of
the two merging scans (the one which is
advanced on equality, usually called the
inner input) must be backed up when
both inputs contain duplicates of a join
attribute value and when the specific
one-to-one match operation requires that
all matches be found, not just one match.
Thus, the control logic for merge-join
variants for join and semi-j oin are slightly
different. Some systems include the no-
ACM Computing Surveys, Vol 25, No 2. June 1993
Query Evaluation Techniques ● 107
tion of “value packet,” meaning all items
with equal join attribute values [Kooi
1980; Kooi and Frankforth 1982]. An it-
erator’s next call returns a value packet,
not an individual item, which makes the
control logic for merge-j oin much easier.
If (or after) both inputs have been sorted,
the merge-join algorithm typically does
not require any 1/0, except when “value
packets” are larger than memory. (See
footnote 1.)
An input may be sorted because a
stored database file was sorted, an or-
dered index was used, an input was
sorted explicitly, or the input came from
an operation that produced sorted out-
put, e.g., another merge-join. The last
point makes merge-join an efficient algo-
rithm if items from multiple sources are
matched on the same join attribute(s) in
multiple binary steps because sorting in-
termediate results is not required for
later merge-joins, which led to the con-
cept of interesting orderings in the Sys-
tem R query optimizer [Selinger et al.
1979]. Since set operations such as inter-
section and union can be evaluated using
any sort order, as long as the same sort
order is present in both inputs, the effect
of interesting orderings for one-to-one
match operators based on merge-join can
always be exploited for set operations.
A combination of nested-loops join and
merge-join is the heap-filter merge -]”oin
[Graefe 1991]. It first sorts the smaller
inner input by the join attribute and
saves it in a temporary file. Next, it uses
all available memory to create sorted
runs from the larger outer input using
replacement selection. As discussed in the
section on sorting, there will be about
W = R/(2 x M) + 1 such runs for outer
input size R. These runs are not written
to disk; instead, they are joined immedi-
ately with the sorted inner input using
merge-join. Thus, the number of scans of
the inner input is reduced to about one
half when compared to block nested loops.
On the other hand, when compared to
merge-join, it saves writing and read-
ing temporary files for the larger outer
input.
Another derivation of merge-join is the
hybrid join used in IBMs DB2 product
[Cheng et al. 1991], combining elements
from index nested-loops join, merge-join,
and techniques joining sorted lists of in-
dex leaf entries. After sorting the outer
input on its join attribute, hybrid join
uses a merge algorithm to “join” the outer
input with the leaf entries of a preexist-
ing B-tree index on the join attribute of
the inner input. The result file contains
entire tuples from the outer input and
record identifiers (RIDs, physical ad-
dresses) for tuples of the inner input.
This file is then sorted on the physical
locations, and the tuples of the inner re-
lation can then be retrieved from disk
very efficiently. This algorithm is not en-
tirely new as it is a special combination
of techniques explored by Blasgen and
Eswaran [1976; 1977], Kooi [1980], and
Whang et al. [1984; 1985]. Blasgen and
Eswaran considered the manipulation of
RID lists but concluded that either
merge-join or nested-loops join is the op-
timal choice in almost all cases; based on
this study, only these two algorithms
were implemented in System R [Astra-
han et al. 1976] and subsequent rela-
tional database systems, Kooi’s optimizer
treated an index similarly to a base rela-
tion and the lookup of data records from
index entries as a join; this naturally
permitted joining two indices or an index
with a base relation as in hybrid join.
5.3 Hash Join Algorithms
Hash join algorithms are based on the
idea of building an in-memory hash table
on one input (the smaller one, frequently
called the build input) and then probing
this hash table using items from the other
input (frequently called the probe input ).
These algorithms have only recently
found greater interest [Bratbergsengen
1984; DeWitt et al. 1984; DeWitt and
Gerber 1985; DeWitt et al. 1986; Fushimi
et al. 1986; Kitsuregawa et al. 1983;
1989a; Nakayama et al. 1988; Omiecin-
ski 1991; Schneider and DeWitt 1989;
Shapiro 1986; Zeller and Gray 1990]. One
reason is that they work very fast, i.e.,
without any temporary files, if the build
input does indeed fit into memory, inde-
pendently of the size of the probe input.
ACM Computing Surveys, Vol 25, No 2, June 1993
108 ● Goetz Graefe
However, they require overflow avoid-
ance or resolution methods for larger
build inputs, and suitable methods were
developed and experimentally verified
only in the mid-1980’s, most notably in
connection with the Grace and Gamma
database machine projects [DeWitt et al.
1986; 1990; Fushimi et al. 1986; Kitsure-
gawa et al. 1983].
In hash-based join methods, build and
probe inputs are partitioned using the
same partitioning function, e.g., the join
key value modulo the number of parti-
tions. The final join result can be formed
by concatenating the join results of pairs
of partitioning files. Figure 13 shows the
effect of partitioning the two inputs of a
binary operation such as join into hash
buckets and partitions. (This figure was
adapted from a similar diagram by IKit-
suregawa et al. [ 1983]. Mishra and Eich
[1992] recently adapted and generalized
it in their survey and comparison of rela-
tional join algorithms.) Without parti-
tioning, each item in the first input must
be compared with each item in the sec-
ond input; this would be represented by
complete shading of the entire diagram.
With partitioning, items are grouped into
partition files, and only pairs in the se-
ries of small rectangles (representing the
partitions) must be compared.
If a build partition file is still larger
than memory, recursive partitioning is
required. Recursive partitioning is used
for both build- and probe-partitioning
files using the same hash and partition-
ing functions. Figure 14 shows how both
input files are partitioned together. The
partial results obtained from pairs of
partition files are concatenated to form
the result of the entire match operation.
Recursive partitioning stops when the
build partition fits into memory. Thus,
the recursion depth of partitioning for
binary match operators depends only on
the size of the build input (which there-
fore should be chosen to be the smaller
input) and is independent of the size of
the probe input. Compared to sort-based
binary matching operators, i.e., variants
of merge-join in which the number of
merge levels is determined for each input
Second
Join
Input
–-l -l-l
LFirst Join Input
Figure 13. Effect of partitioning for join operations.
Figure 14. Recursive partitioning in binary
operations.
file individually, hash-based binary
matching operators are particularly ef-
fective when the input sizes are very dif-
ferent [Bratbergsengen 1984; Graefe et
al. 1993].
The 1/0 cost for binary hybrid hash
operations can be determined by the
number of complete levels (i.e., levels
without hash table) and the fraction of
the input remaining in memory in the
deepest recursion level. For memory size
M, cluster size C, partitioning fan-out
F = [MC – 1],build input size R, and
probe input size S, the number of com-
plete levels is L = [logF( R/M)], after
which the build input partitions should
be of size R’ = R/FL. The 1/0 cost for
the binary operation is the cost of parti-
tioning the build input divided by the
size of the build input and multiplied by
the sum of the input sizes. Adapting the
ACM Computing Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques ● 109
cost formula for unary hashing discussed
earlier, the total amount of 1/0 for a
recursive binary hash operation is
2X( RX(L+l)– F’X(M
-[(R’ - ~+ c)/(lf - C)l
xc))/R x (R + s)
which can be approximated with 2 x
log~(R/iW) X (R + S). In other words,
the cost of binary hash operations on
large inputs is logarithmic; the main
difference to the cost of merge-join is
that the recursion depth (the logarithm)
depends only on one file, the build input,
and is not taken for each file individually.
As for all operations based on parti-
tioning, partitioning (hash) value skew is
the main danger to effectiveness. When
using statistics on hash value distribu-
tions to determine which buckets should
stay in memory in hybrid hash algo-
rithms, the goal is to avoid as much 1/0
as possible with the least memory “in-
vestment.” Thus, it is most effective to
retain those buckets in memory with few
build items but many probe items or,
more formally, the buckets with the
smallest value for r,(rt + Sj ) where r,
and si indicate the total size of a bucket’s
build and probe items [Graefe 1993b].
5.4 Pointer-Based Joins
Recently, links between data items have
found renewed interest, be it in object-
oriented systems in the form of object
identifiers (OIDS) or as access paths for
faster execution of relational joins. In a
sense, links represent a limited form of
precomputed results, somewhat similar
to indices and join indices, and have the
usual cost-versus-benefit tradeoff be-
tween query performance enhancement
and maintenance effort. Kooi [1980] mod-
eled the retrieval of actual records after
index searches as “TID joins” (tuple iden-
tifiers permitting direct record access) in
his query optimizer for Ingres; together
with standard join commutativity and
associativity rules, this model permitted
exploring joins of indices of different re-
lations (joining lists of key-TID pairs) or
joins of one relation with another rela-
tion’s index. In the Genesis data model
and database system, Batory et al.
[1988a; 1988b] modeled joins in a func-
tional way, borrowing from research into
the database languages FQL [Buneman
et al. 1982; Buneman and Frankel 1979],
DAPLEX [Shipman 1981], and Gem [Tsur
and Zaniolo 1984; Zaniolo 1983] and
permitting pointer-based join imple-
mentations in addition to traditional
value-based implementations such as
nested-loops join, merge-join, and hy-
brid hash join.
Shekita and Carey [ 1990] recently ana-
lyzed three pointer-based join methods
based on nested-loops join, merge-join,
and hybrid hash join. Presuming rela-
tions R and S, with a pointer to an S
tuple embedded in each R tuple, the
nested-loops join algorithm simply scans
through R and retrieves the appropriate
S tuple for each R tuple. This algorithm
is very reminiscent of uncluttered index
scans and performs similarly poorly for
larger set sizes. Their conclusion on naive
pointer-based join algorithms is that “it
is unwise for object-oriented database
systems to support only pointer-based
join algorithms,”
The merge-join variant starts with
sorting R on the pointers (i.e., according
to the disk address they point to) and
then retrieves all S items in one elevator
pass over the disk, reading each S page
at most once. Again, this idea was sug-
gested before for uncluttered index scans,
and variants similar to heap-filter
merge-join [Graefe 1991] and complex ob-
ject assembly using a window and prior-
ity heap of open references [Keller et al.
1991] can be designed.
The hybrid hash join variant partitions
only relation R on pointer values, ensur-
ing that R tuples with S pointers to the
same page are brought together, and then
retrieves S pages and tuples. Notice that
the two relations’ roles are fixed by the
direction of the pointers, whereas for
standard hybrid hash join the smaller
relation should be the build input. Differ-
ACM Computing Surveys, Vol 25, No. 2, June 1993
110 “ Goetz Graefe
125 –
100
: /~
V Pointer 10 S+R
o Ne ed Loops
I/o 75 ❑ Merg oin with Two Sorts
Count + Hashi not using Hybrid ing
[Xlooo] 50 A Pointer Join
25
0 1
I I I I I I I I
100 300 500 700 900 1100 13(KI 1500
Sizeof R, S=lOx R
Figure 15. Performance of alternative join methods
ently than standard hybrid hash join,
relation S is not partitioned. This algo-
rithm performs somewhat faster than
pointer-based merge-join if it keeps some
partitions of R in memory and if sorting
writes all R tuples into runs before merg-
ing them.
Pointer-based join algorithms tend to
outperform their standard value-based
counterparts in many situations, in par-
ticular if only a small fraction of S actu-
ally participates in the join and can be
selected effectively using the pointers in
R. Historically, due to the difficulty of
correctly maintaining pointers (nones-
sential links ), they were rejected as a
relational access method in System R
[Chamberlain et al. 1981al and subse-
quently in basically all other systems,
perhaps with the exception of Kooi’s
modified Ingres [Kooi 1980; Kooi and
Frankforth 1982]. However, they were
reevaluated and implemented in the
Starburst project, both as a test of Star-
burst’s extensibility and as a means of
supporting “more object-oriented” modes
of operation [Haas et al. 1990].
5.5 A Rough Performance Comparison
Figure 15 shows an approximate perfor-
mance comparison using the cost formu-
las developed above for block nested-loops
join; merge-join with sorting both inputs
without optimized merging; hash join
without hybrid hashing, bucket tuning,
or dynamic destaging; and pointer joins
with pointers from R to S and from S to
R without grouping pointers to the same
target page together. This comparison is
not precise; its sole purpose is to give a
rough idea of the relative performance of
the algorithm groups, deliberately ignor-
ing the many tricks used to improve and
fine-tune the basic algorithms. The rela-
tion sizes vary; S is always 10 times
larger than R. The memory size is 100
KB; the cluster size is 8 KB; merge fan-in
and partitioning fan-out are 10; and the
number of R-records per cluster is 20.
It is immediately obvious in Figure 15
that nested-loops join is unsuitable for
medium-size and large relations, because
the cost of nested-loops join is propor-
tional to the size of the Cartesian prod-
uct of the two inputs. Both merge-join
(sorting) and hash join have logarithmic
cost functions; the sudden rise in merge-
join and hash join cost around R = 1000
is due to the fact that additional parti-
tioning or merging levels become neces-
sary at that point. The sort-based
merge-join is not quite as fast as hash
join because the merge levels are deter-
mined individually for each file, includ-
ing the bigger S file, while only the
smaller build relation R determines the
partitioning depth of hash join. Pointer
joins are competitive with hash and
merge-joins due to their linear cost func-
ACM Computmg Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques w 111
tion, but only when the pointers are em-
bedded in the smaller relation R. When
S-records point to R-records, the cost of
the pointer join is even higher than for
nested-loops join.
The important point of Figure 15 is to
illustrate that pointer joins can be very
efficient or very inefficient, that one-to-
one match algorithms based on nested-
loops join are not competitive for
medium-size and large inputs, and that
sort- and hash-based algorithms for one-
to-one match operations both have loga-
rithmic cost growth. Of course, this com-
parison is quite naive since it uses only
the simplest form of each algorithm.
Thus, a comparison among alternative
algorithms in a query optimizer must use
the precise cost function for the available
algorithm variant.
6. UNIVERSAL QUANTIFICATION’3
Universal quantification permits queries
such as “find the students who have
taken all database courses”; the differ-
ence to one-to-one match operations is
that a student qualifies because his or
her transcript matches an entire set of
courses, not only one item as in an exis-
tentially quantified query (e.g., “find stu-
dents who have taken a (at least one)
database course”) that can be executed
using a semi-join. In the past, universal
quantification has been largely ignored
for four reasons. First, typical data-
base applications, e.g., record-keeping
and accounting applications, rarely re-
quire universal quantification. Second, it
can be circumvented using a complex ex-
pression involving a Cartesian product.
Third, it can be circumvented using com-
plex aggregation expressions. Fourth,
there seemed to be a lack of efficient
algorithms.
The first reason will not remain true
for database systems supporting logic
programming, rules, and quantifiers, and
algorithms for universal quantification
12 This section is a summary of earlier work [Graefe
1989; Graefe and Cole 1993].
will become more important. The second
reason is valid; however, the substitute
expressions are very slow to execute be-
cause of the Cartesian product. The third
reason is also valid, but replacing a uni-
versal quantifier may require very com-
plex aggregation clauses that are easy to
“get wrong” for the database user. Fur-
thermore, they might be too complex for
the optimizer to recognize as universal
quantification and to execute with a di-
rect algorithm. The fourth reason is not
valid; universal quantification algo-
rithms can be very efficient (in fact, as
fast as semi-join, the operator for exis-
tential quantification), useful for very
large inputs, and easy to parallelize. In
the remainder of this section, we discuss
sort- and hash-based direct and indirect
(aggregation-based) algorithms for uni-
versal quantification.
In the relational world, universal
quantification is expressed with the uni-
versal quantifier in relational calculus
and with the division operator in rela-
tional algebra. We will explain algo-
rithms for universal quantification using
relational terminology. The running ex-
ample in this section uses the relations
Student ( student-id, name, major),
Course (course-no, title), Transcript ( stu-
dent-id, course-no, grade ), and Require-
ment (major, course-no) with the obvious
key attributes. The query to find the stu-
dents who have taken all courses can be
expressed in relational algebra as
studmt.,d, cwrmnoTranscri@
97
+ trCOU~,,
e.~OCourse.
The projection of the Transcript relation
is called the dividend, the projection of
the Course relation the divisor, and the
result relation the quotient. The quotient
attributes are those attributes of the div-
idend that do not appear in the divisor.
The dividend relation semi-joined with
the divisor relation and projected on the
quotient attributes, in the example the
set of student-ids of Students who have
taken at least one course, is called the
set of quotient candidates here.
ACM Computmg Surveys, Vol. 25, No. 2, June 1993
112 “ Goetz Graefe
Some universal quantification queries
seem to require relational division but
actually do not. Consider the query for
the students who have taken all courses
required for their major. This query can
be answered with a sequence of one-to-
one match operations. A join of Student
and Requirement projected on the stu-
dent-id and course-no attributes minus
the Transcript relation can be projected
on student-ids to obtain a set of students
who have not taken all their require-
ments. An anti-semi-join of the Student
relation with this set finds the students
who have satisfied all their require-
ments. This sequence will have accept-
able performance because its required
set-matching algorithms (join, difference,
anti-semi-join) all belong to the family of
one-to-one match operations, for which
efficient algorithms are available as dis-
cussed in the previous section.
Division algorithms differ not only in
their performance but also in how they
fit into complex queries, Prior to the divi-
sion, selections on the dividend, e.g., only
Transcript entries with “A” grades, or on
the divisor, e.g., only the database
courses, may be required. Restrictions on
the dividend can easily be enforced with-
out much effect on the division operation,
while restrictions on the divisor can im-
ply a significant difference for the query
evaluation plan. Subsequent to the divi-
sion operation, the resulting quotient re-
lation (e.g., a set of student-ids) may be
joined with another relation, e.g., the
Student relation to obtain student
names. Thus, obtaining the quotient in a
form suitable for further processing (e.g.,
join or semi-join with a third relation)
can be advantageous.
Typically, universal quantification can
easily be replaced by aggregations. (In-
tuitively, all universal quantification can
be replaced by aggregation. However, we
have not found a proof for this statement.)
For example, the example query about
database courses can be restated as “find
the students who have taken as many
database courses as there are database
course s.” When specifying the aggregate
function, it is important to count only
database courses both in the dividend
(the Transcript relation) and in the divi-
sor (the Course relation). Counting only
database courses might be easy for the
divisor relation, but requires a semi-join
of the dividend with the divisor relation
to propagate the restriction on the divi-
sor to the dividend if it is not known a
priori whether or not referential in-
tegrity holds between the dividend’s divi-
sor attributes and the divisor, i.e.,
whether or not there are divisor attribute
values in the dividend that cannot be
found in the divisor. For example, course-
nos in the Transcript relation that do not
pertain to database courses (and are
therefore not in the divisor) must be re-
moved from the dividend by a semi-join
with the divisor. In general, if the divisor
is the result of a prior selection, any
referential integrity constraints known
for stored relations will not hold and must
be explicitly enforced using a semi-join.
Furthermore, in order to ensure correct
counting, duplicates have to be removed
from either input if the inputs are projec-
tions on nonkey attributes.
There are four methods to compute the
quotient of two relations, a sort- and a
hash-based direct method, and sort- and
hash-based aggregation. Table 6 shows
this classification of relational division
algorithms. Methods for sort- and hash-
based aggregation and the possible sort-
or hash-based semi-join have already
been discussed, including their variants
for inputs larger than memory and their
cost functions. Therefore, we focus here
on the direct division algorithms.
The sort-based direct method, pro-
posed by Smith and Chang [1975] and
called naizw diuision here, sorts the
divisor input on all its attributes and the
dividend relation with the quotient at-
tributes as major and the divisor at-
tributes as minor sort keys. It then pro-
ceeds with a merging scan of the two
sorted inputs to determine which items
belong in the quotient. Notice that the
scan can be programmed such that it
ignores duplicates in either input (in case
those had not been removed yet in the
sort) as well as dividend items that do
ACM Computing Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques - 113
Table 6. Classiflcatlon of Relational Dtvwon Algorithms
Based on Sorting Based on Hashing
Direct Naive divison Hash-division
Indirect by semi-join and Sorting with duplicate removal, Hash-based duplicate removal,
aggregation merge-join, sorting with hybrid hash join, hash-based
aggregation aggregation
not refer to items in the divisor. Thus,
neither a preceding semi-join nor explicit
duplicate removal steps are necessary for
naive division. The 1/0 cost of naive di-
vision is the cost of sorting the two in-
puts plus the cost of repeated scans of
the divisor input.
Figure 16 shows two tables, a dividend
and a divisor, properly sorted for naive
division. Concurrent scans of the “Jack”
tuples (only one) in the dividend and of
the entire divisor determine that “Jack
is not part of the quotient because he has
not taken the “Readings in Databases”
course. A continuing scan through the
“Jill” tuples in the dividend and a new
scan of the entire divisor include “Jill” in
the output of the naive division. The fact
that “Jill” has also taken an “Intro to
Graphics” course is ignored by a suitably
general scan logic for naive division.
The hash-based direct method, called
hash-division, uses two hash tables, one
for the divisor and one for the quotient
candidates. While building the divisor
table, a unique sequence number is as-
signed to each divisor item. After the
divisor table has been built, the dividend
is consumed. For each quotient candi-
date, a bit map is kept with one bit for
each divisor item. The bit map is indexed
with the sequence numbers assigned to
the divisor items. If a dividend item does
not match with an item in the divisor
table, it can be ignored immediately.
Otherwise, a quotient candidate is either
found or created, and the bit correspond-
ing to the matching divisor item is set.
When the entire dividend has been con-
sumed, the quotient consists of those
quotient candidates for which all bits are
set.
This algorithm can ignore duplicates in
the divisor (using hash-based duplicate
Stadent Coarse
Jack Intro to Databases
Jill Intro to Databases
Jill Intro to Graphics
Jill Readings in Databases
Figure 16. Sorted inputs into naive division
removal during insertion into the divisor
table) and automatically ignores dupli-
cates in the dividend as well as dividend
items that do not refer to items in the
divisor (e.g., the AI course in the exam-
ple). Thus, neither prior semi-join nor
duplicate removal are required. However,
if both inputs are known to be duplicate
free, the bit maps can be replaced by
counters. Furthermore, if referential in-
tegrity is known to hold, the divisor table
can be omitted and replaced by a single
counter. Hash-division, including these
variants, has been implemented in the
Volcano query execution engine and has
shown better performance than the other
three algorithms [Graefe 1989; Graefe
and Cole 1993]. In fact, the performance
of hash-division is almost equal to a
hash-based join or semi-join of
dividend and divisor relations (a semi-
join corresponds to existential quantifica-
tion), making universal quantification
and relational division realistic opera-
tions and algorithms to use in database
applications.
The aspect of hash-division that makes
it an efficient algorithm is that the set of
matches between a quotient candidate
and the divisor is represented efficiently
ACM Computing Surveys, Vol. 25, No 2, June 1993
114 “ Goetz Graefe
using a bit map. Bit maps are one of the
standard data structures to represent
sets, and just as bit maps can be used for
a number of set operations, the bit maps
associated with each quotient candidate
can also be used for a number of opera-
tions similar to relational division. For
example, Carlis [ 1986] proposed a gener-
alized division operator called “HAS” that
included relational division as a s~ecial
L
case. The hash-division algorithm can
easily be extended to compute quotient
candidates in the dividend that match a
majority or given fraction of divisor items
as well as (with one more bit in each bit
map) quotient candidates that do or do
not match exactly the divisor items
For real queri~s containing a division,
consider the operation that frequently
follows a division. In the example, a
user is typically not really interested in
student-ids only but in information about
the students. Thus, in many cases, rela-
tional division results will be used to
select items from another relation using
a semi-join. The sort-based algorithms
produce their output sorted, which will
facilitate a subsequent (semi-) merge-join.
The hash-based algorithms produce their
output in hash order; if overflow oc-
curred, there is no m-edictable order at
.
all. However, both aggregation-based and
direct hash-based algorithms use a hash
table on the quotient attributes, which
may be used immediately for a subse-
quent (semi-) join. It seems quite
straightforward to use the same hash
table for the aggregation and a subse-
quent join as well as to modify hash-
division such that it removes quotient
candidates from the quotient table that
do not belong to the final quotient and
then performs a semi-join with a third
input relation.
If the two hash tables do not fit into
memory, the divisor table or the quotient
table or both can be partitioned, and in-
dividual partitions can be held on disk
for processing in multiple steps. In diui-
sor partitioning, the final result consists
of those items that are found in all
partial results; the final result is the
intersection of all partial results. For ex-
ample, if the Course relations in the ex-
ample above are partitioned into under-
graduate and graduate courses, the final
result consists of the students who have
taken all undergraduate courses and all
graduate courses, i.e., those that can be
found in the division result of each parti-
tion. In quotient partitioning, the entire
divisor must be kept in memory for all
partitions. The final result is the concate-
nation (union) of all partial results. For
example, if Transcript items are parti-
tioned by odd and even student-ids, the
final result is the union (concatenation)
of all students with odd student-id who
have taken all courses and those with
even student-id who have taken all
courses. If warranted by the input data,
divisor partitioning and quotient parti-
tioning can be combined.
Hash-division can be modified into an
algorithm for duplicate removal. Con-
sider the problem of removing duplicates
from a relation R(X, Y) where X and Y
are suitably chosen attribute groups. This
relation can be stored using two hash
tables, one storing all values of X (simi-
lar to the divisor table) and assigning
each of them a unique sequence number,
the other storing all values of Y and bit
maps that indicate which X values have
occurred with each Y value. Consider a
brief example for this algorithm: Say re-
lation R(X, Y) contains 1 million tuples,
but only 100,000 tuples if duplicates were
removed. Let X and Y be each 100 bytes
long (total record size 200), and assume
there are 4,000 unique values of each X
and Y. For the standard hash-based du-
plicate removal algorithm, 100,000 x 200
bytes of memory are needed for duplicate
removal without use of temporary files.
For the redesigned hash-division algo-
rithm, 2 x 4,000 x 100 bytes are needed
for data values, 4,000 x 4 for unique se-
quence numbers, and 4,000 x 4,000 bits
for bit maps. Thus, the new algorithm
works efficiently with less than 3 MB of
memory while conventional duplicate re-
moval requires slightly more than 19 MB
of memory, or seven times more than the
duplicate removal algorithm adapted
from hash-division. Clearly, choosing at-
tribute groups X and Y to find attribute
groups with relatively few unique values
ACM Computmg Surveys, Vol 25, No. 2, June 1993
Query Evaluation Techniques ● 115
is crucial for the performance and mem-
ory efficiency of this new algorithm. Since
such knowledge is not available in most
systems and queries (even though some
efficient and helpful algorithms exist, e.g.,
Astrahan et al. [1987]), optimizer heuris-
tics for choosing this algorithm might be
difficult to design and verify.
To summarize the discussion on uni-
versal quantification algorithms, aggre-
gation can be used in systems that lack
direct division algorithms, and hash-
division performs universal quantifica-
tion and relational division generally, i.e.,
it covers cases with duplicates in the in-
puts and with referential integrity viola-
tions, and efficiently, i.e., it permits par-
titioning and using hybrid hashing tech-
niques similar to hybrid hash join, mak-
ing universal quantification (division) as
fast as existential quantification (semi-
join). As will be discussed later, it can
also be effectively parallelized.
7. DUALITY OF SORT- AND HASH-BASED
QUERY PROCESSING ALGORITHMS’4
We conclude the discussion of individual
query processing by outlining the many
existing similarities and dualities of sort-
and hash-based query-processing algo-
rithms as well as the points where the
two types of algorithms differ. The pur-
pose is to contribute to a better under-
standing of the two approaches and their
tradeoffs. We try to discuss the ap-
proaches in general terms, ignoring
whether the algorithms are used for rela-
tional join, union, intersection, aggre-
gation, duplicate removal, or other
operations. Where appropriate, however,
we indicate specific operations.
Table 7 gives an overview of the fea-
tures that correspond to one another.
14 ~art~ of ~hi~ section have been derived from
Graefe et al. [1993], which also provides experimen-
tal evidence for the relative performance of sort-
and hash-based query processing algorithms and
discusses simple cases of transferring tuning ideas
from one type of algorithm to the other. The discus-
sion of this section is continued in Graefe [ 1993a;
1993 C].
Both approaches permit in-memory ver-
sions for small data sets and disk-based
versions for larger data sets. If a data set
fits into memory, quicksort is the sort-
based method to manage data sets while
classic (in-memory) hashing can be used
as a hashing technique. It is interesting
to note that both quicksort and classic
hashing are also used in memory to oper-
ate on subsets after “cutting” an entire
large data set into pieces. The cutting
process is part of the divide-and-conquer
paradigm employed for both sort- and
hash-based query-processing algorithms.
This important similarity of sorting and
hashing has been observed before, e.g.,
by Bratbergsengen [ 1984] and Salzberg
[1988]. There exists, however, an impor-
tant difference. In the sort-based algo-
rithms, a large data set is divided into
subsets using a physical rule, namely into
chunks as large as memory. These chunks
are later combined using a logical step,
merging. In the hash-based algorithms,
large inputs are cut into subsets using a
logical rule, by hash values. The result-
ing partitions are later combined using a
physical step, i.e., by simply concatenat-
ing the subsets or result subsets. In other
words, a single-level merge in a sort algo-
rithm is a dual to partitioning in hash
algorithms. Figure 17 illustrates this du-
ality and the opposite directions.
This duality can also be observed in
the behavior of a disk arm performing
the 1/0 operations for merging or parti-
tioning. While writing initial runs after
sorting them with quicksort, the 1/0 is
sequential. During merging, read opera-
tions access the many files being merged
and require random 1/O capabilities.
During partitioning, the 1/0 operations
are random, but when reading a parti-
tion later on, they are sequential.
For both approaches, sorting and hash-
ing, the amount of available memory lim-
its not only the amount of data in a basic
unit processed using quicksort or classic
hashing, but also the number of basic
units that can be accessed simultane-
ously. For sorting, it is well known that
merging is limited to the quotient of
memory size and buffer space required
for each run, called the merge fan-in.
ACM Computing Surveys, Vol 25, No. 2, June 1993
116 ● Goetz Graefe
Table 7. Duallty of Soti- and Hash-Based Algorithms
Aspect Sorting Hashing
In-memory algorithm
Divide-and-conquer
paradigm
Large inputs
1/0 Patterns
Temporary files accessed
simultaneously
1/0 Optimlzatlons
Very large inputs
Optimizations
Better use of memory
Aggregation and
duphcate removal
Algorithm phases
Resource sharing
Partitiomng skew and
effectiveness
“Item value”
Bit vector filtering
Interesting orderings,
multiple joins
Interesting orderings:
grouping/aggregation
followed by join
Interesting orderings in
index structures
Qulcksort
Physical dlvi sion, logical
combination
Single-level merge
Sequential write, random read
Fan-in
Read-ahead, forecasting
Double-buffering,
striping merge output
Multi-level merge
Merge levels
Nonoptimal final fan-in
Merge optimizatlons
Reverse runs & LRU
Replacement selection
?
Aggregation m replacement selection
Run generation, intermediate and
final merge
Eager merging
Lazy merging
Mergmg run files of different sizes
log (run size)
For both inputs and on each
merge level?
Multiple merge-joins without sorting’
intermediate results
Sorted grouping on foreign key
useful for subsequent join
B-trees feeding into a merge-join
Classic Hash
Logical division, physical combination
Partitioning
Random write, sequential read
Fan-out
Write-behind
Double-buffering,
striping partitioning input
Recursive partitioning
Recursion depth
Nonoptimal hash table size
Bucket tuning
Hybrid hashing
?
Single input in memory
Aggregation in hash table
Initial and intermediate partitioning,
In-memory (hybrid) hashing
Depth-first partitlomng
Breadth-first partitioning
Uneven output file sizes
log (build partition size/or@nal
build input size)
For both inputs and on each recursion
level
N-ary partltlonmg and jams
Grouping while budding the hash
table in hash Join
Mergmg m hash value order
Figure 17. Duahty of partitioning and mergmg,
Similarly, partitioning is limited to the
same fraction, called the fan-out since
the limitation is encountered while writ-
ing partition files.
In order to keep the merge process
active at all times, many merge imple-
mentations use read-ahead controlled by
forecasting, trading reduced 1/0 delays
for a reduced fan-in. In the ideal case,
the bandwidths of 1/0 and processing
(merging) match, and 1/0 latencies for
both the merge input and output are hid-
den by read-ahead and double-buffering,
as mentioned earlier in the section on
sorting. The dual to read-ahead during
merging is write-behind during partition-
ing, i.e., keeping a free output buffer that
can be allocated to an output file while
the previous page for that file is being
written to disk. There is no dual to fore-
casting because it is trivial that the next
ACM Computmg Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques ● 117
output partition to write to is the one for
which an output cluster has just filled
up. Both read-ahead in merging and
write-behind in partitioning are used to
ensure that the processor never has to
wait for the completion of an 1/0 opera-
tion. Another dual is double-buffering and
striping over multiple disks for the out-
put of sorting and the input of
partitioning.
Considering the limitation on fan-in
and fan-out, additional techniques must
be used for very large inputs. Merging
can be performed in multiple levels, each
combining multiple runs into larger ones.
Similarly, partitioning can be repeated
recursively, i.e., partition files are repar-
titioned, the results repartitioned, etc.,
until the partition files flt into main
memory. In sorting and merging, the runs
grow in each level by a factor equal to
the fan-in. In partitioning, the partition
files decrease in size by a factor equal to
the fan-out in each recursion level. Thus,
the number of levels during merging is
equal to the recursion depth during par-
titioning. There are two exceptions to be
made regarding hash value distributions
and relative sizes of inputs in binary op-
erations such as join; we ignore those for
now and will come back to them later.
If merging is done in the most naive
way, i.e., merging all runs of a level as
soon as their number reaches the fan-in,
the last merge on each level might not be
optimal. Similarly, if the highest possible
fan-out is used in each partitioning step,
the partition files in the deepest recur-
sion level might be smaller than mem-
ory, and less than the entire memory is
used when processing these files. Thus,
in both approaches the memory re-
sources are not used optimally in the
most naive versions of the algorithms.
In order to make best use of the final
merge (which, by definition, includes all
output items and is therefore the most
expensive merge), it should proceed with
the maximal possible fan-in. Making best
use of the final merge can be ensured by
merging fewer runs than the maximal
fan-in after the end of the input file has
been reached (as discussed in the earlier
section on sorting). There is no direct
dual in hash-based algorithms for this
optimization. With respect to memory
utilization, the fact that a partition file
and therefore a hash table might actu-
ally be smaller than memory is the clos-
est to a dual. Utilizing memory more
effectively and using less than the maxi-
mal fan-out in hashing has been ad-
dressed in research on bucket tuning
[Kitsuregawa et al. 1989a] and on his-
togram-driven recursive hybrid hash join
[Graefe 1993a].
The development of hybrid hash algo-
rithms [DeWitt et al. 1984; Shapiro 1986]
was a consequence of the advent of large
main memories that had led to the con-
sideration of hash-based join algorithms
in the first place. If the data set is only
slightly larger than the available mem-
ory, e.g., 109%0larger or twice as large,
much of the input can remain in memory
and is never written to a disk-resident
partition file. To obtain the same effect
for sort-based algorithms, if the database
system’s buffer manager is sufficiently
smart or receives and accepts appropri-
ate hints, it is possible to retain some or
all of the pages of the last run written in
memory and thus achieve the same effect
of saving 1/0 operations, This effect can
be used particularly easily if the initial
runs are written in reverse (descending)
order and scanned backward for merg-
ing. However, if one does not believe in
buffer hints or prefers to absolutely en-
sure these 1/0 savings, then using a
final memory-resident run explicitly in
the sort algorithm and merging it with
the disk-resident runs can guarantee this
effect.
Another well-known technique to use
memory more effectively and to improve
sort performance is to generate runs
twice as large as main memory using a
priority heap for replacement selection
[Knuth 1973], as discussed in the earlier
section on sorting. If the runs’ sizes are
doubled, their number is cut in half.
Therefore, merging can be reduced by
some amount, namely log~(2) =
l/logs(F) merge levels. This optimiza-
tion for sorting has no direct dual in the
ACM Computing Surveys, Vol. 25, No. 2, June 1993
118 0 Goetz Graefe
realm of hash-based query-processing
algorithms.
If two sort operations produce input
data for a binarv ooerator such as a
-.
merge-join and if both sort operators’ fi-
nal merges are interleaved with the join,
each final merge can employ only half
the memorv. In hash-based one-to-one
match algo~ithms, only one of the two
inputs resides in and consumes memory
beyond a single input buffer, not both as
in two final merges interleaved with a
merge-join. This difference in the use of
the two inputs is a distinct advantage of
hash-based one-to-one match al~orithms
that does not have a dual in s~rt-based
algorithms.
Interestingly, these two differences of
sort- and hash-based one-to-one match
algorithms cancel each other out. Cutting
the number of runs in half (on each merge
level, including the last one) by using
replacement selection for run generation
exactly offsets this disadvantage of sort-
based one-to-one match operations.
Run generation using replacement se-
lection has a second advantage over
quicksort; this advantage has a direct
dual in hashing. If a hash table is used to
compute an aggregate function using
grouping, e.g., sum of salaries by depart-
ment, hash table overflow occurs only if
the operation’s output does not fit in
memory. Consider, for example, the sum
of salaries by department for 100,000
employees in 1,000 departments. If the
1,000 result records fit in memory, clas-
sic hashing (without overflow) is suffi-
cient. On the other hand, if sorting based
on quicksort is used to compute ;his ag-
gregate function, the input must fit into
memory to avoid temporary files.15 If re-
placement selection is used for run gen-
eration, however, the same behavior as
with classic hashing is easy to achieve.
15A scheme usuw aulcksort and avoiding tem~o-
rary 1/0 m this ~as~ can be devised but ~ould’be
extremely cumbersome; we do not know of any
report or system with such a scheme.
If an iterator interface is used for both
its input and output, and therefore mul-
tiple operators overlap in time, a sort
operator can be divided into three dis-
tinct algorithm phases. First, input items
are consumed and sorted into initial runs.
Second, intermediate merging reduces
the number of runs such that only one
final merge step is left. Third, the final
merge is performed on demand from the
consumer of the sorted data stream. Dur-
ing the first phase, the sort iterator has
to share resources, most notably memory
and disk bandwidth, with its producer
operators in a query evaluation plan.
Similarly, the third phase must share
resources with the consumers.
In many sort implementations, namely
those using eager merging, the first and
second phase interleave as a merge step
is initiated whenever the number of runs
on one level becomes equal to the fan-in.
Thus, some intermediate merge steps
cannot use all resources. In lazy merging,
which starts intermediate merges only
after all initial runs have been created,
the intermediate merges do not share
resources with other operators and can
use the entire memory allocated to a
query evaluation plan; “thus, intermedi-
ate merges can be more effective in lazy
merging than in eager merging.
Hash-based query processing algo-
rithms exhibit three similar phases.
First, the first partitioning step executes
concurrently with the input operator or
operators. Second, intermediate parti-
tioning steps divide the partition files to
ensure that they can be processed with
hybrid hashing. Third, hybrid and in-
memory hash methods process these par-
tition files and produce output passed to
the consumer operators. As in sorting,
the first and third phases must share
resources with other concurrent opera-
tions in the same query evaluation plan.
The standard implementation of hash-
based query processing algorithms for
verv large in~uts uses recursion, i.e., the
ori~inal ‘algo~ithm is invoked for each
partition file (or pair of partition files).
While conce~tuallv sim~le. this method
has the di~adva~tage ‘ that output is
ACM Computing Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques 8 119
produced before all intermediate parti-
tioning steps are complete. Thus, the op-
erators that consume the output must
allocate resources to receive this output,
typically memory (e.g., a hash table).
Further intermediate partitioning steps
will have to share resources with the
consumer operators, making them less
effective. We call this direct recursive im-
plementation of hash-based partitioning
depth-first partitioning and consider its
behavior as well as its resource sharing
and performance effects a dual to eager
merging in sorting. The alternative
schedule is breadth-first partitioning,
which completes each level of partition-
ing before starting the next one. Thus,
hybrid and in-memory hashing are not
initiated until all partition files have be-
come small enough to permit hybrid and
in-memory hashing, and intermediate
partitioning steps never have to share
resources with consumer operators.
Breadth-first partitioning is a dual to lazy
merging, and it is not surprising that
they are both equally more effective than
depth-first partitioning and eager merg-
ing, respectively.
It is well known that partitioning skew
reduces the effectiveness of hash-based
algorithms. Thus, the situation shown in
Figure 18 is undesirable. In the extreme
case, one of the partition files is as large
as the input, and an entire partitioning
step has been wasted. It is less well rec-
ognized that the same issue also pertains
to sort-based query processing algo-
rithms [Graefe 1993c]. Unfortunately, in
order to reduce the number of merge
steps, it is often necessary to merge files
from different merge levels and therefore
of different sizes. In other words, the
goals of optimized merging and of maxi-
mal merge effectiveness do not always
match, and very sophisticated merge
plans, e.g., polyphase merging, might be
required [Knuth 1973].
The same effect can also be observed if
“values” are attached to items in runs
and in partition files. Values should re-
flect the work already performed on an
item. Thus, the value should increase
with run sizes in sorting while the value
,... A
El
‘i” k
Partitioning
I I
n
~es
I
Merging -
*
<
Figure 18. Partitioning skew,
must increase as partition files get
smaller in hash-based query processing
algorithms. For sorting, a suitable choice
for such a value is the logarithm of the
run size [Graefe 1993c]. The value of a
sorted run is the product of the run’s size
and the size’s logarithm. The optimal
merge effectiveness is achieved if each
item’s value increases in each merge step
by the logarithm of the fan-in, and the
overall value of all items increases with
this logarithm multiplied with the data
volume participating in the merge step.
However, only if all runs in a merge step
are of the same size will the value of all
items increase with the logarithm of the
fan-in.
In hash-based query processing, the
corresponding value is the fraction of a
partition size relative to the original in-
put size [Graefe 1993c]. Since only the
build input determines the number of
recursion levels in binary hash partition-
ing, we consider only the build partition.
If the partitioning is skewed, i.e., output
partition files are not of uniform length,
the overall effectiveness of the partition-
ing step is not optimal, i.e., equal to the
logarithm of the partitioning fan-out.
Thus, preventing or managing skew in
partitioning hash functions is very im-
portant [Graefe 1993a].
Bit vector filtering, which will be dis-
cussed later in more detail, can be used
for both sort- and hash-based one-to-one
match operations, although it has been
used mainly for parallel joins to date.
Basically, a bit vector filter is a large
array of bits initialized by hashing items
in the first input of a one-to-one match
operator and used to detect items in the
second input that cannot possibly have a
ACM Computing Surveys, Vol. 25, No. 2, June 1993
120 “ Goetz Graefe ,
match in the first input. In effect, bit
vector filtering reduces the second input
to the items that truly participate in the
binary operation plus some “false passes”
due to hash collisions in the bit vector
filter, In a merge-join with two sort oper-
ations, if the bit vector filter is used be-
fore the second sort, bit vector filtering is
as effective as in hybrid hash join in
reducing the cost of processing the sec-
ond input. In merge-join, it can also be
used symmetrically as shown in Figure
19. Notice that for the right input, bit
vector filtering reduces the sort input
size, whereas for the left input, it only
reduces the merge-join input. In recur-
sive hybrid hash join, bit vector filtering
can be used in each recursion level. The
effectiveness of bit vector filtering in-
creases in deeper recursion levels, be-
cause the number of distinct data values
in each ~artition file decreases. thus re-
ducing tie number of hash collisions and
false passes if bit vector filters of the
same size are used in each recursion level.
Moreover, it can be used in both direc-
tions, i.e.: to reduce the second input us-
ing a bit vector filter based on the first
input and to reduce the first input (in the
next recursion level) using a bit vector
filter based on the second input. The same
effect could be achieved for sort-based
binary operations requiring multilevel
sorting and merging, although to do so
implies switching back and forth be-
tween the two sorts for the two inputs
after each merge level. Not surprisingly,
switching back and forth after each merge
level would be the dual to the nartition-
.
ing process of both inputs in recursive
hybrid hash join. However, sort operators
that switch back and forth on each merge
level are not only complex to implement
but may also inhibit the merge optimiza-
tion discussed earlier.
The final entries in Table 7 concern
interesting orderings used in the System
R query optimizer [Selinger et al. 1979]
and presumably other query optimizers
as well. A strong argument in favor of
sorting and merge-join is the fact that
merge-join delivers its output in sorted
order; thus, multiple merge-joins on the
Merge-Join
/“
Probe Vector 2 sort
I I
sOrt Build Vector 2
I I
Build Vector 1 Probe Vector 1
I I
Scan Input 1 Scan Input 2
Figure 19. Merge-join with symmetric bit vector
filtering.
same attribute can be performed without
sorting intermediate join results. For
joining three relations, as shown in Fig-
ure 20, pipelining data from one merge-
join to the next without sorting trans-
lates into a 3:4 advantage in the number
of sort operations compared to two joins
on different join keys, because the inter-
mediate result 01 does not need to be
sorted. For joining N relations on the
same key, only N sorts are required in-
stead of 2 x N – 2 for joins on different
attributes. Since set operations such as
the union or intersection of N sets can
always be performed using a merge-join
algorithm without sorting intermediate
results, the effect of interesting orderings
is even more important for set operations
than for relational joins.
Hash-based algorithms tend to produce
their outputs in a very unpredictable or-
der, depending on the hash function and
on overflow management. In order to take
advantage of multiple joins on the same
attribute (or of multiple intersections,
etc.) similar to the advantage derived
from interesting orderings in sort-based
query processing, the equality of at-
tributes has to be exploited during the
logical step of hashing, i.e., during parti-
tioning. In other words, such set opera-
tions and join queries can be executed
effectively by a hash join algorithm that
recursively partitions N inputs concur-
rently. The recursion terminates when
N – 1 inputs fit into memory and when
the Nth input is used to probe N – 1
ACM Computmg Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques ● 121
Merge-Join b=b
/
Merge-Join a=a
Sort on b Sort on b
I I
/
01 Merge-Join a=a Sort on a
Merge-Join a=a
/  ‘n’”’”
/ I
Sort on a Sort on a Input 13
Sort on a Sort on a
I I
I I Input 11 Input 12
Input 11 Input 12
Figure 20. The effect of interesting orderings.
Figure 21. Partitioning in a multiinput hash join.
hash tables. Thus, the basic operation of
this N-ary join (intersection, etc.) is an
N-ary join of an N-tuple of partition files,
not pairs as in binary hash join with one
build and one m-obe file for each ~arti-
tion. Figure 21 ‘illustrates recursiv~ par-
titioning for a join of three inputs. In-
stead of partitioning and joining a pair of
in~uts and ~airs of ~artition files as in
tr~ditional binary hybrid hash join, there
are file triples (or N-tuples) at each step.
However, N-ary recursive partitioning
is cumbersome to implement, in particu-
lar if some of the “join” operations are
actually semi-join, outer join, set inter-
section, union, or difference. Therefore,
until a clean implementation method for
hash-based N-ary matching has been
found, it might well be that this distinc-
tion. ioins on the same or on different
/.,
attributes, contributes to the right choice
between sort- and hash-based algorithms
for comdex aueries.
Anot~er si~uation with interesting or-
derings is an aggregation followed by a
join. Many aggregations condense infor-
mation about individual entities; thus,
the aggregation operation is performed
on a relation representing the “many”
side of a many-to-one relationship or on
the relation that represents relationship
instances of a many-to-many relation-
ship. For example, students’ grade point
averages are computed by grouping and
averaging transcript entries in a many-
to-many relationship called transcript
between students and courses. The im-
portant point to note here and in many
similar situations is the grouping at-
tribute is a foreign key. In order to relate
the aggregation output with other infor-
mation pertaining to the entities about
which information was condensed, aggre-
gations are frequently followed by a join.
If the grouping operation is based on
sorting (on the grouping attribute, which
very frequently is a foreign key), the nat-
ural sort order of the aggregation output
can be exploited for an efficient merge-
join without sorting.
While this seems to be an advantage of
sort-based aggregation and join, this
combination of operations also permits a
special trick in hash-based query pro-
cessing [Graefe 1993 b]. Hash-based ag-
gregation is based on identifying items of
the same group while building the hash
table. At the end of the operation, the
hash table contains all output items
hashed on the grouping attribute. If the
grouping attribute is the join attribute in
the next operation, this hash table can
immediately be probed with the other
ACM Computing Surveys, Vol. 25, No 2, June 1993
122 “ Goetz Graefe
join input. Thus, the combined aggrega-
tion-join operation uses only one hash
table, not two hash tables as two sepa-
rate o~erations would do. The differences
to tw~ separate operations are that only
one join input can be aggregated effi-
ciently and that the aggregated input
must be the join’s build input. Both is-
sues could be addressed by symmetric
hash ioins with a hash table on each of
the in”~uts which would be as efficient as
sorting and grouping both join inputs.
A third use of interesting orderings is
the positive interaction of (sorted, B-tree)
index scans and merge-join. While it has
not been reported explicitly in the litera-
ture. the leaves and entries of two hash
indices can be merge-joinedjust like those
of two B-trees, provided the same hash
function was used to create the indices.
For example, it is easy to imagine “merg-
ing” the leaves (data pages) of two ex-
tendible hash indices [Fagin et al. 1979],
even if the key cardinalities and distribu-
tions are verv different.
In summa~y, there exist many duali-
ties between sorting using multilevel
merging and recursive hash table over-
flow management. Two special cases ex-
ist which favor one or the other, however.
First, if two join inputs are of different
size (and the query optimizer can reli-
ably predict this difference), hybrid hash
join outperforms merge-join because only
the smaller of the two inputs determines
what fraction of the input files has to be
written to temporary disk files during
partitioning (or how often each record
has to be written to disk during recursive
partitioning), while each file determines
its own disk 1/0 in sorting [Bratberg-
sengen 1984]. For example, sorting the
larger of two join inputs using multiple
merge levels is more expensive than
writing a small fraction of that file to
hash overflow files. This performance ad-
vantage of hashing grows with the rela-
tive size difference of the two inputs, not
with their absolute sizes or with the
memory size.
Second, if the hash function is very
poor, e.g., because of a prior selection on
the ioin attribute or a correlated at-
tribu~e, hash partitioning can perform
very poorly and create significantly
higher costs than sorting and merge-join.
If the quality of the hash function cannot
be predicted or improved (tuned) dynam-
ically [Graefe 1993a], sort-based query-
processing algorithms are superior be-
cause they are less vulnerable to nonuni-
form data distributions. Since both cases,
join of differently sized files and skewed
hash value distributions, are realistic sit-
uations in database query processing, we
recommend that both sort- and hash-
based algorithms be included in a query-
processing engine and chosen by the
query optimizer according to the two
cases above. If both cases arise simulta-
neously, i.e., a join of differently sized
inputs with unpredictable hash value
distribution, the query optimizer has to
estimate which one poses the greater
danger to system performance and choose
accordingly.
The important conclusion from these
dualities is that neither the absolute in-
put sizes nor the absolute memory size
nor the input sizes relative to the mem-
ory size determine the choice between
sort- and hash-based query-processing
algorithms. Instead, the choice should be
governed by the sizes of the two inputs
into binary operators relative to each
other and by the danger of performance
impairments due to skewed data or hash
value distributions. Furthermore, be-
cause neither algorithm type outper-
forms the other in all situations, both
should be available in a query execution
engine for a choice to be made in each
case by the query optimizer.
8. EXECUTION OF COMPLEX QUERY
PLANS
When multiple operators such as aggre-
gations and joins execute concurrently in
a pipelined execution engine, physical re-
sources such as memory and disk band-
width must be shared by all operators.
Thus, optimal scheduling of multiple op-
erators and the division and allocation of
resources in a complex plan are impor-
tant issues.
In earlier relational execution engines,
these issues were largely ignored for two
ACM Computmg Surveys, Vol. 25, No 2, June 1993
Query Evaluation Techniques “ 123
reasons. First, only left-deep trees were
used for query execution, i.e., the right
(inner) input of a binary operator had to
be a scan. In other words, concurrent
execution of multiple subplans in a single
query was not possible. Second, under
the assumption that sorting was needed
at each step and considering that sorting
for nontrivial file sizes requires that the
entire input be written to temporary files
at least once, concurrency and the need
for resource allocation were basically ab-
sent. Today’s query execution engines
consider more join algorithms that per-
mit extensive pipelining, e.g., hybrid hash
join, and more complex query plans, in-
cluding bushy trees. Moreover, today’s
systems support more concurrent users
and use parallel-processing capabilities.
Thus, resource allocation for complex
queries is of increasing importance for
database query processing.
Some researchers have considered re-
source contention among multiple query
processing operators with the focus on
buffer management. The goal in these
efforts was to assign disk pages to buffer
slots such that the benefit of each buffer
slot would be maximized, i.e., the number
of 1/0 operations avoided in the future.
Sacco and Schkolnick [1982; 1986] ana-
lyzed several database algorithms and
found that their cost functions exhibit
steps when plotted over available buffer
space, and they suggested that buffer
space should be allocated at the low end
of a step for the least buffer use at a
given cost. Chou [1985] and Chou and
DeWitt [1985] took this idea further by
combining it with separate page replace-
ment algorithms for each relation or scan,
following observations by Stonebraker
[1981] on operating system support for
database systems, and with load control,
calling the resulting algorithm DBMIN.
Faloutsos et al. [1991] and Ng et al. [1991]
generalized this goal and used the classic
economic concepts of decreasing marginal
gain and balanced marginal gains for
maximal overall gain. Their measure of
gain was the reduction in the number of
page faults. Zeller and Gray [1990] de-
signed a hash join algorithm that adapts
to the current memory and buffer con-
tention each time a new hash table is
built. Most recently, Brown et al. [1992]
have considered resource allocation
tradeoffs among short transactions and
complex queries.
Schneider [1990] and Schneider and
DeWitt [1990] were the first to systemat-
ically examine execution schedules and
costs for right-deep trees, i.e., query eval-
uation plans with multiple binary hash
joins for which all build phases proceed
concurrently or at least could proceed
concurrently (notice that in a left-deep
plan, each build phase receives its data
from the probe phase of the previous join,
limiting left-deep plans to two concurrent
joins in different phases). Among the
most interesting findings are that
through effective use of bit vector fil-
tering (discussed later), memory re-
quirements for right-deep plans might
actually be comparable to those of
left-deep plans [Schneider 1991]. This
work has recently been extended by
Chen et al. [1992] to bushy plans in-
terpreted and executed as multiple
right-deep subplans,
For binary matching iterators to be
used in bushy plans, we have identified
several concerns. First, some query-
processing algorithms include a point at
which all data are in temporary files on
disk and at which no intermediate result
data reside in memory. Such “stop” points
can be used to switch efficiently between
different subplans. For example, if two
subplans produce and sort two merge-join
inputs, stopping work on the first sub-
plan and switching to the second one
should be done when the first sort opera-
tor has all its data in sorted runs and
when only the final merge is left but no
output has been produced yet. Figure 22
illustrates this point in time. Fortu-
nately, this timing can be realized natu-
rally in the iterator implementation of
sorting if input runs for the final merge
are opened in the first call of the next
procedure, not at the end of the open
phase. A similar stop point is available in
hash join when using overflow avoidance.
Second, since hybrid hashing produces
some output data before the memory con-
tents (output buffers and hash table) can
ACM Computing Surveys, Vol. 25, No. 2, June 1993
124 - Goetz Graefe
merge join
/
‘0”
mm‘~
(done) to be done
Figure 22. The stop point during sorting
be discarded and since, therefore, such a
stop point does not occur in hybrid hash
join, implementations of hybrid hash join
and other binary match operations should
be parameterized to permit overflow
avoidance as a run time option to be
chosen by the query optimizer. This dy-
namic choice will permit the query opti-
mizer to force a stop point in some opera-
tors while using hybrid hash in most
operations.
Third, binary-operator implementa-
tions should include a switch that con-
trols which subplan is initiated first. In
Table 1 with algorithm outlines for itera-
tors’ open, next, and close procedures,
the hash join open procedure executes
the entire build-input plan first before
opening the probe input. However, there
might be situations in which it would be
better to open the probe input before
executing the build input. If the probe
input does not hold any resources such
as memory between open and next calls,
initiating the probe input first is not a
problem. However, there are situations
in which it creates a big benefit, in par-
ticular in bushy query evaluation plans
and in parallel systems to be discussed
later.
Fourth, if multiple operators are active
concurrently, memory has to be divided
among them. If two sorts produce input
data for a merge-join, which in turn
passes its output into another sort using
quicksort, memory should be divided pro-
portionally to the sizes of the three files
involved. We believe that for multiple
sorts producing data for multiple merge-
joins on the same attribute, proportional
memory division will also work best. If a
sort in its run generation phase shares
resources with other operations, e.g., a
sort following two sorts in their final
merges and a merge-join, it should also
use resources proportional to its input
size. For example, if two merge-join in-
puts are of the same size and if the
merge-join output which is sorted imme-
diately following the merge-join is as
large as the two inputs together, the two
final merges should each use one quar-
ter of memory while the run gener-
ation (quicksort) should use one half of
memory.
Fifth, in recursive hybrid hash join,
the recursion levels should be executed
level by level. In the most straightfor-
ward recursive algorithm, recursive invo-
cation of the original algorithm for each
output partition results in depth-first
partitioning, and the algorithm produces
output as soon as the first leaf in the
recursion tree is reached. However, if the
operator that consumes the output re-
quires memory as soon as it receives in-
put, for example, hybrid hash join (ii) in
Figure 23 as soon as hybrid hash join (i)
produces output, the remaining parti-
tioning operations in the producer opera-
tor (hybrid hash join (i)) must share
memory with the consumer operator (hy-
brid hash join (ii)), effectively cutting the
partitioning fan-out in the producer in
half. Thus. hash-based recursive match-
ing algorithms should proceed in three
distinct phases—consuming input and
initial partitioning, partitioning into files
suitable for hybrid hash join, and final
hybrid hash join for all partitions—with
phase two completed entirely before
phase three commences. This sequence of
partitioning steps was introduced as
breadth-first partitioning in the previous
section as opposed to depth-first parti-
tioning used in the most straightforward
recursive algorithms. Of course, the top-
most operator in a query evaluation plan
does not have a consumer operator with
which it shares resources; therefore, this
operator should use depth-first partition-
ing in order to provide a better response
time, i.e., earlier delivery of the first data
item.
ACM Computing Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 125
Hybrid Hash Join (ii)
01/ 
Hybrid Hash Join (i)
/“  1“’”’13
Input 11 Input 12
Figure 23. Plan for joining three inputs.
Sixth, the allocation of resources other
than memory, e.g., disk bandwidth and
disk arms for seeking in partitioning and
merging, is an open issue that should be
addressed soon, because the different im-
m-ovement rates in CPU and disk s~eeds
. .
will increase the im~ortance of disk ~er-
formance for over~ll query processing
performance. One possible alleviation of
this m-oblem might come from disk ar-
.
rays configured exclusively for perfor-
mance, not for reliability. Disk arrays
might not deliver the entire ~erformance
ga~n the large number of disk’drives could
provide if it is not possible to disable a
disk array’s parity mechanisms and to
access s~ecific disks within an arrav.
particul~rly during partitioning aid
merging.
Finally, scheduling bushy trees in
multiprocessor systems is not entirely
understood yet. While all considerations
discussed above apply in principle, multi-
m-ocessors ~ermit trulv concurrent exe-
.
cution of multiple subplans in a bushy
tree. However, it is a very hard problem
to schedule two or more subplans such
that their result streams are available at
the right times and at the right rates, in
particular in light of the unavoidable er-
rors in selectivity and cost estimation
during query optimization [Christodoula-
kis 1984; Ioannidis and Christodoulakis
19911.
The last point, estimation errors, leads
us to suspect that plans with 30 (or even
100) joins or other operations cannot be
optimized completely before execution.
Thus. we susnect that a techniaue remi-
. .
niscent of Ingres Decomposition [Wong
and Youssefi 1976; Youssefi and Wong
1979] will prove to be more effective. One
of the principal ideas of Ingres Decompo-
sition is a repetitive cycle consisting of
three steps. First, the next step is se-
lected, e.g., a selection or join. Second,
the chosen step is executed into a tempo-
rary table. Third, the query is simplified
by removing predicates evaluated in the
completed execution step and replacing
one range variable (relation) in the query
with the new temporary table. The justi-
fication and advantage of this approach
are that all earlier selectivities are known
for each decision, because the intermedi-
ate results are materialized. The disad-
vantage is that data flow between opera-
tors cannot be exploited, resulting in a
significant cost for writing and reading
intermediate files. For very complex
queries, we suggest modifying Decompo-
sition to decide on and execute multiple
steps in each cycle, e.g., 3 to 9 joins,
instead of executing only one selection or
join as in Ingres. Such a hybrid approach
might very well combine the advantages
of a priori optimization, namely, in-
memory data flow between iterators, and
optimization with exactly known inter-
mediate result sizes.
An optimization and execution envi-
ronment even further tuned for very
complex queries would anticipate possi-
ble outcomes of executing subplans and
provide multiple alternative subsequent
plans. Figure 24 shows the structure of
such a dynamic plan for a complex query.
First, subplan A is executed, and statis-
tics about its result are gathered while it
is saved on disk. Depending on these
statistics, either B or C is executed next.
If B is chosen and executed, one of D, E,
and F will complete the query; in the
case of C instead of B, it will be G or H.
Notice that each letter A–H can be an
arbitrarily complex subplan, although
probably not more than 10 operations
due to the limitations of current selectiv-
ity estimation methods. Unfortunately,
realization of such sophisticated query
optimizers will require further research,
e.g., into determination of when separate
cases are warranted and limitation of
the possibly exponential growth in the
number of subplans.
ACM Computing Surveys, Vol. 25, No 2, June 1993
126 ● Goet.z Graefe
B c
/’”%
DEF GH
Figure 24. A decision tree of partial plans,
9. MECHANISMS FOR PARALLEL QUERY
EXECUTION
Considering that all high-performance
computers today employ some form of
parallelism in their processing hardware,
it seems obvious that software written to
manage large data volumes ought to be
able to exploit parallel execution capabil-
ities [DeWitt and Gray 1992]. In fact, we
believe that five years from now it will be
argued that a database management sys-
tem without parallel query execution will
be as handicapped in the market place as
one without indices.
The goal of parallel algorithms and
systems is to obtain speedup and scaleup,
and speedup results are frequently used
to demonstrate the accomplishments of a
design and its implementation. Speedup
considers additional hardware resources
for a constant problem size; linear
speedup is considered optimal. In other
words, N times as many resources should
solve a constant-size problem in lN of
the time. Speedup can also be expressed
as parallel efficiency, i.e., a measure of
how close a system comes to linear
speedup. For example, if solving a prob-
lem takes 1,200 seconds on a single ma-
chine and 100 seconds on 16 machines,
the speedup is somewhat less than lin-
ear. The parallel efficiency is (1 x
1200)/(16 X 100) = 75%.
An alternative measure for a parallel
system’s design and implementation is
scaleup, in which the problem size is al-
tered with the resources. Linear scaleup
is achieved when N times as many re-
sources can solve a problem with iV times
as much data in the same amount of
time. Scaleup can also be expressed us-
ing parallel efficiency, but since speedup
and scaleup are different, it should al-
ways be clearly indicated which parallel
efficiency measure is being reported.
A third measure for the success of a
parallel algorithm based on Amdahl’s law
is the fraction of the sequential program
for which linear speedup was attained,
defined byp=f Xs/d+(l-f)X sfor
sequential execution time s, parallel exe-
cution time p, and degree of parallelism
d. Resolved for f, this is f = (s – p)/ (s
– s/d) = ((s – p)/s)/((d – I)/d). For
the example above, this fraction is f =
((1200 – 100)/’1200)/((16 – 1)/’16) =
97.78%. Notice that this measure gives
much higher percentage values than the
parallel efficiency calculated earlier;
therefore, the two measures should not
be confused.
For query processing problems involv-
ing sorting or hashing in which multiple
merge or partitioning levels are expected,
the speedup can frequently be more than
linear, or superlinear. Consider a sorting
problem that requires two merge levels
in a single machine. If multiple machines
are used, the sort problem can be parti-
tioned such that each machine sorts a
fraction of the entire data amount. Such
partitioning will, in a good implementa-
tion, result in linear speedup. If, in addi-
tion, each machine has its own memory
such that the total memory in the system
grows with the size of the machine, fewer
than two merge levels will suffice, mak-
ing the speedup superlinear.
9.1 Parallel versus Distributed Database
Systems
It might be useful to start the discussion
of parallel and distributed query process-
ing with a distinction of the two concepts.
In the database literature, “distributed”
usually implies “locally autonomous,” i.e.,
each participating system is a complete
database management system in itself,
with access control, metadata (catalogs),
query processing, etc. In other words,
each node in a distributed database man-
agement system can function entirely on
its own, whether or not the other nodes
are present or accessible. Each node per-
ACM Computing Surveys, Vol. 25, No 2, June 1993
Query Evaluation Techniques ● 127
forms its own access control, and co-
operation of each node in a distributed
transaction is voluntary. Examples of
distributed (research) systems are R*
[Haas et al. 1982; Traiger et al. 1982],
distributed Ingres [Epstein and Stone-
braker 1980; Stonebraker 1986a], and
SDD-1 [Bernstein et al. 1981; Rothnie et
al. 1980]. There are now several commer-
cial distributed relational database man-
agement systems. Ozsu and Valduriez
[1991a; 1991b] have discussed dis-
tributed database systems in much more
detail. If the cooperation among multiple
database systems is only limited, the sys-
tem can be called a “federated database
system [ Sheth and Larson 1990].
In parallel systems, on the other hand,
there is only one locus of control. In other
words, there is only one database man-
agement system that divides individual
queries into fragments and executes the
fragments in parallel. Access control to
data is independent of where data objects
currently reside in the system. The query
optimizer and the query execution engine
typically assume that all nodes in the
system are available to participate in ef-
ficient execution of complex queries, and
participation of nodes in a given transac-
tion is either presumed or controlled by a
global resource manager, but is not based
on voluntary cooperation as in dis-
tributed systems. There are several par-
allel research prototypes, e.g., Gamma
[DeWitt et al. 1986; DeWitt et al. 1990],
Bubba [Boral 1988; Boral et al. 1990],
Grace [Fushimi et al. 1986; Kitsuregawa
et al. 1983], and Volcano [Graefe 1990b;
1993b; Graefe and Davison 1993], and
products, e.g., Tandem’s NonStop SQL
[Englert et al. 1989; Zeller 1990], Tera-
data’s DBC/1012 [Neches 1984; 1988;
Teradata 1983], and Informix [Davison
1992].
Both distributed database systems and
parallel database systems have been de-
signed in various kinds, which may cre-
ate some confusion. Distributed systems
can be either homogeneous, meaning that
all participating database management
systems are of the same type (the hard-
ware and the operating system may even
be of the same types), or heterogeneous,
meaning that multiple database manage-
ment systems work together using stan-
dardized interfaces but are internally
different.lG Furthermore, distributed
systems may employ parallelism, e.g.,
by pipelining datasets between nodes
with the receiver already working on
some items while the producer is still
sending more. Parallel systems can be
based on shared-memory (also called
shared-everything), shared-disk (multi-
ple processors sharing disks but not
memory), distributed-memory (with-
out sharing disks, also called shared-
nothing), or hierarchical computer
architectures consisting of multiple clus-
ters, each with multiple CPUS and disks
and a large shared memory. Stone-
braker [ 1986b] compared the first three
alternatives using several aspects of
database management, and came to
the conclusion that distributed memory
is the most promising database man-
agement system platform. Each of
these approaches has advantages and
disadvantages; our belief is that the hi-
erarchical architecture is the most gen-
eral of these architectures and should
be the target architecture for new data-
base software development [Graefe and
Davison 1993],
9.2 Forms of Parallelism
There are several forms of parallelism
that are interesting to designers and im-
plementors of query processing systems.
Irzterquery parallelism is a direct result
of the fact that most database manage-
ment systems can service multiple re-
quests concurrently. In other words,
multiple queries (transactions) can be
executing concurrently within a single
database management system. In this
form of parallelism, resource contention
16 In some organizations, two different database
management systems may run on the came (fairly
large) computer. Their interactions could be called
“nondistributed heterogeneous.” However, since the
rules governing such interactions are the same as
for distributed heterogeneous systems, the case is
usually ignored in research and system design.
ACM Computing Surveys, Vol. 25, No 2, June 1993
128 * Goetz Graefe
is of great concern, in particular, con-
tention for memory and disk arms.
The other forms of parallelism are all
based on the use of algebraic operations
on sets for database query processing,
e.g., selection, join, and intersection. The
theory and practice of exploiting other
“bulk” types such as lists for parallel
database query execution are only now
developing. Interoperator parallelism is
basically pipelining, or parallel execution
of different operators in a single query.
For example, the iterator concept dis-
cussed earlier has also been called “syn-
chronous pipelines” [Pirahesh et al.
1990]; there is no reason not to consider
asynchronous pipelines in which opera-
tors work independently connected by a
buffering mechanism to provide flow
control.
Interoperator parallelism can be used
in two forms, either to execute producers
and consumers in pipelines, called uerti-
cal in teroperator parallelism here, or to
execute independent subtrees in a com-
plex bushy-query evaluation plan concur-
rently, called horizontal in teroperator or
bushy parallelism here. A simple exam-
ple for bushy parallelism is a merge-join
receiving its input data from two sort
processes. The main problem with bushy
parallelism is that it is hard or impossi-
ble to ensure that the two subplans start
generating data at the right time and
generate them at the right rates. Note
that the right time does not necessarily
mean the same time, e.g., for the two
inputs of a hash join, and that the right
rates are not necessarily equal, e.g., if
two inputs of a merge-join have different
sizes. Therefore, bushy parallelism pre-
sents too many open research issues and
is hardly used in practice at this time.
The final form of parallelism in
database query processing is intraopera-
tor parallelism in which a single opera-
tor in a query plan is executed in multi-
ple processes, typically on disjoint pieces
of the problem and disjoint subsets of the
data. This form, also called parallelism
based on fragmentation or partitioning,
is enabled by the fact that query process-
ing focuses on sets. If the underlying data
represented sequences or time series in
a scientific database management sys-
tem, partitioning into subsets to be oper-
ated on independently would not be
feasible or would require additional
synchronization when putting the in-
dependently obtained results together.
Both vertical interoperator parallelism
and intraoperator parallelism are used in
database query processing to obtain
higher performance. Beyond the obvious
opportunities for speedup and scaleup
that these two concepts offer, they both
have significant problems. Pipelining
does not easily lend itself to load balanc-
ing because each process or processor
in the pipeline is loaded proportionally to
the amount of data it has to process. This
amount cannot be chosen by the imple-
mentor or the query optimizer and can-
not be predicted very well. For intraoper-
ator, partitioning-based parallelism, load
balance and performance are optimal if
the partitions are all of equal size; how-
ever, this can be hard to achieve if value
distributions in the inputs are skewed.
9.3 Implementation Strategies
The purpose of the query execution en-
gine is to provide mechanisms for query
execution from which the query opti-
mizer can choose—the same applies for
the means and mechanisms for parallel
execution. There are two general ap-
proaches to parallelizing a query execu-
tion engine, which we call the bracket
and operator models and which are used,
for example, in the Gamma and Volcano
systems, respectively.
In the bracket model, there is a generic
process template that can receive and
send data and can execute exactly one
operator at any point of time. A schematic
diagram of a template process is shown
in Figure 25, together with two possible
operators, join and aggregation. In order
to execute a specific operator, e.g., a join,
the code that makes up the generic tem-
plate “loads” the operator into its place
(by switching to this operator’s code) and
initiates the operator which then controls
execution; network 1/0 on the receiving
ACM Computmg Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 129
output
Input(s)
Figure 25. Bracket model of parallelization.
and sending sides is performed as a ser-
vice to the operator on its request and
initiation and is implemented as proce-
dures to be called by the operator. The
number of inputs that can be active at
any point of time is limited to two since
there are only unary and binary opera-
tors in most database systems. The oper-
ator is surrounded by generic template
code. which shields it from its environ-
ment, for example, the operator(s) that
produce its input and consume its out-
put. For parallel query execution, many
templates are executed concurrently in
the system, using one process per tem-
plate. Because each operator is written
with the imdicit assum~tion that this
.
operator controls all acti~ities in its pro-
cess, it is not possible to execute two
operators in one process without resort-
ing to some thread or coroutine facility
i.e., a second implementation level of the
process concept.
In a query-processing system using the
bracket model, operators are coded in
such a way that network 1/0 is their
only means of obtaining input and deliv-
ering output (with the exception of scan
and store operators). The reason is that
each operator is its own locus of control,
and network flow control must be used to
coordinate multiple operators, e.g., to
match two operators’ speed in a pro-
ducer-consumer relationship. Unfortu-
nately, this coordination requirement
also implies that passing a data item
from one operator to another always in-
volves expensive interprocess communi-
cation system calls, even in the cases
when an entire query is evaluated on a
single CPU (and could therefore be eval-
uated in a single process, without inter-
process communication and operating
system involvement) or when data do not
need to be repartitioned among nodes in
a network. An example for the latter is
the query “joinCselAselB” in the Wiscon-
sin Benchmark, which requires joining
three inputs on the same attribute [De-
Witt 1991], or any other query that per-
mits interesting orderings [Selinger et al.
1979], i.e., any query that uses the same
join attribute for multiple binary joins.
Thus, in queries with multiple operators
(meaning almost all queries), interpro-
cess communication and its overhead are
mandatory in the bracket model rather
than optional.
An alternative to the bracket model is
the operator model. Figure 26 shows a
possible parallelization of a join plan us-
ing the operator model, i.e., by inserting
“parallelism” operators into a sequential
plan, called exchange operators in the
Volcano system [Graefe 1990b; Graefe
and Davison 1993]. The exchange opera-
tor is an iterator like all other operators
in the system with open, next, and close
procedures; therefore, the other opera-
tors are entirely unaffected by the pres-
ence of exchange operators in a query
evaluation plan. The exchange operator
does not contribute to data manipulation;
thus, on the logical level, it is a “no-op”
that has no place in a logical query alge-
bra such as the relational algebra. On
the physical level of algorithms and pro-
cesses, however, it provides control not
provided by any of the normal operators,
i.e., process management, data redistri-
bution, and flow control. Therefore, it is
a control operator or a meta-operator.
Separation of data manipulation from
process control and interprocess com-
munication can be considered an im-
portant advantage of the operator
model of parallel query processing, be-
cause it permits design, implementation,
and execution of new data manipulation
algorithms such as N-ary hybrid hash
join [Graefe 1993a] without regard to the
execution environment.
ACM Computmg Surveys, Vol. 25, No. 2, June 1993
130 “ Goetz Graefe
Print
I
Exchange
Q
Print
I
Join
/’”
Join Exchange
/ I
Exchange Exchange scan
I I
scan scan
Figure 26. Operator model of parallehzation.
A second issue important to point out
is that the exchange operator only pro-
vides mechanisms for parallel query pro-
cessing; it does not determine or presup-
pose policies for using its mechanisms.
Policies for parallel processing such as
the degree of parallelism, partitioning
functions, and allocation of processes to
processors can be set either by a query
optimizer or by a human experimenter in
the Volcano system as they are still sub-
ject to intense research. The design of the
exchange operator permits execution of a
complex query in a single process (by
using a query plan without any exchange
operators, which is useful in single-
processor environments) or with a num-
ber of processes by using one or more
exchange operators in the query evalua-
tion plan. The mapping of a sequential
plan to a parallel plan by inserting ex-
change operators permits one process per
operator as well as multiple processes for
one operator (using data partitioning) or
multiple operators per process, which is
useful for executing a complex query plan
with a moderate number of processes.
Earlier parallel query execution engines
did not provide this degree of flexibility;
the bracket model used in the Gamma
design, for example, requires a separate
process for each operator [DeWitt et al.
1986].
Figure 27 shows the processes created
by the exchange operators in the previ-
Figure 27. Processes created by exchange operators.
ous figure, with each circle representing
a process. Note that this set of processes
is only one possible parallelization, which
makes sense if the joins are on the same
join attributes. Furthermore, the degrees
of data parallelism, i.e., the number of
processes in each process group, can be
controlled using an argument to the ex-
change operator.
There is no reason to assume that the
two models differ significantly in their
performance if implemented with similar
care. Both models can be implemented
with a minimum of control overhead and
can be combined with any partitioning
scheme for load balancing. The only dif-
ference with respect to performance is
that the operator model permits multiple
data manipulation operators such as join
in a single process, i.e., operator synchro-
nization and data transfer between oper-
ators with a single procedure call with-
out operating system involvement. The
important advantages of the operator
model are that it permits easy paral-
lelization of an existing sequential sys-
tem as well as development and mainte-
nance of operators and algorithms in a
familiar and relatively simple single-pro-
cess environment [Graefe and Davison
1993].
ACM Computmg Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 131
The bracket and operator models both
provide pipelining and partitioning as
part of pipelined data transfer between
process groups. For most algebraic opera-
tors used in database query processing,
these two forms of parallelism are suffi-
cient. However, not all operations can be
easily supported by these two models.
For example, in a transitive closure oper-
ator, newly inferred data is equal to in-
put data in its importance and role for
creating further data. Thus, to paral-
lelize a single transitive closure operator,
the newly created data must also be par-
titioned like the input data. Neither
bracket nor operator model immediately
allow for this need. Hence, for transitive
closure operators, intraoperator paral-
lelism based on partitioning requires that
the processes exchange data among
themselves outside of the stream
paradigm.
The transitive closure operator is not
the only operation for which this restric-
tion holds. Other examples include the
complex object assembly operator de-
scribed by Keller et al. [1991] and opera-
tors for numerical optimizations as might
be used in scientific databases. Both
models, the bracket model and the opera-
tor model, could be extended to provide a
general and efficient solution to intraop-
erator data exchange for intraoperator
parallelism.
9.4 Load Balancing and Skew
For optimal speedup and scaleup, pieces
of the processing load must be assigned
carefully to individual processors and
disks to ensure equal completion times
for all pieces. In interoperator paral-
lelism, operators must be grouped to en-
sure that no one processor becomes the
bottleneck for an entire pipeline. Bal-
anced processing loads are very hard to
achieve because intermediate set sizes
cannot be anticipated with accuracy and
certainty in database query optimization.
Thus, no existing or proposed query-
processing engine relies solely on inter-
operator parallelism. In intraoperator
parallelism, data sets must be parti-
tioned such that the processing load is
nearly equal for each processor. Notice
that in particular for binary operations
such as join, equal processing loads can
be different from equal-sized partitions.
There are several research efforts de-
veloping techniques to avoid skew or to
limit the effects of skew in parallel query
processing, e.g., Baru and Frieder [1989],
DeWitt et al. [ 1991b], Hua and Lee
[1991], Kitsuregawa and Ogawa [1990],
Lakshmi and Yu [1988; 1990], Omiecin-
ski [199 1], Seshadri and Naughton
[1992], Walton [1989], Walton et al.
[1991], and Wolf et al. [1990; 1991]. How-
ever, all of these methods have their
drawbacks, for example, additional re-
quirements for local processing to deter-
mine quantiles.
Skew management methods can be di-
vided into basically two groups. First,
skew avoidance methods rely on deter-
mining suitable partitioning rules before
data is exchanged between processing
nodes or processes. For range partition-
ing, quantiles can be determined or esti-
mated from sampling the data set to be
partitioned, from catalog data, e.g., his-
tograms, or from a preprocessing step.
Histograms kept on permament base data
have only limited use for intermediate
query processing results, in particular, if
the partitioning attribute or a correlated
attribute has been used in a prior selec-
tion or matching operation. However, for
stored data they may be very beneficial.
Sampling implies that the entire popula-
tion is available for sampling because the
first memory load of an intermediate re-
sult may be a very poor sample for parti-
tioning decisions. Thus, sampling might
imply that the data flow between opera-
tors be halted and an entire intermediate
result be materialized on disk to ensure
proper random sampling and subsequent
partitioning. However, if such a halt is
required anyway for processing a large
set, it can be used for both purposes. For
example, while creating and writing ini-
tial run files without partitioning in a
parallel sort, quantiles can be deter-
mined or estimated and used in a com-
bined partitioning and merging step.
ACM Computing Surveys, Vol. 25, No. 2, June 1993
132 “ Goetz Graefe
10000 –
solid — 99 9Z0confidence
dotted — 95 % confidence
Sample 10~_ o 1024 patitions
Size ❑ 2 partitions
Per
❑
❑
Partition 10Q–
❑
107
I I I I
1 1.25 1.5 1.75 2
Skew Limit
Figure 28. Skew hmit,confidence, andsamples izeperpartltion.
Second, skew resolution repartitions
some or all of the data if an initial parti-
tioning has resulted in skewed loads.
Repartitioning is relatively easy in
shared-memory machines, but can also
be done in distributed-memory architec-
tures, albeit at the expense of more net-
work activity. Skew resolution can be
based on rehashing in hash partitioning
or on quantile adjustment in range parti-
tioning. Since hash partitioning tends to
create fairly even loads and since net-
work bandwidth will increase in the near
future within distributed-memory ma-
chines as well as in local- and wide-area
networks, skew resolution is a reason-
able method for cases in which a prior
processing step cannot be exploited to
gather the information necessary for
skew avoidance as in the sort example
above.
In their recent research into sampling
for load balancing, DeWitt et al. [ 1991b]
and Seshadri and Naughton [1992] have
shown that stratified random sampling
can be used, i.e., samples are selected
randomly not from the entire distributed
data set but from each local data set at
each site, and that even small sets of
samples ensure reasonably balanced
loads. Their definition of skew is the quo-
tient of sizes of the largest partition and
the average partition, i.e., the sum of
sizes of all partitions divided by the de-
gree of parallelism. In other words, a
skew of 1.0 indicates a perfectly even
distribution. Figure 28 shows the re-
quired sample sizes per partition for var-
ious skew limits, degrees of parallelism,
and confidence levels. For example, to
ensure a maximal skew of 1.5 among
1,000 partitions with 9570 confidence, 110
random samples must be taken at each
site. Thus, relatively small samples suf-
fice for reasonably safe skew avoidance
and load balancing, making precise
methods unnecessary. Typically, only
tens of samples per partition are needed,
not several hundreds of samples at each
site.
For allocation of active processing ele-
ments, i.e., CPUS and disks, the band-
width considerations discussed briefly in
the section on sorting can be generalized
for parallel processes. In principal, all
stages of a pipeline should be sized such
that they all have bandwidths propor-
tional to their respective data volumes in
order to ensure that no stage in the
pipeline becomes a bottleneck and slows
the other ones down. The latency almost
unavoidable in data transfer between
pipeline stages should be hidden by the
use of buffer memory equal in size to the
product of bandwidth and latency.
9.5 Architectures and Architecture
Independence
Many database research projects have
investigated hardware architectures for
parallelism in database systems. Stone-
ACM Computmg Surveys, Vol 25, No. 2, June 1993
Query Evaluation Techniques ● 133
braker[1986b] compared shared-nothing
(distributed-memory), shared-disk (dis-
tributed-memory with multiported disks),
and shared-everything (shared-memory)
architectures for database use based on a
number of issues including scalability,
communication overhead, locking over-
head, and load balancing. His conclusion
at that time was that shared-everything
excels in none of the points considered;
shared-disk introduces too many locking
and buffer coherency problems; and
shared-nothing has the significant bene-
fit of scalability to very high degrees of
parallelism. Therefore, he concluded that
overall shared-nothing is the preferable
architecture for database system imple-
mentation. (Much of this section has been
derived from Graefe et al. [1992] and
Graefe and Davison [1993 ].)
Bhide [1988] and Bhide and Stone-
braker [1988] compared architectural al-
ternatives for transaction processing and
concluded that a shared-everything
(shared-memory) design achieves the best
performance, up to its scalability limit.
To achieve higher performance, reliabil-
ity, and scalability, Bhide suggested
considering shared-nothing (distributed-
memory) machines with shared-every-
thing parallel nodes. The same idea is
mentioned in equally general terms by
Pirahesh et al. [1990] and Boral et al.
[1990], but none of these authors elabo-
rate on the idea’s generality or potential.
Kitsuregawa and Ogawa’s [1990] new
database machine SDC uses multiple
shared-memory nodes (plus custom hard-
ware such as the Omega network and a
hardware sorter), although the effect of
the hardware design on operators other
than join is not evaluated in their article.
Customized parallel hardware was in-
vestigated but largely abandoned after
Boral and DeWitt’s [1983] influential
analysis that compared CPU and 1/0
speeds and their trends. Their analysis
concluded that 1/0, not processing, is
the most likely bottleneck in future
high-performance query execution. Sub-
sequently, both Boral and DeWitt em-
barked on new database machine
projects, Bubba and Gamma, that exe-
cuted customized software on standard
processors with local disks [Boral et al.
1990; DeWitt et al. 1990]. For scalabil-
ity and availability, both projects
used distributed-memory hardware
with single-CPU nodes and investi-
gated scaling questions for very large
configurations.
The XPRS system, on the other hand,
has been based on shared memory [Hong
and Stonebraker 1991; Stonebraker et al.
1988a; 1988b]. Its designers believe that
modern bus architectures can handle up
to 2,000 transactions per second, and that
shared-memory architectures provide
automatic load balancing and faster
communication than shared-nothing
machines and are equally reliable and
available for most errors, i.e., media
failures, software, and operator errors
[Gray 1990]. However, we believe that
attaching 250 disks to a single machine
as necessary for 2,000 transactions per
second [ Stonebraker et al. 1988b] re-
quires significant special hardware, e.g.,
channels or 1/0 processors, and it is quite
likely that the investment for such hard-
ware can have greater impact on overall
system performance if spent on general-
purpose CPUS or disks. Without such
special hardware, the performance limit
for shared-memory machines is probably
much lower than 2,000 transactions per
second. Furthermore, there already are
applications that require larger storage
and access capacities.
Richardson et al. [1987] performed an
analytical study of parallel join algo-
rithms on multiple shared-memory “clus-
ters” of CPUS. They assumed a group of
clusters connected by a global bus with
multiple microprocessors and shared
memory in each cluster. Disk drives were
attached to the busses within clusters.
Their analysis suggested that the best
performance is obtained by using only
one cluster, i.e., a shared-memory archi-
tecture. We contend, however, that their
results are due to their parameter set-
tings, in particular small relations (typi-
cally 100 pages of 32 KB), slow CPUS
(e.g., 5 psec for a comparison, about 2-5
MIPS), a slow global network (a bus with
ACM Computing Surveys, Vol. 25, No 2, June 1993
134 “ Goetz Graefe
typically 100 Mbit/see), and a modest
number of CPUS in the entire system
(128). It would be very interesting to see
the analysis with larger relations (e.g.,
1– 10 GB), a faster network, e.g., a mod-
ern hypercube or mesh with hardware
routing. and consideration of bus load
u,
and bus contention in each cluster, which
might lead to multiple clusters being the
better choice. On the other hand, commu-
nication between clusters will remain a
significant expense. Wong and Katz
[1983] developed the concept of “local
sufficiency” that might provide guidance
in declusterinp and re~lication to reduce
data moveme~t betw~en nodes. Other
work on declustering and limiting declus-
tering includes Copeland et al. [1988],
Fang et al. [1986], Ghandeharizadeh and
DeWitt [1990], Hsiao and DeWitt [1990],
and Hua and Lee [ 19901.
Finally, there are several hardware de-
signs that attempt to overcome the
shared-memory scaling problem, e.g., the
DASH project [Anderson et al. 1988], the
Wisconsin Multicube [Goodman and
Woest 1988], and the Paradigm project
[Cheriton et al. 1991]. However, these
desires follow the traditional se~aration
of operating system and application pro-
gram. They rely on page or cache-line
faulting and do not provide typical
database concepts such as read-ahead
and dataflow. Lacking separation of
mechanism and policy in these designs
almost makes it imperative to implement
dataflow and flow control for database
query processing within the query execu-
tion engine. At this point, none of these
hardware designs has been experimen-
tally tested for database query
processing.
New software systems designed to ex-
ploit parallel hardware should be able to
exploit both the advantages of shared
memory, namely efficient communica-
tion, synchronization, and load balanc-
ing, and of distributed memory, namely
scalability to very high degrees of paral-
lelism and reliability and availability
through independent failures. Figure 29
shows a general hierarchical architec-
ture, which we believe combines these
advantages. The important point is the
combina~ion of lo~al busies within
shared-memory parallel machines and a
dobal interconnection network amomz
machines. The diagram is only a very
general outline of such an architecture;
manv details are deliberately left out and
. .
unspecified. The network could be imple-
mented using a bus such as an ethernet,
a ring, a hypercube, a mesh, or a set of
~oint-to-~oint connections. The local
busses m-ay or may not be split into code
and data or by address range to obtain
less contention and higher bus band-
width and hence higher scalability limits
for the use of sh~red memory. “Design
and placement of caches, disk controllers,
terminal connections. and local- and
wide-area network connections are also
left open. Tape drives or other backup
devices would be connected to local
busses.
Modularity is a very important consid-
eration for such an architecture. For ex-
ample, it should be possible to replace all
CPU boards with uwn-aded models with-
.=
out having to replace memories or disks.
Considering that new components will
change communication demands, e.g.,
faster CPUS might require more local bus
bandwidth, it is also important that the
allocation of boards to local busses can be
changed, For example, it should be easy
to reconfigure a machine with 4 X 16
CPUS into one with 8 X 8 CPUS.
Beyond the effect of faster communica-
tion and synchronization, this architec-
ture can also have a simificant effect on
control overhead, load balancing, and re-
sulting response time problems. Investi-
gations in the Bubba project at MCC
demonstrated that large degrees of paral-
lelism mav reduce ~erformance unless
load imbal~nce and o~erhead for startup,
synchronization, and communication can
be kept low [Copeland et al. 1988]. For
example, when placing 100 CPUS either
in 100 nodes or in 10 nodes of 10 CPUS
each, it is much faster to distribute query
plans to all CPUS and much easier to
achieve reasonable balanced loads in the
.
second case than in the first case. Within
each shared-memory parallel node, load
ACM Computmg Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques “ 135
[ Interconnection Network I
I I
Loc .dBus
+ CPU 1
+ CPU ]
+ CPU I
+ CPU ~
Ed
Figure 29. Ahierarchical-memory architecture
imbalance can be dealt with either by engine were discussed. In this section,
compensating allocation of resources, e.g.,
memory for sorting or hashing, or by rel-
atively efficient reassignment of data to
processors.
Many of today’s parallel machines are
built as one of the two extreme cases of
this hierarchical design: a distributed-
memory machine uses single-CPU nodes,
while a shared-memory machine consists
of a single node. Software designed for
this hierarchical architecture will run on
either conventional design as well as a
genuinely hierarchical machine and will
allow the exploration of tradeoffs in the
range of alternatives in between. The
most recent version of Volcano’s ex-
change operator is designed for hierar-
chical memory, demonstrating that the
operator model of parallelization also of-
fers architecture- and topology-indepen-
dent parallel query evaluation [Graefe
and Davison 1993]. In other words, the
parallelism operator is the only operator
that needs to “understand” the underly-
ing architecture, while all data manipu-
lation operators can be implemented
without concern for parallelism, data dis-
tribution, and flow control.
10. PARALLEL ALGORITHMS
In the previous section, mechanisms for
parallelizing a database query execution
individual algorithms and their special
cases for parallel execution are consid-
ered in more detail. Parallel database
query processing algorithms are typically
based on partitioning an input using
range or hash partitioning. Either form
of partitioning can be combined with sort-
and hash-based query processing algo-
rithms; in other words, the choices of
partitioning scheme and local algorithm
are almost always entirely orthogonal.
When building a parallel system, there
is sometimes a question whether it is
better to parallelize a slower sequential
algorithm with better speedup behavior
or a fast sequential algorithm with infe-
rior speedup behavior. The answer to this
question depends on the design goal and
the planned degree of parallelism. In the
few single-user database systems in use,
the goal has been to minimize response
time; for this goal, a slow algorithm with
linear speedup implemented on highly
parallel hardware might be the right
choice. In multi-user systems, the goal
typically is to minimize resource con-
sumption in order to maximize through-
put. For this goal, only the best sequen-
tial algorithms should be parallelized. For
example, Boral and DeWitt [1983] con-
cluded that parallelism is no substitute
for effective and efficient indices. For a
new parallel algorithm with impressive
ACM Computing Surveys, Vol 25, No. 2, June 1993
136 ● Goetz Graefe
speedup behavior, the question of
whether or not the underlying sequential
algorithm is the most efficient choice
should always be considered.
10.1 Parallel Selections and Updates
Since disk 1/0 is a performance bottle-
neck in many systems, it is natural to
parallelize it. Typically, either asyn-
chronous 1/0 or one process per partici-
pating 1/0 device is used, be it a disk or
an array of disks under a single con-
troller. If a selection attribute is also the
partitioning attribute, fewer than all
disks will contain selection results, and
the number of processes and activated
disks can be limited. Notice that parallel
selection can be combined very effec-
tively with local indices, i.e., ind~ces cov-
ering the data of a single disk or node. In
general, it is most efficient to maintain
indices close to the stored data sets, i.e.,
on the same node in a parallel database
system.
For updates of partitioning attributes
in a partitioned data set, items may need
to move between disks and sites, just as
items may move if a clustering attribute
is updated. Thus, updates of partitioning
attributes may require setting up data
transfers from old to new locations of
modified items in order to maintain the
consistency of the partitioning. The fact
that updating partitioning attributes is
more expensive is one reason why im-
mutable (or nearly immutable) iden-
tifiers or keys are usually used as
partitioning attributes.
10.2 Parallel Sorting
Since sorting is the most expensive oper-
ation in many of today’s database man-
agement systems, much research has
been dedicated to parallel sorting
[Baugsto and Greipsland 1989; Beck et
al. 1988; Bitton and Friedland 1982;
Graefe 1990a; Iyer and Dias 1990; Kit-
suregawa et al, 1989b; Lorie and Young
1989; Menon 1986; Salzberg et al. 1990].
There are two dimensions along which
parallel sorting methods can be classi-
fied: the number of their parallel inputs
(e.g., scan or subplans executed in paral-
lel) and the number of parallel outputs
(consumers) [Graefe 1990a]. As sequen-
tial input or output restrict the through-
put of parallel sorts, we assume a
multiple-input multiple-output paral-
lel sort here, and we further assume
that the input items are partitioned
randomly with respect to the sort at-
tribute and that the outmut items
should be range-partitioned ~nd sorted
within each range.
Considering that data exchange is ex-
pensive, both in terms of communication
and synchronization delays, each data
item should be exchanged only once be-
tween m-ocesses. Thus. most ~arallel sort
. L
algorithms consist of a local sort and a
data exchange step. If the data exchange
step is done first, quantiles must be
known to ensure load balancing during
the local sort step. Such quantil& can b:
obtained from histograms in the catalogs
or by sampling. It is not necessary that
the quantiles be precise; a reasonable
approximation will suffice.
If the local sort is done first. the final
local merging should pass data directly
into the data exchange step. On each
receiving site, multiple sorted streams
must be merged during the data ex-
change step. bne of th~ possible prob-
lems is that all producers of sorted
streams first produce low key values,
limiting performance by the speed of the
first (single!) consumer; then all produc-
ers switch to the next consumer, etc.
If a different partitioning strategy than
range partitioning is used, sorting with
subsequent partitioning is not guaran-
teed to be deadlock free in all situations.
Deadlock will occur if (1) multiple con-
sumers feed multiple producers, (2) each
m-oducer m-educes a sorted stream, and
~ach con&mer merges multiple sorted
streams, (3) some key-based partitioning
rule is used other than range partition-
ing, i.e., hash partitioning, (4) flow con-
trol is enabled, and (5) the data distribu-
tion is particularly unfortunate.
Figure 30 shows a scenario with two
producer and two consumer processes,
ACM Computing Surveys, Vol 25, No. 2, June 1993
Query Evaluation Techniques ● 137
i.e., both the producer operators and the
consumer operators are executed with a
degree of parallelism of two. The circles
in Figure 30 indicate processes, and the
arrows indicate data paths. Presume that
the left sort produces the stream 1, 3, 5,
7 , 999, 1002, 1004, 1006,
Ikii, :.., 2000 while the right sort pro-
duces 2, 4, 6, 8,..., 1000, 1001, 1003,
1005, 1007,..., 1999. The merge opera-
tions in the consumer processes must re-
ceive the first item from each producer
process before they can create their first
output item and remove additional items
from their input buffers. However, the
producers will need to produce 500 items
each (and insert them into one
consumer’s input buffer, all 500 for one
consumer) before they will send their first
item to the other consumer. The data
exchange buffer needs to hold 1,000 items
at one point of time, 500 on each side of
Figure 30. If flow control is enabled and
if the exchange buffer (flow control slack)
is less than 500 items, deadlock will
occur.
The reason deadlock can occur in this
situation is that the producer processes
need to ship data in the order obtained
from their input subplan (the sort in Fig-
ure 30) while the consumer processes
need to receive data in sorted order as
required by the merge. Thus, there are
two sides which both require absolute
control over the order in which data pass
over the process boundary. If the two
requirements are incompatible, an un-
bounded buffer is required to ensure
freedom from deadlock.
In order to avoid deadlock, it must be
ensured that one of the five conditions
listed above is not satisfied. The second
condition is the easiest to avoid, and
should be focused on. If the receiving
processes do not perform a merge, i.e.,
the individual input streams are not
sorted, deadlock cannot occur because the
slack given in the flow control must be
somewhere, either at some producer or
some consumer or several of them, and
the process holding the slack can con-
tinue to process data, thus preventing
deadlock.
Figure 30. Scenario with possible deadlock.
F
!
Receive
odd
even
Pa ..--..
k )
[$1
 Receive )
Y I even
m
Figure 31. Deadlock-free scenario.
Our recommendation is to avoid the
above situation, i.e., to ensure that such
query plans are never generated by the
optimizer. Consider for which purposes
such a query plan would be used. The
typical scenario is that multiple pro-
cesses perform a merge join of two in-
puts, and each (or at least one) input is
sorted by several producer processes. An
alternative scenario that avoids the prob-
lem is shown in Figure 31. Result data
are partitioned and sorted as in the pre-
vious scenario. The important difference
is that the consumer processes do not
merge multiple sorted incoming streams.
One of the conditions for the deadlock
problem illustrated in Figure 30 is that
ACM Computmg Surveys, Vol. 25, No. 2, June 1993
138 . Goetz Graefe
Figure 32. Deadlock danger due to a binary operator in the consumer.
there are multiple producers and multi-
ple consumers of a single logical data
stream. However, a very similar deadlock
situation can occur with single-process
producers if the consumer includes an
operation that depends on ordering, typi-
cally merge-join. Figure 32 illustrates the
problem with a merge-join operation exe-
cuted in two consumer processes. Notice
that the left and right producers in Fig-
ure 32 are different inputs of the merge-
join, not processes executing the same
operators as in Figures 30 and 31. The
consumer in Figure 32 is still one opera-
tor executed by two processes. Presume
that the left sort produces the stream 1,
3, 5, 7, ..., 999, 1002, 1004, 1006,
1008 ,...,2000 while the right sort pro-
duces 2, 4, 6, 8,...,1000, 1001, 1003,
1005, 1007,... , 1999. In this case, the
merge-join has precisely the same effect
as the merging of two parts of one logical
data stream in Figure 30. Again, if the
data exchange buffer (flow control slack)
is too small, deadlock will occur. Similar
to the deadlock avoidance tactic in Fig-
ure 31, deadlock in Figure 32 can be
avoided by placing the sort operations
into the consumer processes rather than
into the producers. However, there is an
additional solution for the scenario in
Figure 32, namely, moving only one of
the sort operators, not both, into the con-
sumer processes.
If moving a sort operation into the con-
sumer process is not realistic, e.g., be-
cause the data already are sorted when
they are retrieved from disk as in a B-tree
scan, alternative parallel execution
strategies must be found that do not re-
quire repartitioning and merging of
sorted data between the producers and
consumers. There are two possible cases.
In the first case, if the input data are not
only sorted but also already partitioned
systematically, i.e., range or hash parti-
tioned, on the attribute(s) considered by
the consumer operator, e.g., the by-list of
an aggregate function or the join at-
tribute, the process boundary and data
exchange could be removed entirely. This
implies that the producer operator, e.g.,
the B-tree scan, and the consumer, e.g.,
the merge-join, are executed by the same
process group and therefore with the
same degree of parallelism.
In the second case, although sorted
on the relevant attribute within each
partition, the operator’s data could be
partitioned either round-robin or on a
different attribute. For a join, a
fragment-and-replicate matching strat-
egy could be used [Epstein et al. 1978;
Epstein and Stonebraker 1980; Lehman
et al. 1985], i.e., the join should execute
within the same threads as the operator
producing sorted output while the second
input is replicated to all instances of the
ACM Computmg Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 139
Merge
Depth
lo–
Total Memory Size M =40
8–
Input Size R = 125,000
6–
4–
2–
I I I I I I I
1 3 5 7 9 11 13
Degree of Parallelism
Figure 33. Merge depth as a function of parallelism.
join. Note that fragment-and-replicate
methods do not work correctly for semi-
join, outer join, difference, and union, i.e.,
when an item is replicated and is in-
serted (incorrectly) multiple times into
the global output. A second solution that
works for all operators, not only joins, is
to execute the consumer of the sorted
data in a single thread. Recall that mul-
tiple consumers are required for a dead-
lock to occur. A third solution that is
correct for all operators is to send dummy
items containing the largest key seen so
far from a producer to a consumer if no
data have been exchanged for a predeter-
mined amount of time (data volume, key
range). In the examples above, if a pro-
ducer must send a key to all consumers
at least after every 100 data items pro-
cessed in the producer, the required
buffer space is bounded, and deadlock
can be avoided. In some sense, this solu-
tion is very simple; however, it requires
that not only the data exchange mecha-
nism but also sort-based algorithms such
as merge-join must “understand” dummy
items. Another solution is to exchange all
data without regard to sort order, i.e., to
omit merging in the data exchange mech-
anism, and to sort explicitly after repar-
titioning is complete. For this sort, re-
placement selection might be more effec-
tive than quicksort for generating initial
runs because the runs would probably
be much larger than twice the size of
memory.
A final remark on deadlock avoidance:
Since deadlock can only occur if the con-
sumer process merges, i.e., not only the
producer but also the consumer operator
try to determine the order in which data
cross process boundaries, the deadlock
problem only exists in a query execution
engine based on sort-based set-processing
algorithms. If hash-based algorithms
were used for aggregation, duplicate re-
moval, join, semi-join, outer join, inter-
section, difference, and union, the need
for merging and therefore the danger of
deadlock would vanish.
An interesting parallel sorting method
with balanced communication and with-
out the possibility of deadlock in spite of
local sort followed by data exchange (if
the data distribution is known a priori) is
to sort locally only by the position within
the final partition and then exchange
data guaranteeing a balanced data flow.
This method might be best seen in an
example: Consider 10 partitions with key
values from O to 999 in a uniform distri-
bution. The goal is to have all key values
between O to 99 sorted on site O, between
100 and 199 sorted on site 1, etc. First,
each partition is sorted locally at its orig-
inal site, without data exchange, on the
last two digits only, ignoring the first
digit. Thus, each site has a sequence such
as 200, 301, 401, 902, 2, 603, 804, 605,
105, 705,...,999, 399. Now each site
sends data to its correct final destination.
Notice that each site sends data simulta-
ACM Computing Surveys, Vol. 25, No. 2, June 1993
140 “ Goetz Graefe
neously to all other sites, creating a bal-
anced data flow among all producers and
consumers. While this method seems ele-
gant, its problem is that it requires fairly
detailed distribution information to en-
sure the desired balanced data flow.
In shared-memory machines, memory
must be divided over all concurrent sort
processes. Thus, the more processes are
active, the less memory each one can get.
The importance of this memory division
is the limitation it puts on the size of
initial runs and on the fan-in in each
merge process. In other words, large de-
grees of parallelism may impede perfor-
mance because they increase the number
of merge levels. Figure 33 shows how the
number of merge levels grows with in-
creasing degrees of parallelism, i.e., de-
creasing memory per process and merge
fan-in. For input size R, total mem-
ory size ill, and P parallel processes, the
merge depth L is L = log&f lP - ~((RP)/
( M/P)) = log ~, P- ~(R/M). The optimal
degree of parallelism must be deter-
mined considering the tradeoff between
parallel processi~g and large fan-ins,
somewhat similar to the tradeoff be-
tween fan-in and cluster size. Extending
this argument using the duality of sort-
ing and hashing, too much parallelism in
hash partitioning on shared-memory ma-
chines can also be detrimental, both for
aggregation and for binary lmatching
[Hong and Stonebraker 1993].
10.3 Parallel Aggregation and Duplicate
Removal
Parallel algorithms for aggregation and
duplicate removal are best divided into a
local step and a global step. First, dupli-
cates are eliminated locally, and then
data are partitioned to detect and re-
move duplicates from different original
sites. For aggregation, local and global
aggregate functions may differ. For ex-
ample, to perform a global count, the
local aggregation counts while the global
aggregation sums local counts into a
global count.
For local hash-based aggregation, a
special technique might improve perfor-
mance. Instead of creating overflow files
locally to resolve hash table overflow,
items can be moved directlv to their final
.
site. Hopefully, this site can aggregate
them immediately into the local hash
table because a similar item already ex-
ists. In manv recent distributed-memorv
machines, it” is faster to ship an item to
another site than to do a local disk 1/0.
In fact. some distributed-memorv ven-
dors attach disk drives not to tie pri-
mary processing nodes but to special
“1/0 nodes” because network delay is
negligible compared to 1/0 time, e.g., in
Intel’s iPSC/2 and its subsequent paral-
lel architectures.
The advantage is that disk 1/0 is re-
quired only when the aggregation output
size does not fit into the aggregate mem-
orv available on all machines. while the
.
standard local aggregation-exchange-
global aggregation scheme requires local
disk 1/0 if any local output size does not
fit into a local memorv. The difference
between the two is d~termined by the
degree to which the original input is al-
ready partitioned (usually not at all),
making this technique very beneficial.
10.4 Parallel Joins and Other Binary
Matching Operations
Binary matching operations such as join,
semi-join, outer join, intersection, union,
and difference are different than the pre-
vious operations exactly because they are
binary. For bushy parallelism, i.e., a join
for which two subplans create the two
inputs independently from one another
in parallel, we might consider symmetric
hash join algorithms. Instead of differen-
tiating between build and probe
inputs, the symmetric hash join uses two
hash tables. one for each in~ut. When a
data item (or packet of ite’ms) arrives,
the join algorithm first determines which
in~ut it came from and then ioins the
ne’w data item with the hash t~ble built
from the other input as well as inserting
the new data item into its hash table
such that data items from the other in-
put arriving later can be joined correctly.
Such a symmetric hash join algorithm
has been used in XPRS, a shared-
memory high-performance extensible-
ACM Computing Surveys, Vol. 25, No 2, June 1993
Query Evaluation Techniques 9 141
relational database system [Hong and
Stonebraker 1991; 1993; Stonebraker
et al. 1988a; 1988b] as well as in Pris-
ma/DB, a shared-nothing main-memory
database system [Wilschut 1993; Wil-
schut and Apers 1993]. The advantage
of symmetric matching algorithms is that
they are independent of the data rates of
the inputs; their disadvantage is that
they require that both inputs fit in mem-
ory, although one hash table can be
dropped when one input is exhausted.
For parallelizing a single binary
matching operation, there are basically
two techniques, called here symmetric
partitioning and fragment and replicate.
In both cases, the global result is the
union (concatenation) of all local results.
Some algorithms exploit the topology of
certain architectures, e.g., ring- or cube-
based communication networks [Baru
and Frieder 1989; Omiecinski and Lin
1989].
In the symmetric partitioning meth-
ods, both inputs are partitioned on the
attributes relevant to the operation (i.e.,
the join attribute for joins or all at-
tributes for set operations), and then the
operation is performed at each site. Both
the Gamma and the Teradata database
machines use this method. Notice that
the partitioning method (usually hashed)
and the local join method are indepen-
dent of each other; Gamma and Grace
use hash joins while Teradata uses
merge-join.
In the fragment-and-replicate meth-
ods, one input is partitioned, and the
other one is broadcast to all sites. Typi-
cally, the larger input is partitioned by
not moving it at all, i.e., the existing
partitions are processed at their loca-
tions prior to the binary matching opera-
tion. Fragment-and-replicate methods
were considered the join algorithms of
choice in early distributed database sys-
tems such as R*, SDD-1, and distributed
Ingres, because communication costs
overshadowed local processing costs and
because it was cheaper to send a small
input to a small number of sites than to
partition both a small and a large input.
Note that fragment-and-replicate meth-
ods do not work correctly for semi-join,
outer join, difference, and union, namely,
when an item is replicated and is in-
serted into the output (incorrectly) multi-
ple times.
A technique for reducing network traf-
fic during join processing in distributed
database systems uses redundant semi-
joins [Bernstein et al. 1981; Chiu and Ho
1980; Gouda and Dayal 1981], an idea
that can also be used in distributed-
memory parallel systems. For example,
consider the join on a common attribute
A of relations R and S stored on two
different nodes in a network, say r and
s. The semi-join method transfers a
duplicate-free projection of R on A to s,
performs a semi-join there to determine
the items in S that actually participate
in the join result, and ships these items
to r for the actual join. In other words,
based on the relational algebra law that
R JOINS
= R JOIN (S SEMIJOIN R),
cost savings of not shipping all of S were
realized at the expense of projecting and
shipping the R. A-column and executing
the semi-join. Of course, this idea can be
used symmetrically to reduce R or S or
both, and all operations (projection, du-
plicate removal, semi-join, and final join)
can be executed in parallel on both r and
s or on more than two nodes using the
parallel join strategies discussed earlier
in this section. Furthermore, there are
probabilistic variants of this idea that
use bit vector filtering instead of semi-
joins, discussed later in its own section.
Roussopoulos and Kang [ 1991] recently
showed that symmetric semi-joins are
particularly useful. Using the equalities
(for a join of relations R and S on at-
tribute A)
R JOINS
= R JOIN (S SEMIJOIN r~R )
= (R SEMIJOIN w~(S SEMIJOIN n.R))
JOIN (S SEMIJOIN(a)~~R) (a)
= (R SEMIJOIN ri-.(S SEMIJOIN WAR))
JOIN (S (SEMIJOIN@)r~R ) , (b)
ACM Computing Surveys, Vol 25, No. 2, ,June 1993
142 ● Goetz Graefe
they designed a four-step procedure to
compute the join of two relations stored
at two sites. First, the first relation’s join
attribute column R. A is sent duplicate
free to the other relation’s site, s. Second,
the first semi-join is computed at s, and
either the matching values (term (a)
above) or the nonmatching values (term
(b) above) of the join column S. A are
sent back to the first site, r. The choice
between (a) and (b) is made based on the
number of matching and nonmatching17
values of S. A. Third, site r determines
which items of R will participate in the
join R JOIN S, i.e., R SEMIJOIN S.
Fourth, both input sites send exactly
those items that will participate in the
join R JOIN S to the site that will com-
pute the final result, which may or may
not be one of the two input sites. Of
course, this two-site algorithm can be
used across any number of sites in a
parallel query evaluation system.
Typically, each data item is exchanged
only once across the interconnection net-
work in a parallel algorithm. However,
for parallel systems with small communi-
cation overhead, in particular for
shared-memory systems, and in parallel
processing systems with processors with-
out local disk(s), it may be useful to
spread each overflow file over all avail-
able nodes and disks in the system. The
disadvantage of the scheme may be
communication overhead; however, the
advantages of load balancing and cumu-
lative bandwidth while reading a parti-
tion file have led to the use of this scheme
both in the Gamma and SDC database
machines, called bucket spreading in the
SDC design [DeWitt et al. 1990;
Kitsuregawa and Ogawa 19901.
For parallel non-equi-joins, a symmet-
ric fragment-and-replicate method has
been proposed by Stamos and Young
[Stamos and Young 1989]. As shown in
Figure 34, processors are organized into
rows and columns. One input relation is
partitioned over rows, and partitions are
17SEMIJOIN stands for the anti-semi-join, which
determines those items in the first input that do
not have a match in the second input.
I L I L I
A A A i
R
s
Figure 34. Symmetric fragment-and-replicate join.
replicated within each row, while the
other input is partitioned and replicated
over columns. Each item from one input
“meets” each item from the other input
at exactly one site, and the global join
result is the concatenation of all local
joins.
Avoiding partitioning as well as broad-
casting for many joins can be accom-
plished with a physical database design
that considers frequently performed joins
and distributes and replicates data over
the nodes of a parallel or distributed sys-
tem such that many joins already have
their input data suitably partitioned.
Katz and Wong formalized this notion as
local sufficiency [Katz and Wong 1983;
Wong and Katz 1983]; more recent re-
search on the issue was performed in the
Bubba project [Copeland et al. 1988].
For joins in distributed systems, a third
class of algorithms, called fetch-as-
needed, was explored. The idea of these
algorithms is that one site performs the
join by explicitly requesting (fetching)
only those items from the other input
needed to perform the join [Daniels and
Ng 1982; Williams et al. 1982]. If one
input is very small, fetching only the
necessary items of the larger input might
seem advantageous. However, this algo-
rithm is a particularly poor implementa-
tion of a semi-join technique discussed
above. Instead of requesting items or val-
ues one by one, it seems better to first
project all join attribute values, ship
(stream) them across the network, per-
form the semi-join using any local binary
matching algorithm, and then stream ex-
actly those items back that will be re-
ACM Computing Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques ● 143
quired for the join back to the first site.
The difference between the semi-join
technique and fetch-as-needed is that the
semi-join scans the first input twice, once
to extract the join values and once to
perform the real join, while fetch as
needed needs to work on each data item
only once.
10.5 Parallel Universal Quantification
In our earlier discussion on sequential
universal quantification, we discussed
four algorithms for universal quantifica-
tion or relational division, namely, naive
division (a direct, sort-based algorithm),
hash-division (direct, hash based), and
sort- and hash-based aggregation (indi-
rect ) algorithms, which might require
semi-joins and duplicate removal in the
inputs.
For naive division, pipelining can be
used between the two sort operators and
the division operator. However, both quo-
tient partitioning and divisor partition-
ing can be employed as described below
for hash-division.
For algorithms based on aggregation,
both pipelining and partitioning can be
applied immediately using standard
techniques for parallel query execution.
While partitioning seems to be a promis-
ing approach, it has an inherent problem
due to the possible need for a semi-join.
Recall that in the example for universal
quantification using Transcript and
Course relations, the join attribute in the
semi-join (course-no) is different than the
grouping attribute in the subsequent ag-
gregation (student-id). Thus, the Tran-
script relation has to be partitioned twice,
once for the semi-join and once for the
aggregation.
For hash-division, pipelining has only
limited promise because the entire
division is performed within a single
operator. However, both partitioning
strategies discussed earlier for hash table
overflow can be employed for parallel ex-
ecution, i.e., quotient partitioning and di-
visor partitioning [Graefe 1989; Graefe
and Cole 1993].
For hash-division with quotient par-
titioning, the divisor table must be
replicated in the main memory of all par-
ticipating processors. After replication,
all local hash-division operators work
completely independent of each other.
Clearly, replication is trivial for shared-
memory machines, in particular since a
single copy of the divisor table can be
shared without synchronization among
multiple processes once it is complete.
When using divisor partitioning, the
resulting partitions are processed in par-
allel instead of in phases as discussed for
hash table overflow. However, instead of
tagging the quotient items with phase
numbers, processor network addresses
are attached to the data items, and the
collection site divides the set of all incom-
ing data items over the set of processor
network addresses. In the case that the
central collection site is a bottleneck, the
collection step can be decentralized using
quotient partitioning.
11. NONSTANDARD QUERY PROCESSING
ALGORITHMS
In this section, we briefly review the
query processing needs of data models
and database systems for nonstandard
applications. In many cases, the logical
operators defined for new data models
can use existing algorithms, e.g., for in-
tersection. The reason is that for process-
ing, bulk data types such as array, set,
bag (multi-set), or list are represented as
sequences similar to the streams used in
the query processing techniques dis-
cussed earlier, and the algorithms to ma-
nipulate these bulk types are equal to
the ones used for sets of tuples, i.e., rela-
tions. However, some algorithms are gen-
uinely different from the algorithms we
have surveyed so far. In this section, we
review operators for nested relations,
temporal and scientific databases,
object-oriented databases, and more
meta-operators for additional query pro-
cessing control.
There are several reasons for integrat-
ing these operators into an algebraic
query-processing system. First, it per-
mits efficient data transfer from the
database to the application embodied in
these operators. The interface between
ACM Computing Surveys, Vol. 25, No. 2, June 1993
144 “ Goetz Graefe
database operators is designed to be as
efficient as possible; the same efficient
interface should also be used for applica-
tions. Second, operator implementors can
take advantage of the control provided by
the meta-operators. For example, an op-
erator for a scientific application can
be implemented in a single-process envi-
ronment and later parallelized with the
exchange operator. Third, query opti-
mization based on algebraic transform a-
tion rules can cover all operators, includ-
ing operations that are normally consid-
ered database application code. For ex-
ample, using algebraic optimization tools
such as the EXODUS and Volcano opti-
mizer generators [Graefe and DeWitt
1987; Graefe et al. 1992; Graefe and
McKenna 1993], optimization rules that
can move an unusual database operator
in a query plan are easy to implement.
For a sampling operator, a rule might
permit transforming an algebra expres-
sion to query a sample instead of sam-
pling a query result,
11.1 Nested Relations
Nested relations, or Non-First-Normal-
Form (NF 2) relations, permit relation-
valued attributes in addition to atomic
values such as integers and strings used
in the normal or “flat” relational model.
For example, in an order-processing ap-
plication, the set of individual line items
on each order could be represented as a
nested relation, i.e., as part of an order
tuple. Figure 35 shows an NF z relation
with two tuples with two and three nested
tuples and the equivalent normalized re-
lations, which we call the master and
detail relations. Nested relations can be
used for all one-to-many relationships but
are particularly well suited for the repre-
sentation of “weak entities” in the
Entity-Relationship (ER) Model [Chen
1976], i.e., entities whose existence and
identification depend on another entity
as for order entries in Figure 35. In gen-
eral, nested subtuples may include rela-
tion-valued attributes, with arbitrary
nesting depth. The advantages of the NF 2
model are that component relationships
can be represented more naturally than
in the fully normalized model; many fre-
quent join operations can be avoided, and
structural information can be used for
physical clustering. Its disadvantage is
the added complexity, in particular, in
storage management and query
m-ocessing.
Severa~ algebras for nested relations
have been defined, e.g., Deshpande and
Larson [1991], Ozsoyoglu et al. [1987],
Roth et al. [ 1988], Schek and Scholl
[1986], Tansel and Garnett [1992]. Our
discussion here focuses not on the concep-
tual design of NF z algebras but on algo-
rithms to manipulate nested relations.
Two operations required in NF 2
database systems are operations that
transform an NF z relation into a normal-
ized relation with atomic attributes onlv.
./
and vice versa. The first operation is
frequently called unnest or flatten; the
opposite direction is called the nest oper-
ation. The unnest operation can be per-
formed in a single scan over the NF 2
relation that includes the nested subtu-
ples; both normalized relations in Figure
35 and their join can be derived readily
enough from the NF z relation. The nest
operation requires grouping of tuples in
the detail relation and a join with the
master relation. Grouping and join can
be implemented using any of the algo-
rithms for aggregate functions and bi-
nary matching discussed earlier, i.e., sort-
and hash-based sequential and parallel
methods. However, in order to ensure
that unnest and nest o~erations are ex-
.
act inverses of each other, some struc-
tural information might have to be pre-
served in the unnest operation. Ozsoyoglu
and Wang [1992] present a recent inves-
tigation of “keying methods” for this pur-
pose.
All operations defined for flat relations
can also be defhed for nested relations,
in particular: selection, join, and set op-
erations (union, intersection, difference).
For selections, additional power is gained
with selection conditions on subtuples
and sets of subtuples using set compar-
isons or existential or universal auantifl-
.
cation. In principle, since a nested rela-
ACM Computing Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 145
Order Customer Date Items
-No -No Part-No Count
110 911 910902 4711 8
2345 7
112 912 910902 9876 3
2222 1
2357 9
Order-No Part-No Quantity
110 4711 8
110 2345 7
112 9876 3
112 2222 1
112 2357 9
Figure 35. Nested relation and equivalent flat relations.
tion is a relation, any relational calculus
and algebra expression should be permit-
ted for it. In the example in Figure 35,
there may be a selection of orders in
which the ordered quantity of all items is
more than 100, which is a universal
quantification. The algorithms for selec-
tions with quantifier are similar to the
ones discussed earlier for flat relations,
e.g., relational semi-join and division, but
are easier to implement because the
grouping process built into the flat-
relational algorithms is inherent in the
nested tuple structure.
For joins, similar considerations apply.
Matching algorithms discussed earlier
can be used in principle. They may be
more complex if the join predicate in-
volves subrelations, and algorithm com-
binations may be required that are de-
rived from a flat-relation query over flat
relations equivalent to the NF 2 query
over the nested relations. However, there
should be some performance improve-
ments possible if the grouping of values
in the nested relations can be exploited,
as for example, in the join algorithms
described by Rosenthal et al. [1991].
Deshpande and Larson [ 1992] investi-
gated join algorithms for nested relations
because “the purpose of nesting in order
to store precomputed joins is defeated
if it is unnested every time a join is
performed on a subrelation.” Their algo-
rithm, (parallel) partitioned nested-
hashed-loops, joins one relation’s subre-
lations with a second, flat relation by
creating an in-memory hash table with
the flat relation. If the flat relation is
larger than memory, memory-sized seg-
ments are loaded one at a time, and the
nested relation is scanned repeatedly.
Since an outer tuple of the nested rela-
tion might have matches in multiple seg-
ments of the flat relation, a final merging
pass is required. This join algorithm is
reminiscent of hash-division, the flat re-
lation taking the role of the divisor and
the nested tuples replacing the quotient
table entries with their bit maps.
Sort-based join algorithms for nested
relations require either flattening the
nested tuples or scanning the sorted flat
relation for each nested tuple, somewhat
reminiscent of naive division. Neither al-
ternative seems very promising for large
ACM Computing Surveys, Vol. 25, No 2, June 1993
146 “ Goetz Graefe
inputs. Sort semantics and appropriate
sort algorithms including duplicate re-
moval and grouping have been consid-
ered by Saake et al. [1989] and Kuespert
et al. [ 1989]. Other researchers have fo-
cused on storage and retrieval methods
for nested relations and operations possi-
ble with single scans [Dadam et al. 1986;
Deppisch et al. 1986; Deshpande and Van
Gucht 1988; Hafez and Ozsoyoglu 1988;
Ozsoyoglu and Wang 1992; Scholl et al.
1987; Scholl 1988].
11.2 Temporal and Scientific Database
Management
For a variety of reasons, management
and manipulation of statistical, tempo-
ral, and scientific data are gaining inter-
est in the database research community.
Most work on temporal databases has
focused on semantics and representation
in data models and query languages
[McKenzie and Snodgrass 1991; Snod-
grass 1990]; some work has considered
special storage structures, e.g., Ahn and
Snodgrass [1988], Lomet and Salzberg
[ 1990b], Rotem and Segev [1987], Sever-
ance and Lehman [1976], algebraic oper-
ators, e.g., temporal joins [Gunadhi and
Segev 1991], and optimization of tempo-
ral queries, e.g., Gunadhi and Segev
[1990], Leung and Muntz [1990; 1992],
Segev and Gunadhi [1989]. While logical
query algebras require extensions to ac-
commodate time, only some storage
structures and algorithms, e.g., multidi-
mensional indices, differential files, and
versioning, and the need for approximate
selection and matching (join) predicates
are new in the query execution algo-
rithms for temporal databases.
A number of operators can be identi-
fied that both add functionality to
database systems used to process scien-
tific data and fit into the database query
processing paradigm. DeWitt et al.
[1991a] considered algorithms for join
predicates that express proximity, i.e.,
join predicates of the form R. A – c1 <
S.B < R.A + Cz for some constants c1
and Cz. Such join predicates are very
different from the usual use of relational
join. They do not reestablish relation-
ships based on identifying keys but match
data values that express a dimension in
which distance can be defined, in partic-
ular, time. Traditionally, such join predi-
cates have been considered non-equi-joins
and were evaluated by a variant of
nested-loops join. However, such “band
joins” can be executed much more effi-
ciently by a variant of merge-join that
keeps a “window” of inner relation tuples
in memory or by a variant of hash join
that uses range partitioning and assigns
some build tuples to multiple partition
files. A similar partitioning model must
be used for parallel execution, requiring
multi-cast for some tuples. Clearly, these
variants of merge-join and hash join will
outperform nested loops for large inputs,
unless the band is so wide that the join
result approaches the Cartesian product.
For storage and management of the
massive amounts of data resulting from
scientific experiments, database tech-
niques are very desirable. Operators for
processing time series in scientific
databases are based on an interpretation
of a stream between operators not as a
set of items (as in most database applica-
tions) but as a sequence in which the
order of items in a stream has semantic
meaning. For example, data reduction
using interpolation as well as extrapola-
tion can be performed within the stream
paradigm. Similarly, digital filtering
[Hamming 19771 also fits the stream-
processing protocol very easily. Interpo-
lation, extrapolation, and digital filtering
were implemented in the Volcano system
with a single algorithm (physical opera-
tor) to verify this fit, including their opti-
mization and parallelization [Graefe and
Wolniewicz 1992; Wolniewicz and Graefe
1993]. Another promising candidate is vi-
sualization of single-dimensional arrays
such as time series,
Problems that do not fit the stream
paradigm, e.g., many matrix operations
such as transformations used in linear
algebra, Laplace or Fast Fourier Trans-
form, and slab (multidimensional sub-
array) extraction, are not as easy to
integrate into database query processing
ACM Computing Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 147
systems. Some of them seem to fit better
into the storage management subsystem
rather than the algebraic query execu-
tion engine. For example, slab extraction
has been integrated into the NetCDF
storage and access software [Rew and
Davis 1990; Unidata 1991]. However, it
is interesting to note that sorting is a
suitable algorithm for permuting the lin-
ear representation of a multidimensional
array, e.g., to modify the hierarchy of
dimensions in the linearization (row- vs.
column-major linearization). Since the
final position of all elements can be
predicted from the beginning of the op-
eration, such “sort” algorithms can be
based on merging or range partitioning
(which is yet another example of the
duality of sort- and hash- (partitioning-)
based data manipulation algorithms).
11.3 Object-Oriented Database Systems
Research into query processing for exten-
sible and object-oriented systems has
been growing rapidly in the last few
years. Most proposals or implementa-
tions use algebras for query processing,
e.g., Albert [1991], Cluet et al. [1989],
Graefe and Maier [1988], Guo et al.
[199 1], Mitschang [1989], Shaw and
Zdonik [1989a; 1989b; 1990], Straube and
Ozsu [1989], Vandenberg and DeWitt
[1991], Yu and Osborn [1991]. These al-
gebras resemble relational algebra in the
sense that they focus on bulk data types
but are generalized to support operations
on arrays, lists, etc., user-defined opera-
tions (methods) on instances, heteroge-
neous bulk types, and inheritance. The
use of algebras permits several impor-
tant conclusions. First, naive execution
models that execute programs as if all
data were in memory are not the only
alternative. Second, data manipulation
operators can be designed and imple-
mented that go beyond data retrieval and
permit some amount of data reduction,
aggregation, and even inference. Third,
algebraic execution techniques including
the stream paradigm and parallel execu-
tion can be used in object-oriented data
models and database systems, Fourth, al-
gebraic optimization techniques will con-
tinue to be useful.
Associative operations are an impor-
tant part in all object-oriented algebras
because they permit reducing large
amounts of data to the interesting subset
of the database suitable for further con-
sideration and processing. Thus, set-
processing and set-matching algorithms
as discussed earlier in this survey will be
found in object-oriented systems, imple-
mented in such a way that they can oper-
ate on heterogeneous sets. The challenge
for query optimization is to map a com-
plex query involving complex behavior
and complex object structures to primi-
tives available in a query execution en-
gine. Translating an initial request with
abstract data types and encapsulated be-
havior coded in a computationally com-
plete language into an internal form that
both captures the entire query’s seman-
tics and allows effective query optimiza-
tion is still an open research issue
[Daniels et al. 1991; Graefe and Maier
1988].
Beyond associative indices discussed
earlier, object-oriented systems can also
benefit from special relationship indices,
i.e., indices that contain condensed infor-
mation about interobject references. In
principle, these index structures are sim-
ilar to join indices [Valduriez 1987] but
can be generalized to support multiple
levels of referencing. Examples for in-
dices in object-oriented database systems
include the work of Maier and Stein
[1986] in the Gemstone object-oriented
database system product, Bertino [1990;
1991] and Bertino and Kim [1989], in the
Orion project and Kemper et al. [19911
and Kemper and Moerkotte [1990a;
1990b] in the GOM project. At this point,
it is too early to decide which index
structures will be the most useful be-
cause the entire field of query processing
in object-oriented systems is still devel-
oping rapidly, from query languages to
algebra design, algorithm repertoire, and
optimization techniques. other areas of
intense current research interest are
buffer management and clustering of
objects on disk.
ACM Computing Surveys, Vol. 25, No. 2, June 1993
148 “ Goetz Graefe
One of the big performance penalties
in object-oriented database systems is
“pointer chasing” (using OID references)
which may involve object faults and disk
read operations at widely scattered loca-
tions, also called “goto’s on disk.” In or-
der to reduce 1/0 costs, some systems
use what amounts to main-memory
databases or map the entire database
into virtual memory. For systems with
an explicit database on disk and an in-
memory buffer, there are various tech-
niques to detect object faults; some
commercial object-oriented database sys-
tems use hardware mechanisms origi-
nally perceived and implemented for
virtual-memory systems. While such
hardware support makes fault detection
faster. it does not address the m-oblem of
.
expensive 1/0 operations. In order to re-
duce actual 1/0 cost, read-ahead and
planned buffering must be used. Palmer
and Zdonik [1991] recently proposed
keeping access patterns or sequences and
activating read-ahead if accesses equal
or similar to a stored ~attern are de-
tected. Another recent p~oposal for effi-
cient assembly of complex objects uses a
window (a small set) of open references
and resolves, at any point of time, the
most convenient one by fetching this ob-
ject or component from disk, which has
shown dramatic improvements in disk
seek times and makes complex object re-
trieval more efficient and more indepen-
dent of object clustering [Keller et al.
19911. Policies and mechanisms for effi-
cient parallel complex object assembly are
an important challenge for the develop-
ers of next-generation object-oriented
database management systems [Maier et
al. 1992].
11.4 More Control Operators
The exchange operator used for parallel
query processing is not a normal opera-
tor in the sense that it does not manipu-
late, select, or transform data. Instead,
the exchange operator provides control of
query processing in a way orthogonal to
what a query does and what algorithms
it uses. Therefore, we call it a meta-
or control operator. There are several
other control o~erators that can be used
.
in database query processing, and we
survey them briefly in this section.
In situations in which an intermediate
result is used repeatedly, e.g., a nested-
loops join with a composite inner input,
either the intermediate result is derived
many times, or it is saved in a temporary
file during its first derivation and then
retrieved from this file while serving sub-
sequent requests. This situation arises
not only with nested-loops join but also
with other algorithms, e.g., sort-based
universal quantification [Smith and
Chang 1975]. Thus, it might be useful to
encapsulate this functionality in a new
algorithm, which we call the store-and-
scan o~erator.
The’ store-and-scan operator permits
three generalizations. First, if the first
consumption of the intermediate result
might actually not need it entirely, e.g., a
nested-loops semi-join which terminates
each inner scan after the first match, the
operator should be switched to derive only
the necessary data items (which implies
leaving the input plan ready to produce
more data later) or to save the entire
intermediate result in the temporary file
right away in order to permit release of
all resources in the subplan. Second, sub-
sequent scans might permit starting not
at the beginning of the temporary file but
at some later point. This version is useful
if manv duplicates exist in the inwts of
. . .
one-to-one matching algorithms based on
merge-join. Third, in some execution
strategies for correlated SQL sub queries,
the plan corresponding to the inner block
is executed once for each tuple in the
outer block. The tudes of the outer block
provide different c~rrelation values, al-
though each value may occur repeatedly.
In order to ensure that the inner plan is
executed only once for each outer correla-
tion value, the store-and-scan operator
could retain information about which part
of its temporary file corresponds to which
correlation value and restrict each scan
appropriately.
Another use of a tem~orarv file is to
.“
support common subexpressions, which
ACM Computing Surveys, Vol. 25, No 2, June 1993
Query Evaluation Techniques - 149
can be executed efficiently with an opera-
tor that passes the result of a common
subexpression to multiple consumers, as
mentioned briefly in the section on the
architecture of query execution engines.
The problem is that multiple consumers,
typically demand-driven and demand-
driving their inputs, will request items of
the common subexpression result at dif-
ferent times or rates. The two standard
solutions are either to execute the com-
mon subexpression into a temporary file
and let each consumer scan this file at
will or to determine which consumer will
be the first to require the result of the
common subexpression, to execute the
common subexpression as part of this
consumer and to create a file with the
common subexpression result as a by-
product of the first consumer’s execution.
Instead, we suggest a new meta-operator,
which we call the split operator, to be
placed at the top of the common subex-
pression’s plan and which can serve mul-
tiple consumers at their own paces. It
automatically performs buffering to ac-
count for different paces, uses temporary
disk space if the discrepancies are too
wide, and is suitably parameterized to
permit both standard solutions described
above.
In query processing systems, data flow
is usually paced or driven from the top,
the consumer. The leftmost diagram of
Figure 36 shows the control flow of nor-
mal iterators. (Notice that the arrows in
Figure 36 show the control flow; the data
flow of all diagrams in Figure 36 is as-
sumed to be upward. In data-driven data
flow, control and data flows point in the
same direction; in demand-driven data
flow, their directions oppose each other.)
However, in real-time systems that cap-
ture data from experiments, this ap-
proach may not be realistic because the
data source, e.g., a satellite receiver, has
to be able to unload data as they arrive.
In such systems, data-driven operators,
shown in the second diagram of Figure
36, might be more appropriate. To com-
bine the algorithms implemented and
used for query processing with such
real-time data capture requirements, one
could design data flow translation con-
trol operators. The first such operator
which we call the active scheduler can be
used between a demand-driven producer
and a data-driven consumer. In this case,
neither operator will schedule the other;
therefore, an active scheduler that de-
mands items from the producer and forces
them onto the consumer will glue these
two operators together. An active-
scheduler schematic is shown in the third
diagram of Figure 36. The opposite case,
a data-driven producer and a demand-
driven consumer, has two operators, each
trying to schedule the other one. A sec-
ond flow control operator, called the pas-
sive scheduler, can be built that accepts
procedure calls from either neighbor and
resumes the other neighbor in a corou-
tine fashion to ensure that the resumed
neighbor will eventually demand the item
the scheduler just received. The final dia-
gram of Figure 36 shows the control flow
of a passive scheduler. (Notice that this
case is similar to the bracket model of
parallel operator implementations dis-
cussed earlier in which an operating sys-
tem or networking software layer had to
be placed between data manipulation op-
erators and perform buffering and flow
control.)
Finally, for very complex queries, it
might be useful to break the data flow
between operators at some point, for two
reasons. First, if too many operators run
in parallel, contention for memory or
temporary disks might be too intense,
and none of the operators will run as
efficiently as possible. A long series of
hybrid hash joins in a right-deep query
plan illustrates this situation. Second,
due to the inherent error in selectivity
estimation during query optimization
[Ioannidis and Christodoulakis 1991;
Mannino et al. 1988], it might be worth-
while to execute only a subset of a plan,
verify the correctness of the estimation,
and then resume query processing with
another few steps. After a few processing
steps have been performed, their result
size and other statistical properties such
as minimum and maximum and approxi-
mate number of duplicate values can be
ACM Computing Surveys, Vol. 25, No. 2, June 1993
150 “ Goet.z Graefe
CD
Standard Data-Driven
Iterator Opemtor
r+
I Active
L
Scheduler
Passive
Scheduler
+
Figure 36. Operators, schedulers, and control flow
easily determined while saving the result
on temporary disk.
In principle, this was done in Ingres’
original optimization method called De-
composition, except that Ingres per-
formed only one operation at a time
before optimizing the remaining query
[Wong and Yousset3 1976; Youssefi and
Wong 1979]. We propose alternating more
slowly between optimization and execu-
tion, i.e., to perform a “reasonable” num-
ber of steps between optimizations, where
reasonable may be three to ten selections
and joins depending on errors and error
propagation in selectivity estimation.
Stopping the data flow and resuming af-
ter additional optimization could very
well turn out to be the most reliable
technique for very large complex queries.
Implementation of this technique could
be embodied in another control operator,
the choose-plan operator first described
in Graefe and Ward [1989]. Its current
implementation executes zero or more
subplans and then invokes a decision
function provided by the optimizer that
decides which of multiple equivalent
plans to execute depending on intermedi-
ate result statistics, current system load,
and run-time values of query parameters
unknown at optimization time. Unfor-
tunately, further research is needed to
develop techniques for placing such op-
erators in very complex query plans.
One possible purpose of the subplans
executed prior to a decision could be to
sample the values in the database.
A very interesting research direction
quantifies the value of sampling by ana-
lyzing the resulting improvement in the
decision quality [Seppi et al. 1989].
ACM Computing Surveys, Vol 25, No 2, June 1993
12. ADDITIONAL TECHNIQUES FOR
PERFORMANCE IMPROVEMENT
In this section, we consider some addi-
tional techniques that have been pro-
posed in the literature or used in real
systems and that have not been dis-
cussed in earlier sections of this survey.
In particular, we consider precomputa-
tion, data compression, surrogate pro-
cessing, bit vector filters, and specialized
hardware. Recently proposed techniques
that have not been fully developed are
not discussed here, e.g., “racing” equiva-
lent plans and terminating the ones that
seem not competitive after some small
amount of time.
12.1 Precomputation and Derived Data
It is trivial to answer a query for which
the answer is already known—therefore,
precomputation of frequently requested
information is an obvious idea. The prob-
lem with keeping preprocessed informa-
tion in addition to base data is that it is
redundant and must be invalidated or
maintained on updates to the base data.
Precomputation and derived data such
as relational views are duals. Thus, con-
cepts and algorithms designed for one
will typically work well for the other. The
main difference is the database user’s
view: precomputed data are typically
used after a query optimizer has deter-
mined that they can be used to answer a
user query against the base data, while
derived data are known to the user and
can be queried without regard to the fact
that they actually must be derived at
run-time from stored base data. Not sur-
Query Evaluation Techniques ● 151
prisingly, since derived data are likely to
be referenced and requested by users and
application programs, precomputation of
derived data has been investigated both
for relational and object-oriented data
models.
Indices are the simplest form of pre-
computed data since they are a re-
dundant and, in a sense, precomputed
selection. They represent a compromise
between a nonredundant database and
one with complex precomputed data be-
cause they can be maintained relatively
efficiently.
The next more sophisticated form
of precomputation are inversions as pro-
vided in System Rs “Oth” prototype
[Chamberlain et al. 1981al, view indices
as analyzed by Roussopoulos [1991],
two-relation join indices as proposed by
Valduriez [1987], or domain indices as
used in the ANDA project (called VAL-
TREE there) [Deshpande and Van Gucht
1988] in which all occurrences of one do-
main (e.g., part number) are indexed to-
gether, and each index entry contains a
relation identification with each record
identifier. With join or domain indices,
join queries can be answered very fast,
typically faster than using multiple sin-
gle-relation indices. On the other hand,
single-clause selections and updates may
be slightly slower if there are more en-
tries for each indexed key.
For binary operators, there is a spec-
trum of possible levels of precomputa-
tions (as suggested by J. A. Blakely),
explored predominantly for joins. The
simplest form of precomputation in sup-
port of binary operations is individual
indices, e.g., clustering B-trees that en-
sure and maintain sorted relations. On
the other extreme are completely materi-
alized join results. Intermediate levels
are pointer-based joins [ Shekita and
Carey 1990] (discussed earlier in the sec-
tion on matching) and join indices
[Valduriez 19871. For each form of pre-
computed result, the required redundant
data structures must be maintained each
time the underlying base data are up-
dated, and larger retrieval speedup might
be paid for with larger maintenance
overhead.
Babb [1982] explored storing only
results of outer joins, but not the nor-
malized base relations, in the content-
addressable file store (CAFS), and called
this encoding join normal form. Blakeley
et al. [1989], Blakeley and Martin [1990],
Larson and Yang [1985], Medeiros and
Tompa [1985], Tompa and Blakeley
[1988], and Yang and Larson [ 1987] in-
vestigated storing and maintaining ma-
terialized views in relational database
systems. Their hope was to speed rela-
tional query processing by using derived
data, possibly without storing all base
data, and ensuring that their mainte-
nance overhead would be less than their
benefits in faster query processing. For
example, Blakeley and Martin [1990]
demonstrated that for a single join there
exists a large range of retrieval and up-
date mixes in which materialized views
outperform both join indices and hybrid
hash join. This investigation should be
extended, however, for more complex
queries, e.g., joins of three and four in-
puts, and for queries in object-oriented
systems and emerging database applica-
tions.
Hanson [1987] compared query modifi-
cation (i.e., query evaluation from base
relations) against the maintenance costs
of materialized views and considered in
particular the cost of immediate versus
deferred updates. His results indicate
that for modest update rates, material-
ized views provide better system perfor-
mance. Furthermore, for modest selectiv-
ities of the view predicate, deferred-view
maintenance using differential files
[Severance and Lohman 1976] outper-
forms immediate maintenance of materi-
alized views. However, Hanson also did
not include multi-input joins in his study.
Sellis [1987] analyzed caching of re-
sults in a query language called Quel +
(which is a subset of Postquel
[Stonebraker et al. 1990bl) over a rela-
tional database with procedural (QUEL)
fields [Sellis 1987]. He also considered
the case of limited space on secondary
storage used for caching query results,
and replacement algorithms for query re-
sults in the cache when the space be-
comes insufficient.
ACM Computing Surveys, Vol. 25, No. 2, June 1993
152 * Goetz Graefe
Links between records (pointers of
some sort, e.g., record, tuple, or object
identifiers) are another form of precom-
putation. Links are particularly effective
for system performance if they are com-
bined with clustering (assignment of
records to pages). Database systems for
the hierarchical and network models have
used physical links and clustering, but
supported basically only queries and op-
erations that were “recomputed” in this
way. Some researchers tried to overcome
this restriction by building relational
query engines on top of network systems,
e.g., Chen and Kuck [1984], Rosenthal
and Reiner [ 1985], Zaniolo [1979]. How-
ever, with performance improvements in
the relational world, these efforts seem
to have been abandoned. With the advent
of extensible and object-oriented database
management systems, combining links
and ad hoc query processing might be-
come a more interesting topic again. A
recent effort for an extensible-relational
system are Starburst’s pointer-based
joins discussed earlier [Haas et al. 1990;
Shekita and Carey 1990].
In order to ensure good performance
for its extensive rule-processing facilities,
Postgres uses precomputation and
caching of the action parts of production
rules [Stonebraker 1987; Stonebraker et
al. 1990a; 1990b]. For automatic mainte-
nance of such derived data, persistent
“invalidation locks” are stored for detec-
tion of invalid data after updates to the
base data,
Finally, the Cactis project focused on
maintenance of derived data in object-
oriented environments [Hudson and King
1989]. The conclusions of this project in-
clude that incremental maintenance cou-
pled with a fairly simple adaptive clus-
tering algorithm is an efficient way to
propagate updates to derived data.
One issue that many investigations
into materialized views ignore is the fact
that many queries do not require views
in their entirety. For example, if a rela-
tional student information system in-
cludes a view that computes each stu-
dent’s grade point average from the en-
rollment data, most queries using this
view will select only a single student, not
all students at the school. Thus, if the
view definition is merged into the query
before query optimization, as discussed
in the introduction, only one student’s
grade point average, not the entire view,
will be computed for each query. Obvi-
ously, the treatment of this difference
will affect an analysis of costs and bene-
fits of materialized views.
12.2 Data Compression
A number of researchers have investi-
gated the effect of compression on
database systems and their performance
[Graefe and Shapiro 1991; Lynch and
Brownrigg 1981; Ruth and Keutzer 1972;
Severance 1983]. There are two types of
compression in database systems. First,
the amount of redundancy can be re-
duced by prefix and suffix truncation, in
particular, in indices, and by use of en-
coding tables (e.g., color combination “9”
means “red car with black interior”). Sec-
ond, compression schemes can be applied
to attribute values, e.g., adaptive
Huffman coding or Ziv-Lempel methods
[Bell et al. 1989; Lelewer and Hirschberg
1987]. This type of compression can be
exploited most effectively in database
query processing if all attributes of the
same domain use the same encoding, e.g.,
the “Part-No” attributes of data sets rep-
resenting parts, orders, shipments, etc.,
because common encodings permit com-
parisons without decompression.
Most obviously, compression can re-
duce the amount of disk space required
for a given data set. Disk space savings
has a number of ramifications on 1/0
performance. First, the reduced data
space fits into a smaller physical disk
area; therefore, the seek distances and
seek times are reduced. Second, more
data fit into each disk page, track, and
cylinder, allowing more intelligent clus-
tering of related objects into physically
near locations. Third, the unused disk
space can be used for disk shadowing to
increase reliability, availability, and 1/0
performance [Bitten and Gray 1988].
Fourth, compressed data can be trans-
ACM Cornputlng Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 153
ferred faster to and from disk. In other
words, data com~ression is an effective
means to increa~e disk bandwidth (not
by increasing physical transfer rates but
by increasing the information density
of transferred data) and to relieve the
1/0 bottleneck found in many high-
performance database management sys-
tems [Boral and DeWitt 19831. Fifth. in
distributed database systems and’ in
client-server situations, compressed da-
ta can be transferred faster across the
network than uncompressed data. Un-
compressed data require either more
network time or a seoarate comm-ession
. ,
step. Finally, retaining data in com-
pressed form in the 1/0 buffer allows
more records to remain in the buffer.
thus increasing the buffer hit rate and
reducing the number of 1/0s. The last
three points are actually more general.
They apply to the entire storage hierar-
chy of tape, disk, controller caches, local
and remote main memories, and CPU
caches.
For query processing, compression can
be exploited far beyond improved 1/0
~erformance because decomm-ession can
~ften be delayed until a rel~tively small
data set is presented to the user or an
application program. First, exact-match
comparisons can be performed on com-
pressed data. Second, projection and du-
plicate removal can be performed with-
out decompressing data. The situation for
aggregation is a little more complex since
the attribute on which arithmetic is per-
formed typically must be decompressed.
Third, neither the join attributes nor
other attributes need to be decompressed
for most joins. Since keys and foreign
keys are from the same domain, and if
compression schemes are fixed for each
domain, a join on compressed key values
will give the same results as a join on
normal uncompressed key values. It
might seem unusual to perform a merge-
join in the order of compressed values,
but it nonetheless is ~ossible and will
produce correct results.’
There are a number of benefits from
processing compressed data. First, mate-
rializing output records is faster because
records are shorter, i.e., less copying is
required. ~econd, for inputs larger than
memory, more records fit into memory.
In hybrid hash join and duplicate re-
moval, for instance, the fraction of the
file that can be retained in the hash table
and thus be joined without any 1/0 is
larger. During sorting, the number of
records in memory and thus per run is
larger, leading to fewer runs and possibly
fewer merge levels. Third, and very in-
terestingly, skew is less likely to be a
problem. The goal of compression is to
represent the information with as few
bits as possible. Therefore, each bit in
the output of a good compression scheme
has close to maximal information con-
tent, and bit columns seen over the
entire file are unlikely to be skewed.
Furthermore, bit columns will not be cor-
related. Thus, the compressed key values
can be used to create a hash value distri-
bution that is almost guaranteed to be
uniform, i.e., optimal for hashing in
memory and partitioning to overflow files
as well as to multiple processors in paral-
lel join algorithms.
We believe that data compression is
undervalued in current query processing
research, mostly because it was not real-
ized that many operations can often be
performed faster on compressed data
than on uncompressed data, and we hope
that future database management sys-
tems make extensive use of data com-
pression. Considering the current growth
rates in CPU and 1/0 performance, it
might even make sense to exploit data
compression on the fly for hash table
overflow resolution.
12.3 Surrogate Processing
Another very useful technique in query
processing is the use of surrogates for
intermediate results. A surrogate is a ref-
erence to a data item, be it a logical
object identifier (OID) used in object-
oriented systems or a physical record
identifier (RID) or location. Instead of
keeping a complete record in memory,
only the fields that are used immediately
are kept, and the remainder replaced by
ACM Computing Surveys, Vol. 25, No. 2, June 1993
154 “ Goetz Graefe
a surrogate, which has in principle the
same effect as compression. While this
technique has traditionally been used to
reduce main-memory requirements, it
can also be employed to improve board-
and CPU-level caching [Nyberg et al.
1993].
The simplest case in which surrogate
processing can be exploited is in avoiding
copying. Consider a relational join; when
two items satisfy the join predicate, a
new tuple is created from the two origi-
nal ones. Instead of copying the data
fields, it is possible to create only a pair
of RIDs or pointers to the original records
if they are kept in memory. If a record is
50 times larger than an RID, e.g., 8 vs.
400 bytes, the effort spent on copying
bvtes is reduced bv that factor
“ Copying is alre~dy a major part of the
CPU time spent in many query process-
ing systems, but it is becoming more ex-
pensive for two reasons. First, many
modern CPU designs and implementa-
tions are optimized for an impressive
number of instructions per second but do
not provide the performance improve-
ments in mundane tasks such as moving
bytes from one memory location to an-
other [Ousterhout 19901. Second, many
modern computer architectures employ
multiple CPUS accessing shared memory
over one bus because this design permits
fast and inexpensive parallelism. Al-
though alleviated by local caches, bus
contention is the major bottleneck and
limitation to scalability in shared-
memory parallel machines. Therefore, re-
ductions in memory-to-memory copying
in database query execution engines per-
mit higher useful degrees of parallelism
in shared-memorv machines.
A second example for surrogate pro-
cessing was mentioned earlier in connec-
tion with indices. To evaluate a conjunc-
tion with multiple clauses, each of which
is supported by an index, it might be
useful to perform an intersection of RID-
lists to reduce the number of records
needed before actual data are accessed.
A third case is the use of indices and
RIDs to evaluate joins, for example, in
the query processing techniques used in
Ingres [Kooi 1980; Kooi and Frankforth
1982] and IBMs hybrid join [Cheng et al.
1991] discussed in the section on binary
matching.
Surrogate processing has also been
used in parallel systems, in particular,
distributed-memory implementations, to
reduce network traffic. For example,
Lorie and Young [1989] used RIDs to
reduce the communication time in paral-
lel sorting by sending (sort key, RID)
~airs to a central site. which determines
~ach record’s global’ rank, and then
repartitioning and merging records very
quickly by their rank alone without fur-
ther data comparisons.
Another form of surrogates are encod-
ings with lossy compressions, such as su-
perimposed coding used for efficient ac-
cess methods [Bloom 1970; Faloutsos
1985; Sacks-Davis and Ramamohanarao
1983; Sacks-Davis et al. 1987]. Berra et
al. [1987] and Chung and Berra [1988]
considered indexing and retrieval organi-
zations for very large (relational) knowl-
edge bases and databases. They em-
ployed three techniques, concatenated
code words (CCWS), superimposed code
words (SCWS), and transformed inverted
lists (TILs). TILs are normal index struc-
tures for all attributes of a relation that
permit answering conjunctive queries by
bitwise anding. CCWS and SCWS use
hash values of all attributes of a tuple
and either concatenate such hash values
or bitwise or them together. The result-
ing code words are then used as keys in
indices. In their particular architecture,
Berra et al. and Chung and Berra con-
sider associative memory and optical
computing to search efficiently through
such indices, although conventional soft-
ware techniques could be used as well.
12.4 Bit Vector Filtering
In parallel systems, bit vector filters have
been used very effectively for what we
call here “probabilistic semi-j oins.” Con-
sider a relational join to be executed on a
distributed-memory machine with repar-
titioning of both input relations on the
join attribute. It is clear that communica-
ACM Computmg Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 155
tion effort could be reduced if only the
tuples that actually contribute to the join
result, i.e., those with a match in the
other relation, needed to be shipped
across the network. To accomplish this,
distributed database systems were de-
signed to make extensive use of semi-
joins, e.g., SDD-1 [Bernstein et al. 1981].
A faster alternative to semi-joins,
which, as discussed earlier, requires ba-
sically the same computational effort as
natural joins, is the use of bit vector
filters [Babb 1979], also called Bloom-
filters [Bloom 1970]. A bit vector filter
with N bits is initialized with zeroes;
and all items in the first (preferably the
smaller) input are hashed on their join
keyto O,..., N – 1. For each item, one
bit in the bit vector filter is set to one;
hash collisions are ignored. After the first
join input has been exhausted, the bit
vector filter is used to filter the second
input. Data items of the second input are
hashed on their join key value, and only
items for which the bit is set to one can
possibly participate in the join. There is
some chance for false passes in the case
of collisions, i.e., items of the second in-
put pass the bit vector filter although
they actually do not participate in the
join, but if the bit vector filter is suffi-
ciently large, the number of false passes
is very small.
In general, if the number of bits is
about twice the number of items in the
first input, bit vector filters are very ef-
fective. If many more bits are available,
the bit vector filter can be split into mul-
tiple subvectors, or multiple bits can be
set for each item using multiple hash
functions, reducing the number of false
passes. Babb [1979] analyzed the use of
multiple bit vector filters in detail.
The Gamma relational database ma-
chine demonstrated the effectiveness of
bit vector filtering in relational join pro-
cessing on distributed-memory hardware
[DeWitt et al. 1986; 1988; 1990; Gerber
1986]. When scanning and redistributing
the build input of a join, the Gamma
machine creates a bit vector filter that is
then distributed to the scanning sites of
the probe input. Based on the bit vector
filter, a large fraction of the probe tuples
can often be discarded before incurring
network costs. The decision whether to
create one bit vector filter for the entire
build input or to create a bit vector filter
for each of the join sites depends on the
space available for bit vector filters and
the communication costs for bit arrays.
Mullin [ 1990] generalized bit vector fil-
tering to sending bit vector filters back
and forth between sites. In his words,
“the central notion is to send small but
optimally information-dense Bloom fil-
ters between sites as long as these filters
serve to reduce the volume of tuples
which need to be transmitted by more
than their own size.” While this proce-
dure achieves very low communication
costs, it ignores the 1/0 cost at each site
if the reduced relations must be scanned
from disk in each step. Qadah [1988] dis-
cussed a limited form of this idea using
only two bit vector filters and augment-
ing it with bit vector filter compression.
While bit vector filtering is typically
used only for joins, it is equally applica-
ble to all other one-to-one match oper-
ators, including semi-j oin, outer join,
intersection, union, and difference. For
operators that include nonmatching
items in their output, e.g., outer joins
and unions, part of the result can be
obtained before network transfer, based
solely on the bit vector filter. For one-to-
one match operations other than join,
e.g., outer join and union, bit vector fil-
ters can also be used, but the algorithm
must be modified to ensure that items
that do not pass the bit vector filter are
properly included in the operation’s out-
put stream. For parallel relational divi-
sion (universal quantification), bit vector
filtering can be used on the divisor at-
tributes to eliminate most of the dividend
items that do not pertain to any divisor
item. Thus, our earlier assessment that
universal quantification can be per-
formed as fast as existential quantifica-
tion (a semi-join of dividend and divisor
relations) even extends to special tech-
niques used to boost join performance.
Bit vector filtering can also be ex-
ploited in sequential systems. Consider a
ACM Computmg Surveys, Vol. 25, No. 2, June 1993
156 * Goetz Graefe
merge-join with sort operations on both
inputs. If the bit vector filter is built
based on the input of the first sort, i.e.,
the ,bit vector filter is completed when all
data have reached the first sort operator.
This bit vector filter can then be used to
reduce the input into the second sort op-
erator on the (presumably larger) second
input. Depending on how the sort opera-
tion is organized into phases, it might
even be possible to create a second bit
vector filter from the second merge-join
input and use it to reduce the first join
input while it is being merged.
For sequential hash joins, bit vector
filters can be used in two ways. First,
they can be used to filter items of the
probe input using a bit vector filter cre-
ated from items of the build input. This
use of bit vector filters is analogous to bit
vector filter usage in parallel systems
and for merge-join. In Rdb/VMS and
DB2, bit vector filters are used when
intersecting large RID lists obtained from
multiple indices on the same table
[Antoshenkov 1993; Mohan et ‘al. 1990].
Second, new bit vector filters can be cre-
ated and used for each partition in each
recursion level. In the Volcano query-
processing system, the operator imple-
menting hash join, intersection, etc. uses
the space used as anchor for each bucket’s
linked list for a small bit vector filter
after the bucket has been spilled to an
overflow file. Only those items from the
probe input that pass the bit vector filter
are written to the probe overflow file.
This technique is used in each recursion
level of overflow resolution. Thus, during
recursive partitioning, relatively small
bit vector filters can be used repeatedly
and at increasingly finer granularity to
remove items from the probe input that
do not contribute to the join result. Bit
vectors could also be used to remove items
from the build input using bit vector fil-
ters created from the probe input; how-
ever, since the probe input is presumed
the larger input and hash collisions in
the bit vector filter would make the filter
less effective, it may or may not be an
effective technique.
With some modifications of the stan-
dard algorithm, bit vector filters can also
be used in hash-based duplicate removal.
Since bit vector filters can only deter-
mine safely which item has not been seen
yet, but not which item has been seen yet
(due to possible hash collisions), bit vec-
tor filters cannot be used in the most
direct way in hash-based duplicate re-
moval. However, hash-based duplicate
removal can be modified to become simi-
lar to a hash join or actually a hash-based
set intersection. Consider a large file R
and a partitioning fan-out F. First, R is
partitioned into F/2 partitions. For each
partition, two files are created; thus, this
step uses the entire fan-out to create a
total of F files. Within each partition, a
bit vector filter is used to determine
whether an item belongs into the first or
the second file of that partition. If an
item is guaranteed to be unique, i.e., there
is no earlier item indicated in the bit
vector filter, the item is assigned to the
first file, and a bit in the bit vector filter
is set. Otherwise, the item is assigned
into the partition’s second file. At the end
of this partitioning step, there are F files,
half of them guaranteed to be free of
duplicate data items. The possible size of
the duplicate-free files is limited by the
size of the bit vector filters; therefore,
this step should use the largest bit vector
filters possible. After the first partition-
ing step, each partition’s pair of files is
intersected using the duplicate-free file
as probe input. Recall that duplicate re-
moval for a join’s build input can be ac-
complished easily and inexpensively
while building the in-memory hash table.
Remaining duplicates with one copy in
the duplicate-free (probe) file and an-
other copy in the other file (the build
input) in the hash table are found when
the probe input is matched against the
hash table. This algorithm performs very
well if many output items depend on only
one input item and if the bit vectors are
quire large. In that case, the duplicate-
free partition files are very large, and the
smaller partition file with duplicates can
be processed very efficiently.
In order to find and exploit a dual in
the realm of sorting and merge-join to bit
vector filtering in each recursion level of
recursive hash join, sorting of multiple
ACM Computing Surveys, Vol 25, No. 2, June 1993
Query Evaluation Techniques 9 157
inputs must be divided into individual
merge levels. In other words, for a
merge-join of inputs R and S, the sort
activity should switch back and forth be-
tween R and S, level by level, creating
and using a new bit vector filter in each
merge level. Unfortunately, even with a
sophisticated sort implementation that
supports this use of bit vector filters in
each merge level, recursive hybrid hash
join will make more effective use of bit
vector filters because the inputs are par-
titioned, thus reducing the number of
distinct values in each partition in each
recursion level.
12.5 Specialized Hardware
Specialized hardware was considered by
a number of researchers, e.g., in the forms
of hardware sorters and logic-per-track
selection. A relatively recent survey of
database machine research is given by
Su [1988]. Most of this research was
abandoned after Boral and DeWitt’s
[1983] influential analysis that compared
CPU and 1/0 speeds and their trends.
They concluded that 1/0 is most likely
the bottleneck in future high-perfor-
mance query execution, not processing.
Therefore, they recommended moving
from research on custom processors to
techniques for overcoming the 1/0 bot-
tleneck, e.g., by use of parallel readout
disks, disk caching and read-ahead, and
indexing to reduce the amount of data to
be read for a query. Other investigations
also came to the conclusion that par-
allelism is no substitute for effective
storage structures and query execution
algorithms [DeWitt and Hawthorn
1981; Neches 1984]. An additional very
strong argument against custom VLSI
processors is that microprocessor speed
is currently improving so rapidly that it
is likely that, by the time a special hard-
ware component has been designed, fab-
ricated, tested, and integrated into a
larger hardware and software system, the
next generation of general-purpose CPUs
will be available and will be able to exe-
cute database functions programmed in a
high-level language at the same speed as
the specialized hardware component.
Furthermore, it is not clear what special-
ized hardware would be most beneficial
to design, in particular, in light of today’s
directions toward extensible database
systems and emerging database applica-
tion domains. Therefore, we do not favor
specialized database hardware modules
beyond general-purpose processing, stor-
age, and communication hardware dedi-
cated to executing database software.
SUMMARY AND OUTLOOK
Database management systems provide
three essential groups of services. First,
they maintain both data and associated
metadata in order to make databases
self-contained and self-explanatory, at
least to some extent, and to provide data
independence. Second, they support safe
data sharing among multiple users as
well as prevention and recovery of fail-
ures and data loss. Third, they raise the
level of abstraction for data manipula-
tion above the primitive access com-
mands provided by file systems with more
or less sophisticated matching and infer-
ence mechanisms, commonly called the
query language or query-processing facil-
ity. We have surveyed execution algo-
rithms and software architectures used
in providing this third essential service.
Query processing has been explored
extensively in the last 20 years in the
context of relational database manage-
ment systems and is slowly gaining
interest in the research community for
extensible and object-oriented systems.
This is a very encouraging development,
because if these new systems have in-
creased modeling power over previous
data models and database management
systems but cannot execute even simple
requests efficiently, they will never gain
widespread use and acceptance.
Databases will continue to manage mas-
sive amounts of data; therefore, efficient
query and request execution will con-
tinue to represent both an important re-
search direction and an important crite-
rion in investment decisions in the “real
world.” In other words, new database
management systems should provide
greater modeling power (this is widely
ACM Computing Surveys, Vol. 25, No 2, June 1993
158 - Goetz Graefe
accepted and intensely pursued), but also
competitive or better performance than
previous systems. We hope that this sur-
vey will contribute to the use of efficient
and parallel algorithms for query pro-
cessing tasks in new database manage-
ment systems.
A large set of query processing algo-
rithms has been developed for relational
systems. Sort- and hash-based tech-
niques have been used for physical-
storage design, for associative index
structures, for algorithms for unary and
binary matching operations such as ag-
gregation, duplicate removal, join, inter-
section, and division, and for parallel
query processing using hash- or range-
partitioning. Additional techniques such
as precomputation and compression have
been shown to provide substantial per-
formance benefits when manipulating
large volumes of data. Many of the exist-
ing algorithms will continue to be useful
for extensible and object-oriented sys-
tems, and many can easily be general-
ized from sets of tuples to more general
pattern-matching functions. Some
emerging database applications will re-
quire new operators, however, both for
translation between alternative data rep-
resentations and for actual data manipu-
lation.
The most promising aspect of current
research into database query processing
for new application domains is that the
concept of a fixed number of parametri-
zed operators, each performing a part of
the required data manipulation and each
passing an intermediate result to the next
operator, is versatile enough to meet the
new challenges. This concept permits
specification of database queries and re.
quests in a logical algebra as well as
concise representation of database pro-
grams in a physical algebra. Further-
more, it allows algebraic optimizations of
requests, i.e., optimizing transformations
of algebra expressions and cost-sensitive
translations of logical into physical ex-
pressions. Finally, it permits pipelining
between operators to exploit parallel
computer architectures and partitioning
of stored data and intermediate results
for most operators, in particular, for op-
erators on sets but also for other bulk
types such as arrays, lists, and time
series.
We can hope that much of the existing
relational technology for query optimiza-
tion and parallel execution will remain
relevant and that research into extensi-
ble optimization and parallelization will
have a significant impact on future
database applications such as scientific
data. For database management systems
to become acceptable for new application
domains, their performance must at least
match that of the file systems currently
in use. Automatic optimization and par-
allelization may be crucial contributions
to achieving this goal, in addition to the
query execution techniques surveyed
here.
ACKNOWLEDGMENTS
JOS6 A Blakeley, Cathy Brand, Rick Cole, Diane
Davison, David Helman, Ann Linville, Bill
McKenna, Gail Mitchell, Shengsong N1, Barb Pe-
ters, Leonard Shapiro, the students of “Readings m
Database Systems” at the University of Colorado at
Boulder (Fall 1991) and “Database Implementation
Techniques” at Portland State University (Winter
1993), David Maler’s weekly reading group at the
Oregon Graduate Institute (Winter 1992), the
anonymous referees, and the Computmg Surveys
editors Shamkant Navathe and Dick Muntz gave
many valuable comments on earlier drafts of this
survey, which have improved the paper very much.
This paper is based on research partially supported
by the National Science Foundation with grants
IRI-8996270, IRI-8912618, IRI-9006348, IRI-
9116547, IRI-9119446, and ASC-9217394, ARPA
with contract DAAB 07-91-C-Q5 18, Texas Instru-
ments, D@al Equipment Corp., Intel Super-
computer Systems Division, Sequent Computer
Systems. ADP, and the Oregon Advanced Com-
puting Institute (OACIS)
REFERENCES
ADALI, N. R., AND WORTMANN, J. C. 1989. Secu-
rity-control methods for statistical databases:
A comparative study. ACM Comput. Surv. 21,
4 (Dec. 1989), 515.
AHN, I., AND SNonGRAss, R. 1988. Partitioned
storage for temporal databases. Inf. SVSt. 13, 4,
369.
ACM Computing Surveys, Vol 25, No 2, June 1993
Query Evaluation Techniques ● 159
ALBERT, J. 1991. Algebraic properties of bag data
types. In Proceedings of the International Con-
ference on Very Large Data Bases. VLDB En-
dowment, 211.
ANALYTI, A., AND PRAMANIK, S. 1992. Fast search
in main memory databases. In Proceedings of
the ACM SIGMOD Conference. ACM, New
York, 215.
ANDERSON, D. P., Tzou, S. Y., AND GmwM, G. S.
1988. The DASH virtual memory system.
Tech. Rep. 88/461, Univ. of California—Berke-
ley, CS Division, Berkeley, Calif.
ANTOSHENKOV, G. 1993. Dynamic query opti-
mization in Rdb/VMS. In Proceedings of the
IEEE Conference on Data Engineering. IEEE,
New York.
ASTRAHAN, M. M., BLASGEN, M. W., CHAMBERLAIN,
D. D., ESWARAN, K. P., GRAY, J. N., GRIFFITHS,
P. P., KING, W. F., LORIE, R. A., MCJONES,
P. R., MEHL, J. W., PUTZOLU, G. R., TRAIGER,
I. L., WmE, B. W., AND WATSON, V. 1976.
System R: A relational approach to database
management. ACM Trans. Database Syst. 1, 2
(June), 97.
ASTRAHAN, M. M., SCHKOLNICK, M., MD WHANG,
K. Y. 1987. Approximating the number of
unique values of an attribute without sorting,
Inf. Syst. 12, 1, 11.
ATKINSON, M. P., AND BUNEMANN, O. P. 1987.
Types and persistence in database program-
ming languages. ACM Comput. Surv. 19, 2
(June), 105.
BABB, E. 1982. Joined Normal Form: A storage
encoding for relational databases. ACM Trans.
Database Syst. 7, 4 (Dec.), 588.
BABB, E. 1979. Implementing a relational
database by means of specialized hardware.
ACM Trans. Database Syst. 4, 1 (Mar.), 1.
BAEZA-YATES, R. A., AND LARSON, P. A. 1989. Per-
formance of B + -trees with partial expansions.
IEEE Trans. Knowledge Data Eng. 1, 2 (June),
248.
BANCILHON, F., AND RAMAKRISHNAN, R. 1986. An
amateur’s introduction to recursive query pro-
cessing strategies. In Proceedings of the ACM
SIGMOD Conference. ACM, New York, 16.
BARGHOUTI, N. S., AND KAISER, G. E. 1991. Con-
currency control in advanced database applica-
tions. ACM Comput. Suru. 23, 3 (Sept.), 269.
BARU, C. K., AND FRIEDER, O. 1989. Database op-
erations in a cube-connected multicomputer
system. IEEE Trans. Comput. 38, 6 (June),
920.
BATINI, C., LENZERINI, M., AND NAVATHE, S. B.
1986. A comparative analysis of methodolo-
gies for database schema integration. ACM
Comput. Sum. 18, 4 (Dec.), 323.
BATORY, D. S., BARNETT, J. R., GARZA, J. F., SMITH,
K. P., TSUKUDA, K., TWICHELL, B. C., AND WISE,
T. E. 1988a. GENESIS: An extensible
database management system, IEEE Trans.
Softw. Eng. 14, 11 (Nov.), 1711.
BATORY, D. S., LEUNG, T. Y., AND WISE, T. E. 1988b.
Implementation concepts for an extensible data
model and data language. ACM Trans.
Database Syst. 13, 3 (Sept.), 231.
BAUGSTO, B., AND GREIPSLAND, J. 1989. Parallel
sorting methods for large data volumes on a
hypercube database computer. In Proceedings
of the 6th International Workshop on Database
Machines (Deauville, France, June 19-21).
BAYER, R., AND MCCREIGHTON, E. 1972. Organi-
sation and maintenance of large ordered in-
dices. Acts Informatica 1, 3, 173.
BECK, M., BITTON, D., AND WILKINSON, W. K. 1988.
Sorting large files on a backend multiprocessor.
IEEE Trans. Comput. 37, 7 (July), 769.
BECKER, B., SIX, H. W., AND WIDMAYER, P. 1991.
Spatial priority search: Au access technique for
scaleless maps. In Proceedings of ACM SIG-
MOD Conference. ACM, New York, 128.
BECKMANN, N., KRIEGEL, H. P., SCHNEIDER, R., AND
SEEGER, B. 1990. The R*-tree: Au efficient
and robust access method for points and rect-
angles. In Proceedings of ACM SIGMOD Con-
ference. ACM, New York, 322.
BELL, T., WITTEN, I. H., AND CLEARY, J. G. 1989.
Modelling for text compression. ACM Cornput.
Surv. 21, 4 (Dec.), 557.
BENTLEY, J. L. 1975. Multidimensional binary
search trees used for associative searching.
Commun. ACM 18, 9 (Sept.), 509.
BERNSTEIN, P. A., mm GOODMAN, N. 1981. Con-
currency control in distributed database sys-
tems. ACM Comput. Suru. 13, 2 (June), 185.
BERNSTEIN, P. A., GOODMAN, N., WONG, E., REEVE,
C. L., AND ROTHNIE, J. B. 1981. Query pro-
cessing in a system for distributed databases
(SDD-1). ACM Trans. Database Syst, 6, 4
(Dec.), 602.
BERNSTEIN, P. A., HADZILACOS, V., AND GOODMAN, N.
1987. Concurrency Control and Recovery in
Database Systems. Addison-Wesley, Reading,
Mass.
BERRA, P. B., CHUNG, S. M., AND HACHEM, N. I.
1987, Computer architecture for a surrogate
file to a very large data/knowledge base. IEEE
Comput. 20, 3 (Mar), 25.
BERTINO, E. 1991. An indexing technique for ob-
ject-oriented databases. In Proceedings of the
IEEE Conference on Data Engineering. IEEE,
New York, 160.
BERTINO, E. 1990. Optimization of queries using
nested indices. In Lecture Notes in Computer
Science, vol. 416. Springer-Verlag, New York.
BERTINO, E., AND KIM, W. 1989. Indexing tech-
niques for queries on nested objects. IEEE
Trans. Knowledge Data Eng. 1, 2 (June), 196.
BHIDE, A. 1988. An analysis of three transaction
processing architectures. In Proceedings of the
International Conference on Ve~ Large Data
Bases (Los Angeles, Aug.). VLDB Endowment,
339.
ACM Computing Surveys, Vol 25, No. 2, June 1993
160 * Goetz Graefe
BHIDE, A,, AND STONEBRAKER, M. 1988. Aperfor-
mance comparison of two architectures for fast
transaction processing. In proceedings of the
IEEE Conference on Data Englneermg. IEEE,
New York, 536.
BITTON, D., AND DEWITT, D. J. 1983. Duplicate
record elimination in large data files. ACM
Trans. Database Syst. 8, 2 (June), 255,
BITTON-FRIEDLAND, D. 1982. Design, analysis,
and implementation of parallel external sorting
algorithms Ph D. Thesis, Univ. of Wiscon-
sin—Madison.
BITTON, D , AND GRAY, J 1988. Dmk shadowing,
In Proceedings of the International Conference
on Very Large Data Bases. (Los Angeles, Aug.).
VLDB Endowment, 331
BITTON, D., DEWITT, D J., HSIAO, D. K. AND MENON,
J. 1984 A taxonomy of parallel sorting.
ACM Comput. SurU. 16, 3 (Sept.), 287.
BITTON, D., HANRAHAN, M. B., AND TURBWILL, C.
1987, Performance of complex queries in main
memory database systems. In Proceedt ngs of
the IEEE Con ference on Data Engineering.
IEEE, New York.
BLAKELEY, J. A., AND MARTIN. N. L. 1990 Join
index, materialized view, and hybrid hash-join:
A performance analysis In Proceedings of the
IEEE Conference on Data Englneermg IEEE,
New York.
BM.KELEY, J. A, COBURN, N., AND LARSON, P. A.
1989. Updating derived relations: Detecting
irrelevant and autonomously computable up-
dates. ACM Trans Database Syst 14,3 (Sept.),
369
BLAS~EN, M., AND ESWARAN, K. 1977. Storage and
access in relational databases. IBM Syst. J.
16, 4, 363.
BLASGEN, M., AND ESWARAN, K. 1976. On the
evaluation of queries in a relational database
system IBM Res. Rep RJ 1745, IBM, San Jose,
Calif.
BLOOM, B H. 1970 Space/time tradeoffs in hash
coding with allowable errors Cornmzm. ACM
13, 7 (July), 422.
BORAL, H. 1988. Parallehsm in Bubba. In Pro-
ceedings of the International Symposium on
Databases m Parallel and Distr~buted Systems
(Austin, Tex., Dec.), 68.
BORAL, H., AND DEWITT, D. J. 1983. Database
machines: An idea whose time has passed? A
critique of the future of database machines. In
Proceedings of the International Workshop on
Database Machines. Reprinted in Parallel Ar-
ch Itectures for Database Systems. IEEE Com-
puter Society Press, Washington, D. C., 1989.
BORAL, H., ALEXANDER, W., CLAY, L., COPELAND, G.,
DANFORTH, S., FRANKLIN, M., HART, B., SMITH,
M., AND VALDURIEZ, P. 1990. Prototyping
Bubba, A Highly Parallel Database System.
IEEE Trans. Knowledge Data Eng. 2, 1 (Mar.),
4.
BRATBERGSENGEN, K, 1984. Hashing methods
and relational algebra operations. In Proceed-
ings of the International Conference on Very
Large Data Bases. VLDB Endowment, 323.
BROWN, K. P., CAREY, M. J., DEWITT, D, J., MEHTA,
M., AND NAUGHTON, J. F. 1992. Scheduling
issues for complex database workloads, Com-
puter Science Tech. Rep. 1095, Univ. of
Wisconsin—Madison.
BUCHERAL, P., THEVERIN, J. M., AND VAL~URIEZ, P.
1990. Efficient main memory data manage-
ment using the DBGraph storage model. In
Proceedings of the International Conference on
Very Large Data Bases. VLDB Endowment, 683.
BUNEMAN, P., AND FRANKEL, R. E. 1979. FQL—A
Functional Query Language. In Proceedings of
ACM SIGMOD Conference. ACM, New York,
52.
BUNEMAN, P,, FRANKRL, R, E., AND NIKHIL, R. 1982
An implementation technique for database
query languages, ACM Trans. Database Syst.
7, 2 (June), 164.
CACACE, F., CERI, S., .AND HOUTSMA, M A. W. 1992.
A survey of parallel execution strategies for
transitive closures and logic programs To ap-
pear in Dwtr. Parall. Databases.
CAREY, M. J,, DEWITT, D. J., RICHARDSON, J. E., AND
SHEKITA, E. J. 1986. Object and file manage-
ment in the EXODUS extensible database sys-
tem. In proceedings of the International
Conference on Very Large Data Bases. VLDB
Endowment, 91.
CARLIS, J. V. 1986. HAS: A relational algebra
operator, or divided M not enough to conquer.
In Proceedings of the IEEE Conference on Data
Engineering. IEEE, New York, 254.
CARTER, J. L., AND WEGMAN, M. N. 1979, Univer-
sal classes of hash functions. J. Cornput. Syst.
Scl. 18, 2, 143.
CHAMB~RLIN, D. D., ASTRAHAN, M M., BLASGEN, M.
W., GsA-I-, J. N., KING, W. F., LINDSAY, B. G.,
LORI~, R., MEHL, J. W., PRICE, T. G,, PUTZOLO,
F , SELINGER, P. G., SCHKOLNIK, M., SLUTZ, D.
R., TKAIGER, I. L , WADE, B. W., AND YOST, R, A.
198 la. A history and evaluation of System R.
Cornmun ACM 24, 10 (Oct.), 632.
CHAMBERLAIN, D. D,, ASTRAHAN, M. M., KING, W F.,
LORIE, R. A,, MEHL, J. W., PRICE, T. G.,
SCHKOLNIK, M., SELINGER, P G., SLUTZ, D. R.,
WADE, B. W., AND YOST, R. A. 1981b. SUp-
port for repetitive transactions and ad hoc
queries in System R. ACM Trans. Database
Syst. 6, 1 (Mar), 70.
CHEN, P. P. 1976. The entity relationship model
—Toward a umtied view of data. ACM Trans.
Database Syst. 1, 1 (Mar.), 9.
C!HEN, H , AND KUCK, S. M. 1984. Combining re-
lational and network retrieval methods. In
proceedings of ACM SIGMOD Conference,
ACM, New York, 131,
CHEN, M. S., Lo, M. L., Yu, P. S., AND YOUNG, H C
1992. Using segmented right-deep trees for
ACM Computmg Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques * 161
the execution of pipelined hash joins. In Pro-
ceedings of the International Conference on Very
Large Data Bases (Vancouver, BC, Canada).
VLDB Endowment, 15.
CHEN~, J., HADERLE, D., HEDGES, R., IYER, B. R.,
MESSINGER, T., MOHAN, C., ANI) WANG, Y.
1991. An efficient hybrid join algorithm: A
DB2 prototype. In Proceedings of the IEEE
Conference on Data Engineering. IEEE, New
York, 171.
CHERITON, D. R., GOOSEN, H. A., mn BOYLE, P. D.
1991 Paradigm: A highly scalable shared-
memory multicomputer. IEEE Comput. 24, 2
(Feb.), 33.
CHIU, D. M., AND Ho, Y, C. 1980. A methodology
for interpreting tree queries into optimal semi-
join expressions. In Proceedings of ACM SIG-
MOD Conference. ACM, New York, 169.
CHOU, H, T. 1985. Buffer management of
database systems. Ph.D. thesis, Univ. of
Wisconsin—Madison.
CHOU, H. T., AND DEWITT, D. J. 1985. An evalua-
tion of buffer management strategies for rela-
tional database systems. In Proceedings of the
International Conference on Very Large Data
Bases (Stockholm, Sweden, Aug.), VLDB En-
dowment, 127. Reprinted in Readings in
Database Systems. Morgan-Kaufman, San
Mateo, Calif., 1988.
CHRISTODOULAKIS, S. 1984. Implications of cer-
tain assumptions in database performance
evaluation. ACM Trans. Database Syst. 9, 2
(June), 163.
CHUNG, S. M., AND BERRA, P. B. 1988. A compari-
son of concatenated and superimposed code
word surrogate files for very large data/knowl-
edge bases. In Lecture Notes in Computer Sci-
ence, vol. 303. Springer-VerIag, New York, 364.
CLUET, S., DIZLOBEL, C., LECLUSF,, C., AND RICHARD,
P. 1989. Reloops, an algebra based query
language for an object-oriented database sys-
tem. In Proceedings of the Ist International
Conference on Deductive and Object-Orzented
Databases (Kyoto, Japan, Dec. 4-6).
COMER, D. 1979. The ubiquitous B-tree. ACM
Comput. Suru. 11, 2 (June), 121.
COPELAND, G., ALEXANDER, W., BOUGHTER, E., AND
KELLER, T. 1988. Data placement in Bubba.
In Proceedings of ACM SIGMOD Conference.
ACM, New York, 99.
DADAM, P., KUESPERT, K., ANDERSON, F., BLANKEN,
H,, ERBE, R., GUENAUER, J., LUM, V., PISTOR, P.,
AND WALCH, G. 1986. A database manage-
ment prototype to support extended NF 2 rela-
tions: An integrated view on flat tables and
hierarchies. In Proc.edmgs of ACM SIGMOD
Conference. ACM, New York, 356.
DMWELS, D., AND NG, P. 1982. Distributed query
compilation and processing in R*. IEEE
Database Eng. 5, 3 (Sept.).
DmIELS, S., GRAEFE, G., KELLER, T., MAIER, D.,
SCHMIDT, D., AND VANCE, B. 1991. Query op-
timization in revelation, an overview. IEEE
Database Eng. 14, 2 (June).
DAVIDSON, S. B., GARCIA-M• LINA, H., AND SKEEN, D.
1985. Consistency in partitioned networks,
ACM Comput. Surv. 17, 3 (Sept.), 341.
DAVIS, D. D. 1992. Oracle’s parallel punch for
OLTP. Datamation (Aug. 1), 67.
DAVISON, W. 1992. Parallel index building in In-
formix OnLine 6.0. In Proceedings of ACM
SIGMOD Conference. ACM, New York, 103.
DEPPISCH, U., PAUL, H. B., AND SCHEK, H. J. 1986.
A storage system for complex objects. In Pro-
ceedings of the International Workshop on Ob-
ject-Or[ented Database Systems (Pacific Grove,
Calif., Sept.), 183.
DESHPANDE, V., AND LARSON, P. A. 1992. The de-
sign and implementation of a parallel join algo-
rithm for nested relations on shared-memory
multiprocessors. In Proceedings of the IEEE
Conference on Data Engineering. IEEE, New
York, 68,
DESHPANDE, V., AND LARSON, P, A, 1991. An alge-
bra for nested relations with support for nulls
and aggregates. Computer Science Dept., Univ.
of Waterloo, Waterloo, Ontario, Canada.
DESHPANDE, A.j AND VAN GUCHT, D. 1988. An im-
plementation for nested relational databases.
In proceedings of the I?lternattonal Conference
on Very Large Data Bases (Los Angeles, Calif.,
Aug. ) VLDB Endowment, 76.
DEWITT, D. J. 1991. The Wisconsin benchmark:
Past, present, and future. In Database and
Transaction Processing System Performance
Handbook. Morgan-Kaufman, San Mateo, Calif,
DEWITT, D. J., AND GERBER, R. H. 1985, Multi-
processor hash-based Join algorithms. In F’ro-
ceedings of the International Conference on Very
Large Data Bases (Stockholm, Sweden, Aug.).
VLDB Endowment, 151.
DEWITT, D. J., AND GRAY, J. 1992. Parallel
database systems: The future of high-perfor-
mance database systems. Commun. ACM 35, 6
(June), 85.
DEWITT, D. J., AND HAWTHORN, P. B. 1981. A
performance evaluation of database machine
architectures. In Proceedings of the Interns-
tional Conference on Very Large Data Bases
(Cannes, France, Sept.). VLDB Endowment,
199.
DEWITT, D. J., GERBER, R. H., GRAEFE, G., HEYTENS,
M. L., KUMAR, K. B., AND MURALI%ISHNA, M.
1986. GAMMA-A high performance dataflow
database machine. In Proceedings of the Inter-
national Conference on Very Large Data Bases.
VLDB Endowment, 228. Reprinted in Read-
ings in Database Systems. Morgan-Kaufman,
San Mateo, Calif., 1988.
DEWITT, D. J., GHANDEHARIZADEH, S., AND SCHNEI-
DER, D. 1988. A performance analysis of the
GAMMA database machine. In Proceedings of
ACM Computing Surveys, Vol. 25, No. 2, June 1993
162 * Goetz Graefe
ACM SIGMOD Conference. ACM, New York,
350
DEWITT, D. J., GHANDEHARIZADEH, S., SCHNEIDER,
D., BRICKER, A., HSIAO, H. I.. AND RASMUSSEN,
R. 1990. The Gamma database machine pro-
ject. IEEE Trans. Knowledge Data Eng. 2, 1
(Mar.). 44.
DEWITT, D. J., KATZ, R., OLKEN, F., SHAPIRO, L.,
STONEBRAKER, M., AND WOOD, D. 1984. Im-
plementation techniques for mam memory
database systems In ProceecZings of ACM SIG-
MOD Conference. ACM, New York, 1.
DEWITT, D., NAUGHTON, J., AND BURGER, J. 1993.
Nested loops revisited. In proceedings of Paral-
lel and Distributed In forrnatlon Systems [San
Diego, Calif., Jan.).
DEWITT, D. J., NAUGHTON, J. E., AND SCHNEIDER,
D. A. 1991a. An evaluation of non-equijoin
algorithms. In Proceedings of the International
Conference on Very Large Data Bases
(Barcelona, Spain). VLDB Endowment, 443.
DEWITT, D., NAUGHTON, J., AND SCHNEIDER, D.
1991b Parallel sorting on a shared-nothing
architecture using probabilistic splitting. In
Proceedings of the International Conference on
Parallel and Dlstnbuted Information Systems
(Miami Beach, Fla , Dec.)
DOZIER, J. 1992. Access to data in NASAS
Earth observing systems. In proceedings of
ACM SIGMOD Conference. ACM, New York, 1.
EFFELSBERG, W., AND HAERDER, T. 1984. Princi-
ples of database buffer management. ACM
Trans. Database Syst. 9, 4 (Dee), 560.
EN~ODY, R. J., AND Du, H. C. 1988. Dynamic
hashing schemes ACM Comput. Suru. 20, 2
(June), 85.
ENGLERT, S., GRAY, J., KOCHER, R., AND SHAH, P.
1989. A benchmark of nonstop SQL release 2
demonstrating near-linear speedup and scaleup
on large databases. Tandem Computers Tech.
Rep. 89,4, Tandem Corp., Cupertino, Calif.
EPSTEIN, R. 1979. Techniques for processing of
aggregates in relational database systems
UCB/ERL Memo. M79/8, Univ. of California,
Berkeley, Calif.
EPSTEIN, R., AND STON~BRAE~R, M. 1980. Analy-
sis of dmtrlbuted data base processing strate-
gies. In Proceedings o} the International Con-
ference on Very Large Data Bases UWmtreal,
Canada, Oct.). VLDB Endowment, 92.
EPSTEIN, R , STONE~RAKER, M., AND WONG, E. 1978
Distributed query processing in a relational
database system. In Proceedings of ACM
SIGMOD Conference. ACM, New York.
FAGIN, R,, NUNERGELT, J., PIPPENGER, N., AND
STRONG, H. R. 1979. Extendible hashing: A
fast access method for dynamic tiles. ACM
Trans. Database Syst. 4, 3 (Sept.), 315.
FALOUTSOS, C 1985. Access methods for text.
ACM Comput. Suru. 17, 1 (Mar.), 49.
FALOUTSOS, C,, NG, R., AND SELLIS, T. 1991. Pre-
dictive load control for flexible buffer alloca-
tion. In Proceedings of the International
Conference on Very Large Data Bases
(Barcelona, Spain). VLDB Endowment, 265.
FANG, M. T,, LEE, R. C. T., AND CHANG, C, C. 1986.
The idea of declustering and its applications. In
Proceedings of the International Conference on
Very Large Data Bases (Kyoto, Japan, Aug.).
VLDB Endowmentj 181.
FINKEL, R. A., AND BENTLEY, J. L. 1974. Quad
trees: A data structure for retrieval on compos-
ite keys. Acts Inform atzca 4, 1,1.
FREYTAG, J. C., AND GOODMAN, N. 1989 On the
translation of relational queries into iteratme
programs. ACM Trans. Database Syst. 14, 1
(Mar.), 1.
FUSHIMI, S., KITSUREGAWA, M., AND TANAKA, H.
1986. An overview of the system software of a
parallel relational database machine GRACE.
In Proceedings of the Intern atLona! Conference
on Very Large Data Bases Kyoto, Japan, Aug.).
ACM, New York, 209.
GALLAIRE, H., MINRER, J., AND NICOLAS, J M. 1984.
Logic and databases A deductive approach.
ACM Comput. Suru. 16, 2 (June), 153
GERBER, R. H. 1986. Dataflow query process-
ing using multiprocessor hash-partitioned
algorithms. Ph.D. thesis, Univ. of
Wisconsin—Madison.
GHANDEHARIZADEH, S., AND DEWITT, D. J. 1990.
Hybrid-range partitioning strategy: A new
declustermg strategy for multiprocessor
database machines. In Proceedings of the Inter-
national Conference on Very Large Data Bases
(Brisbane, Australia). VLDB Endowment, 481.
GOODMAN, J. R., AND WOEST, P. J 1988. The
Wisconsin Multicube: A new large-scale cache-
coherent multiprocessor. Computer Science
Tech Rep. 766, Umv. of Wisconsin—Madison
GOUDA, M. G.. AND DAYAL, U. 1981. Optimal
semijoin schedules for query processing in local
distributed database systems. In Proceedings
of ACM SIGMOD Conference. ACM, New York,
164,
GRAEFE, G. 1993a. Volcano, An extensible and
parallel dataflow query processing system.
IEEE Trans. Knowledge Data Eng. To be
published.
GRAEFE, G. 1993b. Performance enhancements
for hybrid hash join Available as Computer
Science Tech. Rep. 606, Univ. of Colorado,
Boulder.
GRAEFE, G. 1993c Sort-merge-join: An idea
whose time has passed? Revised in Portland
State Univ. Computer Science Tech. Rep. 93-4.
GRAEFII, G. 1991. Heap-filter merge join A new
algorithm for joining medium-size inputs. IEEE
Trans. Softw. Eng. 17, 9 (Sept.). 979.
GRAEFE, G. 1990a. Parallel external sorting in
Volcano. Computer Science Tech. Rep. 459,
Umv. of Colorado, Boulder.
ACM Computmg Surveys, Vol. 25, No 2, June 1993
Query Evaluation Techniques ● 163
GRAEFE, G. 1990b. Encapsulation of parallelism
in the Volcano query processing system. In
proceedings of ACM SIGMOD Conference,
ACM, New York, 102.
GRAEFE) G. 1989. Relational division: Four algo-
rithms and their performance. In Proceedings
of the IEEE Conference on Data Engineering.
IEEE, New York, 94.
GRAEFE, G., AND COLE, R. L. 1993. Fast algo-
rithms for universal quantification in large
databases. Portland State Univ. and Univ. of
Colorado at Boulder.
GRAEFE, G., AND DAVISON, D. L. 1993. Encapsula-
tion of parallelism and architecture-indepen-
dence in extensible database query processing.
IEEE Trans. Softw. Eng. 19, 7 (July).
GRAEFE, G., AND DEWITT, D. J. 1987. The
EXODUS optimizer generator, In Proceedings
of ACM SIGMOD Conference. ACM, New York,
160.
GRAEFE, G., .miI) MAIER, D. 1988. Query opti-
mization in object-oriented database systems:
A prospectus, In Advances in Object-Oriented
Database Systems, vol. 334. Springer-Verlag,
New York, 358.
GRAEFE, G., AND MCKRNNA, W. J. 1993. The Vol-
cano optimizer generator: Extensibility and ef-
ficient search. In Proceedings of the IEEE Con-
ference on Data Engineering. IEEE, New York.
GRAEIW, G., AND SHAPIRO, L. D. 1991. Data com-
pression and database performance. In Pro-
ceedings of the ACM/IEEE-Computer Science
Symposium on Applied Computing.
ACM IEEE, New York.
GRAEFE, G., AND WARD, K. 1989. Dynamic query
evaluation plans. In Proceedings of ACM
SIGMOD Conference. ACM, New York, 358.
GRAEEE,G., ANDWOLNIEWICZ,
R. H. 1992. Alge-
braic optimization and parallel execution of
computations over scientific databases. In Pro-
ceedings of the Workshop on Metadata Manage-
ment in Sclentrfzc Databases (Salt Lake City,
Utah, Nov. 3-5).
GRAEFE, G., COLE, R. L., DAVISON, D. L., MCKENNA,
W. J., AND WOLNIEWICZ, R. H. 1992. Extensi-
ble query optimization and parallel execution
in Volcano. In Query Processing for Aduanced
Database Appkcatlons. Morgan-Kaufman, San
Mateo, Calif.
GRAEEE, G., LHVWLLE, A., AND SHAPIRO, L. D. 1993.
Sort versus hash revisited. IEEE Trans.
Knowledge Data Eng. To be published.
GRAY, J. 1990. A census of Tandem system avail-
ability between 1985 and 1990, Tandem
Computers Tech. Rep. 90.1, Tandem Corp.,
Cupertino, Calif.
GWY, J., AND PUTZOLO, F. 1987. The 5 minute
rule for trading memory for disc accesses and
the 10 byte rule for trading memory for CPU
time. In Proceedings of ACM SIGMOD Confer-
ence. ACM, New York, 395.
GWY) J., AND REUTER, A. 1991. Transaction Pro-
cessing: Concepts and Techniques. Morgan-
Kaufman, San Mateo, Calif,
GRAY, J., MCJONES, P., BLASGEN, M., LINtEAY, B.,
LORIE, R., PRICE, T., PUTZOLO, F., AND TRAIGER,
I. 1981. The recovery manager of the Sys-
tem R database manager. ACM Comput, Sw-u.
13, 2 (June), 223.
GRUENWALD, L., AND EICH, M, H. 1991. MMDB
reload algorithms. In Proceedings of ACM
SIGMOD Conference, ACM, New York, 397.
GUENTHER, O., AND BILMES, J. 1991. Tree-based
access methods for spatial databases: Imple-
mentation and performance evaluation. IEEE
Trans. Knowledge Data Eng. 3, 3 (Sept.), 342.
GUIBAS, L., AND SEI)GEWICK, R. 1978. A dichro-
matic framework for balanced trees. In Pro-
ceedings of the 19th SymposL urn on the Founda
tions of Computer Science.
GUNADHI, H., AND SEGEW, A. 1991. Query pro-
cessing algorithms for temporal intersection
joins. In Proceedings of the IEEE Conference on
Data Engtneermg. IEEE, New York, 336.
GITNADHI, H., AND SEGEV) A. 1990. A framework
for query optimization in temporal databases.
In Proceedings of the 5th Zntcrnatzonal Confer-
ence on Statistical and Scten tific Database
Management.
GUNTHER, O. 1989. The design of the cell tree:
An object-oriented index structure for geomet-
ric databases. In Proceedings of the IEEE Con-
ference on Data Engineering. IEEE, New York,
598.
GUNTHER, O., AND WONG, E. 1987 A dual space
representation for geometric data. In Proceed-
ings of the International Conference on Very
Large Data Bases (Brighton, England, Aug.).
VLDB Endowment, 501.
Guo, M., SLT, S. Y. W., AND LAM, H. 1991. An
association algebra for processing object-
oriented databases. In proceedings of the IEEE
Conference on Data Engmeermg, IEEE, New
York, 23.
GUTTMAN, A. 1984. R-Trees: A dynamic index
structure for spatial searching. In Proceedings
of ACM SIGMOD Conference. ACM, New York,
47. Reprinted in Readings in Database Sys-
tems. Morgan-Kaufman, San Mateo, Ccdif.,
1988.
Hfis, L., CHANG, W., LOHMAN, G., MCPHERSON, J.,
WILMS, P. F., LAPIS, G., LINDSAY, B., PIRAHESH,
H., CAREY, M. J., AND SHEKITA, E. 1990.
Starburst mid-flight: As the dust clears. IEEE
Trans. Knowledge Data Eng. 2, 1 (Mar.), 143.
Hfis, L., FREYTAG, J. C., LOHMAN, G., AND
PIRAHESH, H. 1989. Extensible query pro-
cessing in Starburst. In Proceedings of ACM
SIGMOD Conference. ACM, New York, 377.
H.%+s, L. M., SELINGER, P. G., BERTINO, E., DANI~LS,
D., LINDSAY, B., LOHMAN, G., MASUNAGA, Y.,
Mom, C., NG, P., WILMS, P., AND YOST, R.
ACM Computing Surveys, Vol. 25, No. 2, June 1993
164 ● Goetz Graefe
1982. R*: A research project on distributed
relational database management. IBM Res. Di-
vision, San Jose, Calif.
HAERDER, T., AND REUTER, A. 1983. Principles of
transaction-oriented database recovery. ACM
Comput. Suru. 15, 4 (Dec.).
HAFEZ, A., AND OZSOYOGLU, G. 1988. Storage
structures for nested relations. IEEE Database
Eng. 11, 3 (Sept.), 31.
HAGMANN, R. B. 1986. An observation on
database buffering performance metrics. In
Proceedings of the International Conference on
Very Large Data Bases (Kyoto, Japan, Aug.).
VLDB Endowment, 289.
HAMMING, R. W. 1977. Digital Filters. Prentice-
Hall, Englewood Cliffs, N.J.
HANSON, E. N. 1987. A performance analysis of
view materialization strategies. In Proceedings
of ACM SIGMOD Conference. ACM, New York,
440.
HENRICH, A., Stx, H. W., AND WIDMAYER, P. 1989.
The LSD tree: Spatial access to multi-
dimensional point and nonpoint objects. In
Proceedings of the International Conference on
Very Large Data Bases (Amsterdam, The
Netherlands). VLDB Endowment, 45.
HOEL, E. G., AND SAMET, H. 1992. A qualitative
comparison study of data structures for large
linear segment databases. In %oceedtngs of
ACM SIGMOD Conference. ACM. New York,
205.
HONG, W., AND STONEBRAKER, M. 1993. Opti-
mization of parallel query execution plans in
XPRS. Distrib. Parall. Databases 1, 1 (Jan.), 9.
HONG, W., AND STONEBRAKRR, M. 1991. Opti-
mization of parallel query execution plans in
XPRS. In Proceedings of the International Con-
ference on Parallel and Distributed Information
Systems (Miami Beach, Fla., Dec.).
Hou, W. C., AND OZSOYOGLU, G. 1993. Processing
time-constrained aggregation queries in CASE-
DB. ACM Trans. Database Syst. To be
published.
Hou, W. C., AND OZSOYOGLU, G. 1991. Statistical
estimators for aggregate relational algebra
queries. ACM Trans. Database Syst. 16, 4
(Dec.), 600.
Hou, W. C., OZSOYOGLU, G., AND DOGDU, E. 1991.
Error-constrained COUNT query evaluation in
relational databases. In Proceedings of ACM
SIGMOD Conference. ACM, New York, 278.
HSIAO, H. I., ANDDEWITT, D. J. 1990. Chained
declustering: A new availability strategy for
multiprocessor database machines. In Proceed-
ings of the IEEE Conference on Data Engineer-
ing. IEEE, New York, 456.
HUA, K. A., m~ LEE, C. 1991. Handling data
skew in multicomputer database computers us-
ing partition tuning. In Proceedings of the In-
ternational Conference on Very Large Data
Bases (Barcelona, Spain). VLDB Endowment,
525.
HUA, K. A., AND LEE, C. 1990. An adaptive data
placement scheme for parallel database com-
puter systems. In Proceedings of the Interna-
tional Conference on Very Large Data Bases
(Brisbane, Australia). VLDB Endowment, 493,
HUDSON, S. E., AND KING, R. 1989. Cactis: A self-
adaptive, concurrent implementation of an ob-
ject-oriented database management system.
ACM Trans. Database Svst. 14, 3 (Sept.), 291,
HULL, R., AND KING, R. 1987. Semantic database
modeling: Survey, applications, and research
issues. ACM Comput. Suru, 19, 3 (Sept.), 201.
HUTFLESZ, A., SIX, H. W., AND WIDMAYER) P. 1990.
The R-File: An efficient access structure for
proximity queries. In proceedings of the IEEE
Conference on Data Engineering. IEEE, New
York, 372.
HUTFLESZ, A., SIX, H. W., AND WIDMAYER, P. 1988a,
Twin grid files: Space optimizing access
schemes. In Proceedings of ACM SIGMOD
Conference. ACM, New York, 183.
HUTFLESZ, A., Sm, H, W., AND WIDMAYER, P, 1988b.
The twin grid file: A nearly space optimal index
structure. In Lecture Notes m Computer Sci-
ence, vol. 303, Springer-Verlag, New York, 352.
IOANNIDIS, Y. E,, AND CHRISTODOULAIiIS, S. 1991,
On the propagation of errors in the size of join
results. In Proceedl ngs of ACM SIGMOD Con-
ference. ACM, New York, 268.
IYKR, B. R., AND DIAS, D. M. 1990, System issues
in parallel sorting for database systems. In
Proceedings of the IEEE Conference on Data
Engineering. IEEE, New York, 246.
JAGADISH, H. V. 1991, A retrieval technique for
similar shapes. In Proceedings of ACM
SIGMOD Conference. ACM, New York, 208.
JARKE, M., AND KOCH, J. 1984. Query optimiza-
tion in database systems. ACM Cornput. Suru.
16, 2 (June), 111.
JARKE, M., AND VASSILIOU, Y. 1985. A framework
for choosing a database query language. ACM
Comput. Sure,. 17, 3 (Sept.), 313.
KATz, R. H. 1990. Towards a unified framework
for version modeling in engineering databases.
ACM Comput. Suru. 22, 3 (Dec.), 375.
KATZ, R. H., AND WONG, E. 1983. Resolving con-
flicts in global storage design through replica-
tion. ACM Trans. Database S.vst. 8, 1 (Mar.),
110.
KELLER, T., GRAEFE, G., AND MAIE~, D. 1991. Ef-
ficient assembly of complex objects. In Proceed-
ings of ACM SIGMOD Conference. ACM. New
York, 148.
KEMPER, A., AND MOERKOTTE G. 1990a. Access
support in object bases. In Proceedings of ACM
SIGMOD Conference. ACM, New York, 364.
KEMPER, A., AND MOERKOTTE, G. 1990b. Ad-
vanced query processing in object bases using
access support relations. In Proceedings of the
International Conference on Very Large Data
ACM Computing Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques 9 165
Bases (Brisbane, Australia). VLDB Endow-
ment, 290.
KEMPER, A.j AND WALI.RATH, M. 1987. An analy-
sis of geometric modeling in database systems.
ACM Comput. Suru. 19, 1 (Mar.), 47.
KEMPER,
A., KILGER, C., AND MOERKOTTE, G. 1991.
Function materialization in object bases. In
Proceedings of ACM SIGMOD Conference.
ACM, New York, 258.
KRRNIGHAN,
B. W., ANDRITCHIE,D. M. 1978. VW
C Programming Language. Prentice-Hall,
Englewood Cliffs, N.J.
KIM, W. 1984. Highly available systems for
database applications. ACM Comput. Suru. 16,
1 (Mar.), 71.
KIM, W. 1980. A new way to compute the prod-
uct and join of relations. In Proceedings of
ACM SIGMOD Conference. ACM, New York,
179.
KITWREGAWA,M., AND OGAWA,Y, 1990. Bucket
spreading parallel hash: A new, robust, paral-
lel hash join method for skew in the super
database computer (SDC). In Proceedings of
the International Conference on Very Large
Data Bases (Brisbane, Australia). VLDB En-
dowment, 210.
KI’rSUREGAWA, M., NAKAYAMA, M., AND TAKAGI, M.
1989a. The effect of bucket size tuning in the
dynamic hybrid GRACE hash join method. In
Proceedings of the International Conference on
Very Large Data Bases (Amsterdam, The
Netherlands). VLDB Endowment, 257.
KITSUREGAWA7 M., TANAKA, H., AND MOTOOKA, T.
1983. Application of hash to data base ma-
chine and its architecture. New Gener. Com-
p~t. 1, 1, 63.
KITSUREGAWA, M., YANG, W., AND FUSHIMI, S.
1989b. Evaluation of 18-stage ~i~eline hard-
. .
ware sorter. In Proceedings of the 6th In tern a-
tional Workshop on Database Machines
(Deauville, France, June 19-21).
KLUIG, A. 1982, Equivalence of relational algebra
and relational calculus query languages having
aggregate functions. J. ACM 29, 3 (July), 699.
KNAPP, E. 1987. Deadlock detection in dis-
tributed databases. ACM Comput. Suru. 19, 4
(Dec.), 303.
KNUTH, D, 1973. The Art of Computer Program-
ming, Vol. III, Sorting and Searching.
Addison-Wesley, Reading, Mass.
KOLOVSON, C. P., ANTD STONEBRAKER M. 1991.
Segment indexes: Dynamic indexing tech-
niques for multi-dimensional interval data. In
Proceechngs of ACM SIGMOD Conference.
ACM, New York, 138.
KC)OI. R. P. 1980. The optimization of queries in
relational databases. Ph.D. thesis, Case West-
ern Reserve Univ., Cleveland, Ohio.
KOOI, R. P., AND FRANKFORTH, D. 1982. Query
optimization in Ingres. IEEE Database Eng. 5,
3 (Sept.), 2.
KFUEGEL, H. P., AND SEEGER, B. 1988. PLOP-
Hashing: A grid file without directory. In Pro-
ceedings of the IEEE Conference on Data Engi-
neering. IEEE, New York, 369.
KRIEGEL, H. P., AND SEEGER, B. 1987. Multidi-
mensional dynamic hashing is very efficient for
nonuniform record distributions. In Proceed-
ings of the IEEE Conference on Data Enginee-
ring. IEEE, New York, 10.
KRISHNAMURTHY,
R., BORAL, H., AND ZANIOLO, C.
1986. Optimization of nonrecursive queries. In
Proceedings of the International Conference on
Very Large Data Bases (Kyoto, Japan, Aug.).
VLDB Endowment, 128.
KUESPERT, K.j SAAKE, G.j AND WEGNER, L. 1989.
Duplicate detection and deletion in the ex-
tended NF2 data model. In Proceedings of the
3rd International Conference on the Founda-
tions of Data Organization and Algorithms
(Paris, France, June).
KUMAR, V., AND BURGER, A. 1991. Performance
measurement of some main memory database
recovery algorithms. In Proceedings of the
IEEE Conference on Data Engineering. IEEE,
New York, 436.
LAKSHMI,M. S., AND Yu, P. S. 1990. Effectiveness
of parallel joins. IEEE Trans. Knowledge Data
Eng. 2, 4 (Dec.), 410.
LAIWHMI, M. S., AND Yu, P. S, 1988. Effect of
skew on join performance in parallel architec-
tures. In Proceedings of the International Sym-
posmm on Databases in Parallel and Dis-
tributed Systems (Austin, Tex., Dec.), 107.
LANKA, S., AND MAYS, E. 1991. Fully persistent
B + -trees. In Proceedings of ACM SIGMOD
Conference. ACM, New York, 426.
LARSON,
P. A. 1981. Analysis of index-sequential
files with overflow chaining. ACM Trans.
Database Syst. 6, 4 (Dec.), 671.
LARSON, P., AND YANG, H. 1985. Computing
queries from derived relations. In Proceedings
of the International Conference on Very Large
Data Bases (Stockholm, Sweden, Aug.). VLDB
Endowment, 259.
LEHMAN, T. J., AND CAREY, M. J. 1986. Query
processing in main memory database systems.
In Proceedings of ACM SIGMOD Conference.
ACM, New York, 239.
LELEWER, D. A., AND HIRSCHBERG, D. S. 1987.
Data compression. ACM Comput. Suru. 19, 3
(Sept.), 261.
LEUNG, T. Y. C., AND MUNTZ, R. R. 1992. Tempo-
ral query processing and optimization in multi-
processor database machines. In Proceedings of
the International Conference on Very Large
Data Bases (Vancouver, BC, Canada). VLDB
Endowment, 383.
LEUNG, T. Y. C., AND MUNTZ, R. R. 1990. Query
processing in temporal databases. In Proceed-
ings of the IEEE Conference on Data Englneer-
mg. IEEE, New York, 200.
ACM Com~utina Survevs, Vol 25, No. 2, June 1993
166 “ Goetz Graefe
LI, K., AND NAUGHTON, J. 1988. Multiprocessor
main memory transaction processing. In Pro-
ceedings of the Intcrnatlonal Symposium on
Databases m Parallel and Dlstrlbuted Systems
(Austin, Tex., Dec.), 177.
LITWIN, W. 1980. Linear hashing: A new tool for
file and table addressing. In Proceedings of the
International Conference on Ver<y Large Data
Bases (Montreal, Canada, Oct.). VLDB Endow-
ment, 212. Reprinted in Readings in Database
Systems. Morgan-Kaufman, San
Mateo, Calif
LJTWIN, W., MARK L., AND ROUSSOPOULOS, N. 1990.
Interoperability of multiple autonomous
databases. ACM Comput. Suru. 22, 3 (Sept.).
267.
LOHMAN, G., MOHAN, C., HAAs, L., DANIELS, D.,
LINDSAY, B., SELINGER, P., AND WILMS, P. 1985.
Query processing m R’. In Query Processuzg m
Database Systems. Springer, Berlin, 31.
LOMET, D. 1992. A review of recent work
on multi-attribute access methods. ACM
SIGMOD Rec. 21, 3 (Sept.), 56.
LoM~T, D., AND SALZBER~, B 1990a The perfor-
mance of a multiversion access method. In Pro-
ceedings of ACM SIGMOD Conference ACM,
New York, 353
LOM~T, D. B , AND SALZEER~, B. 1990b. The hB-
tree A multlattrlbute mdexing method with
good guaranteed performance. ACM Trans.
Database Syst. 15, 4 (Dec.), 625.
LORIE, R. A., AND NILSSON, J, F, 1979. AU access
specification language for a relational database
management system IBM J. Res. Deuel. 23, 3
(May), 286
LDRIE, R, A,, AND YOUNG, H. C. 1989. A low com-
munication sort algorithm for a parallel
database machme. In Proceedings of the Inter-
national Conference on Very Large Data Bases
(AmAwdam. The Netherlands). VLDB Endow-
ment, 125.
LYNCH, C A., AND BROWNRIGG, E. B. 1981. Appli-
cation of data compression to a large biblio-
graphic data base In Proceedings of the Inter-
national Conference on Very Large Data Base,~
(Cannes, France, Sept.). VLDB Endowment,
435
L~~INEN, K. 1987. Different perspectives on in-
formation systems: Problems and solutions.
ACM Comput. Suru. 19, 1 (Mar.), 5.
MACVCEVZT,L. F., AND LOHMAN, G. M. 1989 Index
scans using a finite LRU buffer: A validated
1/0 model. ACM Trans Database S.vst. 14, 3
(Sept.), 401.
MAJER, D, 1983. The Theory of Relational
Databases. CS Press, Rockville, Md.
MAI~R, D., AND STEIN, J. 1986 Indexing m an
object-oriented database management. In Pro-
ceedings of the Intern at[on al Workshop on Ob
]ect-(hented Database Systems (Pacific Grove,
Calif, Sept ), 171
MAIER, D., GRAEFE, G., SHAFIRO, L., DANIELS, S.,
KELLER, T., AND VANCE, B. 1992 Issues in
distributed complex object assembly In Prc~-
ceedings of the Workshop on Distributed Object
Management (Edmonton, BC, Canada, Aug.).
MANNINO, M. V., CHU, P., ANrI SAGER, T. 1988.
Statistical profile estimation m database sys-
tems. ACM Comput. Suru. 20, 3 (Sept.).
MCKENZIE, L. E., AND SNODGRASS, R. T. 1991.
Evaluation of relational algebras incorporating
the time dimension m databases. ACM Co~n-
put. Suru. 23, 4 (Dec.).
MEDEIROS, C., AND TOMPA, F. 1985. Understand-
ing the implications of view update pohcles, In
Proceedings of the International Conference on
Very Large Data Bases (Stockholm, Sweden,
Aug.). VLDB Endowment, 316.
MENON, J. 1986. A study of sort algorithms for
multiprocessor database machines. In Proceed-
Ings of the International Conference on Very
Large Data bases (Kyoto, Japan, Aug ) VLDB
Endowment, 197
MISHRA, P., AND EICH, M. H. 1992. Join process-
ing in relational databases. ACM Comput.
Suru. 24, 1 (Mar.), 63
MITSCHANG, B. 1989. Extending the relational al-
gebra to capture complex objects. In Proceed-
ings of the International Conference on Very
Large Data Bases (Amsterdam, The Nether-
lands). VLDB Endowment, 297.
MOHAN, C., HADERLE, D., WANG, Y., AND CHENG, J.
1990. Single table access using multiple in-
dexes: Optimization, execution and concur-
rency control techniques. In Lecture Notes m
Computer Sc~ence, vol. 416. Springer-Verlag,
New York, 29.
MOTRO, A. 1989. An access authorization model
for relational databases based on algebraic ma-
mpulation of view definitions In Proceedl ngs
of the IEEE Conferen m on Data Engw-um-mg
IEEE, New York, 339
MULLIN, J K. 1990. Optimal semijoins for dis-
tributed database systems. IEEE Trans. Softw.
Eng. 16, 5 (May), 558.
NAKAYAMA, M.. KITSUREGAWA. M., AND TAKAGI, M.
1988. Hash-partitioned jom method using dy-
namic dcstaging strategy, In Proeeedmgs of the
Imternatronal Conference on Very Large Data
Bases (Los Angeles, Aug.). VLDB Endowment,
468.
NECHES, P M. 1988. The Ynet: An interconnect
structure for a highly concurrent data base
computer system. In Proceedings of the 2nd
Symposium on the Frontiers of Massiuel.v Par-
allel Computatl on (Fairfax, Virginia, Ott.).
NECHES, P M. 1984. Hardware support for ad-
vanced data management systems. IEEE Com
put. 17, 11 (Nov.), 29.
NEUGEBAUER, L. 1991 Optimization and evalua-
tion of database queries mcludmg embedded
interpolation procedures. In Proceedings of
ACM Computmg Surveys, Vol 25, No 2. June 1993
Query Evaluation Techniques ● 167
ACM SIGMOD Conference. ACM, New York,
118.
NG, R., FALOUTSOS,
C., ANDSELLIS,T. 1991. Flex-
ible buffer allocation based on marginal gains.
In Proceedings of ACM SIGMOD Conference.
ACM, New York, 387.
NIEVERGELT, J., HINTERBERGER, H., AND SEVCIK,
K. C. 1984. The grid file: An adaptable, sym-
metric multikey file structure. ACM Trans.
Database Syst. 9, 1 (Mar.), 38.
NYBERG,C., BERCLAY,
T., CVETANOVIC,
Z., GRAY.J.,
ANDLOMET,D. 1993. AlphaSort: A RISC ma-
chine sort. Tech. Rep. 93.2. DEC San Francisco
Systems Center. Digital Equipment Corp., San
Francisco.
OMIECINSK1,
E. 1991. Performance analysis of a
load balancing relational hash-join algorithm
for a shared-memory multiprocessor. In Pro-
ceedings of the International Conference on Very
Large Data Bases (Barcelona, Spain). VLDB
Endowment, 375.
OMIECINSKLE. 1985. Incremental file reorgani-
zation schemes. In Proceedings of the Interna-
tional Conference on Very Large Data Bases
(Stockholm, Sweden, Aug.). VLDB Endowment,
346.
OMIECINSKI, E., AND LIN, E. 1989. Hash-based
and index-based join algorithms for cube and
ring connected multicomputers. IEEE Trans.
Knowledge Data Eng. 1, 3 (Sept.), 329.
ONO, K., AND LOHMAN, G. M. 1990. Measuring
the complexity of join enumeration in query
optimization. In Proceedings of the Interna-
tional Conference on Very Large Data Bases
(Brisbane, Australia). VLDB Endowment, 314.
OUSTERHOUT,
J. 1990. Why aren’t operating sys-
tems getting faster as fast as hardware. In
USENIX Summer Conference (Anaheim, Calif.,
June). USENIX.
O~SOYOGLU, Z. M., AND WANG, J. 1992. A keying
method for a nested relational database man-
agement system. In Proceedings of the IEEE
Conference on Data Engineering. IEEE, New
York, 438.
OZSOYOGLU,
G., OZSOYOGLU,
Z. M., ANDMATOS,V.
1987. Extending relational algebra and rela-
tional calculus with set-valued attributes and
aggregate functions. ACM Trans. Database
Syst. 12, 4 (Dec.), 566.
Ozsu, M. T., AND VALDURIEZ, P. 1991a. Dis-
tributed database systems: Where are we now.
IEEE Comput. 24, 8 (Aug.), 68.
Ozsu, M. T., AND VALDURIEZ, P. 1991b. Principles
of Distributed Database Systems. Prentice-Hall,
Englewood Cliffs, N.J.
PALMER, M., AND ZDONIK, S. B. 1991. FIDO: A
cache that learns to fetch. In Proceedings of the
International Conference on Very Large Data
Bases (Barcelona, Spain). VLDB Endowment,
255.
PECKHAM, J., AND MARYANSKI, F. 1988. Semantic
data models. ACM Comput. Suru. 20, 3 (Sept.),
153.
PmAHESH, H., MOHAN, C., CHENG, J., LIU, T. S., AND
SELINGER, P. 1990. Parallelism in relational
data base systems: Architectural issues and
design approaches. In Proceedings of the Inter-
national Symposwm on Databases m Parallel
and Distributed Systems (Dublin, Ireland,
July).
QADAH, G. Z. 1988. Filter-based join algorithms
on uniprocessor and dmtributed-memory multi-
processor database machines. In Lecture Notes
m Computer Science, vol. 303. Springer-Verlag,
New York, 388.
REW, R. K., AND DAVIS, G. P. 1990. The Unidata
NetCDF: Software for scientific data access. In
the 6th International Conference on Interactwe
Information and Processing Swstems for Me-
teorology, Ocean ography,- aid Hydrology
(Anaheim, Calif.).
RICHARDSON, J. E., AND CAREY, M. J. 1987. Pro-
gramming constructs for database system im-
plementation m EXODUS. In Proceedings of
ACM SIGMOD Conference. ACM, New York,
208.
RICHAR~SON,
J. P., Lu, H., AND MIKKILINENI, K.
1987. Design and evaluation of parallel
pipelined join algorithms. In Proceedings of
ACM SIGMOD Conference. ACM, New York,
399.
ROBINSON,
J. T. 1981. The K-D-B-Tree: A search
structure for large multidimensional dynamic
indices. ln proceedings of ACM SIGMOD Con-
ference. ACM, New York, 10.
ROSENTHAL,
A., ANDREINER,D. S. 1985. Query-
ing relational views of networks. In Query Pro-
cessing in Database Systems. Springer, Berlin,
109.
ROSENTHAL, A., RICH, C., AND SCHOLL, M. 1991.
Reducing duplicate work in relational join(s): A
modular approach using nested relations. ETH
Tech. Rep., Zurich, Switzerland.
ROTEM, D., AND SEGEV, A. 1987. Physical organi-
zation of temporal data. In Proceedings of the
IEEE Conference on Data Engineering. IEEE,
New York, 547.
ROTH, M. A., KORTH, H. F., AND SILBERSCHATZ, A.
1988. Extended algebra and calculus for
nested relational databases. ACM Trans.
Database Syst. 13, 4 (Dec.), 389.
ROTHNIE, J. B., BERNSTEIN, P. A., Fox, S., GOODMAN,
N., HAMMER, M., LANDERS, T. A., REEVE, C.,
SHIPMAN, D. W., AND WONG, E. 1980. Intro-
duction to a system for distributed databases
(SDD-1). ACM Trans. Database Syst. 5, 1
(Mar.), 1.
ROUSSOPOULOS, N. 1991. An incremental access
method for ViewCache: Concept, algorithms,
and cost analysis. ACM Trans. Database Syst.
16, 3 (Sept.), 535.
ROUSSOPOULOS, N., AND KANG, H. 1991. A
pipeline N-way join algorithm based on the
ACM Computing Surveys, Vol. 25. No. 2, June 1993
168 “ Goetz Graefe
2-way semijoin program. IEEE Trans Knoul-
edge Data Eng. 3, 4 (Dec.), 486.
RUTH, S. S , AND KEUTZER, P J 1972. Data com-
pression for business files. Datamatlon 18
(Sept.), 62.
SAAKD, G., LINNEMANN, V., PISTOR, P , AND WEGNER,
L. 1989. Sorting, grouping and duplicate
elimination in the advanced information man-
agement prototype. In Proceedl ngs of th e Inter-
national Conference on Very Large Data Bases
VLDB Endowment, 307 Extended version in
IBM Sci. Ctr. Heidelberg Tech. Rep 8903.008,
March 1989.
S.4CX:0, G 1987 Index access with a finite buffer.
In Procecdmgs of the International Conference
on Very Large Data Bases (Brighton, England,
Aug.) VLDB Endowment. 301.
SACCO, G. M., .4ND SCHKOLNIK, M. 1986, Buffer
management m relational database systems.
ACM Trans. Database Syst. 11, 4 (Dec.), 473.
SACCO, G M , AND SCHKOLNI~, M 1982. A mech-
anism for managing the buffer pool m a rela-
tional database system using the hot set model.
In Proceedings of the International Conference
on Very Large Data Bases (Mexico City, Mex-
ico, Sept.). VLDB Endowment, 257.
SACXS-DAVIS,
R., AND RAMMIOHANARAO,
K. 1983.
A two-level superimposed coding scheme for
partial match retrieval. Inf. Syst. 8, 4, 273.
S.ACXS-DAVIS, R., KENT, A., ANn RAMAMOHANAR~O, K
1987. Multikey access methods based on su-
perimposed coding techniques. ACM Trans
Database Syst. 12, 4 (Dec.), 655
SALZBERG, B. 1990 Mergmg sorted runs using
large main memory Acts Informatica 27, 195
SALZBERG, B. 1988, Fde Structures: An Analytlc
Approach. Prentice-Hall, Englewood Cliffs, NJ.
SALZBER~,B , TSU~~RMAN,
A., GRAY, J., STEWART,
M., UREN, S., ANrJ VAUGHAN, B. 1990 Fast-
Sort: A distributed single-input single-output
external sort In Proceeduzgs of ACM SIGMOD
Conference ACM, New York, 94.
SAMET, H. 1984. The quadtree and related hier-
archical data structures. ACM Comput. Saru.
16, 2 (June), 187,
SCHEK, H. J., ANII SCHOLL, M. H. 1986. The rela-
tional model with relation-valued attributes.
Inf. Syst. 11, 2, 137.
SCHNEIDER, D. A. 1991. Blt filtering and multi-
way join query processing. Hewlett-Packard
Labs, Palo Alto, Cahf. Unpublished Ms
SCHNEIDER, D. A. 1990. Complex query process-
ing in multiprocessor database machines. Ph.D.
thesis, Univ. of Wmconsin-Madison
SCHNF,mER. D. A., AND DEWITT, D. J. 1990.
Tradeoffs in processing complex join queries
via hashing in multiprocessor database
machines. In Proceedings of the Interna-
tional Conference on Very Large Data Bases
(Brisbane, Austraha). VLDB Endowment, 469
SCHNEIDER, D., AND DEWITT, D. 1989. A perfor-
mance evaluation of four parallel join algo-
rithms in a shared-nothing multiprocessor
environment. In Proceedings of ACM SIGMOD
Conference ACM, New York, 110
SCHOLL, M. H, 1988. The nested relational model
—Efficient support for a relational database
interface. Ph.D. thesis, Technical Umv. Darm-
stadt. In German.
SCHOLL, M., PAUL, H. B., AND SCHEK, H. J 1987
Supporting flat relations by a nested relational
kernel. In Proceedings of the International
Conference on Very Large Data Bases (Brigh-
ton, England, Aug ) VLDB Endowment. 137.
SEEGER, B., ANU LARSON, P A 1991. Multl-disk
B-trees. In Proceedings of ACM SIGMOD Con-
ference ACM, New York. 436.
,SEG’N, A . AND GtTN.ADHI. H. 1989. Event-Join op-
timization in temporal relational databases. In
Proceedings of the IrLternatlOnctl Conference on
Very Large Data Bases (Amsterdam, The
Netherlands), VLDB Endowment, 205.
SELINGiZR, P. G., ASTRAHAN, M. M , CHAMBERLAIN.
D. D., LORIE, R. A., AND PRWE, T. G. 1979
Access path selectlon m a relational database
management system In Proceedings of AdM
SIGMOD Conference. ACM, New York, 23.
Reprinted in Readings m Database Sy~tems
Morgan-Kaufman, San Mateo, Calif., 1988.
SELLIS. T. K. 1987 Efficiently supporting proce-
dures in relational database systems In Pro-
ceedl ngs of ACM SIGMOD Conference. ACM.
New York, 278.
SEPPI, K., BARNES, J,, AND MORRIS, C. 1989, A
Bayesian approach to query optimization m
large scale data bases The Univ. of Texas at
Austin ORP 89-19, Austin.
SERLIN, 0. 1991. The TPC benchmarks. In
Database and Transaction Processing System
Performance Handbook. Morgan-Kaufman, San
Mateo, Cahf
SESHAnRI, S , AND NAUGHTON, J F. 1992 Sam-
pling issues in parallel database systems In
Proceedings of the International Conference
on Extending Database Technology (Vienna,
Austria, Mar.).
SEVERANCE. D. G. 1983. A practitioner’s guide to
data base compression. Inf. S.yst. 8, 1, 51.
SErERANCE, D., AND LOHMAN, G 1976. Differen-
tial files: Their application to the maintenance
of large databases ACM Trans. Database Syst.
1.3 (Sept.).
SEWERANCE, C., PRAMANK S., AND WOLBERG, P.
1990. Distributed linear hashing and parallel
projection in mam memory databases. In Pro-
ceedings of the Internatl onal Conference on Very
Large Data Bases (Brmbane, Australia) VLDB
Endowment, 674.
SHAPIRO, L. D. 1986. Join processing in database
systems with large main memories. ACM
Trans. Database Syst. 11, 3 (Sept.), 239.
SHAW, G. M., AND ZDONIIL S. B. 1990. A query
ACM Camputmg Surveys, Vol. 25, No. 2, June 1993
Query Evaluation Techniques ● 169
algebra for object-oriented databases. In Pro.
ceedings of the IEEE Conference on Data Engl.
neering. IEEE, New York, 154.
SHAW, G., AND Z~ONIK, S. 1989a. An object-
oriented query algebra. IEEE Database Eng.
12, 3 (Sept.), 29.
SIIAW, G. M., AND ZDUNIK, S. B. 1989b. AU
object-oriented query algebra. In Proceedings
of the 2nd International Workshop on Database
Programming Languages. Morgan-Kaufmann,
San Mateo, Calif., 103.
SHEKITA, E. J., AND CAREY, M. J. 1990. A perfor-
mance evaluation of pointer-based joins. In
Proceedings of ACM SIGMOD (70nferen ce.
ACM, New York, 300.
SHERMAN, S. W., AND BRICE, R. S. 1976. Perfor-
mance of a database manager in a virtual
memory system. ACM Trans. Data base Syst.
1, 4 (Dec.), 317.
SHETH, A. P., AND LARSON, J. A. 1990, Federated
database systems for managing distributed,
heterogeneous, and autonomous databases.
ACM Comput. Surv. 22, 3 (Sept.), 183.
SHIPMAN, D. W. 1981. The functional data model
and the data 1anguage DAPLEX. ACM Trans.
Database Syst. 6, 1 (Mar.), 140.
SIKELER, A. 1988. VAR-PAGE-LRU: A buffer re-
placement algorithm supporting different page
sizes. In Lecture Notes in Computer Science,
vol. 303. Springer-Verlag, New York, 336.
SILBERSCHATZ, A., STONEBRAKER, M., AND ULLMAN, J.
1991. Database systems: Achievements and
opportunities. Commun. ACM 34, 10 (Oct.),
110,
SIX, H. W., AND WIDMAYER, P. 1988. Spatial
searching in geometric databases. In Proceed-
ings of the IEEE Conference on Data Enginee-
ring. IEEE, New York, 496.
SMITH,J. M., ANDCHANG,
P. Y. T. 1975. Optimiz-
ing the performance of a relational algebra
database interface. Commun. ACM 18, 10
(Oct.), 568.
SNODGRASS. R, 1990. Temporal databases: Status
and research directions. ACM SIGMOD Rec.
19, 4 (Dec.), 83.
SOCKUT, G. H., AND GOLDBERG, R. P. 1979.
Database reorganization—Principles and prac-
tice. ACM Comput. Suru. 11, 4 (Dec.), 371.
SRINIVASAN, V., AND CAREY, M. J. 1992. Perfor-
mance of on-line index construction algorithms.
In Proceedings of the International Conference
on Extending Database Technology (Vienna,
Austria, Mar.).
SRINWASAN,
V., AND CAREY, M. J. 1991. Perfor-
mance of B-tree concurrency control algo-
rithms. In Proceedings of ACM SIGMOD
Conference. ACM, New York, 416.
STAMOS,
J. W., ANDYOUNG,H. C. 1989. A sym-
metric fragment and replicate algorithm for
distributed joins. Tech. Rep. RJ7 188, IBM Re-
search Labs, San Jose, Calif.
STONEBRAKER, M. 1991. Managing persistent ob-
jects in a multi-level store. In Proceedings of
ACM SIGMOD Conference. ACM, New York, 2.
STONEBRAKER,
M. 1987. The design of the POST-
GRES storage system, In Proceedings of the
International Conference on Very Large Data
Bases (Brighton, England, Aug.). VLDB En-
dowment, 289. Reprinted in Readings in
Database Systems. Morgan-Kaufman, San Ma-
teo, Calif,, 1988.
STONEBRAKER, M. 1986a. The case for shared-
nothing. IEEE Database Eng. 9, 1 (Mar.),
STONEBRAKER, M, 1986b. The design and imple-
mentation of distributed INGRES. In The
INGRES Papers. Addison-Wesley, Reading,
Mass., 187.
STONEBRAKER, M. 1981. Operating system sup-
port for database management. Comrnun. ACM
24, 7 (July), 412.
STONEBRAKER, M. 1975. Implementation of in-
tegrity constraints and views by query modifi-
cation. In Proceedings of ACM SIGMOD Con-
ference ACM, New York.
STONEBRAKER, M., AOKI, P., AND SELTZER, M.
1988a. Parallelism in XPRS. UCB/ERL
Memorandum M89 16, Univ. of California,
Berkeley.
STONEBRAWCR, M., JHINGRAN, A., GOH, J., AND
POTAMIANOS, S. 1990a. On rules, procedures,
caching and views in data base systems In
Proceedings of ACM SIGMOD Conference.
ACM, New York, 281
STONEBRAKER,
M., KATZ, R., PATTERSON,
D., AND
OUSTERHOUT,
J. 1988b. The design of XPRS,
In Proceedings of the International Conference
on Very Large Data Bases (Los Angeles, Aug.).
VLDB Endowment, 318.
STONEBBAKER, M., ROWE, L. A., AND HIROHAMA, M.
1990b. The implementation of Postgres. IEEE
Trans. Knowledge Data Eng. 2, 1 (Mar.), 125.
STRAUBE, D. D., AND OLSU, M. T. 1989. Query
transformation rules for an object algebra,
Dept. of Computing Sciences Tech, Rep. 89-23,
Univ. of Alberta, Alberta, Canada.
Su, S. Y. W. 1988. Database Computers: Princ-
iples, Archltectur-es and Techniques. McGraw-
Hill, New York.
TANSEL, A. U., AND GARNETT, L. 1992. On Roth,
Korth, and Silberschat,z’s extended algebra and
calculus for nested relational databases. ACM
Trans. Database Syst. 17, 2 (June), 374.
TEOROy, T. J., YANG, D., AND FRY, J. P. 1986. A
logical design metb odology for relational
databases using the extended entity-relation-
ship model. ACM Cornput. Suru. 18, 2 (June).
197.
TERADATA. 1983. DBC/1012 Data Base Com-
puter, Concepts and Facilities. Teradata Corpo-
ration, Los Angeles.
THOMAS, G., THOMPSON, G. R., CHUNG, C. W.,
BARKMEYER, E., CARTER, F., TEMPLETON, M.,
ACM Computing Surveys, Vol. 25, No. 2, June 1993
170 “ Goetz Graefe
Fox, S., AND HARTMAN, B. 1990. Heteroge-
neous distributed database systems for produc-
tion use, ACM Comput. Surv. 22. 3 (Sept.),
237.
TOMPA, F. W., AND BLAKELEY, J A. 1988. Main-
taining materialized views without accessing
base data. Inf Swt. 13, 4, 393.
TRAI~ER, 1. L. 1982. Virtual memory manage-
ment for data base systems ACM Oper. S.vst.
Reu. 16, 4 (Oct.), 26.
TRIAGER, 1. L., GRAY, J., GALTIERI, C A., AND
LINDSAY, B, G. 1982. Transactions and con-
sistency in distributed database systems. ACM
Trans. Database Syst. 7, 3 (Sept.), 323.
TSUR, S., AND ZANIOLO, C. 1984 An implementa-
tion of GEM—Supporting a semantic data
model on relational back-end. In Proceedings of
ACM SIGMOD Conference. ACM, New York,
286.
TUKEY, J, W, 1977. Exploratory Data Analysls.
Addison-Wesley, Reading, Mass,
UNTDATA 1991. NetCDF User’s Guide, An Inter-
face for Data Access, Verszon III. NCAR Tech
Note TS-334 + 1A, Boulder, Colo
VALDURIIIZ, P. 1987. Join indices. ACM Trans.
Database S.yst. 12, 2 (June). 218.
VANDEN~ERC, S. L., AND DEWITT, D. J. 1991. Al-
gebraic support for complex objects with ar-
rays, identity, and inhentance. In Proceedmg.s
of ACM SIGMOD Conference. ACM, New York,
158,
WALTON, C B. 1989 Investigating skew and
scalabdlty in parallel joins. Computer Science
Tech. Rep. 89-39, Umv. of Texas, Austin.
WALTON, C, B., DALE, A. G., AND JENEVEIN, R. M.
1991. A taxonomy and performance model of
data skew effects in parallel joins. In Proceed-
ings of the Interns tlona 1 Conference on Very
Large Data Bases (Barcelona, Spain). VLDB
Endowment, 537.
WHANG. K. Y.. AND KRISHNAMURTHY, R. 1990.
Query optimization m a memory-resident do-
mam relational calculus database system. ACM
Trans. Database Syst. 15, 1(Mar.), 67.
WHANG, K. Y., WIEDERHOLD G., AND SAGALOWICZ, D.
1985 The property of separabdity and Its ap-
plication to physical database design. In Query
Processing m Database Systems. Springer,
Berlin, 297.
WHANG, K. Y., WIEDERHOLD, G., AND SAGLOWICZ, D.
1984. Separability—An approach to physical
database design. IEEE Trans. Comput. 33, 3
(Mar.), 209.
WILLIANLS,P., DANIELS, D., HAAs, L., LAPIS, G.,
LINDSAY,B., NG, P., OBERMARC~,
R., SELINGER,
P., WALKER, A., WILMS, P., AND YOST, R. 1982
R’: An overview of the architecture. In Im-
provmg Database Usabdlty and Responslue -
ness. Academic Press, New York. Reprinted in
Readings m Database Systems. Morgan-Kauf-
man, San Mateo, Calif., 1988
WILSCHUT, A. N. 1993. Parallel query execution
in a mam memory database system. Ph.D. the-
sis, Univ. of Tweuk, The Netherlands.
WiLSCHUT, A. N., AND AP~RS, P. M. G. 1993.
Dataflow query execution m a parallel main-
memory environment. Distrlb. Parall.
Databases 1, 1 (Jan.), 103.
WOLF, J. L , DIAS, D. M , AND Yu, P. S. 1990. An
effective algorithm for parallelizing sort merge
in the presence of data skew In Proceedl ngs of
the International Syrnpo.?lum on Data base~ 1n
Parallel and DLstrlbuted Systems (Dubhn,
Ireland, July)
WOLF, J. L., DIAS, D M., Yu, P. S . AND TUREK, ,J.
1991. An effective algorithm for parallelizmg
hash Joins in the presence of data skew. In
Proceedings of the IEEE Conference on Data
Engtneermg. IEEE, New York. 200
WOLNIEWWZ, R. H., AND GRAEFE, G. 1993 Al-
gebralc optlmlzatlon of computations over
scientific databases. In Proceedings of the
International Conference on Very Large
Data Bases. VLDB Endowment.
WONG, E., ANO KATZ. R. H. 1983. Dlstributmg a
database for parallelism. In Proceedings of
ACM SIGMOD Conference. ACM, New York,
23,
WONG, E., AND Youssmv, K. 1976 Decomposition
—A strategy for query processing. ACM Trans
Database Syst 1, 3 (Sept.), 223.
YANG, H., AND LARSON, P. A. 1987. Query trans-
formation for PSJ-queries. In proceedings of
the International Conference on Very Large
Data Bases (Brighton, England, Aug. ) VLDB
Endowment, 245.
YOUSS~FI. K, AND WONG, E 1979. Query pro-
cessing in a relational database management
system. In Proceedings of the Internatmnal
Conference on Very Large Data Bases (Rio de
Janeiro, Ott ). VLDB Endowment, 409.
Yu, C, T., AND CHANG) C. C. 1984. Distributed
query processing. ACM Comput. Surt, 16, 4
(Dec.), 399.
YLT, L., AND OSBORN, S, L, 1991. An evaluation
framework for algebralc object-oriented query
models. In Proceedings of the IEEE Conference
on Data Englneermg. VLDB Endowment, 670.
ZANIOIJ I, C. 1983 The database language Gem.
In Proceedings of ACM SIGMOD Conference.
ACM, New York, 207. Rcprmted m Readcngb
zn Database Systems. Morgan-Kaufman, San
Mateo, Calif., 1988.
ZANmLo, C. 1979 Design of relational views over
network schemas. In Proceedings of ACM
SIGMOD Conference ACM, New York, 179
ZELLER, H. 1990. Parallel query execution m
NonStop SQL. In Dtgest of Papers, 35th C’omp-
Con Conference. San Francisco.
ZELLER H,, AND GRAY, J. 1990 An adaptive hash
join algorithm for multiuser environments. In
Proceedings of the International Conference on
Very Large Data Bases (Brisbane, Australia).
VLDB Endowment, 186.
Recewed January 1992, final revk+on accepted February 1993
ACM Computing Surveys, Vol. 25, No 2, June 1993
Ad

More Related Content

What's hot (20)

Analyzing awr report
Analyzing awr reportAnalyzing awr report
Analyzing awr report
satish Gaddipati
 
Oracle database performance tuning
Oracle database performance tuningOracle database performance tuning
Oracle database performance tuning
Yogiji Creations
 
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Aaron Shilo
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
Biju Nair
 
Database Performance Tuning Introduction
Database  Performance Tuning IntroductionDatabase  Performance Tuning Introduction
Database Performance Tuning Introduction
MyOnlineITCourses
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
Biju Nair
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuning
Simon Huang
 
Oracle Oracle Performance Tuning
Oracle Oracle Performance Tuning Oracle Oracle Performance Tuning
Oracle Oracle Performance Tuning
Kernel Training
 
Sql server performance tuning
Sql server performance tuningSql server performance tuning
Sql server performance tuning
ngupt28
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL Database
Tung Nguyen Thanh
 
Performance Tuning With Oracle ASH and AWR. Part 1 How And What
Performance Tuning With Oracle ASH and AWR. Part 1 How And WhatPerformance Tuning With Oracle ASH and AWR. Part 1 How And What
Performance Tuning With Oracle ASH and AWR. Part 1 How And What
udaymoogala
 
Ash and awr deep dive hotsos
Ash and awr deep dive hotsosAsh and awr deep dive hotsos
Ash and awr deep dive hotsos
Kellyn Pot'Vin-Gorman
 
Earl Shaffer Oracle Performance Tuning pre12c 11g AWR uses
Earl Shaffer Oracle Performance Tuning pre12c 11g AWR usesEarl Shaffer Oracle Performance Tuning pre12c 11g AWR uses
Earl Shaffer Oracle Performance Tuning pre12c 11g AWR uses
oramanc
 
SQL Server Query Tuning Tips - Get it Right the First Time
SQL Server Query Tuning Tips - Get it Right the First TimeSQL Server Query Tuning Tips - Get it Right the First Time
SQL Server Query Tuning Tips - Get it Right the First Time
Dean Richards
 
Performance tuning in sql server
Performance tuning in sql serverPerformance tuning in sql server
Performance tuning in sql server
Antonios Chatzipavlis
 
SQL Server Tuning to Improve Database Performance
SQL Server Tuning to Improve Database PerformanceSQL Server Tuning to Improve Database Performance
SQL Server Tuning to Improve Database Performance
Mark Ginnebaugh
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
Rush Shah
 
Final report group2
Final report group2Final report group2
Final report group2
George Sam
 
SQL Server Performance Tuning Baseline
SQL Server Performance Tuning BaselineSQL Server Performance Tuning Baseline
SQL Server Performance Tuning Baseline
► Supreme Mandal ◄
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d method
Ajith Narayanan
 
Oracle database performance tuning
Oracle database performance tuningOracle database performance tuning
Oracle database performance tuning
Yogiji Creations
 
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Aaron Shilo
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
Biju Nair
 
Database Performance Tuning Introduction
Database  Performance Tuning IntroductionDatabase  Performance Tuning Introduction
Database Performance Tuning Introduction
MyOnlineITCourses
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
Biju Nair
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuning
Simon Huang
 
Oracle Oracle Performance Tuning
Oracle Oracle Performance Tuning Oracle Oracle Performance Tuning
Oracle Oracle Performance Tuning
Kernel Training
 
Sql server performance tuning
Sql server performance tuningSql server performance tuning
Sql server performance tuning
ngupt28
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL Database
Tung Nguyen Thanh
 
Performance Tuning With Oracle ASH and AWR. Part 1 How And What
Performance Tuning With Oracle ASH and AWR. Part 1 How And WhatPerformance Tuning With Oracle ASH and AWR. Part 1 How And What
Performance Tuning With Oracle ASH and AWR. Part 1 How And What
udaymoogala
 
Earl Shaffer Oracle Performance Tuning pre12c 11g AWR uses
Earl Shaffer Oracle Performance Tuning pre12c 11g AWR usesEarl Shaffer Oracle Performance Tuning pre12c 11g AWR uses
Earl Shaffer Oracle Performance Tuning pre12c 11g AWR uses
oramanc
 
SQL Server Query Tuning Tips - Get it Right the First Time
SQL Server Query Tuning Tips - Get it Right the First TimeSQL Server Query Tuning Tips - Get it Right the First Time
SQL Server Query Tuning Tips - Get it Right the First Time
Dean Richards
 
SQL Server Tuning to Improve Database Performance
SQL Server Tuning to Improve Database PerformanceSQL Server Tuning to Improve Database Performance
SQL Server Tuning to Improve Database Performance
Mark Ginnebaugh
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
Rush Shah
 
Final report group2
Final report group2Final report group2
Final report group2
George Sam
 
SQL Server Performance Tuning Baseline
SQL Server Performance Tuning BaselineSQL Server Performance Tuning Baseline
SQL Server Performance Tuning Baseline
► Supreme Mandal ◄
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d method
Ajith Narayanan
 

Similar to Query Evaluation Techniques for Large Databases.pdf (20)

Implementing sorting in database systems
Implementing sorting in database systemsImplementing sorting in database systems
Implementing sorting in database systems
unyil96
 
A relational model of data for large shared data banks
A relational model of data for large shared data banksA relational model of data for large shared data banks
A relational model of data for large shared data banks
Sammy Alvarez
 
Project Demo PPT for temp give reot.pptx
Project Demo PPT for temp give reot.pptxProject Demo PPT for temp give reot.pptx
Project Demo PPT for temp give reot.pptx
practicalfermi7
 
Re-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming TechnologyRe-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming Technology
Gihan Wikramanayake
 
Ems
EmsEms
Ems
Siva Ram
 
Enterprise performance engineering solutions
Enterprise performance engineering solutionsEnterprise performance engineering solutions
Enterprise performance engineering solutions
Infosys
 
Process management seminar
Process management seminarProcess management seminar
Process management seminar
apurva_naik
 
Enterprise resource planning_system
Enterprise resource planning_systemEnterprise resource planning_system
Enterprise resource planning_system
Jithin Zcs
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big Data
EMC
 
Dbms_class _14
Dbms_class _14Dbms_class _14
Dbms_class _14
sushantbit04
 
Solving big data challenges for enterprise application
Solving big data challenges for enterprise applicationSolving big data challenges for enterprise application
Solving big data challenges for enterprise application
Trieu Dao Minh
 
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy DatabasesAssisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Gihan Wikramanayake
 
Ans mi0034-database management system-sda-2012-ii
Ans mi0034-database management system-sda-2012-iiAns mi0034-database management system-sda-2012-ii
Ans mi0034-database management system-sda-2012-ii
zafarishtiaq
 
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
Workload Characterization for Resource Optimization of Big Data Analytics: Be...Workload Characterization for Resource Optimization of Big Data Analytics: Be...
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
IJCI JOURNAL
 
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENTQUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
csandit
 
Fntdb07 architecture
Fntdb07 architectureFntdb07 architecture
Fntdb07 architecture
lasyjack
 
Database performance management
Database performance managementDatabase performance management
Database performance management
scottaver
 
Fulltext01
Fulltext01Fulltext01
Fulltext01
navjeet11
 
Query optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementQuery optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query management
IJDMS
 
AtomicDBCoreTech_White Papaer
AtomicDBCoreTech_White PapaerAtomicDBCoreTech_White Papaer
AtomicDBCoreTech_White Papaer
JEAN-MICHEL LETENNIER
 
Implementing sorting in database systems
Implementing sorting in database systemsImplementing sorting in database systems
Implementing sorting in database systems
unyil96
 
A relational model of data for large shared data banks
A relational model of data for large shared data banksA relational model of data for large shared data banks
A relational model of data for large shared data banks
Sammy Alvarez
 
Project Demo PPT for temp give reot.pptx
Project Demo PPT for temp give reot.pptxProject Demo PPT for temp give reot.pptx
Project Demo PPT for temp give reot.pptx
practicalfermi7
 
Re-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming TechnologyRe-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming Technology
Gihan Wikramanayake
 
Enterprise performance engineering solutions
Enterprise performance engineering solutionsEnterprise performance engineering solutions
Enterprise performance engineering solutions
Infosys
 
Process management seminar
Process management seminarProcess management seminar
Process management seminar
apurva_naik
 
Enterprise resource planning_system
Enterprise resource planning_systemEnterprise resource planning_system
Enterprise resource planning_system
Jithin Zcs
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big Data
EMC
 
Solving big data challenges for enterprise application
Solving big data challenges for enterprise applicationSolving big data challenges for enterprise application
Solving big data challenges for enterprise application
Trieu Dao Minh
 
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy DatabasesAssisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Gihan Wikramanayake
 
Ans mi0034-database management system-sda-2012-ii
Ans mi0034-database management system-sda-2012-iiAns mi0034-database management system-sda-2012-ii
Ans mi0034-database management system-sda-2012-ii
zafarishtiaq
 
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
Workload Characterization for Resource Optimization of Big Data Analytics: Be...Workload Characterization for Resource Optimization of Big Data Analytics: Be...
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
IJCI JOURNAL
 
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENTQUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
csandit
 
Fntdb07 architecture
Fntdb07 architectureFntdb07 architecture
Fntdb07 architecture
lasyjack
 
Database performance management
Database performance managementDatabase performance management
Database performance management
scottaver
 
Query optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementQuery optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query management
IJDMS
 
Ad

Recently uploaded (20)

Modeling the Influence of Environmental Factors on Concrete Evaporation Rate
Modeling the Influence of Environmental Factors on Concrete Evaporation RateModeling the Influence of Environmental Factors on Concrete Evaporation Rate
Modeling the Influence of Environmental Factors on Concrete Evaporation Rate
Journal of Soft Computing in Civil Engineering
 
🚀 TDX Bengaluru 2025 Unwrapped: Key Highlights, Innovations & Trailblazer Tak...
🚀 TDX Bengaluru 2025 Unwrapped: Key Highlights, Innovations & Trailblazer Tak...🚀 TDX Bengaluru 2025 Unwrapped: Key Highlights, Innovations & Trailblazer Tak...
🚀 TDX Bengaluru 2025 Unwrapped: Key Highlights, Innovations & Trailblazer Tak...
SanjeetMishra29
 
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
PawachMetharattanara
 
Control Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptxControl Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptx
vvsasane
 
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdfATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ssuserda39791
 
Artificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptxArtificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptx
rakshanatarajan005
 
Environment .................................
Environment .................................Environment .................................
Environment .................................
shadyozq9
 
Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control Monthly May 2025Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control
 
IBAAS 2023 Series_Lecture 8- Dr. Nandi.pdf
IBAAS 2023 Series_Lecture 8- Dr. Nandi.pdfIBAAS 2023 Series_Lecture 8- Dr. Nandi.pdf
IBAAS 2023 Series_Lecture 8- Dr. Nandi.pdf
VigneshPalaniappanM
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025
Antonin Danalet
 
Physical and Physic-Chemical Based Optimization Methods: A Review
Physical and Physic-Chemical Based Optimization Methods: A ReviewPhysical and Physic-Chemical Based Optimization Methods: A Review
Physical and Physic-Chemical Based Optimization Methods: A Review
Journal of Soft Computing in Civil Engineering
 
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Optimizing Reinforced Concrete Cantilever Retaining Walls Using Gases Brownia...
Journal of Soft Computing in Civil Engineering
 
2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt
rakshaiya16
 
22PCOAM16_MACHINE_LEARNING_UNIT_IV_NOTES_with_QB
22PCOAM16_MACHINE_LEARNING_UNIT_IV_NOTES_with_QB22PCOAM16_MACHINE_LEARNING_UNIT_IV_NOTES_with_QB
22PCOAM16_MACHINE_LEARNING_UNIT_IV_NOTES_with_QB
Guru Nanak Technical Institutions
 
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdfML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
rameshwarchintamani
 
Mode-Wise Corridor Level Travel-Time Estimation Using Machine Learning Models
Mode-Wise Corridor Level Travel-Time Estimation Using Machine Learning ModelsMode-Wise Corridor Level Travel-Time Estimation Using Machine Learning Models
Mode-Wise Corridor Level Travel-Time Estimation Using Machine Learning Models
Journal of Soft Computing in Civil Engineering
 
Jacob Murphy Australia - Excels In Optimizing Software Applications
Jacob Murphy Australia - Excels In Optimizing Software ApplicationsJacob Murphy Australia - Excels In Optimizing Software Applications
Jacob Murphy Australia - Excels In Optimizing Software Applications
Jacob Murphy Australia
 
Personal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.pptPersonal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.ppt
ganjangbegu579
 
22PCOAM16 ML Unit 3 Full notes PDF & QB.pdf
22PCOAM16 ML Unit 3 Full notes PDF & QB.pdf22PCOAM16 ML Unit 3 Full notes PDF & QB.pdf
22PCOAM16 ML Unit 3 Full notes PDF & QB.pdf
Guru Nanak Technical Institutions
 
🚀 TDX Bengaluru 2025 Unwrapped: Key Highlights, Innovations & Trailblazer Tak...
🚀 TDX Bengaluru 2025 Unwrapped: Key Highlights, Innovations & Trailblazer Tak...🚀 TDX Bengaluru 2025 Unwrapped: Key Highlights, Innovations & Trailblazer Tak...
🚀 TDX Bengaluru 2025 Unwrapped: Key Highlights, Innovations & Trailblazer Tak...
SanjeetMishra29
 
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
PawachMetharattanara
 
Control Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptxControl Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptx
vvsasane
 
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdfATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ssuserda39791
 
Artificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptxArtificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptx
rakshanatarajan005
 
Environment .................................
Environment .................................Environment .................................
Environment .................................
shadyozq9
 
IBAAS 2023 Series_Lecture 8- Dr. Nandi.pdf
IBAAS 2023 Series_Lecture 8- Dr. Nandi.pdfIBAAS 2023 Series_Lecture 8- Dr. Nandi.pdf
IBAAS 2023 Series_Lecture 8- Dr. Nandi.pdf
VigneshPalaniappanM
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025
Antonin Danalet
 
2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt
rakshaiya16
 
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdfML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
rameshwarchintamani
 
Jacob Murphy Australia - Excels In Optimizing Software Applications
Jacob Murphy Australia - Excels In Optimizing Software ApplicationsJacob Murphy Australia - Excels In Optimizing Software Applications
Jacob Murphy Australia - Excels In Optimizing Software Applications
Jacob Murphy Australia
 
Personal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.pptPersonal Protective Efsgfgsffquipment.ppt
Personal Protective Efsgfgsffquipment.ppt
ganjangbegu579
 
Ad

Query Evaluation Techniques for Large Databases.pdf

  • 1. Query Evaluation Techniques for Large Databases GOETZ GRAEFE Portland State University, Computer Science Department, P. O. Box751, Portland, Oregon 97207-0751 Database management systems will continue to manage large data volumes. Thus, efficient algorithms for accessing and manipulating large sets and sequences will be required to provide acceptable performance. The advent of object-oriented and extensible database systems will not solve this problem. On the contrary, modern data models exacerbate the problem: In order to manipulate large sets of complex objects as efficiently as today’s database systems manipulate simple records, query processing algorithms and software will become more complex, and a solid understanding of algorithm and architectural issues is essential for the designer of database management software. This survey provides a foundation for the design and implementation of query execution facilities innew database management systems. It describes awide array of practical query evaluation techniques for both relational and postrelational database systems, including iterative execution of complex query evaluation plans, the duality of sort- and hash-based set-matching algorithms, types of parallel query execution and their implementation, and special operators for emerging database application domains. Categories and Subject Descriptors: E.5 [Data]: Files; H.2.4 [Database Management]: Systems—query processing General Terms: Algorithms, Performance Additional Key Words and Phrases: Complex query evaluation plans, dynamic query evaluation plans; extensible database systems, iterators, object-oriented database systems, operator model of parallelization, parallel algorithms, relational database systems, set-matching algorithms, sort-hash duality INTRODUCTION Effective and efficient management of large data volumes is necessary in virtu- ally all computer applications, from busi- ness data processing to library infor- mation retrieval systems, multimedia applications with images and sound, computer-aided design and manufactur- ing, real-time process control, and scien- tific computation. While database management systems are standard tools in business data processing, they are only slowly being introduced to all the other emerging database application areas. In most of these new application do- mains, database management systems have traditionally not been used for two reasons. First, restrictive data definition and manipulation languages can make application development and mainte- nance unbearably cumbersome. Research into semantic and object-oriented data models and into persistent database pro- gramming languages has been address- ing this problem and will eventually lead to acceptable solutions. Second, data vol- Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying ie by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1993 ACM 0360-0300/93/0600-0073 $01.50 ACM Computing Surveys,Vol. 25, No. 2, June 1993
  • 2. GUN I tNT”S INTRODUCTION 1 9 -, 3. 4 5 6 7 8 9 10 11 12 ARCHITECTURE OF QUERY EXECUTION ENGINES SORTING AND HASHING 2.1 Sorting 2.2 Hashing DISK ACCESS 31 File Scans 32 Associative Access Using Indices 3.3 Buffer Management AGGREGATION AND DUPLICATE REMOVAL 41 Aggregation Algorithm sBased on Nested Loops 4,z Aggrcgat~on Algorithms Based on S(,rtlng 43 Aggregation Algorithms Based on Hashing 44 ARough Performance C’omparlson 45 Addltlonal Remarks on Aggregation BINARY MATCHING OPERATIONS 51 Nested-Loops Jom Algorithms 52 Merge-Join Algorithms 53 Hash Join AIgorlthms 54 Pointer-Based Joins 55 ARough Performance Comparison UNIVERSAL QuANTIFICATION DUALITY OF SORT- AND HASH-BASED QUERY PROCESSING ALGORITHMS EXECUTION OF COMPLEX QUERY PLANS MECHANISMS FOR PARALLEL QUERY EXECUTION 91 Parallel versus Dlstrlbuted Database Systems g~ ~~rm~ ~fp~ralle]lsm 9.3 Implementation Strategies 94 Load Balancing and Skew 95 Arcbltectures and Architecture Independence PARALLEL ALGORITHMS 10.1 Parallel Selections and Updates 102 Parallel SOrtmg 103 Parallel Aggregation and Duphcate Removal 10.4 Parallel Jolnsand Otber Binary Matcb, ng Operations 105 Parallel Universal Quantlficatlon NONSTANDARD QUERY PROCESSING ALGORITHMS 11 1 Nested Relations 112 TemPoral and Scientific Database Management 11.3 Object-oriented Database Systems 114 More Control Operators ALJDITIONAL TECHNIQUES FOR PERFORMANCE IMPROVEMENT 121 Precomputatlon and Derived Data 122 Data Compression 12.3 Surrogate Processing 124 B,t Vector F,lter, ng 12.5 Specmllzed Hardware SUMMARY AND OUTLOOK ACKNOWLEDGMENTS REFERENCES umes might be so large or complex that the real or perceived performance advan- tage of file systems is considered more important than all other criteria, e.g., the higher levels of abstraction and program- mer productivity typically achieved with database management systems. Thus, object-oriented database management systems that are designed for nontradi- tional database application domains and extensible database management system toolkits that support a variety of data models must provide excellent perfor- mance to meet the challenges of very large data volumes, and techniques for manipulating large data sets will find renewed and increased interest in the database community. The purpose of this paper is to survey efficient algorithms and software archi- tectures of database query execution en- gines for executing complex queries over large databases. A “complex” query is one that requires a number of query- processing algorithms to work together, and a “large” database uses files with sizes from several megabytes to many terabytes, which are typical for database applications at present and in the near future [Dozier 1992; Silberschatz et al. 1991]. This survey discusses a large vari- ety of query execution techniques that must be considered when designing and implementing the query execution mod- ule of a new database management sys- tem: algorithms and their execution costs, sorting versus hashing, parallelism, re- source allocation and scheduling issues in complex queries, special operations for emerging database application domains such as statistical and scientific data- bases, and general performance-enhanc- ing techniques such as precomputation and compression. While many, although not all, techniques discussed in this pa- per have been developed in the context of relational database systems, most of them are applicable to and useful in the query processing facility for any database management system and any data model, provided the data model permits queries over “bulk” data types such as sets and lists.
  • 3. Query Evaluation Techniques ● 75 F User Interface Database Query Language Query Optimizer Query Execution Engine Files and Indices 1/0 Buffer Disk Figure 1. Query processing in a database system, It is assumed that the reader possesses basic textbook knowledge of database query languages, in particular of rela- tional algebra, and of file systems, in- cluding some basic knowledge of index structures. As shown in Figure 1, query processing fills the gap between database query languages and file systems. It can be divided into query optimization and query execution. A query optimizer translates a query expressed in a high- level query language into a sequence of operations that are implemented in the query execution engine or the file system. The goal of query optimization is to find a query evaluation plan that minimizes the most relevant performance measure, which can be the database user’s wait for the first or last result item, CPU, 1/0, and network time and effort (time and effort can differ due to parallelism), memory costs (as maximum allocation or as time-space product), total resource us- age, even energy consumption (e.g., for battery-powered laptop systems or space craft), a combination of the above, or some other performance measure. Query opti- mization is a special form of planning, employing techniques from artificial in- telligence such as plan representation, search including directed search and pruning, dynamic programming, branch- and-bound algorithms, etc. The query ex- ecution engine is a collection of query execution operators and mechanisms for operator communication and synchro- nization—it employs concepts from al- gorithm design, operating systems, networks, and parallel and distributed computation. The facilities of the query execution engine define the space of possible plans that can be chosen by the auerv o~timizer, A ~en~ra~ outline of the steps required for processing a database query are shown in Fimu-e 2. Of course. this se- quence is only a general guideline, and different database systems may use dif- ferent steps or merge multiple steps into one. After a query or request has been entered into the database system, be it interactively or by an application pro- gram, the query is parsed into an inter- nal form. Next, the query is validated against the metadata (data about the data, also called schema or catalogs) to ensure that the query contains only valid references to existing database objects. If the database system provides a macro facility such as relational views, refer- enced macros and views are expanded into the query [ Stonebraker 1975]. In- tegrity constraints might be expressed as views (externally or internally) and would also be integrated into the query at this point in most systems [Metro 1989]. The query optimizer then maps the expanded query expression into an optimized plan that operates directly on the stored database objects. This mapping process can be very complex and might require substantial search and cost estimation effort. (O~timization is not discussed in this pape~; a survey can be found in Jarke and Koch [1984 ].) The optimizer’s output is called a query execution plan, query evaluation plan, QEP, or simply plan. Using a simple tree traversal algorithm, this plan is translated into a representa- tion readv for execution bv the database’s query ex~cution engine; t~e result of this translation can be compiled machine code or a semicompiled or interpreted lan- guage or data structure. This survey discusses only read-only queries explicitly; however, most of the techniques are also applicable to update requests. In most database management systems, update requests may include a search predicate to determine the database objects are to be modified. Stan- dard query optimization and execution techniques apply to this search; the ac- tual update procedure can be either ap- ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 4. 76 * Goetz Graefe Parsing L Query Validation L View Resolution L Optimization L Plan Compilation L Execution Figure 2. Query processing steps. plied in a second phase, a method called deferred updates, or merged into the search phase if there is no danger of creating ambiguous update semantics. 1 The problem of ensuring ACID seman- tics for updates—making updates Atomic (all-or-nothing semantics), Con- sistent (translating any consistent database state into another consistent database state), Isolated (from other queries and requests), and Durable (per- sistent across all failures)—is beyond the scope of this paper; suitable techniques have been described by many other au- thors, e.g., Bernstein and Goodman [1981], Bernstein et al. [1987], Gray and Reuter [1991], and Haerder and Reuter [1983]. Most research into providing ACID se- mantics focuses on efficient techniques for processing very large numbers of relatively small requests. For example, increasing the balance of one account and decreasing the balance of another account require exclusive access to only two database records and writing some information to an update log. Current research and development efforts in transaction processing target hundreds and even thousands of small transactions per second [Davis 1992; Serlin 1991]. .— 1A standard example for this danger is the “Hal- loween” problem: Consider the request to “give all employees with salaries greater than $30,000 a 3% raise.” If (i) these employees are found using an index on salaries, (ii) index entries are scanned in increasing salary order, and (iii ) the index is up- dated immediately as index entries are found, then each qualifying employee will get an infinite num- ber of raises. ACM Computing Surveys, Vol 25, No 2, June 1993 Query processing, on the other hand, fo- cuses on extracting information from a large amount of data without actually changing the database. For example, printing reports for each branch office with average salaries of employees under 30 years old requires shared access to a large number of records. Mixed requests are also possible, e.g., for crediting monthly earnings to a stock account by combining information about a number of sales transactions. The techniques dis- cussed here apply to the search effort for such a mixed request, e.g., for finding the relevant sales transactions for each stock account. Embedded queries, i.e., database queries that are contained in an applica- tion program written in a standard pro- gramming language such as Cobol, PL/1, C, or Fortran, are also not addressed specifically in this paper because all techniques discussed here can be used for interactive as well as embedded queries. Embedded queries usually are optimized when the program is compiled, in order to avoid the optimization over- head when the program runs. This method was pioneered in System R, in- cluding mechanisms for storing opti- mized plans and invalidating stored plans when they become infeasible, e.g., when an index is dropped from the database [Chamberlain et al. 1981b]. Of course, the cut between compile-time and run-time can be placed at any other point in the sequence in Figure 2. Recursive queries are omitted from this survey, because the entire field of recur- sive query processing—optimization rules and heuristics, selectivity and cost
  • 5. Query Evaluation Techniques “ 77 estimation, algorithms and their paral- lelization—is still developing rapidly (suffice it to point to two recent surveys by Bancilhon and Ramakrishnan [1986] and Cacace et al. [1993]). The present paper surveys query exe- cution techniques; other surveys that pertain to the wide subject of database systems have considered data models and query langaages [Gallaire et al. 1984; Hull and King 1987; Jarke and Vassiliou 1985; McKenzie and Snodgrass 1991; Peckham and Maryanski 1988], access methods [Comer 1979; Enbody and Du 1988; Faloutsos 1985; Samet 1984; Sockut and Goldberg 1979], compression techniques [Bell et al. 1989; Lelewer and Hirschberg 1987], distributed and het- erogeneous systems [Batini et al. 1986; Litwin et al. 1990; Sheth and Larson 1990; Thomas et al. 1990], concurrency control and recovery [Barghouti and Kaiser 1991; Bernstein and Goodman 1981; Gray et al. 1981; Haerder and Reuter 1983; Knapp 1987], availability and reliability [Davidson et al. 1985; Kim 1984], query optimization [Jarke and Koch 1984; Mannino et al. 1988; Yu and Chang 1984], and a variety of other database-related topics [Adam and Wort- mann 1989; Atkinson and Bunemann 1987; Katz 1990; Kemper and Wallrath 1987; Lyytinen 1987; Teoroy et al. 1986]. Bitton et al. [1984] have discussed a number of parallel-sorting techniques, only a few of which are really used in database systems. Mishra and Eich’s [1992] recent survey of relational join al- gorithms compares their behavior using diagrams derived from one by Kitsure- gawa et al. [1983] and also describes join methods using index structures and join methods for distributed systems. The present survey is much broader in scope as it also considers system architectures for complex query plans and for parallel execution, selection and aggregation al- gorithms, the relationship of sorting and hashing as it pertains to database query processing, special operations for nontra- ditional data models, and auxiliary tech- niques such as compression. Section 1 discusses the architecture of query execution engines. Sorting and hashing, the two general approaches to managing and matching elements of large sets, are described in Section 2. Section 3 focuses on accessing large data sets on disk. Section 4 begins the discussion of actual data manipulation methods with algorithms for aggregation and duplicate removal, continued in Section 5 with bi- nary matching operations such as join and intersection and in Section 6 with operations for universal quantification. Section 7 reviews the manv dualities be- tween sorting and hashi~g and points out their differences that have an impact on the performance of algorithms based on either one of these approaches. Execu- tion of very complex query plans with many operators and with nontrivial plan shaues is discussed in Section 8. Section 9 is’ devoted to mechanisms for parallel execution, including architectural issues and load balancing, and Section 10 discusses specific parallel algorithms. Section 11 outlines some nonstandard operators for emerging database appli- cations such as statistical and scientific database management systems. Section 12 is a potpourri of additional techniques that enhance the performance of many algorithms, e.g., compression, precompu- tation, and specialized hardware. The fi- nal section contains a brief summarv and . an outlook on query processing research and its future. For readers who are more interested in some tonics than others. most sections are fair~y self-contained.’ Moreover, the hurried reader may want to skip the derivation of cost functions ;2 their re- sults and effects are summarized later in diagrams. 1. ARCHITECTURE OF QUERY EXECUTION ENGINES This survey focuses on useful mecha- nisms for processing sets of items. These items can be records, tuples, entities, or objects. Furthermore, most of the tech- 2 In any case, our cost functions cover only a lim- ited, though Important, aspect of query execution cost, namely 1/0 effort. ACM Computing Surveys, Vol 25, No 2, June 1993
  • 6. 78 “ Goetz Graefe niques discussed in this survey apply to sequences, not only sets, of items, al- though most query processing research has assumed relations and sets. All query processing algorithm implementations it- erate over the members of their input sets; thus, sets are always represented by sequences. Sequences can be used to represent not only sets but also other one-dimensional “bulk types such as lists, arrays, and time series, and many database query processing algorithms and techniques can be used to manipu- late these other bulk types as well as sets. The important point is to think of these algorithms as algebra operators consuming zero or more inputs (sets or sequences) and producing one (or some- times more) outputs. A complete query execution engine consists of a collection of operators and mechanisms to execute complex expressions using multiple oper- ators, including multiple occurrences of the same operator. Taken as a whole, the query processing algorithms form an al- gebra which we call the physical algebra of a database system. The physical algebra is equivalent to, but quite different from, the logical alge- bra of the data model or the database system. The logical algebra is more closely related to the data model and defines what queries can be expressed in the data model; for example, the rela- tional algebra is a logical algebra. A physical algebra, on the other hand, is system specific. Different systems may implement the same data model and the same logical algebra but may use very different physical algebras. For example, while one relational system may use only nested-loops joins, another system may provide both nested-loops join and merge-join, while a third one may rely entirely on hash join algorithms. (Join algorithms are discussed in detail later in the section on binary matching opera- tors and algorithms.) Another significant difference between logical and physical algebras is the fact that specific algorithms and therefore cost functions are associated only with physical operators, not with logical alge- bra operators. Because of the lack of an algorithm specification, a logical algebra expression is not directly executable and must be mapped into a physical algebra expression. For example, it is impossible to determine the execution time for the left expression in Figure 3, i.e., a logical algebra expression, without mapping it first into a physical algebra expression such as the query evaluation plan on the right of Figure 3. This mapping process can be trivial in some database systems but usually is fairly complex in real database svstems because it involves al- gorithm c~oices and because logical and physical operators frequently do not map directly into one another, as shown in the following four examples. First, some op- erators in the physical algebra may im- plement multiple logical operators. For example, all serious implementations of relational join algorithms include a facil- ity to output fewer than all attributes, i.e., a relational delta-project (a projec- tion without dudicate removal) is in- cluded in the ph~sical join operator. Sec- ond, some physical operators implement only part of a logical operator. For exam- ple, a duplicate removal algorithm imple- ments only the “second half” of a rela- tional projection operator. Third, some physical operators do not exist in the logical algebra. Concretely, a sort opera- tor has no place in pure relational alge- bra because it is an algebra of sets. and sets are, by their defi~ition, unordered. Finally, some properties that hold for logical operators do not hold, or only with some qualifications, for the counterparts in physical algebra. For example, while intersection and union are entirely sym- metric and commutative, algorithms im- plementing them (e.g., nested loops or hybrid hash join) do not treat their two inputs equally. The difference of logical and physical algebras can also be looked at in a differ- ent way. Any database system raises the level of abstraction above files and records; to do so, there are some logical type constructors such as tuple, relation, set, list, array, pointer, etc. Each logical type constructor is complemented by ACM Comput,ng Surveys, V.1 25, No. 2, June 1993
  • 7. Query Evaluation Techniques ● 79 Merge-Join (Intersect) Intersection / / sort sort Set A Set B I I File Scan A File Scan B Figure 3. Logical and physical algebra expressions. some operations that are permitted on instances of such types, e.g., attribute extraction, selection, insertion, deletion, etc. On the physical or representation level, there is typically a smaller set of repre- sentation types and structures, e.g., file, record, record identifier (RID), and maybe very large byte arrays [Carey et al. 1986]. For manipulation, the representation types have their own operations, which will be different from the operations on logical types. Multiple logical types and type constructors can be mapped to the same physical concept. They may also be situations in which one logical type con- structor can be mapped to multiple phys- ical concepts, e.g., a set depending on its size. The mapping from logical types to physical representation types and struc- tures is called physical database design. Query optimization is the mapping from logical to physical operations, and the query execution engine is the imple- mentation of operations on physical rep- resentation types and of mechanisms for coordination and cooperation among multiple such operations in complex que- ries. The policies for using these mech- anisms are part of the query optimizer. Synchronization and data transfer be- tween operators is the main issue to be addressed in the architecture of the query execution engine. Imagine a query with two joins, and consider how the result of the first join is passed to the second one. The simplest method is to create (write) and read a temporary file. The need for temporary files, whether they are kept in the buffer or not, is a direct result of executing an operator’s input subplans completely before starting the operator. Alternatively, it is possible to create one process for each operator and then to use interprocess communication mechanisms (e.g., pipes) to transfer data between op- erators, leaving it to the operating sys- tem to schedule and suspend operator processes as pipes are full or empty. While such data-driven execution re- moves the need for temporary disk files, it introduces another cost, that of operat- ing system scheduling and interprocess communication. In order to avoid both temporary files and operating system scheduling, Freytag and Goodman [1989] proposed writing rule-based translation programs that transform a plan repre- sented as a tree structure into a single iterative program with nested loops and other control structures. However, the re- quired rule set is not simple, in particu- lar for algorithms with complex control logic such as sorting, merge-join, or even hybrid hash join (to be discussed later in the section on matching). The most practical alternative is to im- plement all operators in such a way that they schedule each other within a single operating system process. The basic idea is to define a granule, typically a single record, and to iterate over all granules comprising an intermediate query result.3 Each time an operator needs another granule, it calls its input (operator) to produce one. This call is a simple pro- —- 3It is possible to use multiple granule sizes within a single query-processing system and to provide special operators with the sole purpose of translat- ing from one granule size to another. An example M a query processing system that uses records as an iteration granule except for the inputs of merge-join (see later in the section on binary matching), for which it uses “value packets,” i.e., groups of records with equal join attribute values. ACM Computmg Surveys, Vol. 25, No. 2, June 1993
  • 8. 80 ● Goetz Graefe cedure call, much cheaper than inter- process communication since it does not involve the operating system. The calling operator waits (just as any calling rou- tine waits) until the input operator has produced an item. That input operator, in a complex query plan, might require an item from its own input to produce an item; in that case, it calls its own input (operator) to produce one. Two important features of operators implemented in this way are that they can be combined into arbitrarily complex query evaluation plans and that any number of operators can execute and schedule each other in a single process without assistance from or interaction with the underlying operat- ing system. This model of operator imple- mentation and scheduling resembles very closely those used in relational systems, e.g., System R (and later SQL/DS and DB2), Ingres, Informix, and Oracle, as well as in experimental systems, e.g., the E programming language used in EXO- DUS [Richardson and Carey 1987], Gen- esis [Batory et al. 1988a; 1988 b], and Starburst [Haas et al. 1989; 1990]. Oper- ators imdemented in this model are L called iterators, streams, synchronous pipelines, row-sources, or similar names in the “lingo” of commercial systems. To make the implementation of opera- tors a little easier, it makes sense to separate the functions (a) to prepare an operator for producing data, (b) to pro- duce an item, and (c) to perform final housekeeping. In a file scan, these func- tions are called o~en. next. and close procedures; we adopt these names for all operators. Table 1 gives a rough idea of what the open, next, and close proce- dures for some operators do, as well as the principal local state that needs to be saved from one invocation to the next. (Later sections will discuss sort and join operations in detail. ) The first three examples are trivial, but the hash join operator shows how an operator can schedule its inputs in a nontrivial manner. The interesting observations are that (i) the entire query plan is exe- cuted within a single process, (ii) oper- ators produce one item at a time on request, (iii) this model effectively im- plements, within a single process, (special-purpose) coroutines and de- mand-driven data flow. (iv) items never wait in a temporary ‘file or buffer be- tween operators because they are never m-educed before thev are needed. (v) ~herefore this model “& very efficient in its time-space-product memory costs, (vi) iterators can schedule anv tree. in- cluding bushy trees (see below), ‘(vii) no operator is affected by the complex- ity of the whole plan, i.e., this model of operator implementation and syn- chronization works for sim~le as well as very complex query plan;. As a final remark, there are effective ways to com- bine the iterator model with ~arallel query processing, as will be disc~ssed in Section 9. Since query plans are algebra expres- sions, they can be represented as trees. Query plans can be divided into prototyp- ical shapes, and query execution engines can be divided into groups according to which shapes of plans they can evaluate. Figure 4 shows prototypical left-deep, right-deep, and bushy plans for a join of four inputs. Left-deep and right-deep plans are different because join al- gorithms use their two inputs in differ- ent ways; for example, in the nested-loops join algorithm, the outer loop iterates over one input (usually drawn as left input) while the inner loop iterates over the other input. The set of bushy plans is the most general as it includes the sets of both left-deep and right-deep plans. These names are taken from Graefe and DeWitt [ 1987]; left-deep plans are also called “linear processing trees” [Krishnamurthy et al. 1986] or “plans with no composite inner” [Ono and Lehman 1990]. For queries with common subexpres- sions, the query evaluation plan is not a tree but an acyclic directed graph (DAG). Most systems that identify and exploit common subexpressions execute the plan equivalent to a common subexpression separately, saving the intermediate re- sult in a temporary file to be scanned repeatedly and destroyed after the last ACM Computing Surveys, Vol 25, No 2. June 1993
  • 9. Query Evaluation Techniques ● 81 Table 1. Examples of Iterator Functions Iterator Open Next Close Local State Print open input call next on input; close input format the item on screen Scan open file read next item close file open file descriptor Select open input call next on input close input until an item qualifies Hash join allocate hash call next on probe (without close probe input; hash directory directory; open left input until a match is deallocate hash overflow “build” input; build found directory resolution) hash table calling next on build input; close build input; open right “probe” input Merge-Join open both inputs get next item from (without close both inputs input with smaller duplicates) key until a match is found Sort open input; build all determine next destroy remaining merge heap, open file initial run files output item; read run files descriptors for run files calling next on input; new item from the close input; merge correct run file run files untd only one merge step is left Join C-D Join A-B Jo::@ :fi: ‘m:.-. A B c D Figure 4. Left-deep, bushy, and right-deep plans. scan. Each plan fragment that is exe- cuted as a unit is indeed a tree. The alternative is a “split” iterator that can deliver data to multiple consumers, i.e., that can be invoked as iterator by multi- ple consumer iterators. The split iterator paces its input subtree as fast as the fastest consumer requires it and holds items until the slowest consumer has consumed them. If the consumers re- quest data at about the same rate, the split operator does not require a tempo- rary spool file; such a file and its associ- ated 1/0 cost are required only if the data rate required by the consumers di- verges above some predefine threshold. Among the implementations of itera- tors for query processing, one group can be called “stored-set oriented and the other “algebra oriented.” In System R, an example for the first group, complex join plans are constructed using binary join iterators that “attach” one more set (stored relation) to an existing intermedi- ate result [Astrahan et al. 1976; Lorie and Nilsson 1979], a design that sup- ports only left-deep plans. This design led to a significant simplification of the System R optimizer which could be based on dynamic programming techniques, but ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 10. 82 - Goetz Graefe it ignores the optimal plan for some queries [Selinger et al. 1979].4 A similar design was used, although not strictly required by the design of the execution engine, in the Gamma database machine [DeWitt et al. 1986; 1990; Gerber 1986]. On the other hand, some systems use binary operators for which both inputs can be intermediate results, i.e., the out- put of arbitrarily complex subplans. This design is more general as it also permits bushy plans. Examples for this approach are the second query processing engine of Ingres based on Kooi’s thesis [Kooi 1980; Kooi and Frankforth 1982], the Starburst execution engine [Haas et al. 1989], and the Volcano query execution engine [Graefe 1993b]. The tradeoff between left-deep and bushy query evaluation plans is reduction of the search space in the query optimizer against generality of the execution engine and efficiency for some queries. Right-deep plans have only recently received more interest and may actually turn out to be very efficient, in particular in systems with ample mem- ory [Schneider 1990; Schneider and De- Witt 1990]. The remainder of this section provides more details of how iterators are imple- mented in the Volcano extensible query processing system. We use this system repeatedly as an example in this survey because it provides a large variety of mechanisms for database query process- ing, but mostly because its model of oper- ator implementation and scheduling resembles very closely those used in many relational and extensible systems. The purpose of this section is to provide implementation concepts from which a new query processing engine could be derived. Figure 5 shows how iterators are rep- resented in Volcano. A box stands for a ~Since each operator in such a query execution system will accessa permanent relation, the name “accesspath selection” used for System R optimiza- tion, although including and actually focusing on join optimization, was entirely correct and more descriptive than “query optimization.” record structure in Volcano’s implemen- tation language (C [Kernighan and Ritchie 1978]), and an arrow represents a pointer. Each operator in a query eval- uation dan consists of two record struc- tures, ; small structure of four points and a state record. The small structure is the same for all algorithms. It represents the stream or iterator abstraction and can be invoked with the open, next, and close procedures. The purpose of state records is similar to that of activation records allocated by com~iler-~enerated code upon entry in~o a p>oced~re. Both hold values local to the procedure or the iterator. Their main difference is that activation records reside on the stack and vanish upon procedure exit, while state records must ~ersist from one invocation of the iterato~ to the next, e.g., from the invocation of open to each invocation of next and the invocation of close. Thus. state records do not reside on the stack but in heap space. The type of state records is different for each iterator as it contains iterator- specific arguments and local variables (state) while the iterator is suspended, e.g., currently not active between invoca- tions of the operator’s next m-ocedure. Query plan nodes are linked t~gether by means of input pointers, which are also ke~t in the state records. Since ~ointers to ‘functions are used extensivel~ in this design, all operator code (i.e., the open, next, and close procedures) can be writ- ten in such a way that the names of input operators and their iterator proce- dures are not “hard-wired” into the code. and the operator modules do not need to be recompiled for each query. Further- more, all operations on individual items, e.g., printing, are imported into Volcano o~erators as functions. makirw the o~er- a~ors independent of the sem%tics ‘and representation of items in the data streams they are processing. This organi- zation using function pointers for input operators is fairly standard in commer- cial database management systems. In order to make this discussion more concrete, Figure 5 shows two operators in a query evaluation plan that prints se- ACM Computmg Surveys, Vol 25, No 2, June 1993
  • 11. Query Evaluation Techniques ● 83 * open-filter * next-filter e + close-filter 1 [ Arguments I Input I State I t I I + open-filescan T + next-filescan e print () + close-filescan I i I 3+ Arguments i Input 1 State 1 1 I predicate () (none) Figure 5. Two operators in a Volcano query plan. Iected records from a file. The purpose and capabilities of the filter operator in Volcano include printing items of a stream using a print function passed to the filter operator as one of its argu- ments. The small structure at the top gives access to the filter operator’s itera- tor functions (the open, next, and close procedures) as well as to its state record. Using a pointer to this structure, the open, next, and close procedures of the filter operator can be invoked, and their local state can be passed to them as a procedure argument. The filter’s iterator functions themselves, e.g., open-filter, can use the input pointer contained in the state record to invoke the input operator’s functions, e.g., open-file-scan. Thus, the filter functions can invoke the file scan functions as needed and can pace the file scan according to the needs of the filter. In this section, we have discussed gen- eral physical algebra issues and syn- chronization and data transfer between operators. Iterators are relatively straightforward to implement and are suitable building blocks for efficient, ex- tensible query processing engines. In the following sections, we consider individual operators and algorithms including a comparison of sorting and hashing, de- tailed treatment of parallelism, special operators for emerging database applica- tions such as scientific databases, and auxiliary techniques such as precompu- tation and compression. 2. SORTING AND HASHING Before discussing specific algorithms, two general approaches to managing sets of data are introduced. The purpose of many query-processing algorithms is to per- form some kind of matching, i.e., bring- ing items that are “alike” together and performing some operation on them. There are two basic approaches used for this purpose, sorting and hashing. This pair permeates many aspects of query processing, from indexing and clustering over aggregation and join algorithms to methods for parallelizing database oper- ations. Therefore, we discuss these ap- proaches first in general terms, without regard to specific algorithms. After a sur- vey of specific algorithms for unary (ag- gregation, duplicate removal) and binary (join, semi-join, intersection, division, etc.) matching problems in the following sections, the duality of sort- and hash- based algorithms is discussed in detail. 2.1 Sorting Sorting is used very frequently in database systems, both for presentation to the user in sorted reports or listings and for query processing in sort-based algorithms such as merge-join. There- fore, the performance effects of the many algorithmic tricks and variants of exter- nal sorting deserve detailed discussion in this survey. All sorting algorithms actu- ally used in database systems use merg- ACM Computing Surveys,Vol. 25, No. 2, June 1993
  • 12. 84 “ Goetz Graefe ing, i.e., the input data are written into initial sorted runs and then merged into larger and larger runs until only~ne run is left, the sorted output. Only in the unusual case that a data set is smaller than the available memorv can in- memory techniques such as q-uicksort be used. An excellent reference for many of the issues discussed here is Knuth [ 19731. -, who analyzes algorithms much more ac- curately than we do in this introductory survey. In order to ensure that the sort module interfaces well with the other operators, e.g., file scan or merge-join, sorting should be im~lemented as an iterator. i.e.. with open, ‘next, and close procedures as all other operators of the physical algebra. In the Volcano query-processing system (which is based on iterators), most of the sort work is done during open -sort [Graefe 1990a; 1993b]. This procedure consumes the entire input and leaves ap- propriate data structures for next-sort to produce the final sorted output. If the entire input fits into the sort space in main memory, open-sort leaves a sorted array of pointers to records in 1/0 buffer memory which is used by next-sort to ~roduce the records in sorted order. If ~he input is larger than main memory, the open-sort procedure creates sorted runs and merges them until onlv one final merge ph~se is left, The last merge step is performed in the next-sort proce- dure, i.e., when demanded by the con- sumer of the sorted stream, e.g., a merge-join. The input to the sort module must be an iterator, and sort uses open, next, and close procedures to request its input; therefore, sort input can come from a scan or a complex query plan, and the sort operator can be inserted into a query plan at any place or at several places. Table 2, which summarizes a taxon- omy of parallel sort algorithms [Graefe 1990a], indicates some main characteris- tics of database sort algorithms. The first few items apply to any database sort and will be discussed in this section. The questions pertaining to parallel inputs and outputs and to data exchange will be considered in a later section on parallel algorithms, and the last question regard- ing substitute sorts will be touched on in the section on surrogate processing. All sort algorithms try to exploit the duality between main-memory mergesort and quicksort. Both of these algorithms are recursive divide-and-conquer algo- rithms. The difference is that merge sort first divides physically and then merges on logical keys, whereas quicksort first divides on logical keys and then com- bines physically by trivially concatenat- ing sorted subarrays. In general, one of the two phases—dividing and combining —is based on logical keys, whereas the other arranges data items only physi- cally. We call these the logical and the physical phases. Sorting algorithms for verv large data sets stored on disk or tap: are-also based on dividing and com- bining. Usually, there are two distinct sub algorithms, one for sorting within main memory and one for managing sub- sets of the data set on disk or tape. The choices for mapping logical and physical phases to dividing and combining steps are independent for these two subalgo- rithms. For practical reasons, e.g., ensur- ing that a run fits into main memory, the disk management algorithm typically uses physical dividing and logical com- bining (merging). A point of practical im- portance is the fan-in or degree of merging, but this is a parameter rather than a defining algorithm property. There are two alternative methods for creating initial runs, also called “level-O runs” here. First, an in-memory sort al- gorithm can be used, typically quicksort. Using this method, each run will have the size of allocated memory, and the number of initial runs W will be W = [RM] for input size R and memory size M. (Table 3 summarizes variables and their meaning in cost calculations in this survey.) Second, runs can be produced using replacement selection [Knuth 1973]. Replacement selection starts by filling memory with items which are or- ganized into a priority heap, i.e., a data structure that efficiently supports the op- erations insert and remove-smallest. Next, the item with the smallest key is ACM Comput, ng Surveys, Vol. 25, No. 2, June 1993
  • 13. Query Evaluation Techniques “ 85 Table 2. A Taxonomy of Database Sorhng Algorithms Determinant Possible Options Input division Logical keys (partitioning) or physical division Result combination Logical keys (merging) or physical concatenation Main-memory sort Quicksort or replacement selection Merging Eager or lazy or semi-eager; lazy and semi-eager with or without optimizations Read-ahead No read-ahead or double-buffering or read-ahead with forecasting Input Single-stream or parallel output Single-stream or parallel Number of data exchanges One or multiple Data exchange Before or after local sort Sort objects Original records or key-RZD pairs (substitute sort) removed from the priority heap and writ- ten to a run file and then immediately replaced in the priority heap with an- other item from the input, With high probability, this new item has a key larger than the item just written and therefore will be included in the same run file. Notice that if this is the case, the first run file will be larger than mem- ory. Now the second item (the currently smallest item in the priority heap) is written to the run file and is also re- placed immediately in memory by an- other item from the input. This process repeats, always keeping the memory and the priority heap entirely filled. If a new item has a key smaller than the last key written, the new item cannot be included in the current run file and is marked for the next run file. In comparisons among items in the heap, items marked for the current run file are always considered “smaller” than items marked for the next run file. Eventually, all items in memory are marked for the next run file, at which point the current run file is closed, and a new one is created. Using replacement selection, run files are typically larger than memory. If the input is already sorted or almost sorted, there will be only one run file. This situa- tion could arise, for example, if a file is sorted on field A but should be sorted on A as major and B as the minor sort key. If the input is sorted in reverse order, which is the worst case, each run file will be exactly as large as memory. If the input is random, the average run file will be twice the size of memory, except the first few runs (which get the process started) and the last run. On the aver- age, the expected number of runs is about w = [R/(2 x M)l + I, i.e., about half as many runs as created with quicksort. A more detailed discussion and an analysis of replacement selection were provided by Knuth [1973]. An additional difference between quicksort and replacement selection is the resulting 1/0 pattern during run cre- ation. Quicksort results in bursts of reads and writes for entire memory loads from the input file and to initial run files, while replacement selection alternates between individual read and write opera- tions. If only a single device is used, quicksort may result in faster 1/0 be- cause fewer disk arm movements are re- quired. However, if different devices are used for input and temporary files, or if the input comes as a stream from an- other operator, the alternating behavior of replacement selection may permit more overlap of 1/0 and processing and there- fore result in faster sorting. The problem with replacement selec- tion is memory management. If input items are kept in their original pages in the buffer (in order to save copying data, a real concern for large data volumes) each page must be kept in the buffer until its last record has been written to a run file. On the average, half a page’s records will be in the priority heap. Thus, the priority heap must be reduced to half the size (the number of items in the heap ACM Computing Surveys, Vol 25, No. 2, June 1993
  • 14. 86 ● Goetz Graefe Table 3. Variables, Their Mearrrrg, and Units Variables Description Units M Memory size pages R, S Inputs or their sizes pages c Cluster or unit of I/O pages F, K Fan-in or fan-out (none) w Number of level-O run files (none) L Number of merge levels (none) is one half the number of records that fit into memory), canceling the advantage of longer and fewer run files. The solution to this problem is to copy records into a holding space and to keep them there while they are in the priority heap and until they are written to a run file. If the input items are of varying sizes, memory management is more complex than for quicksort because a new item may not fit into the space vacated in the holding space by the last item written into a run file. Solutions to this problem will intro- duce memory management overhead and some amount of fragmentation, i.e., the size of runs will be less than twice the size of memory. Thus, the advantage of having fewer runs must be balanced with the different 1/0 pattern and the dis- advantage of more complex memory management. The level-O runs are merged into level- 1 runs, which are merged into level-2 runs, etc., to produce the sorted output. During merging, a certain amount of buffer memory must be dedicated to each input run and the merge output. We call the unit of 1/0 a cluster in this survey, which is a number of pages located con- tiguously on disk. We indicate the cluster size with C, which we measure in pages just like memory and input sizes. The number of 1/0 clusters that flt in mem- ory is the quotient of memory size and cluster size. The maximal merge fan-in F, i.e., the number of runs that can be merged at one time, is this quotient mi- nus one cluster for the output. Thus, F = [M/C – 1]. Since the sizes of runs grow by a factor F from level to level, the number of merge levels L, i.e., the num- ber of times each item is written to a run file, is logarithmic with the input size, namely L = [log~( W )1. There are four considerations that can improve the merge efficiency. The first two issues pertain to scheduling of 1/0 operations, First, scans are faster if read-ahead and write-behind are used; therefore, double-buffering using two pages of memory per input run and two for the merge output might speed the merge process [Salzberg 1990; Salzberg et al. 1990]. The obvious disadvantage is that the fan-in is cut in half. However, instead of reserving 2 x F + 2 clusters, a predictive method called forecasting can be employed in which the largest key in each input buffer is used to determine from which input run the next cluster will be read. Thus, the fan-in can be set to any number in the range [ ilI/(2 X C) – 2] s F s [M/C – 1]. One or two read-ahead buffers per input disk are sufficient, and F = [ iW/C ] – 3 will be reasonable in most cases because it uses maximal fan-in with one forecasting in- put buffer and double-buffering for the merge output. Second, if the operating system and the 1/0 hardware support them, using large cluster sizes for the run files is very beneficial. Larger cluster sizes will re- duce the fan-in and therefore may in- crease the number of merge levels.5 However, each merging level is per- formed much faster because fewer 1/0 operations and disk seeks and latency delays are required. Furthermore, if the unit of 1/0 is equal to a disk track, rotational latencies can be avoided en- tirely with a sufficiently smart disk con- 5In files storing permanent data, large clusters (units of 1/0) contammg many records may also create artificial buffer contention (if much more disk space is copied into the buffer than truly nec- essary for one record) and “false sharing” in envi- ronments with page (cluster) locks, I.e., artificial concurrency conflicts, Since run files m a sort oper- ation are not shared but temporary, these problems do not exist in this context. ACM Computmg Surveys,Vol. 25, No 2, June 1993
  • 15. Query Evaluation Techniques ● 87 troller. Usually, relatively small fan-ins with large cluster sizes are the optimal choice, even if the sort requires multiple merge levels [Graefe 1990a]. The precise tradeoff depends on disk seek, latency, and transfer times. It is interesting to note that the optimal cluster size and fan-in basically do not depend on the input size. As a concrete example, consider sort- ing a file of R = 50 MB = 51,200 KB using M = 160 KB of memory. The num- ber of runs created by quicksort will be W = [51200/160] = 320. Depending on the disk access and transfer times (e.g., 25 ms disk seek and latency, 2 ms trans- fer time for a page of 4 KB), C = 16 KB will typically be a good cluster size for fast merging. If one cluster is used for read-ahead and two for the merge out- put, the fan-in will be F = 1160/161 – 3 = 7. The number of merge levels will be L = [logT(320)l = 3. If a 16 KB 1/0 oper- ation takes T = 33 ms, the total 1/0 time, including a factor of two for writing and reading at each merge level, for the entire sort will be 2 X L X [R/Cl X T = 10.56 min. An entirely different approach to de- termining optimal cluster sizes and the amount of memory allocated to forecast- ing and read-ahead is based on process- ing and 1/0 bandwidths and latencies. The cluster sizes should be set such that the 1/0 bandwidth matches the process- ing bandwidth of the CPU. Bandwidths for both 1/0 and CPU are measured here in record or bytes per unit time; instruc- tions per unit time (MIPS) are irrelevant. It is interesting to note that the CPU’s processing bandwidth is largely deter- mined by how fast the CPU can assemble new pages, in other words, how fast the CPU can copy records within memory. This performance measure is usually ig- nored in modern CPU and cache designs geared towards high MIPS or MFLOPS numbers [ Ousterhout 19901. Tuning the sort based on bandwidth and latency proceeds in three steps. First, the cluster size is set such that the pro- cessing and 1/0 bandwidths are equal or very close to equal. If the sort is 1/0 bound, the cluster size is increased for less disk access overhead per record and therefore faster 1/0; if the sort is CPU bound, the cluster size is decreased to slow the 1/0 in favor of a larger merge fan-in. Next, in order to ensure that the two processing components (1/0 and CPU) never (or almost never) have to wait for one another, the amount of space dedicated to read-ahead is determined as the 1/0 time for one cluster multiplied by the processing bandwidth. Typically, this will result in one cluster of read- ahead space per disk used to store and read inputs run into a merge. Of course, in order to make read-ahead effective, forecasting must be used. Finally, the same amount of buffer space is allocated for the merge output (access latency times bandwidth) to ensure that merge pro- cessing never has to wait for the com- pletion of output 1/0. It is an open issue whether these two alternative ap- proaches to tuning cluster size and read- ahead space result in different alloca- tions and sorting speeds or whether one of them is more effective than the other. The third and fourth merging issues focus on using (and exploiting) the maxi- mal fan-in as effectively and often as possible. Both issues require adjusting the fan-in of the first merge step using the formula given below, either the first merge step of all merge steps or, in semi-eager merging [Graefe 1990a], the first merge step after the end of the in- put has been reached. This adjustment is used for only one merge step, called the initial merge here, not for an entire merge level. The third issue to be considered is that the number of runs W is typically not a power of F; therefore, some merges pro- ceed with fewer than F inputs, which creates the opportunity for some opti- mization. Instead of always merging runs of only one level together, the optimal strategy is to merge as many runs as possible using the smallest run files available. The only exception is the fan-in of the first merge, which is determined to ensure that all subsequent merges will use the full fan-in F. ACM Computing Surveys,Vol. 25, No. 2, June 1993
  • 16. 88 e Goetz Graefe — Figure 6. Naive and optimized merging. Let us explain this idea with the exam- ple shown in Figure 6. Consider a sort with a maximal fan-in F = 10 and an input file that requires W = 12 initial runs. Instead of merging only runs of the same level as shown in Figure 6, merging is delayed until the end of the input has been reached. In the first merge step, only 3 of the 12 runs are combined, and the result is then merged with the other 9 runs, as shown in Figure 6. The 1/0 cost (measured by the number of memory loads that must be written to any of the runs created) for the first strategy is 12 + 10 + 2 = 24, while for the second strategy it is 12 + 3 = 15. In other words, the first strategy requires 607. more 1/0 to temporary files than the second one. The general rule is to merge just the right number of runs after the end of the input file has been reached, and to al- ways merge the smallest runs available for merging. More detailed examples are given in Graefe [1990a]. One conse- quence of this optimization is that the merge depth L, i.e., the number of run files a record is written to during the sort or the number of times a record is writ- ten to and read from disk, is not uniform for all records. Therefore, it makes sense to calculate an average merge depth (as required in cost estimation during query optimization), which may be a fraction. Of course, there are much more sophisti- cated merge optimizations, e.g., cascade and polyphase merges [Knuth 1973]. Fourth, since some operations require multiple sorted inputs, for example merge-join (to be discussed in the section on matching) and sort output can be passed directly from the final merge into the next operation (as is natural when using iterators), memory must be divided among multiple final merges. Thus, the final fan-in f and the “normal” fan-in F should be specified separately in an ac- tual sort implementation. Using a final fan-in of 1 also allows the sort operator to produce output into a very slow opera- tor, e.g., a display operation that allows scrolling by a human user, without occu- pying a lot of buffer memory for merging input runs over an extended period of time.G Considering the last two optimization options for merging, the following for- mula determines the fan-in of the first merge. Each merge with normal fan-in F will reduce the number of run files by F – 1 (removing F runs, creating one new one). The goal is to reduce the num- ber of runs from W to f and then to 1 (the final output). Thus, the first merge should reduce the number of runs to f + k(F – 1) for some integer k. In other words, the first merge should use a fan-in of FO=((W–f– l)mod(F– 1))+ 2.In the example of Figure 6, (12 – 10 – 1) mod (10 – 1) + 2 results in a fan-in for the initial merge of FO = 3. If the sort of Figure 6 were the input into a merge-join and if a final fan-in of 5 were desired, the initial merge should proceed with a fan-in of FO=(12–5– l)mod(10– 1)+2= 8. If multiple sort operations produce input data for a common consumer oper- ator, e.g., a merge-join, the two final fan- ins should be set proportionally to the GThere is a simdar case of resource sharing among the operators producing a sort’s input and the run generation phase of the sort, We will come back to these issues later in the section on executing and scheduling complex queries and plans. ACM Computing Surveys, Vol 25, No. 2, June 1993
  • 17. Query Evaluation Techniques ● 89 size of the two inputs. For example, if two merge-join inputs are 1 MB and 9 MB, and if 20 clusters are available for inputs into the two final merges, then 2 clusters should be allocated for the first and 18 clusters for the second input (1/9 = 2/18). Sorting is sometimes criticized because it requires, unlike hybrid hashing (dis- cussed in the next section), that the en- tire input be written to run files and then retrieved for merging. This difference has a particularly large effect for files only slightly larger than memory, e.g., 1.25 times the size of memory. Hybrid hash- ing determines dynamically how much of the input data truly must be written to temporary disk files. In the example, only slightly more than one quarter of the memory size must be written to tempo- rary files on disk while the remainder of the file remains in memory. In sorting, the entire file (1.25 memory sizes) is written to one or two run files and then read for merging. Thus, sorting seems to require five times more 1/0 for tempo- rary files in this example than hybrid hashing. However, this is not necessarily true. The simple trick is to write initial runs in decreasing (reverse) order. When the input is exhausted, and merging in increasing order commences, buffer memory is still full of useful pages with small sort keys that can be merged im- mediately without 1/0 and that never have to be written to disk. The effect of writing runs in reverse order is compara- ble to that of hybrid hashing, i.e., it is particularly effective if the input is only slightly larger than the available memory. To demonstrate the effect of cluster size optimization (the second of the four merging issues discussed above), we sorted 100,000 100-byte records, about 10 MB, with the Volcano query process- ing system, which includes all merge op- timization described above with the ex- ception of read-ahead and forecasting. (This experiment and a similar one were described in more detail earlier [Graefe 1990a; Graefe et al. 1993].) We used a sort space of 40 pages (160 KB) within a 50-page (200 KB) 1/0 buffer, varying the cluster size from 1 page (4 KB) to 15 pages (60 KB). The initial run size was 1,600 records, for a total of 63 initial runs. We counted the number of 1/0 operations and the transferred pages for all run files, and calculated the total 1/0 cost by charging 25 ms per 1/0 operation (for seek and rotational latency) and 2 ms for each transferred page (assuming 2 MB/see transfer rate). As can be seen in Table 4 and Figure 7, there is an optimal cluster size with minimal 1/0 cost. The curve is not as smooth as might have been expected from the approximate cost function because the curve reflects all real-system effects such as rounding (truncating) the fan-in if the cluster size is not an exact divisor of the memory size, the effectiveness of merge optimiza- tion varying for different fan-ins, and internal fragmentation in clusters. The detailed data in Table 4, however, reflect the trends that larger clusters and smaller fan-ins clearly increase the amount of data transferred but not the number of 1/0 operations (disk and la- tency time) until the fan-in has shrunk to very small values, e.g., 3. It is clearly suboptimal to always choose the smallest cluster size (1 page) in order to obtain the largest fan-in and fewest merge lev- els. Furthermore, it seems that the range of cluster sizes that result in near- optimal total 1/0 costs is fairly large; thus it is not as important to determine the exact value as it is to use a cluster size “in the right ball park.” The optimal fan-in is typically fairly small; however, it is not e or 3 as derived by Bratberg- sengen [1984] under the (unrealistic) assumption that the cost of an 1/0 oper- ation is independent of the amount of data being transferred. 2.2 Hashing For many matching tasks, hashing is an alternative to sorting. In general, when equality matching is required, hashing should be considered because the ex- pected complexity of set algorithms based on hashing is O(N) rather than ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 18. 90 “ Goetz Graefe Table 4. Effect of Cluster Size Optlmlzations Cluster Pages Total 1/0 Size Average Disk Transferred cost [ X 4 KB] Fan-in Depth Operations [X4KB] [see] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 40 1.376 20 1.728 13 1.872 10 1.936 8 2.000 6 2.520 5 2.760 5 2.760 4 3.000 4 3.000 3 3.856 3 3.856 3 3.856 2 5.984 2 5.984 6874 4298 3176 2406 1984 2132 1980 1718 1732 1490 1798 1686 1628 2182 2070 6874 8596 9528 9624 9920 12792 13860 13744 15588 14900 19778 20232 21164 30548 31050 185.598 124.642 98.456 79.398 69.440 78.884 77.220 70.438 74.476 67.050 84.506 82.614 83.028 115.646 113.850 150 Total 1/0 Cost 100 [SW] 50 0 i o I I I I I I 2.5 5 7.5 10 12.5 15 Cluster Size [x 4 KB] Figure 7. Effect of cluster size optimizations 0( IV log N) as for sorting. Of course. this makes ‘intuitive sense- if hashing is viewed as radix sorting on a virtual key [Knuth 1973]. Hash-based query processing algo- rithms use an in-memory hash table of database objects to perform their match- ing task. If the entire hash table (includ- ing all records or items) fits into memory, hash-based query processing algorithms are very easy to design, understand, and implement, and they outperform sort- based alternatives. Note that for binary matching operations, such as join or in- tersection, only one of the two inputs must fit into memory. However, if the required hash table is larger than mem- ory, hash table ouerfZow occurs and must be dealt with. There are basically two methods for managing hash table overflow, namely avoidance and resolution. In either case, the input is divided into multiple parti- tion files such that partitions can be pro- cessed independently from one another, and the concatenation of the results of all partitions is the result of the entire oper- ation. Partitioning should ensure that the partitioning files are of roughly even size and can be done using either hash parti - ACM Computing Surveys,Vol. 25, No. 2, June 1993
  • 19. Query Evaluation Techniques ● 91 tioning or range partitioning, i.e., based on keys estimated to be quantiles. Usu- ally, partition files can be processed us- ing the original hash-based algorithm. The maximal partitioning fan-out F, i.e., number of partition files created, is de- termined by the memory size ill divided over the cluster size C minus one cluster for the partitioning input, i.e., F = [M/C – 11, just like the fan-in for sorting. In hash table overflow avoidance, the input set is partitioned into F partition files before any in-memory hash table is built. If it turns out that fewer partitions than have been created would have been sufficient to obtain partition files that will fit into memory, bucket tuning (col- lapsing multiple small buckets into larger ones) and dynamic destaging (determin- ing which buckets should stay in mem- ory) can improve the performance of hash-based operations [Kitsuregawa et al. 1989a; Nakayama et al. 1988]. Algorithms based on hash table over- flow resolution start with the assumption that overflow will not occur, but resort to basically the same set of mechanisms as hash table overflow avoidance once it does occur. No real system uses this naive hash table overflow resolution because so-called hybrid hashing is as efficient but more flexible. Hybrid hashing com- bines in-memory hashing and overflow resolution [DeWitt et al. 1984; Shapiro 1986]. Although invented for relational join and known as hybrid hash join, hy- brid hashing is equally applicable to all hash-based query processing algorithms. Hybrid hash algorithms start out with the (optimistic) premise that no overflow will occur; if it does, however, they parti- tion the input into multiple partitions of which only one is written immediately to temporary files on disk. The other F – 1 partitions remain in memory. If another overflow occurs, another partition is written to disk. If necessary, all F parti- tions are written to disk. Thus, hybrid hash algorithms use all available mem- ory for in-memory processing, but at the same time are able to process large input files by overflow resolution, Figure 8 shows the idea of hybrid hash algo- 1 -.. ‘.+ ‘.< ‘.< ‘.. Iash / ‘- Partition ‘= Files On Disk ‘* ‘* Directory ~ Figure 8. Hybrid hashing. rithms. As many hash buckets as possi- ble are kept in memory, e.g., as linked lists as indicated by solid arrows. The other hash buckets are spooled to tempo- rary disk files, called the overflow or par- tition files, and are processed in later stages of the algorithm. Hybrid hashing is useful if the input size R is larger than the memory size AL?but smaller than the memory size multiplied by the fan-out F, i.e., M< R< FXM. In order to predict the number of 1/0 operations (which actually is not neces- sary for execution because the algorithm adapts to its input size but may be desir- able for cost estimation during query optimization), the number of required partition files on disk must be deter- mined. Call this number K, which must satisfy O s ~ s F. Presuming that the assignment of buckets to partitions is optimal and that each partition file is equal to the memory size ikf, the amount of data that may be written to ~ parti- tion files is equal to K x M. The number of required 1/0 buffers is 1 for the input and K for the output partitions, leaving M – (K+ 1) x C memory for the hash table. The optimal K for a given input size R is the minimal K for which K X M + (M – (K + 1) x C) > R. Solving this inequality and taking the smallest such K results in K = [(~ – M i- C)/(M – C)]. The minimal possible 1/0 cost, including a factor of 2 for writing ACM Computing Surveys,Vol. 25, No 2, June 1993
  • 20. 92 “ Goetz Graefe and reading the partition files and mea- sured in the amount of data that must be written or read, is 2 x (~ – (lvl – (~ + 1) X C)). To determine the 1/0 time, this amount must be divided by the cluster size and multiplied with the 1/0 time for one cluster. For example, consider an input of R = 240 pages, a memory of M = 80 pages, and a cluster size of C = 8 pages. The maximal fan-out is F = [80/8 – 1] = 9. The number of partition files that need to be created on disk is K = [(240 – 80 + 8)/(80 – 8)] = 3. In other words, in the best case, K X C = 3 X 8 = 24 pages will be used as output buffers to write K = 3 partition files of no more than M = 80 pages, and M–(K+l)x C=80–4 X 8 = 48 pages of memory will be used as the hash table. The total amount of data written to and read from disk is 2 X (240 – (80 – 4 X 8)) = 384 pages. If writing or reading a cluster of C = 8 pages takes 40 msec, the total 1/0 time is 384/8 X 40 = 1.92 sec. In the calculation of K, we assumed an optimal assignment of hash buckets to partition files. If buckets were assigned in the most straightforward way, e.g., by dividing the hash directory into F equal-size regions and assigning the buckets of one region to a partition as indicated in Figure 8, all partitions were of nearly the same size, and either all or none of them will fit into their output cluster and therefore into memory. In other words, once hash table overflow occurred, all input was written to parti- tion files. Thus, we presumed in the ear- lier calculations that hash buckets were assigned more intelligently to output partitions. There are three ways to assign hash buckets to partitions. First, each time a hash table overflow occurs, a fixed num- ber of hash buckets is assigned to a new output partition. In the Gamma database machine, the number of disk partitions is chosen “such that each bucket [bucket here means what is called an output par- tition in this survey] can reasonably be expected to fit in memory” [DeWitt and Gerber 1985], e.g., 10% of the hash buck- ets in the hash directory for a fan-out of 10 [Schneider 19901. In other words. the fan~out is set a pri~ri by the query opti- mizer based on the expected (estimated) input size. Since the page size in Gamma is relatively small, only a fraction of memorv is needed for outrmt buffers. and an in-memory hash tab~e can be ‘used even while output partitions are being written to disk. Second. in bucket tuning and dynamic destaging [Kitsuregawa e; al. 1989a; Nakayama 1988], a large num- ber of small partition files is created and then collamed into fewer partition files no larger ~han memory. I; order to ob- tain a large number of partition files and, at the same time. retain some memorv for a hash table, ‘the cluster size is sent quite small, e.g., C = 1 page, and the fan-out is very large though not maxi- mal, e.g., F = M/C/2. In the example above, F = 40 output partitions with an average size of R/F = 6 pages could be created, even though only K = 3 output partitions are required. The smallest partitions are assigned to fill an in-mem- ory hash table of size M – K X C = 80 – 3 x 1 = 77 pages. Hopefully, the dy- namic destaging rule—when an overflow occurs, assign the largest partition still in memorv to disk—ensures that indeed . the smallest partitions are retained in memory. The partitions assigned to disk are collapsed into K = 3 partitions of no more than M = 80 pages, to be processed in K = 3 subsequent phases. In binary operations such as intersection and rela- tional join, bucket tuning is quite effec- tive for skew in the first input, i.e., if the hash value distribution is nonuniform and if the partition files are of uneven sizes. It avoids spooling parts of the sec- ond (typically larger) input to temporary ~artition files because the ~artitions in . . memory can be matched immediately us- ing a hash table in the memory not re- auired as outtmt buffer and because a . L number of small partitions have been col- lapsed into fewer, larger partitions, in- creasing the memory available for the hash table. For skew in the second inmt. bucket tuning and dynamic desta~ng have no advantage. Another disadvan- ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 21. Query Evaluation Techniques 8 93 tage of bucket tuning and dynamic destaging is that the cluster size has to be relatively small, thus requiring a large number of 1/0 operations with disk seeks and rotational latencies to write data to the overflow files. Third, statistics gath- ered before hybrid hashing commences can be used to assign hash buckets to partitions [Graefe 1993a]. Unfortunately, it is possible that one or more partition files are larger than memory. In that case, partitioning is used recursively until the file sizes have shrunk to memory size. Figure 9 shows how a hash-based algorithm for a unary operation, such as aggregation or dupli- cate removal, partitions its input file over multiple recursion levels. The recursion terminates when the files fit into mem- ory. In the deepest recursion level, hy- brid hashing may be employed. If the partitioning (hash) function is good and creates a uniform hash value distribution, the file size in each recur- sion level shrinks by a factor equal to the fan-out, and therefore the number of re- cursion levels L is logarithmic with the size of the input being partitioned. After L partitioning levels, each partition file is of size R’ = R/FL. In order to obtain partition files suitable for hybrid hashing (with M < R’ < F X itf), the number of full recursion levels L, i.e., levels at which hybrid hashing is not applied, is L = [log~(R/M)]. The 1/0 cost of the re- maining step using hybrid hashing can be estimated using the hybrid hash for- mula above with R replaced by R’ and multiplying the cost with FL because hybrid hashing is used for this number of partition files. Thus, the total 1/0 cost for partitioning an input and using hy- brid hashing in the deepest recursion level is 2X RX L+2XFL X( R’–(M– KX C)) =2x( Rx(L+l)– F” X( M–K XC)) =2 X( RX(L+l)– FLX(M -[(R’ –M)/(M - C)l X C)). ~ x= Figure 9. Recursive partitioning. A major problem with hash-based algo- rithms is that their performance depends on the quality of the hash function. In many situations, fairly simple hash func- tions will perform reasonably well. Remember that the purpose of using hash-based algorithms usually is to find database items with a specific key or to bring like items together; thus, methods as simple as using the value of a join key as a hash value will frequently perform satisfactorily. For string values, good hash values can be determined by using binary “exclusive or” operations or by de- termining cyclic redundancy check (CRC) values as used for reliable data storage and transmission. If the quality of the hash function is a potential problem, uni- versal hash functions should be consid- ered [Carter and Wegman 1979]. If the partitioning is skewed, the re- cursion depth may be unexpectedly high, making the algorithm rather slow. This is analogous to the worst-case perfor- mance of quicksort, 0(iV2 ) comparisons for an array of N items, if the partition- ing pivots are chosen extremely poorly and do not divide arrays into nearly equal sub arrays. Skew is the major danger for inferior performance of hash-based query- processing algorithms, There are several ways to deal with skew. For hash-based algorithms using overflow avoidance, bucket tuning and dynamic destaging are quite effective. Another method is to ob- tain statistical information about hash values and to use it to carefully assign ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 22. 94 “ Goetz Graefe hash buckets to partitions. Such statisti- cal information can be ke~t in the form of histomams and can ei;her come from perm~nent system catalogs (metadata), from sampling the input, or from previ- ous recursion levels. For exam~le. for an intermediate query processing’ result for which no statistical parameters are known a priori, the first partitioning level might have to proceed naively pretending that the partitioning hash function is perfect, but the second and further recur- sion levels should be able to use statistics gathered in earlier levels to ensure that each partitioning step creates even parti- tions, i.e., that the data is ~artitioned . . . with maximal effectiveness [Graefe 1993a]. As a final resort, if skew cannot be managed otherwise, or if distribution skew is not the problem but duplicates are, some systems resort to algorithms that are not affected by data or hash value skew. For example, Tandem’s hash join algorithm resorts to nested-loops join (to be discussed later) [Zeller and Gray 19901. As- for sorting, larger cluster sizes re- sult in faster 1/0 at the expense of smaller fan-outs, with the optimal fan-out being fairly small [Graefe 1993a; Graefe et al. 1993]. Thus, multiple recursion lev- els are not uncommon for large files, and statistics gathered on one level to limit skew effects on the next level are a real- istic method for large files to control the performance penalties of uneven partitioning. 3. DISK ACCESS All query evaluation systems have to access base data stored in the data- base. For databases in the megabyte to terabyte range, base data are typically stored on secondary storage in the form of rotating random-access disks. How- ever, deeper storage hierarchies includ- ing optical storage, (maybe robot-oper- ated) tape archives, and remote storage servers will also have to be considered in future high-functionality high-volume database management systems, e.g., as outlined by Stonebraker [1991]. Research into database systems supporting and ex- ploiting a deep storage hierarchy is still in its infancy. On the other hand, some researchers have considered in-memory or main- memory databases, motivated both by the desire for faster transaction and query- processing performance and by the de- creasing cost of semi-conductor memory [Analyti and Pramanik 1992; Bitton et al. 1987; Bucheral et al. 1990; DeWitt et al. 1984; Gruenwald and Eich 1991; Ku- mar and Burger 1991; Lehman and Carey 1986; Li and Naughton 1988; Severance et al. 1990; Whang and Krishnamurthy 1990]. However, for most applications, an analysis by Gray and Putzolo [ 1987] demonstrated that main memory is cost effective only for the most frequently ac- cessed data, The time interval between accesses with equal disk and memory costs was five minutes for their values of memory and disk prices and was ex- pected to grow as main-memory prices decrease faster than disk prices. For the purposes of this survey, we will presume a disk-based storage architecture and will consider disk 1/0 one of the major costs of query evaluation over large databases. 3.1 File Scans The first operator to access base data is the file scan, typically combined with a built-in selection facility. There is not much to be said about file scan ex- cept that it can be made very fast using read-ahead, particularly, large-chunk (“track-at-a-crack”) read-ahead. In some database systems, e.g., IBMs DB2, the read-ahead size is coordinated with the free space left for future insertions dur- ing database reorganization. If a free page is left after every 15 full data pages, the read-ahead unit of 16 pages (64 KB) ensures that overflow records are imme- diately available in the buffer. Efficient read-ahead requires contigu- ous file allocation, which is supported by many operating systems. Such contigu- ous disk regions are frequently called ex- tents. The UNIX operating system does not provide contiguous files, and many ACM Computmg Surveys, Vol 25, No. 2, June 1993
  • 23. Query Evaluation Techniques ● 95 database systems running on UNIX use “raw” devices instead, even though this means that the database management system must provide operating-system functionality such as file structures, disk space allocation, and buffering. The disadvantages of large units of 1/0 are buffer fragmentation and the waste of 1/0 and bus bandwidth if only individ- ual records are required. Permitting dif- ferent page sizes may seem to be a good idea, even at the added complexity in the buffer manager [Carey et al. 1986; Sikeler 1988], but this does not solve the prob- lem of mixed sequential scans and ran- dom record accesses within one file. The common solution is to choose a middle-of- the-road page size, e.g., 8 KB, and to support multipage read-ahead. 3.2 Associative Access Using Indices In order to reduce the number of accesses to secondary storage (which is relatively slow compared to main memory), most database systems employ associative search techniques in the form of indices that map key or attribute values to loca- tor information with which database objects can be retrieved. The best known and most often used database index structure is the B-tree [Bayer and McCreighton 1972; Comer 1979]. A large number of extensions to the basic struc- ture and its algorithms have been pro- posed, e.g., B ‘-trees for faster scans, fast loading from a sorted file, increased fan- out and reduced depth by prefix and suf- fix truncation, B*-trees for better space utilization in random insertions, and top-down B-trees for better locking be- havior through preventive maintenance [Guibas and Sedgewick 1978]. Interest- ingly, B-trees seem to be having a renais- sance as a research subject, in particular with respect to improved space utiliza- tion [Baeza-Yates and Larson 1989], con- currency control [Srinivasan and Carey 1991], recovery [Lanka and Mays 1991], parallelism [Seeger and Larson 1991], and on-line creation of B-trees for very large databases [Srinivasan and Carey 1992]. On-line reorganization and modifi- cation of storage structures, though not a new idea [Omiecinski 1985], is likely to become an important research topic within database research over the next few years as databases become larger and larger and are spread over many disks and many nodes in parallel and distributed systems. While most current database system implementations only use some form of B-trees, an amazing variety of index structures has been described in the lit- erature [Becker et al. 1991; Beckmann et al. 1990; Bentley 1975; Finkel and Bent- ley 1974; Guenther and Bilmes 1991; Gunther and Wong 1987; Gunther 1989; Guttman 1984; Henrich et al. 1989; Heel and Samet 1992; Hutflesz et al. 1988a; 1988b; 1990; Jagadish 1991; Kemper and Wallrath 1987; Kolovson and Stone- braker 1991; Kriegel and Seeger 1987; 1988; Lomet and Salzberg 1990a; Lomet 1992; Neugebauer 1991; Robinson 1981; Samet 1984; Six and Widmayer 1988]. One of the few multidimensional index structures actually implemented in a complete database management system are R-trees in Postgres [Guttman 1984; Stonebraker et al. 1990b]. Table 5 shows some example index structures classified according to four characteristics, namely their support for ordering and sorted scans, their dynamic-versus-static behavior upon insertions and deletions, their support for multiple dimensions, and their sup- port for point data versus range data. We omitted hierarchical concatenation of at- tributes and uniqueness, because all in- dex structures can be implemented to support these. The indication “no range data” for multidimensional index struc- tures indicates that range data are not part of the basic structure, although they can be simulated using twice the number of dimensions. We included a reference or two with each structure; we selected orig- inal descriptions and surveys over the many subsequent papers on special as- pects such as performance analyses, mul- tidisk’ and multiprocessor implementa- tions, page placement on disk, concur- rency control, recovery, order-preserving ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 24. 96 = Goetz Graefe Table 5. Classlflcatlon of Some Index Structures Structure Ordered Dynamic ISAM Yes No B-trees Yes Yes Quad-tree Yes Yes kD-trees Yes Yes KDB-trees Yes Yes hB-trees Yes Yes R-trees Yes Yes Extendible No Yes Hashing Linear Hashing No Yes Grid Files Yes Yes Multl-Dim. Range Data References No No [Larson 1981] No No [Bayer and McCreighton 1972; Comer 1979] Yes No [Finkel and Bentley 1974; Samet 1984] Yes No [Bentley 1975] Yes No [Robinson 1981] Yes No [Lomet and Salzberg 1990a] Yes Yes [Guttman 1984] No No [Fagin et al. 1979] No No [Litwin 1980] Yes No [Nievergelt et al. 1984] hashing, mapping range data of N di- mensions into point data of 2 N dimen- sions, etc.—this list suggests the wealth of subsequent research, in particular on B-trees, linear hashing, and refined mul- tidimensional index structures, Storage structures typically thought of as index structures may be used as pri- mary structures to store actual data or as redundant structures (“access paths”) that do not contain actual data but point- ers to the actual data items in a separate data file. For example, Tandem’s Non- Stop SQL system uses B-trees for actual data as well as for redundant index structures. In this case, a redundant in- dex structure contains not absolute loca- tions of the data items but keys used to search the primary B-tree. If indices are redundant structures, they can still be used to cluster the actual data items, i.e., the order or organization of index entries determines the order of items in the data file. Such indices are called clustering indices; other indices are called nonclus- tering indices. Clustering indices do not necessarily contain an entry for each data item in the primary file, but only one entry for each page of the primary file; in this case, the index is called sparse. Non- clustering indices must always be dense, i.e., there are the same number of entries in the index as there are items in the primary file. The common theme for all index struc- tures is that they associatively map some attribute of a data object to some locator information that can then be used to re- trieve the actual data object. Typically, in relational systems, an attribute value is mapped to a tuple or record identifier (TID or RID). Different systems use dif- ferent approaches, but it seems that most new designs do not firmly attach the record lookup to the index scan. There are several advantages to sepa- rating index scan and record lookup. First, it is possible to scan an index only without ever retrieving records from the underlying data file. For example, if only salary values are needed (e.g., to deter- mine the count or sum of all salaries), it is sufficient to access the salary index only without actually retrieving the data records. The advantages are that (i) fewer 1/0s are required (consider the number of 1/0s for retrieving N successive index entries and those to retrieve N index entries plus N full records, in particular if the index is nonclustering [Mackert and Lehman 1989] and (ii) the remaining 1/0 operations are basically sequential along the leaves of the index (at least for B ‘-trees; other index types behave differ- ently). The optimizers of several commer- cial relational products have recently been revised to recognize situations in which an index-only scan is sufficient. Second, even if none of the existing in- dices is sufficient by itself, multiple in- dices may be “joined” on equal RIDs to obtain all attributes required for a query (join algorithms are discussed below in the section on binary matching). For ex- ACM Computmg Surveys, Vol. 25, No. 2, June 1993
  • 25. Query Evaluation Techniques ● 97 ample, by matching entries in indices on salaries and on names by equal RIDs, the correct salary-name pairs are estab- lished. If a query requires only names and salaries, this “join” has made access- ing the underlying data file obsolete. Third, if two or more indices apply to individual clauses of a query, it may be more effective to take the union or inter- section of RID lists obtained from two index scans than using only one index (algorithms for union and intersection are also discussed in the section on binary matching). Fourth, joining two tables can be accomplished by joining the indices on the two join attributes followed by record retrievals in the two underlying data sets; the advantage of this method is that only those records will be retrieved that truly contribute to the join result [Kooi 1980]. Fifth, for nonclustering indices, sets of RIDs can be sorted by physical location, and the records can be retrieved very efficiently, reducing substantially the number of disk seeks and their seek dis- tances. Obviously, several of these tech- niques can be combined. In addition, some systems such as Rdb/VMS and DB2 use very sophisticated implementations of multiindex scans that decide dynami- cally, i.e., during run-time, which indices to scan, whether scanning a particular index reduces the resulting RID list suf- ficiently to offset the cost of the index scan, and whether to use bit vector filter- ing for the RID list intersection (see a later section on bit vector filtering) [Antoshenkov 1993; Mohan et al. 1990]. Record access performance for nonclus- tering indices can be addressed without performing the entire index scan first (as required if all RIDs are to be sorted) by using a “window” of RIDs. Instead of obtaining one RID from the index scan, retrieving the record, getting the next RID from the index scan, etc., the lookup operator (sometimes called “functional join”) could load IV RIDs, sort them into a priority heap, retrieve the most conve- niently located record, get another RID, insert it into the heap, retrieve a record, etc. Thus, a functional join operator us- ing a window always has N open refer- ences to items that must be retrieved, giving the functional join operator signif- icant freedom to fetch items from disk efficiently. Of course, this technique works most effectively if no other trans- actions or operators use the same disk drive at the same time. This idea has been generalized to as- semble complex objects~In object-oriented systems, objects can contain pointers to (identifiers of) other objects or compo- nents, which in turn may contain further pointers, etc. If multiple objects and all their unresolved references can be con- sidered concurrently when scheduling disk accesses, significant savings in disk seek times can be achieved [Keller et al. 1991]. 3.3 Buffer Management 1/0 cost can be further reduced by caching data in an 1/0 buffer. A large number of buffer management tech- niques have been devised; we give only a few references. Effelsberg and Haerder [19841 survey many of the buffer man- agement issues, including those pertain- ing to issues of recovery, e.g., write-ahead logging. In a survey paper on the interac- tions of operating systems and database management systems, Stonebraker [ 1981] pointed out that the “standard” buffer replacement policy, LRU (least re- cently used), is wrong for many database situations. For example, a file scan reads a large set of pages but uses them only once, “sweeping” the buffer clean of all other pages, even if they might be useful in the future and should be kept in mem- ory. Sacco and Schkolnick [1982; 19861 focused on the nonlinear performance effects of buffer allocation to many rela- tional algorithms, e.g., nested-loops join. Chou [ 1985] and Chou and DeWitt [ 1985] combined these two ideas in their DBMIN algorithm which allocates a fixed number of buffer pages to each scan, depending on its needs, and uses a local replace- ment policy for each scan appropriate to its reference pattern. A recent study into buffer allocation is by Faloutsos et al. [1991] and Ng et al. [19911 on using ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 26. 98 ● G’oetz Graefe marginal gain for buffer allocation. A very promising research direction for buffer management in object-oriented database systems is the work by Palmer and Zdonik [1991] on saving reference pat- terns and using them to predict future object faults and to prevent them by prefetching the required pages. The interactions of index retrieval and buffer management were studied by Sacco [1987] as well as Mackert and Lehman [ 19891. and several authors -, studied database buffer management and virtual memory provided by the operat- ing system [Sherman and Brice 1976; Stonebraker 1981; Traiger 1982]. On the level of buffer manager imple- mentation, most database buffer man- agers do not ~rovide read and write in. t~rfaces to th~ir client modules but fixing and unfixing, also called pinning and un- pinning. The semantics of fixing is that a fixed page is not subject to replacement or relocation in the buffer pool, and a client module may therefore safely use a memory address within a fixed page. If the buffer manager needs to replace a page but all its buffer frames are fixed, some special action must occur such as dynamic growth of the buffer pool or transaction abort. The iterator implementation of query evaluation algorithms can exploit the buffer’s fix/unfix interface by passing pointers to items (records, objects) fixed in the buffer from iterator to iterator. The receiving iterator then owns the fixed item; it may unfix it immediately (e.g., after a predicate fails), hold on to the fixed record for a while (e.g., in a hash table), or pass it on to the next iterator (e.g., if a predicate succeeds). Because the iterator control and interaction of opera- tors ensure that items are never pro- duced and fixed before they are required, the iterator protocol is very eflic~ent in its buffer usage. Some implementors, however, have felt that intermediate results should not be materialized or kept in the database sys- tem’s 1/0 buffer, e.g., in order to ease implementation of transaction (ACID) se- mantics, and have designed a separate memory management scheme for inter- mediate results and items passed from iterator to iterator. The cost of this deci- sion is additional in-memory copying as well as the possible inefficiencies associ- ated with, in effect, two buffer and mem- ory managers. 4. AGGREGATION AND DUPLICATE REMOVAL Aggregation is a very important statisti- cal concept to summarize information about large amounts of data. The idea is to represent a set of items by a single value or to classify items into groups and determine one value per group. Most database systems support aggregate functions for minimum, maximum, sum, count, and average (arithmetic mean’). Other aggregates, e.g., geometric mean or standard deviation, are typically not provided, but may be constructed in some systems with extensibility features. Ag- gregation has been added to both rela- tional calculus and algebra and adds the same expressive power to each of them [Klug 1982]. Aggregation is typically supported in two forms, called scalar aggregates and aggregate functions [Epstein 1979]. Scalar aggregates calculate a single scalar value from a unary input relation, e.g., the sum of the salaries of all employ- ees. Scalar aggregates can easily be de- termined using a single pass over a data set. Some systems exploit indices, in par- ticular for minimum, maximum, and count. Aggregate functions, on the other hand, determine a set of values from a binary input relation, e.g., the sum of salaries for each department. Aggregate functions are relational operators, i.e., they con- sume and produce relations. Figure 10 shows the output of the query “count of employees by department.” The “by-list” or grouping attributes are the key of the new relation, the Department attribute in this example. Algorithms for aggregate functions re- quire grouping, e.g., employee items may be grouped by department, and then one ACM Computing Surveys. Vol 25, No 2, June 1993
  • 27. Query Evaluation Techniques ● 99 Shoe 9 Hardware 7 I Figure 10. Count of employees by department, output item is calculated per group. This grouping process is very similar to dupli- cate removal in which eaual data items must be brought together; compared, and removed. Thus, aggregate functions and duplicate removal are typically imple- mented in the same module. There are only two differences between aggregate functions and duplicate removal. First, in duplicate removal, items are compared on all their attributes. but onlv on the . attributes in the by-list of aggregate functions. Second, an identical item is immediately dropped from further con- sideration in duplicate removal whereas in aggregate functions some computation is ~erformed before the second item of th~ same group is dropped. Both differ- ences can easily be dealt with using a switch in an actual algorithm implemen- tation. Because of their similarity, dupli- cate removal and aggregation are de- scribed and used interchangeably here. In most existing commercial relational systems, aggregation and duplicate re- moval algorithms are based on sorting, following Epstein’s [1979] work. Since aggregation requires that all data be con- sumed before any output can be pro- duced, and since main memories were significantly smaller 15 years ago when the prototypes of these systems were de- signed, these implementations used tem- porary files for output, not streams and iterator algorithms. However, there is no reason why aggregation and duplicate re- moval cannot be implemented using iter- ators exploiting today’s memory sizes. 4.1 Aggregation Algorithms Based on Nested Loops There are three types of algorithms for aggregation and duplicate removal based on nested loops, sorting, and hashing. The first algorithm, which we call nested-loops aggregation, is the most simple-minded one. Using a temporary file to accumulate the output, it loops for each input item over the output file accu- mulated so far and either aggregates the input item into the appropriate output item or creates a new output item and appends it to the output file. Obviously, this algorithm is quite inefficient for large inputs, even if some performance en- hancements can be applied.7 We mention it here because it corres~onds to the al~o- . rithm choices available for relational joins and other binary matching prob- lems (discussed in the next section), which are the nested-loops join and the more efficient sort-based and hash-based join algorithms. As for joins and binary matching, where the nested-loops algo- rithm is the only algorithm that can eval- uate any join predicate, the nested-loops aggregation algorithm can support un- usual aggregations where the input items are not divided into disjoint equivalence classes but where a single input item may contribute to multiple output items. While such aggregations are not sup- ported in today’s database systems, clas- sifications that do not divide the input into equivalence classes can be useful in both commercial and scientific applica- tions. If the number of classifications is small enough that all output items can be kept in memory, the performance of this algorithm is acceptable. However, for the more standard database aggregation problems, sort-based and hash-based du- plicate removal and aggregation algo- rithms are more appropriate. 7The possible improvements are (i) looping over pages or clusters rather than over records of input and output items (block nested loops), (ii) speeding the inner loop by an index (index nested loops), a method that has been used in some commercial relational sYstems, (iii) bit vector filtering to deter- mine without inner loop or index lookup that an item in the outer loop cannot possibly have a match in the inner loop. All three of these issues are discussed later in this survey as they apply to binary operations such as joins and intersection. ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 28. 100 “ Goetz Graefe 4.2 Aggregation Algorithms Based on Sorting Sorting will bring equal items together, and duplicate removal will then be easy. The cost of duplicate removal is domi- nated bv the sort cost. and the cost of this na~ve dudicate removal al~orithm based on sorting can be assume~ to be that of the sort operation. For aggrega- tion, items are sorted on their grouping attributes. This simple method can be improved by detecting and removing duplicates as early as possible, easily implemented in the routines that write run files during sorting. With such “early” duplicate re- moval or aggregation, a run file can never contain more items than the final output (because otherwise it would contain du- plicates!), which may speed up the final merges significantly [Bitten and DeWitt 1983]. As for any external sort operation, the optimizations discussed in the section on sorting, namely read-ahead using fore- casting, merge optimizations, large clus- ter sizes, and reduced final fan-in for binary consumer operations, are fully ap- plicable when sorting is used for aggre- gation and duplicate removal. However, to limit the complexity of the formulas, we derive 1/0 cost formulas without the effects of these optimizations. The amount of I/0 in sort-based a~- gregation is determined by the number ~f merge levels and the effect of early dupli- cate removal on each merge step. The total number of merge levels is unaf- fected by aggregation; in sorting with quicksort and without optimized merg- ing, the number of merge levels is L = DogF(R/M)l for input size R, memory size M, and fan-in F. In the first merge levels, the likelihood is negligible that items of the same group end up in the same run file, and we therefore assume that the sizes of run files are unaffected until their sizes would exceed the size of the final output. Runs on the first few merge levels are of size M X F’ for level i, and runs of the last levels have the same size as the final output. Assuming the output cardinality (number of items) is G times less than the input cardinality (G = R/o), where G is called the aver. age group size or the reduction factor, only the last ~logF(G)l merge levels, in- cluding the final merge, are affected by early aggregation because in earlier lev- els more than G runs exist, and items from each group are distributed over all those runs, giving a negligible chance of early aggregation. In the first merge levels, all input items participate, and the cost for these levels can be determined without explicitly cal- culating the size and number of run files on these levels. In the affected levels, the size of the output runs is constant, equal to the size of the final output O = R/G, while the number of run files decreases by a factor equal to the fan-in F in each level. The number of affected levels that create run files is Lz = [log~(G)l – 1; the subtraction of 1 is necessary because the final merge does not create a run file but the output stream. The number of unaf- fected levels is LI = L – Lz. The number of input runs is W/Fz on level i (recall the number of initial runs W = R/M from the discussion of sorting). The total cost,8 including a factor 2 for writing and reading, is L–1 2X RX LI+2XOX ~W/Fz Z=L1 =2 XRXL1+2XOXW x(lFL1 – l/’F~)/’(l – lF). For example, consider aggregating R = 100 MB input into O = 1 MB output (i.e., reduction factor G = 100) using a system with M = 100 KB memory and fan-in F = 10. Since the input is W = 1,000 times the size of memory, L = 3 merge levels will be needed. The last Lz = log~(G) – 1 = 1 merge level into tem- porary run files will permit early aggre- gation. Thus, the total 1/0 will be 8 Using Z~= ~ a z = (1 – a~+l)/(l – a) and Z~.KaL = X$LO a’ – E~=-O] a’ = (aK – a“’’+l)/(l – a). ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 29. Query Evaluation Techniques ● 101 2X1 OOX2+2X1X1OOO x(l/lo2 – 1/103 )/’(1 – 1/10) = 400 + 2 x 1000 x 0.009/0.9 = 420 MB which has to be divided by the cluster size used and multi~lied bv the time to read or write a clu~ter to “estimate the 1/0 time for aggregation based on sort- ing. Naive separation of sorting and sub- sequent aggregation would have required reading and writing the entire input file three times, for a total of 600 MB 1/0. Thus, early aggregation realizes almost 30% savings in this case. Aggrega~e queries may require that duplicates be removed from the input set to the aggregate functions,g e.g., if the SQL distinct keyword is used. If such an aggregate function is to be executed us- ing sorting, early aggregation can be used only for the duplicate removal part. How- ever, the sort order used for duplicate removal can be suitable to ~ermit the . subsequent aggregation as a simple filter operation on the duplicate removal’s out- put stream. 4.3 Aggregation Algorithms Based on Hashing Hashing can also be used for aggregation by hashing on the grouping attributes. Items of the same group (or duplicate items in duplicate removal) can be found and aggregated when inserting them into the hash table. Since only output items, not input items, are kept in memory, hash table overflow occurs only if the output does not fit into memory. How- ever, if overflow does occur, the partition 9 Consider two queries, both counting salaries per department. In order to determine the number of (salaried) employees per department, all salaries are counted without removing duplicate salary val- ues. On the other hand, in order to assess salary differentiation in each department, one might want to determine the number of distinct salary levels in each department. For this query, only distinct salaries are counted, i.e., duplicate department- salary pairs must be removed prior to counting. (This refers to the latter type of query.) files (all partitioning files in any one re- cursion level) will basically be as large as the entire input because once a partition is being written to disk, no further aggre- gation can occur until the partition files are read back into memory. The amount of 1/0 for hash-based aggregation depends on the number of partitioning (recursion) levels required before the output (not the input) of one partition fits into memory. This will be the case when partition files have been reduced to the size G x M. Since the par- titioning files shrink by a factor of F at each level (presuming hash value skew is absent or effectively counteracted), the number of partitioning (recursion) levels is DogF(R/G/M)l = [log~(O/M)l for in- put size R, output size O, reduction fac- tor G, and fan-out 1’. The costs at each level are proportional to the input file size R. The total 1/0 volume for hashing with overflow avoidance, including a fac- tor of 2 for writing and reading, is 2 x R X [logF(O/~)l. The last partitioning level may use hy- brid hashing, i.e., it may not involve 1/0 for the entire input file. In that case, L = llog~( OM )j complete recursion lev- els involving all input records are re- quired, partitioning the input into files of size R’ = R/FL. In each remaining hy- brid hash aggregation, the size limit for overflow files is M x G because such an overflow file can be aggregated in mem- ory. The number of partition files K must satisfy KxMx G+(M– KxC)XG > R’, meaning K = [(R’/G – M)/(M – C)l partition files will be created. The total 1/0 cost for hybrid hash aggrega- tion is 2X RX L+2XFL x( R’–(M– KxC)XG) =2 X( RX(L+l)– FL x( M–Kx C)XG) =2 X( RX(L+l)– FL x(M– [(R’/G –M)/(M– C)] xC) x G). ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 30. 102 “ Goetz Graefe I/o [MB] 600 – 500 – 400 – 300 – 200 – o Sorting without early aggregation A Sorting with early aggregation 100 – x Hashing without hybrid hashing o– ❑ Hashing with hybrid hashing I I I I I I I 1 I I I I I 1 23510 2030 50 100 200300500 1000 Group Size or Reduction Factor Figure 11. Performance of sort- and hash-based aggregation. As for sorting, if an aggregate query requires duplicate removal for the input set to the aggregate function,l” the group size or reduction factor of the duplicate removal step determines the perfor- mance of hybrid hash duplicate removal. The subsequent aggregation can be per- formed after the duplicate removal as an additional operation within each hash bucket or as a simple filter operation on the duplicate removal’s output stream. 4.4 A Rough Performance Comparison It is interesting to note that the perfor- mance of both sort-based and hash-based aggregation is logarithmic and improves with increasing reduction factors. Figure 11 compares the performance of sort- and hash-based aggregationll using the for- mulas developed above for 100 MB input data, 100 KB memory, clusters of 8 KB, fan-in or fan-out of 10, and varying group sizes or reduction factors. The output size is the input size divided by the group size. It is immediately obvious in Figure 11 that sorting without early aggregation is not competitive because it does not limit the sizes of run files, confirming the re- 1“ See footnote 9. 11Aggregation by nested-loops methods is omitted from Figure 11 because it is not competitive for large data sets. suits of Bitton and DeWitt [1983]. The other algorithms all exhibit similar, though far from equal, performance im- provements for larger reduction factors. Sorting with early aggregation improves once the reduction factor is large enough to affect not only the final but also previ- ous merge steps. Hashing without hybrid hashing improves in steps as the number of partitioning levels can be reduced, with “step” points where G = F’ for some i. Hybrid hashing exploits all available memory to improve performance and generally outperforms overflow avoid- ance hashing. At points where overflow avoidance hashing shows a step, hybrid hashing has no effect, and the two hash- ing schemes have the same performance. While hash-based aggregation and du- plicate removal seem superior in this rough analytical performance compari- son, recall that the cost formula for sort- based aggregation does not include the effects of replacement selection or the merge optimizations discussed earlier in the section on sorting; therefore, Figure 11 shows an upper bound for the 1/0 cost of sort-based aggregation and dupli- cate removal. Furthermore, since the cost formula for hashing presumes optimal assignments of hash buckets to output partitions, the real costs of sort- and hash-based aggregation will be much more similar than they appear in Figure 11. The important point is that both their ACM Computing Surveys, Vol 25, No 2, June 1993
  • 31. Query Evaluation Techniques “ 103 costs are logarithmic with the input size, improve with the group size or reduction factor, and are quite similar overall. 4.5 Additional Remarks on Aggregation Some applications require multilevel ag- gregation. For example, a report genera- tion language might permit a request like “sum (employee. salary by employee.id by employee. department by employee. divi- siony to create a report with an entry for each employee and a sum for each de- partment and each division. In fact, spec- ifying such reports concisely was the driving design goal for the report genera- tion language RPG. In SQL, this requires multiple cursors within an application program, one for each level of detail. This is very undesirable for two reasons. First, the application program performs what is essentially a join of three inputs. Such joins should be provided by the database system, not required to be performed within application programs. Second, the database system more likely than not executes the operations for these cursors independently from one another, result- ing in three sort operations on the em- ployee file instead of one. If complex reporting applications are to be supported, the query language should support direct requests (perhaps similar to the syntax suggested above), and the sort operator should be imple- mented such that it can perform the en- tire operation in a single sort and one final pass over the sorted data. An analo- gous algorithm based on hashing can be defined; however, if the aggregated data are required in sort order, sort-based ag- gregation will be the algorithm of choice. For some applications, exact aggregate functions are not required; reasonably close approximations will do. For exam- ple, exploratory (rather than final pre- cise) data analysis is frequently very use- ful in “approaching” a new set of data [Tukey 1977]. In real-time systems, pre- cision and response time may be reason- able tradeoffs. For database query opti- mization, approximate statistics are a sufficient basis for selectivity estimation, cost calculation, and comparison of alter- native plans. For these applications, faster algorithms can be designed that rely either on a single sequential scan of the data (no run files, no overflow files) or on sampling [Astrahan et al. 1987; Hou and Ozsoyoglu 1991; 1993; Hou et al. 1991]. 5. BINARY MATCHING OPERATIONS While aggregation is essential to con- dense information, there are a number of database operations that combine infor- mation from two inputs, files, or sets and therefore are essential for database systems’ ability to provide more than reliable shared storage and to perform inferences, albeit limited. A group of op- erators that all do basically the same task are called the one-to-one match op- erations here because an input item con- tributes to the output depending on its match with one other item. The most prominent among these operations is the relational join. Mishra and Eich [1992] have recently written a survey of join algorithms, which includes an interest- ing analysis and comparison of algo- rithms focusing on how data items from the two inputs are compared with one another. The other one-to-one match op- erations are left and right semi-joins, left, right, and symmetric outer-joins, left and right anti-semi-joins, symmetric anti- join, intersection, union, left and right differences, and symmetric or anti-dif- ference.lz Figure 12 shows the basic 12 The anti-semijoin of R and S is R SEMIJOIN S = R – (R SEMZJOIN S), i.e., the items in R without matches in S. The (symmetric) anti-join contains those items from both inputs that do not have matches, suitably padded as in outer joins to make them union compatible. Formally, the (sym- metric) anti-join of R” and S is R JOIN S =- (R SEMIJOIN S ) U (S SEMIJOIN R ) with the tuples of the two union arguments suitably extended with null values. The symmetric or anti-difference M the union of the two differences. Formally the anti-dif- ference of R and S is (R u S) – (R n S) = (R – S) u (S – R) [Maier 1983]. Among these three op- erations, the anti-semijoin is probably the most useful one, as in the query to “find the courses that don’t have any enrollment.” ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 32. 104 “ Goetz Graefe R s m A B c output Match on all Match on some Attributes Attributes A Difference Anti-semi-join B Intersection Join, semi-join c Difference Anti-semi-join A, B Left outer join A, C Symmetric difference Anti-join B, C Right outer join A, B, C Union Symmernc outer join Figure 12, Binary one-to-one matching. principle underlying all these operations, namely separation of the matching and nonmatching components of two sets, called R and S in the figure, and produc- tion of appropriate subsets, possibly after some transformation and combination of records as in the case of a join. If the sets R and S have different schemas as in relational joins, it might make sense to think of the set B as two sets 11~ and B~, i.e., the matching elements from R and S. This distinction permits a clearer definition of left semi-join and right semi-join, etc. Since all these operations require basically the same steps and can be implemented with the same algo- rithms, it is logical to implement them in one general and efficient module. For simplicity, only join algorithms are dis- cussed here. Moreover, we discuss algo- rithms for only one join attribute since the algorithms for multi-attribute joins (and their performance) are not different from those for single-attribute joins. Since set operations such as intersec- tion and difference will be used and must be implemented efficiently for any data model, this discussion is relevant to rela- tional, extensible, and object-oriented database systems alike. Furthermore, bi- nary matching problems occur in some surprising places. Consider an object- oriented database system that uses a table to map logical object identifiers (OIDS) to physical locations (record iden- tifiers or RIDs). Resolving a set of OIDS to RIDs can be regarded (as well as opti- mized and executed) as a semi-join of the mapping table and the set of OIDS, and all conventional join strategies can be employed. Another example that can oc- cur in a database management system for any data model is the use of multiple indices in a query: the pointer (OID or RID) lists obtained from the indices must be intersected (for a conjunction) or united (for a disjunction) to obtain the list of pointers to items that satisfy the whole query. Moreover, the actual lookup of the items using the pointer list can be regarded as a semi-join of the underlying data set and the list, as in Kooi’s [1980] thesis and the Ingres product [Kooi and Frankforth 1982] and a recent study by Shekita and Carey [1990]. Finally, many path expressions in object-oriented database systems such as “employee. de- partment.manager. office.location” can frequently be interpreted, optimized, and executed as a sequence of one-to-one match operations using existing join and semi-join algorithms. Thus, even if rela- ACM Computing Surveys, Vol 25, No 2, June 1993
  • 33. Query Evaluation Techniques ● 105 tional systems were completely abolished and replaced by object-oriented database systems, set matching and join tech- niques developed in the relational con- text would continue to be important for the performance of database systems. Most of today’s commercial database systems use only nested loops and merge-join because an analysis per- formed in connection with the System R project determined that of all the join methods considered, one of these two al- ways provided either the best or very close to the best performance [Blasgen and Eswaran 1976; 1977]. However, the System R study did not consider hash join algorithms, which are now regarded as more efficient in many cases. There continues to be a strong interest in join techniques, although the interest has shifted over the last 20 years from basic algorithmic concerns to parallel techniques and to techniques that adapt to unpredictable run-time situations such as data skew and changing resource availability. Unfortunately, many new proposed techniques fail a very simple test (which we call the “Guy Lehman test for join techniques” after the first person who pointed this test out to us), making them problematic for nontrivial queries. The crucial test question is: Does this new technique apply to joining three in- puts without interrupting data flow be- tween the join operators? For example, a technique fails this test if it requires ma- terializing the entire intermediate join result for random sampling of both join inputs or for obtaining exact knowledge about both join input sizes. Given its im- portance, this test should be applied to both proposed query optimization and query execution techniques. For the 1/0 cost formulas given here, we assume that the left and right inputs have R and S pages, respectively, and that the memory size is M pages. We assume that the algorithms are imple- mented as iterators and omit the cost of reading stored inputs and writing an op- eration’s output from the cost formulas because both inputs and output may be iterators, i.e., these intermediate results are never written to disk, and because these costs are equal for all algorithms. 5.1 Nested-Loops Join Algorithms The simplest and, in some sense, most direct algorithm for binary matching is the nested-loops join: for each item in one input (called the outer input), scan the entire other input (called the inner in- put) and find matches. The main advan- tage of this algorithm is its simplicity. Another advantage is that it can com- pute a Cartesian product and any @join of two relations, i.e., a join with an arbi- trary two-relation comparison predicate. However, Cartesian products are avoided by query optimizers because their out- puts tend to contain many data items that will eventually not satisfy a query predicate verified later in the query eval- uation plan. Since the inner input is scanned re- peatedly, it must be stored in a file, i.e., a temporary file if the inner input is pro- duced by a complex subplan. This situa- tion does not change the cost of nested loops; it just replaces the first read of the inner input with a write. Except for very small inputs, the per- formance of nested-loops join is disas- trous because the inner input is scanned very often, once for each item in the outer input. There are a number of improve- ments that can be made to this naiue nested-loops join. First, for one-to-one match operations in which a single match carries all necessary information, e.g., semi-join and intersection, a scan of the inner input can be terminated after the first match for an item of the outer input. Second, instead of scanning the inner in- put once for each item from the outer input, the inner input can be scanned once for each page of the outer input, an algorithm called block nested-loops join [Kim 1980]. Third, the performance can be improved further by filling all of mem- ory except K pages with pages of the outer input and by using the remaining K pages to scan the inner input and to save pages of the inner input in memory. Finally, scans of the inner input can be ACM Computmg Surveys, Vol. 25, No 2, June 1993
  • 34. 106 e Goetz Graefe made a little faster by scanning the inner input alternatingly forward and back- ward, thus reusing the last page of the previous scan and therefore saving one 1/0 per inner scan. The 1/0 cost for this version of nested-loops join is the product of the number of scans (determined by the size of the outer input) and the cost per scan of the inner input, plus K 1/0s because the first inner scan has to scan or save the entire inner input. Thus, the total cost for scanning the inner input repeatedly is [R(M – K)] x (S – K) + K. This expression is minimized if K = 1 and R z S, i.e., the larger input should be the outer. If the critical performance measure is not the amount of data read in the re- peated inner scans but the number of 1/0 operations, more than one page should be moved in each 1/0, even if more memory has to be dedicated to the inner input and less to the outer input, thus increasing the number of passes over the inner input. If C pages are moved in each 1/0 on the inner input and M – C pages for the outer input, the number of 1/0s is [R/(M – C)] x (S/C) + 1, which is minimized if C = M/2. In other words, in order to minimize the number of large-chunk 1/0 operations, the clus- ter size should be chosen as half the available memory size [Hagmann 1986]. Finally, index nested-loops join ex- ploits a permanent or temporary index on the inner input’s join attribute to re- place file scans by index lookups. In prin- ciple, each scan of the inner input in naive nested-loops join is used to find matches, i.e., to provide associativity. Not surprisingly, since all index structures are designed and used for the purpose of associativity, any index structure sup- porting the join predicate (such as = , <, etc.) can be used for index nested-loops join. The fastest indices for exact match queries are hash indices, but any index structure can be used, ordered or un - ordered (hash), single- or multi-attribute, single- or multidimensional. Therefore, indices on frequently used join attributes (keys and foreign keys in relational sys- tems) may be useful. Index nested-loops join is also used sometimes with indices built on the fly, i.e., indices built on inter- mediate query processing results. A recent investigation by DeWitt et al. [1993] demonstrated that index nested- loops join can be the fastest join method if one of the inputs is so small and if the other indexed input is so large that the number of index and data page re- trievals, i.e., about the product of the index depth and the cardinality of the smaller input, is smaller than the num- ber of pages in the larger input. Another interesting idea using two or- dered indices, e.g., a B-tree on each of the two join columns, is to switch roles of inner and outer join inputs after each index lookup, which leads to the name “zig-zag join.” For example, for a join predicate R.a = S.a, a scan in the index on R .a finds the lower join attribute value in R, which is then looked up in the index on S.a. A continuing scan in the index on S.a yields the next possible join attribute value, which is looked up in the index on R .a, etc. It is not immedi- ately clear under which circumstances this join method is most efficient. For complex queries, N-ary joins are sometimes written as a single module, i.e., a module that performs index lookups into indices of multiple relations and joins all relations simultaneously. However, it is not clear how such a multi-input join implementation is superior to multiple index nested-loops joins, 5.2 Merge-Join Algorithms The second commonly used join method is the merge-join. It requires that both inputs are sorted on the join attribute. Merging the two inputs is similar to the merge process used in sorting. An impor- tant difference, however, is that one of the two merging scans (the one which is advanced on equality, usually called the inner input) must be backed up when both inputs contain duplicates of a join attribute value and when the specific one-to-one match operation requires that all matches be found, not just one match. Thus, the control logic for merge-join variants for join and semi-j oin are slightly different. Some systems include the no- ACM Computing Surveys, Vol 25, No 2. June 1993
  • 35. Query Evaluation Techniques ● 107 tion of “value packet,” meaning all items with equal join attribute values [Kooi 1980; Kooi and Frankforth 1982]. An it- erator’s next call returns a value packet, not an individual item, which makes the control logic for merge-j oin much easier. If (or after) both inputs have been sorted, the merge-join algorithm typically does not require any 1/0, except when “value packets” are larger than memory. (See footnote 1.) An input may be sorted because a stored database file was sorted, an or- dered index was used, an input was sorted explicitly, or the input came from an operation that produced sorted out- put, e.g., another merge-join. The last point makes merge-join an efficient algo- rithm if items from multiple sources are matched on the same join attribute(s) in multiple binary steps because sorting in- termediate results is not required for later merge-joins, which led to the con- cept of interesting orderings in the Sys- tem R query optimizer [Selinger et al. 1979]. Since set operations such as inter- section and union can be evaluated using any sort order, as long as the same sort order is present in both inputs, the effect of interesting orderings for one-to-one match operators based on merge-join can always be exploited for set operations. A combination of nested-loops join and merge-join is the heap-filter merge -]”oin [Graefe 1991]. It first sorts the smaller inner input by the join attribute and saves it in a temporary file. Next, it uses all available memory to create sorted runs from the larger outer input using replacement selection. As discussed in the section on sorting, there will be about W = R/(2 x M) + 1 such runs for outer input size R. These runs are not written to disk; instead, they are joined immedi- ately with the sorted inner input using merge-join. Thus, the number of scans of the inner input is reduced to about one half when compared to block nested loops. On the other hand, when compared to merge-join, it saves writing and read- ing temporary files for the larger outer input. Another derivation of merge-join is the hybrid join used in IBMs DB2 product [Cheng et al. 1991], combining elements from index nested-loops join, merge-join, and techniques joining sorted lists of in- dex leaf entries. After sorting the outer input on its join attribute, hybrid join uses a merge algorithm to “join” the outer input with the leaf entries of a preexist- ing B-tree index on the join attribute of the inner input. The result file contains entire tuples from the outer input and record identifiers (RIDs, physical ad- dresses) for tuples of the inner input. This file is then sorted on the physical locations, and the tuples of the inner re- lation can then be retrieved from disk very efficiently. This algorithm is not en- tirely new as it is a special combination of techniques explored by Blasgen and Eswaran [1976; 1977], Kooi [1980], and Whang et al. [1984; 1985]. Blasgen and Eswaran considered the manipulation of RID lists but concluded that either merge-join or nested-loops join is the op- timal choice in almost all cases; based on this study, only these two algorithms were implemented in System R [Astra- han et al. 1976] and subsequent rela- tional database systems, Kooi’s optimizer treated an index similarly to a base rela- tion and the lookup of data records from index entries as a join; this naturally permitted joining two indices or an index with a base relation as in hybrid join. 5.3 Hash Join Algorithms Hash join algorithms are based on the idea of building an in-memory hash table on one input (the smaller one, frequently called the build input) and then probing this hash table using items from the other input (frequently called the probe input ). These algorithms have only recently found greater interest [Bratbergsengen 1984; DeWitt et al. 1984; DeWitt and Gerber 1985; DeWitt et al. 1986; Fushimi et al. 1986; Kitsuregawa et al. 1983; 1989a; Nakayama et al. 1988; Omiecin- ski 1991; Schneider and DeWitt 1989; Shapiro 1986; Zeller and Gray 1990]. One reason is that they work very fast, i.e., without any temporary files, if the build input does indeed fit into memory, inde- pendently of the size of the probe input. ACM Computing Surveys, Vol 25, No 2, June 1993
  • 36. 108 ● Goetz Graefe However, they require overflow avoid- ance or resolution methods for larger build inputs, and suitable methods were developed and experimentally verified only in the mid-1980’s, most notably in connection with the Grace and Gamma database machine projects [DeWitt et al. 1986; 1990; Fushimi et al. 1986; Kitsure- gawa et al. 1983]. In hash-based join methods, build and probe inputs are partitioned using the same partitioning function, e.g., the join key value modulo the number of parti- tions. The final join result can be formed by concatenating the join results of pairs of partitioning files. Figure 13 shows the effect of partitioning the two inputs of a binary operation such as join into hash buckets and partitions. (This figure was adapted from a similar diagram by IKit- suregawa et al. [ 1983]. Mishra and Eich [1992] recently adapted and generalized it in their survey and comparison of rela- tional join algorithms.) Without parti- tioning, each item in the first input must be compared with each item in the sec- ond input; this would be represented by complete shading of the entire diagram. With partitioning, items are grouped into partition files, and only pairs in the se- ries of small rectangles (representing the partitions) must be compared. If a build partition file is still larger than memory, recursive partitioning is required. Recursive partitioning is used for both build- and probe-partitioning files using the same hash and partition- ing functions. Figure 14 shows how both input files are partitioned together. The partial results obtained from pairs of partition files are concatenated to form the result of the entire match operation. Recursive partitioning stops when the build partition fits into memory. Thus, the recursion depth of partitioning for binary match operators depends only on the size of the build input (which there- fore should be chosen to be the smaller input) and is independent of the size of the probe input. Compared to sort-based binary matching operators, i.e., variants of merge-join in which the number of merge levels is determined for each input Second Join Input –-l -l-l LFirst Join Input Figure 13. Effect of partitioning for join operations. Figure 14. Recursive partitioning in binary operations. file individually, hash-based binary matching operators are particularly ef- fective when the input sizes are very dif- ferent [Bratbergsengen 1984; Graefe et al. 1993]. The 1/0 cost for binary hybrid hash operations can be determined by the number of complete levels (i.e., levels without hash table) and the fraction of the input remaining in memory in the deepest recursion level. For memory size M, cluster size C, partitioning fan-out F = [MC – 1],build input size R, and probe input size S, the number of com- plete levels is L = [logF( R/M)], after which the build input partitions should be of size R’ = R/FL. The 1/0 cost for the binary operation is the cost of parti- tioning the build input divided by the size of the build input and multiplied by the sum of the input sizes. Adapting the ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 37. Query Evaluation Techniques ● 109 cost formula for unary hashing discussed earlier, the total amount of 1/0 for a recursive binary hash operation is 2X( RX(L+l)– F’X(M -[(R’ - ~+ c)/(lf - C)l xc))/R x (R + s) which can be approximated with 2 x log~(R/iW) X (R + S). In other words, the cost of binary hash operations on large inputs is logarithmic; the main difference to the cost of merge-join is that the recursion depth (the logarithm) depends only on one file, the build input, and is not taken for each file individually. As for all operations based on parti- tioning, partitioning (hash) value skew is the main danger to effectiveness. When using statistics on hash value distribu- tions to determine which buckets should stay in memory in hybrid hash algo- rithms, the goal is to avoid as much 1/0 as possible with the least memory “in- vestment.” Thus, it is most effective to retain those buckets in memory with few build items but many probe items or, more formally, the buckets with the smallest value for r,(rt + Sj ) where r, and si indicate the total size of a bucket’s build and probe items [Graefe 1993b]. 5.4 Pointer-Based Joins Recently, links between data items have found renewed interest, be it in object- oriented systems in the form of object identifiers (OIDS) or as access paths for faster execution of relational joins. In a sense, links represent a limited form of precomputed results, somewhat similar to indices and join indices, and have the usual cost-versus-benefit tradeoff be- tween query performance enhancement and maintenance effort. Kooi [1980] mod- eled the retrieval of actual records after index searches as “TID joins” (tuple iden- tifiers permitting direct record access) in his query optimizer for Ingres; together with standard join commutativity and associativity rules, this model permitted exploring joins of indices of different re- lations (joining lists of key-TID pairs) or joins of one relation with another rela- tion’s index. In the Genesis data model and database system, Batory et al. [1988a; 1988b] modeled joins in a func- tional way, borrowing from research into the database languages FQL [Buneman et al. 1982; Buneman and Frankel 1979], DAPLEX [Shipman 1981], and Gem [Tsur and Zaniolo 1984; Zaniolo 1983] and permitting pointer-based join imple- mentations in addition to traditional value-based implementations such as nested-loops join, merge-join, and hy- brid hash join. Shekita and Carey [ 1990] recently ana- lyzed three pointer-based join methods based on nested-loops join, merge-join, and hybrid hash join. Presuming rela- tions R and S, with a pointer to an S tuple embedded in each R tuple, the nested-loops join algorithm simply scans through R and retrieves the appropriate S tuple for each R tuple. This algorithm is very reminiscent of uncluttered index scans and performs similarly poorly for larger set sizes. Their conclusion on naive pointer-based join algorithms is that “it is unwise for object-oriented database systems to support only pointer-based join algorithms,” The merge-join variant starts with sorting R on the pointers (i.e., according to the disk address they point to) and then retrieves all S items in one elevator pass over the disk, reading each S page at most once. Again, this idea was sug- gested before for uncluttered index scans, and variants similar to heap-filter merge-join [Graefe 1991] and complex ob- ject assembly using a window and prior- ity heap of open references [Keller et al. 1991] can be designed. The hybrid hash join variant partitions only relation R on pointer values, ensur- ing that R tuples with S pointers to the same page are brought together, and then retrieves S pages and tuples. Notice that the two relations’ roles are fixed by the direction of the pointers, whereas for standard hybrid hash join the smaller relation should be the build input. Differ- ACM Computing Surveys, Vol 25, No. 2, June 1993
  • 38. 110 “ Goetz Graefe 125 – 100 : /~ V Pointer 10 S+R o Ne ed Loops I/o 75 ❑ Merg oin with Two Sorts Count + Hashi not using Hybrid ing [Xlooo] 50 A Pointer Join 25 0 1 I I I I I I I I 100 300 500 700 900 1100 13(KI 1500 Sizeof R, S=lOx R Figure 15. Performance of alternative join methods ently than standard hybrid hash join, relation S is not partitioned. This algo- rithm performs somewhat faster than pointer-based merge-join if it keeps some partitions of R in memory and if sorting writes all R tuples into runs before merg- ing them. Pointer-based join algorithms tend to outperform their standard value-based counterparts in many situations, in par- ticular if only a small fraction of S actu- ally participates in the join and can be selected effectively using the pointers in R. Historically, due to the difficulty of correctly maintaining pointers (nones- sential links ), they were rejected as a relational access method in System R [Chamberlain et al. 1981al and subse- quently in basically all other systems, perhaps with the exception of Kooi’s modified Ingres [Kooi 1980; Kooi and Frankforth 1982]. However, they were reevaluated and implemented in the Starburst project, both as a test of Star- burst’s extensibility and as a means of supporting “more object-oriented” modes of operation [Haas et al. 1990]. 5.5 A Rough Performance Comparison Figure 15 shows an approximate perfor- mance comparison using the cost formu- las developed above for block nested-loops join; merge-join with sorting both inputs without optimized merging; hash join without hybrid hashing, bucket tuning, or dynamic destaging; and pointer joins with pointers from R to S and from S to R without grouping pointers to the same target page together. This comparison is not precise; its sole purpose is to give a rough idea of the relative performance of the algorithm groups, deliberately ignor- ing the many tricks used to improve and fine-tune the basic algorithms. The rela- tion sizes vary; S is always 10 times larger than R. The memory size is 100 KB; the cluster size is 8 KB; merge fan-in and partitioning fan-out are 10; and the number of R-records per cluster is 20. It is immediately obvious in Figure 15 that nested-loops join is unsuitable for medium-size and large relations, because the cost of nested-loops join is propor- tional to the size of the Cartesian prod- uct of the two inputs. Both merge-join (sorting) and hash join have logarithmic cost functions; the sudden rise in merge- join and hash join cost around R = 1000 is due to the fact that additional parti- tioning or merging levels become neces- sary at that point. The sort-based merge-join is not quite as fast as hash join because the merge levels are deter- mined individually for each file, includ- ing the bigger S file, while only the smaller build relation R determines the partitioning depth of hash join. Pointer joins are competitive with hash and merge-joins due to their linear cost func- ACM Computmg Surveys, Vol 25, No 2, June 1993
  • 39. Query Evaluation Techniques w 111 tion, but only when the pointers are em- bedded in the smaller relation R. When S-records point to R-records, the cost of the pointer join is even higher than for nested-loops join. The important point of Figure 15 is to illustrate that pointer joins can be very efficient or very inefficient, that one-to- one match algorithms based on nested- loops join are not competitive for medium-size and large inputs, and that sort- and hash-based algorithms for one- to-one match operations both have loga- rithmic cost growth. Of course, this com- parison is quite naive since it uses only the simplest form of each algorithm. Thus, a comparison among alternative algorithms in a query optimizer must use the precise cost function for the available algorithm variant. 6. UNIVERSAL QUANTIFICATION’3 Universal quantification permits queries such as “find the students who have taken all database courses”; the differ- ence to one-to-one match operations is that a student qualifies because his or her transcript matches an entire set of courses, not only one item as in an exis- tentially quantified query (e.g., “find stu- dents who have taken a (at least one) database course”) that can be executed using a semi-join. In the past, universal quantification has been largely ignored for four reasons. First, typical data- base applications, e.g., record-keeping and accounting applications, rarely re- quire universal quantification. Second, it can be circumvented using a complex ex- pression involving a Cartesian product. Third, it can be circumvented using com- plex aggregation expressions. Fourth, there seemed to be a lack of efficient algorithms. The first reason will not remain true for database systems supporting logic programming, rules, and quantifiers, and algorithms for universal quantification 12 This section is a summary of earlier work [Graefe 1989; Graefe and Cole 1993]. will become more important. The second reason is valid; however, the substitute expressions are very slow to execute be- cause of the Cartesian product. The third reason is also valid, but replacing a uni- versal quantifier may require very com- plex aggregation clauses that are easy to “get wrong” for the database user. Fur- thermore, they might be too complex for the optimizer to recognize as universal quantification and to execute with a di- rect algorithm. The fourth reason is not valid; universal quantification algo- rithms can be very efficient (in fact, as fast as semi-join, the operator for exis- tential quantification), useful for very large inputs, and easy to parallelize. In the remainder of this section, we discuss sort- and hash-based direct and indirect (aggregation-based) algorithms for uni- versal quantification. In the relational world, universal quantification is expressed with the uni- versal quantifier in relational calculus and with the division operator in rela- tional algebra. We will explain algo- rithms for universal quantification using relational terminology. The running ex- ample in this section uses the relations Student ( student-id, name, major), Course (course-no, title), Transcript ( stu- dent-id, course-no, grade ), and Require- ment (major, course-no) with the obvious key attributes. The query to find the stu- dents who have taken all courses can be expressed in relational algebra as studmt.,d, cwrmnoTranscri@ 97 + trCOU~,, e.~OCourse. The projection of the Transcript relation is called the dividend, the projection of the Course relation the divisor, and the result relation the quotient. The quotient attributes are those attributes of the div- idend that do not appear in the divisor. The dividend relation semi-joined with the divisor relation and projected on the quotient attributes, in the example the set of student-ids of Students who have taken at least one course, is called the set of quotient candidates here. ACM Computmg Surveys, Vol. 25, No. 2, June 1993
  • 40. 112 “ Goetz Graefe Some universal quantification queries seem to require relational division but actually do not. Consider the query for the students who have taken all courses required for their major. This query can be answered with a sequence of one-to- one match operations. A join of Student and Requirement projected on the stu- dent-id and course-no attributes minus the Transcript relation can be projected on student-ids to obtain a set of students who have not taken all their require- ments. An anti-semi-join of the Student relation with this set finds the students who have satisfied all their require- ments. This sequence will have accept- able performance because its required set-matching algorithms (join, difference, anti-semi-join) all belong to the family of one-to-one match operations, for which efficient algorithms are available as dis- cussed in the previous section. Division algorithms differ not only in their performance but also in how they fit into complex queries, Prior to the divi- sion, selections on the dividend, e.g., only Transcript entries with “A” grades, or on the divisor, e.g., only the database courses, may be required. Restrictions on the dividend can easily be enforced with- out much effect on the division operation, while restrictions on the divisor can im- ply a significant difference for the query evaluation plan. Subsequent to the divi- sion operation, the resulting quotient re- lation (e.g., a set of student-ids) may be joined with another relation, e.g., the Student relation to obtain student names. Thus, obtaining the quotient in a form suitable for further processing (e.g., join or semi-join with a third relation) can be advantageous. Typically, universal quantification can easily be replaced by aggregations. (In- tuitively, all universal quantification can be replaced by aggregation. However, we have not found a proof for this statement.) For example, the example query about database courses can be restated as “find the students who have taken as many database courses as there are database course s.” When specifying the aggregate function, it is important to count only database courses both in the dividend (the Transcript relation) and in the divi- sor (the Course relation). Counting only database courses might be easy for the divisor relation, but requires a semi-join of the dividend with the divisor relation to propagate the restriction on the divi- sor to the dividend if it is not known a priori whether or not referential in- tegrity holds between the dividend’s divi- sor attributes and the divisor, i.e., whether or not there are divisor attribute values in the dividend that cannot be found in the divisor. For example, course- nos in the Transcript relation that do not pertain to database courses (and are therefore not in the divisor) must be re- moved from the dividend by a semi-join with the divisor. In general, if the divisor is the result of a prior selection, any referential integrity constraints known for stored relations will not hold and must be explicitly enforced using a semi-join. Furthermore, in order to ensure correct counting, duplicates have to be removed from either input if the inputs are projec- tions on nonkey attributes. There are four methods to compute the quotient of two relations, a sort- and a hash-based direct method, and sort- and hash-based aggregation. Table 6 shows this classification of relational division algorithms. Methods for sort- and hash- based aggregation and the possible sort- or hash-based semi-join have already been discussed, including their variants for inputs larger than memory and their cost functions. Therefore, we focus here on the direct division algorithms. The sort-based direct method, pro- posed by Smith and Chang [1975] and called naizw diuision here, sorts the divisor input on all its attributes and the dividend relation with the quotient at- tributes as major and the divisor at- tributes as minor sort keys. It then pro- ceeds with a merging scan of the two sorted inputs to determine which items belong in the quotient. Notice that the scan can be programmed such that it ignores duplicates in either input (in case those had not been removed yet in the sort) as well as dividend items that do ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 41. Query Evaluation Techniques - 113 Table 6. Classiflcatlon of Relational Dtvwon Algorithms Based on Sorting Based on Hashing Direct Naive divison Hash-division Indirect by semi-join and Sorting with duplicate removal, Hash-based duplicate removal, aggregation merge-join, sorting with hybrid hash join, hash-based aggregation aggregation not refer to items in the divisor. Thus, neither a preceding semi-join nor explicit duplicate removal steps are necessary for naive division. The 1/0 cost of naive di- vision is the cost of sorting the two in- puts plus the cost of repeated scans of the divisor input. Figure 16 shows two tables, a dividend and a divisor, properly sorted for naive division. Concurrent scans of the “Jack” tuples (only one) in the dividend and of the entire divisor determine that “Jack is not part of the quotient because he has not taken the “Readings in Databases” course. A continuing scan through the “Jill” tuples in the dividend and a new scan of the entire divisor include “Jill” in the output of the naive division. The fact that “Jill” has also taken an “Intro to Graphics” course is ignored by a suitably general scan logic for naive division. The hash-based direct method, called hash-division, uses two hash tables, one for the divisor and one for the quotient candidates. While building the divisor table, a unique sequence number is as- signed to each divisor item. After the divisor table has been built, the dividend is consumed. For each quotient candi- date, a bit map is kept with one bit for each divisor item. The bit map is indexed with the sequence numbers assigned to the divisor items. If a dividend item does not match with an item in the divisor table, it can be ignored immediately. Otherwise, a quotient candidate is either found or created, and the bit correspond- ing to the matching divisor item is set. When the entire dividend has been con- sumed, the quotient consists of those quotient candidates for which all bits are set. This algorithm can ignore duplicates in the divisor (using hash-based duplicate Stadent Coarse Jack Intro to Databases Jill Intro to Databases Jill Intro to Graphics Jill Readings in Databases Figure 16. Sorted inputs into naive division removal during insertion into the divisor table) and automatically ignores dupli- cates in the dividend as well as dividend items that do not refer to items in the divisor (e.g., the AI course in the exam- ple). Thus, neither prior semi-join nor duplicate removal are required. However, if both inputs are known to be duplicate free, the bit maps can be replaced by counters. Furthermore, if referential in- tegrity is known to hold, the divisor table can be omitted and replaced by a single counter. Hash-division, including these variants, has been implemented in the Volcano query execution engine and has shown better performance than the other three algorithms [Graefe 1989; Graefe and Cole 1993]. In fact, the performance of hash-division is almost equal to a hash-based join or semi-join of dividend and divisor relations (a semi- join corresponds to existential quantifica- tion), making universal quantification and relational division realistic opera- tions and algorithms to use in database applications. The aspect of hash-division that makes it an efficient algorithm is that the set of matches between a quotient candidate and the divisor is represented efficiently ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 42. 114 “ Goetz Graefe using a bit map. Bit maps are one of the standard data structures to represent sets, and just as bit maps can be used for a number of set operations, the bit maps associated with each quotient candidate can also be used for a number of opera- tions similar to relational division. For example, Carlis [ 1986] proposed a gener- alized division operator called “HAS” that included relational division as a s~ecial L case. The hash-division algorithm can easily be extended to compute quotient candidates in the dividend that match a majority or given fraction of divisor items as well as (with one more bit in each bit map) quotient candidates that do or do not match exactly the divisor items For real queri~s containing a division, consider the operation that frequently follows a division. In the example, a user is typically not really interested in student-ids only but in information about the students. Thus, in many cases, rela- tional division results will be used to select items from another relation using a semi-join. The sort-based algorithms produce their output sorted, which will facilitate a subsequent (semi-) merge-join. The hash-based algorithms produce their output in hash order; if overflow oc- curred, there is no m-edictable order at . all. However, both aggregation-based and direct hash-based algorithms use a hash table on the quotient attributes, which may be used immediately for a subse- quent (semi-) join. It seems quite straightforward to use the same hash table for the aggregation and a subse- quent join as well as to modify hash- division such that it removes quotient candidates from the quotient table that do not belong to the final quotient and then performs a semi-join with a third input relation. If the two hash tables do not fit into memory, the divisor table or the quotient table or both can be partitioned, and in- dividual partitions can be held on disk for processing in multiple steps. In diui- sor partitioning, the final result consists of those items that are found in all partial results; the final result is the intersection of all partial results. For ex- ample, if the Course relations in the ex- ample above are partitioned into under- graduate and graduate courses, the final result consists of the students who have taken all undergraduate courses and all graduate courses, i.e., those that can be found in the division result of each parti- tion. In quotient partitioning, the entire divisor must be kept in memory for all partitions. The final result is the concate- nation (union) of all partial results. For example, if Transcript items are parti- tioned by odd and even student-ids, the final result is the union (concatenation) of all students with odd student-id who have taken all courses and those with even student-id who have taken all courses. If warranted by the input data, divisor partitioning and quotient parti- tioning can be combined. Hash-division can be modified into an algorithm for duplicate removal. Con- sider the problem of removing duplicates from a relation R(X, Y) where X and Y are suitably chosen attribute groups. This relation can be stored using two hash tables, one storing all values of X (simi- lar to the divisor table) and assigning each of them a unique sequence number, the other storing all values of Y and bit maps that indicate which X values have occurred with each Y value. Consider a brief example for this algorithm: Say re- lation R(X, Y) contains 1 million tuples, but only 100,000 tuples if duplicates were removed. Let X and Y be each 100 bytes long (total record size 200), and assume there are 4,000 unique values of each X and Y. For the standard hash-based du- plicate removal algorithm, 100,000 x 200 bytes of memory are needed for duplicate removal without use of temporary files. For the redesigned hash-division algo- rithm, 2 x 4,000 x 100 bytes are needed for data values, 4,000 x 4 for unique se- quence numbers, and 4,000 x 4,000 bits for bit maps. Thus, the new algorithm works efficiently with less than 3 MB of memory while conventional duplicate re- moval requires slightly more than 19 MB of memory, or seven times more than the duplicate removal algorithm adapted from hash-division. Clearly, choosing at- tribute groups X and Y to find attribute groups with relatively few unique values ACM Computmg Surveys, Vol 25, No. 2, June 1993
  • 43. Query Evaluation Techniques ● 115 is crucial for the performance and mem- ory efficiency of this new algorithm. Since such knowledge is not available in most systems and queries (even though some efficient and helpful algorithms exist, e.g., Astrahan et al. [1987]), optimizer heuris- tics for choosing this algorithm might be difficult to design and verify. To summarize the discussion on uni- versal quantification algorithms, aggre- gation can be used in systems that lack direct division algorithms, and hash- division performs universal quantifica- tion and relational division generally, i.e., it covers cases with duplicates in the in- puts and with referential integrity viola- tions, and efficiently, i.e., it permits par- titioning and using hybrid hashing tech- niques similar to hybrid hash join, mak- ing universal quantification (division) as fast as existential quantification (semi- join). As will be discussed later, it can also be effectively parallelized. 7. DUALITY OF SORT- AND HASH-BASED QUERY PROCESSING ALGORITHMS’4 We conclude the discussion of individual query processing by outlining the many existing similarities and dualities of sort- and hash-based query-processing algo- rithms as well as the points where the two types of algorithms differ. The pur- pose is to contribute to a better under- standing of the two approaches and their tradeoffs. We try to discuss the ap- proaches in general terms, ignoring whether the algorithms are used for rela- tional join, union, intersection, aggre- gation, duplicate removal, or other operations. Where appropriate, however, we indicate specific operations. Table 7 gives an overview of the fea- tures that correspond to one another. 14 ~art~ of ~hi~ section have been derived from Graefe et al. [1993], which also provides experimen- tal evidence for the relative performance of sort- and hash-based query processing algorithms and discusses simple cases of transferring tuning ideas from one type of algorithm to the other. The discus- sion of this section is continued in Graefe [ 1993a; 1993 C]. Both approaches permit in-memory ver- sions for small data sets and disk-based versions for larger data sets. If a data set fits into memory, quicksort is the sort- based method to manage data sets while classic (in-memory) hashing can be used as a hashing technique. It is interesting to note that both quicksort and classic hashing are also used in memory to oper- ate on subsets after “cutting” an entire large data set into pieces. The cutting process is part of the divide-and-conquer paradigm employed for both sort- and hash-based query-processing algorithms. This important similarity of sorting and hashing has been observed before, e.g., by Bratbergsengen [ 1984] and Salzberg [1988]. There exists, however, an impor- tant difference. In the sort-based algo- rithms, a large data set is divided into subsets using a physical rule, namely into chunks as large as memory. These chunks are later combined using a logical step, merging. In the hash-based algorithms, large inputs are cut into subsets using a logical rule, by hash values. The result- ing partitions are later combined using a physical step, i.e., by simply concatenat- ing the subsets or result subsets. In other words, a single-level merge in a sort algo- rithm is a dual to partitioning in hash algorithms. Figure 17 illustrates this du- ality and the opposite directions. This duality can also be observed in the behavior of a disk arm performing the 1/0 operations for merging or parti- tioning. While writing initial runs after sorting them with quicksort, the 1/0 is sequential. During merging, read opera- tions access the many files being merged and require random 1/O capabilities. During partitioning, the 1/0 operations are random, but when reading a parti- tion later on, they are sequential. For both approaches, sorting and hash- ing, the amount of available memory lim- its not only the amount of data in a basic unit processed using quicksort or classic hashing, but also the number of basic units that can be accessed simultane- ously. For sorting, it is well known that merging is limited to the quotient of memory size and buffer space required for each run, called the merge fan-in. ACM Computing Surveys, Vol 25, No. 2, June 1993
  • 44. 116 ● Goetz Graefe Table 7. Duallty of Soti- and Hash-Based Algorithms Aspect Sorting Hashing In-memory algorithm Divide-and-conquer paradigm Large inputs 1/0 Patterns Temporary files accessed simultaneously 1/0 Optimlzatlons Very large inputs Optimizations Better use of memory Aggregation and duphcate removal Algorithm phases Resource sharing Partitiomng skew and effectiveness “Item value” Bit vector filtering Interesting orderings, multiple joins Interesting orderings: grouping/aggregation followed by join Interesting orderings in index structures Qulcksort Physical dlvi sion, logical combination Single-level merge Sequential write, random read Fan-in Read-ahead, forecasting Double-buffering, striping merge output Multi-level merge Merge levels Nonoptimal final fan-in Merge optimizatlons Reverse runs & LRU Replacement selection ? Aggregation m replacement selection Run generation, intermediate and final merge Eager merging Lazy merging Mergmg run files of different sizes log (run size) For both inputs and on each merge level? Multiple merge-joins without sorting’ intermediate results Sorted grouping on foreign key useful for subsequent join B-trees feeding into a merge-join Classic Hash Logical division, physical combination Partitioning Random write, sequential read Fan-out Write-behind Double-buffering, striping partitioning input Recursive partitioning Recursion depth Nonoptimal hash table size Bucket tuning Hybrid hashing ? Single input in memory Aggregation in hash table Initial and intermediate partitioning, In-memory (hybrid) hashing Depth-first partitlomng Breadth-first partitioning Uneven output file sizes log (build partition size/or@nal build input size) For both inputs and on each recursion level N-ary partltlonmg and jams Grouping while budding the hash table in hash Join Mergmg m hash value order Figure 17. Duahty of partitioning and mergmg, Similarly, partitioning is limited to the same fraction, called the fan-out since the limitation is encountered while writ- ing partition files. In order to keep the merge process active at all times, many merge imple- mentations use read-ahead controlled by forecasting, trading reduced 1/0 delays for a reduced fan-in. In the ideal case, the bandwidths of 1/0 and processing (merging) match, and 1/0 latencies for both the merge input and output are hid- den by read-ahead and double-buffering, as mentioned earlier in the section on sorting. The dual to read-ahead during merging is write-behind during partition- ing, i.e., keeping a free output buffer that can be allocated to an output file while the previous page for that file is being written to disk. There is no dual to fore- casting because it is trivial that the next ACM Computmg Surveys, Vol. 25, No. 2, June 1993
  • 45. Query Evaluation Techniques ● 117 output partition to write to is the one for which an output cluster has just filled up. Both read-ahead in merging and write-behind in partitioning are used to ensure that the processor never has to wait for the completion of an 1/0 opera- tion. Another dual is double-buffering and striping over multiple disks for the out- put of sorting and the input of partitioning. Considering the limitation on fan-in and fan-out, additional techniques must be used for very large inputs. Merging can be performed in multiple levels, each combining multiple runs into larger ones. Similarly, partitioning can be repeated recursively, i.e., partition files are repar- titioned, the results repartitioned, etc., until the partition files flt into main memory. In sorting and merging, the runs grow in each level by a factor equal to the fan-in. In partitioning, the partition files decrease in size by a factor equal to the fan-out in each recursion level. Thus, the number of levels during merging is equal to the recursion depth during par- titioning. There are two exceptions to be made regarding hash value distributions and relative sizes of inputs in binary op- erations such as join; we ignore those for now and will come back to them later. If merging is done in the most naive way, i.e., merging all runs of a level as soon as their number reaches the fan-in, the last merge on each level might not be optimal. Similarly, if the highest possible fan-out is used in each partitioning step, the partition files in the deepest recur- sion level might be smaller than mem- ory, and less than the entire memory is used when processing these files. Thus, in both approaches the memory re- sources are not used optimally in the most naive versions of the algorithms. In order to make best use of the final merge (which, by definition, includes all output items and is therefore the most expensive merge), it should proceed with the maximal possible fan-in. Making best use of the final merge can be ensured by merging fewer runs than the maximal fan-in after the end of the input file has been reached (as discussed in the earlier section on sorting). There is no direct dual in hash-based algorithms for this optimization. With respect to memory utilization, the fact that a partition file and therefore a hash table might actu- ally be smaller than memory is the clos- est to a dual. Utilizing memory more effectively and using less than the maxi- mal fan-out in hashing has been ad- dressed in research on bucket tuning [Kitsuregawa et al. 1989a] and on his- togram-driven recursive hybrid hash join [Graefe 1993a]. The development of hybrid hash algo- rithms [DeWitt et al. 1984; Shapiro 1986] was a consequence of the advent of large main memories that had led to the con- sideration of hash-based join algorithms in the first place. If the data set is only slightly larger than the available mem- ory, e.g., 109%0larger or twice as large, much of the input can remain in memory and is never written to a disk-resident partition file. To obtain the same effect for sort-based algorithms, if the database system’s buffer manager is sufficiently smart or receives and accepts appropri- ate hints, it is possible to retain some or all of the pages of the last run written in memory and thus achieve the same effect of saving 1/0 operations, This effect can be used particularly easily if the initial runs are written in reverse (descending) order and scanned backward for merg- ing. However, if one does not believe in buffer hints or prefers to absolutely en- sure these 1/0 savings, then using a final memory-resident run explicitly in the sort algorithm and merging it with the disk-resident runs can guarantee this effect. Another well-known technique to use memory more effectively and to improve sort performance is to generate runs twice as large as main memory using a priority heap for replacement selection [Knuth 1973], as discussed in the earlier section on sorting. If the runs’ sizes are doubled, their number is cut in half. Therefore, merging can be reduced by some amount, namely log~(2) = l/logs(F) merge levels. This optimiza- tion for sorting has no direct dual in the ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 46. 118 0 Goetz Graefe realm of hash-based query-processing algorithms. If two sort operations produce input data for a binarv ooerator such as a -. merge-join and if both sort operators’ fi- nal merges are interleaved with the join, each final merge can employ only half the memorv. In hash-based one-to-one match algo~ithms, only one of the two inputs resides in and consumes memory beyond a single input buffer, not both as in two final merges interleaved with a merge-join. This difference in the use of the two inputs is a distinct advantage of hash-based one-to-one match al~orithms that does not have a dual in s~rt-based algorithms. Interestingly, these two differences of sort- and hash-based one-to-one match algorithms cancel each other out. Cutting the number of runs in half (on each merge level, including the last one) by using replacement selection for run generation exactly offsets this disadvantage of sort- based one-to-one match operations. Run generation using replacement se- lection has a second advantage over quicksort; this advantage has a direct dual in hashing. If a hash table is used to compute an aggregate function using grouping, e.g., sum of salaries by depart- ment, hash table overflow occurs only if the operation’s output does not fit in memory. Consider, for example, the sum of salaries by department for 100,000 employees in 1,000 departments. If the 1,000 result records fit in memory, clas- sic hashing (without overflow) is suffi- cient. On the other hand, if sorting based on quicksort is used to compute ;his ag- gregate function, the input must fit into memory to avoid temporary files.15 If re- placement selection is used for run gen- eration, however, the same behavior as with classic hashing is easy to achieve. 15A scheme usuw aulcksort and avoiding tem~o- rary 1/0 m this ~as~ can be devised but ~ould’be extremely cumbersome; we do not know of any report or system with such a scheme. If an iterator interface is used for both its input and output, and therefore mul- tiple operators overlap in time, a sort operator can be divided into three dis- tinct algorithm phases. First, input items are consumed and sorted into initial runs. Second, intermediate merging reduces the number of runs such that only one final merge step is left. Third, the final merge is performed on demand from the consumer of the sorted data stream. Dur- ing the first phase, the sort iterator has to share resources, most notably memory and disk bandwidth, with its producer operators in a query evaluation plan. Similarly, the third phase must share resources with the consumers. In many sort implementations, namely those using eager merging, the first and second phase interleave as a merge step is initiated whenever the number of runs on one level becomes equal to the fan-in. Thus, some intermediate merge steps cannot use all resources. In lazy merging, which starts intermediate merges only after all initial runs have been created, the intermediate merges do not share resources with other operators and can use the entire memory allocated to a query evaluation plan; “thus, intermedi- ate merges can be more effective in lazy merging than in eager merging. Hash-based query processing algo- rithms exhibit three similar phases. First, the first partitioning step executes concurrently with the input operator or operators. Second, intermediate parti- tioning steps divide the partition files to ensure that they can be processed with hybrid hashing. Third, hybrid and in- memory hash methods process these par- tition files and produce output passed to the consumer operators. As in sorting, the first and third phases must share resources with other concurrent opera- tions in the same query evaluation plan. The standard implementation of hash- based query processing algorithms for verv large in~uts uses recursion, i.e., the ori~inal ‘algo~ithm is invoked for each partition file (or pair of partition files). While conce~tuallv sim~le. this method has the di~adva~tage ‘ that output is ACM Computing Surveys, Vol 25, No 2, June 1993
  • 47. Query Evaluation Techniques 8 119 produced before all intermediate parti- tioning steps are complete. Thus, the op- erators that consume the output must allocate resources to receive this output, typically memory (e.g., a hash table). Further intermediate partitioning steps will have to share resources with the consumer operators, making them less effective. We call this direct recursive im- plementation of hash-based partitioning depth-first partitioning and consider its behavior as well as its resource sharing and performance effects a dual to eager merging in sorting. The alternative schedule is breadth-first partitioning, which completes each level of partition- ing before starting the next one. Thus, hybrid and in-memory hashing are not initiated until all partition files have be- come small enough to permit hybrid and in-memory hashing, and intermediate partitioning steps never have to share resources with consumer operators. Breadth-first partitioning is a dual to lazy merging, and it is not surprising that they are both equally more effective than depth-first partitioning and eager merg- ing, respectively. It is well known that partitioning skew reduces the effectiveness of hash-based algorithms. Thus, the situation shown in Figure 18 is undesirable. In the extreme case, one of the partition files is as large as the input, and an entire partitioning step has been wasted. It is less well rec- ognized that the same issue also pertains to sort-based query processing algo- rithms [Graefe 1993c]. Unfortunately, in order to reduce the number of merge steps, it is often necessary to merge files from different merge levels and therefore of different sizes. In other words, the goals of optimized merging and of maxi- mal merge effectiveness do not always match, and very sophisticated merge plans, e.g., polyphase merging, might be required [Knuth 1973]. The same effect can also be observed if “values” are attached to items in runs and in partition files. Values should re- flect the work already performed on an item. Thus, the value should increase with run sizes in sorting while the value ,... A El ‘i” k Partitioning I I n ~es I Merging - * < Figure 18. Partitioning skew, must increase as partition files get smaller in hash-based query processing algorithms. For sorting, a suitable choice for such a value is the logarithm of the run size [Graefe 1993c]. The value of a sorted run is the product of the run’s size and the size’s logarithm. The optimal merge effectiveness is achieved if each item’s value increases in each merge step by the logarithm of the fan-in, and the overall value of all items increases with this logarithm multiplied with the data volume participating in the merge step. However, only if all runs in a merge step are of the same size will the value of all items increase with the logarithm of the fan-in. In hash-based query processing, the corresponding value is the fraction of a partition size relative to the original in- put size [Graefe 1993c]. Since only the build input determines the number of recursion levels in binary hash partition- ing, we consider only the build partition. If the partitioning is skewed, i.e., output partition files are not of uniform length, the overall effectiveness of the partition- ing step is not optimal, i.e., equal to the logarithm of the partitioning fan-out. Thus, preventing or managing skew in partitioning hash functions is very im- portant [Graefe 1993a]. Bit vector filtering, which will be dis- cussed later in more detail, can be used for both sort- and hash-based one-to-one match operations, although it has been used mainly for parallel joins to date. Basically, a bit vector filter is a large array of bits initialized by hashing items in the first input of a one-to-one match operator and used to detect items in the second input that cannot possibly have a ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 48. 120 “ Goetz Graefe , match in the first input. In effect, bit vector filtering reduces the second input to the items that truly participate in the binary operation plus some “false passes” due to hash collisions in the bit vector filter, In a merge-join with two sort oper- ations, if the bit vector filter is used be- fore the second sort, bit vector filtering is as effective as in hybrid hash join in reducing the cost of processing the sec- ond input. In merge-join, it can also be used symmetrically as shown in Figure 19. Notice that for the right input, bit vector filtering reduces the sort input size, whereas for the left input, it only reduces the merge-join input. In recur- sive hybrid hash join, bit vector filtering can be used in each recursion level. The effectiveness of bit vector filtering in- creases in deeper recursion levels, be- cause the number of distinct data values in each ~artition file decreases. thus re- ducing tie number of hash collisions and false passes if bit vector filters of the same size are used in each recursion level. Moreover, it can be used in both direc- tions, i.e.: to reduce the second input us- ing a bit vector filter based on the first input and to reduce the first input (in the next recursion level) using a bit vector filter based on the second input. The same effect could be achieved for sort-based binary operations requiring multilevel sorting and merging, although to do so implies switching back and forth be- tween the two sorts for the two inputs after each merge level. Not surprisingly, switching back and forth after each merge level would be the dual to the nartition- . ing process of both inputs in recursive hybrid hash join. However, sort operators that switch back and forth on each merge level are not only complex to implement but may also inhibit the merge optimiza- tion discussed earlier. The final entries in Table 7 concern interesting orderings used in the System R query optimizer [Selinger et al. 1979] and presumably other query optimizers as well. A strong argument in favor of sorting and merge-join is the fact that merge-join delivers its output in sorted order; thus, multiple merge-joins on the Merge-Join /“ Probe Vector 2 sort I I sOrt Build Vector 2 I I Build Vector 1 Probe Vector 1 I I Scan Input 1 Scan Input 2 Figure 19. Merge-join with symmetric bit vector filtering. same attribute can be performed without sorting intermediate join results. For joining three relations, as shown in Fig- ure 20, pipelining data from one merge- join to the next without sorting trans- lates into a 3:4 advantage in the number of sort operations compared to two joins on different join keys, because the inter- mediate result 01 does not need to be sorted. For joining N relations on the same key, only N sorts are required in- stead of 2 x N – 2 for joins on different attributes. Since set operations such as the union or intersection of N sets can always be performed using a merge-join algorithm without sorting intermediate results, the effect of interesting orderings is even more important for set operations than for relational joins. Hash-based algorithms tend to produce their outputs in a very unpredictable or- der, depending on the hash function and on overflow management. In order to take advantage of multiple joins on the same attribute (or of multiple intersections, etc.) similar to the advantage derived from interesting orderings in sort-based query processing, the equality of at- tributes has to be exploited during the logical step of hashing, i.e., during parti- tioning. In other words, such set opera- tions and join queries can be executed effectively by a hash join algorithm that recursively partitions N inputs concur- rently. The recursion terminates when N – 1 inputs fit into memory and when the Nth input is used to probe N – 1 ACM Computmg Surveys, Vol. 25, No. 2, June 1993
  • 49. Query Evaluation Techniques ● 121 Merge-Join b=b / Merge-Join a=a Sort on b Sort on b I I / 01 Merge-Join a=a Sort on a Merge-Join a=a / ‘n’”’” / I Sort on a Sort on a Input 13 Sort on a Sort on a I I I I Input 11 Input 12 Input 11 Input 12 Figure 20. The effect of interesting orderings. Figure 21. Partitioning in a multiinput hash join. hash tables. Thus, the basic operation of this N-ary join (intersection, etc.) is an N-ary join of an N-tuple of partition files, not pairs as in binary hash join with one build and one m-obe file for each ~arti- tion. Figure 21 ‘illustrates recursiv~ par- titioning for a join of three inputs. In- stead of partitioning and joining a pair of in~uts and ~airs of ~artition files as in tr~ditional binary hybrid hash join, there are file triples (or N-tuples) at each step. However, N-ary recursive partitioning is cumbersome to implement, in particu- lar if some of the “join” operations are actually semi-join, outer join, set inter- section, union, or difference. Therefore, until a clean implementation method for hash-based N-ary matching has been found, it might well be that this distinc- tion. ioins on the same or on different /., attributes, contributes to the right choice between sort- and hash-based algorithms for comdex aueries. Anot~er si~uation with interesting or- derings is an aggregation followed by a join. Many aggregations condense infor- mation about individual entities; thus, the aggregation operation is performed on a relation representing the “many” side of a many-to-one relationship or on the relation that represents relationship instances of a many-to-many relation- ship. For example, students’ grade point averages are computed by grouping and averaging transcript entries in a many- to-many relationship called transcript between students and courses. The im- portant point to note here and in many similar situations is the grouping at- tribute is a foreign key. In order to relate the aggregation output with other infor- mation pertaining to the entities about which information was condensed, aggre- gations are frequently followed by a join. If the grouping operation is based on sorting (on the grouping attribute, which very frequently is a foreign key), the nat- ural sort order of the aggregation output can be exploited for an efficient merge- join without sorting. While this seems to be an advantage of sort-based aggregation and join, this combination of operations also permits a special trick in hash-based query pro- cessing [Graefe 1993 b]. Hash-based ag- gregation is based on identifying items of the same group while building the hash table. At the end of the operation, the hash table contains all output items hashed on the grouping attribute. If the grouping attribute is the join attribute in the next operation, this hash table can immediately be probed with the other ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 50. 122 “ Goetz Graefe join input. Thus, the combined aggrega- tion-join operation uses only one hash table, not two hash tables as two sepa- rate o~erations would do. The differences to tw~ separate operations are that only one join input can be aggregated effi- ciently and that the aggregated input must be the join’s build input. Both is- sues could be addressed by symmetric hash ioins with a hash table on each of the in”~uts which would be as efficient as sorting and grouping both join inputs. A third use of interesting orderings is the positive interaction of (sorted, B-tree) index scans and merge-join. While it has not been reported explicitly in the litera- ture. the leaves and entries of two hash indices can be merge-joinedjust like those of two B-trees, provided the same hash function was used to create the indices. For example, it is easy to imagine “merg- ing” the leaves (data pages) of two ex- tendible hash indices [Fagin et al. 1979], even if the key cardinalities and distribu- tions are verv different. In summa~y, there exist many duali- ties between sorting using multilevel merging and recursive hash table over- flow management. Two special cases ex- ist which favor one or the other, however. First, if two join inputs are of different size (and the query optimizer can reli- ably predict this difference), hybrid hash join outperforms merge-join because only the smaller of the two inputs determines what fraction of the input files has to be written to temporary disk files during partitioning (or how often each record has to be written to disk during recursive partitioning), while each file determines its own disk 1/0 in sorting [Bratberg- sengen 1984]. For example, sorting the larger of two join inputs using multiple merge levels is more expensive than writing a small fraction of that file to hash overflow files. This performance ad- vantage of hashing grows with the rela- tive size difference of the two inputs, not with their absolute sizes or with the memory size. Second, if the hash function is very poor, e.g., because of a prior selection on the ioin attribute or a correlated at- tribu~e, hash partitioning can perform very poorly and create significantly higher costs than sorting and merge-join. If the quality of the hash function cannot be predicted or improved (tuned) dynam- ically [Graefe 1993a], sort-based query- processing algorithms are superior be- cause they are less vulnerable to nonuni- form data distributions. Since both cases, join of differently sized files and skewed hash value distributions, are realistic sit- uations in database query processing, we recommend that both sort- and hash- based algorithms be included in a query- processing engine and chosen by the query optimizer according to the two cases above. If both cases arise simulta- neously, i.e., a join of differently sized inputs with unpredictable hash value distribution, the query optimizer has to estimate which one poses the greater danger to system performance and choose accordingly. The important conclusion from these dualities is that neither the absolute in- put sizes nor the absolute memory size nor the input sizes relative to the mem- ory size determine the choice between sort- and hash-based query-processing algorithms. Instead, the choice should be governed by the sizes of the two inputs into binary operators relative to each other and by the danger of performance impairments due to skewed data or hash value distributions. Furthermore, be- cause neither algorithm type outper- forms the other in all situations, both should be available in a query execution engine for a choice to be made in each case by the query optimizer. 8. EXECUTION OF COMPLEX QUERY PLANS When multiple operators such as aggre- gations and joins execute concurrently in a pipelined execution engine, physical re- sources such as memory and disk band- width must be shared by all operators. Thus, optimal scheduling of multiple op- erators and the division and allocation of resources in a complex plan are impor- tant issues. In earlier relational execution engines, these issues were largely ignored for two ACM Computmg Surveys, Vol. 25, No 2, June 1993
  • 51. Query Evaluation Techniques “ 123 reasons. First, only left-deep trees were used for query execution, i.e., the right (inner) input of a binary operator had to be a scan. In other words, concurrent execution of multiple subplans in a single query was not possible. Second, under the assumption that sorting was needed at each step and considering that sorting for nontrivial file sizes requires that the entire input be written to temporary files at least once, concurrency and the need for resource allocation were basically ab- sent. Today’s query execution engines consider more join algorithms that per- mit extensive pipelining, e.g., hybrid hash join, and more complex query plans, in- cluding bushy trees. Moreover, today’s systems support more concurrent users and use parallel-processing capabilities. Thus, resource allocation for complex queries is of increasing importance for database query processing. Some researchers have considered re- source contention among multiple query processing operators with the focus on buffer management. The goal in these efforts was to assign disk pages to buffer slots such that the benefit of each buffer slot would be maximized, i.e., the number of 1/0 operations avoided in the future. Sacco and Schkolnick [1982; 1986] ana- lyzed several database algorithms and found that their cost functions exhibit steps when plotted over available buffer space, and they suggested that buffer space should be allocated at the low end of a step for the least buffer use at a given cost. Chou [1985] and Chou and DeWitt [1985] took this idea further by combining it with separate page replace- ment algorithms for each relation or scan, following observations by Stonebraker [1981] on operating system support for database systems, and with load control, calling the resulting algorithm DBMIN. Faloutsos et al. [1991] and Ng et al. [1991] generalized this goal and used the classic economic concepts of decreasing marginal gain and balanced marginal gains for maximal overall gain. Their measure of gain was the reduction in the number of page faults. Zeller and Gray [1990] de- signed a hash join algorithm that adapts to the current memory and buffer con- tention each time a new hash table is built. Most recently, Brown et al. [1992] have considered resource allocation tradeoffs among short transactions and complex queries. Schneider [1990] and Schneider and DeWitt [1990] were the first to systemat- ically examine execution schedules and costs for right-deep trees, i.e., query eval- uation plans with multiple binary hash joins for which all build phases proceed concurrently or at least could proceed concurrently (notice that in a left-deep plan, each build phase receives its data from the probe phase of the previous join, limiting left-deep plans to two concurrent joins in different phases). Among the most interesting findings are that through effective use of bit vector fil- tering (discussed later), memory re- quirements for right-deep plans might actually be comparable to those of left-deep plans [Schneider 1991]. This work has recently been extended by Chen et al. [1992] to bushy plans in- terpreted and executed as multiple right-deep subplans, For binary matching iterators to be used in bushy plans, we have identified several concerns. First, some query- processing algorithms include a point at which all data are in temporary files on disk and at which no intermediate result data reside in memory. Such “stop” points can be used to switch efficiently between different subplans. For example, if two subplans produce and sort two merge-join inputs, stopping work on the first sub- plan and switching to the second one should be done when the first sort opera- tor has all its data in sorted runs and when only the final merge is left but no output has been produced yet. Figure 22 illustrates this point in time. Fortu- nately, this timing can be realized natu- rally in the iterator implementation of sorting if input runs for the final merge are opened in the first call of the next procedure, not at the end of the open phase. A similar stop point is available in hash join when using overflow avoidance. Second, since hybrid hashing produces some output data before the memory con- tents (output buffers and hash table) can ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 52. 124 - Goetz Graefe merge join / ‘0” mm‘~ (done) to be done Figure 22. The stop point during sorting be discarded and since, therefore, such a stop point does not occur in hybrid hash join, implementations of hybrid hash join and other binary match operations should be parameterized to permit overflow avoidance as a run time option to be chosen by the query optimizer. This dy- namic choice will permit the query opti- mizer to force a stop point in some opera- tors while using hybrid hash in most operations. Third, binary-operator implementa- tions should include a switch that con- trols which subplan is initiated first. In Table 1 with algorithm outlines for itera- tors’ open, next, and close procedures, the hash join open procedure executes the entire build-input plan first before opening the probe input. However, there might be situations in which it would be better to open the probe input before executing the build input. If the probe input does not hold any resources such as memory between open and next calls, initiating the probe input first is not a problem. However, there are situations in which it creates a big benefit, in par- ticular in bushy query evaluation plans and in parallel systems to be discussed later. Fourth, if multiple operators are active concurrently, memory has to be divided among them. If two sorts produce input data for a merge-join, which in turn passes its output into another sort using quicksort, memory should be divided pro- portionally to the sizes of the three files involved. We believe that for multiple sorts producing data for multiple merge- joins on the same attribute, proportional memory division will also work best. If a sort in its run generation phase shares resources with other operations, e.g., a sort following two sorts in their final merges and a merge-join, it should also use resources proportional to its input size. For example, if two merge-join in- puts are of the same size and if the merge-join output which is sorted imme- diately following the merge-join is as large as the two inputs together, the two final merges should each use one quar- ter of memory while the run gener- ation (quicksort) should use one half of memory. Fifth, in recursive hybrid hash join, the recursion levels should be executed level by level. In the most straightfor- ward recursive algorithm, recursive invo- cation of the original algorithm for each output partition results in depth-first partitioning, and the algorithm produces output as soon as the first leaf in the recursion tree is reached. However, if the operator that consumes the output re- quires memory as soon as it receives in- put, for example, hybrid hash join (ii) in Figure 23 as soon as hybrid hash join (i) produces output, the remaining parti- tioning operations in the producer opera- tor (hybrid hash join (i)) must share memory with the consumer operator (hy- brid hash join (ii)), effectively cutting the partitioning fan-out in the producer in half. Thus. hash-based recursive match- ing algorithms should proceed in three distinct phases—consuming input and initial partitioning, partitioning into files suitable for hybrid hash join, and final hybrid hash join for all partitions—with phase two completed entirely before phase three commences. This sequence of partitioning steps was introduced as breadth-first partitioning in the previous section as opposed to depth-first parti- tioning used in the most straightforward recursive algorithms. Of course, the top- most operator in a query evaluation plan does not have a consumer operator with which it shares resources; therefore, this operator should use depth-first partition- ing in order to provide a better response time, i.e., earlier delivery of the first data item. ACM Computing Surveys, Vol 25, No 2, June 1993
  • 53. Query Evaluation Techniques ● 125 Hybrid Hash Join (ii) 01/ Hybrid Hash Join (i) /“ 1“’”’13 Input 11 Input 12 Figure 23. Plan for joining three inputs. Sixth, the allocation of resources other than memory, e.g., disk bandwidth and disk arms for seeking in partitioning and merging, is an open issue that should be addressed soon, because the different im- m-ovement rates in CPU and disk s~eeds . . will increase the im~ortance of disk ~er- formance for over~ll query processing performance. One possible alleviation of this m-oblem might come from disk ar- . rays configured exclusively for perfor- mance, not for reliability. Disk arrays might not deliver the entire ~erformance ga~n the large number of disk’drives could provide if it is not possible to disable a disk array’s parity mechanisms and to access s~ecific disks within an arrav. particul~rly during partitioning aid merging. Finally, scheduling bushy trees in multiprocessor systems is not entirely understood yet. While all considerations discussed above apply in principle, multi- m-ocessors ~ermit trulv concurrent exe- . cution of multiple subplans in a bushy tree. However, it is a very hard problem to schedule two or more subplans such that their result streams are available at the right times and at the right rates, in particular in light of the unavoidable er- rors in selectivity and cost estimation during query optimization [Christodoula- kis 1984; Ioannidis and Christodoulakis 19911. The last point, estimation errors, leads us to suspect that plans with 30 (or even 100) joins or other operations cannot be optimized completely before execution. Thus. we susnect that a techniaue remi- . . niscent of Ingres Decomposition [Wong and Youssefi 1976; Youssefi and Wong 1979] will prove to be more effective. One of the principal ideas of Ingres Decompo- sition is a repetitive cycle consisting of three steps. First, the next step is se- lected, e.g., a selection or join. Second, the chosen step is executed into a tempo- rary table. Third, the query is simplified by removing predicates evaluated in the completed execution step and replacing one range variable (relation) in the query with the new temporary table. The justi- fication and advantage of this approach are that all earlier selectivities are known for each decision, because the intermedi- ate results are materialized. The disad- vantage is that data flow between opera- tors cannot be exploited, resulting in a significant cost for writing and reading intermediate files. For very complex queries, we suggest modifying Decompo- sition to decide on and execute multiple steps in each cycle, e.g., 3 to 9 joins, instead of executing only one selection or join as in Ingres. Such a hybrid approach might very well combine the advantages of a priori optimization, namely, in- memory data flow between iterators, and optimization with exactly known inter- mediate result sizes. An optimization and execution envi- ronment even further tuned for very complex queries would anticipate possi- ble outcomes of executing subplans and provide multiple alternative subsequent plans. Figure 24 shows the structure of such a dynamic plan for a complex query. First, subplan A is executed, and statis- tics about its result are gathered while it is saved on disk. Depending on these statistics, either B or C is executed next. If B is chosen and executed, one of D, E, and F will complete the query; in the case of C instead of B, it will be G or H. Notice that each letter A–H can be an arbitrarily complex subplan, although probably not more than 10 operations due to the limitations of current selectiv- ity estimation methods. Unfortunately, realization of such sophisticated query optimizers will require further research, e.g., into determination of when separate cases are warranted and limitation of the possibly exponential growth in the number of subplans. ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 54. 126 ● Goet.z Graefe B c /’”% DEF GH Figure 24. A decision tree of partial plans, 9. MECHANISMS FOR PARALLEL QUERY EXECUTION Considering that all high-performance computers today employ some form of parallelism in their processing hardware, it seems obvious that software written to manage large data volumes ought to be able to exploit parallel execution capabil- ities [DeWitt and Gray 1992]. In fact, we believe that five years from now it will be argued that a database management sys- tem without parallel query execution will be as handicapped in the market place as one without indices. The goal of parallel algorithms and systems is to obtain speedup and scaleup, and speedup results are frequently used to demonstrate the accomplishments of a design and its implementation. Speedup considers additional hardware resources for a constant problem size; linear speedup is considered optimal. In other words, N times as many resources should solve a constant-size problem in lN of the time. Speedup can also be expressed as parallel efficiency, i.e., a measure of how close a system comes to linear speedup. For example, if solving a prob- lem takes 1,200 seconds on a single ma- chine and 100 seconds on 16 machines, the speedup is somewhat less than lin- ear. The parallel efficiency is (1 x 1200)/(16 X 100) = 75%. An alternative measure for a parallel system’s design and implementation is scaleup, in which the problem size is al- tered with the resources. Linear scaleup is achieved when N times as many re- sources can solve a problem with iV times as much data in the same amount of time. Scaleup can also be expressed us- ing parallel efficiency, but since speedup and scaleup are different, it should al- ways be clearly indicated which parallel efficiency measure is being reported. A third measure for the success of a parallel algorithm based on Amdahl’s law is the fraction of the sequential program for which linear speedup was attained, defined byp=f Xs/d+(l-f)X sfor sequential execution time s, parallel exe- cution time p, and degree of parallelism d. Resolved for f, this is f = (s – p)/ (s – s/d) = ((s – p)/s)/((d – I)/d). For the example above, this fraction is f = ((1200 – 100)/’1200)/((16 – 1)/’16) = 97.78%. Notice that this measure gives much higher percentage values than the parallel efficiency calculated earlier; therefore, the two measures should not be confused. For query processing problems involv- ing sorting or hashing in which multiple merge or partitioning levels are expected, the speedup can frequently be more than linear, or superlinear. Consider a sorting problem that requires two merge levels in a single machine. If multiple machines are used, the sort problem can be parti- tioned such that each machine sorts a fraction of the entire data amount. Such partitioning will, in a good implementa- tion, result in linear speedup. If, in addi- tion, each machine has its own memory such that the total memory in the system grows with the size of the machine, fewer than two merge levels will suffice, mak- ing the speedup superlinear. 9.1 Parallel versus Distributed Database Systems It might be useful to start the discussion of parallel and distributed query process- ing with a distinction of the two concepts. In the database literature, “distributed” usually implies “locally autonomous,” i.e., each participating system is a complete database management system in itself, with access control, metadata (catalogs), query processing, etc. In other words, each node in a distributed database man- agement system can function entirely on its own, whether or not the other nodes are present or accessible. Each node per- ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 55. Query Evaluation Techniques ● 127 forms its own access control, and co- operation of each node in a distributed transaction is voluntary. Examples of distributed (research) systems are R* [Haas et al. 1982; Traiger et al. 1982], distributed Ingres [Epstein and Stone- braker 1980; Stonebraker 1986a], and SDD-1 [Bernstein et al. 1981; Rothnie et al. 1980]. There are now several commer- cial distributed relational database man- agement systems. Ozsu and Valduriez [1991a; 1991b] have discussed dis- tributed database systems in much more detail. If the cooperation among multiple database systems is only limited, the sys- tem can be called a “federated database system [ Sheth and Larson 1990]. In parallel systems, on the other hand, there is only one locus of control. In other words, there is only one database man- agement system that divides individual queries into fragments and executes the fragments in parallel. Access control to data is independent of where data objects currently reside in the system. The query optimizer and the query execution engine typically assume that all nodes in the system are available to participate in ef- ficient execution of complex queries, and participation of nodes in a given transac- tion is either presumed or controlled by a global resource manager, but is not based on voluntary cooperation as in dis- tributed systems. There are several par- allel research prototypes, e.g., Gamma [DeWitt et al. 1986; DeWitt et al. 1990], Bubba [Boral 1988; Boral et al. 1990], Grace [Fushimi et al. 1986; Kitsuregawa et al. 1983], and Volcano [Graefe 1990b; 1993b; Graefe and Davison 1993], and products, e.g., Tandem’s NonStop SQL [Englert et al. 1989; Zeller 1990], Tera- data’s DBC/1012 [Neches 1984; 1988; Teradata 1983], and Informix [Davison 1992]. Both distributed database systems and parallel database systems have been de- signed in various kinds, which may cre- ate some confusion. Distributed systems can be either homogeneous, meaning that all participating database management systems are of the same type (the hard- ware and the operating system may even be of the same types), or heterogeneous, meaning that multiple database manage- ment systems work together using stan- dardized interfaces but are internally different.lG Furthermore, distributed systems may employ parallelism, e.g., by pipelining datasets between nodes with the receiver already working on some items while the producer is still sending more. Parallel systems can be based on shared-memory (also called shared-everything), shared-disk (multi- ple processors sharing disks but not memory), distributed-memory (with- out sharing disks, also called shared- nothing), or hierarchical computer architectures consisting of multiple clus- ters, each with multiple CPUS and disks and a large shared memory. Stone- braker [ 1986b] compared the first three alternatives using several aspects of database management, and came to the conclusion that distributed memory is the most promising database man- agement system platform. Each of these approaches has advantages and disadvantages; our belief is that the hi- erarchical architecture is the most gen- eral of these architectures and should be the target architecture for new data- base software development [Graefe and Davison 1993], 9.2 Forms of Parallelism There are several forms of parallelism that are interesting to designers and im- plementors of query processing systems. Irzterquery parallelism is a direct result of the fact that most database manage- ment systems can service multiple re- quests concurrently. In other words, multiple queries (transactions) can be executing concurrently within a single database management system. In this form of parallelism, resource contention 16 In some organizations, two different database management systems may run on the came (fairly large) computer. Their interactions could be called “nondistributed heterogeneous.” However, since the rules governing such interactions are the same as for distributed heterogeneous systems, the case is usually ignored in research and system design. ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 56. 128 * Goetz Graefe is of great concern, in particular, con- tention for memory and disk arms. The other forms of parallelism are all based on the use of algebraic operations on sets for database query processing, e.g., selection, join, and intersection. The theory and practice of exploiting other “bulk” types such as lists for parallel database query execution are only now developing. Interoperator parallelism is basically pipelining, or parallel execution of different operators in a single query. For example, the iterator concept dis- cussed earlier has also been called “syn- chronous pipelines” [Pirahesh et al. 1990]; there is no reason not to consider asynchronous pipelines in which opera- tors work independently connected by a buffering mechanism to provide flow control. Interoperator parallelism can be used in two forms, either to execute producers and consumers in pipelines, called uerti- cal in teroperator parallelism here, or to execute independent subtrees in a com- plex bushy-query evaluation plan concur- rently, called horizontal in teroperator or bushy parallelism here. A simple exam- ple for bushy parallelism is a merge-join receiving its input data from two sort processes. The main problem with bushy parallelism is that it is hard or impossi- ble to ensure that the two subplans start generating data at the right time and generate them at the right rates. Note that the right time does not necessarily mean the same time, e.g., for the two inputs of a hash join, and that the right rates are not necessarily equal, e.g., if two inputs of a merge-join have different sizes. Therefore, bushy parallelism pre- sents too many open research issues and is hardly used in practice at this time. The final form of parallelism in database query processing is intraopera- tor parallelism in which a single opera- tor in a query plan is executed in multi- ple processes, typically on disjoint pieces of the problem and disjoint subsets of the data. This form, also called parallelism based on fragmentation or partitioning, is enabled by the fact that query process- ing focuses on sets. If the underlying data represented sequences or time series in a scientific database management sys- tem, partitioning into subsets to be oper- ated on independently would not be feasible or would require additional synchronization when putting the in- dependently obtained results together. Both vertical interoperator parallelism and intraoperator parallelism are used in database query processing to obtain higher performance. Beyond the obvious opportunities for speedup and scaleup that these two concepts offer, they both have significant problems. Pipelining does not easily lend itself to load balanc- ing because each process or processor in the pipeline is loaded proportionally to the amount of data it has to process. This amount cannot be chosen by the imple- mentor or the query optimizer and can- not be predicted very well. For intraoper- ator, partitioning-based parallelism, load balance and performance are optimal if the partitions are all of equal size; how- ever, this can be hard to achieve if value distributions in the inputs are skewed. 9.3 Implementation Strategies The purpose of the query execution en- gine is to provide mechanisms for query execution from which the query opti- mizer can choose—the same applies for the means and mechanisms for parallel execution. There are two general ap- proaches to parallelizing a query execu- tion engine, which we call the bracket and operator models and which are used, for example, in the Gamma and Volcano systems, respectively. In the bracket model, there is a generic process template that can receive and send data and can execute exactly one operator at any point of time. A schematic diagram of a template process is shown in Figure 25, together with two possible operators, join and aggregation. In order to execute a specific operator, e.g., a join, the code that makes up the generic tem- plate “loads” the operator into its place (by switching to this operator’s code) and initiates the operator which then controls execution; network 1/0 on the receiving ACM Computmg Surveys, Vol 25, No 2, June 1993
  • 57. Query Evaluation Techniques ● 129 output Input(s) Figure 25. Bracket model of parallelization. and sending sides is performed as a ser- vice to the operator on its request and initiation and is implemented as proce- dures to be called by the operator. The number of inputs that can be active at any point of time is limited to two since there are only unary and binary opera- tors in most database systems. The oper- ator is surrounded by generic template code. which shields it from its environ- ment, for example, the operator(s) that produce its input and consume its out- put. For parallel query execution, many templates are executed concurrently in the system, using one process per tem- plate. Because each operator is written with the imdicit assum~tion that this . operator controls all acti~ities in its pro- cess, it is not possible to execute two operators in one process without resort- ing to some thread or coroutine facility i.e., a second implementation level of the process concept. In a query-processing system using the bracket model, operators are coded in such a way that network 1/0 is their only means of obtaining input and deliv- ering output (with the exception of scan and store operators). The reason is that each operator is its own locus of control, and network flow control must be used to coordinate multiple operators, e.g., to match two operators’ speed in a pro- ducer-consumer relationship. Unfortu- nately, this coordination requirement also implies that passing a data item from one operator to another always in- volves expensive interprocess communi- cation system calls, even in the cases when an entire query is evaluated on a single CPU (and could therefore be eval- uated in a single process, without inter- process communication and operating system involvement) or when data do not need to be repartitioned among nodes in a network. An example for the latter is the query “joinCselAselB” in the Wiscon- sin Benchmark, which requires joining three inputs on the same attribute [De- Witt 1991], or any other query that per- mits interesting orderings [Selinger et al. 1979], i.e., any query that uses the same join attribute for multiple binary joins. Thus, in queries with multiple operators (meaning almost all queries), interpro- cess communication and its overhead are mandatory in the bracket model rather than optional. An alternative to the bracket model is the operator model. Figure 26 shows a possible parallelization of a join plan us- ing the operator model, i.e., by inserting “parallelism” operators into a sequential plan, called exchange operators in the Volcano system [Graefe 1990b; Graefe and Davison 1993]. The exchange opera- tor is an iterator like all other operators in the system with open, next, and close procedures; therefore, the other opera- tors are entirely unaffected by the pres- ence of exchange operators in a query evaluation plan. The exchange operator does not contribute to data manipulation; thus, on the logical level, it is a “no-op” that has no place in a logical query alge- bra such as the relational algebra. On the physical level of algorithms and pro- cesses, however, it provides control not provided by any of the normal operators, i.e., process management, data redistri- bution, and flow control. Therefore, it is a control operator or a meta-operator. Separation of data manipulation from process control and interprocess com- munication can be considered an im- portant advantage of the operator model of parallel query processing, be- cause it permits design, implementation, and execution of new data manipulation algorithms such as N-ary hybrid hash join [Graefe 1993a] without regard to the execution environment. ACM Computmg Surveys, Vol. 25, No. 2, June 1993
  • 58. 130 “ Goetz Graefe Print I Exchange Q Print I Join /’” Join Exchange / I Exchange Exchange scan I I scan scan Figure 26. Operator model of parallehzation. A second issue important to point out is that the exchange operator only pro- vides mechanisms for parallel query pro- cessing; it does not determine or presup- pose policies for using its mechanisms. Policies for parallel processing such as the degree of parallelism, partitioning functions, and allocation of processes to processors can be set either by a query optimizer or by a human experimenter in the Volcano system as they are still sub- ject to intense research. The design of the exchange operator permits execution of a complex query in a single process (by using a query plan without any exchange operators, which is useful in single- processor environments) or with a num- ber of processes by using one or more exchange operators in the query evalua- tion plan. The mapping of a sequential plan to a parallel plan by inserting ex- change operators permits one process per operator as well as multiple processes for one operator (using data partitioning) or multiple operators per process, which is useful for executing a complex query plan with a moderate number of processes. Earlier parallel query execution engines did not provide this degree of flexibility; the bracket model used in the Gamma design, for example, requires a separate process for each operator [DeWitt et al. 1986]. Figure 27 shows the processes created by the exchange operators in the previ- Figure 27. Processes created by exchange operators. ous figure, with each circle representing a process. Note that this set of processes is only one possible parallelization, which makes sense if the joins are on the same join attributes. Furthermore, the degrees of data parallelism, i.e., the number of processes in each process group, can be controlled using an argument to the ex- change operator. There is no reason to assume that the two models differ significantly in their performance if implemented with similar care. Both models can be implemented with a minimum of control overhead and can be combined with any partitioning scheme for load balancing. The only dif- ference with respect to performance is that the operator model permits multiple data manipulation operators such as join in a single process, i.e., operator synchro- nization and data transfer between oper- ators with a single procedure call with- out operating system involvement. The important advantages of the operator model are that it permits easy paral- lelization of an existing sequential sys- tem as well as development and mainte- nance of operators and algorithms in a familiar and relatively simple single-pro- cess environment [Graefe and Davison 1993]. ACM Computmg Surveys, Vol 25, No 2, June 1993
  • 59. Query Evaluation Techniques ● 131 The bracket and operator models both provide pipelining and partitioning as part of pipelined data transfer between process groups. For most algebraic opera- tors used in database query processing, these two forms of parallelism are suffi- cient. However, not all operations can be easily supported by these two models. For example, in a transitive closure oper- ator, newly inferred data is equal to in- put data in its importance and role for creating further data. Thus, to paral- lelize a single transitive closure operator, the newly created data must also be par- titioned like the input data. Neither bracket nor operator model immediately allow for this need. Hence, for transitive closure operators, intraoperator paral- lelism based on partitioning requires that the processes exchange data among themselves outside of the stream paradigm. The transitive closure operator is not the only operation for which this restric- tion holds. Other examples include the complex object assembly operator de- scribed by Keller et al. [1991] and opera- tors for numerical optimizations as might be used in scientific databases. Both models, the bracket model and the opera- tor model, could be extended to provide a general and efficient solution to intraop- erator data exchange for intraoperator parallelism. 9.4 Load Balancing and Skew For optimal speedup and scaleup, pieces of the processing load must be assigned carefully to individual processors and disks to ensure equal completion times for all pieces. In interoperator paral- lelism, operators must be grouped to en- sure that no one processor becomes the bottleneck for an entire pipeline. Bal- anced processing loads are very hard to achieve because intermediate set sizes cannot be anticipated with accuracy and certainty in database query optimization. Thus, no existing or proposed query- processing engine relies solely on inter- operator parallelism. In intraoperator parallelism, data sets must be parti- tioned such that the processing load is nearly equal for each processor. Notice that in particular for binary operations such as join, equal processing loads can be different from equal-sized partitions. There are several research efforts de- veloping techniques to avoid skew or to limit the effects of skew in parallel query processing, e.g., Baru and Frieder [1989], DeWitt et al. [ 1991b], Hua and Lee [1991], Kitsuregawa and Ogawa [1990], Lakshmi and Yu [1988; 1990], Omiecin- ski [199 1], Seshadri and Naughton [1992], Walton [1989], Walton et al. [1991], and Wolf et al. [1990; 1991]. How- ever, all of these methods have their drawbacks, for example, additional re- quirements for local processing to deter- mine quantiles. Skew management methods can be di- vided into basically two groups. First, skew avoidance methods rely on deter- mining suitable partitioning rules before data is exchanged between processing nodes or processes. For range partition- ing, quantiles can be determined or esti- mated from sampling the data set to be partitioned, from catalog data, e.g., his- tograms, or from a preprocessing step. Histograms kept on permament base data have only limited use for intermediate query processing results, in particular, if the partitioning attribute or a correlated attribute has been used in a prior selec- tion or matching operation. However, for stored data they may be very beneficial. Sampling implies that the entire popula- tion is available for sampling because the first memory load of an intermediate re- sult may be a very poor sample for parti- tioning decisions. Thus, sampling might imply that the data flow between opera- tors be halted and an entire intermediate result be materialized on disk to ensure proper random sampling and subsequent partitioning. However, if such a halt is required anyway for processing a large set, it can be used for both purposes. For example, while creating and writing ini- tial run files without partitioning in a parallel sort, quantiles can be deter- mined or estimated and used in a com- bined partitioning and merging step. ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 60. 132 “ Goetz Graefe 10000 – solid — 99 9Z0confidence dotted — 95 % confidence Sample 10~_ o 1024 patitions Size ❑ 2 partitions Per ❑ ❑ Partition 10Q– ❑ 107 I I I I 1 1.25 1.5 1.75 2 Skew Limit Figure 28. Skew hmit,confidence, andsamples izeperpartltion. Second, skew resolution repartitions some or all of the data if an initial parti- tioning has resulted in skewed loads. Repartitioning is relatively easy in shared-memory machines, but can also be done in distributed-memory architec- tures, albeit at the expense of more net- work activity. Skew resolution can be based on rehashing in hash partitioning or on quantile adjustment in range parti- tioning. Since hash partitioning tends to create fairly even loads and since net- work bandwidth will increase in the near future within distributed-memory ma- chines as well as in local- and wide-area networks, skew resolution is a reason- able method for cases in which a prior processing step cannot be exploited to gather the information necessary for skew avoidance as in the sort example above. In their recent research into sampling for load balancing, DeWitt et al. [ 1991b] and Seshadri and Naughton [1992] have shown that stratified random sampling can be used, i.e., samples are selected randomly not from the entire distributed data set but from each local data set at each site, and that even small sets of samples ensure reasonably balanced loads. Their definition of skew is the quo- tient of sizes of the largest partition and the average partition, i.e., the sum of sizes of all partitions divided by the de- gree of parallelism. In other words, a skew of 1.0 indicates a perfectly even distribution. Figure 28 shows the re- quired sample sizes per partition for var- ious skew limits, degrees of parallelism, and confidence levels. For example, to ensure a maximal skew of 1.5 among 1,000 partitions with 9570 confidence, 110 random samples must be taken at each site. Thus, relatively small samples suf- fice for reasonably safe skew avoidance and load balancing, making precise methods unnecessary. Typically, only tens of samples per partition are needed, not several hundreds of samples at each site. For allocation of active processing ele- ments, i.e., CPUS and disks, the band- width considerations discussed briefly in the section on sorting can be generalized for parallel processes. In principal, all stages of a pipeline should be sized such that they all have bandwidths propor- tional to their respective data volumes in order to ensure that no stage in the pipeline becomes a bottleneck and slows the other ones down. The latency almost unavoidable in data transfer between pipeline stages should be hidden by the use of buffer memory equal in size to the product of bandwidth and latency. 9.5 Architectures and Architecture Independence Many database research projects have investigated hardware architectures for parallelism in database systems. Stone- ACM Computmg Surveys, Vol 25, No. 2, June 1993
  • 61. Query Evaluation Techniques ● 133 braker[1986b] compared shared-nothing (distributed-memory), shared-disk (dis- tributed-memory with multiported disks), and shared-everything (shared-memory) architectures for database use based on a number of issues including scalability, communication overhead, locking over- head, and load balancing. His conclusion at that time was that shared-everything excels in none of the points considered; shared-disk introduces too many locking and buffer coherency problems; and shared-nothing has the significant bene- fit of scalability to very high degrees of parallelism. Therefore, he concluded that overall shared-nothing is the preferable architecture for database system imple- mentation. (Much of this section has been derived from Graefe et al. [1992] and Graefe and Davison [1993 ].) Bhide [1988] and Bhide and Stone- braker [1988] compared architectural al- ternatives for transaction processing and concluded that a shared-everything (shared-memory) design achieves the best performance, up to its scalability limit. To achieve higher performance, reliabil- ity, and scalability, Bhide suggested considering shared-nothing (distributed- memory) machines with shared-every- thing parallel nodes. The same idea is mentioned in equally general terms by Pirahesh et al. [1990] and Boral et al. [1990], but none of these authors elabo- rate on the idea’s generality or potential. Kitsuregawa and Ogawa’s [1990] new database machine SDC uses multiple shared-memory nodes (plus custom hard- ware such as the Omega network and a hardware sorter), although the effect of the hardware design on operators other than join is not evaluated in their article. Customized parallel hardware was in- vestigated but largely abandoned after Boral and DeWitt’s [1983] influential analysis that compared CPU and 1/0 speeds and their trends. Their analysis concluded that 1/0, not processing, is the most likely bottleneck in future high-performance query execution. Sub- sequently, both Boral and DeWitt em- barked on new database machine projects, Bubba and Gamma, that exe- cuted customized software on standard processors with local disks [Boral et al. 1990; DeWitt et al. 1990]. For scalabil- ity and availability, both projects used distributed-memory hardware with single-CPU nodes and investi- gated scaling questions for very large configurations. The XPRS system, on the other hand, has been based on shared memory [Hong and Stonebraker 1991; Stonebraker et al. 1988a; 1988b]. Its designers believe that modern bus architectures can handle up to 2,000 transactions per second, and that shared-memory architectures provide automatic load balancing and faster communication than shared-nothing machines and are equally reliable and available for most errors, i.e., media failures, software, and operator errors [Gray 1990]. However, we believe that attaching 250 disks to a single machine as necessary for 2,000 transactions per second [ Stonebraker et al. 1988b] re- quires significant special hardware, e.g., channels or 1/0 processors, and it is quite likely that the investment for such hard- ware can have greater impact on overall system performance if spent on general- purpose CPUS or disks. Without such special hardware, the performance limit for shared-memory machines is probably much lower than 2,000 transactions per second. Furthermore, there already are applications that require larger storage and access capacities. Richardson et al. [1987] performed an analytical study of parallel join algo- rithms on multiple shared-memory “clus- ters” of CPUS. They assumed a group of clusters connected by a global bus with multiple microprocessors and shared memory in each cluster. Disk drives were attached to the busses within clusters. Their analysis suggested that the best performance is obtained by using only one cluster, i.e., a shared-memory archi- tecture. We contend, however, that their results are due to their parameter set- tings, in particular small relations (typi- cally 100 pages of 32 KB), slow CPUS (e.g., 5 psec for a comparison, about 2-5 MIPS), a slow global network (a bus with ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 62. 134 “ Goetz Graefe typically 100 Mbit/see), and a modest number of CPUS in the entire system (128). It would be very interesting to see the analysis with larger relations (e.g., 1– 10 GB), a faster network, e.g., a mod- ern hypercube or mesh with hardware routing. and consideration of bus load u, and bus contention in each cluster, which might lead to multiple clusters being the better choice. On the other hand, commu- nication between clusters will remain a significant expense. Wong and Katz [1983] developed the concept of “local sufficiency” that might provide guidance in declusterinp and re~lication to reduce data moveme~t betw~en nodes. Other work on declustering and limiting declus- tering includes Copeland et al. [1988], Fang et al. [1986], Ghandeharizadeh and DeWitt [1990], Hsiao and DeWitt [1990], and Hua and Lee [ 19901. Finally, there are several hardware de- signs that attempt to overcome the shared-memory scaling problem, e.g., the DASH project [Anderson et al. 1988], the Wisconsin Multicube [Goodman and Woest 1988], and the Paradigm project [Cheriton et al. 1991]. However, these desires follow the traditional se~aration of operating system and application pro- gram. They rely on page or cache-line faulting and do not provide typical database concepts such as read-ahead and dataflow. Lacking separation of mechanism and policy in these designs almost makes it imperative to implement dataflow and flow control for database query processing within the query execu- tion engine. At this point, none of these hardware designs has been experimen- tally tested for database query processing. New software systems designed to ex- ploit parallel hardware should be able to exploit both the advantages of shared memory, namely efficient communica- tion, synchronization, and load balanc- ing, and of distributed memory, namely scalability to very high degrees of paral- lelism and reliability and availability through independent failures. Figure 29 shows a general hierarchical architec- ture, which we believe combines these advantages. The important point is the combina~ion of lo~al busies within shared-memory parallel machines and a dobal interconnection network amomz machines. The diagram is only a very general outline of such an architecture; manv details are deliberately left out and . . unspecified. The network could be imple- mented using a bus such as an ethernet, a ring, a hypercube, a mesh, or a set of ~oint-to-~oint connections. The local busses m-ay or may not be split into code and data or by address range to obtain less contention and higher bus band- width and hence higher scalability limits for the use of sh~red memory. “Design and placement of caches, disk controllers, terminal connections. and local- and wide-area network connections are also left open. Tape drives or other backup devices would be connected to local busses. Modularity is a very important consid- eration for such an architecture. For ex- ample, it should be possible to replace all CPU boards with uwn-aded models with- .= out having to replace memories or disks. Considering that new components will change communication demands, e.g., faster CPUS might require more local bus bandwidth, it is also important that the allocation of boards to local busses can be changed, For example, it should be easy to reconfigure a machine with 4 X 16 CPUS into one with 8 X 8 CPUS. Beyond the effect of faster communica- tion and synchronization, this architec- ture can also have a simificant effect on control overhead, load balancing, and re- sulting response time problems. Investi- gations in the Bubba project at MCC demonstrated that large degrees of paral- lelism mav reduce ~erformance unless load imbal~nce and o~erhead for startup, synchronization, and communication can be kept low [Copeland et al. 1988]. For example, when placing 100 CPUS either in 100 nodes or in 10 nodes of 10 CPUS each, it is much faster to distribute query plans to all CPUS and much easier to achieve reasonable balanced loads in the . second case than in the first case. Within each shared-memory parallel node, load ACM Computmg Surveys, Vol 25, No 2, June 1993
  • 63. Query Evaluation Techniques “ 135 [ Interconnection Network I I I Loc .dBus + CPU 1 + CPU ] + CPU I + CPU ~ Ed Figure 29. Ahierarchical-memory architecture imbalance can be dealt with either by engine were discussed. In this section, compensating allocation of resources, e.g., memory for sorting or hashing, or by rel- atively efficient reassignment of data to processors. Many of today’s parallel machines are built as one of the two extreme cases of this hierarchical design: a distributed- memory machine uses single-CPU nodes, while a shared-memory machine consists of a single node. Software designed for this hierarchical architecture will run on either conventional design as well as a genuinely hierarchical machine and will allow the exploration of tradeoffs in the range of alternatives in between. The most recent version of Volcano’s ex- change operator is designed for hierar- chical memory, demonstrating that the operator model of parallelization also of- fers architecture- and topology-indepen- dent parallel query evaluation [Graefe and Davison 1993]. In other words, the parallelism operator is the only operator that needs to “understand” the underly- ing architecture, while all data manipu- lation operators can be implemented without concern for parallelism, data dis- tribution, and flow control. 10. PARALLEL ALGORITHMS In the previous section, mechanisms for parallelizing a database query execution individual algorithms and their special cases for parallel execution are consid- ered in more detail. Parallel database query processing algorithms are typically based on partitioning an input using range or hash partitioning. Either form of partitioning can be combined with sort- and hash-based query processing algo- rithms; in other words, the choices of partitioning scheme and local algorithm are almost always entirely orthogonal. When building a parallel system, there is sometimes a question whether it is better to parallelize a slower sequential algorithm with better speedup behavior or a fast sequential algorithm with infe- rior speedup behavior. The answer to this question depends on the design goal and the planned degree of parallelism. In the few single-user database systems in use, the goal has been to minimize response time; for this goal, a slow algorithm with linear speedup implemented on highly parallel hardware might be the right choice. In multi-user systems, the goal typically is to minimize resource con- sumption in order to maximize through- put. For this goal, only the best sequen- tial algorithms should be parallelized. For example, Boral and DeWitt [1983] con- cluded that parallelism is no substitute for effective and efficient indices. For a new parallel algorithm with impressive ACM Computing Surveys, Vol 25, No. 2, June 1993
  • 64. 136 ● Goetz Graefe speedup behavior, the question of whether or not the underlying sequential algorithm is the most efficient choice should always be considered. 10.1 Parallel Selections and Updates Since disk 1/0 is a performance bottle- neck in many systems, it is natural to parallelize it. Typically, either asyn- chronous 1/0 or one process per partici- pating 1/0 device is used, be it a disk or an array of disks under a single con- troller. If a selection attribute is also the partitioning attribute, fewer than all disks will contain selection results, and the number of processes and activated disks can be limited. Notice that parallel selection can be combined very effec- tively with local indices, i.e., ind~ces cov- ering the data of a single disk or node. In general, it is most efficient to maintain indices close to the stored data sets, i.e., on the same node in a parallel database system. For updates of partitioning attributes in a partitioned data set, items may need to move between disks and sites, just as items may move if a clustering attribute is updated. Thus, updates of partitioning attributes may require setting up data transfers from old to new locations of modified items in order to maintain the consistency of the partitioning. The fact that updating partitioning attributes is more expensive is one reason why im- mutable (or nearly immutable) iden- tifiers or keys are usually used as partitioning attributes. 10.2 Parallel Sorting Since sorting is the most expensive oper- ation in many of today’s database man- agement systems, much research has been dedicated to parallel sorting [Baugsto and Greipsland 1989; Beck et al. 1988; Bitton and Friedland 1982; Graefe 1990a; Iyer and Dias 1990; Kit- suregawa et al, 1989b; Lorie and Young 1989; Menon 1986; Salzberg et al. 1990]. There are two dimensions along which parallel sorting methods can be classi- fied: the number of their parallel inputs (e.g., scan or subplans executed in paral- lel) and the number of parallel outputs (consumers) [Graefe 1990a]. As sequen- tial input or output restrict the through- put of parallel sorts, we assume a multiple-input multiple-output paral- lel sort here, and we further assume that the input items are partitioned randomly with respect to the sort at- tribute and that the outmut items should be range-partitioned ~nd sorted within each range. Considering that data exchange is ex- pensive, both in terms of communication and synchronization delays, each data item should be exchanged only once be- tween m-ocesses. Thus. most ~arallel sort . L algorithms consist of a local sort and a data exchange step. If the data exchange step is done first, quantiles must be known to ensure load balancing during the local sort step. Such quantil& can b: obtained from histograms in the catalogs or by sampling. It is not necessary that the quantiles be precise; a reasonable approximation will suffice. If the local sort is done first. the final local merging should pass data directly into the data exchange step. On each receiving site, multiple sorted streams must be merged during the data ex- change step. bne of th~ possible prob- lems is that all producers of sorted streams first produce low key values, limiting performance by the speed of the first (single!) consumer; then all produc- ers switch to the next consumer, etc. If a different partitioning strategy than range partitioning is used, sorting with subsequent partitioning is not guaran- teed to be deadlock free in all situations. Deadlock will occur if (1) multiple con- sumers feed multiple producers, (2) each m-oducer m-educes a sorted stream, and ~ach con&mer merges multiple sorted streams, (3) some key-based partitioning rule is used other than range partition- ing, i.e., hash partitioning, (4) flow con- trol is enabled, and (5) the data distribu- tion is particularly unfortunate. Figure 30 shows a scenario with two producer and two consumer processes, ACM Computing Surveys, Vol 25, No. 2, June 1993
  • 65. Query Evaluation Techniques ● 137 i.e., both the producer operators and the consumer operators are executed with a degree of parallelism of two. The circles in Figure 30 indicate processes, and the arrows indicate data paths. Presume that the left sort produces the stream 1, 3, 5, 7 , 999, 1002, 1004, 1006, Ikii, :.., 2000 while the right sort pro- duces 2, 4, 6, 8,..., 1000, 1001, 1003, 1005, 1007,..., 1999. The merge opera- tions in the consumer processes must re- ceive the first item from each producer process before they can create their first output item and remove additional items from their input buffers. However, the producers will need to produce 500 items each (and insert them into one consumer’s input buffer, all 500 for one consumer) before they will send their first item to the other consumer. The data exchange buffer needs to hold 1,000 items at one point of time, 500 on each side of Figure 30. If flow control is enabled and if the exchange buffer (flow control slack) is less than 500 items, deadlock will occur. The reason deadlock can occur in this situation is that the producer processes need to ship data in the order obtained from their input subplan (the sort in Fig- ure 30) while the consumer processes need to receive data in sorted order as required by the merge. Thus, there are two sides which both require absolute control over the order in which data pass over the process boundary. If the two requirements are incompatible, an un- bounded buffer is required to ensure freedom from deadlock. In order to avoid deadlock, it must be ensured that one of the five conditions listed above is not satisfied. The second condition is the easiest to avoid, and should be focused on. If the receiving processes do not perform a merge, i.e., the individual input streams are not sorted, deadlock cannot occur because the slack given in the flow control must be somewhere, either at some producer or some consumer or several of them, and the process holding the slack can con- tinue to process data, thus preventing deadlock. Figure 30. Scenario with possible deadlock. F ! Receive odd even Pa ..--.. k ) [$1 Receive ) Y I even m Figure 31. Deadlock-free scenario. Our recommendation is to avoid the above situation, i.e., to ensure that such query plans are never generated by the optimizer. Consider for which purposes such a query plan would be used. The typical scenario is that multiple pro- cesses perform a merge join of two in- puts, and each (or at least one) input is sorted by several producer processes. An alternative scenario that avoids the prob- lem is shown in Figure 31. Result data are partitioned and sorted as in the pre- vious scenario. The important difference is that the consumer processes do not merge multiple sorted incoming streams. One of the conditions for the deadlock problem illustrated in Figure 30 is that ACM Computmg Surveys, Vol. 25, No. 2, June 1993
  • 66. 138 . Goetz Graefe Figure 32. Deadlock danger due to a binary operator in the consumer. there are multiple producers and multi- ple consumers of a single logical data stream. However, a very similar deadlock situation can occur with single-process producers if the consumer includes an operation that depends on ordering, typi- cally merge-join. Figure 32 illustrates the problem with a merge-join operation exe- cuted in two consumer processes. Notice that the left and right producers in Fig- ure 32 are different inputs of the merge- join, not processes executing the same operators as in Figures 30 and 31. The consumer in Figure 32 is still one opera- tor executed by two processes. Presume that the left sort produces the stream 1, 3, 5, 7, ..., 999, 1002, 1004, 1006, 1008 ,...,2000 while the right sort pro- duces 2, 4, 6, 8,...,1000, 1001, 1003, 1005, 1007,... , 1999. In this case, the merge-join has precisely the same effect as the merging of two parts of one logical data stream in Figure 30. Again, if the data exchange buffer (flow control slack) is too small, deadlock will occur. Similar to the deadlock avoidance tactic in Fig- ure 31, deadlock in Figure 32 can be avoided by placing the sort operations into the consumer processes rather than into the producers. However, there is an additional solution for the scenario in Figure 32, namely, moving only one of the sort operators, not both, into the con- sumer processes. If moving a sort operation into the con- sumer process is not realistic, e.g., be- cause the data already are sorted when they are retrieved from disk as in a B-tree scan, alternative parallel execution strategies must be found that do not re- quire repartitioning and merging of sorted data between the producers and consumers. There are two possible cases. In the first case, if the input data are not only sorted but also already partitioned systematically, i.e., range or hash parti- tioned, on the attribute(s) considered by the consumer operator, e.g., the by-list of an aggregate function or the join at- tribute, the process boundary and data exchange could be removed entirely. This implies that the producer operator, e.g., the B-tree scan, and the consumer, e.g., the merge-join, are executed by the same process group and therefore with the same degree of parallelism. In the second case, although sorted on the relevant attribute within each partition, the operator’s data could be partitioned either round-robin or on a different attribute. For a join, a fragment-and-replicate matching strat- egy could be used [Epstein et al. 1978; Epstein and Stonebraker 1980; Lehman et al. 1985], i.e., the join should execute within the same threads as the operator producing sorted output while the second input is replicated to all instances of the ACM Computmg Surveys, Vol 25, No 2, June 1993
  • 67. Query Evaluation Techniques ● 139 Merge Depth lo– Total Memory Size M =40 8– Input Size R = 125,000 6– 4– 2– I I I I I I I 1 3 5 7 9 11 13 Degree of Parallelism Figure 33. Merge depth as a function of parallelism. join. Note that fragment-and-replicate methods do not work correctly for semi- join, outer join, difference, and union, i.e., when an item is replicated and is in- serted (incorrectly) multiple times into the global output. A second solution that works for all operators, not only joins, is to execute the consumer of the sorted data in a single thread. Recall that mul- tiple consumers are required for a dead- lock to occur. A third solution that is correct for all operators is to send dummy items containing the largest key seen so far from a producer to a consumer if no data have been exchanged for a predeter- mined amount of time (data volume, key range). In the examples above, if a pro- ducer must send a key to all consumers at least after every 100 data items pro- cessed in the producer, the required buffer space is bounded, and deadlock can be avoided. In some sense, this solu- tion is very simple; however, it requires that not only the data exchange mecha- nism but also sort-based algorithms such as merge-join must “understand” dummy items. Another solution is to exchange all data without regard to sort order, i.e., to omit merging in the data exchange mech- anism, and to sort explicitly after repar- titioning is complete. For this sort, re- placement selection might be more effec- tive than quicksort for generating initial runs because the runs would probably be much larger than twice the size of memory. A final remark on deadlock avoidance: Since deadlock can only occur if the con- sumer process merges, i.e., not only the producer but also the consumer operator try to determine the order in which data cross process boundaries, the deadlock problem only exists in a query execution engine based on sort-based set-processing algorithms. If hash-based algorithms were used for aggregation, duplicate re- moval, join, semi-join, outer join, inter- section, difference, and union, the need for merging and therefore the danger of deadlock would vanish. An interesting parallel sorting method with balanced communication and with- out the possibility of deadlock in spite of local sort followed by data exchange (if the data distribution is known a priori) is to sort locally only by the position within the final partition and then exchange data guaranteeing a balanced data flow. This method might be best seen in an example: Consider 10 partitions with key values from O to 999 in a uniform distri- bution. The goal is to have all key values between O to 99 sorted on site O, between 100 and 199 sorted on site 1, etc. First, each partition is sorted locally at its orig- inal site, without data exchange, on the last two digits only, ignoring the first digit. Thus, each site has a sequence such as 200, 301, 401, 902, 2, 603, 804, 605, 105, 705,...,999, 399. Now each site sends data to its correct final destination. Notice that each site sends data simulta- ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 68. 140 “ Goetz Graefe neously to all other sites, creating a bal- anced data flow among all producers and consumers. While this method seems ele- gant, its problem is that it requires fairly detailed distribution information to en- sure the desired balanced data flow. In shared-memory machines, memory must be divided over all concurrent sort processes. Thus, the more processes are active, the less memory each one can get. The importance of this memory division is the limitation it puts on the size of initial runs and on the fan-in in each merge process. In other words, large de- grees of parallelism may impede perfor- mance because they increase the number of merge levels. Figure 33 shows how the number of merge levels grows with in- creasing degrees of parallelism, i.e., de- creasing memory per process and merge fan-in. For input size R, total mem- ory size ill, and P parallel processes, the merge depth L is L = log&f lP - ~((RP)/ ( M/P)) = log ~, P- ~(R/M). The optimal degree of parallelism must be deter- mined considering the tradeoff between parallel processi~g and large fan-ins, somewhat similar to the tradeoff be- tween fan-in and cluster size. Extending this argument using the duality of sort- ing and hashing, too much parallelism in hash partitioning on shared-memory ma- chines can also be detrimental, both for aggregation and for binary lmatching [Hong and Stonebraker 1993]. 10.3 Parallel Aggregation and Duplicate Removal Parallel algorithms for aggregation and duplicate removal are best divided into a local step and a global step. First, dupli- cates are eliminated locally, and then data are partitioned to detect and re- move duplicates from different original sites. For aggregation, local and global aggregate functions may differ. For ex- ample, to perform a global count, the local aggregation counts while the global aggregation sums local counts into a global count. For local hash-based aggregation, a special technique might improve perfor- mance. Instead of creating overflow files locally to resolve hash table overflow, items can be moved directlv to their final . site. Hopefully, this site can aggregate them immediately into the local hash table because a similar item already ex- ists. In manv recent distributed-memorv machines, it” is faster to ship an item to another site than to do a local disk 1/0. In fact. some distributed-memorv ven- dors attach disk drives not to tie pri- mary processing nodes but to special “1/0 nodes” because network delay is negligible compared to 1/0 time, e.g., in Intel’s iPSC/2 and its subsequent paral- lel architectures. The advantage is that disk 1/0 is re- quired only when the aggregation output size does not fit into the aggregate mem- orv available on all machines. while the . standard local aggregation-exchange- global aggregation scheme requires local disk 1/0 if any local output size does not fit into a local memorv. The difference between the two is d~termined by the degree to which the original input is al- ready partitioned (usually not at all), making this technique very beneficial. 10.4 Parallel Joins and Other Binary Matching Operations Binary matching operations such as join, semi-join, outer join, intersection, union, and difference are different than the pre- vious operations exactly because they are binary. For bushy parallelism, i.e., a join for which two subplans create the two inputs independently from one another in parallel, we might consider symmetric hash join algorithms. Instead of differen- tiating between build and probe inputs, the symmetric hash join uses two hash tables. one for each in~ut. When a data item (or packet of ite’ms) arrives, the join algorithm first determines which in~ut it came from and then ioins the ne’w data item with the hash t~ble built from the other input as well as inserting the new data item into its hash table such that data items from the other in- put arriving later can be joined correctly. Such a symmetric hash join algorithm has been used in XPRS, a shared- memory high-performance extensible- ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 69. Query Evaluation Techniques 9 141 relational database system [Hong and Stonebraker 1991; 1993; Stonebraker et al. 1988a; 1988b] as well as in Pris- ma/DB, a shared-nothing main-memory database system [Wilschut 1993; Wil- schut and Apers 1993]. The advantage of symmetric matching algorithms is that they are independent of the data rates of the inputs; their disadvantage is that they require that both inputs fit in mem- ory, although one hash table can be dropped when one input is exhausted. For parallelizing a single binary matching operation, there are basically two techniques, called here symmetric partitioning and fragment and replicate. In both cases, the global result is the union (concatenation) of all local results. Some algorithms exploit the topology of certain architectures, e.g., ring- or cube- based communication networks [Baru and Frieder 1989; Omiecinski and Lin 1989]. In the symmetric partitioning meth- ods, both inputs are partitioned on the attributes relevant to the operation (i.e., the join attribute for joins or all at- tributes for set operations), and then the operation is performed at each site. Both the Gamma and the Teradata database machines use this method. Notice that the partitioning method (usually hashed) and the local join method are indepen- dent of each other; Gamma and Grace use hash joins while Teradata uses merge-join. In the fragment-and-replicate meth- ods, one input is partitioned, and the other one is broadcast to all sites. Typi- cally, the larger input is partitioned by not moving it at all, i.e., the existing partitions are processed at their loca- tions prior to the binary matching opera- tion. Fragment-and-replicate methods were considered the join algorithms of choice in early distributed database sys- tems such as R*, SDD-1, and distributed Ingres, because communication costs overshadowed local processing costs and because it was cheaper to send a small input to a small number of sites than to partition both a small and a large input. Note that fragment-and-replicate meth- ods do not work correctly for semi-join, outer join, difference, and union, namely, when an item is replicated and is in- serted into the output (incorrectly) multi- ple times. A technique for reducing network traf- fic during join processing in distributed database systems uses redundant semi- joins [Bernstein et al. 1981; Chiu and Ho 1980; Gouda and Dayal 1981], an idea that can also be used in distributed- memory parallel systems. For example, consider the join on a common attribute A of relations R and S stored on two different nodes in a network, say r and s. The semi-join method transfers a duplicate-free projection of R on A to s, performs a semi-join there to determine the items in S that actually participate in the join result, and ships these items to r for the actual join. In other words, based on the relational algebra law that R JOINS = R JOIN (S SEMIJOIN R), cost savings of not shipping all of S were realized at the expense of projecting and shipping the R. A-column and executing the semi-join. Of course, this idea can be used symmetrically to reduce R or S or both, and all operations (projection, du- plicate removal, semi-join, and final join) can be executed in parallel on both r and s or on more than two nodes using the parallel join strategies discussed earlier in this section. Furthermore, there are probabilistic variants of this idea that use bit vector filtering instead of semi- joins, discussed later in its own section. Roussopoulos and Kang [ 1991] recently showed that symmetric semi-joins are particularly useful. Using the equalities (for a join of relations R and S on at- tribute A) R JOINS = R JOIN (S SEMIJOIN r~R ) = (R SEMIJOIN w~(S SEMIJOIN n.R)) JOIN (S SEMIJOIN(a)~~R) (a) = (R SEMIJOIN ri-.(S SEMIJOIN WAR)) JOIN (S (SEMIJOIN@)r~R ) , (b) ACM Computing Surveys, Vol 25, No. 2, ,June 1993
  • 70. 142 ● Goetz Graefe they designed a four-step procedure to compute the join of two relations stored at two sites. First, the first relation’s join attribute column R. A is sent duplicate free to the other relation’s site, s. Second, the first semi-join is computed at s, and either the matching values (term (a) above) or the nonmatching values (term (b) above) of the join column S. A are sent back to the first site, r. The choice between (a) and (b) is made based on the number of matching and nonmatching17 values of S. A. Third, site r determines which items of R will participate in the join R JOIN S, i.e., R SEMIJOIN S. Fourth, both input sites send exactly those items that will participate in the join R JOIN S to the site that will com- pute the final result, which may or may not be one of the two input sites. Of course, this two-site algorithm can be used across any number of sites in a parallel query evaluation system. Typically, each data item is exchanged only once across the interconnection net- work in a parallel algorithm. However, for parallel systems with small communi- cation overhead, in particular for shared-memory systems, and in parallel processing systems with processors with- out local disk(s), it may be useful to spread each overflow file over all avail- able nodes and disks in the system. The disadvantage of the scheme may be communication overhead; however, the advantages of load balancing and cumu- lative bandwidth while reading a parti- tion file have led to the use of this scheme both in the Gamma and SDC database machines, called bucket spreading in the SDC design [DeWitt et al. 1990; Kitsuregawa and Ogawa 19901. For parallel non-equi-joins, a symmet- ric fragment-and-replicate method has been proposed by Stamos and Young [Stamos and Young 1989]. As shown in Figure 34, processors are organized into rows and columns. One input relation is partitioned over rows, and partitions are 17SEMIJOIN stands for the anti-semi-join, which determines those items in the first input that do not have a match in the second input. I L I L I A A A i R s Figure 34. Symmetric fragment-and-replicate join. replicated within each row, while the other input is partitioned and replicated over columns. Each item from one input “meets” each item from the other input at exactly one site, and the global join result is the concatenation of all local joins. Avoiding partitioning as well as broad- casting for many joins can be accom- plished with a physical database design that considers frequently performed joins and distributes and replicates data over the nodes of a parallel or distributed sys- tem such that many joins already have their input data suitably partitioned. Katz and Wong formalized this notion as local sufficiency [Katz and Wong 1983; Wong and Katz 1983]; more recent re- search on the issue was performed in the Bubba project [Copeland et al. 1988]. For joins in distributed systems, a third class of algorithms, called fetch-as- needed, was explored. The idea of these algorithms is that one site performs the join by explicitly requesting (fetching) only those items from the other input needed to perform the join [Daniels and Ng 1982; Williams et al. 1982]. If one input is very small, fetching only the necessary items of the larger input might seem advantageous. However, this algo- rithm is a particularly poor implementa- tion of a semi-join technique discussed above. Instead of requesting items or val- ues one by one, it seems better to first project all join attribute values, ship (stream) them across the network, per- form the semi-join using any local binary matching algorithm, and then stream ex- actly those items back that will be re- ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 71. Query Evaluation Techniques ● 143 quired for the join back to the first site. The difference between the semi-join technique and fetch-as-needed is that the semi-join scans the first input twice, once to extract the join values and once to perform the real join, while fetch as needed needs to work on each data item only once. 10.5 Parallel Universal Quantification In our earlier discussion on sequential universal quantification, we discussed four algorithms for universal quantifica- tion or relational division, namely, naive division (a direct, sort-based algorithm), hash-division (direct, hash based), and sort- and hash-based aggregation (indi- rect ) algorithms, which might require semi-joins and duplicate removal in the inputs. For naive division, pipelining can be used between the two sort operators and the division operator. However, both quo- tient partitioning and divisor partition- ing can be employed as described below for hash-division. For algorithms based on aggregation, both pipelining and partitioning can be applied immediately using standard techniques for parallel query execution. While partitioning seems to be a promis- ing approach, it has an inherent problem due to the possible need for a semi-join. Recall that in the example for universal quantification using Transcript and Course relations, the join attribute in the semi-join (course-no) is different than the grouping attribute in the subsequent ag- gregation (student-id). Thus, the Tran- script relation has to be partitioned twice, once for the semi-join and once for the aggregation. For hash-division, pipelining has only limited promise because the entire division is performed within a single operator. However, both partitioning strategies discussed earlier for hash table overflow can be employed for parallel ex- ecution, i.e., quotient partitioning and di- visor partitioning [Graefe 1989; Graefe and Cole 1993]. For hash-division with quotient par- titioning, the divisor table must be replicated in the main memory of all par- ticipating processors. After replication, all local hash-division operators work completely independent of each other. Clearly, replication is trivial for shared- memory machines, in particular since a single copy of the divisor table can be shared without synchronization among multiple processes once it is complete. When using divisor partitioning, the resulting partitions are processed in par- allel instead of in phases as discussed for hash table overflow. However, instead of tagging the quotient items with phase numbers, processor network addresses are attached to the data items, and the collection site divides the set of all incom- ing data items over the set of processor network addresses. In the case that the central collection site is a bottleneck, the collection step can be decentralized using quotient partitioning. 11. NONSTANDARD QUERY PROCESSING ALGORITHMS In this section, we briefly review the query processing needs of data models and database systems for nonstandard applications. In many cases, the logical operators defined for new data models can use existing algorithms, e.g., for in- tersection. The reason is that for process- ing, bulk data types such as array, set, bag (multi-set), or list are represented as sequences similar to the streams used in the query processing techniques dis- cussed earlier, and the algorithms to ma- nipulate these bulk types are equal to the ones used for sets of tuples, i.e., rela- tions. However, some algorithms are gen- uinely different from the algorithms we have surveyed so far. In this section, we review operators for nested relations, temporal and scientific databases, object-oriented databases, and more meta-operators for additional query pro- cessing control. There are several reasons for integrat- ing these operators into an algebraic query-processing system. First, it per- mits efficient data transfer from the database to the application embodied in these operators. The interface between ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 72. 144 “ Goetz Graefe database operators is designed to be as efficient as possible; the same efficient interface should also be used for applica- tions. Second, operator implementors can take advantage of the control provided by the meta-operators. For example, an op- erator for a scientific application can be implemented in a single-process envi- ronment and later parallelized with the exchange operator. Third, query opti- mization based on algebraic transform a- tion rules can cover all operators, includ- ing operations that are normally consid- ered database application code. For ex- ample, using algebraic optimization tools such as the EXODUS and Volcano opti- mizer generators [Graefe and DeWitt 1987; Graefe et al. 1992; Graefe and McKenna 1993], optimization rules that can move an unusual database operator in a query plan are easy to implement. For a sampling operator, a rule might permit transforming an algebra expres- sion to query a sample instead of sam- pling a query result, 11.1 Nested Relations Nested relations, or Non-First-Normal- Form (NF 2) relations, permit relation- valued attributes in addition to atomic values such as integers and strings used in the normal or “flat” relational model. For example, in an order-processing ap- plication, the set of individual line items on each order could be represented as a nested relation, i.e., as part of an order tuple. Figure 35 shows an NF z relation with two tuples with two and three nested tuples and the equivalent normalized re- lations, which we call the master and detail relations. Nested relations can be used for all one-to-many relationships but are particularly well suited for the repre- sentation of “weak entities” in the Entity-Relationship (ER) Model [Chen 1976], i.e., entities whose existence and identification depend on another entity as for order entries in Figure 35. In gen- eral, nested subtuples may include rela- tion-valued attributes, with arbitrary nesting depth. The advantages of the NF 2 model are that component relationships can be represented more naturally than in the fully normalized model; many fre- quent join operations can be avoided, and structural information can be used for physical clustering. Its disadvantage is the added complexity, in particular, in storage management and query m-ocessing. Severa~ algebras for nested relations have been defined, e.g., Deshpande and Larson [1991], Ozsoyoglu et al. [1987], Roth et al. [ 1988], Schek and Scholl [1986], Tansel and Garnett [1992]. Our discussion here focuses not on the concep- tual design of NF z algebras but on algo- rithms to manipulate nested relations. Two operations required in NF 2 database systems are operations that transform an NF z relation into a normal- ized relation with atomic attributes onlv. ./ and vice versa. The first operation is frequently called unnest or flatten; the opposite direction is called the nest oper- ation. The unnest operation can be per- formed in a single scan over the NF 2 relation that includes the nested subtu- ples; both normalized relations in Figure 35 and their join can be derived readily enough from the NF z relation. The nest operation requires grouping of tuples in the detail relation and a join with the master relation. Grouping and join can be implemented using any of the algo- rithms for aggregate functions and bi- nary matching discussed earlier, i.e., sort- and hash-based sequential and parallel methods. However, in order to ensure that unnest and nest o~erations are ex- . act inverses of each other, some struc- tural information might have to be pre- served in the unnest operation. Ozsoyoglu and Wang [1992] present a recent inves- tigation of “keying methods” for this pur- pose. All operations defined for flat relations can also be defhed for nested relations, in particular: selection, join, and set op- erations (union, intersection, difference). For selections, additional power is gained with selection conditions on subtuples and sets of subtuples using set compar- isons or existential or universal auantifl- . cation. In principle, since a nested rela- ACM Computing Surveys, Vol 25, No 2, June 1993
  • 73. Query Evaluation Techniques ● 145 Order Customer Date Items -No -No Part-No Count 110 911 910902 4711 8 2345 7 112 912 910902 9876 3 2222 1 2357 9 Order-No Part-No Quantity 110 4711 8 110 2345 7 112 9876 3 112 2222 1 112 2357 9 Figure 35. Nested relation and equivalent flat relations. tion is a relation, any relational calculus and algebra expression should be permit- ted for it. In the example in Figure 35, there may be a selection of orders in which the ordered quantity of all items is more than 100, which is a universal quantification. The algorithms for selec- tions with quantifier are similar to the ones discussed earlier for flat relations, e.g., relational semi-join and division, but are easier to implement because the grouping process built into the flat- relational algorithms is inherent in the nested tuple structure. For joins, similar considerations apply. Matching algorithms discussed earlier can be used in principle. They may be more complex if the join predicate in- volves subrelations, and algorithm com- binations may be required that are de- rived from a flat-relation query over flat relations equivalent to the NF 2 query over the nested relations. However, there should be some performance improve- ments possible if the grouping of values in the nested relations can be exploited, as for example, in the join algorithms described by Rosenthal et al. [1991]. Deshpande and Larson [ 1992] investi- gated join algorithms for nested relations because “the purpose of nesting in order to store precomputed joins is defeated if it is unnested every time a join is performed on a subrelation.” Their algo- rithm, (parallel) partitioned nested- hashed-loops, joins one relation’s subre- lations with a second, flat relation by creating an in-memory hash table with the flat relation. If the flat relation is larger than memory, memory-sized seg- ments are loaded one at a time, and the nested relation is scanned repeatedly. Since an outer tuple of the nested rela- tion might have matches in multiple seg- ments of the flat relation, a final merging pass is required. This join algorithm is reminiscent of hash-division, the flat re- lation taking the role of the divisor and the nested tuples replacing the quotient table entries with their bit maps. Sort-based join algorithms for nested relations require either flattening the nested tuples or scanning the sorted flat relation for each nested tuple, somewhat reminiscent of naive division. Neither al- ternative seems very promising for large ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 74. 146 “ Goetz Graefe inputs. Sort semantics and appropriate sort algorithms including duplicate re- moval and grouping have been consid- ered by Saake et al. [1989] and Kuespert et al. [ 1989]. Other researchers have fo- cused on storage and retrieval methods for nested relations and operations possi- ble with single scans [Dadam et al. 1986; Deppisch et al. 1986; Deshpande and Van Gucht 1988; Hafez and Ozsoyoglu 1988; Ozsoyoglu and Wang 1992; Scholl et al. 1987; Scholl 1988]. 11.2 Temporal and Scientific Database Management For a variety of reasons, management and manipulation of statistical, tempo- ral, and scientific data are gaining inter- est in the database research community. Most work on temporal databases has focused on semantics and representation in data models and query languages [McKenzie and Snodgrass 1991; Snod- grass 1990]; some work has considered special storage structures, e.g., Ahn and Snodgrass [1988], Lomet and Salzberg [ 1990b], Rotem and Segev [1987], Sever- ance and Lehman [1976], algebraic oper- ators, e.g., temporal joins [Gunadhi and Segev 1991], and optimization of tempo- ral queries, e.g., Gunadhi and Segev [1990], Leung and Muntz [1990; 1992], Segev and Gunadhi [1989]. While logical query algebras require extensions to ac- commodate time, only some storage structures and algorithms, e.g., multidi- mensional indices, differential files, and versioning, and the need for approximate selection and matching (join) predicates are new in the query execution algo- rithms for temporal databases. A number of operators can be identi- fied that both add functionality to database systems used to process scien- tific data and fit into the database query processing paradigm. DeWitt et al. [1991a] considered algorithms for join predicates that express proximity, i.e., join predicates of the form R. A – c1 < S.B < R.A + Cz for some constants c1 and Cz. Such join predicates are very different from the usual use of relational join. They do not reestablish relation- ships based on identifying keys but match data values that express a dimension in which distance can be defined, in partic- ular, time. Traditionally, such join predi- cates have been considered non-equi-joins and were evaluated by a variant of nested-loops join. However, such “band joins” can be executed much more effi- ciently by a variant of merge-join that keeps a “window” of inner relation tuples in memory or by a variant of hash join that uses range partitioning and assigns some build tuples to multiple partition files. A similar partitioning model must be used for parallel execution, requiring multi-cast for some tuples. Clearly, these variants of merge-join and hash join will outperform nested loops for large inputs, unless the band is so wide that the join result approaches the Cartesian product. For storage and management of the massive amounts of data resulting from scientific experiments, database tech- niques are very desirable. Operators for processing time series in scientific databases are based on an interpretation of a stream between operators not as a set of items (as in most database applica- tions) but as a sequence in which the order of items in a stream has semantic meaning. For example, data reduction using interpolation as well as extrapola- tion can be performed within the stream paradigm. Similarly, digital filtering [Hamming 19771 also fits the stream- processing protocol very easily. Interpo- lation, extrapolation, and digital filtering were implemented in the Volcano system with a single algorithm (physical opera- tor) to verify this fit, including their opti- mization and parallelization [Graefe and Wolniewicz 1992; Wolniewicz and Graefe 1993]. Another promising candidate is vi- sualization of single-dimensional arrays such as time series, Problems that do not fit the stream paradigm, e.g., many matrix operations such as transformations used in linear algebra, Laplace or Fast Fourier Trans- form, and slab (multidimensional sub- array) extraction, are not as easy to integrate into database query processing ACM Computing Surveys, Vol 25, No 2, June 1993
  • 75. Query Evaluation Techniques ● 147 systems. Some of them seem to fit better into the storage management subsystem rather than the algebraic query execu- tion engine. For example, slab extraction has been integrated into the NetCDF storage and access software [Rew and Davis 1990; Unidata 1991]. However, it is interesting to note that sorting is a suitable algorithm for permuting the lin- ear representation of a multidimensional array, e.g., to modify the hierarchy of dimensions in the linearization (row- vs. column-major linearization). Since the final position of all elements can be predicted from the beginning of the op- eration, such “sort” algorithms can be based on merging or range partitioning (which is yet another example of the duality of sort- and hash- (partitioning-) based data manipulation algorithms). 11.3 Object-Oriented Database Systems Research into query processing for exten- sible and object-oriented systems has been growing rapidly in the last few years. Most proposals or implementa- tions use algebras for query processing, e.g., Albert [1991], Cluet et al. [1989], Graefe and Maier [1988], Guo et al. [199 1], Mitschang [1989], Shaw and Zdonik [1989a; 1989b; 1990], Straube and Ozsu [1989], Vandenberg and DeWitt [1991], Yu and Osborn [1991]. These al- gebras resemble relational algebra in the sense that they focus on bulk data types but are generalized to support operations on arrays, lists, etc., user-defined opera- tions (methods) on instances, heteroge- neous bulk types, and inheritance. The use of algebras permits several impor- tant conclusions. First, naive execution models that execute programs as if all data were in memory are not the only alternative. Second, data manipulation operators can be designed and imple- mented that go beyond data retrieval and permit some amount of data reduction, aggregation, and even inference. Third, algebraic execution techniques including the stream paradigm and parallel execu- tion can be used in object-oriented data models and database systems, Fourth, al- gebraic optimization techniques will con- tinue to be useful. Associative operations are an impor- tant part in all object-oriented algebras because they permit reducing large amounts of data to the interesting subset of the database suitable for further con- sideration and processing. Thus, set- processing and set-matching algorithms as discussed earlier in this survey will be found in object-oriented systems, imple- mented in such a way that they can oper- ate on heterogeneous sets. The challenge for query optimization is to map a com- plex query involving complex behavior and complex object structures to primi- tives available in a query execution en- gine. Translating an initial request with abstract data types and encapsulated be- havior coded in a computationally com- plete language into an internal form that both captures the entire query’s seman- tics and allows effective query optimiza- tion is still an open research issue [Daniels et al. 1991; Graefe and Maier 1988]. Beyond associative indices discussed earlier, object-oriented systems can also benefit from special relationship indices, i.e., indices that contain condensed infor- mation about interobject references. In principle, these index structures are sim- ilar to join indices [Valduriez 1987] but can be generalized to support multiple levels of referencing. Examples for in- dices in object-oriented database systems include the work of Maier and Stein [1986] in the Gemstone object-oriented database system product, Bertino [1990; 1991] and Bertino and Kim [1989], in the Orion project and Kemper et al. [19911 and Kemper and Moerkotte [1990a; 1990b] in the GOM project. At this point, it is too early to decide which index structures will be the most useful be- cause the entire field of query processing in object-oriented systems is still devel- oping rapidly, from query languages to algebra design, algorithm repertoire, and optimization techniques. other areas of intense current research interest are buffer management and clustering of objects on disk. ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 76. 148 “ Goetz Graefe One of the big performance penalties in object-oriented database systems is “pointer chasing” (using OID references) which may involve object faults and disk read operations at widely scattered loca- tions, also called “goto’s on disk.” In or- der to reduce 1/0 costs, some systems use what amounts to main-memory databases or map the entire database into virtual memory. For systems with an explicit database on disk and an in- memory buffer, there are various tech- niques to detect object faults; some commercial object-oriented database sys- tems use hardware mechanisms origi- nally perceived and implemented for virtual-memory systems. While such hardware support makes fault detection faster. it does not address the m-oblem of . expensive 1/0 operations. In order to re- duce actual 1/0 cost, read-ahead and planned buffering must be used. Palmer and Zdonik [1991] recently proposed keeping access patterns or sequences and activating read-ahead if accesses equal or similar to a stored ~attern are de- tected. Another recent p~oposal for effi- cient assembly of complex objects uses a window (a small set) of open references and resolves, at any point of time, the most convenient one by fetching this ob- ject or component from disk, which has shown dramatic improvements in disk seek times and makes complex object re- trieval more efficient and more indepen- dent of object clustering [Keller et al. 19911. Policies and mechanisms for effi- cient parallel complex object assembly are an important challenge for the develop- ers of next-generation object-oriented database management systems [Maier et al. 1992]. 11.4 More Control Operators The exchange operator used for parallel query processing is not a normal opera- tor in the sense that it does not manipu- late, select, or transform data. Instead, the exchange operator provides control of query processing in a way orthogonal to what a query does and what algorithms it uses. Therefore, we call it a meta- or control operator. There are several other control o~erators that can be used . in database query processing, and we survey them briefly in this section. In situations in which an intermediate result is used repeatedly, e.g., a nested- loops join with a composite inner input, either the intermediate result is derived many times, or it is saved in a temporary file during its first derivation and then retrieved from this file while serving sub- sequent requests. This situation arises not only with nested-loops join but also with other algorithms, e.g., sort-based universal quantification [Smith and Chang 1975]. Thus, it might be useful to encapsulate this functionality in a new algorithm, which we call the store-and- scan o~erator. The’ store-and-scan operator permits three generalizations. First, if the first consumption of the intermediate result might actually not need it entirely, e.g., a nested-loops semi-join which terminates each inner scan after the first match, the operator should be switched to derive only the necessary data items (which implies leaving the input plan ready to produce more data later) or to save the entire intermediate result in the temporary file right away in order to permit release of all resources in the subplan. Second, sub- sequent scans might permit starting not at the beginning of the temporary file but at some later point. This version is useful if manv duplicates exist in the inwts of . . . one-to-one matching algorithms based on merge-join. Third, in some execution strategies for correlated SQL sub queries, the plan corresponding to the inner block is executed once for each tuple in the outer block. The tudes of the outer block provide different c~rrelation values, al- though each value may occur repeatedly. In order to ensure that the inner plan is executed only once for each outer correla- tion value, the store-and-scan operator could retain information about which part of its temporary file corresponds to which correlation value and restrict each scan appropriately. Another use of a tem~orarv file is to .“ support common subexpressions, which ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 77. Query Evaluation Techniques - 149 can be executed efficiently with an opera- tor that passes the result of a common subexpression to multiple consumers, as mentioned briefly in the section on the architecture of query execution engines. The problem is that multiple consumers, typically demand-driven and demand- driving their inputs, will request items of the common subexpression result at dif- ferent times or rates. The two standard solutions are either to execute the com- mon subexpression into a temporary file and let each consumer scan this file at will or to determine which consumer will be the first to require the result of the common subexpression, to execute the common subexpression as part of this consumer and to create a file with the common subexpression result as a by- product of the first consumer’s execution. Instead, we suggest a new meta-operator, which we call the split operator, to be placed at the top of the common subex- pression’s plan and which can serve mul- tiple consumers at their own paces. It automatically performs buffering to ac- count for different paces, uses temporary disk space if the discrepancies are too wide, and is suitably parameterized to permit both standard solutions described above. In query processing systems, data flow is usually paced or driven from the top, the consumer. The leftmost diagram of Figure 36 shows the control flow of nor- mal iterators. (Notice that the arrows in Figure 36 show the control flow; the data flow of all diagrams in Figure 36 is as- sumed to be upward. In data-driven data flow, control and data flows point in the same direction; in demand-driven data flow, their directions oppose each other.) However, in real-time systems that cap- ture data from experiments, this ap- proach may not be realistic because the data source, e.g., a satellite receiver, has to be able to unload data as they arrive. In such systems, data-driven operators, shown in the second diagram of Figure 36, might be more appropriate. To com- bine the algorithms implemented and used for query processing with such real-time data capture requirements, one could design data flow translation con- trol operators. The first such operator which we call the active scheduler can be used between a demand-driven producer and a data-driven consumer. In this case, neither operator will schedule the other; therefore, an active scheduler that de- mands items from the producer and forces them onto the consumer will glue these two operators together. An active- scheduler schematic is shown in the third diagram of Figure 36. The opposite case, a data-driven producer and a demand- driven consumer, has two operators, each trying to schedule the other one. A sec- ond flow control operator, called the pas- sive scheduler, can be built that accepts procedure calls from either neighbor and resumes the other neighbor in a corou- tine fashion to ensure that the resumed neighbor will eventually demand the item the scheduler just received. The final dia- gram of Figure 36 shows the control flow of a passive scheduler. (Notice that this case is similar to the bracket model of parallel operator implementations dis- cussed earlier in which an operating sys- tem or networking software layer had to be placed between data manipulation op- erators and perform buffering and flow control.) Finally, for very complex queries, it might be useful to break the data flow between operators at some point, for two reasons. First, if too many operators run in parallel, contention for memory or temporary disks might be too intense, and none of the operators will run as efficiently as possible. A long series of hybrid hash joins in a right-deep query plan illustrates this situation. Second, due to the inherent error in selectivity estimation during query optimization [Ioannidis and Christodoulakis 1991; Mannino et al. 1988], it might be worth- while to execute only a subset of a plan, verify the correctness of the estimation, and then resume query processing with another few steps. After a few processing steps have been performed, their result size and other statistical properties such as minimum and maximum and approxi- mate number of duplicate values can be ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 78. 150 “ Goet.z Graefe CD Standard Data-Driven Iterator Opemtor r+ I Active L Scheduler Passive Scheduler + Figure 36. Operators, schedulers, and control flow easily determined while saving the result on temporary disk. In principle, this was done in Ingres’ original optimization method called De- composition, except that Ingres per- formed only one operation at a time before optimizing the remaining query [Wong and Yousset3 1976; Youssefi and Wong 1979]. We propose alternating more slowly between optimization and execu- tion, i.e., to perform a “reasonable” num- ber of steps between optimizations, where reasonable may be three to ten selections and joins depending on errors and error propagation in selectivity estimation. Stopping the data flow and resuming af- ter additional optimization could very well turn out to be the most reliable technique for very large complex queries. Implementation of this technique could be embodied in another control operator, the choose-plan operator first described in Graefe and Ward [1989]. Its current implementation executes zero or more subplans and then invokes a decision function provided by the optimizer that decides which of multiple equivalent plans to execute depending on intermedi- ate result statistics, current system load, and run-time values of query parameters unknown at optimization time. Unfor- tunately, further research is needed to develop techniques for placing such op- erators in very complex query plans. One possible purpose of the subplans executed prior to a decision could be to sample the values in the database. A very interesting research direction quantifies the value of sampling by ana- lyzing the resulting improvement in the decision quality [Seppi et al. 1989]. ACM Computing Surveys, Vol 25, No 2, June 1993 12. ADDITIONAL TECHNIQUES FOR PERFORMANCE IMPROVEMENT In this section, we consider some addi- tional techniques that have been pro- posed in the literature or used in real systems and that have not been dis- cussed in earlier sections of this survey. In particular, we consider precomputa- tion, data compression, surrogate pro- cessing, bit vector filters, and specialized hardware. Recently proposed techniques that have not been fully developed are not discussed here, e.g., “racing” equiva- lent plans and terminating the ones that seem not competitive after some small amount of time. 12.1 Precomputation and Derived Data It is trivial to answer a query for which the answer is already known—therefore, precomputation of frequently requested information is an obvious idea. The prob- lem with keeping preprocessed informa- tion in addition to base data is that it is redundant and must be invalidated or maintained on updates to the base data. Precomputation and derived data such as relational views are duals. Thus, con- cepts and algorithms designed for one will typically work well for the other. The main difference is the database user’s view: precomputed data are typically used after a query optimizer has deter- mined that they can be used to answer a user query against the base data, while derived data are known to the user and can be queried without regard to the fact that they actually must be derived at run-time from stored base data. Not sur-
  • 79. Query Evaluation Techniques ● 151 prisingly, since derived data are likely to be referenced and requested by users and application programs, precomputation of derived data has been investigated both for relational and object-oriented data models. Indices are the simplest form of pre- computed data since they are a re- dundant and, in a sense, precomputed selection. They represent a compromise between a nonredundant database and one with complex precomputed data be- cause they can be maintained relatively efficiently. The next more sophisticated form of precomputation are inversions as pro- vided in System Rs “Oth” prototype [Chamberlain et al. 1981al, view indices as analyzed by Roussopoulos [1991], two-relation join indices as proposed by Valduriez [1987], or domain indices as used in the ANDA project (called VAL- TREE there) [Deshpande and Van Gucht 1988] in which all occurrences of one do- main (e.g., part number) are indexed to- gether, and each index entry contains a relation identification with each record identifier. With join or domain indices, join queries can be answered very fast, typically faster than using multiple sin- gle-relation indices. On the other hand, single-clause selections and updates may be slightly slower if there are more en- tries for each indexed key. For binary operators, there is a spec- trum of possible levels of precomputa- tions (as suggested by J. A. Blakely), explored predominantly for joins. The simplest form of precomputation in sup- port of binary operations is individual indices, e.g., clustering B-trees that en- sure and maintain sorted relations. On the other extreme are completely materi- alized join results. Intermediate levels are pointer-based joins [ Shekita and Carey 1990] (discussed earlier in the sec- tion on matching) and join indices [Valduriez 19871. For each form of pre- computed result, the required redundant data structures must be maintained each time the underlying base data are up- dated, and larger retrieval speedup might be paid for with larger maintenance overhead. Babb [1982] explored storing only results of outer joins, but not the nor- malized base relations, in the content- addressable file store (CAFS), and called this encoding join normal form. Blakeley et al. [1989], Blakeley and Martin [1990], Larson and Yang [1985], Medeiros and Tompa [1985], Tompa and Blakeley [1988], and Yang and Larson [ 1987] in- vestigated storing and maintaining ma- terialized views in relational database systems. Their hope was to speed rela- tional query processing by using derived data, possibly without storing all base data, and ensuring that their mainte- nance overhead would be less than their benefits in faster query processing. For example, Blakeley and Martin [1990] demonstrated that for a single join there exists a large range of retrieval and up- date mixes in which materialized views outperform both join indices and hybrid hash join. This investigation should be extended, however, for more complex queries, e.g., joins of three and four in- puts, and for queries in object-oriented systems and emerging database applica- tions. Hanson [1987] compared query modifi- cation (i.e., query evaluation from base relations) against the maintenance costs of materialized views and considered in particular the cost of immediate versus deferred updates. His results indicate that for modest update rates, material- ized views provide better system perfor- mance. Furthermore, for modest selectiv- ities of the view predicate, deferred-view maintenance using differential files [Severance and Lohman 1976] outper- forms immediate maintenance of materi- alized views. However, Hanson also did not include multi-input joins in his study. Sellis [1987] analyzed caching of re- sults in a query language called Quel + (which is a subset of Postquel [Stonebraker et al. 1990bl) over a rela- tional database with procedural (QUEL) fields [Sellis 1987]. He also considered the case of limited space on secondary storage used for caching query results, and replacement algorithms for query re- sults in the cache when the space be- comes insufficient. ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 80. 152 * Goetz Graefe Links between records (pointers of some sort, e.g., record, tuple, or object identifiers) are another form of precom- putation. Links are particularly effective for system performance if they are com- bined with clustering (assignment of records to pages). Database systems for the hierarchical and network models have used physical links and clustering, but supported basically only queries and op- erations that were “recomputed” in this way. Some researchers tried to overcome this restriction by building relational query engines on top of network systems, e.g., Chen and Kuck [1984], Rosenthal and Reiner [ 1985], Zaniolo [1979]. How- ever, with performance improvements in the relational world, these efforts seem to have been abandoned. With the advent of extensible and object-oriented database management systems, combining links and ad hoc query processing might be- come a more interesting topic again. A recent effort for an extensible-relational system are Starburst’s pointer-based joins discussed earlier [Haas et al. 1990; Shekita and Carey 1990]. In order to ensure good performance for its extensive rule-processing facilities, Postgres uses precomputation and caching of the action parts of production rules [Stonebraker 1987; Stonebraker et al. 1990a; 1990b]. For automatic mainte- nance of such derived data, persistent “invalidation locks” are stored for detec- tion of invalid data after updates to the base data, Finally, the Cactis project focused on maintenance of derived data in object- oriented environments [Hudson and King 1989]. The conclusions of this project in- clude that incremental maintenance cou- pled with a fairly simple adaptive clus- tering algorithm is an efficient way to propagate updates to derived data. One issue that many investigations into materialized views ignore is the fact that many queries do not require views in their entirety. For example, if a rela- tional student information system in- cludes a view that computes each stu- dent’s grade point average from the en- rollment data, most queries using this view will select only a single student, not all students at the school. Thus, if the view definition is merged into the query before query optimization, as discussed in the introduction, only one student’s grade point average, not the entire view, will be computed for each query. Obvi- ously, the treatment of this difference will affect an analysis of costs and bene- fits of materialized views. 12.2 Data Compression A number of researchers have investi- gated the effect of compression on database systems and their performance [Graefe and Shapiro 1991; Lynch and Brownrigg 1981; Ruth and Keutzer 1972; Severance 1983]. There are two types of compression in database systems. First, the amount of redundancy can be re- duced by prefix and suffix truncation, in particular, in indices, and by use of en- coding tables (e.g., color combination “9” means “red car with black interior”). Sec- ond, compression schemes can be applied to attribute values, e.g., adaptive Huffman coding or Ziv-Lempel methods [Bell et al. 1989; Lelewer and Hirschberg 1987]. This type of compression can be exploited most effectively in database query processing if all attributes of the same domain use the same encoding, e.g., the “Part-No” attributes of data sets rep- resenting parts, orders, shipments, etc., because common encodings permit com- parisons without decompression. Most obviously, compression can re- duce the amount of disk space required for a given data set. Disk space savings has a number of ramifications on 1/0 performance. First, the reduced data space fits into a smaller physical disk area; therefore, the seek distances and seek times are reduced. Second, more data fit into each disk page, track, and cylinder, allowing more intelligent clus- tering of related objects into physically near locations. Third, the unused disk space can be used for disk shadowing to increase reliability, availability, and 1/0 performance [Bitten and Gray 1988]. Fourth, compressed data can be trans- ACM Cornputlng Surveys, Vol 25, No 2, June 1993
  • 81. Query Evaluation Techniques ● 153 ferred faster to and from disk. In other words, data com~ression is an effective means to increa~e disk bandwidth (not by increasing physical transfer rates but by increasing the information density of transferred data) and to relieve the 1/0 bottleneck found in many high- performance database management sys- tems [Boral and DeWitt 19831. Fifth. in distributed database systems and’ in client-server situations, compressed da- ta can be transferred faster across the network than uncompressed data. Un- compressed data require either more network time or a seoarate comm-ession . , step. Finally, retaining data in com- pressed form in the 1/0 buffer allows more records to remain in the buffer. thus increasing the buffer hit rate and reducing the number of 1/0s. The last three points are actually more general. They apply to the entire storage hierar- chy of tape, disk, controller caches, local and remote main memories, and CPU caches. For query processing, compression can be exploited far beyond improved 1/0 ~erformance because decomm-ession can ~ften be delayed until a rel~tively small data set is presented to the user or an application program. First, exact-match comparisons can be performed on com- pressed data. Second, projection and du- plicate removal can be performed with- out decompressing data. The situation for aggregation is a little more complex since the attribute on which arithmetic is per- formed typically must be decompressed. Third, neither the join attributes nor other attributes need to be decompressed for most joins. Since keys and foreign keys are from the same domain, and if compression schemes are fixed for each domain, a join on compressed key values will give the same results as a join on normal uncompressed key values. It might seem unusual to perform a merge- join in the order of compressed values, but it nonetheless is ~ossible and will produce correct results.’ There are a number of benefits from processing compressed data. First, mate- rializing output records is faster because records are shorter, i.e., less copying is required. ~econd, for inputs larger than memory, more records fit into memory. In hybrid hash join and duplicate re- moval, for instance, the fraction of the file that can be retained in the hash table and thus be joined without any 1/0 is larger. During sorting, the number of records in memory and thus per run is larger, leading to fewer runs and possibly fewer merge levels. Third, and very in- terestingly, skew is less likely to be a problem. The goal of compression is to represent the information with as few bits as possible. Therefore, each bit in the output of a good compression scheme has close to maximal information con- tent, and bit columns seen over the entire file are unlikely to be skewed. Furthermore, bit columns will not be cor- related. Thus, the compressed key values can be used to create a hash value distri- bution that is almost guaranteed to be uniform, i.e., optimal for hashing in memory and partitioning to overflow files as well as to multiple processors in paral- lel join algorithms. We believe that data compression is undervalued in current query processing research, mostly because it was not real- ized that many operations can often be performed faster on compressed data than on uncompressed data, and we hope that future database management sys- tems make extensive use of data com- pression. Considering the current growth rates in CPU and 1/0 performance, it might even make sense to exploit data compression on the fly for hash table overflow resolution. 12.3 Surrogate Processing Another very useful technique in query processing is the use of surrogates for intermediate results. A surrogate is a ref- erence to a data item, be it a logical object identifier (OID) used in object- oriented systems or a physical record identifier (RID) or location. Instead of keeping a complete record in memory, only the fields that are used immediately are kept, and the remainder replaced by ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 82. 154 “ Goetz Graefe a surrogate, which has in principle the same effect as compression. While this technique has traditionally been used to reduce main-memory requirements, it can also be employed to improve board- and CPU-level caching [Nyberg et al. 1993]. The simplest case in which surrogate processing can be exploited is in avoiding copying. Consider a relational join; when two items satisfy the join predicate, a new tuple is created from the two origi- nal ones. Instead of copying the data fields, it is possible to create only a pair of RIDs or pointers to the original records if they are kept in memory. If a record is 50 times larger than an RID, e.g., 8 vs. 400 bytes, the effort spent on copying bvtes is reduced bv that factor “ Copying is alre~dy a major part of the CPU time spent in many query process- ing systems, but it is becoming more ex- pensive for two reasons. First, many modern CPU designs and implementa- tions are optimized for an impressive number of instructions per second but do not provide the performance improve- ments in mundane tasks such as moving bytes from one memory location to an- other [Ousterhout 19901. Second, many modern computer architectures employ multiple CPUS accessing shared memory over one bus because this design permits fast and inexpensive parallelism. Al- though alleviated by local caches, bus contention is the major bottleneck and limitation to scalability in shared- memory parallel machines. Therefore, re- ductions in memory-to-memory copying in database query execution engines per- mit higher useful degrees of parallelism in shared-memorv machines. A second example for surrogate pro- cessing was mentioned earlier in connec- tion with indices. To evaluate a conjunc- tion with multiple clauses, each of which is supported by an index, it might be useful to perform an intersection of RID- lists to reduce the number of records needed before actual data are accessed. A third case is the use of indices and RIDs to evaluate joins, for example, in the query processing techniques used in Ingres [Kooi 1980; Kooi and Frankforth 1982] and IBMs hybrid join [Cheng et al. 1991] discussed in the section on binary matching. Surrogate processing has also been used in parallel systems, in particular, distributed-memory implementations, to reduce network traffic. For example, Lorie and Young [1989] used RIDs to reduce the communication time in paral- lel sorting by sending (sort key, RID) ~airs to a central site. which determines ~ach record’s global’ rank, and then repartitioning and merging records very quickly by their rank alone without fur- ther data comparisons. Another form of surrogates are encod- ings with lossy compressions, such as su- perimposed coding used for efficient ac- cess methods [Bloom 1970; Faloutsos 1985; Sacks-Davis and Ramamohanarao 1983; Sacks-Davis et al. 1987]. Berra et al. [1987] and Chung and Berra [1988] considered indexing and retrieval organi- zations for very large (relational) knowl- edge bases and databases. They em- ployed three techniques, concatenated code words (CCWS), superimposed code words (SCWS), and transformed inverted lists (TILs). TILs are normal index struc- tures for all attributes of a relation that permit answering conjunctive queries by bitwise anding. CCWS and SCWS use hash values of all attributes of a tuple and either concatenate such hash values or bitwise or them together. The result- ing code words are then used as keys in indices. In their particular architecture, Berra et al. and Chung and Berra con- sider associative memory and optical computing to search efficiently through such indices, although conventional soft- ware techniques could be used as well. 12.4 Bit Vector Filtering In parallel systems, bit vector filters have been used very effectively for what we call here “probabilistic semi-j oins.” Con- sider a relational join to be executed on a distributed-memory machine with repar- titioning of both input relations on the join attribute. It is clear that communica- ACM Computmg Surveys, Vol 25, No 2, June 1993
  • 83. Query Evaluation Techniques ● 155 tion effort could be reduced if only the tuples that actually contribute to the join result, i.e., those with a match in the other relation, needed to be shipped across the network. To accomplish this, distributed database systems were de- signed to make extensive use of semi- joins, e.g., SDD-1 [Bernstein et al. 1981]. A faster alternative to semi-joins, which, as discussed earlier, requires ba- sically the same computational effort as natural joins, is the use of bit vector filters [Babb 1979], also called Bloom- filters [Bloom 1970]. A bit vector filter with N bits is initialized with zeroes; and all items in the first (preferably the smaller) input are hashed on their join keyto O,..., N – 1. For each item, one bit in the bit vector filter is set to one; hash collisions are ignored. After the first join input has been exhausted, the bit vector filter is used to filter the second input. Data items of the second input are hashed on their join key value, and only items for which the bit is set to one can possibly participate in the join. There is some chance for false passes in the case of collisions, i.e., items of the second in- put pass the bit vector filter although they actually do not participate in the join, but if the bit vector filter is suffi- ciently large, the number of false passes is very small. In general, if the number of bits is about twice the number of items in the first input, bit vector filters are very ef- fective. If many more bits are available, the bit vector filter can be split into mul- tiple subvectors, or multiple bits can be set for each item using multiple hash functions, reducing the number of false passes. Babb [1979] analyzed the use of multiple bit vector filters in detail. The Gamma relational database ma- chine demonstrated the effectiveness of bit vector filtering in relational join pro- cessing on distributed-memory hardware [DeWitt et al. 1986; 1988; 1990; Gerber 1986]. When scanning and redistributing the build input of a join, the Gamma machine creates a bit vector filter that is then distributed to the scanning sites of the probe input. Based on the bit vector filter, a large fraction of the probe tuples can often be discarded before incurring network costs. The decision whether to create one bit vector filter for the entire build input or to create a bit vector filter for each of the join sites depends on the space available for bit vector filters and the communication costs for bit arrays. Mullin [ 1990] generalized bit vector fil- tering to sending bit vector filters back and forth between sites. In his words, “the central notion is to send small but optimally information-dense Bloom fil- ters between sites as long as these filters serve to reduce the volume of tuples which need to be transmitted by more than their own size.” While this proce- dure achieves very low communication costs, it ignores the 1/0 cost at each site if the reduced relations must be scanned from disk in each step. Qadah [1988] dis- cussed a limited form of this idea using only two bit vector filters and augment- ing it with bit vector filter compression. While bit vector filtering is typically used only for joins, it is equally applica- ble to all other one-to-one match oper- ators, including semi-j oin, outer join, intersection, union, and difference. For operators that include nonmatching items in their output, e.g., outer joins and unions, part of the result can be obtained before network transfer, based solely on the bit vector filter. For one-to- one match operations other than join, e.g., outer join and union, bit vector fil- ters can also be used, but the algorithm must be modified to ensure that items that do not pass the bit vector filter are properly included in the operation’s out- put stream. For parallel relational divi- sion (universal quantification), bit vector filtering can be used on the divisor at- tributes to eliminate most of the dividend items that do not pertain to any divisor item. Thus, our earlier assessment that universal quantification can be per- formed as fast as existential quantifica- tion (a semi-join of dividend and divisor relations) even extends to special tech- niques used to boost join performance. Bit vector filtering can also be ex- ploited in sequential systems. Consider a ACM Computmg Surveys, Vol. 25, No. 2, June 1993
  • 84. 156 * Goetz Graefe merge-join with sort operations on both inputs. If the bit vector filter is built based on the input of the first sort, i.e., the ,bit vector filter is completed when all data have reached the first sort operator. This bit vector filter can then be used to reduce the input into the second sort op- erator on the (presumably larger) second input. Depending on how the sort opera- tion is organized into phases, it might even be possible to create a second bit vector filter from the second merge-join input and use it to reduce the first join input while it is being merged. For sequential hash joins, bit vector filters can be used in two ways. First, they can be used to filter items of the probe input using a bit vector filter cre- ated from items of the build input. This use of bit vector filters is analogous to bit vector filter usage in parallel systems and for merge-join. In Rdb/VMS and DB2, bit vector filters are used when intersecting large RID lists obtained from multiple indices on the same table [Antoshenkov 1993; Mohan et ‘al. 1990]. Second, new bit vector filters can be cre- ated and used for each partition in each recursion level. In the Volcano query- processing system, the operator imple- menting hash join, intersection, etc. uses the space used as anchor for each bucket’s linked list for a small bit vector filter after the bucket has been spilled to an overflow file. Only those items from the probe input that pass the bit vector filter are written to the probe overflow file. This technique is used in each recursion level of overflow resolution. Thus, during recursive partitioning, relatively small bit vector filters can be used repeatedly and at increasingly finer granularity to remove items from the probe input that do not contribute to the join result. Bit vectors could also be used to remove items from the build input using bit vector fil- ters created from the probe input; how- ever, since the probe input is presumed the larger input and hash collisions in the bit vector filter would make the filter less effective, it may or may not be an effective technique. With some modifications of the stan- dard algorithm, bit vector filters can also be used in hash-based duplicate removal. Since bit vector filters can only deter- mine safely which item has not been seen yet, but not which item has been seen yet (due to possible hash collisions), bit vec- tor filters cannot be used in the most direct way in hash-based duplicate re- moval. However, hash-based duplicate removal can be modified to become simi- lar to a hash join or actually a hash-based set intersection. Consider a large file R and a partitioning fan-out F. First, R is partitioned into F/2 partitions. For each partition, two files are created; thus, this step uses the entire fan-out to create a total of F files. Within each partition, a bit vector filter is used to determine whether an item belongs into the first or the second file of that partition. If an item is guaranteed to be unique, i.e., there is no earlier item indicated in the bit vector filter, the item is assigned to the first file, and a bit in the bit vector filter is set. Otherwise, the item is assigned into the partition’s second file. At the end of this partitioning step, there are F files, half of them guaranteed to be free of duplicate data items. The possible size of the duplicate-free files is limited by the size of the bit vector filters; therefore, this step should use the largest bit vector filters possible. After the first partition- ing step, each partition’s pair of files is intersected using the duplicate-free file as probe input. Recall that duplicate re- moval for a join’s build input can be ac- complished easily and inexpensively while building the in-memory hash table. Remaining duplicates with one copy in the duplicate-free (probe) file and an- other copy in the other file (the build input) in the hash table are found when the probe input is matched against the hash table. This algorithm performs very well if many output items depend on only one input item and if the bit vectors are quire large. In that case, the duplicate- free partition files are very large, and the smaller partition file with duplicates can be processed very efficiently. In order to find and exploit a dual in the realm of sorting and merge-join to bit vector filtering in each recursion level of recursive hash join, sorting of multiple ACM Computing Surveys, Vol 25, No. 2, June 1993
  • 85. Query Evaluation Techniques 9 157 inputs must be divided into individual merge levels. In other words, for a merge-join of inputs R and S, the sort activity should switch back and forth be- tween R and S, level by level, creating and using a new bit vector filter in each merge level. Unfortunately, even with a sophisticated sort implementation that supports this use of bit vector filters in each merge level, recursive hybrid hash join will make more effective use of bit vector filters because the inputs are par- titioned, thus reducing the number of distinct values in each partition in each recursion level. 12.5 Specialized Hardware Specialized hardware was considered by a number of researchers, e.g., in the forms of hardware sorters and logic-per-track selection. A relatively recent survey of database machine research is given by Su [1988]. Most of this research was abandoned after Boral and DeWitt’s [1983] influential analysis that compared CPU and 1/0 speeds and their trends. They concluded that 1/0 is most likely the bottleneck in future high-perfor- mance query execution, not processing. Therefore, they recommended moving from research on custom processors to techniques for overcoming the 1/0 bot- tleneck, e.g., by use of parallel readout disks, disk caching and read-ahead, and indexing to reduce the amount of data to be read for a query. Other investigations also came to the conclusion that par- allelism is no substitute for effective storage structures and query execution algorithms [DeWitt and Hawthorn 1981; Neches 1984]. An additional very strong argument against custom VLSI processors is that microprocessor speed is currently improving so rapidly that it is likely that, by the time a special hard- ware component has been designed, fab- ricated, tested, and integrated into a larger hardware and software system, the next generation of general-purpose CPUs will be available and will be able to exe- cute database functions programmed in a high-level language at the same speed as the specialized hardware component. Furthermore, it is not clear what special- ized hardware would be most beneficial to design, in particular, in light of today’s directions toward extensible database systems and emerging database applica- tion domains. Therefore, we do not favor specialized database hardware modules beyond general-purpose processing, stor- age, and communication hardware dedi- cated to executing database software. SUMMARY AND OUTLOOK Database management systems provide three essential groups of services. First, they maintain both data and associated metadata in order to make databases self-contained and self-explanatory, at least to some extent, and to provide data independence. Second, they support safe data sharing among multiple users as well as prevention and recovery of fail- ures and data loss. Third, they raise the level of abstraction for data manipula- tion above the primitive access com- mands provided by file systems with more or less sophisticated matching and infer- ence mechanisms, commonly called the query language or query-processing facil- ity. We have surveyed execution algo- rithms and software architectures used in providing this third essential service. Query processing has been explored extensively in the last 20 years in the context of relational database manage- ment systems and is slowly gaining interest in the research community for extensible and object-oriented systems. This is a very encouraging development, because if these new systems have in- creased modeling power over previous data models and database management systems but cannot execute even simple requests efficiently, they will never gain widespread use and acceptance. Databases will continue to manage mas- sive amounts of data; therefore, efficient query and request execution will con- tinue to represent both an important re- search direction and an important crite- rion in investment decisions in the “real world.” In other words, new database management systems should provide greater modeling power (this is widely ACM Computing Surveys, Vol. 25, No 2, June 1993
  • 86. 158 - Goetz Graefe accepted and intensely pursued), but also competitive or better performance than previous systems. We hope that this sur- vey will contribute to the use of efficient and parallel algorithms for query pro- cessing tasks in new database manage- ment systems. A large set of query processing algo- rithms has been developed for relational systems. Sort- and hash-based tech- niques have been used for physical- storage design, for associative index structures, for algorithms for unary and binary matching operations such as ag- gregation, duplicate removal, join, inter- section, and division, and for parallel query processing using hash- or range- partitioning. Additional techniques such as precomputation and compression have been shown to provide substantial per- formance benefits when manipulating large volumes of data. Many of the exist- ing algorithms will continue to be useful for extensible and object-oriented sys- tems, and many can easily be general- ized from sets of tuples to more general pattern-matching functions. Some emerging database applications will re- quire new operators, however, both for translation between alternative data rep- resentations and for actual data manipu- lation. The most promising aspect of current research into database query processing for new application domains is that the concept of a fixed number of parametri- zed operators, each performing a part of the required data manipulation and each passing an intermediate result to the next operator, is versatile enough to meet the new challenges. This concept permits specification of database queries and re. quests in a logical algebra as well as concise representation of database pro- grams in a physical algebra. Further- more, it allows algebraic optimizations of requests, i.e., optimizing transformations of algebra expressions and cost-sensitive translations of logical into physical ex- pressions. Finally, it permits pipelining between operators to exploit parallel computer architectures and partitioning of stored data and intermediate results for most operators, in particular, for op- erators on sets but also for other bulk types such as arrays, lists, and time series. We can hope that much of the existing relational technology for query optimiza- tion and parallel execution will remain relevant and that research into extensi- ble optimization and parallelization will have a significant impact on future database applications such as scientific data. For database management systems to become acceptable for new application domains, their performance must at least match that of the file systems currently in use. Automatic optimization and par- allelization may be crucial contributions to achieving this goal, in addition to the query execution techniques surveyed here. ACKNOWLEDGMENTS JOS6 A Blakeley, Cathy Brand, Rick Cole, Diane Davison, David Helman, Ann Linville, Bill McKenna, Gail Mitchell, Shengsong N1, Barb Pe- ters, Leonard Shapiro, the students of “Readings m Database Systems” at the University of Colorado at Boulder (Fall 1991) and “Database Implementation Techniques” at Portland State University (Winter 1993), David Maler’s weekly reading group at the Oregon Graduate Institute (Winter 1992), the anonymous referees, and the Computmg Surveys editors Shamkant Navathe and Dick Muntz gave many valuable comments on earlier drafts of this survey, which have improved the paper very much. This paper is based on research partially supported by the National Science Foundation with grants IRI-8996270, IRI-8912618, IRI-9006348, IRI- 9116547, IRI-9119446, and ASC-9217394, ARPA with contract DAAB 07-91-C-Q5 18, Texas Instru- ments, D@al Equipment Corp., Intel Super- computer Systems Division, Sequent Computer Systems. ADP, and the Oregon Advanced Com- puting Institute (OACIS) REFERENCES ADALI, N. R., AND WORTMANN, J. C. 1989. Secu- rity-control methods for statistical databases: A comparative study. ACM Comput. Surv. 21, 4 (Dec. 1989), 515. AHN, I., AND SNonGRAss, R. 1988. Partitioned storage for temporal databases. Inf. SVSt. 13, 4, 369. ACM Computing Surveys, Vol 25, No 2, June 1993
  • 87. Query Evaluation Techniques ● 159 ALBERT, J. 1991. Algebraic properties of bag data types. In Proceedings of the International Con- ference on Very Large Data Bases. VLDB En- dowment, 211. ANALYTI, A., AND PRAMANIK, S. 1992. Fast search in main memory databases. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 215. ANDERSON, D. P., Tzou, S. Y., AND GmwM, G. S. 1988. The DASH virtual memory system. Tech. Rep. 88/461, Univ. of California—Berke- ley, CS Division, Berkeley, Calif. ANTOSHENKOV, G. 1993. Dynamic query opti- mization in Rdb/VMS. In Proceedings of the IEEE Conference on Data Engineering. IEEE, New York. ASTRAHAN, M. M., BLASGEN, M. W., CHAMBERLAIN, D. D., ESWARAN, K. P., GRAY, J. N., GRIFFITHS, P. P., KING, W. F., LORIE, R. A., MCJONES, P. R., MEHL, J. W., PUTZOLU, G. R., TRAIGER, I. L., WmE, B. W., AND WATSON, V. 1976. System R: A relational approach to database management. ACM Trans. Database Syst. 1, 2 (June), 97. ASTRAHAN, M. M., SCHKOLNICK, M., MD WHANG, K. Y. 1987. Approximating the number of unique values of an attribute without sorting, Inf. Syst. 12, 1, 11. ATKINSON, M. P., AND BUNEMANN, O. P. 1987. Types and persistence in database program- ming languages. ACM Comput. Surv. 19, 2 (June), 105. BABB, E. 1982. Joined Normal Form: A storage encoding for relational databases. ACM Trans. Database Syst. 7, 4 (Dec.), 588. BABB, E. 1979. Implementing a relational database by means of specialized hardware. ACM Trans. Database Syst. 4, 1 (Mar.), 1. BAEZA-YATES, R. A., AND LARSON, P. A. 1989. Per- formance of B + -trees with partial expansions. IEEE Trans. Knowledge Data Eng. 1, 2 (June), 248. BANCILHON, F., AND RAMAKRISHNAN, R. 1986. An amateur’s introduction to recursive query pro- cessing strategies. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 16. BARGHOUTI, N. S., AND KAISER, G. E. 1991. Con- currency control in advanced database applica- tions. ACM Comput. Suru. 23, 3 (Sept.), 269. BARU, C. K., AND FRIEDER, O. 1989. Database op- erations in a cube-connected multicomputer system. IEEE Trans. Comput. 38, 6 (June), 920. BATINI, C., LENZERINI, M., AND NAVATHE, S. B. 1986. A comparative analysis of methodolo- gies for database schema integration. ACM Comput. Sum. 18, 4 (Dec.), 323. BATORY, D. S., BARNETT, J. R., GARZA, J. F., SMITH, K. P., TSUKUDA, K., TWICHELL, B. C., AND WISE, T. E. 1988a. GENESIS: An extensible database management system, IEEE Trans. Softw. Eng. 14, 11 (Nov.), 1711. BATORY, D. S., LEUNG, T. Y., AND WISE, T. E. 1988b. Implementation concepts for an extensible data model and data language. ACM Trans. Database Syst. 13, 3 (Sept.), 231. BAUGSTO, B., AND GREIPSLAND, J. 1989. Parallel sorting methods for large data volumes on a hypercube database computer. In Proceedings of the 6th International Workshop on Database Machines (Deauville, France, June 19-21). BAYER, R., AND MCCREIGHTON, E. 1972. Organi- sation and maintenance of large ordered in- dices. Acts Informatica 1, 3, 173. BECK, M., BITTON, D., AND WILKINSON, W. K. 1988. Sorting large files on a backend multiprocessor. IEEE Trans. Comput. 37, 7 (July), 769. BECKER, B., SIX, H. W., AND WIDMAYER, P. 1991. Spatial priority search: Au access technique for scaleless maps. In Proceedings of ACM SIG- MOD Conference. ACM, New York, 128. BECKMANN, N., KRIEGEL, H. P., SCHNEIDER, R., AND SEEGER, B. 1990. The R*-tree: Au efficient and robust access method for points and rect- angles. In Proceedings of ACM SIGMOD Con- ference. ACM, New York, 322. BELL, T., WITTEN, I. H., AND CLEARY, J. G. 1989. Modelling for text compression. ACM Cornput. Surv. 21, 4 (Dec.), 557. BENTLEY, J. L. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (Sept.), 509. BERNSTEIN, P. A., mm GOODMAN, N. 1981. Con- currency control in distributed database sys- tems. ACM Comput. Suru. 13, 2 (June), 185. BERNSTEIN, P. A., GOODMAN, N., WONG, E., REEVE, C. L., AND ROTHNIE, J. B. 1981. Query pro- cessing in a system for distributed databases (SDD-1). ACM Trans. Database Syst, 6, 4 (Dec.), 602. BERNSTEIN, P. A., HADZILACOS, V., AND GOODMAN, N. 1987. Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading, Mass. BERRA, P. B., CHUNG, S. M., AND HACHEM, N. I. 1987, Computer architecture for a surrogate file to a very large data/knowledge base. IEEE Comput. 20, 3 (Mar), 25. BERTINO, E. 1991. An indexing technique for ob- ject-oriented databases. In Proceedings of the IEEE Conference on Data Engineering. IEEE, New York, 160. BERTINO, E. 1990. Optimization of queries using nested indices. In Lecture Notes in Computer Science, vol. 416. Springer-Verlag, New York. BERTINO, E., AND KIM, W. 1989. Indexing tech- niques for queries on nested objects. IEEE Trans. Knowledge Data Eng. 1, 2 (June), 196. BHIDE, A. 1988. An analysis of three transaction processing architectures. In Proceedings of the International Conference on Ve~ Large Data Bases (Los Angeles, Aug.). VLDB Endowment, 339. ACM Computing Surveys, Vol 25, No. 2, June 1993
  • 88. 160 * Goetz Graefe BHIDE, A,, AND STONEBRAKER, M. 1988. Aperfor- mance comparison of two architectures for fast transaction processing. In proceedings of the IEEE Conference on Data Englneermg. IEEE, New York, 536. BITTON, D., AND DEWITT, D. J. 1983. Duplicate record elimination in large data files. ACM Trans. Database Syst. 8, 2 (June), 255, BITTON-FRIEDLAND, D. 1982. Design, analysis, and implementation of parallel external sorting algorithms Ph D. Thesis, Univ. of Wiscon- sin—Madison. BITTON, D , AND GRAY, J 1988. Dmk shadowing, In Proceedings of the International Conference on Very Large Data Bases. (Los Angeles, Aug.). VLDB Endowment, 331 BITTON, D., DEWITT, D J., HSIAO, D. K. AND MENON, J. 1984 A taxonomy of parallel sorting. ACM Comput. SurU. 16, 3 (Sept.), 287. BITTON, D., HANRAHAN, M. B., AND TURBWILL, C. 1987, Performance of complex queries in main memory database systems. In Proceedt ngs of the IEEE Con ference on Data Engineering. IEEE, New York. BLAKELEY, J. A., AND MARTIN. N. L. 1990 Join index, materialized view, and hybrid hash-join: A performance analysis In Proceedings of the IEEE Conference on Data Englneermg IEEE, New York. BM.KELEY, J. A, COBURN, N., AND LARSON, P. A. 1989. Updating derived relations: Detecting irrelevant and autonomously computable up- dates. ACM Trans Database Syst 14,3 (Sept.), 369 BLAS~EN, M., AND ESWARAN, K. 1977. Storage and access in relational databases. IBM Syst. J. 16, 4, 363. BLASGEN, M., AND ESWARAN, K. 1976. On the evaluation of queries in a relational database system IBM Res. Rep RJ 1745, IBM, San Jose, Calif. BLOOM, B H. 1970 Space/time tradeoffs in hash coding with allowable errors Cornmzm. ACM 13, 7 (July), 422. BORAL, H. 1988. Parallehsm in Bubba. In Pro- ceedings of the International Symposium on Databases m Parallel and Distr~buted Systems (Austin, Tex., Dec.), 68. BORAL, H., AND DEWITT, D. J. 1983. Database machines: An idea whose time has passed? A critique of the future of database machines. In Proceedings of the International Workshop on Database Machines. Reprinted in Parallel Ar- ch Itectures for Database Systems. IEEE Com- puter Society Press, Washington, D. C., 1989. BORAL, H., ALEXANDER, W., CLAY, L., COPELAND, G., DANFORTH, S., FRANKLIN, M., HART, B., SMITH, M., AND VALDURIEZ, P. 1990. Prototyping Bubba, A Highly Parallel Database System. IEEE Trans. Knowledge Data Eng. 2, 1 (Mar.), 4. BRATBERGSENGEN, K, 1984. Hashing methods and relational algebra operations. In Proceed- ings of the International Conference on Very Large Data Bases. VLDB Endowment, 323. BROWN, K. P., CAREY, M. J., DEWITT, D, J., MEHTA, M., AND NAUGHTON, J. F. 1992. Scheduling issues for complex database workloads, Com- puter Science Tech. Rep. 1095, Univ. of Wisconsin—Madison. BUCHERAL, P., THEVERIN, J. M., AND VAL~URIEZ, P. 1990. Efficient main memory data manage- ment using the DBGraph storage model. In Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, 683. BUNEMAN, P., AND FRANKEL, R. E. 1979. FQL—A Functional Query Language. In Proceedings of ACM SIGMOD Conference. ACM, New York, 52. BUNEMAN, P,, FRANKRL, R, E., AND NIKHIL, R. 1982 An implementation technique for database query languages, ACM Trans. Database Syst. 7, 2 (June), 164. CACACE, F., CERI, S., .AND HOUTSMA, M A. W. 1992. A survey of parallel execution strategies for transitive closures and logic programs To ap- pear in Dwtr. Parall. Databases. CAREY, M. J,, DEWITT, D. J., RICHARDSON, J. E., AND SHEKITA, E. J. 1986. Object and file manage- ment in the EXODUS extensible database sys- tem. In proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, 91. CARLIS, J. V. 1986. HAS: A relational algebra operator, or divided M not enough to conquer. In Proceedings of the IEEE Conference on Data Engineering. IEEE, New York, 254. CARTER, J. L., AND WEGMAN, M. N. 1979, Univer- sal classes of hash functions. J. Cornput. Syst. Scl. 18, 2, 143. CHAMB~RLIN, D. D., ASTRAHAN, M M., BLASGEN, M. W., GsA-I-, J. N., KING, W. F., LINDSAY, B. G., LORI~, R., MEHL, J. W., PRICE, T. G,, PUTZOLO, F , SELINGER, P. G., SCHKOLNIK, M., SLUTZ, D. R., TKAIGER, I. L , WADE, B. W., AND YOST, R, A. 198 la. A history and evaluation of System R. Cornmun ACM 24, 10 (Oct.), 632. CHAMBERLAIN, D. D,, ASTRAHAN, M. M., KING, W F., LORIE, R. A,, MEHL, J. W., PRICE, T. G., SCHKOLNIK, M., SELINGER, P G., SLUTZ, D. R., WADE, B. W., AND YOST, R. A. 1981b. SUp- port for repetitive transactions and ad hoc queries in System R. ACM Trans. Database Syst. 6, 1 (Mar), 70. CHEN, P. P. 1976. The entity relationship model —Toward a umtied view of data. ACM Trans. Database Syst. 1, 1 (Mar.), 9. C!HEN, H , AND KUCK, S. M. 1984. Combining re- lational and network retrieval methods. In proceedings of ACM SIGMOD Conference, ACM, New York, 131, CHEN, M. S., Lo, M. L., Yu, P. S., AND YOUNG, H C 1992. Using segmented right-deep trees for ACM Computmg Surveys, Vol. 25, No. 2, June 1993
  • 89. Query Evaluation Techniques * 161 the execution of pipelined hash joins. In Pro- ceedings of the International Conference on Very Large Data Bases (Vancouver, BC, Canada). VLDB Endowment, 15. CHEN~, J., HADERLE, D., HEDGES, R., IYER, B. R., MESSINGER, T., MOHAN, C., ANI) WANG, Y. 1991. An efficient hybrid join algorithm: A DB2 prototype. In Proceedings of the IEEE Conference on Data Engineering. IEEE, New York, 171. CHERITON, D. R., GOOSEN, H. A., mn BOYLE, P. D. 1991 Paradigm: A highly scalable shared- memory multicomputer. IEEE Comput. 24, 2 (Feb.), 33. CHIU, D. M., AND Ho, Y, C. 1980. A methodology for interpreting tree queries into optimal semi- join expressions. In Proceedings of ACM SIG- MOD Conference. ACM, New York, 169. CHOU, H, T. 1985. Buffer management of database systems. Ph.D. thesis, Univ. of Wisconsin—Madison. CHOU, H. T., AND DEWITT, D. J. 1985. An evalua- tion of buffer management strategies for rela- tional database systems. In Proceedings of the International Conference on Very Large Data Bases (Stockholm, Sweden, Aug.), VLDB En- dowment, 127. Reprinted in Readings in Database Systems. Morgan-Kaufman, San Mateo, Calif., 1988. CHRISTODOULAKIS, S. 1984. Implications of cer- tain assumptions in database performance evaluation. ACM Trans. Database Syst. 9, 2 (June), 163. CHUNG, S. M., AND BERRA, P. B. 1988. A compari- son of concatenated and superimposed code word surrogate files for very large data/knowl- edge bases. In Lecture Notes in Computer Sci- ence, vol. 303. Springer-VerIag, New York, 364. CLUET, S., DIZLOBEL, C., LECLUSF,, C., AND RICHARD, P. 1989. Reloops, an algebra based query language for an object-oriented database sys- tem. In Proceedings of the Ist International Conference on Deductive and Object-Orzented Databases (Kyoto, Japan, Dec. 4-6). COMER, D. 1979. The ubiquitous B-tree. ACM Comput. Suru. 11, 2 (June), 121. COPELAND, G., ALEXANDER, W., BOUGHTER, E., AND KELLER, T. 1988. Data placement in Bubba. In Proceedings of ACM SIGMOD Conference. ACM, New York, 99. DADAM, P., KUESPERT, K., ANDERSON, F., BLANKEN, H,, ERBE, R., GUENAUER, J., LUM, V., PISTOR, P., AND WALCH, G. 1986. A database manage- ment prototype to support extended NF 2 rela- tions: An integrated view on flat tables and hierarchies. In Proc.edmgs of ACM SIGMOD Conference. ACM, New York, 356. DMWELS, D., AND NG, P. 1982. Distributed query compilation and processing in R*. IEEE Database Eng. 5, 3 (Sept.). DmIELS, S., GRAEFE, G., KELLER, T., MAIER, D., SCHMIDT, D., AND VANCE, B. 1991. Query op- timization in revelation, an overview. IEEE Database Eng. 14, 2 (June). DAVIDSON, S. B., GARCIA-M• LINA, H., AND SKEEN, D. 1985. Consistency in partitioned networks, ACM Comput. Surv. 17, 3 (Sept.), 341. DAVIS, D. D. 1992. Oracle’s parallel punch for OLTP. Datamation (Aug. 1), 67. DAVISON, W. 1992. Parallel index building in In- formix OnLine 6.0. In Proceedings of ACM SIGMOD Conference. ACM, New York, 103. DEPPISCH, U., PAUL, H. B., AND SCHEK, H. J. 1986. A storage system for complex objects. In Pro- ceedings of the International Workshop on Ob- ject-Or[ented Database Systems (Pacific Grove, Calif., Sept.), 183. DESHPANDE, V., AND LARSON, P. A. 1992. The de- sign and implementation of a parallel join algo- rithm for nested relations on shared-memory multiprocessors. In Proceedings of the IEEE Conference on Data Engineering. IEEE, New York, 68, DESHPANDE, V., AND LARSON, P, A, 1991. An alge- bra for nested relations with support for nulls and aggregates. Computer Science Dept., Univ. of Waterloo, Waterloo, Ontario, Canada. DESHPANDE, A.j AND VAN GUCHT, D. 1988. An im- plementation for nested relational databases. In proceedings of the I?lternattonal Conference on Very Large Data Bases (Los Angeles, Calif., Aug. ) VLDB Endowment, 76. DEWITT, D. J. 1991. The Wisconsin benchmark: Past, present, and future. In Database and Transaction Processing System Performance Handbook. Morgan-Kaufman, San Mateo, Calif, DEWITT, D. J., AND GERBER, R. H. 1985, Multi- processor hash-based Join algorithms. In F’ro- ceedings of the International Conference on Very Large Data Bases (Stockholm, Sweden, Aug.). VLDB Endowment, 151. DEWITT, D. J., AND GRAY, J. 1992. Parallel database systems: The future of high-perfor- mance database systems. Commun. ACM 35, 6 (June), 85. DEWITT, D. J., AND HAWTHORN, P. B. 1981. A performance evaluation of database machine architectures. In Proceedings of the Interns- tional Conference on Very Large Data Bases (Cannes, France, Sept.). VLDB Endowment, 199. DEWITT, D. J., GERBER, R. H., GRAEFE, G., HEYTENS, M. L., KUMAR, K. B., AND MURALI%ISHNA, M. 1986. GAMMA-A high performance dataflow database machine. In Proceedings of the Inter- national Conference on Very Large Data Bases. VLDB Endowment, 228. Reprinted in Read- ings in Database Systems. Morgan-Kaufman, San Mateo, Calif., 1988. DEWITT, D. J., GHANDEHARIZADEH, S., AND SCHNEI- DER, D. 1988. A performance analysis of the GAMMA database machine. In Proceedings of ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 90. 162 * Goetz Graefe ACM SIGMOD Conference. ACM, New York, 350 DEWITT, D. J., GHANDEHARIZADEH, S., SCHNEIDER, D., BRICKER, A., HSIAO, H. I.. AND RASMUSSEN, R. 1990. The Gamma database machine pro- ject. IEEE Trans. Knowledge Data Eng. 2, 1 (Mar.). 44. DEWITT, D. J., KATZ, R., OLKEN, F., SHAPIRO, L., STONEBRAKER, M., AND WOOD, D. 1984. Im- plementation techniques for mam memory database systems In ProceecZings of ACM SIG- MOD Conference. ACM, New York, 1. DEWITT, D., NAUGHTON, J., AND BURGER, J. 1993. Nested loops revisited. In proceedings of Paral- lel and Distributed In forrnatlon Systems [San Diego, Calif., Jan.). DEWITT, D. J., NAUGHTON, J. E., AND SCHNEIDER, D. A. 1991a. An evaluation of non-equijoin algorithms. In Proceedings of the International Conference on Very Large Data Bases (Barcelona, Spain). VLDB Endowment, 443. DEWITT, D., NAUGHTON, J., AND SCHNEIDER, D. 1991b Parallel sorting on a shared-nothing architecture using probabilistic splitting. In Proceedings of the International Conference on Parallel and Dlstnbuted Information Systems (Miami Beach, Fla , Dec.) DOZIER, J. 1992. Access to data in NASAS Earth observing systems. In proceedings of ACM SIGMOD Conference. ACM, New York, 1. EFFELSBERG, W., AND HAERDER, T. 1984. Princi- ples of database buffer management. ACM Trans. Database Syst. 9, 4 (Dee), 560. EN~ODY, R. J., AND Du, H. C. 1988. Dynamic hashing schemes ACM Comput. Suru. 20, 2 (June), 85. ENGLERT, S., GRAY, J., KOCHER, R., AND SHAH, P. 1989. A benchmark of nonstop SQL release 2 demonstrating near-linear speedup and scaleup on large databases. Tandem Computers Tech. Rep. 89,4, Tandem Corp., Cupertino, Calif. EPSTEIN, R. 1979. Techniques for processing of aggregates in relational database systems UCB/ERL Memo. M79/8, Univ. of California, Berkeley, Calif. EPSTEIN, R., AND STON~BRAE~R, M. 1980. Analy- sis of dmtrlbuted data base processing strate- gies. In Proceedings o} the International Con- ference on Very Large Data Bases UWmtreal, Canada, Oct.). VLDB Endowment, 92. EPSTEIN, R , STONE~RAKER, M., AND WONG, E. 1978 Distributed query processing in a relational database system. In Proceedings of ACM SIGMOD Conference. ACM, New York. FAGIN, R,, NUNERGELT, J., PIPPENGER, N., AND STRONG, H. R. 1979. Extendible hashing: A fast access method for dynamic tiles. ACM Trans. Database Syst. 4, 3 (Sept.), 315. FALOUTSOS, C 1985. Access methods for text. ACM Comput. Suru. 17, 1 (Mar.), 49. FALOUTSOS, C,, NG, R., AND SELLIS, T. 1991. Pre- dictive load control for flexible buffer alloca- tion. In Proceedings of the International Conference on Very Large Data Bases (Barcelona, Spain). VLDB Endowment, 265. FANG, M. T,, LEE, R. C. T., AND CHANG, C, C. 1986. The idea of declustering and its applications. In Proceedings of the International Conference on Very Large Data Bases (Kyoto, Japan, Aug.). VLDB Endowmentj 181. FINKEL, R. A., AND BENTLEY, J. L. 1974. Quad trees: A data structure for retrieval on compos- ite keys. Acts Inform atzca 4, 1,1. FREYTAG, J. C., AND GOODMAN, N. 1989 On the translation of relational queries into iteratme programs. ACM Trans. Database Syst. 14, 1 (Mar.), 1. FUSHIMI, S., KITSUREGAWA, M., AND TANAKA, H. 1986. An overview of the system software of a parallel relational database machine GRACE. In Proceedings of the Intern atLona! Conference on Very Large Data Bases Kyoto, Japan, Aug.). ACM, New York, 209. GALLAIRE, H., MINRER, J., AND NICOLAS, J M. 1984. Logic and databases A deductive approach. ACM Comput. Suru. 16, 2 (June), 153 GERBER, R. H. 1986. Dataflow query process- ing using multiprocessor hash-partitioned algorithms. Ph.D. thesis, Univ. of Wisconsin—Madison. GHANDEHARIZADEH, S., AND DEWITT, D. J. 1990. Hybrid-range partitioning strategy: A new declustermg strategy for multiprocessor database machines. In Proceedings of the Inter- national Conference on Very Large Data Bases (Brisbane, Australia). VLDB Endowment, 481. GOODMAN, J. R., AND WOEST, P. J 1988. The Wisconsin Multicube: A new large-scale cache- coherent multiprocessor. Computer Science Tech Rep. 766, Umv. of Wisconsin—Madison GOUDA, M. G.. AND DAYAL, U. 1981. Optimal semijoin schedules for query processing in local distributed database systems. In Proceedings of ACM SIGMOD Conference. ACM, New York, 164, GRAEFE, G. 1993a. Volcano, An extensible and parallel dataflow query processing system. IEEE Trans. Knowledge Data Eng. To be published. GRAEFE, G. 1993b. Performance enhancements for hybrid hash join Available as Computer Science Tech. Rep. 606, Univ. of Colorado, Boulder. GRAEFE, G. 1993c Sort-merge-join: An idea whose time has passed? Revised in Portland State Univ. Computer Science Tech. Rep. 93-4. GRAEFII, G. 1991. Heap-filter merge join A new algorithm for joining medium-size inputs. IEEE Trans. Softw. Eng. 17, 9 (Sept.). 979. GRAEFE, G. 1990a. Parallel external sorting in Volcano. Computer Science Tech. Rep. 459, Umv. of Colorado, Boulder. ACM Computmg Surveys, Vol. 25, No 2, June 1993
  • 91. Query Evaluation Techniques ● 163 GRAEFE, G. 1990b. Encapsulation of parallelism in the Volcano query processing system. In proceedings of ACM SIGMOD Conference, ACM, New York, 102. GRAEFE) G. 1989. Relational division: Four algo- rithms and their performance. In Proceedings of the IEEE Conference on Data Engineering. IEEE, New York, 94. GRAEFE, G., AND COLE, R. L. 1993. Fast algo- rithms for universal quantification in large databases. Portland State Univ. and Univ. of Colorado at Boulder. GRAEFE, G., AND DAVISON, D. L. 1993. Encapsula- tion of parallelism and architecture-indepen- dence in extensible database query processing. IEEE Trans. Softw. Eng. 19, 7 (July). GRAEFE, G., AND DEWITT, D. J. 1987. The EXODUS optimizer generator, In Proceedings of ACM SIGMOD Conference. ACM, New York, 160. GRAEFE, G., .miI) MAIER, D. 1988. Query opti- mization in object-oriented database systems: A prospectus, In Advances in Object-Oriented Database Systems, vol. 334. Springer-Verlag, New York, 358. GRAEFE, G., AND MCKRNNA, W. J. 1993. The Vol- cano optimizer generator: Extensibility and ef- ficient search. In Proceedings of the IEEE Con- ference on Data Engineering. IEEE, New York. GRAEIW, G., AND SHAPIRO, L. D. 1991. Data com- pression and database performance. In Pro- ceedings of the ACM/IEEE-Computer Science Symposium on Applied Computing. ACM IEEE, New York. GRAEFE, G., AND WARD, K. 1989. Dynamic query evaluation plans. In Proceedings of ACM SIGMOD Conference. ACM, New York, 358. GRAEEE,G., ANDWOLNIEWICZ, R. H. 1992. Alge- braic optimization and parallel execution of computations over scientific databases. In Pro- ceedings of the Workshop on Metadata Manage- ment in Sclentrfzc Databases (Salt Lake City, Utah, Nov. 3-5). GRAEFE, G., COLE, R. L., DAVISON, D. L., MCKENNA, W. J., AND WOLNIEWICZ, R. H. 1992. Extensi- ble query optimization and parallel execution in Volcano. In Query Processing for Aduanced Database Appkcatlons. Morgan-Kaufman, San Mateo, Calif. GRAEEE, G., LHVWLLE, A., AND SHAPIRO, L. D. 1993. Sort versus hash revisited. IEEE Trans. Knowledge Data Eng. To be published. GRAY, J. 1990. A census of Tandem system avail- ability between 1985 and 1990, Tandem Computers Tech. Rep. 90.1, Tandem Corp., Cupertino, Calif. GWY, J., AND PUTZOLO, F. 1987. The 5 minute rule for trading memory for disc accesses and the 10 byte rule for trading memory for CPU time. In Proceedings of ACM SIGMOD Confer- ence. ACM, New York, 395. GWY) J., AND REUTER, A. 1991. Transaction Pro- cessing: Concepts and Techniques. Morgan- Kaufman, San Mateo, Calif, GRAY, J., MCJONES, P., BLASGEN, M., LINtEAY, B., LORIE, R., PRICE, T., PUTZOLO, F., AND TRAIGER, I. 1981. The recovery manager of the Sys- tem R database manager. ACM Comput, Sw-u. 13, 2 (June), 223. GRUENWALD, L., AND EICH, M, H. 1991. MMDB reload algorithms. In Proceedings of ACM SIGMOD Conference, ACM, New York, 397. GUENTHER, O., AND BILMES, J. 1991. Tree-based access methods for spatial databases: Imple- mentation and performance evaluation. IEEE Trans. Knowledge Data Eng. 3, 3 (Sept.), 342. GUIBAS, L., AND SEI)GEWICK, R. 1978. A dichro- matic framework for balanced trees. In Pro- ceedings of the 19th SymposL urn on the Founda tions of Computer Science. GUNADHI, H., AND SEGEW, A. 1991. Query pro- cessing algorithms for temporal intersection joins. In Proceedings of the IEEE Conference on Data Engtneermg. IEEE, New York, 336. GITNADHI, H., AND SEGEV) A. 1990. A framework for query optimization in temporal databases. In Proceedings of the 5th Zntcrnatzonal Confer- ence on Statistical and Scten tific Database Management. GUNTHER, O. 1989. The design of the cell tree: An object-oriented index structure for geomet- ric databases. In Proceedings of the IEEE Con- ference on Data Engineering. IEEE, New York, 598. GUNTHER, O., AND WONG, E. 1987 A dual space representation for geometric data. In Proceed- ings of the International Conference on Very Large Data Bases (Brighton, England, Aug.). VLDB Endowment, 501. Guo, M., SLT, S. Y. W., AND LAM, H. 1991. An association algebra for processing object- oriented databases. In proceedings of the IEEE Conference on Data Engmeermg, IEEE, New York, 23. GUTTMAN, A. 1984. R-Trees: A dynamic index structure for spatial searching. In Proceedings of ACM SIGMOD Conference. ACM, New York, 47. Reprinted in Readings in Database Sys- tems. Morgan-Kaufman, San Mateo, Ccdif., 1988. Hfis, L., CHANG, W., LOHMAN, G., MCPHERSON, J., WILMS, P. F., LAPIS, G., LINDSAY, B., PIRAHESH, H., CAREY, M. J., AND SHEKITA, E. 1990. Starburst mid-flight: As the dust clears. IEEE Trans. Knowledge Data Eng. 2, 1 (Mar.), 143. Hfis, L., FREYTAG, J. C., LOHMAN, G., AND PIRAHESH, H. 1989. Extensible query pro- cessing in Starburst. In Proceedings of ACM SIGMOD Conference. ACM, New York, 377. H.%+s, L. M., SELINGER, P. G., BERTINO, E., DANI~LS, D., LINDSAY, B., LOHMAN, G., MASUNAGA, Y., Mom, C., NG, P., WILMS, P., AND YOST, R. ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 92. 164 ● Goetz Graefe 1982. R*: A research project on distributed relational database management. IBM Res. Di- vision, San Jose, Calif. HAERDER, T., AND REUTER, A. 1983. Principles of transaction-oriented database recovery. ACM Comput. Suru. 15, 4 (Dec.). HAFEZ, A., AND OZSOYOGLU, G. 1988. Storage structures for nested relations. IEEE Database Eng. 11, 3 (Sept.), 31. HAGMANN, R. B. 1986. An observation on database buffering performance metrics. In Proceedings of the International Conference on Very Large Data Bases (Kyoto, Japan, Aug.). VLDB Endowment, 289. HAMMING, R. W. 1977. Digital Filters. Prentice- Hall, Englewood Cliffs, N.J. HANSON, E. N. 1987. A performance analysis of view materialization strategies. In Proceedings of ACM SIGMOD Conference. ACM, New York, 440. HENRICH, A., Stx, H. W., AND WIDMAYER, P. 1989. The LSD tree: Spatial access to multi- dimensional point and nonpoint objects. In Proceedings of the International Conference on Very Large Data Bases (Amsterdam, The Netherlands). VLDB Endowment, 45. HOEL, E. G., AND SAMET, H. 1992. A qualitative comparison study of data structures for large linear segment databases. In %oceedtngs of ACM SIGMOD Conference. ACM. New York, 205. HONG, W., AND STONEBRAKER, M. 1993. Opti- mization of parallel query execution plans in XPRS. Distrib. Parall. Databases 1, 1 (Jan.), 9. HONG, W., AND STONEBRAKRR, M. 1991. Opti- mization of parallel query execution plans in XPRS. In Proceedings of the International Con- ference on Parallel and Distributed Information Systems (Miami Beach, Fla., Dec.). Hou, W. C., AND OZSOYOGLU, G. 1993. Processing time-constrained aggregation queries in CASE- DB. ACM Trans. Database Syst. To be published. Hou, W. C., AND OZSOYOGLU, G. 1991. Statistical estimators for aggregate relational algebra queries. ACM Trans. Database Syst. 16, 4 (Dec.), 600. Hou, W. C., OZSOYOGLU, G., AND DOGDU, E. 1991. Error-constrained COUNT query evaluation in relational databases. In Proceedings of ACM SIGMOD Conference. ACM, New York, 278. HSIAO, H. I., ANDDEWITT, D. J. 1990. Chained declustering: A new availability strategy for multiprocessor database machines. In Proceed- ings of the IEEE Conference on Data Engineer- ing. IEEE, New York, 456. HUA, K. A., m~ LEE, C. 1991. Handling data skew in multicomputer database computers us- ing partition tuning. In Proceedings of the In- ternational Conference on Very Large Data Bases (Barcelona, Spain). VLDB Endowment, 525. HUA, K. A., AND LEE, C. 1990. An adaptive data placement scheme for parallel database com- puter systems. In Proceedings of the Interna- tional Conference on Very Large Data Bases (Brisbane, Australia). VLDB Endowment, 493, HUDSON, S. E., AND KING, R. 1989. Cactis: A self- adaptive, concurrent implementation of an ob- ject-oriented database management system. ACM Trans. Database Svst. 14, 3 (Sept.), 291, HULL, R., AND KING, R. 1987. Semantic database modeling: Survey, applications, and research issues. ACM Comput. Suru, 19, 3 (Sept.), 201. HUTFLESZ, A., SIX, H. W., AND WIDMAYER) P. 1990. The R-File: An efficient access structure for proximity queries. In proceedings of the IEEE Conference on Data Engineering. IEEE, New York, 372. HUTFLESZ, A., SIX, H. W., AND WIDMAYER, P. 1988a, Twin grid files: Space optimizing access schemes. In Proceedings of ACM SIGMOD Conference. ACM, New York, 183. HUTFLESZ, A., Sm, H, W., AND WIDMAYER, P, 1988b. The twin grid file: A nearly space optimal index structure. In Lecture Notes m Computer Sci- ence, vol. 303, Springer-Verlag, New York, 352. IOANNIDIS, Y. E,, AND CHRISTODOULAIiIS, S. 1991, On the propagation of errors in the size of join results. In Proceedl ngs of ACM SIGMOD Con- ference. ACM, New York, 268. IYKR, B. R., AND DIAS, D. M. 1990, System issues in parallel sorting for database systems. In Proceedings of the IEEE Conference on Data Engineering. IEEE, New York, 246. JAGADISH, H. V. 1991, A retrieval technique for similar shapes. In Proceedings of ACM SIGMOD Conference. ACM, New York, 208. JARKE, M., AND KOCH, J. 1984. Query optimiza- tion in database systems. ACM Cornput. Suru. 16, 2 (June), 111. JARKE, M., AND VASSILIOU, Y. 1985. A framework for choosing a database query language. ACM Comput. Sure,. 17, 3 (Sept.), 313. KATz, R. H. 1990. Towards a unified framework for version modeling in engineering databases. ACM Comput. Suru. 22, 3 (Dec.), 375. KATZ, R. H., AND WONG, E. 1983. Resolving con- flicts in global storage design through replica- tion. ACM Trans. Database S.vst. 8, 1 (Mar.), 110. KELLER, T., GRAEFE, G., AND MAIE~, D. 1991. Ef- ficient assembly of complex objects. In Proceed- ings of ACM SIGMOD Conference. ACM. New York, 148. KEMPER, A., AND MOERKOTTE G. 1990a. Access support in object bases. In Proceedings of ACM SIGMOD Conference. ACM, New York, 364. KEMPER, A., AND MOERKOTTE, G. 1990b. Ad- vanced query processing in object bases using access support relations. In Proceedings of the International Conference on Very Large Data ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 93. Query Evaluation Techniques 9 165 Bases (Brisbane, Australia). VLDB Endow- ment, 290. KEMPER, A.j AND WALI.RATH, M. 1987. An analy- sis of geometric modeling in database systems. ACM Comput. Suru. 19, 1 (Mar.), 47. KEMPER, A., KILGER, C., AND MOERKOTTE, G. 1991. Function materialization in object bases. In Proceedings of ACM SIGMOD Conference. ACM, New York, 258. KRRNIGHAN, B. W., ANDRITCHIE,D. M. 1978. VW C Programming Language. Prentice-Hall, Englewood Cliffs, N.J. KIM, W. 1984. Highly available systems for database applications. ACM Comput. Suru. 16, 1 (Mar.), 71. KIM, W. 1980. A new way to compute the prod- uct and join of relations. In Proceedings of ACM SIGMOD Conference. ACM, New York, 179. KITWREGAWA,M., AND OGAWA,Y, 1990. Bucket spreading parallel hash: A new, robust, paral- lel hash join method for skew in the super database computer (SDC). In Proceedings of the International Conference on Very Large Data Bases (Brisbane, Australia). VLDB En- dowment, 210. KI’rSUREGAWA, M., NAKAYAMA, M., AND TAKAGI, M. 1989a. The effect of bucket size tuning in the dynamic hybrid GRACE hash join method. In Proceedings of the International Conference on Very Large Data Bases (Amsterdam, The Netherlands). VLDB Endowment, 257. KITSUREGAWA7 M., TANAKA, H., AND MOTOOKA, T. 1983. Application of hash to data base ma- chine and its architecture. New Gener. Com- p~t. 1, 1, 63. KITSUREGAWA, M., YANG, W., AND FUSHIMI, S. 1989b. Evaluation of 18-stage ~i~eline hard- . . ware sorter. In Proceedings of the 6th In tern a- tional Workshop on Database Machines (Deauville, France, June 19-21). KLUIG, A. 1982, Equivalence of relational algebra and relational calculus query languages having aggregate functions. J. ACM 29, 3 (July), 699. KNAPP, E. 1987. Deadlock detection in dis- tributed databases. ACM Comput. Suru. 19, 4 (Dec.), 303. KNUTH, D, 1973. The Art of Computer Program- ming, Vol. III, Sorting and Searching. Addison-Wesley, Reading, Mass. KOLOVSON, C. P., ANTD STONEBRAKER M. 1991. Segment indexes: Dynamic indexing tech- niques for multi-dimensional interval data. In Proceechngs of ACM SIGMOD Conference. ACM, New York, 138. KC)OI. R. P. 1980. The optimization of queries in relational databases. Ph.D. thesis, Case West- ern Reserve Univ., Cleveland, Ohio. KOOI, R. P., AND FRANKFORTH, D. 1982. Query optimization in Ingres. IEEE Database Eng. 5, 3 (Sept.), 2. KFUEGEL, H. P., AND SEEGER, B. 1988. PLOP- Hashing: A grid file without directory. In Pro- ceedings of the IEEE Conference on Data Engi- neering. IEEE, New York, 369. KRIEGEL, H. P., AND SEEGER, B. 1987. Multidi- mensional dynamic hashing is very efficient for nonuniform record distributions. In Proceed- ings of the IEEE Conference on Data Enginee- ring. IEEE, New York, 10. KRISHNAMURTHY, R., BORAL, H., AND ZANIOLO, C. 1986. Optimization of nonrecursive queries. In Proceedings of the International Conference on Very Large Data Bases (Kyoto, Japan, Aug.). VLDB Endowment, 128. KUESPERT, K.j SAAKE, G.j AND WEGNER, L. 1989. Duplicate detection and deletion in the ex- tended NF2 data model. In Proceedings of the 3rd International Conference on the Founda- tions of Data Organization and Algorithms (Paris, France, June). KUMAR, V., AND BURGER, A. 1991. Performance measurement of some main memory database recovery algorithms. In Proceedings of the IEEE Conference on Data Engineering. IEEE, New York, 436. LAKSHMI,M. S., AND Yu, P. S. 1990. Effectiveness of parallel joins. IEEE Trans. Knowledge Data Eng. 2, 4 (Dec.), 410. LAIWHMI, M. S., AND Yu, P. S, 1988. Effect of skew on join performance in parallel architec- tures. In Proceedings of the International Sym- posmm on Databases in Parallel and Dis- tributed Systems (Austin, Tex., Dec.), 107. LANKA, S., AND MAYS, E. 1991. Fully persistent B + -trees. In Proceedings of ACM SIGMOD Conference. ACM, New York, 426. LARSON, P. A. 1981. Analysis of index-sequential files with overflow chaining. ACM Trans. Database Syst. 6, 4 (Dec.), 671. LARSON, P., AND YANG, H. 1985. Computing queries from derived relations. In Proceedings of the International Conference on Very Large Data Bases (Stockholm, Sweden, Aug.). VLDB Endowment, 259. LEHMAN, T. J., AND CAREY, M. J. 1986. Query processing in main memory database systems. In Proceedings of ACM SIGMOD Conference. ACM, New York, 239. LELEWER, D. A., AND HIRSCHBERG, D. S. 1987. Data compression. ACM Comput. Suru. 19, 3 (Sept.), 261. LEUNG, T. Y. C., AND MUNTZ, R. R. 1992. Tempo- ral query processing and optimization in multi- processor database machines. In Proceedings of the International Conference on Very Large Data Bases (Vancouver, BC, Canada). VLDB Endowment, 383. LEUNG, T. Y. C., AND MUNTZ, R. R. 1990. Query processing in temporal databases. In Proceed- ings of the IEEE Conference on Data Englneer- mg. IEEE, New York, 200. ACM Com~utina Survevs, Vol 25, No. 2, June 1993
  • 94. 166 “ Goetz Graefe LI, K., AND NAUGHTON, J. 1988. Multiprocessor main memory transaction processing. In Pro- ceedings of the Intcrnatlonal Symposium on Databases m Parallel and Dlstrlbuted Systems (Austin, Tex., Dec.), 177. LITWIN, W. 1980. Linear hashing: A new tool for file and table addressing. In Proceedings of the International Conference on Ver<y Large Data Bases (Montreal, Canada, Oct.). VLDB Endow- ment, 212. Reprinted in Readings in Database Systems. Morgan-Kaufman, San Mateo, Calif LJTWIN, W., MARK L., AND ROUSSOPOULOS, N. 1990. Interoperability of multiple autonomous databases. ACM Comput. Suru. 22, 3 (Sept.). 267. LOHMAN, G., MOHAN, C., HAAs, L., DANIELS, D., LINDSAY, B., SELINGER, P., AND WILMS, P. 1985. Query processing m R’. In Query Processuzg m Database Systems. Springer, Berlin, 31. LOMET, D. 1992. A review of recent work on multi-attribute access methods. ACM SIGMOD Rec. 21, 3 (Sept.), 56. LoM~T, D., AND SALZBER~, B 1990a The perfor- mance of a multiversion access method. In Pro- ceedings of ACM SIGMOD Conference ACM, New York, 353 LOM~T, D. B , AND SALZEER~, B. 1990b. The hB- tree A multlattrlbute mdexing method with good guaranteed performance. ACM Trans. Database Syst. 15, 4 (Dec.), 625. LORIE, R. A., AND NILSSON, J, F, 1979. AU access specification language for a relational database management system IBM J. Res. Deuel. 23, 3 (May), 286 LDRIE, R, A,, AND YOUNG, H. C. 1989. A low com- munication sort algorithm for a parallel database machme. In Proceedings of the Inter- national Conference on Very Large Data Bases (AmAwdam. The Netherlands). VLDB Endow- ment, 125. LYNCH, C A., AND BROWNRIGG, E. B. 1981. Appli- cation of data compression to a large biblio- graphic data base In Proceedings of the Inter- national Conference on Very Large Data Base,~ (Cannes, France, Sept.). VLDB Endowment, 435 L~~INEN, K. 1987. Different perspectives on in- formation systems: Problems and solutions. ACM Comput. Suru. 19, 1 (Mar.), 5. MACVCEVZT,L. F., AND LOHMAN, G. M. 1989 Index scans using a finite LRU buffer: A validated 1/0 model. ACM Trans Database S.vst. 14, 3 (Sept.), 401. MAJER, D, 1983. The Theory of Relational Databases. CS Press, Rockville, Md. MAI~R, D., AND STEIN, J. 1986 Indexing m an object-oriented database management. In Pro- ceedings of the Intern at[on al Workshop on Ob ]ect-(hented Database Systems (Pacific Grove, Calif, Sept ), 171 MAIER, D., GRAEFE, G., SHAFIRO, L., DANIELS, S., KELLER, T., AND VANCE, B. 1992 Issues in distributed complex object assembly In Prc~- ceedings of the Workshop on Distributed Object Management (Edmonton, BC, Canada, Aug.). MANNINO, M. V., CHU, P., ANrI SAGER, T. 1988. Statistical profile estimation m database sys- tems. ACM Comput. Suru. 20, 3 (Sept.). MCKENZIE, L. E., AND SNODGRASS, R. T. 1991. Evaluation of relational algebras incorporating the time dimension m databases. ACM Co~n- put. Suru. 23, 4 (Dec.). MEDEIROS, C., AND TOMPA, F. 1985. Understand- ing the implications of view update pohcles, In Proceedings of the International Conference on Very Large Data Bases (Stockholm, Sweden, Aug.). VLDB Endowment, 316. MENON, J. 1986. A study of sort algorithms for multiprocessor database machines. In Proceed- Ings of the International Conference on Very Large Data bases (Kyoto, Japan, Aug ) VLDB Endowment, 197 MISHRA, P., AND EICH, M. H. 1992. Join process- ing in relational databases. ACM Comput. Suru. 24, 1 (Mar.), 63 MITSCHANG, B. 1989. Extending the relational al- gebra to capture complex objects. In Proceed- ings of the International Conference on Very Large Data Bases (Amsterdam, The Nether- lands). VLDB Endowment, 297. MOHAN, C., HADERLE, D., WANG, Y., AND CHENG, J. 1990. Single table access using multiple in- dexes: Optimization, execution and concur- rency control techniques. In Lecture Notes m Computer Sc~ence, vol. 416. Springer-Verlag, New York, 29. MOTRO, A. 1989. An access authorization model for relational databases based on algebraic ma- mpulation of view definitions In Proceedl ngs of the IEEE Conferen m on Data Engw-um-mg IEEE, New York, 339 MULLIN, J K. 1990. Optimal semijoins for dis- tributed database systems. IEEE Trans. Softw. Eng. 16, 5 (May), 558. NAKAYAMA, M.. KITSUREGAWA. M., AND TAKAGI, M. 1988. Hash-partitioned jom method using dy- namic dcstaging strategy, In Proeeedmgs of the Imternatronal Conference on Very Large Data Bases (Los Angeles, Aug.). VLDB Endowment, 468. NECHES, P M. 1988. The Ynet: An interconnect structure for a highly concurrent data base computer system. In Proceedings of the 2nd Symposium on the Frontiers of Massiuel.v Par- allel Computatl on (Fairfax, Virginia, Ott.). NECHES, P M. 1984. Hardware support for ad- vanced data management systems. IEEE Com put. 17, 11 (Nov.), 29. NEUGEBAUER, L. 1991 Optimization and evalua- tion of database queries mcludmg embedded interpolation procedures. In Proceedings of ACM Computmg Surveys, Vol 25, No 2. June 1993
  • 95. Query Evaluation Techniques ● 167 ACM SIGMOD Conference. ACM, New York, 118. NG, R., FALOUTSOS, C., ANDSELLIS,T. 1991. Flex- ible buffer allocation based on marginal gains. In Proceedings of ACM SIGMOD Conference. ACM, New York, 387. NIEVERGELT, J., HINTERBERGER, H., AND SEVCIK, K. C. 1984. The grid file: An adaptable, sym- metric multikey file structure. ACM Trans. Database Syst. 9, 1 (Mar.), 38. NYBERG,C., BERCLAY, T., CVETANOVIC, Z., GRAY.J., ANDLOMET,D. 1993. AlphaSort: A RISC ma- chine sort. Tech. Rep. 93.2. DEC San Francisco Systems Center. Digital Equipment Corp., San Francisco. OMIECINSK1, E. 1991. Performance analysis of a load balancing relational hash-join algorithm for a shared-memory multiprocessor. In Pro- ceedings of the International Conference on Very Large Data Bases (Barcelona, Spain). VLDB Endowment, 375. OMIECINSKLE. 1985. Incremental file reorgani- zation schemes. In Proceedings of the Interna- tional Conference on Very Large Data Bases (Stockholm, Sweden, Aug.). VLDB Endowment, 346. OMIECINSKI, E., AND LIN, E. 1989. Hash-based and index-based join algorithms for cube and ring connected multicomputers. IEEE Trans. Knowledge Data Eng. 1, 3 (Sept.), 329. ONO, K., AND LOHMAN, G. M. 1990. Measuring the complexity of join enumeration in query optimization. In Proceedings of the Interna- tional Conference on Very Large Data Bases (Brisbane, Australia). VLDB Endowment, 314. OUSTERHOUT, J. 1990. Why aren’t operating sys- tems getting faster as fast as hardware. In USENIX Summer Conference (Anaheim, Calif., June). USENIX. O~SOYOGLU, Z. M., AND WANG, J. 1992. A keying method for a nested relational database man- agement system. In Proceedings of the IEEE Conference on Data Engineering. IEEE, New York, 438. OZSOYOGLU, G., OZSOYOGLU, Z. M., ANDMATOS,V. 1987. Extending relational algebra and rela- tional calculus with set-valued attributes and aggregate functions. ACM Trans. Database Syst. 12, 4 (Dec.), 566. Ozsu, M. T., AND VALDURIEZ, P. 1991a. Dis- tributed database systems: Where are we now. IEEE Comput. 24, 8 (Aug.), 68. Ozsu, M. T., AND VALDURIEZ, P. 1991b. Principles of Distributed Database Systems. Prentice-Hall, Englewood Cliffs, N.J. PALMER, M., AND ZDONIK, S. B. 1991. FIDO: A cache that learns to fetch. In Proceedings of the International Conference on Very Large Data Bases (Barcelona, Spain). VLDB Endowment, 255. PECKHAM, J., AND MARYANSKI, F. 1988. Semantic data models. ACM Comput. Suru. 20, 3 (Sept.), 153. PmAHESH, H., MOHAN, C., CHENG, J., LIU, T. S., AND SELINGER, P. 1990. Parallelism in relational data base systems: Architectural issues and design approaches. In Proceedings of the Inter- national Symposwm on Databases m Parallel and Distributed Systems (Dublin, Ireland, July). QADAH, G. Z. 1988. Filter-based join algorithms on uniprocessor and dmtributed-memory multi- processor database machines. In Lecture Notes m Computer Science, vol. 303. Springer-Verlag, New York, 388. REW, R. K., AND DAVIS, G. P. 1990. The Unidata NetCDF: Software for scientific data access. In the 6th International Conference on Interactwe Information and Processing Swstems for Me- teorology, Ocean ography,- aid Hydrology (Anaheim, Calif.). RICHARDSON, J. E., AND CAREY, M. J. 1987. Pro- gramming constructs for database system im- plementation m EXODUS. In Proceedings of ACM SIGMOD Conference. ACM, New York, 208. RICHAR~SON, J. P., Lu, H., AND MIKKILINENI, K. 1987. Design and evaluation of parallel pipelined join algorithms. In Proceedings of ACM SIGMOD Conference. ACM, New York, 399. ROBINSON, J. T. 1981. The K-D-B-Tree: A search structure for large multidimensional dynamic indices. ln proceedings of ACM SIGMOD Con- ference. ACM, New York, 10. ROSENTHAL, A., ANDREINER,D. S. 1985. Query- ing relational views of networks. In Query Pro- cessing in Database Systems. Springer, Berlin, 109. ROSENTHAL, A., RICH, C., AND SCHOLL, M. 1991. Reducing duplicate work in relational join(s): A modular approach using nested relations. ETH Tech. Rep., Zurich, Switzerland. ROTEM, D., AND SEGEV, A. 1987. Physical organi- zation of temporal data. In Proceedings of the IEEE Conference on Data Engineering. IEEE, New York, 547. ROTH, M. A., KORTH, H. F., AND SILBERSCHATZ, A. 1988. Extended algebra and calculus for nested relational databases. ACM Trans. Database Syst. 13, 4 (Dec.), 389. ROTHNIE, J. B., BERNSTEIN, P. A., Fox, S., GOODMAN, N., HAMMER, M., LANDERS, T. A., REEVE, C., SHIPMAN, D. W., AND WONG, E. 1980. Intro- duction to a system for distributed databases (SDD-1). ACM Trans. Database Syst. 5, 1 (Mar.), 1. ROUSSOPOULOS, N. 1991. An incremental access method for ViewCache: Concept, algorithms, and cost analysis. ACM Trans. Database Syst. 16, 3 (Sept.), 535. ROUSSOPOULOS, N., AND KANG, H. 1991. A pipeline N-way join algorithm based on the ACM Computing Surveys, Vol. 25. No. 2, June 1993
  • 96. 168 “ Goetz Graefe 2-way semijoin program. IEEE Trans Knoul- edge Data Eng. 3, 4 (Dec.), 486. RUTH, S. S , AND KEUTZER, P J 1972. Data com- pression for business files. Datamatlon 18 (Sept.), 62. SAAKD, G., LINNEMANN, V., PISTOR, P , AND WEGNER, L. 1989. Sorting, grouping and duplicate elimination in the advanced information man- agement prototype. In Proceedl ngs of th e Inter- national Conference on Very Large Data Bases VLDB Endowment, 307 Extended version in IBM Sci. Ctr. Heidelberg Tech. Rep 8903.008, March 1989. S.4CX:0, G 1987 Index access with a finite buffer. In Procecdmgs of the International Conference on Very Large Data Bases (Brighton, England, Aug.) VLDB Endowment. 301. SACCO, G. M., .4ND SCHKOLNIK, M. 1986, Buffer management m relational database systems. ACM Trans. Database Syst. 11, 4 (Dec.), 473. SACCO, G M , AND SCHKOLNI~, M 1982. A mech- anism for managing the buffer pool m a rela- tional database system using the hot set model. In Proceedings of the International Conference on Very Large Data Bases (Mexico City, Mex- ico, Sept.). VLDB Endowment, 257. SACXS-DAVIS, R., AND RAMMIOHANARAO, K. 1983. A two-level superimposed coding scheme for partial match retrieval. Inf. Syst. 8, 4, 273. S.ACXS-DAVIS, R., KENT, A., ANn RAMAMOHANAR~O, K 1987. Multikey access methods based on su- perimposed coding techniques. ACM Trans Database Syst. 12, 4 (Dec.), 655 SALZBERG, B. 1990 Mergmg sorted runs using large main memory Acts Informatica 27, 195 SALZBERG, B. 1988, Fde Structures: An Analytlc Approach. Prentice-Hall, Englewood Cliffs, NJ. SALZBER~,B , TSU~~RMAN, A., GRAY, J., STEWART, M., UREN, S., ANrJ VAUGHAN, B. 1990 Fast- Sort: A distributed single-input single-output external sort In Proceeduzgs of ACM SIGMOD Conference ACM, New York, 94. SAMET, H. 1984. The quadtree and related hier- archical data structures. ACM Comput. Saru. 16, 2 (June), 187, SCHEK, H. J., ANII SCHOLL, M. H. 1986. The rela- tional model with relation-valued attributes. Inf. Syst. 11, 2, 137. SCHNEIDER, D. A. 1991. Blt filtering and multi- way join query processing. Hewlett-Packard Labs, Palo Alto, Cahf. Unpublished Ms SCHNEIDER, D. A. 1990. Complex query process- ing in multiprocessor database machines. Ph.D. thesis, Univ. of Wmconsin-Madison SCHNF,mER. D. A., AND DEWITT, D. J. 1990. Tradeoffs in processing complex join queries via hashing in multiprocessor database machines. In Proceedings of the Interna- tional Conference on Very Large Data Bases (Brisbane, Austraha). VLDB Endowment, 469 SCHNEIDER, D., AND DEWITT, D. 1989. A perfor- mance evaluation of four parallel join algo- rithms in a shared-nothing multiprocessor environment. In Proceedings of ACM SIGMOD Conference ACM, New York, 110 SCHOLL, M. H, 1988. The nested relational model —Efficient support for a relational database interface. Ph.D. thesis, Technical Umv. Darm- stadt. In German. SCHOLL, M., PAUL, H. B., AND SCHEK, H. J 1987 Supporting flat relations by a nested relational kernel. In Proceedings of the International Conference on Very Large Data Bases (Brigh- ton, England, Aug ) VLDB Endowment. 137. SEEGER, B., ANU LARSON, P A 1991. Multl-disk B-trees. In Proceedings of ACM SIGMOD Con- ference ACM, New York. 436. ,SEG’N, A . AND GtTN.ADHI. H. 1989. Event-Join op- timization in temporal relational databases. In Proceedings of the IrLternatlOnctl Conference on Very Large Data Bases (Amsterdam, The Netherlands), VLDB Endowment, 205. SELINGiZR, P. G., ASTRAHAN, M. M , CHAMBERLAIN. D. D., LORIE, R. A., AND PRWE, T. G. 1979 Access path selectlon m a relational database management system In Proceedings of AdM SIGMOD Conference. ACM, New York, 23. Reprinted in Readings m Database Sy~tems Morgan-Kaufman, San Mateo, Calif., 1988. SELLIS. T. K. 1987 Efficiently supporting proce- dures in relational database systems In Pro- ceedl ngs of ACM SIGMOD Conference. ACM. New York, 278. SEPPI, K., BARNES, J,, AND MORRIS, C. 1989, A Bayesian approach to query optimization m large scale data bases The Univ. of Texas at Austin ORP 89-19, Austin. SERLIN, 0. 1991. The TPC benchmarks. In Database and Transaction Processing System Performance Handbook. Morgan-Kaufman, San Mateo, Cahf SESHAnRI, S , AND NAUGHTON, J F. 1992 Sam- pling issues in parallel database systems In Proceedings of the International Conference on Extending Database Technology (Vienna, Austria, Mar.). SEVERANCE. D. G. 1983. A practitioner’s guide to data base compression. Inf. S.yst. 8, 1, 51. SErERANCE, D., AND LOHMAN, G 1976. Differen- tial files: Their application to the maintenance of large databases ACM Trans. Database Syst. 1.3 (Sept.). SEWERANCE, C., PRAMANK S., AND WOLBERG, P. 1990. Distributed linear hashing and parallel projection in mam memory databases. In Pro- ceedings of the Internatl onal Conference on Very Large Data Bases (Brmbane, Australia) VLDB Endowment, 674. SHAPIRO, L. D. 1986. Join processing in database systems with large main memories. ACM Trans. Database Syst. 11, 3 (Sept.), 239. SHAW, G. M., AND ZDONIIL S. B. 1990. A query ACM Camputmg Surveys, Vol. 25, No. 2, June 1993
  • 97. Query Evaluation Techniques ● 169 algebra for object-oriented databases. In Pro. ceedings of the IEEE Conference on Data Engl. neering. IEEE, New York, 154. SHAW, G., AND Z~ONIK, S. 1989a. An object- oriented query algebra. IEEE Database Eng. 12, 3 (Sept.), 29. SIIAW, G. M., AND ZDUNIK, S. B. 1989b. AU object-oriented query algebra. In Proceedings of the 2nd International Workshop on Database Programming Languages. Morgan-Kaufmann, San Mateo, Calif., 103. SHEKITA, E. J., AND CAREY, M. J. 1990. A perfor- mance evaluation of pointer-based joins. In Proceedings of ACM SIGMOD (70nferen ce. ACM, New York, 300. SHERMAN, S. W., AND BRICE, R. S. 1976. Perfor- mance of a database manager in a virtual memory system. ACM Trans. Data base Syst. 1, 4 (Dec.), 317. SHETH, A. P., AND LARSON, J. A. 1990, Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22, 3 (Sept.), 183. SHIPMAN, D. W. 1981. The functional data model and the data 1anguage DAPLEX. ACM Trans. Database Syst. 6, 1 (Mar.), 140. SIKELER, A. 1988. VAR-PAGE-LRU: A buffer re- placement algorithm supporting different page sizes. In Lecture Notes in Computer Science, vol. 303. Springer-Verlag, New York, 336. SILBERSCHATZ, A., STONEBRAKER, M., AND ULLMAN, J. 1991. Database systems: Achievements and opportunities. Commun. ACM 34, 10 (Oct.), 110, SIX, H. W., AND WIDMAYER, P. 1988. Spatial searching in geometric databases. In Proceed- ings of the IEEE Conference on Data Enginee- ring. IEEE, New York, 496. SMITH,J. M., ANDCHANG, P. Y. T. 1975. Optimiz- ing the performance of a relational algebra database interface. Commun. ACM 18, 10 (Oct.), 568. SNODGRASS. R, 1990. Temporal databases: Status and research directions. ACM SIGMOD Rec. 19, 4 (Dec.), 83. SOCKUT, G. H., AND GOLDBERG, R. P. 1979. Database reorganization—Principles and prac- tice. ACM Comput. Suru. 11, 4 (Dec.), 371. SRINIVASAN, V., AND CAREY, M. J. 1992. Perfor- mance of on-line index construction algorithms. In Proceedings of the International Conference on Extending Database Technology (Vienna, Austria, Mar.). SRINWASAN, V., AND CAREY, M. J. 1991. Perfor- mance of B-tree concurrency control algo- rithms. In Proceedings of ACM SIGMOD Conference. ACM, New York, 416. STAMOS, J. W., ANDYOUNG,H. C. 1989. A sym- metric fragment and replicate algorithm for distributed joins. Tech. Rep. RJ7 188, IBM Re- search Labs, San Jose, Calif. STONEBRAKER, M. 1991. Managing persistent ob- jects in a multi-level store. In Proceedings of ACM SIGMOD Conference. ACM, New York, 2. STONEBRAKER, M. 1987. The design of the POST- GRES storage system, In Proceedings of the International Conference on Very Large Data Bases (Brighton, England, Aug.). VLDB En- dowment, 289. Reprinted in Readings in Database Systems. Morgan-Kaufman, San Ma- teo, Calif,, 1988. STONEBRAKER, M. 1986a. The case for shared- nothing. IEEE Database Eng. 9, 1 (Mar.), STONEBRAKER, M, 1986b. The design and imple- mentation of distributed INGRES. In The INGRES Papers. Addison-Wesley, Reading, Mass., 187. STONEBRAKER, M. 1981. Operating system sup- port for database management. Comrnun. ACM 24, 7 (July), 412. STONEBRAKER, M. 1975. Implementation of in- tegrity constraints and views by query modifi- cation. In Proceedings of ACM SIGMOD Con- ference ACM, New York. STONEBRAKER, M., AOKI, P., AND SELTZER, M. 1988a. Parallelism in XPRS. UCB/ERL Memorandum M89 16, Univ. of California, Berkeley. STONEBRAWCR, M., JHINGRAN, A., GOH, J., AND POTAMIANOS, S. 1990a. On rules, procedures, caching and views in data base systems In Proceedings of ACM SIGMOD Conference. ACM, New York, 281 STONEBRAKER, M., KATZ, R., PATTERSON, D., AND OUSTERHOUT, J. 1988b. The design of XPRS, In Proceedings of the International Conference on Very Large Data Bases (Los Angeles, Aug.). VLDB Endowment, 318. STONEBBAKER, M., ROWE, L. A., AND HIROHAMA, M. 1990b. The implementation of Postgres. IEEE Trans. Knowledge Data Eng. 2, 1 (Mar.), 125. STRAUBE, D. D., AND OLSU, M. T. 1989. Query transformation rules for an object algebra, Dept. of Computing Sciences Tech, Rep. 89-23, Univ. of Alberta, Alberta, Canada. Su, S. Y. W. 1988. Database Computers: Princ- iples, Archltectur-es and Techniques. McGraw- Hill, New York. TANSEL, A. U., AND GARNETT, L. 1992. On Roth, Korth, and Silberschat,z’s extended algebra and calculus for nested relational databases. ACM Trans. Database Syst. 17, 2 (June), 374. TEOROy, T. J., YANG, D., AND FRY, J. P. 1986. A logical design metb odology for relational databases using the extended entity-relation- ship model. ACM Cornput. Suru. 18, 2 (June). 197. TERADATA. 1983. DBC/1012 Data Base Com- puter, Concepts and Facilities. Teradata Corpo- ration, Los Angeles. THOMAS, G., THOMPSON, G. R., CHUNG, C. W., BARKMEYER, E., CARTER, F., TEMPLETON, M., ACM Computing Surveys, Vol. 25, No. 2, June 1993
  • 98. 170 “ Goetz Graefe Fox, S., AND HARTMAN, B. 1990. Heteroge- neous distributed database systems for produc- tion use, ACM Comput. Surv. 22. 3 (Sept.), 237. TOMPA, F. W., AND BLAKELEY, J A. 1988. Main- taining materialized views without accessing base data. Inf Swt. 13, 4, 393. TRAI~ER, 1. L. 1982. Virtual memory manage- ment for data base systems ACM Oper. S.vst. Reu. 16, 4 (Oct.), 26. TRIAGER, 1. L., GRAY, J., GALTIERI, C A., AND LINDSAY, B, G. 1982. Transactions and con- sistency in distributed database systems. ACM Trans. Database Syst. 7, 3 (Sept.), 323. TSUR, S., AND ZANIOLO, C. 1984 An implementa- tion of GEM—Supporting a semantic data model on relational back-end. In Proceedings of ACM SIGMOD Conference. ACM, New York, 286. TUKEY, J, W, 1977. Exploratory Data Analysls. Addison-Wesley, Reading, Mass, UNTDATA 1991. NetCDF User’s Guide, An Inter- face for Data Access, Verszon III. NCAR Tech Note TS-334 + 1A, Boulder, Colo VALDURIIIZ, P. 1987. Join indices. ACM Trans. Database S.yst. 12, 2 (June). 218. VANDEN~ERC, S. L., AND DEWITT, D. J. 1991. Al- gebraic support for complex objects with ar- rays, identity, and inhentance. In Proceedmg.s of ACM SIGMOD Conference. ACM, New York, 158, WALTON, C B. 1989 Investigating skew and scalabdlty in parallel joins. Computer Science Tech. Rep. 89-39, Umv. of Texas, Austin. WALTON, C, B., DALE, A. G., AND JENEVEIN, R. M. 1991. A taxonomy and performance model of data skew effects in parallel joins. In Proceed- ings of the Interns tlona 1 Conference on Very Large Data Bases (Barcelona, Spain). VLDB Endowment, 537. WHANG. K. Y.. AND KRISHNAMURTHY, R. 1990. Query optimization m a memory-resident do- mam relational calculus database system. ACM Trans. Database Syst. 15, 1(Mar.), 67. WHANG, K. Y., WIEDERHOLD G., AND SAGALOWICZ, D. 1985 The property of separabdity and Its ap- plication to physical database design. In Query Processing m Database Systems. Springer, Berlin, 297. WHANG, K. Y., WIEDERHOLD, G., AND SAGLOWICZ, D. 1984. Separability—An approach to physical database design. IEEE Trans. Comput. 33, 3 (Mar.), 209. WILLIANLS,P., DANIELS, D., HAAs, L., LAPIS, G., LINDSAY,B., NG, P., OBERMARC~, R., SELINGER, P., WALKER, A., WILMS, P., AND YOST, R. 1982 R’: An overview of the architecture. In Im- provmg Database Usabdlty and Responslue - ness. Academic Press, New York. Reprinted in Readings m Database Systems. Morgan-Kauf- man, San Mateo, Calif., 1988 WILSCHUT, A. N. 1993. Parallel query execution in a mam memory database system. Ph.D. the- sis, Univ. of Tweuk, The Netherlands. WiLSCHUT, A. N., AND AP~RS, P. M. G. 1993. Dataflow query execution m a parallel main- memory environment. Distrlb. Parall. Databases 1, 1 (Jan.), 103. WOLF, J. L , DIAS, D. M , AND Yu, P. S. 1990. An effective algorithm for parallelizing sort merge in the presence of data skew In Proceedl ngs of the International Syrnpo.?lum on Data base~ 1n Parallel and DLstrlbuted Systems (Dubhn, Ireland, July) WOLF, J. L., DIAS, D M., Yu, P. S . AND TUREK, ,J. 1991. An effective algorithm for parallelizmg hash Joins in the presence of data skew. In Proceedings of the IEEE Conference on Data Engtneermg. IEEE, New York. 200 WOLNIEWWZ, R. H., AND GRAEFE, G. 1993 Al- gebralc optlmlzatlon of computations over scientific databases. In Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment. WONG, E., ANO KATZ. R. H. 1983. Dlstributmg a database for parallelism. In Proceedings of ACM SIGMOD Conference. ACM, New York, 23, WONG, E., AND Youssmv, K. 1976 Decomposition —A strategy for query processing. ACM Trans Database Syst 1, 3 (Sept.), 223. YANG, H., AND LARSON, P. A. 1987. Query trans- formation for PSJ-queries. In proceedings of the International Conference on Very Large Data Bases (Brighton, England, Aug. ) VLDB Endowment, 245. YOUSS~FI. K, AND WONG, E 1979. Query pro- cessing in a relational database management system. In Proceedings of the Internatmnal Conference on Very Large Data Bases (Rio de Janeiro, Ott ). VLDB Endowment, 409. Yu, C, T., AND CHANG) C. C. 1984. Distributed query processing. ACM Comput. Surt, 16, 4 (Dec.), 399. YLT, L., AND OSBORN, S, L, 1991. An evaluation framework for algebralc object-oriented query models. In Proceedings of the IEEE Conference on Data Englneermg. VLDB Endowment, 670. ZANIOIJ I, C. 1983 The database language Gem. In Proceedings of ACM SIGMOD Conference. ACM, New York, 207. Rcprmted m Readcngb zn Database Systems. Morgan-Kaufman, San Mateo, Calif., 1988. ZANmLo, C. 1979 Design of relational views over network schemas. In Proceedings of ACM SIGMOD Conference ACM, New York, 179 ZELLER, H. 1990. Parallel query execution m NonStop SQL. In Dtgest of Papers, 35th C’omp- Con Conference. San Francisco. ZELLER H,, AND GRAY, J. 1990 An adaptive hash join algorithm for multiuser environments. In Proceedings of the International Conference on Very Large Data Bases (Brisbane, Australia). VLDB Endowment, 186. Recewed January 1992, final revk+on accepted February 1993 ACM Computing Surveys, Vol. 25, No 2, June 1993
  翻译: