ChatGPT解决这个技术问题 Extra ChatGPT

When and why are database joins expensive?

I'm doing some research into databases and I'm looking at some limitations of relational DBs.

I'm getting that joins of large tables is very expensive, but I'm not completely sure why. What does the DBMS need to do to execute a join operation, where is the bottleneck? How can denormalization help to overcome this expense? How do other optimization techniques (indexing, for example) help?

Personal experiences are welcome! If you're going to post links to resources, please avoid Wikipedia. I know where to find that already.

In relation to this, I'm wondering about the denormalized approaches used by cloud service databases like BigTable and SimpleDB. See this question.

Are you also looking into the benefits? ;)
I'm am looking into an objective (if there is such a thing) comparison. Pro's, con's, what-have-you's.
The pre-rendered approaches of cloud computing are predicated on being able to bet every which way, avoiding the "wrong join" problem. Google has some whitepapers on their own systems. Quite interesting - ways to extend applicability of the special cases.
@PeterWone - care to provide a reference to some of those papers? p.s. to answer the question in your profile, Android is Open Source - well, at least partially, so the geeks jumped on that bandwagon. Seen as technically advanced by the great unwashed, they were followed lemming-like into Google's tight and sweaty embrace! Betamax anyone? Closer to my own heart (and generation), how did MySQL (with no FOREGIN KEYs FFS) become (and remain) the world's most popular "R"DBMS when it had competition from PostgreSQL (no native Windows version) and Firebird (Opensourcing fiasco), or even SQLite?
Needless to say, I regard PostgreSQL and Firebird as vastly superior to MySQL for multi-user systems and SQLite as stellar in the single user sphere. SQLite handles the sqlite.org site (400,00 hits a day!).

2
20 revs, 6 users 88%

Denormalising to improve performance? It sounds convincing, but it doesn't hold water.

Chris Date, who in company with Dr Ted Codd was the original proponent of the relational data model, ran out of patience with misinformed arguments against normalisation and systematically demolished them using scientific method: he got large databases and tested these assertions.

I think he wrote it up in Relational Database Writings 1988-1991 but this book was later rolled into edition six of Introduction to Database Systems, which is the definitive text on database theory and design, in its eighth edition as I write and likely to remain in print for decades to come. Chris Date was an expert in this field when most of us were still running around barefoot.

He found that:

Some of them hold for special cases

All of them fail to pay off for general use

All of them are significantly worse for other special cases

It all comes back to mitigating the size of the working set. Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result before the rows are materialised.

Materialising the result involves bulk disk reads which are the most expensive aspect of the exercise by an order of magnitude. Performing a join, by contrast, logically requires retrieval of only the keys. In practice, not even the key values are fetched: the key hash values are used for join comparisons, mitigating the cost of multi-column joins and radically reducing the cost of joins involving string comparisons. Not only will vastly more fit in cache, there's a lot less disk reading to do.

Moreover, a good optimiser will choose the most restrictive condition and apply it before it performs a join, very effectively leveraging the high selectivity of joins on indexes with high cardinality.

Admittedly this type of optimisation can also be applied to denormalised databases, but the sort of people who want to denormalise a schema typically don't think about cardinality when (if) they set up indexes.

It is important to understand that table scans (examination of every row in a table in the course of producing a join) are rare in practice. A query optimiser will choose a table scan only when one or more of the following holds.

There are fewer than 200 rows in the relation (in this case a scan will be cheaper)

There are no suitable indexes on the join columns (if it's meaningful to join on these columns then why aren't they indexed? fix it)

A type coercion is required before the columns can be compared (WTF?! fix it or go home) SEE END NOTES FOR ADO.NET ISSUE

One of the arguments of the comparison is an expression (no index)

Performing an operation is more expensive than not performing it. However, performing the wrong operation, being forced into pointless disk I/O and then discarding the dross prior to performing the join you really need, is much more expensive. Even when the "wrong" operation is precomputed and indexes have been sensibly applied, there remains significant penalty. Denormalising to precompute a join - notwithstanding the update anomalies entailed - is a commitment to a particular join. If you need a different join, that commitment is going to cost you big.

If anyone wants to remind me that it's a changing world, I think you'll find that bigger datasets on gruntier hardware just exaggerates the spread of Date's findings.

For all of you who work on billing systems or junk mail generators (shame on you) and are indignantly setting hand to keyboard to tell me that you know for a fact that denormalisation is faster, sorry but you're living in one of the special cases - specifically, the case where you process all of the data, in-order. It's not a general case, and you are justified in your strategy.

You are not justified in falsely generalising it. See the end of the notes section for more information on appropriate use of denormalisation in data warehousing scenarios.

I'd also like to respond to

Joins are just cartesian products with some lipgloss

What a load of bollocks. Restrictions are applied as early as possible, most restrictive first. You've read the theory, but you haven't understood it. Joins are treated as "cartesian products to which predicates apply" only by the query optimiser. This is a symbolic representation (a normalisation, in fact) to facilitate symbolic decomposition so the optimiser can produce all the equivalent transformations and rank them by cost and selectivity so that it can select the best query plan.

The only way you will ever get the optimiser to produce a cartesian product is to fail to supply a predicate: SELECT * FROM A,B

Notes

David Aldridge provides some important additional information.

There is indeed a variety of other strategies besides indexes and table scans, and a modern optimiser will cost them all before producing an execution plan.

A practical piece of advice: if it can be used as a foreign key then index it, so that an index strategy is available to the optimiser.

I used to be smarter than the MSSQL optimiser. That changed two versions ago. Now it generally teaches me. It is, in a very real sense, an expert system, codifying all the wisdom of many very clever people in a domain sufficiently closed that a rule-based system is effective.

"Bollocks" may have been tactless. I am asked to be less haughty and reminded that math doesn't lie. This is true, but not all of the implications of mathematical models should necessarily be taken literally. Square roots of negative numbers are very handy if you carefully avoid examining their absurdity (pun there) and make damn sure you cancel them all out before you try to interpret your equation.

The reason that I responded so savagely was that the statement as worded says that

Joins are cartesian products...

This may not be what was meant but it is what was written, and it's categorically untrue. A cartesian product is a relation. A join is a function. More specifically, a join is a relation-valued function. With an empty predicate it will produce a cartesian product, and checking that it does so is one correctness check for a database query engine, but nobody writes unconstrained joins in practice because they have no practical value outside a classroom.

I called this out because I don't want readers falling into the ancient trap of confusing the model with the thing modelled. A model is an approximation, deliberately simplified for convenient manipulation.

The cut-off for selection of a table-scan join strategy may vary between database engines. It is affected by a number of implementation decisions such as tree-node fill-factor, key-value size and subtleties of algorithm, but broadly speaking high-performance indexing has an execution time of k log n + c. The C term is a fixed overhead mostly made of setup time, and the shape of the curve means you don't get a payoff (compared to a linear search) until n is in the hundreds.

Sometimes denormalisation is a good idea

Denormalisation is a commitment to a particular join strategy. As mentioned earlier, this interferes with other join strategies. But if you have buckets of disk space, predictable patterns of access, and a tendency to process much or all of it, then precomputing a join can be very worthwhile.

You can also figure out the access paths your operation typically uses and precompute all the joins for those access paths. This is the premise behind data warehouses, or at least it is when they're built by people who know why they're doing what they're doing, and not just for the sake of buzzword compliance.

A properly designed data warehouse is produced periodically by a bulk transformation out of a normalised transaction processing system. This separation of the operations and reporting databases has the very desirable effect of eliminating the clash between OLTP and OLAP (online transaction processing ie data entry, and online analytical processing ie reporting).

An important point here is that apart from the periodic updates, the data warehouse is read only. This renders moot the question of update anomalies.

Don't make the mistake of denormalising your OLTP database (the database on which data entry happens). It might be faster for billing runs but if you do that you will get update anomalies. Ever tried to get Reader's Digest to stop sending you stuff?

Disk space is cheap these days, so knock yourself out. But denormalising is only part of the story for data warehouses. Much bigger performance gains are derived from precomputed rolled-up values: monthly totals, that sort of thing. It's always about reducing the working set.

ADO.NET problem with type mismatches

Suppose you have a SQL Server table containing an indexed column of type varchar, and you use AddWithValue to pass a parameter constraining a query on this column. C# strings are Unicode, so the inferred parameter type will be NVARCHAR, which doesn't match VARCHAR.

VARCHAR to NVARCHAR is a widening conversion so it happens implicitly - but say goodbye to indexing, and good luck working out why.

"Count the disk hits" (Rick James)

If everything is cached in RAM, JOINs are rather cheap. That is, normalization does not have much performance penalty.

If a "normalized" schema causes JOINs to hit the disk a lot, but the equivalent "denormalized" schema would not have to hit the disk, then denormalization wins a performance competition.

Comment from original author: Modern database engines are very good at organising access sequencing to minimise cache misses during join operations. The above, while true, might be miscontrued as implying that joins are necessarily problematically expensive on large data. This would lead to cause poor decision-making on the part of inexperienced developers.


Sonme of these statements are specific to a particular DBMS, aren't they? eg. "There are fewer than 200 rows in the relation"
Does the use of surrogate keys (or not) influence all this significantly?
The great E F Codd is solely responsible for the Relational Model. C J Date, and more recently H Darwen, are both idiots, who do not understand the RM, and provide masses of information on "how to improve" the RM, all of which can be dismissed, because one cannot fix what one does not understand. They serve only to damage the relevance of the RM, by suggesting that there is something "missing".
Also, don't forget that many NoSQL databases are essentially the same databases that we discarded 40 years ago. Young people always think they've discovered something new. Fabian Pascal: dbdebunk.com/2014/02/thinking-logically-sql-nosql-and.html
Aggressive. It was a good account, but the aggression's and micro-aggression's don't add to the content, or the value of the content.
D
David Aldridge

What most commenters fail to note is the wide range of join methodologies available in a complex RDBMS, and the denormalisers invariably gloss over the higher cost of maintaining denormalised data. Not every join is based on indexes, and databases have a lot of optimised algotithms and methodologies for joining that are intended to reduce join costs.

In any case, the cost of a join depends on its type and a few other factors. It needn't be expensive at all - some examples.

A hash join, in which bulk data is equijoined, is very cheap indeed, and the cost only become significant if the hash table cannot be cached in memory. No index required. Equi-partitioning between the joined data sets can be a great help.

The cost of a sort-merge join is driven by the cost of the sort rather than the merge -- an index-based access method can virtually eliminate the cost of the sort.

The cost of a nested loop join on an index is driven by the height of the b-tree index and the access of the table block itself. It's fast, but not suitable for bulk joins.

A nested loop join based on a cluster is much cheaper, with fewer logicAL IO'S required per join row -- if the joined tables are both in the same cluster then the join becomes very cheap through the colocation of joined rows.

Databases are designed to join, and they're very flexible in how they do it and generally very performant unless they get the join mechanism wrong.


I think it comes down to "if in doubt, ask your DBA". Modern databases are complex beasts and do require study to understand. I've only been using Oracle since 1996 and it's a full time job keeping up with the new features. SQLserver has also come along hugely way since 2005. It's not a black box!
Hmmm, well in my humble experience there are too many DBA's out there who have never heard of a hash join, or think that they're a Universally Bad Thing.
J
Joel Coehoorn

I think the whole question is based on a false premise. Joins on large tables are not necessarily expensive. In fact, doing joins efficiently is one of the main reasons relational databases exist at all. Joins on large sets often are expensive, but very rarely do you want to join the entire contents of large table A with the entire contents of large table B. Instead, you write the query such that only the important rows of each table are used and the actual set kept by the join remains smaller.

Additionally, you have the efficiencies mentioned by Peter Wone, such that only the important parts of each record need be in memory until the final result set is materialized. Also, in large queries with many joins you typically want to start with the smaller table sets and work your way up to the large ones, so that the set kept in memory remains as small as possible as long as possible.

When done properly, joins are generally the best way to compare, combine, or filter on large amounts of data.


@joel. The converse is also true. Large dataset joins can be expensive and are sometimes required, but you don't want to do it too often unless a) you can handle the IO and RAM needed and b) you're not doing it too often. Consider materialised views, reporting systems, realtime vs CoB reports.
M
Mark Brackett

The bottleneck is pretty much always disk I/O, and even more specifically - random disk I/O (by comparison, sequential reads are fairly fast and can be cached with read ahead strategies).

Joins can increase random seeks - if you're jumping around reading small parts of a large table. But, query optimizers look for that and will turn it into a sequential table scan (discarding the unneeded rows) if it thinks that'd be better.

A single denormalized table has a similar problem - the rows are large, and so less fit on a single data page. If you need rows that are located far from another (and the large row size makes them further apart) then you'll have more random I/O. Again, a table scan may be forced to avoid this. But, this time, your table scan has to read more data because of the large row size. Add to that the fact that you're copying data from a single location to multiple locations, and the RDBMS has that much more to read (and cache).

With 2 tables, you also get 2 clustered indexes - and can generally index more (because of less insert/update overhead) which can get you drastically increased performance (mainly, again, because indexes are (relatively) small, quick to read off disk (or cheap to cache), and lessen the amount of table rows you need to read from disk).

About the only overhead with a join comes from figuring out the matching rows. Sql Server uses 3 different types of joins, mainly based on dataset sizes, to find matching rows. If the optimizer picks the wrong join type (due to inaccurate statistics, inadequate indexes, or just an optimizer bug or edge case) it can drastically affect query times.

A loop join is farily cheap for (at least 1) small dataset.

A merge join requires a sort of both datasets first. If you join on an indexed column, though, then the index is already sorted and no further work needs to be done. Otherwise, there is some CPU and memory overhead in sorting.

The hash join requires both memory (to store the hashtable) and CPU (to build the hash). Again, this is fairly quick in relation to the disk I/O. However, if there's not enough RAM to store the hashtable, Sql Server will use tempdb to store parts of the hashtable and the found rows, and then process only parts of the hashtable at a time. As with all things disk, this is fairly slow.

In the optimal case, these cause no disk I/O - and so are negligible from a performance perspective.

All in all, at worst - it should actually be faster to read the same amount of logical data from x joined tables, as it is from a single denormalized table because of the smaller disk reads. To read the same amount of physical data, there could be some slight overhead.

Since query time is usually dominated by I/O costs, and the size of your data does not change (minus some very miniscule row overhead) with denormalization, there's not a tremendous amount of benefit to be had by just merging tables together. The type of denormalization that tends to increase performance, IME, is caching calculated values instead of reading the 10,000 rows required to calculate them.


Reducing random seeks: good point, although a good RAID controller with a big cache will do elevator read/write.
The best answer in the thread! Covered the most significant aspects and their effect on disk, CPU and RAM. Though, the conclusion about denormalisation is valid only for reading large data. Modern apps usually handle paginated requests with modest output. In this case denormalisation wins.
I
Ilya Kochetov

The order in which you're joining the tables is extremely important. If you have two sets of data try to build the query in a way so the smallest will be used first to reduce the amount of data the query has to work on.

For some databases it does not matter, for example MS SQL does know the proper join order most of the time. For some (like IBM Informix) the order makes all the difference.


In general a decent query optimizer is going to be unaffected by the order that the joins or tables are listed, and will make its own determination of the most efficient way to perform the join.
MySQL, Oracle, SQL Server, Sybase, postgreSQL,etc. care not the order of joins. I've worked with DB2 and it also, to my knowledge, doesn't care what order you put them in. This is not helpful advice in the general case
MySQL clustering using the NDB engine (admittedly an edge case, and only advanced developers are going to go near NDB) doesn't guess the join order correctly, so you have to add "USE INDEX" statements to most joined queries or they'll be horrendously inefficient. MySQL docs cover it.
@iiya, Understanding what the optimiser will choose is more important than generalised statements or "myths" about table ordering. Do not rely on a particular quirk in your SQL as the behaviour often changes when the RDBMS is upgraded. Oracle has changed behaviours several times since v7.
@Matt I've saw Oracle 9i perform very different optimizations and query plans just adjusting the join order. Maybe this has changed from version 10i onwards?
M
MathGladiator

Deciding on whether to denormalize or normalize is fairly a straightforward process when you consider the complexity class of the join. For instance, I tend to design my databases with normalization when the queries are O(k log n) where k is relative to the desired output magnitude.

An easy way to denormalize and optimize performance is to think about how changes to your normalize structure affect your denormalized structure. It can be problematic however as it may require transactional logic to work on a denormalized structured.

The debate for normalization and denormalization isn't going to end since the problems are vast. There are many problems where the natural solution requires both approaches.

As a general rule, I've always stored a normalized structure and denormalized caches that can be reconstructed. Eventually, these caches save my ass to solve the future normalization problems.


q
questzen

Elaborating what others have said,

Joins are just cartesian products with some lipgloss. {1,2,3,4}X{1,2,3} would give us 12 combinations (nXn=n^2). This computed set acts as a reference on which conditions are applied. The DBMS applies the conditions (like where both left and right are 2 or 3) to give us the matching condition(s). Actually it is more optimised but the problem is the same. The changes to size of the sets would increase the result size exponentially. The amount of memory and cpu cycles consumed all are effected in exponential terms.

When we denormalise, we avoid this computation altogether, think of having a colored sticky, attached to every page of your book. You can infer the information with out using a reference. The penalty we pay is that we are compromising the essence of DBMS (optimal organisation of data)


-1: This post is a great example of why you let the DBMS perform the joins -- because the DBMS designers think about these issues all the time and come up with more effective ways to do it than the compsci 101 method.
@David: Agreed. DBMS optimizer programmers are some smart cookies
This answer is incorrect. If your query is executed against a normalized, indexed database and has any kind of filter or join condition, the optimizer will find a way to avoid the Cartesian product and minimize memory usage and CPU cycles. If you actually intend to select a Cartesian product, you'll use the same memory in a normalized or de-normalized db.