ChatGPT解决这个技术问题 Extra ChatGPT

Joins are for lazy people?

I recently had a discussion with another developer who claimed to me that JOINs (SQL) are useless. This is technically true but he added that using joins is less efficient than making several requests and link tables in the code (C# or Java).

For him joins are for lazy people that don't care about performance. Is this true? Should we avoid using joins?

No. Databases are optimized to perform joins, they are extremely fast especially for large datasets. You don't want your application to load tens of thousands rows and merge them together manually.
I'm solidly in the "Wrong!" camp, as are most of the respondants to this question. Which, alas, means that we won't get many (if any) posts from people arguing strongly for the other side. I'd really like to hear that guy's rationale...
Seems simple enough to prove out (he's wrong). It's not a subjective issue. I might answer that the vast majority of the industry disagrees, and if he has some evidence to the contrary then it would be worthwhile to provide a scenario and some metrics. Maybe he has some bizarre scenario he's basing his opinion on. That could be an opportunity for you to help him prove his own hunches and become a better, more curious (and therefore eventually more effective) developer.

M
Michael Borgwardt

No, we should avoid developers who hold such incredibly wrong opinions.

In many cases, a database join is several orders of magnitude faster than anything done via the client, because it avoids DB roundtrips, and the DB can use indexes to perform the join.

Off the top of my head, I can't even imagine a single scenario where a correctly used join would be slower than the equivalent client-side operation.

Edit: There are some rare cases where custom client code can do things more efficiently than a straightforward DB join (see comment by meriton). But this is very much the exception.


What about 3-way joins? Aren't there cases where you'd be better off doing them "in code"?
Joining in the app server can be more efficient if joining on the database causes severe redundancy in the result set sent over the network. Consider tables A and B, where each row in A is associated with 20 rows in B, B has only 100 rows, and we want to fetch the first 1000 rows from A with associated rows from B. Joining in the database will result in 20 * 1000 tuples sent across the network. If the join is done in the app server (first fetching the entire B table into memory), a mere 100 + 1000 rows are sent across the network.
However, you are certainly correct in that joins on the database are much faster in most cases, and therefore not just a matter of convenience, but of necessity.
I have have been lucky enough to speak with some of the developers who work on SQL Server at Microsoft. It will make you dizzy hearing the optimizations they do on queries. Anyone who thinks that they are smarter than that needs to be smacked.
@meriton I'm a little surprised; I'd expect the client library to optimise cross joins.
M
Marc Gravell

It sounds to me like your colleague would do well with a no-sql document-database or key-value store. Which are themselves very good tools and a good fit for many problems.

However, a relational database is heavily optimised for working with sets. There are many, many ways of querying the data based on joins that are vastly more efficient than lots of round trips. This is where the versatilty of a rdbms comes from. You can achieve the same in a nosql store too, but you often end up building a separate structure suited for each different nature of query.

In short: I disagree. In a RDBMS, joins are fundamental. If you aren't using them, you aren't using it as a RDBMS.


s
sehe

Well, he is wrong in the general case.

Databases are able to optimize using a variety of methods, helped by optimizer hints, table indexes, foreign key relationships and possibly other database vendor specific information.


I have to admit when I started working with databases I held the same belief that I could beat the performance of joins. But it didn't take long to realize how amazingly fast joins are done by the DB. In fact I would say in this situation it's better to discuss it with the employee in an open way rather than dismissing him as an idiot.
@LegendLength I'd say that's even true if they're not so smart. No need to assume smartness because they make the same mistakes as we remember making (in fact, for me that might mean they're not so smart...) It's simpler: It rarely helps to be dismissive. It's okay to be wrong, once in a while!
r
richzilla

No, you shouldnt.

Databases are specifically designed to manipulate sets of data (obviously....). Therefore they are incredibly efficient at doing this. By doing what is essentially a manual join in his own code, he is attempting to take over the role of something specifically designed for the job. The chances of his code ever being as efficient as that in the database are very remote.

As an aside, without joins, whats the point in using a database? he may as well just use text files.


Even without joins? Automatic in-memory mapping, automatic query caching, lots of other automagic stuff that doesn't happen at all with most filesystems. Oh, did I mention finely-controllable transactions?
M
MJB

If "lazy" is defined as people who want to write less code, then I agree. If "lazy" is defined as people who want to have tools do what they are good at doing, I agree. So if he is merely agreeing with Larry Wall (regarding the attributes of good programmers), then I agree with him.


I added the precision of lazy : for lazy people that don't care about performances and prefer write less code. I think that joins are for lazy people but in this case joins are also better than several requests.
@Dran Dane: Joins are for lazy people, yes. The fact that they will likely perform well is orthogonal.
G
Giovanni Galbo

Ummm, joins is how relational databases relate tables to each other. I'm not sure what he's getting at.

How can making several calls to the database be more efficient than one call? Plus sql engines are optimized at doing this sort of thing.

Maybe your coworker is too lazy to learn SQL.


P
Piskvor left the building

"This is technicaly true" - similarly, a SQL database is useless: what's the point in using one when you can get the same result by using a bunch of CSV files, and correlating them in code? Heck, any abstraction is for lazy people, let's go back to programming in machine code right on the hardware! ;)

Also, his asssertion is untrue in all but the most convoluted cases: RDBMSs are heavily optimized to make JOINs fast. Relational database management systems, right?


+1 The phrase "... technically true" would have worked better if the OP had used to word unnecessary rather useless in the preceding sentence. Saying that joins are useless is patently untrue with no technicalities needing consideration. In any case, the OP's and the colleague's misunderstanding of the point of RDBMSs is sandly not uncommon: stackoverflow.com/q/5575682/47550
R
RedPain

Yes, You should.

And you should use C++ instead of C# because of performance. C# is for lazy people.

No, no, no. You should use C instead of C++ because of performance. C++ is for lazy people.

No, no, no. You should use assembly instead of C because of performance. C is for lazy people.

Yes, I am joking. you can make faster programs without joins and you can make programs using less memory without joins. BUT in many cases, your development time is more important than CPU time and memory. Give up a little performance and enjoy your life. Don't waste your time for little little performance. And tell him "Why don't you make a straight highway from your place to your office?"


S
Srikanth

The last company I worked for didn't use SQL joins either. Instead they moved this work to application layer which is designed to scale horizontally. The rationale for this design is to avoid work at database layer. It is usually the database that becomes bottleneck. Its easier to replicate application layer than database. There could be other reasons. But this is the one that I can recall now.

Yes I agree that joins done at application layer are inefficient compared to joins done by database. More network communication also.

Please note that I'm not taking a hard stand on avoiding SQL joins.


Well, that sounds like a rational argument against JOINs in your specific case. I remember that FB Engineering posted something similar on their blog - scaling out was also their key priority. Alas, only a small % of programmers will ever need to do this, but many think they do "because OMG Facebook also does that" ;)
okay, in an enterprise solution where you have enough traffic to overload the database server this may be worth considering but its more likely to be that reporting stored procedure or the scheduled backup nailing the performance. Databases are good at joins, especially if there are indecies to help
@Jodrell: Yes they are good at joins; again, there are corner cases where you need to drop the joins' elegance to get more power. I've met one such situation; we tried every possible solution, and indeed a no-join solution was fastest in that one very specific situation. And no, there wasn't anything else at all running at that particular server; stored procedures can't slow you down if you don't have any ;)
J
JonH

Without joins how are you going to relate order items to orders? That is the entire point of a relational database management system. Without joins there is no relational data and you might as well use text files to process data.

Sounds like he doesn't understand the concept so he's trying to make it seem they are useless. He's the same type of person who thinks excel is a database application. Slap him silly and tell him to read more about databases. Making multiple connections and pulling data and merging the data via C# is the wrong way to do things.


J
Jodrell

I don't understand the logic of the statement "joins in SQL are useless". Is it useful to filter and limit the data before working on it? As you're other respondants have stated this is what database engines do, it should be what they are good at.

Perhaps a lazy programmer would stick to technologies with which they were familiar and eschew other possibilities for non technical reasons.

I leave it to you to decide.


p
phoog

Let's consider an example: a table with invoice records, and a related table with invoice line item records. Consider the client pseudo code:

for each (invoice in invoices)
    let invoiceLines = FindLinesFor(invoice)
...

If you have 100,000 invoices with 10 lines each, this code will look up 10 invoice lines from a table of 1 million, and it will do that 100,000 times. As the table size increases, the number of select operations increases, and the cost of each select operation increases.

Becase computers are fast, you may not notice a performance difference between the two approaches if you have several thousand records or fewer. Because the cost increase is more than linear, as the number of records increases (into the millions, say), you'll begin to notice a difference, and the difference will become less tolerable as the size of the data set grows.

The join, however. will use the table's indexes and merge the two data sets. This means that you're effectively scanning the second table once rather than randomly accessing it N times. If there's a foreign key defined, the database already has the links between the related records stored internally.

Imagine doing this yourself. You have an alphabetical list of students and a notebook with all the students' grade reports (one page per class). The notebook is sorted in order by the students' names, in the same order as the list. How would you prefer to proceed?

Read a name from the list. Open the notebook. Find the student's name. Read the student's grades, turning pages until you reach the next student or the last page. Close the notebook. Repeat.

Or:

Open the notebook to the first page. Read a name from the list. Read any grades for that name from the notebook. Repeat steps 2-3 until you get to the end Close the notebook.


j
jcollum

Sounds like a classic case of "I can write it better." In other words, he's seeing something that he sees as kind of a pain in the neck (writing a bunch of joins in SQL) and saying "I'm sure I can write that better and get better performance." You should ask him if he is a) smarter and b) more educated than the typical person that's knee deep in the Oracle or SQL Server optimization code. Odds are he isn't.


M
Mike M.

He is most certainly wrong. While there are definite pros to data manipulation within languages like C# or Java, joins are fastest in the database due to the nature of SQL itself.

SQL keeps detailing statistics regarding the data, and if you have created your indexes correctly, can very quickly find one record in a couple of million. Besides the fact that why would you want to drag all your data into C# to do a join when you can just do it right on the database level?

The pros for using C# come into play when you need to do something iteratively. If you need to do some function for each row, it's likely faster to do so within C#, otherwise, joining data is optimized in the DB.


J
JaCraig

I will say that I have run into a case where it was faster breaking the query down and doing the joins in code. That being said, it was only with one particular version of MySQL that I had to do that. Everything else, the database is probably going to be faster (note that you may have to optimize the queries, but it will still be faster).


P
Peter Lawrey

I suspect he has a limited view on what databases should be used for. One approach to maximise performance is to read the entire database into memory. In this situation, you may get better performance and you may want to perform joins if memory for efficiency. However this is not really using a database, as a database IMHO.


Most database engines will do this for you behind-the-scenes anyway; and e.g. in MySQL you can create a purely in-memory table (MEMORY engine). Re-implementing the database functionality without the database is usually a sign of a severe case of NIH ;)
@phoog: Not Invented Here - in other words, "I didn't think of that, so it doesn't exist". Many square wheels were re-invented because of this. (and yes, sometimes re-inventing the wheel is useful, e.g. if you're making racing cars; re-inventing "just because" is unlikely to get you a better wheel)
In other words, "I didn't make it so it must be rubbish". This has a grain of truth only so far as "I haven't tested it so it might not be suitable for my purposes", so test it before you judge it.
@Piskvor: Not necessarily, the database can only use the memory of the system it runs on, whereas the application can use the application server's memory. Put differently: If the database is on a dedicated host, accessing that cache still requires network bandwith and is subject to network latency, but any cache the application keeps can be queried with the speed a low latency of memory access.
J
Jonas Byström

No, not only are joins better optimized in database code that ad-hoc C#/Java; but usually several filtering techniques can be applied, which yields even better performance.


H
HLGEM

He is wrong, joins are what competent programmers use. There may be a few limited cases where his proposed method is more efficient (and inthose I would probably be using a Documant database) but I can't see it if you have any deceent amount of data. For example take this query:

select t1.field1 
from table1 t1
join table2 t2 
    on t1.id = t2.id
where t1.field2 = 'test'

Assume you have 10 million records in table1 and 1 million records in table2. Assume 9 million of the records in table 1 meet the where clause. Assume only 15 of them are in table2 as well. You can run this sql statement which if properly indexed will take milliseconds and return 15 records across the network with only 1 column of data. Or you can send ten million records with 2 columns of data and separately send another 1 millions records with one column of data across the network and combine them on the web server.

Or of course you could keep the entire contents of the database on the web server at all times which is just plain silly if you have more than a trivial amount of data and data that is continually changing. If you don't need the qualities of a relational database then don't use one. But if you do, then use it correctly.


C
Christian Seifert

I've heard this argument quite often during my career as a software developer. Almost everytime it has been stated, the guy making the claim didn't have much knowledge about relational database systems, the way they work and the way such systems should be used.

Yes, when used incorrectly, joins seem to be useless or even dangerous. But when used in the correct way, there is a lot of potential for database implementation to perform optimizations and to "help" the developer retrieving the correct result most efficiently.

Don't forget that using a JOIN you tell the database about the way you expect the pieces of data to relate to each other and therefore give the database more information about what you are trying to do and therefore making it able to better fit your needs.

So the answer is definitely: No, JOINSaren't useless at all!


f
fredt

This is "technically true" only in one case which is not used often in applications (when all the rows of all the tables in the join(s) are returned by the query). In most queries only a fraction of the rows of each table is returned. The database engine often uses indexes to eliminate the unwanted rows, sometimes even without reading the actual row as it can use the values stored in indexes. The database engine is itself written in C, C++, etc. and is at least as efficient as code written by a developer.


g
gbn

Unless I've seriously misunderstood, the logic in the question is very flawed

If there are 20 rows in B for each A, a 1000 rows in A implies 20k rows in B. There can't be just 100 rows in B unless there is many-many table "AB" with 20k rows with the containing the mapping.

So to get all information about which 20 of the 100 B rows map to each A row you table AB too. So this would be either:

3 result sets of 100, 1000, and 20k rows and a client JOIN

a single JOINed A-AB-B result set with 20k rows

So "JOIN" in the client does add any value when you examine the data. Not that it isn't a bad idea. If I was retrieving one object from the database than maybe it makes more sense to break it down into separate results sets. For a report type call, I'd flatten it out into one almost always.

In any case, I'd say there is almost no use for a cross join of this magnitude. It's a poor example.

You have to JOIN somewhere, and that's what RDBMS are good at. I'd not like to work with any client code monkey who thinks they can do better.

Afterthought:

To join in the client requires persistent objects such as DataTables (in .net). If you have one flattened resultset it can be consumed via something lighter like a DataReader. High volume = lot of client resources used to avoid a database JOIN.