ChatGPT解决这个技术问题 Extra ChatGPT

SQL Server: Difference between PARTITION BY and GROUP BY

I've been using GROUP BY for all types of aggregate queries over the years. Recently, I've been reverse-engineering some code that uses PARTITION BY to perform aggregations. In reading through all the documentation I can find about PARTITION BY, it sounds a lot like GROUP BY, maybe with a little extra functionality added in? Are they two versions of the same general functionality, or are they something different entirely?


A
Al Dass

They're used in different places. group by modifies the entire query, like:

select customerId, count(*) as orderCount
from Orders
group by customerId

But partition by just works on a window function, like row_number:

select row_number() over (partition by customerId order by orderId)
    as OrderNumberForThisCustomer
from Orders

A group by normally reduces the number of rows returned by rolling them up and calculating averages or sums for each row. partition by does not affect the number of rows returned, but it changes how a window function's result is calculated.


nice answer, would you please write a sample of a returned results for each of them?
@AshkanMobayenKhiabani you can run both queries against Northwind, which may or may not be installed by default depending on your sql server version. If not you can search for it on s downloads page.
@AshkanMobayenKhiabani Arunprasanth's answer below shows returned results which can save you time as opposed to jumping through more learning hoops and time to learn Northwind
More on windows functions (in SQL): blog.jooq.org/2013/11/03/…
itcodehub.blogspot.com/2019/03/… - more info and example about differences between group by and partition by in sql
M
Marc.2377

We can take a simple example.

Consider a table named TableA with the following values:

id  firstname                   lastname                    Mark
-------------------------------------------------------------------
1   arun                        prasanth                    40
2   ann                         antony                      45
3   sruthy                      abc                         41
6   new                         abc                         47
1   arun                        prasanth                    45
1   arun                        prasanth                    49
2   ann                         antony                      49

GROUP BY

The SQL GROUP BY clause can be used in a SELECT statement to collect data across multiple records and group the results by one or more columns. In more simple words GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns.

Syntax:

SELECT expression1, expression2, ... expression_n, 
       aggregate_function (aggregate_expression)
FROM tables
WHERE conditions
GROUP BY expression1, expression2, ... expression_n;

We can apply GROUP BY in our table:

select SUM(Mark)marksum,firstname from TableA
group by id,firstName

Results:

marksum  firstname
----------------
94      ann                      
134     arun                     
47      new                      
41      sruthy   

In our real table we have 7 rows and when we apply GROUP BY id, the server group the results based on id:

In simple words:

here GROUP BY normally reduces the number of rows returned by rolling them up and calculating Sum() for each row.

PARTITION BY

Before going to PARTITION BY, let us look at the OVER clause:

According to the MSDN definition:

OVER clause defines a window or user-specified set of rows within a query result set. A window function then computes a value for each row in the window. You can use the OVER clause with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results.

PARTITION BY will not reduce the number of rows returned.

We can apply PARTITION BY in our example table:

SELECT SUM(Mark) OVER (PARTITION BY id) AS marksum, firstname FROM TableA

Result:

marksum firstname 
-------------------
134     arun                     
134     arun                     
134     arun                     
94      ann                      
94      ann                      
41      sruthy                   
47      new  

Look at the results - it will partition the rows and returns all rows, unlike GROUP BY.


partition by can affect the number of rows, it just won't reduce the number of rows.
What would be the difference if i were to change the SELECT into SELECT DISTINCT to the second query? wouldn't that return the same data-set as the GROUP BY query? What are the reasons of choosing one or the other?
@Erick3E please have a look on this question stackoverflow.com/questions/20375074/…
I like this answer better because it shows how the Aggregate functions Min/Max/Sum etc work on Partitions. The Row_Number() example doesn't make it as clear. Normally I use an aggregate function with GROUP BY but just noticed that PARTITION-OVER has the same methods and wondered the same thing the OP did - which lead me here. Thanks!
@ripvlan I agree. This answer clarifies that doing something like SELECT column1,AVG(column2) OVER (PARTITION BY column1) returns all the rows and the average of column 2 that is calculated with respect to column 1.
C
ConcernedOfTunbridgeWells

partition by doesn't actually roll up the data. It allows you to reset something on a per group basis. For example, you can get an ordinal column within a group by partitioning on the grouping field and using rownum() over the rows within that group. This gives you something that behaves a bit like an identity column that resets at the beginning of each group.


W
Will Marcouiller

PARTITION BY Divides the result set into partitions. The window function is applied to each partition separately and computation restarts for each partition.

Found at this link: OVER Clause


a
adolf garlic

It provides rolled-up data without rolling up

i.e. Suppose I want to return the relative position of sales region

Using PARTITION BY, I can return the sales amount for a given region and the MAX amount across all sales regions in the same row.

This does mean you will have repeating data, but it may suit the end consumer in the sense that data has been aggregated but no data has been lost - as would be the case with GROUP BY.


The best, simplest answer.
R
RLH

PARTITION BY is analytic, while GROUP BY is aggregate. In order to use PARTITION BY, you have to contain it with an OVER clause.


PARTITION BY is analytic this simple statement cleared a lot up for me. +1.
This is actually the simplest and best answer.
y
yoel halb

As of my understanding Partition By is almost identical to Group By, but with the following differences:

That group by actually groups the result set returning one row per group, which results therefore in SQL Server only allowing in the SELECT list aggregate functions or columns that are part of the group by clause (in which case SQL Server can guarantee that there are unique results for each group).

Consider for example MySQL which allows to have in the SELECT list columns that are not defined in the Group By clause, in which case one row is still being returned per group, however if the column doesn't have unique results then there is no guarantee what will be the output!

But with Partition By, although the results of the function are identical to the results of an aggregate function with Group By, still you are getting the normal result set, which means that one is getting one row per underlying row, and not one row per group, and because of this one can have columns that are not unique per group in the SELECT list.

So as a summary, Group By would be best when needs an output of one row per group, and Partition By would be best when one needs all the rows but still wants the aggregate function based on a group.

Of course there might also be performance issues, see http://social.msdn.microsoft.com/Forums/ms-MY/transactsql/thread/0b20c2b5-1607-40bc-b7a7-0c60a2a55fba.


L
Lukas Eder

PARTITION BY semantics

Your question was specifically about SQL Server, which currently only supports a PARTITION BY clause only in window functions, but as I've explained in this blog post about the various meanings of PARTITION BY in SQL, there are also others, including:

Window partitions (window functions are a SQL standard)

Table partitions (vendor specific extensions to organise storage, e.g. in Oracle or PostgreSQL)

MATCH_REGOGNIZE partitions (which is also a SQL standard)

MODEL or SPREADSHEET partitions (an Oracle extension to SQL)

OUTER JOIN partitions (a SQL standard)

Apart from the last one, which re-uses the PARTITION BY syntax to implement some sort of CROSS JOIN logic, all of these PARTITION BY clauses have the same meaning:

A partition separates a data set into subsets, which don’t overlap.

Based on this partitioning, further calculations or storage operations per partition can be implemented. E.g. with window functions, such as COUNT(*) OVER (PARTITION BY criteria), the COUNT(*) value is calculated per partition.

GROUP BY semantics

GROUP BY allows for similar partitioning behaviour, although it also transforms the semantics of your entire query in various weird ways. Most queries using GROUP BY can be rewritten using window functions, instead, although often, the GROUP BY syntax is more concise and possibly also better optimised.

For example, these are the logically the same, but I would expect the GROUP BY clause to perform better:

-- Classic
SELECT a, COUNT(*)
FROM t
GROUP BY a

-- Using window functions
SELECT DISTINCT a, COUNT(*) OVER (PARTITION BY a)
FROM t

The key difference is:

Window functions can also be non-aggregate functions, e.g. ROW_NUMBER()

Each window function can have its own PARTITION BY clause, whereas GROUP BY can only group by one set of expressions per query.


To extend this comprehensive list, PARTITION BY could also be found as a way to divide workload for Table Functions, here example from Snowflake: Table Functions and Partitions
@LukaszSzozda: Thanks for the pointer, nice to know. That's... weird. I mean, any partitioning or ordering can easily be undone by hash joins, hash grouping, or other operators that happen after the table function access. I guess I would have expected some hint syntax instead, for this kind of non-logic influencing optimisation functionality.
M
Mahdi Ben Selimene

When you use GROUP BY, the resulting rows will be usually less then incoming rows.

But, when you use PARTITION BY, the resulting row count should be the same as incoming.


does incoming mean uncommited rows here ?
u
user1785960

Small observation. Automation mechanism to dynamically generate SQL using the 'partition by' it is much simpler to implement in relation to the 'group by'. In the case of 'group by', We must take care of the content of 'select' column.

Sorry for My English.


A
Ambrish Rajput

Suppose we have 14 records of name column in table

in group by

select name,count(*) as totalcount from person where name='Please fill out' group BY name;

it will give count in single row i.e 14

but in partition by

select row_number() over (partition by name) as total from person where name = 'Please fill out';

it will 14 rows of increase in count


A
Alev Ryustemov

It has really different usage scenarios. When you use GROUP BY you merge some of the records for the columns that are same and you have an aggregation of the result set.

However when you use PARTITION BY your result set is same but you just have an aggregation over the window functions and you don't merge the records, you will still have the same count of records.

Here is a rally helpful article explaining the difference: http://alevryustemov.com/sql/sql-partition-by/


P
Peoria Os
-- BELOW IS A SAMPLE WHICH OUTLINES THE SIMPLE DIFFERENCES
-- READ IT AND THEN EXECUTE IT
-- THERE ARE THREE ROWS OF EACH COLOR INSERTED INTO THE TABLE
-- CREATE A database called testDB


-- use testDB
USE [TestDB]
GO


-- create Paints table
CREATE TABLE [dbo].[Paints](
    [Color] [varchar](50) NULL,
    [glossLevel] [varchar](50) NULL
) ON [PRIMARY]

GO


-- Populate Table
insert into paints (color, glossLevel)
select 'red', 'eggshell'
union
select 'red', 'glossy'
union
select 'red', 'flat'
union
select 'blue', 'eggshell'
union
select 'blue', 'glossy'
union
select 'blue', 'flat'
union
select 'orange', 'glossy'
union
select 'orange', 'flat'
union
select 'orange', 'eggshell'
union
select 'green', 'eggshell'
union
select 'green', 'glossy'
union
select 'green', 'flat'
union
select 'black', 'eggshell'
union
select 'black', 'glossy'
union
select 'black', 'flat'
union
select 'purple', 'eggshell'
union
select 'purple', 'glossy'
union
select 'purple', 'flat'
union
select 'salmon', 'eggshell'
union
select 'salmon', 'glossy'
union
select 'salmon', 'flat'


/*   COMPARE 'GROUP BY' color to 'OVER (PARTITION BY Color)'  */

-- GROUP BY Color 
-- row quantity defined by group by
-- aggregate (count(*)) defined by group by
select count(*) from paints
group by color

-- OVER (PARTITION BY... Color 
-- row quantity defined by main query
-- aggregate defined by OVER-PARTITION BY
select color
, glossLevel
, count(*) OVER (Partition by color)
from paints

/* COMPARE 'GROUP BY' color, glossLevel to 'OVER (PARTITION BY Color, GlossLevel)'  */

-- GROUP BY Color, GlossLevel
-- row quantity defined by GROUP BY
-- aggregate (count(*)) defined by GROUP BY
select count(*) from paints
group by color, glossLevel



-- Partition by Color, GlossLevel
-- row quantity defined by main query
-- aggregate (count(*)) defined by OVER-PARTITION BY
select color
, glossLevel
, count(*) OVER (Partition by color, glossLevel)
from paints