ChatGPT解决这个技术问题 Extra ChatGPT

Get top n records for each group of grouped results

The following is the simplest possible example, though any solution should be able to scale to however many n top results are needed:

Given a table like that below, with person, group, and age columns, how would you get the 2 oldest people in each group? (Ties within groups should not yield more results, but give the first 2 in alphabetical order)

+--------+-------+-----+
| Person | Group | Age |
+--------+-------+-----+
| Bob    | 1     | 32  |
| Jill   | 1     | 34  |
| Shawn  | 1     | 42  |
| Jake   | 2     | 29  |
| Paul   | 2     | 36  |
| Laura  | 2     | 39  |
+--------+-------+-----+

Desired result set:

+--------+-------+-----+
| Shawn  | 1     | 42  |
| Jill   | 1     | 34  |
| Laura  | 2     | 39  |
| Paul   | 2     | 36  |
+--------+-------+-----+

NOTE: This question builds on a previous one- Get records with max value for each group of grouped SQL results - for getting a single top row from each group, and which received a great MySQL-specific answer from @Bohemian:

select * 
from (select * from mytable order by `Group`, Age desc, Person) x
group by `Group`

Would love to be able to build off this, though I don't see how.

Check this example. It is pretty much close to what you ask: stackoverflow.com/questions/1537606/…
Using LIMIT within GROUP BY to get N results per group? stackoverflow.com/questions/2129693/…

T
Taryn

Here is one way to do this, using UNION ALL (See SQL Fiddle with Demo). This works with two groups, if you have more than two groups, then you would need to specify the group number and add queries for each group:

(
  select *
  from mytable 
  where `group` = 1
  order by age desc
  LIMIT 2
)
UNION ALL
(
  select *
  from mytable 
  where `group` = 2
  order by age desc
  LIMIT 2
)

There are a variety of ways to do this, see this article to determine the best route for your situation:

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

Edit:

This might work for you too, it generates a row number for each record. Using an example from the link above this will return only those records with a row number of less than or equal to 2:

select person, `group`, age
from 
(
   select person, `group`, age,
      (@num:=if(@group = `group`, @num +1, if(@group := `group`, 1, 1))) row_number 
  from test t
  CROSS JOIN (select @num:=0, @group:=null) c
  order by `Group`, Age desc, person
) as x 
where x.row_number <= 2;

See Demo


if he have 1 000+ groups, wouldn't that make this a bit scary?
@CharlesForest yes, it would and that Is why I stated that you would have to specify it for more than two groups. It would become ugly.
@CharlesForest I think I found a better solution, see my edit
A note for anyone reading this: The version is the variables is close to being correct. However, MySQL does not guarantee the order of evaluation of expressions in the SELECT (and, in fact, sometimes evaluates them out-of-order). The key to the solution is to put all the variable assignments in a single expression; here is an example: stackoverflow.com/questions/38535020/….
@GordonLinoff Updated my answer, thanks for pointing it out. It also took far too long for me to update it.
M
Mark Byers

In other databases you can do this using ROW_NUMBER. MySQL doesn't support ROW_NUMBER but you can use variables to emulate it:

SELECT
    person,
    groupname,
    age
FROM
(
    SELECT
        person,
        groupname,
        age,
        @rn := IF(@prev = groupname, @rn + 1, 1) AS rn,
        @prev := groupname
    FROM mytable
    JOIN (SELECT @prev := NULL, @rn := 0) AS vars
    ORDER BY groupname, age DESC, person
) AS T1
WHERE rn <= 2

See it working online: sqlfiddle

Edit I just noticed that bluefeet posted a very similar answer: +1 to him. However this answer has two small advantages:

It it is a single query. The variables are initialized inside the SELECT statement. It handles ties as described in the question (alphabetical order by name).

So I'll leave it here in case it can help someone.


Mark- This is working well for us. Thanks for providing another good alternative to compliment @bluefeet's- much appreciated.
+1. This worked for me. Really clean and to the point answer. Can you please explain how exactly this works? Whats the logic behind this?
Nice solution but it seems it is not working in my environment (MySQL 5.6) because the order by clause is applied after select so it doesn't return the top result, see my alternative solution to fix this issue
While running this I was able to delete JOIN (SELECT @prev := NULL, @rn := 0) AS vars. I get the idea is to declare empty variables, but it seems extraneous for MySql.
This works great for me in MySQL 5.7, but it would be awesome if someone could explain how it works
s
snuffn

Try this:

SELECT a.person, a.group, a.age FROM person AS a WHERE 
(SELECT COUNT(*) FROM person AS b 
WHERE b.group = a.group AND b.age >= a.age) <= 2 
ORDER BY a.group ASC, a.age DESC

DEMO


snuffin coming out of nowhere with the simplest solution! Is this more elegant than Ludo's/Bill Karwin's? Can I get some commentary
Hm, not sure if it is more elegant. But judging from the votes, I guess bluefeet might have the better solution.
There's a problem with this. If there's a tie for second place within the group, only one top result is returned. See modified demo
It's not a problem if it's desired. You can set the order of a.person.
no, it is not working in my case, neither does the DEMO work
C
Community

How about using self-joining:

CREATE TABLE mytable (person, groupname, age);
INSERT INTO mytable VALUES('Bob',1,32);
INSERT INTO mytable VALUES('Jill',1,34);
INSERT INTO mytable VALUES('Shawn',1,42);
INSERT INTO mytable VALUES('Jake',2,29);
INSERT INTO mytable VALUES('Paul',2,36);
INSERT INTO mytable VALUES('Laura',2,39);

SELECT a.* FROM mytable AS a
  LEFT JOIN mytable AS a2 
    ON a.groupname = a2.groupname AND a.age <= a2.age
GROUP BY a.person
HAVING COUNT(*) <= 2
ORDER BY a.groupname, a.age DESC;

gives me:

a.person    a.groupname  a.age     
----------  -----------  ----------
Shawn       1            42        
Jill        1            34        
Laura       2            39        
Paul        2            36      

I was strongly inspired by the answer from Bill Karwin to Select top 10 records for each category

Also, I'm using SQLite, but this should work on MySQL.

Another thing: in the above, I replaced the group column with a groupname column for convenience.

Edit:

Following-up on the OP's comment regarding missing tie results, I incremented on snuffin's answer to show all the ties. This means that if the last ones are ties, more than 2 rows can be returned, as shown below:

.headers on
.mode column

CREATE TABLE foo (person, groupname, age);
INSERT INTO foo VALUES('Paul',2,36);
INSERT INTO foo VALUES('Laura',2,39);
INSERT INTO foo VALUES('Joe',2,36);
INSERT INTO foo VALUES('Bob',1,32);
INSERT INTO foo VALUES('Jill',1,34);
INSERT INTO foo VALUES('Shawn',1,42);
INSERT INTO foo VALUES('Jake',2,29);
INSERT INTO foo VALUES('James',2,15);
INSERT INTO foo VALUES('Fred',1,12);
INSERT INTO foo VALUES('Chuck',3,112);


SELECT a.person, a.groupname, a.age 
FROM foo AS a 
WHERE a.age >= (SELECT MIN(b.age)
                FROM foo AS b 
                WHERE (SELECT COUNT(*)
                       FROM foo AS c
                       WHERE c.groupname = b.groupname AND c.age >= b.age) <= 2
                GROUP BY b.groupname)
ORDER BY a.groupname ASC, a.age DESC;

gives me:

person      groupname   age       
----------  ----------  ----------
Shawn       1           42        
Jill        1           34        
Laura       2           39        
Paul        2           36        
Joe         2           36        
Chuck       3           112      

@Ludo- Just saw that answer from Bill Karwin - thanks for applying it here
What do you think of Snuffin's answer? I'm trying to compare the two
There's a problem with this. If there's a tie for second place within the group, only one top result is returned- See demo
@Ludo- the original requirement was that each group return the exact n results, with any ties being resolved alphabetically
The edit to include the ties does not work for me. I get ERROR 1242 (21000): Subquery returns more than 1 row, presumably because of the GROUP BY. When I execute the SELECT MIN subquery alone, it generates three rows: 34, 39, 112 and there it appears the second value should be 36, not 39.
L
Laurent PELE

Snuffin solution seems quite slow to execute when you've got plenty of rows and Mark Byers/Rick James and Bluefeet solutions doesn't work on my environnement (MySQL 5.6) because order by is applied after execution of select, so here is a variant of Marc Byers/Rick James solutions to fix this issue (with an extra imbricated select):

select person, groupname, age
from
(
    select person, groupname, age,
    (@rn:=if(@prev = groupname, @rn +1, 1)) as rownumb,
    @prev:= groupname 
    from 
    (
        select person, groupname, age
        from persons 
        order by groupname ,  age desc, person
    )   as sortedlist
    JOIN (select @prev:=NULL, @rn :=0) as vars
) as groupedlist 
where rownumb<=2
order by groupname ,  age desc, person;

I tried similar query on a table having 5 millions rows and it returns result in less than 3 seconds


This is the only query that has been working in my environment. Thanks!
Add LIMIT 9999999 to any derived table with an ORDER BY. This may prevent the ORDER BY from being ignored.
I ran a a similar query on a table containing a few thousand rows, and it took 60 seconds to return one result, so... thanks for the post, it's a start for me. (ETA: down to 5 seconds. Good!)
This is the query that works perfect with the Order. The answers below dont work well. Thanks
Wow, really nice, I was able to improve the performance significantly using this approach! Btw, had to escape all variable assignments using double backslash \\:= in spring jpa repository code for nativeQuery.
R
Rick James

If the other answers are not fast enough Give this code a try:

SELECT
        province, n, city, population
    FROM
      ( SELECT  @prev := '', @n := 0 ) init
    JOIN
      ( SELECT  @n := if(province != @prev, 1, @n + 1) AS n,
                @prev := province,
                province, city, population
            FROM  Canada
            ORDER BY
                province   ASC,
                population DESC
      ) x
    WHERE  n <= 3
    ORDER BY  province, n;

Output:

+---------------------------+------+------------------+------------+
| province                  | n    | city             | population |
+---------------------------+------+------------------+------------+
| Alberta                   |    1 | Calgary          |     968475 |
| Alberta                   |    2 | Edmonton         |     822319 |
| Alberta                   |    3 | Red Deer         |      73595 |
| British Columbia          |    1 | Vancouver        |    1837970 |
| British Columbia          |    2 | Victoria         |     289625 |
| British Columbia          |    3 | Abbotsford       |     151685 |
| Manitoba                  |    1 | ...

Looked at your site - where would I get the data source for the cities' populations? TIA and rgs.
maxmind.com/en/worldcities -- I find it handy for experimenting with lat/lng searches, queries, partitioning, etc. It is big enough to be interesting, yet readable enough to recognize the answers. The Canadian subset is handy for this kind of question. (Fewer provinces than US cities.)
@RickJames thank you, this worked first time, I have spent 3 hours trying to do this and failed miserably.
@dimButTries - It's tricky code. I have seen a lot of postings that fail to perform well. So I set out to find (or create) the best solutions.
T
Travesty3

Check this out:

SELECT
  p.Person,
  p.`Group`,
  p.Age
FROM
  people p
  INNER JOIN
  (
    SELECT MAX(Age) AS Age, `Group` FROM people GROUP BY `Group`
    UNION
    SELECT MAX(p3.Age) AS Age, p3.`Group` FROM people p3 INNER JOIN (SELECT MAX(Age) AS Age, `Group` FROM people GROUP BY `Group`) p4 ON p3.Age < p4.Age AND p3.`Group` = p4.`Group` GROUP BY `Group`
  ) p2 ON p.Age = p2.Age AND p.`Group` = p2.`Group`
ORDER BY
  `Group`,
  Age DESC,
  Person;

SQL Fiddle: http://sqlfiddle.com/#!2/cdbb6/15


Man, others found much simpler solutions...I just spent like 15 minutes on this and was incredibly proud of myself for coming up with such a complicated solution, too. That sucks.
I had to find an internal version number that was 1 less than the current - this gave me the answer to do this: max(internal_version - 1) - so stress less :)
R
Roman Marusyk

In SQL Server row_numer() is a powerful function that can get result easily as below

select Person,[group],age
from
(
select * ,row_number() over(partition by [group] order by age desc) rn
from mytable
) t
where rn <= 2

With 8.0 and 10.2 being GA, this answer is becoming reasonable.
@RickJames what does 'being GA' means? Window functions (dev.mysql.com/doc/refman/8.0/en/window-functions.html) solved my problem very well.
@iedmrc - "GA" means "Generally Available". It is tech-speak for "ready for prime time", or "released". They are through developing the version and will be focusing on bug that they missed. That link discusses MySQL 8.0's implementation, which might be different than MariaDB 10.2's implementation.
S
Stephen Rauch

I wanted to share this because I spent a long time searching for an easy way to implement this in a java program I'm working on. This doesn't quite give the output you're looking for but its close. The function in mysql called GROUP_CONCAT() worked really well for specifying how many results to return in each group. Using LIMIT or any of the other fancy ways of trying to do this with COUNT didn't work for me. So if you're willing to accept a modified output, its a great solution. Lets say I have a table called 'student' with student ids, their gender, and gpa. Lets say I want to top 5 gpas for each gender. Then I can write the query like this

SELECT sex, SUBSTRING_INDEX(GROUP_CONCAT(cast(gpa AS char ) ORDER BY gpa desc), ',',5) 
AS subcategories FROM student GROUP BY sex;

Note that the parameter '5' tells it how many entries to concatenate into each row

And the output would look something like

+--------+----------------+
| Male   | 4,4,4,4,3.9    |
| Female | 4,4,3.9,3.9,3.8|
+--------+----------------+

You can also change the ORDER BY variable and order them a different way. So if I had the student's age I could replace the 'gpa desc' with 'age desc' and it will work! You can also add variables to the group by statement to get more columns in the output. So this is just a way I found that is pretty flexible and works good if you are ok with just listing results.


k
kovac

There is a really nice answer to this problem at MySQL - How To Get Top N Rows per Each Group

Based on the solution in the referenced link, your query would be like:

SELECT Person, Group, Age
   FROM
     (SELECT Person, Group, Age, 
                  @group_rank := IF(@group = Group, @group_rank + 1, 1) AS group_rank,
                  @current_group := Group 
       FROM `your_table`
       ORDER BY Group, Age DESC
     ) ranked
   WHERE group_rank <= `n`
   ORDER BY Group, Age DESC;

where n is the top n and your_table is the name of your table.

I think the explanation in the reference is really clear. For quick reference I will copy and paste it here:

Currently MySQL does not support ROW_NUMBER() function that can assign a sequence number within a group, but as a workaround we can use MySQL session variables. These variables do not require declaration, and can be used in a query to do calculations and to store intermediate results. @current_country := country This code is executed for each row and stores the value of country column to @current_country variable. @country_rank := IF(@current_country = country, @country_rank + 1, 1) In this code, if @current_country is the same we increment rank, otherwise set it to 1. For the first row @current_country is NULL, so rank is also set to 1. For correct ranking, we need to have ORDER BY country, population DESC


Well, it is the principle used by solutions of Marc Byers, Rick James and mine.
Difficult to say which post (Stack Overflow or SQLlines) was the first
@LaurentPELE - Mine was posted Feb, 2015. I see no timestamp or name on SQLlines. MySQL blogs have been around for long enough that some of them are out of date, and should be removed -- people are quoting mis-information.
H
Hiren
WITH cte_window AS (
SELECT movie_name,director_id,release_date,
ROW_NUMBER() OVER( PARTITION BY director_id ORDER BY release_date DESC) r
FROM movies
)   
SELECT * FROM cte_window WHERE r <= <n>;

Above query will returns latest n movies for each directors.


取一个好的名字
SELECT
p1.Person,
p1.`GROUP`,
p1.Age  
   FROM
person AS p1 
 WHERE
(
SELECT
    COUNT( DISTINCT ( p2.age ) ) 
FROM
    person AS p2 
WHERE
    p2.`GROUP` = p1.`GROUP` 
    AND p2.Age >= p1.Age 
) < 2 
ORDER BY
p1.`GROUP` ASC,
p1.age DESC

reference leetcode