ChatGPT解决这个技术问题 Extra ChatGPT

How can I remove duplicate rows?

What is the best way to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows)?

The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
Quick tip for PostgreSQL users reading this (lots, going by how often it's linked to): Pg doesn't expose CTE terms as updatable views so you can't DELETE FROM a CTE term directly. See stackoverflow.com/q/18439054/398670
@CraigRinger the same is true for Sybase - I have collected the remaining solutions here (should be valid for PG and others, too: stackoverflow.com/q/19544489/1855801 (just replace the ROWID() function by the RowID column, if any)
Just to add a caveat here. When running any de-duplication process, always double check what you are deleting first! This is one of those areas where it is very common to accidentally delete good data.

S
Srini

Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:

DELETE FROM MyTable
LEFT OUTER JOIN (
   SELECT MIN(RowId) as RowId, Col1, Col2, Col3 
   FROM MyTable 
   GROUP BY Col1, Col2, Col3
) as KeepRows ON
   MyTable.RowId = KeepRows.RowId
WHERE
   KeepRows.RowId IS NULL

In case you have a GUID instead of an integer, you can replace

MIN(RowId)

with

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))

Would this work as well? DELETE FROM MyTable WHERE RowId NOT IN (SELECT MIN(RowId) FROM MyTable GROUP BY Col1, Col2, Col3);
@Andriy - In SQL Server LEFT JOIN is less efficient than NOT EXISTS sqlinthewild.co.za/index.php/2010/03/23/… The same site also compares NOT IN vs NOT EXISTS. sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in Out of the 3 I think NOT EXISTS performs best. All three will generate a plan with a self join though that can be avoided.
@Martin, @Georg: So, I've made a small test. A big table was created and populated as described here: sqlinthewild.co.za/index.php/2010/03/23/… Two SELECTs then were produced, one using the LEFT JOIN + WHERE IS NULL technique, the other using the NOT IN one. Then I proceeded with the execution plans, and guess what? The query costs were 18% for LEFT JOIN against 82% for NOT IN, a big surprise to me. I might have done something I shouldn't have or vice versa, which, if true, I would really like to know.
@GeorgSchölly has provided an elegant answer. I've used it on a table where a PHP bug of mine created duplicate rows.
Sorry but why is DELETE MyTable FROM MyTable correct syntax? I don't see putting the table name right after the DELETE as an option in the documentation here. Sorry if this is obvious to others; I'm a newbie to SQL just trying to learn. More importantly than why does it work: what is the difference between including the name of the table there or not?
C
Callum Watkins

Another possible way of doing this is

; 

--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 
                                       ORDER BY ( SELECT 0)) RN
         FROM   #MyTable)
DELETE FROM cte
WHERE  RN > 1;

I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.

To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC

Execution Plans

The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.

https://i.stack.imgur.com/ZJiWF.jpg

This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.

The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.

https://i.stack.imgur.com/iUlWm.jpg

Factors which might favour the hash aggregate approach would be

No useful index on the partitioning columns

relatively fewer groups with relatively more duplicates in each group

In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.


If I may add: The accepted answer doesn't work with tables that uses uniqueidentifier. This one is much simpler and works perfectly on any table. Thanks Martin.
This is such an awesome answer! It worked event when I had removed the old PK before I realised there where duplicates. +100
I suggest asking and then answering this question (with this answer) on DBA.SE. Then we can add it to our list of canonical answers.
Unlike the accepted answer, this also worked on a table that had no key (RowId) to compare on.
This one doesn't work on all SQL server versions, on the other hand
I
Ivan Yurchenko

There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.

I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:

DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField 
AND dupes.secondDupField = fullTable.secondDupField 
AND dupes.uniqueField > fullTable.uniqueField

perfect! i found this is the most efficient way to remove duplicate rows on my old mariadb version 10.1.xx. thank you!
Much simpler and easier to understand!
I have one doubt, in your sql query why are you not using 'From' keyword after 'DELETE' ? I have seen from in many other solution.
M
Martin Smith

The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.

DELETE FROM TableName
WHERE  ID NOT IN (SELECT MAX(ID)
                  FROM   TableName
                  GROUP  BY Column1,
                            Column2,
                            Column3
                  /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
                    nullable. Because of semantics of NOT IN (NULL) including the clause
                    below can simplify the plan*/
                  HAVING MAX(ID) IS NOT NULL) 

The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.

SELECT YourColumnName,
       COUNT(*) TotalCount
FROM   YourTableName
GROUP  BY YourColumnName
HAVING COUNT(*) > 1
ORDER  BY COUNT(*) DESC 

MySQL error with the first script 'You can't specify target table 'TableName' for update in FROM clause'
Apart from the error D.Rosado already reported, your first query is also very slow. The corresponding SELECT query took on my setup +- 20 times longer than the accepted answer.
@parvus - The question is tagged SQL Server not MySQL. The syntax is fine in SQL Server. Also MySQL is notoriously bad at optimising sub queries see for example here. This answer is fine in SQL Server. In fact NOT IN often performs better than OUTER JOIN ... NULL. I would add a HAVING MAX(ID) IS NOT NULL to the query though even though semantically it ought not be necessary as that can improve the plan example of that here
Works great in PostgreSQL 8.4.
C
Chloe
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid

Postgres:

delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid

Why post a Postgres solution on a SQL Server question?
@Lankymart Because postgres users are coming here too. Look at the score of this answer.
I've seen this in some popular SQL questions, as in here, here and here. The OP got his answer and everyone else got some help too. No problem IMHO.
in one query you are using 'From' after Delete and in one your not using 'From', whats the logic ?
J
Jithin Shaji
DELETE LU 
FROM   (SELECT *, 
               Row_number() 
                 OVER ( 
                   partition BY col1, col1, col3 
                   ORDER BY rowid DESC) [Row] 
        FROM   mytable) LU 
WHERE  [row] > 1 

I get this message on azure SQL DW: A FROM clause is currently not supported in a DELETE statement.
F
Faisal

This will delete duplicate rows, except the first row

DELETE
FROM
    Mytable
WHERE
    RowID NOT IN (
        SELECT
            MIN(RowID)
        FROM
            Mytable
        GROUP BY
            Col1,
            Col2,
            Col3
    )

Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)


For mysql it will give error: Error Code: 1093. You can't specify target table 'Mytable' for update in FROM clause. but this small change will work for mysql: DELETE FROM Mytable WHERE RowID NOT IN ( SELECT ID FROM (SELECT MIN(RowID) AS ID FROM Mytable GROUP BY Col1,Col2,Col3) AS TEMP)
S
Shamseer K

I would prefer CTE for deleting duplicate rows from sql server table

strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/

by keeping original

WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)

DELETE FROM CTE WHERE RN<>1

without keeping original

WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)

in one query you are using 'from' after delete and in another 'from' is not there, what is this, I am confused ?
F
Factor Mystic

To Fetch Duplicate Rows:

SELECT
name, email, COUNT(*)
FROM 
users
GROUP BY
name, email
HAVING COUNT(*) > 1

To Delete the Duplicate Rows:

DELETE users 
WHERE rowid NOT IN 
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);      

For MySQL users, note that first of all it has to be DELETE FROM, second, it won't work, because you can't SELECT from the same table you're DELETEing from. In MySQL this blasts off MySQL error 1093.
I think is is much more reasonable than the rather esotheric accepted answer using DELETE FROM ... LEFT OUTER JOIN that also does not work on some systems (e.g. SQL Server). If you run into the limitation stated above, you can always save the results of your selection into a temporary TABLE variable: DECLARE @idsToKeep TABLE(rowid INT); and then INSERT INTO @idsToKeep(rowid) SELECT MIN... GROUP BY ... followed by DELETE users WHERE rowid NOT IN (SELECT rowid FROM @idsToKeep);
J
JuanJo

Quick and Dirty to delete exact duplicated rows (for small tables):

select  distinct * into t2 from t1;
delete from t1;
insert into t1 select *  from t2;
drop table t2;

Note that the question actually specifies non-exact duplication (dueto row id).
You also have to deal with identity (key) columns using set identity_insert t1 on.
J
James Errico

I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.

--DELETE FROM table1 
--WHERE id IN ( 
     SELECT MIN(id) FROM table1 
     GROUP BY col1, col2, col3 
     -- could add a WHERE clause here to further filter
     HAVING count(*) > 1
--)

Doesn't it delete all the records that show up in the inner query. We need to remove only duplicates and preserve the original.
You're only returning the one with the lowest id, based on the min(id) in the select clause.
Uncomment out the first, second, and last lines of the query.
This won't clean up all duplicates. If you have 3 rows that are duplicates, it will only select the row with the MIN(id), and delete that one, leaving two rows left that are duplicates.
Nevertheless, I ended up using this statement repeated over & over again, so that it would actually make progress instead of having the connection timing out or the computer go to sleep. I changed it to MAX(id) to eliminate the latter duplicates, and added LIMIT 1000000 to the inner query so it wouldn't have to scan the whole table. This showed progress much quicker than the other answers, which would seem to hang for hours. After the table was pruned to a manageable size, then you can finish with the other queries. Tip: make sure col1/col2/col3 has indices for group by.
H
Himanshu
SELECT  DISTINCT *
      INTO tempdb.dbo.tmpTable
FROM myTable

TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable

Truncating won't work if you have foreign key references to myTable.
R
Ruben Verschueren

I thought I'd share my solution since it works under special circumstances. I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).

begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2

-- insert distinct values into temp
insert into #temp 
select distinct * 
from  tableName

-- delete from source
delete from tableName 

-- insert into source from temp
insert into tableName 
select * 
from #temp

rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!

PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...


F
Faisal

This query showed very good performance for me:

DELETE tbl
FROM
    MyTable tbl
WHERE
    EXISTS (
        SELECT
            *
        FROM
            MyTable tbl2
        WHERE
            tbl2.SameValue = tbl.SameValue
        AND tbl.IdUniqueValue < tbl2.IdUniqueValue
    )

it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)


O
Ostati

Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:

;with cte as (
    select 
        min(PrimaryKey) as PrimaryKey
        UniqueColumn1,
        UniqueColumn2
    from dbo.DuplicatesTable 
    group by
        UniqueColumn1, UniqueColumn1
    having count(*) > 1
)
delete d
from dbo.DuplicatesTable d 
inner join cte on 
    d.PrimaryKey > cte.PrimaryKey and
    d.UniqueColumn1 = cte.UniqueColumn1 and 
    d.UniqueColumn2 = cte.UniqueColumn2;

I think you're missing an AND in your JOIN.
J
Jeff Davis

Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.

Here are the relevant portions from the linked page:

Consider this data:

EMPLOYEE_ID ATTENDANCE_DATE
A001    2011-01-01
A001    2011-01-01
A002    2011-01-01
A002    2011-01-01
A002    2011-01-01
A003    2011-01-01

So how can we delete those duplicate data?

First, insert an identity column in that table by using the following code:

ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)  

Use the following code to resolve it:

DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
    FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE) 

"Easy to grasp", "seems to be effective", but not a word about what the method consists in. Just imagine that the link becomes invalid, what use would then be to know that the method was easy to grasp and effective? Please consider adding essential parts of the method's description into your post, otherwise this is not an answer.
This method is useful for tables where you don't yet have an identity defined. Often you need to get rid of duplicates in order to define the primary key!
@JeffDavis - The ROW_NUMBER version works fine for that case without needing to go to the lengths of adding a new column before you begin.
H
Haris N I

Use this

WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
   As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1

P
Penny Liu

This is the easiest way to delete duplicate record

 DELETE FROM tblemp WHERE id IN 
 (
  SELECT MIN(id) FROM tblemp
   GROUP BY  title HAVING COUNT(id)>1
 )

Why is anyone upvoting this? If you have more than two of the same id this WON'T work. Instead write: delete from tblemp where id not in (select min(id) from tblemp group by title)
C
Craig

Here is another good article on removing duplicates.

It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."

The temp table solution, and two mysql examples.

In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)


SQL is based on multi-sets. But even if it was based on sets, this two tuples (1, a) & (2, a) are different.
c
codegoalie

I had a table where I needed to preserve non-duplicate rows. I'm not sure on the speed or efficiency.

DELETE FROM myTable WHERE RowID IN (
  SELECT MIN(RowID) AS IDNo FROM myTable
  GROUP BY Col1, Col2, Col3
  HAVING COUNT(*) = 2 )

This assumes that there is at most 1 duplicate.
Why not HAVING COUNT(*) > 1?
J
Jacob Proffitt

Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:

DELETE FROM MyTable WHERE NOT RowID IN
    (SELECT 
        (SELECT TOP 1 RowID FROM MyTable mt2 
        WHERE mt2.Col1 = mt.Col1 
        AND mt2.Col2 = mt.Col2 
        AND mt2.Col3 = mt.Col3) 
    FROM MyTable mt)

Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.


I
Ismail Yavuz

The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.

Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.


a
a_m0d

By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname

DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno) 
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno) 
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)

s
shA.t

Create new blank table with the same structure Execute query like this INSERT INTO tc_category1 SELECT * FROM tc_category GROUP BY category_id, application_id HAVING count(*) > 1 Then execute this query INSERT INTO tc_category1 SELECT * FROM tc_category GROUP BY category_id, application_id HAVING count(*) = 1


y
yuvi

Another way of doing this :--

DELETE A
FROM   TABLE A,
       TABLE B
WHERE  A.COL1 = B.COL1
       AND A.COL2 = B.COL2
       AND A.UNIQUEFIELD > B.UNIQUEFIELD 

What's different to this existing answer from Aug 20 2008? - stackoverflow.com/a/18934/692942
E
Evgueny Sedov

I would mention this approach as well as it can be helpful, and works in all SQL servers: Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:

SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0

G
Gidil

From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.

I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:

-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism 
ON stories 
after INSERT, UPDATE 
AS 
    DECLARE @cnt AS INT 

    SELECT @cnt = Count(*) 
    FROM   stories 
           INNER JOIN inserted 
                   ON ( stories.story = inserted.story 
                        AND stories.story_id != inserted.story_id ) 

    IF @cnt > 0 
      BEGIN 
          RAISERROR('plagiarism detected',16,1) 

          ROLLBACK TRANSACTION 
      END 

Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?


F
Faisal
DELETE
FROM
    table_name T1
WHERE
    rowid > (
        SELECT
            min(rowid)
        FROM
            table_name T2
        WHERE
            T1.column_name = T2.column_name
    );

Hi Teena, you have missed the table Alice name T1 after the delete comment otherwise it will throgh syntax exception.
A
AnandPhadke
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)

INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)

--SELECT * FROM car

;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)

DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)

L
Lauri Lubi

I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/

with MYCTE as (
  SELECT ROW_NUMBER() OVER (
    PARTITION BY DuplicateKey1
                ,DuplicateKey2 -- optional
    ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
  ) RN
  FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

DELETE u1 FROM users u1 JOIN users u2 WHERE u1.id > u2.id AND u1.email=u2.email