🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How do you handle duplicate records in SQL?

Handling duplicate records in SQL involves three main strategies: preventing duplicates during insertion, identifying existing duplicates, and removing or merging them. The approach depends on whether duplicates are allowed in your data model and how you need to interact with the data.

Prevention is the first line of defense. Use database constraints like UNIQUE or PRIMARY KEY to enforce uniqueness on specific columns. For example, adding UNIQUE (email) to a table definition prevents duplicate email entries. If duplicates are allowed but need to be managed, consider using composite keys (e.g., UNIQUE (user_id, date) to allow multiple entries per user but prevent duplicate dates). For transactional systems, use INSERT IGNORE (MySQL) or ON CONFLICT DO NOTHING (PostgreSQL) to skip duplicate inserts. If working with existing data, tools like MERGE (in SQL Server) or UPSERT can update existing records or insert new ones based on uniqueness checks.

Identifying duplicates requires querying for rows with matching values in key columns. A common method uses GROUP BY and HAVING COUNT(*) > 1. For example, SELECT email, COUNT(*) FROM users GROUP BY email HAVING COUNT(*) > 1 lists emails appearing more than once. Window functions like ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) can also rank duplicates, helping isolate specific instances (e.g., keeping the oldest record). Temporary tables or CTEs (Common Table Expressions) are useful for storing intermediate results during analysis.

Removing duplicates typically involves deleting redundant rows while preserving at least one instance. For example, using a DELETE statement with a subquery to retain the lowest id for each duplicate group:

DELETE FROM users 
WHERE id NOT IN (
 SELECT MIN(id) FROM users GROUP BY email
);

Alternatively, use DISTINCT in a CREATE TABLE AS SELECT statement to rebuild the table without duplicates. For large datasets, batch processing or adding a status column to mark duplicates (e.g., “active” vs. “duplicate”) can minimize downtime. Always back up data before bulk deletions.

Like the article? Spread the word