Handling Duplicates in SQL

Duplicates in a database can cause data inconsistencies and incorrect analysis. SQL provides several methods to detect, remove, and manage duplicates effectively.

1. Detecting Duplicates

a. Find All Duplicate Rows

SELECT *
FROM employees
WHERE id IN (
    SELECT id
    FROM employees
    GROUP BY id, name, age, department, salary
    HAVING COUNT(*) > 1
);

b. Count Duplicates Based on Specific Columns

SELECT name, COUNT(*) AS count
FROM employees
GROUP BY name
HAVING COUNT(*) > 1;

c. View Duplicate Rows with All Details

SELECT *
FROM employees e1
WHERE EXISTS (
    SELECT 1
    FROM employees e2
    WHERE e1.name = e2.name AND e1.department = e2.department
    AND e1.id > e2.id
);

Explanation:

EXISTS checks if duplicate rows exist.
Uses comparison on specific columns (name and department).
Ignores duplicates based on the id.

2. Removing Duplicates

a. Delete All Duplicate Rows (Keep One)

DELETE FROM employees
WHERE id NOT IN (
    SELECT MIN(id)
    FROM employees
    GROUP BY name, age, department, salary
);

Explanation:

GROUP BY groups data based on the specified columns.
MIN(id) keeps the first occurrence and deletes others.

b. Delete Duplicate Rows with Self-Join

DELETE e1
FROM employees e1
JOIN employees e2
ON e1.name = e2.name AND e1.department = e2.department
WHERE e1.id > e2.id;

Explanation:

Joins the table with itself to compare rows.
Deletes rows with higher IDs (duplicates).

c. Delete All Duplicates (No Conditions)

DELETE FROM employees
WHERE ROWID NOT IN (
    SELECT MIN(ROWID)
    FROM employees
    GROUP BY name, department
);

Note: ROWID is database-specific and available in some SQL engines like Oracle. For others, use ID or ROW_NUMBER().

3. Preventing Duplicates (Constraints)

a. Add Unique Constraints

ALTER TABLE employees
ADD CONSTRAINT unique_employee UNIQUE(name, department);

Explanation:

Prevents insertion of duplicate rows with the same name and department.

4. Using Window Functions to Handle Duplicates

a. Find Duplicates with ROW_NUMBER()

SELECT *, 
       ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num
FROM employees;

Explanation:

Assigns a row number for each duplicate group.
Rows with row_num > 1 are duplicates.

b. Delete Duplicates Using ROW_NUMBER()

DELETE FROM employees
WHERE id IN (
    SELECT id
    FROM (
        SELECT id,
               ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num
        FROM employees
    ) subquery
    WHERE row_num > 1
);

Explanation:

Keeps the first occurrence (row_num = 1) and deletes the rest.

5. Soft Delete (Mark Duplicates Instead of Deleting)

ALTER TABLE employees ADD COLUMN is_duplicate BOOLEAN DEFAULT FALSE;

UPDATE employees
SET is_duplicate = TRUE
WHERE id IN (
    SELECT id
    FROM (
        SELECT id, 
               ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num
        FROM employees
    ) subquery
    WHERE row_num > 1
);

Explanation:

Adds a new column (is_duplicate) to mark duplicates instead of deleting them.
Useful for auditing or future cleanup.

6. Deduplication with SELECT DISTINCT

a. Select Unique Rows

SELECT DISTINCT name, department
FROM employees;

b. Insert Unique Records into a New Table

CREATE TABLE unique_employees AS
SELECT DISTINCT *
FROM employees;

7. Tips for Handling Duplicates

Backup Before Deletion: Always create a backup before removing duplicates to avoid accidental data loss.
Check Primary Keys: Ensure primary keys are properly defined to prevent duplicates during data insertion.
Normalize Data: Structure data to avoid redundancy by following database normalization rules.
Use Indexes: Add unique indexes to enforce uniqueness constraints.
Audit Data Inserts: Track inserts and monitor logs to avoid accidental duplicate entries.

CompiledByUdithaWICK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

Handling_Duplicates.md

Handling_Duplicates.md

Handling Duplicates in SQL

1. Detecting Duplicates

a. Find All Duplicate Rows

b. Count Duplicates Based on Specific Columns

c. View Duplicate Rows with All Details

2. Removing Duplicates

a. Delete All Duplicate Rows (Keep One)

b. Delete Duplicate Rows with Self-Join

c. Delete All Duplicates (No Conditions)

3. Preventing Duplicates (Constraints)

a. Add Unique Constraints

4. Using Window Functions to Handle Duplicates

a. Find Duplicates with ROW_NUMBER()

b. Delete Duplicates Using ROW_NUMBER()

5. Soft Delete (Mark Duplicates Instead of Deleting)

6. Deduplication with SELECT DISTINCT

a. Select Unique Rows

b. Insert Unique Records into a New Table

7. Tips for Handling Duplicates

Files

Handling_Duplicates.md

Latest commit

History

Handling_Duplicates.md

File metadata and controls

Handling Duplicates in SQL

1. Detecting Duplicates

a. Find All Duplicate Rows

b. Count Duplicates Based on Specific Columns

c. View Duplicate Rows with All Details

2. Removing Duplicates

a. Delete All Duplicate Rows (Keep One)

b. Delete Duplicate Rows with Self-Join

c. Delete All Duplicates (No Conditions)

3. Preventing Duplicates (Constraints)

a. Add Unique Constraints

4. Using Window Functions to Handle Duplicates

a. Find Duplicates with ROW_NUMBER()

b. Delete Duplicates Using ROW_NUMBER()

5. Soft Delete (Mark Duplicates Instead of Deleting)

6. Deduplication with SELECT DISTINCT

a. Select Unique Rows

b. Insert Unique Records into a New Table

7. Tips for Handling Duplicates