Duplicates in a database can cause data inconsistencies and incorrect analysis. SQL provides several methods to detect, remove, and manage duplicates effectively.
SELECT *
FROM employees
WHERE id IN (
SELECT id
FROM employees
GROUP BY id, name, age, department, salary
HAVING COUNT(*) > 1
);
SELECT name, COUNT(*) AS count
FROM employees
GROUP BY name
HAVING COUNT(*) > 1;
SELECT *
FROM employees e1
WHERE EXISTS (
SELECT 1
FROM employees e2
WHERE e1.name = e2.name AND e1.department = e2.department
AND e1.id > e2.id
);
Explanation:
- EXISTS checks if duplicate rows exist.
- Uses comparison on specific columns (name and department).
- Ignores duplicates based on the id.
DELETE FROM employees
WHERE id NOT IN (
SELECT MIN(id)
FROM employees
GROUP BY name, age, department, salary
);
Explanation:
- GROUP BY groups data based on the specified columns.
- MIN(id) keeps the first occurrence and deletes others.
DELETE e1
FROM employees e1
JOIN employees e2
ON e1.name = e2.name AND e1.department = e2.department
WHERE e1.id > e2.id;
Explanation:
- Joins the table with itself to compare rows.
- Deletes rows with higher IDs (duplicates).
DELETE FROM employees
WHERE ROWID NOT IN (
SELECT MIN(ROWID)
FROM employees
GROUP BY name, department
);
Note: ROWID is database-specific and available in some SQL engines like Oracle. For others, use ID or ROW_NUMBER().
ALTER TABLE employees
ADD CONSTRAINT unique_employee UNIQUE(name, department);
Explanation:
- Prevents insertion of duplicate rows with the same name and department.
SELECT *,
ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num
FROM employees;
Explanation:
- Assigns a row number for each duplicate group.
- Rows with row_num > 1 are duplicates.
DELETE FROM employees
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num
FROM employees
) subquery
WHERE row_num > 1
);
Explanation:
- Keeps the first occurrence (row_num = 1) and deletes the rest.
ALTER TABLE employees ADD COLUMN is_duplicate BOOLEAN DEFAULT FALSE;
UPDATE employees
SET is_duplicate = TRUE
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num
FROM employees
) subquery
WHERE row_num > 1
);
Explanation:
- Adds a new column (is_duplicate) to mark duplicates instead of deleting them.
- Useful for auditing or future cleanup.
SELECT DISTINCT name, department
FROM employees;
CREATE TABLE unique_employees AS
SELECT DISTINCT *
FROM employees;
- Backup Before Deletion: Always create a backup before removing duplicates to avoid accidental data loss.
- Check Primary Keys: Ensure primary keys are properly defined to prevent duplicates during data insertion.
- Normalize Data: Structure data to avoid redundancy by following database normalization rules.
- Use Indexes: Add unique indexes to enforce uniqueness constraints.
- Audit Data Inserts: Track inserts and monitor logs to avoid accidental duplicate entries.
CompiledByUdithaWICK