|
| 1 | +### **Handling Duplicates in SQL** |
| 2 | + |
| 3 | +Duplicates in a database can cause **data inconsistencies** and **incorrect analysis**. SQL provides several methods to **detect, remove, and manage duplicates** effectively. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## **1. Detecting Duplicates** |
| 8 | + |
| 9 | +### **a. Find All Duplicate Rows** |
| 10 | +```sql |
| 11 | +SELECT * |
| 12 | +FROM employees |
| 13 | +WHERE id IN ( |
| 14 | + SELECT id |
| 15 | + FROM employees |
| 16 | + GROUP BY id, name, age, department, salary |
| 17 | + HAVING COUNT(*) > 1 |
| 18 | +); |
| 19 | +``` |
| 20 | + |
| 21 | +### **b. Count Duplicates Based on Specific Columns** |
| 22 | +```sql |
| 23 | +SELECT name, COUNT(*) AS count |
| 24 | +FROM employees |
| 25 | +GROUP BY name |
| 26 | +HAVING COUNT(*) > 1; |
| 27 | +``` |
| 28 | + |
| 29 | +### **c. View Duplicate Rows with All Details** |
| 30 | +```sql |
| 31 | +SELECT * |
| 32 | +FROM employees e1 |
| 33 | +WHERE EXISTS ( |
| 34 | + SELECT 1 |
| 35 | + FROM employees e2 |
| 36 | + WHERE e1.name = e2.name AND e1.department = e2.department |
| 37 | + AND e1.id > e2.id |
| 38 | +); |
| 39 | +``` |
| 40 | +**Explanation:** |
| 41 | +- **EXISTS** checks if duplicate rows exist. |
| 42 | +- Uses comparison on specific columns (name and department). |
| 43 | +- Ignores duplicates based on the **id**. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## **2. Removing Duplicates** |
| 48 | + |
| 49 | +### **a. Delete All Duplicate Rows (Keep One)** |
| 50 | +```sql |
| 51 | +DELETE FROM employees |
| 52 | +WHERE id NOT IN ( |
| 53 | + SELECT MIN(id) |
| 54 | + FROM employees |
| 55 | + GROUP BY name, age, department, salary |
| 56 | +); |
| 57 | +``` |
| 58 | +**Explanation:** |
| 59 | +- **GROUP BY** groups data based on the specified columns. |
| 60 | +- **MIN(id)** keeps the first occurrence and deletes others. |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +### **b. Delete Duplicate Rows with Self-Join** |
| 65 | +```sql |
| 66 | +DELETE e1 |
| 67 | +FROM employees e1 |
| 68 | +JOIN employees e2 |
| 69 | +ON e1.name = e2.name AND e1.department = e2.department |
| 70 | +WHERE e1.id > e2.id; |
| 71 | +``` |
| 72 | +**Explanation:** |
| 73 | +- Joins the table with itself to compare rows. |
| 74 | +- Deletes rows with **higher IDs** (duplicates). |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +### **c. Delete All Duplicates (No Conditions)** |
| 79 | +```sql |
| 80 | +DELETE FROM employees |
| 81 | +WHERE ROWID NOT IN ( |
| 82 | + SELECT MIN(ROWID) |
| 83 | + FROM employees |
| 84 | + GROUP BY name, department |
| 85 | +); |
| 86 | +``` |
| 87 | +**Note:** **ROWID** is database-specific and available in some SQL engines like **Oracle**. For others, use **ID** or **ROW_NUMBER()**. |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +## **3. Preventing Duplicates (Constraints)** |
| 92 | + |
| 93 | +### **a. Add Unique Constraints** |
| 94 | +```sql |
| 95 | +ALTER TABLE employees |
| 96 | +ADD CONSTRAINT unique_employee UNIQUE(name, department); |
| 97 | +``` |
| 98 | +**Explanation:** |
| 99 | +- Prevents insertion of duplicate rows with the same **name** and **department**. |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +## **4. Using Window Functions to Handle Duplicates** |
| 104 | + |
| 105 | +### **a. Find Duplicates with ROW_NUMBER()** |
| 106 | +```sql |
| 107 | +SELECT *, |
| 108 | + ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num |
| 109 | +FROM employees; |
| 110 | +``` |
| 111 | +**Explanation:** |
| 112 | +- Assigns a **row number** for each duplicate group. |
| 113 | +- Rows with **row_num > 1** are duplicates. |
| 114 | + |
| 115 | +--- |
| 116 | + |
| 117 | +### **b. Delete Duplicates Using ROW_NUMBER()** |
| 118 | +```sql |
| 119 | +DELETE FROM employees |
| 120 | +WHERE id IN ( |
| 121 | + SELECT id |
| 122 | + FROM ( |
| 123 | + SELECT id, |
| 124 | + ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num |
| 125 | + FROM employees |
| 126 | + ) subquery |
| 127 | + WHERE row_num > 1 |
| 128 | +); |
| 129 | +``` |
| 130 | +**Explanation:** |
| 131 | +- Keeps the first occurrence (**row_num = 1**) and deletes the rest. |
| 132 | + |
| 133 | +--- |
| 134 | + |
| 135 | +## **5. Soft Delete (Mark Duplicates Instead of Deleting)** |
| 136 | + |
| 137 | +```sql |
| 138 | +ALTER TABLE employees ADD COLUMN is_duplicate BOOLEAN DEFAULT FALSE; |
| 139 | + |
| 140 | +UPDATE employees |
| 141 | +SET is_duplicate = TRUE |
| 142 | +WHERE id IN ( |
| 143 | + SELECT id |
| 144 | + FROM ( |
| 145 | + SELECT id, |
| 146 | + ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num |
| 147 | + FROM employees |
| 148 | + ) subquery |
| 149 | + WHERE row_num > 1 |
| 150 | +); |
| 151 | +``` |
| 152 | +**Explanation:** |
| 153 | +- Adds a new column (**is_duplicate**) to **mark duplicates** instead of deleting them. |
| 154 | +- Useful for auditing or future cleanup. |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## **6. Deduplication with SELECT DISTINCT** |
| 159 | + |
| 160 | +### **a. Select Unique Rows** |
| 161 | +```sql |
| 162 | +SELECT DISTINCT name, department |
| 163 | +FROM employees; |
| 164 | +``` |
| 165 | + |
| 166 | +### **b. Insert Unique Records into a New Table** |
| 167 | +```sql |
| 168 | +CREATE TABLE unique_employees AS |
| 169 | +SELECT DISTINCT * |
| 170 | +FROM employees; |
| 171 | +``` |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## **7. Tips for Handling Duplicates** |
| 176 | + |
| 177 | +1. **Backup Before Deletion:** Always create a backup before removing duplicates to avoid accidental data loss. |
| 178 | +2. **Check Primary Keys:** Ensure primary keys are properly defined to prevent duplicates during data insertion. |
| 179 | +3. **Normalize Data:** Structure data to avoid redundancy by following database normalization rules. |
| 180 | +4. **Use Indexes:** Add unique indexes to enforce uniqueness constraints. |
| 181 | +5. **Audit Data Inserts:** Track inserts and monitor logs to avoid accidental duplicate entries. |
| 182 | + |
| 183 | +--- |
0 commit comments