Skip to content

Commit 2cd90b3

Browse files
authored
Create Handling_Duplicates.md
1 parent 1ab217b commit 2cd90b3

File tree

1 file changed

+183
-0
lines changed

1 file changed

+183
-0
lines changed

Handling_Duplicates.md

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
### **Handling Duplicates in SQL**
2+
3+
Duplicates in a database can cause **data inconsistencies** and **incorrect analysis**. SQL provides several methods to **detect, remove, and manage duplicates** effectively.
4+
5+
---
6+
7+
## **1. Detecting Duplicates**
8+
9+
### **a. Find All Duplicate Rows**
10+
```sql
11+
SELECT *
12+
FROM employees
13+
WHERE id IN (
14+
SELECT id
15+
FROM employees
16+
GROUP BY id, name, age, department, salary
17+
HAVING COUNT(*) > 1
18+
);
19+
```
20+
21+
### **b. Count Duplicates Based on Specific Columns**
22+
```sql
23+
SELECT name, COUNT(*) AS count
24+
FROM employees
25+
GROUP BY name
26+
HAVING COUNT(*) > 1;
27+
```
28+
29+
### **c. View Duplicate Rows with All Details**
30+
```sql
31+
SELECT *
32+
FROM employees e1
33+
WHERE EXISTS (
34+
SELECT 1
35+
FROM employees e2
36+
WHERE e1.name = e2.name AND e1.department = e2.department
37+
AND e1.id > e2.id
38+
);
39+
```
40+
**Explanation:**
41+
- **EXISTS** checks if duplicate rows exist.
42+
- Uses comparison on specific columns (name and department).
43+
- Ignores duplicates based on the **id**.
44+
45+
---
46+
47+
## **2. Removing Duplicates**
48+
49+
### **a. Delete All Duplicate Rows (Keep One)**
50+
```sql
51+
DELETE FROM employees
52+
WHERE id NOT IN (
53+
SELECT MIN(id)
54+
FROM employees
55+
GROUP BY name, age, department, salary
56+
);
57+
```
58+
**Explanation:**
59+
- **GROUP BY** groups data based on the specified columns.
60+
- **MIN(id)** keeps the first occurrence and deletes others.
61+
62+
---
63+
64+
### **b. Delete Duplicate Rows with Self-Join**
65+
```sql
66+
DELETE e1
67+
FROM employees e1
68+
JOIN employees e2
69+
ON e1.name = e2.name AND e1.department = e2.department
70+
WHERE e1.id > e2.id;
71+
```
72+
**Explanation:**
73+
- Joins the table with itself to compare rows.
74+
- Deletes rows with **higher IDs** (duplicates).
75+
76+
---
77+
78+
### **c. Delete All Duplicates (No Conditions)**
79+
```sql
80+
DELETE FROM employees
81+
WHERE ROWID NOT IN (
82+
SELECT MIN(ROWID)
83+
FROM employees
84+
GROUP BY name, department
85+
);
86+
```
87+
**Note:** **ROWID** is database-specific and available in some SQL engines like **Oracle**. For others, use **ID** or **ROW_NUMBER()**.
88+
89+
---
90+
91+
## **3. Preventing Duplicates (Constraints)**
92+
93+
### **a. Add Unique Constraints**
94+
```sql
95+
ALTER TABLE employees
96+
ADD CONSTRAINT unique_employee UNIQUE(name, department);
97+
```
98+
**Explanation:**
99+
- Prevents insertion of duplicate rows with the same **name** and **department**.
100+
101+
---
102+
103+
## **4. Using Window Functions to Handle Duplicates**
104+
105+
### **a. Find Duplicates with ROW_NUMBER()**
106+
```sql
107+
SELECT *,
108+
ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num
109+
FROM employees;
110+
```
111+
**Explanation:**
112+
- Assigns a **row number** for each duplicate group.
113+
- Rows with **row_num > 1** are duplicates.
114+
115+
---
116+
117+
### **b. Delete Duplicates Using ROW_NUMBER()**
118+
```sql
119+
DELETE FROM employees
120+
WHERE id IN (
121+
SELECT id
122+
FROM (
123+
SELECT id,
124+
ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num
125+
FROM employees
126+
) subquery
127+
WHERE row_num > 1
128+
);
129+
```
130+
**Explanation:**
131+
- Keeps the first occurrence (**row_num = 1**) and deletes the rest.
132+
133+
---
134+
135+
## **5. Soft Delete (Mark Duplicates Instead of Deleting)**
136+
137+
```sql
138+
ALTER TABLE employees ADD COLUMN is_duplicate BOOLEAN DEFAULT FALSE;
139+
140+
UPDATE employees
141+
SET is_duplicate = TRUE
142+
WHERE id IN (
143+
SELECT id
144+
FROM (
145+
SELECT id,
146+
ROW_NUMBER() OVER(PARTITION BY name, department ORDER BY id) AS row_num
147+
FROM employees
148+
) subquery
149+
WHERE row_num > 1
150+
);
151+
```
152+
**Explanation:**
153+
- Adds a new column (**is_duplicate**) to **mark duplicates** instead of deleting them.
154+
- Useful for auditing or future cleanup.
155+
156+
---
157+
158+
## **6. Deduplication with SELECT DISTINCT**
159+
160+
### **a. Select Unique Rows**
161+
```sql
162+
SELECT DISTINCT name, department
163+
FROM employees;
164+
```
165+
166+
### **b. Insert Unique Records into a New Table**
167+
```sql
168+
CREATE TABLE unique_employees AS
169+
SELECT DISTINCT *
170+
FROM employees;
171+
```
172+
173+
---
174+
175+
## **7. Tips for Handling Duplicates**
176+
177+
1. **Backup Before Deletion:** Always create a backup before removing duplicates to avoid accidental data loss.
178+
2. **Check Primary Keys:** Ensure primary keys are properly defined to prevent duplicates during data insertion.
179+
3. **Normalize Data:** Structure data to avoid redundancy by following database normalization rules.
180+
4. **Use Indexes:** Add unique indexes to enforce uniqueness constraints.
181+
5. **Audit Data Inserts:** Track inserts and monitor logs to avoid accidental duplicate entries.
182+
183+
---

0 commit comments

Comments
 (0)