본문 바로가기
Issue Note

Delete Duplicate Emails [SQL]

by jhleeatl 2024. 6. 25.

 

This time, I encountered an SQL question that is supposed to be easy, but it wasn't as easy as expected. It was a problem about deleting duplicate emails. I tried to solve it using SELECT and GROUP BY, but it didn't work.

 

Question

더보기

Table: Person

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| email       | varchar |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table contains an email. The emails will not contain uppercase letters.

 

Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.

For SQL users, please note that you are supposed to write a DELETE statement and not a SELECT one.

For Pandas users, please note that you are supposed to modify Person in place.

After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.

The result format is in the following example.

 

Example 1:

Input: 
Person table:
+----+------------------+
| id | email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
| 3  | john@example.com |
+----+------------------+
Output: 
+----+------------------+
| id | email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
+----+------------------+
Explanation: john@example.com is repeated two times. We keep the row with the smallest Id = 1.

 


I looked up how to use the DELETE statement to solve this problem and wrote the following code:

 

DELETE FROM Person
WHERE id NOT IN (
    SELECT MIN(id)
    FROM Person
    GROUP BY email
)

 

But, I got a error message 'You can't specify target table 'Person' for update in FROM clause'

 

in order to solve this issue, I thought I should approach the problem using a subquery.

 

WITH CTE AS (
    SELECT 
        id, 
        email,
        ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) as rn
    FROM Person
)
DELETE FROM Person
WHERE id IN (
    SELECT id
    FROM CTE
    WHERE rn > 1
)

 

After writing it this way, I was able to solve the problem. However, I thought I needed to learn more about the DELETE statement.

 


Overview of DELETE Statement

The DELETE statement is used to remove rows from a table based on specified conditions. The basic syntax is:

sql
 
DELETE FROM table_name 
WHERE condition;

 

Examples

1. Deleting Rows Based on a Condition

Suppose we have a table Employees:

id	name	department
1	Alice	HR
2	Bob	IT
3	Charlie	IT
4	David	HR

 

To delete all employees in the HR department:

 
DELETE FROM Employees 
WHERE department = 'HR'
 

 

After running this query, the Employees table will look like:

id	name	department
2	Bob		IT
3	Charlie		IT

 

2. Deleting All Rows

 

To delete all rows from the Employees table:

DELETE FROM Employees;

 

After running this query, the Employees table will be empty.

 

3. Using Subqueries

You can use subqueries to determine which rows to delete. For example, suppose you have another table Departments:

 

id	department_name
1	HR
2	IT
3	Sales

 

 

To delete employees whose department is not in the Departments table:

DELETE FROM Employees 
WHERE department NOT IN (SELECT department_name FROM Departments)

 

 

'Issue Note' 카테고리의 다른 글

Group sold products by the date - SQL  (0) 2024.06.27
Movie rating - SQL  (0) 2024.06.12
Join And function  (0) 2024.04.19
1934. Confirmation Rate  (0) 2024.04.12
The count of divisors and addition  (0) 2024.04.09