Eliminating Duplicate Emails Efficiently: A Comprehensive Guide to Data Cleanup Using Pandas and SQL
The Problem
With this article, I will be covering the delete duplicate emails Leetcode problem.
Leetcode describes this problem as easy. That's a super reasonable evaluation as the solution requires only basic SQL or Pandas knowledge.
The problem description is as follows:
Write a solution to delete all duplicate emails, keeping only one unique email with the smallest
id
.For SQL users, please note that you are supposed to write a
DELETE
statement and not aSELECT
one.For Pandas users, please note that you are supposed to modify
Person
in place.After running your script, the answer shown is the
Person
table. The driver will first compile and run your piece of code and then show thePerson
table. The final order of thePerson
table does not matter.+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | email | varchar | +-------------+---------+ id is the primary key (column with unique values) for this table. Each row of this table contains an email. The emails will not contain uppercase letters.
The Solution
Pandas
Pandas is a great Python tool for data analysis and manipulation. Built into that library is the drop duplicates function. Please note that the problem statement asks us to do this in place.
Using the Pandas library this can be achieved by first in-place sorting by the id
field and then dropping the duplicates from email
. We want to keep at least the first duplicated element.
def delete_duplicate_emails(person: pd.DataFrame) -> None:
person.sort_values(by='id', inplace=True)
person.drop_duplicates(subset='email', keep='first', inplace=True)
Voila! I love these simple questions.
SQL
In SQL we want to run a delete query. We will grab two copies of the person
table and match them based on the email
. To keep at least one of the solutions we only delete the entry with the higher id
value. This keeps the email
associated with the smallest id
.
DELETE p1
FROM person p1, person p2
WHERE p1.email = p2.email AND p1.id > p2.id;
As with many problems, there are multiple ways to solve them. These Pandas and SQL solutions are but one way of approaching the delete duplicate question.
Originally published at https://blog.seancoughlin.me.
Top comments (0)