SQL: How to find duplicates? | 3 Simple Ways
Quick Example: Find Duplicates
A simple SQL query example to find duplicates:
SELECT email, COUNT() FROM customers
GROUP BY email HAVING COUNT() > 1;
SQL: How to find duplicates?
In SQL Server, there are 3 main ways to find duplicates:
1.Use GROUP BY
To find duplicates using the GROUP BY method in SQL:
- Select the columns that you want to check for duplicates.
- Use the GROUP BY clause to group the data by those columns.
- Use the HAVING clause to filter the results to show only the groups that have more than one entry.
- Full example:
sql
SELECT email, COUNT() FROM customers GROUP BY email HAVING COUNT() > 1;
Example
Let's say you have a table called "customers" with columns "name" and "email":
id | name | |
---|---|---|
1 | John Smith | john.smith@example.com |
2 | Jane Doe | jane.doe@example.com |
3 | J Doe | jane.doe@example.com |
4 | Bob Johnson | bob.johnson@example.com |
To find duplicate email addresses, you could use the following SQL query:
SELECT email, COUNT() FROM customers GROUP BY email HAVING COUNT() > 1;
This query will group the customers by email address and count the number of entries in each group. It will then filter the results and show only the groups with more than one entry, which are duplicate email addresses.
Here's the result:
COUNT() | |
---|---|
| 2 |
2.Use ROW_NUMBER() and a Common Table Expression (CTE)
- Partition the data based on the uniqueness of a search field.
- Use the ROW_NUMBER function. Duplicates will have a row number greater than one.
- Use a CTE (common table expression) block to select only the duplicated records.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY name DESC) AS rn
FROM customers
)
SELECT * FROM cte
WHERE rn > 1;
3.Use LAG() and a CASE statement
The third method uses an analytical function:
- Order records by the search field.
- Use the LAG function to get the previous value of a search field.
- Use the OVER keyword to apply lag on the search field.
- Checks if the current value matches the previous value. If it does, it is considered a duplicate.
LAG is an analytical function that accesses the value stored in a row before the current row.
- With LAG, you must specify ORDER BY in the OVER clause.
Example
Query:
SELECT email,
CASE
WHEN email = LAG(email) OVER (ORDER BY email) THEN 'Yes'
ELSE 'No'
END AS duplicate_value
FROM customers
ORDER BY email;
Result:
duplicate_value | |
---|---|
bob.johnson@example.com |
No |
jane.doe@example.com |
No |
jane.doe@example.com |
Yes |
john.smith@example.com |
No |
Josip Miskovic is a software developer at Americaneagle.com. Josip has 10+ years in experience in developing web applications, mobile apps, and games.
Read more posts →I've used these principles to increase my earnings by 63% in two years. So can you.
Dive into my 7 actionable steps to elevate your career.