The majority of duplicate records fall into one of two categories: Duplicate Meaning and Non-unique Keys. The How to Spot and Delete Values with Duplicate Meaning in MySQL blog dealt with Duplicate Meaning; the follow-up addressed how to identify Non-unique Keys. That’s where two records in the same table have the same key, but may or may not have different values and meanings. Today’s blog will cover how to delete rows with duplicated data, but with different keys.
Identifying Duplicates by Type
The last query presented in the How to Identify Duplicates with Non-unique Keys in MySQL blog listed all the duplicates in a format that was easy to visually scan through:
Repetitions | row_data | |
---------------------------------------------------------------- | ||
2 | 22 (DAVIS, JENNIFER) | 22 (DAVIS, JENNIFER) | |
2 | 23 (LOLLOBRIGIDA, JOHNNY) | 23 (GABLE, CHRISTIAN) | |
2 | 41 (WAHLBERG, NICK) | 12 (WAHLBERG, NICK) |
Having identified all of the duplicated keys and values, we can decide how best to deal with the redundant data.
JENNIFER DAVIS appears in two records with the same key of 22, making those rows exact duplicates. Nick Walberg’s name fields are duplicated, but the IDs are not. There is also a duplicated key that is associated with two unrelated actors: #23 for “JOHNNY LOLLOBRIGIDA” and “CHRISTIAN GABLE”. With regards to the duplicated keys of 22 and 23, the first is a true duplicate, whereas the second only needs a new key to be generated for one of the records.
Deleting Rows using DELETE JOIN
In the How to Spot and Delete Values with Duplicate Meaning in MySQL blog, we removed duplicates from SELECT result sets by performing a Search & Replace on values. Here we will permanently delete one of the duplicated rows using the DELETE JOIN statement.
Since we are comparing fields from the same table, we have to join the table to itself. We can choose to keep either the lower or higher id number by comparing the ids in the WHERE clause. The following statement keeps the highest id:
DELETE a FROM wp.amalgamated_actors a INNER JOIN wp.amalgamated_actors a2 WHERE a.id < a2.id AND a.first_name = a2.first_name AND a.last_name = a2.last_name; 1 row(s) affected 0.093 sec
In this case, the affected (deleted) row is NICK WAHLBERG with an id of 12. A quick SELECT confirms the result:
id | first_name | last_name | ------------------------------------- | |
10 | PENELOPE | GUINESS | ||
12 | NICK | WAHLBERG | ||
14 | ED | CHASE | ||
22 | JENNIFER | DAVIS | ||
23 | JOHNNY | LOLLOBRIGIDA | ||
27 | BETTE | NICHOLSON | ||
34 | GRACE | MOSTEL | ||
39 | JOE | SWANK | ||
23 | CHRISTIAN | GABLE | ||
22 | JENNIFER | DAVIS |
If you wanted to keep the lowest id, you would just change the a.id < a2.id expression to a.id > a2.id.
id | first_name | last_name | ------------------------------------- | |
10 | PENELOPE | GUINESS | ||
14 | ED | CHASE | ||
22 | JENNIFER | DAVIS | ||
23 | JOHNNY | LOLLOBRIGIDA | ||
27 | BETTE | NICHOLSON | ||
34 | GRACE | MOSTEL | ||
41 | NICK | WAHLBERG | ||
39 | JOE | SWANK | ||
23 | CHRISTIAN | GABLE | ||
22 | JENNIFER | DAVIS |
Deleting Rows with Non-unique Keys
In the case of JENNIFER DAVIS, who appears twice with the same id of 22, we would need to employ a different approach because running the above statement with a.id = a2.id will target every row in the table! The reason is that we are essentially matching every row against itself! In the next blog, we’ll learn how to delete rows with non-unique keys such as these.