Delete Duplicate Rows From Database
Author: Madhax | Published: 27th January 2010 | RSS | LINK | No Comments yet. Start talking!
I investigated various ways of deleting duplicate rows in a MySQL table. My problem required that the method should be quick, fault tolerant, verbose and be able to handle large datasets. Following is a list of common ways to either select distinct rows or delete duplicate rows from MySQL tables, along with how long it took (or how long it was taking). I would like to preface by saying that having the data column indexed isn’t a viable solution because it significantly reduces the insertion time when the table is large.
select distinct method
1 000 000 rows. Duplicates unlikely. Datalen 12 bytes mysql> SELECT DISTINCT DATA FROM test; 959058 rows IN SET (1 min 37.36 sec) 1 000 000 rows. Duplicates unlikely. Datalen 255 bytes mysql> SHOW processlist; +----+------+----------------+--------+---------+------+------------------------------+--------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+------+----------------+--------+---------+------+------------------------------+--------------------------------+ | 9 | root | localhost:2728 | testdb | Query | 0 | NULL | SHOW processlist | | 17 | root | localhost:3592 | testdb | Query | 1243 | Copying TO tmp TABLE ON disk | SELECT DISTINCT DATA FROM test | +----+------+----------------+--------+---------+------+------------------------------+--------------------------------+ 1 000 000 rows. Duplicates unlikely. Datalen 12 bytes mysql> INSERT INTO tmp(DATA) SELECT DISTINCT DATA FROM test; Query OK, 959044 rows affected (3 min 4.89 sec) Records: 959044 Duplicates: 0 Warnings: 0 1 000 000 rows. Duplicates unlikely. Datalen 255 bytes mysql> INSERT INTO tmp(DATA) SELECT DISTINCT DATA FROM test; mysql> SHOW processlist; +----+------+----------------+--------+---------+------+------------------------------+------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+------+----------------+--------+---------+------+------------------------------+------------------------------------------------------+ | 8 | root | localhost:2725 | testdb | Query | 7591 | Copying TO tmp TABLE ON disk | INSERT INTO tmp(DATA) SELECT DISTINCT DATA FROM test | | 9 | root | localhost:2728 | testdb | Query | 0 | NULL | SHOW processlist | +----+------+----------------+--------+---------+------+------------------------------+------------------------------------------------------+
equal data different key method
10 000 rows. Duplicates unlikely. Datalen 12 bytes. DELETE FROM `test` USING `test`, `test` AS vtable WHERE (NOT `test`.id = `vtable`.id) AND (`test`.DATA=`vtable`.DATA); >5 mins
equal data different key improved method
10 000 rows. Duplicates unlikely. Datalen 12 bytes. mysql>delete FROM test USING test, test AS vtable WHERE (test.id >; vtable.id) AND (test.DATA=vtable.DATA); >5 mins
add unique index method
1 000 000 rows. Duplicates unlikely. Datalen 12 bytes mysql> ALTER IGNORE TABLE `test` ADD UNIQUE INDEX(DATA); Query OK, 1000000 rows affected (1 min 51.14 sec) Records: 1000000 Duplicates: 4 Warnings: 0 1 000 000 rows. Duplicates unlikely. Datalen 255 bytes mysql> ALTER IGNORE TABLE `test` ADD UNIQUE INDEX(DATA); Query OK, 1000000 rows affected (6 hours 37 min 49.35 sec) Records: 1000000 Duplicates: 0 Warnings: 0
group by data method
1 000 000 rows. Duplicates unlikely. Datalen 12 bytes mysql> CREATE TABLE temp_table AS SELECT * FROM test GROUP BY DATA; Query OK, 959175 rows affected (10 min 50.14 sec) Records: 959175 Duplicates: 0 Warnings: 0 1 000 000 rows. Duplicates unlikely. Datalen 255 bytes mysql> SHOW processlist; +----+------+----------------+--------+---------+------+------------------------------+-------------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+------+----------------+--------+---------+------+------------------------------+-------------------------------------------------------------+ | 9 | root | localhost:2728 | testdb | Query | 5013 | Copying TO tmp TABLE ON disk | CREATE TABLE temp_table AS SELECT * FROM test GROUP BY DATA | | 19 | root | localhost:2185 | testdb | Query | 0 | NULL | SHOW processlist | +----+------+----------------+--------+---------+------+------------------------------+-------------------------------------------------------------+ 2 rows IN SET (0.09 sec)
None of the above solutions are suitable for a number of reasons. Adding a unique index creates the problem of long insertion times for subsequent INSERT queries. The growth seems to be exponential. There isn’t a way to see how long the query will take to complete. Should the machine fail while the week> long query is executing then the table is likely to be corrupted, without being partially completed.
A better solution is to create a python script (or use any other language for that matter) that makes use of the MySQL API to delete duplicates. The design is as follows:
-N in memory hash tables that store the indices to a distinct column in a file -N stream formatted (data + \0) files that store distinct columns -N queues which store the hashes and the rows that need to be checked for their respective hash table -Uniformly distributed hashing function to be used on the duplicate column
Multiple files help reduce (or eliminate) the overall seek-time on the hard drive. The result of the hashing function on a column specifies a hash table and an index into the hash table. i.e.
hash_table = fn(data)%num_hash_tables, index = fn(data)%size_of_hash_table
File offsets to a distinct column are stored in the hash table.
As each row is hashed it is placed into a queue for later processing. When a finite number (10 000~) rows has been distributed among the queues, each queue is processed in sequence to reduce hard drive seek time.
Python-like pseudo code with brevity in mind
row_cnt = 0 for row in database: ntable = hash(data)%num_hash_tables index = hash(data)%size_of_hash_table queues[ntable].push([index, data, primary_key]) row_cnt++ if row_cnt == 10000: for each entry in each queue: fileoffset = hash_tables[num][index] if data == read_data(num + “_file”, fileoffset): add_new_entry_to_file_and_hash_table() else: delete_row(primary_key) row_cnt=0
Depending on the complexity of the hash function, size of the table, and size of the column, this method could be slower. For any significantly large non-indexed table it is better than any of the above solutions. It is also easy to determine how long the script will take to complete and it is also easy to make the script fault tolerant (could be safely interrupted).
For large databases – the number of rows exceeds the number of offsets that can be stored in memory – one would use a variation of the algorithm. Store N distinct rows in the hash tables, and keep track of the primary key when N values have been stored. Iterate through remaining rows deleting any duplicates. When finished iterating, empty hash tables and start storing new hashes/deleting duplicates beginning from the previous primary key that was stored.
more python-like pseudo code with brevity in mind
start_value = 0 new_start_value = 0 while True: row_cnt = 0 for row in database starting at primary key start_value: ntable = hash(data)%num_hash_tables index = hash(data)%size_of_hash_table queues[ntable].push([index, data, primary_key]) row_cnt++ if row_cnt == 10000: for each entry in each queue: fileoffset = hash_tables[num][index] if data != read_data(num + “_file”, fileoffset): add_new_entry_to_file_and_hash_table_if_not_full() if hash_table_full() and start_value == new_start_value: new_start_value = primary_key else: delete_row(primary_key) row_cnt=0 empty_hash_tables() if start_value == new_start_value: break else: start_value = new_start_value
To put the performance increase into perspective, for 1 000 000 rows, datalen 255 with duplicates unlikely, it handled 10 000 rows in an average time of 4 seconds. All duplicates were removed in roughly 7 minutes, whereas all other solutions took hours. The speed increase gets better as the dataset gets larger.

