Author: Madhax | Published: 27th January 2010 | RSS | LINK

I investigated various ways of deleting duplicate rows in a MySQL table. My problem required that the method should be quick, fault tolerant, verbose and be able to handle large datasets. Following is a list of common ways to either select distinct rows or delete duplicate rows from MySQL tables, along with how long it took (or how long it was taking). I would like to preface by saying that having the data column indexed isn’t a viable solution because it significantly reduces the insertion time when the table is large.

select distinct method

1 000 000 rows. Duplicates unlikely. Datalen 12 bytes
mysql> SELECT DISTINCT DATA FROM test;
959058 rows IN SET (1 min 37.36 sec)
 
1 000 000 rows. Duplicates unlikely. Datalen 255 bytes
mysql> SHOW processlist;
+----+------+----------------+--------+---------+------+------------------------------+--------------------------------+
| Id | User | Host           | db     | Command | Time | State                        | Info                           |
+----+------+----------------+--------+---------+------+------------------------------+--------------------------------+
|  9 | root | localhost:2728 | testdb | Query   |    0 | NULL                         | SHOW processlist               |
| 17 | root | localhost:3592 | testdb | Query   | 1243 | Copying TO tmp TABLE ON disk | SELECT DISTINCT DATA FROM test |
+----+------+----------------+--------+---------+------+------------------------------+--------------------------------+
 
1 000 000 rows. Duplicates unlikely. Datalen 12 bytes
mysql> INSERT INTO tmp(DATA) SELECT DISTINCT DATA FROM test;
Query OK, 959044 rows affected (3 min 4.89 sec)
Records: 959044  Duplicates: 0  Warnings: 0
 
1 000 000 rows. Duplicates unlikely. Datalen 255 bytes
mysql> INSERT INTO tmp(DATA) SELECT DISTINCT DATA FROM test;
mysql> SHOW processlist;
+----+------+----------------+--------+---------+------+------------------------------+------------------------------------------------------+
| Id | User | Host           | db     | Command | Time | State                        | Info                                                 |
+----+------+----------------+--------+---------+------+------------------------------+------------------------------------------------------+
|  8 | root | localhost:2725 | testdb | Query   | 7591 | Copying TO tmp TABLE ON disk | INSERT INTO tmp(DATA) SELECT DISTINCT DATA FROM test |
|  9 | root | localhost:2728 | testdb | Query   |    0 | NULL                         | SHOW processlist                                     |
+----+------+----------------+--------+---------+------+------------------------------+------------------------------------------------------+

equal data different key method

10 000 rows. Duplicates unlikely. Datalen 12 bytes. 
DELETE FROM `test` USING `test`, `test` AS vtable WHERE (NOT `test`.id = `vtable`.id) AND (`test`.DATA=`vtable`.DATA);
>5 mins

equal data different key improved method

10 000 rows. Duplicates unlikely. Datalen 12 bytes. 
mysql>delete FROM test USING test, test AS vtable WHERE (test.id >; vtable.id) AND (test.DATA=vtable.DATA);
>5 mins

add unique index method

1 000 000 rows. Duplicates unlikely. Datalen 12 bytes
mysql> ALTER IGNORE TABLE `test` ADD UNIQUE INDEX(DATA);
Query OK, 1000000 rows affected (1 min 51.14 sec)
Records: 1000000  Duplicates: 4  Warnings: 0
 
1 000 000 rows. Duplicates unlikely. Datalen 255 bytes
mysql> ALTER IGNORE TABLE `test` ADD UNIQUE INDEX(DATA);
Query OK, 1000000 rows affected (6 hours 37 min 49.35 sec)
Records: 1000000  Duplicates: 0  Warnings: 0

group by data method

1 000 000 rows. Duplicates unlikely. Datalen 12 bytes
mysql> CREATE TABLE temp_table AS SELECT * FROM test GROUP BY DATA;
Query OK, 959175 rows affected (10 min 50.14 sec)
Records: 959175  Duplicates: 0  Warnings: 0
 
1 000 000 rows. Duplicates unlikely. Datalen 255 bytes
mysql> SHOW processlist;
+----+------+----------------+--------+---------+------+------------------------------+-------------------------------------------------------------+
| Id | User | Host           | db     | Command | Time | State                        | Info                                                        |
+----+------+----------------+--------+---------+------+------------------------------+-------------------------------------------------------------+
|  9 | root | localhost:2728 | testdb | Query   | 5013 | Copying TO tmp TABLE ON disk | CREATE TABLE temp_table AS SELECT * FROM test GROUP BY DATA |
| 19 | root | localhost:2185 | testdb | Query   |    0 | NULL                         | SHOW processlist                                            |
+----+------+----------------+--------+---------+------+------------------------------+-------------------------------------------------------------+
2 rows IN SET (0.09 sec)

None of the above solutions are suitable for a number of reasons. Adding a unique index creates the problem of long insertion times for subsequent INSERT queries. The growth seems to be exponential. There isn’t a way to see how long the query will take to complete. Should the machine fail while the week> long query is executing then the table is likely to be corrupted, without being partially completed.

A better solution is to create a python script (or use any other language for that matter) that makes use of the MySQL API to delete duplicates. The design is as follows:

-N in memory hash tables that store the indices to a distinct column in a file
 
-N stream formatted (data + \0) files that store distinct columns
 
-N queues which store the hashes and the rows that need to be checked for their respective hash table
 
-Uniformly distributed hashing function to be used on the duplicate column

Multiple files help reduce (or eliminate) the overall seek-time on the hard drive. The result of the hashing function on a column specifies a hash table and an index into the hash table. i.e.

hash_table = fn(data)%num_hash_tables, index = fn(data)%size_of_hash_table

File offsets to a distinct column are stored in the hash table.

As each row is hashed it is placed into a queue for later processing. When a finite number (10 000~) rows has been distributed among the queues, each queue is processed in sequence to reduce hard drive seek time.

Python-like pseudo code with brevity in mind

row_cnt = 0
for row in database:
	ntable = hash(data)%num_hash_tables
	index = hash(data)%size_of_hash_table
	queues[ntable].push([index, data, primary_key])
	row_cnt++
	if row_cnt == 10000:
		for each entry in each queue:
			fileoffset  = hash_tables[num][index]
			if data == read_data(num + “_file”, fileoffset):
				add_new_entry_to_file_and_hash_table()
			else:
				delete_row(primary_key)
	row_cnt=0

Depending on the complexity of the hash function, size of the table, and size of the column, this method could be slower. For any significantly large non-indexed table it is better than any of the above solutions. It is also easy to determine how long the script will take to complete and it is also easy to make the script fault tolerant (could be safely interrupted).

For large databases – the number of rows exceeds the number of offsets that can be stored in memory – one would use a variation of the algorithm. Store N distinct rows in the hash tables, and keep track of the primary key when N values have been stored. Iterate through remaining rows deleting any duplicates. When finished iterating, empty hash tables and start storing new hashes/deleting duplicates beginning from the previous primary key that was stored.

more python-like pseudo code with brevity in mind

start_value = 0
new_start_value = 0
while True:
	row_cnt = 0
	for row in database starting at primary key start_value:
		ntable = hash(data)%num_hash_tables
		index = hash(data)%size_of_hash_table
		queues[ntable].push([index, data, primary_key])
		row_cnt++
		if row_cnt == 10000:
			for each entry in each queue:
				fileoffset  = hash_tables[num][index]
				if data != read_data(num + “_file”, fileoffset):
					add_new_entry_to_file_and_hash_table_if_not_full()
					if hash_table_full() and start_value == new_start_value:
						new_start_value = primary_key
				else:
					delete_row(primary_key)
		row_cnt=0
	empty_hash_tables()
	if start_value == new_start_value:
		break
	else:
		start_value = new_start_value

To put the performance increase into perspective, for 1 000 000 rows, datalen 255 with duplicates unlikely, it handled 10 000 rows in an average time of 4 seconds. All duplicates were removed in roughly 7 minutes, whereas all other solutions took hours. The speed increase gets better as the dataset gets larger.

2 Comments. Add yours!

  1. MySQLer
    3:16 AM on May 16th, 2010

    A production table I was working on had 16 million (with 8 million dupes) rows and I tried unique index method. It took around 6-7 hours. I’m trying algorithmic method (since I have to design a script for n other prod databases with same issue), if I am trying to develop similar function in perl as it would suit the requirement. Please let me know if you have a perl version of the above as I was not able to find some of the equivalent functions.

  2. Madhax
    4:00 AM on May 16th, 2010

    Unfortunately I do not have a perl version of the above code. The code I put in my blog isn’t very clear and looking at it now I got sort of confused as the pseudo code is more implicit than explicit. I have attached the full python code that I use in my production databases. It is in the form of a module.

    The only functionality you need to port this to perl is to be able to:
    -create hash tables
    -create queues
    -read/write binary files
    -perform SQL queries

    This code was meant for my eyes only, so I’m offering it as is and make no guarantees. I will answer questions when and if I have time.

    CLICK HERE TO VIEW CODE

Leave a Reply

Some basic HTML is allowed. Please keep all comments constructive, polite and on-topic. Any spam or offensive comments will be deleted.