Author: Madhax | Published: 27th January 2010 | RSS | LINK | No Comments yet. Start talking!

I investigated various ways of deleting duplicate rows in a MySQL table. My problem required that the method should be quick, fault tolerant, verbose and be able to handle large datasets. Following is a list of common ways to either select distinct rows or delete duplicate rows from MySQL tables, along with how long it took (or how long it was taking). I would like to preface by saying that having the data column indexed isn’t a viable solution because it significantly reduces the insertion time when the table is large.

select distinct method

1 000 000 rows. Duplicates unlikely. Datalen 12 bytes
mysql> SELECT DISTINCT DATA FROM test;
959058 rows IN SET (1 min 37.36 sec)
 
1 000 000 rows. Duplicates unlikely. Datalen 255 bytes
mysql> SHOW processlist;
+----+------+----------------+--------+---------+------+------------------------------+--------------------------------+
| Id | User | Host           | db     | Command | Time | State                        | Info                           |
+----+------+----------------+--------+---------+------+------------------------------+--------------------------------+
|  9 | root | localhost:2728 | testdb | Query   |    0 | NULL                         | SHOW processlist               |
| 17 | root | localhost:3592 | testdb | Query   | 1243 | Copying TO tmp TABLE ON disk | SELECT DISTINCT DATA FROM test |
+----+------+----------------+--------+---------+------+------------------------------+--------------------------------+
 
1 000 000 rows. Duplicates unlikely. Datalen 12 bytes
mysql> INSERT INTO tmp(DATA) SELECT DISTINCT DATA FROM test;
Query OK, 959044 rows affected (3 min 4.89 sec)
Records: 959044  Duplicates: 0  Warnings: 0
 
1 000 000 rows. Duplicates unlikely. Datalen 255 bytes
mysql> INSERT INTO tmp(DATA) SELECT DISTINCT DATA FROM test;
mysql> SHOW processlist;
+----+------+----------------+--------+---------+------+------------------------------+------------------------------------------------------+
| Id | User | Host           | db     | Command | Time | State                        | Info                                                 |
+----+------+----------------+--------+---------+------+------------------------------+------------------------------------------------------+
|  8 | root | localhost:2725 | testdb | Query   | 7591 | Copying TO tmp TABLE ON disk | INSERT INTO tmp(DATA) SELECT DISTINCT DATA FROM test |
|  9 | root | localhost:2728 | testdb | Query   |    0 | NULL                         | SHOW processlist                                     |
+----+------+----------------+--------+---------+------+------------------------------+------------------------------------------------------+

equal data different key method

10 000 rows. Duplicates unlikely. Datalen 12 bytes. 
DELETE FROM `test` USING `test`, `test` AS vtable WHERE (NOT `test`.id = `vtable`.id) AND (`test`.DATA=`vtable`.DATA);
>5 mins

equal data different key improved method

10 000 rows. Duplicates unlikely. Datalen 12 bytes. 
mysql>delete FROM test USING test, test AS vtable WHERE (test.id >; vtable.id) AND (test.DATA=vtable.DATA);
>5 mins

add unique index method

1 000 000 rows. Duplicates unlikely. Datalen 12 bytes
mysql> ALTER IGNORE TABLE `test` ADD UNIQUE INDEX(DATA);
Query OK, 1000000 rows affected (1 min 51.14 sec)
Records: 1000000  Duplicates: 4  Warnings: 0
 
1 000 000 rows. Duplicates unlikely. Datalen 255 bytes
mysql> ALTER IGNORE TABLE `test` ADD UNIQUE INDEX(DATA);
Query OK, 1000000 rows affected (6 hours 37 min 49.35 sec)
Records: 1000000  Duplicates: 0  Warnings: 0

group by data method

1 000 000 rows. Duplicates unlikely. Datalen 12 bytes
mysql> CREATE TABLE temp_table AS SELECT * FROM test GROUP BY DATA;
Query OK, 959175 rows affected (10 min 50.14 sec)
Records: 959175  Duplicates: 0  Warnings: 0
 
1 000 000 rows. Duplicates unlikely. Datalen 255 bytes
mysql> SHOW processlist;
+----+------+----------------+--------+---------+------+------------------------------+-------------------------------------------------------------+
| Id | User | Host           | db     | Command | Time | State                        | Info                                                        |
+----+------+----------------+--------+---------+------+------------------------------+-------------------------------------------------------------+
|  9 | root | localhost:2728 | testdb | Query   | 5013 | Copying TO tmp TABLE ON disk | CREATE TABLE temp_table AS SELECT * FROM test GROUP BY DATA |
| 19 | root | localhost:2185 | testdb | Query   |    0 | NULL                         | SHOW processlist                                            |
+----+------+----------------+--------+---------+------+------------------------------+-------------------------------------------------------------+
2 rows IN SET (0.09 sec)

None of the above solutions are suitable for a number of reasons. Adding a unique index creates the problem of long insertion times for subsequent INSERT queries. The growth seems to be exponential. There isn’t a way to see how long the query will take to complete. Should the machine fail while the week> long query is executing then the table is likely to be corrupted, without being partially completed.

A better solution is to create a python script (or use any other language for that matter) that makes use of the MySQL API to delete duplicates. The design is as follows:

-N in memory hash tables that store the indices to a distinct column in a file
 
-N stream formatted (data + \0) files that store distinct columns
 
-N queues which store the hashes and the rows that need to be checked for their respective hash table
 
-Uniformly distributed hashing function to be used on the duplicate column

Multiple files help reduce (or eliminate) the overall seek-time on the hard drive. The result of the hashing function on a column specifies a hash table and an index into the hash table. i.e.

hash_table = fn(data)%num_hash_tables, index = fn(data)%size_of_hash_table

File offsets to a distinct column are stored in the hash table.

As each row is hashed it is placed into a queue for later processing. When a finite number (10 000~) rows has been distributed among the queues, each queue is processed in sequence to reduce hard drive seek time.

Python-like pseudo code with brevity in mind

row_cnt = 0
for row in database:
	ntable = hash(data)%num_hash_tables
	index = hash(data)%size_of_hash_table
	queues[ntable].push([index, data, primary_key])
	row_cnt++
	if row_cnt == 10000:
		for each entry in each queue:
			fileoffset  = hash_tables[num][index]
			if data == read_data(num + “_file”, fileoffset):
				add_new_entry_to_file_and_hash_table()
			else:
				delete_row(primary_key)
	row_cnt=0

Depending on the complexity of the hash function, size of the table, and size of the column, this method could be slower. For any significantly large non-indexed table it is better than any of the above solutions. It is also easy to determine how long the script will take to complete and it is also easy to make the script fault tolerant (could be safely interrupted).

For large databases – the number of rows exceeds the number of offsets that can be stored in memory – one would use a variation of the algorithm. Store N distinct rows in the hash tables, and keep track of the primary key when N values have been stored. Iterate through remaining rows deleting any duplicates. When finished iterating, empty hash tables and start storing new hashes/deleting duplicates beginning from the previous primary key that was stored.

more python-like pseudo code with brevity in mind

start_value = 0
new_start_value = 0
while True:
	row_cnt = 0
	for row in database starting at primary key start_value:
		ntable = hash(data)%num_hash_tables
		index = hash(data)%size_of_hash_table
		queues[ntable].push([index, data, primary_key])
		row_cnt++
		if row_cnt == 10000:
			for each entry in each queue:
				fileoffset  = hash_tables[num][index]
				if data != read_data(num + “_file”, fileoffset):
					add_new_entry_to_file_and_hash_table_if_not_full()
					if hash_table_full() and start_value == new_start_value:
						new_start_value = primary_key
				else:
					delete_row(primary_key)
		row_cnt=0
	empty_hash_tables()
	if start_value == new_start_value:
		break
	else:
		start_value = new_start_value

To put the performance increase into perspective, for 1 000 000 rows, datalen 255 with duplicates unlikely, it handled 10 000 rows in an average time of 4 seconds. All duplicates were removed in roughly 7 minutes, whereas all other solutions took hours. The speed increase gets better as the dataset gets larger.

Author: Madhax | Published: 11th November 2009 | RSS | LINK | No Comments yet. Start talking!

A couple weeks ago I had been working on something I thought would be pretty cool. The idea was a video player that could stream videos from external servers. The way I had planned on accomplishing this was via JavaFX and a signed applet. Needless to say (if you read the title) it didn’t really work out. I had managed to stream traditional FLV video files… but a large portion of video sites – at least the more popular ones that I’ve tried – make use of H.264 video files. JavaFX supposedly supports all media that is supported by Windows media player [1]

Problem 1: I couldn’t stream H.264 files.
Fine print: I COULD play a local H.264 file

I tried working around the problem of not being able to stream H264 files by downloading portions of the video – luckily the bleeding edge version of JavaFX (1.2) allowed the output of httprequests to be redirected to file – and playing those portions in sequence.

Problem 2: javafx.io.http.HttpRequest locks file while downloading… which makes using a media file impossible until the download is complete.
Fine Print: I imagine that HttpRequest wasn’t meant to download media files…  but it would be incredibly useful to download and play a media file.

I then tried to build a library in native java that would release the file lock and allow JavaFX to load the media. This worked for the first bit that had been downloaded (and flushed to a file), but then I had to update file. I deleted the media object that made use of the local file and was hoping that it would release the file lock it put on the file while reading it.

Problem 3: It didn’t release the file lock… and there IS NOW WAY to force the release of system resources held by JavaFX.

There is no fine print for this… this is completely inexcusable.  It makes it impossible to update any resource (I am assuming this problem exists beyond video files I tried to play.) Whoever designed the language and omitted an interface – or a means – to delete a language native object, had a very narrow minded view on how the language should be used. I could’ve wrote each bit to a new file and then load the new file, but I would have to keep track of FLV offsets and correctly build a new FLV header for each new file and even then it will end up looking like a series of clips. At this point I had given up and moved onto a new idea that caught my interest. These are some more problems I’ve encountered that relate to JavaFX.

Problem 4: JavaFX had a problem handling  fileoffsets meta tag in the FLV header for some videos for me. To play videos I had to filter them out.  (Yes, it meant correctly rebuilding an FLV stream… yes… I did it =\)

Problem 5: The JavaFX forums were inactive. Not one person had replied to any thread I had started.

Problem 6: No one wants to wait 20 seconds for the JVM to fire-up just to play a quick game or music video.

Problem 7: Video support doesn’t come out of the box. So if you want your users to be able to watch videos… they will not only have to download Java and JavaFX (both are bundled), but also a third party codec pack.

In retrospect, I put a lot of work in hacking away at the problem. I don’t think I failed… I got it to work… it’s just that the effort  I – and a user – would have to go through makes using JavaFX impractical. At least it was fun learning about FLVs [2].

[1] http://www.javafx.com/docs/articles/media/format.jsp
[2] http://www.adobe.com/devnet/flv/pdf/video_file_format_spec_v10.pdf

Author: Madhax | Published: 16th July 2009 | RSS | LINK | 1 Comment. Leave yours!

I recently had an interesting experience with Facebook. I made an application that let a user search for another user’s public albums even if they weren’t on each other’s friends list. This application didn’t violate any clause of the TOS list at the time. The application made use of a design decision on Facebook’s part to allow world readable albums. When you would create an album on Facebook, you can set the privacy setting of the album. One of the choices (the one selected by default) is that the album would be world readable. I made use of the Facebook API (PHP, SQL for caching, FQL for speed, all that jazz) to create the application. I even made a slick interface that looked very facebook-ish. For lols I created a tutorial that centered around viewing Mark Zuckerberg’s photos. In the first couple days after the app had launched, I had gotten several hundred users and several 5-star reviews (all the reviews were 5-star).

I decided to submit the application to the Facebook public application directory (stupid me… I know) – which would make it visible to any user that is specifically looking for a super awesome application that lets them view the public albums of any user on Facebook. The application was submitted for approval and I waited. One day passed, then two – I thought that the approval process would only take one business day, since it just involves a person to add and try an application out. A couple days after submitting the application I noticed that I could no longer view Mark Zuckerberg’s photos. Not that I was stalking Mark Zuckerberg, but I like to run through things I make – daily – to make sure they aren’t broken. Viewing Mark Zuckerberg’s photos had become routine from a testing standpoint. Testing my app on other users showed that my application _still_ worked.

Doing a regex search in my apache logs showed that my application had been added (and removed) by _many_ ips in the Facebook subnet over 250 times. The regex count returned 255 (WHAT A ROUND NUMBER :D ). Each IP interacted with my application in some way, which suggests that it wasn’t a bot that kept on adding and removing my application. I got really excited because I thought that many employees of facebook saw my app, and in Mark Zuckerberg’s embarrassment, he changed the privacy setting of his albums (thought: MARK ZUCKERBERG HIMSELF SAW MY APP, WOW :D).
facebook
(photo from one of Mark Zuckerberg’s albums)

Today I tried my app, it barely returns any results on any search. All the searches I tried previously no longer work. Facebook hurriedly fixed a problem that didn’t exist… and it broke my app. What I mean by a problem that doesn’t exist is that I was taking advantage of a design decision Facebook made. Albums had a privacy setting that allowed everyone to view them. I could view what albums a user had using FQL, now I can’t (it still works for some people… but everyday I’m getting less and less results… so I assume Facebook is fixing each account sequentially or something.) The world readable attribute still exists and it’s still the default choice when creating a new album.

What bothers me most about this experience is not the hours I put in developing the app. Not that I shared this application publicly and won’t be able to use it privately. What bothers me most about this is that Facebook delayed the approval of my application for the public directory listing until it didn’t work. It’s their platform. They can change whatever functionality they want… send no notification to you…and if they break your app there’s nothing you can do about it.

Facebook did what they had to do to accomplish their goals (avoid embarassment, fix an initial design error, w/e.) I now view them as an evil company, but that’s just my opinion. Luckily, I cached (for optimization purposes) the albums IDs of Mark Zuckerberg’s albums :D.

ref:
[1] My application: http://apps.facebook.com/publicphotos/
[2] http://apps.facebook.com/publicphotos/album.php?a=17182073944
[3] http://apps.facebook.com/publicphotos/album.php?a=17181871868
[4] http://apps.facebook.com/publicphotos/album.php?a=17181903264
[5] http://apps.facebook.com/publicphotos/album.php?a=17181916415