Author: Madhax | Published: 16th July 2009 | RSS | LINK | 2 Comments. Leave yours!

I recently had an interesting experience with Facebook. I made an application that let a user search for another user’s public albums even if they weren’t on each other’s friends list. This application didn’t violate any clause of the TOS list at the time. The application made use of a design decision on Facebook’s part to allow world readable albums. When you would create an album on Facebook, you can set the privacy setting of the album. One of the choices (the one selected by default) is that the album would be world readable. I made use of the Facebook API (PHP, SQL for caching, FQL for speed, all that jazz) to create the application. I even made a slick interface that looked very facebook-ish. For lols I created a tutorial that centered around viewing Mark Zuckerberg’s photos. In the first couple days after the app had launched, I had gotten several hundred users and several 5-star reviews (all the reviews were 5-star).

I decided to submit the application to the Facebook public application directory (stupid me… I know) – which would make it visible to any user that is specifically looking for a super awesome application that lets them view the public albums of any user on Facebook. The application was submitted for approval and I waited. One day passed, then two – I thought that the approval process would only take one business day, since it just involves a person to add and try an application out. A couple days after submitting the application I noticed that I could no longer view Mark Zuckerberg’s photos. Not that I was stalking Mark Zuckerberg, but I like to run through things I make – daily – to make sure they aren’t broken. Viewing Mark Zuckerberg’s photos had become routine from a testing standpoint. Testing my app on other users showed that my application _still_ worked.

Doing a regex search in my apache logs showed that my application had been added (and removed) by _many_ ips in the Facebook subnet over 250 times. The regex count returned 255 (WHAT A ROUND NUMBER :D ). Each IP interacted with my application in some way, which suggests that it wasn’t a bot that kept on adding and removing my application. I got really excited because I thought that many employees of facebook saw my app, and in Mark Zuckerberg’s embarrassment, he changed the privacy setting of his albums (thought: MARK ZUCKERBERG HIMSELF SAW MY APP, WOW :D).
facebook
(photo from one of Mark Zuckerberg’s albums)

Today I tried my app, it barely returns any results on any search. All the searches I tried previously no longer work. Facebook hurriedly fixed a problem that didn’t exist… and it broke my app. What I mean by a problem that doesn’t exist is that I was taking advantage of a design decision Facebook made. Albums had a privacy setting that allowed everyone to view them. I could view what albums a user had using FQL, now I can’t (it still works for some people… but everyday I’m getting less and less results… so I assume Facebook is fixing each account sequentially or something.) The world readable attribute still exists and it’s still the default choice when creating a new album.

What bothers me most about this experience is not the hours I put in developing the app. Not that I shared this application publicly and won’t be able to use it privately. What bothers me most about this is that Facebook delayed the approval of my application for the public directory listing until it didn’t work. It’s their platform. They can change whatever functionality they want… send no notification to you…and if they break your app there’s nothing you can do about it.

Facebook did what they had to do to accomplish their goals (avoid embarassment, fix an initial design error, w/e.) I now view them as an evil company, but that’s just my opinion. Luckily, I cached (for optimization purposes) the albums IDs of Mark Zuckerberg’s albums :D.

ref:
[1] My application: http://apps.facebook.com/publicphotos/
[2] http://apps.facebook.com/publicphotos/album.php?a=17182073944
[3] http://apps.facebook.com/publicphotos/album.php?a=17181871868
[4] http://apps.facebook.com/publicphotos/album.php?a=17181903264
[5] http://apps.facebook.com/publicphotos/album.php?a=17181916415

Author: Madhax | Published: 24th June 2009 | RSS | LINK | No Comments yet. Start talking!

Recently I discovered that the recommended way of changing a column type of a table in MySQL is intolerably slow. By recommended I mean that the first result of searching for “how to change column type mysql” (excluding quotes) is a link to the MySQL developer documentation for an ALTER TABLE query.

The circumstances started with me needing a table to store MD5 hashes. The table definition I had come up with at the time was

mysql> DESCRIBE mytable;
+-------+------------+------+-----+------------------+-------+
| FIELD | Type       | NULL | KEY | DEFAULT          | Extra |
+-------+------------+------+-----+------------------+-------+
| HASH  | CHAR  (16) | NO   | PRI |                  |       |
+-------+------------+------+-----+------------------+-------+
1 row IN SET (0.03 sec)

The design error is that I had defined the `HASH` as char(16). A properly escaped MD5 digest would be stored correctly in the table and I was able to fetch any digest – via MySQL command line applications – as it was stored in the table (Data after \0’s would get returned.) There wasn’t any problem until I chose to make optimizations to the project and make use of functionality provided by a MySQL API.

When selecting any char(x) column from a table the API would read up to a \0 (NULL, int(0).) So I would get varying column lengths depending on the digest in the current row. It’s reasonable for the APIs to assume that because I have defined a column as a sequence of chars then a null terminated string would be stored in it … especially since there is another data type that is specifically used for binary data.

This realization came AFTER I had already populated the table with a lot of data. The table had grown to a fairly large size.

mysql> SELECT COUNT(*) FROM `mytable`;
 
+----------+
| COUNT(*) |
+----------+
| 38744395 |
+----------+
1 row IN SET (0.33 sec)

I was looking to fix my mistake. The ideal solution would be to change the column type. Research done via Google and looking through a book (High Performance MySQL) lead me to believe that

ALTER TABLE `mytable` MODIFY HASH BINARY(16) NOT NULL DEFAULT '’;

would be the best way to go about changing a column type. I was wrong. The query ran for days. This is the result of a “SHOW PROCESSLIST” a few days into the query.

mysql> SHOW processlist;
+------+------+----------------+----------+---------+--------+-------------------+----------------------------------------------------------------------+
| Id   | User | Host           | db       | Command | Time   | State             | Info                                                                 |
+------+------+----------------+----------+---------+--------+-------------------+----------------------------------------------------------------------+
|   82 | root | localhost:4442 | NULL     | Sleep   |     52 |                   | NULL                                                                 |
|  912 | root | localhost:3677 | mydb     | Query   | 275084 | copy TO tmp TABLE | ALTER TABLE `mytable` MODIFY HASH BINARY(16) NOT NULL DEFAULT ''     |
| 2672 | root | localhost:4527 | ******** | Query   |      0 | NULL              | SHOW processlist                                                     |
+------+------+----------------+----------+---------+--------+-------------------+----------------------------------------------------------------------+

A day later the query stopped executing because `mytable` was corrupted. It didn’t occur to me that my experimenting may have corrupted the table – so I didn’t bother running a consistency check. The MySQL client should’ve probably verified the tables before executing any query that is expected to run for more than a day. Regardless, I would have thought that the column modification would be near instantaneous. The column data wouldn’t need to be changed or converted (I used a single byte character set). All that would be required is for the particular column to be recognized as a different type. An ALTER TABLE query shouldn’t be this slow.

In contrast, a full table scan – which involves reading all the data from the table off of the hard drive – is completed within 8 minutes.

mysql> SELECT * FROM `mytable` WHERE `HASH` LIKE "%dsdfsdfsdfsd%" LIMIT 1;
Empty SET (7 min 1.45 sec)

So even if it needed to modify all the rows, and assuming that reading from disk takes as much time as writing to disk, then it should finish within 20 minutes. (The columns would have the same width so no data re-arrangement would need to take place.)

Without digging through the source code I can only provide speculation based on behavior and on information provided by third party sources. An excerpt from High Performance MySQL reads:

MySQL performs most alterations by making an empty table with the desired new
structure, inserting all the data from the old table into the new one, and deleting the
old table.
....
mysql> ALTER TABLE sakila.film
-> MODIFY COLUMN rental_duration TINYINT(3) NOT NULL DEFAULT 5;
 
Profiling that statement with SHOW STATUS shows that it does 1,000 handler reads and
1,000 inserts.

My understanding is that it creates a temporary table, fetches one row at a time from the original table and inserts one row at a time into the temporary table. The problem with this method is that it incurs a lot of overhead performing queries one row at a time. This overhead could be avoided by handling a lot of rows in a lot fewer queries. All this occurred to me while the ALTER TABLE had been executing, but I didn’t kill it because I wasn’t completely confident that this was the problem or that I had a better solution.

When the ALTER TABLE failed I tried an alternate solution that involves bulk SELECTs and INSERTs. I did the following steps:

1.	Create an SQL dump of mytable using mysqldump command
2.	Opened the SQL dump in an editor that supports large files (I used UEStudio.)
3.	I did a string-replace of char(16) to binary(16) (The reason I did a string replace is because UEStudio – when opening large files – would write whatever you typed directly to disk.. so for very large files a single keystroke would lag for about a minute.)
4.	Import the edited SQL file via  “mysql –uuser –p mydb < edited.sql”

This entire process was completed within an hour. Hopefully this post will help someone save a few days of their life.

RE: http://oreilly.com/catalog/9780596003067/
RE: http://dev.mysql.com/doc/refman/5.1/en/alter-table.html

Author: Madhax | Published: 16th June 2009 | RSS | LINK | No Comments yet. Start talking!

I’m currently working on a small project that led me to look for the best way to fetch a random row from a MySQL database. Online research showed that a common solution to this problem is something similar to

[1]

SELECT * FROM `TABLE` ORDER BY RAND() LIMIT 1;

The problem with this solution is that it assigns a random value to each row and attempts to sort the entire table by the random value. I list here a variety of solutions I have found on the web.

[2]

$range_result = mysql_query( "SELECT MAX(`id`) AS max_id , MIN(`id`) AS min_id FROM `table` ");
$range_row = mysql_fetch_object( $range_result ); 
$random = mt_rand( $range_row->min_id , $range_row->max_id );
$result = mysql_query( " SELECT * FROM `table` WHERE `id` >= $random LIMIT 0,1 ");

Solution [2] scans the entire table up to $random. Worst case scenario – this requires a full table scan.

mysql> EXPLAIN SELECT * FROM `to_visit` WHERE `ID` >= 250;
 
+----+-------------+----------+------+---------------+------+---------+------+----------+-------------+
| id | select_type | TABLE    | type | possible_keys | KEY  | key_len | ref  | rows     | Extra       |
+----+-------------+----------+------+---------------+------+---------+------+----------+-------------+
|  1 | SIMPLE      | to_visit | ALL  | PRIMARY       | NULL | NULL    | NULL | 20345551 | USING WHERE |
+----+-------------+----------+------+---------------+------+---------+------+----------+-------------+
 
 
1 row IN SET (0.00 sec)

[3]

$offset_result = mysql_query( " SELECT FLOOR(RAND() * COUNT(*)) AS `offset` FROM `table` ");
$offset_row = mysql_fetch_object( $offset_result ); 
$offset = $offset_row->offset;
$result = mysql_query( " SELECT * FROM `table` LIMIT $offset, 1 " );

Solution [3] scans the entire database up to $offset. Worst case scenario – a full table scan is required.

mysql> EXPLAIN SELECT * FROM `to_visit` LIMIT 20, 1;
 
+----+-------------+----------+------+---------------+------+---------+------+----------+-------+
| id | select_type | TABLE    | type | possible_keys | KEY  | key_len | ref  | rows     | Extra |
+----+-------------+----------+------+---------------+------+---------+------+----------+-------+
|  1 | SIMPLE      | to_visit | ALL  | NULL          | NULL | NULL    | NULL | 20345551 |       |
+----+-------------+----------+------+---------------+------+---------+------+----------+-------+
 
 
1 row IN SET (0.00 sec)

[4]

SELECT * FROM `table` WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM `table` ) ORDER BY id LIMIT 1;

Solution [4] is equivalent to solution [2]. I’m not sure why the “ORDER BY id” was required at all. Worst case scenario – a full table scan is required.

mysql> EXPLAIN SELECT * FROM `to_visit` WHERE ID >= (SELECT FLOOR(MAX(`ID`) * RAND()) FROM `to_visit`) LIMIT 1;
 
+----+----------------------+----------+-------+---------------+---------+---------+------+----------+-------------+
| id | select_type          | TABLE    | type  | possible_keys | KEY     | key_len | ref  | rows     | Extra       |
+----+----------------------+----------+-------+---------------+---------+---------+------+----------+-------------+
|  1 | PRIMARY              | to_visit | ALL   | NULL          | NULL    | NULL    | NULL | 20345551 | USING WHERE |
|  2 | UNCACHEABLE SUBQUERY | to_visit | INDEX | NULL          | PRIMARY | 4       | NULL | 20345551 | USING INDEX |
+----+----------------------+----------+-------+---------------+---------+---------+------+----------+-------------+
 
 
2 rows IN SET (0.00 sec)

[5]

@rand_id := FLOOR(1 + (RAND() * (@max_id - 1)));
SELECT `column` FROM `table` WHERE `id` >= @rand_id ORDER BY `id` ASC LIMIT 1

Solution [5] is equivalent to solution [4]. However, if `column` _is_ a primary key then only the index table will be searched. Worst case scenario (`column` is not a primary key) a full table scan may be done. Best case scenario is that `column` is indexed and it would be significantly faster.

The main problem in all these solutions is the assumption that MySQL will do some magical handling for any reference to a column that has a primary key. MySQL will not use an indexed column if you are fetching anything other than indexed columns (not exactly true – see http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html .)

My solution makes use of a table index. The following code is in Python (which means it should be very easy to read :D .)

cursor.execute("SELECT MAX(`ID`), MIN(`ID`) FROM `to_visit`")
result = cursor.fetchall()
maxid = result[0][0]
minid = result[0][1]
id = random.randint(minid, maxid)
cursor.execute("SELECT * FROM `to_visit` as t1 JOIN (SELECT ID FROM `to_visit` WHERE `ID`>="+str(id)+" limit 1) as t2 where t1.id = t2.id")
result = cursor.fetchall()

Selecting MAX(`ID`) and MIN(`ID`) should be constant time (it is for MyISAM storage engine.) The second query uses a random ID generated in python. Since the inner query fetches the `ID` – which is a primary key – only a scan of the index table is required. The outer join is performed in constant time because you are referencing a specific row by using one specific primary key.

mysql> EXPLAIN SELECT * FROM `to_visit` AS t1 JOIN (SELECT ID FROM `to_visit` WHERE `ID`>=1 LIMIT 1) AS t2 ON t1.id = t2.id;
 
+----+-------------+------------+--------+---------------+---------+---------+-------+----------+--------------------------+
| id | select_type | TABLE      | type   | possible_keys | KEY     | key_len | ref   | rows     | Extra                    |
+----+-------------+------------+--------+---------------+---------+---------+-------+----------+--------------------------+
|  1 | PRIMARY     | <derived2> | system | NULL          | NULL    | NULL    | NULL  |        1 |                          |
|  1 | PRIMARY     | t1         | const  | PRIMARY       | PRIMARY | 4       | const |        1 |                          |
|  2 | DERIVED     | to_visit   | INDEX  | PRIMARY       | PRIMARY | 4       | NULL  | 21254018 | USING WHERE; USING INDEX |
+----+-------------+------------+--------+---------------+---------+---------+-------+----------+--------------------------+
 
 
3 rows IN SET (0.00 sec)

I used this small snippet of code to benchmark my solution.

def fetch_random_row_test():
	global db
	start = time.time()
	cursor = db.cursor()
	cursor.execute("SELECT MAX(`ID`), MIN(`ID`) FROM `to_visit`")
	result = cursor.fetchall()
	maxid = result[0][0]
	minid = result[0][1]
	for x in range(0,1000):
		id = random.randint(minid, maxid)
		cursor.execute("SELECT * FROM `to_visit` as t1 JOIN (SELECT ID FROM `to_visit` WHERE `ID`>="+str(id)+" limit 1) as t2 where t1.id = t2.id")
		result = cursor.fetchall()
 
	print time.time() – start

Even though the code doesn’t use a performance timer it should still give fairly accurate and practical results. On my home computer it completed in 21 seconds (that is it fetched 1000 random rows in 21 seconds.) There are other issues that need to be addressed when selecting random rows from a MySQL table (such as fragmentation – which would make the “randomness” not so random :P .) I will maybe discuss fragmentation in a future blog post.

RE: http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
RE: http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html