Jasper22.NET: Loading half a billion rows into MySQL

Posted by jasper22 at 10:19 | Thursday, June 7, 2012

Background

We have a legacy system in our production environment that keeps track of when a user takes an action on Causes.com (joins a Cause, recruits a friend, etc). I say legacy, but I really mean a prematurely-optimized system that I’d like to make less smart. This 500m record database is split across monthly sharded tables. Seems like a great solution to scaling (and it is) — except that we don’t need it. And based on our usage pattern (e.g. to count a user’s total number of actions, we need to do query N tables), this leads to pretty severe performance degradation issues. Even with memcache layer sitting in front of old month tables, new features keep discovering new N-query performance problems. Noticing that we have another database happily chugging along with 900 million records, I decided to migrate the existing system into a single table setup. The goals were:

reduce complexity. Querying one table is simpler than N tables.

push as much complexity as possible to the database. The wrappers around the month-sharding logic in Rails are slow and buggy.

increase performance. Also related to one table query being simpler than N.

Alternative Proposed Solutions

MySQL Partitioning: this was the most similar to our existing set up, since MySQL internally stores the data into different tables. We decided against it because it seemed likely that it wouldn’t be much faster than our current solution (although MySQL can internally do some optimizations to make sure you only look at tables that could possibly have data you want). And it’s still the same complexity we were looking to reduce (and would further be the only database set up in our system using partitioning).

Redis: Not really proposed as an alternative because the full dataset won’t fit into memory, but something we’re considering loading a subset of the data into to answer queries that we make a lot that MySQL isn’t particularly good at (e.g. ‘which of my friends have taken an action’ is quick using Redis’s built in SET UNION function). The new MySQL table might be performant enough that it doesn’t make sense to build a fast Redis version, so we’re avoiding this as possible premature optimization, especially with a technology we’re not as familiar with.

Dumping the old data

MySQL provides the `mysqldump’ utility to allow quick dumping to disk:

msyqldump -T /var/lib/mysql/database_data database_name

This will produce a TSV file for each table in the database, and this is the format that `LOAD INFILE’ will be able to quickly load later on.

Installing Percona 5.5

We’ll be building the new system with the latest-and-greatest in Percona databases on CentOS 6.2:

rpm -Uhv http://www.percona.com/downloads/percona-release/percona-release-0.0-1.x86_64.rpm

yum install Percona-Server-shared-compat Percona-Server-client-55 Percona-Server-server-55 -y

[ open bug with the compat package: https://bugs.launchpad.net/percona-server/+bug/908620]

Specify a directory for the InnoDB data

This isn’t exactly a performance tip, but I had to do some digging to get MySQL to store data on a different partition. The first step is to make use your my.cnf contains a

datadir = /path/to/data

directive. Make sure /path/to/data is owned by mysql:mysql (chown -R mysql.mysql /path/to/data) and run:

mysql_install_db --user=mysql --datadir=/path/to/data

Jasper22.NET

Loading half a billion rows into MySQL

Archive

Random sites

Followers

Search This Blog