Anatomy of a query pileup

Anatomy of a query pileup
We have been trying to understand the cause of a bottleneck that causes intermittent query pileups. By pileup I mean that the number of concurrent queries jumps from 10 to 1000 in a few seconds and stays high for 10 or more seconds. Query response time increases significantly during a pileup. Users tend to dislike that.

Debugging, step 1



I think that bug 46459 is the cause. The search for the bug was interesting. We started with monitoring data from a subset of iostat, vmstat, SHOW STATUS and SHOW INNODB STATUS counters using a sample interval of more than 30 seconds. The data from this wasn't sufficient to understand the problem so we deployed mstat.py to sample counters more frequently on selected machines. While it would be great for my debugging efforts to always sample all counters every second, such a change would overwhelm the monitoring infrastructure and the sampled database servers.

Debugging, step 2



We added more SHOW STATUS output for InnoDB IO latency. From that we learned that IO latency was not a problem. Histograms were added in addition to counters for average latency as we want to know whether there are intermittent latency spikes.

With output from mstat.py we were able to determine that:

  • the query arrival rate did not increase prior to the problem

  • most metrics that counted resource consumption (rows read, etc) dropped significantly during the problem

  • with one exception below, most metrics from vmstat and iostat indicated the server became significantly more idle during the problem

  • CPU utilization numbers from vmstat did not change signficantly during the problem



Debugging, step 3



From this knew that the problem was not from an increase in the query arrival rate and it was also not from the arrival of more expensive queries. The next step was to determine what queries were running during the pileup. A shell script was started to collect SHOW PROCESSLIST samples at the same rate as the performance counters. Alas, we did not learn much from that output. Then another shell script was started to collect SHOW INNODB STATUS output at the same rate and that output was much more interesting.

Brief lecture on InnoDB



Some of the places where a query can block with InnoDB are:

  • wait for a synchronous IO request to complete

  • wait for asynchronous IO request to be completed by a background IO thread

  • wait to get the per-table auto increment lock

  • wait to enter the InnoDB kernel without violating innodb_thread_concurrency. Note that InnoDB does not enforce this limit exactly. When a thread enters the kernel it gets a number of tickets and returns 1 ticket per re-entry. Only when it runs out of tickets must it enter a queue to get more tickets and allow another thread to enter.

  • wait to get the commit mutex when the binlog is enabled. Group commit is not used when the binlog is enabled.

  • wait to lock the adaptive hash index read-write lock. Note that this lock is not always released when a session leaves the InnoDB kernel.



When a thread really leaves the kernel it resets its ticket count to 0 to allow other threads to enter the kernel when the innodb_thread_concurrency limit has been reached. The thread also releases the adaptive hash index read-write lock if it cached it. Prior to MySQL 5.0.84 and 5.1.36, MySQL did not release these resources as quickly as needed. But <a "http://bugs.mysql.com/bug.php?id=32149">bug 32149 fixes that.

Debugging, step 4



I knew most of this prior to working on this problem, but that wasn't enough to debug it. The next step was to use gdb to dump stack traces for mysqld during a pileup. I then processed the call stacks using the poor man's profiler and saw two interesting thread stacks. There were 31 threads with this stack trace in common:

__lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,innobase_xa_prepare,ha_commit_trans,ha_autocommit_or_rollback


And 1 thread with this stack trace:

datasync,my_sync,MYSQL_LOG::flush_and_sync,MYSQL_LOG::write,MYSQL_LOG::write,MYSQL_LOG::log_xid


Debugging, step 5



From this we guessed that there was a bottleneck in the InnoDB commit path. That is not surprising given that group commit is not used when the binlog is enabled. A possible cause for the bottleneck is slow handling of the fdatasync() call used to sync the binlog. Alas, there are no counters for that operation or the fsync/fdatasync calls used by InnoDB. Maybe DTrace is a workaround for this, but we really need more counters in mysqld source to improve debugging.

At this point I reviewed the code in innobase_xa_prepare() that is used in the PREPARE phase of 2PC on a commit when the binlog is enabled. I learned that function calls to signal that a thread has exited the InnoDB kernel, release concurrency tickets and the adaptive hash latch are only made after the commit mutex has been obtained. Follow the link for bug 46459 above to get more details. But the results are that when many threads are blocked waiting for the commit mutex, they may be counted as inside the InnoDB kernel and this prevents other queries from running. This should only be an issue for insert, update and delete statements run in autocommit mode and that is explained in the bug report.

Debugging, step 6



At this point I have a bug report with a reproduction case. I also have a kludge to fix the problem, but the kludge needs more effort to become a good fix. However, I need a benchmark test to reproduce the pileup in the current code and show that the kludge fixes the problem.

The test



I started with iibench, the insert benchmark code, and removed the secondary indexes from the test. I then modified it to support a variable number of query and insert sessions for one table. This test did not reproduce the problem. There is no chance of a bottleneck in the comment code when all transactions insert to the same table using insert statements run with autocommit because all but one of the concurrent insert sessions will block trying to get the autocommit lock.

I then added a shell script to run N instances of the test where each instance uses a different table and starts one insert session and one query session. With this test, if enough instances of the test are run, then the insert sessions can block on the commit mutex and prevent the query sessions from running. I ran this test and measured the rate at which inserts and queries were done. The results are interesting:

Test server



I used a server with 4 CPU cores, XFS with the write barrier enabled and 1 7200 RPM SATA disk. With this configuration, fsync and fsyncdata run at disk speed so the numbers below are not overwhelming but the performance benefit from the kludge is. Settings for my.cnf were:

innodb_buffer_pool_size=2000M
innodb_log_file_size=100M
innodb_flush_log_at_trx_commit=1
innodb_max_dirty_pages=80
innodb_doublewrite=0
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=8
log_bin
max_connections=500


Test results



With 4 tables, 4 insert sessions and 4 query sessions I get:

original: ~16 inserts/second and ~10,000 queries/second
kludge: ~16 inserts/second and ~10,000 queries/second


With 8 tables, 8 insert sessions and 8 query session I get:

original: ~16 inserts/second and ~2,400 queries/second
kludge: ~16 inserts/second and ~10,000 queries/second


With 16 tables, 16 insert sessions and 16 query sessions I get:

original: ~16 inserts/second and ~8 queries/second
kludge: ~16 inserts/second and ~10,000 queries/second


For the last test, CPU utilization was ~1% without the kludge and over 90% with the kludge.

The code for the the test is in launchpad (bzr branch lp:mysql-patch/mytools) at mytools/bench/run_ai.

Conclusion



The kludge avoids the worst case. Now I need to make it production quality. Debugging problems like this is a great way to learn a lot about InnoDB.

In this note

No one.