FLIFO scheduling for InnoDB

FLIFO scheduling for InnoDB
At the end of this note I describe how InnoDB can be much faster (2.5X) for high-concurrency workloads. However, what we really did is improve the code to not get 2.5X slower.

InnoDB uses innodb_thread_concurrency to limit the number of threads that run concurrently. Enforcement of this limit is imprecise to improve performance. By imprecise, I mean that there are usually fewer than innodb_thread_concurrency threads running within InnoDB even when there are many threads ready, willing and able to run there.

In what follows assume that there is a 1:1 mapping between thread, session and connection.

Enforcement is implemented using tickets. Each time a thread enters InnoDB it gives up 1 ticket. When a thread has 0 tickets it gets more tickets if the number of threads with non-zero ticket counts is less than innodb_thread_concurrency. Otherwise the thread is put at the end of the queue of waiting threads and the thread at the head of the queue is scheduled to run. This queue provides FIFO scheduling. A thread has zero tickets at the start of statement execution.

There is also the notion of exiting InnoDB. A thread exits InnoDB at the end of a statement, during a row lock wait and when performing long-running work outside of InnoDB. Long-running work includes sending results back to a client and performing a file sort. The cases in which a thread exits InnoDB expanded with fixes for bug 32149 (thanks to Ben for finding these). Note that a thread does not exit InnoDB while blocked on IO. I will experiment with that in the future.

So far, this sounds reasonable, with the exception that threads don't exit InnoDB when blocked on IO. There will be at most 8 pending read requests when innodb_thread_concurrency=8 and this might not utilize a servers IO capacity. An IO bound server needs a higher value for innodb_thread_concurrency than a CPU bound server. As some servers are IO bound part of the time and CPU bound part of the time, it is difficult to find the best value for innodb_thread_concurrency.

But I encountered a new problem. I noticed a few servers with 1000+ concurrent queries and when I looked at the thread stacks all threads were blocked on the per-thread condition variables used to schedule them to be run. This server used innodb_thread_concurrency=8 and eventually I confirmed that 8 threads had been scheduled to run. The problem is that the threads took too long to begin running.

At this point I made that joke that while the scheduler in Linux may be O(1), there is a high constant factor. And then someone told me that we weren't using that yet.

We don't want to modify libc, NPTL or the kernel, so we need to fix this in InnoDB. InnoDB uses FIFO to schedule the threads. We changed it to use a hybrid of FIFO+LIFO, thus the name FLIFO scheduling. This makes performance much better on high-concurrency workloads. The behavior is enabled by the my.cnf parameter innodb_thread_lifo.

Two variables were added as part of the change:

  • fifo-pending - the number of threads that have been selected to run via a pthread condition variable broadcast but have yet to begin running. This is incremented by threads that exit InnoDB when they schedule another thread to run and is decremented by the thread as soon as it begins to run.
  • lifo-running - the number of threads that are currently running courtesy of the LIFO policy.

When a thread would not be allowed to enter InnoDB because of the FIFO policy and the innodb_thread_concurrency limit, it is allowed to enter by the LIFO policy when:

lifo-running < fifo-pending


This allows 2 * innodb_thread_concurrency threads to run concurrently in some cases, so the value should be set with that in mind.

The results are impressive. See the results on a chart. The numbers are the TPS from sysbench readonly for MySQL 5.0.84 using the new FLIFO policy versus the original FIFO policy on an 8-core x86 server. The difference is that throughput is constant with FLIFO at high-concurrency while it degrades signficantly for the original FIFO.

5084-flifo 186.22 342.65 570.79 707.82 718.84 665.60 574.96 590.33 611.15 612.69 637.37
5084-original 180.95 335.37 516.97 645.86 625.14 648.88 658.71 579.59 515.26 431.97 255.04

sysbench command line for the test:

sysbench --test=oltp --mysql-db=test --oltp-table-size=2000000 --max-time=180 \
--max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable \
--mysql-engine-trx=yes --oltp-read-only --oltp-skip-trx --oltp-dist-type=uniform \
--oltp-range-size=1000 --num-threads=1 --seed-rng=1 run


my.cnf for FLIFO:

innodb_buffer_pool_size=2000M
innodb_log_file_size=100M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=16
max_connections=2000
innodb_max_dirty_pages_pct=80
table_cache=2000

innodb_thread_sleep_delay=0
innodb_concurrency_tickets=500
innodb_thread_lifo=1

my.cnf for original FIFO:

innodb_buffer_pool_size=2000M
innodb_log_file_size=100M
innodb_doublewrite=0
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=32
max_connections=2000
innodb_max_dirty_pages_pct=80
innodb_flush_log_at_trx_commit=2
table_cache=2000


This is vmstat output during the test. Note that context switches for InnoDB+FLIFO are about 1/3 of the rate for original InnoDB and CPU times for us/sy/id/wa are much better for FLIFO.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
# Typical output from vmstat for InnoDB+FLIFO
40 0 0 56524420 1255528 5451320 0 0 0 81 1035 66694 81 11 8 0 0

# and from original InnoDB
53 0 0 56494980 1255616 5460592 0 0 0 95 1037 238890 58 23 19 0 0

In this note

No one.