Group commit has an interesting history in MySQL. Peter opened bug 13669 for this many years ago. That bug has been closed and InnoDB announced that the plugin uses group commit. That is only true when the binlog is disabled. Things are more complicated when the binlog is enabled and I am not certain this is clear from the documentation.
There are three log writes during commit when the binlog is enabled. I have written about this before and am repeating myself to help the reader understand how group commit has been fixed in the Facebook patch. The log writes are:
- Prepare InnoDB by forcing changes from the transaction to the InnoDB transaction log. This does an fsync on the transaction log file and that fsync can be shared by several transactions.
- Write the changes from the transaction to the binlog. When sync_binlog=1 this does an fsync on the binlog. An InnoDB mutex, prepare_commit_mutex, is held for the duration of this call so this cannot be shared by several transactions. This can be the bottleneck for high-throughput OLTP and a good workaround is to use a battery-backed RAID card so that fsync is fast but still safe. This step makes commit official. If the server crashes after this step, the transaction will be committed. If the server crashes before this step, the transaction will be rolled back during crash recovery.
- Release innodb_prepare_commit mutex and then force a commit record to the InnoDB log. This does an fsync on the transaction log file and that fsync can be shared by several transactions.
How did InnoDB fix group commit in the 1.0.4 plugin? They changed either step 1 or step 3 to use group commit and that can make performance much better when the binlog is disabled. Alas, most of us run with the binlog enabled. It might not be enabled on a slave, but there is much less opportunity for group commit on a slave as the SQL thread is single-threaded. The peak number of commits per second on a server can be limited by the peak number of fsync calls per second that a disk system can perform.
Vamsi Ponnekanti and Ryan McElroy implemented group commit for the binlog write/fsync and this is now part of the Facebook patch for MySQL. Kristian Nielsen has begun working on a fix for this as part of MariaDB. Mats from the MySQL replication team has begun describing a fix for this as part of official MySQL. As MySQL is reluctant to commit to claiming things will be fixed in future releases, I won't claim that MySQL has work in progress. I will claim that I think this won't be a problem in the future and that bug 49326 will be fixed.
Three solutions have been described and it is possible that three different solutions will be implemented. I don't mind this diversity. At least one of these solutions will be excellent. Work done by my team has different constraints than work done for MariaDB or official MySQL. They can do the right thing even when that requires changing a lot of code and taking a lot of time. Others must solve problems faster.
Group commit in the Facebook patch
Vamsi provided a lot of the content that follows.
Group commit isn't possible today in official MySQL and MariaDB for the binlog write/flush/fsync because prepare_commit_mutex is locked for the duration of that work. The first change we made was to add a dynamic server configuration variable, innodb_prepare_commit_mutex, to disable use of that mutex. The mutex is only required when hot backup is being done and most of the time we are not taking a hot backup. The mutex is not used when innodb_prepare_commit_mutex=OFF so that threads can concurrently run the code that does the write/flush/fsync for the binlog.
The next step is to make sure that the binlog and InnoDB transaction log are written in the same order. This would otherwise be possible when prepare_commit_mutex is not used. We use tickets to force the same order. A thread is assigned a ticket number when it prepares an InnoDB transaction by writing changes for that transaction to the InnoDB log. Threads write transaction changes to the binlog in ticket number order. After writing the changes to the binlog a thread will wait a small amount of time. During that wait another thread might do the binlog fsync thus saving this thread and possibly others from also doing an fsync. A server configuration variable, force_binlog_order, was added to determine whether this feature is enabled.
Unfortunately, all of the changes described thus far do not improve performance when there are concurrent threads trying to modify the same rows. The results below are from a sysbench read-write test with group commit enabled. TPS degrades at high-concurrency:
16 32 64 128 256 384 512 #concurrent clients
203 376 297 156 93 86 71 transactions per second
The problem is that the row locks are not released until the commit step is done (step 3 above). This means that the locks are not released until 2 fsyncs have been done (one for the InnoDB prepare step and one for the binlog write/flush/fsync step). We then added an server configuration variable, innodb_release_locks_early, to determine whether row locks are released during the prepare step. By doing this it is possible for other sessions to read changes that are not committed (if there is a crash after the InnoDB prepare step and before the binlog fsync, then that transaction will be rolled back during crash recovery). You should determine whether this is an issue.
The tests were repeated after the server included the change described above. Results are much better:
16 32 64 128 256 512 #concurrent clients
203 376 297 156 93 71 transactions per second without change
645 621 631 transactions per second with change
The final task was to make sure that this performs as expected for real workloads with lots of concurrency. Initial performance was disappointing. Vamsi discovered that there was too much mutex contention from the code that made the binlog write order match the InnoDB prepare order. The problem was that a single condition variable was used and broadcast was done to wake all threads waiting on it rather than the one thread that should next write the binlog. Vamsi fixed this by using an array of condition variables, making threads wait on (ticket# mod array_size), and then doing a broadcast only for one element of the array. This causes far fewer spurious wakeups.
If you want to read the source code for the changes then check out the change log for the Facebook patch for MySQL.
- The function innobase_xa_prepare was modified to make the use of prepare_commit_mutex optional.
- To implement force_binlog_order transactions get ticket numbers at the time of prepare (innobase_xa_prepare, MSQL_BIN_LOG::enqueue_thread) and they write to binlog in the order of their ticket numberTwo new global variables were added: next_ticket and current_ticket. The variable next_ticket is the next available ticket number. Threads wait until the value of current_ticket equals the ticket number they were assigned in innobase_xa_prepare. See MYSQL_BIN_LOG::dequeue_thread_in_order().
- After writing the binlog the thread does a condition variable broadcast for one element of binlog_commit_cond_array which is an array of pthread condition variables.
- The binlog is forced to disk in the function MYSQL_BIN_LOG::flush_and_sync. Threads do a wait with timeout to allow other threads to write the binlog and possibly for it to disk so that one fsync can force several transactions to disk. The thread waits for up to sync_binlog_timeout_usecs microseconds (yet another server configuration variable we added). Waiting is implemented as pthread_cond_timedwait on a new condition variable (binlog_cond) added for this purpose. Upon wakeup this thread will force the binlog to disk if another thread has not done so. A high value of sync_binlog_timeout_usecs increases commit latency while a low value is less likely to share the fsync across several transactions.
- Row locks are released during the InnoDB prepare step when innodb_release_locks_early is on. This is implemented by calling lock_release_off_kernel from innobase_xa_prepare. However, this change is not sufficient. Innodb has the notion of implicit locks. The function lock_clust_rec_some_has_impl has code that looks at the transaction ID embedded in the row, and if that transaction is active (as determined by the function trx_is_active()), it concludes that the transaction has an implicit lock. When another thread does this it will create an explicit lock is created using the function lock_rec_convert_impl_to_expl. As a result, even though transaction T1 releases locks at prepare time, a subsequent update scan by transaction T2 (see lock_clust_rec_read_check_and_lock) recreates those locks on behalf of T1. This undoes the benefit of early lock release. The function trx_is_active is only used for determining if implicit locks exist for a transaction. We enhanced it to return FALSE if a transaction is already in the prepared state and the option innobase_release_locks_early is set.
There are other changes as part of this feature:
- If group commit hangs detect it and kill the server.
- If current_ticket does not advance for GROUP_COMMIT_HANG_ERROR_SECONDS (defined as 1s) then the server logs messages to the error log and if current_ticket does not advance for GROUP_COMMIT_HANG_KILL_SECONDS (defined as 60s), then it kills the server.
- The server is killed if there is an attempt to rollback after locks are released. This should never happen.
Use caution while turning on group commit
As there are 3 variables controlling the feature, the order in which they are changed becomes important. This will be enabled and disabled dynamically as it should not be enabled when hot backups are done. If they are to be enabled dynamically, the suggested order when enabling group commit is:
- Turn ON force_binlog_order.
- Turn OFF innodb_prepare_commit_mutex.
- Set sync_binlog_timeout_usecs. (Only after this, we will start seeing benefits of group commit.)
Turning OFF innodb_prepare_commit_mutex before turning ON force_binlog_order could potentially cause some transactions to write to binlog in a different order than their commit order in transaction log.
The suggested order for dynamically disabling group commit is:
- Clear sync_binlog_timeout_usecs.
- Turn ON innodb_prepare_commit_mutex.
- Turn OFF force_binlog_order.
We added several tests for group commit:
- force_binlog_order.test: Shows that if force_binlog_order is turned OFF, transactions that modify DIFFERENT rows can write to binlog in a different order than their commit in xact log, and that turning it ON fixes the order of write to binlog. It tests both auto_commit on and off cases.
- release_row_locks_early.test: Shows that if force_binlog_order is turned OFF and innodb_release_locks_early is turned ON, transactions that modify the same rows can write to the binlog in a different order than their commit in the InnoDB transaction log, and that turning it ON fixes the order of the binlog writes. It tests both auto_commit on and off.
- group_commit_crash.test: For five crash points already setup in the server, this test checks that transaction ordering is same in both group committ on/off cases.