Вы находитесь на странице: 1из 53

High-Value Transaction Processing

Mark Callaghan
Percona Live
February, 2011
What do I mean by value?
▪  Low price?
▪  High price/performance?
▪  Valuable data
OLTP in the datacenter
▪  Sharding

▪  Availability

▪  Legacy applications
▪  Used by many applications
Sharding
▪  Sharding is easy, resharding is hard
▪  Joins within a shard are still frequent and useful
▪  Some all-shards joins must use Hive
▪  Provides some fault-isolation benefits
Availability
▪  Sources of downtime
▪  Schema change (but now we have OSC)
▪  Manual failover
▪  Misbehaving applications
▪  Oops
Used by many applications
If your company is successful then
▪  Your database will be accessed by many different applications
▪  Application authors might not be MySQL experts
▪  Application owners might have different priorities than the DB team
Legacy applications
If your company is successful then you will have
▪  Applications written many years ago by people who are gone
▪  Design decisions that are not good for your current size
▪  Not enough resources or time to rewrite applications
Our busy OLTP deployment
▪  Query response time ▪  Rows read per second
▪  4 ms reads, 5ms writes ▪  450M peak

▪  Network bytes per second ▪  Rows changed per second


▪  38GB peak ▪  3.5M peak

▪  Queries per second ▪  InnoDB page IO per second


▪  13M peak ▪  5.2M peak
Recent improvements
▪  Joint work by Facebook, Percona and Oracle/MySQL
▪  Prevent InnoDB stalls
▪  Stalls from caches
▪  Stalls from mutexes
▪  IO efficiency
▪  Improve monitoring
▪  Improve XtraBackup
How do you measure performance?
▪  Response time variance leads to bad user experiences
▪  Optimizations that defer work must handle steady-state loads
▪  When designing a server the choices are:
▪  No concurrency (and no mutexes)
▪  One mutex
▪  More than one mutex
This has good average performance
Which metric matters?
Stalls from caches
Caches that defer expensive operations must eventually complete them
at the same rate at which they are deferred.
▪  InnoDB purge
▪  InnoDB insert buffer
▪  Async writes are not async
▪  Fuzzy checkpoint constraint enforcement
InnoDB purge stalls
▪  InnoDB purge removes delete-marked rows
▪  Done by the main background thread in 5.1 plugin
▪  Optionally done by a separate thread in 5.5

▪  Purge is single-threaded and might be stalled by disk reads


▪  Further it gets behind, more likely it won’t catch up

▪  Need
multiple purge threads as the main background thread can
become the dedicated purge thread and that isn’t enough

do {
n_pages_purged = trx_purge();
} while (n_pages_purged);
InnoDB insert buffer stalls
▪  The insert buffer is not drained as fast as it can get full
▪  Drain rate is 5% of innodb_io_capacity

▪  bugs.mysql.com/59214

▪  Fixed in the Facebook patch and XtraDB


▪  Patch pending for MySQL 5.5
Performance drops when ibuf is full
Otherwise, the insert buffer is awesome
Fuzzy checkpoint constraint
▪  TotalLogSize = #log_files X innodb_log_file_size
▪  AsyncLimit = 0.70 X TotalLogSize

▪  SyncLimit = 0.75 X TotalLogSize

▪  OldestDirtyLSN is the smallest oldest_modification LSN of all dirty pages in the


buffer pool

▪  Age = CurrentLSN – OldestDirtyLSN

Fuzzy Checkpoint Constraint


▪  If Age > SyncLimit then flush_dirty_pages_synch()

▪  Else if Age > AsyncLimit then flush_dirty_pages_async()


Async page writes are not async
▪  Async
page write requests submitted per fuzzy checkpoint constraint
are not async
▪  User transactions may do this via log_preflush_pool_modified_pages
▪  Caller does large write for doublewrite buffer
▪  Caller then submits in-place write requests for background write threads
▪  Caller then waits for background write threads to finish

▪  bugs.mysql.com/55004

▪  Fixed in the Facebook patch


Fuzzy checkpoint constraint enforcement
Prior to InnoDB plugin 5.1.38, page writes done to enforce the fuzzy
checkpoint constraint were not submitted by the main background
thread.
▪  InnoDB plugin added innodb_adaptive_flushing in 5.1.38 plugin
▪  Percona added innodb_adaptive_checkpoint
▪  Facebook patch added innodb_background_checkpoint
Sysbench QPS at 20 second intervals with checkpoint stalls
Stalls from mutexes
▪  Extending InnoDB files ▪  Buffer pool invalidate
▪  Opening InnoDB tables ▪  LOCK_open and kernel_mutex
▪  Purge/undo lock conflicts ▪  Excessive calls to fcntl
▪  TRUNCATE table and LOCK_open ▪  Deadlock detection overhead
▪  DROP table and LOCK_open ▪  innodb_thread_concurrency
Stalls from extending InnoDB files
▪  A
global mutex is locked when InnoDB tables are extended while
writes are done to extend the file
▪  All reads on the file are blocked until the writes are done
▪  bugs.mysql.com/56433

▪  To be fixed real soon in the Facebook patch


Stalls from opening InnoDB tables
▪  Openingtable handler instances is serialized on LOCK_open. Index
cardinality stats might then be computed using random reads
▪  bugs.mysql.com/49463 and bugs.mysql.com/53046
▪  Fixed in the Facebook patch and MySQL 5.5

▪  When stats are recomputed many uses of that table will stall
▪  Fixed in the Facebook patch

▪  Index stats could be recomputed too frequently


▪  bugs.mysql.com/56340
▪  Fixed in the Facebook patch, MySQL 5.1 and MySQL 5.5
Stalls from purge/undo lock conflicts
▪  Purge and undo are not concurrent on the same InnoDB table
▪  Purge gets a share lock on the table
▪  Undo gets an exclusive lock on the table

▪  REPLACE statements that use insert-then-undo can generate undo


▪  bugs.mysql.com/54538

▪  Fixed in MySQL 5.1.55 and MySQL 5.5


TRUNCATE table and LOCK_open
▪  LOCK_open is held when the truncate is done by InnoDB
▪  When file-per-table is used the file must be removed and that can take too long
▪  The InnoDB buffer pool LRU must be scanned

▪  New queries cannot be started


▪  bugs.mysql.com/41158 and bugs.mysql.com/56696
▪  Fixed in MySQL 5.5 courtesy of meta-data locking
DROP table and LOCK_open
▪  LOCK_open is held when the drop is done by InnoDB
▪  When file-per-table is used the file must be removed and that can take too long
▪  The InnoDB buffer pool LRU must be scanned

▪  New queries cannot be started


▪  bugs.mysql.com/56655

▪  Fixed in the Facebook patch


▪  Do most InnoDB processing in the background drop queue

▪  Fixed in MySQL 5.5 courtesy of meta-data locking


TRUNCATE/DROP table and invalidate
▪  Pages for table removed from buffer pool and adaptive hash
▪  InnoDB buffer pool mutex locked while the LRU is scanned
▪  This is slow with a large buffer pool

▪  Most threads in InnoDB will block waiting for the buffer pool mutex
▪  bugs.mysql.com/51325 and bugs.mysql.com/56332
▪  I hope Yasufumi can fix it
LOCK_open and kernel_mutex conflicts
▪  Thread A
▪  Gather table statistics while holding LOCK_open
▪  Block on kernel_mutex while starting a transaction

▪  Thread B
▪  Hold kernel_mutex while doing deadlock detection

▪  All other threads block on LOCK_open or kernel_mutex


▪  bugs.mysql.com/51557

▪  Fixed in MySQL 5.5


Stalls from excessive calls to fcntl
▪  fcntl

▪  My Linux kernels get the big kernel lock on fcntl calls


▪  MySQL called fcntl too often

▪  Doubled peak QPS by hacking MySQL to call fcntl less


▪  Almost 200,000 QPS without using HandlerSocket

▪  bugs.mysql.com/54790

▪  Fixed in Facebook patch, then reverted because it broke SSL tests


▪  Not sure where or when this will be fixed
Sysbench read-only with fcntl fix
Stalls from deadlock detection overhead
▪  InnoDB
deadlock detection was very inefficient. Worst case when all
threads waited on the same row lock.
▪  Added option to disable it in the Facebook patch and rely on lock wait timeout
▪  MySQL made it more efficient in MySQL 5.1

▪  bugs.mysql.com/49047
Stalls from innodb_thread_concurrency
▪  When
there are 1000+ sleeping threads it can take too long to
wake up a specific thread
▪  Changeinnodb_thread_concurrency to use FIFO scheduling in
addition to existing use of LIFO and FIFO+LIFO = FLIFO
▪  Fixed in the Facebook patch
Sysbench TPS with FLIFO
IO efficiency
High priority problems for me are:
▪  Reducing IOPs used for my workload
▪  Supporting very large databases

Significant improvements:
▪  Switch from mysqldump to XtraBackup
▪  Run innosim to confirm storage performance
▪  Tune InnoDB
▪  Improve schemas and queries
mysqldump vs XtraBackup
▪  mysqldump is slower for backup
▪  Clustered index is scanned row-at-a-time in key order (lots of random reads)
▪  Backup accounts for half of the disk reads for servers I watch

▪  Single-table restore is easy with mysqldump


▪  Possible with XtraBackup thanks to work by Vamsi from Facebook

▪  Incremental backup
▪  Not possible with mysqldump
▪  XtraBackup has incremental (scan all data, write only the changed blocks)
▪  Vamsi from Facebook added support for really incremental, scan & write only
the changed blocks
innosim storage benchmark
▪  InnoDB IO simulator that models
▪  Doublewrite buffer
▪  Dirty page writes
▪  Transaction log and binlog fsync and IO
▪  User transactions that do read, write and commit

▪  Search for “facebook innosim”


▪  Source code on launchpad
Tune InnoDB
▪  It is not easy to support many concurrent disk reads
▪  Innodb_thread_concurrency tickets not released when waiting for a read
▪  If innodb_thread_concurrency is too high then writers suffer
▪  If innodb_thread_concurrency is too low then readers suffer

▪  Smaller pages are better for some but not all tables
▪  A large log file can reduce the dirty page flush rate
▪  A large buffer pool can reduce the page read rate
IOPs is a function of size and concurrency
Smaller pages aren’t always better
Checkpoint IO rate by log file size
Page read rate by buffer pool size
Improve schemas
▪  Make your performance critical queries index only
▪  Primary key columns are included in the secondary index
▪  Understand how the insert buffer makes index maintenance cheaper

▪  Figure out how to do schema changes with minimal downtime


▪  We used the Online Schema Change tool (thanks Vamsi)
▪  You can also do the schema change on a slave first and then promote it
Monitoring
▪  Per table, index, account via information_schema tables
▪  Efficient and always enabled
▪  Easy to use

▪  Enhanced slow query log


▪  Facebook patch added options to do sampling for the slow query log
▪  Sample from all queries and from all queries that have an error
▪  Error is limited to errno, error text must wait for 5.5 plugin
▪  Aggregate by query text and URL from query commen
Open Problems
▪  Parallel replication apply
▪  Support max concurrent queries
▪  Automate slave failover when a master fails
▪  Use InnoDB compression for OLTP
▪  Multi-master replication with conflict resolution
Parallel replication apply
▪  Replicationapply is single-threaded. This causes lag on IO-bound
slaves even when SQL is simple
▪  mk-slave-prefetch can help but something better is needed
▪  Is a thread running BEGIN; replay-slave-sql; ROLLBACK better?
▪  I want:
▪  N replay queues
▪  Binlog events (SBR or RBR) hashed to queues by database names
▪  Each queue replayed in parallel
Max concurrent queries
▪  Use large values for max concurrent connections per account
▪  Enforce smaller values for max concurrent queries
▪  We have begun testing an implementation.
▪  Enforce at statement entry
▪  Account for threads that block (row lock, disk IO, network IO)
Automate slave failover
▪  Global transactions IDs from the Google patch is awesome
▪  But I don’t have the skills to port or support it
▪  A unique ID per binlog group or event might be sufficient
▪  Add an attribute to binlog event metadata
▪  Preserve on the slave similar to server ID
InnoDB compression for OLTP
▪  Change InnoDB to not log page images for compressed pages
▪  Logging them increases the log IO rate
▪  Increasing the log IO rate then increases the checkpoint IO rate
▪  Change InnoDB to use QuickLZ instead of zlib for compression
▪  Add an option to limit compression to the PK index
▪  Add per-table compression statistics
MySQL in the datacenter
▪  Previously dominated the market
▪  Now it must learn to share
▪  PostgreSQL continues to improve for OLTP
▪  Hbase, Cassandra, MongoDB are getting transactions today
Why NoSQL
▪  Do less, but do it better
▪  Some offer write-optimized data stores
▪  Some don’t require sharding
▪  Interesting HA models
▪  Cassandra doesn’t have the notion of failover
▪  HBase doesn’t require failover when a server dies

▪  Healthy development communities improve code quickly


What comes next
▪  Batch extraction is not the answer for MySQL/NoSQL integration
▪  NoSQL deployments will be reminded that
▪  Some of your problems are independent of technology
▪  You need better monitoring
▪  There is downtime when you need to modify the clustered index
▪  Database ops is hard with legacy apps and multi-user deployments
▪  In a few years someone will document the many stalls in HBase
(c) 2007 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Вам также может понравиться