1Anticipatory IO scheduler
   3Nick Piggin <>    13 Sep 2003
   5Attention! Database servers, especially those using "TCQ" disks should
   6investigate performance with the 'deadline' IO scheduler. Any system with high
   7disk performance requirements should do so, in fact.
   9If you see unusual performance characteristics of your disk systems, or you
  10see big performance regressions versus the deadline scheduler, please email
  11me. Database users don't bother unless you're willing to test a lot of patches
  12from me ;) its a known issue.
  14Also, users with hardware RAID controllers, doing striping, may find
  15highly variable performance results with using the as-iosched. The
  16as-iosched anticipatory implementation is based on the notion that a disk
  17device has only one physical seeking head.  A striped RAID controller
  18actually has a head for each physical device in the logical RAID device.
  20However, setting the antic_expire (see tunable parameters below) produces
  21very similar behavior to the deadline IO scheduler.
  23Selecting IO schedulers
  25Refer to Documentation/block/switching-sched.txt for information on
  26selecting an io scheduler on a per-device basis.
  28Anticipatory IO scheduler Policies
  30The as-iosched implementation implements several layers of policies
  31to determine when an IO request is dispatched to the disk controller.
  32Here are the policies outlined, in order of application.
  341. one-way Elevator algorithm.
  36The elevator algorithm is similar to that used in deadline scheduler, with
  37the addition that it allows limited backward movement of the elevator
  38(i.e. seeks backwards).  A seek backwards can occur when choosing between
  39two IO requests where one is behind the elevator's current position, and
  40the other is in front of the elevator's position. If the seek distance to
  41the request in back of the elevator is less than half the seek distance to
  42the request in front of the elevator, then the request in back can be chosen.
  43Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
  44This favors forward movement of the elevator, while allowing opportunistic
  45"short" backward seeks.
  472. FIFO expiration times for reads and for writes.
  49This is again very similar to the deadline IO scheduler.  The expiration
  50times for requests on these lists is tunable using the parameters read_expire
  51and write_expire discussed below.  When a read or a write expires in this way,
  52the IO scheduler will interrupt its current elevator sweep or read anticipation
  53to service the expired request.
  553. Read and write request batching
  57A batch is a collection of read requests or a collection of write
  58requests.  The as scheduler alternates dispatching read and write batches
  59to the driver.  In the case a read batch, the scheduler submits read
  60requests to the driver as long as there are read requests to submit, and
  61the read batch time limit has not been exceeded (read_batch_expire).
  62The read batch time limit begins counting down only when there are
  63competing write requests pending.
  65In the case of a write batch, the scheduler submits write requests to
  66the driver as long as there are write requests available, and the
  67write batch time limit has not been exceeded (write_batch_expire).
  68However, the length of write batches will be gradually shortened
  69when read batches frequently exceed their time limit.
  71When changing between batch types, the scheduler waits for all requests
  72from the previous batch to complete before scheduling requests for the
  73next batch.
  75The read and write fifo expiration times described in policy 2 above
  76are checked only when in scheduling IO of a batch for the corresponding
  77(read/write) type.  So for example, the read FIFO timeout values are
  78tested only during read batches.  Likewise, the write FIFO timeout
  79values are tested only during write batches.  For this reason,
  80it is generally not recommended for the read batch time
  81to be longer than the write expiration time, nor for the write batch
  82time to exceed the read expiration time (see tunable parameters below).
  84When the IO scheduler changes from a read to a write batch,
  85it begins the elevator from the request that is on the head of the
  86write expiration FIFO.  Likewise, when changing from a write batch to
  87a read batch, scheduler begins the elevator from the first entry
  88on the read expiration FIFO.
  904. Read anticipation.
  92Read anticipation occurs only when scheduling a read batch.
  93This implementation of read anticipation allows only one read request
  94to be dispatched to the disk controller at a time.  In
  95contrast, many write requests may be dispatched to the disk controller
  96at a time during a write batch.  It is this characteristic that can make
  97the anticipatory scheduler perform anomalously with controllers supporting
  98TCQ, or with hardware striped RAID devices. Setting the antic_expire
  99queue parameter (see below) to zero disables this behavior, and the 
 100anticipatory scheduler behaves essentially like the deadline scheduler.
 102When read anticipation is enabled (antic_expire is not zero), reads
 103are dispatched to the disk controller one at a time.
 104At the end of each read request, the IO scheduler examines its next
 105candidate read request from its sorted read list.  If that next request
 106is from the same process as the request that just completed,
 107or if the next request in the queue is "very close" to the
 108just completed request, it is dispatched immediately.  Otherwise,
 109statistics (average think time, average seek distance) on the process
 110that submitted the just completed request are examined.  If it seems
 111likely that that process will submit another request soon, and that
 112request is likely to be near the just completed request, then the IO
 113scheduler will stop dispatching more read requests for up to (antic_expire)
 114milliseconds, hoping that process will submit a new request near the one
 115that just completed.  If such a request is made, then it is dispatched
 116immediately.  If the antic_expire wait time expires, then the IO scheduler
 117will dispatch the next read request from the sorted read queue.
 119To decide whether an anticipatory wait is worthwhile, the scheduler
 120maintains statistics for each process that can be used to compute
 121mean "think time" (the time between read requests), and mean seek
 122distance for that process.  One observation is that these statistics
 123are associated with each process, but those statistics are not associated
 124with a specific IO device.  So for example, if a process is doing IO
 125on several file systems on separate devices, the statistics will be
 126a combination of IO behavior from all those devices.
 129Tuning the anticipatory IO scheduler
 131When using 'as', the anticipatory IO scheduler there are 5 parameters under
 132/sys/block/*/queue/iosched/. All are units of milliseconds.
 134The parameters are:
 135* read_expire
 136    Controls how long until a read request becomes "expired". It also controls the
 137    interval between which expired requests are served, so set to 50, a request
 138    might take anywhere < 100ms to be serviced _if_ it is the next on the
 139    expired list. Obviously request expiration strategies won't make the disk
 140    go faster. The result basically equates to the timeslice a single reader
 141    gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
 142    very roughly the % streaming read efficiency your disk should get with
 143    multiple readers.
 145* read_batch_expire
 146    Controls how much time a batch of reads is given before pending writes are
 147    served. A higher value is more efficient. This might be set below read_expire
 148    if writes are to be given higher priority than reads, but reads are to be
 149    as efficient as possible when there are no writes. Generally though, it
 150    should be some multiple of read_expire.
 152* write_expire, and
 153* write_batch_expire are equivalent to the above, for writes.
 155* antic_expire
 156    Controls the maximum amount of time we can anticipate a good read (one
 157    with a short seek distance from the most recently completed request) before
 158    giving up. Many other factors may cause anticipation to be stopped early,
 159    or some processes will not be "anticipated" at all. Should be a bit higher
 160    for big seek time devices though not a linear correspondence - most
 161    processes have only a few ms thinktime.
 163In addition to the tunables above there is a read-only file named est_time
 164which, when read, will show:
 166    - The probability of a task exiting without a cooperating task
 167      submitting an anticipated IO.
 169    - The current mean think time.
 171    - The seek distance used to determine if an incoming IO is better.