1I/O Barriers
   3Tejun Heo <>, July 22 2005
   5I/O barrier requests are used to guarantee ordering around the barrier
   6requests.  Unless you're crazy enough to use disk drives for
   7implementing synchronization constructs (wow, sounds interesting...),
   8the ordering is meaningful only for write requests for things like
   9journal checkpoints.  All requests queued before a barrier request
  10must be finished (made it to the physical medium) before the barrier
  11request is started, and all requests queued after the barrier request
  12must be started only after the barrier request is finished (again,
  13made it to the physical medium).
  15In other words, I/O barrier requests have the following two properties.
  171. Request ordering
  19Requests cannot pass the barrier request.  Preceding requests are
  20processed before the barrier and following requests after.
  22Depending on what features a drive supports, this can be done in one
  23of the following three ways.
  25i. For devices which have queue depth greater than 1 (TCQ devices) and
  26support ordered tags, block layer can just issue the barrier as an
  27ordered request and the lower level driver, controller and drive
  28itself are responsible for making sure that the ordering constraint is
  29met.  Most modern SCSI controllers/drives should support this.
  31NOTE: SCSI ordered tag isn't currently used due to limitation in the
  32      SCSI midlayer, see the following random notes section.
  34ii. For devices which have queue depth greater than 1 but don't
  35support ordered tags, block layer ensures that the requests preceding
  36a barrier request finishes before issuing the barrier request.  Also,
  37it defers requests following the barrier until the barrier request is
  38finished.  Older SCSI controllers/drives and SATA drives fall in this
  41iii. Devices which have queue depth of 1.  This is a degenerate case
  42of ii.  Just keeping issue order suffices.  Ancient SCSI
  43controllers/drives and IDE drives are in this category.
  452. Forced flushing to physical medium
  47Again, if you're not gonna do synchronization with disk drives (dang,
  48it sounds even more appealing now!), the reason you use I/O barriers
  49is mainly to protect filesystem integrity when power failure or some
  50other events abruptly stop the drive from operating and possibly make
  51the drive lose data in its cache.  So, I/O barriers need to guarantee
  52that requests actually get written to non-volatile medium in order.
  54There are four cases,
  56i. No write-back cache.  Keeping requests ordered is enough.
  58ii. Write-back cache but no flush operation.  There's no way to
  59guarantee physical-medium commit order.  This kind of devices can't to
  60I/O barriers.
  62iii. Write-back cache and flush operation but no FUA (forced unit
  63access).  We need two cache flushes - before and after the barrier
  66iv. Write-back cache, flush operation and FUA.  We still need one
  67flush to make sure requests preceding a barrier are written to medium,
  68but post-barrier flush can be avoided by using FUA write on the
  69barrier itself.
  72How to support barrier requests in drivers
  75All barrier handling is done inside block layer proper.  All low level
  76drivers have to are implementing its prepare_flush_fn and using one
  77the following two functions to indicate what barrier type it supports
  78and how to prepare flush requests.  Note that the term 'ordered' is
  79used to indicate the whole sequence of performing barrier requests
  80including draining and flushing.
  82typedef void (prepare_flush_fn)(struct request_queue *q, struct request *rq);
  84int blk_queue_ordered(struct request_queue *q, unsigned ordered,
  85                      prepare_flush_fn *prepare_flush_fn);
  87@q                      : the queue in question
  88@ordered                : the ordered mode the driver/device supports
  89@prepare_flush_fn       : this function should prepare @rq such that it
  90                          flushes cache to physical medium when executed
  92For example, SCSI disk driver's prepare_flush_fn looks like the
  95static void sd_prepare_flush(struct request_queue *q, struct request *rq)
  97        memset(rq->cmd, 0, sizeof(rq->cmd));
  98        rq->cmd_type = REQ_TYPE_BLOCK_PC;
  99        rq->timeout = SD_TIMEOUT;
 100        rq->cmd[0] = SYNCHRONIZE_CACHE;
 101        rq->cmd_len = 10;
 104The following seven ordered modes are supported.  The following table
 105shows which mode should be used depending on what features a
 106device/driver supports.  In the leftmost column of table,
 107QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
 109The table is followed by description of each mode.  Note that in the
 110descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
 111used for QUEUE_ORDERED_TAG* descriptions.  '=>' indicates that the
 112preceding step must be complete before proceeding to the next step.
 113'->' indicates that the next step can start as soon as the previous
 114step is issued.
 116            write-back cache    ordered tag     flush           FUA
 118NONE            yes/no          N/A             no              N/A
 119DRAIN           no              no              N/A             N/A
 120DRAIN_FLUSH     yes             no              yes             no
 121DRAIN_FUA       yes             no              yes             yes
 122TAG             no              yes             N/A             N/A
 123TAG_FLUSH       yes             yes             yes             no
 124TAG_FUA         yes             yes             yes             yes
 128        I/O barriers are not needed and/or supported.
 130        Sequence: N/A
 133        Requests are ordered by draining the request queue and cache
 134        flushing isn't needed.
 136        Sequence: drain => barrier
 139        Requests are ordered by draining the request queue and both
 140        pre-barrier and post-barrier cache flushings are needed.
 142        Sequence: drain => preflush => barrier => postflush
 145        Requests are ordered by draining the request queue and
 146        pre-barrier cache flushing is needed.  By using FUA on barrier
 147        request, post-barrier flushing can be skipped.
 149        Sequence: drain => preflush => barrier
 152        Requests are ordered by ordered tag and cache flushing isn't
 153        needed.
 155        Sequence: barrier
 158        Requests are ordered by ordered tag and both pre-barrier and
 159        post-barrier cache flushings are needed.
 161        Sequence: preflush -> barrier -> postflush
 164        Requests are ordered by ordered tag and pre-barrier cache
 165        flushing is needed.  By using FUA on barrier request,
 166        post-barrier flushing can be skipped.
 168        Sequence: preflush -> barrier
 171Random notes/caveats
 174* SCSI layer currently can't use TAG ordering even if the drive,
 175controller and driver support it.  The problem is that SCSI midlayer
 176request dispatch function is not atomic.  It releases queue lock and
 177switch to SCSI host lock during issue and it's possible and likely to
 178happen in time that requests change their relative positions.  Once
 179this problem is solved, TAG ordering can be enabled.
 181* Currently, no matter which ordered mode is used, there can be only
 182one barrier request in progress.  All I/O barriers are held off by
 183block layer until the previous I/O barrier is complete.  This doesn't
 184make any difference for DRAIN ordered devices, but, for TAG ordered
 185devices with very high command latency, passing multiple I/O barriers
 186to low level *might* be helpful if they are very frequent.  Well, this
 187certainly is a non-issue.  I'm writing this just to make clear that no
 188two I/O barrier is ever passed to low-level driver.
 190* Completion order.  Requests in ordered sequence are issued in order
 191but not required to finish in order.  Barrier implementation can
 192handle out-of-order completion of ordered sequence.  IOW, the requests
 193MUST be processed in order but the hardware/software completion paths
 194are allowed to reorder completion notifications - eg. current SCSI
 195midlayer doesn't preserve completion order during error handling.
 197* Requeueing order.  Low-level drivers are free to requeue any request
 198after they removed it from the request queue with
 199blkdev_dequeue_request().  As barrier sequence should be kept in order
 200when requeued, generic elevator code takes care of putting requests in
 201order around barrier.  See blk_ordered_req_seq() and
 202ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
 204Note that block drivers must not requeue preceding requests while
 205completing latter requests in an ordered sequence.  Currently, no
 206error checking is done against this.
 208* Error handling.  Currently, block layer will report error to upper
 209layer if any of requests in an ordered sequence fails.  Unfortunately,
 210this doesn't seem to be enough.  Look at the following request flow.
 213 [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
 214                                          still in elevator
 216Let's say request [2], [3] are write requests to update file system
 217metadata (journal or whatever) and [barrier] is used to mark that
 218those updates are valid.  Consider the following sequence.
 220 i.     Requests [0] ~ [post] leaves the request queue and enters
 221        low-level driver.
 222 ii.    After a while, unfortunately, something goes wrong and the
 223        drive fails [2].  Note that any of [0], [1] and [3] could have
 224        completed by this time, but [pre] couldn't have been finished
 225        as the drive must process it in order and it failed before
 226        processing that command.
 227 iii.   Error handling kicks in and determines that the error is
 228        unrecoverable and fails [2], and resumes operation.
 229 iv.    [pre] [barrier] [post] gets processed.
 230 v.     *BOOM* power fails
 232The problem here is that the barrier request is *supposed* to indicate
 233that filesystem update requests [2] and [3] made it safely to the
 234physical medium and, if the machine crashes after the barrier is
 235written, filesystem recovery code can depend on that.  Sadly, that
 236isn't true in this case anymore.  IOW, the success of a I/O barrier
 237should also be dependent on success of some of the preceding requests,
 238where only upper layer (filesystem) knows what 'some' is.
 240This can be solved by implementing a way to tell the block layer which
 241requests affect the success of the following barrier request and
 242making lower lever drivers to resume operation on error only after
 243block layer tells it to do so.
 245As the probability of this happening is very low and the drive should
 246be faulty, implementing the fix is probably an overkill.  But, still,
 247it's there.
 249* In previous drafts of barrier implementation, there was fallback
 250mechanism such that, if FUA or ordered TAG fails, less fancy ordered
 251mode can be selected and the failed barrier request is retried
 252automatically.  The rationale for this feature was that as FUA is
 253pretty new in ATA world and ordered tag was never used widely, there
 254could be devices which report to support those features but choke when
 255actually given such requests.
 257 This was removed for two reasons 1. it's an overkill 2. it's
 258impossible to implement properly when TAG ordering is used as low
 259level drivers resume after an error automatically.  If it's ever
 260needed adding it back and modifying low level drivers accordingly
 261shouldn't be difficult.