···11+I/O Barriers22+============33+Tejun Heo <htejun@gmail.com>, July 22 200544+55+I/O barrier requests are used to guarantee ordering around the barrier66+requests. Unless you're crazy enough to use disk drives for77+implementing synchronization constructs (wow, sounds interesting...),88+the ordering is meaningful only for write requests for things like99+journal checkpoints. All requests queued before a barrier request1010+must be finished (made it to the physical medium) before the barrier1111+request is started, and all requests queued after the barrier request1212+must be started only after the barrier request is finished (again,1313+made it to the physical medium).1414+1515+In other words, I/O barrier requests have the following two properties.1616+1717+1. Request ordering1818+1919+Requests cannot pass the barrier request. Preceding requests are2020+processed before the barrier and following requests after.2121+2222+Depending on what features a drive supports, this can be done in one2323+of the following three ways.2424+2525+i. For devices which have queue depth greater than 1 (TCQ devices) and2626+support ordered tags, block layer can just issue the barrier as an2727+ordered request and the lower level driver, controller and drive2828+itself are responsible for making sure that the ordering contraint is2929+met. Most modern SCSI controllers/drives should support this.3030+3131+NOTE: SCSI ordered tag isn't currently used due to limitation in the3232+ SCSI midlayer, see the following random notes section.3333+3434+ii. For devices which have queue depth greater than 1 but don't3535+support ordered tags, block layer ensures that the requests preceding3636+a barrier request finishes before issuing the barrier request. Also,3737+it defers requests following the barrier until the barrier request is3838+finished. Older SCSI controllers/drives and SATA drives fall in this3939+category.4040+4141+iii. Devices which have queue depth of 1. This is a degenerate case4242+of ii. Just keeping issue order suffices. Ancient SCSI4343+controllers/drives and IDE drives are in this category.4444+4545+2. Forced flushing to physcial medium4646+4747+Again, if you're not gonna do synchronization with disk drives (dang,4848+it sounds even more appealing now!), the reason you use I/O barriers4949+is mainly to protect filesystem integrity when power failure or some5050+other events abruptly stop the drive from operating and possibly make5151+the drive lose data in its cache. So, I/O barriers need to guarantee5252+that requests actually get written to non-volatile medium in order.5353+5454+There are four cases,5555+5656+i. No write-back cache. Keeping requests ordered is enough.5757+5858+ii. Write-back cache but no flush operation. There's no way to5959+gurantee physical-medium commit order. This kind of devices can't to6060+I/O barriers.6161+6262+iii. Write-back cache and flush operation but no FUA (forced unit6363+access). We need two cache flushes - before and after the barrier6464+request.6565+6666+iv. Write-back cache, flush operation and FUA. We still need one6767+flush to make sure requests preceding a barrier are written to medium,6868+but post-barrier flush can be avoided by using FUA write on the6969+barrier itself.7070+7171+7272+How to support barrier requests in drivers7373+------------------------------------------7474+7575+All barrier handling is done inside block layer proper. All low level7676+drivers have to are implementing its prepare_flush_fn and using one7777+the following two functions to indicate what barrier type it supports7878+and how to prepare flush requests. Note that the term 'ordered' is7979+used to indicate the whole sequence of performing barrier requests8080+including draining and flushing.8181+8282+typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);8383+8484+int blk_queue_ordered(request_queue_t *q, unsigned ordered,8585+ prepare_flush_fn *prepare_flush_fn,8686+ unsigned gfp_mask);8787+8888+int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,8989+ prepare_flush_fn *prepare_flush_fn,9090+ unsigned gfp_mask);9191+9292+The only difference between the two functions is whether or not the9393+caller is holding q->queue_lock on entry. The latter expects the9494+caller is holding the lock.9595+9696+@q : the queue in question9797+@ordered : the ordered mode the driver/device supports9898+@prepare_flush_fn : this function should prepare @rq such that it9999+ flushes cache to physical medium when executed100100+@gfp_mask : gfp_mask used when allocating data structures101101+ for ordered processing102102+103103+For example, SCSI disk driver's prepare_flush_fn looks like the104104+following.105105+106106+static void sd_prepare_flush(request_queue_t *q, struct request *rq)107107+{108108+ memset(rq->cmd, 0, sizeof(rq->cmd));109109+ rq->flags |= REQ_BLOCK_PC;110110+ rq->timeout = SD_TIMEOUT;111111+ rq->cmd[0] = SYNCHRONIZE_CACHE;112112+}113113+114114+The following seven ordered modes are supported. The following table115115+shows which mode should be used depending on what features a116116+device/driver supports. In the leftmost column of table,117117+QUEUE_ORDERED_ prefix is omitted from the mode names to save space.118118+119119+The table is followed by description of each mode. Note that in the120120+descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is121121+used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the122122+preceding step must be complete before proceeding to the next step.123123+'->' indicates that the next step can start as soon as the previous124124+step is issued.125125+126126+ write-back cache ordered tag flush FUA127127+-----------------------------------------------------------------------128128+NONE yes/no N/A no N/A129129+DRAIN no no N/A N/A130130+DRAIN_FLUSH yes no yes no131131+DRAIN_FUA yes no yes yes132132+TAG no yes N/A N/A133133+TAG_FLUSH yes yes yes no134134+TAG_FUA yes yes yes yes135135+136136+137137+QUEUE_ORDERED_NONE138138+ I/O barriers are not needed and/or supported.139139+140140+ Sequence: N/A141141+142142+QUEUE_ORDERED_DRAIN143143+ Requests are ordered by draining the request queue and cache144144+ flushing isn't needed.145145+146146+ Sequence: drain => barrier147147+148148+QUEUE_ORDERED_DRAIN_FLUSH149149+ Requests are ordered by draining the request queue and both150150+ pre-barrier and post-barrier cache flushings are needed.151151+152152+ Sequence: drain => preflush => barrier => postflush153153+154154+QUEUE_ORDERED_DRAIN_FUA155155+ Requests are ordered by draining the request queue and156156+ pre-barrier cache flushing is needed. By using FUA on barrier157157+ request, post-barrier flushing can be skipped.158158+159159+ Sequence: drain => preflush => barrier160160+161161+QUEUE_ORDERED_TAG162162+ Requests are ordered by ordered tag and cache flushing isn't163163+ needed.164164+165165+ Sequence: barrier166166+167167+QUEUE_ORDERED_TAG_FLUSH168168+ Requests are ordered by ordered tag and both pre-barrier and169169+ post-barrier cache flushings are needed.170170+171171+ Sequence: preflush -> barrier -> postflush172172+173173+QUEUE_ORDERED_TAG_FUA174174+ Requests are ordered by ordered tag and pre-barrier cache175175+ flushing is needed. By using FUA on barrier request,176176+ post-barrier flushing can be skipped.177177+178178+ Sequence: preflush -> barrier179179+180180+181181+Random notes/caveats182182+--------------------183183+184184+* SCSI layer currently can't use TAG ordering even if the drive,185185+controller and driver support it. The problem is that SCSI midlayer186186+request dispatch function is not atomic. It releases queue lock and187187+switch to SCSI host lock during issue and it's possible and likely to188188+happen in time that requests change their relative positions. Once189189+this problem is solved, TAG ordering can be enabled.190190+191191+* Currently, no matter which ordered mode is used, there can be only192192+one barrier request in progress. All I/O barriers are held off by193193+block layer until the previous I/O barrier is complete. This doesn't194194+make any difference for DRAIN ordered devices, but, for TAG ordered195195+devices with very high command latency, passing multiple I/O barriers196196+to low level *might* be helpful if they are very frequent. Well, this197197+certainly is a non-issue. I'm writing this just to make clear that no198198+two I/O barrier is ever passed to low-level driver.199199+200200+* Completion order. Requests in ordered sequence are issued in order201201+but not required to finish in order. Barrier implementation can202202+handle out-of-order completion of ordered sequence. IOW, the requests203203+MUST be processed in order but the hardware/software completion paths204204+are allowed to reorder completion notifications - eg. current SCSI205205+midlayer doesn't preserve completion order during error handling.206206+207207+* Requeueing order. Low-level drivers are free to requeue any request208208+after they removed it from the request queue with209209+blkdev_dequeue_request(). As barrier sequence should be kept in order210210+when requeued, generic elevator code takes care of putting requests in211211+order around barrier. See blk_ordered_req_seq() and212212+ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.213213+214214+Note that block drivers must not requeue preceding requests while215215+completing latter requests in an ordered sequence. Currently, no216216+error checking is done against this.217217+218218+* Error handling. Currently, block layer will report error to upper219219+layer if any of requests in an ordered sequence fails. Unfortunately,220220+this doesn't seem to be enough. Look at the following request flow.221221+QUEUE_ORDERED_TAG_FLUSH is in use.222222+223223+ [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >224224+ still in elevator225225+226226+Let's say request [2], [3] are write requests to update file system227227+metadata (journal or whatever) and [barrier] is used to mark that228228+those updates are valid. Consider the following sequence.229229+230230+ i. Requests [0] ~ [post] leaves the request queue and enters231231+ low-level driver.232232+ ii. After a while, unfortunately, something goes wrong and the233233+ drive fails [2]. Note that any of [0], [1] and [3] could have234234+ completed by this time, but [pre] couldn't have been finished235235+ as the drive must process it in order and it failed before236236+ processing that command.237237+ iii. Error handling kicks in and determines that the error is238238+ unrecoverable and fails [2], and resumes operation.239239+ iv. [pre] [barrier] [post] gets processed.240240+ v. *BOOM* power fails241241+242242+The problem here is that the barrier request is *supposed* to indicate243243+that filesystem update requests [2] and [3] made it safely to the244244+physical medium and, if the machine crashes after the barrier is245245+written, filesystem recovery code can depend on that. Sadly, that246246+isn't true in this case anymore. IOW, the success of a I/O barrier247247+should also be dependent on success of some of the preceding requests,248248+where only upper layer (filesystem) knows what 'some' is.249249+250250+This can be solved by implementing a way to tell the block layer which251251+requests affect the success of the following barrier request and252252+making lower lever drivers to resume operation on error only after253253+block layer tells it to do so.254254+255255+As the probability of this happening is very low and the drive should256256+be faulty, implementing the fix is probably an overkill. But, still,257257+it's there.258258+259259+* In previous drafts of barrier implementation, there was fallback260260+mechanism such that, if FUA or ordered TAG fails, less fancy ordered261261+mode can be selected and the failed barrier request is retried262262+automatically. The rationale for this feature was that as FUA is263263+pretty new in ATA world and ordered tag was never used widely, there264264+could be devices which report to support those features but choke when265265+actually given such requests.266266+267267+ This was removed for two reasons 1. it's an overkill 2. it's268268+impossible to implement properly when TAG ordering is used as low269269+level drivers resume after an error automatically. If it's ever270270+needed adding it back and modifying low level drivers accordingly271271+shouldn't be difficult.
+2-2
block/elevator.c
···157157 strcpy(chosen_elevator, "anticipatory");158158159159 /*160160- * If the given scheduler is not available, fall back to no-op.160160+ * If the given scheduler is not available, fall back to the default161161 */162162 if ((e = elevator_find(chosen_elevator)))163163 elevator_put(e);164164 else165165- strcpy(chosen_elevator, "noop");165165+ strcpy(chosen_elevator, CONFIG_DEFAULT_IOSCHED);166166}167167168168static int __init elevator_setup(char *str)