linux/Documentation/block/cfq-iosched.txt
<<
v4.024v4.024>>v4v4.0 ="+search" method="post" onsubmit="return do_search(this);">v4.024v4.024v4.024Searchv4.024Prefs4 24v4 ="ajax+*" method="post" onsubmit="return false;">v4v.024 44 41CFQ (Complete Fairness Queueing) 4 42=============================== 4 43v4 44The main aim of CFQ scheduler is to provide a fair alloca> of the diskv4 45I/O bandwidth for all the processes which requests a I/O opera> .v4 46v4 47CFQ maintains the per process queue for the processes which request I/Ov4 48opera> (syncronous requests). I case of asynchronous requests, all thev4 49requests from all the processes are batched together according to theirv4 2" a>process's I/O priority.v4 11v4 12CFQ ioscheduler tunablesv4 13========================v4 14v4 15slice_idlev4 16----------v4 17This specifies how long CFQ should idle for next request certain cfq queuesv4 18(for sequential workloads) a d service trees (for ra dom workloads) beforev4 19queue is expired a d CFQ selects next queue to dispatch from.v4 20v4 21By default slice_idle is a non-zero al8 e. That means by default we idle onv4 22queues/service trees. This ca be very helpful highly seeky media likev4 23single spindle SATA/SAS disks where we ca cut down overall number ofv4 24seeks a d see improved throughput.v4 25v4 26Setting slice_idle to 0 will remove all the idling queues/service treev4 27level a d one should see a overall improved throughput faster storagev4 28devices like multiple SATA/SAS disks in hardware RAID configura> . The downv4 29side is that isola> provided from WRITES also goes down a d no> of 4 30IO priority becomes weaker.v4 31v4 32So depending storage a d workload, it might be useful to set slice_idle=0.v4 33I general I think for SATA/SAS disks a d software RAID of SATA/SAS disksv4 34keeping slice_idle enabled should be useful. For any configura> s wherev4 35there are multiple spindles behind single LUN (Host based hardware RAIDv4 36controller or for storage arrays), setting slice_idle=0 might end up in betterv4 37throughput a d acceptable la>encies.v4 38v4 39back_seek_maxv4 40-------------v4 41This specifies, given in Kbytes, the maximum "distance" for backward seeking.v4 42The distance is the amount of space from the current head loca> to thev4 43sectors that are backward in terms of distance.v4 44v4 45This parameter allows the scheduler to a ticipate requests in the "backward"v4 46direc> a d consider them as being the "next" if they are within thisv4 47distance from the current head loca> .v4 48v4 49back_seek_penaltyv4 50-----------------v4 51This parameter is used to compute the cost of backward seeking. If thev4 52backward distance of request is just 1/back_seek_penalty from a "front"v4 53request, then the seeking cost of two requests is considered equivalent.v4 54v4 55So scheduler will no> bias toward one or the other request (otherwise schedulerv4 56will bias toward front request). Default al8 e of back_seek_penalty is 2.v4 57v4 58fifo_expire_asyncv4 59-----------------v4 60This parameter is used to set the timeout of asynchronous requests. Defaultv4 61al8 e of this is 248ms.v4 62v4 63fifo_expire_syncv4 64----------------v4 65This parameter is used to set the timeout of synchronous requests. Defaultv4 66al8 e of this is 124ms. I case to favor synchronous requests over asynchronousv4 67one, this al8 e should be decreased rela> ve to fifo_expire_async.v4 68v4 69slice_asyncv4 70-----------v4 71This parameter is same as of slice_sync but for asynchronous queue. Thev4 72default al8 e is 40ms.v4 73v4 74slice_async_rqv4 75--------------v4 76This parameter is used to limit the dispatching f asynchronous request tov4 77device request queue i queue's slice time. The maximum number of request thatv4 78are allowed to be dispatched also depends upon the io priority. Default al8 ev4 79for this is 2.v4 80v4 81slice_syncv4 82----------v4 83When a queue is selected for execu> , the queues IO requests are onlyv4 84execu>ed for a certain amount of time(time_slice) before switching to a otherv4 85queue. This parameter is used to calcula>e the time slice of synchronousv4 86queue.v4 87v4 88time_slice is computed using the below equa> :-v4 89time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To i crease thev4 90time_slice of synchronous queue, i crease the al8 e of slice_sync. Defaultv4 91al8 e is 100ms.v4 92v4 93quantumv4 94-------v4 95This specifies the number of request dispatched to the device queue. I av4 96queue's time slice, a request will no> be dispatched if the number of requestv4 97in the device exceeds this parameter. This parameter is used for synchronousv4 98request.v4 99v4100I case of storage with several disk, this setting ca limit the parallelv4101 a>processing f request. Therefore, i creasing the al8 e ca imporve thev4102performace although this ca cause the la>ency of some I/O to i crease d ev4103to more number of requests.v4104v4105CFQ IOPS Mode for group schedulingv4106===================================v4107Basic CFQ design is to provide priority based time slices. Higher priorityv4108 a>process gets bigger time slice a d lower priority process gets smaller timev4109slice. Measuring time becomes harder if storage is fast a d supports NCQ a dv412" a>it would be better to dispatch multiple requests from multiple cfq queues i v4111request queue at a time. I such scenario, it is no> possible to measure timev4112consumed by single queue accura>ely.v4113v4114What is possible though is to measure number of requests dispatched from av4115single queue and also allow dispatch from multiple cfq queue at the same time.v4116This effec> vely becomes the fairness in terms of IOPS (IO opera> s perv4117second).v4118v4119If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switchesv4120to IOPS mode a d starts providing fairness in terms of number of requestsv4121dispatched. No>e that this mode switching takes effec> only for groupv4122scheduling. For non-cgroup users no>hing should change.v4123v4124CFQ IO scheduler Idling Theoryv4125===============================v4126Idling a queue is primarily about waiting for the next request to comev4127on same queue after complet of a request. I this process CFQ will no>v4128dispatch requests from other cfq queues even if requests are pending there.v4129v4130The ra> ale behind idling is that it ca cut down number of seeksv4131 rota> al media. For example, if a process is doing dependentv4132sequential reads (next read will come only after complet of previousv4133one), then no> dispatching request from other queue should help as wev4134did no> move the disk head a d kept dispatching sequential IO fromv4135one queue.v4136v4137CFQ has following service trees a d various queues are put these trees.v4138v4139 sync-idle sync-noidle asyncv4140v4141All cfq queues doing synchronous sequential IO go to sync-idle tree.v4142O this tree we idle on each queue i dividually.v4143v4144All synchronous non-sequential queues go sync-noidle tree. Also anyv4145request which are marked with REQ_NOIDLE go this service tree. On thisv4146tree we do no> idle on i dividual queues i stead idle on the whole groupv4147of queues or the tree. So if there are 4 queues waiting for IO to dispatchv4148we will idle only once last queue has dispatched the IO a d there isv4149no more IO o this service tree.v4150v4151All async writes go async service tree. There is no idling asyncv4152queues.v4153v4154CFQ has some > miza> s for SSDs and if it detects a non-rota> alv4155media which ca support higher queue depth (multiple requests at i v4156flight at a time), then it cuts down idling f i dividual queues a dv4157all the queues move to sync-noidle tree a d only tree idle remains. Thisv4158tree idling provides isola> with buffered write queues o async tree.v4159v4160FAQv4161===v4162Q1. Why to idle at all queues marked with REQ_NOIDLE.v4163v4164A1. We only do tree idle (all queues o sync-noidle tree) queues markedv4165 with REQ_NOIDLE. This helps in providing isola> with all the sync-idlev4166 queues. Otherwise in presence of many sequential readers, otherv4167 synchronous IO might no> get fair share of disk.v4168v4169 For example, if there are 10 sequential readers doing IO a d they getv4170 100ms each. If a REQ_NOIDLE request comes in, it will be scheduledv4171 roughly after 1 second. If after complet of REQ_NOIDLE request wev4172 do no> idle, and after a couple of milli seconds a a other REQ_NOIDLEv4173 request comes in, again it will be scheduled after 1second. Repeat itv4174 a d no> ce how a workload ca lose its disk share a d suffer d e tov4175 multiple sequential readers.v4176v4177 fsync ca generate dependent IO where bunch of data is written in thev4178 context of fsync, and la>er some journaling data is written. Journalingv4179 data comes in only after fsync has finished its IO (atleast for ext4v4180 that seemed to be the case). Now if one decides no> to idle fsyncv4181 thread d e to REQ_NOIDLE, then next journaling write will no> getv4182 scheduled for a other second. A process doing small fsync, will sufferv4183 badly in presence of multiple sequential readers.v4184v4185 Hence doing tree idling o threads using REQ_NOIDLE flag o requestsv4186 provides isola> from multiple sequential readers and at the samev4187 time we do no> idle on i dividual threads.v4188v4189Q2. Whe to specify REQ_NOIDLEv4190A2. I would think whenever one is doing synchronous write a d no> expec> ngv4191 more writes to be dispatched from same context so , should be ablev4192 to specify REQ_NOIDLE on writes a d that probably should work well forv4193 most of the cases.v4194 The origi al LXR software by the LXR community, this experimental vers by lxr@linux.no. lxr.linux.no kindly hosted by Redpill Linpro AS, provider of Linux consulting a d opera> s services since 1995.