linux/block/bfq-iosched.c
<<
>>
Prefs
   1// SPDX-License-Identifier: GPL-2.0-or-later
   2/*
   3 * Budget Fair Queueing (BFQ) I/O scheduler.
   4 *
   5 * Based on ideas and code from CFQ:
   6 * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
   7 *
   8 * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
   9 *                    Paolo Valente <paolo.valente@unimore.it>
  10 *
  11 * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
  12 *                    Arianna Avanzini <avanzini@google.com>
  13 *
  14 * Copyright (C) 2017 Paolo Valente <paolo.valente@linaro.org>
  15 *
  16 * BFQ is a proportional-share I/O scheduler, with some extra
  17 * low-latency capabilities. BFQ also supports full hierarchical
  18 * scheduling through cgroups. Next paragraphs provide an introduction
  19 * on BFQ inner workings. Details on BFQ benefits, usage and
  20 * limitations can be found in Documentation/block/bfq-iosched.rst.
  21 *
  22 * BFQ is a proportional-share storage-I/O scheduling algorithm based
  23 * on the slice-by-slice service scheme of CFQ. But BFQ assigns
  24 * budgets, measured in number of sectors, to processes instead of
  25 * time slices. The device is not granted to the in-service process
  26 * for a given time slice, but until it has exhausted its assigned
  27 * budget. This change from the time to the service domain enables BFQ
  28 * to distribute the device throughput among processes as desired,
  29 * without any distortion due to throughput fluctuations, or to device
  30 * internal queueing. BFQ uses an ad hoc internal scheduler, called
  31 * B-WF2Q+, to schedule processes according to their budgets. More
  32 * precisely, BFQ schedules queues associated with processes. Each
  33 * process/queue is assigned a user-configurable weight, and B-WF2Q+
  34 * guarantees that each queue receives a fraction of the throughput
  35 * proportional to its weight. Thanks to the accurate policy of
  36 * B-WF2Q+, BFQ can afford to assign high budgets to I/O-bound
  37 * processes issuing sequential requests (to boost the throughput),
  38 * and yet guarantee a low latency to interactive and soft real-time
  39 * applications.
  40 *
  41 * In particular, to provide these low-latency guarantees, BFQ
  42 * explicitly privileges the I/O of two classes of time-sensitive
  43 * applications: interactive and soft real-time. In more detail, BFQ
  44 * behaves this way if the low_latency parameter is set (default
  45 * configuration). This feature enables BFQ to provide applications in
  46 * these classes with a very low latency.
  47 *
  48 * To implement this feature, BFQ constantly tries to detect whether
  49 * the I/O requests in a bfq_queue come from an interactive or a soft
  50 * real-time application. For brevity, in these cases, the queue is
  51 * said to be interactive or soft real-time. In both cases, BFQ
  52 * privileges the service of the queue, over that of non-interactive
  53 * and non-soft-real-time queues. This privileging is performed,
  54 * mainly, by raising the weight of the queue. So, for brevity, we
  55 * call just weight-raising periods the time periods during which a
  56 * queue is privileged, because deemed interactive or soft real-time.
  57 *
  58 * The detection of soft real-time queues/applications is described in
  59 * detail in the comments on the function
  60 * bfq_bfqq_softrt_next_start. On the other hand, the detection of an
  61 * interactive queue works as follows: a queue is deemed interactive
  62 * if it is constantly non empty only for a limited time interval,
  63 * after which it does become empty. The queue may be deemed
  64 * interactive again (for a limited time), if it restarts being
  65 * constantly non empty, provided that this happens only after the
  66 * queue has remained empty for a given minimum idle time.
  67 *
  68 * By default, BFQ computes automatically the above maximum time
  69 * interval, i.e., the time interval after which a constantly
  70 * non-empty queue stops being deemed interactive. Since a queue is
  71 * weight-raised while it is deemed interactive, this maximum time
  72 * interval happens to coincide with the (maximum) duration of the
  73 * weight-raising for interactive queues.
  74 *
  75 * Finally, BFQ also features additional heuristics for
  76 * preserving both a low latency and a high throughput on NCQ-capable,
  77 * rotational or flash-based devices, and to get the job done quickly
  78 * for applications consisting in many I/O-bound processes.
  79 *
  80 * NOTE: if the main or only goal, with a given device, is to achieve
  81 * the maximum-possible throughput at all times, then do switch off
  82 * all low-latency heuristics for that device, by setting low_latency
  83 * to 0.
  84 *
  85 * BFQ is described in [1], where also a reference to the initial,
  86 * more theoretical paper on BFQ can be found. The interested reader
  87 * can find in the latter paper full details on the main algorithm, as
  88 * well as formulas of the guarantees and formal proofs of all the
  89 * properties.  With respect to the version of BFQ presented in these
  90 * papers, this implementation adds a few more heuristics, such as the
  91 * ones that guarantee a low latency to interactive and soft real-time
  92 * applications, and a hierarchical extension based on H-WF2Q+.
  93 *
  94 * B-WF2Q+ is based on WF2Q+, which is described in [2], together with
  95 * H-WF2Q+, while the augmented tree used here to implement B-WF2Q+
  96 * with O(log N) complexity derives from the one introduced with EEVDF
  97 * in [3].
  98 *
  99 * [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
 100 *     Scheduler", Proceedings of the First Workshop on Mobile System
 101 *     Technologies (MST-2015), May 2015.
 102 *     http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
 103 *
 104 * [2] Jon C.R. Bennett and H. Zhang, "Hierarchical Packet Fair Queueing
 105 *     Algorithms", IEEE/ACM Transactions on Networking, 5(5):675-689,
 106 *     Oct 1997.
 107 *
 108 * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
 109 *
 110 * [3] I. Stoica and H. Abdel-Wahab, "Earliest Eligible Virtual Deadline
 111 *     First: A Flexible and Accurate Mechanism for Proportional Share
 112 *     Resource Allocation", technical report.
 113 *
 114 * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
 115 */
 116#include <linux/module.h>
 117#include <linux/slab.h>
 118#include <linux/blkdev.h>
 119#include <linux/cgroup.h>
 120#include <linux/ktime.h>
 121#include <linux/rbtree.h>
 122#include <linux/ioprio.h>
 123#include <linux/sbitmap.h>
 124#include <linux/delay.h>
 125#include <linux/backing-dev.h>
 126
 127#include <trace/events/block.h>
 128
 129#include "elevator.h"
 130#include "blk.h"
 131#include "blk-mq.h"
 132#include "blk-mq-sched.h"
 133#include "bfq-iosched.h"
 134#include "blk-wbt.h"
 135
 136#define BFQ_BFQQ_FNS(name)                                              \
 137void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)                       \
 138{                                                                       \
 139        __set_bit(BFQQF_##name, &(bfqq)->flags);                        \
 140}                                                                       \
 141void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)                      \
 142{                                                                       \
 143        __clear_bit(BFQQF_##name, &(bfqq)->flags);              \
 144}                                                                       \
 145int bfq_bfqq_##name(const struct bfq_queue *bfqq)                       \
 146{                                                                       \
 147        return test_bit(BFQQF_##name, &(bfqq)->flags);          \
 148}
 149
 150BFQ_BFQQ_FNS(just_created);
 151BFQ_BFQQ_FNS(busy);
 152BFQ_BFQQ_FNS(wait_request);
 153BFQ_BFQQ_FNS(non_blocking_wait_rq);
 154BFQ_BFQQ_FNS(fifo_expire);
 155BFQ_BFQQ_FNS(has_short_ttime);
 156BFQ_BFQQ_FNS(sync);
 157BFQ_BFQQ_FNS(IO_bound);
 158BFQ_BFQQ_FNS(in_large_burst);
 159BFQ_BFQQ_FNS(coop);
 160BFQ_BFQQ_FNS(split_coop);
 161BFQ_BFQQ_FNS(softrt_update);
 162#undef BFQ_BFQQ_FNS                                             \
 163
 164/* Expiration time of async (0) and sync (1) requests, in ns. */
 165static const u64 bfq_fifo_expire[2] = { NSEC_PER_SEC / 4, NSEC_PER_SEC / 8 };
 166
 167/* Maximum backwards seek (magic number lifted from CFQ), in KiB. */
 168static const int bfq_back_max = 16 * 1024;
 169
 170/* Penalty of a backwards seek, in number of sectors. */
 171static const int bfq_back_penalty = 2;
 172
 173/* Idling period duration, in ns. */
 174static u64 bfq_slice_idle = NSEC_PER_SEC / 125;
 175
 176/* Minimum number of assigned budgets for which stats are safe to compute. */
 177static const int bfq_stats_min_budgets = 194;
 178
 179/* Default maximum budget values, in sectors and number of requests. */
 180static const int bfq_default_max_budget = 16 * 1024;
 181
 182/*
 183 * When a sync request is dispatched, the queue that contains that
 184 * request, and all the ancestor entities of that queue, are charged
 185 * with the number of sectors of the request. In contrast, if the
 186 * request is async, then the queue and its ancestor entities are
 187 * charged with the number of sectors of the request, multiplied by
 188 * the factor below. This throttles the bandwidth for async I/O,
 189 * w.r.t. to sync I/O, and it is done to counter the tendency of async
 190 * writes to steal I/O throughput to reads.
 191 *
 192 * The current value of this parameter is the result of a tuning with
 193 * several hardware and software configurations. We tried to find the
 194 * lowest value for which writes do not cause noticeable problems to
 195 * reads. In fact, the lower this parameter, the stabler I/O control,
 196 * in the following respect.  The lower this parameter is, the less
 197 * the bandwidth enjoyed by a group decreases
 198 * - when the group does writes, w.r.t. to when it does reads;
 199 * - when other groups do reads, w.r.t. to when they do writes.
 200 */
 201static const int bfq_async_charge_factor = 3;
 202
 203/* Default timeout values, in jiffies, approximating CFQ defaults. */
 204const int bfq_timeout = HZ / 8;
 205
 206/*
 207 * Time limit for merging (see comments in bfq_setup_cooperator). Set
 208 * to the slowest value that, in our tests, proved to be effective in
 209 * removing false positives, while not causing true positives to miss
 210 * queue merging.
 211 *
 212 * As can be deduced from the low time limit below, queue merging, if
 213 * successful, happens at the very beginning of the I/O of the involved
 214 * cooperating processes, as a consequence of the arrival of the very
 215 * first requests from each cooperator.  After that, there is very
 216 * little chance to find cooperators.
 217 */
 218static const unsigned long bfq_merge_time_limit = HZ/10;
 219
 220static struct kmem_cache *bfq_pool;
 221
 222/* Below this threshold (in ns), we consider thinktime immediate. */
 223#define BFQ_MIN_TT              (2 * NSEC_PER_MSEC)
 224
 225/* hw_tag detection: parallel requests threshold and min samples needed. */
 226#define BFQ_HW_QUEUE_THRESHOLD  3
 227#define BFQ_HW_QUEUE_SAMPLES    32
 228
 229#define BFQQ_SEEK_THR           (sector_t)(8 * 100)
 230#define BFQQ_SECT_THR_NONROT    (sector_t)(2 * 32)
 231#define BFQ_RQ_SEEKY(bfqd, last_pos, rq) \
 232        (get_sdist(last_pos, rq) >                      \
 233         BFQQ_SEEK_THR &&                               \
 234         (!blk_queue_nonrot(bfqd->queue) ||             \
 235          blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
 236#define BFQQ_CLOSE_THR          (sector_t)(8 * 1024)
 237#define BFQQ_SEEKY(bfqq)        (hweight32(bfqq->seek_history) > 19)
 238/*
 239 * Sync random I/O is likely to be confused with soft real-time I/O,
 240 * because it is characterized by limited throughput and apparently
 241 * isochronous arrival pattern. To avoid false positives, queues
 242 * containing only random (seeky) I/O are prevented from being tagged
 243 * as soft real-time.
 244 */
 245#define BFQQ_TOTALLY_SEEKY(bfqq)        (bfqq->seek_history == -1)
 246
 247/* Min number of samples required to perform peak-rate update */
 248#define BFQ_RATE_MIN_SAMPLES    32
 249/* Min observation time interval required to perform a peak-rate update (ns) */
 250#define BFQ_RATE_MIN_INTERVAL   (300*NSEC_PER_MSEC)
 251/* Target observation time interval for a peak-rate update (ns) */
 252#define BFQ_RATE_REF_INTERVAL   NSEC_PER_SEC
 253
 254/*
 255 * Shift used for peak-rate fixed precision calculations.
 256 * With
 257 * - the current shift: 16 positions
 258 * - the current type used to store rate: u32
 259 * - the current unit of measure for rate: [sectors/usec], or, more precisely,
 260 *   [(sectors/usec) / 2^BFQ_RATE_SHIFT] to take into account the shift,
 261 * the range of rates that can be stored is
 262 * [1 / 2^BFQ_RATE_SHIFT, 2^(32 - BFQ_RATE_SHIFT)] sectors/usec =
 263 * [1 / 2^16, 2^16] sectors/usec = [15e-6, 65536] sectors/usec =
 264 * [15, 65G] sectors/sec
 265 * Which, assuming a sector size of 512B, corresponds to a range of
 266 * [7.5K, 33T] B/sec
 267 */
 268#define BFQ_RATE_SHIFT          16
 269
 270/*
 271 * When configured for computing the duration of the weight-raising
 272 * for interactive queues automatically (see the comments at the
 273 * beginning of this file), BFQ does it using the following formula:
 274 * duration = (ref_rate / r) * ref_wr_duration,
 275 * where r is the peak rate of the device, and ref_rate and
 276 * ref_wr_duration are two reference parameters.  In particular,
 277 * ref_rate is the peak rate of the reference storage device (see
 278 * below), and ref_wr_duration is about the maximum time needed, with
 279 * BFQ and while reading two files in parallel, to load typical large
 280 * applications on the reference device (see the comments on
 281 * max_service_from_wr below, for more details on how ref_wr_duration
 282 * is obtained).  In practice, the slower/faster the device at hand
 283 * is, the more/less it takes to load applications with respect to the
 284 * reference device.  Accordingly, the longer/shorter BFQ grants
 285 * weight raising to interactive applications.
 286 *
 287 * BFQ uses two different reference pairs (ref_rate, ref_wr_duration),
 288 * depending on whether the device is rotational or non-rotational.
 289 *
 290 * In the following definitions, ref_rate[0] and ref_wr_duration[0]
 291 * are the reference values for a rotational device, whereas
 292 * ref_rate[1] and ref_wr_duration[1] are the reference values for a
 293 * non-rotational device. The reference rates are not the actual peak
 294 * rates of the devices used as a reference, but slightly lower
 295 * values. The reason for using slightly lower values is that the
 296 * peak-rate estimator tends to yield slightly lower values than the
 297 * actual peak rate (it can yield the actual peak rate only if there
 298 * is only one process doing I/O, and the process does sequential
 299 * I/O).
 300 *
 301 * The reference peak rates are measured in sectors/usec, left-shifted
 302 * by BFQ_RATE_SHIFT.
 303 */
 304static int ref_rate[2] = {14000, 33000};
 305/*
 306 * To improve readability, a conversion function is used to initialize
 307 * the following array, which entails that the array can be
 308 * initialized only in a function.
 309 */
 310static int ref_wr_duration[2];
 311
 312/*
 313 * BFQ uses the above-detailed, time-based weight-raising mechanism to
 314 * privilege interactive tasks. This mechanism is vulnerable to the
 315 * following false positives: I/O-bound applications that will go on
 316 * doing I/O for much longer than the duration of weight
 317 * raising. These applications have basically no benefit from being
 318 * weight-raised at the beginning of their I/O. On the opposite end,
 319 * while being weight-raised, these applications
 320 * a) unjustly steal throughput to applications that may actually need
 321 * low latency;
 322 * b) make BFQ uselessly perform device idling; device idling results
 323 * in loss of device throughput with most flash-based storage, and may
 324 * increase latencies when used purposelessly.
 325 *
 326 * BFQ tries to reduce these problems, by adopting the following
 327 * countermeasure. To introduce this countermeasure, we need first to
 328 * finish explaining how the duration of weight-raising for
 329 * interactive tasks is computed.
 330 *
 331 * For a bfq_queue deemed as interactive, the duration of weight
 332 * raising is dynamically adjusted, as a function of the estimated
 333 * peak rate of the device, so as to be equal to the time needed to
 334 * execute the 'largest' interactive task we benchmarked so far. By
 335 * largest task, we mean the task for which each involved process has
 336 * to do more I/O than for any of the other tasks we benchmarked. This
 337 * reference interactive task is the start-up of LibreOffice Writer,
 338 * and in this task each process/bfq_queue needs to have at most ~110K
 339 * sectors transferred.
 340 *
 341 * This last piece of information enables BFQ to reduce the actual
 342 * duration of weight-raising for at least one class of I/O-bound
 343 * applications: those doing sequential or quasi-sequential I/O. An
 344 * example is file copy. In fact, once started, the main I/O-bound
 345 * processes of these applications usually consume the above 110K
 346 * sectors in much less time than the processes of an application that
 347 * is starting, because these I/O-bound processes will greedily devote
 348 * almost all their CPU cycles only to their target,
 349 * throughput-friendly I/O operations. This is even more true if BFQ
 350 * happens to be underestimating the device peak rate, and thus
 351 * overestimating the duration of weight raising. But, according to
 352 * our measurements, once transferred 110K sectors, these processes
 353 * have no right to be weight-raised any longer.
 354 *
 355 * Basing on the last consideration, BFQ ends weight-raising for a
 356 * bfq_queue if the latter happens to have received an amount of
 357 * service at least equal to the following constant. The constant is
 358 * set to slightly more than 110K, to have a minimum safety margin.
 359 *
 360 * This early ending of weight-raising reduces the amount of time
 361 * during which interactive false positives cause the two problems
 362 * described at the beginning of these comments.
 363 */
 364static const unsigned long max_service_from_wr = 120000;
 365
 366/*
 367 * Maximum time between the creation of two queues, for stable merge
 368 * to be activated (in ms)
 369 */
 370static const unsigned long bfq_activation_stable_merging = 600;
 371/*
 372 * Minimum time to be waited before evaluating delayed stable merge (in ms)
 373 */
 374static const unsigned long bfq_late_stable_merging = 600;
 375
 376#define RQ_BIC(rq)              ((struct bfq_io_cq *)((rq)->elv.priv[0]))
 377#define RQ_BFQQ(rq)             ((rq)->elv.priv[1])
 378
 379struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync,
 380                              unsigned int actuator_idx)
 381{
 382        if (is_sync)
 383                return bic->bfqq[1][actuator_idx];
 384
 385        return bic->bfqq[0][actuator_idx];
 386}
 387
 388static void bfq_put_stable_ref(struct bfq_queue *bfqq);
 389
 390void bic_set_bfqq(struct bfq_io_cq *bic,
 391                  struct bfq_queue *bfqq,
 392                  bool is_sync,
 393                  unsigned int actuator_idx)
 394{
 395        struct bfq_queue *old_bfqq = bic->bfqq[is_sync][actuator_idx];
 396
 397        /*
 398         * If bfqq != NULL, then a non-stable queue merge between
 399         * bic->bfqq and bfqq is happening here. This causes troubles
 400         * in the following case: bic->bfqq has also been scheduled
 401         * for a possible stable merge with bic->stable_merge_bfqq,
 402         * and bic->stable_merge_bfqq == bfqq happens to
 403         * hold. Troubles occur because bfqq may then undergo a split,
 404         * thereby becoming eligible for a stable merge. Yet, if
 405         * bic->stable_merge_bfqq points exactly to bfqq, then bfqq
 406         * would be stably merged with itself. To avoid this anomaly,
 407         * we cancel the stable merge if
 408         * bic->stable_merge_bfqq == bfqq.
 409         */
 410        struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[actuator_idx];
 411
 412        /* Clear bic pointer if bfqq is detached from this bic */
 413        if (old_bfqq && old_bfqq->bic == bic)
 414                old_bfqq->bic = NULL;
 415
 416        if (is_sync)
 417                bic->bfqq[1][actuator_idx] = bfqq;
 418        else
 419                bic->bfqq[0][actuator_idx] = bfqq;
 420
 421        if (bfqq && bfqq_data->stable_merge_bfqq == bfqq) {
 422                /*
 423                 * Actually, these same instructions are executed also
 424                 * in bfq_setup_cooperator, in case of abort or actual
 425                 * execution of a stable merge. We could avoid
 426                 * repeating these instructions there too, but if we
 427                 * did so, we would nest even more complexity in this
 428                 * function.
 429                 */
 430                bfq_put_stable_ref(bfqq_data->stable_merge_bfqq);
 431
 432                bfqq_data->stable_merge_bfqq = NULL;
 433        }
 434}
 435
 436struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
 437{
 438        return bic->icq.q->elevator->elevator_data;
 439}
 440
 441/**
 442 * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
 443 * @icq: the iocontext queue.
 444 */
 445static struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
 446{
 447        /* bic->icq is the first member, %NULL will convert to %NULL */
 448        return container_of(icq, struct bfq_io_cq, icq);
 449}
 450
 451/**
 452 * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
 453 * @q: the request queue.
 454 */
 455static struct bfq_io_cq *bfq_bic_lookup(struct request_queue *q)
 456{
 457        struct bfq_io_cq *icq;
 458        unsigned long flags;
 459
 460        if (!current->io_context)
 461                return NULL;
 462
 463        spin_lock_irqsave(&q->queue_lock, flags);
 464        icq = icq_to_bic(ioc_lookup_icq(q));
 465        spin_unlock_irqrestore(&q->queue_lock, flags);
 466
 467        return icq;
 468}
 469
 470/*
 471 * Scheduler run of queue, if there are requests pending and no one in the
 472 * driver that will restart queueing.
 473 */
 474void bfq_schedule_dispatch(struct bfq_data *bfqd)
 475{
 476        lockdep_assert_held(&bfqd->lock);
 477
 478        if (bfqd->queued != 0) {
 479                bfq_log(bfqd, "schedule dispatch");
 480                blk_mq_run_hw_queues(bfqd->queue, true);
 481        }
 482}
 483
 484#define bfq_class_idle(bfqq)    ((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
 485
 486#define bfq_sample_valid(samples)       ((samples) > 80)
 487
 488/*
 489 * Lifted from AS - choose which of rq1 and rq2 that is best served now.
 490 * We choose the request that is closer to the head right now.  Distance
 491 * behind the head is penalized and only allowed to a certain extent.
 492 */
 493static struct request *bfq_choose_req(struct bfq_data *bfqd,
 494                                      struct request *rq1,
 495                                      struct request *rq2,
 496                                      sector_t last)
 497{
 498        sector_t s1, s2, d1 = 0, d2 = 0;
 499        unsigned long back_max;
 500#define BFQ_RQ1_WRAP    0x01 /* request 1 wraps */
 501#define BFQ_RQ2_WRAP    0x02 /* request 2 wraps */
 502        unsigned int wrap = 0; /* bit mask: requests behind the disk head? */
 503
 504        if (!rq1 || rq1 == rq2)
 505                return rq2;
 506        if (!rq2)
 507                return rq1;
 508
 509        if (rq_is_sync(rq1) && !rq_is_sync(rq2))
 510                return rq1;
 511        else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
 512                return rq2;
 513        if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
 514                return rq1;
 515        else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
 516                return rq2;
 517
 518        s1 = blk_rq_pos(rq1);
 519        s2 = blk_rq_pos(rq2);
 520
 521        /*
 522         * By definition, 1KiB is 2 sectors.
 523         */
 524        back_max = bfqd->bfq_back_max * 2;
 525
 526        /*
 527         * Strict one way elevator _except_ in the case where we allow
 528         * short backward seeks which are biased as twice the cost of a
 529         * similar forward seek.
 530         */
 531        if (s1 >= last)
 532                d1 = s1 - last;
 533        else if (s1 + back_max >= last)
 534                d1 = (last - s1) * bfqd->bfq_back_penalty;
 535        else
 536                wrap |= BFQ_RQ1_WRAP;
 537
 538        if (s2 >= last)
 539                d2 = s2 - last;
 540        else if (s2 + back_max >= last)
 541                d2 = (last - s2) * bfqd->bfq_back_penalty;
 542        else
 543                wrap |= BFQ_RQ2_WRAP;
 544
 545        /* Found required data */
 546
 547        /*
 548         * By doing switch() on the bit mask "wrap" we avoid having to
 549         * check two variables for all permutations: --> faster!
 550         */
 551        switch (wrap) {
 552        case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
 553                if (d1 < d2)
 554                        return rq1;
 555                else if (d2 < d1)
 556                        return rq2;
 557
 558                if (s1 >= s2)
 559                        return rq1;
 560                else
 561                        return rq2;
 562
 563        case BFQ_RQ2_WRAP:
 564                return rq1;
 565        case BFQ_RQ1_WRAP:
 566                return rq2;
 567        case BFQ_RQ1_WRAP|BFQ_RQ2_WRAP: /* both rqs wrapped */
 568        default:
 569                /*
 570                 * Since both rqs are wrapped,
 571                 * start with the one that's further behind head
 572                 * (--> only *one* back seek required),
 573                 * since back seek takes more time than forward.
 574                 */
 575                if (s1 <= s2)
 576                        return rq1;
 577                else
 578                        return rq2;
 579        }
 580}
 581
 582#define BFQ_LIMIT_INLINE_DEPTH 16
 583
 584#ifdef CONFIG_BFQ_GROUP_IOSCHED
 585static bool bfqq_request_over_limit(struct bfq_queue *bfqq, int limit)
 586{
 587        struct bfq_data *bfqd = bfqq->bfqd;
 588        struct bfq_entity *entity = &bfqq->entity;
 589        struct bfq_entity *inline_entities[BFQ_LIMIT_INLINE_DEPTH];
 590        struct bfq_entity **entities = inline_entities;
 591        int depth, level, alloc_depth = BFQ_LIMIT_INLINE_DEPTH;
 592        int class_idx = bfqq->ioprio_class - 1;
 593        struct bfq_sched_data *sched_data;
 594        unsigned long wsum;
 595        bool ret = false;
 596
 597        if (!entity->on_st_or_in_serv)
 598                return false;
 599
 600retry:
 601        spin_lock_irq(&bfqd->lock);
 602        /* +1 for bfqq entity, root cgroup not included */
 603        depth = bfqg_to_blkg(bfqq_group(bfqq))->blkcg->css.cgroup->level + 1;
 604        if (depth > alloc_depth) {
 605                spin_unlock_irq(&bfqd->lock);
 606                if (entities != inline_entities)
 607                        kfree(entities);
 608                entities = kmalloc_array(depth, sizeof(*entities), GFP_NOIO);
 609                if (!entities)
 610                        return false;
 611                alloc_depth = depth;
 612                goto retry;
 613        }
 614
 615        sched_data = entity->sched_data;
 616        /* Gather our ancestors as we need to traverse them in reverse order */
 617        level = 0;
 618        for_each_entity(entity) {
 619                /*
 620                 * If at some level entity is not even active, allow request
 621                 * queueing so that BFQ knows there's work to do and activate
 622                 * entities.
 623                 */
 624                if (!entity->on_st_or_in_serv)
 625                        goto out;
 626                /* Uh, more parents than cgroup subsystem thinks? */
 627                if (WARN_ON_ONCE(level >= depth))
 628                        break;
 629                entities[level++] = entity;
 630        }
 631        WARN_ON_ONCE(level != depth);
 632        for (level--; level >= 0; level--) {
 633                entity = entities[level];
 634                if (level > 0) {
 635                        wsum = bfq_entity_service_tree(entity)->wsum;
 636                } else {
 637                        int i;
 638                        /*
 639                         * For bfqq itself we take into account service trees
 640                         * of all higher priority classes and multiply their
 641                         * weights so that low prio queue from higher class
 642                         * gets more requests than high prio queue from lower
 643                         * class.
 644                         */
 645                        wsum = 0;
 646                        for (i = 0; i <= class_idx; i++) {
 647                                wsum = wsum * IOPRIO_BE_NR +
 648                                        sched_data->service_tree[i].wsum;
 649                        }
 650                }
 651                if (!wsum)
 652                        continue;
 653                limit = DIV_ROUND_CLOSEST(limit * entity->weight, wsum);
 654                if (entity->allocated >= limit) {
 655                        bfq_log_bfqq(bfqq->bfqd, bfqq,
 656                                "too many requests: allocated %d limit %d level %d",
 657                                entity->allocated, limit, level);
 658                        ret = true;
 659                        break;
 660                }
 661        }
 662out:
 663        spin_unlock_irq(&bfqd->lock);
 664        if (entities != inline_entities)
 665                kfree(entities);
 666        return ret;
 667}
 668#else
 669static bool bfqq_request_over_limit(struct bfq_queue *bfqq, int limit)
 670{
 671        return false;
 672}
 673#endif
 674
 675/*
 676 * Async I/O can easily starve sync I/O (both sync reads and sync
 677 * writes), by consuming all tags. Similarly, storms of sync writes,
 678 * such as those that sync(2) may trigger, can starve sync reads.
 679 * Limit depths of async I/O and sync writes so as to counter both
 680 * problems.
 681 *
 682 * Also if a bfq queue or its parent cgroup consume more tags than would be
 683 * appropriate for their weight, we trim the available tag depth to 1. This
 684 * avoids a situation where one cgroup can starve another cgroup from tags and
 685 * thus block service differentiation among cgroups. Note that because the
 686 * queue / cgroup already has many requests allocated and queued, this does not
 687 * significantly affect service guarantees coming from the BFQ scheduling
 688 * algorithm.
 689 */
 690static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
 691{
 692        struct bfq_data *bfqd = data->q->elevator->elevator_data;
 693        struct bfq_io_cq *bic = bfq_bic_lookup(data->q);
 694        int depth;
 695        unsigned limit = data->q->nr_requests;
 696        unsigned int act_idx;
 697
 698        /* Sync reads have full depth available */
 699        if (op_is_sync(opf) && !op_is_write(opf)) {
 700                depth = 0;
 701        } else {
 702                depth = bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)];
 703                limit = (limit * depth) >> bfqd->full_depth_shift;
 704        }
 705
 706        for (act_idx = 0; bic && act_idx < bfqd->num_actuators; act_idx++) {
 707                struct bfq_queue *bfqq =
 708                        bic_to_bfqq(bic, op_is_sync(opf), act_idx);
 709
 710                /*
 711                 * Does queue (or any parent entity) exceed number of
 712                 * requests that should be available to it? Heavily
 713                 * limit depth so that it cannot consume more
 714                 * available requests and thus starve other entities.
 715                 */
 716                if (bfqq && bfqq_request_over_limit(bfqq, limit)) {
 717                        depth = 1;
 718                        break;
 719                }
 720        }
 721        bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
 722                __func__, bfqd->wr_busy_queues, op_is_sync(opf), depth);
 723        if (depth)
 724                data->shallow_depth = depth;
 725}
 726
 727static struct bfq_queue *
 728bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
 729                     sector_t sector, struct rb_node **ret_parent,
 730                     struct rb_node ***rb_link)
 731{
 732        struct rb_node **p, *parent;
 733        struct bfq_queue *bfqq = NULL;
 734
 735        parent = NULL;
 736        p = &root->rb_node;
 737        while (*p) {
 738                struct rb_node **n;
 739
 740                parent = *p;
 741                bfqq = rb_entry(parent, struct bfq_queue, pos_node);
 742
 743                /*
 744                 * Sort strictly based on sector. Smallest to the left,
 745                 * largest to the right.
 746                 */
 747                if (sector > blk_rq_pos(bfqq->next_rq))
 748                        n = &(*p)->rb_right;
 749                else if (sector < blk_rq_pos(bfqq->next_rq))
 750                        n = &(*p)->rb_left;
 751                else
 752                        break;
 753                p = n;
 754                bfqq = NULL;
 755        }
 756
 757        *ret_parent = parent;
 758        if (rb_link)
 759                *rb_link = p;
 760
 761        bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
 762                (unsigned long long)sector,
 763                bfqq ? bfqq->pid : 0);
 764
 765        return bfqq;
 766}
 767
 768static bool bfq_too_late_for_merging(struct bfq_queue *bfqq)
 769{
 770        return bfqq->service_from_backlogged > 0 &&
 771                time_is_before_jiffies(bfqq->first_IO_time +
 772                                       bfq_merge_time_limit);
 773}
 774
 775/*
 776 * The following function is not marked as __cold because it is
 777 * actually cold, but for the same performance goal described in the
 778 * comments on the likely() at the beginning of
 779 * bfq_setup_cooperator(). Unexpectedly, to reach an even lower
 780 * execution time for the case where this function is not invoked, we
 781 * had to add an unlikely() in each involved if().
 782 */
 783void __cold
 784bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 785{
 786        struct rb_node **p, *parent;
 787        struct bfq_queue *__bfqq;
 788
 789        if (bfqq->pos_root) {
 790                rb_erase(&bfqq->pos_node, bfqq->pos_root);
 791                bfqq->pos_root = NULL;
 792        }
 793
 794        /* oom_bfqq does not participate in queue merging */
 795        if (bfqq == &bfqd->oom_bfqq)
 796                return;
 797
 798        /*
 799         * bfqq cannot be merged any longer (see comments in
 800         * bfq_setup_cooperator): no point in adding bfqq into the
 801         * position tree.
 802         */
 803        if (bfq_too_late_for_merging(bfqq))
 804                return;
 805
 806        if (bfq_class_idle(bfqq))
 807                return;
 808        if (!bfqq->next_rq)
 809                return;
 810
 811        bfqq->pos_root = &bfqq_group(bfqq)->rq_pos_tree;
 812        __bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
 813                        blk_rq_pos(bfqq->next_rq), &parent, &p);
 814        if (!__bfqq) {
 815                rb_link_node(&bfqq->pos_node, parent, p);
 816                rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
 817        } else
 818                bfqq->pos_root = NULL;
 819}
 820
 821/*
 822 * The following function returns false either if every active queue
 823 * must receive the same share of the throughput (symmetric scenario),
 824 * or, as a special case, if bfqq must receive a share of the
 825 * throughput lower than or equal to the share that every other active
 826 * queue must receive.  If bfqq does sync I/O, then these are the only
 827 * two cases where bfqq happens to be guaranteed its share of the
 828 * throughput even if I/O dispatching is not plugged when bfqq remains
 829 * temporarily empty (for more details, see the comments in the
 830 * function bfq_better_to_idle()). For this reason, the return value
 831 * of this function is used to check whether I/O-dispatch plugging can
 832 * be avoided.
 833 *
 834 * The above first case (symmetric scenario) occurs when:
 835 * 1) all active queues have the same weight,
 836 * 2) all active queues belong to the same I/O-priority class,
 837 * 3) all active groups at the same level in the groups tree have the same
 838 *    weight,
 839 * 4) all active groups at the same level in the groups tree have the same
 840 *    number of children.
 841 *
 842 * Unfortunately, keeping the necessary state for evaluating exactly
 843 * the last two symmetry sub-conditions above would be quite complex
 844 * and time consuming. Therefore this function evaluates, instead,
 845 * only the following stronger three sub-conditions, for which it is
 846 * much easier to maintain the needed state:
 847 * 1) all active queues have the same weight,
 848 * 2) all active queues belong to the same I/O-priority class,
 849 * 3) there is at most one active group.
 850 * In particular, the last condition is always true if hierarchical
 851 * support or the cgroups interface are not enabled, thus no state
 852 * needs to be maintained in this case.
 853 */
 854static bool bfq_asymmetric_scenario(struct bfq_data *bfqd,
 855                                   struct bfq_queue *bfqq)
 856{
 857        bool smallest_weight = bfqq &&
 858                bfqq->weight_counter &&
 859                bfqq->weight_counter ==
 860                container_of(
 861                        rb_first_cached(&bfqd->queue_weights_tree),
 862                        struct bfq_weight_counter,
 863                        weights_node);
 864
 865        /*
 866         * For queue weights to differ, queue_weights_tree must contain
 867         * at least two nodes.
 868         */
 869        bool varied_queue_weights = !smallest_weight &&
 870                !RB_EMPTY_ROOT(&bfqd->queue_weights_tree.rb_root) &&
 871                (bfqd->queue_weights_tree.rb_root.rb_node->rb_left ||
 872                 bfqd->queue_weights_tree.rb_root.rb_node->rb_right);
 873
 874        bool multiple_classes_busy =
 875                (bfqd->busy_queues[0] && bfqd->busy_queues[1]) ||
 876                (bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||
 877                (bfqd->busy_queues[1] && bfqd->busy_queues[2]);
 878
 879        return varied_queue_weights || multiple_classes_busy
 880#ifdef CONFIG_BFQ_GROUP_IOSCHED
 881               || bfqd->num_groups_with_pending_reqs > 1
 882#endif
 883                ;
 884}
 885
 886/*
 887 * If the weight-counter tree passed as input contains no counter for
 888 * the weight of the input queue, then add that counter; otherwise just
 889 * increment the existing counter.
 890 *
 891 * Note that weight-counter trees contain few nodes in mostly symmetric
 892 * scenarios. For example, if all queues have the same weight, then the
 893 * weight-counter tree for the queues may contain at most one node.
 894 * This holds even if low_latency is on, because weight-raised queues
 895 * are not inserted in the tree.
 896 * In most scenarios, the rate at which nodes are created/destroyed
 897 * should be low too.
 898 */
 899void bfq_weights_tree_add(struct bfq_queue *bfqq)
 900{
 901        struct rb_root_cached *root = &bfqq->bfqd->queue_weights_tree;
 902        struct bfq_entity *entity = &bfqq->entity;
 903        struct rb_node **new = &(root->rb_root.rb_node), *parent = NULL;
 904        bool leftmost = true;
 905
 906        /*
 907         * Do not insert if the queue is already associated with a
 908         * counter, which happens if:
 909         *   1) a request arrival has caused the queue to become both
 910         *      non-weight-raised, and hence change its weight, and
 911         *      backlogged; in this respect, each of the two events
 912         *      causes an invocation of this function,
 913         *   2) this is the invocation of this function caused by the
 914         *      second event. This second invocation is actually useless,
 915         *      and we handle this fact by exiting immediately. More
 916         *      efficient or clearer solutions might possibly be adopted.
 917         */
 918        if (bfqq->weight_counter)
 919                return;
 920
 921        while (*new) {
 922                struct bfq_weight_counter *__counter = container_of(*new,
 923                                                struct bfq_weight_counter,
 924                                                weights_node);
 925                parent = *new;
 926
 927                if (entity->weight == __counter->weight) {
 928                        bfqq->weight_counter = __counter;
 929                        goto inc_counter;
 930                }
 931                if (entity->weight < __counter->weight)
 932                        new = &((*new)->rb_left);
 933                else {
 934                        new = &((*new)->rb_right);
 935                        leftmost = false;
 936                }
 937        }
 938
 939        bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
 940                                       GFP_ATOMIC);
 941
 942        /*
 943         * In the unlucky event of an allocation failure, we just
 944         * exit. This will cause the weight of queue to not be
 945         * considered in bfq_asymmetric_scenario, which, in its turn,
 946         * causes the scenario to be deemed wrongly symmetric in case
 947         * bfqq's weight would have been the only weight making the
 948         * scenario asymmetric.  On the bright side, no unbalance will
 949         * however occur when bfqq becomes inactive again (the
 950         * invocation of this function is triggered by an activation
 951         * of queue).  In fact, bfq_weights_tree_remove does nothing
 952         * if !bfqq->weight_counter.
 953         */
 954        if (unlikely(!bfqq->weight_counter))
 955                return;
 956
 957        bfqq->weight_counter->weight = entity->weight;
 958        rb_link_node(&bfqq->weight_counter->weights_node, parent, new);
 959        rb_insert_color_cached(&bfqq->weight_counter->weights_node, root,
 960                                leftmost);
 961
 962inc_counter:
 963        bfqq->weight_counter->num_active++;
 964        bfqq->ref++;
 965}
 966
 967/*
 968 * Decrement the weight counter associated with the queue, and, if the
 969 * counter reaches 0, remove the counter from the tree.
 970 * See the comments to the function bfq_weights_tree_add() for considerations
 971 * about overhead.
 972 */
 973void bfq_weights_tree_remove(struct bfq_queue *bfqq)
 974{
 975        struct rb_root_cached *root;
 976
 977        if (!bfqq->weight_counter)
 978                return;
 979
 980        root = &bfqq->bfqd->queue_weights_tree;
 981        bfqq->weight_counter->num_active--;
 982        if (bfqq->weight_counter->num_active > 0)
 983                goto reset_entity_pointer;
 984
 985        rb_erase_cached(&bfqq->weight_counter->weights_node, root);
 986        kfree(bfqq->weight_counter);
 987
 988reset_entity_pointer:
 989        bfqq->weight_counter = NULL;
 990        bfq_put_queue(bfqq);
 991}
 992
 993/*
 994 * Return expired entry, or NULL to just start from scratch in rbtree.
 995 */
 996static struct request *bfq_check_fifo(struct bfq_queue *bfqq,
 997                                      struct request *last)
 998{
 999        struct request *rq;
1000
1001        if (bfq_bfqq_fifo_expire(bfqq))
1002                return NULL;
1003
1004        bfq_mark_bfqq_fifo_expire(bfqq);
1005
1006        rq = rq_entry_fifo(bfqq->fifo.next);
1007
1008        if (rq == last || ktime_get_ns() < rq->fifo_time)
1009                return NULL;
1010
1011        bfq_log_bfqq(bfqq->bfqd, bfqq, "check_fifo: returned %p", rq);
1012        return rq;
1013}
1014
1015static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
1016                                        struct bfq_queue *bfqq,
1017                                        struct request *last)
1018{
1019        struct rb_node *rbnext = rb_next(&last->rb_node);
1020        struct rb_node *rbprev = rb_prev(&last->rb_node);
1021        struct request *next, *prev = NULL;
1022
1023        /* Follow expired path, else get first next available. */
1024        next = bfq_check_fifo(bfqq, last);
1025        if (next)
1026                return next;
1027
1028        if (rbprev)
1029                prev = rb_entry_rq(rbprev);
1030
1031        if (rbnext)
1032                next = rb_entry_rq(rbnext);
1033        else {
1034                rbnext = rb_first(&bfqq->sort_list);
1035                if (rbnext && rbnext != &last->rb_node)
1036                        next = rb_entry_rq(rbnext);
1037        }
1038
1039        return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
1040}
1041
1042/* see the definition of bfq_async_charge_factor for details */
1043static unsigned long bfq_serv_to_charge(struct request *rq,
1044                                        struct bfq_queue *bfqq)
1045{
1046        if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1 ||
1047            bfq_asymmetric_scenario(bfqq->bfqd, bfqq))
1048                return blk_rq_sectors(rq);
1049
1050        return blk_rq_sectors(rq) * bfq_async_charge_factor;
1051}
1052
1053/**
1054 * bfq_updated_next_req - update the queue after a new next_rq selection.
1055 * @bfqd: the device data the queue belongs to.
1056 * @bfqq: the queue to update.
1057 *
1058 * If the first request of a queue changes we make sure that the queue
1059 * has enough budget to serve at least its first request (if the
1060 * request has grown).  We do this because if the queue has not enough
1061 * budget for its first request, it has to go through two dispatch
1062 * rounds to actually get it dispatched.
1063 */
1064static void bfq_updated_next_req(struct bfq_data *bfqd,
1065                                 struct bfq_queue *bfqq)
1066{
1067        struct bfq_entity *entity = &bfqq->entity;
1068        struct request *next_rq = bfqq->next_rq;
1069        unsigned long new_budget;
1070
1071        if (!next_rq)
1072                return;
1073
1074        if (bfqq == bfqd->in_service_queue)
1075                /*
1076                 * In order not to break guarantees, budgets cannot be
1077                 * changed after an entity has been selected.
1078                 */
1079                return;
1080
1081        new_budget = max_t(unsigned long,
1082                           max_t(unsigned long, bfqq->max_budget,
1083                                 bfq_serv_to_charge(next_rq, bfqq)),
1084                           entity->service);
1085        if (entity->budget != new_budget) {
1086                entity->budget = new_budget;
1087                bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
1088                                         new_budget);
1089                bfq_requeue_bfqq(bfqd, bfqq, false);
1090        }
1091}
1092
1093static unsigned int bfq_wr_duration(struct bfq_data *bfqd)
1094{
1095        u64 dur;
1096
1097        dur = bfqd->rate_dur_prod;
1098        do_div(dur, bfqd->peak_rate);
1099
1100        /*
1101         * Limit duration between 3 and 25 seconds. The upper limit
1102         * has been conservatively set after the following worst case:
1103         * on a QEMU/KVM virtual machine
1104         * - running in a slow PC
1105         * - with a virtual disk stacked on a slow low-end 5400rpm HDD
1106         * - serving a heavy I/O workload, such as the sequential reading
1107         *   of several files
1108         * mplayer took 23 seconds to start, if constantly weight-raised.
1109         *
1110         * As for higher values than that accommodating the above bad
1111         * scenario, tests show that higher values would often yield
1112         * the opposite of the desired result, i.e., would worsen
1113         * responsiveness by allowing non-interactive applications to
1114         * preserve weight raising for too long.
1115         *
1116         * On the other end, lower values than 3 seconds make it
1117         * difficult for most interactive tasks to complete their jobs
1118         * before weight-raising finishes.
1119         */
1120        return clamp_val(dur, msecs_to_jiffies(3000), msecs_to_jiffies(25000));
1121}
1122
1123/* switch back from soft real-time to interactive weight raising */
1124static void switch_back_to_interactive_wr(struct bfq_queue *bfqq,
1125                                          struct bfq_data *bfqd)
1126{
1127        bfqq->wr_coeff = bfqd->bfq_wr_coeff;
1128        bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
1129        bfqq->last_wr_start_finish = bfqq->wr_start_at_switch_to_srt;
1130}
1131
1132static void
1133bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
1134                      struct bfq_io_cq *bic, bool bfq_already_existing)
1135{
1136        unsigned int old_wr_coeff = 1;
1137        bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
1138        unsigned int a_idx = bfqq->actuator_idx;
1139        struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];
1140
1141        if (bfqq_data->saved_has_short_ttime)
1142                bfq_mark_bfqq_has_short_ttime(bfqq);
1143        else
1144                bfq_clear_bfqq_has_short_ttime(bfqq);
1145
1146        if (bfqq_data->saved_IO_bound)
1147                bfq_mark_bfqq_IO_bound(bfqq);
1148        else
1149                bfq_clear_bfqq_IO_bound(bfqq);
1150
1151        bfqq->last_serv_time_ns = bfqq_data->saved_last_serv_time_ns;
1152        bfqq->inject_limit = bfqq_data->saved_inject_limit;
1153        bfqq->decrease_time_jif = bfqq_data->saved_decrease_time_jif;
1154
1155        bfqq->entity.new_weight = bfqq_data->saved_weight;
1156        bfqq->ttime = bfqq_data->saved_ttime;
1157        bfqq->io_start_time = bfqq_data->saved_io_start_time;
1158        bfqq->tot_idle_time = bfqq_data->saved_tot_idle_time;
1159        /*
1160         * Restore weight coefficient only if low_latency is on
1161         */
1162        if (bfqd->low_latency) {
1163                old_wr_coeff = bfqq->wr_coeff;
1164                bfqq->wr_coeff = bfqq_data->saved_wr_coeff;
1165        }
1166        bfqq->service_from_wr = bfqq_data->saved_service_from_wr;
1167        bfqq->wr_start_at_switch_to_srt =
1168                bfqq_data->saved_wr_start_at_switch_to_srt;
1169        bfqq->last_wr_start_finish = bfqq_data->saved_last_wr_start_finish;
1170        bfqq->wr_cur_max_time = bfqq_data->saved_wr_cur_max_time;
1171
1172        if (bfqq->wr_coeff > 1 && (bfq_bfqq_in_large_burst(bfqq) ||
1173            time_is_before_jiffies(bfqq->last_wr_start_finish +
1174                                   bfqq->wr_cur_max_time))) {
1175                if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
1176                    !bfq_bfqq_in_large_burst(bfqq) &&
1177                    time_is_after_eq_jiffies(bfqq->wr_start_at_switch_to_srt +
1178                                             bfq_wr_duration(bfqd))) {
1179                        switch_back_to_interactive_wr(bfqq, bfqd);
1180                } else {
1181                        bfqq->wr_coeff = 1;
1182                        bfq_log_bfqq(bfqq->bfqd, bfqq,
1183                                     "resume state: switching off wr");
1184                }
1185        }
1186
1187        /* make sure weight will be updated, however we got here */
1188        bfqq->entity.prio_changed = 1;
1189
1190        if (likely(!busy))
1191                return;
1192
1193        if (old_wr_coeff == 1 && bfqq->wr_coeff > 1)
1194                bfqd->wr_busy_queues++;
1195        else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
1196                bfqd->wr_busy_queues--;
1197}
1198
1199static int bfqq_process_refs(struct bfq_queue *bfqq)
1200{
1201        return bfqq->ref - bfqq->entity.allocated -
1202                bfqq->entity.on_st_or_in_serv -
1203                (bfqq->weight_counter != NULL) - bfqq->stable_ref;
1204}
1205
1206/* Empty burst list and add just bfqq (see comments on bfq_handle_burst) */
1207static void bfq_reset_burst_list(struct bfq_data *bfqd, struct bfq_queue *bfqq)
1208{
1209        struct bfq_queue *item;
1210        struct hlist_node *n;
1211
1212        hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node)
1213                hlist_del_init(&item->burst_list_node);
1214
1215        /*
1216         * Start the creation of a new burst list only if there is no
1217         * active queue. See comments on the conditional invocation of
1218         * bfq_handle_burst().
1219         */
1220        if (bfq_tot_busy_queues(bfqd) == 0) {
1221                hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
1222                bfqd->burst_size = 1;
1223        } else
1224                bfqd->burst_size = 0;
1225
1226        bfqd->burst_parent_entity = bfqq->entity.parent;
1227}
1228
1229/* Add bfqq to the list of queues in current burst (see bfq_handle_burst) */
1230static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
1231{
1232        /* Increment burst size to take into account also bfqq */
1233        bfqd->burst_size++;
1234
1235        if (bfqd->burst_size == bfqd->bfq_large_burst_thresh) {
1236                struct bfq_queue *pos, *bfqq_item;
1237                struct hlist_node *n;
1238
1239                /*
1240                 * Enough queues have been activated shortly after each
1241                 * other to consider this burst as large.
1242                 */
1243                bfqd->large_burst = true;
1244
1245                /*
1246                 * We can now mark all queues in the burst list as
1247                 * belonging to a large burst.
1248                 */
1249                hlist_for_each_entry(bfqq_item, &bfqd->burst_list,
1250                                     burst_list_node)
1251                        bfq_mark_bfqq_in_large_burst(bfqq_item);
1252                bfq_mark_bfqq_in_large_burst(bfqq);
1253
1254                /*
1255                 * From now on, and until the current burst finishes, any
1256                 * new queue being activated shortly after the last queue
1257                 * was inserted in the burst can be immediately marked as
1258                 * belonging to a large burst. So the burst list is not
1259                 * needed any more. Remove it.
1260                 */
1261                hlist_for_each_entry_safe(pos, n, &bfqd->burst_list,
1262                                          burst_list_node)
1263                        hlist_del_init(&pos->burst_list_node);
1264        } else /*
1265                * Burst not yet large: add bfqq to the burst list. Do
1266                * not increment the ref counter for bfqq, because bfqq
1267                * is removed from the burst list before freeing bfqq
1268                * in put_queue.
1269                */
1270                hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
1271}
1272
1273/*
1274 * If many queues belonging to the same group happen to be created
1275 * shortly after each other, then the processes associated with these
1276 * queues have typically a common goal. In particular, bursts of queue
1277 * creations are usually caused by services or applications that spawn
1278 * many parallel threads/processes. Examples are systemd during boot,
1279 * or git grep. To help these processes get their job done as soon as
1280 * possible, it is usually better to not grant either weight-raising
1281 * or device idling to their queues, unless these queues must be
1282 * protected from the I/O flowing through other active queues.
1283 *
1284 * In this comment we describe, firstly, the reasons why this fact
1285 * holds, and, secondly, the next function, which implements the main
1286 * steps needed to properly mark these queues so that they can then be
1287 * treated in a different way.
1288 *
1289 * The above services or applications benefit mostly from a high
1290 * throughput: the quicker the requests of the activated queues are
1291 * cumulatively served, the sooner the target job of these queues gets
1292 * completed. As a consequence, weight-raising any of these queues,
1293 * which also implies idling the device for it, is almost always
1294 * counterproductive, unless there are other active queues to isolate
1295 * these new queues from. If there no other active queues, then
1296 * weight-raising these new queues just lowers throughput in most
1297 * cases.
1298 *
1299 * On the other hand, a burst of queue creations may be caused also by
1300 * the start of an application that does not consist of a lot of
1301 * parallel I/O-bound threads. In fact, with a complex application,
1302 * several short processes may need to be executed to start-up the
1303 * application. In this respect, to start an application as quickly as
1304 * possible, the best thing to do is in any case to privilege the I/O
1305 * related to the application with respect to all other
1306 * I/O. Therefore, the best strategy to start as quickly as possible
1307 * an application that causes a burst of queue creations is to
1308 * weight-raise all the queues created during the burst. This is the
1309 * exact opposite of the best strategy for the other type of bursts.
1310 *
1311 * In the end, to take the best action for each of the two cases, the
1312 * two types of bursts need to be distinguished. Fortunately, this
1313 * seems relatively easy, by looking at the sizes of the bursts. In
1314 * particular, we found a threshold such that only bursts with a
1315 * larger size than that threshold are apparently caused by
1316 * services or commands such as systemd or git grep. For brevity,
1317 * hereafter we call just 'large' these bursts. BFQ *does not*
1318 * weight-raise queues whose creation occurs in a large burst. In
1319 * addition, for each of these queues BFQ performs or does not perform
1320 * idling depending on which choice boosts the throughput more. The
1321 * exact choice depends on the device and request pattern at
1322 * hand.
1323 *
1324 * Unfortunately, false positives may occur while an interactive task
1325 * is starting (e.g., an application is being started). The
1326 * consequence is that the queues associated with the task do not
1327 * enjoy weight raising as expected. Fortunately these false positives
1328 * are very rare. They typically occur if some service happens to
1329 * start doing I/O exactly when the interactive task starts.
1330 *
1331 * Turning back to the next function, it is invoked only if there are
1332 * no active queues (apart from active queues that would belong to the
1333 * same, possible burst bfqq would belong to), and it implements all
1334 * the steps needed to detect the occurrence of a large burst and to
1335 * properly mark all the queues belonging to it (so that they can then
1336 * be treated in a different way). This goal is achieved by
1337 * maintaining a "burst list" that holds, temporarily, the queues that
1338 * belong to the burst in progress. The list is then used to mark
1339 * these queues as belonging to a large burst if the burst does become
1340 * large. The main steps are the following.
1341 *
1342 * . when the very first queue is created, the queue is inserted into the
1343 *   list (as it could be the first queue in a possible burst)
1344 *
1345 * . if the current burst has not yet become large, and a queue Q that does
1346 *   not yet belong to the burst is activated shortly after the last time
1347 *   at which a new queue entered the burst list, then the function appends
1348 *   Q to the burst list
1349 *
1350 * . if, as a consequence of the previous step, the burst size reaches
1351 *   the large-burst threshold, then
1352 *
1353 *     . all the queues in the burst list are marked as belonging to a
1354 *       large burst
1355 *
1356 *     . the burst list is deleted; in fact, the burst list already served
1357 *       its purpose (keeping temporarily track of the queues in a burst,
1358 *       so as to be able to mark them as belonging to a large burst in the
1359 *       previous sub-step), and now is not needed any more
1360 *
1361 *     . the device enters a large-burst mode
1362 *
1363 * . if a queue Q that does not belong to the burst is created while
1364 *   the device is in large-burst mode and shortly after the last time
1365 *   at which a queue either entered the burst list or was marked as
1366 *   belonging to the current large burst, then Q is immediately marked
1367 *   as belonging to a large burst.
1368 *
1369 * . if a queue Q that does not belong to the burst is created a while
1370 *   later, i.e., not shortly after, than the last time at which a queue
1371 *   either entered the burst list or was marked as belonging to the
1372 *   current large burst, then the current burst is deemed as finished and:
1373 *
1374 *        . the large-burst mode is reset if set
1375 *
1376 *        . the burst list is emptied
1377 *
1378 *        . Q is inserted in the burst list, as Q may be the first queue
1379 *          in a possible new burst (then the burst list contains just Q
1380 *          after this step).
1381 */
1382static void bfq_handle_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
1383{
1384        /*
1385         * If bfqq is already in the burst list or is part of a large
1386         * burst, or finally has just been split, then there is
1387         * nothing else to do.
1388         */
1389        if (!hlist_unhashed(&bfqq->burst_list_node) ||
1390            bfq_bfqq_in_large_burst(bfqq) ||
1391            time_is_after_eq_jiffies(bfqq->split_time +
1392                                     msecs_to_jiffies(10)))
1393                return;
1394
1395        /*
1396         * If bfqq's creation happens late enough, or bfqq belongs to
1397         * a different group than the burst group, then the current
1398         * burst is finished, and related data structures must be
1399         * reset.
1400         *
1401         * In this respect, consider the special case where bfqq is
1402         * the very first queue created after BFQ is selected for this
1403         * device. In this case, last_ins_in_burst and
1404         * burst_parent_entity are not yet significant when we get
1405         * here. But it is easy to verify that, whether or not the
1406         * following condition is true, bfqq will end up being
1407         * inserted into the burst list. In particular the list will
1408         * happen to contain only bfqq. And this is exactly what has
1409         * to happen, as bfqq may be the first queue of the first
1410         * burst.
1411         */
1412        if (time_is_before_jiffies(bfqd->last_ins_in_burst +
1413            bfqd->bfq_burst_interval) ||
1414            bfqq->entity.parent != bfqd->burst_parent_entity) {
1415                bfqd->large_burst = false;
1416                bfq_reset_burst_list(bfqd, bfqq);
1417                goto end;
1418        }
1419
1420        /*
1421         * If we get here, then bfqq is being activated shortly after the
1422         * last queue. So, if the current burst is also large, we can mark
1423         * bfqq as belonging to this large burst immediately.
1424         */
1425        if (bfqd->large_burst) {
1426                bfq_mark_bfqq_in_large_burst(bfqq);
1427                goto end;
1428        }
1429
1430        /*
1431         * If we get here, then a large-burst state has not yet been
1432         * reached, but bfqq is being activated shortly after the last
1433         * queue. Then we add bfqq to the burst.
1434         */
1435        bfq_add_to_burst(bfqd, bfqq);
1436end:
1437        /*
1438         * At this point, bfqq either has been added to the current
1439         * burst or has caused the current burst to terminate and a
1440         * possible new burst to start. In particular, in the second
1441         * case, bfqq has become the first queue in the possible new
1442         * burst.  In both cases last_ins_in_burst needs to be moved
1443         * forward.
1444         */
1445        bfqd->last_ins_in_burst = jiffies;
1446}
1447
1448static int bfq_bfqq_budget_left(struct bfq_queue *bfqq)
1449{
1450        struct bfq_entity *entity = &bfqq->entity;
1451
1452        return entity->budget - entity->service;
1453}
1454
1455/*
1456 * If enough samples have been computed, return the current max budget
1457 * stored in bfqd, which is dynamically updated according to the
1458 * estimated disk peak rate; otherwise return the default max budget
1459 */
1460static int bfq_max_budget(struct bfq_data *bfqd)
1461{
1462        if (bfqd->budgets_assigned < bfq_stats_min_budgets)
1463                return bfq_default_max_budget;
1464        else
1465                return bfqd->bfq_max_budget;
1466}
1467
1468/*
1469 * Return min budget, which is a fraction of the current or default
1470 * max budget (trying with 1/32)
1471 */
1472static int bfq_min_budget(struct bfq_data *bfqd)
1473{
1474        if (bfqd->budgets_assigned < bfq_stats_min_budgets)
1475                return bfq_default_max_budget / 32;
1476        else
1477                return bfqd->bfq_max_budget / 32;
1478}
1479
1480/*
1481 * The next function, invoked after the input queue bfqq switches from
1482 * idle to busy, updates the budget of bfqq. The function also tells
1483 * whether the in-service queue should be expired, by returning
1484 * true. The purpose of expiring the in-service queue is to give bfqq
1485 * the chance to possibly preempt the in-service queue, and the reason
1486 * for preempting the in-service queue is to achieve one of the two
1487 * goals below.
1488 *
1489 * 1. Guarantee to bfqq its reserved bandwidth even if bfqq has
1490 * expired because it has remained idle. In particular, bfqq may have
1491 * expired for one of the following two reasons:
1492 *
1493 * - BFQQE_NO_MORE_REQUESTS bfqq did not enjoy any device idling
1494 *   and did not make it to issue a new request before its last
1495 *   request was served;
1496 *
1497 * - BFQQE_TOO_IDLE bfqq did enjoy device idling, but did not issue
1498 *   a new request before the expiration of the idling-time.
1499 *
1500 * Even if bfqq has expired for one of the above reasons, the process
1501 * associated with the queue may be however issuing requests greedily,
1502 * and thus be sensitive to the bandwidth it receives (bfqq may have
1503 * remained idle for other reasons: CPU high load, bfqq not enjoying
1504 * idling, I/O throttling somewhere in the path from the process to
1505 * the I/O scheduler, ...). But if, after every expiration for one of
1506 * the above two reasons, bfqq has to wait for the service of at least
1507 * one full budget of another queue before being served again, then
1508 * bfqq is likely to get a much lower bandwidth or resource time than
1509 * its reserved ones. To address this issue, two countermeasures need
1510 * to be taken.
1511 *
1512 * First, the budget and the timestamps of bfqq need to be updated in
1513 * a special way on bfqq reactivation: they need to be updated as if
1514 * bfqq did not remain idle and did not expire. In fact, if they are
1515 * computed as if bfqq expired and remained idle until reactivation,
1516 * then the process associated with bfqq is treated as if, instead of
1517 * being greedy, it stopped issuing requests when bfqq remained idle,
1518 * and restarts issuing requests only on this reactivation. In other
1519 * words, the scheduler does not help the process recover the "service
1520 * hole" between bfqq expiration and reactivation. As a consequence,
1521 * the process receives a lower bandwidth than its reserved one. In
1522 * contrast, to recover this hole, the budget must be updated as if
1523 * bfqq was not expired at all before this reactivation, i.e., it must
1524 * be set to the value of the remaining budget when bfqq was
1525 * expired. Along the same line, timestamps need to be assigned the
1526 * value they had the last time bfqq was selected for service, i.e.,
1527 * before last expiration. Thus timestamps need to be back-shifted
1528 * with respect to their normal computation (see [1] for more details
1529 * on this tricky aspect).
1530 *
1531 * Secondly, to allow the process to recover the hole, the in-service
1532 * queue must be expired too, to give bfqq the chance to preempt it
1533 * immediately. In fact, if bfqq has to wait for a full budget of the
1534 * in-service queue to be completed, then it may become impossible to
1535 * let the process recover the hole, even if the back-shifted
1536 * timestamps of bfqq are lower than those of the in-service queue. If
1537 * this happens for most or all of the holes, then the process may not
1538 * receive its reserved bandwidth. In this respect, it is worth noting
1539 * that, being the service of outstanding requests unpreemptible, a
1540 * little fraction of the holes may however be unrecoverable, thereby
1541 * causing a little loss of bandwidth.
1542 *
1543 * The last important point is detecting whether bfqq does need this
1544 * bandwidth recovery. In this respect, the next function deems the
1545 * process associated with bfqq greedy, and thus allows it to recover
1546 * the hole, if: 1) the process is waiting for the arrival of a new
1547 * request (which implies that bfqq expired for one of the above two
1548 * reasons), and 2) such a request has arrived soon. The first
1549 * condition is controlled through the flag non_blocking_wait_rq,
1550 * while the second through the flag arrived_in_time. If both
1551 * conditions hold, then the function computes the budget in the
1552 * above-described special way, and signals that the in-service queue
1553 * should be expired. Timestamp back-shifting is done later in
1554 * __bfq_activate_entity.
1555 *
1556 * 2. Reduce latency. Even if timestamps are not backshifted to let
1557 * the process associated with bfqq recover a service hole, bfqq may
1558 * however happen to have, after being (re)activated, a lower finish
1559 * timestamp than the in-service queue.  That is, the next budget of
1560 * bfqq may have to be completed before the one of the in-service
1561 * queue. If this is the case, then preempting the in-service queue
1562 * allows this goal to be achieved, apart from the unpreemptible,
1563 * outstanding requests mentioned above.
1564 *
1565 * Unfortunately, regardless of which of the above two goals one wants
1566 * to achieve, service trees need first to be updated to know whether
1567 * the in-service queue must be preempted. To have service trees
1568 * correctly updated, the in-service queue must be expired and
1569 * rescheduled, and bfqq must be scheduled too. This is one of the
1570 * most costly operations (in future versions, the scheduling
1571 * mechanism may be re-designed in such a way to make it possible to
1572 * know whether preemption is needed without needing to update service
1573 * trees). In addition, queue preemptions almost always cause random
1574 * I/O, which may in turn cause loss of throughput. Finally, there may
1575 * even be no in-service queue when the next function is invoked (so,
1576 * no queue to compare timestamps with). Because of these facts, the
1577 * next function adopts the following simple scheme to avoid costly
1578 * operations, too frequent preemptions and too many dependencies on
1579 * the state of the scheduler: it requests the expiration of the
1580 * in-service queue (unconditionally) only for queues that need to
1581 * recover a hole. Then it delegates to other parts of the code the
1582 * responsibility of handling the above case 2.
1583 */
1584static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
1585                                                struct bfq_queue *bfqq,
1586                                                bool arrived_in_time)
1587{
1588        struct bfq_entity *entity = &bfqq->entity;
1589
1590        /*
1591         * In the next compound condition, we check also whether there
1592         * is some budget left, because otherwise there is no point in
1593         * trying to go on serving bfqq with this same budget: bfqq
1594         * would be expired immediately after being selected for
1595         * service. This would only cause useless overhead.
1596         */
1597        if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time &&
1598            bfq_bfqq_budget_left(bfqq) > 0) {
1599                /*
1600                 * We do not clear the flag non_blocking_wait_rq here, as
1601                 * the latter is used in bfq_activate_bfqq to signal
1602                 * that timestamps need to be back-shifted (and is
1603                 * cleared right after).
1604                 */
1605
1606                /*
1607                 * In next assignment we rely on that either
1608                 * entity->service or entity->budget are not updated
1609                 * on expiration if bfqq is empty (see
1610                 * __bfq_bfqq_recalc_budget). Thus both quantities
1611                 * remain unchanged after such an expiration, and the
1612                 * following statement therefore assigns to
1613                 * entity->budget the remaining budget on such an
1614                 * expiration.
1615                 */
1616                entity->budget = min_t(unsigned long,
1617                                       bfq_bfqq_budget_left(bfqq),
1618                                       bfqq->max_budget);
1619
1620                /*
1621                 * At this point, we have used entity->service to get
1622                 * the budget left (needed for updating
1623                 * entity->budget). Thus we finally can, and have to,
1624                 * reset entity->service. The latter must be reset
1625                 * because bfqq would otherwise be charged again for
1626                 * the service it has received during its previous
1627                 * service slot(s).
1628                 */
1629                entity->service = 0;
1630
1631                return true;
1632        }
1633
1634        /*
1635         * We can finally complete expiration, by setting service to 0.
1636         */
1637        entity->service = 0;
1638        entity->budget = max_t(unsigned long, bfqq->max_budget,
1639                               bfq_serv_to_charge(bfqq->next_rq, bfqq));
1640        bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
1641        return false;
1642}
1643
1644/*
1645 * Return the farthest past time instant according to jiffies
1646 * macros.
1647 */
1648static unsigned long bfq_smallest_from_now(void)
1649{
1650        return jiffies - MAX_JIFFY_OFFSET;
1651}
1652
1653static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
1654                                             struct bfq_queue *bfqq,
1655                                             unsigned int old_wr_coeff,
1656                                             bool wr_or_deserves_wr,
1657                                             bool interactive,
1658                                             bool in_burst,
1659                                             bool soft_rt)
1660{
1661        if (old_wr_coeff == 1 && wr_or_deserves_wr) {
1662                /* start a weight-raising period */
1663                if (interactive) {
1664                        bfqq->service_from_wr = 0;
1665                        bfqq->wr_coeff = bfqd->bfq_wr_coeff;
1666                        bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
1667                } else {
1668                        /*
1669                         * No interactive weight raising in progress
1670                         * here: assign minus infinity to
1671                         * wr_start_at_switch_to_srt, to make sure
1672                         * that, at the end of the soft-real-time
1673                         * weight raising periods that is starting
1674                         * now, no interactive weight-raising period
1675                         * may be wrongly considered as still in
1676                         * progress (and thus actually started by
1677                         * mistake).
1678                         */
1679                        bfqq->wr_start_at_switch_to_srt =
1680                                bfq_smallest_from_now();
1681                        bfqq->wr_coeff = bfqd->bfq_wr_coeff *
1682                                BFQ_SOFTRT_WEIGHT_FACTOR;
1683                        bfqq->wr_cur_max_time =
1684                                bfqd->bfq_wr_rt_max_time;
1685                }
1686
1687                /*
1688                 * If needed, further reduce budget to make sure it is
1689                 * close to bfqq's backlog, so as to reduce the
1690                 * scheduling-error component due to a too large
1691                 * budget. Do not care about throughput consequences,
1692                 * but only about latency. Finally, do not assign a
1693                 * too small budget either, to avoid increasing
1694                 * latency by causing too frequent expirations.
1695                 */
1696                bfqq->entity.budget = min_t(unsigned long,
1697                                            bfqq->entity.budget,
1698                                            2 * bfq_min_budget(bfqd));
1699        } else if (old_wr_coeff > 1) {
1700                if (interactive) { /* update wr coeff and duration */
1701                        bfqq->wr_coeff = bfqd->bfq_wr_coeff;
1702                        bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
1703                } else if (in_burst)
1704                        bfqq->wr_coeff = 1;
1705                else if (soft_rt) {
1706                        /*
1707                         * The application is now or still meeting the
1708                         * requirements for being deemed soft rt.  We
1709                         * can then correctly and safely (re)charge
1710                         * the weight-raising duration for the
1711                         * application with the weight-raising
1712                         * duration for soft rt applications.
1713                         *
1714                         * In particular, doing this recharge now, i.e.,
1715                         * before the weight-raising period for the
1716                         * application finishes, reduces the probability
1717                         * of the following negative scenario:
1718                         * 1) the weight of a soft rt application is
1719                         *    raised at startup (as for any newly
1720                         *    created application),
1721                         * 2) since the application is not interactive,
1722                         *    at a certain time weight-raising is
1723                         *    stopped for the application,
1724                         * 3) at that time the application happens to
1725                         *    still have pending requests, and hence
1726                         *    is destined to not have a chance to be
1727                         *    deemed soft rt before these requests are
1728                         *    completed (see the comments to the
1729                         *    function bfq_bfqq_softrt_next_start()
1730                         *    for details on soft rt detection),
1731                         * 4) these pending requests experience a high
1732                         *    latency because the application is not
1733                         *    weight-raised while they are pending.
1734                         */
1735                        if (bfqq->wr_cur_max_time !=
1736                                bfqd->bfq_wr_rt_max_time) {
1737                                bfqq->wr_start_at_switch_to_srt =
1738                                        bfqq->last_wr_start_finish;
1739
1740                                bfqq->wr_cur_max_time =
1741                                        bfqd->bfq_wr_rt_max_time;
1742                                bfqq->wr_coeff = bfqd->bfq_wr_coeff *
1743                                        BFQ_SOFTRT_WEIGHT_FACTOR;
1744                        }
1745                        bfqq->last_wr_start_finish = jiffies;
1746                }
1747        }
1748}
1749
1750static bool bfq_bfqq_idle_for_long_time(struct bfq_data *bfqd,
1751                                        struct bfq_queue *bfqq)
1752{
1753        return bfqq->dispatched == 0 &&
1754                time_is_before_jiffies(
1755                        bfqq->budget_timeout +
1756                        bfqd->bfq_wr_min_idle_time);
1757}
1758
1759
1760/*
1761 * Return true if bfqq is in a higher priority class, or has a higher
1762 * weight than the in-service queue.
1763 */
1764static bool bfq_bfqq_higher_class_or_weight(struct bfq_queue *bfqq,
1765                                            struct bfq_queue *in_serv_bfqq)
1766{
1767        int bfqq_weight, in_serv_weight;
1768
1769        if (bfqq->ioprio_class < in_serv_bfqq->ioprio_class)
1770                return true;
1771
1772        if (in_serv_bfqq->entity.parent == bfqq->entity.parent) {
1773                bfqq_weight = bfqq->entity.weight;
1774                in_serv_weight = in_serv_bfqq->entity.weight;
1775        } else {
1776                if (bfqq->entity.parent)
1777                        bfqq_weight = bfqq->entity.parent->weight;
1778                else
1779                        bfqq_weight = bfqq->entity.weight;
1780                if (in_serv_bfqq->entity.parent)
1781                        in_serv_weight = in_serv_bfqq->entity.parent->weight;
1782                else
1783                        in_serv_weight = in_serv_bfqq->entity.weight;
1784        }
1785
1786        return bfqq_weight > in_serv_weight;
1787}
1788
1789/*
1790 * Get the index of the actuator that will serve bio.
1791 */
1792static unsigned int bfq_actuator_index(struct bfq_data *bfqd, struct bio *bio)
1793{
1794        unsigned int i;
1795        sector_t end;
1796
1797        /* no search needed if one or zero ranges present */
1798        if (bfqd->num_actuators == 1)
1799                return 0;
1800
1801        /* bio_end_sector(bio) gives the sector after the last one */
1802        end = bio_end_sector(bio) - 1;
1803
1804        for (i = 0; i < bfqd->num_actuators; i++) {
1805                if (end >= bfqd->sector[i] &&
1806                    end < bfqd->sector[i] + bfqd->nr_sectors[i])
1807                        return i;
1808        }
1809
1810        WARN_ONCE(true,
1811                  "bfq_actuator_index: bio sector out of ranges: end=%llu\n",
1812                  end);
1813        return 0;
1814}
1815
1816static bool bfq_better_to_idle(struct bfq_queue *bfqq);
1817
1818static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
1819                                             struct bfq_queue *bfqq,
1820                                             int old_wr_coeff,
1821                                             struct request *rq,
1822                                             bool *interactive)
1823{
1824        bool soft_rt, in_burst, wr_or_deserves_wr,
1825                bfqq_wants_to_preempt,
1826                idle_for_long_time = bfq_bfqq_idle_for_long_time(bfqd, bfqq),
1827                /*
1828                 * See the comments on
1829                 * bfq_bfqq_update_budg_for_activation for
1830                 * details on the usage of the next variable.
1831                 */
1832                arrived_in_time =  ktime_get_ns() <=
1833                        bfqq->ttime.last_end_request +
1834                        bfqd->bfq_slice_idle * 3;
1835        unsigned int act_idx = bfq_actuator_index(bfqd, rq->bio);
1836        bool bfqq_non_merged_or_stably_merged =
1837                bfqq->bic || RQ_BIC(rq)->bfqq_data[act_idx].stably_merged;
1838
1839        /*
1840         * bfqq deserves to be weight-raised if:
1841         * - it is sync,
1842         * - it does not belong to a large burst,
1843         * - it has been idle for enough time or is soft real-time,
1844         * - is linked to a bfq_io_cq (it is not shared in any sense),
1845         * - has a default weight (otherwise we assume the user wanted
1846         *   to control its weight explicitly)
1847         */
1848        in_burst = bfq_bfqq_in_large_burst(bfqq);
1849        soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
1850                !BFQQ_TOTALLY_SEEKY(bfqq) &&
1851                !in_burst &&
1852                time_is_before_jiffies(bfqq->soft_rt_next_start) &&
1853                bfqq->dispatched == 0 &&
1854                bfqq->entity.new_weight == 40;
1855        *interactive = !in_burst && idle_for_long_time &&
1856                bfqq->entity.new_weight == 40;
1857        /*
1858         * Merged bfq_queues are kept out of weight-raising
1859         * (low-latency) mechanisms. The reason is that these queues
1860         * are usually created for non-interactive and
1861         * non-soft-real-time tasks. Yet this is not the case for
1862         * stably-merged queues. These queues are merged just because
1863         * they are created shortly after each other. So they may
1864         * easily serve the I/O of an interactive or soft-real time
1865         * application, if the application happens to spawn multiple
1866         * processes. So let also stably-merged queued enjoy weight
1867         * raising.
1868         */
1869        wr_or_deserves_wr = bfqd->low_latency &&
1870                (bfqq->wr_coeff > 1 ||
1871                 (bfq_bfqq_sync(bfqq) && bfqq_non_merged_or_stably_merged &&
1872                  (*interactive || soft_rt)));
1873
1874        /*
1875         * Using the last flag, update budget and check whether bfqq
1876         * may want to preempt the in-service queue.
1877         */
1878        bfqq_wants_to_preempt =
1879                bfq_bfqq_update_budg_for_activation(bfqd, bfqq,
1880                                                    arrived_in_time);
1881
1882        /*
1883         * If bfqq happened to be activated in a burst, but has been
1884         * idle for much more than an interactive queue, then we
1885         * assume that, in the overall I/O initiated in the burst, the
1886         * I/O associated with bfqq is finished. So bfqq does not need
1887         * to be treated as a queue belonging to a burst
1888         * anymore. Accordingly, we reset bfqq's in_large_burst flag
1889         * if set, and remove bfqq from the burst list if it's
1890         * there. We do not decrement burst_size, because the fact
1891         * that bfqq does not need to belong to the burst list any
1892         * more does not invalidate the fact that bfqq was created in
1893         * a burst.
1894         */
1895        if (likely(!bfq_bfqq_just_created(bfqq)) &&
1896            idle_for_long_time &&
1897            time_is_before_jiffies(
1898                    bfqq->budget_timeout +
1899                    msecs_to_jiffies(10000))) {
1900                hlist_del_init(&bfqq->burst_list_node);
1901                bfq_clear_bfqq_in_large_burst(bfqq);
1902        }
1903
1904        bfq_clear_bfqq_just_created(bfqq);
1905
1906        if (bfqd->low_latency) {
1907                if (unlikely(time_is_after_jiffies(bfqq->split_time)))
1908                        /* wraparound */
1909                        bfqq->split_time =
1910                                jiffies - bfqd->bfq_wr_min_idle_time - 1;
1911
1912                if (time_is_before_jiffies(bfqq->split_time +
1913                                           bfqd->bfq_wr_min_idle_time)) {
1914                        bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq,
1915                                                         old_wr_coeff,
1916                                                         wr_or_deserves_wr,
1917                                                         *interactive,
1918                                                         in_burst,
1919                                                         soft_rt);
1920
1921                        if (old_wr_coeff != bfqq->wr_coeff)
1922                                bfqq->entity.prio_changed = 1;
1923                }
1924        }
1925
1926        bfqq->last_idle_bklogged = jiffies;
1927        bfqq->service_from_backlogged = 0;
1928        bfq_clear_bfqq_softrt_update(bfqq);
1929
1930        bfq_add_bfqq_busy(bfqq);
1931
1932        /*
1933         * Expire in-service queue if preemption may be needed for
1934         * guarantees or throughput. As for guarantees, we care
1935         * explicitly about two cases. The first is that bfqq has to
1936         * recover a service hole, as explained in the comments on
1937         * bfq_bfqq_update_budg_for_activation(), i.e., that
1938         * bfqq_wants_to_preempt is true. However, if bfqq does not
1939         * carry time-critical I/O, then bfqq's bandwidth is less
1940         * important than that of queues that carry time-critical I/O.
1941         * So, as a further constraint, we consider this case only if
1942         * bfqq is at least as weight-raised, i.e., at least as time
1943         * critical, as the in-service queue.
1944         *
1945         * The second case is that bfqq is in a higher priority class,
1946         * or has a higher weight than the in-service queue. If this
1947         * condition does not hold, we don't care because, even if
1948         * bfqq does not start to be served immediately, the resulting
1949         * delay for bfqq's I/O is however lower or much lower than
1950         * the ideal completion time to be guaranteed to bfqq's I/O.
1951         *
1952         * In both cases, preemption is needed only if, according to
1953         * the timestamps of both bfqq and of the in-service queue,
1954         * bfqq actually is the next queue to serve. So, to reduce
1955         * useless preemptions, the return value of
1956         * next_queue_may_preempt() is considered in the next compound
1957         * condition too. Yet next_queue_may_preempt() just checks a
1958         * simple, necessary condition for bfqq to be the next queue
1959         * to serve. In fact, to evaluate a sufficient condition, the
1960         * timestamps of the in-service queue would need to be
1961         * updated, and this operation is quite costly (see the
1962         * comments on bfq_bfqq_update_budg_for_activation()).
1963         *
1964         * As for throughput, we ask bfq_better_to_idle() whether we
1965         * still need to plug I/O dispatching. If bfq_better_to_idle()
1966         * says no, then plugging is not needed any longer, either to
1967         * boost throughput or to perserve service guarantees. Then
1968         * the best option is to stop plugging I/O, as not doing so
1969         * would certainly lower throughput. We may end up in this
1970         * case if: (1) upon a dispatch attempt, we detected that it
1971         * was better to plug I/O dispatch, and to wait for a new
1972         * request to arrive for the currently in-service queue, but
1973         * (2) this switch of bfqq to busy changes the scenario.
1974         */
1975        if (bfqd->in_service_queue &&
1976            ((bfqq_wants_to_preempt &&
1977              bfqq->wr_coeff >= bfqd->in_service_queue->wr_coeff) ||
1978             bfq_bfqq_higher_class_or_weight(bfqq, bfqd->in_service_queue) ||
1979             !bfq_better_to_idle(bfqd->in_service_queue)) &&
1980            next_queue_may_preempt(bfqd))
1981                bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
1982                                false, BFQQE_PREEMPTED);
1983}
1984
1985static void bfq_reset_inject_limit(struct bfq_data *bfqd,
1986                                   struct bfq_queue *bfqq)
1987{
1988        /* invalidate baseline total service time */
1989        bfqq->last_serv_time_ns = 0;
1990
1991        /*
1992         * Reset pointer in case we are waiting for
1993         * some request completion.
1994         */
1995        bfqd->waited_rq = NULL;
1996
1997        /*
1998         * If bfqq has a short think time, then start by setting the
1999         * inject limit to 0 prudentially, because the service time of
2000         * an injected I/O request may be higher than the think time
2001         * of bfqq, and therefore, if one request was injected when
2002         * bfqq remains empty, this injected request might delay the
2003         * service of the next I/O request for bfqq significantly. In
2004         * case bfqq can actually tolerate some injection, then the
2005         * adaptive update will however raise the limit soon. This
2006         * lucky circumstance holds exactly because bfqq has a short
2007         * think time, and thus, after remaining empty, is likely to
2008         * get new I/O enqueued---and then completed---before being
2009         * expired. This is the very pattern that gives the
2010         * limit-update algorithm the chance to measure the effect of
2011         * injection on request service times, and then to update the
2012         * limit accordingly.
2013         *
2014         * However, in the following special case, the inject limit is
2015         * left to 1 even if the think time is short: bfqq's I/O is
2016         * synchronized with that of some other queue, i.e., bfqq may
2017         * receive new I/O only after the I/O of the other queue is
2018         * completed. Keeping the inject limit to 1 allows the
2019         * blocking I/O to be served while bfqq is in service. And
2020         * this is very convenient both for bfqq and for overall
2021         * throughput, as explained in detail in the comments in
2022         * bfq_update_has_short_ttime().
2023         *
2024         * On the opposite end, if bfqq has a long think time, then
2025         * start directly by 1, because:
2026         * a) on the bright side, keeping at most one request in
2027         * service in the drive is unlikely to cause any harm to the
2028         * latency of bfqq's requests, as the service time of a single
2029         * request is likely to be lower than the think time of bfqq;
2030         * b) on the downside, after becoming empty, bfqq is likely to
2031         * expire before getting its next request. With this request
2032         * arrival pattern, it is very hard to sample total service
2033         * times and update the inject limit accordingly (see comments
2034         * on bfq_update_inject_limit()). So the limit is likely to be
2035         * never, or at least seldom, updated.  As a consequence, by
2036         * setting the limit to 1, we avoid that no injection ever
2037         * occurs with bfqq. On the downside, this proactive step
2038         * further reduces chances to actually compute the baseline
2039         * total service time. Thus it reduces chances to execute the
2040         * limit-update algorithm and possibly raise the limit to more
2041         * than 1.
2042         */
2043        if (bfq_bfqq_has_short_ttime(bfqq))
2044                bfqq->inject_limit = 0;
2045        else
2046                bfqq->inject_limit = 1;
2047
2048        bfqq->decrease_time_jif = jiffies;
2049}
2050
2051static void bfq_update_io_intensity(struct bfq_queue *bfqq, u64 now_ns)
2052{
2053        u64 tot_io_time = now_ns - bfqq->io_start_time;
2054
2055        if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfqq->dispatched == 0)
2056                bfqq->tot_idle_time +=
2057                        now_ns - bfqq->ttime.last_end_request;
2058
2059        if (unlikely(bfq_bfqq_just_created(bfqq)))
2060                return;
2061
2062        /*
2063         * Must be busy for at least about 80% of the time to be
2064         * considered I/O bound.
2065         */
2066        if (bfqq->tot_idle_time * 5 > tot_io_time)
2067                bfq_clear_bfqq_IO_bound(bfqq);
2068        else
2069                bfq_mark_bfqq_IO_bound(bfqq);
2070
2071        /*
2072         * Keep an observation window of at most 200 ms in the past
2073         * from now.
2074         */
2075        if (tot_io_time > 200 * NSEC_PER_MSEC) {
2076                bfqq->io_start_time = now_ns - (tot_io_time>>1);
2077                bfqq->tot_idle_time >>= 1;
2078        }
2079}
2080
2081/*
2082 * Detect whether bfqq's I/O seems synchronized with that of some
2083 * other queue, i.e., whether bfqq, after remaining empty, happens to
2084 * receive new I/O only right after some I/O request of the other
2085 * queue has been completed. We call waker queue the other queue, and
2086 * we assume, for simplicity, that bfqq may have at most one waker
2087 * queue.
2088 *
2089 * A remarkable throughput boost can be reached by unconditionally
2090 * injecting the I/O of the waker queue, every time a new
2091 * bfq_dispatch_request happens to be invoked while I/O is being
2092 * plugged for bfqq.  In addition to boosting throughput, this
2093 * unblocks bfqq's I/O, thereby improving bandwidth and latency for
2094 * bfqq. Note that these same results may be achieved with the general
2095 * injection mechanism, but less effectively. For details on this
2096 * aspect, see the comments on the choice of the queue for injection
2097 * in bfq_select_queue().
2098 *
2099 * Turning back to the detection of a waker queue, a queue Q is deemed as a
2100 * waker queue for bfqq if, for three consecutive times, bfqq happens to become
2101 * non empty right after a request of Q has been completed within given
2102 * timeout. In this respect, even if bfqq is empty, we do not check for a waker
2103 * if it still has some in-flight I/O. In fact, in this case bfqq is actually
2104 * still being served by the drive, and may receive new I/O on the completion
2105 * of some of the in-flight requests. In particular, on the first time, Q is
2106 * tentatively set as a candidate waker queue, while on the third consecutive
2107 * time that Q is detected, the field waker_bfqq is set to Q, to confirm that Q
2108 * is a waker queue for bfqq. These detection steps are performed only if bfqq
2109 * has a long think time, so as to make it more likely that bfqq's I/O is
2110 * actually being blocked by a synchronization. This last filter, plus the
2111 * above three-times requirement and time limit for detection, make false
2112 * positives less likely.
2113 *
2114 * NOTE
2115 *
2116 * The sooner a waker queue is detected, the sooner throughput can be
2117 * boosted by injecting I/O from the waker queue. Fortunately,
2118 * detection is likely to be actually fast, for the following
2119 * reasons. While blocked by synchronization, bfqq has a long think
2120 * time. This implies that bfqq's inject limit is at least equal to 1
2121 * (see the comments in bfq_update_inject_limit()). So, thanks to
2122 * injection, the waker queue is likely to be served during the very
2123 * first I/O-plugging time interval for bfqq. This triggers the first
2124 * step of the detection mechanism. Thanks again to injection, the
2125 * candidate waker queue is then likely to be confirmed no later than
2126 * during the next I/O-plugging interval for bfqq.
2127 *
2128 * ISSUE
2129 *
2130 * On queue merging all waker information is lost.
2131 */
2132static void bfq_check_waker(struct bfq_data *bfqd, struct bfq_queue *bfqq,
2133                            u64 now_ns)
2134{
2135        char waker_name[MAX_BFQQ_NAME_LENGTH];
2136
2137        if (!bfqd->last_completed_rq_bfqq ||
2138            bfqd->last_completed_rq_bfqq == bfqq ||
2139            bfq_bfqq_has_short_ttime(bfqq) ||
2140            now_ns - bfqd->last_completion >= 4 * NSEC_PER_MSEC ||
2141            bfqd->last_completed_rq_bfqq == &bfqd->oom_bfqq ||
2142            bfqq == &bfqd->oom_bfqq)
2143                return;
2144
2145        /*
2146         * We reset waker detection logic also if too much time has passed
2147         * since the first detection. If wakeups are rare, pointless idling
2148         * doesn't hurt throughput that much. The condition below makes sure
2149         * we do not uselessly idle blocking waker in more than 1/64 cases.
2150         */
2151        if (bfqd->last_completed_rq_bfqq !=
2152            bfqq->tentative_waker_bfqq ||
2153            now_ns > bfqq->waker_detection_started +
2154                                        128 * (u64)bfqd->bfq_slice_idle) {
2155                /*
2156                 * First synchronization detected with a
2157                 * candidate waker queue, or with a different
2158                 * candidate waker queue from the current one.
2159                 */
2160                bfqq->tentative_waker_bfqq =
2161                        bfqd->last_completed_rq_bfqq;
2162                bfqq->num_waker_detections = 1;
2163                bfqq->waker_detection_started = now_ns;
2164                bfq_bfqq_name(bfqq->tentative_waker_bfqq, waker_name,
2165                              MAX_BFQQ_NAME_LENGTH);
2166                bfq_log_bfqq(bfqd, bfqq, "set tentative waker %s", waker_name);
2167        } else /* Same tentative waker queue detected again */
2168                bfqq->num_waker_detections++;
2169
2170        if (bfqq->num_waker_detections == 3) {
2171                bfqq->waker_bfqq = bfqd->last_completed_rq_bfqq;
2172                bfqq->tentative_waker_bfqq = NULL;
2173                bfq_bfqq_name(bfqq->waker_bfqq, waker_name,
2174                              MAX_BFQQ_NAME_LENGTH);
2175                bfq_log_bfqq(bfqd, bfqq, "set waker %s", waker_name);
2176
2177                /*
2178                 * If the waker queue disappears, then
2179                 * bfqq->waker_bfqq must be reset. To
2180                 * this goal, we maintain in each
2181                 * waker queue a list, woken_list, of
2182                 * all the queues that reference the
2183                 * waker queue through their
2184                 * waker_bfqq pointer. When the waker
2185                 * queue exits, the waker_bfqq pointer
2186                 * of all the queues in the woken_list
2187                 * is reset.
2188                 *
2189                 * In addition, if bfqq is already in
2190                 * the woken_list of a waker queue,
2191                 * then, before being inserted into
2192                 * the woken_list of a new waker
2193                 * queue, bfqq must be removed from
2194                 * the woken_list of the old waker
2195                 * queue.
2196                 */
2197                if (!hlist_unhashed(&bfqq->woken_list_node))
2198                        hlist_del_init(&bfqq->woken_list_node);
2199                hlist_add_head(&bfqq->woken_list_node,
2200                               &bfqd->last_completed_rq_bfqq->woken_list);
2201        }
2202}
2203
2204static void bfq_add_request(struct request *rq)
2205{
2206        struct bfq_queue *bfqq = RQ_BFQQ(rq);
2207        struct bfq_data *bfqd = bfqq->bfqd;
2208        struct request *next_rq, *prev;
2209        unsigned int old_wr_coeff = bfqq->wr_coeff;
2210        bool interactive = false;
2211        u64 now_ns = ktime_get_ns();
2212
2213        bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
2214        bfqq->queued[rq_is_sync(rq)]++;
2215        /*
2216         * Updating of 'bfqd->queued' is protected by 'bfqd->lock', however, it
2217         * may be read without holding the lock in bfq_has_work().
2218         */
2219        WRITE_ONCE(bfqd->queued, bfqd->queued + 1);
2220
2221        if (bfq_bfqq_sync(bfqq) && RQ_BIC(rq)->requests <= 1) {
2222                bfq_check_waker(bfqd, bfqq, now_ns);
2223
2224                /*
2225                 * Periodically reset inject limit, to make sure that
2226                 * the latter eventually drops in case workload
2227                 * changes, see step (3) in the comments on
2228                 * bfq_update_inject_limit().
2229                 */
2230                if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
2231                                             msecs_to_jiffies(1000)))
2232                        bfq_reset_inject_limit(bfqd, bfqq);
2233
2234                /*
2235                 * The following conditions must hold to setup a new
2236                 * sampling of total service time, and then a new
2237                 * update of the inject limit:
2238                 * - bfqq is in service, because the total service
2239                 *   time is evaluated only for the I/O requests of
2240                 *   the queues in service;
2241                 * - this is the right occasion to compute or to
2242                 *   lower the baseline total service time, because
2243                 *   there are actually no requests in the drive,
2244                 *   or
2245                 *   the baseline total service time is available, and
2246                 *   this is the right occasion to compute the other
2247                 *   quantity needed to update the inject limit, i.e.,
2248                 *   the total service time caused by the amount of
2249                 *   injection allowed by the current value of the
2250                 *   limit. It is the right occasion because injection
2251                 *   has actually been performed during the service
2252                 *   hole, and there are still in-flight requests,
2253                 *   which are very likely to be exactly the injected
2254                 *   requests, or part of them;
2255                 * - the minimum interval for sampling the total
2256                 *   service time and updating the inject limit has
2257                 *   elapsed.
2258                 */
2259                if (bfqq == bfqd->in_service_queue &&
2260                    (bfqd->tot_rq_in_driver == 0 ||
2261                     (bfqq->last_serv_time_ns > 0 &&
2262                      bfqd->rqs_injected && bfqd->tot_rq_in_driver > 0)) &&
2263                    time_is_before_eq_jiffies(bfqq->decrease_time_jif +
2264                                              msecs_to_jiffies(10))) {
2265                        bfqd->last_empty_occupied_ns = ktime_get_ns();
2266                        /*
2267                         * Start the state machine for measuring the
2268                         * total service time of rq: setting
2269                         * wait_dispatch will cause bfqd->waited_rq to
2270                         * be set when rq will be dispatched.
2271                         */
2272                        bfqd->wait_dispatch = true;
2273                        /*
2274                         * If there is no I/O in service in the drive,
2275                         * then possible injection occurred before the
2276                         * arrival of rq will not affect the total
2277                         * service time of rq. So the injection limit
2278                         * must not be updated as a function of such
2279                         * total service time, unless new injection
2280                         * occurs before rq is completed. To have the
2281                         * injection limit updated only in the latter
2282                         * case, reset rqs_injected here (rqs_injected
2283                         * will be set in case injection is performed
2284                         * on bfqq before rq is completed).
2285                         */
2286                        if (bfqd->tot_rq_in_driver == 0)
2287                                bfqd->rqs_injected = false;
2288                }
2289        }
2290
2291        if (bfq_bfqq_sync(bfqq))
2292                bfq_update_io_intensity(bfqq, now_ns);
2293
2294        elv_rb_add(&bfqq->sort_list, rq);
2295
2296        /*
2297         * Check if this request is a better next-serve candidate.
2298         */
2299        prev = bfqq->next_rq;
2300        next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
2301        bfqq->next_rq = next_rq;
2302
2303        /*
2304         * Adjust priority tree position, if next_rq changes.
2305         * See comments on bfq_pos_tree_add_move() for the unlikely().
2306         */
2307        if (unlikely(!bfqd->nonrot_with_queueing && prev != bfqq->next_rq))
2308                bfq_pos_tree_add_move(bfqd, bfqq);
2309
2310        if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
2311                bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, old_wr_coeff,
2312                                                 rq, &interactive);
2313        else {
2314                if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
2315                    time_is_before_jiffies(
2316                                bfqq->last_wr_start_finish +
2317                                bfqd->bfq_wr_min_inter_arr_async)) {
2318                        bfqq->wr_coeff = bfqd->bfq_wr_coeff;
2319                        bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
2320
2321                        bfqd->wr_busy_queues++;
2322                        bfqq->entity.prio_changed = 1;
2323                }
2324                if (prev != bfqq->next_rq)
2325                        bfq_updated_next_req(bfqd, bfqq);
2326        }
2327
2328        /*
2329         * Assign jiffies to last_wr_start_finish in the following
2330         * cases:
2331         *
2332         * . if bfqq is not going to be weight-raised, because, for
2333         *   non weight-raised queues, last_wr_start_finish stores the
2334         *   arrival time of the last request; as of now, this piece
2335         *   of information is used only for deciding whether to
2336         *   weight-raise async queues
2337         *
2338         * . if bfqq is not weight-raised, because, if bfqq is now
2339         *   switching to weight-raised, then last_wr_start_finish
2340         *   stores the time when weight-raising starts
2341         *
2342         * . if bfqq is interactive, because, regardless of whether
2343         *   bfqq is currently weight-raised, the weight-raising
2344         *   period must start or restart (this case is considered
2345         *   separately because it is not detected by the above
2346         *   conditions, if bfqq is already weight-raised)
2347         *
2348         * last_wr_start_finish has to be updated also if bfqq is soft
2349         * real-time, because the weight-raising period is constantly
2350         * restarted on idle-to-busy transitions for these queues, but
2351         * this is already done in bfq_bfqq_handle_idle_busy_switch if
2352         * needed.
2353         */
2354        if (bfqd->low_latency &&
2355                (old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive))
2356                bfqq->last_wr_start_finish = jiffies;
2357}
2358
2359static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
2360                                          struct bio *bio,
2361                                          struct request_queue *q)
2362{
2363        struct bfq_queue *bfqq = bfqd->bio_bfqq;
2364
2365
2366        if (bfqq)
2367                return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
2368
2369        return NULL;
2370}
2371
2372static sector_t get_sdist(sector_t last_pos, struct request *rq)
2373{
2374        if (last_pos)
2375                return abs(blk_rq_pos(rq) - last_pos);
2376
2377        return 0;
2378}
2379
2380static void bfq_remove_request(struct request_queue *q,
2381                               struct request *rq)
2382{
2383        struct bfq_queue *bfqq = RQ_BFQQ(rq);
2384        struct bfq_data *bfqd = bfqq->bfqd;
2385        const int sync = rq_is_sync(rq);
2386
2387        if (bfqq->next_rq == rq) {
2388                bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
2389                bfq_updated_next_req(bfqd, bfqq);
2390        }
2391
2392        if (rq->queuelist.prev != &rq->queuelist)
2393                list_del_init(&rq->queuelist);
2394        bfqq->queued[sync]--;
2395        /*
2396         * Updating of 'bfqd->queued' is protected by 'bfqd->lock', however, it
2397         * may be read without holding the lock in bfq_has_work().
2398         */
2399        WRITE_ONCE(bfqd->queued, bfqd->queued - 1);
2400        elv_rb_del(&bfqq->sort_list, rq);
2401
2402        elv_rqhash_del(q, rq);
2403        if (q->last_merge == rq)
2404                q->last_merge = NULL;
2405
2406        if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
2407                bfqq->next_rq = NULL;
2408
2409                if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue) {
2410                        bfq_del_bfqq_busy(bfqq, false);
2411                        /*
2412                         * bfqq emptied. In normal operation, when
2413                         * bfqq is empty, bfqq->entity.service and
2414                         * bfqq->entity.budget must contain,
2415                         * respectively, the service received and the
2416                         * budget used last time bfqq emptied. These
2417                         * facts do not hold in this case, as at least
2418                         * this last removal occurred while bfqq is
2419                         * not in service. To avoid inconsistencies,
2420                         * reset both bfqq->entity.service and
2421                         * bfqq->entity.budget, if bfqq has still a
2422                         * process that may issue I/O requests to it.
2423                         */
2424                        bfqq->entity.budget = bfqq->entity.service = 0;
2425                }
2426
2427                /*
2428                 * Remove queue from request-position tree as it is empty.
2429                 */
2430                if (bfqq->pos_root) {
2431                        rb_erase(&bfqq->pos_node, bfqq->pos_root);
2432                        bfqq->pos_root = NULL;
2433                }
2434        } else {
2435                /* see comments on bfq_pos_tree_add_move() for the unlikely() */
2436                if (unlikely(!bfqd->nonrot_with_queueing))
2437                        bfq_pos_tree_add_move(bfqd, bfqq);
2438        }
2439
2440        if (rq->cmd_flags & REQ_META)
2441                bfqq->meta_pending--;
2442
2443}
2444
2445static bool bfq_bio_merge(struct request_queue *q, struct bio *bio,
2446                unsigned int nr_segs)
2447{
2448        struct bfq_data *bfqd = q->elevator->elevator_data;
2449        struct request *free = NULL;
2450        /*
2451         * bfq_bic_lookup grabs the queue_lock: invoke it now and
2452         * store its return value for later use, to avoid nesting
2453         * queue_lock inside the bfqd->lock. We assume that the bic
2454         * returned by bfq_bic_lookup does not go away before
2455         * bfqd->lock is taken.
2456         */
2457        struct bfq_io_cq *bic = bfq_bic_lookup(q);
2458        bool ret;
2459
2460        spin_lock_irq(&bfqd->lock);
2461
2462        if (bic) {
2463                /*
2464                 * Make sure cgroup info is uptodate for current process before
2465                 * considering the merge.
2466                 */
2467                bfq_bic_update_cgroup(bic, bio);
2468
2469                bfqd->bio_bfqq = bic_to_bfqq(bic, op_is_sync(bio->bi_opf),
2470                                             bfq_actuator_index(bfqd, bio));
2471        } else {
2472                bfqd->bio_bfqq = NULL;
2473        }
2474        bfqd->bio_bic = bic;
2475
2476        ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
2477
2478        spin_unlock_irq(&bfqd->lock);
2479        if (free)
2480                blk_mq_free_request(free);
2481
2482        return ret;
2483}
2484
2485static int bfq_request_merge(struct request_queue *q, struct request **req,
2486                             struct bio *bio)
2487{
2488        struct bfq_data *bfqd = q->elevator->elevator_data;
2489        struct request *__rq;
2490
2491        __rq = bfq_find_rq_fmerge(bfqd, bio, q);
2492        if (__rq && elv_bio_merge_ok(__rq, bio)) {
2493                *req = __rq;
2494
2495                if (blk_discard_mergable(__rq))
2496                        return ELEVATOR_DISCARD_MERGE;
2497                return ELEVATOR_FRONT_MERGE;
2498        }
2499
2500        return ELEVATOR_NO_MERGE;
2501}
2502
2503static void bfq_request_merged(struct request_queue *q, struct request *req,
2504                               enum elv_merge type)
2505{
2506        if (type == ELEVATOR_FRONT_MERGE &&
2507            rb_prev(&req->rb_node) &&
2508            blk_rq_pos(req) <
2509            blk_rq_pos(container_of(rb_prev(&req->rb_node),
2510                                    struct request, rb_node))) {
2511                struct bfq_queue *bfqq = RQ_BFQQ(req);
2512                struct bfq_data *bfqd;
2513                struct request *prev, *next_rq;
2514
2515                if (!bfqq)
2516                        return;
2517
2518                bfqd = bfqq->bfqd;
2519
2520                /* Reposition request in its sort_list */
2521                elv_rb_del(&bfqq->sort_list, req);
2522                elv_rb_add(&bfqq->sort_list, req);
2523
2524                /* Choose next request to be served for bfqq */
2525                prev = bfqq->next_rq;
2526                next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
2527                                         bfqd->last_position);
2528                bfqq->next_rq = next_rq;
2529                /*
2530                 * If next_rq changes, update both the queue's budget to
2531                 * fit the new request and the queue's position in its
2532                 * rq_pos_tree.
2533                 */
2534                if (prev != bfqq->next_rq) {
2535                        bfq_updated_next_req(bfqd, bfqq);
2536                        /*
2537                         * See comments on bfq_pos_tree_add_move() for
2538                         * the unlikely().
2539                         */
2540                        if (unlikely(!bfqd->nonrot_with_queueing))
2541                                bfq_pos_tree_add_move(bfqd, bfqq);
2542                }
2543        }
2544}
2545
2546/*
2547 * This function is called to notify the scheduler that the requests
2548 * rq and 'next' have been merged, with 'next' going away.  BFQ
2549 * exploits this hook to address the following issue: if 'next' has a
2550 * fifo_time lower that rq, then the fifo_time of rq must be set to
2551 * the value of 'next', to not forget the greater age of 'next'.
2552 *
2553 * NOTE: in this function we assume that rq is in a bfq_queue, basing
2554 * on that rq is picked from the hash table q->elevator->hash, which,
2555 * in its turn, is filled only with I/O requests present in
2556 * bfq_queues, while BFQ is in use for the request queue q. In fact,
2557 * the function that fills this hash table (elv_rqhash_add) is called
2558 * only by bfq_insert_request.
2559 */
2560static void bfq_requests_merged(struct request_queue *q, struct request *rq,
2561                                struct request *next)
2562{
2563        struct bfq_queue *bfqq = RQ_BFQQ(rq),
2564                *next_bfqq = RQ_BFQQ(next);
2565
2566        if (!bfqq)
2567                goto remove;
2568
2569        /*
2570         * If next and rq belong to the same bfq_queue and next is older
2571         * than rq, then reposition rq in the fifo (by substituting next
2572         * with rq). Otherwise, if next and rq belong to different
2573         * bfq_queues, never reposition rq: in fact, we would have to
2574         * reposition it with respect to next's position in its own fifo,
2575         * which would most certainly be too expensive with respect to
2576         * the benefits.
2577         */
2578        if (bfqq == next_bfqq &&
2579            !list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
2580            next->fifo_time < rq->fifo_time) {
2581                list_del_init(&rq->queuelist);
2582                list_replace_init(&next->queuelist, &rq->queuelist);
2583                rq->fifo_time = next->fifo_time;
2584        }
2585
2586        if (bfqq->next_rq == next)
2587                bfqq->next_rq = rq;
2588
2589        bfqg_stats_update_io_merged(bfqq_group(bfqq), next->cmd_flags);
2590remove:
2591        /* Merged request may be in the IO scheduler. Remove it. */
2592        if (!RB_EMPTY_NODE(&next->rb_node)) {
2593                bfq_remove_request(next->q, next);
2594                if (next_bfqq)
2595                        bfqg_stats_update_io_remove(bfqq_group(next_bfqq),
2596                                                    next->cmd_flags);
2597        }
2598}
2599
2600/* Must be called with bfqq != NULL */
2601static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
2602{
2603        /*
2604         * If bfqq has been enjoying interactive weight-raising, then
2605         * reset soft_rt_next_start. We do it for the following
2606         * reason. bfqq may have been conveying the I/O needed to load
2607         * a soft real-time application. Such an application actually
2608         * exhibits a soft real-time I/O pattern after it finishes
2609         * loading, and finally starts doing its job. But, if bfqq has
2610         * been receiving a lot of bandwidth so far (likely to happen
2611         * on a fast device), then soft_rt_next_start now contains a
2612         * high value that. So, without this reset, bfqq would be
2613         * prevented from being possibly considered as soft_rt for a
2614         * very long time.
2615         */
2616
2617        if (bfqq->wr_cur_max_time !=
2618            bfqq->bfqd->bfq_wr_rt_max_time)
2619                bfqq->soft_rt_next_start = jiffies;
2620
2621        if (bfq_bfqq_busy(bfqq))
2622                bfqq->bfqd->wr_busy_queues--;
2623        bfqq->wr_coeff = 1;
2624        bfqq->wr_cur_max_time = 0;
2625        bfqq->last_wr_start_finish = jiffies;
2626        /*
2627         * Trigger a weight change on the next invocation of
2628         * __bfq_entity_update_weight_prio.
2629         */
2630        bfqq->entity.prio_changed = 1;
2631}
2632
2633void bfq_end_wr_async_queues(struct bfq_data *bfqd,
2634                             struct bfq_group *bfqg)
2635{
2636        int i, j, k;
2637
2638        for (k = 0; k < bfqd->num_actuators; k++) {
2639                for (i = 0; i < 2; i++)
2640                        for (j = 0; j < IOPRIO_NR_LEVELS; j++)
2641                                if (bfqg->async_bfqq[i][j][k])
2642                                        bfq_bfqq_end_wr(bfqg->async_bfqq[i][j][k]);
2643                if (bfqg->async_idle_bfqq[k])
2644                        bfq_bfqq_end_wr(bfqg->async_idle_bfqq[k]);
2645        }
2646}
2647
2648static void bfq_end_wr(struct bfq_data *bfqd)
2649{
2650        struct bfq_queue *bfqq;
2651        int i;
2652
2653        spin_lock_irq(&bfqd->lock);
2654
2655        for (i = 0; i < bfqd->num_actuators; i++) {
2656                list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)
2657                        bfq_bfqq_end_wr(bfqq);
2658        }
2659        list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
2660                bfq_bfqq_end_wr(bfqq);
2661        bfq_end_wr_async(bfqd);
2662
2663        spin_unlock_irq(&bfqd->lock);
2664}
2665
2666static sector_t bfq_io_struct_pos(void *io_struct, bool request)
2667{
2668        if (request)
2669                return blk_rq_pos(io_struct);
2670        else
2671                return ((struct bio *)io_struct)->bi_iter.bi_sector;
2672}
2673
2674static int bfq_rq_close_to_sector(void *io_struct, bool request,
2675                                  sector_t sector)
2676{
2677        return abs(bfq_io_struct_pos(io_struct, request) - sector) <=
2678               BFQQ_CLOSE_THR;
2679}
2680
2681static struct bfq_queue *bfqq_find_close(struct bfq_data *bfqd,
2682                                         struct bfq_queue *bfqq,
2683                                         sector_t sector)
2684{
2685        struct rb_root *root = &bfqq_group(bfqq)->rq_pos_tree;
2686        struct rb_node *parent, *node;
2687        struct bfq_queue *__bfqq;
2688
2689        if (RB_EMPTY_ROOT(root))
2690                return NULL;
2691
2692        /*
2693         * First, if we find a request starting at the end of the last
2694         * request, choose it.
2695         */
2696        __bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
2697        if (__bfqq)
2698                return __bfqq;
2699
2700        /*
2701         * If the exact sector wasn't found, the parent of the NULL leaf
2702         * will contain the closest sector (rq_pos_tree sorted by
2703         * next_request position).
2704         */
2705        __bfqq = rb_entry(parent, struct bfq_queue, pos_node);
2706        if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
2707                return __bfqq;
2708
2709        if (blk_rq_pos(__bfqq->next_rq) < sector)
2710                node = rb_next(&__bfqq->pos_node);
2711        else
2712                node = rb_prev(&__bfqq->pos_node);
2713        if (!node)
2714                return NULL;
2715
2716        __bfqq = rb_entry(node, struct bfq_queue, pos_node);
2717        if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
2718                return __bfqq;
2719
2720        return NULL;
2721}
2722
2723static struct bfq_queue *bfq_find_close_cooperator(struct bfq_data *bfqd,
2724                                                   struct bfq_queue *cur_bfqq,
2725                                                   sector_t sector)
2726{
2727        struct bfq_queue *bfqq;
2728
2729        /*
2730         * We shall notice if some of the queues are cooperating,
2731         * e.g., working closely on the same area of the device. In
2732         * that case, we can group them together and: 1) don't waste
2733         * time idling, and 2) serve the union of their requests in
2734         * the best possible order for throughput.
2735         */
2736        bfqq = bfqq_find_close(bfqd, cur_bfqq, sector);
2737        if (!bfqq || bfqq == cur_bfqq)
2738                return NULL;
2739
2740        return bfqq;
2741}
2742
2743static struct bfq_queue *
2744bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
2745{
2746        int process_refs, new_process_refs;
2747        struct bfq_queue *__bfqq;
2748
2749        /*
2750         * If there are no process references on the new_bfqq, then it is
2751         * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
2752         * may have dropped their last reference (not just their last process
2753         * reference).
2754         */
2755        if (!bfqq_process_refs(new_bfqq))
2756                return NULL;
2757
2758        /* Avoid a circular list and skip interim queue merges. */
2759        while ((__bfqq = new_bfqq->new_bfqq)) {
2760                if (__bfqq == bfqq)
2761                        return NULL;
2762                new_bfqq = __bfqq;
2763        }
2764
2765        process_refs = bfqq_process_refs(bfqq);
2766        new_process_refs = bfqq_process_refs(new_bfqq);
2767        /*
2768         * If the process for the bfqq has gone away, there is no
2769         * sense in merging the queues.
2770         */
2771        if (process_refs == 0 || new_process_refs == 0)
2772                return NULL;
2773
2774        /*
2775         * Make sure merged queues belong to the same parent. Parents could
2776         * have changed since the time we decided the two queues are suitable
2777         * for merging.
2778         */
2779        if (new_bfqq->entity.parent != bfqq->entity.parent)
2780                return NULL;
2781
2782        bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
2783                new_bfqq->pid);
2784
2785        /*
2786         * Merging is just a redirection: the requests of the process
2787         * owning one of the two queues are redirected to the other queue.
2788         * The latter queue, in its turn, is set as shared if this is the
2789         * first time that the requests of some process are redirected to
2790         * it.
2791         *
2792         * We redirect bfqq to new_bfqq and not the opposite, because
2793         * we are in the context of the process owning bfqq, thus we
2794         * have the io_cq of this process. So we can immediately
2795         * configure this io_cq to redirect the requests of the
2796         * process to new_bfqq. In contrast, the io_cq of new_bfqq is
2797         * not available any more (new_bfqq->bic == NULL).
2798         *
2799         * Anyway, even in case new_bfqq coincides with the in-service
2800         * queue, redirecting requests the in-service queue is the
2801         * best option, as we feed the in-service queue with new
2802         * requests close to the last request served and, by doing so,
2803         * are likely to increase the throughput.
2804         */
2805        bfqq->new_bfqq = new_bfqq;
2806        /*
2807         * The above assignment schedules the following redirections:
2808         * each time some I/O for bfqq arrives, the process that
2809         * generated that I/O is disassociated from bfqq and
2810         * associated with new_bfqq. Here we increases new_bfqq->ref
2811         * in advance, adding the number of processes that are
2812         * expected to be associated with new_bfqq as they happen to
2813         * issue I/O.
2814         */
2815        new_bfqq->ref += process_refs;
2816        return new_bfqq;
2817}
2818
2819static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
2820                                        struct bfq_queue *new_bfqq)
2821{
2822        if (bfq_too_late_for_merging(new_bfqq))
2823                return false;
2824
2825        if (bfq_class_idle(bfqq) || bfq_class_idle(new_bfqq) ||
2826            (bfqq->ioprio_class != new_bfqq->ioprio_class))
2827                return false;
2828
2829        /*
2830         * If either of the queues has already been detected as seeky,
2831         * then merging it with the other queue is unlikely to lead to
2832         * sequential I/O.
2833         */
2834        if (BFQQ_SEEKY(bfqq) || BFQQ_SEEKY(new_bfqq))
2835                return false;
2836
2837        /*
2838         * Interleaved I/O is known to be done by (some) applications
2839         * only for reads, so it does not make sense to merge async
2840         * queues.
2841         */
2842        if (!bfq_bfqq_sync(bfqq) || !bfq_bfqq_sync(new_bfqq))
2843                return false;
2844
2845        return true;
2846}
2847
2848static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
2849                                             struct bfq_queue *bfqq);
2850
2851static struct bfq_queue *
2852bfq_setup_stable_merge(struct bfq_data *bfqd, struct bfq_queue *bfqq,
2853                       struct bfq_queue *stable_merge_bfqq,
2854                       struct bfq_iocq_bfqq_data *bfqq_data)
2855{
2856        int proc_ref = min(bfqq_process_refs(bfqq),
2857                           bfqq_process_refs(stable_merge_bfqq));
2858        struct bfq_queue *new_bfqq = NULL;
2859
2860        bfqq_data->stable_merge_bfqq = NULL;
2861        if (idling_boosts_thr_without_issues(bfqd, bfqq) || proc_ref == 0)
2862                goto out;
2863
2864        /* next function will take at least one ref */
2865        new_bfqq = bfq_setup_merge(bfqq, stable_merge_bfqq);
2866
2867        if (new_bfqq) {
2868                bfqq_data->stably_merged = true;
2869                if (new_bfqq->bic) {
2870                        unsigned int new_a_idx = new_bfqq->actuator_idx;
2871                        struct bfq_iocq_bfqq_data *new_bfqq_data =
2872                                &new_bfqq->bic->bfqq_data[new_a_idx];
2873
2874                        new_bfqq_data->stably_merged = true;
2875                }
2876        }
2877
2878out:
2879        /* deschedule stable merge, because done or aborted here */
2880        bfq_put_stable_ref(stable_merge_bfqq);
2881
2882        return new_bfqq;
2883}
2884
2885/*
2886 * Attempt to schedule a merge of bfqq with the currently in-service
2887 * queue or with a close queue among the scheduled queues.  Return
2888 * NULL if no merge was scheduled, a pointer to the shared bfq_queue
2889 * structure otherwise.
2890 *
2891 * The OOM queue is not allowed to participate to cooperation: in fact, since
2892 * the requests temporarily redirected to the OOM queue could be redirected
2893 * again to dedicated queues at any time, the state needed to correctly
2894 * handle merging with the OOM queue would be quite complex and expensive
2895 * to maintain. Besides, in such a critical condition as an out of memory,
2896 * the benefits of queue merging may be little relevant, or even negligible.
2897 *
2898 * WARNING: queue merging may impair fairness among non-weight raised
2899 * queues, for at least two reasons: 1) the original weight of a
2900 * merged queue may change during the merged state, 2) even being the
2901 * weight the same, a merged queue may be bloated with many more
2902 * requests than the ones produced by its originally-associated
2903 * process.
2904 */
2905static struct bfq_queue *
2906bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
2907                     void *io_struct, bool request, struct bfq_io_cq *bic)
2908{
2909        struct bfq_queue *in_service_bfqq, *new_bfqq;
2910        unsigned int a_idx = bfqq->actuator_idx;
2911        struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];
2912
2913        /* if a merge has already been setup, then proceed with that first */
2914        if (bfqq->new_bfqq)
2915                return bfqq->new_bfqq;
2916
2917        /*
2918         * Check delayed stable merge for rotational or non-queueing
2919         * devs. For this branch to be executed, bfqq must not be
2920         * currently merged with some other queue (i.e., bfqq->bic
2921         * must be non null). If we considered also merged queues,
2922         * then we should also check whether bfqq has already been
2923         * merged with bic->stable_merge_bfqq. But this would be
2924         * costly and complicated.
2925         */
2926        if (unlikely(!bfqd->nonrot_with_queueing)) {
2927                /*
2928                 * Make sure also that bfqq is sync, because
2929                 * bic->stable_merge_bfqq may point to some queue (for
2930                 * stable merging) also if bic is associated with a
2931                 * sync queue, but this bfqq is async
2932                 */
2933                if (bfq_bfqq_sync(bfqq) && bfqq_data->stable_merge_bfqq &&
2934                    !bfq_bfqq_just_created(bfqq) &&
2935                    time_is_before_jiffies(bfqq->split_time +
2936                                          msecs_to_jiffies(bfq_late_stable_merging)) &&
2937                    time_is_before_jiffies(bfqq->creation_time +
2938                                           msecs_to_jiffies(bfq_late_stable_merging))) {
2939                        struct bfq_queue *stable_merge_bfqq =
2940                                bfqq_data->stable_merge_bfqq;
2941
2942                        return bfq_setup_stable_merge(bfqd, bfqq,
2943                                                      stable_merge_bfqq,
2944                                                      bfqq_data);
2945                }
2946        }
2947
2948        /*
2949         * Do not perform queue merging if the device is non
2950         * rotational and performs internal queueing. In fact, such a
2951         * device reaches a high speed through internal parallelism
2952         * and pipelining. This means that, to reach a high
2953         * throughput, it must have many requests enqueued at the same
2954         * time. But, in this configuration, the internal scheduling
2955         * algorithm of the device does exactly the job of queue
2956         * merging: it reorders requests so as to obtain as much as
2957         * possible a sequential I/O pattern. As a consequence, with
2958         * the workload generated by processes doing interleaved I/O,
2959         * the throughput reached by the device is likely to be the
2960         * same, with and without queue merging.
2961         *
2962         * Disabling merging also provides a remarkable benefit in
2963         * terms of throughput. Merging tends to make many workloads
2964         * artificially more uneven, because of shared queues
2965         * remaining non empty for incomparably more time than
2966         * non-merged queues. This may accentuate workload
2967         * asymmetries. For example, if one of the queues in a set of
2968         * merged queues has a higher weight than a normal queue, then
2969         * the shared queue may inherit such a high weight and, by
2970         * staying almost always active, may force BFQ to perform I/O
2971         * plugging most of the time. This evidently makes it harder
2972         * for BFQ to let the device reach a high throughput.
2973         *
2974         * Finally, the likely() macro below is not used because one
2975         * of the two branches is more likely than the other, but to
2976         * have the code path after the following if() executed as
2977         * fast as possible for the case of a non rotational device
2978         * with queueing. We want it because this is the fastest kind
2979         * of device. On the opposite end, the likely() may lengthen
2980         * the execution time of BFQ for the case of slower devices
2981         * (rotational or at least without queueing). But in this case
2982         * the execution time of BFQ matters very little, if not at
2983         * all.
2984         */
2985        if (likely(bfqd->nonrot_with_queueing))
2986                return NULL;
2987
2988        /*
2989         * Prevent bfqq from being merged if it has been created too
2990         * long ago. The idea is that true cooperating processes, and
2991         * thus their associated bfq_queues, are supposed to be
2992         * created shortly after each other. This is the case, e.g.,
2993         * for KVM/QEMU and dump I/O threads. Basing on this
2994         * assumption, the following filtering greatly reduces the
2995         * probability that two non-cooperating processes, which just
2996         * happen to do close I/O for some short time interval, have
2997         * their queues merged by mistake.
2998         */
2999        if (bfq_too_late_for_merging(bfqq))
3000                return NULL;
3001
3002        if (!io_struct || unlikely(bfqq == &bfqd->oom_bfqq))
3003                return NULL;
3004
3005        /* If there is only one backlogged queue, don't search. */
3006        if (bfq_tot_busy_queues(bfqd) == 1)
3007                return NULL;
3008
3009        in_service_bfqq = bfqd->in_service_queue;
3010
3011        if (in_service_bfqq && in_service_bfqq != bfqq &&
3012            likely(in_service_bfqq != &bfqd->oom_bfqq) &&
3013            bfq_rq_close_to_sector(io_struct, request,
3014                                   bfqd->in_serv_last_pos) &&
3015            bfqq->entity.parent == in_service_bfqq->entity.parent &&
3016            bfq_may_be_close_cooperator(bfqq, in_service_bfqq)) {
3017                new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
3018                if (new_bfqq)
3019                        return new_bfqq;
3020        }
3021        /*
3022         * Check whether there is a cooperator among currently scheduled
3023         * queues. The only thing we need is that the bio/request is not
3024         * NULL, as we need it to establish whether a cooperator exists.
3025         */
3026        new_bfqq = bfq_find_close_cooperator(bfqd, bfqq,
3027                        bfq_io_struct_pos(io_struct, request));
3028
3029        if (new_bfqq && likely(new_bfqq != &bfqd->oom_bfqq) &&
3030            bfq_may_be_close_cooperator(bfqq, new_bfqq))
3031                return bfq_setup_merge(bfqq, new_bfqq);
3032
3033        return NULL;
3034}
3035
3036static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
3037{
3038        struct bfq_io_cq *bic = bfqq->bic;
3039        unsigned int a_idx = bfqq->actuator_idx;
3040        struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];
3041
3042        /*
3043         * If !bfqq->bic, the queue is already shared or its requests
3044         * have already been redirected to a shared queue; both idle window
3045         * and weight raising state have already been saved. Do nothing.
3046         */
3047        if (!bic)
3048                return;
3049
3050        bfqq_data->saved_last_serv_time_ns = bfqq->last_serv_time_ns;
3051        bfqq_data->saved_inject_limit = bfqq->inject_limit;
3052        bfqq_data->saved_decrease_time_jif = bfqq->decrease_time_jif;
3053
3054        bfqq_data->saved_weight = bfqq->entity.orig_weight;
3055        bfqq_data->saved_ttime = bfqq->ttime;
3056        bfqq_data->saved_has_short_ttime =
3057                bfq_bfqq_has_short_ttime(bfqq);
3058        bfqq_data->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
3059        bfqq_data->saved_io_start_time = bfqq->io_start_time;
3060        bfqq_data->saved_tot_idle_time = bfqq->tot_idle_time;
3061        bfqq_data->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
3062        bfqq_data->was_in_burst_list =
3063                !hlist_unhashed(&bfqq->burst_list_node);
3064
3065        if (unlikely(bfq_bfqq_just_created(bfqq) &&
3066                     !bfq_bfqq_in_large_burst(bfqq) &&
3067                     bfqq->bfqd->low_latency)) {
3068                /*
3069                 * bfqq being merged right after being created: bfqq
3070                 * would have deserved interactive weight raising, but
3071                 * did not make it to be set in a weight-raised state,
3072                 * because of this early merge. Store directly the
3073                 * weight-raising state that would have been assigned
3074                 * to bfqq, so that to avoid that bfqq unjustly fails
3075                 * to enjoy weight raising if split soon.
3076                 */
3077                bfqq_data->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff;
3078                bfqq_data->saved_wr_start_at_switch_to_srt =
3079                        bfq_smallest_from_now();
3080                bfqq_data->saved_wr_cur_max_time =
3081                        bfq_wr_duration(bfqq->bfqd);
3082                bfqq_data->saved_last_wr_start_finish = jiffies;
3083        } else {
3084                bfqq_data->saved_wr_coeff = bfqq->wr_coeff;
3085                bfqq_data->saved_wr_start_at_switch_to_srt =
3086                        bfqq->wr_start_at_switch_to_srt;
3087                bfqq_data->saved_service_from_wr =
3088                        bfqq->service_from_wr;
3089                bfqq_data->saved_last_wr_start_finish =
3090                        bfqq->last_wr_start_finish;
3091                bfqq_data->saved_wr_cur_max_time = bfqq->wr_cur_max_time;
3092        }
3093}
3094
3095
3096static void
3097bfq_reassign_last_bfqq(struct bfq_queue *cur_bfqq, struct bfq_queue *new_bfqq)
3098{
3099        if (cur_bfqq->entity.parent &&
3100            cur_bfqq->entity.parent->last_bfqq_created == cur_bfqq)
3101                cur_bfqq->entity.parent->last_bfqq_created = new_bfqq;
3102        else if (cur_bfqq->bfqd && cur_bfqq->bfqd->last_bfqq_created == cur_bfqq)
3103                cur_bfqq->bfqd->last_bfqq_created = new_bfqq;
3104}
3105
3106void bfq_release_process_ref(struct bfq_data *bfqd, struct bfq_queue *bfqq)
3107{
3108        /*
3109         * To prevent bfqq's service guarantees from being violated,
3110         * bfqq may be left busy, i.e., queued for service, even if
3111         * empty (see comments in __bfq_bfqq_expire() for
3112         * details). But, if no process will send requests to bfqq any
3113         * longer, then there is no point in keeping bfqq queued for
3114         * service. In addition, keeping bfqq queued for service, but
3115         * with no process ref any longer, may have caused bfqq to be
3116         * freed when dequeued from service. But this is assumed to
3117         * never happen.
3118         */
3119        if (bfq_bfqq_busy(bfqq) && RB_EMPTY_ROOT(&bfqq->sort_list) &&
3120            bfqq != bfqd->in_service_queue)
3121                bfq_del_bfqq_busy(bfqq, false);
3122
3123        bfq_reassign_last_bfqq(bfqq, NULL);
3124
3125        bfq_put_queue(bfqq);
3126}
3127
3128static void
3129bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
3130                struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
3131{
3132        bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
3133                (unsigned long)new_bfqq->pid);
3134        /* Save weight raising and idle window of the merged queues */
3135        bfq_bfqq_save_state(bfqq);
3136        bfq_bfqq_save_state(new_bfqq);
3137        if (bfq_bfqq_IO_bound(bfqq))
3138                bfq_mark_bfqq_IO_bound(new_bfqq);
3139        bfq_clear_bfqq_IO_bound(bfqq);
3140
3141        /*
3142         * The processes associated with bfqq are cooperators of the
3143         * processes associated with new_bfqq. So, if bfqq has a
3144         * waker, then assume that all these processes will be happy
3145         * to let bfqq's waker freely inject I/O when they have no
3146         * I/O.
3147         */
3148        if (bfqq->waker_bfqq && !new_bfqq->waker_bfqq &&
3149            bfqq->waker_bfqq != new_bfqq) {
3150                new_bfqq->waker_bfqq = bfqq->waker_bfqq;
3151                new_bfqq->tentative_waker_bfqq = NULL;
3152
3153                /*
3154                 * If the waker queue disappears, then
3155                 * new_bfqq->waker_bfqq must be reset. So insert
3156                 * new_bfqq into the woken_list of the waker. See
3157                 * bfq_check_waker for details.
3158                 */
3159                hlist_add_head(&new_bfqq->woken_list_node,
3160                               &new_bfqq->waker_bfqq->woken_list);
3161
3162        }
3163
3164        /*
3165         * If bfqq is weight-raised, then let new_bfqq inherit
3166         * weight-raising. To reduce false positives, neglect the case
3167         * where bfqq has just been created, but has not yet made it
3168         * to be weight-raised (which may happen because EQM may merge
3169         * bfqq even before bfq_add_request is executed for the first
3170         * time for bfqq). Handling this case would however be very
3171         * easy, thanks to the flag just_created.
3172         */
3173        if (new_bfqq->wr_coeff == 1 && bfqq->wr_coeff > 1) {
3174                new_bfqq->wr_coeff = bfqq->wr_coeff;
3175                new_bfqq->wr_cur_max_time = bfqq->wr_cur_max_time;
3176                new_bfqq->last_wr_start_finish = bfqq->last_wr_start_finish;
3177                new_bfqq->wr_start_at_switch_to_srt =
3178                        bfqq->wr_start_at_switch_to_srt;
3179                if (bfq_bfqq_busy(new_bfqq))
3180                        bfqd->wr_busy_queues++;
3181                new_bfqq->entity.prio_changed = 1;
3182        }
3183
3184        if (bfqq->wr_coeff > 1) { /* bfqq has given its wr to new_bfqq */
3185                bfqq->wr_coeff = 1;
3186                bfqq->entity.prio_changed = 1;
3187                if (bfq_bfqq_busy(bfqq))
3188                        bfqd->wr_busy_queues--;
3189        }
3190
3191        bfq_log_bfqq(bfqd, new_bfqq, "merge_bfqqs: wr_busy %d",
3192                     bfqd->wr_busy_queues);
3193
3194        /*
3195         * Merge queues (that is, let bic redirect its requests to new_bfqq)
3196         */
3197        bic_set_bfqq(bic, new_bfqq, true, bfqq->actuator_idx);
3198        bfq_mark_bfqq_coop(new_bfqq);
3199        /*
3200         * new_bfqq now belongs to at least two bics (it is a shared queue):
3201         * set new_bfqq->bic to NULL. bfqq either:
3202         * - does not belong to any bic any more, and hence bfqq->bic must
3203         *   be set to NULL, or
3204         * - is a queue whose owning bics have already been redirected to a
3205         *   different queue, hence the queue is destined to not belong to
3206         *   any bic soon and bfqq->bic is already NULL (therefore the next
3207         *   assignment causes no harm).
3208         */
3209        new_bfqq->bic = NULL;
3210        /*
3211         * If the queue is shared, the pid is the pid of one of the associated
3212         * processes. Which pid depends on the exact sequence of merge events
3213         * the queue underwent. So printing such a pid is useless and confusing
3214         * because it reports a random pid between those of the associated
3215         * processes.
3216         * We mark such a queue with a pid -1, and then print SHARED instead of
3217         * a pid in logging messages.
3218         */
3219        new_bfqq->pid = -1;
3220        bfqq->bic = NULL;
3221
3222        bfq_reassign_last_bfqq(bfqq, new_bfqq);
3223
3224        bfq_release_process_ref(bfqd, bfqq);
3225}
3226
3227static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
3228                                struct bio *bio)
3229{
3230        struct bfq_data *bfqd = q->elevator->elevator_data;
3231        bool is_sync = op_is_sync(bio->bi_opf);
3232        struct bfq_queue *bfqq = bfqd->bio_bfqq, *new_bfqq;
3233
3234        /*
3235         * Disallow merge of a sync bio into an async request.
3236         */
3237        if (is_sync && !rq_is_sync(rq))
3238                return false;
3239
3240        /*
3241         * Lookup the bfqq that this bio will be queued with. Allow
3242         * merge only if rq is queued there.
3243         */
3244        if (!bfqq)
3245                return false;
3246
3247        /*
3248         * We take advantage of this function to perform an early merge
3249         * of the queues of possible cooperating processes.
3250         */
3251        new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false, bfqd->bio_bic);
3252        if (new_bfqq) {
3253                /*
3254                 * bic still points to bfqq, then it has not yet been
3255                 * redirected to some other bfq_queue, and a queue
3256                 * merge between bfqq and new_bfqq can be safely
3257                 * fulfilled, i.e., bic can be redirected to new_bfqq
3258                 * and bfqq can be put.
3259                 */
3260                bfq_merge_bfqqs(bfqd, bfqd->bio_bic, bfqq,
3261                                new_bfqq);
3262                /*
3263                 * If we get here, bio will be queued into new_queue,
3264                 * so use new_bfqq to decide whether bio and rq can be
3265                 * merged.
3266                 */
3267                bfqq = new_bfqq;
3268
3269                /*
3270                 * Change also bqfd->bio_bfqq, as
3271                 * bfqd->bio_bic now points to new_bfqq, and
3272                 * this function may be invoked again (and then may
3273                 * use again bqfd->bio_bfqq).
3274                 */
3275                bfqd->bio_bfqq = bfqq;
3276        }
3277
3278        return bfqq == RQ_BFQQ(rq);
3279}
3280
3281/*
3282 * Set the maximum time for the in-service queue to consume its
3283 * budget. This prevents seeky processes from lowering the throughput.
3284 * In practice, a time-slice service scheme is used with seeky
3285 * processes.
3286 */
3287static void bfq_set_budget_timeout(struct bfq_data *bfqd,
3288                                   struct bfq_queue *bfqq)
3289{
3290        unsigned int timeout_coeff;
3291
3292        if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
3293                timeout_coeff = 1;
3294        else
3295                timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
3296
3297        bfqd->last_budget_start = ktime_get();
3298
3299        bfqq->budget_timeout = jiffies +
3300                bfqd->bfq_timeout * timeout_coeff;
3301}
3302
3303static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
3304                                       struct bfq_queue *bfqq)
3305{
3306        if (bfqq) {
3307                bfq_clear_bfqq_fifo_expire(bfqq);
3308
3309                bfqd->budgets_assigned = (bfqd->budgets_assigned * 7 + 256) / 8;
3310
3311                if (time_is_before_jiffies(bfqq->last_wr_start_finish) &&
3312                    bfqq->wr_coeff > 1 &&
3313                    bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
3314                    time_is_before_jiffies(bfqq->budget_timeout)) {
3315                        /*
3316                         * For soft real-time queues, move the start
3317                         * of the weight-raising period forward by the
3318                         * time the queue has not received any
3319                         * service. Otherwise, a relatively long
3320                         * service delay is likely to cause the
3321                         * weight-raising period of the queue to end,
3322                         * because of the short duration of the
3323                         * weight-raising period of a soft real-time
3324                         * queue.  It is worth noting that this move
3325                         * is not so dangerous for the other queues,
3326                         * because soft real-time queues are not
3327                         * greedy.
3328                         *
3329                         * To not add a further variable, we use the
3330                         * overloaded field budget_timeout to
3331                         * determine for how long the queue has not
3332                         * received service, i.e., how much time has
3333                         * elapsed since the queue expired. However,
3334                         * this is a little imprecise, because
3335                         * budget_timeout is set to jiffies if bfqq
3336                         * not only expires, but also remains with no
3337                         * request.
3338                         */
3339                        if (time_after(bfqq->budget_timeout,
3340                                       bfqq->last_wr_start_finish))
3341                                bfqq->last_wr_start_finish +=
3342                                        jiffies - bfqq->budget_timeout;
3343                        else
3344                                bfqq->last_wr_start_finish = jiffies;
3345                }
3346
3347                bfq_set_budget_timeout(bfqd, bfqq);
3348                bfq_log_bfqq(bfqd, bfqq,
3349                             "set_in_service_queue, cur-budget = %d",
3350                             bfqq->entity.budget);
3351        }
3352
3353        bfqd->in_service_queue = bfqq;
3354        bfqd->in_serv_last_pos = 0;
3355}
3356
3357/*
3358 * Get and set a new queue for service.
3359 */
3360static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
3361{
3362        struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
3363
3364        __bfq_set_in_service_queue(bfqd, bfqq);
3365        return bfqq;
3366}
3367
3368static void bfq_arm_slice_timer(struct bfq_data *bfqd)
3369{
3370        struct bfq_queue *bfqq = bfqd->in_service_queue;
3371        u32 sl;
3372
3373        bfq_mark_bfqq_wait_request(bfqq);
3374
3375        /*
3376         * We don't want to idle for seeks, but we do want to allow
3377         * fair distribution of slice time for a process doing back-to-back
3378         * seeks. So allow a little bit of time for him to submit a new rq.
3379         */
3380        sl = bfqd->bfq_slice_idle;
3381        /*
3382         * Unless the queue is being weight-raised or the scenario is
3383         * asymmetric, grant only minimum idle time if the queue
3384         * is seeky. A long idling is preserved for a weight-raised
3385         * queue, or, more in general, in an asymmetric scenario,
3386         * because a long idling is needed for guaranteeing to a queue
3387         * its reserved share of the throughput (in particular, it is
3388         * needed if the queue has a higher weight than some other
3389         * queue).
3390         */
3391        if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
3392            !bfq_asymmetric_scenario(bfqd, bfqq))
3393                sl = min_t(u64, sl, BFQ_MIN_TT);
3394        else if (bfqq->wr_coeff > 1)
3395                sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
3396
3397        bfqd->last_idling_start = ktime_get();
3398        bfqd->last_idling_start_jiffies = jiffies;
3399
3400        hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
3401                      HRTIMER_MODE_REL);
3402        bfqg_stats_set_start_idle_time(bfqq_group(bfqq));
3403}
3404
3405/*
3406 * In autotuning mode, max_budget is dynamically recomputed as the
3407 * amount of sectors transferred in timeout at the estimated peak
3408 * rate. This enables BFQ to utilize a full timeslice with a full
3409 * budget, even if the in-service queue is served at peak rate. And
3410 * this maximises throughput with sequential workloads.
3411 */
3412static unsigned long bfq_calc_max_budget(struct bfq_data *bfqd)
3413{
3414        return (u64)bfqd->peak_rate * USEC_PER_MSEC *
3415                jiffies_to_msecs(bfqd->bfq_timeout)>>BFQ_RATE_SHIFT;
3416}
3417
3418/*
3419 * Update parameters related to throughput and responsiveness, as a
3420 * function of the estimated peak rate. See comments on
3421 * bfq_calc_max_budget(), and on the ref_wr_duration array.
3422 */
3423static void update_thr_responsiveness_params(struct bfq_data *bfqd)
3424{
3425        if (bfqd->bfq_user_max_budget == 0) {
3426                bfqd->bfq_max_budget =
3427                        bfq_calc_max_budget(bfqd);
3428                bfq_log(bfqd, "new max_budget = %d", bfqd->bfq_max_budget);
3429        }
3430}
3431
3432static void bfq_reset_rate_computation(struct bfq_data *bfqd,
3433                                       struct request *rq)
3434{
3435        if (rq != NULL) { /* new rq dispatch now, reset accordingly */
3436                bfqd->last_dispatch = bfqd->first_dispatch = ktime_get_ns();
3437                bfqd->peak_rate_samples = 1;
3438                bfqd->sequential_samples = 0;
3439                bfqd->tot_sectors_dispatched = bfqd->last_rq_max_size =
3440                        blk_rq_sectors(rq);
3441        } else /* no new rq dispatched, just reset the number of samples */
3442                bfqd->peak_rate_samples = 0; /* full re-init on next disp. */
3443
3444        bfq_log(bfqd,
3445                "reset_rate_computation at end, sample %u/%u tot_sects %llu",
3446                bfqd->peak_rate_samples, bfqd->sequential_samples,
3447                bfqd->tot_sectors_dispatched);
3448}
3449
3450static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
3451{
3452        u32 rate, weight, divisor;
3453
3454        /*
3455         * For the convergence property to hold (see comments on
3456         * bfq_update_peak_rate()) and for the assessment to be
3457         * reliable, a minimum number of samples must be present, and
3458         * a minimum amount of time must have elapsed. If not so, do
3459         * not compute new rate. Just reset parameters, to get ready
3460         * for a new evaluation attempt.
3461         */
3462        if (bfqd->peak_rate_samples < BFQ_RATE_MIN_SAMPLES ||
3463            bfqd->delta_from_first < BFQ_RATE_MIN_INTERVAL)
3464                goto reset_computation;
3465
3466        /*
3467         * If a new request completion has occurred after last
3468         * dispatch, then, to approximate the rate at which requests
3469         * have been served by the device, it is more precise to
3470         * extend the observation interval to the last completion.
3471         */
3472        bfqd->delta_from_first =
3473                max_t(u64, bfqd->delta_from_first,
3474                      bfqd->last_completion - bfqd->first_dispatch);
3475
3476        /*
3477         * Rate computed in sects/usec, and not sects/nsec, for
3478         * precision issues.
3479         */
3480        rate = div64_ul(bfqd->tot_sectors_dispatched<<BFQ_RATE_SHIFT,
3481                        div_u64(bfqd->delta_from_first, NSEC_PER_USEC));
3482
3483        /*
3484         * Peak rate not updated if:
3485         * - the percentage of sequential dispatches is below 3/4 of the
3486         *   total, and rate is below the current estimated peak rate
3487         * - rate is unreasonably high (> 20M sectors/sec)
3488         */
3489        if ((bfqd->sequential_samples < (3 * bfqd->peak_rate_samples)>>2 &&
3490             rate <= bfqd->peak_rate) ||
3491                rate > 20<<BFQ_RATE_SHIFT)
3492                goto reset_computation;
3493
3494        /*
3495         * We have to update the peak rate, at last! To this purpose,
3496         * we use a low-pass filter. We compute the smoothing constant
3497         * of the filter as a function of the 'weight' of the new
3498         * measured rate.
3499         *
3500         * As can be seen in next formulas, we define this weight as a
3501         * quantity proportional to how sequential the workload is,
3502         * and to how long the observation time interval is.
3503         *
3504         * The weight runs from 0 to 8. The maximum value of the
3505         * weight, 8, yields the minimum value for the smoothing
3506         * constant. At this minimum value for the smoothing constant,
3507         * the measured rate contributes for half of the next value of
3508         * the estimated peak rate.
3509         *
3510         * So, the first step is to compute the weight as a function
3511         * of how sequential the workload is. Note that the weight
3512         * cannot reach 9, because bfqd->sequential_samples cannot
3513         * become equal to bfqd->peak_rate_samples, which, in its
3514         * turn, holds true because bfqd->sequential_samples is not
3515         * incremented for the first sample.
3516         */
3517        weight = (9 * bfqd->sequential_samples) / bfqd->peak_rate_samples;
3518
3519        /*
3520         * Second step: further refine the weight as a function of the
3521         * duration of the observation interval.
3522         */
3523        weight = min_t(u32, 8,
3524                       div_u64(weight * bfqd->delta_from_first,
3525                               BFQ_RATE_REF_INTERVAL));
3526
3527        /*
3528         * Divisor ranging from 10, for minimum weight, to 2, for
3529         * maximum weight.
3530         */
3531        divisor = 10 - weight;
3532
3533        /*
3534         * Finally, update peak rate:
3535         *
3536         * peak_rate = peak_rate * (divisor-1) / divisor  +  rate / divisor
3537         */
3538        bfqd->peak_rate *= divisor-1;
3539        bfqd->peak_rate /= divisor;
3540        rate /= divisor; /* smoothing constant alpha = 1/divisor */
3541
3542        bfqd->peak_rate += rate;
3543
3544        /*
3545         * For a very slow device, bfqd->peak_rate can reach 0 (see
3546         * the minimum representable values reported in the comments
3547         * on BFQ_RATE_SHIFT). Push to 1 if this happens, to avoid
3548         * divisions by zero where bfqd->peak_rate is used as a
3549         * divisor.
3550         */
3551        bfqd->peak_rate = max_t(u32, 1, bfqd->peak_rate);
3552
3553        update_thr_responsiveness_params(bfqd);
3554
3555reset_computation:
3556        bfq_reset_rate_computation(bfqd, rq);
3557}
3558
3559/*
3560 * Update the read/write peak rate (the main quantity used for
3561 * auto-tuning, see update_thr_responsiveness_params()).
3562 *
3563 * It is not trivial to estimate the peak rate (correctly): because of
3564 * the presence of sw and hw queues between the scheduler and the
3565 * device components that finally serve I/O requests, it is hard to
3566 * say exactly when a given dispatched request is served inside the
3567 * device, and for how long. As a consequence, it is hard to know
3568 * precisely at what rate a given set of requests is actually served
3569 * by the device.
3570 *
3571 * On the opposite end, the dispatch time of any request is trivially
3572 * available, and, from this piece of information, the "dispatch rate"
3573 * of requests can be immediately computed. So, the idea in the next
3574 * function is to use what is known, namely request dispatch times
3575 * (plus, when useful, request completion times), to estimate what is
3576 * unknown, namely in-device request service rate.
3577 *
3578 * The main issue is that, because of the above facts, the rate at
3579 * which a certain set of requests is dispatched over a certain time
3580 * interval can vary greatly with respect to the rate at which the
3581 * same requests are then served. But, since the size of any
3582 * intermediate queue is limited, and the service scheme is lossless
3583 * (no request is silently dropped), the following obvious convergence
3584 * property holds: the number of requests dispatched MUST become
3585 * closer and closer to the number of requests completed as the
3586 * observation interval grows. This is the key property used in
3587 * the next function to estimate the peak service rate as a function
3588 * of the observed dispatch rate. The function assumes to be invoked
3589 * on every request dispatch.
3590 */
3591static void bfq_update_peak_rate(struct bfq_data *bfqd, struct request *rq)
3592{
3593        u64 now_ns = ktime_get_ns();
3594
3595        if (bfqd->peak_rate_samples == 0) { /* first dispatch */
3596                bfq_log(bfqd, "update_peak_rate: goto reset, samples %d",
3597                        bfqd->peak_rate_samples);
3598                bfq_reset_rate_computation(bfqd, rq);
3599                goto update_last_values; /* will add one sample */
3600        }
3601
3602        /*
3603         * Device idle for very long: the observation interval lasting
3604         * up to this dispatch cannot be a valid observation interval
3605         * for computing a new peak rate (similarly to the late-
3606         * completion event in bfq_completed_request()). Go to
3607         * update_rate_and_reset to have the following three steps
3608         * taken:
3609         * - close the observation interval at the last (previous)
3610         *   request dispatch or completion
3611         * - compute rate, if possible, for that observation interval
3612         * - start a new observation interval with this dispatch
3613         */
3614        if (now_ns - bfqd->last_dispatch > 100*NSEC_PER_MSEC &&
3615            bfqd->tot_rq_in_driver == 0)
3616                goto update_rate_and_reset;
3617
3618        /* Update sampling information */
3619        bfqd->peak_rate_samples++;
3620
3621        if ((bfqd->tot_rq_in_driver > 0 ||
3622                now_ns - bfqd->last_completion < BFQ_MIN_TT)
3623            && !BFQ_RQ_SEEKY(bfqd, bfqd->last_position, rq))
3624                bfqd->sequential_samples++;
3625
3626        bfqd->tot_sectors_dispatched += blk_rq_sectors(rq);
3627
3628        /* Reset max observed rq size every 32 dispatches */
3629        if (likely(bfqd->peak_rate_samples % 32))
3630                bfqd->last_rq_max_size =
3631                        max_t(u32, blk_rq_sectors(rq), bfqd->last_rq_max_size);
3632        else
3633                bfqd->last_rq_max_size = blk_rq_sectors(rq);
3634
3635        bfqd->delta_from_first = now_ns - bfqd->first_dispatch;
3636
3637        /* Target observation interval not yet reached, go on sampling */
3638        if (bfqd->delta_from_first < BFQ_RATE_REF_INTERVAL)
3639                goto update_last_values;
3640
3641update_rate_and_reset:
3642        bfq_update_rate_reset(bfqd, rq);
3643update_last_values:
3644        bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
3645        if (RQ_BFQQ(rq) == bfqd->in_service_queue)
3646                bfqd->in_serv_last_pos = bfqd->last_position;
3647        bfqd->last_dispatch = now_ns;
3648}
3649
3650/*
3651 * Remove request from internal lists.
3652 */
3653static void bfq_dispatch_remove(struct request_queue *q, struct request *rq)
3654{
3655        struct bfq_queue *bfqq = RQ_BFQQ(rq);
3656
3657        /*
3658         * For consistency, the next instruction should have been
3659         * executed after removing the request from the queue and
3660         * dispatching it.  We execute instead this instruction before
3661         * bfq_remove_request() (and hence introduce a temporary
3662         * inconsistency), for efficiency.  In fact, should this
3663         * dispatch occur for a non in-service bfqq, this anticipated
3664         * increment prevents two counters related to bfqq->dispatched
3665         * from risking to be, first, uselessly decremented, and then
3666         * incremented again when the (new) value of bfqq->dispatched
3667         * happens to be taken into account.
3668         */
3669        bfqq->dispatched++;
3670        bfq_update_peak_rate(q->elevator->elevator_data, rq);
3671
3672        bfq_remove_request(q, rq);
3673}
3674
3675/*
3676 * There is a case where idling does not have to be performed for
3677 * throughput concerns, but to preserve the throughput share of
3678 * the process associated with bfqq.
3679 *
3680 * To introduce this case, we can note that allowing the drive
3681 * to enqueue more than one request at a time, and hence
3682 * delegating de facto final scheduling decisions to the
3683 * drive's internal scheduler, entails loss of control on the
3684 * actual request service order. In particular, the critical
3685 * situation is when requests from different processes happen
3686 * to be present, at the same time, in the internal queue(s)
3687 * of the drive. In such a situation, the drive, by deciding
3688 * the service order of the internally-queued requests, does
3689 * determine also the actual throughput distribution among
3690 * these processes. But the drive typically has no notion or
3691 * concern about per-process throughput distribution, and
3692 * makes its decisions only on a per-request basis. Therefore,
3693 * the service distribution enforced by the drive's internal
3694 * scheduler is likely to coincide with the desired throughput
3695 * distribution only in a completely symmetric, or favorably
3696 * skewed scenario where:
3697 * (i-a) each of these processes must get the same throughput as
3698 *       the others,
3699 * (i-b) in case (i-a) does not hold, it holds that the process
3700 *       associated with bfqq must receive a lower or equal
3701 *       throughput than any of the other processes;
3702 * (ii)  the I/O of each process has the same properties, in
3703 *       terms of locality (sequential or random), direction
3704 *       (reads or writes), request sizes, greediness
3705 *       (from I/O-bound to sporadic), and so on;
3706
3707 * In fact, in such a scenario, the drive tends to treat the requests
3708 * of each process in about the same way as the requests of the
3709 * others, and thus to provide each of these processes with about the
3710 * same throughput.  This is exactly the desired throughput
3711 * distribution if (i-a) holds, or, if (i-b) holds instead, this is an
3712 * even more convenient distribution for (the process associated with)
3713 * bfqq.
3714 *
3715 * In contrast, in any asymmetric or unfavorable scenario, device
3716 * idling (I/O-dispatch plugging) is certainly needed to guarantee
3717 * that bfqq receives its assigned fraction of the device throughput
3718 * (see [1] for details).
3719 *
3720 * The problem is that idling may significantly reduce throughput with
3721 * certain combinations of types of I/O and devices. An important
3722 * example is sync random I/O on flash storage with command
3723 * queueing. So, unless bfqq falls in cases where idling also boosts
3724 * throughput, it is important to check conditions (i-a), i(-b) and
3725 * (ii) accurately, so as to avoid idling when not strictly needed for
3726 * service guarantees.
3727 *
3728 * Unfortunately, it is extremely difficult to thoroughly check
3729 * condition (ii). And, in case there are active groups, it becomes
3730 * very difficult to check conditions (i-a) and (i-b) too.  In fact,
3731 * if there are active groups, then, for conditions (i-a) or (i-b) to
3732 * become false 'indirectly', it is enough that an active group
3733 * contains more active processes or sub-groups than some other active
3734 * group. More precisely, for conditions (i-a) or (i-b) to become
3735 * false because of such a group, it is not even necessary that the
3736 * group is (still) active: it is sufficient that, even if the group
3737 * has become inactive, some of its descendant processes still have
3738 * some request already dispatched but still waiting for
3739 * completion. In fact, requests have still to be guaranteed their
3740 * share of the throughput even after being dispatched. In this
3741 * respect, it is easy to show that, if a group frequently becomes
3742 * inactive while still having in-flight requests, and if, when this
3743 * happens, the group is not considered in the calculation of whether
3744 * the scenario is asymmetric, then the group may fail to be
3745 * guaranteed its fair share of the throughput (basically because
3746 * idling may not be performed for the descendant processes of the
3747 * group, but it had to be).  We address this issue with the following
3748 * bi-modal behavior, implemented in the function
3749 * bfq_asymmetric_scenario().
3750 *
3751 * If there are groups with requests waiting for completion
3752 * (as commented above, some of these groups may even be
3753 * already inactive), then the scenario is tagged as
3754 * asymmetric, conservatively, without checking any of the
3755 * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq.
3756 * This behavior matches also the fact that groups are created
3757 * exactly if controlling I/O is a primary concern (to
3758 * preserve bandwidth and latency guarantees).
3759 *
3760 * On the opposite end, if there are no groups with requests waiting
3761 * for completion, then only conditions (i-a) and (i-b) are actually
3762 * controlled, i.e., provided that conditions (i-a) or (i-b) holds,
3763 * idling is not performed, regardless of whether condition (ii)
3764 * holds.  In other words, only if conditions (i-a) and (i-b) do not
3765 * hold, then idling is allowed, and the device tends to be prevented
3766 * from queueing many requests, possibly of several processes. Since
3767 * there are no groups with requests waiting for completion, then, to
3768 * control conditions (i-a) and (i-b) it is enough to check just
3769 * whether all the queues with requests waiting for completion also
3770 * have the same weight.
3771 *
3772 * Not checking condition (ii) evidently exposes bfqq to the
3773 * risk of getting less throughput than its fair share.
3774 * However, for queues with the same weight, a further
3775 * mechanism, preemption, mitigates or even eliminates this
3776 * problem. And it does so without consequences on overall
3777 * throughput. This mechanism and its benefits are explained
3778 * in the next three paragraphs.
3779 *
3780 * Even if a queue, say Q, is expired when it remains idle, Q
3781 * can still preempt the new in-service queue if the next
3782 * request of Q arrives soon (see the comments on
3783 * bfq_bfqq_update_budg_for_activation). If all queues and
3784 * groups have the same weight, this form of preemption,
3785 * combined with the hole-recovery heuristic described in the
3786 * comments on function bfq_bfqq_update_budg_for_activation,
3787 * are enough to preserve a correct bandwidth distribution in
3788 * the mid term, even without idling. In fact, even if not
3789 * idling allows the internal queues of the device to contain
3790 * many requests, and thus to reorder requests, we can rather
3791 * safely assume that the internal scheduler still preserves a
3792 * minimum of mid-term fairness.
3793 *
3794 * More precisely, this preemption-based, idleless approach
3795 * provides fairness in terms of IOPS, and not sectors per
3796 * second. This can be seen with a simple example. Suppose
3797 * that there are two queues with the same weight, but that
3798 * the first queue receives requests of 8 sectors, while the
3799 * second queue receives requests of 1024 sectors. In
3800 * addition, suppose that each of the two queues contains at
3801 * most one request at a time, which implies that each queue
3802 * always remains idle after it is served. Finally, after
3803 * remaining idle, each queue receives very quickly a new
3804 * request. It follows that the two queues are served
3805 * alternatively, preempting each other if needed. This
3806 * implies that, although both queues have the same weight,
3807 * the queue with large requests receives a service that is
3808 * 1024/8 times as high as the service received by the other
3809 * queue.
3810 *
3811 * The motivation for using preemption instead of idling (for
3812 * queues with the same weight) is that, by not idling,
3813 * service guarantees are preserved (completely or at least in
3814 * part) without minimally sacrificing throughput. And, if
3815 * there is no active group, then the primary expectation for
3816 * this device is probably a high throughput.
3817 *
3818 * We are now left only with explaining the two sub-conditions in the
3819 * additional compound condition that is checked below for deciding
3820 * whether the scenario is asymmetric. To explain the first
3821 * sub-condition, we need to add that the function
3822 * bfq_asymmetric_scenario checks the weights of only
3823 * non-weight-raised queues, for efficiency reasons (see comments on
3824 * bfq_weights_tree_add()). Then the fact that bfqq is weight-raised
3825 * is checked explicitly here. More precisely, the compound condition
3826 * below takes into account also the fact that, even if bfqq is being
3827 * weight-raised, the scenario is still symmetric if all queues with
3828 * requests waiting for completion happen to be
3829 * weight-raised. Actually, we should be even more precise here, and
3830 * differentiate between interactive weight raising and soft real-time
3831 * weight raising.
3832 *
3833 * The second sub-condition checked in the compound condition is
3834 * whether there is a fair amount of already in-flight I/O not
3835 * belonging to bfqq. If so, I/O dispatching is to be plugged, for the
3836 * following reason. The drive may decide to serve in-flight
3837 * non-bfqq's I/O requests before bfqq's ones, thereby delaying the
3838 * arrival of new I/O requests for bfqq (recall that bfqq is sync). If
3839 * I/O-dispatching is not plugged, then, while bfqq remains empty, a
3840 * basically uncontrolled amount of I/O from other queues may be
3841 * dispatched too, possibly causing the service of bfqq's I/O to be
3842 * delayed even longer in the drive. This problem gets more and more
3843 * serious as the speed and the queue depth of the drive grow,
3844 * because, as these two quantities grow, the probability to find no
3845 * queue busy but many requests in flight grows too. By contrast,
3846 * plugging I/O dispatching minimizes the delay induced by already
3847 * in-flight I/O, and enables bfqq to recover the bandwidth it may
3848 * lose because of this delay.
3849 *
3850 * As a side note, it is worth considering that the above
3851 * device-idling countermeasures may however fail in the following
3852 * unlucky scenario: if I/O-dispatch plugging is (correctly) disabled
3853 * in a time period during which all symmetry sub-conditions hold, and
3854 * therefore the device is allowed to enqueue many requests, but at
3855 * some later point in time some sub-condition stops to hold, then it
3856 * may become impossible to make requests be served in the desired
3857 * order until all the requests already queued in the device have been
3858 * served. The last sub-condition commented above somewhat mitigates
3859 * this problem for weight-raised queues.
3860 *
3861 * However, as an additional mitigation for this problem, we preserve
3862 * plugging for a special symmetric case that may suddenly turn into
3863 * asymmetric: the case where only bfqq is busy. In this case, not
3864 * expiring bfqq does not cause any harm to any other queues in terms
3865 * of service guarantees. In contrast, it avoids the following unlucky
3866 * sequence of events: (1) bfqq is expired, (2) a new queue with a
3867 * lower weight than bfqq becomes busy (or more queues), (3) the new
3868 * queue is served until a new request arrives for bfqq, (4) when bfqq
3869 * is finally served, there are so many requests of the new queue in
3870 * the drive that the pending requests for bfqq take a lot of time to
3871 * be served. In particular, event (2) may case even already
3872 * dispatched requests of bfqq to be delayed, inside the drive. So, to
3873 * avoid this series of events, the scenario is preventively declared
3874 * as asymmetric also if bfqq is the only busy queues
3875 */
3876static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
3877                                                 struct bfq_queue *bfqq)
3878{
3879        int tot_busy_queues = bfq_tot_busy_queues(bfqd);
3880
3881        /* No point in idling for bfqq if it won't get requests any longer */
3882        if (unlikely(!bfqq_process_refs(bfqq)))
3883                return false;
3884
3885        return (bfqq->wr_coeff > 1 &&
3886                (bfqd->wr_busy_queues < tot_busy_queues ||
3887                 bfqd->tot_rq_in_driver >= bfqq->dispatched + 4)) ||
3888                bfq_asymmetric_scenario(bfqd, bfqq) ||
3889                tot_busy_queues == 1;
3890}
3891
3892static bool __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq,
3893                              enum bfqq_expiration reason)
3894{
3895        /*
3896         * If this bfqq is shared between multiple processes, check
3897         * to make sure that those processes are still issuing I/Os
3898         * within the mean seek distance. If not, it may be time to
3899         * break the queues apart again.
3900         */
3901        if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
3902                bfq_mark_bfqq_split_coop(bfqq);
3903
3904        /*
3905         * Consider queues with a higher finish virtual time than
3906         * bfqq. If idling_needed_for_service_guarantees(bfqq) returns
3907         * true, then bfqq's bandwidth would be violated if an
3908         * uncontrolled amount of I/O from these queues were
3909         * dispatched while bfqq is waiting for its new I/O to
3910         * arrive. This is exactly what may happen if this is a forced
3911         * expiration caused by a preemption attempt, and if bfqq is
3912         * not re-scheduled. To prevent this from happening, re-queue
3913         * bfqq if it needs I/O-dispatch plugging, even if it is
3914         * empty. By doing so, bfqq is granted to be served before the
3915         * above queues (provided that bfqq is of course eligible).
3916         */
3917        if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
3918            !(reason == BFQQE_PREEMPTED &&
3919              idling_needed_for_service_guarantees(bfqd, bfqq))) {
3920                if (bfqq->dispatched == 0)
3921                        /*
3922                         * Overloading budget_timeout field to store
3923                         * the time at which the queue remains with no
3924                         * backlog and no outstanding request; used by
3925                         * the weight-raising mechanism.
3926                         */
3927                        bfqq->budget_timeout = jiffies;
3928
3929                bfq_del_bfqq_busy(bfqq, true);
3930        } else {
3931                bfq_requeue_bfqq(bfqd, bfqq, true);
3932                /*
3933                 * Resort priority tree of potential close cooperators.
3934                 * See comments on bfq_pos_tree_add_move() for the unlikely().
3935                 */
3936                if (unlikely(!bfqd->nonrot_with_queueing &&
3937                             !RB_EMPTY_ROOT(&bfqq->sort_list)))
3938                        bfq_pos_tree_add_move(bfqd, bfqq);
3939        }
3940
3941        /*
3942         * All in-service entities must have been properly deactivated
3943         * or requeued before executing the next function, which
3944         * resets all in-service entities as no more in service. This
3945         * may cause bfqq to be freed. If this happens, the next
3946         * function returns true.
3947         */
3948        return __bfq_bfqd_reset_in_service(bfqd);
3949}
3950
3951/**
3952 * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
3953 * @bfqd: device data.
3954 * @bfqq: queue to update.
3955 * @reason: reason for expiration.
3956 *
3957 * Handle the feedback on @bfqq budget at queue expiration.
3958 * See the body for detailed comments.
3959 */
3960static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
3961                                     struct bfq_queue *bfqq,
3962                                     enum bfqq_expiration reason)
3963{
3964        struct request *next_rq;
3965        int budget, min_budget;
3966
3967        min_budget = bfq_min_budget(bfqd);
3968
3969        if (bfqq->wr_coeff == 1)
3970                budget = bfqq->max_budget;
3971        else /*
3972              * Use a constant, low budget for weight-raised queues,
3973              * to help achieve a low latency. Keep it slightly higher
3974              * than the minimum possible budget, to cause a little
3975              * bit fewer expirations.
3976              */
3977                budget = 2 * min_budget;
3978
3979        bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %d, budg left %d",
3980                bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
3981        bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %d, min budg %d",
3982                budget, bfq_min_budget(bfqd));
3983        bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
3984                bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
3985
3986        if (bfq_bfqq_sync(bfqq) && bfqq->wr_coeff == 1) {
3987                switch (reason) {
3988                /*
3989                 * Caveat: in all the following cases we trade latency
3990                 * for throughput.
3991                 */
3992                case BFQQE_TOO_IDLE:
3993                        /*
3994                         * This is the only case where we may reduce
3995                         * the budget: if there is no request of the
3996                         * process still waiting for completion, then
3997                         * we assume (tentatively) that the timer has
3998                         * expired because the batch of requests of
3999                         * the process could have been served with a
4000                         * smaller budget.  Hence, betting that
4001                         * process will behave in the same way when it
4002                         * becomes backlogged again, we reduce its
4003                         * next budget.  As long as we guess right,
4004                         * this budget cut reduces the latency
4005                         * experienced by the process.
4006                         *
4007                         * However, if there are still outstanding
4008                         * requests, then the process may have not yet
4009                         * issued its next request just because it is
4010                         * still waiting for the completion of some of
4011                         * the still outstanding ones.  So in this
4012                         * subcase we do not reduce its budget, on the
4013                         * contrary we increase it to possibly boost
4014                         * the throughput, as discussed in the
4015                         * comments to the BUDGET_TIMEOUT case.
4016                         */
4017                        if (bfqq->dispatched > 0) /* still outstanding reqs */
4018                                budget = min(budget * 2, bfqd->bfq_max_budget);
4019                        else {
4020                                if (budget > 5 * min_budget)
4021                                        budget -= 4 * min_budget;
4022                                else
4023                                        budget = min_budget;
4024                        }
4025                        break;
4026                case BFQQE_BUDGET_TIMEOUT:
4027                        /*
4028                         * We double the budget here because it gives
4029                         * the chance to boost the throughput if this
4030                         * is not a seeky process (and has bumped into
4031                         * this timeout because of, e.g., ZBR).
4032                         */
4033                        budget = min(budget * 2, bfqd->bfq_max_budget);
4034                        break;
4035                case BFQQE_BUDGET_EXHAUSTED:
4036                        /*
4037                         * The process still has backlog, and did not
4038                         * let either the budget timeout or the disk
4039                         * idling timeout expire. Hence it is not
4040                         * seeky, has a short thinktime and may be
4041                         * happy with a higher budget too. So
4042                         * definitely increase the budget of this good
4043                         * candidate to boost the disk throughput.
4044                         */
4045                        budget = min(budget * 4, bfqd->bfq_max_budget);
4046                        break;
4047                case BFQQE_NO_MORE_REQUESTS:
4048                        /*
4049                         * For queues that expire for this reason, it
4050                         * is particularly important to keep the
4051                         * budget close to the actual service they
4052                         * need. Doing so reduces the timestamp
4053                         * misalignment problem described in the
4054                         * comments in the body of
4055                         * __bfq_activate_entity. In fact, suppose
4056                         * that a queue systematically expires for
4057                         * BFQQE_NO_MORE_REQUESTS and presents a
4058                         * new request in time to enjoy timestamp
4059                         * back-shifting. The larger the budget of the
4060                         * queue is with respect to the service the
4061                         * queue actually requests in each service
4062                         * slot, the more times the queue can be
4063                         * reactivated with the same virtual finish
4064                         * time. It follows that, even if this finish
4065                         * time is pushed to the system virtual time
4066                         * to reduce the consequent timestamp
4067                         * misalignment, the queue unjustly enjoys for
4068                         * many re-activations a lower finish time
4069                         * than all newly activated queues.
4070                         *
4071                         * The service needed by bfqq is measured
4072                         * quite precisely by bfqq->entity.service.
4073                         * Since bfqq does not enjoy device idling,
4074                         * bfqq->entity.service is equal to the number
4075                         * of sectors that the process associated with
4076                         * bfqq requested to read/write before waiting
4077                         * for request completions, or blocking for
4078                         * other reasons.
4079                         */
4080                        budget = max_t(int, bfqq->entity.service, min_budget);
4081                        break;
4082                default:
4083                        return;
4084                }
4085        } else if (!bfq_bfqq_sync(bfqq)) {
4086                /*
4087                 * Async queues get always the maximum possible
4088                 * budget, as for them we do not care about latency
4089                 * (in addition, their ability to dispatch is limited
4090                 * by the charging factor).
4091                 */
4092                budget = bfqd->bfq_max_budget;
4093        }
4094
4095        bfqq->max_budget = budget;
4096
4097        if (bfqd->budgets_assigned >= bfq_stats_min_budgets &&
4098            !bfqd->bfq_user_max_budget)
4099                bfqq->max_budget = min(bfqq->max_budget, bfqd->bfq_max_budget);
4100
4101        /*
4102         * If there is still backlog, then assign a new budget, making
4103         * sure that it is large enough for the next request.  Since
4104         * the finish time of bfqq must be kept in sync with the
4105         * budget, be sure to call __bfq_bfqq_expire() *after* this
4106         * update.
4107         *
4108         * If there is no backlog, then no need to update the budget;
4109         * it will be updated on the arrival of a new request.
4110         */
4111        next_rq = bfqq->next_rq;
4112        if (next_rq)
4113                bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
4114                                            bfq_serv_to_charge(next_rq, bfqq));
4115
4116        bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %d",
4117                        next_rq ? blk_rq_sectors(next_rq) : 0,
4118                        bfqq->entity.budget);
4119}
4120
4121/*
4122 * Return true if the process associated with bfqq is "slow". The slow
4123 * flag is used, in addition to the budget timeout, to reduce the
4124 * amount of service provided to seeky processes, and thus reduce
4125 * their chances to lower the throughput. More details in the comments
4126 * on the function bfq_bfqq_expire().
4127 *
4128 * An important observation is in order: as discussed in the comments
4129 * on the function bfq_update_peak_rate(), with devices with internal
4130 * queues, it is hard if ever possible to know when and for how long
4131 * an I/O request is processed by the device (apart from the trivial
4132 * I/O pattern where a new request is dispatched only after the
4133 * previous one has been completed). This makes it hard to evaluate
4134 * the real rate at which the I/O requests of each bfq_queue are
4135 * served.  In fact, for an I/O scheduler like BFQ, serving a
4136 * bfq_queue means just dispatching its requests during its service
4137 * slot (i.e., until the budget of the queue is exhausted, or the
4138 * queue remains idle, or, finally, a timeout fires). But, during the
4139 * service slot of a bfq_queue, around 100 ms at most, the device may
4140 * be even still processing requests of bfq_queues served in previous
4141 * service slots. On the opposite end, the requests of the in-service
4142 * bfq_queue may be completed after the service slot of the queue
4143 * finishes.
4144 *
4145 * Anyway, unless more sophisticated solutions are used
4146 * (where possible), the sum of the sizes of the requests dispatched
4147 * during the service slot of a bfq_queue is probably the only
4148 * approximation available for the service received by the bfq_queue
4149 * during its service slot. And this sum is the quantity used in this
4150 * function to evaluate the I/O speed of a process.
4151 */
4152static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue *bfqq,
4153                                 bool compensate, unsigned long *delta_ms)
4154{
4155        ktime_t delta_ktime;
4156        u32 delta_usecs;
4157        bool slow = BFQQ_SEEKY(bfqq); /* if delta too short, use seekyness */
4158
4159        if (!bfq_bfqq_sync(bfqq))
4160                return false;
4161
4162        if (compensate)
4163                delta_ktime = bfqd->last_idling_start;
4164        else
4165                delta_ktime = ktime_get();
4166        delta_ktime = ktime_sub(delta_ktime, bfqd->last_budget_start);
4167        delta_usecs = ktime_to_us(delta_ktime);
4168
4169        /* don't use too short time intervals */
4170        if (delta_usecs < 1000) {
4171                if (blk_queue_nonrot(bfqd->queue))
4172                         /*
4173                          * give same worst-case guarantees as idling
4174                          * for seeky
4175                          */
4176                        *delta_ms = BFQ_MIN_TT / NSEC_PER_MSEC;
4177                else /* charge at least one seek */
4178                        *delta_ms = bfq_slice_idle / NSEC_PER_MSEC;
4179
4180                return slow;
4181        }
4182
4183        *delta_ms = delta_usecs / USEC_PER_MSEC;
4184
4185        /*
4186         * Use only long (> 20ms) intervals to filter out excessive
4187         * spikes in service rate estimation.
4188         */
4189        if (delta_usecs > 20000) {
4190                /*
4191                 * Caveat for rotational devices: processes doing I/O
4192                 * in the slower disk zones tend to be slow(er) even
4193                 * if not seeky. In this respect, the estimated peak
4194                 * rate is likely to be an average over the disk
4195                 * surface. Accordingly, to not be too harsh with
4196                 * unlucky processes, a process is deemed slow only if
4197                 * its rate has been lower than half of the estimated
4198                 * peak rate.
4199                 */
4200                slow = bfqq->entity.service < bfqd->bfq_max_budget / 2;
4201        }
4202
4203        bfq_log_bfqq(bfqd, bfqq, "bfq_bfqq_is_slow: slow %d", slow);
4204
4205        return slow;
4206}
4207
4208/*
4209 * To be deemed as soft real-time, an application must meet two
4210 * requirements. First, the application must not require an average
4211 * bandwidth higher than the approximate bandwidth required to playback or
4212 * record a compressed high-definition video.
4213 * The next function is invoked on the completion of the last request of a
4214 * batch, to compute the next-start time instant, soft_rt_next_start, such
4215 * that, if the next request of the application does not arrive before
4216 * soft_rt_next_start, then the above requirement on the bandwidth is met.
4217 *
4218 * The second requirement is that the request pattern of the application is
4219 * isochronous, i.e., that, after issuing a request or a batch of requests,
4220 * the application stops issuing new requests until all its pending requests
4221 * have been completed. After that, the application may issue a new batch,
4222 * and so on.
4223 * For this reason the next function is invoked to compute
4224 * soft_rt_next_start only for applications that meet this requirement,
4225 * whereas soft_rt_next_start is set to infinity for applications that do
4226 * not.
4227 *
4228 * Unfortunately, even a greedy (i.e., I/O-bound) application may
4229 * happen to meet, occasionally or systematically, both the above
4230 * bandwidth and isochrony requirements. This may happen at least in
4231 * the following circumstances. First, if the CPU load is high. The
4232 * application may stop issuing requests while the CPUs are busy
4233 * serving other processes, then restart, then stop again for a while,
4234 * and so on. The other circumstances are related to the storage
4235 * device: the storage device is highly loaded or reaches a low-enough
4236 * throughput with the I/O of the application (e.g., because the I/O
4237 * is random and/or the device is slow). In all these cases, the
4238 * I/O of the application may be simply slowed down enough to meet
4239 * the bandwidth and isochrony requirements. To reduce the probability
4240 * that greedy applications are deemed as soft real-time in these
4241 * corner cases, a further rule is used in the computation of
4242 * soft_rt_next_start: the return value of this function is forced to
4243 * be higher than the maximum between the following two quantities.
4244 *
4245 * (a) Current time plus: (1) the maximum time for which the arrival
4246 *     of a request is waited for when a sync queue becomes idle,
4247 *     namely bfqd->bfq_slice_idle, and (2) a few extra jiffies. We
4248 *     postpone for a moment the reason for adding a few extra
4249 *     jiffies; we get back to it after next item (b).  Lower-bounding
4250 *     the return value of this function with the current time plus
4251 *     bfqd->bfq_slice_idle tends to filter out greedy applications,
4252 *     because the latter issue their next request as soon as possible
4253 *     after the last one has been completed. In contrast, a soft
4254 *     real-time application spends some time processing data, after a
4255 *     batch of its requests has been completed.
4256 *
4257 * (b) Current value of bfqq->soft_rt_next_start. As pointed out
4258 *     above, greedy applications may happen to meet both the
4259 *     bandwidth and isochrony requirements under heavy CPU or
4260 *     storage-device load. In more detail, in these scenarios, these
4261 *     applications happen, only for limited time periods, to do I/O
4262 *     slowly enough to meet all the requirements described so far,
4263 *     including the filtering in above item (a). These slow-speed
4264 *     time intervals are usually interspersed between other time
4265 *     intervals during which these applications do I/O at a very high
4266 *     speed. Fortunately, exactly because of the high speed of the
4267 *     I/O in the high-speed intervals, the values returned by this
4268 *     function happen to be so high, near the end of any such
4269 *     high-speed interval, to be likely to fall *after* the end of
4270 *     the low-speed time interval that follows. These high values are
4271 *     stored in bfqq->soft_rt_next_start after each invocation of
4272 *     this function. As a consequence, if the last value of
4273 *     bfqq->soft_rt_next_start is constantly used to lower-bound the
4274 *     next value that this function may return, then, from the very
4275 *     beginning of a low-speed interval, bfqq->soft_rt_next_start is
4276 *     likely to be constantly kept so high that any I/O request
4277 *     issued during the low-speed interval is considered as arriving
4278 *     to soon for the application to be deemed as soft
4279 *     real-time. Then, in the high-speed interval that follows, the
4280 *     application will not be deemed as soft real-time, just because
4281 *     it will do I/O at a high speed. And so on.
4282 *
4283 * Getting back to the filtering in item (a), in the following two
4284 * cases this filtering might be easily passed by a greedy
4285 * application, if the reference quantity was just
4286 * bfqd->bfq_slice_idle:
4287 * 1) HZ is so low that the duration of a jiffy is comparable to or
4288 *    higher than bfqd->bfq_slice_idle. This happens, e.g., on slow
4289 *    devices with HZ=100. The time granularity may be so coarse
4290 *    that the approximation, in jiffies, of bfqd->bfq_slice_idle
4291 *    is rather lower than the exact value.
4292 * 2) jiffies, instead of increasing at a constant rate, may stop increasing
4293 *    for a while, then suddenly 'jump' by several units to recover the lost
4294 *    increments. This seems to happen, e.g., inside virtual machines.
4295 * To address this issue, in the filtering in (a) we do not use as a
4296 * reference time interval just bfqd->bfq_slice_idle, but
4297 * bfqd->bfq_slice_idle plus a few jiffies. In particular, we add the
4298 * minimum number of jiffies for which the filter seems to be quite
4299 * precise also in embedded systems and KVM/QEMU virtual machines.
4300 */
4301static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
4302                                                struct bfq_queue *bfqq)
4303{
4304        return max3(bfqq->soft_rt_next_start,
4305                    bfqq->last_idle_bklogged +
4306                    HZ * bfqq->service_from_backlogged /
4307                    bfqd->bfq_wr_max_softrt_rate,
4308                    jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4);
4309}
4310
4311/**
4312 * bfq_bfqq_expire - expire a queue.
4313 * @bfqd: device owning the queue.
4314 * @bfqq: the queue to expire.
4315 * @compensate: if true, compensate for the time spent idling.
4316 * @reason: the reason causing the expiration.
4317 *
4318 * If the process associated with bfqq does slow I/O (e.g., because it
4319 * issues random requests), we charge bfqq with the time it has been
4320 * in service instead of the service it has received (see
4321 * bfq_bfqq_charge_time for details on how this goal is achieved). As
4322 * a consequence, bfqq will typically get higher timestamps upon
4323 * reactivation, and hence it will be rescheduled as if it had
4324 * received more service than what it has actually received. In the
4325 * end, bfqq receives less service in proportion to how slowly its
4326 * associated process consumes its budgets (and hence how seriously it
4327 * tends to lower the throughput). In addition, this time-charging
4328 * strategy guarantees time fairness among slow processes. In
4329 * contrast, if the process associated with bfqq is not slow, we
4330 * charge bfqq exactly with the service it has received.
4331 *
4332 * Charging time to the first type of queues and the exact service to
4333 * the other has the effect of using the WF2Q+ policy to schedule the
4334 * former on a timeslice basis, without violating service domain
4335 * guarantees among the latter.
4336 */
4337void bfq_bfqq_expire(struct bfq_data *bfqd,
4338                     struct bfq_queue *bfqq,
4339                     bool compensate,
4340                     enum bfqq_expiration reason)
4341{
4342        bool slow;
4343        unsigned long delta = 0;
4344        struct bfq_entity *entity = &bfqq->entity;
4345
4346        /*
4347         * Check whether the process is slow (see bfq_bfqq_is_slow).
4348         */
4349        slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, &delta);
4350
4351        /*
4352         * As above explained, charge slow (typically seeky) and
4353         * timed-out queues with the time and not the service
4354         * received, to favor sequential workloads.
4355         *
4356         * Processes doing I/O in the slower disk zones will tend to
4357         * be slow(er) even if not seeky. Therefore, since the
4358         * estimated peak rate is actually an average over the disk
4359         * surface, these processes may timeout just for bad luck. To
4360         * avoid punishing them, do not charge time to processes that
4361         * succeeded in consuming at least 2/3 of their budget. This
4362         * allows BFQ to preserve enough elasticity to still perform
4363         * bandwidth, and not time, distribution with little unlucky
4364         * or quasi-sequential processes.
4365         */
4366        if (bfqq->wr_coeff == 1 &&
4367            (slow ||
4368             (reason == BFQQE_BUDGET_TIMEOUT &&
4369              bfq_bfqq_budget_left(bfqq) >=  entity->budget / 3)))
4370                bfq_bfqq_charge_time(bfqd, bfqq, delta);
4371
4372        if (bfqd->low_latency && bfqq->wr_coeff == 1)
4373                bfqq->last_wr_start_finish = jiffies;
4374
4375        if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
4376            RB_EMPTY_ROOT(&bfqq->sort_list)) {
4377                /*
4378                 * If we get here, and there are no outstanding
4379                 * requests, then the request pattern is isochronous
4380                 * (see the comments on the function
4381                 * bfq_bfqq_softrt_next_start()). Therefore we can
4382                 * compute soft_rt_next_start.
4383                 *
4384                 * If, instead, the queue still has outstanding
4385                 * requests, then we have to wait for the completion
4386                 * of all the outstanding requests to discover whether
4387                 * the request pattern is actually isochronous.
4388                 */
4389                if (bfqq->dispatched == 0)
4390                        bfqq->soft_rt_next_start =
4391                                bfq_bfqq_softrt_next_start(bfqd, bfqq);
4392                else if (bfqq->dispatched > 0) {
4393                        /*
4394                         * Schedule an update of soft_rt_next_start to when
4395                         * the task may be discovered to be isochronous.
4396                         */
4397                        bfq_mark_bfqq_softrt_update(bfqq);
4398                }
4399        }
4400
4401        bfq_log_bfqq(bfqd, bfqq,
4402                "expire (%d, slow %d, num_disp %d, short_ttime %d)", reason,
4403                slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq));
4404
4405        /*
4406         * bfqq expired, so no total service time needs to be computed
4407         * any longer: reset state machine for measuring total service
4408         * times.
4409         */
4410        bfqd->rqs_injected = bfqd->wait_dispatch = false;
4411        bfqd->waited_rq = NULL;
4412
4413        /*
4414         * Increase, decrease or leave budget unchanged according to
4415         * reason.
4416         */
4417        __bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
4418        if (__bfq_bfqq_expire(bfqd, bfqq, reason))
4419                /* bfqq is gone, no more actions on it */
4420                return;
4421
4422        /* mark bfqq as waiting a request only if a bic still points to it */
4423        if (!bfq_bfqq_busy(bfqq) &&
4424            reason != BFQQE_BUDGET_TIMEOUT &&
4425            reason != BFQQE_BUDGET_EXHAUSTED) {
4426                bfq_mark_bfqq_non_blocking_wait_rq(bfqq);
4427                /*
4428                 * Not setting service to 0, because, if the next rq
4429                 * arrives in time, the queue will go on receiving
4430                 * service with this same budget (as if it never expired)
4431                 */
4432        } else
4433                entity->service = 0;
4434
4435        /*
4436         * Reset the received-service counter for every parent entity.
4437         * Differently from what happens with bfqq->entity.service,
4438         * the resetting of this counter never needs to be postponed
4439         * for parent entities. In fact, in case bfqq may have a
4440         * chance to go on being served using the last, partially
4441         * consumed budget, bfqq->entity.service needs to be kept,
4442         * because if bfqq then actually goes on being served using
4443         * the same budget, the last value of bfqq->entity.service is
4444         * needed to properly decrement bfqq->entity.budget by the
4445         * portion already consumed. In contrast, it is not necessary
4446         * to keep entity->service for parent entities too, because
4447         * the bubble up of the new value of bfqq->entity.budget will
4448         * make sure that the budgets of parent entities are correct,
4449         * even in case bfqq and thus parent entities go on receiving
4450         * service with the same budget.
4451         */
4452        entity = entity->parent;
4453        for_each_entity(entity)
4454                entity->service = 0;
4455}
4456
4457/*
4458 * Budget timeout is not implemented through a dedicated timer, but
4459 * just checked on request arrivals and completions, as well as on
4460 * idle timer expirations.
4461 */
4462static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
4463{
4464        return time_is_before_eq_jiffies(bfqq->budget_timeout);
4465}
4466
4467/*
4468 * If we expire a queue that is actively waiting (i.e., with the
4469 * device idled) for the arrival of a new request, then we may incur
4470 * the timestamp misalignment problem described in the body of the
4471 * function __bfq_activate_entity. Hence we return true only if this
4472 * condition does not hold, or if the queue is slow enough to deserve
4473 * only to be kicked off for preserving a high throughput.
4474 */
4475static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
4476{
4477        bfq_log_bfqq(bfqq->bfqd, bfqq,
4478                "may_budget_timeout: wait_request %d left %d timeout %d",
4479                bfq_bfqq_wait_request(bfqq),
4480                        bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
4481                bfq_bfqq_budget_timeout(bfqq));
4482
4483        return (!bfq_bfqq_wait_request(bfqq) ||
4484                bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
4485                &&
4486                bfq_bfqq_budget_timeout(bfqq);
4487}
4488
4489static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
4490                                             struct bfq_queue *bfqq)
4491{
4492        bool rot_without_queueing =
4493                !blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag,
4494                bfqq_sequential_and_IO_bound,
4495                idling_boosts_thr;
4496
4497        /* No point in idling for bfqq if it won't get requests any longer */
4498        if (unlikely(!bfqq_process_refs(bfqq)))
4499                return false;
4500
4501        bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) &&
4502                bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_has_short_ttime(bfqq);
4503
4504        /*
4505         * The next variable takes into account the cases where idling
4506         * boosts the throughput.
4507         *
4508         * The value of the variable is computed considering, first, that
4509         * idling is virtually always beneficial for the throughput if:
4510         * (a) the device is not NCQ-capable and rotational, or
4511         * (b) regardless of the presence of NCQ, the device is rotational and
4512         *     the request pattern for bfqq is I/O-bound and sequential, or
4513         * (c) regardless of whether it is rotational, the device is
4514         *     not NCQ-capable and the request pattern for bfqq is
4515         *     I/O-bound and sequential.
4516         *
4517         * Secondly, and in contrast to the above item (b), idling an
4518         * NCQ-capable flash-based device would not boost the
4519         * throughput even with sequential I/O; rather it would lower
4520         * the throughput in proportion to how fast the device
4521         * is. Accordingly, the next variable is true if any of the
4522         * above conditions (a), (b) or (c) is true, and, in
4523         * particular, happens to be false if bfqd is an NCQ-capable
4524         * flash-based device.
4525         */
4526        idling_boosts_thr = rot_without_queueing ||
4527                ((!blk_queue_nonrot(bfqd->queue) || !bfqd->hw_tag) &&
4528                 bfqq_sequential_and_IO_bound);
4529
4530        /*
4531         * The return value of this function is equal to that of
4532         * idling_boosts_thr, unless a special case holds. In this
4533         * special case, described below, idling may cause problems to
4534         * weight-raised queues.
4535         *
4536         * When the request pool is saturated (e.g., in the presence
4537         * of write hogs), if the processes associated with
4538         * non-weight-raised queues ask for requests at a lower rate,
4539         * then processes associated with weight-raised queues have a
4540         * higher probability to get a request from the pool
4541         * immediately (or at least soon) when they need one. Thus
4542         * they have a higher probability to actually get a fraction
4543         * of the device throughput proportional to their high
4544         * weight. This is especially true with NCQ-capable drives,
4545         * which enqueue several requests in advance, and further
4546         * reorder internally-queued requests.
4547         *
4548         * For this reason, we force to false the return value if
4549         * there are weight-raised busy queues. In this case, and if
4550         * bfqq is not weight-raised, this guarantees that the device
4551         * is not idled for bfqq (if, instead, bfqq is weight-raised,
4552         * then idling will be guaranteed by another variable, see
4553         * below). Combined with the timestamping rules of BFQ (see
4554         * [1] for details), this behavior causes bfqq, and hence any
4555         * sync non-weight-raised queue, to get a lower number of
4556         * requests served, and thus to ask for a lower number of
4557         * requests from the request pool, before the busy
4558         * weight-raised queues get served again. This often mitigates
4559         * starvation problems in the presence of heavy write
4560         * workloads and NCQ, thereby guaranteeing a higher
4561         * application and system responsiveness in these hostile
4562         * scenarios.
4563         */
4564        return idling_boosts_thr &&