linux/Documentation/cgroups/blkio-controller.txt
<<
>>
Prefs
   1                                Block IO Controller
   2                                ===================
   3Overview
   4========
   5cgroup subsys "blkio" implements the block io controller. There seems to be
   6a need of various kinds of IO control policies (like proportional BW, max BW)
   7both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
   8Plan is to use the same cgroup based management interface for blkio controller
   9and based on user options switch IO policies in the background.
  10
  11Currently two IO control policies are implemented. First one is proportional
  12weight time based division of disk policy. It is implemented in CFQ. Hence
  13this policy takes effect only on leaf nodes when CFQ is being used. The second
  14one is throttling policy which can be used to specify upper IO rate limits
  15on devices. This policy is implemented in generic block layer and can be
  16used on leaf nodes as well as higher level logical devices like device mapper.
  17
  18HOWTO
  19=====
  20Proportional Weight division of bandwidth
  21-----------------------------------------
  22You can do a very simple testing of running two dd threads in two different
  23cgroups. Here is what you can do.
  24
  25- Enable Block IO controller
  26        CONFIG_BLK_CGROUP=y
  27
  28- Enable group scheduling in CFQ
  29        CONFIG_CFQ_GROUP_IOSCHED=y
  30
  31- Compile and boot into kernel and mount IO controller (blkio); see
  32  cgroups.txt, Why are cgroups needed?.
  33
  34        mount -t tmpfs cgroup_root /sys/fs/cgroup
  35        mkdir /sys/fs/cgroup/blkio
  36        mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
  37
  38- Create two cgroups
  39        mkdir -p /sys/fs/cgroup/blkio/test1/ /sys/fs/cgroup/blkio/test2
  40
  41- Set weights of group test1 and test2
  42        echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight
  43        echo 500 > /sys/fs/cgroup/blkio/test2/blkio.weight
  44
  45- Create two same size files (say 512MB each) on same disk (file1, file2) and
  46  launch two dd threads in different cgroup to read those files.
  47
  48        sync
  49        echo 3 > /proc/sys/vm/drop_caches
  50
  51        dd if=/mnt/sdb/zerofile1 of=/dev/null &
  52        echo $! > /sys/fs/cgroup/blkio/test1/tasks
  53        cat /sys/fs/cgroup/blkio/test1/tasks
  54
  55        dd if=/mnt/sdb/zerofile2 of=/dev/null &
  56        echo $! > /sys/fs/cgroup/blkio/test2/tasks
  57        cat /sys/fs/cgroup/blkio/test2/tasks
  58
  59- At macro level, first dd should finish first. To get more precise data, keep
  60  on looking at (with the help of script), at blkio.disk_time and
  61  blkio.disk_sectors files of both test1 and test2 groups. This will tell how
  62  much disk time (in milli seconds), each group got and how many secotors each
  63  group dispatched to the disk. We provide fairness in terms of disk time, so
  64  ideally io.disk_time of cgroups should be in proportion to the weight.
  65
  66Throttling/Upper Limit policy
  67-----------------------------
  68- Enable Block IO controller
  69        CONFIG_BLK_CGROUP=y
  70
  71- Enable throttling in block layer
  72        CONFIG_BLK_DEV_THROTTLING=y
  73
  74- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)
  75        mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
  76
  77- Specify a bandwidth rate on particular device for root group. The format
  78  for policy is "<major>:<minor>  <bytes_per_second>".
  79
  80        echo "8:16  1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
  81
  82  Above will put a limit of 1MB/second on reads happening for root group
  83  on device having major/minor number 8:16.
  84
  85- Run dd to read a file and see if rate is throttled to 1MB/s or not.
  86
  87                # dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
  88                # iflag=direct
  89        1024+0 records in
  90        1024+0 records out
  91        4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
  92
  93 Limits for writes can be put using blkio.throttle.write_bps_device file.
  94
  95Hierarchical Cgroups
  96====================
  97
  98Both CFQ and throttling implement hierarchy support; however,
  99throttling's hierarchy support is enabled iff "sane_behavior" is
 100enabled from cgroup side, which currently is a development option and
 101not publicly available.
 102
 103If somebody created a hierarchy like as follows.
 104
 105                        root
 106                        /  \
 107                     test1 test2
 108                        |
 109                     test3
 110
 111CFQ by default and throttling with "sane_behavior" will handle the
 112hierarchy correctly.  For details on CFQ hierarchy support, refer to
 113Documentation/block/cfq-iosched.txt.  For throttling, all limits apply
 114to the whole subtree while all statistics are local to the IOs
 115directly generated by tasks in that cgroup.
 116
 117Throttling without "sane_behavior" enabled from cgroup side will
 118practically treat all groups at same level as if it looks like the
 119following.
 120
 121                                pivot
 122                             /  /   \  \
 123                        root  test1 test2  test3
 124
 125Various user visible config options
 126===================================
 127CONFIG_BLK_CGROUP
 128        - Block IO controller.
 129
 130CONFIG_DEBUG_BLK_CGROUP
 131        - Debug help. Right now some additional stats file show up in cgroup
 132          if this option is enabled.
 133
 134CONFIG_CFQ_GROUP_IOSCHED
 135        - Enables group scheduling in CFQ. Currently only 1 level of group
 136          creation is allowed.
 137
 138CONFIG_BLK_DEV_THROTTLING
 139        - Enable block device throttling support in block layer.
 140
 141Details of cgroup files
 142=======================
 143Proportional weight policy files
 144--------------------------------
 145- blkio.weight
 146        - Specifies per cgroup weight. This is default weight of the group
 147          on all the devices until and unless overridden by per device rule.
 148          (See blkio.weight_device).
 149          Currently allowed range of weights is from 10 to 1000.
 150
 151- blkio.weight_device
 152        - One can specify per cgroup per device rules using this interface.
 153          These rules override the default value of group weight as specified
 154          by blkio.weight.
 155
 156          Following is the format.
 157
 158          # echo dev_maj:dev_minor weight > blkio.weight_device
 159          Configure weight=300 on /dev/sdb (8:16) in this cgroup
 160          # echo 8:16 300 > blkio.weight_device
 161          # cat blkio.weight_device
 162          dev     weight
 163          8:16    300
 164
 165          Configure weight=500 on /dev/sda (8:0) in this cgroup
 166          # echo 8:0 500 > blkio.weight_device
 167          # cat blkio.weight_device
 168          dev     weight
 169          8:0     500
 170          8:16    300
 171
 172          Remove specific weight for /dev/sda in this cgroup
 173          # echo 8:0 0 > blkio.weight_device
 174          # cat blkio.weight_device
 175          dev     weight
 176          8:16    300
 177
 178- blkio.leaf_weight[_device]
 179        - Equivalents of blkio.weight[_device] for the purpose of
 180          deciding how much weight tasks in the given cgroup has while
 181          competing with the cgroup's child cgroups. For details,
 182          please refer to Documentation/block/cfq-iosched.txt.
 183
 184- blkio.time
 185        - disk time allocated to cgroup per device in milliseconds. First
 186          two fields specify the major and minor number of the device and
 187          third field specifies the disk time allocated to group in
 188          milliseconds.
 189
 190- blkio.sectors
 191        - number of sectors transferred to/from disk by the group. First
 192          two fields specify the major and minor number of the device and
 193          third field specifies the number of sectors transferred by the
 194          group to/from the device.
 195
 196- blkio.io_service_bytes
 197        - Number of bytes transferred to/from the disk by the group. These
 198          are further divided by the type of operation - read or write, sync
 199          or async. First two fields specify the major and minor number of the
 200          device, third field specifies the operation type and the fourth field
 201          specifies the number of bytes.
 202
 203- blkio.io_serviced
 204        - Number of IOs completed to/from the disk by the group. These
 205          are further divided by the type of operation - read or write, sync
 206          or async. First two fields specify the major and minor number of the
 207          device, third field specifies the operation type and the fourth field
 208          specifies the number of IOs.
 209
 210- blkio.io_service_time
 211        - Total amount of time between request dispatch and request completion
 212          for the IOs done by this cgroup. This is in nanoseconds to make it
 213          meaningful for flash devices too. For devices with queue depth of 1,
 214          this time represents the actual service time. When queue_depth > 1,
 215          that is no longer true as requests may be served out of order. This
 216          may cause the service time for a given IO to include the service time
 217          of multiple IOs when served out of order which may result in total
 218          io_service_time > actual time elapsed. This time is further divided by
 219          the type of operation - read or write, sync or async. First two fields
 220          specify the major and minor number of the device, third field
 221          specifies the operation type and the fourth field specifies the
 222          io_service_time in ns.
 223
 224- blkio.io_wait_time
 225        - Total amount of time the IOs for this cgroup spent waiting in the
 226          scheduler queues for service. This can be greater than the total time
 227          elapsed since it is cumulative io_wait_time for all IOs. It is not a
 228          measure of total time the cgroup spent waiting but rather a measure of
 229          the wait_time for its individual IOs. For devices with queue_depth > 1
 230          this metric does not include the time spent waiting for service once
 231          the IO is dispatched to the device but till it actually gets serviced
 232          (there might be a time lag here due to re-ordering of requests by the
 233          device). This is in nanoseconds to make it meaningful for flash
 234          devices too. This time is further divided by the type of operation -
 235          read or write, sync or async. First two fields specify the major and
 236          minor number of the device, third field specifies the operation type
 237          and the fourth field specifies the io_wait_time in ns.
 238
 239- blkio.io_merged
 240        - Total number of bios/requests merged into requests belonging to this
 241          cgroup. This is further divided by the type of operation - read or
 242          write, sync or async.
 243
 244- blkio.io_queued
 245        - Total number of requests queued up at any given instant for this
 246          cgroup. This is further divided by the type of operation - read or
 247          write, sync or async.
 248
 249- blkio.avg_queue_size
 250        - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
 251          The average queue size for this cgroup over the entire time of this
 252          cgroup's existence. Queue size samples are taken each time one of the
 253          queues of this cgroup gets a timeslice.
 254
 255- blkio.group_wait_time
 256        - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
 257          This is the amount of time the cgroup had to wait since it became busy
 258          (i.e., went from 0 to 1 request queued) to get a timeslice for one of
 259          its queues. This is different from the io_wait_time which is the
 260          cumulative total of the amount of time spent by each IO in that cgroup
 261          waiting in the scheduler queue. This is in nanoseconds. If this is
 262          read when the cgroup is in a waiting (for timeslice) state, the stat
 263          will only report the group_wait_time accumulated till the last time it
 264          got a timeslice and will not include the current delta.
 265
 266- blkio.empty_time
 267        - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
 268          This is the amount of time a cgroup spends without any pending
 269          requests when not being served, i.e., it does not include any time
 270          spent idling for one of the queues of the cgroup. This is in
 271          nanoseconds. If this is read when the cgroup is in an empty state,
 272          the stat will only report the empty_time accumulated till the last
 273          time it had a pending request and will not include the current delta.
 274
 275- blkio.idle_time
 276        - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
 277          This is the amount of time spent by the IO scheduler idling for a
 278          given cgroup in anticipation of a better request than the existing ones
 279          from other queues/cgroups. This is in nanoseconds. If this is read
 280          when the cgroup is in an idling state, the stat will only report the
 281          idle_time accumulated till the last idle period and will not include
 282          the current delta.
 283
 284- blkio.dequeue
 285        - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This
 286          gives the statistics about how many a times a group was dequeued
 287          from service tree of the device. First two fields specify the major
 288          and minor number of the device and third field specifies the number
 289          of times a group was dequeued from a particular device.
 290
 291- blkio.*_recursive
 292        - Recursive version of various stats. These files show the
 293          same information as their non-recursive counterparts but
 294          include stats from all the descendant cgroups.
 295
 296Throttling/Upper limit policy files
 297-----------------------------------
 298- blkio.throttle.read_bps_device
 299        - Specifies upper limit on READ rate from the device. IO rate is
 300          specified in bytes per second. Rules are per device. Following is
 301          the format.
 302
 303  echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
 304
 305- blkio.throttle.write_bps_device
 306        - Specifies upper limit on WRITE rate to the device. IO rate is
 307          specified in bytes per second. Rules are per device. Following is
 308          the format.
 309
 310  echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
 311
 312- blkio.throttle.read_iops_device
 313        - Specifies upper limit on READ rate from the device. IO rate is
 314          specified in IO per second. Rules are per device. Following is
 315          the format.
 316
 317  echo "<major>:<minor>  <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
 318
 319- blkio.throttle.write_iops_device
 320        - Specifies upper limit on WRITE rate to the device. IO rate is
 321          specified in io per second. Rules are per device. Following is
 322          the format.
 323
 324  echo "<major>:<minor>  <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
 325
 326Note: If both BW and IOPS rules are specified for a device, then IO is
 327      subjected to both the constraints.
 328
 329- blkio.throttle.io_serviced
 330        - Number of IOs (bio) completed to/from the disk by the group (as
 331          seen by throttling policy). These are further divided by the type
 332          of operation - read or write, sync or async. First two fields specify
 333          the major and minor number of the device, third field specifies the
 334          operation type and the fourth field specifies the number of IOs.
 335
 336          blkio.io_serviced does accounting as seen by CFQ and counts are in
 337          number of requests (struct request). On the other hand,
 338          blkio.throttle.io_serviced counts number of IO in terms of number
 339          of bios as seen by throttling policy.  These bios can later be
 340          merged by elevator and total number of requests completed can be
 341          lesser.
 342
 343- blkio.throttle.io_service_bytes
 344        - Number of bytes transferred to/from the disk by the group. These
 345          are further divided by the type of operation - read or write, sync
 346          or async. First two fields specify the major and minor number of the
 347          device, third field specifies the operation type and the fourth field
 348          specifies the number of bytes.
 349
 350          These numbers should roughly be same as blkio.io_service_bytes as
 351          updated by CFQ. The difference between two is that
 352          blkio.io_service_bytes will not be updated if CFQ is not operating
 353          on request queue.
 354
 355Common files among various policies
 356-----------------------------------
 357- blkio.reset_stats
 358        - Writing an int to this file will result in resetting all the stats
 359          for that cgroup.
 360
 361CFQ sysfs tunable
 362=================
 363/sys/block/<disk>/queue/iosched/slice_idle
 364------------------------------------------
 365On a faster hardware CFQ can be slow, especially with sequential workload.
 366This happens because CFQ idles on a single queue and single queue might not
 367drive deeper request queue depths to keep the storage busy. In such scenarios
 368one can try setting slice_idle=0 and that would switch CFQ to IOPS
 369(IO operations per second) mode on NCQ supporting hardware.
 370
 371That means CFQ will not idle between cfq queues of a cfq group and hence be
 372able to driver higher queue depth and achieve better throughput. That also
 373means that cfq provides fairness among groups in terms of IOPS and not in
 374terms of disk time.
 375
 376/sys/block/<disk>/queue/iosched/group_idle
 377------------------------------------------
 378If one disables idling on individual cfq queues and cfq service trees by
 379setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
 380on the group in an attempt to provide fairness among groups.
 381
 382By default group_idle is same as slice_idle and does not do anything if
 383slice_idle is enabled.
 384
 385One can experience an overall throughput drop if you have created multiple
 386groups and put applications in that group which are not driving enough
 387IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
 388on individual groups and throughput should improve.
 389
 390What works
 391==========
 392- Currently only sync IO queues are support. All the buffered writes are
 393  still system wide and not per group. Hence we will not see service
 394  differentiation between buffered writes between groups.
 395