1                                CPUSETS
   4Copyright (C) 2004 BULL SA.
   5Written by
   7Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
   8Modified by Paul Jackson <>
   9Modified by Christoph Lameter <>
  10Modified by Paul Menage <>
  11Modified by Hidetoshi Seto <>
  161. Cpusets
  17  1.1 What are cpusets ?
  18  1.2 Why are cpusets needed ?
  19  1.3 How are cpusets implemented ?
  20  1.4 What are exclusive cpusets ?
  21  1.5 What is memory_pressure ?
  22  1.6 What is memory spread ?
  23  1.7 What is sched_load_balance ?
  24  1.8 What is sched_relax_domain_level ?
  25  1.9 How do I use cpusets ?
  262. Usage Examples and Syntax
  27  2.1 Basic Usage
  28  2.2 Adding/removing cpus
  29  2.3 Setting flags
  30  2.4 Attaching processes
  313. Questions
  324. Contact
  341. Cpusets
  371.1 What are cpusets ?
  40Cpusets provide a mechanism for assigning a set of CPUs and Memory
  41Nodes to a set of tasks.   In this document "Memory Node" refers to
  42an on-line node that contains memory.
  44Cpusets constrain the CPU and Memory placement of tasks to only
  45the resources within a task's current cpuset.  They form a nested
  46hierarchy visible in a virtual file system.  These are the essential
  47hooks, beyond what is already present, required to manage dynamic
  48job placement on large systems.
  50Cpusets use the generic cgroup subsystem described in
  53Requests by a task, using the sched_setaffinity(2) system call to
  54include CPUs in its CPU affinity mask, and using the mbind(2) and
  55set_mempolicy(2) system calls to include Memory Nodes in its memory
  56policy, are both filtered through that task's cpuset, filtering out any
  57CPUs or Memory Nodes not in that cpuset.  The scheduler will not
  58schedule a task on a CPU that is not allowed in its cpus_allowed
  59vector, and the kernel page allocator will not allocate a page on a
  60node that is not allowed in the requesting task's mems_allowed vector.
  62User level code may create and destroy cpusets by name in the cgroup
  63virtual file system, manage the attributes and permissions of these
  64cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
  65specify and query to which cpuset a task is assigned, and list the
  66task pids assigned to a cpuset.
  691.2 Why are cpusets needed ?
  72The management of large computer systems, with many processors (CPUs),
  73complex memory cache hierarchies and multiple Memory Nodes having
  74non-uniform access times (NUMA) presents additional challenges for
  75the efficient scheduling and memory placement of processes.
  77Frequently more modest sized systems can be operated with adequate
  78efficiency just by letting the operating system automatically share
  79the available CPU and Memory resources amongst the requesting tasks.
  81But larger systems, which benefit more from careful processor and
  82memory placement to reduce memory access times and contention,
  83and which typically represent a larger investment for the customer,
  84can benefit from explicitly placing jobs on properly sized subsets of
  85the system.
  87This can be especially valuable on:
  89    * Web Servers running multiple instances of the same web application,
  90    * Servers running different applications (for instance, a web server
  91      and a database), or
  92    * NUMA systems running large HPC applications with demanding
  93      performance characteristics.
  95These subsets, or "soft partitions" must be able to be dynamically
  96adjusted, as the job mix changes, without impacting other concurrently
  97executing jobs. The location of the running jobs pages may also be moved
  98when the memory locations are changed.
 100The kernel cpuset patch provides the minimum essential kernel
 101mechanisms required to efficiently implement such subsets.  It
 102leverages existing CPU and Memory Placement facilities in the Linux
 103kernel to avoid any additional impact on the critical scheduler or
 104memory allocator code.
 1071.3 How are cpusets implemented ?
 110Cpusets provide a Linux kernel mechanism to constrain which CPUs and
 111Memory Nodes are used by a process or set of processes.
 113The Linux kernel already has a pair of mechanisms to specify on which
 114CPUs a task may be scheduled (sched_setaffinity) and on which Memory
 115Nodes it may obtain memory (mbind, set_mempolicy).
 117Cpusets extends these two mechanisms as follows:
 119 - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
 120   kernel.
 121 - Each task in the system is attached to a cpuset, via a pointer
 122   in the task structure to a reference counted cgroup structure.
 123 - Calls to sched_setaffinity are filtered to just those CPUs
 124   allowed in that task's cpuset.
 125 - Calls to mbind and set_mempolicy are filtered to just
 126   those Memory Nodes allowed in that task's cpuset.
 127 - The root cpuset contains all the systems CPUs and Memory
 128   Nodes.
 129 - For any cpuset, one can define child cpusets containing a subset
 130   of the parents CPU and Memory Node resources.
 131 - The hierarchy of cpusets can be mounted at /dev/cpuset, for
 132   browsing and manipulation from user space.
 133 - A cpuset may be marked exclusive, which ensures that no other
 134   cpuset (except direct ancestors and descendants) may contain
 135   any overlapping CPUs or Memory Nodes.
 136 - You can list all the tasks (by pid) attached to any cpuset.
 138The implementation of cpusets requires a few, simple hooks
 139into the rest of the kernel, none in performance critical paths:
 141 - in init/main.c, to initialize the root cpuset at system boot.
 142 - in fork and exit, to attach and detach a task from its cpuset.
 143 - in sched_setaffinity, to mask the requested CPUs by what's
 144   allowed in that task's cpuset.
 145 - in sched.c migrate_live_tasks(), to keep migrating tasks within
 146   the CPUs allowed by their cpuset, if possible.
 147 - in the mbind and set_mempolicy system calls, to mask the requested
 148   Memory Nodes by what's allowed in that task's cpuset.
 149 - in page_alloc.c, to restrict memory to allowed nodes.
 150 - in vmscan.c, to restrict page recovery to the current cpuset.
 152You should mount the "cgroup" filesystem type in order to enable
 153browsing and modifying the cpusets presently known to the kernel.  No
 154new system calls are added for cpusets - all support for querying and
 155modifying cpusets is via this cpuset file system.
 157The /proc/<pid>/status file for each task has four added lines,
 158displaying the task's cpus_allowed (on which CPUs it may be scheduled)
 159and mems_allowed (on which Memory Nodes it may obtain memory),
 160in the two formats seen in the following example:
 162  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
 163  Cpus_allowed_list:      0-127
 164  Mems_allowed:   ffffffff,ffffffff
 165  Mems_allowed_list:      0-63
 167Each cpuset is represented by a directory in the cgroup file system
 168containing (on top of the standard cgroup files) the following
 169files describing that cpuset:
 171 - cpuset.cpus: list of CPUs in that cpuset
 172 - cpuset.mems: list of Memory Nodes in that cpuset
 173 - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
 174 - cpuset.cpu_exclusive flag: is cpu placement exclusive?
 175 - cpuset.mem_exclusive flag: is memory placement exclusive?
 176 - cpuset.mem_hardwall flag:  is memory allocation hardwalled
 177 - cpuset.memory_pressure: measure of how much paging pressure in cpuset
 178 - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
 179 - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
 180 - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
 181 - cpuset.sched_relax_domain_level: the searching range when migrating tasks
 183In addition, only the root cpuset has the following file:
 184 - cpuset.memory_pressure_enabled flag: compute memory_pressure?
 186New cpusets are created using the mkdir system call or shell
