linux-old/Documentation/smp.tex
<<
>>
Prefs
   1From: michael@Physik.Uni-Dortmund.DE (Michael Dirkmann)
   2
   3thanks for your information. Attached is the tex-code of your
   4SMP-documentation :
   5-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
   6\documentclass[]{article}
   7\parindent0.0cm
   8\parskip0.2cm
   9
  10\begin{document}
  11
  12\begin{center}
  13\LARGE \bf
  14An Implementation Of Multiprocessor Linux
  15\normalsize
  16\end{center}
  17
  18{ \it
  19This document describes the implementation of a simple SMP 
  20Linux kernel extension and how to use this to develop SMP Linux kernels for 
  21architectures other than the Intel MP v1.1 architecture for Pentium and 486 
  22processors.}
  23
  24\hfill Alan Cox, 1995
  25
  26
  27The author wishes to thank Caldera Inc. ( http://www.caldera.com )
  28whose donation of an ASUS dual Pentium board made this project possible, 
  29and Thomas Radke, whose initial work on multiprocessor Linux formed 
  30the backbone of this project.
  31
  32\section{Background: The Intel MP specification.}
  33Most IBM PC style multiprocessor motherboards combine Intel 486 or Pentium 
  34processors and glue chipsets with a hardware/software specification. The 
  35specification places much of the onus for hard work on the chipset and 
  36hardware rather than the operating system.
  37
  38The Intel Pentium processors have a wide variety of inbuilt facilities for 
  39supporting multiprocessing, including hardware cache coherency, built in 
  40interprocessor interrupt handling and a set of atomic test and set, 
  41exchange and similar operations. The cache coherency in particular makes the 
  42operating system's job far easier.
  43
  44The specification defines a detailed configuration structure in ROM that 
  45the boot up processor can read to find the full configuration of the 
  46processors and buses. It also defines a procedure for starting up the 
  47other processors.
  48
  49
  50\section{Mutual Exclusion Within A Single Processor Linux Kernel}
  51For any kernel to function in a sane manner it has to provide internal 
  52locking and protection of its own tables to prevent two processes updating 
  53them at once and for example allocating the same memory block. There are 
  54two strategies for this within current Unix and Unixlike kernels. 
  55Traditional Unix systems from the earliest of days use a scheme of 'Coarse 
  56Grained Locking' where the entire kernel is protected by a small number of 
  57locks only. Some modern systems use fine grained locking. Because fine 
  58grained locking has more overhead it is normally used only on 
  59multiprocessor kernels and real time kernels. In a real time kernel the 
  60fine grained locking reduces the amount of time locks are held and reduces 
  61the critical (to real time programming at least) latency times.
  62
  63Within the Linux kernel certain guarantees are made. No process running in 
  64kernel mode will be pre-empted by another kernel mode process unless it 
  65voluntarily sleeps.  This ensures that blocks of kernel code are 
  66effectively atomic with respect to other processes and greatly simplifies 
  67many operations. Secondly interrupts may pre-empt a kernel running process, 
  68but will always return to that process. A process in kernel mode may 
  69disable interrupts on the processor and guarantee such an interruption will 
  70not occur. The final guarantee is that an interrupt will not be pre-empted 
  71by a kernel task. That is interrupts will run to completion or be 
  72pre-empted by other interrupts only.
  73
  74The SMP kernel chooses to continue these basic guarantees in order to make 
  75initial implementation and deployment easier.  A single lock is maintained 
  76across all processors. This lock is required to access the kernel space. 
  77Any processor may hold it and once it is held may also re-enter the kernel 
  78for interrupts and other services whenever it likes until the lock is 
  79relinquished. This lock ensures that a kernel mode process will not be 
  80pre-empted and ensures that blocking interrupts in kernel mode behaves 
  81correctly. This is guaranteed because only the processor holding the lock 
  82can be in kernel mode, only kernel mode processes can disable interrupts 
  83and only the processor holding the lock may handle an interrupt.
  84
  85Such a choice is however poor for performance. In the longer term it is 
  86necessary to move to finer grained parallelism in order to get the best 
  87system performance. This can be done hierarchically by gradually refining 
  88the locks to cover smaller areas. With the current kernel highly CPU bound 
  89process sets perform well but I/O bound task sets can easily degenerate to 
  90near single processor performance levels. This refinement will be needed to 
  91get the best from Linux/SMP.
  92
  93\subsection{Changes To The Portable Kernel Components}
  94The kernel changes are split into generic SMP support changes and 
  95architecture specific changes necessary to accommodate each different 
  96processor type Linux is ported to.
  97
  98
  99\subsubsection{Initialisation}
 100The first problem with a multiprocessor kernel is starting the other 
 101processors up. Linux/SMP defines that a single processor enters the normal 
 102kernel entry point start\_kernel(). Other processors are assumed not to be 
 103started or to have been captured elsewhere. The first processor begins the 
 104normal Linux initialisation sequences and sets up paging, interrupts and 
 105trap handlers. After it has obtained the processor information about the 
 106boot CPU, the architecture specific function 
 107
 108
 109{\tt \bf{void smp\_store\_cpu\_info(int processor\_id) }}
 110
 111is called to store any information about the processor into a per processor 
 112array. This includes things like the bogomips speed ratings.
 113
 114Having completed the kernel initialisation the architecture specific 
 115function
 116
 117{\tt \bf void smp\_boot\_cpus(void) }
 118
 119is called and is expected to start up each other processor and cause it to 
 120enter start\_kernel() with its paging registers and other control 
 121information correctly loaded. Each other processor skips the setup except 
 122for calling the trap and irq initialisation functions that are needed on 
 123some processors to set each CPU up correctly.  These functions will 
 124probably need to be modified in existing kernels to cope with this.
 125
 126
 127Each additional CPU then calls the architecture specific function
 128
 129{\tt \bf void smp\_callin(void)}
 130
 131which does any final setup and then spins the processor while the boot 
 132up processor forks off enough idle threads for each processor. This is 
 133necessary because the scheduler assumes there is always something to run. 
 134Having generated these threads and forked init the architecture specific 
 135
 136{\tt \bf void smp\_commence(void)}
 137
 138function is invoked. This does any final setup and indicates to the system 
 139that multiprocessor mode is now active. All the processors spinning in the 
 140smp\_callin() function are now released to run the idle processes, which 
 141they will run when they have no real work to process.
 142
 143
 144\subsubsection{Scheduling}
 145The kernel scheduler implements a simple but very effective task 
 146scheduler. The basic structure of this scheduler is unchanged in the 
 147multiprocessor kernel. A processor field is added to each task, and this 
 148maintains the number of the processor executing a given task, or a magic 
 149constant (NO\_PROC\_ID)  indicating the job is not allocated to a processor. 
 150         
 151Each processor executes the scheduler itself and will select the next task 
 152to run from all runnable processes not allocated to a different processor. 
 153The algorithm used by the selection is otherwise unchanged. This is 
 154actually inadequate for the final system because there are advantages to 
 155keeping a process on the same CPU, especially on processor boards with per 
 156processor second level caches.
 157
 158Throughout the kernel the variable 'current' is used as a global for the 
 159current process. In Linux/SMP this becomes a macro which expands to 
 160current\_set[smp\_processor\_id()]. This enables almost the entire kernel to 
 161be unaware of the array of running processors, but still allows the SMP 
 162aware kernel modules to see all of the running processes.
 163
 164The fork system call is modified to generate multiple processes with a 
 165process id of zero until the SMP kernel starts up properly. This is 
 166necessary because process number 1 must be init, and it is desirable that 
 167all the system threads are process 0. 
 168
 169The final area within the scheduling of processes that does cause problems 
 170is the fact the uniprocessor kernel hard codes tests for the idle threads 
 171as task[0] and the init process as task[1]. Because there are multiple idle 
 172threads it is necessary to replace these with tests that the process id is 
 1730 and a search for process ID 1, respectively.
 174
 175\subsubsection{Memory Management}
 176The memory management core of the existing Linux system functions 
 177adequately within the multiprocessor framework providing the locking is 
 178used. Certain processor specific areas do need changing, in particular 
 179invalidate() must invalidate the TLBs of all processors before it returns.
 180
 181
 182\subsubsection{Miscellaneous Functions}
 183The portable SMP code rests on a small set of functions and variables 
 184that are provided by the processor specification functionality. These are
 185
 186{\tt \bf int smp\_processor\_id(void) }
 187
 188which returns the identity of the processor the call is executed upon. This 
 189call is assumed to be valid at all times. This may mean additional tests 
 190are needed during initialisation.
 191
 192
 193{\tt \bf int smp\_num\_cpus;}
 194
 195This is the number of processors in the system. \
 196
 197{\tt \bf void smp\_message\_pass(int target, int msg, unsigned long data,
 198                int wait)}
 199
 200This function passes messages between processors. At the moment it is not 
 201sufficiently defined to sensibly document and needs cleaning up and further 
 202work. Refer to the processor specific code documentation for more details.
 203
 204
 205\subsection{Architecture Specific Code For the Intel MP Port}
 206The architecture specific code for the Intel port splits fairly cleanly 
 207into four sections. Firstly the initialisation code used to boot the 
 208system, secondly the message handling and support code, thirdly the 
 209interrupt and kernel syscall entry function handling and finally the 
 210extensions to standard kernel facilities to cope with multiple processors.
 211
 212\subsubsection{Initialisation}  
 213The Intel MP architecture captures all the processors except for a single 
 214processor known as the 'boot processor' in the BIOS at boot time. Thus a 
 215single processor enters the kernel bootup code. The first processor 
 216executes the bootstrap code, loads and uncompresses the kernel. Having 
 217unpacked the kernel it sets up the paging and control registers then enters 
 218the C kernel startup.
 219
 220The assembler startup code for the kernel is modified so that it can be 
 221used by the other processors to do the processor identification and various 
 222other low level configurations but does not execute those parts of the 
 223startup code that would damage the running system (such as clearing the BSS 
 224segment). 
 225
 226In the initialisation done by the first processor the arch/i386/mm/init 
 227code is modified to scan the low page, top page and BIOS for intel MP 
 228signature blocks. This is necessary because the MP signature blocks must 
 229be read and processed before the kernel is allowed to allocate and destroy 
 230the page at the top of low memory. Having established the number of 
 231processors it reserves a set of pages to provide a stack come boot up area 
 232for each processor in the system. These must be allocated at startup to 
 233ensure they fall below the 1Mb boundary.
 234
 235Further processors are started up in smp\_boot\_cpus() by programming the 
 236APIC controller registers and sending an inter-processor interrupt (IPI) to 
 237the processor. This message causes the target processor to begin executing 
 238code at the start of any page of memory within the lowest 1Mb, in 16bit 
 239real mode. The kernel uses the single page it allocated for each processor 
 240to use as stack. Before booting a given CPU the relocatable code from 
 241trampoline.S and trampoline32.S is copied to the bottom of its stack page 
 242and used as the target for the startup. 
 243
 244The trampoline code calculates the desired stack base from the code 
 245segment (since the code segment on startup is the bottom of the stack), 
 246 enters 32bit mode and jumps to the kernel entry assembler. This as 
 247described above is modified to only execute the parts necessary for each 
 248processor, and then to enter start\_kernel(). On entering the kernel the 
 249processor initialises its trap and interrupt handlers before entering 
 250smp\_callin(), where it reports its status and sets a flag that causes the 
 251boot processor to continue and look for further processors. The processor 
 252then spins until smp\_commence() is invoked.
 253
 254Having started each processor up the smp\_commence( ) function flips a 
 255flag. Each processor spinning in smp\_callin() then loads the task register 
 256with the task state segment (TSS) of its idle thread as is needed for task 
 257switching.
 258
 259\subsubsection{Message Handling and Support Code}
 260The architecture specific code implements the smp\_processor\_id() function 
 261by querying the APIC logical identity register. Because the APIC isn't 
 262mapped into the kernel address space at boot, the initial value returned is 
 263rigged by setting the APIC base pointer to point at a suitable constant. 
 264Once the system starts doing the SMP setup (in smp\_boot\_cpus()), the APIC 
 265is mapped with a vremap() call and the apic pointer is adjusted 
 266appropriately. From then on the real APIC logical identity register is 
 267read.
 268
 269Message passing is accomplished using a pair of IPIs on interrupt 13 
 270(unused by the 80486 FPUs in SMP mode) and interrupt 16. Two are used in 
 271order to separate messages that cannot be processed until the receiver 
 272obtains the kernel spinlock from messages that can be processed 
 273immediately. In effect IRQ 13 is a fast IRQ handler that does not obtain 
 274the locks, and cannot cause a reschedule, while IRQ 16 is a slow IRQ that 
 275must acquire the kernel spinlocks and can cause a reschedule. This 
 276interrupt is used for passing on slave timer messages from the processor 
 277that receives the timer interrupt to the rest of the processors, so that 
 278they can reschedule running tasks.
 279
 280
 281\subsubsection{Entry And Exit Code}
 282A single spinlock protects the entire kernel. The interrupt handlers, the 
 283syscall entry code and the exception handlers all acquire the lock before 
 284entering the kernel proper. When the processor is trying to acquire the 
 285spinlock it spins continually on the lock with interrupts disabled. This 
 286causes a specific deadlock problem. The lock owner may need to send an 
 287invalidate request to the rest of the processors and wait for these to 
 288complete before continuing. A processor spinning on the lock would not be 
 289able to do this. Thus the loop of the spinlock tests and handles invalidate 
 290requests. If the invalidate bit for the spinning CPU is set the processor 
 291invalidates its TLB and atomically clears the bit. When the spinlock is 
 292obtained that processor will take an IPI and in the IPI test the bit and 
 293skip the invalidate as the bit is clear.
 294
 295One complexity of the spinlock is that a process running in kernel mode 
 296can sleep voluntarily and be pre-empted. A switch from such a process to a 
 297process executing in user space may reduce the lock count. To track this 
 298the kernel uses a syscall\_count and a per process lock\_depth parameter to 
 299track the kernel lock state. The switch\_to() function is modified in SMP 
 300mode to adjust the lock appropriately.
 301
 302The final problem is the idle thread. In the single processor kernel the 
 303idle thread executes 'hlt' instructions. This saves power and reduces the 
 304running temperature of the processors when they are idle. However it means 
 305the process spends all its time in kernel mode and would thus hold the 
 306kernel spinlock. The SMP idle thread continually reschedules a new task and 
 307returns to user mode. This is far from ideal and will be modified to use 
 308'hlt' instructions and release the spinlock soon. Using 'hlt' is even more 
 309beneficial on a multiprocessor system as it almost completely takes an idle 
 310processor off the bus.
 311
 312Interrupts are distributed by an i82489 APIC. This chip is set up to work 
 313as an emulation of the traditional PC interrupt controllers when the 
 314machine boots (so that an Intel MP machine boots one CPU and PC 
 315compatible). The kernel has all the relevant locks but does not yet 
 316reprogram the 82489 to deliver interrupts to arbitrary processors as it 
 317should. This requires further modification of the standard Linux interrupt 
 318handling code, and is particularly messy as the interrupt handler behaviour 
 319has to change as soon as the 82489 is switched into SMP mode.
 320
 321
 322\subsubsection{Extensions To Standard Facilities}
 323The kernel maintains a set of per processor control information such as 
 324the speed of the processor for delay loops. These functions on the SMP 
 325kernel look the values up in a per processor array that is set up from the 
 326data generated at boot up by the smp\_store\_cpu\_info() function. This 
 327includes other facts such as whether there is an FPU on the processor. The 
 328current kernel does not handle floating point correctly, this requires some 
 329changes to the techniques the single CPU kernel uses to minimise floating 
 330point processor reloads.
 331
 332The highly useful atomic bit operations are prefixed with the 'lock' 
 333prefix in the SMP kernel to maintain their atomic properties when used 
 334outside of (and by) the spinlock and message code. Amongst other things 
 335this is needed for the invalidate handler, as all  CPU's will invalidate at 
 336the same time without any locks.
 337
 338Interrupt 13 floating point error reporting is removed. This facility is 
 339not usable on a multiprocessor board, nor relevant to the Intel MP 
 340architecture which does not cover the 80386/80387 processor pair. \
 341
 342The /proc filesystem support is changed so that the /proc/cpuinfo file 
 343contains a column for each processor present. This information is extracted 
 344from the data saved by smp\_store\_cpu\_info().
 345
 346\end{document}
 347
lxr.linux.no kindly hosted by Redpill Linpro AS, provider of Linux consulting and operations services since 1995.