1This document provides "recipes", that is, litmus tests for commonly
   2occurring situations, as well as a few that illustrate subtly broken but
   3attractive nuisances.  Many of these recipes include example code from
   4v5.7 of the Linux kernel.
   6The first section covers simple special cases, the second section
   7takes off the training wheels to cover more involved examples,
   8and the third section provides a few rules of thumb.
  11Simple special cases
  14This section presents two simple special cases, the first being where
  15there is only one CPU or only one memory location is accessed, and the
  16second being use of that old concurrency workhorse, locking.
  19Single CPU or single memory location
  22If there is only one CPU on the one hand or only one variable
  23on the other, the code will execute in order.  There are (as
  24usual) some things to be careful of:
  261.      Some aspects of the C language are unordered.  For example,
  27        in the expression "f(x) + g(y)", the order in which f and g are
  28        called is not defined; the object code is allowed to use either
  29        order or even to interleave the computations.
  312.      Compilers are permitted to use the "as-if" rule.  That is, a
  32        compiler can emit whatever code it likes for normal accesses,
  33        as long as the results of a single-threaded execution appear
  34        just as if the compiler had followed all the relevant rules.
  35        To see this, compile with a high level of optimization and run
  36        the debugger on the resulting binary.
  383.      If there is only one variable but multiple CPUs, that variable
  39        must be properly aligned and all accesses to that variable must
  40        be full sized.  Variables that straddle cachelines or pages void
  41        your full-ordering warranty, as do undersized accesses that load
  42        from or store to only part of the variable.
  444.      If there are multiple CPUs, accesses to shared variables should
  45        use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
  46        tearing, load/store fusing, and invented loads and stores.
  47        There are exceptions to this rule, including:
  49        i.      When there is no possibility of a given shared variable
  50                being updated by some other CPU, for example, while
  51                holding the update-side lock, reads from that variable
  52                need not use READ_ONCE().
  54        ii.     When there is no possibility of a given shared variable
  55                being either read or updated by other CPUs, for example,
  56                when running during early boot, reads from that variable
  57                need not use READ_ONCE() and writes to that variable
  58                need not use WRITE_ONCE().
  64Locking is well-known and straightforward, at least if you don't think
  65about it too hard.  And the basic rule is indeed quite simple: Any CPU that
  66has acquired a given lock sees any changes previously seen or made by any
  67CPU before it released that same lock.  Note that this statement is a bit
  68stronger than "Any CPU holding a given lock sees all changes made by any
  69CPU during the time that CPU was holding this same lock".  For example,
  70consider the following pair of code fragments:
  72        /* See MP+polocks.litmus. */
  73        void CPU0(void)
  74        {
  75                WRITE_ONCE(x, 1);
  76                spin_lock(&mylock);
  77                WRITE_ONCE(y, 1);
  78                spin_unlock(&mylock);
  79        }
  81        void CPU1(void)
  82        {
  83                spin_lock(&mylock);
  84                r0 = READ_ONCE(y);
  85                spin_unlock(&mylock);
  86                r1 = READ_ONCE(x);
  87        }
  89The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
  90then both r0 and r1 must be set to the value 1.  This also has the
  91consequence that if the final value of r0 is equal to 1, then the final
  92value of r1 must also be equal to 1.  In contrast, the weaker rule would
  93say nothing about the final value of r1.
  95The converse to the basic rule also holds, as illustrated by the
  96following litmus test:
  98        /* See MP+porevlocks.litmus. */
  99        void CPU0(void)
 100        {
 101                r0 = READ_ONCE(y);
 102                spin_lock(&mylock);
 103                r1 = READ_ONCE(x);
 104                spin_unlock(&mylock);
 105        }
 107        void CPU1(void)
 108        {
 109                spin_lock(&mylock);
 110                WRITE_ONCE(x, 1);
 111                spin_unlock(&mylock);
 112                WRITE_ONCE(y, 1);
 113        }
 115This converse to the basic rule guarantees that if CPU0() acquires
 116mylock before CPU1(), then both r0 and r1 must be set to the value 0.
 117This also has the consequence that if the final value of r1 is equal
 118to 0, then the final value of r0 must also be equal to 0.  In contrast,
 119the weaker rule would say nothing about the final value of r0.
 121These examples show only a single pair of CPUs, but the effects of the
 122locking basic rule extend across multiple acquisitions of a given lock
 123across multiple CPUs.
 125However, it is not necessarily the case that accesses ordered by
 126locking will be seen as ordered by CPUs not holding that lock.
 127Consider this example:
 129        /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
 130        void CPU0(void)
 131        {
 132                spin_lock(&mylock);
 133                WRITE_ONCE(x, 1);
 134                WRITE_ONCE(y, 1);
 135                spin_unlock(&mylock);
 136        }
 138        void CPU1(void)
 139        {
 140                spin_lock(&mylock);
 141                r0 = READ_ONCE(y);
 142                WRITE_ONCE(z, 1);
 143                spin_unlock(&mylock);
 144        }
 146        void CPU2(void)
 147        {
 148                WRITE_ONCE(z, 2);
 149                smp_mb();
 150                r1 = READ_ONCE(x);
 151        }
 153Counter-intuitive though it might be, it is quite possible to have
 154the final value of r0 be 1, the final value of z be 2, and the final
 155value of r1 be 0.  The reason for this surprising outcome is that
 156CPU2() never acquired the lock, and thus did not benefit from the
 157lock's ordering properties.
 159Ordering can be extended to CPUs not holding the lock by careful use
 160of smp_mb__after_spinlock():
 162        /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
 163        void CPU0(void)
 164        {
 165                spin_lock(&mylock);
 166                WRITE_ONCE(x, 1);
 167                WRITE_ONCE(y, 1);
 168                spin_unlock(&mylock);
 169        }
 171        void CPU1(void)
 172        {
 173                spin_lock(&mylock);
 174                smp_mb__after_spinlock();
 175                r0 = READ_ONCE(y);
 176                WRITE_ONCE(z, 1);
 177                spin_unlock(&mylock);
 178        }
 180        void CPU2(void)
 181        {
 182                WRITE_ONCE(z, 2);
 183                smp_mb();
 184                r1 = READ_ONCE(x);
 185        }
 187This addition of smp_mb__after_spinlock() strengthens the lock acquisition
 188sufficiently to rule out the counter-intuitive outcome.
 191Taking off the training wheels
 194This section looks at more complex examples, including message passing,
 195load buffering, release-acquire chains, store buffering.
 196Many classes of litmus tests have abbreviated names, which may be found
 200Message passing (MP)
 203The MP pattern has one CPU execute a pair of stores to a pair of variables
 204and another CPU execute a pair of loads from this same pair of variables,
 205but in the opposite order.  The goal is to avoid the counter-intuitive
 206outcome in which the first load sees the value written by the second store
 207but the second load does not see the value written by the first store.
 208In the absence of any ordering, this goal may not be met, as can be seen
 209in the MP+poonceonces.litmus litmus test.  This section therefore looks at
 210a number of ways of meeting this goal.
 213Release and acquire
 216Use of smp_store_release() and smp_load_acquire() is one way to force
 217the desired MP ordering.  The general approach is shown below:
 219        /* See MP+pooncerelease+poacquireonce.litmus. */
 220        void CPU0(void)
 221        {
 222                WRITE_ONCE(x, 1);
 223                smp_store_release(&y, 1);
 224        }
 226        void CPU1(void)
 227        {
 228                r0 = smp_load_acquire(&y);
 229                r1 = READ_ONCE(x);
 230        }
 232The smp_store_release() macro orders any prior accesses against the
 233store, while the smp_load_acquire macro orders the load against any
 234subsequent accesses.  Therefore, if the final value of r0 is the value 1,
 235the final value of r1 must also be the value 1.
 237The init_stack_slab() function in lib/stackdepot.c uses release-acquire
 238in this way to safely initialize of a slab of the stack.  Working out
 239the mutual-exclusion design is left as an exercise for the reader.
 242Assign and dereference
 245Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
 246use of smp_store_release() and smp_load_acquire(), except that both
 247rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
 248pointers.  The general approach is shown below:
 250        /* See MP+onceassign+derefonce.litmus. */
 251        int z;
 252        int *y = &z;
 253        int x;
 255        void CPU0(void)
 256        {
 257                WRITE_ONCE(x, 1);
 258                rcu_assign_pointer(y, &x);
 259        }
 261        void CPU1(void)
 262        {
 263                rcu_read_lock();
 264                r0 = rcu_dereference(y);
 265                r1 = READ_ONCE(*r0);
 266                rcu_read_unlock();
 267        }
 269In this example, if the final value of r0 is &x then the final value of
 270r1 must be 1.
 272The rcu_assign_pointer() macro has the same ordering properties as does
 273smp_store_release(), but the rcu_dereference() macro orders the load only
 274against later accesses that depend on the value loaded.  A dependency
 275is present if the value loaded determines the address of a later access
 276(address dependency, as shown above), the value written by a later store
 277(data dependency), or whether or not a later store is executed in the
 278first place (control dependency).  Note that the term "data dependency"
 279is sometimes casually used to cover both address and data dependencies.
 281In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
 282rcu_assign_pointer(), and the next_prime_number() function invokes
 283rcu_dereference().  This combination mediates access to a bit vector
 284that is expanded as additional primes are needed.
 287Write and read memory barriers
 290It is usually better to use smp_store_release() instead of smp_wmb()
 291and to use smp_load_acquire() instead of smp_rmb().  However, the older
 292smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
 293to understand their use cases.  The general approach is shown below:
 295        /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
 296        void CPU0(void)
 297        {
 298                WRITE_ONCE(x, 1);
 299                smp_wmb();
 300                WRITE_ONCE(y, 1);
 301        }
 303        void CPU1(void)
 304        {
 305                r0 = READ_ONCE(y);
 306                smp_rmb();
 307                r1 = READ_ONCE(x);
 308        }
 310The smp_wmb() macro orders prior stores against later stores, and the
 311smp_rmb() macro orders prior loads against later loads.  Therefore, if
 312the final value of r0 is 1, the final value of r1 must also be 1.
 314The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
 315the following write-side code fragment:
 317        log->l_curr_block -= log->l_logBBsize;
 318        ASSERT(log->l_curr_block >= 0);
 319        smp_wmb();
 320        log->l_curr_cycle++;
 322And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
 323the corresponding read-side code fragment:
 325        cur_cycle = READ_ONCE(log->l_curr_cycle);
 326        smp_rmb();
 327        cur_block = READ_ONCE(log->l_curr_block);
 329Alternatively, consider the following comment in function
 330perf_output_put_handle() in kernel/events/ring_buffer.c:
 332         *   kernel                             user
 333         *
 334         *   if (LOAD ->data_tail) {            LOAD ->data_head
 335         *                      (A)             smp_rmb()       (C)
 336         *      STORE $data                     LOAD $data
 337         *      smp_wmb()       (B)             smp_mb()        (D)
 338         *      STORE ->data_head               STORE ->data_tail
 339         *   }
 341The B/C pairing is an example of the MP pattern using smp_wmb() on the
 342write side and smp_rmb() on the read side.
 344Of course, given that smp_mb() is strictly stronger than either smp_wmb()
 345or smp_rmb(), any code fragment that would work with smp_rmb() and
 346smp_wmb() would also work with smp_mb() replacing either or both of the
 347weaker barriers.
 350Load buffering (LB)
 353The LB pattern has one CPU load from one variable and then store to a
 354second, while another CPU loads from the second variable and then stores
 355to the first.  The goal is to avoid the counter-intuitive situation where
 356each load reads the value written by the other CPU's store.  In the
 357absence of any ordering it is quite possible that this may happen, as
 358can be seen in the LB+poonceonces.litmus litmus test.
 360One way of avoiding the counter-intuitive outcome is through the use of a
 361control dependency paired with a full memory barrier:
 363        /* See LB+fencembonceonce+ctrlonceonce.litmus. */
 364        void CPU0(void)
 365        {
 366                r0 = READ_ONCE(x);
 367                if (r0)
 368                        WRITE_ONCE(y, 1);
 369        }
 371        void CPU1(void)
 372        {
 373                r1 = READ_ONCE(y);
 374                smp_mb();
 375                WRITE_ONCE(x, 1);
 376        }
 378This pairing of a control dependency in CPU0() with a full memory
 379barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
 381The A/D pairing from the ring-buffer use case shown earlier also
 382illustrates LB.  Here is a repeat of the comment in
 383perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
 384control dependency on the kernel side and a full memory barrier on
 385the user side:
 387         *   kernel                             user
 388         *
 389         *   if (LOAD ->data_tail) {            LOAD ->data_head
 390         *                      (A)             smp_rmb()       (C)
 391         *      STORE $data                     LOAD $data
 392         *      smp_wmb()       (B)             smp_mb()        (D)
 393         *      STORE ->data_head               STORE ->data_tail
 394         *   }
 395         *
 396         * Where A pairs with D, and B pairs with C.
 398The kernel's control dependency between the load from ->data_tail
 399and the store to data combined with the user's full memory barrier
 400between the load from data and the store to ->data_tail prevents
 401the counter-intuitive outcome where the kernel overwrites the data
 402before the user gets done loading it.
 405Release-acquire chains
 408Release-acquire chains are a low-overhead, flexible, and easy-to-use
 409method of maintaining order.  However, they do have some limitations that
 410need to be fully understood.  Here is an example that maintains order:
 412        /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
 413        void CPU0(void)
 414        {
 415                WRITE_ONCE(x, 1);
 416                smp_store_release(&y, 1);
 417        }
 419        void CPU1(void)
 420        {
 421                r0 = smp_load_acquire(y);
 422                smp_store_release(&z, 1);
 423        }
 425        void CPU2(void)
 426        {
 427                r1 = smp_load_acquire(z);
 428                r2 = READ_ONCE(x);
 429        }
 431In this case, if r0 and r1 both have final values of 1, then r2 must
 432also have a final value of 1.
 434The ordering in this example is stronger than it needs to be.  For
 435example, ordering would still be preserved if CPU1()'s smp_load_acquire()
 436invocation was replaced with READ_ONCE().
 438It is tempting to assume that CPU0()'s store to x is globally ordered
 439before CPU1()'s store to z, but this is not the case:
 441        /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
 442        void CPU0(void)
 443        {
 444                WRITE_ONCE(x, 1);
 445                smp_store_release(&y, 1);
 446        }
 448        void CPU1(void)
 449        {
 450                r0 = smp_load_acquire(y);
 451                smp_store_release(&z, 1);
 452        }
 454        void CPU2(void)
 455        {
 456                WRITE_ONCE(z, 2);
 457                smp_mb();
 458                r1 = READ_ONCE(x);
 459        }
 461One might hope that if the final value of r0 is 1 and the final value
 462of z is 2, then the final value of r1 must also be 1, but it really is
 463possible for r1 to have the final value of 0.  The reason, of course,
 464is that in this version, CPU2() is not part of the release-acquire chain.
 465This situation is accounted for in the rules of thumb below.
 467Despite this limitation, release-acquire chains are low-overhead as
 468well as simple and powerful, at least as memory-ordering mechanisms go.
 471Store buffering
 474Store buffering can be thought of as upside-down load buffering, so
 475that one CPU first stores to one variable and then loads from a second,
 476while another CPU stores to the second variable and then loads from the
 477first.  Preserving order requires nothing less than full barriers:
 479        /* See SB+fencembonceonces.litmus. */
 480        void CPU0(void)
 481        {
 482                WRITE_ONCE(x, 1);
 483                smp_mb();
 484                r0 = READ_ONCE(y);
 485        }
 487        void CPU1(void)
 488        {
 489                WRITE_ONCE(y, 1);
 490                smp_mb();
 491                r1 = READ_ONCE(x);
 492        }
 494Omitting either smp_mb() will allow both r0 and r1 to have final
 495values of 0, but providing both full barriers as shown above prevents
 496this counter-intuitive outcome.
 498This pattern most famously appears as part of Dekker's locking
 499algorithm, but it has a much more practical use within the Linux kernel
 500of ordering wakeups.  The following comment taken from waitqueue_active()
 501in include/linux/wait.h shows the canonical pattern:
 503 *      CPU0 - waker                    CPU1 - waiter
 504 *
 505 *                                      for (;;) {
 506 *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
 507 *      smp_mb();                         // smp_mb() from set_current_state()
 508 *      if (waitqueue_active(wq_head))         if (@cond)
 509 *        wake_up(wq_head);                      break;
 510 *                                        schedule();
 511 *                                      }
 512 *                                      finish_wait(&wq_head, &wait);
 514On CPU0, the store is to @cond and the load is in waitqueue_active().
 515On CPU1, prepare_to_wait() contains both a store to wq_head and a call
 516to set_current_state(), which contains an smp_mb() barrier; the load is
 517"if (@cond)".  The full barriers prevent the undesirable outcome where
 518CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
 520Note that use of locking can greatly simplify this pattern.
 523Rules of thumb
 526There might seem to be no pattern governing what ordering primitives are
 527needed in which situations, but this is not the case.  There is a pattern
 528based on the relation between the accesses linking successive CPUs in a
 529given litmus test.  There are three types of linkage:
 5311.      Write-to-read, where the next CPU reads the value that the
 532        previous CPU wrote.  The LB litmus-test patterns contain only
 533        this type of relation.  In formal memory-modeling texts, this
 534        relation is called "reads-from" and is usually abbreviated "rf".
 5362.      Read-to-write, where the next CPU overwrites the value that the
 537        previous CPU read.  The SB litmus test contains only this type
 538        of relation.  In formal memory-modeling texts, this relation is
 539        often called "from-reads" and is sometimes abbreviated "fr".
 5413.      Write-to-write, where the next CPU overwrites the value written
 542        by the previous CPU.  The Z6.0 litmus test pattern contains a
 543        write-to-write relation between the last access of CPU1() and
 544        the first access of CPU2().  In formal memory-modeling texts,
 545        this relation is often called "coherence order" and is sometimes
 546        abbreviated "co".  In the C++ standard, it is instead called
 547        "modification order" and often abbreviated "mo".
 549The strength of memory ordering required for a given litmus test to
 550avoid a counter-intuitive outcome depends on the types of relations
 551linking the memory accesses for the outcome in question:
 553o       If all links are write-to-read links, then the weakest
 554        possible ordering within each CPU suffices.  For example, in
 555        the LB litmus test, a control dependency was enough to do the
 556        job.
 558o       If all but one of the links are write-to-read links, then a
 559        release-acquire chain suffices.  Both the MP and the ISA2
 560        litmus tests illustrate this case.
 562o       If more than one of the links are something other than
 563        write-to-read links, then a full memory barrier is required
 564        between each successive pair of non-write-to-read links.  This
 565        case is illustrated by the Z6.0 litmus tests, both in the
 566        locking and in the release-acquire sections.
 568However, if you find yourself having to stretch these rules of thumb
 569to fit your situation, you should consider creating a litmus test and
 570running it on the model.