linux/Documentation/trace/ring-buffer-design.txt
<<
>>
Prefs
   1                Lockless Ring Buffer Design
   2                ===========================
   3
   4Copyright 2009 Red Hat Inc.
   5   Author:   Steven Rostedt <srostedt@redhat.com>
   6  License:   The GNU Free Documentation License, Version 1.2
   7               (dual licensed under the GPL v2)
   8Reviewers:   Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
   9             and Frederic Weisbecker.
  10
  11
  12Written for: 2.6.31
  13
  14Terminology used in this Document
  15---------------------------------
  16
  17tail - where new writes happen in the ring buffer.
  18
  19head - where new reads happen in the ring buffer.
  20
  21producer - the task that writes into the ring buffer (same as writer)
  22
  23writer - same as producer
  24
  25consumer - the task that reads from the buffer (same as reader)
  26
  27reader - same as consumer.
  28
  29reader_page - A page outside the ring buffer used solely (for the most part)
  30    by the reader.
  31
  32head_page - a pointer to the page that the reader will use next
  33
  34tail_page - a pointer to the page that will be written to next
  35
  36commit_page - a pointer to the page with the last finished non-nested write.
  37
  38cmpxchg - hardware-assisted atomic transaction that performs the following:
  39
  40   A = B iff previous A == C
  41
  42   R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
  43      current A is equal to C, and we put the old (current) A into R
  44
  45   R gets the previous A regardless if A is updated with B or not.
  46
  47   To see if the update was successful a compare of R == C may be used.
  48
  49The Generic Ring Buffer
  50-----------------------
  51
  52The ring buffer can be used in either an overwrite mode or in
  53producer/consumer mode.
  54
  55Producer/consumer mode is where if the producer were to fill up the
  56buffer before the consumer could free up anything, the producer
  57will stop writing to the buffer. This will lose most recent events.
  58
  59Overwrite mode is where if the producer were to fill up the buffer
  60before the consumer could free up anything, the producer will
  61overwrite the older data. This will lose the oldest events.
  62
  63No two writers can write at the same time (on the same per-cpu buffer),
  64but a writer may interrupt another writer, but it must finish writing
  65before the previous writer may continue. This is very important to the
  66algorithm. The writers act like a "stack". The way interrupts works
  67enforces this behavior.
  68
  69
  70  writer1 start
  71     <preempted> writer2 start
  72         <preempted> writer3 start
  73                     writer3 finishes
  74                 writer2 finishes
  75  writer1 finishes
  76
  77This is very much like a writer being preempted by an interrupt and
  78the interrupt doing a write as well.
  79
  80Readers can happen at any time. But no two readers may run at the
  81same time, nor can a reader preempt/interrupt another reader. A reader
  82cannot preempt/interrupt a writer, but it may read/consume from the
  83buffer at the same time as a writer is writing, but the reader must be
  84on another processor to do so. A reader may read on its own processor
  85and can be preempted by a writer.
  86
  87A writer can preempt a reader, but a reader cannot preempt a writer.
  88But a reader can read the buffer at the same time (on another processor)
  89as a writer.
  90
  91The ring buffer is made up of a list of pages held together by a linked list.
  92
  93At initialization a reader page is allocated for the reader that is not
  94part of the ring buffer.
  95
  96The head_page, tail_page and commit_page are all initialized to point
  97to the same page.
  98
  99The reader page is initialized to have its next pointer pointing to
 100the head page, and its previous pointer pointing to a page before
 101the head page.
 102
 103The reader has its own page to use. At start up time, this page is
 104allocated but is not attached to the list. When the reader wants
 105to read from the buffer, if its page is empty (like it is on start-up),
 106it will swap its page with the head_page. The old reader page will
 107become part of the ring buffer and the head_page will be removed.
 108The page after the inserted page (old reader_page) will become the
 109new head page.
 110
 111Once the new page is given to the reader, the reader could do what
 112it wants with it, as long as a writer has left that page.
 113
 114A sample of how the reader page is swapped: Note this does not
 115show the head page in the buffer, it is for demonstrating a swap
 116only.
 117
 118  +------+
 119  |reader|          RING BUFFER
 120  |page  |
 121  +------+
 122                  +---+   +---+   +---+
 123                  |   |-->|   |-->|   |
 124                  |   |<--|   |<--|   |
 125                  +---+   +---+   +---+
 126                   ^ |             ^ |
 127                   | +-------------+ |
 128                   +-----------------+
 129
 130
 131  +------+
 132  |reader|          RING BUFFER
 133  |page  |-------------------+
 134  +------+                   v
 135    |             +---+   +---+   +---+
 136    |             |   |-->|   |-->|   |
 137    |             |   |<--|   |<--|   |<-+
 138    |             +---+   +---+   +---+  |
 139    |              ^ |             ^ |   |
 140    |              | +-------------+ |   |
 141    |              +-----------------+   |
 142    +------------------------------------+
 143
 144  +------+
 145  |reader|          RING BUFFER
 146  |page  |-------------------+
 147  +------+ <---------------+ v
 148    |  ^          +---+   +---+   +---+
 149    |  |          |   |-->|   |-->|   |
 150    |  |          |   |   |   |<--|   |<-+
 151    |  |          +---+   +---+   +---+  |
 152    |  |             |             ^ |   |
 153    |  |             +-------------+ |   |
 154    |  +-----------------------------+   |
 155    +------------------------------------+
 156
 157  +------+
 158  |buffer|          RING BUFFER
 159  |page  |-------------------+
 160  +------+ <---------------+ v
 161    |  ^          +---+   +---+   +---+
 162    |  |          |   |   |   |-->|   |
 163    |  |  New     |   |   |   |<--|   |<-+
 164    |  | Reader   +---+   +---+   +---+  |
 165    |  |  page ----^                 |   |
 166    |  |                             |   |
 167    |  +-----------------------------+   |
 168    +------------------------------------+
 169
 170
 171
 172It is possible that the page swapped is the commit page and the tail page,
 173if what is in the ring buffer is less than what is held in a buffer page.
 174
 175
 176          reader page    commit page   tail page
 177              |              |             |
 178              v              |             |
 179             +---+           |             |
 180             |   |<----------+             |
 181             |   |<------------------------+
 182             |   |------+
 183             +---+      |
 184                        |
 185                        v
 186    +---+    +---+    +---+    +---+
 187<---|   |--->|   |--->|   |--->|   |--->
 188--->|   |<---|   |<---|   |<---|   |<---
 189    +---+    +---+    +---+    +---+
 190
 191This case is still valid for this algorithm.
 192When the writer leaves the page, it simply goes into the ring buffer
 193since the reader page still points to the next location in the ring
 194buffer.
 195
 196
 197The main pointers:
 198
 199  reader page - The page used solely by the reader and is not part
 200                of the ring buffer (may be swapped in)
 201
 202  head page - the next page in the ring buffer that will be swapped
 203              with the reader page.
 204
 205  tail page - the page where the next write will take place.
 206
 207  commit page - the page that last finished a write.
 208
 209The commit page only is updated by the outermost writer in the
 210writer stack. A writer that preempts another writer will not move the
 211commit page.
 212
 213When data is written into the ring buffer, a position is reserved
 214in the ring buffer and passed back to the writer. When the writer
 215is finished writing data into that position, it commits the write.
 216
 217Another write (or a read) may take place at anytime during this
 218transaction. If another write happens it must finish before continuing
 219with the previous write.
 220
 221
 222   Write reserve:
 223
 224       Buffer page
 225      +---------+
 226      |written  |
 227      +---------+  <--- given back to writer (current commit)
 228      |reserved |
 229      +---------+ <--- tail pointer
 230      | empty   |
 231      +---------+
 232
 233   Write commit:
 234
 235       Buffer page
 236      +---------+
 237      |written  |
 238      +---------+
 239      |written  |
 240      +---------+  <--- next position for write (current commit)
 241      | empty   |
 242      +---------+
 243
 244
 245 If a write happens after the first reserve:
 246
 247       Buffer page
 248      +---------+
 249      |written  |
 250      +---------+  <-- current commit
 251      |reserved |
 252      +---------+  <--- given back to second writer
 253      |reserved |
 254      +---------+ <--- tail pointer
 255
 256  After second writer commits:
 257
 258
 259       Buffer page
 260      +---------+
 261      |written  |
 262      +---------+  <--(last full commit)
 263      |reserved |
 264      +---------+
 265      |pending  |
 266      |commit   |
 267      +---------+ <--- tail pointer
 268
 269  When the first writer commits:
 270
 271       Buffer page
 272      +---------+
 273      |written  |
 274      +---------+
 275      |written  |
 276      +---------+
 277      |written  |
 278      +---------+  <--(last full commit and tail pointer)
 279
 280
 281The commit pointer points to the last write location that was
 282committed without preempting another write. When a write that
 283preempted another write is committed, it only becomes a pending commit
 284and will not be a full commit until all writes have been committed.
 285
 286The commit page points to the page that has the last full commit.
 287The tail page points to the page with the last write (before
 288committing).
 289
 290The tail page is always equal to or after the commit page. It may
 291be several pages ahead. If the tail page catches up to the commit
 292page then no more writes may take place (regardless of the mode
 293of the ring buffer: overwrite and produce/consumer).
 294
 295The order of pages is:
 296
 297 head page
 298 commit page
 299 tail page
 300
 301Possible scenario:
 302                             tail page
 303  head page         commit page  |
 304      |                 |        |
 305      v                 v        v
 306    +---+    +---+    +---+    +---+
 307<---|   |--->|   |--->|   |--->|   |--->
 308--->|   |<---|   |<---|   |<---|   |<---
 309    +---+    +---+    +---+    +---+
 310
 311There is a special case that the head page is after either the commit page
 312and possibly the tail page. That is when the commit (and tail) page has been
 313swapped with the reader page. This is because the head page is always
 314part of the ring buffer, but the reader page is not. Whenever there
 315has been less than a full page that has been committed inside the ring buffer,
 316and a reader swaps out a page, it will be swapping out the commit page.
 317
 318
 319          reader page    commit page   tail page
 320              |              |             |
 321              v              |             |
 322             +---+           |             |
 323             |   |<----------+             |
 324             |   |<------------------------+
 325             |   |------+
 326             +---+      |
 327                        |
 328                        v
 329    +---+    +---+    +---+    +---+
 330<---|   |--->|   |--->|   |--->|   |--->
 331--->|   |<---|   |<---|   |<---|   |<---
 332    +---+    +---+    +---+    +---+
 333                        ^
 334                        |
 335                    head page
 336
 337
 338In this case, the head page will not move when the tail and commit
 339move back into the ring buffer.
 340
 341The reader cannot swap a page into the ring buffer if the commit page
 342is still on that page. If the read meets the last commit (real commit
 343not pending or reserved), then there is nothing more to read.
 344The buffer is considered empty until another full commit finishes.
 345
 346When the tail meets the head page, if the buffer is in overwrite mode,
 347the head page will be pushed ahead one. If the buffer is in producer/consumer
 348mode, the write will fail.
 349
 350Overwrite mode:
 351
 352            tail page
 353               |
 354               v
 355    +---+    +---+    +---+    +---+
 356<---|   |--->|   |--->|   |--->|   |--->
 357--->|   |<---|   |<---|   |<---|   |<---
 358    +---+    +---+    +---+    +---+
 359                        ^
 360                        |
 361                    head page
 362
 363
 364            tail page
 365               |
 366               v
 367    +---+    +---+    +---+    +---+
 368<---|   |--->|   |--->|   |--->|   |--->
 369--->|   |<---|   |<---|   |<---|   |<---
 370    +---+    +---+    +---+    +---+
 371                                 ^
 372                                 |
 373                             head page
 374
 375
 376                    tail page
 377                        |
 378                        v
 379    +---+    +---+    +---+    +---+
 380<---|   |--->|   |--->|   |--->|   |--->
 381--->|   |<---|   |<---|   |<---|   |<---
 382    +---+    +---+    +---+    +---+
 383                                 ^
 384                                 |
 385                             head page
 386
 387Note, the reader page will still point to the previous head page.
 388But when a swap takes place, it will use the most recent head page.
 389
 390
 391Making the Ring Buffer Lockless:
 392--------------------------------
 393
 394The main idea behind the lockless algorithm is to combine the moving
 395of the head_page pointer with the swapping of pages with the reader.
 396State flags are placed inside the pointer to the page. To do this,
 397each page must be aligned in memory by 4 bytes. This will allow the 2
 398least significant bits of the address to be used as flags, since
 399they will always be zero for the address. To get the address,
 400simply mask out the flags.
 401
 402  MASK = ~3
 403
 404  address & MASK
 405
 406Two flags will be kept by these two bits:
 407
 408   HEADER - the page being pointed to is a head page
 409
 410   UPDATE - the page being pointed to is being updated by a writer
 411          and was or is about to be a head page.
 412
 413
 414          reader page
 415              |
 416              v
 417             +---+
 418             |   |------+
 419             +---+      |
 420                        |
 421                        v
 422    +---+    +---+    +---+    +---+
 423<---|   |--->|   |-H->|   |--->|   |--->
 424--->|   |<---|   |<---|   |<---|   |<---
 425    +---+    +---+    +---+    +---+
 426
 427
 428The above pointer "-H->" would have the HEADER flag set. That is
 429the next page is the next page to be swapped out by the reader.
 430This pointer means the next page is the head page.
 431
 432When the tail page meets the head pointer, it will use cmpxchg to
 433change the pointer to the UPDATE state:
 434
 435
 436            tail page
 437               |
 438               v
 439    +---+    +---+    +---+    +---+
 440<---|   |--->|   |-H->|   |--->|   |--->
 441--->|   |<---|   |<---|   |<---|   |<---
 442    +---+    +---+    +---+    +---+
 443
 444            tail page
 445               |
 446               v
 447    +---+    +---+    +---+    +---+
 448<---|   |--->|   |-U->|   |--->|   |--->
 449--->|   |<---|   |<---|   |<---|   |<---
 450    +---+    +---+    +---+    +---+
 451
 452"-U->" represents a pointer in the UPDATE state.
 453
 454Any access to the reader will need to take some sort of lock to serialize
 455the readers. But the writers will never take a lock to write to the
 456ring buffer. This means we only need to worry about a single reader,
 457and writes only preempt in "stack" formation.
 458
 459When the reader tries to swap the page with the ring buffer, it
 460will also use cmpxchg. If the flag bit in the pointer to the
 461head page does not have the HEADER flag set, the compare will fail
 462and the reader will need to look for the new head page and try again.
 463Note, the flags UPDATE and HEADER are never set at the same time.
 464
 465The reader swaps the reader page as follows:
 466
 467  +------+
 468  |reader|          RING BUFFER
 469  |page  |
 470  +------+
 471                  +---+    +---+    +---+
 472                  |   |--->|   |--->|   |
 473                  |   |<---|   |<---|   |
 474                  +---+    +---+    +---+
 475                   ^ |               ^ |
 476                   | +---------------+ |
 477                   +-----H-------------+
 478
 479The reader sets the reader page next pointer as HEADER to the page after
 480the head page.
 481
 482
 483  +------+
 484  |reader|          RING BUFFER
 485  |page  |-------H-----------+
 486  +------+                   v
 487    |             +---+    +---+    +---+
 488    |             |   |--->|   |--->|   |
 489    |             |   |<---|   |<---|   |<-+
 490    |             +---+    +---+    +---+  |
 491    |              ^ |               ^ |   |
 492    |              | +---------------+ |   |
 493    |              +-----H-------------+   |
 494    +--------------------------------------+
 495
 496It does a cmpxchg with the pointer to the previous head page to make it
 497point to the reader page. Note that the new pointer does not have the HEADER
 498flag set.  This action atomically moves the head page forward.
 499
 500  +------+
 501  |reader|          RING BUFFER
 502  |page  |-------H-----------+
 503  +------+                   v
 504    |  ^          +---+   +---+   +---+
 505    |  |          |   |-->|   |-->|   |
 506    |  |          |   |<--|   |<--|   |<-+
 507    |  |          +---+   +---+   +---+  |
 508    |  |             |             ^ |   |
 509    |  |             +-------------+ |   |
 510    |  +-----------------------------+   |
 511    +------------------------------------+
 512
 513After the new head page is set, the previous pointer of the head page is
 514updated to the reader page.
 515
 516  +------+
 517  |reader|          RING BUFFER
 518  |page  |-------H-----------+
 519  +------+ <---------------+ v
 520    |  ^          +---+   +---+   +---+
 521    |  |          |   |-->|   |-->|   |
 522    |  |          |   |   |   |<--|   |<-+
 523    |  |          +---+   +---+   +---+  |
 524    |  |             |             ^ |   |
 525    |  |             +-------------+ |   |
 526    |  +-----------------------------+   |
 527    +------------------------------------+
 528
 529  +------+
 530  |buffer|          RING BUFFER
 531  |page  |-------H-----------+  <--- New head page
 532  +------+ <---------------+ v
 533    |  ^          +---+   +---+   +---+
 534    |  |          |   |   |   |-->|   |
 535    |  |  New     |   |   |   |<--|   |<-+
 536    |  | Reader   +---+   +---+   +---+  |
 537    |  |  page ----^                 |   |
 538    |  |                             |   |
 539    |  +-----------------------------+   |
 540    +------------------------------------+
 541
 542Another important point: The page that the reader page points back to
 543by its previous pointer (the one that now points to the new head page)
 544never points back to the reader page. That is because the reader page is
 545not part of the ring buffer. Traversing the ring buffer via the next pointers
 546will always stay in the ring buffer. Traversing the ring buffer via the
 547prev pointers may not.
 548
 549Note, the way to determine a reader page is simply by examining the previous
 550pointer of the page. If the next pointer of the previous page does not
 551point back to the original page, then the original page is a reader page:
 552
 553
 554             +--------+
 555             | reader |  next   +----+
 556             |  page  |-------->|    |<====== (buffer page)
 557             +--------+         +----+
 558                 |                | ^
 559                 |                v | next
 560            prev |              +----+
 561                 +------------->|    |
 562                                +----+
 563
 564The way the head page moves forward:
 565
 566When the tail page meets the head page and the buffer is in overwrite mode
 567and more writes take place, the head page must be moved forward before the
 568writer may move the tail page. The way this is done is that the writer
 569performs a cmpxchg to convert the pointer to the head page from the HEADER
 570flag to have the UPDATE flag set. Once this is done, the reader will
 571not be able to swap the head page from the buffer, nor will it be able to
 572move the head page, until the writer is finished with the move.
 573
 574This eliminates any races that the reader can have on the writer. The reader
 575must spin, and this is why the reader cannot preempt the writer.
 576
 577            tail page
 578               |
 579               v
 580    +---+    +---+    +---+    +---+
 581<---|   |--->|   |-H->|   |--->|   |--->
 582--->|   |<---|   |<---|   |<---|   |<---
 583    +---+    +---+    +---+    +---+
 584
 585            tail page
 586               |
 587               v
 588    +---+    +---+    +---+    +---+
 589<---|   |--->|   |-U->|   |--->|   |--->
 590--->|   |<---|   |<---|   |<---|   |<---
 591    +---+    +---+    +---+    +---+
 592
 593The following page will be made into the new head page.
 594
 595           tail page
 596               |
 597               v
 598    +---+    +---+    +---+    +---+
 599<---|   |--->|   |-U->|   |-H->|   |--->
 600--->|   |<---|   |<---|   |<---|   |<---
 601    +---+    +---+    +---+    +---+
 602
 603After the new head page has been set, we can set the old head page
 604pointer back to NORMAL.
 605
 606           tail page
 607               |
 608               v
 609    +---+    +---+    +---+    +---+
 610<---|   |--->|   |--->|   |-H->|   |--->
 611--->|   |<---|   |<---|   |<---|   |<---
 612    +---+    +---+    +---+    +---+
 613
 614After the head page has been moved, the tail page may now move forward.
 615
 616                    tail page
 617                        |
 618                        v
 619    +---+    +---+    +---+    +---+
 620<---|   |--->|   |--->|   |-H->|   |--->
 621--->|   |<---|   |<---|   |<---|   |<---
 622    +---+    +---+    +---+    +---+
 623
 624
 625The above are the trivial updates. Now for the more complex scenarios.
 626
 627
 628As stated before, if enough writes preempt the first write, the
 629tail page may make it all the way around the buffer and meet the commit
 630page. At this time, we must start dropping writes (usually with some kind
 631of warning to the user). But what happens if the commit was still on the
 632reader page? The commit page is not part of the ring buffer. The tail page
 633must account for this.
 634
 635
 636          reader page    commit page
 637              |              |
 638              v              |
 639             +---+           |
 640             |   |<----------+
 641             |   |
 642             |   |------+
 643             +---+      |
 644                        |
 645                        v
 646    +---+    +---+    +---+    +---+
 647<---|   |--->|   |-H->|   |--->|   |--->
 648--->|   |<---|   |<---|   |<---|   |<---
 649    +---+    +---+    +---+    +---+
 650               ^
 651               |
 652           tail page
 653
 654If the tail page were to simply push the head page forward, the commit when
 655leaving the reader page would not be pointing to the correct page.
 656
 657The solution to this is to test if the commit page is on the reader page
 658before pushing the head page. If it is, then it can be assumed that the
 659tail page wrapped the buffer, and we must drop new writes.
 660
 661This is not a race condition, because the commit page can only be moved
 662by the outermost writer (the writer that was preempted).
 663This means that the commit will not move while a writer is moving the
 664tail page. The reader cannot swap the reader page if it is also being
 665used as the commit page. The reader can simply check that the commit
 666is off the reader page. Once the commit page leaves the reader page
 667it will never go back on it unless a reader does another swap with the
 668buffer page that is also the commit page.
 669
 670
 671Nested writes
 672-------------
 673
 674In the pushing forward of the tail page we must first push forward
 675the head page if the head page is the next page. If the head page
 676is not the next page, the tail page is simply updated with a cmpxchg.
 677
 678Only writers move the tail page. This must be done atomically to protect
 679against nested writers.
 680
 681  temp_page = tail_page
 682  next_page = temp_page->next
 683  cmpxchg(tail_page, temp_page, next_page)
 684
 685The above will update the tail page if it is still pointing to the expected
 686page. If this fails, a nested write pushed it forward, the the current write
 687does not need to push it.
 688
 689
 690           temp page
 691               |
 692               v
 693            tail page
 694               |
 695               v
 696    +---+    +---+    +---+    +---+
 697<---|   |--->|   |--->|   |--->|   |--->
 698--->|   |<---|   |<---|   |<---|   |<---
 699    +---+    +---+    +---+    +---+
 700
 701Nested write comes in and moves the tail page forward:
 702
 703                    tail page (moved by nested writer)
 704            temp page   |
 705               |        |
 706               v        v
 707    +---+    +---+    +---+    +---+
 708<---|   |--->|   |--->|   |--->|   |--->
 709--->|   |<---|   |<---|   |<---|   |<---
 710    +---+    +---+    +---+    +---+
 711
 712The above would fail the cmpxchg, but since the tail page has already
 713been moved forward, the writer will just try again to reserve storage
 714on the new tail page.
 715
 716But the moving of the head page is a bit more complex.
 717
 718            tail page
 719               |
 720               v
 721    +---+    +---+    +---+    +---+
 722<---|   |--->|   |-H->|   |--->|   |--->
 723--->|   |<---|   |<---|   |<---|   |<---
 724    +---+    +---+    +---+    +---+
 725
 726The write converts the head page pointer to UPDATE.
 727
 728            tail page
 729               |
 730               v
 731    +---+    +---+    +---+    +---+
 732<---|   |--->|   |-U->|   |--->|   |--->
 733--->|   |<---|   |<---|   |<---|   |<---
 734    +---+    +---+    +---+    +---+
 735
 736But if a nested writer preempts here, it will see that the next
 737page is a head page, but it is also nested. It will detect that
 738it is nested and will save that information. The detection is the
 739fact that it sees the UPDATE flag instead of a HEADER or NORMAL
 740pointer.
 741
 742The nested writer will set the new head page pointer.
 743
 744           tail page
 745               |
 746               v
 747    +---+    +---+    +---+    +---+
 748<---|   |--->|   |-U->|   |-H->|   |--->
 749--->|   |<---|   |<---|   |<---|   |<---
 750    +---+    +---+    +---+    +---+
 751
 752But it will not reset the update back to normal. Only the writer
 753that converted a pointer from HEAD to UPDATE will convert it back
 754to NORMAL.
 755
 756                    tail page
 757                        |
 758                        v
 759    +---+    +---+    +---+    +---+
 760<---|   |--->|   |-U->|   |-H->|   |--->
 761--->|   |<---|   |<---|   |<---|   |<---
 762    +---+    +---+    +---+    +---+
 763
 764After the nested writer finishes, the outermost writer will convert
 765the UPDATE pointer to NORMAL.
 766
 767
 768                    tail page
 769                        |
 770                        v
 771    +---+    +---+    +---+    +---+
 772<---|   |--->|   |--->|   |-H->|   |--->
 773--->|   |<---|   |<---|   |<---|   |<---
 774    +---+    +---+    +---+    +---+
 775
 776
 777It can be even more complex if several nested writes came in and moved
 778the tail page ahead several pages:
 779
 780
 781(first writer)
 782
 783            tail page
 784               |
 785               v
 786    +---+    +---+    +---+    +---+
 787<---|   |--->|   |-H->|   |--->|   |--->
 788--->|   |<---|   |<---|   |<---|   |<---
 789    +---+    +---+    +---+    +---+
 790
 791The write converts the head page pointer to UPDATE.
 792
 793            tail page
 794               |
 795               v
 796    +---+    +---+    +---+    +---+
 797<---|   |--->|   |-U->|   |--->|   |--->
 798--->|   |<---|   |<---|   |<---|   |<---
 799    +---+    +---+    +---+    +---+
 800
 801Next writer comes in, and sees the update and sets up the new
 802head page.
 803
 804(second writer)
 805
 806           tail page
 807               |
 808               v
 809    +---+    +---+    +---+    +---+
 810<---|   |--->|   |-U->|   |-H->|   |--->
 811--->|   |<---|   |<---|   |<---|   |<---
 812    +---+    +---+    +---+    +---+
 813
 814The nested writer moves the tail page forward. But does not set the old
 815update page to NORMAL because it is not the outermost writer.
 816
 817                    tail page
 818                        |
 819                        v
 820    +---+    +---+    +---+    +---+
 821<---|   |--->|   |-U->|   |-H->|   |--->
 822--->|   |<---|   |<---|   |<---|   |<---
 823    +---+    +---+    +---+    +---+
 824
 825Another writer preempts and sees the page after the tail page is a head page.
 826It changes it from HEAD to UPDATE.
 827
 828(third writer)
 829
 830                    tail page
 831                        |
 832                        v
 833    +---+    +---+    +---+    +---+
 834<---|   |--->|   |-U->|   |-U->|   |--->
 835--->|   |<---|   |<---|   |<---|   |<---
 836    +---+    +---+    +---+    +---+
 837
 838The writer will move the head page forward:
 839
 840
 841(third writer)
 842
 843                    tail page
 844                        |
 845                        v
 846    +---+    +---+    +---+    +---+
 847<---|   |--->|   |-U->|   |-U->|   |-H->
 848--->|   |<---|   |<---|   |<---|   |<---
 849    +---+    +---+    +---+    +---+
 850
 851But now that the third writer did change the HEAD flag to UPDATE it
 852will convert it to normal:
 853
 854
 855(third writer)
 856
 857                    tail page
 858                        |
 859                        v
 860    +---+    +---+    +---+    +---+
 861<---|   |--->|   |-U->|   |--->|   |-H->
 862--->|   |<---|   |<---|   |<---|   |<---
 863    +---+    +---+    +---+    +---+
 864
 865
 866Then it will move the tail page, and return back to the second writer.
 867
 868
 869(second writer)
 870
 871                             tail page
 872                                 |
 873                                 v
 874    +---+    +---+    +---+    +---+
 875<---|   |--->|   |-U->|   |--->|   |-H->
 876--->|   |<---|   |<---|   |<---|   |<---
 877    +---+    +---+    +---+    +---+
 878
 879
 880The second writer will fail to move the tail page because it was already
 881moved, so it will try again and add its data to the new tail page.
 882It will return to the first writer.
 883
 884
 885(first writer)
 886
 887                             tail page
 888                                 |
 889                                 v
 890    +---+    +---+    +---+    +---+
 891<---|   |--->|   |-U->|   |--->|   |-H->
 892--->|   |<---|   |<---|   |<---|   |<---
 893    +---+    +---+    +---+    +---+
 894
 895The first writer cannot know atomically if the tail page moved
 896while it updates the HEAD page. It will then update the head page to
 897what it thinks is the new head page.
 898
 899
 900(first writer)
 901
 902                             tail page
 903                                 |
 904                                 v
 905    +---+    +---+    +---+    +---+
 906<---|   |--->|   |-U->|   |-H->|   |-H->
 907--->|   |<---|   |<---|   |<---|   |<---
 908    +---+    +---+    +---+    +---+
 909
 910Since the cmpxchg returns the old value of the pointer the first writer
 911will see it succeeded in updating the pointer from NORMAL to HEAD.
 912But as we can see, this is not good enough. It must also check to see
 913if the tail page is either where it use to be or on the next page:
 914
 915
 916(first writer)
 917
 918               A        B    tail page
 919               |        |        |
 920               v        v        v
 921    +---+    +---+    +---+    +---+
 922<---|   |--->|   |-U->|   |-H->|   |-H->
 923--->|   |<---|   |<---|   |<---|   |<---
 924    +---+    +---+    +---+    +---+
 925
 926If tail page != A and tail page != B, then it must reset the pointer
 927back to NORMAL. The fact that it only needs to worry about nested
 928writers means that it only needs to check this after setting the HEAD page.
 929
 930
 931(first writer)
 932
 933               A        B    tail page
 934               |        |        |
 935               v        v        v
 936    +---+    +---+    +---+    +---+
 937<---|   |--->|   |-U->|   |--->|   |-H->
 938--->|   |<---|   |<---|   |<---|   |<---
 939    +---+    +---+    +---+    +---+
 940
 941Now the writer can update the head page. This is also why the head page must
 942remain in UPDATE and only reset by the outermost writer. This prevents
 943the reader from seeing the incorrect head page.
 944
 945
 946(first writer)
 947
 948               A        B    tail page
 949               |        |        |
 950               v        v        v
 951    +---+    +---+    +---+    +---+
 952<---|   |--->|   |--->|   |--->|   |-H->
 953--->|   |<---|   |<---|   |<---|   |<---
 954    +---+    +---+    +---+    +---+
 955
 956
lxr.linux.no kindly hosted by Redpill Linpro AS, provider of Linux consulting and operations services since 1995.