1 Lockless Ring Buffer Design 2 =========================== 3 4Copyright 2009 Red Hat Inc. 5 Author: Steven Rostedt <srostedt@redhat.com> 6 License: The GNU Free Documentation License, Version 1.2 7 (dual licensed under the GPL v2) 8Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, 9 and Frederic Weisbecker. 10 11 12Written for: 2.6.31 13 14Terminology used in this Document 15--------------------------------- 16 17tail - where new writes happen in the ring buffer. 18 19head - where new reads happen in the ring buffer. 20 21producer - the task that writes into the ring buffer (same as writer) 22 23writer - same as producer 24 25consumer - the task that reads from the buffer (same as reader) 26 27reader - same as consumer. 28 29reader_page - A page outside the ring buffer used solely (for the most part) 30 by the reader. 31 32head_page - a pointer to the page that the reader will use next 33 34tail_page - a pointer to the page that will be written to next 35 36commit_page - a pointer to the page with the last finished non-nested write. 37 38cmpxchg - hardware-assisted atomic transaction that performs the following: 39 40 A = B iff previous A == C 41 42 R = cmpxchg(A, C, B) is saying that we replace A with B if and only if 43 current A is equal to C, and we put the old (current) A into R 44 45 R gets the previous A regardless if A is updated with B or not. 46 47 To see if the update was successful a compare of R == C may be used. 48 49The Generic Ring Buffer 50----------------------- 51 52The ring buffer can be used in either an overwrite mode or in 53producer/consumer mode. 54 55Producer/consumer mode is where if the producer were to fill up the 56buffer before the consumer could free up anything, the producer 57will stop writing to the buffer. This will lose most recent events. 58 59Overwrite mode is where if the producer were to fill up the buffer 60before the consumer could free up anything, the producer will 61overwrite the older data. This will lose the oldest events. 62 63No two writers can write at the same time (on the same per-cpu buffer), 64but a writer may interrupt another writer, but it must finish writing 65before the previous writer may continue. This is very important to the 66algorithm. The writers act like a "stack". The way interrupts works 67enforces this behavior. 68 69 70 writer1 start 71 <preempted> writer2 start 72 <preempted> writer3 start 73 writer3 finishes 74 writer2 finishes 75 writer1 finishes 76 77This is very much like a writer being preempted by an interrupt and 78the interrupt doing a write as well. 79 80Readers can happen at any time. But no two readers may run at the 81same time, nor can a reader preempt/interrupt another reader. A reader 82cannot preempt/interrupt a writer, but it may read/consume from the 83buffer at the same time as a writer is writing, but the reader must be 84on another processor to do so. A reader may read on its own processor 85and can be preempted by a writer. 86 87A writer can preempt a reader, but a reader cannot preempt a writer. 88But a reader can read the buffer at the same time (on another processor) 89as a writer. 90 91The ring buffer is made up of a list of pages held together by a linked list. 92 93At initialization a reader page is allocated for the reader that is not 94part of the ring buffer. 95 96The head_page, tail_page and commit_page are all initialized to point 97to the same page. 98 99The reader page is initialized to have its next pointer pointing to 100the head page, and its previous pointer pointing to a page before 101the head page. 102 103The reader has its own page to use. At start up time, this page is 104allocated but is not attached to the list. When the reader wants 105to read from the buffer, if its page is empty (like it is on start-up), 106it will swap its page with the head_page. The old reader page will 107become part of the ring buffer and the head_page will be removed. 108The page after the inserted page (old reader_page) will become the 109new head page. 110 111Once the new page is given to the reader, the reader could do what 112it wants with it, as long as a writer has left that page. 113 114A sample of how the reader page is swapped: Note this does not 115show the head page in the buffer, it is for demonstrating a swap 116only. 117 118 +------+ 119 |reader| RING BUFFER 120 |page | 121 +------+ 122 +---+ +---+ +---+ 123 | |-->| |-->| | 124 | |<--| |<--| | 125 +---+ +---+ +---+ 126 ^ | ^ | 127 | +-------------+ | 128 +-----------------+ 129 130 131 +------+ 132 |reader| RING BUFFER 133 |page |-------------------+ 134 +------+ v 135 | +---+ +---+ +---+ 136 | | |-->| |-->| | 137 | | |<--| |<--| |<-+ 138 | +---+ +---+ +---+ | 139 | ^ | ^ | | 140 | | +-------------+ | | 141 | +-----------------+ | 142 +------------------------------------+ 143 144 +------+ 145 |reader| RING BUFFER 146 |page |-------------------+ 147 +------+ <---------------+ v 148 | ^ +---+ +---+ +---+ 149 | | | |-->| |-->| | 150 | | | | | |<--| |<-+ 151 | | +---+ +---+ +---+ | 152 | | | ^ | | 153 | | +-------------+ | | 154 | +-----------------------------+ | 155 +------------------------------------+ 156 157 +------+ 158 |buffer| RING BUFFER 159 |page |-------------------+ 160 +------+ <---------------+ v 161 | ^ +---+ +---+ +---+ 162 | | | | | |-->| | 163 | | New | | | |<--| |<-+ 164 | | Reader +---+ +---+ +---+ | 165 | | page ----^ | | 166 | | | | 167 | +-----------------------------+ | 168 +------------------------------------+ 169 170 171 172It is possible that the page swapped is the commit page and the tail page, 173if what is in the ring buffer is less than what is held in a buffer page. 174 175 176 reader page commit page tail page 177 | | | 178 v | | 179 +---+ | | 180 | |<----------+ | 181 | |<------------------------+ 182 | |------+ 183 +---+ | 184 | 185 v 186 +---+ +---+ +---+ +---+ 187<---| |--->| |--->| |--->| |---> 188--->| |<---| |<---| |<---| |<--- 189 +---+ +---+ +---+ +---+ 190 191This case is still valid for this algorithm. 192When the writer leaves the page, it simply goes into the ring buffer 193since the reader page still points to the next location in the ring 194buffer. 195 196 197The main pointers: 198 199 reader page - The page used solely by the reader and is not part 200 of the ring buffer (may be swapped in) 201 202 head page - the next page in the ring buffer that will be swapped 203 with the reader page. 204 205 tail page - the page where the next write will take place. 206 207 commit page - the page that last finished a write. 208 209The commit page only is updated by the outermost writer in the 210writer stack. A writer that preempts another writer will not move the 211commit page. 212 213When data is written into the ring buffer, a position is reserved 214in the ring buffer and passed back to the writer. When the writer 215is finished writing data into that position, it commits the write. 216 217Another write (or a read) may take place at anytime during this 218transaction. If another write happens it must finish before continuing 219with the previous write. 220 221 222 Write reserve: 223 224 Buffer page 225 +---------+ 226 |written | 227 +---------+ <--- given back to writer (current commit) 228 |reserved | 229 +---------+ <--- tail pointer 230 | empty | 231 +---------+ 232 233 Write commit: 234 235 Buffer page 236 +---------+ 237 |written | 238 +---------+ 239 |written | 240 +---------+ <--- next positon for write (current commit) 241 | empty | 242 +---------+ 243 244 245 If a write happens after the first reserve: 246 247 Buffer page 248 +---------+ 249 |written | 250 +---------+ <-- current commit 251 |reserved | 252 +---------+ <--- given back to second writer 253 |reserved | 254 +---------+ <--- tail pointer 255 256 After second writer commits: 257 258 259 Buffer page 260 +---------+ 261 |written | 262 +---------+ <--(last full commit) 263 |reserved | 264 +---------+ 265 |pending | 266 |commit | 267 +---------+ <--- tail pointer 268 269 When the first writer commits: 270 271 Buffer page 272 +---------+ 273 |written | 274 +---------+ 275 |written | 276 +---------+ 277 |written | 278 +---------+ <--(last full commit and tail pointer) 279 280 281The commit pointer points to the last write location that was 282committed without preempting another write. When a write that 283preempted another write is committed, it only becomes a pending commit 284and will not be a full commit until all writes have been committed. 285 286The commit page points to the page that has the last full commit. 287The tail page points to the page with the last write (before 288committing). 289 290The tail page is always equal to or after the commit page. It may 291be several pages ahead. If the tail page catches up to the commit 292page then no more writes may take place (regardless of the mode 293of the ring buffer: overwrite and produce/consumer). 294 295The order of pages is: 296 297 head page 298 commit page 299 tail page 300 301Possible scenario: 302 tail page 303 head page commit page | 304 | | | 305 v v v 306 +---+ +---+ +---+ +---+ 307<---| |--->| |--->| |--->| |---> 308--->| |<---| |<---| |<---| |<--- 309 +---+ +---+ +---+ +---+ 310 311There is a special case that the head page is after either the commit page 312and possibly the tail page. That is when the commit (and tail) page has been 313swapped with the reader page. This is because the head page is always 314part of the ring buffer, but the reader page is not. Whenever there 315has been less than a full page that has been committed inside the ring buffer, 316and a reader swaps out a page, it will be swapping out the commit page. 317 318 319 reader page commit page tail page 320 | | | 321 v | | 322 +---+ | | 323 | |<----------+ | 324 | |<------------------------+ 325 | |------+ 326 +---+ | 327 | 328 v 329 +---+ +---+ +---+ +---+ 330<---| |--->| |--->| |--->| |---> 331--->| |<---| |<---| |<---| |<--- 332 +---+ +---+ +---+ +---+ 333 ^ 334 | 335 head page 336 337 338In this case, the head page will not move when the tail and commit 339move back into the ring buffer. 340 341The reader cannot swap a page into the ring buffer if the commit page 342is still on that page. If the read meets the last commit (real commit 343not pending or reserved), then there is nothing more to read. 344The buffer is considered empty until another full commit finishes. 345 346When the tail meets the head page, if the buffer is in overwrite mode, 347the head page will be pushed ahead one. If the buffer is in producer/consumer 348mode, the write will fail. 349 350Overwrite mode: 351 352 tail page 353 | 354 v 355 +---+ +---+ +---+ +---+ 356<---| |--->| |--->| |--->| |---> 357--->| |<---| |<---| |<---| |<--- 358 +---+ +---+ +---+ +---+ 359 ^ 360 | 361 head page 362 363 364 tail page 365 | 366 v 367 +---+ +---+ +---+ +---+ 368<---| |--->| |--->| |--->| |---> 369--->| |<---| |<---| |<---| |<--- 370 +---+ +---+ +---+ +---+ 371 ^ 372 | 373 head page 374 375 376 tail page 377 | 378 v 379 +---+ +---+ +---+ +---+ 380<---| |--->| |--->| |--->| |---> 381--->| |<---| |<---| |<---| |<--- 382 +---+ +---+ +---+ +---+ 383 ^ 384 | 385 head page 386 387Note, the reader page will still point to the previous head page. 388But when a swap takes place, it will use the most recent head page. 389 390 391Making the Ring Buffer Lockless: 392-------------------------------- 393 394The main idea behind the lockless algorithm is to combine the moving 395of the head_page pointer with the swapping of pages with the reader. 396State flags are placed inside the pointer to the page. To do this, 397each page must be aligned in memory by 4 bytes. This will allow the 2 398least significant bits of the address to be used as flags, since 399they will always be zero for the address. To get the address, 400simply mask out the flags. 401 402 MASK = ~3 403 404 address & MASK 405 406Two flags will be kept by these two bits: 407 408 HEADER - the page being pointed to is a head page 409 410 UPDATE - the page being pointed to is being updated by a writer 411 and was or is about to be a head page. 412 413 414 reader page 415 | 416 v 417 +---+ 418 | |------+ 419 +---+ | 420 | 421 v 422 +---+ +---+ +---+ +---+ 423<---| |--->| |-H->| |--->| |---> 424--->| |<---| |<---| |<---| |<--- 425 +---+ +---+ +---+ +---+ 426 427 428The above pointer "-H->" would have the HEADER flag set. That is 429the next page is the next page to be swapped out by the reader. 430This pointer means the next page is the head page. 431 432When the tail page meets the head pointer, it will use cmpxchg to 433change the pointer to the UPDATE state: 434 435 436 tail page 437 | 438 v 439 +---+ +---+ +---+ +---+ 440<---| |--->| |-H->| |--->| |---> 441--->| |<---| |<---| |<---| |<--- 442 +---+ +---+ +---+ +---+ 443 444 tail page 445 | 446 v 447 +---+ +---+ +---+ +---+ 448<---| |--->| |-U->| |--->| |---> 449--->| |<---| |<---| |<---| |<--- 450 +---+ +---+ +---+ +---+ 451 452"-U->" represents a pointer in the UPDATE state. 453 454Any access to the reader will need to take some sort of lock to serialize 455the readers. But the writers will never take a lock to write to the 456ring buffer. This means we only need to worry about a single reader, 457and writes only preempt in "stack" formation. 458 459When the reader tries to swap the page with the ring buffer, it 460will also use cmpxchg. If the flag bit in the pointer to the 461head page does not have the HEADER flag set, the compare will fail 462and the reader will need to look for the new head page and try again. 463Note, the flags UPDATE and HEADER are never set at the same time. 464 465The reader swaps the reader page as follows: 466 467 +------+ 468 |reader| RING BUFFER 469 |page | 470 +------+ 471 +---+ +---+ +---+ 472 | |--->| |--->| | 473 | |<---| |<---| | 474 +---+ +---+ +---+ 475 ^ | ^ | 476 | +---------------+ | 477 +-----H-------------+ 478 479The reader sets the reader page next pointer as HEADER to the page after 480the head page. 481 482 483 +------+ 484 |reader| RING BUFFER 485 |page |-------H-----------+ 486 +------+ v 487 | +---+ +---+ +---+ 488 | | |--->| |--->| | 489 | | |<---| |<---| |<-+ 490 | +---+ +---+ +---+ | 491 | ^ | ^ | | 492 | | +---------------+ | | 493 | +-----H-------------+ | 494 +--------------------------------------+ 495 496It does a cmpxchg with the pointer to the previous head page to make it 497point to the reader page. Note that the new pointer does not have the HEADER 498flag set. This action atomically moves the head page forward. 499 500 +------+ 501 |reader| RING BUFFER 502 |page |-------H-----------+ 503 +------+ v 504 | ^ +---+ +---+ +---+ 505 | | | |-->| |-->| | 506 | | | |<--| |<--| |<-+ 507 | | +---+ +---+ +---+ | 508 | | | ^ | | 509 | | +-------------+ | | 510 | +-----------------------------+ | 511 +------------------------------------+ 512 513After the new head page is set, the previous pointer of the head page is 514updated to the reader page. 515 516 +------+ 517 |reader| RING BUFFER 518 |page |-------H-----------+ 519 +------+ <---------------+ v 520 | ^ +---+ +---+ +---+ 521 | | | |-->| |-->| | 522 | | | | | |<--| |<-+ 523 | | +---+ +---+ +---+ | 524 | | | ^ | | 525 | | +-------------+ | | 526 | +-----------------------------+ | 527 +------------------------------------+ 528 529 +------+ 530 |buffer| RING BUFFER 531 |page |-------H-----------+ <--- New head page 532 +------+ <---------------+ v 533 | ^ +---+ +---+ +---+ 534 | | | | | |-->| | 535 | | New | | | |<--| |<-+ 536 | | Reader +---+ +---+ +---+ | 537 | | page ----^ | | 538 | | | | 539 | +-----------------------------+ | 540 +------------------------------------+ 541 542Another important point: The page that the reader page points back to 543by its previous pointer (the one that now points to the new head page) 544never points back to the reader page. That is because the reader page is 545not part of the ring buffer. Traversing the ring buffer via the next pointers 546will always stay in the ring buffer. Traversing the ring buffer via the 547prev pointers may not. 548 549Note, the way to determine a reader page is simply by examining the previous 550pointer of the page. If the next pointer of the previous page does not 551point back to the original page, then the original page is a reader page: 552 553 554 +--------+ 555 | reader | next +----+ 556 | page |-------->| |<====== (buffer page) 557 +--------+ +----+ 558 | | ^ 559 | v | next 560 prev | +----+ 561 +------------->| | 562 +----+ 563 564The way the head page moves forward: 565 566When the tail page meets the head page and the buffer is in overwrite mode 567and more writes take place, the head page must be moved forward before the 568writer may move the tail page. The way this is done is that the writer 569performs a cmpxchg to convert the pointer to the head page from the HEADER 570flag to have the UPDATE flag set. Once this is done, the reader will 571not be able to swap the head page from the buffer, nor will it be able to 572move the head page, until the writer is finished with the move. 573 574This eliminates any races that the reader can have on the writer. The reader 575must spin, and this is why the reader cannot preempt the writer. 576 577 tail page 578 | 579 v 580 +---+ +---+ +---+ +---+ 581<---| |--->| |-H->| |--->| |---> 582--->| |<---| |<---| |<---| |<--- 583 +---+ +---+ +---+ +---+ 584 585 tail page 586 | 587 v 588 +---+ +---+ +---+ +---+ 589<---| |--->| |-U->| |--->| |---> 590--->| |<---| |<---| |<---| |<--- 591 +---+ +---+ +---+ +---+ 592 593The following page will be made into the new head page. 594 595 tail page 596 | 597 v 598 +---+ +---+ +---+ +---+ 599<---| |--->| |-U->| |-H->| |---> 600--->| |<---| |<---| |<---| |<--- 601 +---+ +---+ +---+ +---+ 602 603After the new head page has been set, we can set the old head page 604pointer back to NORMAL. 605 606 tail page 607 | 608 v 609 +---+ +---+ +---+ +---+ 610<---| |--->| |--->| |-H->| |---> 611--->| |<---| |<---| |<---| |<--- 612 +---+ +---+ +---+ +---+ 613 614After the head page has been moved, the tail page may now move forward. 615 616 tail page 617 | 618 v 619 +---+ +---+ +---+ +---+ 620<---| |--->| |--->| |-H->| |---> 621--->| |<---| |<---| |<---| |<--- 622 +---+ +---+ +---+ +---+ 623 624 625The above are the trivial updates. Now for the more complex scenarios. 626 627 628As stated before, if enough writes preempt the first write, the 629tail page may make it all the way around the buffer and meet the commit 630page. At this time, we must start dropping writes (usually with some kind 631of warning to the user). But what happens if the commit was still on the 632reader page? The commit page is not part of the ring buffer. The tail page 633must account for this. 634 635 636 reader page commit page 637 | | 638 v | 639 +---+ | 640 | |<----------+ 641 | | 642 | |------+ 643 +---+ | 644 | 645 v 646 +---+ +---+ +---+ +---+ 647<---| |--->| |-H->| |--->| |---> 648--->| |<---| |<---| |<---| |<--- 649 +---+ +---+ +---+ +---+ 650 ^ 651 | 652 tail page 653 654If the tail page were to simply push the head page forward, the commit when 655leaving the reader page would not be pointing to the correct page. 656 657The solution to this is to test if the commit page is on the reader page 658before pushing the head page. If it is, then it can be assumed that the 659tail page wrapped the buffer, and we must drop new writes. 660 661This is not a race condition, because the commit page can only be moved 662by the outermost writer (the writer that was preempted). 663This means that the commit will not move while a writer is moving the 664tail page. The reader cannot swap the reader page if it is also being 665used as the commit page. The reader can simply check that the commit 666is off the reader page. Once the commit page leaves the reader page 667it will never go back on it unless a reader does another swap with the 668buffer page that is also the commit page. 669 670 671Nested writes 672------------- 673 674In the pushing forward of the tail page we must first push forward 675the head page if the head page is the next page. If the head page 676is not the next page, the tail page is simply updated with a cmpxchg. 677 678Only writers move the tail page. This must be done atomically to protect 679against nested writers. 680 681 temp_page = tail_page 682 next_page = temp_page->next 683 cmpxchg(tail_page, temp_page, next_page) 684 685The above will update the tail page if it is still pointing to the expected 686page. If this fails, a nested write pushed it forward, the the current write 687does not need to push it. 688 689 690 temp page 691 | 692 v 693 tail page 694 | 695 v 696 +---+ +---+ +---+ +---+ 697<---| |--->| |--->| |--->| |---> 698--->| |<---| |<---| |<---| |<--- 699 +---+ +---+ +---+ +---+ 700 701Nested write comes in and moves the tail page forward: 702 703 tail page (moved by nested writer) 704 temp page | 705 | | 706 v v 707 +---+ +---+ +---+ +---+ 708<---| |--->| |--->| |--->| |---> 709--->| |<---| |<---| |<---| |<--- 710 +---+ +---+ +---+ +---+ 711 712The above would fail the cmpxchg, but since the tail page has already 713been moved forward, the writer will just try again to reserve storage 714on the new tail page. 715 716But the moving of the head page is a bit more complex. 717 718 tail page 719 | 720 v 721 +---+ +---+ +---+ +---+ 722<---| |--->| |-H->| |--->| |---> 723--->| |<---| |<---| |<---| |<--- 724 +---+ +---+ +---+ +---+ 725 726The write converts the head page pointer to UPDATE. 727 728 tail page 729 | 730 v 731 +---+ +---+ +---+ +---+ 732<---| |--->| |-U->| |--->| |---> 733--->| |<---| |<---| |<---| |<--- 734 +---+ +---+ +---+ +---+ 735 736But if a nested writer preempts here, it will see that the next 737page is a head page, but it is also nested. It will detect that 738it is nested and will save that information. The detection is the 739fact that it sees the UPDATE flag instead of a HEADER or NORMAL 740pointer. 741 742The nested writer will set the new head page pointer. 743 744 tail page 745 | 746 v 747 +---+ +---+ +---+ +---+ 748<---| |--->| |-U->| |-H->| |---> 749--->| |<---| |<---| |<---| |<--- 750 +---+ +---+ +---+ +---+ 751 752But it will not reset the update back to normal. Only the writer 753that converted a pointer from HEAD to UPDATE will convert it back 754to NORMAL. 755 756 tail page 757 | 758 v 759 +---+ +---+ +---+ +---+ 760<---| |--->| |-U->| |-H->| |---> 761--->| |<---| |<---| |<---| |<--- 762 +---+ +---+ +---+ +---+ 763 764After the nested writer finishes, the outermost writer will convert 765the UPDATE pointer to NORMAL. 766 767 768 tail page 769 | 770 v 771 +---+ +---+ +---+ +---+ 772<---| |--->| |--->| |-H->| |---> 773--->| |<---| |<---| |<---| |<--- 774 +---+ +---+ +---+ +---+ 775 776 777It can be even more complex if several nested writes came in and moved 778the tail page ahead several pages: 779 780 781(first writer) 782 783 tail page 784 | 785 v 786 +---+ +---+ +---+ +---+ 787<---| |--->| |-H->| |--->| |---> 788--->| |<---| |<---| |<---| |<--- 789 +---+ +---+ +---+ +---+ 790 791The write converts the head page pointer to UPDATE. 792 793 tail page 794 | 795 v 796 +---+ +---+ +---+ +---+ 797<---| |--->| |-U->| |--->| |---> 798--->| |<---| |<---| |<---| |<--- 799 +---+ +---+ +---+ +---+ 800 801Next writer comes in, and sees the update and sets up the new 802head page. 803 804(second writer) 805 806 tail page 807 | 808 v 809 +---+ +---+ +---+ +---+ 810<---| |--->| |-U->| |-H->| |---> 811--->| |<---| |<---| |<---| |<--- 812 +---+ +---+ +---+ +---+ 813 814The nested writer moves the tail page forward. But does not set the old 815update page to NORMAL because it is not the outermost writer. 816 817 tail page 818 | 819 v 820 +---+ +---+ +---+ +---+ 821<---| |--->| |-U->| |-H->| |---> 822--->| |<---| |<---| |<---| |<--- 823 +---+ +---+ +---+ +---+ 824 825Another writer preempts and sees the page after the tail page is a head page. 826It changes it from HEAD to UPDATE. 827 828(third writer) 829 830 tail page 831 | 832 v 833 +---+ +---+ +---+ +---+ 834<---| |--->| |-U->| |-U->| |---> 835--->| |<---| |<---| |<---| |<--- 836 +---+ +---+ +---+ +---+ 837 838The writer will move the head page forward: 839 840 841(third writer) 842 843 tail page 844 | 845 v 846 +---+ +---+ +---+ +---+ 847<---| |--->| |-U->| |-U->| |-H-> 848--->| |<---| |<---| |<---| |<--- 849 +---+ +---+ +---+ +---+ 850 851But now that the third writer did change the HEAD flag to UPDATE it 852will convert it to normal: 853 854 855(third writer) 856 857 tail page 858 | 859 v 860 +---+ +---+ +---+ +---+ 861<---| |--->| |-U->| |--->| |-H-> 862--->| |<---| |<---| |<---| |<--- 863 +---+ +---+ +---+ +---+ 864 865 866Then it will move the tail page, and return back to the second writer. 867 868 869(second writer) 870 871 tail page 872 | 873 v 874 +---+ +---+ +---+ +---+ 875<---| |--->| |-U->| |--->| |-H-> 876--->| |<---| |<---| |<---| |<--- 877 +---+ +---+ +---+ +---+ 878 879 880The second writer will fail to move the tail page because it was already 881moved, so it will try again and add its data to the new tail page. 882It will return to the first writer. 883 884 885(first writer) 886 887 tail page 888 | 889 v 890 +---+ +---+ +---+ +---+ 891<---| |--->| |-U->| |--->| |-H-> 892--->| |<---| |<---| |<---| |<--- 893 +---+ +---+ +---+ +---+ 894 895The first writer cannot know atomically if the tail page moved 896while it updates the HEAD page. It will then update the head page to 897what it thinks is the new head page. 898 899 900(first writer) 901 902 tail page 903 | 904 v 905 +---+ +---+ +---+ +---+ 906<---| |--->| |-U->| |-H->| |-H-> 907--->| |<---| |<---| |<---| |<--- 908 +---+ +---+ +---+ +---+ 909 910Since the cmpxchg returns the old value of the pointer the first writer 911will see it succeeded in updating the pointer from NORMAL to HEAD. 912But as we can see, this is not good enough. It must also check to see 913if the tail page is either where it use to be or on the next page: 914 915 916(first writer) 917 918 A B tail page 919 | | | 920 v v v 921 +---+ +---+ +---+ +---+ 922<---| |--->| |-U->| |-H->| |-H-> 923--->| |<---| |<---| |<---| |<--- 924 +---+ +---+ +---+ +---+ 925 926If tail page != A and tail page != B, then it must reset the pointer 927back to NORMAL. The fact that it only needs to worry about nested 928writers means that it only needs to check this after setting the HEAD page. 929 930 931(first writer) 932 933 A B tail page 934 | | | 935 v v v 936 +---+ +---+ +---+ +---+ 937<---| |--->| |-U->| |--->| |-H-> 938--->| |<---| |<---| |<---| |<--- 939 +---+ +---+ +---+ +---+ 940 941Now the writer can update the head page. This is also why the head page must 942remain in UPDATE and only reset by the outermost writer. This prevents 943the reader from seeing the incorrect head page. 944 945 946(first writer) 947 948 A B tail page 949 | | | 950 v v v 951 +---+ +---+ +---+ +---+ 952<---| |--->| |--->| |--->| |-H-> 953--->| |<---| |<---| |<---| |<--- 954 +---+ +---+ +---+ +---+ 955 956