linux/Documentation/filesystems/ext4/journal.rst
<<
>>
Prefs
   1.. SPDX-License-Identifier: GPL-2.0
   2
   3Journal (jbd2)
   4--------------
   5
   6Introduced in ext3, the ext4 filesystem employs a journal to protect the
   7filesystem against metadata inconsistencies in the case of a system crash. Up
   8to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
   9size limits) can be reserved inside the filesystem as a place to land
  10“important” data writes on-disk as quickly as possible. Once the important
  11data transaction is fully written to the disk and flushed from the disk write
  12cache, a record of the data being committed is also written to the journal. At
  13some later point in time, the journal code writes the transactions to their
  14final locations on disk (this could involve a lot of seeking or a lot of small
  15read-write-erases) before erasing the commit record. Should the system
  16crash during the second slow write, the journal can be replayed all the
  17way to the latest commit record, guaranteeing the atomicity of whatever
  18gets written through the journal to the disk. The effect of this is to
  19guarantee that the filesystem does not become stuck midway through a
  20metadata update.
  21
  22For performance reasons, ext4 by default only writes filesystem metadata
  23through the journal. This means that file data blocks are /not/
  24guaranteed to be in any consistent state after a crash. If this default
  25guarantee level (``data=ordered``) is not satisfactory, there is a mount
  26option to control journal behavior. If ``data=journal``, all data and
  27metadata are written to disk through the journal. This is slower but
  28safest. If ``data=writeback``, dirty data blocks are not flushed to the
  29disk before the metadata are written to disk through the journal.
  30
  31In case of ``data=ordered`` mode, Ext4 also supports fast commits which
  32help reduce commit latency significantly. The default ``data=ordered``
  33mode works by logging metadata blocks to the journal. In fast commit
  34mode, Ext4 only stores the minimal delta needed to recreate the
  35affected metadata in fast commit space that is shared with JBD2.
  36Once the fast commit area fills in or if fast commit is not possible
  37or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
  38A full commit invalidates all the fast commits that happened before
  39it and thus it makes the fast commit area empty for further fast
  40commits. This feature needs to be enabled at mkfs time.
  41
  42The journal inode is typically inode 8. The first 68 bytes of the
  43journal inode are replicated in the ext4 superblock. The journal itself
  44is normal (but hidden) file within the filesystem. The file usually
  45consumes an entire block group, though mke2fs tries to put it in the
  46middle of the disk.
  47
  48All fields in jbd2 are written to disk in big-endian order. This is the
  49opposite of ext4.
  50
  51NOTE: Both ext4 and ocfs2 use jbd2.
  52
  53The maximum size of a journal embedded in an ext4 filesystem is 2^32
  54blocks. jbd2 itself does not seem to care.
  55
  56Layout
  57~~~~~~
  58
  59Generally speaking, the journal has this format:
  60
  61.. list-table::
  62   :widths: 16 48 16
  63   :header-rows: 1
  64
  65   * - Superblock
  66     - descriptor\_block (data\_blocks or revocation\_block) [more data or
  67       revocations] commmit\_block
  68     - [more transactions...]
  69   * - 
  70     - One transaction
  71     -
  72
  73Notice that a transaction begins with either a descriptor and some data,
  74or a block revocation list. A finished transaction always ends with a
  75commit. If there is no commit record (or the checksums don't match), the
  76transaction will be discarded during replay.
  77
  78External Journal
  79~~~~~~~~~~~~~~~~
  80
  81Optionally, an ext4 filesystem can be created with an external journal
  82device (as opposed to an internal journal, which uses a reserved inode).
  83In this case, on the filesystem device, ``s_journal_inum`` should be
  84zero and ``s_journal_uuid`` should be set. On the journal device there
  85will be an ext4 super block in the usual place, with a matching UUID.
  86The journal superblock will be in the next full block after the
  87superblock.
  88
  89.. list-table::
  90   :widths: 12 12 12 32 12
  91   :header-rows: 1
  92
  93   * - 1024 bytes of padding
  94     - ext4 Superblock
  95     - Journal Superblock
  96     - descriptor\_block (data\_blocks or revocation\_block) [more data or
  97       revocations] commmit\_block
  98     - [more transactions...]
  99   * - 
 100     -
 101     -
 102     - One transaction
 103     -
 104
 105Block Header
 106~~~~~~~~~~~~
 107
 108Every block in the journal starts with a common 12-byte header
 109``struct journal_header_s``:
 110
 111.. list-table::
 112   :widths: 8 8 24 40
 113   :header-rows: 1
 114
 115   * - Offset
 116     - Type
 117     - Name
 118     - Description
 119   * - 0x0
 120     - \_\_be32
 121     - h\_magic
 122     - jbd2 magic number, 0xC03B3998.
 123   * - 0x4
 124     - \_\_be32
 125     - h\_blocktype
 126     - Description of what this block contains. See the jbd2_blocktype_ table
 127       below.
 128   * - 0x8
 129     - \_\_be32
 130     - h\_sequence
 131     - The transaction ID that goes with this block.
 132
 133.. _jbd2_blocktype:
 134
 135The journal block type can be any one of:
 136
 137.. list-table::
 138   :widths: 16 64
 139   :header-rows: 1
 140
 141   * - Value
 142     - Description
 143   * - 1
 144     - Descriptor. This block precedes a series of data blocks that were
 145       written through the journal during a transaction.
 146   * - 2
 147     - Block commit record. This block signifies the completion of a
 148       transaction.
 149   * - 3
 150     - Journal superblock, v1.
 151   * - 4
 152     - Journal superblock, v2.
 153   * - 5
 154     - Block revocation records. This speeds up recovery by enabling the
 155       journal to skip writing blocks that were subsequently rewritten.
 156
 157Super Block
 158~~~~~~~~~~~
 159
 160The super block for the journal is much simpler as compared to ext4's.
 161The key data kept within are size of the journal, and where to find the
 162start of the log of transactions.
 163
 164The journal superblock is recorded as ``struct journal_superblock_s``,
 165which is 1024 bytes long:
 166
 167.. list-table::
 168   :widths: 8 8 24 40
 169   :header-rows: 1
 170
 171   * - Offset
 172     - Type
 173     - Name
 174     - Description
 175   * -
 176     -
 177     -
 178     - Static information describing the journal.
 179   * - 0x0
 180     - journal\_header\_t (12 bytes)
 181     - s\_header
 182     - Common header identifying this as a superblock.
 183   * - 0xC
 184     - \_\_be32
 185     - s\_blocksize
 186     - Journal device block size.
 187   * - 0x10
 188     - \_\_be32
 189     - s\_maxlen
 190     - Total number of blocks in this journal.
 191   * - 0x14
 192     - \_\_be32
 193     - s\_first
 194     - First block of log information.
 195   * -
 196     -
 197     -
 198     - Dynamic information describing the current state of the log.
 199   * - 0x18
 200     - \_\_be32
 201     - s\_sequence
 202     - First commit ID expected in log.
 203   * - 0x1C
 204     - \_\_be32
 205     - s\_start
 206     - Block number of the start of log. Contrary to the comments, this field
 207       being zero does not imply that the journal is clean!
 208   * - 0x20
 209     - \_\_be32
 210     - s\_errno
 211     - Error value, as set by jbd2\_journal\_abort().
 212   * -
 213     -
 214     -
 215     - The remaining fields are only valid in a v2 superblock.
 216   * - 0x24
 217     - \_\_be32
 218     - s\_feature\_compat;
 219     - Compatible feature set. See the table jbd2_compat_ below.
 220   * - 0x28
 221     - \_\_be32
 222     - s\_feature\_incompat
 223     - Incompatible feature set. See the table jbd2_incompat_ below.
 224   * - 0x2C
 225     - \_\_be32
 226     - s\_feature\_ro\_compat
 227     - Read-only compatible feature set. There aren't any of these currently.
 228   * - 0x30
 229     - \_\_u8
 230     - s\_uuid[16]
 231     - 128-bit uuid for journal. This is compared against the copy in the ext4
 232       super block at mount time.
 233   * - 0x40
 234     - \_\_be32
 235     - s\_nr\_users
 236     - Number of file systems sharing this journal.
 237   * - 0x44
 238     - \_\_be32
 239     - s\_dynsuper
 240     - Location of dynamic super block copy. (Not used?)
 241   * - 0x48
 242     - \_\_be32
 243     - s\_max\_transaction
 244     - Limit of journal blocks per transaction. (Not used?)
 245   * - 0x4C
 246     - \_\_be32
 247     - s\_max\_trans\_data
 248     - Limit of data blocks per transaction. (Not used?)
 249   * - 0x50
 250     - \_\_u8
 251     - s\_checksum\_type
 252     - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
 253       more info.
 254   * - 0x51
 255     - \_\_u8[3]
 256     - s\_padding2
 257     -
 258   * - 0x54
 259     - \_\_be32
 260     - s\_num\_fc\_blocks
 261     - Number of fast commit blocks in the journal.
 262   * - 0x58
 263     - \_\_u32
 264     - s\_padding[42]
 265     -
 266   * - 0xFC
 267     - \_\_be32
 268     - s\_checksum
 269     - Checksum of the entire superblock, with this field set to zero.
 270   * - 0x100
 271     - \_\_u8
 272     - s\_users[16\*48]
 273     - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
 274       shared external journals, but I imagine Lustre (or ocfs2?), which use
 275       the jbd2 code, might.
 276
 277.. _jbd2_compat:
 278
 279The journal compat features are any combination of the following:
 280
 281.. list-table::
 282   :widths: 16 64
 283   :header-rows: 1
 284
 285   * - Value
 286     - Description
 287   * - 0x1
 288     - Journal maintains checksums on the data blocks.
 289       (JBD2\_FEATURE\_COMPAT\_CHECKSUM)
 290
 291.. _jbd2_incompat:
 292
 293The journal incompat features are any combination of the following:
 294
 295.. list-table::
 296   :widths: 16 64
 297   :header-rows: 1
 298
 299   * - Value
 300     - Description
 301   * - 0x1
 302     - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE)
 303   * - 0x2
 304     - Journal can deal with 64-bit block numbers.
 305       (JBD2\_FEATURE\_INCOMPAT\_64BIT)
 306   * - 0x4
 307     - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT)
 308   * - 0x8
 309     - This journal uses v2 of the checksum on-disk format. Each journal
 310       metadata block gets its own checksum, and the block tags in the
 311       descriptor table contain checksums for each of the data blocks in the
 312       journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2)
 313   * - 0x10
 314     - This journal uses v3 of the checksum on-disk format. This is the same as
 315       v2, but the journal block tag size is fixed regardless of the size of
 316       block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
 317   * - 0x20
 318     - Journal has fast commit blocks. (JBD2\_FEATURE\_INCOMPAT\_FAST\_COMMIT)
 319
 320.. _jbd2_checksum_type:
 321
 322Journal checksum type codes are one of the following.  crc32 or crc32c are the
 323most likely choices.
 324
 325.. list-table::
 326   :widths: 16 64
 327   :header-rows: 1
 328
 329   * - Value
 330     - Description
 331   * - 1
 332     - CRC32
 333   * - 2
 334     - MD5
 335   * - 3
 336     - SHA1
 337   * - 4
 338     - CRC32C
 339
 340Descriptor Block
 341~~~~~~~~~~~~~~~~
 342
 343The descriptor block contains an array of journal block tags that
 344describe the final locations of the data blocks that follow in the
 345journal. Descriptor blocks are open-coded instead of being completely
 346described by a data structure, but here is the block structure anyway.
 347Descriptor blocks consume at least 36 bytes, but use a full block:
 348
 349.. list-table::
 350   :widths: 8 8 24 40
 351   :header-rows: 1
 352
 353   * - Offset
 354     - Type
 355     - Name
 356     - Descriptor
 357   * - 0x0
 358     - journal\_header\_t
 359     - (open coded)
 360     - Common block header.
 361   * - 0xC
 362     - struct journal\_block\_tag\_s
 363     - open coded array[]
 364     - Enough tags either to fill up the block or to describe all the data
 365       blocks that follow this descriptor block.
 366
 367Journal block tags have any of the following formats, depending on which
 368journal feature and block tag flags are set.
 369
 370If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is
 371defined as ``struct journal_block_tag3_s``, which looks like the
 372following. The size is 16 or 32 bytes.
 373
 374.. list-table::
 375   :widths: 8 8 24 40
 376   :header-rows: 1
 377
 378   * - Offset
 379     - Type
 380     - Name
 381     - Descriptor
 382   * - 0x0
 383     - \_\_be32
 384     - t\_blocknr
 385     - Lower 32-bits of the location of where the corresponding data block
 386       should end up on disk.
 387   * - 0x4
 388     - \_\_be32
 389     - t\_flags
 390     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
 391       more info.
 392   * - 0x8
 393     - \_\_be32
 394     - t\_blocknr\_high
 395     - Upper 32-bits of the location of where the corresponding data block
 396       should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is
 397       not enabled.
 398   * - 0xC
 399     - \_\_be32
 400     - t\_checksum
 401     - Checksum of the journal UUID, the sequence number, and the data block.
 402   * -
 403     -
 404     -
 405     - This field appears to be open coded. It always comes at the end of the
 406       tag, after t_checksum. This field is not present if the "same UUID" flag
 407       is set.
 408   * - 0x8 or 0xC
 409     - char
 410     - uuid[16]
 411     - A UUID to go with this tag. This field appears to be copied from the
 412       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
 413       field.
 414
 415.. _jbd2_tag_flags:
 416
 417The journal tag flags are any combination of the following:
 418
 419.. list-table::
 420   :widths: 16 64
 421   :header-rows: 1
 422
 423   * - Value
 424     - Description
 425   * - 0x1
 426     - On-disk block is escaped. The first four bytes of the data block just
 427       happened to match the jbd2 magic number.
 428   * - 0x2
 429     - This block has the same UUID as previous, therefore the UUID field is
 430       omitted.
 431   * - 0x4
 432     - The data block was deleted by the transaction. (Not used?)
 433   * - 0x8
 434     - This is the last tag in this descriptor block.
 435
 436If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag
 437is defined as ``struct journal_block_tag_s``, which looks like the
 438following. The size is 8, 12, 24, or 28 bytes:
 439
 440.. list-table::
 441   :widths: 8 8 24 40
 442   :header-rows: 1
 443
 444   * - Offset
 445     - Type
 446     - Name
 447     - Descriptor
 448   * - 0x0
 449     - \_\_be32
 450     - t\_blocknr
 451     - Lower 32-bits of the location of where the corresponding data block
 452       should end up on disk.
 453   * - 0x4
 454     - \_\_be16
 455     - t\_checksum
 456     - Checksum of the journal UUID, the sequence number, and the data block.
 457       Note that only the lower 16 bits are stored.
 458   * - 0x6
 459     - \_\_be16
 460     - t\_flags
 461     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
 462       more info.
 463   * -
 464     -
 465     -
 466     - This next field is only present if the super block indicates support for
 467       64-bit block numbers.
 468   * - 0x8
 469     - \_\_be32
 470     - t\_blocknr\_high
 471     - Upper 32-bits of the location of where the corresponding data block
 472       should end up on disk.
 473   * -
 474     -
 475     -
 476     - This field appears to be open coded. It always comes at the end of the
 477       tag, after t_flags or t_blocknr_high. This field is not present if the
 478       "same UUID" flag is set.
 479   * - 0x8 or 0xC
 480     - char
 481     - uuid[16]
 482     - A UUID to go with this tag. This field appears to be copied from the
 483       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
 484       field.
 485
 486If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
 487JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
 488``struct jbd2_journal_block_tail``, which looks like this:
 489
 490.. list-table::
 491   :widths: 8 8 24 40
 492   :header-rows: 1
 493
 494   * - Offset
 495     - Type
 496     - Name
 497     - Descriptor
 498   * - 0x0
 499     - \_\_be32
 500     - t\_checksum
 501     - Checksum of the journal UUID + the descriptor block, with this field set
 502       to zero.
 503
 504Data Block
 505~~~~~~~~~~
 506
 507In general, the data blocks being written to disk through the journal
 508are written verbatim into the journal file after the descriptor block.
 509However, if the first four bytes of the block match the jbd2 magic
 510number then those four bytes are replaced with zeroes and the “escaped”
 511flag is set in the descriptor block tag.
 512
 513Revocation Block
 514~~~~~~~~~~~~~~~~
 515
 516A revocation block is used to prevent replay of a block in an earlier
 517transaction. This is used to mark blocks that were journalled at one
 518time but are no longer journalled. Typically this happens if a metadata
 519block is freed and re-allocated as a file data block; in this case, a
 520journal replay after the file block was written to disk will cause
 521corruption.
 522
 523**NOTE**: This mechanism is NOT used to express “this journal block is
 524superseded by this other journal block”, as the author (djwong)
 525mistakenly thought. Any block being added to a transaction will cause
 526the removal of all existing revocation records for that block.
 527
 528Revocation blocks are described in
 529``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
 530length, but use a full block:
 531
 532.. list-table::
 533   :widths: 8 8 24 40
 534   :header-rows: 1
 535
 536   * - Offset
 537     - Type
 538     - Name
 539     - Description
 540   * - 0x0
 541     - journal\_header\_t
 542     - r\_header
 543     - Common block header.
 544   * - 0xC
 545     - \_\_be32
 546     - r\_count
 547     - Number of bytes used in this block.
 548   * - 0x10
 549     - \_\_be32 or \_\_be64
 550     - blocks[0]
 551     - Blocks to revoke.
 552
 553After r\_count is a linear array of block numbers that are effectively
 554revoked by this transaction. The size of each block number is 8 bytes if
 555the superblock advertises 64-bit block number support, or 4 bytes
 556otherwise.
 557
 558If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
 559JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
 560block is a ``struct jbd2_journal_revoke_tail``, which has this format:
 561
 562.. list-table::
 563   :widths: 8 8 24 40
 564   :header-rows: 1
 565
 566   * - Offset
 567     - Type
 568     - Name
 569     - Description
 570   * - 0x0
 571     - \_\_be32
 572     - r\_checksum
 573     - Checksum of the journal UUID + revocation block
 574
 575Commit Block
 576~~~~~~~~~~~~
 577
 578The commit block is a sentry that indicates that a transaction has been
 579completely written to the journal. Once this commit block reaches the
 580journal, the data stored with this transaction can be written to their
 581final locations on disk.
 582
 583The commit block is described by ``struct commit_header``, which is 32
 584bytes long (but uses a full block):
 585
 586.. list-table::
 587   :widths: 8 8 24 40
 588   :header-rows: 1
 589
 590   * - Offset
 591     - Type
 592     - Name
 593     - Descriptor
 594   * - 0x0
 595     - journal\_header\_s
 596     - (open coded)
 597     - Common block header.
 598   * - 0xC
 599     - unsigned char
 600     - h\_chksum\_type
 601     - The type of checksum to use to verify the integrity of the data blocks
 602       in the transaction. See jbd2_checksum_type_ for more info.
 603   * - 0xD
 604     - unsigned char
 605     - h\_chksum\_size
 606     - The number of bytes used by the checksum. Most likely 4.
 607   * - 0xE
 608     - unsigned char
 609     - h\_padding[2]
 610     -
 611   * - 0x10
 612     - \_\_be32
 613     - h\_chksum[JBD2\_CHECKSUM\_BYTES]
 614     - 32 bytes of space to store checksums. If
 615       JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3
 616       are set, the first ``__be32`` is the checksum of the journal UUID and
 617       the entire commit block, with this field zeroed. If
 618       JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the
 619       crc32 of all the blocks already written to the transaction.
 620   * - 0x30
 621     - \_\_be64
 622     - h\_commit\_sec
 623     - The time that the transaction was committed, in seconds since the epoch.
 624   * - 0x38
 625     - \_\_be32
 626     - h\_commit\_nsec
 627     - Nanoseconds component of the above timestamp.
 628
 629Fast commits
 630~~~~~~~~~~~~
 631
 632Fast commit area is organized as a log of tag length values. Each TLV has
 633a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
 634of the entire field. It is followed by variable length tag specific value.
 635Here is the list of supported tags and their meanings:
 636
 637.. list-table::
 638   :widths: 8 20 20 32
 639   :header-rows: 1
 640
 641   * - Tag
 642     - Meaning
 643     - Value struct
 644     - Description
 645   * - EXT4_FC_TAG_HEAD
 646     - Fast commit area header
 647     - ``struct ext4_fc_head``
 648     - Stores the TID of the transaction after which these fast commits should
 649       be applied.
 650   * - EXT4_FC_TAG_ADD_RANGE
 651     - Add extent to inode
 652     - ``struct ext4_fc_add_range``
 653     - Stores the inode number and extent to be added in this inode
 654   * - EXT4_FC_TAG_DEL_RANGE
 655     - Remove logical offsets to inode
 656     - ``struct ext4_fc_del_range``
 657     - Stores the inode number and the logical offset range that needs to be
 658       removed
 659   * - EXT4_FC_TAG_CREAT
 660     - Create directory entry for a newly created file
 661     - ``struct ext4_fc_dentry_info``
 662     - Stores the parent inode number, inode number and directory entry of the
 663       newly created file
 664   * - EXT4_FC_TAG_LINK
 665     - Link a directory entry to an inode
 666     - ``struct ext4_fc_dentry_info``
 667     - Stores the parent inode number, inode number and directory entry
 668   * - EXT4_FC_TAG_UNLINK
 669     - Unlink a directory entry of an inode
 670     - ``struct ext4_fc_dentry_info``
 671     - Stores the parent inode number, inode number and directory entry
 672
 673   * - EXT4_FC_TAG_PAD
 674     - Padding (unused area)
 675     - None
 676     - Unused bytes in the fast commit area.
 677
 678   * - EXT4_FC_TAG_TAIL
 679     - Mark the end of a fast commit
 680     - ``struct ext4_fc_tail``
 681     - Stores the TID of the commit, CRC of the fast commit of which this tag
 682       represents the end of
 683
 684Fast Commit Replay Idempotence
 685~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 686
 687Fast commits tags are idempotent in nature provided the recovery code follows
 688certain rules. The guiding principle that the commit path follows while
 689committing is that it stores the result of a particular operation instead of
 690storing the procedure.
 691
 692Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
 693was associated with inode 10. During fast commit, instead of storing this
 694operation as a procedure "rename a to b", we store the resulting file system
 695state as a "series" of outcomes:
 696
 697- Link dirent b to inode 10
 698- Unlink dirent a
 699- Inode 10 with valid refcount
 700
 701Now when recovery code runs, it needs "enforce" this state on the file
 702system. This is what guarantees idempotence of fast commit replay.
 703
 704Let's take an example of a procedure that is not idempotent and see how fast
 705commits make it idempotent. Consider following sequence of operations:
 706
 7071) rm A
 7082) mv B A
 7093) read A
 710
 711If we store this sequence of operations as is then the replay is not idempotent.
 712Let's say while in replay, we crash after (2). During the second replay,
 713file A (which was actually created as a result of "mv B A" operation) would get
 714deleted. Thus, file named A would be absent when we try to read A. So, this
 715sequence of operations is not idempotent. However, as mentioned above, instead
 716of storing the procedure fast commits store the outcome of each procedure. Thus
 717the fast commit log for above procedure would be as follows:
 718
 719(Let's assume dirent A was linked to inode 10 and dirent B was linked to
 720inode 11 before the replay)
 721
 7221) Unlink A
 7232) Link A to inode 11
 7243) Unlink B
 7254) Inode 11
 726
 727If we crash after (3) we will have file A linked to inode 11. During the second
 728replay, we will remove file A (inode 11). But we will create it back and make
 729it point to inode 11. We won't find B, so we'll just skip that step. At this
 730point, the refcount for inode 11 is not reliable, but that gets fixed by the
 731replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
 732into a series of idempotent outcomes, fast commits ensured idempotence during
 733the replay.
 734
 735Journal Checkpoint
 736~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 737
 738Checkpointing the journal ensures all transactions and their associated buffers
 739are submitted to the disk. In-progress transactions are waited upon and included
 740in the checkpoint. Checkpointing is used internally during critical updates to
 741the filesystem including journal recovery, filesystem resizing, and freeing of
 742the journal_t structure.
 743
 744A journal checkpoint can be triggered from userspace via the ioctl
 745EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
 746Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
 747can be used to verify input to the ioctl. It returns error if there is any
 748invalid input, otherwise it returns success without performing
 749any checkpointing. This can be used to check whether the ioctl exists on a
 750system and to verify there are no issues with arguments or flags. The
 751other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
 752EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
 753discarded or zero-filled, respectively, after the journal checkpoint is
 754complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
 755cannot both be set. The ioctl may be useful when snapshotting a system or for
 756complying with content deletion SLOs.
 757