linux/Documentation/filesystems/ext4/inodes.rst
<<
>>
Prefs
   1.. SPDX-License-Identifier: GPL-2.0
   2
   3Index Nodes
   4-----------
   5
   6In a regular UNIX filesystem, the inode stores all the metadata
   7pertaining to the file (time stamps, block maps, extended attributes,
   8etc), not the directory entry. To find the information associated with a
   9file, one must traverse the directory files to find the directory entry
  10associated with a file, then load the inode to find the metadata for
  11that file. ext4 appears to cheat (for performance reasons) a little bit
  12by storing a copy of the file type (normally stored in the inode) in the
  13directory entry. (Compare all this to FAT, which stores all the file
  14information directly in the directory entry, but does not support hard
  15links and is in general more seek-happy than ext4 due to its simpler
  16block allocator and extensive use of linked lists.)
  17
  18The inode table is a linear array of ``struct ext4_inode``. The table is
  19sized to have enough blocks to store at least
  20``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the
  21block group containing an inode can be calculated as
  22``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the
  23group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There
  24is no inode 0.
  25
  26The inode checksum is calculated against the FS UUID, the inode number,
  27and the inode structure itself.
  28
  29The inode table entry is laid out in ``struct ext4_inode``.
  30
  31.. list-table::
  32   :widths: 8 8 24 40
  33   :header-rows: 1
  34   :class: longtable
  35
  36   * - Offset
  37     - Size
  38     - Name
  39     - Description
  40   * - 0x0
  41     - \_\_le16
  42     - i\_mode
  43     - File mode. See the table i_mode_ below.
  44   * - 0x2
  45     - \_\_le16
  46     - i\_uid
  47     - Lower 16-bits of Owner UID.
  48   * - 0x4
  49     - \_\_le32
  50     - i\_size\_lo
  51     - Lower 32-bits of size in bytes.
  52   * - 0x8
  53     - \_\_le32
  54     - i\_atime
  55     - Last access time, in seconds since the epoch. However, if the EA\_INODE
  56       inode flag is set, this inode stores an extended attribute value and
  57       this field contains the checksum of the value.
  58   * - 0xC
  59     - \_\_le32
  60     - i\_ctime
  61     - Last inode change time, in seconds since the epoch. However, if the
  62       EA\_INODE inode flag is set, this inode stores an extended attribute
  63       value and this field contains the lower 32 bits of the attribute value's
  64       reference count.
  65   * - 0x10
  66     - \_\_le32
  67     - i\_mtime
  68     - Last data modification time, in seconds since the epoch. However, if the
  69       EA\_INODE inode flag is set, this inode stores an extended attribute
  70       value and this field contains the number of the inode that owns the
  71       extended attribute.
  72   * - 0x14
  73     - \_\_le32
  74     - i\_dtime
  75     - Deletion Time, in seconds since the epoch.
  76   * - 0x18
  77     - \_\_le16
  78     - i\_gid
  79     - Lower 16-bits of GID.
  80   * - 0x1A
  81     - \_\_le16
  82     - i\_links\_count
  83     - Hard link count. Normally, ext4 does not permit an inode to have more
  84       than 65,000 hard links. This applies to files as well as directories,
  85       which means that there cannot be more than 64,998 subdirectories in a
  86       directory (each subdirectory's '..' entry counts as a hard link, as does
  87       the '.' entry in the directory itself). With the DIR\_NLINK feature
  88       enabled, ext4 supports more than 64,998 subdirectories by setting this
  89       field to 1 to indicate that the number of hard links is not known.
  90   * - 0x1C
  91     - \_\_le32
  92     - i\_blocks\_lo
  93     - Lower 32-bits of “block” count. If the huge\_file feature flag is not
  94       set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
  95       on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in
  96       ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
  97       << 32)`` 512-byte blocks on disk. If huge\_file is set and
  98       EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file
  99       consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
 100       disk.
 101   * - 0x20
 102     - \_\_le32
 103     - i\_flags
 104     - Inode flags. See the table i_flags_ below.
 105   * - 0x24
 106     - 4 bytes
 107     - i\_osd1
 108     - See the table i_osd1_ for more details.
 109   * - 0x28
 110     - 60 bytes
 111     - i\_block[EXT4\_N\_BLOCKS=15]
 112     - Block map or extent tree. See the section “The Contents of inode.i\_block”.
 113   * - 0x64
 114     - \_\_le32
 115     - i\_generation
 116     - File version (for NFS).
 117   * - 0x68
 118     - \_\_le32
 119     - i\_file\_acl\_lo
 120     - Lower 32-bits of extended attribute block. ACLs are of course one of
 121       many possible extended attributes; I think the name of this field is a
 122       result of the first use of extended attributes being for ACLs.
 123   * - 0x6C
 124     - \_\_le32
 125     - i\_size\_high / i\_dir\_acl
 126     - Upper 32-bits of file/directory size. In ext2/3 this field was named
 127       i\_dir\_acl, though it was usually set to zero and never used.
 128   * - 0x70
 129     - \_\_le32
 130     - i\_obso\_faddr
 131     - (Obsolete) fragment address.
 132   * - 0x74
 133     - 12 bytes
 134     - i\_osd2
 135     - See the table i_osd2_ for more details.
 136   * - 0x80
 137     - \_\_le16
 138     - i\_extra\_isize
 139     - Size of this inode - 128. Alternately, the size of the extended inode
 140       fields beyond the original ext2 inode, including this field.
 141   * - 0x82
 142     - \_\_le16
 143     - i\_checksum\_hi
 144     - Upper 16-bits of the inode checksum.
 145   * - 0x84
 146     - \_\_le32
 147     - i\_ctime\_extra
 148     - Extra change time bits. This provides sub-second precision. See Inode
 149       Timestamps section.
 150   * - 0x88
 151     - \_\_le32
 152     - i\_mtime\_extra
 153     - Extra modification time bits. This provides sub-second precision.
 154   * - 0x8C
 155     - \_\_le32
 156     - i\_atime\_extra
 157     - Extra access time bits. This provides sub-second precision.
 158   * - 0x90
 159     - \_\_le32
 160     - i\_crtime
 161     - File creation time, in seconds since the epoch.
 162   * - 0x94
 163     - \_\_le32
 164     - i\_crtime\_extra
 165     - Extra file creation time bits. This provides sub-second precision.
 166   * - 0x98
 167     - \_\_le32
 168     - i\_version\_hi
 169     - Upper 32-bits for version number.
 170   * - 0x9C
 171     - \_\_le32
 172     - i\_projid
 173     - Project ID.
 174
 175.. _i_mode:
 176
 177The ``i_mode`` value is a combination of the following flags:
 178
 179.. list-table::
 180   :widths: 16 64
 181   :header-rows: 1
 182
 183   * - Value
 184     - Description
 185   * - 0x1
 186     - S\_IXOTH (Others may execute)
 187   * - 0x2
 188     - S\_IWOTH (Others may write)
 189   * - 0x4
 190     - S\_IROTH (Others may read)
 191   * - 0x8
 192     - S\_IXGRP (Group members may execute)
 193   * - 0x10
 194     - S\_IWGRP (Group members may write)
 195   * - 0x20
 196     - S\_IRGRP (Group members may read)
 197   * - 0x40
 198     - S\_IXUSR (Owner may execute)
 199   * - 0x80
 200     - S\_IWUSR (Owner may write)
 201   * - 0x100
 202     - S\_IRUSR (Owner may read)
 203   * - 0x200
 204     - S\_ISVTX (Sticky bit)
 205   * - 0x400
 206     - S\_ISGID (Set GID)
 207   * - 0x800
 208     - S\_ISUID (Set UID)
 209   * -
 210     - These are mutually-exclusive file types:
 211   * - 0x1000
 212     - S\_IFIFO (FIFO)
 213   * - 0x2000
 214     - S\_IFCHR (Character device)
 215   * - 0x4000
 216     - S\_IFDIR (Directory)
 217   * - 0x6000
 218     - S\_IFBLK (Block device)
 219   * - 0x8000
 220     - S\_IFREG (Regular file)
 221   * - 0xA000
 222     - S\_IFLNK (Symbolic link)
 223   * - 0xC000
 224     - S\_IFSOCK (Socket)
 225
 226.. _i_flags:
 227
 228The ``i_flags`` field is a combination of these values:
 229
 230.. list-table::
 231   :widths: 16 64
 232   :header-rows: 1
 233
 234   * - Value
 235     - Description
 236   * - 0x1
 237     - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented)
 238   * - 0x2
 239     - This file should be preserved, should undeletion be desired
 240       (EXT4\_UNRM\_FL). (not implemented)
 241   * - 0x4
 242     - File is compressed (EXT4\_COMPR\_FL). (not really implemented)
 243   * - 0x8
 244     - All writes to the file must be synchronous (EXT4\_SYNC\_FL).
 245   * - 0x10
 246     - File is immutable (EXT4\_IMMUTABLE\_FL).
 247   * - 0x20
 248     - File can only be appended (EXT4\_APPEND\_FL).
 249   * - 0x40
 250     - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL).
 251   * - 0x80
 252     - Do not update access time (EXT4\_NOATIME\_FL).
 253   * - 0x100
 254     - Dirty compressed file (EXT4\_DIRTY\_FL). (not used)
 255   * - 0x200
 256     - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used)
 257   * - 0x400
 258     - Do not compress file (EXT4\_NOCOMPR\_FL). (not used)
 259   * - 0x800
 260     - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was
 261       EXT4\_ECOMPR\_FL (compression error), which was never used.
 262   * - 0x1000
 263     - Directory has hashed indexes (EXT4\_INDEX\_FL).
 264   * - 0x2000
 265     - AFS magic directory (EXT4\_IMAGIC\_FL).
 266   * - 0x4000
 267     - File data must always be written through the journal
 268       (EXT4\_JOURNAL\_DATA\_FL).
 269   * - 0x8000
 270     - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4)
 271   * - 0x10000
 272     - All directory entry data should be written synchronously (see
 273       ``dirsync``) (EXT4\_DIRSYNC\_FL).
 274   * - 0x20000
 275     - Top of directory hierarchy (EXT4\_TOPDIR\_FL).
 276   * - 0x40000
 277     - This is a huge file (EXT4\_HUGE\_FILE\_FL).
 278   * - 0x80000
 279     - Inode uses extents (EXT4\_EXTENTS\_FL).
 280   * - 0x100000
 281     - Verity protected file (EXT4\_VERITY\_FL).
 282   * - 0x200000
 283     - Inode stores a large extended attribute value in its data blocks
 284       (EXT4\_EA\_INODE\_FL).
 285   * - 0x400000
 286     - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL).
 287       (deprecated)
 288   * - 0x01000000
 289     - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
 290   * - 0x04000000
 291     - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in
 292       mainline)
 293   * - 0x08000000
 294     - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
 295       mainline)
 296   * - 0x10000000
 297     - Inode has inline data (EXT4\_INLINE\_DATA\_FL).
 298   * - 0x20000000
 299     - Create children with the same project ID (EXT4\_PROJINHERIT\_FL).
 300   * - 0x80000000
 301     - Reserved for ext4 library (EXT4\_RESERVED\_FL).
 302   * -
 303     - Aggregate flags:
 304   * - 0x705BDFFF
 305     - User-visible flags.
 306   * - 0x604BC0FF
 307     - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and
 308       EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's
 309       EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of
 310       these flags in a special manner and they are masked out of the set of
 311       flags that are saved directly to i\_flags.
 312
 313.. _i_osd1:
 314
 315The ``osd1`` field has multiple meanings depending on the creator:
 316
 317Linux:
 318
 319.. list-table::
 320   :widths: 8 8 24 40
 321   :header-rows: 1
 322
 323   * - Offset
 324     - Size
 325     - Name
 326     - Description
 327   * - 0x0
 328     - \_\_le32
 329     - l\_i\_version
 330     - Inode version. However, if the EA\_INODE inode flag is set, this inode
 331       stores an extended attribute value and this field contains the upper 32
 332       bits of the attribute value's reference count.
 333
 334Hurd:
 335
 336.. list-table::
 337   :widths: 8 8 24 40
 338   :header-rows: 1
 339
 340   * - Offset
 341     - Size
 342     - Name
 343     - Description
 344   * - 0x0
 345     - \_\_le32
 346     - h\_i\_translator
 347     - ??
 348
 349Masix:
 350
 351.. list-table::
 352   :widths: 8 8 24 40
 353   :header-rows: 1
 354
 355   * - Offset
 356     - Size
 357     - Name
 358     - Description
 359   * - 0x0
 360     - \_\_le32
 361     - m\_i\_reserved
 362     - ??
 363
 364.. _i_osd2:
 365
 366The ``osd2`` field has multiple meanings depending on the filesystem creator:
 367
 368Linux:
 369
 370.. list-table::
 371   :widths: 8 8 24 40
 372   :header-rows: 1
 373
 374   * - Offset
 375     - Size
 376     - Name
 377     - Description
 378   * - 0x0
 379     - \_\_le16
 380     - l\_i\_blocks\_high
 381     - Upper 16-bits of the block count. Please see the note attached to
 382       i\_blocks\_lo.
 383   * - 0x2
 384     - \_\_le16
 385     - l\_i\_file\_acl\_high
 386     - Upper 16-bits of the extended attribute block (historically, the file
 387       ACL location). See the Extended Attributes section below.
 388   * - 0x4
 389     - \_\_le16
 390     - l\_i\_uid\_high
 391     - Upper 16-bits of the Owner UID.
 392   * - 0x6
 393     - \_\_le16
 394     - l\_i\_gid\_high
 395     - Upper 16-bits of the GID.
 396   * - 0x8
 397     - \_\_le16
 398     - l\_i\_checksum\_lo
 399     - Lower 16-bits of the inode checksum.
 400   * - 0xA
 401     - \_\_le16
 402     - l\_i\_reserved
 403     - Unused.
 404
 405Hurd:
 406
 407.. list-table::
 408   :widths: 8 8 24 40
 409   :header-rows: 1
 410
 411   * - Offset
 412     - Size
 413     - Name
 414     - Description
 415   * - 0x0
 416     - \_\_le16
 417     - h\_i\_reserved1
 418     - ??
 419   * - 0x2
 420     - \_\_u16
 421     - h\_i\_mode\_high
 422     - Upper 16-bits of the file mode.
 423   * - 0x4
 424     - \_\_le16
 425     - h\_i\_uid\_high
 426     - Upper 16-bits of the Owner UID.
 427   * - 0x6
 428     - \_\_le16
 429     - h\_i\_gid\_high
 430     - Upper 16-bits of the GID.
 431   * - 0x8
 432     - \_\_u32
 433     - h\_i\_author
 434     - Author code?
 435
 436Masix:
 437
 438.. list-table::
 439   :widths: 8 8 24 40
 440   :header-rows: 1
 441
 442   * - Offset
 443     - Size
 444     - Name
 445     - Description
 446   * - 0x0
 447     - \_\_le16
 448     - h\_i\_reserved1
 449     - ??
 450   * - 0x2
 451     - \_\_u16
 452     - m\_i\_file\_acl\_high
 453     - Upper 16-bits of the extended attribute block (historically, the file
 454       ACL location).
 455   * - 0x4
 456     - \_\_u32
 457     - m\_i\_reserved2[2]
 458     - ??
 459
 460Inode Size
 461~~~~~~~~~~
 462
 463In ext2 and ext3, the inode structure size was fixed at 128 bytes
 464(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of
 465128 bytes. Starting with ext4, it is possible to allocate a larger
 466on-disk inode at format time for all inodes in the filesystem to provide
 467space beyond the end of the original ext2 inode. The on-disk inode
 468record size is recorded in the superblock as ``s_inode_size``. The
 469number of bytes actually used by struct ext4\_inode beyond the original
 470128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
 471inode, which allows struct ext4\_inode to grow for a new kernel without
 472having to upgrade all of the on-disk inodes. Access to fields beyond
 473EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within
 474``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
 475of August 2019) the inode structure is 160 bytes
 476(``i_extra_isize = 32``). The extra space between the end of the inode
 477structure and the end of the inode record can be used to store extended
 478attributes. Each inode record can be as large as the filesystem block
 479size, though this is not terribly efficient.
 480
 481Finding an Inode
 482~~~~~~~~~~~~~~~~
 483
 484Each block group contains ``sb->s_inodes_per_group`` inodes. Because
 485inode 0 is defined not to exist, this formula can be used to find the
 486block group that an inode lives in:
 487``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode
 488can be found within the block group's inode table at
 489``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte
 490address within the inode table, use
 491``offset = index * sb->s_inode_size``.
 492
 493Inode Timestamps
 494~~~~~~~~~~~~~~~~
 495
 496Four timestamps are recorded in the lower 128 bytes of the inode
 497structure -- inode change time (ctime), access time (atime), data
 498modification time (mtime), and deletion time (dtime). The four fields
 499are 32-bit signed integers that represent seconds since the Unix epoch
 500(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
 501January 2038. If the filesystem does not have orphan_file feature, inodes
 502that are not linked from any directory but are still open (orphan inodes) have
 503the dtime field overloaded for use with the orphan list. The superblock field
 504``s_last_orphan`` points to the first inode in the orphan list; dtime is then
 505the number of the next orphaned inode, or zero if there are no more orphans.
 506
 507If the inode structure size ``sb->s_inode_size`` is larger than 128
 508bytes and the ``i_inode_extra`` field is large enough to encompass the
 509respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
 510inode fields are widened to 64 bits. Within this “extra” 32-bit field,
 511the lower two bits are used to extend the 32-bit seconds field to be 34
 512bit wide; the upper 30 bits are used to provide nanosecond timestamp
 513accuracy. Therefore, timestamps should not overflow until May 2446.
 514dtime was not widened. There is also a fifth timestamp to record inode
 515creation time (crtime); this field is 64-bits wide and decoded in the
 516same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
 517through the regular stat() interface, though debugfs will report them.
 518
 519We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)).
 520In other words:
 521
 522.. list-table::
 523   :widths: 20 20 20 20 20
 524   :header-rows: 1
 525
 526   * - Extra epoch bits
 527     - MSB of 32-bit time
 528     - Adjustment for signed 32-bit to 64-bit tv\_sec
 529     - Decoded 64-bit tv\_sec
 530     - valid time range
 531   * - 0 0
 532     - 1
 533     - 0
 534     - ``-0x80000000 - -0x00000001``
 535     - 1901-12-13 to 1969-12-31
 536   * - 0 0
 537     - 0
 538     - 0
 539     - ``0x000000000 - 0x07fffffff``
 540     - 1970-01-01 to 2038-01-19
 541   * - 0 1
 542     - 1
 543     - 0x100000000
 544     - ``0x080000000 - 0x0ffffffff``
 545     - 2038-01-19 to 2106-02-07
 546   * - 0 1
 547     - 0
 548     - 0x100000000
 549     - ``0x100000000 - 0x17fffffff``
 550     - 2106-02-07 to 2174-02-25
 551   * - 1 0
 552     - 1
 553     - 0x200000000
 554     - ``0x180000000 - 0x1ffffffff``
 555     - 2174-02-25 to 2242-03-16
 556   * - 1 0
 557     - 0
 558     - 0x200000000
 559     - ``0x200000000 - 0x27fffffff``
 560     - 2242-03-16 to 2310-04-04
 561   * - 1 1
 562     - 1
 563     - 0x300000000
 564     - ``0x280000000 - 0x2ffffffff``
 565     - 2310-04-04 to 2378-04-22
 566   * - 1 1
 567     - 0
 568     - 0x300000000
 569     - ``0x300000000 - 0x37fffffff``
 570     - 2378-04-22 to 2446-05-10
 571
 572This is a somewhat odd encoding since there are effectively seven times
 573as many positive values as negative values. There have also been
 574long-standing bugs decoding and encoding dates beyond 2038, which don't
 575seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels
 576incorrectly use the extra epoch bits 1,1 for dates between 1901 and
 5771970. At some point the kernel will be fixed and e2fsck will fix this
 578situation, assuming that it is run before 2310.
 579