linux/Documentation/filesystems/locking.rst
<<
>>
Prefs
   1=======
   2Locking
   3=======
   4
   5The text below describes the locking rules for VFS-related methods.
   6It is (believed to be) up-to-date. *Please*, if you change anything in
   7prototypes or locking protocols - update this file. And update the relevant
   8instances in the tree, don't leave that to maintainers of filesystems/devices/
   9etc. At the very least, put the list of dubious cases in the end of this file.
  10Don't turn it into log - maintainers of out-of-the-tree code are supposed to
  11be able to use diff(1).
  12
  13Thing currently missing here: socket operations. Alexey?
  14
  15dentry_operations
  16=================
  17
  18prototypes::
  19
  20        int (*d_revalidate)(struct dentry *, unsigned int);
  21        int (*d_weak_revalidate)(struct dentry *, unsigned int);
  22        int (*d_hash)(const struct dentry *, struct qstr *);
  23        int (*d_compare)(const struct dentry *,
  24                        unsigned int, const char *, const struct qstr *);
  25        int (*d_delete)(struct dentry *);
  26        int (*d_init)(struct dentry *);
  27        void (*d_release)(struct dentry *);
  28        void (*d_iput)(struct dentry *, struct inode *);
  29        char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
  30        struct vfsmount *(*d_automount)(struct path *path);
  31        int (*d_manage)(const struct path *, bool);
  32        struct dentry *(*d_real)(struct dentry *, const struct inode *);
  33
  34locking rules:
  35
  36================== ===========  ========        ==============  ========
  37ops                rename_lock  ->d_lock        may block       rcu-walk
  38================== ===========  ========        ==============  ========
  39d_revalidate:      no           no              yes (ref-walk)  maybe
  40d_weak_revalidate: no           no              yes             no
  41d_hash             no           no              no              maybe
  42d_compare:         yes          no              no              maybe
  43d_delete:          no           yes             no              no
  44d_init:            no           no              yes             no
  45d_release:         no           no              yes             no
  46d_prune:           no           yes             no              no
  47d_iput:            no           no              yes             no
  48d_dname:           no           no              no              no
  49d_automount:       no           no              yes             no
  50d_manage:          no           no              yes (ref-walk)  maybe
  51d_real             no           no              yes             no
  52================== ===========  ========        ==============  ========
  53
  54inode_operations
  55================
  56
  57prototypes::
  58
  59        int (*create) (struct inode *,struct dentry *,umode_t, bool);
  60        struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
  61        int (*link) (struct dentry *,struct inode *,struct dentry *);
  62        int (*unlink) (struct inode *,struct dentry *);
  63        int (*symlink) (struct inode *,struct dentry *,const char *);
  64        int (*mkdir) (struct inode *,struct dentry *,umode_t);
  65        int (*rmdir) (struct inode *,struct dentry *);
  66        int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
  67        int (*rename) (struct inode *, struct dentry *,
  68                        struct inode *, struct dentry *, unsigned int);
  69        int (*readlink) (struct dentry *, char __user *,int);
  70        const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *);
  71        void (*truncate) (struct inode *);
  72        int (*permission) (struct inode *, int, unsigned int);
  73        struct posix_acl * (*get_acl)(struct inode *, int, bool);
  74        int (*setattr) (struct dentry *, struct iattr *);
  75        int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
  76        ssize_t (*listxattr) (struct dentry *, char *, size_t);
  77        int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
  78        void (*update_time)(struct inode *, struct timespec *, int);
  79        int (*atomic_open)(struct inode *, struct dentry *,
  80                                struct file *, unsigned open_flag,
  81                                umode_t create_mode);
  82        int (*tmpfile) (struct inode *, struct dentry *, umode_t);
  83        int (*fileattr_set)(struct user_namespace *mnt_userns,
  84                            struct dentry *dentry, struct fileattr *fa);
  85        int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
  86
  87locking rules:
  88        all may block
  89
  90=============   =============================================
  91ops             i_rwsem(inode)
  92=============   =============================================
  93lookup:         shared
  94create:         exclusive
  95link:           exclusive (both)
  96mknod:          exclusive
  97symlink:        exclusive
  98mkdir:          exclusive
  99unlink:         exclusive (both)
 100rmdir:          exclusive (both)(see below)
 101rename:         exclusive (all) (see below)
 102readlink:       no
 103get_link:       no
 104setattr:        exclusive
 105permission:     no (may not block if called in rcu-walk mode)
 106get_acl:        no
 107getattr:        no
 108listxattr:      no
 109fiemap:         no
 110update_time:    no
 111atomic_open:    shared (exclusive if O_CREAT is set in open flags)
 112tmpfile:        no
 113fileattr_get:   no or exclusive
 114fileattr_set:   exclusive
 115=============   =============================================
 116
 117
 118        Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
 119        exclusive on victim.
 120        cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
 121
 122See Documentation/filesystems/directory-locking.rst for more detailed discussion
 123of the locking scheme for directory operations.
 124
 125xattr_handler operations
 126========================
 127
 128prototypes::
 129
 130        bool (*list)(struct dentry *dentry);
 131        int (*get)(const struct xattr_handler *handler, struct dentry *dentry,
 132                   struct inode *inode, const char *name, void *buffer,
 133                   size_t size);
 134        int (*set)(const struct xattr_handler *handler,
 135                   struct user_namespace *mnt_userns,
 136                   struct dentry *dentry, struct inode *inode, const char *name,
 137                   const void *buffer, size_t size, int flags);
 138
 139locking rules:
 140        all may block
 141
 142=====           ==============
 143ops             i_rwsem(inode)
 144=====           ==============
 145list:           no
 146get:            no
 147set:            exclusive
 148=====           ==============
 149
 150super_operations
 151================
 152
 153prototypes::
 154
 155        struct inode *(*alloc_inode)(struct super_block *sb);
 156        void (*free_inode)(struct inode *);
 157        void (*destroy_inode)(struct inode *);
 158        void (*dirty_inode) (struct inode *, int flags);
 159        int (*write_inode) (struct inode *, struct writeback_control *wbc);
 160        int (*drop_inode) (struct inode *);
 161        void (*evict_inode) (struct inode *);
 162        void (*put_super) (struct super_block *);
 163        int (*sync_fs)(struct super_block *sb, int wait);
 164        int (*freeze_fs) (struct super_block *);
 165        int (*unfreeze_fs) (struct super_block *);
 166        int (*statfs) (struct dentry *, struct kstatfs *);
 167        int (*remount_fs) (struct super_block *, int *, char *);
 168        void (*umount_begin) (struct super_block *);
 169        int (*show_options)(struct seq_file *, struct dentry *);
 170        ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
 171        ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 172
 173locking rules:
 174        All may block [not true, see below]
 175
 176======================  ============    ========================
 177ops                     s_umount        note
 178======================  ============    ========================
 179alloc_inode:
 180free_inode:                             called from RCU callback
 181destroy_inode:
 182dirty_inode:
 183write_inode:
 184drop_inode:                             !!!inode->i_lock!!!
 185evict_inode:
 186put_super:              write
 187sync_fs:                read
 188freeze_fs:              write
 189unfreeze_fs:            write
 190statfs:                 maybe(read)     (see below)
 191remount_fs:             write
 192umount_begin:           no
 193show_options:           no              (namespace_sem)
 194quota_read:             no              (see below)
 195quota_write:            no              (see below)
 196======================  ============    ========================
 197
 198->statfs() has s_umount (shared) when called by ustat(2) (native or
 199compat), but that's an accident of bad API; s_umount is used to pin
 200the superblock down when we only have dev_t given us by userland to
 201identify the superblock.  Everything else (statfs(), fstatfs(), etc.)
 202doesn't hold it when calling ->statfs() - superblock is pinned down
 203by resolving the pathname passed to syscall.
 204
 205->quota_read() and ->quota_write() functions are both guaranteed to
 206be the only ones operating on the quota file by the quota code (via
 207dqio_sem) (unless an admin really wants to screw up something and
 208writes to quota files with quotas on). For other details about locking
 209see also dquot_operations section.
 210
 211file_system_type
 212================
 213
 214prototypes::
 215
 216        struct dentry *(*mount) (struct file_system_type *, int,
 217                       const char *, void *);
 218        void (*kill_sb) (struct super_block *);
 219
 220locking rules:
 221
 222=======         =========
 223ops             may block
 224=======         =========
 225mount           yes
 226kill_sb         yes
 227=======         =========
 228
 229->mount() returns ERR_PTR or the root dentry; its superblock should be locked
 230on return.
 231
 232->kill_sb() takes a write-locked superblock, does all shutdown work on it,
 233unlocks and drops the reference.
 234
 235address_space_operations
 236========================
 237prototypes::
 238
 239        int (*writepage)(struct page *page, struct writeback_control *wbc);
 240        int (*readpage)(struct file *, struct page *);
 241        int (*writepages)(struct address_space *, struct writeback_control *);
 242        int (*set_page_dirty)(struct page *page);
 243        void (*readahead)(struct readahead_control *);
 244        int (*readpages)(struct file *filp, struct address_space *mapping,
 245                        struct list_head *pages, unsigned nr_pages);
 246        int (*write_begin)(struct file *, struct address_space *mapping,
 247                                loff_t pos, unsigned len, unsigned flags,
 248                                struct page **pagep, void **fsdata);
 249        int (*write_end)(struct file *, struct address_space *mapping,
 250                                loff_t pos, unsigned len, unsigned copied,
 251                                struct page *page, void *fsdata);
 252        sector_t (*bmap)(struct address_space *, sector_t);
 253        void (*invalidatepage) (struct page *, unsigned int, unsigned int);
 254        int (*releasepage) (struct page *, int);
 255        void (*freepage)(struct page *);
 256        int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
 257        bool (*isolate_page) (struct page *, isolate_mode_t);
 258        int (*migratepage)(struct address_space *, struct page *, struct page *);
 259        void (*putback_page) (struct page *);
 260        int (*launder_page)(struct page *);
 261        int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
 262        int (*error_remove_page)(struct address_space *, struct page *);
 263        int (*swap_activate)(struct file *);
 264        int (*swap_deactivate)(struct file *);
 265
 266locking rules:
 267        All except set_page_dirty and freepage may block
 268
 269======================  ======================== =========      ===============
 270ops                     PageLocked(page)         i_rwsem        invalidate_lock
 271======================  ======================== =========      ===============
 272writepage:              yes, unlocks (see below)
 273readpage:               yes, unlocks                            shared
 274writepages:
 275set_page_dirty          no
 276readahead:              yes, unlocks                            shared
 277readpages:              no                                      shared
 278write_begin:            locks the page           exclusive
 279write_end:              yes, unlocks             exclusive
 280bmap:
 281invalidatepage:         yes                                     exclusive
 282releasepage:            yes
 283freepage:               yes
 284direct_IO:
 285isolate_page:           yes
 286migratepage:            yes (both)
 287putback_page:           yes
 288launder_page:           yes
 289is_partially_uptodate:  yes
 290error_remove_page:      yes
 291swap_activate:          no
 292swap_deactivate:        no
 293======================  ======================== =========      ===============
 294
 295->write_begin(), ->write_end() and ->readpage() may be called from
 296the request handler (/dev/loop).
 297
 298->readpage() unlocks the page, either synchronously or via I/O
 299completion.
 300
 301->readahead() unlocks the pages that I/O is attempted on like ->readpage().
 302
 303->readpages() populates the pagecache with the passed pages and starts
 304I/O against them.  They come unlocked upon I/O completion.
 305
 306->writepage() is used for two purposes: for "memory cleansing" and for
 307"sync".  These are quite different operations and the behaviour may differ
 308depending upon the mode.
 309
 310If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
 311it *must* start I/O against the page, even if that would involve
 312blocking on in-progress I/O.
 313
 314If writepage is called for memory cleansing (sync_mode ==
 315WBC_SYNC_NONE) then its role is to get as much writeout underway as
 316possible.  So writepage should try to avoid blocking against
 317currently-in-progress I/O.
 318
 319If the filesystem is not called for "sync" and it determines that it
 320would need to block against in-progress I/O to be able to start new I/O
 321against the page the filesystem should redirty the page with
 322redirty_page_for_writepage(), then unlock the page and return zero.
 323This may also be done to avoid internal deadlocks, but rarely.
 324
 325If the filesystem is called for sync then it must wait on any
 326in-progress I/O and then start new I/O.
 327
 328The filesystem should unlock the page synchronously, before returning to the
 329caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
 330value. WRITEPAGE_ACTIVATE means that page cannot really be written out
 331currently, and VM should stop calling ->writepage() on this page for some
 332time. VM does this by moving page to the head of the active list, hence the
 333name.
 334
 335Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
 336and return zero, writepage *must* run set_page_writeback() against the page,
 337followed by unlocking it.  Once set_page_writeback() has been run against the
 338page, write I/O can be submitted and the write I/O completion handler must run
 339end_page_writeback() once the I/O is complete.  If no I/O is submitted, the
 340filesystem must run end_page_writeback() against the page before returning from
 341writepage.
 342
 343That is: after 2.5.12, pages which are under writeout are *not* locked.  Note,
 344if the filesystem needs the page to be locked during writeout, that is ok, too,
 345the page is allowed to be unlocked at any point in time between the calls to
 346set_page_writeback() and end_page_writeback().
 347
 348Note, failure to run either redirty_page_for_writepage() or the combination of
 349set_page_writeback()/end_page_writeback() on a page submitted to writepage
 350will leave the page itself marked clean but it will be tagged as dirty in the
 351radix tree.  This incoherency can lead to all sorts of hard-to-debug problems
 352in the filesystem like having dirty inodes at umount and losing written data.
 353
 354->writepages() is used for periodic writeback and for syscall-initiated
 355sync operations.  The address_space should start I/O against at least
 356``*nr_to_write`` pages.  ``*nr_to_write`` must be decremented for each page
 357which is written.  The address_space implementation may write more (or less)
 358pages than ``*nr_to_write`` asks for, but it should try to be reasonably close.
 359If nr_to_write is NULL, all dirty pages must be written.
 360
 361writepages should _only_ write pages which are present on
 362mapping->io_pages.
 363
 364->set_page_dirty() is called from various places in the kernel
 365when the target page is marked as needing writeback.  It may be called
 366under spinlock (it cannot block) and is sometimes called with the page
 367not locked.
 368
 369->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
 370filesystems and by the swapper. The latter will eventually go away.  Please,
 371keep it that way and don't breed new callers.
 372
 373->invalidatepage() is called when the filesystem must attempt to drop
 374some or all of the buffers from the page when it is being truncated. It
 375returns zero on success. If ->invalidatepage is zero, the kernel uses
 376block_invalidatepage() instead. The filesystem must exclusively acquire
 377invalidate_lock before invalidating page cache in truncate / hole punch path
 378(and thus calling into ->invalidatepage) to block races between page cache
 379invalidation and page cache filling functions (fault, read, ...).
 380
 381->releasepage() is called when the kernel is about to try to drop the
 382buffers from the page in preparation for freeing it.  It returns zero to
 383indicate that the buffers are (or may be) freeable.  If ->releasepage is zero,
 384the kernel assumes that the fs has no private interest in the buffers.
 385
 386->freepage() is called when the kernel is done dropping the page
 387from the page cache.
 388
 389->launder_page() may be called prior to releasing a page if
 390it is still found to be dirty. It returns zero if the page was successfully
 391cleaned, or an error value if not. Note that in order to prevent the page
 392getting mapped back in and redirtied, it needs to be kept locked
 393across the entire operation.
 394
 395->swap_activate will be called with a non-zero argument on
 396files backing (non block device backed) swapfiles. A return value
 397of zero indicates success, in which case this file can be used for
 398backing swapspace. The swapspace operations will be proxied to the
 399address space operations.
 400
 401->swap_deactivate() will be called in the sys_swapoff()
 402path after ->swap_activate() returned success.
 403
 404file_lock_operations
 405====================
 406
 407prototypes::
 408
 409        void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
 410        void (*fl_release_private)(struct file_lock *);
 411
 412
 413locking rules:
 414
 415===================     =============   =========
 416ops                     inode->i_lock   may block
 417===================     =============   =========
 418fl_copy_lock:           yes             no
 419fl_release_private:     maybe           maybe[1]_
 420===================     =============   =========
 421
 422.. [1]:
 423   ->fl_release_private for flock or POSIX locks is currently allowed
 424   to block. Leases however can still be freed while the i_lock is held and
 425   so fl_release_private called on a lease should not block.
 426
 427lock_manager_operations
 428=======================
 429
 430prototypes::
 431
 432        void (*lm_notify)(struct file_lock *);  /* unblock callback */
 433        int (*lm_grant)(struct file_lock *, struct file_lock *, int);
 434        void (*lm_break)(struct file_lock *); /* break_lease callback */
 435        int (*lm_change)(struct file_lock **, int);
 436        bool (*lm_breaker_owns_lease)(struct file_lock *);
 437
 438locking rules:
 439
 440======================  =============   =================       =========
 441ops                     inode->i_lock   blocked_lock_lock       may block
 442======================  =============   =================       =========
 443lm_notify:              yes             yes                     no
 444lm_grant:               no              no                      no
 445lm_break:               yes             no                      no
 446lm_change               yes             no                      no
 447lm_breaker_owns_lease:  no              no                      no
 448======================  =============   =================       =========
 449
 450buffer_head
 451===========
 452
 453prototypes::
 454
 455        void (*b_end_io)(struct buffer_head *bh, int uptodate);
 456
 457locking rules:
 458
 459called from interrupts. In other words, extreme care is needed here.
 460bh is locked, but that's all warranties we have here. Currently only RAID1,
 461highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
 462call this method upon the IO completion.
 463
 464block_device_operations
 465=======================
 466prototypes::
 467
 468        int (*open) (struct block_device *, fmode_t);
 469        int (*release) (struct gendisk *, fmode_t);
 470        int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 471        int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 472        int (*direct_access) (struct block_device *, sector_t, void **,
 473                                unsigned long *);
 474        void (*unlock_native_capacity) (struct gendisk *);
 475        int (*getgeo)(struct block_device *, struct hd_geometry *);
 476        void (*swap_slot_free_notify) (struct block_device *, unsigned long);
 477
 478locking rules:
 479
 480======================= ===================
 481ops                     open_mutex
 482======================= ===================
 483open:                   yes
 484release:                yes
 485ioctl:                  no
 486compat_ioctl:           no
 487direct_access:          no
 488unlock_native_capacity: no
 489getgeo:                 no
 490swap_slot_free_notify:  no      (see below)
 491======================= ===================
 492
 493swap_slot_free_notify is called with swap_lock and sometimes the page lock
 494held.
 495
 496
 497file_operations
 498===============
 499
 500prototypes::
 501
 502        loff_t (*llseek) (struct file *, loff_t, int);
 503        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
 504        ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 505        ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 506        ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
 507        int (*iopoll) (struct kiocb *kiocb, bool spin);
 508        int (*iterate) (struct file *, struct dir_context *);
 509        int (*iterate_shared) (struct file *, struct dir_context *);
 510        __poll_t (*poll) (struct file *, struct poll_table_struct *);
 511        long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 512        long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 513        int (*mmap) (struct file *, struct vm_area_struct *);
 514        int (*open) (struct inode *, struct file *);
 515        int (*flush) (struct file *);
 516        int (*release) (struct inode *, struct file *);
 517        int (*fsync) (struct file *, loff_t start, loff_t end, int datasync);
 518        int (*fasync) (int, struct file *, int);
 519        int (*lock) (struct file *, int, struct file_lock *);
 520        ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
 521                        loff_t *, int);
 522        unsigned long (*get_unmapped_area)(struct file *, unsigned long,
 523                        unsigned long, unsigned long, unsigned long);
 524        int (*check_flags)(int);
 525        int (*flock) (struct file *, int, struct file_lock *);
 526        ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *,
 527                        size_t, unsigned int);
 528        ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *,
 529                        size_t, unsigned int);
 530        int (*setlease)(struct file *, long, struct file_lock **, void **);
 531        long (*fallocate)(struct file *, int, loff_t, loff_t);
 532        void (*show_fdinfo)(struct seq_file *m, struct file *f);
 533        unsigned (*mmap_capabilities)(struct file *);
 534        ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
 535                        loff_t, size_t, unsigned int);
 536        loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
 537                        struct file *file_out, loff_t pos_out,
 538                        loff_t len, unsigned int remap_flags);
 539        int (*fadvise)(struct file *, loff_t, loff_t, int);
 540
 541locking rules:
 542        All may block.
 543
 544->llseek() locking has moved from llseek to the individual llseek
 545implementations.  If your fs is not using generic_file_llseek, you
 546need to acquire and release the appropriate locks in your ->llseek().
 547For many filesystems, it is probably safe to acquire the inode
 548mutex or just to use i_size_read() instead.
 549Note: this does not protect the file->f_pos against concurrent modifications
 550since this is something the userspace has to take care about.
 551
 552->iterate() is called with i_rwsem exclusive.
 553
 554->iterate_shared() is called with i_rwsem at least shared.
 555
 556->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags.
 557Most instances call fasync_helper(), which does that maintenance, so it's
 558not normally something one needs to worry about.  Return values > 0 will be
 559mapped to zero in the VFS layer.
 560
 561->readdir() and ->ioctl() on directories must be changed. Ideally we would
 562move ->readdir() to inode_operations and use a separate method for directory
 563->ioctl() or kill the latter completely. One of the problems is that for
 564anything that resembles union-mount we won't have a struct file for all
 565components. And there are other reasons why the current interface is a mess...
 566
 567->read on directories probably must go away - we should just enforce -EISDIR
 568in sys_read() and friends.
 569
 570->setlease operations should call generic_setlease() before or after setting
 571the lease within the individual filesystem to record the result of the
 572operation
 573
 574->fallocate implementation must be really careful to maintain page cache
 575consistency when punching holes or performing other operations that invalidate
 576page cache contents. Usually the filesystem needs to call
 577truncate_inode_pages_range() to invalidate relevant range of the page cache.
 578However the filesystem usually also needs to update its internal (and on disk)
 579view of file offset -> disk block mapping. Until this update is finished, the
 580filesystem needs to block page faults and reads from reloading now-stale page
 581cache contents from the disk. Since VFS acquires mapping->invalidate_lock in
 582shared mode when loading pages from disk (filemap_fault(), filemap_read(),
 583readahead paths), the fallocate implementation must take the invalidate_lock to
 584prevent reloading.
 585
 586->copy_file_range and ->remap_file_range implementations need to serialize
 587against modifications of file data while the operation is running. For
 588blocking changes through write(2) and similar operations inode->i_rwsem can be
 589used. To block changes to file contents via a memory mapping during the
 590operation, the filesystem must take mapping->invalidate_lock to coordinate
 591with ->page_mkwrite.
 592
 593dquot_operations
 594================
 595
 596prototypes::
 597
 598        int (*write_dquot) (struct dquot *);
 599        int (*acquire_dquot) (struct dquot *);
 600        int (*release_dquot) (struct dquot *);
 601        int (*mark_dirty) (struct dquot *);
 602        int (*write_info) (struct super_block *, int);
 603
 604These operations are intended to be more or less wrapping functions that ensure
 605a proper locking wrt the filesystem and call the generic quota operations.
 606
 607What filesystem should expect from the generic quota functions:
 608
 609==============  ============    =========================
 610ops             FS recursion    Held locks when called
 611==============  ============    =========================
 612write_dquot:    yes             dqonoff_sem or dqptr_sem
 613acquire_dquot:  yes             dqonoff_sem or dqptr_sem
 614release_dquot:  yes             dqonoff_sem or dqptr_sem
 615mark_dirty:     no              -
 616write_info:     yes             dqonoff_sem
 617==============  ============    =========================
 618
 619FS recursion means calling ->quota_read() and ->quota_write() from superblock
 620operations.
 621
 622More details about quota locking can be found in fs/dquot.c.
 623
 624vm_operations_struct
 625====================
 626
 627prototypes::
 628
 629        void (*open)(struct vm_area_struct*);
 630        void (*close)(struct vm_area_struct*);
 631        vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *);
 632        vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
 633        vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
 634        int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
 635
 636locking rules:
 637
 638=============   =========       ===========================
 639ops             mmap_lock       PageLocked(page)
 640=============   =========       ===========================
 641open:           yes
 642close:          yes
 643fault:          yes             can return with page locked
 644map_pages:      yes
 645page_mkwrite:   yes             can return with page locked
 646pfn_mkwrite:    yes
 647access:         yes
 648=============   =========       ===========================
 649
 650->fault() is called when a previously not present pte is about to be faulted
 651in. The filesystem must find and return the page associated with the passed in
 652"pgoff" in the vm_fault structure. If it is possible that the page may be
 653truncated and/or invalidated, then the filesystem must lock invalidate_lock,
 654then ensure the page is not already truncated (invalidate_lock will block
 655subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
 656locked. The VM will unlock the page.
 657
 658->map_pages() is called when VM asks to map easy accessible pages.
 659Filesystem should find and map pages associated with offsets from "start_pgoff"
 660till "end_pgoff". ->map_pages() is called with page table locked and must
 661not block.  If it's not possible to reach a page without blocking,
 662filesystem should skip it. Filesystem should use do_set_pte() to setup
 663page table entry. Pointer to entry associated with the page is passed in
 664"pte" field in vm_fault structure. Pointers to entries for other offsets
 665should be calculated relative to "pte".
 666
 667->page_mkwrite() is called when a previously read-only pte is about to become
 668writeable. The filesystem again must ensure that there are no
 669truncate/invalidate races or races with operations such as ->remap_file_range
 670or ->copy_file_range, and then return with the page locked. Usually
 671mapping->invalidate_lock is suitable for proper serialization. If the page has
 672been truncated, the filesystem should not look up a new page like the ->fault()
 673handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
 674retry the fault.
 675
 676->pfn_mkwrite() is the same as page_mkwrite but when the pte is
 677VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
 678VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
 679after this call is to make the pte read-write, unless pfn_mkwrite returns
 680an error.
 681
 682->access() is called when get_user_pages() fails in
 683access_process_vm(), typically used to debug a process through
 684/proc/pid/mem or ptrace.  This function is needed only for
 685VM_IO | VM_PFNMAP VMAs.
 686
 687--------------------------------------------------------------------------------
 688
 689                        Dubious stuff
 690
 691(if you break something or notice that it is broken and do not fix it yourself
 692- at least put it here)
 693