linux/Documentation/filesystems/sharedsubtree.txt
<<
>>
Prefs
   1Shared Subtrees
   2---------------
   3
   4Contents:
   5        1) Overview
   6        2) Features
   7        3) Setting mount states
   8        4) Use-case
   9        5) Detailed semantics
  10        6) Quiz
  11        7) FAQ
  12        8) Implementation
  13
  14
  151) Overview
  16-----------
  17
  18Consider the following situation:
  19
  20A process wants to clone its own namespace, but still wants to access the CD
  21that got mounted recently.  Shared subtree semantics provide the necessary
  22mechanism to accomplish the above.
  23
  24It provides the necessary building blocks for features like per-user-namespace
  25and versioned filesystem.
  26
  272) Features
  28-----------
  29
  30Shared subtree provides four different flavors of mounts; struct vfsmount to be
  31precise
  32
  33        a. shared mount
  34        b. slave mount
  35        c. private mount
  36        d. unbindable mount
  37
  38
  392a) A shared mount can be replicated to as many mountpoints and all the
  40replicas continue to be exactly same.
  41
  42        Here is an example:
  43
  44        Let's say /mnt has a mount that is shared.
  45        mount --make-shared /mnt
  46
  47        Note: mount(8) command now supports the --make-shared flag,
  48        so the sample 'smount' program is no longer needed and has been
  49        removed.
  50
  51        # mount --bind /mnt /tmp
  52        The above command replicates the mount at /mnt to the mountpoint /tmp
  53        and the contents of both the mounts remain identical.
  54
  55        #ls /mnt
  56        a b c
  57
  58        #ls /tmp
  59        a b c
  60
  61        Now let's say we mount a device at /tmp/a
  62        # mount /dev/sd0  /tmp/a
  63
  64        #ls /tmp/a
  65        t1 t2 t3
  66
  67        #ls /mnt/a
  68        t1 t2 t3
  69
  70        Note that the mount has propagated to the mount at /mnt as well.
  71
  72        And the same is true even when /dev/sd0 is mounted on /mnt/a. The
  73        contents will be visible under /tmp/a too.
  74
  75
  762b) A slave mount is like a shared mount except that mount and umount events
  77        only propagate towards it.
  78
  79        All slave mounts have a master mount which is a shared.
  80
  81        Here is an example:
  82
  83        Let's say /mnt has a mount which is shared.
  84        # mount --make-shared /mnt
  85
  86        Let's bind mount /mnt to /tmp
  87        # mount --bind /mnt /tmp
  88
  89        the new mount at /tmp becomes a shared mount and it is a replica of
  90        the mount at /mnt.
  91
  92        Now let's make the mount at /tmp; a slave of /mnt
  93        # mount --make-slave /tmp
  94
  95        let's mount /dev/sd0 on /mnt/a
  96        # mount /dev/sd0 /mnt/a
  97
  98        #ls /mnt/a
  99        t1 t2 t3
 100
 101        #ls /tmp/a
 102        t1 t2 t3
 103
 104        Note the mount event has propagated to the mount at /tmp
 105
 106        However let's see what happens if we mount something on the mount at /tmp
 107
 108        # mount /dev/sd1 /tmp/b
 109
 110        #ls /tmp/b
 111        s1 s2 s3
 112
 113        #ls /mnt/b
 114
 115        Note how the mount event has not propagated to the mount at
 116        /mnt
 117
 118
 1192c) A private mount does not forward or receive propagation.
 120
 121        This is the mount we are familiar with. Its the default type.
 122
 123
 1242d) A unbindable mount is a unbindable private mount
 125
 126        let's say we have a mount at /mnt and we make is unbindable
 127
 128        # mount --make-unbindable /mnt
 129
 130         Let's try to bind mount this mount somewhere else.
 131         # mount --bind /mnt /tmp
 132         mount: wrong fs type, bad option, bad superblock on /mnt,
 133                or too many mounted file systems
 134
 135        Binding a unbindable mount is a invalid operation.
 136
 137
 1383) Setting mount states
 139
 140        The mount command (util-linux package) can be used to set mount
 141        states:
 142
 143        mount --make-shared mountpoint
 144        mount --make-slave mountpoint
 145        mount --make-private mountpoint
 146        mount --make-unbindable mountpoint
 147
 148
 1494) Use cases
 150------------
 151
 152        A) A process wants to clone its own namespace, but still wants to
 153           access the CD that got mounted recently.
 154
 155           Solution:
 156
 157                The system administrator can make the mount at /cdrom shared
 158                mount --bind /cdrom /cdrom
 159                mount --make-shared /cdrom
 160
 161                Now any process that clones off a new namespace will have a
 162                mount at /cdrom which is a replica of the same mount in the
 163                parent namespace.
 164
 165                So when a CD is inserted and mounted at /cdrom that mount gets
 166                propagated to the other mount at /cdrom in all the other clone
 167                namespaces.
 168
 169        B) A process wants its mounts invisible to any other process, but
 170        still be able to see the other system mounts.
 171
 172           Solution:
 173
 174                To begin with, the administrator can mark the entire mount tree
 175                as shareable.
 176
 177                mount --make-rshared /
 178
 179                A new process can clone off a new namespace. And mark some part
 180                of its namespace as slave
 181
 182                mount --make-rslave /myprivatetree
 183
 184                Hence forth any mounts within the /myprivatetree done by the
 185                process will not show up in any other namespace. However mounts
 186                done in the parent namespace under /myprivatetree still shows
 187                up in the process's namespace.
 188
 189
 190        Apart from the above semantics this feature provides the
 191        building blocks to solve the following problems:
 192
 193        C)  Per-user namespace
 194
 195                The above semantics allows a way to share mounts across
 196                namespaces.  But namespaces are associated with processes. If
 197                namespaces are made first class objects with user API to
 198                associate/disassociate a namespace with userid, then each user
 199                could have his/her own namespace and tailor it to his/her
 200                requirements. Offcourse its needs support from PAM.
 201
 202        D)  Versioned files
 203
 204                If the entire mount tree is visible at multiple locations, then
 205                a underlying versioning file system can return different
 206                version of the file depending on the path used to access that
 207                file.
 208
 209                An example is:
 210
 211                mount --make-shared /
 212                mount --rbind / /view/v1
 213                mount --rbind / /view/v2
 214                mount --rbind / /view/v3
 215                mount --rbind / /view/v4
 216
 217                and if /usr has a versioning filesystem mounted, then that
 218                mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
 219                /view/v4/usr too
 220
 221                A user can request v3 version of the file /usr/fs/namespace.c
 222                by accessing /view/v3/usr/fs/namespace.c . The underlying
 223                versioning filesystem can then decipher that v3 version of the
 224                filesystem is being requested and return the corresponding
 225                inode.
 226
 2275) Detailed semantics:
 228-------------------
 229        The section below explains the detailed semantics of
 230        bind, rbind, move, mount, umount and clone-namespace operations.
 231
 232        Note: the word 'vfsmount' and the noun 'mount' have been used
 233        to mean the same thing, throughout this document.
 234
 2355a) Mount states
 236
 237        A given mount can be in one of the following states
 238        1) shared
 239        2) slave
 240        3) shared and slave
 241        4) private
 242        5) unbindable
 243
 244        A 'propagation event' is defined as event generated on a vfsmount
 245        that leads to mount or unmount actions in other vfsmounts.
 246
 247        A 'peer group' is defined as a group of vfsmounts that propagate
 248        events to each other.
 249
 250        (1) Shared mounts
 251
 252                A 'shared mount' is defined as a vfsmount that belongs to a
 253                'peer group'.
 254
 255                For example:
 256                        mount --make-shared /mnt
 257                        mount --bind /mnt /tmp
 258
 259                The mount at /mnt and that at /tmp are both shared and belong
 260                to the same peer group. Anything mounted or unmounted under
 261                /mnt or /tmp reflect in all the other mounts of its peer
 262                group.
 263
 264
 265        (2) Slave mounts
 266
 267                A 'slave mount' is defined as a vfsmount that receives
 268                propagation events and does not forward propagation events.
 269
 270                A slave mount as the name implies has a master mount from which
 271                mount/unmount events are received. Events do not propagate from
 272                the slave mount to the master.  Only a shared mount can be made
 273                a slave by executing the following command
 274
 275                        mount --make-slave mount
 276
 277                A shared mount that is made as a slave is no more shared unless
 278                modified to become shared.
 279
 280        (3) Shared and Slave
 281
 282                A vfsmount can be both shared as well as slave.  This state
 283                indicates that the mount is a slave of some vfsmount, and
 284                has its own peer group too.  This vfsmount receives propagation
 285                events from its master vfsmount, and also forwards propagation
 286                events to its 'peer group' and to its slave vfsmounts.
 287
 288                Strictly speaking, the vfsmount is shared having its own
 289                peer group, and this peer-group is a slave of some other
 290                peer group.
 291
 292                Only a slave vfsmount can be made as 'shared and slave' by
 293                either executing the following command
 294                        mount --make-shared mount
 295                or by moving the slave vfsmount under a shared vfsmount.
 296
 297        (4) Private mount
 298
 299                A 'private mount' is defined as vfsmount that does not
 300                receive or forward any propagation events.
 301
 302        (5) Unbindable mount
 303
 304                A 'unbindable mount' is defined as vfsmount that does not
 305                receive or forward any propagation events and cannot
 306                be bind mounted.
 307
 308
 309        State diagram:
 310        The state diagram below explains the state transition of a mount,
 311        in response to various commands.
 312        ------------------------------------------------------------------------
 313        |             |make-shared |  make-slave  | make-private |make-unbindab|
 314        --------------|------------|--------------|--------------|-------------|
 315        |shared       |shared      |*slave/private|   private    | unbindable  |
 316        |             |            |              |              |             |
 317        |-------------|------------|--------------|--------------|-------------|
 318        |slave        |shared      |    **slave   |    private   | unbindable  |
 319        |             |and slave   |              |              |             |
 320        |-------------|------------|--------------|--------------|-------------|
 321        |shared       |shared      |    slave     |    private   | unbindable  |
 322        |and slave    |and slave   |              |              |             |
 323        |-------------|------------|--------------|--------------|-------------|
 324        |private      |shared      |  **private   |    private   | unbindable  |
 325        |-------------|------------|--------------|--------------|-------------|
 326        |unbindable   |shared      |**unbindable  |    private   | unbindable  |
 327        ------------------------------------------------------------------------
 328
 329        * if the shared mount is the only mount in its peer group, making it
 330        slave, makes it private automatically. Note that there is no master to
 331        which it can be slaved to.
 332
 333        ** slaving a non-shared mount has no effect on the mount.
 334
 335        Apart from the commands listed below, the 'move' operation also changes
 336        the state of a mount depending on type of the destination mount. Its
 337        explained in section 5d.
 338
 3395b) Bind semantics
 340
 341        Consider the following command
 342
 343        mount --bind A/a  B/b
 344
 345        where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
 346        is the destination mount and 'b' is the dentry in the destination mount.
 347
 348        The outcome depends on the type of mount of 'A' and 'B'. The table
 349        below contains quick reference.
 350   ---------------------------------------------------------------------------
 351   |         BIND MOUNT OPERATION                                            |
 352   |**************************************************************************
 353   |source(A)->| shared       |       private  |       slave    | unbindable |
 354   | dest(B)  |               |                |                |            |
 355   |   |      |               |                |                |            |
 356   |   v      |               |                |                |            |
 357   |**************************************************************************
 358   |  shared  | shared        |     shared     | shared & slave |  invalid   |
 359   |          |               |                |                |            |
 360   |non-shared| shared        |      private   |      slave     |  invalid   |
 361   ***************************************************************************
 362
 363        Details:
 364
 365        1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
 366        which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
 367        mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
 368        are created and mounted at the dentry 'b' on all mounts where 'B'
 369        propagates to. A new propagation tree containing 'C1',..,'Cn' is
 370        created. This propagation tree is identical to the propagation tree of
 371        'B'.  And finally the peer-group of 'C' is merged with the peer group
 372        of 'A'.
 373
 374        2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
 375        which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
 376        mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
 377        are created and mounted at the dentry 'b' on all mounts where 'B'
 378        propagates to. A new propagation tree is set containing all new mounts
 379        'C', 'C1', .., 'Cn' with exactly the same configuration as the
 380        propagation tree for 'B'.
 381
 382        3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
 383        mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
 384        'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
 385        'C3' ... are created and mounted at the dentry 'b' on all mounts where
 386        'B' propagates to. A new propagation tree containing the new mounts
 387        'C','C1',..  'Cn' is created. This propagation tree is identical to the
 388        propagation tree for 'B'. And finally the mount 'C' and its peer group
 389        is made the slave of mount 'Z'.  In other words, mount 'C' is in the
 390        state 'slave and shared'.
 391
 392        4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
 393        invalid operation.
 394
 395        5. 'A' is a private mount and 'B' is a non-shared(private or slave or
 396        unbindable) mount. A new mount 'C' which is clone of 'A', is created.
 397        Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
 398
 399        6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
 400        which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
 401        mounted on mount 'B' at dentry 'b'.  'C' is made a member of the
 402        peer-group of 'A'.
 403
 404        7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
 405        new mount 'C' which is a clone of 'A' is created. Its root dentry is
 406        'a'.  'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
 407        slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
 408        'Z'.  All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
 409        mount/unmount on 'A' do not propagate anywhere else. Similarly
 410        mount/unmount on 'C' do not propagate anywhere else.
 411
 412        8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
 413        invalid operation. A unbindable mount cannot be bind mounted.
 414
 4155c) Rbind semantics
 416
 417        rbind is same as bind. Bind replicates the specified mount.  Rbind
 418        replicates all the mounts in the tree belonging to the specified mount.
 419        Rbind mount is bind mount applied to all the mounts in the tree.
 420
 421        If the source tree that is rbind has some unbindable mounts,
 422        then the subtree under the unbindable mount is pruned in the new
 423        location.
 424
 425        eg: let's say we have the following mount tree.
 426
 427                A
 428              /   \
 429              B   C
 430             / \ / \
 431             D E F G
 432
 433             Let's say all the mount except the mount C in the tree are
 434             of a type other than unbindable.
 435
 436             If this tree is rbound to say Z
 437
 438             We will have the following tree at the new location.
 439
 440                Z
 441                |
 442                A'
 443               /
 444              B'                Note how the tree under C is pruned
 445             / \                in the new location.
 446            D' E'
 447
 448
 449
 4505d) Move semantics
 451
 452        Consider the following command
 453
 454        mount --move A  B/b
 455
 456        where 'A' is the source mount, 'B' is the destination mount and 'b' is
 457        the dentry in the destination mount.
 458
 459        The outcome depends on the type of the mount of 'A' and 'B'. The table
 460        below is a quick reference.
 461   ---------------------------------------------------------------------------
 462   |                    MOVE MOUNT OPERATION                                 |
 463   |**************************************************************************
 464   | source(A)->| shared      |       private  |       slave    | unbindable |
 465   | dest(B)  |               |                |                |            |
 466   |   |      |               |                |                |            |
 467   |   v      |               |                |                |            |
 468   |**************************************************************************
 469   |  shared  | shared        |     shared     |shared and slave|  invalid   |
 470   |          |               |                |                |            |
 471   |non-shared| shared        |      private   |    slave       | unbindable |
 472   ***************************************************************************
 473        NOTE: moving a mount residing under a shared mount is invalid.
 474
 475      Details follow:
 476
 477        1. 'A' is a shared mount and 'B' is a shared mount.  The mount 'A' is
 478        mounted on mount 'B' at dentry 'b'.  Also new mounts 'A1', 'A2'...'An'
 479        are created and mounted at dentry 'b' on all mounts that receive
 480        propagation from mount 'B'. A new propagation tree is created in the
 481        exact same configuration as that of 'B'. This new propagation tree
 482        contains all the new mounts 'A1', 'A2'...  'An'.  And this new
 483        propagation tree is appended to the already existing propagation tree
 484        of 'A'.
 485
 486        2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
 487        mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
 488        are created and mounted at dentry 'b' on all mounts that receive
 489        propagation from mount 'B'. The mount 'A' becomes a shared mount and a
 490        propagation tree is created which is identical to that of
 491        'B'. This new propagation tree contains all the new mounts 'A1',
 492        'A2'...  'An'.
 493
 494        3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount.  The
 495        mount 'A' is mounted on mount 'B' at dentry 'b'.  Also new mounts 'A1',
 496        'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
 497        receive propagation from mount 'B'. A new propagation tree is created
 498        in the exact same configuration as that of 'B'. This new propagation
 499        tree contains all the new mounts 'A1', 'A2'...  'An'.  And this new
 500        propagation tree is appended to the already existing propagation tree of
 501        'A'.  Mount 'A' continues to be the slave mount of 'Z' but it also
 502        becomes 'shared'.
 503
 504        4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
 505        is invalid. Because mounting anything on the shared mount 'B' can
 506        create new mounts that get mounted on the mounts that receive
 507        propagation from 'B'.  And since the mount 'A' is unbindable, cloning
 508        it to mount at other mountpoints is not possible.
 509
 510        5. 'A' is a private mount and 'B' is a non-shared(private or slave or
 511        unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
 512
 513        6. 'A' is a shared mount and 'B' is a non-shared mount.  The mount 'A'
 514        is mounted on mount 'B' at dentry 'b'.  Mount 'A' continues to be a
 515        shared mount.
 516
 517        7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
 518        The mount 'A' is mounted on mount 'B' at dentry 'b'.  Mount 'A'
 519        continues to be a slave mount of mount 'Z'.
 520
 521        8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
 522        'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
 523        unbindable mount.
 524
 5255e) Mount semantics
 526
 527        Consider the following command
 528
 529        mount device  B/b
 530
 531        'B' is the destination mount and 'b' is the dentry in the destination
 532        mount.
 533
 534        The above operation is the same as bind operation with the exception
 535        that the source mount is always a private mount.
 536
 537
 5385f) Unmount semantics
 539
 540        Consider the following command
 541
 542        umount A
 543
 544        where 'A' is a mount mounted on mount 'B' at dentry 'b'.
 545
 546        If mount 'B' is shared, then all most-recently-mounted mounts at dentry
 547        'b' on mounts that receive propagation from mount 'B' and does not have
 548        sub-mounts within them are unmounted.
 549
 550        Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
 551        each other.
 552
 553        let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
 554        'B1', 'B2' and 'B3' respectively.
 555
 556        let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
 557        mount 'B1', 'B2' and 'B3' respectively.
 558
 559        if 'C1' is unmounted, all the mounts that are most-recently-mounted on
 560        'B1' and on the mounts that 'B1' propagates-to are unmounted.
 561
 562        'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
 563        on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
 564
 565        So all 'C1', 'C2' and 'C3' should be unmounted.
 566
 567        If any of 'C2' or 'C3' has some child mounts, then that mount is not
 568        unmounted, but all other mounts are unmounted. However if 'C1' is told
 569        to be unmounted and 'C1' has some sub-mounts, the umount operation is
 570        failed entirely.
 571
 5725g) Clone Namespace
 573
 574        A cloned namespace contains all the mounts as that of the parent
 575        namespace.
 576
 577        Let's say 'A' and 'B' are the corresponding mounts in the parent and the
 578        child namespace.
 579
 580        If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
 581        each other.
 582
 583        If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
 584        'Z'.
 585
 586        If 'A' is a private mount, then 'B' is a private mount too.
 587
 588        If 'A' is unbindable mount, then 'B' is a unbindable mount too.
 589
 590
 5916) Quiz
 592
 593        A. What is the result of the following command sequence?
 594
 595                mount --bind /mnt /mnt
 596                mount --make-shared /mnt
 597                mount --bind /mnt /tmp
 598                mount --move /tmp /mnt/1
 599
 600                what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
 601                Should they all be identical? or should /mnt and /mnt/1 be
 602                identical only?
 603
 604
 605        B. What is the result of the following command sequence?
 606
 607                mount --make-rshared /
 608                mkdir -p /v/1
 609                mount --rbind / /v/1
 610
 611                what should be the content of /v/1/v/1 be?
 612
 613
 614        C. What is the result of the following command sequence?
 615
 616                mount --bind /mnt /mnt
 617                mount --make-shared /mnt
 618                mkdir -p /mnt/1/2/3 /mnt/1/test
 619                mount --bind /mnt/1 /tmp
 620                mount --make-slave /mnt
 621                mount --make-shared /mnt
 622                mount --bind /mnt/1/2 /tmp1
 623                mount --make-slave /mnt
 624
 625                At this point we have the first mount at /tmp and
 626                its root dentry is 1. Let's call this mount 'A'
 627                And then we have a second mount at /tmp1 with root
 628                dentry 2. Let's call this mount 'B'
 629                Next we have a third mount at /mnt with root dentry
 630                mnt. Let's call this mount 'C'
 631
 632                'B' is the slave of 'A' and 'C' is a slave of 'B'
 633                A -> B -> C
 634
 635                at this point if we execute the following command
 636
 637                mount --bind /bin /tmp/test
 638
 639                The mount is attempted on 'A'
 640
 641                will the mount propagate to 'B' and 'C' ?
 642
 643                what would be the contents of
 644                /mnt/1/test be?
 645
 6467) FAQ
 647
 648        Q1. Why is bind mount needed? How is it different from symbolic links?
 649                symbolic links can get stale if the destination mount gets
 650                unmounted or moved. Bind mounts continue to exist even if the
 651                other mount is unmounted or moved.
 652
 653        Q2. Why can't the shared subtree be implemented using exportfs?
 654
 655                exportfs is a heavyweight way of accomplishing part of what
 656                shared subtree can do. I cannot imagine a way to implement the
 657                semantics of slave mount using exportfs?
 658
 659        Q3 Why is unbindable mount needed?
 660
 661                Let's say we want to replicate the mount tree at multiple
 662                locations within the same subtree.
 663
 664                if one rbind mounts a tree within the same subtree 'n' times
 665                the number of mounts created is an exponential function of 'n'.
 666                Having unbindable mount can help prune the unneeded bind
 667                mounts. Here is a example.
 668
 669                step 1:
 670                   let's say the root tree has just two directories with
 671                   one vfsmount.
 672                                    root
 673                                   /    \
 674                                  tmp    usr
 675
 676                    And we want to replicate the tree at multiple
 677                    mountpoints under /root/tmp
 678
 679                step2:
 680                      mount --make-shared /root
 681
 682                      mkdir -p /tmp/m1
 683
 684                      mount --rbind /root /tmp/m1
 685
 686                      the new tree now looks like this:
 687
 688                                    root
 689                                   /    \
 690                                 tmp    usr
 691                                /
 692                               m1
 693                              /  \
 694                             tmp  usr
 695                             /
 696                            m1
 697
 698                          it has two vfsmounts
 699
 700                step3:
 701                            mkdir -p /tmp/m2
 702                            mount --rbind /root /tmp/m2
 703
 704                        the new tree now looks like this:
 705
 706                                      root
 707                                     /    \
 708                                   tmp     usr
 709                                  /    \
 710                                m1       m2
 711                               / \       /  \
 712                             tmp  usr   tmp  usr
 713                             / \          /
 714                            m1  m2      m1
 715                                / \     /  \
 716                              tmp usr  tmp   usr
 717                              /        / \
 718                             m1       m1  m2
 719                            /  \
 720                          tmp   usr
 721                          /  \
 722                         m1   m2
 723
 724                       it has 6 vfsmounts
 725
 726                step 4:
 727                          mkdir -p /tmp/m3
 728                          mount --rbind /root /tmp/m3
 729
 730                          I wont' draw the tree..but it has 24 vfsmounts
 731
 732
 733                at step i the number of vfsmounts is V[i] = i*V[i-1].
 734                This is an exponential function. And this tree has way more
 735                mounts than what we really needed in the first place.
 736
 737                One could use a series of umount at each step to prune
 738                out the unneeded mounts. But there is a better solution.
 739                Unclonable mounts come in handy here.
 740
 741                step 1:
 742                   let's say the root tree has just two directories with
 743                   one vfsmount.
 744                                    root
 745                                   /    \
 746                                  tmp    usr
 747
 748                    How do we set up the same tree at multiple locations under
 749                    /root/tmp
 750
 751                step2:
 752                      mount --bind /root/tmp /root/tmp
 753
 754                      mount --make-rshared /root
 755                      mount --make-unbindable /root/tmp
 756
 757                      mkdir -p /tmp/m1
 758
 759                      mount --rbind /root /tmp/m1
 760
 761                      the new tree now looks like this:
 762
 763                                    root
 764                                   /    \
 765                                 tmp    usr
 766                                /
 767                               m1
 768                              /  \
 769                             tmp  usr
 770
 771                step3:
 772                            mkdir -p /tmp/m2
 773                            mount --rbind /root /tmp/m2
 774
 775                      the new tree now looks like this:
 776
 777                                    root
 778                                   /    \
 779                                 tmp    usr
 780                                /   \
 781                               m1     m2
 782                              /  \     / \
 783                             tmp  usr tmp usr
 784
 785                step4:
 786
 787                            mkdir -p /tmp/m3
 788                            mount --rbind /root /tmp/m3
 789
 790                      the new tree now looks like this:
 791
 792                                          root
 793                                      /           \
 794                                     tmp           usr
 795                                 /    \    \
 796                               m1     m2     m3
 797                              /  \     / \    /  \
 798                             tmp  usr tmp usr tmp usr
 799
 8008) Implementation
 801
 8028A) Datastructure
 803
 804        4 new fields are introduced to struct vfsmount
 805        ->mnt_share
 806        ->mnt_slave_list
 807        ->mnt_slave
 808        ->mnt_master
 809
 810        ->mnt_share links together all the mount to/from which this vfsmount
 811                send/receives propagation events.
 812
 813        ->mnt_slave_list links all the mounts to which this vfsmount propagates
 814                to.
 815
 816        ->mnt_slave links together all the slaves that its master vfsmount
 817                propagates to.
 818
 819        ->mnt_master points to the master vfsmount from which this vfsmount
 820                receives propagation.
 821
 822        ->mnt_flags takes two more flags to indicate the propagation status of
 823                the vfsmount.  MNT_SHARE indicates that the vfsmount is a shared
 824                vfsmount.  MNT_UNCLONABLE indicates that the vfsmount cannot be
 825                replicated.
 826
 827        All the shared vfsmounts in a peer group form a cyclic list through
 828        ->mnt_share.
 829
 830        All vfsmounts with the same ->mnt_master form on a cyclic list anchored
 831        in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
 832
 833         ->mnt_master can point to arbitrary (and possibly different) members
 834         of master peer group.  To find all immediate slaves of a peer group
 835         you need to go through _all_ ->mnt_slave_list of its members.
 836         Conceptually it's just a single set - distribution among the
 837         individual lists does not affect propagation or the way propagation
 838         tree is modified by operations.
 839
 840        All vfsmounts in a peer group have the same ->mnt_master.  If it is
 841        non-NULL, they form a contiguous (ordered) segment of slave list.
 842
 843        A example propagation tree looks as shown in the figure below.
 844        [ NOTE: Though it looks like a forest, if we consider all the shared
 845        mounts as a conceptual entity called 'pnode', it becomes a tree]
 846
 847
 848                        A <--> B <--> C <---> D
 849                       /|\            /|      |\
 850                      / F G          J K      H I
 851                     /
 852                    E<-->K
 853                        /|\
 854                       M L N
 855
 856        In the above figure  A,B,C and D all are shared and propagate to each
 857        other.   'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
 858        mounts 'J' and 'K'  and  'D' has got two slave mounts 'H' and 'I'.
 859        'E' is also shared with 'K' and they propagate to each other.  And
 860        'K' has 3 slaves 'M', 'L' and 'N'
 861
 862        A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
 863
 864        A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
 865
 866        E's ->mnt_share links with ->mnt_share of K
 867        'E', 'K', 'F', 'G' have their ->mnt_master point to struct
 868                                vfsmount of 'A'
 869        'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
 870        K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
 871
 872        C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
 873        J and K's ->mnt_master points to struct vfsmount of C
 874        and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
 875        'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
 876
 877
 878        NOTE: The propagation tree is orthogonal to the mount tree.
 879
 8808B Locking:
 881
 882        ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
 883        by namespace_sem (exclusive for modifications, shared for reading).
 884
 885        Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
 886        There are two exceptions: do_add_mount() and clone_mnt().
 887        The former modifies a vfsmount that has not been visible in any shared
 888        data structures yet.
 889        The latter holds namespace_sem and the only references to vfsmount
 890        are in lists that can't be traversed without namespace_sem.
 891
 8928C Algorithm:
 893
 894        The crux of the implementation resides in rbind/move operation.
 895
 896        The overall algorithm breaks the operation into 3 phases: (look at
 897        attach_recursive_mnt() and propagate_mnt())
 898
 899        1. prepare phase.
 900        2. commit phases.
 901        3. abort phases.
 902
 903        Prepare phase:
 904
 905        for each mount in the source tree:
 906                   a) Create the necessary number of mount trees to
 907                        be attached to each of the mounts that receive
 908                        propagation from the destination mount.
 909                   b) Do not attach any of the trees to its destination.
 910                      However note down its ->mnt_parent and ->mnt_mountpoint
 911                   c) Link all the new mounts to form a propagation tree that
 912                      is identical to the propagation tree of the destination
 913                      mount.
 914
 915                   If this phase is successful, there should be 'n' new
 916                   propagation trees; where 'n' is the number of mounts in the
 917                   source tree.  Go to the commit phase
 918
 919                   Also there should be 'm' new mount trees, where 'm' is
 920                   the number of mounts to which the destination mount
 921                   propagates to.
 922
 923                   if any memory allocations fail, go to the abort phase.
 924
 925        Commit phase
 926                attach each of the mount trees to their corresponding
 927                destination mounts.
 928
 929        Abort phase
 930                delete all the newly created trees.
 931
 932        NOTE: all the propagation related functionality resides in the file
 933        pnode.c
 934
 935
 936------------------------------------------------------------------------
 937
 938version 0.1  (created the initial document, Ram Pai linuxram@us.ibm.com)
 939version 0.2  (Incorporated comments from Al Viro)
 940
lxr.linux.no kindly hosted by Redpill Linpro AS, provider of Linux consulting and operations services since 1995.