Linux Container Development
 help / color / mirror / Atom feed
  • [parent not found: <87fukqwcue.fsf@xmission.com>]
  • * Possible bug: detached mounts difficult to cleanup
    @ 2017-01-11  1:24 Krister Johansen
      0 siblings, 0 replies; 11+ messages in thread
    From: Krister Johansen @ 2017-01-11  1:24 UTC (permalink / raw)
      To: Eric W. Biederman, Al Viro
      Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
    	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
    
    Gents,
    This is the follow-up e-mail I referenced in our discussion about the
    put_mountpoint locking problem.
    
    The problem manifested itself as a situation where our container
    provisioner would sometimes fail to re-start a container that it had
    made configuration changes.  The IP address chosen by the provisioner
    was still in use in another container.  This meant that the system had a
    network namespace with an IP address that was still in use, despite the
    provisoner having torn down the container as part of the reconfig
    operation.
    
    In order to keep the network namespace in use while the container is
    alive, the software bind mounts the net and user namespaces out of
    /proc/<pid>/ns/ into a directory that's used as the top level for the
    container instance.
    
    After forcing a crash dump and looking through the results, I was able
    to confirm that the only reference keeping the net namespace alive was
    the one held by the dentry on the mountpoint for the nsfs mount of the
    network namespace.  The problem was that the container software had
    unmounted this mountpoint, so it wasn't even in the host container's
    mount namespace.
    
    Since the software was using shared mounts, the nsfs bind mount was
    getting copied into the mount namespaces of any container that was
    created after the nsfs bind mount was established.  However, this isn't
    obvious because each new namespace executes a pivot_root(2), followed by
    an immediate and subsequent umount2(MNT_DETACH) on the old part of the
    root filesystem that is no longer in use.  These mounts of the nsfs bind
    mount weren't visibile in the kernel debugger, because they'd been
    detached from the mount namespace's mount tree.
    
    After looking at how iproute handles net namespaces, I ran a test where
    every unmount of the net nsfs bind mount was followed by a rm of that
    mountpoint.  That always resulted in the mountpoint getting freed and
    the refcount on the dentry going to zero.  It wsa enough for to make
    forward progress on the other tasks at hand.  I was able to verify that
    the nsfs refcount was getting dropped, and we were going through the
    __detach_mounts() cleanup path:
    
    rm 14633 [013] 29947.047071:         probe:nsfs_evict: (ffffffff81254fb0)
                7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
                7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
                7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
                7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
                7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
                7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
                7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
                7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
                7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
                7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
                7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
                7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
                7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
                7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
                7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
                7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
                7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
                       e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)
    
    Over the holiday, I had some more time to debug this and was able to
    narrow it down to the following case.
    
    1. The mount namespace that gets a copy of the nsfs bind mount must be
    created in a different user namespace than the host container.  This
    causes MNT_LOCKED to get set on the cloned mounts.
    
    2. In the container, pivot_root(2) and then umount2(MNT_DETACH) the old
    part of the tree from pivot_root.  Ensure that the nsfs mount is beneath
    the root of this tree.
    
    3. Umount the nsfs mount in the host container.  If the mount wasn't
    locked in the other container, you'll see a kprobe on nsfs_evict trigger
    immediately.  If it was MNT_LOCKED, then you'll need to rm the
    mountpoint in the host to trigger the nsfs_evict.
    
    For a nsfs mount, it's not particularly problematic to have to rm the
    mount to clean it up, but the other mounts in the tree that are detached
    and locked are often on mountpoints that can't be easily rm'd from the
    host.  These are harder to clean up, and essentially orphaned until the
    container's mount ns goes away.
    
    It would be ideal if we could release these mounts sooner, but I'm
    unsure of the best approach here.
    
    Debugging further, I was able to see that:
    
    a) The reason the nsfs isn't considere as part of propagate_mount_unlock
    is that the 'mnt' passed to that function is the top of the mount tree
    and it appears to only be considering mounts directly related to 'mnt'.
    
    b) The change_mnt_propogation(MS_PRIVATE) at the end of the while loop
    in umount_tree() is what ends up hiding these mounts from the host
    container.  Once they're no longer slaved or shared, we never again
    consider them as candiates for unlocking.
    
    c) Also note that these detached mounts that aren't free'd aren't
    charged against a container's ns->mounts limit, so it may be possible
    for a mount ns to be using more mounts than it has officially accounted
    for.
    
    I wondered if a naive solution could re-walk the list of mounts
    processed in umount_tree() and if all of the detached but locked mounts
    had a refcount that indicated they're unused, they could be unlocked and
    unmounted.  At least in the case of the containers I'm dealing with, the
    the container software should be ensuring that nothing in the container
    has a reference on anything that's under the detached portion of the
    tree.  However, there's probably a better way to do this.
    
    Thoughts?
    
    -K
    
    ^ permalink raw reply	[flat|nested] 11+ messages in thread

    end of thread, other threads:[~2017-01-13 23:28 UTC | newest]
    
    Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
    -- links below jump to the message on this page --
         [not found] <20170111012454.GB2497@templeofstupid.com>
         [not found] ` <20170111012454.GB2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
    2017-01-11  2:04   ` Possible bug: detached mounts difficult to cleanup Eric W. Biederman
         [not found]     ` <87r34a5p3t.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
    2017-01-11  3:07       ` Krister Johansen
         [not found]         ` <20170111030753.GC2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
    2017-01-13  0:37           ` Andrei Vagin
         [not found]             ` <CANaxB-zMzS-euqR1_LvZSoEsO-Y6q=_qGNTJZCKZTL5WfFF16g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
    2017-01-13 23:28               ` Krister Johansen
    2017-01-11  2:27   ` Eric W. Biederman
         [not found] ` <87fukqwcue.fsf@xmission.com>
         [not found]   ` <87fukqwcue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
    2017-01-11  2:37     ` Eric W. Biederman
         [not found]       ` <87shoqtj7z.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
    2017-01-12  6:15         ` Krister Johansen
         [not found]           ` <20170112061539.GA2345-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
    2017-01-12  8:26             ` Eric W. Biederman
         [not found]               ` <87r348y98z.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
    2017-01-13 23:28                 ` Krister Johansen
    2017-01-11  2:51     ` Al Viro
    2017-01-11  1:24 Krister Johansen
    

    This is a public inbox, see mirroring instructions
    for how to clone and mirror all data and code used for this inbox