Re: Possible bug: detached mounts difficult to cleanup

Linux Container Development
 help / color / mirror / Atom feed

From: ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman)
To: Krister Johansen
	<kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Subject: Re: Possible bug: detached mounts difficult to cleanup
Date: Wed, 11 Jan 2017 15:04:22 +1300	[thread overview]
Message-ID: <87r34a5p3t.fsf@xmission.com> (raw)
In-Reply-To: <20170111012454.GB2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org> (Krister Johansen's message of "Tue, 10 Jan 2017 17:24:54 -0800")

Krister Johansen <kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org> writes:

> Gents,
> This is the follow-up e-mail I referenced in our discussion about the
> put_mountpoint locking problem.
>
> The problem manifested itself as a situation where our container
> provisioner would sometimes fail to re-start a container that it had
> made configuration changes.  The IP address chosen by the provisioner
> was still in use in another container.  This meant that the system had a
> network namespace with an IP address that was still in use, despite the
> provisoner having torn down the container as part of the reconfig
> operation.
>
> In order to keep the network namespace in use while the container is
> alive, the software bind mounts the net and user namespaces out of
> /proc/<pid>/ns/ into a directory that's used as the top level for the
> container instance.
>
> After forcing a crash dump and looking through the results, I was able
> to confirm that the only reference keeping the net namespace alive was
> the one held by the dentry on the mountpoint for the nsfs mount of the
> network namespace.  The problem was that the container software had
> unmounted this mountpoint, so it wasn't even in the host container's
> mount namespace.
>
> Since the software was using shared mounts, the nsfs bind mount was
> getting copied into the mount namespaces of any container that was
> created after the nsfs bind mount was established.  However, this isn't
> obvious because each new namespace executes a pivot_root(2), followed by
> an immediate and subsequent umount2(MNT_DETACH) on the old part of the
> root filesystem that is no longer in use.  These mounts of the nsfs bind
> mount weren't visibile in the kernel debugger, because they'd been
> detached from the mount namespace's mount tree.
>
> After looking at how iproute handles net namespaces, I ran a test where
> every unmount of the net nsfs bind mount was followed by a rm of that
> mountpoint.  That always resulted in the mountpoint getting freed and
> the refcount on the dentry going to zero.  It wsa enough for to make
> forward progress on the other tasks at hand.  I was able to verify that
> the nsfs refcount was getting dropped, and we were going through the
> __detach_mounts() cleanup path:
>
> rm 14633 [013] 29947.047071:         probe:nsfs_evict: (ffffffff81254fb0)
>             7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
>             7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
>             7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
>             7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
>             7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
>             7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
>             7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
>             7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
>             7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
>             7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
>             7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
>             7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
>             7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
>             7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
>             7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
>             7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
>             7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
>                    e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)
>
> Over the holiday, I had some more time to debug this and was able to
> narrow it down to the following case.
>
> 1. The mount namespace that gets a copy of the nsfs bind mount must be
> created in a different user namespace than the host container.  This
> causes MNT_LOCKED to get set on the cloned mounts.
>
> 2. In the container, pivot_root(2) and then umount2(MNT_DETACH) the old
> part of the tree from pivot_root.  Ensure that the nsfs mount is beneath
> the root of this tree.
>
> 3. Umount the nsfs mount in the host container.  If the mount wasn't
> locked in the other container, you'll see a kprobe on nsfs_evict trigger
> immediately.  If it was MNT_LOCKED, then you'll need to rm the
> mountpoint in the host to trigger the nsfs_evict.
>
> For a nsfs mount, it's not particularly problematic to have to rm the
> mount to clean it up, but the other mounts in the tree that are detached
> and locked are often on mountpoints that can't be easily rm'd from the
> host.  These are harder to clean up, and essentially orphaned until the
> container's mount ns goes away.
>
> It would be ideal if we could release these mounts sooner, but I'm
> unsure of the best approach here.
>
> Debugging further, I was able to see that:
>
> a) The reason the nsfs isn't considere as part of propagate_mount_unlock
> is that the 'mnt' passed to that function is the top of the mount tree
> and it appears to only be considering mounts directly related to 'mnt'.
>
> b) The change_mnt_propogation(MS_PRIVATE) at the end of the while loop
> in umount_tree() is what ends up hiding these mounts from the host
> container.  Once they're no longer slaved or shared, we never again
> consider them as candiates for unlocking.
>
> c) Also note that these detached mounts that aren't free'd aren't
> charged against a container's ns->mounts limit, so it may be possible
> for a mount ns to be using more mounts than it has officially accounted
> for.
>
> I wondered if a naive solution could re-walk the list of mounts
> processed in umount_tree() and if all of the detached but locked mounts
> had a refcount that indicated they're unused, they could be unlocked and
> unmounted.  At least in the case of the containers I'm dealing with, the
> the container software should be ensuring that nothing in the container
> has a reference on anything that's under the detached portion of the
> tree.  However, there's probably a better way to do this.
>
> Thoughts?

Any chance you have a trivial reproducer script?

From you description I don't quite see the problem.  I know where to
look but if could give a script that reproduces the conditions you
see that would make it easier for me to dig into, and would certainly
would remove ambiguity.   Ideally such a script would be runnable
under unshare -Urm for easy repeated testing.

Eric

next      parent reply	other threads:[~2017-01-11  2:04 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20170111012454.GB2497@templeofstupid.com>
     [not found] ` <20170111012454.GB2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
2017-01-11  2:04   ` Eric W. Biederman [this message]
     [not found]     ` <87r34a5p3t.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-11  3:07       ` Possible bug: detached mounts difficult to cleanup Krister Johansen
     [not found]         ` <20170111030753.GC2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
2017-01-13  0:37           ` Andrei Vagin
     [not found]             ` <CANaxB-zMzS-euqR1_LvZSoEsO-Y6q=_qGNTJZCKZTL5WfFF16g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-13 23:28               ` Krister Johansen
2017-01-11  2:27   ` Eric W. Biederman
     [not found] ` <87fukqwcue.fsf@xmission.com>
     [not found]   ` <87fukqwcue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-11  2:37     ` Eric W. Biederman
     [not found]       ` <87shoqtj7z.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-12  6:15         ` Krister Johansen
     [not found]           ` <20170112061539.GA2345-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
2017-01-12  8:26             ` Eric W. Biederman
     [not found]               ` <87r348y98z.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-13 23:28                 ` Krister Johansen
2017-01-11  2:51     ` Al Viro
2017-01-11  1:24 Krister Johansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r34a5p3t.fsf@xmission.com \
    --to=ebiederm-as9lmozglivwk0htik3j/w@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org \
    --cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox