From: Krister Johansen <kjlx@templeofstupid.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Krister Johansen <kjlx@templeofstupid.com>,
Al Viro <viro@ZenIV.linux.org.uk>,
linux-fsdevel@vger.kernel.org,
containers@lists.linux-foundation.org
Subject: Re: Possible bug: detached mounts difficult to cleanup
Date: Wed, 11 Jan 2017 22:15:39 -0800 [thread overview]
Message-ID: <20170112061539.GA2345@templeofstupid.com> (raw)
In-Reply-To: <87shoqtj7z.fsf@xmission.com>
On Wed, Jan 11, 2017 at 03:37:36PM +1300, Eric W. Biederman wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
> > So if the code is working correctly that should already happen.
> >
> > The design is for the parent mount to hold a reference to the submounts.
> > And when the reference on the parent drops to 0. The references on
> > all of the submounts will also be dropped.
> >
> > I was hoping to read the code and point it out to you quickly, but I am
> > not seeing it now. I am wondering if in all of the refactoring of that
> > code something was dropped/missed :(
> >
> > Somewhere there is supposed to be the equivalent of:
> > pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
> > when we unhash those mounts because the last count has gone away.
> > Either it is very sophisticated or I am missing it. Grr....
>
> Ok. I see the code now, and it should be doing the right thing.
>
> During umount_tree the code calls pin_insert_group(...) with the
> last paramenter being NULL. That adds the mount to one or two
> lists. The mnt_pins list of the parent mount and the &unmounted
> hlist.
>
> Then later when the parent's cleanup_mnt is called if the mnt_pins
> still has entries mnt_pin_kill is called. For every mount on the
> mnt_pins list drop_mountpoint is called. Which calls dput and
> mntput.
>
> So that is how your references are supposed to be freed. Which leaves
> the question why aren't your mounts being freed? Is a file descriptor
> perhaps from a mmaped executable holding a mount reference?
Was that test case of any use? I'm afraid that I'm still failing to
communicate the problem. The parent's cleanup_mnt isn't getting called
for the detached and locked mounts, and I can explain why. The only
time I'm seeing them free'd is via the __detach_mounts() path, which is
only invoked for d_invalidate, vfs_rmdir, vfs_unlink, and vfs_rename:
rm 14633 [013] 29947.047071: probe:nsfs_evict: (ffffffff81254fb0)
7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)
So that's the stack where I see it work, but I never see it go through
the cleanup_mnt() path, and here's why. First, the code to for loop
in umount_tree():
while (!list_empty(&tmp_list)) {
struct mnt_namespace *ns;
bool disconnect;
p = list_first_entry(&tmp_list, struct mount, mnt_list);
list_del_init(&p->mnt_expire);
list_del_init(&p->mnt_list);
ns = p->mnt_ns;
if (ns) {
ns->mounts--;
__touch_mnt_namespace(ns);
}
p->mnt_ns = NULL;
if (how & UMOUNT_SYNC)
p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
#1 ---> disconnect = disconnect_mount(p, how);
#2 ---> pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
disconnect ? &unmounted : NULL);
if (mnt_has_parent(p)) {
mnt_add_count(p->mnt_parent, -1);
if (!disconnect) {
/* Don't forget about p */
list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
} else {
umount_mnt(p);
}
}
#3 ---> change_mnt_propagation(p, MS_PRIVATE);
}
So at #1 disconnect is false if p has MNT_LOCKED set.
At #2 p isn't added to the s_list on 'unmounted' if disconnect is false.
The mount gets hidden from the host container at #3, but that's not
germane to the invocation of pin_kill.
This is namespace_unlock:
hlist_move_list(&unmounted, &head);
up_write(&namespace_sem);
if (likely(hlist_empty(&head)))
return;
synchronize_rcu();
group_pin_kill(&head);
So unmounted is moved to head, and group_pin_kill is invoked on that.
Only the mounts we marked for disconnect go through the cleanup_mnt path
that way.
So that's the fundamental question I'm trying to ask. If we have a
mount tree that's umount(MNT_DETACH)'d immediately after a pivot_root,
but it's never getting those mounts cleaned up except when their
mountpoints get rm'd or mv'd, is there a better way to clean up this
tree?
-K
next prev parent reply other threads:[~2017-01-12 6:15 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-11 1:24 Possible bug: detached mounts difficult to cleanup Krister Johansen
2017-01-11 2:04 ` Eric W. Biederman
2017-01-11 3:07 ` Krister Johansen
2017-01-13 0:37 ` Andrei Vagin
2017-01-13 23:28 ` Krister Johansen
2017-01-11 2:27 ` Eric W. Biederman
2017-01-11 2:37 ` Eric W. Biederman
2017-01-12 6:15 ` Krister Johansen [this message]
2017-01-12 8:26 ` Eric W. Biederman
2017-01-13 23:28 ` Krister Johansen
2017-01-11 2:51 ` Al Viro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170112061539.GA2345@templeofstupid.com \
--to=kjlx@templeofstupid.com \
--cc=containers@lists.linux-foundation.org \
--cc=ebiederm@xmission.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).