From: ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman)
To: Krister Johansen
<kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Subject: Re: Possible bug: detached mounts difficult to cleanup
Date: Thu, 12 Jan 2017 21:26:20 +1300 [thread overview]
Message-ID: <87r348y98z.fsf@xmission.com> (raw)
In-Reply-To: <20170112061539.GA2345-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org> (Krister Johansen's message of "Wed, 11 Jan 2017 22:15:39 -0800")
Krister Johansen <kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org> writes:
> On Wed, Jan 11, 2017 at 03:37:36PM +1300, Eric W. Biederman wrote:
>> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>> > So if the code is working correctly that should already happen.
>> >
>> > The design is for the parent mount to hold a reference to the submounts.
>> > And when the reference on the parent drops to 0. The references on
>> > all of the submounts will also be dropped.
>> >
>> > I was hoping to read the code and point it out to you quickly, but I am
>> > not seeing it now. I am wondering if in all of the refactoring of that
>> > code something was dropped/missed :(
>> >
>> > Somewhere there is supposed to be the equivalent of:
>> > pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
>> > when we unhash those mounts because the last count has gone away.
>> > Either it is very sophisticated or I am missing it. Grr....
>>
>> Ok. I see the code now, and it should be doing the right thing.
>>
>> During umount_tree the code calls pin_insert_group(...) with the
>> last paramenter being NULL. That adds the mount to one or two
>> lists. The mnt_pins list of the parent mount and the &unmounted
>> hlist.
>>
>> Then later when the parent's cleanup_mnt is called if the mnt_pins
>> still has entries mnt_pin_kill is called. For every mount on the
>> mnt_pins list drop_mountpoint is called. Which calls dput and
>> mntput.
>>
>> So that is how your references are supposed to be freed. Which leaves
>> the question why aren't your mounts being freed? Is a file descriptor
>> perhaps from a mmaped executable holding a mount reference?
>
> Was that test case of any use? I'm afraid that I'm still failing to
> communicate the problem.
I apologize I really haven't had the energy to dig into it, especially
after I read the code and the only way I could see to get the
problem you are having is for something to be retaining a reference to
the mounts.
> The parent's cleanup_mnt isn't getting called
> for the detached and locked mounts, and I can explain why. The only
> time I'm seeing them free'd is via the __detach_mounts() path, which is
> only invoked for d_invalidate, vfs_rmdir, vfs_unlink, and vfs_rename:
>
> rm 14633 [013] 29947.047071: probe:nsfs_evict: (ffffffff81254fb0)
> 7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
> 7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
> 7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
> 7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
> 7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
> 7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
> 7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
> 7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
> 7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
> 7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
> 7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
> 7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
> 7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
> 7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
> 7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
> 7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
> 7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
> e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)
>
> So that's the stack where I see it work, but I never see it go through
> the cleanup_mnt() path, and here's why. First, the code to for loop
> in umount_tree():
>
> while (!list_empty(&tmp_list)) {
> struct mnt_namespace *ns;
> bool disconnect;
> p = list_first_entry(&tmp_list, struct mount, mnt_list);
> list_del_init(&p->mnt_expire);
> list_del_init(&p->mnt_list);
> ns = p->mnt_ns;
> if (ns) {
> ns->mounts--;
> __touch_mnt_namespace(ns);
> }
> p->mnt_ns = NULL;
> if (how & UMOUNT_SYNC)
> p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
>
> #1 ---> disconnect = disconnect_mount(p, how);
>
> #2 ---> pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
> disconnect ? &unmounted : NULL);
> if (mnt_has_parent(p)) {
> mnt_add_count(p->mnt_parent, -1);
> if (!disconnect) {
> /* Don't forget about p */
> list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
> } else {
> umount_mnt(p);
> }
> }
> #3 ---> change_mnt_propagation(p, MS_PRIVATE);
> }
>
>
> So at #1 disconnect is false if p has MNT_LOCKED set.
> At #2 p isn't added to the s_list on 'unmounted' if disconnect is false.
>
> The mount gets hidden from the host container at #3, but that's not
> germane to the invocation of pin_kill.
>
> This is namespace_unlock:
>
> hlist_move_list(&unmounted, &head);
>
> up_write(&namespace_sem);
>
> if (likely(hlist_empty(&head)))
> return;
>
> synchronize_rcu();
>
> group_pin_kill(&head);
>
> So unmounted is moved to head, and group_pin_kill is invoked on that.
> Only the mounts we marked for disconnect go through the cleanup_mnt path
> that way.
At which point you have an island of mounts.
In that island each submount is on it's parent's mnt_pin list.
When the last reference of a parent is dropped we call
umount_mnt on the children from mntput_no_expire
drop_mountpoint from mnt_pin_kill from cleanup_mnt indirectly from mntput_no_expire
So all we need is mntput_no_expire on a mount to be called for the
entire island to be freed.
So the fundamental issue appears to be that nothing is dropping the last
reference to some part of your island of mounts.
> So that's the fundamental question I'm trying to ask. If we have a
> mount tree that's umount(MNT_DETACH)'d immediately after a pivot_root,
> but it's never getting those mounts cleaned up except when their
> mountpoints get rm'd or mv'd, is there a better way to clean up this
> tree?
SIGKILL the process that is holding a reference.
Eric
WARNING: multiple messages have this Message-ID (diff)
From: ebiederm@xmission.com (Eric W. Biederman)
To: Krister Johansen <kjlx@templeofstupid.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>,
linux-fsdevel@vger.kernel.org,
containers@lists.linux-foundation.org
Subject: Re: Possible bug: detached mounts difficult to cleanup
Date: Thu, 12 Jan 2017 21:26:20 +1300 [thread overview]
Message-ID: <87r348y98z.fsf@xmission.com> (raw)
In-Reply-To: <20170112061539.GA2345@templeofstupid.com> (Krister Johansen's message of "Wed, 11 Jan 2017 22:15:39 -0800")
Krister Johansen <kjlx@templeofstupid.com> writes:
> On Wed, Jan 11, 2017 at 03:37:36PM +1300, Eric W. Biederman wrote:
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> > So if the code is working correctly that should already happen.
>> >
>> > The design is for the parent mount to hold a reference to the submounts.
>> > And when the reference on the parent drops to 0. The references on
>> > all of the submounts will also be dropped.
>> >
>> > I was hoping to read the code and point it out to you quickly, but I am
>> > not seeing it now. I am wondering if in all of the refactoring of that
>> > code something was dropped/missed :(
>> >
>> > Somewhere there is supposed to be the equivalent of:
>> > pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt, &unmounted);
>> > when we unhash those mounts because the last count has gone away.
>> > Either it is very sophisticated or I am missing it. Grr....
>>
>> Ok. I see the code now, and it should be doing the right thing.
>>
>> During umount_tree the code calls pin_insert_group(...) with the
>> last paramenter being NULL. That adds the mount to one or two
>> lists. The mnt_pins list of the parent mount and the &unmounted
>> hlist.
>>
>> Then later when the parent's cleanup_mnt is called if the mnt_pins
>> still has entries mnt_pin_kill is called. For every mount on the
>> mnt_pins list drop_mountpoint is called. Which calls dput and
>> mntput.
>>
>> So that is how your references are supposed to be freed. Which leaves
>> the question why aren't your mounts being freed? Is a file descriptor
>> perhaps from a mmaped executable holding a mount reference?
>
> Was that test case of any use? I'm afraid that I'm still failing to
> communicate the problem.
I apologize I really haven't had the energy to dig into it, especially
after I read the code and the only way I could see to get the
problem you are having is for something to be retaining a reference to
the mounts.
> The parent's cleanup_mnt isn't getting called
> for the detached and locked mounts, and I can explain why. The only
> time I'm seeing them free'd is via the __detach_mounts() path, which is
> only invoked for d_invalidate, vfs_rmdir, vfs_unlink, and vfs_rename:
>
> rm 14633 [013] 29947.047071: probe:nsfs_evict: (ffffffff81254fb0)
> 7fff81256fb1 nsfs_evict+0x80007f002001 ([kernel.kallsyms])
> 7fff8123e4c6 iput+0x80007f002196 ([kernel.kallsyms])
> 7fff8123944c __dentry_kill+0x80007f00219c ([kernel.kallsyms])
> 7fff81239611 dput+0x80007f002151 ([kernel.kallsyms])
> 7fff81241bb6 cleanup_mnt+0x80007f002036 ([kernel.kallsyms])
> 7fff81242beb mntput_no_expire+0x80007f00212b ([kernel.kallsyms])
> 7fff81242c54 mntput+0x80007f002024 ([kernel.kallsyms])
> 7fff81242c9a drop_mountpoint+0x80007f00202a ([kernel.kallsyms])
> 7fff81256df7 pin_kill+0x80007f002077 ([kernel.kallsyms])
> 7fff81256ede group_pin_kill+0x80007f00201e ([kernel.kallsyms])
> 7fff812416e3 namespace_unlock+0x80007f002073 ([kernel.kallsyms])
> 7fff81243e03 __detach_mounts+0x80007f0020d3 ([kernel.kallsyms])
> 7fff8122f0cd vfs_unlink+0x80007f00217d ([kernel.kallsyms])
> 7fff81231ce3 do_unlinkat+0x80007f002263 ([kernel.kallsyms])
> 7fff812327ab sys_unlinkat+0x80007f00201b ([kernel.kallsyms])
> 7fff81005b12 do_syscall_64+0x80007f002062 ([kernel.kallsyms])
> 7fff81735b21 return_from_SYSCALL_64+0x80007f002000 ([kernel.kallsyms])
> e90ed unlinkat+0xffff012b930e800d (/usr/lib64/libc-2.17.so)
>
> So that's the stack where I see it work, but I never see it go through
> the cleanup_mnt() path, and here's why. First, the code to for loop
> in umount_tree():
>
> while (!list_empty(&tmp_list)) {
> struct mnt_namespace *ns;
> bool disconnect;
> p = list_first_entry(&tmp_list, struct mount, mnt_list);
> list_del_init(&p->mnt_expire);
> list_del_init(&p->mnt_list);
> ns = p->mnt_ns;
> if (ns) {
> ns->mounts--;
> __touch_mnt_namespace(ns);
> }
> p->mnt_ns = NULL;
> if (how & UMOUNT_SYNC)
> p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
>
> #1 ---> disconnect = disconnect_mount(p, how);
>
> #2 ---> pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
> disconnect ? &unmounted : NULL);
> if (mnt_has_parent(p)) {
> mnt_add_count(p->mnt_parent, -1);
> if (!disconnect) {
> /* Don't forget about p */
> list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
> } else {
> umount_mnt(p);
> }
> }
> #3 ---> change_mnt_propagation(p, MS_PRIVATE);
> }
>
>
> So at #1 disconnect is false if p has MNT_LOCKED set.
> At #2 p isn't added to the s_list on 'unmounted' if disconnect is false.
>
> The mount gets hidden from the host container at #3, but that's not
> germane to the invocation of pin_kill.
>
> This is namespace_unlock:
>
> hlist_move_list(&unmounted, &head);
>
> up_write(&namespace_sem);
>
> if (likely(hlist_empty(&head)))
> return;
>
> synchronize_rcu();
>
> group_pin_kill(&head);
>
> So unmounted is moved to head, and group_pin_kill is invoked on that.
> Only the mounts we marked for disconnect go through the cleanup_mnt path
> that way.
At which point you have an island of mounts.
In that island each submount is on it's parent's mnt_pin list.
When the last reference of a parent is dropped we call
umount_mnt on the children from mntput_no_expire
drop_mountpoint from mnt_pin_kill from cleanup_mnt indirectly from mntput_no_expire
So all we need is mntput_no_expire on a mount to be called for the
entire island to be freed.
So the fundamental issue appears to be that nothing is dropping the last
reference to some part of your island of mounts.
> So that's the fundamental question I'm trying to ask. If we have a
> mount tree that's umount(MNT_DETACH)'d immediately after a pivot_root,
> but it's never getting those mounts cleaned up except when their
> mountpoints get rm'd or mv'd, is there a better way to clean up this
> tree?
SIGKILL the process that is holding a reference.
Eric
next prev parent reply other threads:[~2017-01-12 8:26 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-11 1:24 Possible bug: detached mounts difficult to cleanup Krister Johansen
[not found] ` <20170111012454.GB2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
2017-01-11 2:04 ` Eric W. Biederman
2017-01-11 2:04 ` Eric W. Biederman
[not found] ` <87r34a5p3t.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-11 3:07 ` Krister Johansen
2017-01-11 3:07 ` Krister Johansen
[not found] ` <20170111030753.GC2497-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
2017-01-13 0:37 ` Andrei Vagin
2017-01-13 0:37 ` Andrei Vagin
[not found] ` <CANaxB-zMzS-euqR1_LvZSoEsO-Y6q=_qGNTJZCKZTL5WfFF16g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-01-13 23:28 ` Krister Johansen
2017-01-13 23:28 ` Krister Johansen
2017-01-11 2:27 ` Eric W. Biederman
2017-01-11 2:27 ` Eric W. Biederman
2017-01-11 2:51 ` Al Viro
[not found] ` <87fukqwcue.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-11 2:37 ` Eric W. Biederman
2017-01-11 2:37 ` Eric W. Biederman
[not found] ` <87shoqtj7z.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-12 6:15 ` Krister Johansen
2017-01-12 6:15 ` Krister Johansen
[not found] ` <20170112061539.GA2345-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org>
2017-01-12 8:26 ` Eric W. Biederman [this message]
2017-01-12 8:26 ` Eric W. Biederman
[not found] ` <87r348y98z.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-01-13 23:28 ` Krister Johansen
2017-01-13 23:28 ` Krister Johansen
2017-01-11 2:51 ` Al Viro
-- strict thread matches above, loose matches on Subject: below --
2017-01-11 1:24 Krister Johansen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87r348y98z.fsf@xmission.com \
--to=ebiederm-as9lmozglivwk0htik3j/w@public.gmane.org \
--cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
--cc=kjlx-6woCzk5+qv5TrMCiz+cRkdBPR1lH4CV8@public.gmane.org \
--cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.