Re: linux-next: cgroup_mount() falls asleep forever

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Al Viro <viro@ZenIV.linux.org.uk>
To: Tejun Heo <tj@kernel.org>
Cc: Andrey Wagin <avagin@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Li Zefan <lizefan@huawei.com>
Subject: Re: linux-next: cgroup_mount() falls asleep forever
Date: Thu, 25 Sep 2014 03:47:19 +0100	[thread overview]
Message-ID: <20140925024719.GJ7996@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20140924192456.GA1233@mtj.dyndns.org>

On Wed, Sep 24, 2014 at 03:24:56PM -0400, Tejun Heo wrote:

> > Lovely...  First of all, that thing is obviously racy - there's nothing
> > to prevent another mount happening between these two places.  Moreover,
> > kernfs_mount() calling conventions are really atrocious - pointer to
> > bool just to indicate that superblock is new?
> > 
> > Could somebody explain WTF is the whole construction trying to do?  Not
> > to mention anything else, what *does* this pinning a superblock protect
> > from?  Suppose we have a superblock for the same root with non-NULL ns
> > and _that_ gets killed.  We get hit by the same
> > 	percpu_ref_kill(&root->cgrp.self.refcnt);
> > so what's the point of pinned_sb?  Might as well have just bumped the
> > refcount, superblock or no superblock.  And no, delaying that kernfs_kill_sb()
> > does you no good whatsoever - again, pinned_sb might have nothing to do with
> > the superblock we are after.
> 
> Yeah, it's an ugly thing to work around vfs interface not very
> conducive for filesystems which conditionally create or reuse
> superblocks during mount.  There was a thread explaining what's going
> on.  Looking up...
> 
>   http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635

Umm...  I still don't get it.  Could you describe the screnario in which
that percpu_ref_tryget_live() would be called and managed to fail?

It smells to me like most of the problems here are simply due to having too
many locks and not being able to decide where should they live relative to
->s_umount.  That cgroup_mutex thing feels like something way to coarse...
You have it grabbed/dropped in
	cgroup_destroy_root(), cgroup_remount(), cgroup_mount(),
	task_cgroup_path(), cgroup_attach_task_all(), cgroup_rename(),
	cgroup_rm_cftypes(), cgroup_add_cftypes(), cgroup_transfer_tasks(),
	cgroupstats_build(),
	css_release_work_fn() [async, queue_work()],
	css_killed_work_fn() [async, queue_work()],
	cgroup_init_subsys(), cgroup_init(), proc_cgroup_show(),
	proc_cgroupstats_show(), cgroup_release_agent(), __cgroup_procs_write(),
	cgroup_release_agent_write(), cgroup_subtree_control_write(),
	cgroup_mkdir(), cgroup_rmdir()
And that's a single system-wide mutex.  Plus there's a slew of workqueues
and really unpleasant abuse of restart_syscall() tossed in for fun - what
happens if some joker triggers that ->mount() _not_ from mount(2)?

Then there's a global rwsem nesting inside that sucker.  And there's another
mutex in fs/kernfs - also a global one.  Are the locking rules documented
anywhere?  Lifetime rules, as well...

Frankly, my first inclination here would be to try using sget() instead of
scanning the list of roots.  How painful would it be to split the allocation
and setup of roots, always allocate a new root and have sget() wait
for fs shutdown in progress and decide whether it wants to reuse the existing
one.  You can easily tell reuse existing vs. set up a new one from each other -
just look at the root associated with the superblock you've got and check
if it's the one you've allocated.  Freeing the damn thing if we'd reused
an existing one and doing setup otherwise.

I realize that it won't do in such form; your for_each_subsys() loop in there
really depends on holding cgroup_mutex all the way through.  But do we really
need it there?  Would just skipping the ones that doomed in rebind_subsystems()
suffice?

Just looking at the size of the damn thing is depressing, TBH - it's quite
a bit bigger than *anything* in fs/*.c.  And callers are all over the tree ;-/

next prev parent reply	other threads:[~2014-09-25  2:47 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-24 10:31 linux-next: cgroup_mount() falls asleep forever Andrey Wagin
2014-09-24 14:29 ` Andrey Wagin
2014-09-24 18:31   ` Cong Wang
2014-09-24 18:52   ` Al Viro
2014-09-24 19:24     ` Tejun Heo
2014-09-25  2:47       ` Al Viro [this message]
2014-09-25  3:25         ` Tejun Heo
2014-09-25 10:23           ` Zefan Li
2014-09-25 15:14             ` Tejun Heo
2014-09-24 20:16     ` Al Viro
2014-09-24 16:13 ` Andrey Wagin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140925024719.GJ7996@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=avagin@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox