Re: linux-next: cgroup_mount() falls asleep forever

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zefan Li <lizefan@huawei.com>
To: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>,
	Andrey Wagin <avagin@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: linux-next: cgroup_mount() falls asleep forever
Date: Thu, 25 Sep 2014 18:23:39 +0800	[thread overview]
Message-ID: <5423ED2B.8010009@huawei.com> (raw)
In-Reply-To: <20140925032502.GB6903@mtj.dyndns.org>

On 2014/9/25 11:25, Tejun Heo wrote:
> Hello, Al.
> 
> On Thu, Sep 25, 2014 at 03:47:19AM +0100, Al Viro wrote:
>>> Yeah, it's an ugly thing to work around vfs interface not very
>>> conducive for filesystems which conditionally create or reuse
>>> superblocks during mount.  There was a thread explaining what's going
>>> on.  Looking up...
>>>
>>>   http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635
>>
>> Umm...  I still don't get it.  Could you describe the screnario in which
>> that percpu_ref_tryget_live() would be called and managed to fail?
> 

See this:

https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/kernel/cgroup.c?id=3a32bd72d77058d768dbb38183ad517f720dd1bc

Without the above commit, A scenario like this can happen:

1. Thread A has dropped sb refcnt to 0 but hasn't called percpu_ref_kill(). 
2. At this time Thread B calls cgroup_mount() and percpu_ref_tryget() succeeds.
3. After a while Thread C calls cgroup_mount(), but percpu_ref_tryget() keeps
returning failure, because the ref has been killed by percpu_ref_kill().

So I had to use kernfs_pin_sb() to prevent thread B from getting the percpu refcnt.

Any better idea to fix this?

> That was for the initial fix and Li later added the pinning to fix
> something else.  Let's wait for Li to chime in.  He knows this part
> better.
> 
>> It smells to me like most of the problems here are simply due to having too
>> many locks and not being able to decide where should they live relative to
>> ->s_umount.  That cgroup_mutex thing feels like something way to coarse...
>> You have it grabbed/dropped in
> 
> cgroup_mutex is the outer-most lock as far as cgroup is concerned and
> not expected to nest under anything which is used by individual
> controllers.  Most of what cgroup core does is low-freq managerial
> things which don't benefit from finer grained locking and the mount
> path is one of few surfaces where it interacts with outside in terms
> of locking, so it's better to keep that path special and everything
> else simpler.
> 
>> And that's a single system-wide mutex.  Plus there's a slew of workqueues
>> and really unpleasant abuse of restart_syscall() tossed in for fun - what
>> happens if some joker triggers that ->mount() _not_ from mount(2)?
> 
> For cgroup, mount is the userland-visible init interface.  It gotta be
> called from userland.  It originally had internal retry loop but
> syscall restart is simpler.  Reviving that loop isn't difficult if it
> ever becomes necessary.
> 
>> Then there's a global rwsem nesting inside that sucker.  And there's another
> 
> The rwsem nests inside cgroup_mutex and exists to allow multiple
> reader accesses to a particular data structure.
> 
>> mutex in fs/kernfs - also a global one.  Are the locking rules documented
>> anywhere?  Lifetime rules, as well...
> 
> kernfs one shouldn't interact with anything outside kernfs.  Its
> dependency is terminated within kernfs.
> 
>> Frankly, my first inclination here would be to try using sget() instead of
>> scanning the list of roots.  How painful would it be to split the allocation
>> and setup of roots, always allocate a new root and have sget() wait
>> for fs shutdown in progress and decide whether it wants to reuse the existing
>> one.  You can easily tell reuse existing vs. set up a new one from each other -
>> just look at the root associated with the superblock you've got and check
>> if it's the one you've allocated.  Freeing the damn thing if we'd reused
>> an existing one and doing setup otherwise.
>>
>> I realize that it won't do in such form; your for_each_subsys() loop in there
>> really depends on holding cgroup_mutex all the way through.  But do we really
>> need it there?  Would just skipping the ones that doomed in rebind_subsystems()
>> suffice?
> 
> At this point, cgroup core locking is heavily focused on simplicity -
> cgroup_mutex for the whole thing and css_set_rwsem for css_set reader
> accesses.  It works out pretty well for the rest of the code but the
> mount path does get tricky.  We can definitely relax / separate out
> locking on subsys iteration for mount path but if possible I'd prefer
> to pay isolated complexity there instead of spilling it to other
> places.
> 
> Anyways, let's wait for Li.  At least nobody reported breakage before
> the recent commit, so we can revert the offending commit for the short
> term.
> 

That patch is wrong. We have to use both pinned_sb and new_sb, so
please revert it. :(

I think we need to put some test cases in tools/testing/selftests/, to
prevent this fragile thing from breaking again.

next prev parent reply	other threads:[~2014-09-25 10:23 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-24 10:31 linux-next: cgroup_mount() falls asleep forever Andrey Wagin
2014-09-24 14:29 ` Andrey Wagin
2014-09-24 18:31   ` Cong Wang
2014-09-24 18:52   ` Al Viro
2014-09-24 19:24     ` Tejun Heo
2014-09-25  2:47       ` Al Viro
2014-09-25  3:25         ` Tejun Heo
2014-09-25 10:23           ` Zefan Li [this message]
2014-09-25 15:14             ` Tejun Heo
2014-09-24 20:16     ` Al Viro
2014-09-24 16:13 ` Andrey Wagin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5423ED2B.8010009@huawei.com \
    --to=lizefan@huawei.com \
    --cc=avagin@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.