* linux-next: cgroup_mount() falls asleep forever @ 2014-09-24 10:31 Andrey Wagin 2014-09-24 14:29 ` Andrey Wagin 2014-09-24 16:13 ` Andrey Wagin 0 siblings, 2 replies; 11+ messages in thread From: Andrey Wagin @ 2014-09-24 10:31 UTC (permalink / raw) To: LKML, Tejun Heo, Li Zefan Hi All, I execute CRIU tests on linux-next. Today I found that one of tests hangs up forever. [root@linux-next-test linux-next]# git describe HEAD next-20140922 [root@linux-next-test ~]# ps axf ... 698 ? Ss 0:05 \_ sshd: root@notty 700 ? Ss 0:00 | \_ bash -x jenkins-scripts/jenkins-ct.sh jenkins.sh 706 ? S 0:00 | \_ bash -c ( mount --make-rprivate / && umount -l /proc && mount -t proc proc /proc/ && bash -x jenkins.sh) 707 ? S 0:00 | \_ bash -c ( mount --make-rprivate / && umount -l /proc && mount -t proc proc /proc/ && bash -x jenkins.sh) 711 ? S 0:00 | \_ bash -x jenkins.sh 2981 ? S 0:05 | \_ bash -x test/zdtm.sh -C -x .*\(maps01\|maps04\) 6717 ? S 0:00 | \_ /bin/bash zdtm/live/static/cgroup00.hook --clean 6719 ? D 11:13 | \_ mount -t cgroup none cgclean.cWgK71 -o none,name=zdtmtst [root@linux-next-test ~]# cat /proc/6719/stack [<ffffffff8111fcb7>] msleep+0x37/0x50 [<ffffffff8114bde2>] cgroup_mount+0x712/0xa50 [<ffffffff8124d0e9>] mount_fs+0x39/0x1b0 [<ffffffff8126d76b>] vfs_kern_mount+0x6b/0x150 [<ffffffff8127075b>] do_mount+0x22b/0xc20 [<ffffffff81271492>] SyS_mount+0xa2/0x110 [<ffffffff817e3ba9>] system_call_fastpath+0x12/0x17 [<ffffffffffffffff>] 0xffffffffffffffff Let me know which info do you need for investigation. Steps to reproduce: git clone https://github.com/xemul/criu.git cd criu make bash test/zdtm.sh static/cgroup00 Thanks, Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: linux-next: cgroup_mount() falls asleep forever 2014-09-24 10:31 linux-next: cgroup_mount() falls asleep forever Andrey Wagin @ 2014-09-24 14:29 ` Andrey Wagin 2014-09-24 18:31 ` Cong Wang 2014-09-24 18:52 ` Al Viro 2014-09-24 16:13 ` Andrey Wagin 1 sibling, 2 replies; 11+ messages in thread From: Andrey Wagin @ 2014-09-24 14:29 UTC (permalink / raw) To: LKML, Tejun Heo, Li Zefan 2014-09-24 14:31 GMT+04:00 Andrey Wagin <avagin@gmail.com>: > Hi All, The problem is in a following commit: commit 0c7bf3e8cab7900e17ce7f97104c39927d835469 Author: Zefan Li <lizefan@huawei.com> Date: Sat Sep 20 14:49:10 2014 +0800 cgroup: remove redundant variable in cgroup_mount() Both pinned_sb and new_sb indicate if a new superblock is needed, so we can just remove new_sb. Note now we must check if kernfs_tryget_sb() returns NULL, because when it returns NULL, kernfs_mount() may still re-use an existing superblock, which is just allocated by another concurent mount. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> > > I execute CRIU tests on linux-next. Today I found that one of tests > hangs up forever. > > [root@linux-next-test linux-next]# git describe HEAD > next-20140922 > [root@linux-next-test ~]# ps axf > ... > 698 ? Ss 0:05 \_ sshd: root@notty > 700 ? Ss 0:00 | \_ bash -x > jenkins-scripts/jenkins-ct.sh jenkins.sh > 706 ? S 0:00 | \_ bash -c ( mount --make-rprivate > / && umount -l /proc && mount -t proc proc /proc/ && bash -x > jenkins.sh) > 707 ? S 0:00 | \_ bash -c ( mount > --make-rprivate / && umount -l /proc && mount -t proc proc /proc/ && > bash -x jenkins.sh) > 711 ? S 0:00 | \_ bash -x jenkins.sh > 2981 ? S 0:05 | \_ bash -x > test/zdtm.sh -C -x .*\(maps01\|maps04\) > 6717 ? S 0:00 | \_ /bin/bash > zdtm/live/static/cgroup00.hook --clean > 6719 ? D 11:13 | \_ mount -t > cgroup none cgclean.cWgK71 -o none,name=zdtmtst > > [root@linux-next-test ~]# cat /proc/6719/stack > [<ffffffff8111fcb7>] msleep+0x37/0x50 > [<ffffffff8114bde2>] cgroup_mount+0x712/0xa50 > [<ffffffff8124d0e9>] mount_fs+0x39/0x1b0 > [<ffffffff8126d76b>] vfs_kern_mount+0x6b/0x150 > [<ffffffff8127075b>] do_mount+0x22b/0xc20 > [<ffffffff81271492>] SyS_mount+0xa2/0x110 > [<ffffffff817e3ba9>] system_call_fastpath+0x12/0x17 > [<ffffffffffffffff>] 0xffffffffffffffff > > Let me know which info do you need for investigation. > > Steps to reproduce: > > git clone https://github.com/xemul/criu.git > cd criu > make > bash test/zdtm.sh static/cgroup00 > > Thanks, > Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: linux-next: cgroup_mount() falls asleep forever 2014-09-24 14:29 ` Andrey Wagin @ 2014-09-24 18:31 ` Cong Wang 2014-09-24 18:52 ` Al Viro 1 sibling, 0 replies; 11+ messages in thread From: Cong Wang @ 2014-09-24 18:31 UTC (permalink / raw) To: Andrey Wagin; +Cc: LKML, Tejun Heo, Li Zefan On Wed, Sep 24, 2014 at 7:29 AM, Andrey Wagin <avagin@gmail.com> wrote: > 2014-09-24 14:31 GMT+04:00 Andrey Wagin <avagin@gmail.com>: >> Hi All, > > The problem is in a following commit: > > commit 0c7bf3e8cab7900e17ce7f97104c39927d835469 > Author: Zefan Li <lizefan@huawei.com> > Date: Sat Sep 20 14:49:10 2014 +0800 > > cgroup: remove redundant variable in cgroup_mount() > > Both pinned_sb and new_sb indicate if a new superblock is needed, > so we can just remove new_sb. > > Note now we must check if kernfs_tryget_sb() returns NULL, because > when it returns NULL, kernfs_mount() may still re-use an existing > superblock, which is just allocated by another concurent mount. > I guess the check for NULL is incorrect, the comment on kernfs_pin_sb() says: Returns NULL if there's no superblock associated to this kernfs_root, ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: linux-next: cgroup_mount() falls asleep forever 2014-09-24 14:29 ` Andrey Wagin 2014-09-24 18:31 ` Cong Wang @ 2014-09-24 18:52 ` Al Viro 2014-09-24 19:24 ` Tejun Heo 2014-09-24 20:16 ` Al Viro 1 sibling, 2 replies; 11+ messages in thread From: Al Viro @ 2014-09-24 18:52 UTC (permalink / raw) To: Andrey Wagin; +Cc: LKML, Tejun Heo, Li Zefan On Wed, Sep 24, 2014 at 06:29:27PM +0400, Andrey Wagin wrote: > 2014-09-24 14:31 GMT+04:00 Andrey Wagin <avagin@gmail.com>: > > Hi All, > > The problem is in a following commit: > > commit 0c7bf3e8cab7900e17ce7f97104c39927d835469 > Author: Zefan Li <lizefan@huawei.com> > Date: Sat Sep 20 14:49:10 2014 +0800 > > cgroup: remove redundant variable in cgroup_mount() > > Both pinned_sb and new_sb indicate if a new superblock is needed, > so we can just remove new_sb. > > Note now we must check if kernfs_tryget_sb() returns NULL, because > when it returns NULL, kernfs_mount() may still re-use an existing > superblock, which is just allocated by another concurent mount. > > Suggested-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Zefan Li <lizefan@huawei.com> > Signed-off-by: Tejun Heo <tj@kernel.org> Lovely... First of all, that thing is obviously racy - there's nothing to prevent another mount happening between these two places. Moreover, kernfs_mount() calling conventions are really atrocious - pointer to bool just to indicate that superblock is new? Could somebody explain WTF is the whole construction trying to do? Not to mention anything else, what *does* this pinning a superblock protect from? Suppose we have a superblock for the same root with non-NULL ns and _that_ gets killed. We get hit by the same percpu_ref_kill(&root->cgrp.self.refcnt); so what's the point of pinned_sb? Might as well have just bumped the refcount, superblock or no superblock. And no, delaying that kernfs_kill_sb() does you no good whatsoever - again, pinned_sb might have nothing to do with the superblock we are after. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: linux-next: cgroup_mount() falls asleep forever 2014-09-24 18:52 ` Al Viro @ 2014-09-24 19:24 ` Tejun Heo 2014-09-25 2:47 ` Al Viro 2014-09-24 20:16 ` Al Viro 1 sibling, 1 reply; 11+ messages in thread From: Tejun Heo @ 2014-09-24 19:24 UTC (permalink / raw) To: Al Viro; +Cc: Andrey Wagin, LKML, Li Zefan Hey, Al. On Wed, Sep 24, 2014 at 07:52:14PM +0100, Al Viro wrote: > On Wed, Sep 24, 2014 at 06:29:27PM +0400, Andrey Wagin wrote: > > 2014-09-24 14:31 GMT+04:00 Andrey Wagin <avagin@gmail.com>: > > > Hi All, > > > > The problem is in a following commit: > > > > commit 0c7bf3e8cab7900e17ce7f97104c39927d835469 > > Author: Zefan Li <lizefan@huawei.com> > > Date: Sat Sep 20 14:49:10 2014 +0800 > > > > cgroup: remove redundant variable in cgroup_mount() > > > > Both pinned_sb and new_sb indicate if a new superblock is needed, > > so we can just remove new_sb. > > > > Note now we must check if kernfs_tryget_sb() returns NULL, because > > when it returns NULL, kernfs_mount() may still re-use an existing > > superblock, which is just allocated by another concurent mount. > > > > Suggested-by: Tejun Heo <tj@kernel.org> > > Signed-off-by: Zefan Li <lizefan@huawei.com> > > Signed-off-by: Tejun Heo <tj@kernel.org> I'm gonna wait for Li's response for a couple days and then revert it if it can't be fixed differently. > Lovely... First of all, that thing is obviously racy - there's nothing > to prevent another mount happening between these two places. Moreover, > kernfs_mount() calling conventions are really atrocious - pointer to > bool just to indicate that superblock is new? > > Could somebody explain WTF is the whole construction trying to do? Not > to mention anything else, what *does* this pinning a superblock protect > from? Suppose we have a superblock for the same root with non-NULL ns > and _that_ gets killed. We get hit by the same > percpu_ref_kill(&root->cgrp.self.refcnt); > so what's the point of pinned_sb? Might as well have just bumped the > refcount, superblock or no superblock. And no, delaying that kernfs_kill_sb() > does you no good whatsoever - again, pinned_sb might have nothing to do with > the superblock we are after. Yeah, it's an ugly thing to work around vfs interface not very conducive for filesystems which conditionally create or reuse superblocks during mount. There was a thread explaining what's going on. Looking up... http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635 Thanks. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: linux-next: cgroup_mount() falls asleep forever 2014-09-24 19:24 ` Tejun Heo @ 2014-09-25 2:47 ` Al Viro 2014-09-25 3:25 ` Tejun Heo 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2014-09-25 2:47 UTC (permalink / raw) To: Tejun Heo; +Cc: Andrey Wagin, LKML, Li Zefan On Wed, Sep 24, 2014 at 03:24:56PM -0400, Tejun Heo wrote: > > Lovely... First of all, that thing is obviously racy - there's nothing > > to prevent another mount happening between these two places. Moreover, > > kernfs_mount() calling conventions are really atrocious - pointer to > > bool just to indicate that superblock is new? > > > > Could somebody explain WTF is the whole construction trying to do? Not > > to mention anything else, what *does* this pinning a superblock protect > > from? Suppose we have a superblock for the same root with non-NULL ns > > and _that_ gets killed. We get hit by the same > > percpu_ref_kill(&root->cgrp.self.refcnt); > > so what's the point of pinned_sb? Might as well have just bumped the > > refcount, superblock or no superblock. And no, delaying that kernfs_kill_sb() > > does you no good whatsoever - again, pinned_sb might have nothing to do with > > the superblock we are after. > > Yeah, it's an ugly thing to work around vfs interface not very > conducive for filesystems which conditionally create or reuse > superblocks during mount. There was a thread explaining what's going > on. Looking up... > > http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635 Umm... I still don't get it. Could you describe the screnario in which that percpu_ref_tryget_live() would be called and managed to fail? It smells to me like most of the problems here are simply due to having too many locks and not being able to decide where should they live relative to ->s_umount. That cgroup_mutex thing feels like something way to coarse... You have it grabbed/dropped in cgroup_destroy_root(), cgroup_remount(), cgroup_mount(), task_cgroup_path(), cgroup_attach_task_all(), cgroup_rename(), cgroup_rm_cftypes(), cgroup_add_cftypes(), cgroup_transfer_tasks(), cgroupstats_build(), css_release_work_fn() [async, queue_work()], css_killed_work_fn() [async, queue_work()], cgroup_init_subsys(), cgroup_init(), proc_cgroup_show(), proc_cgroupstats_show(), cgroup_release_agent(), __cgroup_procs_write(), cgroup_release_agent_write(), cgroup_subtree_control_write(), cgroup_mkdir(), cgroup_rmdir() And that's a single system-wide mutex. Plus there's a slew of workqueues and really unpleasant abuse of restart_syscall() tossed in for fun - what happens if some joker triggers that ->mount() _not_ from mount(2)? Then there's a global rwsem nesting inside that sucker. And there's another mutex in fs/kernfs - also a global one. Are the locking rules documented anywhere? Lifetime rules, as well... Frankly, my first inclination here would be to try using sget() instead of scanning the list of roots. How painful would it be to split the allocation and setup of roots, always allocate a new root and have sget() wait for fs shutdown in progress and decide whether it wants to reuse the existing one. You can easily tell reuse existing vs. set up a new one from each other - just look at the root associated with the superblock you've got and check if it's the one you've allocated. Freeing the damn thing if we'd reused an existing one and doing setup otherwise. I realize that it won't do in such form; your for_each_subsys() loop in there really depends on holding cgroup_mutex all the way through. But do we really need it there? Would just skipping the ones that doomed in rebind_subsystems() suffice? Just looking at the size of the damn thing is depressing, TBH - it's quite a bit bigger than *anything* in fs/*.c. And callers are all over the tree ;-/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: linux-next: cgroup_mount() falls asleep forever 2014-09-25 2:47 ` Al Viro @ 2014-09-25 3:25 ` Tejun Heo 2014-09-25 10:23 ` Zefan Li 0 siblings, 1 reply; 11+ messages in thread From: Tejun Heo @ 2014-09-25 3:25 UTC (permalink / raw) To: Al Viro; +Cc: Andrey Wagin, LKML, Li Zefan Hello, Al. On Thu, Sep 25, 2014 at 03:47:19AM +0100, Al Viro wrote: > > Yeah, it's an ugly thing to work around vfs interface not very > > conducive for filesystems which conditionally create or reuse > > superblocks during mount. There was a thread explaining what's going > > on. Looking up... > > > > http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635 > > Umm... I still don't get it. Could you describe the screnario in which > that percpu_ref_tryget_live() would be called and managed to fail? That was for the initial fix and Li later added the pinning to fix something else. Let's wait for Li to chime in. He knows this part better. > It smells to me like most of the problems here are simply due to having too > many locks and not being able to decide where should they live relative to > ->s_umount. That cgroup_mutex thing feels like something way to coarse... > You have it grabbed/dropped in cgroup_mutex is the outer-most lock as far as cgroup is concerned and not expected to nest under anything which is used by individual controllers. Most of what cgroup core does is low-freq managerial things which don't benefit from finer grained locking and the mount path is one of few surfaces where it interacts with outside in terms of locking, so it's better to keep that path special and everything else simpler. > And that's a single system-wide mutex. Plus there's a slew of workqueues > and really unpleasant abuse of restart_syscall() tossed in for fun - what > happens if some joker triggers that ->mount() _not_ from mount(2)? For cgroup, mount is the userland-visible init interface. It gotta be called from userland. It originally had internal retry loop but syscall restart is simpler. Reviving that loop isn't difficult if it ever becomes necessary. > Then there's a global rwsem nesting inside that sucker. And there's another The rwsem nests inside cgroup_mutex and exists to allow multiple reader accesses to a particular data structure. > mutex in fs/kernfs - also a global one. Are the locking rules documented > anywhere? Lifetime rules, as well... kernfs one shouldn't interact with anything outside kernfs. Its dependency is terminated within kernfs. > Frankly, my first inclination here would be to try using sget() instead of > scanning the list of roots. How painful would it be to split the allocation > and setup of roots, always allocate a new root and have sget() wait > for fs shutdown in progress and decide whether it wants to reuse the existing > one. You can easily tell reuse existing vs. set up a new one from each other - > just look at the root associated with the superblock you've got and check > if it's the one you've allocated. Freeing the damn thing if we'd reused > an existing one and doing setup otherwise. > > I realize that it won't do in such form; your for_each_subsys() loop in there > really depends on holding cgroup_mutex all the way through. But do we really > need it there? Would just skipping the ones that doomed in rebind_subsystems() > suffice? At this point, cgroup core locking is heavily focused on simplicity - cgroup_mutex for the whole thing and css_set_rwsem for css_set reader accesses. It works out pretty well for the rest of the code but the mount path does get tricky. We can definitely relax / separate out locking on subsys iteration for mount path but if possible I'd prefer to pay isolated complexity there instead of spilling it to other places. Anyways, let's wait for Li. At least nobody reported breakage before the recent commit, so we can revert the offending commit for the short term. Thanks. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: linux-next: cgroup_mount() falls asleep forever 2014-09-25 3:25 ` Tejun Heo @ 2014-09-25 10:23 ` Zefan Li 2014-09-25 15:14 ` Tejun Heo 0 siblings, 1 reply; 11+ messages in thread From: Zefan Li @ 2014-09-25 10:23 UTC (permalink / raw) To: Tejun Heo; +Cc: Al Viro, Andrey Wagin, LKML On 2014/9/25 11:25, Tejun Heo wrote: > Hello, Al. > > On Thu, Sep 25, 2014 at 03:47:19AM +0100, Al Viro wrote: >>> Yeah, it's an ugly thing to work around vfs interface not very >>> conducive for filesystems which conditionally create or reuse >>> superblocks during mount. There was a thread explaining what's going >>> on. Looking up... >>> >>> http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635 >> >> Umm... I still don't get it. Could you describe the screnario in which >> that percpu_ref_tryget_live() would be called and managed to fail? > See this: https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/kernel/cgroup.c?id=3a32bd72d77058d768dbb38183ad517f720dd1bc Without the above commit, A scenario like this can happen: 1. Thread A has dropped sb refcnt to 0 but hasn't called percpu_ref_kill(). 2. At this time Thread B calls cgroup_mount() and percpu_ref_tryget() succeeds. 3. After a while Thread C calls cgroup_mount(), but percpu_ref_tryget() keeps returning failure, because the ref has been killed by percpu_ref_kill(). So I had to use kernfs_pin_sb() to prevent thread B from getting the percpu refcnt. Any better idea to fix this? > That was for the initial fix and Li later added the pinning to fix > something else. Let's wait for Li to chime in. He knows this part > better. > >> It smells to me like most of the problems here are simply due to having too >> many locks and not being able to decide where should they live relative to >> ->s_umount. That cgroup_mutex thing feels like something way to coarse... >> You have it grabbed/dropped in > > cgroup_mutex is the outer-most lock as far as cgroup is concerned and > not expected to nest under anything which is used by individual > controllers. Most of what cgroup core does is low-freq managerial > things which don't benefit from finer grained locking and the mount > path is one of few surfaces where it interacts with outside in terms > of locking, so it's better to keep that path special and everything > else simpler. > >> And that's a single system-wide mutex. Plus there's a slew of workqueues >> and really unpleasant abuse of restart_syscall() tossed in for fun - what >> happens if some joker triggers that ->mount() _not_ from mount(2)? > > For cgroup, mount is the userland-visible init interface. It gotta be > called from userland. It originally had internal retry loop but > syscall restart is simpler. Reviving that loop isn't difficult if it > ever becomes necessary. > >> Then there's a global rwsem nesting inside that sucker. And there's another > > The rwsem nests inside cgroup_mutex and exists to allow multiple > reader accesses to a particular data structure. > >> mutex in fs/kernfs - also a global one. Are the locking rules documented >> anywhere? Lifetime rules, as well... > > kernfs one shouldn't interact with anything outside kernfs. Its > dependency is terminated within kernfs. > >> Frankly, my first inclination here would be to try using sget() instead of >> scanning the list of roots. How painful would it be to split the allocation >> and setup of roots, always allocate a new root and have sget() wait >> for fs shutdown in progress and decide whether it wants to reuse the existing >> one. You can easily tell reuse existing vs. set up a new one from each other - >> just look at the root associated with the superblock you've got and check >> if it's the one you've allocated. Freeing the damn thing if we'd reused >> an existing one and doing setup otherwise. >> >> I realize that it won't do in such form; your for_each_subsys() loop in there >> really depends on holding cgroup_mutex all the way through. But do we really >> need it there? Would just skipping the ones that doomed in rebind_subsystems() >> suffice? > > At this point, cgroup core locking is heavily focused on simplicity - > cgroup_mutex for the whole thing and css_set_rwsem for css_set reader > accesses. It works out pretty well for the rest of the code but the > mount path does get tricky. We can definitely relax / separate out > locking on subsys iteration for mount path but if possible I'd prefer > to pay isolated complexity there instead of spilling it to other > places. > > Anyways, let's wait for Li. At least nobody reported breakage before > the recent commit, so we can revert the offending commit for the short > term. > That patch is wrong. We have to use both pinned_sb and new_sb, so please revert it. :( I think we need to put some test cases in tools/testing/selftests/, to prevent this fragile thing from breaking again. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: linux-next: cgroup_mount() falls asleep forever 2014-09-25 10:23 ` Zefan Li @ 2014-09-25 15:14 ` Tejun Heo 0 siblings, 0 replies; 11+ messages in thread From: Tejun Heo @ 2014-09-25 15:14 UTC (permalink / raw) To: Zefan Li; +Cc: Al Viro, Andrey Wagin, LKML On Thu, Sep 25, 2014 at 06:23:39PM +0800, Zefan Li wrote: > That patch is wrong. We have to use both pinned_sb and new_sb, so > please revert it. :( > > I think we need to put some test cases in tools/testing/selftests/, to > prevent this fragile thing from breaking again. Can you please send a patch to revert it? There have been a couple changes dependent on it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: linux-next: cgroup_mount() falls asleep forever 2014-09-24 18:52 ` Al Viro 2014-09-24 19:24 ` Tejun Heo @ 2014-09-24 20:16 ` Al Viro 1 sibling, 0 replies; 11+ messages in thread From: Al Viro @ 2014-09-24 20:16 UTC (permalink / raw) To: Andrey Wagin; +Cc: LKML, Tejun Heo, Li Zefan On Wed, Sep 24, 2014 at 07:52:14PM +0100, Al Viro wrote: > Could somebody explain WTF is the whole construction trying to do? Not > to mention anything else, what *does* this pinning a superblock protect > from? Suppose we have a superblock for the same root with non-NULL ns > and _that_ gets killed. We get hit by the same > percpu_ref_kill(&root->cgrp.self.refcnt); > so what's the point of pinned_sb? Might as well have just bumped the > refcount, superblock or no superblock. And no, delaying that kernfs_kill_sb() > does you no good whatsoever - again, pinned_sb might have nothing to do with > the superblock we are after. Hrm... Scratch the comments re "different superblock" (we are passing NULL ns in that kernfs_mount() below), but... then WTF is that thing trying to do? OK, you've got your active reference to a superblock from kernfs_pin_sb(). You grab root->cgrp.self.refcnt. Suppose it also worked. Now what? You drop cgroup_mutex and proceed to kernfs_mount(). Which calls sget(), looking for exact same thing as your kernfs_pin_sb(). So it finds the same superblock and grab it for you (with ->s_umount held). At which point you drop root->cgrp.self.refcnt and drop the active reference you've got from kernfs_pin_sb(). Pardon me, but what's the point of that song and dance? Who else can make that attempt at grabbing root->cgrp.self.refcnt fail? BTW, what happens if kernfs_fill_super() fails? You get cgroup_kill_sb() triggered by deactivate_locked_super(), which calls kernfs_kill_sb(), which does kernfs_put(). Balancing the kernfs_get() we'd never got around to in kernfs_fill_super()... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: linux-next: cgroup_mount() falls asleep forever 2014-09-24 10:31 linux-next: cgroup_mount() falls asleep forever Andrey Wagin 2014-09-24 14:29 ` Andrey Wagin @ 2014-09-24 16:13 ` Andrey Wagin 1 sibling, 0 replies; 11+ messages in thread From: Andrey Wagin @ 2014-09-24 16:13 UTC (permalink / raw) To: LKML, Tejun Heo, Li Zefan 2014-09-24 14:31 GMT+04:00 Andrey Wagin <avagin@gmail.com>: > Hi All, > > I execute CRIU tests on linux-next. Today I found that one of tests > hangs up forever. > > [root@linux-next-test linux-next]# git describe HEAD > next-20140922 > [root@linux-next-test ~]# ps axf > ... > 698 ? Ss 0:05 \_ sshd: root@notty > 700 ? Ss 0:00 | \_ bash -x > jenkins-scripts/jenkins-ct.sh jenkins.sh > 706 ? S 0:00 | \_ bash -c ( mount --make-rprivate > / && umount -l /proc && mount -t proc proc /proc/ && bash -x > jenkins.sh) > 707 ? S 0:00 | \_ bash -c ( mount > --make-rprivate / && umount -l /proc && mount -t proc proc /proc/ && > bash -x jenkins.sh) > 711 ? S 0:00 | \_ bash -x jenkins.sh > 2981 ? S 0:05 | \_ bash -x > test/zdtm.sh -C -x .*\(maps01\|maps04\) > 6717 ? S 0:00 | \_ /bin/bash > zdtm/live/static/cgroup00.hook --clean > 6719 ? D 11:13 | \_ mount -t > cgroup none cgclean.cWgK71 -o none,name=zdtmtst > > [root@linux-next-test ~]# cat /proc/6719/stack > [<ffffffff8111fcb7>] msleep+0x37/0x50 > [<ffffffff8114bde2>] cgroup_mount+0x712/0xa50 > [<ffffffff8124d0e9>] mount_fs+0x39/0x1b0 > [<ffffffff8126d76b>] vfs_kern_mount+0x6b/0x150 > [<ffffffff8127075b>] do_mount+0x22b/0xc20 > [<ffffffff81271492>] SyS_mount+0xa2/0x110 > [<ffffffff817e3ba9>] system_call_fastpath+0x12/0x17 > [<ffffffffffffffff>] 0xffffffffffffffff > > Let me know which info do you need for investigation. > > Steps to reproduce: > > git clone https://github.com/xemul/criu.git > cd criu > make > bash test/zdtm.sh static/cgroup00 Steps to reproduce without CRIU: [root@avagin-fc19-cr ~]# mount -t cgroup -o none,name=zdtmtst xxx /mnt/test/ [root@avagin-fc19-cr ~]# mkdir /mnt/test/xxx [root@avagin-fc19-cr ~]# umount /mnt/test/ [root@avagin-fc19-cr ~]# mount -t cgroup -o none,name=zdtmtst xxx /mnt/test/ [1]+ Stopped mount -t cgroup -o none,name=zdtmtst xxx /mnt/test/ [root@avagin-fc19-cr ~]# bg [root@avagin-fc19-cr ~]# cat /proc/633/stack [<ffffffff810e9377>] msleep+0x37/0x50 [<ffffffff811118a2>] cgroup_mount+0x482/0x670 [<ffffffff811e3769>] mount_fs+0x39/0x1b0 [<ffffffff8120157b>] vfs_kern_mount+0x6b/0x150 [<ffffffff8120422b>] do_mount+0x22b/0xc20 [<ffffffff81204f62>] SyS_mount+0xa2/0x110 [<ffffffff816fcc69>] system_call_fastpath+0x12/0x17 [<ffffffffffffffff>] 0xffffffffffffffff > > Thanks, > Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-09-25 15:14 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-09-24 10:31 linux-next: cgroup_mount() falls asleep forever Andrey Wagin 2014-09-24 14:29 ` Andrey Wagin 2014-09-24 18:31 ` Cong Wang 2014-09-24 18:52 ` Al Viro 2014-09-24 19:24 ` Tejun Heo 2014-09-25 2:47 ` Al Viro 2014-09-25 3:25 ` Tejun Heo 2014-09-25 10:23 ` Zefan Li 2014-09-25 15:14 ` Tejun Heo 2014-09-24 20:16 ` Al Viro 2014-09-24 16:13 ` Andrey Wagin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox