* [RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes
@ 2008-12-10 23:36 menage
2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: menage @ 2008-12-10 23:36 UTC (permalink / raw)
To: kamezawa.hiroyu, balbir, containers; +Cc: linux-kernel, akpm
These patches present an alternative to some of the other cgroups
locking/refcount patches that have been proposed on LKML recently.
Several of these patches have been to address the race opened by
moving the calls to pre_destroy() callbacks outside of cgroup_mutex;
rather than continuing to patch up the holes caused by that change,
these patches introduce new locking/refcount rules to ultimately allow
the previous atomicity of cgroup_rmdir() to be restored.
These three patches give:
1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can
use to prevent changes to its own cgroup tree
2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the
memory controller
3/3 - introduce a css_tryget() function similar to the one proposed by
Kamezawa, but avoiding spurious refcount failures in the event
of a race between a css_tryget() and an unsuccessful cgroup_rmdir()
Future patches will likely involve:
- using hierarchy mutex in place of cgroup_lock() in more subsystems
where appropriate
- restoring the atomicity of cgroup_rmdir() with respect to cgroup_create()
Signed-off-by: Paul Menage <menage@google.com>
^ permalink raw reply [flat|nested] 24+ messages in thread* [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex 2008-12-10 23:36 [RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes menage @ 2008-12-10 23:36 ` menage 2008-12-11 0:37 ` KAMEZAWA Hiroyuki ` (2 more replies) 2008-12-10 23:36 ` [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage 2008-12-10 23:36 ` [RFC][PATCH 3/3] CGroups: Add css_tryget() menage 2 siblings, 3 replies; 24+ messages in thread From: menage @ 2008-12-10 23:36 UTC (permalink / raw) To: kamezawa.hiroyu, balbir, containers; +Cc: linux-kernel, akpm [-- Attachment #1: cgroup_hierarchy_lock.patch --] [-- Type: text/plain, Size: 5137 bytes --] This patch adds a hierarchy_mutex to the cgroup_subsys object that protects changes to the hierarchy observed by that subsystem. It is taken by the cgroup subsystem (in addition to cgroup_mutex) for the following operations: - linking a cgroup into that subsystem's cgroup tree - unlinking a cgroup from that subsystem's cgroup tree - moving the subsystem to/from a hierarchy (including across the bind() callback) Thus if the subsystem holds its own hierarchy_mutex, it can safely traverse its own hierarchy. Signed-off-by: Paul Menage <menage@google.com> --- Documentation/cgroups/cgroups.txt | 2 +- include/linux/cgroup.h | 9 ++++++++- kernel/cgroup.c | 37 +++++++++++++++++++++++++++++++++++-- 3 files changed, 44 insertions(+), 4 deletions(-) Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h =================================================================== --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h @@ -337,8 +337,15 @@ struct cgroup_subsys { #define MAX_CGROUP_TYPE_NAMELEN 32 const char *name; - struct cgroupfs_root *root; + /* + * Protects sibling/children links of cgroups in this + * hierarchy, plus protects which hierarchy (or none) the + * subsystem is a part of (i.e. root/sibling) + */ + struct mutex hierarchy_mutex; + /* Protected by this->hierarchy_mutex and cgroup_lock() */ + struct cgroupfs_root *root; struct list_head sibling; }; Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c =================================================================== --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c @@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro BUG_ON(cgrp->subsys[i]); BUG_ON(!dummytop->subsys[i]); BUG_ON(dummytop->subsys[i]->cgroup != dummytop); + mutex_lock(&ss->hierarchy_mutex); cgrp->subsys[i] = dummytop->subsys[i]; cgrp->subsys[i]->cgroup = cgrp; list_move(&ss->sibling, &root->subsys_list); ss->root = root; if (ss->bind) ss->bind(ss, cgrp); - + mutex_unlock(&ss->hierarchy_mutex); } else if (bit & removed_bits) { /* We're removing this subsystem */ BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]); BUG_ON(cgrp->subsys[i]->cgroup != cgrp); + mutex_lock(&ss->hierarchy_mutex); if (ss->bind) ss->bind(ss, dummytop); dummytop->subsys[i]->cgroup = dummytop; cgrp->subsys[i] = NULL; subsys[i]->root = &rootnode; list_move(&ss->sibling, &rootnode.subsys_list); + mutex_unlock(&ss->hierarchy_mutex); } else if (bit & final_bits) { /* Subsystem state should already exist */ BUG_ON(!cgrp->subsys[i]); @@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou cgrp->subsys[ss->subsys_id] = css; } +static void cgroup_lock_hierarchy(struct cgroupfs_root *root) +{ + /* We need to take each hierarchy_mutex in a consistent order */ + int i; + + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { + struct cgroup_subsys *ss = subsys[i]; + if (ss->root == root) + mutex_lock_nested(&ss->hierarchy_mutex, i); + } +} + +static void cgroup_unlock_hierarchy(struct cgroupfs_root *root) +{ + int i; + + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { + struct cgroup_subsys *ss = subsys[i]; + if (ss->root == root) + mutex_unlock(&ss->hierarchy_mutex); + } +} + /* * cgroup_create - create a cgroup * @parent: cgroup that will be parent of the new cgroup @@ -2374,7 +2400,9 @@ static long cgroup_create(struct cgroup init_cgroup_css(css, ss, cgrp); } + cgroup_lock_hierarchy(root); list_add(&cgrp->sibling, &cgrp->parent->children); + cgroup_unlock_hierarchy(root); root->number_of_cgroups++; err = cgroup_create_dir(cgrp, dentry, mode); @@ -2492,8 +2520,12 @@ static int cgroup_rmdir(struct inode *un if (!list_empty(&cgrp->release_list)) list_del(&cgrp->release_list); spin_unlock(&release_list_lock); - /* delete my sibling from parent->children */ + + cgroup_lock_hierarchy(cgrp->root); + /* delete this cgroup from parent->children */ list_del(&cgrp->sibling); + cgroup_unlock_hierarchy(cgrp->root); + spin_lock(&cgrp->dentry->d_lock); d = dget(cgrp->dentry); spin_unlock(&d->d_lock); @@ -2535,6 +2567,7 @@ static void __init cgroup_init_subsys(st * need to invoke fork callbacks here. */ BUG_ON(!list_empty(&init_task.tasks)); + mutex_init(&ss->hierarchy_mutex); ss->active = 1; } Index: hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt =================================================================== --- hierarchy_lock-mmotm-2008-12-09.orig/Documentation/cgroups/cgroups.txt +++ hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt @@ -528,7 +528,7 @@ example in cpusets, no task may attach b up. void bind(struct cgroup_subsys *ss, struct cgroup *root) -(cgroup_mutex held by caller) +(cgroup_mutex and ss->hierarchy_mutex held by caller) Called when a cgroup subsystem is rebound to a different hierarchy and root cgroup. Currently this will only involve movement between -- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex 2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage @ 2008-12-11 0:37 ` KAMEZAWA Hiroyuki 2008-12-11 0:44 ` Paul Menage 2008-12-11 6:30 ` Balbir Singh 2008-12-11 3:05 ` Li Zefan 2008-12-11 6:29 ` Balbir Singh 2 siblings, 2 replies; 24+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-11 0:37 UTC (permalink / raw) To: menage; +Cc: balbir, containers, linux-kernel, akpm thank you for this work. On Wed, 10 Dec 2008 15:36:55 -0800 menage@google.com wrote: > This patch adds a hierarchy_mutex to the cgroup_subsys object that > protects changes to the hierarchy observed by that subsystem. It is > taken by the cgroup subsystem (in addition to cgroup_mutex) for the > following operations: > > - linking a cgroup into that subsystem's cgroup tree > - unlinking a cgroup from that subsystem's cgroup tree > - moving the subsystem to/from a hierarchy (including across the > bind() callback) > o.k. usage is very clear. > Thus if the subsystem holds its own hierarchy_mutex, it can safely > traverse its own hierarchy. > > Signed-off-by: Paul Menage <menage@google.com> > > --- > > Documentation/cgroups/cgroups.txt | 2 +- > include/linux/cgroup.h | 9 ++++++++- > kernel/cgroup.c | 37 +++++++++++++++++++++++++++++++++++-- > 3 files changed, 44 insertions(+), 4 deletions(-) > > Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h > =================================================================== > --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h > +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h > @@ -337,8 +337,15 @@ struct cgroup_subsys { > #define MAX_CGROUP_TYPE_NAMELEN 32 > const char *name; > > - struct cgroupfs_root *root; > + /* > + * Protects sibling/children links of cgroups in this > + * hierarchy, plus protects which hierarchy (or none) the > + * subsystem is a part of (i.e. root/sibling) > + */ > + struct mutex hierarchy_mutex; > > + /* Protected by this->hierarchy_mutex and cgroup_lock() */ > + struct cgroupfs_root *root; > struct list_head sibling; > }; > > Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c > =================================================================== > --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c > +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c > @@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro > BUG_ON(cgrp->subsys[i]); > BUG_ON(!dummytop->subsys[i]); > BUG_ON(dummytop->subsys[i]->cgroup != dummytop); > + mutex_lock(&ss->hierarchy_mutex); > cgrp->subsys[i] = dummytop->subsys[i]; > cgrp->subsys[i]->cgroup = cgrp; > list_move(&ss->sibling, &root->subsys_list); > ss->root = root; > if (ss->bind) > ss->bind(ss, cgrp); > - > + mutex_unlock(&ss->hierarchy_mutex); > } else if (bit & removed_bits) { > /* We're removing this subsystem */ > BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]); > BUG_ON(cgrp->subsys[i]->cgroup != cgrp); > + mutex_lock(&ss->hierarchy_mutex); > if (ss->bind) > ss->bind(ss, dummytop); > dummytop->subsys[i]->cgroup = dummytop; > cgrp->subsys[i] = NULL; > subsys[i]->root = &rootnode; > list_move(&ss->sibling, &rootnode.subsys_list); > + mutex_unlock(&ss->hierarchy_mutex); > } else if (bit & final_bits) { > /* Subsystem state should already exist */ > BUG_ON(!cgrp->subsys[i]); > @@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou > cgrp->subsys[ss->subsys_id] = css; > } > > +static void cgroup_lock_hierarchy(struct cgroupfs_root *root) > +{ > + /* We need to take each hierarchy_mutex in a consistent order */ > + int i; > + > + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > + struct cgroup_subsys *ss = subsys[i]; > + if (ss->root == root) > + mutex_lock_nested(&ss->hierarchy_mutex, i); > + } > +} > + > +static void cgroup_unlock_hierarchy(struct cgroupfs_root *root) > +{ > + int i; > + > + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > + struct cgroup_subsys *ss = subsys[i]; > + if (ss->root == root) > + mutex_unlock(&ss->hierarchy_mutex); > + } > +} > + Maybe no problem..but I don't like releasing lock in the order of acquiring lock. for (i = CGROUP_SUBSYS_COUNT - 1; i >=0; i--) ? -Kame ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex 2008-12-11 0:37 ` KAMEZAWA Hiroyuki @ 2008-12-11 0:44 ` Paul Menage 2008-12-11 6:30 ` Balbir Singh 1 sibling, 0 replies; 24+ messages in thread From: Paul Menage @ 2008-12-11 0:44 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: balbir, containers, linux-kernel, akpm On Wed, Dec 10, 2008 at 4:37 PM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: >> +static void cgroup_unlock_hierarchy(struct cgroupfs_root *root) >> +{ >> + int i; >> + >> + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { >> + struct cgroup_subsys *ss = subsys[i]; >> + if (ss->root == root) >> + mutex_unlock(&ss->hierarchy_mutex); >> + } >> +} >> + > Maybe no problem..but I don't like releasing lock in the order of acquiring lock. > > for (i = CGROUP_SUBSYS_COUNT - 1; i >=0; i--) ? The order that you release the locks is irrelevant for correctness. In this case, since the only callers of cgroup_lock_hierarchy() also hold cgroup_mutex and hence can't race with one another, the order of locking is irrelevant for correctness too - right now the locking order is just designed to keep lockdep happy. I think that the reverse-ordered loop is less readable for no gain. Paul ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex 2008-12-11 0:37 ` KAMEZAWA Hiroyuki 2008-12-11 0:44 ` Paul Menage @ 2008-12-11 6:30 ` Balbir Singh 1 sibling, 0 replies; 24+ messages in thread From: Balbir Singh @ 2008-12-11 6:30 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: menage, containers, linux-kernel, akpm * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-11 09:37:33]: > thank you for this work. > > On Wed, 10 Dec 2008 15:36:55 -0800 > menage@google.com wrote: > > > This patch adds a hierarchy_mutex to the cgroup_subsys object that > > protects changes to the hierarchy observed by that subsystem. It is > > taken by the cgroup subsystem (in addition to cgroup_mutex) for the > > following operations: > > > > - linking a cgroup into that subsystem's cgroup tree > > - unlinking a cgroup from that subsystem's cgroup tree > > - moving the subsystem to/from a hierarchy (including across the > > bind() callback) > > > o.k. usage is very clear. > > > Thus if the subsystem holds its own hierarchy_mutex, it can safely > > traverse its own hierarchy. > > > > Signed-off-by: Paul Menage <menage@google.com> > > > > --- > > > > Documentation/cgroups/cgroups.txt | 2 +- > > include/linux/cgroup.h | 9 ++++++++- > > kernel/cgroup.c | 37 +++++++++++++++++++++++++++++++++++-- > > 3 files changed, 44 insertions(+), 4 deletions(-) > > > > Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h > > =================================================================== > > --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h > > +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h > > @@ -337,8 +337,15 @@ struct cgroup_subsys { > > #define MAX_CGROUP_TYPE_NAMELEN 32 > > const char *name; > > > > - struct cgroupfs_root *root; > > + /* > > + * Protects sibling/children links of cgroups in this > > + * hierarchy, plus protects which hierarchy (or none) the > > + * subsystem is a part of (i.e. root/sibling) > > + */ > > + struct mutex hierarchy_mutex; > > > > + /* Protected by this->hierarchy_mutex and cgroup_lock() */ > > + struct cgroupfs_root *root; > > struct list_head sibling; > > }; > > > > Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c > > =================================================================== > > --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c > > +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c > > @@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro > > BUG_ON(cgrp->subsys[i]); > > BUG_ON(!dummytop->subsys[i]); > > BUG_ON(dummytop->subsys[i]->cgroup != dummytop); > > + mutex_lock(&ss->hierarchy_mutex); > > cgrp->subsys[i] = dummytop->subsys[i]; > > cgrp->subsys[i]->cgroup = cgrp; > > list_move(&ss->sibling, &root->subsys_list); > > ss->root = root; > > if (ss->bind) > > ss->bind(ss, cgrp); > > - > > + mutex_unlock(&ss->hierarchy_mutex); > > } else if (bit & removed_bits) { > > /* We're removing this subsystem */ > > BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]); > > BUG_ON(cgrp->subsys[i]->cgroup != cgrp); > > + mutex_lock(&ss->hierarchy_mutex); > > if (ss->bind) > > ss->bind(ss, dummytop); > > dummytop->subsys[i]->cgroup = dummytop; > > cgrp->subsys[i] = NULL; > > subsys[i]->root = &rootnode; > > list_move(&ss->sibling, &rootnode.subsys_list); > > + mutex_unlock(&ss->hierarchy_mutex); > > } else if (bit & final_bits) { > > /* Subsystem state should already exist */ > > BUG_ON(!cgrp->subsys[i]); > > @@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou > > cgrp->subsys[ss->subsys_id] = css; > > } > > > > +static void cgroup_lock_hierarchy(struct cgroupfs_root *root) > > +{ > > + /* We need to take each hierarchy_mutex in a consistent order */ > > + int i; > > + > > + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > > + struct cgroup_subsys *ss = subsys[i]; > > + if (ss->root == root) > > + mutex_lock_nested(&ss->hierarchy_mutex, i); > > + } > > +} > > + > > +static void cgroup_unlock_hierarchy(struct cgroupfs_root *root) > > +{ > > + int i; > > + > > + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > > + struct cgroup_subsys *ss = subsys[i]; > > + if (ss->root == root) > > + mutex_unlock(&ss->hierarchy_mutex); > > + } > > +} > > + > Maybe no problem..but I don't like releasing lock in the order of acquiring lock. > > for (i = CGROUP_SUBSYS_COUNT - 1; i >=0; i--) ? > Good point! -- Balbir ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex 2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage 2008-12-11 0:37 ` KAMEZAWA Hiroyuki @ 2008-12-11 3:05 ` Li Zefan 2008-12-11 17:07 ` Paul Menage 2008-12-11 6:29 ` Balbir Singh 2 siblings, 1 reply; 24+ messages in thread From: Li Zefan @ 2008-12-11 3:05 UTC (permalink / raw) To: menage; +Cc: kamezawa.hiroyu, balbir, containers, linux-kernel, akpm > +static void cgroup_lock_hierarchy(struct cgroupfs_root *root) > +{ > + /* We need to take each hierarchy_mutex in a consistent order */ This comment is not so clear. Do you mean for_each_subsys() can't be used here? But this function is used in cgroup.c internally, and always called with cgroup_lock held, so it's OK to use for_each_subsys(). > + int i; > + > + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > + struct cgroup_subsys *ss = subsys[i]; > + if (ss->root == root) > + mutex_lock_nested(&ss->hierarchy_mutex, i); > + } > +} ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex 2008-12-11 3:05 ` Li Zefan @ 2008-12-11 17:07 ` Paul Menage 0 siblings, 0 replies; 24+ messages in thread From: Paul Menage @ 2008-12-11 17:07 UTC (permalink / raw) To: Li Zefan; +Cc: kamezawa.hiroyu, balbir, containers, linux-kernel, akpm On Wed, Dec 10, 2008 at 7:05 PM, Li Zefan <lizf@cn.fujitsu.com> wrote: > > But this function is used in cgroup.c internally, and always called > with cgroup_lock held, so it's OK to use for_each_subsys(). This is true, but it doesn't hurt to be future-proof and not require cgroup_mutex to be held. Paul ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex 2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage 2008-12-11 0:37 ` KAMEZAWA Hiroyuki 2008-12-11 3:05 ` Li Zefan @ 2008-12-11 6:29 ` Balbir Singh 2008-12-11 17:09 ` Paul Menage 2 siblings, 1 reply; 24+ messages in thread From: Balbir Singh @ 2008-12-11 6:29 UTC (permalink / raw) To: menage; +Cc: kamezawa.hiroyu, containers, linux-kernel, akpm * menage@google.com <menage@google.com> [2008-12-10 15:36:55]: > This patch adds a hierarchy_mutex to the cgroup_subsys object that > protects changes to the hierarchy observed by that subsystem. It is > taken by the cgroup subsystem (in addition to cgroup_mutex) for the > following operations: > > - linking a cgroup into that subsystem's cgroup tree > - unlinking a cgroup from that subsystem's cgroup tree > - moving the subsystem to/from a hierarchy (including across the > bind() callback) > > Thus if the subsystem holds its own hierarchy_mutex, it can safely > traverse its own hierarchy. > Ths sounds reasonable, a further abstraction in the future could be to provide the visitor pattern. Allow cgroups to do the walking and have callbacks called during the visit. > Signed-off-by: Paul Menage <menage@google.com> > > --- > > Documentation/cgroups/cgroups.txt | 2 +- > include/linux/cgroup.h | 9 ++++++++- > kernel/cgroup.c | 37 +++++++++++++++++++++++++++++++++++-- > 3 files changed, 44 insertions(+), 4 deletions(-) > > Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h > =================================================================== > --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h > +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h > @@ -337,8 +337,15 @@ struct cgroup_subsys { > #define MAX_CGROUP_TYPE_NAMELEN 32 > const char *name; > > - struct cgroupfs_root *root; > + /* > + * Protects sibling/children links of cgroups in this > + * hierarchy, plus protects which hierarchy (or none) the > + * subsystem is a part of (i.e. root/sibling) > + */ > + struct mutex hierarchy_mutex; > > + /* Protected by this->hierarchy_mutex and cgroup_lock() */ > + struct cgroupfs_root *root; > struct list_head sibling; > }; > > Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c > =================================================================== > --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c > +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c > @@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro > BUG_ON(cgrp->subsys[i]); > BUG_ON(!dummytop->subsys[i]); > BUG_ON(dummytop->subsys[i]->cgroup != dummytop); > + mutex_lock(&ss->hierarchy_mutex); > cgrp->subsys[i] = dummytop->subsys[i]; > cgrp->subsys[i]->cgroup = cgrp; > list_move(&ss->sibling, &root->subsys_list); > ss->root = root; > if (ss->bind) > ss->bind(ss, cgrp); > - > + mutex_unlock(&ss->hierarchy_mutex); > } else if (bit & removed_bits) { > /* We're removing this subsystem */ > BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]); > BUG_ON(cgrp->subsys[i]->cgroup != cgrp); > + mutex_lock(&ss->hierarchy_mutex); > if (ss->bind) > ss->bind(ss, dummytop); > dummytop->subsys[i]->cgroup = dummytop; > cgrp->subsys[i] = NULL; > subsys[i]->root = &rootnode; > list_move(&ss->sibling, &rootnode.subsys_list); > + mutex_unlock(&ss->hierarchy_mutex); > } else if (bit & final_bits) { > /* Subsystem state should already exist */ > BUG_ON(!cgrp->subsys[i]); > @@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou > cgrp->subsys[ss->subsys_id] = css; > } > > +static void cgroup_lock_hierarchy(struct cgroupfs_root *root) > +{ > + /* We need to take each hierarchy_mutex in a consistent order */ > + int i; > + > + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > + struct cgroup_subsys *ss = subsys[i]; > + if (ss->root == root) > + mutex_lock_nested(&ss->hierarchy_mutex, i); > + } > +} > + > +static void cgroup_unlock_hierarchy(struct cgroupfs_root *root) > +{ > + int i; > + > + for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > + struct cgroup_subsys *ss = subsys[i]; > + if (ss->root == root) > + mutex_unlock(&ss->hierarchy_mutex); > + } > +} > + > /* > * cgroup_create - create a cgroup > * @parent: cgroup that will be parent of the new cgroup > @@ -2374,7 +2400,9 @@ static long cgroup_create(struct cgroup > init_cgroup_css(css, ss, cgrp); > } > > + cgroup_lock_hierarchy(root); > list_add(&cgrp->sibling, &cgrp->parent->children); > + cgroup_unlock_hierarchy(root); > root->number_of_cgroups++; > > err = cgroup_create_dir(cgrp, dentry, mode); > @@ -2492,8 +2520,12 @@ static int cgroup_rmdir(struct inode *un > if (!list_empty(&cgrp->release_list)) > list_del(&cgrp->release_list); > spin_unlock(&release_list_lock); > - /* delete my sibling from parent->children */ > + > + cgroup_lock_hierarchy(cgrp->root); > + /* delete this cgroup from parent->children */ > list_del(&cgrp->sibling); > + cgroup_unlock_hierarchy(cgrp->root); > + > spin_lock(&cgrp->dentry->d_lock); > d = dget(cgrp->dentry); > spin_unlock(&d->d_lock); > @@ -2535,6 +2567,7 @@ static void __init cgroup_init_subsys(st > * need to invoke fork callbacks here. */ > BUG_ON(!list_empty(&init_task.tasks)); > > + mutex_init(&ss->hierarchy_mutex); > ss->active = 1; > } > > Index: hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt > =================================================================== > --- hierarchy_lock-mmotm-2008-12-09.orig/Documentation/cgroups/cgroups.txt > +++ hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt > @@ -528,7 +528,7 @@ example in cpusets, no task may attach b > up. > > void bind(struct cgroup_subsys *ss, struct cgroup *root) > -(cgroup_mutex held by caller) > +(cgroup_mutex and ss->hierarchy_mutex held by caller) > Seems reasonable, I was wondering if instead of acquiring a mutex per subsystem that shares the root, if we can collapse it to a single mutex and prevent cgroup from changing binding. Those are optimizations that we can think of later Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> -- Balbir ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex 2008-12-11 6:29 ` Balbir Singh @ 2008-12-11 17:09 ` Paul Menage 0 siblings, 0 replies; 24+ messages in thread From: Paul Menage @ 2008-12-11 17:09 UTC (permalink / raw) To: balbir, menage, kamezawa.hiroyu, containers, linux-kernel, akpm On Wed, Dec 10, 2008 at 10:29 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > Seems reasonable, I was wondering if instead of acquiring a mutex per > subsystem that shares the root, if we can collapse it to a single > mutex and prevent cgroup from changing binding. Those are > optimizations that we can think of later You mean a single hierarchy_mutex per root? But given that one of the things being protected is which root you're attached to, if you're trying to prevent hierarchy changes then which root's hierarchy_mutex would you lock? Paul ^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller 2008-12-10 23:36 [RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes menage 2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage @ 2008-12-10 23:36 ` menage 2008-12-11 0:49 ` KAMEZAWA Hiroyuki 2008-12-10 23:36 ` [RFC][PATCH 3/3] CGroups: Add css_tryget() menage 2 siblings, 1 reply; 24+ messages in thread From: menage @ 2008-12-10 23:36 UTC (permalink / raw) To: kamezawa.hiroyu, balbir, containers; +Cc: linux-kernel, akpm [-- Attachment #1: mm-destroy-fix.patch --] [-- Type: text/plain, Size: 2270 bytes --] This patch updates the memory controller to use its hierarchy_mutex rather than calling cgroup_lock() to protected against cgroup_mkdir()/cgroup_rmdir() from occurring in its hierarchy. Signed-off-by: Paul Menage <menage@google.com> --- mm/memcontrol.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) Index: hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c =================================================================== --- hierarchy_lock-mmotm-2008-12-09.orig/mm/memcontrol.c +++ hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c @@ -154,7 +154,7 @@ struct mem_cgroup { /* * While reclaiming in a hiearchy, we cache the last child we - * reclaimed from. Protected by cgroup_lock() + * reclaimed from. Protected by hierarchy_mutex */ struct mem_cgroup *last_scanned_child; /* @@ -529,7 +529,7 @@ unsigned long mem_cgroup_isolate_pages(u /* * This routine finds the DFS walk successor. This routine should be - * called with cgroup_mutex held + * called with hierarchy_mutex held */ static struct mem_cgroup * mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem) @@ -598,7 +598,7 @@ mem_cgroup_get_first_node(struct mem_cgr /* * Scan all children under the mem_cgroup mem */ - cgroup_lock(); + mutex_lock(&mem_cgroup_subsys.hierarchy_mutex); if (list_empty(&root_mem->css.cgroup->children)) { ret = root_mem; goto done; @@ -619,7 +619,7 @@ mem_cgroup_get_first_node(struct mem_cgr done: root_mem->last_scanned_child = ret; - cgroup_unlock(); + mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex); return ret; } @@ -683,18 +683,16 @@ static int mem_cgroup_hierarchical_recla while (next_mem != root_mem) { if (next_mem->obsolete) { mem_cgroup_put(next_mem); - cgroup_lock(); next_mem = mem_cgroup_get_first_node(root_mem); - cgroup_unlock(); continue; } ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap, get_swappiness(next_mem)); if (mem_cgroup_check_under_limit(root_mem)) return 0; - cgroup_lock(); + mutex_lock(&mem_cgroup_subsys.hierarchy_mutex); next_mem = mem_cgroup_get_next_node(next_mem, root_mem); - cgroup_unlock(); + mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex); } return ret; } -- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller 2008-12-10 23:36 ` [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage @ 2008-12-11 0:49 ` KAMEZAWA Hiroyuki 2008-12-11 0:52 ` Paul Menage 0 siblings, 1 reply; 24+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-11 0:49 UTC (permalink / raw) To: menage; +Cc: balbir, containers, linux-kernel, akpm On Wed, 10 Dec 2008 15:36:56 -0800 menage@google.com wrote: > This patch updates the memory controller to use its hierarchy_mutex > rather than calling cgroup_lock() to protected against > cgroup_mkdir()/cgroup_rmdir() from occurring in its hierarchy. > > Signed-off-by: Paul Menage <menage@google.com> > Hmm, can we avoid following kind of dead-lock by this ? Assume subsys A and B is on the same hierarchy, A's subsys ID is smaller than B's. (I assume memory and cpuset for A and B, but just an example.) an operation like rmdir() in somewhere. hierarchy_lock for A (acquired) hierarchy_lock for B (waiting) in subsys A. mmap_sem (acquired) hierarchy_lock for A (waiting) in subsys B. hierarchy_lock for B (aquired) mmap_sem (waiting) Thanks, -Kame > --- > mm/memcontrol.c | 14 ++++++-------- > 1 file changed, 6 insertions(+), 8 deletions(-) > > Index: hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c > =================================================================== > --- hierarchy_lock-mmotm-2008-12-09.orig/mm/memcontrol.c > +++ hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c > @@ -154,7 +154,7 @@ struct mem_cgroup { > > /* > * While reclaiming in a hiearchy, we cache the last child we > - * reclaimed from. Protected by cgroup_lock() > + * reclaimed from. Protected by hierarchy_mutex > */ > struct mem_cgroup *last_scanned_child; > /* > @@ -529,7 +529,7 @@ unsigned long mem_cgroup_isolate_pages(u > > /* > * This routine finds the DFS walk successor. This routine should be > - * called with cgroup_mutex held > + * called with hierarchy_mutex held > */ > static struct mem_cgroup * > mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem) > @@ -598,7 +598,7 @@ mem_cgroup_get_first_node(struct mem_cgr > /* > * Scan all children under the mem_cgroup mem > */ > - cgroup_lock(); > + mutex_lock(&mem_cgroup_subsys.hierarchy_mutex); > if (list_empty(&root_mem->css.cgroup->children)) { > ret = root_mem; > goto done; > @@ -619,7 +619,7 @@ mem_cgroup_get_first_node(struct mem_cgr > > done: > root_mem->last_scanned_child = ret; > - cgroup_unlock(); > + mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex); > return ret; > } > > @@ -683,18 +683,16 @@ static int mem_cgroup_hierarchical_recla > while (next_mem != root_mem) { > if (next_mem->obsolete) { > mem_cgroup_put(next_mem); > - cgroup_lock(); > next_mem = mem_cgroup_get_first_node(root_mem); > - cgroup_unlock(); > continue; > } > ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap, > get_swappiness(next_mem)); > if (mem_cgroup_check_under_limit(root_mem)) > return 0; > - cgroup_lock(); > + mutex_lock(&mem_cgroup_subsys.hierarchy_mutex); > next_mem = mem_cgroup_get_next_node(next_mem, root_mem); > - cgroup_unlock(); > + mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex); > } > return ret; > } > > -- > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller 2008-12-11 0:49 ` KAMEZAWA Hiroyuki @ 2008-12-11 0:52 ` Paul Menage 2008-12-11 1:05 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 24+ messages in thread From: Paul Menage @ 2008-12-11 0:52 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: balbir, containers, linux-kernel, akpm On Wed, Dec 10, 2008 at 4:49 PM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > an operation like rmdir() in somewhere. > hierarchy_lock for A (acquired) > hierarchy_lock for B (waiting) > > in subsys A. > mmap_sem (acquired) > hierarchy_lock for A (waiting) > in subsys B. > hierarchy_lock for B (aquired) > mmap_sem (waiting) > That's a valid deadlock - you'd need to require the mmap_sem nests either inside all hierarchy_mutexes or else outside all of them. Paul ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller 2008-12-11 0:52 ` Paul Menage @ 2008-12-11 1:05 ` KAMEZAWA Hiroyuki 2008-12-11 6:33 ` Balbir Singh 0 siblings, 1 reply; 24+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-11 1:05 UTC (permalink / raw) To: Paul Menage; +Cc: balbir, containers, linux-kernel, akpm On Wed, 10 Dec 2008 16:52:57 -0800 Paul Menage <menage@google.com> wrote: > On Wed, Dec 10, 2008 at 4:49 PM, KAMEZAWA Hiroyuki > <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > an operation like rmdir() in somewhere. > > hierarchy_lock for A (acquired) > > hierarchy_lock for B (waiting) > > > > in subsys A. > > mmap_sem (acquired) > > hierarchy_lock for A (waiting) > > in subsys B. > > hierarchy_lock for B (aquired) > > mmap_sem (waiting) > > > > That's a valid deadlock - you'd need to require the mmap_sem nests > either inside all hierarchy_mutexes or else outside all of them. > This was a found dead lock between memcg and cpuset. another one was an operation like rmdir() in somewhere. hierarchy_lock for memcg (acquired) hierarchy_lock for B (waiting) in subsys B. hierarchy_lock for B (aquired) have to do some memory reclaim -> hierarchy_lock for memcg (waiting) I have no objections to hierarchy_lock itself but calling context to memcg is very complicated and simple replace of these locks will be just a small help. -Kame ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller 2008-12-11 1:05 ` KAMEZAWA Hiroyuki @ 2008-12-11 6:33 ` Balbir Singh 2008-12-11 6:47 ` Li Zefan 2008-12-11 6:53 ` KAMEZAWA Hiroyuki 0 siblings, 2 replies; 24+ messages in thread From: Balbir Singh @ 2008-12-11 6:33 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Paul Menage, containers, linux-kernel, akpm * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-11 10:05:01]: > On Wed, 10 Dec 2008 16:52:57 -0800 > Paul Menage <menage@google.com> wrote: > > > On Wed, Dec 10, 2008 at 4:49 PM, KAMEZAWA Hiroyuki > > <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > > an operation like rmdir() in somewhere. > > > hierarchy_lock for A (acquired) > > > hierarchy_lock for B (waiting) > > > > > > in subsys A. > > > mmap_sem (acquired) > > > hierarchy_lock for A (waiting) > > > in subsys B. > > > hierarchy_lock for B (aquired) > > > mmap_sem (waiting) > > > > > > > That's a valid deadlock - you'd need to require the mmap_sem nests > > either inside all hierarchy_mutexes or else outside all of them. > > > This was a found dead lock between memcg and cpuset. > > another one was > > an operation like rmdir() in somewhere. > hierarchy_lock for memcg (acquired) > hierarchy_lock for B (waiting) > > in subsys B. > hierarchy_lock for B (aquired) But then the hierarchy_locks acquired will be different right? > have to do some memory reclaim -> hierarchy_lock for memcg (waiting) > > I have no objections to hierarchy_lock itself but calling context to memcg is very > complicated and simple replace of these locks will be just a small help. Could you please explain the race better? -- Balbir ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller 2008-12-11 6:33 ` Balbir Singh @ 2008-12-11 6:47 ` Li Zefan 2008-12-11 6:53 ` KAMEZAWA Hiroyuki 1 sibling, 0 replies; 24+ messages in thread From: Li Zefan @ 2008-12-11 6:47 UTC (permalink / raw) To: KAMEZAWA Hiroyuki, Paul Menage, containers, linux-kernel, akpm Balbir Singh wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-11 10:05:01]: > >> On Wed, 10 Dec 2008 16:52:57 -0800 >> Paul Menage <menage@google.com> wrote: >> >>> On Wed, Dec 10, 2008 at 4:49 PM, KAMEZAWA Hiroyuki >>> <kamezawa.hiroyu@jp.fujitsu.com> wrote: >>>> an operation like rmdir() in somewhere. >>>> hierarchy_lock for A (acquired) >>>> hierarchy_lock for B (waiting) >>>> >>>> in subsys A. >>>> mmap_sem (acquired) >>>> hierarchy_lock for A (waiting) >>>> in subsys B. >>>> hierarchy_lock for B (aquired) >>>> mmap_sem (waiting) >>>> >>> That's a valid deadlock - you'd need to require the mmap_sem nests >>> either inside all hierarchy_mutexes or else outside all of them. >>> >> This was a found dead lock between memcg and cpuset. >> >> another one was >> >> an operation like rmdir() in somewhere. >> hierarchy_lock for memcg (acquired) >> hierarchy_lock for B (waiting) >> >> in subsys B. >> hierarchy_lock for B (aquired) > > But then the hierarchy_locks acquired will be different right? > Yes, I'm worrying this too. The lock order by cgroup_lock_hierarchy() is: lock A -> lock B -> lock C But a call chain may end up with: ... -> lock B -> ... lock A -> ... So though this hierarchy lock proprosal can solve specific deadlock between cpuset and memcg by making cpuset holding cgroup_lock and memcg holding hierarchy_lock, but we'll probably encounter other deadlocks describled above. >> have to do some memory reclaim -> hierarchy_lock for memcg (waiting) >> >> I have no objections to hierarchy_lock itself but calling context to memcg is very >> complicated and simple replace of these locks will be just a small help. > > Could you please explain the race better? > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller 2008-12-11 6:33 ` Balbir Singh 2008-12-11 6:47 ` Li Zefan @ 2008-12-11 6:53 ` KAMEZAWA Hiroyuki 2008-12-11 17:05 ` Paul Menage 1 sibling, 1 reply; 24+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-11 6:53 UTC (permalink / raw) To: balbir; +Cc: Paul Menage, containers, linux-kernel, akpm On Thu, 11 Dec 2008 12:03:07 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-11 10:05:01]: > > > On Wed, 10 Dec 2008 16:52:57 -0800 > > Paul Menage <menage@google.com> wrote: > > > > > On Wed, Dec 10, 2008 at 4:49 PM, KAMEZAWA Hiroyuki > > > <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > > > > an operation like rmdir() in somewhere. > > > > hierarchy_lock for A (acquired) > > > > hierarchy_lock for B (waiting) > > > > > > > > in subsys A. > > > > mmap_sem (acquired) > > > > hierarchy_lock for A (waiting) > > > > in subsys B. > > > > hierarchy_lock for B (aquired) > > > > mmap_sem (waiting) > > > > > > > > > > That's a valid deadlock - you'd need to require the mmap_sem nests > > > either inside all hierarchy_mutexes or else outside all of them. > > > > > This was a found dead lock between memcg and cpuset. > > > > another one was > > > > an operation like rmdir() in somewhere. > > hierarchy_lock for memcg (acquired) > > hierarchy_lock for B (waiting) > > > > in subsys B. > > hierarchy_lock for B (aquired) > > But then the hierarchy_locks acquired will be different right? > yes. > > have to do some memory reclaim -> hierarchy_lock for memcg (waiting) > > > > I have no objections to hierarchy_lock itself but calling context to memcg is very > > complicated and simple replace of these locks will be just a small help. > > Could you please explain the race better? > I explained... One example was mmap_sem. above one is recursive lock in cpuset you tried to fix. (Fortunately, cpuset's subsys ID is smaller than memcg's one. you'll not see in real environment.) And there are other unknown subsyses are planned, bio, checkpoint, etc.. Could you write a document to explain what kind of nest of locks are allowed before merging this ? Thanks, -Kame ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller 2008-12-11 6:53 ` KAMEZAWA Hiroyuki @ 2008-12-11 17:05 ` Paul Menage 2008-12-12 1:12 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 24+ messages in thread From: Paul Menage @ 2008-12-11 17:05 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: balbir, containers, linux-kernel, akpm On Wed, Dec 10, 2008 at 10:53 PM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > Could you write a document to explain what kind of nest of locks are allowed > before merging this ? We should probably treat it similarly to the cpuset "callback_mutex". I.e. memory allocations are most likely forbidden when you're holding it. Paul ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller 2008-12-11 17:05 ` Paul Menage @ 2008-12-12 1:12 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 24+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-12 1:12 UTC (permalink / raw) To: Paul Menage; +Cc: balbir, containers, linux-kernel, akpm On Thu, 11 Dec 2008 09:05:47 -0800 Paul Menage <menage@google.com> wrote: > On Wed, Dec 10, 2008 at 10:53 PM, KAMEZAWA Hiroyuki > <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > Could you write a document to explain what kind of nest of locks are allowed > > before merging this ? > > We should probably treat it similarly to the cpuset "callback_mutex". > I.e. memory allocations are most likely forbidden when you're holding > it. > Hmm. Anyway, hierarchy mutex should not be aquired in deep stack of the kernel calls. (and my patch will make memcg's hierarchy walk lockless.) -Kame ^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC][PATCH 3/3] CGroups: Add css_tryget() 2008-12-10 23:36 [RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes menage 2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage 2008-12-10 23:36 ` [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage @ 2008-12-10 23:36 ` menage 2008-12-11 0:52 ` KAMEZAWA Hiroyuki 2008-12-11 5:15 ` KAMEZAWA Hiroyuki 2 siblings, 2 replies; 24+ messages in thread From: menage @ 2008-12-10 23:36 UTC (permalink / raw) To: kamezawa.hiroyu, balbir, containers; +Cc: linux-kernel, akpm [-- Attachment #1: cgroup_refcnt.patch --] [-- Type: text/plain, Size: 6424 bytes --] This patch adds css_tryget(), that obtains a counted reference on a CSS. It is used in situations where the caller has a "weak" reference to the CSS, i.e. one that does not protect the cgroup from removal via a reference count, but would instead be cleaned up by a destroy() callback. css_tryget() will return true on success, or false if the cgroup is being removed. This is similar to Kamezawa Hiroyuki's patch from a week or two ago, but with the difference that in the event of css_tryget() racing with a cgroup_rmdir(), css_tryget() will only return false if the cgroup really does get removed. This implementation is done by biasing css->refcnt, so that a refcnt of 1 means "releasable" and 0 means "released or releasing". In the event of a race, css_tryget() distinguishes between "released" and "releasing" by checking for the CSS_REMOVED flag in css->flags. Signed-off-by: Paul Menage <menage@google.com> --- include/linux/cgroup.h | 33 ++++++++++++++++++++++----- kernel/cgroup.c | 58 ++++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 80 insertions(+), 11 deletions(-) Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h =================================================================== --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h @@ -52,9 +52,9 @@ struct cgroup_subsys_state { * hierarchy structure */ struct cgroup *cgroup; - /* State maintained by the cgroup system to allow - * subsystems to be "busy". Should be accessed via css_get() - * and css_put() */ + /* State maintained by the cgroup system to allow subsystems + * to be "busy". Should be accessed via css_get(), + * css_tryget() and and css_put(). */ atomic_t refcnt; @@ -64,11 +64,14 @@ struct cgroup_subsys_state { /* bits in struct cgroup_subsys_state flags field */ enum { CSS_ROOT, /* This CSS is the root of the subsystem */ + CSS_REMOVED, /* This CSS is dead */ }; /* - * Call css_get() to hold a reference on the cgroup; - * + * Call css_get() to hold a reference on the css; it can be used + * for a reference obtained via: + * - an existing ref-counted reference to the css + * - task->cgroups for a locked task */ static inline void css_get(struct cgroup_subsys_state *css) @@ -77,9 +80,27 @@ static inline void css_get(struct cgroup if (!test_bit(CSS_ROOT, &css->flags)) atomic_inc(&css->refcnt); } + +/* + * Call css_tryget() to take a reference on a css if your existing + * (known-valid) reference isn't already ref-counted. Returns false if + * the css has been destroyed. + */ + +static inline bool css_tryget(struct cgroup_subsys_state *css) +{ + if (test_bit(CSS_ROOT, &css->flags)) + return true; + while (!atomic_inc_not_zero(&css->refcnt)) { + if (test_bit(CSS_REMOVED, &css->flags)) + return false; + } + return true; +} + /* * css_put() should be called to release a reference taken by - * css_get() + * css_get() or css_tryget() */ extern void __css_put(struct cgroup_subsys_state *css); Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c =================================================================== --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c @@ -2321,7 +2321,7 @@ static void init_cgroup_css(struct cgrou struct cgroup *cgrp) { css->cgroup = cgrp; - atomic_set(&css->refcnt, 0); + atomic_set(&css->refcnt, 1); css->flags = 0; if (cgrp == dummytop) set_bit(CSS_ROOT, &css->flags); @@ -2453,7 +2453,7 @@ static int cgroup_has_css_refs(struct cg { /* Check the reference count on each subsystem. Since we * already established that there are no tasks in the - * cgroup, if the css refcount is also 0, then there should + * cgroup, if the css refcount is also 1, then there should * be no outstanding references, so the subsystem is safe to * destroy. We scan across all subsystems rather than using * the per-hierarchy linked list of mounted subsystems since @@ -2474,12 +2474,59 @@ static int cgroup_has_css_refs(struct cg * matter, since it can only happen if the cgroup * has been deleted and hence no longer needs the * release agent to be called anyway. */ - if (css && atomic_read(&css->refcnt)) + if (css && (atomic_read(&css->refcnt) > 1)) return 1; } return 0; } +/* + * Atomically mark all of the cgroup's CSS objects as + * CSS_REMOVED. Return true on success, or false if the cgroup has + * busy subsystems. Call with cgroup_mutex held + */ + +static int cgroup_clear_css_refs(struct cgroup *cgrp) +{ + struct cgroup_subsys *ss; + unsigned long flags; + bool failed = false; + local_irq_save(flags); + for_each_subsys(cgrp->root, ss) { + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; + int refcnt; + do { + /* We can only remove a CSS with a refcnt==1 */ + refcnt = atomic_read(&css->refcnt); + if (refcnt > 1) { + failed = true; + goto done; + } + BUG_ON(!refcnt); + /* + * Drop the refcnt to 0 while we check other + * subsystems. This will cause any racing + * css_tryget() to spin until we set the + * CSS_REMOVED bits or abort + */ + } while (atomic_cmpxchg(&css->refcnt, refcnt, 0) != refcnt); + } + done: + for_each_subsys(cgrp->root, ss) { + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; + if (failed) { + /* Restore old refcnt if necessary */ + if (!atomic_read(&css->refcnt)) + atomic_set(&css->refcnt, 1); + } else { + /* Commit the fact that the CSS is removed */ + set_bit(CSS_REMOVED, &css->flags); + } + } + local_irq_restore(flags); + return !failed; +} + static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry) { struct cgroup *cgrp = dentry->d_fsdata; @@ -2510,7 +2557,7 @@ static int cgroup_rmdir(struct inode *un if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children) - || cgroup_has_css_refs(cgrp)) { + || !cgroup_clear_css_refs(cgrp)) { mutex_unlock(&cgroup_mutex); return -EBUSY; } @@ -3065,7 +3112,8 @@ void __css_put(struct cgroup_subsys_stat { struct cgroup *cgrp = css->cgroup; rcu_read_lock(); - if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) { + if ((atomic_dec_return(&css->refcnt) == 1) && + notify_on_release(cgrp)) { set_bit(CGRP_RELEASABLE, &cgrp->flags); check_for_release(cgrp); } -- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 3/3] CGroups: Add css_tryget() 2008-12-10 23:36 ` [RFC][PATCH 3/3] CGroups: Add css_tryget() menage @ 2008-12-11 0:52 ` KAMEZAWA Hiroyuki 2008-12-11 7:02 ` [RFC][PATCH]example: use css_tryget() in memcg " KAMEZAWA Hiroyuki 2008-12-11 5:15 ` KAMEZAWA Hiroyuki 1 sibling, 1 reply; 24+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-11 0:52 UTC (permalink / raw) To: menage; +Cc: balbir, containers, linux-kernel, akpm On Wed, 10 Dec 2008 15:36:57 -0800 menage@google.com wrote: > This patch adds css_tryget(), that obtains a counted reference on a > CSS. It is used in situations where the caller has a "weak" reference > to the CSS, i.e. one that does not protect the cgroup from removal via > a reference count, but would instead be cleaned up by a destroy() > callback. > > css_tryget() will return true on success, or false if the cgroup is > being removed. > > This is similar to Kamezawa Hiroyuki's patch from a week or two ago, > but with the difference that in the event of css_tryget() racing with > a cgroup_rmdir(), css_tryget() will only return false if the cgroup > really does get removed. > > This implementation is done by biasing css->refcnt, so that a refcnt > of 1 means "releasable" and 0 means "released or releasing". In the > event of a race, css_tryget() distinguishes between "released" and > "releasing" by checking for the CSS_REMOVED flag in css->flags. > > Signed-off-by: Paul Menage <menage@google.com> > thx, I'll try this. -Kame > --- > include/linux/cgroup.h | 33 ++++++++++++++++++++++----- > kernel/cgroup.c | 58 ++++++++++++++++++++++++++++++++++++++++++++----- > 2 files changed, 80 insertions(+), 11 deletions(-) > > Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h > =================================================================== > --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h > +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h > @@ -52,9 +52,9 @@ struct cgroup_subsys_state { > * hierarchy structure */ > struct cgroup *cgroup; > > - /* State maintained by the cgroup system to allow > - * subsystems to be "busy". Should be accessed via css_get() > - * and css_put() */ > + /* State maintained by the cgroup system to allow subsystems > + * to be "busy". Should be accessed via css_get(), > + * css_tryget() and and css_put(). */ > > atomic_t refcnt; > > @@ -64,11 +64,14 @@ struct cgroup_subsys_state { > /* bits in struct cgroup_subsys_state flags field */ > enum { > CSS_ROOT, /* This CSS is the root of the subsystem */ > + CSS_REMOVED, /* This CSS is dead */ > }; > > /* > - * Call css_get() to hold a reference on the cgroup; > - * > + * Call css_get() to hold a reference on the css; it can be used > + * for a reference obtained via: > + * - an existing ref-counted reference to the css > + * - task->cgroups for a locked task > */ > > static inline void css_get(struct cgroup_subsys_state *css) > @@ -77,9 +80,27 @@ static inline void css_get(struct cgroup > if (!test_bit(CSS_ROOT, &css->flags)) > atomic_inc(&css->refcnt); > } > + > +/* > + * Call css_tryget() to take a reference on a css if your existing > + * (known-valid) reference isn't already ref-counted. Returns false if > + * the css has been destroyed. > + */ > + > +static inline bool css_tryget(struct cgroup_subsys_state *css) > +{ > + if (test_bit(CSS_ROOT, &css->flags)) > + return true; > + while (!atomic_inc_not_zero(&css->refcnt)) { > + if (test_bit(CSS_REMOVED, &css->flags)) > + return false; > + } > + return true; > +} > + > /* > * css_put() should be called to release a reference taken by > - * css_get() > + * css_get() or css_tryget() > */ > > extern void __css_put(struct cgroup_subsys_state *css); > Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c > =================================================================== > --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c > +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c > @@ -2321,7 +2321,7 @@ static void init_cgroup_css(struct cgrou > struct cgroup *cgrp) > { > css->cgroup = cgrp; > - atomic_set(&css->refcnt, 0); > + atomic_set(&css->refcnt, 1); > css->flags = 0; > if (cgrp == dummytop) > set_bit(CSS_ROOT, &css->flags); > @@ -2453,7 +2453,7 @@ static int cgroup_has_css_refs(struct cg > { > /* Check the reference count on each subsystem. Since we > * already established that there are no tasks in the > - * cgroup, if the css refcount is also 0, then there should > + * cgroup, if the css refcount is also 1, then there should > * be no outstanding references, so the subsystem is safe to > * destroy. We scan across all subsystems rather than using > * the per-hierarchy linked list of mounted subsystems since > @@ -2474,12 +2474,59 @@ static int cgroup_has_css_refs(struct cg > * matter, since it can only happen if the cgroup > * has been deleted and hence no longer needs the > * release agent to be called anyway. */ > - if (css && atomic_read(&css->refcnt)) > + if (css && (atomic_read(&css->refcnt) > 1)) > return 1; > } > return 0; > } > > +/* > + * Atomically mark all of the cgroup's CSS objects as > + * CSS_REMOVED. Return true on success, or false if the cgroup has > + * busy subsystems. Call with cgroup_mutex held > + */ > + > +static int cgroup_clear_css_refs(struct cgroup *cgrp) > +{ > + struct cgroup_subsys *ss; > + unsigned long flags; > + bool failed = false; > + local_irq_save(flags); > + for_each_subsys(cgrp->root, ss) { > + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; > + int refcnt; > + do { > + /* We can only remove a CSS with a refcnt==1 */ > + refcnt = atomic_read(&css->refcnt); > + if (refcnt > 1) { > + failed = true; > + goto done; > + } > + BUG_ON(!refcnt); > + /* > + * Drop the refcnt to 0 while we check other > + * subsystems. This will cause any racing > + * css_tryget() to spin until we set the > + * CSS_REMOVED bits or abort > + */ > + } while (atomic_cmpxchg(&css->refcnt, refcnt, 0) != refcnt); > + } > + done: > + for_each_subsys(cgrp->root, ss) { > + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; > + if (failed) { > + /* Restore old refcnt if necessary */ > + if (!atomic_read(&css->refcnt)) > + atomic_set(&css->refcnt, 1); > + } else { > + /* Commit the fact that the CSS is removed */ > + set_bit(CSS_REMOVED, &css->flags); > + } > + } > + local_irq_restore(flags); > + return !failed; > +} > + > static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry) > { > struct cgroup *cgrp = dentry->d_fsdata; > @@ -2510,7 +2557,7 @@ static int cgroup_rmdir(struct inode *un > > if (atomic_read(&cgrp->count) > || !list_empty(&cgrp->children) > - || cgroup_has_css_refs(cgrp)) { > + || !cgroup_clear_css_refs(cgrp)) { > mutex_unlock(&cgroup_mutex); > return -EBUSY; > } > @@ -3065,7 +3112,8 @@ void __css_put(struct cgroup_subsys_stat > { > struct cgroup *cgrp = css->cgroup; > rcu_read_lock(); > - if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) { > + if ((atomic_dec_return(&css->refcnt) == 1) && > + notify_on_release(cgrp)) { > set_bit(CGRP_RELEASABLE, &cgrp->flags); > check_for_release(cgrp); > } > > -- > ^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC][PATCH]example: use css_tryget() in memcg Re: [RFC][PATCH 3/3] CGroups: Add css_tryget() 2008-12-11 0:52 ` KAMEZAWA Hiroyuki @ 2008-12-11 7:02 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 24+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-11 7:02 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: menage, containers, linux-kernel, akpm, balbir An exmaple. and css_is_removed() is added (by other patch) bool css_is_removed(struct cgroup_subsys_state *css) { return test_bit(CSS_REMOVED, &css->flags); } == From:KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Use css_tryget() in memcg. css_tryget() newly is added and we can know css is alive or not and get refcnt of css in very safe way. ("alive" here means "rmdir/destroy" is not called.) This patch replaces css_get() to css_tryget(), where I cannot explain why css_get() is safe. And removes memcg->obsolete flag. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- mm/memcontrol.c | 98 ++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 61 insertions(+), 37 deletions(-) Index: mmotm-2.6.28-Dec10/mm/memcontrol.c =================================================================== --- mmotm-2.6.28-Dec10.orig/mm/memcontrol.c +++ mmotm-2.6.28-Dec10/mm/memcontrol.c @@ -162,7 +162,6 @@ struct mem_cgroup { */ bool use_hierarchy; unsigned long last_oom_jiffies; - int obsolete; atomic_t refcnt; unsigned int swappiness; @@ -286,6 +285,31 @@ struct mem_cgroup *mem_cgroup_from_task( struct mem_cgroup, css); } +static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) +{ + struct mem_cgroup *mem = NULL; + /* + * Because we have no locks, mm->owner's may be being moved to other + * cgroup. We use css_tryget() here even if this looks + * pessimistic (rather than adding locks here). + */ + rcu_read_lock(); + do { + mem = mem_cgroup_from_task(rcu_dereference(mm->owner)); + if (unlikely(!mem)) + break; + } while (!css_tryget(&mem->css)); + rcu_read_unlock(); + return mem; +} + +static bool mem_cgroup_is_obsolete(struct mem_cgroup *mem) +{ + if (!mem) + return true; + return css_is_removed(&mem->css); +} + /* * Following LRU functions are allowed to be used without PCG_LOCK. * Operations are called by routine of global LRU independently from memcg. @@ -592,8 +616,9 @@ mem_cgroup_get_first_node(struct mem_cgr { struct cgroup *cgroup; struct mem_cgroup *ret; - bool obsolete = (root_mem->last_scanned_child && - root_mem->last_scanned_child->obsolete); + bool obsolete; + + obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child); /* * Scan all children under the mem_cgroup mem @@ -681,7 +706,7 @@ static int mem_cgroup_hierarchical_recla next_mem = mem_cgroup_get_first_node(root_mem); while (next_mem != root_mem) { - if (next_mem->obsolete) { + if (mem_cgroup_is_obsolete(next_mem)) { mem_cgroup_put(next_mem); next_mem = mem_cgroup_get_first_node(root_mem); continue; @@ -737,23 +762,17 @@ static int __mem_cgroup_try_charge(struc * thread group leader migrates. It's possible that mm is not * set, if so charge the init_mm (happens for pagecache usage). */ - if (likely(!*memcg)) { - rcu_read_lock(); - mem = mem_cgroup_from_task(rcu_dereference(mm->owner)); - if (unlikely(!mem)) { - rcu_read_unlock(); - return 0; - } - /* - * For every charge from the cgroup, increment reference count - */ - css_get(&mem->css); + mem = *memcg; + if (likely(!mem)) { + mem = try_get_mem_cgroup_from_mm(mm); *memcg = mem; - rcu_read_unlock(); } else { - mem = *memcg; css_get(&mem->css); } + if (unlikely(!mem)) + return 0; + + VM_BUG_ON(mem_cgroup_is_obsolete(mem)); while (1) { int ret; @@ -1040,12 +1059,19 @@ int mem_cgroup_cache_charge(struct page MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL); } +/* + * While swap-in, try_charge -> commit or cancel, the page is locked. + * And when try_charge() successfully returns, one refcnt to memcg without + * struct page_cgroup is aquired. This refcnt will be cumsumed by + * "commit()" or removed by "cancel()" + */ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { struct mem_cgroup *mem; swp_entry_t ent; + int ret; if (mem_cgroup_disabled()) return 0; @@ -1064,10 +1090,15 @@ int mem_cgroup_try_charge_swapin(struct ent.val = page_private(page); mem = lookup_swap_cgroup(ent); - if (!mem || mem->obsolete) + if (!mem) + goto charge_cur_mm; + if (css_tryget(&mem->css)) goto charge_cur_mm; *ptr = mem; - return __mem_cgroup_try_charge(NULL, mask, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, ptr, true); + /* drop extra refcnt from tryget */ + css_put(&mem->css); + return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; @@ -1098,13 +1129,18 @@ int mem_cgroup_cache_charge_swapin(struc ent.val = page_private(page); if (do_swap_account) { mem = lookup_swap_cgroup(ent); - if (mem && mem->obsolete) - mem = NULL; - if (mem) - mm = NULL; + if (mem) { + if (css_tryget(&mem->css)) + mm = NULL; /* charge to recorded */ + else + mem = NULL; /* charge to current */ + } } ret = mem_cgroup_charge_common(page, mm, mask, MEM_CGROUP_CHARGE_TYPE_SHMEM, mem); + /* drop extra refcnt from tryget */ + if (mem) + css_put(&mem->css); if (!ret && do_swap_account) { /* avoid double counting */ @@ -1145,7 +1181,6 @@ void mem_cgroup_commit_charge_swapin(str struct mem_cgroup *memcg; memcg = swap_cgroup_record(ent, NULL); if (memcg) { - /* If memcg is obsolete, memcg can be != ptr */ res_counter_uncharge(&memcg->memsw, PAGE_SIZE); mem_cgroup_put(memcg); } @@ -1388,14 +1423,9 @@ int mem_cgroup_shrink_usage(struct mm_st if (!mm) return 0; - rcu_read_lock(); - mem = mem_cgroup_from_task(rcu_dereference(mm->owner)); - if (unlikely(!mem)) { - rcu_read_unlock(); + mem = try_get_mem_cgroup_from_mm(mm); + if (unlikely(!mem)) return 0; - } - css_get(&mem->css); - rcu_read_unlock(); do { progress = mem_cgroup_hierarchical_reclaim(mem, gfp_mask, true); @@ -2081,9 +2111,6 @@ static struct mem_cgroup *mem_cgroup_all * the number of reference from swap_cgroup and free mem_cgroup when * it goes down to 0. * - * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and - * entry which points to this memcg will be ignore at swapin. - * * Removal of cgroup itself succeeds regardless of refs from swap. */ @@ -2112,8 +2139,6 @@ static void mem_cgroup_get(struct mem_cg static void mem_cgroup_put(struct mem_cgroup *mem) { if (atomic_dec_and_test(&mem->refcnt)) { - if (!mem->obsolete) - return; mem_cgroup_free(mem); } } @@ -2179,7 +2204,6 @@ static void mem_cgroup_pre_destroy(struc struct cgroup *cont) { struct mem_cgroup *mem = mem_cgroup_from_cont(cont); - mem->obsolete = 1; mem_cgroup_force_empty(mem, false); } ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 3/3] CGroups: Add css_tryget() 2008-12-10 23:36 ` [RFC][PATCH 3/3] CGroups: Add css_tryget() menage 2008-12-11 0:52 ` KAMEZAWA Hiroyuki @ 2008-12-11 5:15 ` KAMEZAWA Hiroyuki 2008-12-11 7:28 ` Paul Menage 1 sibling, 1 reply; 24+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-11 5:15 UTC (permalink / raw) To: menage; +Cc: balbir, containers, linux-kernel, akpm On Wed, 10 Dec 2008 15:36:57 -0800 menage@google.com wrote: > This patch adds css_tryget(), that obtains a counted reference on a > CSS. It is used in situations where the caller has a "weak" reference > to the CSS, i.e. one that does not protect the cgroup from removal via > a reference count, but would instead be cleaned up by a destroy() > callback. > > css_tryget() will return true on success, or false if the cgroup is > being removed. > > This is similar to Kamezawa Hiroyuki's patch from a week or two ago, > but with the difference that in the event of css_tryget() racing with > a cgroup_rmdir(), css_tryget() will only return false if the cgroup > really does get removed. > > This implementation is done by biasing css->refcnt, so that a refcnt > of 1 means "releasable" and 0 means "released or releasing". In the > event of a race, css_tryget() distinguishes between "released" and > "releasing" by checking for the CSS_REMOVED flag in css->flags. > > Signed-off-by: Paul Menage <menage@google.com> > > --- > include/linux/cgroup.h | 33 ++++++++++++++++++++++----- > kernel/cgroup.c | 58 ++++++++++++++++++++++++++++++++++++++++++++----- > 2 files changed, 80 insertions(+), 11 deletions(-) > > Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h > =================================================================== > --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h > +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h > @@ -52,9 +52,9 @@ struct cgroup_subsys_state { > * hierarchy structure */ > struct cgroup *cgroup; > > - /* State maintained by the cgroup system to allow > - * subsystems to be "busy". Should be accessed via css_get() > - * and css_put() */ > + /* State maintained by the cgroup system to allow subsystems > + * to be "busy". Should be accessed via css_get(), > + * css_tryget() and and css_put(). */ > > atomic_t refcnt; > > @@ -64,11 +64,14 @@ struct cgroup_subsys_state { > /* bits in struct cgroup_subsys_state flags field */ > enum { > CSS_ROOT, /* This CSS is the root of the subsystem */ > + CSS_REMOVED, /* This CSS is dead */ > }; > Could you add this ? == bool css_is_removed(struct cgroup_subsys_state *css) { return test_bit(CSS_REMOVED, &css->flags); } == Then, I'll use this instead of memcg->obsolete flag. Thanks, -Kame > /* > - * Call css_get() to hold a reference on the cgroup; > - * > + * Call css_get() to hold a reference on the css; it can be used > + * for a reference obtained via: > + * - an existing ref-counted reference to the css > + * - task->cgroups for a locked task > */ > > static inline void css_get(struct cgroup_subsys_state *css) > @@ -77,9 +80,27 @@ static inline void css_get(struct cgroup > if (!test_bit(CSS_ROOT, &css->flags)) > atomic_inc(&css->refcnt); > } > + > +/* > + * Call css_tryget() to take a reference on a css if your existing > + * (known-valid) reference isn't already ref-counted. Returns false if > + * the css has been destroyed. > + */ > + > +static inline bool css_tryget(struct cgroup_subsys_state *css) > +{ > + if (test_bit(CSS_ROOT, &css->flags)) > + return true; > + while (!atomic_inc_not_zero(&css->refcnt)) { > + if (test_bit(CSS_REMOVED, &css->flags)) > + return false; > + } > + return true; > +} > + > /* > * css_put() should be called to release a reference taken by > - * css_get() > + * css_get() or css_tryget() > */ > > extern void __css_put(struct cgroup_subsys_state *css); > Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c > =================================================================== > --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c > +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c > @@ -2321,7 +2321,7 @@ static void init_cgroup_css(struct cgrou > struct cgroup *cgrp) > { > css->cgroup = cgrp; > - atomic_set(&css->refcnt, 0); > + atomic_set(&css->refcnt, 1); > css->flags = 0; > if (cgrp == dummytop) > set_bit(CSS_ROOT, &css->flags); > @@ -2453,7 +2453,7 @@ static int cgroup_has_css_refs(struct cg > { > /* Check the reference count on each subsystem. Since we > * already established that there are no tasks in the > - * cgroup, if the css refcount is also 0, then there should > + * cgroup, if the css refcount is also 1, then there should > * be no outstanding references, so the subsystem is safe to > * destroy. We scan across all subsystems rather than using > * the per-hierarchy linked list of mounted subsystems since > @@ -2474,12 +2474,59 @@ static int cgroup_has_css_refs(struct cg > * matter, since it can only happen if the cgroup > * has been deleted and hence no longer needs the > * release agent to be called anyway. */ > - if (css && atomic_read(&css->refcnt)) > + if (css && (atomic_read(&css->refcnt) > 1)) > return 1; > } > return 0; > } > > +/* > + * Atomically mark all of the cgroup's CSS objects as > + * CSS_REMOVED. Return true on success, or false if the cgroup has > + * busy subsystems. Call with cgroup_mutex held > + */ > + > +static int cgroup_clear_css_refs(struct cgroup *cgrp) > +{ > + struct cgroup_subsys *ss; > + unsigned long flags; > + bool failed = false; > + local_irq_save(flags); > + for_each_subsys(cgrp->root, ss) { > + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; > + int refcnt; > + do { > + /* We can only remove a CSS with a refcnt==1 */ > + refcnt = atomic_read(&css->refcnt); > + if (refcnt > 1) { > + failed = true; > + goto done; > + } > + BUG_ON(!refcnt); > + /* > + * Drop the refcnt to 0 while we check other > + * subsystems. This will cause any racing > + * css_tryget() to spin until we set the > + * CSS_REMOVED bits or abort > + */ > + } while (atomic_cmpxchg(&css->refcnt, refcnt, 0) != refcnt); > + } > + done: > + for_each_subsys(cgrp->root, ss) { > + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; > + if (failed) { > + /* Restore old refcnt if necessary */ > + if (!atomic_read(&css->refcnt)) > + atomic_set(&css->refcnt, 1); > + } else { > + /* Commit the fact that the CSS is removed */ > + set_bit(CSS_REMOVED, &css->flags); > + } > + } > + local_irq_restore(flags); > + return !failed; > +} > + > static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry) > { > struct cgroup *cgrp = dentry->d_fsdata; > @@ -2510,7 +2557,7 @@ static int cgroup_rmdir(struct inode *un > > if (atomic_read(&cgrp->count) > || !list_empty(&cgrp->children) > - || cgroup_has_css_refs(cgrp)) { > + || !cgroup_clear_css_refs(cgrp)) { > mutex_unlock(&cgroup_mutex); > return -EBUSY; > } > @@ -3065,7 +3112,8 @@ void __css_put(struct cgroup_subsys_stat > { > struct cgroup *cgrp = css->cgroup; > rcu_read_lock(); > - if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) { > + if ((atomic_dec_return(&css->refcnt) == 1) && > + notify_on_release(cgrp)) { > set_bit(CGRP_RELEASABLE, &cgrp->flags); > check_for_release(cgrp); > } > > -- > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 3/3] CGroups: Add css_tryget() 2008-12-11 5:15 ` KAMEZAWA Hiroyuki @ 2008-12-11 7:28 ` Paul Menage 2008-12-11 7:29 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 24+ messages in thread From: Paul Menage @ 2008-12-11 7:28 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: balbir, containers, linux-kernel, akpm On Wed, Dec 10, 2008 at 9:15 PM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > Could you add this ? > == > bool css_is_removed(struct cgroup_subsys_state *css) > { > return test_bit(CSS_REMOVED, &css->flags); > } > == > Then, I'll use this instead of memcg->obsolete flag. I actually had a patch almost ready to send out the nuked memcg->obsolete. A difference in mine is that I did #define css_is_removed(__css) test_bit(CSS_REMOVED, &(__css)->css.flags) so that you can use it on any subsystem state object that contains a "struct cgroup_subsys_state css;" field, which is currently all of them I think, without having to add the "->css" every time. Paul ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC][PATCH 3/3] CGroups: Add css_tryget() 2008-12-11 7:28 ` Paul Menage @ 2008-12-11 7:29 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 24+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-11 7:29 UTC (permalink / raw) To: Paul Menage; +Cc: balbir, containers, linux-kernel, akpm On Wed, 10 Dec 2008 23:28:12 -0800 Paul Menage <menage@google.com> wrote: > On Wed, Dec 10, 2008 at 9:15 PM, KAMEZAWA Hiroyuki > <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > Could you add this ? > > == > > bool css_is_removed(struct cgroup_subsys_state *css) > > { > > return test_bit(CSS_REMOVED, &css->flags); > > } > > == > > Then, I'll use this instead of memcg->obsolete flag. > > I actually had a patch almost ready to send out the nuked memcg->obsolete. > > A difference in mine is that I did > > #define css_is_removed(__css) test_bit(CSS_REMOVED, &(__css)->css.flags) > > so that you can use it on any subsystem state object that contains a > "struct cgroup_subsys_state css;" field, which is currently all of > them I think, without having to add the "->css" every time. > ok, thx. very useful :) -Kame ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2008-12-12 1:14 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-12-10 23:36 [RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes menage 2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage 2008-12-11 0:37 ` KAMEZAWA Hiroyuki 2008-12-11 0:44 ` Paul Menage 2008-12-11 6:30 ` Balbir Singh 2008-12-11 3:05 ` Li Zefan 2008-12-11 17:07 ` Paul Menage 2008-12-11 6:29 ` Balbir Singh 2008-12-11 17:09 ` Paul Menage 2008-12-10 23:36 ` [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage 2008-12-11 0:49 ` KAMEZAWA Hiroyuki 2008-12-11 0:52 ` Paul Menage 2008-12-11 1:05 ` KAMEZAWA Hiroyuki 2008-12-11 6:33 ` Balbir Singh 2008-12-11 6:47 ` Li Zefan 2008-12-11 6:53 ` KAMEZAWA Hiroyuki 2008-12-11 17:05 ` Paul Menage 2008-12-12 1:12 ` KAMEZAWA Hiroyuki 2008-12-10 23:36 ` [RFC][PATCH 3/3] CGroups: Add css_tryget() menage 2008-12-11 0:52 ` KAMEZAWA Hiroyuki 2008-12-11 7:02 ` [RFC][PATCH]example: use css_tryget() in memcg " KAMEZAWA Hiroyuki 2008-12-11 5:15 ` KAMEZAWA Hiroyuki 2008-12-11 7:28 ` Paul Menage 2008-12-11 7:29 ` KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox