* [-mm][PATCH 0/4] Add memrlimit controller (v4)
@ 2008-05-14 13:09 Balbir Singh
2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh
` (3 more replies)
0 siblings, 4 replies; 27+ messages in thread
From: Balbir Singh @ 2008-05-14 13:09 UTC (permalink / raw)
To: linux-mm
Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
Pavel Emelianov, Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki
This is the fourth version of the address space control patches. These
patches are against 2.6.26-rc2-mm1 and have been tested using KVM in SMP mode,
both with and without the config enabled, on a powerpc box and using UML.
The patches have also been compile tested with the config disabled on a
powerpc box.
The goal of this patch is to implement a virtual address space controller
using cgroups. The documentation describes the controller, it's goal and
usage in further details.
The first patch adds the user interface. The second patch fixes the
cgroup mm_owner_changed callback to pass the task struct, so that
accounting can be adjusted on owner changes. The thrid patch adds accounting
and control. The fourth patch updates documentation.
An earlier post of the patchset can be found at
http://lwn.net/Articles/275143/
This patch is built on top of the mm owner patches and utilizes that feature
to virtually group tasks by mm_struct.
Reviews, Comments?
Changelog
1. Add documentation upfront
2. Refactor the code (error handling and changes to improvde code)
3. Protect reading of total_vm with mmap_sem
4. Port to 2.6.26-rc2
5. Changed the name from rlimit to memrlimit
Series
memrlimit-controller-add-documentation.patch
memrlimit-controller-setup.patch
cgroup-add-task-to-mm-owner-callbacks.patch
memrlimit-controller-address-space-accounting-and-control.patch
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 27+ messages in thread* [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) 2008-05-14 13:09 [-mm][PATCH 0/4] Add memrlimit controller (v4) Balbir Singh @ 2008-05-14 13:09 ` Balbir Singh 2008-05-15 1:20 ` Li Zefan 2008-05-15 18:22 ` Avi Kivity 2008-05-14 13:09 ` [-mm][PATCH 2/4] Setup the memrlimit controller (v4) Balbir Singh ` (2 subsequent siblings) 3 siblings, 2 replies; 27+ messages in thread From: Balbir Singh @ 2008-05-14 13:09 UTC (permalink / raw) To: linux-mm Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Pavel Emelianov, Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki Documentation patch - describes the goals and usage of the memrlimit controller. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- Documentation/controllers/memrlimit.txt | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff -puN /dev/null Documentation/controllers/memrlimit.txt --- /dev/null 2008-05-14 04:27:30.032276540 +0530 +++ linux-2.6.26-rc2-balbir/Documentation/controllers/memrlimit.txt 2008-05-14 18:35:55.000000000 +0530 @@ -0,0 +1,29 @@ +This controller is enabled by the CONFIG_CGROUP_MEMRLIMIT_CTLR option. Prior +to reading this documentation please read Documentation/cgroups.txt and +Documentation/controllers/memory.txt. Several of the principles of this +controller are similar to the memory resource controller. + +This controller framework is designed to be extensible to control any +memory resource limit with little effort. + +This new controller, controls the address space expansion of the tasks +belonging to a cgroup. Address space control is provided along the same lines as +RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2). +The interface for controlling address space is provided through +"rlimit.limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t. the user +interface. Please see section 3 of the memory resource controller documentation +for more details on how to use the user interface to get and set values. + +The "memrlimit.usage_in_bytes" file provides information about the total address +space usage of the tasks in the cgroup, in bytes. + +Advantages of providing this feature + +1. Control over virtual address space allows for a cgroup to fail gracefully + i.e., via a malloc or mmap failure as compared to OOM kill when no + pages can be reclaimed. +2. It provides better control over how many pages can be swapped out when + the cgroup goes over its limit. A badly setup cgroup can cause excessive + swapping. Providing control over the address space allocations ensures + that the system administrator has control over the total swapping that + can take place. _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) 2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh @ 2008-05-15 1:20 ` Li Zefan 2008-05-15 18:22 ` Avi Kivity 1 sibling, 0 replies; 27+ messages in thread From: Li Zefan @ 2008-05-15 1:20 UTC (permalink / raw) To: Balbir Singh Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki Balbir Singh wrote: > Documentation patch - describes the goals and usage of the memrlimit > controller. > > > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> > --- > > Documentation/controllers/memrlimit.txt | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff -puN /dev/null Documentation/controllers/memrlimit.txt > --- /dev/null 2008-05-14 04:27:30.032276540 +0530 > +++ linux-2.6.26-rc2-balbir/Documentation/controllers/memrlimit.txt 2008-05-14 18:35:55.000000000 +0530 > @@ -0,0 +1,29 @@ > +This controller is enabled by the CONFIG_CGROUP_MEMRLIMIT_CTLR option. Prior > +to reading this documentation please read Documentation/cgroups.txt and > +Documentation/controllers/memory.txt. Several of the principles of this > +controller are similar to the memory resource controller. > + > +This controller framework is designed to be extensible to control any > +memory resource limit with little effort. > + > +This new controller, controls the address space expansion of the tasks > +belonging to a cgroup. Address space control is provided along the same lines as > +RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2). > +The interface for controlling address space is provided through > +"rlimit.limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t. the user memrlimit.limit_in_bytes > +interface. Please see section 3 of the memory resource controller documentation > +for more details on how to use the user interface to get and set values. > + > +The "memrlimit.usage_in_bytes" file provides information about the total address > +space usage of the tasks in the cgroup, in bytes. > + > +Advantages of providing this feature > + > +1. Control over virtual address space allows for a cgroup to fail gracefully > + i.e., via a malloc or mmap failure as compared to OOM kill when no > + pages can be reclaimed. > +2. It provides better control over how many pages can be swapped out when > + the cgroup goes over its limit. A badly setup cgroup can cause excessive > + swapping. Providing control over the address space allocations ensures > + that the system administrator has control over the total swapping that > + can take place. > _ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) 2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh 2008-05-15 1:20 ` Li Zefan @ 2008-05-15 18:22 ` Avi Kivity 2008-05-15 18:39 ` Balbir Singh 1 sibling, 1 reply; 27+ messages in thread From: Avi Kivity @ 2008-05-15 18:22 UTC (permalink / raw) To: Balbir Singh Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki Balbir Singh wrote: > + > +Advantages of providing this feature > + > +1. Control over virtual address space allows for a cgroup to fail gracefully > + i.e., via a malloc or mmap failure as compared to OOM kill when no > + pages can be reclaimed. > Do you mean by this, limiting the number of pagetable pages (that are pinned in memory), this preventing oom by a cgroup that instantiates many pagetables? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) 2008-05-15 18:22 ` Avi Kivity @ 2008-05-15 18:39 ` Balbir Singh 0 siblings, 0 replies; 27+ messages in thread From: Balbir Singh @ 2008-05-15 18:39 UTC (permalink / raw) To: Avi Kivity Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki Avi Kivity wrote: > Balbir Singh wrote: >> + >> +Advantages of providing this feature >> + >> +1. Control over virtual address space allows for a cgroup to fail >> gracefully >> + i.e., via a malloc or mmap failure as compared to OOM kill when no >> + pages can be reclaimed. >> > > Do you mean by this, limiting the number of pagetable pages (that are > pinned in memory), this preventing oom by a cgroup that instantiates > many pagetables? > > This is not for page tables (that is in the long term TODO list). This is more for user space calls to mmap(), malloc() or anything that causes the total virtual memory of the process to go up (in our case cgroups). The motivation is similar to the motivations of RLIMIT_AS. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [-mm][PATCH 2/4] Setup the memrlimit controller (v4) 2008-05-14 13:09 [-mm][PATCH 0/4] Add memrlimit controller (v4) Balbir Singh 2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh @ 2008-05-14 13:09 ` Balbir Singh 2008-05-14 13:29 ` Pavel Emelyanov 2008-05-14 13:09 ` [-mm][PATCH 3/4] cgroup mm owner callback changes to add task info (v4) Balbir Singh 2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh 3 siblings, 1 reply; 27+ messages in thread From: Balbir Singh @ 2008-05-14 13:09 UTC (permalink / raw) To: linux-mm Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Pavel Emelianov, Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki This patch sets up the rlimit cgroup controller. It adds the basic create, destroy and populate functionality. The user interface provided is very similar to the memory resource controller. The rlimit controller can be enhanced easily in the future to control mlocked pages. Changelog v3->v4 1. Use PAGE_ALIGN() 2. Rename rlimit to memrlimit Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- include/linux/cgroup_subsys.h | 4 + include/linux/memrlimitcgroup.h | 19 +++++ init/Kconfig | 10 ++ mm/Makefile | 1 mm/memrlimitcgroup.c | 144 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 178 insertions(+) diff -puN include/linux/cgroup_subsys.h~memrlimit-controller-setup include/linux/cgroup_subsys.h --- linux-2.6.26-rc2/include/linux/cgroup_subsys.h~memrlimit-controller-setup 2008-05-14 18:36:36.000000000 +0530 +++ linux-2.6.26-rc2-balbir/include/linux/cgroup_subsys.h 2008-05-14 18:36:36.000000000 +0530 @@ -47,4 +47,8 @@ SUBSYS(mem_cgroup) SUBSYS(devices) #endif +#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR +SUBSYS(memrlimit_cgroup) +#endif + /* */ diff -puN /dev/null include/linux/memrlimitcgroup.h --- /dev/null 2008-05-14 04:27:30.032276540 +0530 +++ linux-2.6.26-rc2-balbir/include/linux/memrlimitcgroup.h 2008-05-14 18:36:36.000000000 +0530 @@ -0,0 +1,19 @@ +/* + * Copyright A(C) International Business Machines Corp., 2008 + * + * Author: Balbir Singh <balbir@linux.vnet.ibm.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#ifndef LINUX_MEMRLIMITCGROUP_H +#define LINUX_MEMRLIMITCGROUP_H + +#endif /* LINUX_MEMRLIMITCGROUP_H */ diff -puN init/Kconfig~memrlimit-controller-setup init/Kconfig --- linux-2.6.26-rc2/init/Kconfig~memrlimit-controller-setup 2008-05-14 18:36:36.000000000 +0530 +++ linux-2.6.26-rc2-balbir/init/Kconfig 2008-05-14 18:36:36.000000000 +0530 @@ -407,6 +407,16 @@ config CGROUP_MEM_RES_CTLR This config option also selects MM_OWNER config option, which could in turn add some fork/exit overhead. +config CGROUP_MEMRLIMIT_CTLR + bool "Memory resource limit controls for cgroups" + depends on CGROUPS && RESOURCE_COUNTERS && MMU + select MM_OWNER + help + Provides resource limits for all the tasks belonging to a + control group. CGROUP_MEM_RES_CTLR provides support for physical + memory RSS and Page Cache control. Virtual address space control + is provided by this controller. + config SYSFS_DEPRECATED bool diff -puN mm/Makefile~memrlimit-controller-setup mm/Makefile --- linux-2.6.26-rc2/mm/Makefile~memrlimit-controller-setup 2008-05-14 18:36:36.000000000 +0530 +++ linux-2.6.26-rc2-balbir/mm/Makefile 2008-05-14 18:36:36.000000000 +0530 @@ -34,4 +34,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o +obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o diff -puN /dev/null mm/memrlimitcgroup.c --- /dev/null 2008-05-14 04:27:30.032276540 +0530 +++ linux-2.6.26-rc2-balbir/mm/memrlimitcgroup.c 2008-05-14 18:36:36.000000000 +0530 @@ -0,0 +1,144 @@ +/* + * Copyright A(C) International Business Machines Corp., 2008 + * + * Author: Balbir Singh <balbir@linux.vnet.ibm.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * Provide memory resource limits for tasks in a control group. A lot of code is + * duplicated from the memory controller (this code is common to almost + * all controllers). TODO: Consider writing a tool that can generate this + * code. + */ +#include <linux/cgroup.h> +#include <linux/mm.h> +#include <linux/smp.h> +#include <linux/rcupdate.h> +#include <linux/slab.h> +#include <linux/swap.h> +#include <linux/spinlock.h> +#include <linux/fs.h> +#include <linux/res_counter.h> +#include <linux/memrlimitcgroup.h> + +struct cgroup_subsys memrlimit_cgroup_subsys; + +struct memrlimit_cgroup { + struct cgroup_subsys_state css; + struct res_counter as_res; /* address space counter */ +}; + +static struct memrlimit_cgroup init_memrlimit_cgroup; + +static struct memrlimit_cgroup *memrlimit_cgroup_from_cgrp(struct cgroup *cgrp) +{ + return container_of(cgroup_subsys_state(cgrp, + memrlimit_cgroup_subsys_id), + struct memrlimit_cgroup, css); +} + +static struct cgroup_subsys_state * +memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp) +{ + struct memrlimit_cgroup *memrcg; + + if (unlikely(cgrp->parent == NULL)) + memrcg = &init_memrlimit_cgroup; + else { + memrcg = kzalloc(sizeof(*memrcg), GFP_KERNEL); + if (!memrcg) + return ERR_PTR(-ENOMEM); + } + res_counter_init(&memrcg->as_res); + return &memrcg->css; +} + +static void memrlimit_cgroup_destroy(struct cgroup_subsys *ss, + struct cgroup *cgrp) +{ + kfree(memrlimit_cgroup_from_cgrp(cgrp)); +} + +static int memrlimit_cgroup_reset(struct cgroup *cgrp, unsigned int event) +{ + struct memrlimit_cgroup *memrcg; + + memrcg = memrlimit_cgroup_from_cgrp(cgrp); + switch (event) { + case RES_FAILCNT: + res_counter_reset_failcnt(&memrcg->as_res); + break; + } + return 0; +} + +static u64 memrlimit_cgroup_read(struct cgroup *cgrp, struct cftype *cft) +{ + return res_counter_read_u64(&memrlimit_cgroup_from_cgrp(cgrp)->as_res, + cft->private); +} + +static int memrlimit_cgroup_write_strategy(char *buf, unsigned long long *tmp) +{ + *tmp = memparse(buf, &buf); + if (*buf != '\0') + return -EINVAL; + + *tmp = PAGE_ALIGN(*tmp); + return 0; +} + +static ssize_t memrlimit_cgroup_write(struct cgroup *cgrp, struct cftype *cft, + struct file *file, + const char __user *userbuf, + size_t nbytes, + loff_t *ppos) +{ + return res_counter_write(&memrlimit_cgroup_from_cgrp(cgrp)->as_res, + cft->private, userbuf, nbytes, ppos, + memrlimit_cgroup_write_strategy); +} + +static struct cftype memrlimit_cgroup_files[] = { + { + .name = "usage_in_bytes", + .private = RES_USAGE, + .read_u64 = memrlimit_cgroup_read, + }, + { + .name = "limit_in_bytes", + .private = RES_LIMIT, + .write = memrlimit_cgroup_write, + .read_u64 = memrlimit_cgroup_read, + }, + { + .name = "failcnt", + .private = RES_FAILCNT, + .trigger = memrlimit_cgroup_reset, + .read_u64 = memrlimit_cgroup_read, + }, +}; + +static int memrlimit_cgroup_populate(struct cgroup_subsys *ss, + struct cgroup *cgrp) +{ + return cgroup_add_files(cgrp, ss, memrlimit_cgroup_files, + ARRAY_SIZE(memrlimit_cgroup_files)); +} + +struct cgroup_subsys memrlimit_cgroup_subsys = { + .name = "memrlimit", + .subsys_id = memrlimit_cgroup_subsys_id, + .create = memrlimit_cgroup_create, + .destroy = memrlimit_cgroup_destroy, + .populate = memrlimit_cgroup_populate, + .early_init = 0, +}; _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 2/4] Setup the memrlimit controller (v4) 2008-05-14 13:09 ` [-mm][PATCH 2/4] Setup the memrlimit controller (v4) Balbir Singh @ 2008-05-14 13:29 ` Pavel Emelyanov 0 siblings, 0 replies; 27+ messages in thread From: Pavel Emelyanov @ 2008-05-14 13:29 UTC (permalink / raw) To: Balbir Singh Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki Balbir Singh wrote: > This patch sets up the rlimit cgroup controller. It adds the basic create, > destroy and populate functionality. The user interface provided is very > similar to the memory resource controller. The rlimit controller can be > enhanced easily in the future to control mlocked pages. > > Changelog v3->v4 > > 1. Use PAGE_ALIGN() > 2. Rename rlimit to memrlimit > > > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [-mm][PATCH 3/4] cgroup mm owner callback changes to add task info (v4) 2008-05-14 13:09 [-mm][PATCH 0/4] Add memrlimit controller (v4) Balbir Singh 2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh 2008-05-14 13:09 ` [-mm][PATCH 2/4] Setup the memrlimit controller (v4) Balbir Singh @ 2008-05-14 13:09 ` Balbir Singh 2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh 3 siblings, 0 replies; 27+ messages in thread From: Balbir Singh @ 2008-05-14 13:09 UTC (permalink / raw) To: linux-mm Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Pavel Emelianov, Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki This patch adds an additional field to the mm_owner callbacks. This field is required to get to the mm that changed. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- include/linux/cgroup.h | 3 ++- kernel/cgroup.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff -puN include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks include/linux/cgroup.h --- linux-2.6.26-rc2/include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks 2008-05-14 18:36:59.000000000 +0530 +++ linux-2.6.26-rc2-balbir/include/linux/cgroup.h 2008-05-14 18:36:59.000000000 +0530 @@ -310,7 +310,8 @@ struct cgroup_subsys { */ void (*mm_owner_changed)(struct cgroup_subsys *ss, struct cgroup *old, - struct cgroup *new); + struct cgroup *new, + struct task_struct *p); int subsys_id; int active; int disabled; diff -puN kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks kernel/cgroup.c --- linux-2.6.26-rc2/kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks 2008-05-14 18:36:59.000000000 +0530 +++ linux-2.6.26-rc2-balbir/kernel/cgroup.c 2008-05-14 18:36:59.000000000 +0530 @@ -2772,7 +2772,7 @@ void cgroup_mm_owner_callbacks(struct ta if (oldcgrp == newcgrp) continue; if (ss->mm_owner_changed) - ss->mm_owner_changed(ss, oldcgrp, newcgrp); + ss->mm_owner_changed(ss, oldcgrp, newcgrp, new); } } } _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-14 13:09 [-mm][PATCH 0/4] Add memrlimit controller (v4) Balbir Singh ` (2 preceding siblings ...) 2008-05-14 13:09 ` [-mm][PATCH 3/4] cgroup mm owner callback changes to add task info (v4) Balbir Singh @ 2008-05-14 13:09 ` Balbir Singh 2008-05-14 13:25 ` Balbir Singh ` (2 more replies) 3 siblings, 3 replies; 27+ messages in thread From: Balbir Singh @ 2008-05-14 13:09 UTC (permalink / raw) To: linux-mm Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Pavel Emelianov, Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki This patch adds support for accounting and control of virtual address space limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions. The core of the accounting takes place during fork time in copy_process(), may_expand_vm(), remove_vma_list() and exit_mmap(). There are some special cases that are handled here as well (arch/ia64/kernel/perform.c, arch/x86/kernel/ptrace.c, insert_special_mapping()) Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- arch/ia64/kernel/perfmon.c | 6 ++ arch/x86/kernel/ptrace.c | 17 +++++-- fs/exec.c | 5 ++ include/linux/memrlimitcgroup.h | 21 ++++++++ kernel/fork.c | 8 +++ mm/memrlimitcgroup.c | 94 ++++++++++++++++++++++++++++++++++++++++ mm/mmap.c | 11 ++++ 7 files changed, 157 insertions(+), 5 deletions(-) diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c --- linux-2.6.26-rc2/arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:09:32.000000000 +0530 +++ linux-2.6.26-rc2-balbir/arch/ia64/kernel/perfmon.c 2008-05-14 18:09:32.000000000 +0530 @@ -40,6 +40,7 @@ #include <linux/capability.h> #include <linux/rcupdate.h> #include <linux/completion.h> +#include <linux/memrlimitcgroup.h> #include <asm/errno.h> #include <asm/intrinsics.h> @@ -2294,6 +2295,9 @@ pfm_smpl_buffer_alloc(struct task_struct DPRINT(("sampling buffer rsize=%lu size=%lu bytes\n", rsize, size)); + if (memrlimit_cgroup_charge_as(mm, size >> PAGE_SHIFT)) + return -ENOMEM; + /* * check requested size to avoid Denial-of-service attacks * XXX: may have to refine this test @@ -2313,6 +2317,7 @@ pfm_smpl_buffer_alloc(struct task_struct smpl_buf = pfm_rvmalloc(size); if (smpl_buf == NULL) { DPRINT(("Can't allocate sampling buffer\n")); + memrlimit_cgroup_uncharge_as(mm, size >> PAGE_SHIFT); return -ENOMEM; } @@ -2390,6 +2395,7 @@ pfm_smpl_buffer_alloc(struct task_struct return 0; error: + memrlimit_cgroup_uncharge_as(mm, size >> PAGE_SHIFT); kmem_cache_free(vm_area_cachep, vma); error_kmem: pfm_rvfree(smpl_buf, size); diff -puN arch/x86/kernel/ds.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ds.c diff -puN fs/exec.c~memrlimit-controller-address-space-accounting-and-control fs/exec.c --- linux-2.6.26-rc2/fs/exec.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:09:32.000000000 +0530 +++ linux-2.6.26-rc2-balbir/fs/exec.c 2008-05-14 18:09:32.000000000 +0530 @@ -52,6 +52,7 @@ #include <linux/tsacct_kern.h> #include <linux/cn_proc.h> #include <linux/audit.h> +#include <linux/memrlimitcgroup.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -230,6 +231,9 @@ static int __bprm_mm_init(struct linux_b if (!vma) goto err; + if (memrlimit_cgroup_charge_as(mm, 1)) + goto err; + down_write(&mm->mmap_sem); vma->vm_mm = mm; @@ -247,6 +251,7 @@ static int __bprm_mm_init(struct linux_b err = insert_vm_struct(mm, vma); if (err) { up_write(&mm->mmap_sem); + memrlimit_cgroup_uncharge_as(mm, 1); goto err; } diff -puN include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control include/linux/memrlimitcgroup.h --- linux-2.6.26-rc2/include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:09:32.000000000 +0530 +++ linux-2.6.26-rc2-balbir/include/linux/memrlimitcgroup.h 2008-05-14 18:09:32.000000000 +0530 @@ -16,4 +16,25 @@ #ifndef LINUX_MEMRLIMITCGROUP_H #define LINUX_MEMRLIMITCGROUP_H +#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR + +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages); +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages); + +#else /* !CONFIG_CGROUP_RLIMIT_CTLR */ + +static inline int +memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + return 0; +} + +static inline void +memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages) +{ +} + +#endif /* CONFIG_CGROUP_RLIMIT_CTLR */ + + #endif /* LINUX_MEMRLIMITCGROUP_H */ diff -puN kernel/fork.c~memrlimit-controller-address-space-accounting-and-control kernel/fork.c --- linux-2.6.26-rc2/kernel/fork.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:09:32.000000000 +0530 +++ linux-2.6.26-rc2-balbir/kernel/fork.c 2008-05-14 18:09:32.000000000 +0530 @@ -54,6 +54,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/memrlimitcgroup.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -267,6 +268,7 @@ static int dup_mmap(struct mm_struct *mm mm->total_vm -= pages; vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file, -pages); + memrlimit_cgroup_uncharge_as(mm, pages); continue; } charge = 0; @@ -596,6 +598,12 @@ static int copy_mm(unsigned long clone_f atomic_inc(&oldmm->mm_users); mm = oldmm; goto good_mm; + } else { + down_read(&oldmm->mmap_sem); + retval = memrlimit_cgroup_charge_as(oldmm, oldmm->total_vm); + up_read(&oldmm->mmap_sem); + if (retval) + goto fail_nomem; } retval = -ENOMEM; diff -puN mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control mm/memrlimitcgroup.c --- linux-2.6.26-rc2/mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:09:32.000000000 +0530 +++ linux-2.6.26-rc2-balbir/mm/memrlimitcgroup.c 2008-05-14 18:09:32.000000000 +0530 @@ -45,6 +45,41 @@ static struct memrlimit_cgroup *memrlimi struct memrlimit_cgroup, css); } +static struct memrlimit_cgroup * +memrlimit_cgroup_from_task(struct task_struct *p) +{ + return container_of(task_subsys_state(p, memrlimit_cgroup_subsys_id), + struct memrlimit_cgroup, css); +} + +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + int ret; + struct memrlimit_cgroup *memrcg; + + rcu_read_lock(); + memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner)); + css_get(&memrcg->css); + rcu_read_unlock(); + + ret = res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); + css_put(&memrcg->css); + return ret; +} + +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + struct memrlimit_cgroup *memrcg; + + rcu_read_lock(); + memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner)); + css_get(&memrcg->css); + rcu_read_unlock(); + + res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); + css_put(&memrcg->css); +} + static struct cgroup_subsys_state * memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp) { @@ -134,11 +169,70 @@ static int memrlimit_cgroup_populate(str ARRAY_SIZE(memrlimit_cgroup_files)); } +static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss, + struct cgroup *cgrp, + struct cgroup *old_cgrp, + struct task_struct *p) +{ + struct mm_struct *mm; + struct memrlimit_cgroup *memrcg, *old_memrcg; + + mm = get_task_mm(p); + if (mm == NULL) + return; + + rcu_read_lock(); + if (p != rcu_dereference(mm->owner)) + goto out; + + memrcg = memrlimit_cgroup_from_cgrp(cgrp); + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); + + if (memrcg == old_memrcg) + goto out; + + /* + * Hold mmap_sem, so that total_vm does not change underneath us + */ + down_read(&mm->mmap_sem); + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) + goto out; + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); +out: + up_read(&mm->mmap_sem); + rcu_read_unlock(); + mmput(mm); +} + +static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss, + struct cgroup *cgrp, + struct cgroup *old_cgrp, + struct task_struct *p) +{ + struct memrlimit_cgroup *memrcg, *old_memrcg; + struct mm_struct *mm = get_task_mm(p); + + BUG_ON(!mm); + memrcg = memrlimit_cgroup_from_cgrp(cgrp); + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); + + down_read(&mm->mmap_sem); + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) + goto out; + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); +out: + up_read(&mm->mmap_sem); + + mmput(mm); +} + struct cgroup_subsys memrlimit_cgroup_subsys = { .name = "memrlimit", .subsys_id = memrlimit_cgroup_subsys_id, .create = memrlimit_cgroup_create, .destroy = memrlimit_cgroup_destroy, .populate = memrlimit_cgroup_populate, + .attach = memrlimit_cgroup_move_task, + .mm_owner_changed = memrlimit_cgroup_mm_owner_changed, .early_init = 0, }; diff -puN mm/mmap.c~memrlimit-controller-address-space-accounting-and-control mm/mmap.c --- linux-2.6.26-rc2/mm/mmap.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:09:32.000000000 +0530 +++ linux-2.6.26-rc2-balbir/mm/mmap.c 2008-05-14 18:09:32.000000000 +0530 @@ -26,6 +26,7 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/memrlimitcgroup.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -1730,6 +1731,7 @@ static void remove_vma_list(struct mm_st long nrpages = vma_pages(vma); mm->total_vm -= nrpages; + memrlimit_cgroup_uncharge_as(mm, nrpages); if (vma->vm_flags & VM_LOCKED) mm->locked_vm -= nrpages; vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages); @@ -2056,6 +2058,7 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); + memrlimit_cgroup_uncharge_as(mm, mm->total_vm); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); @@ -2174,6 +2177,10 @@ int may_expand_vm(struct mm_struct *mm, if (cur + npages > lim) return 0; + + if (memrlimit_cgroup_charge_as(mm, npages)) + return 0; + return 1; } @@ -2236,6 +2243,9 @@ int install_special_mapping(struct mm_st if (unlikely(vma == NULL)) return -ENOMEM; + if (memrlimit_cgroup_charge_as(mm, len >> PAGE_SHIFT)) + return -ENOMEM; + vma->vm_mm = mm; vma->vm_start = addr; vma->vm_end = addr + len; @@ -2248,6 +2258,7 @@ int install_special_mapping(struct mm_st if (unlikely(insert_vm_struct(mm, vma))) { kmem_cache_free(vm_area_cachep, vma); + memrlimit_cgroup_uncharge_as(mm, len >> PAGE_SHIFT); return -ENOMEM; } diff -puN arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ptrace.c --- linux-2.6.26-rc2/arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:09:32.000000000 +0530 +++ linux-2.6.26-rc2-balbir/arch/x86/kernel/ptrace.c 2008-05-14 18:09:32.000000000 +0530 @@ -20,6 +20,7 @@ #include <linux/audit.h> #include <linux/seccomp.h> #include <linux/signal.h> +#include <linux/memrlimitcgroup.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -782,21 +783,25 @@ static int ptrace_bts_realloc(struct tas current->mm->total_vm -= old_size; current->mm->locked_vm -= old_size; + memrlimit_cgroup_uncharge_as(mm, old_size); if (size == 0) goto out; + if (memrlimit_cgroup_charge_as(current->mm, size)) + goto out; + rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT; vm = current->mm->total_vm + size; if (rlim < vm) { ret = -ENOMEM; if (!reduce_size) - goto out; + goto out_uncharge; size = rlim - current->mm->total_vm; if (size <= 0) - goto out; + goto out_uncharge; } rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; @@ -805,21 +810,23 @@ static int ptrace_bts_realloc(struct tas ret = -ENOMEM; if (!reduce_size) - goto out; + goto out_uncharge; size = rlim - current->mm->locked_vm; if (size <= 0) - goto out; + goto out_uncharge; } ret = ds_allocate((void **)&child->thread.ds_area_msr, size << PAGE_SHIFT); if (ret < 0) - goto out; + goto out_uncharge; current->mm->total_vm += size; current->mm->locked_vm += size; +out_uncharge: + memrlimit_cgroup_uncharge_as(mm, size); out: if (child->thread.ds_area_msr) set_tsk_thread_flag(child, TIF_DS_AREA_MSR); _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh @ 2008-05-14 13:25 ` Balbir Singh 2008-05-15 2:25 ` Paul Menage 2008-05-14 13:32 ` Pavel Emelyanov 2008-09-18 20:54 ` Andrew Morton 2 siblings, 1 reply; 27+ messages in thread From: Balbir Singh @ 2008-05-14 13:25 UTC (permalink / raw) To: linux-mm Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki * Balbir Singh <balbir@linux.vnet.ibm.com> [2008-05-14 18:39:51]: Here's a better version of the patch. ptrace.c changes have a bug, where uncharge is called unconditionally. We now check for ret < 0. The two signed-off-by's by the same person did not look good either :) Description This patch adds support for accounting and control of virtual address space limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions. The core of the accounting takes place during fork time in copy_process(), may_expand_vm(), remove_vma_list() and exit_mmap(). There are some special cases that are handled here as well (arch/ia64/kernel/perform.c, arch/x86/kernel/ptrace.c, insert_special_mapping()) Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- arch/ia64/kernel/perfmon.c | 6 ++ arch/x86/kernel/ptrace.c | 18 +++++-- fs/exec.c | 5 ++ include/linux/memrlimitcgroup.h | 21 ++++++++ kernel/fork.c | 8 +++ mm/memrlimitcgroup.c | 94 ++++++++++++++++++++++++++++++++++++++++ mm/mmap.c | 11 ++++ 7 files changed, 158 insertions(+), 5 deletions(-) diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c --- linux-2.6.26-rc2/arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:37:11.000000000 +0530 +++ linux-2.6.26-rc2-balbir/arch/ia64/kernel/perfmon.c 2008-05-14 18:37:11.000000000 +0530 @@ -40,6 +40,7 @@ #include <linux/capability.h> #include <linux/rcupdate.h> #include <linux/completion.h> +#include <linux/memrlimitcgroup.h> #include <asm/errno.h> #include <asm/intrinsics.h> @@ -2294,6 +2295,9 @@ pfm_smpl_buffer_alloc(struct task_struct DPRINT(("sampling buffer rsize=%lu size=%lu bytes\n", rsize, size)); + if (memrlimit_cgroup_charge_as(mm, size >> PAGE_SHIFT)) + return -ENOMEM; + /* * check requested size to avoid Denial-of-service attacks * XXX: may have to refine this test @@ -2313,6 +2317,7 @@ pfm_smpl_buffer_alloc(struct task_struct smpl_buf = pfm_rvmalloc(size); if (smpl_buf == NULL) { DPRINT(("Can't allocate sampling buffer\n")); + memrlimit_cgroup_uncharge_as(mm, size >> PAGE_SHIFT); return -ENOMEM; } @@ -2390,6 +2395,7 @@ pfm_smpl_buffer_alloc(struct task_struct return 0; error: + memrlimit_cgroup_uncharge_as(mm, size >> PAGE_SHIFT); kmem_cache_free(vm_area_cachep, vma); error_kmem: pfm_rvfree(smpl_buf, size); diff -puN arch/x86/kernel/ds.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ds.c diff -puN fs/exec.c~memrlimit-controller-address-space-accounting-and-control fs/exec.c --- linux-2.6.26-rc2/fs/exec.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:37:11.000000000 +0530 +++ linux-2.6.26-rc2-balbir/fs/exec.c 2008-05-14 18:37:11.000000000 +0530 @@ -52,6 +52,7 @@ #include <linux/tsacct_kern.h> #include <linux/cn_proc.h> #include <linux/audit.h> +#include <linux/memrlimitcgroup.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -230,6 +231,9 @@ static int __bprm_mm_init(struct linux_b if (!vma) goto err; + if (memrlimit_cgroup_charge_as(mm, 1)) + goto err; + down_write(&mm->mmap_sem); vma->vm_mm = mm; @@ -247,6 +251,7 @@ static int __bprm_mm_init(struct linux_b err = insert_vm_struct(mm, vma); if (err) { up_write(&mm->mmap_sem); + memrlimit_cgroup_uncharge_as(mm, 1); goto err; } diff -puN include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control include/linux/memrlimitcgroup.h --- linux-2.6.26-rc2/include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:37:11.000000000 +0530 +++ linux-2.6.26-rc2-balbir/include/linux/memrlimitcgroup.h 2008-05-14 18:37:11.000000000 +0530 @@ -16,4 +16,25 @@ #ifndef LINUX_MEMRLIMITCGROUP_H #define LINUX_MEMRLIMITCGROUP_H +#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR + +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages); +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages); + +#else /* !CONFIG_CGROUP_RLIMIT_CTLR */ + +static inline int +memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + return 0; +} + +static inline void +memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages) +{ +} + +#endif /* CONFIG_CGROUP_RLIMIT_CTLR */ + + #endif /* LINUX_MEMRLIMITCGROUP_H */ diff -puN kernel/fork.c~memrlimit-controller-address-space-accounting-and-control kernel/fork.c --- linux-2.6.26-rc2/kernel/fork.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:37:11.000000000 +0530 +++ linux-2.6.26-rc2-balbir/kernel/fork.c 2008-05-14 18:37:11.000000000 +0530 @@ -54,6 +54,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/memrlimitcgroup.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -267,6 +268,7 @@ static int dup_mmap(struct mm_struct *mm mm->total_vm -= pages; vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file, -pages); + memrlimit_cgroup_uncharge_as(mm, pages); continue; } charge = 0; @@ -596,6 +598,12 @@ static int copy_mm(unsigned long clone_f atomic_inc(&oldmm->mm_users); mm = oldmm; goto good_mm; + } else { + down_read(&oldmm->mmap_sem); + retval = memrlimit_cgroup_charge_as(oldmm, oldmm->total_vm); + up_read(&oldmm->mmap_sem); + if (retval) + goto fail_nomem; } retval = -ENOMEM; diff -puN mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control mm/memrlimitcgroup.c --- linux-2.6.26-rc2/mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:37:11.000000000 +0530 +++ linux-2.6.26-rc2-balbir/mm/memrlimitcgroup.c 2008-05-14 18:37:11.000000000 +0530 @@ -45,6 +45,41 @@ static struct memrlimit_cgroup *memrlimi struct memrlimit_cgroup, css); } +static struct memrlimit_cgroup * +memrlimit_cgroup_from_task(struct task_struct *p) +{ + return container_of(task_subsys_state(p, memrlimit_cgroup_subsys_id), + struct memrlimit_cgroup, css); +} + +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + int ret; + struct memrlimit_cgroup *memrcg; + + rcu_read_lock(); + memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner)); + css_get(&memrcg->css); + rcu_read_unlock(); + + ret = res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); + css_put(&memrcg->css); + return ret; +} + +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + struct memrlimit_cgroup *memrcg; + + rcu_read_lock(); + memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner)); + css_get(&memrcg->css); + rcu_read_unlock(); + + res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); + css_put(&memrcg->css); +} + static struct cgroup_subsys_state * memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp) { @@ -134,11 +169,70 @@ static int memrlimit_cgroup_populate(str ARRAY_SIZE(memrlimit_cgroup_files)); } +static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss, + struct cgroup *cgrp, + struct cgroup *old_cgrp, + struct task_struct *p) +{ + struct mm_struct *mm; + struct memrlimit_cgroup *memrcg, *old_memrcg; + + mm = get_task_mm(p); + if (mm == NULL) + return; + + rcu_read_lock(); + if (p != rcu_dereference(mm->owner)) + goto out; + + memrcg = memrlimit_cgroup_from_cgrp(cgrp); + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); + + if (memrcg == old_memrcg) + goto out; + + /* + * Hold mmap_sem, so that total_vm does not change underneath us + */ + down_read(&mm->mmap_sem); + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) + goto out; + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); +out: + up_read(&mm->mmap_sem); + rcu_read_unlock(); + mmput(mm); +} + +static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss, + struct cgroup *cgrp, + struct cgroup *old_cgrp, + struct task_struct *p) +{ + struct memrlimit_cgroup *memrcg, *old_memrcg; + struct mm_struct *mm = get_task_mm(p); + + BUG_ON(!mm); + memrcg = memrlimit_cgroup_from_cgrp(cgrp); + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); + + down_read(&mm->mmap_sem); + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) + goto out; + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); +out: + up_read(&mm->mmap_sem); + + mmput(mm); +} + struct cgroup_subsys memrlimit_cgroup_subsys = { .name = "memrlimit", .subsys_id = memrlimit_cgroup_subsys_id, .create = memrlimit_cgroup_create, .destroy = memrlimit_cgroup_destroy, .populate = memrlimit_cgroup_populate, + .attach = memrlimit_cgroup_move_task, + .mm_owner_changed = memrlimit_cgroup_mm_owner_changed, .early_init = 0, }; diff -puN mm/mmap.c~memrlimit-controller-address-space-accounting-and-control mm/mmap.c --- linux-2.6.26-rc2/mm/mmap.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:37:11.000000000 +0530 +++ linux-2.6.26-rc2-balbir/mm/mmap.c 2008-05-14 18:37:11.000000000 +0530 @@ -26,6 +26,7 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/memrlimitcgroup.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -1730,6 +1731,7 @@ static void remove_vma_list(struct mm_st long nrpages = vma_pages(vma); mm->total_vm -= nrpages; + memrlimit_cgroup_uncharge_as(mm, nrpages); if (vma->vm_flags & VM_LOCKED) mm->locked_vm -= nrpages; vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages); @@ -2056,6 +2058,7 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); + memrlimit_cgroup_uncharge_as(mm, mm->total_vm); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); @@ -2174,6 +2177,10 @@ int may_expand_vm(struct mm_struct *mm, if (cur + npages > lim) return 0; + + if (memrlimit_cgroup_charge_as(mm, npages)) + return 0; + return 1; } @@ -2236,6 +2243,9 @@ int install_special_mapping(struct mm_st if (unlikely(vma == NULL)) return -ENOMEM; + if (memrlimit_cgroup_charge_as(mm, len >> PAGE_SHIFT)) + return -ENOMEM; + vma->vm_mm = mm; vma->vm_start = addr; vma->vm_end = addr + len; @@ -2248,6 +2258,7 @@ int install_special_mapping(struct mm_st if (unlikely(insert_vm_struct(mm, vma))) { kmem_cache_free(vm_area_cachep, vma); + memrlimit_cgroup_uncharge_as(mm, len >> PAGE_SHIFT); return -ENOMEM; } diff -puN arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ptrace.c --- linux-2.6.26-rc2/arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:37:11.000000000 +0530 +++ linux-2.6.26-rc2-balbir/arch/x86/kernel/ptrace.c 2008-05-14 18:43:13.000000000 +0530 @@ -20,6 +20,7 @@ #include <linux/audit.h> #include <linux/seccomp.h> #include <linux/signal.h> +#include <linux/memrlimitcgroup.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -782,21 +783,25 @@ static int ptrace_bts_realloc(struct tas current->mm->total_vm -= old_size; current->mm->locked_vm -= old_size; + memrlimit_cgroup_uncharge_as(mm, old_size); if (size == 0) goto out; + if (memrlimit_cgroup_charge_as(current->mm, size)) + goto out; + rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT; vm = current->mm->total_vm + size; if (rlim < vm) { ret = -ENOMEM; if (!reduce_size) - goto out; + goto out_uncharge; size = rlim - current->mm->total_vm; if (size <= 0) - goto out; + goto out_uncharge; } rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; @@ -805,21 +810,24 @@ static int ptrace_bts_realloc(struct tas ret = -ENOMEM; if (!reduce_size) - goto out; + goto out_uncharge; size = rlim - current->mm->locked_vm; if (size <= 0) - goto out; + goto out_uncharge; } ret = ds_allocate((void **)&child->thread.ds_area_msr, size << PAGE_SHIFT); if (ret < 0) - goto out; + goto out_uncharge; current->mm->total_vm += size; current->mm->locked_vm += size; +out_uncharge: + if (ret < 0) + memrlimit_cgroup_uncharge_as(mm, size); out: if (child->thread.ds_area_msr) set_tsk_thread_flag(child, TIF_DS_AREA_MSR); _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-14 13:25 ` Balbir Singh @ 2008-05-15 2:25 ` Paul Menage 2008-05-15 6:17 ` Balbir Singh 0 siblings, 1 reply; 27+ messages in thread From: Paul Menage @ 2008-05-15 2:25 UTC (permalink / raw) To: balbir, linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki On Wed, May 14, 2008 at 6:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > + > +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) > +{ > + int ret; > + struct memrlimit_cgroup *memrcg; > + > + rcu_read_lock(); > + memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner)); > + css_get(&memrcg->css); > + rcu_read_unlock(); > + > + ret = res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); > + css_put(&memrcg->css); > + return ret; > +} Assuming that we're holding a write lock on mm->mmap_sem here, and we additionally hold mmap_sem for the whole of mm_update_next_owner(), then maybe we don't need any extra synchronization here? Something like simply: int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) { struct memrlimit_cgroup *memrcg = memrlimit_cgroup_from_task(mm->owner); return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); } Seems good to minimize additional synchronization on the fast path. The only thing that's still broken is that the task_struct.cgroups pointer gets updated only under the synchronization of task_lock(), so we've still got the race of: A: attach_task() updates B->cgroups B: memrlimit_cgroup_charge_as() charges the new res counter and updates mm->total_vm A: memrlimit_cgroup_move_task() moves mm->total_vm from the old counter to the new counter Here's one way I see to fix this: We change attach_task() so that rather than updating the task_struct.cgroups pointer once from the original css_set to the final css_set, it goes through a series of intermediate css_set structures, one for each subsystem in the hierarchy, transitioning from the old set to the final set. Then for each subsystem ss, it would do: next_css = <old css with pointer for ss updated> if (ss->attach) { ss->attach(ss, p, next_css); } else { task_lock(p); rcu_assign_ptr(p->cgroups, next_css); task_unlock(p); } i.e. the subsystem would be free to implement any synchronization it desired in the attach() code. The attach() method's responsibility would be to ensure that p->cgroups was updated to point to next_css before returning. This should make it much simpler for a subsystem to handle potential races between attach() and accounting. The current semantics of can_attach()/update/attach() are sufficient for cpusets, but probably not for systems with more complex accounting. I'd still need to figure out a nice way to get the kind of transactional semantics that you want from can_attach(). > + > +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages) > +{ > + struct memrlimit_cgroup *memrcg; > + > + rcu_read_lock(); > + memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner)); > + css_get(&memrcg->css); > + rcu_read_unlock(); > + > + res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); > + css_put(&memrcg->css); > +} > + > static struct cgroup_subsys_state * > memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp) > { > @@ -134,11 +169,70 @@ static int memrlimit_cgroup_populate(str > ARRAY_SIZE(memrlimit_cgroup_files)); > } > > +static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss, > + struct cgroup *cgrp, > + struct cgroup *old_cgrp, > + struct task_struct *p) > +{ > + struct mm_struct *mm; > + struct memrlimit_cgroup *memrcg, *old_memrcg; > + > + mm = get_task_mm(p); > + if (mm == NULL) > + return; > + > + rcu_read_lock(); > + if (p != rcu_dereference(mm->owner)) > + goto out; out: does up_read() on mmap_sem, which you don't currently hold. Can you add more comments about why you're using RCU here? You have a refcounted pointer on mm, so mm can't go away, and you never dereference the result of the rcu_dereference(mm->owner), so you're not protecting the validity of that pointer. The only way this could help would be if anything that changes mm->owner calls synchronize_rcu() before, say, doing accounting changes, and I don't believe that's the case. Would it be simpler to use task_lock(p) here to ensure that it doesn't lose its owner status (by exiting or execing) while we're moving it? i.e., something like: [ this assumes the new semantics of attach that I proposed above ] static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss, struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *p, struct css_set new_css_set) { struct mm_struct *mm; struct memrlimit_cgroup *memrcg, *old_memrcg; retry: mm = get_task_mm(p); if (mm == NULL) { task_lock(p); rcu_assign_ptr(p->cgroups, new_css_set); task_unlock(p); return; } /* Take mmap_sem to prevent address space changes */ down_read(&mm->mmap_sem); /* task_lock(p) to prevent mm ownership changes */ task_lock(p); if (p->mm != mm) { /* We raced */ task_unlock(p); up_read(&mm->mmap_sem); mmput(mm); goto retry; } if (p != mm->owner) goto out_assign; memrcg = memrlimit_cgroup_from_cgrp(cgrp); old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); if (memrcg == old_memrcg) goto out_assign; if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) goto out_assign; res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); out_assign: rcu_assign_ptr(p->cgroups, new_css_set); task_unlock(p); up_read(&mm->mmap_sem); mmput(mm); } > + > + memrcg = memrlimit_cgroup_from_cgrp(cgrp); > + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); > + > + if (memrcg == old_memrcg) > + goto out; mmap_sem is also not held here. > + > + /* > + * Hold mmap_sem, so that total_vm does not change underneath us > + */ > + down_read(&mm->mmap_sem); You can't block inside rcu_read_lock(). > + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) > + goto out; > + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); > +out: > + up_read(&mm->mmap_sem); > + rcu_read_unlock(); > + mmput(mm); > +} > + > +static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss, > + struct cgroup *cgrp, > + struct cgroup *old_cgrp, > + struct task_struct *p) > +{ > + struct memrlimit_cgroup *memrcg, *old_memrcg; > + struct mm_struct *mm = get_task_mm(p); > + > + BUG_ON(!mm); > + memrcg = memrlimit_cgroup_from_cgrp(cgrp); > + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); > + > + down_read(&mm->mmap_sem); At this point we're holding p->alloc_lock, so we can't do a blocking down_read(). How about if we down_read(&mm->mmap_sem) in mm_update_next_owner() prior to taking tasklist_lock? That will ensure that mm ownership changes are synchronized against mmap/munmap operations on that mm, and since the cgroups-based owners are likely to be wanting to track the ownership changes for accounting purposes, this seems like an appropriate lock to hold. > + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) > + goto out; > + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); > +out: > + up_read(&mm->mmap_sem); > + > + mmput(mm); > +} > + -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-15 2:25 ` Paul Menage @ 2008-05-15 6:17 ` Balbir Singh 2008-05-15 6:55 ` Paul Menage 0 siblings, 1 reply; 27+ messages in thread From: Balbir Singh @ 2008-05-15 6:17 UTC (permalink / raw) To: Paul Menage Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki * Paul Menage <menage@google.com> [2008-05-14 19:25:07]: > On Wed, May 14, 2008 at 6:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > + > > +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) > > +{ > > + int ret; > > + struct memrlimit_cgroup *memrcg; > > + > > + rcu_read_lock(); > > + memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner)); > > + css_get(&memrcg->css); > > + rcu_read_unlock(); > > + > > + ret = res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); > > + css_put(&memrcg->css); > > + return ret; > > +} > > Assuming that we're holding a write lock on mm->mmap_sem here, and we > additionally hold mmap_sem for the whole of mm_update_next_owner(), > then maybe we don't need any extra synchronization here? Something > like simply: > > int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) > { > struct memrlimit_cgroup *memrcg = memrlimit_cgroup_from_task(mm->owner); > return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); > } The charge_as routine is not always called with mmap_sem held, since the undo path gets more complicated under the lock. We already have our own locking mechanism for the counters. We're not really accessing any member of the mm here except the owner. Do we need to be called with mmap_sem held? > > Seems good to minimize additional synchronization on the fast path. > > The only thing that's still broken is that the task_struct.cgroups > pointer gets updated only under the synchronization of task_lock(), so > we've still got the race of: > > A: attach_task() updates B->cgroups > > B: memrlimit_cgroup_charge_as() charges the new res counter and > updates mm->total_vm > > A: memrlimit_cgroup_move_task() moves mm->total_vm from the old > counter to the new counter > > Here's one way I see to fix this: > > We change attach_task() so that rather than updating the > task_struct.cgroups pointer once from the original css_set to the > final css_set, it goes through a series of intermediate css_set > structures, one for each subsystem in the hierarchy, transitioning > from the old set to the final set. Then for each subsystem ss, it > would do: > > next_css = <old css with pointer for ss updated> > if (ss->attach) { > ss->attach(ss, p, next_css); > } else { > task_lock(p); > rcu_assign_ptr(p->cgroups, next_css); > task_unlock(p); > } > > i.e. the subsystem would be free to implement any synchronization it > desired in the attach() code. The attach() method's responsibility > would be to ensure that p->cgroups was updated to point to next_css > before returning. This should make it much simpler for a subsystem to > handle potential races between attach() and accounting. The current > semantics of can_attach()/update/attach() are sufficient for cpusets, > but probably not for systems with more complex accounting. I'd still > need to figure out a nice way to get the kind of transactional > semantics that you want from can_attach(). > A transaction manager would be great. We do the mm_update_owner_changes under the task_lock(), may be the attach callback should do the same to ensure that > > + > > +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages) > > +{ > > + struct memrlimit_cgroup *memrcg; > > + > > + rcu_read_lock(); > > + memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner)); > > + css_get(&memrcg->css); > > + rcu_read_unlock(); > > + > > + res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); > > + css_put(&memrcg->css); > > +} > > + > > static struct cgroup_subsys_state * > > memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp) > > { > > @@ -134,11 +169,70 @@ static int memrlimit_cgroup_populate(str > > ARRAY_SIZE(memrlimit_cgroup_files)); > > } > > > > +static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss, > > + struct cgroup *cgrp, > > + struct cgroup *old_cgrp, > > + struct task_struct *p) > > +{ > > + struct mm_struct *mm; > > + struct memrlimit_cgroup *memrcg, *old_memrcg; > > + > > + mm = get_task_mm(p); > > + if (mm == NULL) > > + return; > > + > > + rcu_read_lock(); > > + if (p != rcu_dereference(mm->owner)) > > + goto out; > > out: does up_read() on mmap_sem, which you don't currently hold. > Yes, good catch! > Can you add more comments about why you're using RCU here? > Sure, I will. One of things we want to ensure is that mm->owner does not go away. > You have a refcounted pointer on mm, so mm can't go away, and you > never dereference the result of the rcu_dereference(mm->owner), so > you're not protecting the validity of that pointer. The only way this > could help would be if anything that changes mm->owner calls > synchronize_rcu() before, say, doing accounting changes, and I don't > believe that's the case. > > Would it be simpler to use task_lock(p) here to ensure that it doesn't > lose its owner status (by exiting or execing) while we're moving it? > That's what I've been thinking as well, based on the discussion above. > i.e., something like: [ this assumes the new semantics of attach that > I proposed above ] > > > static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss, > struct cgroup *cgrp, > struct cgroup *old_cgrp, > struct task_struct *p, > struct css_set new_css_set) > { > struct mm_struct *mm; > struct memrlimit_cgroup *memrcg, *old_memrcg; > > retry: > mm = get_task_mm(p); > if (mm == NULL) { > task_lock(p); > rcu_assign_ptr(p->cgroups, new_css_set); Will each callback assign p->cgroups to new_css_set? > task_unlock(p); > return; > } > > /* Take mmap_sem to prevent address space changes */ > down_read(&mm->mmap_sem); > /* task_lock(p) to prevent mm ownership changes */ > task_lock(p); > if (p->mm != mm) { > /* We raced */ With exit_mmap() or exec_mmap() right? > task_unlock(p); > up_read(&mm->mmap_sem); > mmput(mm); > goto retry; > } > if (p != mm->owner) > goto out_assign; > > memrcg = memrlimit_cgroup_from_cgrp(cgrp); > old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); > > if (memrcg == old_memrcg) > goto out_assign; > > if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) > goto out_assign; > res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); > out_assign: > rcu_assign_ptr(p->cgroups, new_css_set); > task_unlock(p); > up_read(&mm->mmap_sem); > mmput(mm); > } > Taking the mmap_sem here would mean, we would need to document (something I should have done earlier), that mmap_sem nests under cgroup_mutex Looking at the mm->owner patches, I am thinking of writing down a race scenario card Race conditions R1: mm->owner can change dynamically under task_lock R2: mm->owner's cgroup can change under cgroup_mutex Read Write mm->owner Prevent hold task_lock cgroup from changing Prevent owner from changing Scenarios requiring protection/consistency R1: causes no problem, since we expect to make appropriate adjustment in mm_owner_changed R2: Is handled by the attach() callback Which leaves us with the following conclusion We don't have move_task(), mm_owner_changed() and charge/uncharge() running in parallel at the same time. If we agree with the assertion/conclusion above, then a simple lock might be able to protect us, assuming that it does not create a interwined locking hierarchy. > > > > > + > > + memrcg = memrlimit_cgroup_from_cgrp(cgrp); > > + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); > > + > > + if (memrcg == old_memrcg) > > + goto out; > > mmap_sem is also not held here. > Will fix > > + > > + /* > > + * Hold mmap_sem, so that total_vm does not change underneath us > > + */ > > + down_read(&mm->mmap_sem); > > You can't block inside rcu_read_lock(). > Good catch! > > + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) > > + goto out; > > + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); > > +out: > > + up_read(&mm->mmap_sem); > > + rcu_read_unlock(); > > + mmput(mm); > > +} > > + > > +static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss, > > + struct cgroup *cgrp, > > + struct cgroup *old_cgrp, > > + struct task_struct *p) > > +{ > > + struct memrlimit_cgroup *memrcg, *old_memrcg; > > + struct mm_struct *mm = get_task_mm(p); > > + > > + BUG_ON(!mm); > > + memrcg = memrlimit_cgroup_from_cgrp(cgrp); > > + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); > > + > > + down_read(&mm->mmap_sem); > > At this point we're holding p->alloc_lock, so we can't do a blocking > down_read(). > Good catch! > How about if we down_read(&mm->mmap_sem) in mm_update_next_owner() > prior to taking tasklist_lock? That will ensure that mm ownership > changes are synchronized against mmap/munmap operations on that mm, > and since the cgroups-based owners are likely to be wanting to track > the ownership changes for accounting purposes, this seems like an > appropriate lock to hold. Hmmm.. may be worth doing > > > + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) > > + goto out; > > + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); > > +out: > > + up_read(&mm->mmap_sem); > > + > > + mmput(mm); > > +} > > + > Thanks for the detailed review. > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-15 6:17 ` Balbir Singh @ 2008-05-15 6:55 ` Paul Menage 2008-05-15 7:03 ` Balbir Singh 0 siblings, 1 reply; 27+ messages in thread From: Paul Menage @ 2008-05-15 6:55 UTC (permalink / raw) To: balbir, Paul Menage, linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki On Wed, May 14, 2008 at 11:17 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > Assuming that we're holding a write lock on mm->mmap_sem here, and we > > additionally hold mmap_sem for the whole of mm_update_next_owner(), > > then maybe we don't need any extra synchronization here? Something > > like simply: > > > > int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) > > { > > struct memrlimit_cgroup *memrcg = memrlimit_cgroup_from_task(mm->owner); > > return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); > > } > > The charge_as routine is not always called with mmap_sem held, since > the undo path gets more complicated under the lock. We already have > our own locking mechanism for the counters. I'm not worried about the counters themselves being inconsistent - I'm worried about the case where charge_as() is called in the middle of the attach operation, and we account the charge X to the new cgroup's res_counter and update mm->total_vm, and then when we do the move, we charge the whole of mm->total_mm to the new cgroup even though the last charge was already accounted to the new res_counter, not the old one. That's what I'm hoping to address with the idea of splitting the attach into one update per subsystem, and letting the subsystems control their own synchronization. > We're not really accessing > any member of the mm here except the owner. Do we need to be called > with mmap_sem held? > Not necessarily mmap_sem, but there needs to be something to ensure that the update to mm->total_vm and the charge/uncharge against the res_counter are an atomic pair with respect to the code that shifts an mm between two cgroups, either due to mm->owner change or due to an attach_task(). Since mmap_sem is held for write on almost all the fast path calls to the rlimit_as charge/uncharge functions, using that for the synchronization avoids the need for any additional synchronization in the fast path. Can you say more about the complications of holding a write lock on mmap_sem in the cleanup calls to uncharge? > > retry: > > mm = get_task_mm(p); > > if (mm == NULL) { > > task_lock(p); > > rcu_assign_ptr(p->cgroups, new_css_set); > > Will each callback assign p->cgroups to new_css_set? Yes - but new_css_set will be slightly different for each callback. Specifically, it will differ from the existing set pointed to by p->cgroups in the pointer for this particular subsystem. So the task will move over in a staggered fashion, and each subsystem will get to choose its own synchronization. > > task_lock(p); > > if (p->mm != mm) { > > /* We raced */ > > With exit_mmap() or exec_mmap() right? > Yes. > If we agree with the assertion/conclusion above, then a simple lock > might be able to protect us, assuming that it does not create a > interwined locking hierarchy. > Right - and if we can make that lock be the mmap_sem of the mm in question, we avoid introducing a new lock into the fast path. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-15 6:55 ` Paul Menage @ 2008-05-15 7:03 ` Balbir Singh 2008-05-15 7:39 ` Paul Menage 0 siblings, 1 reply; 27+ messages in thread From: Balbir Singh @ 2008-05-15 7:03 UTC (permalink / raw) To: Paul Menage Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki * Paul Menage <menage@google.com> [2008-05-14 23:55:07]: > On Wed, May 14, 2008 at 11:17 PM, Balbir Singh > <balbir@linux.vnet.ibm.com> wrote: > > > > > > Assuming that we're holding a write lock on mm->mmap_sem here, and we > > > additionally hold mmap_sem for the whole of mm_update_next_owner(), > > > then maybe we don't need any extra synchronization here? Something > > > like simply: > > > > > > int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) > > > { > > > struct memrlimit_cgroup *memrcg = memrlimit_cgroup_from_task(mm->owner); > > > return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); > > > } > > > > The charge_as routine is not always called with mmap_sem held, since > > the undo path gets more complicated under the lock. We already have > > our own locking mechanism for the counters. > > I'm not worried about the counters themselves being inconsistent - I'm > worried about the case where charge_as() is called in the middle of > the attach operation, and we account the charge X to the new cgroup's > res_counter and update mm->total_vm, and then when we do the move, we > charge the whole of mm->total_mm to the new cgroup even though the > last charge was already accounted to the new res_counter, not the old > one. > > That's what I'm hoping to address with the idea of splitting the > attach into one update per subsystem, and letting the subsystems > control their own synchronization. > > > We're not really accessing > > any member of the mm here except the owner. Do we need to be called > > with mmap_sem held? > > > > Not necessarily mmap_sem, but there needs to be something to ensure > that the update to mm->total_vm and the charge/uncharge against the > res_counter are an atomic pair with respect to the code that shifts an > mm between two cgroups, either due to mm->owner change or due to an > attach_task(). Since mmap_sem is held for write on almost all the fast > path calls to the rlimit_as charge/uncharge functions, using that for > the synchronization avoids the need for any additional synchronization > in the fast path. > > Can you say more about the complications of holding a write lock on > mmap_sem in the cleanup calls to uncharge? > > > > retry: > > > mm = get_task_mm(p); > > > if (mm == NULL) { > > > task_lock(p); > > > rcu_assign_ptr(p->cgroups, new_css_set); > > > > Will each callback assign p->cgroups to new_css_set? > > Yes - but new_css_set will be slightly different for each callback. > Specifically, it will differ from the existing set pointed to by > p->cgroups in the pointer for this particular subsystem. So the task > will move over in a staggered fashion, and each subsystem will get to > choose its own synchronization. > > > > task_lock(p); > > > if (p->mm != mm) { > > > /* We raced */ > > > > With exit_mmap() or exec_mmap() right? > > > > Yes. > > > If we agree with the assertion/conclusion above, then a simple lock > > might be able to protect us, assuming that it does not create a > > interwined locking hierarchy. > > > > Right - and if we can make that lock be the mmap_sem of the mm in > question, we avoid introducing a new lock into the fast path. > I want to focus on this conclusion/assertion, since it takes care of most of the locking related discussion above, unless I missed something. My concern with using mmap_sem, is that 1. It's highly contended (every page fault, vma change, etc) 2. It's going to make the locking hierarchy deeper and complex 3. It's not appropriate to call all the accounting callbacks with the mmap_sem() held, since the undo operations _can get_ complicated at the caller. I would prefer introducing a new lock, so that other subsystems are not affected. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-15 7:03 ` Balbir Singh @ 2008-05-15 7:39 ` Paul Menage 2008-05-15 8:25 ` Balbir Singh 0 siblings, 1 reply; 27+ messages in thread From: Paul Menage @ 2008-05-15 7:39 UTC (permalink / raw) To: balbir, Paul Menage, linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki On Thu, May 15, 2008 at 12:03 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > I want to focus on this conclusion/assertion, since it takes care of > most of the locking related discussion above, unless I missed > something. > > My concern with using mmap_sem, is that > > 1. It's highly contended (every page fault, vma change, etc) But the only *new* cases of taking the mmap_sem that this would introduce would be: - on a failed vm limit charge - when a task exit/exec causes an mm ownership change - when a task moves between two cgroups in the memrlimit hierarchy. All of these should be rare events, so I don't think the additional contention is a worry. > 2. It's going to make the locking hierarchy deeper and complex Yes, potentially. But if the upside of that is that we eliminate a lock/unlock on a shared lock on every mmap/munmap call, it might well be worth it. > 3. It's not appropriate to call all the accounting callbacks with > the mmap_sem() held, since the undo operations _can get_ complicated > at the caller. > Can you give an example? > I would prefer introducing a new lock, so that other subsystems are > not affected. > For getting the first cut of the memrlimit controller working this may well make sense. But it would be nice to avoid it longer-term. Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-15 7:39 ` Paul Menage @ 2008-05-15 8:25 ` Balbir Singh 2008-05-15 15:28 ` Paul Menage 0 siblings, 1 reply; 27+ messages in thread From: Balbir Singh @ 2008-05-15 8:25 UTC (permalink / raw) To: Paul Menage Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki * Paul Menage <menage@google.com> [2008-05-15 00:39:45]: > On Thu, May 15, 2008 at 12:03 AM, Balbir Singh > <balbir@linux.vnet.ibm.com> wrote: > > > > I want to focus on this conclusion/assertion, since it takes care of > > most of the locking related discussion above, unless I missed > > something. > > > > My concern with using mmap_sem, is that > > > > 1. It's highly contended (every page fault, vma change, etc) > > But the only *new* cases of taking the mmap_sem that this would > introduce would be: > > - on a failed vm limit charge Why a failed charge? Aren't we talking of moving all charge/uncharge under mmap_sem? > - when a task exit/exec causes an mm ownership change Yes, in the mm_owner_changed callbacks > - when a task moves between two cgroups in the memrlimit hierarchy. > Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would be a bad side-effect. > All of these should be rare events, so I don't think the additional > contention is a worry. We do make several of all charge calls under the mmap_sem, but not all of them. So the additional contention might not be all that bad. > > > 2. It's going to make the locking hierarchy deeper and complex > > Yes, potentially. But if the upside of that is that we eliminate a > lock/unlock on a shared lock on every mmap/munmap call, it might well > be worth it. > > > 3. It's not appropriate to call all the accounting callbacks with > > the mmap_sem() held, since the undo operations _can get_ complicated > > at the caller. > > > > Can you give an example? Some paths of the uncharge are not under mmap_sem. Undoing the operation there seemed complex. > > > I would prefer introducing a new lock, so that other subsystems are > > not affected. > > > > For getting the first cut of the memrlimit controller working this may > well make sense. But it would be nice to avoid it longer-term. OK, so here's what I am going to try and do Refactor the code to try and use mmap_sem and see what I come up with. Basically use mmap_sem for all charge/uncharge operations as well use mmap_sem in read_mode in the move_task() and mm_owner_changed() callbacks. That should take care of the race conditions discussed, unless I missed something. Try and instrument insert_vm_struct() for charge/uncharge -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-15 8:25 ` Balbir Singh @ 2008-05-15 15:28 ` Paul Menage 2008-05-15 17:01 ` Balbir Singh 2008-05-17 20:15 ` Balbir Singh 0 siblings, 2 replies; 27+ messages in thread From: Paul Menage @ 2008-05-15 15:28 UTC (permalink / raw) To: balbir, Paul Menage, linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki On Thu, May 15, 2008 at 1:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > But the only *new* cases of taking the mmap_sem that this would > > introduce would be: > > > > - on a failed vm limit charge > > Why a failed charge? Aren't we talking of moving all charge/uncharge > under mmap_sem? > Sorry, I worded that wrongly - I meant "cleaning up a successful charge after an expansion fails for other reasons" I thought that all the charges and most of the uncharges were already under mmap_sem, and it would just be a few of the cleanup paths that needed to take it. > > > - when a task moves between two cgroups in the memrlimit hierarchy. > > > > Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would > be a bad side-effect. > I think it's already nested that way - e.g. the cpusets code can call various migration functions (which take mmap_sem) while holding cgroup_mutex. > > Refactor the code to try and use mmap_sem and see what I come up > with. Basically use mmap_sem for all charge/uncharge operations as > well use mmap_sem in read_mode in the move_task() and > mm_owner_changed() callbacks. That should take care of the race > conditions discussed, unless I missed something. Sounds good. Thanks, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-15 15:28 ` Paul Menage @ 2008-05-15 17:01 ` Balbir Singh 2008-05-17 20:15 ` Balbir Singh 1 sibling, 0 replies; 27+ messages in thread From: Balbir Singh @ 2008-05-15 17:01 UTC (permalink / raw) To: Paul Menage Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki Paul Menage wrote: > On Thu, May 15, 2008 at 1:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >> > >> > But the only *new* cases of taking the mmap_sem that this would >> > introduce would be: >> > >> > - on a failed vm limit charge >> >> Why a failed charge? Aren't we talking of moving all charge/uncharge >> under mmap_sem? >> > > Sorry, I worded that wrongly - I meant "cleaning up a successful > charge after an expansion fails for other reasons" > > I thought that all the charges and most of the uncharges were already > under mmap_sem, and it would just be a few of the cleanup paths that > needed to take it. > OK, that's definitely more meaningful. Thanks for clarifying. >> > - when a task moves between two cgroups in the memrlimit hierarchy. >> > >> >> Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would >> be a bad side-effect. >> > > I think it's already nested that way - e.g. the cpusets code can call > various migration functions (which take mmap_sem) while holding > cgroup_mutex. > >> Refactor the code to try and use mmap_sem and see what I come up >> with. Basically use mmap_sem for all charge/uncharge operations as >> well use mmap_sem in read_mode in the move_task() and >> mm_owner_changed() callbacks. That should take care of the race >> conditions discussed, unless I missed something. > > Sounds good. > Let me get that done and I'll post the next version. > Thanks, > > Paul -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-15 15:28 ` Paul Menage 2008-05-15 17:01 ` Balbir Singh @ 2008-05-17 20:15 ` Balbir Singh 2008-05-17 20:17 ` Balbir Singh 1 sibling, 1 reply; 27+ messages in thread From: Balbir Singh @ 2008-05-17 20:15 UTC (permalink / raw) To: Paul Menage Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki * Paul Menage <menage@google.com> [2008-05-15 08:28:46]: > On Thu, May 15, 2008 at 1:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > > But the only *new* cases of taking the mmap_sem that this would > > > introduce would be: > > > > > > - on a failed vm limit charge > > > > Why a failed charge? Aren't we talking of moving all charge/uncharge > > under mmap_sem? > > > > Sorry, I worded that wrongly - I meant "cleaning up a successful > charge after an expansion fails for other reasons" > > I thought that all the charges and most of the uncharges were already > under mmap_sem, and it would just be a few of the cleanup paths that > needed to take it. > > > > > > - when a task moves between two cgroups in the memrlimit hierarchy. > > > > > > > Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would > > be a bad side-effect. > > > > I think it's already nested that way - e.g. the cpusets code can call > various migration functions (which take mmap_sem) while holding > cgroup_mutex. > > > > > Refactor the code to try and use mmap_sem and see what I come up > > with. Basically use mmap_sem for all charge/uncharge operations as > > well use mmap_sem in read_mode in the move_task() and > > mm_owner_changed() callbacks. That should take care of the race > > conditions discussed, unless I missed something. > > Sounds good. > > Thanks, > I've revamped the last two patches. Please review This patch adds an additional field to the mm_owner callbacks. This field is required to get to the mm that changed. Hold mmap_sem in write mode before calling the mm_owner_changed callback Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- include/linux/cgroup.h | 3 ++- kernel/cgroup.c | 4 +++- kernel/exit.c | 3 +++ 3 files changed, 8 insertions(+), 2 deletions(-) diff -puN include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks include/linux/cgroup.h --- linux-2.6.26-rc2/include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks 2008-05-14 18:36:59.000000000 +0530 +++ linux-2.6.26-rc2-balbir/include/linux/cgroup.h 2008-05-14 18:36:59.000000000 +0530 @@ -310,7 +310,8 @@ struct cgroup_subsys { */ void (*mm_owner_changed)(struct cgroup_subsys *ss, struct cgroup *old, - struct cgroup *new); + struct cgroup *new, + struct task_struct *p); int subsys_id; int active; int disabled; diff -puN kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks kernel/cgroup.c --- linux-2.6.26-rc2/kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks 2008-05-14 18:36:59.000000000 +0530 +++ linux-2.6.26-rc2-balbir/kernel/cgroup.c 2008-05-17 22:09:57.000000000 +0530 @@ -2758,6 +2758,8 @@ void cgroup_fork_callbacks(struct task_s * Called on every change to mm->owner. mm_init_owner() does not * invoke this routine, since it assigns the mm->owner the first time * and does not change it. + * + * The callbacks are invoked with mmap_sem held in read mode. */ void cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new) { @@ -2772,7 +2774,7 @@ void cgroup_mm_owner_callbacks(struct ta if (oldcgrp == newcgrp) continue; if (ss->mm_owner_changed) - ss->mm_owner_changed(ss, oldcgrp, newcgrp); + ss->mm_owner_changed(ss, oldcgrp, newcgrp, new); } } } diff -puN kernel/exit.c~cgroup-add-task-to-mm-owner-callbacks kernel/exit.c --- linux-2.6.26-rc2/kernel/exit.c~cgroup-add-task-to-mm-owner-callbacks 2008-05-17 22:10:00.000000000 +0530 +++ linux-2.6.26-rc2-balbir/kernel/exit.c 2008-05-17 23:14:44.000000000 +0530 @@ -621,6 +621,7 @@ retry: assign_new_owner: BUG_ON(c == p); get_task_struct(c); + down_write(&mm->mmap_sem); /* * The task_lock protects c->mm from changing. * We always want mm->owner->mm == mm @@ -634,12 +635,14 @@ assign_new_owner: if (c->mm != mm) { task_unlock(c); put_task_struct(c); + up_write(&mm->mmap_sem); goto retry; } cgroup_mm_owner_callbacks(mm->owner, c); mm->owner = c; task_unlock(c); put_task_struct(c); + up_write(&mm->mmap_sem); } #endif /* CONFIG_MM_OWNER */ _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-17 20:15 ` Balbir Singh @ 2008-05-17 20:17 ` Balbir Singh 0 siblings, 0 replies; 27+ messages in thread From: Balbir Singh @ 2008-05-17 20:17 UTC (permalink / raw) To: Paul Menage, linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki * Balbir Singh <balbir@linux.vnet.ibm.com> [2008-05-18 01:45:45]: > * Paul Menage <menage@google.com> [2008-05-15 08:28:46]: > > > On Thu, May 15, 2008 at 1:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > > > > But the only *new* cases of taking the mmap_sem that this would > > > > introduce would be: > > > > > > > > - on a failed vm limit charge > > > > > > Why a failed charge? Aren't we talking of moving all charge/uncharge > > > under mmap_sem? > > > > > > > Sorry, I worded that wrongly - I meant "cleaning up a successful > > charge after an expansion fails for other reasons" > > > > I thought that all the charges and most of the uncharges were already > > under mmap_sem, and it would just be a few of the cleanup paths that > > needed to take it. > > > > > > > > > - when a task moves between two cgroups in the memrlimit hierarchy. > > > > > > > > > > Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would > > > be a bad side-effect. > > > > > > > I think it's already nested that way - e.g. the cpusets code can call > > various migration functions (which take mmap_sem) while holding > > cgroup_mutex. > > > > > > > > Refactor the code to try and use mmap_sem and see what I come up > > > with. Basically use mmap_sem for all charge/uncharge operations as > > > well use mmap_sem in read_mode in the move_task() and > > > mm_owner_changed() callbacks. That should take care of the race > > > conditions discussed, unless I missed something. > > > > Sounds good. > > > > Thanks, > > > I've revamped the last two patches. Please review > Here's the last patch for review This patch adds support for accounting and control of virtual address space limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions. The core of the accounting takes place during fork time in copy_process(), may_expand_vm(), remove_vma_list() and exit_mmap(). Changelog v5->v4 Move specific hooks in code to insert_vm_struct Use mmap_sem to protect mm->owner from changing and mm->owner from changing cgroups. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- arch/x86/kernel/ptrace.c | 18 +++++-- include/linux/memrlimitcgroup.h | 21 +++++++++ kernel/fork.c | 8 +++ mm/memrlimitcgroup.c | 92 ++++++++++++++++++++++++++++++++++++++++ mm/mmap.c | 17 ++++++- 5 files changed, 149 insertions(+), 7 deletions(-) diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c diff -puN arch/x86/kernel/ds.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ds.c diff -puN fs/exec.c~memrlimit-controller-address-space-accounting-and-control fs/exec.c diff -puN include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control include/linux/memrlimitcgroup.h --- linux-2.6.26-rc2/include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control 2008-05-17 23:14:53.000000000 +0530 +++ linux-2.6.26-rc2-balbir/include/linux/memrlimitcgroup.h 2008-05-17 23:14:53.000000000 +0530 @@ -16,4 +16,25 @@ #ifndef LINUX_MEMRLIMITCGROUP_H #define LINUX_MEMRLIMITCGROUP_H +#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR + +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages); +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages); + +#else /* !CONFIG_CGROUP_RLIMIT_CTLR */ + +static inline int +memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + return 0; +} + +static inline void +memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages) +{ +} + +#endif /* CONFIG_CGROUP_RLIMIT_CTLR */ + + #endif /* LINUX_MEMRLIMITCGROUP_H */ diff -puN kernel/fork.c~memrlimit-controller-address-space-accounting-and-control kernel/fork.c --- linux-2.6.26-rc2/kernel/fork.c~memrlimit-controller-address-space-accounting-and-control 2008-05-17 23:14:53.000000000 +0530 +++ linux-2.6.26-rc2-balbir/kernel/fork.c 2008-05-17 23:15:55.000000000 +0530 @@ -54,6 +54,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/memrlimitcgroup.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -267,6 +268,7 @@ static int dup_mmap(struct mm_struct *mm mm->total_vm -= pages; vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file, -pages); + memrlimit_cgroup_uncharge_as(mm, pages); continue; } charge = 0; @@ -596,6 +598,12 @@ static int copy_mm(unsigned long clone_f atomic_inc(&oldmm->mm_users); mm = oldmm; goto good_mm; + } else { + down_write(&oldmm->mmap_sem); + retval = memrlimit_cgroup_charge_as(oldmm, oldmm->total_vm); + up_write(&oldmm->mmap_sem); + if (retval) + goto fail_nomem; } retval = -ENOMEM; diff -puN mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control mm/memrlimitcgroup.c --- linux-2.6.26-rc2/mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control 2008-05-17 23:14:53.000000000 +0530 +++ linux-2.6.26-rc2-balbir/mm/memrlimitcgroup.c 2008-05-18 00:47:31.000000000 +0530 @@ -45,6 +45,38 @@ static struct memrlimit_cgroup *memrlimi struct memrlimit_cgroup, css); } +static struct memrlimit_cgroup * +memrlimit_cgroup_from_task(struct task_struct *p) +{ + return container_of(task_subsys_state(p, memrlimit_cgroup_subsys_id), + struct memrlimit_cgroup, css); +} + +/* + * Charge the cgroup for address space usage - mmap(), malloc() (through + * brk(), sbrk()), stack expansion, mremap(), etc - called with + * mmap_sem held. + */ +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + struct memrlimit_cgroup *memrcg; + + memrcg = memrlimit_cgroup_from_task(mm->owner); + return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); +} + +/* + * Uncharge the cgroup, as the address space of one of the tasks is + * decreasing - called with mmap_sem held. + */ +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + struct memrlimit_cgroup *memrcg; + + memrcg = memrlimit_cgroup_from_task(mm->owner); + res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); +} + static struct cgroup_subsys_state * memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp) { @@ -134,11 +166,71 @@ static int memrlimit_cgroup_populate(str ARRAY_SIZE(memrlimit_cgroup_files)); } +static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss, + struct cgroup *cgrp, + struct cgroup *old_cgrp, + struct task_struct *p) +{ + struct mm_struct *mm; + struct memrlimit_cgroup *memrcg, *old_memrcg; + + mm = get_task_mm(p); + if (mm == NULL) + return; + + /* + * Hold mmap_sem, so that total_vm does not change underneath us + */ + down_read(&mm->mmap_sem); + + rcu_read_lock(); + if (p != rcu_dereference(mm->owner)) + goto out; + + memrcg = memrlimit_cgroup_from_cgrp(cgrp); + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); + + if (memrcg == old_memrcg) + goto out; + + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) + goto out; + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); +out: + rcu_read_unlock(); + up_read(&mm->mmap_sem); + mmput(mm); +} + +/* + * This callback is called with mmap_sem held + */ +static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss, + struct cgroup *cgrp, + struct cgroup *old_cgrp, + struct task_struct *p) +{ + struct memrlimit_cgroup *memrcg, *old_memrcg; + struct mm_struct *mm = get_task_mm(p); + + BUG_ON(!mm); + memrcg = memrlimit_cgroup_from_cgrp(cgrp); + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); + + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) + goto out; + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); +out: + mmput(mm); +} + struct cgroup_subsys memrlimit_cgroup_subsys = { .name = "memrlimit", .subsys_id = memrlimit_cgroup_subsys_id, .create = memrlimit_cgroup_create, .destroy = memrlimit_cgroup_destroy, .populate = memrlimit_cgroup_populate, + .attach = memrlimit_cgroup_move_task, + .mm_owner_changed = memrlimit_cgroup_mm_owner_changed, .early_init = 0, }; diff -puN mm/mmap.c~memrlimit-controller-address-space-accounting-and-control mm/mmap.c --- linux-2.6.26-rc2/mm/mmap.c~memrlimit-controller-address-space-accounting-and-control 2008-05-17 23:14:53.000000000 +0530 +++ linux-2.6.26-rc2-balbir/mm/mmap.c 2008-05-17 23:14:53.000000000 +0530 @@ -26,6 +26,7 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/memrlimitcgroup.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -1730,6 +1731,7 @@ static void remove_vma_list(struct mm_st long nrpages = vma_pages(vma); mm->total_vm -= nrpages; + memrlimit_cgroup_uncharge_as(mm, nrpages); if (vma->vm_flags & VM_LOCKED) mm->locked_vm -= nrpages; vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages); @@ -2056,6 +2058,7 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); + memrlimit_cgroup_uncharge_as(mm, mm->total_vm); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); @@ -2078,6 +2081,9 @@ int insert_vm_struct(struct mm_struct * struct vm_area_struct * __vma, * prev; struct rb_node ** rb_link, * rb_parent; + if (memrlimit_cgroup_charge_as(mm, vma_pages(vma))) + return -ENOMEM; + /* * The vm_pgoff of a purely anonymous vma should be irrelevant * until its first write fault, when page's anon_vma and index @@ -2096,12 +2102,15 @@ int insert_vm_struct(struct mm_struct * } __vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent); if (__vma && __vma->vm_start < vma->vm_end) - return -ENOMEM; + goto err; if ((vma->vm_flags & VM_ACCOUNT) && security_vm_enough_memory_mm(mm, vma_pages(vma))) - return -ENOMEM; + goto err; vma_link(mm, vma, prev, rb_link, rb_parent); return 0; +err: + memrlimit_cgroup_uncharge_as(mm, vma_pages(vma)); + return -ENOMEM; } /* @@ -2174,6 +2183,10 @@ int may_expand_vm(struct mm_struct *mm, if (cur + npages > lim) return 0; + + if (memrlimit_cgroup_charge_as(mm, npages)) + return 0; + return 1; } diff -puN arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ptrace.c --- linux-2.6.26-rc2/arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control 2008-05-17 23:14:53.000000000 +0530 +++ linux-2.6.26-rc2-balbir/arch/x86/kernel/ptrace.c 2008-05-17 23:14:53.000000000 +0530 @@ -20,6 +20,7 @@ #include <linux/audit.h> #include <linux/seccomp.h> #include <linux/signal.h> +#include <linux/memrlimitcgroup.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -782,21 +783,25 @@ static int ptrace_bts_realloc(struct tas current->mm->total_vm -= old_size; current->mm->locked_vm -= old_size; + memrlimit_cgroup_uncharge_as(mm, old_size); if (size == 0) goto out; + if (memrlimit_cgroup_charge_as(current->mm, size)) + goto out; + rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT; vm = current->mm->total_vm + size; if (rlim < vm) { ret = -ENOMEM; if (!reduce_size) - goto out; + goto out_uncharge; size = rlim - current->mm->total_vm; if (size <= 0) - goto out; + goto out_uncharge; } rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; @@ -805,21 +810,24 @@ static int ptrace_bts_realloc(struct tas ret = -ENOMEM; if (!reduce_size) - goto out; + goto out_uncharge; size = rlim - current->mm->locked_vm; if (size <= 0) - goto out; + goto out_uncharge; } ret = ds_allocate((void **)&child->thread.ds_area_msr, size << PAGE_SHIFT); if (ret < 0) - goto out; + goto out_uncharge; current->mm->total_vm += size; current->mm->locked_vm += size; +out_uncharge: + if (ret < 0) + memrlimit_cgroup_uncharge_as(mm, size); out: if (child->thread.ds_area_msr) set_tsk_thread_flag(child, TIF_DS_AREA_MSR); _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh 2008-05-14 13:25 ` Balbir Singh @ 2008-05-14 13:32 ` Pavel Emelyanov 2008-05-14 19:39 ` Balbir Singh 2008-09-18 20:54 ` Andrew Morton 2 siblings, 1 reply; 27+ messages in thread From: Pavel Emelyanov @ 2008-05-14 13:32 UTC (permalink / raw) To: Balbir Singh Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki Balbir Singh wrote: > This patch adds support for accounting and control of virtual address space > limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions. > The core of the accounting takes place during fork time in copy_process(), > may_expand_vm(), remove_vma_list() and exit_mmap(). There are some special > cases that are handled here as well (arch/ia64/kernel/perform.c, > arch/x86/kernel/ptrace.c, insert_special_mapping()) > > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> > --- > > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> > --- > > arch/ia64/kernel/perfmon.c | 6 ++ > arch/x86/kernel/ptrace.c | 17 +++++-- > fs/exec.c | 5 ++ > include/linux/memrlimitcgroup.h | 21 ++++++++ > kernel/fork.c | 8 +++ > mm/memrlimitcgroup.c | 94 ++++++++++++++++++++++++++++++++++++++++ > mm/mmap.c | 11 ++++ > 7 files changed, 157 insertions(+), 5 deletions(-) > > diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c > --- linux-2.6.26-rc2/arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control 2008-05-14 18:09:32.000000000 +0530 > +++ linux-2.6.26-rc2-balbir/arch/ia64/kernel/perfmon.c 2008-05-14 18:09:32.000000000 +0530 > @@ -40,6 +40,7 @@ > #include <linux/capability.h> > #include <linux/rcupdate.h> > #include <linux/completion.h> > +#include <linux/memrlimitcgroup.h> > > #include <asm/errno.h> > #include <asm/intrinsics.h> > @@ -2294,6 +2295,9 @@ pfm_smpl_buffer_alloc(struct task_struct > > DPRINT(("sampling buffer rsize=%lu size=%lu bytes\n", rsize, size)); > > + if (memrlimit_cgroup_charge_as(mm, size >> PAGE_SHIFT)) > + return -ENOMEM; > + AFAIS you didn't cover all the cases when VM expands. At least all the arch/ia64/ia32/binfmt_elf32.c is missed. I'd insert this charge into insert_vm_struct. This would a) cover all of the missed cases and b) reduce the amount of places to patch. [snip the rest of the patch] -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-14 13:32 ` Pavel Emelyanov @ 2008-05-14 19:39 ` Balbir Singh 0 siblings, 0 replies; 27+ messages in thread From: Balbir Singh @ 2008-05-14 19:39 UTC (permalink / raw) To: Pavel Emelyanov Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki * Pavel Emelyanov <xemul@openvz.org> [2008-05-14 17:32:42]: > > AFAIS you didn't cover all the cases when VM expands. At least all > the arch/ia64/ia32/binfmt_elf32.c is missed. > > I'd insert this charge into insert_vm_struct. This would a) cover > all of the missed cases and b) reduce the amount of places to patch. > I thought I have those covered. insert_vm_struct() is called from places that we have covered in this patch. As far as arch/ia64/ia32/binfmt_elf32.c is concerned, it inserts a GDT, LDT and does not change total_vm. Having said that, I am not against putting the hooks in insert_vm_struct(). -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh 2008-05-14 13:25 ` Balbir Singh 2008-05-14 13:32 ` Pavel Emelyanov @ 2008-09-18 20:54 ` Andrew Morton 2008-09-18 23:55 ` Balbir Singh 2008-09-19 6:38 ` Balbir Singh 2 siblings, 2 replies; 27+ messages in thread From: Andrew Morton @ 2008-09-18 20:54 UTC (permalink / raw) To: Balbir Singh Cc: linux-mm, skumar, yamamoto, menage, lizf, linux-kernel, xemul, kamezawa.hiroyu, Ingo Molnar On Wed, 14 May 2008 18:39:51 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > This patch adds support for accounting and control of virtual address space > limits. Large changes in linux-next's arch/x86/kernel/ptrace.c caused damage to the memrlimit patches. I decided to retain the patches because it looks repairable. The problem is this reject from memrlimit-add-memrlimit-controller-accounting-and-control.patch: *************** *** 808,828 **** current->mm->total_vm -= old_size; current->mm->locked_vm -= old_size; if (size == 0) goto out; rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT; vm = current->mm->total_vm + size; if (rlim < vm) { ret = -ENOMEM; if (!reduce_size) - goto out; size = rlim - current->mm->total_vm; if (size <= 0) - goto out; } rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; --- 809,833 ---- current->mm->total_vm -= old_size; current->mm->locked_vm -= old_size; + memrlimit_cgroup_uncharge_as(mm, old_size); if (size == 0) goto out; + if (memrlimit_cgroup_charge_as(current->mm, size)) + goto out; + rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT; vm = current->mm->total_vm + size; if (rlim < vm) { ret = -ENOMEM; if (!reduce_size) + goto out_uncharge; size = rlim - current->mm->total_vm; if (size <= 0) + goto out_uncharge; } rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; *************** *** 831,851 **** ret = -ENOMEM; if (!reduce_size) - goto out; size = rlim - current->mm->locked_vm; if (size <= 0) - goto out; } ret = ds_allocate((void **)&child->thread.ds_area_msr, size << PAGE_SHIFT); if (ret < 0) - goto out; current->mm->total_vm += size; current->mm->locked_vm += size; out: if (child->thread.ds_area_msr) set_tsk_thread_flag(child, TIF_DS_AREA_MSR); --- 836,859 ---- ret = -ENOMEM; if (!reduce_size) + goto out_uncharge; size = rlim - current->mm->locked_vm; if (size <= 0) + goto out_uncharge; } ret = ds_allocate((void **)&child->thread.ds_area_msr, size << PAGE_SHIFT); if (ret < 0) + goto out_uncharge; current->mm->total_vm += size; current->mm->locked_vm += size; + out_uncharge: + if (ret < 0) + memrlimit_cgroup_uncharge_as(mm, size); out: if (child->thread.ds_area_msr) set_tsk_thread_flag(child, TIF_DS_AREA_MSR); could you plese take a look at today's mmotm and see what needs to be done to salvage it? Most of the code you were altering got moved into arch/x86/kernel/ds.c and got changed rather a lot. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-09-18 20:54 ` Andrew Morton @ 2008-09-18 23:55 ` Balbir Singh 2008-09-19 6:38 ` Balbir Singh 1 sibling, 0 replies; 27+ messages in thread From: Balbir Singh @ 2008-09-18 23:55 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, skumar, yamamoto, menage, lizf, linux-kernel, xemul, kamezawa.hiroyu, Ingo Molnar Andrew Morton wrote: > On Wed, 14 May 2008 18:39:51 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> This patch adds support for accounting and control of virtual address space >> limits. > [snip] > > could you plese take a look at today's mmotm and see what needs to be > done to salvage it? Most of the code you were altering got moved into > arch/x86/kernel/ds.c and got changed rather a lot. I'll take a look tonight and see what needs to be done -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-09-18 20:54 ` Andrew Morton 2008-09-18 23:55 ` Balbir Singh @ 2008-09-19 6:38 ` Balbir Singh 2008-09-19 20:14 ` Andrew Morton 1 sibling, 1 reply; 27+ messages in thread From: Balbir Singh @ 2008-09-19 6:38 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, skumar, yamamoto, menage, lizf, linux-kernel, xemul, kamezawa.hiroyu, Ingo Molnar * Andrew Morton <akpm@linux-foundation.org> [2008-09-18 13:54:30]: > On Wed, 14 May 2008 18:39:51 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > This patch adds support for accounting and control of virtual address space > > limits. > > > Large changes in linux-next's arch/x86/kernel/ptrace.c caused damage to > the memrlimit patches. > > I decided to retain the patches because it looks repairable. The > problem is this reject from > memrlimit-add-memrlimit-controller-accounting-and-control.patch: > Andrew, I could not apply mmotm to linux-next (both downloaded right now). I applied the patches one-by-one resolving differences starting from #mm in the series file. Here is my fixed version of the patch, I compiled the patch, but could not run it, since I could not create the full series of applied patches. I compiled arch/x86/kernel/ds.o and ptrace.o. I've included the patch below, please let me know if the code looks OK (via review) and the patch applies. I'll test it once I can resonably resolve all conflicts between linux-next and mmotm. From: Balbir Singh <balbir@linux.vnet.ibm.com> This patch adds support for accounting and control of virtual address space limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions. The core of the accounting takes place during fork time in copy_process(), may_expand_vm(), remove_vma_list() and exit_mmap(). Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- arch/x86/kernel/ptrace.c | 1 include/linux/memrlimitcgroup.h | 21 ++++++ kernel/fork.c | 14 ++++ mm/memrlimitcgroup.c | 92 ++++++++++++++++++++++++++++++ mm/mmap.c | 17 ++++- 5 files changed, 142 insertions(+), 3 deletions(-) Index: linux-next/include/linux/memrlimitcgroup.h =================================================================== --- linux-next.orig/include/linux/memrlimitcgroup.h 2008-09-18 23:19:26.000000000 -0700 +++ linux-next/include/linux/memrlimitcgroup.h 2008-09-18 23:19:29.000000000 -0700 @@ -16,4 +16,25 @@ #ifndef LINUX_MEMRLIMITCGROUP_H #define LINUX_MEMRLIMITCGROUP_H +#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR + +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages); +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages); + +#else /* !CONFIG_CGROUP_RLIMIT_CTLR */ + +static inline int +memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + return 0; +} + +static inline void +memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages) +{ +} + +#endif /* CONFIG_CGROUP_RLIMIT_CTLR */ + + #endif /* LINUX_MEMRLIMITCGROUP_H */ Index: linux-next/kernel/fork.c =================================================================== --- linux-next.orig/kernel/fork.c 2008-09-18 21:32:39.000000000 -0700 +++ linux-next/kernel/fork.c 2008-09-18 23:27:21.000000000 -0700 @@ -51,6 +51,7 @@ #include <linux/acct.h> #include <linux/tsacct_kern.h> #include <linux/cn_proc.h> +#include <linux/memrlimitcgroup.h> #include <linux/freezer.h> #include <linux/delayacct.h> #include <linux/taskstats_kern.h> @@ -263,7 +264,7 @@ struct vm_area_struct *mpnt, *tmp, **pprev; struct rb_node **rb_link, *rb_parent; int retval; - unsigned long charge; + unsigned long charge, uncharged = 0; struct mempolicy *pol; down_write(&oldmm->mmap_sem); @@ -273,6 +274,15 @@ */ down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING); + /* + * Uncharging as a result of failure is done by mmput() + * in dup_mm() + */ + if (memrlimit_cgroup_charge_as(oldmm, oldmm->total_vm)) { + retval = -ENOMEM; + goto out; + } + mm->locked_vm = 0; mm->mmap = NULL; mm->mmap_cache = NULL; @@ -293,6 +303,8 @@ mm->total_vm -= pages; vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file, -pages); + memrlimit_cgroup_uncharge_as(mm, pages); + uncharged += pages; continue; } charge = 0; Index: linux-next/mm/memrlimitcgroup.c =================================================================== --- linux-next.orig/mm/memrlimitcgroup.c 2008-09-18 23:19:26.000000000 -0700 +++ linux-next/mm/memrlimitcgroup.c 2008-09-18 23:27:36.000000000 -0700 @@ -45,6 +45,38 @@ struct memrlimit_cgroup, css); } +static struct memrlimit_cgroup * +memrlimit_cgroup_from_task(struct task_struct *p) +{ + return container_of(task_subsys_state(p, memrlimit_cgroup_subsys_id), + struct memrlimit_cgroup, css); +} + +/* + * Charge the cgroup for address space usage - mmap(), malloc() (through + * brk(), sbrk()), stack expansion, mremap(), etc - called with + * mmap_sem held. + */ +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + struct memrlimit_cgroup *memrcg; + + memrcg = memrlimit_cgroup_from_task(mm->owner); + return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); +} + +/* + * Uncharge the cgroup, as the address space of one of the tasks is + * decreasing - called with mmap_sem held. + */ +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages) +{ + struct memrlimit_cgroup *memrcg; + + memrcg = memrlimit_cgroup_from_task(mm->owner); + res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT)); +} + static struct cgroup_subsys_state * memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp) { @@ -121,11 +153,71 @@ ARRAY_SIZE(memrlimit_cgroup_files)); } +static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss, + struct cgroup *cgrp, + struct cgroup *old_cgrp, + struct task_struct *p) +{ + struct mm_struct *mm; + struct memrlimit_cgroup *memrcg, *old_memrcg; + + mm = get_task_mm(p); + if (mm == NULL) + return; + + /* + * Hold mmap_sem, so that total_vm does not change underneath us + */ + down_read(&mm->mmap_sem); + + rcu_read_lock(); + if (p != rcu_dereference(mm->owner)) + goto out; + + memrcg = memrlimit_cgroup_from_cgrp(cgrp); + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); + + if (memrcg == old_memrcg) + goto out; + + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) + goto out; + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); +out: + rcu_read_unlock(); + up_read(&mm->mmap_sem); + mmput(mm); +} + +/* + * This callback is called with mmap_sem held + */ +static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss, + struct cgroup *cgrp, + struct cgroup *old_cgrp, + struct task_struct *p) +{ + struct memrlimit_cgroup *memrcg, *old_memrcg; + struct mm_struct *mm = get_task_mm(p); + + BUG_ON(!mm); + memrcg = memrlimit_cgroup_from_cgrp(cgrp); + old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp); + + if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT))) + goto out; + res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT)); +out: + mmput(mm); +} + struct cgroup_subsys memrlimit_cgroup_subsys = { .name = "memrlimit", .subsys_id = memrlimit_cgroup_subsys_id, .create = memrlimit_cgroup_create, .destroy = memrlimit_cgroup_destroy, .populate = memrlimit_cgroup_populate, + .attach = memrlimit_cgroup_move_task, + .mm_owner_changed = memrlimit_cgroup_mm_owner_changed, .early_init = 0, }; Index: linux-next/mm/mmap.c =================================================================== --- linux-next.orig/mm/mmap.c 2008-09-18 23:00:18.000000000 -0700 +++ linux-next/mm/mmap.c 2008-09-18 23:27:21.000000000 -0700 @@ -23,6 +23,7 @@ #include <linux/hugetlb.h> #include <linux/profile.h> #include <linux/module.h> +#include <linux/memrlimitcgroup.h> #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> @@ -1756,6 +1757,7 @@ long nrpages = vma_pages(vma); mm->total_vm -= nrpages; + memrlimit_cgroup_uncharge_as(mm, nrpages); vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages); vma = remove_vma(vma); } while (vma); @@ -2106,6 +2108,7 @@ /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); + memrlimit_cgroup_uncharge_as(mm, mm->total_vm); free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); @@ -2128,6 +2131,9 @@ struct vm_area_struct * __vma, * prev; struct rb_node ** rb_link, * rb_parent; + if (memrlimit_cgroup_charge_as(mm, vma_pages(vma))) + return -ENOMEM; + /* * The vm_pgoff of a purely anonymous vma should be irrelevant * until its first write fault, when page's anon_vma and index @@ -2146,12 +2152,15 @@ } __vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent); if (__vma && __vma->vm_start < vma->vm_end) - return -ENOMEM; + goto err; if ((vma->vm_flags & VM_ACCOUNT) && security_vm_enough_memory_mm(mm, vma_pages(vma))) - return -ENOMEM; + goto err; vma_link(mm, vma, prev, rb_link, rb_parent); return 0; +err: + memrlimit_cgroup_uncharge_as(mm, vma_pages(vma)); + return -ENOMEM; } /* @@ -2224,6 +2233,10 @@ if (cur + npages > lim) return 0; + + if (memrlimit_cgroup_charge_as(mm, npages)) + return 0; + return 1; } Index: linux-next/arch/x86/kernel/ds.c =================================================================== --- linux-next.orig/arch/x86/kernel/ds.c 2008-09-18 23:28:07.000000000 -0700 +++ linux-next/arch/x86/kernel/ds.c 2008-09-18 23:33:55.000000000 -0700 @@ -30,6 +30,7 @@ #include <linux/slab.h> #include <linux/sched.h> #include <linux/mm.h> +#include <linux/memrlimitcgroup.h> /* @@ -339,19 +340,22 @@ pgsz = PAGE_ALIGN(size) >> PAGE_SHIFT; + if (memrlimit_cgroup_charge_as(current->mm, pgsz << PAGE_SHIFT)) + return NULL; + rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT; vm = current->mm->total_vm + pgsz; if (rlim < vm) - return NULL; + goto uncharge; rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; vm = current->mm->locked_vm + pgsz; if (rlim < vm) - return NULL; + goto uncharge; buffer = kzalloc(size, GFP_KERNEL); if (!buffer) - return NULL; + goto uncharge; current->mm->total_vm += pgsz; current->mm->locked_vm += pgsz; @@ -360,6 +364,9 @@ *pages = pgsz; return buffer; +uncharge: + memrlimit_cgroup_uncharge_as(current->mm, pgsz << PAGE_SHIFT); + return NULL; } static int ds_request(struct task_struct *task, void *base, size_t size, -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-09-19 6:38 ` Balbir Singh @ 2008-09-19 20:14 ` Andrew Morton 2008-09-19 21:28 ` Balbir Singh 0 siblings, 1 reply; 27+ messages in thread From: Andrew Morton @ 2008-09-19 20:14 UTC (permalink / raw) To: balbir Cc: linux-mm, skumar, yamamoto, menage, lizf, linux-kernel, xemul, kamezawa.hiroyu, mingo On Thu, 18 Sep 2008 23:38:23 -0700 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * Andrew Morton <akpm@linux-foundation.org> [2008-09-18 13:54:30]: > > > On Wed, 14 May 2008 18:39:51 +0530 > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > This patch adds support for accounting and control of virtual address space > > > limits. > > > > > > Large changes in linux-next's arch/x86/kernel/ptrace.c caused damage to > > the memrlimit patches. > > > > I decided to retain the patches because it looks repairable. The > > problem is this reject from > > memrlimit-add-memrlimit-controller-accounting-and-control.patch: > > > > Andrew, > > I could not apply mmotm to linux-next (both downloaded right now). mmotm includes linux-next.patch. mmotm is based upon the most recent 2.6.x-rcy. This is the only way to do it - I often have to change linux-next.patch due to rejects and it's unreasonable to expect people to base off the same version of linux-next as I did. > I > applied the patches one-by-one resolving differences starting from #mm > in the series file. > > Here is my fixed version of the patch, I compiled the patch, but could > not run it, since I could not create the full series of applied > patches. I compiled arch/x86/kernel/ds.o and ptrace.o. I've included > the patch below, please let me know if the code looks OK (via review) > and the patch applies. I'll test it once I can resonably resolve all > conflicts between linux-next and mmotm. OK, we'll give it a shot, thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) 2008-09-19 20:14 ` Andrew Morton @ 2008-09-19 21:28 ` Balbir Singh 0 siblings, 0 replies; 27+ messages in thread From: Balbir Singh @ 2008-09-19 21:28 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, skumar, yamamoto, menage, lizf, linux-kernel, xemul, kamezawa.hiroyu, mingo Andrew Morton wrote: > On Thu, 18 Sep 2008 23:38:23 -0700 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> * Andrew Morton <akpm@linux-foundation.org> [2008-09-18 13:54:30]: >> >>> On Wed, 14 May 2008 18:39:51 +0530 >>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >>> >>>> This patch adds support for accounting and control of virtual address space >>>> limits. >>> >>> Large changes in linux-next's arch/x86/kernel/ptrace.c caused damage to >>> the memrlimit patches. >>> >>> I decided to retain the patches because it looks repairable. The >>> problem is this reject from >>> memrlimit-add-memrlimit-controller-accounting-and-control.patch: >>> >> Andrew, >> >> I could not apply mmotm to linux-next (both downloaded right now). > > mmotm includes linux-next.patch. mmotm is based upon the most recent > 2.6.x-rcy. > Thanks for the info [snip] > OK, we'll give it a shot, thanks. > Thanks! -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2008-09-19 21:28 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-14 13:09 [-mm][PATCH 0/4] Add memrlimit controller (v4) Balbir Singh 2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh 2008-05-15 1:20 ` Li Zefan 2008-05-15 18:22 ` Avi Kivity 2008-05-15 18:39 ` Balbir Singh 2008-05-14 13:09 ` [-mm][PATCH 2/4] Setup the memrlimit controller (v4) Balbir Singh 2008-05-14 13:29 ` Pavel Emelyanov 2008-05-14 13:09 ` [-mm][PATCH 3/4] cgroup mm owner callback changes to add task info (v4) Balbir Singh 2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh 2008-05-14 13:25 ` Balbir Singh 2008-05-15 2:25 ` Paul Menage 2008-05-15 6:17 ` Balbir Singh 2008-05-15 6:55 ` Paul Menage 2008-05-15 7:03 ` Balbir Singh 2008-05-15 7:39 ` Paul Menage 2008-05-15 8:25 ` Balbir Singh 2008-05-15 15:28 ` Paul Menage 2008-05-15 17:01 ` Balbir Singh 2008-05-17 20:15 ` Balbir Singh 2008-05-17 20:17 ` Balbir Singh 2008-05-14 13:32 ` Pavel Emelyanov 2008-05-14 19:39 ` Balbir Singh 2008-09-18 20:54 ` Andrew Morton 2008-09-18 23:55 ` Balbir Singh 2008-09-19 6:38 ` Balbir Singh 2008-09-19 20:14 ` Andrew Morton 2008-09-19 21:28 ` Balbir Singh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).