[-mm][PATCH 0/4] Add memrlimit controller (v4)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [-mm][PATCH 0/4] Add memrlimit controller (v4)
@ 2008-05-14 13:09 Balbir Singh
  2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Balbir Singh @ 2008-05-14 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Pavel Emelianov, Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki

This is the fourth version of the address space control patches. These
patches are against 2.6.26-rc2-mm1  and have been tested using KVM in SMP mode,
both with and without the config enabled, on a powerpc box and using UML.
The patches have also been compile tested with the config disabled on a
powerpc box.

The goal of this patch is to implement a virtual address space controller
using cgroups. The documentation describes the controller, it's goal and
usage in further details.

The first patch adds the user interface. The second patch fixes the
cgroup mm_owner_changed callback to pass the task struct, so that
accounting can be adjusted on owner changes. The thrid patch adds accounting
and control. The fourth patch updates documentation.

An earlier post of the patchset can be found at
http://lwn.net/Articles/275143/

This patch is built on top of the mm owner patches and utilizes that feature
to virtually group tasks by mm_struct.

Reviews, Comments?

Changelog

1. Add documentation upfront
2. Refactor the code (error handling and changes to improvde code)
3. Protect reading of total_vm with mmap_sem
4. Port to 2.6.26-rc2
5. Changed the name from rlimit to memrlimit

Series

memrlimit-controller-add-documentation.patch
memrlimit-controller-setup.patch
cgroup-add-task-to-mm-owner-callbacks.patch
memrlimit-controller-address-space-accounting-and-control.patch

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [-mm][PATCH 1/4] Add memrlimit controller documentation (v4)
  2008-05-14 13:09 [-mm][PATCH 0/4] Add memrlimit controller (v4) Balbir Singh
@ 2008-05-14 13:09 ` Balbir Singh
  2008-05-15  1:20   ` Li Zefan
  2008-05-15 18:22   ` Avi Kivity
  2008-05-14 13:09 ` [-mm][PATCH 2/4] Setup the memrlimit controller (v4) Balbir Singh
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 27+ messages in thread
From: Balbir Singh @ 2008-05-14 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Pavel Emelianov, Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki


Documentation patch - describes the goals and usage of the memrlimit
controller.


Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/controllers/memrlimit.txt |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff -puN /dev/null Documentation/controllers/memrlimit.txt
--- /dev/null	2008-05-14 04:27:30.032276540 +0530
+++ linux-2.6.26-rc2-balbir/Documentation/controllers/memrlimit.txt	2008-05-14 18:35:55.000000000 +0530
@@ -0,0 +1,29 @@
+This controller is enabled by the CONFIG_CGROUP_MEMRLIMIT_CTLR option. Prior
+to reading this documentation please read Documentation/cgroups.txt and
+Documentation/controllers/memory.txt. Several of the principles of this
+controller are similar to the memory resource controller.
+
+This controller framework is designed to be extensible to control any
+memory resource limit with little effort.
+
+This new controller, controls the address space expansion of the tasks
+belonging to a cgroup. Address space control is provided along the same lines as
+RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2).
+The interface for controlling address space is provided through
+"rlimit.limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t. the user
+interface. Please see section 3 of the memory resource controller documentation
+for more details on how to use the user interface to get and set values.
+
+The "memrlimit.usage_in_bytes" file provides information about the total address
+space usage of the tasks in the cgroup, in bytes.
+
+Advantages of providing this feature
+
+1. Control over virtual address space allows for a cgroup to fail gracefully
+   i.e., via a malloc or mmap failure as compared to OOM kill when no
+   pages can be reclaimed.
+2. It provides better control over how many pages can be swapped out when
+   the cgroup goes over its limit. A badly setup cgroup can cause excessive
+   swapping. Providing control over the address space allocations ensures
+   that the system administrator has control over the total swapping that
+   can take place.
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 1/4] Add memrlimit controller documentation (v4)
  2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh
@ 2008-05-15  1:20   ` Li Zefan
  2008-05-15 18:22   ` Avi Kivity
  1 sibling, 0 replies; 27+ messages in thread
From: Li Zefan @ 2008-05-15  1:20 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage,
	linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki

Balbir Singh wrote:
> Documentation patch - describes the goals and usage of the memrlimit
> controller.
> 
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  Documentation/controllers/memrlimit.txt |   29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff -puN /dev/null Documentation/controllers/memrlimit.txt
> --- /dev/null	2008-05-14 04:27:30.032276540 +0530
> +++ linux-2.6.26-rc2-balbir/Documentation/controllers/memrlimit.txt	2008-05-14 18:35:55.000000000 +0530
> @@ -0,0 +1,29 @@
> +This controller is enabled by the CONFIG_CGROUP_MEMRLIMIT_CTLR option. Prior
> +to reading this documentation please read Documentation/cgroups.txt and
> +Documentation/controllers/memory.txt. Several of the principles of this
> +controller are similar to the memory resource controller.
> +
> +This controller framework is designed to be extensible to control any
> +memory resource limit with little effort.
> +
> +This new controller, controls the address space expansion of the tasks
> +belonging to a cgroup. Address space control is provided along the same lines as
> +RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2).
> +The interface for controlling address space is provided through
> +"rlimit.limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t. the user

    memrlimit.limit_in_bytes

> +interface. Please see section 3 of the memory resource controller documentation
> +for more details on how to use the user interface to get and set values.
> +
> +The "memrlimit.usage_in_bytes" file provides information about the total address
> +space usage of the tasks in the cgroup, in bytes.
> +
> +Advantages of providing this feature
> +
> +1. Control over virtual address space allows for a cgroup to fail gracefully
> +   i.e., via a malloc or mmap failure as compared to OOM kill when no
> +   pages can be reclaimed.
> +2. It provides better control over how many pages can be swapped out when
> +   the cgroup goes over its limit. A badly setup cgroup can cause excessive
> +   swapping. Providing control over the address space allocations ensures
> +   that the system administrator has control over the total swapping that
> +   can take place.
> _
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 1/4] Add memrlimit controller documentation (v4)
  2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh
  2008-05-15  1:20   ` Li Zefan
@ 2008-05-15 18:22   ` Avi Kivity
  2008-05-15 18:39     ` Balbir Singh
  1 sibling, 1 reply; 27+ messages in thread
From: Avi Kivity @ 2008-05-15 18:22 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki

Balbir Singh wrote:
> +
> +Advantages of providing this feature
> +
> +1. Control over virtual address space allows for a cgroup to fail gracefully
> +   i.e., via a malloc or mmap failure as compared to OOM kill when no
> +   pages can be reclaimed.
>   

Do you mean by this, limiting the number of pagetable pages (that are 
pinned in memory), this preventing oom by a cgroup that instantiates 
many pagetables?


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 1/4] Add memrlimit controller documentation (v4)
  2008-05-15 18:22   ` Avi Kivity
@ 2008-05-15 18:39     ` Balbir Singh
  0 siblings, 0 replies; 27+ messages in thread
From: Balbir Singh @ 2008-05-15 18:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki

Avi Kivity wrote:
> Balbir Singh wrote:
>> +
>> +Advantages of providing this feature
>> +
>> +1. Control over virtual address space allows for a cgroup to fail
>> gracefully
>> +   i.e., via a malloc or mmap failure as compared to OOM kill when no
>> +   pages can be reclaimed.
>>   
> 
> Do you mean by this, limiting the number of pagetable pages (that are
> pinned in memory), this preventing oom by a cgroup that instantiates
> many pagetables?
> 
> 

This is not for page tables (that is in the long term TODO list). This is more
for user space calls to mmap(), malloc() or anything that causes the total
virtual memory of the process to go up (in our case cgroups). The motivation is
similar to the motivations of RLIMIT_AS.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [-mm][PATCH 2/4] Setup the memrlimit controller (v4)
  2008-05-14 13:09 [-mm][PATCH 0/4] Add memrlimit controller (v4) Balbir Singh
  2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh
@ 2008-05-14 13:09 ` Balbir Singh
  2008-05-14 13:29   ` Pavel Emelyanov
  2008-05-14 13:09 ` [-mm][PATCH 3/4] cgroup mm owner callback changes to add task info (v4) Balbir Singh
  2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh
  3 siblings, 1 reply; 27+ messages in thread
From: Balbir Singh @ 2008-05-14 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Pavel Emelianov, Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki

This patch sets up the rlimit cgroup controller. It adds the basic create,
destroy and populate functionality. The user interface provided is very
similar to the memory resource controller. The rlimit controller can be
enhanced easily in the future to control mlocked pages.

Changelog v3->v4

1. Use PAGE_ALIGN()
2. Rename rlimit to memrlimit


Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/cgroup_subsys.h   |    4 +
 include/linux/memrlimitcgroup.h |   19 +++++
 init/Kconfig                    |   10 ++
 mm/Makefile                     |    1 
 mm/memrlimitcgroup.c            |  144 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 178 insertions(+)

diff -puN include/linux/cgroup_subsys.h~memrlimit-controller-setup include/linux/cgroup_subsys.h
--- linux-2.6.26-rc2/include/linux/cgroup_subsys.h~memrlimit-controller-setup	2008-05-14 18:36:36.000000000 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/cgroup_subsys.h	2008-05-14 18:36:36.000000000 +0530
@@ -47,4 +47,8 @@ SUBSYS(mem_cgroup)
 SUBSYS(devices)
 #endif
 
+#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR
+SUBSYS(memrlimit_cgroup)
+#endif
+
 /* */
diff -puN /dev/null include/linux/memrlimitcgroup.h
--- /dev/null	2008-05-14 04:27:30.032276540 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/memrlimitcgroup.h	2008-05-14 18:36:36.000000000 +0530
@@ -0,0 +1,19 @@
+/*
+ * Copyright A(C) International Business Machines  Corp., 2008
+ *
+ * Author: Balbir Singh <balbir@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+#ifndef LINUX_MEMRLIMITCGROUP_H
+#define LINUX_MEMRLIMITCGROUP_H
+
+#endif /* LINUX_MEMRLIMITCGROUP_H */
diff -puN init/Kconfig~memrlimit-controller-setup init/Kconfig
--- linux-2.6.26-rc2/init/Kconfig~memrlimit-controller-setup	2008-05-14 18:36:36.000000000 +0530
+++ linux-2.6.26-rc2-balbir/init/Kconfig	2008-05-14 18:36:36.000000000 +0530
@@ -407,6 +407,16 @@ config CGROUP_MEM_RES_CTLR
 	  This config option also selects MM_OWNER config option, which
 	  could in turn add some fork/exit overhead.
 
+config CGROUP_MEMRLIMIT_CTLR
+	bool "Memory resource limit controls for cgroups"
+	depends on CGROUPS && RESOURCE_COUNTERS && MMU
+	select MM_OWNER
+	help
+	  Provides resource limits for all the tasks belonging to a
+	  control group. CGROUP_MEM_RES_CTLR provides support for physical
+	  memory RSS and Page Cache control. Virtual address space control
+	  is provided by this controller.
+
 config SYSFS_DEPRECATED
 	bool
 
diff -puN mm/Makefile~memrlimit-controller-setup mm/Makefile
--- linux-2.6.26-rc2/mm/Makefile~memrlimit-controller-setup	2008-05-14 18:36:36.000000000 +0530
+++ linux-2.6.26-rc2-balbir/mm/Makefile	2008-05-14 18:36:36.000000000 +0530
@@ -34,4 +34,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
 
diff -puN /dev/null mm/memrlimitcgroup.c
--- /dev/null	2008-05-14 04:27:30.032276540 +0530
+++ linux-2.6.26-rc2-balbir/mm/memrlimitcgroup.c	2008-05-14 18:36:36.000000000 +0530
@@ -0,0 +1,144 @@
+/*
+ * Copyright A(C) International Business Machines  Corp., 2008
+ *
+ * Author: Balbir Singh <balbir@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Provide memory resource limits for tasks in a control group. A lot of code is
+ * duplicated from the memory controller (this code is common to almost
+ * all controllers). TODO: Consider writing a tool that can generate this
+ * code.
+ */
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/rcupdate.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/res_counter.h>
+#include <linux/memrlimitcgroup.h>
+
+struct cgroup_subsys memrlimit_cgroup_subsys;
+
+struct memrlimit_cgroup {
+	struct cgroup_subsys_state css;
+	struct res_counter as_res;	/* address space counter */
+};
+
+static struct memrlimit_cgroup init_memrlimit_cgroup;
+
+static struct memrlimit_cgroup *memrlimit_cgroup_from_cgrp(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp,
+				memrlimit_cgroup_subsys_id),
+				struct memrlimit_cgroup, css);
+}
+
+static struct cgroup_subsys_state *
+memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct memrlimit_cgroup *memrcg;
+
+	if (unlikely(cgrp->parent == NULL))
+		memrcg = &init_memrlimit_cgroup;
+	else {
+		memrcg = kzalloc(sizeof(*memrcg), GFP_KERNEL);
+		if (!memrcg)
+			return ERR_PTR(-ENOMEM);
+	}
+	res_counter_init(&memrcg->as_res);
+	return &memrcg->css;
+}
+
+static void memrlimit_cgroup_destroy(struct cgroup_subsys *ss,
+					struct cgroup *cgrp)
+{
+	kfree(memrlimit_cgroup_from_cgrp(cgrp));
+}
+
+static int memrlimit_cgroup_reset(struct cgroup *cgrp, unsigned int event)
+{
+	struct memrlimit_cgroup *memrcg;
+
+	memrcg = memrlimit_cgroup_from_cgrp(cgrp);
+	switch (event) {
+	case RES_FAILCNT:
+		res_counter_reset_failcnt(&memrcg->as_res);
+		break;
+	}
+	return 0;
+}
+
+static u64 memrlimit_cgroup_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	return res_counter_read_u64(&memrlimit_cgroup_from_cgrp(cgrp)->as_res,
+					cft->private);
+}
+
+static int memrlimit_cgroup_write_strategy(char *buf, unsigned long long *tmp)
+{
+	*tmp = memparse(buf, &buf);
+	if (*buf != '\0')
+		return -EINVAL;
+
+	*tmp = PAGE_ALIGN(*tmp);
+	return 0;
+}
+
+static ssize_t memrlimit_cgroup_write(struct cgroup *cgrp, struct cftype *cft,
+					struct file *file,
+					const char __user *userbuf,
+					size_t nbytes,
+					loff_t *ppos)
+{
+	return res_counter_write(&memrlimit_cgroup_from_cgrp(cgrp)->as_res,
+					cft->private, userbuf, nbytes, ppos,
+					memrlimit_cgroup_write_strategy);
+}
+
+static struct cftype memrlimit_cgroup_files[] = {
+	{
+		.name = "usage_in_bytes",
+		.private = RES_USAGE,
+		.read_u64 = memrlimit_cgroup_read,
+	},
+	{
+		.name = "limit_in_bytes",
+		.private = RES_LIMIT,
+		.write = memrlimit_cgroup_write,
+		.read_u64 = memrlimit_cgroup_read,
+	},
+	{
+		.name = "failcnt",
+		.private = RES_FAILCNT,
+		.trigger = memrlimit_cgroup_reset,
+		.read_u64 = memrlimit_cgroup_read,
+	},
+};
+
+static int memrlimit_cgroup_populate(struct cgroup_subsys *ss,
+					struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, memrlimit_cgroup_files,
+				ARRAY_SIZE(memrlimit_cgroup_files));
+}
+
+struct cgroup_subsys memrlimit_cgroup_subsys = {
+	.name = "memrlimit",
+	.subsys_id = memrlimit_cgroup_subsys_id,
+	.create = memrlimit_cgroup_create,
+	.destroy = memrlimit_cgroup_destroy,
+	.populate = memrlimit_cgroup_populate,
+	.early_init = 0,
+};
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 2/4] Setup the memrlimit controller (v4)
  2008-05-14 13:09 ` [-mm][PATCH 2/4] Setup the memrlimit controller (v4) Balbir Singh
@ 2008-05-14 13:29   ` Pavel Emelyanov
  0 siblings, 0 replies; 27+ messages in thread
From: Pavel Emelyanov @ 2008-05-14 13:29 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki

Balbir Singh wrote:
> This patch sets up the rlimit cgroup controller. It adds the basic create,
> destroy and populate functionality. The user interface provided is very
> similar to the memory resource controller. The rlimit controller can be
> enhanced easily in the future to control mlocked pages.
> 
> Changelog v3->v4
> 
> 1. Use PAGE_ALIGN()
> 2. Rename rlimit to memrlimit
> 
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>

Acked-by: Pavel Emelyanov <xemul@openvz.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [-mm][PATCH 3/4] cgroup mm owner callback changes to add task info (v4)
  2008-05-14 13:09 [-mm][PATCH 0/4] Add memrlimit controller (v4) Balbir Singh
  2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh
  2008-05-14 13:09 ` [-mm][PATCH 2/4] Setup the memrlimit controller (v4) Balbir Singh
@ 2008-05-14 13:09 ` Balbir Singh
  2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh
  3 siblings, 0 replies; 27+ messages in thread
From: Balbir Singh @ 2008-05-14 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Pavel Emelianov, Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki

This patch adds an additional field to the mm_owner callbacks. This field
is required to get to the mm that changed.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/cgroup.h |    3 ++-
 kernel/cgroup.c        |    2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff -puN include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks include/linux/cgroup.h
--- linux-2.6.26-rc2/include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks	2008-05-14 18:36:59.000000000 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/cgroup.h	2008-05-14 18:36:59.000000000 +0530
@@ -310,7 +310,8 @@ struct cgroup_subsys {
 	 */
 	void (*mm_owner_changed)(struct cgroup_subsys *ss,
 					struct cgroup *old,
-					struct cgroup *new);
+					struct cgroup *new,
+					struct task_struct *p);
 	int subsys_id;
 	int active;
 	int disabled;
diff -puN kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks kernel/cgroup.c
--- linux-2.6.26-rc2/kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks	2008-05-14 18:36:59.000000000 +0530
+++ linux-2.6.26-rc2-balbir/kernel/cgroup.c	2008-05-14 18:36:59.000000000 +0530
@@ -2772,7 +2772,7 @@ void cgroup_mm_owner_callbacks(struct ta
 			if (oldcgrp == newcgrp)
 				continue;
 			if (ss->mm_owner_changed)
-				ss->mm_owner_changed(ss, oldcgrp, newcgrp);
+				ss->mm_owner_changed(ss, oldcgrp, newcgrp, new);
 		}
 	}
 }
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-14 13:09 [-mm][PATCH 0/4] Add memrlimit controller (v4) Balbir Singh
                   ` (2 preceding siblings ...)
  2008-05-14 13:09 ` [-mm][PATCH 3/4] cgroup mm owner callback changes to add task info (v4) Balbir Singh
@ 2008-05-14 13:09 ` Balbir Singh
  2008-05-14 13:25   ` Balbir Singh
                     ` (2 more replies)
  3 siblings, 3 replies; 27+ messages in thread
From: Balbir Singh @ 2008-05-14 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Pavel Emelianov, Balbir Singh, Andrew Morton, KAMEZAWA Hiroyuki

This patch adds support for accounting and control of virtual address space
limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
The core of the accounting takes place during fork time in copy_process(),
may_expand_vm(), remove_vma_list() and exit_mmap(). There are some special
cases that are handled here as well (arch/ia64/kernel/perform.c,
arch/x86/kernel/ptrace.c, insert_special_mapping())

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 arch/ia64/kernel/perfmon.c      |    6 ++
 arch/x86/kernel/ptrace.c        |   17 +++++--
 fs/exec.c                       |    5 ++
 include/linux/memrlimitcgroup.h |   21 ++++++++
 kernel/fork.c                   |    8 +++
 mm/memrlimitcgroup.c            |   94 ++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                       |   11 ++++
 7 files changed, 157 insertions(+), 5 deletions(-)

diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c
--- linux-2.6.26-rc2/arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:09:32.000000000 +0530
+++ linux-2.6.26-rc2-balbir/arch/ia64/kernel/perfmon.c	2008-05-14 18:09:32.000000000 +0530
@@ -40,6 +40,7 @@
 #include <linux/capability.h>
 #include <linux/rcupdate.h>
 #include <linux/completion.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/errno.h>
 #include <asm/intrinsics.h>
@@ -2294,6 +2295,9 @@ pfm_smpl_buffer_alloc(struct task_struct
 
 	DPRINT(("sampling buffer rsize=%lu size=%lu bytes\n", rsize, size));
 
+	if (memrlimit_cgroup_charge_as(mm, size >> PAGE_SHIFT))
+		return -ENOMEM;
+
 	/*
 	 * check requested size to avoid Denial-of-service attacks
 	 * XXX: may have to refine this test
@@ -2313,6 +2317,7 @@ pfm_smpl_buffer_alloc(struct task_struct
 	smpl_buf = pfm_rvmalloc(size);
 	if (smpl_buf == NULL) {
 		DPRINT(("Can't allocate sampling buffer\n"));
+		memrlimit_cgroup_uncharge_as(mm, size >> PAGE_SHIFT);
 		return -ENOMEM;
 	}
 
@@ -2390,6 +2395,7 @@ pfm_smpl_buffer_alloc(struct task_struct
 	return 0;
 
 error:
+	memrlimit_cgroup_uncharge_as(mm, size >> PAGE_SHIFT);
 	kmem_cache_free(vm_area_cachep, vma);
 error_kmem:
 	pfm_rvfree(smpl_buf, size);
diff -puN arch/x86/kernel/ds.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ds.c
diff -puN fs/exec.c~memrlimit-controller-address-space-accounting-and-control fs/exec.c
--- linux-2.6.26-rc2/fs/exec.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:09:32.000000000 +0530
+++ linux-2.6.26-rc2-balbir/fs/exec.c	2008-05-14 18:09:32.000000000 +0530
@@ -52,6 +52,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -230,6 +231,9 @@ static int __bprm_mm_init(struct linux_b
 	if (!vma)
 		goto err;
 
+	if (memrlimit_cgroup_charge_as(mm, 1))
+		goto err;
+
 	down_write(&mm->mmap_sem);
 	vma->vm_mm = mm;
 
@@ -247,6 +251,7 @@ static int __bprm_mm_init(struct linux_b
 	err = insert_vm_struct(mm, vma);
 	if (err) {
 		up_write(&mm->mmap_sem);
+		memrlimit_cgroup_uncharge_as(mm, 1);
 		goto err;
 	}
 
diff -puN include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control include/linux/memrlimitcgroup.h
--- linux-2.6.26-rc2/include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:09:32.000000000 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/memrlimitcgroup.h	2008-05-14 18:09:32.000000000 +0530
@@ -16,4 +16,25 @@
 #ifndef LINUX_MEMRLIMITCGROUP_H
 #define LINUX_MEMRLIMITCGROUP_H
 
+#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR
+
+int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages);
+void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages);
+
+#else /* !CONFIG_CGROUP_RLIMIT_CTLR */
+
+static inline int
+memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	return 0;
+}
+
+static inline void
+memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+}
+
+#endif /* CONFIG_CGROUP_RLIMIT_CTLR */
+
+
 #endif /* LINUX_MEMRLIMITCGROUP_H */
diff -puN kernel/fork.c~memrlimit-controller-address-space-accounting-and-control kernel/fork.c
--- linux-2.6.26-rc2/kernel/fork.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:09:32.000000000 +0530
+++ linux-2.6.26-rc2-balbir/kernel/fork.c	2008-05-14 18:09:32.000000000 +0530
@@ -54,6 +54,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -267,6 +268,7 @@ static int dup_mmap(struct mm_struct *mm
 			mm->total_vm -= pages;
 			vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
 								-pages);
+			memrlimit_cgroup_uncharge_as(mm, pages);
 			continue;
 		}
 		charge = 0;
@@ -596,6 +598,12 @@ static int copy_mm(unsigned long clone_f
 		atomic_inc(&oldmm->mm_users);
 		mm = oldmm;
 		goto good_mm;
+	} else {
+		down_read(&oldmm->mmap_sem);
+		retval = memrlimit_cgroup_charge_as(oldmm, oldmm->total_vm);
+		up_read(&oldmm->mmap_sem);
+		if (retval)
+			goto fail_nomem;
 	}
 
 	retval = -ENOMEM;
diff -puN mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control mm/memrlimitcgroup.c
--- linux-2.6.26-rc2/mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:09:32.000000000 +0530
+++ linux-2.6.26-rc2-balbir/mm/memrlimitcgroup.c	2008-05-14 18:09:32.000000000 +0530
@@ -45,6 +45,41 @@ static struct memrlimit_cgroup *memrlimi
 				struct memrlimit_cgroup, css);
 }
 
+static struct memrlimit_cgroup *
+memrlimit_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, memrlimit_cgroup_subsys_id),
+				struct memrlimit_cgroup, css);
+}
+
+int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	int ret;
+	struct memrlimit_cgroup *memrcg;
+
+	rcu_read_lock();
+	memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner));
+	css_get(&memrcg->css);
+	rcu_read_unlock();
+
+	ret = res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
+	css_put(&memrcg->css);
+	return ret;
+}
+
+void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	struct memrlimit_cgroup *memrcg;
+
+	rcu_read_lock();
+	memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner));
+	css_get(&memrcg->css);
+	rcu_read_unlock();
+
+	res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
+	css_put(&memrcg->css);
+}
+
 static struct cgroup_subsys_state *
 memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
@@ -134,11 +169,70 @@ static int memrlimit_cgroup_populate(str
 				ARRAY_SIZE(memrlimit_cgroup_files));
 }
 
+static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss,
+					struct cgroup *cgrp,
+					struct cgroup *old_cgrp,
+					struct task_struct *p)
+{
+	struct mm_struct *mm;
+	struct memrlimit_cgroup *memrcg, *old_memrcg;
+
+	mm = get_task_mm(p);
+	if (mm == NULL)
+		return;
+
+	rcu_read_lock();
+	if (p != rcu_dereference(mm->owner))
+		goto out;
+
+	memrcg = memrlimit_cgroup_from_cgrp(cgrp);
+	old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
+
+	if (memrcg == old_memrcg)
+		goto out;
+
+	/*
+	 * Hold mmap_sem, so that total_vm does not change underneath us
+	 */
+	down_read(&mm->mmap_sem);
+	if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
+		goto out;
+	res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
+out:
+	up_read(&mm->mmap_sem);
+	rcu_read_unlock();
+	mmput(mm);
+}
+
+static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss,
+						struct cgroup *cgrp,
+						struct cgroup *old_cgrp,
+						struct task_struct *p)
+{
+	struct memrlimit_cgroup *memrcg, *old_memrcg;
+	struct mm_struct *mm = get_task_mm(p);
+
+	BUG_ON(!mm);
+	memrcg = memrlimit_cgroup_from_cgrp(cgrp);
+	old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
+
+	down_read(&mm->mmap_sem);
+	if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
+		goto out;
+	res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
+out:
+	up_read(&mm->mmap_sem);
+
+	mmput(mm);
+}
+
 struct cgroup_subsys memrlimit_cgroup_subsys = {
 	.name = "memrlimit",
 	.subsys_id = memrlimit_cgroup_subsys_id,
 	.create = memrlimit_cgroup_create,
 	.destroy = memrlimit_cgroup_destroy,
 	.populate = memrlimit_cgroup_populate,
+	.attach = memrlimit_cgroup_move_task,
+	.mm_owner_changed = memrlimit_cgroup_mm_owner_changed,
 	.early_init = 0,
 };
diff -puN mm/mmap.c~memrlimit-controller-address-space-accounting-and-control mm/mmap.c
--- linux-2.6.26-rc2/mm/mmap.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:09:32.000000000 +0530
+++ linux-2.6.26-rc2-balbir/mm/mmap.c	2008-05-14 18:09:32.000000000 +0530
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1730,6 +1731,7 @@ static void remove_vma_list(struct mm_st
 		long nrpages = vma_pages(vma);
 
 		mm->total_vm -= nrpages;
+		memrlimit_cgroup_uncharge_as(mm, nrpages);
 		if (vma->vm_flags & VM_LOCKED)
 			mm->locked_vm -= nrpages;
 		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
@@ -2056,6 +2058,7 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
+	memrlimit_cgroup_uncharge_as(mm, mm->total_vm);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 
@@ -2174,6 +2177,10 @@ int may_expand_vm(struct mm_struct *mm, 
 
 	if (cur + npages > lim)
 		return 0;
+
+	if (memrlimit_cgroup_charge_as(mm, npages))
+		return 0;
+
 	return 1;
 }
 
@@ -2236,6 +2243,9 @@ int install_special_mapping(struct mm_st
 	if (unlikely(vma == NULL))
 		return -ENOMEM;
 
+	if (memrlimit_cgroup_charge_as(mm, len >> PAGE_SHIFT))
+		return -ENOMEM;
+
 	vma->vm_mm = mm;
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
@@ -2248,6 +2258,7 @@ int install_special_mapping(struct mm_st
 
 	if (unlikely(insert_vm_struct(mm, vma))) {
 		kmem_cache_free(vm_area_cachep, vma);
+		memrlimit_cgroup_uncharge_as(mm, len >> PAGE_SHIFT);
 		return -ENOMEM;
 	}
 
diff -puN arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ptrace.c
--- linux-2.6.26-rc2/arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:09:32.000000000 +0530
+++ linux-2.6.26-rc2-balbir/arch/x86/kernel/ptrace.c	2008-05-14 18:09:32.000000000 +0530
@@ -20,6 +20,7 @@
 #include <linux/audit.h>
 #include <linux/seccomp.h>
 #include <linux/signal.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -782,21 +783,25 @@ static int ptrace_bts_realloc(struct tas
 
 	current->mm->total_vm  -= old_size;
 	current->mm->locked_vm -= old_size;
+	memrlimit_cgroup_uncharge_as(mm, old_size);
 
 	if (size == 0)
 		goto out;
 
+	if (memrlimit_cgroup_charge_as(current->mm, size))
+		goto out;
+
 	rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT;
 	vm = current->mm->total_vm  + size;
 	if (rlim < vm) {
 		ret = -ENOMEM;
 
 		if (!reduce_size)
-			goto out;
+			goto out_uncharge;
 
 		size = rlim - current->mm->total_vm;
 		if (size <= 0)
-			goto out;
+			goto out_uncharge;
 	}
 
 	rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
@@ -805,21 +810,23 @@ static int ptrace_bts_realloc(struct tas
 		ret = -ENOMEM;
 
 		if (!reduce_size)
-			goto out;
+			goto out_uncharge;
 
 		size = rlim - current->mm->locked_vm;
 		if (size <= 0)
-			goto out;
+			goto out_uncharge;
 	}
 
 	ret = ds_allocate((void **)&child->thread.ds_area_msr,
 			  size << PAGE_SHIFT);
 	if (ret < 0)
-		goto out;
+		goto out_uncharge;
 
 	current->mm->total_vm  += size;
 	current->mm->locked_vm += size;
 
+out_uncharge:
+	memrlimit_cgroup_uncharge_as(mm, size);
 out:
 	if (child->thread.ds_area_msr)
 		set_tsk_thread_flag(child, TIF_DS_AREA_MSR);
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh
@ 2008-05-14 13:25   ` Balbir Singh
  2008-05-15  2:25     ` Paul Menage
  2008-05-14 13:32   ` Pavel Emelyanov
  2008-09-18 20:54   ` Andrew Morton
  2 siblings, 1 reply; 27+ messages in thread
From: Balbir Singh @ 2008-05-14 13:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf, linux-kernel,
	Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki

* Balbir Singh <balbir@linux.vnet.ibm.com> [2008-05-14 18:39:51]:

Here's a better version of the patch. ptrace.c changes have a bug,
where uncharge is called unconditionally. We now check for ret < 0.
The two signed-off-by's by the same person did not look good either :)

Description

This patch adds support for accounting and control of virtual address space
limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
The core of the accounting takes place during fork time in copy_process(),
may_expand_vm(), remove_vma_list() and exit_mmap(). There are some special
cases that are handled here as well (arch/ia64/kernel/perform.c,
arch/x86/kernel/ptrace.c, insert_special_mapping())



Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 arch/ia64/kernel/perfmon.c      |    6 ++
 arch/x86/kernel/ptrace.c        |   18 +++++--
 fs/exec.c                       |    5 ++
 include/linux/memrlimitcgroup.h |   21 ++++++++
 kernel/fork.c                   |    8 +++
 mm/memrlimitcgroup.c            |   94 ++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                       |   11 ++++
 7 files changed, 158 insertions(+), 5 deletions(-)

diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c
--- linux-2.6.26-rc2/arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:37:11.000000000 +0530
+++ linux-2.6.26-rc2-balbir/arch/ia64/kernel/perfmon.c	2008-05-14 18:37:11.000000000 +0530
@@ -40,6 +40,7 @@
 #include <linux/capability.h>
 #include <linux/rcupdate.h>
 #include <linux/completion.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/errno.h>
 #include <asm/intrinsics.h>
@@ -2294,6 +2295,9 @@ pfm_smpl_buffer_alloc(struct task_struct
 
 	DPRINT(("sampling buffer rsize=%lu size=%lu bytes\n", rsize, size));
 
+	if (memrlimit_cgroup_charge_as(mm, size >> PAGE_SHIFT))
+		return -ENOMEM;
+
 	/*
 	 * check requested size to avoid Denial-of-service attacks
 	 * XXX: may have to refine this test
@@ -2313,6 +2317,7 @@ pfm_smpl_buffer_alloc(struct task_struct
 	smpl_buf = pfm_rvmalloc(size);
 	if (smpl_buf == NULL) {
 		DPRINT(("Can't allocate sampling buffer\n"));
+		memrlimit_cgroup_uncharge_as(mm, size >> PAGE_SHIFT);
 		return -ENOMEM;
 	}
 
@@ -2390,6 +2395,7 @@ pfm_smpl_buffer_alloc(struct task_struct
 	return 0;
 
 error:
+	memrlimit_cgroup_uncharge_as(mm, size >> PAGE_SHIFT);
 	kmem_cache_free(vm_area_cachep, vma);
 error_kmem:
 	pfm_rvfree(smpl_buf, size);
diff -puN arch/x86/kernel/ds.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ds.c
diff -puN fs/exec.c~memrlimit-controller-address-space-accounting-and-control fs/exec.c
--- linux-2.6.26-rc2/fs/exec.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:37:11.000000000 +0530
+++ linux-2.6.26-rc2-balbir/fs/exec.c	2008-05-14 18:37:11.000000000 +0530
@@ -52,6 +52,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/audit.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -230,6 +231,9 @@ static int __bprm_mm_init(struct linux_b
 	if (!vma)
 		goto err;
 
+	if (memrlimit_cgroup_charge_as(mm, 1))
+		goto err;
+
 	down_write(&mm->mmap_sem);
 	vma->vm_mm = mm;
 
@@ -247,6 +251,7 @@ static int __bprm_mm_init(struct linux_b
 	err = insert_vm_struct(mm, vma);
 	if (err) {
 		up_write(&mm->mmap_sem);
+		memrlimit_cgroup_uncharge_as(mm, 1);
 		goto err;
 	}
 
diff -puN include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control include/linux/memrlimitcgroup.h
--- linux-2.6.26-rc2/include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:37:11.000000000 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/memrlimitcgroup.h	2008-05-14 18:37:11.000000000 +0530
@@ -16,4 +16,25 @@
 #ifndef LINUX_MEMRLIMITCGROUP_H
 #define LINUX_MEMRLIMITCGROUP_H
 
+#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR
+
+int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages);
+void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages);
+
+#else /* !CONFIG_CGROUP_RLIMIT_CTLR */
+
+static inline int
+memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	return 0;
+}
+
+static inline void
+memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+}
+
+#endif /* CONFIG_CGROUP_RLIMIT_CTLR */
+
+
 #endif /* LINUX_MEMRLIMITCGROUP_H */
diff -puN kernel/fork.c~memrlimit-controller-address-space-accounting-and-control kernel/fork.c
--- linux-2.6.26-rc2/kernel/fork.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:37:11.000000000 +0530
+++ linux-2.6.26-rc2-balbir/kernel/fork.c	2008-05-14 18:37:11.000000000 +0530
@@ -54,6 +54,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -267,6 +268,7 @@ static int dup_mmap(struct mm_struct *mm
 			mm->total_vm -= pages;
 			vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
 								-pages);
+			memrlimit_cgroup_uncharge_as(mm, pages);
 			continue;
 		}
 		charge = 0;
@@ -596,6 +598,12 @@ static int copy_mm(unsigned long clone_f
 		atomic_inc(&oldmm->mm_users);
 		mm = oldmm;
 		goto good_mm;
+	} else {
+		down_read(&oldmm->mmap_sem);
+		retval = memrlimit_cgroup_charge_as(oldmm, oldmm->total_vm);
+		up_read(&oldmm->mmap_sem);
+		if (retval)
+			goto fail_nomem;
 	}
 
 	retval = -ENOMEM;
diff -puN mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control mm/memrlimitcgroup.c
--- linux-2.6.26-rc2/mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:37:11.000000000 +0530
+++ linux-2.6.26-rc2-balbir/mm/memrlimitcgroup.c	2008-05-14 18:37:11.000000000 +0530
@@ -45,6 +45,41 @@ static struct memrlimit_cgroup *memrlimi
 				struct memrlimit_cgroup, css);
 }
 
+static struct memrlimit_cgroup *
+memrlimit_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, memrlimit_cgroup_subsys_id),
+				struct memrlimit_cgroup, css);
+}
+
+int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	int ret;
+	struct memrlimit_cgroup *memrcg;
+
+	rcu_read_lock();
+	memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner));
+	css_get(&memrcg->css);
+	rcu_read_unlock();
+
+	ret = res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
+	css_put(&memrcg->css);
+	return ret;
+}
+
+void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	struct memrlimit_cgroup *memrcg;
+
+	rcu_read_lock();
+	memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner));
+	css_get(&memrcg->css);
+	rcu_read_unlock();
+
+	res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
+	css_put(&memrcg->css);
+}
+
 static struct cgroup_subsys_state *
 memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
@@ -134,11 +169,70 @@ static int memrlimit_cgroup_populate(str
 				ARRAY_SIZE(memrlimit_cgroup_files));
 }
 
+static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss,
+					struct cgroup *cgrp,
+					struct cgroup *old_cgrp,
+					struct task_struct *p)
+{
+	struct mm_struct *mm;
+	struct memrlimit_cgroup *memrcg, *old_memrcg;
+
+	mm = get_task_mm(p);
+	if (mm == NULL)
+		return;
+
+	rcu_read_lock();
+	if (p != rcu_dereference(mm->owner))
+		goto out;
+
+	memrcg = memrlimit_cgroup_from_cgrp(cgrp);
+	old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
+
+	if (memrcg == old_memrcg)
+		goto out;
+
+	/*
+	 * Hold mmap_sem, so that total_vm does not change underneath us
+	 */
+	down_read(&mm->mmap_sem);
+	if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
+		goto out;
+	res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
+out:
+	up_read(&mm->mmap_sem);
+	rcu_read_unlock();
+	mmput(mm);
+}
+
+static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss,
+						struct cgroup *cgrp,
+						struct cgroup *old_cgrp,
+						struct task_struct *p)
+{
+	struct memrlimit_cgroup *memrcg, *old_memrcg;
+	struct mm_struct *mm = get_task_mm(p);
+
+	BUG_ON(!mm);
+	memrcg = memrlimit_cgroup_from_cgrp(cgrp);
+	old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
+
+	down_read(&mm->mmap_sem);
+	if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
+		goto out;
+	res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
+out:
+	up_read(&mm->mmap_sem);
+
+	mmput(mm);
+}
+
 struct cgroup_subsys memrlimit_cgroup_subsys = {
 	.name = "memrlimit",
 	.subsys_id = memrlimit_cgroup_subsys_id,
 	.create = memrlimit_cgroup_create,
 	.destroy = memrlimit_cgroup_destroy,
 	.populate = memrlimit_cgroup_populate,
+	.attach = memrlimit_cgroup_move_task,
+	.mm_owner_changed = memrlimit_cgroup_mm_owner_changed,
 	.early_init = 0,
 };
diff -puN mm/mmap.c~memrlimit-controller-address-space-accounting-and-control mm/mmap.c
--- linux-2.6.26-rc2/mm/mmap.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:37:11.000000000 +0530
+++ linux-2.6.26-rc2-balbir/mm/mmap.c	2008-05-14 18:37:11.000000000 +0530
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1730,6 +1731,7 @@ static void remove_vma_list(struct mm_st
 		long nrpages = vma_pages(vma);
 
 		mm->total_vm -= nrpages;
+		memrlimit_cgroup_uncharge_as(mm, nrpages);
 		if (vma->vm_flags & VM_LOCKED)
 			mm->locked_vm -= nrpages;
 		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
@@ -2056,6 +2058,7 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
+	memrlimit_cgroup_uncharge_as(mm, mm->total_vm);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 
@@ -2174,6 +2177,10 @@ int may_expand_vm(struct mm_struct *mm, 
 
 	if (cur + npages > lim)
 		return 0;
+
+	if (memrlimit_cgroup_charge_as(mm, npages))
+		return 0;
+
 	return 1;
 }
 
@@ -2236,6 +2243,9 @@ int install_special_mapping(struct mm_st
 	if (unlikely(vma == NULL))
 		return -ENOMEM;
 
+	if (memrlimit_cgroup_charge_as(mm, len >> PAGE_SHIFT))
+		return -ENOMEM;
+
 	vma->vm_mm = mm;
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
@@ -2248,6 +2258,7 @@ int install_special_mapping(struct mm_st
 
 	if (unlikely(insert_vm_struct(mm, vma))) {
 		kmem_cache_free(vm_area_cachep, vma);
+		memrlimit_cgroup_uncharge_as(mm, len >> PAGE_SHIFT);
 		return -ENOMEM;
 	}
 
diff -puN arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ptrace.c
--- linux-2.6.26-rc2/arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:37:11.000000000 +0530
+++ linux-2.6.26-rc2-balbir/arch/x86/kernel/ptrace.c	2008-05-14 18:43:13.000000000 +0530
@@ -20,6 +20,7 @@
 #include <linux/audit.h>
 #include <linux/seccomp.h>
 #include <linux/signal.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -782,21 +783,25 @@ static int ptrace_bts_realloc(struct tas
 
 	current->mm->total_vm  -= old_size;
 	current->mm->locked_vm -= old_size;
+	memrlimit_cgroup_uncharge_as(mm, old_size);
 
 	if (size == 0)
 		goto out;
 
+	if (memrlimit_cgroup_charge_as(current->mm, size))
+		goto out;
+
 	rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT;
 	vm = current->mm->total_vm  + size;
 	if (rlim < vm) {
 		ret = -ENOMEM;
 
 		if (!reduce_size)
-			goto out;
+			goto out_uncharge;
 
 		size = rlim - current->mm->total_vm;
 		if (size <= 0)
-			goto out;
+			goto out_uncharge;
 	}
 
 	rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
@@ -805,21 +810,24 @@ static int ptrace_bts_realloc(struct tas
 		ret = -ENOMEM;
 
 		if (!reduce_size)
-			goto out;
+			goto out_uncharge;
 
 		size = rlim - current->mm->locked_vm;
 		if (size <= 0)
-			goto out;
+			goto out_uncharge;
 	}
 
 	ret = ds_allocate((void **)&child->thread.ds_area_msr,
 			  size << PAGE_SHIFT);
 	if (ret < 0)
-		goto out;
+		goto out_uncharge;
 
 	current->mm->total_vm  += size;
 	current->mm->locked_vm += size;
 
+out_uncharge:
+	if (ret < 0)
+		memrlimit_cgroup_uncharge_as(mm, size);
 out:
 	if (child->thread.ds_area_msr)
 		set_tsk_thread_flag(child, TIF_DS_AREA_MSR);
_


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-14 13:25   ` Balbir Singh
@ 2008-05-15  2:25     ` Paul Menage
  2008-05-15  6:17       ` Balbir Singh
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Menage @ 2008-05-15  2:25 UTC (permalink / raw)
  To: balbir, linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage,
	lizf, linux-kernel, Pavel Emelianov, Andrew Morton,
	KAMEZAWA Hiroyuki

On Wed, May 14, 2008 at 6:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  +
>  +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
>  +{
>  +       int ret;
>  +       struct memrlimit_cgroup *memrcg;
>  +
>  +       rcu_read_lock();
>  +       memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner));
>  +       css_get(&memrcg->css);
>  +       rcu_read_unlock();
>  +
>  +       ret = res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
>  +       css_put(&memrcg->css);
>  +       return ret;
>  +}

Assuming that we're holding a write lock on mm->mmap_sem here, and we
additionally hold mmap_sem for the whole of mm_update_next_owner(),
then maybe we don't need any extra synchronization here? Something
like simply:

int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
{
       struct memrlimit_cgroup *memrcg = memrlimit_cgroup_from_task(mm->owner);
       return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
}

Seems good to minimize additional synchronization on the fast path.

The only thing that's still broken is that the task_struct.cgroups
pointer gets updated only under the synchronization of task_lock(), so
we've still got the race of:

A: attach_task() updates B->cgroups

B: memrlimit_cgroup_charge_as() charges the new res counter and
updates mm->total_vm

A: memrlimit_cgroup_move_task() moves mm->total_vm from the old
counter to the new counter

Here's one way I see to fix this:

We change attach_task() so that rather than updating the
task_struct.cgroups pointer once from the original css_set to the
final css_set, it goes through a series of intermediate css_set
structures, one for each subsystem in the hierarchy, transitioning
from the old set to the final set. Then for each subsystem ss, it
would do:

next_css = <old css with pointer for ss updated>
if (ss->attach) {
  ss->attach(ss, p, next_css);
} else {
  task_lock(p);
  rcu_assign_ptr(p->cgroups, next_css);
  task_unlock(p);
}

i.e. the subsystem would be free to implement any synchronization it
desired in the attach() code. The attach() method's responsibility
would be to ensure that p->cgroups was updated to point to next_css
before returning. This should make it much simpler for a subsystem to
handle potential races between attach() and accounting. The current
semantics of can_attach()/update/attach() are sufficient for cpusets,
but probably not for systems with more complex accounting. I'd still
need to figure out a nice way to get the kind of transactional
semantics that you want from can_attach().

>  +
>  +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
>  +{
>  +       struct memrlimit_cgroup *memrcg;
>  +
>  +       rcu_read_lock();
>  +       memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner));
>  +       css_get(&memrcg->css);
>  +       rcu_read_unlock();
>  +
>  +       res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
>  +       css_put(&memrcg->css);
>  +}
>  +
>   static struct cgroup_subsys_state *
>   memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
>   {
>  @@ -134,11 +169,70 @@ static int memrlimit_cgroup_populate(str
>                                 ARRAY_SIZE(memrlimit_cgroup_files));
>   }
>
>  +static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss,
>  +                                       struct cgroup *cgrp,
>  +                                       struct cgroup *old_cgrp,
>  +                                       struct task_struct *p)
>  +{
>  +       struct mm_struct *mm;
>  +       struct memrlimit_cgroup *memrcg, *old_memrcg;
>  +
>  +       mm = get_task_mm(p);
>  +       if (mm == NULL)
>  +               return;
>  +
>  +       rcu_read_lock();
>  +       if (p != rcu_dereference(mm->owner))
>  +               goto out;

out: does up_read() on mmap_sem, which you don't currently hold.

Can you add more comments about why you're using RCU here?

You have a refcounted pointer on mm, so mm can't go away, and you
never dereference the result of the rcu_dereference(mm->owner), so
you're not protecting the validity of that pointer. The only way this
could help would be if anything that changes mm->owner calls
synchronize_rcu() before, say, doing accounting changes, and I don't
believe that's the case.

Would it be simpler to use task_lock(p) here to ensure that it doesn't
lose its owner status (by exiting or execing) while we're moving it?

i.e., something like: [ this assumes the new semantics of attach that
I proposed above ]


static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss,
                                       struct cgroup *cgrp,
                                       struct cgroup *old_cgrp,
                                       struct task_struct *p,
                                       struct css_set new_css_set)
{
        struct mm_struct *mm;
        struct memrlimit_cgroup *memrcg, *old_memrcg;

 retry:
        mm = get_task_mm(p);
        if (mm == NULL) {
          task_lock(p);
          rcu_assign_ptr(p->cgroups, new_css_set);
          task_unlock(p);
          return;
        }

        /* Take mmap_sem to prevent address space changes */
        down_read(&mm->mmap_sem);
        /* task_lock(p) to prevent mm ownership changes */
        task_lock(p);
        if (p->mm != mm) {
                /* We raced */
                task_unlock(p);
                up_read(&mm->mmap_sem);
                mmput(mm);
                goto retry;
        }
        if (p != mm->owner)
                goto out_assign;

        memrcg = memrlimit_cgroup_from_cgrp(cgrp);
        old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);

        if (memrcg == old_memrcg)
                goto out_assign;

        if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
                goto out_assign;
        res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
  out_assign:
        rcu_assign_ptr(p->cgroups, new_css_set);
        task_unlock(p);
        up_read(&mm->mmap_sem);
        mmput(mm);
}




>  +
>  +       memrcg = memrlimit_cgroup_from_cgrp(cgrp);
>  +       old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
>  +
>  +       if (memrcg == old_memrcg)
>  +               goto out;

mmap_sem is also not held here.

>  +
>  +       /*
>  +        * Hold mmap_sem, so that total_vm does not change underneath us
>  +        */
>  +       down_read(&mm->mmap_sem);

You can't block inside rcu_read_lock().

>  +       if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
>  +               goto out;
>  +       res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
>  +out:
>  +       up_read(&mm->mmap_sem);
>  +       rcu_read_unlock();
>  +       mmput(mm);
>  +}
>  +
>  +static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss,
>  +                                               struct cgroup *cgrp,
>  +                                               struct cgroup *old_cgrp,
>  +                                               struct task_struct *p)
>  +{
>  +       struct memrlimit_cgroup *memrcg, *old_memrcg;
>  +       struct mm_struct *mm = get_task_mm(p);
>  +
>  +       BUG_ON(!mm);
>  +       memrcg = memrlimit_cgroup_from_cgrp(cgrp);
>  +       old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
>  +
>  +       down_read(&mm->mmap_sem);

At this point we're holding p->alloc_lock, so we can't do a blocking
down_read().

How about if we down_read(&mm->mmap_sem) in mm_update_next_owner()
prior to taking tasklist_lock? That will ensure that mm ownership
changes are synchronized against mmap/munmap operations on that mm,
and since the cgroups-based owners are likely to be wanting to track
the ownership changes for accounting purposes, this seems like an
appropriate lock to hold.

>  +       if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
>  +               goto out;
>  +       res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
>  +out:
>  +       up_read(&mm->mmap_sem);
>  +
>  +       mmput(mm);
>  +}
>  +

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-15  2:25     ` Paul Menage
@ 2008-05-15  6:17       ` Balbir Singh
  2008-05-15  6:55         ` Paul Menage
  0 siblings, 1 reply; 27+ messages in thread
From: Balbir Singh @ 2008-05-15  6:17 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel,
	Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki

* Paul Menage <menage@google.com> [2008-05-14 19:25:07]:

> On Wed, May 14, 2008 at 6:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >  +
> >  +int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
> >  +{
> >  +       int ret;
> >  +       struct memrlimit_cgroup *memrcg;
> >  +
> >  +       rcu_read_lock();
> >  +       memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner));
> >  +       css_get(&memrcg->css);
> >  +       rcu_read_unlock();
> >  +
> >  +       ret = res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
> >  +       css_put(&memrcg->css);
> >  +       return ret;
> >  +}
> 
> Assuming that we're holding a write lock on mm->mmap_sem here, and we
> additionally hold mmap_sem for the whole of mm_update_next_owner(),
> then maybe we don't need any extra synchronization here? Something
> like simply:
> 
> int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
> {
>        struct memrlimit_cgroup *memrcg = memrlimit_cgroup_from_task(mm->owner);
>        return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
> }

The charge_as routine is not always called with mmap_sem held, since
the undo path gets more complicated under the lock. We already have
our own locking mechanism for the counters. We're not really accessing
any member of the mm here except the owner. Do we need to be called
with mmap_sem held?

> 
> Seems good to minimize additional synchronization on the fast path.
>
> The only thing that's still broken is that the task_struct.cgroups
> pointer gets updated only under the synchronization of task_lock(), so
> we've still got the race of:
> 
> A: attach_task() updates B->cgroups
> 
> B: memrlimit_cgroup_charge_as() charges the new res counter and
> updates mm->total_vm
> 
> A: memrlimit_cgroup_move_task() moves mm->total_vm from the old
> counter to the new counter
> 
> Here's one way I see to fix this:
> 
> We change attach_task() so that rather than updating the
> task_struct.cgroups pointer once from the original css_set to the
> final css_set, it goes through a series of intermediate css_set
> structures, one for each subsystem in the hierarchy, transitioning
> from the old set to the final set. Then for each subsystem ss, it
> would do:
> 
> next_css = <old css with pointer for ss updated>
> if (ss->attach) {
>   ss->attach(ss, p, next_css);
> } else {
>   task_lock(p);
>   rcu_assign_ptr(p->cgroups, next_css);
>   task_unlock(p);
> }
> 
> i.e. the subsystem would be free to implement any synchronization it
> desired in the attach() code. The attach() method's responsibility
> would be to ensure that p->cgroups was updated to point to next_css
> before returning. This should make it much simpler for a subsystem to
> handle potential races between attach() and accounting. The current
> semantics of can_attach()/update/attach() are sufficient for cpusets,
> but probably not for systems with more complex accounting. I'd still
> need to figure out a nice way to get the kind of transactional
> semantics that you want from can_attach().
> 

A transaction manager would be great. We do the
mm_update_owner_changes under the task_lock(), may be the attach
callback should do the same to ensure that


> >  +
> >  +void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
> >  +{
> >  +       struct memrlimit_cgroup *memrcg;
> >  +
> >  +       rcu_read_lock();
> >  +       memrcg = memrlimit_cgroup_from_task(rcu_dereference(mm->owner));
> >  +       css_get(&memrcg->css);
> >  +       rcu_read_unlock();
> >  +
> >  +       res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
> >  +       css_put(&memrcg->css);
> >  +}
> >  +
> >   static struct cgroup_subsys_state *
> >   memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
> >   {
> >  @@ -134,11 +169,70 @@ static int memrlimit_cgroup_populate(str
> >                                 ARRAY_SIZE(memrlimit_cgroup_files));
> >   }
> >
> >  +static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss,
> >  +                                       struct cgroup *cgrp,
> >  +                                       struct cgroup *old_cgrp,
> >  +                                       struct task_struct *p)
> >  +{
> >  +       struct mm_struct *mm;
> >  +       struct memrlimit_cgroup *memrcg, *old_memrcg;
> >  +
> >  +       mm = get_task_mm(p);
> >  +       if (mm == NULL)
> >  +               return;
> >  +
> >  +       rcu_read_lock();
> >  +       if (p != rcu_dereference(mm->owner))
> >  +               goto out;
> 
> out: does up_read() on mmap_sem, which you don't currently hold.
> 

Yes, good catch!

> Can you add more comments about why you're using RCU here?
>

Sure, I will. One of things we want to ensure is that mm->owner does
not go away.

> You have a refcounted pointer on mm, so mm can't go away, and you
> never dereference the result of the rcu_dereference(mm->owner), so
> you're not protecting the validity of that pointer. The only way this
> could help would be if anything that changes mm->owner calls
> synchronize_rcu() before, say, doing accounting changes, and I don't
> believe that's the case.
> 
> Would it be simpler to use task_lock(p) here to ensure that it doesn't
> lose its owner status (by exiting or execing) while we're moving it?
> 

That's what I've been thinking as well, based on the discussion above.

> i.e., something like: [ this assumes the new semantics of attach that
> I proposed above ]
> 
> 
> static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss,
>                                        struct cgroup *cgrp,
>                                        struct cgroup *old_cgrp,
>                                        struct task_struct *p,
>                                        struct css_set new_css_set)
> {
>         struct mm_struct *mm;
>         struct memrlimit_cgroup *memrcg, *old_memrcg;
> 
>  retry:
>         mm = get_task_mm(p);
>         if (mm == NULL) {
>           task_lock(p);
>           rcu_assign_ptr(p->cgroups, new_css_set);

Will each callback assign p->cgroups to new_css_set?

>           task_unlock(p);
>           return;
>         }
> 
>         /* Take mmap_sem to prevent address space changes */
>         down_read(&mm->mmap_sem);
>         /* task_lock(p) to prevent mm ownership changes */
>         task_lock(p);
>         if (p->mm != mm) {
>                 /* We raced */

With exit_mmap() or exec_mmap() right?

>                 task_unlock(p);
>                 up_read(&mm->mmap_sem);
>                 mmput(mm);
>                 goto retry;
>         }
>         if (p != mm->owner)
>                 goto out_assign;
> 
>         memrcg = memrlimit_cgroup_from_cgrp(cgrp);
>         old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
> 
>         if (memrcg == old_memrcg)
>                 goto out_assign;
> 
>         if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
>                 goto out_assign;
>         res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
>   out_assign:
>         rcu_assign_ptr(p->cgroups, new_css_set);
>         task_unlock(p);
>         up_read(&mm->mmap_sem);
>         mmput(mm);
> }
> 

Taking the mmap_sem here would mean, we would need to document
(something I should have done earlier), that mmap_sem nests under
cgroup_mutex


Looking at the mm->owner patches, I am thinking of writing down a race
scenario card


        Race conditions

        R1: mm->owner can change dynamically under task_lock
        R2: mm->owner's cgroup can change under cgroup_mutex

                        Read            Write

        mm->owner       Prevent         hold task_lock 
                        cgroup from
                        changing

                        Prevent owner
                        from changing

        Scenarios requiring protection/consistency

        R1: causes no problem, since we expect to make appropriate
            adjustment in mm_owner_changed
        R2: Is handled by the attach() callback

        Which leaves us with the following conclusion

        We don't have move_task(), mm_owner_changed() and charge/uncharge()
        running in parallel at the same time.

If we agree with the assertion/conclusion above, then a simple lock
might be able to protect us, assuming that it does not create a
interwined locking hierarchy.

> 
> 
> 
> >  +
> >  +       memrcg = memrlimit_cgroup_from_cgrp(cgrp);
> >  +       old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
> >  +
> >  +       if (memrcg == old_memrcg)
> >  +               goto out;
> 
> mmap_sem is also not held here.
> 

Will fix

> >  +
> >  +       /*
> >  +        * Hold mmap_sem, so that total_vm does not change underneath us
> >  +        */
> >  +       down_read(&mm->mmap_sem);
> 
> You can't block inside rcu_read_lock().
>

Good catch!
 
> >  +       if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
> >  +               goto out;
> >  +       res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
> >  +out:
> >  +       up_read(&mm->mmap_sem);
> >  +       rcu_read_unlock();
> >  +       mmput(mm);
> >  +}
> >  +
> >  +static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss,
> >  +                                               struct cgroup *cgrp,
> >  +                                               struct cgroup *old_cgrp,
> >  +                                               struct task_struct *p)
> >  +{
> >  +       struct memrlimit_cgroup *memrcg, *old_memrcg;
> >  +       struct mm_struct *mm = get_task_mm(p);
> >  +
> >  +       BUG_ON(!mm);
> >  +       memrcg = memrlimit_cgroup_from_cgrp(cgrp);
> >  +       old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
> >  +
> >  +       down_read(&mm->mmap_sem);
> 
> At this point we're holding p->alloc_lock, so we can't do a blocking
> down_read().
> 

Good catch!

> How about if we down_read(&mm->mmap_sem) in mm_update_next_owner()
> prior to taking tasklist_lock? That will ensure that mm ownership
> changes are synchronized against mmap/munmap operations on that mm,
> and since the cgroups-based owners are likely to be wanting to track
> the ownership changes for accounting purposes, this seems like an
> appropriate lock to hold.

Hmmm.. may be worth doing

> 
> >  +       if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
> >  +               goto out;
> >  +       res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
> >  +out:
> >  +       up_read(&mm->mmap_sem);
> >  +
> >  +       mmput(mm);
> >  +}
> >  +
> 

Thanks for the detailed review.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-15  6:17       ` Balbir Singh
@ 2008-05-15  6:55         ` Paul Menage
  2008-05-15  7:03           ` Balbir Singh
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Menage @ 2008-05-15  6:55 UTC (permalink / raw)
  To: balbir, Paul Menage, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, Pavel Emelianov, Andrew Morton,
	KAMEZAWA Hiroyuki

On Wed, May 14, 2008 at 11:17 PM, Balbir Singh
<balbir@linux.vnet.ibm.com> wrote:
>  >
>  > Assuming that we're holding a write lock on mm->mmap_sem here, and we
>  > additionally hold mmap_sem for the whole of mm_update_next_owner(),
>  > then maybe we don't need any extra synchronization here? Something
>  > like simply:
>  >
>  > int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
>  > {
>  >        struct memrlimit_cgroup *memrcg = memrlimit_cgroup_from_task(mm->owner);
>  >        return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
>  > }
>
>  The charge_as routine is not always called with mmap_sem held, since
>  the undo path gets more complicated under the lock. We already have
>  our own locking mechanism for the counters.

I'm not worried about the counters themselves being inconsistent - I'm
worried about the case where charge_as() is called in the middle of
the attach operation, and we account the charge X to the new cgroup's
res_counter and update mm->total_vm, and then when we do the move, we
charge the whole of mm->total_mm to the new cgroup even though the
last charge was already accounted to the new res_counter, not the old
one.

That's what I'm hoping to address with the idea of splitting the
attach into one update per subsystem, and letting the subsystems
control their own synchronization.

> We're not really accessing
>  any member of the mm here except the owner. Do we need to be called
>  with mmap_sem held?
>

Not necessarily mmap_sem, but there needs to be something to ensure
that the update to mm->total_vm and the charge/uncharge against the
res_counter are an atomic pair with respect to the code that shifts an
mm between two cgroups, either due to mm->owner change or due to an
attach_task(). Since mmap_sem is held for write on almost all the fast
path calls to the rlimit_as charge/uncharge functions, using that for
the synchronization avoids the need for any additional synchronization
in the fast path.

Can you say more about the complications of holding a write lock on
mmap_sem in the cleanup calls to uncharge?

>  >  retry:
>  >         mm = get_task_mm(p);
>  >         if (mm == NULL) {
>  >           task_lock(p);
>  >           rcu_assign_ptr(p->cgroups, new_css_set);
>
>  Will each callback assign p->cgroups to new_css_set?

Yes - but new_css_set will be slightly different for each callback.
Specifically, it will differ from the existing set pointed to by
p->cgroups in the pointer for this particular subsystem. So the task
will move over in a staggered fashion, and each subsystem will get to
choose its own synchronization.

>  >         task_lock(p);
>  >         if (p->mm != mm) {
>  >                 /* We raced */
>
>  With exit_mmap() or exec_mmap() right?
>

Yes.

>  If we agree with the assertion/conclusion above, then a simple lock
>  might be able to protect us, assuming that it does not create a
>  interwined locking hierarchy.
>

Right - and if we can make that lock be the mmap_sem of the mm in
question, we avoid introducing a new lock into the fast path.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-15  6:55         ` Paul Menage
@ 2008-05-15  7:03           ` Balbir Singh
  2008-05-15  7:39             ` Paul Menage
  0 siblings, 1 reply; 27+ messages in thread
From: Balbir Singh @ 2008-05-15  7:03 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel,
	Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki

* Paul Menage <menage@google.com> [2008-05-14 23:55:07]:

> On Wed, May 14, 2008 at 11:17 PM, Balbir Singh
> <balbir@linux.vnet.ibm.com> wrote:
> >  >
> >  > Assuming that we're holding a write lock on mm->mmap_sem here, and we
> >  > additionally hold mmap_sem for the whole of mm_update_next_owner(),
> >  > then maybe we don't need any extra synchronization here? Something
> >  > like simply:
> >  >
> >  > int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
> >  > {
> >  >        struct memrlimit_cgroup *memrcg = memrlimit_cgroup_from_task(mm->owner);
> >  >        return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
> >  > }
> >
> >  The charge_as routine is not always called with mmap_sem held, since
> >  the undo path gets more complicated under the lock. We already have
> >  our own locking mechanism for the counters.
> 
> I'm not worried about the counters themselves being inconsistent - I'm
> worried about the case where charge_as() is called in the middle of
> the attach operation, and we account the charge X to the new cgroup's
> res_counter and update mm->total_vm, and then when we do the move, we
> charge the whole of mm->total_mm to the new cgroup even though the
> last charge was already accounted to the new res_counter, not the old
> one.
> 
> That's what I'm hoping to address with the idea of splitting the
> attach into one update per subsystem, and letting the subsystems
> control their own synchronization.
> 
> > We're not really accessing
> >  any member of the mm here except the owner. Do we need to be called
> >  with mmap_sem held?
> >
> 
> Not necessarily mmap_sem, but there needs to be something to ensure
> that the update to mm->total_vm and the charge/uncharge against the
> res_counter are an atomic pair with respect to the code that shifts an
> mm between two cgroups, either due to mm->owner change or due to an
> attach_task(). Since mmap_sem is held for write on almost all the fast
> path calls to the rlimit_as charge/uncharge functions, using that for
> the synchronization avoids the need for any additional synchronization
> in the fast path.
> 
> Can you say more about the complications of holding a write lock on
> mmap_sem in the cleanup calls to uncharge?
> 
> >  >  retry:
> >  >         mm = get_task_mm(p);
> >  >         if (mm == NULL) {
> >  >           task_lock(p);
> >  >           rcu_assign_ptr(p->cgroups, new_css_set);
> >
> >  Will each callback assign p->cgroups to new_css_set?
> 
> Yes - but new_css_set will be slightly different for each callback.
> Specifically, it will differ from the existing set pointed to by
> p->cgroups in the pointer for this particular subsystem. So the task
> will move over in a staggered fashion, and each subsystem will get to
> choose its own synchronization.
> 
> >  >         task_lock(p);
> >  >         if (p->mm != mm) {
> >  >                 /* We raced */
> >
> >  With exit_mmap() or exec_mmap() right?
> >
> 
> Yes.
> 
> >  If we agree with the assertion/conclusion above, then a simple lock
> >  might be able to protect us, assuming that it does not create a
> >  interwined locking hierarchy.
> >
> 
> Right - and if we can make that lock be the mmap_sem of the mm in
> question, we avoid introducing a new lock into the fast path.
>

I want to focus on this conclusion/assertion, since it takes care of
most of the locking related discussion above, unless I missed
something.

My concern with using mmap_sem, is that

1. It's highly contended (every page fault, vma change, etc)
2. It's going to make the locking hierarchy deeper and complex
3. It's not appropriate to call all the accounting callbacks with
   the mmap_sem() held, since the undo operations _can get_ complicated
   at the caller.

I would prefer introducing a new lock, so that other subsystems are
not affected. 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-15  7:03           ` Balbir Singh
@ 2008-05-15  7:39             ` Paul Menage
  2008-05-15  8:25               ` Balbir Singh
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Menage @ 2008-05-15  7:39 UTC (permalink / raw)
  To: balbir, Paul Menage, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, Pavel Emelianov, Andrew Morton,
	KAMEZAWA Hiroyuki

On Thu, May 15, 2008 at 12:03 AM, Balbir Singh
<balbir@linux.vnet.ibm.com> wrote:
>
>  I want to focus on this conclusion/assertion, since it takes care of
>  most of the locking related discussion above, unless I missed
>  something.
>
>  My concern with using mmap_sem, is that
>
>  1. It's highly contended (every page fault, vma change, etc)

But the only *new* cases of taking the mmap_sem that this would
introduce would be:

- on a failed vm limit charge
- when a task exit/exec causes an mm ownership change
- when a task moves between two cgroups in the memrlimit hierarchy.

All of these should be rare events, so I don't think the additional
contention is a worry.

>  2. It's going to make the locking hierarchy deeper and complex

Yes, potentially. But if the upside of that is that we eliminate a
lock/unlock on a shared lock on every mmap/munmap call, it might well
be worth it.

>  3. It's not appropriate to call all the accounting callbacks with
>    the mmap_sem() held, since the undo operations _can get_ complicated
>    at the caller.
>

Can you give an example?

>  I would prefer introducing a new lock, so that other subsystems are
>  not affected.
>

For getting the first cut of the memrlimit controller working this may
well make sense. But it would be nice to avoid it longer-term.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-15  7:39             ` Paul Menage
@ 2008-05-15  8:25               ` Balbir Singh
  2008-05-15 15:28                 ` Paul Menage
  0 siblings, 1 reply; 27+ messages in thread
From: Balbir Singh @ 2008-05-15  8:25 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel,
	Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki

* Paul Menage <menage@google.com> [2008-05-15 00:39:45]:

> On Thu, May 15, 2008 at 12:03 AM, Balbir Singh
> <balbir@linux.vnet.ibm.com> wrote:
> >
> >  I want to focus on this conclusion/assertion, since it takes care of
> >  most of the locking related discussion above, unless I missed
> >  something.
> >
> >  My concern with using mmap_sem, is that
> >
> >  1. It's highly contended (every page fault, vma change, etc)
> 
> But the only *new* cases of taking the mmap_sem that this would
> introduce would be:
> 
> - on a failed vm limit charge

Why a failed charge? Aren't we talking of moving all charge/uncharge
under mmap_sem?

> - when a task exit/exec causes an mm ownership change

Yes, in the mm_owner_changed callbacks

> - when a task moves between two cgroups in the memrlimit hierarchy.
> 

Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would
be a bad side-effect.

> All of these should be rare events, so I don't think the additional
> contention is a worry.

We do make several of all charge calls under the mmap_sem, but not
all of them. So the additional contention might not be all that bad.

> 
> >  2. It's going to make the locking hierarchy deeper and complex
> 
> Yes, potentially. But if the upside of that is that we eliminate a
> lock/unlock on a shared lock on every mmap/munmap call, it might well
> be worth it.
> 
> >  3. It's not appropriate to call all the accounting callbacks with
> >    the mmap_sem() held, since the undo operations _can get_ complicated
> >    at the caller.
> >
> 
> Can you give an example?

Some paths of the uncharge are not under mmap_sem. Undoing the
operation there seemed complex.

> 
> >  I would prefer introducing a new lock, so that other subsystems are
> >  not affected.
> >
> 
> For getting the first cut of the memrlimit controller working this may
> well make sense. But it would be nice to avoid it longer-term.

OK, so here's what I am going to try and do

Refactor the code to try and use mmap_sem and see what I come up
with. Basically use mmap_sem for all charge/uncharge operations as
well use mmap_sem in read_mode in the move_task() and
mm_owner_changed() callbacks. That should take care of the race
conditions discussed, unless I missed something.
Try and instrument insert_vm_struct() for charge/uncharge

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-15  8:25               ` Balbir Singh
@ 2008-05-15 15:28                 ` Paul Menage
  2008-05-15 17:01                   ` Balbir Singh
  2008-05-17 20:15                   ` Balbir Singh
  0 siblings, 2 replies; 27+ messages in thread
From: Paul Menage @ 2008-05-15 15:28 UTC (permalink / raw)
  To: balbir, Paul Menage, linux-mm, Sudhir Kumar, YAMAMOTO Takashi,
	lizf, linux-kernel, Pavel Emelianov, Andrew Morton,
	KAMEZAWA Hiroyuki

On Thu, May 15, 2008 at 1:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  >
>  > But the only *new* cases of taking the mmap_sem that this would
>  > introduce would be:
>  >
>  > - on a failed vm limit charge
>
>  Why a failed charge? Aren't we talking of moving all charge/uncharge
>  under mmap_sem?
>

Sorry, I worded that wrongly - I meant "cleaning up a successful
charge after an expansion fails for other reasons"

I thought that all the charges and most of the uncharges were already
under mmap_sem, and it would just be a few of the cleanup paths that
needed to take it.

>
>  > - when a task moves between two cgroups in the memrlimit hierarchy.
>  >
>
>  Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would
>  be a bad side-effect.
>

I think it's already nested that way - e.g. the cpusets code can call
various migration functions (which take mmap_sem) while holding
cgroup_mutex.

>
>  Refactor the code to try and use mmap_sem and see what I come up
>  with. Basically use mmap_sem for all charge/uncharge operations as
>  well use mmap_sem in read_mode in the move_task() and
>  mm_owner_changed() callbacks. That should take care of the race
>  conditions discussed, unless I missed something.

Sounds good.

Thanks,

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-15 15:28                 ` Paul Menage
@ 2008-05-15 17:01                   ` Balbir Singh
  2008-05-17 20:15                   ` Balbir Singh
  1 sibling, 0 replies; 27+ messages in thread
From: Balbir Singh @ 2008-05-15 17:01 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel,
	Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki

Paul Menage wrote:
> On Thu, May 15, 2008 at 1:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>  >
>>  > But the only *new* cases of taking the mmap_sem that this would
>>  > introduce would be:
>>  >
>>  > - on a failed vm limit charge
>>
>>  Why a failed charge? Aren't we talking of moving all charge/uncharge
>>  under mmap_sem?
>>
> 
> Sorry, I worded that wrongly - I meant "cleaning up a successful
> charge after an expansion fails for other reasons"
> 
> I thought that all the charges and most of the uncharges were already
> under mmap_sem, and it would just be a few of the cleanup paths that
> needed to take it.
> 

OK, that's definitely more meaningful. Thanks for clarifying.

>>  > - when a task moves between two cgroups in the memrlimit hierarchy.
>>  >
>>
>>  Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would
>>  be a bad side-effect.
>>
> 
> I think it's already nested that way - e.g. the cpusets code can call
> various migration functions (which take mmap_sem) while holding
> cgroup_mutex.
> 
>>  Refactor the code to try and use mmap_sem and see what I come up
>>  with. Basically use mmap_sem for all charge/uncharge operations as
>>  well use mmap_sem in read_mode in the move_task() and
>>  mm_owner_changed() callbacks. That should take care of the race
>>  conditions discussed, unless I missed something.
> 
> Sounds good.
> 

Let me get that done and I'll post the next version.

> Thanks,
> 
> Paul


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-15 15:28                 ` Paul Menage
  2008-05-15 17:01                   ` Balbir Singh
@ 2008-05-17 20:15                   ` Balbir Singh
  2008-05-17 20:17                     ` Balbir Singh
  1 sibling, 1 reply; 27+ messages in thread
From: Balbir Singh @ 2008-05-17 20:15 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf, linux-kernel,
	Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki

* Paul Menage <menage@google.com> [2008-05-15 08:28:46]:

> On Thu, May 15, 2008 at 1:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >  >
> >  > But the only *new* cases of taking the mmap_sem that this would
> >  > introduce would be:
> >  >
> >  > - on a failed vm limit charge
> >
> >  Why a failed charge? Aren't we talking of moving all charge/uncharge
> >  under mmap_sem?
> >
> 
> Sorry, I worded that wrongly - I meant "cleaning up a successful
> charge after an expansion fails for other reasons"
> 
> I thought that all the charges and most of the uncharges were already
> under mmap_sem, and it would just be a few of the cleanup paths that
> needed to take it.
> 
> >
> >  > - when a task moves between two cgroups in the memrlimit hierarchy.
> >  >
> >
> >  Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would
> >  be a bad side-effect.
> >
> 
> I think it's already nested that way - e.g. the cpusets code can call
> various migration functions (which take mmap_sem) while holding
> cgroup_mutex.
> 
> >
> >  Refactor the code to try and use mmap_sem and see what I come up
> >  with. Basically use mmap_sem for all charge/uncharge operations as
> >  well use mmap_sem in read_mode in the move_task() and
> >  mm_owner_changed() callbacks. That should take care of the race
> >  conditions discussed, unless I missed something.
> 
> Sounds good.
> 
> Thanks,
>
I've revamped the last two patches. Please review


This patch adds an additional field to the mm_owner callbacks. This field
is required to get to the mm that changed. Hold mmap_sem in write mode
before calling the mm_owner_changed callback

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/cgroup.h |    3 ++-
 kernel/cgroup.c        |    4 +++-
 kernel/exit.c          |    3 +++
 3 files changed, 8 insertions(+), 2 deletions(-)

diff -puN include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks include/linux/cgroup.h
--- linux-2.6.26-rc2/include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks	2008-05-14 18:36:59.000000000 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/cgroup.h	2008-05-14 18:36:59.000000000 +0530
@@ -310,7 +310,8 @@ struct cgroup_subsys {
 	 */
 	void (*mm_owner_changed)(struct cgroup_subsys *ss,
 					struct cgroup *old,
-					struct cgroup *new);
+					struct cgroup *new,
+					struct task_struct *p);
 	int subsys_id;
 	int active;
 	int disabled;
diff -puN kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks kernel/cgroup.c
--- linux-2.6.26-rc2/kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks	2008-05-14 18:36:59.000000000 +0530
+++ linux-2.6.26-rc2-balbir/kernel/cgroup.c	2008-05-17 22:09:57.000000000 +0530
@@ -2758,6 +2758,8 @@ void cgroup_fork_callbacks(struct task_s
  * Called on every change to mm->owner. mm_init_owner() does not
  * invoke this routine, since it assigns the mm->owner the first time
  * and does not change it.
+ *
+ * The callbacks are invoked with mmap_sem held in read mode.
  */
 void cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
 {
@@ -2772,7 +2774,7 @@ void cgroup_mm_owner_callbacks(struct ta
 			if (oldcgrp == newcgrp)
 				continue;
 			if (ss->mm_owner_changed)
-				ss->mm_owner_changed(ss, oldcgrp, newcgrp);
+				ss->mm_owner_changed(ss, oldcgrp, newcgrp, new);
 		}
 	}
 }
diff -puN kernel/exit.c~cgroup-add-task-to-mm-owner-callbacks kernel/exit.c
--- linux-2.6.26-rc2/kernel/exit.c~cgroup-add-task-to-mm-owner-callbacks	2008-05-17 22:10:00.000000000 +0530
+++ linux-2.6.26-rc2-balbir/kernel/exit.c	2008-05-17 23:14:44.000000000 +0530
@@ -621,6 +621,7 @@ retry:
 assign_new_owner:
 	BUG_ON(c == p);
 	get_task_struct(c);
+	down_write(&mm->mmap_sem);
 	/*
 	 * The task_lock protects c->mm from changing.
 	 * We always want mm->owner->mm == mm
@@ -634,12 +635,14 @@ assign_new_owner:
 	if (c->mm != mm) {
 		task_unlock(c);
 		put_task_struct(c);
+		up_write(&mm->mmap_sem);
 		goto retry;
 	}
 	cgroup_mm_owner_callbacks(mm->owner, c);
 	mm->owner = c;
 	task_unlock(c);
 	put_task_struct(c);
+	up_write(&mm->mmap_sem);
 }
 #endif /* CONFIG_MM_OWNER */
 
_
 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-17 20:15                   ` Balbir Singh
@ 2008-05-17 20:17                     ` Balbir Singh
  0 siblings, 0 replies; 27+ messages in thread
From: Balbir Singh @ 2008-05-17 20:17 UTC (permalink / raw)
  To: Paul Menage, linux-mm, Sudhir Kumar, YAMAMOTO Takashi, lizf,
	linux-kernel, Pavel Emelianov, Andrew Morton, KAMEZAWA Hiroyuki

* Balbir Singh <balbir@linux.vnet.ibm.com> [2008-05-18 01:45:45]:

> * Paul Menage <menage@google.com> [2008-05-15 08:28:46]:
> 
> > On Thu, May 15, 2008 at 1:25 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > >  >
> > >  > But the only *new* cases of taking the mmap_sem that this would
> > >  > introduce would be:
> > >  >
> > >  > - on a failed vm limit charge
> > >
> > >  Why a failed charge? Aren't we talking of moving all charge/uncharge
> > >  under mmap_sem?
> > >
> > 
> > Sorry, I worded that wrongly - I meant "cleaning up a successful
> > charge after an expansion fails for other reasons"
> > 
> > I thought that all the charges and most of the uncharges were already
> > under mmap_sem, and it would just be a few of the cleanup paths that
> > needed to take it.
> > 
> > >
> > >  > - when a task moves between two cgroups in the memrlimit hierarchy.
> > >  >
> > >
> > >  Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would
> > >  be a bad side-effect.
> > >
> > 
> > I think it's already nested that way - e.g. the cpusets code can call
> > various migration functions (which take mmap_sem) while holding
> > cgroup_mutex.
> > 
> > >
> > >  Refactor the code to try and use mmap_sem and see what I come up
> > >  with. Basically use mmap_sem for all charge/uncharge operations as
> > >  well use mmap_sem in read_mode in the move_task() and
> > >  mm_owner_changed() callbacks. That should take care of the race
> > >  conditions discussed, unless I missed something.
> > 
> > Sounds good.
> > 
> > Thanks,
> >
> I've revamped the last two patches. Please review
>

Here's the last patch for review


This patch adds support for accounting and control of virtual address space
limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
The core of the accounting takes place during fork time in copy_process(),
may_expand_vm(), remove_vma_list() and exit_mmap(). 

Changelog v5->v4

Move specific hooks in code to insert_vm_struct
Use mmap_sem to protect mm->owner from changing and mm->owner from
changing cgroups.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 arch/x86/kernel/ptrace.c        |   18 +++++--
 include/linux/memrlimitcgroup.h |   21 +++++++++
 kernel/fork.c                   |    8 +++
 mm/memrlimitcgroup.c            |   92 ++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                       |   17 ++++++-
 5 files changed, 149 insertions(+), 7 deletions(-)

diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c
diff -puN arch/x86/kernel/ds.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ds.c
diff -puN fs/exec.c~memrlimit-controller-address-space-accounting-and-control fs/exec.c
diff -puN include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control include/linux/memrlimitcgroup.h
--- linux-2.6.26-rc2/include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control	2008-05-17 23:14:53.000000000 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/memrlimitcgroup.h	2008-05-17 23:14:53.000000000 +0530
@@ -16,4 +16,25 @@
 #ifndef LINUX_MEMRLIMITCGROUP_H
 #define LINUX_MEMRLIMITCGROUP_H
 
+#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR
+
+int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages);
+void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages);
+
+#else /* !CONFIG_CGROUP_RLIMIT_CTLR */
+
+static inline int
+memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	return 0;
+}
+
+static inline void
+memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+}
+
+#endif /* CONFIG_CGROUP_RLIMIT_CTLR */
+
+
 #endif /* LINUX_MEMRLIMITCGROUP_H */
diff -puN kernel/fork.c~memrlimit-controller-address-space-accounting-and-control kernel/fork.c
--- linux-2.6.26-rc2/kernel/fork.c~memrlimit-controller-address-space-accounting-and-control	2008-05-17 23:14:53.000000000 +0530
+++ linux-2.6.26-rc2-balbir/kernel/fork.c	2008-05-17 23:15:55.000000000 +0530
@@ -54,6 +54,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -267,6 +268,7 @@ static int dup_mmap(struct mm_struct *mm
 			mm->total_vm -= pages;
 			vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
 								-pages);
+			memrlimit_cgroup_uncharge_as(mm, pages);
 			continue;
 		}
 		charge = 0;
@@ -596,6 +598,12 @@ static int copy_mm(unsigned long clone_f
 		atomic_inc(&oldmm->mm_users);
 		mm = oldmm;
 		goto good_mm;
+	} else {
+		down_write(&oldmm->mmap_sem);
+		retval = memrlimit_cgroup_charge_as(oldmm, oldmm->total_vm);
+		up_write(&oldmm->mmap_sem);
+		if (retval)
+			goto fail_nomem;
 	}
 
 	retval = -ENOMEM;
diff -puN mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control mm/memrlimitcgroup.c
--- linux-2.6.26-rc2/mm/memrlimitcgroup.c~memrlimit-controller-address-space-accounting-and-control	2008-05-17 23:14:53.000000000 +0530
+++ linux-2.6.26-rc2-balbir/mm/memrlimitcgroup.c	2008-05-18 00:47:31.000000000 +0530
@@ -45,6 +45,38 @@ static struct memrlimit_cgroup *memrlimi
 				struct memrlimit_cgroup, css);
 }
 
+static struct memrlimit_cgroup *
+memrlimit_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, memrlimit_cgroup_subsys_id),
+				struct memrlimit_cgroup, css);
+}
+
+/*
+ * Charge the cgroup for address space usage - mmap(), malloc() (through
+ * brk(), sbrk()), stack expansion, mremap(), etc - called with
+ * mmap_sem held.
+ */
+int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	struct memrlimit_cgroup *memrcg;
+
+	memrcg = memrlimit_cgroup_from_task(mm->owner);
+	return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
+}
+
+/*
+ * Uncharge the cgroup, as the address space of one of the tasks is
+ * decreasing - called with mmap_sem held.
+ */
+void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	struct memrlimit_cgroup *memrcg;
+
+	memrcg = memrlimit_cgroup_from_task(mm->owner);
+	res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
+}
+
 static struct cgroup_subsys_state *
 memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
@@ -134,11 +166,71 @@ static int memrlimit_cgroup_populate(str
 				ARRAY_SIZE(memrlimit_cgroup_files));
 }
 
+static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss,
+					struct cgroup *cgrp,
+					struct cgroup *old_cgrp,
+					struct task_struct *p)
+{
+	struct mm_struct *mm;
+	struct memrlimit_cgroup *memrcg, *old_memrcg;
+
+	mm = get_task_mm(p);
+	if (mm == NULL)
+		return;
+
+	/*
+	 * Hold mmap_sem, so that total_vm does not change underneath us
+	 */
+	down_read(&mm->mmap_sem);
+
+	rcu_read_lock();
+	if (p != rcu_dereference(mm->owner))
+		goto out;
+
+	memrcg = memrlimit_cgroup_from_cgrp(cgrp);
+	old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
+
+	if (memrcg == old_memrcg)
+		goto out;
+
+	if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
+		goto out;
+	res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
+out:
+	rcu_read_unlock();
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+}
+
+/*
+ * This callback is called with mmap_sem held
+ */
+static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss,
+						struct cgroup *cgrp,
+						struct cgroup *old_cgrp,
+						struct task_struct *p)
+{
+	struct memrlimit_cgroup *memrcg, *old_memrcg;
+	struct mm_struct *mm = get_task_mm(p);
+
+	BUG_ON(!mm);
+	memrcg = memrlimit_cgroup_from_cgrp(cgrp);
+	old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
+
+	if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
+		goto out;
+	res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
+out:
+	mmput(mm);
+}
+
 struct cgroup_subsys memrlimit_cgroup_subsys = {
 	.name = "memrlimit",
 	.subsys_id = memrlimit_cgroup_subsys_id,
 	.create = memrlimit_cgroup_create,
 	.destroy = memrlimit_cgroup_destroy,
 	.populate = memrlimit_cgroup_populate,
+	.attach = memrlimit_cgroup_move_task,
+	.mm_owner_changed = memrlimit_cgroup_mm_owner_changed,
 	.early_init = 0,
 };
diff -puN mm/mmap.c~memrlimit-controller-address-space-accounting-and-control mm/mmap.c
--- linux-2.6.26-rc2/mm/mmap.c~memrlimit-controller-address-space-accounting-and-control	2008-05-17 23:14:53.000000000 +0530
+++ linux-2.6.26-rc2-balbir/mm/mmap.c	2008-05-17 23:14:53.000000000 +0530
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1730,6 +1731,7 @@ static void remove_vma_list(struct mm_st
 		long nrpages = vma_pages(vma);
 
 		mm->total_vm -= nrpages;
+		memrlimit_cgroup_uncharge_as(mm, nrpages);
 		if (vma->vm_flags & VM_LOCKED)
 			mm->locked_vm -= nrpages;
 		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
@@ -2056,6 +2058,7 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
+	memrlimit_cgroup_uncharge_as(mm, mm->total_vm);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 
@@ -2078,6 +2081,9 @@ int insert_vm_struct(struct mm_struct * 
 	struct vm_area_struct * __vma, * prev;
 	struct rb_node ** rb_link, * rb_parent;
 
+	if (memrlimit_cgroup_charge_as(mm, vma_pages(vma)))
+		return -ENOMEM;
+
 	/*
 	 * The vm_pgoff of a purely anonymous vma should be irrelevant
 	 * until its first write fault, when page's anon_vma and index
@@ -2096,12 +2102,15 @@ int insert_vm_struct(struct mm_struct * 
 	}
 	__vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent);
 	if (__vma && __vma->vm_start < vma->vm_end)
-		return -ENOMEM;
+		goto err;
 	if ((vma->vm_flags & VM_ACCOUNT) &&
 	     security_vm_enough_memory_mm(mm, vma_pages(vma)))
-		return -ENOMEM;
+		goto err;
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	return 0;
+err:
+	memrlimit_cgroup_uncharge_as(mm, vma_pages(vma));
+	return -ENOMEM;
 }
 
 /*
@@ -2174,6 +2183,10 @@ int may_expand_vm(struct mm_struct *mm, 
 
 	if (cur + npages > lim)
 		return 0;
+
+	if (memrlimit_cgroup_charge_as(mm, npages))
+		return 0;
+
 	return 1;
 }
 
diff -puN arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ptrace.c
--- linux-2.6.26-rc2/arch/x86/kernel/ptrace.c~memrlimit-controller-address-space-accounting-and-control	2008-05-17 23:14:53.000000000 +0530
+++ linux-2.6.26-rc2-balbir/arch/x86/kernel/ptrace.c	2008-05-17 23:14:53.000000000 +0530
@@ -20,6 +20,7 @@
 #include <linux/audit.h>
 #include <linux/seccomp.h>
 #include <linux/signal.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -782,21 +783,25 @@ static int ptrace_bts_realloc(struct tas
 
 	current->mm->total_vm  -= old_size;
 	current->mm->locked_vm -= old_size;
+	memrlimit_cgroup_uncharge_as(mm, old_size);
 
 	if (size == 0)
 		goto out;
 
+	if (memrlimit_cgroup_charge_as(current->mm, size))
+		goto out;
+
 	rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT;
 	vm = current->mm->total_vm  + size;
 	if (rlim < vm) {
 		ret = -ENOMEM;
 
 		if (!reduce_size)
-			goto out;
+			goto out_uncharge;
 
 		size = rlim - current->mm->total_vm;
 		if (size <= 0)
-			goto out;
+			goto out_uncharge;
 	}
 
 	rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
@@ -805,21 +810,24 @@ static int ptrace_bts_realloc(struct tas
 		ret = -ENOMEM;
 
 		if (!reduce_size)
-			goto out;
+			goto out_uncharge;
 
 		size = rlim - current->mm->locked_vm;
 		if (size <= 0)
-			goto out;
+			goto out_uncharge;
 	}
 
 	ret = ds_allocate((void **)&child->thread.ds_area_msr,
 			  size << PAGE_SHIFT);
 	if (ret < 0)
-		goto out;
+		goto out_uncharge;
 
 	current->mm->total_vm  += size;
 	current->mm->locked_vm += size;
 
+out_uncharge:
+	if (ret < 0)
+		memrlimit_cgroup_uncharge_as(mm, size);
 out:
 	if (child->thread.ds_area_msr)
 		set_tsk_thread_flag(child, TIF_DS_AREA_MSR);
_
 
-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh
  2008-05-14 13:25   ` Balbir Singh
@ 2008-05-14 13:32   ` Pavel Emelyanov
  2008-05-14 19:39     ` Balbir Singh
  2008-09-18 20:54   ` Andrew Morton
  2 siblings, 1 reply; 27+ messages in thread
From: Pavel Emelyanov @ 2008-05-14 13:32 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki

Balbir Singh wrote:
> This patch adds support for accounting and control of virtual address space
> limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
> The core of the accounting takes place during fork time in copy_process(),
> may_expand_vm(), remove_vma_list() and exit_mmap(). There are some special
> cases that are handled here as well (arch/ia64/kernel/perform.c,
> arch/x86/kernel/ptrace.c, insert_special_mapping())
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  arch/ia64/kernel/perfmon.c      |    6 ++
>  arch/x86/kernel/ptrace.c        |   17 +++++--
>  fs/exec.c                       |    5 ++
>  include/linux/memrlimitcgroup.h |   21 ++++++++
>  kernel/fork.c                   |    8 +++
>  mm/memrlimitcgroup.c            |   94 ++++++++++++++++++++++++++++++++++++++++
>  mm/mmap.c                       |   11 ++++
>  7 files changed, 157 insertions(+), 5 deletions(-)
> 
> diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c
> --- linux-2.6.26-rc2/arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:09:32.000000000 +0530
> +++ linux-2.6.26-rc2-balbir/arch/ia64/kernel/perfmon.c	2008-05-14 18:09:32.000000000 +0530
> @@ -40,6 +40,7 @@
>  #include <linux/capability.h>
>  #include <linux/rcupdate.h>
>  #include <linux/completion.h>
> +#include <linux/memrlimitcgroup.h>
>  
>  #include <asm/errno.h>
>  #include <asm/intrinsics.h>
> @@ -2294,6 +2295,9 @@ pfm_smpl_buffer_alloc(struct task_struct
>  
>  	DPRINT(("sampling buffer rsize=%lu size=%lu bytes\n", rsize, size));
>  
> +	if (memrlimit_cgroup_charge_as(mm, size >> PAGE_SHIFT))
> +		return -ENOMEM;
> +

AFAIS you didn't cover all the cases when VM expands. At least all
the arch/ia64/ia32/binfmt_elf32.c is missed.

I'd insert this charge into insert_vm_struct. This would a) cover
all of the missed cases and b) reduce the amount of places to patch.

[snip the rest of the patch]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-14 13:32   ` Pavel Emelyanov
@ 2008-05-14 19:39     ` Balbir Singh
  0 siblings, 0 replies; 27+ messages in thread
From: Balbir Singh @ 2008-05-14 19:39 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: linux-mm, Sudhir Kumar, YAMAMOTO Takashi, Paul Menage, lizf,
	linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki

* Pavel Emelyanov <xemul@openvz.org> [2008-05-14 17:32:42]:

> 
> AFAIS you didn't cover all the cases when VM expands. At least all
> the arch/ia64/ia32/binfmt_elf32.c is missed.
> 
> I'd insert this charge into insert_vm_struct. This would a) cover
> all of the missed cases and b) reduce the amount of places to patch.
>

I thought I have those covered. insert_vm_struct() is called from
places that we have covered in this patch. As far as
arch/ia64/ia32/binfmt_elf32.c is concerned, it inserts a GDT, LDT
and does not change total_vm. Having said that, I am not against
putting the hooks in insert_vm_struct().
 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh
  2008-05-14 13:25   ` Balbir Singh
  2008-05-14 13:32   ` Pavel Emelyanov
@ 2008-09-18 20:54   ` Andrew Morton
  2008-09-18 23:55     ` Balbir Singh
  2008-09-19  6:38     ` Balbir Singh
  2 siblings, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2008-09-18 20:54 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, skumar, yamamoto, menage, lizf, linux-kernel, xemul,
	kamezawa.hiroyu, Ingo Molnar

On Wed, 14 May 2008 18:39:51 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> This patch adds support for accounting and control of virtual address space
> limits.


Large changes in linux-next's arch/x86/kernel/ptrace.c caused damage to
the memrlimit patches.

I decided to retain the patches because it looks repairable.  The
problem is this reject from
memrlimit-add-memrlimit-controller-accounting-and-control.patch:

***************
*** 808,828 ****
  
  	current->mm->total_vm  -= old_size;
  	current->mm->locked_vm -= old_size;
  
  	if (size == 0)
  		goto out;
  
  	rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT;
  	vm = current->mm->total_vm  + size;
  	if (rlim < vm) {
  		ret = -ENOMEM;
  
  		if (!reduce_size)
- 			goto out;
  
  		size = rlim - current->mm->total_vm;
  		if (size <= 0)
- 			goto out;
  	}
  
  	rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
--- 809,833 ----
  
  	current->mm->total_vm  -= old_size;
  	current->mm->locked_vm -= old_size;
+ 	memrlimit_cgroup_uncharge_as(mm, old_size);
  
  	if (size == 0)
  		goto out;
  
+ 	if (memrlimit_cgroup_charge_as(current->mm, size))
+ 		goto out;
+ 
  	rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT;
  	vm = current->mm->total_vm  + size;
  	if (rlim < vm) {
  		ret = -ENOMEM;
  
  		if (!reduce_size)
+ 			goto out_uncharge;
  
  		size = rlim - current->mm->total_vm;
  		if (size <= 0)
+ 			goto out_uncharge;
  	}
  
  	rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
***************
*** 831,851 ****
  		ret = -ENOMEM;
  
  		if (!reduce_size)
- 			goto out;
  
  		size = rlim - current->mm->locked_vm;
  		if (size <= 0)
- 			goto out;
  	}
  
  	ret = ds_allocate((void **)&child->thread.ds_area_msr,
  			  size << PAGE_SHIFT);
  	if (ret < 0)
- 		goto out;
  
  	current->mm->total_vm  += size;
  	current->mm->locked_vm += size;
  
  out:
  	if (child->thread.ds_area_msr)
  		set_tsk_thread_flag(child, TIF_DS_AREA_MSR);
--- 836,859 ----
  		ret = -ENOMEM;
  
  		if (!reduce_size)
+ 			goto out_uncharge;
  
  		size = rlim - current->mm->locked_vm;
  		if (size <= 0)
+ 			goto out_uncharge;
  	}
  
  	ret = ds_allocate((void **)&child->thread.ds_area_msr,
  			  size << PAGE_SHIFT);
  	if (ret < 0)
+ 		goto out_uncharge;
  
  	current->mm->total_vm  += size;
  	current->mm->locked_vm += size;
  
+ out_uncharge:
+ 	if (ret < 0)
+ 		memrlimit_cgroup_uncharge_as(mm, size);
  out:
  	if (child->thread.ds_area_msr)
  		set_tsk_thread_flag(child, TIF_DS_AREA_MSR);



could you plese take a look at today's mmotm and see what needs to be
done to salvage it?  Most of the code you were altering got moved into
arch/x86/kernel/ds.c and got changed rather a lot.


Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-09-18 20:54   ` Andrew Morton
@ 2008-09-18 23:55     ` Balbir Singh
  2008-09-19  6:38     ` Balbir Singh
  1 sibling, 0 replies; 27+ messages in thread
From: Balbir Singh @ 2008-09-18 23:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, skumar, yamamoto, menage, lizf, linux-kernel, xemul,
	kamezawa.hiroyu, Ingo Molnar

Andrew Morton wrote:
> On Wed, 14 May 2008 18:39:51 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> This patch adds support for accounting and control of virtual address space
>> limits.
>

[snip]

> 
> could you plese take a look at today's mmotm and see what needs to be
> done to salvage it?  Most of the code you were altering got moved into
> arch/x86/kernel/ds.c and got changed rather a lot.

I'll take a look tonight and see what needs to be done

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-09-18 20:54   ` Andrew Morton
  2008-09-18 23:55     ` Balbir Singh
@ 2008-09-19  6:38     ` Balbir Singh
  2008-09-19 20:14       ` Andrew Morton
  1 sibling, 1 reply; 27+ messages in thread
From: Balbir Singh @ 2008-09-19  6:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, skumar, yamamoto, menage, lizf, linux-kernel, xemul,
	kamezawa.hiroyu, Ingo Molnar

* Andrew Morton <akpm@linux-foundation.org> [2008-09-18 13:54:30]:

> On Wed, 14 May 2008 18:39:51 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > This patch adds support for accounting and control of virtual address space
> > limits.
> 
> 
> Large changes in linux-next's arch/x86/kernel/ptrace.c caused damage to
> the memrlimit patches.
> 
> I decided to retain the patches because it looks repairable.  The
> problem is this reject from
> memrlimit-add-memrlimit-controller-accounting-and-control.patch:
>

Andrew,

I could not apply mmotm to linux-next (both downloaded right now). I
applied the patches one-by-one resolving differences starting from #mm
in the series file.

Here is my fixed version of the patch, I compiled the patch, but could
not run it, since I could not create the full series of applied
patches. I compiled arch/x86/kernel/ds.o and ptrace.o. I've included
the patch below, please let me know if the code looks OK (via review)
and the patch applies. I'll test it once I can resonably resolve all
conflicts between linux-next and mmotm.


From: Balbir Singh <balbir@linux.vnet.ibm.com>

This patch adds support for accounting and control of virtual address space
limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
The core of the accounting takes place during fork time in copy_process(),
may_expand_vm(), remove_vma_list() and exit_mmap().

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/kernel/ptrace.c        |    1 
 include/linux/memrlimitcgroup.h |   21 ++++++
 kernel/fork.c                   |   14 ++++
 mm/memrlimitcgroup.c            |   92 ++++++++++++++++++++++++++++++
 mm/mmap.c                       |   17 ++++-
 5 files changed, 142 insertions(+), 3 deletions(-)

Index: linux-next/include/linux/memrlimitcgroup.h
===================================================================
--- linux-next.orig/include/linux/memrlimitcgroup.h	2008-09-18 23:19:26.000000000 -0700
+++ linux-next/include/linux/memrlimitcgroup.h	2008-09-18 23:19:29.000000000 -0700
@@ -16,4 +16,25 @@
 #ifndef LINUX_MEMRLIMITCGROUP_H
 #define LINUX_MEMRLIMITCGROUP_H
 
+#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR
+
+int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages);
+void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages);
+
+#else /* !CONFIG_CGROUP_RLIMIT_CTLR */
+
+static inline int
+memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	return 0;
+}
+
+static inline void
+memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+}
+
+#endif /* CONFIG_CGROUP_RLIMIT_CTLR */
+
+
 #endif /* LINUX_MEMRLIMITCGROUP_H */
Index: linux-next/kernel/fork.c
===================================================================
--- linux-next.orig/kernel/fork.c	2008-09-18 21:32:39.000000000 -0700
+++ linux-next/kernel/fork.c	2008-09-18 23:27:21.000000000 -0700
@@ -51,6 +51,7 @@
 #include <linux/acct.h>
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
+#include <linux/memrlimitcgroup.h>
 #include <linux/freezer.h>
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
@@ -263,7 +264,7 @@
 	struct vm_area_struct *mpnt, *tmp, **pprev;
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
-	unsigned long charge;
+	unsigned long charge, uncharged = 0;
 	struct mempolicy *pol;
 
 	down_write(&oldmm->mmap_sem);
@@ -273,6 +274,15 @@
 	 */
 	down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);
 
+	/*
+	 * Uncharging as a result of failure is done by mmput()
+	 * in dup_mm()
+	 */
+	if (memrlimit_cgroup_charge_as(oldmm, oldmm->total_vm)) {
+		retval = -ENOMEM;
+		goto out;
+	}
+
 	mm->locked_vm = 0;
 	mm->mmap = NULL;
 	mm->mmap_cache = NULL;
@@ -293,6 +303,8 @@
 			mm->total_vm -= pages;
 			vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
 								-pages);
+			memrlimit_cgroup_uncharge_as(mm, pages);
+			uncharged += pages;
 			continue;
 		}
 		charge = 0;
Index: linux-next/mm/memrlimitcgroup.c
===================================================================
--- linux-next.orig/mm/memrlimitcgroup.c	2008-09-18 23:19:26.000000000 -0700
+++ linux-next/mm/memrlimitcgroup.c	2008-09-18 23:27:36.000000000 -0700
@@ -45,6 +45,38 @@
 				struct memrlimit_cgroup, css);
 }
 
+static struct memrlimit_cgroup *
+memrlimit_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, memrlimit_cgroup_subsys_id),
+				struct memrlimit_cgroup, css);
+}
+
+/*
+ * Charge the cgroup for address space usage - mmap(), malloc() (through
+ * brk(), sbrk()), stack expansion, mremap(), etc - called with
+ * mmap_sem held.
+ */
+int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	struct memrlimit_cgroup *memrcg;
+
+	memrcg = memrlimit_cgroup_from_task(mm->owner);
+	return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
+}
+
+/*
+ * Uncharge the cgroup, as the address space of one of the tasks is
+ * decreasing - called with mmap_sem held.
+ */
+void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	struct memrlimit_cgroup *memrcg;
+
+	memrcg = memrlimit_cgroup_from_task(mm->owner);
+	res_counter_uncharge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
+}
+
 static struct cgroup_subsys_state *
 memrlimit_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
@@ -121,11 +153,71 @@
 				ARRAY_SIZE(memrlimit_cgroup_files));
 }
 
+static void memrlimit_cgroup_move_task(struct cgroup_subsys *ss,
+					struct cgroup *cgrp,
+					struct cgroup *old_cgrp,
+					struct task_struct *p)
+{
+	struct mm_struct *mm;
+	struct memrlimit_cgroup *memrcg, *old_memrcg;
+
+	mm = get_task_mm(p);
+	if (mm == NULL)
+		return;
+
+	/*
+	 * Hold mmap_sem, so that total_vm does not change underneath us
+	 */
+	down_read(&mm->mmap_sem);
+
+	rcu_read_lock();
+	if (p != rcu_dereference(mm->owner))
+		goto out;
+
+	memrcg = memrlimit_cgroup_from_cgrp(cgrp);
+	old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
+
+	if (memrcg == old_memrcg)
+		goto out;
+
+	if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
+		goto out;
+	res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
+out:
+	rcu_read_unlock();
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+}
+
+/*
+ * This callback is called with mmap_sem held
+ */
+static void memrlimit_cgroup_mm_owner_changed(struct cgroup_subsys *ss,
+						struct cgroup *cgrp,
+						struct cgroup *old_cgrp,
+						struct task_struct *p)
+{
+	struct memrlimit_cgroup *memrcg, *old_memrcg;
+	struct mm_struct *mm = get_task_mm(p);
+
+	BUG_ON(!mm);
+	memrcg = memrlimit_cgroup_from_cgrp(cgrp);
+	old_memrcg = memrlimit_cgroup_from_cgrp(old_cgrp);
+
+	if (res_counter_charge(&memrcg->as_res, (mm->total_vm << PAGE_SHIFT)))
+		goto out;
+	res_counter_uncharge(&old_memrcg->as_res, (mm->total_vm << PAGE_SHIFT));
+out:
+	mmput(mm);
+}
+
 struct cgroup_subsys memrlimit_cgroup_subsys = {
 	.name = "memrlimit",
 	.subsys_id = memrlimit_cgroup_subsys_id,
 	.create = memrlimit_cgroup_create,
 	.destroy = memrlimit_cgroup_destroy,
 	.populate = memrlimit_cgroup_populate,
+	.attach = memrlimit_cgroup_move_task,
+	.mm_owner_changed = memrlimit_cgroup_mm_owner_changed,
 	.early_init = 0,
 };
Index: linux-next/mm/mmap.c
===================================================================
--- linux-next.orig/mm/mmap.c	2008-09-18 23:00:18.000000000 -0700
+++ linux-next/mm/mmap.c	2008-09-18 23:27:21.000000000 -0700
@@ -23,6 +23,7 @@
 #include <linux/hugetlb.h>
 #include <linux/profile.h>
 #include <linux/module.h>
+#include <linux/memrlimitcgroup.h>
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
@@ -1756,6 +1757,7 @@
 		long nrpages = vma_pages(vma);
 
 		mm->total_vm -= nrpages;
+		memrlimit_cgroup_uncharge_as(mm, nrpages);
 		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
 		vma = remove_vma(vma);
 	} while (vma);
@@ -2106,6 +2108,7 @@
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
+	memrlimit_cgroup_uncharge_as(mm, mm->total_vm);
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 
@@ -2128,6 +2131,9 @@
 	struct vm_area_struct * __vma, * prev;
 	struct rb_node ** rb_link, * rb_parent;
 
+	if (memrlimit_cgroup_charge_as(mm, vma_pages(vma)))
+		return -ENOMEM;
+
 	/*
 	 * The vm_pgoff of a purely anonymous vma should be irrelevant
 	 * until its first write fault, when page's anon_vma and index
@@ -2146,12 +2152,15 @@
 	}
 	__vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent);
 	if (__vma && __vma->vm_start < vma->vm_end)
-		return -ENOMEM;
+		goto err;
 	if ((vma->vm_flags & VM_ACCOUNT) &&
 	     security_vm_enough_memory_mm(mm, vma_pages(vma)))
-		return -ENOMEM;
+		goto err;
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	return 0;
+err:
+	memrlimit_cgroup_uncharge_as(mm, vma_pages(vma));
+	return -ENOMEM;
 }
 
 /*
@@ -2224,6 +2233,10 @@
 
 	if (cur + npages > lim)
 		return 0;
+
+	if (memrlimit_cgroup_charge_as(mm, npages))
+		return 0;
+
 	return 1;
 }
 
Index: linux-next/arch/x86/kernel/ds.c
===================================================================
--- linux-next.orig/arch/x86/kernel/ds.c	2008-09-18 23:28:07.000000000 -0700
+++ linux-next/arch/x86/kernel/ds.c	2008-09-18 23:33:55.000000000 -0700
@@ -30,6 +30,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
+#include <linux/memrlimitcgroup.h>
 
 
 /*
@@ -339,19 +340,22 @@
 
 	pgsz = PAGE_ALIGN(size) >> PAGE_SHIFT;
 
+	if (memrlimit_cgroup_charge_as(current->mm, pgsz << PAGE_SHIFT))
+		return NULL;
+
 	rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT;
 	vm   = current->mm->total_vm  + pgsz;
 	if (rlim < vm)
-		return NULL;
+		goto uncharge;
 
 	rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
 	vm   = current->mm->locked_vm  + pgsz;
 	if (rlim < vm)
-		return NULL;
+		goto uncharge;
 
 	buffer = kzalloc(size, GFP_KERNEL);
 	if (!buffer)
-		return NULL;
+		goto uncharge;
 
 	current->mm->total_vm  += pgsz;
 	current->mm->locked_vm += pgsz;
@@ -360,6 +364,9 @@
 		*pages = pgsz;
 
 	return buffer;
+uncharge:
+	memrlimit_cgroup_uncharge_as(current->mm, pgsz << PAGE_SHIFT);
+	return NULL;
 }
 
 static int ds_request(struct task_struct *task, void *base, size_t size,

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-09-19  6:38     ` Balbir Singh
@ 2008-09-19 20:14       ` Andrew Morton
  2008-09-19 21:28         ` Balbir Singh
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2008-09-19 20:14 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, skumar, yamamoto, menage, lizf, linux-kernel, xemul,
	kamezawa.hiroyu, mingo

On Thu, 18 Sep 2008 23:38:23 -0700
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * Andrew Morton <akpm@linux-foundation.org> [2008-09-18 13:54:30]:
> 
> > On Wed, 14 May 2008 18:39:51 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > This patch adds support for accounting and control of virtual address space
> > > limits.
> > 
> > 
> > Large changes in linux-next's arch/x86/kernel/ptrace.c caused damage to
> > the memrlimit patches.
> > 
> > I decided to retain the patches because it looks repairable.  The
> > problem is this reject from
> > memrlimit-add-memrlimit-controller-accounting-and-control.patch:
> >
> 
> Andrew,
> 
> I could not apply mmotm to linux-next (both downloaded right now).

mmotm includes linux-next.patch.  mmotm is based upon the most recent
2.6.x-rcy.

This is the only way to do it - I often have to change linux-next.patch
due to rejects and it's unreasonable to expect people to base off the
same version of linux-next as I did.

> I
> applied the patches one-by-one resolving differences starting from #mm
> in the series file.
> 
> Here is my fixed version of the patch, I compiled the patch, but could
> not run it, since I could not create the full series of applied
> patches. I compiled arch/x86/kernel/ds.o and ptrace.o. I've included
> the patch below, please let me know if the code looks OK (via review)
> and the patch applies. I'll test it once I can resonably resolve all
> conflicts between linux-next and mmotm.

OK, we'll give it a shot, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)
  2008-09-19 20:14       ` Andrew Morton
@ 2008-09-19 21:28         ` Balbir Singh
  0 siblings, 0 replies; 27+ messages in thread
From: Balbir Singh @ 2008-09-19 21:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, skumar, yamamoto, menage, lizf, linux-kernel, xemul,
	kamezawa.hiroyu, mingo

Andrew Morton wrote:
> On Thu, 18 Sep 2008 23:38:23 -0700
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> * Andrew Morton <akpm@linux-foundation.org> [2008-09-18 13:54:30]:
>>
>>> On Wed, 14 May 2008 18:39:51 +0530
>>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>
>>>> This patch adds support for accounting and control of virtual address space
>>>> limits.
>>>
>>> Large changes in linux-next's arch/x86/kernel/ptrace.c caused damage to
>>> the memrlimit patches.
>>>
>>> I decided to retain the patches because it looks repairable.  The
>>> problem is this reject from
>>> memrlimit-add-memrlimit-controller-accounting-and-control.patch:
>>>
>> Andrew,
>>
>> I could not apply mmotm to linux-next (both downloaded right now).
> 
> mmotm includes linux-next.patch.  mmotm is based upon the most recent
> 2.6.x-rcy.
> 

Thanks for the info

[snip]

> OK, we'll give it a shot, thanks.
> 

Thanks!

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2008-09-19 21:28 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-14 13:09 [-mm][PATCH 0/4] Add memrlimit controller (v4) Balbir Singh
2008-05-14 13:09 ` [-mm][PATCH 1/4] Add memrlimit controller documentation (v4) Balbir Singh
2008-05-15  1:20   ` Li Zefan
2008-05-15 18:22   ` Avi Kivity
2008-05-15 18:39     ` Balbir Singh
2008-05-14 13:09 ` [-mm][PATCH 2/4] Setup the memrlimit controller (v4) Balbir Singh
2008-05-14 13:29   ` Pavel Emelyanov
2008-05-14 13:09 ` [-mm][PATCH 3/4] cgroup mm owner callback changes to add task info (v4) Balbir Singh
2008-05-14 13:09 ` [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4) Balbir Singh
2008-05-14 13:25   ` Balbir Singh
2008-05-15  2:25     ` Paul Menage
2008-05-15  6:17       ` Balbir Singh
2008-05-15  6:55         ` Paul Menage
2008-05-15  7:03           ` Balbir Singh
2008-05-15  7:39             ` Paul Menage
2008-05-15  8:25               ` Balbir Singh
2008-05-15 15:28                 ` Paul Menage
2008-05-15 17:01                   ` Balbir Singh
2008-05-17 20:15                   ` Balbir Singh
2008-05-17 20:17                     ` Balbir Singh
2008-05-14 13:32   ` Pavel Emelyanov
2008-05-14 19:39     ` Balbir Singh
2008-09-18 20:54   ` Andrew Morton
2008-09-18 23:55     ` Balbir Singh
2008-09-19  6:38     ` Balbir Singh
2008-09-19 20:14       ` Andrew Morton
2008-09-19 21:28         ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).