[PATCH 0/2] Mempressure cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Mempressure cgroup
@ 2013-01-04  8:27 Anton Vorontsov
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Anton Vorontsov @ 2013-01-04  8:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Mel Gorman, Glauber Costa, Michal Hocko,
	Kirill A. Shutemov, Luiz Capitulino, Andrew Morton, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

Hi all,

Here is another round of the mempressure cgroup. This time I dared to
remove the RFC tag. :)

In this revision:

- Addressed most of Kirill Shutemov's comments. I didn't bother
  implementing per-level lists, though. It would needlessly complicate the
  logic, and the gain would be only visible with lots of watchers (which
  we don't have for our use-cases). But it is always an option to add the
  feature;

- I've split the pach into two: 'shrinker' and 'levels' parts. While the
  full-fledged userland shrinker is an interesting idea, we don't have any
  users ready for it, so I won't advocate for it too much.

  And since at least Kirill has some concerns about it, I don't want the
  shrinker to block the pressure levels.

  So, these are now separate. At some point, I'd like to both of them
  merged, but if anything, let's discuss them separately;

- Rebased onto v3.8-rc2.

RFC v2 (http://lkml.org/lkml/2012/12/10/128):

 - Added documentation, describes APIs and the purpose;
 - Implemented shrinker interface, this is based on Andrew's idea and
   supersedes my "balance" level idea;
 - The shrinker interface comes with a stress-test utility, that is what
   Andrew was also asking for. A simple app that we can run and see if the
   thing works as expected;
 - Added reclaimer's target_mem_cgroup handling;
 - As promised, added support for multiple listeners, and fixed some other
   comments on the previous RFC.

RFC v1 (http://lkml.org/lkml/2012/11/28/109)

--
 Documentation/cgroups/mempressure.txt    |  97 +++++
 Documentation/cgroups/mempressure_test.c | 213 ++++++++++
 include/linux/cgroup_subsys.h            |   6 +
 include/linux/vmstat.h                   |  11 +
 init/Kconfig                             |  13 +
 mm/Makefile                              |   1 +
 mm/mempressure.c                         | 487 +++++++++++++++++++++++
 mm/vmscan.c                              |   4 +
 8 files changed, 832 insertions(+)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/2] Add mempressure cgroup
  2013-01-04  8:27 [PATCH 0/2] Mempressure cgroup Anton Vorontsov
@ 2013-01-04  8:29 ` Anton Vorontsov
  2013-01-04 15:05   ` Kirill A. Shutemov
                     ` (8 more replies)
  2013-01-04  8:29 ` [PATCH 2/2] Add shrinker interface for " Anton Vorontsov
  2013-01-11 19:13 ` [PATCH 0/2] Mempressure cgroup Luiz Capitulino
  2 siblings, 9 replies; 33+ messages in thread
From: Anton Vorontsov @ 2013-01-04  8:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Mel Gorman, Glauber Costa, Michal Hocko,
	Kirill A. Shutemov, Luiz Capitulino, Andrew Morton, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

This commit implements David Rientjes' idea of mempressure cgroup.

The main characteristics are the same to what I've tried to add to vmevent
API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
pressure index calculation. But we don't expose the index to the userland.
Instead, there are three levels of the pressure:

 o low (just reclaiming, e.g. caches are draining);
 o medium (allocation cost becomes high, e.g. swapping);
 o oom (about to oom very soon).

The rationale behind exposing levels and not the raw pressure index
described here: http://lkml.org/lkml/2012/11/16/675

For a task it is possible to be in both cpusets, memcg and mempressure
cgroups, so by rearranging the tasks it is possible to watch a specific
pressure (i.e. caused by cpuset and/or memcg).

Note that while this adds the cgroups support, the code is well separated
and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
But this is another story.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
---
 Documentation/cgroups/mempressure.txt |  50 ++++++
 include/linux/cgroup_subsys.h         |   6 +
 include/linux/vmstat.h                |  11 ++
 init/Kconfig                          |  12 ++
 mm/Makefile                           |   1 +
 mm/mempressure.c                      | 330 ++++++++++++++++++++++++++++++++++
 mm/vmscan.c                           |   4 +
 7 files changed, 414 insertions(+)
 create mode 100644 Documentation/cgroups/mempressure.txt
 create mode 100644 mm/mempressure.c

diff --git a/Documentation/cgroups/mempressure.txt b/Documentation/cgroups/mempressure.txt
new file mode 100644
index 0000000..dbc0aca
--- /dev/null
+++ b/Documentation/cgroups/mempressure.txt
@@ -0,0 +1,50 @@
+  Memory pressure cgroup
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+  Before using the mempressure cgroup, make sure you have it mounted:
+
+   # cd /sys/fs/cgroup/
+   # mkdir mempressure
+   # mount -t cgroup cgroup ./mempressure -o mempressure
+
+  It is possible to combine cgroups, for example you can mount memory
+  (memcg) and mempressure cgroups together:
+
+   # mount -t cgroup cgroup ./mempressure -o memory,mempressure
+
+  That way the reported pressure will honour memory cgroup limits. The
+  same goes for cpusets.
+
+  After the hierarchy is mounted, you can use the following API:
+
+  /sys/fs/cgroup/.../mempressure.level
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+  To maintain the interactivity/memory allocation cost, one can use the
+  pressure level notifications, and the levels are defined like this:
+
+  The "low" level means that the system is reclaiming memory for new
+  allocations. Monitoring reclaiming activity might be useful for
+  maintaining overall system's cache level. Upon notification, the program
+  (typically "Activity Manager") might analyze vmstat and act in advance
+  (i.e. prematurely shutdown unimportant services).
+
+  The "medium" level means that the system is experiencing medium memory
+  pressure, there is some mild swapping activity. Upon this event
+  applications may decide to free any resources that can be easily
+  reconstructed or re-read from a disk.
+
+  The "oom" level means that the system is actively thrashing, it is about
+  to out of memory (OOM) or even the in-kernel OOM killer is on its way to
+  trigger. Applications should do whatever they can to help the system.
+
+  Event control:
+    Is used to setup an eventfd with a level threshold. The argument to
+    the event control specifies the level threshold.
+  Read:
+    Reads mempory presure levels: low, medium or oom.
+  Write:
+    Not implemented.
+  Test:
+    To set up a notification:
+
+    # cgroup_event_listener ./mempressure.level low
+    ("low", "medium", "oom" are permitted.)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index f204a7a..b9802e2 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -37,6 +37,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE)
+SUBSYS(mpc_cgroup)
+#endif
+
+/* */
+
 #if IS_SUBSYS_ENABLED(CONFIG_CGROUP_DEVICE)
 SUBSYS(devices)
 #endif
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index a13291f..c1a66c7 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -10,6 +10,17 @@
 
 extern int sysctl_stat_interval;
 
+struct mem_cgroup;
+#ifdef CONFIG_CGROUP_MEMPRESSURE
+extern void vmpressure(struct mem_cgroup *memcg,
+		       ulong scanned, ulong reclaimed);
+extern void vmpressure_prio(struct mem_cgroup *memcg, int prio);
+#else
+static inline void vmpressure(struct mem_cgroup *memcg,
+			      ulong scanned, ulong reclaimed) {}
+static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {}
+#endif
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 /*
  * Light weight per cpu counter implementation.
diff --git a/init/Kconfig b/init/Kconfig
index 7d30240..d526249 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -891,6 +891,18 @@ config MEMCG_KMEM
 	  the kmem extension can use it to guarantee that no group of processes
 	  will ever exhaust kernel resources alone.
 
+config CGROUP_MEMPRESSURE
+	bool "Memory pressure monitor for Control Groups"
+	help
+	  The memory pressure monitor cgroup provides a facility for
+	  userland programs so that they could easily assist the kernel
+	  with the memory management. So far the API provides simple,
+	  levels-based memory pressure notifications.
+
+	  For more information see Documentation/cgroups/mempressure.txt
+
+	  If unsure, say N.
+
 config CGROUP_HUGETLB
 	bool "HugeTLB Resource Controller for Control Groups"
 	depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..e69bbda 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
 obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEMPRESSURE) += mempressure.o
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
diff --git a/mm/mempressure.c b/mm/mempressure.c
new file mode 100644
index 0000000..ea312bb
--- /dev/null
+++ b/mm/mempressure.c
@@ -0,0 +1,330 @@
+/*
+ * Linux VM pressure
+ *
+ * Copyright 2012 Linaro Ltd.
+ *		  Anton Vorontsov <anton.vorontsov@linaro.org>
+ *
+ * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
+ * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/eventfd.h>
+#include <linux/swap.h>
+#include <linux/printk.h>
+
+static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r);
+
+/*
+ * Generic VM Pressure routines (no cgroups or any other API details)
+ */
+
+/*
+ * The window size is the number of scanned pages before we try to analyze
+ * the scanned/reclaimed ratio (or difference).
+ *
+ * It is used as a rate-limit tunable for the "low" level notification,
+ * and for averaging medium/oom levels. Using small window sizes can cause
+ * lot of false positives, but too big window size will delay the
+ * notifications.
+ */
+static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static const uint vmpressure_level_med = 60;
+static const uint vmpressure_level_oom = 99;
+static const uint vmpressure_level_oom_prio = 4;
+
+enum vmpressure_levels {
+	VMPRESSURE_LOW = 0,
+	VMPRESSURE_MEDIUM,
+	VMPRESSURE_OOM,
+	VMPRESSURE_NUM_LEVELS,
+};
+
+static const char *vmpressure_str_levels[] = {
+	[VMPRESSURE_LOW] = "low",
+	[VMPRESSURE_MEDIUM] = "medium",
+	[VMPRESSURE_OOM] = "oom",
+};
+
+static enum vmpressure_levels vmpressure_level(uint pressure)
+{
+	if (pressure >= vmpressure_level_oom)
+		return VMPRESSURE_OOM;
+	else if (pressure >= vmpressure_level_med)
+		return VMPRESSURE_MEDIUM;
+	return VMPRESSURE_LOW;
+}
+
+static ulong vmpressure_calc_level(uint win, uint s, uint r)
+{
+	ulong p;
+
+	if (!s)
+		return 0;
+
+	/*
+	 * We calculate the ratio (in percents) of how many pages were
+	 * scanned vs. reclaimed in a given time frame (window). Note that
+	 * time is in VM reclaimer's "ticks", i.e. number of pages
+	 * scanned. This makes it possible to set desired reaction time
+	 * and serves as a ratelimit.
+	 */
+	p = win - (r * win / s);
+	p = p * 100 / win;
+
+	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
+
+	return vmpressure_level(p);
+}
+
+void vmpressure(struct mem_cgroup *memcg, ulong scanned, ulong reclaimed)
+{
+	if (!scanned)
+		return;
+	mpc_vmpressure(memcg, scanned, reclaimed);
+}
+
+void vmpressure_prio(struct mem_cgroup *memcg, int prio)
+{
+	if (prio > vmpressure_level_oom_prio)
+		return;
+
+	/* OK, the prio is below the threshold, send the pre-OOM event. */
+	vmpressure(memcg, vmpressure_win, 0);
+}
+
+/*
+ * Memory pressure cgroup code
+ */
+
+struct mpc_event {
+	struct eventfd_ctx *efd;
+	enum vmpressure_levels level;
+	struct list_head node;
+};
+
+struct mpc_state {
+	struct cgroup_subsys_state css;
+
+	uint scanned;
+	uint reclaimed;
+	struct mutex sr_lock;
+
+	struct list_head events;
+	struct mutex events_lock;
+
+	struct work_struct work;
+};
+
+static struct mpc_state *wk2mpc(struct work_struct *wk)
+{
+	return container_of(wk, struct mpc_state, work);
+}
+
+static struct mpc_state *css2mpc(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct mpc_state, css);
+}
+
+static struct mpc_state *tsk2mpc(struct task_struct *tsk)
+{
+	return css2mpc(task_subsys_state(tsk, mpc_cgroup_subsys_id));
+}
+
+static struct mpc_state *cg2mpc(struct cgroup *cg)
+{
+	return css2mpc(cgroup_subsys_state(cg, mpc_cgroup_subsys_id));
+}
+
+static void mpc_event(struct mpc_state *mpc, ulong s, ulong r)
+{
+	struct mpc_event *ev;
+	int level = vmpressure_calc_level(vmpressure_win, s, r);
+
+	mutex_lock(&mpc->events_lock);
+
+	list_for_each_entry(ev, &mpc->events, node) {
+		if (level >= ev->level)
+			eventfd_signal(ev->efd, 1);
+	}
+
+	mutex_unlock(&mpc->events_lock);
+}
+
+static void mpc_vmpressure_wk_fn(struct work_struct *wk)
+{
+	struct mpc_state *mpc = wk2mpc(wk);
+	ulong s;
+	ulong r;
+
+	mutex_lock(&mpc->sr_lock);
+	s = mpc->scanned;
+	r = mpc->reclaimed;
+	mpc->scanned = 0;
+	mpc->reclaimed = 0;
+	mutex_unlock(&mpc->sr_lock);
+
+	mpc_event(mpc, s, r);
+}
+
+static void __mpc_vmpressure(struct mpc_state *mpc, ulong s, ulong r)
+{
+	mutex_lock(&mpc->sr_lock);
+	mpc->scanned += s;
+	mpc->reclaimed += r;
+	mutex_unlock(&mpc->sr_lock);
+
+	if (s < vmpressure_win || work_pending(&mpc->work))
+		return;
+
+	schedule_work(&mpc->work);
+}
+
+static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r)
+{
+	/*
+	 * There are two options for implementing cgroup pressure
+	 * notifications:
+	 *
+	 * - Store pressure counter atomically in the task struct. Upon
+	 *   hitting 'window' wake up a workqueue that will walk every
+	 *   task and sum per-thread pressure into cgroup pressure (to
+	 *   which the task belongs). The cons are obvious: bloats task
+	 *   struct, have to walk all processes and makes pressue less
+	 *   accurate (the window becomes per-thread);
+	 *
+	 * - Store pressure counters in per-cgroup state. This is easy and
+	 *   straightforward, and that's how we do things here. But this
+	 *   requires us to not put the vmpressure hooks into hotpath,
+	 *   since we have to grab some locks.
+	 */
+
+#ifdef CONFIG_MEMCG
+	if (memcg) {
+		struct cgroup_subsys_state *css = mem_cgroup_css(memcg);
+		struct cgroup *cg = css->cgroup;
+		struct mpc_state *mpc = cg2mpc(cg);
+
+		if (mpc)
+			__mpc_vmpressure(mpc, s, r);
+		return;
+	}
+#endif
+	task_lock(current);
+	__mpc_vmpressure(tsk2mpc(current), s, r);
+	task_unlock(current);
+}
+
+static struct cgroup_subsys_state *mpc_css_alloc(struct cgroup *cg)
+{
+	struct mpc_state *mpc;
+
+	mpc = kzalloc(sizeof(*mpc), GFP_KERNEL);
+	if (!mpc)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&mpc->sr_lock);
+	mutex_init(&mpc->events_lock);
+	INIT_LIST_HEAD(&mpc->events);
+	INIT_WORK(&mpc->work, mpc_vmpressure_wk_fn);
+
+	return &mpc->css;
+}
+
+static void mpc_css_free(struct cgroup *cg)
+{
+	struct mpc_state *mpc = cg2mpc(cg);
+
+	kfree(mpc);
+}
+
+static ssize_t mpc_read_level(struct cgroup *cg, struct cftype *cft,
+			      struct file *file, char __user *buf,
+			      size_t sz, loff_t *ppos)
+{
+	struct mpc_state *mpc = cg2mpc(cg);
+	uint level;
+	const char *str;
+
+	mutex_lock(&mpc->sr_lock);
+
+	level = vmpressure_calc_level(vmpressure_win,
+			mpc->scanned, mpc->reclaimed);
+
+	mutex_unlock(&mpc->sr_lock);
+
+	str = vmpressure_str_levels[level];
+	return simple_read_from_buffer(buf, sz, ppos, str, strlen(str));
+}
+
+static int mpc_register_level(struct cgroup *cg, struct cftype *cft,
+			      struct eventfd_ctx *eventfd, const char *args)
+{
+	struct mpc_state *mpc = cg2mpc(cg);
+	struct mpc_event *ev;
+	int lvl;
+
+	for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
+		if (!strcmp(vmpressure_str_levels[lvl], args))
+			break;
+	}
+
+	if (lvl >= VMPRESSURE_NUM_LEVELS)
+		return -EINVAL;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		return -ENOMEM;
+
+	ev->efd = eventfd;
+	ev->level = lvl;
+
+	mutex_lock(&mpc->events_lock);
+	list_add(&ev->node, &mpc->events);
+	mutex_unlock(&mpc->events_lock);
+
+	return 0;
+}
+
+static void mpc_unregister_level(struct cgroup *cg, struct cftype *cft,
+				 struct eventfd_ctx *eventfd)
+{
+	struct mpc_state *mpc = cg2mpc(cg);
+	struct mpc_event *ev;
+
+	mutex_lock(&mpc->events_lock);
+	list_for_each_entry(ev, &mpc->events, node) {
+		if (ev->efd != eventfd)
+			continue;
+		list_del(&ev->node);
+		kfree(ev);
+		break;
+	}
+	mutex_unlock(&mpc->events_lock);
+}
+
+static struct cftype mpc_files[] = {
+	{
+		.name = "level",
+		.read = mpc_read_level,
+		.register_event = mpc_register_level,
+		.unregister_event = mpc_unregister_level,
+	},
+	{},
+};
+
+struct cgroup_subsys mpc_cgroup_subsys = {
+	.name = "mempressure",
+	.subsys_id = mpc_cgroup_subsys_id,
+	.css_alloc = mpc_css_alloc,
+	.css_free = mpc_css_free,
+	.base_cftypes = mpc_files,
+};
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 16b42af..fed0e04 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1900,6 +1900,9 @@ restart:
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 
+	vmpressure(sc->target_mem_cgroup,
+		   sc->nr_scanned - nr_scanned, nr_reclaimed);
+
 	/* reclaim/compaction might need reclaim to continue */
 	if (should_continue_reclaim(lruvec, nr_reclaimed,
 				    sc->nr_scanned - nr_scanned, sc))
@@ -2122,6 +2125,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		count_vm_event(ALLOCSTALL);
 
 	do {
+		vmpressure_prio(sc->target_mem_cgroup, sc->priority);
 		sc->nr_scanned = 0;
 		aborted_reclaim = shrink_zones(zonelist, sc);
 
-- 
1.8.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/2] Add shrinker interface for mempressure cgroup
  2013-01-04  8:27 [PATCH 0/2] Mempressure cgroup Anton Vorontsov
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
@ 2013-01-04  8:29 ` Anton Vorontsov
  2013-01-11 19:13 ` [PATCH 0/2] Mempressure cgroup Luiz Capitulino
  2 siblings, 0 replies; 33+ messages in thread
From: Anton Vorontsov @ 2013-01-04  8:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Mel Gorman, Glauber Costa, Michal Hocko,
	Kirill A. Shutemov, Luiz Capitulino, Andrew Morton, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

This commit implements Andrew Morton's idea of kernel-controlled userland
reclaimer. This is very similar to the in-kernel shrinker, with one major
difference: it is asynchronous, i.e. like kswapd.

Note that the shrinker interface is not a substitution for the levels, the
two interfaces report different kinds information (i.e. with the shrinker
you don't know the actual system state -- how bad/good the memory
situation is).

The interface is well documented and comes with a stress-test utility.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
---
 Documentation/cgroups/mempressure.txt    |  53 +++++++-
 Documentation/cgroups/mempressure_test.c | 213 +++++++++++++++++++++++++++++++
 init/Kconfig                             |   5 +-
 mm/mempressure.c                         | 157 +++++++++++++++++++++++
 4 files changed, 423 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/cgroups/mempressure_test.c

diff --git a/Documentation/cgroups/mempressure.txt b/Documentation/cgroups/mempressure.txt
index dbc0aca..5094749 100644
--- a/Documentation/cgroups/mempressure.txt
+++ b/Documentation/cgroups/mempressure.txt
@@ -16,10 +16,55 @@
 
   After the hierarchy is mounted, you can use the following API:
 
+  /sys/fs/cgroup/.../mempressure.shrinker
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+  The file implements userland shrinker (memory reclaimer) interface, so
+  that the kernel can ask userland to help with the memory reclaiming
+  process.
+
+  There are two basic concepts: chunks and chunks' size. The program must
+  tell the kernel the granularity of its allocations (chunk size) and the
+  number of reclaimable chunks. The granularity may be not 100% accurate,
+  but the more it is accurate, the better. I.e. suppose the application
+  has 200 page renders cached (but not displayed), 1MB each. So the chunk
+  size is 1MB, and the number of chunks is 200.
+
+  The granularity is specified during shrinker registration (i.e. via
+  argument to the event_control cgroup file; and it is OK to register
+  multiple shrinkers for different granularities). The number of
+  reclaimable chunks is specified by writing to the mempressure.shrinker
+  file.
+
+  The notification comes through the eventfd() interface. Upon the
+  notification, a read() from the eventfd returns the number of chunks to
+  reclaim (free).
+
+  It is assumed that the application will free the specified amount of
+  chunks before reading from the eventfd again. If that is not the case,
+  suppose the program was not able to reclaim the chunks, then application
+  should re-add the amount of chunks by writing to the
+  mempressure.shrinker file (otherwise the chunks won't be accounted by
+  the kernel, since it assumes that they were reclaimed).
+
+  Event control:
+    Used to setup shrinker events. There is only one argument for the
+    event control: chunk size in bytes.
+  Read:
+    Not implemented.
+  Write:
+    Writes must be in "<eventfd> <number of chunks>" format. Positive
+    numbers increment the internal counter, negative numbers decrement it
+    (but the kernel prevents the counter from falling down below zero).
+  Test:
+    See mempressure_test.c
+
   /sys/fs/cgroup/.../mempressure.level
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-  To maintain the interactivity/memory allocation cost, one can use the
-  pressure level notifications, and the levels are defined like this:
+  Instead of working on the bytes level (like shrinkers), one may decide
+  to maintain the interactivity/memory allocation cost.
+
+  For this, the cgroup has memory pressure level notifications, and the
+  levels are defined like this:
 
   The "low" level means that the system is reclaiming memory for new
   allocations. Monitoring reclaiming activity might be useful for
@@ -30,7 +75,9 @@
   The "medium" level means that the system is experiencing medium memory
   pressure, there is some mild swapping activity. Upon this event
   applications may decide to free any resources that can be easily
-  reconstructed or re-read from a disk.
+  reconstructed or re-read from a disk. Note that for a fine-grained
+  control, you should probably use the shrinker interface, as described
+  above.
 
   The "oom" level means that the system is actively thrashing, it is about
   to out of memory (OOM) or even the in-kernel OOM killer is on its way to
diff --git a/Documentation/cgroups/mempressure_test.c b/Documentation/cgroups/mempressure_test.c
new file mode 100644
index 0000000..a6c770c
--- /dev/null
+++ b/Documentation/cgroups/mempressure_test.c
@@ -0,0 +1,213 @@
+/*
+ * mempressure shrinker test
+ *
+ * Copyright 2012 Linaro Ltd.
+ *		  Anton Vorontsov <anton.vorontsov@linaro.org>
+ *
+ * It is pretty simple: we create two threads, the first one constantly
+ * tries to allocate memory (more than we physically have), the second
+ * thread listens to the kernel shrinker notifications and frees asked
+ * amount of chunks. When we allocate more than available RAM, the two
+ * threads start to fight. Idially, we should not OOM (but if we reclaim
+ * slower than we allocate, things might OOM). Also, ideally we should not
+ * grow swap too much.
+ *
+ * The test accepts no arguments, so you can just run it and observe the
+ * output and memory usage (e.g. 'watch -n 0.2 free -m'). Upon ctrl+c, the
+ * test prints total amount of bytes we helped to reclaim.
+ *
+ * Compile with -pthread.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <errno.h>
+#include <sys/eventfd.h>
+#include <sys/sysinfo.h>
+
+#define CG			"/sys/fs/cgroup/mempressure"
+#define CG_EVENT_CONTROL	(CG "/cgroup.event_control")
+#define CG_SHRINKER		(CG "/mempressure.shrinker")
+
+#define CHUNK_SIZE (1 * 1024 * 1024)
+
+static size_t num_chunks;
+
+static void **chunks;
+static pthread_mutex_t *locks;
+static int efd;
+static int sfd;
+
+static inline void pabort(bool f, int code, const char *str)
+{
+	if (!f)
+		return;
+	perror(str);
+	printf("(%d)\n", code);
+	abort();
+}
+
+static void init_shrinker(void)
+{
+	int cfd;
+	int ret;
+	char *str;
+
+	cfd = open(CG_EVENT_CONTROL, O_WRONLY);
+	pabort(cfd < 0, cfd, CG_EVENT_CONTROL);
+
+	sfd = open(CG_SHRINKER, O_RDWR);
+	pabort(sfd < 0, sfd, CG_SHRINKER);
+
+	efd = eventfd(0, 0);
+	pabort(efd < 0, efd, "eventfd()");
+
+	ret = asprintf(&str, "%d %d %d\n", efd, sfd, CHUNK_SIZE);
+	pabort(ret == -1, ret, "control string");
+	printf("%s\n", str);
+
+	ret = write(cfd, str, ret + 1);
+	pabort(ret == -1, ret, "write() to event_control");
+
+	free(str);
+}
+
+static void add_reclaimable(int chunks)
+{
+	int ret;
+	char *str;
+
+	ret = asprintf(&str, "%d %d\n", efd, chunks);
+	pabort(ret == -1, ret, "add_reclaimable, asprintf");
+
+	ret = write(sfd, str, ret + 1);
+	pabort(ret <= 0, ret, "add_reclaimable, write");
+
+	free(str);
+}
+
+static int chunks_to_reclaim(void)
+{
+	uint64_t n = 0;
+	int ret;
+
+	ret = read(efd, &n, sizeof(n));
+	pabort(ret <= 0, ret, "read() from eventfd");
+
+	printf("%d chunks to reclaim\n", (int)n);
+
+	return n;
+}
+
+static unsigned int reclaimed;
+
+static void print_stats(int signum)
+{
+	printf("\nTOTAL: helped to reclaim %d chunks (%d MB)\n",
+	       reclaimed, reclaimed * CHUNK_SIZE / 1024 / 1024);
+	exit(0);
+}
+
+static void *shrinker_thr_fn(void *arg)
+{
+	puts("shrinker thread started");
+
+	sigaction(SIGINT, &(struct sigaction){.sa_handler = print_stats}, NULL);
+
+	while (1) {
+		unsigned int i = 0;
+		int n;
+
+		n = chunks_to_reclaim();
+
+		reclaimed += n;
+
+		while (n) {
+			pthread_mutex_lock(&locks[i]);
+			if (chunks[i]) {
+				free(chunks[i]);
+				chunks[i] = NULL;
+				n--;
+			}
+			pthread_mutex_unlock(&locks[i]);
+
+			i = (i + 1) % num_chunks;
+		}
+	}
+	return NULL;
+}
+
+static void consume_memory(void)
+{
+	unsigned int i = 0;
+	unsigned int j = 0;
+
+	puts("consuming memory...");
+
+	while (1) {
+		pthread_mutex_lock(&locks[i]);
+		if (!chunks[i]) {
+			chunks[i] = malloc(CHUNK_SIZE);
+			pabort(!chunks[i], 0, "chunks alloc failed");
+			memset(chunks[i], 0, CHUNK_SIZE);
+			j++;
+		}
+		pthread_mutex_unlock(&locks[i]);
+
+		if (j >= num_chunks / 10) {
+			add_reclaimable(num_chunks / 10);
+			printf("added %d reclaimable chunks\n", j);
+			j = 0;
+		}
+
+		i = (i + 1) % num_chunks;
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	int ret;
+	int i;
+	pthread_t shrinker_thr;
+	struct sysinfo si;
+
+	ret = sysinfo(&si);
+	pabort(ret != 0, ret, "sysinfo()");
+
+	num_chunks = (si.totalram + si.totalswap) * si.mem_unit / 1024 / 1024;
+
+	chunks = malloc(sizeof(*chunks) * num_chunks);
+	locks = malloc(sizeof(*locks) * num_chunks);
+	pabort(!chunks || !locks, ENOMEM, NULL);
+
+	init_shrinker();
+
+	for (i = 0; i < num_chunks; i++) {
+		ret = pthread_mutex_init(&locks[i], NULL);
+		pabort(ret != 0, ret, "pthread_mutex_init");
+	}
+
+	ret = pthread_create(&shrinker_thr, NULL, shrinker_thr_fn, NULL);
+	pabort(ret != 0, ret, "pthread_create(shrinker)");
+
+	consume_memory();
+
+	ret = pthread_join(shrinker_thr, NULL);
+	pabort(ret != 0, ret, "pthread_join(shrinker)");
+
+	return 0;
+}
diff --git a/init/Kconfig b/init/Kconfig
index d526249..bdb5ba2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -896,8 +896,9 @@ config CGROUP_MEMPRESSURE
 	help
 	  The memory pressure monitor cgroup provides a facility for
 	  userland programs so that they could easily assist the kernel
-	  with the memory management. So far the API provides simple,
-	  levels-based memory pressure notifications.
+	  with the memory management. The API provides simple,
+	  levels-based memory pressure notifications and a full-fledged
+	  userland reclaimer.
 
 	  For more information see Documentation/cgroups/mempressure.txt
 
diff --git a/mm/mempressure.c b/mm/mempressure.c
index ea312bb..5512326 100644
--- a/mm/mempressure.c
+++ b/mm/mempressure.c
@@ -35,6 +35,10 @@ static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r);
  * and for averaging medium/oom levels. Using small window sizes can cause
  * lot of false positives, but too big window size will delay the
  * notifications.
+ *
+ * The same window size also used for the shrinker, so be aware. It might
+ * be a good idea to derive the window size from the machine size, similar
+ * to what we do for the vmstat.
  */
 static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
 static const uint vmpressure_level_med = 60;
@@ -111,6 +115,13 @@ struct mpc_event {
 	struct list_head node;
 };
 
+struct mpc_shrinker {
+	struct eventfd_ctx *efd;
+	size_t chunks;
+	size_t chunk_sz;
+	struct list_head node;
+};
+
 struct mpc_state {
 	struct cgroup_subsys_state css;
 
@@ -121,6 +132,9 @@ struct mpc_state {
 	struct list_head events;
 	struct mutex events_lock;
 
+	struct list_head shrinkers;
+	struct mutex shrinkers_lock;
+
 	struct work_struct work;
 };
 
@@ -144,6 +158,54 @@ static struct mpc_state *cg2mpc(struct cgroup *cg)
 	return css2mpc(cgroup_subsys_state(cg, mpc_cgroup_subsys_id));
 }
 
+static void mpc_shrinker(struct mpc_state *mpc, ulong s, ulong r)
+{
+	struct mpc_shrinker *sh;
+	ssize_t to_reclaim_pages = s - r;
+
+	if (!to_reclaim_pages)
+		return;
+
+	mutex_lock(&mpc->shrinkers_lock);
+
+	/*
+	 * To make accounting more precise and to avoid excessive
+	 * communication with the kernel, we operate on chunks instead of
+	 * bytes. Say, asking to free 8 KBs makes little sense if
+	 * granularity of allocations is 10 MBs. Also, knowing the
+	 * granularity (chunk size) and the number of reclaimable chunks,
+	 * we just ask that N chunks should be freed, and we assume that
+	 * it will be freed, thus we decrement our internal counter
+	 * straight away (i.e. userland does not need to respond how much
+	 * was reclaimed). But, if userland could not free it, it is
+	 * responsible to increment the counter back.
+	 */
+	list_for_each_entry(sh, &mpc->shrinkers, node) {
+		size_t to_reclaim_chunks;
+
+		if (!sh->chunks)
+			continue;
+
+		to_reclaim_chunks = to_reclaim_pages *
+				    PAGE_SIZE / sh->chunk_sz;
+		to_reclaim_chunks = min(sh->chunks, to_reclaim_chunks);
+
+		if (!to_reclaim_chunks)
+			continue;
+
+		sh->chunks -= to_reclaim_chunks;
+
+		eventfd_signal(sh->efd, to_reclaim_chunks);
+
+		to_reclaim_pages -= to_reclaim_chunks *
+				    sh->chunk_sz / PAGE_SIZE;
+		if (to_reclaim_pages <= 0)
+			break;
+	}
+
+	mutex_unlock(&mpc->shrinkers_lock);
+}
+
 static void mpc_event(struct mpc_state *mpc, ulong s, ulong r)
 {
 	struct mpc_event *ev;
@@ -172,6 +234,7 @@ static void mpc_vmpressure_wk_fn(struct work_struct *wk)
 	mpc->reclaimed = 0;
 	mutex_unlock(&mpc->sr_lock);
 
+	mpc_shrinker(mpc, s, r);
 	mpc_event(mpc, s, r);
 }
 
@@ -233,7 +296,9 @@ static struct cgroup_subsys_state *mpc_css_alloc(struct cgroup *cg)
 
 	mutex_init(&mpc->sr_lock);
 	mutex_init(&mpc->events_lock);
+	mutex_init(&mpc->shrinkers_lock);
 	INIT_LIST_HEAD(&mpc->events);
+	INIT_LIST_HEAD(&mpc->shrinkers);
 	INIT_WORK(&mpc->work, mpc_vmpressure_wk_fn);
 
 	return &mpc->css;
@@ -311,6 +376,92 @@ static void mpc_unregister_level(struct cgroup *cg, struct cftype *cft,
 	mutex_unlock(&mpc->events_lock);
 }
 
+static int mpc_register_shrinker(struct cgroup *cg, struct cftype *cft,
+				 struct eventfd_ctx *eventfd,
+				 const char *args)
+{
+	struct mpc_state *mpc = cg2mpc(cg);
+	struct mpc_shrinker *sh;
+	ulong chunk_sz;
+	int ret;
+
+	ret = kstrtoul(args, 10, &chunk_sz);
+	if (ret)
+		return ret;
+
+	sh = kzalloc(sizeof(*sh), GFP_KERNEL);
+	if (!sh)
+		return -ENOMEM;
+
+	sh->efd = eventfd;
+	sh->chunk_sz = chunk_sz;
+
+	mutex_lock(&mpc->shrinkers_lock);
+	list_add(&sh->node, &mpc->shrinkers);
+	mutex_unlock(&mpc->shrinkers_lock);
+
+	return 0;
+}
+
+static void mpc_unregister_shrinker(struct cgroup *cg, struct cftype *cft,
+				 struct eventfd_ctx *eventfd)
+{
+	struct mpc_state *mpc = cg2mpc(cg);
+	struct mpc_shrinker *sh;
+
+	mutex_lock(&mpc->shrinkers_lock);
+	list_for_each_entry(sh, &mpc->shrinkers, node) {
+		if (sh->efd != eventfd)
+			continue;
+		list_del(&sh->node);
+		kfree(sh);
+		break;
+	}
+	mutex_unlock(&mpc->shrinkers_lock);
+}
+
+static int mpc_write_shrinker(struct cgroup *cg, struct cftype *cft,
+			      const char *str)
+{
+	struct mpc_state *mpc = cg2mpc(cg);
+	struct mpc_shrinker *sh;
+	struct eventfd_ctx *eventfd;
+	struct file *file;
+	ssize_t chunks;
+	int fd;
+	int ret;
+
+	ret = sscanf(str, "%d %zd\n", &fd, &chunks);
+	if (ret != 2)
+		return -EINVAL;
+
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+
+	eventfd = eventfd_ctx_fileget(file);
+
+	mutex_lock(&mpc->shrinkers_lock);
+
+	/* Can avoid the loop once we introduce ->priv for eventfd_ctx. */
+	list_for_each_entry(sh, &mpc->shrinkers, node) {
+		if (sh->efd != eventfd)
+			continue;
+		if (chunks < 0 && abs(chunks) > sh->chunks)
+			sh->chunks = 0;
+		else
+			sh->chunks += chunks;
+		break;
+	}
+
+	mutex_unlock(&mpc->shrinkers_lock);
+
+	eventfd_ctx_put(eventfd);
+	fput(file);
+
+	return 0;
+}
+
 static struct cftype mpc_files[] = {
 	{
 		.name = "level",
@@ -318,6 +469,12 @@ static struct cftype mpc_files[] = {
 		.register_event = mpc_register_level,
 		.unregister_event = mpc_unregister_level,
 	},
+	{
+		.name = "shrinker",
+		.register_event = mpc_register_shrinker,
+		.unregister_event = mpc_unregister_shrinker,
+		.write_string = mpc_write_shrinker,
+	},
 	{},
 };
 
-- 
1.8.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
@ 2013-01-04 15:05   ` Kirill A. Shutemov
  2013-01-07  8:51   ` Kamezawa Hiroyuki
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Kirill A. Shutemov @ 2013-01-04 15:05 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Luiz Capitulino, Andrew Morton, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Fri, Jan 04, 2013 at 12:29:11AM -0800, Anton Vorontsov wrote:
> This commit implements David Rientjes' idea of mempressure cgroup.
> 
> The main characteristics are the same to what I've tried to add to vmevent
> API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> pressure index calculation. But we don't expose the index to the userland.
> Instead, there are three levels of the pressure:
> 
>  o low (just reclaiming, e.g. caches are draining);
>  o medium (allocation cost becomes high, e.g. swapping);
>  o oom (about to oom very soon).
> 
> The rationale behind exposing levels and not the raw pressure index
> described here: http://lkml.org/lkml/2012/11/16/675
> 
> For a task it is possible to be in both cpusets, memcg and mempressure
> cgroups, so by rearranging the tasks it is possible to watch a specific
> pressure (i.e. caused by cpuset and/or memcg).
> 
> Note that while this adds the cgroups support, the code is well separated
> and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> But this is another story.
> 
> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
  2013-01-04 15:05   ` Kirill A. Shutemov
@ 2013-01-07  8:51   ` Kamezawa Hiroyuki
  2013-01-08  7:29     ` Anton Vorontsov
  2013-01-08  8:49   ` Minchan Kim
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  8:51 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

(2013/01/04 17:29), Anton Vorontsov wrote:
> This commit implements David Rientjes' idea of mempressure cgroup.
> 
> The main characteristics are the same to what I've tried to add to vmevent
> API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> pressure index calculation. But we don't expose the index to the userland.
> Instead, there are three levels of the pressure:
> 
>   o low (just reclaiming, e.g. caches are draining);
>   o medium (allocation cost becomes high, e.g. swapping);
>   o oom (about to oom very soon).
> 
> The rationale behind exposing levels and not the raw pressure index
> described here: http://lkml.org/lkml/2012/11/16/675
> 
> For a task it is possible to be in both cpusets, memcg and mempressure
> cgroups, so by rearranging the tasks it is possible to watch a specific
> pressure (i.e. caused by cpuset and/or memcg).
> 
> Note that while this adds the cgroups support, the code is well separated
> and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> But this is another story.
> 
> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>

I'm just curious..
 
> ---
>   Documentation/cgroups/mempressure.txt |  50 ++++++
>   include/linux/cgroup_subsys.h         |   6 +
>   include/linux/vmstat.h                |  11 ++
>   init/Kconfig                          |  12 ++
>   mm/Makefile                           |   1 +
>   mm/mempressure.c                      | 330 ++++++++++++++++++++++++++++++++++
>   mm/vmscan.c                           |   4 +
>   7 files changed, 414 insertions(+)
>   create mode 100644 Documentation/cgroups/mempressure.txt
>   create mode 100644 mm/mempressure.c
> 
> diff --git a/Documentation/cgroups/mempressure.txt b/Documentation/cgroups/mempressure.txt
> new file mode 100644
> index 0000000..dbc0aca
> --- /dev/null
> +++ b/Documentation/cgroups/mempressure.txt
> @@ -0,0 +1,50 @@
> +  Memory pressure cgroup
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +  Before using the mempressure cgroup, make sure you have it mounted:
> +
> +   # cd /sys/fs/cgroup/
> +   # mkdir mempressure
> +   # mount -t cgroup cgroup ./mempressure -o mempressure
> +
> +  It is possible to combine cgroups, for example you can mount memory
> +  (memcg) and mempressure cgroups together:
> +
> +   # mount -t cgroup cgroup ./mempressure -o memory,mempressure
> +
> +  That way the reported pressure will honour memory cgroup limits. The
> +  same goes for cpusets.
> +
> +  After the hierarchy is mounted, you can use the following API:
> +
> +  /sys/fs/cgroup/.../mempressure.level
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +  To maintain the interactivity/memory allocation cost, one can use the
> +  pressure level notifications, and the levels are defined like this:
> +
> +  The "low" level means that the system is reclaiming memory for new
> +  allocations. Monitoring reclaiming activity might be useful for
> +  maintaining overall system's cache level. Upon notification, the program
> +  (typically "Activity Manager") might analyze vmstat and act in advance
> +  (i.e. prematurely shutdown unimportant services).
> +
> +  The "medium" level means that the system is experiencing medium memory
> +  pressure, there is some mild swapping activity. Upon this event
> +  applications may decide to free any resources that can be easily
> +  reconstructed or re-read from a disk.
> +
> +  The "oom" level means that the system is actively thrashing, it is about
> +  to out of memory (OOM) or even the in-kernel OOM killer is on its way to
> +  trigger. Applications should do whatever they can to help the system.
> +
> +  Event control:
> +    Is used to setup an eventfd with a level threshold. The argument to
> +    the event control specifies the level threshold.
> +  Read:
> +    Reads mempory presure levels: low, medium or oom.
> +  Write:
> +    Not implemented.
> +  Test:
> +    To set up a notification:
> +
> +    # cgroup_event_listener ./mempressure.level low
> +    ("low", "medium", "oom" are permitted.)
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index f204a7a..b9802e2 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -37,6 +37,12 @@ SUBSYS(mem_cgroup)
>   
>   /* */
>   
> +#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE)
> +SUBSYS(mpc_cgroup)
> +#endif
> +
> +/* */
> +
>   #if IS_SUBSYS_ENABLED(CONFIG_CGROUP_DEVICE)
>   SUBSYS(devices)
>   #endif
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index a13291f..c1a66c7 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -10,6 +10,17 @@
>   
>   extern int sysctl_stat_interval;
>   
> +struct mem_cgroup;
> +#ifdef CONFIG_CGROUP_MEMPRESSURE
> +extern void vmpressure(struct mem_cgroup *memcg,
> +		       ulong scanned, ulong reclaimed);
> +extern void vmpressure_prio(struct mem_cgroup *memcg, int prio);
> +#else
> +static inline void vmpressure(struct mem_cgroup *memcg,
> +			      ulong scanned, ulong reclaimed) {}
> +static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {}
> +#endif
> +
>   #ifdef CONFIG_VM_EVENT_COUNTERS
>   /*
>    * Light weight per cpu counter implementation.
> diff --git a/init/Kconfig b/init/Kconfig
> index 7d30240..d526249 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -891,6 +891,18 @@ config MEMCG_KMEM
>   	  the kmem extension can use it to guarantee that no group of processes
>   	  will ever exhaust kernel resources alone.
>   
> +config CGROUP_MEMPRESSURE
> +	bool "Memory pressure monitor for Control Groups"
> +	help
> +	  The memory pressure monitor cgroup provides a facility for
> +	  userland programs so that they could easily assist the kernel
> +	  with the memory management. So far the API provides simple,
> +	  levels-based memory pressure notifications.
> +
> +	  For more information see Documentation/cgroups/mempressure.txt
> +
> +	  If unsure, say N.
> +
>   config CGROUP_HUGETLB
>   	bool "HugeTLB Resource Controller for Control Groups"
>   	depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL
> diff --git a/mm/Makefile b/mm/Makefile
> index 3a46287..e69bbda 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -51,6 +51,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>   obj-$(CONFIG_QUICKLIST) += quicklist.o
>   obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>   obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_MEMPRESSURE) += mempressure.o
>   obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>   obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>   obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> diff --git a/mm/mempressure.c b/mm/mempressure.c
> new file mode 100644
> index 0000000..ea312bb
> --- /dev/null
> +++ b/mm/mempressure.c
> @@ -0,0 +1,330 @@
> +/*
> + * Linux VM pressure
> + *
> + * Copyright 2012 Linaro Ltd.
> + *		  Anton Vorontsov <anton.vorontsov@linaro.org>
> + *
> + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
> + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/fs.h>
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +#include <linux/eventfd.h>
> +#include <linux/swap.h>
> +#include <linux/printk.h>
> +
> +static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r);
> +
> +/*
> + * Generic VM Pressure routines (no cgroups or any other API details)
> + */
> +
> +/*
> + * The window size is the number of scanned pages before we try to analyze
> + * the scanned/reclaimed ratio (or difference).
> + *
> + * It is used as a rate-limit tunable for the "low" level notification,
> + * and for averaging medium/oom levels. Using small window sizes can cause
> + * lot of false positives, but too big window size will delay the
> + * notifications.
> + */
> +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +static const uint vmpressure_level_med = 60;
> +static const uint vmpressure_level_oom = 99;
> +static const uint vmpressure_level_oom_prio = 4;
> +

Hmm... isn't this window size too small ?
If vmscan cannot find a reclaimable page while scanning 2M of pages in a zone,
oom notify will be returned. Right ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-07  8:51   ` Kamezawa Hiroyuki
@ 2013-01-08  7:29     ` Anton Vorontsov
  2013-01-08  7:57       ` leonid.moiseichuk
  2013-01-08  8:24       ` Kamezawa Hiroyuki
  0 siblings, 2 replies; 33+ messages in thread
From: Anton Vorontsov @ 2013-01-08  7:29 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Mon, Jan 07, 2013 at 05:51:46PM +0900, Kamezawa Hiroyuki wrote:
[...]
> I'm just curious..

Thanks for taking a look! :)

[...]
> > +/*
> > + * The window size is the number of scanned pages before we try to analyze
> > + * the scanned/reclaimed ratio (or difference).
> > + *
> > + * It is used as a rate-limit tunable for the "low" level notification,
> > + * and for averaging medium/oom levels. Using small window sizes can cause
> > + * lot of false positives, but too big window size will delay the
> > + * notifications.
> > + */
> > +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
> > +static const uint vmpressure_level_med = 60;
> > +static const uint vmpressure_level_oom = 99;
> > +static const uint vmpressure_level_oom_prio = 4;
> > +
> 
> Hmm... isn't this window size too small ?
> If vmscan cannot find a reclaimable page while scanning 2M of pages in a zone,
> oom notify will be returned. Right ?

Yup, you are right, if we were not able to find anything within the window
size (which is 2M, but see below), then it is effectively the "OOM level".
The thing is, the vmpressure reports... the pressure. :) Or, the
allocation cost, and if the cost becomes high, it is no good.

The 2M is, of course, not ideal. And the "ideal" depends on many factors,
alike to vmstat. And, actually I dream about deriving the window size from
zone->stat_threshold, which would make the window automatically adjustable
for different "machine sizes" (as we do in calculate_normal_threshold(),
in vmstat.c).

But again, this is all "implementation details"; tunable stuff that we can
either adjust ourselves as needed, or try to be smart, i.e. apply some
heuristics, again, as in vmstat.

Thanks,
Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH 1/2] Add mempressure cgroup
  2013-01-08  7:29     ` Anton Vorontsov
@ 2013-01-08  7:57       ` leonid.moiseichuk
  2013-01-08  8:24       ` Kamezawa Hiroyuki
  1 sibling, 0 replies; 33+ messages in thread
From: leonid.moiseichuk @ 2013-01-08  7:57 UTC (permalink / raw)
  To: anton.vorontsov, kamezawa.hiroyu
  Cc: rientjes, penberg, mgorman, glommer, mhocko, kirill, lcapitulino,
	akpm, gthelen, kosaki.motohiro, minchan, b.zolnierkie,
	john.stultz, linux-mm, linux-kernel, linaro-kernel, patches,
	kernel-team

-----Original Message-----
From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org] 
Sent: 08 January, 2013 08:30
...
> > +static const uint vmpressure_level_med = 60;
> > +static const uint vmpressure_level_oom = 99;
> > +static const uint vmpressure_level_oom_prio = 4;
> > +
..
Seems vmpressure_level_oom = 99 is quite high if I understand it as a global. If I do not wrong in old version of kernel the kernel only memory border was stated as 1/32 part of available memory meaning no allocation for user-space if amount of free memory reached 1/32. So, decreasing this parameter to 95 or 90 will allow notification to be propagated to user-space and handled.

Best wishes,
Leonid

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-08  7:29     ` Anton Vorontsov
  2013-01-08  7:57       ` leonid.moiseichuk
@ 2013-01-08  8:24       ` Kamezawa Hiroyuki
  1 sibling, 0 replies; 33+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-08  8:24 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

(2013/01/08 16:29), Anton Vorontsov wrote:
> On Mon, Jan 07, 2013 at 05:51:46PM +0900, Kamezawa Hiroyuki wrote:
> [...]
>> I'm just curious..
>
> Thanks for taking a look! :)
>
> [...]
>>> +/*
>>> + * The window size is the number of scanned pages before we try to analyze
>>> + * the scanned/reclaimed ratio (or difference).
>>> + *
>>> + * It is used as a rate-limit tunable for the "low" level notification,
>>> + * and for averaging medium/oom levels. Using small window sizes can cause
>>> + * lot of false positives, but too big window size will delay the
>>> + * notifications.
>>> + */
>>> +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
>>> +static const uint vmpressure_level_med = 60;
>>> +static const uint vmpressure_level_oom = 99;
>>> +static const uint vmpressure_level_oom_prio = 4;
>>> +
>>
>> Hmm... isn't this window size too small ?
>> If vmscan cannot find a reclaimable page while scanning 2M of pages in a zone,
>> oom notify will be returned. Right ?
>
> Yup, you are right, if we were not able to find anything within the window
> size (which is 2M, but see below), then it is effectively the "OOM level".
> The thing is, the vmpressure reports... the pressure. :) Or, the
> allocation cost, and if the cost becomes high, it is no good.
>
> The 2M is, of course, not ideal. And the "ideal" depends on many factors,
> alike to vmstat. And, actually I dream about deriving the window size from
> zone->stat_threshold, which would make the window automatically adjustable
> for different "machine sizes" (as we do in calculate_normal_threshold(),
> in vmstat.c).
>
> But again, this is all "implementation details"; tunable stuff that we can
> either adjust ourselves as needed, or try to be smart, i.e. apply some
> heuristics, again, as in vmstat.
>

Hmm, I like automatic adjustment for things like this (but may be need to be tunable by
user). My concern is, for example, that if a qemu-kvm with pci-passthrough running on
a node using the most of memory on it, the interface will say "Hey it's near to OOM"
to users. We may need a complicated heuristics ;)

Anyway, your approach seems interesting to me but it seems peaky to usual users.
Uses should know what they should check (vmstat, zoneinfo, malloc latency ??) when they
get notify before rising real alarm. (not explained in the doc.)
For example, if the user takes care of usage of swap, he should check it.

I'm glad if you explain in Doc that this interface just makes a hint and notify status
of _recent_ vmscans of some amount of window. That means latency of recent memory allocations.
Users should confirm the real status and make the final judge by themselves.
The point is that this notify is important because it's quick and related to ongoing memory
allocation latency. But kernel is not sure there are long-standing heavy vm pressure.

I'm sorry if I misundestand the concept.

Thank you,
-Kame



  


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
  2013-01-04 15:05   ` Kirill A. Shutemov
  2013-01-07  8:51   ` Kamezawa Hiroyuki
@ 2013-01-08  8:49   ` Minchan Kim
  2013-01-09 22:14     ` Anton Vorontsov
  2013-01-08 21:44   ` Andrew Morton
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2013-01-08  8:49 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

Hi Anton,

On Fri, Jan 04, 2013 at 12:29:11AM -0800, Anton Vorontsov wrote:
> This commit implements David Rientjes' idea of mempressure cgroup.
> 
> The main characteristics are the same to what I've tried to add to vmevent
> API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> pressure index calculation. But we don't expose the index to the userland.
> Instead, there are three levels of the pressure:
> 
>  o low (just reclaiming, e.g. caches are draining);
>  o medium (allocation cost becomes high, e.g. swapping);
>  o oom (about to oom very soon).
> 
> The rationale behind exposing levels and not the raw pressure index
> described here: http://lkml.org/lkml/2012/11/16/675
> 
> For a task it is possible to be in both cpusets, memcg and mempressure
> cgroups, so by rearranging the tasks it is possible to watch a specific
> pressure (i.e. caused by cpuset and/or memcg).
> 
> Note that while this adds the cgroups support, the code is well separated
> and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> But this is another story.
> 
> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>

Sorry still I didn't look at your implementation about cgroup part.
but I had a question since long time ago.

How can we can make sure false positive about zone and NUMA?
I mean DMA zone is short in system so VM notify to user and user
free all memory of NORMAL zone because he can't know what pages live
in any zones. NUMA is ditto.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
                     ` (2 preceding siblings ...)
  2013-01-08  8:49   ` Minchan Kim
@ 2013-01-08 21:44   ` Andrew Morton
  2013-01-09 14:10     ` Glauber Costa
  2013-01-09  8:56   ` Glauber Costa
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2013-01-08 21:44 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Fri,  4 Jan 2013 00:29:11 -0800
Anton Vorontsov <anton.vorontsov@linaro.org> wrote:

> This commit implements David Rientjes' idea of mempressure cgroup.
> 
> The main characteristics are the same to what I've tried to add to vmevent
> API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> pressure index calculation. But we don't expose the index to the userland.
> Instead, there are three levels of the pressure:
> 
>  o low (just reclaiming, e.g. caches are draining);
>  o medium (allocation cost becomes high, e.g. swapping);
>  o oom (about to oom very soon).
> 
> The rationale behind exposing levels and not the raw pressure index
> described here: http://lkml.org/lkml/2012/11/16/675
> 
> For a task it is possible to be in both cpusets, memcg and mempressure
> cgroups, so by rearranging the tasks it is possible to watch a specific
> pressure (i.e. caused by cpuset and/or memcg).
> 
> Note that while this adds the cgroups support, the code is well separated
> and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> But this is another story.
> 

I'd have thought that it's pretty important offer this feature to
non-cgroups setups.  Restricting it to cgroups-only seems a large
limitation.

> diff --git a/mm/mempressure.c b/mm/mempressure.c
> new file mode 100644
> index 0000000..ea312bb
> --- /dev/null
> +++ b/mm/mempressure.c
> @@ -0,0 +1,330 @@
> +/*
> + * Linux VM pressure
> + *
> + * Copyright 2012 Linaro Ltd.
> + *		  Anton Vorontsov <anton.vorontsov@linaro.org>
> + *
> + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
> + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/fs.h>
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +#include <linux/eventfd.h>
> +#include <linux/swap.h>
> +#include <linux/printk.h>
> +
> +static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r);

mm/ doesn't use uint or ulong.  In fact I can find zero uses of either
in all of mm/.

I don't have a problem with them personally - they're short and clear. 
But we just ...  don't do that.  Perhaps we shold start using them.

> +
> +/*
> + * Generic VM Pressure routines (no cgroups or any other API details)
> + */
> +
> +/*
> + * The window size is the number of scanned pages before we try to analyze
> + * the scanned/reclaimed ratio (or difference).
> + *
> + * It is used as a rate-limit tunable for the "low" level notification,
> + * and for averaging medium/oom levels. Using small window sizes can cause
> + * lot of false positives, but too big window size will delay the
> + * notifications.
> + */
> +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +static const uint vmpressure_level_med = 60;
> +static const uint vmpressure_level_oom = 99;
> +static const uint vmpressure_level_oom_prio = 4;
> +
> +enum vmpressure_levels {
> +	VMPRESSURE_LOW = 0,
> +	VMPRESSURE_MEDIUM,
> +	VMPRESSURE_OOM,

VMPRESSURE_OOM seems an odd-man-out.  VMPRESSURE_HIGH would be pleasing.

> +	VMPRESSURE_NUM_LEVELS,
> +};
> +
>
> ...
>
> +static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r)
> +{
> +	/*
> +	 * There are two options for implementing cgroup pressure
> +	 * notifications:
> +	 *
> +	 * - Store pressure counter atomically in the task struct. Upon
> +	 *   hitting 'window' wake up a workqueue that will walk every
> +	 *   task and sum per-thread pressure into cgroup pressure (to
> +	 *   which the task belongs). The cons are obvious: bloats task
> +	 *   struct, have to walk all processes and makes pressue less
> +	 *   accurate (the window becomes per-thread);
> +	 *
> +	 * - Store pressure counters in per-cgroup state. This is easy and
> +	 *   straightforward, and that's how we do things here. But this
> +	 *   requires us to not put the vmpressure hooks into hotpath,
> +	 *   since we have to grab some locks.
> +	 */
> +
> +#ifdef CONFIG_MEMCG
> +	if (memcg) {
> +		struct cgroup_subsys_state *css = mem_cgroup_css(memcg);
> +		struct cgroup *cg = css->cgroup;
> +		struct mpc_state *mpc = cg2mpc(cg);
> +
> +		if (mpc)
> +			__mpc_vmpressure(mpc, s, r);
> +		return;
> +	}
> +#endif
> +	task_lock(current);
> +	__mpc_vmpressure(tsk2mpc(current), s, r);
> +	task_unlock(current);
> +}

The task_lock() is mysterious.  What's it protecting?  That's unobvious
and afacit undocumented.

Also it is buggy: __mpc_vmpressure() does mutex_lock(). 
Documentation/SubmitChecklist section 12 has handy hints!

>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
                     ` (3 preceding siblings ...)
  2013-01-08 21:44   ` Andrew Morton
@ 2013-01-09  8:56   ` Glauber Costa
  2013-01-09  9:15     ` Andrew Morton
  2013-01-09 20:37   ` Tejun Heo
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Glauber Costa @ 2013-01-09  8:56 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Michal Hocko,
	Kirill A. Shutemov, Luiz Capitulino, Andrew Morton, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

Hi.

I have a couple of small questions.

On 01/04/2013 12:29 PM, Anton Vorontsov wrote:
> This commit implements David Rientjes' idea of mempressure cgroup.
> 
> The main characteristics are the same to what I've tried to add to vmevent
> API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> pressure index calculation. But we don't expose the index to the userland.
> Instead, there are three levels of the pressure:
> 
>  o low (just reclaiming, e.g. caches are draining);
>  o medium (allocation cost becomes high, e.g. swapping);
>  o oom (about to oom very soon).
> 
> The rationale behind exposing levels and not the raw pressure index
> described here: http://lkml.org/lkml/2012/11/16/675
> 
> For a task it is possible to be in both cpusets, memcg and mempressure
> cgroups, so by rearranging the tasks it is possible to watch a specific
> pressure (i.e. caused by cpuset and/or memcg).
> 
> Note that while this adds the cgroups support, the code is well separated
> and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> But this is another story.
Andrew already said he would like to see this exposed to non cgroup
users, I'll just add to that: I'd like the interfaces to be consistent.

We need to make sure that cgroups and non-cgroup users will act on this
in the same way. So it is important that this is included in the
proposition, so we can judge and avoid a future kludge.

> diff --git a/Documentation/cgroups/mempressure.txt b/Documentation/cgroups/mempressure.txt
> new file mode 100644
> index 0000000..dbc0aca
> --- /dev/null
> +++ b/Documentation/cgroups/mempressure.txt
> @@ -0,0 +1,50 @@
> +  Memory pressure cgroup
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +  Before using the mempressure cgroup, make sure you have it mounted:
> +
> +   # cd /sys/fs/cgroup/
> +   # mkdir mempressure
> +   # mount -t cgroup cgroup ./mempressure -o mempressure
> +
> +  It is possible to combine cgroups, for example you can mount memory
> +  (memcg) and mempressure cgroups together:
> +
> +   # mount -t cgroup cgroup ./mempressure -o memory,mempressure
> +

Most of the time these days, the groups are mounted separately. The
tasks, however, still belong to one or more controllers regardless of
where they are mounted.

Can you describe a bit better (not only in reply, but also update the
docs) what happens when:

1) both cpusets and memcg are present. Which one takes precedence? Will
there be a way to differentiate which kind of pressure is being seen so
I as a task can adjust my actions accordingly?

2) the task belongs to memcg (or cpuset), but the controllers itself are
mounted separately. Is it equivalent to mounted them jointly? Will this
fact just be ignored by the pressure levels?

I can guess the answer to some of them by the code, but I think it is
quite important to have all this crystal clear.

> +    ("low", "medium", "oom" are permitted.)
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index f204a7a..b9802e2 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -37,6 +37,12 @@ SUBSYS(mem_cgroup)
>  
>  /* */
>  
> +#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE)
> +SUBSYS(mpc_cgroup)
> +#endif

It might be just me, but if one does not know what this is about, "mpc"
immediately fetches something communication-related to mind. I would
suggest changing this to just plain "mempressure_cgroup", or something
more descriptive.

> diff --git a/mm/mempressure.c b/mm/mempressure.c
> new file mode 100644
> index 0000000..ea312bb
> --- /dev/null
> +++ b/mm/mempressure.c
> @@ -0,0 +1,330 @@
> +/*
> + * Linux VM pressure
> + *
> + * Copyright 2012 Linaro Ltd.
> + *		  Anton Vorontsov <anton.vorontsov@linaro.org>
> + *
> + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
> + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/fs.h>
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +#include <linux/eventfd.h>
> +#include <linux/swap.h>
> +#include <linux/printk.h>
> +
> +static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r);
> +
> +/*
> + * Generic VM Pressure routines (no cgroups or any other API details)
> + */
> +
> +/*
> + * The window size is the number of scanned pages before we try to analyze
> + * the scanned/reclaimed ratio (or difference).
> + *
> + * It is used as a rate-limit tunable for the "low" level notification,
> + * and for averaging medium/oom levels. Using small window sizes can cause
> + * lot of false positives, but too big window size will delay the
> + * notifications.
> + */
> +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +static const uint vmpressure_level_med = 60;
> +static const uint vmpressure_level_oom = 99;
> +static const uint vmpressure_level_oom_prio = 4;
> +
> +enum vmpressure_levels {
> +	VMPRESSURE_LOW = 0,
> +	VMPRESSURE_MEDIUM,
> +	VMPRESSURE_OOM,
> +	VMPRESSURE_NUM_LEVELS,
> +};
> +
> +static const char *vmpressure_str_levels[] = {
> +	[VMPRESSURE_LOW] = "low",
> +	[VMPRESSURE_MEDIUM] = "medium",
> +	[VMPRESSURE_OOM] = "oom",
> +};
> +
> +static enum vmpressure_levels vmpressure_level(uint pressure)
> +{
> +	if (pressure >= vmpressure_level_oom)
> +		return VMPRESSURE_OOM;
> +	else if (pressure >= vmpressure_level_med)
> +		return VMPRESSURE_MEDIUM;
> +	return VMPRESSURE_LOW;
> +}
> +
> +static ulong vmpressure_calc_level(uint win, uint s, uint r)
> +{
> +	ulong p;
> +
> +	if (!s)
> +		return 0;
> +
> +	/*
> +	 * We calculate the ratio (in percents) of how many pages were
> +	 * scanned vs. reclaimed in a given time frame (window). Note that
> +	 * time is in VM reclaimer's "ticks", i.e. number of pages
> +	 * scanned. This makes it possible to set desired reaction time
> +	 * and serves as a ratelimit.
> +	 */
> +	p = win - (r * win / s);
> +	p = p * 100 / win;
> +
> +	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
> +
> +	return vmpressure_level(p);
> +}
> +
> +void vmpressure(struct mem_cgroup *memcg, ulong scanned, ulong reclaimed)
> +{
> +	if (!scanned)
> +		return;
> +	mpc_vmpressure(memcg, scanned, reclaimed);
> +}
> +
> +void vmpressure_prio(struct mem_cgroup *memcg, int prio)
> +{
> +	if (prio > vmpressure_level_oom_prio)
> +		return;
> +
> +	/* OK, the prio is below the threshold, send the pre-OOM event. */
> +	vmpressure(memcg, vmpressure_win, 0);
> +}
> +
> +/*
> + * Memory pressure cgroup code
> + */
> +
> +struct mpc_event {
> +	struct eventfd_ctx *efd;
> +	enum vmpressure_levels level;
> +	struct list_head node;
> +};
> +
> +struct mpc_state {
> +	struct cgroup_subsys_state css;
> +
> +	uint scanned;
> +	uint reclaimed;
> +	struct mutex sr_lock;
> +
> +	struct list_head events;
> +	struct mutex events_lock;
> +
> +	struct work_struct work;
> +};
> +
> +static struct mpc_state *wk2mpc(struct work_struct *wk)
> +{
> +	return container_of(wk, struct mpc_state, work);
> +}
> +
> +static struct mpc_state *css2mpc(struct cgroup_subsys_state *css)
> +{
> +	return container_of(css, struct mpc_state, css);
> +}
> +
> +static struct mpc_state *tsk2mpc(struct task_struct *tsk)
> +{
> +	return css2mpc(task_subsys_state(tsk, mpc_cgroup_subsys_id));
> +}
> +
> +static struct mpc_state *cg2mpc(struct cgroup *cg)
> +{
> +	return css2mpc(cgroup_subsys_state(cg, mpc_cgroup_subsys_id));
> +}

I think we would be better of with more descriptive names here as well.
Other cgroups would use the convention of using _to_ and _from_ in names
instead of 2.

For instance, task_to_mempressure is a lot more descriptive than
"tsk2mpc". There are no bonus points for manually compressing code.

> +
> +static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r)
> +{
> +	/*
> +	 * There are two options for implementing cgroup pressure
> +	 * notifications:
> +	 *
> +	 * - Store pressure counter atomically in the task struct. Upon
> +	 *   hitting 'window' wake up a workqueue that will walk every
> +	 *   task and sum per-thread pressure into cgroup pressure (to
> +	 *   which the task belongs). The cons are obvious: bloats task
> +	 *   struct, have to walk all processes and makes pressue less
> +	 *   accurate (the window becomes per-thread);
> +	 *
> +	 * - Store pressure counters in per-cgroup state. This is easy and
> +	 *   straightforward, and that's how we do things here. But this
> +	 *   requires us to not put the vmpressure hooks into hotpath,
> +	 *   since we have to grab some locks.
> +	 */
> +
> +#ifdef CONFIG_MEMCG
> +	if (memcg) {
> +		struct cgroup_subsys_state *css = mem_cgroup_css(memcg);
> +		struct cgroup *cg = css->cgroup;
> +		struct mpc_state *mpc = cg2mpc(cg);
> +
> +		if (mpc)
> +			__mpc_vmpressure(mpc, s, r);
> +		return;
> +	}
> +#endif
> +	task_lock(current);
> +	__mpc_vmpressure(tsk2mpc(current), s, r);
> +	task_unlock(current);
> +}

How about cpusets?

I still see no significant mention of it, and I would like to understand
how does it get into play in practice.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09  8:56   ` Glauber Costa
@ 2013-01-09  9:15     ` Andrew Morton
  2013-01-09 13:43       ` Glauber Costa
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2013-01-09  9:15 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Anton Vorontsov, David Rientjes, Pekka Enberg, Mel Gorman,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Wed, 9 Jan 2013 12:56:46 +0400 Glauber Costa <glommer@parallels.com> wrote:

> > +#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE)
> > +SUBSYS(mpc_cgroup)
> > +#endif
> 
> It might be just me, but if one does not know what this is about, "mpc"
> immediately fetches something communication-related to mind. I would
> suggest changing this to just plain "mempressure_cgroup", or something
> more descriptive.

mempressure_cgroup is rather lengthy.  "mpcg" would be good - it's short
and rememberable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09  9:15     ` Andrew Morton
@ 2013-01-09 13:43       ` Glauber Costa
  0 siblings, 0 replies; 33+ messages in thread
From: Glauber Costa @ 2013-01-09 13:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Vorontsov, David Rientjes, Pekka Enberg, Mel Gorman,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On 01/09/2013 01:15 PM, Andrew Morton wrote:
> On Wed, 9 Jan 2013 12:56:46 +0400 Glauber Costa <glommer@parallels.com> wrote:
> 
>>> +#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE)
>>> +SUBSYS(mpc_cgroup)
>>> +#endif
>>
>> It might be just me, but if one does not know what this is about, "mpc"
>> immediately fetches something communication-related to mind. I would
>> suggest changing this to just plain "mempressure_cgroup", or something
>> more descriptive.
> 
> mempressure_cgroup is rather lengthy.  "mpcg" would be good - it's short
> and rememberable.
> 
Or, since most of the cgroups don't actually use the suffix "cgroup"
(with the exception of cpu and memcg), maybe just mempressure?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-08 21:44   ` Andrew Morton
@ 2013-01-09 14:10     ` Glauber Costa
  2013-01-09 20:28       ` Andrew Morton
  0 siblings, 1 reply; 33+ messages in thread
From: Glauber Costa @ 2013-01-09 14:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Vorontsov, David Rientjes, Pekka Enberg, Mel Gorman,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On 01/09/2013 01:44 AM, Andrew Morton wrote:
> On Fri,  4 Jan 2013 00:29:11 -0800
> Anton Vorontsov <anton.vorontsov@linaro.org> wrote:
> 
>> This commit implements David Rientjes' idea of mempressure cgroup.
>>
>> The main characteristics are the same to what I've tried to add to vmevent
>> API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
>> pressure index calculation. But we don't expose the index to the userland.
>> Instead, there are three levels of the pressure:
>>
>>  o low (just reclaiming, e.g. caches are draining);
>>  o medium (allocation cost becomes high, e.g. swapping);
>>  o oom (about to oom very soon).
>>
>> The rationale behind exposing levels and not the raw pressure index
>> described here: http://lkml.org/lkml/2012/11/16/675
>>
>> For a task it is possible to be in both cpusets, memcg and mempressure
>> cgroups, so by rearranging the tasks it is possible to watch a specific
>> pressure (i.e. caused by cpuset and/or memcg).
>>
>> Note that while this adds the cgroups support, the code is well separated
>> and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
>> But this is another story.
>>
> 
> I'd have thought that it's pretty important offer this feature to
> non-cgroups setups.  Restricting it to cgroups-only seems a large
> limitation.
> 

Why is it so, Andrew?

When we talk about "cgroups", we are not necessarily talking about the
whole beast, with all controllers enabled. Much less we are talking
about hierarchies being created, and tasks put on it.

It's an interface only. And since all controllers will always have a
special "root" cgroup, this applies to the tasks in the system all the
same. In the end of the day, if we have something like
CONFIG_MEMPRESSURE that selects CONFIG_CGROUP, the user needs to do the
same thing to actually turn on the functionality: switch a config
option. It is not more expensive, and it doesn't bring in anything extra
as well.

To actually use it, one needs to mount the filesystem, and write to a
file. Nothing else.

What is that drives this opposition towards a cgroup-only interface?
Is it about the interface, or the underlying machinery ?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09 14:10     ` Glauber Costa
@ 2013-01-09 20:28       ` Andrew Morton
  0 siblings, 0 replies; 33+ messages in thread
From: Andrew Morton @ 2013-01-09 20:28 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Anton Vorontsov, David Rientjes, Pekka Enberg, Mel Gorman,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Wed, 9 Jan 2013 18:10:02 +0400
Glauber Costa <glommer@parallels.com> wrote:

> On 01/09/2013 01:44 AM, Andrew Morton wrote:
> > On Fri,  4 Jan 2013 00:29:11 -0800
> > Anton Vorontsov <anton.vorontsov@linaro.org> wrote:
> > 
> >> This commit implements David Rientjes' idea of mempressure cgroup.
> >>
> >> The main characteristics are the same to what I've tried to add to vmevent
> >> API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> >> pressure index calculation. But we don't expose the index to the userland.
> >> Instead, there are three levels of the pressure:
> >>
> >>  o low (just reclaiming, e.g. caches are draining);
> >>  o medium (allocation cost becomes high, e.g. swapping);
> >>  o oom (about to oom very soon).
> >>
> >> The rationale behind exposing levels and not the raw pressure index
> >> described here: http://lkml.org/lkml/2012/11/16/675
> >>
> >> For a task it is possible to be in both cpusets, memcg and mempressure
> >> cgroups, so by rearranging the tasks it is possible to watch a specific
> >> pressure (i.e. caused by cpuset and/or memcg).
> >>
> >> Note that while this adds the cgroups support, the code is well separated
> >> and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> >> But this is another story.
> >>
> > 
> > I'd have thought that it's pretty important offer this feature to
> > non-cgroups setups.  Restricting it to cgroups-only seems a large
> > limitation.
> > 
> 
> Why is it so, Andrew?
> 
> When we talk about "cgroups", we are not necessarily talking about the
> whole beast, with all controllers enabled. Much less we are talking
> about hierarchies being created, and tasks put on it.
> 
> It's an interface only. And since all controllers will always have a
> special "root" cgroup, this applies to the tasks in the system all the
> same. In the end of the day, if we have something like
> CONFIG_MEMPRESSURE that selects CONFIG_CGROUP, the user needs to do the
> same thing to actually turn on the functionality: switch a config
> option. It is not more expensive, and it doesn't bring in anything extra
> as well.
> 
> To actually use it, one needs to mount the filesystem, and write to a
> file. Nothing else.
> 

Oh, OK, well if the feature can be used in a system-wide fashion in
this manner then I guess that is sufficient.  For some reason I was
thinking it was tied to memcg, doh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
                     ` (4 preceding siblings ...)
  2013-01-09  8:56   ` Glauber Costa
@ 2013-01-09 20:37   ` Tejun Heo
  2013-01-09 20:39     ` Tejun Heo
  2013-01-09 21:20     ` Glauber Costa
  2013-01-13  8:50   ` Simon Jeons
                     ` (2 subsequent siblings)
  8 siblings, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2013-01-09 20:37 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

Hello,

Can you please cc me too when posting further patches?  I kinda missed
the whole discussion upto this point.

On Fri, Jan 04, 2013 at 12:29:11AM -0800, Anton Vorontsov wrote:
> This commit implements David Rientjes' idea of mempressure cgroup.
> 
> The main characteristics are the same to what I've tried to add to vmevent
> API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> pressure index calculation. But we don't expose the index to the userland.
> Instead, there are three levels of the pressure:
> 
>  o low (just reclaiming, e.g. caches are draining);
>  o medium (allocation cost becomes high, e.g. swapping);
>  o oom (about to oom very soon).
> 
> The rationale behind exposing levels and not the raw pressure index
> described here: http://lkml.org/lkml/2012/11/16/675
> 
> For a task it is possible to be in both cpusets, memcg and mempressure
> cgroups, so by rearranging the tasks it is possible to watch a specific
> pressure (i.e. caused by cpuset and/or memcg).

So, cgroup is headed towards single hierarchy.  Dunno how much it
would affect mempressure but it probably isn't wise to design with
focus on multiple hierarchies.

Isn't memory reclaim and oom condition tied to memcgs when memcg is in
use?  It seems natural to tie mempressure to memcg.  Is there some
reason this should be a separate cgroup.  I'm kinda worried this is
headed cpuacct / cpu silliness we have.  Glauber, what's your opinion
here?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09 20:37   ` Tejun Heo
@ 2013-01-09 20:39     ` Tejun Heo
  2013-01-09 21:20     ` Glauber Costa
  1 sibling, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2013-01-09 20:39 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Wed, Jan 09, 2013 at 12:37:31PM -0800, Tejun Heo wrote:
> Hello,
> 
> Can you please cc me too when posting further patches?  I kinda missed
> the whole discussion upto this point.
> 
> On Fri, Jan 04, 2013 at 12:29:11AM -0800, Anton Vorontsov wrote:
> > This commit implements David Rientjes' idea of mempressure cgroup.
> > 
> > The main characteristics are the same to what I've tried to add to vmevent
> > API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> > pressure index calculation. But we don't expose the index to the userland.
> > Instead, there are three levels of the pressure:
> > 
> >  o low (just reclaiming, e.g. caches are draining);
> >  o medium (allocation cost becomes high, e.g. swapping);
> >  o oom (about to oom very soon).
> > 
> > The rationale behind exposing levels and not the raw pressure index
> > described here: http://lkml.org/lkml/2012/11/16/675
> > 
> > For a task it is possible to be in both cpusets, memcg and mempressure
> > cgroups, so by rearranging the tasks it is possible to watch a specific
> > pressure (i.e. caused by cpuset and/or memcg).
> 
> So, cgroup is headed towards single hierarchy.  Dunno how much it
> would affect mempressure but it probably isn't wise to design with
> focus on multiple hierarchies.

Also, how are you implementing hierarchical behavior?  All controllers
should support hierarchy.  Can you please explain how the interface
would work in detail?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09 20:37   ` Tejun Heo
  2013-01-09 20:39     ` Tejun Heo
@ 2013-01-09 21:20     ` Glauber Costa
  2013-01-09 21:36       ` Anton Vorontsov
  1 sibling, 1 reply; 33+ messages in thread
From: Glauber Costa @ 2013-01-09 21:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Anton Vorontsov, David Rientjes, Pekka Enberg, Mel Gorman,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On 01/10/2013 12:37 AM, Tejun Heo wrote:
> Hello,
> 
> Can you please cc me too when posting further patches?  I kinda missed
> the whole discussion upto this point.
> 
> On Fri, Jan 04, 2013 at 12:29:11AM -0800, Anton Vorontsov wrote:
>> This commit implements David Rientjes' idea of mempressure cgroup.
>>
>> The main characteristics are the same to what I've tried to add to vmevent
>> API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
>> pressure index calculation. But we don't expose the index to the userland.
>> Instead, there are three levels of the pressure:
>>
>>  o low (just reclaiming, e.g. caches are draining);
>>  o medium (allocation cost becomes high, e.g. swapping);
>>  o oom (about to oom very soon).
>>
>> The rationale behind exposing levels and not the raw pressure index
>> described here: http://lkml.org/lkml/2012/11/16/675
>>
>> For a task it is possible to be in both cpusets, memcg and mempressure
>> cgroups, so by rearranging the tasks it is possible to watch a specific
>> pressure (i.e. caused by cpuset and/or memcg).
> 
> So, cgroup is headed towards single hierarchy.  Dunno how much it
> would affect mempressure but it probably isn't wise to design with
> focus on multiple hierarchies.
> 
> Isn't memory reclaim and oom condition tied to memcgs when memcg is in
> use?  It seems natural to tie mempressure to memcg.  Is there some
> reason this should be a separate cgroup.  I'm kinda worried this is
> headed cpuacct / cpu silliness we have.  Glauber, what's your opinion
> here?
> 

I've already said this in a previous incarnation of this thread. But
I'll summarize my main points:

* I believe this mechanism is superior to memcg notification mechanism.
* I believe memcg notification mechanism is quite coarce - we actually
define the thresholds prior to flushing the stock, which means we can be
wrong by as much as 32 * ncpus.
* Agreeing with you that most of the data will come from memcg, I just
think this should all be part of memcg.
* memcg is indeed expensive even when it is not being used, so global
users would like to avoid it. This is true, but I've already
demonstrated that it is an implementation problem rather than a
conceptual problem, and can be fixed - although I had not yet the time
to go back to it (but now I have a lot less on my shoulders than before)

Given the above, I believe that ideally we should use this pressure
mechanism in memcg replacing the current memcg notification mechanism.
More or less like timer expiration happens: you could still write
numbers for compatibility, but those numbers would be internally mapped
into the levels Anton is proposing, that makes *way* more sense.

If that is not possible, they should coexist as "notification" and a
"pressure" mechanism inside memcg.

The main argument against it centered around cpusets also being able to
participate in the play. I haven't yet understood how would it take
place. In particular, I saw no mention to cpusets in the patches.

I will say again that I fully know memcg is expensive. We all do.
However, it only matters to the global case. For the child cgroup case,
you are *already* paying this anyway. And for the global case, we should
not use the costs of it as an excuse: we should fix it, or otherwise
prove that it is unfixable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09 21:20     ` Glauber Costa
@ 2013-01-09 21:36       ` Anton Vorontsov
  2013-01-09 21:55         ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Anton Vorontsov @ 2013-01-09 21:36 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tejun Heo, David Rientjes, Pekka Enberg, Mel Gorman, Michal Hocko,
	Kirill A. Shutemov, Luiz Capitulino, Andrew Morton, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Thu, Jan 10, 2013 at 01:20:30AM +0400, Glauber Costa wrote:
[...]
> Given the above, I believe that ideally we should use this pressure
> mechanism in memcg replacing the current memcg notification mechanism.

Just a quick wonder: why would we need to place it into memcg, when we
don't need any of the memcg stuff for it? I see no benefits, not
design-wise, not implementation-wise or anything-wise. :)

We can use mempressure w/o memcg, and even then it can (or should :) be
useful (for cpuset, for example).

> More or less like timer expiration happens: you could still write
> numbers for compatibility, but those numbers would be internally mapped
> into the levels Anton is proposing, that makes *way* more sense.
> 
> If that is not possible, they should coexist as "notification" and a
> "pressure" mechanism inside memcg.
> 
> The main argument against it centered around cpusets also being able to
> participate in the play. I haven't yet understood how would it take
> place. In particular, I saw no mention to cpusets in the patches.

I didn't test it, but as I see it, once a process in a specific cpuset,
the task can only use a specific allowed zones for reclaim/alloc, i.e.
various checks like this in vmscan:

         if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
                     continue;

So, vmscan simply won't call vmpressure() if the zone is not allowed (so
we won't account that pressure, from that zone).

Thanks,
Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09 21:36       ` Anton Vorontsov
@ 2013-01-09 21:55         ` Tejun Heo
  2013-01-09 22:04           ` Tejun Heo
  2013-01-09 22:06           ` Anton Vorontsov
  0 siblings, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2013-01-09 21:55 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Glauber Costa, David Rientjes, Pekka Enberg, Mel Gorman,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team, KAMEZAWA Hiroyuki

Hello, Anton.

On Wed, Jan 09, 2013 at 01:36:04PM -0800, Anton Vorontsov wrote:
> On Thu, Jan 10, 2013 at 01:20:30AM +0400, Glauber Costa wrote:
> [...]
> > Given the above, I believe that ideally we should use this pressure
> > mechanism in memcg replacing the current memcg notification mechanism.
> 
> Just a quick wonder: why would we need to place it into memcg, when we
> don't need any of the memcg stuff for it? I see no benefits, not
> design-wise, not implementation-wise or anything-wise. :)

Maybe I'm misunderstanding the whole thing but how can memory pressure
exist apart from memcg when memcg is in use?  Memory limits, reclaim
and OOM are all per-memcg, how do you even define memory pressure?  If
ten tasks belong to a memcg w/ a lot of spare memory and one belongs
to another which is about to hit OOM, is that mempressure cgroup under
pressure?

> We can use mempressure w/o memcg, and even then it can (or should :) be
> useful (for cpuset, for example).

The problem is that you end with, at the very least, duplicate
hierarchical accounting mechanisms which overlap with each other
while, most likely, being slightly different.  About the same thing
happened with cpu and cpuacct controllers and we're now trying to
deprecate the latter.

Please talk with memcg people and fold it into memcg.  It can (and
should) be done in a way to not incur overhead when only root memcg is
in use and how this is done defines userland-visible interface, so
let's please not repeat past mistakes.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09 21:55         ` Tejun Heo
@ 2013-01-09 22:04           ` Tejun Heo
  2013-01-09 22:06           ` Anton Vorontsov
  1 sibling, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2013-01-09 22:04 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Glauber Costa, David Rientjes, Pekka Enberg, Mel Gorman,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team, KAMEZAWA Hiroyuki,
	Johannes Weiner, Li Zefan, cgroups

On Wed, Jan 09, 2013 at 01:55:14PM -0800, Tejun Heo wrote:
> Please talk with memcg people and fold it into memcg.  It can (and
> should) be done in a way to not incur overhead when only root memcg is
> in use and how this is done defines userland-visible interface, so
> let's please not repeat past mistakes.

CC'ing KAMEZAWA, Johannes, Li and cgroup mailing list.  Please keep
them cc'd for further discussion.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09 21:55         ` Tejun Heo
  2013-01-09 22:04           ` Tejun Heo
@ 2013-01-09 22:06           ` Anton Vorontsov
  2013-01-09 22:21             ` Tejun Heo
  2013-01-10  7:18             ` Glauber Costa
  1 sibling, 2 replies; 33+ messages in thread
From: Anton Vorontsov @ 2013-01-09 22:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, David Rientjes, Pekka Enberg, Mel Gorman,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team, KAMEZAWA Hiroyuki

On Wed, Jan 09, 2013 at 01:55:14PM -0800, Tejun Heo wrote:
[...]
> > We can use mempressure w/o memcg, and even then it can (or should :) be
> > useful (for cpuset, for example).
> 
> The problem is that you end with, at the very least, duplicate
> hierarchical accounting mechanisms which overlap with each other
> while, most likely, being slightly different.  About the same thing
> happened with cpu and cpuacct controllers and we're now trying to
> deprecate the latter.

Yeah. I started answering your comments about hierarchical accounting,
looked into the memcg code, and realized that *this* is where I need the
memcg stuff. :)

Thus yes, I guess I'll have to integrate it with memcg, or sort of.

I will surely Cc you on the next interations.

Thanks,
Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-08  8:49   ` Minchan Kim
@ 2013-01-09 22:14     ` Anton Vorontsov
  2013-01-11  5:12       ` Minchan Kim
  0 siblings, 1 reply; 33+ messages in thread
From: Anton Vorontsov @ 2013-01-09 22:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Tue, Jan 08, 2013 at 05:49:49PM +0900, Minchan Kim wrote:
[...]
> Sorry still I didn't look at your implementation about cgroup part.
> but I had a question since long time ago.
> 
> How can we can make sure false positive about zone and NUMA?
> I mean DMA zone is short in system so VM notify to user and user
> free all memory of NORMAL zone because he can't know what pages live
> in any zones. NUMA is ditto.

Um, we count scans irrespective of zones or nodes, i.e. we sum all 'number
of scanned' and 'number of reclaimed' stats. So, it should not be a
problem, as I see it.

Thanks,
Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09 22:06           ` Anton Vorontsov
@ 2013-01-09 22:21             ` Tejun Heo
  2013-01-10  7:18             ` Glauber Costa
  1 sibling, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2013-01-09 22:21 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Glauber Costa, David Rientjes, Pekka Enberg, Mel Gorman,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team, KAMEZAWA Hiroyuki

Hello, Anton.

On Wed, Jan 09, 2013 at 02:06:41PM -0800, Anton Vorontsov wrote:
> Yeah. I started answering your comments about hierarchical accounting,
> looked into the memcg code, and realized that *this* is where I need the
> memcg stuff. :)

Yay, I wasn't completely clueless.

> Thus yes, I guess I'll have to integrate it with memcg, or sort of.

I really don't know much about memcg internals but I guess
implementation can be split into two pieces.  memcg already has its
own accounting and pressure mechanism so it should be possible to bolt
on the mempressure interface on top of already existing data.  You can
improve / bring some sanity :) to memcg if the proposed mempressure
implementation is better.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09 22:06           ` Anton Vorontsov
  2013-01-09 22:21             ` Tejun Heo
@ 2013-01-10  7:18             ` Glauber Costa
  1 sibling, 0 replies; 33+ messages in thread
From: Glauber Costa @ 2013-01-10  7:18 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Tejun Heo, David Rientjes, Pekka Enberg, Mel Gorman, Michal Hocko,
	Kirill A. Shutemov, Luiz Capitulino, Andrew Morton, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team, KAMEZAWA Hiroyuki

On 01/10/2013 02:06 AM, Anton Vorontsov wrote:
> On Wed, Jan 09, 2013 at 01:55:14PM -0800, Tejun Heo wrote:
> [...]
>>> We can use mempressure w/o memcg, and even then it can (or should :) be
>>> useful (for cpuset, for example).
>>
>> The problem is that you end with, at the very least, duplicate
>> hierarchical accounting mechanisms which overlap with each other
>> while, most likely, being slightly different.  About the same thing
>> happened with cpu and cpuacct controllers and we're now trying to
>> deprecate the latter.
> 
> Yeah. I started answering your comments about hierarchical accounting,
> looked into the memcg code, and realized that *this* is where I need the
> memcg stuff. :)
> 
> Thus yes, I guess I'll have to integrate it with memcg, or sort of.
> 

That being my point since the beginning. To generate per-memcg pressure,
you need memcg anyway. So you would have to have two different and
orthogonal mechanisms, and therefore, double account.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-09 22:14     ` Anton Vorontsov
@ 2013-01-11  5:12       ` Minchan Kim
  2013-01-11  5:38         ` Anton Vorontsov
  0 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2013-01-11  5:12 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Wed, Jan 09, 2013 at 02:14:49PM -0800, Anton Vorontsov wrote:
> On Tue, Jan 08, 2013 at 05:49:49PM +0900, Minchan Kim wrote:
> [...]
> > Sorry still I didn't look at your implementation about cgroup part.
> > but I had a question since long time ago.
> > 
> > How can we can make sure false positive about zone and NUMA?
> > I mean DMA zone is short in system so VM notify to user and user
> > free all memory of NORMAL zone because he can't know what pages live
> > in any zones. NUMA is ditto.
> 
> Um, we count scans irrespective of zones or nodes, i.e. we sum all 'number
> of scanned' and 'number of reclaimed' stats. So, it should not be a
> problem, as I see it.

Why is it no problem? For example, let's think of normal zone reclaim.
Page allocator try to allocate pages from NORMAL zone to DMA zone fallback
and your logic could trigger mpc_shrinker. So process A, B, C start to
release thier freeable memory but unfortunately, freed pages are all
HIGHMEM pages. Why should processes release memory unnecessary?
Is there any method for proecess to detect such situation in user level
before releasing the freeable memory?

In android smart phone, until now, there was a zone - DMA so low memory
killer didn't have a problem but these days smart phone use 2G DRAM so
we started seeing the above problem. Your generic approach should solve
the problem, too.

> 
> Thanks,
> Anton
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-11  5:12       ` Minchan Kim
@ 2013-01-11  5:38         ` Anton Vorontsov
  2013-01-11  5:56           ` Minchan Kim
  0 siblings, 1 reply; 33+ messages in thread
From: Anton Vorontsov @ 2013-01-11  5:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Fri, Jan 11, 2013 at 02:12:10PM +0900, Minchan Kim wrote:
> On Wed, Jan 09, 2013 at 02:14:49PM -0800, Anton Vorontsov wrote:
> > On Tue, Jan 08, 2013 at 05:49:49PM +0900, Minchan Kim wrote:
> > [...]
> > > Sorry still I didn't look at your implementation about cgroup part.
> > > but I had a question since long time ago.
> > > 
> > > How can we can make sure false positive about zone and NUMA?
> > > I mean DMA zone is short in system so VM notify to user and user
> > > free all memory of NORMAL zone because he can't know what pages live
> > > in any zones. NUMA is ditto.
> > 
> > Um, we count scans irrespective of zones or nodes, i.e. we sum all 'number
> > of scanned' and 'number of reclaimed' stats. So, it should not be a
> > problem, as I see it.
> 
> Why is it no problem? For example, let's think of normal zone reclaim.
> Page allocator try to allocate pages from NORMAL zone to DMA zone fallback
> and your logic could trigger mpc_shrinker. So process A, B, C start to
> release thier freeable memory but unfortunately, freed pages are all
> HIGHMEM pages. Why should processes release memory unnecessary?
> Is there any method for proecess to detect such situation in user level
> before releasing the freeable memory?

Ahh. You're talking about the shrinker interface. Yes, there is no way to
tell if the freed memory will be actually "released" (and if not, then
yes, we released it unnecessary).

But that's not only problem with NUMA or zones. Shared pages are in the
same boat, right? An app might free some memory, but as another process
might be still using it, we don't know whether our action helps or not.

The situation is a little bit easier for the in-kernel shrinkers, since we
have more control over pages, but still, even for the kernel shrinkers, we
don't provide all the information (only gfpmask, which, I just looked into
the random user, drivers/gpu/drm/ttm, sometimes is not used).

So, answering your question: no, I don't know how to solve it for the
userland. But I also don't think it's a big concern (especially if we make
it cgroup-aware -- this would be cgroup's worry then, i.e. we might
isolate task to only some nodes/zones, if we really care about precise
accounting?). But I'm surely open for ideas. :)

Thanks!

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-11  5:38         ` Anton Vorontsov
@ 2013-01-11  5:56           ` Minchan Kim
  2013-01-11  6:09             ` Anton Vorontsov
  0 siblings, 1 reply; 33+ messages in thread
From: Minchan Kim @ 2013-01-11  5:56 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Thu, Jan 10, 2013 at 09:38:31PM -0800, Anton Vorontsov wrote:
> On Fri, Jan 11, 2013 at 02:12:10PM +0900, Minchan Kim wrote:
> > On Wed, Jan 09, 2013 at 02:14:49PM -0800, Anton Vorontsov wrote:
> > > On Tue, Jan 08, 2013 at 05:49:49PM +0900, Minchan Kim wrote:
> > > [...]
> > > > Sorry still I didn't look at your implementation about cgroup part.
> > > > but I had a question since long time ago.
> > > > 
> > > > How can we can make sure false positive about zone and NUMA?
> > > > I mean DMA zone is short in system so VM notify to user and user
> > > > free all memory of NORMAL zone because he can't know what pages live
> > > > in any zones. NUMA is ditto.
> > > 
> > > Um, we count scans irrespective of zones or nodes, i.e. we sum all 'number
> > > of scanned' and 'number of reclaimed' stats. So, it should not be a
> > > problem, as I see it.
> > 
> > Why is it no problem? For example, let's think of normal zone reclaim.
> > Page allocator try to allocate pages from NORMAL zone to DMA zone fallback
> > and your logic could trigger mpc_shrinker. So process A, B, C start to
> > release thier freeable memory but unfortunately, freed pages are all
> > HIGHMEM pages. Why should processes release memory unnecessary?
> > Is there any method for proecess to detect such situation in user level
> > before releasing the freeable memory?
> 
> Ahh. You're talking about the shrinker interface. Yes, there is no way to
> tell if the freed memory will be actually "released" (and if not, then
> yes, we released it unnecessary).

I don't tell about actually "released" or not.
I assume application actually release pages but the pages would be another
zones, NOT targetted zone from kernel. In case of that, kernel could ask
continuously until target zone has enough free memory.

> 
> But that's not only problem with NUMA or zones. Shared pages are in the
> same boat, right? An app might free some memory, but as another process
> might be still using it, we don't know whether our action helps or not.

It's not what I meant.

> 
> The situation is a little bit easier for the in-kernel shrinkers, since we
> have more control over pages, but still, even for the kernel shrinkers, we
> don't provide all the information (only gfpmask, which, I just looked into
> the random user, drivers/gpu/drm/ttm, sometimes is not used).
> 
> So, answering your question: no, I don't know how to solve it for the
> userland. But I also don't think it's a big concern (especially if we make
> it cgroup-aware -- this would be cgroup's worry then, i.e. we might
> isolate task to only some nodes/zones, if we really care about precise
> accounting?). But I'm surely open for ideas. :)

My dumb idea is only notify to user when reclaim is triggered by
__GFP_HIGHMEM|__GFP_MOVABLE which is most gfp_t for application memory. :)


> 
> Thanks!
> 
> Anton
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-11  5:56           ` Minchan Kim
@ 2013-01-11  6:09             ` Anton Vorontsov
  0 siblings, 0 replies; 33+ messages in thread
From: Anton Vorontsov @ 2013-01-11  6:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Fri, Jan 11, 2013 at 02:56:15PM +0900, Minchan Kim wrote:
[...]
> > Ahh. You're talking about the shrinker interface. Yes, there is no way to
> > tell if the freed memory will be actually "released" (and if not, then
> > yes, we released it unnecessary).
> 
> I don't tell about actually "released" or not.
> I assume application actually release pages but the pages would be another
> zones, NOT targetted zone from kernel. In case of that, kernel could ask
> continuously until target zone has enough free memory.
[...]
> > isolate task to only some nodes/zones, if we really care about precise
> > accounting?). But I'm surely open for ideas. :)
> 
> My dumb idea is only notify to user when reclaim is triggered by
> __GFP_HIGHMEM|__GFP_MOVABLE which is most gfp_t for application memory. :)

Ah, I see. Sure, that will help a lot. I'll try to incorporate this into
the next iteration. But there are still unresolved accounting issues that
I outlined, and I don't think that they are this easy to solve. :)

Thanks!

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/2] Mempressure cgroup
  2013-01-04  8:27 [PATCH 0/2] Mempressure cgroup Anton Vorontsov
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
  2013-01-04  8:29 ` [PATCH 2/2] Add shrinker interface for " Anton Vorontsov
@ 2013-01-11 19:13 ` Luiz Capitulino
  2 siblings, 0 replies; 33+ messages in thread
From: Luiz Capitulino @ 2013-01-11 19:13 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Andrew Morton, Greg Thelen,
	Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Fri, 4 Jan 2013 00:27:52 -0800
Anton Vorontsov <anton.vorontsov@linaro.org> wrote:

> - I've split the pach into two: 'shrinker' and 'levels' parts. While the
>   full-fledged userland shrinker is an interesting idea, we don't have any
>   users ready for it, so I won't advocate for it too much.

For the next version of the automatic balloon prototype I'm planning to give
the user-space shrinker a try. It seems to be a better fit, as the current
prototype has to guess by how much a guest's balloon should be inflated.

Also, I think it would be worth it to list possible use-cases for the two
functionalities in the series' intro email. This might help choosing both,
one or another.

Looking forward to the next version :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
                     ` (5 preceding siblings ...)
  2013-01-09 20:37   ` Tejun Heo
@ 2013-01-13  8:50   ` Simon Jeons
  2013-01-13  8:52   ` Wanpeng Li
  2013-01-13  8:52   ` Wanpeng Li
  8 siblings, 0 replies; 33+ messages in thread
From: Simon Jeons @ 2013-01-13  8:50 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A. Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

On Fri, 2013-01-04 at 00:29 -0800, Anton Vorontsov wrote:
> This commit implements David Rientjes' idea of mempressure cgroup.
> 
> The main characteristics are the same to what I've tried to add to vmevent
> API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> pressure index calculation. But we don't expose the index to the userland.
> Instead, there are three levels of the pressure:
> 
>  o low (just reclaiming, e.g. caches are draining);
>  o medium (allocation cost becomes high, e.g. swapping);
>  o oom (about to oom very soon).
> 
> The rationale behind exposing levels and not the raw pressure index
> described here: http://lkml.org/lkml/2012/11/16/675
> 
> For a task it is possible to be in both cpusets, memcg and mempressure
> cgroups, so by rearranging the tasks it is possible to watch a specific
> pressure (i.e. caused by cpuset and/or memcg).
> 
> Note that while this adds the cgroups support, the code is well separated
> and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> But this is another story.
> 
> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
> ---
>  Documentation/cgroups/mempressure.txt |  50 ++++++
>  include/linux/cgroup_subsys.h         |   6 +
>  include/linux/vmstat.h                |  11 ++
>  init/Kconfig                          |  12 ++
>  mm/Makefile                           |   1 +
>  mm/mempressure.c                      | 330 ++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                           |   4 +
>  7 files changed, 414 insertions(+)
>  create mode 100644 Documentation/cgroups/mempressure.txt
>  create mode 100644 mm/mempressure.c
> 
> diff --git a/Documentation/cgroups/mempressure.txt b/Documentation/cgroups/mempressure.txt
> new file mode 100644
> index 0000000..dbc0aca
> --- /dev/null
> +++ b/Documentation/cgroups/mempressure.txt
> @@ -0,0 +1,50 @@
> +  Memory pressure cgroup
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +  Before using the mempressure cgroup, make sure you have it mounted:
> +
> +   # cd /sys/fs/cgroup/
> +   # mkdir mempressure
> +   # mount -t cgroup cgroup ./mempressure -o mempressure
> +
> +  It is possible to combine cgroups, for example you can mount memory
> +  (memcg) and mempressure cgroups together:
> +
> +   # mount -t cgroup cgroup ./mempressure -o memory,mempressure
> +
> +  That way the reported pressure will honour memory cgroup limits. The
> +  same goes for cpusets.
> +
> +  After the hierarchy is mounted, you can use the following API:
> +
> +  /sys/fs/cgroup/.../mempressure.level
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +  To maintain the interactivity/memory allocation cost, one can use the
> +  pressure level notifications, and the levels are defined like this:
> +
> +  The "low" level means that the system is reclaiming memory for new
> +  allocations. Monitoring reclaiming activity might be useful for
> +  maintaining overall system's cache level. Upon notification, the program
> +  (typically "Activity Manager") might analyze vmstat and act in advance
> +  (i.e. prematurely shutdown unimportant services).
> +
> +  The "medium" level means that the system is experiencing medium memory
> +  pressure, there is some mild swapping activity. Upon this event
> +  applications may decide to free any resources that can be easily
> +  reconstructed or re-read from a disk.
> +
> +  The "oom" level means that the system is actively thrashing, it is about
> +  to out of memory (OOM) or even the in-kernel OOM killer is on its way to
> +  trigger. Applications should do whatever they can to help the system.
> +
> +  Event control:
> +    Is used to setup an eventfd with a level threshold. The argument to
> +    the event control specifies the level threshold.
> +  Read:
> +    Reads mempory presure levels: low, medium or oom.
> +  Write:
> +    Not implemented.
> +  Test:
> +    To set up a notification:
> +
> +    # cgroup_event_listener ./mempressure.level low
> +    ("low", "medium", "oom" are permitted.)
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index f204a7a..b9802e2 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -37,6 +37,12 @@ SUBSYS(mem_cgroup)
>  
>  /* */
>  
> +#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE)
> +SUBSYS(mpc_cgroup)
> +#endif
> +
> +/* */
> +
>  #if IS_SUBSYS_ENABLED(CONFIG_CGROUP_DEVICE)
>  SUBSYS(devices)
>  #endif
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index a13291f..c1a66c7 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -10,6 +10,17 @@
>  
>  extern int sysctl_stat_interval;
>  
> +struct mem_cgroup;
> +#ifdef CONFIG_CGROUP_MEMPRESSURE
> +extern void vmpressure(struct mem_cgroup *memcg,
> +		       ulong scanned, ulong reclaimed);
> +extern void vmpressure_prio(struct mem_cgroup *memcg, int prio);
> +#else
> +static inline void vmpressure(struct mem_cgroup *memcg,
> +			      ulong scanned, ulong reclaimed) {}
> +static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {}
> +#endif
> +
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  /*
>   * Light weight per cpu counter implementation.
> diff --git a/init/Kconfig b/init/Kconfig
> index 7d30240..d526249 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -891,6 +891,18 @@ config MEMCG_KMEM
>  	  the kmem extension can use it to guarantee that no group of processes
>  	  will ever exhaust kernel resources alone.
>  
> +config CGROUP_MEMPRESSURE
> +	bool "Memory pressure monitor for Control Groups"
> +	help
> +	  The memory pressure monitor cgroup provides a facility for
> +	  userland programs so that they could easily assist the kernel
> +	  with the memory management. So far the API provides simple,
> +	  levels-based memory pressure notifications.
> +
> +	  For more information see Documentation/cgroups/mempressure.txt
> +
> +	  If unsure, say N.
> +
>  config CGROUP_HUGETLB
>  	bool "HugeTLB Resource Controller for Control Groups"
>  	depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL
> diff --git a/mm/Makefile b/mm/Makefile
> index 3a46287..e69bbda 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -51,6 +51,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>  obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_MEMPRESSURE) += mempressure.o
>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>  obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> diff --git a/mm/mempressure.c b/mm/mempressure.c
> new file mode 100644
> index 0000000..ea312bb
> --- /dev/null
> +++ b/mm/mempressure.c
> @@ -0,0 +1,330 @@
> +/*
> + * Linux VM pressure
> + *
> + * Copyright 2012 Linaro Ltd.
> + *		  Anton Vorontsov <anton.vorontsov@linaro.org>
> + *
> + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
> + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/fs.h>
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +#include <linux/eventfd.h>
> +#include <linux/swap.h>
> +#include <linux/printk.h>
> +
> +static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r);
> +
> +/*
> + * Generic VM Pressure routines (no cgroups or any other API details)
> + */
> +
> +/*
> + * The window size is the number of scanned pages before we try to analyze
> + * the scanned/reclaimed ratio (or difference).
> + *
> + * It is used as a rate-limit tunable for the "low" level notification,
> + * and for averaging medium/oom levels. Using small window sizes can cause
> + * lot of false positives, but too big window size will delay the
> + * notifications.
> + */
> +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
> +static const uint vmpressure_level_med = 60;
> +static const uint vmpressure_level_oom = 99;
> +static const uint vmpressure_level_oom_prio = 4;
> +
> +enum vmpressure_levels {
> +	VMPRESSURE_LOW = 0,
> +	VMPRESSURE_MEDIUM,
> +	VMPRESSURE_OOM,
> +	VMPRESSURE_NUM_LEVELS,
> +};
> +
> +static const char *vmpressure_str_levels[] = {
> +	[VMPRESSURE_LOW] = "low",
> +	[VMPRESSURE_MEDIUM] = "medium",
> +	[VMPRESSURE_OOM] = "oom",
> +};
> +
> +static enum vmpressure_levels vmpressure_level(uint pressure)
> +{
> +	if (pressure >= vmpressure_level_oom)
> +		return VMPRESSURE_OOM;
> +	else if (pressure >= vmpressure_level_med)
> +		return VMPRESSURE_MEDIUM;
> +	return VMPRESSURE_LOW;
> +}
> +
> +static ulong vmpressure_calc_level(uint win, uint s, uint r)
> +{
> +	ulong p;
> +
> +	if (!s)
> +		return 0;
> +
> +	/*
> +	 * We calculate the ratio (in percents) of how many pages were
> +	 * scanned vs. reclaimed in a given time frame (window). Note that
> +	 * time is in VM reclaimer's "ticks", i.e. number of pages
> +	 * scanned. This makes it possible to set desired reaction time
> +	 * and serves as a ratelimit.
> +	 */
> +	p = win - (r * win / s);
> +	p = p * 100 / win;
> +
> +	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
> +
> +	return vmpressure_level(p);
> +}
> +
> +void vmpressure(struct mem_cgroup *memcg, ulong scanned, ulong reclaimed)
> +{
> +	if (!scanned)
> +		return;
> +	mpc_vmpressure(memcg, scanned, reclaimed);
> +}
> +
> +void vmpressure_prio(struct mem_cgroup *memcg, int prio)
> +{
> +	if (prio > vmpressure_level_oom_prio)
> +		return;
> +
> +	/* OK, the prio is below the threshold, send the pre-OOM event. */
> +	vmpressure(memcg, vmpressure_win, 0);
> +}
> +
> +/*
> + * Memory pressure cgroup code
> + */
> +
> +struct mpc_event {
> +	struct eventfd_ctx *efd;
> +	enum vmpressure_levels level;
> +	struct list_head node;
> +};
> +
> +struct mpc_state {
> +	struct cgroup_subsys_state css;
> +
> +	uint scanned;
> +	uint reclaimed;
> +	struct mutex sr_lock;
> +
> +	struct list_head events;
> +	struct mutex events_lock;
> +
> +	struct work_struct work;
> +};
> +
> +static struct mpc_state *wk2mpc(struct work_struct *wk)
> +{
> +	return container_of(wk, struct mpc_state, work);
> +}
> +
> +static struct mpc_state *css2mpc(struct cgroup_subsys_state *css)
> +{
> +	return container_of(css, struct mpc_state, css);
> +}
> +
> +static struct mpc_state *tsk2mpc(struct task_struct *tsk)
> +{
> +	return css2mpc(task_subsys_state(tsk, mpc_cgroup_subsys_id));
> +}
> +
> +static struct mpc_state *cg2mpc(struct cgroup *cg)
> +{
> +	return css2mpc(cgroup_subsys_state(cg, mpc_cgroup_subsys_id));
> +}
> +
> +static void mpc_event(struct mpc_state *mpc, ulong s, ulong r)
> +{
> +	struct mpc_event *ev;
> +	int level = vmpressure_calc_level(vmpressure_win, s, r);
> +
> +	mutex_lock(&mpc->events_lock);
> +
> +	list_for_each_entry(ev, &mpc->events, node) {
> +		if (level >= ev->level)
> +			eventfd_signal(ev->efd, 1);
> +	}
> +
> +	mutex_unlock(&mpc->events_lock);
> +}
> +
> +static void mpc_vmpressure_wk_fn(struct work_struct *wk)
> +{
> +	struct mpc_state *mpc = wk2mpc(wk);
> +	ulong s;
> +	ulong r;
> +
> +	mutex_lock(&mpc->sr_lock);
> +	s = mpc->scanned;
> +	r = mpc->reclaimed;
> +	mpc->scanned = 0;
> +	mpc->reclaimed = 0;
> +	mutex_unlock(&mpc->sr_lock);
> +
> +	mpc_event(mpc, s, r);
> +}
> +
> +static void __mpc_vmpressure(struct mpc_state *mpc, ulong s, ulong r)
> +{
> +	mutex_lock(&mpc->sr_lock);
> +	mpc->scanned += s;
> +	mpc->reclaimed += r;
> +	mutex_unlock(&mpc->sr_lock);
> +
> +	if (s < vmpressure_win || work_pending(&mpc->work))
> +		return;
> +
> +	schedule_work(&mpc->work);
> +}
> +
> +static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r)
> +{
> +	/*
> +	 * There are two options for implementing cgroup pressure
> +	 * notifications:
> +	 *
> +	 * - Store pressure counter atomically in the task struct. Upon
> +	 *   hitting 'window' wake up a workqueue that will walk every
> +	 *   task and sum per-thread pressure into cgroup pressure (to
> +	 *   which the task belongs). The cons are obvious: bloats task
> +	 *   struct, have to walk all processes and makes pressue less
> +	 *   accurate (the window becomes per-thread);
> +	 *
> +	 * - Store pressure counters in per-cgroup state. This is easy and
> +	 *   straightforward, and that's how we do things here. But this
> +	 *   requires us to not put the vmpressure hooks into hotpath,
> +	 *   since we have to grab some locks.
> +	 */
> +
> +#ifdef CONFIG_MEMCG
> +	if (memcg) {
> +		struct cgroup_subsys_state *css = mem_cgroup_css(memcg);
> +		struct cgroup *cg = css->cgroup;
> +		struct mpc_state *mpc = cg2mpc(cg);
> +
> +		if (mpc)
> +			__mpc_vmpressure(mpc, s, r);
> +		return;
> +	}
> +#endif
> +	task_lock(current);
> +	__mpc_vmpressure(tsk2mpc(current), s, r);
> +	task_unlock(current);
> +}
> +
> +static struct cgroup_subsys_state *mpc_css_alloc(struct cgroup *cg)
> +{
> +	struct mpc_state *mpc;
> +
> +	mpc = kzalloc(sizeof(*mpc), GFP_KERNEL);
> +	if (!mpc)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mutex_init(&mpc->sr_lock);
> +	mutex_init(&mpc->events_lock);
> +	INIT_LIST_HEAD(&mpc->events);
> +	INIT_WORK(&mpc->work, mpc_vmpressure_wk_fn);
> +
> +	return &mpc->css;
> +}
> +
> +static void mpc_css_free(struct cgroup *cg)
> +{
> +	struct mpc_state *mpc = cg2mpc(cg);
> +
> +	kfree(mpc);
> +}
> +
> +static ssize_t mpc_read_level(struct cgroup *cg, struct cftype *cft,
> +			      struct file *file, char __user *buf,
> +			      size_t sz, loff_t *ppos)
> +{
> +	struct mpc_state *mpc = cg2mpc(cg);
> +	uint level;
> +	const char *str;
> +
> +	mutex_lock(&mpc->sr_lock);
> +
> +	level = vmpressure_calc_level(vmpressure_win,
> +			mpc->scanned, mpc->reclaimed);
> +
> +	mutex_unlock(&mpc->sr_lock);
> +
> +	str = vmpressure_str_levels[level];
> +	return simple_read_from_buffer(buf, sz, ppos, str, strlen(str));
> +}
> +
> +static int mpc_register_level(struct cgroup *cg, struct cftype *cft,
> +			      struct eventfd_ctx *eventfd, const char *args)
> +{
> +	struct mpc_state *mpc = cg2mpc(cg);
> +	struct mpc_event *ev;
> +	int lvl;
> +
> +	for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
> +		if (!strcmp(vmpressure_str_levels[lvl], args))
> +			break;
> +	}
> +
> +	if (lvl >= VMPRESSURE_NUM_LEVELS)
> +		return -EINVAL;
> +
> +	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> +	if (!ev)
> +		return -ENOMEM;
> +
> +	ev->efd = eventfd;
> +	ev->level = lvl;
> +
> +	mutex_lock(&mpc->events_lock);
> +	list_add(&ev->node, &mpc->events);
> +	mutex_unlock(&mpc->events_lock);
> +
> +	return 0;
> +}
> +
> +static void mpc_unregister_level(struct cgroup *cg, struct cftype *cft,
> +				 struct eventfd_ctx *eventfd)
> +{
> +	struct mpc_state *mpc = cg2mpc(cg);
> +	struct mpc_event *ev;
> +
> +	mutex_lock(&mpc->events_lock);
> +	list_for_each_entry(ev, &mpc->events, node) {
> +		if (ev->efd != eventfd)
> +			continue;
> +		list_del(&ev->node);
> +		kfree(ev);
> +		break;
> +	}
> +	mutex_unlock(&mpc->events_lock);
> +}
> +
> +static struct cftype mpc_files[] = {
> +	{
> +		.name = "level",
> +		.read = mpc_read_level,
> +		.register_event = mpc_register_level,
> +		.unregister_event = mpc_unregister_level,
> +	},
> +	{},
> +};
> +
> +struct cgroup_subsys mpc_cgroup_subsys = {
> +	.name = "mempressure",
> +	.subsys_id = mpc_cgroup_subsys_id,
> +	.css_alloc = mpc_css_alloc,
> +	.css_free = mpc_css_free,
> +	.base_cftypes = mpc_files,
> +};
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 16b42af..fed0e04 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1900,6 +1900,9 @@ restart:
>  		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  				   sc, LRU_ACTIVE_ANON);
>  
> +	vmpressure(sc->target_mem_cgroup,
> +		   sc->nr_scanned - nr_scanned, nr_reclaimed);
> +
>  	/* reclaim/compaction might need reclaim to continue */
>  	if (should_continue_reclaim(lruvec, nr_reclaimed,
>  				    sc->nr_scanned - nr_scanned, sc))
> @@ -2122,6 +2125,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		count_vm_event(ALLOCSTALL);
>  
>  	do {
> +		vmpressure_prio(sc->target_mem_cgroup, sc->priority);

Why need function vmpressure_prio? It seems that it's reduncated.  

>  		sc->nr_scanned = 0;
>  		aborted_reclaim = shrink_zones(zonelist, sc);
>  


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
                     ` (7 preceding siblings ...)
  2013-01-13  8:52   ` Wanpeng Li
@ 2013-01-13  8:52   ` Wanpeng Li
  8 siblings, 0 replies; 33+ messages in thread
From: Wanpeng Li @ 2013-01-13  8:52 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A.Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

Hi Anton,

On Fri, Jan 04, 2013 at 12:29:11AM -0800, Anton Vorontsov wrote:
>This commit implements David Rientjes' idea of mempressure cgroup.
>
>The main characteristics are the same to what I've tried to add to vmevent
>API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
>pressure index calculation. But we don't expose the index to the userland.
>Instead, there are three levels of the pressure:
>
> o low (just reclaiming, e.g. caches are draining);
> o medium (allocation cost becomes high, e.g. swapping);
> o oom (about to oom very soon).
>
>The rationale behind exposing levels and not the raw pressure index
>described here: http://lkml.org/lkml/2012/11/16/675
>
>For a task it is possible to be in both cpusets, memcg and mempressure
>cgroups, so by rearranging the tasks it is possible to watch a specific
>pressure (i.e. caused by cpuset and/or memcg).
>
>Note that while this adds the cgroups support, the code is well separated
>and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
>But this is another story.
>
>Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
>---
> Documentation/cgroups/mempressure.txt |  50 ++++++
> include/linux/cgroup_subsys.h         |   6 +
> include/linux/vmstat.h                |  11 ++
> init/Kconfig                          |  12 ++
> mm/Makefile                           |   1 +
> mm/mempressure.c                      | 330 ++++++++++++++++++++++++++++++++++
> mm/vmscan.c                           |   4 +
> 7 files changed, 414 insertions(+)
> create mode 100644 Documentation/cgroups/mempressure.txt
> create mode 100644 mm/mempressure.c
>
>diff --git a/Documentation/cgroups/mempressure.txt b/Documentation/cgroups/mempressure.txt
>new file mode 100644
>index 0000000..dbc0aca
>--- /dev/null
>+++ b/Documentation/cgroups/mempressure.txt
>@@ -0,0 +1,50 @@
>+  Memory pressure cgroup
>+~~~~~~~~~~~~~~~~~~~~~~~~~~
>+  Before using the mempressure cgroup, make sure you have it mounted:
>+
>+   # cd /sys/fs/cgroup/
>+   # mkdir mempressure
>+   # mount -t cgroup cgroup ./mempressure -o mempressure
>+
>+  It is possible to combine cgroups, for example you can mount memory
>+  (memcg) and mempressure cgroups together:
>+
>+   # mount -t cgroup cgroup ./mempressure -o memory,mempressure
>+
>+  That way the reported pressure will honour memory cgroup limits. The
>+  same goes for cpusets.
>+
>+  After the hierarchy is mounted, you can use the following API:
>+
>+  /sys/fs/cgroup/.../mempressure.level
>+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>+  To maintain the interactivity/memory allocation cost, one can use the
>+  pressure level notifications, and the levels are defined like this:
>+
>+  The "low" level means that the system is reclaiming memory for new
>+  allocations. Monitoring reclaiming activity might be useful for
>+  maintaining overall system's cache level. Upon notification, the program
>+  (typically "Activity Manager") might analyze vmstat and act in advance
>+  (i.e. prematurely shutdown unimportant services).
>+
>+  The "medium" level means that the system is experiencing medium memory
>+  pressure, there is some mild swapping activity. Upon this event
>+  applications may decide to free any resources that can be easily
>+  reconstructed or re-read from a disk.
>+
>+  The "oom" level means that the system is actively thrashing, it is about
>+  to out of memory (OOM) or even the in-kernel OOM killer is on its way to
>+  trigger. Applications should do whatever they can to help the system.
>+
>+  Event control:
>+    Is used to setup an eventfd with a level threshold. The argument to
>+    the event control specifies the level threshold.
>+  Read:
>+    Reads mempory presure levels: low, medium or oom.
>+  Write:
>+    Not implemented.
>+  Test:
>+    To set up a notification:
>+
>+    # cgroup_event_listener ./mempressure.level low
>+    ("low", "medium", "oom" are permitted.)
>diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>index f204a7a..b9802e2 100644
>--- a/include/linux/cgroup_subsys.h
>+++ b/include/linux/cgroup_subsys.h
>@@ -37,6 +37,12 @@ SUBSYS(mem_cgroup)
>
> /* */
>
>+#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE)
>+SUBSYS(mpc_cgroup)
>+#endif
>+
>+/* */
>+
> #if IS_SUBSYS_ENABLED(CONFIG_CGROUP_DEVICE)
> SUBSYS(devices)
> #endif
>diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>index a13291f..c1a66c7 100644
>--- a/include/linux/vmstat.h
>+++ b/include/linux/vmstat.h
>@@ -10,6 +10,17 @@
>
> extern int sysctl_stat_interval;
>
>+struct mem_cgroup;
>+#ifdef CONFIG_CGROUP_MEMPRESSURE
>+extern void vmpressure(struct mem_cgroup *memcg,
>+		       ulong scanned, ulong reclaimed);
>+extern void vmpressure_prio(struct mem_cgroup *memcg, int prio);
>+#else
>+static inline void vmpressure(struct mem_cgroup *memcg,
>+			      ulong scanned, ulong reclaimed) {}
>+static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {}
>+#endif
>+
> #ifdef CONFIG_VM_EVENT_COUNTERS
> /*
>  * Light weight per cpu counter implementation.
>diff --git a/init/Kconfig b/init/Kconfig
>index 7d30240..d526249 100644
>--- a/init/Kconfig
>+++ b/init/Kconfig
>@@ -891,6 +891,18 @@ config MEMCG_KMEM
> 	  the kmem extension can use it to guarantee that no group of processes
> 	  will ever exhaust kernel resources alone.
>
>+config CGROUP_MEMPRESSURE
>+	bool "Memory pressure monitor for Control Groups"
>+	help
>+	  The memory pressure monitor cgroup provides a facility for
>+	  userland programs so that they could easily assist the kernel
>+	  with the memory management. So far the API provides simple,
>+	  levels-based memory pressure notifications.
>+
>+	  For more information see Documentation/cgroups/mempressure.txt
>+
>+	  If unsure, say N.
>+
> config CGROUP_HUGETLB
> 	bool "HugeTLB Resource Controller for Control Groups"
> 	depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL
>diff --git a/mm/Makefile b/mm/Makefile
>index 3a46287..e69bbda 100644
>--- a/mm/Makefile
>+++ b/mm/Makefile
>@@ -51,6 +51,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
>+obj-$(CONFIG_CGROUP_MEMPRESSURE) += mempressure.o
> obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
> obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>diff --git a/mm/mempressure.c b/mm/mempressure.c
>new file mode 100644
>index 0000000..ea312bb
>--- /dev/null
>+++ b/mm/mempressure.c
>@@ -0,0 +1,330 @@
>+/*
>+ * Linux VM pressure
>+ *
>+ * Copyright 2012 Linaro Ltd.
>+ *		  Anton Vorontsov <anton.vorontsov@linaro.org>
>+ *
>+ * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
>+ * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
>+ *
>+ * This program is free software; you can redistribute it and/or modify it
>+ * under the terms of the GNU General Public License version 2 as published
>+ * by the Free Software Foundation.
>+ */
>+
>+#include <linux/cgroup.h>
>+#include <linux/fs.h>
>+#include <linux/sched.h>
>+#include <linux/mm.h>
>+#include <linux/vmstat.h>
>+#include <linux/eventfd.h>
>+#include <linux/swap.h>
>+#include <linux/printk.h>
>+
>+static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r);
>+
>+/*
>+ * Generic VM Pressure routines (no cgroups or any other API details)
>+ */
>+
>+/*
>+ * The window size is the number of scanned pages before we try to analyze
>+ * the scanned/reclaimed ratio (or difference).
>+ *
>+ * It is used as a rate-limit tunable for the "low" level notification,
>+ * and for averaging medium/oom levels. Using small window sizes can cause
>+ * lot of false positives, but too big window size will delay the
>+ * notifications.
>+ */
>+static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;

Since the type is const, how can it tunable?

>+static const uint vmpressure_level_med = 60;
>+static const uint vmpressure_level_oom = 99;
>+static const uint vmpressure_level_oom_prio = 4;
>+
>+enum vmpressure_levels {
>+	VMPRESSURE_LOW = 0,
>+	VMPRESSURE_MEDIUM,
>+	VMPRESSURE_OOM,
>+	VMPRESSURE_NUM_LEVELS,
>+};
>+
>+static const char *vmpressure_str_levels[] = {
>+	[VMPRESSURE_LOW] = "low",
>+	[VMPRESSURE_MEDIUM] = "medium",
>+	[VMPRESSURE_OOM] = "oom",
>+};
>+
>+static enum vmpressure_levels vmpressure_level(uint pressure)
>+{
>+	if (pressure >= vmpressure_level_oom)
>+		return VMPRESSURE_OOM;
>+	else if (pressure >= vmpressure_level_med)
>+		return VMPRESSURE_MEDIUM;
>+	return VMPRESSURE_LOW;
>+}
>+
>+static ulong vmpressure_calc_level(uint win, uint s, uint r)
>+{
>+	ulong p;
>+
>+	if (!s)
>+		return 0;
>+
>+	/*
>+	 * We calculate the ratio (in percents) of how many pages were
>+	 * scanned vs. reclaimed in a given time frame (window). Note that
>+	 * time is in VM reclaimer's "ticks", i.e. number of pages
>+	 * scanned. This makes it possible to set desired reaction time
>+	 * and serves as a ratelimit.
>+	 */
>+	p = win - (r * win / s);
>+	p = p * 100 / win;
>+
>+	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
>+
>+	return vmpressure_level(p);
>+}
>+
>+void vmpressure(struct mem_cgroup *memcg, ulong scanned, ulong reclaimed)
>+{
>+	if (!scanned)
>+		return;
>+	mpc_vmpressure(memcg, scanned, reclaimed);
>+}
>+
>+void vmpressure_prio(struct mem_cgroup *memcg, int prio)
>+{
>+	if (prio > vmpressure_level_oom_prio)
>+		return;

Since the max value of prio(sc->priority) == DEF_PRIORITY(12), why need
it?

>+
>+	/* OK, the prio is below the threshold, send the pre-OOM event. */
>+	vmpressure(memcg, vmpressure_win, 0);
>+}
>+
>+/*
>+ * Memory pressure cgroup code
>+ */
>+
>+struct mpc_event {
>+	struct eventfd_ctx *efd;
>+	enum vmpressure_levels level;
>+	struct list_head node;
>+};
>+
>+struct mpc_state {
>+	struct cgroup_subsys_state css;
>+
>+	uint scanned;
>+	uint reclaimed;
>+	struct mutex sr_lock;
>+
>+	struct list_head events;
>+	struct mutex events_lock;
>+
>+	struct work_struct work;
>+};
>+
>+static struct mpc_state *wk2mpc(struct work_struct *wk)
>+{
>+	return container_of(wk, struct mpc_state, work);
>+}
>+
>+static struct mpc_state *css2mpc(struct cgroup_subsys_state *css)
>+{
>+	return container_of(css, struct mpc_state, css);
>+}
>+
>+static struct mpc_state *tsk2mpc(struct task_struct *tsk)
>+{
>+	return css2mpc(task_subsys_state(tsk, mpc_cgroup_subsys_id));
>+}
>+
>+static struct mpc_state *cg2mpc(struct cgroup *cg)
>+{
>+	return css2mpc(cgroup_subsys_state(cg, mpc_cgroup_subsys_id));
>+}
>+
>+static void mpc_event(struct mpc_state *mpc, ulong s, ulong r)
>+{
>+	struct mpc_event *ev;
>+	int level = vmpressure_calc_level(vmpressure_win, s, r);
>+
>+	mutex_lock(&mpc->events_lock);
>+
>+	list_for_each_entry(ev, &mpc->events, node) {
>+		if (level >= ev->level)
>+			eventfd_signal(ev->efd, 1);
>+	}
>+
>+	mutex_unlock(&mpc->events_lock);
>+}
>+
>+static void mpc_vmpressure_wk_fn(struct work_struct *wk)
>+{
>+	struct mpc_state *mpc = wk2mpc(wk);
>+	ulong s;
>+	ulong r;
>+
>+	mutex_lock(&mpc->sr_lock);
>+	s = mpc->scanned;
>+	r = mpc->reclaimed;
>+	mpc->scanned = 0;
>+	mpc->reclaimed = 0;
>+	mutex_unlock(&mpc->sr_lock);
>+
>+	mpc_event(mpc, s, r);
>+}
>+
>+static void __mpc_vmpressure(struct mpc_state *mpc, ulong s, ulong r)
>+{
>+	mutex_lock(&mpc->sr_lock);
>+	mpc->scanned += s;
>+	mpc->reclaimed += r;
>+	mutex_unlock(&mpc->sr_lock);
>+
>+	if (s < vmpressure_win || work_pending(&mpc->work))
>+		return;
>+
>+	schedule_work(&mpc->work);
>+}
>+
>+static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r)
>+{
>+	/*
>+	 * There are two options for implementing cgroup pressure
>+	 * notifications:
>+	 *
>+	 * - Store pressure counter atomically in the task struct. Upon
>+	 *   hitting 'window' wake up a workqueue that will walk every
>+	 *   task and sum per-thread pressure into cgroup pressure (to
>+	 *   which the task belongs). The cons are obvious: bloats task
>+	 *   struct, have to walk all processes and makes pressue less
>+	 *   accurate (the window becomes per-thread);
>+	 *
>+	 * - Store pressure counters in per-cgroup state. This is easy and
>+	 *   straightforward, and that's how we do things here. But this
>+	 *   requires us to not put the vmpressure hooks into hotpath,
>+	 *   since we have to grab some locks.
>+	 */
>+
>+#ifdef CONFIG_MEMCG
>+	if (memcg) {
>+		struct cgroup_subsys_state *css = mem_cgroup_css(memcg);
>+		struct cgroup *cg = css->cgroup;
>+		struct mpc_state *mpc = cg2mpc(cg);
>+
>+		if (mpc)
>+			__mpc_vmpressure(mpc, s, r);
>+		return;
>+	}
>+#endif
>+	task_lock(current);
>+	__mpc_vmpressure(tsk2mpc(current), s, r);
>+	task_unlock(current);
>+}
>+
>+static struct cgroup_subsys_state *mpc_css_alloc(struct cgroup *cg)
>+{
>+	struct mpc_state *mpc;
>+
>+	mpc = kzalloc(sizeof(*mpc), GFP_KERNEL);
>+	if (!mpc)
>+		return ERR_PTR(-ENOMEM);
>+
>+	mutex_init(&mpc->sr_lock);
>+	mutex_init(&mpc->events_lock);
>+	INIT_LIST_HEAD(&mpc->events);
>+	INIT_WORK(&mpc->work, mpc_vmpressure_wk_fn);
>+
>+	return &mpc->css;
>+}
>+
>+static void mpc_css_free(struct cgroup *cg)
>+{
>+	struct mpc_state *mpc = cg2mpc(cg);
>+
>+	kfree(mpc);
>+}
>+
>+static ssize_t mpc_read_level(struct cgroup *cg, struct cftype *cft,
>+			      struct file *file, char __user *buf,
>+			      size_t sz, loff_t *ppos)
>+{
>+	struct mpc_state *mpc = cg2mpc(cg);
>+	uint level;
>+	const char *str;
>+
>+	mutex_lock(&mpc->sr_lock);
>+
>+	level = vmpressure_calc_level(vmpressure_win,
>+			mpc->scanned, mpc->reclaimed);
>+
>+	mutex_unlock(&mpc->sr_lock);
>+
>+	str = vmpressure_str_levels[level];
>+	return simple_read_from_buffer(buf, sz, ppos, str, strlen(str));

You miss "\n". The print result:
[root@kernel ~]# cat /sys/fs/cgroup/mempressure/mempressure.level
low[root@kernel ~]#

Regards,
Wanpeng Li

>+}
>+
>+static int mpc_register_level(struct cgroup *cg, struct cftype *cft,
>+			      struct eventfd_ctx *eventfd, const char *args)
>+{
>+	struct mpc_state *mpc = cg2mpc(cg);
>+	struct mpc_event *ev;
>+	int lvl;
>+
>+	for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
>+		if (!strcmp(vmpressure_str_levels[lvl], args))
>+			break;
>+	}
>+
>+	if (lvl >= VMPRESSURE_NUM_LEVELS)
>+		return -EINVAL;
>+
>+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>+	if (!ev)
>+		return -ENOMEM;
>+
>+	ev->efd = eventfd;
>+	ev->level = lvl;
>+
>+	mutex_lock(&mpc->events_lock);
>+	list_add(&ev->node, &mpc->events);
>+	mutex_unlock(&mpc->events_lock);
>+
>+	return 0;
>+}
>+
>+static void mpc_unregister_level(struct cgroup *cg, struct cftype *cft,
>+				 struct eventfd_ctx *eventfd)
>+{
>+	struct mpc_state *mpc = cg2mpc(cg);
>+	struct mpc_event *ev;
>+
>+	mutex_lock(&mpc->events_lock);
>+	list_for_each_entry(ev, &mpc->events, node) {
>+		if (ev->efd != eventfd)
>+			continue;
>+		list_del(&ev->node);
>+		kfree(ev);
>+		break;
>+	}
>+	mutex_unlock(&mpc->events_lock);
>+}
>+
>+static struct cftype mpc_files[] = {
>+	{
>+		.name = "level",
>+		.read = mpc_read_level,
>+		.register_event = mpc_register_level,
>+		.unregister_event = mpc_unregister_level,
>+	},
>+	{},
>+};
>+
>+struct cgroup_subsys mpc_cgroup_subsys = {
>+	.name = "mempressure",
>+	.subsys_id = mpc_cgroup_subsys_id,
>+	.css_alloc = mpc_css_alloc,
>+	.css_free = mpc_css_free,
>+	.base_cftypes = mpc_files,
>+};
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 16b42af..fed0e04 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -1900,6 +1900,9 @@ restart:
> 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> 				   sc, LRU_ACTIVE_ANON);
>
>+	vmpressure(sc->target_mem_cgroup,
>+		   sc->nr_scanned - nr_scanned, nr_reclaimed);
>+
> 	/* reclaim/compaction might need reclaim to continue */
> 	if (should_continue_reclaim(lruvec, nr_reclaimed,
> 				    sc->nr_scanned - nr_scanned, sc))
>@@ -2122,6 +2125,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> 		count_vm_event(ALLOCSTALL);
>
> 	do {
>+		vmpressure_prio(sc->target_mem_cgroup, sc->priority);
> 		sc->nr_scanned = 0;
> 		aborted_reclaim = shrink_zones(zonelist, sc);
>
>-- 
>1.8.0.2
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] Add mempressure cgroup
  2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
                     ` (6 preceding siblings ...)
  2013-01-13  8:50   ` Simon Jeons
@ 2013-01-13  8:52   ` Wanpeng Li
  2013-01-13  8:52   ` Wanpeng Li
  8 siblings, 0 replies; 33+ messages in thread
From: Wanpeng Li @ 2013-01-13  8:52 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: David Rientjes, Pekka Enberg, Mel Gorman, Glauber Costa,
	Michal Hocko, Kirill A.Shutemov, Luiz Capitulino, Andrew Morton,
	Greg Thelen, Leonid Moiseichuk, KOSAKI Motohiro, Minchan Kim,
	Bartlomiej Zolnierkiewicz, John Stultz, linux-mm, linux-kernel,
	linaro-kernel, patches, kernel-team

Hi Anton,

On Fri, Jan 04, 2013 at 12:29:11AM -0800, Anton Vorontsov wrote:
>This commit implements David Rientjes' idea of mempressure cgroup.
>
>The main characteristics are the same to what I've tried to add to vmevent
>API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
>pressure index calculation. But we don't expose the index to the userland.
>Instead, there are three levels of the pressure:
>
> o low (just reclaiming, e.g. caches are draining);
> o medium (allocation cost becomes high, e.g. swapping);
> o oom (about to oom very soon).
>
>The rationale behind exposing levels and not the raw pressure index
>described here: http://lkml.org/lkml/2012/11/16/675
>
>For a task it is possible to be in both cpusets, memcg and mempressure
>cgroups, so by rearranging the tasks it is possible to watch a specific
>pressure (i.e. caused by cpuset and/or memcg).
>
>Note that while this adds the cgroups support, the code is well separated
>and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
>But this is another story.
>
>Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
>---
> Documentation/cgroups/mempressure.txt |  50 ++++++
> include/linux/cgroup_subsys.h         |   6 +
> include/linux/vmstat.h                |  11 ++
> init/Kconfig                          |  12 ++
> mm/Makefile                           |   1 +
> mm/mempressure.c                      | 330 ++++++++++++++++++++++++++++++++++
> mm/vmscan.c                           |   4 +
> 7 files changed, 414 insertions(+)
> create mode 100644 Documentation/cgroups/mempressure.txt
> create mode 100644 mm/mempressure.c
>
>diff --git a/Documentation/cgroups/mempressure.txt b/Documentation/cgroups/mempressure.txt
>new file mode 100644
>index 0000000..dbc0aca
>--- /dev/null
>+++ b/Documentation/cgroups/mempressure.txt
>@@ -0,0 +1,50 @@
>+  Memory pressure cgroup
>+~~~~~~~~~~~~~~~~~~~~~~~~~~
>+  Before using the mempressure cgroup, make sure you have it mounted:
>+
>+   # cd /sys/fs/cgroup/
>+   # mkdir mempressure
>+   # mount -t cgroup cgroup ./mempressure -o mempressure
>+
>+  It is possible to combine cgroups, for example you can mount memory
>+  (memcg) and mempressure cgroups together:
>+
>+   # mount -t cgroup cgroup ./mempressure -o memory,mempressure
>+
>+  That way the reported pressure will honour memory cgroup limits. The
>+  same goes for cpusets.
>+
>+  After the hierarchy is mounted, you can use the following API:
>+
>+  /sys/fs/cgroup/.../mempressure.level
>+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>+  To maintain the interactivity/memory allocation cost, one can use the
>+  pressure level notifications, and the levels are defined like this:
>+
>+  The "low" level means that the system is reclaiming memory for new
>+  allocations. Monitoring reclaiming activity might be useful for
>+  maintaining overall system's cache level. Upon notification, the program
>+  (typically "Activity Manager") might analyze vmstat and act in advance
>+  (i.e. prematurely shutdown unimportant services).
>+
>+  The "medium" level means that the system is experiencing medium memory
>+  pressure, there is some mild swapping activity. Upon this event
>+  applications may decide to free any resources that can be easily
>+  reconstructed or re-read from a disk.
>+
>+  The "oom" level means that the system is actively thrashing, it is about
>+  to out of memory (OOM) or even the in-kernel OOM killer is on its way to
>+  trigger. Applications should do whatever they can to help the system.
>+
>+  Event control:
>+    Is used to setup an eventfd with a level threshold. The argument to
>+    the event control specifies the level threshold.
>+  Read:
>+    Reads mempory presure levels: low, medium or oom.
>+  Write:
>+    Not implemented.
>+  Test:
>+    To set up a notification:
>+
>+    # cgroup_event_listener ./mempressure.level low
>+    ("low", "medium", "oom" are permitted.)
>diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>index f204a7a..b9802e2 100644
>--- a/include/linux/cgroup_subsys.h
>+++ b/include/linux/cgroup_subsys.h
>@@ -37,6 +37,12 @@ SUBSYS(mem_cgroup)
>
> /* */
>
>+#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE)
>+SUBSYS(mpc_cgroup)
>+#endif
>+
>+/* */
>+
> #if IS_SUBSYS_ENABLED(CONFIG_CGROUP_DEVICE)
> SUBSYS(devices)
> #endif
>diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>index a13291f..c1a66c7 100644
>--- a/include/linux/vmstat.h
>+++ b/include/linux/vmstat.h
>@@ -10,6 +10,17 @@
>
> extern int sysctl_stat_interval;
>
>+struct mem_cgroup;
>+#ifdef CONFIG_CGROUP_MEMPRESSURE
>+extern void vmpressure(struct mem_cgroup *memcg,
>+		       ulong scanned, ulong reclaimed);
>+extern void vmpressure_prio(struct mem_cgroup *memcg, int prio);
>+#else
>+static inline void vmpressure(struct mem_cgroup *memcg,
>+			      ulong scanned, ulong reclaimed) {}
>+static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {}
>+#endif
>+
> #ifdef CONFIG_VM_EVENT_COUNTERS
> /*
>  * Light weight per cpu counter implementation.
>diff --git a/init/Kconfig b/init/Kconfig
>index 7d30240..d526249 100644
>--- a/init/Kconfig
>+++ b/init/Kconfig
>@@ -891,6 +891,18 @@ config MEMCG_KMEM
> 	  the kmem extension can use it to guarantee that no group of processes
> 	  will ever exhaust kernel resources alone.
>
>+config CGROUP_MEMPRESSURE
>+	bool "Memory pressure monitor for Control Groups"
>+	help
>+	  The memory pressure monitor cgroup provides a facility for
>+	  userland programs so that they could easily assist the kernel
>+	  with the memory management. So far the API provides simple,
>+	  levels-based memory pressure notifications.
>+
>+	  For more information see Documentation/cgroups/mempressure.txt
>+
>+	  If unsure, say N.
>+
> config CGROUP_HUGETLB
> 	bool "HugeTLB Resource Controller for Control Groups"
> 	depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL
>diff --git a/mm/Makefile b/mm/Makefile
>index 3a46287..e69bbda 100644
>--- a/mm/Makefile
>+++ b/mm/Makefile
>@@ -51,6 +51,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
>+obj-$(CONFIG_CGROUP_MEMPRESSURE) += mempressure.o
> obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
> obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>diff --git a/mm/mempressure.c b/mm/mempressure.c
>new file mode 100644
>index 0000000..ea312bb
>--- /dev/null
>+++ b/mm/mempressure.c
>@@ -0,0 +1,330 @@
>+/*
>+ * Linux VM pressure
>+ *
>+ * Copyright 2012 Linaro Ltd.
>+ *		  Anton Vorontsov <anton.vorontsov@linaro.org>
>+ *
>+ * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
>+ * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
>+ *
>+ * This program is free software; you can redistribute it and/or modify it
>+ * under the terms of the GNU General Public License version 2 as published
>+ * by the Free Software Foundation.
>+ */
>+
>+#include <linux/cgroup.h>
>+#include <linux/fs.h>
>+#include <linux/sched.h>
>+#include <linux/mm.h>
>+#include <linux/vmstat.h>
>+#include <linux/eventfd.h>
>+#include <linux/swap.h>
>+#include <linux/printk.h>
>+
>+static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r);
>+
>+/*
>+ * Generic VM Pressure routines (no cgroups or any other API details)
>+ */
>+
>+/*
>+ * The window size is the number of scanned pages before we try to analyze
>+ * the scanned/reclaimed ratio (or difference).
>+ *
>+ * It is used as a rate-limit tunable for the "low" level notification,
>+ * and for averaging medium/oom levels. Using small window sizes can cause
>+ * lot of false positives, but too big window size will delay the
>+ * notifications.
>+ */
>+static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;

Since the type is const, how can it tunable?

>+static const uint vmpressure_level_med = 60;
>+static const uint vmpressure_level_oom = 99;
>+static const uint vmpressure_level_oom_prio = 4;
>+
>+enum vmpressure_levels {
>+	VMPRESSURE_LOW = 0,
>+	VMPRESSURE_MEDIUM,
>+	VMPRESSURE_OOM,
>+	VMPRESSURE_NUM_LEVELS,
>+};
>+
>+static const char *vmpressure_str_levels[] = {
>+	[VMPRESSURE_LOW] = "low",
>+	[VMPRESSURE_MEDIUM] = "medium",
>+	[VMPRESSURE_OOM] = "oom",
>+};
>+
>+static enum vmpressure_levels vmpressure_level(uint pressure)
>+{
>+	if (pressure >= vmpressure_level_oom)
>+		return VMPRESSURE_OOM;
>+	else if (pressure >= vmpressure_level_med)
>+		return VMPRESSURE_MEDIUM;
>+	return VMPRESSURE_LOW;
>+}
>+
>+static ulong vmpressure_calc_level(uint win, uint s, uint r)
>+{
>+	ulong p;
>+
>+	if (!s)
>+		return 0;
>+
>+	/*
>+	 * We calculate the ratio (in percents) of how many pages were
>+	 * scanned vs. reclaimed in a given time frame (window). Note that
>+	 * time is in VM reclaimer's "ticks", i.e. number of pages
>+	 * scanned. This makes it possible to set desired reaction time
>+	 * and serves as a ratelimit.
>+	 */
>+	p = win - (r * win / s);
>+	p = p * 100 / win;
>+
>+	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
>+
>+	return vmpressure_level(p);
>+}
>+
>+void vmpressure(struct mem_cgroup *memcg, ulong scanned, ulong reclaimed)
>+{
>+	if (!scanned)
>+		return;
>+	mpc_vmpressure(memcg, scanned, reclaimed);
>+}
>+
>+void vmpressure_prio(struct mem_cgroup *memcg, int prio)
>+{
>+	if (prio > vmpressure_level_oom_prio)
>+		return;

Since the max value of prio(sc->priority) == DEF_PRIORITY(12), why need
it?

>+
>+	/* OK, the prio is below the threshold, send the pre-OOM event. */
>+	vmpressure(memcg, vmpressure_win, 0);
>+}
>+
>+/*
>+ * Memory pressure cgroup code
>+ */
>+
>+struct mpc_event {
>+	struct eventfd_ctx *efd;
>+	enum vmpressure_levels level;
>+	struct list_head node;
>+};
>+
>+struct mpc_state {
>+	struct cgroup_subsys_state css;
>+
>+	uint scanned;
>+	uint reclaimed;
>+	struct mutex sr_lock;
>+
>+	struct list_head events;
>+	struct mutex events_lock;
>+
>+	struct work_struct work;
>+};
>+
>+static struct mpc_state *wk2mpc(struct work_struct *wk)
>+{
>+	return container_of(wk, struct mpc_state, work);
>+}
>+
>+static struct mpc_state *css2mpc(struct cgroup_subsys_state *css)
>+{
>+	return container_of(css, struct mpc_state, css);
>+}
>+
>+static struct mpc_state *tsk2mpc(struct task_struct *tsk)
>+{
>+	return css2mpc(task_subsys_state(tsk, mpc_cgroup_subsys_id));
>+}
>+
>+static struct mpc_state *cg2mpc(struct cgroup *cg)
>+{
>+	return css2mpc(cgroup_subsys_state(cg, mpc_cgroup_subsys_id));
>+}
>+
>+static void mpc_event(struct mpc_state *mpc, ulong s, ulong r)
>+{
>+	struct mpc_event *ev;
>+	int level = vmpressure_calc_level(vmpressure_win, s, r);
>+
>+	mutex_lock(&mpc->events_lock);
>+
>+	list_for_each_entry(ev, &mpc->events, node) {
>+		if (level >= ev->level)
>+			eventfd_signal(ev->efd, 1);
>+	}
>+
>+	mutex_unlock(&mpc->events_lock);
>+}
>+
>+static void mpc_vmpressure_wk_fn(struct work_struct *wk)
>+{
>+	struct mpc_state *mpc = wk2mpc(wk);
>+	ulong s;
>+	ulong r;
>+
>+	mutex_lock(&mpc->sr_lock);
>+	s = mpc->scanned;
>+	r = mpc->reclaimed;
>+	mpc->scanned = 0;
>+	mpc->reclaimed = 0;
>+	mutex_unlock(&mpc->sr_lock);
>+
>+	mpc_event(mpc, s, r);
>+}
>+
>+static void __mpc_vmpressure(struct mpc_state *mpc, ulong s, ulong r)
>+{
>+	mutex_lock(&mpc->sr_lock);
>+	mpc->scanned += s;
>+	mpc->reclaimed += r;
>+	mutex_unlock(&mpc->sr_lock);
>+
>+	if (s < vmpressure_win || work_pending(&mpc->work))
>+		return;
>+
>+	schedule_work(&mpc->work);
>+}
>+
>+static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r)
>+{
>+	/*
>+	 * There are two options for implementing cgroup pressure
>+	 * notifications:
>+	 *
>+	 * - Store pressure counter atomically in the task struct. Upon
>+	 *   hitting 'window' wake up a workqueue that will walk every
>+	 *   task and sum per-thread pressure into cgroup pressure (to
>+	 *   which the task belongs). The cons are obvious: bloats task
>+	 *   struct, have to walk all processes and makes pressue less
>+	 *   accurate (the window becomes per-thread);
>+	 *
>+	 * - Store pressure counters in per-cgroup state. This is easy and
>+	 *   straightforward, and that's how we do things here. But this
>+	 *   requires us to not put the vmpressure hooks into hotpath,
>+	 *   since we have to grab some locks.
>+	 */
>+
>+#ifdef CONFIG_MEMCG
>+	if (memcg) {
>+		struct cgroup_subsys_state *css = mem_cgroup_css(memcg);
>+		struct cgroup *cg = css->cgroup;
>+		struct mpc_state *mpc = cg2mpc(cg);
>+
>+		if (mpc)
>+			__mpc_vmpressure(mpc, s, r);
>+		return;
>+	}
>+#endif
>+	task_lock(current);
>+	__mpc_vmpressure(tsk2mpc(current), s, r);
>+	task_unlock(current);
>+}
>+
>+static struct cgroup_subsys_state *mpc_css_alloc(struct cgroup *cg)
>+{
>+	struct mpc_state *mpc;
>+
>+	mpc = kzalloc(sizeof(*mpc), GFP_KERNEL);
>+	if (!mpc)
>+		return ERR_PTR(-ENOMEM);
>+
>+	mutex_init(&mpc->sr_lock);
>+	mutex_init(&mpc->events_lock);
>+	INIT_LIST_HEAD(&mpc->events);
>+	INIT_WORK(&mpc->work, mpc_vmpressure_wk_fn);
>+
>+	return &mpc->css;
>+}
>+
>+static void mpc_css_free(struct cgroup *cg)
>+{
>+	struct mpc_state *mpc = cg2mpc(cg);
>+
>+	kfree(mpc);
>+}
>+
>+static ssize_t mpc_read_level(struct cgroup *cg, struct cftype *cft,
>+			      struct file *file, char __user *buf,
>+			      size_t sz, loff_t *ppos)
>+{
>+	struct mpc_state *mpc = cg2mpc(cg);
>+	uint level;
>+	const char *str;
>+
>+	mutex_lock(&mpc->sr_lock);
>+
>+	level = vmpressure_calc_level(vmpressure_win,
>+			mpc->scanned, mpc->reclaimed);
>+
>+	mutex_unlock(&mpc->sr_lock);
>+
>+	str = vmpressure_str_levels[level];
>+	return simple_read_from_buffer(buf, sz, ppos, str, strlen(str));

You miss "\n". The print result:
[root@kernel ~]# cat /sys/fs/cgroup/mempressure/mempressure.level
low[root@kernel ~]#

Regards,
Wanpeng Li

>+}
>+
>+static int mpc_register_level(struct cgroup *cg, struct cftype *cft,
>+			      struct eventfd_ctx *eventfd, const char *args)
>+{
>+	struct mpc_state *mpc = cg2mpc(cg);
>+	struct mpc_event *ev;
>+	int lvl;
>+
>+	for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
>+		if (!strcmp(vmpressure_str_levels[lvl], args))
>+			break;
>+	}
>+
>+	if (lvl >= VMPRESSURE_NUM_LEVELS)
>+		return -EINVAL;
>+
>+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>+	if (!ev)
>+		return -ENOMEM;
>+
>+	ev->efd = eventfd;
>+	ev->level = lvl;
>+
>+	mutex_lock(&mpc->events_lock);
>+	list_add(&ev->node, &mpc->events);
>+	mutex_unlock(&mpc->events_lock);
>+
>+	return 0;
>+}
>+
>+static void mpc_unregister_level(struct cgroup *cg, struct cftype *cft,
>+				 struct eventfd_ctx *eventfd)
>+{
>+	struct mpc_state *mpc = cg2mpc(cg);
>+	struct mpc_event *ev;
>+
>+	mutex_lock(&mpc->events_lock);
>+	list_for_each_entry(ev, &mpc->events, node) {
>+		if (ev->efd != eventfd)
>+			continue;
>+		list_del(&ev->node);
>+		kfree(ev);
>+		break;
>+	}
>+	mutex_unlock(&mpc->events_lock);
>+}
>+
>+static struct cftype mpc_files[] = {
>+	{
>+		.name = "level",
>+		.read = mpc_read_level,
>+		.register_event = mpc_register_level,
>+		.unregister_event = mpc_unregister_level,
>+	},
>+	{},
>+};
>+
>+struct cgroup_subsys mpc_cgroup_subsys = {
>+	.name = "mempressure",
>+	.subsys_id = mpc_cgroup_subsys_id,
>+	.css_alloc = mpc_css_alloc,
>+	.css_free = mpc_css_free,
>+	.base_cftypes = mpc_files,
>+};
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 16b42af..fed0e04 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -1900,6 +1900,9 @@ restart:
> 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> 				   sc, LRU_ACTIVE_ANON);
>
>+	vmpressure(sc->target_mem_cgroup,
>+		   sc->nr_scanned - nr_scanned, nr_reclaimed);
>+
> 	/* reclaim/compaction might need reclaim to continue */
> 	if (should_continue_reclaim(lruvec, nr_reclaimed,
> 				    sc->nr_scanned - nr_scanned, sc))
>@@ -2122,6 +2125,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> 		count_vm_event(ALLOCSTALL);
>
> 	do {
>+		vmpressure_prio(sc->target_mem_cgroup, sc->priority);
> 		sc->nr_scanned = 0;
> 		aborted_reclaim = shrink_zones(zonelist, sc);
>
>-- 
>1.8.0.2
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2013-01-13  8:52 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-04  8:27 [PATCH 0/2] Mempressure cgroup Anton Vorontsov
2013-01-04  8:29 ` [PATCH 1/2] Add mempressure cgroup Anton Vorontsov
2013-01-04 15:05   ` Kirill A. Shutemov
2013-01-07  8:51   ` Kamezawa Hiroyuki
2013-01-08  7:29     ` Anton Vorontsov
2013-01-08  7:57       ` leonid.moiseichuk
2013-01-08  8:24       ` Kamezawa Hiroyuki
2013-01-08  8:49   ` Minchan Kim
2013-01-09 22:14     ` Anton Vorontsov
2013-01-11  5:12       ` Minchan Kim
2013-01-11  5:38         ` Anton Vorontsov
2013-01-11  5:56           ` Minchan Kim
2013-01-11  6:09             ` Anton Vorontsov
2013-01-08 21:44   ` Andrew Morton
2013-01-09 14:10     ` Glauber Costa
2013-01-09 20:28       ` Andrew Morton
2013-01-09  8:56   ` Glauber Costa
2013-01-09  9:15     ` Andrew Morton
2013-01-09 13:43       ` Glauber Costa
2013-01-09 20:37   ` Tejun Heo
2013-01-09 20:39     ` Tejun Heo
2013-01-09 21:20     ` Glauber Costa
2013-01-09 21:36       ` Anton Vorontsov
2013-01-09 21:55         ` Tejun Heo
2013-01-09 22:04           ` Tejun Heo
2013-01-09 22:06           ` Anton Vorontsov
2013-01-09 22:21             ` Tejun Heo
2013-01-10  7:18             ` Glauber Costa
2013-01-13  8:50   ` Simon Jeons
2013-01-13  8:52   ` Wanpeng Li
2013-01-13  8:52   ` Wanpeng Li
2013-01-04  8:29 ` [PATCH 2/2] Add shrinker interface for " Anton Vorontsov
2013-01-11 19:13 ` [PATCH 0/2] Mempressure cgroup Luiz Capitulino

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).