* [PATCH v5 01/14] memcg: Make it possible to use the stock for more than one page.
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
[not found] ` <1350382611-20579-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-10-16 10:16 ` [PATCH v5 02/14] memcg: Reclaim when more than one page needed Glauber Costa
` (13 subsequent siblings)
14 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel,
Suleiman Souhlal, Glauber Costa
From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
We currently have a percpu stock cache scheme that charges one page at a
time from memcg->res, the user counter. When the kernel memory
controller comes into play, we'll need to charge more than that.
This is because kernel memory allocations will also draw from the user
counter, and can be bigger than a single page, as it is the case with
the stack (usually 2 pages) or some higher order slabs.
[ glommer@parallels.com: added a changelog ]
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
CC: Tejun Heo <tj@kernel.org>
---
mm/memcontrol.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..47cb019 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2028,20 +2028,28 @@ struct memcg_stock_pcp {
static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
static DEFINE_MUTEX(percpu_charge_mutex);
-/*
- * Try to consume stocked charge on this cpu. If success, one page is consumed
- * from local stock and true is returned. If the stock is 0 or charges from a
- * cgroup which is not current target, returns false. This stock will be
- * refilled.
+/**
+ * consume_stock: Try to consume stocked charge on this cpu.
+ * @memcg: memcg to consume from.
+ * @nr_pages: how many pages to charge.
+ *
+ * The charges will only happen if @memcg matches the current cpu's memcg
+ * stock, and at least @nr_pages are available in that stock. Failure to
+ * service an allocation will refill the stock.
+ *
+ * returns true if succesfull, false otherwise.
*/
-static bool consume_stock(struct mem_cgroup *memcg)
+static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
{
struct memcg_stock_pcp *stock;
bool ret = true;
+ if (nr_pages > CHARGE_BATCH)
+ return false;
+
stock = &get_cpu_var(memcg_stock);
- if (memcg == stock->cached && stock->nr_pages)
- stock->nr_pages--;
+ if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+ stock->nr_pages -= nr_pages;
else /* need to call res_counter_charge */
ret = false;
put_cpu_var(memcg_stock);
@@ -2340,7 +2348,7 @@ again:
VM_BUG_ON(css_is_removed(&memcg->css));
if (mem_cgroup_is_root(memcg))
goto done;
- if (nr_pages == 1 && consume_stock(memcg))
+ if (consume_stock(memcg, nr_pages))
goto done;
css_get(&memcg->css);
} else {
@@ -2365,7 +2373,7 @@ again:
rcu_read_unlock();
goto done;
}
- if (nr_pages == 1 && consume_stock(memcg)) {
+ if (consume_stock(memcg, nr_pages)) {
/*
* It seems dagerous to access memcg without css_get().
* But considering how consume_stok works, it's not
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 70+ messages in thread* [PATCH v5 02/14] memcg: Reclaim when more than one page needed.
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
2012-10-16 10:16 ` [PATCH v5 01/14] memcg: Make it possible to use the stock for more than one page Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
[not found] ` <1350382611-20579-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-10-16 10:16 ` [PATCH v5 03/14] memcg: change defines to an enum Glauber Costa
` (12 subsequent siblings)
14 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel,
Suleiman Souhlal, Glauber Costa
From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
mem_cgroup_do_charge() was written before kmem accounting, and expects
three cases: being called for 1 page, being called for a stock of 32
pages, or being called for a hugepage. If we call for 2 or 3 pages (and
both the stack and several slabs used in process creation are such, at
least with the debug options I had), it assumed it's being called for
stock and just retried without reclaiming.
Fix that by passing down a minsize argument in addition to the csize.
And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
needed at all (and presumably is since it's there, perhaps to handle
races), then it should be extended to more than PAGE_SIZE, yet how far?
And should there be a retry count limit, of what? For now retry up to
COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if
__GFP_NORETRY.
[v4: fixed nr pages calculation pointed out by Christoph Lameter ]
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
CC: Tejun Heo <tj@kernel.org>
---
mm/memcontrol.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 47cb019..7a9652a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2226,7 +2226,8 @@ enum {
};
static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, unsigned int min_pages,
+ bool oom_check)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2249,18 +2250,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
} else
mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
/*
- * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
- * of regular pages (CHARGE_BATCH), or a single regular page (1).
- *
* Never reclaim on behalf of optional batching, retry with a
* single page instead.
*/
- if (nr_pages == CHARGE_BATCH)
+ if (nr_pages > min_pages)
return CHARGE_RETRY;
if (!(gfp_mask & __GFP_WAIT))
return CHARGE_WOULDBLOCK;
+ if (gfp_mask & __GFP_NORETRY)
+ return CHARGE_NOMEM;
+
ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
return CHARGE_RETRY;
@@ -2273,7 +2274,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
* unlikely to succeed so close to the limit, and we fall back
* to regular pages anyway in case of failure.
*/
- if (nr_pages == 1 && ret)
+ if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret)
return CHARGE_RETRY;
/*
@@ -2408,7 +2409,8 @@ again:
nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
}
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
+ oom_check);
switch (ret) {
case CHARGE_OK:
break;
--
1.7.11.7
^ permalink raw reply related [flat|nested] 70+ messages in thread* [PATCH v5 03/14] memcg: change defines to an enum
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
2012-10-16 10:16 ` [PATCH v5 01/14] memcg: Make it possible to use the stock for more than one page Glauber Costa
2012-10-16 10:16 ` [PATCH v5 02/14] memcg: Reclaim when more than one page needed Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
2012-10-17 21:50 ` David Rientjes
2012-10-16 10:16 ` [PATCH v5 04/14] kmem accounting basic infrastructure Glauber Costa
` (11 subsequent siblings)
14 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa
This is just a cleanup patch for clarity of expression. In earlier
submissions, people asked it to be in a separate patch, so here it is.
[ v2: use named enum as type throughout the file as well ]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
CC: Tejun Heo <tj@kernel.org>
---
mm/memcontrol.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7a9652a..71d259e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -386,9 +386,12 @@ enum charge_type {
};
/* for encoding cft->private value on file */
-#define _MEM (0)
-#define _MEMSWAP (1)
-#define _OOM_TYPE (2)
+enum res_type {
+ _MEM,
+ _MEMSWAP,
+ _OOM_TYPE,
+};
+
#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
#define MEMFILE_TYPE(val) ((val) >> 16 & 0xffff)
#define MEMFILE_ATTR(val) ((val) & 0xffff)
@@ -3915,7 +3918,8 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
char str[64];
u64 val;
- int type, name, len;
+ int name, len;
+ enum res_type type;
type = MEMFILE_TYPE(cft->private);
name = MEMFILE_ATTR(cft->private);
@@ -3951,7 +3955,8 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
const char *buffer)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- int type, name;
+ enum res_type type;
+ int name;
unsigned long long val;
int ret;
@@ -4027,7 +4032,8 @@ out:
static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- int type, name;
+ int name;
+ enum res_type type;
type = MEMFILE_TYPE(event);
name = MEMFILE_ATTR(event);
@@ -4363,7 +4369,7 @@ static int mem_cgroup_usage_register_event(struct cgroup *cgrp,
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);
u64 threshold, usage;
int i, size, ret;
@@ -4446,7 +4452,7 @@ static void mem_cgroup_usage_unregister_event(struct cgroup *cgrp,
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);
u64 usage;
int i, j, size;
@@ -4524,7 +4530,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_eventfd_list *event;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);
BUG_ON(type != _OOM_TYPE);
event = kmalloc(sizeof(*event), GFP_KERNEL);
@@ -4549,7 +4555,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_eventfd_list *ev, *tmp;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);
BUG_ON(type != _OOM_TYPE);
--
1.7.11.7
^ permalink raw reply related [flat|nested] 70+ messages in thread* Re: [PATCH v5 03/14] memcg: change defines to an enum
2012-10-16 10:16 ` [PATCH v5 03/14] memcg: change defines to an enum Glauber Costa
@ 2012-10-17 21:50 ` David Rientjes
0 siblings, 0 replies; 70+ messages in thread
From: David Rientjes @ 2012-10-17 21:50 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Andrew Morton,
Michal Hocko, Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
Pekka Enberg, devel, linux-kernel
On Tue, 16 Oct 2012, Glauber Costa wrote:
> This is just a cleanup patch for clarity of expression. In earlier
> submissions, people asked it to be in a separate patch, so here it is.
>
> [ v2: use named enum as type throughout the file as well ]
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> CC: Tejun Heo <tj@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v5 04/14] kmem accounting basic infrastructure
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (2 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 03/14] memcg: change defines to an enum Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
2012-10-16 12:14 ` Michal Hocko
` (2 more replies)
2012-10-16 10:16 ` [PATCH v5 05/14] Add a __GFP_KMEMCG flag Glauber Costa
` (10 subsequent siblings)
14 siblings, 3 replies; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa
This patch adds the basic infrastructure for the accounting of kernel
memory. To control that, the following files are created:
* memory.kmem.usage_in_bytes
* memory.kmem.limit_in_bytes
* memory.kmem.failcnt
* memory.kmem.max_usage_in_bytes
They have the same meaning of their user memory counterparts. They
reflect the state of the "kmem" res_counter.
Per cgroup kmem memory accounting is not enabled until a limit is set
for the group. Once the limit is set the accounting cannot be disabled
for that group. This means that after the patch is applied, no
behavioral changes exists for whoever is still using memcg to control
their memory usage, until memory.kmem.limit_in_bytes is set for the
first time.
We always account to both user and kernel resource_counters. This
effectively means that an independent kernel limit is in place when the
limit is set to a lower value than the user memory. A equal or higher
value means that the user limit will always hit first, meaning that kmem
is effectively unlimited.
People who want to track kernel memory but not limit it, can set this
limit to a very high number (like RESOURCE_MAX - 1page - that no one
will ever hit, or equal to the user memory)
[ v4: make kmem files part of the main array;
do not allow limit to be set for non-empty cgroups ]
[ v5: cosmetic changes ]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Tejun Heo <tj@kernel.org>
---
mm/memcontrol.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 115 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 71d259e..30eafeb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -266,6 +266,10 @@ struct mem_cgroup {
};
/*
+ * the counter to account for kernel memory usage.
+ */
+ struct res_counter kmem;
+ /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -280,6 +284,7 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
+ unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
bool oom_lock;
atomic_t under_oom;
@@ -332,6 +337,20 @@ struct mem_cgroup {
#endif
};
+/* internal only representation about the status of kmem accounting. */
+enum {
+ KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
+};
+
+#define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
+
+#ifdef CONFIG_MEMCG_KMEM
+static void memcg_kmem_set_active(struct mem_cgroup *memcg)
+{
+ set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
+}
+#endif
+
/* Stuffs for move charges at task migration. */
/*
* Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
@@ -390,6 +409,7 @@ enum res_type {
_MEM,
_MEMSWAP,
_OOM_TYPE,
+ _KMEM,
};
#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
@@ -1433,6 +1453,10 @@ done:
res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+ printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
+ res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
+ res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
+ res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
}
/*
@@ -3940,6 +3964,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
else
val = res_counter_read_u64(&memcg->memsw, name);
break;
+ case _KMEM:
+ val = res_counter_read_u64(&memcg->kmem, name);
+ break;
default:
BUG();
}
@@ -3947,6 +3974,57 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
return simple_read_from_buffer(buf, nbytes, ppos, str, len);
}
+
+static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
+{
+ int ret = -EINVAL;
+#ifdef CONFIG_MEMCG_KMEM
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ /*
+ * For simplicity, we won't allow this to be disabled. It also can't
+ * be changed if the cgroup has children already, or if tasks had
+ * already joined.
+ *
+ * If tasks join before we set the limit, a person looking at
+ * kmem.usage_in_bytes will have no way to determine when it took
+ * place, which makes the value quite meaningless.
+ *
+ * After it first became limited, changes in the value of the limit are
+ * of course permitted.
+ *
+ * Taking the cgroup_lock is really offensive, but it is so far the only
+ * way to guarantee that no children will appear. There are plenty of
+ * other offenders, and they should all go away. Fine grained locking
+ * is probably the way to go here. When we are fully hierarchical, we
+ * can also get rid of the use_hierarchy check.
+ */
+ cgroup_lock();
+ mutex_lock(&set_limit_mutex);
+ if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
+ if (cgroup_task_count(cont) || (memcg->use_hierarchy &&
+ !list_empty(&cont->children))) {
+ ret = -EBUSY;
+ goto out;
+ }
+ ret = res_counter_set_limit(&memcg->kmem, val);
+ VM_BUG_ON(ret);
+
+ memcg_kmem_set_active(memcg);
+ } else
+ ret = res_counter_set_limit(&memcg->kmem, val);
+out:
+ mutex_unlock(&set_limit_mutex);
+ cgroup_unlock();
+#endif
+ return ret;
+}
+
+static void memcg_propagate_kmem(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ memcg->kmem_accounted = parent->kmem_accounted;
+}
+
/*
* The user of this function is...
* RES_LIMIT.
@@ -3978,8 +4056,12 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
break;
if (type == _MEM)
ret = mem_cgroup_resize_limit(memcg, val);
- else
+ else if (type == _MEMSWAP)
ret = mem_cgroup_resize_memsw_limit(memcg, val);
+ else if (type == _KMEM)
+ ret = memcg_update_kmem_limit(cont, val);
+ else
+ return -EINVAL;
break;
case RES_SOFT_LIMIT:
ret = res_counter_memparse_write_strategy(buffer, &val);
@@ -4045,12 +4127,16 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
case RES_MAX_USAGE:
if (type == _MEM)
res_counter_reset_max(&memcg->res);
+ else if (type == _KMEM)
+ res_counter_reset_max(&memcg->kmem);
else
res_counter_reset_max(&memcg->memsw);
break;
case RES_FAILCNT:
if (type == _MEM)
res_counter_reset_failcnt(&memcg->res);
+ else if (type == _KMEM)
+ res_counter_reset_failcnt(&memcg->kmem);
else
res_counter_reset_failcnt(&memcg->memsw);
break;
@@ -4728,6 +4814,31 @@ static struct cftype mem_cgroup_files[] = {
.read = mem_cgroup_read,
},
#endif
+#ifdef CONFIG_MEMCG_KMEM
+ {
+ .name = "kmem.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+ .write_string = mem_cgroup_write,
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.failcnt",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
+ .trigger = mem_cgroup_reset,
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.max_usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
+ .trigger = mem_cgroup_reset,
+ .read = mem_cgroup_read,
+ },
+#endif
{ }, /* terminate */
};
@@ -4973,6 +5084,7 @@ mem_cgroup_create(struct cgroup *cont)
if (parent && parent->use_hierarchy) {
res_counter_init(&memcg->res, &parent->res);
res_counter_init(&memcg->memsw, &parent->memsw);
+ res_counter_init(&memcg->kmem, &parent->kmem);
/*
* We increment refcnt of the parent to ensure that we can
* safely access it on res_counter_charge/uncharge.
@@ -4980,9 +5092,11 @@ mem_cgroup_create(struct cgroup *cont)
* mem_cgroup(see mem_cgroup_put).
*/
mem_cgroup_get(parent);
+ memcg_propagate_kmem(memcg, parent);
} else {
res_counter_init(&memcg->res, NULL);
res_counter_init(&memcg->memsw, NULL);
+ res_counter_init(&memcg->kmem, NULL);
/*
* Deeper hierachy with use_hierarchy == false doesn't make
* much sense so let cgroup subsystem know about this
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 70+ messages in thread* Re: [PATCH v5 04/14] kmem accounting basic infrastructure
2012-10-16 10:16 ` [PATCH v5 04/14] kmem accounting basic infrastructure Glauber Costa
@ 2012-10-16 12:14 ` Michal Hocko
[not found] ` <1350382611-20579-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-10-17 22:12 ` Andrew Morton
2 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2012-10-16 12:14 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Andrew Morton,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel
On Tue 16-10-12 14:16:41, Glauber Costa wrote:
> This patch adds the basic infrastructure for the accounting of kernel
> memory. To control that, the following files are created:
>
> * memory.kmem.usage_in_bytes
> * memory.kmem.limit_in_bytes
> * memory.kmem.failcnt
> * memory.kmem.max_usage_in_bytes
>
> They have the same meaning of their user memory counterparts. They
> reflect the state of the "kmem" res_counter.
>
> Per cgroup kmem memory accounting is not enabled until a limit is set
> for the group. Once the limit is set the accounting cannot be disabled
> for that group. This means that after the patch is applied, no
> behavioral changes exists for whoever is still using memcg to control
> their memory usage, until memory.kmem.limit_in_bytes is set for the
> first time.
>
> We always account to both user and kernel resource_counters. This
> effectively means that an independent kernel limit is in place when the
> limit is set to a lower value than the user memory. A equal or higher
> value means that the user limit will always hit first, meaning that kmem
> is effectively unlimited.
>
> People who want to track kernel memory but not limit it, can set this
> limit to a very high number (like RESOURCE_MAX - 1page - that no one
> will ever hit, or equal to the user memory)
>
> [ v4: make kmem files part of the main array;
> do not allow limit to be set for non-empty cgroups ]
> [ v5: cosmetic changes ]
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
> ---
> mm/memcontrol.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 115 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 71d259e..30eafeb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -266,6 +266,10 @@ struct mem_cgroup {
> };
>
> /*
> + * the counter to account for kernel memory usage.
> + */
> + struct res_counter kmem;
> + /*
> * Per cgroup active and inactive list, similar to the
> * per zone LRU lists.
> */
> @@ -280,6 +284,7 @@ struct mem_cgroup {
> * Should the accounting and control be hierarchical, per subtree?
> */
> bool use_hierarchy;
> + unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
>
> bool oom_lock;
> atomic_t under_oom;
> @@ -332,6 +337,20 @@ struct mem_cgroup {
> #endif
> };
>
> +/* internal only representation about the status of kmem accounting. */
> +enum {
> + KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
> +};
> +
> +#define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +static void memcg_kmem_set_active(struct mem_cgroup *memcg)
> +{
> + set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
> +}
> +#endif
> +
> /* Stuffs for move charges at task migration. */
> /*
> * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
> @@ -390,6 +409,7 @@ enum res_type {
> _MEM,
> _MEMSWAP,
> _OOM_TYPE,
> + _KMEM,
> };
>
> #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
> @@ -1433,6 +1453,10 @@ done:
> res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
> res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
> res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> + printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
> + res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
> + res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
> + res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
> }
>
> /*
> @@ -3940,6 +3964,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
> else
> val = res_counter_read_u64(&memcg->memsw, name);
> break;
> + case _KMEM:
> + val = res_counter_read_u64(&memcg->kmem, name);
> + break;
> default:
> BUG();
> }
> @@ -3947,6 +3974,57 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
> len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
> return simple_read_from_buffer(buf, nbytes, ppos, str, len);
> }
> +
> +static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
> +{
> + int ret = -EINVAL;
> +#ifdef CONFIG_MEMCG_KMEM
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> + /*
> + * For simplicity, we won't allow this to be disabled. It also can't
> + * be changed if the cgroup has children already, or if tasks had
> + * already joined.
> + *
> + * If tasks join before we set the limit, a person looking at
> + * kmem.usage_in_bytes will have no way to determine when it took
> + * place, which makes the value quite meaningless.
> + *
> + * After it first became limited, changes in the value of the limit are
> + * of course permitted.
> + *
> + * Taking the cgroup_lock is really offensive, but it is so far the only
> + * way to guarantee that no children will appear. There are plenty of
> + * other offenders, and they should all go away. Fine grained locking
> + * is probably the way to go here. When we are fully hierarchical, we
> + * can also get rid of the use_hierarchy check.
> + */
> + cgroup_lock();
> + mutex_lock(&set_limit_mutex);
> + if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
> + if (cgroup_task_count(cont) || (memcg->use_hierarchy &&
> + !list_empty(&cont->children))) {
> + ret = -EBUSY;
> + goto out;
> + }
> + ret = res_counter_set_limit(&memcg->kmem, val);
> + VM_BUG_ON(ret);
> +
> + memcg_kmem_set_active(memcg);
> + } else
> + ret = res_counter_set_limit(&memcg->kmem, val);
> +out:
> + mutex_unlock(&set_limit_mutex);
> + cgroup_unlock();
> +#endif
> + return ret;
> +}
> +
> +static void memcg_propagate_kmem(struct mem_cgroup *memcg,
> + struct mem_cgroup *parent)
> +{
> + memcg->kmem_accounted = parent->kmem_accounted;
> +}
> +
> /*
> * The user of this function is...
> * RES_LIMIT.
> @@ -3978,8 +4056,12 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
> break;
> if (type == _MEM)
> ret = mem_cgroup_resize_limit(memcg, val);
> - else
> + else if (type == _MEMSWAP)
> ret = mem_cgroup_resize_memsw_limit(memcg, val);
> + else if (type == _KMEM)
> + ret = memcg_update_kmem_limit(cont, val);
> + else
> + return -EINVAL;
> break;
> case RES_SOFT_LIMIT:
> ret = res_counter_memparse_write_strategy(buffer, &val);
> @@ -4045,12 +4127,16 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
> case RES_MAX_USAGE:
> if (type == _MEM)
> res_counter_reset_max(&memcg->res);
> + else if (type == _KMEM)
> + res_counter_reset_max(&memcg->kmem);
> else
> res_counter_reset_max(&memcg->memsw);
> break;
> case RES_FAILCNT:
> if (type == _MEM)
> res_counter_reset_failcnt(&memcg->res);
> + else if (type == _KMEM)
> + res_counter_reset_failcnt(&memcg->kmem);
> else
> res_counter_reset_failcnt(&memcg->memsw);
> break;
> @@ -4728,6 +4814,31 @@ static struct cftype mem_cgroup_files[] = {
> .read = mem_cgroup_read,
> },
> #endif
> +#ifdef CONFIG_MEMCG_KMEM
> + {
> + .name = "kmem.limit_in_bytes",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> + .write_string = mem_cgroup_write,
> + .read = mem_cgroup_read,
> + },
> + {
> + .name = "kmem.usage_in_bytes",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> + .read = mem_cgroup_read,
> + },
> + {
> + .name = "kmem.failcnt",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
> + .trigger = mem_cgroup_reset,
> + .read = mem_cgroup_read,
> + },
> + {
> + .name = "kmem.max_usage_in_bytes",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
> + .trigger = mem_cgroup_reset,
> + .read = mem_cgroup_read,
> + },
> +#endif
> { }, /* terminate */
> };
>
> @@ -4973,6 +5084,7 @@ mem_cgroup_create(struct cgroup *cont)
> if (parent && parent->use_hierarchy) {
> res_counter_init(&memcg->res, &parent->res);
> res_counter_init(&memcg->memsw, &parent->memsw);
> + res_counter_init(&memcg->kmem, &parent->kmem);
> /*
> * We increment refcnt of the parent to ensure that we can
> * safely access it on res_counter_charge/uncharge.
> @@ -4980,9 +5092,11 @@ mem_cgroup_create(struct cgroup *cont)
> * mem_cgroup(see mem_cgroup_put).
> */
> mem_cgroup_get(parent);
> + memcg_propagate_kmem(memcg, parent);
> } else {
> res_counter_init(&memcg->res, NULL);
> res_counter_init(&memcg->memsw, NULL);
> + res_counter_init(&memcg->kmem, NULL);
> /*
> * Deeper hierachy with use_hierarchy == false doesn't make
> * much sense so let cgroup subsystem know about this
> --
> 1.7.11.7
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread[parent not found: <1350382611-20579-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v5 04/14] kmem accounting basic infrastructure
[not found] ` <1350382611-20579-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-10-17 22:08 ` David Rientjes
[not found] ` <alpine.DEB.2.00.1210171455010.20712-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
0 siblings, 1 reply; 70+ messages in thread
From: David Rientjes @ 2012-10-17 22:08 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, Pekka Enberg, devel-GEFAQzZX7r8dnm+yROfE0A,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Tue, 16 Oct 2012, Glauber Costa wrote:
> This patch adds the basic infrastructure for the accounting of kernel
> memory. To control that, the following files are created:
>
> * memory.kmem.usage_in_bytes
> * memory.kmem.limit_in_bytes
> * memory.kmem.failcnt
> * memory.kmem.max_usage_in_bytes
>
> They have the same meaning of their user memory counterparts. They
> reflect the state of the "kmem" res_counter.
>
> Per cgroup kmem memory accounting is not enabled until a limit is set
> for the group. Once the limit is set the accounting cannot be disabled
> for that group. This means that after the patch is applied, no
> behavioral changes exists for whoever is still using memcg to control
> their memory usage, until memory.kmem.limit_in_bytes is set for the
> first time.
>
> We always account to both user and kernel resource_counters. This
> effectively means that an independent kernel limit is in place when the
> limit is set to a lower value than the user memory. A equal or higher
> value means that the user limit will always hit first, meaning that kmem
> is effectively unlimited.
>
> People who want to track kernel memory but not limit it, can set this
> limit to a very high number (like RESOURCE_MAX - 1page - that no one
> will ever hit, or equal to the user memory)
>
> [ v4: make kmem files part of the main array;
> do not allow limit to be set for non-empty cgroups ]
> [ v5: cosmetic changes ]
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
> mm/memcontrol.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 115 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 71d259e..30eafeb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -266,6 +266,10 @@ struct mem_cgroup {
> };
>
> /*
> + * the counter to account for kernel memory usage.
> + */
> + struct res_counter kmem;
> + /*
> * Per cgroup active and inactive list, similar to the
> * per zone LRU lists.
> */
> @@ -280,6 +284,7 @@ struct mem_cgroup {
> * Should the accounting and control be hierarchical, per subtree?
> */
> bool use_hierarchy;
> + unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
I think this should be named kmem_account_flags or kmem_flags, otherwise
it appears that this is the actual account.
>
> bool oom_lock;
> atomic_t under_oom;
> @@ -332,6 +337,20 @@ struct mem_cgroup {
> #endif
> };
>
> +/* internal only representation about the status of kmem accounting. */
> +enum {
> + KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
> +};
> +
> +#define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
> +
> +#ifdef CONFIG_MEMCG_KMEM
memcg->kmem_accounted isn't only defined for this configuration, so would
it be simpler to define this unconditionally?
> +static void memcg_kmem_set_active(struct mem_cgroup *memcg)
inline?
> +{
> + set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
> +}
> +#endif
> +
> /* Stuffs for move charges at task migration. */
> /*
> * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
> @@ -390,6 +409,7 @@ enum res_type {
> _MEM,
> _MEMSWAP,
> _OOM_TYPE,
> + _KMEM,
> };
>
> #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
> @@ -1433,6 +1453,10 @@ done:
> res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
> res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
> res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> + printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
> + res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
> + res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
> + res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
> }
>
> /*
> @@ -3940,6 +3964,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
> else
> val = res_counter_read_u64(&memcg->memsw, name);
> break;
> + case _KMEM:
> + val = res_counter_read_u64(&memcg->kmem, name);
> + break;
> default:
> BUG();
> }
> @@ -3947,6 +3974,57 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
> len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
> return simple_read_from_buffer(buf, nbytes, ppos, str, len);
> }
> +
> +static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
> +{
> + int ret = -EINVAL;
> +#ifdef CONFIG_MEMCG_KMEM
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> + /*
> + * For simplicity, we won't allow this to be disabled. It also can't
> + * be changed if the cgroup has children already, or if tasks had
> + * already joined.
> + *
> + * If tasks join before we set the limit, a person looking at
> + * kmem.usage_in_bytes will have no way to determine when it took
> + * place, which makes the value quite meaningless.
> + *
> + * After it first became limited, changes in the value of the limit are
> + * of course permitted.
> + *
> + * Taking the cgroup_lock is really offensive, but it is so far the only
> + * way to guarantee that no children will appear. There are plenty of
> + * other offenders, and they should all go away. Fine grained locking
> + * is probably the way to go here. When we are fully hierarchical, we
> + * can also get rid of the use_hierarchy check.
Not sure it's so offensive, it's a pretty standard way of ensuring that
cont->children doesn't get manipulated in a race.
> + */
> + cgroup_lock();
> + mutex_lock(&set_limit_mutex);
> + if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
> + if (cgroup_task_count(cont) || (memcg->use_hierarchy &&
> + !list_empty(&cont->children))) {
> + ret = -EBUSY;
> + goto out;
> + }
> + ret = res_counter_set_limit(&memcg->kmem, val);
> + VM_BUG_ON(ret);
> +
> + memcg_kmem_set_active(memcg);
> + } else
> + ret = res_counter_set_limit(&memcg->kmem, val);
> +out:
> + mutex_unlock(&set_limit_mutex);
> + cgroup_unlock();
> +#endif
> + return ret;
> +}
> +
> +static void memcg_propagate_kmem(struct mem_cgroup *memcg,
> + struct mem_cgroup *parent)
> +{
> + memcg->kmem_accounted = parent->kmem_accounted;
> +}
> +
> /*
> * The user of this function is...
> * RES_LIMIT.
> @@ -3978,8 +4056,12 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
> break;
> if (type == _MEM)
> ret = mem_cgroup_resize_limit(memcg, val);
> - else
> + else if (type == _MEMSWAP)
> ret = mem_cgroup_resize_memsw_limit(memcg, val);
> + else if (type == _KMEM)
> + ret = memcg_update_kmem_limit(cont, val);
> + else
> + return -EINVAL;
I like how this is done in a maintainable way to ensure no other types can
inadvertently update the memsw limit as it was previously written. All
other returns of -EINVAL just cause the switch statement to break, though,
rather than return directly.
> break;
> case RES_SOFT_LIMIT:
> ret = res_counter_memparse_write_strategy(buffer, &val);
> @@ -4045,12 +4127,16 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
> case RES_MAX_USAGE:
> if (type == _MEM)
> res_counter_reset_max(&memcg->res);
> + else if (type == _KMEM)
> + res_counter_reset_max(&memcg->kmem);
Could this be written in the same way above, i.e. check _MEMSWAP to pass
memcg->memsw, _KMEM for memcg->kmem, etc?
> else
> res_counter_reset_max(&memcg->memsw);
> break;
> case RES_FAILCNT:
> if (type == _MEM)
> res_counter_reset_failcnt(&memcg->res);
> + else if (type == _KMEM)
> + res_counter_reset_failcnt(&memcg->kmem);
Same.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v5 04/14] kmem accounting basic infrastructure
2012-10-16 10:16 ` [PATCH v5 04/14] kmem accounting basic infrastructure Glauber Costa
2012-10-16 12:14 ` Michal Hocko
[not found] ` <1350382611-20579-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-10-17 22:12 ` Andrew Morton
[not found] ` <20121017151207.e8bb3db2.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2 siblings, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2012-10-17 22:12 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel
On Tue, 16 Oct 2012 14:16:41 +0400
Glauber Costa <glommer@parallels.com> wrote:
> This patch adds the basic infrastructure for the accounting of kernel
> memory. To control that, the following files are created:
>
> * memory.kmem.usage_in_bytes
> * memory.kmem.limit_in_bytes
> * memory.kmem.failcnt
gargh. "failcnt" is not a word. Who was it who first thought that
omitting voewls from words improves anything?
Sigh. That pooch is already screwed and there's nothing we can do
about it now.
> * memory.kmem.max_usage_in_bytes
>
> They have the same meaning of their user memory counterparts. They
> reflect the state of the "kmem" res_counter.
>
> Per cgroup kmem memory accounting is not enabled until a limit is set
> for the group. Once the limit is set the accounting cannot be disabled
> for that group. This means that after the patch is applied, no
> behavioral changes exists for whoever is still using memcg to control
> their memory usage, until memory.kmem.limit_in_bytes is set for the
> first time.
>
> We always account to both user and kernel resource_counters. This
> effectively means that an independent kernel limit is in place when the
> limit is set to a lower value than the user memory. A equal or higher
> value means that the user limit will always hit first, meaning that kmem
> is effectively unlimited.
>
> People who want to track kernel memory but not limit it, can set this
> limit to a very high number (like RESOURCE_MAX - 1page - that no one
> will ever hit, or equal to the user memory)
>
>
> ...
>
> +/* internal only representation about the status of kmem accounting. */
> +enum {
> + KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
> +};
> +
> +#define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +static void memcg_kmem_set_active(struct mem_cgroup *memcg)
> +{
> + set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
> +}
> +#endif
I don't think memcg_kmem_set_active() really needs to exist. It has a
single caller and is unlikely to get any additional callers, so just
open-code it there?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v5 05/14] Add a __GFP_KMEMCG flag
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (3 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 04/14] kmem accounting basic infrastructure Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
[not found] ` <1350382611-20579-6-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-10-16 10:16 ` [PATCH v5 06/14] memcg: kmem controller infrastructure Glauber Costa
` (9 subsequent siblings)
14 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
This flag is used to indicate to the callees that this allocation is a
kernel allocation in process context, and should be accounted to
current's memcg. It takes numerical place of the of the recently removed
__GFP_NO_KSWAPD.
[ v4: make flag unconditional, also declare it in trace code ]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Michal Hocko <mhocko@suse.cz>
CC: Suleiman Souhlal <suleiman@google.com>
CC: Tejun Heo <tj@kernel.org>
---
include/linux/gfp.h | 3 ++-
include/trace/events/gfpflags.h | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 02c1c97..9289d46 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -31,6 +31,7 @@ struct vm_area_struct;
#define ___GFP_THISNODE 0x40000u
#define ___GFP_RECLAIMABLE 0x80000u
#define ___GFP_NOTRACK 0x200000u
+#define ___GFP_KMEMCG 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
@@ -87,7 +88,7 @@ struct vm_area_struct;
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
-
+#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
/*
* This may seem redundant, but it's a way of annotating false positives vs.
* allocations that simply cannot be supported (e.g. page tables).
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 9391706..730df12 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -36,6 +36,7 @@
{(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
{(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"}, \
{(unsigned long)__GFP_NOTRACK, "GFP_NOTRACK"}, \
+ {(unsigned long)__GFP_KMEMCG, "GFP_KMEMCG"}, \
{(unsigned long)__GFP_OTHER_NODE, "GFP_OTHER_NODE"} \
) : "GFP_NOWAIT"
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 70+ messages in thread* [PATCH v5 06/14] memcg: kmem controller infrastructure
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (4 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 05/14] Add a __GFP_KMEMCG flag Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
2012-10-17 6:40 ` Kamezawa Hiroyuki
` (2 more replies)
2012-10-16 10:16 ` [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg Glauber Costa
` (8 subsequent siblings)
14 siblings, 3 replies; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa,
Pekka Enberg
This patch introduces infrastructure for tracking kernel memory pages to
a given memcg. This will happen whenever the caller includes the flag
__GFP_KMEMCG flag, and the task belong to a memcg other than the root.
In memcontrol.h those functions are wrapped in inline acessors. The
idea is to later on, patch those with static branches, so we don't incur
any overhead when no mem cgroups with limited kmem are being used.
Users of this functionality shall interact with the memcg core code
through the following functions:
memcg_kmem_newpage_charge: will return true if the group can handle the
allocation. At this point, struct page is not
yet allocated.
memcg_kmem_commit_charge: will either revert the charge, if struct page
allocation failed, or embed memcg information
into page_cgroup.
memcg_kmem_uncharge_page: called at free time, will revert the charge.
[ v2: improved comments and standardized function names ]
[ v3: handle no longer opaque, functions not exported,
even more comments ]
[ v4: reworked Used bit handling and surroundings for more clarity ]
[ v5: simplified code for kmemcg compiled out and core functions in
memcontrol.c, moved kmem code to the middle to avoid forward decls ]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Tejun Heo <tj@kernel.org>
---
include/linux/memcontrol.h | 98 ++++++++++++++++++++++++++
mm/memcontrol.c | 169 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 267 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8d9489f..303a456 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -21,6 +21,7 @@
#define _LINUX_MEMCONTROL_H
#include <linux/cgroup.h>
#include <linux/vm_event_item.h>
+#include <linux/hardirq.h>
struct mem_cgroup;
struct page_cgroup;
@@ -399,6 +400,88 @@ struct sock;
#ifdef CONFIG_MEMCG_KMEM
void sock_update_memcg(struct sock *sk);
void sock_release_memcg(struct sock *sk);
+
+static inline bool memcg_kmem_enabled(void)
+{
+ return true;
+}
+
+bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
+ int order);
+void __memcg_kmem_commit_charge(struct page *page,
+ struct mem_cgroup *memcg, int order);
+void __memcg_kmem_uncharge_page(struct page *page, int order);
+
+/**
+ * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
+ * @gfp: the gfp allocation flags.
+ * @memcg: a pointer to the memcg this was charged against.
+ * @order: allocation order.
+ *
+ * returns true if the memcg where the current task belongs can hold this
+ * allocation.
+ *
+ * We return true automatically if this allocation is not to be accounted to
+ * any memcg.
+ */
+static __always_inline bool
+memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
+{
+ if (!memcg_kmem_enabled())
+ return true;
+
+ /*
+ * __GFP_NOFAIL allocations will move on even if charging is not
+ * possible. Therefore we don't even try, and have this allocation
+ * unaccounted. We could in theory charge it with
+ * res_counter_charge_nofail, but we hope those allocations are rare,
+ * and won't be worth the trouble.
+ */
+ if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
+ return true;
+ if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
+ return true;
+
+ /* If the test is dying, just let it go. */
+ if (unlikely(test_thread_flag(TIF_MEMDIE)
+ || fatal_signal_pending(current)))
+ return true;
+
+ return __memcg_kmem_newpage_charge(gfp, memcg, order);
+}
+
+/**
+ * memcg_kmem_uncharge_page: uncharge pages from memcg
+ * @page: pointer to struct page being freed
+ * @order: allocation order.
+ *
+ * there is no need to specify memcg here, since it is embedded in page_cgroup
+ */
+static __always_inline void
+memcg_kmem_uncharge_page(struct page *page, int order)
+{
+ if (memcg_kmem_enabled())
+ __memcg_kmem_uncharge_page(page, order);
+}
+
+/**
+ * memcg_kmem_commit_charge: embeds correct memcg in a page
+ * @page: pointer to struct page recently allocated
+ * @memcg: the memcg structure we charged against
+ * @order: allocation order.
+ *
+ * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
+ * failure of the allocation. if @page is NULL, this function will revert the
+ * charges. Otherwise, it will commit the memcg given by @memcg to the
+ * corresponding page_cgroup.
+ */
+static __always_inline void
+memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
+{
+ if (memcg_kmem_enabled() && memcg)
+ __memcg_kmem_commit_charge(page, memcg, order);
+}
+
#else
static inline void sock_update_memcg(struct sock *sk)
{
@@ -406,6 +489,21 @@ static inline void sock_update_memcg(struct sock *sk)
static inline void sock_release_memcg(struct sock *sk)
{
}
+
+static inline bool
+memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
+{
+ return true;
+}
+
+static inline void memcg_kmem_uncharge_page(struct page *page, int order)
+{
+}
+
+static inline void
+memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
+{
+}
#endif /* CONFIG_MEMCG_KMEM */
#endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 30eafeb..1182188 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -10,6 +10,10 @@
* Copyright (C) 2009 Nokia Corporation
* Author: Kirill A. Shutemov
*
+ * Kernel Memory Controller
+ * Copyright (C) 2012 Parallels Inc. and Google Inc.
+ * Authors: Glauber Costa and Suleiman Souhlal
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -2630,6 +2634,171 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
memcg_check_events(memcg, page);
}
+#ifdef CONFIG_MEMCG_KMEM
+static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
+{
+ return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
+ (memcg->kmem_accounted & KMEM_ACCOUNTED_MASK);
+}
+
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+{
+ struct res_counter *fail_res;
+ struct mem_cgroup *_memcg;
+ int ret = 0;
+ bool may_oom;
+
+ ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+ if (ret)
+ return ret;
+
+ /*
+ * Conditions under which we can wait for the oom_killer.
+ * We have to be able to wait, but also, if we can't retry,
+ * we obviously shouldn't go mess with oom.
+ */
+ may_oom = (gfp & __GFP_WAIT) && !(gfp & __GFP_NORETRY);
+
+ _memcg = memcg;
+ ret = __mem_cgroup_try_charge(NULL, gfp, size >> PAGE_SHIFT,
+ &_memcg, may_oom);
+
+ if (ret == -EINTR) {
+ /*
+ * __mem_cgroup_try_charge() chosed to bypass to root due to
+ * OOM kill or fatal signal. Since our only options are to
+ * either fail the allocation or charge it to this cgroup, do
+ * it as a temporary condition. But we can't fail. From a
+ * kmem/slab perspective, the cache has already been selected,
+ * by mem_cgroup_get_kmem_cache(), so it is too late to change
+ * our minds. This condition will only trigger if the task
+ * entered memcg_charge_kmem in a sane state, but was
+ * OOM-killed. during __mem_cgroup_try_charge. Tasks that are
+ * already dying when the allocation triggers should have been
+ * already directed to the root cgroup.
+ */
+ res_counter_charge_nofail(&memcg->res, size, &fail_res);
+ if (do_swap_account)
+ res_counter_charge_nofail(&memcg->memsw, size,
+ &fail_res);
+ ret = 0;
+ } else if (ret)
+ res_counter_uncharge(&memcg->kmem, size);
+
+ return ret;
+}
+
+static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
+{
+ res_counter_uncharge(&memcg->kmem, size);
+ res_counter_uncharge(&memcg->res, size);
+ if (do_swap_account)
+ res_counter_uncharge(&memcg->memsw, size);
+}
+
+/*
+ * We need to verify if the allocation against current->mm->owner's memcg is
+ * possible for the given order. But the page is not allocated yet, so we'll
+ * need a further commit step to do the final arrangements.
+ *
+ * It is possible for the task to switch cgroups in this mean time, so at
+ * commit time, we can't rely on task conversion any longer. We'll then use
+ * the handle argument to return to the caller which cgroup we should commit
+ * against. We could also return the memcg directly and avoid the pointer
+ * passing, but a boolean return value gives better semantics considering
+ * the compiled-out case as well.
+ *
+ * Returning true means the allocation is possible.
+ */
+bool
+__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
+{
+ struct mem_cgroup *memcg;
+ int ret;
+
+ *_memcg = NULL;
+ memcg = try_get_mem_cgroup_from_mm(current->mm);
+
+ /*
+ * very rare case described in mem_cgroup_from_task. Unfortunately there
+ * isn't much we can do without complicating this too much, and it would
+ * be gfp-dependent anyway. Just let it go
+ */
+ if (unlikely(!memcg))
+ return true;
+
+ if (!memcg_can_account_kmem(memcg)) {
+ css_put(&memcg->css);
+ return true;
+ }
+
+ mem_cgroup_get(memcg);
+
+ ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
+ if (!ret)
+ *_memcg = memcg;
+ else
+ mem_cgroup_put(memcg);
+
+ css_put(&memcg->css);
+ return (ret == 0);
+}
+
+void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
+ int order)
+{
+ struct page_cgroup *pc;
+
+ VM_BUG_ON(mem_cgroup_is_root(memcg));
+
+ /* The page allocation failed. Revert */
+ if (!page) {
+ memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+ mem_cgroup_put(memcg);
+ return;
+ }
+
+ pc = lookup_page_cgroup(page);
+ lock_page_cgroup(pc);
+ pc->mem_cgroup = memcg;
+ SetPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
+}
+
+void __memcg_kmem_uncharge_page(struct page *page, int order)
+{
+ struct mem_cgroup *memcg = NULL;
+ struct page_cgroup *pc;
+
+
+ pc = lookup_page_cgroup(page);
+ /*
+ * Fast unlocked return. Theoretically might have changed, have to
+ * check again after locking.
+ */
+ if (!PageCgroupUsed(pc))
+ return;
+
+ lock_page_cgroup(pc);
+ if (PageCgroupUsed(pc)) {
+ memcg = pc->mem_cgroup;
+ ClearPageCgroupUsed(pc);
+ }
+ unlock_page_cgroup(pc);
+
+ /*
+ * We trust that only if there is a memcg associated with the page, it
+ * is a valid allocation
+ */
+ if (!memcg)
+ return;
+
+ VM_BUG_ON(mem_cgroup_is_root(memcg));
+ memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+ mem_cgroup_put(memcg);
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 70+ messages in thread* Re: [PATCH v5 06/14] memcg: kmem controller infrastructure
2012-10-16 10:16 ` [PATCH v5 06/14] memcg: kmem controller infrastructure Glauber Costa
@ 2012-10-17 6:40 ` Kamezawa Hiroyuki
2012-10-17 22:12 ` Andrew Morton
2012-10-17 22:37 ` David Rientjes
2 siblings, 0 replies; 70+ messages in thread
From: Kamezawa Hiroyuki @ 2012-10-17 6:40 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Andrew Morton,
Michal Hocko, Johannes Weiner, Christoph Lameter, David Rientjes,
Pekka Enberg, devel, linux-kernel, Pekka Enberg
(2012/10/16 19:16), Glauber Costa wrote:
> This patch introduces infrastructure for tracking kernel memory pages to
> a given memcg. This will happen whenever the caller includes the flag
> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>
> In memcontrol.h those functions are wrapped in inline acessors. The
> idea is to later on, patch those with static branches, so we don't incur
> any overhead when no mem cgroups with limited kmem are being used.
>
> Users of this functionality shall interact with the memcg core code
> through the following functions:
>
> memcg_kmem_newpage_charge: will return true if the group can handle the
> allocation. At this point, struct page is not
> yet allocated.
>
> memcg_kmem_commit_charge: will either revert the charge, if struct page
> allocation failed, or embed memcg information
> into page_cgroup.
>
> memcg_kmem_uncharge_page: called at free time, will revert the charge.
>
> [ v2: improved comments and standardized function names ]
> [ v3: handle no longer opaque, functions not exported,
> even more comments ]
> [ v4: reworked Used bit handling and surroundings for more clarity ]
> [ v5: simplified code for kmemcg compiled out and core functions in
> memcontrol.c, moved kmem code to the middle to avoid forward decls ]
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Tejun Heo <tj@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v5 06/14] memcg: kmem controller infrastructure
2012-10-16 10:16 ` [PATCH v5 06/14] memcg: kmem controller infrastructure Glauber Costa
2012-10-17 6:40 ` Kamezawa Hiroyuki
@ 2012-10-17 22:12 ` Andrew Morton
2012-10-18 9:16 ` Glauber Costa
2012-10-17 22:37 ` David Rientjes
2 siblings, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2012-10-17 22:12 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Pekka Enberg
On Tue, 16 Oct 2012 14:16:43 +0400
Glauber Costa <glommer@parallels.com> wrote:
> This patch introduces infrastructure for tracking kernel memory pages to
> a given memcg. This will happen whenever the caller includes the flag
> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>
> In memcontrol.h those functions are wrapped in inline acessors. The
> idea is to later on, patch those with static branches, so we don't incur
> any overhead when no mem cgroups with limited kmem are being used.
>
> Users of this functionality shall interact with the memcg core code
> through the following functions:
>
> memcg_kmem_newpage_charge: will return true if the group can handle the
> allocation. At this point, struct page is not
> yet allocated.
>
> memcg_kmem_commit_charge: will either revert the charge, if struct page
> allocation failed, or embed memcg information
> into page_cgroup.
>
> memcg_kmem_uncharge_page: called at free time, will revert the charge.
>
> ...
>
> +static __always_inline bool
> +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
> +{
> + if (!memcg_kmem_enabled())
> + return true;
> +
> + /*
> + * __GFP_NOFAIL allocations will move on even if charging is not
> + * possible. Therefore we don't even try, and have this allocation
> + * unaccounted. We could in theory charge it with
> + * res_counter_charge_nofail, but we hope those allocations are rare,
> + * and won't be worth the trouble.
> + */
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + return true;
> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> + return true;
> +
> + /* If the test is dying, just let it go. */
> + if (unlikely(test_thread_flag(TIF_MEMDIE)
> + || fatal_signal_pending(current)))
> + return true;
> +
> + return __memcg_kmem_newpage_charge(gfp, memcg, order);
> +}
That's a big function! Why was it __always_inline? I'd have thought
it would be better to move the code after memcg_kmem_enabled() out of
line.
Do we actually need to test PF_KTHREAD when current->mm == NULL?
Perhaps because of aio threads whcih temporarily adopt a userspace mm?
> +/**
> + * memcg_kmem_uncharge_page: uncharge pages from memcg
> + * @page: pointer to struct page being freed
> + * @order: allocation order.
> + *
> + * there is no need to specify memcg here, since it is embedded in page_cgroup
> + */
> +static __always_inline void
> +memcg_kmem_uncharge_page(struct page *page, int order)
> +{
> + if (memcg_kmem_enabled())
> + __memcg_kmem_uncharge_page(page, order);
> +}
> +
> +/**
> + * memcg_kmem_commit_charge: embeds correct memcg in a page
> + * @page: pointer to struct page recently allocated
> + * @memcg: the memcg structure we charged against
> + * @order: allocation order.
> + *
> + * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
> + * failure of the allocation. if @page is NULL, this function will revert the
> + * charges. Otherwise, it will commit the memcg given by @memcg to the
> + * corresponding page_cgroup.
> + */
> +static __always_inline void
> +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
> +{
> + if (memcg_kmem_enabled() && memcg)
> + __memcg_kmem_commit_charge(page, memcg, order);
> +}
I suspect the __always_inline's here are to do with static branch
trickery. A code comment is warranted if so?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread* Re: [PATCH v5 06/14] memcg: kmem controller infrastructure
2012-10-17 22:12 ` Andrew Morton
@ 2012-10-18 9:16 ` Glauber Costa
2012-10-18 22:06 ` David Rientjes
0 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-18 9:16 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Pekka Enberg
On 10/18/2012 02:12 AM, Andrew Morton wrote:
> On Tue, 16 Oct 2012 14:16:43 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> This patch introduces infrastructure for tracking kernel memory pages to
>> a given memcg. This will happen whenever the caller includes the flag
>> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>>
>> In memcontrol.h those functions are wrapped in inline acessors. The
>> idea is to later on, patch those with static branches, so we don't incur
>> any overhead when no mem cgroups with limited kmem are being used.
>>
>> Users of this functionality shall interact with the memcg core code
>> through the following functions:
>>
>> memcg_kmem_newpage_charge: will return true if the group can handle the
>> allocation. At this point, struct page is not
>> yet allocated.
>>
>> memcg_kmem_commit_charge: will either revert the charge, if struct page
>> allocation failed, or embed memcg information
>> into page_cgroup.
>>
>> memcg_kmem_uncharge_page: called at free time, will revert the charge.
>>
>> ...
>>
>> +static __always_inline bool
>> +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
>> +{
>> + if (!memcg_kmem_enabled())
>> + return true;
>> +
>> + /*
>> + * __GFP_NOFAIL allocations will move on even if charging is not
>> + * possible. Therefore we don't even try, and have this allocation
>> + * unaccounted. We could in theory charge it with
>> + * res_counter_charge_nofail, but we hope those allocations are rare,
>> + * and won't be worth the trouble.
>> + */
>> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
>> + return true;
>> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>> + return true;
>> +
>> + /* If the test is dying, just let it go. */
>> + if (unlikely(test_thread_flag(TIF_MEMDIE)
>> + || fatal_signal_pending(current)))
>> + return true;
>> +
>> + return __memcg_kmem_newpage_charge(gfp, memcg, order);
>> +}
>
> That's a big function! Why was it __always_inline? I'd have thought
> it would be better to move the code after memcg_kmem_enabled() out of
> line.
>
it is big, but it is mostly bit testing. So the goal here is to avoid a
function call at all costs, this being a fast path.
> Do we actually need to test PF_KTHREAD when current->mm == NULL?
> Perhaps because of aio threads whcih temporarily adopt a userspace mm?
I believe so. I remember I discussed this in the past with David
Rientjes and he advised me to test for both.
>
>> +/**
>> + * memcg_kmem_uncharge_page: uncharge pages from memcg
>> + * @page: pointer to struct page being freed
>> + * @order: allocation order.
>> + *
>> + * there is no need to specify memcg here, since it is embedded in page_cgroup
>> + */
>> +static __always_inline void
>> +memcg_kmem_uncharge_page(struct page *page, int order)
>> +{
>> + if (memcg_kmem_enabled())
>> + __memcg_kmem_uncharge_page(page, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_commit_charge: embeds correct memcg in a page
>> + * @page: pointer to struct page recently allocated
>> + * @memcg: the memcg structure we charged against
>> + * @order: allocation order.
>> + *
>> + * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
>> + * failure of the allocation. if @page is NULL, this function will revert the
>> + * charges. Otherwise, it will commit the memcg given by @memcg to the
>> + * corresponding page_cgroup.
>> + */
>> +static __always_inline void
>> +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
>> +{
>> + if (memcg_kmem_enabled() && memcg)
>> + __memcg_kmem_commit_charge(page, memcg, order);
>> +}
>
> I suspect the __always_inline's here are to do with static branch
> trickery. A code comment is warranted if so?
>
Not necessarily. Same thing as above. We want to avoid function calls in
those sites.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread* Re: [PATCH v5 06/14] memcg: kmem controller infrastructure
2012-10-18 9:16 ` Glauber Costa
@ 2012-10-18 22:06 ` David Rientjes
2012-10-19 9:10 ` Glauber Costa
0 siblings, 1 reply; 70+ messages in thread
From: David Rientjes @ 2012-10-18 22:06 UTC (permalink / raw)
To: Glauber Costa
Cc: Andrew Morton, linux-mm, cgroups, Mel Gorman, Tejun Heo,
Michal Hocko, Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
Pekka Enberg, devel, linux-kernel, Pekka Enberg
On Thu, 18 Oct 2012, Glauber Costa wrote:
> > Do we actually need to test PF_KTHREAD when current->mm == NULL?
> > Perhaps because of aio threads whcih temporarily adopt a userspace mm?
>
> I believe so. I remember I discussed this in the past with David
> Rientjes and he advised me to test for both.
>
PF_KTHREAD can do use_mm() to assume an ->mm but hopefully they aren't
allocating slab while doing so. Have you considered actually charging
current->mm->owner for that memory, though, since the kthread will have
freed the memory before unuse_mm() or otherwise have charged it on behalf
of a user process, i.e. only exempting PF_KTHREAD?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v5 06/14] memcg: kmem controller infrastructure
2012-10-18 22:06 ` David Rientjes
@ 2012-10-19 9:10 ` Glauber Costa
2012-10-19 9:31 ` David Rientjes
0 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-19 9:10 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, linux-mm, cgroups, Mel Gorman, Tejun Heo,
Michal Hocko, Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
Pekka Enberg, devel, linux-kernel, Pekka Enberg
On 10/19/2012 02:06 AM, David Rientjes wrote:
> On Thu, 18 Oct 2012, Glauber Costa wrote:
>
>>> Do we actually need to test PF_KTHREAD when current->mm == NULL?
>>> Perhaps because of aio threads whcih temporarily adopt a userspace mm?
>>
>> I believe so. I remember I discussed this in the past with David
>> Rientjes and he advised me to test for both.
>>
>
> PF_KTHREAD can do use_mm() to assume an ->mm but hopefully they aren't
> allocating slab while doing so. Have you considered actually charging
> current->mm->owner for that memory, though, since the kthread will have
> freed the memory before unuse_mm() or otherwise have charged it on behalf
> of a user process, i.e. only exempting PF_KTHREAD?
>
I always charge current->mm->owner.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v5 06/14] memcg: kmem controller infrastructure
2012-10-19 9:10 ` Glauber Costa
@ 2012-10-19 9:31 ` David Rientjes
2012-10-19 10:00 ` Glauber Costa
0 siblings, 1 reply; 70+ messages in thread
From: David Rientjes @ 2012-10-19 9:31 UTC (permalink / raw)
To: Glauber Costa
Cc: Andrew Morton, linux-mm, cgroups, Mel Gorman, Tejun Heo,
Michal Hocko, Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
Pekka Enberg, devel, linux-kernel, Pekka Enberg
On Fri, 19 Oct 2012, Glauber Costa wrote:
> >>> Do we actually need to test PF_KTHREAD when current->mm == NULL?
> >>> Perhaps because of aio threads whcih temporarily adopt a userspace mm?
> >>
> >> I believe so. I remember I discussed this in the past with David
> >> Rientjes and he advised me to test for both.
> >>
> >
> > PF_KTHREAD can do use_mm() to assume an ->mm but hopefully they aren't
> > allocating slab while doing so. Have you considered actually charging
> > current->mm->owner for that memory, though, since the kthread will have
> > freed the memory before unuse_mm() or otherwise have charged it on behalf
> > of a user process, i.e. only exempting PF_KTHREAD?
> >
> I always charge current->mm->owner.
>
Yeah, I'm asking have you considered charging current->mm->owner for the
memory when a kthread (current) assumes the mm of a user process via
use_mm()? It may free the memory before calling unuse_mm(), but it's also
allocating the memory on behalf of a user so this exemption might be
dangerous if use_mm() becomes more popular. I don't think there's
anything that prevents that charge, I'm just wondering if you considered
doing it even for kthreads with an mm.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v5 06/14] memcg: kmem controller infrastructure
2012-10-19 9:31 ` David Rientjes
@ 2012-10-19 10:00 ` Glauber Costa
0 siblings, 0 replies; 70+ messages in thread
From: Glauber Costa @ 2012-10-19 10:00 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, linux-mm, cgroups, Mel Gorman, Tejun Heo,
Michal Hocko, Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
Pekka Enberg, devel, linux-kernel, Pekka Enberg
On 10/19/2012 01:31 PM, David Rientjes wrote:
> On Fri, 19 Oct 2012, Glauber Costa wrote:
>
>>>>> Do we actually need to test PF_KTHREAD when current->mm == NULL?
>>>>> Perhaps because of aio threads whcih temporarily adopt a userspace mm?
>>>>
>>>> I believe so. I remember I discussed this in the past with David
>>>> Rientjes and he advised me to test for both.
>>>>
>>>
>>> PF_KTHREAD can do use_mm() to assume an ->mm but hopefully they aren't
>>> allocating slab while doing so. Have you considered actually charging
>>> current->mm->owner for that memory, though, since the kthread will have
>>> freed the memory before unuse_mm() or otherwise have charged it on behalf
>>> of a user process, i.e. only exempting PF_KTHREAD?
>>>
>> I always charge current->mm->owner.
>>
>
> Yeah, I'm asking have you considered charging current->mm->owner for the
> memory when a kthread (current) assumes the mm of a user process via
> use_mm()? It may free the memory before calling unuse_mm(), but it's also
> allocating the memory on behalf of a user so this exemption might be
> dangerous if use_mm() becomes more popular. I don't think there's
> anything that prevents that charge, I'm just wondering if you considered
> doing it even for kthreads with an mm.
>
Well, I thought about it.
And I personally don't like it. I think all kthreads should be treated
the same. We have control over it, unlike any userspace application. We
never expect its memory consumption to explode.
Specially considering that those allocations are supposed to be
short-lived, we are only paying the res_counters count for no reason.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v5 06/14] memcg: kmem controller infrastructure
2012-10-16 10:16 ` [PATCH v5 06/14] memcg: kmem controller infrastructure Glauber Costa
2012-10-17 6:40 ` Kamezawa Hiroyuki
2012-10-17 22:12 ` Andrew Morton
@ 2012-10-17 22:37 ` David Rientjes
2012-10-18 9:23 ` Glauber Costa
2 siblings, 1 reply; 70+ messages in thread
From: David Rientjes @ 2012-10-17 22:37 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Andrew Morton,
Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, devel, linux-kernel,
Pekka Enberg
On Tue, 16 Oct 2012, Glauber Costa wrote:
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8d9489f..303a456 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -21,6 +21,7 @@
> #define _LINUX_MEMCONTROL_H
> #include <linux/cgroup.h>
> #include <linux/vm_event_item.h>
> +#include <linux/hardirq.h>
>
> struct mem_cgroup;
> struct page_cgroup;
> @@ -399,6 +400,88 @@ struct sock;
> #ifdef CONFIG_MEMCG_KMEM
> void sock_update_memcg(struct sock *sk);
> void sock_release_memcg(struct sock *sk);
> +
> +static inline bool memcg_kmem_enabled(void)
> +{
> + return true;
> +}
> +
> +bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
> + int order);
> +void __memcg_kmem_commit_charge(struct page *page,
> + struct mem_cgroup *memcg, int order);
> +void __memcg_kmem_uncharge_page(struct page *page, int order);
> +
> +/**
> + * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
> + * @gfp: the gfp allocation flags.
> + * @memcg: a pointer to the memcg this was charged against.
> + * @order: allocation order.
> + *
> + * returns true if the memcg where the current task belongs can hold this
> + * allocation.
> + *
> + * We return true automatically if this allocation is not to be accounted to
> + * any memcg.
> + */
> +static __always_inline bool
> +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
> +{
> + if (!memcg_kmem_enabled())
> + return true;
> +
> + /*
> + * __GFP_NOFAIL allocations will move on even if charging is not
> + * possible. Therefore we don't even try, and have this allocation
> + * unaccounted. We could in theory charge it with
> + * res_counter_charge_nofail, but we hope those allocations are rare,
> + * and won't be worth the trouble.
> + */
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + return true;
> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> + return true;
> +
> + /* If the test is dying, just let it go. */
> + if (unlikely(test_thread_flag(TIF_MEMDIE)
> + || fatal_signal_pending(current)))
> + return true;
This can be simplified to just check fatal_signal_pending(), all threads
with TIF_MEMDIE also have a pending SIGKILL.
It also has whitespace damage.
> +
> + return __memcg_kmem_newpage_charge(gfp, memcg, order);
> +}
> +
> +/**
> + * memcg_kmem_uncharge_page: uncharge pages from memcg
Should be memcg_kmem_uncharge_pages() since it takes an order argument?
> + * @page: pointer to struct page being freed
> + * @order: allocation order.
> + *
> + * there is no need to specify memcg here, since it is embedded in page_cgroup
> + */
> +static __always_inline void
> +memcg_kmem_uncharge_page(struct page *page, int order)
> +{
> + if (memcg_kmem_enabled())
> + __memcg_kmem_uncharge_page(page, order);
> +}
> +
> +/**
> + * memcg_kmem_commit_charge: embeds correct memcg in a page
> + * @page: pointer to struct page recently allocated
> + * @memcg: the memcg structure we charged against
> + * @order: allocation order.
> + *
> + * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
> + * failure of the allocation. if @page is NULL, this function will revert the
> + * charges. Otherwise, it will commit the memcg given by @memcg to the
> + * corresponding page_cgroup.
> + */
> +static __always_inline void
> +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
> +{
> + if (memcg_kmem_enabled() && memcg)
> + __memcg_kmem_commit_charge(page, memcg, order);
> +}
> +
> #else
> static inline void sock_update_memcg(struct sock *sk)
> {
> @@ -406,6 +489,21 @@ static inline void sock_update_memcg(struct sock *sk)
> static inline void sock_release_memcg(struct sock *sk)
> {
> }
> +
> +static inline bool
> +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
> +{
> + return true;
> +}
> +
> +static inline void memcg_kmem_uncharge_page(struct page *page, int order)
Two spaces.
> +{
> +}
> +
> +static inline void
> +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
> +{
> +}
> #endif /* CONFIG_MEMCG_KMEM */
> #endif /* _LINUX_MEMCONTROL_H */
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 30eafeb..1182188 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -10,6 +10,10 @@
> * Copyright (C) 2009 Nokia Corporation
> * Author: Kirill A. Shutemov
> *
> + * Kernel Memory Controller
> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
> + * Authors: Glauber Costa and Suleiman Souhlal
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -2630,6 +2634,171 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
> memcg_check_events(memcg, page);
> }
>
> +#ifdef CONFIG_MEMCG_KMEM
> +static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
> +{
> + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
> + (memcg->kmem_accounted & KMEM_ACCOUNTED_MASK);
> +}
> +
> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
> +{
> + struct res_counter *fail_res;
> + struct mem_cgroup *_memcg;
> + int ret = 0;
> + bool may_oom;
> +
> + ret = res_counter_charge(&memcg->kmem, size, &fail_res);
> + if (ret)
> + return ret;
> +
> + /*
> + * Conditions under which we can wait for the oom_killer.
> + * We have to be able to wait, but also, if we can't retry,
> + * we obviously shouldn't go mess with oom.
> + */
> + may_oom = (gfp & __GFP_WAIT) && !(gfp & __GFP_NORETRY);
What about gfp & __GFP_FS?
> +
> + _memcg = memcg;
> + ret = __mem_cgroup_try_charge(NULL, gfp, size >> PAGE_SHIFT,
> + &_memcg, may_oom);
> +
> + if (ret == -EINTR) {
> + /*
> + * __mem_cgroup_try_charge() chosed to bypass to root due to
> + * OOM kill or fatal signal. Since our only options are to
> + * either fail the allocation or charge it to this cgroup, do
> + * it as a temporary condition. But we can't fail. From a
> + * kmem/slab perspective, the cache has already been selected,
> + * by mem_cgroup_get_kmem_cache(), so it is too late to change
> + * our minds. This condition will only trigger if the task
> + * entered memcg_charge_kmem in a sane state, but was
> + * OOM-killed. during __mem_cgroup_try_charge. Tasks that are
Looks like some copy-and-paste damage.
> + * already dying when the allocation triggers should have been
> + * already directed to the root cgroup.
> + */
> + res_counter_charge_nofail(&memcg->res, size, &fail_res);
> + if (do_swap_account)
> + res_counter_charge_nofail(&memcg->memsw, size,
> + &fail_res);
> + ret = 0;
> + } else if (ret)
> + res_counter_uncharge(&memcg->kmem, size);
> +
> + return ret;
> +}
> +
> +static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
> +{
> + res_counter_uncharge(&memcg->kmem, size);
> + res_counter_uncharge(&memcg->res, size);
> + if (do_swap_account)
> + res_counter_uncharge(&memcg->memsw, size);
> +}
> +
> +/*
> + * We need to verify if the allocation against current->mm->owner's memcg is
> + * possible for the given order. But the page is not allocated yet, so we'll
> + * need a further commit step to do the final arrangements.
> + *
> + * It is possible for the task to switch cgroups in this mean time, so at
> + * commit time, we can't rely on task conversion any longer. We'll then use
> + * the handle argument to return to the caller which cgroup we should commit
> + * against. We could also return the memcg directly and avoid the pointer
> + * passing, but a boolean return value gives better semantics considering
> + * the compiled-out case as well.
> + *
> + * Returning true means the allocation is possible.
> + */
> +bool
> +__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
> +{
> + struct mem_cgroup *memcg;
> + int ret;
> +
> + *_memcg = NULL;
> + memcg = try_get_mem_cgroup_from_mm(current->mm);
> +
> + /*
> + * very rare case described in mem_cgroup_from_task. Unfortunately there
> + * isn't much we can do without complicating this too much, and it would
> + * be gfp-dependent anyway. Just let it go
> + */
> + if (unlikely(!memcg))
> + return true;
> +
> + if (!memcg_can_account_kmem(memcg)) {
> + css_put(&memcg->css);
> + return true;
> + }
> +
> + mem_cgroup_get(memcg);
> +
> + ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
> + if (!ret)
> + *_memcg = memcg;
> + else
> + mem_cgroup_put(memcg);
> +
> + css_put(&memcg->css);
> + return (ret == 0);
> +}
> +
> +void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
> + int order)
> +{
> + struct page_cgroup *pc;
> +
> + VM_BUG_ON(mem_cgroup_is_root(memcg));
> +
> + /* The page allocation failed. Revert */
> + if (!page) {
> + memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
> + mem_cgroup_put(memcg);
> + return;
> + }
> +
> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> + pc->mem_cgroup = memcg;
> + SetPageCgroupUsed(pc);
> + unlock_page_cgroup(pc);
> +}
> +
> +void __memcg_kmem_uncharge_page(struct page *page, int order)
> +{
> + struct mem_cgroup *memcg = NULL;
> + struct page_cgroup *pc;
> +
> +
> + pc = lookup_page_cgroup(page);
> + /*
> + * Fast unlocked return. Theoretically might have changed, have to
> + * check again after locking.
> + */
> + if (!PageCgroupUsed(pc))
> + return;
> +
> + lock_page_cgroup(pc);
> + if (PageCgroupUsed(pc)) {
> + memcg = pc->mem_cgroup;
> + ClearPageCgroupUsed(pc);
> + }
> + unlock_page_cgroup(pc);
> +
> + /*
> + * We trust that only if there is a memcg associated with the page, it
> + * is a valid allocation
> + */
> + if (!memcg)
> + return;
> +
> + VM_BUG_ON(mem_cgroup_is_root(memcg));
> + memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
> + mem_cgroup_put(memcg);
Should this mem_cgroup_put() be done conditionally on
memcg->kmem_accounted & KMEM_ACCOUNTED_MASK?
The next patch in the series does memcg_kmem_newpage_charge() in the page
allocator which will return true for memcg_can_account_kmem() without
doing mem_cgroup_get().
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> #define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread* Re: [PATCH v5 06/14] memcg: kmem controller infrastructure
2012-10-17 22:37 ` David Rientjes
@ 2012-10-18 9:23 ` Glauber Costa
[not found] ` <507FCA90.8060307-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
0 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-18 9:23 UTC (permalink / raw)
To: David Rientjes
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Andrew Morton,
Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, devel, linux-kernel,
Pekka Enberg
On 10/18/2012 02:37 AM, David Rientjes wrote:
> On Tue, 16 Oct 2012, Glauber Costa wrote:
>
>> + /* If the test is dying, just let it go. */
>> + if (unlikely(test_thread_flag(TIF_MEMDIE)
>> + || fatal_signal_pending(current)))
>> + return true;
>
> This can be simplified to just check fatal_signal_pending(), all threads
> with TIF_MEMDIE also have a pending SIGKILL.
Yes, I believe it is better. I will change.
>> +
>> + return __memcg_kmem_newpage_charge(gfp, memcg, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_uncharge_page: uncharge pages from memcg
>
> Should be memcg_kmem_uncharge_pages() since it takes an order argument?
>
I tried to use naming as close as possible to user-memcg. But to be
fair, their are always calling it page-by-page, so pages() won't be a
problem here.
>> + * @page: pointer to struct page being freed
>> + * @order: allocation order.
>> + *
>> + * there is no need to specify memcg here, since it is embedded in page_cgroup
>> + */
>> +static __always_inline void
>> +memcg_kmem_uncharge_page(struct page *page, int order)
>> +{
>> + if (memcg_kmem_enabled())
>> + __memcg_kmem_uncharge_page(page, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_commit_charge: embeds correct memcg in a page
>> + * @page: pointer to struct page recently allocated
>> + * @memcg: the memcg structure we charged against
>> + * @order: allocation order.
>> + *
>> + * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
>> + * failure of the allocation. if @page is NULL, this function will revert the
>> + * charges. Otherwise, it will commit the memcg given by @memcg to the
>> + * corresponding page_cgroup.
>> + */
>> +static __always_inline void
>> +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
>> +{
>> + if (memcg_kmem_enabled() && memcg)
>> + __memcg_kmem_commit_charge(page, memcg, order);
>> +}
>> +
>> #else
>> static inline void sock_update_memcg(struct sock *sk)
>> {
>> @@ -406,6 +489,21 @@ static inline void sock_update_memcg(struct sock *sk)
>> static inline void sock_release_memcg(struct sock *sk)
>> {
>> }
>> +
>> +static inline bool
>> +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
>> +{
>> + return true;
>> +}
>> +
>> +static inline void memcg_kmem_uncharge_page(struct page *page, int order)
>
> Two spaces.
>
Thanks.
>> +{
>> +}
>> +
>> +static inline void
>> +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
>> +{
>> +}
>> #endif /* CONFIG_MEMCG_KMEM */
>> #endif /* _LINUX_MEMCONTROL_H */
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 30eafeb..1182188 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -10,6 +10,10 @@
>> * Copyright (C) 2009 Nokia Corporation
>> * Author: Kirill A. Shutemov
>> *
>> + * Kernel Memory Controller
>> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
>> + * Authors: Glauber Costa and Suleiman Souhlal
>> + *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License as published by
>> * the Free Software Foundation; either version 2 of the License, or
>> @@ -2630,6 +2634,171 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
>> memcg_check_events(memcg, page);
>> }
>>
>> +#ifdef CONFIG_MEMCG_KMEM
>> +static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
>> +{
>> + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
>> + (memcg->kmem_accounted & KMEM_ACCOUNTED_MASK);
>> +}
>> +
>> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>> +{
>> + struct res_counter *fail_res;
>> + struct mem_cgroup *_memcg;
>> + int ret = 0;
>> + bool may_oom;
>> +
>> + ret = res_counter_charge(&memcg->kmem, size, &fail_res);
>> + if (ret)
>> + return ret;
>> +
>> + /*
>> + * Conditions under which we can wait for the oom_killer.
>> + * We have to be able to wait, but also, if we can't retry,
>> + * we obviously shouldn't go mess with oom.
>> + */
>> + may_oom = (gfp & __GFP_WAIT) && !(gfp & __GFP_NORETRY);
>
> What about gfp & __GFP_FS?
>
Do you intend to prevent or allow OOM under that flag? I personally
think that anything that accepts to be OOM-killed should have GFP_WAIT
set, so that ought to be enough.
>> +
>> + _memcg = memcg;
>> + ret = __mem_cgroup_try_charge(NULL, gfp, size >> PAGE_SHIFT,
>> + &_memcg, may_oom);
>> +
>> + if (ret == -EINTR) {
>> + /*
>> + * __mem_cgroup_try_charge() chosed to bypass to root due to
>> + * OOM kill or fatal signal. Since our only options are to
>> + * either fail the allocation or charge it to this cgroup, do
>> + * it as a temporary condition. But we can't fail. From a
>> + * kmem/slab perspective, the cache has already been selected,
>> + * by mem_cgroup_get_kmem_cache(), so it is too late to change
>> + * our minds. This condition will only trigger if the task
>> + * entered memcg_charge_kmem in a sane state, but was
>> + * OOM-killed. during __mem_cgroup_try_charge. Tasks that are
>
> Looks like some copy-and-paste damage.
>
thanks.
>> +void __memcg_kmem_uncharge_page(struct page *page, int order)
>> +{
>> + struct mem_cgroup *memcg = NULL;
>> + struct page_cgroup *pc;
>> +
>> +
>> + pc = lookup_page_cgroup(page);
>> + /*
>> + * Fast unlocked return. Theoretically might have changed, have to
>> + * check again after locking.
>> + */
>> + if (!PageCgroupUsed(pc))
>> + return;
>> +
>> + lock_page_cgroup(pc);
>> + if (PageCgroupUsed(pc)) {
>> + memcg = pc->mem_cgroup;
>> + ClearPageCgroupUsed(pc);
>> + }
>> + unlock_page_cgroup(pc);
>> +
>> + /*
>> + * We trust that only if there is a memcg associated with the page, it
>> + * is a valid allocation
>> + */
>> + if (!memcg)
>> + return;
>> +
>> + VM_BUG_ON(mem_cgroup_is_root(memcg));
>> + memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
>> + mem_cgroup_put(memcg);
>
> Should this mem_cgroup_put() be done conditionally on
> memcg->kmem_accounted & KMEM_ACCOUNTED_MASK?
>
> The next patch in the series does memcg_kmem_newpage_charge() in the page
> allocator which will return true for memcg_can_account_kmem() without
> doing mem_cgroup_get().
>
And then this put will go away as well.
I am not testing for memcg_can_account_kmem in here, because having or
not having the PageCgroupUsed bit set (and therefore, a valid memcg) in
page_cgroup should be the most robust test here.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (5 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 06/14] memcg: kmem controller infrastructure Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
2012-10-17 22:12 ` Andrew Morton
[not found] ` <1350382611-20579-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-10-16 10:16 ` [PATCH v5 08/14] res_counter: return amount of charges after res_counter_uncharge Glauber Costa
` (7 subsequent siblings)
14 siblings, 2 replies; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
When a process tries to allocate a page with the __GFP_KMEMCG flag, the
page allocator will call the corresponding memcg functions to validate
the allocation. Tasks in the root memcg can always proceed.
To avoid adding markers to the page - and a kmem flag that would
necessarily follow, as much as doing page_cgroup lookups for no reason,
whoever is marking its allocations with __GFP_KMEMCG flag is responsible
for telling the page allocator that this is such an allocation at
free_pages() time. This is done by the invocation of
__free_accounted_pages() and free_accounted_pages().
[ v2: inverted test order to avoid a memcg_get leak,
free_accounted_pages simplification ]
[ v4: test for TIF_MEMDIE at newpage_charge ]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
CC: Tejun Heo <tj@kernel.org>
---
include/linux/gfp.h | 3 +++
mm/page_alloc.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9289d46..8f6fe34 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -362,6 +362,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
extern void free_hot_cold_page(struct page *page, int cold);
extern void free_hot_cold_page_list(struct list_head *list, int cold);
+extern void __free_accounted_pages(struct page *page, unsigned int order);
+extern void free_accounted_pages(unsigned long addr, unsigned int order);
+
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr), 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index feddc7f..dcf33ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2595,6 +2595,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
int migratetype = allocflags_to_migratetype(gfp_mask);
unsigned int cpuset_mems_cookie;
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
+ struct mem_cgroup *memcg = NULL;
gfp_mask &= gfp_allowed_mask;
@@ -2613,6 +2614,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;
+ /*
+ * Will only have any effect when __GFP_KMEMCG is set. This is
+ * verified in the (always inline) callee
+ */
+ if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order))
+ return NULL;
+
retry_cpuset:
cpuset_mems_cookie = get_mems_allowed();
@@ -2648,6 +2656,8 @@ out:
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
+ memcg_kmem_commit_charge(page, memcg, order);
+
return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2700,6 +2710,31 @@ void free_pages(unsigned long addr, unsigned int order)
EXPORT_SYMBOL(free_pages);
+/*
+ * __free_accounted_pages and free_accounted_pages will free pages allocated
+ * with __GFP_KMEMCG.
+ *
+ * Those pages are accounted to a particular memcg, embedded in the
+ * corresponding page_cgroup. To avoid adding a hit in the allocator to search
+ * for that information only to find out that it is NULL for users who have no
+ * interest in that whatsoever, we provide these functions.
+ *
+ * The caller knows better which flags it relies on.
+ */
+void __free_accounted_pages(struct page *page, unsigned int order)
+{
+ memcg_kmem_uncharge_page(page, order);
+ __free_pages(page, order);
+}
+
+void free_accounted_pages(unsigned long addr, unsigned int order)
+{
+ if (addr != 0) {
+ VM_BUG_ON(!virt_addr_valid((void *)addr));
+ __free_accounted_pages(virt_to_page((void *)addr), order);
+ }
+}
+
static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
{
if (addr) {
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 70+ messages in thread* Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg
2012-10-16 10:16 ` [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg Glauber Costa
@ 2012-10-17 22:12 ` Andrew Morton
2012-10-18 9:24 ` Glauber Costa
[not found] ` <20121017151221.4c420e5a.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
[not found] ` <1350382611-20579-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
1 sibling, 2 replies; 70+ messages in thread
From: Andrew Morton @ 2012-10-17 22:12 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Pekka Enberg,
Suleiman Souhlal
On Tue, 16 Oct 2012 14:16:44 +0400
Glauber Costa <glommer@parallels.com> wrote:
> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> page allocator will call the corresponding memcg functions to validate
> the allocation. Tasks in the root memcg can always proceed.
>
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,
> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time.
Well, why? Was that the correct decision?
> This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
These are very general-sounding names. I'd expect the identifiers to
contain "memcg" and/or "kmem", to identify what's going on.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread* Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg
2012-10-17 22:12 ` Andrew Morton
@ 2012-10-18 9:24 ` Glauber Costa
[not found] ` <507FCADF.20109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
[not found] ` <20121017151221.4c420e5a.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
1 sibling, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-18 9:24 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Pekka Enberg,
Suleiman Souhlal
On 10/18/2012 02:12 AM, Andrew Morton wrote:
> On Tue, 16 Oct 2012 14:16:44 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
>> page allocator will call the corresponding memcg functions to validate
>> the allocation. Tasks in the root memcg can always proceed.
>>
>> To avoid adding markers to the page - and a kmem flag that would
>> necessarily follow, as much as doing page_cgroup lookups for no reason,
>> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
>> for telling the page allocator that this is such an allocation at
>> free_pages() time.
>
> Well, why? Was that the correct decision?
>
I don't fully understand your question. Is this the same question you
posed in patch 0, about marking some versus marking all? If so, I
believe I should have answered it there.
If not, please explain.
>> This is done by the invocation of
>> __free_accounted_pages() and free_accounted_pages().
>
> These are very general-sounding names. I'd expect the identifiers to
> contain "memcg" and/or "kmem", to identify what's going on.
>
Changed.
^ permalink raw reply [flat|nested] 70+ messages in thread[parent not found: <20121017151221.4c420e5a.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg
[not found] ` <20121017151221.4c420e5a.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2012-10-18 11:53 ` Glauber Costa
0 siblings, 0 replies; 70+ messages in thread
From: Glauber Costa @ 2012-10-18 11:53 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Mel Gorman, Tejun Heo, Michal Hocko, Johannes Weiner,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Christoph Lameter,
David Rientjes, Pekka Enberg, devel-GEFAQzZX7r8dnm+yROfE0A,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Pekka Enberg,
Suleiman Souhlal
On 10/18/2012 02:12 AM, Andrew Morton wrote:
> On Tue, 16 Oct 2012 14:16:44 +0400
> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
>
>> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
>> page allocator will call the corresponding memcg functions to validate
>> the allocation. Tasks in the root memcg can always proceed.
>>
>> To avoid adding markers to the page - and a kmem flag that would
>> necessarily follow, as much as doing page_cgroup lookups for no reason,
>> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
>> for telling the page allocator that this is such an allocation at
>> free_pages() time.
>
> Well, why? Was that the correct decision?
>
>> This is done by the invocation of
>> __free_accounted_pages() and free_accounted_pages().
>
> These are very general-sounding names. I'd expect the identifiers to
> contain "memcg" and/or "kmem", to identify what's going on.
>
I've just changed to free_memcg_kmem_pages.
Let me know if the name is better.
^ permalink raw reply [flat|nested] 70+ messages in thread
[parent not found: <1350382611-20579-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg
[not found] ` <1350382611-20579-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-10-16 15:31 ` Christoph Lameter
[not found] ` <0000013a6a333867-52b0d904-5ca2-4095-8c46-5a4dfc021cde-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
2012-10-17 22:43 ` David Rientjes
1 sibling, 1 reply; 70+ messages in thread
From: Christoph Lameter @ 2012-10-16 15:31 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
David Rientjes, Pekka Enberg, devel-GEFAQzZX7r8dnm+yROfE0A,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Pekka Enberg,
Suleiman Souhlal
On Tue, 16 Oct 2012, Glauber Costa wrote:
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,
> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
Hmmm... The code paths to free pages are often shared between multiple
subsystems. Are you sure that this is actually working and accurately
tracks the MEMCG pages?
> +/*
> + * __free_accounted_pages and free_accounted_pages will free pages allocated
> + * with __GFP_KMEMCG.
> + *
> + * Those pages are accounted to a particular memcg, embedded in the
> + * corresponding page_cgroup. To avoid adding a hit in the allocator to search
> + * for that information only to find out that it is NULL for users who have no
> + * interest in that whatsoever, we provide these functions.
> + *
> + * The caller knows better which flags it relies on.
> + */
> +void __free_accounted_pages(struct page *page, unsigned int order)
> +{
> + memcg_kmem_uncharge_page(page, order);
> + __free_pages(page, order);
> +}
If we already are introducing such an API: Could it not be made more
general so that it can also be used in the future to communicate other
characteristics of a page on free?
^ permalink raw reply [flat|nested] 70+ messages in thread* Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg
[not found] ` <1350382611-20579-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-10-16 15:31 ` Christoph Lameter
@ 2012-10-17 22:43 ` David Rientjes
1 sibling, 0 replies; 70+ messages in thread
From: David Rientjes @ 2012-10-17 22:43 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, Pekka Enberg, devel-GEFAQzZX7r8dnm+yROfE0A,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Pekka Enberg,
Suleiman Souhlal
On Tue, 16 Oct 2012, Glauber Costa wrote:
> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> page allocator will call the corresponding memcg functions to validate
> the allocation. Tasks in the root memcg can always proceed.
>
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,
> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
>
> [ v2: inverted test order to avoid a memcg_get leak,
> free_accounted_pages simplification ]
> [ v4: test for TIF_MEMDIE at newpage_charge ]
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> CC: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v5 08/14] res_counter: return amount of charges after res_counter_uncharge
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (6 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
[not found] ` <1350382611-20579-9-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-10-16 10:16 ` [PATCH v5 09/14] memcg: kmem accounting lifecycle management Glauber Costa
` (6 subsequent siblings)
14 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa,
Suleiman Souhlal
It is useful to know how many charges are still left after a call to
res_counter_uncharge. While it is possible to issue a res_counter_read
after uncharge, this can be racy.
If we need, for instance, to take some action when the counters drop
down to 0, only one of the callers should see it. This is the same
semantics as the atomic variables in the kernel.
Since the current return value is void, we don't need to worry about
anything breaking due to this change: nobody relied on that, and only
users appearing from now on will be checking this value.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
CC: Tejun Heo <tj@kernel.org>
---
Documentation/cgroups/resource_counter.txt | 7 ++++---
include/linux/res_counter.h | 12 +++++++-----
kernel/res_counter.c | 20 +++++++++++++-------
3 files changed, 24 insertions(+), 15 deletions(-)
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 0c4a344..c4d99ed 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -83,16 +83,17 @@ to work with it.
res_counter->lock internally (it must be called with res_counter->lock
held). The force parameter indicates whether we can bypass the limit.
- e. void res_counter_uncharge[_locked]
+ e. u64 res_counter_uncharge[_locked]
(struct res_counter *rc, unsigned long val)
When a resource is released (freed) it should be de-accounted
from the resource counter it was accounted to. This is called
- "uncharging".
+ "uncharging". The return value of this function indicate the amount
+ of charges still present in the counter.
The _locked routines imply that the res_counter->lock is taken.
- f. void res_counter_uncharge_until
+ f. u64 res_counter_uncharge_until
(struct res_counter *rc, struct res_counter *top,
unsinged long val)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 7d7fbe2..4b173b6 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -130,14 +130,16 @@ int res_counter_charge_nofail(struct res_counter *counter,
*
* these calls check for usage underflow and show a warning on the console
* _locked call expects the counter->lock to be taken
+ *
+ * returns the total charges still present in @counter.
*/
-void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val);
+u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
+u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge_until(struct res_counter *counter,
- struct res_counter *top,
- unsigned long val);
+u64 res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val);
/**
* res_counter_margin - calculate chargeable space of a counter
* @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index ad581aa..7b3d6dc 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -86,33 +86,39 @@ int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
return __res_counter_charge(counter, val, limit_fail_at, true);
}
-void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
+u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
{
if (WARN_ON(counter->usage < val))
val = counter->usage;
counter->usage -= val;
+ return counter->usage;
}
-void res_counter_uncharge_until(struct res_counter *counter,
- struct res_counter *top,
- unsigned long val)
+u64 res_counter_uncharge_until(struct res_counter *counter,
+ struct res_counter *top,
+ unsigned long val)
{
unsigned long flags;
struct res_counter *c;
+ u64 ret = 0;
local_irq_save(flags);
for (c = counter; c != top; c = c->parent) {
+ u64 r;
spin_lock(&c->lock);
- res_counter_uncharge_locked(c, val);
+ r = res_counter_uncharge_locked(c, val);
+ if (c == counter)
+ ret = r;
spin_unlock(&c->lock);
}
local_irq_restore(flags);
+ return ret;
}
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+u64 res_counter_uncharge(struct res_counter *counter, unsigned long val)
{
- res_counter_uncharge_until(counter, NULL, val);
+ return res_counter_uncharge_until(counter, NULL, val);
}
static inline unsigned long long *
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 70+ messages in thread* [PATCH v5 09/14] memcg: kmem accounting lifecycle management
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (7 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 08/14] res_counter: return amount of charges after res_counter_uncharge Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
2012-10-17 23:28 ` David Rientjes
2012-10-16 10:16 ` [PATCH v5 10/14] memcg: use static branches when code not in use Glauber Costa
` (5 subsequent siblings)
14 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
Because kmem charges can outlive the cgroup, we need to make sure that
we won't free the memcg structure while charges are still in flight.
For reviewing simplicity, the charge functions will issue
mem_cgroup_get() at every charge, and mem_cgroup_put() at every
uncharge.
This can get expensive, however, and we can do better. mem_cgroup_get()
only really needs to be issued once: when the first limit is set. In the
same spirit, we only need to issue mem_cgroup_put() when the last charge
is gone.
We'll need an extra bit in kmem_accounted for that: KMEM_ACCOUNTED_DEAD.
it will be set when the cgroup dies, if there are charges in the group.
If there aren't, we can proceed right away.
Our uncharge function will have to test that bit every time the charges
drop to 0. Because that is not the likely output of
res_counter_uncharge, this should not impose a big hit on us: it is
certainly much better than a reference count decrease at every
operation.
[ v3: merged all lifecycle related patches in one ]
[ v5: changed memcg_kmem_dead's name ]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
CC: Tejun Heo <tj@kernel.org>
---
mm/memcontrol.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 49 insertions(+), 7 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1182188..e24b388 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -344,6 +344,7 @@ struct mem_cgroup {
/* internal only representation about the status of kmem accounting. */
enum {
KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
+ KMEM_ACCOUNTED_DEAD, /* dead memcg, pending kmem charges */
};
#define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
@@ -353,6 +354,22 @@ static void memcg_kmem_set_active(struct mem_cgroup *memcg)
{
set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
}
+
+static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+ return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
+}
+
+static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
+{
+ if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted))
+ set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_accounted);
+}
+
+static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
+{
+ return test_and_clear_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_accounted);
+}
#endif
/* Stuffs for move charges at task migration. */
@@ -2690,10 +2707,16 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
{
- res_counter_uncharge(&memcg->kmem, size);
res_counter_uncharge(&memcg->res, size);
if (do_swap_account)
res_counter_uncharge(&memcg->memsw, size);
+
+ /* Not down to 0 */
+ if (res_counter_uncharge(&memcg->kmem, size))
+ return;
+
+ if (memcg_kmem_test_and_clear_dead(memcg))
+ mem_cgroup_put(memcg);
}
/*
@@ -2732,13 +2755,9 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
return true;
}
- mem_cgroup_get(memcg);
-
ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
if (!ret)
*_memcg = memcg;
- else
- mem_cgroup_put(memcg);
css_put(&memcg->css);
return (ret == 0);
@@ -2754,7 +2773,6 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
/* The page allocation failed. Revert */
if (!page) {
memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
- mem_cgroup_put(memcg);
return;
}
@@ -2795,7 +2813,6 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
VM_BUG_ON(mem_cgroup_is_root(memcg));
memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
- mem_cgroup_put(memcg);
}
#endif /* CONFIG_MEMCG_KMEM */
@@ -4179,6 +4196,13 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
VM_BUG_ON(ret);
memcg_kmem_set_active(memcg);
+ /*
+ * kmem charges can outlive the cgroup. In the case of slab
+ * pages, for instance, a page contain objects from various
+ * processes, so it is unfeasible to migrate them away. We
+ * need to reference count the memcg because of that.
+ */
+ mem_cgroup_get(memcg);
} else
ret = res_counter_set_limit(&memcg->kmem, val);
out:
@@ -4192,6 +4216,10 @@ static void memcg_propagate_kmem(struct mem_cgroup *memcg,
struct mem_cgroup *parent)
{
memcg->kmem_accounted = parent->kmem_accounted;
+#ifdef CONFIG_MEMCG_KMEM
+ if (memcg_kmem_is_active(memcg))
+ mem_cgroup_get(memcg);
+#endif
}
/*
@@ -4875,6 +4903,20 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
{
mem_cgroup_sockets_destroy(memcg);
+
+ memcg_kmem_mark_dead(memcg);
+
+ if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
+ return;
+
+ /*
+ * Charges already down to 0, undo mem_cgroup_get() done in the charge
+ * path here, being careful not to race with memcg_uncharge_kmem: it is
+ * possible that the charges went down to 0 between mark_dead and the
+ * res_counter read, so in that case, we don't need the put
+ */
+ if (memcg_kmem_test_and_clear_dead(memcg))
+ mem_cgroup_put(memcg);
}
#else
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
--
1.7.11.7
^ permalink raw reply related [flat|nested] 70+ messages in thread* Re: [PATCH v5 09/14] memcg: kmem accounting lifecycle management
2012-10-16 10:16 ` [PATCH v5 09/14] memcg: kmem accounting lifecycle management Glauber Costa
@ 2012-10-17 23:28 ` David Rientjes
[not found] ` <alpine.DEB.2.00.1210171624540.20813-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
0 siblings, 1 reply; 70+ messages in thread
From: David Rientjes @ 2012-10-17 23:28 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Andrew Morton,
Michal Hocko, Johannes Weiner, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, devel, linux-kernel,
Pekka Enberg, Suleiman Souhlal
On Tue, 16 Oct 2012, Glauber Costa wrote:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1182188..e24b388 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -344,6 +344,7 @@ struct mem_cgroup {
> /* internal only representation about the status of kmem accounting. */
> enum {
> KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
> + KMEM_ACCOUNTED_DEAD, /* dead memcg, pending kmem charges */
"dead memcg with pending kmem charges" seems better.
> };
>
> #define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
> @@ -353,6 +354,22 @@ static void memcg_kmem_set_active(struct mem_cgroup *memcg)
> {
> set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
> }
> +
> +static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +{
> + return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
> +}
I think all of these should be inline.
> +
> +static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
> +{
> + if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted))
> + set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_accounted);
> +}
The set_bit() doesn't happen atomically with the test_bit(), what
synchronization is required for this?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v5 10/14] memcg: use static branches when code not in use
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (8 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 09/14] memcg: kmem accounting lifecycle management Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
2012-10-16 10:16 ` [PATCH v5 11/14] memcg: allow a memcg with kmem charges to be destructed Glauber Costa
` (4 subsequent siblings)
14 siblings, 0 replies; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
We can use static branches to patch the code in or out when not used.
Because the _ACTIVE bit on kmem_accounted is only set after the
increment is done, we guarantee that the root memcg will always be
selected for kmem charges until all call sites are patched (see
memcg_kmem_enabled). This guarantees that no mischarges are applied.
static branch decrement happens when the last reference count from the
kmem accounting in memcg dies. This will only happen when the charges
drop down to 0.
When that happen, we need to disable the static branch only on those
memcgs that enabled it. To achieve this, we would be forced to
complicate the code by keeping track of which memcgs were the ones
that actually enabled limits, and which ones got it from its parents.
It is a lot simpler just to do static_key_slow_inc() on every child
that is accounted.
[ v4: adapted this patch to the changes in kmem_accounted ]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
CC: Tejun Heo <tj@kernel.org>
---
include/linux/memcontrol.h | 4 ++-
mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 78 insertions(+), 5 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 303a456..34e96cf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -22,6 +22,7 @@
#include <linux/cgroup.h>
#include <linux/vm_event_item.h>
#include <linux/hardirq.h>
+#include <linux/jump_label.h>
struct mem_cgroup;
struct page_cgroup;
@@ -401,9 +402,10 @@ struct sock;
void sock_update_memcg(struct sock *sk);
void sock_release_memcg(struct sock *sk);
+extern struct static_key memcg_kmem_enabled_key;
static inline bool memcg_kmem_enabled(void)
{
- return true;
+ return static_key_false(&memcg_kmem_enabled_key);
}
bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e24b388..1dd31a1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -344,10 +344,13 @@ struct mem_cgroup {
/* internal only representation about the status of kmem accounting. */
enum {
KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
+ KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
KMEM_ACCOUNTED_DEAD, /* dead memcg, pending kmem charges */
};
-#define KMEM_ACCOUNTED_MASK (1 << KMEM_ACCOUNTED_ACTIVE)
+/* We account when limit is on, but only after call sites are patched */
+#define KMEM_ACCOUNTED_MASK \
+ ((1 << KMEM_ACCOUNTED_ACTIVE) | (1 << KMEM_ACCOUNTED_ACTIVATED))
#ifdef CONFIG_MEMCG_KMEM
static void memcg_kmem_set_active(struct mem_cgroup *memcg)
@@ -360,6 +363,11 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted);
}
+static void memcg_kmem_set_activated(struct mem_cgroup *memcg)
+{
+ set_bit(KMEM_ACCOUNTED_ACTIVATED, &memcg->kmem_accounted);
+}
+
static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
{
if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted))
@@ -529,6 +537,26 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
}
#endif
+#ifdef CONFIG_MEMCG_KMEM
+struct static_key memcg_kmem_enabled_key;
+
+static void disarm_kmem_keys(struct mem_cgroup *memcg)
+{
+ if (memcg_kmem_is_active(memcg))
+ static_key_slow_dec(&memcg_kmem_enabled_key);
+}
+#else
+static void disarm_kmem_keys(struct mem_cgroup *memcg)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
+static void disarm_static_keys(struct mem_cgroup *memcg)
+{
+ disarm_sock_keys(memcg);
+ disarm_kmem_keys(memcg);
+}
+
static void drain_all_stock_async(struct mem_cgroup *memcg);
static struct mem_cgroup_per_zone *
@@ -4165,6 +4193,8 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
{
int ret = -EINVAL;
#ifdef CONFIG_MEMCG_KMEM
+ bool must_inc_static_branch = false;
+
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
/*
* For simplicity, we won't allow this to be disabled. It also can't
@@ -4195,7 +4225,15 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
ret = res_counter_set_limit(&memcg->kmem, val);
VM_BUG_ON(ret);
- memcg_kmem_set_active(memcg);
+ /*
+ * After this point, kmem_accounted (that we test atomically in
+ * the beginning of this conditional), is no longer 0. This
+ * guarantees only one process will set the following boolean
+ * to true. We don't need test_and_set because we're protected
+ * by the set_limit_mutex anyway.
+ */
+ memcg_kmem_set_activated(memcg);
+ must_inc_static_branch = true;
/*
* kmem charges can outlive the cgroup. In the case of slab
* pages, for instance, a page contain objects from various
@@ -4208,6 +4246,27 @@ static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
out:
mutex_unlock(&set_limit_mutex);
cgroup_unlock();
+
+ /*
+ * We are by now familiar with the fact that we can't inc the static
+ * branch inside cgroup_lock. See disarm functions for details. A
+ * worker here is overkill, but also wrong: After the limit is set, we
+ * must start accounting right away. Since this operation can't fail,
+ * we can safely defer it to here - no rollback will be needed.
+ *
+ * The boolean used to control this is also safe, because
+ * KMEM_ACCOUNTED_ACTIVATED guarantees that only one process will be
+ * able to set it to true;
+ */
+ if (must_inc_static_branch) {
+ static_key_slow_inc(&memcg_kmem_enabled_key);
+ /*
+ * setting the active bit after the inc will guarantee no one
+ * starts accounting before all call sites are patched
+ */
+ memcg_kmem_set_active(memcg);
+ }
+
#endif
return ret;
}
@@ -4217,8 +4276,20 @@ static void memcg_propagate_kmem(struct mem_cgroup *memcg,
{
memcg->kmem_accounted = parent->kmem_accounted;
#ifdef CONFIG_MEMCG_KMEM
- if (memcg_kmem_is_active(memcg))
+ /*
+ * When that happen, we need to disable the static branch only on those
+ * memcgs that enabled it. To achieve this, we would be forced to
+ * complicate the code by keeping track of which memcgs were the ones
+ * that actually enabled limits, and which ones got it from its
+ * parents.
+ *
+ * It is a lot simpler just to do static_key_slow_inc() on every child
+ * that is accounted.
+ */
+ if (memcg_kmem_is_active(memcg)) {
mem_cgroup_get(memcg);
+ static_key_slow_inc(&memcg_kmem_enabled_key);
+ }
#endif
}
@@ -5138,7 +5209,7 @@ static void free_work(struct work_struct *work)
* to move this code around, and make sure it is outside
* the cgroup_lock.
*/
- disarm_sock_keys(memcg);
+ disarm_static_keys(memcg);
if (size < PAGE_SIZE)
kfree(memcg);
else
--
1.7.11.7
^ permalink raw reply related [flat|nested] 70+ messages in thread* [PATCH v5 11/14] memcg: allow a memcg with kmem charges to be destructed.
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (9 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 10/14] memcg: use static branches when code not in use Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
2012-10-17 22:12 ` Andrew Morton
2012-10-16 10:16 ` [PATCH v5 12/14] execute the whole memcg freeing in free_worker Glauber Costa
` (3 subsequent siblings)
14 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
Because the ultimate goal of the kmem tracking in memcg is to track slab
pages as well, we can't guarantee that we'll always be able to point a
page to a particular process, and migrate the charges along with it -
since in the common case, a page will contain data belonging to multiple
processes.
Because of that, when we destroy a memcg, we only make sure the
destruction will succeed by discounting the kmem charges from the user
charges when we try to empty the cgroup.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
CC: Tejun Heo <tj@kernel.org>
---
mm/memcontrol.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1dd31a1..9086cf2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -544,6 +544,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
{
if (memcg_kmem_is_active(memcg))
static_key_slow_dec(&memcg_kmem_enabled_key);
+ /*
+ * This check can't live in kmem destruction function,
+ * since the charges will outlive the cgroup
+ */
+ WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
}
#else
static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -3993,6 +3998,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
int node, zid, shrink;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct cgroup *cgrp = memcg->css.cgroup;
+ u64 usage;
css_get(&memcg->css);
@@ -4026,8 +4032,17 @@ move_account:
mem_cgroup_end_move(memcg);
memcg_oom_recover(memcg);
cond_resched();
+ /*
+ * Kernel memory may not necessarily be trackable to a specific
+ * process. So they are not migrated, and therefore we can't
+ * expect their value to drop to 0 here.
+ *
+ * having res filled up with kmem only is enough
+ */
+ usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
+ res_counter_read_u64(&memcg->kmem, RES_USAGE);
/* "ret" should also be checked to ensure all lists are empty. */
- } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
+ } while (usage > 0 || ret);
out:
css_put(&memcg->css);
return ret;
--
1.7.11.7
^ permalink raw reply related [flat|nested] 70+ messages in thread* Re: [PATCH v5 11/14] memcg: allow a memcg with kmem charges to be destructed.
2012-10-16 10:16 ` [PATCH v5 11/14] memcg: allow a memcg with kmem charges to be destructed Glauber Costa
@ 2012-10-17 22:12 ` Andrew Morton
2012-10-18 9:33 ` Glauber Costa
0 siblings, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2012-10-17 22:12 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Pekka Enberg,
Suleiman Souhlal
On Tue, 16 Oct 2012 14:16:48 +0400
Glauber Costa <glommer@parallels.com> wrote:
> Because the ultimate goal of the kmem tracking in memcg is to track slab
> pages as well,
It is? For a major patchset such as this, it's pretty important to
discuss such long-term plans in the top-level discussion. Covering
things such as expected complexity, expected performance hit, how these
plans affected the current implementation, etc.
The main reason for this is that if the future plans appear to be of
doubtful feasibility and the current implementation isn't sufficiently
useful without the future stuff, we shouldn't merge the current
implementation. It's a big issue!
> we can't guarantee that we'll always be able to point a
> page to a particular process, and migrate the charges along with it -
> since in the common case, a page will contain data belonging to multiple
> processes.
>
> Because of that, when we destroy a memcg, we only make sure the
> destruction will succeed by discounting the kmem charges from the user
> charges when we try to empty the cgroup.
>
> ...
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v5 11/14] memcg: allow a memcg with kmem charges to be destructed.
2012-10-17 22:12 ` Andrew Morton
@ 2012-10-18 9:33 ` Glauber Costa
0 siblings, 0 replies; 70+ messages in thread
From: Glauber Costa @ 2012-10-18 9:33 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Pekka Enberg,
Suleiman Souhlal
On 10/18/2012 02:12 AM, Andrew Morton wrote:
> On Tue, 16 Oct 2012 14:16:48 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> Because the ultimate goal of the kmem tracking in memcg is to track slab
>> pages as well,
>
> It is? For a major patchset such as this, it's pretty important to
> discuss such long-term plans in the top-level discussion. Covering
> things such as expected complexity, expected performance hit, how these
> plans affected the current implementation, etc.
>
> The main reason for this is that if the future plans appear to be of
> doubtful feasibility and the current implementation isn't sufficiently
> useful without the future stuff, we shouldn't merge the current
> implementation. It's a big issue!
>
Not really. I am not talking about plans when it comes to slab. The code
is there, and usually always posted to linux-mm a few days after I post
this series. It also lives in the kmemcg-slab branch in my git tree.
I am trying to logically split it in two to aid reviewers work. I may
have made a mistake by splitting it this way, but so far I think it was
the right decision: it allowed people to focus on a part of the work
first, instead of going all the way in a 30-patch patch series that
would be merged atomically.
I believe they should be merged separately, to allow us to find any
issues easier. But I also believe that this "separate" should ultimately
live in the same merge window.
Pekka, from the slab side, already stated that 3.8 would not be
unreasonable.
As for the perfomance hit, my latest benchmark, quoted in the opening
mail of this series already include results for both patchsets.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v5 12/14] execute the whole memcg freeing in free_worker
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (10 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 11/14] memcg: allow a memcg with kmem charges to be destructed Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
2012-10-17 6:56 ` Kamezawa Hiroyuki
2012-10-16 10:16 ` [PATCH v5 13/14] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Glauber Costa
` (2 subsequent siblings)
14 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa
A lot of the initialization we do in mem_cgroup_create() is done with
softirqs enabled. This include grabbing a css id, which holds
&ss->id_lock->rlock, and the per-zone trees, which holds
rtpz->lock->rlock. All of those signal to the lockdep mechanism that
those locks can be used in SOFTIRQ-ON-W context. This means that the
freeing of memcg structure must happen in a compatible context,
otherwise we'll get a deadlock, like the one bellow, caught by lockdep:
[<ffffffff81103095>] free_accounted_pages+0x47/0x4c
[<ffffffff81047f90>] free_task+0x31/0x5c
[<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
[<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
[<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
[<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
[<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
[<ffffffff8105266d>] __do_softirq+0x122/0x277
This usage pattern could not be triggered before kmem came into play.
With the introduction of kmem stack handling, it is possible that we
call the last mem_cgroup_put() from the task destructor, which is run in
an rcu callback. Such callbacks are run with softirqs disabled, leading
to the offensive usage pattern.
In general, we have little, if any, means to guarantee in which context
the last memcg_put will happen. The best we can do is test it and try to
make sure no invalid context releases are happening. But as we add more
code to memcg, the possible interactions grow in number and expose more
ways to get context conflicts. One thing to keep in mind, is that part
of the freeing process is already deferred to a worker, such as vfree(),
that can only be called from process context.
For the moment, the only two functions we really need moved away are:
* free_css_id(), and
* mem_cgroup_remove_from_trees().
But because the later accesses per-zone info,
free_mem_cgroup_per_zone_info() needs to be moved as well. With that, we
are left with the per_cpu stats only. Better move it all.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Tested-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Tejun Heo <tj@kernel.org>
---
mm/memcontrol.c | 66 +++++++++++++++++++++++++++++----------------------------
1 file changed, 34 insertions(+), 32 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9086cf2..f323c2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5203,16 +5203,29 @@ out_free:
}
/*
- * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
- * but in process context. The work_freeing structure is overlaid
- * on the rcu_freeing structure, which itself is overlaid on memsw.
+ * At destroying mem_cgroup, references from swap_cgroup can remain.
+ * (scanning all at force_empty is too costly...)
+ *
+ * Instead of clearing all references at force_empty, we remember
+ * the number of reference from swap_cgroup and free mem_cgroup when
+ * it goes down to 0.
+ *
+ * Removal of cgroup itself succeeds regardless of refs from swap.
*/
-static void free_work(struct work_struct *work)
+
+static void __mem_cgroup_free(struct mem_cgroup *memcg)
{
- struct mem_cgroup *memcg;
+ int node;
int size = sizeof(struct mem_cgroup);
- memcg = container_of(work, struct mem_cgroup, work_freeing);
+ mem_cgroup_remove_from_trees(memcg);
+ free_css_id(&mem_cgroup_subsys, &memcg->css);
+
+ for_each_node(node)
+ free_mem_cgroup_per_zone_info(memcg, node);
+
+ free_percpu(memcg->stat);
+
/*
* We need to make sure that (at least for now), the jump label
* destruction code runs outside of the cgroup lock. This is because
@@ -5231,38 +5244,27 @@ static void free_work(struct work_struct *work)
vfree(memcg);
}
-static void free_rcu(struct rcu_head *rcu_head)
-{
- struct mem_cgroup *memcg;
-
- memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
- INIT_WORK(&memcg->work_freeing, free_work);
- schedule_work(&memcg->work_freeing);
-}
/*
- * At destroying mem_cgroup, references from swap_cgroup can remain.
- * (scanning all at force_empty is too costly...)
- *
- * Instead of clearing all references at force_empty, we remember
- * the number of reference from swap_cgroup and free mem_cgroup when
- * it goes down to 0.
- *
- * Removal of cgroup itself succeeds regardless of refs from swap.
+ * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
+ * but in process context. The work_freeing structure is overlaid
+ * on the rcu_freeing structure, which itself is overlaid on memsw.
*/
-
-static void __mem_cgroup_free(struct mem_cgroup *memcg)
+static void free_work(struct work_struct *work)
{
- int node;
+ struct mem_cgroup *memcg;
- mem_cgroup_remove_from_trees(memcg);
- free_css_id(&mem_cgroup_subsys, &memcg->css);
+ memcg = container_of(work, struct mem_cgroup, work_freeing);
+ __mem_cgroup_free(memcg);
+}
- for_each_node(node)
- free_mem_cgroup_per_zone_info(memcg, node);
+static void free_rcu(struct rcu_head *rcu_head)
+{
+ struct mem_cgroup *memcg;
- free_percpu(memcg->stat);
- call_rcu(&memcg->rcu_freeing, free_rcu);
+ memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
+ INIT_WORK(&memcg->work_freeing, free_work);
+ schedule_work(&memcg->work_freeing);
}
static void mem_cgroup_get(struct mem_cgroup *memcg)
@@ -5274,7 +5276,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
{
if (atomic_sub_and_test(count, &memcg->refcnt)) {
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
- __mem_cgroup_free(memcg);
+ call_rcu(&memcg->rcu_freeing, free_rcu);
if (parent)
mem_cgroup_put(parent);
}
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 70+ messages in thread* Re: [PATCH v5 12/14] execute the whole memcg freeing in free_worker
2012-10-16 10:16 ` [PATCH v5 12/14] execute the whole memcg freeing in free_worker Glauber Costa
@ 2012-10-17 6:56 ` Kamezawa Hiroyuki
0 siblings, 0 replies; 70+ messages in thread
From: Kamezawa Hiroyuki @ 2012-10-17 6:56 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Andrew Morton,
Michal Hocko, Johannes Weiner, Christoph Lameter, David Rientjes,
Pekka Enberg, devel, linux-kernel
(2012/10/16 19:16), Glauber Costa wrote:
> A lot of the initialization we do in mem_cgroup_create() is done with
> softirqs enabled. This include grabbing a css id, which holds
> &ss->id_lock->rlock, and the per-zone trees, which holds
> rtpz->lock->rlock. All of those signal to the lockdep mechanism that
> those locks can be used in SOFTIRQ-ON-W context. This means that the
> freeing of memcg structure must happen in a compatible context,
> otherwise we'll get a deadlock, like the one bellow, caught by lockdep:
>
> [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
> [<ffffffff81047f90>] free_task+0x31/0x5c
> [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
> [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
> [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
> [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
> [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
> [<ffffffff8105266d>] __do_softirq+0x122/0x277
>
> This usage pattern could not be triggered before kmem came into play.
> With the introduction of kmem stack handling, it is possible that we
> call the last mem_cgroup_put() from the task destructor, which is run in
> an rcu callback. Such callbacks are run with softirqs disabled, leading
> to the offensive usage pattern.
>
> In general, we have little, if any, means to guarantee in which context
> the last memcg_put will happen. The best we can do is test it and try to
> make sure no invalid context releases are happening. But as we add more
> code to memcg, the possible interactions grow in number and expose more
> ways to get context conflicts. One thing to keep in mind, is that part
> of the freeing process is already deferred to a worker, such as vfree(),
> that can only be called from process context.
>
> For the moment, the only two functions we really need moved away are:
>
> * free_css_id(), and
> * mem_cgroup_remove_from_trees().
>
> But because the later accesses per-zone info,
> free_mem_cgroup_per_zone_info() needs to be moved as well. With that, we
> are left with the per_cpu stats only. Better move it all.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Tested-by: Greg Thelen <gthelen@google.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>
> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Tejun Heo <tj@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v5 13/14] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (11 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 12/14] execute the whole memcg freeing in free_worker Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
2012-10-17 22:12 ` Andrew Morton
2012-10-16 10:16 ` [PATCH v5 14/14] Add documentation about the kmem controller Glauber Costa
[not found] ` <1350382611-20579-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
14 siblings, 1 reply; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
Because those architectures will draw their stacks directly from the
page allocator, rather than the slab cache, we can directly pass
__GFP_KMEMCG flag, and issue the corresponding free_pages.
This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
architectures fall in this category.
This will guarantee that every stack page is accounted to the memcg the
process currently lives on, and will have the allocations to fail if
they go over limit.
For the time being, I am defining a new variant of THREADINFO_GFP, not
to mess with the other path. Once the slab is also tracked by memcg, we
can get rid of that flag.
Tested to successfully protect against :(){ :|:& };:
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
CC: Tejun Heo <tj@kernel.org>
---
include/linux/thread_info.h | 2 ++
kernel/fork.c | 4 ++--
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index ccc1899..e7e0473 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -61,6 +61,8 @@ extern long do_no_restart_syscall(struct restart_block *parm);
# define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
#endif
+#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
+
/*
* flag set/clear/test wrappers
* - pass TIF_xxxx constants to these functions
diff --git a/kernel/fork.c b/kernel/fork.c
index 03b86f1..b3f6298 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -146,7 +146,7 @@ void __weak arch_release_thread_info(struct thread_info *ti)
static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
int node)
{
- struct page *page = alloc_pages_node(node, THREADINFO_GFP,
+ struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
THREAD_SIZE_ORDER);
return page ? page_address(page) : NULL;
@@ -154,7 +154,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
static inline void free_thread_info(struct thread_info *ti)
{
- free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+ free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER);
}
# else
static struct kmem_cache *thread_info_cache;
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 70+ messages in thread* Re: [PATCH v5 13/14] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs
2012-10-16 10:16 ` [PATCH v5 13/14] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Glauber Costa
@ 2012-10-17 22:12 ` Andrew Morton
[not found] ` <20121017151245.f11c4d18.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
0 siblings, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2012-10-17 22:12 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Pekka Enberg,
Suleiman Souhlal
On Tue, 16 Oct 2012 14:16:50 +0400
Glauber Costa <glommer@parallels.com> wrote:
> @@ -146,7 +146,7 @@ void __weak arch_release_thread_info(struct thread_info *ti)
> static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
> int node)
> {
> - struct page *page = alloc_pages_node(node, THREADINFO_GFP,
> + struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
> THREAD_SIZE_ORDER);
yay, we actually used all this code for something ;)
I don't think we really saw a comprehensive list of what else the kmem
controller will be used for, but I believe that all other envisaged
applications will require slab accounting, yes?
So it appears that all we have at present is a
yet-another-fork-bomb-preventer, but one which requires that the
culprit be in a container? That's reasonable, given your
hosted-environment scenario. It's unclear (to me) that we should merge
all this code for only this feature. Again, it would be good to have a
clear listing of and plan for other applications of this code.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH v5 14/14] Add documentation about the kmem controller
2012-10-16 10:16 [PATCH v5 00/14] kmem controller for memcg Glauber Costa
` (12 preceding siblings ...)
2012-10-16 10:16 ` [PATCH v5 13/14] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Glauber Costa
@ 2012-10-16 10:16 ` Glauber Costa
[not found] ` <1350382611-20579-15-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (2 more replies)
[not found] ` <1350382611-20579-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
14 siblings, 3 replies; 70+ messages in thread
From: Glauber Costa @ 2012-10-16 10:16 UTC (permalink / raw)
To: linux-mm
Cc: cgroups, Mel Gorman, Tejun Heo, Andrew Morton, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel, Glauber Costa,
Frederic Weisbecker, Pekka Enberg, Suleiman Souhlal
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Frederic Weisbecker <fweisbec@redhat.com>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Suleiman Souhlal <suleiman@google.com>
CC: Tejun Heo <tj@kernel.org>
---
Documentation/cgroups/memory.txt | 58 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 57 insertions(+), 1 deletion(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index c07f7b4..dd15be8 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -71,6 +71,11 @@ Brief summary of control files.
memory.oom_control # set/show oom controls.
memory.numa_stat # show the number of memory usage per numa node
+ memory.kmem.limit_in_bytes # set/show hard limit for kernel memory
+ memory.kmem.usage_in_bytes # show current kernel memory allocation
+ memory.kmem.failcnt # show the number of kernel memory usage hits limits
+ memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded
+
memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory
memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation
memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits
@@ -268,20 +273,65 @@ the amount of kernel memory used by the system. Kernel memory is fundamentally
different than user memory, since it can't be swapped out, which makes it
possible to DoS the system by consuming too much of this precious resource.
+Kernel memory won't be accounted at all until limit on a group is set. This
+allows for existing setups to continue working without disruption. The limit
+cannot be set if the cgroup have children, or if there are already tasks in the
+cgroup. When use_hierarchy == 1 and a group is accounted, its children will
+automatically be accounted regardless of their limit value.
+
+After a controller is first limited, it will be kept being accounted until it
+is removed. The memory limitation itself, can of course be removed by writing
+-1 to memory.kmem.limit_in_bytes. In this case, kmem will be accounted, but not
+limited.
+
Kernel memory limits are not imposed for the root cgroup. Usage for the root
-cgroup may or may not be accounted.
+cgroup may or may not be accounted. The memory used is accumulated into
+memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
+(currently only for tcp).
+The main "kmem" counter is fed into the main counter, so kmem charges will
+also be visible from the user counter.
Currently no soft limit is implemented for kernel memory. It is future work
to trigger slab reclaim when those limits are reached.
2.7.1 Current Kernel Memory resources accounted
+* stack pages: every process consumes some stack pages. By accounting into
+kernel memory, we prevent new processes from being created when the kernel
+memory usage is too high.
+
* sockets memory pressure: some sockets protocols have memory pressure
thresholds. The Memory Controller allows them to be controlled individually
per cgroup, instead of globally.
* tcp memory pressure: sockets memory pressure for the tcp protocol.
+2.7.3 Common use cases
+
+Because the "kmem" counter is fed to the main user counter, kernel memory can
+never be limited completely independently of user memory. Say "U" is the user
+limit, and "K" the kernel limit. There are three possible ways limits can be
+set:
+
+ U != 0, K = unlimited:
+ This is the standard memcg limitation mechanism already present before kmem
+ accounting. Kernel memory is completely ignored.
+
+ U != 0, K < U:
+ Kernel memory is a subset of the user memory. This setup is useful in
+ deployments where the total amount of memory per-cgroup is overcommited.
+ Overcommiting kernel memory limits is definitely not recommended, since the
+ box can still run out of non-reclaimable memory.
+ In this case, the admin could set up K so that the sum of all groups is
+ never greater than the total memory, and freely set U at the cost of his
+ QoS.
+
+ U != 0, K >= U:
+ Since kmem charges will also be fed to the user counter and reclaim will be
+ triggered for the cgroup for both kinds of memory. This setup gives the
+ admin a unified view of memory, and it is also useful for people who just
+ want to track kernel memory usage.
+
3. User Interface
0. Configuration
@@ -290,6 +340,7 @@ a. Enable CONFIG_CGROUPS
b. Enable CONFIG_RESOURCE_COUNTERS
c. Enable CONFIG_MEMCG
d. Enable CONFIG_MEMCG_SWAP (to use swap extension)
+d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
# mount -t tmpfs none /sys/fs/cgroup
@@ -406,6 +457,11 @@ About use_hierarchy, see Section 6.
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.
+ Also, note that when memory.kmem.limit_in_bytes is set the charges due to
+ kernel pages will still be seen. This is not considered a failure and the
+ write will still return success. In this case, it is expected that
+ memory.kmem.usage_in_bytes == memory.usage_in_bytes.
+
About use_hierarchy, see Section 6.
5.2 stat file
--
1.7.11.7
^ permalink raw reply related [flat|nested] 70+ messages in thread[parent not found: <1350382611-20579-15-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v5 14/14] Add documentation about the kmem controller
[not found] ` <1350382611-20579-15-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-10-16 12:23 ` Michal Hocko
0 siblings, 0 replies; 70+ messages in thread
From: Michal Hocko @ 2012-10-16 12:23 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Mel Gorman, Tejun Heo, Andrew Morton, Johannes Weiner,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Christoph Lameter,
David Rientjes, Pekka Enberg, devel-GEFAQzZX7r8dnm+yROfE0A,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Frederic Weisbecker,
Pekka Enberg, Suleiman Souhlal
On Tue 16-10-12 14:16:51, Glauber Costa wrote:
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Frederic Weisbecker <fweisbec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> CC: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org
Just a nit..
> ---
> Documentation/cgroups/memory.txt | 58 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 57 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index c07f7b4..dd15be8 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
[...]
> @@ -268,20 +273,65 @@ the amount of kernel memory used by the system. Kernel memory is fundamentally
> different than user memory, since it can't be swapped out, which makes it
> possible to DoS the system by consuming too much of this precious resource.
>
> +Kernel memory won't be accounted at all until limit on a group is set. This
> +allows for existing setups to continue working without disruption. The limit
> +cannot be set if the cgroup have children, or if there are already tasks in the
> +cgroup. When use_hierarchy == 1 and a group is accounted, its children will
> +automatically be accounted regardless of their limit value.
> +
> +After a controller is first limited, it will be kept being accounted until it
s/controller/group/
> +is removed. The memory limitation itself, can of course be removed by writing
> +-1 to memory.kmem.limit_in_bytes. In this case, kmem will be accounted, but not
> +limited.
> +
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH v5 14/14] Add documentation about the kmem controller
2012-10-16 10:16 ` [PATCH v5 14/14] Add documentation about the kmem controller Glauber Costa
[not found] ` <1350382611-20579-15-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-10-16 18:25 ` Christoph Lameter
[not found] ` <0000013a6ad26c73-d043cf97-c44a-45c1-9cae-0a962e93a005-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>
2012-10-17 22:12 ` Andrew Morton
2 siblings, 1 reply; 70+ messages in thread
From: Christoph Lameter @ 2012-10-16 18:25 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Andrew Morton,
Michal Hocko, Johannes Weiner, kamezawa.hiroyu, David Rientjes,
Pekka Enberg, devel, linux-kernel, Frederic Weisbecker,
Pekka Enberg, Suleiman Souhlal
On Tue, 16 Oct 2012, Glauber Costa wrote:
>
> + memory.kmem.limit_in_bytes # set/show hard limit for kernel memory
> + memory.kmem.usage_in_bytes # show current kernel memory allocation
> + memory.kmem.failcnt # show the number of kernel memory usage hits limits
> + memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded
Does it actually make sense to limit kernel memory? The user generally has
no idea how much kernel memory a process is using and kernel changes can
change the memory footprint. Given the fuzzy accounting in the kernel a
large cache refill (if someone configures the slab batch count to be
really big f.e.) can account a lot of memory to the wrong cgroup. The
allocation could fail.
Limiting the total memory use of a process (U+K) would make more sense I
guess. Only U is probably sufficient? In what way would a limitation on
kernel memory in use be good?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread* Re: [PATCH v5 14/14] Add documentation about the kmem controller
2012-10-16 10:16 ` [PATCH v5 14/14] Add documentation about the kmem controller Glauber Costa
[not found] ` <1350382611-20579-15-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-10-16 18:25 ` Christoph Lameter
@ 2012-10-17 22:12 ` Andrew Morton
2012-10-18 9:38 ` Glauber Costa
2 siblings, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2012-10-17 22:12 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel,
Frederic Weisbecker, Pekka Enberg, Suleiman Souhlal
On Tue, 16 Oct 2012 14:16:51 +0400
Glauber Costa <glommer@parallels.com> wrote:
> +Kernel memory won't be accounted at all until limit on a group is set. This
> +allows for existing setups to continue working without disruption. The limit
> +cannot be set if the cgroup have children, or if there are already tasks in the
> +cgroup.
What behaviour will usersapce see if "The limit cannot be set"?
write() returns -EINVAL, something like that?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread* Re: [PATCH v5 14/14] Add documentation about the kmem controller
2012-10-17 22:12 ` Andrew Morton
@ 2012-10-18 9:38 ` Glauber Costa
0 siblings, 0 replies; 70+ messages in thread
From: Glauber Costa @ 2012-10-18 9:38 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, cgroups, Mel Gorman, Tejun Heo, Michal Hocko,
Johannes Weiner, kamezawa.hiroyu, Christoph Lameter,
David Rientjes, Pekka Enberg, devel, linux-kernel,
Frederic Weisbecker, Pekka Enberg, Suleiman Souhlal
On 10/18/2012 02:12 AM, Andrew Morton wrote:
> On Tue, 16 Oct 2012 14:16:51 +0400
> Glauber Costa <glommer@parallels.com> wrote:
>
>> +Kernel memory won't be accounted at all until limit on a group is set. This
>> +allows for existing setups to continue working without disruption. The limit
>> +cannot be set if the cgroup have children, or if there are already tasks in the
>> +cgroup.
>
> What behaviour will usersapce see if "The limit cannot be set"?
> write() returns -EINVAL, something like that?
>
-EBUSY.
Will update.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 70+ messages in thread
[parent not found: <1350382611-20579-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v5 00/14] kmem controller for memcg.
[not found] ` <1350382611-20579-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-10-17 22:11 ` Andrew Morton
[not found] ` <20121017151142.71e1f3c5.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
0 siblings, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2012-10-17 22:11 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
Mel Gorman, Tejun Heo, Michal Hocko, Johannes Weiner,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Christoph Lameter,
David Rientjes, Pekka Enberg, devel-GEFAQzZX7r8dnm+yROfE0A,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Tue, 16 Oct 2012 14:16:37 +0400
Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> ...
>
> A general explanation of what this is all about follows:
>
> The kernel memory limitation mechanism for memcg concerns itself with
> disallowing potentially non-reclaimable allocations to happen in exaggerate
> quantities by a particular set of processes (cgroup). Those allocations could
> create pressure that affects the behavior of a different and unrelated set of
> processes.
>
> Its basic working mechanism is to annotate some allocations with the
> _GFP_KMEMCG flag. When this flag is set, the current process allocating will
> have its memcg identified and charged against. When reaching a specific limit,
> further allocations will be denied.
The need to set _GFP_KMEMCG is rather unpleasing, and makes one wonder
"why didn't it just track all allocations".
Does this mean that over time we can expect more sites to get the
_GFP_KMEMCG tagging? If so, are there any special implications, or do
we just go in, do the one-line patch and expect everything to work? If
so, why don't we go in and do that tagging right now?
And how *accurate* is the proposed code? What percentage of kernel
memory allocations are unaccounted, typical case and worst case?
All sorts of questions come to mind over this decision, but it was
unexplained. It should be, please. A lot!
>
> ...
>
> Limits lower than
> the user limit effectively means there is a separate kernel memory limit that
> may be reached independently than the user limit. Values equal or greater than
> the user limit implies only that kernel memory is tracked. This provides a
> unified vision of "maximum memory", be it kernel or user memory.
>
I'm struggling to understand that text much at all. Reading the
Documentation/cgroups/memory.txt patch helped.
^ permalink raw reply [flat|nested] 70+ messages in thread