* [PATCH v2 01/11] memcg: Make it possible to use the stock for more than one page.
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-09 13:01 ` Glauber Costa
2012-08-10 15:12 ` Michal Hocko
2012-08-09 13:01 ` [PATCH v2 02/11] memcg: Reclaim when more than one page needed Glauber Costa
` (9 subsequent siblings)
10 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Suleiman Souhlal,
Glauber Costa
From: Suleiman Souhlal <ssouhlal-HZy0K5TPuP5AfugRpC6u6w@public.gmane.org>
We currently have a percpu stock cache scheme that charges one page at a
time from memcg->res, the user counter. When the kernel memory
controller comes into play, we'll need to charge more than that.
This is because kernel memory allocations will also draw from the user
counter, and can be bigger than a single page, as it is the case with
the stack (usually 2 pages) or some higher order slabs.
[ glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org: added a changelog ]
Signed-off-by: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
---
mm/memcontrol.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 95162c9..bc7bfa7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2096,20 +2096,28 @@ struct memcg_stock_pcp {
static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
static DEFINE_MUTEX(percpu_charge_mutex);
-/*
- * Try to consume stocked charge on this cpu. If success, one page is consumed
- * from local stock and true is returned. If the stock is 0 or charges from a
- * cgroup which is not current target, returns false. This stock will be
- * refilled.
+/**
+ * consume_stock: Try to consume stocked charge on this cpu.
+ * @memcg: memcg to consume from.
+ * @nr_pages: how many pages to charge.
+ *
+ * The charges will only happen if @memcg matches the current cpu's memcg
+ * stock, and at least @nr_pages are available in that stock. Failure to
+ * service an allocation will refill the stock.
+ *
+ * returns true if succesfull, false otherwise.
*/
-static bool consume_stock(struct mem_cgroup *memcg)
+static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
{
struct memcg_stock_pcp *stock;
bool ret = true;
+ if (nr_pages > CHARGE_BATCH)
+ return false;
+
stock = &get_cpu_var(memcg_stock);
- if (memcg == stock->cached && stock->nr_pages)
- stock->nr_pages--;
+ if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+ stock->nr_pages -= nr_pages;
else /* need to call res_counter_charge */
ret = false;
put_cpu_var(memcg_stock);
@@ -2408,7 +2416,7 @@ again:
VM_BUG_ON(css_is_removed(&memcg->css));
if (mem_cgroup_is_root(memcg))
goto done;
- if (nr_pages == 1 && consume_stock(memcg))
+ if (consume_stock(memcg, nr_pages))
goto done;
css_get(&memcg->css);
} else {
@@ -2433,7 +2441,7 @@ again:
rcu_read_unlock();
goto done;
}
- if (nr_pages == 1 && consume_stock(memcg)) {
+ if (consume_stock(memcg, nr_pages)) {
/*
* It seems dagerous to access memcg without css_get().
* But considering how consume_stok works, it's not
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread* Re: [PATCH v2 01/11] memcg: Make it possible to use the stock for more than one page.
2012-08-09 13:01 ` [PATCH v2 01/11] memcg: Make it possible to use the stock for more than one page Glauber Costa
@ 2012-08-10 15:12 ` Michal Hocko
0 siblings, 0 replies; 135+ messages in thread
From: Michal Hocko @ 2012-08-10 15:12 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Suleiman Souhlal
On Thu 09-08-12 17:01:09, Glauber Costa wrote:
> From: Suleiman Souhlal <ssouhlal@FreeBSD.org>
>
> We currently have a percpu stock cache scheme that charges one page at a
> time from memcg->res, the user counter. When the kernel memory
> controller comes into play, we'll need to charge more than that.
>
> This is because kernel memory allocations will also draw from the user
> counter, and can be bigger than a single page, as it is the case with
> the stack (usually 2 pages) or some higher order slabs.
>
> [ glommer@parallels.com: added a changelog ]
>
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
> ---
> mm/memcontrol.c | 28 ++++++++++++++++++----------
> 1 file changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 95162c9..bc7bfa7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2096,20 +2096,28 @@ struct memcg_stock_pcp {
> static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> static DEFINE_MUTEX(percpu_charge_mutex);
>
> -/*
> - * Try to consume stocked charge on this cpu. If success, one page is consumed
> - * from local stock and true is returned. If the stock is 0 or charges from a
> - * cgroup which is not current target, returns false. This stock will be
> - * refilled.
> +/**
> + * consume_stock: Try to consume stocked charge on this cpu.
> + * @memcg: memcg to consume from.
> + * @nr_pages: how many pages to charge.
> + *
> + * The charges will only happen if @memcg matches the current cpu's memcg
> + * stock, and at least @nr_pages are available in that stock. Failure to
> + * service an allocation will refill the stock.
> + *
> + * returns true if succesfull, false otherwise.
> */
> -static bool consume_stock(struct mem_cgroup *memcg)
> +static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
> {
> struct memcg_stock_pcp *stock;
> bool ret = true;
>
> + if (nr_pages > CHARGE_BATCH)
> + return false;
> +
> stock = &get_cpu_var(memcg_stock);
> - if (memcg == stock->cached && stock->nr_pages)
> - stock->nr_pages--;
> + if (memcg == stock->cached && stock->nr_pages >= nr_pages)
> + stock->nr_pages -= nr_pages;
> else /* need to call res_counter_charge */
> ret = false;
> put_cpu_var(memcg_stock);
> @@ -2408,7 +2416,7 @@ again:
> VM_BUG_ON(css_is_removed(&memcg->css));
> if (mem_cgroup_is_root(memcg))
> goto done;
> - if (nr_pages == 1 && consume_stock(memcg))
> + if (consume_stock(memcg, nr_pages))
> goto done;
> css_get(&memcg->css);
> } else {
> @@ -2433,7 +2441,7 @@ again:
> rcu_read_unlock();
> goto done;
> }
> - if (nr_pages == 1 && consume_stock(memcg)) {
> + if (consume_stock(memcg, nr_pages)) {
> /*
> * It seems dagerous to access memcg without css_get().
> * But considering how consume_stok works, it's not
> --
> 1.7.11.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v2 02/11] memcg: Reclaim when more than one page needed.
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-09 13:01 ` [PATCH v2 01/11] memcg: Make it possible to use the stock for more than one page Glauber Costa
@ 2012-08-09 13:01 ` Glauber Costa
[not found] ` <1344517279-30646-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-09 13:01 ` [PATCH v2 03/11] memcg: change defines to an enum Glauber Costa
` (8 subsequent siblings)
10 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Suleiman Souhlal,
Glauber Costa
From: Suleiman Souhlal <ssouhlal-HZy0K5TPuP5AfugRpC6u6w@public.gmane.org>
mem_cgroup_do_charge() was written before kmem accounting, and expects
three cases: being called for 1 page, being called for a stock of 32
pages, or being called for a hugepage. If we call for 2 or 3 pages (and
both the stack and several slabs used in process creation are such, at
least with the debug options I had), it assumed it's being called for
stock and just retried without reclaiming.
Fix that by passing down a minsize argument in addition to the csize.
And what to do about that (csize == PAGE_SIZE && ret) retry? If it's
needed at all (and presumably is since it's there, perhaps to handle
races), then it should be extended to more than PAGE_SIZE, yet how far?
And should there be a retry count limit, of what? For now retry up to
COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if
__GFP_NORETRY.
[v4: fixed nr pages calculation pointed out by Christoph Lameter ]
Signed-off-by: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
---
mm/memcontrol.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bc7bfa7..2cef99a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2294,7 +2294,8 @@ enum {
};
static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, unsigned int min_pages,
+ bool oom_check)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2317,18 +2318,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
} else
mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
/*
- * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
- * of regular pages (CHARGE_BATCH), or a single regular page (1).
- *
* Never reclaim on behalf of optional batching, retry with a
* single page instead.
*/
- if (nr_pages == CHARGE_BATCH)
+ if (nr_pages > min_pages)
return CHARGE_RETRY;
if (!(gfp_mask & __GFP_WAIT))
return CHARGE_WOULDBLOCK;
+ if (gfp_mask & __GFP_NORETRY)
+ return CHARGE_NOMEM;
+
ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
return CHARGE_RETRY;
@@ -2341,7 +2342,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
* unlikely to succeed so close to the limit, and we fall back
* to regular pages anyway in case of failure.
*/
- if (nr_pages == 1 && ret)
+ if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret)
return CHARGE_RETRY;
/*
@@ -2476,7 +2477,8 @@ again:
nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
}
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
+ oom_check);
switch (ret) {
case CHARGE_OK:
break;
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread* [PATCH v2 03/11] memcg: change defines to an enum
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-09 13:01 ` [PATCH v2 01/11] memcg: Make it possible to use the stock for more than one page Glauber Costa
2012-08-09 13:01 ` [PATCH v2 02/11] memcg: Reclaim when more than one page needed Glauber Costa
@ 2012-08-09 13:01 ` Glauber Costa
[not found] ` <1344517279-30646-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-09 13:01 ` [PATCH v2 04/11] kmem accounting basic infrastructure Glauber Costa
` (7 subsequent siblings)
10 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Glauber Costa
This is just a cleanup patch for clarity of expression. In earlier
submissions, people asked it to be in a separate patch, so here it is.
[ v2: use named enum as type throughout the file as well ]
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
---
mm/memcontrol.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2cef99a..b0e29f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -393,9 +393,12 @@ enum charge_type {
};
/* for encoding cft->private value on file */
-#define _MEM (0)
-#define _MEMSWAP (1)
-#define _OOM_TYPE (2)
+enum res_type {
+ _MEM,
+ _MEMSWAP,
+ _OOM_TYPE,
+};
+
#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
#define MEMFILE_TYPE(val) ((val) >> 16 & 0xffff)
#define MEMFILE_ATTR(val) ((val) & 0xffff)
@@ -3983,7 +3986,8 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
char str[64];
u64 val;
- int type, name, len;
+ int name, len;
+ enum res_type type;
type = MEMFILE_TYPE(cft->private);
name = MEMFILE_ATTR(cft->private);
@@ -4019,7 +4023,8 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
const char *buffer)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- int type, name;
+ enum res_type type;
+ int name;
unsigned long long val;
int ret;
@@ -4095,7 +4100,8 @@ out:
static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- int type, name;
+ int name;
+ enum res_type type;
type = MEMFILE_TYPE(event);
name = MEMFILE_ATTR(event);
@@ -4423,7 +4429,7 @@ static int mem_cgroup_usage_register_event(struct cgroup *cgrp,
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);
u64 threshold, usage;
int i, size, ret;
@@ -4506,7 +4512,7 @@ static void mem_cgroup_usage_unregister_event(struct cgroup *cgrp,
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);
u64 usage;
int i, j, size;
@@ -4584,7 +4590,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_eventfd_list *event;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);
BUG_ON(type != _OOM_TYPE);
event = kmalloc(sizeof(*event), GFP_KERNEL);
@@ -4609,7 +4615,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct mem_cgroup_eventfd_list *ev, *tmp;
- int type = MEMFILE_TYPE(cft->private);
+ enum res_type type = MEMFILE_TYPE(cft->private);
BUG_ON(type != _OOM_TYPE);
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread* [PATCH v2 04/11] kmem accounting basic infrastructure
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (2 preceding siblings ...)
2012-08-09 13:01 ` [PATCH v2 03/11] memcg: change defines to an enum Glauber Costa
@ 2012-08-09 13:01 ` Glauber Costa
[not found] ` <1344517279-30646-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-14 16:21 ` Michal Hocko
2012-08-09 13:01 ` [PATCH v2 05/11] Add a __GFP_KMEMCG flag Glauber Costa
` (6 subsequent siblings)
10 siblings, 2 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Glauber Costa
This patch adds the basic infrastructure for the accounting of the slab
caches. To control that, the following files are created:
* memory.kmem.usage_in_bytes
* memory.kmem.limit_in_bytes
* memory.kmem.failcnt
* memory.kmem.max_usage_in_bytes
They have the same meaning of their user memory counterparts. They
reflect the state of the "kmem" res_counter.
The code is not enabled until a limit is set. This can be tested by the
flag "kmem_accounted". This means that after the patch is applied, no
behavioral changes exists for whoever is still using memcg to control
their memory usage.
We always account to both user and kernel resource_counters. This
effectively means that an independent kernel limit is in place when the
limit is set to a lower value than the user memory. A equal or higher
value means that the user limit will always hit first, meaning that kmem
is effectively unlimited.
People who want to track kernel memory but not limit it, can set this
limit to a very high number (like RESOURCE_MAX - 1page - that no one
will ever hit, or equal to the user memory)
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
---
mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 68 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b0e29f4..54e93de 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -273,6 +273,10 @@ struct mem_cgroup {
};
/*
+ * the counter to account for kernel memory usage.
+ */
+ struct res_counter kmem;
+ /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -287,6 +291,7 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
+ bool kmem_accounted;
bool oom_lock;
atomic_t under_oom;
@@ -397,6 +402,7 @@ enum res_type {
_MEM,
_MEMSWAP,
_OOM_TYPE,
+ _KMEM,
};
#define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
@@ -1499,6 +1505,10 @@ done:
res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+ printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
+ res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
+ res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
+ res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
mem_cgroup_print_oom_stat(memcg);
}
@@ -4008,6 +4018,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
else
val = res_counter_read_u64(&memcg->memsw, name);
break;
+ case _KMEM:
+ val = res_counter_read_u64(&memcg->kmem, name);
+ break;
default:
BUG();
}
@@ -4046,8 +4059,23 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
break;
if (type == _MEM)
ret = mem_cgroup_resize_limit(memcg, val);
- else
+ else if (type == _MEMSWAP)
ret = mem_cgroup_resize_memsw_limit(memcg, val);
+ else if (type == _KMEM) {
+ ret = res_counter_set_limit(&memcg->kmem, val);
+ if (ret)
+ break;
+ /*
+ * Once enabled, can't be disabled. We could in theory
+ * disable it if we haven't yet created any caches, or
+ * if we can shrink them all to death.
+ *
+ * But it is not worth the trouble
+ */
+ if (!memcg->kmem_accounted && val != RESOURCE_MAX)
+ memcg->kmem_accounted = true;
+ } else
+ return -EINVAL;
break;
case RES_SOFT_LIMIT:
ret = res_counter_memparse_write_strategy(buffer, &val);
@@ -4113,12 +4141,16 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
case RES_MAX_USAGE:
if (type == _MEM)
res_counter_reset_max(&memcg->res);
+ else if (type == _KMEM)
+ res_counter_reset_max(&memcg->kmem);
else
res_counter_reset_max(&memcg->memsw);
break;
case RES_FAILCNT:
if (type == _MEM)
res_counter_reset_failcnt(&memcg->res);
+ else if (type == _KMEM)
+ res_counter_reset_failcnt(&memcg->kmem);
else
res_counter_reset_failcnt(&memcg->memsw);
break;
@@ -4672,6 +4704,33 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
}
#ifdef CONFIG_MEMCG_KMEM
+static struct cftype kmem_cgroup_files[] = {
+ {
+ .name = "kmem.limit_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+ .write_string = mem_cgroup_write,
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.failcnt",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
+ .trigger = mem_cgroup_reset,
+ .read = mem_cgroup_read,
+ },
+ {
+ .name = "kmem.max_usage_in_bytes",
+ .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
+ .trigger = mem_cgroup_reset,
+ .read = mem_cgroup_read,
+ },
+ {},
+};
+
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
return mem_cgroup_sockets_init(memcg, ss);
@@ -5015,6 +5074,12 @@ mem_cgroup_create(struct cgroup *cont)
int cpu;
enable_swap_cgroup();
parent = NULL;
+
+#ifdef CONFIG_MEMCG_KMEM
+ WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
+ kmem_cgroup_files));
+#endif
+
if (mem_cgroup_soft_limit_tree_init())
goto free_out;
root_mem_cgroup = memcg;
@@ -5033,6 +5098,7 @@ mem_cgroup_create(struct cgroup *cont)
if (parent && parent->use_hierarchy) {
res_counter_init(&memcg->res, &parent->res);
res_counter_init(&memcg->memsw, &parent->memsw);
+ res_counter_init(&memcg->kmem, &parent->kmem);
/*
* We increment refcnt of the parent to ensure that we can
* safely access it on res_counter_charge/uncharge.
@@ -5043,6 +5109,7 @@ mem_cgroup_create(struct cgroup *cont)
} else {
res_counter_init(&memcg->res, NULL);
res_counter_init(&memcg->memsw, NULL);
+ res_counter_init(&memcg->kmem, NULL);
}
memcg->last_scanned_node = MAX_NUMNODES;
INIT_LIST_HEAD(&memcg->oom_notify);
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread[parent not found: <1344517279-30646-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v2 04/11] kmem accounting basic infrastructure
[not found] ` <1344517279-30646-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-10 17:02 ` Kamezawa Hiroyuki
[not found] ` <50253EA8.9080205-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
0 siblings, 1 reply; 135+ messages in thread
From: Kamezawa Hiroyuki @ 2012-08-10 17:02 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, Christoph Lameter, David Rientjes, Pekka Enberg
(2012/08/09 22:01), Glauber Costa wrote:
> This patch adds the basic infrastructure for the accounting of the slab
> caches. To control that, the following files are created:
>
> * memory.kmem.usage_in_bytes
> * memory.kmem.limit_in_bytes
> * memory.kmem.failcnt
> * memory.kmem.max_usage_in_bytes
>
> They have the same meaning of their user memory counterparts. They
> reflect the state of the "kmem" res_counter.
>
> The code is not enabled until a limit is set. This can be tested by the
> flag "kmem_accounted". This means that after the patch is applied, no
> behavioral changes exists for whoever is still using memcg to control
> their memory usage.
>
> We always account to both user and kernel resource_counters. This
> effectively means that an independent kernel limit is in place when the
> limit is set to a lower value than the user memory. A equal or higher
> value means that the user limit will always hit first, meaning that kmem
> is effectively unlimited.
>
> People who want to track kernel memory but not limit it, can set this
> limit to a very high number (like RESOURCE_MAX - 1page - that no one
> will ever hit, or equal to the user memory)
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Could you add a patch for documentation of this new interface and a text
explaining the behavior of "kmem_accounting" ?
Hm, my concern is the difference of behavior between user page accounting and
kmem accounting...but this is how tcp-accounting is working.
Once you add Documentation, it's okay to add my Ack.
Thanks,
-Kame
> ---
> mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 68 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b0e29f4..54e93de 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -273,6 +273,10 @@ struct mem_cgroup {
> };
>
> /*
> + * the counter to account for kernel memory usage.
> + */
> + struct res_counter kmem;
> + /*
> * Per cgroup active and inactive list, similar to the
> * per zone LRU lists.
> */
> @@ -287,6 +291,7 @@ struct mem_cgroup {
> * Should the accounting and control be hierarchical, per subtree?
> */
> bool use_hierarchy;
> + bool kmem_accounted;
>
> bool oom_lock;
> atomic_t under_oom;
> @@ -397,6 +402,7 @@ enum res_type {
> _MEM,
> _MEMSWAP,
> _OOM_TYPE,
> + _KMEM,
> };
>
> #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val))
> @@ -1499,6 +1505,10 @@ done:
> res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
> res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
> res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> + printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
> + res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
> + res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
> + res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
>
> mem_cgroup_print_oom_stat(memcg);
> }
> @@ -4008,6 +4018,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
> else
> val = res_counter_read_u64(&memcg->memsw, name);
> break;
> + case _KMEM:
> + val = res_counter_read_u64(&memcg->kmem, name);
> + break;
> default:
> BUG();
> }
> @@ -4046,8 +4059,23 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
> break;
> if (type == _MEM)
> ret = mem_cgroup_resize_limit(memcg, val);
> - else
> + else if (type == _MEMSWAP)
> ret = mem_cgroup_resize_memsw_limit(memcg, val);
> + else if (type == _KMEM) {
> + ret = res_counter_set_limit(&memcg->kmem, val);
> + if (ret)
> + break;
> + /*
> + * Once enabled, can't be disabled. We could in theory
> + * disable it if we haven't yet created any caches, or
> + * if we can shrink them all to death.
> + *
> + * But it is not worth the trouble
> + */
> + if (!memcg->kmem_accounted && val != RESOURCE_MAX)
> + memcg->kmem_accounted = true;
> + } else
> + return -EINVAL;
> break;
> case RES_SOFT_LIMIT:
> ret = res_counter_memparse_write_strategy(buffer, &val);
> @@ -4113,12 +4141,16 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
> case RES_MAX_USAGE:
> if (type == _MEM)
> res_counter_reset_max(&memcg->res);
> + else if (type == _KMEM)
> + res_counter_reset_max(&memcg->kmem);
> else
> res_counter_reset_max(&memcg->memsw);
> break;
> case RES_FAILCNT:
> if (type == _MEM)
> res_counter_reset_failcnt(&memcg->res);
> + else if (type == _KMEM)
> + res_counter_reset_failcnt(&memcg->kmem);
> else
> res_counter_reset_failcnt(&memcg->memsw);
> break;
> @@ -4672,6 +4704,33 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
> }
>
> #ifdef CONFIG_MEMCG_KMEM
> +static struct cftype kmem_cgroup_files[] = {
> + {
> + .name = "kmem.limit_in_bytes",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> + .write_string = mem_cgroup_write,
> + .read = mem_cgroup_read,
> + },
> + {
> + .name = "kmem.usage_in_bytes",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> + .read = mem_cgroup_read,
> + },
> + {
> + .name = "kmem.failcnt",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
> + .trigger = mem_cgroup_reset,
> + .read = mem_cgroup_read,
> + },
> + {
> + .name = "kmem.max_usage_in_bytes",
> + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
> + .trigger = mem_cgroup_reset,
> + .read = mem_cgroup_read,
> + },
> + {},
> +};
> +
> static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> {
> return mem_cgroup_sockets_init(memcg, ss);
> @@ -5015,6 +5074,12 @@ mem_cgroup_create(struct cgroup *cont)
> int cpu;
> enable_swap_cgroup();
> parent = NULL;
> +
> +#ifdef CONFIG_MEMCG_KMEM
> + WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
> + kmem_cgroup_files));
> +#endif
> +
> if (mem_cgroup_soft_limit_tree_init())
> goto free_out;
> root_mem_cgroup = memcg;
> @@ -5033,6 +5098,7 @@ mem_cgroup_create(struct cgroup *cont)
> if (parent && parent->use_hierarchy) {
> res_counter_init(&memcg->res, &parent->res);
> res_counter_init(&memcg->memsw, &parent->memsw);
> + res_counter_init(&memcg->kmem, &parent->kmem);
> /*
> * We increment refcnt of the parent to ensure that we can
> * safely access it on res_counter_charge/uncharge.
> @@ -5043,6 +5109,7 @@ mem_cgroup_create(struct cgroup *cont)
> } else {
> res_counter_init(&memcg->res, NULL);
> res_counter_init(&memcg->memsw, NULL);
> + res_counter_init(&memcg->kmem, NULL);
> }
> memcg->last_scanned_node = MAX_NUMNODES;
> INIT_LIST_HEAD(&memcg->oom_notify);
>
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v2 04/11] kmem accounting basic infrastructure
2012-08-09 13:01 ` [PATCH v2 04/11] kmem accounting basic infrastructure Glauber Costa
[not found] ` <1344517279-30646-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-14 16:21 ` Michal Hocko
2012-08-15 9:33 ` Glauber Costa
2012-08-15 19:50 ` Ying Han
1 sibling, 2 replies; 135+ messages in thread
From: Michal Hocko @ 2012-08-14 16:21 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg
On Thu 09-08-12 17:01:12, Glauber Costa wrote:
> This patch adds the basic infrastructure for the accounting of the slab
> caches. To control that, the following files are created:
>
> * memory.kmem.usage_in_bytes
> * memory.kmem.limit_in_bytes
> * memory.kmem.failcnt
> * memory.kmem.max_usage_in_bytes
>
> They have the same meaning of their user memory counterparts. They
> reflect the state of the "kmem" res_counter.
>
> The code is not enabled until a limit is set. This can be tested by the
> flag "kmem_accounted". This means that after the patch is applied, no
> behavioral changes exists for whoever is still using memcg to control
> their memory usage.
>
> We always account to both user and kernel resource_counters. This
> effectively means that an independent kernel limit is in place when the
> limit is set to a lower value than the user memory. A equal or higher
> value means that the user limit will always hit first, meaning that kmem
> is effectively unlimited.
Well, it contributes to the user limit so it is not unlimited. It just
falls under a different limit and it tends to contribute less. This can
be quite confusing. I am still not sure whether we should mix the two
things together. If somebody wants to limit the kernel memory he has to
touch the other limit anyway. Do you have a strong reason to mix the
user and kernel counters?
My impression was that kernel allocation should simply fail while user
allocations might reclaim as well. Why should we reclaim just because of
the kernel allocation (which is unreclaimable from hard limit reclaim
point of view)?
I also think that the whole thing would get much simpler if those two
are split. Anyway if this is really a must then this should be
documented here.
One nit bellow.
> People who want to track kernel memory but not limit it, can set this
> limit to a very high number (like RESOURCE_MAX - 1page - that no one
> will ever hit, or equal to the user memory)
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 68 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b0e29f4..54e93de 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -4046,8 +4059,23 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
> break;
> if (type == _MEM)
> ret = mem_cgroup_resize_limit(memcg, val);
> - else
> + else if (type == _MEMSWAP)
> ret = mem_cgroup_resize_memsw_limit(memcg, val);
> + else if (type == _KMEM) {
> + ret = res_counter_set_limit(&memcg->kmem, val);
> + if (ret)
> + break;
> + /*
> + * Once enabled, can't be disabled. We could in theory
> + * disable it if we haven't yet created any caches, or
> + * if we can shrink them all to death.
> + *
> + * But it is not worth the trouble
> + */
> + if (!memcg->kmem_accounted && val != RESOURCE_MAX)
> + memcg->kmem_accounted = true;
> + } else
> + return -EINVAL;
> break;
This doesn't check for the hierachy so kmem_accounted might not be in
sync with it's parents. mem_cgroup_create (below) needs to copy
kmem_accounted down from the parent and the above needs to check if this
is a similar dance like mem_cgroup_oom_control_write.
[...]
> @@ -5033,6 +5098,7 @@ mem_cgroup_create(struct cgroup *cont)
> if (parent && parent->use_hierarchy) {
> res_counter_init(&memcg->res, &parent->res);
> res_counter_init(&memcg->memsw, &parent->memsw);
> + res_counter_init(&memcg->kmem, &parent->kmem);
> /*
> * We increment refcnt of the parent to ensure that we can
> * safely access it on res_counter_charge/uncharge.
> @@ -5043,6 +5109,7 @@ mem_cgroup_create(struct cgroup *cont)
> } else {
> res_counter_init(&memcg->res, NULL);
> res_counter_init(&memcg->memsw, NULL);
> + res_counter_init(&memcg->kmem, NULL);
> }
> memcg->last_scanned_node = MAX_NUMNODES;
> INIT_LIST_HEAD(&memcg->oom_notify);
> --
> 1.7.11.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 04/11] kmem accounting basic infrastructure
2012-08-14 16:21 ` Michal Hocko
@ 2012-08-15 9:33 ` Glauber Costa
[not found] ` <502B6D03.1080804-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-15 19:50 ` Ying Han
1 sibling, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-15 9:33 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg
>> We always account to both user and kernel resource_counters. This
>> effectively means that an independent kernel limit is in place when the
>> limit is set to a lower value than the user memory. A equal or higher
>> value means that the user limit will always hit first, meaning that kmem
>> is effectively unlimited.
>
> Well, it contributes to the user limit so it is not unlimited. It just
> falls under a different limit and it tends to contribute less.
You are right, but this is just wording. I will update it, but what I
really mean here is that an independent limit is no imposed on kmem.
> This can
> be quite confusing. I am still not sure whether we should mix the two
> things together. If somebody wants to limit the kernel memory he has to
> touch the other limit anyway. Do you have a strong reason to mix the
> user and kernel counters?
This is funny, because the first opposition I found to this work was
"Why would anyone want to limit it separately?" =p
It seems that a quite common use case is to have a container with a
unified view of "memory" that it can use the way he likes, be it with
kernel memory, or user memory. I believe those people would be happy to
just silently account kernel memory to user memory, or at the most have
a switch to enable it.
What gets clear from this back and forth, is that there are people
interested in both use cases.
> My impression was that kernel allocation should simply fail while user
> allocations might reclaim as well. Why should we reclaim just because of
> the kernel allocation (which is unreclaimable from hard limit reclaim
> point of view)?
That is not what the kernel does, in general. We assume that if he wants
that memory and we can serve it, we should. Also, not all kernel memory
is unreclaimable. We can shrink the slabs, for instance. Ying Han
claims she has patches for that already...
> I also think that the whole thing would get much simpler if those two
> are split. Anyway if this is really a must then this should be
> documented here.
Well, documentation can't hurt.
>
> This doesn't check for the hierachy so kmem_accounted might not be in
> sync with it's parents. mem_cgroup_create (below) needs to copy
> kmem_accounted down from the parent and the above needs to check if this
> is a similar dance like mem_cgroup_oom_control_write.
>
I don't see why we have to.
I believe in a A/B/C hierarchy, C should be perfectly able to set a
different limit than its parents. Note that this is not a boolean.
Also, right now, C can become completely unlimited (by not setting a
limited) and this is, indeed, not the desired behavior.
A later patch will change kmem_accounted to a bitfield, and we'll use
one of the bits to signal that we should account kmem because our parent
is limited.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 04/11] kmem accounting basic infrastructure
2012-08-14 16:21 ` Michal Hocko
2012-08-15 9:33 ` Glauber Costa
@ 2012-08-15 19:50 ` Ying Han
[not found] ` <CALWz4iwgnqwq5k_zhpsiiwrj8Y=OkCUg7H96khJWPZScSQE=nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
1 sibling, 1 reply; 135+ messages in thread
From: Ying Han @ 2012-08-15 19:50 UTC (permalink / raw)
To: Michal Hocko
Cc: Glauber Costa, linux-kernel, linux-mm, cgroups, devel,
Johannes Weiner, Andrew Morton, kamezawa.hiroyu,
Christoph Lameter, David Rientjes, Pekka Enberg
On Tue, Aug 14, 2012 at 9:21 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Thu 09-08-12 17:01:12, Glauber Costa wrote:
>> This patch adds the basic infrastructure for the accounting of the slab
>> caches. To control that, the following files are created:
>>
>> * memory.kmem.usage_in_bytes
>> * memory.kmem.limit_in_bytes
>> * memory.kmem.failcnt
>> * memory.kmem.max_usage_in_bytes
>>
>> They have the same meaning of their user memory counterparts. They
>> reflect the state of the "kmem" res_counter.
>>
>> The code is not enabled until a limit is set. This can be tested by the
>> flag "kmem_accounted". This means that after the patch is applied, no
>> behavioral changes exists for whoever is still using memcg to control
>> their memory usage.
>>
>> We always account to both user and kernel resource_counters. This
>> effectively means that an independent kernel limit is in place when the
>> limit is set to a lower value than the user memory. A equal or higher
>> value means that the user limit will always hit first, meaning that kmem
>> is effectively unlimited.
>
> Well, it contributes to the user limit so it is not unlimited. It just
> falls under a different limit and it tends to contribute less. This can
> be quite confusing. I am still not sure whether we should mix the two
> things together. If somebody wants to limit the kernel memory he has to
> touch the other limit anyway. Do you have a strong reason to mix the
> user and kernel counters?
The reason to mix the two together is a compromise of the two use
cases we've heard by far. In google, we only need one limit which
limits u & k, and the reclaim kicks in when the total usage hits the
limit.
> My impression was that kernel allocation should simply fail while user
> allocations might reclaim as well. Why should we reclaim just because of
> the kernel allocation (which is unreclaimable from hard limit reclaim
> point of view)?
Some of kernel objects are reclaimable if we have per-memcg shrinker.
> I also think that the whole thing would get much simpler if those two
> are split. Anyway if this is really a must then this should be
> documented here.
What would be the use case you have in your end?
--Ying
> One nit bellow.
>
>> People who want to track kernel memory but not limit it, can set this
>> limit to a very high number (like RESOURCE_MAX - 1page - that no one
>> will ever hit, or equal to the user memory)
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>> mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 68 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index b0e29f4..54e93de 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
> [...]
>> @@ -4046,8 +4059,23 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
>> break;
>> if (type == _MEM)
>> ret = mem_cgroup_resize_limit(memcg, val);
>> - else
>> + else if (type == _MEMSWAP)
>> ret = mem_cgroup_resize_memsw_limit(memcg, val);
>> + else if (type == _KMEM) {
>> + ret = res_counter_set_limit(&memcg->kmem, val);
>> + if (ret)
>> + break;
>> + /*
>> + * Once enabled, can't be disabled. We could in theory
>> + * disable it if we haven't yet created any caches, or
>> + * if we can shrink them all to death.
>> + *
>> + * But it is not worth the trouble
>> + */
>> + if (!memcg->kmem_accounted && val != RESOURCE_MAX)
>> + memcg->kmem_accounted = true;
>> + } else
>> + return -EINVAL;
>> break;
>
> This doesn't check for the hierachy so kmem_accounted might not be in
> sync with it's parents. mem_cgroup_create (below) needs to copy
> kmem_accounted down from the parent and the above needs to check if this
> is a similar dance like mem_cgroup_oom_control_write.
>
> [...]
>
>> @@ -5033,6 +5098,7 @@ mem_cgroup_create(struct cgroup *cont)
>> if (parent && parent->use_hierarchy) {
>> res_counter_init(&memcg->res, &parent->res);
>> res_counter_init(&memcg->memsw, &parent->memsw);
>> + res_counter_init(&memcg->kmem, &parent->kmem);
>> /*
>> * We increment refcnt of the parent to ensure that we can
>> * safely access it on res_counter_charge/uncharge.
>> @@ -5043,6 +5109,7 @@ mem_cgroup_create(struct cgroup *cont)
>> } else {
>> res_counter_init(&memcg->res, NULL);
>> res_counter_init(&memcg->memsw, NULL);
>> + res_counter_init(&memcg->kmem, NULL);
>> }
>> memcg->last_scanned_node = MAX_NUMNODES;
>> INIT_LIST_HEAD(&memcg->oom_notify);
>> --
>> 1.7.11.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v2 05/11] Add a __GFP_KMEMCG flag
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (3 preceding siblings ...)
2012-08-09 13:01 ` [PATCH v2 04/11] kmem accounting basic infrastructure Glauber Costa
@ 2012-08-09 13:01 ` Glauber Costa
[not found] ` <1344517279-30646-6-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-09 13:01 ` [PATCH v2 06/11] memcg: kmem controller infrastructure Glauber Costa
` (5 subsequent siblings)
10 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Glauber Costa,
Pekka Enberg, Suleiman Souhlal, Rik van Riel, Mel Gorman
This flag is used to indicate to the callees that this allocation is a
kernel allocation in process context, and should be accounted to
current's memcg. It takes numerical place of the of the recently removed
__GFP_NO_KSWAPD.
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
CC: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Mel Gorman <mel-wPRd99KPJ+uzQB+pC5nmwQ@public.gmane.org>
---
include/linux/gfp.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f9bc873..d8eae4d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,11 @@ struct vm_area_struct;
#else
#define ___GFP_NOTRACK 0
#endif
+#ifdef CONFIG_MEMCG_KMEM
+#define ___GFP_KMEMCG 0x400000u
+#else
+#define ___GFP_KMEMCG 0
+#endif
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
@@ -91,7 +96,7 @@ struct vm_area_struct;
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */
-
+#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
/*
* This may seem redundant, but it's a way of annotating false positives vs.
* allocations that simply cannot be supported (e.g. page tables).
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread* [PATCH v2 06/11] memcg: kmem controller infrastructure
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (4 preceding siblings ...)
2012-08-09 13:01 ` [PATCH v2 05/11] Add a __GFP_KMEMCG flag Glauber Costa
@ 2012-08-09 13:01 ` Glauber Costa
[not found] ` <1344517279-30646-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (2 more replies)
2012-08-09 13:01 ` [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg Glauber Costa
` (4 subsequent siblings)
10 siblings, 3 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Glauber Costa,
Pekka Enberg
This patch introduces infrastructure for tracking kernel memory pages to
a given memcg. This will happen whenever the caller includes the flag
__GFP_KMEMCG flag, and the task belong to a memcg other than the root.
In memcontrol.h those functions are wrapped in inline accessors. The
idea is to later on, patch those with static branches, so we don't incur
any overhead when no mem cgroups with limited kmem are being used.
[ v2: improved comments and standardized function names ]
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
include/linux/memcontrol.h | 79 +++++++++++++++++++
mm/memcontrol.c | 185 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 264 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8d9489f..75b247e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -21,6 +21,7 @@
#define _LINUX_MEMCONTROL_H
#include <linux/cgroup.h>
#include <linux/vm_event_item.h>
+#include <linux/hardirq.h>
struct mem_cgroup;
struct page_cgroup;
@@ -399,6 +400,11 @@ struct sock;
#ifdef CONFIG_MEMCG_KMEM
void sock_update_memcg(struct sock *sk);
void sock_release_memcg(struct sock *sk);
+
+#define memcg_kmem_on 1
+bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
+void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
+void __memcg_kmem_free_page(struct page *page, int order);
#else
static inline void sock_update_memcg(struct sock *sk)
{
@@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
static inline void sock_release_memcg(struct sock *sk)
{
}
+
+#define memcg_kmem_on 0
+static inline bool
+__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
+{
+ return false;
+}
+
+static inline void __memcg_kmem_free_page(struct page *page, int order)
+{
+}
+
+static inline void
+__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
+{
+}
#endif /* CONFIG_MEMCG_KMEM */
+
+/**
+ * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
+ * @gfp: the gfp allocation flags.
+ * @handle: a pointer to the memcg this was charged against.
+ * @order: allocation order.
+ *
+ * returns true if the memcg where the current task belongs can hold this
+ * allocation.
+ *
+ * We return true automatically if this allocation is not to be accounted to
+ * any memcg.
+ */
+static __always_inline bool
+memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
+{
+ if (!memcg_kmem_on)
+ return true;
+ if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
+ return true;
+ if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
+ return true;
+ return __memcg_kmem_new_page(gfp, handle, order);
+}
+
+/**
+ * memcg_kmem_free_page: uncharge pages from memcg
+ * @page: pointer to struct page being freed
+ * @order: allocation order.
+ *
+ * there is no need to specify memcg here, since it is embedded in page_cgroup
+ */
+static __always_inline void
+memcg_kmem_free_page(struct page *page, int order)
+{
+ if (memcg_kmem_on)
+ __memcg_kmem_free_page(page, order);
+}
+
+/**
+ * memcg_kmem_commit_page: embeds correct memcg in a page
+ * @handle: a pointer to the memcg this was charged against.
+ * @page: pointer to struct page recently allocated
+ * @handle: the memcg structure we charged against
+ * @order: allocation order.
+ *
+ * Needs to be called after memcg_kmem_new_page, regardless of success or
+ * failure of the allocation. if @page is NULL, this function will revert the
+ * charges. Otherwise, it will commit the memcg given by @handle to the
+ * corresponding page_cgroup.
+ */
+static __always_inline void
+memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
+{
+ if (memcg_kmem_on)
+ __memcg_kmem_commit_page(page, handle, order);
+}
#endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 54e93de..e9824c1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -10,6 +10,10 @@
* Copyright (C) 2009 Nokia Corporation
* Author: Kirill A. Shutemov
*
+ * Kernel Memory Controller
+ * Copyright (C) 2012 Parallels Inc. and Google Inc.
+ * Authors: Glauber Costa and Suleiman Souhlal
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
#include <net/ip.h>
static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
+static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
+
void sock_update_memcg(struct sock *sk)
{
if (mem_cgroup_sockets_enabled) {
@@ -488,6 +495,118 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
}
EXPORT_SYMBOL(tcp_proto_cgroup);
#endif /* CONFIG_INET */
+
+static inline bool memcg_kmem_enabled(struct mem_cgroup *memcg)
+{
+ return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
+ memcg->kmem_accounted;
+}
+
+/*
+ * We need to verify if the allocation against current->mm->owner's memcg is
+ * possible for the given order. But the page is not allocated yet, so we'll
+ * need a further commit step to do the final arrangements.
+ *
+ * It is possible for the task to switch cgroups in this mean time, so at
+ * commit time, we can't rely on task conversion any longer. We'll then use
+ * the handle argument to return to the caller which cgroup we should commit
+ * against
+ *
+ * Returning true means the allocation is possible.
+ */
+bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
+{
+ struct mem_cgroup *memcg;
+ struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
+ bool ret = true;
+ size_t size;
+ struct task_struct *p;
+
+ *handle = NULL;
+ rcu_read_lock();
+ p = rcu_dereference(current->mm->owner);
+ memcg = mem_cgroup_from_task(p);
+ if (!memcg_kmem_enabled(memcg))
+ goto out;
+
+ mem_cgroup_get(memcg);
+
+ size = PAGE_SIZE << order;
+ ret = memcg_charge_kmem(memcg, gfp, size) == 0;
+ if (!ret) {
+ mem_cgroup_put(memcg);
+ goto out;
+ }
+
+ *handle = memcg;
+out:
+ rcu_read_unlock();
+ return ret;
+}
+EXPORT_SYMBOL(__memcg_kmem_new_page);
+
+void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
+{
+ struct page_cgroup *pc;
+ struct mem_cgroup *memcg = handle;
+
+ if (!memcg)
+ return;
+
+ WARN_ON(mem_cgroup_is_root(memcg));
+ /* The page allocation must have failed. Revert */
+ if (!page) {
+ size_t size = PAGE_SIZE << order;
+
+ memcg_uncharge_kmem(memcg, size);
+ mem_cgroup_put(memcg);
+ return;
+ }
+
+ pc = lookup_page_cgroup(page);
+ lock_page_cgroup(pc);
+ pc->mem_cgroup = memcg;
+ SetPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
+}
+
+void __memcg_kmem_free_page(struct page *page, int order)
+{
+ struct mem_cgroup *memcg;
+ size_t size;
+ struct page_cgroup *pc;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ pc = lookup_page_cgroup(page);
+ lock_page_cgroup(pc);
+ memcg = pc->mem_cgroup;
+ pc->mem_cgroup = NULL;
+ if (!PageCgroupUsed(pc)) {
+ unlock_page_cgroup(pc);
+ return;
+ }
+ ClearPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
+
+ /*
+ * Checking if kmem accounted is enabled won't work for uncharge, since
+ * it is possible that the user enabled kmem tracking, allocated, and
+ * then disabled it again.
+ *
+ * We trust if there is a memcg associated with the page, it is a valid
+ * allocation
+ */
+ if (!memcg)
+ return;
+
+ WARN_ON(mem_cgroup_is_root(memcg));
+ size = (1 << order) << PAGE_SHIFT;
+ memcg_uncharge_kmem(memcg, size);
+ mem_cgroup_put(memcg);
+}
+EXPORT_SYMBOL(__memcg_kmem_free_page);
#endif /* CONFIG_MEMCG_KMEM */
#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
@@ -5759,3 +5878,69 @@ static int __init enable_swap_account(char *s)
__setup("swapaccount=", enable_swap_account);
#endif
+
+#ifdef CONFIG_MEMCG_KMEM
+int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
+{
+ struct res_counter *fail_res;
+ struct mem_cgroup *_memcg;
+ int ret;
+ bool may_oom;
+ bool nofail = false;
+
+ may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
+ !(gfp & __GFP_NORETRY);
+
+ ret = 0;
+
+ if (!memcg)
+ return ret;
+
+ _memcg = memcg;
+ ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
+ &_memcg, may_oom);
+
+ if (ret == -EINTR) {
+ nofail = true;
+ /*
+ * __mem_cgroup_try_charge() chosed to bypass to root due to
+ * OOM kill or fatal signal. Since our only options are to
+ * either fail the allocation or charge it to this cgroup, do
+ * it as a temporary condition. But we can't fail. From a
+ * kmem/slab perspective, the cache has already been selected,
+ * by mem_cgroup_get_kmem_cache(), so it is too late to change
+ * our minds
+ */
+ res_counter_charge_nofail(&memcg->res, delta, &fail_res);
+ if (do_swap_account)
+ res_counter_charge_nofail(&memcg->memsw, delta,
+ &fail_res);
+ ret = 0;
+ } else if (ret == -ENOMEM)
+ return ret;
+
+ if (nofail)
+ res_counter_charge_nofail(&memcg->kmem, delta, &fail_res);
+ else
+ ret = res_counter_charge(&memcg->kmem, delta, &fail_res);
+
+ if (ret) {
+ res_counter_uncharge(&memcg->res, delta);
+ if (do_swap_account)
+ res_counter_uncharge(&memcg->memsw, delta);
+ }
+
+ return ret;
+}
+
+void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta)
+{
+ if (!memcg)
+ return;
+
+ res_counter_uncharge(&memcg->kmem, delta);
+ res_counter_uncharge(&memcg->res, delta);
+ if (do_swap_account)
+ res_counter_uncharge(&memcg->memsw, delta);
+}
+#endif /* CONFIG_MEMCG_KMEM */
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread[parent not found: <1344517279-30646-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
[not found] ` <1344517279-30646-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-10 17:27 ` Kamezawa Hiroyuki
2012-08-13 8:28 ` Glauber Costa
2012-08-14 11:00 ` Glauber Costa
2012-08-14 17:25 ` Michal Hocko
1 sibling, 2 replies; 135+ messages in thread
From: Kamezawa Hiroyuki @ 2012-08-10 17:27 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, Christoph Lameter, David Rientjes, Pekka Enberg,
Pekka Enberg
(2012/08/09 22:01), Glauber Costa wrote:
> This patch introduces infrastructure for tracking kernel memory pages to
> a given memcg. This will happen whenever the caller includes the flag
> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>
> In memcontrol.h those functions are wrapped in inline accessors. The
> idea is to later on, patch those with static branches, so we don't incur
> any overhead when no mem cgroups with limited kmem are being used.
>
> [ v2: improved comments and standardized function names ]
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> ---
> include/linux/memcontrol.h | 79 +++++++++++++++++++
> mm/memcontrol.c | 185 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 264 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8d9489f..75b247e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -21,6 +21,7 @@
> #define _LINUX_MEMCONTROL_H
> #include <linux/cgroup.h>
> #include <linux/vm_event_item.h>
> +#include <linux/hardirq.h>
>
> struct mem_cgroup;
> struct page_cgroup;
> @@ -399,6 +400,11 @@ struct sock;
> #ifdef CONFIG_MEMCG_KMEM
> void sock_update_memcg(struct sock *sk);
> void sock_release_memcg(struct sock *sk);
> +
> +#define memcg_kmem_on 1
> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
> +void __memcg_kmem_free_page(struct page *page, int order);
> #else
> static inline void sock_update_memcg(struct sock *sk)
> {
> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
> static inline void sock_release_memcg(struct sock *sk)
> {
> }
> +
> +#define memcg_kmem_on 0
> +static inline bool
> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + return false;
> +}
> +
> +static inline void __memcg_kmem_free_page(struct page *page, int order)
> +{
> +}
> +
> +static inline void
> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
> +{
> +}
> #endif /* CONFIG_MEMCG_KMEM */
> +
> +/**
> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
> + * @gfp: the gfp allocation flags.
> + * @handle: a pointer to the memcg this was charged against.
> + * @order: allocation order.
> + *
> + * returns true if the memcg where the current task belongs can hold this
> + * allocation.
> + *
> + * We return true automatically if this allocation is not to be accounted to
> + * any memcg.
> + */
> +static __always_inline bool
> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + if (!memcg_kmem_on)
> + return true;
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + return true;
> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> + return true;
> + return __memcg_kmem_new_page(gfp, handle, order);
> +}
> +
> +/**
> + * memcg_kmem_free_page: uncharge pages from memcg
> + * @page: pointer to struct page being freed
> + * @order: allocation order.
> + *
> + * there is no need to specify memcg here, since it is embedded in page_cgroup
> + */
> +static __always_inline void
> +memcg_kmem_free_page(struct page *page, int order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_free_page(page, order);
> +}
> +
> +/**
> + * memcg_kmem_commit_page: embeds correct memcg in a page
> + * @handle: a pointer to the memcg this was charged against.
> + * @page: pointer to struct page recently allocated
> + * @handle: the memcg structure we charged against
> + * @order: allocation order.
> + *
> + * Needs to be called after memcg_kmem_new_page, regardless of success or
> + * failure of the allocation. if @page is NULL, this function will revert the
> + * charges. Otherwise, it will commit the memcg given by @handle to the
> + * corresponding page_cgroup.
> + */
> +static __always_inline void
> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_commit_page(page, handle, order);
> +}
Doesn't this 2 functions has no short-cuts ?
if (memcg_kmem_on && handle) ?
Maybe free() needs to access page_cgroup...
> #endif /* _LINUX_MEMCONTROL_H */
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 54e93de..e9824c1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -10,6 +10,10 @@
> * Copyright (C) 2009 Nokia Corporation
> * Author: Kirill A. Shutemov
> *
> + * Kernel Memory Controller
> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
> + * Authors: Glauber Costa and Suleiman Souhlal
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
> #include <net/ip.h>
>
> static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
> +static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
> +
> void sock_update_memcg(struct sock *sk)
> {
> if (mem_cgroup_sockets_enabled) {
> @@ -488,6 +495,118 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
> }
> EXPORT_SYMBOL(tcp_proto_cgroup);
> #endif /* CONFIG_INET */
> +
> +static inline bool memcg_kmem_enabled(struct mem_cgroup *memcg)
> +{
> + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
> + memcg->kmem_accounted;
> +}
> +
> +/*
> + * We need to verify if the allocation against current->mm->owner's memcg is
> + * possible for the given order. But the page is not allocated yet, so we'll
> + * need a further commit step to do the final arrangements.
> + *
> + * It is possible for the task to switch cgroups in this mean time, so at
> + * commit time, we can't rely on task conversion any longer. We'll then use
> + * the handle argument to return to the caller which cgroup we should commit
> + * against
> + *
> + * Returning true means the allocation is possible.
> + */
> +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
> +{
> + struct mem_cgroup *memcg;
> + struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
> + bool ret = true;
> + size_t size;
> + struct task_struct *p;
> +
> + *handle = NULL;
> + rcu_read_lock();
> + p = rcu_dereference(current->mm->owner);
> + memcg = mem_cgroup_from_task(p);
> + if (!memcg_kmem_enabled(memcg))
> + goto out;
> +
> + mem_cgroup_get(memcg);
> +
This mem_cgroup_get() will be a potentioal performance problem.
Don't you have good idea to avoid accessing atomic counter here ?
I think some kind of percpu counter or a feature to disable "move task"
will be a help.
> + size = PAGE_SIZE << order;
> + ret = memcg_charge_kmem(memcg, gfp, size) == 0;
> + if (!ret) {
> + mem_cgroup_put(memcg);
> + goto out;
> + }
> +
> + *handle = memcg;
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +EXPORT_SYMBOL(__memcg_kmem_new_page);
> +
> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
> +{
> + struct page_cgroup *pc;
> + struct mem_cgroup *memcg = handle;
> +
> + if (!memcg)
> + return;
> +
> + WARN_ON(mem_cgroup_is_root(memcg));
> + /* The page allocation must have failed. Revert */
> + if (!page) {
> + size_t size = PAGE_SIZE << order;
> +
> + memcg_uncharge_kmem(memcg, size);
> + mem_cgroup_put(memcg);
> + return;
> + }
> +
> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> + pc->mem_cgroup = memcg;
> + SetPageCgroupUsed(pc);
> + unlock_page_cgroup(pc);
> +}
> +
> +void __memcg_kmem_free_page(struct page *page, int order)
> +{
> + struct mem_cgroup *memcg;
> + size_t size;
> + struct page_cgroup *pc;
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> + memcg = pc->mem_cgroup;
> + pc->mem_cgroup = NULL;
shouldn't this happen after checking "Used" bit ?
Ah, BTW, why do you need to clear pc->memcg ?
> + if (!PageCgroupUsed(pc)) {
> + unlock_page_cgroup(pc);
> + return;
> + }
> + ClearPageCgroupUsed(pc);
> + unlock_page_cgroup(pc);
> +
> + /*
> + * Checking if kmem accounted is enabled won't work for uncharge, since
> + * it is possible that the user enabled kmem tracking, allocated, and
> + * then disabled it again.
> + *
> + * We trust if there is a memcg associated with the page, it is a valid
> + * allocation
> + */
> + if (!memcg)
> + return;
> +
> + WARN_ON(mem_cgroup_is_root(memcg));
> + size = (1 << order) << PAGE_SHIFT;
> + memcg_uncharge_kmem(memcg, size);
> + mem_cgroup_put(memcg);
Why do we need ref-counting here ? kmem res_counter cannot work as
reference ?
> +}
> +EXPORT_SYMBOL(__memcg_kmem_free_page);
> #endif /* CONFIG_MEMCG_KMEM */
>
> #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> @@ -5759,3 +5878,69 @@ static int __init enable_swap_account(char *s)
> __setup("swapaccount=", enable_swap_account);
>
> #endif
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
> +{
What does 'delta' means ?
> + struct res_counter *fail_res;
> + struct mem_cgroup *_memcg;
> + int ret;
> + bool may_oom;
> + bool nofail = false;
> +
> + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
> + !(gfp & __GFP_NORETRY);
> +
> + ret = 0;
> +
> + if (!memcg)
> + return ret;
> +
> + _memcg = memcg;
> + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
> + &_memcg, may_oom);
> +
> + if (ret == -EINTR) {
> + nofail = true;
> + /*
> + * __mem_cgroup_try_charge() chosed to bypass to root due to
> + * OOM kill or fatal signal. Since our only options are to
> + * either fail the allocation or charge it to this cgroup, do
> + * it as a temporary condition. But we can't fail. From a
> + * kmem/slab perspective, the cache has already been selected,
> + * by mem_cgroup_get_kmem_cache(), so it is too late to change
> + * our minds
> + */
> + res_counter_charge_nofail(&memcg->res, delta, &fail_res);
> + if (do_swap_account)
> + res_counter_charge_nofail(&memcg->memsw, delta,
> + &fail_res);
> + ret = 0;
Hm, you returns 0 and this charge may never be uncharged....right ?
Thanks,
-Kame
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-10 17:27 ` Kamezawa Hiroyuki
@ 2012-08-13 8:28 ` Glauber Costa
2012-08-14 18:58 ` Greg Thelen
[not found] ` <5028BA9E.7000302-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-14 11:00 ` Glauber Costa
1 sibling, 2 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-13 8:28 UTC (permalink / raw)
To: Kamezawa Hiroyuki
Cc: linux-kernel, linux-mm, cgroups, devel, Michal Hocko,
Johannes Weiner, Andrew Morton, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg
>> > + * Needs to be called after memcg_kmem_new_page, regardless of success or
>> > + * failure of the allocation. if @page is NULL, this function will revert the
>> > + * charges. Otherwise, it will commit the memcg given by @handle to the
>> > + * corresponding page_cgroup.
>> > + */
>> > +static __always_inline void
>> > +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
>> > +{
>> > + if (memcg_kmem_on)
>> > + __memcg_kmem_commit_page(page, handle, order);
>> > +}
> Doesn't this 2 functions has no short-cuts ?
Sorry kame, what exactly do you mean?
> if (memcg_kmem_on && handle) ?
I guess this can be done to avoid a function call.
> Maybe free() needs to access page_cgroup...
>
Can you also be a bit more specific here?
>> > +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
>> > +{
>> > + struct mem_cgroup *memcg;
>> > + struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
>> > + bool ret = true;
>> > + size_t size;
>> > + struct task_struct *p;
>> > +
>> > + *handle = NULL;
>> > + rcu_read_lock();
>> > + p = rcu_dereference(current->mm->owner);
>> > + memcg = mem_cgroup_from_task(p);
>> > + if (!memcg_kmem_enabled(memcg))
>> > + goto out;
>> > +
>> > + mem_cgroup_get(memcg);
>> > +
> This mem_cgroup_get() will be a potentioal performance problem.
> Don't you have good idea to avoid accessing atomic counter here ?
> I think some kind of percpu counter or a feature to disable "move task"
> will be a help.
>> > + pc = lookup_page_cgroup(page);
>> > + lock_page_cgroup(pc);
>> > + pc->mem_cgroup = memcg;
>> > + SetPageCgroupUsed(pc);
>> > + unlock_page_cgroup(pc);
>> > +}
>> > +
>> > +void __memcg_kmem_free_page(struct page *page, int order)
>> > +{
>> > + struct mem_cgroup *memcg;
>> > + size_t size;
>> > + struct page_cgroup *pc;
>> > +
>> > + if (mem_cgroup_disabled())
>> > + return;
>> > +
>> > + pc = lookup_page_cgroup(page);
>> > + lock_page_cgroup(pc);
>> > + memcg = pc->mem_cgroup;
>> > + pc->mem_cgroup = NULL;
> shouldn't this happen after checking "Used" bit ?
> Ah, BTW, why do you need to clear pc->memcg ?
As for clearing pc->memcg, I think I'm just being overzealous. I can't
foresee any problems due to removing it.
As for the Used bit, what difference does it make when we clear it?
>> > + if (!PageCgroupUsed(pc)) {
>> > + unlock_page_cgroup(pc);
>> > + return;
>> > + }
>> > + ClearPageCgroupUsed(pc);
>> > + unlock_page_cgroup(pc);
>> > +
>> > + /*
>> > + * Checking if kmem accounted is enabled won't work for uncharge, since
>> > + * it is possible that the user enabled kmem tracking, allocated, and
>> > + * then disabled it again.
>> > + *
>> > + * We trust if there is a memcg associated with the page, it is a valid
>> > + * allocation
>> > + */
>> > + if (!memcg)
>> > + return;
>> > +
>> > + WARN_ON(mem_cgroup_is_root(memcg));
>> > + size = (1 << order) << PAGE_SHIFT;
>> > + memcg_uncharge_kmem(memcg, size);
>> > + mem_cgroup_put(memcg);
> Why do we need ref-counting here ? kmem res_counter cannot work as
> reference ?
This is of course the pair of the mem_cgroup_get() you commented on
earlier. If we need one, we need the other. If we don't need one, we
don't need the other =)
The guarantee we're trying to give here is that the memcg structure will
stay around while there are dangling charges to kmem, that we decided
not to move (remember: moving it for the stack is simple, for the slab
is very complicated and ill-defined, and I believe it is better to treat
all kmem equally here)
So maybe we can be clever here, and avoid reference counting at all
times. We call mem_cgroup_get() when the first charge occurs, and then
go for mem_cgroup_put() when our count reaches 0.
What do you think about that?
>> > +#ifdef CONFIG_MEMCG_KMEM
>> > +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
>> > +{
> What does 'delta' means ?
>
I can change it to something like nr_bytes, more informative.
>> > + struct res_counter *fail_res;
>> > + struct mem_cgroup *_memcg;
>> > + int ret;
>> > + bool may_oom;
>> > + bool nofail = false;
>> > +
>> > + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
>> > + !(gfp & __GFP_NORETRY);
>> > +
>> > + ret = 0;
>> > +
>> > + if (!memcg)
>> > + return ret;
>> > +
>> > + _memcg = memcg;
>> > + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
>> > + &_memcg, may_oom);
>> > +
>> > + if (ret == -EINTR) {
>> > + nofail = true;
>> > + /*
>> > + * __mem_cgroup_try_charge() chosed to bypass to root due to
>> > + * OOM kill or fatal signal. Since our only options are to
>> > + * either fail the allocation or charge it to this cgroup, do
>> > + * it as a temporary condition. But we can't fail. From a
>> > + * kmem/slab perspective, the cache has already been selected,
>> > + * by mem_cgroup_get_kmem_cache(), so it is too late to change
>> > + * our minds
>> > + */
>> > + res_counter_charge_nofail(&memcg->res, delta, &fail_res);
>> > + if (do_swap_account)
>> > + res_counter_charge_nofail(&memcg->memsw, delta,
>> > + &fail_res);
>> > + ret = 0;
> Hm, you returns 0 and this charge may never be uncharged....right ?
>
Can't see why. By returning 0 we inform our caller that the allocation
succeeded. It is up to him to undo it later through a call to uncharge.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-13 8:28 ` Glauber Costa
@ 2012-08-14 18:58 ` Greg Thelen
[not found] ` <xr93ipcl9u7x.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
[not found] ` <5028BA9E.7000302-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
1 sibling, 1 reply; 135+ messages in thread
From: Greg Thelen @ 2012-08-14 18:58 UTC (permalink / raw)
To: Glauber Costa
Cc: Kamezawa Hiroyuki, linux-kernel, linux-mm, cgroups, devel,
Michal Hocko, Johannes Weiner, Andrew Morton, Christoph Lameter,
David Rientjes, Pekka Enberg, Pekka Enberg
On Mon, Aug 13 2012, Glauber Costa wrote:
>>> > + WARN_ON(mem_cgroup_is_root(memcg));
>>> > + size = (1 << order) << PAGE_SHIFT;
>>> > + memcg_uncharge_kmem(memcg, size);
>>> > + mem_cgroup_put(memcg);
>> Why do we need ref-counting here ? kmem res_counter cannot work as
>> reference ?
> This is of course the pair of the mem_cgroup_get() you commented on
> earlier. If we need one, we need the other. If we don't need one, we
> don't need the other =)
>
> The guarantee we're trying to give here is that the memcg structure will
> stay around while there are dangling charges to kmem, that we decided
> not to move (remember: moving it for the stack is simple, for the slab
> is very complicated and ill-defined, and I believe it is better to treat
> all kmem equally here)
By keeping memcg structures hanging around until the last referring kmem
page is uncharged do such zombie memcg each consume a css_id and thus
put pressure on the 64k css_id space? I imagine in pathological cases
this would prevent creation of new cgroups until these zombies are
dereferenced.
Is there any way to see how much kmem such zombie memcg are consuming?
I think we could find these with
for_each_mem_cgroup_tree(root_mem_cgroup). Basically, I'm wanting to
know where kernel memory has been allocated. For live memcg, an admin
can cat memory.kmem.usage_in_bytes. But for zombie memcg, I'm not sure
how to get this info. It looks like the root_mem_cgroup
memory.kmem.usage_in_bytes is not hierarchically charged.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread[parent not found: <5028BA9E.7000302-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
[not found] ` <5028BA9E.7000302-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-17 2:36 ` Kamezawa Hiroyuki
[not found] ` <502DAE2A.1000404-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
0 siblings, 1 reply; 135+ messages in thread
From: Kamezawa Hiroyuki @ 2012-08-17 2:36 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, Christoph Lameter, David Rientjes, Pekka Enberg,
Pekka Enberg
(2012/08/13 17:28), Glauber Costa wrote:
>>>> + * Needs to be called after memcg_kmem_new_page, regardless of success or
>>>> + * failure of the allocation. if @page is NULL, this function will revert the
>>>> + * charges. Otherwise, it will commit the memcg given by @handle to the
>>>> + * corresponding page_cgroup.
>>>> + */
>>>> +static __always_inline void
>>>> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
>>>> +{
>>>> + if (memcg_kmem_on)
>>>> + __memcg_kmem_commit_page(page, handle, order);
>>>> +}
>> Doesn't this 2 functions has no short-cuts ?
>
> Sorry kame, what exactly do you mean?
>
I meant avoinding function call. But please ignore, I missed following patches.
>> if (memcg_kmem_on && handle) ?
> I guess this can be done to avoid a function call.
>
>> Maybe free() needs to access page_cgroup...
>>
> Can you also be a bit more specific here?
>
Please ignore, I misunderstood the usage of free_accounted_pages().
>>>> +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
>>>> +{
>>>> + struct mem_cgroup *memcg;
>>>> + struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
>>>> + bool ret = true;
>>>> + size_t size;
>>>> + struct task_struct *p;
>>>> +
>>>> + *handle = NULL;
>>>> + rcu_read_lock();
>>>> + p = rcu_dereference(current->mm->owner);
>>>> + memcg = mem_cgroup_from_task(p);
>>>> + if (!memcg_kmem_enabled(memcg))
>>>> + goto out;
>>>> +
>>>> + mem_cgroup_get(memcg);
>>>> +
>> This mem_cgroup_get() will be a potentioal performance problem.
>> Don't you have good idea to avoid accessing atomic counter here ?
>> I think some kind of percpu counter or a feature to disable "move task"
>> will be a help.
>
>
>
>
>>>> + pc = lookup_page_cgroup(page);
>>>> + lock_page_cgroup(pc);
>>>> + pc->mem_cgroup = memcg;
>>>> + SetPageCgroupUsed(pc);
>>>> + unlock_page_cgroup(pc);
>>>> +}
>>>> +
>>>> +void __memcg_kmem_free_page(struct page *page, int order)
>>>> +{
>>>> + struct mem_cgroup *memcg;
>>>> + size_t size;
>>>> + struct page_cgroup *pc;
>>>> +
>>>> + if (mem_cgroup_disabled())
>>>> + return;
>>>> +
>>>> + pc = lookup_page_cgroup(page);
>>>> + lock_page_cgroup(pc);
>>>> + memcg = pc->mem_cgroup;
>>>> + pc->mem_cgroup = NULL;
>
>> shouldn't this happen after checking "Used" bit ?
>> Ah, BTW, why do you need to clear pc->memcg ?
>
> As for clearing pc->memcg, I think I'm just being overzealous. I can't
> foresee any problems due to removing it.
>
> As for the Used bit, what difference does it make when we clear it?
>
I just want to see the same logic used in mem_cgroup_uncharge_common().
Hmm, at setting pc->mem_cgroup, the things happens in
set pc->mem_cgroup
set Used bit
order. If you clear pc->mem_cgroup
unset Used bit
clear pc->mem_cgroup
seems reasonable.
>>>> + if (!PageCgroupUsed(pc)) {
>>>> + unlock_page_cgroup(pc);
>>>> + return;
>>>> + }
>>>> + ClearPageCgroupUsed(pc);
>>>> + unlock_page_cgroup(pc);
>>>> +
>>>> + /*
>>>> + * Checking if kmem accounted is enabled won't work for uncharge, since
>>>> + * it is possible that the user enabled kmem tracking, allocated, and
>>>> + * then disabled it again.
>>>> + *
>>>> + * We trust if there is a memcg associated with the page, it is a valid
>>>> + * allocation
>>>> + */
>>>> + if (!memcg)
>>>> + return;
>>>> +
>>>> + WARN_ON(mem_cgroup_is_root(memcg));
>>>> + size = (1 << order) << PAGE_SHIFT;
>>>> + memcg_uncharge_kmem(memcg, size);
>>>> + mem_cgroup_put(memcg);
>> Why do we need ref-counting here ? kmem res_counter cannot work as
>> reference ?
> This is of course the pair of the mem_cgroup_get() you commented on
> earlier. If we need one, we need the other. If we don't need one, we
> don't need the other =)
>
> The guarantee we're trying to give here is that the memcg structure will
> stay around while there are dangling charges to kmem, that we decided
> not to move (remember: moving it for the stack is simple, for the slab
> is very complicated and ill-defined, and I believe it is better to treat
> all kmem equally here)
>
> So maybe we can be clever here, and avoid reference counting at all
> times. We call mem_cgroup_get() when the first charge occurs, and then
> go for mem_cgroup_put() when our count reaches 0.
>
> What do you think about that?
>
I think that should work. I don't want to add not-optimized atomic counter ops
in this very hot path.
>
>>>> +#ifdef CONFIG_MEMCG_KMEM
>>>> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
>>>> +{
>> What does 'delta' means ?
>>
> I can change it to something like nr_bytes, more informative.
>
>>>> + struct res_counter *fail_res;
>>>> + struct mem_cgroup *_memcg;
>>>> + int ret;
>>>> + bool may_oom;
>>>> + bool nofail = false;
>>>> +
>>>> + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
>>>> + !(gfp & __GFP_NORETRY);
>>>> +
>>>> + ret = 0;
>>>> +
>>>> + if (!memcg)
>>>> + return ret;
>>>> +
>>>> + _memcg = memcg;
>>>> + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
>>>> + &_memcg, may_oom);
>>>> +
>>>> + if (ret == -EINTR) {
>>>> + nofail = true;
>>>> + /*
>>>> + * __mem_cgroup_try_charge() chosed to bypass to root due to
>>>> + * OOM kill or fatal signal. Since our only options are to
>>>> + * either fail the allocation or charge it to this cgroup, do
>>>> + * it as a temporary condition. But we can't fail. From a
>>>> + * kmem/slab perspective, the cache has already been selected,
>>>> + * by mem_cgroup_get_kmem_cache(), so it is too late to change
>>>> + * our minds
>>>> + */
>>>> + res_counter_charge_nofail(&memcg->res, delta, &fail_res);
>>>> + if (do_swap_account)
>>>> + res_counter_charge_nofail(&memcg->memsw, delta,
>>>> + &fail_res);
>>>> + ret = 0;
>> Hm, you returns 0 and this charge may never be uncharged....right ?
>>
>
> Can't see why. By returning 0 we inform our caller that the allocation
> succeeded. It is up to him to undo it later through a call to uncharge.
>
Hmm, okay. You trust callers.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-10 17:27 ` Kamezawa Hiroyuki
2012-08-13 8:28 ` Glauber Costa
@ 2012-08-14 11:00 ` Glauber Costa
1 sibling, 0 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-14 11:00 UTC (permalink / raw)
To: Kamezawa Hiroyuki
Cc: linux-kernel, linux-mm, cgroups, devel, Michal Hocko,
Johannes Weiner, Andrew Morton, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg
On 08/10/2012 09:27 PM, Kamezawa Hiroyuki wrote:
>> +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
>> > +{
>> > + struct mem_cgroup *memcg;
>> > + struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
>> > + bool ret = true;
>> > + size_t size;
>> > + struct task_struct *p;
>> > +
>> > + *handle = NULL;
>> > + rcu_read_lock();
>> > + p = rcu_dereference(current->mm->owner);
>> > + memcg = mem_cgroup_from_task(p);
>> > + if (!memcg_kmem_enabled(memcg))
>> > + goto out;
>> > +
>> > + mem_cgroup_get(memcg);
>> > +
> This mem_cgroup_get() will be a potentioal performance problem.
> Don't you have good idea to avoid accessing atomic counter here ?
> I think some kind of percpu counter or a feature to disable "move task"
> will be a help.
>
>
I have just sent out a proposal to deal with this. I tried the trick of
marking only the first charge and last uncharge, and it works quite
alright at the cost of a bit test on most calls to memcg_kmem_charge.
Please let me know what you think.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
[not found] ` <1344517279-30646-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-10 17:27 ` Kamezawa Hiroyuki
@ 2012-08-14 17:25 ` Michal Hocko
2012-08-15 9:42 ` Glauber Costa
1 sibling, 1 reply; 135+ messages in thread
From: Michal Hocko @ 2012-08-14 17:25 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Andrew Morton,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Christoph Lameter,
David Rientjes, Pekka Enberg, Pekka Enberg
On Thu 09-08-12 17:01:14, Glauber Costa wrote:
> This patch introduces infrastructure for tracking kernel memory pages to
> a given memcg. This will happen whenever the caller includes the flag
> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>
> In memcontrol.h those functions are wrapped in inline accessors. The
> idea is to later on, patch those with static branches, so we don't incur
> any overhead when no mem cgroups with limited kmem are being used.
>
> [ v2: improved comments and standardized function names ]
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> ---
> include/linux/memcontrol.h | 79 +++++++++++++++++++
> mm/memcontrol.c | 185 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 264 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8d9489f..75b247e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
[...]
> +/**
> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
> + * @gfp: the gfp allocation flags.
> + * @handle: a pointer to the memcg this was charged against.
> + * @order: allocation order.
> + *
> + * returns true if the memcg where the current task belongs can hold this
> + * allocation.
> + *
> + * We return true automatically if this allocation is not to be accounted to
> + * any memcg.
> + */
> +static __always_inline bool
> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + if (!memcg_kmem_on)
> + return true;
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
OK, I see the point behind __GFP_NOFAIL but it would deserve a comment
or a mention in the changelog.
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 54e93de..e9824c1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> +EXPORT_SYMBOL(__memcg_kmem_new_page);
Why is this exported?
> +
> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
> +{
> + struct page_cgroup *pc;
> + struct mem_cgroup *memcg = handle;
> +
> + if (!memcg)
> + return;
> +
> + WARN_ON(mem_cgroup_is_root(memcg));
> + /* The page allocation must have failed. Revert */
> + if (!page) {
> + size_t size = PAGE_SIZE << order;
> +
> + memcg_uncharge_kmem(memcg, size);
> + mem_cgroup_put(memcg);
> + return;
> + }
> +
> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> + pc->mem_cgroup = memcg;
> + SetPageCgroupUsed(pc);
Don't we need a write barrier before assigning memcg? Same as
__mem_cgroup_commit_charge. This tests the Used bit always from within
lock_page_cgroup so it should be safe but I am not 100% sure about the
rest of the code.
[...]
> +EXPORT_SYMBOL(__memcg_kmem_free_page);
Why is the symbol exported?
> #endif /* CONFIG_MEMCG_KMEM */
>
> #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> @@ -5759,3 +5878,69 @@ static int __init enable_swap_account(char *s)
> __setup("swapaccount=", enable_swap_account);
>
> #endif
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
> +{
> + struct res_counter *fail_res;
> + struct mem_cgroup *_memcg;
> + int ret;
> + bool may_oom;
> + bool nofail = false;
> +
> + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
> + !(gfp & __GFP_NORETRY);
This deserves a comment.
> +
> + ret = 0;
> +
> + if (!memcg)
> + return ret;
> +
> + _memcg = memcg;
> + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
> + &_memcg, may_oom);
This is really dangerous because atomic allocation which seem to be
possible could result in deadlocks because of the reclaim. Also, as I
have mentioned in the other email in this thread. Why should we reclaim
just because of kernel allocation when we are not reclaiming any of it
because shrink_slab is ignored in the memcg reclaim.
> +
> + if (ret == -EINTR) {
> + nofail = true;
> + /*
> + * __mem_cgroup_try_charge() chosed to bypass to root due to
> + * OOM kill or fatal signal. Since our only options are to
> + * either fail the allocation or charge it to this cgroup, do
> + * it as a temporary condition. But we can't fail. From a
> + * kmem/slab perspective, the cache has already been selected,
> + * by mem_cgroup_get_kmem_cache(), so it is too late to change
> + * our minds
> + */
> + res_counter_charge_nofail(&memcg->res, delta, &fail_res);
> + if (do_swap_account)
> + res_counter_charge_nofail(&memcg->memsw, delta,
> + &fail_res);
Hmmm, this is kind of ugly but I guess unvoidable with the current
implementation. Oh well...
> + ret = 0;
> + } else if (ret == -ENOMEM)
> + return ret;
> +
> + if (nofail)
> + res_counter_charge_nofail(&memcg->kmem, delta, &fail_res);
> + else
> + ret = res_counter_charge(&memcg->kmem, delta, &fail_res);
> +
> + if (ret) {
> + res_counter_uncharge(&memcg->res, delta);
> + if (do_swap_account)
> + res_counter_uncharge(&memcg->memsw, delta);
> + }
> +
> + return ret;
> +}
> +
[...]
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-14 17:25 ` Michal Hocko
@ 2012-08-15 9:42 ` Glauber Costa
2012-08-15 10:44 ` Glauber Costa
[not found] ` <502B6F00.8040207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
0 siblings, 2 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-15 9:42 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg
>> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
>> + * @gfp: the gfp allocation flags.
>> + * @handle: a pointer to the memcg this was charged against.
>> + * @order: allocation order.
>> + *
>> + * returns true if the memcg where the current task belongs can hold this
>> + * allocation.
>> + *
>> + * We return true automatically if this allocation is not to be accounted to
>> + * any memcg.
>> + */
>> +static __always_inline bool
>> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>> +{
>> + if (!memcg_kmem_on)
>> + return true;
>> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
>
> OK, I see the point behind __GFP_NOFAIL but it would deserve a comment
> or a mention in the changelog.
documentation can't hurt!
Just added.
> [...]
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 54e93de..e9824c1 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
> [...]
>> +EXPORT_SYMBOL(__memcg_kmem_new_page);
>
> Why is this exported?
>
It shouldn't be. Removed.
>> +
>> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
>> +{
>> + struct page_cgroup *pc;
>> + struct mem_cgroup *memcg = handle;
>> +
>> + if (!memcg)
>> + return;
>> +
>> + WARN_ON(mem_cgroup_is_root(memcg));
>> + /* The page allocation must have failed. Revert */
>> + if (!page) {
>> + size_t size = PAGE_SIZE << order;
>> +
>> + memcg_uncharge_kmem(memcg, size);
>> + mem_cgroup_put(memcg);
>> + return;
>> + }
>> +
>> + pc = lookup_page_cgroup(page);
>> + lock_page_cgroup(pc);
>> + pc->mem_cgroup = memcg;
>> + SetPageCgroupUsed(pc);
>
> Don't we need a write barrier before assigning memcg? Same as
> __mem_cgroup_commit_charge. This tests the Used bit always from within
> lock_page_cgroup so it should be safe but I am not 100% sure about the
> rest of the code.
>
Well, I don't see the reason, precisely because we'll always grab it
from within the locked region. That should ensure all the necessary
serialization.
>> +#ifdef CONFIG_MEMCG_KMEM
>> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
>> +{
>> + struct res_counter *fail_res;
>> + struct mem_cgroup *_memcg;
>> + int ret;
>> + bool may_oom;
>> + bool nofail = false;
>> +
>> + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
>> + !(gfp & __GFP_NORETRY);
>
> This deserves a comment.
>
can't hurt!! =)
>> +
>> + ret = 0;
>> +
>> + if (!memcg)
>> + return ret;
>> +
>> + _memcg = memcg;
>> + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
>> + &_memcg, may_oom);
>
> This is really dangerous because atomic allocation which seem to be
> possible could result in deadlocks because of the reclaim.
Can you elaborate on how this would happen?
> Also, as I
> have mentioned in the other email in this thread. Why should we reclaim
> just because of kernel allocation when we are not reclaiming any of it
> because shrink_slab is ignored in the memcg reclaim.
Don't get too distracted by the fact that shrink_slab is ignored. It is
temporary, and while this being ignored now leads to suboptimal
behavior, it will 1st, only affect its users, and 2nd, not be disastrous.
I see it this as more or less on pair with the soft limit reclaim
problem we had. It is not ideal, but it already provided functionality
>> +
>> + if (ret == -EINTR) {
>> + nofail = true;
>> + /*
>> + * __mem_cgroup_try_charge() chosed to bypass to root due to
>> + * OOM kill or fatal signal. Since our only options are to
>> + * either fail the allocation or charge it to this cgroup, do
>> + * it as a temporary condition. But we can't fail. From a
>> + * kmem/slab perspective, the cache has already been selected,
>> + * by mem_cgroup_get_kmem_cache(), so it is too late to change
>> + * our minds
>> + */
>> + res_counter_charge_nofail(&memcg->res, delta, &fail_res);
>> + if (do_swap_account)
>> + res_counter_charge_nofail(&memcg->memsw, delta,
>> + &fail_res);
>
> Hmmm, this is kind of ugly but I guess unvoidable with the current
> implementation. Oh well...
>
Oh well...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-15 9:42 ` Glauber Costa
@ 2012-08-15 10:44 ` Glauber Costa
[not found] ` <502B6F00.8040207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
1 sibling, 0 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-15 10:44 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg
On 08/15/2012 01:42 PM, Glauber Costa wrote:
>> Also, as I
>> > have mentioned in the other email in this thread. Why should we reclaim
>> > just because of kernel allocation when we are not reclaiming any of it
>> > because shrink_slab is ignored in the memcg reclaim.
>
> Don't get too distracted by the fact that shrink_slab is ignored. It is
> temporary, and while this being ignored now leads to suboptimal
> behavior, it will 1st, only affect its users, and 2nd, not be disastrous.
>
> I see it this as more or less on pair with the soft limit reclaim
> problem we had. It is not ideal, but it already provided functionality
>
Okay, I sent the e-mail before finishing it... duh
What I meant in this last sentence, is that the situation while the
memcg-aware shrinkers doesn't land in the kernel is more or less the
same (obviously not exactly) as with the soft reclaim work. It is an
evolutionary approach that provides some functionality that is not yet
perfect but already solves lots of problems for people willing to live
with its temporary drawbacks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread[parent not found: <502B6F00.8040207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
[not found] ` <502B6F00.8040207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-15 13:09 ` Michal Hocko
2012-08-15 14:01 ` Glauber Costa
0 siblings, 1 reply; 135+ messages in thread
From: Michal Hocko @ 2012-08-15 13:09 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Andrew Morton,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Christoph Lameter,
David Rientjes, Pekka Enberg, Pekka Enberg
On Wed 15-08-12 13:42:24, Glauber Costa wrote:
[...]
> >> +
> >> + ret = 0;
> >> +
> >> + if (!memcg)
> >> + return ret;
> >> +
> >> + _memcg = memcg;
> >> + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
> >> + &_memcg, may_oom);
> >
> > This is really dangerous because atomic allocation which seem to be
> > possible could result in deadlocks because of the reclaim.
>
> Can you elaborate on how this would happen?
Say you have an atomic allocation and we hit the limit so we get either
to reclaim which can sleep or to oom which can sleep as well (depending
on the oom_control).
> > Also, as I have mentioned in the other email in this thread. Why
> > should we reclaim just because of kernel allocation when we are not
> > reclaiming any of it because shrink_slab is ignored in the memcg
> > reclaim.
>
> Don't get too distracted by the fact that shrink_slab is ignored. It is
> temporary, and while this being ignored now leads to suboptimal
> behavior, it will 1st, only affect its users, and 2nd, not be disastrous.
It's not just about shrink_slab it is also about triggering memcg-oom
which doesn't consider kmem accounted memory so the wrong tasks could
be killed. It is true that the impact is packed inside the group
(hierarchy) so you are right it won't be disastrous.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-15 13:09 ` Michal Hocko
@ 2012-08-15 14:01 ` Glauber Costa
[not found] ` <502BABCF.7020608-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
0 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-15 14:01 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg
On 08/15/2012 05:09 PM, Michal Hocko wrote:
> On Wed 15-08-12 13:42:24, Glauber Costa wrote:
> [...]
>>>> +
>>>> + ret = 0;
>>>> +
>>>> + if (!memcg)
>>>> + return ret;
>>>> +
>>>> + _memcg = memcg;
>>>> + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
>>>> + &_memcg, may_oom);
>>>
>>> This is really dangerous because atomic allocation which seem to be
>>> possible could result in deadlocks because of the reclaim.
>>
>> Can you elaborate on how this would happen?
>
> Say you have an atomic allocation and we hit the limit so we get either
> to reclaim which can sleep or to oom which can sleep as well (depending
> on the oom_control).
>
I see now, you seem to be right.
How about we change the following code in mem_cgroup_do_charge:
if (gfp_mask & __GFP_NORETRY)
return CHARGE_NOMEM;
to:
if ((gfp_mask & __GFP_NORETRY) || (gfp_mask & __GFP_ATOMIC))
return CHARGE_NOMEM;
?
Would this take care of the issue ?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-09 13:01 ` [PATCH v2 06/11] memcg: kmem controller infrastructure Glauber Costa
[not found] ` <1344517279-30646-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-11 5:11 ` Greg Thelen
2012-08-13 8:07 ` Glauber Costa
2012-08-13 9:59 ` Glauber Costa
2012-08-21 21:50 ` Greg Thelen
2 siblings, 2 replies; 135+ messages in thread
From: Greg Thelen @ 2012-08-11 5:11 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel, linux-mm, cgroups, devel, Michal Hocko,
Johannes Weiner, Andrew Morton, kamezawa.hiroyu,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg
On Thu, Aug 09 2012, Glauber Costa wrote:
> This patch introduces infrastructure for tracking kernel memory pages to
> a given memcg. This will happen whenever the caller includes the flag
> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>
> In memcontrol.h those functions are wrapped in inline accessors. The
> idea is to later on, patch those with static branches, so we don't incur
> any overhead when no mem cgroups with limited kmem are being used.
>
> [ v2: improved comments and standardized function names ]
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> ---
> include/linux/memcontrol.h | 79 +++++++++++++++++++
> mm/memcontrol.c | 185 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 264 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8d9489f..75b247e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -21,6 +21,7 @@
> #define _LINUX_MEMCONTROL_H
> #include <linux/cgroup.h>
> #include <linux/vm_event_item.h>
> +#include <linux/hardirq.h>
>
> struct mem_cgroup;
> struct page_cgroup;
> @@ -399,6 +400,11 @@ struct sock;
> #ifdef CONFIG_MEMCG_KMEM
> void sock_update_memcg(struct sock *sk);
> void sock_release_memcg(struct sock *sk);
> +
> +#define memcg_kmem_on 1
> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
> +void __memcg_kmem_free_page(struct page *page, int order);
> #else
> static inline void sock_update_memcg(struct sock *sk)
> {
> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
> static inline void sock_release_memcg(struct sock *sk)
> {
> }
> +
> +#define memcg_kmem_on 0
> +static inline bool
> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + return false;
> +}
> +
> +static inline void __memcg_kmem_free_page(struct page *page, int order)
> +{
> +}
> +
> +static inline void
> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
> +{
> +}
> #endif /* CONFIG_MEMCG_KMEM */
> +
> +/**
> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
> + * @gfp: the gfp allocation flags.
> + * @handle: a pointer to the memcg this was charged against.
> + * @order: allocation order.
> + *
> + * returns true if the memcg where the current task belongs can hold this
> + * allocation.
> + *
> + * We return true automatically if this allocation is not to be accounted to
> + * any memcg.
> + */
> +static __always_inline bool
> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + if (!memcg_kmem_on)
> + return true;
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + return true;
> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> + return true;
> + return __memcg_kmem_new_page(gfp, handle, order);
> +}
> +
> +/**
> + * memcg_kmem_free_page: uncharge pages from memcg
> + * @page: pointer to struct page being freed
> + * @order: allocation order.
> + *
> + * there is no need to specify memcg here, since it is embedded in page_cgroup
> + */
> +static __always_inline void
> +memcg_kmem_free_page(struct page *page, int order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_free_page(page, order);
> +}
> +
> +/**
> + * memcg_kmem_commit_page: embeds correct memcg in a page
> + * @handle: a pointer to the memcg this was charged against.
> + * @page: pointer to struct page recently allocated
> + * @handle: the memcg structure we charged against
> + * @order: allocation order.
> + *
> + * Needs to be called after memcg_kmem_new_page, regardless of success or
> + * failure of the allocation. if @page is NULL, this function will revert the
> + * charges. Otherwise, it will commit the memcg given by @handle to the
> + * corresponding page_cgroup.
> + */
> +static __always_inline void
> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_commit_page(page, handle, order);
> +}
> #endif /* _LINUX_MEMCONTROL_H */
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 54e93de..e9824c1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -10,6 +10,10 @@
> * Copyright (C) 2009 Nokia Corporation
> * Author: Kirill A. Shutemov
> *
> + * Kernel Memory Controller
> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
> + * Authors: Glauber Costa and Suleiman Souhlal
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
> #include <net/ip.h>
>
> static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
> +static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
> +
> void sock_update_memcg(struct sock *sk)
> {
> if (mem_cgroup_sockets_enabled) {
> @@ -488,6 +495,118 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
> }
> EXPORT_SYMBOL(tcp_proto_cgroup);
> #endif /* CONFIG_INET */
> +
> +static inline bool memcg_kmem_enabled(struct mem_cgroup *memcg)
> +{
> + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
> + memcg->kmem_accounted;
> +}
> +
> +/*
> + * We need to verify if the allocation against current->mm->owner's memcg is
> + * possible for the given order. But the page is not allocated yet, so we'll
> + * need a further commit step to do the final arrangements.
> + *
> + * It is possible for the task to switch cgroups in this mean time, so at
> + * commit time, we can't rely on task conversion any longer. We'll then use
> + * the handle argument to return to the caller which cgroup we should commit
> + * against
> + *
> + * Returning true means the allocation is possible.
> + */
> +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
> +{
> + struct mem_cgroup *memcg;
> + struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
> + bool ret = true;
> + size_t size;
> + struct task_struct *p;
> +
> + *handle = NULL;
> + rcu_read_lock();
> + p = rcu_dereference(current->mm->owner);
> + memcg = mem_cgroup_from_task(p);
> + if (!memcg_kmem_enabled(memcg))
> + goto out;
> +
> + mem_cgroup_get(memcg);
> +
> + size = PAGE_SIZE << order;
> + ret = memcg_charge_kmem(memcg, gfp, size) == 0;
> + if (!ret) {
> + mem_cgroup_put(memcg);
> + goto out;
> + }
> +
> + *handle = memcg;
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +EXPORT_SYMBOL(__memcg_kmem_new_page);
While running f853d89 from git://github.com/glommer/linux.git , I hit a
lockdep issue. To create this I allocated and held reference to some
kmem in the context of a kmem limited memcg. Then I moved the
allocating process out of memcg and then deleted the memcg. Due to the
kmem reference the struct mem_cgroup is still active but invisible in
cgroupfs namespace. No problems yet. Then I killed the user process
which freed the kmem from the now unlinked memcg. Dropping the kmem
caused the memcg ref to hit zero. Then the memcg is deleted but that
acquires a non-irqsafe spinlock in softirq which annoys lockdep. I
think the lock in question is the mctz below:
mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
struct mem_cgroup_per_zone *mz,
struct mem_cgroup_tree_per_zone *mctz)
{
spin_lock(&mctz->lock);
__mem_cgroup_remove_exceeded(memcg, mz, mctz);
spin_unlock(&mctz->lock);
}
Perhaps your patches expose this problem by being the first time we call
__mem_cgroup_free() from softirq (this is just an educated guess). I'm
not sure how this would interact with Ying's soft limit rework:
https://lwn.net/Articles/501338/
Here's the dmesg splat.
[ 335.550398] =================================
[ 335.554739] [ INFO: inconsistent lock state ]
[ 335.559091] 3.5.0-dbg-DEV #3 Tainted: G W
[ 335.563946] ---------------------------------
[ 335.568290] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[ 335.574286] swapper/10/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
[ 335.579508] (&(&rtpz->lock)->rlock){+.?...}, at: [<ffffffff8118216d>] __mem_cgroup_free+0x8d/0x1b0
[ 335.588525] {SOFTIRQ-ON-W} state was registered at:
[ 335.593389] [<ffffffff810cb073>] __lock_acquire+0x623/0x1a50
[ 335.599200] [<ffffffff810cca55>] lock_acquire+0x95/0x150
[ 335.604670] [<ffffffff81582531>] _raw_spin_lock+0x41/0x50
[ 335.610232] [<ffffffff8118216d>] __mem_cgroup_free+0x8d/0x1b0
[ 335.616135] [<ffffffff811822d5>] mem_cgroup_put+0x45/0x50
[ 335.621696] [<ffffffff81182302>] mem_cgroup_destroy+0x22/0x30
[ 335.627592] [<ffffffff810e093f>] cgroup_diput+0xbf/0x160
[ 335.633062] [<ffffffff811a07ef>] d_delete+0x12f/0x1a0
[ 335.638276] [<ffffffff8119671e>] vfs_rmdir+0x11e/0x140
[ 335.643565] [<ffffffff81199173>] do_rmdir+0x113/0x130
[ 335.648773] [<ffffffff8119a5e6>] sys_rmdir+0x16/0x20
[ 335.653900] [<ffffffff8158c74f>] cstar_dispatch+0x7/0x1f
[ 335.659370] irq event stamp: 399732
[ 335.662846] hardirqs last enabled at (399732): [<ffffffff810e8e08>] res_counter_uncharge_until+0x68/0xa0
[ 335.672383] hardirqs last disabled at (399731): [<ffffffff810e8dc8>] res_counter_uncharge_until+0x28/0xa0
[ 335.681916] softirqs last enabled at (399710): [<ffffffff81085dd3>] _local_bh_enable+0x13/0x20
[ 335.690590] softirqs last disabled at (399711): [<ffffffff8158c48c>] call_softirq+0x1c/0x30
[ 335.698914]
[ 335.698914] other info that might help us debug this:
[ 335.705415] Possible unsafe locking scenario:
[ 335.705415]
[ 335.711317] CPU0
[ 335.713757] ----
[ 335.716198] lock(&(&rtpz->lock)->rlock);
[ 335.720282] <Interrupt>
[ 335.722896] lock(&(&rtpz->lock)->rlock);
[ 335.727153]
[ 335.727153] *** DEADLOCK ***
[ 335.727153]
[ 335.733055] no locks held by swapper/10/0.
[ 335.737141]
[ 335.737141] stack backtrace:
[ 335.741483] Pid: 0, comm: swapper/10 Tainted: G W 3.5.0-dbg-DEV #3
[ 335.748510] Call Trace:
[ 335.750952] <IRQ> [<ffffffff81579a27>] print_usage_bug+0x1fc/0x20d
[ 335.757286] [<ffffffff81058a9f>] ? save_stack_trace+0x2f/0x50
[ 335.763098] [<ffffffff810ca9ed>] mark_lock+0x29d/0x300
[ 335.768309] [<ffffffff810c9e10>] ? print_irq_inversion_bug.part.36+0x1f0/0x1f0
[ 335.775599] [<ffffffff810caffc>] __lock_acquire+0x5ac/0x1a50
[ 335.781323] [<ffffffff810cad34>] ? __lock_acquire+0x2e4/0x1a50
[ 335.787224] [<ffffffff8118216d>] ? __mem_cgroup_free+0x8d/0x1b0
[ 335.793212] [<ffffffff810cca55>] lock_acquire+0x95/0x150
[ 335.798594] [<ffffffff8118216d>] ? __mem_cgroup_free+0x8d/0x1b0
[ 335.804581] [<ffffffff810e8ddd>] ? res_counter_uncharge_until+0x3d/0xa0
[ 335.811263] [<ffffffff81582531>] _raw_spin_lock+0x41/0x50
[ 335.816731] [<ffffffff8118216d>] ? __mem_cgroup_free+0x8d/0x1b0
[ 335.822724] [<ffffffff8118216d>] __mem_cgroup_free+0x8d/0x1b0
[ 335.828538] [<ffffffff811822d5>] mem_cgroup_put+0x45/0x50
[ 335.834002] [<ffffffff811828a6>] __memcg_kmem_free_page+0xa6/0x110
[ 335.840256] [<ffffffff81138109>] free_accounted_pages+0x99/0xa0
[ 335.846243] [<ffffffff8107b09f>] free_task+0x3f/0x70
[ 335.851278] [<ffffffff8107b18c>] __put_task_struct+0xbc/0x130
[ 335.857094] [<ffffffff81081524>] delayed_put_task_struct+0x54/0xd0
[ 335.863338] [<ffffffff810fd354>] __rcu_process_callbacks+0x1e4/0x490
[ 335.869757] [<ffffffff810fd62f>] rcu_process_callbacks+0x2f/0x80
[ 335.875835] [<ffffffff810862f5>] __do_softirq+0xc5/0x270
[ 335.881218] [<ffffffff810c49b4>] ? clockevents_program_event+0x74/0x100
[ 335.887895] [<ffffffff810c5d94>] ? tick_program_event+0x24/0x30
[ 335.893882] [<ffffffff8158c48c>] call_softirq+0x1c/0x30
[ 335.899179] [<ffffffff8104cefd>] do_softirq+0x8d/0xc0
[ 335.904301] [<ffffffff810867de>] irq_exit+0xae/0xe0
[ 335.909251] [<ffffffff8158cc3e>] smp_apic_timer_interrupt+0x6e/0x99
[ 335.915591] [<ffffffff8158ba9c>] apic_timer_interrupt+0x6c/0x80
[ 335.921583] <EOI> [<ffffffff810530e7>] ? default_idle+0x67/0x270
[ 335.927741] [<ffffffff810530e5>] ? default_idle+0x65/0x270
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-11 5:11 ` Greg Thelen
@ 2012-08-13 8:07 ` Glauber Costa
2012-08-13 9:59 ` Glauber Costa
1 sibling, 0 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-13 8:07 UTC (permalink / raw)
To: Greg Thelen
Cc: linux-kernel, linux-mm, cgroups, devel, Michal Hocko,
Johannes Weiner, Andrew Morton, kamezawa.hiroyu,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg
On 08/11/2012 09:11 AM, Greg Thelen wrote:
> On Thu, Aug 09 2012, Glauber Costa wrote:
>
>> This patch introduces infrastructure for tracking kernel memory pages to
>> a given memcg. This will happen whenever the caller includes the flag
>> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>>
>> In memcontrol.h those functions are wrapped in inline accessors. The
>> idea is to later on, patch those with static branches, so we don't incur
>> any overhead when no mem cgroups with limited kmem are being used.
>>
>> [ v2: improved comments and standardized function names ]
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> CC: Christoph Lameter <cl@linux.com>
>> CC: Pekka Enberg <penberg@cs.helsinki.fi>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> ---
>> include/linux/memcontrol.h | 79 +++++++++++++++++++
>> mm/memcontrol.c | 185 +++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 264 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 8d9489f..75b247e 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -21,6 +21,7 @@
>> #define _LINUX_MEMCONTROL_H
>> #include <linux/cgroup.h>
>> #include <linux/vm_event_item.h>
>> +#include <linux/hardirq.h>
>>
>> struct mem_cgroup;
>> struct page_cgroup;
>> @@ -399,6 +400,11 @@ struct sock;
>> #ifdef CONFIG_MEMCG_KMEM
>> void sock_update_memcg(struct sock *sk);
>> void sock_release_memcg(struct sock *sk);
>> +
>> +#define memcg_kmem_on 1
>> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
>> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
>> +void __memcg_kmem_free_page(struct page *page, int order);
>> #else
>> static inline void sock_update_memcg(struct sock *sk)
>> {
>> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
>> static inline void sock_release_memcg(struct sock *sk)
>> {
>> }
>> +
>> +#define memcg_kmem_on 0
>> +static inline bool
>> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>> +{
>> + return false;
>> +}
>> +
>> +static inline void __memcg_kmem_free_page(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline void
>> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
>> +{
>> +}
>> #endif /* CONFIG_MEMCG_KMEM */
>> +
>> +/**
>> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
>> + * @gfp: the gfp allocation flags.
>> + * @handle: a pointer to the memcg this was charged against.
>> + * @order: allocation order.
>> + *
>> + * returns true if the memcg where the current task belongs can hold this
>> + * allocation.
>> + *
>> + * We return true automatically if this allocation is not to be accounted to
>> + * any memcg.
>> + */
>> +static __always_inline bool
>> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>> +{
>> + if (!memcg_kmem_on)
>> + return true;
>> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
>> + return true;
>> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>> + return true;
>> + return __memcg_kmem_new_page(gfp, handle, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_free_page: uncharge pages from memcg
>> + * @page: pointer to struct page being freed
>> + * @order: allocation order.
>> + *
>> + * there is no need to specify memcg here, since it is embedded in page_cgroup
>> + */
>> +static __always_inline void
>> +memcg_kmem_free_page(struct page *page, int order)
>> +{
>> + if (memcg_kmem_on)
>> + __memcg_kmem_free_page(page, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_commit_page: embeds correct memcg in a page
>> + * @handle: a pointer to the memcg this was charged against.
>> + * @page: pointer to struct page recently allocated
>> + * @handle: the memcg structure we charged against
>> + * @order: allocation order.
>> + *
>> + * Needs to be called after memcg_kmem_new_page, regardless of success or
>> + * failure of the allocation. if @page is NULL, this function will revert the
>> + * charges. Otherwise, it will commit the memcg given by @handle to the
>> + * corresponding page_cgroup.
>> + */
>> +static __always_inline void
>> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
>> +{
>> + if (memcg_kmem_on)
>> + __memcg_kmem_commit_page(page, handle, order);
>> +}
>> #endif /* _LINUX_MEMCONTROL_H */
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 54e93de..e9824c1 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -10,6 +10,10 @@
>> * Copyright (C) 2009 Nokia Corporation
>> * Author: Kirill A. Shutemov
>> *
>> + * Kernel Memory Controller
>> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
>> + * Authors: Glauber Costa and Suleiman Souhlal
>> + *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License as published by
>> * the Free Software Foundation; either version 2 of the License, or
>> @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
>> #include <net/ip.h>
>>
>> static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
>> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
>> +static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
>> +
>> void sock_update_memcg(struct sock *sk)
>> {
>> if (mem_cgroup_sockets_enabled) {
>> @@ -488,6 +495,118 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
>> }
>> EXPORT_SYMBOL(tcp_proto_cgroup);
>> #endif /* CONFIG_INET */
>> +
>> +static inline bool memcg_kmem_enabled(struct mem_cgroup *memcg)
>> +{
>> + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
>> + memcg->kmem_accounted;
>> +}
>> +
>> +/*
>> + * We need to verify if the allocation against current->mm->owner's memcg is
>> + * possible for the given order. But the page is not allocated yet, so we'll
>> + * need a further commit step to do the final arrangements.
>> + *
>> + * It is possible for the task to switch cgroups in this mean time, so at
>> + * commit time, we can't rely on task conversion any longer. We'll then use
>> + * the handle argument to return to the caller which cgroup we should commit
>> + * against
>> + *
>> + * Returning true means the allocation is possible.
>> + */
>> +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
>> +{
>> + struct mem_cgroup *memcg;
>> + struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
>> + bool ret = true;
>> + size_t size;
>> + struct task_struct *p;
>> +
>> + *handle = NULL;
>> + rcu_read_lock();
>> + p = rcu_dereference(current->mm->owner);
>> + memcg = mem_cgroup_from_task(p);
>> + if (!memcg_kmem_enabled(memcg))
>> + goto out;
>> +
>> + mem_cgroup_get(memcg);
>> +
>> + size = PAGE_SIZE << order;
>> + ret = memcg_charge_kmem(memcg, gfp, size) == 0;
>> + if (!ret) {
>> + mem_cgroup_put(memcg);
>> + goto out;
>> + }
>> +
>> + *handle = memcg;
>> +out:
>> + rcu_read_unlock();
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(__memcg_kmem_new_page);
>
> While running f853d89 from git://github.com/glommer/linux.git , I hit a
> lockdep issue. To create this I allocated and held reference to some
> kmem in the context of a kmem limited memcg. Then I moved the
> allocating process out of memcg and then deleted the memcg. Due to the
> kmem reference the struct mem_cgroup is still active but invisible in
> cgroupfs namespace. No problems yet. Then I killed the user process
> which freed the kmem from the now unlinked memcg. Dropping the kmem
> caused the memcg ref to hit zero. Then the memcg is deleted but that
> acquires a non-irqsafe spinlock in softirq which annoys lockdep. I
> think the lock in question is the mctz below:
>
> mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
> struct mem_cgroup_per_zone *mz,
> struct mem_cgroup_tree_per_zone *mctz)
> {
> spin_lock(&mctz->lock);
> __mem_cgroup_remove_exceeded(memcg, mz, mctz);
> spin_unlock(&mctz->lock);
> }
>
> Perhaps your patches expose this problem by being the first time we call
> __mem_cgroup_free() from softirq (this is just an educated guess). I'm
> not sure how this would interact with Ying's soft limit rework:
> https://lwn.net/Articles/501338/
>
Thanks for letting me know, Greg,
I'll try to reproduce this today and see how it goes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-11 5:11 ` Greg Thelen
2012-08-13 8:07 ` Glauber Costa
@ 2012-08-13 9:59 ` Glauber Costa
2012-08-13 21:21 ` Greg Thelen
1 sibling, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-13 9:59 UTC (permalink / raw)
To: Greg Thelen
Cc: linux-kernel, linux-mm, cgroups, devel, Michal Hocko,
Johannes Weiner, Andrew Morton, kamezawa.hiroyu,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg
>
> Here's the dmesg splat.
>
Do you always get this report in the same way?
I managed to get a softirq inconsistency like yours, but the complaint
goes for a different lock.
> [ 335.550398] =================================
> [ 335.554739] [ INFO: inconsistent lock state ]
> [ 335.559091] 3.5.0-dbg-DEV #3 Tainted: G W
> [ 335.563946] ---------------------------------
> [ 335.568290] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> [ 335.574286] swapper/10/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
> [ 335.579508] (&(&rtpz->lock)->rlock){+.?...}, at: [<ffffffff8118216d>] __mem_cgroup_free+0x8d/0x1b0
> [ 335.588525] {SOFTIRQ-ON-W} state was registered at:
> [ 335.593389] [<ffffffff810cb073>] __lock_acquire+0x623/0x1a50
> [ 335.599200] [<ffffffff810cca55>] lock_acquire+0x95/0x150
> [ 335.604670] [<ffffffff81582531>] _raw_spin_lock+0x41/0x50
> [ 335.610232] [<ffffffff8118216d>] __mem_cgroup_free+0x8d/0x1b0
> [ 335.616135] [<ffffffff811822d5>] mem_cgroup_put+0x45/0x50
> [ 335.621696] [<ffffffff81182302>] mem_cgroup_destroy+0x22/0x30
> [ 335.627592] [<ffffffff810e093f>] cgroup_diput+0xbf/0x160
> [ 335.633062] [<ffffffff811a07ef>] d_delete+0x12f/0x1a0
> [ 335.638276] [<ffffffff8119671e>] vfs_rmdir+0x11e/0x140
> [ 335.643565] [<ffffffff81199173>] do_rmdir+0x113/0x130
> [ 335.648773] [<ffffffff8119a5e6>] sys_rmdir+0x16/0x20
> [ 335.653900] [<ffffffff8158c74f>] cstar_dispatch+0x7/0x1f
> [ 335.659370] irq event stamp: 399732
> [ 335.662846] hardirqs last enabled at (399732): [<ffffffff810e8e08>] res_counter_uncharge_until+0x68/0xa0
> [ 335.672383] hardirqs last disabled at (399731): [<ffffffff810e8dc8>] res_counter_uncharge_until+0x28/0xa0
> [ 335.681916] softirqs last enabled at (399710): [<ffffffff81085dd3>] _local_bh_enable+0x13/0x20
> [ 335.690590] softirqs last disabled at (399711): [<ffffffff8158c48c>] call_softirq+0x1c/0x30
> [ 335.698914]
> [ 335.698914] other info that might help us debug this:
> [ 335.705415] Possible unsafe locking scenario:
> [ 335.705415]
> [ 335.711317] CPU0
> [ 335.713757] ----
> [ 335.716198] lock(&(&rtpz->lock)->rlock);
> [ 335.720282] <Interrupt>
> [ 335.722896] lock(&(&rtpz->lock)->rlock);
> [ 335.727153]
> [ 335.727153] *** DEADLOCK ***
> [ 335.727153]
> [ 335.733055] no locks held by swapper/10/0.
> [ 335.737141]
> [ 335.737141] stack backtrace:
> [ 335.741483] Pid: 0, comm: swapper/10 Tainted: G W 3.5.0-dbg-DEV #3
> [ 335.748510] Call Trace:
> [ 335.750952] <IRQ> [<ffffffff81579a27>] print_usage_bug+0x1fc/0x20d
> [ 335.757286] [<ffffffff81058a9f>] ? save_stack_trace+0x2f/0x50
> [ 335.763098] [<ffffffff810ca9ed>] mark_lock+0x29d/0x300
> [ 335.768309] [<ffffffff810c9e10>] ? print_irq_inversion_bug.part.36+0x1f0/0x1f0
> [ 335.775599] [<ffffffff810caffc>] __lock_acquire+0x5ac/0x1a50
> [ 335.781323] [<ffffffff810cad34>] ? __lock_acquire+0x2e4/0x1a50
> [ 335.787224] [<ffffffff8118216d>] ? __mem_cgroup_free+0x8d/0x1b0
> [ 335.793212] [<ffffffff810cca55>] lock_acquire+0x95/0x150
> [ 335.798594] [<ffffffff8118216d>] ? __mem_cgroup_free+0x8d/0x1b0
> [ 335.804581] [<ffffffff810e8ddd>] ? res_counter_uncharge_until+0x3d/0xa0
> [ 335.811263] [<ffffffff81582531>] _raw_spin_lock+0x41/0x50
> [ 335.816731] [<ffffffff8118216d>] ? __mem_cgroup_free+0x8d/0x1b0
> [ 335.822724] [<ffffffff8118216d>] __mem_cgroup_free+0x8d/0x1b0
> [ 335.828538] [<ffffffff811822d5>] mem_cgroup_put+0x45/0x50
> [ 335.834002] [<ffffffff811828a6>] __memcg_kmem_free_page+0xa6/0x110
> [ 335.840256] [<ffffffff81138109>] free_accounted_pages+0x99/0xa0
> [ 335.846243] [<ffffffff8107b09f>] free_task+0x3f/0x70
> [ 335.851278] [<ffffffff8107b18c>] __put_task_struct+0xbc/0x130
> [ 335.857094] [<ffffffff81081524>] delayed_put_task_struct+0x54/0xd0
> [ 335.863338] [<ffffffff810fd354>] __rcu_process_callbacks+0x1e4/0x490
> [ 335.869757] [<ffffffff810fd62f>] rcu_process_callbacks+0x2f/0x80
> [ 335.875835] [<ffffffff810862f5>] __do_softirq+0xc5/0x270
> [ 335.881218] [<ffffffff810c49b4>] ? clockevents_program_event+0x74/0x100
> [ 335.887895] [<ffffffff810c5d94>] ? tick_program_event+0x24/0x30
> [ 335.893882] [<ffffffff8158c48c>] call_softirq+0x1c/0x30
> [ 335.899179] [<ffffffff8104cefd>] do_softirq+0x8d/0xc0
> [ 335.904301] [<ffffffff810867de>] irq_exit+0xae/0xe0
> [ 335.909251] [<ffffffff8158cc3e>] smp_apic_timer_interrupt+0x6e/0x99
> [ 335.915591] [<ffffffff8158ba9c>] apic_timer_interrupt+0x6c/0x80
> [ 335.921583] <EOI> [<ffffffff810530e7>] ? default_idle+0x67/0x270
> [ 335.927741] [<ffffffff810530e5>] ? default_idle+0x65/0x270
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-13 9:59 ` Glauber Costa
@ 2012-08-13 21:21 ` Greg Thelen
0 siblings, 0 replies; 135+ messages in thread
From: Greg Thelen @ 2012-08-13 21:21 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel, linux-mm, cgroups, devel, Michal Hocko,
Johannes Weiner, Andrew Morton, kamezawa.hiroyu,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg
On Mon, Aug 13 2012, Glauber Costa wrote:
>>
>> Here's the dmesg splat.
>>
>
> Do you always get this report in the same way?
> I managed to get a softirq inconsistency like yours, but the complaint
> goes for a different lock.
Yes, I repeatedly get the same dmesg splat below.
Once I your 'execute the whole memcg freeing in rcu callback' patch,
then the warnings are not printed. I'll take a closer look at the patch
soon.
>> [ 335.550398] =================================
>> [ 335.554739] [ INFO: inconsistent lock state ]
>> [ 335.559091] 3.5.0-dbg-DEV #3 Tainted: G W
>> [ 335.563946] ---------------------------------
>> [ 335.568290] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
>> [ 335.574286] swapper/10/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
>> [ 335.579508] (&(&rtpz->lock)->rlock){+.?...}, at: [<ffffffff8118216d>] __mem_cgroup_free+0x8d/0x1b0
>> [ 335.588525] {SOFTIRQ-ON-W} state was registered at:
>> [ 335.593389] [<ffffffff810cb073>] __lock_acquire+0x623/0x1a50
>> [ 335.599200] [<ffffffff810cca55>] lock_acquire+0x95/0x150
>> [ 335.604670] [<ffffffff81582531>] _raw_spin_lock+0x41/0x50
>> [ 335.610232] [<ffffffff8118216d>] __mem_cgroup_free+0x8d/0x1b0
>> [ 335.616135] [<ffffffff811822d5>] mem_cgroup_put+0x45/0x50
>> [ 335.621696] [<ffffffff81182302>] mem_cgroup_destroy+0x22/0x30
>> [ 335.627592] [<ffffffff810e093f>] cgroup_diput+0xbf/0x160
>> [ 335.633062] [<ffffffff811a07ef>] d_delete+0x12f/0x1a0
>> [ 335.638276] [<ffffffff8119671e>] vfs_rmdir+0x11e/0x140
>> [ 335.643565] [<ffffffff81199173>] do_rmdir+0x113/0x130
>> [ 335.648773] [<ffffffff8119a5e6>] sys_rmdir+0x16/0x20
>> [ 335.653900] [<ffffffff8158c74f>] cstar_dispatch+0x7/0x1f
>> [ 335.659370] irq event stamp: 399732
>> [ 335.662846] hardirqs last enabled at (399732): [<ffffffff810e8e08>] res_counter_uncharge_until+0x68/0xa0
>> [ 335.672383] hardirqs last disabled at (399731): [<ffffffff810e8dc8>] res_counter_uncharge_until+0x28/0xa0
>> [ 335.681916] softirqs last enabled at (399710): [<ffffffff81085dd3>] _local_bh_enable+0x13/0x20
>> [ 335.690590] softirqs last disabled at (399711): [<ffffffff8158c48c>] call_softirq+0x1c/0x30
>> [ 335.698914]
>> [ 335.698914] other info that might help us debug this:
>> [ 335.705415] Possible unsafe locking scenario:
>> [ 335.705415]
>> [ 335.711317] CPU0
>> [ 335.713757] ----
>> [ 335.716198] lock(&(&rtpz->lock)->rlock);
>> [ 335.720282] <Interrupt>
>> [ 335.722896] lock(&(&rtpz->lock)->rlock);
>> [ 335.727153]
>> [ 335.727153] *** DEADLOCK ***
>> [ 335.727153]
>> [ 335.733055] no locks held by swapper/10/0.
>> [ 335.737141]
>> [ 335.737141] stack backtrace:
>> [ 335.741483] Pid: 0, comm: swapper/10 Tainted: G W 3.5.0-dbg-DEV #3
>> [ 335.748510] Call Trace:
>> [ 335.750952] <IRQ> [<ffffffff81579a27>] print_usage_bug+0x1fc/0x20d
>> [ 335.757286] [<ffffffff81058a9f>] ? save_stack_trace+0x2f/0x50
>> [ 335.763098] [<ffffffff810ca9ed>] mark_lock+0x29d/0x300
>> [ 335.768309] [<ffffffff810c9e10>] ? print_irq_inversion_bug.part.36+0x1f0/0x1f0
>> [ 335.775599] [<ffffffff810caffc>] __lock_acquire+0x5ac/0x1a50
>> [ 335.781323] [<ffffffff810cad34>] ? __lock_acquire+0x2e4/0x1a50
>> [ 335.787224] [<ffffffff8118216d>] ? __mem_cgroup_free+0x8d/0x1b0
>> [ 335.793212] [<ffffffff810cca55>] lock_acquire+0x95/0x150
>> [ 335.798594] [<ffffffff8118216d>] ? __mem_cgroup_free+0x8d/0x1b0
>> [ 335.804581] [<ffffffff810e8ddd>] ? res_counter_uncharge_until+0x3d/0xa0
>> [ 335.811263] [<ffffffff81582531>] _raw_spin_lock+0x41/0x50
>> [ 335.816731] [<ffffffff8118216d>] ? __mem_cgroup_free+0x8d/0x1b0
>> [ 335.822724] [<ffffffff8118216d>] __mem_cgroup_free+0x8d/0x1b0
>> [ 335.828538] [<ffffffff811822d5>] mem_cgroup_put+0x45/0x50
>> [ 335.834002] [<ffffffff811828a6>] __memcg_kmem_free_page+0xa6/0x110
>> [ 335.840256] [<ffffffff81138109>] free_accounted_pages+0x99/0xa0
>> [ 335.846243] [<ffffffff8107b09f>] free_task+0x3f/0x70
>> [ 335.851278] [<ffffffff8107b18c>] __put_task_struct+0xbc/0x130
>> [ 335.857094] [<ffffffff81081524>] delayed_put_task_struct+0x54/0xd0
>> [ 335.863338] [<ffffffff810fd354>] __rcu_process_callbacks+0x1e4/0x490
>> [ 335.869757] [<ffffffff810fd62f>] rcu_process_callbacks+0x2f/0x80
>> [ 335.875835] [<ffffffff810862f5>] __do_softirq+0xc5/0x270
>> [ 335.881218] [<ffffffff810c49b4>] ? clockevents_program_event+0x74/0x100
>> [ 335.887895] [<ffffffff810c5d94>] ? tick_program_event+0x24/0x30
>> [ 335.893882] [<ffffffff8158c48c>] call_softirq+0x1c/0x30
>> [ 335.899179] [<ffffffff8104cefd>] do_softirq+0x8d/0xc0
>> [ 335.904301] [<ffffffff810867de>] irq_exit+0xae/0xe0
>> [ 335.909251] [<ffffffff8158cc3e>] smp_apic_timer_interrupt+0x6e/0x99
>> [ 335.915591] [<ffffffff8158ba9c>] apic_timer_interrupt+0x6c/0x80
>> [ 335.921583] <EOI> [<ffffffff810530e7>] ? default_idle+0x67/0x270
>> [ 335.927741] [<ffffffff810530e5>] ? default_idle+0x65/0x270
>>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-09 13:01 ` [PATCH v2 06/11] memcg: kmem controller infrastructure Glauber Costa
[not found] ` <1344517279-30646-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-11 5:11 ` Greg Thelen
@ 2012-08-21 21:50 ` Greg Thelen
2012-08-22 8:35 ` Glauber Costa
2 siblings, 1 reply; 135+ messages in thread
From: Greg Thelen @ 2012-08-21 21:50 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel, linux-mm, cgroups, devel, Michal Hocko,
Johannes Weiner, Andrew Morton, kamezawa.hiroyu,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg
On Thu, Aug 09 2012, Glauber Costa wrote:
> This patch introduces infrastructure for tracking kernel memory pages to
> a given memcg. This will happen whenever the caller includes the flag
> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>
> In memcontrol.h those functions are wrapped in inline accessors. The
> idea is to later on, patch those with static branches, so we don't incur
> any overhead when no mem cgroups with limited kmem are being used.
>
> [ v2: improved comments and standardized function names ]
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> ---
> include/linux/memcontrol.h | 79 +++++++++++++++++++
> mm/memcontrol.c | 185 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 264 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8d9489f..75b247e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -21,6 +21,7 @@
> #define _LINUX_MEMCONTROL_H
> #include <linux/cgroup.h>
> #include <linux/vm_event_item.h>
> +#include <linux/hardirq.h>
>
> struct mem_cgroup;
> struct page_cgroup;
> @@ -399,6 +400,11 @@ struct sock;
> #ifdef CONFIG_MEMCG_KMEM
> void sock_update_memcg(struct sock *sk);
> void sock_release_memcg(struct sock *sk);
> +
> +#define memcg_kmem_on 1
> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
> +void __memcg_kmem_free_page(struct page *page, int order);
> #else
> static inline void sock_update_memcg(struct sock *sk)
> {
> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
> static inline void sock_release_memcg(struct sock *sk)
> {
> }
> +
> +#define memcg_kmem_on 0
> +static inline bool
> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + return false;
> +}
> +
> +static inline void __memcg_kmem_free_page(struct page *page, int order)
> +{
> +}
> +
> +static inline void
> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
> +{
> +}
> #endif /* CONFIG_MEMCG_KMEM */
> +
> +/**
> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
> + * @gfp: the gfp allocation flags.
> + * @handle: a pointer to the memcg this was charged against.
> + * @order: allocation order.
> + *
> + * returns true if the memcg where the current task belongs can hold this
> + * allocation.
> + *
> + * We return true automatically if this allocation is not to be accounted to
> + * any memcg.
> + */
> +static __always_inline bool
> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + if (!memcg_kmem_on)
> + return true;
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + return true;
> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> + return true;
> + return __memcg_kmem_new_page(gfp, handle, order);
> +}
> +
> +/**
> + * memcg_kmem_free_page: uncharge pages from memcg
> + * @page: pointer to struct page being freed
> + * @order: allocation order.
> + *
> + * there is no need to specify memcg here, since it is embedded in page_cgroup
> + */
> +static __always_inline void
> +memcg_kmem_free_page(struct page *page, int order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_free_page(page, order);
> +}
> +
> +/**
> + * memcg_kmem_commit_page: embeds correct memcg in a page
> + * @handle: a pointer to the memcg this was charged against.
> + * @page: pointer to struct page recently allocated
> + * @handle: the memcg structure we charged against
> + * @order: allocation order.
> + *
> + * Needs to be called after memcg_kmem_new_page, regardless of success or
> + * failure of the allocation. if @page is NULL, this function will revert the
> + * charges. Otherwise, it will commit the memcg given by @handle to the
> + * corresponding page_cgroup.
> + */
> +static __always_inline void
> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_commit_page(page, handle, order);
> +}
> #endif /* _LINUX_MEMCONTROL_H */
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 54e93de..e9824c1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -10,6 +10,10 @@
> * Copyright (C) 2009 Nokia Corporation
> * Author: Kirill A. Shutemov
> *
> + * Kernel Memory Controller
> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
> + * Authors: Glauber Costa and Suleiman Souhlal
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
> #include <net/ip.h>
>
> static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
> +static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
> +
> void sock_update_memcg(struct sock *sk)
> {
> if (mem_cgroup_sockets_enabled) {
> @@ -488,6 +495,118 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
> }
> EXPORT_SYMBOL(tcp_proto_cgroup);
> #endif /* CONFIG_INET */
> +
> +static inline bool memcg_kmem_enabled(struct mem_cgroup *memcg)
> +{
> + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
> + memcg->kmem_accounted;
> +}
> +
> +/*
> + * We need to verify if the allocation against current->mm->owner's memcg is
> + * possible for the given order. But the page is not allocated yet, so we'll
> + * need a further commit step to do the final arrangements.
> + *
> + * It is possible for the task to switch cgroups in this mean time, so at
> + * commit time, we can't rely on task conversion any longer. We'll then use
> + * the handle argument to return to the caller which cgroup we should commit
> + * against
> + *
> + * Returning true means the allocation is possible.
> + */
> +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
> +{
> + struct mem_cgroup *memcg;
> + struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
> + bool ret = true;
> + size_t size;
> + struct task_struct *p;
> +
> + *handle = NULL;
> + rcu_read_lock();
> + p = rcu_dereference(current->mm->owner);
> + memcg = mem_cgroup_from_task(p);
> + if (!memcg_kmem_enabled(memcg))
> + goto out;
> +
> + mem_cgroup_get(memcg);
> +
> + size = PAGE_SIZE << order;
> + ret = memcg_charge_kmem(memcg, gfp, size) == 0;
> + if (!ret) {
> + mem_cgroup_put(memcg);
> + goto out;
> + }
> +
> + *handle = memcg;
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +EXPORT_SYMBOL(__memcg_kmem_new_page);
> +
> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
> +{
> + struct page_cgroup *pc;
> + struct mem_cgroup *memcg = handle;
> +
> + if (!memcg)
> + return;
> +
> + WARN_ON(mem_cgroup_is_root(memcg));
> + /* The page allocation must have failed. Revert */
> + if (!page) {
> + size_t size = PAGE_SIZE << order;
> +
> + memcg_uncharge_kmem(memcg, size);
> + mem_cgroup_put(memcg);
> + return;
> +
> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> + pc->mem_cgroup = memcg;
> + SetPageCgroupUsed(pc);
> + unlock_page_cgroup(pc);
I have no problem with the code here. But, out of curiosity, why do we
need to lock the pc here and below in __memcg_kmem_free_page()?
For the allocating side, I don't think that migration or reclaim will be
manipulating this page. But is there something else that we need the
locking for?
For the freeing side, it seems that anyone calling
__memcg_kmem_free_page() is going to be freeing a previously accounted
page.
I imagine that if we did not need the locking we would still need some
memory barriers to make sure that modifications to the PG_lru are
serialized wrt. to kmem modifying PageCgroupUsed here.
Perhaps we're just trying to take a conservative initial implementation
which is consistent with user visible pages.
> +}
> +
> +void __memcg_kmem_free_page(struct page *page, int order)
> +{
> + struct mem_cgroup *memcg;
> + size_t size;
> + struct page_cgroup *pc;
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> + memcg = pc->mem_cgroup;
> + pc->mem_cgroup = NULL;
> + if (!PageCgroupUsed(pc)) {
When do we expect to find PageCgroupUsed() unset in this routine? Is
this just to handle the race of someone enabling kmem accounting after
allocating a page and then later freeing that page?
> + unlock_page_cgroup(pc);
> + return;
> + }
> + ClearPageCgroupUsed(pc);
> + unlock_page_cgroup(pc);
> +
> + /*
> + * Checking if kmem accounted is enabled won't work for uncharge, since
> + * it is possible that the user enabled kmem tracking, allocated, and
> + * then disabled it again.
> + *
> + * We trust if there is a memcg associated with the page, it is a valid
> + * allocation
> + */
> + if (!memcg)
> + return;
> +
> + WARN_ON(mem_cgroup_is_root(memcg));
> + size = (1 << order) << PAGE_SHIFT;
> + memcg_uncharge_kmem(memcg, size);
> + mem_cgroup_put(memcg);
> +}
> +EXPORT_SYMBOL(__memcg_kmem_free_page);
> #endif /* CONFIG_MEMCG_KMEM */
>
> #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> @@ -5759,3 +5878,69 @@ static int __init enable_swap_account(char *s)
> __setup("swapaccount=", enable_swap_account);
>
> #endif
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
> +{
> + struct res_counter *fail_res;
> + struct mem_cgroup *_memcg;
> + int ret;
> + bool may_oom;
> + bool nofail = false;
> +
> + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
> + !(gfp & __GFP_NORETRY);
> +
> + ret = 0;
> +
> + if (!memcg)
> + return ret;
> +
> + _memcg = memcg;
> + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
> + &_memcg, may_oom);
> +
> + if (ret == -EINTR) {
> + nofail = true;
> + /*
> + * __mem_cgroup_try_charge() chosed to bypass to root due to
> + * OOM kill or fatal signal. Since our only options are to
> + * either fail the allocation or charge it to this cgroup, do
> + * it as a temporary condition. But we can't fail. From a
> + * kmem/slab perspective, the cache has already been selected,
> + * by mem_cgroup_get_kmem_cache(), so it is too late to change
> + * our minds
> + */
> + res_counter_charge_nofail(&memcg->res, delta, &fail_res);
> + if (do_swap_account)
> + res_counter_charge_nofail(&memcg->memsw, delta,
> + &fail_res);
> + ret = 0;
> + } else if (ret == -ENOMEM)
> + return ret;
> +
> + if (nofail)
> + res_counter_charge_nofail(&memcg->kmem, delta, &fail_res);
> + else
> + ret = res_counter_charge(&memcg->kmem, delta, &fail_res);
> +
> + if (ret) {
> + res_counter_uncharge(&memcg->res, delta);
> + if (do_swap_account)
> + res_counter_uncharge(&memcg->memsw, delta);
> + }
> +
> + return ret;
> +}
> +
> +void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta)
> +{
> + if (!memcg)
> + return;
> +
> + res_counter_uncharge(&memcg->kmem, delta);
> + res_counter_uncharge(&memcg->res, delta);
> + if (do_swap_account)
> + res_counter_uncharge(&memcg->memsw, delta);
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 06/11] memcg: kmem controller infrastructure
2012-08-21 21:50 ` Greg Thelen
@ 2012-08-22 8:35 ` Glauber Costa
[not found] ` <503499CC.7070704-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
0 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-22 8:35 UTC (permalink / raw)
To: Greg Thelen
Cc: linux-kernel, linux-mm, cgroups, devel, Michal Hocko,
Johannes Weiner, Andrew Morton, kamezawa.hiroyu,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg
On 08/22/2012 01:50 AM, Greg Thelen wrote:
> On Thu, Aug 09 2012, Glauber Costa wrote:
>
>> This patch introduces infrastructure for tracking kernel memory pages to
>> a given memcg. This will happen whenever the caller includes the flag
>> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>>
>> In memcontrol.h those functions are wrapped in inline accessors. The
>> idea is to later on, patch those with static branches, so we don't incur
>> any overhead when no mem cgroups with limited kmem are being used.
>>
>> [ v2: improved comments and standardized function names ]
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> CC: Christoph Lameter <cl@linux.com>
>> CC: Pekka Enberg <penberg@cs.helsinki.fi>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> ---
>> include/linux/memcontrol.h | 79 +++++++++++++++++++
>> mm/memcontrol.c | 185 +++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 264 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 8d9489f..75b247e 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -21,6 +21,7 @@
>> #define _LINUX_MEMCONTROL_H
>> #include <linux/cgroup.h>
>> #include <linux/vm_event_item.h>
>> +#include <linux/hardirq.h>
>>
>> struct mem_cgroup;
>> struct page_cgroup;
>> @@ -399,6 +400,11 @@ struct sock;
>> #ifdef CONFIG_MEMCG_KMEM
>> void sock_update_memcg(struct sock *sk);
>> void sock_release_memcg(struct sock *sk);
>> +
>> +#define memcg_kmem_on 1
>> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
>> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
>> +void __memcg_kmem_free_page(struct page *page, int order);
>> #else
>> static inline void sock_update_memcg(struct sock *sk)
>> {
>> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
>> static inline void sock_release_memcg(struct sock *sk)
>> {
>> }
>> +
>> +#define memcg_kmem_on 0
>> +static inline bool
>> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>> +{
>> + return false;
>> +}
>> +
>> +static inline void __memcg_kmem_free_page(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline void
>> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
>> +{
>> +}
>> #endif /* CONFIG_MEMCG_KMEM */
>> +
>> +/**
>> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
>> + * @gfp: the gfp allocation flags.
>> + * @handle: a pointer to the memcg this was charged against.
>> + * @order: allocation order.
>> + *
>> + * returns true if the memcg where the current task belongs can hold this
>> + * allocation.
>> + *
>> + * We return true automatically if this allocation is not to be accounted to
>> + * any memcg.
>> + */
>> +static __always_inline bool
>> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>> +{
>> + if (!memcg_kmem_on)
>> + return true;
>> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
>> + return true;
>> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>> + return true;
>> + return __memcg_kmem_new_page(gfp, handle, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_free_page: uncharge pages from memcg
>> + * @page: pointer to struct page being freed
>> + * @order: allocation order.
>> + *
>> + * there is no need to specify memcg here, since it is embedded in page_cgroup
>> + */
>> +static __always_inline void
>> +memcg_kmem_free_page(struct page *page, int order)
>> +{
>> + if (memcg_kmem_on)
>> + __memcg_kmem_free_page(page, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_commit_page: embeds correct memcg in a page
>> + * @handle: a pointer to the memcg this was charged against.
>> + * @page: pointer to struct page recently allocated
>> + * @handle: the memcg structure we charged against
>> + * @order: allocation order.
>> + *
>> + * Needs to be called after memcg_kmem_new_page, regardless of success or
>> + * failure of the allocation. if @page is NULL, this function will revert the
>> + * charges. Otherwise, it will commit the memcg given by @handle to the
>> + * corresponding page_cgroup.
>> + */
>> +static __always_inline void
>> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
>> +{
>> + if (memcg_kmem_on)
>> + __memcg_kmem_commit_page(page, handle, order);
>> +}
>> #endif /* _LINUX_MEMCONTROL_H */
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 54e93de..e9824c1 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -10,6 +10,10 @@
>> * Copyright (C) 2009 Nokia Corporation
>> * Author: Kirill A. Shutemov
>> *
>> + * Kernel Memory Controller
>> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
>> + * Authors: Glauber Costa and Suleiman Souhlal
>> + *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License as published by
>> * the Free Software Foundation; either version 2 of the License, or
>> @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
>> #include <net/ip.h>
>>
>> static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
>> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
>> +static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
>> +
>> void sock_update_memcg(struct sock *sk)
>> {
>> if (mem_cgroup_sockets_enabled) {
>> @@ -488,6 +495,118 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
>> }
>> EXPORT_SYMBOL(tcp_proto_cgroup);
>> #endif /* CONFIG_INET */
>> +
>> +static inline bool memcg_kmem_enabled(struct mem_cgroup *memcg)
>> +{
>> + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
>> + memcg->kmem_accounted;
>> +}
>> +
>> +/*
>> + * We need to verify if the allocation against current->mm->owner's memcg is
>> + * possible for the given order. But the page is not allocated yet, so we'll
>> + * need a further commit step to do the final arrangements.
>> + *
>> + * It is possible for the task to switch cgroups in this mean time, so at
>> + * commit time, we can't rely on task conversion any longer. We'll then use
>> + * the handle argument to return to the caller which cgroup we should commit
>> + * against
>> + *
>> + * Returning true means the allocation is possible.
>> + */
>> +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
>> +{
>> + struct mem_cgroup *memcg;
>> + struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
>> + bool ret = true;
>> + size_t size;
>> + struct task_struct *p;
>> +
>> + *handle = NULL;
>> + rcu_read_lock();
>> + p = rcu_dereference(current->mm->owner);
>> + memcg = mem_cgroup_from_task(p);
>> + if (!memcg_kmem_enabled(memcg))
>> + goto out;
>> +
>> + mem_cgroup_get(memcg);
>> +
>> + size = PAGE_SIZE << order;
>> + ret = memcg_charge_kmem(memcg, gfp, size) == 0;
>> + if (!ret) {
>> + mem_cgroup_put(memcg);
>> + goto out;
>> + }
>> +
>> + *handle = memcg;
>> +out:
>> + rcu_read_unlock();
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(__memcg_kmem_new_page);
>> +
>> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
>> +{
>> + struct page_cgroup *pc;
>> + struct mem_cgroup *memcg = handle;
>> +
>> + if (!memcg)
>> + return;
>> +
>> + WARN_ON(mem_cgroup_is_root(memcg));
>> + /* The page allocation must have failed. Revert */
>> + if (!page) {
>> + size_t size = PAGE_SIZE << order;
>> +
>> + memcg_uncharge_kmem(memcg, size);
>> + mem_cgroup_put(memcg);
>> + return;
>
>> +
>> + pc = lookup_page_cgroup(page);
>> + lock_page_cgroup(pc);
>> + pc->mem_cgroup = memcg;
>> + SetPageCgroupUsed(pc);
>> + unlock_page_cgroup(pc);
>
> I have no problem with the code here. But, out of curiosity, why do we
> need to lock the pc here and below in __memcg_kmem_free_page()?
>
> For the allocating side, I don't think that migration or reclaim will be
> manipulating this page. But is there something else that we need the
> locking for?
>
> For the freeing side, it seems that anyone calling
> __memcg_kmem_free_page() is going to be freeing a previously accounted
> page.
>
> I imagine that if we did not need the locking we would still need some
> memory barriers to make sure that modifications to the PG_lru are
> serialized wrt. to kmem modifying PageCgroupUsed here.
>
Unlocking should do that, no?
> Perhaps we're just trying to take a conservative initial implementation
> which is consistent with user visible pages.
>
The way I see it, is not about being conservative, but rather about my
physical safety. It is quite easy and natural to assume that "all
modifications to page cgroup are done under lock". So someone modifying
this later will likely find out about this exception in a rather
unpleasant way. They know where I live, and guns for hire are everywhere.
Note that it is not unreasonable to believe that we can modify this
later. This can be a way out, for example, for the memcg lifecycle problem.
I agree with your analysis and we can ultimately remove it, but if we
cannot pinpoint any performance problems to here, maybe consistency
wins. Also, the locking operation itself is a bit expensive, but the
biggest price is the actual contention. If we'll have nobody contending
for the same page_cgroup, the problem - if exists - shouldn't be that
bad. And if we ever have, the lock is needed.
>> +}
>> +
>> +void __memcg_kmem_free_page(struct page *page, int order)
>> +{
>> + struct mem_cgroup *memcg;
>> + size_t size;
>> + struct page_cgroup *pc;
>> +
>> + if (mem_cgroup_disabled())
>> + return;
>> +
>> + pc = lookup_page_cgroup(page);
>> + lock_page_cgroup(pc);
>> + memcg = pc->mem_cgroup;
>> + pc->mem_cgroup = NULL;
>> + if (!PageCgroupUsed(pc)) {
>
> When do we expect to find PageCgroupUsed() unset in this routine? Is
> this just to handle the race of someone enabling kmem accounting after
> allocating a page and then later freeing that page?
>
All the time we have a valid memcg. It is marked Used at charge time, so
this is how we differentiate between a tracked page and a non-tracked
page. Note that even though we explicit mark the freeing call sites with
free_allocated_page, etc, not all pc->memcg will be valid. There are
unlimited memcgs, bypassed charges, GFP_NOFAIL allocations, etc.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (5 preceding siblings ...)
2012-08-09 13:01 ` [PATCH v2 06/11] memcg: kmem controller infrastructure Glauber Costa
@ 2012-08-09 13:01 ` Glauber Costa
[not found] ` <1344517279-30646-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-10 17:36 ` Greg Thelen
2012-08-09 13:01 ` [PATCH v2 08/11] memcg: disable kmem code when not in use Glauber Costa
` (3 subsequent siblings)
10 siblings, 2 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
When a process tries to allocate a page with the __GFP_KMEMCG flag, the
page allocator will call the corresponding memcg functions to validate
the allocation. Tasks in the root memcg can always proceed.
To avoid adding markers to the page - and a kmem flag that would
necessarily follow, as much as doing page_cgroup lookups for no reason,
whoever is marking its allocations with __GFP_KMEMCG flag is responsible
for telling the page allocator that this is such an allocation at
free_pages() time. This is done by the invocation of
__free_accounted_pages() and free_accounted_pages().
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
include/linux/gfp.h | 3 +++
mm/page_alloc.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index d8eae4d..029570f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
extern void free_hot_cold_page(struct page *page, int cold);
extern void free_hot_cold_page_list(struct list_head *list, int cold);
+extern void __free_accounted_pages(struct page *page, unsigned int order);
+extern void free_accounted_pages(unsigned long addr, unsigned int order);
+
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr), 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b956cec..da341dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
int migratetype = allocflags_to_migratetype(gfp_mask);
unsigned int cpuset_mems_cookie;
+ void *handle = NULL;
gfp_mask &= gfp_allowed_mask;
@@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
return NULL;
/*
+ * Will only have any effect when __GFP_KMEMCG is set.
+ * This is verified in the (always inline) callee
+ */
+ if (!memcg_kmem_new_page(gfp_mask, &handle, order))
+ return NULL;
+
+ /*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of GFP_THISNODE and a memoryless node
@@ -2583,6 +2591,8 @@ out:
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
+ memcg_kmem_commit_page(page, handle, order);
+
return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
EXPORT_SYMBOL(free_pages);
+/*
+ * __free_accounted_pages and free_accounted_pages will free pages allocated
+ * with __GFP_KMEMCG.
+ *
+ * Those pages are accounted to a particular memcg, embedded in the
+ * corresponding page_cgroup. To avoid adding a hit in the allocator to search
+ * for that information only to find out that it is NULL for users who have no
+ * interest in that whatsoever, we provide these functions.
+ *
+ * The caller knows better which flags it relies on.
+ */
+void __free_accounted_pages(struct page *page, unsigned int order)
+{
+ memcg_kmem_free_page(page, order);
+ __free_pages(page, order);
+}
+EXPORT_SYMBOL(__free_accounted_pages);
+
+void free_accounted_pages(unsigned long addr, unsigned int order)
+{
+ if (addr != 0) {
+ VM_BUG_ON(!virt_addr_valid((void *)addr));
+ memcg_kmem_free_page(virt_to_page((void *)addr), order);
+ __free_pages(virt_to_page((void *)addr), order);
+ }
+}
+EXPORT_SYMBOL(free_accounted_pages);
+
static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
{
if (addr) {
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread[parent not found: <1344517279-30646-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg
[not found] ` <1344517279-30646-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-09 16:33 ` Greg Thelen
[not found] ` <xr93boikgh4w.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
2012-08-10 17:33 ` Kamezawa Hiroyuki
` (2 subsequent siblings)
3 siblings, 1 reply; 135+ messages in thread
From: Greg Thelen @ 2012-08-09 16:33 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg,
Suleiman Souhlal
On Thu, Aug 09 2012, Glauber Costa wrote:
> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> page allocator will call the corresponding memcg functions to validate
> the allocation. Tasks in the root memcg can always proceed.
>
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,
> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> include/linux/gfp.h | 3 +++
> mm/page_alloc.c | 38 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 41 insertions(+)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index d8eae4d..029570f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
> extern void free_hot_cold_page(struct page *page, int cold);
> extern void free_hot_cold_page_list(struct list_head *list, int cold);
>
> +extern void __free_accounted_pages(struct page *page, unsigned int order);
> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
> +
> #define __free_page(page) __free_pages((page), 0)
> #define free_page(addr) free_pages((addr), 0)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b956cec..da341dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> struct page *page = NULL;
> int migratetype = allocflags_to_migratetype(gfp_mask);
> unsigned int cpuset_mems_cookie;
> + void *handle = NULL;
>
> gfp_mask &= gfp_allowed_mask;
>
> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> return NULL;
>
> /*
> + * Will only have any effect when __GFP_KMEMCG is set.
> + * This is verified in the (always inline) callee
> + */
> + if (!memcg_kmem_new_page(gfp_mask, &handle, order))
> + return NULL;
> +
> + /*
> * Check the zones suitable for the gfp_mask contain at least one
> * valid zone. It's possible to have an empty zonelist as a result
> * of GFP_THISNODE and a memoryless node
If memcg_kmem_new_page() succeeds then it may have obtained a memcg
reference with mem_cgroup_get(). I think this reference is leaked when
returning below:
/*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of GFP_THISNODE and a memoryless node
*/
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;
I suspect the easiest fix is to swap the call to memcg_kmem_new_page()
and the (!zonelist->_zonerefs->zone) check.
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg
[not found] ` <1344517279-30646-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-09 16:33 ` Greg Thelen
@ 2012-08-10 17:33 ` Kamezawa Hiroyuki
[not found] ` <502545D2.80708-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2012-08-14 15:16 ` Mel Gorman
2012-08-15 9:24 ` Michal Hocko
3 siblings, 1 reply; 135+ messages in thread
From: Kamezawa Hiroyuki @ 2012-08-10 17:33 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, Christoph Lameter, David Rientjes, Pekka Enberg,
Pekka Enberg, Suleiman Souhlal
(2012/08/09 22:01), Glauber Costa wrote:
> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> page allocator will call the corresponding memcg functions to validate
> the allocation. Tasks in the root memcg can always proceed.
>
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,
> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Ah, ok. free_accounted_page() seems good.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
I myself is okay with this. But...
Because you add a new hook to alloc_pages(), please get Ack from Mel
before requesting merge.
Thanks,
-Kame
> ---
> include/linux/gfp.h | 3 +++
> mm/page_alloc.c | 38 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 41 insertions(+)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index d8eae4d..029570f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
> extern void free_hot_cold_page(struct page *page, int cold);
> extern void free_hot_cold_page_list(struct list_head *list, int cold);
>
> +extern void __free_accounted_pages(struct page *page, unsigned int order);
> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
> +
> #define __free_page(page) __free_pages((page), 0)
> #define free_page(addr) free_pages((addr), 0)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b956cec..da341dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> struct page *page = NULL;
> int migratetype = allocflags_to_migratetype(gfp_mask);
> unsigned int cpuset_mems_cookie;
> + void *handle = NULL;
>
> gfp_mask &= gfp_allowed_mask;
>
> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> return NULL;
>
> /*
> + * Will only have any effect when __GFP_KMEMCG is set.
> + * This is verified in the (always inline) callee
> + */
> + if (!memcg_kmem_new_page(gfp_mask, &handle, order))
> + return NULL;
> +
> + /*
> * Check the zones suitable for the gfp_mask contain at least one
> * valid zone. It's possible to have an empty zonelist as a result
> * of GFP_THISNODE and a memoryless node
> @@ -2583,6 +2591,8 @@ out:
> if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> goto retry_cpuset;
>
> + memcg_kmem_commit_page(page, handle, order);
> +
> return page;
> }
> EXPORT_SYMBOL(__alloc_pages_nodemask);
> @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
>
> EXPORT_SYMBOL(free_pages);
>
> +/*
> + * __free_accounted_pages and free_accounted_pages will free pages allocated
> + * with __GFP_KMEMCG.
> + *
> + * Those pages are accounted to a particular memcg, embedded in the
> + * corresponding page_cgroup. To avoid adding a hit in the allocator to search
> + * for that information only to find out that it is NULL for users who have no
> + * interest in that whatsoever, we provide these functions.
> + *
> + * The caller knows better which flags it relies on.
> + */
> +void __free_accounted_pages(struct page *page, unsigned int order)
> +{
> + memcg_kmem_free_page(page, order);
> + __free_pages(page, order);
> +}
> +EXPORT_SYMBOL(__free_accounted_pages);
> +
> +void free_accounted_pages(unsigned long addr, unsigned int order)
> +{
> + if (addr != 0) {
> + VM_BUG_ON(!virt_addr_valid((void *)addr));
> + memcg_kmem_free_page(virt_to_page((void *)addr), order);
> + __free_pages(virt_to_page((void *)addr), order);
> + }
> +}
> +EXPORT_SYMBOL(free_accounted_pages);
> +
> static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
> {
> if (addr) {
>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg
[not found] ` <1344517279-30646-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-09 16:33 ` Greg Thelen
2012-08-10 17:33 ` Kamezawa Hiroyuki
@ 2012-08-14 15:16 ` Mel Gorman
2012-08-15 9:08 ` Glauber Costa
2012-08-15 9:24 ` Michal Hocko
3 siblings, 1 reply; 135+ messages in thread
From: Mel Gorman @ 2012-08-14 15:16 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg,
Suleiman Souhlal
On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote:
> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> page allocator will call the corresponding memcg functions to validate
> the allocation. Tasks in the root memcg can always proceed.
>
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,
As you already guessed, doing a page_cgroup in the page allocator free
path would be a no-go.
This is my first time glancing at the series and I'm only paying close
attention to this patch so pardon me if my observations have been made
already.
> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> include/linux/gfp.h | 3 +++
> mm/page_alloc.c | 38 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 41 insertions(+)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index d8eae4d..029570f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
> extern void free_hot_cold_page(struct page *page, int cold);
> extern void free_hot_cold_page_list(struct list_head *list, int cold);
>
> +extern void __free_accounted_pages(struct page *page, unsigned int order);
> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
> +
> #define __free_page(page) __free_pages((page), 0)
> #define free_page(addr) free_pages((addr), 0)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b956cec..da341dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> struct page *page = NULL;
> int migratetype = allocflags_to_migratetype(gfp_mask);
> unsigned int cpuset_mems_cookie;
> + void *handle = NULL;
>
> gfp_mask &= gfp_allowed_mask;
>
> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> return NULL;
>
> /*
> + * Will only have any effect when __GFP_KMEMCG is set.
> + * This is verified in the (always inline) callee
> + */
> + if (!memcg_kmem_new_page(gfp_mask, &handle, order))
memcg_kmem_new_page takes a void * parameter already but here you are
passing in a void **. This probably happens to work because you do this
struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
but that appears to defeat the purpose of having an opaque type as a
"handle". You have to treat it different then passing it into the commit
function because it expects a void *. The motivation for an opaque type
is completely unclear to me and how it is managed with a mix of void *
and void ** is very confusing.
On a similar note I spotted #define memcg_kmem_on 1 . That is also
different just for the sake of it. The convension is to do something
like this
/* This helps us to avoid #ifdef CONFIG_NUMA */
#ifdef CONFIG_NUMA
#define NUMA_BUILD 1
#else
#define NUMA_BUILD 0
#endif
memcg_kmem_on was difficult to guess based on its name. I thought initially
that it would only be active if a memcg existed or at least something like
mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.
I also find it *very* strange to have a function named as if it is an
allocation-style function when it in fact it's looking up a mem_cgroup
and charging it (and uncharging it in the error path if necessary). If
it was called memcg_kmem_newpage_charge I might have found it a little
better. While I believe you have to take care to avoid confusion with
mem_cgroup_newpage_charge, it would be preferable if the APIs were similar.
memcg is hard enough as it is to understand without having different APIs.
This whole operation also looks very expensive (cgroup lookups, RCU locks
taken etc) but I guess you're willing to take that cost in the same of
isolating containers from each other. However, I strongly suggest that
this overhead is measured in advance. It should not stop the series being
merged as such but it should be understood because if the cost is high
then this feature will be avoided like the plague. I am skeptical that
distributions would enable this by default, at least not without support
for cgroup_disable=kmem
As this thing is called from within the allocator, it's not clear why
__memcg_kmem_new_page is exported. I can't imagine why a module would call
it directly although maybe you cover that somewhere else in the series.
From the point of view of a hook, that is acceptable but just barely. I have
slammed other hooks because it was possible for a subsystem to override them
meaning the runtime cost could be anything. I did not spot a similar issue
here but if I missed it, it's still unacceptable. At least here the cost
is sortof predictable and only affects memcg because of the __GFP_KMEMCG
check in memcg_kmem_new_page.
> + return NULL;
> +
> + /*
> * Check the zones suitable for the gfp_mask contain at least one
> * valid zone. It's possible to have an empty zonelist as a result
> * of GFP_THISNODE and a memoryless node
> @@ -2583,6 +2591,8 @@ out:
> if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> goto retry_cpuset;
>
> + memcg_kmem_commit_page(page, handle, order);
> +
As a side note, I'm not keen on how you shortcut these functions. They
are all function calls because memcg_kmem_commit_page() will always call
__memcg_kmem_commit_page() to check the handle once it's compiled in.
The handle==NULL check should have happened in the inline function to save
a few cycles.
This also has the feel that the call of memcg_kmem_commit_page belongs in
prep_new_page() but I recognise that requires passing the opaque handler
around which would be very ugly.
> return page;
> }
> EXPORT_SYMBOL(__alloc_pages_nodemask);
> @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
>
> EXPORT_SYMBOL(free_pages);
>
> +/*
> + * __free_accounted_pages and free_accounted_pages will free pages allocated
> + * with __GFP_KMEMCG.
> + *
> + * Those pages are accounted to a particular memcg, embedded in the
> + * corresponding page_cgroup. To avoid adding a hit in the allocator to search
> + * for that information only to find out that it is NULL for users who have no
> + * interest in that whatsoever, we provide these functions.
> + *
> + * The caller knows better which flags it relies on.
> + */
> +void __free_accounted_pages(struct page *page, unsigned int order)
> +{
> + memcg_kmem_free_page(page, order);
> + __free_pages(page, order);
> +}
> +EXPORT_SYMBOL(__free_accounted_pages);
memcg_kmem_new_page makes the following check
+ if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
+ return true;
so if the allocation had __GFP_NOFAIL, it does not get charged but can
still be freed. I didn't check if this is really the case but it looks
very suspicious.
Again, this is a fairly heavy operation.
> +
> +void free_accounted_pages(unsigned long addr, unsigned int order)
> +{
> + if (addr != 0) {
> + VM_BUG_ON(!virt_addr_valid((void *)addr));
> + memcg_kmem_free_page(virt_to_page((void *)addr), order);
> + __free_pages(virt_to_page((void *)addr), order);
> + }
> +}
> +EXPORT_SYMBOL(free_accounted_pages);
> +
> static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
> {
> if (addr) {
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg
2012-08-14 15:16 ` Mel Gorman
@ 2012-08-15 9:08 ` Glauber Costa
[not found] ` <502B66F8.30909-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
0 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-15 9:08 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-kernel, linux-mm, cgroups, devel, Michal Hocko,
Johannes Weiner, Andrew Morton, kamezawa.hiroyu,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg,
Suleiman Souhlal
On 08/14/2012 07:16 PM, Mel Gorman wrote:
> On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote:
>> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
>> page allocator will call the corresponding memcg functions to validate
>> the allocation. Tasks in the root memcg can always proceed.
>>
>> To avoid adding markers to the page - and a kmem flag that would
>> necessarily follow, as much as doing page_cgroup lookups for no reason,
>
> As you already guessed, doing a page_cgroup in the page allocator free
> path would be a no-go.
Specifically yes, but in general, you will be able to observe that I am
taking all the possible measures to make sure existing paths are
disturbed as little as possible.
Thanks for your review here
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b956cec..da341dc 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>> struct page *page = NULL;
>> int migratetype = allocflags_to_migratetype(gfp_mask);
>> unsigned int cpuset_mems_cookie;
>> + void *handle = NULL;
>>
>> gfp_mask &= gfp_allowed_mask;
>>
>> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>> return NULL;
>>
>> /*
>> + * Will only have any effect when __GFP_KMEMCG is set.
>> + * This is verified in the (always inline) callee
>> + */
>> + if (!memcg_kmem_new_page(gfp_mask, &handle, order))
>
> memcg_kmem_new_page takes a void * parameter already but here you are
> passing in a void **. This probably happens to work because you do this
>
> struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
>
> but that appears to defeat the purpose of having an opaque type as a
> "handle". You have to treat it different then passing it into the commit
> function because it expects a void *. The motivation for an opaque type
> is completely unclear to me and how it is managed with a mix of void *
> and void ** is very confusing.
okay.
The opaque exists because I am doing speculative charging. I believe it
to be a better and less complicated approach then letting a page appear
and then charging it. Besides being consistent with the rest of memcg,
it won't create unnecessary disturbance in the page allocator
when the allocation is to fail.
Now, tasks can move between memcgs, so we can't rely on grabbing it from
current in commit_page, so we pass it around as a handle. Also, even if
the task could not move, we already got it once from the task, and that
is not for free. Better save it.
Aside from the handle needed, the cost is more or less the same compared
to doing it in one pass. All we do by using speculative charging is to
split the cost in two, and doing it from two places.
We'd have to charge + update page_cgroup anyway.
As for the type, do you think using struct mem_cgroup would be less
confusing?
> On a similar note I spotted #define memcg_kmem_on 1 . That is also
> different just for the sake of it. The convension is to do something
> like this
>
> /* This helps us to avoid #ifdef CONFIG_NUMA */
> #ifdef CONFIG_NUMA
> #define NUMA_BUILD 1
> #else
> #define NUMA_BUILD 0
> #endif
For simple defines, yes. But a later patch will turn this into a static
branch test. memcg_kmem_on will be always 0 when compile-disabled, but
when enable will expand to static_branch(&...).
> memcg_kmem_on was difficult to guess based on its name. I thought initially
> that it would only be active if a memcg existed or at least something like
> mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.
For now. And I thought that adding the static branch in this patch would
only confuse matters. The placeholder is there, but it is later patched
to the final thing.
With that explained, if you want me to change it to something else, I
can do it. Should I ?
> I also find it *very* strange to have a function named as if it is an
> allocation-style function when it in fact it's looking up a mem_cgroup
> and charging it (and uncharging it in the error path if necessary). If
> it was called memcg_kmem_newpage_charge I might have found it a little
> better.
I don't feel strongly about names in general. I can change it.
Will update to memcg_kmem_newpage_charge() and memcg_kmem_page_uncharge().
> This whole operation also looks very expensive (cgroup lookups, RCU locks
> taken etc) but I guess you're willing to take that cost in the same of
> isolating containers from each other. However, I strongly suggest that
> this overhead is measured in advance. It should not stop the series being
> merged as such but it should be understood because if the cost is high
> then this feature will be avoided like the plague. I am skeptical that
> distributions would enable this by default, at least not without support
> for cgroup_disable=kmem
Enabling this feature will bring you nothing, therefore, no (or little)
overhead. Nothing of this will be patched in until the first memcg gets
kmem limited. The mere fact of moving tasks to memcgs won't trigger any
of this.
I haven't measured this series in particular, but I did measure the slab
series (which builds ontop of this). I found the per-allocation cost to
be in the order of 2-3 % for tasks living in limited memcgs, and
hard to observe when living in the root memcg (compared of course to the
case of a task running on root memcg without those patches)
I also believe the folks from google also measured this. They may be
able to spit out numbers grabbed from a system bigger than mine =p
> As this thing is called from within the allocator, it's not clear why
> __memcg_kmem_new_page is exported. I can't imagine why a module would call
> it directly although maybe you cover that somewhere else in the series.
Okay, more people commented on this, so let me clarify: They shouldn't
be. They were initially exported when this was about the slab only,
because they could be called from inlined functions from the allocators.
Now that the charge/uncharge was moved to the page allocator - which
already allowed me the big benefit of separating this in two pieces,
none of this needs to be exported.
Sorry for not noticing this myself, but thanks for the eyes =)
> From the point of view of a hook, that is acceptable but just barely. I have
> slammed other hooks because it was possible for a subsystem to override them
> meaning the runtime cost could be anything. I did not spot a similar issue
> here but if I missed it, it's still unacceptable. At least here the cost
> is sortof predictable and only affects memcg because of the __GFP_KMEMCG
> check in memcg_kmem_new_page.
Yes, that is the idea. And I don't think anyone should override those,
so I don't see them as hooks in this sense.
>> + return NULL;
>> +
>> + /*
>> * Check the zones suitable for the gfp_mask contain at least one
>> * valid zone. It's possible to have an empty zonelist as a result
>> * of GFP_THISNODE and a memoryless node
>> @@ -2583,6 +2591,8 @@ out:
>> if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
>> goto retry_cpuset;
>>
>> + memcg_kmem_commit_page(page, handle, order);
>> +
>
> As a side note, I'm not keen on how you shortcut these functions. They
> are all function calls because memcg_kmem_commit_page() will always call
> __memcg_kmem_commit_page() to check the handle once it's compiled in.
> The handle==NULL check should have happened in the inline function to save
> a few cycles.
>
It is already happening on my updated series after a comment from Kame
pointed this out.
> This also has the feel that the call of memcg_kmem_commit_page belongs in
> prep_new_page() but I recognise that requires passing the opaque handler
> around which would be very ugly.
Indeed, and that is the reason why I kept everything local.
>> return page;
>
> memcg_kmem_new_page makes the following check
>
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + return true;
>
> so if the allocation had __GFP_NOFAIL, it does not get charged but can
> still be freed. I didn't check if this is really the case but it looks
> very suspicious.
No, it can't be freed (uncharged), because in that case, we won't fill
in the memcg information in page cgroup.
>
> Again, this is a fairly heavy operation.
Mel, once I address all the issues you pointed out here, do you think
this would be in an acceptable state for merging? Do you still have any
fundamental opposition to this?
thanks again
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg
[not found] ` <1344517279-30646-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (2 preceding siblings ...)
2012-08-14 15:16 ` Mel Gorman
@ 2012-08-15 9:24 ` Michal Hocko
3 siblings, 0 replies; 135+ messages in thread
From: Michal Hocko @ 2012-08-15 9:24 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Andrew Morton,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Christoph Lameter,
David Rientjes, Pekka Enberg, Pekka Enberg, Suleiman Souhlal
On Thu 09-08-12 17:01:15, Glauber Costa wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b956cec..da341dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> struct page *page = NULL;
> int migratetype = allocflags_to_migratetype(gfp_mask);
> unsigned int cpuset_mems_cookie;
> + void *handle = NULL;
>
> gfp_mask &= gfp_allowed_mask;
>
> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> return NULL;
>
> /*
> + * Will only have any effect when __GFP_KMEMCG is set.
> + * This is verified in the (always inline) callee
> + */
> + if (!memcg_kmem_new_page(gfp_mask, &handle, order))
> + return NULL;
When the previous patch introduced this function I thought the handle
obfuscantion is to prevent from spreading struct mem_cgroup inside the
page allocator but memcg_kmem_commit_page uses the type directly. So why
that obfuscation? Even handle as a name sounds unnecessarily confusing.
I would go with struct mem_cgroup **memcgp or even return the pointer on
success or NULL otherwise.
[...]
> +EXPORT_SYMBOL(__free_accounted_pages);
Why exported?
Btw. this is called from call_rcu context but it itself calls call_rcu
down the chain in mem_cgroup_put. Is it safe?
[...]
> +EXPORT_SYMBOL(free_accounted_pages);
here again
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg
2012-08-09 13:01 ` [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg Glauber Costa
[not found] ` <1344517279-30646-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-10 17:36 ` Greg Thelen
2012-08-13 8:02 ` Glauber Costa
1 sibling, 1 reply; 135+ messages in thread
From: Greg Thelen @ 2012-08-10 17:36 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg,
Suleiman Souhlal
On Thu, Aug 09 2012, Glauber Costa wrote:
> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> page allocator will call the corresponding memcg functions to validate
> the allocation. Tasks in the root memcg can always proceed.
>
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,
> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> include/linux/gfp.h | 3 +++
> mm/page_alloc.c | 38 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 41 insertions(+)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index d8eae4d..029570f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
> extern void free_hot_cold_page(struct page *page, int cold);
> extern void free_hot_cold_page_list(struct list_head *list, int cold);
>
> +extern void __free_accounted_pages(struct page *page, unsigned int order);
> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
> +
> #define __free_page(page) __free_pages((page), 0)
> #define free_page(addr) free_pages((addr), 0)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b956cec..da341dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> struct page *page = NULL;
> int migratetype = allocflags_to_migratetype(gfp_mask);
> unsigned int cpuset_mems_cookie;
> + void *handle = NULL;
>
> gfp_mask &= gfp_allowed_mask;
>
> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> return NULL;
>
> /*
> + * Will only have any effect when __GFP_KMEMCG is set.
> + * This is verified in the (always inline) callee
> + */
> + if (!memcg_kmem_new_page(gfp_mask, &handle, order))
> + return NULL;
> +
> + /*
> * Check the zones suitable for the gfp_mask contain at least one
> * valid zone. It's possible to have an empty zonelist as a result
> * of GFP_THISNODE and a memoryless node
> @@ -2583,6 +2591,8 @@ out:
> if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> goto retry_cpuset;
>
> + memcg_kmem_commit_page(page, handle, order);
> +
> return page;
> }
> EXPORT_SYMBOL(__alloc_pages_nodemask);
> @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
>
> EXPORT_SYMBOL(free_pages);
>
> +/*
> + * __free_accounted_pages and free_accounted_pages will free pages allocated
> + * with __GFP_KMEMCG.
> + *
> + * Those pages are accounted to a particular memcg, embedded in the
> + * corresponding page_cgroup. To avoid adding a hit in the allocator to search
> + * for that information only to find out that it is NULL for users who have no
> + * interest in that whatsoever, we provide these functions.
> + *
> + * The caller knows better which flags it relies on.
> + */
> +void __free_accounted_pages(struct page *page, unsigned int order)
> +{
> + memcg_kmem_free_page(page, order);
> + __free_pages(page, order);
> +}
> +EXPORT_SYMBOL(__free_accounted_pages);
> +
> +void free_accounted_pages(unsigned long addr, unsigned int order)
> +{
> + if (addr != 0) {
> + VM_BUG_ON(!virt_addr_valid((void *)addr));
> + memcg_kmem_free_page(virt_to_page((void *)addr), order);
> + __free_pages(virt_to_page((void *)addr), order);
Nit. Is there any reason not to replace the above two lines with:
__free_accounted_pages(virt_to_page((void *)addr), order);
> + }
> +}
> +EXPORT_SYMBOL(free_accounted_pages);
> +
> static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
> {
> if (addr) {
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg
2012-08-10 17:36 ` Greg Thelen
@ 2012-08-13 8:02 ` Glauber Costa
0 siblings, 0 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-13 8:02 UTC (permalink / raw)
To: Greg Thelen
Cc: linux-kernel, linux-mm, cgroups, devel, Michal Hocko,
Johannes Weiner, Andrew Morton, kamezawa.hiroyu,
Christoph Lameter, David Rientjes, Pekka Enberg, Pekka Enberg,
Suleiman Souhlal
On 08/10/2012 09:36 PM, Greg Thelen wrote:
> On Thu, Aug 09 2012, Glauber Costa wrote:
>
>> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
>> page allocator will call the corresponding memcg functions to validate
>> the allocation. Tasks in the root memcg can always proceed.
>>
>> To avoid adding markers to the page - and a kmem flag that would
>> necessarily follow, as much as doing page_cgroup lookups for no reason,
>> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
>> for telling the page allocator that this is such an allocation at
>> free_pages() time. This is done by the invocation of
>> __free_accounted_pages() and free_accounted_pages().
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> CC: Christoph Lameter <cl@linux.com>
>> CC: Pekka Enberg <penberg@cs.helsinki.fi>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> CC: Suleiman Souhlal <suleiman@google.com>
>> ---
>> include/linux/gfp.h | 3 +++
>> mm/page_alloc.c | 38 ++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 41 insertions(+)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index d8eae4d..029570f 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int order);
>> extern void free_hot_cold_page(struct page *page, int cold);
>> extern void free_hot_cold_page_list(struct list_head *list, int cold);
>>
>> +extern void __free_accounted_pages(struct page *page, unsigned int order);
>> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
>> +
>> #define __free_page(page) __free_pages((page), 0)
>> #define free_page(addr) free_pages((addr), 0)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b956cec..da341dc 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>> struct page *page = NULL;
>> int migratetype = allocflags_to_migratetype(gfp_mask);
>> unsigned int cpuset_mems_cookie;
>> + void *handle = NULL;
>>
>> gfp_mask &= gfp_allowed_mask;
>>
>> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>> return NULL;
>>
>> /*
>> + * Will only have any effect when __GFP_KMEMCG is set.
>> + * This is verified in the (always inline) callee
>> + */
>> + if (!memcg_kmem_new_page(gfp_mask, &handle, order))
>> + return NULL;
>> +
>> + /*
>> * Check the zones suitable for the gfp_mask contain at least one
>> * valid zone. It's possible to have an empty zonelist as a result
>> * of GFP_THISNODE and a memoryless node
>> @@ -2583,6 +2591,8 @@ out:
>> if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
>> goto retry_cpuset;
>>
>> + memcg_kmem_commit_page(page, handle, order);
>> +
>> return page;
>> }
>> EXPORT_SYMBOL(__alloc_pages_nodemask);
>> @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
>>
>> EXPORT_SYMBOL(free_pages);
>>
>> +/*
>> + * __free_accounted_pages and free_accounted_pages will free pages allocated
>> + * with __GFP_KMEMCG.
>> + *
>> + * Those pages are accounted to a particular memcg, embedded in the
>> + * corresponding page_cgroup. To avoid adding a hit in the allocator to search
>> + * for that information only to find out that it is NULL for users who have no
>> + * interest in that whatsoever, we provide these functions.
>> + *
>> + * The caller knows better which flags it relies on.
>> + */
>> +void __free_accounted_pages(struct page *page, unsigned int order)
>> +{
>> + memcg_kmem_free_page(page, order);
>> + __free_pages(page, order);
>> +}
>> +EXPORT_SYMBOL(__free_accounted_pages);
>> +
>> +void free_accounted_pages(unsigned long addr, unsigned int order)
>> +{
>> + if (addr != 0) {
>> + VM_BUG_ON(!virt_addr_valid((void *)addr));
>> + memcg_kmem_free_page(virt_to_page((void *)addr), order);
>> + __free_pages(virt_to_page((void *)addr), order);
>
> Nit. Is there any reason not to replace the above two lines with:
> __free_accounted_pages(virt_to_page((void *)addr), order);
>
Not any particular reason. If people prefer it this way, I can do that
with no problems.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v2 08/11] memcg: disable kmem code when not in use.
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (6 preceding siblings ...)
2012-08-09 13:01 ` [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg Glauber Costa
@ 2012-08-09 13:01 ` Glauber Costa
2012-08-17 7:02 ` Michal Hocko
2012-08-09 13:01 ` [PATCH v2 09/11] memcg: propagate kmem limiting information to children Glauber Costa
` (2 subsequent siblings)
10 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
We can use jump labels to patch the code in or out when not used.
Because the assignment: memcg->kmem_accounted = true is done after the
jump labels increment, we guarantee that the root memcg will always be
selected until all call sites are patched (see memcg_kmem_enabled).
This guarantees that no mischarges are applied.
Jump label decrement happens when the last reference count from the
memcg dies. This will only happen when the caches are all dead.
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
include/linux/memcontrol.h | 5 ++++-
mm/memcontrol.c | 50 ++++++++++++++++++++++++++++++++++++----------
2 files changed, 44 insertions(+), 11 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 75b247e..f39d933 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -22,6 +22,7 @@
#include <linux/cgroup.h>
#include <linux/vm_event_item.h>
#include <linux/hardirq.h>
+#include <linux/jump_label.h>
struct mem_cgroup;
struct page_cgroup;
@@ -401,7 +402,9 @@ struct sock;
void sock_update_memcg(struct sock *sk);
void sock_release_memcg(struct sock *sk);
-#define memcg_kmem_on 1
+extern struct static_key memcg_kmem_enabled_key;
+#define memcg_kmem_on static_key_false(&memcg_kmem_enabled_key)
+
bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
void __memcg_kmem_free_page(struct page *page, int order);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9824c1..3216292 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -437,6 +437,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
#include <net/sock.h>
#include <net/ip.h>
+struct static_key memcg_kmem_enabled_key;
+/* so modules can inline the checks */
+EXPORT_SYMBOL(memcg_kmem_enabled_key);
+
static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
@@ -607,6 +611,16 @@ void __memcg_kmem_free_page(struct page *page, int order)
mem_cgroup_put(memcg);
}
EXPORT_SYMBOL(__memcg_kmem_free_page);
+
+static void disarm_kmem_keys(struct mem_cgroup *memcg)
+{
+ if (memcg->kmem_accounted)
+ static_key_slow_dec(&memcg_kmem_enabled_key);
+}
+#else
+static void disarm_kmem_keys(struct mem_cgroup *memcg)
+{
+}
#endif /* CONFIG_MEMCG_KMEM */
#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
@@ -622,6 +636,12 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
}
#endif
+static void disarm_static_keys(struct mem_cgroup *memcg)
+{
+ disarm_sock_keys(memcg);
+ disarm_kmem_keys(memcg);
+}
+
static void drain_all_stock_async(struct mem_cgroup *memcg);
static struct mem_cgroup_per_zone *
@@ -4147,6 +4167,24 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
return simple_read_from_buffer(buf, nbytes, ppos, str, len);
}
+
+static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
+{
+#ifdef CONFIG_MEMCG_KMEM
+ /*
+ * Once enabled, can't be disabled. We could in theory disable it if we
+ * haven't yet created any caches, or if we can shrink them all to
+ * death. But it is not worth the trouble.
+ */
+ mutex_lock(&set_limit_mutex);
+ if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
+ static_key_slow_inc(&memcg_kmem_enabled_key);
+ memcg->kmem_accounted = true;
+ }
+ mutex_unlock(&set_limit_mutex);
+#endif
+}
+
/*
* The user of this function is...
* RES_LIMIT.
@@ -4184,15 +4222,7 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
ret = res_counter_set_limit(&memcg->kmem, val);
if (ret)
break;
- /*
- * Once enabled, can't be disabled. We could in theory
- * disable it if we haven't yet created any caches, or
- * if we can shrink them all to death.
- *
- * But it is not worth the trouble
- */
- if (!memcg->kmem_accounted && val != RESOURCE_MAX)
- memcg->kmem_accounted = true;
+ memcg_update_kmem_limit(memcg, val);
} else
return -EINVAL;
break;
@@ -5054,7 +5084,7 @@ static void free_work(struct work_struct *work)
* to move this code around, and make sure it is outside
* the cgroup_lock.
*/
- disarm_sock_keys(memcg);
+ disarm_static_keys(memcg);
if (size < PAGE_SIZE)
kfree(memcg);
else
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread* Re: [PATCH v2 08/11] memcg: disable kmem code when not in use.
2012-08-09 13:01 ` [PATCH v2 08/11] memcg: disable kmem code when not in use Glauber Costa
@ 2012-08-17 7:02 ` Michal Hocko
2012-08-17 7:01 ` Glauber Costa
0 siblings, 1 reply; 135+ messages in thread
From: Michal Hocko @ 2012-08-17 7:02 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg, Suleiman Souhlal
On Thu 09-08-12 17:01:16, Glauber Costa wrote:
> We can use jump labels to patch the code in or out when not used.
>
> Because the assignment: memcg->kmem_accounted = true is done after the
> jump labels increment, we guarantee that the root memcg will always be
> selected until all call sites are patched (see memcg_kmem_enabled).
Not that it would be really important because kmem_accounted goes away
in a subsequent patch but I think the wording is a bit misleading here.
First of all there is no guanratee that kmem_accounted=true is seen
before atomic_inc(&key->enabled) because there is no memory barrier and
the lock serves just a leave barrier. But I do not think this is
important at all because key->enabled is what matters here. Even if
memcg_kmem_enabled is true we do not consider it if the key is disabled,
right?
> This guarantees that no mischarges are applied.
>
> Jump label decrement happens when the last reference count from the
> memcg dies. This will only happen when the caches are all dead.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Suleiman Souhlal <suleiman@google.com>
Anyway the code looks correct.
Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
> include/linux/memcontrol.h | 5 ++++-
> mm/memcontrol.c | 50 ++++++++++++++++++++++++++++++++++++----------
> 2 files changed, 44 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 75b247e..f39d933 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -22,6 +22,7 @@
> #include <linux/cgroup.h>
> #include <linux/vm_event_item.h>
> #include <linux/hardirq.h>
> +#include <linux/jump_label.h>
>
> struct mem_cgroup;
> struct page_cgroup;
> @@ -401,7 +402,9 @@ struct sock;
> void sock_update_memcg(struct sock *sk);
> void sock_release_memcg(struct sock *sk);
>
> -#define memcg_kmem_on 1
> +extern struct static_key memcg_kmem_enabled_key;
> +#define memcg_kmem_on static_key_false(&memcg_kmem_enabled_key)
> +
> bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
> void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
> void __memcg_kmem_free_page(struct page *page, int order);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e9824c1..3216292 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -437,6 +437,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
> #include <net/sock.h>
> #include <net/ip.h>
>
> +struct static_key memcg_kmem_enabled_key;
> +/* so modules can inline the checks */
> +EXPORT_SYMBOL(memcg_kmem_enabled_key);
> +
> static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
> static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
> static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
> @@ -607,6 +611,16 @@ void __memcg_kmem_free_page(struct page *page, int order)
> mem_cgroup_put(memcg);
> }
> EXPORT_SYMBOL(__memcg_kmem_free_page);
> +
> +static void disarm_kmem_keys(struct mem_cgroup *memcg)
> +{
> + if (memcg->kmem_accounted)
> + static_key_slow_dec(&memcg_kmem_enabled_key);
> +}
> +#else
> +static void disarm_kmem_keys(struct mem_cgroup *memcg)
> +{
> +}
> #endif /* CONFIG_MEMCG_KMEM */
>
> #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> @@ -622,6 +636,12 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
> }
> #endif
>
> +static void disarm_static_keys(struct mem_cgroup *memcg)
> +{
> + disarm_sock_keys(memcg);
> + disarm_kmem_keys(memcg);
> +}
> +
> static void drain_all_stock_async(struct mem_cgroup *memcg);
>
> static struct mem_cgroup_per_zone *
> @@ -4147,6 +4167,24 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
> len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
> return simple_read_from_buffer(buf, nbytes, ppos, str, len);
> }
> +
> +static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
> +{
> +#ifdef CONFIG_MEMCG_KMEM
> + /*
> + * Once enabled, can't be disabled. We could in theory disable it if we
> + * haven't yet created any caches, or if we can shrink them all to
> + * death. But it is not worth the trouble.
> + */
> + mutex_lock(&set_limit_mutex);
> + if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
> + static_key_slow_inc(&memcg_kmem_enabled_key);
> + memcg->kmem_accounted = true;
> + }
> + mutex_unlock(&set_limit_mutex);
> +#endif
> +}
> +
> /*
> * The user of this function is...
> * RES_LIMIT.
> @@ -4184,15 +4222,7 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
> ret = res_counter_set_limit(&memcg->kmem, val);
> if (ret)
> break;
> - /*
> - * Once enabled, can't be disabled. We could in theory
> - * disable it if we haven't yet created any caches, or
> - * if we can shrink them all to death.
> - *
> - * But it is not worth the trouble
> - */
> - if (!memcg->kmem_accounted && val != RESOURCE_MAX)
> - memcg->kmem_accounted = true;
> + memcg_update_kmem_limit(memcg, val);
> } else
> return -EINVAL;
> break;
> @@ -5054,7 +5084,7 @@ static void free_work(struct work_struct *work)
> * to move this code around, and make sure it is outside
> * the cgroup_lock.
> */
> - disarm_sock_keys(memcg);
> + disarm_static_keys(memcg);
> if (size < PAGE_SIZE)
> kfree(memcg);
> else
> --
> 1.7.11.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 08/11] memcg: disable kmem code when not in use.
2012-08-17 7:02 ` Michal Hocko
@ 2012-08-17 7:01 ` Glauber Costa
2012-08-17 8:04 ` Michal Hocko
0 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-17 7:01 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg, Suleiman Souhlal
On 08/17/2012 11:02 AM, Michal Hocko wrote:
> On Thu 09-08-12 17:01:16, Glauber Costa wrote:
>> We can use jump labels to patch the code in or out when not used.
>>
>> Because the assignment: memcg->kmem_accounted = true is done after the
>> jump labels increment, we guarantee that the root memcg will always be
>> selected until all call sites are patched (see memcg_kmem_enabled).
>
> Not that it would be really important because kmem_accounted goes away
> in a subsequent patch but I think the wording is a bit misleading here.
> First of all there is no guanratee that kmem_accounted=true is seen
> before atomic_inc(&key->enabled) because there is no memory barrier and
> the lock serves just a leave barrier. But I do not think this is
> important at all because key->enabled is what matters here. Even if
> memcg_kmem_enabled is true we do not consider it if the key is disabled,
> right?
>
Right.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v2 08/11] memcg: disable kmem code when not in use.
2012-08-17 7:01 ` Glauber Costa
@ 2012-08-17 8:04 ` Michal Hocko
0 siblings, 0 replies; 135+ messages in thread
From: Michal Hocko @ 2012-08-17 8:04 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg, Suleiman Souhlal
On Fri 17-08-12 11:01:06, Glauber Costa wrote:
> On 08/17/2012 11:02 AM, Michal Hocko wrote:
> > On Thu 09-08-12 17:01:16, Glauber Costa wrote:
> >> We can use jump labels to patch the code in or out when not used.
> >>
> >> Because the assignment: memcg->kmem_accounted = true is done after the
> >> jump labels increment, we guarantee that the root memcg will always be
> >> selected until all call sites are patched (see memcg_kmem_enabled).
> >
> > Not that it would be really important because kmem_accounted goes away
And just found out it doesn't go away completely, it just transforms
from bool to unsigned log (with flags). The rest still holds...
> > in a subsequent patch but I think the wording is a bit misleading here.
> > First of all there is no guanratee that kmem_accounted=true is seen
> > before atomic_inc(&key->enabled) because there is no memory barrier and
> > the lock serves just a leave barrier. But I do not think this is
> > important at all because key->enabled is what matters here. Even if
> > memcg_kmem_enabled is true we do not consider it if the key is disabled,
> > right?
> >
>
> Right.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v2 09/11] memcg: propagate kmem limiting information to children
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (7 preceding siblings ...)
2012-08-09 13:01 ` [PATCH v2 08/11] memcg: disable kmem code when not in use Glauber Costa
@ 2012-08-09 13:01 ` Glauber Costa
[not found] ` <1344517279-30646-10-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-17 9:00 ` Michal Hocko
2012-08-09 13:01 ` [PATCH v2 10/11] memcg: allow a memcg with kmem charges to be destructed Glauber Costa
2012-08-09 13:01 ` [PATCH v2 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Glauber Costa
10 siblings, 2 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
The current memcg slab cache management fails to present satisfatory
hierarchical behavior in the following scenario:
-> /cgroups/memory/A/B/C
* kmem limit set at A,
* A and B have no tasks,
* span a new task in in C.
Because kmem_accounted is a boolean that was not set for C, no
accounting would be done. This is, however, not what we expect.
The basic idea, is that when a cgroup is limited, we walk the tree
upwards (something Kame and I already thought about doing for other
purposes), and make sure that we store the information about the parent
being limited in kmem_accounted (that is turned into a bitmap: two
booleans would not be space efficient). The code for that is taken from
sched/core.c. My reasons for not putting it into a common place is to
dodge the type issues that would arise from a common implementation
between memcg and the scheduler - but I think that it should ultimately
happen, so if you want me to do it now, let me know.
We do the reverse operation when a formerly limited cgroup becomes
unlimited.
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
mm/memcontrol.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 79 insertions(+), 9 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3216292..3d30b79 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -295,7 +295,8 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
bool use_hierarchy;
- bool kmem_accounted;
+
+ unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
bool oom_lock;
atomic_t under_oom;
@@ -348,6 +349,38 @@ struct mem_cgroup {
#endif
};
+enum {
+ KMEM_ACCOUNTED_THIS, /* accounted by this cgroup itself */
+ KMEM_ACCOUNTED_PARENT, /* accounted by any of its parents. */
+};
+
+#ifdef CONFIG_MEMCG_KMEM
+static bool memcg_kmem_account(struct mem_cgroup *memcg)
+{
+ return !test_and_set_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
+}
+
+static bool memcg_kmem_clear_account(struct mem_cgroup *memcg)
+{
+ return test_and_clear_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
+}
+
+static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg)
+{
+ return test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
+}
+
+static void memcg_kmem_account_parent(struct mem_cgroup *memcg)
+{
+ set_bit(KMEM_ACCOUNTED_PARENT, &memcg->kmem_accounted);
+}
+
+static void memcg_kmem_clear_account_parent(struct mem_cgroup *memcg)
+{
+ clear_bit(KMEM_ACCOUNTED_PARENT, &memcg->kmem_accounted);
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
/* Stuffs for move charges at task migration. */
/*
* Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
@@ -614,7 +647,7 @@ EXPORT_SYMBOL(__memcg_kmem_free_page);
static void disarm_kmem_keys(struct mem_cgroup *memcg)
{
- if (memcg->kmem_accounted)
+ if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
static_key_slow_dec(&memcg_kmem_enabled_key);
}
#else
@@ -4171,17 +4204,54 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
{
#ifdef CONFIG_MEMCG_KMEM
- /*
- * Once enabled, can't be disabled. We could in theory disable it if we
- * haven't yet created any caches, or if we can shrink them all to
- * death. But it is not worth the trouble.
- */
+ struct mem_cgroup *iter;
+
mutex_lock(&set_limit_mutex);
- if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
+ if ((val != RESOURCE_MAX) && memcg_kmem_account(memcg)) {
+
+ /*
+ * Once enabled, can't be disabled. We could in theory disable
+ * it if we haven't yet created any caches, or if we can shrink
+ * them all to death. But it is not worth the trouble
+ */
static_key_slow_inc(&memcg_kmem_enabled_key);
- memcg->kmem_accounted = true;
+
+ if (!memcg->use_hierarchy)
+ goto out;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ if (iter == memcg)
+ continue;
+ memcg_kmem_account_parent(iter);
+ }
+ } else if ((val == RESOURCE_MAX) && memcg_kmem_clear_account(memcg)) {
+
+ if (!memcg->use_hierarchy)
+ goto out;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ struct mem_cgroup *parent;
+
+ if (iter == memcg)
+ continue;
+ /*
+ * We should only have our parent bit cleared if none
+ * of our parents are accounted. The transversal order
+ * of our iter function forces us to always look at the
+ * parents.
+ */
+ parent = parent_mem_cgroup(iter);
+ for (; parent != memcg; parent = parent_mem_cgroup(iter))
+ if (memcg_kmem_is_accounted(parent))
+ goto noclear;
+ memcg_kmem_clear_account_parent(iter);
+noclear:
+ continue;
+ }
}
+out:
mutex_unlock(&set_limit_mutex);
+
#endif
}
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread[parent not found: <1344517279-30646-10-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children
[not found] ` <1344517279-30646-10-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-10 17:51 ` Kamezawa Hiroyuki
[not found] ` <50254A0A.3080805-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
0 siblings, 1 reply; 135+ messages in thread
From: Kamezawa Hiroyuki @ 2012-08-10 17:51 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, Christoph Lameter, David Rientjes, Pekka Enberg,
Pekka Enberg, Suleiman Souhlal
(2012/08/09 22:01), Glauber Costa wrote:
> The current memcg slab cache management fails to present satisfatory
> hierarchical behavior in the following scenario:
>
> -> /cgroups/memory/A/B/C
>
> * kmem limit set at A,
> * A and B have no tasks,
> * span a new task in in C.
>
> Because kmem_accounted is a boolean that was not set for C, no
> accounting would be done. This is, however, not what we expect.
>
> The basic idea, is that when a cgroup is limited, we walk the tree
> upwards (something Kame and I already thought about doing for other
> purposes), and make sure that we store the information about the parent
> being limited in kmem_accounted (that is turned into a bitmap: two
> booleans would not be space efficient). The code for that is taken from
> sched/core.c. My reasons for not putting it into a common place is to
> dodge the type issues that would arise from a common implementation
> between memcg and the scheduler - but I think that it should ultimately
> happen, so if you want me to do it now, let me know.
>
> We do the reverse operation when a formerly limited cgroup becomes
> unlimited.
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> mm/memcontrol.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 79 insertions(+), 9 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3216292..3d30b79 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -295,7 +295,8 @@ struct mem_cgroup {
> * Should the accounting and control be hierarchical, per subtree?
> */
> bool use_hierarchy;
> - bool kmem_accounted;
> +
> + unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
>
> bool oom_lock;
> atomic_t under_oom;
> @@ -348,6 +349,38 @@ struct mem_cgroup {
> #endif
> };
>
> +enum {
> + KMEM_ACCOUNTED_THIS, /* accounted by this cgroup itself */
> + KMEM_ACCOUNTED_PARENT, /* accounted by any of its parents. */
> +};
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +static bool memcg_kmem_account(struct mem_cgroup *memcg)
> +{
> + return !test_and_set_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
> +}
> +
> +static bool memcg_kmem_clear_account(struct mem_cgroup *memcg)
> +{
> + return test_and_clear_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
> +}
> +
> +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg)
> +{
> + return test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
> +}
> +
> +static void memcg_kmem_account_parent(struct mem_cgroup *memcg)
> +{
> + set_bit(KMEM_ACCOUNTED_PARENT, &memcg->kmem_accounted);
> +}
> +
> +static void memcg_kmem_clear_account_parent(struct mem_cgroup *memcg)
> +{
> + clear_bit(KMEM_ACCOUNTED_PARENT, &memcg->kmem_accounted);
> +}
> +#endif /* CONFIG_MEMCG_KMEM */
> +
> /* Stuffs for move charges at task migration. */
> /*
> * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
> @@ -614,7 +647,7 @@ EXPORT_SYMBOL(__memcg_kmem_free_page);
>
> static void disarm_kmem_keys(struct mem_cgroup *memcg)
> {
> - if (memcg->kmem_accounted)
> + if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
> static_key_slow_dec(&memcg_kmem_enabled_key);
> }
> #else
> @@ -4171,17 +4204,54 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
> static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
> {
> #ifdef CONFIG_MEMCG_KMEM
> - /*
> - * Once enabled, can't be disabled. We could in theory disable it if we
> - * haven't yet created any caches, or if we can shrink them all to
> - * death. But it is not worth the trouble.
> - */
> + struct mem_cgroup *iter;
> +
> mutex_lock(&set_limit_mutex);
> - if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
> + if ((val != RESOURCE_MAX) && memcg_kmem_account(memcg)) {
> +
> + /*
> + * Once enabled, can't be disabled. We could in theory disable
> + * it if we haven't yet created any caches, or if we can shrink
> + * them all to death. But it is not worth the trouble
> + */
> static_key_slow_inc(&memcg_kmem_enabled_key);
> - memcg->kmem_accounted = true;
> +
> + if (!memcg->use_hierarchy)
> + goto out;
> +
> + for_each_mem_cgroup_tree(iter, memcg) {
> + if (iter == memcg)
> + continue;
> + memcg_kmem_account_parent(iter);
> + }
Could you add an explanation comment ?
> + } else if ((val == RESOURCE_MAX) && memcg_kmem_clear_account(memcg)) {
> +
> + if (!memcg->use_hierarchy)
> + goto out;
> +
ditto.
> + for_each_mem_cgroup_tree(iter, memcg) {
> + struct mem_cgroup *parent;
> +
> + if (iter == memcg)
> + continue;
> + /*
> + * We should only have our parent bit cleared if none
> + * of our parents are accounted. The transversal order
> + * of our iter function forces us to always look at the
> + * parents.
> + */
> + parent = parent_mem_cgroup(iter);
> + for (; parent != memcg; parent = parent_mem_cgroup(iter))
> + if (memcg_kmem_is_accounted(parent))
> + goto noclear;
> + memcg_kmem_clear_account_parent(iter);
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children
2012-08-09 13:01 ` [PATCH v2 09/11] memcg: propagate kmem limiting information to children Glauber Costa
[not found] ` <1344517279-30646-10-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-17 9:00 ` Michal Hocko
2012-08-17 9:15 ` Glauber Costa
1 sibling, 1 reply; 135+ messages in thread
From: Michal Hocko @ 2012-08-17 9:00 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg, Suleiman Souhlal
On Thu 09-08-12 17:01:17, Glauber Costa wrote:
> The current memcg slab cache management fails to present satisfatory
> hierarchical behavior in the following scenario:
>
> -> /cgroups/memory/A/B/C
>
> * kmem limit set at A,
> * A and B have no tasks,
> * span a new task in in C.
>
> Because kmem_accounted is a boolean that was not set for C, no
> accounting would be done. This is, however, not what we expect.
>
> The basic idea, is that when a cgroup is limited, we walk the tree
> upwards
Isn't it rather downwards? We start at A and then mark all children so
we go down the tree. Moreover the walk is not atomic wrt. parallel
charges nor to a new child creation. First one seems to be acceptable
as the charges go to the root. The second one requires cgroup_lock.
It also seems that you are missing memcg_kmem_account_parent in
mem_cgroup_create (use_hierarchy path) if memcg_kmem_is_accounted(parent).
Some further "wording" comments below. Other than that the patch looks
correct.
> (something Kame and I already thought about doing for other
> purposes), and make sure that we store the information about the parent
> being limited in kmem_accounted (that is turned into a bitmap: two
> booleans would not be space efficient).
Two booleans even don't serve the purpose because you want to test this
atomically, right?
> The code for that is taken from sched/core.c. My reasons for not
> putting it into a common place is to dodge the type issues that would
> arise from a common implementation between memcg and the scheduler -
> but I think that it should ultimately happen, so if you want me to do
> it now, let me know.
Is this really relevant for the patch?
> We do the reverse operation when a formerly limited cgroup becomes
> unlimited.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Suleiman Souhlal <suleiman@google.com>
> ---
> mm/memcontrol.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 79 insertions(+), 9 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3216292..3d30b79 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -295,7 +295,8 @@ struct mem_cgroup {
> * Should the accounting and control be hierarchical, per subtree?
> */
> bool use_hierarchy;
> - bool kmem_accounted;
> +
> + unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
>
> bool oom_lock;
> atomic_t under_oom;
> @@ -348,6 +349,38 @@ struct mem_cgroup {
> #endif
> };
>
> +enum {
> + KMEM_ACCOUNTED_THIS, /* accounted by this cgroup itself */
> + KMEM_ACCOUNTED_PARENT, /* accounted by any of its parents. */
How it can be accounted by its parent, the charge doesn't go downwards.
Shouldn't it rather be /* a parent is accounted */
> +};
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +static bool memcg_kmem_account(struct mem_cgroup *memcg)
memcg_kmem_set_account? It matches _clear_ counterpart and it makes
obvious that the value is changed actually.
[...]
> +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg)
> +{
> + return test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
> +}
> +
> +static void memcg_kmem_account_parent(struct mem_cgroup *memcg)
same here _set_parent
[...]
> @@ -614,7 +647,7 @@ EXPORT_SYMBOL(__memcg_kmem_free_page);
>
> static void disarm_kmem_keys(struct mem_cgroup *memcg)
> {
> - if (memcg->kmem_accounted)
> + if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
memcg_kmem_is_accounted. I do not see any reason to open code this.
> static_key_slow_dec(&memcg_kmem_enabled_key);
> }
> #else
> @@ -4171,17 +4204,54 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
> static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
> {
> #ifdef CONFIG_MEMCG_KMEM
> - /*
> - * Once enabled, can't be disabled. We could in theory disable it if we
> - * haven't yet created any caches, or if we can shrink them all to
> - * death. But it is not worth the trouble.
> - */
> + struct mem_cgroup *iter;
> +
> mutex_lock(&set_limit_mutex);
> - if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
> + if ((val != RESOURCE_MAX) && memcg_kmem_account(memcg)) {
> +
> + /*
> + * Once enabled, can't be disabled. We could in theory disable
> + * it if we haven't yet created any caches, or if we can shrink
> + * them all to death. But it is not worth the trouble
> + */
> static_key_slow_inc(&memcg_kmem_enabled_key);
> - memcg->kmem_accounted = true;
> +
> + if (!memcg->use_hierarchy)
> + goto out;
> +
> + for_each_mem_cgroup_tree(iter, memcg) {
for_each_mem_cgroup_tree does respect use_hierarchy so the above
shortcut is not necessary. Dunno but IMHO we should get rid of explicit
tests as much as possible. This doesn't look like a hot path anyway.
> + if (iter == memcg)
> + continue;
> + memcg_kmem_account_parent(iter);
> + }
> + } else if ((val == RESOURCE_MAX) && memcg_kmem_clear_account(memcg)) {
Above you said "Once enabled, can't be disabled." and now you can
disable it? Say you are a leaf group with non accounted parents. This
will clear the flag and so no further accounting is done. Shouldn't
unlimited mean that we will never reach the limit? Or am I missing
something?
> +
> + if (!memcg->use_hierarchy)
> + goto out;
> +
> + for_each_mem_cgroup_tree(iter, memcg) {
> + struct mem_cgroup *parent;
> +
> + if (iter == memcg)
> + continue;
> + /*
> + * We should only have our parent bit cleared if none
> + * of our parents are accounted. The transversal order
> + * of our iter function forces us to always look at the
> + * parents.
> + */
> + parent = parent_mem_cgroup(iter);
> + for (; parent != memcg; parent = parent_mem_cgroup(iter))
> + if (memcg_kmem_is_accounted(parent))
> + goto noclear;
> + memcg_kmem_clear_account_parent(iter);
Brain hurts...
Yes we are iterating in the creation ordering so we cannot rely on the
first encountered accounted memcg
A(a) - B - D
- C (a) - E
> +noclear:
> + continue;
> + }
> }
> +out:
> mutex_unlock(&set_limit_mutex);
> +
> #endif
> }
>
> --
> 1.7.11.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children
2012-08-17 9:00 ` Michal Hocko
@ 2012-08-17 9:15 ` Glauber Costa
2012-08-17 9:35 ` Michal Hocko
0 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-17 9:15 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg, Suleiman Souhlal
On 08/17/2012 01:00 PM, Michal Hocko wrote:
> On Thu 09-08-12 17:01:17, Glauber Costa wrote:
>> The current memcg slab cache management fails to present satisfatory
>> hierarchical behavior in the following scenario:
>>
>> -> /cgroups/memory/A/B/C
>>
>> * kmem limit set at A,
>> * A and B have no tasks,
>> * span a new task in in C.
>>
>> Because kmem_accounted is a boolean that was not set for C, no
>> accounting would be done. This is, however, not what we expect.
>>
>> The basic idea, is that when a cgroup is limited, we walk the tree
>> upwards
>
> Isn't it rather downwards? We start at A and then mark all children so
> we go down the tree. Moreover the walk is not atomic wrt. parallel
> charges nor to a new child creation. First one seems to be acceptable
> as the charges go to the root. The second one requires cgroup_lock.
>
Yes, it is downwards. I've already noticed that yesterday and updated
in my tree.
As for the lock, can't we take set_limit lock in cgroup creation just
around the place that updates that field in the child? It is a lot more
fine grained - everything except the dead bkl is - and what we're
actually protecting is the limit.
If you prefer, I can use cgroup lock just fine. But then I won't sleep
at night and probably pee my pants, which is something I don't do for at
least two decades now.
> It also seems that you are missing memcg_kmem_account_parent in
> mem_cgroup_create (use_hierarchy path) if memcg_kmem_is_accounted(parent).
>
You mean when we create a cgroup ontop of an already limited parent?
Humm, you are very right.
> Some further "wording" comments below. Other than that the patch looks
> correct.
>
>> (something Kame and I already thought about doing for other
>> purposes), and make sure that we store the information about the parent
>> being limited in kmem_accounted (that is turned into a bitmap: two
>> booleans would not be space efficient).
>
> Two booleans even don't serve the purpose because you want to test this
> atomically, right?
>
Well, yes, we have that extra problem as well.
>> The code for that is taken from sched/core.c. My reasons for not
>> putting it into a common place is to dodge the type issues that would
>> arise from a common implementation between memcg and the scheduler -
>> but I think that it should ultimately happen, so if you want me to do
>> it now, let me know.
>
> Is this really relevant for the patch?
>
Not at all. Besides not being relevant, it is also not true, since I now
use the memcg iterator. I would prefer the tree walk instead of having
to cope with the order imposed by the memcg iterator, but we add
less code this way...
Again, already modified that in my yesterday's update.
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 3216292..3d30b79 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -295,7 +295,8 @@ struct mem_cgroup {
>> * Should the accounting and control be hierarchical, per subtree?
>> */
>> bool use_hierarchy;
>> - bool kmem_accounted;
>> +
>> + unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
>>
>> bool oom_lock;
>> atomic_t under_oom;
>> @@ -348,6 +349,38 @@ struct mem_cgroup {
>> #endif
>> };
>>
>> +enum {
>> + KMEM_ACCOUNTED_THIS, /* accounted by this cgroup itself */
>> + KMEM_ACCOUNTED_PARENT, /* accounted by any of its parents. */
>
> How it can be accounted by its parent, the charge doesn't go downwards.
> Shouldn't it rather be /* a parent is accounted */
>
indeed.
>> +};
>> +
>> +#ifdef CONFIG_MEMCG_KMEM
>> +static bool memcg_kmem_account(struct mem_cgroup *memcg)
>
> memcg_kmem_set_account? It matches _clear_ counterpart and it makes
> obvious that the value is changed actually.
>
Ok.
> [...]
>> +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg)
>> +{
>> + return test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
>> +}
>> +
>> +static void memcg_kmem_account_parent(struct mem_cgroup *memcg)
>
> same here _set_parent
>
Ok, agreed.
> [...]
>> @@ -614,7 +647,7 @@ EXPORT_SYMBOL(__memcg_kmem_free_page);
>>
>> static void disarm_kmem_keys(struct mem_cgroup *memcg)
>> {
>> - if (memcg->kmem_accounted)
>> + if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
>
> memcg_kmem_is_accounted. I do not see any reason to open code this.
>
ok.
>> #ifdef CONFIG_MEMCG_KMEM
>> - /*
>> - * Once enabled, can't be disabled. We could in theory disable it if we
>> - * haven't yet created any caches, or if we can shrink them all to
>> - * death. But it is not worth the trouble.
>> - */
>> + struct mem_cgroup *iter;
>> +
>> mutex_lock(&set_limit_mutex);
>> - if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
>> + if ((val != RESOURCE_MAX) && memcg_kmem_account(memcg)) {
>> +
>> + /*
>> + * Once enabled, can't be disabled. We could in theory disable
>> + * it if we haven't yet created any caches, or if we can shrink
>> + * them all to death. But it is not worth the trouble
>> + */
>> static_key_slow_inc(&memcg_kmem_enabled_key);
>> - memcg->kmem_accounted = true;
>> +
>> + if (!memcg->use_hierarchy)
>> + goto out;
>> +
>> + for_each_mem_cgroup_tree(iter, memcg) {
>
> for_each_mem_cgroup_tree does respect use_hierarchy so the above
> shortcut is not necessary. Dunno but IMHO we should get rid of explicit
> tests as much as possible. This doesn't look like a hot path anyway.
>
I can't remember any reason for doing so other than gaining some time.
I will remove it.
>> + if (iter == memcg)
>> + continue;
>> + memcg_kmem_account_parent(iter);
>> + }
>> + } else if ((val == RESOURCE_MAX) && memcg_kmem_clear_account(memcg)) {
>
> Above you said "Once enabled, can't be disabled." and now you can
> disable it? Say you are a leaf group with non accounted parents. This
> will clear the flag and so no further accounting is done. Shouldn't
> unlimited mean that we will never reach the limit? Or am I missing
> something?
>
You are missing something, and maybe I should be more clear about that.
The static branches can't be disabled (it is only safe to disable them
from disarm_static_branches(), when all references are gone). Note that
when unlimited, we flip bits, do a transversal, but there is no mention
to the static branch.
The limiting can come and go at will.
>> +
>> + if (!memcg->use_hierarchy)
>> + goto out;
>> +
>> + for_each_mem_cgroup_tree(iter, memcg) {
>> + struct mem_cgroup *parent;
>> +
>> + if (iter == memcg)
>> + continue;
>> + /*
>> + * We should only have our parent bit cleared if none
>> + * of our parents are accounted. The transversal order
>> + * of our iter function forces us to always look at the
>> + * parents.
>> + */
>> + parent = parent_mem_cgroup(iter);
>> + for (; parent != memcg; parent = parent_mem_cgroup(iter))
>> + if (memcg_kmem_is_accounted(parent))
>> + goto noclear;
>> + memcg_kmem_clear_account_parent(iter);
>
> Brain hurts...
> Yes we are iterating in the creation ordering so we cannot rely on the
> first encountered accounted memcg
> A(a) - B - D
> - C (a) - E
>
>
That's why I said I preferred the iterator the scheduler uses. The
actual transverse code was much simpler, because it will stop at an
unlimited parent. But this is the only drawback I see in the memcg
iterator, so I decided that just documenting this "interesting" piece of
code well would do...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread* Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children
2012-08-17 9:15 ` Glauber Costa
@ 2012-08-17 9:35 ` Michal Hocko
[not found] ` <20120817093504.GE18600-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
0 siblings, 1 reply; 135+ messages in thread
From: Michal Hocko @ 2012-08-17 9:35 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg, Suleiman Souhlal
On Fri 17-08-12 13:15:47, Glauber Costa wrote:
> On 08/17/2012 01:00 PM, Michal Hocko wrote:
> > On Thu 09-08-12 17:01:17, Glauber Costa wrote:
> >> The current memcg slab cache management fails to present satisfatory
> >> hierarchical behavior in the following scenario:
> >>
> >> -> /cgroups/memory/A/B/C
> >>
> >> * kmem limit set at A,
> >> * A and B have no tasks,
> >> * span a new task in in C.
> >>
> >> Because kmem_accounted is a boolean that was not set for C, no
> >> accounting would be done. This is, however, not what we expect.
> >>
> >> The basic idea, is that when a cgroup is limited, we walk the tree
> >> upwards
> >
> > Isn't it rather downwards? We start at A and then mark all children so
> > we go down the tree. Moreover the walk is not atomic wrt. parallel
> > charges nor to a new child creation. First one seems to be acceptable
> > as the charges go to the root. The second one requires cgroup_lock.
> >
>
> Yes, it is downwards. I've already noticed that yesterday and updated
> in my tree.
>
> As for the lock, can't we take set_limit lock in cgroup creation just
> around the place that updates that field in the child? It is a lot more
> fine grained - everything except the dead bkl is - and what we're
> actually protecting is the limit.
That should work as well. It is less obvious because we are not
considering the parent limit (maybe we should rename the lock but that
is just a detail).
> If you prefer, I can use cgroup lock just fine. But then I won't sleep
> at night and probably pee my pants, which is something I don't do for at
> least two decades now.
Heh, please no, I would feel terrible then
> > It also seems that you are missing memcg_kmem_account_parent in
> > mem_cgroup_create (use_hierarchy path) if memcg_kmem_is_accounted(parent).
> >
>
> You mean when we create a cgroup ontop of an already limited parent?
I would prefer bellow but yes
A (a) - B (a, pa)
- C (new)
> Humm, you are very right.
>
> > Some further "wording" comments below. Other than that the patch looks
> > correct.
> >
> >> (something Kame and I already thought about doing for other
> >> purposes), and make sure that we store the information about the parent
> >> being limited in kmem_accounted (that is turned into a bitmap: two
> >> booleans would not be space efficient).
> >
> > Two booleans even don't serve the purpose because you want to test this
> > atomically, right?
> >
>
> Well, yes, we have that extra problem as well.
> >> The code for that is taken from sched/core.c. My reasons for not
> >> putting it into a common place is to dodge the type issues that would
> >> arise from a common implementation between memcg and the scheduler -
> >> but I think that it should ultimately happen, so if you want me to do
> >> it now, let me know.
> >
> > Is this really relevant for the patch?
> >
>
> Not at all. Besides not being relevant, it is also not true, since I now
> use the memcg iterator. I would prefer the tree walk instead of having
> to cope with the order imposed by the memcg iterator, but we add
> less code this way...
>
> Again, already modified that in my yesterday's update.
OK
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 3216292..3d30b79 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -295,7 +295,8 @@ struct mem_cgroup {
> >> * Should the accounting and control be hierarchical, per subtree?
> >> */
> >> bool use_hierarchy;
> >> - bool kmem_accounted;
> >> +
> >> + unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
> >>
> >> bool oom_lock;
> >> atomic_t under_oom;
> >> @@ -348,6 +349,38 @@ struct mem_cgroup {
> >> #endif
> >> };
> >>
> >> +enum {
> >> + KMEM_ACCOUNTED_THIS, /* accounted by this cgroup itself */
> >> + KMEM_ACCOUNTED_PARENT, /* accounted by any of its parents. */
> >
> > How it can be accounted by its parent, the charge doesn't go downwards.
> > Shouldn't it rather be /* a parent is accounted */
> >
> indeed.
>
> >> +};
> >> +
> >> +#ifdef CONFIG_MEMCG_KMEM
> >> +static bool memcg_kmem_account(struct mem_cgroup *memcg)
> >
> > memcg_kmem_set_account? It matches _clear_ counterpart and it makes
> > obvious that the value is changed actually.
> >
>
> Ok.
>
> > [...]
> >> +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg)
> >> +{
> >> + return test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
> >> +}
> >> +
> >> +static void memcg_kmem_account_parent(struct mem_cgroup *memcg)
> >
> > same here _set_parent
> >
>
> Ok, agreed.
Thanks
>
> > [...]
> >> @@ -614,7 +647,7 @@ EXPORT_SYMBOL(__memcg_kmem_free_page);
> >>
> >> static void disarm_kmem_keys(struct mem_cgroup *memcg)
> >> {
> >> - if (memcg->kmem_accounted)
> >> + if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
> >
> > memcg_kmem_is_accounted. I do not see any reason to open code this.
> >
>
> ok.
>
> >> #ifdef CONFIG_MEMCG_KMEM
> >> - /*
> >> - * Once enabled, can't be disabled. We could in theory disable it if we
> >> - * haven't yet created any caches, or if we can shrink them all to
> >> - * death. But it is not worth the trouble.
> >> - */
> >> + struct mem_cgroup *iter;
> >> +
> >> mutex_lock(&set_limit_mutex);
> >> - if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
> >> + if ((val != RESOURCE_MAX) && memcg_kmem_account(memcg)) {
> >> +
> >> + /*
> >> + * Once enabled, can't be disabled. We could in theory disable
> >> + * it if we haven't yet created any caches, or if we can shrink
> >> + * them all to death. But it is not worth the trouble
> >> + */
> >> static_key_slow_inc(&memcg_kmem_enabled_key);
> >> - memcg->kmem_accounted = true;
> >> +
> >> + if (!memcg->use_hierarchy)
> >> + goto out;
> >> +
> >> + for_each_mem_cgroup_tree(iter, memcg) {
> >
> > for_each_mem_cgroup_tree does respect use_hierarchy so the above
> > shortcut is not necessary. Dunno but IMHO we should get rid of explicit
> > tests as much as possible. This doesn't look like a hot path anyway.
> >
>
> I can't remember any reason for doing so other than gaining some time.
> I will remove it.
Well it involves a bit more code because you would basically do expand
to a loop which does one iteration (continue) and terminates also take
and drop the reference on the group. That all seems unnecessary but as I
said this is not a hot path and we better get rid of direct checks.
I am not insisting on this so use your good taste...
>
> >> + if (iter == memcg)
> >> + continue;
> >> + memcg_kmem_account_parent(iter);
> >> + }
> >> + } else if ((val == RESOURCE_MAX) && memcg_kmem_clear_account(memcg)) {
> >
> > Above you said "Once enabled, can't be disabled." and now you can
> > disable it? Say you are a leaf group with non accounted parents. This
> > will clear the flag and so no further accounting is done. Shouldn't
> > unlimited mean that we will never reach the limit? Or am I missing
> > something?
> >
>
> You are missing something, and maybe I should be more clear about that.
> The static branches can't be disabled (it is only safe to disable them
> from disarm_static_branches(), when all references are gone). Note that
> when unlimited, we flip bits, do a transversal, but there is no mention
> to the static branch.
My little brain still doesn't get this. I wasn't concerned about static
branches. I was worried about memcg_can_account_kmem which will return
false now, doesn't it.
>
> The limiting can come and go at will.
>
> >> +
> >> + if (!memcg->use_hierarchy)
> >> + goto out;
> >> +
> >> + for_each_mem_cgroup_tree(iter, memcg) {
> >> + struct mem_cgroup *parent;
> >> +
> >> + if (iter == memcg)
> >> + continue;
> >> + /*
> >> + * We should only have our parent bit cleared if none
> >> + * of our parents are accounted. The transversal order
> >> + * of our iter function forces us to always look at the
> >> + * parents.
> >> + */
> >> + parent = parent_mem_cgroup(iter);
> >> + for (; parent != memcg; parent = parent_mem_cgroup(iter))
> >> + if (memcg_kmem_is_accounted(parent))
> >> + goto noclear;
> >> + memcg_kmem_clear_account_parent(iter);
> >
> > Brain hurts...
> > Yes we are iterating in the creation ordering so we cannot rely on the
> > first encountered accounted memcg
> > A(a) - B - D
> > - C (a) - E
>
> That's why I said I preferred the iterator the scheduler uses. The
> actual transverse code was much simpler, because it will stop at an
> unlimited parent. But this is the only drawback I see in the memcg
> iterator, so I decided that just documenting this "interesting" piece of
> code well would do...
I was just complaining that more specific comment would be much more
helpful... The ordering might be non-trivial for those who are not
familiar with cgroup internals because id doesn't tell you much.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v2 10/11] memcg: allow a memcg with kmem charges to be destructed.
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (8 preceding siblings ...)
2012-08-09 13:01 ` [PATCH v2 09/11] memcg: propagate kmem limiting information to children Glauber Costa
@ 2012-08-09 13:01 ` Glauber Costa
[not found] ` <1344517279-30646-11-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-09 13:01 ` [PATCH v2 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Glauber Costa
10 siblings, 1 reply; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
Because the ultimate goal of the kmem tracking in memcg is to track slab
pages as well, we can't guarantee that we'll always be able to point a
page to a particular process, and migrate the charges along with it -
since in the common case, a page will contain data belonging to multiple
processes.
Because of that, when we destroy a memcg, we only make sure the
destruction will succeed by discounting the kmem charges from the user
charges when we try to empty the cgroup.
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
mm/memcontrol.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3d30b79..7c1ea49 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -649,6 +649,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
{
if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
static_key_slow_dec(&memcg_kmem_enabled_key);
+ /*
+ * This check can't live in kmem destruction function,
+ * since the charges will outlive the cgroup
+ */
+ WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
}
#else
static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -4005,6 +4010,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
int node, zid, shrink;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct cgroup *cgrp = memcg->css.cgroup;
+ u64 usage;
css_get(&memcg->css);
@@ -4038,8 +4044,17 @@ move_account:
mem_cgroup_end_move(memcg);
memcg_oom_recover(memcg);
cond_resched();
+ /*
+ * Kernel memory may not necessarily be trackable to a specific
+ * process. So they are not migrated, and therefore we can't
+ * expect their value to drop to 0 here.
+ *
+ * having res filled up with kmem only is enough
+ */
+ usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
+ res_counter_read_u64(&memcg->kmem, RES_USAGE);
/* "ret" should also be checked to ensure all lists are empty. */
- } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
+ } while (usage > 0 || ret);
out:
css_put(&memcg->css);
return ret;
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread* [PATCH v2 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs
[not found] ` <1344517279-30646-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (9 preceding siblings ...)
2012-08-09 13:01 ` [PATCH v2 10/11] memcg: allow a memcg with kmem charges to be destructed Glauber Costa
@ 2012-08-09 13:01 ` Glauber Costa
[not found] ` <1344517279-30646-12-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-21 9:35 ` Michal Hocko
10 siblings, 2 replies; 135+ messages in thread
From: Glauber Costa @ 2012-08-09 13:01 UTC (permalink / raw)
To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
Christoph Lameter, David Rientjes, Pekka Enberg, Glauber Costa,
Pekka Enberg, Suleiman Souhlal
Because those architectures will draw their stacks directly from the
page allocator, rather than the slab cache, we can directly pass
__GFP_KMEMCG flag, and issue the corresponding free_pages.
This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
architectures fall in this category.
This will guarantee that every stack page is accounted to the memcg the
process currently lives on, and will have the allocations to fail if
they go over limit.
For the time being, I am defining a new variant of THREADINFO_GFP, not
to mess with the other path. Once the slab is also tracked by memcg, we
can get rid of that flag.
Tested to successfully protect against :(){ :|:& };:
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Acked-by: Frederic Weisbecker <fweisbec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
include/linux/thread_info.h | 2 ++
kernel/fork.c | 4 ++--
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index ccc1899..e7e0473 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -61,6 +61,8 @@ extern long do_no_restart_syscall(struct restart_block *parm);
# define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
#endif
+#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
+
/*
* flag set/clear/test wrappers
* - pass TIF_xxxx constants to these functions
diff --git a/kernel/fork.c b/kernel/fork.c
index dc3ff16..b0b90c3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -142,7 +142,7 @@ void __weak arch_release_thread_info(struct thread_info *ti) { }
static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
int node)
{
- struct page *page = alloc_pages_node(node, THREADINFO_GFP,
+ struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
THREAD_SIZE_ORDER);
return page ? page_address(page) : NULL;
@@ -151,7 +151,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
static inline void free_thread_info(struct thread_info *ti)
{
arch_release_thread_info(ti);
- free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+ free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER);
}
# else
static struct kmem_cache *thread_info_cache;
--
1.7.11.2
^ permalink raw reply related [flat|nested] 135+ messages in thread[parent not found: <1344517279-30646-12-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v2 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs
[not found] ` <1344517279-30646-12-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-10 17:54 ` Kamezawa Hiroyuki
0 siblings, 0 replies; 135+ messages in thread
From: Kamezawa Hiroyuki @ 2012-08-10 17:54 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
devel-GEFAQzZX7r8dnm+yROfE0A, Michal Hocko, Johannes Weiner,
Andrew Morton, Christoph Lameter, David Rientjes, Pekka Enberg,
Pekka Enberg, Suleiman Souhlal
(2012/08/09 22:01), Glauber Costa wrote:
> Because those architectures will draw their stacks directly from the
> page allocator, rather than the slab cache, we can directly pass
> __GFP_KMEMCG flag, and issue the corresponding free_pages.
>
> This code path is taken when the architecture doesn't define
> CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
> THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
> architectures fall in this category.
>
> This will guarantee that every stack page is accounted to the memcg the
> process currently lives on, and will have the allocations to fail if
> they go over limit.
>
> For the time being, I am defining a new variant of THREADINFO_GFP, not
> to mess with the other path. Once the slab is also tracked by memcg, we
> can get rid of that flag.
>
> Tested to successfully protect against :(){ :|:& };:
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Acked-by: Frederic Weisbecker <fweisbec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> ---
> include/linux/thread_info.h | 2 ++
> kernel/fork.c | 4 ++--
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index ccc1899..e7e0473 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -61,6 +61,8 @@ extern long do_no_restart_syscall(struct restart_block *parm);
> # define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
> #endif
>
> +#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
> +
> /*
> * flag set/clear/test wrappers
> * - pass TIF_xxxx constants to these functions
> diff --git a/kernel/fork.c b/kernel/fork.c
> index dc3ff16..b0b90c3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -142,7 +142,7 @@ void __weak arch_release_thread_info(struct thread_info *ti) { }
> static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
> int node)
> {
> - struct page *page = alloc_pages_node(node, THREADINFO_GFP,
> + struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
> THREAD_SIZE_ORDER);
>
> return page ? page_address(page) : NULL;
> @@ -151,7 +151,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
> static inline void free_thread_info(struct thread_info *ti)
> {
> arch_release_thread_info(ti);
> - free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
> + free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER);
> }
> # else
> static struct kmem_cache *thread_info_cache;
>
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v2 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs
2012-08-09 13:01 ` [PATCH v2 11/11] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Glauber Costa
[not found] ` <1344517279-30646-12-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-21 9:35 ` Michal Hocko
[not found] ` <20120821093513.GD19797-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
1 sibling, 1 reply; 135+ messages in thread
From: Michal Hocko @ 2012-08-21 9:35 UTC (permalink / raw)
To: Glauber Costa
Cc: linux-kernel, linux-mm, cgroups, devel, Johannes Weiner,
Andrew Morton, kamezawa.hiroyu, Christoph Lameter, David Rientjes,
Pekka Enberg, Pekka Enberg, Suleiman Souhlal
On Thu 09-08-12 17:01:19, Glauber Costa wrote:
> Because those architectures will draw their stacks directly from the
> page allocator, rather than the slab cache, we can directly pass
> __GFP_KMEMCG flag, and issue the corresponding free_pages.
>
> This code path is taken when the architecture doesn't define
> CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
> THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
> architectures fall in this category.
quick git grep "define *THREAD_SIZE\>" arch says that there is no such
architecture.
> This will guarantee that every stack page is accounted to the memcg the
> process currently lives on, and will have the allocations to fail if
> they go over limit.
>
> For the time being, I am defining a new variant of THREADINFO_GFP, not
> to mess with the other path. Once the slab is also tracked by memcg, we
> can get rid of that flag.
>
> Tested to successfully protect against :(){ :|:& };:
I guess there were no other tasks in the same group (except for the
parent shell), right? I am asking because this should trigger memcg-oom
but that one will usually pick up something else than the fork bomb
which would have a small memory footprint. But that needs to be handled
on the oom level obviously.
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Suleiman Souhlal <suleiman@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
> include/linux/thread_info.h | 2 ++
> kernel/fork.c | 4 ++--
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index ccc1899..e7e0473 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -61,6 +61,8 @@ extern long do_no_restart_syscall(struct restart_block *parm);
> # define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
> #endif
>
> +#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
> +
> /*
> * flag set/clear/test wrappers
> * - pass TIF_xxxx constants to these functions
> diff --git a/kernel/fork.c b/kernel/fork.c
> index dc3ff16..b0b90c3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -142,7 +142,7 @@ void __weak arch_release_thread_info(struct thread_info *ti) { }
> static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
> int node)
> {
> - struct page *page = alloc_pages_node(node, THREADINFO_GFP,
> + struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
> THREAD_SIZE_ORDER);
>
> return page ? page_address(page) : NULL;
> @@ -151,7 +151,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
> static inline void free_thread_info(struct thread_info *ti)
> {
> arch_release_thread_info(ti);
> - free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
> + free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER);
> }
> # else
> static struct kmem_cache *thread_info_cache;
> --
> 1.7.11.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 135+ messages in thread