* [PATCH v3 00/13] kmem controller for memcg. @ 2012-09-18 14:03 Glauber Costa [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:03 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes Hi, This is the first part of the kernel memory controller for memcg. It has been discussed many times, and I consider this stable enough to be on tree. A follow up to this series are the patches to also track slab memory. They are not included here because I believe we could benefit from merging them separately for better testing coverage. If there are any issues preventing this to be merged, let me know. I'll be happy to address them. *v3: - Changed function names to match memcg's - avoid doing get/put in charge/uncharge path - revert back to keeping the account enabled after it is first activated The slab patches are also mature in my self evaluation and could be merged not too long after this. For the reference, the last discussion about them happened at http://lwn.net/Articles/508087/. Patches for that will be sent shortly, and will include the documentation for this. Numbers can be found at https://lkml.org/lkml/2012/9/13/239 A (throwaway) git tree with them is placed at: git://git.kernel.org/pub/scm/linux/kernel/git/glommer/memcg.git kmemcg-stack A general explanation of what this is all about follows: The kernel memory limitation mechanism for memcg concerns itself with disallowing potentially non-reclaimable allocations to happen in exaggerate quantities by a particular set of processes (cgroup). Those allocations could create pressure that affects the behavior of a different and unrelated set of processes. Its basic working mechanism is to annotate some allocations with the _GFP_KMEMCG flag. When this flag is set, the current process allocating will have its memcg identified and charged against. When reaching a specific limit, further allocations will be denied. One example of such problematic pressure that can be prevented by this work is a fork bomb conducted in a shell. We prevent it by noting that processes use a limited amount of stack pages. Seen this way, a fork bomb is just a special case of resource abuse. If the offender is unable to grab more pages for the stack, no new processes can be created. There are also other things the general mechanism protects against. For example, using too much of pinned dentry and inode cache, by touching files an leaving them in memory forever. In fact, a simple: while true; do mkdir x; cd x; done can halt your system easily because the file system limits are hard to reach (big disks), but the kernel memory is not. Those are examples, but the list certainly don't stop here. An important use case for all that, is concerned with people offering hosting services through containers. In a physical box we can put a limit to some resources, like total number of processes or threads. But in an environment where each independent user gets its own piece of the machine, we don't want a potentially malicious user to destroy good users' services. This might be true for systemd as well, that now groups services inside cgroups. They generally want to put forward a set of guarantees that limits the running service in a variety of ways, so that if they become badly behaved, they won't interfere with the rest of the system. There is, of course, a cost for that. To attempt to mitigate that, static branches are used to make sure that even if the feature is compiled in with potentially a lot of memory cgroups deployed this code will only be enabled after the first user of this service configures any limit. Limits lower than the user limit effectively means there is a separate kernel memory limit that may be reached independently than the user limit. Values equal or greater than the user limit implies only that kernel memory is tracked. This provides a unified vision of "maximum memory", be it kernel or user memory. Because this is all default-off, existing deployments will see no change in behavior. Glauber Costa (11): memcg: change defines to an enum kmem accounting basic infrastructure Add a __GFP_KMEMCG flag memcg: kmem controller infrastructure mm: Allocate kernel pages to the right memcg res_counter: return amount of charges after res_counter_uncharge memcg: kmem accounting lifecycle management memcg: use static branches when code not in use memcg: allow a memcg with kmem charges to be destructed. execute the whole memcg freeing in rcu callback protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Suleiman Souhlal (2): memcg: Make it possible to use the stock for more than one page. memcg: Reclaim when more than one page needed. Documentation/cgroups/resource_counter.txt | 7 +- include/linux/gfp.h | 10 +- include/linux/memcontrol.h | 99 ++++++ include/linux/res_counter.h | 12 +- include/linux/thread_info.h | 2 + kernel/fork.c | 4 +- kernel/res_counter.c | 20 +- mm/memcontrol.c | 519 +++++++++++++++++++++++++---- mm/page_alloc.c | 35 ++ 9 files changed, 628 insertions(+), 80 deletions(-) -- 1.7.11.4 ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* [PATCH v3 01/13] memcg: Make it possible to use the stock for more than one page. [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-18 14:03 ` Glauber Costa 2012-10-01 18:48 ` Johannes Weiner 2012-09-18 14:03 ` [PATCH v3 02/13] memcg: Reclaim when more than one page needed Glauber Costa ` (11 subsequent siblings) 12 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:03 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa From: Suleiman Souhlal <ssouhlal-HZy0K5TPuP5AfugRpC6u6w@public.gmane.org> We currently have a percpu stock cache scheme that charges one page at a time from memcg->res, the user counter. When the kernel memory controller comes into play, we'll need to charge more than that. This is because kernel memory allocations will also draw from the user counter, and can be bigger than a single page, as it is the case with the stack (usually 2 pages) or some higher order slabs. [ glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org: added a changelog ] Signed-off-by: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> Acked-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- mm/memcontrol.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 795e525..9d3bc72 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2034,20 +2034,28 @@ struct memcg_stock_pcp { static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); static DEFINE_MUTEX(percpu_charge_mutex); -/* - * Try to consume stocked charge on this cpu. If success, one page is consumed - * from local stock and true is returned. If the stock is 0 or charges from a - * cgroup which is not current target, returns false. This stock will be - * refilled. +/** + * consume_stock: Try to consume stocked charge on this cpu. + * @memcg: memcg to consume from. + * @nr_pages: how many pages to charge. + * + * The charges will only happen if @memcg matches the current cpu's memcg + * stock, and at least @nr_pages are available in that stock. Failure to + * service an allocation will refill the stock. + * + * returns true if succesfull, false otherwise. */ -static bool consume_stock(struct mem_cgroup *memcg) +static bool consume_stock(struct mem_cgroup *memcg, int nr_pages) { struct memcg_stock_pcp *stock; bool ret = true; + if (nr_pages > CHARGE_BATCH) + return false; + stock = &get_cpu_var(memcg_stock); - if (memcg == stock->cached && stock->nr_pages) - stock->nr_pages--; + if (memcg == stock->cached && stock->nr_pages >= nr_pages) + stock->nr_pages -= nr_pages; else /* need to call res_counter_charge */ ret = false; put_cpu_var(memcg_stock); @@ -2346,7 +2354,7 @@ again: VM_BUG_ON(css_is_removed(&memcg->css)); if (mem_cgroup_is_root(memcg)) goto done; - if (nr_pages == 1 && consume_stock(memcg)) + if (consume_stock(memcg, nr_pages)) goto done; css_get(&memcg->css); } else { @@ -2371,7 +2379,7 @@ again: rcu_read_unlock(); goto done; } - if (nr_pages == 1 && consume_stock(memcg)) { + if (consume_stock(memcg, nr_pages)) { /* * It seems dagerous to access memcg without css_get(). * But considering how consume_stok works, it's not -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
* Re: [PATCH v3 01/13] memcg: Make it possible to use the stock for more than one page. 2012-09-18 14:03 ` [PATCH v3 01/13] memcg: Make it possible to use the stock for more than one page Glauber Costa @ 2012-10-01 18:48 ` Johannes Weiner 0 siblings, 0 replies; 127+ messages in thread From: Johannes Weiner @ 2012-10-01 18:48 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes On Tue, Sep 18, 2012 at 06:03:58PM +0400, Glauber Costa wrote: > From: Suleiman Souhlal <ssouhlal@FreeBSD.org> > > We currently have a percpu stock cache scheme that charges one page at a > time from memcg->res, the user counter. When the kernel memory > controller comes into play, we'll need to charge more than that. > > This is because kernel memory allocations will also draw from the user > counter, and can be bigger than a single page, as it is the case with > the stack (usually 2 pages) or some higher order slabs. > > [ glommer@parallels.com: added a changelog ] > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > Signed-off-by: Glauber Costa <glommer@parallels.com> > Acked-by: David Rientjes <rientjes@google.com> > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Acked-by: Michal Hocko <mhocko@suse.cz> Independent of how the per-subtree enable-through-setting-limit discussion pans out, we're going to need the charge cache, so: Acked-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 02/13] memcg: Reclaim when more than one page needed. [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 14:03 ` [PATCH v3 01/13] memcg: Make it possible to use the stock for more than one page Glauber Costa @ 2012-09-18 14:03 ` Glauber Costa 2012-10-01 19:00 ` Johannes Weiner 2012-09-18 14:04 ` [PATCH v3 03/13] memcg: change defines to an enum Glauber Costa ` (10 subsequent siblings) 12 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:03 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa From: Suleiman Souhlal <ssouhlal-HZy0K5TPuP5AfugRpC6u6w@public.gmane.org> mem_cgroup_do_charge() was written before kmem accounting, and expects three cases: being called for 1 page, being called for a stock of 32 pages, or being called for a hugepage. If we call for 2 or 3 pages (and both the stack and several slabs used in process creation are such, at least with the debug options I had), it assumed it's being called for stock and just retried without reclaiming. Fix that by passing down a minsize argument in addition to the csize. And what to do about that (csize == PAGE_SIZE && ret) retry? If it's needed at all (and presumably is since it's there, perhaps to handle races), then it should be extended to more than PAGE_SIZE, yet how far? And should there be a retry count limit, of what? For now retry up to COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if __GFP_NORETRY. [v4: fixed nr pages calculation pointed out by Christoph Lameter ] Signed-off-by: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- mm/memcontrol.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9d3bc72..b12121b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2232,7 +2232,8 @@ enum { }; static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, - unsigned int nr_pages, bool oom_check) + unsigned int nr_pages, unsigned int min_pages, + bool oom_check) { unsigned long csize = nr_pages * PAGE_SIZE; struct mem_cgroup *mem_over_limit; @@ -2255,18 +2256,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, } else mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); /* - * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch - * of regular pages (CHARGE_BATCH), or a single regular page (1). - * * Never reclaim on behalf of optional batching, retry with a * single page instead. */ - if (nr_pages == CHARGE_BATCH) + if (nr_pages > min_pages) return CHARGE_RETRY; if (!(gfp_mask & __GFP_WAIT)) return CHARGE_WOULDBLOCK; + if (gfp_mask & __GFP_NORETRY) + return CHARGE_NOMEM; + ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) return CHARGE_RETRY; @@ -2279,7 +2280,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, * unlikely to succeed so close to the limit, and we fall back * to regular pages anyway in case of failure. */ - if (nr_pages == 1 && ret) + if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret) return CHARGE_RETRY; /* @@ -2414,7 +2415,8 @@ again: nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; } - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check); + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages, + oom_check); switch (ret) { case CHARGE_OK: break; -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
* Re: [PATCH v3 02/13] memcg: Reclaim when more than one page needed. 2012-09-18 14:03 ` [PATCH v3 02/13] memcg: Reclaim when more than one page needed Glauber Costa @ 2012-10-01 19:00 ` Johannes Weiner 0 siblings, 0 replies; 127+ messages in thread From: Johannes Weiner @ 2012-10-01 19:00 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes On Tue, Sep 18, 2012 at 06:03:59PM +0400, Glauber Costa wrote: > From: Suleiman Souhlal <ssouhlal@FreeBSD.org> > > mem_cgroup_do_charge() was written before kmem accounting, and expects > three cases: being called for 1 page, being called for a stock of 32 > pages, or being called for a hugepage. If we call for 2 or 3 pages (and > both the stack and several slabs used in process creation are such, at > least with the debug options I had), it assumed it's being called for > stock and just retried without reclaiming. > > Fix that by passing down a minsize argument in addition to the csize. > > And what to do about that (csize == PAGE_SIZE && ret) retry? If it's Wow, that patch set has been around for a while. It's been nr_pages == 1 for a while now :-) > needed at all (and presumably is since it's there, perhaps to handle > races), then it should be extended to more than PAGE_SIZE, yet how far? > And should there be a retry count limit, of what? For now retry up to > COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if > __GFP_NORETRY. > > [v4: fixed nr pages calculation pointed out by Christoph Lameter ] > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > Signed-off-by: Glauber Costa <glommer@parallels.com> > Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Acked-by: Michal Hocko <mhocko@suse.cz> > --- > mm/memcontrol.c | 16 +++++++++------- > 1 file changed, 9 insertions(+), 7 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 9d3bc72..b12121b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2232,7 +2232,8 @@ enum { > }; > > static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > - unsigned int nr_pages, bool oom_check) > + unsigned int nr_pages, unsigned int min_pages, > + bool oom_check) I'm not a big fan of the parameter names. Can we make this function officially aware of batching and name the parameters like the arguments that are passed in? I.e. @batch and @nr_pages? > { > unsigned long csize = nr_pages * PAGE_SIZE; > struct mem_cgroup *mem_over_limit; > @@ -2255,18 +2256,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > } else > mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); > /* > - * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch > - * of regular pages (CHARGE_BATCH), or a single regular page (1). > - * > * Never reclaim on behalf of optional batching, retry with a > * single page instead. "[...] with the amount of actually required pages instead." > */ > - if (nr_pages == CHARGE_BATCH) > + if (nr_pages > min_pages) > return CHARGE_RETRY; if (batch > nr_pages) return CHARGE_RETRY; But that is all just nitpicking. Functionally, it looks sane, so: Acked-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 03/13] memcg: change defines to an enum [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 14:03 ` [PATCH v3 01/13] memcg: Make it possible to use the stock for more than one page Glauber Costa 2012-09-18 14:03 ` [PATCH v3 02/13] memcg: Reclaim when more than one page needed Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa [not found] ` <1347977050-29476-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 14:04 ` [PATCH v3 04/13] kmem accounting basic infrastructure Glauber Costa ` (9 subsequent siblings) 12 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Johannes Weiner This is just a cleanup patch for clarity of expression. In earlier submissions, people asked it to be in a separate patch, so here it is. [ v2: use named enum as type throughout the file as well ] Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- mm/memcontrol.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b12121b..d6ad138 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -385,9 +385,12 @@ enum charge_type { }; /* for encoding cft->private value on file */ -#define _MEM (0) -#define _MEMSWAP (1) -#define _OOM_TYPE (2) +enum res_type { + _MEM, + _MEMSWAP, + _OOM_TYPE, +}; + #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val)) #define MEMFILE_TYPE(val) ((val) >> 16 & 0xffff) #define MEMFILE_ATTR(val) ((val) & 0xffff) @@ -3921,7 +3924,8 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft, struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); char str[64]; u64 val; - int type, name, len; + int name, len; + enum res_type type; type = MEMFILE_TYPE(cft->private); name = MEMFILE_ATTR(cft->private); @@ -3957,7 +3961,8 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, const char *buffer) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - int type, name; + enum res_type type; + int name; unsigned long long val; int ret; @@ -4033,7 +4038,8 @@ out: static int mem_cgroup_reset(struct cgroup *cont, unsigned int event) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - int type, name; + int name; + enum res_type type; type = MEMFILE_TYPE(event); name = MEMFILE_ATTR(event); @@ -4369,7 +4375,7 @@ static int mem_cgroup_usage_register_event(struct cgroup *cgrp, struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); struct mem_cgroup_thresholds *thresholds; struct mem_cgroup_threshold_ary *new; - int type = MEMFILE_TYPE(cft->private); + enum res_type type = MEMFILE_TYPE(cft->private); u64 threshold, usage; int i, size, ret; @@ -4452,7 +4458,7 @@ static void mem_cgroup_usage_unregister_event(struct cgroup *cgrp, struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); struct mem_cgroup_thresholds *thresholds; struct mem_cgroup_threshold_ary *new; - int type = MEMFILE_TYPE(cft->private); + enum res_type type = MEMFILE_TYPE(cft->private); u64 usage; int i, j, size; @@ -4530,7 +4536,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp, { struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); struct mem_cgroup_eventfd_list *event; - int type = MEMFILE_TYPE(cft->private); + enum res_type type = MEMFILE_TYPE(cft->private); BUG_ON(type != _OOM_TYPE); event = kmalloc(sizeof(*event), GFP_KERNEL); @@ -4555,7 +4561,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp, { struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); struct mem_cgroup_eventfd_list *ev, *tmp; - int type = MEMFILE_TYPE(cft->private); + enum res_type type = MEMFILE_TYPE(cft->private); BUG_ON(type != _OOM_TYPE); -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
[parent not found: <1347977050-29476-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 03/13] memcg: change defines to an enum [not found] ` <1347977050-29476-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-10-01 19:06 ` Johannes Weiner 2012-10-02 9:10 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Johannes Weiner @ 2012-10-01 19:06 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes On Tue, Sep 18, 2012 at 06:04:00PM +0400, Glauber Costa wrote: > This is just a cleanup patch for clarity of expression. In earlier > submissions, people asked it to be in a separate patch, so here it is. > > [ v2: use named enum as type throughout the file as well ] > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> > Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Should probably be the first in the series to get the cleanups out of the way :-) Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 03/13] memcg: change defines to an enum 2012-10-01 19:06 ` Johannes Weiner @ 2012-10-02 9:10 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-10-02 9:10 UTC (permalink / raw) To: Johannes Weiner Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes On 10/01/2012 11:06 PM, Johannes Weiner wrote: > On Tue, Sep 18, 2012 at 06:04:00PM +0400, Glauber Costa wrote: >> This is just a cleanup patch for clarity of expression. In earlier >> submissions, people asked it to be in a separate patch, so here it is. >> >> [ v2: use named enum as type throughout the file as well ] >> >> Signed-off-by: Glauber Costa <glommer@parallels.com> >> CC: Johannes Weiner <hannes@cmpxchg.org> >> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> >> Acked-by: Michal Hocko <mhocko@suse.cz> > > Should probably be the first in the series to get the cleanups out of > the way :-) > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > If you guys want to merge this separately, be my guest =) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (2 preceding siblings ...) 2012-09-18 14:04 ` [PATCH v3 03/13] memcg: change defines to an enum Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa 2012-09-21 16:34 ` Tejun Heo [not found] ` <1347977050-29476-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 14:04 ` [PATCH v3 05/13] Add a __GFP_KMEMCG flag Glauber Costa ` (8 subsequent siblings) 12 siblings, 2 replies; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Michal Hocko, Johannes Weiner This patch adds the basic infrastructure for the accounting of the slab caches. To control that, the following files are created: * memory.kmem.usage_in_bytes * memory.kmem.limit_in_bytes * memory.kmem.failcnt * memory.kmem.max_usage_in_bytes They have the same meaning of their user memory counterparts. They reflect the state of the "kmem" res_counter. The code is not enabled until a limit is set. This can be tested by the flag "kmem_accounted". This means that after the patch is applied, no behavioral changes exists for whoever is still using memcg to control their memory usage. We always account to both user and kernel resource_counters. This effectively means that an independent kernel limit is in place when the limit is set to a lower value than the user memory. A equal or higher value means that the user limit will always hit first, meaning that kmem is effectively unlimited. People who want to track kernel memory but not limit it, can set this limit to a very high number (like RESOURCE_MAX - 1page - that no one will ever hit, or equal to the user memory) Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> --- mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 63 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d6ad138..f3fd354 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -265,6 +265,10 @@ struct mem_cgroup { }; /* + * the counter to account for kernel memory usage. + */ + struct res_counter kmem; + /* * Per cgroup active and inactive list, similar to the * per zone LRU lists. */ @@ -279,6 +283,7 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; + bool kmem_accounted; bool oom_lock; atomic_t under_oom; @@ -389,6 +394,7 @@ enum res_type { _MEM, _MEMSWAP, _OOM_TYPE, + _KMEM, }; #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val)) @@ -1439,6 +1445,10 @@ done: res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10, res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10, res_counter_read_u64(&memcg->memsw, RES_FAILCNT)); + printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n", + res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, + res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, + res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); } /* @@ -3946,6 +3956,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft, else val = res_counter_read_u64(&memcg->memsw, name); break; + case _KMEM: + val = res_counter_read_u64(&memcg->kmem, name); + break; default: BUG(); } @@ -3984,8 +3997,18 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, break; if (type == _MEM) ret = mem_cgroup_resize_limit(memcg, val); - else + else if (type == _MEMSWAP) ret = mem_cgroup_resize_memsw_limit(memcg, val); + else if (type == _KMEM) { + ret = res_counter_set_limit(&memcg->kmem, val); + if (ret) + break; + + /* For simplicity, we won't allow this to be disabled */ + if (!memcg->kmem_accounted && val != RESOURCE_MAX) + memcg->kmem_accounted = true; + } else + return -EINVAL; break; case RES_SOFT_LIMIT: ret = res_counter_memparse_write_strategy(buffer, &val); @@ -4051,12 +4074,16 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event) case RES_MAX_USAGE: if (type == _MEM) res_counter_reset_max(&memcg->res); + else if (type == _KMEM) + res_counter_reset_max(&memcg->kmem); else res_counter_reset_max(&memcg->memsw); break; case RES_FAILCNT: if (type == _MEM) res_counter_reset_failcnt(&memcg->res); + else if (type == _KMEM) + res_counter_reset_failcnt(&memcg->kmem); else res_counter_reset_failcnt(&memcg->memsw); break; @@ -4618,6 +4645,33 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, } #ifdef CONFIG_MEMCG_KMEM +static struct cftype kmem_cgroup_files[] = { + { + .name = "kmem.limit_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), + .write_string = mem_cgroup_write, + .read = mem_cgroup_read, + }, + { + .name = "kmem.usage_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), + .read = mem_cgroup_read, + }, + { + .name = "kmem.failcnt", + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT), + .trigger = mem_cgroup_reset, + .read = mem_cgroup_read, + }, + { + .name = "kmem.max_usage_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE), + .trigger = mem_cgroup_reset, + .read = mem_cgroup_read, + }, + {}, +}; + static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) { return mem_cgroup_sockets_init(memcg, ss); @@ -4961,6 +5015,12 @@ mem_cgroup_create(struct cgroup *cont) int cpu; enable_swap_cgroup(); parent = NULL; + +#ifdef CONFIG_MEMCG_KMEM + WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, + kmem_cgroup_files)); +#endif + if (mem_cgroup_soft_limit_tree_init()) goto free_out; root_mem_cgroup = memcg; @@ -4979,6 +5039,7 @@ mem_cgroup_create(struct cgroup *cont) if (parent && parent->use_hierarchy) { res_counter_init(&memcg->res, &parent->res); res_counter_init(&memcg->memsw, &parent->memsw); + res_counter_init(&memcg->kmem, &parent->kmem); /* * We increment refcnt of the parent to ensure that we can * safely access it on res_counter_charge/uncharge. @@ -4989,6 +5050,7 @@ mem_cgroup_create(struct cgroup *cont) } else { res_counter_init(&memcg->res, NULL); res_counter_init(&memcg->memsw, NULL); + res_counter_init(&memcg->kmem, NULL); } memcg->last_scanned_node = MAX_NUMNODES; INIT_LIST_HEAD(&memcg->oom_notify); -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-18 14:04 ` [PATCH v3 04/13] kmem accounting basic infrastructure Glauber Costa @ 2012-09-21 16:34 ` Tejun Heo [not found] ` <20120921163404.GC7264-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> [not found] ` <1347977050-29476-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 1 sibling, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-21 16:34 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Michal Hocko, Johannes Weiner On Tue, Sep 18, 2012 at 06:04:01PM +0400, Glauber Costa wrote: > #ifdef CONFIG_MEMCG_KMEM > +static struct cftype kmem_cgroup_files[] = { > + { > + .name = "kmem.limit_in_bytes", > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > + .write_string = mem_cgroup_write, > + .read = mem_cgroup_read, > + }, > + { > + .name = "kmem.usage_in_bytes", > + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), > + .read = mem_cgroup_read, > + }, > + { > + .name = "kmem.failcnt", > + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT), > + .trigger = mem_cgroup_reset, > + .read = mem_cgroup_read, > + }, > + { > + .name = "kmem.max_usage_in_bytes", > + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE), > + .trigger = mem_cgroup_reset, > + .read = mem_cgroup_read, > + }, > + {}, > +}; > + > static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > { > return mem_cgroup_sockets_init(memcg, ss); > @@ -4961,6 +5015,12 @@ mem_cgroup_create(struct cgroup *cont) > int cpu; > enable_swap_cgroup(); > parent = NULL; > + > +#ifdef CONFIG_MEMCG_KMEM > + WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, > + kmem_cgroup_files)); > +#endif > + Why not just make it part of mem_cgroup_files[]? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120921163404.GC7264-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <20120921163404.GC7264-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-24 8:09 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-09-24 8:09 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Michal Hocko, Johannes Weiner >> + >> +#ifdef CONFIG_MEMCG_KMEM >> + WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, >> + kmem_cgroup_files)); >> +#endif >> + > > Why not just make it part of mem_cgroup_files[]? > > Thanks. > Done. ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <1347977050-29476-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <1347977050-29476-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-26 14:03 ` Michal Hocko 2012-09-26 14:33 ` Glauber Costa [not found] ` <20120926140347.GD15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 2 replies; 127+ messages in thread From: Michal Hocko @ 2012-09-26 14:03 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Tue 18-09-12 18:04:01, Glauber Costa wrote: > This patch adds the basic infrastructure for the accounting of the slab > caches. To control that, the following files are created: > > * memory.kmem.usage_in_bytes > * memory.kmem.limit_in_bytes > * memory.kmem.failcnt > * memory.kmem.max_usage_in_bytes > > They have the same meaning of their user memory counterparts. They > reflect the state of the "kmem" res_counter. > The code is not enabled until a limit is set. "Per cgroup slab memory accounting is not enabled until a limit is set for the group. Once the limit is set the accounting cannot be disabled such a group." Better? > This can be tested by the flag "kmem_accounted". Sounds as if it could be done from userspace (because you were talking about an user interface) which it cannot and we do not see it in this patch because it is not used anywhere. So please be more specific. > This means that after the patch is applied, no behavioral changes > exists for whoever is still using memcg to control their memory usage. > > We always account to both user and kernel resource_counters. This is in contradiction with your claim that there is no behavioral change for memcg users. Please clarify when we use u and when u+k accounting. " There is no behavioral change if the kmem accounting is turned off for memcg users but when there is a kmem.limit_in_bytes is set then the memory.usage_in_bytes will include both user and kmem memory. " > This > effectively means that an independent kernel limit is in place when the > limit is set to a lower value than the user memory. A equal or higher > value means that the user limit will always hit first, meaning that kmem > is effectively unlimited. > > People who want to track kernel memory but not limit it, can set this > limit to a very high number (like RESOURCE_MAX - 1page - that no one > will ever hit, or equal to the user memory) > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> > --- > mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 63 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d6ad138..f3fd354 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -265,6 +265,10 @@ struct mem_cgroup { > }; > > /* > + * the counter to account for kernel memory usage. > + */ > + struct res_counter kmem; > + /* > * Per cgroup active and inactive list, similar to the > * per zone LRU lists. > */ > @@ -279,6 +283,7 @@ struct mem_cgroup { > * Should the accounting and control be hierarchical, per subtree? > */ > bool use_hierarchy; > + bool kmem_accounted; > > bool oom_lock; > atomic_t under_oom; > @@ -389,6 +394,7 @@ enum res_type { > _MEM, > _MEMSWAP, > _OOM_TYPE, > + _KMEM, > }; > > #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val)) > @@ -1439,6 +1445,10 @@ done: > res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10, > res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10, > res_counter_read_u64(&memcg->memsw, RES_FAILCNT)); > + printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n", > + res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, > + res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, > + res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); > } > > /* > @@ -3946,6 +3956,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft, > else > val = res_counter_read_u64(&memcg->memsw, name); > break; > + case _KMEM: > + val = res_counter_read_u64(&memcg->kmem, name); > + break; > default: > BUG(); > } > @@ -3984,8 +3997,18 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, > break; > if (type == _MEM) > ret = mem_cgroup_resize_limit(memcg, val); > - else > + else if (type == _MEMSWAP) > ret = mem_cgroup_resize_memsw_limit(memcg, val); > + else if (type == _KMEM) { > + ret = res_counter_set_limit(&memcg->kmem, val); > + if (ret) > + break; > + > + /* For simplicity, we won't allow this to be disabled */ > + if (!memcg->kmem_accounted && val != RESOURCE_MAX) > + memcg->kmem_accounted = true; > + } else > + return -EINVAL; > break; > case RES_SOFT_LIMIT: > ret = res_counter_memparse_write_strategy(buffer, &val); > @@ -4051,12 +4074,16 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event) > case RES_MAX_USAGE: > if (type == _MEM) > res_counter_reset_max(&memcg->res); > + else if (type == _KMEM) > + res_counter_reset_max(&memcg->kmem); > else > res_counter_reset_max(&memcg->memsw); > break; > case RES_FAILCNT: > if (type == _MEM) > res_counter_reset_failcnt(&memcg->res); > + else if (type == _KMEM) > + res_counter_reset_failcnt(&memcg->kmem); > else > res_counter_reset_failcnt(&memcg->memsw); > break; > @@ -4618,6 +4645,33 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, > } > > #ifdef CONFIG_MEMCG_KMEM Some things are guarded CONFIG_MEMCG_KMEM but some are not (e.g. struct mem_cgroup.kmem). I do understand you want to keep ifdefs on the leash but we should clean this up one day. > +static struct cftype kmem_cgroup_files[] = { > + { > + .name = "kmem.limit_in_bytes", > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > + .write_string = mem_cgroup_write, > + .read = mem_cgroup_read, > + }, > + { > + .name = "kmem.usage_in_bytes", > + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), > + .read = mem_cgroup_read, > + }, > + { > + .name = "kmem.failcnt", > + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT), > + .trigger = mem_cgroup_reset, > + .read = mem_cgroup_read, > + }, > + { > + .name = "kmem.max_usage_in_bytes", > + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE), > + .trigger = mem_cgroup_reset, > + .read = mem_cgroup_read, > + }, > + {}, > +}; > + > static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > { > return mem_cgroup_sockets_init(memcg, ss); > @@ -4961,6 +5015,12 @@ mem_cgroup_create(struct cgroup *cont) > int cpu; > enable_swap_cgroup(); > parent = NULL; > + > +#ifdef CONFIG_MEMCG_KMEM > + WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, > + kmem_cgroup_files)); > +#endif > + > if (mem_cgroup_soft_limit_tree_init()) > goto free_out; > root_mem_cgroup = memcg; > @@ -4979,6 +5039,7 @@ mem_cgroup_create(struct cgroup *cont) > if (parent && parent->use_hierarchy) { > res_counter_init(&memcg->res, &parent->res); > res_counter_init(&memcg->memsw, &parent->memsw); > + res_counter_init(&memcg->kmem, &parent->kmem); Haven't we already discussed that a new memcg should inherit kmem_accounted from its parent for use_hierarchy? Say we have root | A (kmem_accounted = 1, use_hierachy = 1) \ B (kmem_accounted = 0) \ C (kmem_accounted = 1) B find's itself in an awkward situation becuase it doesn't want to account u+k but it ends up doing so becuase C. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 14:03 ` Michal Hocko @ 2012-09-26 14:33 ` Glauber Costa [not found] ` <50631226.9050304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> [not found] ` <20120926140347.GD15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-26 14:33 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/26/2012 06:03 PM, Michal Hocko wrote: > On Tue 18-09-12 18:04:01, Glauber Costa wrote: >> This patch adds the basic infrastructure for the accounting of the slab >> caches. To control that, the following files are created: >> >> * memory.kmem.usage_in_bytes >> * memory.kmem.limit_in_bytes >> * memory.kmem.failcnt >> * memory.kmem.max_usage_in_bytes >> >> They have the same meaning of their user memory counterparts. They >> reflect the state of the "kmem" res_counter. > >> The code is not enabled until a limit is set. > > "Per cgroup slab memory accounting is not enabled until a limit is set > for the group. Once the limit is set the accounting cannot be disabled > such a group." > > Better? > >> This can be tested by the flag "kmem_accounted". > > Sounds as if it could be done from userspace (because you were talking > about an user interface) which it cannot and we do not see it in this > patch because it is not used anywhere. So please be more specific. > >> This means that after the patch is applied, no behavioral changes >> exists for whoever is still using memcg to control their memory usage. >> >> We always account to both user and kernel resource_counters. > > This is in contradiction with your claim that there is no behavioral > change for memcg users. Please clarify when we use u and when u+k > accounting. > " > There is no behavioral change if the kmem accounting is turned off for > memcg users but when there is a kmem.limit_in_bytes is set then the > memory.usage_in_bytes will include both user and kmem memory. > " > >> This >> effectively means that an independent kernel limit is in place when the >> limit is set to a lower value than the user memory. A equal or higher >> value means that the user limit will always hit first, meaning that kmem >> is effectively unlimited. >> >> People who want to track kernel memory but not limit it, can set this >> limit to a very high number (like RESOURCE_MAX - 1page - that no one >> will ever hit, or equal to the user memory) >> >> Signed-off-by: Glauber Costa <glommer@parallels.com> >> CC: Michal Hocko <mhocko@suse.cz> >> CC: Johannes Weiner <hannes@cmpxchg.org> >> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> >> --- >> mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 63 insertions(+), 1 deletion(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index d6ad138..f3fd354 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -265,6 +265,10 @@ struct mem_cgroup { >> }; >> >> /* >> + * the counter to account for kernel memory usage. >> + */ >> + struct res_counter kmem; >> + /* >> * Per cgroup active and inactive list, similar to the >> * per zone LRU lists. >> */ >> @@ -279,6 +283,7 @@ struct mem_cgroup { >> * Should the accounting and control be hierarchical, per subtree? >> */ >> bool use_hierarchy; >> + bool kmem_accounted; >> >> bool oom_lock; >> atomic_t under_oom; >> @@ -389,6 +394,7 @@ enum res_type { >> _MEM, >> _MEMSWAP, >> _OOM_TYPE, >> + _KMEM, >> }; >> >> #define MEMFILE_PRIVATE(x, val) ((x) << 16 | (val)) >> @@ -1439,6 +1445,10 @@ done: >> res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10, >> res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10, >> res_counter_read_u64(&memcg->memsw, RES_FAILCNT)); >> + printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n", >> + res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, >> + res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, >> + res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); >> } >> >> /* >> @@ -3946,6 +3956,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft, >> else >> val = res_counter_read_u64(&memcg->memsw, name); >> break; >> + case _KMEM: >> + val = res_counter_read_u64(&memcg->kmem, name); >> + break; >> default: >> BUG(); >> } >> @@ -3984,8 +3997,18 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, >> break; >> if (type == _MEM) >> ret = mem_cgroup_resize_limit(memcg, val); >> - else >> + else if (type == _MEMSWAP) >> ret = mem_cgroup_resize_memsw_limit(memcg, val); >> + else if (type == _KMEM) { >> + ret = res_counter_set_limit(&memcg->kmem, val); >> + if (ret) >> + break; >> + >> + /* For simplicity, we won't allow this to be disabled */ >> + if (!memcg->kmem_accounted && val != RESOURCE_MAX) >> + memcg->kmem_accounted = true; >> + } else >> + return -EINVAL; >> break; >> case RES_SOFT_LIMIT: >> ret = res_counter_memparse_write_strategy(buffer, &val); >> @@ -4051,12 +4074,16 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event) >> case RES_MAX_USAGE: >> if (type == _MEM) >> res_counter_reset_max(&memcg->res); >> + else if (type == _KMEM) >> + res_counter_reset_max(&memcg->kmem); >> else >> res_counter_reset_max(&memcg->memsw); >> break; >> case RES_FAILCNT: >> if (type == _MEM) >> res_counter_reset_failcnt(&memcg->res); >> + else if (type == _KMEM) >> + res_counter_reset_failcnt(&memcg->kmem); >> else >> res_counter_reset_failcnt(&memcg->memsw); >> break; >> @@ -4618,6 +4645,33 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, >> } >> >> #ifdef CONFIG_MEMCG_KMEM > > Some things are guarded CONFIG_MEMCG_KMEM but some are not (e.g. struct > mem_cgroup.kmem). I do understand you want to keep ifdefs on the leash > but we should clean this up one day. > >> +static struct cftype kmem_cgroup_files[] = { >> + { >> + .name = "kmem.limit_in_bytes", >> + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), >> + .write_string = mem_cgroup_write, >> + .read = mem_cgroup_read, >> + }, >> + { >> + .name = "kmem.usage_in_bytes", >> + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), >> + .read = mem_cgroup_read, >> + }, >> + { >> + .name = "kmem.failcnt", >> + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT), >> + .trigger = mem_cgroup_reset, >> + .read = mem_cgroup_read, >> + }, >> + { >> + .name = "kmem.max_usage_in_bytes", >> + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE), >> + .trigger = mem_cgroup_reset, >> + .read = mem_cgroup_read, >> + }, >> + {}, >> +}; >> + >> static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) >> { >> return mem_cgroup_sockets_init(memcg, ss); >> @@ -4961,6 +5015,12 @@ mem_cgroup_create(struct cgroup *cont) >> int cpu; >> enable_swap_cgroup(); >> parent = NULL; >> + >> +#ifdef CONFIG_MEMCG_KMEM >> + WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, >> + kmem_cgroup_files)); >> +#endif >> + >> if (mem_cgroup_soft_limit_tree_init()) >> goto free_out; >> root_mem_cgroup = memcg; >> @@ -4979,6 +5039,7 @@ mem_cgroup_create(struct cgroup *cont) >> if (parent && parent->use_hierarchy) { >> res_counter_init(&memcg->res, &parent->res); >> res_counter_init(&memcg->memsw, &parent->memsw); >> + res_counter_init(&memcg->kmem, &parent->kmem); > > Haven't we already discussed that a new memcg should inherit kmem_accounted > from its parent for use_hierarchy? > Say we have > root > | > A (kmem_accounted = 1, use_hierachy = 1) > \ > B (kmem_accounted = 0) > \ > C (kmem_accounted = 1) > > B find's itself in an awkward situation becuase it doesn't want to > account u+k but it ends up doing so becuase C. > Ok, I haven't updated it here. But that should be taken care of in the lifecycle patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50631226.9050304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50631226.9050304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-26 16:01 ` Michal Hocko 2012-09-26 17:34 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-09-26 16:01 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Wed 26-09-12 18:33:10, Glauber Costa wrote: > On 09/26/2012 06:03 PM, Michal Hocko wrote: > > On Tue 18-09-12 18:04:01, Glauber Costa wrote: [...] > >> @@ -4961,6 +5015,12 @@ mem_cgroup_create(struct cgroup *cont) > >> int cpu; > >> enable_swap_cgroup(); > >> parent = NULL; > >> + > >> +#ifdef CONFIG_MEMCG_KMEM > >> + WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, > >> + kmem_cgroup_files)); > >> +#endif > >> + > >> if (mem_cgroup_soft_limit_tree_init()) > >> goto free_out; > >> root_mem_cgroup = memcg; > >> @@ -4979,6 +5039,7 @@ mem_cgroup_create(struct cgroup *cont) > >> if (parent && parent->use_hierarchy) { > >> res_counter_init(&memcg->res, &parent->res); > >> res_counter_init(&memcg->memsw, &parent->memsw); > >> + res_counter_init(&memcg->kmem, &parent->kmem); > > > > Haven't we already discussed that a new memcg should inherit kmem_accounted > > from its parent for use_hierarchy? > > Say we have > > root > > | > > A (kmem_accounted = 1, use_hierachy = 1) > > \ > > B (kmem_accounted = 0) > > \ > > C (kmem_accounted = 1) > > > > B find's itself in an awkward situation becuase it doesn't want to > > account u+k but it ends up doing so becuase C. > > > > Ok, I haven't updated it here. But that should be taken care of in the > lifecycle patch. I am not sure which patch you are thinking about but I would prefer to have it here because it is safe wrt. races and it is more obvious as well. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 16:01 ` Michal Hocko @ 2012-09-26 17:34 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-09-26 17:34 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/26/2012 08:01 PM, Michal Hocko wrote: > On Wed 26-09-12 18:33:10, Glauber Costa wrote: >> On 09/26/2012 06:03 PM, Michal Hocko wrote: >>> On Tue 18-09-12 18:04:01, Glauber Costa wrote: > [...] >>>> @@ -4961,6 +5015,12 @@ mem_cgroup_create(struct cgroup *cont) >>>> int cpu; >>>> enable_swap_cgroup(); >>>> parent = NULL; >>>> + >>>> +#ifdef CONFIG_MEMCG_KMEM >>>> + WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys, >>>> + kmem_cgroup_files)); >>>> +#endif >>>> + >>>> if (mem_cgroup_soft_limit_tree_init()) >>>> goto free_out; >>>> root_mem_cgroup = memcg; >>>> @@ -4979,6 +5039,7 @@ mem_cgroup_create(struct cgroup *cont) >>>> if (parent && parent->use_hierarchy) { >>>> res_counter_init(&memcg->res, &parent->res); >>>> res_counter_init(&memcg->memsw, &parent->memsw); >>>> + res_counter_init(&memcg->kmem, &parent->kmem); >>> >>> Haven't we already discussed that a new memcg should inherit kmem_accounted >>> from its parent for use_hierarchy? >>> Say we have >>> root >>> | >>> A (kmem_accounted = 1, use_hierachy = 1) >>> \ >>> B (kmem_accounted = 0) >>> \ >>> C (kmem_accounted = 1) >>> >>> B find's itself in an awkward situation becuase it doesn't want to >>> account u+k but it ends up doing so becuase C. >>> >> >> Ok, I haven't updated it here. But that should be taken care of in the >> lifecycle patch. > > I am not sure which patch you are thinking about but I would prefer to > have it here because it is safe wrt. races and it is more obvious as > well. > The patch where I make kmem_accounted into a bitfield. So any code here will eventually disappear. But BTW, I am not saying I won't update the patch - I like that all patches work and make sense in their own, I am just saying that I forgot to update this patch, because I added the code in its final version to the end and then squashed it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120926140347.GD15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <20120926140347.GD15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-09-26 16:36 ` Tejun Heo 2012-09-26 17:36 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-26 16:36 UTC (permalink / raw) To: Michal Hocko Cc: Glauber Costa, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, Michal, Glauber. On Wed, Sep 26, 2012 at 04:03:47PM +0200, Michal Hocko wrote: > Haven't we already discussed that a new memcg should inherit kmem_accounted > from its parent for use_hierarchy? > Say we have > root > | > A (kmem_accounted = 1, use_hierachy = 1) > \ > B (kmem_accounted = 0) > \ > C (kmem_accounted = 1) > > B find's itself in an awkward situation becuase it doesn't want to > account u+k but it ends up doing so becuase C. Do we really want this level of flexibility? What's wrong with a global switch at the root? I'm not even sure we want this to be optional at all. The only reason I can think of is that it might screw up some configurations in use which are carefully crafted to suit userland-only usage but for that isn't what we need a transition plan rather than another ultra flexible config option that not many really understand the implication of? In the same vein, do we really need both .kmem_accounted and config option? If someone is turning on MEMCG, just include kmem accounting. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 16:36 ` Tejun Heo @ 2012-09-26 17:36 ` Glauber Costa [not found] ` <50633D24.6020002-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-26 17:36 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/26/2012 08:36 PM, Tejun Heo wrote: > Hello, Michal, Glauber. > > On Wed, Sep 26, 2012 at 04:03:47PM +0200, Michal Hocko wrote: >> Haven't we already discussed that a new memcg should inherit kmem_accounted >> from its parent for use_hierarchy? >> Say we have >> root >> | >> A (kmem_accounted = 1, use_hierachy = 1) >> \ >> B (kmem_accounted = 0) >> \ >> C (kmem_accounted = 1) >> >> B find's itself in an awkward situation becuase it doesn't want to >> account u+k but it ends up doing so becuase C. > > Do we really want this level of flexibility? What's wrong with a > global switch at the root? I'm not even sure we want this to be > optional at all. The only reason I can think of is that it might > screw up some configurations in use which are carefully crafted to > suit userland-only usage but for that isn't what we need a transition > plan rather than another ultra flexible config option that not many > really understand the implication of? > > In the same vein, do we really need both .kmem_accounted and config > option? If someone is turning on MEMCG, just include kmem accounting. > Yes, we do. This was discussed multiple times. Our interest is to preserve existing deployed setup, that were tuned in a world where kmem didn't exist. Because we also feed kmem to the user counter, this may very well disrupt their setup. User memory, unlike kernel memory, may very well be totally in control of the userspace application, so it is not unreasonable to believe that extra pages appearing in a new kernel version may break them. It is actually a much worse compatibility problem than flipping hierarchy, in comparison -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50633D24.6020002-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50633D24.6020002-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-26 17:44 ` Tejun Heo 2012-09-26 17:53 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-26 17:44 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, Glauber. On Wed, Sep 26, 2012 at 10:36 AM, Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote: > This was discussed multiple times. Our interest is to preserve existing > deployed setup, that were tuned in a world where kmem didn't exist. > Because we also feed kmem to the user counter, this may very well > disrupt their setup. So, that can be served by .kmem_accounted at root, no? > User memory, unlike kernel memory, may very well be totally in control > of the userspace application, so it is not unreasonable to believe that > extra pages appearing in a new kernel version may break them. > > It is actually a much worse compatibility problem than flipping > hierarchy, in comparison Again, what's wrong with one switch at the root? Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 17:44 ` Tejun Heo @ 2012-09-26 17:53 ` Glauber Costa 2012-09-26 18:01 ` Tejun Heo 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-26 17:53 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/26/2012 09:44 PM, Tejun Heo wrote: > Hello, Glauber. > > On Wed, Sep 26, 2012 at 10:36 AM, Glauber Costa <glommer@parallels.com> wrote: >> This was discussed multiple times. Our interest is to preserve existing >> deployed setup, that were tuned in a world where kmem didn't exist. >> Because we also feed kmem to the user counter, this may very well >> disrupt their setup. > > So, that can be served by .kmem_accounted at root, no? > >> User memory, unlike kernel memory, may very well be totally in control >> of the userspace application, so it is not unreasonable to believe that >> extra pages appearing in a new kernel version may break them. >> >> It is actually a much worse compatibility problem than flipping >> hierarchy, in comparison > > Again, what's wrong with one switch at the root? > I understand your trauma about over flexibility, and you know I share of it. But I don't think there is any need to cap it here. Given kmem accounted is perfectly hierarchical, and there seem to be plenty of people who only care about user memory, I see no reason to disallow a mixed use case here. I must say that for my particular use case, enabling it unconditionally would just work, so it is not that what I have in mind. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 17:53 ` Glauber Costa @ 2012-09-26 18:01 ` Tejun Heo [not found] ` <20120926180124.GA12544-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-26 18:01 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, On Wed, Sep 26, 2012 at 09:53:09PM +0400, Glauber Costa wrote: > I understand your trauma about over flexibility, and you know I share of > it. But I don't think there is any need to cap it here. Given kmem > accounted is perfectly hierarchical, and there seem to be plenty of > people who only care about user memory, I see no reason to disallow a > mixed use case here. > > I must say that for my particular use case, enabling it unconditionally > would just work, so it is not that what I have in mind. So, I'm not gonna go as far as pushing for enabling it unconditionally but would really like to hear why it's necessary to make it per node instead of one global switch. Maybe it has already been discussed to hell and back. Care to summarize / point me to it? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120926180124.GA12544-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <20120926180124.GA12544-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-26 18:56 ` Glauber Costa [not found] ` <50634FC9.4090609-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-26 18:56 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/26/2012 10:01 PM, Tejun Heo wrote: > Hello, > > On Wed, Sep 26, 2012 at 09:53:09PM +0400, Glauber Costa wrote: >> I understand your trauma about over flexibility, and you know I share of >> it. But I don't think there is any need to cap it here. Given kmem >> accounted is perfectly hierarchical, and there seem to be plenty of >> people who only care about user memory, I see no reason to disallow a >> mixed use case here. >> >> I must say that for my particular use case, enabling it unconditionally >> would just work, so it is not that what I have in mind. > > So, I'm not gonna go as far as pushing for enabling it unconditionally > but would really like to hear why it's necessary to make it per node > instead of one global switch. Maybe it has already been discussed to > hell and back. Care to summarize / point me to it? > For me, it is the other way around: it makes perfect sense to have a per-subtree selection of features where it doesn't hurt us, provided it is hierarchical. For the mere fact that not every application is interested in this (Michal is the one that is being so far more vocal about this not being needed in some use cases), and it is perfectly valid to imagine such applications would coexist. So given the flexibility it brings, the real question is, as I said, backwards: what is it necessary to make it a global switch ? ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50634FC9.4090609-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50634FC9.4090609-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-26 19:34 ` Tejun Heo 2012-09-26 19:46 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-26 19:34 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, On Wed, Sep 26, 2012 at 10:56:09PM +0400, Glauber Costa wrote: > For me, it is the other way around: it makes perfect sense to have a > per-subtree selection of features where it doesn't hurt us, provided it > is hierarchical. For the mere fact that not every application is > interested in this (Michal is the one that is being so far more vocal > about this not being needed in some use cases), and it is perfectly > valid to imagine such applications would coexist. > > So given the flexibility it brings, the real question is, as I said, > backwards: what is it necessary to make it a global switch ? Because it hurts my head and it's better to keep things simple. We're planning to retire .use_hierarhcy in sub hierarchies and I'd really like to prevent another fiasco like that unless there absolutely is no way around it. Flexibility where necessary is fine but let's please try our best to avoid over-designing things. We've been far too good at getting lost in flexbility maze. Michal, care to chime in? Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 19:34 ` Tejun Heo @ 2012-09-26 19:46 ` Glauber Costa [not found] ` <50635B9D.8020205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-26 19:46 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/26/2012 11:34 PM, Tejun Heo wrote: > Hello, > > On Wed, Sep 26, 2012 at 10:56:09PM +0400, Glauber Costa wrote: >> For me, it is the other way around: it makes perfect sense to have a >> per-subtree selection of features where it doesn't hurt us, provided it >> is hierarchical. For the mere fact that not every application is >> interested in this (Michal is the one that is being so far more vocal >> about this not being needed in some use cases), and it is perfectly >> valid to imagine such applications would coexist. >> >> So given the flexibility it brings, the real question is, as I said, >> backwards: what is it necessary to make it a global switch ? > > Because it hurts my head and it's better to keep things simple. We're > planning to retire .use_hierarhcy in sub hierarchies and I'd really > like to prevent another fiasco like that unless there absolutely is no > way around it. Flexibility where necessary is fine but let's please > try our best to avoid over-designing things. We've been far too good > at getting lost in flexbility maze. Michal, care to chime in? > I would very much like to hear Michal here as well, sure. But as I said in this very beginning of this, you pretty much know that I am heavily involved in trying to get rid of use_hierarchy, and by no means I consider this en pair with that. use_hierarchy is a hack around a core property of cgroups, the fact that they are hierarchical. Its mere existence came to be to overcome a performance limitation. It puts you in contradictory situation where you have cgroups organized as directories, and then not satisfied in making this hierarchical representation be gravely ignored, forces you to use nonsensical terms like "flat hierarchy", making us grasp at how it is to be a politician once in our lifetimes. Besides not being part of cgroup core, and respecting very much both cgroups' and basic sanity properties, kmem is an actual feature that some people want, and some people don't. There is no reason to believe that applications that want will live in the same environment with ones that don't want. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50635B9D.8020205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50635B9D.8020205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-26 19:56 ` Tejun Heo 2012-09-26 20:02 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-26 19:56 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, On Wed, Sep 26, 2012 at 11:46:37PM +0400, Glauber Costa wrote: > Besides not being part of cgroup core, and respecting very much both > cgroups' and basic sanity properties, kmem is an actual feature that > some people want, and some people don't. There is no reason to believe > that applications that want will live in the same environment with ones > that don't want. I don't know. It definitely is less crazy than .use_hierarchy but I wouldn't say it's an inherently different thing. I mean, what does it even mean to have u+k limit on one subtree and not on another branch? And we worry about things like what if parent doesn't enable it but its chlidren do. This is a feature which adds complexity. If the feature is necessary and justified, sure. If not, let's please not and let's err on the side of conservativeness. We can always add it later but the other direction is much harder. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 19:56 ` Tejun Heo @ 2012-09-26 20:02 ` Glauber Costa [not found] ` <50635F46.7000700-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-26 20:02 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/26/2012 11:56 PM, Tejun Heo wrote: > Hello, > > On Wed, Sep 26, 2012 at 11:46:37PM +0400, Glauber Costa wrote: >> Besides not being part of cgroup core, and respecting very much both >> cgroups' and basic sanity properties, kmem is an actual feature that >> some people want, and some people don't. There is no reason to believe >> that applications that want will live in the same environment with ones >> that don't want. > > I don't know. It definitely is less crazy than .use_hierarchy but I > wouldn't say it's an inherently different thing. I mean, what does it > even mean to have u+k limit on one subtree and not on another branch? > And we worry about things like what if parent doesn't enable it but > its chlidren do. > It is inherently different. To begin with, it actually contemplates two use cases. It is not a work around. The meaning is also very well defined. The meaning of having this enabled in one subtree and not in other is: Subtree A wants to track kernel memory. Subtree B does not. It's that, and never more than that. There is no maybes and no buts, no magic knobs that makes it behave in a crazy way. If a children enables it but the parent does not, this does what every tree does: enable it from that point downwards. > This is a feature which adds complexity. If the feature is necessary > and justified, sure. If not, let's please not and let's err on the > side of conservativeness. We can always add it later but the other > direction is much harder. > I disagree. Having kmem tracking adds complexity. Having to cope with the use case where we turn it on dynamically to cope with the "user page only" use case adds complexity. But I see no significant complexity being added by having it per subtree. Really. You have the use_hierarchy fiasco in mind, and I do understand that you are raising the flag and all that. But think in terms of functionality: This thing here is a lot more similar to swap than use_hierarchy. Would you argue that memsw should be per-root ? The reason why it shouldn't: Some people want to limit memory consumption all the way to the swap, some people don't. Same with kmem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50635F46.7000700-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50635F46.7000700-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-26 20:16 ` Tejun Heo 2012-09-26 21:24 ` Glauber Costa 2012-09-26 22:11 ` Johannes Weiner 1 sibling, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-26 20:16 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Thu, Sep 27, 2012 at 12:02:14AM +0400, Glauber Costa wrote: > But think in terms of functionality: This thing here is a lot more > similar to swap than use_hierarchy. Would you argue that memsw should be > per-root ? I'm fairly sure you can make about the same argument about use_hierarchy. There is a choice to make here and one is simpler than the other. I want the additional complexity justified by actual use cases which isn't too much to ask for especially when the complexity is something visible to userland. So let's please stop arguing semantics. If this is definitely necessary for some use cases, sure let's have it. If not, let's consider it later. I'll stop responding on "inherent differences." I don't think we'll get anywhere with that. Michal, Johannes, Kamezawa, what are your thoughts? Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 20:16 ` Tejun Heo @ 2012-09-26 21:24 ` Glauber Costa [not found] ` <50637298.2090904-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-26 21:24 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/27/2012 12:16 AM, Tejun Heo wrote: > On Thu, Sep 27, 2012 at 12:02:14AM +0400, Glauber Costa wrote: >> But think in terms of functionality: This thing here is a lot more >> similar to swap than use_hierarchy. Would you argue that memsw should be >> per-root ? > > I'm fairly sure you can make about the same argument about > use_hierarchy. There is a choice to make here and one is simpler than > the other. I want the additional complexity justified by actual use > cases which isn't too much to ask for especially when the complexity > is something visible to userland. > > So let's please stop arguing semantics. If this is definitely > necessary for some use cases, sure let's have it. If not, let's > consider it later. I'll stop responding on "inherent differences." I > don't think we'll get anywhere with that. > If you stop responding, we are for sure not getting anywhere. I agree with you here. Let me point out one issue that you seem to be missing, and you respond or not, your call. "kmem_accounted" is not a switch. It is an internal representation only. The semantics, that we discussed exhaustively in San Diego, is that a group that is not limited is not accounted. This is simple and consistent. Since the limits are still per-cgroup, you are actually proposing more user-visible complexity than me, since you are adding yet another file, with its own semantics. About use cases, I've already responded: my containers use case is kmem limited. There are people like Michal that specifically asked for user-only semantics to be preserved. So your question for global vs local switch (that again, doesn't exist; only a local *limit* exists) should really be posed in the following way: "Can two different use cases with different needs be hosted in the same box?" > Michal, Johannes, Kamezawa, what are your thoughts? > waiting! =) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50637298.2090904-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50637298.2090904-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-26 22:10 ` Tejun Heo 2012-09-26 22:29 ` Glauber Costa 2012-09-27 12:08 ` Michal Hocko 1 sibling, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-26 22:10 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, Glauber. On Thu, Sep 27, 2012 at 01:24:40AM +0400, Glauber Costa wrote: > "kmem_accounted" is not a switch. It is an internal representation only. > The semantics, that we discussed exhaustively in San Diego, is that a > group that is not limited is not accounted. This is simple and consistent. > > Since the limits are still per-cgroup, you are actually proposing more > user-visible complexity than me, since you are adding yet another file, > with its own semantics. I was confused. I thought it was exposed as a switch to userland (it being right below .use_hierarchy tripped red alert). This is internal flag dependent upon kernel limit being set. My apologies. So, the proposed behavior is to allow enabling kmemcg anytime but ignore what happened inbetween? Where the knob is changes but the weirdity seems all the same. What prevents us from having a single switch at root which can only be flipped when there's no children? Backward compatibility is covered with single switch and I really don't think "you can enable limits for kernel memory anytime but we don't keep track of whatever happened before it was flipped the first time because the first time is always special" is a sane thing to expose to userland. Or am I misunderstanding the proposed behavior again? Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 22:10 ` Tejun Heo @ 2012-09-26 22:29 ` Glauber Costa [not found] ` <506381B2.2060806-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-26 22:29 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/27/2012 02:10 AM, Tejun Heo wrote: > Hello, Glauber. > > On Thu, Sep 27, 2012 at 01:24:40AM +0400, Glauber Costa wrote: >> "kmem_accounted" is not a switch. It is an internal representation only. >> The semantics, that we discussed exhaustively in San Diego, is that a >> group that is not limited is not accounted. This is simple and consistent. >> >> Since the limits are still per-cgroup, you are actually proposing more >> user-visible complexity than me, since you are adding yet another file, >> with its own semantics. > > I was confused. I thought it was exposed as a switch to userland (it > being right below .use_hierarchy tripped red alert). Remember I was the one more vocally and radically so far trying to get rid of use_hierarchy. I should have been more clear - and I was, as soon as I better understood the nature of your opposition - but this is precisely what I meant by "inherently different". > > So, the proposed behavior is to allow enabling kmemcg anytime but > ignore what happened inbetween? Where the knob is changes but the > weirdity seems all the same. What prevents us from having a single > switch at root which can only be flipped when there's no children? So I view this very differently from you. We have no root-only switches in memcg. This would be a first, and this is the kind of thing that adds complexity, in my view. You have someone like libvirt or a systemd service using memcg. It probably starts at boot. Once it is started, it will pretty much prevent switching of any global switch like this. And then what? If you want a different behavior you need to go kill all your services that are using memcg so you can get the behavior you want? And if they happen to be making a specific flag choice by design, you just say "you really can't run A + B together" ? I myself think global switches are an unnecessary complication. And let us not talk about use_hierarchy, please. If it becomes global, it is going to be as part of a phase out plan anyway. The problem with that is not that it is global, is that it shouldn't even exist. > > Backward compatibility is covered with single switch and I really > don't think "you can enable limits for kernel memory anytime but we > don't keep track of whatever happened before it was flipped the first > time because the first time is always special" is a sane thing to > expose to userland. Or am I misunderstanding the proposed behavior > again? > You do keep track. Before you switch it for the first time, it all belongs to the root memcg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <506381B2.2060806-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <506381B2.2060806-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-26 22:42 ` Tejun Heo 2012-09-26 22:54 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-26 22:42 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, Glauber. On Thu, Sep 27, 2012 at 02:29:06AM +0400, Glauber Costa wrote: > And then what? If you want a different behavior you need to go kill all > your services that are using memcg so you can get the behavior you want? > And if they happen to be making a specific flag choice by design, you > just say "you really can't run A + B together" ? > > I myself think global switches are an unnecessary complication. And let > us not talk about use_hierarchy, please. If it becomes global, it is > going to be as part of a phase out plan anyway. The problem with that is > not that it is global, is that it shouldn't even exist. I would consider it more of a compatibility thing which is set during boot and configurable by sysadmin. Let the newer systems enable it by default on boot and old configs / special ones disable it as necessary. > > Backward compatibility is covered with single switch and I really > > don't think "you can enable limits for kernel memory anytime but we > > don't keep track of whatever happened before it was flipped the first > > time because the first time is always special" is a sane thing to > > expose to userland. Or am I misunderstanding the proposed behavior > > again? > > You do keep track. Before you switch it for the first time, it all > belongs to the root memcg. Well, that's really playing with words. Limit is per cgroup and before the limit is set for the first time, everything is accounted to something else. How is that keeping track? The proposed behavior seems really crazy to me. Do people really think this is a good idea? Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 22:42 ` Tejun Heo @ 2012-09-26 22:54 ` Glauber Costa [not found] ` <50638793.7060806-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-26 22:54 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/27/2012 02:42 AM, Tejun Heo wrote: > Hello, Glauber. > > On Thu, Sep 27, 2012 at 02:29:06AM +0400, Glauber Costa wrote: >> And then what? If you want a different behavior you need to go kill all >> your services that are using memcg so you can get the behavior you want? >> And if they happen to be making a specific flag choice by design, you >> just say "you really can't run A + B together" ? >> >> I myself think global switches are an unnecessary complication. And let >> us not talk about use_hierarchy, please. If it becomes global, it is >> going to be as part of a phase out plan anyway. The problem with that is >> not that it is global, is that it shouldn't even exist. > > I would consider it more of a compatibility thing which is set during > boot and configurable by sysadmin. Let the newer systems enable it by > default on boot and old configs / special ones disable it as > necessary. > I don't. Much has been said in the past about the problem of sharing. A lot of the kernel objects are shared by nature, this is pretty much unavoidable. The answer we have been giving to this inquiry, is that the workloads (us) interested in kmem accounted tend to be quite local in their file accesses (and other kernel objects as well). It should be obvious that not all workloads are like this, and some of them would actually prefer to have their umem limited only. There is nothing unreasonable in tracking user memory only. If we have a global switch for "tracking all kernel memory", who would you account the objects that are heavily shared to? I solve this by not tracking kernel memory for cgroups in such workloads. What do you propose? >>> Backward compatibility is covered with single switch and I really >>> don't think "you can enable limits for kernel memory anytime but we >>> don't keep track of whatever happened before it was flipped the first >>> time because the first time is always special" is a sane thing to >>> expose to userland. Or am I misunderstanding the proposed behavior >>> again? >> >> You do keep track. Before you switch it for the first time, it all >> belongs to the root memcg. > > Well, that's really playing with words. Limit is per cgroup and > before the limit is set for the first time, everything is accounted to > something else. How is that keeping track? > Even after the limit is set, it is set only by workloads that want kmem to be tracked. If you want to track it during the whole lifetime of the cgroup, you switch it before you put tasks to it. What is so crazy about it? > The proposed behavior seems really crazy to me. Do people really > think this is a good idea? > It is really sad that you lost the opportunity to say that in a room full of mm developers that could add to this discussion in real time, when after an explanation about this was given, Mel asked if anyone would have any objections to this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50638793.7060806-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50638793.7060806-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-26 23:08 ` Tejun Heo 2012-09-26 23:20 ` Glauber Costa [not found] ` <20120926230807.GC10453-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 2 replies; 127+ messages in thread From: Tejun Heo @ 2012-09-26 23:08 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, Glauber. On Thu, Sep 27, 2012 at 02:54:11AM +0400, Glauber Costa wrote: > I don't. Much has been said in the past about the problem of sharing. A > lot of the kernel objects are shared by nature, this is pretty much > unavoidable. The answer we have been giving to this inquiry, is that the > workloads (us) interested in kmem accounted tend to be quite local in > their file accesses (and other kernel objects as well). > > It should be obvious that not all workloads are like this, and some of > them would actually prefer to have their umem limited only. > > There is nothing unreasonable in tracking user memory only. > > If we have a global switch for "tracking all kernel memory", who would > you account the objects that are heavily shared to? I solve this by not > tracking kernel memory for cgroups in such workloads. What do you propose? One of the things wrong with that is that it exposes the limitation of the current implementation as interface to userland, which is never a good idea. In addition, how is userland supposed to know which workload is shared kmem heavy or not? Details like that are not even inherent to workloads. It's highly dependent on kernel implementation which may change any day. If we hit workloads like that the right thing to do is improving kmemcg so that such problems don't occur, not exposing another switch. If we can't make that work in reasonable (doesn't have to be perfect) way, we might as well just give up on kmem controller. If userland has to second-guess kernel implementation details to make use of it, it's useless. > > Well, that's really playing with words. Limit is per cgroup and > > before the limit is set for the first time, everything is accounted to > > something else. How is that keeping track? > > > > Even after the limit is set, it is set only by workloads that want kmem > to be tracked. If you want to track it during the whole lifetime of the > cgroup, you switch it before you put tasks to it. What is so crazy about it? The fact that the numbers don't really mean what they apparently should mean. > > The proposed behavior seems really crazy to me. Do people really > > think this is a good idea? > > It is really sad that you lost the opportunity to say that in a room > full of mm developers that could add to this discussion in real time, > when after an explanation about this was given, Mel asked if anyone > would have any objections to this. Sure, conferences are useful for building consensus but that's the extent of it. Sorry that I didn't realize the implications then but conferences don't really add any finality to decisions. So, this seems properly crazy to me at the similar level of use_hierarchy fiasco. I'm gonna NACK on this. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 23:08 ` Tejun Heo @ 2012-09-26 23:20 ` Glauber Costa 2012-09-26 23:33 ` Tejun Heo [not found] ` <20120926230807.GC10453-9pTldWuhBndy/B6EtB590w@public.gmane.org> 1 sibling, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-26 23:20 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/27/2012 03:08 AM, Tejun Heo wrote: > Hello, Glauber. > > On Thu, Sep 27, 2012 at 02:54:11AM +0400, Glauber Costa wrote: >> I don't. Much has been said in the past about the problem of sharing. A >> lot of the kernel objects are shared by nature, this is pretty much >> unavoidable. The answer we have been giving to this inquiry, is that the >> workloads (us) interested in kmem accounted tend to be quite local in >> their file accesses (and other kernel objects as well). >> >> It should be obvious that not all workloads are like this, and some of >> them would actually prefer to have their umem limited only. >> >> There is nothing unreasonable in tracking user memory only. >> >> If we have a global switch for "tracking all kernel memory", who would >> you account the objects that are heavily shared to? I solve this by not >> tracking kernel memory for cgroups in such workloads. What do you propose? > > One of the things wrong with that is that it exposes the limitation of > the current implementation as interface to userland, which is never a > good idea. In addition, how is userland supposed to know which > workload is shared kmem heavy or not? Details like that are not even > inherent to workloads. It's highly dependent on kernel implementation > which may change any day. If we hit workloads like that the right > thing to do is improving kmemcg so that such problems don't occur, not > exposing another switch. > Sorry, there is nothing implementation dependent in here. One of the biggest consumers of all this, are dentries. Dentries are related to the paths you touch. If you touch files in a self-contained directory, where you don't expect anyone else to touch, this can safely be considered local. If you touch files all around, this can safely be considered not local. Where is the implementation dependent part? > If we can't make that work in reasonable (doesn't have to be perfect) > way, we might as well just give up on kmem controller. If userland > has to second-guess kernel implementation details to make use of it, > it's useless. > As I said above, it shouldn't. >>> Well, that's really playing with words. Limit is per cgroup and >>> before the limit is set for the first time, everything is accounted to >>> something else. How is that keeping track? >>> >> >> Even after the limit is set, it is set only by workloads that want kmem >> to be tracked. If you want to track it during the whole lifetime of the >> cgroup, you switch it before you put tasks to it. What is so crazy about it? > > The fact that the numbers don't really mean what they apparently > should mean. > This is vague. The usage file in the cgroup means how much kernel memory was used by that cgroup. If it really bothers you that this may not be set through the whole group's lifetime, it is also easily solvable. > So, this seems properly crazy to me at the similar level of > use_hierarchy fiasco. I'm gonna NACK on this. > As I said: all use cases I particularly care about are covered by a global switch. I am laying down my views because I really believe they make more sense. But at some point, of course, I'll shut up if I believe I am a lone voice. I believe it should still be good to hear from mhocko and kame, but from your point of view, would all the rest, plus the introduction of a global switch make it acceptable to you? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 23:20 ` Glauber Costa @ 2012-09-26 23:33 ` Tejun Heo [not found] ` <20120926233334.GD10453-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-26 23:33 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, On Thu, Sep 27, 2012 at 03:20:27AM +0400, Glauber Costa wrote: > > One of the things wrong with that is that it exposes the limitation of > > the current implementation as interface to userland, which is never a > > good idea. In addition, how is userland supposed to know which > > workload is shared kmem heavy or not? Details like that are not even > > inherent to workloads. It's highly dependent on kernel implementation > > which may change any day. If we hit workloads like that the right > > thing to do is improving kmemcg so that such problems don't occur, not > > exposing another switch. > > Sorry, there is nothing implementation dependent in here. One of the > biggest consumers of all this, are dentries. Dentries are related to > the paths you touch. If you touch files in a self-contained directory, > where you don't expect anyone else to touch, this can safely be > considered local. If you touch files all around, this can safely be > considered not local. Where is the implementation dependent part? For things like dentries and inodes and if that really matters, we should be able to account for the usage better, no? And frankly I'm not even sold on that usecase. Unless there's a way to detect and inform about these, userland isn't gonna know that they're doing something which consumes a lot of shared memory even that activity is filesystem walking. You're still asking userland to tune something depending on parameters not easily visible from userland. It's a lose lose situation. > >> Even after the limit is set, it is set only by workloads that want kmem > >> to be tracked. If you want to track it during the whole lifetime of the > >> cgroup, you switch it before you put tasks to it. What is so crazy about it? > > > > The fact that the numbers don't really mean what they apparently > > should mean. > > > > This is vague. The usage file in the cgroup means how much kernel memory > was used by that cgroup. If it really bothers you that this may not be > set through the whole group's lifetime, it is also easily solvable. Yes, easily by having a global switch which can be manipulated when there's no children. It really seems like a no brainer to me. > > So, this seems properly crazy to me at the similar level of > > use_hierarchy fiasco. I'm gonna NACK on this. > > As I said: all use cases I particularly care about are covered by a > global switch. > > I am laying down my views because I really believe they make more sense. > But at some point, of course, I'll shut up if I believe I am a lone voice. > > I believe it should still be good to hear from mhocko and kame, but from > your point of view, would all the rest, plus the introduction of a > global switch make it acceptable to you? The only thing I'm whining about is per-node switch + silently ignoring past accounting, so if those two are solved, I think I'm pretty happy with the rest. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120926233334.GD10453-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <20120926233334.GD10453-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-09-27 12:15 ` Michal Hocko 2012-09-27 12:20 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-09-27 12:15 UTC (permalink / raw) To: Tejun Heo Cc: Glauber Costa, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Wed 26-09-12 16:33:34, Tejun Heo wrote: [...] > > > So, this seems properly crazy to me at the similar level of > > > use_hierarchy fiasco. I'm gonna NACK on this. > > > > As I said: all use cases I particularly care about are covered by a > > global switch. > > > > I am laying down my views because I really believe they make more sense. > > But at some point, of course, I'll shut up if I believe I am a lone voice. > > > > I believe it should still be good to hear from mhocko and kame, but from > > your point of view, would all the rest, plus the introduction of a > > global switch make it acceptable to you? > > The only thing I'm whining about is per-node switch + silently > ignoring past accounting, so if those two are solved, I think I'm > pretty happy with the rest. I think that per-group "switch" is not nice as well but if we make it hierarchy specific (which I am proposing for quite some time) and do not let enable accounting for a group with tasks then we get both flexibility and reasonable semantic. A global switch sounds too coars to me and it really not necessary. Would this work with you? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 12:15 ` Michal Hocko @ 2012-09-27 12:20 ` Glauber Costa [not found] ` <506444A7.5060303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-27 12:20 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/27/2012 04:15 PM, Michal Hocko wrote: > On Wed 26-09-12 16:33:34, Tejun Heo wrote: > [...] >>>> So, this seems properly crazy to me at the similar level of >>>> use_hierarchy fiasco. I'm gonna NACK on this. >>> >>> As I said: all use cases I particularly care about are covered by a >>> global switch. >>> >>> I am laying down my views because I really believe they make more sense. >>> But at some point, of course, I'll shut up if I believe I am a lone voice. >>> >>> I believe it should still be good to hear from mhocko and kame, but from >>> your point of view, would all the rest, plus the introduction of a >>> global switch make it acceptable to you? >> >> The only thing I'm whining about is per-node switch + silently >> ignoring past accounting, so if those two are solved, I think I'm >> pretty happy with the rest. > > I think that per-group "switch" is not nice as well but if we make it > hierarchy specific (which I am proposing for quite some time) and do not > let enable accounting for a group with tasks then we get both > flexibility and reasonable semantic. A global switch sounds too coars to > me and it really not necessary. > > Would this work with you? > How exactly would that work? AFAIK, we have a single memcg root, we can't have multiple memcg hierarchies in a system. Am I missing something? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <506444A7.5060303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <506444A7.5060303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-27 12:40 ` Michal Hocko [not found] ` <20120927124031.GC29104-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-09-27 12:40 UTC (permalink / raw) To: Glauber Costa Cc: Tejun Heo, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Thu 27-09-12 16:20:55, Glauber Costa wrote: > On 09/27/2012 04:15 PM, Michal Hocko wrote: > > On Wed 26-09-12 16:33:34, Tejun Heo wrote: > > [...] > >>>> So, this seems properly crazy to me at the similar level of > >>>> use_hierarchy fiasco. I'm gonna NACK on this. > >>> > >>> As I said: all use cases I particularly care about are covered by a > >>> global switch. > >>> > >>> I am laying down my views because I really believe they make more sense. > >>> But at some point, of course, I'll shut up if I believe I am a lone voice. > >>> > >>> I believe it should still be good to hear from mhocko and kame, but from > >>> your point of view, would all the rest, plus the introduction of a > >>> global switch make it acceptable to you? > >> > >> The only thing I'm whining about is per-node switch + silently > >> ignoring past accounting, so if those two are solved, I think I'm > >> pretty happy with the rest. > > > > I think that per-group "switch" is not nice as well but if we make it > > hierarchy specific (which I am proposing for quite some time) and do not > > let enable accounting for a group with tasks then we get both > > flexibility and reasonable semantic. A global switch sounds too coars to > > me and it really not necessary. > > > > Would this work with you? > > > > How exactly would that work? AFAIK, we have a single memcg root, we > can't have multiple memcg hierarchies in a system. Am I missing something? Well root is so different that we could consider the first level as the real roots for hierarchies. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120927124031.GC29104-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <20120927124031.GC29104-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-09-27 12:40 ` Glauber Costa 2012-09-27 12:54 ` Michal Hocko 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-27 12:40 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 09/27/2012 04:40 PM, Michal Hocko wrote: > On Thu 27-09-12 16:20:55, Glauber Costa wrote: >> On 09/27/2012 04:15 PM, Michal Hocko wrote: >>> On Wed 26-09-12 16:33:34, Tejun Heo wrote: >>> [...] >>>>>> So, this seems properly crazy to me at the similar level of >>>>>> use_hierarchy fiasco. I'm gonna NACK on this. >>>>> >>>>> As I said: all use cases I particularly care about are covered by a >>>>> global switch. >>>>> >>>>> I am laying down my views because I really believe they make more sense. >>>>> But at some point, of course, I'll shut up if I believe I am a lone voice. >>>>> >>>>> I believe it should still be good to hear from mhocko and kame, but from >>>>> your point of view, would all the rest, plus the introduction of a >>>>> global switch make it acceptable to you? >>>> >>>> The only thing I'm whining about is per-node switch + silently >>>> ignoring past accounting, so if those two are solved, I think I'm >>>> pretty happy with the rest. >>> >>> I think that per-group "switch" is not nice as well but if we make it >>> hierarchy specific (which I am proposing for quite some time) and do not >>> let enable accounting for a group with tasks then we get both >>> flexibility and reasonable semantic. A global switch sounds too coars to >>> me and it really not necessary. >>> >>> Would this work with you? >>> >> >> How exactly would that work? AFAIK, we have a single memcg root, we >> can't have multiple memcg hierarchies in a system. Am I missing something? > > Well root is so different that we could consider the first level as the > real roots for hierarchies. > So let's favor clarity: What you are proposing is that the first level can have a switch for that, and the first level only. Is that right ? At first, I just want to understand what exactly is your proposal. This is not an endorsement of lack thereof. ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 12:40 ` Glauber Costa @ 2012-09-27 12:54 ` Michal Hocko 0 siblings, 0 replies; 127+ messages in thread From: Michal Hocko @ 2012-09-27 12:54 UTC (permalink / raw) To: Glauber Costa Cc: Tejun Heo, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Thu 27-09-12 16:40:03, Glauber Costa wrote: > On 09/27/2012 04:40 PM, Michal Hocko wrote: > > On Thu 27-09-12 16:20:55, Glauber Costa wrote: > >> On 09/27/2012 04:15 PM, Michal Hocko wrote: > >>> On Wed 26-09-12 16:33:34, Tejun Heo wrote: > >>> [...] > >>>>>> So, this seems properly crazy to me at the similar level of > >>>>>> use_hierarchy fiasco. I'm gonna NACK on this. > >>>>> > >>>>> As I said: all use cases I particularly care about are covered by a > >>>>> global switch. > >>>>> > >>>>> I am laying down my views because I really believe they make more sense. > >>>>> But at some point, of course, I'll shut up if I believe I am a lone voice. > >>>>> > >>>>> I believe it should still be good to hear from mhocko and kame, but from > >>>>> your point of view, would all the rest, plus the introduction of a > >>>>> global switch make it acceptable to you? > >>>> > >>>> The only thing I'm whining about is per-node switch + silently > >>>> ignoring past accounting, so if those two are solved, I think I'm > >>>> pretty happy with the rest. > >>> > >>> I think that per-group "switch" is not nice as well but if we make it > >>> hierarchy specific (which I am proposing for quite some time) and do not > >>> let enable accounting for a group with tasks then we get both > >>> flexibility and reasonable semantic. A global switch sounds too coars to > >>> me and it really not necessary. > >>> > >>> Would this work with you? > >>> > >> > >> How exactly would that work? AFAIK, we have a single memcg root, we > >> can't have multiple memcg hierarchies in a system. Am I missing something? > > > > Well root is so different that we could consider the first level as the > > real roots for hierarchies. > > > So let's favor clarity: What you are proposing is that the first level > can have a switch for that, and the first level only. Is that right ? I do not want any more switches. I am fine with your "set the limit and start accounting apprach" and then inherit the _internal_ flag down the hierarchy. If you are in a child and want to set the limit then you can do that only if your parent is accounted already (so that you can have your own limit). We will need the same thing for oom_controll and swappinness. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120926230807.GC10453-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <20120926230807.GC10453-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-09-27 14:28 ` Mel Gorman 2012-09-27 14:49 ` Tejun Heo 0 siblings, 1 reply; 127+ messages in thread From: Mel Gorman @ 2012-09-27 14:28 UTC (permalink / raw) To: Tejun Heo Cc: Glauber Costa, Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On Wed, Sep 26, 2012 at 04:08:07PM -0700, Tejun Heo wrote: > Hello, Glauber. > > On Thu, Sep 27, 2012 at 02:54:11AM +0400, Glauber Costa wrote: > > I don't. Much has been said in the past about the problem of sharing. A > > lot of the kernel objects are shared by nature, this is pretty much > > unavoidable. The answer we have been giving to this inquiry, is that the > > workloads (us) interested in kmem accounted tend to be quite local in > > their file accesses (and other kernel objects as well). > > > > It should be obvious that not all workloads are like this, and some of > > them would actually prefer to have their umem limited only. > > > > There is nothing unreasonable in tracking user memory only. > > > > If we have a global switch for "tracking all kernel memory", who would > > you account the objects that are heavily shared to? I solve this by not > > tracking kernel memory for cgroups in such workloads. What do you propose? > > One of the things wrong with that is that it exposes the limitation of > the current implementation as interface to userland, which is never a > good idea. I think the limitations have been fairly clearly explained and any admin using the interface is going to have *some* familiarity with the limitations. > In addition, how is userland supposed to know which > workload is shared kmem heavy or not? By using a bit of common sense. An application may not be able to figure this out but the administrator is going to be able to make a very educated guess. If processes running within two containers are not sharing a filesystem hierarchy for example then it'll be clear they are not sharing dentries. If there was a suspicion they were then it could be analysed with something like SystemTap probing when files are opened and see if files are being opened that are shared between containers. It's not super-easy but it's not impossible either and I fail to see why it's such a big deal for you. >Details like that are not even > inherent to workloads. It's highly dependent on kernel implementation > which may change any day. If we hit workloads like that the right > thing to do is improving kmemcg so that such problems don't occur, not > exposing another switch. > > If we can't make that work in reasonable (doesn't have to be perfect) > way, we might as well just give up on kmem controller. If userland > has to second-guess kernel implementation details to make use of it, > it's useless. > > > > Well, that's really playing with words. Limit is per cgroup and > > > before the limit is set for the first time, everything is accounted to > > > something else. How is that keeping track? > > > > > > > Even after the limit is set, it is set only by workloads that want kmem > > to be tracked. If you want to track it during the whole lifetime of the > > cgroup, you switch it before you put tasks to it. What is so crazy about it? > > The fact that the numbers don't really mean what they apparently > should mean. > I think it is a reasonable limitation that only some kernel allocations are accounted for although I'll freely admit I'm not a cgroup or memcg user either. My understanding is that this comes down to cost -- accounting for the kernel memory usage is expensive so it is limited only to the allocations that are easy to abuse by an unprivileged process. Hence this is initially concerned with stack pages with dentries and TCP usage to follow in later patches. Further I would expect that an administrator would be aware of these limitations and set kmem_accounting at cgroup creation time before any processes start. Maybe that should be enforced but it's not a fundamental problem. Due to the cost of accounting, I can see why it would be desirable to enable kmem_accounting for some cgroup trees and not others. It is not unreasonable to expect that an administrator might want to account within one cgroup where processes are accessing millions of files without impairing the performance of another cgroup that is mostly using anonymous memory. > > > The proposed behavior seems really crazy to me. Do people really > > > think this is a good idea? > > > > It is really sad that you lost the opportunity to say that in a room > > full of mm developers that could add to this discussion in real time, > > when after an explanation about this was given, Mel asked if anyone > > would have any objections to this. > > Sure, conferences are useful for building consensus but that's the > extent of it. Sorry that I didn't realize the implications then but > conferences don't really add any finality to decisions. > > So, this seems properly crazy to me at the similar level of > use_hierarchy fiasco. I'm gonna NACK on this. > I think you're over-reacting to say the very least :| -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 14:28 ` Mel Gorman @ 2012-09-27 14:49 ` Tejun Heo 2012-09-27 14:57 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-27 14:49 UTC (permalink / raw) To: Mel Gorman Cc: Glauber Costa, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner Hello, Mel. On Thu, Sep 27, 2012 at 03:28:22PM +0100, Mel Gorman wrote: > > In addition, how is userland supposed to know which > > workload is shared kmem heavy or not? > > By using a bit of common sense. > > An application may not be able to figure this out but the administrator > is going to be able to make a very educated guess. If processes running > within two containers are not sharing a filesystem hierarchy for example > then it'll be clear they are not sharing dentries. > > If there was a suspicion they were then it could be analysed with > something like SystemTap probing when files are opened and see if files > are being opened that are shared between containers. > > It's not super-easy but it's not impossible either and I fail to see why > it's such a big deal for you. Because we're not even trying to actually solve the problem but just dumping it to userland. If dentry/inode usage is the only case we're being worried about, there can be better ways to solve it or at least we should strive for that. Also, the problem is not that it is impossible if you know and carefully plan for things beforehand (that would be one extremely competent admin) but that the problem is undiscoverable. With kmemcg accounting disabled, there's no way to tell a looking cgroup the admin thinks running something which doesn'ft tax kmem much could be generating a ton without the admin ever noticing. > > The fact that the numbers don't really mean what they apparently > > should mean. > > I think it is a reasonable limitation that only some kernel allocations are > accounted for although I'll freely admit I'm not a cgroup or memcg user > either. > > My understanding is that this comes down to cost -- accounting for the > kernel memory usage is expensive so it is limited only to the allocations > that are easy to abuse by an unprivileged process. Hence this is > initially concerned with stack pages with dentries and TCP usage to > follow in later patches. I think the cost isn't too prohibitive considering it's already using memcg. Charging / uncharging happens only as pages enter and leave slab caches and the hot path overhead is essentially single indirection. Glauber's benchmark seemed pretty reasonable to me and I don't yet think that warrants exposing this subtle tree of configuration. > > Sure, conferences are useful for building consensus but that's the > > extent of it. Sorry that I didn't realize the implications then but > > conferences don't really add any finality to decisions. > > > > So, this seems properly crazy to me at the similar level of > > use_hierarchy fiasco. I'm gonna NACK on this. > > I think you're over-reacting to say the very least :| The part I nacked is enabling kmemcg on a populated cgroup and then starting accounting from then without any apparent indication that any past allocation hasn't been considered. You end up with numbers which nobody can't tell what they really mean and there's no mechanism to guarantee any kind of ordering between populating the cgroup and configuring it and there's *no* way to find out what happened afterwards neither. This is properly crazy and definitely deserves a nack. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 14:49 ` Tejun Heo @ 2012-09-27 14:57 ` Glauber Costa [not found] ` <50646977.40300-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-27 14:57 UTC (permalink / raw) To: Tejun Heo Cc: Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On 09/27/2012 06:49 PM, Tejun Heo wrote: > Hello, Mel. > > On Thu, Sep 27, 2012 at 03:28:22PM +0100, Mel Gorman wrote: >>> In addition, how is userland supposed to know which >>> workload is shared kmem heavy or not? >> >> By using a bit of common sense. >> >> An application may not be able to figure this out but the administrator >> is going to be able to make a very educated guess. If processes running >> within two containers are not sharing a filesystem hierarchy for example >> then it'll be clear they are not sharing dentries. >> >> If there was a suspicion they were then it could be analysed with >> something like SystemTap probing when files are opened and see if files >> are being opened that are shared between containers. >> >> It's not super-easy but it's not impossible either and I fail to see why >> it's such a big deal for you. > > Because we're not even trying to actually solve the problem but just > dumping it to userland. If dentry/inode usage is the only case we're > being worried about, there can be better ways to solve it or at least > we should strive for that. > Not only it is not the only case we care about, this is not even touched in this series. (It is only touched in the next one). This one, for instance, cares about the stack. The reason everything is being dumped into "kmem", is precisely to make things simpler. I argue that at some point it makes sense to draw a line, and "kmem" is a much better line than any fine grained control - precisely because it is conceptually easier to grasp. > Also, the problem is not that it is impossible if you know and > carefully plan for things beforehand (that would be one extremely > competent admin) but that the problem is undiscoverable. With kmemcg > accounting disabled, there's no way to tell a looking cgroup the admin > thinks running something which doesn'ft tax kmem much could be > generating a ton without the admin ever noticing. > >>> The fact that the numbers don't really mean what they apparently >>> should mean. >> >> I think it is a reasonable limitation that only some kernel allocations are >> accounted for although I'll freely admit I'm not a cgroup or memcg user >> either. >> >> My understanding is that this comes down to cost -- accounting for the >> kernel memory usage is expensive so it is limited only to the allocations >> that are easy to abuse by an unprivileged process. Hence this is >> initially concerned with stack pages with dentries and TCP usage to >> follow in later patches. > > I think the cost isn't too prohibitive considering it's already using > memcg. Charging / uncharging happens only as pages enter and leave > slab caches and the hot path overhead is essentially single > indirection. Glauber's benchmark seemed pretty reasonable to me and I > don't yet think that warrants exposing this subtle tree of > configuration. > Only so we can get some numbers: the cost is really minor if this is all disabled. It this is fully enable, it can get to some 2 or 3 %, which may or may not be acceptable to an application. But for me this is not even about cost, and that's why I haven't brought it up so far >>> Sure, conferences are useful for building consensus but that's the >>> extent of it. Sorry that I didn't realize the implications then but >>> conferences don't really add any finality to decisions. >>> >>> So, this seems properly crazy to me at the similar level of >>> use_hierarchy fiasco. I'm gonna NACK on this. >> >> I think you're over-reacting to say the very least :| > > The part I nacked is enabling kmemcg on a populated cgroup and then > starting accounting from then without any apparent indication that any > past allocation hasn't been considered. You end up with numbers which > nobody can't tell what they really mean and there's no mechanism to > guarantee any kind of ordering between populating the cgroup and > configuring it and there's *no* way to find out what happened > afterwards neither. This is properly crazy and definitely deserves a > nack. > Mel suggestion of not allowing this to happen once the cgroup has tasks takes care of this, and is something I thought of myself. This would remove this particular piece of objection, and maintain the per-subtree control. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50646977.40300-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50646977.40300-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-27 17:46 ` Tejun Heo 2012-09-27 17:56 ` Michal Hocko 2012-09-27 18:45 ` Glauber Costa 0 siblings, 2 replies; 127+ messages in thread From: Tejun Heo @ 2012-09-27 17:46 UTC (permalink / raw) To: Glauber Costa Cc: Mel Gorman, Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner Hello, On Thu, Sep 27, 2012 at 06:57:59PM +0400, Glauber Costa wrote: > > Because we're not even trying to actually solve the problem but just > > dumping it to userland. If dentry/inode usage is the only case we're > > being worried about, there can be better ways to solve it or at least > > we should strive for that. > > Not only it is not the only case we care about, this is not even touched > in this series. (It is only touched in the next one). This one, for > instance, cares about the stack. The reason everything is being dumped > into "kmem", is precisely to make things simpler. I argue that at some > point it makes sense to draw a line, and "kmem" is a much better line > than any fine grained control - precisely because it is conceptually > easier to grasp. Can you please give other examples of cases where this type of issue exists (plenty of shared kernel data structure which is inherent to the workload at hand)? Until now, this has been the only example for this type of issues. > > I think the cost isn't too prohibitive considering it's already using > > memcg. Charging / uncharging happens only as pages enter and leave > > slab caches and the hot path overhead is essentially single > > indirection. Glauber's benchmark seemed pretty reasonable to me and I > > don't yet think that warrants exposing this subtle tree of > > configuration. > > Only so we can get some numbers: the cost is really minor if this is all > disabled. It this is fully enable, it can get to some 2 or 3 %, which > may or may not be acceptable to an application. But for me this is not > even about cost, and that's why I haven't brought it up so far It seems like Mel's concern is mostly based on performance overhead concerns tho. > > The part I nacked is enabling kmemcg on a populated cgroup and then > > starting accounting from then without any apparent indication that any > > past allocation hasn't been considered. You end up with numbers which > > nobody can't tell what they really mean and there's no mechanism to > > guarantee any kind of ordering between populating the cgroup and > > configuring it and there's *no* way to find out what happened > > afterwards neither. This is properly crazy and definitely deserves a > > nack. > > > > Mel suggestion of not allowing this to happen once the cgroup has tasks > takes care of this, and is something I thought of myself. You mean Michal's? It should also disallow switching if there are children cgroups, right? > This would remove this particular piece of objection, and maintain the > per-subtree control. Yeah, I don't see anything broken with that although I'll try to argue for simpler one a bit more. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 17:46 ` Tejun Heo @ 2012-09-27 17:56 ` Michal Hocko 2012-09-27 18:45 ` Glauber Costa 1 sibling, 0 replies; 127+ messages in thread From: Michal Hocko @ 2012-09-27 17:56 UTC (permalink / raw) To: Tejun Heo Cc: Glauber Costa, Mel Gorman, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On Thu 27-09-12 10:46:05, Tejun Heo wrote: [...] > > > The part I nacked is enabling kmemcg on a populated cgroup and then > > > starting accounting from then without any apparent indication that any > > > past allocation hasn't been considered. You end up with numbers which > > > nobody can't tell what they really mean and there's no mechanism to > > > guarantee any kind of ordering between populating the cgroup and > > > configuring it and there's *no* way to find out what happened > > > afterwards neither. This is properly crazy and definitely deserves a > > > nack. > > > > > > > Mel suggestion of not allowing this to happen once the cgroup has tasks > > takes care of this, and is something I thought of myself. > > You mean Michal's? It should also disallow switching if there are > children cgroups, right? Right. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 17:46 ` Tejun Heo 2012-09-27 17:56 ` Michal Hocko @ 2012-09-27 18:45 ` Glauber Costa [not found] ` <50649EAD.2050306-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 1 sibling, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-27 18:45 UTC (permalink / raw) To: Tejun Heo Cc: Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On 09/27/2012 09:46 PM, Tejun Heo wrote: > Hello, > > On Thu, Sep 27, 2012 at 06:57:59PM +0400, Glauber Costa wrote: >>> Because we're not even trying to actually solve the problem but just >>> dumping it to userland. If dentry/inode usage is the only case we're >>> being worried about, there can be better ways to solve it or at least >>> we should strive for that. >> >> Not only it is not the only case we care about, this is not even touched >> in this series. (It is only touched in the next one). This one, for >> instance, cares about the stack. The reason everything is being dumped >> into "kmem", is precisely to make things simpler. I argue that at some >> point it makes sense to draw a line, and "kmem" is a much better line >> than any fine grained control - precisely because it is conceptually >> easier to grasp. > > Can you please give other examples of cases where this type of issue > exists (plenty of shared kernel data structure which is inherent to > the workload at hand)? Until now, this has been the only example for > this type of issues. > Yes. the namespace related caches (*), all kinds of sockets and network structures, other file system structures like file struct, vm areas, and pretty much everything a full container does. (*) we run full userspace, so we have namespaces + cgroups combination. >>> I think the cost isn't too prohibitive considering it's already using >>> memcg. Charging / uncharging happens only as pages enter and leave >>> slab caches and the hot path overhead is essentially single >>> indirection. Glauber's benchmark seemed pretty reasonable to me and I >>> don't yet think that warrants exposing this subtle tree of >>> configuration. >> >> Only so we can get some numbers: the cost is really minor if this is all >> disabled. It this is fully enable, it can get to some 2 or 3 %, which >> may or may not be acceptable to an application. But for me this is not >> even about cost, and that's why I haven't brought it up so far > > It seems like Mel's concern is mostly based on performance overhead > concerns tho. > >>> The part I nacked is enabling kmemcg on a populated cgroup and then >>> starting accounting from then without any apparent indication that any >>> past allocation hasn't been considered. You end up with numbers which >>> nobody can't tell what they really mean and there's no mechanism to >>> guarantee any kind of ordering between populating the cgroup and >>> configuring it and there's *no* way to find out what happened >>> afterwards neither. This is properly crazy and definitely deserves a >>> nack. >>> >> >> Mel suggestion of not allowing this to happen once the cgroup has tasks >> takes care of this, and is something I thought of myself. > > You mean Michal's? It should also disallow switching if there are > children cgroups, right? > No, I meant Mel, quoting this: "Further I would expect that an administrator would be aware of these limitations and set kmem_accounting at cgroup creation time before any processes start. Maybe that should be enforced but it's not a fundamental problem." But I guess it is pretty much the same thing Michal proposes, in essence. Or IOW, if your concern is with the fact that charges may have happened in the past before this is enabled, we can make sure this cannot happen by disallowing the limit to be set if currently unset (value changes are obviously fine) if you have children or any tasks already in the group. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50649EAD.2050306-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50649EAD.2050306-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-30 7:57 ` Tejun Heo 2012-09-30 8:02 ` Tejun Heo 2012-10-01 8:36 ` Glauber Costa 0 siblings, 2 replies; 127+ messages in thread From: Tejun Heo @ 2012-09-30 7:57 UTC (permalink / raw) To: Glauber Costa Cc: Mel Gorman, Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner Hello, Glauber. On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote: > > Can you please give other examples of cases where this type of issue > > exists (plenty of shared kernel data structure which is inherent to > > the workload at hand)? Until now, this has been the only example for > > this type of issues. > > Yes. the namespace related caches (*), all kinds of sockets and network > structures, other file system structures like file struct, vm areas, and > pretty much everything a full container does. > > (*) we run full userspace, so we have namespaces + cgroups combination. This is probably me being dumb but wouldn't resources used by full namespaces be mostly independent? Which parts get shared? Also, if you do full namespace, isn't it more likely that you would want fuller resource isolation too? > >> Mel suggestion of not allowing this to happen once the cgroup has tasks > >> takes care of this, and is something I thought of myself. > > > > You mean Michal's? It should also disallow switching if there are > > children cgroups, right? > > No, I meant Mel, quoting this: > > "Further I would expect that an administrator would be aware of these > limitations and set kmem_accounting at cgroup creation time before any > processes start. Maybe that should be enforced but it's not a > fundamental problem." > > But I guess it is pretty much the same thing Michal proposes, in essence. > > Or IOW, if your concern is with the fact that charges may have happened > in the past before this is enabled, we can make sure this cannot happen > by disallowing the limit to be set if currently unset (value changes are > obviously fine) if you have children or any tasks already in the group. Yeah, please do that. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-30 7:57 ` Tejun Heo @ 2012-09-30 8:02 ` Tejun Heo [not found] ` <20120930080249.GF10383-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-10-01 8:36 ` Glauber Costa 1 sibling, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-30 8:02 UTC (permalink / raw) To: Glauber Costa Cc: Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On Sun, Sep 30, 2012 at 04:57:00PM +0900, Tejun Heo wrote: > On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote: > > > Can you please give other examples of cases where this type of issue > > > exists (plenty of shared kernel data structure which is inherent to > > > the workload at hand)? Until now, this has been the only example for > > > this type of issues. > > > > Yes. the namespace related caches (*), all kinds of sockets and network > > structures, other file system structures like file struct, vm areas, and > > pretty much everything a full container does. > > > > (*) we run full userspace, so we have namespaces + cgroups combination. > > This is probably me being dumb but wouldn't resources used by full > namespaces be mostly independent? Which parts get shared? Also, if > you do full namespace, isn't it more likely that you would want fuller > resource isolation too? Just a thought about dentry/inode. Would it make sense to count total number of references per cgroup and charge the total amount according to that? Reference counts are how the shared ownership is represented after all. Counting total per cgroup isn't accurate and pathological cases could be weird tho. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120930080249.GF10383-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <20120930080249.GF10383-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-09-30 8:56 ` James Bottomley [not found] ` <1348995388.2458.8.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: James Bottomley @ 2012-09-30 8:56 UTC (permalink / raw) To: Tejun Heo Cc: Glauber Costa, Mel Gorman, Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On Sun, 2012-09-30 at 17:02 +0900, Tejun Heo wrote: > On Sun, Sep 30, 2012 at 04:57:00PM +0900, Tejun Heo wrote: > > On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote: > > > > Can you please give other examples of cases where this type of issue > > > > exists (plenty of shared kernel data structure which is inherent to > > > > the workload at hand)? Until now, this has been the only example for > > > > this type of issues. > > > > > > Yes. the namespace related caches (*), all kinds of sockets and network > > > structures, other file system structures like file struct, vm areas, and > > > pretty much everything a full container does. > > > > > > (*) we run full userspace, so we have namespaces + cgroups combination. > > > > This is probably me being dumb but wouldn't resources used by full > > namespaces be mostly independent? Which parts get shared? Also, if > > you do full namespace, isn't it more likely that you would want fuller > > resource isolation too? > > Just a thought about dentry/inode. Would it make sense to count total > number of references per cgroup and charge the total amount according > to that? Reference counts are how the shared ownership is represented > after all. Counting total per cgroup isn't accurate and pathological > cases could be weird tho. The beancounter approach originally used by OpenVZ does exactly this. There are two specific problems, though, firstly you can't count references in generic code, so now you have to extend the cgroup tentacles into every object, an invasiveness which people didn't really like. Secondly split accounting causes oddities too, like your total kernel memory usage can appear to go down even though you do nothing just because someone else added a share. Worse, if someone drops the reference, your usage can go up, even though you did nothing, and push you over your limit, at which point action gets taken against the container. This leads to nasty system unpredictability (The whole point of cgroup isolation is supposed to be preventing resource usage in one cgroup from affecting that in another). We discussed this pretty heavily at the Containers Mini Summit in Santa Rosa. The emergent consensus was that no-one really likes first use accounting, but it does solve all the problems and it has the fewest unexpected side effects. If you have an alternative that wasn't considered then, I'm sure everyone would be interested, but it isn't split accounting. James ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <1348995388.2458.8.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <1348995388.2458.8.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> @ 2012-09-30 10:37 ` Tejun Heo 2012-09-30 11:25 ` James Bottomley 2012-10-01 8:46 ` Glauber Costa 0 siblings, 2 replies; 127+ messages in thread From: Tejun Heo @ 2012-09-30 10:37 UTC (permalink / raw) To: James Bottomley Cc: Glauber Costa, Mel Gorman, Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner Hello, James. On Sun, Sep 30, 2012 at 09:56:28AM +0100, James Bottomley wrote: > The beancounter approach originally used by OpenVZ does exactly this. > There are two specific problems, though, firstly you can't count > references in generic code, so now you have to extend the cgroup > tentacles into every object, an invasiveness which people didn't really > like. Yeah, it will need some hooks. For dentry and inode, I think it would be pretty well isolated tho. Wasn't it? > Secondly split accounting causes oddities too, like your total > kernel memory usage can appear to go down even though you do nothing > just because someone else added a share. Worse, if someone drops the > reference, your usage can go up, even though you did nothing, and push > you over your limit, at which point action gets taken against the > container. This leads to nasty system unpredictability (The whole point > of cgroup isolation is supposed to be preventing resource usage in one > cgroup from affecting that in another). In a sense, the fluctuating amount is the actual resource burden the cgroup is putting on the system, so maybe it just needs to be handled better or maybe we should charge fixed amount per refcnt? I don't know. > We discussed this pretty heavily at the Containers Mini Summit in Santa > Rosa. The emergent consensus was that no-one really likes first use > accounting, but it does solve all the problems and it has the fewest > unexpected side effects. But that's like fitting the problem to the mechanism. Maybe that is the best which can be done, but the side effect there is way-off accounting under pretty common workload, which sounds pretty nasty to me. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-30 10:37 ` Tejun Heo @ 2012-09-30 11:25 ` James Bottomley [not found] ` <1349004352.2458.34.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2012-10-01 8:46 ` Glauber Costa 1 sibling, 1 reply; 127+ messages in thread From: James Bottomley @ 2012-09-30 11:25 UTC (permalink / raw) To: Tejun Heo Cc: Glauber Costa, Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On Sun, 2012-09-30 at 19:37 +0900, Tejun Heo wrote: > Hello, James. > > On Sun, Sep 30, 2012 at 09:56:28AM +0100, James Bottomley wrote: > > The beancounter approach originally used by OpenVZ does exactly this. > > There are two specific problems, though, firstly you can't count > > references in generic code, so now you have to extend the cgroup > > tentacles into every object, an invasiveness which people didn't really > > like. > > Yeah, it will need some hooks. For dentry and inode, I think it would > be pretty well isolated tho. Wasn't it? But you've got to ask yourself who cares about accurate accounting per container of dentry and inode objects? They're not objects that any administrator is used to limiting. What we at parallels care about isn't accurately accounting them, it's that one container can't DoS another by exhausting system resources. That's achieved equally well by first charge slab accounting, so we don't really have an interest in pushing object accounting code for which there's no use case. > > Secondly split accounting causes oddities too, like your total > > kernel memory usage can appear to go down even though you do nothing > > just because someone else added a share. Worse, if someone drops the > > reference, your usage can go up, even though you did nothing, and push > > you over your limit, at which point action gets taken against the > > container. This leads to nasty system unpredictability (The whole point > > of cgroup isolation is supposed to be preventing resource usage in one > > cgroup from affecting that in another). > > In a sense, the fluctuating amount is the actual resource burden the > cgroup is putting on the system, so maybe it just needs to be handled > better or maybe we should charge fixed amount per refcnt? I don't > know. Yes, we considered single charge per reference accounting as well (although I don't believe anyone went as far as producing an implementation). The problem here is now that the sum of the container resources no-longer bears any relation to the host consumption. This makes it very difficult for virtualisation orchestration systems to make accurate decisions when doing dynamic resource scheduling (DRS). Conversely, as ugly as you think it, first use accounting is actually pretty good at identifying problem containers (at least with regard to memory usage) for DRS because containers which are stretching the memory tend to accumulate the greatest number of first charges over the system lifetime. > > We discussed this pretty heavily at the Containers Mini Summit in Santa > > Rosa. The emergent consensus was that no-one really likes first use > > accounting, but it does solve all the problems and it has the fewest > > unexpected side effects. > > But that's like fitting the problem to the mechanism. Maybe that is > the best which can be done, but the side effect there is way-off > accounting under pretty common workload, which sounds pretty nasty to > me. All we need kernel memory accounting and limiting for is DoS prevention. There aren't really any system administrators who care about Kernel Memory accounting (at least until the system goes oom) because there are no absolute knobs for it (all there is are a set of weird and wonderful heuristics, like dirty limit ratio and drop caches). Kernel memory usage has a whole set of regulatory infrastructure for trying to make it transparent to the user. Don't get me wrong: if there were some easy way to get proper memory accounting for free, we'd be happy but, because it has no practical application for any of our customers, there's a limited price we're willing to pay to get it. James ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <1349004352.2458.34.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <1349004352.2458.34.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> @ 2012-10-01 0:57 ` Tejun Heo 2012-10-01 8:43 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-10-01 0:57 UTC (permalink / raw) To: James Bottomley Cc: Glauber Costa, Mel Gorman, Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner Hello, James. On Sun, Sep 30, 2012 at 12:25:52PM +0100, James Bottomley wrote: > But you've got to ask yourself who cares about accurate accounting per > container of dentry and inode objects? They're not objects that any > administrator is used to limiting. What we at parallels care about > isn't accurately accounting them, it's that one container can't DoS > another by exhausting system resources. That's achieved equally well by > first charge slab accounting, so we don't really have an interest in > pushing object accounting code for which there's no use case. Isn't it more because the use cases you have on mind don't share dentries/inodes too much? Wildly incorrect accounting definitely degrades container isolation and can lead to unexpected behaviors. > All we need kernel memory accounting and limiting for is DoS prevention. > There aren't really any system administrators who care about Kernel > Memory accounting (at least until the system goes oom) because there are > no absolute knobs for it (all there is are a set of weird and wonderful > heuristics, like dirty limit ratio and drop caches). Kernel memory I think that's because the mechanism currently doesn't exist. If one wants to control how memory is distributed across different cgroups, it's logical to control kernel memory too. The resource in question is the actual memory after all. I think at least google would be interested in it, so, no, I don't agree that nobody wants it. If that is the case, we're working towards the wrong direction. > usage has a whole set of regulatory infrastructure for trying to make it > transparent to the user. > > Don't get me wrong: if there were some easy way to get proper memory > accounting for free, we'd be happy but, because it has no practical > application for any of our customers, there's a limited price we're > willing to pay to get it. Even on purely technical ground, it could be that first-use is the right trade off if other more accurate approaches are too difficult and most workloads are happy with such approach. I'm still a bit weary to base userland interface decisions on that tho. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-10-01 0:57 ` Tejun Heo @ 2012-10-01 8:43 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-10-01 8:43 UTC (permalink / raw) To: Tejun Heo Cc: James Bottomley, Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner, Greg Thelen On 10/01/2012 04:57 AM, Tejun Heo wrote: > Hello, James. > > On Sun, Sep 30, 2012 at 12:25:52PM +0100, James Bottomley wrote: >> But you've got to ask yourself who cares about accurate accounting per >> container of dentry and inode objects? They're not objects that any >> administrator is used to limiting. What we at parallels care about >> isn't accurately accounting them, it's that one container can't DoS >> another by exhausting system resources. That's achieved equally well by >> first charge slab accounting, so we don't really have an interest in >> pushing object accounting code for which there's no use case. > > Isn't it more because the use cases you have on mind don't share > dentries/inodes too much? Wildly incorrect accounting definitely > degrades container isolation and can lead to unexpected behaviors. > >> All we need kernel memory accounting and limiting for is DoS prevention. >> There aren't really any system administrators who care about Kernel >> Memory accounting (at least until the system goes oom) because there are >> no absolute knobs for it (all there is are a set of weird and wonderful >> heuristics, like dirty limit ratio and drop caches). Kernel memory > > I think that's because the mechanism currently doesn't exist. If one > wants to control how memory is distributed across different cgroups, > it's logical to control kernel memory too. The resource in question > is the actual memory after all. I think at least google would be > interested in it, so, no, I don't agree that nobody wants it. If that > is the case, we're working towards the wrong direction. > >> usage has a whole set of regulatory infrastructure for trying to make it >> transparent to the user. >> >> Don't get me wrong: if there were some easy way to get proper memory >> accounting for free, we'd be happy but, because it has no practical >> application for any of our customers, there's a limited price we're >> willing to pay to get it. > > Even on purely technical ground, it could be that first-use is the > right trade off if other more accurate approaches are too difficult > and most workloads are happy with such approach. I'm still a bit > weary to base userland interface decisions on that tho. > For the record, user memory also suffers a bit from being always constrained to first-touch accounting. Greg Thelen is working on alternative solutions to make first-accounting the default in a configurable environment, as he explained in the kernel summit. When that happens, kernel memory can take advantage of it for free. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-30 10:37 ` Tejun Heo 2012-09-30 11:25 ` James Bottomley @ 2012-10-01 8:46 ` Glauber Costa 2012-10-03 22:59 ` Tejun Heo 1 sibling, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-10-01 8:46 UTC (permalink / raw) To: Tejun Heo Cc: James Bottomley, Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On 09/30/2012 02:37 PM, Tejun Heo wrote: > Hello, James. > > On Sun, Sep 30, 2012 at 09:56:28AM +0100, James Bottomley wrote: >> The beancounter approach originally used by OpenVZ does exactly this. >> There are two specific problems, though, firstly you can't count >> references in generic code, so now you have to extend the cgroup >> tentacles into every object, an invasiveness which people didn't really >> like. > > Yeah, it will need some hooks. For dentry and inode, I think it would > be pretty well isolated tho. Wasn't it? > We would still need something for the stack. For open files, and for everything that becomes a potential problem. We then end up with 35 different knobs instead of one. One of the perceived advantages of this approach, is that it condenses as much data as a single knob as possible, reducing complexity and over flexibility. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-10-01 8:46 ` Glauber Costa @ 2012-10-03 22:59 ` Tejun Heo 0 siblings, 0 replies; 127+ messages in thread From: Tejun Heo @ 2012-10-03 22:59 UTC (permalink / raw) To: Glauber Costa Cc: James Bottomley, Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner Hello, Glauber. On Mon, Oct 01, 2012 at 12:46:02PM +0400, Glauber Costa wrote: > > Yeah, it will need some hooks. For dentry and inode, I think it would > > be pretty well isolated tho. Wasn't it? > > We would still need something for the stack. For open files, and for > everything that becomes a potential problem. We then end up with 35 > different knobs instead of one. One of the perceived advantages of this > approach, is that it condenses as much data as a single knob as > possible, reducing complexity and over flexibility. Oh, I didn't mean to use object-specific counting for all of them. Most resources don't have such common misaccounting problem. I mean, for stack, it doesn't exist by definition (other than cgroup migration). There's no reason to use anything other than first-use kmem based accounting for them. My point was that for particularly problematic ones like dentry/inode, it might be better to treat them differently. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-30 7:57 ` Tejun Heo 2012-09-30 8:02 ` Tejun Heo @ 2012-10-01 8:36 ` Glauber Costa 1 sibling, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-10-01 8:36 UTC (permalink / raw) To: Tejun Heo Cc: Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On 09/30/2012 11:57 AM, Tejun Heo wrote: > Hello, Glauber. > > On Thu, Sep 27, 2012 at 10:45:01PM +0400, Glauber Costa wrote: >>> Can you please give other examples of cases where this type of issue >>> exists (plenty of shared kernel data structure which is inherent to >>> the workload at hand)? Until now, this has been the only example for >>> this type of issues. >> >> Yes. the namespace related caches (*), all kinds of sockets and network >> structures, other file system structures like file struct, vm areas, and >> pretty much everything a full container does. >> >> (*) we run full userspace, so we have namespaces + cgroups combination. > > This is probably me being dumb but wouldn't resources used by full > namespaces be mostly independent? Which parts get shared? Also, if > you do full namespace, isn't it more likely that you would want fuller > resource isolation too? > Not necessarily. Namespaces are pretty flexible. If you are using the network namespace, for instance, you can create interfaces, routes, addresses, etc. But because this deals with the network only, there is nothing unreasonable in saying that your webserver and database lives in the same network (which is a different network namespace), but are entitled to different memory limits - which is cgroups realm. With application-only containers being championed these days by multiple users, I would expect this situation to become non-negligible. The full container scenario, of course, is very different. Most of the accesses tends to be local. I will second what Michal said, since I believe this is also very important: User memory is completely at the control of the application, while kernel memory is not, and never will. It is perfectly fine to imagine applications that want its memory to be limited by a very predictable amount, and there are no reasons to believe that those cannot live in the same box as full containers - the biggest example of kmem interested folks. It could be, for instance, that a management tool for containers lives in there, and that application wants to be umem limited but not kmem limited. (If it goes touching files and data inside each container, for instance, it is obviously not local) >>>> Mel suggestion of not allowing this to happen once the cgroup has tasks >>>> takes care of this, and is something I thought of myself. >>> >>> You mean Michal's? It should also disallow switching if there are >>> children cgroups, right? >> >> No, I meant Mel, quoting this: >> >> "Further I would expect that an administrator would be aware of these >> limitations and set kmem_accounting at cgroup creation time before any >> processes start. Maybe that should be enforced but it's not a >> fundamental problem." >> >> But I guess it is pretty much the same thing Michal proposes, in essence. >> >> Or IOW, if your concern is with the fact that charges may have happened >> in the past before this is enabled, we can make sure this cannot happen >> by disallowing the limit to be set if currently unset (value changes are >> obviously fine) if you have children or any tasks already in the group. > > Yeah, please do that. > I did already, patches soon! =) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50637298.2090904-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 22:10 ` Tejun Heo @ 2012-09-27 12:08 ` Michal Hocko 2012-09-27 12:11 ` Glauber Costa 2012-09-27 14:33 ` Tejun Heo 1 sibling, 2 replies; 127+ messages in thread From: Michal Hocko @ 2012-09-27 12:08 UTC (permalink / raw) To: Glauber Costa Cc: Tejun Heo, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Thu 27-09-12 01:24:40, Glauber Costa wrote: [...] > About use cases, I've already responded: my containers use case is kmem > limited. There are people like Michal that specifically asked for > user-only semantics to be preserved. Yes, because we have many users (basically almost all) who care only about the user memory because that's what occupies the vast majority of the memory. They usually want to isolate workload which would disrupt the global memory otherwise (e.g. backup process vs. database). You really do not want to pay an additional overhead for kmem accounting here. > So your question for global vs local switch (that again, doesn't > exist; only a local *limit* exists) should really be posed in the > following way: "Can two different use cases with different needs be > hosted in the same box?" I think this is a good and a relevant question. I think this boils down to whether you want to have trusted and untrusted workloads at the same machine. Trusted loads usually only need user memory accounting because kmem consumption should be really negligible (unless kernel is doing something really stupid and no kmem limit will help here). On the other hand, untrusted workloads can do nasty things that administrator has hard time to mitigate and setting a kmem limit can help significantly. IMHO such a different loads exist on a single machine quite often (Web server and a back up process as the most simplistic one). The per hierarchy accounting, therefore, sounds like a good idea without too much added complexity (actually the only added complexity is in the proper kmem.limit_in_bytes handling which is a single place). So I would rather go with per-hierarchy thing. > > Michal, Johannes, Kamezawa, what are your thoughts? > > > waiting! =) Well, you guys generated a lot of discussion that one has to read through, didn't you :P -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 12:08 ` Michal Hocko @ 2012-09-27 12:11 ` Glauber Costa 2012-09-27 14:33 ` Tejun Heo 1 sibling, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-09-27 12:11 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner >>> Michal, Johannes, Kamezawa, what are your thoughts? >>> >> waiting! =) > > Well, you guys generated a lot of discussion that one has to read > through, didn't you :P > We're quite good at that! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 12:08 ` Michal Hocko 2012-09-27 12:11 ` Glauber Costa @ 2012-09-27 14:33 ` Tejun Heo [not found] ` <20120927143300.GA4251-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-09-27 15:09 ` Michal Hocko 1 sibling, 2 replies; 127+ messages in thread From: Tejun Heo @ 2012-09-27 14:33 UTC (permalink / raw) To: Michal Hocko Cc: Glauber Costa, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, Michal. On Thu, Sep 27, 2012 at 02:08:06PM +0200, Michal Hocko wrote: > Yes, because we have many users (basically almost all) who care only > about the user memory because that's what occupies the vast majority of > the memory. They usually want to isolate workload which would disrupt > the global memory otherwise (e.g. backup process vs. database). You > really do not want to pay an additional overhead for kmem accounting > here. I'm not too convinced. First of all, the overhead added by kmemcg isn't big. The hot path overhead is quite minimal - it doesn't do much more than indirecting one more time. In terms of memory usage, it sure could lead to a bit more fragmentation but even if it gets to several megs per cgroup, I don't think that's something excessive. So, there is overhead but I don't believe it to be prohibitive. > > So your question for global vs local switch (that again, doesn't > > exist; only a local *limit* exists) should really be posed in the > > following way: "Can two different use cases with different needs be > > hosted in the same box?" > > I think this is a good and a relevant question. I think this boils down > to whether you want to have trusted and untrusted workloads at the same > machine. > Trusted loads usually only need user memory accounting because kmem > consumption should be really negligible (unless kernel is doing > something really stupid and no kmem limit will help here). > On the other hand, untrusted workloads can do nasty things that > administrator has hard time to mitigate and setting a kmem limit can > help significantly. > > IMHO such a different loads exist on a single machine quite often (Web > server and a back up process as the most simplistic one). The per > hierarchy accounting, therefore, sounds like a good idea without too > much added complexity (actually the only added complexity is in the > proper kmem.limit_in_bytes handling which is a single place). The distinction between "trusted" and "untrusted" is something artificially created due to the assumed deficiency of kmemcg implementation. Making things like this visible to userland is a bad idea because it locks us into a place where we can't or don't need to improve the said deficiencies and end up pushing the difficult problems to somewhere else where it will likely be implemented in a shabbier way. There sure are cases when such approach simply cannot be avoided, but I really don't think that's the case here - the overhead already seems to be at an acceptable level and we're not taking away the escape switch. This is userland visible API. We better err on the side of being conservative than going overboard with flexibility. Even if we eventually need to make this switching fullly hierarchical, we really should be doing, 1. Implement simple global switching and look for problem cases. 2. Analyze them and see whether the problem case can't be solved in a better, more intelligent way. 3. If the problem is something structurally inherent or reasonably too difficult to solve any other way, consider dumping the problem as config parameters to userland. We can always expand the flexibility. Let's do the simple thing first. As an added bonus, it would enable using static_keys for accounting branches too. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120927143300.GA4251-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <20120927143300.GA4251-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-09-27 14:43 ` Mel Gorman 2012-09-27 14:58 ` Tejun Heo 0 siblings, 1 reply; 127+ messages in thread From: Mel Gorman @ 2012-09-27 14:43 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Glauber Costa, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On Thu, Sep 27, 2012 at 07:33:00AM -0700, Tejun Heo wrote: > Hello, Michal. > > On Thu, Sep 27, 2012 at 02:08:06PM +0200, Michal Hocko wrote: > > Yes, because we have many users (basically almost all) who care only > > about the user memory because that's what occupies the vast majority of > > the memory. They usually want to isolate workload which would disrupt > > the global memory otherwise (e.g. backup process vs. database). You > > really do not want to pay an additional overhead for kmem accounting > > here. > > I'm not too convinced. First of all, the overhead added by kmemcg > isn't big. Really? If kmemcg was globally accounted then every __GFP_KMEMCG allocation in the page allocator potentially ends up down in __memcg_kmem_newpage_charge which 1. takes RCU read lock 2. looks up cgroup from task 3. takes a reference count 4. memcg_charge_kmem -> __mem_cgroup_try_charge 5. release reference count That's a *LOT* of work to incur for cgroups that do not care about kernel accounting. This is why I thought it was reasonable that the kmem accounting not be global. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 14:43 ` Mel Gorman @ 2012-09-27 14:58 ` Tejun Heo 2012-09-27 18:30 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-27 14:58 UTC (permalink / raw) To: Mel Gorman Cc: Michal Hocko, Glauber Costa, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner Hello, Mel. On Thu, Sep 27, 2012 at 03:43:07PM +0100, Mel Gorman wrote: > > I'm not too convinced. First of all, the overhead added by kmemcg > > isn't big. > > Really? > > If kmemcg was globally accounted then every __GFP_KMEMCG allocation in > the page allocator potentially ends up down in > __memcg_kmem_newpage_charge which > > 1. takes RCU read lock > 2. looks up cgroup from task > 3. takes a reference count > 4. memcg_charge_kmem -> __mem_cgroup_try_charge > 5. release reference count > > That's a *LOT* of work to incur for cgroups that do not care about kernel > accounting. This is why I thought it was reasonable that the kmem accounting > not be global. But that happens only when pages enter and leave slab and if it still is significant, we can try to further optimize charging. Given that this is only for cases where memcg is already in use and we provide a switch to disable it globally, I really don't think this warrants implementing fully hierarchy configuration. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 14:58 ` Tejun Heo @ 2012-09-27 18:30 ` Glauber Costa 2012-09-30 8:23 ` Tejun Heo 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-27 18:30 UTC (permalink / raw) To: Tejun Heo Cc: Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On 09/27/2012 06:58 PM, Tejun Heo wrote: > Hello, Mel. > > On Thu, Sep 27, 2012 at 03:43:07PM +0100, Mel Gorman wrote: >>> I'm not too convinced. First of all, the overhead added by kmemcg >>> isn't big. >> >> Really? >> >> If kmemcg was globally accounted then every __GFP_KMEMCG allocation in >> the page allocator potentially ends up down in >> __memcg_kmem_newpage_charge which >> >> 1. takes RCU read lock >> 2. looks up cgroup from task >> 3. takes a reference count >> 4. memcg_charge_kmem -> __mem_cgroup_try_charge >> 5. release reference count >> >> That's a *LOT* of work to incur for cgroups that do not care about kernel >> accounting. This is why I thought it was reasonable that the kmem accounting >> not be global. > > But that happens only when pages enter and leave slab and if it still > is significant, we can try to further optimize charging. Given that > this is only for cases where memcg is already in use and we provide a > switch to disable it globally, I really don't think this warrants > implementing fully hierarchy configuration. > Not totally true. We still have to match every allocation to the right cache, and that is actually our heaviest hit, responsible for the 2, 3 % we're seeing when this is enabled. It is the kind of path so hot that people frown upon branches being added, so I don't think we'll ever get this close to being free. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 18:30 ` Glauber Costa @ 2012-09-30 8:23 ` Tejun Heo 2012-10-01 8:45 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-30 8:23 UTC (permalink / raw) To: Glauber Costa Cc: Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner Hello, Glauber. On Thu, Sep 27, 2012 at 10:30:36PM +0400, Glauber Costa wrote: > > But that happens only when pages enter and leave slab and if it still > > is significant, we can try to further optimize charging. Given that > > this is only for cases where memcg is already in use and we provide a > > switch to disable it globally, I really don't think this warrants > > implementing fully hierarchy configuration. > > Not totally true. We still have to match every allocation to the right > cache, and that is actually our heaviest hit, responsible for the 2, 3 % > we're seeing when this is enabled. It is the kind of path so hot that > people frown upon branches being added, so I don't think we'll ever get > this close to being free. Sure, depening on workload, any addition to alloc/free could be noticeable. I don't know. I'll write more about it when replying to Michal's message. BTW, __memcg_kmem_get_cache() does seem a bit heavy. I wonder whether indexing from cache side would make it cheaper? e.g. something like the following. kmem_cache *__memcg_kmem_get_cache(cachep, gfp) { struct kmem_cache *c; c = cachep->memcg_params->caches[percpu_read(kmemcg_slab_idx)]; if (likely(c)) return c; /* try to create and then fall back to cachep */ } where kmemcg_slab_idx is updated from sched notifier (or maybe add and use current->kmemcg_slab_idx?). You would still need __GFP_* and in_interrupt() tests but current->mm and PF_KTHREAD tests can be rolled into index selection. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-30 8:23 ` Tejun Heo @ 2012-10-01 8:45 ` Glauber Costa 2012-10-03 22:54 ` Tejun Heo 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-10-01 8:45 UTC (permalink / raw) To: Tejun Heo Cc: Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On 09/30/2012 12:23 PM, Tejun Heo wrote: > Hello, Glauber. > > On Thu, Sep 27, 2012 at 10:30:36PM +0400, Glauber Costa wrote: >>> But that happens only when pages enter and leave slab and if it still >>> is significant, we can try to further optimize charging. Given that >>> this is only for cases where memcg is already in use and we provide a >>> switch to disable it globally, I really don't think this warrants >>> implementing fully hierarchy configuration. >> >> Not totally true. We still have to match every allocation to the right >> cache, and that is actually our heaviest hit, responsible for the 2, 3 % >> we're seeing when this is enabled. It is the kind of path so hot that >> people frown upon branches being added, so I don't think we'll ever get >> this close to being free. > > Sure, depening on workload, any addition to alloc/free could be > noticeable. I don't know. I'll write more about it when replying to > Michal's message. BTW, __memcg_kmem_get_cache() does seem a bit > heavy. I wonder whether indexing from cache side would make it > cheaper? e.g. something like the following. > > kmem_cache *__memcg_kmem_get_cache(cachep, gfp) > { > struct kmem_cache *c; > > c = cachep->memcg_params->caches[percpu_read(kmemcg_slab_idx)]; > if (likely(c)) > return c; > /* try to create and then fall back to cachep */ > } > > where kmemcg_slab_idx is updated from sched notifier (or maybe add and > use current->kmemcg_slab_idx?). You would still need __GFP_* and > in_interrupt() tests but current->mm and PF_KTHREAD tests can be > rolled into index selection. > How big would this array be? there can be a lot more kmem_caches than there are memcgs. That is why it is done from memcg side. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-10-01 8:45 ` Glauber Costa @ 2012-10-03 22:54 ` Tejun Heo 2012-10-04 11:55 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-10-03 22:54 UTC (permalink / raw) To: Glauber Costa Cc: Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner Hello, Glauber. On Mon, Oct 01, 2012 at 12:45:11PM +0400, Glauber Costa wrote: > > where kmemcg_slab_idx is updated from sched notifier (or maybe add and > > use current->kmemcg_slab_idx?). You would still need __GFP_* and > > in_interrupt() tests but current->mm and PF_KTHREAD tests can be > > rolled into index selection. > > How big would this array be? there can be a lot more kmem_caches than > there are memcgs. That is why it is done from memcg side. The total number of memcgs are pretty limited due to the ID thing, right? And kmemcg is only applied to subset of caches. I don't think the array size would be a problem in terms of memory overhead, would it? If so, RCU synchronize and dynamically grow them? Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-10-03 22:54 ` Tejun Heo @ 2012-10-04 11:55 ` Glauber Costa [not found] ` <506D7922.1050108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-10-04 11:55 UTC (permalink / raw) To: Tejun Heo Cc: Mel Gorman, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner On 10/04/2012 02:54 AM, Tejun Heo wrote: > Hello, Glauber. > > On Mon, Oct 01, 2012 at 12:45:11PM +0400, Glauber Costa wrote: >>> where kmemcg_slab_idx is updated from sched notifier (or maybe add and >>> use current->kmemcg_slab_idx?). You would still need __GFP_* and >>> in_interrupt() tests but current->mm and PF_KTHREAD tests can be >>> rolled into index selection. >> >> How big would this array be? there can be a lot more kmem_caches than >> there are memcgs. That is why it is done from memcg side. > > The total number of memcgs are pretty limited due to the ID thing, > right? And kmemcg is only applied to subset of caches. I don't think > the array size would be a problem in terms of memory overhead, would > it? If so, RCU synchronize and dynamically grow them? > > Thanks. > I don't want to assume the number of memcgs will always be that limited. Sure, the ID limitation sounds pretty much a big one, but people doing VMs usually want to stack as many VMs as they possibly can in an environment, and the less things preventing that from happening, the better. That said, now that I've experimented with this a bit, indexing from the cache may have some advantages: it can get too complicated to propagate new caches appearing to all memcgs that already in-flight. We don't have this problem from the cache side, because instances of it are guaranteed not to exist at this point by definition. I don't want to bloat unrelated kmem_cache structures, so I can't embed a memcg array in there: I would have to have a pointer to a memcg array that gets assigned at first use. But if we don't want to have a static number, as you and christoph already frowned upon heavily, we may have to do that memcg side as well. The array gets bigger, though, because it pretty much has to be enough to accomodate all css_ids. Even now, they are more than the 400 I used in this patchset. Not allocating all of them at once will lead to more complication and pointer chasing in here. I'll take a look at the alternatives today and tomorrow. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <506D7922.1050108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <506D7922.1050108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-10-06 2:19 ` Tejun Heo 0 siblings, 0 replies; 127+ messages in thread From: Tejun Heo @ 2012-10-06 2:19 UTC (permalink / raw) To: Glauber Costa Cc: Mel Gorman, Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Johannes Weiner Hello, Glauber. On Thu, Oct 04, 2012 at 03:55:14PM +0400, Glauber Costa wrote: > I don't want to bloat unrelated kmem_cache structures, so I can't embed > a memcg array in there: I would have to have a pointer to a memcg array > that gets assigned at first use. But if we don't want to have a static > number, as you and christoph already frowned upon heavily, we may have > to do that memcg side as well. > > The array gets bigger, though, because it pretty much has to be enough > to accomodate all css_ids. Even now, they are more than the 400 I used > in this patchset. Not allocating all of them at once will lead to more > complication and pointer chasing in here. I don't think it would require more pointer chasing. At the simplest, we can just compare the array size each time. If you wanna be more efficient, all arrays can be kept at the same size and resized when the number of memcgs cross the current number. The only runtime overhead would be one pointer deref which I don't think can be avoided regardless of the indexing direction. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 14:33 ` Tejun Heo [not found] ` <20120927143300.GA4251-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-09-27 15:09 ` Michal Hocko 2012-09-30 8:47 ` Tejun Heo 1 sibling, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-09-27 15:09 UTC (permalink / raw) To: Tejun Heo Cc: Glauber Costa, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Thu 27-09-12 07:33:00, Tejun Heo wrote: > Hello, Michal. > > On Thu, Sep 27, 2012 at 02:08:06PM +0200, Michal Hocko wrote: > > Yes, because we have many users (basically almost all) who care only > > about the user memory because that's what occupies the vast majority of > > the memory. They usually want to isolate workload which would disrupt > > the global memory otherwise (e.g. backup process vs. database). You > > really do not want to pay an additional overhead for kmem accounting > > here. > > I'm not too convinced. First of all, the overhead added by kmemcg > isn't big. You are probably talking about memory overhead which is indeed not that big (except for possible side effects as fragmentation which you mention bellow). But the runtime overhead is present, as far as I understand from what Glauber said. But, on the other hand, it is fair to say that those who _want_ use the feature should pay for it. > The hot path overhead is quite minimal - it doesn't do much more than > indirecting one more time. In terms of memory usage, it sure could > lead to a bit more fragmentation but even if it gets to several megs > per cgroup, I don't think that's something excessive. So, there is > overhead but I don't believe it to be prohibitive. Remember that users do not want to pay even "something minimal" when the feature is not needed. > > > So your question for global vs local switch (that again, doesn't > > > exist; only a local *limit* exists) should really be posed in the > > > following way: "Can two different use cases with different needs be > > > hosted in the same box?" > > > > I think this is a good and a relevant question. I think this boils down > > to whether you want to have trusted and untrusted workloads at the same > > machine. > > Trusted loads usually only need user memory accounting because kmem > > consumption should be really negligible (unless kernel is doing > > something really stupid and no kmem limit will help here). > > On the other hand, untrusted workloads can do nasty things that > > administrator has hard time to mitigate and setting a kmem limit can > > help significantly. > > > > IMHO such a different loads exist on a single machine quite often (Web > > server and a back up process as the most simplistic one). The per > > hierarchy accounting, therefore, sounds like a good idea without too > > much added complexity (actually the only added complexity is in the > > proper kmem.limit_in_bytes handling which is a single place). > > The distinction between "trusted" and "untrusted" is something > artificially created due to the assumed deficiency of kmemcg > implementation. Not really. It doesn't have to do anything with the overhead (be it memory or runtime). It really boils down to "do I need/want it at all". Why would I want to think about how much kernel memory is in use in the first place? Or do you think that user memory accounting should be deprecated? > Making things like this visible to userland is a bad > idea because it locks us into a place where we can't or don't need to > improve the said deficiencies and end up pushing the difficult > problems to somewhere else where it will likely be implemented in a > shabbier way. There sure are cases when such approach simply cannot > be avoided, but I really don't think that's the case here - the > overhead already seems to be at an acceptable level and we're not > taking away the escape switch. > > This is userland visible API. I am not sure which API visible part you have in mind but kmem.limit_in_bytes will be there whether we go with global knob or "no limit no accounting" approach. > We better err on the side of being conservative than going overboard > with flexibility. Even if we eventually need to make this switching > fullly hierarchical, we really should be doing, > > 1. Implement simple global switching and look for problem cases. > > 2. Analyze them and see whether the problem case can't be solved in a > better, more intelligent way. > > 3. If the problem is something structurally inherent or reasonably too > difficult to solve any other way, consider dumping the problem as > config parameters to userland. > > We can always expand the flexibility. Let's do the simple thing > first. As an added bonus, it would enable using static_keys for > accounting branches too. While I do agree with you in general and being careful is at place in this area as time shown several times, this seems to be too restrictive in this particular case. We won't save almost no code with the global knob so I am not sure what we are actually saving here. Global knob will just give us all or nothing semantic without making the whole thing simpler. You will stick with static branches and checkes whether the group accountable anyway, right? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-27 15:09 ` Michal Hocko @ 2012-09-30 8:47 ` Tejun Heo 2012-10-01 9:27 ` Michal Hocko 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-30 8:47 UTC (permalink / raw) To: Michal Hocko Cc: Glauber Costa, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hello, Michal. On Thu, Sep 27, 2012 at 05:09:50PM +0200, Michal Hocko wrote: > On Thu 27-09-12 07:33:00, Tejun Heo wrote: > > I'm not too convinced. First of all, the overhead added by kmemcg > > isn't big. > > You are probably talking about memory overhead which is indeed not that > big (except for possible side effects as fragmentation which you mention > bellow). But the runtime overhead is present, as far as I understand from > what Glauber said. But, on the other hand, it is fair to say that those > who _want_ use the feature should pay for it. Yeah, as long as the overhead is reasonable and it doesn't affect non-users, I think we should put more emphasis on simplicity. cgroup is pretty hairy (from both implementation and interface POVs) to begin with. Unfortunately, what's reasonable or how much more emphasis varies widely depending on who one asks. > > The hot path overhead is quite minimal - it doesn't do much more than > > indirecting one more time. In terms of memory usage, it sure could > > lead to a bit more fragmentation but even if it gets to several megs > > per cgroup, I don't think that's something excessive. So, there is > > overhead but I don't believe it to be prohibitive. > > Remember that users do not want to pay even "something minimal" when the > feature is not needed. Yeah but, if we can get it down to, say, around 1% under most workloads for memcg users, it is quite questionable to introduce full hierarchical configuration to allow avoiding that, isn't it? > > The distinction between "trusted" and "untrusted" is something > > artificially created due to the assumed deficiency of kmemcg > > implementation. > > Not really. It doesn't have to do anything with the overhead (be it > memory or runtime). It really boils down to "do I need/want it at all". > Why would I want to think about how much kernel memory is in use in the > first place? Or do you think that user memory accounting should be > deprecated? But you can apply the same "do I need/want it at all" question to the configuration parameter too. I can see your point but the decision seems muddy to me, and if muddy, I prefer to err on the side of being too conservative. > > This is userland visible API. > > I am not sure which API visible part you have in mind but > kmem.limit_in_bytes will be there whether we go with global knob or "no > limit no accounting" approach. I mean full hierarchical configuration of it. It becomes something which each memcg user cares about instead of something which the base system / admin flips on system boot. > > We can always expand the flexibility. Let's do the simple thing > > first. As an added bonus, it would enable using static_keys for > > accounting branches too. > > While I do agree with you in general and being careful is at place in > this area as time shown several times, this seems to be too restrictive > in this particular case. > We won't save almost no code with the global knob so I am not sure > what we are actually saving here. Global knob will just give us all or > nothing semantic without making the whole thing simpler. You will stick > with static branches and checkes whether the group accountable anyway, > right? The thing is about the same argument can be made about .use_hierarchy too. It doesn't necessarily make the code much harier. Especially because the code is structured with that feature on mind, removing .use_hierarchy might not remove whole lot of code; however, the wider range of behavior which got exposed through that poses a much larger problem when we try to make modifications on related behaviors. We get a lot more locked down by seemingly not too much code and our long term maintainability / sustainability suffers as a result. I'm not trying to say this is as bad as .use_hierarchy but want to point out that memcg and cgroup in general have had pretty strong tendency to choose overly flexible and complex designs and interfaces and it's probably about time we become more careful especially about stuff which is visible to userland. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-30 8:47 ` Tejun Heo @ 2012-10-01 9:27 ` Michal Hocko [not found] ` <20121001092756.GA8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-10-01 9:27 UTC (permalink / raw) To: Tejun Heo Cc: Glauber Costa, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hi, On Sun 30-09-12 17:47:50, Tejun Heo wrote: [...] > > > The hot path overhead is quite minimal - it doesn't do much more than > > > indirecting one more time. In terms of memory usage, it sure could > > > lead to a bit more fragmentation but even if it gets to several megs > > > per cgroup, I don't think that's something excessive. So, there is > > > overhead but I don't believe it to be prohibitive. > > > > Remember that users do not want to pay even "something minimal" when the > > feature is not needed. > > Yeah but, if we can get it down to, say, around 1% under most > workloads for memcg users, it is quite questionable to introduce full > hierarchical configuration to allow avoiding that, isn't it? Remember that the kmem memory is still accounted to u+k if it is enabled which could be a no-go because some workloads (I have provided an example that those which are trusted are generally safe to ignore kernel memory overhead) simply don't want to consider additional memory which is mostly invisible for them. > > > The distinction between "trusted" and "untrusted" is something > > > artificially created due to the assumed deficiency of kmemcg > > > implementation. > > > > Not really. It doesn't have to do anything with the overhead (be it > > memory or runtime). It really boils down to "do I need/want it at all". > > Why would I want to think about how much kernel memory is in use in the > > first place? Or do you think that user memory accounting should be > > deprecated? > > But you can apply the same "do I need/want it at all" question to the > configuration parameter too. Yes but, as I've said, the global configuration parameter is too coarse. You can have a mix of trusted and untrusted workloads at the same machine (e.g. web server which is inherently untrusted) and trusted (local batch jobs which just needs a special LRU aging). > I can see your point but the decision seems muddy to me, and if muddy, > I prefer to err on the side of being too conservative. > > > > This is userland visible API. > > > > I am not sure which API visible part you have in mind but > > kmem.limit_in_bytes will be there whether we go with global knob or "no > > limit no accounting" approach. > > I mean full hierarchical configuration of it. It becomes something > which each memcg user cares about instead of something which the base > system / admin flips on system boot. > > > > We can always expand the flexibility. Let's do the simple thing > > > first. As an added bonus, it would enable using static_keys for > > > accounting branches too. > > > > While I do agree with you in general and being careful is at place in > > this area as time shown several times, this seems to be too restrictive > > in this particular case. > > We won't save almost no code with the global knob so I am not sure > > what we are actually saving here. Global knob will just give us all or > > nothing semantic without making the whole thing simpler. You will stick > > with static branches and checkes whether the group accountable anyway, > > right? > > The thing is about the same argument can be made about .use_hierarchy > too. It doesn't necessarily make the code much harier. Especially > because the code is structured with that feature on mind, removing > .use_hierarchy might not remove whole lot of code; however, the wider > range of behavior which got exposed through that poses a much larger > problem when we try to make modifications on related behaviors. We > get a lot more locked down by seemingly not too much code and our long > term maintainability / sustainability suffers as a result. I think that comparing kmem accounting with use_hierarchy is not fair. Glauber tried to explain why already so I will not repeat it here. I will just mention one thing. use_hierarchy has been introduces becuase hierarchies were expensive at the time. kmem accounting is about should we do u or u+k accounting. So there is a crucial difference. > I'm not trying to say this is as bad as .use_hierarchy but want to > point out that memcg and cgroup in general have had pretty strong > tendency to choose overly flexible and complex designs and interfaces > and it's probably about time we become more careful especially about > stuff which is visible to userland. That is right but I think that the current discussion shows that a mixed (kmem disabled and kmem enabled hierarchies) workloads are far from being theoretical and a global knob is just too coarse. I am afraid we will see "we want that per hierarchy" requests shortly and that would just add a new confusion where global knob would complicate it considerably (do we really want on/off/per_hierarchy global knob?). -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20121001092756.GA8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <20121001092756.GA8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-10-03 22:43 ` Tejun Heo 2012-10-05 13:47 ` Michal Hocko 0 siblings, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-10-03 22:43 UTC (permalink / raw) To: Michal Hocko Cc: Glauber Costa, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner Hey, Michal. On Mon, Oct 01, 2012 at 11:27:56AM +0200, Michal Hocko wrote: > > Yeah but, if we can get it down to, say, around 1% under most > > workloads for memcg users, it is quite questionable to introduce full > > hierarchical configuration to allow avoiding that, isn't it? > > Remember that the kmem memory is still accounted to u+k if it is enabled > which could be a no-go because some workloads (I have provided an > example that those which are trusted are generally safe to ignore kernel > memory overhead) simply don't want to consider additional memory which > is mostly invisible for them. Maybe it's because my exposure to cgroup usage is different from yours but the argument that not accounting kernel memory is something inherently beneficial is lost on me. For compatibility, overhead and/or implementation complexity issues, yeah, sure, we can't always (well more like usually) have all we want but I don't get how not accounting kernel memory is something inherently necessary or beneficial. This is all about provisioning physical memory to different groups of users and memory is memory (that's why u+k makes sense, right?). Without kmemcg enabled, we not only lack a way to control kernel memory usage but also a way to even watch per-group usages. > > But you can apply the same "do I need/want it at all" question to the > > configuration parameter too. > > Yes but, as I've said, the global configuration parameter is too > coarse. You can have a mix of trusted and untrusted workloads at the > same machine (e.g. web server which is inherently untrusted) and trusted > (local batch jobs which just needs a special LRU aging). This too stems from the same difference stated above. You think there's inherent distinction between trusted and untrusted workloads and they need different features from the kernel while I think why trust anyone if you can untrust everyone and consider the knob as a compatibility thing. > I think that comparing kmem accounting with use_hierarchy is not fair. > Glauber tried to explain why already so I will not repeat it here. > I will just mention one thing. use_hierarchy has been introduces becuase > hierarchies were expensive at the time. kmem accounting is about should > we do u or u+k accounting. So there is a crucial difference. It may be less crazy but I think there are enough commonalities to at least make a comparison. Mel seems to think it's mostly about performance overhead. You think that not accounting kmem is something inherently necessary. > That is right but I think that the current discussion shows that a mixed > (kmem disabled and kmem enabled hierarchies) workloads are far from > being theoretical and a global knob is just too coarse. I am afraid we I'm not sure there's much evidence in this thread. The strongest upto this point seems to be performance overhead / difficulty of general enough implementation. As for "trusted" workload, what are the inherent benefits of trusting if you don't have to? > will see "we want that per hierarchy" requests shortly and that would > just add a new confusion where global knob would complicate it > considerably (do we really want on/off/per_hierarchy global knob?). Hmmm? The global knob is just the same per_hierarchy knob at the root. It's hierarchical after all. Anyways, as long as the "we silently ignore what happened before being enabled" is gone, I won't fight this anymore. It isn't broken after all. But, please think about making things simpler in general, cgroup is riddled with mis-designed complexities and memcg seems to be leading the charge at times. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-10-03 22:43 ` Tejun Heo @ 2012-10-05 13:47 ` Michal Hocko 0 siblings, 0 replies; 127+ messages in thread From: Michal Hocko @ 2012-10-05 13:47 UTC (permalink / raw) To: Tejun Heo Cc: Glauber Costa, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Thu 04-10-12 07:43:16, Tejun Heo wrote: [...] > > That is right but I think that the current discussion shows that a mixed > > (kmem disabled and kmem enabled hierarchies) workloads are far from > > being theoretical and a global knob is just too coarse. I am afraid we > > I'm not sure there's much evidence in this thread. The strongest upto > this point seems to be performance overhead / difficulty of general > enough implementation. As for "trusted" workload, what are the > inherent benefits of trusting if you don't have to? One advantage is that you do _not have_ to consider kernel memory allocations (which are inherently bound to the kernel version) so the sizing is much easier and version independent. If you set a limit to XY because you know what you are doing you certainly do not want to regress (e.g. because of unnecessary reclaim) just because a certain kernel allocation got bigger, right? > > will see "we want that per hierarchy" requests shortly and that would > > just add a new confusion where global knob would complicate it > > considerably (do we really want on/off/per_hierarchy global knob?). > > Hmmm? The global knob is just the same per_hierarchy knob at the > root. It's hierarchical after all. When you said global knob I imagined mount or boot option. If you want to have yet another memory.enable_kmem then IMHO it is much easier to use set-accounted semantic (which is hierarchical as well). > Anyways, as long as the "we silently ignore what happened before being > enabled" is gone, I won't fight this anymore. It isn't broken after > all. OK, it is good that we settled this. > But, please think about making things simpler in general, cgroup > is riddled with mis-designed complexities and memcg seems to be > leading the charge at times. Yes this is an evolution and it seems that we are slowly getting there. > > Thanks. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure [not found] ` <50635F46.7000700-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 20:16 ` Tejun Heo @ 2012-09-26 22:11 ` Johannes Weiner 2012-09-26 22:45 ` Glauber Costa 1 sibling, 1 reply; 127+ messages in thread From: Johannes Weiner @ 2012-09-26 22:11 UTC (permalink / raw) To: Glauber Costa Cc: Tejun Heo, Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes On Thu, Sep 27, 2012 at 12:02:14AM +0400, Glauber Costa wrote: > On 09/26/2012 11:56 PM, Tejun Heo wrote: > > Hello, > > > > On Wed, Sep 26, 2012 at 11:46:37PM +0400, Glauber Costa wrote: > >> Besides not being part of cgroup core, and respecting very much both > >> cgroups' and basic sanity properties, kmem is an actual feature that > >> some people want, and some people don't. There is no reason to believe > >> that applications that want will live in the same environment with ones > >> that don't want. > > > > I don't know. It definitely is less crazy than .use_hierarchy but I > > wouldn't say it's an inherently different thing. I mean, what does it > > even mean to have u+k limit on one subtree and not on another branch? > > And we worry about things like what if parent doesn't enable it but > > its chlidren do. > > > > It is inherently different. To begin with, it actually contemplates two > use cases. It is not a work around. > > The meaning is also very well defined. The meaning of having this > enabled in one subtree and not in other is: Subtree A wants to track > kernel memory. Subtree B does not. It's that, and never more than that. > There is no maybes and no buts, no magic knobs that makes it behave in a > crazy way. > > If a children enables it but the parent does not, this does what every > tree does: enable it from that point downwards. > > > This is a feature which adds complexity. If the feature is necessary > > and justified, sure. If not, let's please not and let's err on the > > side of conservativeness. We can always add it later but the other > > direction is much harder. > > I disagree. Having kmem tracking adds complexity. Having to cope with > the use case where we turn it on dynamically to cope with the "user page > only" use case adds complexity. But I see no significant complexity > being added by having it per subtree. Really. Maybe not in code, but you are adding an extra variable into the system. "One switch per subtree" is more complex than "one switch." Yes, the toggle is hidden behind setting the limit, but it's still a toggle. The use_hierarchy complexity comes not from the file that enables it, but from the resulting semantics. kmem accounting is expensive and we definitely want to allow enabling it separately from traditional user memory accounting. But I think there is no good reason to not demand an all-or-nothing answer from the admin; either he wants kmem tracking on a machine or not. At least you haven't presented a convincing case, IMO. I don't think there is strong/any demand for per-node toggles, but once we add this behavior, people will rely on it and expect kmem tracking to stay local and we are stuck with it. Adding it for the reason that people will use it is a self-fulfilling prophecy. > You have the use_hierarchy fiasco in mind, and I do understand that you > are raising the flag and all that. > > But think in terms of functionality: This thing here is a lot more > similar to swap than use_hierarchy. Would you argue that memsw should be > per-root ? We actually do have a per-root flag that controls accounting for swap. > The reason why it shouldn't: Some people want to limit memory > consumption all the way to the swap, some people don't. Same with kmem. That lies in the nature of the interface: we chose k & u+k rather than u & u+k, so our memory.limit_in_bytes will necessarily include kmem, while swap is not included there. But I really doubt that there is a strong case for turning on swap accounting intentionally and then limiting memory+swap only on certain subtrees. Where would be the sense in that? ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 04/13] kmem accounting basic infrastructure 2012-09-26 22:11 ` Johannes Weiner @ 2012-09-26 22:45 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-09-26 22:45 UTC (permalink / raw) To: Johannes Weiner Cc: Tejun Heo, Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes On 09/27/2012 02:11 AM, Johannes Weiner wrote: > On Thu, Sep 27, 2012 at 12:02:14AM +0400, Glauber Costa wrote: >> On 09/26/2012 11:56 PM, Tejun Heo wrote: >>> Hello, >>> >>> On Wed, Sep 26, 2012 at 11:46:37PM +0400, Glauber Costa wrote: >>>> Besides not being part of cgroup core, and respecting very much both >>>> cgroups' and basic sanity properties, kmem is an actual feature that >>>> some people want, and some people don't. There is no reason to believe >>>> that applications that want will live in the same environment with ones >>>> that don't want. >>> >>> I don't know. It definitely is less crazy than .use_hierarchy but I >>> wouldn't say it's an inherently different thing. I mean, what does it >>> even mean to have u+k limit on one subtree and not on another branch? >>> And we worry about things like what if parent doesn't enable it but >>> its chlidren do. >>> >> >> It is inherently different. To begin with, it actually contemplates two >> use cases. It is not a work around. >> >> The meaning is also very well defined. The meaning of having this >> enabled in one subtree and not in other is: Subtree A wants to track >> kernel memory. Subtree B does not. It's that, and never more than that. >> There is no maybes and no buts, no magic knobs that makes it behave in a >> crazy way. >> >> If a children enables it but the parent does not, this does what every >> tree does: enable it from that point downwards. >> >>> This is a feature which adds complexity. If the feature is necessary >>> and justified, sure. If not, let's please not and let's err on the >>> side of conservativeness. We can always add it later but the other >>> direction is much harder. >> >> I disagree. Having kmem tracking adds complexity. Having to cope with >> the use case where we turn it on dynamically to cope with the "user page >> only" use case adds complexity. But I see no significant complexity >> being added by having it per subtree. Really. > > Maybe not in code, but you are adding an extra variable into the > system. "One switch per subtree" is more complex than "one switch." > Yes, the toggle is hidden behind setting the limit, but it's still a > toggle. The use_hierarchy complexity comes not from the file that > enables it, but from the resulting semantics. > I didn't claim the complexity was in the code. I actually think the other way around that you do, and claim that a global switch is more complex than a per-subtree. All properties we have so far applies to subtrees, due to cgroup's hierarchical nature. We have no global switches like this so far, and adding one would just add a new concept that wasn't here. > kmem accounting is expensive and we definitely want to allow enabling > it separately from traditional user memory accounting. But I think > there is no good reason to not demand an all-or-nothing answer from > the admin; either he wants kmem tracking on a machine or not. At > least you haven't presented a convincing case, IMO. > > I don't think there is strong/any demand for per-node toggles, but > once we add this behavior, people will rely on it and expect kmem > tracking to stay local and we are stuck with it. Adding it for the > reason that people will use it is a self-fulfilling prophecy. I don't think this is a compatibility only switch. Much has been said in the past about the problem of sharing. A lot of the kernel objects are shared by nature, this is pretty much unavoidable. The answer we have been giving to this inquiry, is that the workloads (us) interested in kmem accounted tend to be quite local in their file accesses (and other kernel objects as well). It should be obvious that not all workloads are like this, and some of them would actually prefer to have their umem limited only. I really don't think, and correct me if I am wrong, that the problem lays in "is there a use case for umem?", but rather, if they should be allowed to coexist in a box. And honestly, it seems to me totally reasonable to avoid restricting people to run as many workloads they think they can in the same box. >> You have the use_hierarchy fiasco in mind, and I do understand that you >> are raising the flag and all that. >> >> But think in terms of functionality: This thing here is a lot more >> similar to swap than use_hierarchy. Would you argue that memsw should be >> per-root ? > > We actually do have a per-root flag that controls accounting for swap. > >> The reason why it shouldn't: Some people want to limit memory >> consumption all the way to the swap, some people don't. Same with kmem. > > That lies in the nature of the interface: we chose k & u+k rather than > u & u+k, so our memory.limit_in_bytes will necessarily include kmem, > while swap is not included there. But I really doubt that there is a > strong case for turning on swap accounting intentionally and then > limiting memory+swap only on certain subtrees. Where would be the > sense in that? It makes absolute sense. Because until I go set memory.memsw.limit_in_bytes, my subtree is not limited, which is precisely what kmem does. And the use cases for that are: 1) I, application A, want to use 2G of mem, and I can never swap 2) I, application B, want to use 2G of mem, but I am fine using extra 1G in swap. There are plenty of workloads in both the "can swap" and "can't swap" category around. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 05/13] Add a __GFP_KMEMCG flag [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (3 preceding siblings ...) 2012-09-18 14:04 ` [PATCH v3 04/13] kmem accounting basic infrastructure Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa 2012-09-18 14:15 ` Rik van Riel ` (3 more replies) 2012-09-18 14:04 ` [PATCH v3 06/13] memcg: kmem controller infrastructure Glauber Costa ` (7 subsequent siblings) 12 siblings, 4 replies; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner, Rik van Riel, Mel Gorman This flag is used to indicate to the callees that this allocation is a kernel allocation in process context, and should be accounted to current's memcg. It takes numerical place of the of the recently removed __GFP_NO_KSWAPD. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> CC: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> CC: Mel Gorman <mel-wPRd99KPJ+uzQB+pC5nmwQ@public.gmane.org> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> --- include/linux/gfp.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index f9bc873..d8eae4d 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,11 @@ struct vm_area_struct; #else #define ___GFP_NOTRACK 0 #endif +#ifdef CONFIG_MEMCG_KMEM +#define ___GFP_KMEMCG 0x400000u +#else +#define ___GFP_KMEMCG 0 +#endif #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u @@ -91,7 +96,7 @@ struct vm_area_struct; #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ - +#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ /* * This may seem redundant, but it's a way of annotating false positives vs. * allocations that simply cannot be supported (e.g. page tables). -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
* Re: [PATCH v3 05/13] Add a __GFP_KMEMCG flag 2012-09-18 14:04 ` [PATCH v3 05/13] Add a __GFP_KMEMCG flag Glauber Costa @ 2012-09-18 14:15 ` Rik van Riel [not found] ` <1347977050-29476-6-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (2 subsequent siblings) 3 siblings, 0 replies; 127+ messages in thread From: Rik van Riel @ 2012-09-18 14:15 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner, Mel Gorman On 09/18/2012 10:04 AM, Glauber Costa wrote: > This flag is used to indicate to the callees that this allocation is a > kernel allocation in process context, and should be accounted to > current's memcg. It takes numerical place of the of the recently removed > __GFP_NO_KSWAPD. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > CC: Christoph Lameter <cl@linux.com> > CC: Pekka Enberg <penberg@cs.helsinki.fi> > CC: Michal Hocko <mhocko@suse.cz> > CC: Johannes Weiner <hannes@cmpxchg.org> > CC: Suleiman Souhlal <suleiman@google.com> > CC: Rik van Riel <riel@redhat.com> > CC: Mel Gorman <mel@csn.ul.ie> > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <1347977050-29476-6-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 05/13] Add a __GFP_KMEMCG flag [not found] ` <1347977050-29476-6-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-18 15:06 ` Christoph Lameter [not found] ` <00000139d9ea69c6-109249c2-5176-4a1e-b000-4c076d05844d-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Christoph Lameter @ 2012-09-18 15:06 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Pekka Enberg, Michal Hocko, Johannes Weiner, Rik van Riel, Mel Gorman On Tue, 18 Sep 2012, Glauber Costa wrote: > +++ b/include/linux/gfp.h > @@ -35,6 +35,11 @@ struct vm_area_struct; > #else > #define ___GFP_NOTRACK 0 > #endif > +#ifdef CONFIG_MEMCG_KMEM > +#define ___GFP_KMEMCG 0x400000u > +#else > +#define ___GFP_KMEMCG 0 > +#endif Could you leave __GFP_MEMCG a simple definition and then define GFP_MEMCG to be zer0 if !MEMCG_KMEM? I think that would be cleaner and the __GFP_KMEMCHECK another case that would be good to fix up. ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <00000139d9ea69c6-109249c2-5176-4a1e-b000-4c076d05844d-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org>]
* Re: [PATCH v3 05/13] Add a __GFP_KMEMCG flag [not found] ` <00000139d9ea69c6-109249c2-5176-4a1e-b000-4c076d05844d-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org> @ 2012-09-19 7:39 ` Glauber Costa [not found] ` <505976B5.6090801-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-19 7:39 UTC (permalink / raw) To: Christoph Lameter Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Pekka Enberg, Michal Hocko, Johannes Weiner, Rik van Riel, Mel Gorman On 09/18/2012 07:06 PM, Christoph Lameter wrote: > On Tue, 18 Sep 2012, Glauber Costa wrote: > >> +++ b/include/linux/gfp.h >> @@ -35,6 +35,11 @@ struct vm_area_struct; >> #else >> #define ___GFP_NOTRACK 0 >> #endif >> +#ifdef CONFIG_MEMCG_KMEM >> +#define ___GFP_KMEMCG 0x400000u >> +#else >> +#define ___GFP_KMEMCG 0 >> +#endif > > Could you leave __GFP_MEMCG a simple definition and then define GFP_MEMCG > to be zer0 if !MEMCG_KMEM? I think that would be cleaner and the > __GFP_KMEMCHECK another case that would be good to fix up. > > > I can, but what does this buy us? Also, in any case, this can be done incrementally, and for the other flag as well, as you describe. ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <505976B5.6090801-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 05/13] Add a __GFP_KMEMCG flag [not found] ` <505976B5.6090801-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-19 14:07 ` Christoph Lameter 0 siblings, 0 replies; 127+ messages in thread From: Christoph Lameter @ 2012-09-19 14:07 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Pekka Enberg, Michal Hocko, Johannes Weiner, Rik van Riel, Mel Gorman On Wed, 19 Sep 2012, Glauber Costa wrote: > On 09/18/2012 07:06 PM, Christoph Lameter wrote: > > On Tue, 18 Sep 2012, Glauber Costa wrote: > > > >> +++ b/include/linux/gfp.h > >> @@ -35,6 +35,11 @@ struct vm_area_struct; > >> #else > >> #define ___GFP_NOTRACK 0 > >> #endif > >> +#ifdef CONFIG_MEMCG_KMEM > >> +#define ___GFP_KMEMCG 0x400000u > >> +#else > >> +#define ___GFP_KMEMCG 0 > >> +#endif > > > > Could you leave __GFP_MEMCG a simple definition and then define GFP_MEMCG > > to be zer0 if !MEMCG_KMEM? I think that would be cleaner and the > > __GFP_KMEMCHECK another case that would be good to fix up. > > > > > > > I can, but what does this buy us? All the numeric values should be defined with __ unconditionally so that they can be used in future context. Note the comment above the __GFP_XX which says "Do not use this directly". > Also, in any case, this can be done incrementally, and for the other > flag as well, as you describe. There is only one other flag that does not follow the scheme. I'd appreciate it if you could submit a patch to fix up the __GFP_NOTRACK conditional there. There is no need to do this incrementally. Do it the right way immediately. ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 05/13] Add a __GFP_KMEMCG flag 2012-09-18 14:04 ` [PATCH v3 05/13] Add a __GFP_KMEMCG flag Glauber Costa 2012-09-18 14:15 ` Rik van Riel [not found] ` <1347977050-29476-6-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-27 13:34 ` Mel Gorman 2012-09-27 13:41 ` Glauber Costa 2012-10-01 19:09 ` Johannes Weiner 3 siblings, 1 reply; 127+ messages in thread From: Mel Gorman @ 2012-09-27 13:34 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner, Rik van Riel, Mel Gorman On Tue, Sep 18, 2012 at 06:04:02PM +0400, Glauber Costa wrote: > This flag is used to indicate to the callees that this allocation is a > kernel allocation in process context, and should be accounted to > current's memcg. It takes numerical place of the of the recently removed > __GFP_NO_KSWAPD. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > CC: Christoph Lameter <cl@linux.com> > CC: Pekka Enberg <penberg@cs.helsinki.fi> > CC: Michal Hocko <mhocko@suse.cz> > CC: Johannes Weiner <hannes@cmpxchg.org> > CC: Suleiman Souhlal <suleiman@google.com> > CC: Rik van Riel <riel@redhat.com> > CC: Mel Gorman <mel@csn.ul.ie> > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> I agree with Christophs recommendation that this flag always exist instead of being 0 in the !MEMCG_KMEM case. If __GFP_KMEMCG ever is used in another part of the VM (which would be unexpected but still) then the behaviour might differ too much between MEMCG_KMEM and !MEMCG_KMEM cases. As unlikely as the case is, it's not impossible. For tracing __GFP_KMEMCG should have an entry in include/trace/events/gfpflags.h Get rid of the CONFIG_MEMCG_KMEM check and update include/trace/events/gfpflags.h and then feel free to stick my Acked-by on it. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 05/13] Add a __GFP_KMEMCG flag 2012-09-27 13:34 ` Mel Gorman @ 2012-09-27 13:41 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-09-27 13:41 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner, Rik van Riel, Mel Gorman On 09/27/2012 05:34 PM, Mel Gorman wrote: > On Tue, Sep 18, 2012 at 06:04:02PM +0400, Glauber Costa wrote: >> This flag is used to indicate to the callees that this allocation is a >> kernel allocation in process context, and should be accounted to >> current's memcg. It takes numerical place of the of the recently removed >> __GFP_NO_KSWAPD. >> >> Signed-off-by: Glauber Costa <glommer@parallels.com> >> CC: Christoph Lameter <cl@linux.com> >> CC: Pekka Enberg <penberg@cs.helsinki.fi> >> CC: Michal Hocko <mhocko@suse.cz> >> CC: Johannes Weiner <hannes@cmpxchg.org> >> CC: Suleiman Souhlal <suleiman@google.com> >> CC: Rik van Riel <riel@redhat.com> >> CC: Mel Gorman <mel@csn.ul.ie> >> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > I agree with Christophs recommendation that this flag always exist instead > of being 0 in the !MEMCG_KMEM case. If __GFP_KMEMCG ever is used in another > part of the VM (which would be unexpected but still) then the behaviour > might differ too much between MEMCG_KMEM and !MEMCG_KMEM cases. As unlikely > as the case is, it's not impossible. > > For tracing __GFP_KMEMCG should have an entry in > include/trace/events/gfpflags.h > > Get rid of the CONFIG_MEMCG_KMEM check and update > include/trace/events/gfpflags.h and then feel free to stick my Acked-by > on it. > Thanks, that is certainly doable. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 05/13] Add a __GFP_KMEMCG flag 2012-09-18 14:04 ` [PATCH v3 05/13] Add a __GFP_KMEMCG flag Glauber Costa ` (2 preceding siblings ...) 2012-09-27 13:34 ` Mel Gorman @ 2012-10-01 19:09 ` Johannes Weiner 3 siblings, 0 replies; 127+ messages in thread From: Johannes Weiner @ 2012-10-01 19:09 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Michal Hocko, Rik van Riel, Mel Gorman On Tue, Sep 18, 2012 at 06:04:02PM +0400, Glauber Costa wrote: > This flag is used to indicate to the callees that this allocation is a > kernel allocation in process context, and should be accounted to > current's memcg. It takes numerical place of the of the recently removed > __GFP_NO_KSWAPD. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > CC: Christoph Lameter <cl@linux.com> > CC: Pekka Enberg <penberg@cs.helsinki.fi> > CC: Michal Hocko <mhocko@suse.cz> > CC: Johannes Weiner <hannes@cmpxchg.org> > CC: Suleiman Souhlal <suleiman@google.com> > CC: Rik van Riel <riel@redhat.com> > CC: Mel Gorman <mel@csn.ul.ie> > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> With the feedback from Christoph and Mel incorporated: Acked-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (4 preceding siblings ...) 2012-09-18 14:04 ` [PATCH v3 05/13] Add a __GFP_KMEMCG flag Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa [not found] ` <1347977050-29476-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 14:04 ` [PATCH v3 07/13] mm: Allocate kernel pages to the right memcg Glauber Costa ` (6 subsequent siblings) 12 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner This patch introduces infrastructure for tracking kernel memory pages to a given memcg. This will happen whenever the caller includes the flag __GFP_KMEMCG flag, and the task belong to a memcg other than the root. In memcontrol.h those functions are wrapped in inline acessors. The idea is to later on, patch those with static branches, so we don't incur any overhead when no mem cgroups with limited kmem are being used. [ v2: improved comments and standardized function names ] [ v3: handle no longer opaque, functions not exported, even more comments ] [ v4: reworked Used bit handling and surroundings for more clarity ] Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- include/linux/memcontrol.h | 97 +++++++++++++++++++++++++ mm/memcontrol.c | 177 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 274 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 8d9489f..82ede9a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -21,6 +21,7 @@ #define _LINUX_MEMCONTROL_H #include <linux/cgroup.h> #include <linux/vm_event_item.h> +#include <linux/hardirq.h> struct mem_cgroup; struct page_cgroup; @@ -399,6 +400,17 @@ struct sock; #ifdef CONFIG_MEMCG_KMEM void sock_update_memcg(struct sock *sk); void sock_release_memcg(struct sock *sk); + +static inline bool memcg_kmem_enabled(void) +{ + return true; +} + +extern bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, + int order); +extern void __memcg_kmem_commit_charge(struct page *page, + struct mem_cgroup *memcg, int order); +extern void __memcg_kmem_uncharge_page(struct page *page, int order); #else static inline void sock_update_memcg(struct sock *sk) { @@ -406,6 +418,91 @@ static inline void sock_update_memcg(struct sock *sk) static inline void sock_release_memcg(struct sock *sk) { } + +static inline bool memcg_kmem_enabled(void) +{ + return false; +} + +static inline bool +__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) +{ + return false; +} + +static inline void __memcg_kmem_uncharge_page(struct page *page, int order) +{ +} + +static inline void +__memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order) +{ +} #endif /* CONFIG_MEMCG_KMEM */ + +/** + * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed. + * @gfp: the gfp allocation flags. + * @memcg: a pointer to the memcg this was charged against. + * @order: allocation order. + * + * returns true if the memcg where the current task belongs can hold this + * allocation. + * + * We return true automatically if this allocation is not to be accounted to + * any memcg. + */ +static __always_inline bool +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) +{ + if (!memcg_kmem_enabled()) + return true; + + /* + * __GFP_NOFAIL allocations will move on even if charging is not + * possible. Therefore we don't even try, and have this allocation + * unaccounted. We could in theory charge it with + * res_counter_charge_nofail, but we hope those allocations are rare, + * and won't be worth the trouble. + */ + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL)) + return true; + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD)) + return true; + return __memcg_kmem_newpage_charge(gfp, memcg, order); +} + +/** + * memcg_kmem_uncharge_page: uncharge pages from memcg + * @page: pointer to struct page being freed + * @order: allocation order. + * + * there is no need to specify memcg here, since it is embedded in page_cgroup + */ +static __always_inline void +memcg_kmem_uncharge_page(struct page *page, int order) +{ + if (memcg_kmem_enabled()) + __memcg_kmem_uncharge_page(page, order); +} + +/** + * memcg_kmem_commit_charge: embeds correct memcg in a page + * @memcg: a pointer to the memcg this was charged against. + * @page: pointer to struct page recently allocated + * @memcg: the memcg structure we charged against + * @order: allocation order. + * + * Needs to be called after memcg_kmem_newpage_charge, regardless of success or + * failure of the allocation. if @page is NULL, this function will revert the + * charges. Otherwise, it will commit the memcg given by @memcg to the + * corresponding page_cgroup. + */ +static __always_inline void +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order) +{ + if (memcg_kmem_enabled() && memcg) + __memcg_kmem_commit_charge(page, memcg, order); +} #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f3fd354..0f36a01 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -10,6 +10,10 @@ * Copyright (C) 2009 Nokia Corporation * Author: Kirill A. Shutemov * + * Kernel Memory Controller + * Copyright (C) 2012 Parallels Inc. and Google Inc. + * Authors: Glauber Costa and Suleiman Souhlal + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -426,6 +430,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) #include <net/ip.h> static bool mem_cgroup_is_root(struct mem_cgroup *memcg); +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size); +static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size); + void sock_update_memcg(struct sock *sk) { if (mem_cgroup_sockets_enabled) { @@ -480,6 +487,110 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg) } EXPORT_SYMBOL(tcp_proto_cgroup); #endif /* CONFIG_INET */ + +static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg) +{ + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) && + memcg->kmem_accounted; +} + +/* + * We need to verify if the allocation against current->mm->owner's memcg is + * possible for the given order. But the page is not allocated yet, so we'll + * need a further commit step to do the final arrangements. + * + * It is possible for the task to switch cgroups in this mean time, so at + * commit time, we can't rely on task conversion any longer. We'll then use + * the handle argument to return to the caller which cgroup we should commit + * against. We could also return the memcg directly and avoid the pointer + * passing, but a boolean return value gives better semantics considering + * the compiled-out case as well. + * + * Returning true means the allocation is possible. + */ +bool +__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order) +{ + struct mem_cgroup *memcg; + bool ret; + struct task_struct *p; + + *_memcg = NULL; + rcu_read_lock(); + p = rcu_dereference(current->mm->owner); + memcg = mem_cgroup_from_task(p); + rcu_read_unlock(); + + if (!memcg_can_account_kmem(memcg)) + return true; + + mem_cgroup_get(memcg); + + ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order) == 0; + if (ret) + *_memcg = memcg; + else + mem_cgroup_put(memcg); + + return ret; +} + +void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, + int order) +{ + struct page_cgroup *pc; + + WARN_ON(mem_cgroup_is_root(memcg)); + + /* The page allocation failed. Revert */ + if (!page) { + memcg_uncharge_kmem(memcg, PAGE_SIZE << order); + return; + } + + pc = lookup_page_cgroup(page); + lock_page_cgroup(pc); + pc->mem_cgroup = memcg; + SetPageCgroupUsed(pc); + unlock_page_cgroup(pc); +} + +void __memcg_kmem_uncharge_page(struct page *page, int order) +{ + struct mem_cgroup *memcg = NULL; + struct page_cgroup *pc; + + + pc = lookup_page_cgroup(page); + /* + * Fast unlocked return. Theoretically might have changed, have to + * check again after locking. + */ + if (!PageCgroupUsed(pc)) + return; + + lock_page_cgroup(pc); + if (PageCgroupUsed(pc)) { + memcg = pc->mem_cgroup; + ClearPageCgroupUsed(pc); + } + unlock_page_cgroup(pc); + + /* + * Checking if kmem accounted is enabled won't work for uncharge, since + * it is possible that the user enabled kmem tracking, allocated, and + * then disabled it again. + * + * We trust if there is a memcg associated with the page, it is a valid + * allocation + */ + if (!memcg) + return; + + WARN_ON(mem_cgroup_is_root(memcg)); + memcg_uncharge_kmem(memcg, PAGE_SIZE << order); + mem_cgroup_put(memcg); +} #endif /* CONFIG_MEMCG_KMEM */ #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) @@ -5700,3 +5811,69 @@ static int __init enable_swap_account(char *s) __setup("swapaccount=", enable_swap_account); #endif + +#ifdef CONFIG_MEMCG_KMEM +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size) +{ + struct res_counter *fail_res; + struct mem_cgroup *_memcg; + int ret; + bool may_oom; + bool nofail = false; + + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) && + !(gfp & __GFP_NORETRY); + + ret = 0; + + if (!memcg) + return ret; + + _memcg = memcg; + ret = __mem_cgroup_try_charge(NULL, gfp, size / PAGE_SIZE, + &_memcg, may_oom); + + if (ret == -EINTR) { + nofail = true; + /* + * __mem_cgroup_try_charge() chosed to bypass to root due to + * OOM kill or fatal signal. Since our only options are to + * either fail the allocation or charge it to this cgroup, do + * it as a temporary condition. But we can't fail. From a + * kmem/slab perspective, the cache has already been selected, + * by mem_cgroup_get_kmem_cache(), so it is too late to change + * our minds + */ + res_counter_charge_nofail(&memcg->res, size, &fail_res); + if (do_swap_account) + res_counter_charge_nofail(&memcg->memsw, size, + &fail_res); + ret = 0; + } else if (ret == -ENOMEM) + return ret; + + if (nofail) + res_counter_charge_nofail(&memcg->kmem, size, &fail_res); + else + ret = res_counter_charge(&memcg->kmem, size, &fail_res); + + if (ret) { + res_counter_uncharge(&memcg->res, size); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, size); + } + + return ret; +} + +void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size) +{ + if (!memcg) + return; + + res_counter_uncharge(&memcg->kmem, size); + res_counter_uncharge(&memcg->res, size); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, size); +} +#endif /* CONFIG_MEMCG_KMEM */ -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
[parent not found: <1347977050-29476-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <1347977050-29476-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-20 16:05 ` JoonSoo Kim [not found] ` <CAAmzW4ONnc7n3kZbYnE6n2Cg0ZyPXW0QU2NMr0uRkyTxnGpNqQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2012-09-26 15:51 ` Michal Hocko 1 sibling, 1 reply; 127+ messages in thread From: JoonSoo Kim @ 2012-09-20 16:05 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner Hi, Glauber. 2012/9/18 Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>: > +/* > + * We need to verify if the allocation against current->mm->owner's memcg is > + * possible for the given order. But the page is not allocated yet, so we'll > + * need a further commit step to do the final arrangements. > + * > + * It is possible for the task to switch cgroups in this mean time, so at > + * commit time, we can't rely on task conversion any longer. We'll then use > + * the handle argument to return to the caller which cgroup we should commit > + * against. We could also return the memcg directly and avoid the pointer > + * passing, but a boolean return value gives better semantics considering > + * the compiled-out case as well. > + * > + * Returning true means the allocation is possible. > + */ > +bool > +__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order) > +{ > + struct mem_cgroup *memcg; > + bool ret; > + struct task_struct *p; > + > + *_memcg = NULL; > + rcu_read_lock(); > + p = rcu_dereference(current->mm->owner); > + memcg = mem_cgroup_from_task(p); > + rcu_read_unlock(); > + > + if (!memcg_can_account_kmem(memcg)) > + return true; > + > + mem_cgroup_get(memcg); > + > + ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order) == 0; > + if (ret) > + *_memcg = memcg; > + else > + mem_cgroup_put(memcg); > + > + return ret; > +} "*_memcg = memcg" should be executed when "memcg_charge_kmem" is success. "memcg_charge_kmem" return 0 if success in charging. Therefore, I think this code is wrong. If I am right, it is a serious bug that affect behavior of all the patchset. > +void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, > + int order) > +{ > + struct page_cgroup *pc; > + > + WARN_ON(mem_cgroup_is_root(memcg)); > + > + /* The page allocation failed. Revert */ > + if (!page) { > + memcg_uncharge_kmem(memcg, PAGE_SIZE << order); > + return; > + } In case of "!page ", mem_cgroup_put(memcg) is needed, because we already call "mem_cgroup_get(memcg)" in __memcg_kmem_newpage_charge(). I know that mem_cgroup_put()/get() will be removed in later patch, but it is important that every patch works fine. Thanks. ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <CAAmzW4ONnc7n3kZbYnE6n2Cg0ZyPXW0QU2NMr0uRkyTxnGpNqQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <CAAmzW4ONnc7n3kZbYnE6n2Cg0ZyPXW0QU2NMr0uRkyTxnGpNqQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2012-09-21 8:41 ` Glauber Costa 2012-09-21 9:14 ` JoonSoo Kim 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-21 8:41 UTC (permalink / raw) To: JoonSoo Kim Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner On 09/20/2012 08:05 PM, JoonSoo Kim wrote: > Hi, Glauber. > > 2012/9/18 Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>: >> +/* >> + * We need to verify if the allocation against current->mm->owner's memcg is >> + * possible for the given order. But the page is not allocated yet, so we'll >> + * need a further commit step to do the final arrangements. >> + * >> + * It is possible for the task to switch cgroups in this mean time, so at >> + * commit time, we can't rely on task conversion any longer. We'll then use >> + * the handle argument to return to the caller which cgroup we should commit >> + * against. We could also return the memcg directly and avoid the pointer >> + * passing, but a boolean return value gives better semantics considering >> + * the compiled-out case as well. >> + * >> + * Returning true means the allocation is possible. >> + */ >> +bool >> +__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order) >> +{ >> + struct mem_cgroup *memcg; >> + bool ret; >> + struct task_struct *p; >> + >> + *_memcg = NULL; >> + rcu_read_lock(); >> + p = rcu_dereference(current->mm->owner); >> + memcg = mem_cgroup_from_task(p); >> + rcu_read_unlock(); >> + >> + if (!memcg_can_account_kmem(memcg)) >> + return true; >> + >> + mem_cgroup_get(memcg); >> + >> + ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order) == 0; >> + if (ret) >> + *_memcg = memcg; >> + else >> + mem_cgroup_put(memcg); >> + >> + return ret; >> +} > > "*_memcg = memcg" should be executed when "memcg_charge_kmem" is success. > "memcg_charge_kmem" return 0 if success in charging. > Therefore, I think this code is wrong. > If I am right, it is a serious bug that affect behavior of all the patchset. Which is precisely what it does. ret is a boolean, that will be true when charge succeeded (== 0 test) > >> +void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, >> + int order) >> +{ >> + struct page_cgroup *pc; >> + >> + WARN_ON(mem_cgroup_is_root(memcg)); >> + >> + /* The page allocation failed. Revert */ >> + if (!page) { >> + memcg_uncharge_kmem(memcg, PAGE_SIZE << order); >> + return; >> + } > > In case of "!page ", mem_cgroup_put(memcg) is needed, > because we already call "mem_cgroup_get(memcg)" in > __memcg_kmem_newpage_charge(). > I know that mem_cgroup_put()/get() will be removed in later patch, but > it is important that every patch works fine. Okay, I'll add the put here. It is indeed missing. ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure 2012-09-21 8:41 ` Glauber Costa @ 2012-09-21 9:14 ` JoonSoo Kim 0 siblings, 0 replies; 127+ messages in thread From: JoonSoo Kim @ 2012-09-21 9:14 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner >> "*_memcg = memcg" should be executed when "memcg_charge_kmem" is success. >> "memcg_charge_kmem" return 0 if success in charging. >> Therefore, I think this code is wrong. >> If I am right, it is a serious bug that affect behavior of all the patchset. > > Which is precisely what it does. ret is a boolean, that will be true > when charge succeeded (== 0 test) Ahh...Okay! I didn't see (== 0 test) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <1347977050-29476-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-20 16:05 ` JoonSoo Kim @ 2012-09-26 15:51 ` Michal Hocko 2012-09-27 11:31 ` Glauber Costa 1 sibling, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-09-26 15:51 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Tue 18-09-12 18:04:03, Glauber Costa wrote: > This patch introduces infrastructure for tracking kernel memory pages to > a given memcg. This will happen whenever the caller includes the flag > __GFP_KMEMCG flag, and the task belong to a memcg other than the root. > > In memcontrol.h those functions are wrapped in inline acessors. The > idea is to later on, patch those with static branches, so we don't incur > any overhead when no mem cgroups with limited kmem are being used. Could you describe the API a bit here, please? I guess the kernel user is supposed to call memcg_kmem_newpage_charge and memcg_kmem_commit_charge resp. memcg_kmem_uncharge_page. All other kmem functions here are just helpers, right? > > [ v2: improved comments and standardized function names ] > [ v3: handle no longer opaque, functions not exported, > even more comments ] > [ v4: reworked Used bit handling and surroundings for more clarity ] > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> > CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> > CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> > CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > --- > include/linux/memcontrol.h | 97 +++++++++++++++++++++++++ > mm/memcontrol.c | 177 +++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 274 insertions(+) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 8d9489f..82ede9a 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -21,6 +21,7 @@ > #define _LINUX_MEMCONTROL_H > #include <linux/cgroup.h> > #include <linux/vm_event_item.h> > +#include <linux/hardirq.h> > > struct mem_cgroup; > struct page_cgroup; > @@ -399,6 +400,17 @@ struct sock; > #ifdef CONFIG_MEMCG_KMEM > void sock_update_memcg(struct sock *sk); > void sock_release_memcg(struct sock *sk); > + > +static inline bool memcg_kmem_enabled(void) > +{ > + return true; > +} > + > +extern bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, > + int order); > +extern void __memcg_kmem_commit_charge(struct page *page, > + struct mem_cgroup *memcg, int order); > +extern void __memcg_kmem_uncharge_page(struct page *page, int order); > #else > static inline void sock_update_memcg(struct sock *sk) > { > @@ -406,6 +418,91 @@ static inline void sock_update_memcg(struct sock *sk) > static inline void sock_release_memcg(struct sock *sk) > { > } > + > +static inline bool memcg_kmem_enabled(void) > +{ > + return false; > +} > + > +static inline bool > +__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) > +{ > + return false; > +} > + > +static inline void __memcg_kmem_uncharge_page(struct page *page, int order) > +{ > +} > + > +static inline void > +__memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order) > +{ > +} I think we shouldn't care about these for !MEMCG_KMEM. It should be sufficient to define the main three functions bellow as return true resp. NOOP. This would reduce the code churn a bit and also make it better maintainable. > #endif /* CONFIG_MEMCG_KMEM */ > + > +/** > + * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed. > + * @gfp: the gfp allocation flags. > + * @memcg: a pointer to the memcg this was charged against. > + * @order: allocation order. > + * > + * returns true if the memcg where the current task belongs can hold this > + * allocation. > + * > + * We return true automatically if this allocation is not to be accounted to > + * any memcg. > + */ > +static __always_inline bool > +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) > +{ > + if (!memcg_kmem_enabled()) > + return true; > + > + /* > + * __GFP_NOFAIL allocations will move on even if charging is not > + * possible. Therefore we don't even try, and have this allocation > + * unaccounted. We could in theory charge it with > + * res_counter_charge_nofail, but we hope those allocations are rare, > + * and won't be worth the trouble. > + */ > + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL)) > + return true; > + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD)) > + return true; > + return __memcg_kmem_newpage_charge(gfp, memcg, order); > +} > + > +/** > + * memcg_kmem_uncharge_page: uncharge pages from memcg > + * @page: pointer to struct page being freed > + * @order: allocation order. > + * > + * there is no need to specify memcg here, since it is embedded in page_cgroup > + */ > +static __always_inline void > +memcg_kmem_uncharge_page(struct page *page, int order) > +{ > + if (memcg_kmem_enabled()) > + __memcg_kmem_uncharge_page(page, order); > +} > + > +/** > + * memcg_kmem_commit_charge: embeds correct memcg in a page > + * @memcg: a pointer to the memcg this was charged against. ^^^^^^^ remove this one? > + * @page: pointer to struct page recently allocated > + * @memcg: the memcg structure we charged against > + * @order: allocation order. > + * > + * Needs to be called after memcg_kmem_newpage_charge, regardless of success or > + * failure of the allocation. if @page is NULL, this function will revert the > + * charges. Otherwise, it will commit the memcg given by @memcg to the > + * corresponding page_cgroup. > + */ > +static __always_inline void > +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order) > +{ > + if (memcg_kmem_enabled() && memcg) > + __memcg_kmem_commit_charge(page, memcg, order); > +} > #endif /* _LINUX_MEMCONTROL_H */ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index f3fd354..0f36a01 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -10,6 +10,10 @@ > * Copyright (C) 2009 Nokia Corporation > * Author: Kirill A. Shutemov > * > + * Kernel Memory Controller > + * Copyright (C) 2012 Parallels Inc. and Google Inc. > + * Authors: Glauber Costa and Suleiman Souhlal > + * > * This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License as published by > * the Free Software Foundation; either version 2 of the License, or > @@ -426,6 +430,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) > #include <net/ip.h> > > static bool mem_cgroup_is_root(struct mem_cgroup *memcg); > +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size); > +static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size); > + Why the forward declarations here? We can simply move definitions up before they are used for the first time, can't we? Besides that they are never used/defined from outside of KMEM_MEMCG. > void sock_update_memcg(struct sock *sk) > { > if (mem_cgroup_sockets_enabled) { > @@ -480,6 +487,110 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg) > } > EXPORT_SYMBOL(tcp_proto_cgroup); > #endif /* CONFIG_INET */ > + > +static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg) > +{ > + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) && > + memcg->kmem_accounted; > +} > + > +/* > + * We need to verify if the allocation against current->mm->owner's memcg is > + * possible for the given order. But the page is not allocated yet, so we'll > + * need a further commit step to do the final arrangements. > + * > + * It is possible for the task to switch cgroups in this mean time, so at > + * commit time, we can't rely on task conversion any longer. We'll then use > + * the handle argument to return to the caller which cgroup we should commit > + * against. We could also return the memcg directly and avoid the pointer > + * passing, but a boolean return value gives better semantics considering > + * the compiled-out case as well. > + * > + * Returning true means the allocation is possible. > + */ > +bool > +__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order) > +{ > + struct mem_cgroup *memcg; > + bool ret; > + struct task_struct *p; Johannes likes christmas trees ;) and /me would like to remove `p' and use mem_cgroup_from_task(rcu_dereference(current->mm->owner)) same as we do at other places (I guess it will be checkpatch safe). > + > + *_memcg = NULL; > + rcu_read_lock(); > + p = rcu_dereference(current->mm->owner); > + memcg = mem_cgroup_from_task(p); mem_cgroup_from_task says it can return NULL. Do we care here? If not then please put VM_BUG_ON(!memcg) here. > + rcu_read_unlock(); > + > + if (!memcg_can_account_kmem(memcg)) > + return true; > + > + mem_cgroup_get(memcg); I am confused. Why do we take a reference to memcg rather than css_get here? Ahh it is because we keep the reference while the page is allocated, right? Comment please. I am still not sure whether we need css_get here as well. How do you know that the current is not moved in parallel and it is a last task in a group which then can go away? > + > + ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order) == 0; > + if (ret) > + *_memcg = memcg; > + else > + mem_cgroup_put(memcg); > + > + return ret; > +} > + > +void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, > + int order) > +{ > + struct page_cgroup *pc; > + > + WARN_ON(mem_cgroup_is_root(memcg)); Why the warn? Nobody should use this directly and memcg_kmem_commit_charge takes care of the root same as __memcg_kmem_newpage_charge does. If it is for correctness then it should be VM_BUG_ON. > + > + /* The page allocation failed. Revert */ > + if (!page) { > + memcg_uncharge_kmem(memcg, PAGE_SIZE << order); > + return; > + } > + > + pc = lookup_page_cgroup(page); > + lock_page_cgroup(pc); > + pc->mem_cgroup = memcg; > + SetPageCgroupUsed(pc); > + unlock_page_cgroup(pc); > +} > + > +void __memcg_kmem_uncharge_page(struct page *page, int order) > +{ > + struct mem_cgroup *memcg = NULL; > + struct page_cgroup *pc; > + > + > + pc = lookup_page_cgroup(page); > + /* > + * Fast unlocked return. Theoretically might have changed, have to > + * check again after locking. > + */ > + if (!PageCgroupUsed(pc)) > + return; > + > + lock_page_cgroup(pc); > + if (PageCgroupUsed(pc)) { > + memcg = pc->mem_cgroup; > + ClearPageCgroupUsed(pc); > + } > + unlock_page_cgroup(pc); > + > + /* > + * Checking if kmem accounted is enabled won't work for uncharge, since > + * it is possible that the user enabled kmem tracking, allocated, and > + * then disabled it again. disabling cannot happen, right? > + * > + * We trust if there is a memcg associated with the page, it is a valid > + * allocation > + */ > + if (!memcg) > + return; > + > + WARN_ON(mem_cgroup_is_root(memcg)); Same as above I do not see a reason for warn here. It just adds a code and if you want it for debugging then VM_BUG_ON sounds more appropriate. /me thinks > + memcg_uncharge_kmem(memcg, PAGE_SIZE << order); > + mem_cgroup_put(memcg); > +} > #endif /* CONFIG_MEMCG_KMEM */ > > #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > @@ -5700,3 +5811,69 @@ static int __init enable_swap_account(char *s) > __setup("swapaccount=", enable_swap_account); > > #endif > + > +#ifdef CONFIG_MEMCG_KMEM > +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size) > +{ > + struct res_counter *fail_res; > + struct mem_cgroup *_memcg; > + int ret; > + bool may_oom; > + bool nofail = false; > + > + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) && > + !(gfp & __GFP_NORETRY); A comment please? Why __GFP_IO is not considered for example? > + > + ret = 0; > + > + if (!memcg) > + return ret; How can we get a NULL memcg here without blowing in __memcg_kmem_newpage_charge? > + > + _memcg = memcg; > + ret = __mem_cgroup_try_charge(NULL, gfp, size / PAGE_SIZE, me likes >> PAGE_SHIFT more. > + &_memcg, may_oom); > + > + if (ret == -EINTR) { > + nofail = true; > + /* > + * __mem_cgroup_try_charge() chosed to bypass to root due to > + * OOM kill or fatal signal. Since our only options are to > + * either fail the allocation or charge it to this cgroup, do > + * it as a temporary condition. But we can't fail. From a > + * kmem/slab perspective, the cache has already been selected, > + * by mem_cgroup_get_kmem_cache(), so it is too late to change > + * our minds > + */ > + res_counter_charge_nofail(&memcg->res, size, &fail_res); > + if (do_swap_account) > + res_counter_charge_nofail(&memcg->memsw, size, > + &fail_res); > + ret = 0; > + } else if (ret == -ENOMEM) > + return ret; > + > + if (nofail) > + res_counter_charge_nofail(&memcg->kmem, size, &fail_res); > + else > + ret = res_counter_charge(&memcg->kmem, size, &fail_res); > + > + if (ret) { > + res_counter_uncharge(&memcg->res, size); > + if (do_swap_account) > + res_counter_uncharge(&memcg->memsw, size); > + } You could save few lines and get rid of the strange nofail by: [...] + res_counter_charge_nofail(&memcg->res, size, &fail_res); + if (do_swap_account) + res_counter_charge_nofail(&memcg->memsw, size, + &fail_res); + res_counter_charge_nofail(&memcg->kmem, size, &fail_res); + return 0; + } else if (ret == -ENOMEM) + return ret; + else + ret = res_counter_charge(&memcg->kmem, size, &fail_res); + + if (ret) { + res_counter_uncharge(&memcg->res, size); + if (do_swap_account) + res_counter_uncharge(&memcg->memsw, size); + } > + > + return ret; > +} > + > +void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size) > +{ > + if (!memcg) > + return; > + > + res_counter_uncharge(&memcg->kmem, size); > + res_counter_uncharge(&memcg->res, size); > + if (do_swap_account) > + res_counter_uncharge(&memcg->memsw, size); > +} > +#endif /* CONFIG_MEMCG_KMEM */ > -- > 1.7.11.4 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure 2012-09-26 15:51 ` Michal Hocko @ 2012-09-27 11:31 ` Glauber Costa [not found] ` <5064392D.5040707-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-27 11:31 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On 09/26/2012 07:51 PM, Michal Hocko wrote: > On Tue 18-09-12 18:04:03, Glauber Costa wrote: >> This patch introduces infrastructure for tracking kernel memory pages to >> a given memcg. This will happen whenever the caller includes the flag >> __GFP_KMEMCG flag, and the task belong to a memcg other than the root. >> >> In memcontrol.h those functions are wrapped in inline acessors. The >> idea is to later on, patch those with static branches, so we don't incur >> any overhead when no mem cgroups with limited kmem are being used. > > Could you describe the API a bit here, please? I guess the > kernel user is supposed to call memcg_kmem_newpage_charge and > memcg_kmem_commit_charge resp. memcg_kmem_uncharge_page. > All other kmem functions here are just helpers, right? Yes, sir. >> >> [ v2: improved comments and standardized function names ] >> [ v3: handle no longer opaque, functions not exported, >> even more comments ] >> [ v4: reworked Used bit handling and surroundings for more clarity ] >> >> Signed-off-by: Glauber Costa <glommer@parallels.com> >> CC: Christoph Lameter <cl@linux.com> >> CC: Pekka Enberg <penberg@cs.helsinki.fi> >> CC: Michal Hocko <mhocko@suse.cz> >> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> >> CC: Johannes Weiner <hannes@cmpxchg.org> >> --- >> include/linux/memcontrol.h | 97 +++++++++++++++++++++++++ >> mm/memcontrol.c | 177 +++++++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 274 insertions(+) >> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >> index 8d9489f..82ede9a 100644 >> --- a/include/linux/memcontrol.h >> +++ b/include/linux/memcontrol.h >> @@ -21,6 +21,7 @@ >> #define _LINUX_MEMCONTROL_H >> #include <linux/cgroup.h> >> #include <linux/vm_event_item.h> >> +#include <linux/hardirq.h> >> >> struct mem_cgroup; >> struct page_cgroup; >> @@ -399,6 +400,17 @@ struct sock; >> #ifdef CONFIG_MEMCG_KMEM >> void sock_update_memcg(struct sock *sk); >> void sock_release_memcg(struct sock *sk); >> + >> +static inline bool memcg_kmem_enabled(void) >> +{ >> + return true; >> +} >> + >> +extern bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, >> + int order); >> +extern void __memcg_kmem_commit_charge(struct page *page, >> + struct mem_cgroup *memcg, int order); >> +extern void __memcg_kmem_uncharge_page(struct page *page, int order); >> #else >> static inline void sock_update_memcg(struct sock *sk) >> { >> @@ -406,6 +418,91 @@ static inline void sock_update_memcg(struct sock *sk) >> static inline void sock_release_memcg(struct sock *sk) >> { >> } >> + >> +static inline bool memcg_kmem_enabled(void) >> +{ >> + return false; >> +} >> + >> +static inline bool >> +__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) >> +{ >> + return false; >> +} >> + >> +static inline void __memcg_kmem_uncharge_page(struct page *page, int order) >> +{ >> +} >> + >> +static inline void >> +__memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order) >> +{ >> +} > > I think we shouldn't care about these for !MEMCG_KMEM. It should be > sufficient to define the main three functions bellow as return true > resp. NOOP. This would reduce the code churn a bit and also make it > better maintainable. > Ok. >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index f3fd354..0f36a01 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -10,6 +10,10 @@ >> * Copyright (C) 2009 Nokia Corporation >> * Author: Kirill A. Shutemov >> * >> + * Kernel Memory Controller >> + * Copyright (C) 2012 Parallels Inc. and Google Inc. >> + * Authors: Glauber Costa and Suleiman Souhlal >> + * >> * This program is free software; you can redistribute it and/or modify >> * it under the terms of the GNU General Public License as published by >> * the Free Software Foundation; either version 2 of the License, or >> @@ -426,6 +430,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) >> #include <net/ip.h> >> >> static bool mem_cgroup_is_root(struct mem_cgroup *memcg); >> +static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size); >> +static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size); >> + > > Why the forward declarations here? We can simply move definitions up > before they are used for the first time, can't we? Besides that they are > never used/defined from outside of KMEM_MEMCG. > I see your point, given the recent patch about gcc complaining about those things. Will change. >> + >> + *_memcg = NULL; >> + rcu_read_lock(); >> + p = rcu_dereference(current->mm->owner); >> + memcg = mem_cgroup_from_task(p); > > mem_cgroup_from_task says it can return NULL. Do we care here? If not > then please put VM_BUG_ON(!memcg) here. > >> + rcu_read_unlock(); >> + >> + if (!memcg_can_account_kmem(memcg)) >> + return true; >> + >> + mem_cgroup_get(memcg); > > I am confused. Why do we take a reference to memcg rather than css_get > here? Ahh it is because we keep the reference while the page is > allocated, right? Comment please. ok. > > I am still not sure whether we need css_get here as well. How do you > know that the current is not moved in parallel and it is a last task in > a group which then can go away? the reference count aquired by mem_cgroup_get will still prevent the memcg from going away, no? >> + >> + /* The page allocation failed. Revert */ >> + if (!page) { >> + memcg_uncharge_kmem(memcg, PAGE_SIZE << order); >> + return; >> + } >> + >> + pc = lookup_page_cgroup(page); >> + lock_page_cgroup(pc); >> + pc->mem_cgroup = memcg; >> + SetPageCgroupUsed(pc); >> + unlock_page_cgroup(pc); >> +} >> + >> +void __memcg_kmem_uncharge_page(struct page *page, int order) >> +{ >> + struct mem_cgroup *memcg = NULL; >> + struct page_cgroup *pc; >> + >> + >> + pc = lookup_page_cgroup(page); >> + /* >> + * Fast unlocked return. Theoretically might have changed, have to >> + * check again after locking. >> + */ >> + if (!PageCgroupUsed(pc)) >> + return; >> + >> + lock_page_cgroup(pc); >> + if (PageCgroupUsed(pc)) { >> + memcg = pc->mem_cgroup; >> + ClearPageCgroupUsed(pc); >> + } >> + unlock_page_cgroup(pc); >> + >> + /* >> + * Checking if kmem accounted is enabled won't work for uncharge, since >> + * it is possible that the user enabled kmem tracking, allocated, and >> + * then disabled it again. > > disabling cannot happen, right? > not anymore, right. I can update the comment, but I still believe it is a lot saner to trust information in page_cgroup. >> +#ifdef CONFIG_MEMCG_KMEM >> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size) >> +{ >> + struct res_counter *fail_res; >> + struct mem_cgroup *_memcg; >> + int ret; >> + bool may_oom; >> + bool nofail = false; >> + >> + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) && >> + !(gfp & __GFP_NORETRY); > > A comment please? Why __GFP_IO is not considered for example? > > Actually, I believe testing for GFP_WAIT and !GFP_NORETRY would be enough. The rationale here is, of course, under which circumstance would it be valid to call the oom killer? Which is, if the allocation can wait, and can retry. > > You could save few lines and get rid of the strange nofail by: > [...] > + res_counter_charge_nofail(&memcg->res, size, &fail_res); > + if (do_swap_account) > + res_counter_charge_nofail(&memcg->memsw, size, > + &fail_res); > + res_counter_charge_nofail(&memcg->kmem, size, &fail_res); > + return 0; > + } else if (ret == -ENOMEM) > + return ret; > + else > + ret = res_counter_charge(&memcg->kmem, size, &fail_res); > + > + if (ret) { > + res_counter_uncharge(&memcg->res, size); > + if (do_swap_account) > + res_counter_uncharge(&memcg->memsw, size); > + } > indeed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <5064392D.5040707-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <5064392D.5040707-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-27 13:44 ` Michal Hocko [not found] ` <20120927134432.GE29104-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-09-27 13:44 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Thu 27-09-12 15:31:57, Glauber Costa wrote: > On 09/26/2012 07:51 PM, Michal Hocko wrote: > > On Tue 18-09-12 18:04:03, Glauber Costa wrote: [...] > >> + *_memcg = NULL; > >> + rcu_read_lock(); > >> + p = rcu_dereference(current->mm->owner); > >> + memcg = mem_cgroup_from_task(p); > > > > mem_cgroup_from_task says it can return NULL. Do we care here? If not > > then please put VM_BUG_ON(!memcg) here. > > > >> + rcu_read_unlock(); > >> + > >> + if (!memcg_can_account_kmem(memcg)) > >> + return true; > >> + > >> + mem_cgroup_get(memcg); > > > > I am confused. Why do we take a reference to memcg rather than css_get > > here? Ahh it is because we keep the reference while the page is > > allocated, right? Comment please. > ok. > > > > > I am still not sure whether we need css_get here as well. How do you > > know that the current is not moved in parallel and it is a last task in > > a group which then can go away? > > the reference count aquired by mem_cgroup_get will still prevent the > memcg from going away, no? Yes but you are outside of the rcu now and we usually do css_get before we rcu_unlock. mem_cgroup_get just makes sure the group doesn't get deallocated but it could be gone before you call it. Or I am just confused - these 2 levels of ref counting is really not nice. Anyway, I have just noticed that __mem_cgroup_try_charge does VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should keep css ref count up as well. > >> + /* The page allocation failed. Revert */ > >> + if (!page) { > >> + memcg_uncharge_kmem(memcg, PAGE_SIZE << order); > >> + return; > >> + } > >> + > >> + pc = lookup_page_cgroup(page); > >> + lock_page_cgroup(pc); > >> + pc->mem_cgroup = memcg; > >> + SetPageCgroupUsed(pc); > >> + unlock_page_cgroup(pc); > >> +} > >> + > >> +void __memcg_kmem_uncharge_page(struct page *page, int order) > >> +{ > >> + struct mem_cgroup *memcg = NULL; > >> + struct page_cgroup *pc; > >> + > >> + > >> + pc = lookup_page_cgroup(page); > >> + /* > >> + * Fast unlocked return. Theoretically might have changed, have to > >> + * check again after locking. > >> + */ > >> + if (!PageCgroupUsed(pc)) > >> + return; > >> + > >> + lock_page_cgroup(pc); > >> + if (PageCgroupUsed(pc)) { > >> + memcg = pc->mem_cgroup; > >> + ClearPageCgroupUsed(pc); > >> + } > >> + unlock_page_cgroup(pc); > >> + > >> + /* > >> + * Checking if kmem accounted is enabled won't work for uncharge, since > >> + * it is possible that the user enabled kmem tracking, allocated, and > >> + * then disabled it again. > > > > disabling cannot happen, right? > > > not anymore, right. I can update the comment, yes, it is confusing > but I still believe it is a lot saner to trust information in > page_cgroup. I have no objections against that. PageCgroupUsed test and using pc->mem_cgroup is fine. > >> +#ifdef CONFIG_MEMCG_KMEM > >> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size) > >> +{ > >> + struct res_counter *fail_res; > >> + struct mem_cgroup *_memcg; > >> + int ret; > >> + bool may_oom; > >> + bool nofail = false; > >> + > >> + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) && > >> + !(gfp & __GFP_NORETRY); > > > > A comment please? Why __GFP_IO is not considered for example? > > > > > > Actually, I believe testing for GFP_WAIT and !GFP_NORETRY would be enough. > > The rationale here is, of course, under which circumstance would it be > valid to call the oom killer? Which is, if the allocation can wait, and > can retry. Yes __GFP_WAIT is clear because memcg OOM can wait for arbitrary amount of time (wait for userspace action on oom_control). __GFP_NORETRY couldn't get to oom before because oom was excluded explicitely for THP and migration didn't go through the charging path to reach the oom. But I do agree that __GFP_NORETRY allocations shouldn't cause the OOM because we should rather fail the allocation from kernel rather than shoot something. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120927134432.GE29104-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <20120927134432.GE29104-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-09-28 11:34 ` Glauber Costa 2012-09-30 8:25 ` Tejun Heo [not found] ` <50658B3B.9020303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 2 replies; 127+ messages in thread From: Glauber Costa @ 2012-09-28 11:34 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On 09/27/2012 05:44 PM, Michal Hocko wrote: >> > the reference count aquired by mem_cgroup_get will still prevent the >> > memcg from going away, no? > Yes but you are outside of the rcu now and we usually do css_get before > we rcu_unlock. mem_cgroup_get just makes sure the group doesn't get > deallocated but it could be gone before you call it. Or I am just > confused - these 2 levels of ref counting is really not nice. > > Anyway, I have just noticed that __mem_cgroup_try_charge does > VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should > keep css ref count up as well. > IIRC, css_get will prevent the cgroup directory from being removed. Because some allocations are expected to outlive the cgroup, we specifically don't want that. ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure 2012-09-28 11:34 ` Glauber Costa @ 2012-09-30 8:25 ` Tejun Heo 2012-10-01 8:28 ` Glauber Costa [not found] ` <20120930082542.GH10383-9pTldWuhBndy/B6EtB590w@public.gmane.org> [not found] ` <50658B3B.9020303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 1 sibling, 2 replies; 127+ messages in thread From: Tejun Heo @ 2012-09-30 8:25 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Fri, Sep 28, 2012 at 03:34:19PM +0400, Glauber Costa wrote: > On 09/27/2012 05:44 PM, Michal Hocko wrote: > > Anyway, I have just noticed that __mem_cgroup_try_charge does > > VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should > > keep css ref count up as well. > > IIRC, css_get will prevent the cgroup directory from being removed. > Because some allocations are expected to outlive the cgroup, we > specifically don't want that. That synchronous ref draining is going away. Maybe we can do that before kmemcg? Michal, do you have some timeframe on mind? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure 2012-09-30 8:25 ` Tejun Heo @ 2012-10-01 8:28 ` Glauber Costa [not found] ` <5069542C.2020103-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> [not found] ` <20120930082542.GH10383-9pTldWuhBndy/B6EtB590w@public.gmane.org> 1 sibling, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-10-01 8:28 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On 09/30/2012 12:25 PM, Tejun Heo wrote: > On Fri, Sep 28, 2012 at 03:34:19PM +0400, Glauber Costa wrote: >> On 09/27/2012 05:44 PM, Michal Hocko wrote: >>> Anyway, I have just noticed that __mem_cgroup_try_charge does >>> VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should >>> keep css ref count up as well. >> >> IIRC, css_get will prevent the cgroup directory from being removed. >> Because some allocations are expected to outlive the cgroup, we >> specifically don't want that. > > That synchronous ref draining is going away. Maybe we can do that > before kmemcg? Michal, do you have some timeframe on mind? > Since you said yourself in other points in this thread that you are fine with some page references outliving the cgroup in the case of slab, this is a situation that comes with the code, not a situation that was incidentally there, and we're making use of. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <5069542C.2020103-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <5069542C.2020103-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-10-03 22:11 ` Tejun Heo 0 siblings, 0 replies; 127+ messages in thread From: Tejun Heo @ 2012-10-03 22:11 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner Hello, Glauber. Sorry about late replies. I'be been traveling for the Korean thanksgiving holidays. On Mon, Oct 01, 2012 at 12:28:28PM +0400, Glauber Costa wrote: > > That synchronous ref draining is going away. Maybe we can do that > > before kmemcg? Michal, do you have some timeframe on mind? > > Since you said yourself in other points in this thread that you are fine > with some page references outliving the cgroup in the case of slab, this > is a situation that comes with the code, not a situation that was > incidentally there, and we're making use of. Hmmm? Not sure what you're trying to say but I wanted to say that this should be okay once the scheduled memcg pre_destroy change happens and nudge Michal once more. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120930082542.GH10383-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <20120930082542.GH10383-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-10-01 9:44 ` Michal Hocko 0 siblings, 0 replies; 127+ messages in thread From: Michal Hocko @ 2012-10-01 9:44 UTC (permalink / raw) To: Tejun Heo Cc: Glauber Costa, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Sun 30-09-12 17:25:42, Tejun Heo wrote: > On Fri, Sep 28, 2012 at 03:34:19PM +0400, Glauber Costa wrote: > > On 09/27/2012 05:44 PM, Michal Hocko wrote: > > > Anyway, I have just noticed that __mem_cgroup_try_charge does > > > VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should > > > keep css ref count up as well. > > > > IIRC, css_get will prevent the cgroup directory from being removed. > > Because some allocations are expected to outlive the cgroup, we > > specifically don't want that. > > That synchronous ref draining is going away. Maybe we can do that > before kmemcg? Michal, do you have some timeframe on mind? It is on my todo list but I didn't get to it yet. I am not sure we can get rid of css_get though - will have to think about that. > > Thanks. > > -- > tejun -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50658B3B.9020303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <50658B3B.9020303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-10-01 9:48 ` Michal Hocko 2012-10-01 10:09 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-10-01 9:48 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Fri 28-09-12 15:34:19, Glauber Costa wrote: > On 09/27/2012 05:44 PM, Michal Hocko wrote: > >> > the reference count aquired by mem_cgroup_get will still prevent the > >> > memcg from going away, no? > > Yes but you are outside of the rcu now and we usually do css_get before > > we rcu_unlock. mem_cgroup_get just makes sure the group doesn't get > > deallocated but it could be gone before you call it. Or I am just > > confused - these 2 levels of ref counting is really not nice. > > > > Anyway, I have just noticed that __mem_cgroup_try_charge does > > VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should > > keep css ref count up as well. > > > > IIRC, css_get will prevent the cgroup directory from being removed. > Because some allocations are expected to outlive the cgroup, we > specifically don't want that. Yes, but how do you guarantee that the above VM_BUG_ON doesn't trigger? Task could have been moved to another group between mem_cgroup_from_task and mem_cgroup_get, no? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure 2012-10-01 9:48 ` Michal Hocko @ 2012-10-01 10:09 ` Glauber Costa 2012-10-01 11:51 ` Michal Hocko 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-10-01 10:09 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On 10/01/2012 01:48 PM, Michal Hocko wrote: > On Fri 28-09-12 15:34:19, Glauber Costa wrote: >> On 09/27/2012 05:44 PM, Michal Hocko wrote: >>>>> the reference count aquired by mem_cgroup_get will still prevent the >>>>> memcg from going away, no? >>> Yes but you are outside of the rcu now and we usually do css_get before >>> we rcu_unlock. mem_cgroup_get just makes sure the group doesn't get >>> deallocated but it could be gone before you call it. Or I am just >>> confused - these 2 levels of ref counting is really not nice. >>> >>> Anyway, I have just noticed that __mem_cgroup_try_charge does >>> VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should >>> keep css ref count up as well. >>> >> >> IIRC, css_get will prevent the cgroup directory from being removed. >> Because some allocations are expected to outlive the cgroup, we >> specifically don't want that. > > Yes, but how do you guarantee that the above VM_BUG_ON doesn't trigger? > Task could have been moved to another group between mem_cgroup_from_task > and mem_cgroup_get, no? > Ok, after reading this again (and again), you seem to be right. It concerns me, however, that simply getting the css would lead us to a double get/put pair, since try_charge will have to do it anyway. I considered just letting try_charge selecting the memcg, but that is not really what we want, since if that memcg will fail kmem allocations, we simply won't issue try charge, but return early. Any immediate suggestions on how to handle this ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure 2012-10-01 10:09 ` Glauber Costa @ 2012-10-01 11:51 ` Michal Hocko [not found] ` <20121001115157.GE8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-10-01 11:51 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Mon 01-10-12 14:09:09, Glauber Costa wrote: > On 10/01/2012 01:48 PM, Michal Hocko wrote: > > On Fri 28-09-12 15:34:19, Glauber Costa wrote: > >> On 09/27/2012 05:44 PM, Michal Hocko wrote: > >>>>> the reference count aquired by mem_cgroup_get will still prevent the > >>>>> memcg from going away, no? > >>> Yes but you are outside of the rcu now and we usually do css_get before > >>> we rcu_unlock. mem_cgroup_get just makes sure the group doesn't get > >>> deallocated but it could be gone before you call it. Or I am just > >>> confused - these 2 levels of ref counting is really not nice. > >>> > >>> Anyway, I have just noticed that __mem_cgroup_try_charge does > >>> VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should > >>> keep css ref count up as well. > >>> > >> > >> IIRC, css_get will prevent the cgroup directory from being removed. > >> Because some allocations are expected to outlive the cgroup, we > >> specifically don't want that. > > > > Yes, but how do you guarantee that the above VM_BUG_ON doesn't trigger? > > Task could have been moved to another group between mem_cgroup_from_task > > and mem_cgroup_get, no? > > > > Ok, after reading this again (and again), you seem to be right. It > concerns me, however, that simply getting the css would lead us to a > double get/put pair, since try_charge will have to do it anyway. That happens only for !*ptr case and you provide a memcg here, don't you. > I considered just letting try_charge selecting the memcg, but that is > not really what we want, since if that memcg will fail kmem allocations, > we simply won't issue try charge, but return early. > > Any immediate suggestions on how to handle this ? I would do the same thing __mem_cgroup_try_charge does. retry: rcu_read_lock(); p = rcu_dereference(mm->owner); if (!css_tryget(&memcg->css)) { rcu_read_unlock(); goto retry; } rcu_read_unlock(); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20121001115157.GE8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <20121001115157.GE8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-10-01 11:51 ` Glauber Costa 2012-10-01 11:58 ` Michal Hocko 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-10-01 11:51 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On 10/01/2012 03:51 PM, Michal Hocko wrote: > On Mon 01-10-12 14:09:09, Glauber Costa wrote: >> On 10/01/2012 01:48 PM, Michal Hocko wrote: >>> On Fri 28-09-12 15:34:19, Glauber Costa wrote: >>>> On 09/27/2012 05:44 PM, Michal Hocko wrote: >>>>>>> the reference count aquired by mem_cgroup_get will still prevent the >>>>>>> memcg from going away, no? >>>>> Yes but you are outside of the rcu now and we usually do css_get before >>>>> we rcu_unlock. mem_cgroup_get just makes sure the group doesn't get >>>>> deallocated but it could be gone before you call it. Or I am just >>>>> confused - these 2 levels of ref counting is really not nice. >>>>> >>>>> Anyway, I have just noticed that __mem_cgroup_try_charge does >>>>> VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should >>>>> keep css ref count up as well. >>>>> >>>> >>>> IIRC, css_get will prevent the cgroup directory from being removed. >>>> Because some allocations are expected to outlive the cgroup, we >>>> specifically don't want that. >>> >>> Yes, but how do you guarantee that the above VM_BUG_ON doesn't trigger? >>> Task could have been moved to another group between mem_cgroup_from_task >>> and mem_cgroup_get, no? >>> >> >> Ok, after reading this again (and again), you seem to be right. It >> concerns me, however, that simply getting the css would lead us to a >> double get/put pair, since try_charge will have to do it anyway. > > That happens only for !*ptr case and you provide a memcg here, don't > you. > if (*ptr) { /* css should be a valid one */ memcg = *ptr; VM_BUG_ON(css_is_removed(&memcg->css)); if (mem_cgroup_is_root(memcg)) goto done; if (consume_stock(memcg, nr_pages)) goto done; css_get(&memcg->css); The way I read this, this will still issue a css_get here, unless consume_stock suceeds (assuming non-root) So we'd still have to have a wrapping get/put pair outside the charge. ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure 2012-10-01 11:51 ` Glauber Costa @ 2012-10-01 11:58 ` Michal Hocko [not found] ` <20121001115847.GF8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-10-01 11:58 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Mon 01-10-12 15:51:20, Glauber Costa wrote: > On 10/01/2012 03:51 PM, Michal Hocko wrote: > > On Mon 01-10-12 14:09:09, Glauber Costa wrote: > >> On 10/01/2012 01:48 PM, Michal Hocko wrote: > >>> On Fri 28-09-12 15:34:19, Glauber Costa wrote: > >>>> On 09/27/2012 05:44 PM, Michal Hocko wrote: > >>>>>>> the reference count aquired by mem_cgroup_get will still prevent the > >>>>>>> memcg from going away, no? > >>>>> Yes but you are outside of the rcu now and we usually do css_get before > >>>>> we rcu_unlock. mem_cgroup_get just makes sure the group doesn't get > >>>>> deallocated but it could be gone before you call it. Or I am just > >>>>> confused - these 2 levels of ref counting is really not nice. > >>>>> > >>>>> Anyway, I have just noticed that __mem_cgroup_try_charge does > >>>>> VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should > >>>>> keep css ref count up as well. > >>>>> > >>>> > >>>> IIRC, css_get will prevent the cgroup directory from being removed. > >>>> Because some allocations are expected to outlive the cgroup, we > >>>> specifically don't want that. > >>> > >>> Yes, but how do you guarantee that the above VM_BUG_ON doesn't trigger? > >>> Task could have been moved to another group between mem_cgroup_from_task > >>> and mem_cgroup_get, no? > >>> > >> > >> Ok, after reading this again (and again), you seem to be right. It > >> concerns me, however, that simply getting the css would lead us to a > >> double get/put pair, since try_charge will have to do it anyway. > > > > That happens only for !*ptr case and you provide a memcg here, don't > > you. > > > > if (*ptr) { /* css should be a valid one */ > memcg = *ptr; > VM_BUG_ON(css_is_removed(&memcg->css)); > if (mem_cgroup_is_root(memcg)) > goto done; > if (consume_stock(memcg, nr_pages)) > goto done; > css_get(&memcg->css); > > > The way I read this, this will still issue a css_get here, unless > consume_stock suceeds (assuming non-root) > > So we'd still have to have a wrapping get/put pair outside the charge. That is correct but it assumes that the css is valid so somebody upwards made sure css will not go away. This would suggest css_get is not necessary here but I guess the primary intention here is to make the code easier so that we do not have to check whether we took css reference on the return path. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20121001115847.GF8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 06/13] memcg: kmem controller infrastructure [not found] ` <20121001115847.GF8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-10-01 12:04 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-10-01 12:04 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On 10/01/2012 03:58 PM, Michal Hocko wrote: > On Mon 01-10-12 15:51:20, Glauber Costa wrote: >> On 10/01/2012 03:51 PM, Michal Hocko wrote: >>> On Mon 01-10-12 14:09:09, Glauber Costa wrote: >>>> On 10/01/2012 01:48 PM, Michal Hocko wrote: >>>>> On Fri 28-09-12 15:34:19, Glauber Costa wrote: >>>>>> On 09/27/2012 05:44 PM, Michal Hocko wrote: >>>>>>>>> the reference count aquired by mem_cgroup_get will still prevent the >>>>>>>>> memcg from going away, no? >>>>>>> Yes but you are outside of the rcu now and we usually do css_get before >>>>>>> we rcu_unlock. mem_cgroup_get just makes sure the group doesn't get >>>>>>> deallocated but it could be gone before you call it. Or I am just >>>>>>> confused - these 2 levels of ref counting is really not nice. >>>>>>> >>>>>>> Anyway, I have just noticed that __mem_cgroup_try_charge does >>>>>>> VM_BUG_ON(css_is_removed(&memcg->css)) on a given memcg so you should >>>>>>> keep css ref count up as well. >>>>>>> >>>>>> >>>>>> IIRC, css_get will prevent the cgroup directory from being removed. >>>>>> Because some allocations are expected to outlive the cgroup, we >>>>>> specifically don't want that. >>>>> >>>>> Yes, but how do you guarantee that the above VM_BUG_ON doesn't trigger? >>>>> Task could have been moved to another group between mem_cgroup_from_task >>>>> and mem_cgroup_get, no? >>>>> >>>> >>>> Ok, after reading this again (and again), you seem to be right. It >>>> concerns me, however, that simply getting the css would lead us to a >>>> double get/put pair, since try_charge will have to do it anyway. >>> >>> That happens only for !*ptr case and you provide a memcg here, don't >>> you. >>> >> >> if (*ptr) { /* css should be a valid one */ >> memcg = *ptr; >> VM_BUG_ON(css_is_removed(&memcg->css)); >> if (mem_cgroup_is_root(memcg)) >> goto done; >> if (consume_stock(memcg, nr_pages)) >> goto done; >> css_get(&memcg->css); >> >> >> The way I read this, this will still issue a css_get here, unless >> consume_stock suceeds (assuming non-root) >> >> So we'd still have to have a wrapping get/put pair outside the charge. > > That is correct but it assumes that the css is valid so somebody upwards > made sure css will not go away. This would suggest css_get is not > necessary here but I guess the primary intention here is to make the > code easier so that we do not have to check whether we took css > reference on the return path. > In any case, umem would also suffer from double reference, so I'm fine taking it here as well, since a solution for that is orthogonal. I still need mem_cgroup_get() to make sure the data structure stays around, but we only need to do it once at first charge. ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 07/13] mm: Allocate kernel pages to the right memcg [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (5 preceding siblings ...) 2012-09-18 14:04 ` [PATCH v3 06/13] memcg: kmem controller infrastructure Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa 2012-09-27 13:50 ` Mel Gorman [not found] ` <1347977050-29476-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 14:04 ` [PATCH v3 08/13] res_counter: return amount of charges after res_counter_uncharge Glauber Costa ` (5 subsequent siblings) 12 siblings, 2 replies; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner When a process tries to allocate a page with the __GFP_KMEMCG flag, the page allocator will call the corresponding memcg functions to validate the allocation. Tasks in the root memcg can always proceed. To avoid adding markers to the page - and a kmem flag that would necessarily follow, as much as doing page_cgroup lookups for no reason, whoever is marking its allocations with __GFP_KMEMCG flag is responsible for telling the page allocator that this is such an allocation at free_pages() time. This is done by the invocation of __free_accounted_pages() and free_accounted_pages(). [ v2: inverted test order to avoid a memcg_get leak, free_accounted_pages simplification ] Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> --- include/linux/gfp.h | 3 +++ mm/page_alloc.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index d8eae4d..029570f 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int order); extern void free_hot_cold_page(struct page *page, int cold); extern void free_hot_cold_page_list(struct list_head *list, int cold); +extern void __free_accounted_pages(struct page *page, unsigned int order); +extern void free_accounted_pages(unsigned long addr, unsigned int order); + #define __free_page(page) __free_pages((page), 0) #define free_page(addr) free_pages((addr), 0) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b0c5a52..897d8e2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2573,6 +2573,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, struct page *page = NULL; int migratetype = allocflags_to_migratetype(gfp_mask); unsigned int cpuset_mems_cookie; + struct mem_cgroup *memcg = NULL; gfp_mask &= gfp_allowed_mask; @@ -2591,6 +2592,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (unlikely(!zonelist->_zonerefs->zone)) return NULL; + /* + * Will only have any effect when __GFP_KMEMCG is set. This is + * verified in the (always inline) callee + */ + if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order)) + return NULL; + retry_cpuset: cpuset_mems_cookie = get_mems_allowed(); @@ -2624,6 +2632,8 @@ out: if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; + memcg_kmem_commit_charge(page, memcg, order); + return page; } EXPORT_SYMBOL(__alloc_pages_nodemask); @@ -2676,6 +2686,31 @@ void free_pages(unsigned long addr, unsigned int order) EXPORT_SYMBOL(free_pages); +/* + * __free_accounted_pages and free_accounted_pages will free pages allocated + * with __GFP_KMEMCG. + * + * Those pages are accounted to a particular memcg, embedded in the + * corresponding page_cgroup. To avoid adding a hit in the allocator to search + * for that information only to find out that it is NULL for users who have no + * interest in that whatsoever, we provide these functions. + * + * The caller knows better which flags it relies on. + */ +void __free_accounted_pages(struct page *page, unsigned int order) +{ + memcg_kmem_uncharge_page(page, order); + __free_pages(page, order); +} + +void free_accounted_pages(unsigned long addr, unsigned int order) +{ + if (addr != 0) { + VM_BUG_ON(!virt_addr_valid((void *)addr)); + __free_accounted_pages(virt_to_page((void *)addr), order); + } +} + static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size) { if (addr) { -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
* Re: [PATCH v3 07/13] mm: Allocate kernel pages to the right memcg 2012-09-18 14:04 ` [PATCH v3 07/13] mm: Allocate kernel pages to the right memcg Glauber Costa @ 2012-09-27 13:50 ` Mel Gorman [not found] ` <20120927135053.GF3429-l3A5Bk7waGM@public.gmane.org> [not found] ` <1347977050-29476-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 1 sibling, 1 reply; 127+ messages in thread From: Mel Gorman @ 2012-09-27 13:50 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner On Tue, Sep 18, 2012 at 06:04:04PM +0400, Glauber Costa wrote: > When a process tries to allocate a page with the __GFP_KMEMCG flag, the > page allocator will call the corresponding memcg functions to validate > the allocation. Tasks in the root memcg can always proceed. > > To avoid adding markers to the page - and a kmem flag that would > necessarily follow, as much as doing page_cgroup lookups for no reason, > whoever is marking its allocations with __GFP_KMEMCG flag is responsible > for telling the page allocator that this is such an allocation at > free_pages() time. This is done by the invocation of > __free_accounted_pages() and free_accounted_pages(). > > [ v2: inverted test order to avoid a memcg_get leak, > free_accounted_pages simplification ] > > Signed-off-by: Glauber Costa <glommer@parallels.com> > CC: Christoph Lameter <cl@linux.com> > CC: Pekka Enberg <penberg@cs.helsinki.fi> > CC: Michal Hocko <mhocko@suse.cz> > CC: Johannes Weiner <hannes@cmpxchg.org> > CC: Suleiman Souhlal <suleiman@google.com> > CC: Mel Gorman <mgorman@suse.de> > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > --- > include/linux/gfp.h | 3 +++ > mm/page_alloc.c | 35 +++++++++++++++++++++++++++++++++++ > 2 files changed, 38 insertions(+) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index d8eae4d..029570f 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int order); > extern void free_hot_cold_page(struct page *page, int cold); > extern void free_hot_cold_page_list(struct list_head *list, int cold); > > +extern void __free_accounted_pages(struct page *page, unsigned int order); > +extern void free_accounted_pages(unsigned long addr, unsigned int order); > + > #define __free_page(page) __free_pages((page), 0) > #define free_page(addr) free_pages((addr), 0) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b0c5a52..897d8e2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2573,6 +2573,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, > struct page *page = NULL; > int migratetype = allocflags_to_migratetype(gfp_mask); > unsigned int cpuset_mems_cookie; > + struct mem_cgroup *memcg = NULL; > > gfp_mask &= gfp_allowed_mask; > > @@ -2591,6 +2592,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, > if (unlikely(!zonelist->_zonerefs->zone)) > return NULL; > > + /* > + * Will only have any effect when __GFP_KMEMCG is set. This is > + * verified in the (always inline) callee > + */ > + if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order)) > + return NULL; > + 1. returns quickly if memcg disabled 2. returns quickly if !__GFP_KMEMCG 3. returns quickly for kernel threads and interrupts I'm expecting that these paths are completely dead when kmem accounting is off so Acked-by: Mel Gorman <mgorman@suse.de> That said, it's not directly related to this patch but I would suggest that you also check for TIF_MEMDIE in memcg_kmem_newpage_charge. It would be very silly if a process failed to exit because it couldn't allocate a page it needed. I expect that such a case is impossible today but it might change in the future. If you're doing another revision, an extra check would not hurt. It's difficult to predict if it should be making all the checks that gfp_to_alloc_flags() does but you might need to in the future so keep it in mind. > retry_cpuset: > cpuset_mems_cookie = get_mems_allowed(); > > @@ -2624,6 +2632,8 @@ out: > if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) > goto retry_cpuset; > > + memcg_kmem_commit_charge(page, memcg, order); > + > return page; Ok. > } > EXPORT_SYMBOL(__alloc_pages_nodemask); > @@ -2676,6 +2686,31 @@ void free_pages(unsigned long addr, unsigned int order) > > EXPORT_SYMBOL(free_pages); > > +/* > + * __free_accounted_pages and free_accounted_pages will free pages allocated > + * with __GFP_KMEMCG. > + * > + * Those pages are accounted to a particular memcg, embedded in the > + * corresponding page_cgroup. To avoid adding a hit in the allocator to search > + * for that information only to find out that it is NULL for users who have no > + * interest in that whatsoever, we provide these functions. > + * > + * The caller knows better which flags it relies on. > + */ > +void __free_accounted_pages(struct page *page, unsigned int order) > +{ > + memcg_kmem_uncharge_page(page, order); > + __free_pages(page, order); > +} > + > +void free_accounted_pages(unsigned long addr, unsigned int order) > +{ > + if (addr != 0) { > + VM_BUG_ON(!virt_addr_valid((void *)addr)); This is probably overkill. If it's invalid, the next line is likely to blow up anyway. It's no biggie. > + __free_accounted_pages(virt_to_page((void *)addr), order); > + } > +} > + > static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size) > { > if (addr) { > -- > 1.7.11.4 > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120927135053.GF3429-l3A5Bk7waGM@public.gmane.org>]
* Re: [PATCH v3 07/13] mm: Allocate kernel pages to the right memcg [not found] ` <20120927135053.GF3429-l3A5Bk7waGM@public.gmane.org> @ 2012-09-28 9:43 ` Glauber Costa 2012-09-28 13:28 ` Mel Gorman 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-28 9:43 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner On 09/27/2012 05:50 PM, Mel Gorman wrote: >> +void __free_accounted_pages(struct page *page, unsigned int order) >> > +{ >> > + memcg_kmem_uncharge_page(page, order); >> > + __free_pages(page, order); >> > +} >> > + >> > +void free_accounted_pages(unsigned long addr, unsigned int order) >> > +{ >> > + if (addr != 0) { >> > + VM_BUG_ON(!virt_addr_valid((void *)addr)); > This is probably overkill. If it's invalid, the next line is likely to > blow up anyway. It's no biggie. > So this is here because it is in free_pages() as well. If it blows, at least we know precisely why (if debugging), and VM_BUG_ON() is only compiled in when CONFIG_DEBUG_VM. But I'm fine with either. Should it stay or should it go ? ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 07/13] mm: Allocate kernel pages to the right memcg 2012-09-28 9:43 ` Glauber Costa @ 2012-09-28 13:28 ` Mel Gorman 0 siblings, 0 replies; 127+ messages in thread From: Mel Gorman @ 2012-09-28 13:28 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, David Rientjes, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner On Fri, Sep 28, 2012 at 01:43:47PM +0400, Glauber Costa wrote: > On 09/27/2012 05:50 PM, Mel Gorman wrote: > >> +void __free_accounted_pages(struct page *page, unsigned int order) > >> > +{ > >> > + memcg_kmem_uncharge_page(page, order); > >> > + __free_pages(page, order); > >> > +} > >> > + > >> > +void free_accounted_pages(unsigned long addr, unsigned int order) > >> > +{ > >> > + if (addr != 0) { > >> > + VM_BUG_ON(!virt_addr_valid((void *)addr)); > > This is probably overkill. If it's invalid, the next line is likely to > > blow up anyway. It's no biggie. > > > > So this is here because it is in free_pages() as well. If it blows, at > least we know precisely why (if debugging), and VM_BUG_ON() is only > compiled in when CONFIG_DEBUG_VM. > Ah, I see. > But I'm fine with either. > Should it stay or should it go ? > It can stay. It makes sense that it look similar to free_pages() and as you say, it makes debugging marginally easier. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <1347977050-29476-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 07/13] mm: Allocate kernel pages to the right memcg [not found] ` <1347977050-29476-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-27 13:52 ` Michal Hocko 0 siblings, 0 replies; 127+ messages in thread From: Michal Hocko @ 2012-09-27 13:52 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Tue 18-09-12 18:04:04, Glauber Costa wrote: > When a process tries to allocate a page with the __GFP_KMEMCG flag, the > page allocator will call the corresponding memcg functions to validate > the allocation. Tasks in the root memcg can always proceed. > > To avoid adding markers to the page - and a kmem flag that would > necessarily follow, as much as doing page_cgroup lookups for no reason, > whoever is marking its allocations with __GFP_KMEMCG flag is responsible > for telling the page allocator that this is such an allocation at > free_pages() time. This is done by the invocation of > __free_accounted_pages() and free_accounted_pages(). > > [ v2: inverted test order to avoid a memcg_get leak, > free_accounted_pages simplification ] > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> > CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> > CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org> > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Thanks! > --- > include/linux/gfp.h | 3 +++ > mm/page_alloc.c | 35 +++++++++++++++++++++++++++++++++++ > 2 files changed, 38 insertions(+) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index d8eae4d..029570f 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int order); > extern void free_hot_cold_page(struct page *page, int cold); > extern void free_hot_cold_page_list(struct list_head *list, int cold); > > +extern void __free_accounted_pages(struct page *page, unsigned int order); > +extern void free_accounted_pages(unsigned long addr, unsigned int order); > + > #define __free_page(page) __free_pages((page), 0) > #define free_page(addr) free_pages((addr), 0) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b0c5a52..897d8e2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2573,6 +2573,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, > struct page *page = NULL; > int migratetype = allocflags_to_migratetype(gfp_mask); > unsigned int cpuset_mems_cookie; > + struct mem_cgroup *memcg = NULL; > > gfp_mask &= gfp_allowed_mask; > > @@ -2591,6 +2592,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, > if (unlikely(!zonelist->_zonerefs->zone)) > return NULL; > > + /* > + * Will only have any effect when __GFP_KMEMCG is set. This is > + * verified in the (always inline) callee > + */ > + if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order)) > + return NULL; > + > retry_cpuset: > cpuset_mems_cookie = get_mems_allowed(); > > @@ -2624,6 +2632,8 @@ out: > if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) > goto retry_cpuset; > > + memcg_kmem_commit_charge(page, memcg, order); > + > return page; > } > EXPORT_SYMBOL(__alloc_pages_nodemask); > @@ -2676,6 +2686,31 @@ void free_pages(unsigned long addr, unsigned int order) > > EXPORT_SYMBOL(free_pages); > > +/* > + * __free_accounted_pages and free_accounted_pages will free pages allocated > + * with __GFP_KMEMCG. > + * > + * Those pages are accounted to a particular memcg, embedded in the > + * corresponding page_cgroup. To avoid adding a hit in the allocator to search > + * for that information only to find out that it is NULL for users who have no > + * interest in that whatsoever, we provide these functions. > + * > + * The caller knows better which flags it relies on. > + */ > +void __free_accounted_pages(struct page *page, unsigned int order) > +{ > + memcg_kmem_uncharge_page(page, order); > + __free_pages(page, order); > +} > + > +void free_accounted_pages(unsigned long addr, unsigned int order) > +{ > + if (addr != 0) { > + VM_BUG_ON(!virt_addr_valid((void *)addr)); > + __free_accounted_pages(virt_to_page((void *)addr), order); > + } > +} > + > static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size) > { > if (addr) { > -- > 1.7.11.4 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 08/13] res_counter: return amount of charges after res_counter_uncharge [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (6 preceding siblings ...) 2012-09-18 14:04 ` [PATCH v3 07/13] mm: Allocate kernel pages to the right memcg Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa 2012-10-01 10:00 ` Michal Hocko 2012-09-18 14:04 ` [PATCH v3 09/13] memcg: kmem accounting lifecycle management Glauber Costa ` (4 subsequent siblings) 12 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Michal Hocko, Johannes Weiner It is useful to know how many charges are still left after a call to res_counter_uncharge. While it is possible to issue a res_counter_read after uncharge, this is racy. It would be better if uncharge itself would tell us what the current status is. Since the current return value is void, we don't need to worry about anything breaking due to this change: nobody relied on that, and only users appearing from now on will be checking this value. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> --- Documentation/cgroups/resource_counter.txt | 7 ++++--- include/linux/res_counter.h | 12 +++++++----- kernel/res_counter.c | 20 +++++++++++++------- 3 files changed, 24 insertions(+), 15 deletions(-) diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt index 0c4a344..c4d99ed 100644 --- a/Documentation/cgroups/resource_counter.txt +++ b/Documentation/cgroups/resource_counter.txt @@ -83,16 +83,17 @@ to work with it. res_counter->lock internally (it must be called with res_counter->lock held). The force parameter indicates whether we can bypass the limit. - e. void res_counter_uncharge[_locked] + e. u64 res_counter_uncharge[_locked] (struct res_counter *rc, unsigned long val) When a resource is released (freed) it should be de-accounted from the resource counter it was accounted to. This is called - "uncharging". + "uncharging". The return value of this function indicate the amount + of charges still present in the counter. The _locked routines imply that the res_counter->lock is taken. - f. void res_counter_uncharge_until + f. u64 res_counter_uncharge_until (struct res_counter *rc, struct res_counter *top, unsinged long val) diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h index 7d7fbe2..4b173b6 100644 --- a/include/linux/res_counter.h +++ b/include/linux/res_counter.h @@ -130,14 +130,16 @@ int res_counter_charge_nofail(struct res_counter *counter, * * these calls check for usage underflow and show a warning on the console * _locked call expects the counter->lock to be taken + * + * returns the total charges still present in @counter. */ -void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val); -void res_counter_uncharge(struct res_counter *counter, unsigned long val); +u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val); +u64 res_counter_uncharge(struct res_counter *counter, unsigned long val); -void res_counter_uncharge_until(struct res_counter *counter, - struct res_counter *top, - unsigned long val); +u64 res_counter_uncharge_until(struct res_counter *counter, + struct res_counter *top, + unsigned long val); /** * res_counter_margin - calculate chargeable space of a counter * @cnt: the counter diff --git a/kernel/res_counter.c b/kernel/res_counter.c index ad581aa..7b3d6dc 100644 --- a/kernel/res_counter.c +++ b/kernel/res_counter.c @@ -86,33 +86,39 @@ int res_counter_charge_nofail(struct res_counter *counter, unsigned long val, return __res_counter_charge(counter, val, limit_fail_at, true); } -void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val) +u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val) { if (WARN_ON(counter->usage < val)) val = counter->usage; counter->usage -= val; + return counter->usage; } -void res_counter_uncharge_until(struct res_counter *counter, - struct res_counter *top, - unsigned long val) +u64 res_counter_uncharge_until(struct res_counter *counter, + struct res_counter *top, + unsigned long val) { unsigned long flags; struct res_counter *c; + u64 ret = 0; local_irq_save(flags); for (c = counter; c != top; c = c->parent) { + u64 r; spin_lock(&c->lock); - res_counter_uncharge_locked(c, val); + r = res_counter_uncharge_locked(c, val); + if (c == counter) + ret = r; spin_unlock(&c->lock); } local_irq_restore(flags); + return ret; } -void res_counter_uncharge(struct res_counter *counter, unsigned long val) +u64 res_counter_uncharge(struct res_counter *counter, unsigned long val) { - res_counter_uncharge_until(counter, NULL, val); + return res_counter_uncharge_until(counter, NULL, val); } static inline unsigned long long * -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
* Re: [PATCH v3 08/13] res_counter: return amount of charges after res_counter_uncharge 2012-09-18 14:04 ` [PATCH v3 08/13] res_counter: return amount of charges after res_counter_uncharge Glauber Costa @ 2012-10-01 10:00 ` Michal Hocko 2012-10-01 10:01 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-10-01 10:00 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Tue 18-09-12 18:04:05, Glauber Costa wrote: > It is useful to know how many charges are still left after a call to > res_counter_uncharge. > While it is possible to issue a res_counter_read > after uncharge, this is racy. It would be better if uncharge itself > would tell us what the current status is. Well I am not sure how less racy it would be if you return the old value. It could be out of date when you read it, right? (this is even more visible with res_counter_uncharge_until) res_counter_read_u64 uses locks only for 32b when your change could help to reduce lock contention. Other than that it is just res_counter_member which is one cmp and a dereference. Sure you safe something but it is barely noticable I guess. I am not saying I do not like this change I just think that the above part of the changelog doesn't fit. So it would be much better if you tell us why this is needed for your patchset because the usage is not part of the patch. > Since the current return value is void, we don't need to worry about > anything breaking due to this change: nobody relied on that, and only > users appearing from now on will be checking this value. > > Signed-off-by: Glauber Costa <glommer@parallels.com> > CC: Michal Hocko <mhocko@suse.cz> > CC: Johannes Weiner <hannes@cmpxchg.org> > CC: Suleiman Souhlal <suleiman@google.com> > CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > --- > Documentation/cgroups/resource_counter.txt | 7 ++++--- > include/linux/res_counter.h | 12 +++++++----- > kernel/res_counter.c | 20 +++++++++++++------- > 3 files changed, 24 insertions(+), 15 deletions(-) > > diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt > index 0c4a344..c4d99ed 100644 > --- a/Documentation/cgroups/resource_counter.txt > +++ b/Documentation/cgroups/resource_counter.txt > @@ -83,16 +83,17 @@ to work with it. > res_counter->lock internally (it must be called with res_counter->lock > held). The force parameter indicates whether we can bypass the limit. > > - e. void res_counter_uncharge[_locked] > + e. u64 res_counter_uncharge[_locked] > (struct res_counter *rc, unsigned long val) > > When a resource is released (freed) it should be de-accounted > from the resource counter it was accounted to. This is called > - "uncharging". > + "uncharging". The return value of this function indicate the amount > + of charges still present in the counter. > > The _locked routines imply that the res_counter->lock is taken. > > - f. void res_counter_uncharge_until > + f. u64 res_counter_uncharge_until > (struct res_counter *rc, struct res_counter *top, > unsinged long val) > > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h > index 7d7fbe2..4b173b6 100644 > --- a/include/linux/res_counter.h > +++ b/include/linux/res_counter.h > @@ -130,14 +130,16 @@ int res_counter_charge_nofail(struct res_counter *counter, > * > * these calls check for usage underflow and show a warning on the console > * _locked call expects the counter->lock to be taken > + * > + * returns the total charges still present in @counter. > */ > > -void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val); > -void res_counter_uncharge(struct res_counter *counter, unsigned long val); > +u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val); > +u64 res_counter_uncharge(struct res_counter *counter, unsigned long val); > > -void res_counter_uncharge_until(struct res_counter *counter, > - struct res_counter *top, > - unsigned long val); > +u64 res_counter_uncharge_until(struct res_counter *counter, > + struct res_counter *top, > + unsigned long val); > /** > * res_counter_margin - calculate chargeable space of a counter > * @cnt: the counter > diff --git a/kernel/res_counter.c b/kernel/res_counter.c > index ad581aa..7b3d6dc 100644 > --- a/kernel/res_counter.c > +++ b/kernel/res_counter.c > @@ -86,33 +86,39 @@ int res_counter_charge_nofail(struct res_counter *counter, unsigned long val, > return __res_counter_charge(counter, val, limit_fail_at, true); > } > > -void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val) > +u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val) > { > if (WARN_ON(counter->usage < val)) > val = counter->usage; > > counter->usage -= val; > + return counter->usage; > } > > -void res_counter_uncharge_until(struct res_counter *counter, > - struct res_counter *top, > - unsigned long val) > +u64 res_counter_uncharge_until(struct res_counter *counter, > + struct res_counter *top, > + unsigned long val) > { > unsigned long flags; > struct res_counter *c; > + u64 ret = 0; > > local_irq_save(flags); > for (c = counter; c != top; c = c->parent) { > + u64 r; > spin_lock(&c->lock); > - res_counter_uncharge_locked(c, val); > + r = res_counter_uncharge_locked(c, val); > + if (c == counter) > + ret = r; > spin_unlock(&c->lock); > } > local_irq_restore(flags); > + return ret; > } > > -void res_counter_uncharge(struct res_counter *counter, unsigned long val) > +u64 res_counter_uncharge(struct res_counter *counter, unsigned long val) > { > - res_counter_uncharge_until(counter, NULL, val); > + return res_counter_uncharge_until(counter, NULL, val); > } > > static inline unsigned long long * > -- > 1.7.11.4 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 08/13] res_counter: return amount of charges after res_counter_uncharge 2012-10-01 10:00 ` Michal Hocko @ 2012-10-01 10:01 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-10-01 10:01 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 10/01/2012 02:00 PM, Michal Hocko wrote: > On Tue 18-09-12 18:04:05, Glauber Costa wrote: >> It is useful to know how many charges are still left after a call to >> res_counter_uncharge. > >> While it is possible to issue a res_counter_read >> after uncharge, this is racy. It would be better if uncharge itself >> would tell us what the current status is. > > Well I am not sure how less racy it would be if you return the old > value. It could be out of date when you read it, right? (this is even > more visible with res_counter_uncharge_until) > res_counter_read_u64 uses locks only for 32b when your change could help > to reduce lock contention. Other than that it is just res_counter_member > which is one cmp and a dereference. Sure you safe something but it is > barely noticable I guess. > Sure it can. But this is the same semantics of the atomic updates, which were always considered to be good enough for cases like this. > I am not saying I do not like this change I just think that the > above part of the changelog doesn't fit. So it would be much better if > you tell us why this is needed for your patchset because the usage is > not part of the patch. > I can update the changelog, for sure. But for the record, this has the goal of taking the get/put pair out of the charge/uncharge path. If I have 8k of data, and two threads decrement 4k each, a update + read may return 0 for both. With this patch, only one of them will see 0, and will proceed with the reference drop. Again, this is the same semantics as all of the atomic variables in the kernel. >> Since the current return value is void, we don't need to worry about >> anything breaking due to this change: nobody relied on that, and only >> users appearing from now on will be checking this value. >> >> Signed-off-by: Glauber Costa <glommer@parallels.com> >> CC: Michal Hocko <mhocko@suse.cz> >> CC: Johannes Weiner <hannes@cmpxchg.org> >> CC: Suleiman Souhlal <suleiman@google.com> >> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> >> --- >> Documentation/cgroups/resource_counter.txt | 7 ++++--- >> include/linux/res_counter.h | 12 +++++++----- >> kernel/res_counter.c | 20 +++++++++++++------- >> 3 files changed, 24 insertions(+), 15 deletions(-) >> >> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt >> index 0c4a344..c4d99ed 100644 >> --- a/Documentation/cgroups/resource_counter.txt >> +++ b/Documentation/cgroups/resource_counter.txt >> @@ -83,16 +83,17 @@ to work with it. >> res_counter->lock internally (it must be called with res_counter->lock >> held). The force parameter indicates whether we can bypass the limit. >> >> - e. void res_counter_uncharge[_locked] >> + e. u64 res_counter_uncharge[_locked] >> (struct res_counter *rc, unsigned long val) >> >> When a resource is released (freed) it should be de-accounted >> from the resource counter it was accounted to. This is called >> - "uncharging". >> + "uncharging". The return value of this function indicate the amount >> + of charges still present in the counter. >> >> The _locked routines imply that the res_counter->lock is taken. >> >> - f. void res_counter_uncharge_until >> + f. u64 res_counter_uncharge_until >> (struct res_counter *rc, struct res_counter *top, >> unsinged long val) >> >> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h >> index 7d7fbe2..4b173b6 100644 >> --- a/include/linux/res_counter.h >> +++ b/include/linux/res_counter.h >> @@ -130,14 +130,16 @@ int res_counter_charge_nofail(struct res_counter *counter, >> * >> * these calls check for usage underflow and show a warning on the console >> * _locked call expects the counter->lock to be taken >> + * >> + * returns the total charges still present in @counter. >> */ >> >> -void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val); >> -void res_counter_uncharge(struct res_counter *counter, unsigned long val); >> +u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val); >> +u64 res_counter_uncharge(struct res_counter *counter, unsigned long val); >> >> -void res_counter_uncharge_until(struct res_counter *counter, >> - struct res_counter *top, >> - unsigned long val); >> +u64 res_counter_uncharge_until(struct res_counter *counter, >> + struct res_counter *top, >> + unsigned long val); >> /** >> * res_counter_margin - calculate chargeable space of a counter >> * @cnt: the counter >> diff --git a/kernel/res_counter.c b/kernel/res_counter.c >> index ad581aa..7b3d6dc 100644 >> --- a/kernel/res_counter.c >> +++ b/kernel/res_counter.c >> @@ -86,33 +86,39 @@ int res_counter_charge_nofail(struct res_counter *counter, unsigned long val, >> return __res_counter_charge(counter, val, limit_fail_at, true); >> } >> >> -void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val) >> +u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val) >> { >> if (WARN_ON(counter->usage < val)) >> val = counter->usage; >> >> counter->usage -= val; >> + return counter->usage; >> } >> >> -void res_counter_uncharge_until(struct res_counter *counter, >> - struct res_counter *top, >> - unsigned long val) >> +u64 res_counter_uncharge_until(struct res_counter *counter, >> + struct res_counter *top, >> + unsigned long val) >> { >> unsigned long flags; >> struct res_counter *c; >> + u64 ret = 0; >> >> local_irq_save(flags); >> for (c = counter; c != top; c = c->parent) { >> + u64 r; >> spin_lock(&c->lock); >> - res_counter_uncharge_locked(c, val); >> + r = res_counter_uncharge_locked(c, val); >> + if (c == counter) >> + ret = r; >> spin_unlock(&c->lock); >> } >> local_irq_restore(flags); >> + return ret; >> } >> >> -void res_counter_uncharge(struct res_counter *counter, unsigned long val) >> +u64 res_counter_uncharge(struct res_counter *counter, unsigned long val) >> { >> - res_counter_uncharge_until(counter, NULL, val); >> + return res_counter_uncharge_until(counter, NULL, val); >> } >> >> static inline unsigned long long * >> -- >> 1.7.11.4 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe cgroups" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 09/13] memcg: kmem accounting lifecycle management [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (7 preceding siblings ...) 2012-09-18 14:04 ` [PATCH v3 08/13] res_counter: return amount of charges after res_counter_uncharge Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa [not found] ` <1347977050-29476-10-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 14:04 ` [PATCH v3 10/13] memcg: use static branches when code not in use Glauber Costa ` (3 subsequent siblings) 12 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner Because the assignment: memcg->kmem_accounted = true is done after the jump labels increment, we guarantee that the root memcg will always be selected until all call sites are patched (see memcg_kmem_enabled). This guarantees that no mischarges are applied. Jump label decrement happens when the last reference count from the memcg dies. This will only happen when the caches are all dead. -> /cgroups/memory/A/B/C * kmem limit set at A, * A and B have no tasks, * span a new task in in C. Because kmem_accounted is a boolean that was not set for C, no accounting would be done. This is, however, not what we expect. The basic idea, is that when a cgroup is limited, we walk the tree downwards and make sure that we store the information about the parent being limited in kmem_accounted. We do the reverse operation when a formerly limited cgroup becomes unlimited. Since kmem charges may outlive the cgroup existance, we need to be extra careful to guarantee the memcg object will stay around for as long as needed. Up to now, we were using a mem_cgroup_get()/put() pair in charge and uncharge operations. Although this guarantees that the object will be around until the last call to unchage, this means an atomic update in every charge. We can do better than that if we only issue get() in the first charge, and then put() when the last charge finally goes away. [ v3: merged all lifecycle related patches in one ] Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- mm/memcontrol.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 112 insertions(+), 11 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0f36a01..720e4bb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -287,7 +287,8 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; - bool kmem_accounted; + + unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */ bool oom_lock; atomic_t under_oom; @@ -340,6 +341,43 @@ struct mem_cgroup { #endif }; +enum { + KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */ + KMEM_ACCOUNTED_PARENT, /* one of its parents is active */ + KMEM_ACCOUNTED_DEAD, /* dead memcg, pending kmem charges */ +}; + +/* bits 0 and 1 */ +#define KMEM_ACCOUNTED_MASK 0x3 + +#ifdef CONFIG_MEMCG_KMEM +static bool memcg_kmem_set_active(struct mem_cgroup *memcg) +{ + return !test_and_set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted); +} + +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg) +{ + return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted); +} + +static void memcg_kmem_set_active_parent(struct mem_cgroup *memcg) +{ + set_bit(KMEM_ACCOUNTED_PARENT, &memcg->kmem_accounted); +} + +static void memcg_kmem_mark_dead(struct mem_cgroup *memcg) +{ + if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted)) + set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_accounted); +} + +static bool memcg_kmem_dead(struct mem_cgroup *memcg) +{ + return test_and_clear_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_accounted); +} +#endif /* CONFIG_MEMCG_KMEM */ + /* Stuffs for move charges at task migration. */ /* * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a @@ -491,7 +529,7 @@ EXPORT_SYMBOL(tcp_proto_cgroup); static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg) { return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) && - memcg->kmem_accounted; + (memcg->kmem_accounted & (KMEM_ACCOUNTED_MASK)); } /* @@ -524,13 +562,9 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order) if (!memcg_can_account_kmem(memcg)) return true; - mem_cgroup_get(memcg); - ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order) == 0; if (ret) *_memcg = memcg; - else - mem_cgroup_put(memcg); return ret; } @@ -589,7 +623,6 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) WARN_ON(mem_cgroup_is_root(memcg)); memcg_uncharge_kmem(memcg, PAGE_SIZE << order); - mem_cgroup_put(memcg); } #endif /* CONFIG_MEMCG_KMEM */ @@ -4077,6 +4110,40 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft, len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val); return simple_read_from_buffer(buf, nbytes, ppos, str, len); } + +static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val) +{ +#ifdef CONFIG_MEMCG_KMEM + struct mem_cgroup *iter; + + /* + * When we are doing hierarchical accounting, with an hierarchy like + * A/B/C, we need to start accounting kernel memory all the way up to C + * in case A start being accounted. + * + * So when we the cgroup first gets to be unlimited, we walk all the + * children of the current memcg and enable kmem accounting for them. + * Note that a separate bit is used there to indicate that the + * accounting happens due to the parent being accounted. + * + * note that memcg_kmem_set_active is a test-and-set routine, so we only + * arrive here once (since we never disable it) + */ + mutex_lock(&set_limit_mutex); + if ((val != RESOURCE_MAX) && memcg_kmem_set_active(memcg)) { + + mem_cgroup_get(memcg); + + for_each_mem_cgroup_tree(iter, memcg) { + if (iter == memcg) + continue; + memcg_kmem_set_active_parent(iter); + } + } + mutex_unlock(&set_limit_mutex); +#endif +} + /* * The user of this function is... * RES_LIMIT. @@ -4115,9 +4182,7 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, if (ret) break; - /* For simplicity, we won't allow this to be disabled */ - if (!memcg->kmem_accounted && val != RESOURCE_MAX) - memcg->kmem_accounted = true; + memcg_update_kmem_limit(memcg, val); } else return -EINVAL; break; @@ -4791,6 +4856,20 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) static void kmem_cgroup_destroy(struct mem_cgroup *memcg) { mem_cgroup_sockets_destroy(memcg); + + memcg_kmem_mark_dead(memcg); + + if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0) + return; + + /* + * Charges already down to 0, undo mem_cgroup_get() done in the charge + * path here, being careful not to race with memcg_uncharge_kmem: it is + * possible that the charges went down to 0 between mark_dead and the + * res_counter read, so in that case, we don't need the put + */ + if (memcg_kmem_dead(memcg)) + mem_cgroup_put(memcg); } #else static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) @@ -5148,6 +5227,8 @@ mem_cgroup_create(struct cgroup *cont) } if (parent && parent->use_hierarchy) { + struct mem_cgroup __maybe_unused *p; + res_counter_init(&memcg->res, &parent->res); res_counter_init(&memcg->memsw, &parent->memsw); res_counter_init(&memcg->kmem, &parent->kmem); @@ -5158,6 +5239,20 @@ mem_cgroup_create(struct cgroup *cont) * mem_cgroup(see mem_cgroup_put). */ mem_cgroup_get(parent); +#ifdef CONFIG_MEMCG_KMEM + /* + * In case a parent is already limited when we create this, we + * need him to propagate it now so we become limited as well. + */ + mutex_lock(&set_limit_mutex); + for (p = parent; p != NULL; p = parent_mem_cgroup(p)) { + if (memcg_kmem_is_accounted(p)) { + memcg_kmem_set_active_parent(memcg); + break; + } + } + mutex_unlock(&set_limit_mutex); +#endif } else { res_counter_init(&memcg->res, NULL); res_counter_init(&memcg->memsw, NULL); @@ -5871,9 +5966,15 @@ void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size) if (!memcg) return; - res_counter_uncharge(&memcg->kmem, size); res_counter_uncharge(&memcg->res, size); if (do_swap_account) res_counter_uncharge(&memcg->memsw, size); + + /* Not down to 0 */ + if (res_counter_uncharge(&memcg->kmem, size)) + return; + + if (memcg_kmem_dead(memcg)) + mem_cgroup_put(memcg); } #endif /* CONFIG_MEMCG_KMEM */ -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
[parent not found: <1347977050-29476-10-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 09/13] memcg: kmem accounting lifecycle management [not found] ` <1347977050-29476-10-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-10-01 12:15 ` Michal Hocko [not found] ` <20121001121553.GG8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-10-01 12:15 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner Based on the previous discussions I guess this one will get reworked, right? On Tue 18-09-12 18:04:06, Glauber Costa wrote: > Because the assignment: memcg->kmem_accounted = true is done after the > jump labels increment, we guarantee that the root memcg will always be > selected until all call sites are patched (see memcg_kmem_enabled). > This guarantees that no mischarges are applied. > > Jump label decrement happens when the last reference count from the > memcg dies. This will only happen when the caches are all dead. > > -> /cgroups/memory/A/B/C > > * kmem limit set at A, > * A and B have no tasks, > * span a new task in in C. > > Because kmem_accounted is a boolean that was not set for C, no > accounting would be done. This is, however, not what we expect. > > The basic idea, is that when a cgroup is limited, we walk the tree > downwards and make sure that we store the information about the parent > being limited in kmem_accounted. > > We do the reverse operation when a formerly limited cgroup becomes > unlimited. > > Since kmem charges may outlive the cgroup existance, we need to be extra > careful to guarantee the memcg object will stay around for as long as > needed. Up to now, we were using a mem_cgroup_get()/put() pair in charge > and uncharge operations. > > Although this guarantees that the object will be around until the last > call to unchage, this means an atomic update in every charge. We can do > better than that if we only issue get() in the first charge, and then > put() when the last charge finally goes away. > > [ v3: merged all lifecycle related patches in one ] > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> > CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> > CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> > CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > --- > mm/memcontrol.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 112 insertions(+), 11 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 0f36a01..720e4bb 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -287,7 +287,8 @@ struct mem_cgroup { > * Should the accounting and control be hierarchical, per subtree? > */ > bool use_hierarchy; > - bool kmem_accounted; > + > + unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */ > > bool oom_lock; > atomic_t under_oom; > @@ -340,6 +341,43 @@ struct mem_cgroup { > #endif > }; > > +enum { > + KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */ > + KMEM_ACCOUNTED_PARENT, /* one of its parents is active */ > + KMEM_ACCOUNTED_DEAD, /* dead memcg, pending kmem charges */ > +}; > + > +/* bits 0 and 1 */ > +#define KMEM_ACCOUNTED_MASK 0x3 > + > +#ifdef CONFIG_MEMCG_KMEM > +static bool memcg_kmem_set_active(struct mem_cgroup *memcg) > +{ > + return !test_and_set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted); > +} > + > +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg) > +{ > + return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted); > +} > + > +static void memcg_kmem_set_active_parent(struct mem_cgroup *memcg) > +{ > + set_bit(KMEM_ACCOUNTED_PARENT, &memcg->kmem_accounted); > +} > + > +static void memcg_kmem_mark_dead(struct mem_cgroup *memcg) > +{ > + if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_accounted)) > + set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_accounted); > +} > + > +static bool memcg_kmem_dead(struct mem_cgroup *memcg) > +{ > + return test_and_clear_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_accounted); > +} > +#endif /* CONFIG_MEMCG_KMEM */ > + > /* Stuffs for move charges at task migration. */ > /* > * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a > @@ -491,7 +529,7 @@ EXPORT_SYMBOL(tcp_proto_cgroup); > static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg) > { > return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) && > - memcg->kmem_accounted; > + (memcg->kmem_accounted & (KMEM_ACCOUNTED_MASK)); > } > > /* > @@ -524,13 +562,9 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order) > if (!memcg_can_account_kmem(memcg)) > return true; > > - mem_cgroup_get(memcg); > - > ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order) == 0; > if (ret) > *_memcg = memcg; > - else > - mem_cgroup_put(memcg); > > return ret; > } > @@ -589,7 +623,6 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) > > WARN_ON(mem_cgroup_is_root(memcg)); > memcg_uncharge_kmem(memcg, PAGE_SIZE << order); > - mem_cgroup_put(memcg); > } > #endif /* CONFIG_MEMCG_KMEM */ > > @@ -4077,6 +4110,40 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft, > len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val); > return simple_read_from_buffer(buf, nbytes, ppos, str, len); > } > + > +static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val) > +{ > +#ifdef CONFIG_MEMCG_KMEM > + struct mem_cgroup *iter; > + > + /* > + * When we are doing hierarchical accounting, with an hierarchy like > + * A/B/C, we need to start accounting kernel memory all the way up to C > + * in case A start being accounted. > + * > + * So when we the cgroup first gets to be unlimited, we walk all the > + * children of the current memcg and enable kmem accounting for them. > + * Note that a separate bit is used there to indicate that the > + * accounting happens due to the parent being accounted. > + * > + * note that memcg_kmem_set_active is a test-and-set routine, so we only > + * arrive here once (since we never disable it) > + */ > + mutex_lock(&set_limit_mutex); > + if ((val != RESOURCE_MAX) && memcg_kmem_set_active(memcg)) { > + > + mem_cgroup_get(memcg); > + > + for_each_mem_cgroup_tree(iter, memcg) { > + if (iter == memcg) > + continue; > + memcg_kmem_set_active_parent(iter); > + } > + } > + mutex_unlock(&set_limit_mutex); > +#endif > +} > + > /* > * The user of this function is... > * RES_LIMIT. > @@ -4115,9 +4182,7 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft, > if (ret) > break; > > - /* For simplicity, we won't allow this to be disabled */ > - if (!memcg->kmem_accounted && val != RESOURCE_MAX) > - memcg->kmem_accounted = true; > + memcg_update_kmem_limit(memcg, val); > } else > return -EINVAL; > break; > @@ -4791,6 +4856,20 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > static void kmem_cgroup_destroy(struct mem_cgroup *memcg) > { > mem_cgroup_sockets_destroy(memcg); > + > + memcg_kmem_mark_dead(memcg); > + > + if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0) > + return; > + > + /* > + * Charges already down to 0, undo mem_cgroup_get() done in the charge > + * path here, being careful not to race with memcg_uncharge_kmem: it is > + * possible that the charges went down to 0 between mark_dead and the > + * res_counter read, so in that case, we don't need the put > + */ > + if (memcg_kmem_dead(memcg)) > + mem_cgroup_put(memcg); > } > #else > static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) > @@ -5148,6 +5227,8 @@ mem_cgroup_create(struct cgroup *cont) > } > > if (parent && parent->use_hierarchy) { > + struct mem_cgroup __maybe_unused *p; > + > res_counter_init(&memcg->res, &parent->res); > res_counter_init(&memcg->memsw, &parent->memsw); > res_counter_init(&memcg->kmem, &parent->kmem); > @@ -5158,6 +5239,20 @@ mem_cgroup_create(struct cgroup *cont) > * mem_cgroup(see mem_cgroup_put). > */ > mem_cgroup_get(parent); > +#ifdef CONFIG_MEMCG_KMEM > + /* > + * In case a parent is already limited when we create this, we > + * need him to propagate it now so we become limited as well. > + */ > + mutex_lock(&set_limit_mutex); > + for (p = parent; p != NULL; p = parent_mem_cgroup(p)) { > + if (memcg_kmem_is_accounted(p)) { > + memcg_kmem_set_active_parent(memcg); > + break; > + } > + } > + mutex_unlock(&set_limit_mutex); > +#endif > } else { > res_counter_init(&memcg->res, NULL); > res_counter_init(&memcg->memsw, NULL); > @@ -5871,9 +5966,15 @@ void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size) > if (!memcg) > return; > > - res_counter_uncharge(&memcg->kmem, size); > res_counter_uncharge(&memcg->res, size); > if (do_swap_account) > res_counter_uncharge(&memcg->memsw, size); > + > + /* Not down to 0 */ > + if (res_counter_uncharge(&memcg->kmem, size)) > + return; > + > + if (memcg_kmem_dead(memcg)) > + mem_cgroup_put(memcg); > } > #endif /* CONFIG_MEMCG_KMEM */ > -- > 1.7.11.4 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20121001121553.GG8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 09/13] memcg: kmem accounting lifecycle management [not found] ` <20121001121553.GG8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-10-01 12:29 ` Glauber Costa [not found] ` <50698C97.70703-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-10-01 12:29 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On 10/01/2012 04:15 PM, Michal Hocko wrote: > Based on the previous discussions I guess this one will get reworked, > right? > Yes, but most of it stayed. The hierarchy part is gone, but because we will still have kmem pages floating around (potentially), I am still using the mark_dead() trick with the corresponding get when kmem_accounted. ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <50698C97.70703-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 09/13] memcg: kmem accounting lifecycle management [not found] ` <50698C97.70703-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-10-01 12:36 ` Michal Hocko [not found] ` <20121001123654.GJ8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-10-01 12:36 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Mon 01-10-12 16:29:11, Glauber Costa wrote: > On 10/01/2012 04:15 PM, Michal Hocko wrote: > > Based on the previous discussions I guess this one will get reworked, > > right? > > > > Yes, but most of it stayed. The hierarchy part is gone, but because we > will still have kmem pages floating around (potentially), I am still > using the mark_dead() trick with the corresponding get when kmem_accounted. Is it OK if I hold on with the review of this one until the next version? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20121001123654.GJ8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 09/13] memcg: kmem accounting lifecycle management [not found] ` <20121001123654.GJ8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-10-01 12:43 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-10-01 12:43 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On 10/01/2012 04:36 PM, Michal Hocko wrote: > On Mon 01-10-12 16:29:11, Glauber Costa wrote: >> On 10/01/2012 04:15 PM, Michal Hocko wrote: >>> Based on the previous discussions I guess this one will get reworked, >>> right? >>> >> >> Yes, but most of it stayed. The hierarchy part is gone, but because we >> will still have kmem pages floating around (potentially), I am still >> using the mark_dead() trick with the corresponding get when kmem_accounted. > > Is it OK if I hold on with the review of this one until the next > version? > Of course. I haven't sent it yet because I also received a lot more feedback for the slab part (which is expected), and I want to get a least part of that going before I send it again. ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 10/13] memcg: use static branches when code not in use [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (8 preceding siblings ...) 2012-09-18 14:04 ` [PATCH v3 09/13] memcg: kmem accounting lifecycle management Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa 2012-10-01 12:25 ` Michal Hocko 2012-09-18 14:04 ` [PATCH v3 11/13] memcg: allow a memcg with kmem charges to be destructed Glauber Costa ` (2 subsequent siblings) 12 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner We can use static branches to patch the code in or out when not used. Because the _ACTIVATED bit on kmem_accounted is only set once, we guarantee that the root memcg will always be selected until all call sites are patched (see memcg_kmem_enabled). This guarantees that no mischarges are applied. static branch decrement happens when the last reference count from the kmem accounting in memcg dies. This will only happen when the charges drop down to 0. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- include/linux/memcontrol.h | 4 +++- mm/memcontrol.c | 26 ++++++++++++++++++++++++-- 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 82ede9a..4ec9fd5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -22,6 +22,7 @@ #include <linux/cgroup.h> #include <linux/vm_event_item.h> #include <linux/hardirq.h> +#include <linux/jump_label.h> struct mem_cgroup; struct page_cgroup; @@ -401,9 +402,10 @@ struct sock; void sock_update_memcg(struct sock *sk); void sock_release_memcg(struct sock *sk); +extern struct static_key memcg_kmem_enabled_key; static inline bool memcg_kmem_enabled(void) { - return true; + return static_key_false(&memcg_kmem_enabled_key); } extern bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 720e4bb..aada601 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -467,6 +467,8 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) #include <net/sock.h> #include <net/ip.h> +struct static_key memcg_kmem_enabled_key; + static bool mem_cgroup_is_root(struct mem_cgroup *memcg); static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size); static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size); @@ -624,6 +626,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) WARN_ON(mem_cgroup_is_root(memcg)); memcg_uncharge_kmem(memcg, PAGE_SIZE << order); } + +static void disarm_kmem_keys(struct mem_cgroup *memcg) +{ + if (memcg_kmem_is_accounted(memcg)) + static_key_slow_dec(&memcg_kmem_enabled_key); +} +#else +static void disarm_kmem_keys(struct mem_cgroup *memcg) +{ +} #endif /* CONFIG_MEMCG_KMEM */ #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) @@ -639,6 +651,12 @@ static void disarm_sock_keys(struct mem_cgroup *memcg) } #endif +static void disarm_static_keys(struct mem_cgroup *memcg) +{ + disarm_sock_keys(memcg); + disarm_kmem_keys(memcg); +} + static void drain_all_stock_async(struct mem_cgroup *memcg); static struct mem_cgroup_per_zone * @@ -4131,7 +4149,11 @@ static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val) */ mutex_lock(&set_limit_mutex); if ((val != RESOURCE_MAX) && memcg_kmem_set_active(memcg)) { - + /* + * Once the static branch is enabled it will only be + * disabled when the last reference to memcg is gone. + */ + static_key_slow_inc(&memcg_kmem_enabled_key); mem_cgroup_get(memcg); for_each_mem_cgroup_tree(iter, memcg) { @@ -5066,7 +5088,7 @@ static void free_work(struct work_struct *work) * to move this code around, and make sure it is outside * the cgroup_lock. */ - disarm_sock_keys(memcg); + disarm_static_keys(memcg); if (size < PAGE_SIZE) kfree(memcg); else -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
* Re: [PATCH v3 10/13] memcg: use static branches when code not in use 2012-09-18 14:04 ` [PATCH v3 10/13] memcg: use static branches when code not in use Glauber Costa @ 2012-10-01 12:25 ` Michal Hocko [not found] ` <20121001122516.GH8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-10-01 12:25 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Tue 18-09-12 18:04:07, Glauber Costa wrote: [...] > include/linux/memcontrol.h | 4 +++- > mm/memcontrol.c | 26 ++++++++++++++++++++++++-- > 2 files changed, 27 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 82ede9a..4ec9fd5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -22,6 +22,7 @@ > #include <linux/cgroup.h> > #include <linux/vm_event_item.h> > #include <linux/hardirq.h> > +#include <linux/jump_label.h> > > struct mem_cgroup; > struct page_cgroup; > @@ -401,9 +402,10 @@ struct sock; > void sock_update_memcg(struct sock *sk); > void sock_release_memcg(struct sock *sk); > > +extern struct static_key memcg_kmem_enabled_key; > static inline bool memcg_kmem_enabled(void) > { > - return true; > + return static_key_false(&memcg_kmem_enabled_key); > } > > extern bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 720e4bb..aada601 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -467,6 +467,8 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) > #include <net/sock.h> > #include <net/ip.h> > > +struct static_key memcg_kmem_enabled_key; > + > static bool mem_cgroup_is_root(struct mem_cgroup *memcg); > static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size); > static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size); > @@ -624,6 +626,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) > WARN_ON(mem_cgroup_is_root(memcg)); > memcg_uncharge_kmem(memcg, PAGE_SIZE << order); > } > + > +static void disarm_kmem_keys(struct mem_cgroup *memcg) > +{ > + if (memcg_kmem_is_accounted(memcg)) > + static_key_slow_dec(&memcg_kmem_enabled_key); > +} > +#else > +static void disarm_kmem_keys(struct mem_cgroup *memcg) > +{ > +} > #endif /* CONFIG_MEMCG_KMEM */ > > #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) > @@ -639,6 +651,12 @@ static void disarm_sock_keys(struct mem_cgroup *memcg) > } > #endif > > +static void disarm_static_keys(struct mem_cgroup *memcg) > +{ > + disarm_sock_keys(memcg); > + disarm_kmem_keys(memcg); > +} > + > static void drain_all_stock_async(struct mem_cgroup *memcg); > > static struct mem_cgroup_per_zone * > @@ -4131,7 +4149,11 @@ static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val) > */ > mutex_lock(&set_limit_mutex); > if ((val != RESOURCE_MAX) && memcg_kmem_set_active(memcg)) { > - > + /* > + * Once the static branch is enabled it will only be > + * disabled when the last reference to memcg is gone. > + */ > + static_key_slow_inc(&memcg_kmem_enabled_key); I guess the reason why we do not need to inc also for children is that we do not inherit kmem_accounted, right? > mem_cgroup_get(memcg); > > for_each_mem_cgroup_tree(iter, memcg) { > @@ -5066,7 +5088,7 @@ static void free_work(struct work_struct *work) > * to move this code around, and make sure it is outside > * the cgroup_lock. > */ > - disarm_sock_keys(memcg); > + disarm_static_keys(memcg); > if (size < PAGE_SIZE) > kfree(memcg); > else > -- > 1.7.11.4 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20121001122516.GH8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 10/13] memcg: use static branches when code not in use [not found] ` <20121001122516.GH8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-10-01 12:27 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-10-01 12:27 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On 10/01/2012 04:25 PM, Michal Hocko wrote: > On Tue 18-09-12 18:04:07, Glauber Costa wrote: > [...] >> include/linux/memcontrol.h | 4 +++- >> mm/memcontrol.c | 26 ++++++++++++++++++++++++-- >> 2 files changed, 27 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >> index 82ede9a..4ec9fd5 100644 >> --- a/include/linux/memcontrol.h >> +++ b/include/linux/memcontrol.h >> @@ -22,6 +22,7 @@ >> #include <linux/cgroup.h> >> #include <linux/vm_event_item.h> >> #include <linux/hardirq.h> >> +#include <linux/jump_label.h> >> >> struct mem_cgroup; >> struct page_cgroup; >> @@ -401,9 +402,10 @@ struct sock; >> void sock_update_memcg(struct sock *sk); >> void sock_release_memcg(struct sock *sk); >> >> +extern struct static_key memcg_kmem_enabled_key; >> static inline bool memcg_kmem_enabled(void) >> { >> - return true; >> + return static_key_false(&memcg_kmem_enabled_key); >> } >> >> extern bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 720e4bb..aada601 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -467,6 +467,8 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) >> #include <net/sock.h> >> #include <net/ip.h> >> >> +struct static_key memcg_kmem_enabled_key; >> + >> static bool mem_cgroup_is_root(struct mem_cgroup *memcg); >> static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size); >> static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size); >> @@ -624,6 +626,16 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) >> WARN_ON(mem_cgroup_is_root(memcg)); >> memcg_uncharge_kmem(memcg, PAGE_SIZE << order); >> } >> + >> +static void disarm_kmem_keys(struct mem_cgroup *memcg) >> +{ >> + if (memcg_kmem_is_accounted(memcg)) >> + static_key_slow_dec(&memcg_kmem_enabled_key); >> +} >> +#else >> +static void disarm_kmem_keys(struct mem_cgroup *memcg) >> +{ >> +} >> #endif /* CONFIG_MEMCG_KMEM */ >> >> #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) >> @@ -639,6 +651,12 @@ static void disarm_sock_keys(struct mem_cgroup *memcg) >> } >> #endif >> >> +static void disarm_static_keys(struct mem_cgroup *memcg) >> +{ >> + disarm_sock_keys(memcg); >> + disarm_kmem_keys(memcg); >> +} >> + >> static void drain_all_stock_async(struct mem_cgroup *memcg); >> >> static struct mem_cgroup_per_zone * >> @@ -4131,7 +4149,11 @@ static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val) >> */ >> mutex_lock(&set_limit_mutex); >> if ((val != RESOURCE_MAX) && memcg_kmem_set_active(memcg)) { >> - >> + /* >> + * Once the static branch is enabled it will only be >> + * disabled when the last reference to memcg is gone. >> + */ >> + static_key_slow_inc(&memcg_kmem_enabled_key); > > I guess the reason why we do not need to inc also for children is that > we do not inherit kmem_accounted, right? > Yes, but I of course changed that in the upcoming version of the patch. We now inherit the value everytime, and the static branches are updated accordingly. ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 11/13] memcg: allow a memcg with kmem charges to be destructed. [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (9 preceding siblings ...) 2012-09-18 14:04 ` [PATCH v3 10/13] memcg: use static branches when code not in use Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa [not found] ` <1347977050-29476-12-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 14:04 ` [PATCH v3 12/13] execute the whole memcg freeing in rcu callback Glauber Costa 2012-09-18 14:04 ` [PATCH v3 13/13] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Glauber Costa 12 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner Because the ultimate goal of the kmem tracking in memcg is to track slab pages as well, we can't guarantee that we'll always be able to point a page to a particular process, and migrate the charges along with it - since in the common case, a page will contain data belonging to multiple processes. Because of that, when we destroy a memcg, we only make sure the destruction will succeed by discounting the kmem charges from the user charges when we try to empty the cgroup. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- mm/memcontrol.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aada601..b05ecac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -631,6 +631,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg) { if (memcg_kmem_is_accounted(memcg)) static_key_slow_dec(&memcg_kmem_enabled_key); + /* + * This check can't live in kmem destruction function, + * since the charges will outlive the cgroup + */ + WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0); } #else static void disarm_kmem_keys(struct mem_cgroup *memcg) @@ -3933,6 +3938,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all) int node, zid, shrink; int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; struct cgroup *cgrp = memcg->css.cgroup; + u64 usage; css_get(&memcg->css); @@ -3966,8 +3972,17 @@ move_account: mem_cgroup_end_move(memcg); memcg_oom_recover(memcg); cond_resched(); + /* + * Kernel memory may not necessarily be trackable to a specific + * process. So they are not migrated, and therefore we can't + * expect their value to drop to 0 here. + * + * having res filled up with kmem only is enough + */ + usage = res_counter_read_u64(&memcg->res, RES_USAGE) - + res_counter_read_u64(&memcg->kmem, RES_USAGE); /* "ret" should also be checked to ensure all lists are empty. */ - } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret); + } while (usage > 0 || ret); out: css_put(&memcg->css); return ret; -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
[parent not found: <1347977050-29476-12-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 11/13] memcg: allow a memcg with kmem charges to be destructed. [not found] ` <1347977050-29476-12-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-10-01 12:30 ` Michal Hocko 0 siblings, 0 replies; 127+ messages in thread From: Michal Hocko @ 2012-10-01 12:30 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Tue 18-09-12 18:04:08, Glauber Costa wrote: > Because the ultimate goal of the kmem tracking in memcg is to track slab > pages as well, we can't guarantee that we'll always be able to point a > page to a particular process, and migrate the charges along with it - > since in the common case, a page will contain data belonging to multiple > processes. > > Because of that, when we destroy a memcg, we only make sure the > destruction will succeed by discounting the kmem charges from the user > charges when we try to empty the cgroup. > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> > CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> > CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> > CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Looks good. Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > --- > mm/memcontrol.c | 17 ++++++++++++++++- > 1 file changed, 16 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index aada601..b05ecac 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -631,6 +631,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg) > { > if (memcg_kmem_is_accounted(memcg)) > static_key_slow_dec(&memcg_kmem_enabled_key); > + /* > + * This check can't live in kmem destruction function, > + * since the charges will outlive the cgroup > + */ > + WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0); > } > #else > static void disarm_kmem_keys(struct mem_cgroup *memcg) > @@ -3933,6 +3938,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all) > int node, zid, shrink; > int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > struct cgroup *cgrp = memcg->css.cgroup; > + u64 usage; > > css_get(&memcg->css); > > @@ -3966,8 +3972,17 @@ move_account: > mem_cgroup_end_move(memcg); > memcg_oom_recover(memcg); > cond_resched(); > + /* > + * Kernel memory may not necessarily be trackable to a specific > + * process. So they are not migrated, and therefore we can't > + * expect their value to drop to 0 here. > + * > + * having res filled up with kmem only is enough > + */ > + usage = res_counter_read_u64(&memcg->res, RES_USAGE) - > + res_counter_read_u64(&memcg->kmem, RES_USAGE); > /* "ret" should also be checked to ensure all lists are empty. */ > - } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret); > + } while (usage > 0 || ret); > out: > css_put(&memcg->css); > return ret; > -- > 1.7.11.4 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 12/13] execute the whole memcg freeing in rcu callback [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (10 preceding siblings ...) 2012-09-18 14:04 ` [PATCH v3 11/13] memcg: allow a memcg with kmem charges to be destructed Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa [not found] ` <1347977050-29476-13-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 14:04 ` [PATCH v3 13/13] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Glauber Costa 12 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Michal Hocko, Johannes Weiner A lot of the initialization we do in mem_cgroup_create() is done with softirqs enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This means that the freeing of memcg structure must happen in a compatible context, otherwise we'll get a deadlock. The reference counting mechanism we use allows the memcg structure to be freed later and outlive the actual memcg destruction from the filesystem. However, we have little, if any, means to guarantee in which context the last memcg_put will happen. The best we can do is test it and try to make sure no invalid context releases are happening. But as we add more code to memcg, the possible interactions grow in number and expose more ways to get context conflicts. We already moved a part of the freeing to a worker thread to be context-safe for the static branches disabling. I see no reason not to do it for the whole freeing action. I consider this to be the safe choice. Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> Tested-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> --- mm/memcontrol.c | 66 +++++++++++++++++++++++++++++---------------------------- 1 file changed, 34 insertions(+), 32 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b05ecac..74654f0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5082,16 +5082,29 @@ out_free: } /* - * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU, - * but in process context. The work_freeing structure is overlaid - * on the rcu_freeing structure, which itself is overlaid on memsw. + * At destroying mem_cgroup, references from swap_cgroup can remain. + * (scanning all at force_empty is too costly...) + * + * Instead of clearing all references at force_empty, we remember + * the number of reference from swap_cgroup and free mem_cgroup when + * it goes down to 0. + * + * Removal of cgroup itself succeeds regardless of refs from swap. */ -static void free_work(struct work_struct *work) + +static void __mem_cgroup_free(struct mem_cgroup *memcg) { - struct mem_cgroup *memcg; + int node; int size = sizeof(struct mem_cgroup); - memcg = container_of(work, struct mem_cgroup, work_freeing); + mem_cgroup_remove_from_trees(memcg); + free_css_id(&mem_cgroup_subsys, &memcg->css); + + for_each_node(node) + free_mem_cgroup_per_zone_info(memcg, node); + + free_percpu(memcg->stat); + /* * We need to make sure that (at least for now), the jump label * destruction code runs outside of the cgroup lock. This is because @@ -5110,38 +5123,27 @@ static void free_work(struct work_struct *work) vfree(memcg); } -static void free_rcu(struct rcu_head *rcu_head) -{ - struct mem_cgroup *memcg; - - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); - INIT_WORK(&memcg->work_freeing, free_work); - schedule_work(&memcg->work_freeing); -} /* - * At destroying mem_cgroup, references from swap_cgroup can remain. - * (scanning all at force_empty is too costly...) - * - * Instead of clearing all references at force_empty, we remember - * the number of reference from swap_cgroup and free mem_cgroup when - * it goes down to 0. - * - * Removal of cgroup itself succeeds regardless of refs from swap. + * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU, + * but in process context. The work_freeing structure is overlaid + * on the rcu_freeing structure, which itself is overlaid on memsw. */ - -static void __mem_cgroup_free(struct mem_cgroup *memcg) +static void free_work(struct work_struct *work) { - int node; + struct mem_cgroup *memcg; - mem_cgroup_remove_from_trees(memcg); - free_css_id(&mem_cgroup_subsys, &memcg->css); + memcg = container_of(work, struct mem_cgroup, work_freeing); + __mem_cgroup_free(memcg); +} - for_each_node(node) - free_mem_cgroup_per_zone_info(memcg, node); +static void free_rcu(struct rcu_head *rcu_head) +{ + struct mem_cgroup *memcg; - free_percpu(memcg->stat); - call_rcu(&memcg->rcu_freeing, free_rcu); + memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); + INIT_WORK(&memcg->work_freeing, free_work); + schedule_work(&memcg->work_freeing); } static void mem_cgroup_get(struct mem_cgroup *memcg) @@ -5153,7 +5155,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) { if (atomic_sub_and_test(count, &memcg->refcnt)) { struct mem_cgroup *parent = parent_mem_cgroup(memcg); - __mem_cgroup_free(memcg); + call_rcu(&memcg->rcu_freeing, free_rcu); if (parent) mem_cgroup_put(parent); } -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
[parent not found: <1347977050-29476-13-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback [not found] ` <1347977050-29476-13-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-21 17:23 ` Tejun Heo [not found] ` <20120921172355.GD7264-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-10-01 13:27 ` Michal Hocko 1 sibling, 1 reply; 127+ messages in thread From: Tejun Heo @ 2012-09-21 17:23 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Michal Hocko, Johannes Weiner Hello, Glauber. On Tue, Sep 18, 2012 at 06:04:09PM +0400, Glauber Costa wrote: > A lot of the initialization we do in mem_cgroup_create() is done with softirqs > enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and > the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the > lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This > means that the freeing of memcg structure must happen in a compatible context, > otherwise we'll get a deadlock. Lockdep requires lock to be softirq or irq safe iff the lock is actually acquired from the said context. Merely using a lock with bh / irq disabled doesn't signal that to lockdep; otherwise, we'll end up with enormous number of spurious warnings. > The reference counting mechanism we use allows the memcg structure to be freed > later and outlive the actual memcg destruction from the filesystem. However, we > have little, if any, means to guarantee in which context the last memcg_put > will happen. The best we can do is test it and try to make sure no invalid > context releases are happening. But as we add more code to memcg, the possible > interactions grow in number and expose more ways to get context conflicts. > > We already moved a part of the freeing to a worker thread to be context-safe > for the static branches disabling. I see no reason not to do it for the whole > freeing action. I consider this to be the safe choice. And the above description too makes me scratch my head quite a bit. I can see what the patch is doing but can't understand the why. * Why was it punting the freeing to workqueue anyway? ISTR something about static_keys but my memory fails. What changed? Why don't we need it anymore? * As for locking context, the above description seems a bit misleading to me. Synchronization constructs involved there currently doesn't require softirq or irq safe context. If that needs to change, that's fine but that's a completely different reason than given above. Thanks. -- tejun ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <20120921172355.GD7264-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback [not found] ` <20120921172355.GD7264-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-24 8:48 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-09-24 8:48 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Michal Hocko, Johannes Weiner > And the above description too makes me scratch my head quite a bit. I > can see what the patch is doing but can't understand the why. > > * Why was it punting the freeing to workqueue anyway? ISTR something > about static_keys but my memory fails. What changed? Why don't we > need it anymore? > > * As for locking context, the above description seems a bit misleading > to me. Synchronization constructs involved there currently doesn't > require softirq or irq safe context. If that needs to change, > that's fine but that's a completely different reason than given > above. > > Thanks. > I just suck at changelogs =( The problem here is very much like the one we had with static branches. In that case, we had the problem with the cgroup_lock() being held, in which case the jump label lock could not be called. In here, after the kmem patches are in, the destruction function could be called directly from memcg_kmem_uncharge_page() when the last put is done. But this can actually be called from the page allocator, with an incompatible softirq context. So it is not that it could be called, they are actually called in that context at this point. ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback [not found] ` <1347977050-29476-13-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-21 17:23 ` Tejun Heo @ 2012-10-01 13:27 ` Michal Hocko 2012-10-04 10:53 ` Glauber Costa 1 sibling, 1 reply; 127+ messages in thread From: Michal Hocko @ 2012-10-01 13:27 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On Tue 18-09-12 18:04:09, Glauber Costa wrote: > A lot of the initialization we do in mem_cgroup_create() is done with softirqs > enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and > the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the > lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This > means that the freeing of memcg structure must happen in a compatible context, > otherwise we'll get a deadlock. Maybe I am missing something obvious but why cannot we simply disble (soft)irqs in mem_cgroup_create rather than make the free path much more complicated. It really feels strange to defer everything (e.g. soft reclaim tree cleanup which should be a no-op at the time because there shouldn't be any user pages in the group). > The reference counting mechanism we use allows the memcg structure to be freed > later and outlive the actual memcg destruction from the filesystem. However, we > have little, if any, means to guarantee in which context the last memcg_put > will happen. The best we can do is test it and try to make sure no invalid > context releases are happening. But as we add more code to memcg, the possible > interactions grow in number and expose more ways to get context conflicts. > > We already moved a part of the freeing to a worker thread to be context-safe > for the static branches disabling. I see no reason not to do it for the whole > freeing action. I consider this to be the safe choice. > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > Tested-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> > CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > --- > mm/memcontrol.c | 66 +++++++++++++++++++++++++++++---------------------------- > 1 file changed, 34 insertions(+), 32 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b05ecac..74654f0 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5082,16 +5082,29 @@ out_free: > } > > /* > - * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU, > - * but in process context. The work_freeing structure is overlaid > - * on the rcu_freeing structure, which itself is overlaid on memsw. > + * At destroying mem_cgroup, references from swap_cgroup can remain. > + * (scanning all at force_empty is too costly...) > + * > + * Instead of clearing all references at force_empty, we remember > + * the number of reference from swap_cgroup and free mem_cgroup when > + * it goes down to 0. > + * > + * Removal of cgroup itself succeeds regardless of refs from swap. > */ > -static void free_work(struct work_struct *work) > + > +static void __mem_cgroup_free(struct mem_cgroup *memcg) > { > - struct mem_cgroup *memcg; > + int node; > int size = sizeof(struct mem_cgroup); > > - memcg = container_of(work, struct mem_cgroup, work_freeing); > + mem_cgroup_remove_from_trees(memcg); > + free_css_id(&mem_cgroup_subsys, &memcg->css); > + > + for_each_node(node) > + free_mem_cgroup_per_zone_info(memcg, node); > + > + free_percpu(memcg->stat); > + > /* > * We need to make sure that (at least for now), the jump label > * destruction code runs outside of the cgroup lock. This is because > @@ -5110,38 +5123,27 @@ static void free_work(struct work_struct *work) > vfree(memcg); > } > > -static void free_rcu(struct rcu_head *rcu_head) > -{ > - struct mem_cgroup *memcg; > - > - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); > - INIT_WORK(&memcg->work_freeing, free_work); > - schedule_work(&memcg->work_freeing); > -} > > /* > - * At destroying mem_cgroup, references from swap_cgroup can remain. > - * (scanning all at force_empty is too costly...) > - * > - * Instead of clearing all references at force_empty, we remember > - * the number of reference from swap_cgroup and free mem_cgroup when > - * it goes down to 0. > - * > - * Removal of cgroup itself succeeds regardless of refs from swap. > + * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU, > + * but in process context. The work_freeing structure is overlaid > + * on the rcu_freeing structure, which itself is overlaid on memsw. > */ > - > -static void __mem_cgroup_free(struct mem_cgroup *memcg) > +static void free_work(struct work_struct *work) > { > - int node; > + struct mem_cgroup *memcg; > > - mem_cgroup_remove_from_trees(memcg); > - free_css_id(&mem_cgroup_subsys, &memcg->css); > + memcg = container_of(work, struct mem_cgroup, work_freeing); > + __mem_cgroup_free(memcg); > +} > > - for_each_node(node) > - free_mem_cgroup_per_zone_info(memcg, node); > +static void free_rcu(struct rcu_head *rcu_head) > +{ > + struct mem_cgroup *memcg; > > - free_percpu(memcg->stat); > - call_rcu(&memcg->rcu_freeing, free_rcu); > + memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing); > + INIT_WORK(&memcg->work_freeing, free_work); > + schedule_work(&memcg->work_freeing); > } > > static void mem_cgroup_get(struct mem_cgroup *memcg) > @@ -5153,7 +5155,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count) > { > if (atomic_sub_and_test(count, &memcg->refcnt)) { > struct mem_cgroup *parent = parent_mem_cgroup(memcg); > - __mem_cgroup_free(memcg); > + call_rcu(&memcg->rcu_freeing, free_rcu); > if (parent) > mem_cgroup_put(parent); > } > -- > 1.7.11.4 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback 2012-10-01 13:27 ` Michal Hocko @ 2012-10-04 10:53 ` Glauber Costa 2012-10-04 14:20 ` Glauber Costa [not found] ` <506D6A99.7070800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 2 replies; 127+ messages in thread From: Glauber Costa @ 2012-10-04 10:53 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 10/01/2012 05:27 PM, Michal Hocko wrote: > On Tue 18-09-12 18:04:09, Glauber Costa wrote: >> A lot of the initialization we do in mem_cgroup_create() is done with softirqs >> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and >> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the >> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This >> means that the freeing of memcg structure must happen in a compatible context, >> otherwise we'll get a deadlock. > > Maybe I am missing something obvious but why cannot we simply disble > (soft)irqs in mem_cgroup_create rather than make the free path much more > complicated. It really feels strange to defer everything (e.g. soft > reclaim tree cleanup which should be a no-op at the time because there > shouldn't be any user pages in the group). > Ok. I was just able to come back to this today - I was mostly working on the slab feedback over the past few days. I will answer yours and Tejun's concerns at once: Here is the situation: the backtrace I get is this one: [ 124.956725] ================================= [ 124.957217] [ INFO: inconsistent lock state ] [ 124.957217] 3.5.0+ #99 Not tainted [ 124.957217] --------------------------------- [ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at: [<ffffffff810aa7b2>] spin_lock+0x9/0xb [ 124.957217] {SOFTIRQ-ON-W} state was registered at: [ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68 [ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c [ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f [ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb [ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3 [ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60 [ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100 [ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee [ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5 [ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b So what we learn from it, is: we are acquiring a specific lock (the css id one) from softirq context. It was previously taken in a softirq-enabled context, that seems to be coming directly from get_new_cssid. Tejun correctly pointed out that we should never acquire that lock from a softirq context, in which he is right. But the situation changes slightly with kmem. Now, the following excerpt of a backtrace is possible: [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22 [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98 [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277 So as you can see, free_accounted_pages (that will trigger a memcg_put() -> mem_cgroup_free()) can now be called from softirq context, which is, an rcu callback (and I just realized I wrote the exact opposite in the subj line: man, I really suck at that!!) As a matter of fact, we could not move to our rcu callback as well: we need to move it to a worker thread with the rest. We already have a worker thread: he reason we have it is not static_branches: The reason is vfree(), that will BUG_ON(in_interrupt()) and could not be called from rcu callback as well. We moved static branches in there as well for a similar problem, but haven't introduced it. Could we move just part of it to the worker thread? Absolutely yes. Moving just free_css_id() is enough to make it work. But since it is not the first context related problem we had, I thought: "to hell with that, let's move everything and be safe". I am fine moving free_css_id() only if you would prefer. Can we disable softirqs when we initialize css_id? Maybe. My machine seems to boot fine and survive the simple workload that would trigger that bug if I use irqsave spinlocks instead of normal spinlocks. But this has to be done from cgroup core: We have no control over css creation in memcg. How would you guys like me to handle this ? ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback 2012-10-04 10:53 ` Glauber Costa @ 2012-10-04 14:20 ` Glauber Costa [not found] ` <506D6A99.7070800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 1 sibling, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-10-04 14:20 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Johannes Weiner On 10/04/2012 02:53 PM, Glauber Costa wrote: > On 10/01/2012 05:27 PM, Michal Hocko wrote: >> On Tue 18-09-12 18:04:09, Glauber Costa wrote: >>> A lot of the initialization we do in mem_cgroup_create() is done with softirqs >>> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and >>> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the >>> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This >>> means that the freeing of memcg structure must happen in a compatible context, >>> otherwise we'll get a deadlock. >> >> Maybe I am missing something obvious but why cannot we simply disble >> (soft)irqs in mem_cgroup_create rather than make the free path much more >> complicated. It really feels strange to defer everything (e.g. soft >> reclaim tree cleanup which should be a no-op at the time because there >> shouldn't be any user pages in the group). >> > > Ok. > > I was just able to come back to this today - I was mostly working on the > slab feedback over the past few days. I will answer yours and Tejun's > concerns at once: > > Here is the situation: the backtrace I get is this one: > > [ 124.956725] ================================= > [ 124.957217] [ INFO: inconsistent lock state ] > [ 124.957217] 3.5.0+ #99 Not tainted > [ 124.957217] --------------------------------- > [ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. > [ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes: > [ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at: > [<ffffffff810aa7b2>] spin_lock+0x9/0xb > [ 124.957217] {SOFTIRQ-ON-W} state was registered at: > [ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68 > [ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c > [ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f > [ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb > [ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3 > [ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60 > [ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100 > [ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee > [ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5 > [ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b > > > So what we learn from it, is: we are acquiring a specific lock (the css > id one) from softirq context. It was previously taken in a > softirq-enabled context, that seems to be coming directly from > get_new_cssid. > > Tejun correctly pointed out that we should never acquire that lock from > a softirq context, in which he is right. > > But the situation changes slightly with kmem. Now, the following excerpt > of a backtrace is possible: > > [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c > [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c > [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb > [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22 > [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98 > [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df > [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b > [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277 > > So as you can see, free_accounted_pages (that will trigger a memcg_put() > -> mem_cgroup_free()) can now be called from softirq context, which is, > an rcu callback (and I just realized I wrote the exact opposite in the > subj line: man, I really suck at that!!) > As a matter of fact, we could not move to our rcu callback as well: > > we need to move it to a worker thread with the rest. > > We already have a worker thread: he reason we have it is not > static_branches: The reason is vfree(), that will BUG_ON(in_interrupt()) > and could not be called from rcu callback as well. We moved static > branches in there as well for a similar problem, but haven't introduced it. > > Could we move just part of it to the worker thread? Absolutely yes. > Moving just free_css_id() is enough to make it work. But since it is not > the first context related problem we had, I thought: "to hell with that, > let's move everything and be safe". > > I am fine moving free_css_id() only if you would prefer. > > Can we disable softirqs when we initialize css_id? Maybe. My machine > seems to boot fine and survive the simple workload that would trigger > that bug if I use irqsave spinlocks instead of normal spinlocks. But > this has to be done from cgroup core: We have no control over css > creation in memcg. > > How would you guys like me to handle this ? One more thing: As I mentioned in the Changelog, mem_cgroup_remove_exceeded(), called from mem_cgroup_remove_from_trees() will lead to the same usage pattern. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
[parent not found: <506D6A99.7070800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback [not found] ` <506D6A99.7070800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-10-05 15:31 ` Johannes Weiner 2012-10-08 9:45 ` Glauber Costa 0 siblings, 1 reply; 127+ messages in thread From: Johannes Weiner @ 2012-10-05 15:31 UTC (permalink / raw) To: Glauber Costa Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes On Thu, Oct 04, 2012 at 02:53:13PM +0400, Glauber Costa wrote: > On 10/01/2012 05:27 PM, Michal Hocko wrote: > > On Tue 18-09-12 18:04:09, Glauber Costa wrote: > >> A lot of the initialization we do in mem_cgroup_create() is done with softirqs > >> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and > >> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the > >> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This > >> means that the freeing of memcg structure must happen in a compatible context, > >> otherwise we'll get a deadlock. > > > > Maybe I am missing something obvious but why cannot we simply disble > > (soft)irqs in mem_cgroup_create rather than make the free path much more > > complicated. It really feels strange to defer everything (e.g. soft > > reclaim tree cleanup which should be a no-op at the time because there > > shouldn't be any user pages in the group). > > > > Ok. > > I was just able to come back to this today - I was mostly working on the > slab feedback over the past few days. I will answer yours and Tejun's > concerns at once: > > Here is the situation: the backtrace I get is this one: > > [ 124.956725] ================================= > [ 124.957217] [ INFO: inconsistent lock state ] > [ 124.957217] 3.5.0+ #99 Not tainted > [ 124.957217] --------------------------------- > [ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. > [ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes: > [ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at: > [<ffffffff810aa7b2>] spin_lock+0x9/0xb > [ 124.957217] {SOFTIRQ-ON-W} state was registered at: > [ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68 > [ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c > [ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f > [ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb > [ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3 > [ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60 > [ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100 > [ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee > [ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5 > [ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b > > > So what we learn from it, is: we are acquiring a specific lock (the css > id one) from softirq context. It was previously taken in a > softirq-enabled context, that seems to be coming directly from > get_new_cssid. > > Tejun correctly pointed out that we should never acquire that lock from > a softirq context, in which he is right. > > But the situation changes slightly with kmem. Now, the following excerpt > of a backtrace is possible: > > [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c > [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c > [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb > [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22 > [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98 > [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df > [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b > [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277 > > So as you can see, free_accounted_pages (that will trigger a memcg_put() > -> mem_cgroup_free()) can now be called from softirq context, which is, > an rcu callback (and I just realized I wrote the exact opposite in the > subj line: man, I really suck at that!!) > As a matter of fact, we could not move to our rcu callback as well: > > we need to move it to a worker thread with the rest. > > We already have a worker thread: he reason we have it is not > static_branches: The reason is vfree(), that will BUG_ON(in_interrupt()) > and could not be called from rcu callback as well. We moved static > branches in there as well for a similar problem, but haven't introduced it. > > Could we move just part of it to the worker thread? Absolutely yes. > Moving just free_css_id() is enough to make it work. But since it is not > the first context related problem we had, I thought: "to hell with that, > let's move everything and be safe". > > I am fine moving free_css_id() only if you would prefer. > > Can we disable softirqs when we initialize css_id? Maybe. My machine > seems to boot fine and survive the simple workload that would trigger > that bug if I use irqsave spinlocks instead of normal spinlocks. But > this has to be done from cgroup core: We have no control over css > creation in memcg. > > How would you guys like me to handle this ? Without the vfree callback, I would have preferred just making the id_lock softirq safe. But since we have to defer (parts of) freeing anyway, I like your approach of just deferring the rest as well better. But please add comments why the stuff in there is actually deferred. Just simple notes like: "this can be called from atomic contexts, <examples>", "vfree must run from process context" and "css_id locking is not soft irq safe", "to hell with that, let's just do everything from the workqueue and be safe and simple". (And this may be personal preference, but why have free_work call __mem_cgroup_free()? Does anyone else need to call that code? There are too many layers already, why not just keep it all in free_work() and have one less stack frame on your mind? :)) As for the changelog, here is my attempt: --- mm: memcg: defer whole memcg tear-down to workqueue The final memcg put can already happen in atomic context and so the freeing is deferred to a workqueue because it needs to use vfree(). Kmem tracking will add freeing from softirq context, but the id_lock acquired when destroying the cgroup object is not softirq safe, e.g.: > [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c > [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c > [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb > [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22 > [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98 > [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df > [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b > [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277 To avoid making tear-down too complicated - making locks soft irq safe, having half the cleanup in one function on the other half somewhere else - just defer everything to the workqueue. ^ permalink raw reply [flat|nested] 127+ messages in thread
* Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback 2012-10-05 15:31 ` Johannes Weiner @ 2012-10-08 9:45 ` Glauber Costa 0 siblings, 0 replies; 127+ messages in thread From: Glauber Costa @ 2012-10-08 9:45 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-kernel, cgroups, kamezawa.hiroyu, devel, Tejun Heo, linux-mm, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes On 10/05/2012 07:31 PM, Johannes Weiner wrote: > On Thu, Oct 04, 2012 at 02:53:13PM +0400, Glauber Costa wrote: >> On 10/01/2012 05:27 PM, Michal Hocko wrote: >>> On Tue 18-09-12 18:04:09, Glauber Costa wrote: >>>> A lot of the initialization we do in mem_cgroup_create() is done with softirqs >>>> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and >>>> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the >>>> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This >>>> means that the freeing of memcg structure must happen in a compatible context, >>>> otherwise we'll get a deadlock. >>> >>> Maybe I am missing something obvious but why cannot we simply disble >>> (soft)irqs in mem_cgroup_create rather than make the free path much more >>> complicated. It really feels strange to defer everything (e.g. soft >>> reclaim tree cleanup which should be a no-op at the time because there >>> shouldn't be any user pages in the group). >>> >> >> Ok. >> >> I was just able to come back to this today - I was mostly working on the >> slab feedback over the past few days. I will answer yours and Tejun's >> concerns at once: >> >> Here is the situation: the backtrace I get is this one: >> >> [ 124.956725] ================================= >> [ 124.957217] [ INFO: inconsistent lock state ] >> [ 124.957217] 3.5.0+ #99 Not tainted >> [ 124.957217] --------------------------------- >> [ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. >> [ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes: >> [ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at: >> [<ffffffff810aa7b2>] spin_lock+0x9/0xb >> [ 124.957217] {SOFTIRQ-ON-W} state was registered at: >> [ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68 >> [ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c >> [ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f >> [ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb >> [ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3 >> [ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60 >> [ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100 >> [ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee >> [ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5 >> [ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b >> >> >> So what we learn from it, is: we are acquiring a specific lock (the css >> id one) from softirq context. It was previously taken in a >> softirq-enabled context, that seems to be coming directly from >> get_new_cssid. >> >> Tejun correctly pointed out that we should never acquire that lock from >> a softirq context, in which he is right. >> >> But the situation changes slightly with kmem. Now, the following excerpt >> of a backtrace is possible: >> >> [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c >> [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c >> [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb >> [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22 >> [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98 >> [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df >> [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b >> [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277 >> >> So as you can see, free_accounted_pages (that will trigger a memcg_put() >> -> mem_cgroup_free()) can now be called from softirq context, which is, >> an rcu callback (and I just realized I wrote the exact opposite in the >> subj line: man, I really suck at that!!) >> As a matter of fact, we could not move to our rcu callback as well: >> >> we need to move it to a worker thread with the rest. >> >> We already have a worker thread: he reason we have it is not >> static_branches: The reason is vfree(), that will BUG_ON(in_interrupt()) >> and could not be called from rcu callback as well. We moved static >> branches in there as well for a similar problem, but haven't introduced it. >> >> Could we move just part of it to the worker thread? Absolutely yes. >> Moving just free_css_id() is enough to make it work. But since it is not >> the first context related problem we had, I thought: "to hell with that, >> let's move everything and be safe". >> >> I am fine moving free_css_id() only if you would prefer. >> >> Can we disable softirqs when we initialize css_id? Maybe. My machine >> seems to boot fine and survive the simple workload that would trigger >> that bug if I use irqsave spinlocks instead of normal spinlocks. But >> this has to be done from cgroup core: We have no control over css >> creation in memcg. >> >> How would you guys like me to handle this ? > > Without the vfree callback, I would have preferred just making the > id_lock softirq safe. But since we have to defer (parts of) freeing > anyway, I like your approach of just deferring the rest as well > better. > > But please add comments why the stuff in there is actually deferred. > Just simple notes like: > > "this can be called from atomic contexts, <examples>", > > "vfree must run from process context" and "css_id locking is not soft > irq safe", > > "to hell with that, let's just do everything from the workqueue and be > safe and simple". > > (And this may be personal preference, but why have free_work call > __mem_cgroup_free()? Does anyone else need to call that code? There > are too many layers already, why not just keep it all in free_work() > and have one less stack frame on your mind? :)) > It is used when create fails. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 127+ messages in thread
* [PATCH v3 13/13] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> ` (11 preceding siblings ...) 2012-09-18 14:04 ` [PATCH v3 12/13] execute the whole memcg freeing in rcu callback Glauber Costa @ 2012-09-18 14:04 ` Glauber Costa [not found] ` <1347977050-29476-14-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 12 siblings, 1 reply; 127+ messages in thread From: Glauber Costa @ 2012-09-18 14:04 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Glauber Costa, Christoph Lameter, Pekka Enberg, Michal Hocko, Johannes Weiner Because those architectures will draw their stacks directly from the page allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG flag, and issue the corresponding free_pages. This code path is taken when the architecture doesn't define CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining architectures fall in this category. This will guarantee that every stack page is accounted to the memcg the process currently lives on, and will have the allocations to fail if they go over limit. For the time being, I am defining a new variant of THREADINFO_GFP, not to mess with the other path. Once the slab is also tracked by memcg, we can get rid of that flag. Tested to successfully protect against :(){ :|:& };: Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> Acked-by: Frederic Weisbecker <fweisbec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- include/linux/thread_info.h | 2 ++ kernel/fork.c | 4 ++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h index ccc1899..e7e0473 100644 --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -61,6 +61,8 @@ extern long do_no_restart_syscall(struct restart_block *parm); # define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK) #endif +#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG) + /* * flag set/clear/test wrappers * - pass TIF_xxxx constants to these functions diff --git a/kernel/fork.c b/kernel/fork.c index 0ff2bf7..897e89c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -146,7 +146,7 @@ void __weak arch_release_thread_info(struct thread_info *ti) static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, int node) { - struct page *page = alloc_pages_node(node, THREADINFO_GFP, + struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED, THREAD_SIZE_ORDER); return page ? page_address(page) : NULL; @@ -154,7 +154,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, static inline void free_thread_info(struct thread_info *ti) { - free_pages((unsigned long)ti, THREAD_SIZE_ORDER); + free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER); } # else static struct kmem_cache *thread_info_cache; -- 1.7.11.4 ^ permalink raw reply related [flat|nested] 127+ messages in thread
[parent not found: <1347977050-29476-14-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [PATCH v3 13/13] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs [not found] ` <1347977050-29476-14-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-10-01 13:17 ` Michal Hocko 0 siblings, 0 replies; 127+ messages in thread From: Michal Hocko @ 2012-10-01 13:17 UTC (permalink / raw) To: Glauber Costa Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, devel-GEFAQzZX7r8dnm+yROfE0A, Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Frederic Weisbecker, Mel Gorman, David Rientjes, Christoph Lameter, Pekka Enberg, Johannes Weiner On Tue 18-09-12 18:04:10, Glauber Costa wrote: > Because those architectures will draw their stacks directly from the > page allocator, rather than the slab cache, we can directly pass > __GFP_KMEMCG flag, and issue the corresponding free_pages. > > This code path is taken when the architecture doesn't define > CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has > THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining > architectures fall in this category. > > This will guarantee that every stack page is accounted to the memcg the > process currently lives on, and will have the allocations to fail if > they go over limit. > > For the time being, I am defining a new variant of THREADINFO_GFP, not > to mess with the other path. Once the slab is also tracked by memcg, we > can get rid of that flag. > > Tested to successfully protect against :(){ :|:& };: OK. Although I was complaining that this is not the full truth the last time, I do not insist on gravy details about the slaughter this will cause to the rest of the group and that who-ever could fork in the group can easily DOS the whole hierarchy. It has some interesting side effects as well but let's keep this to a careful reader ;) The patch, as is, is still useful and an improvement because it reduces the impact. > > Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> > Acked-by: Frederic Weisbecker <fweisbec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> > CC: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> > CC: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org> > CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> > CC: Suleiman Souhlal <suleiman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> > --- > include/linux/thread_info.h | 2 ++ > kernel/fork.c | 4 ++-- > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h > index ccc1899..e7e0473 100644 > --- a/include/linux/thread_info.h > +++ b/include/linux/thread_info.h > @@ -61,6 +61,8 @@ extern long do_no_restart_syscall(struct restart_block *parm); > # define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK) > #endif > > +#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG) > + > /* > * flag set/clear/test wrappers > * - pass TIF_xxxx constants to these functions > diff --git a/kernel/fork.c b/kernel/fork.c > index 0ff2bf7..897e89c 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -146,7 +146,7 @@ void __weak arch_release_thread_info(struct thread_info *ti) > static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, > int node) > { > - struct page *page = alloc_pages_node(node, THREADINFO_GFP, > + struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED, > THREAD_SIZE_ORDER); > > return page ? page_address(page) : NULL; > @@ -154,7 +154,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, > > static inline void free_thread_info(struct thread_info *ti) > { > - free_pages((unsigned long)ti, THREAD_SIZE_ORDER); > + free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER); > } > # else > static struct kmem_cache *thread_info_cache; > -- > 1.7.11.4 > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 127+ messages in thread
end of thread, other threads:[~2012-10-08 9:45 UTC | newest] Thread overview: 127+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-09-18 14:03 [PATCH v3 00/13] kmem controller for memcg Glauber Costa [not found] ` <1347977050-29476-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 14:03 ` [PATCH v3 01/13] memcg: Make it possible to use the stock for more than one page Glauber Costa 2012-10-01 18:48 ` Johannes Weiner 2012-09-18 14:03 ` [PATCH v3 02/13] memcg: Reclaim when more than one page needed Glauber Costa 2012-10-01 19:00 ` Johannes Weiner 2012-09-18 14:04 ` [PATCH v3 03/13] memcg: change defines to an enum Glauber Costa [not found] ` <1347977050-29476-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-10-01 19:06 ` Johannes Weiner 2012-10-02 9:10 ` Glauber Costa 2012-09-18 14:04 ` [PATCH v3 04/13] kmem accounting basic infrastructure Glauber Costa 2012-09-21 16:34 ` Tejun Heo [not found] ` <20120921163404.GC7264-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-24 8:09 ` Glauber Costa [not found] ` <1347977050-29476-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 14:03 ` Michal Hocko 2012-09-26 14:33 ` Glauber Costa [not found] ` <50631226.9050304-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 16:01 ` Michal Hocko 2012-09-26 17:34 ` Glauber Costa [not found] ` <20120926140347.GD15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-26 16:36 ` Tejun Heo 2012-09-26 17:36 ` Glauber Costa [not found] ` <50633D24.6020002-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 17:44 ` Tejun Heo 2012-09-26 17:53 ` Glauber Costa 2012-09-26 18:01 ` Tejun Heo [not found] ` <20120926180124.GA12544-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-26 18:56 ` Glauber Costa [not found] ` <50634FC9.4090609-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 19:34 ` Tejun Heo 2012-09-26 19:46 ` Glauber Costa [not found] ` <50635B9D.8020205-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 19:56 ` Tejun Heo 2012-09-26 20:02 ` Glauber Costa [not found] ` <50635F46.7000700-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 20:16 ` Tejun Heo 2012-09-26 21:24 ` Glauber Costa [not found] ` <50637298.2090904-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 22:10 ` Tejun Heo 2012-09-26 22:29 ` Glauber Costa [not found] ` <506381B2.2060806-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 22:42 ` Tejun Heo 2012-09-26 22:54 ` Glauber Costa [not found] ` <50638793.7060806-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-26 23:08 ` Tejun Heo 2012-09-26 23:20 ` Glauber Costa 2012-09-26 23:33 ` Tejun Heo [not found] ` <20120926233334.GD10453-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-09-27 12:15 ` Michal Hocko 2012-09-27 12:20 ` Glauber Costa [not found] ` <506444A7.5060303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-27 12:40 ` Michal Hocko [not found] ` <20120927124031.GC29104-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-27 12:40 ` Glauber Costa 2012-09-27 12:54 ` Michal Hocko [not found] ` <20120926230807.GC10453-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-09-27 14:28 ` Mel Gorman 2012-09-27 14:49 ` Tejun Heo 2012-09-27 14:57 ` Glauber Costa [not found] ` <50646977.40300-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-27 17:46 ` Tejun Heo 2012-09-27 17:56 ` Michal Hocko 2012-09-27 18:45 ` Glauber Costa [not found] ` <50649EAD.2050306-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-30 7:57 ` Tejun Heo 2012-09-30 8:02 ` Tejun Heo [not found] ` <20120930080249.GF10383-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-09-30 8:56 ` James Bottomley [not found] ` <1348995388.2458.8.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2012-09-30 10:37 ` Tejun Heo 2012-09-30 11:25 ` James Bottomley [not found] ` <1349004352.2458.34.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2012-10-01 0:57 ` Tejun Heo 2012-10-01 8:43 ` Glauber Costa 2012-10-01 8:46 ` Glauber Costa 2012-10-03 22:59 ` Tejun Heo 2012-10-01 8:36 ` Glauber Costa 2012-09-27 12:08 ` Michal Hocko 2012-09-27 12:11 ` Glauber Costa 2012-09-27 14:33 ` Tejun Heo [not found] ` <20120927143300.GA4251-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-09-27 14:43 ` Mel Gorman 2012-09-27 14:58 ` Tejun Heo 2012-09-27 18:30 ` Glauber Costa 2012-09-30 8:23 ` Tejun Heo 2012-10-01 8:45 ` Glauber Costa 2012-10-03 22:54 ` Tejun Heo 2012-10-04 11:55 ` Glauber Costa [not found] ` <506D7922.1050108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-10-06 2:19 ` Tejun Heo 2012-09-27 15:09 ` Michal Hocko 2012-09-30 8:47 ` Tejun Heo 2012-10-01 9:27 ` Michal Hocko [not found] ` <20121001092756.GA8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-10-03 22:43 ` Tejun Heo 2012-10-05 13:47 ` Michal Hocko 2012-09-26 22:11 ` Johannes Weiner 2012-09-26 22:45 ` Glauber Costa 2012-09-18 14:04 ` [PATCH v3 05/13] Add a __GFP_KMEMCG flag Glauber Costa 2012-09-18 14:15 ` Rik van Riel [not found] ` <1347977050-29476-6-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-18 15:06 ` Christoph Lameter [not found] ` <00000139d9ea69c6-109249c2-5176-4a1e-b000-4c076d05844d-000000-p/GC64/jrecnJqMo6gzdpkEOCMrvLtNR@public.gmane.org> 2012-09-19 7:39 ` Glauber Costa [not found] ` <505976B5.6090801-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-19 14:07 ` Christoph Lameter 2012-09-27 13:34 ` Mel Gorman 2012-09-27 13:41 ` Glauber Costa 2012-10-01 19:09 ` Johannes Weiner 2012-09-18 14:04 ` [PATCH v3 06/13] memcg: kmem controller infrastructure Glauber Costa [not found] ` <1347977050-29476-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-20 16:05 ` JoonSoo Kim [not found] ` <CAAmzW4ONnc7n3kZbYnE6n2Cg0ZyPXW0QU2NMr0uRkyTxnGpNqQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2012-09-21 8:41 ` Glauber Costa 2012-09-21 9:14 ` JoonSoo Kim 2012-09-26 15:51 ` Michal Hocko 2012-09-27 11:31 ` Glauber Costa [not found] ` <5064392D.5040707-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-27 13:44 ` Michal Hocko [not found] ` <20120927134432.GE29104-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-28 11:34 ` Glauber Costa 2012-09-30 8:25 ` Tejun Heo 2012-10-01 8:28 ` Glauber Costa [not found] ` <5069542C.2020103-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-10-03 22:11 ` Tejun Heo [not found] ` <20120930082542.GH10383-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-10-01 9:44 ` Michal Hocko [not found] ` <50658B3B.9020303-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-10-01 9:48 ` Michal Hocko 2012-10-01 10:09 ` Glauber Costa 2012-10-01 11:51 ` Michal Hocko [not found] ` <20121001115157.GE8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-10-01 11:51 ` Glauber Costa 2012-10-01 11:58 ` Michal Hocko [not found] ` <20121001115847.GF8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-10-01 12:04 ` Glauber Costa 2012-09-18 14:04 ` [PATCH v3 07/13] mm: Allocate kernel pages to the right memcg Glauber Costa 2012-09-27 13:50 ` Mel Gorman [not found] ` <20120927135053.GF3429-l3A5Bk7waGM@public.gmane.org> 2012-09-28 9:43 ` Glauber Costa 2012-09-28 13:28 ` Mel Gorman [not found] ` <1347977050-29476-8-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-27 13:52 ` Michal Hocko 2012-09-18 14:04 ` [PATCH v3 08/13] res_counter: return amount of charges after res_counter_uncharge Glauber Costa 2012-10-01 10:00 ` Michal Hocko 2012-10-01 10:01 ` Glauber Costa 2012-09-18 14:04 ` [PATCH v3 09/13] memcg: kmem accounting lifecycle management Glauber Costa [not found] ` <1347977050-29476-10-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-10-01 12:15 ` Michal Hocko [not found] ` <20121001121553.GG8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-10-01 12:29 ` Glauber Costa [not found] ` <50698C97.70703-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-10-01 12:36 ` Michal Hocko [not found] ` <20121001123654.GJ8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-10-01 12:43 ` Glauber Costa 2012-09-18 14:04 ` [PATCH v3 10/13] memcg: use static branches when code not in use Glauber Costa 2012-10-01 12:25 ` Michal Hocko [not found] ` <20121001122516.GH8622-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-10-01 12:27 ` Glauber Costa 2012-09-18 14:04 ` [PATCH v3 11/13] memcg: allow a memcg with kmem charges to be destructed Glauber Costa [not found] ` <1347977050-29476-12-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-10-01 12:30 ` Michal Hocko 2012-09-18 14:04 ` [PATCH v3 12/13] execute the whole memcg freeing in rcu callback Glauber Costa [not found] ` <1347977050-29476-13-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-21 17:23 ` Tejun Heo [not found] ` <20120921172355.GD7264-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-24 8:48 ` Glauber Costa 2012-10-01 13:27 ` Michal Hocko 2012-10-04 10:53 ` Glauber Costa 2012-10-04 14:20 ` Glauber Costa [not found] ` <506D6A99.7070800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-10-05 15:31 ` Johannes Weiner 2012-10-08 9:45 ` Glauber Costa 2012-09-18 14:04 ` [PATCH v3 13/13] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Glauber Costa [not found] ` <1347977050-29476-14-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-10-01 13:17 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).