* [RFC] Shared page accounting for memory cgroup @ 2009-12-29 18:27 Balbir Singh 2010-01-03 23:51 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2009-12-29 18:27 UTC (permalink / raw) To: linux-mm@kvack.org Cc: Andrew Morton, linux-kernel@vger.kernel.org, KAMEZAWA Hiroyuki, nishimura@mxp.nes.nec.co.jp Hi, Everyone, I've been working on heuristics for shared page accounting for the memory cgroup. I've tested the patches by creating multiple cgroups and running programs that share memory and observed the output. Comments? Add shared accounting to memcg From: Balbir Singh <balbir@linux.vnet.ibm.com> Currently there is no accurate way of estimating how many pages are shared in a memory cgroup. The accurate way of accounting shared memory is to 1. Either follow every page rmap and track number of users 2. Iterate through the pages and use _mapcount We take an intermediate approach (suggested by Kamezawa), we sum up the file and anon rss of the mm's belonging to the cgroup and then subtract the values of anon rss and file mapped. This should give us a good estimate of the pages being shared. The shared statistic is called memory.shared_usage_in_bytes and does not support hierarchical information, just the information for the current cgroup. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- Documentation/cgroups/memory.txt | 6 +++++ mm/memcontrol.c | 43 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+), 0 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index b871f25..c2c70c9 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -341,6 +341,12 @@ Note: - a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and not the root of hierarchy. +5.4 shared_usage_in_bytes + This data lists the number of shared bytes. The data provided + provides an approximation based on the anon and file rss counts + of all the mm's belonging to the cgroup. The sum above is subtracted + from the count of rss and file mapped count maintained within the + memory cgroup statistics (see section 5.2). 6. Hierarchy support diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 488b644..8e296be 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3052,6 +3052,45 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft, return 0; } +static u64 mem_cgroup_shared_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + struct cgroup_iter it; + struct task_struct *tsk; + u64 total_rss = 0, shared; + struct mm_struct *mm; + s64 val; + + cgroup_iter_start(cgrp, &it); + val = mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_RSS); + val += mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_FILE_MAPPED); + while ((tsk = cgroup_iter_next(cgrp, &it))) { + if (!thread_group_leader(tsk)) + continue; + mm = tsk->mm; + /* + * We can't use get_task_mm(), since mmput() its counterpart + * can sleep. We know that mm can't become invalid since + * we hold the css_set_lock (see cgroup_iter_start()). + */ + if (tsk->flags & PF_KTHREAD || !mm) + continue; + total_rss += get_mm_counter(mm, file_rss) + + get_mm_counter(mm, anon_rss); + } + cgroup_iter_end(cgrp, &it); + + /* + * We need to tolerate negative values due to the difference in + * time of calculating total_rss and val, but the shared value + * converges to the correct value quite soon depending on the changing + * memory usage of the workload running in the memory cgroup. + */ + shared = total_rss - val; + shared = max_t(s64, 0, shared); + shared <<= PAGE_SHIFT; + return shared; +} static struct cftype mem_cgroup_files[] = { { @@ -3101,6 +3140,10 @@ static struct cftype mem_cgroup_files[] = { .read_u64 = mem_cgroup_swappiness_read, .write_u64 = mem_cgroup_swappiness_write, }, + { + .name = "shared_usage_in_bytes", + .read_u64 = mem_cgroup_shared_read, + }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2009-12-29 18:27 [RFC] Shared page accounting for memory cgroup Balbir Singh @ 2010-01-03 23:51 ` KAMEZAWA Hiroyuki 2010-01-04 0:07 ` Balbir Singh 0 siblings, 1 reply; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-03 23:51 UTC (permalink / raw) To: balbir Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Tue, 29 Dec 2009 23:57:43 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > Hi, Everyone, > > I've been working on heuristics for shared page accounting for the > memory cgroup. I've tested the patches by creating multiple cgroups > and running programs that share memory and observed the output. > > Comments? Hmm? Why we have to do this in the kernel ? Thanks, -Kame > > > Add shared accounting to memcg > > From: Balbir Singh <balbir@linux.vnet.ibm.com> > > Currently there is no accurate way of estimating how many pages are > shared in a memory cgroup. The accurate way of accounting shared memory > is to > > 1. Either follow every page rmap and track number of users > 2. Iterate through the pages and use _mapcount > > We take an intermediate approach (suggested by Kamezawa), we sum up > the file and anon rss of the mm's belonging to the cgroup and then > subtract the values of anon rss and file mapped. This should give > us a good estimate of the pages being shared. > > The shared statistic is called memory.shared_usage_in_bytes and > does not support hierarchical information, just the information > for the current cgroup. > > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> > --- > > Documentation/cgroups/memory.txt | 6 +++++ > mm/memcontrol.c | 43 ++++++++++++++++++++++++++++++++++++++ > 2 files changed, 49 insertions(+), 0 deletions(-) > > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > index b871f25..c2c70c9 100644 > --- a/Documentation/cgroups/memory.txt > +++ b/Documentation/cgroups/memory.txt > @@ -341,6 +341,12 @@ Note: > - a cgroup which uses hierarchy and it has child cgroup. > - a cgroup which uses hierarchy and not the root of hierarchy. > > +5.4 shared_usage_in_bytes > + This data lists the number of shared bytes. The data provided > + provides an approximation based on the anon and file rss counts > + of all the mm's belonging to the cgroup. The sum above is subtracted > + from the count of rss and file mapped count maintained within the > + memory cgroup statistics (see section 5.2). > > 6. Hierarchy support > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 488b644..8e296be 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3052,6 +3052,45 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft, > return 0; > } > > +static u64 mem_cgroup_shared_read(struct cgroup *cgrp, struct cftype *cft) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > + struct cgroup_iter it; > + struct task_struct *tsk; > + u64 total_rss = 0, shared; > + struct mm_struct *mm; > + s64 val; > + > + cgroup_iter_start(cgrp, &it); > + val = mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_RSS); > + val += mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_FILE_MAPPED); > + while ((tsk = cgroup_iter_next(cgrp, &it))) { > + if (!thread_group_leader(tsk)) > + continue; > + mm = tsk->mm; > + /* > + * We can't use get_task_mm(), since mmput() its counterpart > + * can sleep. We know that mm can't become invalid since > + * we hold the css_set_lock (see cgroup_iter_start()). > + */ > + if (tsk->flags & PF_KTHREAD || !mm) > + continue; > + total_rss += get_mm_counter(mm, file_rss) + > + get_mm_counter(mm, anon_rss); > + } > + cgroup_iter_end(cgrp, &it); > + > + /* > + * We need to tolerate negative values due to the difference in > + * time of calculating total_rss and val, but the shared value > + * converges to the correct value quite soon depending on the changing > + * memory usage of the workload running in the memory cgroup. > + */ > + shared = total_rss - val; > + shared = max_t(s64, 0, shared); > + shared <<= PAGE_SHIFT; > + return shared; > +} > > static struct cftype mem_cgroup_files[] = { > { > @@ -3101,6 +3140,10 @@ static struct cftype mem_cgroup_files[] = { > .read_u64 = mem_cgroup_swappiness_read, > .write_u64 = mem_cgroup_swappiness_write, > }, > + { > + .name = "shared_usage_in_bytes", > + .read_u64 = mem_cgroup_shared_read, > + }, > }; > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP > > -- > Balbir > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-03 23:51 ` KAMEZAWA Hiroyuki @ 2010-01-04 0:07 ` Balbir Singh 2010-01-04 0:35 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2010-01-04 0:07 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 08:51:08]: > On Tue, 29 Dec 2009 23:57:43 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > Hi, Everyone, > > > > I've been working on heuristics for shared page accounting for the > > memory cgroup. I've tested the patches by creating multiple cgroups > > and running programs that share memory and observed the output. > > > > Comments? > > Hmm? Why we have to do this in the kernel ? > For several reasons that I can think of 1. With task migration changes coming in, getting consistent data free of races is going to be hard. 2. The cost of doing it in the kernel is not high, it does not impact the memcg runtime, it is a request-response sort of cost. 3. The cost in user space is going to be high and the implementation cumbersome to get right. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-04 0:07 ` Balbir Singh @ 2010-01-04 0:35 ` KAMEZAWA Hiroyuki 2010-01-04 0:50 ` Balbir Singh 0 siblings, 1 reply; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-04 0:35 UTC (permalink / raw) To: balbir Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Mon, 4 Jan 2010 05:37:52 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 08:51:08]: > > > On Tue, 29 Dec 2009 23:57:43 +0530 > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > Hi, Everyone, > > > > > > I've been working on heuristics for shared page accounting for the > > > memory cgroup. I've tested the patches by creating multiple cgroups > > > and running programs that share memory and observed the output. > > > > > > Comments? > > > > Hmm? Why we have to do this in the kernel ? > > > > For several reasons that I can think of > > 1. With task migration changes coming in, getting consistent data free of races > is going to be hard. Hmm, Let's see real-worlds's "ps" or "top" command. Even when there are no guarantee of error range of data, it's still useful. > 2. The cost of doing it in the kernel is not high, it does not impact > the memcg runtime, it is a request-response sort of cost. > > 3. The cost in user space is going to be high and the implementation > cumbersome to get right. > I don't like moving a cost in the userland to the kernel. Considering real-time kernel or full-preemptive kernel, this very long read_lock() in the kernel is not good, IMHO. (I think css_set_lock should be mutex/rw-sem...) cgroup_iter_xxx can block cgroup_post_fork() and this may cause critical system delay of milli-seconds. BTW, if you really want to calculate somthing in atomic, I think following interface may be welcomed for freezing. cgroup.lock # echo 1 > /...../cgroup.lock All task move, mkdir, rmdir to this cgroup will be blocked by mutex. (But fork/exit will not be blocked.) # echo 0 > /...../cgroup.lock Unlock. # cat /...../cgroup.lock show lock status and lock history (for debug). Maybe good for some kinds of middleware. But this may be difficult if we have to consider hierarchy. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-04 0:35 ` KAMEZAWA Hiroyuki @ 2010-01-04 0:50 ` Balbir Singh 2010-01-06 4:02 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2010-01-04 0:50 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 09:35:28]: > On Mon, 4 Jan 2010 05:37:52 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 08:51:08]: > > > > > On Tue, 29 Dec 2009 23:57:43 +0530 > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > > > Hi, Everyone, > > > > > > > > I've been working on heuristics for shared page accounting for the > > > > memory cgroup. I've tested the patches by creating multiple cgroups > > > > and running programs that share memory and observed the output. > > > > > > > > Comments? > > > > > > Hmm? Why we have to do this in the kernel ? > > > > > > > For several reasons that I can think of > > > > 1. With task migration changes coming in, getting consistent data free of races > > is going to be hard. > > Hmm, Let's see real-worlds's "ps" or "top" command. Even when there are no guarantee > of error range of data, it's still useful. Yes, my concern is this 1. I iterate through tasks and calculate RSS 2. I look at memory.usage_in_bytes If the time in user space between 1 and 2 is large I get very wrong results, specifically if the workload is changing its memory usage drastically.. no? > > > 2. The cost of doing it in the kernel is not high, it does not impact > > the memcg runtime, it is a request-response sort of cost. > > > > 3. The cost in user space is going to be high and the implementation > > cumbersome to get right. > > > I don't like moving a cost in the userland to the kernel. Me neither, but I don't think it is a fixed overhead. Considering > real-time kernel or full-preemptive kernel, this very long read_lock() in the > kernel is not good, IMHO. (I think css_set_lock should be mutex/rw-sem...) I agree, we should discuss converting the lock to a mutex or a semaphore, but there might be a good reason for keeping it as a spin_lock. > cgroup_iter_xxx can block cgroup_post_fork() and this may cause critical > system delay of milli-seconds. > Agreed, but then that can happen, even while attaching a task, seeing cgroup tasks file (list of tasks). > BTW, if you really want to calculate somthing in atomic, I think following > interface may be welcomed for freezing. > > cgroup.lock > # echo 1 > /...../cgroup.lock > All task move, mkdir, rmdir to this cgroup will be blocked by mutex. > (But fork/exit will not be blocked.) > > # echo 0 > /...../cgroup.lock > Unlock. > > # cat /...../cgroup.lock > show lock status and lock history (for debug). > > Maybe good for some kinds of middleware. > But this may be difficult if we have to consider hierarchy. > I don't like the idea of providing an interface that can control kernel locks from user space, user space can tangle up and get it wrong. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-04 0:50 ` Balbir Singh @ 2010-01-06 4:02 ` KAMEZAWA Hiroyuki 2010-01-06 7:01 ` Balbir Singh 0 siblings, 1 reply; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-06 4:02 UTC (permalink / raw) To: balbir Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Mon, 4 Jan 2010 06:20:31 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 09:35:28]: > > > On Mon, 4 Jan 2010 05:37:52 +0530 > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 08:51:08]: > > > > > > > On Tue, 29 Dec 2009 23:57:43 +0530 > > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > > > > > Hi, Everyone, > > > > > > > > > > I've been working on heuristics for shared page accounting for the > > > > > memory cgroup. I've tested the patches by creating multiple cgroups > > > > > and running programs that share memory and observed the output. > > > > > > > > > > Comments? > > > > > > > > Hmm? Why we have to do this in the kernel ? > > > > > > > > > > For several reasons that I can think of > > > > > > 1. With task migration changes coming in, getting consistent data free of races > > > is going to be hard. > > > > Hmm, Let's see real-worlds's "ps" or "top" command. Even when there are no guarantee > > of error range of data, it's still useful. > > Yes, my concern is this > > 1. I iterate through tasks and calculate RSS > 2. I look at memory.usage_in_bytes > > If the time in user space between 1 and 2 is large I get very wrong > results, specifically if the workload is changing its memory usage > drastically.. no? > No. If it takes long time, locking fork()/exit() for such long time is the bigger issue. I recommend you to add memacct subsystem to sum up RSS of all processes's RSS counting under a cgroup. Althoght it may add huge costs in page fault path but implementation will be very simple and will not hurt realtime ops. There will be no terrible race, I guess. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-06 4:02 ` KAMEZAWA Hiroyuki @ 2010-01-06 7:01 ` Balbir Singh 2010-01-06 7:12 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2010-01-06 7:01 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-06 13:02:58]: > On Mon, 4 Jan 2010 06:20:31 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 09:35:28]: > > > > > On Mon, 4 Jan 2010 05:37:52 +0530 > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 08:51:08]: > > > > > > > > > On Tue, 29 Dec 2009 23:57:43 +0530 > > > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > > > > > > > Hi, Everyone, > > > > > > > > > > > > I've been working on heuristics for shared page accounting for the > > > > > > memory cgroup. I've tested the patches by creating multiple cgroups > > > > > > and running programs that share memory and observed the output. > > > > > > > > > > > > Comments? > > > > > > > > > > Hmm? Why we have to do this in the kernel ? > > > > > > > > > > > > > For several reasons that I can think of > > > > > > > > 1. With task migration changes coming in, getting consistent data free of races > > > > is going to be hard. > > > > > > Hmm, Let's see real-worlds's "ps" or "top" command. Even when there are no guarantee > > > of error range of data, it's still useful. > > > > Yes, my concern is this > > > > 1. I iterate through tasks and calculate RSS > > 2. I look at memory.usage_in_bytes > > > > If the time in user space between 1 and 2 is large I get very wrong > > results, specifically if the workload is changing its memory usage > > drastically.. no? > > > No. If it takes long time, locking fork()/exit() for such long time is the bigger > issue. > I recommend you to add memacct subsystem to sum up RSS of all processes's RSS counting > under a cgroup. Althoght it may add huge costs in page fault path but implementation > will be very simple and will not hurt realtime ops. > There will be no terrible race, I guess. > But others hold that lock as well, simple thing like listing tasks and moving tasks, etc. I expect the usage of shared to be in the same range. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-06 7:01 ` Balbir Singh @ 2010-01-06 7:12 ` KAMEZAWA Hiroyuki 2010-01-07 7:15 ` Balbir Singh 0 siblings, 1 reply; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-06 7:12 UTC (permalink / raw) To: balbir Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Wed, 6 Jan 2010 12:31:50 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > No. If it takes long time, locking fork()/exit() for such long time is the bigger > > issue. > > I recommend you to add memacct subsystem to sum up RSS of all processes's RSS counting > > under a cgroup. Althoght it may add huge costs in page fault path but implementation > > will be very simple and will not hurt realtime ops. > > There will be no terrible race, I guess. > > > > But others hold that lock as well, simple thing like listing tasks and > moving tasks, etc. I expect the usage of shared to be in the same > range. > And piles up costs ? I think cgroup guys should pay attention to fork/exit costs more. Now, it gets slower and slower. In that point, I never like migrate-at-task-move work in cpuset and memcg. My 1st objection to this patch is this "shared" doesn't mean "shared between cgroup" but means "shared between processes". I think it's of no use and no help to users. And implementation is 2nd thing. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-06 7:12 ` KAMEZAWA Hiroyuki @ 2010-01-07 7:15 ` Balbir Singh 2010-01-07 7:36 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2010-01-07 7:15 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-06 16:12:11]: > On Wed, 6 Jan 2010 12:31:50 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > No. If it takes long time, locking fork()/exit() for such long time is the bigger > > > issue. > > > I recommend you to add memacct subsystem to sum up RSS of all processes's RSS counting > > > under a cgroup. Althoght it may add huge costs in page fault path but implementation > > > will be very simple and will not hurt realtime ops. > > > There will be no terrible race, I guess. > > > > > > > But others hold that lock as well, simple thing like listing tasks and > > moving tasks, etc. I expect the usage of shared to be in the same > > range. > > > > And piles up costs ? I think cgroup guys should pay attention to fork/exit > costs more. Now, it gets slower and slower. > In that point, I never like migrate-at-task-move work in cpuset and memcg. > > My 1st objection to this patch is this "shared" doesn't mean "shared between > cgroup" but means "shared between processes". > I think it's of no use and no help to users. > So what in your opinion would help end users? My concern is that as we make progress with memcg, we account only for privately used pages with no hint/data about the real usage (shared within or with other cgroups). How do we decide if one cgroup is really heavy? > And implementation is 2nd thing. > More details on your concern, please! -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-07 7:15 ` Balbir Singh @ 2010-01-07 7:36 ` KAMEZAWA Hiroyuki 2010-01-07 8:34 ` Balbir Singh 0 siblings, 1 reply; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-07 7:36 UTC (permalink / raw) To: balbir Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Thu, 7 Jan 2010 12:45:54 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-06 16:12:11]: > > And piles up costs ? I think cgroup guys should pay attention to fork/exit > > costs more. Now, it gets slower and slower. > > In that point, I never like migrate-at-task-move work in cpuset and memcg. > > > > My 1st objection to this patch is this "shared" doesn't mean "shared between > > cgroup" but means "shared between processes". > > I think it's of no use and no help to users. > > > > So what in your opinion would help end users? My concern is that as > we make progress with memcg, we account only for privately used pages > with no hint/data about the real usage (shared within or with other > cgroups). The real usage is already shown as [root@bluextal ref-mmotm]# cat /cgroups/memory.stat cache 7706181632 rss 120905728 mapped_file 32239616 This is real. And "sum of rss - rss+mapped" doesn't show anything. > How do we decide if one cgroup is really heavy? > What "heavy" means ? "Hard to page out ?" Historically, it's caught by pagein/pageout _speed_. "How heavy memory system is ?" can only be measured by "speed". If you add latency-stat for memcg, I'm glad to use it. Anyway, "How memory reclaim can go successfully" is generic problem rather than memcg. Maybe no good answers from VM guys.... I think you should add codes to global VM rather than cgroup. "How pages are shared" doesn't show good hints. I don't hear such parameter is used in production's resource monitoring software. > > And implementation is 2nd thing. > > > > More details on your concern, please! > I already wrote....why do you want to make fork()/exit() slow for a thing which is not necessary to be done in atomic ? There are many hosts which has thousands of process and a cgrop may contain thousands of process in production server. In that situation, How the "make kernel" can slow down with following ? == while true; do cat /cgroup/memory.shared > /dev/null; done == In a word, the implementation problem is - An operation against a container can cause generic system slow down. Then, I don't like heavy task move under cgroup. Yes, this can happen in other places (we have to do some improvements). But this is not good for a concept of isolation by container, anyway. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-07 7:36 ` KAMEZAWA Hiroyuki @ 2010-01-07 8:34 ` Balbir Singh 2010-01-07 8:48 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2010-01-07 8:34 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 16:36:10]: > On Thu, 7 Jan 2010 12:45:54 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-06 16:12:11]: > > > And piles up costs ? I think cgroup guys should pay attention to fork/exit > > > costs more. Now, it gets slower and slower. > > > In that point, I never like migrate-at-task-move work in cpuset and memcg. > > > > > > My 1st objection to this patch is this "shared" doesn't mean "shared between > > > cgroup" but means "shared between processes". > > > I think it's of no use and no help to users. > > > > > > > So what in your opinion would help end users? My concern is that as > > we make progress with memcg, we account only for privately used pages > > with no hint/data about the real usage (shared within or with other > > cgroups). > > The real usage is already shown as > > [root@bluextal ref-mmotm]# cat /cgroups/memory.stat > cache 7706181632 > rss 120905728 > mapped_file 32239616 > > This is real. And "sum of rss - rss+mapped" doesn't show anything. > > > How do we decide if one cgroup is really heavy? > > > > What "heavy" means ? "Hard to page out ?" > Heavy can also indicate, should we OOM kill in this cgroup or kill the entire cgroup? Should we add or remove resources from this cgroup? > Historically, it's caught by pagein/pageout _speed_. > "How heavy memory system is ?" can only be measured by "speed". Not really... A cgroup might be very large with a large number of its pages shared and frequently used. How do we detect if this cgroup needs its resources or its taking too many of them. > If you add latency-stat for memcg, I'm glad to use it. > > Anyway, "How memory reclaim can go successfully" is generic problem rather > than memcg. Maybe no good answers from VM guys.... > I think you should add codes to global VM rather than cgroup. > No.. this is not for reclaim > "How pages are shared" doesn't show good hints. I don't hear such parameter > is used in production's resource monitoring software. > You mean "How many pages are shared" are not good hints, please see my justification above. With Virtualization (look at KSM for example), shared pages are going to be increasingly important part of the accounting. > > > > And implementation is 2nd thing. > > > > > > > More details on your concern, please! > > > I already wrote....why do you want to make fork()/exit() slow for a thing > which is not necessary to be done in atomic ? > So your concern is about iterating through the tasks in cgroup, I can think of an alternative low cost implementation if possible > There are many hosts which has thousands of process and a cgrop may contain > thousands of process in production server. > In that situation, How the "make kernel" can slow down with following ? > == > while true; do cat /cgroup/memory.shared > /dev/null; done > == This is the worst case usage scenario that would be effected even if memory.shared were replaced by tasks. > > In a word, the implementation problem is > - An operation against a container can cause generic system slow down. > Then, I don't like heavy task move under cgroup. > > > Yes, this can happen in other places (we have to do some improvements). > But this is not good for a concept of isolation by container, anyway. Thanks for the review! -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-07 8:34 ` Balbir Singh @ 2010-01-07 8:48 ` KAMEZAWA Hiroyuki 2010-01-07 9:08 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-07 8:48 UTC (permalink / raw) To: balbir Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Thu, 7 Jan 2010 14:04:40 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 16:36:10]: > > > On Thu, 7 Jan 2010 12:45:54 +0530 > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-06 16:12:11]: > > > > And piles up costs ? I think cgroup guys should pay attention to fork/exit > > > > costs more. Now, it gets slower and slower. > > > > In that point, I never like migrate-at-task-move work in cpuset and memcg. > > > > > > > > My 1st objection to this patch is this "shared" doesn't mean "shared between > > > > cgroup" but means "shared between processes". > > > > I think it's of no use and no help to users. > > > > > > > > > > So what in your opinion would help end users? My concern is that as > > > we make progress with memcg, we account only for privately used pages > > > with no hint/data about the real usage (shared within or with other > > > cgroups). > > > > The real usage is already shown as > > > > [root@bluextal ref-mmotm]# cat /cgroups/memory.stat > > cache 7706181632 > > rss 120905728 > > mapped_file 32239616 > > > > This is real. And "sum of rss - rss+mapped" doesn't show anything. > > > > > How do we decide if one cgroup is really heavy? > > > > > > > What "heavy" means ? "Hard to page out ?" > > > > Heavy can also indicate, should we OOM kill in this cgroup or kill the > entire cgroup? Should we add or remove resources from this cgroup? > That's can be shown by usage... > > Historically, it's caught by pagein/pageout _speed_. > > "How heavy memory system is ?" can only be measured by "speed". > > Not really... A cgroup might be very large with a large number of its > pages shared and frequently used. How do we detect if this cgroup > needs its resources or its taking too many of them. > I don't know. If we have good parameter to know "resource is in short" in the kernel, please add to global VM before memcg. as "/dev/mem_notify" proposed in the past. memcg will use similar logic which is guaranteed by VM guys. > > "How pages are shared" doesn't show good hints. I don't hear such parameter > > is used in production's resource monitoring software. > > > > You mean "How many pages are shared" are not good hints, please see my > justification above. With Virtualization (look at KSM for example), > shared pages are going to be increasingly important part of the > accounting. > Considering KSM, your cuounting style is tooo bad. You should add - MEM_CGROUP_STAT_SHARED_BY_KSM - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM counters to memcg rather than scanning. I can help tests. I have no objections to have above 2 counters. It's informative. But, memory reclaim can page-out pages even if pages are shared. So, "how heavy memcg is" is an independent problem from above coutners. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-07 8:48 ` KAMEZAWA Hiroyuki @ 2010-01-07 9:08 ` KAMEZAWA Hiroyuki 2010-01-07 9:27 ` Balbir Singh 0 siblings, 1 reply; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-07 9:08 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: balbir, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Thu, 7 Jan 2010 17:48:14 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > "How pages are shared" doesn't show good hints. I don't hear such parameter > > > is used in production's resource monitoring software. > > > > > > > You mean "How many pages are shared" are not good hints, please see my > > justification above. With Virtualization (look at KSM for example), > > shared pages are going to be increasingly important part of the > > accounting. > > > > Considering KSM, your cuounting style is tooo bad. > > You should add > > - MEM_CGROUP_STAT_SHARED_BY_KSM > - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM > > counters to memcg rather than scanning. I can help tests. > > I have no objections to have above 2 counters. It's informative. > > But, memory reclaim can page-out pages even if pages are shared. > So, "how heavy memcg is" is an independent problem from above coutners. > In other words, above counters can show "What role the memcg play in the system" to some extent. But I don't express it as "heavy" ....."importance or influence of cgroup" ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-07 9:08 ` KAMEZAWA Hiroyuki @ 2010-01-07 9:27 ` Balbir Singh 2010-01-07 23:47 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2010-01-07 9:27 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]: > On Thu, 7 Jan 2010 17:48:14 +0900 > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > "How pages are shared" doesn't show good hints. I don't hear such parameter > > > > is used in production's resource monitoring software. > > > > > > > > > > You mean "How many pages are shared" are not good hints, please see my > > > justification above. With Virtualization (look at KSM for example), > > > shared pages are going to be increasingly important part of the > > > accounting. > > > > > > > Considering KSM, your cuounting style is tooo bad. > > > > You should add > > > > - MEM_CGROUP_STAT_SHARED_BY_KSM > > - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM > > No.. I am just talking about shared memory being important and shared accounting being useful, no counters for KSM in particular (in the memcg context). > > counters to memcg rather than scanning. I can help tests. > > > > I have no objections to have above 2 counters. It's informative. > > Apart from those two, I want to provide what Pss provides today or an approximation of it. > > But, memory reclaim can page-out pages even if pages are shared. > > So, "how heavy memcg is" is an independent problem from above coutners. > > > > In other words, above counters can show > "What role the memcg play in the system" to some extent. > > But I don't express it as "heavy" ....."importance or influence of cgroup" ? > > Thanks, > -Kame > > -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-07 9:27 ` Balbir Singh @ 2010-01-07 23:47 ` KAMEZAWA Hiroyuki 2010-01-17 19:30 ` Balbir Singh 0 siblings, 1 reply; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-07 23:47 UTC (permalink / raw) To: balbir Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Thu, 7 Jan 2010 14:57:36 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]: > > > On Thu, 7 Jan 2010 17:48:14 +0900 > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > > "How pages are shared" doesn't show good hints. I don't hear such parameter > > > > > is used in production's resource monitoring software. > > > > > > > > > > > > > You mean "How many pages are shared" are not good hints, please see my > > > > justification above. With Virtualization (look at KSM for example), > > > > shared pages are going to be increasingly important part of the > > > > accounting. > > > > > > > > > > Considering KSM, your cuounting style is tooo bad. > > > > > > You should add > > > > > > - MEM_CGROUP_STAT_SHARED_BY_KSM > > > - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM > > > > > No.. I am just talking about shared memory being important and shared > accounting being useful, no counters for KSM in particular (in the > memcg context). > Think so ? The number of memcg-private pages is in interest in my point of view. Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated in the kernel. If you want to provide that in memcg, please add it to global VM as /proc/meminfo. IIUC, KSM/SHMEM has some official method in global VM. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-07 23:47 ` KAMEZAWA Hiroyuki @ 2010-01-17 19:30 ` Balbir Singh 2010-01-18 0:05 ` KAMEZAWA Hiroyuki 2010-01-18 0:49 ` Daisuke Nishimura 0 siblings, 2 replies; 31+ messages in thread From: Balbir Singh @ 2010-01-17 19:30 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Thu, 7 Jan 2010 14:57:36 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]: >> >> > On Thu, 7 Jan 2010 17:48:14 +0900 >> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: >> > > > > "How pages are shared" doesn't show good hints. I don't hear such parameter >> > > > > is used in production's resource monitoring software. >> > > > > >> > > > >> > > > You mean "How many pages are shared" are not good hints, please see my >> > > > justification above. With Virtualization (look at KSM for example), >> > > > shared pages are going to be increasingly important part of the >> > > > accounting. >> > > > >> > > >> > > Considering KSM, your cuounting style is tooo bad. >> > > >> > > You should add >> > > >> > > - MEM_CGROUP_STAT_SHARED_BY_KSM >> > > - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM >> > > >> >> No.. I am just talking about shared memory being important and shared >> accounting being useful, no counters for KSM in particular (in the >> memcg context). >> > Think so ? The number of memcg-private pages is in interest in my point of view. > > Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated > in the kernel. > If you want to provide that in memcg, please add it to global VM as /proc/meminfo. > > IIUC, KSM/SHMEM has some official method in global VM. > Kamezawa-San, I implemented the same in user space and I get really bad results, here is why 1. I need to hold and walk the tasks list in cgroups and extract RSS through /proc (results in worse hold times for the fork() scenario you menioned) 2. The data is highly inconsistent due to the higher margin of error in accumulating data which is changing as we run. By the time we total and look at the memcg data, the data is stale Would you be OK with the patch, if I renamed "shared_usage_in_bytes" to "non_private_usage_in_bytes"? Given that the stat is user initiated, I don't see your concern w.r.t. overhead. Many subsystems like KSM do pay the overhead cost if the user really wants the feature or the data. I would be really interested in other opinions as well (if people do feel strongly against or for the feature) Balbir Singh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-17 19:30 ` Balbir Singh @ 2010-01-18 0:05 ` KAMEZAWA Hiroyuki 2010-01-18 0:22 ` KAMEZAWA Hiroyuki 2010-01-18 0:49 ` Daisuke Nishimura 1 sibling, 1 reply; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-18 0:05 UTC (permalink / raw) To: Balbir Singh Cc: linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Mon, 18 Jan 2010 01:00:44 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki > <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > On Thu, 7 Jan 2010 14:57:36 +0530 > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > >> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]: > >> > >> > On Thu, 7 Jan 2010 17:48:14 +0900 > >> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > >> > > > > "How pages are shared" doesn't show good hints. I don't hear such parameter > >> > > > > is used in production's resource monitoring software. > >> > > > > > >> > > > > >> > > > You mean "How many pages are shared" are not good hints, please see my > >> > > > justification above. With Virtualization (look at KSM for example), > >> > > > shared pages are going to be increasingly important part of the > >> > > > accounting. > >> > > > > >> > > > >> > > Considering KSM, your cuounting style is tooo bad. > >> > > > >> > > You should add > >> > > > >> > > A - MEM_CGROUP_STAT_SHARED_BY_KSM > >> > > A - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM > >> > > > >> > >> No.. I am just talking about shared memory being important and shared > >> accounting being useful, no counters for KSM in particular (in the > >> memcg context). > >> > > Think so ? The number of memcg-private pages is in interest in my point of view. > > > > Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated > > in the kernel. > > If you want to provide that in memcg, please add it to global VM as /proc/meminfo. > > > > IIUC, KSM/SHMEM has some official method in global VM. > > > > Kamezawa-San, > > I implemented the same in user space and I get really bad results, here is why > > 1. I need to hold and walk the tasks list in cgroups and extract RSS > through /proc (results in worse hold times for the fork() scenario you > menioned) > 2. The data is highly inconsistent due to the higher margin of error > in accumulating data which is changing as we run. By the time we total > and look at the memcg data, the data is stale > > Would you be OK with the patch, if I renamed "shared_usage_in_bytes" > to "non_private_usage_in_bytes"? > > Given that the stat is user initiated, I don't see your concern w.r.t. > overhead. Many subsystems like KSM do pay the overhead cost if the > user really wants the feature or the data. I would be really > interested in other opinions as well (if people do feel strongly > against or for the feature) > Please add that featuter to global VM before memcg. If VM guyes admits its good, I have no objections more. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-18 0:05 ` KAMEZAWA Hiroyuki @ 2010-01-18 0:22 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-18 0:22 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Balbir Singh, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp On Mon, 18 Jan 2010 09:05:49 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > > Kamezawa-San, > > > > I implemented the same in user space and I get really bad results, here is why > > > > 1. I need to hold and walk the tasks list in cgroups and extract RSS > > through /proc (results in worse hold times for the fork() scenario you > > menioned) > > 2. The data is highly inconsistent due to the higher margin of error > > in accumulating data which is changing as we run. By the time we total > > and look at the memcg data, the data is stale > > > > Would you be OK with the patch, if I renamed "shared_usage_in_bytes" > > to "non_private_usage_in_bytes"? > > > > Given that the stat is user initiated, I don't see your concern w.r.t. > > overhead. Many subsystems like KSM do pay the overhead cost if the > > user really wants the feature or the data. I would be really > > interested in other opinions as well (if people do feel strongly > > against or for the feature) > > > > Please add that featuter to global VM before memcg. > If VM guyes admits its good, I have no objections more. > I don't want to say any more but...one point. If the status of memory changes so frequently as the user land check program can't calculate stable data, what the management daemon can react agasinst it...the stale data ? So, I think it's nonsense anyway. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-17 19:30 ` Balbir Singh 2010-01-18 0:05 ` KAMEZAWA Hiroyuki @ 2010-01-18 0:49 ` Daisuke Nishimura 2010-01-18 8:26 ` Balbir Singh 1 sibling, 1 reply; 31+ messages in thread From: Daisuke Nishimura @ 2010-01-18 0:49 UTC (permalink / raw) To: Balbir Singh Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, Daisuke Nishimura On Mon, 18 Jan 2010 01:00:44 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki > <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > On Thu, 7 Jan 2010 14:57:36 +0530 > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > >> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]: > >> > >> > On Thu, 7 Jan 2010 17:48:14 +0900 > >> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > >> > > > > "How pages are shared" doesn't show good hints. I don't hear such parameter > >> > > > > is used in production's resource monitoring software. > >> > > > > > >> > > > > >> > > > You mean "How many pages are shared" are not good hints, please see my > >> > > > justification above. With Virtualization (look at KSM for example), > >> > > > shared pages are going to be increasingly important part of the > >> > > > accounting. > >> > > > > >> > > > >> > > Considering KSM, your cuounting style is tooo bad. > >> > > > >> > > You should add > >> > > > >> > > A - MEM_CGROUP_STAT_SHARED_BY_KSM > >> > > A - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM > >> > > > >> > >> No.. I am just talking about shared memory being important and shared > >> accounting being useful, no counters for KSM in particular (in the > >> memcg context). > >> > > Think so ? The number of memcg-private pages is in interest in my point of view. > > > > Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated > > in the kernel. > > If you want to provide that in memcg, please add it to global VM as /proc/meminfo. > > > > IIUC, KSM/SHMEM has some official method in global VM. > > > > Kamezawa-San, > > I implemented the same in user space and I get really bad results, here is why > > 1. I need to hold and walk the tasks list in cgroups and extract RSS > through /proc (results in worse hold times for the fork() scenario you > menioned) > 2. The data is highly inconsistent due to the higher margin of error > in accumulating data which is changing as we run. By the time we total > and look at the memcg data, the data is stale > > Would you be OK with the patch, if I renamed "shared_usage_in_bytes" > to "non_private_usage_in_bytes"? > I think the name is still ambiguous. For example, if process A belongs to /cgroup/memory/01 and process B to /cgroup/memory/02, both process have 10MB anonymous pages and 10MB file caches of the same pages, and all of the file caches are charged to 01. In this case, the value in 01 is 0MB(=20MB - 20MB) and 10MB(20MB - 10MB), right? I don't think "non private usage" is appropriate to this value. Why don't you just show "sum_of_each_process_rss" ? I think it would be easier to understand for users. But, hmm, I don't see any strong reason to do this in kernel, then :( Thanks, Daisuke Nishimura. > Given that the stat is user initiated, I don't see your concern w.r.t. > overhead. Many subsystems like KSM do pay the overhead cost if the > user really wants the feature or the data. I would be really > interested in other opinions as well (if people do feel strongly > against or for the feature) > > Balbir Singh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-18 0:49 ` Daisuke Nishimura @ 2010-01-18 8:26 ` Balbir Singh 2010-01-19 1:22 ` Daisuke Nishimura 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2010-01-18 8:26 UTC (permalink / raw) To: Daisuke Nishimura Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org On Monday 18 January 2010 06:19 AM, Daisuke Nishimura wrote: > On Mon, 18 Jan 2010 01:00:44 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >> On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki >> <kamezawa.hiroyu@jp.fujitsu.com> wrote: >>> On Thu, 7 Jan 2010 14:57:36 +0530 >>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >>> >>>> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]: >>>> >>>>> On Thu, 7 Jan 2010 17:48:14 +0900 >>>>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: >>>>>>>> "How pages are shared" doesn't show good hints. I don't hear such parameter >>>>>>>> is used in production's resource monitoring software. >>>>>>>> >>>>>>> >>>>>>> You mean "How many pages are shared" are not good hints, please see my >>>>>>> justification above. With Virtualization (look at KSM for example), >>>>>>> shared pages are going to be increasingly important part of the >>>>>>> accounting. >>>>>>> >>>>>> >>>>>> Considering KSM, your cuounting style is tooo bad. >>>>>> >>>>>> You should add >>>>>> >>>>>> - MEM_CGROUP_STAT_SHARED_BY_KSM >>>>>> - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM >>>>>> >>>> >>>> No.. I am just talking about shared memory being important and shared >>>> accounting being useful, no counters for KSM in particular (in the >>>> memcg context). >>>> >>> Think so ? The number of memcg-private pages is in interest in my point of view. >>> >>> Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated >>> in the kernel. >>> If you want to provide that in memcg, please add it to global VM as /proc/meminfo. >>> >>> IIUC, KSM/SHMEM has some official method in global VM. >>> >> >> Kamezawa-San, >> >> I implemented the same in user space and I get really bad results, here is why >> >> 1. I need to hold and walk the tasks list in cgroups and extract RSS >> through /proc (results in worse hold times for the fork() scenario you >> menioned) >> 2. The data is highly inconsistent due to the higher margin of error >> in accumulating data which is changing as we run. By the time we total >> and look at the memcg data, the data is stale >> >> Would you be OK with the patch, if I renamed "shared_usage_in_bytes" >> to "non_private_usage_in_bytes"? >> > I think the name is still ambiguous. > > For example, if process A belongs to /cgroup/memory/01 and process B to /cgroup/memory/02, > both process have 10MB anonymous pages and 10MB file caches of the same pages, and all of the > file caches are charged to 01. > In this case, the value in 01 is 0MB(=20MB - 20MB) and 10MB(20MB - 10MB), right? > Correct, file cache is almost always considered shared, so it has 1. non-private or shared usage of 10MB 2. 10 MB of file cache > I don't think "non private usage" is appropriate to this value. > Why don't you just show "sum_of_each_process_rss" ? I think it would be easier > to understand for users. Here is my concern 1. The gap between looking at memcg stat and sum of all RSS is way higher in user space 2. Summing up all rss without walking the tasks atomically can and will lead to consistency issues. Data can be stale as long as it represents a consistent snapshot of data We need to differentiate between 1. Data snapshot (taken at a time, but valid at that point) 2. Data taken from different sources that does not form a uniform snapshot, because the timestamping of the each of the collected data items is different > But, hmm, I don't see any strong reason to do this in kernel, then :( Please see my reason above for doing it in the kernel. Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-18 8:26 ` Balbir Singh @ 2010-01-19 1:22 ` Daisuke Nishimura 2010-01-19 1:49 ` Balbir Singh 0 siblings, 1 reply; 31+ messages in thread From: Daisuke Nishimura @ 2010-01-19 1:22 UTC (permalink / raw) To: balbir Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, Daisuke Nishimura On Mon, 18 Jan 2010 13:56:44 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > On Monday 18 January 2010 06:19 AM, Daisuke Nishimura wrote: > > On Mon, 18 Jan 2010 01:00:44 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki > >> <kamezawa.hiroyu@jp.fujitsu.com> wrote: > >>> On Thu, 7 Jan 2010 14:57:36 +0530 > >>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >>> > >>>> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]: > >>>> > >>>>> On Thu, 7 Jan 2010 17:48:14 +0900 > >>>>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > >>>>>>>> "How pages are shared" doesn't show good hints. I don't hear such parameter > >>>>>>>> is used in production's resource monitoring software. > >>>>>>>> > >>>>>>> > >>>>>>> You mean "How many pages are shared" are not good hints, please see my > >>>>>>> justification above. With Virtualization (look at KSM for example), > >>>>>>> shared pages are going to be increasingly important part of the > >>>>>>> accounting. > >>>>>>> > >>>>>> > >>>>>> Considering KSM, your cuounting style is tooo bad. > >>>>>> > >>>>>> You should add > >>>>>> > >>>>>> - MEM_CGROUP_STAT_SHARED_BY_KSM > >>>>>> - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM > >>>>>> > >>>> > >>>> No.. I am just talking about shared memory being important and shared > >>>> accounting being useful, no counters for KSM in particular (in the > >>>> memcg context). > >>>> > >>> Think so ? The number of memcg-private pages is in interest in my point of view. > >>> > >>> Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated > >>> in the kernel. > >>> If you want to provide that in memcg, please add it to global VM as /proc/meminfo. > >>> > >>> IIUC, KSM/SHMEM has some official method in global VM. > >>> > >> > >> Kamezawa-San, > >> > >> I implemented the same in user space and I get really bad results, here is why > >> > >> 1. I need to hold and walk the tasks list in cgroups and extract RSS > >> through /proc (results in worse hold times for the fork() scenario you > >> menioned) > >> 2. The data is highly inconsistent due to the higher margin of error > >> in accumulating data which is changing as we run. By the time we total > >> and look at the memcg data, the data is stale > >> > >> Would you be OK with the patch, if I renamed "shared_usage_in_bytes" > >> to "non_private_usage_in_bytes"? > >> > > I think the name is still ambiguous. > > > > For example, if process A belongs to /cgroup/memory/01 and process B to /cgroup/memory/02, > > both process have 10MB anonymous pages and 10MB file caches of the same pages, and all of the > > file caches are charged to 01. > > In this case, the value in 01 is 0MB(=20MB - 20MB) and 10MB(20MB - 10MB), right? > > > > Correct, file cache is almost always considered shared, so it has > > 1. non-private or shared usage of 10MB > 2. 10 MB of file cache > > > I don't think "non private usage" is appropriate to this value. > > Why don't you just show "sum_of_each_process_rss" ? I think it would be easier > > to understand for users. > > Here is my concern > > 1. The gap between looking at memcg stat and sum of all RSS is way > higher in user space > 2. Summing up all rss without walking the tasks atomically can and > will lead to consistency issues. Data can be stale as long as it > represents a consistent snapshot of data > > We need to differentiate between > > 1. Data snapshot (taken at a time, but valid at that point) > 2. Data taken from different sources that does not form a uniform > snapshot, because the timestamping of the each of the collected data > items is different > Hmm, I'm sorry I can't understand why you need "difference". IOW, what can users or middlewares know by the value in the above case (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about this point... Why can this value mean some of the groups are "heavy" ? > > > But, hmm, I don't see any strong reason to do this in kernel, then :( > > Please see my reason above for doing it in the kernel. > > Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-19 1:22 ` Daisuke Nishimura @ 2010-01-19 1:49 ` Balbir Singh 2010-01-19 2:34 ` Daisuke Nishimura 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2010-01-19 1:49 UTC (permalink / raw) To: Daisuke Nishimura Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: [snip] >> Correct, file cache is almost always considered shared, so it has >> >> 1. non-private or shared usage of 10MB >> 2. 10 MB of file cache >> >> > I don't think "non private usage" is appropriate to this value. >> > Why don't you just show "sum_of_each_process_rss" ? I think it would be easier >> > to understand for users. >> >> Here is my concern >> >> 1. The gap between looking at memcg stat and sum of all RSS is way >> higher in user space >> 2. Summing up all rss without walking the tasks atomically can and >> will lead to consistency issues. Data can be stale as long as it >> represents a consistent snapshot of data >> >> We need to differentiate between >> >> 1. Data snapshot (taken at a time, but valid at that point) >> 2. Data taken from different sources that does not form a uniform >> snapshot, because the timestamping of the each of the collected data >> items is different >> > Hmm, I'm sorry I can't understand why you need "difference". > IOW, what can users or middlewares know by the value in the above case > (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about > this point... Why can this value mean some of the groups are "heavy" ? > Consider a default cgroup that is not root and assume all applications move there initially. Now with a lot of shared memory, the default cgroup will be the first one to page in a lot of the memory and its usage will be very high. Without the concept of showing how much is non-private, how does one decide if the default cgroup is using a lot of memory or sharing it? How do we decide on limits of a cgroup without knowing its actual usage - PSS equivalent for a region of memory for a task. Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-19 1:49 ` Balbir Singh @ 2010-01-19 2:34 ` Daisuke Nishimura 2010-01-19 3:52 ` Balbir Singh 0 siblings, 1 reply; 31+ messages in thread From: Daisuke Nishimura @ 2010-01-19 2:34 UTC (permalink / raw) To: Balbir Singh Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, Daisuke Nishimura On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura > <nishimura@mxp.nes.nec.co.jp> wrote: > [snip] > >> Correct, file cache is almost always considered shared, so it has > >> > >> 1. non-private or shared usage of 10MB > >> 2. 10 MB of file cache > >> > >> > I don't think "non private usage" is appropriate to this value. > >> > Why don't you just show "sum_of_each_process_rss" ? I think it would be easier > >> > to understand for users. > >> > >> Here is my concern > >> > >> 1. The gap between looking at memcg stat and sum of all RSS is way > >> higher in user space > >> 2. Summing up all rss without walking the tasks atomically can and > >> will lead to consistency issues. Data can be stale as long as it > >> represents a consistent snapshot of data > >> > >> We need to differentiate between > >> > >> 1. Data snapshot (taken at a time, but valid at that point) > >> 2. Data taken from different sources that does not form a uniform > >> snapshot, because the timestamping of the each of the collected data > >> items is different > >> > > Hmm, I'm sorry I can't understand why you need "difference". > > IOW, what can users or middlewares know by the value in the above case > > (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about > > this point... Why can this value mean some of the groups are "heavy" ? > > > > Consider a default cgroup that is not root and assume all applications > move there initially. Now with a lot of shared memory, > the default cgroup will be the first one to page in a lot of the > memory and its usage will be very high. Without the concept of > showing how much is non-private, how does one decide if the default > cgroup is using a lot of memory or sharing it? How > do we decide on limits of a cgroup without knowing its actual usage - > PSS equivalent for a region of memory for a task. > As for limit, I think we should decide it based on the actual usage because we account and limit the accual usage. Why we should take account of the sum of rss ? I agree that we'd better not to ignore the sum of rss completely, but could you show me how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ? I wouldn't argue against you if I could understand the value would be useful, but I can't understand how the value can be used, so I'm asking :) Thanks Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-19 2:34 ` Daisuke Nishimura @ 2010-01-19 3:52 ` Balbir Singh 2010-01-20 4:09 ` Daisuke Nishimura 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2010-01-19 3:52 UTC (permalink / raw) To: Daisuke Nishimura Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote: > On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura >> <nishimura@mxp.nes.nec.co.jp> wrote: >> [snip] >>>> Correct, file cache is almost always considered shared, so it has >>>> >>>> 1. non-private or shared usage of 10MB >>>> 2. 10 MB of file cache >>>> >>>>> I don't think "non private usage" is appropriate to this value. >>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier >>>>> to understand for users. >>>> >>>> Here is my concern >>>> >>>> 1. The gap between looking at memcg stat and sum of all RSS is way >>>> higher in user space >>>> 2. Summing up all rss without walking the tasks atomically can and >>>> will lead to consistency issues. Data can be stale as long as it >>>> represents a consistent snapshot of data >>>> >>>> We need to differentiate between >>>> >>>> 1. Data snapshot (taken at a time, but valid at that point) >>>> 2. Data taken from different sources that does not form a uniform >>>> snapshot, because the timestamping of the each of the collected data >>>> items is different >>>> >>> Hmm, I'm sorry I can't understand why you need "difference". >>> IOW, what can users or middlewares know by the value in the above case >>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about >>> this point... Why can this value mean some of the groups are "heavy" ? >>> >> >> Consider a default cgroup that is not root and assume all applications >> move there initially. Now with a lot of shared memory, >> the default cgroup will be the first one to page in a lot of the >> memory and its usage will be very high. Without the concept of >> showing how much is non-private, how does one decide if the default >> cgroup is using a lot of memory or sharing it? How >> do we decide on limits of a cgroup without knowing its actual usage - >> PSS equivalent for a region of memory for a task. >> > As for limit, I think we should decide it based on the actual usage because > we account and limit the accual usage. Why we should take account of the sum of rss ? I am talking of non-private pages or potentially shared pages - which is derived as follows sum_of_all_rss - (rss + file_mapped) (from .stat file) file cache is considered to be shared always > I agree that we'd better not to ignore the sum of rss completely, but could you show me > how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ? In your example, usage shows that the real usage of the cgroup is 20 MB for 01 and 10 MB for 02. Today we show that we are using 40MB instead of 30MB (when summed). If an administrator has to make a decision to say add more resources, the one with 20MB would be the right place w.r.t. memory. > I wouldn't argue against you if I could understand the value would be useful, > but I can't understand how the value can be used, so I'm asking :) I understand, I am not completely closed to suggestions from you and Kamezawa-San, just trying to find a way to get useful information about shared memory usage back to user space. Remember walking the LRU or even VMA's to find shared pages is expensive. We could do it lazily at rmap time, it works well for charging, but not too good for uncharging, since we'll need to keep track of the mm's, so that if the mm that charge can be properly marked as private or shared in the correct memcg. It will require more invasive work. Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-19 3:52 ` Balbir Singh @ 2010-01-20 4:09 ` Daisuke Nishimura 2010-01-20 7:15 ` Daisuke Nishimura 2010-01-20 8:17 ` Balbir Singh 0 siblings, 2 replies; 31+ messages in thread From: Daisuke Nishimura @ 2010-01-20 4:09 UTC (permalink / raw) To: balbir Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, Daisuke Nishimura On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote: > > On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura > >> <nishimura@mxp.nes.nec.co.jp> wrote: > >> [snip] > >>>> Correct, file cache is almost always considered shared, so it has > >>>> > >>>> 1. non-private or shared usage of 10MB > >>>> 2. 10 MB of file cache > >>>> > >>>>> I don't think "non private usage" is appropriate to this value. > >>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier > >>>>> to understand for users. > >>>> > >>>> Here is my concern > >>>> > >>>> 1. The gap between looking at memcg stat and sum of all RSS is way > >>>> higher in user space > >>>> 2. Summing up all rss without walking the tasks atomically can and > >>>> will lead to consistency issues. Data can be stale as long as it > >>>> represents a consistent snapshot of data > >>>> > >>>> We need to differentiate between > >>>> > >>>> 1. Data snapshot (taken at a time, but valid at that point) > >>>> 2. Data taken from different sources that does not form a uniform > >>>> snapshot, because the timestamping of the each of the collected data > >>>> items is different > >>>> > >>> Hmm, I'm sorry I can't understand why you need "difference". > >>> IOW, what can users or middlewares know by the value in the above case > >>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about > >>> this point... Why can this value mean some of the groups are "heavy" ? > >>> > >> > >> Consider a default cgroup that is not root and assume all applications > >> move there initially. Now with a lot of shared memory, > >> the default cgroup will be the first one to page in a lot of the > >> memory and its usage will be very high. Without the concept of > >> showing how much is non-private, how does one decide if the default > >> cgroup is using a lot of memory or sharing it? How > >> do we decide on limits of a cgroup without knowing its actual usage - > >> PSS equivalent for a region of memory for a task. > >> > > As for limit, I think we should decide it based on the actual usage because > > we account and limit the accual usage. Why we should take account of the sum of rss ? > > I am talking of non-private pages or potentially shared pages - which is > derived as follows > > sum_of_all_rss - (rss + file_mapped) (from .stat file) > > file cache is considered to be shared always > > > > I agree that we'd better not to ignore the sum of rss completely, but could you show me > > how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ? > > In your example, usage shows that the real usage of the cgroup is 20 MB > for 01 and 10 MB for 02. right. > Today we show that we are using 40MB instead of > 30MB (when summed). Sorry, I can't understand here. If we sum usage_in_bytes in both groups, it would be 30MB. If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M. If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups, it would be 40MB. > If an administrator has to make a decision to say > add more resources, the one with 20MB would be the right place w.r.t. > memory. > You mean he would add the additional resource to 00, right? Then, the smaller "shared_usage_in_bytes" is, the more likely an administrator should add additional resources to the group ? But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage, and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none, "shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is no "shared" usage between aa and bb). Should an administrator consider bb is heavier than aa ? I don't think so. IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which group is unfairly "heavy". The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would mean a), but it has no use in real case). a) memory usage used by multiple processes inside this group. b) memory usage used by both processes inside this and another group. c) memory usage not used by any processes inside this group, but used by that of in another group. IMHO, we should take account of all the above values to determine which group is unfairly "heavy". I agree that the bigger the size of a) is, the bigger "shared_usage_in_bytes" of the group would be, but we cannot know any information about the size of b) by it, becase those usages are included in both actual usage(rss via stat) and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has the opposite meaning about b), i.e., the more a processe in some group(foo) has actual charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be (as 00 and 01 in my example). I would agree with you if you add interfaces to show some hints to users about above values, but "shared_usage_in_bytes" doesn't meet it at all. Thanks, Daisuke Nishimura. > > I wouldn't argue against you if I could understand the value would be useful, > > but I can't understand how the value can be used, so I'm asking :) > > I understand, I am not completely closed to suggestions from you and > Kamezawa-San, just trying to find a way to get useful information about > shared memory usage back to user space. Remember walking the LRU or even > VMA's to find shared pages is expensive. We could do it lazily at rmap > time, it works well for charging, but not too good for uncharging, since > we'll need to keep track of the mm's, so that if the mm that charge can > be properly marked as private or shared in the correct memcg. It will > require more invasive work. > > Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-20 4:09 ` Daisuke Nishimura @ 2010-01-20 7:15 ` Daisuke Nishimura 2010-01-20 7:43 ` KAMEZAWA Hiroyuki 2010-01-20 8:18 ` Balbir Singh 2010-01-20 8:17 ` Balbir Singh 1 sibling, 2 replies; 31+ messages in thread From: Daisuke Nishimura @ 2010-01-20 7:15 UTC (permalink / raw) To: balbir Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, Daisuke Nishimura On Wed, 20 Jan 2010 13:09:02 +0900, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote: > > > On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > >> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura > > >> <nishimura@mxp.nes.nec.co.jp> wrote: > > >> [snip] > > >>>> Correct, file cache is almost always considered shared, so it has > > >>>> > > >>>> 1. non-private or shared usage of 10MB > > >>>> 2. 10 MB of file cache > > >>>> > > >>>>> I don't think "non private usage" is appropriate to this value. > > >>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier > > >>>>> to understand for users. > > >>>> > > >>>> Here is my concern > > >>>> > > >>>> 1. The gap between looking at memcg stat and sum of all RSS is way > > >>>> higher in user space > > >>>> 2. Summing up all rss without walking the tasks atomically can and > > >>>> will lead to consistency issues. Data can be stale as long as it > > >>>> represents a consistent snapshot of data > > >>>> > > >>>> We need to differentiate between > > >>>> > > >>>> 1. Data snapshot (taken at a time, but valid at that point) > > >>>> 2. Data taken from different sources that does not form a uniform > > >>>> snapshot, because the timestamping of the each of the collected data > > >>>> items is different > > >>>> > > >>> Hmm, I'm sorry I can't understand why you need "difference". > > >>> IOW, what can users or middlewares know by the value in the above case > > >>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about > > >>> this point... Why can this value mean some of the groups are "heavy" ? > > >>> > > >> > > >> Consider a default cgroup that is not root and assume all applications > > >> move there initially. Now with a lot of shared memory, > > >> the default cgroup will be the first one to page in a lot of the > > >> memory and its usage will be very high. Without the concept of > > >> showing how much is non-private, how does one decide if the default > > >> cgroup is using a lot of memory or sharing it? How > > >> do we decide on limits of a cgroup without knowing its actual usage - > > >> PSS equivalent for a region of memory for a task. > > >> > > > As for limit, I think we should decide it based on the actual usage because > > > we account and limit the accual usage. Why we should take account of the sum of rss ? > > > > I am talking of non-private pages or potentially shared pages - which is > > derived as follows > > > > sum_of_all_rss - (rss + file_mapped) (from .stat file) > > > > file cache is considered to be shared always > > > > > > > I agree that we'd better not to ignore the sum of rss completely, but could you show me > > > how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ? > > > > In your example, usage shows that the real usage of the cgroup is 20 MB > > for 01 and 10 MB for 02. > right. > > > Today we show that we are using 40MB instead of > > 30MB (when summed). > Sorry, I can't understand here. > If we sum usage_in_bytes in both groups, it would be 30MB. > If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M. > If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups, > it would be 40MB. > > > If an administrator has to make a decision to say > > add more resources, the one with 20MB would be the right place w.r.t. > > memory. > > > You mean he would add the additional resource to 00, right? Then, > the smaller "shared_usage_in_bytes" is, the more likely an administrator should > add additional resources to the group ? > > But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage, > and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none, > "shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is > no "shared" usage between aa and bb). > Should an administrator consider bb is heavier than aa ? I don't think so. > > IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which > group is unfairly "heavy". > > The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum > of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would > mean a), but it has no use in real case). > > a) memory usage used by multiple processes inside this group. > b) memory usage used by both processes inside this and another group. > c) memory usage not used by any processes inside this group, but used by > that of in another group. > > IMHO, we should take account of all the above values to determine which group > is unfairly "heavy". I agree that the bigger the size of a) is, the bigger > "shared_usage_in_bytes" of the group would be, but we cannot know any information about > the size of b) by it, becase those usages are included in both actual usage(rss via stat) > and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has > the opposite meaning about b), i.e., the more a processe in some group(foo) has actual > charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be > (as 00 and 01 in my example). > > I would agree with you if you add interfaces to show some hints to users about above values, > but "shared_usage_in_bytes" doesn't meet it at all. > This is just an idea(At least, we need interfaces to read and reset them). diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 385e29b..bf601f2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -83,6 +83,8 @@ enum mem_cgroup_stat_index { used by soft limit implementation */ MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out. used by threshold implementation */ + MEM_CGROUP_STAT_SHARED_IN_GROUP, + MEM_CGROUP_STAT_SHARED_FROM_OTHERS, MEM_CGROUP_STAT_NSTATS, }; @@ -1707,8 +1709,25 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, lock_page_cgroup(pc); if (unlikely(PageCgroupUsed(pc))) { + struct mem_cgroup *charged = pc->mem_cgroup; + struct mem_cgroup_stat *stat; + struct mem_cgroup_stat_cpu *cpustat; + int cpu; + int shared_type; + unlock_page_cgroup(pc); mem_cgroup_cancel_charge(mem); + + stat = &charged->stat; + cpu = get_cpu(); + cpustat = &stat->cpustat[cpu]; + if (charged == mem) + shared_type = MEM_CGROUP_STAT_SHARED_IN_GROUP; + else + shared_type = MEM_CGROUP_STAT_SHARED_FROM_OTHERS; + __mem_cgroup_stat_add_safe(cpustat, shared_type, 1); + put_cpu(); + return; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-20 7:15 ` Daisuke Nishimura @ 2010-01-20 7:43 ` KAMEZAWA Hiroyuki 2010-01-20 8:18 ` Balbir Singh 1 sibling, 0 replies; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-20 7:43 UTC (permalink / raw) To: Daisuke Nishimura Cc: balbir, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org On Wed, 20 Jan 2010 16:15:33 +0900 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > > I would agree with you if you add interfaces to show some hints to users about above values, > > but "shared_usage_in_bytes" doesn't meet it at all. > > > This is just an idea(At least, we need interfaces to read and reset them). > seems atractive but there is no way to decrement this counter in _scalable_ way. We need some inovation to go this way. But I doubt how this comes to be useful. In general, we can assume - file is shared. (because of their nature.) - rss is private. (because of thier nature.) Then, the problem is how rss(private anon) is shared. Except for crazy progam as AIM7, rss is private in many case. Even if highly shared, in most case, shared rss can be estimated by the size of parent process's rss. And processe's parent-child relationship is appearent. Measurement is easy. If COW is troublesome, counting # of COW per process is reasonable way. (But you have to fight with the cost of adding that.) I tend not to disagree to add a counter to show "shared with other cgroup" but disagree "shared between process". Thanks, -Kame > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 385e29b..bf601f2 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -83,6 +83,8 @@ enum mem_cgroup_stat_index { > used by soft limit implementation */ > MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out. > used by threshold implementation */ > + MEM_CGROUP_STAT_SHARED_IN_GROUP, > + MEM_CGROUP_STAT_SHARED_FROM_OTHERS, > > MEM_CGROUP_STAT_NSTATS, > }; > @@ -1707,8 +1709,25 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, > > lock_page_cgroup(pc); > if (unlikely(PageCgroupUsed(pc))) { > + struct mem_cgroup *charged = pc->mem_cgroup; > + struct mem_cgroup_stat *stat; > + struct mem_cgroup_stat_cpu *cpustat; > + int cpu; > + int shared_type; > + > unlock_page_cgroup(pc); > mem_cgroup_cancel_charge(mem); > + > + stat = &charged->stat; > + cpu = get_cpu(); > + cpustat = &stat->cpustat[cpu]; > + if (charged == mem) > + shared_type = MEM_CGROUP_STAT_SHARED_IN_GROUP; > + else > + shared_type = MEM_CGROUP_STAT_SHARED_FROM_OTHERS; > + __mem_cgroup_stat_add_safe(cpustat, shared_type, 1); > + put_cpu(); > + > return; > } > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-20 7:15 ` Daisuke Nishimura 2010-01-20 7:43 ` KAMEZAWA Hiroyuki @ 2010-01-20 8:18 ` Balbir Singh 1 sibling, 0 replies; 31+ messages in thread From: Balbir Singh @ 2010-01-20 8:18 UTC (permalink / raw) To: Daisuke Nishimura Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org On Wednesday 20 January 2010 12:45 PM, Daisuke Nishimura wrote: > On Wed, 20 Jan 2010 13:09:02 +0900, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: >> On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >>> On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote: >>>> On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >>>>> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura >>>>> <nishimura@mxp.nes.nec.co.jp> wrote: >>>>> [snip] >>>>>>> Correct, file cache is almost always considered shared, so it has >>>>>>> >>>>>>> 1. non-private or shared usage of 10MB >>>>>>> 2. 10 MB of file cache >>>>>>> >>>>>>>> I don't think "non private usage" is appropriate to this value. >>>>>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier >>>>>>>> to understand for users. >>>>>>> >>>>>>> Here is my concern >>>>>>> >>>>>>> 1. The gap between looking at memcg stat and sum of all RSS is way >>>>>>> higher in user space >>>>>>> 2. Summing up all rss without walking the tasks atomically can and >>>>>>> will lead to consistency issues. Data can be stale as long as it >>>>>>> represents a consistent snapshot of data >>>>>>> >>>>>>> We need to differentiate between >>>>>>> >>>>>>> 1. Data snapshot (taken at a time, but valid at that point) >>>>>>> 2. Data taken from different sources that does not form a uniform >>>>>>> snapshot, because the timestamping of the each of the collected data >>>>>>> items is different >>>>>>> >>>>>> Hmm, I'm sorry I can't understand why you need "difference". >>>>>> IOW, what can users or middlewares know by the value in the above case >>>>>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about >>>>>> this point... Why can this value mean some of the groups are "heavy" ? >>>>>> >>>>> >>>>> Consider a default cgroup that is not root and assume all applications >>>>> move there initially. Now with a lot of shared memory, >>>>> the default cgroup will be the first one to page in a lot of the >>>>> memory and its usage will be very high. Without the concept of >>>>> showing how much is non-private, how does one decide if the default >>>>> cgroup is using a lot of memory or sharing it? How >>>>> do we decide on limits of a cgroup without knowing its actual usage - >>>>> PSS equivalent for a region of memory for a task. >>>>> >>>> As for limit, I think we should decide it based on the actual usage because >>>> we account and limit the accual usage. Why we should take account of the sum of rss ? >>> >>> I am talking of non-private pages or potentially shared pages - which is >>> derived as follows >>> >>> sum_of_all_rss - (rss + file_mapped) (from .stat file) >>> >>> file cache is considered to be shared always >>> >>> >>>> I agree that we'd better not to ignore the sum of rss completely, but could you show me >>>> how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ? >>> >>> In your example, usage shows that the real usage of the cgroup is 20 MB >>> for 01 and 10 MB for 02. >> right. >> >>> Today we show that we are using 40MB instead of >>> 30MB (when summed). >> Sorry, I can't understand here. >> If we sum usage_in_bytes in both groups, it would be 30MB. >> If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M. >> If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups, >> it would be 40MB. >> >>> If an administrator has to make a decision to say >>> add more resources, the one with 20MB would be the right place w.r.t. >>> memory. >>> >> You mean he would add the additional resource to 00, right? Then, >> the smaller "shared_usage_in_bytes" is, the more likely an administrator should >> add additional resources to the group ? >> >> But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage, >> and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none, >> "shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is >> no "shared" usage between aa and bb). >> Should an administrator consider bb is heavier than aa ? I don't think so. >> >> IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which >> group is unfairly "heavy". >> >> The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum >> of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would >> mean a), but it has no use in real case). >> >> a) memory usage used by multiple processes inside this group. >> b) memory usage used by both processes inside this and another group. >> c) memory usage not used by any processes inside this group, but used by >> that of in another group. >> >> IMHO, we should take account of all the above values to determine which group >> is unfairly "heavy". I agree that the bigger the size of a) is, the bigger >> "shared_usage_in_bytes" of the group would be, but we cannot know any information about >> the size of b) by it, becase those usages are included in both actual usage(rss via stat) >> and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has >> the opposite meaning about b), i.e., the more a processe in some group(foo) has actual >> charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be >> (as 00 and 01 in my example). >> >> I would agree with you if you add interfaces to show some hints to users about above values, >> but "shared_usage_in_bytes" doesn't meet it at all. >> > This is just an idea(At least, we need interfaces to read and reset them). > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c> index 385e29b..bf601f2 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -83,6 +83,8 @@ enum mem_cgroup_stat_index { > used by soft limit implementation */ > MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out. > used by threshold implementation */ > + MEM_CGROUP_STAT_SHARED_IN_GROUP, > + MEM_CGROUP_STAT_SHARED_FROM_OTHERS, > > MEM_CGROUP_STAT_NSTATS, > }; > @@ -1707,8 +1709,25 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, > > lock_page_cgroup(pc); > if (unlikely(PageCgroupUsed(pc))) { > + struct mem_cgroup *charged = pc->mem_cgroup; > + struct mem_cgroup_stat *stat; > + struct mem_cgroup_stat_cpu *cpustat; > + int cpu; > + int shared_type; > + > unlock_page_cgroup(pc); > mem_cgroup_cancel_charge(mem); > + > + stat = &charged->stat; > + cpu = get_cpu(); > + cpustat = &stat->cpustat[cpu]; > + if (charged == mem) > + shared_type = MEM_CGROUP_STAT_SHARED_IN_GROUP; > + else > + shared_type = MEM_CGROUP_STAT_SHARED_FROM_OTHERS; > + __mem_cgroup_stat_add_safe(cpustat, shared_type, 1); > + put_cpu(); > + > return; How will this work during uncharge? If the original cgroup that owns the pages has unmapped them? Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-20 4:09 ` Daisuke Nishimura 2010-01-20 7:15 ` Daisuke Nishimura @ 2010-01-20 8:17 ` Balbir Singh 2010-01-21 1:04 ` Daisuke Nishimura 1 sibling, 1 reply; 31+ messages in thread From: Balbir Singh @ 2010-01-20 8:17 UTC (permalink / raw) To: Daisuke Nishimura Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org On Wednesday 20 January 2010 09:39 AM, Daisuke Nishimura wrote: > On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >> On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote: >>> On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >>>> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura >>>> <nishimura@mxp.nes.nec.co.jp> wrote: >>>> [snip] >>>>>> Correct, file cache is almost always considered shared, so it has >>>>>> >>>>>> 1. non-private or shared usage of 10MB >>>>>> 2. 10 MB of file cache >>>>>> >>>>>>> I don't think "non private usage" is appropriate to this value. >>>>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier >>>>>>> to understand for users. >>>>>> >>>>>> Here is my concern >>>>>> >>>>>> 1. The gap between looking at memcg stat and sum of all RSS is way >>>>>> higher in user space >>>>>> 2. Summing up all rss without walking the tasks atomically can and >>>>>> will lead to consistency issues. Data can be stale as long as it >>>>>> represents a consistent snapshot of data >>>>>> >>>>>> We need to differentiate between >>>>>> >>>>>> 1. Data snapshot (taken at a time, but valid at that point) >>>>>> 2. Data taken from different sources that does not form a uniform >>>>>> snapshot, because the timestamping of the each of the collected data >>>>>> items is different >>>>>> >>>>> Hmm, I'm sorry I can't understand why you need "difference". >>>>> IOW, what can users or middlewares know by the value in the above case >>>>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about >>>>> this point... Why can this value mean some of the groups are "heavy" ? >>>>> >>>> >>>> Consider a default cgroup that is not root and assume all applications >>>> move there initially. Now with a lot of shared memory, >>>> the default cgroup will be the first one to page in a lot of the >>>> memory and its usage will be very high. Without the concept of >>>> showing how much is non-private, how does one decide if the default >>>> cgroup is using a lot of memory or sharing it? How >>>> do we decide on limits of a cgroup without knowing its actual usage - >>>> PSS equivalent for a region of memory for a task. >>>> >>> As for limit, I think we should decide it based on the actual usage because >>> we account and limit the accual usage. Why we should take account of the sum of rss ? >> >> I am talking of non-private pages or potentially shared pages - which is >> derived as follows >> >> sum_of_all_rss - (rss + file_mapped) (from .stat file) >> >> file cache is considered to be shared always >> >> >>> I agree that we'd better not to ignore the sum of rss completely, but could you show me >>> how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ? >> >> In your example, usage shows that the real usage of the cgroup is 20 MB >> for 01 and 10 MB for 02. > right. > >> Today we show that we are using 40MB instead of >> 30MB (when summed). > Sorry, I can't understand here. > If we sum usage_in_bytes in both groups, it would be 30MB. Right > If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M. > If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups, > it would be 40MB. > mm_counter would show 40GB, memcgroup would show 30MB you are right. But of the 30MB, do we say the one using 20MB is consuming more resources? >> If an administrator has to make a decision to say >> add more resources, the one with 20MB would be the right place w.r.t. >> memory. >> > You mean he would add the additional resource to 00, right? Then, > the smaller "shared_usage_in_bytes" is, the more likely an administrator should > add additional resources to the group ? > > But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage, > and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none, > "shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is > no "shared" usage between aa and bb). > Should an administrator consider bb is heavier than aa ? I don't think so. > No.. but before OOM killing or considering moving in a virtual environment the cgorup "aa", the real usage should be considered or at-least the fact that moving "bb" would require 20MB. > IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which > group is unfairly "heavy". > No, but it gives an idea of the sharing, which can be important for making decisions and estimating the real usage. In the case of aa, one can estimate private usage to be 20MB - 10MB (10MB) which is one correct way of looking at the heaviness of the cgroup. > The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum > of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would > mean a), but it has no use in real case). > > a) memory usage used by multiple processes inside this group. > b) memory usage used by both processes inside this and another group. > c) memory usage not used by any processes inside this group, but used by > that of in another group. > > IMHO, we should take account of all the above values to determine which group > is unfairly "heavy". I agree that the bigger the size of a) is, the bigger > "shared_usage_in_bytes" of the group would be, but we cannot know any information about > the size of b) by it, becase those usages are included in both actual usage(rss via stat) (b) IMHO is a longer term goal and can be estimated from the PSS of the processes within the cgroup > and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has > the opposite meaning about b), i.e., the more a processe in some group(foo) has actual > charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be > (as 00 and 01 in my example). > > I would agree with you if you add interfaces to show some hints to users about above values, > but "shared_usage_in_bytes" doesn't meet it at all. > Not sure I follow your suggestion here. Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-20 8:17 ` Balbir Singh @ 2010-01-21 1:04 ` Daisuke Nishimura 2010-01-21 1:30 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 31+ messages in thread From: Daisuke Nishimura @ 2010-01-21 1:04 UTC (permalink / raw) To: balbir Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org, Daisuke Nishimura On Wed, 20 Jan 2010 13:47:13 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > On Wednesday 20 January 2010 09:39 AM, Daisuke Nishimura wrote: > > On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote: > >>> On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >>>> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura > >>>> <nishimura@mxp.nes.nec.co.jp> wrote: > >>>> [snip] > >>>>>> Correct, file cache is almost always considered shared, so it has > >>>>>> > >>>>>> 1. non-private or shared usage of 10MB > >>>>>> 2. 10 MB of file cache > >>>>>> > >>>>>>> I don't think "non private usage" is appropriate to this value. > >>>>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier > >>>>>>> to understand for users. > >>>>>> > >>>>>> Here is my concern > >>>>>> > >>>>>> 1. The gap between looking at memcg stat and sum of all RSS is way > >>>>>> higher in user space > >>>>>> 2. Summing up all rss without walking the tasks atomically can and > >>>>>> will lead to consistency issues. Data can be stale as long as it > >>>>>> represents a consistent snapshot of data > >>>>>> > >>>>>> We need to differentiate between > >>>>>> > >>>>>> 1. Data snapshot (taken at a time, but valid at that point) > >>>>>> 2. Data taken from different sources that does not form a uniform > >>>>>> snapshot, because the timestamping of the each of the collected data > >>>>>> items is different > >>>>>> > >>>>> Hmm, I'm sorry I can't understand why you need "difference". > >>>>> IOW, what can users or middlewares know by the value in the above case > >>>>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about > >>>>> this point... Why can this value mean some of the groups are "heavy" ? > >>>>> > >>>> > >>>> Consider a default cgroup that is not root and assume all applications > >>>> move there initially. Now with a lot of shared memory, > >>>> the default cgroup will be the first one to page in a lot of the > >>>> memory and its usage will be very high. Without the concept of > >>>> showing how much is non-private, how does one decide if the default > >>>> cgroup is using a lot of memory or sharing it? How > >>>> do we decide on limits of a cgroup without knowing its actual usage - > >>>> PSS equivalent for a region of memory for a task. > >>>> > >>> As for limit, I think we should decide it based on the actual usage because > >>> we account and limit the accual usage. Why we should take account of the sum of rss ? > >> > >> I am talking of non-private pages or potentially shared pages - which is > >> derived as follows > >> > >> sum_of_all_rss - (rss + file_mapped) (from .stat file) > >> > >> file cache is considered to be shared always > >> > >> > >>> I agree that we'd better not to ignore the sum of rss completely, but could you show me > >>> how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ? > >> > >> In your example, usage shows that the real usage of the cgroup is 20 MB > >> for 01 and 10 MB for 02. > > right. > > > >> Today we show that we are using 40MB instead of > >> 30MB (when summed). > > Sorry, I can't understand here. > > If we sum usage_in_bytes in both groups, it would be 30MB. > > Right > > > If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M. > > If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups, > > it would be 40MB. > > > > mm_counter would show 40GB, memcgroup would show 30MB you are right. But > of the 30MB, do we say the one using 20MB is consuming more resources? > > >> If an administrator has to make a decision to say > >> add more resources, the one with 20MB would be the right place w.r.t. > >> memory. > >> > > You mean he would add the additional resource to 00, right? Then, > > the smaller "shared_usage_in_bytes" is, the more likely an administrator should > > add additional resources to the group ? > > > > But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage, > > and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none, > > "shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is > > no "shared" usage between aa and bb). > > Should an administrator consider bb is heavier than aa ? I don't think so. > > > > No.. but before OOM killing or considering moving in a virtual > environment the cgorup "aa", the real usage should be considered or > at-least the fact that moving "bb" would require 20MB. > > > IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which > > group is unfairly "heavy". > > > > No, but it gives an idea of the sharing, which can be important for > making decisions and estimating the real usage. In the case of aa, one > can estimate private usage to be 20MB - 10MB (10MB) which is one correct > way of looking at the heaviness of the cgroup. > This incosistency is the problem I worry about the most. The bigger "shared_usage_in_bytes" is, the more likey the group is "heavy", or the opposite ? It only confuses users. The "shared_usage_in_bytes" of A can be used to roughly estimate a sum of i) memory usage used by multiple processes in A. ii) memroy usage processes in A charges to OTHER GROUPS. ^^^^^^^^^^^^^^^^^^^^^^^ I would say "yes, it might be usefull to decide the weight of A" if it can be used to estimate a sum of i) and iii) memory usage processes in OTHER GROUPS charges to A. ^^^^^^^^^^^^ Anyway, I wouldn't say any more about the usefullness of "shared_usage_in_bytes". But if you dare to add this interface to kernel, please and please write the documentation that it can be used to roughly estimate a sum of i) and ii), not sum of i) and iii), and can be used to decide the weight of the group only when few pages are shared between groups. So that users doesn't misunderstand nor misuse the interface. And I think you should answer what Kamezawa-san pointed in http://lkml.org/lkml/2010/1/17/186. Thanks, Daisuke Nishimura. > > The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum > > of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would > > mean a), but it has no use in real case). > > > > a) memory usage used by multiple processes inside this group. > > b) memory usage used by both processes inside this and another group. > > c) memory usage not used by any processes inside this group, but used by > > that of in another group. > > > > IMHO, we should take account of all the above values to determine which group > > is unfairly "heavy". I agree that the bigger the size of a) is, the bigger > > "shared_usage_in_bytes" of the group would be, but we cannot know any information about > > the size of b) by it, becase those usages are included in both actual usage(rss via stat) > > (b) IMHO is a longer term goal and can be estimated from the PSS of the > processes within the cgroup > > > and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has > > the opposite meaning about b), i.e., the more a processe in some group(foo) has actual > > charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be > > (as 00 and 01 in my example). > > > > I would agree with you if you add interfaces to show some hints to users about above values, > > but "shared_usage_in_bytes" doesn't meet it at all. > > > > Not sure I follow your suggestion here. > > Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Shared page accounting for memory cgroup 2010-01-21 1:04 ` Daisuke Nishimura @ 2010-01-21 1:30 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 31+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-01-21 1:30 UTC (permalink / raw) To: Daisuke Nishimura Cc: balbir, linux-mm@kvack.org, Andrew Morton, linux-kernel@vger.kernel.org On Thu, 21 Jan 2010 10:04:16 +0900 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > Anyway, I wouldn't say any more about the usefullness of "shared_usage_in_bytes". > > But if you dare to add this interface to kernel, please and please write the documentation > that it can be used to roughly estimate a sum of i) and ii), not sum of i) and iii), and > can be used to decide the weight of the group only when few pages are shared between groups. > So that users doesn't misunderstand nor misuse the interface. > > And I think you should answer what Kamezawa-san pointed in http://lkml.org/lkml/2010/1/17/186. > > I wouldn't like to say anything other than 'please add stat to global VM before memcg if it's really important" because it seems I couldn't persuade him, he can't do so me. I myself never think sum of rss is important. An additonal craim I can easily think of is fork()->exit(). Assume there is a program with 1GB RSS and which invokes a helper program by fork()->exec(). This is an usual situation. Then, sum of RSS can easily jump up/down 1GB. Even if getting data in atomic way, the data itself can be corrupted very easily and the users should remove noises by themselves. So, there is no much difference to calculate RSS in user land or kernel. The users has to measure the status and estimate the stable value in statical technique. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2010-01-21 1:34 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-12-29 18:27 [RFC] Shared page accounting for memory cgroup Balbir Singh 2010-01-03 23:51 ` KAMEZAWA Hiroyuki 2010-01-04 0:07 ` Balbir Singh 2010-01-04 0:35 ` KAMEZAWA Hiroyuki 2010-01-04 0:50 ` Balbir Singh 2010-01-06 4:02 ` KAMEZAWA Hiroyuki 2010-01-06 7:01 ` Balbir Singh 2010-01-06 7:12 ` KAMEZAWA Hiroyuki 2010-01-07 7:15 ` Balbir Singh 2010-01-07 7:36 ` KAMEZAWA Hiroyuki 2010-01-07 8:34 ` Balbir Singh 2010-01-07 8:48 ` KAMEZAWA Hiroyuki 2010-01-07 9:08 ` KAMEZAWA Hiroyuki 2010-01-07 9:27 ` Balbir Singh 2010-01-07 23:47 ` KAMEZAWA Hiroyuki 2010-01-17 19:30 ` Balbir Singh 2010-01-18 0:05 ` KAMEZAWA Hiroyuki 2010-01-18 0:22 ` KAMEZAWA Hiroyuki 2010-01-18 0:49 ` Daisuke Nishimura 2010-01-18 8:26 ` Balbir Singh 2010-01-19 1:22 ` Daisuke Nishimura 2010-01-19 1:49 ` Balbir Singh 2010-01-19 2:34 ` Daisuke Nishimura 2010-01-19 3:52 ` Balbir Singh 2010-01-20 4:09 ` Daisuke Nishimura 2010-01-20 7:15 ` Daisuke Nishimura 2010-01-20 7:43 ` KAMEZAWA Hiroyuki 2010-01-20 8:18 ` Balbir Singh 2010-01-20 8:17 ` Balbir Singh 2010-01-21 1:04 ` Daisuke Nishimura 2010-01-21 1:30 ` KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).