linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  0 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Note: The scan period is much larger than it was in the original patch.
	The reason was because the system CPU usage went through the roof
	with a sample period of 500ms but it was unsuitable to have a
	situation where a large process could stall for excessively long
	updating pte_numa. This may need to be tuned again if a placement
	policy converges too slowly.

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

 - it samples the working set at dissimilar rates,
   giving some tasks a sampling quality advantage
   over others.

 - creates performance problems for tasks with very
   large working sets

 - over-samples processes with large address spaces but
   which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 2 seconds up to just once per 32 seconds.  The current
sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 32
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
  rate of the 'hinting page fault'. AutoNUMA's scanning is
  currently rate-limited, but it is also fundamentally
  single-threaded, executing in the knuma_scand kernel thread,
  so the limit in AutoNUMA is global and does not scale up with
  the number of CPUs, nor does it scan tasks in an execution
  proportional manner.

  So the idea of rate-limiting the scanning was first implemented
  in the AutoNUMA tree via a global rate limit. This patch goes
  beyond that by implementing an execution rate proportional
  working set sampling rate that is not implemented via a single
  global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
  first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm_types.h |    3 +++
 include/linux/sched.h    |    1 +
 kernel/sched/fair.c      |   61 ++++++++++++++++++++++++++++++++++++----------
 kernel/sysctl.c          |    7 ++++++
 4 files changed, 59 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d82accb..b40f4ef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -406,6 +406,9 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
+	/* Restart point for scanning and setting pte_numa */
+	unsigned long numa_scan_offset;
+
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 241e4f7..6b8a14f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2008,6 +2008,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_scan_size;
 extern unsigned int sysctl_balance_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9ea13e9..6df5620 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -780,10 +780,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_BALANCE_NUMA
 /*
- * numa task sample period in ms: 5s
+ * numa task sample period in ms
  */
-unsigned int sysctl_balance_numa_scan_period_min = 5000;
-unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+unsigned int sysctl_balance_numa_scan_period_min = 2000;
+unsigned int sysctl_balance_numa_scan_period_max = 2000*16;
+
+/* Portion of address space to scan in MB */
+unsigned int sysctl_balance_numa_scan_size = 256;
 
 static void task_numa_placement(struct task_struct *p)
 {
@@ -822,6 +825,9 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	struct vm_area_struct *vma;
+	unsigned long offset, end;
+	long length;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -851,18 +857,47 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	ACCESS_ONCE(mm->numa_scan_seq)++;
-	{
-		struct vm_area_struct *vma;
+	offset = mm->numa_scan_offset;
+	length = sysctl_balance_numa_scan_size;
+	length <<= 20;
 
-		down_read(&mm->mmap_sem);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			if (!vma_migratable(vma))
-				continue;
-			change_prot_numa(vma, vma->vm_start, vma->vm_end);
-		}
-		up_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, offset);
+	if (!vma) {
+		ACCESS_ONCE(mm->numa_scan_seq)++;
+		offset = 0;
+		vma = mm->mmap;
+	}
+	for (; vma && length > 0; vma = vma->vm_next) {
+		if (!vma_migratable(vma))
+			continue;
+
+		/* Skip small VMAs. They are not likely to be of relevance */
+		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
+			continue;
+
+		offset = max(offset, vma->vm_start);
+		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+		length -= end - offset;
+
+		change_prot_numa(vma, offset, end);
+
+		offset = end;
+	}
+
+	/*
+	 * It is possible to reach the end of the VMA list but the last few VMAs are
+	 * not guaranteed to the vma_migratable. If they are not, we would find the
+	 * !migratable VMA on the next scan but not reset the scanner to the start
+	 * so check it now.
+	 */
+	if (!vma) {
+		ACCESS_ONCE(mm->numa_scan_seq)++;
+		offset = 0;
+		vma = mm->mmap;
 	}
+	mm->numa_scan_offset = offset;
+	up_read(&mm->mmap_sem);
 }
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1359f51..d191203 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -366,6 +366,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "balance_numa_scan_size_mb",
+		.data		= &sysctl_balance_numa_scan_size,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif /* CONFIG_BALANCE_NUMA */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
@ 2012-11-14 17:24 Andrew Theurer
  2012-11-14 18:28 ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Theurer @ 2012-11-14 17:24 UTC (permalink / raw)
  To: Mel Gorman; +Cc: a.p.zijlstra, riel, aarcange, lee.schermerhorn, linux-kernel


> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Note: The scan period is much larger than it was in the original patch.
> 	The reason was because the system CPU usage went through the roof
> 	with a sample period of 500ms but it was unsuitable to have a
> 	situation where a large process could stall for excessively long
> 	updating pte_numa. This may need to be tuned again if a placement
> 	policy converges too slowly.
> 
> Previously, to probe the working set of a task, we'd use
> a very simple and crude method: mark all of its address
> space PROT_NONE.
> 
> That method has various (obvious) disadvantages:
> 
>  - it samples the working set at dissimilar rates,
>    giving some tasks a sampling quality advantage
>    over others.
> 
>  - creates performance problems for tasks with very
>    large working sets
> 
>  - over-samples processes with large address spaces but
>    which only very rarely execute
> 
> Improve that method by keeping a rotating offset into the
> address space that marks the current position of the scan,
> and advance it by a constant rate (in a CPU cycles execution
> proportional manner). If the offset reaches the last mapped
> address of the mm then it then it starts over at the first
> address.

I believe we will have problems with this. For example, running a large
KVM VM with 512GB memory, using the new defaults in this patch, and
assuming we never go longer per scan than the scan_period_min, it would
take over an hour to scan the entire VM just once.  The defaults could
be changed, but ideally there should be no knobs like this in the final
version, as it should just work well under all conditions.

Also, if such a method is kept, would it be possible to base it on fixed
number of pages successfully marked instead of a MB range?  Reason I
bring it up is that we often can have VMs which are large in their
memory definition, but might not actually have a lot of pages faulted
in.  We could be "scanning" sections of vma which are not even actually
present yet.

> The per-task nature of the working set sampling functionality in this tree
> allows such constant rate, per task, execution-weight proportional sampling
> of the working set, with an adaptive sampling interval/frequency that
> goes from once per 2 seconds up to just once per 32 seconds.  The current
> sampling volume is 256 MB per interval.

Once a new section is marked, is the previous section automatically
reverted?  If not, I wonder if there's risk of building up a ton of
potential page faults?

> As tasks mature and converge their working set, so does the
> sampling rate slow down to just a trickle, 256 MB per 32
> seconds of CPU time executed.
> 
> This, beyond being adaptive, also rate-limits rarely
> executing systems and does not over-sample on overloaded
> systems.

I am wondering if it would be better to shrink the scan period back to a
much smaller fixed value, and instead of picking 256MB ranges of memory
to mark completely, go back to using all of the address space, but mark
only every Nth page.  N is adjusted each period to target a rolling
average of X faults per MB per execution time period.  This per task N
would also be an interesting value to rank memory access frequency among
tasks and help prioritize scheduling decisions.

-Andrew Theurer

> 
> [ In AutoNUMA speak, this patch deals with the effective sampling
>   rate of the 'hinting page fault'. AutoNUMA's scanning is
>   currently rate-limited, but it is also fundamentally
>   single-threaded, executing in the knuma_scand kernel thread,
>   so the limit in AutoNUMA is global and does not scale up with
>   the number of CPUs, nor does it scan tasks in an execution
>   proportional manner.
> 
>   So the idea of rate-limiting the scanning was first implemented
>   in the AutoNUMA tree via a global rate limit. This patch goes
>   beyond that by implementing an execution rate proportional
>   working set sampling rate that is not implemented via a single
>   global scanning daemon. ]
> 
> [ Dan Carpenter pointed out a possible NULL pointer dereference in the
>   first version of this patch. ]
> 
> Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
> Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> [ Wrote changelog and fixed bug. ]
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/mm_types.h |    3 +++
>  include/linux/sched.h    |    1 +
>  kernel/sched/fair.c      |   61 ++++++++++++++++++++++++++++++++++++----------
>  kernel/sysctl.c          |    7 ++++++
>  4 files changed, 59 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d82accb..b40f4ef 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -406,6 +406,9 @@ struct mm_struct {
>  	 */
>  	unsigned long numa_next_scan;
>  
> +	/* Restart point for scanning and setting pte_numa */
> +	unsigned long numa_scan_offset;
> +
>  	/* numa_scan_seq prevents two threads setting pte_numa */
>  	int numa_scan_seq;
>  #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 241e4f7..6b8a14f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2008,6 +2008,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
>  
>  extern unsigned int sysctl_balance_numa_scan_period_min;
>  extern unsigned int sysctl_balance_numa_scan_period_max;
> +extern unsigned int sysctl_balance_numa_scan_size;
>  extern unsigned int sysctl_balance_numa_settle_count;
>  
>  #ifdef CONFIG_SCHED_DEBUG
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9ea13e9..6df5620 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -780,10 +780,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  
>  #ifdef CONFIG_BALANCE_NUMA
>  /*
> - * numa task sample period in ms: 5s
> + * numa task sample period in ms
>   */
> -unsigned int sysctl_balance_numa_scan_period_min = 5000;
> -unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
> +unsigned int sysctl_balance_numa_scan_period_min = 2000;
> +unsigned int sysctl_balance_numa_scan_period_max = 2000*16;
> +
> +/* Portion of address space to scan in MB */
> +unsigned int sysctl_balance_numa_scan_size = 256;
>  
>  static void task_numa_placement(struct task_struct *p)
>  {
> @@ -822,6 +825,9 @@ void task_numa_work(struct callback_head *work)
>  	unsigned long migrate, next_scan, now = jiffies;
>  	struct task_struct *p = current;
>  	struct mm_struct *mm = p->mm;
> +	struct vm_area_struct *vma;
> +	unsigned long offset, end;
> +	long length;
>  
>  	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
>  
> @@ -851,18 +857,47 @@ void task_numa_work(struct callback_head *work)
>  	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
>  		return;
>  
> -	ACCESS_ONCE(mm->numa_scan_seq)++;
> -	{
> -		struct vm_area_struct *vma;
> +	offset = mm->numa_scan_offset;
> +	length = sysctl_balance_numa_scan_size;
> +	length <<= 20;
>  
> -		down_read(&mm->mmap_sem);
> -		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> -			if (!vma_migratable(vma))
> -				continue;
> -			change_prot_numa(vma, vma->vm_start, vma->vm_end);
> -		}
> -		up_read(&mm->mmap_sem);
> +	down_read(&mm->mmap_sem);
> +	vma = find_vma(mm, offset);
> +	if (!vma) {
> +		ACCESS_ONCE(mm->numa_scan_seq)++;
> +		offset = 0;
> +		vma = mm->mmap;
> +	}
> +	for (; vma && length > 0; vma = vma->vm_next) {
> +		if (!vma_migratable(vma))
> +			continue;
> +
> +		/* Skip small VMAs. They are not likely to be of relevance */
> +		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
> +			continue;
> +
> +		offset = max(offset, vma->vm_start);
> +		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
> +		length -= end - offset;
> +
> +		change_prot_numa(vma, offset, end);
> +
> +		offset = end;
> +	}
> +
> +	/*
> +	 * It is possible to reach the end of the VMA list but the last few VMAs are
> +	 * not guaranteed to the vma_migratable. If they are not, we would find the
> +	 * !migratable VMA on the next scan but not reset the scanner to the start
> +	 * so check it now.
> +	 */
> +	if (!vma) {
> +		ACCESS_ONCE(mm->numa_scan_seq)++;
> +		offset = 0;
> +		vma = mm->mmap;
>  	}
> +	mm->numa_scan_offset = offset;
> +	up_read(&mm->mmap_sem);
>  }
>  
>  /*
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 1359f51..d191203 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -366,6 +366,13 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	{
> +		.procname	= "balance_numa_scan_size_mb",
> +		.data		= &sysctl_balance_numa_scan_size,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
>  #endif /* CONFIG_BALANCE_NUMA */
>  #endif /* CONFIG_SCHED_DEBUG */
>  	{
> -- 
> 1.7.9.2
> 
> --


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-14 17:24 [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Andrew Theurer
@ 2012-11-14 18:28 ` Mel Gorman
  2012-11-14 19:39   ` Andrew Theurer
  0 siblings, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2012-11-14 18:28 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: a.p.zijlstra, riel, aarcange, lee.schermerhorn, linux-kernel

On Wed, Nov 14, 2012 at 11:24:42AM -0600, Andrew Theurer wrote:
> 
> > From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > 
> > Note: The scan period is much larger than it was in the original patch.
> > 	The reason was because the system CPU usage went through the roof
> > 	with a sample period of 500ms but it was unsuitable to have a
> > 	situation where a large process could stall for excessively long
> > 	updating pte_numa. This may need to be tuned again if a placement
> > 	policy converges too slowly.
> > 
> > Previously, to probe the working set of a task, we'd use
> > a very simple and crude method: mark all of its address
> > space PROT_NONE.
> > 
> > That method has various (obvious) disadvantages:
> > 
> >  - it samples the working set at dissimilar rates,
> >    giving some tasks a sampling quality advantage
> >    over others.
> > 
> >  - creates performance problems for tasks with very
> >    large working sets
> > 
> >  - over-samples processes with large address spaces but
> >    which only very rarely execute
> > 
> > Improve that method by keeping a rotating offset into the
> > address space that marks the current position of the scan,
> > and advance it by a constant rate (in a CPU cycles execution
> > proportional manner). If the offset reaches the last mapped
> > address of the mm then it then it starts over at the first
> > address.
> 
> I believe we will have problems with this. For example, running a large
> KVM VM with 512GB memory, using the new defaults in this patch, and
> assuming we never go longer per scan than the scan_period_min, it would
> take over an hour to scan the entire VM just once.  The defaults could
> be changed, but ideally there should be no knobs like this in the final
> version, as it should just work well under all conditions.
> 

Good point. I'll switch to the old defaults. The system CPU usage will
be high but that has to be coped with anyway. Ideally the tunables would
go away but for now they are handy for debugging.

> Also, if such a method is kept, would it be possible to base it on fixed
> number of pages successfully marked instead of a MB range? 

I see a patch for that in the -tip tree. I'm still debating this with
myself. On the one hand, it'll update the PTEs faster. On the other
hand, the time spent scanning is now variable because it depends on the
number of PTE updates. It's no longer a constant in terms of scanning
although it would still be constant in terms of PTEs update. Hmm..

> Reason I
> bring it up is that we often can have VMs which are large in their
> memory definition, but might not actually have a lot of pages faulted
> in.  We could be "scanning" sections of vma which are not even actually
> present yet.
> 

Ok, thanks for that. That would push me towards accepting it and being
ok with the variable amount of scanning.

> > The per-task nature of the working set sampling functionality in this tree
> > allows such constant rate, per task, execution-weight proportional sampling
> > of the working set, with an adaptive sampling interval/frequency that
> > goes from once per 2 seconds up to just once per 32 seconds.  The current
> > sampling volume is 256 MB per interval.
> 
> Once a new section is marked, is the previous section automatically
> reverted? 

No.

> If not, I wonder if there's risk of building up a ton of
> potential page faults?
> 

Yes, if the full address space is suddenly referenced.

> > As tasks mature and converge their working set, so does the
> > sampling rate slow down to just a trickle, 256 MB per 32
> > seconds of CPU time executed.
> > 
> > This, beyond being adaptive, also rate-limits rarely
> > executing systems and does not over-sample on overloaded
> > systems.
> 
> I am wondering if it would be better to shrink the scan period back to a
> much smaller fixed value,

I'll do that anyway.

> and instead of picking 256MB ranges of memory
> to mark completely, go back to using all of the address space, but mark
> only every Nth page. 

It'll still be necessary to do the full walk and I wonder if we'd lose on
the larger number of PTE locks that will have to be taken to do a scan if
we are only updating every 128 pages for example. It could be very expensive.

> N is adjusted each period to target a rolling
> average of X faults per MB per execution time period.  This per task N
> would also be an interesting value to rank memory access frequency among
> tasks and help prioritize scheduling decisions.
> 

It's an interesting idea. I'll think on it more but my initial reaction
is that the cost could be really high.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-14 18:28 ` Mel Gorman
@ 2012-11-14 19:39   ` Andrew Theurer
  2012-11-15 10:27     ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Theurer @ 2012-11-14 19:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: a.p.zijlstra, riel, aarcange, lee.schermerhorn, linux-kernel

On Wed, 2012-11-14 at 18:28 +0000, Mel Gorman wrote:
> On Wed, Nov 14, 2012 at 11:24:42AM -0600, Andrew Theurer wrote:
> > 
> > > From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > 
> > > Note: The scan period is much larger than it was in the original patch.
> > > 	The reason was because the system CPU usage went through the roof
> > > 	with a sample period of 500ms but it was unsuitable to have a
> > > 	situation where a large process could stall for excessively long
> > > 	updating pte_numa. This may need to be tuned again if a placement
> > > 	policy converges too slowly.
> > > 
> > > Previously, to probe the working set of a task, we'd use
> > > a very simple and crude method: mark all of its address
> > > space PROT_NONE.
> > > 
> > > That method has various (obvious) disadvantages:
> > > 
> > >  - it samples the working set at dissimilar rates,
> > >    giving some tasks a sampling quality advantage
> > >    over others.
> > > 
> > >  - creates performance problems for tasks with very
> > >    large working sets
> > > 
> > >  - over-samples processes with large address spaces but
> > >    which only very rarely execute
> > > 
> > > Improve that method by keeping a rotating offset into the
> > > address space that marks the current position of the scan,
> > > and advance it by a constant rate (in a CPU cycles execution
> > > proportional manner). If the offset reaches the last mapped
> > > address of the mm then it then it starts over at the first
> > > address.
> > 
> > I believe we will have problems with this. For example, running a large
> > KVM VM with 512GB memory, using the new defaults in this patch, and
> > assuming we never go longer per scan than the scan_period_min, it would
> > take over an hour to scan the entire VM just once.  The defaults could
> > be changed, but ideally there should be no knobs like this in the final
> > version, as it should just work well under all conditions.
> > 
> 
> Good point. I'll switch to the old defaults. The system CPU usage will
> be high but that has to be coped with anyway. Ideally the tunables would
> go away but for now they are handy for debugging.
> 
> > Also, if such a method is kept, would it be possible to base it on fixed
> > number of pages successfully marked instead of a MB range? 
> 
> I see a patch for that in the -tip tree. I'm still debating this with
> myself. On the one hand, it'll update the PTEs faster. On the other
> hand, the time spent scanning is now variable because it depends on the
> number of PTE updates. It's no longer a constant in terms of scanning
> although it would still be constant in terms of PTEs update. Hmm..
> 
> > Reason I
> > bring it up is that we often can have VMs which are large in their
> > memory definition, but might not actually have a lot of pages faulted
> > in.  We could be "scanning" sections of vma which are not even actually
> > present yet.
> > 
> 
> Ok, thanks for that. That would push me towards accepting it and being
> ok with the variable amount of scanning.
> 
> > > The per-task nature of the working set sampling functionality in this tree
> > > allows such constant rate, per task, execution-weight proportional sampling
> > > of the working set, with an adaptive sampling interval/frequency that
> > > goes from once per 2 seconds up to just once per 32 seconds.  The current
> > > sampling volume is 256 MB per interval.
> > 
> > Once a new section is marked, is the previous section automatically
> > reverted? 
> 
> No.
> 
> > If not, I wonder if there's risk of building up a ton of
> > potential page faults?
> > 
> 
> Yes, if the full address space is suddenly referenced.
> 
> > > As tasks mature and converge their working set, so does the
> > > sampling rate slow down to just a trickle, 256 MB per 32
> > > seconds of CPU time executed.
> > > 
> > > This, beyond being adaptive, also rate-limits rarely
> > > executing systems and does not over-sample on overloaded
> > > systems.
> > 
> > I am wondering if it would be better to shrink the scan period back to a
> > much smaller fixed value,
> 
> I'll do that anyway.
> 
> > and instead of picking 256MB ranges of memory
> > to mark completely, go back to using all of the address space, but mark
> > only every Nth page. 
> 
> It'll still be necessary to do the full walk and I wonder if we'd lose on
> the larger number of PTE locks that will have to be taken to do a scan if
> we are only updating every 128 pages for example. It could be very expensive.

Yes, good point.  My other inclination was not doing a mass marking of
pages at all (except just one time at some point after task init) and
conditionally setting or clearing the prot_numa in the fault path itself
to control the fault rate.  The problem I see is I am not sure how we
"back-off" the fault rate per page.  You could choose to not leave the
page marked, but then you never get a fault on that page again, so
there's no good way to mark it again in the fault path for that page
unless you have the periodic marker.  However, maybe a certain number of
pages are considered clustered together, and a fault from any page is
considered a fault for the cluster of pages.  When handling the fault,
the number of pages which are marked in the cluster is varied to achieve
a target, reasonable fault rate.  Might be able to treat page migrations
in clusters as well...  I probably need to think about this a bit
more....

> 
> > N is adjusted each period to target a rolling
> > average of X faults per MB per execution time period.  This per task N
> > would also be an interesting value to rank memory access frequency among
> > tasks and help prioritize scheduling decisions.
> > 
> 
> It's an interesting idea. I'll think on it more but my initial reaction
> is that the cost could be really high.

-Andrew Theurer



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-14 19:39   ` Andrew Theurer
@ 2012-11-15 10:27     ` Mel Gorman
  0 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2012-11-15 10:27 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: a.p.zijlstra, riel, aarcange, lee.schermerhorn, linux-kernel

On Wed, Nov 14, 2012 at 01:39:53PM -0600, Andrew Theurer wrote:
> > > <SNIP>
> > >
> > > I am wondering if it would be better to shrink the scan period back to a
> > > much smaller fixed value,
> > 
> > I'll do that anyway.
> > 
> > > and instead of picking 256MB ranges of memory
> > > to mark completely, go back to using all of the address space, but mark
> > > only every Nth page. 
> > 
> > It'll still be necessary to do the full walk and I wonder if we'd lose on
> > the larger number of PTE locks that will have to be taken to do a scan if
> > we are only updating every 128 pages for example. It could be very expensive.
> 
> Yes, good point.  My other inclination was not doing a mass marking of
> pages at all (except just one time at some point after task init) and
> conditionally setting or clearing the prot_numa in the fault path itself
> to control the fault rate. 

That's a bit of a catch-22. You need faults to control the scan rate
which determines the fault rate.

One thing that could be done is that the PTE scanning-and-updating is
rate limited if there is an excessive number of migrations due to NUMA
hinting faults within a given window. I've prototyped something along
these lines. The problem is that it'll disrupt the accuracy of the
statistics gathered by the hinting faults.

> The problem I see is I am not sure how we
> "back-off" the fault rate per page. 

I went for a straight cutoff. If a node has migrated too much recently,
no PTEs are marked for update if the PTE points to a page on that node. I
know it's a big heavy hammer but it'll indicate if it's worthwhile.

> You could choose to not leave the
> page marked, but then you never get a fault on that page again, so
> there's no good way to mark it again in the fault path for that page
> unless you have the periodic marker. 

In my case, the throttle window expires and it goes back to scanning at
the normal rate. I've changed the details of how the scanning rate
increases and decreases but how exactly is not that important right now.

> However, maybe a certain number of
> pages are considered clustered together, and a fault from any page is
> considered a fault for the cluster of pages.  When handling the fault,
> the number of pages which are marked in the cluster is varied to achieve
> a target, reasonable fault rate.  Might be able to treat page migrations
> in clusters as well...  I probably need to think about this a bit
> more....
> 

FWIW, I'm wary of putting too many smarts into how the scanning rates are
adapted. It'll be too specific to workloads and machine sizes.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-11-15 10:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-14 17:24 [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Andrew Theurer
2012-11-14 18:28 ` Mel Gorman
2012-11-14 19:39   ` Andrew Theurer
2012-11-15 10:27     ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
2012-11-13 11:12 ` [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).