Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
@ 2012-11-14 17:24 Andrew Theurer
  2012-11-14 18:28 ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Theurer @ 2012-11-14 17:24 UTC (permalink / raw)
  To: Mel Gorman; +Cc: a.p.zijlstra, riel, aarcange, lee.schermerhorn, linux-kernel


> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Note: The scan period is much larger than it was in the original patch.
> 	The reason was because the system CPU usage went through the roof
> 	with a sample period of 500ms but it was unsuitable to have a
> 	situation where a large process could stall for excessively long
> 	updating pte_numa. This may need to be tuned again if a placement
> 	policy converges too slowly.
> 
> Previously, to probe the working set of a task, we'd use
> a very simple and crude method: mark all of its address
> space PROT_NONE.
> 
> That method has various (obvious) disadvantages:
> 
>  - it samples the working set at dissimilar rates,
>    giving some tasks a sampling quality advantage
>    over others.
> 
>  - creates performance problems for tasks with very
>    large working sets
> 
>  - over-samples processes with large address spaces but
>    which only very rarely execute
> 
> Improve that method by keeping a rotating offset into the
> address space that marks the current position of the scan,
> and advance it by a constant rate (in a CPU cycles execution
> proportional manner). If the offset reaches the last mapped
> address of the mm then it then it starts over at the first
> address.

I believe we will have problems with this. For example, running a large
KVM VM with 512GB memory, using the new defaults in this patch, and
assuming we never go longer per scan than the scan_period_min, it would
take over an hour to scan the entire VM just once.  The defaults could
be changed, but ideally there should be no knobs like this in the final
version, as it should just work well under all conditions.

Also, if such a method is kept, would it be possible to base it on fixed
number of pages successfully marked instead of a MB range?  Reason I
bring it up is that we often can have VMs which are large in their
memory definition, but might not actually have a lot of pages faulted
in.  We could be "scanning" sections of vma which are not even actually
present yet.

> The per-task nature of the working set sampling functionality in this tree
> allows such constant rate, per task, execution-weight proportional sampling
> of the working set, with an adaptive sampling interval/frequency that
> goes from once per 2 seconds up to just once per 32 seconds.  The current
> sampling volume is 256 MB per interval.

Once a new section is marked, is the previous section automatically
reverted?  If not, I wonder if there's risk of building up a ton of
potential page faults?

> As tasks mature and converge their working set, so does the
> sampling rate slow down to just a trickle, 256 MB per 32
> seconds of CPU time executed.
> 
> This, beyond being adaptive, also rate-limits rarely
> executing systems and does not over-sample on overloaded
> systems.

I am wondering if it would be better to shrink the scan period back to a
much smaller fixed value, and instead of picking 256MB ranges of memory
to mark completely, go back to using all of the address space, but mark
only every Nth page.  N is adjusted each period to target a rolling
average of X faults per MB per execution time period.  This per task N
would also be an interesting value to rank memory access frequency among
tasks and help prioritize scheduling decisions.

-Andrew Theurer

> 
> [ In AutoNUMA speak, this patch deals with the effective sampling
>   rate of the 'hinting page fault'. AutoNUMA's scanning is
>   currently rate-limited, but it is also fundamentally
>   single-threaded, executing in the knuma_scand kernel thread,
>   so the limit in AutoNUMA is global and does not scale up with
>   the number of CPUs, nor does it scan tasks in an execution
>   proportional manner.
> 
>   So the idea of rate-limiting the scanning was first implemented
>   in the AutoNUMA tree via a global rate limit. This patch goes
>   beyond that by implementing an execution rate proportional
>   working set sampling rate that is not implemented via a single
>   global scanning daemon. ]
> 
> [ Dan Carpenter pointed out a possible NULL pointer dereference in the
>   first version of this patch. ]
> 
> Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
> Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> [ Wrote changelog and fixed bug. ]
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/mm_types.h |    3 +++
>  include/linux/sched.h    |    1 +
>  kernel/sched/fair.c      |   61 ++++++++++++++++++++++++++++++++++++----------
>  kernel/sysctl.c          |    7 ++++++
>  4 files changed, 59 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d82accb..b40f4ef 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -406,6 +406,9 @@ struct mm_struct {
>  	 */
>  	unsigned long numa_next_scan;
>  
> +	/* Restart point for scanning and setting pte_numa */
> +	unsigned long numa_scan_offset;
> +
>  	/* numa_scan_seq prevents two threads setting pte_numa */
>  	int numa_scan_seq;
>  #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 241e4f7..6b8a14f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2008,6 +2008,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
>  
>  extern unsigned int sysctl_balance_numa_scan_period_min;
>  extern unsigned int sysctl_balance_numa_scan_period_max;
> +extern unsigned int sysctl_balance_numa_scan_size;
>  extern unsigned int sysctl_balance_numa_settle_count;
>  
>  #ifdef CONFIG_SCHED_DEBUG
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9ea13e9..6df5620 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -780,10 +780,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  
>  #ifdef CONFIG_BALANCE_NUMA
>  /*
> - * numa task sample period in ms: 5s
> + * numa task sample period in ms
>   */
> -unsigned int sysctl_balance_numa_scan_period_min = 5000;
> -unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
> +unsigned int sysctl_balance_numa_scan_period_min = 2000;
> +unsigned int sysctl_balance_numa_scan_period_max = 2000*16;
> +
> +/* Portion of address space to scan in MB */
> +unsigned int sysctl_balance_numa_scan_size = 256;
>  
>  static void task_numa_placement(struct task_struct *p)
>  {
> @@ -822,6 +825,9 @@ void task_numa_work(struct callback_head *work)
>  	unsigned long migrate, next_scan, now = jiffies;
>  	struct task_struct *p = current;
>  	struct mm_struct *mm = p->mm;
> +	struct vm_area_struct *vma;
> +	unsigned long offset, end;
> +	long length;
>  
>  	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
>  
> @@ -851,18 +857,47 @@ void task_numa_work(struct callback_head *work)
>  	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
>  		return;
>  
> -	ACCESS_ONCE(mm->numa_scan_seq)++;
> -	{
> -		struct vm_area_struct *vma;
> +	offset = mm->numa_scan_offset;
> +	length = sysctl_balance_numa_scan_size;
> +	length <<= 20;
>  
> -		down_read(&mm->mmap_sem);
> -		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> -			if (!vma_migratable(vma))
> -				continue;
> -			change_prot_numa(vma, vma->vm_start, vma->vm_end);
> -		}
> -		up_read(&mm->mmap_sem);
> +	down_read(&mm->mmap_sem);
> +	vma = find_vma(mm, offset);
> +	if (!vma) {
> +		ACCESS_ONCE(mm->numa_scan_seq)++;
> +		offset = 0;
> +		vma = mm->mmap;
> +	}
> +	for (; vma && length > 0; vma = vma->vm_next) {
> +		if (!vma_migratable(vma))
> +			continue;
> +
> +		/* Skip small VMAs. They are not likely to be of relevance */
> +		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
> +			continue;
> +
> +		offset = max(offset, vma->vm_start);
> +		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
> +		length -= end - offset;
> +
> +		change_prot_numa(vma, offset, end);
> +
> +		offset = end;
> +	}
> +
> +	/*
> +	 * It is possible to reach the end of the VMA list but the last few VMAs are
> +	 * not guaranteed to the vma_migratable. If they are not, we would find the
> +	 * !migratable VMA on the next scan but not reset the scanner to the start
> +	 * so check it now.
> +	 */
> +	if (!vma) {
> +		ACCESS_ONCE(mm->numa_scan_seq)++;
> +		offset = 0;
> +		vma = mm->mmap;
>  	}
> +	mm->numa_scan_offset = offset;
> +	up_read(&mm->mmap_sem);
>  }
>  
>  /*
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 1359f51..d191203 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -366,6 +366,13 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	{
> +		.procname	= "balance_numa_scan_size_mb",
> +		.data		= &sysctl_balance_numa_scan_size,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
>  #endif /* CONFIG_BALANCE_NUMA */
>  #endif /* CONFIG_SCHED_DEBUG */
>  	{
> -- 
> 1.7.9.2
> 
> --


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-14 17:24 [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Andrew Theurer
@ 2012-11-14 18:28 ` Mel Gorman
  2012-11-14 19:39   ` Andrew Theurer
  0 siblings, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2012-11-14 18:28 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: a.p.zijlstra, riel, aarcange, lee.schermerhorn, linux-kernel

On Wed, Nov 14, 2012 at 11:24:42AM -0600, Andrew Theurer wrote:
> 
> > From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > 
> > Note: The scan period is much larger than it was in the original patch.
> > 	The reason was because the system CPU usage went through the roof
> > 	with a sample period of 500ms but it was unsuitable to have a
> > 	situation where a large process could stall for excessively long
> > 	updating pte_numa. This may need to be tuned again if a placement
> > 	policy converges too slowly.
> > 
> > Previously, to probe the working set of a task, we'd use
> > a very simple and crude method: mark all of its address
> > space PROT_NONE.
> > 
> > That method has various (obvious) disadvantages:
> > 
> >  - it samples the working set at dissimilar rates,
> >    giving some tasks a sampling quality advantage
> >    over others.
> > 
> >  - creates performance problems for tasks with very
> >    large working sets
> > 
> >  - over-samples processes with large address spaces but
> >    which only very rarely execute
> > 
> > Improve that method by keeping a rotating offset into the
> > address space that marks the current position of the scan,
> > and advance it by a constant rate (in a CPU cycles execution
> > proportional manner). If the offset reaches the last mapped
> > address of the mm then it then it starts over at the first
> > address.
> 
> I believe we will have problems with this. For example, running a large
> KVM VM with 512GB memory, using the new defaults in this patch, and
> assuming we never go longer per scan than the scan_period_min, it would
> take over an hour to scan the entire VM just once.  The defaults could
> be changed, but ideally there should be no knobs like this in the final
> version, as it should just work well under all conditions.
> 

Good point. I'll switch to the old defaults. The system CPU usage will
be high but that has to be coped with anyway. Ideally the tunables would
go away but for now they are handy for debugging.

> Also, if such a method is kept, would it be possible to base it on fixed
> number of pages successfully marked instead of a MB range? 

I see a patch for that in the -tip tree. I'm still debating this with
myself. On the one hand, it'll update the PTEs faster. On the other
hand, the time spent scanning is now variable because it depends on the
number of PTE updates. It's no longer a constant in terms of scanning
although it would still be constant in terms of PTEs update. Hmm..

> Reason I
> bring it up is that we often can have VMs which are large in their
> memory definition, but might not actually have a lot of pages faulted
> in.  We could be "scanning" sections of vma which are not even actually
> present yet.
> 

Ok, thanks for that. That would push me towards accepting it and being
ok with the variable amount of scanning.

> > The per-task nature of the working set sampling functionality in this tree
> > allows such constant rate, per task, execution-weight proportional sampling
> > of the working set, with an adaptive sampling interval/frequency that
> > goes from once per 2 seconds up to just once per 32 seconds.  The current
> > sampling volume is 256 MB per interval.
> 
> Once a new section is marked, is the previous section automatically
> reverted? 

No.

> If not, I wonder if there's risk of building up a ton of
> potential page faults?
> 

Yes, if the full address space is suddenly referenced.

> > As tasks mature and converge their working set, so does the
> > sampling rate slow down to just a trickle, 256 MB per 32
> > seconds of CPU time executed.
> > 
> > This, beyond being adaptive, also rate-limits rarely
> > executing systems and does not over-sample on overloaded
> > systems.
> 
> I am wondering if it would be better to shrink the scan period back to a
> much smaller fixed value,

I'll do that anyway.

> and instead of picking 256MB ranges of memory
> to mark completely, go back to using all of the address space, but mark
> only every Nth page. 

It'll still be necessary to do the full walk and I wonder if we'd lose on
the larger number of PTE locks that will have to be taken to do a scan if
we are only updating every 128 pages for example. It could be very expensive.

> N is adjusted each period to target a rolling
> average of X faults per MB per execution time period.  This per task N
> would also be an interesting value to rank memory access frequency among
> tasks and help prioritize scheduling decisions.
> 

It's an interesting idea. I'll think on it more but my initial reaction
is that the cost could be really high.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-14 18:28 ` Mel Gorman
@ 2012-11-14 19:39   ` Andrew Theurer
  2012-11-15 10:27     ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Theurer @ 2012-11-14 19:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: a.p.zijlstra, riel, aarcange, lee.schermerhorn, linux-kernel

On Wed, 2012-11-14 at 18:28 +0000, Mel Gorman wrote:
> On Wed, Nov 14, 2012 at 11:24:42AM -0600, Andrew Theurer wrote:
> > 
> > > From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > 
> > > Note: The scan period is much larger than it was in the original patch.
> > > 	The reason was because the system CPU usage went through the roof
> > > 	with a sample period of 500ms but it was unsuitable to have a
> > > 	situation where a large process could stall for excessively long
> > > 	updating pte_numa. This may need to be tuned again if a placement
> > > 	policy converges too slowly.
> > > 
> > > Previously, to probe the working set of a task, we'd use
> > > a very simple and crude method: mark all of its address
> > > space PROT_NONE.
> > > 
> > > That method has various (obvious) disadvantages:
> > > 
> > >  - it samples the working set at dissimilar rates,
> > >    giving some tasks a sampling quality advantage
> > >    over others.
> > > 
> > >  - creates performance problems for tasks with very
> > >    large working sets
> > > 
> > >  - over-samples processes with large address spaces but
> > >    which only very rarely execute
> > > 
> > > Improve that method by keeping a rotating offset into the
> > > address space that marks the current position of the scan,
> > > and advance it by a constant rate (in a CPU cycles execution
> > > proportional manner). If the offset reaches the last mapped
> > > address of the mm then it then it starts over at the first
> > > address.
> > 
> > I believe we will have problems with this. For example, running a large
> > KVM VM with 512GB memory, using the new defaults in this patch, and
> > assuming we never go longer per scan than the scan_period_min, it would
> > take over an hour to scan the entire VM just once.  The defaults could
> > be changed, but ideally there should be no knobs like this in the final
> > version, as it should just work well under all conditions.
> > 
> 
> Good point. I'll switch to the old defaults. The system CPU usage will
> be high but that has to be coped with anyway. Ideally the tunables would
> go away but for now they are handy for debugging.
> 
> > Also, if such a method is kept, would it be possible to base it on fixed
> > number of pages successfully marked instead of a MB range? 
> 
> I see a patch for that in the -tip tree. I'm still debating this with
> myself. On the one hand, it'll update the PTEs faster. On the other
> hand, the time spent scanning is now variable because it depends on the
> number of PTE updates. It's no longer a constant in terms of scanning
> although it would still be constant in terms of PTEs update. Hmm..
> 
> > Reason I
> > bring it up is that we often can have VMs which are large in their
> > memory definition, but might not actually have a lot of pages faulted
> > in.  We could be "scanning" sections of vma which are not even actually
> > present yet.
> > 
> 
> Ok, thanks for that. That would push me towards accepting it and being
> ok with the variable amount of scanning.
> 
> > > The per-task nature of the working set sampling functionality in this tree
> > > allows such constant rate, per task, execution-weight proportional sampling
> > > of the working set, with an adaptive sampling interval/frequency that
> > > goes from once per 2 seconds up to just once per 32 seconds.  The current
> > > sampling volume is 256 MB per interval.
> > 
> > Once a new section is marked, is the previous section automatically
> > reverted? 
> 
> No.
> 
> > If not, I wonder if there's risk of building up a ton of
> > potential page faults?
> > 
> 
> Yes, if the full address space is suddenly referenced.
> 
> > > As tasks mature and converge their working set, so does the
> > > sampling rate slow down to just a trickle, 256 MB per 32
> > > seconds of CPU time executed.
> > > 
> > > This, beyond being adaptive, also rate-limits rarely
> > > executing systems and does not over-sample on overloaded
> > > systems.
> > 
> > I am wondering if it would be better to shrink the scan period back to a
> > much smaller fixed value,
> 
> I'll do that anyway.
> 
> > and instead of picking 256MB ranges of memory
> > to mark completely, go back to using all of the address space, but mark
> > only every Nth page. 
> 
> It'll still be necessary to do the full walk and I wonder if we'd lose on
> the larger number of PTE locks that will have to be taken to do a scan if
> we are only updating every 128 pages for example. It could be very expensive.

Yes, good point.  My other inclination was not doing a mass marking of
pages at all (except just one time at some point after task init) and
conditionally setting or clearing the prot_numa in the fault path itself
to control the fault rate.  The problem I see is I am not sure how we
"back-off" the fault rate per page.  You could choose to not leave the
page marked, but then you never get a fault on that page again, so
there's no good way to mark it again in the fault path for that page
unless you have the periodic marker.  However, maybe a certain number of
pages are considered clustered together, and a fault from any page is
considered a fault for the cluster of pages.  When handling the fault,
the number of pages which are marked in the cluster is varied to achieve
a target, reasonable fault rate.  Might be able to treat page migrations
in clusters as well...  I probably need to think about this a bit
more....

> 
> > N is adjusted each period to target a rolling
> > average of X faults per MB per execution time period.  This per task N
> > would also be an interesting value to rank memory access frequency among
> > tasks and help prioritize scheduling decisions.
> > 
> 
> It's an interesting idea. I'll think on it more but my initial reaction
> is that the cost could be really high.

-Andrew Theurer



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-14 19:39   ` Andrew Theurer
@ 2012-11-15 10:27     ` Mel Gorman
  0 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2012-11-15 10:27 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: a.p.zijlstra, riel, aarcange, lee.schermerhorn, linux-kernel

On Wed, Nov 14, 2012 at 01:39:53PM -0600, Andrew Theurer wrote:
> > > <SNIP>
> > >
> > > I am wondering if it would be better to shrink the scan period back to a
> > > much smaller fixed value,
> > 
> > I'll do that anyway.
> > 
> > > and instead of picking 256MB ranges of memory
> > > to mark completely, go back to using all of the address space, but mark
> > > only every Nth page. 
> > 
> > It'll still be necessary to do the full walk and I wonder if we'd lose on
> > the larger number of PTE locks that will have to be taken to do a scan if
> > we are only updating every 128 pages for example. It could be very expensive.
> 
> Yes, good point.  My other inclination was not doing a mass marking of
> pages at all (except just one time at some point after task init) and
> conditionally setting or clearing the prot_numa in the fault path itself
> to control the fault rate. 

That's a bit of a catch-22. You need faults to control the scan rate
which determines the fault rate.

One thing that could be done is that the PTE scanning-and-updating is
rate limited if there is an excessive number of migrations due to NUMA
hinting faults within a given window. I've prototyped something along
these lines. The problem is that it'll disrupt the accuracy of the
statistics gathered by the hinting faults.

> The problem I see is I am not sure how we
> "back-off" the fault rate per page. 

I went for a straight cutoff. If a node has migrated too much recently,
no PTEs are marked for update if the PTE points to a page on that node. I
know it's a big heavy hammer but it'll indicate if it's worthwhile.

> You could choose to not leave the
> page marked, but then you never get a fault on that page again, so
> there's no good way to mark it again in the fault path for that page
> unless you have the periodic marker. 

In my case, the throttle window expires and it goes back to scanning at
the normal rate. I've changed the details of how the scanning rate
increases and decreases but how exactly is not that important right now.

> However, maybe a certain number of
> pages are considered clustered together, and a fault from any page is
> considered a fault for the cluster of pages.  When handling the fault,
> the number of pages which are marked in the cluster is varied to achieve
> a target, reasonable fault rate.  Might be able to treat page migrations
> in clusters as well...  I probably need to think about this a bit
> more....
> 

FWIW, I'm wary of putting too many smarts into how the scanning rates are
adapted. It'll be too specific to workloads and machine sizes.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2
@ 2012-11-13 11:12 Mel Gorman
  2012-11-13 11:12 ` [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

(Since I wrote this changelog there has been another release of schednuma.
I had delayed releasing this series long enough and decided not to delay
further. Of course, I plan to dig into that new revision and see what
has changed.)

This is V2 of the series which attempts to layer parts of autonuma's
placement policy on top of the balancenuma foundation. Unfortunately a few
bugs were discovered very late in the foundation. This forced me to discard
all test results and a number of patches which I could no longer depend on
as a result of the bugs. I'll have to redo and resend later but decided to
send this series as-is as it had been delayed enough already.  This series
is still very much a WIP but I wanted to show where things currently stand
in terms of pulling material from both schednuma and autonuma.

Changelog since V1
  o Account for faults on the correct node after migration
  o Do not account for THP splits as faults.
  o Account THP faults on the node they occurred
  o Ensure preferred_node_policy is initialised before use
  o Mitigate double faults
  o Add home-node logic
  o Add some tlb-flush mitigation patches
  o Add variation of CPU follows memory algorithm
  o Add last_nid and use it as a two-stage filter before migrating pages
  o Restart the PTE scanner when it reaches the end of the address space
  o Lots of stuff I did not note properly

There are currently two competing approaches to implement support for
automatically migrating pages to optimise NUMA locality.  Performance results
are available for both but review highlighted different problems in both.
They are not compatible with each other even though some fundamental
mechanics should have been the same.

This series addresses part of the integration and sharing problem
by implementing a foundation that either the policy for schednuma or
autonuma can be rebased on. The initial policy it implements is a very
basic greedy policy called "Migrate On Reference Of pte_numa Node (MORON)"
and is later replaced by a variation of the home-node policy and renamed.
I expect to build upon this revised policy and rename it to something
more sensible that reflects what it means.

Patches 1-3 move some vmstat counters so that migrated pages get accounted
	for. In the past the primary user of migration was compaction but
	if pages are to migrate for NUMA optimisation then the counters
	need to be generally useful.

Patch 4 defines an arch-specific PTE bit called _PAGE_NUMA that is used
	to trigger faults later in the series. A placement policy is expected
	to use these faults to determine if a page should migrate.  On x86,
	the bit is the same as _PAGE_PROTNONE but other architectures
	may differ.

Patch 5-7 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
	friends. It implements them for x86, handles GUP and preserves
	the _PAGE_NUMA bit across THP splits.

Patch 8 creates the fault handler for p[te|md]_numa PTEs and just clears
	them again.

Patches 9-11 add a migrate-on-fault mode that applications can specifically
	ask for. Applications can take advantage of this if they wish. It
	also meanst that if automatic balancing was broken for some workload
	that the application could disable the automatic stuff but still
	get some advantage.

Patch 12 adds migrate_misplaced_page which is responsible for migrating
	a page to a new location.

Patch 13 migrates the page on fault if mpol_misplaced() says to do so.

Patch 14 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
	On the next reference the memory should be migrated to the node that
	references the memory.

Patch 15 sets pte_numa within the context of the scheduler.

Patch 16 avoids calling task_numa_placement if the page is not misplaced as later
	in the series that becomes a very heavy function.

Patch 17 tries to avoid double faulting after migrating a page

Patches 18-19 note that the marking of pte_numa has a number of disadvantages and
	instead incrementally updates a limited range of the address space
	each tick.

Patch 20 adds some vmstats that can be used to approximate the cost of the
	scheduling policy in a more fine-grained fashion than looking at
	the system CPU usage.

Patch 21 implements the MORON policy.

Patches 22-24 brings in some TLB flush reduction patches. It was pointed
	out that try_to_unmap_one still incurs a TLB flush and this is true.
	An initial patch to cover this looked promising but was suspected
	of a stability issue. It was likely triggered by another corruption
	bug that has since been fixed and needs to be revisited.

Patches 25-28 introduces the concept of a home-node that the scheduler tries
	to keep processes on. It's advisory only and not particularly strict.
	There may be a problem with this whereby the load balancer is not
	pushing processes back to their home node because there are no
	idle CPUs available. It might need to be more aggressive about
	swapping two tasks that are both running off their home node.

Patch 29 implements a CPU follow memory policy. It builds statistics
	on faults on a per-task and per-mm basis and decides if a tasks
	home node should be updated on that basis.

Patch 30-31 introduces last_nid and uses it to build a two-stage filter
	that delays when a page gets migrated to avoid a situation where
	a task running temporarily off its home node forces a migration.

Some notes.

The MPOL_LAZY policy is still be exposed to userspace. It has been asked that
this be dropped until the series has solidifed. I'm happy to do this but kept
it in this release. If I hear no objections I'll drop it in the next release.

This still is missing a mechanism for disabling from the command-line.

Documentation is sorely missing at this point.

Although the results the observation is based on are unusable, I noticed
one interesting thing in the profiles is how mutex_spin_on_owner()
changes which is ordinarily a sensible heuristic. On autonumabench
NUMA01_THREADLOCAL, the patches spend more time spinning in there and more
time in intel_idle implying that other users are waiting for the pte_numa
updates to complete. In the autonumabenchmark cases, the other contender
could be khugepaged. In the specjbb case there is also a lot of spinning
and it could be due to the JVM calling mprotect(). One way or the other,
it needs to be pinned down if the pte_numa updates are the problem and
if so how we might work around the requirement to hold mmap_sem while the
pte_numa update takes place.

 arch/sh/mm/Kconfig                   |    1 +
 arch/x86/include/asm/pgtable.h       |   65 ++-
 arch/x86/include/asm/pgtable_types.h |   20 +
 arch/x86/mm/gup.c                    |   13 +-
 arch/x86/mm/pgtable.c                |    8 +-
 include/asm-generic/pgtable.h        |   12 +
 include/linux/huge_mm.h              |   10 +
 include/linux/init_task.h            |    8 +
 include/linux/mempolicy.h            |    8 +
 include/linux/migrate.h              |   21 +-
 include/linux/mm.h                   |   33 ++
 include/linux/mm_types.h             |   44 ++
 include/linux/sched.h                |   52 +++
 include/linux/vm_event_item.h        |   12 +-
 include/trace/events/migrate.h       |   51 +++
 include/uapi/linux/mempolicy.h       |   24 +-
 init/Kconfig                         |   14 +
 kernel/fork.c                        |   18 +
 kernel/sched/core.c                  |   60 ++-
 kernel/sched/debug.c                 |    3 +
 kernel/sched/fair.c                  |  743 ++++++++++++++++++++++++++++++++--
 kernel/sched/features.h              |   25 ++
 kernel/sched/sched.h                 |   36 ++
 kernel/sysctl.c                      |   38 +-
 mm/compaction.c                      |   15 +-
 mm/huge_memory.c                     |   53 +++
 mm/memory-failure.c                  |    3 +-
 mm/memory.c                          |  167 +++++++-
 mm/memory_hotplug.c                  |    3 +-
 mm/mempolicy.c                       |  360 ++++++++++++++--
 mm/migrate.c                         |  130 +++++-
 mm/page_alloc.c                      |    5 +-
 mm/pgtable-generic.c                 |    6 +-
 mm/vmstat.c                          |   16 +-
 34 files changed, 1985 insertions(+), 92 deletions(-)
 create mode 100644 include/trace/events/migrate.h

-- 
1.7.9.2

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
  2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
@ 2012-11-13 11:12 ` Mel Gorman
  0 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2012-11-13 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Ingo Molnar
  Cc: Rik van Riel, Johannes Weiner, Hugh Dickins, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, Linux-MM, LKML, Mel Gorman

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Note: The scan period is much larger than it was in the original patch.
	The reason was because the system CPU usage went through the roof
	with a sample period of 500ms but it was unsuitable to have a
	situation where a large process could stall for excessively long
	updating pte_numa. This may need to be tuned again if a placement
	policy converges too slowly.

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

 - it samples the working set at dissimilar rates,
   giving some tasks a sampling quality advantage
   over others.

 - creates performance problems for tasks with very
   large working sets

 - over-samples processes with large address spaces but
   which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 2 seconds up to just once per 32 seconds.  The current
sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 32
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
  rate of the 'hinting page fault'. AutoNUMA's scanning is
  currently rate-limited, but it is also fundamentally
  single-threaded, executing in the knuma_scand kernel thread,
  so the limit in AutoNUMA is global and does not scale up with
  the number of CPUs, nor does it scan tasks in an execution
  proportional manner.

  So the idea of rate-limiting the scanning was first implemented
  in the AutoNUMA tree via a global rate limit. This patch goes
  beyond that by implementing an execution rate proportional
  working set sampling rate that is not implemented via a single
  global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
  first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mm_types.h |    3 +++
 include/linux/sched.h    |    1 +
 kernel/sched/fair.c      |   61 ++++++++++++++++++++++++++++++++++++----------
 kernel/sysctl.c          |    7 ++++++
 4 files changed, 59 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d82accb..b40f4ef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -406,6 +406,9 @@ struct mm_struct {
 	 */
 	unsigned long numa_next_scan;
 
+	/* Restart point for scanning and setting pte_numa */
+	unsigned long numa_scan_offset;
+
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 241e4f7..6b8a14f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2008,6 +2008,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
 extern unsigned int sysctl_balance_numa_scan_period_min;
 extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_scan_size;
 extern unsigned int sysctl_balance_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9ea13e9..6df5620 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -780,10 +780,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_BALANCE_NUMA
 /*
- * numa task sample period in ms: 5s
+ * numa task sample period in ms
  */
-unsigned int sysctl_balance_numa_scan_period_min = 5000;
-unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+unsigned int sysctl_balance_numa_scan_period_min = 2000;
+unsigned int sysctl_balance_numa_scan_period_max = 2000*16;
+
+/* Portion of address space to scan in MB */
+unsigned int sysctl_balance_numa_scan_size = 256;
 
 static void task_numa_placement(struct task_struct *p)
 {
@@ -822,6 +825,9 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	struct vm_area_struct *vma;
+	unsigned long offset, end;
+	long length;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -851,18 +857,47 @@ void task_numa_work(struct callback_head *work)
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	ACCESS_ONCE(mm->numa_scan_seq)++;
-	{
-		struct vm_area_struct *vma;
+	offset = mm->numa_scan_offset;
+	length = sysctl_balance_numa_scan_size;
+	length <<= 20;
 
-		down_read(&mm->mmap_sem);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			if (!vma_migratable(vma))
-				continue;
-			change_prot_numa(vma, vma->vm_start, vma->vm_end);
-		}
-		up_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, offset);
+	if (!vma) {
+		ACCESS_ONCE(mm->numa_scan_seq)++;
+		offset = 0;
+		vma = mm->mmap;
+	}
+	for (; vma && length > 0; vma = vma->vm_next) {
+		if (!vma_migratable(vma))
+			continue;
+
+		/* Skip small VMAs. They are not likely to be of relevance */
+		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
+			continue;
+
+		offset = max(offset, vma->vm_start);
+		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+		length -= end - offset;
+
+		change_prot_numa(vma, offset, end);
+
+		offset = end;
+	}
+
+	/*
+	 * It is possible to reach the end of the VMA list but the last few VMAs are
+	 * not guaranteed to the vma_migratable. If they are not, we would find the
+	 * !migratable VMA on the next scan but not reset the scanner to the start
+	 * so check it now.
+	 */
+	if (!vma) {
+		ACCESS_ONCE(mm->numa_scan_seq)++;
+		offset = 0;
+		vma = mm->mmap;
 	}
+	mm->numa_scan_offset = offset;
+	up_read(&mm->mmap_sem);
 }
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1359f51..d191203 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -366,6 +366,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "balance_numa_scan_size_mb",
+		.data		= &sysctl_balance_numa_scan_size,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif /* CONFIG_BALANCE_NUMA */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-11-15 10:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-14 17:24 [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Andrew Theurer
2012-11-14 18:28 ` Mel Gorman
2012-11-14 19:39   ` Andrew Theurer
2012-11-15 10:27     ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2012-11-13 11:12 [RFC PATCH 00/31] Foundation for automatic NUMA balancing V2 Mel Gorman
2012-11-13 11:12 ` [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).