[PATCH] powerpc: Set a smaller value for RECLAIM

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
@ 2010-02-18 22:29 Anton Blanchard
  2010-02-19  0:07 ` Anton Blanchard
  2010-02-19 15:43 ` Balbir Singh
  0 siblings, 2 replies; 14+ messages in thread
From: Anton Blanchard @ 2010-02-18 22:29 UTC (permalink / raw)
  To: mel, benh, cl; +Cc: linuxppc-dev


I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets
enabled via this:

        /*
         * If another node is sufficiently far away then it is better
         * to reclaim pages in a zone before going off node.
         */
        if (distance > RECLAIM_DISTANCE)
                zone_reclaim_mode = 1;

Since we use the default value of 20 for REMOTE_DISTANCE and 20 for
RECLAIM_DISTANCE it never kicks in.

The local to remote bandwidth ratios can be quite large on System p
machines so it makes sense for us to reclaim clean pagecache locally before
going off node.

The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
zone reclaim.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: powerpc.git/arch/powerpc/include/asm/topology.h
===================================================================
--- powerpc.git.orig/arch/powerpc/include/asm/topology.h	2010-02-18 14:26:45.736821967 +1100
+++ powerpc.git/arch/powerpc/include/asm/topology.h	2010-02-18 14:51:24.793071748 +1100
@@ -8,6 +8,16 @@ struct device_node;
 
 #ifdef CONFIG_NUMA
 
+/*
+ * Before going off node we want the VM to try and reclaim from the local
+ * node. It does this if the remote distance is larger than RECLAIM_DISTANCE.
+ * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of
+ * 20, we never reclaim and go off node straight away.
+ *
+ * To fix this we choose a smaller value of RECLAIM_DISTANCE.
+ */
+#define RECLAIM_DISTANCE 10
+
 #include <asm/mmzone.h>
 
 static inline int cpu_to_node(int cpu)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-18 22:29 [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim Anton Blanchard
@ 2010-02-19  0:07 ` Anton Blanchard
  2010-02-19 14:55   ` Mel Gorman
  2010-02-19 15:43 ` Balbir Singh
  1 sibling, 1 reply; 14+ messages in thread
From: Anton Blanchard @ 2010-02-19  0:07 UTC (permalink / raw)
  To: mel, benh, cl; +Cc: linuxppc-dev

Hi,

> The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
> zone reclaim.

FYI even with this enabled I could trip it up pretty easily with a multi
threaded application. I tried running stream across all threads in node 0. The
machine looks like:

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 free: 30254 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 free: 31832 MB

Now create some clean pagecache on node 0:

# taskset -c 0 dd if=/dev/zero of=/tmp/bigfile bs=1G count=16
# sync

node 0 free: 12880 MB
node 1 free: 31830 MB

I built stream to use about 25GB of memory. I then ran stream across all
threads in node 0:

# OMP_NUM_THREADS=16 taskset -c 0-15 ./stream

We exhaust all memory on node 0, and start using memory on node 1:

node 0 free: 0 MB
node 1 free: 20795 MB

ie about 10GB of node 1. Now if we run the same test with one thread:

# OMP_NUM_THREADS=1 taskset -c 0 ./stream

things are much better:

node 0 free: 11 MB
node 1 free: 31552 MB

Interestingly enough it takes two goes to get completely onto node 0, even
with one thread. The second run looks like:

node 0 free: 14 MB
node 1 free: 31811 MB

I had a quick look at the page allocation logic and I think I understand why
we would have issues with multple threads all trying to allocate at once.

- The ZONE_RECLAIM_LOCKED flag allows only one thread into zone reclaim at
  a time, and whatever thread is in zone reclaim probably only frees a small
  amount of memory. Certainly not enough to satisfy all 16 threads.

- We seem to end up racing between zone_watermark_ok, zone_reclaim and
  buffered_rmqueue. Since everyone is in here the memory one thread reclaims
  may be stolen by another thread.

I'm not sure if there is an easy way to fix this without penalising other
workloads though.

Anton

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-19  0:07 ` Anton Blanchard
@ 2010-02-19 14:55   ` Mel Gorman
  2010-02-19 15:12     ` Christoph Lameter
  2010-02-23  1:55     ` Anton Blanchard
  0 siblings, 2 replies; 14+ messages in thread
From: Mel Gorman @ 2010-02-19 14:55 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: cl, linuxppc-dev

On Fri, Feb 19, 2010 at 11:07:30AM +1100, Anton Blanchard wrote:
> 
> Hi,
> 
> > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
> > zone reclaim.
> 

I've no problem with the patch anyway.

> FYI even with this enabled I could trip it up pretty easily with a multi
> threaded application. I tried running stream across all threads in node 0. The
> machine looks like:
> 
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 0 free: 30254 MB
> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> node 1 free: 31832 MB
> 
> Now create some clean pagecache on node 0:
> 
> # taskset -c 0 dd if=/dev/zero of=/tmp/bigfile bs=1G count=16
> # sync
> 
> node 0 free: 12880 MB
> node 1 free: 31830 MB
> 
> I built stream to use about 25GB of memory. I then ran stream across all
> threads in node 0:
> 
> # OMP_NUM_THREADS=16 taskset -c 0-15 ./stream
> 
> We exhaust all memory on node 0, and start using memory on node 1:
> 
> node 0 free: 0 MB
> node 1 free: 20795 MB
> 
> ie about 10GB of node 1. Now if we run the same test with one thread:
> 
> # OMP_NUM_THREADS=1 taskset -c 0 ./stream
> 
> things are much better:
> 
> node 0 free: 11 MB
> node 1 free: 31552 MB
> 
> Interestingly enough it takes two goes to get completely onto node 0, even
> with one thread. The second run looks like:
> 
> node 0 free: 14 MB
> node 1 free: 31811 MB
> 
> I had a quick look at the page allocation logic and I think I understand why
> we would have issues with multple threads all trying to allocate at once.
> 
> - The ZONE_RECLAIM_LOCKED flag allows only one thread into zone reclaim at
>   a time, and whatever thread is in zone reclaim probably only frees a small
>   amount of memory. Certainly not enough to satisfy all 16 threads.
> 
> - We seem to end up racing between zone_watermark_ok, zone_reclaim and
>   buffered_rmqueue. Since everyone is in here the memory one thread reclaims
>   may be stolen by another thread.
> 

You're pretty much on the button here. Only one thread at a time enters
zone_reclaim. The others back off and try the next zone in the zonelist
instead. I'm not sure what the original intention was but most likely it
was to prevent too many parallel reclaimers in the same zone potentially
dumping out way more data than necessary.

> I'm not sure if there is an easy way to fix this without penalising other
> workloads though.
> 

You could experiment with waiting on the bit if the GFP flags allowi it? The
expectation would be that the reclaim operation does not take long. Wait
on the bit, if you are making the forward progress, recheck the
watermarks before continueing.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-19 14:55   ` Mel Gorman
@ 2010-02-19 15:12     ` Christoph Lameter
  2010-02-19 15:41       ` Balbir Singh
  2010-02-23  1:55     ` Anton Blanchard
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2010-02-19 15:12 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linuxppc-dev, Anton Blanchard

On Fri, 19 Feb 2010, Mel Gorman wrote:

> > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
> > > zone reclaim.
> >
>
> I've no problem with the patch anyway.

Nor do I.

> > - We seem to end up racing between zone_watermark_ok, zone_reclaim and
> >   buffered_rmqueue. Since everyone is in here the memory one thread reclaims
> >   may be stolen by another thread.
> >
>
> You're pretty much on the button here. Only one thread at a time enters
> zone_reclaim. The others back off and try the next zone in the zonelist
> instead. I'm not sure what the original intention was but most likely it
> was to prevent too many parallel reclaimers in the same zone potentially
> dumping out way more data than necessary.

Yes it was to prevent concurrency slowing down reclaim. At that time the
number of processors per NUMA node was 2 or so. The number of pages that
are reclaimed is limited to avoid tossing too many page cache pages.

> You could experiment with waiting on the bit if the GFP flags allowi it? The
> expectation would be that the reclaim operation does not take long. Wait
> on the bit, if you are making the forward progress, recheck the
> watermarks before continueing.

You could reclaim more pages during a zone reclaim pass? Increase the
nr_to_reclaim in __zone_reclaim() and see if that helps. One zone reclaim
pass should reclaim enough local pages to keep the processors on a node
happy for a reasonable interval. Maybe do a fraction of a zone? 1/16th?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-19 15:12     ` Christoph Lameter
@ 2010-02-19 15:41       ` Balbir Singh
  2010-02-19 15:51         ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Balbir Singh @ 2010-02-19 15:41 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, linuxppc-dev, Anton Blanchard

On Fri, Feb 19, 2010 at 8:42 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> On Fri, 19 Feb 2010, Mel Gorman wrote:
>
>> > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus e=
nables
>> > > zone reclaim.
>> >
>>
>> I've no problem with the patch anyway.
>
> Nor do I.
>
>> > - We seem to end up racing between zone_watermark_ok, zone_reclaim and
>> > =A0 buffered_rmqueue. Since everyone is in here the memory one thread =
reclaims
>> > =A0 may be stolen by another thread.
>> >
>>
>> You're pretty much on the button here. Only one thread at a time enters
>> zone_reclaim. The others back off and try the next zone in the zonelist
>> instead. I'm not sure what the original intention was but most likely it
>> was to prevent too many parallel reclaimers in the same zone potentially
>> dumping out way more data than necessary.
>
> Yes it was to prevent concurrency slowing down reclaim. At that time the
> number of processors per NUMA node was 2 or so. The number of pages that
> are reclaimed is limited to avoid tossing too many page cache pages.
>

That is interesting, I always thought it was to try and free page
cache first. For example with zone->min_unmapped_pages, if
zone_pagecache_reclaimable is greater than unmapped pages, we start
reclaim the cached pages first. The min_unmapped_pages almost sounds
like the higher level watermark - or am I misreading the code.

Balbir Singh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-19 15:41       ` Balbir Singh
@ 2010-02-19 15:51         ` Christoph Lameter
  2010-02-19 17:39           ` Balbir Singh
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2010-02-19 15:51 UTC (permalink / raw)
  To: Balbir Singh; +Cc: Mel Gorman, linuxppc-dev, Anton Blanchard

On Fri, 19 Feb 2010, Balbir Singh wrote:

> >> zone_reclaim. The others back off and try the next zone in the zonelist
> >> instead. I'm not sure what the original intention was but most likely it
> >> was to prevent too many parallel reclaimers in the same zone potentially
> >> dumping out way more data than necessary.
> >
> > Yes it was to prevent concurrency slowing down reclaim. At that time the
> > number of processors per NUMA node was 2 or so. The number of pages that
> > are reclaimed is limited to avoid tossing too many page cache pages.
> >
>
> That is interesting, I always thought it was to try and free page
> cache first. For example with zone->min_unmapped_pages, if
> zone_pagecache_reclaimable is greater than unmapped pages, we start
> reclaim the cached pages first. The min_unmapped_pages almost sounds
> like the higher level watermark - or am I misreading the code.

Indeed the purpose is to free *old* page cache pages.

The min_unmapped_pages is to protect a mininum of the page cache pages /
fs metadata from zone reclaim so that ongoing file I/O is not impacted.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-19 15:51         ` Christoph Lameter
@ 2010-02-19 17:39           ` Balbir Singh
  0 siblings, 0 replies; 14+ messages in thread
From: Balbir Singh @ 2010-02-19 17:39 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Mel Gorman, linuxppc-dev, Anton Blanchard

* Christoph Lameter <cl@linux-foundation.org> [2010-02-19 09:51:12]:

> On Fri, 19 Feb 2010, Balbir Singh wrote:
> 
> > >> zone_reclaim. The others back off and try the next zone in the zonelist
> > >> instead. I'm not sure what the original intention was but most likely it
> > >> was to prevent too many parallel reclaimers in the same zone potentially
> > >> dumping out way more data than necessary.
> > >
> > > Yes it was to prevent concurrency slowing down reclaim. At that time the
> > > number of processors per NUMA node was 2 or so. The number of pages that
> > > are reclaimed is limited to avoid tossing too many page cache pages.
> > >
> >
> > That is interesting, I always thought it was to try and free page
> > cache first. For example with zone->min_unmapped_pages, if
> > zone_pagecache_reclaimable is greater than unmapped pages, we start
> > reclaim the cached pages first. The min_unmapped_pages almost sounds
> > like the higher level watermark - or am I misreading the code.
> 
> Indeed the purpose is to free *old* page cache pages.
> 
> The min_unmapped_pages is to protect a mininum of the page cache pages /
> fs metadata from zone reclaim so that ongoing file I/O is not impacted.

Thanks for the explanation!

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-19 14:55   ` Mel Gorman
  2010-02-19 15:12     ` Christoph Lameter
@ 2010-02-23  1:55     ` Anton Blanchard
  2010-02-23 16:23       ` Mel Gorman
                         ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Anton Blanchard @ 2010-02-23  1:55 UTC (permalink / raw)
  To: Mel Gorman; +Cc: cl, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1755 bytes --]

 
Hi Mel,

> You're pretty much on the button here. Only one thread at a time enters
> zone_reclaim. The others back off and try the next zone in the zonelist
> instead. I'm not sure what the original intention was but most likely it
> was to prevent too many parallel reclaimers in the same zone potentially
> dumping out way more data than necessary.
> 
> > I'm not sure if there is an easy way to fix this without penalising other
> > workloads though.
> > 
> 
> You could experiment with waiting on the bit if the GFP flags allowi it? The
> expectation would be that the reclaim operation does not take long. Wait
> on the bit, if you are making the forward progress, recheck the
> watermarks before continueing.

Thanks to you and Christoph for some suggestions to try. Attached is a
chart showing the results of the following tests:


baseline.txt
The current ppc64 default of zone_reclaim_mode = 0. As expected we see
no change in remote node memory usage even after 10 iterations.

zone_reclaim_mode.txt
Now we set zone_reclaim_mode = 1. On each iteration we continue to improve,
but even after 10 runs of stream we have > 10% remote node memory usage.

reclaim_4096_pages.txt
Instead of reclaiming 32 pages at a time, we try for a much larger batch
of 4096. The slope is much steeper but it still takes around 6 iterations
to get almost all local node memory.

wait_on_busy_flag.txt
Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest
we would need to check the GFP flags etc, but so far it looks the most
promising. We only get a few percent of remote node memory on the first
iteration and get all local node by the second.


Perhaps a combination of larger batch size and waiting on the busy
flag is the way to go?

Anton

[-- Attachment #2: stream_test:_percentage_off_node_memory.png --]
[-- Type: image/png, Size: 34767 bytes --]

[-- Attachment #3: reclaim_4096_pages.patch --]
[-- Type: text/x-diff, Size: 376 bytes --]

--- mm/vmscan.c~	2010-02-21 23:47:14.000000000 -0600
+++ mm/vmscan.c	2010-02-22 03:22:01.000000000 -0600
@@ -2534,7 +2534,7 @@
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
-				       SWAP_CLUSTER_MAX),
+				       4096),
 		.gfp_mask = gfp_mask,
 		.swappiness = vm_swappiness,
 		.order = order,

[-- Attachment #4: wait_on_ZONE_RECLAIM_LOCKED.patch --]
[-- Type: text/x-diff, Size: 482 bytes --]

--- mm/vmscan.c~	2010-02-21 23:47:14.000000000 -0600
+++ mm/vmscan.c	2010-02-21 23:47:31.000000000 -0600
@@ -2634,8 +2634,8 @@
 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return ZONE_RECLAIM_NOSCAN;
 
-	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
-		return ZONE_RECLAIM_NOSCAN;
+	while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
+		cpu_relax();
 
 	ret = __zone_reclaim(zone, gfp_mask, order);
 	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-23  1:55     ` Anton Blanchard
@ 2010-02-23 16:23       ` Mel Gorman
  2010-02-24 15:43       ` Christoph Lameter
  2010-03-01 12:06       ` Mel Gorman
  2 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2010-02-23 16:23 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: cl, linuxppc-dev

On Tue, Feb 23, 2010 at 12:55:51PM +1100, Anton Blanchard wrote:
>  
> Hi Mel,
> 

I'm afraid I'm on vacation at the moment. This mail is costing me shots with
penaltys every minute it's open.  It'll be early next week before I can look
at this closely.

Sorry.

> > You're pretty much on the button here. Only one thread at a time enters
> > zone_reclaim. The others back off and try the next zone in the zonelist
> > instead. I'm not sure what the original intention was but most likely it
> > was to prevent too many parallel reclaimers in the same zone potentially
> > dumping out way more data than necessary.
> > 
> > > I'm not sure if there is an easy way to fix this without penalising other
> > > workloads though.
> > > 
> > 
> > You could experiment with waiting on the bit if the GFP flags allowi it? The
> > expectation would be that the reclaim operation does not take long. Wait
> > on the bit, if you are making the forward progress, recheck the
> > watermarks before continueing.
> 
> Thanks to you and Christoph for some suggestions to try. Attached is a
> chart showing the results of the following tests:
> 
> 
> baseline.txt
> The current ppc64 default of zone_reclaim_mode = 0. As expected we see
> no change in remote node memory usage even after 10 iterations.
> 
> zone_reclaim_mode.txt
> Now we set zone_reclaim_mode = 1. On each iteration we continue to improve,
> but even after 10 runs of stream we have > 10% remote node memory usage.
> 
> reclaim_4096_pages.txt
> Instead of reclaiming 32 pages at a time, we try for a much larger batch
> of 4096. The slope is much steeper but it still takes around 6 iterations
> to get almost all local node memory.
> 
> wait_on_busy_flag.txt
> Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest
> we would need to check the GFP flags etc, but so far it looks the most
> promising. We only get a few percent of remote node memory on the first
> iteration and get all local node by the second.
> 
> 
> Perhaps a combination of larger batch size and waiting on the busy
> flag is the way to go?
> 
> Anton


> --- mm/vmscan.c~	2010-02-21 23:47:14.000000000 -0600
> +++ mm/vmscan.c	2010-02-22 03:22:01.000000000 -0600
> @@ -2534,7 +2534,7 @@
>  		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
>  		.may_swap = 1,
>  		.nr_to_reclaim = max_t(unsigned long, nr_pages,
> -				       SWAP_CLUSTER_MAX),
> +				       4096),
>  		.gfp_mask = gfp_mask,
>  		.swappiness = vm_swappiness,
>  		.order = order,

> --- mm/vmscan.c~	2010-02-21 23:47:14.000000000 -0600
> +++ mm/vmscan.c	2010-02-21 23:47:31.000000000 -0600
> @@ -2634,8 +2634,8 @@
>  	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
>  		return ZONE_RECLAIM_NOSCAN;
>  
> -	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> -		return ZONE_RECLAIM_NOSCAN;
> +	while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> +		cpu_relax();
>  
>  	ret = __zone_reclaim(zone, gfp_mask, order);
>  	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-23  1:55     ` Anton Blanchard
  2010-02-23 16:23       ` Mel Gorman
@ 2010-02-24 15:43       ` Christoph Lameter
  2010-03-01 12:06       ` Mel Gorman
  2 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2010-02-24 15:43 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Mel Gorman, linuxppc-dev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1244 bytes --]

On Tue, 23 Feb 2010, Anton Blanchard wrote:

> zone_reclaim_mode.txt
> Now we set zone_reclaim_mode = 1. On each iteration we continue to improve,
> but even after 10 runs of stream we have > 10% remote node memory usage.

The intend of zone reclaim was never to allocate all memory from on node.
You should not expect all memory to come from the node even if zone
reclaim works.

> reclaim_4096_pages.txt
> Instead of reclaiming 32 pages at a time, we try for a much larger batch
> of 4096. The slope is much steeper but it still takes around 6 iterations
> to get almost all local node memory.

"almost all"? How much do you want?

> wait_on_busy_flag.txt
> Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest
> we would need to check the GFP flags etc, but so far it looks the most
> promising. We only get a few percent of remote node memory on the first
> iteration and get all local node by the second.

This would significantly impact performance. Zone reclaim should reclaim
with minimal overhead. If zone reclaim is running on another processor
then the OS already takes measures against the shortage of node local
memory. The right thing to do is to take what is currently available which
may be off node memory.

[-- Attachment #2: Type: IMAGE/PNG, Size: 34767 bytes --]

[-- Attachment #3: Type: TEXT/X-DIFF, Size: 387 bytes --]

--- mm/vmscan.c~	2010-02-21 23:47:14.000000000 -0600

+++ mm/vmscan.c	2010-02-22 03:22:01.000000000 -0600

@@ -2534,7 +2534,7 @@

 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),

 		.may_swap = 1,

 		.nr_to_reclaim = max_t(unsigned long, nr_pages,

-				       SWAP_CLUSTER_MAX),

+				       4096),

 		.gfp_mask = gfp_mask,

 		.swappiness = vm_swappiness,

 		.order = order,


[-- Attachment #4: Type: TEXT/X-DIFF, Size: 495 bytes --]

--- mm/vmscan.c~	2010-02-21 23:47:14.000000000 -0600

+++ mm/vmscan.c	2010-02-21 23:47:31.000000000 -0600

@@ -2634,8 +2634,8 @@

 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())

 		return ZONE_RECLAIM_NOSCAN;

 

-	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))

-		return ZONE_RECLAIM_NOSCAN;

+	while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))

+		cpu_relax();

 

 	ret = __zone_reclaim(zone, gfp_mask, order);

 	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-23  1:55     ` Anton Blanchard
  2010-02-23 16:23       ` Mel Gorman
  2010-02-24 15:43       ` Christoph Lameter
@ 2010-03-01 12:06       ` Mel Gorman
  2010-03-01 15:19         ` Christoph Lameter
  2 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2010-03-01 12:06 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: cl, linuxppc-dev

On Tue, Feb 23, 2010 at 12:55:51PM +1100, Anton Blanchard wrote:
>  
> Hi Mel,
> 

I'm back but a bit vague. Am on painkillers for the bashing I gave
myself down the hills.

> > You're pretty much on the button here. Only one thread at a time enters
> > zone_reclaim. The others back off and try the next zone in the zonelist
> > instead. I'm not sure what the original intention was but most likely it
> > was to prevent too many parallel reclaimers in the same zone potentially
> > dumping out way more data than necessary.
> > 
> > > I'm not sure if there is an easy way to fix this without penalising other
> > > workloads though.
> > > 
> > 
> > You could experiment with waiting on the bit if the GFP flags allowi it? The
> > expectation would be that the reclaim operation does not take long. Wait
> > on the bit, if you are making the forward progress, recheck the
> > watermarks before continueing.
> 
> Thanks to you and Christoph for some suggestions to try. Attached is a
> chart showing the results of the following tests:
> 
> 
> baseline.txt
> The current ppc64 default of zone_reclaim_mode = 0. As expected we see
> no change in remote node memory usage even after 10 iterations.
> 
> zone_reclaim_mode.txt
> Now we set zone_reclaim_mode = 1. On each iteration we continue to improve,
> but even after 10 runs of stream we have > 10% remote node memory usage.
> 

Ok, so how reasonable would it be to expect that the rate of "improvement"
to be related to the ratio between "available free node memory at start -
how many pages the benchmark requires" and the number of pages zone_reclaim
reclaims on the local node?

The exact rate of improvement is complicated by multiple threads so it
won't be exact.

> reclaim_4096_pages.txt
> Instead of reclaiming 32 pages at a time, we try for a much larger batch
> of 4096. The slope is much steeper but it still takes around 6 iterations
> to get almost all local node memory.
> 
> wait_on_busy_flag.txt
> Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest
> we would need to check the GFP flags etc, but so far it looks the most
> promising. We only get a few percent of remote node memory on the first
> iteration and get all local node by the second.
> 

If the above expectation is reasonable, a better alternative may be to adapt
the number of pages reclaimed to the number of callers to
__zone_reclaim() and allow parallel reclaimers.

e.g. 
	1 thread	 128
	2 threads	  64
	3 threads	  32
	4 threads	  16
etc

The exact starting batch count needs more careful thinking than what I'm
giving it currently and maybe the decay ratio too to work out what the
worst-case scenario for dumping node-local memory is but you get the idea.

The downside is that this requires a per-zone counter to count the
number of parallel reclaimers.

> 
> Perhaps a combination of larger batch size and waiting on the busy
> flag is the way to go?
> 

I think a static increase on the batch size runs three risks. The first of
parallel reclaimers dumping too much of local memory although it could be
mitigated by checking the watermarks after waiting on the bit lock. The
second is that the thread doing the reclaiming is penalised with higher
reclaim costs while other CPUs remain idle. The third is that there
could be latency snags with a thread spinning that would previously have
gone off-node.

Not sure what the impact of the third risk but it might be noticeable on
latency-sensitive machines where the off-node cost is not significant
enough to justify a delay.

Christoph, how feasible would it be to allow parallel reclaimers in
__zone_reclaim() that back off at a rate depending on the number of
reclaimers?

> --- mm/vmscan.c~	2010-02-21 23:47:14.000000000 -0600
> +++ mm/vmscan.c	2010-02-22 03:22:01.000000000 -0600
> @@ -2534,7 +2534,7 @@
>  		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
>  		.may_swap = 1,
>  		.nr_to_reclaim = max_t(unsigned long, nr_pages,
> -				       SWAP_CLUSTER_MAX),
> +				       4096),
>  		.gfp_mask = gfp_mask,
>  		.swappiness = vm_swappiness,
>  		.order = order,

> --- mm/vmscan.c~	2010-02-21 23:47:14.000000000 -0600
> +++ mm/vmscan.c	2010-02-21 23:47:31.000000000 -0600
> @@ -2634,8 +2634,8 @@
>  	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
>  		return ZONE_RECLAIM_NOSCAN;
>  
> -	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> -		return ZONE_RECLAIM_NOSCAN;
> +	while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> +		cpu_relax();
>  
>  	ret = __zone_reclaim(zone, gfp_mask, order);
>  	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-03-01 12:06       ` Mel Gorman
@ 2010-03-01 15:19         ` Christoph Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2010-03-01 15:19 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linuxppc-dev, Anton Blanchard

On Mon, 1 Mar 2010, Mel Gorman wrote:

> Christoph, how feasible would it be to allow parallel reclaimers in
> __zone_reclaim() that back off at a rate depending on the number of
> reclaimers?

Not too hard. Zone locking is there but there may be a lot of bouncing
cachelines if you run it concurrently.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-18 22:29 [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim Anton Blanchard
  2010-02-19  0:07 ` Anton Blanchard
@ 2010-02-19 15:43 ` Balbir Singh
  2010-02-23  1:38   ` Anton Blanchard
  1 sibling, 1 reply; 14+ messages in thread
From: Balbir Singh @ 2010-02-19 15:43 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: mel, cl, linuxppc-dev

On Fri, Feb 19, 2010 at 3:59 AM, Anton Blanchard <anton@samba.org> wrote:
>
> I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It ge=
ts
> enabled via this:
>
> =A0 =A0 =A0 =A0/*
> =A0 =A0 =A0 =A0 * If another node is sufficiently far away then it is bet=
ter
> =A0 =A0 =A0 =A0 * to reclaim pages in a zone before going off node.
> =A0 =A0 =A0 =A0 */
> =A0 =A0 =A0 =A0if (distance > RECLAIM_DISTANCE)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zone_reclaim_mode =3D 1;
>
> Since we use the default value of 20 for REMOTE_DISTANCE and 20 for
> RECLAIM_DISTANCE it never kicks in.
>
> The local to remote bandwidth ratios can be quite large on System p
> machines so it makes sense for us to reclaim clean pagecache locally befo=
re
> going off node.
>
> The patch below sets a smaller value for RECLAIM_DISTANCE and thus enable=
s
> zone reclaim.
>

A reclaim distance of 10 implies a ratio of 1, that means we'll always
do zone_reclaim() to free page cache and slab cache before moving on
to another node?

Balbir Singh.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
  2010-02-19 15:43 ` Balbir Singh
@ 2010-02-23  1:38   ` Anton Blanchard
  0 siblings, 0 replies; 14+ messages in thread
From: Anton Blanchard @ 2010-02-23  1:38 UTC (permalink / raw)
  To: Balbir Singh; +Cc: mel, cl, linuxppc-dev

 
Hi Balbir,

> A reclaim distance of 10 implies a ratio of 1, that means we'll always
> do zone_reclaim() to free page cache and slab cache before moving on
> to another node?

I want to make an effort to reclaim local pagecache before ever going
off node. As an example, a completely off node stream result is almost 3x
slower than on node on my test box.

Anton

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-03-01 15:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-18 22:29 [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim Anton Blanchard
2010-02-19  0:07 ` Anton Blanchard
2010-02-19 14:55   ` Mel Gorman
2010-02-19 15:12     ` Christoph Lameter
2010-02-19 15:41       ` Balbir Singh
2010-02-19 15:51         ` Christoph Lameter
2010-02-19 17:39           ` Balbir Singh
2010-02-23  1:55     ` Anton Blanchard
2010-02-23 16:23       ` Mel Gorman
2010-02-24 15:43       ` Christoph Lameter
2010-03-01 12:06       ` Mel Gorman
2010-03-01 15:19         ` Christoph Lameter
2010-02-19 15:43 ` Balbir Singh
2010-02-23  1:38   ` Anton Blanchard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).