* [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim @ 2010-02-18 22:29 Anton Blanchard 2010-02-19 0:07 ` Anton Blanchard 2010-02-19 15:43 ` Balbir Singh 0 siblings, 2 replies; 14+ messages in thread From: Anton Blanchard @ 2010-02-18 22:29 UTC (permalink / raw) To: mel, benh, cl; +Cc: linuxppc-dev I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets enabled via this: /* * If another node is sufficiently far away then it is better * to reclaim pages in a zone before going off node. */ if (distance > RECLAIM_DISTANCE) zone_reclaim_mode = 1; Since we use the default value of 20 for REMOTE_DISTANCE and 20 for RECLAIM_DISTANCE it never kicks in. The local to remote bandwidth ratios can be quite large on System p machines so it makes sense for us to reclaim clean pagecache locally before going off node. The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables zone reclaim. Signed-off-by: Anton Blanchard <anton@samba.org> --- Index: powerpc.git/arch/powerpc/include/asm/topology.h =================================================================== --- powerpc.git.orig/arch/powerpc/include/asm/topology.h 2010-02-18 14:26:45.736821967 +1100 +++ powerpc.git/arch/powerpc/include/asm/topology.h 2010-02-18 14:51:24.793071748 +1100 @@ -8,6 +8,16 @@ struct device_node; #ifdef CONFIG_NUMA +/* + * Before going off node we want the VM to try and reclaim from the local + * node. It does this if the remote distance is larger than RECLAIM_DISTANCE. + * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of + * 20, we never reclaim and go off node straight away. + * + * To fix this we choose a smaller value of RECLAIM_DISTANCE. + */ +#define RECLAIM_DISTANCE 10 + #include <asm/mmzone.h> static inline int cpu_to_node(int cpu) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-18 22:29 [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim Anton Blanchard @ 2010-02-19 0:07 ` Anton Blanchard 2010-02-19 14:55 ` Mel Gorman 2010-02-19 15:43 ` Balbir Singh 1 sibling, 1 reply; 14+ messages in thread From: Anton Blanchard @ 2010-02-19 0:07 UTC (permalink / raw) To: mel, benh, cl; +Cc: linuxppc-dev Hi, > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables > zone reclaim. FYI even with this enabled I could trip it up pretty easily with a multi threaded application. I tried running stream across all threads in node 0. The machine looks like: node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 free: 30254 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 free: 31832 MB Now create some clean pagecache on node 0: # taskset -c 0 dd if=/dev/zero of=/tmp/bigfile bs=1G count=16 # sync node 0 free: 12880 MB node 1 free: 31830 MB I built stream to use about 25GB of memory. I then ran stream across all threads in node 0: # OMP_NUM_THREADS=16 taskset -c 0-15 ./stream We exhaust all memory on node 0, and start using memory on node 1: node 0 free: 0 MB node 1 free: 20795 MB ie about 10GB of node 1. Now if we run the same test with one thread: # OMP_NUM_THREADS=1 taskset -c 0 ./stream things are much better: node 0 free: 11 MB node 1 free: 31552 MB Interestingly enough it takes two goes to get completely onto node 0, even with one thread. The second run looks like: node 0 free: 14 MB node 1 free: 31811 MB I had a quick look at the page allocation logic and I think I understand why we would have issues with multple threads all trying to allocate at once. - The ZONE_RECLAIM_LOCKED flag allows only one thread into zone reclaim at a time, and whatever thread is in zone reclaim probably only frees a small amount of memory. Certainly not enough to satisfy all 16 threads. - We seem to end up racing between zone_watermark_ok, zone_reclaim and buffered_rmqueue. Since everyone is in here the memory one thread reclaims may be stolen by another thread. I'm not sure if there is an easy way to fix this without penalising other workloads though. Anton ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-19 0:07 ` Anton Blanchard @ 2010-02-19 14:55 ` Mel Gorman 2010-02-19 15:12 ` Christoph Lameter 2010-02-23 1:55 ` Anton Blanchard 0 siblings, 2 replies; 14+ messages in thread From: Mel Gorman @ 2010-02-19 14:55 UTC (permalink / raw) To: Anton Blanchard; +Cc: cl, linuxppc-dev On Fri, Feb 19, 2010 at 11:07:30AM +1100, Anton Blanchard wrote: > > Hi, > > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables > > zone reclaim. > I've no problem with the patch anyway. > FYI even with this enabled I could trip it up pretty easily with a multi > threaded application. I tried running stream across all threads in node 0. The > machine looks like: > > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > node 0 free: 30254 MB > node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 > node 1 free: 31832 MB > > Now create some clean pagecache on node 0: > > # taskset -c 0 dd if=/dev/zero of=/tmp/bigfile bs=1G count=16 > # sync > > node 0 free: 12880 MB > node 1 free: 31830 MB > > I built stream to use about 25GB of memory. I then ran stream across all > threads in node 0: > > # OMP_NUM_THREADS=16 taskset -c 0-15 ./stream > > We exhaust all memory on node 0, and start using memory on node 1: > > node 0 free: 0 MB > node 1 free: 20795 MB > > ie about 10GB of node 1. Now if we run the same test with one thread: > > # OMP_NUM_THREADS=1 taskset -c 0 ./stream > > things are much better: > > node 0 free: 11 MB > node 1 free: 31552 MB > > Interestingly enough it takes two goes to get completely onto node 0, even > with one thread. The second run looks like: > > node 0 free: 14 MB > node 1 free: 31811 MB > > I had a quick look at the page allocation logic and I think I understand why > we would have issues with multple threads all trying to allocate at once. > > - The ZONE_RECLAIM_LOCKED flag allows only one thread into zone reclaim at > a time, and whatever thread is in zone reclaim probably only frees a small > amount of memory. Certainly not enough to satisfy all 16 threads. > > - We seem to end up racing between zone_watermark_ok, zone_reclaim and > buffered_rmqueue. Since everyone is in here the memory one thread reclaims > may be stolen by another thread. > You're pretty much on the button here. Only one thread at a time enters zone_reclaim. The others back off and try the next zone in the zonelist instead. I'm not sure what the original intention was but most likely it was to prevent too many parallel reclaimers in the same zone potentially dumping out way more data than necessary. > I'm not sure if there is an easy way to fix this without penalising other > workloads though. > You could experiment with waiting on the bit if the GFP flags allowi it? The expectation would be that the reclaim operation does not take long. Wait on the bit, if you are making the forward progress, recheck the watermarks before continueing. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-19 14:55 ` Mel Gorman @ 2010-02-19 15:12 ` Christoph Lameter 2010-02-19 15:41 ` Balbir Singh 2010-02-23 1:55 ` Anton Blanchard 1 sibling, 1 reply; 14+ messages in thread From: Christoph Lameter @ 2010-02-19 15:12 UTC (permalink / raw) To: Mel Gorman; +Cc: linuxppc-dev, Anton Blanchard On Fri, 19 Feb 2010, Mel Gorman wrote: > > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables > > > zone reclaim. > > > > I've no problem with the patch anyway. Nor do I. > > - We seem to end up racing between zone_watermark_ok, zone_reclaim and > > buffered_rmqueue. Since everyone is in here the memory one thread reclaims > > may be stolen by another thread. > > > > You're pretty much on the button here. Only one thread at a time enters > zone_reclaim. The others back off and try the next zone in the zonelist > instead. I'm not sure what the original intention was but most likely it > was to prevent too many parallel reclaimers in the same zone potentially > dumping out way more data than necessary. Yes it was to prevent concurrency slowing down reclaim. At that time the number of processors per NUMA node was 2 or so. The number of pages that are reclaimed is limited to avoid tossing too many page cache pages. > You could experiment with waiting on the bit if the GFP flags allowi it? The > expectation would be that the reclaim operation does not take long. Wait > on the bit, if you are making the forward progress, recheck the > watermarks before continueing. You could reclaim more pages during a zone reclaim pass? Increase the nr_to_reclaim in __zone_reclaim() and see if that helps. One zone reclaim pass should reclaim enough local pages to keep the processors on a node happy for a reasonable interval. Maybe do a fraction of a zone? 1/16th? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-19 15:12 ` Christoph Lameter @ 2010-02-19 15:41 ` Balbir Singh 2010-02-19 15:51 ` Christoph Lameter 0 siblings, 1 reply; 14+ messages in thread From: Balbir Singh @ 2010-02-19 15:41 UTC (permalink / raw) To: Christoph Lameter; +Cc: Mel Gorman, linuxppc-dev, Anton Blanchard On Fri, Feb 19, 2010 at 8:42 PM, Christoph Lameter <cl@linux-foundation.org> wrote: > On Fri, 19 Feb 2010, Mel Gorman wrote: > >> > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus e= nables >> > > zone reclaim. >> > >> >> I've no problem with the patch anyway. > > Nor do I. > >> > - We seem to end up racing between zone_watermark_ok, zone_reclaim and >> > =A0 buffered_rmqueue. Since everyone is in here the memory one thread = reclaims >> > =A0 may be stolen by another thread. >> > >> >> You're pretty much on the button here. Only one thread at a time enters >> zone_reclaim. The others back off and try the next zone in the zonelist >> instead. I'm not sure what the original intention was but most likely it >> was to prevent too many parallel reclaimers in the same zone potentially >> dumping out way more data than necessary. > > Yes it was to prevent concurrency slowing down reclaim. At that time the > number of processors per NUMA node was 2 or so. The number of pages that > are reclaimed is limited to avoid tossing too many page cache pages. > That is interesting, I always thought it was to try and free page cache first. For example with zone->min_unmapped_pages, if zone_pagecache_reclaimable is greater than unmapped pages, we start reclaim the cached pages first. The min_unmapped_pages almost sounds like the higher level watermark - or am I misreading the code. Balbir Singh ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-19 15:41 ` Balbir Singh @ 2010-02-19 15:51 ` Christoph Lameter 2010-02-19 17:39 ` Balbir Singh 0 siblings, 1 reply; 14+ messages in thread From: Christoph Lameter @ 2010-02-19 15:51 UTC (permalink / raw) To: Balbir Singh; +Cc: Mel Gorman, linuxppc-dev, Anton Blanchard On Fri, 19 Feb 2010, Balbir Singh wrote: > >> zone_reclaim. The others back off and try the next zone in the zonelist > >> instead. I'm not sure what the original intention was but most likely it > >> was to prevent too many parallel reclaimers in the same zone potentially > >> dumping out way more data than necessary. > > > > Yes it was to prevent concurrency slowing down reclaim. At that time the > > number of processors per NUMA node was 2 or so. The number of pages that > > are reclaimed is limited to avoid tossing too many page cache pages. > > > > That is interesting, I always thought it was to try and free page > cache first. For example with zone->min_unmapped_pages, if > zone_pagecache_reclaimable is greater than unmapped pages, we start > reclaim the cached pages first. The min_unmapped_pages almost sounds > like the higher level watermark - or am I misreading the code. Indeed the purpose is to free *old* page cache pages. The min_unmapped_pages is to protect a mininum of the page cache pages / fs metadata from zone reclaim so that ongoing file I/O is not impacted. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-19 15:51 ` Christoph Lameter @ 2010-02-19 17:39 ` Balbir Singh 0 siblings, 0 replies; 14+ messages in thread From: Balbir Singh @ 2010-02-19 17:39 UTC (permalink / raw) To: Christoph Lameter; +Cc: Mel Gorman, linuxppc-dev, Anton Blanchard * Christoph Lameter <cl@linux-foundation.org> [2010-02-19 09:51:12]: > On Fri, 19 Feb 2010, Balbir Singh wrote: > > > >> zone_reclaim. The others back off and try the next zone in the zonelist > > >> instead. I'm not sure what the original intention was but most likely it > > >> was to prevent too many parallel reclaimers in the same zone potentially > > >> dumping out way more data than necessary. > > > > > > Yes it was to prevent concurrency slowing down reclaim. At that time the > > > number of processors per NUMA node was 2 or so. The number of pages that > > > are reclaimed is limited to avoid tossing too many page cache pages. > > > > > > > That is interesting, I always thought it was to try and free page > > cache first. For example with zone->min_unmapped_pages, if > > zone_pagecache_reclaimable is greater than unmapped pages, we start > > reclaim the cached pages first. The min_unmapped_pages almost sounds > > like the higher level watermark - or am I misreading the code. > > Indeed the purpose is to free *old* page cache pages. > > The min_unmapped_pages is to protect a mininum of the page cache pages / > fs metadata from zone reclaim so that ongoing file I/O is not impacted. Thanks for the explanation! -- Three Cheers, Balbir ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-19 14:55 ` Mel Gorman 2010-02-19 15:12 ` Christoph Lameter @ 2010-02-23 1:55 ` Anton Blanchard 2010-02-23 16:23 ` Mel Gorman ` (2 more replies) 1 sibling, 3 replies; 14+ messages in thread From: Anton Blanchard @ 2010-02-23 1:55 UTC (permalink / raw) To: Mel Gorman; +Cc: cl, linuxppc-dev [-- Attachment #1: Type: text/plain, Size: 1755 bytes --] Hi Mel, > You're pretty much on the button here. Only one thread at a time enters > zone_reclaim. The others back off and try the next zone in the zonelist > instead. I'm not sure what the original intention was but most likely it > was to prevent too many parallel reclaimers in the same zone potentially > dumping out way more data than necessary. > > > I'm not sure if there is an easy way to fix this without penalising other > > workloads though. > > > > You could experiment with waiting on the bit if the GFP flags allowi it? The > expectation would be that the reclaim operation does not take long. Wait > on the bit, if you are making the forward progress, recheck the > watermarks before continueing. Thanks to you and Christoph for some suggestions to try. Attached is a chart showing the results of the following tests: baseline.txt The current ppc64 default of zone_reclaim_mode = 0. As expected we see no change in remote node memory usage even after 10 iterations. zone_reclaim_mode.txt Now we set zone_reclaim_mode = 1. On each iteration we continue to improve, but even after 10 runs of stream we have > 10% remote node memory usage. reclaim_4096_pages.txt Instead of reclaiming 32 pages at a time, we try for a much larger batch of 4096. The slope is much steeper but it still takes around 6 iterations to get almost all local node memory. wait_on_busy_flag.txt Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest we would need to check the GFP flags etc, but so far it looks the most promising. We only get a few percent of remote node memory on the first iteration and get all local node by the second. Perhaps a combination of larger batch size and waiting on the busy flag is the way to go? Anton [-- Attachment #2: stream_test:_percentage_off_node_memory.png --] [-- Type: image/png, Size: 34767 bytes --] [-- Attachment #3: reclaim_4096_pages.patch --] [-- Type: text/x-diff, Size: 376 bytes --] --- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 +++ mm/vmscan.c 2010-02-22 03:22:01.000000000 -0600 @@ -2534,7 +2534,7 @@ .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), .may_swap = 1, .nr_to_reclaim = max_t(unsigned long, nr_pages, - SWAP_CLUSTER_MAX), + 4096), .gfp_mask = gfp_mask, .swappiness = vm_swappiness, .order = order, [-- Attachment #4: wait_on_ZONE_RECLAIM_LOCKED.patch --] [-- Type: text/x-diff, Size: 482 bytes --] --- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 +++ mm/vmscan.c 2010-02-21 23:47:31.000000000 -0600 @@ -2634,8 +2634,8 @@ if (node_state(node_id, N_CPU) && node_id != numa_node_id()) return ZONE_RECLAIM_NOSCAN; - if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) - return ZONE_RECLAIM_NOSCAN; + while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) + cpu_relax(); ret = __zone_reclaim(zone, gfp_mask, order); zone_clear_flag(zone, ZONE_RECLAIM_LOCKED); ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-23 1:55 ` Anton Blanchard @ 2010-02-23 16:23 ` Mel Gorman 2010-02-24 15:43 ` Christoph Lameter 2010-03-01 12:06 ` Mel Gorman 2 siblings, 0 replies; 14+ messages in thread From: Mel Gorman @ 2010-02-23 16:23 UTC (permalink / raw) To: Anton Blanchard; +Cc: cl, linuxppc-dev On Tue, Feb 23, 2010 at 12:55:51PM +1100, Anton Blanchard wrote: > > Hi Mel, > I'm afraid I'm on vacation at the moment. This mail is costing me shots with penaltys every minute it's open. It'll be early next week before I can look at this closely. Sorry. > > You're pretty much on the button here. Only one thread at a time enters > > zone_reclaim. The others back off and try the next zone in the zonelist > > instead. I'm not sure what the original intention was but most likely it > > was to prevent too many parallel reclaimers in the same zone potentially > > dumping out way more data than necessary. > > > > > I'm not sure if there is an easy way to fix this without penalising other > > > workloads though. > > > > > > > You could experiment with waiting on the bit if the GFP flags allowi it? The > > expectation would be that the reclaim operation does not take long. Wait > > on the bit, if you are making the forward progress, recheck the > > watermarks before continueing. > > Thanks to you and Christoph for some suggestions to try. Attached is a > chart showing the results of the following tests: > > > baseline.txt > The current ppc64 default of zone_reclaim_mode = 0. As expected we see > no change in remote node memory usage even after 10 iterations. > > zone_reclaim_mode.txt > Now we set zone_reclaim_mode = 1. On each iteration we continue to improve, > but even after 10 runs of stream we have > 10% remote node memory usage. > > reclaim_4096_pages.txt > Instead of reclaiming 32 pages at a time, we try for a much larger batch > of 4096. The slope is much steeper but it still takes around 6 iterations > to get almost all local node memory. > > wait_on_busy_flag.txt > Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest > we would need to check the GFP flags etc, but so far it looks the most > promising. We only get a few percent of remote node memory on the first > iteration and get all local node by the second. > > > Perhaps a combination of larger batch size and waiting on the busy > flag is the way to go? > > Anton > --- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 > +++ mm/vmscan.c 2010-02-22 03:22:01.000000000 -0600 > @@ -2534,7 +2534,7 @@ > .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), > .may_swap = 1, > .nr_to_reclaim = max_t(unsigned long, nr_pages, > - SWAP_CLUSTER_MAX), > + 4096), > .gfp_mask = gfp_mask, > .swappiness = vm_swappiness, > .order = order, > --- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 > +++ mm/vmscan.c 2010-02-21 23:47:31.000000000 -0600 > @@ -2634,8 +2634,8 @@ > if (node_state(node_id, N_CPU) && node_id != numa_node_id()) > return ZONE_RECLAIM_NOSCAN; > > - if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) > - return ZONE_RECLAIM_NOSCAN; > + while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) > + cpu_relax(); > > ret = __zone_reclaim(zone, gfp_mask, order); > zone_clear_flag(zone, ZONE_RECLAIM_LOCKED); -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-23 1:55 ` Anton Blanchard 2010-02-23 16:23 ` Mel Gorman @ 2010-02-24 15:43 ` Christoph Lameter 2010-03-01 12:06 ` Mel Gorman 2 siblings, 0 replies; 14+ messages in thread From: Christoph Lameter @ 2010-02-24 15:43 UTC (permalink / raw) To: Anton Blanchard; +Cc: Mel Gorman, linuxppc-dev [-- Attachment #1: Type: TEXT/PLAIN, Size: 1244 bytes --] On Tue, 23 Feb 2010, Anton Blanchard wrote: > zone_reclaim_mode.txt > Now we set zone_reclaim_mode = 1. On each iteration we continue to improve, > but even after 10 runs of stream we have > 10% remote node memory usage. The intend of zone reclaim was never to allocate all memory from on node. You should not expect all memory to come from the node even if zone reclaim works. > reclaim_4096_pages.txt > Instead of reclaiming 32 pages at a time, we try for a much larger batch > of 4096. The slope is much steeper but it still takes around 6 iterations > to get almost all local node memory. "almost all"? How much do you want? > wait_on_busy_flag.txt > Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest > we would need to check the GFP flags etc, but so far it looks the most > promising. We only get a few percent of remote node memory on the first > iteration and get all local node by the second. This would significantly impact performance. Zone reclaim should reclaim with minimal overhead. If zone reclaim is running on another processor then the OS already takes measures against the shortage of node local memory. The right thing to do is to take what is currently available which may be off node memory. [-- Attachment #2: Type: IMAGE/PNG, Size: 34767 bytes --] [-- Attachment #3: Type: TEXT/X-DIFF, Size: 387 bytes --] --- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 +++ mm/vmscan.c 2010-02-22 03:22:01.000000000 -0600 @@ -2534,7 +2534,7 @@ .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), .may_swap = 1, .nr_to_reclaim = max_t(unsigned long, nr_pages, - SWAP_CLUSTER_MAX), + 4096), .gfp_mask = gfp_mask, .swappiness = vm_swappiness, .order = order, [-- Attachment #4: Type: TEXT/X-DIFF, Size: 495 bytes --] --- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 +++ mm/vmscan.c 2010-02-21 23:47:31.000000000 -0600 @@ -2634,8 +2634,8 @@ if (node_state(node_id, N_CPU) && node_id != numa_node_id()) return ZONE_RECLAIM_NOSCAN; - if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) - return ZONE_RECLAIM_NOSCAN; + while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) + cpu_relax(); ret = __zone_reclaim(zone, gfp_mask, order); zone_clear_flag(zone, ZONE_RECLAIM_LOCKED); ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-23 1:55 ` Anton Blanchard 2010-02-23 16:23 ` Mel Gorman 2010-02-24 15:43 ` Christoph Lameter @ 2010-03-01 12:06 ` Mel Gorman 2010-03-01 15:19 ` Christoph Lameter 2 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2010-03-01 12:06 UTC (permalink / raw) To: Anton Blanchard; +Cc: cl, linuxppc-dev On Tue, Feb 23, 2010 at 12:55:51PM +1100, Anton Blanchard wrote: > > Hi Mel, > I'm back but a bit vague. Am on painkillers for the bashing I gave myself down the hills. > > You're pretty much on the button here. Only one thread at a time enters > > zone_reclaim. The others back off and try the next zone in the zonelist > > instead. I'm not sure what the original intention was but most likely it > > was to prevent too many parallel reclaimers in the same zone potentially > > dumping out way more data than necessary. > > > > > I'm not sure if there is an easy way to fix this without penalising other > > > workloads though. > > > > > > > You could experiment with waiting on the bit if the GFP flags allowi it? The > > expectation would be that the reclaim operation does not take long. Wait > > on the bit, if you are making the forward progress, recheck the > > watermarks before continueing. > > Thanks to you and Christoph for some suggestions to try. Attached is a > chart showing the results of the following tests: > > > baseline.txt > The current ppc64 default of zone_reclaim_mode = 0. As expected we see > no change in remote node memory usage even after 10 iterations. > > zone_reclaim_mode.txt > Now we set zone_reclaim_mode = 1. On each iteration we continue to improve, > but even after 10 runs of stream we have > 10% remote node memory usage. > Ok, so how reasonable would it be to expect that the rate of "improvement" to be related to the ratio between "available free node memory at start - how many pages the benchmark requires" and the number of pages zone_reclaim reclaims on the local node? The exact rate of improvement is complicated by multiple threads so it won't be exact. > reclaim_4096_pages.txt > Instead of reclaiming 32 pages at a time, we try for a much larger batch > of 4096. The slope is much steeper but it still takes around 6 iterations > to get almost all local node memory. > > wait_on_busy_flag.txt > Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest > we would need to check the GFP flags etc, but so far it looks the most > promising. We only get a few percent of remote node memory on the first > iteration and get all local node by the second. > If the above expectation is reasonable, a better alternative may be to adapt the number of pages reclaimed to the number of callers to __zone_reclaim() and allow parallel reclaimers. e.g. 1 thread 128 2 threads 64 3 threads 32 4 threads 16 etc The exact starting batch count needs more careful thinking than what I'm giving it currently and maybe the decay ratio too to work out what the worst-case scenario for dumping node-local memory is but you get the idea. The downside is that this requires a per-zone counter to count the number of parallel reclaimers. > > Perhaps a combination of larger batch size and waiting on the busy > flag is the way to go? > I think a static increase on the batch size runs three risks. The first of parallel reclaimers dumping too much of local memory although it could be mitigated by checking the watermarks after waiting on the bit lock. The second is that the thread doing the reclaiming is penalised with higher reclaim costs while other CPUs remain idle. The third is that there could be latency snags with a thread spinning that would previously have gone off-node. Not sure what the impact of the third risk but it might be noticeable on latency-sensitive machines where the off-node cost is not significant enough to justify a delay. Christoph, how feasible would it be to allow parallel reclaimers in __zone_reclaim() that back off at a rate depending on the number of reclaimers? > --- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 > +++ mm/vmscan.c 2010-02-22 03:22:01.000000000 -0600 > @@ -2534,7 +2534,7 @@ > .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), > .may_swap = 1, > .nr_to_reclaim = max_t(unsigned long, nr_pages, > - SWAP_CLUSTER_MAX), > + 4096), > .gfp_mask = gfp_mask, > .swappiness = vm_swappiness, > .order = order, > --- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 > +++ mm/vmscan.c 2010-02-21 23:47:31.000000000 -0600 > @@ -2634,8 +2634,8 @@ > if (node_state(node_id, N_CPU) && node_id != numa_node_id()) > return ZONE_RECLAIM_NOSCAN; > > - if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) > - return ZONE_RECLAIM_NOSCAN; > + while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) > + cpu_relax(); > > ret = __zone_reclaim(zone, gfp_mask, order); > zone_clear_flag(zone, ZONE_RECLAIM_LOCKED); -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-03-01 12:06 ` Mel Gorman @ 2010-03-01 15:19 ` Christoph Lameter 0 siblings, 0 replies; 14+ messages in thread From: Christoph Lameter @ 2010-03-01 15:19 UTC (permalink / raw) To: Mel Gorman; +Cc: linuxppc-dev, Anton Blanchard On Mon, 1 Mar 2010, Mel Gorman wrote: > Christoph, how feasible would it be to allow parallel reclaimers in > __zone_reclaim() that back off at a rate depending on the number of > reclaimers? Not too hard. Zone locking is there but there may be a lot of bouncing cachelines if you run it concurrently. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-18 22:29 [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim Anton Blanchard 2010-02-19 0:07 ` Anton Blanchard @ 2010-02-19 15:43 ` Balbir Singh 2010-02-23 1:38 ` Anton Blanchard 1 sibling, 1 reply; 14+ messages in thread From: Balbir Singh @ 2010-02-19 15:43 UTC (permalink / raw) To: Anton Blanchard; +Cc: mel, cl, linuxppc-dev On Fri, Feb 19, 2010 at 3:59 AM, Anton Blanchard <anton@samba.org> wrote: > > I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It ge= ts > enabled via this: > > =A0 =A0 =A0 =A0/* > =A0 =A0 =A0 =A0 * If another node is sufficiently far away then it is bet= ter > =A0 =A0 =A0 =A0 * to reclaim pages in a zone before going off node. > =A0 =A0 =A0 =A0 */ > =A0 =A0 =A0 =A0if (distance > RECLAIM_DISTANCE) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zone_reclaim_mode =3D 1; > > Since we use the default value of 20 for REMOTE_DISTANCE and 20 for > RECLAIM_DISTANCE it never kicks in. > > The local to remote bandwidth ratios can be quite large on System p > machines so it makes sense for us to reclaim clean pagecache locally befo= re > going off node. > > The patch below sets a smaller value for RECLAIM_DISTANCE and thus enable= s > zone reclaim. > A reclaim distance of 10 implies a ratio of 1, that means we'll always do zone_reclaim() to free page cache and slab cache before moving on to another node? Balbir Singh. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim 2010-02-19 15:43 ` Balbir Singh @ 2010-02-23 1:38 ` Anton Blanchard 0 siblings, 0 replies; 14+ messages in thread From: Anton Blanchard @ 2010-02-23 1:38 UTC (permalink / raw) To: Balbir Singh; +Cc: mel, cl, linuxppc-dev Hi Balbir, > A reclaim distance of 10 implies a ratio of 1, that means we'll always > do zone_reclaim() to free page cache and slab cache before moving on > to another node? I want to make an effort to reclaim local pagecache before ever going off node. As an example, a completely off node stream result is almost 3x slower than on node on my test box. Anton ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-03-01 15:19 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-02-18 22:29 [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim Anton Blanchard 2010-02-19 0:07 ` Anton Blanchard 2010-02-19 14:55 ` Mel Gorman 2010-02-19 15:12 ` Christoph Lameter 2010-02-19 15:41 ` Balbir Singh 2010-02-19 15:51 ` Christoph Lameter 2010-02-19 17:39 ` Balbir Singh 2010-02-23 1:55 ` Anton Blanchard 2010-02-23 16:23 ` Mel Gorman 2010-02-24 15:43 ` Christoph Lameter 2010-03-01 12:06 ` Mel Gorman 2010-03-01 15:19 ` Christoph Lameter 2010-02-19 15:43 ` Balbir Singh 2010-02-23 1:38 ` Anton Blanchard
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).