* [PATCH 0/3] better zone and watermark balancing
@ 2005-11-01 5:18 Nick Piggin
2005-11-01 5:19 ` [PATCH 1/3] vm: kswapd incmin Nick Piggin
0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2005-11-01 5:18 UTC (permalink / raw)
To: linux-kernel
This patchset I have had around for a long time and improves
various zone and watermark balancing by making calculations
more logical.
When reading 128GB through the pagecache, in 4 concurrent
streams, the final page residency and total reclaim ratios
look like this (no highmem, ~900MB RAM):
2.6.14-git3
DMA pages= 2214, scan= 124146
NRM pages=215966, scan=3990129
Pages Scan
DMA 01.01 03.01
NRM 98.99 96.99
2.6.14-git3-vm
DMA pages= 2220, scan= 99264
NRM pages=216373, scan=4011975
Pages Scan
DMA 01.01 02.41
NRM 98.99 97.59
So in this case, DMA is still getting a beating, but things have
improved nicely. Now are results with highmem and ~4GB RAM:
2.6.14-git3
DMA pages=0, scan=0
NRM pages=177241, scan=1607991
HIG pages=817122, scan=1607166
Pages Scan
DMA 00.00 00.00
NRM 17.83 50.01
HIG 82.17 49.99
2.6.14-git3-vm
DMA pages=0, scan=0
NRM pages=178215, scan=553311
HIG pages=815771, scan=2757744
Pages Scan
DMA 00.00 00.00
NRM 17.92 16.71
HIG 82.07 83.28
Current kernels are abysmal, while the patches bring scanning to
an almost perfect ratio.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH 1/3] vm: kswapd incmin 2005-11-01 5:18 [PATCH 0/3] better zone and watermark balancing Nick Piggin @ 2005-11-01 5:19 ` Nick Piggin 2005-11-01 5:20 ` [PATCH 2/3] vm: highmem watermarks Nick Piggin 2005-11-07 15:28 ` [PATCH 1/3] vm: kswapd incmin Marcelo Tosatti 0 siblings, 2 replies; 10+ messages in thread From: Nick Piggin @ 2005-11-01 5:19 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 33 bytes --] 1/3 -- SUSE Labs, Novell Inc. [-- Attachment #2: vm-kswapd-incmin.patch --] [-- Type: text/plain, Size: 4705 bytes --] Explicitly teach kswapd about the incremental min logic instead of just scanning all zones under the first low zone. This should keep more even pressure applied on the zones. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/vmscan.c =================================================================== --- linux-2.6.orig/mm/vmscan.c 2005-11-01 13:42:33.000000000 +1100 +++ linux-2.6/mm/vmscan.c 2005-11-01 14:27:16.000000000 +1100 @@ -1051,97 +1051,63 @@ loop_again: } for (priority = DEF_PRIORITY; priority >= 0; priority--) { - int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long lru_pages = 0; + int first_low_zone = 0; all_zones_ok = 1; + sc.nr_scanned = 0; + sc.nr_reclaimed = 0; + sc.priority = priority; + sc.swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX; - if (nr_pages == 0) { - /* - * Scan in the highmem->dma direction for the highest - * zone which needs scanning - */ - for (i = pgdat->nr_zones - 1; i >= 0; i--) { - struct zone *zone = pgdat->node_zones + i; + /* Scan in the highmem->dma direction */ + for (i = pgdat->nr_zones - 1; i >= 0; i--) { + struct zone *zone = pgdat->node_zones + i; - if (zone->present_pages == 0) - continue; + if (zone->present_pages == 0) + continue; - if (zone->all_unreclaimable && - priority != DEF_PRIORITY) + if (nr_pages == 0) { /* Not software suspend */ + if (zone_watermark_ok(zone, order, + zone->pages_high, first_low_zone, 0, 0)) continue; - if (!zone_watermark_ok(zone, order, - zone->pages_high, 0, 0, 0)) { - end_zone = i; - goto scan; - } + all_zones_ok = 0; + if (first_low_zone < i) + first_low_zone = i; } - goto out; - } else { - end_zone = pgdat->nr_zones - 1; - } -scan: - for (i = 0; i <= end_zone; i++) { - struct zone *zone = pgdat->node_zones + i; - - lru_pages += zone->nr_active + zone->nr_inactive; - } - - /* - * Now scan the zone in the dma->highmem direction, stopping - * at the last zone which needs scanning. - * - * We do this because the page allocator works in the opposite - * direction. This prevents the page allocator from allocating - * pages behind kswapd's direction of progress, which would - * cause too much scanning of the lower zones. - */ - for (i = 0; i <= end_zone; i++) { - struct zone *zone = pgdat->node_zones + i; - int nr_slab; - - if (zone->present_pages == 0) - continue; if (zone->all_unreclaimable && priority != DEF_PRIORITY) continue; - if (nr_pages == 0) { /* Not software suspend */ - if (!zone_watermark_ok(zone, order, - zone->pages_high, end_zone, 0, 0)) - all_zones_ok = 0; - } zone->temp_priority = priority; if (zone->prev_priority > priority) zone->prev_priority = priority; - sc.nr_scanned = 0; - sc.nr_reclaimed = 0; - sc.priority = priority; - sc.swap_cluster_max = nr_pages? nr_pages : SWAP_CLUSTER_MAX; + lru_pages += zone->nr_active + zone->nr_inactive; + atomic_inc(&zone->reclaim_in_progress); shrink_zone(zone, &sc); atomic_dec(&zone->reclaim_in_progress); - reclaim_state->reclaimed_slab = 0; - nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, - lru_pages); - sc.nr_reclaimed += reclaim_state->reclaimed_slab; - total_reclaimed += sc.nr_reclaimed; - total_scanned += sc.nr_scanned; - if (zone->all_unreclaimable) - continue; - if (nr_slab == 0 && zone->pages_scanned >= + + if (zone->pages_scanned >= (zone->nr_active + zone->nr_inactive) * 4) zone->all_unreclaimable = 1; - /* - * If we've done a decent amount of scanning and - * the reclaim ratio is low, start doing writepage - * even in laptop mode - */ - if (total_scanned > SWAP_CLUSTER_MAX * 2 && - total_scanned > total_reclaimed+total_reclaimed/2) - sc.may_writepage = 1; } + reclaim_state->reclaimed_slab = 0; + shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages); + sc.nr_reclaimed += reclaim_state->reclaimed_slab; + total_reclaimed += sc.nr_reclaimed; + total_scanned += sc.nr_scanned; + + /* + * If we've done a decent amount of scanning and + * the reclaim ratio is low, start doing writepage + * even in laptop mode + */ + if (total_scanned > SWAP_CLUSTER_MAX * 2 && + total_scanned > total_reclaimed+total_reclaimed/2) + sc.may_writepage = 1; + if (nr_pages && to_free > total_reclaimed) continue; /* swsusp: need to do more work */ if (all_zones_ok) @@ -1162,7 +1128,6 @@ scan: if ((total_reclaimed >= SWAP_CLUSTER_MAX) && (!nr_pages)) break; } -out: for (i = 0; i < pgdat->nr_zones; i++) { struct zone *zone = pgdat->node_zones + i; ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/3] vm: highmem watermarks 2005-11-01 5:19 ` [PATCH 1/3] vm: kswapd incmin Nick Piggin @ 2005-11-01 5:20 ` Nick Piggin 2005-11-01 5:21 ` [PATCH 3/3] vm: writeout watermarks Nick Piggin 2005-11-07 15:28 ` [PATCH 1/3] vm: kswapd incmin Marcelo Tosatti 1 sibling, 1 reply; 10+ messages in thread From: Nick Piggin @ 2005-11-01 5:20 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 33 bytes --] 2/3 -- SUSE Labs, Novell Inc. [-- Attachment #2: vm-highmem-watermarks.patch --] [-- Type: text/plain, Size: 2366 bytes --] The pages_high - pages_low and pages_low - pages_min deltas are the asynch reclaim watermarks. As such, the should be in the same ratios as any other zone for highmem zones. It is the pages_min - 0 delta which is the PF_MEMALLOC reserve, and this is the region that isn't very useful for highmem. This patch ensures highmem systems have similar characteristics as non highmem ones with the same amount of memory, and also that highmem zones get similar reclaim pressures to other zones. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c 2005-11-01 13:42:35.000000000 +1100 +++ linux-2.6/mm/page_alloc.c 2005-11-01 14:29:07.000000000 +1100 @@ -2374,13 +2374,18 @@ static void setup_per_zone_pages_min(voi } for_each_zone(zone) { + unsigned long tmp; spin_lock_irqsave(&zone->lru_lock, flags); + tmp = (pages_min * zone->present_pages) / lowmem_pages; if (is_highmem(zone)) { /* - * Often, highmem doesn't need to reserve any pages. - * But the pages_min/low/high values are also used for - * batching up page reclaim activity so we need a - * decent value here. + * __GFP_HIGH and PF_MEMALLOC allocations usually don't + * need highmem pages, so cap pages_min to a small + * value here. + * + * The (pages_high-pages_low) and (pages_low-pages_min) + * deltas controls asynch page reclaim, and so should + * not be capped for highmem. */ int min_pages; @@ -2391,19 +2396,15 @@ static void setup_per_zone_pages_min(voi min_pages = 128; zone->pages_min = min_pages; } else { - /* if it's a lowmem zone, reserve a number of pages + /* + * If it's a lowmem zone, reserve a number of pages * proportionate to the zone's size. */ - zone->pages_min = (pages_min * zone->present_pages) / - lowmem_pages; + zone->pages_min = tmp; } - /* - * When interpreting these watermarks, just keep in mind that: - * zone->pages_min == (zone->pages_min * 4) / 4; - */ - zone->pages_low = (zone->pages_min * 5) / 4; - zone->pages_high = (zone->pages_min * 6) / 4; + zone->pages_low = zone->pages_min + tmp / 4; + zone->pages_high = zone->pages_min + tmp / 2; spin_unlock_irqrestore(&zone->lru_lock, flags); } } ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 3/3] vm: writeout watermarks 2005-11-01 5:20 ` [PATCH 2/3] vm: highmem watermarks Nick Piggin @ 2005-11-01 5:21 ` Nick Piggin 2005-11-07 15:33 ` Marcelo Tosatti 0 siblings, 1 reply; 10+ messages in thread From: Nick Piggin @ 2005-11-01 5:21 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 33 bytes --] 3/3 -- SUSE Labs, Novell Inc. [-- Attachment #2: vm-tune-writeout.patch --] [-- Type: text/plain, Size: 1110 bytes --] Slightly change the writeout watermark calculations so we keep background and synchronous writeout watermarks in the same ratios after adjusting them. This ensures we should always attempt to start background writeout before synchronous writeout. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c 2005-11-01 13:41:39.000000000 +1100 +++ linux-2.6/mm/page-writeback.c 2005-11-01 14:29:27.000000000 +1100 @@ -165,9 +165,11 @@ get_dirty_limits(struct writeback_state if (dirty_ratio < 5) dirty_ratio = 5; - background_ratio = dirty_background_ratio; - if (background_ratio >= dirty_ratio) - background_ratio = dirty_ratio / 2; + /* + * Keep the ratio between dirty_ratio and background_ratio roughly + * what the sysctls are after dirty_ratio has been scaled (above). + */ + background_ratio = dirty_background_ratio * dirty_ratio/vm_dirty_ratio; background = (background_ratio * available_memory) / 100; dirty = (dirty_ratio * available_memory) / 100; ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] vm: writeout watermarks 2005-11-01 5:21 ` [PATCH 3/3] vm: writeout watermarks Nick Piggin @ 2005-11-07 15:33 ` Marcelo Tosatti 2005-11-07 21:13 ` Nikita Danilov 2005-11-07 23:12 ` Nick Piggin 0 siblings, 2 replies; 10+ messages in thread From: Marcelo Tosatti @ 2005-11-07 15:33 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, Nikita Danilov Nikita has a customer using large percentage of RAM for a kernel module, which results in get_dirty_limits() misbehaviour since unsigned long available_memory = total_pages; It should work on the amount of cacheable pages instead. He's got a patch but I dont remember the URL. Nikita? On Tue, Nov 01, 2005 at 04:21:15PM +1100, Nick Piggin wrote: > 3/3 > > -- > SUSE Labs, Novell Inc. > > Slightly change the writeout watermark calculations so we keep background > and synchronous writeout watermarks in the same ratios after adjusting them. > This ensures we should always attempt to start background writeout before > synchronous writeout. > > Signed-off-by: Nick Piggin <npiggin@suse.de> > > > Index: linux-2.6/mm/page-writeback.c > =================================================================== > --- linux-2.6.orig/mm/page-writeback.c 2005-11-01 13:41:39.000000000 +1100 > +++ linux-2.6/mm/page-writeback.c 2005-11-01 14:29:27.000000000 +1100 > @@ -165,9 +165,11 @@ get_dirty_limits(struct writeback_state > if (dirty_ratio < 5) > dirty_ratio = 5; > > - background_ratio = dirty_background_ratio; > - if (background_ratio >= dirty_ratio) > - background_ratio = dirty_ratio / 2; > + /* > + * Keep the ratio between dirty_ratio and background_ratio roughly > + * what the sysctls are after dirty_ratio has been scaled (above). > + */ > + background_ratio = dirty_background_ratio * dirty_ratio/vm_dirty_ratio; > > background = (background_ratio * available_memory) / 100; > dirty = (dirty_ratio * available_memory) / 100; ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] vm: writeout watermarks 2005-11-07 15:33 ` Marcelo Tosatti @ 2005-11-07 21:13 ` Nikita Danilov 2005-11-07 23:12 ` Nick Piggin 1 sibling, 0 replies; 10+ messages in thread From: Nikita Danilov @ 2005-11-07 21:13 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Nick Piggin, linux-kernel Marcelo Tosatti writes: > > Nikita has a customer using large percentage of RAM for > a kernel module, which results in get_dirty_limits() misbehaviour > since > > unsigned long available_memory = total_pages; > > It should work on the amount of cacheable pages instead. > > He's got a patch but I dont remember the URL. Nikita? http://linuxhacker.ru/~nikita/patches/2.6.14-rc5/09-throttle-against-free-memory.patch It changes balance_dirty_pages() to calculate threshold not from total amount of physical pages, but from the maximal amount of pages that can be consumed by the file system cache. This amount is approximated by total size of LRU list plus free memory (across all zones). This has a downside of starting write-out earlier, so patch should probably be accompanied by some tuning of default thresholds. Nikita. > > On Tue, Nov 01, 2005 at 04:21:15PM +1100, Nick Piggin wrote: > > 3/3 > > > > -- > > SUSE Labs, Novell Inc. > > > > > Slightly change the writeout watermark calculations so we keep background > > and synchronous writeout watermarks in the same ratios after adjusting them. > > This ensures we should always attempt to start background writeout before > > synchronous writeout. > > > > Signed-off-by: Nick Piggin <npiggin@suse.de> > > > > > > Index: linux-2.6/mm/page-writeback.c > > =================================================================== > > --- linux-2.6.orig/mm/page-writeback.c 2005-11-01 13:41:39.000000000 +1100 > > +++ linux-2.6/mm/page-writeback.c 2005-11-01 14:29:27.000000000 +1100 > > @@ -165,9 +165,11 @@ get_dirty_limits(struct writeback_state > > if (dirty_ratio < 5) > > dirty_ratio = 5; > > > > - background_ratio = dirty_background_ratio; > > - if (background_ratio >= dirty_ratio) > > - background_ratio = dirty_ratio / 2; > > + /* > > + * Keep the ratio between dirty_ratio and background_ratio roughly > > + * what the sysctls are after dirty_ratio has been scaled (above). > > + */ > > + background_ratio = dirty_background_ratio * dirty_ratio/vm_dirty_ratio; > > > > background = (background_ratio * available_memory) / 100; > > dirty = (dirty_ratio * available_memory) / 100; ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] vm: writeout watermarks 2005-11-07 15:33 ` Marcelo Tosatti 2005-11-07 21:13 ` Nikita Danilov @ 2005-11-07 23:12 ` Nick Piggin 1 sibling, 0 replies; 10+ messages in thread From: Nick Piggin @ 2005-11-07 23:12 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: linux-kernel, Nikita Danilov Marcelo Tosatti wrote: > Nikita has a customer using large percentage of RAM for > a kernel module, which results in get_dirty_limits() misbehaviour > since > > unsigned long available_memory = total_pages; > > It should work on the amount of cacheable pages instead. > > He's got a patch but I dont remember the URL. Nikita? > Indeed. This patch has a couple of little problems anyway, and probably does not logicaly belong as part of this series. I'll work on previous 2 more important patches first. My patch should probably go on top of more fundamental work like Nikita's patch. Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/3] vm: kswapd incmin 2005-11-01 5:19 ` [PATCH 1/3] vm: kswapd incmin Nick Piggin 2005-11-01 5:20 ` [PATCH 2/3] vm: highmem watermarks Nick Piggin @ 2005-11-07 15:28 ` Marcelo Tosatti 2005-11-07 23:08 ` Nick Piggin 1 sibling, 1 reply; 10+ messages in thread From: Marcelo Tosatti @ 2005-11-07 15:28 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel Hi Nick, Looks nice, much easier to read than before. One comment: you change the pagecache/slab scanning ratio by moving shrink_slab() outside of the zone loop. This means that for each kswapd iteration will scan "lru_pages" SLAB entries, instead of "lru_pages*NR_ZONES" entries. Can you comment on that? On Tue, Nov 01, 2005 at 04:19:49PM +1100, Nick Piggin wrote: > 1/3 > > -- > SUSE Labs, Novell Inc. > > Explicitly teach kswapd about the incremental min logic instead of just scanning > all zones under the first low zone. This should keep more even pressure applied > on the zones. > > Signed-off-by: Nick Piggin <npiggin@suse.de> > > > Index: linux-2.6/mm/vmscan.c > =================================================================== > --- linux-2.6.orig/mm/vmscan.c 2005-11-01 13:42:33.000000000 +1100 > +++ linux-2.6/mm/vmscan.c 2005-11-01 14:27:16.000000000 +1100 > @@ -1051,97 +1051,63 @@ loop_again: > } > > for (priority = DEF_PRIORITY; priority >= 0; priority--) { > - int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ > unsigned long lru_pages = 0; > + int first_low_zone = 0; > > all_zones_ok = 1; > + sc.nr_scanned = 0; > + sc.nr_reclaimed = 0; > + sc.priority = priority; > + sc.swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX; > > - if (nr_pages == 0) { > - /* > - * Scan in the highmem->dma direction for the highest > - * zone which needs scanning > - */ > - for (i = pgdat->nr_zones - 1; i >= 0; i--) { > - struct zone *zone = pgdat->node_zones + i; > + /* Scan in the highmem->dma direction */ > + for (i = pgdat->nr_zones - 1; i >= 0; i--) { > + struct zone *zone = pgdat->node_zones + i; > > - if (zone->present_pages == 0) > - continue; > + if (zone->present_pages == 0) > + continue; > > - if (zone->all_unreclaimable && > - priority != DEF_PRIORITY) > + if (nr_pages == 0) { /* Not software suspend */ > + if (zone_watermark_ok(zone, order, > + zone->pages_high, first_low_zone, 0, 0)) > continue; > > - if (!zone_watermark_ok(zone, order, > - zone->pages_high, 0, 0, 0)) { > - end_zone = i; > - goto scan; > - } > + all_zones_ok = 0; > + if (first_low_zone < i) > + first_low_zone = i; > } > - goto out; > - } else { > - end_zone = pgdat->nr_zones - 1; > - } > -scan: > - for (i = 0; i <= end_zone; i++) { > - struct zone *zone = pgdat->node_zones + i; > - > - lru_pages += zone->nr_active + zone->nr_inactive; > - } > - > - /* > - * Now scan the zone in the dma->highmem direction, stopping > - * at the last zone which needs scanning. > - * > - * We do this because the page allocator works in the opposite > - * direction. This prevents the page allocator from allocating > - * pages behind kswapd's direction of progress, which would > - * cause too much scanning of the lower zones. > - */ > - for (i = 0; i <= end_zone; i++) { > - struct zone *zone = pgdat->node_zones + i; > - int nr_slab; > - > - if (zone->present_pages == 0) > - continue; > > if (zone->all_unreclaimable && priority != DEF_PRIORITY) > continue; > > - if (nr_pages == 0) { /* Not software suspend */ > - if (!zone_watermark_ok(zone, order, > - zone->pages_high, end_zone, 0, 0)) > - all_zones_ok = 0; > - } > zone->temp_priority = priority; > if (zone->prev_priority > priority) > zone->prev_priority = priority; > - sc.nr_scanned = 0; > - sc.nr_reclaimed = 0; > - sc.priority = priority; > - sc.swap_cluster_max = nr_pages? nr_pages : SWAP_CLUSTER_MAX; > + lru_pages += zone->nr_active + zone->nr_inactive; > + > atomic_inc(&zone->reclaim_in_progress); > shrink_zone(zone, &sc); > atomic_dec(&zone->reclaim_in_progress); > - reclaim_state->reclaimed_slab = 0; > - nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, > - lru_pages); > - sc.nr_reclaimed += reclaim_state->reclaimed_slab; > - total_reclaimed += sc.nr_reclaimed; > - total_scanned += sc.nr_scanned; > - if (zone->all_unreclaimable) > - continue; > - if (nr_slab == 0 && zone->pages_scanned >= > + > + if (zone->pages_scanned >= > (zone->nr_active + zone->nr_inactive) * 4) > zone->all_unreclaimable = 1; > - /* > - * If we've done a decent amount of scanning and > - * the reclaim ratio is low, start doing writepage > - * even in laptop mode > - */ > - if (total_scanned > SWAP_CLUSTER_MAX * 2 && > - total_scanned > total_reclaimed+total_reclaimed/2) > - sc.may_writepage = 1; > } > + reclaim_state->reclaimed_slab = 0; > + shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages); > + sc.nr_reclaimed += reclaim_state->reclaimed_slab; > + total_reclaimed += sc.nr_reclaimed; > + total_scanned += sc.nr_scanned; > + > + /* > + * If we've done a decent amount of scanning and > + * the reclaim ratio is low, start doing writepage > + * even in laptop mode > + */ > + if (total_scanned > SWAP_CLUSTER_MAX * 2 && > + total_scanned > total_reclaimed+total_reclaimed/2) > + sc.may_writepage = 1; > + > if (nr_pages && to_free > total_reclaimed) > continue; /* swsusp: need to do more work */ > if (all_zones_ok) > @@ -1162,7 +1128,6 @@ scan: > if ((total_reclaimed >= SWAP_CLUSTER_MAX) && (!nr_pages)) > break; > } > -out: > for (i = 0; i < pgdat->nr_zones; i++) { > struct zone *zone = pgdat->node_zones + i; > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/3] vm: kswapd incmin 2005-11-07 15:28 ` [PATCH 1/3] vm: kswapd incmin Marcelo Tosatti @ 2005-11-07 23:08 ` Nick Piggin 2005-11-07 18:43 ` Marcelo Tosatti 0 siblings, 1 reply; 10+ messages in thread From: Nick Piggin @ 2005-11-07 23:08 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: linux-kernel Marcelo Tosatti wrote: > Hi Nick, > > Looks nice, much easier to read than before. > Hi Marcelo, Thanks! That was one of the main aims. > One comment: you change the pagecache/slab scanning ratio by moving > shrink_slab() outside of the zone loop. > > This means that for each kswapd iteration will scan "lru_pages" > SLAB entries, instead of "lru_pages*NR_ZONES" entries. > > Can you comment on that? > I believe I have tried to get it right, let me explain. lru_pages is just used as the divisor for the ratio between lru scanning and slab scanning. So long as it is kept constant across calls to shrink_slab, there should be no change in behaviour. The the nr_scanned variable is the other half of the equation that controls slab shrinking. I have changed from: lru_pages = total_node_lru_pages; for each zone in node { shrink_zone(); shrink_slab(zone_scanned, lru_pages); } To: lru_pages = 0; for each zone in node { shrink_zone(); lru_pages += zone_lru_pages; } shrink_slab(total_zone_scanned, lru_pages); So the ratio remains basically the same [eg. 10/100 + 20/100 + 30/100 = (10+20+30)/100] 2 reasons for doing this. The first is just efficiency and better rounding of the divisions. The second is that within the for_each_zone loop, we are able to set all_unreclaimable without worrying about slab, because the final shrink_slab at the end will clear all_unreclaimable if any zones have had slab pages freed up. I believe it generally should result in more consistent reclaim across zones, and also matches direct reclaim better. Hope this made sense, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/3] vm: kswapd incmin 2005-11-07 23:08 ` Nick Piggin @ 2005-11-07 18:43 ` Marcelo Tosatti 0 siblings, 0 replies; 10+ messages in thread From: Marcelo Tosatti @ 2005-11-07 18:43 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel On Tue, Nov 08, 2005 at 10:08:53AM +1100, Nick Piggin wrote: > Marcelo Tosatti wrote: > >Hi Nick, > > > >Looks nice, much easier to read than before. > > > > Hi Marcelo, > > Thanks! That was one of the main aims. > > >One comment: you change the pagecache/slab scanning ratio by moving > >shrink_slab() outside of the zone loop. > > > >This means that for each kswapd iteration will scan "lru_pages" > >SLAB entries, instead of "lru_pages*NR_ZONES" entries. > > > >Can you comment on that? > > > > I believe I have tried to get it right, let me explain. lru_pages > is just used as the divisor for the ratio between lru scanning > and slab scanning. So long as it is kept constant across calls to > shrink_slab, there should be no change in behaviour. > > The the nr_scanned variable is the other half of the equation that > controls slab shrinking. I have changed from: > > lru_pages = total_node_lru_pages; > for each zone in node { > shrink_zone(); > shrink_slab(zone_scanned, lru_pages); > } > > To: > > lru_pages = 0; > for each zone in node { > shrink_zone(); > lru_pages += zone_lru_pages; > } > shrink_slab(total_zone_scanned, lru_pages); > > So the ratio remains basically the same > [eg. 10/100 + 20/100 + 30/100 = (10+20+30)/100] > > 2 reasons for doing this. The first is just efficiency and better > rounding of the divisions. > > The second is that within the for_each_zone loop, we are able to > set all_unreclaimable without worrying about slab, because the > final shrink_slab at the end will clear all_unreclaimable if any > zones have had slab pages freed up. > > I believe it generally should result in more consistent reclaim > across zones, and also matches direct reclaim better. > > Hope this made sense, Yes, makes sense. My reading was not correct. Sounds great. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-11-07 23:45 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-11-01 5:18 [PATCH 0/3] better zone and watermark balancing Nick Piggin 2005-11-01 5:19 ` [PATCH 1/3] vm: kswapd incmin Nick Piggin 2005-11-01 5:20 ` [PATCH 2/3] vm: highmem watermarks Nick Piggin 2005-11-01 5:21 ` [PATCH 3/3] vm: writeout watermarks Nick Piggin 2005-11-07 15:33 ` Marcelo Tosatti 2005-11-07 21:13 ` Nikita Danilov 2005-11-07 23:12 ` Nick Piggin 2005-11-07 15:28 ` [PATCH 1/3] vm: kswapd incmin Marcelo Tosatti 2005-11-07 23:08 ` Nick Piggin 2005-11-07 18:43 ` Marcelo Tosatti
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox