From: Wu Fengguang <wfg@mail.ustc.edu.cn>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@osdl.org>,
Christoph Lameter <christoph@lameter.com>,
Rik van Riel <riel@redhat.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Marcelo Tosatti <marcelo.tosatti@cyclades.com>,
Magnus Damm <magnus.damm@gmail.com>,
Nick Piggin <npiggin@suse.de>, Andrea Arcangeli <andrea@suse.de>
Subject: [PATCH 07/16] mm: balance active/inactive list scan rates
Date: Wed, 07 Dec 2005 18:48:02 +0800 [thread overview]
Message-ID: <20051207105035.516042000@localhost.localdomain> (raw)
In-Reply-To: 20051207104755.177435000@localhost.localdomain
[-- Attachment #1: mm-balance-active-inactive-list-aging.patch --]
[-- Type: text/plain, Size: 9397 bytes --]
shrink_zone() has two major design goals:
1) let active/inactive lists have equal scan rates
2) do the scans in small chunks
But the implementation has some problems:
- reluctant to scan small zones
the callers often have to dip into low priority to free memory.
- the balance is quite rough
the break statement in the loop breaks it.
- may scan few pages in one batch
refill_inactive_zone can be called twice to scan 32 and 1 pages.
The new design:
1) keep perfect balance
let active_list follow inactive_list in scan rate
2) always scan in SWAP_CLUSTER_MAX sized chunks
simple and efficient
3) will scan at least one chunk
the expected behavior from the callers
The perfect balance may or may not yield better performance, though it
a) is a more understandable and dependable behavior
b) together with inter-zone balancing, makes the zoned memories consistent
The atomic reclaim_in_progress is there to prevent most concurrent reclaims.
If concurrent reclaims did happen, there will be no fatal errors.
I tested the patch with the following commands:
dd if=/dev/zero of=hot bs=1M seek=800 count=1
dd if=/dev/zero of=cold bs=1M seek=50000 count=1
./test-aging.sh; ./active-inactive-aging-rate.sh
Before the patch:
-----------------------------------------------------------------------------
active/inactive sizes on 2.6.14-2-686-smp:
0/1000 = 0 / 1241
563/1000 = 73343 / 130108
887/1000 = 137348 / 154816
active/inactive scan rates:
dma 38/1000 = 7731 / (198924 + 0)
normal 465/1000 = 2979780 / (6394740 + 0)
high 680/1000 = 4354230 / (6396786 + 0)
total used free shared buffers cached
Mem: 2027 1978 49 0 4 1923
-/+ buffers/cache: 49 1977
Swap: 0 0 0
-----------------------------------------------------------------------------
After the patch, the scan rates and the size ratios are kept roughly the same
for all zones:
-----------------------------------------------------------------------------
active/inactive sizes on 2.6.15-rc3-mm1:
0/1000 = 0 / 961
236/1000 = 38385 / 162429
319/1000 = 70607 / 221101
active/inactive scan rates:
dma 0/1000 = 0 / (42176 + 0)
normal 234/1000 = 1714688 / (7303456 + 1088)
high 317/1000 = 3151936 / (9933792 + 96)
total used free shared buffers cached
Mem: 2020 1969 50 0 5 1908
-/+ buffers/cache: 54 1965
Swap: 0 0 0
-----------------------------------------------------------------------------
script test-aging.sh:
------------------------------
#!/bin/zsh
cp cold /dev/null&
while {pidof cp > /dev/null};
do
cp hot /dev/null
done
------------------------------
script active-inactive-aging-rate.sh:
-----------------------------------------------------------------------------
#!/bin/sh
echo active/inactive sizes on `uname -r`:
egrep '(active|inactive)' /proc/zoneinfo |
while true
do
read name value
[[ -z $name ]] && break
eval $name=$value
[[ $name = "inactive" ]] && echo -e "$((active * 1000 / (1 + inactive)))/1000 \t= $active / $inactive"
done
while true
do
read name value
[[ -z $name ]] && break
eval $name=$value
done < /proc/vmstat
echo
echo active/inactive scan rates:
echo -e "dma \t $((pgrefill_dma * 1000 / (1 + pgscan_kswapd_dma + pgscan_direct_dma)))/1000 \t= $pgrefill_dma / ($pgscan_kswapd_dma + $pgscan_direct_dma)"
echo -e "normal \t $((pgrefill_normal * 1000 / (1 + pgscan_kswapd_normal + pgscan_direct_normal)))/1000 \t= $pgrefill_normal / ($pgscan_kswapd_normal + $pgscan_direct_normal)"
echo -e "high \t $((pgrefill_high * 1000 / (1 + pgscan_kswapd_high + pgscan_direct_high)))/1000 \t= $pgrefill_high / ($pgscan_kswapd_high + $pgscan_direct_high)"
echo
free -m
-----------------------------------------------------------------------------
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mmzone.h | 3 --
include/linux/swap.h | 2 -
mm/page_alloc.c | 5 +---
mm/vmscan.c | 52 +++++++++++++++++++++++++++----------------------
4 files changed, 33 insertions(+), 29 deletions(-)
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -912,7 +912,7 @@ static void shrink_cache(struct zone *zo
int nr_scan;
int nr_freed;
- nr_taken = isolate_lru_pages(sc->swap_cluster_max,
+ nr_taken = isolate_lru_pages(sc->nr_to_scan,
&zone->inactive_list,
&page_list, &nr_scan);
zone->nr_inactive -= nr_taken;
@@ -1106,56 +1106,56 @@ refill_inactive_zone(struct zone *zone,
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
+ * The reclaim process:
+ * a) scan always in batch of SWAP_CLUSTER_MAX pages
+ * b) scan inactive list at least one batch
+ * c) balance the scan rate of active/inactive list
+ * d) finish on either scanned or reclaimed enough pages
*/
static void
shrink_zone(struct zone *zone, struct scan_control *sc)
{
+ unsigned long long next_scan_active;
unsigned long nr_active;
unsigned long nr_inactive;
atomic_inc(&zone->reclaim_in_progress);
+ next_scan_active = sc->nr_scanned;
+
/*
* Add one to `nr_to_scan' just to make sure that the kernel will
* slowly sift through the active list.
*/
- zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
- nr_active = zone->nr_scan_active;
- if (nr_active >= sc->swap_cluster_max)
- zone->nr_scan_active = 0;
- else
- nr_active = 0;
-
- zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
- nr_inactive = zone->nr_scan_inactive;
- if (nr_inactive >= sc->swap_cluster_max)
- zone->nr_scan_inactive = 0;
- else
- nr_inactive = 0;
+ nr_active = zone->nr_scan_active + 1;
+ nr_inactive = (zone->nr_inactive >> sc->priority) + SWAP_CLUSTER_MAX;
+ nr_inactive &= ~(SWAP_CLUSTER_MAX - 1);
+ sc->nr_to_scan = SWAP_CLUSTER_MAX;
sc->nr_to_reclaim = sc->swap_cluster_max;
- while (nr_active || nr_inactive) {
- if (nr_active) {
- sc->nr_to_scan = min(nr_active,
- (unsigned long)sc->swap_cluster_max);
- nr_active -= sc->nr_to_scan;
+ while (nr_active >= SWAP_CLUSTER_MAX * 1024 || nr_inactive) {
+ if (nr_active >= SWAP_CLUSTER_MAX * 1024) {
+ nr_active -= SWAP_CLUSTER_MAX * 1024;
refill_inactive_zone(zone, sc);
}
if (nr_inactive) {
- sc->nr_to_scan = min(nr_inactive,
- (unsigned long)sc->swap_cluster_max);
- nr_inactive -= sc->nr_to_scan;
+ nr_inactive -= SWAP_CLUSTER_MAX;
shrink_cache(zone, sc);
if (sc->nr_to_reclaim <= 0)
break;
}
}
- throttle_vm_writeout();
+ next_scan_active = (sc->nr_scanned - next_scan_active) * 1024ULL *
+ (unsigned long long)zone->nr_active;
+ do_div(next_scan_active, zone->nr_inactive | 1);
+ zone->nr_scan_active = nr_active + (unsigned long)next_scan_active;
atomic_dec(&zone->reclaim_in_progress);
+
+ throttle_vm_writeout();
}
/*
@@ -1196,6 +1196,9 @@ shrink_caches(struct zone **zones, struc
if (zone->all_unreclaimable && sc->priority < DEF_PRIORITY)
continue; /* Let kswapd poll it */
+ if (atomic_read(&zone->reclaim_in_progress))
+ continue;
+
/*
* Balance page aging in local zones and following headless
* zones.
@@ -1425,6 +1428,9 @@ scan_swspd:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;
+ if (atomic_read(&zone->reclaim_in_progress))
+ continue;
+
zone->temp_priority = priority;
if (zone->prev_priority > priority)
zone->prev_priority = priority;
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -2145,7 +2145,6 @@ static void __init free_area_init_core(s
INIT_LIST_HEAD(&zone->active_list);
INIT_LIST_HEAD(&zone->inactive_list);
zone->nr_scan_active = 0;
- zone->nr_scan_inactive = 0;
zone->nr_active = 0;
zone->nr_inactive = 0;
zone->aging_total = 0;
@@ -2301,7 +2300,7 @@ static int zoneinfo_show(struct seq_file
"\n inactive %lu"
"\n aging %lu"
"\n age %lu"
- "\n scanned %lu (a: %lu i: %lu)"
+ "\n scanned %lu (a: %lu)"
"\n spanned %lu"
"\n present %lu",
zone->free_pages,
@@ -2313,7 +2312,7 @@ static int zoneinfo_show(struct seq_file
zone->aging_total,
zone->page_age,
zone->pages_scanned,
- zone->nr_scan_active, zone->nr_scan_inactive,
+ zone->nr_scan_active / 1024,
zone->spanned_pages,
zone->present_pages);
seq_printf(m,
--- linux.orig/include/linux/swap.h
+++ linux/include/linux/swap.h
@@ -111,7 +111,7 @@ enum {
SWP_SCANNING = (1 << 8), /* refcount in scan_swap_map */
};
-#define SWAP_CLUSTER_MAX 32
+#define SWAP_CLUSTER_MAX 32 /* must be power of 2 */
#define SWAP_MAP_MAX 0x7fff
#define SWAP_MAP_BAD 0x8000
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -142,8 +142,7 @@ struct zone {
spinlock_t lru_lock;
struct list_head active_list;
struct list_head inactive_list;
- unsigned long nr_scan_active;
- unsigned long nr_scan_inactive;
+ unsigned long nr_scan_active; /* x1024 to be more precise */
unsigned long nr_active;
unsigned long nr_inactive;
unsigned long pages_scanned; /* since last reclaim */
--
next prev parent reply other threads:[~2005-12-07 10:24 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
2005-12-07 10:47 ` [PATCH 01/16] mm: restore sc.nr_to_reclaim Wu Fengguang
2005-12-07 10:47 ` [PATCH 02/16] mm: simplify kswapd reclaim code Wu Fengguang
2005-12-07 10:47 ` [PATCH 03/16] mm: supporting variables and functions for balanced zone aging Wu Fengguang
2005-12-11 22:36 ` Marcelo Tosatti
2005-12-12 2:53 ` Wu Fengguang
2005-12-07 10:47 ` [PATCH 04/16] mm: balance zone aging in direct reclaim path Wu Fengguang
2005-12-07 10:48 ` [PATCH 05/16] mm: balance zone aging in kswapd " Wu Fengguang
2005-12-07 10:58 ` Wu Fengguang
2005-12-07 13:32 ` Wu Fengguang
2005-12-07 10:48 ` [PATCH 06/16] mm: balance slab aging Wu Fengguang
2005-12-07 11:08 ` Wu Fengguang
2005-12-07 11:34 ` Nick Piggin
2005-12-07 12:59 ` Wu Fengguang
2005-12-07 10:48 ` Wu Fengguang [this message]
2005-12-07 10:48 ` [PATCH 08/16] mm: fine grained scan priority Wu Fengguang
2005-12-07 10:48 ` [PATCH 09/16] mm: remove unnecessary variable and loop Wu Fengguang
2006-01-05 19:21 ` Marcelo Tosatti
2006-01-06 8:58 ` Wu Fengguang
2005-12-07 10:48 ` [PATCH 10/16] mm: remove swap_cluster_max from scan_control Wu Fengguang
2005-12-07 10:48 ` [PATCH 11/16] mm: let sc.nr_scanned/sc.nr_reclaimed accumulate Wu Fengguang
2005-12-07 10:48 ` [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags Wu Fengguang
2005-12-07 10:36 ` Nick Piggin
2005-12-07 11:11 ` Wu Fengguang
2005-12-07 11:12 ` Nick Piggin
2005-12-07 13:01 ` Wu Fengguang
2005-12-07 11:15 ` Wu Fengguang
2005-12-07 17:02 ` Martin Hicks
2005-12-07 23:15 ` Andrew Morton
2005-12-07 10:48 ` [PATCH 13/16] mm: fix minor scan count bugs Wu Fengguang
2005-12-07 10:32 ` Nick Piggin
2005-12-07 11:02 ` Wu Fengguang
2005-12-07 10:48 ` [PATCH 14/16] mm: zone aging rounds accounting Wu Fengguang
2005-12-07 10:48 ` [PATCH 15/16] mm: add page reclaim debug traces Wu Fengguang
2005-12-07 10:48 ` [PATCH 16/16] mm: kswapd reclaim debug trace Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20051207105035.516042000@localhost.localdomain \
--to=wfg@mail.ustc.edu.cn \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@osdl.org \
--cc=andrea@suse.de \
--cc=christoph@lameter.com \
--cc=linux-kernel@vger.kernel.org \
--cc=magnus.damm@gmail.com \
--cc=marcelo.tosatti@cyclades.com \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox