From: Greg KH <gregkh@suse.de>
To: linux-kernel@vger.kernel.org, stable@kernel.org
Cc: Justin Forbes <jmforbes@linuxtx.org>,
Zwane Mwaikambo <zwane@arm.linux.org.uk>,
"Theodore Ts'o" <tytso@mit.edu>,
Randy Dunlap <rdunlap@xenotime.net>,
Dave Jones <davej@redhat.com>,
Chuck Wolber <chuckw@quantumlinux.com>,
Chris Wedgwood <reviews@ml.cw.f00f.org>,
Michael Krufky <mkrufky@linuxtv.org>,
torvalds@osdl.org, akpm@osdl.org, alan@lxorguk.ukuu.org.uk,
Christoph Lameter <clameter@sgi.com>,
Greg Kroah-Hartman <gregkh@suse.de>
Subject: [patch 18/67] zone_reclaim: dynamic slab reclaim
Date: Wed, 11 Oct 2006 14:04:53 -0700 [thread overview]
Message-ID: <20061011210453.GS16627@kroah.com> (raw)
In-Reply-To: <20061011210310.GA16627@kroah.com>
[-- Attachment #1: zone_reclaim-dynamic-slab-reclaim.patch --]
[-- Type: text/plain, Size: 11401 bytes --]
-stable review patch. If anyone has any objections, please let us know.
------------------
From: Christoph Lameter <clameter@sgi.com>
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0ff38490c836dc379ff7ec45b10a15a662f4e5f6
Currently one can enable slab reclaim by setting an explicit option in
/proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
option if the freeing of unmapped file backed pages is not enough to free
enough pages to allow a local allocation.
However, that means that the slab can grow excessively and that most memory
of a node may be used by slabs. We have had a case where a machine with
46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
dealing with pagecache pages. However, slab reclaim was only done during
global reclaim (which is a bit rare on NUMA systems).
This patch implements slab reclaim during zone reclaim. Zone reclaim
occurs if there is a danger of an off node allocation. At that point we
1. Shrink the per node page cache if the number of pagecache
pages is more than min_unmapped_ratio percent of pages in a zone.
2. Shrink the slab cache if the number of the nodes reclaimable slab pages
(patch depends on earlier one that implements that counter)
are more than min_slab_ratio (a new /proc/sys/vm tunable).
The shrinking of the slab cache is a bit problematic since it is not node
specific. So we simply calculate what point in the slab we want to reach
(current per node slab use minus the number of pages that neeed to be
allocated) and then repeately run the global reclaim until that is
unsuccessful or we have reached the limit. I hope we will have zone based
slab reclaim at some point which will make that easier.
The default for the min_slab_ratio is 5%
Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.
[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
Documentation/sysctl/vm.txt | 27 +++++++++++++++-----
include/linux/mmzone.h | 3 ++
include/linux/swap.h | 1
include/linux/sysctl.h | 1
kernel/sysctl.c | 11 ++++++++
mm/page_alloc.c | 17 ++++++++++++
mm/vmscan.c | 58 ++++++++++++++++++++++++++++----------------
7 files changed, 90 insertions(+), 28 deletions(-)
--- linux-2.6.18.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.18/Documentation/sysctl/vm.txt
@@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/
- drop-caches
- zone_reclaim_mode
- min_unmapped_ratio
+- min_slab_ratio
- panic_on_oom
==============================================================
@@ -138,7 +139,6 @@ This is value ORed together of
1 = Zone reclaim on
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages
-8 = Also do a global slab reclaim pass
zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
@@ -162,18 +162,13 @@ Allowing regular swap effectively restri
node unless explicitly overridden by memory policies or cpuset
configurations.
-It may be advisable to allow slab reclaim if the system makes heavy
-use of files and builds up large slab caches. However, the slab
-shrink operation is global, may take a long time and free slabs
-in all nodes of the system.
-
=============================================================
min_unmapped_ratio:
This is available only on NUMA kernels.
-A percentage of the file backed pages in each zone. Zone reclaim will only
+A percentage of the total pages in each zone. Zone reclaim will only
occur if more than this percentage of pages are file backed and unmapped.
This is to insure that a minimal amount of local pages is still available for
file I/O even if the node is overallocated.
@@ -182,6 +177,24 @@ The default is 1 percent.
=============================================================
+min_slab_ratio:
+
+This is available only on NUMA kernels.
+
+A percentage of the total pages in each zone. On Zone reclaim
+(fallback from the local zone occurs) slabs will be reclaimed if more
+than this percentage of pages in a zone are reclaimable slab pages.
+This insures that the slab growth stays under control even in NUMA
+systems that rarely perform global reclaim.
+
+The default is 5 percent.
+
+Note that slab reclaim is triggered in a per zone / node fashion.
+The process of reclaiming slab memory is currently not node specific
+and may not be fast.
+
+=============================================================
+
panic_on_oom
This enables or disables panic on out-of-memory feature. If this is set to 1,
--- linux-2.6.18.orig/include/linux/mmzone.h
+++ linux-2.6.18/include/linux/mmzone.h
@@ -155,6 +155,7 @@ struct zone {
* zone reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_ratio;
+ unsigned long min_slab_pages;
struct per_cpu_pageset *pageset[NR_CPUS];
#else
struct per_cpu_pageset pageset[NR_CPUS];
@@ -421,6 +422,8 @@ int percpu_pagelist_fraction_sysctl_hand
void __user *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
+ struct file *, void __user *, size_t *, loff_t *);
#include <linux/topology.h>
/* Returns the number of the current Node. */
--- linux-2.6.18.orig/include/linux/swap.h
+++ linux-2.6.18/include/linux/swap.h
@@ -190,6 +190,7 @@ extern long vm_total_pages;
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
+extern int sysctl_min_slab_ratio;
extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
#else
#define zone_reclaim_mode 0
--- linux-2.6.18.orig/include/linux/sysctl.h
+++ linux-2.6.18/include/linux/sysctl.h
@@ -191,6 +191,7 @@ enum
VM_MIN_UNMAPPED=32, /* Set min percent of unmapped pages */
VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
+ VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
};
--- linux-2.6.18.orig/kernel/sysctl.c
+++ linux-2.6.18/kernel/sysctl.c
@@ -943,6 +943,17 @@ static ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one_hundred,
},
+ {
+ .ctl_name = VM_MIN_SLAB,
+ .procname = "min_slab_ratio",
+ .data = &sysctl_min_slab_ratio,
+ .maxlen = sizeof(sysctl_min_slab_ratio),
+ .mode = 0644,
+ .proc_handler = &sysctl_min_slab_ratio_sysctl_handler,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
#endif
#ifdef CONFIG_X86_32
{
--- linux-2.6.18.orig/mm/page_alloc.c
+++ linux-2.6.18/mm/page_alloc.c
@@ -2008,6 +2008,7 @@ static void __meminit free_area_init_cor
#ifdef CONFIG_NUMA
zone->min_unmapped_ratio = (realsize*sysctl_min_unmapped_ratio)
/ 100;
+ zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
#endif
zone->name = zone_names[j];
spin_lock_init(&zone->lock);
@@ -2318,6 +2319,22 @@ int sysctl_min_unmapped_ratio_sysctl_han
sysctl_min_unmapped_ratio) / 100;
return 0;
}
+
+int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ struct zone *zone;
+ int rc;
+
+ rc = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+ if (rc)
+ return rc;
+
+ for_each_zone(zone)
+ zone->min_slab_pages = (zone->present_pages *
+ sysctl_min_slab_ratio) / 100;
+ return 0;
+}
#endif
/*
--- linux-2.6.18.orig/mm/vmscan.c
+++ linux-2.6.18/mm/vmscan.c
@@ -1510,7 +1510,6 @@ int zone_reclaim_mode __read_mostly;
#define RECLAIM_ZONE (1<<0) /* Run shrink_cache on the zone */
#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
#define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */
-#define RECLAIM_SLAB (1<<3) /* Do a global slab shrink if the zone is out of memory */
/*
* Priority for ZONE_RECLAIM. This determines the fraction of pages
@@ -1526,6 +1525,12 @@ int zone_reclaim_mode __read_mostly;
int sysctl_min_unmapped_ratio = 1;
/*
+ * If the number of slab pages in a zone grows beyond this percentage then
+ * slab reclaim needs to occur.
+ */
+int sysctl_min_slab_ratio = 5;
+
+/*
* Try to free up some pages from this zone through reclaim.
*/
static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
@@ -1556,29 +1561,37 @@ static int __zone_reclaim(struct zone *z
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
- /*
- * Free memory by calling shrink zone with increasing priorities
- * until we have enough memory freed.
- */
- priority = ZONE_RECLAIM_PRIORITY;
- do {
- nr_reclaimed += shrink_zone(priority, zone, &sc);
- priority--;
- } while (priority >= 0 && nr_reclaimed < nr_pages);
+ if (zone_page_state(zone, NR_FILE_PAGES) -
+ zone_page_state(zone, NR_FILE_MAPPED) >
+ zone->min_unmapped_ratio) {
+ /*
+ * Free memory by calling shrink zone with increasing
+ * priorities until we have enough memory freed.
+ */
+ priority = ZONE_RECLAIM_PRIORITY;
+ do {
+ nr_reclaimed += shrink_zone(priority, zone, &sc);
+ priority--;
+ } while (priority >= 0 && nr_reclaimed < nr_pages);
+ }
- if (nr_reclaimed < nr_pages && (zone_reclaim_mode & RECLAIM_SLAB)) {
+ if (zone_page_state(zone, NR_SLAB) > zone->min_slab_pages) {
/*
* shrink_slab() does not currently allow us to determine how
- * many pages were freed in this zone. So we just shake the slab
- * a bit and then go off node for this particular allocation
- * despite possibly having freed enough memory to allocate in
- * this zone. If we freed local memory then the next
- * allocations will be local again.
+ * many pages were freed in this zone. So we take the current
+ * number of slab pages and shake the slab until it is reduced
+ * by the same nr_pages that we used for reclaiming unmapped
+ * pages.
*
- * shrink_slab will free memory on all zones and may take
- * a long time.
+ * Note that shrink_slab will free memory on all zones and may
+ * take a long time.
*/
- shrink_slab(sc.nr_scanned, gfp_mask, order);
+ unsigned long limit = zone_page_state(zone,
+ NR_SLAB) - nr_pages;
+
+ while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
+ zone_page_state(zone, NR_SLAB) > limit)
+ ;
}
p->reclaim_state = NULL;
@@ -1592,7 +1605,8 @@ int zone_reclaim(struct zone *zone, gfp_
int node_id;
/*
- * Zone reclaim reclaims unmapped file backed pages.
+ * Zone reclaim reclaims unmapped file backed pages and
+ * slab pages if we are over the defined limits.
*
* A small portion of unmapped file backed pages is needed for
* file I/O otherwise pages read by file I/O will be immediately
@@ -1601,7 +1615,9 @@ int zone_reclaim(struct zone *zone, gfp_
* unmapped file backed pages.
*/
if (zone_page_state(zone, NR_FILE_PAGES) -
- zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_ratio)
+ zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_ratio
+ && zone_page_state(zone, NR_SLAB)
+ <= zone->min_slab_pages)
return 0;
/*
--
next prev parent reply other threads:[~2006-10-11 21:33 UTC|newest]
Thread overview: 90+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20061011204756.642936754@quad.kroah.org>
2006-10-11 21:03 ` [patch 00/67] 2.6.18-stable review Greg KH
2006-10-11 21:03 ` [patch 01/67] NET_SCHED: Fix fallout from dev->qdisc RCU change Greg KH
2006-10-11 21:03 ` [patch 02/67] uml: allow using again x86/x86_64 crypto code Greg KH
2006-10-11 21:03 ` [patch 03/67] uml: use DEFCONFIG_LIST to avoid reading hosts config Greg KH
2006-10-11 21:03 ` [patch 04/67] UML: Fix UML build failure Greg KH
2006-10-11 21:03 ` [patch 05/67] Video: Fix msp343xG handling regression Greg KH
2006-10-11 21:03 ` [patch 06/67] Video: cx24123: fix PLL divisor setup Greg KH
2006-10-11 21:15 ` Michael Krufky
2006-10-11 21:29 ` Greg KH
2006-10-11 21:36 ` Michael Krufky
2006-10-11 23:01 ` [stable] " Greg KH
2006-10-11 23:58 ` Michael Krufky
2006-10-13 18:48 ` Greg KH
2006-10-11 21:03 ` [patch 07/67] Video: pvrusb2: Solve mutex deadlock Greg KH
2006-10-11 21:04 ` [patch 09/67] Video: pvrusb2: Suppress compiler warning Greg KH
2006-10-11 21:04 ` [patch 10/67] Video: pvrusb2: Limit hor res for 24xxx devices Greg KH
2006-10-11 21:04 ` [patch 11/67] zd1211rw: ZD1211B ASIC/FWT, not jointly decoder Greg KH
2006-10-12 13:41 ` John W. Linville
2006-10-11 21:04 ` [patch 12/67] S390: user readable uninitialised kernel memory (CVE-2006-5174) Greg KH
2006-10-11 21:04 ` [patch 13/67] IB/mthca: Fix lid used for sending traps Greg KH
2006-10-11 21:04 ` [patch 14/67] USB: Allow compile in g_ether, fix typo Greg KH
2006-10-11 21:04 ` [patch 15/67] ALSA: Fix initiailization of user-space controls Greg KH
2006-10-11 21:04 ` [patch 16/67] jbd: fix commit of ordered data buffers Greg KH
2006-10-12 11:55 ` Jan Kara
2006-10-12 17:16 ` Greg KH
2006-10-11 21:04 ` [patch 17/67] Fix longstanding load balancing bug in the scheduler Greg KH
2006-10-12 7:30 ` Arjan van de Ven
2006-10-11 21:04 ` Greg KH [this message]
2006-10-12 7:31 ` [patch 18/67] zone_reclaim: dynamic slab reclaim Arjan van de Ven
2006-10-12 10:04 ` Christoph Lameter
2006-10-11 21:04 ` [patch 19/67] mv643xx_eth: fix obvious typo, which caused build breakage Greg KH
2006-10-11 21:05 ` [patch 20/67] netdrvr: lp486e: fix typo Greg KH
2006-10-11 21:05 ` [patch 21/67] sky2: tx pause bug fix Greg KH
2006-10-11 21:05 ` [patch 22/67] sky2 network driver device ids Greg KH
2006-10-11 21:05 ` [patch 23/67] One line per header in Kbuild files to reduce conflicts Greg KH
2006-10-11 21:05 ` [patch 24/67] Fix ARM make headers_check Greg KH
2006-10-11 21:05 ` [patch 25/67] Fix make headers_check on sh Greg KH
2006-10-11 21:05 ` [patch 26/67] Fix make headers_check on sh64 Greg KH
2006-10-11 21:05 ` [patch 27/67] Fix make headers_check on m32r Greg KH
2006-10-11 21:05 ` [patch 28/67] Fix exported headers for SPARC, SPARC64 Greg KH
2006-10-11 21:05 ` [patch 29/67] Fix m68knommu exported headers Greg KH
2006-10-11 21:05 ` [patch 30/67] Fix H8300 " Greg KH
2006-10-11 21:06 ` [patch 31/67] Remove ARM26 header export Greg KH
2006-10-11 21:06 ` [patch 32/67] Remove UML " Greg KH
2006-10-11 21:06 ` [patch 33/67] Dont advertise (or allow) headers_{install,check} where inappropriate Greg KH
2006-10-11 21:06 ` [patch 34/67] Fix v850 exported headers Greg KH
2006-10-11 21:06 ` [patch 35/67] Clean up exported headers on CRIS Greg KH
2006-10-11 21:06 ` [patch 36/67] Remove offsetof() from user-visible <linux/stddef.h> Greg KH
2006-10-11 21:06 ` [patch 37/67] powerpc: fix building gdb against asm/ptrace.h Greg KH
2006-10-11 21:06 ` [patch 38/67] sysfs: remove duplicated dput in sysfs_update_file Greg KH
2006-10-11 21:06 ` [patch 39/67] powerpc: Fix ohare IDE irq workaround on old powermacs Greg KH
2006-10-11 21:07 ` [patch 40/67] i386 bootioremap / kexec fix Greg KH
2006-10-11 21:07 ` [patch 41/67] rtc: lockdep fix/workaround Greg KH
2006-10-11 21:07 ` [patch 42/67] do not free non slab allocated per_cpu_pageset Greg KH
2006-10-11 21:07 ` [patch 43/67] backlight: fix oops in __mutex_lock_slowpath during head /sys/class/graphics/fb0/bits_per_pixel /sys/class/graphics/fb0/blank /sys/class/graphics/fb0/console /sys/class/graphics/fb0/cursor /sys/class/graphics/fb0/dev /sys/class/graphics/fb0/device /sys/class/graphics/fb0/mode /sys/class/graphics/fb0/modes /sys/class/graphics/fb0/name /sys/class/graphics/fb0/pan /sys/class/graphics/fb0/rotate /sys/class/graphics/fb0/state /sys/class/graphics/fb0/stride /sys/class/graphics/fb0/subsystem /sys/class/graphics/fb0/uevent /sys/class/graphics/fb0/virtual_size Greg KH
2006-10-11 21:07 ` [patch 44/67] cpu to node relationship fixup: acpi_map_cpu2node Greg KH
2006-10-11 21:07 ` [patch 45/67] cpu to node relationship fixup: map cpu to node Greg KH
2006-10-11 21:07 ` [patch 46/67] i386: fix flat mode numa on a real numa system Greg KH
2006-10-11 21:07 ` [patch 47/67] load_module: no BUG if module_subsys uninitialized Greg KH
2006-10-11 21:07 ` [patch 48/67] Fix VIDIOC_ENUMSTD bug Greg KH
2006-10-11 21:46 ` Jonathan Corbet
2006-10-11 21:49 ` Michael Krufky
2006-10-11 22:10 ` Mauro Carvalho Chehab
2006-10-11 23:04 ` [stable] " Greg KH
2006-10-11 21:07 ` [patch 49/67] SPARC64: Fix serious bug in sched_clock() on sparc64 Greg KH
2006-10-11 21:07 ` [patch 50/67] CPUFREQ: Fix some more CPU hotplug locking Greg KH
2006-10-11 21:08 ` [patch 51/67] IPV6: bh_lock_sock_nested on tcp_v6_rcv Greg KH
2006-10-11 21:08 ` [patch 52/67] SPARC64: Fix sparc64 ramdisk handling Greg KH
2006-10-11 21:08 ` [patch 53/67] sata_mv: fix oops Greg KH
2006-10-11 21:08 ` [patch 54/67] PKT_SCHED: cls_basic: Use unsigned int when generating handle Greg KH
2006-10-11 21:08 ` [patch 55/67] IPV6: Disable SG for GSO unless we have checksum Greg KH
2006-10-11 21:08 ` [patch 56/67] MD: Fix problem where hot-added drives are not resynced Greg KH
2006-10-11 21:08 ` [patch 57/67] TCP: Fix and simplify microsecond rtt sampling Greg KH
2006-10-11 21:08 ` [patch 58/67] mm: bug in set_page_dirty_buffers Greg KH
2006-10-11 21:08 ` [patch 59/67] fbdev: correct buffer size limit in fbmem_read_proc() Greg KH
2006-10-11 21:08 ` [patch 60/67] rtc driver rtc-pcf8563 century bit inversed Greg KH
2006-10-11 21:08 ` [patch 61/67] invalidate_inode_pages2(): ignore page refcounts Greg KH
2006-10-11 21:09 ` [patch 62/67] scx200_hrt: fix precedence bug manifesting as 27x clock in 1 MHz mode Greg KH
2006-10-11 21:09 ` [patch 63/67] ide-generic: jmicron fix Greg KH
2006-10-11 21:09 ` [patch 64/67] x86-64: Calgary IOMMU: Fix off by one when calculating register space location Greg KH
2006-10-11 21:09 ` [patch 66/67] NETFILTER: NAT: fix NOTRACK checksum handling Greg KH
2006-10-11 21:09 ` [patch 67/67] block layer: elv_iosched_show should get elv_list_lock Greg KH
2006-10-11 21:36 ` [patch 00/67] 2.6.18-stable review Dave Jones
2006-10-11 21:59 ` Greg KH
2006-10-11 22:17 ` Dave Jones
2006-10-11 22:19 ` Dave Jones
2006-10-11 22:59 ` [stable] " Greg KH
2006-10-12 0:42 ` Theodore Tso
2006-10-12 16:35 ` [stable] " Greg KH
2006-10-12 16:51 ` Dave Jones
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061011210453.GS16627@kroah.com \
--to=gregkh@suse.de \
--cc=akpm@osdl.org \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=chuckw@quantumlinux.com \
--cc=clameter@sgi.com \
--cc=davej@redhat.com \
--cc=jmforbes@linuxtx.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mkrufky@linuxtv.org \
--cc=rdunlap@xenotime.net \
--cc=reviews@ml.cw.f00f.org \
--cc=stable@kernel.org \
--cc=torvalds@osdl.org \
--cc=tytso@mit.edu \
--cc=zwane@arm.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox