From: Chris Wright <chrisw@sous-sol.org>
To: linux-kernel@vger.kernel.org, stable@kernel.org
Cc: Justin Forbes <jmforbes@linuxtx.org>,
Zwane Mwaikambo <zwane@arm.linux.org.uk>,
"Theodore Ts'o" <tytso@mit.edu>,
Randy Dunlap <rdunlap@xenotime.net>,
Dave Jones <davej@redhat.com>,
Chuck Wolber <chuckw@quantumlinux.com>,
Chris Wedgwood <reviews@ml.cw.f00f.org>,
Michael Krufky <mkrufky@linuxtv.org>,
torvalds@osdl.org, akpm@osdl.org, alan@lxorguk.ukuu.org.uk,
Martin Bligh <mbligh@mbligh.org>,
Nick Piggin <nickpiggin@yahoo.com.au>,
Christoph Lameter <clameter@engr.sgi.com>
Subject: [PATCH 54/61] vmscan: Fix temp_priority race
Date: Tue, 31 Oct 2006 21:34:34 -0800 [thread overview]
Message-ID: <20061101054507.162773000@sous-sol.org> (raw)
In-Reply-To: 20061101053340.305569000@sous-sol.org
[-- Attachment #1: vmscan-fix-temp_priority-race.patch --]
[-- Type: text/plain, Size: 7852 bytes --]
-stable review patch. If anyone has any objections, please let us know.
------------------
From: Andrew Morton <akpm@osdl.org>
The temp_priority field in zone is racy, as we can walk through a reclaim
path, and just before we copy it into prev_priority, it can be overwritten
(say with DEF_PRIORITY) by another reclaimer.
The same bug is contained in both try_to_free_pages and balance_pgdat, but
it is fixed slightly differently. In balance_pgdat, we keep a separate
priority record per zone in a local array. In try_to_free_pages there is
no need to do this, as the priority level is the same for all zones that we
reclaim from.
Impact of this bug is that temp_priority is copied into prev_priority, and
setting this artificially high causes reclaimers to set distress
artificially low. They then fail to reclaim mapped pages, when they are,
in fact, under severe memory pressure (their priority may be as low as 0).
This causes the OOM killer to fire incorrectly.
From: Andrew Morton <akpm@osdl.org>
__zone_reclaim() isn't modifying zone->prev_priority. But zone->prev_priority
is used in the decision whether or not to bring mapped pages onto the inactive
list. Hence there's a risk here that __zone_reclaim() will fail because
zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
stuck on the active list.
Fix that up by decreasing (ie making more urgent) zone->prev_priority as
__zone_reclaim() scans the zone's pages.
This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created. It should be
possible to remove that now, and to just start out at DEF_PRIORITY?
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
[chrisw: minor wiggle to fit -stable]
---
include/linux/mmzone.h | 6 -----
mm/page_alloc.c | 2 -
mm/vmscan.c | 55 ++++++++++++++++++++++++++++++++++++-------------
mm/vmstat.c | 2 -
4 files changed, 43 insertions(+), 22 deletions(-)
--- linux-2.6.18.1.orig/include/linux/mmzone.h
+++ linux-2.6.18.1/include/linux/mmzone.h
@@ -200,13 +200,9 @@ struct zone {
* under - it drives the swappiness decision: whether to unmap mapped
* pages.
*
- * temp_priority is used to remember the scanning priority at which
- * this zone was successfully refilled to free_pages == pages_high.
- *
- * Access to both these fields is quite racy even on uniprocessor. But
+ * Access to both this field is quite racy even on uniprocessor. But
* it is expected to average out OK.
*/
- int temp_priority;
int prev_priority;
--- linux-2.6.18.1.orig/mm/page_alloc.c
+++ linux-2.6.18.1/mm/page_alloc.c
@@ -2021,7 +2021,7 @@ static void __meminit free_area_init_cor
zone->zone_pgdat = pgdat;
zone->free_pages = 0;
- zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
+ zone->prev_priority = DEF_PRIORITY;
zone_pcp_init(zone);
INIT_LIST_HEAD(&zone->active_list);
--- linux-2.6.18.1.orig/mm/vmscan.c
+++ linux-2.6.18.1/mm/vmscan.c
@@ -696,6 +696,20 @@ done:
}
/*
+ * We are about to scan this zone at a certain priority level. If that priority
+ * level is smaller (ie: more urgent) than the previous priority, then note
+ * that priority level within the zone. This is done so that when the next
+ * process comes in to scan this zone, it will immediately start out at this
+ * priority level rather than having to build up its own scanning priority.
+ * Here, this priority affects only the reclaim-mapped threshold.
+ */
+static inline void note_zone_scanning_priority(struct zone *zone, int priority)
+{
+ if (priority < zone->prev_priority)
+ zone->prev_priority = priority;
+}
+
+/*
* This moves pages from the active list to the inactive list.
*
* We move them the other way if the page is referenced by one or more
@@ -934,9 +948,7 @@ static unsigned long shrink_zones(int pr
if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
continue;
- zone->temp_priority = priority;
- if (zone->prev_priority > priority)
- zone->prev_priority = priority;
+ note_zone_scanning_priority(zone, priority);
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
@@ -984,7 +996,6 @@ unsigned long try_to_free_pages(struct z
if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
continue;
- zone->temp_priority = DEF_PRIORITY;
lru_pages += zone->nr_active + zone->nr_inactive;
}
@@ -1022,13 +1033,22 @@ unsigned long try_to_free_pages(struct z
blk_congestion_wait(WRITE, HZ/10);
}
out:
+ /*
+ * Now that we've scanned all the zones at this priority level, note
+ * that level within the zone so that the next thread which performs
+ * scanning of this zone will immediately start out at this priority
+ * level. This affects only the decision whether or not to bring
+ * mapped pages onto the inactive list.
+ */
+ if (priority < 0)
+ priority = 0;
for (i = 0; zones[i] != 0; i++) {
struct zone *zone = zones[i];
if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
continue;
- zone->prev_priority = zone->temp_priority;
+ zone->prev_priority = priority;
}
return ret;
}
@@ -1068,6 +1088,11 @@ static unsigned long balance_pgdat(pg_da
.swap_cluster_max = SWAP_CLUSTER_MAX,
.swappiness = vm_swappiness,
};
+ /*
+ * temp_priority is used to remember the scanning priority at which
+ * this zone was successfully refilled to free_pages == pages_high.
+ */
+ int temp_priority[MAX_NR_ZONES];
loop_again:
total_scanned = 0;
@@ -1075,11 +1100,8 @@ loop_again:
sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);
- for (i = 0; i < pgdat->nr_zones; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- zone->temp_priority = DEF_PRIORITY;
- }
+ for (i = 0; i < pgdat->nr_zones; i++)
+ temp_priority[i] = DEF_PRIORITY;
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
@@ -1140,10 +1162,9 @@ scan:
if (!zone_watermark_ok(zone, order, zone->pages_high,
end_zone, 0))
all_zones_ok = 0;
- zone->temp_priority = priority;
- if (zone->prev_priority > priority)
- zone->prev_priority = priority;
+ temp_priority[i] = priority;
sc.nr_scanned = 0;
+ note_zone_scanning_priority(zone, priority);
nr_reclaimed += shrink_zone(priority, zone, &sc);
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
@@ -1183,10 +1204,15 @@ scan:
break;
}
out:
+ /*
+ * Note within each zone the priority level at which this zone was
+ * brought into a happy state. So that the next thread which scans this
+ * zone will start out at that priority level.
+ */
for (i = 0; i < pgdat->nr_zones; i++) {
struct zone *zone = pgdat->node_zones + i;
- zone->prev_priority = zone->temp_priority;
+ zone->prev_priority = temp_priority[i];
}
if (!all_zones_ok) {
cond_resched();
@@ -1570,6 +1596,7 @@ static int __zone_reclaim(struct zone *z
*/
priority = ZONE_RECLAIM_PRIORITY;
do {
+ note_zone_scanning_priority(zone, priority);
nr_reclaimed += shrink_zone(priority, zone, &sc);
priority--;
} while (priority >= 0 && nr_reclaimed < nr_pages);
--- linux-2.6.18.1.orig/mm/vmstat.c
+++ linux-2.6.18.1/mm/vmstat.c
@@ -586,11 +586,9 @@ static int zoneinfo_show(struct seq_file
seq_printf(m,
"\n all_unreclaimable: %u"
"\n prev_priority: %i"
- "\n temp_priority: %i"
"\n start_pfn: %lu",
zone->all_unreclaimable,
zone->prev_priority,
- zone->temp_priority,
zone->zone_start_pfn);
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
--
next prev parent reply other threads:[~2006-11-01 5:47 UTC|newest]
Thread overview: 91+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-11-01 5:33 [PATCH 00/61] 2.6.18-stable review Chris Wright
2006-11-01 5:33 ` [PATCH 01/61] [DECNET]: Fix sfuzz hanging on 2.6.18 Chris Wright
2006-11-01 5:33 ` [PATCH 02/61] splice: fix pipe_to_file() ->prepare_write() error path Chris Wright
2006-11-01 5:33 ` [PATCH 03/61] [S390] __div64_32 for 31 bit Chris Wright
2006-11-01 5:33 ` [PATCH 04/61] mm: fix a race condition under SMC + COW Chris Wright
2006-11-01 5:33 ` [PATCH 05/61] sky2: MSI test race and message Chris Wright
2006-11-01 5:33 ` [PATCH 06/61] sky2: pause parameter adjustment Chris Wright
2006-11-01 5:33 ` [PATCH 07/61] sky2: turn off PHY IRQ on shutdown Chris Wright
2006-11-01 5:33 ` [PATCH 08/61] ALSA: emu10k1: Fix outl() in snd_emu10k1_resume_regs() Chris Wright
2006-11-01 5:33 ` [PATCH 09/61] ALSA: powermac - Fix Oops when conflicting with aoa driver Chris Wright
2006-11-01 5:33 ` [PATCH 10/61] sound/pci/au88x0/au88x0.c: ioremap balanced with iounmap Chris Wright
2006-11-01 5:33 ` [PATCH 11/61] ALSA: Dereference after free in snd_hwdep_release() Chris Wright
2006-11-01 5:33 ` [PATCH 12/61] ALSA: Fix bug in snd-usb-usx2ys usX2Y_pcms_lock_check() Chris Wright
2006-11-01 5:33 ` [PATCH 13/61] ALSA: Repair snd-usb-usx2y for usb 2.6.18 Chris Wright
2006-11-01 5:33 ` [PATCH 14/61] sky2: accept multicast pause frames Chris Wright
2006-11-01 5:33 ` [PATCH 15/61] sky2: GMAC pause frame Chris Wright
2006-11-01 5:33 ` [PATCH 16/61] uml: fix processor selection to exclude unsupported processors and features Chris Wright
2006-11-01 5:33 ` [PATCH 17/61] SCSI: DAC960: PCI id table fixup Chris Wright
2006-11-01 5:33 ` [PATCH 18/61] Fix uninitialised spinlock in via-pmu-backlight code Chris Wright
2006-11-01 5:33 ` [PATCH 19/61] SERIAL: Fix resume handling bug Chris Wright
2006-11-01 5:34 ` [PATCH 20/61] SERIAL: Fix oops when removing suspended serial port Chris Wright
2006-11-01 5:34 ` [PATCH 21/61] Bluetooth: Check if DLC is still attached to the TTY Chris Wright
2006-11-01 5:34 ` [PATCH 22/61] JFS: pageno needs to be long Chris Wright
2006-11-01 5:34 ` [PATCH 23/61] SPARC64: Fix central/FHC bus handling on Ex000 systems Chris Wright
2006-11-01 5:34 ` [PATCH 24/61] SPARC64: Fix memory corruption in pci_4u_free_consistent() Chris Wright
2006-11-01 5:34 ` [PATCH 25/61] bcm43xx: fix watchdog timeouts Chris Wright
2006-11-01 15:13 ` John W. Linville
2006-11-01 5:34 ` [PATCH 26/61] DVB: fix dvb_pll_attach for mt352/zl10353 in cx88-dvb, and nxt200x Chris Wright
2006-11-01 5:34 ` [PATCH 27/61] ALSA: Fix re-use of va_list Chris Wright
2006-11-01 5:34 ` [PATCH 28/61] md: Fix bug where spares dont always get rebuilt properly when they become live Chris Wright
2006-11-01 5:34 ` [PATCH 29/61] md: Fix calculation of ->degraded for multipath and raid10 Chris Wright
2006-11-01 5:34 ` [PATCH 30/61] knfsd: Fix race that can disable NFS server Chris Wright
2006-11-01 7:11 ` Willy Tarreau
2006-11-04 21:06 ` Willy Tarreau
2006-11-05 23:55 ` Neil Brown
2006-11-06 4:11 ` Willy Tarreau
2006-11-01 5:34 ` [PATCH 31/61] SCSI: aic7xxx: avoid checking SBLKCTL register for certain cards Chris Wright
2006-11-01 5:34 ` [PATCH 32/61] IPoIB: Rejoin all multicast groups after a port event Chris Wright
2006-11-01 5:34 ` [PATCH 33/61] IB/mthca: Use mmiowb after doorbell ring Chris Wright
2006-11-01 5:34 ` [PATCH 34/61] fuse: fix hang on SMP Chris Wright
2006-11-01 5:34 ` [PATCH 35/61] Fix potential interrupts during alternative patching Chris Wright
2006-11-01 5:34 ` [PATCH 36/61] sky2: 88E803X transmit lockup (2.6.18) Chris Wright
2006-11-01 5:34 ` [PATCH 37/61] SCSI: aic7xxx: pause sequencer before touching SBLKCTL Chris Wright
2006-11-01 5:34 ` [PATCH 38/61] Audit: fix missing ifdefs in syscall classes hookup for generic targets Chris Wright
2006-11-01 5:34 ` [PATCH 39/61] NET: Fix skb_segment() handling of fully linear SKBs Chris Wright
2006-11-01 5:34 ` [PATCH 40/61] SCTP: Always linearise packet on input Chris Wright
2006-11-01 7:17 ` Willy Tarreau
2006-11-01 6:23 ` David Miller
2006-11-01 5:34 ` [PATCH 41/61] x86-64: Fix C3 timer test Chris Wright
2006-11-01 6:19 ` Len Brown
2006-11-01 5:34 ` [PATCH 42/61] uml: make Uml compile on FC6 kernel headers Chris Wright
2006-11-01 5:34 ` [PATCH 43/61] uml: remove warnings added by previous -stable patch Chris Wright
2006-11-01 5:34 ` [PATCH 44/61] ALSA: snd_rtctimer: handle RTC interrupts with a tasklet Chris Wright
2006-11-01 5:34 ` [PATCH 45/61] Watchdog: sc1200wdt - fix missing pnp_unregister_driver() Chris Wright
2006-11-01 7:45 ` Willy Tarreau
2006-11-01 13:14 ` Wim Van Sebroeck
2006-11-01 5:34 ` [PATCH 46/61] fix Intel RNG detection Chris Wright
2006-11-20 23:45 ` Dave Jones
2006-11-21 2:21 ` [stable] " Chris Wright
2006-11-21 9:32 ` Jan Beulich
2006-11-22 1:50 ` Chris Wright
2006-11-22 7:53 ` Jan Beulich
2006-11-24 20:27 ` Dave Jones
2006-11-27 8:30 ` Jan Beulich
2006-11-29 8:46 ` Jan Beulich
2006-12-13 19:50 ` dean gaudet
2006-12-13 20:33 ` Chris Wright
2006-12-13 23:00 ` dean gaudet
2006-12-14 7:54 ` Jan Beulich
2006-12-14 8:40 ` dean gaudet
2006-12-14 10:12 ` Jan Beulich
2006-11-21 9:05 ` Michael Buesch
2006-11-01 5:34 ` [PATCH 47/61] posix-cpu-timers: prevent signal delivery starvation Chris Wright
2006-11-01 5:34 ` [PATCH 48/61] rtc-max6902: month conversion fix Chris Wright
2006-11-01 5:34 ` [PATCH 49/61] ISDN: fix drivers, by handling errors thrown by ->readstat() Chris Wright
2006-11-01 6:02 ` Jeff Garzik
2006-11-01 7:49 ` Willy Tarreau
2006-11-01 9:18 ` Karsten Keil
2006-11-01 5:34 ` [PATCH 50/61] SPARC64: Fix PCI memory space root resource on Hummingbird Chris Wright
2006-11-01 5:34 ` [PATCH 51/61] PCI: Remove quirk_via_abnormal_poweroff Chris Wright
2006-11-01 6:20 ` Len Brown
2006-11-01 5:34 ` [PATCH 52/61] Reintroduce NODES_SPAN_OTHER_NODES for powerpc Chris Wright
2006-11-01 5:34 ` [PATCH 53/61] NFS: nfs_lookup - dont hash dentry when optimising away the lookup Chris Wright
2006-11-01 5:34 ` Chris Wright [this message]
2006-11-01 5:34 ` [PATCH 55/61] Use min of two prio settings in calculating distress for reclaim Chris Wright
2006-11-01 5:34 ` [PATCH 56/61] fill_tgid: fix task_struct leak and possible oops Chris Wright
2006-11-01 5:34 ` [PATCH 57/61] JMB 368 PATA detection Chris Wright
2006-11-01 5:34 ` [PATCH 58/61] tcp: cubic scaling error Chris Wright
2006-11-01 5:34 ` [PATCH 59/61] IPV6: fix lockup via /proc/net/ip6_flowlabel [CVE-2006-5619] Chris Wright
2006-11-01 5:34 ` [PATCH 60/61] md: check bio address after mapping through partitions Chris Wright
2006-11-01 5:34 ` [PATCH 61/61] usbfs: private mutex for open, release, and remove Chris Wright
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061101054507.162773000@sous-sol.org \
--to=chrisw@sous-sol.org \
--cc=akpm@osdl.org \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=chuckw@quantumlinux.com \
--cc=clameter@engr.sgi.com \
--cc=davej@redhat.com \
--cc=jmforbes@linuxtx.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mbligh@mbligh.org \
--cc=mkrufky@linuxtv.org \
--cc=nickpiggin@yahoo.com.au \
--cc=rdunlap@xenotime.net \
--cc=reviews@ml.cw.f00f.org \
--cc=stable@kernel.org \
--cc=torvalds@osdl.org \
--cc=tytso@mit.edu \
--cc=zwane@arm.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox