From: Greg KH <gregkh@suse.de>
To: linux-kernel@vger.kernel.org, stable@kernel.org
Cc: stable-review@kernel.org, torvalds@linux-foundation.org,
akpm@linux-foundation.org, alan@lxorguk.ukuu.org.uk,
Christoph Lameter <cl@linux.com>, Mel Gorman <mel@csn.ul.ie>
Subject: [47/68] mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
Date: Fri, 24 Sep 2010 09:32:11 -0700 [thread overview]
Message-ID: <20100924163348.197319640@clark.site> (raw)
In-Reply-To: <20100924163357.GA15741@kroah.com>
2.6.32-stable review patch. If anyone has any objections, please let us know.
------------------
From: Christoph Lameter <cl@linux.com>
commit aa45484031ddee09b06350ab8528bfe5b2c76d1c upstream.
Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold. On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero. Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.
This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken. The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
include/linux/mmzone.h | 13 +++++++++++++
include/linux/vmstat.h | 22 ++++++++++++++++++++++
mm/mmzone.c | 21 +++++++++++++++++++++
mm/page_alloc.c | 4 ++--
mm/vmstat.c | 15 ++++++++++++++-
5 files changed, 72 insertions(+), 3 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -290,6 +290,13 @@ struct zone {
unsigned long watermark[NR_WMARK];
/*
+ * When free pages are below this point, additional steps are taken
+ * when reading the number of free pages to avoid per-cpu counter
+ * drift allowing watermarks to be breached
+ */
+ unsigned long percpu_drift_mark;
+
+ /*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
* GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -460,6 +467,12 @@ static inline int zone_is_oom_locked(con
return test_bit(ZONE_OOM_LOCKED, &zone->flags);
}
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
/*
* The "priority" of VM scanning is how much of the queues we will scan in one
* go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -166,6 +166,28 @@ static inline unsigned long zone_page_st
return x;
}
+/*
+ * More accurate version that also considers the currently pending
+ * deltas. For that we need to loop over all cpus to find the current
+ * deltas. There is no synchronization so the result cannot be
+ * exactly accurate either.
+ */
+static inline unsigned long zone_page_state_snapshot(struct zone *zone,
+ enum zone_stat_item item)
+{
+ long x = atomic_long_read(&zone->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+ int cpu;
+ for_each_online_cpu(cpu)
+ x += zone_pcp(zone, cpu)->vm_stat_diff[item];
+
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
extern unsigned long global_reclaimable_pages(void);
extern unsigned long zone_reclaimable_pages(struct zone *zone);
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pf
return 1;
}
#endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+ unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+ /*
+ * While kswapd is awake, it is considered the zone is under some
+ * memory pressure. Under pressure, there is a risk that
+ * per-cpu-counter-drift will allow the min watermark to be breached
+ * potentially causing a live-lock. While kswapd is awake and
+ * free pages are low, get a better estimate for free pages
+ */
+ if (nr_free_pages < zone->percpu_drift_mark &&
+ !waitqueue_active(&zone->zone_pgdat->kswapd_wait))
+ return zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+ return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1365,7 +1365,7 @@ int zone_watermark_ok(struct zone *z, in
{
/* free_pages my go negative - that's OK */
long min = mark;
- long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+ long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
int o;
if (alloc_flags & ALLOC_HIGH)
@@ -2250,7 +2250,7 @@ void show_free_areas(void)
" all_unreclaimable? %s"
"\n",
zone->name,
- K(zone_page_state(zone, NR_FREE_PAGES)),
+ K(zone_nr_free_pages(zone)),
K(min_wmark_pages(zone)),
K(low_wmark_pages(zone)),
K(high_wmark_pages(zone)),
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -136,10 +136,23 @@ static void refresh_zone_stat_thresholds
int threshold;
for_each_populated_zone(zone) {
+ unsigned long max_drift, tolerate_drift;
+
threshold = calculate_threshold(zone);
for_each_online_cpu(cpu)
zone_pcp(zone, cpu)->stat_threshold = threshold;
+
+ /*
+ * Only set percpu_drift_mark if there is a danger that
+ * NR_FREE_PAGES reports the low watermark is ok when in fact
+ * the min watermark could be breached by an allocation
+ */
+ tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+ max_drift = num_online_cpus() * threshold;
+ if (max_drift > tolerate_drift)
+ zone->percpu_drift_mark = high_wmark_pages(zone) +
+ max_drift;
}
}
@@ -715,7 +728,7 @@ static void zoneinfo_show_print(struct s
"\n scanned %lu"
"\n spanned %lu"
"\n present %lu",
- zone_page_state(zone, NR_FREE_PAGES),
+ zone_nr_free_pages(zone),
min_wmark_pages(zone),
low_wmark_pages(zone),
high_wmark_pages(zone),
next prev parent reply other threads:[~2010-09-24 16:35 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-24 16:33 [00/68] 2.6.32.23 stable review Greg KH
2010-09-24 16:31 ` [01/68] USB: serial/mos*: prevent reading uninitialized stack memory Greg KH
2010-09-24 16:31 ` [02/68] sparc: Provide io{read,write}{16,32}be() Greg KH
2010-09-24 16:31 ` [03/68] gro: fix different skb headrooms Greg KH
2010-09-24 16:31 ` [04/68] gro: Re-fix " Greg KH
2010-09-24 16:31 ` [05/68] irda: Correctly clean up self->ias_obj on irda_bind() failure Greg KH
2010-09-24 16:31 ` [06/68] tcp: select(writefds) dont hang up when a peer close connection Greg KH
2010-09-24 16:31 ` [07/68] tcp: Combat per-cpu skew in orphan tests Greg KH
2010-09-24 16:31 ` [08/68] tcp: fix three tcp sysctls tuning Greg KH
2010-09-24 16:31 ` [09/68] bridge: Clear IPCB before possible entry into IP stack Greg KH
2010-09-24 16:31 ` [10/68] bridge: Clear INET control block of SKBs passed into ip_fragment() Greg KH
2010-09-24 16:31 ` [11/68] net: Fix oops from tcp_collapse() when using splice() Greg KH
2010-09-24 16:31 ` [12/68] rds: fix a leak of kernel memory Greg KH
2010-10-01 0:50 ` David Miller
2010-09-24 16:31 ` [13/68] tcp: Prevent overzealous packetization by SWS logic Greg KH
2010-09-24 16:31 ` [14/68] UNIX: Do not loop forever at unix_autobind() Greg KH
2010-09-24 16:31 ` [15/68] r8169: fix random mdio_write failures Greg KH
2010-09-24 16:31 ` [16/68] r8169: fix mdio_read and update mdio_write according to hw specs Greg KH
2010-09-24 16:31 ` [17/68] sparc64: Get rid of indirect p1275 PROM call buffer Greg KH
2010-09-24 16:31 ` [18/68] drivers/net/usb/hso.c: prevent reading uninitialized memory Greg KH
2010-09-24 16:31 ` [19/68] drivers/net/cxgb3/cxgb3_main.c: prevent reading uninitialized stack memory Greg KH
2010-09-24 16:31 ` [20/68] drivers/net/eql.c: " Greg KH
2010-09-24 16:31 ` [21/68] bonding: correctly process non-linear skbs Greg KH
2010-09-24 16:31 ` [22/68] Staging: vt6655: fix buffer overflow Greg KH
2010-09-24 16:31 ` [23/68] net/llc: make opt unsigned in llc_ui_setsockopt() Greg KH
2010-09-24 16:31 ` [24/68] pid: make setpgid() system call use RCU read-side critical section Greg KH
2010-09-24 16:31 ` [25/68] sched: Fix user time incorrectly accounted as system time on 32-bit Greg KH
2010-09-24 16:31 ` [26/68] oprofile: Add Support for Intel CPU Family 6 / Model 22 (Intel Celeron 540) Greg KH
2010-09-24 16:31 ` [27/68] char: Mark /dev/zero and /dev/kmem as not capable of writeback Greg KH
2010-09-24 16:31 ` [28/68] drivers/pci/intel-iommu.c: fix build with older gccs Greg KH
2010-09-24 16:31 ` [29/68] drivers/video/sis/sis_main.c: prevent reading uninitialized stack memory Greg KH
2010-09-24 16:31 ` [30/68] percpu: fix pcpu_last_unit_cpu Greg KH
2010-09-24 16:31 ` [31/68] aio: check for multiplication overflow in do_io_submit Greg KH
2010-09-24 16:31 ` [32/68] inotify: send IN_UNMOUNT events Greg KH
2010-09-24 16:31 ` [33/68] SCSI: mptsas: fix hangs caused by ATA pass-through Greg KH
2010-09-24 16:31 ` [34/68] ext4: Fix remaining racy updates of EXT4_I(inode)->i_flags Greg KH
2010-09-24 16:31 ` [35/68] IA64: fix siglock Greg KH
2010-09-24 16:32 ` [36/68] IA64: Optimize ticket spinlocks in fsys_rt_sigprocmask Greg KH
2010-09-24 16:32 ` [37/68] KEYS: Fix RCU no-lock warning in keyctl_session_to_parent() Greg KH
2010-09-24 16:32 ` [38/68] KEYS: Fix bug in keyctl_session_to_parent() if parent has no session keyring Greg KH
2010-09-24 16:32 ` [39/68] xfs: prevent reading uninitialized stack memory Greg KH
2010-09-24 16:32 ` [40/68] drivers/video/via/ioctl.c: " Greg KH
2010-09-24 16:32 ` [41/68] ACPI: disable _OSI(Windows 2009) on Asus K50IJ Greg KH
2010-09-24 16:32 ` [42/68] bnx2: Fix netpoll crash Greg KH
2010-09-24 16:32 ` [43/68] bnx2: Fix hang during rmmod bnx2 Greg KH
2010-09-24 16:32 ` [44/68] AT91: change dma resource index Greg KH
2010-09-24 16:32 ` [45/68] cxgb3: fix hot plug removal crash Greg KH
2010-09-24 16:32 ` [46/68] mm: page allocator: drain per-cpu lists after direct reclaim allocation fails Greg KH
2010-09-24 16:32 ` Greg KH [this message]
2010-09-24 16:32 ` [48/68] mm: page allocator: update free page counters after pages are placed on the free list Greg KH
2010-09-24 16:32 ` [49/68] guard page for stacks that grow upwards Greg KH
2010-09-24 16:32 ` [50/68] Fix unprotected access to task credentials in waitid() Greg KH
2010-09-24 16:32 ` [51/68] sctp: Do not reset the packet during sctp_packet_config() Greg KH
2010-09-24 16:32 ` [52/68] 3c503: Fix IRQ probing Greg KH
2010-09-24 16:32 ` [53/68] asix: fix setting mac address for AX88772 Greg KH
2010-09-24 16:32 ` [54/68] [S390] dasd: use correct label location for diag fba disks Greg KH
2010-09-24 16:32 ` [55/68] [PATCH] clocksource: sh_tmu: compute mult and shift before registration Greg KH
2010-09-24 16:32 ` [56/68] gro: Fix bogus gso_size on the first fraglist entry Greg KH
2010-09-24 16:32 ` [57/68] hostap_pci: set dev->base_addr during probe Greg KH
2010-09-24 16:32 ` [58/68] [PATCH] inotify: fix inotify oneshot support Greg KH
2010-09-24 16:32 ` [59/68] Input: add compat support for sysfs and /proc capabilities output Greg KH
2010-09-24 16:32 ` [60/68] MIPS: Quit using undefined behavior of ADDU in 64-bit atomic operations Greg KH
2010-09-24 16:32 ` [61/68] MIPS: Set io_map_base for several PCI bridges lacking it Greg KH
2010-09-24 16:32 ` [62/68] [PATCH] MIPS: uasm: Add OR instruction Greg KH
2010-09-24 16:32 ` [63/68] pata_pdc202xx_old: fix UDMA mode for Promise UDMA33 cards Greg KH
2010-09-24 16:32 ` [64/68] [PATCH] pata_pdc202xx_old: fix UDMA mode for PDC2026x chipsets Greg KH
2010-09-24 16:32 ` [65/68] MIPS: Sibyte: Fix M3 TLB exception handler workaround Greg KH
2010-09-24 16:32 ` [66/68] sis-agp: Remove SIS 760, handled by amd64-agp Greg KH
2010-09-24 16:32 ` [67/68] alpha: Fix printk format errors Greg KH
2010-09-24 16:32 ` [68/68] x86: Add memory modify constraints to xchg() and cmpxchg() Greg KH
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100924163348.197319640@clark.site \
--to=gregkh@suse.de \
--cc=akpm@linux-foundation.org \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=cl@linux.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mel@csn.ul.ie \
--cc=stable-review@kernel.org \
--cc=stable@kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox