From: Mel Gorman <mel@csn.ul.ie>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux Kernel List <linux-kernel@vger.kernel.org>,
linux-mm@kvack.org, Rik van Riel <riel@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Minchan Kim <minchan.kim@gmail.com>,
Christoph Lameter <cl@linux-foundation.org>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Mel Gorman <mel@csn.ul.ie>
Subject: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
Date: Fri, 3 Sep 2010 10:08:45 +0100 [thread overview]
Message-ID: <1283504926-2120-3-git-send-email-mel@csn.ul.ie> (raw)
In-Reply-To: <1283504926-2120-1-git-send-email-mel@csn.ul.ie>
From: Christoph Lameter <cl@linux.com>
Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can
be very high. If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero. Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.
This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
being accidentally broken. The estimate is not perfect and may result
in cache line bounces but is expected to be lighter than the IPI calls
necessary to continually drain the per-cpu counters while kswapd is awake.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/mmzone.h | 13 +++++++++++++
include/linux/vmstat.h | 22 ++++++++++++++++++++++
mm/mmzone.c | 21 +++++++++++++++++++++
mm/page_alloc.c | 4 ++--
mm/vmstat.c | 15 ++++++++++++++-
5 files changed, 72 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
unsigned long watermark[NR_WMARK];
/*
+ * When free pages are below this point, additional steps are taken
+ * when reading the number of free pages to avoid per-cpu counter
+ * drift allowing watermarks to be breached
+ */
+ unsigned long percpu_drift_mark;
+
+ /*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
* GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
return test_bit(ZONE_OOM_LOCKED, &zone->flags);
}
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
/*
* The "priority" of VM scanning is how much of the queues we will scan in one
* go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 7f43ccd..eaaea37 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -170,6 +170,28 @@ static inline unsigned long zone_page_state(struct zone *zone,
return x;
}
+/*
+ * More accurate version that also considers the currently pending
+ * deltas. For that we need to loop over all cpus to find the current
+ * deltas. There is no synchronization so the result cannot be
+ * exactly accurate either.
+ */
+static inline unsigned long zone_page_state_snapshot(struct zone *zone,
+ enum zone_stat_item item)
+{
+ long x = atomic_long_read(&zone->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+ int cpu;
+ for_each_online_cpu(cpu)
+ x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
+
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
extern unsigned long global_reclaimable_pages(void);
extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..e35bfb8 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
return 1;
}
#endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+ unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+ /*
+ * While kswapd is awake, it is considered the zone is under some
+ * memory pressure. Under pressure, there is a risk that
+ * per-cpu-counter-drift will allow the min watermark to be breached
+ * potentially causing a live-lock. While kswapd is awake and
+ * free pages are low, get a better estimate for free pages
+ */
+ if (nr_free_pages < zone->percpu_drift_mark &&
+ !waitqueue_active(&zone->zone_pgdat->kswapd_wait))
+ return zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+ return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
{
/* free_pages my go negative - that's OK */
long min = mark;
- long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+ long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
int o;
if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
" all_unreclaimable? %s"
"\n",
zone->name,
- K(zone_page_state(zone, NR_FREE_PAGES)),
+ K(zone_nr_free_pages(zone)),
K(min_wmark_pages(zone)),
K(low_wmark_pages(zone)),
K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
int threshold;
for_each_populated_zone(zone) {
+ unsigned long max_drift, tolerate_drift;
+
threshold = calculate_threshold(zone);
for_each_online_cpu(cpu)
per_cpu_ptr(zone->pageset, cpu)->stat_threshold
= threshold;
+
+ /*
+ * Only set percpu_drift_mark if there is a danger that
+ * NR_FREE_PAGES reports the low watermark is ok when in fact
+ * the min watermark could be breached by an allocation
+ */
+ tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+ max_drift = num_online_cpus() * threshold;
+ if (max_drift > tolerate_drift)
+ zone->percpu_drift_mark = high_wmark_pages(zone) +
+ max_drift;
}
}
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n scanned %lu"
"\n spanned %lu"
"\n present %lu",
- zone_page_state(zone, NR_FREE_PAGES),
+ zone_nr_free_pages(zone),
min_wmark_pages(zone),
low_wmark_pages(zone),
high_wmark_pages(zone),
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-09-03 9:08 UTC|newest]
Thread overview: 97+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-03 9:08 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4 Mel Gorman
2010-09-03 9:08 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman
2010-09-03 22:38 ` Andrew Morton
2010-09-05 18:06 ` Mel Gorman
2010-09-03 9:08 ` Mel Gorman [this message]
2010-09-03 22:55 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Andrew Morton
2010-09-03 23:17 ` Christoph Lameter
2010-09-03 23:28 ` Andrew Morton
2010-09-04 0:54 ` Christoph Lameter
2010-09-05 18:12 ` Mel Gorman
2010-09-03 9:08 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
2010-09-03 23:00 ` Andrew Morton
2010-09-04 2:25 ` Dave Chinner
2010-09-04 3:21 ` Andrew Morton
2010-09-04 7:58 ` Dave Chinner
2010-09-04 8:14 ` Dave Chinner
[not found] ` <20100905015400.GA10714@localhost>
[not found] ` <20100905021555.GG705@dastard>
[not found] ` <20100905060539.GA17450@localhost>
[not found] ` <20100905131447.GJ705@dastard>
2010-09-05 13:45 ` Wu Fengguang
2010-09-05 23:33 ` Dave Chinner
2010-09-06 4:02 ` Dave Chinner
2010-09-06 8:40 ` Mel Gorman
2010-09-06 21:50 ` Dave Chinner
2010-09-08 8:49 ` Dave Chinner
2010-09-09 12:39 ` Mel Gorman
2010-09-10 6:17 ` Dave Chinner
2010-09-07 14:23 ` Christoph Lameter
2010-09-08 2:13 ` Wu Fengguang
2010-09-04 3:23 ` Wu Fengguang
2010-09-04 3:59 ` Andrew Morton
2010-09-04 4:37 ` Wu Fengguang
2010-09-05 18:22 ` Mel Gorman
2010-09-05 18:14 ` Mel Gorman
2010-09-08 7:43 ` KOSAKI Motohiro
2010-09-08 20:05 ` Christoph Lameter
2010-09-09 12:41 ` Mel Gorman
2010-09-09 13:45 ` Christoph Lameter
2010-09-09 13:55 ` Mel Gorman
2010-09-09 14:32 ` Christoph Lameter
2010-09-09 15:05 ` Mel Gorman
2010-09-10 2:56 ` KOSAKI Motohiro
2010-09-03 23:05 ` [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4 Andrew Morton
2010-09-21 11:17 ` Mel Gorman
2010-09-21 12:58 ` [stable] " Greg KH
2010-09-21 14:23 ` Mel Gorman
2010-09-23 18:49 ` Greg KH
2010-09-24 9:14 ` Mel Gorman
-- strict thread matches above, loose matches on Subject: below --
2010-08-31 17:37 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V3 Mel Gorman
2010-08-31 17:37 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-08-31 18:20 ` Christoph Lameter
2010-08-31 23:37 ` KOSAKI Motohiro
2010-09-01 7:24 ` Mel Gorman
2010-09-01 7:33 ` KOSAKI Motohiro
2010-09-01 20:16 ` Christoph Lameter
2010-09-01 20:34 ` Mel Gorman
2010-09-02 0:24 ` Christoph Lameter
2010-09-02 0:26 ` KOSAKI Motohiro
2010-09-02 0:39 ` Christoph Lameter
2010-09-02 0:54 ` Christoph Lameter
2010-09-02 0:43 ` Christoph Lameter
2010-09-02 0:49 ` KOSAKI Motohiro
2010-09-02 8:51 ` Mel Gorman
2010-08-23 8:00 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2 Mel Gorman
2010-08-23 8:00 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-08-23 12:56 ` Christoph Lameter
2010-08-23 13:03 ` Mel Gorman
2010-08-23 13:41 ` Christoph Lameter
2010-08-23 13:55 ` Mel Gorman
2010-08-23 16:04 ` Christoph Lameter
2010-08-23 16:13 ` Mel Gorman
2010-08-16 9:42 [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator Mel Gorman
2010-08-16 9:42 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-08-16 9:43 ` Mel Gorman
2010-08-16 14:47 ` Rik van Riel
2010-08-16 16:06 ` Johannes Weiner
2010-08-17 2:26 ` Minchan Kim
2010-08-17 10:42 ` Mel Gorman
2010-08-17 15:01 ` Minchan Kim
2010-08-17 15:05 ` Mel Gorman
2010-08-17 10:16 ` Mel Gorman
2010-08-17 11:05 ` Johannes Weiner
2010-08-17 14:20 ` Minchan Kim
2010-08-18 8:51 ` Mel Gorman
2010-08-18 14:57 ` Minchan Kim
2010-08-19 8:06 ` Mel Gorman
2010-08-19 10:33 ` Minchan Kim
2010-08-19 10:38 ` Mel Gorman
2010-08-19 14:01 ` Minchan Kim
2010-08-19 14:09 ` Mel Gorman
2010-08-19 14:34 ` Minchan Kim
2010-08-19 15:07 ` Mel Gorman
2010-08-19 15:22 ` Minchan Kim
2010-08-19 15:40 ` Mel Gorman
2010-08-19 15:44 ` Minchan Kim
2010-08-19 15:46 ` Minchan Kim
2010-08-19 16:06 ` Mel Gorman
2010-08-19 16:45 ` Minchan Kim
2010-08-18 2:59 ` KAMEZAWA Hiroyuki
2010-08-18 15:55 ` Christoph Lameter
2010-08-19 0:07 ` KAMEZAWA Hiroyuki
2010-08-19 19:00 ` Christoph Lameter
2010-08-19 23:49 ` KAMEZAWA Hiroyuki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1283504926-2120-3-git-send-email-mel@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=akpm@linux-foundation.org \
--cc=cl@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan.kim@gmail.com \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).