From: Andrew Morton <akpm@linux-foundation.org>
To: Mel Gorman <mel@csn.ul.ie>
Cc: Shaohua Li <shaohua.li@intel.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Christoph Lameter <cl@linux.com>,
David Rientjes <rientjes@google.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
LKML <linux-kernel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>
Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
Date: Thu, 28 Oct 2010 15:04:33 -0700 [thread overview]
Message-ID: <20101028150433.fe4f2d77.akpm@linux-foundation.org> (raw)
In-Reply-To: <1288278816-32667-2-git-send-email-mel@csn.ul.ie>
On Thu, 28 Oct 2010 16:13:35 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
> memory is low] noted that watermarks were based on the vmstat
> NR_FREE_PAGES. To avoid synchronization overhead, these counters are
> maintained on a per-cpu basis and drained both periodically and when a
> threshold is above a threshold. On large CPU systems, the difference
> between the estimate and real value of NR_FREE_PAGES can be very high.
> The system can get into a case where pages are allocated far below the
> min watermark potentially causing livelock issues. The commit solved the
> problem by taking a better reading of NR_FREE_PAGES when memory was low.
>
> Unfortately, as reported by Shaohua Li this accurate reading can consume
> a large amount of CPU time on systems with many sockets due to cache
> line bouncing. This patch takes a different approach. For large machines
> where counter drift might be unsafe and while kswapd is awake, the per-cpu
> thresholds for the target pgdat are reduced to limit the level of drift
> to what should be a safe level. This incurs a performance penalty in heavy
> memory pressure by a factor that depends on the workload and the machine but
> the machine should function correctly without accidentally exhausting all
> memory on a node. There is an additional cost when kswapd wakes and sleeps
> but the event is not expected to be frequent - in Shaohua's test case,
> there was one recorded sleep and wake event at least.
>
> To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> is introduced that takes a more accurate reading of NR_FREE_PAGES when
> called from wakeup_kswapd, when deciding whether it is really safe to go
> back to sleep in sleeping_prematurely() and when deciding if a zone is
> really balanced or not in balance_pgdat(). We are still using an expensive
> function but limiting how often it is called.
Here I go again. I have a feeling that I already said this, but I
can't find versions 2 or 3 in the archives..
Did you evaluate using plain on percpu_counters for this? They won't
solve the performance problem as they're basically the same thing as
these open-coded counters. But they'd reduce the amount of noise and
custom-coded boilerplate in mm/.
> When the test case is reproduced, the time spent in the watermark functions
> is reduced. The following report is on the percentage of time spent
> cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(),
> __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(),
> zone_page_state().
So how did you decide which callsites needed to use the
fast-but-inaccurate zone_watermark_ok() and which needed to use the
slow-but-more-accurate zone_watermark_ok_safe()? (Those functions need
comments explaining the difference btw)
I have a feeling this problem will bite us again perhaps due to those
other callsites, but we haven't found the workload yet.
I don't undestand why restore/reduce_pgdat_percpu_threshold() were
called around that particular sleep in kswapd and nowhere else.
> vanilla 11.6615%
> disable-threshold 0.2584%
Wow. That's 12% of all CPUs? How many CPUs and what workload?
>
> ...
>
> if (!sleeping_prematurely(pgdat, order, remaining)) {
> trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> + restore_pgdat_percpu_threshold(pgdat);
> schedule();
> + reduce_pgdat_percpu_threshold(pgdat);
We could do with some code comments here explaining what's going on.
> } else {
> if (remaining)
> count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
>
> ...
>
> +static int calculate_pressure_threshold(struct zone *zone)
> +{
> + int threshold;
> + int watermark_distance;
> +
> + /*
> + * As vmstats are not up to date, there is drift between the estimated
> + * and real values. For high thresholds and a high number of CPUs, it
> + * is possible for the min watermark to be breached while the estimated
> + * value looks fine. The pressure threshold is a reduced value such
> + * that even the maximum amount of drift will not accidentally breach
> + * the min watermark
> + */
> + watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone);
> + threshold = max(1, (int)(watermark_distance / num_online_cpus()));
> +
> + /*
> + * Maximum threshold is 125
Reasoning?
> + */
> + threshold = min(125, threshold);
> +
> + return threshold;
> +}
> +
> static int calculate_threshold(struct zone *zone)
> {
> int threshold;
>
> ...
>
> +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
> +{
> + struct zone *zone;
> + int cpu;
> + int threshold;
> + int i;
> +
> + get_online_cpus();
> + for (i = 0; i < pgdat->nr_zones; i++) {
> + zone = &pgdat->node_zones[i];
> + if (!zone->percpu_drift_mark)
> + continue;
> +
> + threshold = calculate_pressure_threshold(zone);
> + for_each_online_cpu(cpu)
> + per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> + = threshold;
> + }
> + put_online_cpus();
> +}
> +
> +void restore_pgdat_percpu_threshold(pg_data_t *pgdat)
> +{
> + struct zone *zone;
> + int cpu;
> + int threshold;
> + int i;
> +
> + get_online_cpus();
> + for (i = 0; i < pgdat->nr_zones; i++) {
> + zone = &pgdat->node_zones[i];
> + if (!zone->percpu_drift_mark)
> + continue;
> +
> + threshold = calculate_threshold(zone);
> + for_each_online_cpu(cpu)
> + per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> + = threshold;
> + }
> + put_online_cpus();
> +}
Given that ->stat_threshold is the same for each CPU, why store it for
each CPU at all? Why not put it in the zone and eliminate the inner
loop?
next prev parent reply other threads:[~2010-10-28 22:05 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-10-28 15:13 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions V4 Mel Gorman
2010-10-28 15:13 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
2010-10-28 22:04 ` Andrew Morton [this message]
2010-10-29 10:12 ` Mel Gorman
2010-10-29 19:40 ` Andrew Morton
2010-11-02 0:53 ` Shaohua Li
2010-11-09 11:33 ` Mel Gorman
2010-11-09 16:48 ` Christoph Lameter
2010-10-29 14:58 ` Christoph Lameter
2010-10-29 18:25 ` Andrew Morton
2010-10-29 19:33 ` Christoph Lameter
2010-10-28 15:13 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman
2010-10-28 22:09 ` Andrew Morton
2010-10-29 10:17 ` Mel Gorman
-- strict thread matches above, loose matches on Subject: below --
2010-10-27 8:47 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions Mel Gorman
2010-10-27 8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
2010-10-27 20:16 ` Christoph Lameter
2010-10-28 1:09 ` KAMEZAWA Hiroyuki
2010-10-28 9:49 ` Mel Gorman
2010-10-28 9:58 ` KAMEZAWA Hiroyuki
2010-11-01 7:06 ` KOSAKI Motohiro
2010-11-26 16:06 ` Kyle McMartin
2010-11-29 9:56 ` Mel Gorman
2010-11-29 13:16 ` Kyle McMartin
2010-11-29 15:08 ` Mel Gorman
2010-11-29 15:22 ` Kyle McMartin
2010-11-29 15:26 ` Kyle McMartin
2010-11-29 15:58 ` Mel Gorman
2010-12-23 22:18 ` David Rientjes
2010-12-23 22:35 ` Andrew Morton
2010-12-23 23:00 ` Kyle McMartin
2010-12-23 23:07 ` David Rientjes
2010-12-23 23:17 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101028150433.fe4f2d77.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=rientjes@google.com \
--cc=shaohua.li@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox