linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org, Rik van Riel <riel@redhat.com>,
	Nick Piggin <nickpiggin@yahoo.com.au>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
Date: Tue, 17 Aug 2010 11:16:55 +0100	[thread overview]
Message-ID: <20100817101655.GN19797@csn.ul.ie> (raw)
In-Reply-To: <20100816160623.GB15103@cmpxchg.org>

On Mon, Aug 16, 2010 at 06:06:23PM +0200, Johannes Weiner wrote:
> [npiggin@suse.de bounces, switched to yahoo address]
> 
> On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote:
> > On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote:
> > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> > > it is cheaper than scanning a number of lists. To avoid synchronization
> > > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > > periodically and when the delta is above a threshold. On large CPU systems,
> > > the difference between the estimated and real value of NR_FREE_PAGES can be
> > > very high. If the system is under both load and low memory, it's possible
> > > for watermarks to be breached. In extreme cases, the number of free pages
> > > can drop to 0 leading to the possibility of system livelock.
> > > 
> > > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > > estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
> > > and may result in cache line bounces but is expected to be lighter than the
> > > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > > is awake.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > And the second I sent this, I realised I had sent a slightly old version
> > that missed a compile-fix :(
> > 
> > ==== CUT HERE ====
> > mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> > 
> > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> > it is cheaper than scanning a number of lists. To avoid synchronization
> > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > periodically and when the delta is above a threshold. On large CPU systems,
> > the difference between the estimated and real value of NR_FREE_PAGES can be
> > very high. If the system is under both load and low memory, it's possible
> > for watermarks to be breached. In extreme cases, the number of free pages
> > can drop to 0 leading to the possibility of system livelock.
> > 
> > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
> > and may result in cache line bounces but is expected to be lighter than the
> > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > is awake.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> [...]
> 
> > --- a/mm/mmzone.c
> > +++ b/mm/mmzone.c
> > @@ -87,3 +87,30 @@ int memmap_valid_within(unsigned long pfn,
> >  	return 1;
> >  }
> >  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> > +
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +	/*
> > +	 * While kswapd is awake, it is considered the zone is under some
> > +	 * memory pressure. Under pressure, there is a risk that
> > +	 * er-cpu-counter-drift will allow the min watermark to be breached
> 
> Missing `p'.
> 

D'oh. Fixed

> > +	 * potentially causing a live-lock. While kswapd is awake and
> > +	 * free pages are low, get a better estimate for free pages
> > +	 */
> > +	if (nr_free_pages < zone->percpu_drift_mark &&
> > +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> > +		int cpu;
> > +
> > +		for_each_online_cpu(cpu) {
> > +			struct per_cpu_pageset *pset;
> > +
> > +			pset = per_cpu_ptr(zone->pageset, cpu);
> > +			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
> > +		}
> > +	}
> > +
> > +	return nr_free_pages;
> > +}
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c2407a4..67a2ed0 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> >  {
> >  	/* free_pages my go negative - that's OK */
> >  	long min = mark;
> > -	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
> > +	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
> >  	int o;
> >  
> >  	if (alloc_flags & ALLOC_HIGH)
> > @@ -2413,7 +2413,7 @@ void show_free_areas(void)
> >  			" all_unreclaimable? %s"
> >  			"\n",
> >  			zone->name,
> > -			K(zone_page_state(zone, NR_FREE_PAGES)),
> > +			K(zone_nr_free_pages(zone)),
> >  			K(min_wmark_pages(zone)),
> >  			K(low_wmark_pages(zone)),
> >  			K(high_wmark_pages(zone)),
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 7759941..c95a159 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
> >  		for_each_online_cpu(cpu)
> >  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> >  							= threshold;
> > +
> > +		zone->percpu_drift_mark = high_wmark_pages(zone) +
> > +					num_online_cpus() * threshold;
> >  	}
> >  }
> 
> Hm, this one I don't quite get (might be the jetlag, though): we have
> _at least_ NR_FREE_PAGES free pages, there may just be more lurking in
> the pcp counters.
> 

Well, the drift can be either direction because drift can be due to pages
being either freed or allocated. e.g. it could be something like

NR_FREE_PAGES		CPU 0			CPU 1		Actual Free
128			-32			 +64		   160

Because CPU 0 was allocating pages while CPU 1 was freeing them but that
is not what is important here. At any given time, the NR_FREE_PAGES can be
wrong by as much as

num_online_cpus * (threshold - 1)

As kswapd goes back to sleep when the high watermark is reached, it's important
that it has actually reached the watermark before sleeping.  Similarly,
if an allocator is checking the low watermark, it needs an accurate count.
Hence a more careful accounting for NR_FREE_PAGES should happen when the
number of free pages is within

high_watermark + (num_online_cpus * (threshold - 1))

Only checking when kswapd is awake still leaves a window between the low
and min watermark when we could breach the watermark but I'm expecting it
can only happen for at worst one allocation. After that, kswapd wakes
and the count becomes accurate again.

> So shouldn't we only collect the pcp deltas in case the high watermark
> is breached?  Above this point, we should be fine or better, no?
> 

Is that not what is happening in zone_nr_free_pages with this check?

        /*
         * While kswapd is awake, it is considered the zone is under some
         * memory pressure. Under pressure, there is a risk that
         * per-cpu-counter-drift will allow the min watermark to be breached
         * potentially causing a live-lock. While kswapd is awake and
         * free pages are low, get a better estimate for free pages
         */
        if (nr_free_pages < zone->percpu_drift_mark &&
                        !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {

Maybe I'm misunderstanding your question.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2010-08-17 10:17 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-16  9:42 [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator Mel Gorman
2010-08-16  9:42 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman
2010-08-16 14:04   ` Rik van Riel
2010-08-16 15:26   ` Johannes Weiner
2010-08-17  2:21   ` Minchan Kim
2010-08-17  9:59     ` Mel Gorman
2010-08-17 14:25       ` Minchan Kim
2010-08-18  2:21   ` KAMEZAWA Hiroyuki
2010-08-16  9:42 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-08-16  9:43   ` Mel Gorman
2010-08-16 14:47     ` Rik van Riel
2010-08-16 16:06     ` Johannes Weiner
2010-08-17  2:26       ` Minchan Kim
2010-08-17 10:42         ` Mel Gorman
2010-08-17 15:01           ` Minchan Kim
2010-08-17 15:05             ` Mel Gorman
2010-08-17 10:16       ` Mel Gorman [this message]
2010-08-17 11:05         ` Johannes Weiner
2010-08-17 14:20         ` Minchan Kim
2010-08-18  8:51           ` Mel Gorman
2010-08-18 14:57             ` Minchan Kim
2010-08-19  8:06               ` Mel Gorman
2010-08-19 10:33                 ` Minchan Kim
2010-08-19 10:38                   ` Mel Gorman
2010-08-19 14:01                     ` Minchan Kim
2010-08-19 14:09                       ` Mel Gorman
2010-08-19 14:34                         ` Minchan Kim
2010-08-19 15:07                           ` Mel Gorman
2010-08-19 15:22                             ` Minchan Kim
2010-08-19 15:40                               ` Mel Gorman
2010-08-19 15:44                                 ` Minchan Kim
2010-08-19 15:46     ` Minchan Kim
2010-08-19 16:06       ` Mel Gorman
2010-08-19 16:45         ` Minchan Kim
2010-08-18  2:59   ` KAMEZAWA Hiroyuki
2010-08-18 15:55     ` Christoph Lameter
2010-08-19  0:07       ` KAMEZAWA Hiroyuki
2010-08-19 19:00         ` Christoph Lameter
2010-08-19 23:49           ` KAMEZAWA Hiroyuki
2010-08-20  0:22             ` [PATCH] vmstat : update zone stat threshold at onlining a cpu KAMEZAWA Hiroyuki
2010-08-20 14:54               ` Christoph Lameter
2010-08-20 17:29                 ` Andrew Morton
2010-08-23  7:18               ` Mel Gorman
2010-08-16  9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
2010-08-16 14:50   ` Rik van Riel
2010-08-17  2:57   ` Minchan Kim
2010-08-18  3:02   ` KAMEZAWA Hiroyuki
2010-08-19 14:47   ` Minchan Kim
2010-08-19 15:10     ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2010-08-23  8:00 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2 Mel Gorman
2010-08-23  8:00 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-08-23 12:56   ` Christoph Lameter
2010-08-23 13:03     ` Mel Gorman
2010-08-23 13:41       ` Christoph Lameter
2010-08-23 13:55         ` Mel Gorman
2010-08-23 16:04           ` Christoph Lameter
2010-08-23 16:13             ` Mel Gorman
2010-08-31 17:37 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V3 Mel Gorman
2010-08-31 17:37 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-08-31 18:20   ` Christoph Lameter
2010-08-31 23:37   ` KOSAKI Motohiro
2010-09-01  7:24     ` Mel Gorman
2010-09-01  7:33       ` KOSAKI Motohiro
2010-09-01 20:16         ` Christoph Lameter
2010-09-01 20:34           ` Mel Gorman
2010-09-02  0:24             ` Christoph Lameter
2010-09-02  0:26               ` KOSAKI Motohiro
2010-09-02  0:39                 ` Christoph Lameter
2010-09-02  0:54                   ` Christoph Lameter
2010-09-02  0:43   ` Christoph Lameter
2010-09-02  0:49     ` KOSAKI Motohiro
2010-09-02  8:51     ` Mel Gorman
2010-09-03  9:08 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4 Mel Gorman
2010-09-03  9:08 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-09-03 22:55   ` Andrew Morton
2010-09-03 23:17     ` Christoph Lameter
2010-09-03 23:28       ` Andrew Morton
2010-09-04  0:54         ` Christoph Lameter
2010-09-05 18:12     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100817101655.GN19797@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).