Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Nick Piggin <npiggin@suse.de>
Cc: linux-mm@kvack.org,
	Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>,
	Chris Mason <chris.mason@oracle.com>,
	Jens Axboe <jens.axboe@oracle.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
Date: Tue, 9 Mar 2010 17:35:36 +0000	[thread overview]
Message-ID: <20100309173535.GI4883@csn.ul.ie> (raw)
In-Reply-To: <20100309150332.GP8653@laptop>

On Wed, Mar 10, 2010 at 02:03:32AM +1100, Nick Piggin wrote:
> On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
> > On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> > > On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> > > > Under heavy memory pressure, the page allocator may call congestion_wait()
> > > > to wait for IO congestion to clear or a timeout. This is not as sensible
> > > > a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> > > > even congested as the pressure could have been due to a large number of
> > > > SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> > > > 
> > > > At the point of congestion_wait(), the allocator is struggling to get the
> > > > pages it needs and it should back off. This patch puts the allocator to sleep
> > > > on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> > > > kswapd brings the zone over the low watermark, whichever happens first.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > >  include/linux/mmzone.h |    3 ++
> > > >  mm/internal.h          |    4 +++
> > > >  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> > > >  mm/vmscan.c            |    2 +
> > > >  5 files changed, 101 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 30fe668..72465c1 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -398,6 +398,9 @@ struct zone {
> > > >  	unsigned long		wait_table_hash_nr_entries;
> > > >  	unsigned long		wait_table_bits;
> > > >  
> > > > +	/* queue for processes waiting for pressure to relieve */
> > > > +	wait_queue_head_t	*pressure_wq;
> > > 
> > > Hmm, processes may be eligible to allocate from > 1 zone, but you
> > > have them only waiting for one. I wonder if we shouldn't wait for
> > > more zones?
> > > 
> > 
> > It's waiting for the zone that is most desirable. If that zones watermarks
> > are met, why would it wait on any other zone?
> 
> I mean the other way around. If that zone's watermarks are not met, then
> why shouldn't it be woken up by other zones reaching their watermarks.
> 

Doing it requires moving to a per-node structure or a global queue. I'd rather
not add hot lines to the node structure (and the associated lookup cost in
the free path) if I can help it. A global queue would work on smaller machines
but I'd be worried about thundering herd problems on larger machines. I know
congestion_wait is already a global queue but IO is a relatively slow event.
Potentially the wakeups from this queue are a lot faster.

Should I just move to a global queue as a starting point and see what
problems are caused later?

> > If you mean that it would
> > wait for any of the eligible zones to meet their watermark, it might have
> > an impact on NUMA locality but it could be managed. It might make sense to
> > wait on a node-based queue rather than a zone if this behaviour was desirable.
> > 
> > > Congestion waiting uses a global waitqueue, which hasn't seemed to
> > > cause a big scalability problem. Would it be better to have a global
> > > waitqueue for this too?
> > > 
> > 
> > Considering that the congestion wait queue is for a relatively slow operation,
> > it would be surprising if lock scalability was noticeable.  Potentially the
> > pressure_wq involves no IO so scalability may be noticeable there.
> > 
> > What would the advantages of a global waitqueue be? Obviously, a smaller
> > memory footprint. A second potential advantage is that on wakeup, it
> > could check the watermarks on multiple zones which might reduce
> > latencies in some cases. Can you think of more compelling reasons?
> 
> Your 2nd advantage is what I mean above.
> 
> 
> > > 
> > > > +void check_zone_pressure(struct zone *zone)
> > > 
> > > I don't really like the name pressure. We use that term for the reclaim
> > > pressure wheras we're just checking watermarks here (actual pressure
> > > could be anything).
> > > 
> > 
> > pressure_wq => watermark_wq
> > check_zone_pressure => check_watermark_wq
> > 
> > ?
> 
> Thanks.
> 
> > 
> > > 
> > > > +{
> > > > +	/* If no process is waiting, nothing to do */
> > > > +	if (!waitqueue_active(zone->pressure_wq))
> > > > +		return;
> > > > +
> > > > +	/* Check if the high watermark is ok for order 0 */
> > > > +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> > > > +		wake_up_interruptible(zone->pressure_wq);
> > > > +}
> > > 
> > > If you were to do this under the zone lock (in your subsequent patch),
> > > then it could avoid races. I would suggest doing it all as a single
> > > patch and not doing the pressure checks in reclaim at all.
> > > 
> > 
> > That is reasonable. I've already dropped the checks in reclaim because as you
> > say, if the free path check is cheap enough, it's also sufficient. Checking
> > in the reclaim paths as well is redundant.
> > 
> > I'll move the call to check_zone_pressure() within the zone lock to avoid
> > races.
> > 
> > > If you are missing anything, then that needs to be explained and fixed
> > > rather than just adding extra checks.
> > > 
> > > > +
> > > > +/**
> > > > + * zonepressure_wait - Wait for pressure on a zone to ease off
> > > > + * @zone: The zone that is expected to be under pressure
> > > > + * @order: The order the caller is waiting on pages for
> > > > + * @timeout: Wait until pressure is relieved or this timeout is reached
> > > > + *
> > > > + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> > > > + * It's considered to be relieved if any direct reclaimer or kswapd brings
> > > > + * the zone above the high watermark
> > > > + */
> > > > +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> > > > +{
> > > > +	long ret;
> > > > +	DEFINE_WAIT(wait);
> > > > +
> > > > +wait_again:
> > > > +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
> > > 
> > > I guess to do it without races you need to check watermark here.
> > > And possibly some barriers if it is done without zone->lock.
> > > 
> > 
> > As watermark checks are already done without the zone->lock and without
> > barriers, why are they needed here? Yes, there are small races. For
> > example, it's possible to hit a window where pages were freed between
> > watermarks were checked and we went to sleep here but that is similar to
> > current behaviour.
> 
> Well with the check in free_pages_bulk then doing another check here
> before the wait should be able to close all lost-wakeup races. I agree
> it is pretty fuzzy heuristics anyway, so these races don't *really*
> matter a lot. But it seems easy to close the races, so I don't see
> why not.
> 

I agree that the window is unnecessarily large. I'll tighten it.,

> 
> > > > +
> > > > +	/*
> > > > +	 * The use of io_schedule_timeout() here means that it gets
> > > > +	 * accounted for as IO waiting. This may or may not be the case
> > > > +	 * but at least this way it gets picked up by vmstat
> > > > +	 */
> > > > +	ret = io_schedule_timeout(timeout);
> > > > +	finish_wait(zone->pressure_wq, &wait);
> > > > +
> > > > +	/* If woken early, check watermarks before continuing */
> > > > +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> > > > +		timeout = ret;
> > > > +		goto wait_again;
> > > > +	}
> > > 
> > > And then I don't know if we'd really need the extra check here. Might as
> > > well just let the allocator try again and avoid the code?
> > > 
> > 
> > I was considering multiple processes been woken up and racing with each
> > other. I can drop this check though. The worst that happens is multiple
> > processes wake and walk the full zonelist. Some will succeed and others
> > will go back to sleep.
> 
> Yep. And it doesn't really solve that race either becuase the zone
> might subsequently go below the watermark.
> 

True. In theory, the same sort of races currently apply with
congestion_wait() but that's just an excuse. There is a strong
possibility we could behave better with respect to watermarks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-03-09 17:35 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-08 11:48 [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Mel Gorman
2010-03-08 11:48 ` [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion Mel Gorman
2010-03-09 13:35   ` Nick Piggin
2010-03-09 14:17     ` Mel Gorman
2010-03-09 15:03       ` Nick Piggin
2010-03-09 15:42         ` Christian Ehrhardt
2010-03-09 18:22           ` Mel Gorman
2010-03-10  2:38             ` Nick Piggin
2010-03-09 17:35         ` Mel Gorman [this message]
2010-03-10  2:35           ` Nick Piggin
2010-03-09 15:50   ` Christoph Lameter
2010-03-09 15:56     ` Christian Ehrhardt
2010-03-09 16:09       ` Christoph Lameter
2010-03-09 17:01         ` Mel Gorman
2010-03-09 17:11           ` Christoph Lameter
2010-03-09 17:30             ` Mel Gorman
2010-03-08 11:48 ` [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed Mel Gorman
2010-03-09  9:53   ` Nick Piggin
2010-03-09 10:08     ` Mel Gorman
2010-03-09 10:23       ` Nick Piggin
2010-03-09 10:36         ` Mel Gorman
2010-03-09 11:11           ` Nick Piggin
2010-03-09 11:29             ` Mel Gorman
2010-03-08 11:48 ` [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion Mel Gorman
2010-03-09 10:00   ` Nick Piggin
2010-03-09 10:21     ` Mel Gorman
2010-03-09 10:32       ` Nick Piggin
2010-03-11 23:41 ` [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Andrew Morton
2010-03-12  6:39   ` Christian Ehrhardt
2010-03-12  7:05     ` Andrew Morton
2010-03-12 10:47       ` Mel Gorman
2010-03-12 12:15         ` Christian Ehrhardt
2010-03-12 14:37           ` Andrew Morton
2010-03-15 12:29             ` Mel Gorman
2010-03-15 14:45               ` Christian Ehrhardt
2010-03-15 12:34             ` Christian Ehrhardt
2010-03-15 20:09               ` Andrew Morton
2010-03-16 10:11                 ` Mel Gorman
2010-03-18 17:42                 ` Mel Gorman
2010-03-22 23:50                 ` Mel Gorman
2010-03-23 14:35                   ` Christian Ehrhardt
2010-03-23 21:35                   ` Corrado Zoccolo
2010-03-24 11:48                     ` Mel Gorman
2010-03-24 12:56                       ` Corrado Zoccolo
2010-03-23 22:29                   ` Rik van Riel
2010-03-24 14:50                     ` Mel Gorman
2010-04-19 12:22                       ` Christian Ehrhardt
2010-04-19 21:44                         ` Johannes Weiner
2010-04-20  7:20                           ` Christian Ehrhardt
2010-04-20  8:54                             ` Christian Ehrhardt
2010-04-20 15:32                             ` Johannes Weiner
2010-04-20 17:22                               ` Rik van Riel
2010-04-21  4:23                                 ` Christian Ehrhardt
2010-04-21  7:35                                   ` Christian Ehrhardt
2010-04-21 13:19                                     ` Rik van Riel
2010-04-22  6:21                                       ` Christian Ehrhardt
2010-04-26 10:59                                         ` Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2 Christian Ehrhardt
2010-04-26 11:59                                           ` KOSAKI Motohiro
2010-04-26 12:43                                             ` Christian Ehrhardt
2010-04-26 14:20                                               ` Rik van Riel
2010-04-27 14:00                                                 ` Christian Ehrhardt
2010-04-21  9:03                                   ` [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Johannes Weiner
2010-04-21 13:20                                   ` Rik van Riel
2010-04-20 14:40                           ` Rik van Riel
2010-03-24  2:38                   ` Greg KH
2010-03-24 11:49                     ` Mel Gorman
2010-03-24 13:13                   ` Johannes Weiner
2010-03-12  9:09   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100309173535.GI4883@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=chris.mason@oracle.com \
    --cc=ehrhardt@linux.vnet.ibm.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).