public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] mm: vmscan: Do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL
Date: Wed, 23 Apr 2014 14:52:10 +0100	[thread overview]
Message-ID: <20140423135210.GK23991@suse.de> (raw)
In-Reply-To: <20140422123149.d406e5cbef5c01eb6dc5c89b@linux-foundation.org>

On Tue, Apr 22, 2014 at 12:31:49PM -0700, Andrew Morton wrote:
> On Tue, 22 Apr 2014 09:38:52 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > throttle_direct_reclaim() is meant to trigger during swap-over-network
> > during which the min watermark is treated as a pfmemalloc reserve. It
> > throttes on the first node in the zonelist but this is flawed.
> > 
> > On a NUMA machine running a 32-bit kernel (I know) allocation requests
> > freom CPUs on node 1 would detect no pfmemalloc reserves and the process
> > gets throttled. This patch adjusts throttling of direct reclaim to throttle
> > based on the first node in the zonelist that has a usable ZONE_NORMAL or
> > lower zone.
> 
> I'm unable to determine from the above whether we should backport this
> fix.  Please don't forget to describe the end-user visible effects of
> a bug when that isn't obvious.  
> 

The user-visible impact is that a process running on CPU whose local
memory node has no ZONE_NORMAL will stall for prolonged periods of time,
possibly indefintely. This is due to throttle_direct_reclaim thinking the
pfmemalloc reserves are depleted when in fact they don't exist on that node.

Strictly speaking this is stable material. I should have flagged it as
such but hadn't as I was treating 32-bit kernels running on NUMA hardware
as being a poor choice.

> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2507,10 +2507,17 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> >  
> >  	for (i = 0; i <= ZONE_NORMAL; i++) {
> >  		zone = &pgdat->node_zones[i];
> > +		if (!populated_zone(zone))
> > +			continue;
> 
> What's this?  Performance tweak?  Or does min_wmark_pages() return
> non-zero for an unpopulated zone, which seems odd.
> 

Minor performance tweak. It's a force of habit to skip populated zones
when doing a zone walk like this.

> >  		pfmemalloc_reserve += min_wmark_pages(zone);
> >  		free_pages += zone_page_state(zone, NR_FREE_PAGES);
> >  	}
> >  
> > +	/* If there are no reserves (unexpected config) then do not throttle */
> > +	if (!pfmemalloc_reserve)
> > +		return true;
> > +
> >  	wmark_ok = free_pages > pfmemalloc_reserve / 2;
> >  
> >  	/* kswapd must be awake if processes are being throttled */
> > @@ -2535,9 +2542,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> >  static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> >  					nodemask_t *nodemask)
> >  {
> > +	struct zoneref *z;
> >  	struct zone *zone;
> > -	int high_zoneidx = gfp_zone(gfp_mask);
> > -	pg_data_t *pgdat;
> > +	pg_data_t *pgdat = NULL;
> >  
> >  	/*
> >  	 * Kernel threads should not be throttled as they may be indirectly
> > @@ -2556,10 +2563,24 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> >  	if (fatal_signal_pending(current))
> >  		goto out;
> >  
> > -	/* Check if the pfmemalloc reserves are ok */
> > -	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> > -	pgdat = zone->zone_pgdat;
> > -	if (pfmemalloc_watermark_ok(pgdat))
> > +	/*
> > +	 * Check if the pfmemalloc reserves are ok by finding the first node
> > +	 * with a usable ZONE_NORMAL or lower zone
> > +	 */
> 
> That comment tells us what the code does but not why it does it.
> 
> - Why do we ignore zones >= ZONE_NORMAL?
> 
> - Why do we throttle when there may be as-yet-unexamined nodes which
>   have reclaimable pages?
> 

/*
 * Check if the pfmemalloc reserves are ok by finding the first node
 * with a usable ZONE_NORMAL or lower zone. The expectation is that
 * GFP_KERNEL will be required for allocating network buffers when
 * swapping over the network so ZONE_HIGHMEM is unusable.
 * 
 * Throttling is based on the first usable node and throttled processes
 * wait on a queue until kswapd makes progress and wakes them. There
 * is an affinity then between processes waking up and where reclaim
 * progress has been made assuming the process wakes on the same node.
 * More importantly, processes running on remote nodes will not compete
 * for remote pfmemalloc reserves and processes on different nodes
 * should make reasonable progress.
 */

?

> 
> > +        for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > +                                        gfp_mask, nodemask) {
> 
> Those two lines have spaces-instead-of-tabs.
> 

Sorry, that was careless.

-- 
Mel Gorman
SUSE Labs

      reply	other threads:[~2014-04-23 13:52 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-22  8:38 [PATCH] mm: vmscan: Do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL Mel Gorman
2014-04-22 19:31 ` Andrew Morton
2014-04-23 13:52   ` Mel Gorman [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140423135210.GK23991@suse.de \
    --to=mgorman@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox