Re: [PATCH 02/13] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: NeilBrown <neilb@suse.de>
Cc: Linux-MM <linux-mm@kvack.org>,
	Linux-Netdev <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	David Miller <davem@davemloft.net>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [PATCH 02/13] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages
Date: Thu, 28 Apr 2011 10:46:13 +0100	[thread overview]
Message-ID: <20110428094613.GN4658@suse.de> (raw)
In-Reply-To: <20110428092110.608eb354@notabene.brown>

On Thu, Apr 28, 2011 at 09:21:10AM +1000, NeilBrown wrote:
> On Tue, 26 Apr 2011 14:59:40 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Tue, Apr 26, 2011 at 09:37:58PM +1000, NeilBrown wrote:
> > > On Tue, 26 Apr 2011 08:36:43 +0100 Mel Gorman <mgorman@suse.de> wrote:
> > > 
> > > > +		/*
> > > > +		 * If there are full empty slabs and we were not forced to
> > > > +		 * allocate a slab, mark this one !pfmemalloc
> > > > +		 */
> > > > +		l3 = cachep->nodelists[numa_mem_id()];
> > > > +		if (!list_empty(&l3->slabs_free) && force_refill) {
> > > > +			struct slab *slabp = virt_to_slab(objp);
> > > > +			slabp->pfmemalloc = false;
> > > > +			clear_obj_pfmemalloc(&objp);
> > > > +			check_ac_pfmemalloc(cachep, ac);
> > > > +			return objp;
> > > > +		}
> > > 
> > > The comment doesn't match the code.  I think you need to remove the words
> > > "full" and "not" assuming the code is correct which it probably is...
> > > 
> > 
> > I'll fix up the comment, you're right, it's confusing.
> > 
> > > But the code seems to be much more complex than Peter's original, and I don't
> > > see the gain.
> > > 
> > 
> > You're right, it is more complex.
> > 
> > > Peter's code had only one 'reserved' flag for each kmem_cache. 
> > 
> > The reserve was set in a per-cpu structure so there was a "lag" time
> > before that information was available to other CPUs. Fine on smaller
> > machines but a bit more of a problem today. 
> > 
> > > You seem to
> > > have one for every slab.  I don't see the point.
> > > It is true that yours is in some sense more fair - but I'm not sure the
> > > complexity is worth it.
> > > 
> > 
> > More fairness was one of the objects.
> > 
> > > Was there some particular reason you made the change?
> > > 
> > 
> > This version survives under considerably more stress than Peter's
> > original version did without requiring the additional complexity of
> > memory reserves.
> > 
> 
> That is certainly a very compelling argument .... but I still don't get why.
> I'm sorry if I'm being dense, but I still don't see why the complexity buys
> us better stability and I really would like to understand.
> 
> You don't seem to need the same complexity for SLUB with the justification
> of "SLUB generally maintaining smaller lists than SLAB".
> 

It is an educated guess that the length of the lists was what was
relevant. Even without these patches, SLUB is harder to lockup (minutes
rather than seconds to halt the machine) than SLAB and I assumed it
was because there were fewer pages pinned on per-CPU lists with SLUB.

> Presumably these are per-CPU lists of free objects or slabs? 

The per-cpu lists are of objects (the entry[] array in struct
array_cache). The slab management structure is looked up much less
frequently (when a block of objects are being freed for example).

> If the things
> on those lists could be used by anyone long lists couldn't hurt.

Long lists can hurt in a few ways but I believe the two relevant reasons
for this series are;

1. A remote CPU could be holding the object on its free list
2. Multiple unnecessary caches could be pinning free memory with the
   shrinkers not triggering because everything is waiting on IO to complete

> So the problem must be that the lists get long while the array_cache is still
> marked as 'pfmemalloc'

By marking the array_cache pfmemalloc we can have objects on the list
that are a mix of pfmemalloc and !pfmemalloc objects with very coarse
control over who is accessing them.

For example. In Peters patches, CPU A could allocate from pfmemalloc
reserves and mark its array_cache appropriately. CPU B could be freeing
the objects but not have its array_cache marked. PFMEMALLOC objects
are now available for !PFMEMALLOC uses on that CPU and we dip further
into our reserves. This was managed by the memory reservation patches
which meant that dipping further into the reserves was not that much
of a problem.

> (or 'reserve' in Peter's patches).
> 
> Is that the problem?  That reserve memory gets locked up in SLAB freelists?
> If so - would that be more easily addressed by effectively reducing the
> 'batching' when the array_cache had dipped into reserves, so slabs are
> returned to the VM more promptly?
> 

Pinning objects on long list is one problem but I don't think it's
the most important problem. Indications were that the big problem
was insufficient control over who was accessing objects belonging to
slab pages allocated from the pfmemalloc reserves.

> Probably related, now that you've fixed the comment here (thanks):
> 
> +		/*
> +		 * If there are empty slabs on the slabs_free list and we are
> +		 * being forced to refill the cache, mark this one !pfmemalloc.
> +		 */
> +		l3 = cachep->nodelists[numa_mem_id()];
> +		if (!list_empty(&l3->slabs_free) && force_refill) {
> +			struct slab *slabp = virt_to_slab(objp);
> +			slabp->pfmemalloc = false;
> +			clear_obj_pfmemalloc(&objp);
> +			check_ac_pfmemalloc(cachep, ac);
> +			return objp;
> +		}
> 
> I'm trying to understand it...

Thanks :)

> The context is that a non-MEMALLOC allocation is happening and everything on
> the free lists is reserved for MEMALLOC allocations.

Yep.

> So if in that case there is a completely free slab on the free list we decide
> that it is OK to mark the current slab as non-MEMALLOC.

Yep.

> The logic seems to be that we could just release that free slab to the VM,
> then an alloc_page would be able to get it back. 

We could, it'd be slower and there would need to be some sort of
retry path in a hotter code path to catch the situation but we could.

> But if we are still well
> below the reserve watermark, then there might be some other allocation that
> is more deserving of the page and we shouldn't just assume we can take
> it with actually calling in to alloc_pages to check that we are no longer
> running on reserves..
> 
> So this looks like an optimisation that is wrong.

The problem that is being addressed here is that pfmemalloc slabs
have to made available for general use at some point or slabs can
be artifically large and waste memory. I didn't do a free and retry
path because it'd be more expensive and I wanted to avoid hurting
the common slab paths.

The assumption is that if there are free slab pages while there are
pfmemalloc slabs in use then we cannot be under that much pressure
and the free slab page is safe to use. If we go well below the reserve
watermark as you are concerned about, the throttle logic will trigger
and we'll at least identify that the situation occured without the
system crashing.

> BTW, 
> 
> 
> +	/* Record if ALLOC_PFMEMALLOC was set when allocating the slab */
> +	if (pfmemalloc) {
> +		struct array_cache *ac = cpu_cache_get(cachep);
> +		slabp->pfmemalloc = true;
> +		ac->pfmemalloc = 1;
> +	}
> +
> 
> I think that "= 1"  should be "= true".  :-)
> 

/me slaps self

Thanks

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-04-28  9:46 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-26  7:36 [PATCH 00/13] Swap-over-NBD without deadlocking Mel Gorman
2011-04-26  7:36 ` [PATCH 01/13] mm: Serialize access to min_free_kbytes Mel Gorman
2011-04-26  7:36 ` [PATCH 02/13] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages Mel Gorman
2011-04-26 11:15   ` NeilBrown
2011-04-26 11:33     ` Mel Gorman
2011-04-26 12:05       ` NeilBrown
2011-04-26 11:37   ` NeilBrown
2011-04-26 13:59     ` Mel Gorman
2011-04-27 23:21       ` NeilBrown
2011-04-28  9:46         ` Mel Gorman [this message]
2011-04-26  7:36 ` [PATCH 03/13] mm: Introduce __GFP_MEMALLOC to allow access to emergency reserves Mel Gorman
2011-04-26  9:49   ` NeilBrown
2011-04-26 10:36     ` Mel Gorman
2011-04-26 10:53       ` NeilBrown
2011-04-26 14:00         ` Mel Gorman
2011-04-26  7:36 ` [PATCH 04/13] mm: allow PF_MEMALLOC from softirq context Mel Gorman
2011-04-26  7:36 ` [PATCH 05/13] mm: Ignore mempolicies when using ALLOC_NO_WATERMARK Mel Gorman
2011-04-26  7:36 ` [PATCH 06/13] net: Introduce sk_allocation() to allow addition of GFP flags depending on the individual socket Mel Gorman
2011-04-26  7:36 ` [PATCH 07/13] netvm: Allow the use of __GFP_MEMALLOC by specific sockets Mel Gorman
2011-04-26  7:36 ` [PATCH 08/13] netvm: Allow skb allocation to use PFMEMALLOC reserves Mel Gorman
2011-04-26  7:36 ` [PATCH 09/13] netvm: Set PF_MEMALLOC as appropriate during SKB processing Mel Gorman
2011-04-26 12:21   ` NeilBrown
2011-04-26 14:10     ` Mel Gorman
2011-04-26 23:22       ` NeilBrown
2011-04-26  7:36 ` [PATCH 10/13] mm: Micro-optimise slab to avoid a function call Mel Gorman
2011-04-26  7:36 ` [PATCH 11/13] nbd: Set SOCK_MEMALLOC for access to PFMEMALLOC reserves Mel Gorman
2011-04-26  7:36 ` [PATCH 12/13] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage Mel Gorman
2011-04-26 12:30   ` NeilBrown
2011-04-26 14:26     ` Mel Gorman
2011-04-26 23:18       ` NeilBrown
2011-04-27  8:36         ` Mel Gorman
2011-04-26  7:36 ` [PATCH 13/13] mm: Account for the number of times direct reclaimers get throttled Mel Gorman
2011-04-26 12:35   ` NeilBrown
2011-04-26 14:26     ` Mel Gorman
2011-04-26 14:23 ` [PATCH 00/13] Swap-over-NBD without deadlocking Peter Zijlstra
2011-04-26 14:46   ` Mel Gorman
2011-04-26 14:50     ` Peter Zijlstra
2011-04-27  8:43       ` Mel Gorman
2011-04-28 13:31 ` Pavel Machek
2011-04-28 13:42   ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2011-04-27 16:07 [PATCH 00/13] Swap-over-NBD without deadlocking v3 Mel Gorman
2011-04-27 16:08 ` [PATCH 02/13] mm: sl[au]b: Add knowledge of PFMEMALLOC reserve pages Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110428094613.GN4658@suse.de \
    --to=mgorman@suse.de \
    --cc=a.p.zijlstra@chello.nl \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=neilb@suse.de \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).