All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov@parallels.com>
To: Tejun Heo <tj@kernel.org>, Andrew Morton <akpm@linux-foundation.org>
Cc: hannes@cmpxchg.org, mhocko@kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, kernel-team@fb.com,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Christoph Lameter <cl@linux.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
Date: Fri, 28 Aug 2015 23:32:31 +0300	[thread overview]
Message-ID: <20150828203231.GL9610@esperanza> (raw)
In-Reply-To: <20150828164819.GL26785@mtj.duckdns.org>

On Fri, Aug 28, 2015 at 12:48:19PM -0400, Tejun Heo wrote:
...
> > > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> > >   skipped.  If a process performs only speculative allocations, it can
> > >   blow way past the high limit.  This is actually easily reproducible
> > >   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
> > >   allocations first, so as long as there's memory which can be
> > >   consumed without blocking, it can keep allocating memory regardless
> > >   of the high limit.
> > 
> > I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
> > in a row - they should still alternate with normal __GFP_WAIT
> > allocations. Yes, that means we can breach memory.high threshold for a
> > short period of time, but it isn't a hard limit, so it looks perfectly
> > fine to me.
> > 
> > I tried to run `find /` over ext4 in a cgroup with memory.high set to
> > 32M and kmem accounting enabled. With such a setup memory.current never
> > got higher than 33152K, which is only 384K greater than the memory.high.
> > Which FS did you use?
> 
> ext4.  Here, it goes onto happily consuming hundreds of megabytes with
> limit set at 32M.  We have quite a few places where !__GFP_WAIT
> allocations are performed speculatively in hot paths with fallback
> slow paths, so this is bound to happen somewhere.

What kind of workload should it be then? `find` will constantly invoke
d_alloc, which issues a GFP_KERNEL allocation and therefore is allowed
to perform reclaim...

OK, I tried to reproduce the issue on the latest mainline kernel and ...
succeeded - memory.current did occasionally jump up to ~55M although
memory.high was set to 32M. Hmm, strange... Started to investigate.
Printed stack traces and found that we don't invoke memcg reclaim on
normal GFP_KERNEL allocations! How is that? The thing is there was a
commit that made SLUB (not VFS or any other kmem user, but core SLUB)
try to allocate high order slab pages w/o __GFP_WAIT for performance
reasons. That broke kmemcg case. Here it goes:

commit 6af3142bed1f520b90f4cdb6cd10bbd16906ce9a
Author: Joonsoo Kim <js1304@gmail.com>
Date:   Tue Aug 25 00:03:52 2015 +0000

    mm/slub: don't wait for high-order page allocation

I suspect your kernel has this commit included, because w/o it I haven't
managed to catch anything nearly as bad as you describe: the memory.high
excess reached 1-2 Mb at max, but never "hundreds of megabytes". If so,
we'd better fix that instead. Actually, it's worth fixing anyway. What
about the patch below?
---
From: Vladimir Davydov <vdavydov@parallels.com>
Date: Fri, 28 Aug 2015 23:17:19 +0300
Subject: [PATCH] mm/slub: don't bypass memcg reclaim for high-order page
 allocation

Commit 6af3142bed1f52 ("mm/slub: don't wait for high-order page
allocation") made allocate_slab() try to allocate high order slab pages
w/o __GFP_WAIT in order to avoid invoking reclaim/compaction when we can
fall back on low order pages. However, it broke kmemcg/memory.high
logic. The latter works as a soft limit: an allocation won't fail if it
is breached, but we call direct reclaim to compensate the excess. W/o
__GFP_WAIT we can't invoke reclaimer and therefore we will just go on,
exceeding memory.high more and more until a normal __GFP_WAIT allocation
is issued.

Since memcg reclaim never triggers compaction, we can pass __GFP_WAIT to
memcg_charge_slab() even on high order page allocations w/o any
performance impact. So let's fix this problem by excluding __GFP_WAIT
only from alloc_pages() while still forwarding it to memcg_charge_slab()
if the context allows.

Fixes: 6af3142bed1f52 ("mm/slub: don't wait for high-order page allocation")
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

diff --git a/mm/slub.c b/mm/slub.c
index e180f8dcd06d..1b9dbad40272 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1333,6 +1333,9 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	if (memcg_charge_slab(s, flags, order))
 		return NULL;
 
+	if ((flags & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
+		flags = (flags | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
+
 	if (node == NUMA_NO_NODE)
 		page = alloc_pages(flags, order);
 	else
@@ -1364,8 +1367,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * so we fall-back to the minimum order allocation.
 	 */
 	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
-	if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
-		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
 
 	page = alloc_slab_page(s, alloc_gfp, node, oo);
 	if (unlikely(!page)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-08-28 20:32 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-28 15:25 [PATCHSET] memcg: improve high limit behavior and always enable kmemcg on dfl hier Tejun Heo
2015-08-28 15:25 ` Tejun Heo
     [not found] ` <1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 15:25   ` [PATCH 1/4] memcg: fix over-high reclaim amount Tejun Heo
2015-08-28 15:25     ` Tejun Heo
2015-08-28 17:06     ` Michal Hocko
     [not found]       ` <20150828170612.GA21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-08-28 18:32         ` Tejun Heo
2015-08-28 18:32           ` Tejun Heo
     [not found]           ` <20150828183209.GA9423-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-08-31  7:51             ` Michal Hocko
2015-08-31  7:51               ` Michal Hocko
     [not found]               ` <20150831075133.GA29723-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-08-31 13:38                 ` Tejun Heo
2015-08-31 13:38                   ` Tejun Heo
2015-09-01 12:51                   ` Michal Hocko
     [not found]                     ` <20150901125149.GD8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-09-01 18:33                       ` Tejun Heo
2015-09-01 18:33                         ` Tejun Heo
2015-08-28 15:25   ` [PATCH 2/4] memcg: flatten task_struct->memcg_oom Tejun Heo
2015-08-28 15:25     ` Tejun Heo
     [not found]     ` <1440775530-18630-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 17:11       ` Michal Hocko
2015-08-28 17:11         ` Michal Hocko
2015-08-28 15:25   ` [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path Tejun Heo
2015-08-28 15:25     ` Tejun Heo
     [not found]     ` <1440775530-18630-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 16:36       ` Vladimir Davydov
2015-08-28 16:36         ` Vladimir Davydov
2015-08-28 16:48         ` Tejun Heo
2015-08-28 16:48           ` Tejun Heo
2015-08-28 20:32           ` Vladimir Davydov [this message]
2015-08-28 20:44             ` Tejun Heo
2015-08-28 20:44               ` Tejun Heo
2015-08-28 22:06               ` Tejun Heo
     [not found]                 ` <20150828220632.GF11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-08-29  7:59                   ` Vladimir Davydov
2015-08-29  7:59                     ` Vladimir Davydov
2015-08-30 15:52             ` Vladimir Davydov
2015-08-28 17:13       ` Michal Hocko
2015-08-28 17:13         ` Michal Hocko
     [not found]         ` <20150828171322.GC21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-08-28 17:56           ` Tejun Heo
2015-08-28 17:56             ` Tejun Heo
2015-08-28 20:45           ` Vladimir Davydov
2015-08-28 20:45             ` Vladimir Davydov
2015-08-28 20:53             ` Tejun Heo
2015-08-28 20:53               ` Tejun Heo
     [not found]               ` <20150828205301.GB11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-08-28 21:07                 ` Vladimir Davydov
2015-08-28 21:07                   ` Vladimir Davydov
2015-08-28 21:14                   ` Tejun Heo
2015-08-28 21:14                     ` Tejun Heo
2015-08-28 15:25   ` [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy Tejun Heo
2015-08-28 15:25     ` Tejun Heo
     [not found]     ` <1440775530-18630-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-08-28 16:49       ` Vladimir Davydov
2015-08-28 16:49         ` Vladimir Davydov
2015-08-28 16:56         ` Tejun Heo
2015-08-28 17:14         ` Michal Hocko
2015-08-28 17:14           ` Michal Hocko
2015-08-28 17:41           ` Tejun Heo
2015-09-01 12:44             ` Michal Hocko
     [not found]               ` <20150901124459.GC8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-09-01 18:51                 ` Tejun Heo
2015-09-01 18:51                   ` Tejun Heo
2015-09-04 13:30                   ` Michal Hocko
     [not found]                     ` <20150904133038.GC8220-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-09-04 15:38                       ` Vladimir Davydov
2015-09-04 15:38                         ` Vladimir Davydov
2015-09-07  9:39                         ` Michal Hocko
2015-09-07  9:39                           ` Michal Hocko
2015-09-07 10:01                           ` Vladimir Davydov
2015-09-07 11:03                             ` Michal Hocko
2015-09-04 16:18                     ` Tejun Heo
     [not found]                       ` <20150904161845.GB25329-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-09-07 10:54                         ` Michal Hocko
2015-09-07 10:54                           ` Michal Hocko
2015-09-08 18:50                           ` Tejun Heo
2015-11-05 17:30     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150828203231.GL9610@esperanza \
    --to=vdavydov@parallels.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kernel-team@fb.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.