From: Alex Thorlton <athorlton@sgi.com>
To: David Rientjes <rientjes@google.com>
Cc: Alex Thorlton <athorlton@sgi.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
akpm@linux-foundation.org, mgorman@suse.de, riel@redhat.com,
kirill.shutemov@linux.intel.com, mingo@kernel.org,
hughd@google.com, lliubbo@gmail.com, hannes@cmpxchg.org,
srivatsa.bhat@linux.vnet.ibm.com, dave.hansen@linux.intel.com,
dfults@sgi.com, hedi@sgi.com
Subject: Re: [BUG] THP allocations escape cpuset when defrag is off
Date: Wed, 23 Jul 2014 17:57:42 -0500 [thread overview]
Message-ID: <20140723225742.GU8578@sgi.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1407231516570.23495@chino.kir.corp.google.com>
On Wed, Jul 23, 2014 at 03:28:09PM -0700, David Rientjes wrote:
> > My debug code shows that certain code paths are still allowing
> > ALLOC_CPUSET to get pulled off the alloc_flags with the patch, but
> > monitoring the memory usage shows that we're staying on node, aside from
> > some very small allocations, which may be other types of allocations that
> > are not necessarly confined to a cpuset. Need a bit more research to
> > confirm that.
> >
>
> ALLOC_CPUSET should get stripped for the cases outlined in
> __cpuset_node_allowed_softwall(), specifically for GFP_ATOMIC which does
> not have __GFP_WAIT set.
Makes sense. I knew my patch was probably the wrong way to fix this,
but it did serve my purpose :)
> > So, my question ends up being, why do we wipe out ___GFP_WAIT when
> > defrag is off? I'll trust that there is good reason to do that, but, if
> > so, is the behavior that I'm seeing expected?
> >
>
> The intention is to avoid memory compaction (and direct reclaim),
> obviously, which does not run when __GFP_WAIT is not set. But you're
> exactly right that this abuses the allocflags conversion that allows
> ALLOC_CPUSET to get cleared because it is using the aforementioned
> GFP_ATOMIC exception for cpuset allocation.
>
> We can't use PF_MEMALLOC or TIF_MEMDIE for hugepage allocation because it
> affects the allowed watermarks and nothing else prevents memory compaction
> or direct reclaim from running in the page allocator slowpath.
>
> So it looks like a modification to the page allocator is needed, see
> below.
Looks good to me. Fixes the problem without affecting any of the other
intended functionality.
> It's also been a long-standing issue that cpusets and mempolicies are
> ignored by khugepaged that allows memory to be migrated remotely to nodes
> that are not allowed by a cpuset's mems or a mempolicy's nodemask. Even
> with this issue fixed, you may find that some memory is migrated remotely,
> although it may be negligible, by khugepaged.
A bit here and there is manageable. There is, of course, some work to
be done there, but for now we're mainly concerned with a job that's
supposed to be confined to a cpuset spilling out and soaking up all the
memory on a machine.
Thanks for the help, David. Much appreciated!
- Alex
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Alex Thorlton <athorlton@sgi.com>
To: David Rientjes <rientjes@google.com>
Cc: Alex Thorlton <athorlton@sgi.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
akpm@linux-foundation.org, mgorman@suse.de, riel@redhat.com,
kirill.shutemov@linux.intel.com, mingo@kernel.org,
hughd@google.com, lliubbo@gmail.com, hannes@cmpxchg.org,
srivatsa.bhat@linux.vnet.ibm.com, dave.hansen@linux.intel.com,
dfults@sgi.com, hedi@sgi.com
Subject: Re: [BUG] THP allocations escape cpuset when defrag is off
Date: Wed, 23 Jul 2014 17:57:42 -0500 [thread overview]
Message-ID: <20140723225742.GU8578@sgi.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1407231516570.23495@chino.kir.corp.google.com>
On Wed, Jul 23, 2014 at 03:28:09PM -0700, David Rientjes wrote:
> > My debug code shows that certain code paths are still allowing
> > ALLOC_CPUSET to get pulled off the alloc_flags with the patch, but
> > monitoring the memory usage shows that we're staying on node, aside from
> > some very small allocations, which may be other types of allocations that
> > are not necessarly confined to a cpuset. Need a bit more research to
> > confirm that.
> >
>
> ALLOC_CPUSET should get stripped for the cases outlined in
> __cpuset_node_allowed_softwall(), specifically for GFP_ATOMIC which does
> not have __GFP_WAIT set.
Makes sense. I knew my patch was probably the wrong way to fix this,
but it did serve my purpose :)
> > So, my question ends up being, why do we wipe out ___GFP_WAIT when
> > defrag is off? I'll trust that there is good reason to do that, but, if
> > so, is the behavior that I'm seeing expected?
> >
>
> The intention is to avoid memory compaction (and direct reclaim),
> obviously, which does not run when __GFP_WAIT is not set. But you're
> exactly right that this abuses the allocflags conversion that allows
> ALLOC_CPUSET to get cleared because it is using the aforementioned
> GFP_ATOMIC exception for cpuset allocation.
>
> We can't use PF_MEMALLOC or TIF_MEMDIE for hugepage allocation because it
> affects the allowed watermarks and nothing else prevents memory compaction
> or direct reclaim from running in the page allocator slowpath.
>
> So it looks like a modification to the page allocator is needed, see
> below.
Looks good to me. Fixes the problem without affecting any of the other
intended functionality.
> It's also been a long-standing issue that cpusets and mempolicies are
> ignored by khugepaged that allows memory to be migrated remotely to nodes
> that are not allowed by a cpuset's mems or a mempolicy's nodemask. Even
> with this issue fixed, you may find that some memory is migrated remotely,
> although it may be negligible, by khugepaged.
A bit here and there is manageable. There is, of course, some work to
be done there, but for now we're mainly concerned with a job that's
supposed to be confined to a cpuset spilling out and soaking up all the
memory on a machine.
Thanks for the help, David. Much appreciated!
- Alex
next prev parent reply other threads:[~2014-07-23 22:57 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-23 22:05 [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton
2014-07-23 22:05 ` Alex Thorlton
2014-07-23 22:28 ` David Rientjes
2014-07-23 22:28 ` David Rientjes
2014-07-23 22:50 ` [patch] mm, thp: do not allow thp faults to avoid cpuset restrictions David Rientjes
2014-07-23 22:50 ` David Rientjes
2014-07-23 23:20 ` Alex Thorlton
2014-07-23 23:20 ` Alex Thorlton
2014-07-25 9:14 ` Michal Hocko
2014-07-25 9:14 ` Michal Hocko
2014-07-23 22:57 ` Alex Thorlton [this message]
2014-07-23 22:57 ` [BUG] THP allocations escape cpuset when defrag is off Alex Thorlton
2014-07-23 23:05 ` David Rientjes
2014-07-23 23:05 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140723225742.GU8578@sgi.com \
--to=athorlton@sgi.com \
--cc=akpm@linux-foundation.org \
--cc=dave.hansen@linux.intel.com \
--cc=dfults@sgi.com \
--cc=hannes@cmpxchg.org \
--cc=hedi@sgi.com \
--cc=hughd@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lliubbo@gmail.com \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=riel@redhat.com \
--cc=rientjes@google.com \
--cc=srivatsa.bhat@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.