linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Alex Thorlton <athorlton@sgi.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Thorlton <athorlton@sgi.com>,
	linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Ingo Molnar <mingo@kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Hugh Dickins <hughd@google.com>, Bob Liu <lliubbo@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-mm@kvack.org
Subject: Re: [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset
Date: Fri, 10 Oct 2014 13:56:20 -0500	[thread overview]
Message-ID: <20141010185620.GA3745@sgi.com> (raw)
In-Reply-To: <20141010092052.GU4750@worktop.programming.kicks-ass.net>

On Fri, Oct 10, 2014 at 11:20:52AM +0200, Peter Zijlstra wrote:
> So for the numa thing we do everything from the affected tasks context.
> There was a lot of arguments early on that that could never really work,
> but here we are.
>
> Should we convert khugepaged to the same? Drive the whole thing from
> task_work? That would make this issue naturally go away.

That seems like a reasonable idea to me, but that will change the way
that the compaction scans work right now, by quite a bit.  As I'm sure
you're aware, the way it works now is we tack our mm onto the
khugepagd_scan list in do_huge_pmd_anonymous_page (there might be some
other ways to get on there - I can't remember), then when khugepaged
wakes up it scans through each mm on the list until it hits the maximum
number of pages to scan on each pass.

If we move the compaction scan over to a task_work style function, we'll
only be able to scan the one task's mm at a time.  While the underlying
compaction infrastructure can function more or less the same, the timing
of when these scans occur, and exactly what the scans cover, will have
to change.  If we go for the most rudimentary approach, the scans will
occur each time a thread is about to return to userland after faulting
in a THP (we'll just replace the khugepaged_enter call with a
task_work_add), and will cover the mm for the current task.  A slightly
more advanced approach would involve a timer to ensure that scans don't
occur too often, as is currently handled by
khugepaged_scan_sleep_millisecs. In any case, I don't see a way around
the fact that we'll lose the multi-mm scanning functionality our
khugepaged_scan list provides, but maybe that's not a huge issue.

Before I run off and start writing patches, here's a brief summary of
what I think we could do here:

1) Dissolve the khugepaged thread and related structs/timers (I'm
   expecting some backlash on this one).
2) Replace khugepged_enter calls with calls to task_work_add(work,
   our_new_scan_function) - new scan function will look almost exactly
   like khugepaged_scan_mm_slot.
3) Set up a timer similar to khugepaged_scan_sleep_millisecs that gets
   checked during/before our_new_scan_function to ensure that we're not
   scanning more often than necessary.  Also, set up progress markers to
   limit the number of pages scanned in a single pass.

By doing this, scans will get triggered each time a thread that has
faulted THPs is about to return to userland execution, throttled by our
new timer/progress indicators.  The major benefit here is that scans
will now occur in the desired task's context.

Let me know if you anybody sees any major flaws in this approach.

Thanks a lot for your input, Peter!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-10-10 18:56 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-08 19:10 [BUG] mm, thp: khugepaged can't allocate on requested node when confined to a cpuset Alex Thorlton
2014-10-10  9:20 ` Peter Zijlstra
2014-10-10 18:56   ` Alex Thorlton [this message]
2014-10-10 21:57     ` Vlastimil Babka
2014-10-14 14:58       ` Alex Thorlton
2014-10-21 10:59       ` Peter Zijlstra
2014-10-21 10:55     ` Peter Zijlstra
2014-10-21 16:25       ` Alex Thorlton
2014-10-14 11:48 ` Kirill A. Shutemov
2014-10-14 14:54   ` Peter Zijlstra
2014-10-14 15:31     ` Rik van Riel
2014-10-14 17:38     ` Kirill A. Shutemov
2014-10-21 10:17       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141010185620.GA3745@sgi.com \
    --to=athorlton@sgi.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lliubbo@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).