From: Mel Gorman <mgorman@suse.de>
To: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>,
Alex Thorlton <athorlton@sgi.com>,
linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Bob Liu <lliubbo@gmail.com>, David Rientjes <rientjes@google.com>,
"Eric W. Biederman" <ebiederm@xmission.com>,
Hugh Dickins <hughd@google.com>, Ingo Molnar <mingo@redhat.com>,
Kees Cook <keescook@chromium.org>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Oleg Nesterov <oleg@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Vladimir Davydov <vdavydov@parallels.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/4] Convert khugepaged to a task_work function
Date: Mon, 10 Nov 2014 11:03:57 +0000 [thread overview]
Message-ID: <20141110110357.GY21422@suse.de> (raw)
In-Reply-To: <544F9302.4010001@redhat.com>
On Tue, Oct 28, 2014 at 08:58:42AM -0400, Rik van Riel wrote:
> On 10/28/2014 08:12 AM, Andi Kleen wrote:
> > Alex Thorlton <athorlton@sgi.com> writes:
> >
> >> Last week, while discussing possible fixes for some unexpected/unwanted behavior
> >> from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several people
> >> mentioned possibly changing changing khugepaged to work as a task_work function
> >> instead of a kernel thread. This will give us finer grained control over the
> >> page collapse scans, eliminate some unnecessary scans since tasks that are
> >> relatively inactive will not be scanned often, and eliminate the unwanted
> >> behavior described in the email thread I mentioned.
> >
> > With your change, what would happen in a single threaded case?
> >
> > Previously one core would scan and another would run the workload.
> > With your change both scanning and running would be on the same
> > core.
> >
> > Would seem like a step backwards to me.
>
Only in the single-threaded, one process for the whole system case.
khugepaged can only scan one address space at a time and if processes
fail to allocate a huge page on fault then they must wait until
khugepaged gets to scan them. The wait time is not unbounded, but it
could be considerable.
As pointed out elsewhere, scanning from task-work context allows the
scan rate to adapt due to different inputs -- runtime on CPU probably
being the most relevant. Another scan factor could be NUMA sharing within
THP-boundaries in which case we don't want to either collapse or continue
scanning at the same rate.
> It's not just scanning, either.
>
> Memory compaction can spend a lot of time waiting on
> locks. Not consuming CPU or anything, but just waiting.
>
I did not pick apart the implementation closely as it's still RFC but
there is no requirement for the reclaim/compaction to take place from
task work context. That would likely cause user-visible stalls in any
number of situations can trigger bug reports.
One possibility would be to try allocate a THP GFP_ATOMIC from task_work
context and only start the scan if that allocation succeeds. Scan the
address space for a THP to collapse. If a collapse target it found and
the allocated THP is on the correct node then great -- use it. If not,
the first page should be freed and a second GFP_ATOMIC allocation
attempt made.
If a THP allocation fails then wake we need something to try allocate the
page on the processes behalf. khugepaged could be repurposed to do the
reclaim/compaction step or kswapd could be woken up. Either option may
be tricky to get right as currently waking kswapd is avoided to prevent
excessive reclaim. khugepaged could do the work but would need similar
back-off logic in the event of failures. Workqueues could also be used
but I'd worry about controlling the number of active workqueue requests
and accounting for the reclaim/compaction work is tricker if workqueues
were used.
> I am not convinced that moving all that waiting to task
> context is a good idea.
>
It allows the scanning of page tables to be parallelised, moves the
work into the task context where it can be accounted for and the scan
rate can be adapted to prevent useless work. I think those are desirable
characteristics although there is no data on the expected gains of doing
something like this. It's the proper deferral of THP allocations that is
likely to cause the most headaches.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-11-10 11:04 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-23 2:49 [PATCH 0/4] Convert khugepaged to a task_work function Alex Thorlton
2014-10-23 2:49 ` [PATCH 1/4] Disable khugepaged thread Alex Thorlton
2014-10-23 2:49 ` [PATCH] Add pgcollapse controls to task_struct Alex Thorlton
2014-10-23 15:29 ` Alex Thorlton
2014-10-23 2:49 ` [PATCH 3/4] Convert khugepaged scan functions to work with task_work Alex Thorlton
2014-10-23 2:49 ` [PATCH 4/4] Add /proc files to expose per-mm pgcollapse stats Alex Thorlton
2014-10-23 3:06 ` [PATCH 1/2] Add pgcollapse stat counter to task_struct Alex Thorlton
2014-10-23 3:06 ` [PATCH 2/2] Add /proc files to expose per-mm pgcollapse stats Alex Thorlton
2014-10-23 17:55 ` [PATCH 0/4] Convert khugepaged to a task_work function Rik van Riel
2014-10-23 18:05 ` Alex Thorlton
2014-10-23 18:52 ` Alex Thorlton
2014-10-28 12:12 ` Andi Kleen
2014-10-28 12:58 ` Rik van Riel
2014-10-28 15:39 ` Rik van Riel
2014-10-31 20:27 ` Vlastimil Babka
2014-11-17 21:34 ` Alex Thorlton
2014-11-10 11:03 ` Mel Gorman [this message]
2014-11-17 21:16 ` Alex Thorlton
2014-10-29 21:58 ` Alex Thorlton
2014-10-30 0:23 ` Kirill A. Shutemov
2014-10-30 8:35 ` Andi Kleen
2014-10-30 18:25 ` Alex Thorlton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20141110110357.GY21422@suse.de \
--to=mgorman@suse.de \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=athorlton@sgi.com \
--cc=ebiederm@xmission.com \
--cc=hughd@google.com \
--cc=keescook@chromium.org \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lliubbo@gmail.com \
--cc=mingo@redhat.com \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=rientjes@google.com \
--cc=tglx@linutronix.de \
--cc=vdavydov@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).