linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Pavel Emelyanov <xemul@parallels.com>,
	zhang.zhanghailiang@huawei.com,
	Dave Hansen <dave.hansen@intel.com>,
	Rik van Riel <riel@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Bamvor Zhang Jian <bamvor.zhangjian@linaro.org>,
	Bharata B Rao <bharata@linux.vnet.ibm.com>,
	Geert Uytterhoeven <geert@linux-m68k.org>
Subject: Re: [PATCH 00/12] userfaultfd non-x86 and selftest updates for 4.2.0+
Date: Thu, 1 Oct 2015 18:04:30 +0200	[thread overview]
Message-ID: <20151001160430.GJ19466@redhat.com> (raw)
In-Reply-To: <560C8161.5020602@oracle.com>

Hello Mike,

On Wed, Sep 30, 2015 at 05:42:09PM -0700, Mike Kravetz wrote:
> The use case I have is pretty simple.  Recently, fallocate hole punch
> support was added to hugetlbfs.  The reason for this is that the database
> people want to 'free up' huge pages they know will no longer be used.
> However, these huge pages are part of SGA areas sometimes mapped by tens
> of thousands of tasks.  They would like to 'catch' any tasks that
> (incorrectly) fault in a page after hole punch.  The thought is that
> this can be done with userfaultfd by registering these mappings with
> UFFDIO_REGISTER_MODE_MISSING.  No need for UFFDIO_COPY or UFFDIO_ZEROPAGE.
> We would just send a signal to the task (such as SIGBUS) and then do
> a UFFDIO_WAKE.  The only downside to this approach is having thousands
> of threads monitoring userfault fds to catch a database error condition.
> I believe the MADV_USERFAULT/NOUSERFAULT code you proposed some time back
> would be the ideal solution for this use case.  Unfortunately, I did not
> know of this use case or your proposal back then. :(

I see how the MADV_USERFAULT would have been lighter weight in
avoiding to allocate anon file structures and the associated anon
inode, but it's no big deal. A few thousand files are lost in the
noise in terms of memory footprint and there will be no performance
difference.

Note also that adding back MADV_USEFAULT always remains possible but
you can avoid all those threads even with the userfaultfd API. CRIU
and postcopy live migration of containers are also going to use a
similar logic (and for them MADV_USERFAULT API would not be enough).

Even at the light of this, I don't think MADV_USERFAULT was worth
saving, it was too flakey when you deal with copy-user or GUP failing
in the context of read/write or other syscalls that just return
-EFAULT and are not restartable by signals if page faults fails. Not
to tell it requires going back to userland and back into kernel in
order to run the sigbus handler, userfaultfd optimizes that away. Last
but not the least a communication channel between the sigbus handler
and the userfault handler thread would need to be allocated by
manually by userland anyway. With userfaultfd it's the kernel that
talks directly to the userfault handler thread so there's no need of
maintaining another communication channel because the userfaultfd
provides for it in a more efficient way.

If you have a parent alive of all those processes waiting for sigchld
to reap the zombies, you can send the userfaultfd of the child to a
thread in the parent using unix domain sockets, then you can release
the fd in the child. Then the uffd will be pollable in the parent and
it'll still work on the child "mm" as if it was a thread per-child
handling it. A single parent thread (or even the main process thread
itself if it's using a epoll driven loop) can poll all child. If doing
it with a separate thread cloned by the parent, no need of epoll for
your case, as you only get waken in case of memory corruption and
failure to cleanup and report.

Once an uffd gets waken you can send any signal to the child to kill
it (note that only SIGKILL is reliable to kill a task stuck in
handle_userfaultd because if the userfault happened inside a syscall
all other signals can't run until the child is waken by
UFFDIO_WAKE). SIGKILL always works reliably at killing a task stuck in
userfault no matter if it was originated by userland or not. To
decrease the latency of signals and to allow gdb/strace to work
seamlessly in most cases, we allowed signals to interrupt a blocked
userfault if it originated in userland and in turn it will be retried
immediately after the signal sigreturns. It'll be like if no page
fault has happened yet by the time the signal returns. You don't want
to depend on this as you won't know if the handle_userfault() was
originated by a userland or kernel page fault.

When a SIGCHLD is received by the parent and you call one of the
wait() variants to reap the zombie, you also close the associated uffd
to release the memory of the child.

Alternatively if you are satisfied with just an hang instead of ending
up with memory-corrupting, you can just register it in the child and
leave the uffd open without ever polling it. If you've a watchdog in
the parent process detecting task in S state not responding you can
still detect the corruption case by looking in /proc/pid/stack, you'll
see it hung in handle_userfault(). This won't provide for an accurate
error message though but it'd be the simplest to deploy. It'll still
provide for a fully safe avoidance of memory corruption and it may be
enough considering what would happen if the userfault wasn't armed.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-10-01 16:04 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-08 20:43 [PATCH 00/12] userfaultfd non-x86 and selftest updates for 4.2.0+ Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 01/12] userfaultfd: selftest: update userfaultfd x86 32bit syscall number Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 02/12] userfaultfd: Revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 03/12] userfaultfd: selftests: vm: pick up sanitized kernel headers Andrea Arcangeli
2015-09-09  2:48   ` Michael Ellerman
2015-09-08 20:43 ` [PATCH 04/12] userfaultfd: selftest: headers fixup Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 05/12] userfaultfd: selftest: only warn if __NR_userfaultfd is undefined Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 06/12] userfaultfd: selftest: avoid my_bcmp false positives with powerpc Andrea Arcangeli
2015-09-09  2:50   ` Michael Ellerman
2015-09-09 17:02     ` Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 07/12] userfaultfd: selftest: Fix compiler warnings on 32-bit Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 08/12] userfaultfd: selftest: return an error if BOUNCE_VERIFY fails Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 09/12] userfaultfd: selftest: don't error out if pthread_mutex_t isn't identical Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 10/12] userfaultfd: powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd Andrea Arcangeli
2015-09-09  2:48   ` Michael Ellerman
2015-09-08 20:43 ` [PATCH 11/12] userfaultfd: powerpc: implement syscall Andrea Arcangeli
2015-09-08 20:43 ` [PATCH 12/12] userfaultfd: register uapi generic syscall (aarch64) Andrea Arcangeli
2015-09-15 20:02   ` Andrew Morton
2015-09-15 20:20     ` Mathieu Desnoyers
2015-09-15 20:47     ` Andrea Arcangeli
2015-09-30 21:56 ` [PATCH 00/12] userfaultfd non-x86 and selftest updates for 4.2.0+ Mike Kravetz
2015-10-01  0:06   ` Andrea Arcangeli
2015-10-01  0:42     ` Mike Kravetz
2015-10-01 16:04       ` Andrea Arcangeli [this message]
2015-10-01 16:45         ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151001160430.GJ19466@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bamvor.zhangjian@linaro.org \
    --cc=bharata@linux.vnet.ibm.com \
    --cc=dave.hansen@intel.com \
    --cc=dgilbert@redhat.com \
    --cc=geert@linux-m68k.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=mpe@ellerman.id.au \
    --cc=peter.huangpeng@huawei.com \
    --cc=riel@redhat.com \
    --cc=xemul@parallels.com \
    --cc=zhang.zhanghailiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).