From: Andrea Arcangeli <aarcange@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org,
Sanidhya Kashyap <sanidhya.gatech@gmail.com>,
Dave Hansen <dave.hansen@intel.com>,
zhang.zhanghailiang@huawei.com,
Pavel Emelyanov <xemul@parallels.com>,
Hugh Dickins <hughd@google.com>, Mel Gorman <mgorman@suse.de>,
"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
Andres Lagar-Cavilla <andreslc@google.com>,
"Kirill A. Shutemov" <kirill@shutemov.name>,
linux-mm@kvack.org, linux-api@vger.kernel.org,
linux-kernel@vger.kernel.org,
Andy Lutomirski <luto@amacapital.net>,
Johannes Weiner <hannes@cmpxchg.org>,
Paolo Bonzini <pbonzini@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Peter Feiner <pfeiner@google.com>
Subject: Re: [Qemu-devel] [PATCH 14/23] userfaultfd: wake pending userfaults
Date: Thu, 22 Oct 2015 16:18:31 +0200 [thread overview]
Message-ID: <20151022141831.GA1331@redhat.com> (raw)
In-Reply-To: <20151022133824.GR17308@twins.programming.kicks-ass.net>
On Thu, Oct 22, 2015 at 03:38:24PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2015 at 03:20:15PM +0200, Andrea Arcangeli wrote:
>
> > If schedule spontaneously wakes up a task in TASK_KILLABLE state that
> > would be a bug in the scheduler in my view. Luckily there doesn't seem
> > to be such a bug, or at least we never experienced it.
>
> Well, there will be a wakeup, just not the one you were hoping for.
>
> We have code that does:
>
> @cond = true;
> get_task_struct(p);
> queue(p)
>
> /* random wait somewhere */
> for (;;) {
> prepare_to_wait();
> if (@cond)
> break;
>
> ...
>
> handle_userfault()
> ...
> schedule();
> ...
>
> dequeue(p)
> wake_up_process(p) ---> wakeup without userfault wakeup
>
>
> These races are (extremely) rare, but they do exist. Therefore one must
> never assume schedule() will not spuriously wake because of these
> things.
>
> Also, see:
>
> lkml.kernel.org/r/CA+55aFwHkOo+YGWKYROmce1-H_uG3KfEUmCkJUerTj=ojY2H6Q@mail.gmail.com
With one more spinlock taken in the fast path we could recheck if the
waitqueue is still queued and this is a false positive extremely rare
spurious wakeup, and in such case set the state back to TASK_KILLABLE
and schedule.
However in the long term such a spinlock should be removed because
it's faster to stick with the current lockless list_empty_careful and
not to recheck the auto-remove waitqueue, but then we must be able to
re-enter handle_userfault() even if FAULT_FLAG_TRIED was set
(currently we can't return VM_FAULT_RETRY if FAULT_FLAG_TRIED is set
and that's the problem). This change is planned for a long time as we
need it to arm the vma-less write protection while the app is running,
so I'm not sure if it's worth going for the short term fix if this is
extremely rare.
The risk of memory corruption is still zero no matter what happens
here, in the extremely rare case the app will get a SIGBUS or a
syscall will return -EFAULT. The kernel also cannot crash. So it's not
very severe concern if it happens extremely rarely (we never
reproduced it and stress testing run for months). Of course in the
longer term this would have been fixed regardless as said in previous
email.
I think going for the longer term fix that was already planned, is
better than doing a short term fix and the real question is how I
should proceed to change the arch code and gup to cope with
handle_userfault() being re-entered.
The simplest thing is to drop FAULT_FLAG_TRIED as a whole. Or I could
add a new VM_FAULT_USERFAULT flag specific to handle_userfault that
would be returned even if FAULT_FLAG_TRIED is set, so that only
userfaults will be allowed to be repeated indefinitely (and then
VM_FAULT_USERFAULT shouldn't trigger a transition to FAULT_FLAG_TRIED,
unlike VM_FAULT_RETRY does).
This is all about being allowed to drop the mmap_sem.
If we'd check the waitqueue with the spinlock (to be sure the wakeup
isn't happening from under us while we check if we got an userfault
wakeup or if this is a spurious schedule), we could also limit the
VM_FAULT_RETRY to 2 max events if I add a FAULT_FLAG_TRIED2 and I
still use VM_FAULT_RETRY (instead of VM_FAULT_USERFAULT).
Being able to return VM_FAULT_RETRY indefinitely is only needed if we
don't handle the extremely wakeup race condition in handle_userfault
by taking the spinlock once more time in the fast path (i.e. after the
schedule).
I'm not exactly sure why we allow VM_FAULT_RETRY only once currently
so I'm tempted to drop FAULT_FLAG_TRIED entirely.
I've no real preference on how to tweak the page fault code to be able
to return VM_FAULT_RETRY indefinitely and I would aim for the smallest
change possible, so if you've suggestions now it's good time.
Thanks,
Andrea
next prev parent reply other threads:[~2015-10-22 14:18 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-14 17:30 [Qemu-devel] [PATCH 00/23] userfaultfd v4 Andrea Arcangeli
2015-05-14 17:30 ` [Qemu-devel] [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
2015-09-11 8:47 ` Michael Kerrisk (man-pages)
2015-12-04 15:50 ` Michael Kerrisk (man-pages)
2015-12-04 17:55 ` Andrea Arcangeli
2015-05-14 17:30 ` [Qemu-devel] [PATCH 02/23] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 03/23] userfaultfd: uAPI Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 04/23] userfaultfd: linux/userfaultfd_k.h Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 05/23] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 06/23] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 07/23] userfaultfd: call handle_userfault() for userfaultfd_missing() faults Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 08/23] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 09/23] userfaultfd: prevent khugepaged to merge if userfaultfd is armed Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2015-05-14 17:49 ` Linus Torvalds
2015-05-15 16:04 ` Andrea Arcangeli
2015-05-15 18:22 ` Linus Torvalds
2015-06-23 19:00 ` Dave Hansen
2015-06-23 21:41 ` Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 11/23] userfaultfd: Rename uffd_api.bits into .features Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 12/23] userfaultfd: Rename uffd_api.bits into .features fixup Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 13/23] userfaultfd: change the read API to return a uffd_msg Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 14/23] userfaultfd: wake pending userfaults Andrea Arcangeli
2015-10-22 12:10 ` Peter Zijlstra
2015-10-22 13:20 ` Andrea Arcangeli
2015-10-22 13:38 ` Peter Zijlstra
2015-10-22 14:18 ` Andrea Arcangeli [this message]
2015-10-22 15:15 ` Peter Zijlstra
2015-10-22 15:30 ` Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 15/23] userfaultfd: optimize read() and poll() to be O(1) Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 16/23] userfaultfd: allocate the userfaultfd_ctx cacheline aligned Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 17/23] userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 18/23] userfaultfd: buildsystem activation Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall Andrea Arcangeli
2015-08-11 10:07 ` Bharata B Rao
2015-08-11 13:48 ` Andrea Arcangeli
2015-08-12 5:23 ` Bharata B Rao
2015-09-08 6:08 ` Michael Ellerman
2015-09-08 6:39 ` Bharata B Rao
2015-09-08 7:14 ` Michael Ellerman
2015-09-08 10:40 ` Michael Ellerman
2015-09-08 12:28 ` Dr. David Alan Gilbert
2015-09-08 8:59 ` Dr. David Alan Gilbert
2015-09-08 10:00 ` Bharata B Rao
2015-09-08 12:46 ` Dr. David Alan Gilbert
2015-09-08 13:37 ` Bharata B Rao
2015-09-08 14:13 ` Dr. David Alan Gilbert
2015-09-10 12:24 ` Bharata B Rao
2015-09-11 19:15 ` Dr. David Alan Gilbert
2015-09-14 18:53 ` Dr. David Alan Gilbert
2015-05-14 17:31 ` [Qemu-devel] [PATCH 20/23] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 21/23] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic Andrea Arcangeli
2015-05-22 20:18 ` Andrew Morton
2015-05-22 20:48 ` Andrea Arcangeli
2015-05-22 21:18 ` Andrew Morton
2015-05-23 1:04 ` Andrea Arcangeli
2015-05-14 17:31 ` [Qemu-devel] [PATCH 23/23] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE Andrea Arcangeli
2015-05-18 14:24 ` [Qemu-devel] [PATCH 00/23] userfaultfd v4 Pavel Emelyanov
2015-05-19 21:38 ` Andrew Morton
2015-05-19 21:59 ` Richard Weinberger
2015-05-20 14:17 ` Andrea Arcangeli
2015-05-20 13:23 ` Andrea Arcangeli
2015-05-21 13:11 ` Kirill Smelkov
2015-05-21 15:52 ` Andrea Arcangeli
2015-05-22 16:35 ` Kirill Smelkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151022141831.GA1331@redhat.com \
--to=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andreslc@google.com \
--cc=dave.hansen@intel.com \
--cc=dgilbert@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kirill@shutemov.name \
--cc=kvm@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@amacapital.net \
--cc=mgorman@suse.de \
--cc=pbonzini@redhat.com \
--cc=peter.huangpeng@huawei.com \
--cc=peterz@infradead.org \
--cc=pfeiner@google.com \
--cc=qemu-devel@nongnu.org \
--cc=sanidhya.gatech@gmail.com \
--cc=torvalds@linux-foundation.org \
--cc=xemul@parallels.com \
--cc=zhang.zhanghailiang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).