qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Peter Maydell <peter.maydell@linaro.org>
Cc: Robert Love <rlove@google.com>, Dave Hansen <dave@sr71.net>,
	Jan Kara <jack@suse.cz>, kvm-devel <kvm@vger.kernel.org>,
	Neil Brown <neilb@suse.de>, Stefan Hajnoczi <stefanha@gmail.com>,
	QEMU Developers <qemu-devel@nongnu.org>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Taras Glek <tglek@mozilla.com>,
	zhanghailiang <zhang.zhanghailiang@huawei.com>,
	Juan Quintela <quintela@redhat.com>,
	Hugh Dickins <hughd@google.com>, Mel Gorman <mgorman@suse.de>,
	Sasha Levin <sasha.levin@oracle.com>,
	Android Kernel Team <kernel-team@android.com>,
	Andrew Jones <drjones@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Christopher Covington <cov@codeaurora.org>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Mike Hommey <mh@glandium.org>, Keith Packard <keithp@keithp.com>,
	Wenchao Xia <wenchaoqemu@gmail.com>,
	lkml - Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Minchan Kim <minchan@kernel.org>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Feiner <pfeiner@google.com>
Subject: Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2
Date: Tue, 25 Nov 2014 20:45:05 +0100	[thread overview]
Message-ID: <20141125194505.GR4569@redhat.com> (raw)
In-Reply-To: <CAFEAcA9c7spP1wAqWFE4r=uZVyT2ZfZGZ2MJhjTCgfs5YiSONg@mail.gmail.com>

On Fri, Nov 21, 2014 at 11:05:45PM +0000, Peter Maydell wrote:
> If it's mapped and readable-but-not-writable then it should still
> fault on write accesses, though? These are cases we currently get
> SEGV for, anyway.

Yes then it'll work just fine.

> Ah, I guess we have a terminology difference. I was considering
> "page fault" to mean (roughly) "anything that causes the CPU to
> take an exception on an attempted load/store" and expected that
> userfaultfd would notify userspace of any of those. (Well, not
> alignment faults, maybe, but I'm definitely surprised that
> access permission issues don't get reported the same way as
> page-completely-missing issues. In other words I was expecting
> that this was "everything previously reported via SIGSEGV or
> SIGBUS now comes via userfaultfd".)

Just not PROT_NONE SIGSEGV faults, i.e. PROT_NONE would still SIGSEGV
currently. Because it's not a not-present fault (the page is present,
just not mapped readable) and it's neither a wrprotect fault (it is
trapped with the vma vm_flags permission bits instead before the
actual page fault handler is invoked). userfaultfd hooks into the
common code of the page fault handler.

> > Temporarily removing/moving the page with remap_anon_pages shall be
> > much better than using PROT_NONE for this (or alternative syscall name
> > to differentiate it further from remap_file_pages, or equivalent
> > userfaultfd command if we decide to hide the pte/pmd mangling as
> > userfaultfd commands instead of adding new standalone syscalls).
> 
> We don't use PROT_NONE for the linux-user situation, we just use
> mprotect() to remove the PAGE_WRITE permission so it's still
> readable.

Like said above it'll work just fine then.

> I suspect actually linux-user would be better off implementing
> something like "if this is a page which we've mapped read-only
> because we translated code out of it, then go ahead and remap
> it r/w and throw away the translation and retry the access,
> otherwise report SEGV to the guest", because taking SEGVs shouldn't
> be a fast path in the guest binary. That would let us work without
> architecture-specific junk and without requiring new kernel
> features either. So you can ignore this whole tangent thread :-)

You might get a significant boost if you use userfaultfd.

For postcopy live snapshot and postcopy live migration the main
benefit is the removal mprotect as a whole and the performance
improvement is a secondary benefit.

You can cap the max size of the JIT translated cache (and in turn the
maximal number of vmas generated by the mprotects) but we can't cap
the address space fragmentation. The faults may invoke way too many
mprotect and we may fragment the vma too much to the point we get
-ENOMEM.

Marking a page wrprotected however is always tricky, no matter if it's
fork doing it or KSM or something else. KSM just skips page that could
be under gup pins and retries them at the next pass. Fork simply won't
work right currently and it needs MADV_DONTFORK to avoid the
wrprotection entirely where you may use O_DIRECT mixed with threads
and fork.

For this new vma-less syscall (or ufd command) the best we could do is
to print a warning if any page marked wrprotected could be under GUP
pin (the warning could generate false positives as result of
speculative cache lookups that run lockless get_page_unless_zero() on
any pfn).

To avoid races the postcopy live snapshot feature I think it should be
enough to wait all in-flight I/O to complete before marking the guest
address space readonly (the KVM gup() side can be taken care of by
marking the shadow MMU readonly which is a must anyway, the mmu
notifier will take care of that part).

The postcopy live snapshot will have to copy the page so it's
effectively a COW in userland, and in turn it must ensure there's no
O_DIRECT in flight still writing to the page (despite we marked it
readonly) while the wrprotection syscall runs.

For your case probably there's no gup() in the equation unless you use
O_DIRECT (I don't think you use shadow-MMU in the kernel in
linux-user) so you don't have to worry about those races and it's just
simpler.

  reply	other threads:[~2014-11-25 19:46 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-03 17:07 [Qemu-devel] [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 01/17] mm: gup: add FOLL_TRIED Andrea Arcangeli
2014-10-03 18:15   ` Linus Torvalds
2014-10-03 20:55     ` Paolo Bonzini
2014-10-03 17:07 ` [Qemu-devel] [PATCH 02/17] mm: gup: add get_user_pages_locked and get_user_pages_unlocked Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 03/17] mm: gup: use get_user_pages_unlocked within get_user_pages_fast Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious Andrea Arcangeli
2014-10-03 18:23   ` Linus Torvalds
2014-10-06 14:14     ` Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 05/17] mm: gup: use get_user_pages_fast and get_user_pages_unlocked Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 06/17] kvm: Faults which trigger IO release the mmap_sem Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
2014-10-07  9:03   ` Kirill A. Shutemov
2014-11-06 20:08   ` Konstantin Khlebnikov
2014-10-03 17:07 ` [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT Andrea Arcangeli
2014-10-03 23:13   ` Mike Hommey
2014-10-06 17:24     ` Andrea Arcangeli
2014-10-07 10:36   ` Kirill A. Shutemov
2014-10-07 10:46     ` Dr. David Alan Gilbert
2014-10-07 10:52       ` Kirill A. Shutemov
2014-10-07 11:01         ` Dr. David Alan Gilbert
2014-10-07 11:30           ` Kirill A. Shutemov
2014-10-07 13:24     ` Andrea Arcangeli
2014-10-07 15:21       ` Kirill A. Shutemov
2014-10-03 17:07 ` [Qemu-devel] [PATCH 09/17] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
2014-10-03 18:31   ` Linus Torvalds
2014-10-06  8:55     ` Dr. David Alan Gilbert
2014-10-06 16:41       ` Andrea Arcangeli
2014-10-07 12:47         ` Linus Torvalds
2014-10-07 14:19           ` Andrea Arcangeli
2014-10-07 15:52             ` Andrea Arcangeli
2014-10-07 15:54               ` Andy Lutomirski
2014-10-07 16:13               ` Peter Feiner
2014-10-07 16:56             ` Linus Torvalds
2014-10-07 17:07           ` Dr. David Alan Gilbert
2014-10-07 17:14             ` Paolo Bonzini
2014-10-07 17:25               ` Dr. David Alan Gilbert
2014-10-07 11:10   ` Kirill A. Shutemov
2014-10-07 13:37     ` Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 11/17] mm: swp_entry_swapcount Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 12/17] mm: sys_remap_anon_pages Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 13/17] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 14/17] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 15/17] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 16/17] powerpc: add remap_anon_pages and userfaultfd Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 17/17] userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER Andrea Arcangeli
2014-10-27  9:32 ` [Qemu-devel] [PATCH 00/17] RFC: userfault v2 zhanghailiang
2014-10-29 17:46   ` Andrea Arcangeli
2014-10-29 17:56     ` Peter Maydell
2014-11-21 20:14       ` Andrea Arcangeli
2014-11-21 23:05         ` Peter Maydell
2014-11-25 19:45           ` Andrea Arcangeli [this message]
2014-10-30 11:31     ` zhanghailiang
2014-10-30 12:49       ` Dr. David Alan Gilbert
2014-10-31  1:26         ` zhanghailiang
2014-11-19 18:49           ` Andrea Arcangeli
2014-11-20  2:54             ` zhanghailiang
2014-11-20 17:38               ` Andrea Arcangeli
2014-11-21  7:19                 ` zhanghailiang
2014-10-31  2:23       ` Peter Feiner
2014-10-31  3:29         ` zhanghailiang
2014-10-31  4:38           ` zhanghailiang
2014-10-31  5:17             ` Andres Lagar-Cavilla
2014-10-31  8:11               ` zhanghailiang
2014-10-31 19:39           ` Peter Feiner
2014-11-01  8:48             ` zhanghailiang
2014-11-20 17:29             ` Andrea Arcangeli
2014-11-12  7:18       ` zhanghailiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141125194505.GR4569@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreslc@google.com \
    --cc=anthony@codemonkey.ws \
    --cc=cov@codeaurora.org \
    --cc=dave@sr71.net \
    --cc=dgilbert@redhat.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=drjones@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=keithp@keithp.com \
    --cc=kernel-team@android.com \
    --cc=kosaki.motohiro@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mgorman@suse.de \
    --cc=mh@glandium.org \
    --cc=minchan@kernel.org \
    --cc=neilb@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=peter.huangpeng@huawei.com \
    --cc=peter.maydell@linaro.org \
    --cc=pfeiner@google.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=rlove@google.com \
    --cc=sasha.levin@oracle.com \
    --cc=stefanha@gmail.com \
    --cc=tglek@mozilla.com \
    --cc=walken@google.com \
    --cc=wenchaoqemu@gmail.com \
    --cc=zhang.zhanghailiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).