Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Andrea Arcangeli <aarcange@redhat.com>
To: zhanghailiang <zhang.zhanghailiang@huawei.com>
Cc: Robert Love <rlove@google.com>, Dave Hansen <dave@sr71.net>,
	Jan Kara <jack@suse.cz>,
	kvm@vger.kernel.org, Neil Brown <neilb@suse.de>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	qemu-devel@nongnu.org,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Taras Glek <tglek@mozilla.com>, Andrew Jones <drjones@redhat.com>,
	Juan Quintela <quintela@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	Isaku Yamahata <yamahata@valinux.co.jp>,
	Mel Gorman <mgorman@suse.de>,
	Sasha Levin <sasha.levin@oracle.com>,
	Android Kernel Team <kernel-team@android.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Christopher Covington <cov@codeaurora.org>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Keith Packard <keithp@keithp.com>,
	Wenchao Xia <wenchaoqemu@gmail.com>,
	linux-kernel@vger.kernel.org,
	Andy Lutomirski <luto@amacapital.net>,
	Minchan Kim <minchan@kernel.org>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Mike Hommey <mh@glandium.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Feiner <pfeiner@google.com>
Subject: Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2
Date: Wed, 29 Oct 2014 18:46:07 +0100	[thread overview]
Message-ID: <20141029174607.GK19606@redhat.com> (raw)
In-Reply-To: <544E1143.1080905@huawei.com>

Hi Zhanghailiang,

On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote:
> Hi Andrea,
> 
> Thanks for your hard work on userfault;)
> 
> This is really a useful API.
> 
> I want to confirm a question:
> Can we support distinguishing between writing and reading memory for userfault?
> That is, we can decide whether writing a page, reading a page or both trigger userfault.
> 
> I think this will help supporting vhost-scsi,ivshmem for migration,
> we can trace dirty page in userspace.
> 
> Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault,
> but reading memory from migration thread will also trigger userfault.
> It will be easy to implement live memory snapshot, if we support configuring
> userfault for writing memory only.

Mail is going to be long enough already so I'll just assume tracking
dirty memory in userland (instead of doing it in kernel) is worthy
feature to have here.

After some chat during the KVMForum I've been already thinking it
could be beneficial for some usage to give userland the information
about the fault being read or write, combined with the ability of
mapping pages wrprotected to mcopy_atomic (that would work without
false positives only with MADV_DONTFORK also set, but it's already set
in qemu). That will require "vma->vm_flags & VM_USERFAULT" to be
checked also in the wrprotect faults, not just in the not present
faults, but it's not a massive change. Returning the read/write
information is also a not massive change. This will then payoff mostly
if there's also a way to remove the memory atomically (kind of
remap_anon_pages).

Would that be enough? I mean are you still ok if non present read
fault traps too (you'd be notified it's a read) and you get
notification for both wrprotect and non present faults?

The question then is how you mark the memory readonly to let the
wrprotect faults trap if the memory already existed and you didn't map
it yourself in the guest with mcopy_atomic with a readonly flag.

My current plan would be:

- keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the
  fast path check in the not-present and wrprotect page fault

- if VM_USERFAULT is set, find if there's a userfaultfd registered
  into that vma too

    if yes engage userfaultfd protocol

    otherwise raise SIGBUS (single threaded apps should be fine with
    SIGBUS and it'll avoid them to spawn a thread in order to talk the
    userfaultfd protocol)

- if userfaultfd protocol is engaged, return read|write fault + fault
  address to read(ufd) syscalls

- leave the "userfault" resolution mechanism independent of the
  userfaultfd protocol so we keep the two problems separated and we
  don't mix them in the same API which makes it even harder to
  finalize it.

    add mcopy_atomic (with a flag to map the page readonly too)

    The alternative would be to hide mcopy_atomic (and even
    remap_anon_pages in order to "remove" the memory atomically for
    the externalization into the cloud) as userfaultfd commands to
    write into the fd. But then there would be no much point to keep
    MADV_USERFAULT around if I do so and I could just remove it
    too or it doesn't look clean having to open the userfaultfd just
    to issue an hidden mcopy_atomic.

    So it becomes a decision if the basic SIGBUS mode for single
    threaded apps should be supported or not. As long as we support
    SIGBUS too and we don't force to use userfaultfd as the only
    mechanism to be notified about userfaults, having a separate
    mcopy_atomic syscall sounds cleaner.
 
    Perhaps mcopy_atomic could be used in other cases that may arise
    later that may not be connected with the userfault.

Questions to double check the above plan is ok:

1) should I drop the SIGBUS behavior and MADV_USERFAULT?

2) should I hide mcopy_atomic as a write into the userfaultfd?

   NOTE: even if I hide mcopy_atomic as a userfaultfd command to write
   into the fd, the buffer pointer passed to write() syscall would
   still _not_ be pointing to the data like a regular write, but it
   would be a pointer to a command structure that points to the source
   and destination data of the "hidden" mcopy_atomic, the only
   advantage is that perhaps I could wakeup the blocked page faults
   without requiring an additional syscall.

   The standalone mcopy_atomic would still require a write into the
   userfaultfd as it happens now after remap_anon_pages returns, in
   order to wakeup the stopped page faults.

3) should I add a registration command to trap only write faults?

   The protocol can always be extended later anyway in a backwards
   compatible way but it's better if we get it fully featured from the
   start.

For completeness, some answers for other questions I've seen floating
around but that weren't posted on the list yet (you can skip reading
the below part if not interested):

- open("/dev/userfault") instead of sys_userfaultfd(), I don't see the
  benefit: userfaultfd is just like eventfd in terms of kernel API and
  registering a /dev/ device actually sounds trickier. userfault is a
  core VM feature and generally we prefer syscalls for core VM
  features instead of running ioctl on some chardev that may or may
  not exist. (like we did with /dev/ksm -> MADV_MERGEABLE)

- there was a suggestion during KVMForum about allowing an external
  program to attach to any MM. Like ptrace. So you could have a single
  process managing all userfaults for different processes. However
  because I cannot allow multiple userfaultfd to register into the
  same range, this doesn't look very reliable (ptrace is kind of an
  optional/debug feature while if userfault goes wrong and returns
  -EBUSY things go bad) and there may be other complications. If I'd
  allow multiple userfaultfd to register into the same range, I
  wouldn't even know who to deliver the userfault to. It is an erratic
  behavior. Currently it'd return -EBUSY if the app has a bug and does
  that, but maybe later this can be relaxed to allow higher
  scalability with a flag (userfaultfd gets flags as parameters), but
  it still would need to be the same logic that manages userfaults and
  the only point of allowing multiple ufd to map the same range would
  be SMP scalability. So I tend to see the userfaultfd as a MM local
  thing. The thread managing the userfaults can still talk with
  another process in the local machine using pipes or sockets if it
  needs to.

- the userfaultfd protocol version handshake was done this way because
  it looked more reliable.

  Of course we could pass the version of the protocol as parameter to
  userfaultfd too, but running the syscall multiple times until
  -EPROTO didn't return anymore doesn't seem any better than writing
  into the fd the wanted protocol until you read it back instead of
  -1ULL. It just looked more reliable not having to run the syscall
  again and again while depending on -EPROTO or some other
  -Esomething.

Thanks,
Andrea

next prev parent reply	other threads:[~2014-10-29 17:47 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-03 17:07 [Qemu-devel] [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 01/17] mm: gup: add FOLL_TRIED Andrea Arcangeli
2014-10-03 18:15   ` Linus Torvalds
2014-10-03 20:55     ` Paolo Bonzini
2014-10-03 17:07 ` [Qemu-devel] [PATCH 02/17] mm: gup: add get_user_pages_locked and get_user_pages_unlocked Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 03/17] mm: gup: use get_user_pages_unlocked within get_user_pages_fast Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious Andrea Arcangeli
2014-10-03 18:23   ` Linus Torvalds
2014-10-06 14:14     ` Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 05/17] mm: gup: use get_user_pages_fast and get_user_pages_unlocked Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 06/17] kvm: Faults which trigger IO release the mmap_sem Andrea Arcangeli
2014-10-03 17:07 ` [Qemu-devel] [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
2014-10-07  9:03   ` Kirill A. Shutemov
2014-11-06 20:08   ` Konstantin Khlebnikov
2014-10-03 17:07 ` [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT Andrea Arcangeli
2014-10-03 23:13   ` Mike Hommey
2014-10-06 17:24     ` Andrea Arcangeli
2014-10-07 10:36   ` Kirill A. Shutemov
2014-10-07 10:46     ` Dr. David Alan Gilbert
2014-10-07 10:52       ` Kirill A. Shutemov
2014-10-07 11:01         ` Dr. David Alan Gilbert
2014-10-07 11:30           ` Kirill A. Shutemov
2014-10-07 13:24     ` Andrea Arcangeli
2014-10-07 15:21       ` Kirill A. Shutemov
2014-10-03 17:07 ` [Qemu-devel] [PATCH 09/17] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
2014-10-03 18:31   ` Linus Torvalds
2014-10-06  8:55     ` Dr. David Alan Gilbert
2014-10-06 16:41       ` Andrea Arcangeli
2014-10-07 12:47         ` Linus Torvalds
2014-10-07 14:19           ` Andrea Arcangeli
2014-10-07 15:52             ` Andrea Arcangeli
2014-10-07 15:54               ` Andy Lutomirski
2014-10-07 16:13               ` Peter Feiner
2014-10-07 16:56             ` Linus Torvalds
2014-10-07 17:07           ` Dr. David Alan Gilbert
2014-10-07 17:14             ` Paolo Bonzini
2014-10-07 17:25               ` Dr. David Alan Gilbert
2014-10-07 11:10   ` Kirill A. Shutemov
2014-10-07 13:37     ` Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 11/17] mm: swp_entry_swapcount Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 12/17] mm: sys_remap_anon_pages Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 13/17] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 14/17] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 15/17] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 16/17] powerpc: add remap_anon_pages and userfaultfd Andrea Arcangeli
2014-10-03 17:08 ` [Qemu-devel] [PATCH 17/17] userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER Andrea Arcangeli
2014-10-27  9:32 ` [Qemu-devel] [PATCH 00/17] RFC: userfault v2 zhanghailiang
2014-10-29 17:46   ` Andrea Arcangeli [this message]
2014-10-29 17:56     ` Peter Maydell
2014-11-21 20:14       ` Andrea Arcangeli
2014-11-21 23:05         ` Peter Maydell
2014-11-25 19:45           ` Andrea Arcangeli
2014-10-30 11:31     ` zhanghailiang
2014-10-30 12:49       ` Dr. David Alan Gilbert
2014-10-31  1:26         ` zhanghailiang
2014-11-19 18:49           ` Andrea Arcangeli
2014-11-20  2:54             ` zhanghailiang
2014-11-20 17:38               ` Andrea Arcangeli
2014-11-21  7:19                 ` zhanghailiang
2014-10-31  2:23       ` Peter Feiner
2014-10-31  3:29         ` zhanghailiang
2014-10-31  4:38           ` zhanghailiang
2014-10-31  5:17             ` Andres Lagar-Cavilla
2014-10-31  8:11               ` zhanghailiang
2014-10-31 19:39           ` Peter Feiner
2014-11-01  8:48             ` zhanghailiang
2014-11-20 17:29             ` Andrea Arcangeli
2014-11-12  7:18       ` zhanghailiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141029174607.GK19606@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreslc@google.com \
    --cc=anthony@codemonkey.ws \
    --cc=cov@codeaurora.org \
    --cc=dave@sr71.net \
    --cc=dgilbert@redhat.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=drjones@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=keithp@keithp.com \
    --cc=kernel-team@android.com \
    --cc=kosaki.motohiro@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mgorman@suse.de \
    --cc=mh@glandium.org \
    --cc=minchan@kernel.org \
    --cc=neilb@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=peter.huangpeng@huawei.com \
    --cc=pfeiner@google.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=rlove@google.com \
    --cc=sasha.levin@oracle.com \
    --cc=stefanha@gmail.com \
    --cc=tglek@mozilla.com \
    --cc=walken@google.com \
    --cc=wenchaoqemu@gmail.com \
    --cc=yamahata@valinux.co.jp \
    --cc=zhang.zhanghailiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).