Re: [PATCH 2/3] uffd: Introduce the v2 API

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Pavel Emelyanov <xemul@parallels.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>,
	Linux API <linux-api@vger.kernel.org>,
	Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Subject: Re: [PATCH 2/3] uffd: Introduce the v2 API
Date: Thu, 30 Apr 2015 12:50:43 +0300	[thread overview]
Message-ID: <5541FAF3.4080008@parallels.com> (raw)
In-Reply-To: <20150427211236.GB24035@redhat.com>

On 04/28/2015 12:12 AM, Andrea Arcangeli wrote:
> Hello,
> 
> On Thu, Apr 23, 2015 at 09:29:07AM +0300, Pavel Emelyanov wrote:
>> So your proposal is to always report 16 bytes per PF from read() and
>> let userspace decide itself how to handle the result?
> 
> Reading 16bytes for each userfault (instead of 8) and sharing the same
> read(2) protocol (UFFD_API) for both the cooperative and
> non-cooperative usages, is something I just suggested for
> consideration after reading your patchset.
> 
> The pros of using a single protocol for both is that it would reduce
> amount of code and there would be just one file operation for the
> .read method. The cons is that it will waste 8 bytes per userfault in
> terms of memory footprint. The other major cons is that it would force
> us to define the format of the non cooperative protocol now despite it's
> not fully finished yet.
> 
> I'm also ok with two protocols if nobody else objects, but if we use
> two protocols, we should at least use different file operation methods
> and use __always_inline with constants passed as parameter to optimize
> away the branches at build time. This way we get the reduced memory
> footprint in the read syscall without other runtime overhead
> associated with it.

OK. I would go with two protocols then and will reshuffle the code to
use two ops.

>>>> +struct uffd_v2_msg {
>>>> +	__u64	type;
>>>> +	__u64	arg;
>>>> +};
>>>> +
>>>> +#define UFFD_PAGEFAULT	0x1
>>>> +
>>>> +#define UFFD_PAGEFAULT_BIT	(1 << (UFFD_PAGEFAULT - 1))
>>>> +#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT)
>>>> +
>>>> +/*
>>>> + * Lower PAGE_SHIFT bits are used to report those supported
>>>> + * by the pagefault message itself. Other bits are used to
>>>> + * report the message types v2 API supports
>>>> + */
>>>> +#define UFFD_API_V2_BITS	(__UFFD_API_V2_BITS << 12)
>>>> +
>>>
>>> And why exactly is this 12 hardcoded?
>>
>> Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it
>> would be OK to have different shifts in different arches.
>>
>> But taking into account your comment that bits field id bad for these
>> values, if we introduce the new .features one for api message, then this
>> 12 will just go away.
> 
> Ok.
> 
>>> And which field should be masked
>>> with the bits? In the V1 protocol it was the "arg" (userfault address)
>>> not the "type". So this is a bit confusing and probably requires
>>> simplification.
>>
>> I see. Actually I decided that since bits higher than 12th (for x86) is
>> always 0 in api message (no bits allowed there, since pfn sits in this
>> place), it would be OK to put non-PF bits there.
> 
> That was ok yes.
> 
>> Should I better introduce another .features field in uffd API message?
> 
> What about renaming "uffdio_api.bits" to "uffdio_api.features"?

Yup, agreed, will do.

> And then we set uffdio_api.features to
> UFFD_FEATURE_WRITE|UFFD_FEATURE_WP|UFFD_FEATURE_FORK as needed.
> 
> UFFD_FEATURE_WRITE would always be enabled, it's there only in case we
> want to disable it later (mostly if some arch has trouble with it,
> which is unlikely, but qemu doesn't need that bit of information at
> all for example so qemu would be fine if UFFD_FEATURE_WRITE
> disappears).
> 
> UFFD_FEATURE_WP would signal also that the wrprotection feature (not
> implemented yet) is available (then later the register ioctl would
> also show the new wrprotection ioctl numbers available to mangle the
> wrprotection). The UFFD_FEATURE_WP feature in the cooperative usage
> (qemu live snapshotting) can use the UFFD_API first protocol too.
> 
> UFFD_FEATURE_FORK would be returned if the UFFD_API_V2 was set in
> uffdio.api, and it would be part of the incremental non-cooperative
> patchset.
> 
> We could also not define "UFFD_FEATURE_FORK" at all and imply that
> fork/mremap/MADV_DONTNEED are all available if UFFD_API_V2 uffdio_api
> ioctl succeeds... That's only doable if we keep two different read
> protocols though. UFFD_FEATURE_FORK (or UFFD_FEATURE_NON_COOPERATIVE)
> are really strictly needed only if we share the same read(2) protocol
> for both the cooperative and non-cooperative usages.
> 
> The idea is that there's not huge benefit of only having the "fork"
> feature supported but missing "mremap" and "madv_dontneed".
> 
> In fact if a new syscall that works like mremap is added later (call
> it mremap2), we would need to fail the UFFDIO_API_V2 and require a
> UFFDIO_API_V3 for such kernel that can return a new mremap2 type of
> event. Userland couldn't just assume it is ok to use postcopy live
> migration for containers, because
> UFFD_FEATURE_FORK|MREMAP|MADV_DONTNEED are present in the
> uffdio.features when it asked for API_V2. There shall be something
> that tells userland "hey there's a new mremap2 that the software
> inside the container can run on top of this kernel, so you are going
> to get a new mremap2 type of userfault event too".

But that's why I assumed to use per-sycall bits -- UFFD_FEATURE_FORK,
_MREMAP, _MWHATEVER so that userspace can read those bits and make sure
it contains only bits it understands with other bits set to zero.

If we had only one UFFD_API_NON_COOPERATIVE userspace would have no idea
what kind of messages it may receive.

> In any case, regardless of how we solve the above,
> "uffdio_api.features" sounds better than ".bits".
> 
> If we retain two different UFFD_API, we'll be able to freeze the
> current one and decide later if
> UFFD_FEATURE_FORK|UFFD_FEATURE_MREMAP|UFFD_FEATURE_MADV_DONTNEED shall
> be added to the .features, or if to rely on UFFD_API_V2 succeeding to
> let userland know that the non-cooperative usage is fully supported by
> the kernel.
> 
> Not having to freeze these details now is the main benefit of having
> two different UFFD_API after all...
> .
> 

-- Pavel

next prev parent reply	other threads:[~2015-04-30  9:50 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-18 19:34 [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Pavel Emelyanov
2015-03-18 19:34 ` [PATCH 1/3] uffd: Tossing bits around Pavel Emelyanov
2015-03-18 19:35 ` [PATCH 2/3] uffd: Introduce the v2 API Pavel Emelyanov
2015-04-21 12:18   ` Andrea Arcangeli
2015-04-23  6:29     ` Pavel Emelyanov
2015-04-27 21:12       ` Andrea Arcangeli
2015-04-30  9:50         ` Pavel Emelyanov [this message]
2015-03-18 19:35 ` [PATCH 3/3] uffd: Introduce fork() notification Pavel Emelyanov
2015-04-21 12:02 ` [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Andrea Arcangeli
2015-04-23  6:34   ` Pavel Emelyanov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5541FAF3.4080008@parallels.com \
    --to=xemul@parallels.com \
    --cc=aarcange@redhat.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=sanidhya.gatech@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox