public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Quentin Perret <qperret@google.com>
To: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Fuad Tabba <tabba@google.com>,
	kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com,
	chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org,
	paul.walmsley@sifive.com, palmer@dabbelt.com,
	aou@eecs.berkeley.edu, seanjc@google.com,
	viro@zeniv.linux.org.uk, brauner@kernel.org,
	akpm@linux-foundation.org, xiaoyao.li@intel.com,
	yilun.xu@intel.com, chao.p.peng@linux.intel.com,
	jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com,
	yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com,
	mic@digikod.net, vbabka@suse.cz, vannapurve@google.com,
	ackerleytng@google.com, mail@maciej.szmigiero.name,
	michael.roth@amd.com, wei.w.wang@intel.com,
	liam.merwick@oracle.com, isaku.yamahata@gmail.com,
	kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com,
	steven.price@arm.com, quic_mnalajal@quicinc.com,
	quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com,
	quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com,
	quic_pheragu@quicinc.com, catalin.marinas@arm.com,
	james.morse@arm.com, yuzenghui@huawei.com,
	oliver.upton@linux.dev, maz@kernel.org, will@kernel.org,
	keirf@google.com, linux-mm@kvack.org
Subject: Re: folio_mmapped
Date: Wed, 28 Feb 2024 12:44:47 +0000	[thread overview]
Message-ID: <Zd8qvwQ05xBDXEkp@google.com> (raw)
In-Reply-To: <755911e5-8d4a-4e24-89c7-a087a26ec5f6@redhat.com>

On Wednesday 28 Feb 2024 at 12:11:30 (+0100), David Hildenbrand wrote:
> On 28.02.24 11:48, Quentin Perret wrote:
> > On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote:
> > > > 
> > > > Ah, this was something I hadn't thought about. I think both Fuad and I
> > > > need to update our series to check the refcount rather than mapcount
> > > > (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).
> > > 
> > > An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But
> > > checking for any unexpected references might be better (there are still some
> > > GUP users that don't use FOLL_PIN).
> > 
> > As a non-mm person I'm not sure to understand to consequences of holding
> > a GUP pin to a page that is not covered by any VMA. The absence of VMAs
> > imply that userspace cannot access the page right? Presumably the kernel
> > can't be coerced into accessing that page either? Is that correct?
> 
> Simple example: register the page using an iouring fixed buffer, then unmap
> the VMA. iouring now has the page pinned and can read/write it using an
> address in the kernel vitual address space (direct map).
> 
> Then, you can happily make the kernel read/write that page using iouring,
> even though no VMA still covers/maps that page.

Makes sense, and yes that would be a major bug if we let that happen,
thanks for the explanation.

> [...]
> 
> > > Instead of
> > > 
> > > 1) Converting a page to private only if there are no unexpected
> > >     references (no mappings, GUP pins, ...) and no VMAs covering it where
> > >     we could fault it in later
> > > 2) Disallowing mmap when the range would contain any private page
> > > 3) Handling races between mmap and page conversion
> > 
> > The one thing that makes the second option cleaner from a userspace
> > perspective (IMO) is that the conversion to private is happening lazily
> > during guest faults. So whether or not an mmapped page can indeed be
> > accessed from userspace will be entirely undeterministic as it depends
> > on the guest faulting pattern which userspace is entirely unaware of.
> > Elliot's suggestion would prevent spurious crashes caused by that
> > somewhat odd behaviour, though arguably sane userspace software
> > shouldn't be doing that to start with.
> 
> The last sentence is the important one. User space should not access that
> memory. If it does, it gets a slap on the hand. Because it should not access
> that memory.
> 
> We might even be able to export to user space which pages are currently
> accessible and which ones not (e.g., pagemap), although it would be racy as
> long as the VM is running and can trigger a conversion.
> 
> > 
> > To add a layer of paint to the shed, the usage of SIGBUS for
> > something that is really a permission access problem doesn't feel
> 
> SIGBUS stands for "BUS error (bad memory access)."
> 
> Which makes sense, if you try accessing something that can no longer be
> accessed. It's now inaccessible. Even if it is temporarily.
> 
> Just like a page with an MCE error. Swapin errors. Etc. You cannot access
> it.
> 
> It might be a permission problem on the pKVM side, but it's not the
> traditional "permission problem" as in mprotect() and friends. You cannot
> resolve that permission problem yourself. It's a higher entity that turned
> that memory inaccessible.

Well that's where I'm not sure to agree. Userspace can, in fact, get
back all of that memory by simply killing the protected VM. With the
approach suggested here, the guestmem pages are entirely accessible to
the host until they are attached to a running protected VM which
triggers the protection. It is very much userspace saying "I promise not
to touch these pages from now on" when it does that, in a way that I
personally find very comparable to the mprotect case. It is not some
other entity that pulls the carpet from under userspace's feet, it is
userspace being inconsistent with itself that causes the issue here, and
that's why SIGBUS feels kinda wrong as it tends to be used to report
external errors of some sort.

> > appropriate. Allocating memory via guestmem and donating that to a
> > protected guest is a way for userspace to voluntarily relinquish access
> > permissions to the memory it allocated. So a userspace process violating
> > that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
> > point that signal would be sent, the page would have been accounted
> > against that userspace process, so not sure the paging examples that
> > were discussed earlier are exactly comparable. To illustrate that
> > differently, given that pKVM and Gunyah use MMU-based protection, there
> > is nothing architecturally that prevents a guest from sharing a page
> > back with Linux as RO.
> 
> Sure, then allow page faults that allow for reads and give a signal on write
> faults.
> 
> In the scenario, it even makes more sense to not constantly require new
> mmap's from user space just to access a now-shared page.
> 
> > Note that we don't currently support this, so I
> > don't want to conflate this use case, but that hopefully makes it a
> > little more obvious that this is a "there is a page, but you don't
> > currently have the permission to access it" problem rather than "sorry
> > but we ran out of pages" problem.
> 
> We could user other signals, at least as the semantics are clear and it's
> documented. Maybe SIGSEGV would be warranted.
> 
> I consider that a minor detail, though.
>
> Requiring mmap()/munmap() dances just to access a page that is now shared
> from user space sounds a bit suboptimal. But I don't know all the details of
> the user space implementation.

Agreed, if we could save having to mmap() each page that gets shared
back that would be a nice performance optimization.

> "mmap() the whole thing once and only access what you are supposed to
> access" sounds reasonable to me. If you don't play by the rules, you get a
> signal.

"... you get a signal, or maybe you don't". But yes I understand your
point, and as per the above there are real benefits to this approach so
why not.

What do we expect userspace to do when a page goes from shared back to
being guest-private, because e.g. the guest decides to unshare? Use
munmap() on that page? Or perhaps an madvise() call of some sort? Note
that this will be needed when starting a guest as well, as userspace
needs to copy the guest payload in the guestmem file prior to starting
the protected VM.

Thanks,
Quentin

  reply	other threads:[~2024-02-28 12:44 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-22 16:10 [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 01/26] KVM: Split KVM memory attributes into user and kernel attributes Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 02/26] KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 03/26] KVM: Add restricted support for mapping guestmem by the host Fuad Tabba
2024-02-22 16:28   ` David Hildenbrand
2024-02-26  8:58     ` Fuad Tabba
2024-02-26  9:57       ` David Hildenbrand
2024-02-26 17:30         ` Fuad Tabba
2024-02-27  7:40           ` David Hildenbrand
2024-02-22 16:10 ` [RFC PATCH v1 04/26] KVM: Don't allow private attribute to be set if mapped by host Fuad Tabba
2024-04-17 23:27   ` Sean Christopherson
2024-04-18 10:54   ` David Hildenbrand
2024-02-22 16:10 ` [RFC PATCH v1 05/26] KVM: Don't allow private attribute to be removed for unmappable memory Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 06/26] KVM: Implement kvm_(read|/write)_guest_page for private memory slots Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 07/26] KVM: arm64: Turn llist of pinned pages into an rb-tree Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 08/26] KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 09/26] KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 10/26] KVM: arm64: Avoid unnecessary unmap walk " Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 11/26] KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 12/26] KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 13/26] KVM: arm64: Create hypercall return handler Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 14/26] KVM: arm64: Refactor code around handling return from host to guest Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 15/26] KVM: arm64: Rename kvm_pinned_page to kvm_guest_page Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 16/26] KVM: arm64: Add a field to indicate whether the guest page was pinned Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 17/26] KVM: arm64: Do not allow changes to private memory slots Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 18/26] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 19/26] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 20/26] KVM: arm64: Track sharing of memory from protected guest to host Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 21/26] KVM: arm64: Mark a protected VM's memory as unmappable at initialization Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 22/26] KVM: arm64: Handle unshare on way back to guest entry rather than exit Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 23/26] KVM: arm64: Check that host unmaps memory unshared by guest Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 24/26] KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes() Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 25/26] KVM: arm64: Enable private memory support when pKVM is enabled Fuad Tabba
2024-02-22 16:10 ` [RFC PATCH v1 26/26] KVM: arm64: Enable private memory kconfig for arm64 Fuad Tabba
2024-02-22 23:43 ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Elliot Berman
2024-02-23  0:35   ` folio_mmapped Matthew Wilcox
2024-02-26  9:28     ` folio_mmapped David Hildenbrand
2024-02-26 21:14       ` folio_mmapped Elliot Berman
2024-02-27 14:59         ` folio_mmapped David Hildenbrand
2024-02-28 10:48           ` folio_mmapped Quentin Perret
2024-02-28 11:11             ` folio_mmapped David Hildenbrand
2024-02-28 12:44               ` Quentin Perret [this message]
2024-02-28 13:00                 ` folio_mmapped David Hildenbrand
2024-02-28 13:34                   ` folio_mmapped Quentin Perret
2024-02-28 18:43                     ` folio_mmapped Elliot Berman
2024-02-28 18:51                       ` Quentin Perret
2024-02-29 10:04                     ` folio_mmapped David Hildenbrand
2024-02-29 19:01                       ` folio_mmapped Fuad Tabba
2024-03-01  0:40                         ` folio_mmapped Elliot Berman
2024-03-01 11:16                           ` folio_mmapped David Hildenbrand
2024-03-04 12:53                             ` folio_mmapped Quentin Perret
2024-03-04 20:22                               ` folio_mmapped David Hildenbrand
2024-03-01 11:06                         ` folio_mmapped David Hildenbrand
2024-03-04 12:36                       ` folio_mmapped Quentin Perret
2024-03-04 19:04                         ` folio_mmapped Sean Christopherson
2024-03-04 20:17                           ` folio_mmapped David Hildenbrand
2024-03-04 21:43                             ` folio_mmapped Elliot Berman
2024-03-04 21:58                               ` folio_mmapped David Hildenbrand
2024-03-19  9:47                                 ` folio_mmapped Quentin Perret
2024-03-19  9:54                                   ` folio_mmapped David Hildenbrand
2024-03-18 17:06                             ` folio_mmapped Vishal Annapurve
2024-03-18 22:02                               ` folio_mmapped David Hildenbrand
2024-03-18 23:07                                 ` folio_mmapped Vishal Annapurve
2024-03-19  0:10                                   ` folio_mmapped Sean Christopherson
2024-03-19 10:26                                     ` folio_mmapped David Hildenbrand
2024-03-19 13:19                                       ` folio_mmapped David Hildenbrand
2024-03-19 14:31                                       ` folio_mmapped Will Deacon
2024-03-19 23:54                                         ` folio_mmapped Elliot Berman
2024-03-22 16:36                                           ` Will Deacon
2024-03-22 18:46                                             ` Elliot Berman
2024-03-27 19:31                                               ` Will Deacon
2024-03-22 17:52                                         ` folio_mmapped David Hildenbrand
2024-03-22 21:21                                           ` folio_mmapped David Hildenbrand
2024-03-26 22:04                                             ` folio_mmapped Elliot Berman
2024-03-27 17:50                                               ` folio_mmapped David Hildenbrand
2024-03-27 19:34                                           ` folio_mmapped Will Deacon
2024-03-28  9:06                                             ` folio_mmapped David Hildenbrand
2024-03-28 10:10                                               ` folio_mmapped Quentin Perret
2024-03-28 10:32                                                 ` folio_mmapped David Hildenbrand
2024-03-28 10:58                                                   ` folio_mmapped Quentin Perret
2024-03-28 11:41                                                     ` folio_mmapped David Hildenbrand
2024-03-29 18:38                                                       ` folio_mmapped Vishal Annapurve
2024-04-04  0:15                                             ` folio_mmapped Sean Christopherson
2024-03-19 15:04                                       ` folio_mmapped Sean Christopherson
2024-03-22 17:16                                         ` folio_mmapped David Hildenbrand
2024-02-26  9:03   ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
2024-02-23 12:00 ` Alexandru Elisei
2024-02-26  9:05   ` Fuad Tabba
2024-02-26  9:47 ` David Hildenbrand
2024-02-27  9:37   ` Fuad Tabba
2024-02-27 14:41     ` David Hildenbrand
2024-02-27 14:49       ` David Hildenbrand
2024-02-28  9:57       ` Fuad Tabba
2024-02-28 10:12         ` David Hildenbrand
2024-02-28 14:01           ` Quentin Perret
2024-02-29  9:51             ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zd8qvwQ05xBDXEkp@google.com \
    --to=qperret@google.com \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=amoorthy@google.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=brauner@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chenhuacai@kernel.org \
    --cc=david@redhat.com \
    --cc=dmatlack@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=isaku.yamahata@intel.com \
    --cc=james.morse@arm.com \
    --cc=jarkko@kernel.org \
    --cc=keirf@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=liam.merwick@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mail@maciej.szmigiero.name \
    --cc=maz@kernel.org \
    --cc=mic@digikod.net \
    --cc=michael.roth@amd.com \
    --cc=mpe@ellerman.id.au \
    --cc=oliver.upton@linux.dev \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=quic_cvanscha@quicinc.com \
    --cc=quic_mnalajal@quicinc.com \
    --cc=quic_pderrin@quicinc.com \
    --cc=quic_pheragu@quicinc.com \
    --cc=quic_svaddagi@quicinc.com \
    --cc=quic_tsoni@quicinc.com \
    --cc=seanjc@google.com \
    --cc=steven.price@arm.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wei.w.wang@intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=yu.c.zhang@linux.intel.com \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox