From: David Hildenbrand <david@redhat.com>
To: Fuad Tabba <tabba@google.com>,
Quentin Perret <qperret@google.com>,
Matthew Wilcox <willy@infradead.org>,
kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com,
chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org,
paul.walmsley@sifive.com, palmer@dabbelt.com,
aou@eecs.berkeley.edu, seanjc@google.com, brauner@kernel.org,
akpm@linux-foundation.org, xiaoyao.li@intel.com,
yilun.xu@intel.com, chao.p.peng@linux.intel.com,
jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com,
yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com,
mic@digikod.net, vbabka@suse.cz, vannapurve@google.com,
ackerleytng@google.com, mail@maciej.szmigiero.name,
michael.roth@amd.com, wei.w.wang@intel.com,
liam.merwick@oracle.com, isaku.yamahata@gmail.com,
kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com,
steven.price@arm.com, quic_mnalajal@quicinc.com,
quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com,
quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com,
quic_pheragu@quicinc.com, catalin.marinas@arm.com,
james.morse@arm.com, yuzenghui@huawei.com,
oliver.upton@linux.dev, maz@kernel.org, will@kernel.org,
keirf@google.com, linux-mm@kvack.org
Subject: Re: folio_mmapped
Date: Fri, 1 Mar 2024 12:16:54 +0100 [thread overview]
Message-ID: <d8e6c848-e26a-4014-b0c2-f3a21fb4e636@redhat.com> (raw)
In-Reply-To: <20240229114526893-0800.eberman@hu-eberman-lv.qualcomm.com>
>> I don't think that we can assume that only a single VMA covers a page.
>>
>>> But of course, no rmap walk is always better.
>>
>> We've been thinking some more about how to handle the case where the
>> host userspace has a mapping of a page that later becomes private.
>>
>> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
>> to the host with a meaningful exit reason) until the host unmaps that
>> page, and check for the refcount to the page as you mentioned earlier.
>> This is essentially what the RFC I sent does (minus the bugs :) ) .
>>
>> The other idea is to use the rmap walk as you suggested to zap that
>> page. If the host tries to access that page again, it would get a
>> SIGBUS on the fault. This has the advantage that, as you'd mentioned,
>> the host doesn't need to constantly mmap() and munmap() pages. It
>> could potentially be optimised further as suggested if we have a
>> cooperating VMM that would issue a MADV_DONTNEED or something like
>> that, but that's just an optimisation and we would still need to have
>> the option of the rmap walk. However, I was wondering how practical
>> this idea would be if more than a single VMA covers a page?
>>
>
> Agree with all your points here. I changed Gunyah's implementation to do
> the unmap instead of erroring out. I didn't observe a significant
> performance difference. However, doing unmap might be a little faster
> because we can check folio_mapped() before doing the rmap walk. When
> erroring out at mmap() level, we always have to do the walk.
Right. On the mmap() level you won't really have to walk page tables, as
the the munmap() already zapped the page and removed the "problematic" VMA.
Likely, you really want to avoid repeatedly calling mmap()+munmap() just
to access shared memory; but that's just my best guess about your user
space app :)
>
>> Also, there's the question of what to do if the page is gupped? In
>> this case I think the only thing we can do is refuse to run the guest
>> until the gup (and all references) are released, which also brings us
>> back to the way things (kind of) are...
>>
>
> If there are gup users who don't do FOLL_PIN, I think we either need to
> fix them or live with possibility here? We don't have a reliable
> refcount for a folio to be safe to unmap: it might be that another vCPU
> is trying to get the same page, has incremented the refcount, and
> waiting for the folio_lock.
Likely there could be a way to detect that when only the vCPUs are your
concern? But yes, it's nasty.
(has to be handled in either case :()
Disallowing any FOLL_GET|FOLL_PIN could work. Not sure how some
core-kernel FOLL_GET users would react to that, though.
See vma_is_secretmem() and folio_is_secretmem() in mm/gup.c, where we
disallow any FOLL_GET|FOLL_PIN of secretmem pages.
We'd need a way to teach core-mm similarly about guest_memfd, which
might end up rather tricky, but not impossible :)
> This problem exists whether we block the
> mmap() or do SIGBUS.
There is work on doing more conversion to FOLL_PIN, but some cases are
harder to convert. Most of O_DIRECT should be using it nowadays, but
some other known use cases don't.
The simplest and readily-available example is still vmsplice(). I don't
think it was fixed yet to use FOLL_PIN.
Use vmsplice() to pin the page in the pipe (read-only). Unmap the VMA.
You can read the page any time later by reading from the pipe.
So I wouldn't bet on all relevant cases being gone in the near future.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-03-01 11:17 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20240222161047.402609-1-tabba@google.com>
[not found] ` <20240222141602976-0800.eberman@hu-eberman-lv.qualcomm.com>
2024-02-23 0:35 ` folio_mmapped Matthew Wilcox
2024-02-26 9:28 ` folio_mmapped David Hildenbrand
2024-02-26 21:14 ` folio_mmapped Elliot Berman
2024-02-27 14:59 ` folio_mmapped David Hildenbrand
2024-02-28 10:48 ` folio_mmapped Quentin Perret
2024-02-28 11:11 ` folio_mmapped David Hildenbrand
2024-02-28 12:44 ` folio_mmapped Quentin Perret
2024-02-28 13:00 ` folio_mmapped David Hildenbrand
2024-02-28 13:34 ` folio_mmapped Quentin Perret
2024-02-28 18:43 ` folio_mmapped Elliot Berman
2024-02-28 18:51 ` Quentin Perret
2024-02-29 10:04 ` folio_mmapped David Hildenbrand
2024-02-29 19:01 ` folio_mmapped Fuad Tabba
2024-03-01 0:40 ` folio_mmapped Elliot Berman
2024-03-01 11:16 ` David Hildenbrand [this message]
2024-03-04 12:53 ` folio_mmapped Quentin Perret
2024-03-04 20:22 ` folio_mmapped David Hildenbrand
2024-03-01 11:06 ` folio_mmapped David Hildenbrand
2024-03-04 12:36 ` folio_mmapped Quentin Perret
2024-03-04 19:04 ` folio_mmapped Sean Christopherson
2024-03-04 20:17 ` folio_mmapped David Hildenbrand
2024-03-04 21:43 ` folio_mmapped Elliot Berman
2024-03-04 21:58 ` folio_mmapped David Hildenbrand
2024-03-19 9:47 ` folio_mmapped Quentin Perret
2024-03-19 9:54 ` folio_mmapped David Hildenbrand
2024-03-18 17:06 ` folio_mmapped Vishal Annapurve
2024-03-18 22:02 ` folio_mmapped David Hildenbrand
[not found] ` <CAGtprH8B8y0Khrid5X_1twMce7r-Z7wnBiaNOi-QwxVj4D+L3w@mail.gmail.com>
2024-03-19 0:10 ` folio_mmapped Sean Christopherson
2024-03-19 10:26 ` folio_mmapped David Hildenbrand
2024-03-19 13:19 ` folio_mmapped David Hildenbrand
2024-03-19 14:31 ` folio_mmapped Will Deacon
2024-03-19 23:54 ` folio_mmapped Elliot Berman
2024-03-22 16:36 ` Will Deacon
2024-03-22 18:46 ` Elliot Berman
2024-03-27 19:31 ` Will Deacon
[not found] ` <2d6fc3c0-a55b-4316-90b8-deabb065d007@redhat.com>
2024-03-22 21:21 ` folio_mmapped David Hildenbrand
2024-03-26 22:04 ` folio_mmapped Elliot Berman
2024-03-27 19:34 ` folio_mmapped Will Deacon
2024-03-28 9:06 ` folio_mmapped David Hildenbrand
2024-03-28 10:10 ` folio_mmapped Quentin Perret
2024-03-28 10:32 ` folio_mmapped David Hildenbrand
2024-03-28 10:58 ` folio_mmapped Quentin Perret
2024-03-28 11:41 ` folio_mmapped David Hildenbrand
2024-03-29 18:38 ` folio_mmapped Vishal Annapurve
2024-04-04 0:15 ` folio_mmapped Sean Christopherson
2024-03-19 15:04 ` folio_mmapped Sean Christopherson
2024-03-22 17:16 ` folio_mmapped David Hildenbrand
2024-02-26 9:03 ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d8e6c848-e26a-4014-b0c2-f3a21fb4e636@redhat.com \
--to=david@redhat.com \
--cc=ackerleytng@google.com \
--cc=akpm@linux-foundation.org \
--cc=amoorthy@google.com \
--cc=anup@brainfault.org \
--cc=aou@eecs.berkeley.edu \
--cc=brauner@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=chao.p.peng@linux.intel.com \
--cc=chenhuacai@kernel.org \
--cc=dmatlack@google.com \
--cc=isaku.yamahata@gmail.com \
--cc=isaku.yamahata@intel.com \
--cc=james.morse@arm.com \
--cc=jarkko@kernel.org \
--cc=keirf@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.linux.dev \
--cc=liam.merwick@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mail@maciej.szmigiero.name \
--cc=maz@kernel.org \
--cc=mic@digikod.net \
--cc=michael.roth@amd.com \
--cc=mpe@ellerman.id.au \
--cc=oliver.upton@linux.dev \
--cc=palmer@dabbelt.com \
--cc=paul.walmsley@sifive.com \
--cc=pbonzini@redhat.com \
--cc=qperret@google.com \
--cc=quic_cvanscha@quicinc.com \
--cc=quic_mnalajal@quicinc.com \
--cc=quic_pderrin@quicinc.com \
--cc=quic_pheragu@quicinc.com \
--cc=quic_svaddagi@quicinc.com \
--cc=quic_tsoni@quicinc.com \
--cc=seanjc@google.com \
--cc=steven.price@arm.com \
--cc=suzuki.poulose@arm.com \
--cc=tabba@google.com \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=wei.w.wang@intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=xiaoyao.li@intel.com \
--cc=yilun.xu@intel.com \
--cc=yu.c.zhang@linux.intel.com \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).