Re: folio_mmapped - David Hildenbrand

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Vishal Annapurve <vannapurve@google.com>
Cc: Sean Christopherson <seanjc@google.com>,
	Quentin Perret <qperret@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Fuad Tabba <tabba@google.com>,
	kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com,
	chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org,
	paul.walmsley@sifive.com, palmer@dabbelt.com,
	aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk,
	brauner@kernel.org, akpm@linux-foundation.org,
	xiaoyao.li@intel.com, yilun.xu@intel.com,
	chao.p.peng@linux.intel.com, jarkko@kernel.org,
	amoorthy@google.com, dmatlack@google.com,
	yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com,
	mic@digikod.net, vbabka@suse.cz, ackerleytng@google.com,
	mail@maciej.szmigiero.name, michael.roth@amd.com,
	wei.w.wang@intel.com, liam.merwick@oracle.com,
	isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com,
	suzuki.poulose@arm.com, steven.price@arm.com,
	quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com,
	quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com,
	quic_pderrin@quicinc.com, quic_pheragu@quicinc.com,
	catalin.marinas@arm.com, james.morse@arm.com,
	yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org,
	will@kernel.org, keirf@google.com, linux-mm@kvack.org
Subject: Re: folio_mmapped
Date: Mon, 18 Mar 2024 23:02:17 +0100	[thread overview]
Message-ID: <7470390a-5a97-475d-aaad-0f6dfb3d26ea@redhat.com> (raw)
In-Reply-To: <CAGtprH-17s7ipmr=+cC6YuH-R0Bvr7kJS7Zo9a+Dc9VEt2BAcQ@mail.gmail.com>

On 18.03.24 18:06, Vishal Annapurve wrote:
> On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.03.24 20:04, Sean Christopherson wrote:
>>> On Mon, Mar 04, 2024, Quentin Perret wrote:
>>>>> As discussed in the sub-thread, that might still be required.
>>>>>
>>>>> One could think about completely forbidding GUP on these mmap'ed
>>>>> guest-memfds. But likely, there might be use cases in the future where you
>>>>> want to use GUP on shared memory inside a guest_memfd.
>>>>>
>>>>> (the iouring example I gave might currently not work because
>>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
>>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
>>>>> details)
>>>>
>>>> Perhaps it would be wise to start with GUP being forbidden if the
>>>> current users do not need it (not sure if that is the case in Android,
>>>> I'll check) ? We can always relax this constraint later when/if the
>>>> use-cases arise, which is obviously much harder to do the other way
>>>> around.
>>>
>>> +1000.  At least on the KVM side, I would like to be as conservative as possible
>>> when it comes to letting anything other than the guest access guest_memfd.
>>
>> So we'll have to do it similar to any occurrences of "secretmem" in
>> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code
>> similar to e.g., folio_is_secretmem().
>>
>> IIRC, we might not be able to de-reference the actual mapping because it
>> could get free concurrently ...
>>
>> That will then prohibit any kind of GUP access to these pages, including
>> reading/writing for ptrace/debugging purposes, for core dumping purposes
>> etc. But at least, you know that nobody was able to optain page
>> references using GUP that might be used for reading/writing later.
>>
> 
> There has been little discussion about supporting 1G pages with
> guest_memfd for TDX/SNP or pKVM. I would like to restart this
> discussion [1]. 1G pages should be a very important usecase for guest
> memfd, especially considering large VM sizes supporting confidential
> GPU/TPU workloads.
> 
> Using separate backing stores for private and shared memory ranges is
> not going to work effectively when using 1G pages. Consider the
> following scenario of memory conversion when using 1G pages to back
> private memory:
> * Guest requests conversion of 4KB range from private to shared, host
> in response ideally does following steps:
>      a) Updates the guest memory attributes
>      b) Unbacks the corresponding private memory
>      c) Allocates corresponding shared memory or let it be faulted in
> when guest accesses it
> 
> Step b above can't be skipped here, otherwise we would have two
> physical pages (1 backing private memory, another backing the shared
> memory) for the same GPA range causing "double allocation".
> 
> With 1G pages, it would be difficult to punch KBs or even MBs sized
> hole since to support that:
> 1G page would need to be split (which hugetlbfs doesn't support today
> because of right reasons), causing -
>          - loss of vmemmap optimization [3]
>          - losing ability to reconstitute the huge page again,
> especially as private pages in CVMs are not relocatable today,
> increasing overall fragmentation over time.
>                - unless a smarter algorithm is devised for memory
> reclaim to reconstitute large pages for unmovable memory.
> 
> With the above limitations in place, best thing could be to allow:
>   - single backing store for both shared and private memory ranges
>   - host userspace to mmap the guest memfd (as this series is trying to do)
>   - allow userspace to fault in memfd file ranges that correspond to
> shared GPA ranges
>       - pagetable mappings will need to be restricted to shared memory
> ranges causing higher granularity mappings (somewhat similar to what
> HGM series from James [2] was trying to do) than 1G.
>   - Allow IOMMU also to map those pages (pfns would be requested using
> get_user_pages* APIs) to allow devices to access shared memory. IOMMU
> management code would have to be enlightened or somehow restricted to
> map only shared regions of guest memfd.
>   - Upon conversion from shared to private, host will have to ensure
> that there are no mappings/references present for the memory ranges
> being converted to private.
> 
> If the above usecase sounds reasonable, GUP access to guest memfd
> pages should be allowed.

To say it with nice words: "Not a fan".

First, I don't think only 1 GiB will be problematic. Already 2 MiB ones 
will be problematic and so far it is even unclear how guest_memfd will 
consume them in a way acceptable to upstream MM. Likely not using 
hugetlb from what I recall after the previous discussions with Mike.

Second, we should find better ways to let an IOMMU map these pages, 
*not* using GUP. There were already discussions on providing a similar 
fd+offset-style interface instead. GUP really sounds like the wrong 
approach here. Maybe we should look into passing not only guest_memfd, 
but also "ordinary" memfds.

Third, I don't think we should be using huge pages where huge pages 
don't make any sense. Using a 1 GiB page so the VM will convert some 
pieces to map it using PTEs will destroy the whole purpose of using 1 
GiB pages. It doesn't make any sense.

A direction that might make sense is either (A) enlighting the VM about 
the granularity in which memory can be converted (but also problematic 
for 1 GiB pages) and/or (B) physically restricting the memory that can 
be converted.

For example, one could create a GPA layout where some regions are backed 
by gigantic pages that cannot be converted/can only be converted as a 
whole, and some are backed by 4k pages that can be converted back and 
forth. We'd use multiple guest_memfds for that. I recall that physically 
restricting such conversions/locations (e.g., for bounce buffers) in 
Linux was already discussed somewhere, but I don't recall the details.

It's all not trivial and not easy to get "clean".

Concluding that individual pieces of a 1 GiB / 2 MiB huge page should 
not be converted back and forth might be a reasonable. Although I'm sure 
people will argue the opposite and develop hackish solutions in 
desperate ways to make it work somehow.

Huge pages, and especially gigantic pages, are simply a bad fit if the 
VM will convert individual 4k pages.


But to answer your last question: we might be able to avoid GUP by using 
a different mapping API, similar to the once KVM now provides.

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-03-18 22:02 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240222161047.402609-1-tabba@google.com>
     [not found] ` <20240222141602976-0800.eberman@hu-eberman-lv.qualcomm.com>
2024-02-23  0:35   ` folio_mmapped Matthew Wilcox
2024-02-26  9:28     ` folio_mmapped David Hildenbrand
2024-02-26 21:14       ` folio_mmapped Elliot Berman
2024-02-27 14:59         ` folio_mmapped David Hildenbrand
2024-02-28 10:48           ` folio_mmapped Quentin Perret
2024-02-28 11:11             ` folio_mmapped David Hildenbrand
2024-02-28 12:44               ` folio_mmapped Quentin Perret
2024-02-28 13:00                 ` folio_mmapped David Hildenbrand
2024-02-28 13:34                   ` folio_mmapped Quentin Perret
2024-02-28 18:43                     ` folio_mmapped Elliot Berman
2024-02-28 18:51                       ` Quentin Perret
2024-02-29 10:04                     ` folio_mmapped David Hildenbrand
2024-02-29 19:01                       ` folio_mmapped Fuad Tabba
2024-03-01  0:40                         ` folio_mmapped Elliot Berman
2024-03-01 11:16                           ` folio_mmapped David Hildenbrand
2024-03-04 12:53                             ` folio_mmapped Quentin Perret
2024-03-04 20:22                               ` folio_mmapped David Hildenbrand
2024-03-01 11:06                         ` folio_mmapped David Hildenbrand
2024-03-04 12:36                       ` folio_mmapped Quentin Perret
2024-03-04 19:04                         ` folio_mmapped Sean Christopherson
2024-03-04 20:17                           ` folio_mmapped David Hildenbrand
2024-03-04 21:43                             ` folio_mmapped Elliot Berman
2024-03-04 21:58                               ` folio_mmapped David Hildenbrand
2024-03-19  9:47                                 ` folio_mmapped Quentin Perret
2024-03-19  9:54                                   ` folio_mmapped David Hildenbrand
2024-03-18 17:06                             ` folio_mmapped Vishal Annapurve
2024-03-18 22:02                               ` David Hildenbrand [this message]
     [not found]                                 ` <CAGtprH8B8y0Khrid5X_1twMce7r-Z7wnBiaNOi-QwxVj4D+L3w@mail.gmail.com>
2024-03-19  0:10                                   ` folio_mmapped Sean Christopherson
2024-03-19 10:26                                     ` folio_mmapped David Hildenbrand
2024-03-19 13:19                                       ` folio_mmapped David Hildenbrand
2024-03-19 14:31                                       ` folio_mmapped Will Deacon
2024-03-19 23:54                                         ` folio_mmapped Elliot Berman
2024-03-22 16:36                                           ` Will Deacon
2024-03-22 18:46                                             ` Elliot Berman
2024-03-27 19:31                                               ` Will Deacon
     [not found]                                         ` <2d6fc3c0-a55b-4316-90b8-deabb065d007@redhat.com>
2024-03-22 21:21                                           ` folio_mmapped David Hildenbrand
2024-03-26 22:04                                             ` folio_mmapped Elliot Berman
2024-03-27 19:34                                           ` folio_mmapped Will Deacon
2024-03-28  9:06                                             ` folio_mmapped David Hildenbrand
2024-03-28 10:10                                               ` folio_mmapped Quentin Perret
2024-03-28 10:32                                                 ` folio_mmapped David Hildenbrand
2024-03-28 10:58                                                   ` folio_mmapped Quentin Perret
2024-03-28 11:41                                                     ` folio_mmapped David Hildenbrand
2024-03-29 18:38                                                       ` folio_mmapped Vishal Annapurve
2024-04-04  0:15                                             ` folio_mmapped Sean Christopherson
2024-03-19 15:04                                       ` folio_mmapped Sean Christopherson
2024-03-22 17:16                                         ` folio_mmapped David Hildenbrand
2024-02-26  9:03   ` [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support Fuad Tabba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7470390a-5a97-475d-aaad-0f6dfb3d26ea@redhat.com \
    --to=david@redhat.com \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=amoorthy@google.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=brauner@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chenhuacai@kernel.org \
    --cc=dmatlack@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=isaku.yamahata@intel.com \
    --cc=james.morse@arm.com \
    --cc=jarkko@kernel.org \
    --cc=keirf@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=liam.merwick@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mail@maciej.szmigiero.name \
    --cc=maz@kernel.org \
    --cc=mic@digikod.net \
    --cc=michael.roth@amd.com \
    --cc=mpe@ellerman.id.au \
    --cc=oliver.upton@linux.dev \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=qperret@google.com \
    --cc=quic_cvanscha@quicinc.com \
    --cc=quic_mnalajal@quicinc.com \
    --cc=quic_pderrin@quicinc.com \
    --cc=quic_pheragu@quicinc.com \
    --cc=quic_svaddagi@quicinc.com \
    --cc=quic_tsoni@quicinc.com \
    --cc=seanjc@google.com \
    --cc=steven.price@arm.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wei.w.wang@intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=yu.c.zhang@linux.intel.com \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).