From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78A4EC4167D for ; Tue, 31 Oct 2023 22:13:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E2C1E6B0332; Tue, 31 Oct 2023 18:13:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DDC226B0333; Tue, 31 Oct 2023 18:13:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA3EC6B0334; Tue, 31 Oct 2023 18:13:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id BA79B6B0332 for ; Tue, 31 Oct 2023 18:13:28 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 49CC180BCD for ; Tue, 31 Oct 2023 22:13:28 +0000 (UTC) X-FDA: 81407158896.28.00DD287 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) by imf13.hostedemail.com (Postfix) with ESMTP id 5A14220010 for ; Tue, 31 Oct 2023 22:13:26 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=qTQ5xiXP; spf=pass (imf13.hostedemail.com: domain of 3BXxBZQYKCDQiUQdZSWeeWbU.SecbYdkn-ccalQSa.ehW@flex--seanjc.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3BXxBZQYKCDQiUQdZSWeeWbU.SecbYdkn-ccalQSa.ehW@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698790406; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zs8biX5N196Q/2fqRFnHuCbiMBjZx3eWIMZhT9G5lkM=; b=xy0iM+SrHStieEx8ye73X6EtMD3KjeWmKnRHm7w8SUOyMbob5uhfzBEzjn8EXXstf/8reU JZnpfRc4xPDSL21bFGbfmttrCx9HUUwSTyrHd9kg4EriDd4wYRbxblo/zzqD2fgfj0Wscj FTvLgctIAfoei6TwY4JtwpB87hfnXo4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698790406; a=rsa-sha256; cv=none; b=N6Q9m9ka4l23jaqBwzkzHiCCdOAZgSNhB19IGWyK2E1gB5/7K1NyENLa72c4vOL514H0Lm k8SwT/KOfvyPpI2QvEvazaPTTPX5hKxd/IOCGRg5HaiFKt+Lgt7tMlZp4itPqF2cyjE9uA 0WwMNVluD8Jl9HigZoxc3YwR/DO2Ic8= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=qTQ5xiXP; spf=pass (imf13.hostedemail.com: domain of 3BXxBZQYKCDQiUQdZSWeeWbU.SecbYdkn-ccalQSa.ehW@flex--seanjc.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3BXxBZQYKCDQiUQdZSWeeWbU.SecbYdkn-ccalQSa.ehW@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-5b3715f3b41so12179947b3.2 for ; Tue, 31 Oct 2023 15:13:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698790405; x=1699395205; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=zs8biX5N196Q/2fqRFnHuCbiMBjZx3eWIMZhT9G5lkM=; b=qTQ5xiXPWEFg5HtJ9XDc5/sP5GpM36CGesB2jljQPwYMF+KjVYIkKRi7d3uTwho9UV Y2Y1iqF6Qje+xqpzN8zlToxRD58nhg1X1IOP74WYsN353Wfu7WJtsBIPk6XVcUDGL0lv 8wSCjdqiQ3NIPiVx2jVU6yYzjvRS1L9hSBVI3ibR/z77V50u2ZLZOw5SynBabkbLnQi3 VXtNEpyG/wdhJn9EFDh5Evh1V4HauE41JzAHri8EUSxtPxxD3mmB5s+V1RqHJAiusMCp NEcYcclrjJ+dugXKkk3sa3FSgqb+Z1RkbEjMHWZcJtOPAiLhQBhnUkGpbLcv3qZWBt10 apqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698790405; x=1699395205; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=zs8biX5N196Q/2fqRFnHuCbiMBjZx3eWIMZhT9G5lkM=; b=Bik9Ad+TyqNw4K8YYfvGOMCfV+G5WF/BjBtK23ovG4nBEyQYcjNOd99ttjKj1SCkeG 3njChHSIV1ViXsfcWMTG9vg0+71Pc6vWSBticwn112p3+c0ohslvi4oHiG//qs5/AJg+ OiiFFpx6q5irJ2iwgq7Hq+5FmsUD1/BfUvPY3ZYc2IYcd3nw74L4h8ngBDfeqnBQvQQE IBEPwDhxSURdN77tB52Zjb3E96LJP+Lx7efa4IJtWMF2+DbITOuy8ZFehU0K0tgBgvw0 BR4NM3x4XLXhiuLBQ/qVWGcX4cSFWRQ+1B8trwbmbvTTlz7le17s7f4rLPbaH0xwChGl gB1w== X-Gm-Message-State: AOJu0Yz87ZRnWMJSyTtpu4OLVHHmbXV18fqkBLru6GXjqqdHGb6hFUzW GVtRXWyzTbFnfSl/jM65J50taBu0QSk= X-Google-Smtp-Source: AGHT+IFQSb3cO99m4kmVkxpYDSR1DbV1n07kMYxpUGk3r7RnUdGfKSLZ2dTeiYGKN4sKIS0uvhtR25tzD1o= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6902:1746:b0:d9a:59cb:8bed with SMTP id bz6-20020a056902174600b00d9a59cb8bedmr238072ybb.5.1698790405257; Tue, 31 Oct 2023 15:13:25 -0700 (PDT) Date: Tue, 31 Oct 2023 15:13:23 -0700 In-Reply-To: Mime-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> <20231027182217.3615211-17-seanjc@google.com> Message-ID: Subject: Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: Fuad Tabba Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Jarkko Sakkinen , Anish Moorthy , David Matlack , Yu Zhang , Isaku Yamahata , "=?utf-8?Q?Micka=C3=ABl_Sala=C3=BCn?=" , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 5A14220010 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 7f5rqxuhyk9jkzgcceqbmak3wkwyzyq1 X-HE-Tag: 1698790406-254083 X-HE-Meta: U2FsdGVkX1+Tru1hWBTDTj61UM4yg9kcaic2PDcWq9M7u9rYk9JLx432xj+NxIvwBRDdcxoFvZItwifYq6bS3YszGhaxdPVv6nGs0SJWDifv/gDZ0ADqpbEAD74q8QtRVNa71N1wMsh1jJrM3WdoHJHg9QymYz3H5HDODDT7sYpFOa4h1HITLKYWtN9X3JNsqYFg526adBCzwpm8YJO74acrvgWPy+2woVT2H6G5wOP4jPivgM80nEIy95f4WrZzN1vYQxHIzMDchvf5gdFBwoqUBVlcKuWFDymcgjRJNkH3ekiiy5pJe3eWWP1iSuCTioIlWYV7l+lqznMG9FiL+irv6+ISOe3wOvfn2wKw2goM2cvArxPYsBCJpQusXgQWGDNzIXR5/SwqeeGuKJCuK/FcL436eoPhZUbj1/BPgWpILln+s5RCzLFwt41Nt0sH4DiPsxaFlXfeaS4N2/IvfwijAoTVNBvL0oIXd4cYGYTK0bwbqSElTamvxVzfgQWk9XdoFpidOP1/GNTDS6WFAOfaxxg14yafxSmf/NsKn8+MSJyCUMZUGAE5IfMuW1DI1AOr1f6kPbOuN6Cb0eWhiENvPnb/7NVqTkJkS9mTu6ussZQIxSN3hwggh/iScc8ymX+ypOjTdM+PSw7UCVO6pglKi+wWjQR8/Qd/Q1FFY4WI09tX46GGlk2JQLTS7lA6HQHvdZRszpPLnpDcaup1LrQgSunUHaOsTHotF1/mk1/JIjVtVa70MfXbQnZpK2BidUYDzbYg30yfBRjwI8cfCweCKZrRGywPMAVlLIKtcGTS4EYC8fnYBQY0BkwycftSGcVolK3UcGpaEBjCCmyeCC56J2IxiBgvxcGHzWTynDb0dTd5c4Ug9fN1fnWU+mZ6jt0uLB2WLIE1eZMirf6BlZmkqCwgA28XcchDPx+wwxdFY6TfNXjW5H7FF287a8LH9e3FAiVwjd9IE1wPRkv iqEHaHap CQxtAfOJD6vbZV2XN5OuUkttbj36Ce+gn4cGdHNUke+oWf0Agg7qJordPdhQB1GcFCw3OEEQiLZEJtt315W9NGFLPS9HMDM2Fg/bK63RynZYAwCX95wyt3LKpRFt2Dmq0GoPNfh2czxUAjhOgPW/Mop3GaJ8VqWlBPH9HlebntklYx5xuCgaMBAl37Q8ZiXrEZcepIGllAw8pEGRWw//s8SGVvd4bf2L7/SAPE2gkuFpFCEz856vzrFKTZ6cK7eBU5Pl/Ka0DxPWUn9SMVFFcHPF2ILP+EHp45Y6MDCst6+5NYVRd5Ps14OaKPV7bAowXXtNq1d309sG3NBGJtj8fJvwnKjGcFjb0o6XisPYmPIvNnRRGu23Y+Ffov6Vv0b6Lo1bPXamfsZux13DPBe5sGEHAaYyiAwcO2/y0/qJY3UkvociSh70InwpDPEY0Xt3gwu8KeQeyEOnqVZsjYoRzB9yeDl6zFAt+hj/5LNBHKECx+UdUfIORUfAE7o3YyhcQusdKOGKm0DPbkFHeoRZoIZGZUw5ZKIo+nCgTwXC451XbJW1l/Bb2w0qOuPa5B1d2kY+RxuXDsT7EdWVzSqJw6P6RJa2IJVGsYdTfgX6w9JzJVoK0aqpzdVjR4ezqC+s/hrFymIU/sSZ2JlBk/wBGDPPKha0CKGq3AdyT3BSudvEvmWoPux2Nm/tixbrOLsgoPNV0sNBCDbzL/Tx4Ftl/eWBhHT2FG8anhScrN7sGXb1vU/2pmUPMys6VSDtqqXvZeSAAf2qqNTllr60vM/paaBl0dd86pSEG2/Wg X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 31, 2023, Fuad Tabba wrote: > Hi, >=20 > On Fri, Oct 27, 2023 at 7:23=E2=80=AFPM Sean Christopherson wrote: >=20 > ... >=20 > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/ap= i.rst > > index e2252c748fd6..e82c69d5e755 100644 > > --- a/Documentation/virt/kvm/api.rst > > +++ b/Documentation/virt/kvm/api.rst > > @@ -6079,6 +6079,15 @@ applied. > > :Parameters: struct kvm_userspace_memory_region2 (in) > > :Returns: 0 on success, -1 on error > > > > +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REG= ION that > > +allows mapping guest_memfd memory into a guest. All fields shared wit= h > > +KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_PRI= VATE in > > +flags to have KVM bind the memory region to a given guest_memfd range = of > > +[guest_memfd_offset, guest_memfd_offset + memory_size]. The target gu= est_memfd > > +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current= VM, and > > +the target range must not be bound to any other memory region. All st= andard > > +bounds checks apply (use common sense). > > + >=20 > Bikeshedding here: Not sure if KVM_MEM_PRIVATE is the best name for > this. It gets confusing with KVM_MEMORY_ATTRIBUTE_PRIVATE, i.e., that > a region marked as KVM_MEM_PRIVATE is only potentially private. It did > confuse the rest of the team when I walked them through a previous > version of this code once. Would something like KVM_MEM_GUESTMEM make > more sense? Heh, deja vu. We discussed this back in v7[*], and I came to the conclusio= n that choosing a name that wasn't explicitly tied to private memory wasn't justif= ied. But that was before a KVM-owned guest_memfd was even an idea, and thus befo= re we had anything close to a real use case. Since we now know that at least pKVM will use guest_memfd for shared memory= , and odds are quite good that "regular" VMs will also do the same, i.e. will wan= t guest_memfd with the concept of private memory, I agree that we should avoi= d PRIVATE. Though I vote for KVM_MEM_GUEST_MEMFD (or KVM_MEM_GUEST_MEMFD_VALID or KVM_MEM_USE_GUEST_MEMFD). I.e. do our best to avoid ambiguity between refe= rring to "guest memory" at-large and guest_memfd. Copying a few relevant points from v7 to save a click or three. : I don't have a concrete use case (this is a recent idea on my end), but = since we're : already adding fd-based memory, I can't think of a good reason not make = it more generic : for not much extra cost. And there are definitely classes of VMs for wh= ich fd-based : memory would Just Work, e.g. large VMs that are never oversubscribed on = memory don't : need to support reclaim, so the fact that fd-based memslots won't suppor= t page aging : (among other things) right away is a non-issue. ... : Hrm, but basing private memory on top of a generic FD_VALID would effect= ively require : shared memory to use hva-based memslots for confidential VMs. That'd yi= eld a very : weird API, e.g. non-confidential VMs could be backed entirely by fd-base= d memslots, : but confidential VMs would be forced to use hva-based memslots. :=20 : Ignore this idea for now. If there's an actual use case for generic fd-= based memory : then we'll want a separate flag, fd, and offset, i.e. that support could= be added : independent of KVM_MEM_PRIVATE. ... : One alternative would be to call it KVM_MEM_PROTECTED. That shouldn't c= ause : problems for the known use of "private" (TDX and SNP), and it gives us a= little : wiggle room, e.g. if we ever get a use case where VMs can share memory t= hat is : otherwise protected. :=20 : That's a pretty big "if" though, and odds are good we'd need more memslo= t flags and : fd+offset pairs to allow differentiating "private" vs. "protected-shared= " without : forcing userspace to punch holes in memslots, so I don't know that hedgi= ng now will : buy us anything. :=20 : So I'd say that if people think KVM_MEM_PRIVATE brings additional and me= aningful : clarity over KVM_MEM_PROTECTECD, then lets go with PRIVATE. But if PROT= ECTED is : just as good, go with PROTECTED as it gives us a wee bit of wiggle room = for the : future. [*] https://lore.kernel.org/all/Yuh0ikhoh+tCK6VW@google.com =20 > > -See KVM_SET_USER_MEMORY_REGION. > > +A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private memo= ry) and > > +userspace_addr (shared memory). However, "valid" for userspace_addr s= imply > > +means that the address itself must be a legal userspace address. The = backing > > +mapping for userspace_addr is not required to be valid/populated at th= e time of > > +KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/a= llocated > > +on-demand. >=20 > Regarding requiring that a private region have both a valid > guest_memfd and a userspace_addr, should this be > implementation-specific? In pKVM at least, all regions for protected > VMs are private, and KVM doesn't care about the host userspace address > for those regions even when part of the memory is shared. Hmm, as of this patch, no, because the pKVM usage doesn't exist. E.g.=20 . Because this literally documents the current ABI. When > > +When mapping a gfn into the guest, KVM selects shared vs. private, i.e= consumes > > +userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUT= E_PRIVATE > > +state. At VM creation time, all memory is shared, i.e. the PRIVATE at= tribute > > +is '0' for all gfns. Userspace can control whether memory is shared/p= rivate by > > +toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as= needed. >=20 > In pKVM, guest memory is private by default, and most of it will > remain so for the lifetime of the VM. Userspace could explicitly mark > all the guest's memory as private at initialization, but it would save > a slight amount of work. That said, I understand that it might be > better to be consistent across implementations. Yeah, we discussed this in v12[*]. The default really doesn't matter for m= emory overheads or performances once supports range-based xarray entries, and if = that isn't sufficient, KVM can internally invert the polarity of PRIVATE. But for the ABI, I think we put a stake in the ground and say that all memo= ry is shared by default. That way CoCo VMs and regular VMs (i.e VMs without the = concept of private memory) all have the same ABI. Practically speaking, the cost t= o pKVM (and likely every other CoCo VM type) is a single ioctl() during VM creatio= n to "convert" all memory to private. [*] https://lore.kernel.org/all/ZRw6X2BptZnRPNK7@google.com > > --- /dev/null > > +++ b/virt/kvm/guest_memfd.c > > @@ -0,0 +1,548 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +#include > > +#include > > +#include > > +#include > > +#include >=20 > nit: should this include be first (to maintain alphabetical ordering > of the includes)? Heh, yeah. I would argue this isn't a nit though ;-) > > +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, lo= ff_t len) > > +{ > > + struct list_head *gmem_list =3D &inode->i_mapping->private_list= ; > > + pgoff_t start =3D offset >> PAGE_SHIFT; > > + pgoff_t end =3D (offset + len) >> PAGE_SHIFT; > > + struct kvm_gmem *gmem; > > + > > + /* > > + * Bindings must stable across invalidation to ensure the start= +end >=20 > nit: Bindings must _be/stay?_ stable "be" is what's intended. > ... >=20 > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 78a0b09ef2a5..5d1a2f1b4e94 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -798,7 +798,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, = gfn_t start, gfn_t end) > > } > > } > > > > -static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_ra= nge *range) > > +bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *ra= nge) > > { > > kvm_mmu_invalidate_range_add(kvm, range->start, range->end); > > return kvm_unmap_gfn_range(kvm, range); > > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_m= emory_slot *memslot) > > /* This does not remove the slot from struct kvm_memslots data structu= res */ > > static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *= slot) > > { > > + if (slot->flags & KVM_MEM_PRIVATE) > > + kvm_gmem_unbind(slot); > > + >=20 > Should this be called after kvm_arch_free_memslot()? Arch-specific ode > might need some of the data before the unbinding, something I thought > might be necessary at one point for the pKVM port when deleting a > memslot, but realized later that kvm_invalidate_memslot() -> > kvm_arch_guest_memory_reclaimed() was the more logical place for it. > Also, since that seems to be the pattern for arch-specific handlers in > KVM. Maybe? But only if we can about symmetry between the allocation and free p= aths I really don't think kvm_arch_free_memslot() should be doing anything beyon= d a "pure" free. E.g. kvm_arch_free_memslot() is also called after moving a me= mslot, which hopefully we never actually have to allow for guest_memfd, but any co= de in kvm_arch_free_memslot() would bring about "what if" questions regarding mem= slot movement. I.e. the API is intended to be a "free arch metadata associated = with the memslot". Out of curiosity, what does pKVM need to do at kvm_arch_guest_memory_reclai= med()?