From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90782C4332F for ; Tue, 31 Oct 2023 21:36:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A43F26B0299; Tue, 31 Oct 2023 17:36:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F4316B029A; Tue, 31 Oct 2023 17:36:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BC656B029B; Tue, 31 Oct 2023 17:36:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 78F816B0299 for ; Tue, 31 Oct 2023 17:36:42 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 53501801D0 for ; Tue, 31 Oct 2023 21:36:42 +0000 (UTC) X-FDA: 81407066244.16.C1843DA Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) by imf05.hostedemail.com (Postfix) with ESMTP id 6505A100022 for ; Tue, 31 Oct 2023 21:36:40 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=w6925BXL; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of 3Z3NBZQYKCIQ0mivrkowwotm.kwutqv25-uus3iks.wzo@flex--seanjc.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3Z3NBZQYKCIQ0mivrkowwotm.kwutqv25-uus3iks.wzo@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698788200; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qwUJsYeIlwqw805rRX9kqL0ratglmWsD95jLONcl2+Q=; b=Mk+oNmu+rT+VgljCafby3C3bzfI3eXRSYqB3cJVSKjiIfjuDT0nZykuH88LnChDdjANquG 8fV1rW1IlJaCgk5k9ZsxXE+uB6WUc7aXC3ltHaESfegP4aAPwoKYW7rAmtMGvwi1kg2uL3 hxN+JjhI5HQV5Cmq8XdrdDs/UZfVfSg= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=w6925BXL; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of 3Z3NBZQYKCIQ0mivrkowwotm.kwutqv25-uus3iks.wzo@flex--seanjc.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3Z3NBZQYKCIQ0mivrkowwotm.kwutqv25-uus3iks.wzo@flex--seanjc.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698788200; a=rsa-sha256; cv=none; b=8RjSLdLXz1t0AmoSqBNA0J+UVdjNEG1yLhvIQJ2PWhWJyvBXQ0aScNXRRt+CoOi5zwBPfS tzc1mE2FIrfa7j/uSSekhtKZgUutknsxky0ZlhphG8kE/cac/0/bcGsC5KwzIfJmWp0YdW N1HtO/5B1el/G0r3DC/HnG5mxcX/8UM= Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-1cc3130ba31so25803085ad.0 for ; Tue, 31 Oct 2023 14:36:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698788199; x=1699392999; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=qwUJsYeIlwqw805rRX9kqL0ratglmWsD95jLONcl2+Q=; b=w6925BXLBnM6iXcZXOL1OkHjAKq7b6SOKCLs31e0Qx8c8B1QGY22DV0JZrhQRGg+3r MPvWg7vdZa3oKXz/wmVHVQscqAD51MioOys4JlfvXN0fPWZxM07vIhyQzylehqpCCxX6 xn3tGMxYlf9t+T93c1pEkLJUmtUmQ+NhaPelx+97xaORWC2MjWRxYqnD2Lho1iCGKQbV 8Y0VgelFYm2IIJ73ravvUpk6yZ+ZVJ2rh2ESofYnhAr6uJ9dOiX9ZmzkZ4YJFMkHooVj uT1SyBLWEhVJL+u38yXFEuB2C/RyiPcxZnwS/DFPhBPYy/vf7dqOjta5P5KBh26aRVtb KfeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698788199; x=1699392999; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=qwUJsYeIlwqw805rRX9kqL0ratglmWsD95jLONcl2+Q=; b=Ps6PhLNhBp+ZHQLo+uvf3TxSRTMRJ81b5HD4Xz29t7R1QmNma3up1ovOqHUPAJMfl8 1AVr/ikZLIBQrvEQccsGSYe+YilNwiu09R19eVdLnDYaJ/XlN08xeKO9iH5JAamimy0Y 3LWg8vEnOg2UmyeQjUuZMT8TGtsTf8Y0x714uQxGsQlAhfjO1qLEERAGaatxOfib3Di+ gf8lEtqXEjCcPRqg8+FjWPRsnOFhyOocBiBfcupPbCZBOWtEMAgl3Bv5BJD88Bw1qGQb tAnLEnB8MHFF3+Ywa4GkEITxBP3VKedm8m2un8cfGOVVxinH3wfSlCFJ0v8uQ//QN0Wq JpPA== X-Gm-Message-State: AOJu0YyFcJG2c65IxAhwIE9KLs0ap89PYd8CmRk+tHFtKM+mfjYj6a+P bbxOxca9LRgz/BzEyafrNUM4iub16nU= X-Google-Smtp-Source: AGHT+IErbOn1a7ghsHO7PXtEmrexZ1Aq9PA/AfhJ0FeZy4uT6qmhxk4RVfIH9rjSv+2WH3z2AoAxZzd2jeI= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:903:2609:b0:1b8:8c7:31e6 with SMTP id jd9-20020a170903260900b001b808c731e6mr249399plb.1.1698788199022; Tue, 31 Oct 2023 14:36:39 -0700 (PDT) Date: Tue, 31 Oct 2023 14:36:37 -0700 In-Reply-To: Mime-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> <20231027182217.3615211-17-seanjc@google.com> Message-ID: Subject: Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: David Matlack Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Fuad Tabba , Jarkko Sakkinen , Anish Moorthy , Yu Zhang , Isaku Yamahata , "=?utf-8?Q?Micka=C3=ABl_Sala=C3=BCn?=" , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: jc67k8efc3yngyseazq4nw9tf6sk8fn3 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 6505A100022 X-HE-Tag: 1698788200-677102 X-HE-Meta: U2FsdGVkX1+4LZfSc2AD1W/u6K5TfyC4U+vaK990LIW/OtCOr80pWp1XmsjUxoaj1aNdVujFdQA5xoZCsVKC1JjC3qi8xKlYDLRvVe6mS0oeRrpw/HnPBwLqX6GkRCrl8EVdPvvRIbrYeMckSwQxthWhN840L2VVTtEGCQlNZaExgjaLwD0R+4GRA3eWdJpF7NyukwOdvx9iQj3+QJVcR1gFY8d0Gi8o5HN42pm0ufC4bgzSe8YD7T3it1DmK8+lHos7nmJS2RiqDjnexBE7ZGkYbinlqCbC8ItTUPCLjV7Oyrhb3PR+PlEeTvaDs+5mp+p8ccVgz6ijxGjRbo7z7pWvcBorNHyD7sa6wonjJ3tDzvVYsEXRlSeTt8YphKjUeaGzZoj8PGYxI/NbTUMposV89lEa3Xw8DeheUWMOQ4RWxff3LOhe0o5y40J7nntoxMy2wKizQ3bI830D9DkTWK5xP5jaEq9YSl5WJehXatPF0smZlS0RnSXxI6bF4jx0OI+wJCuCj9Mlja8yw+F+F39sW5RyU9uN5BADXLqd0hGLCwrrCvxXuY3pbWesMkxGTf0cy5wDdBmx+4uj2UH12r558wHGL6ry3KSzAaG7G8kN2bQFQKa8lU+50H2adf8+rn4yxZUuDhX6s0SoB/4ODV2aKpY3Y6Wg6Ru2pdtwvQhr0LbtvK7/Yu/hJrM6hJldC0JOX4PhgtcBNEjn84Dx+49aJOBFwCI8mKU/AvVJdK/EOMNrMAtRi/NcHWMwn3C/69K8/a5U2Hh/G8rXpV59GKaM05IZXB4uoZ6rF6cWRFQ2Shyp6OpF2VvhJI4neORIsGyZlpPrwmPSFqRrS328jglFFzO3v/pRvrQgG+b7sdpseFEG1JLt6WeAq46d30geXUFT8HsxCcxmvP/2fgdQQvQ63lheM9Y/nwFkAi34NyIXHgNMwJ3idQo8u4NLL2XTLHiwqxbKukXNuRLR9A7 kIGf+UXf kvBphRVx3BB1UfOLKdOdILzSJTgk8hhqXUFOHt+zyM0QkNV6pAQKY79l3DysR0GFYl5tG/N4xVX1Wj3XXtpJKTIUyNtVDRhsZ8U0DBHgSsCK635bFBFv8PDRA553tTGfwlbo08s4dWtIX5L9oCYA14L2a8VifbUI8TSYpz//Sg3p+n+I48IOL/Zo0U3waZQpETAjSWusJhxjfg3OgpaURJF3576bLRI1CDDLCLkPO8Z9RStSA+dYfxEqYaGjwdMeWQ4Ts0YGnvxfCQBsX5l70+rsvUcdSQs6DZY6VpQM/ebP7OT9NajG/jd6OpiRwwY4+IwiXvtJoW2Gc6mWSKdBkAY80JGK+dmdhIs3v/6FbkzMf+93CNqR19dTPGZhZ4MRbn29AZ+uIx+BvOpoMTflegi29qvG40dZ1EOO/YlEDBBz89BQasPdb1IdW+tTR+YTklDLizBuY8SsEb8dwCVs2xAl7thDmycNV4Z0UtVOW0g0K1jlDUIHGTcr7gQ3TsOwckRQBhw8p8Zh8geRWXsVjWPFkGEXwa9kUx5aNM0WGtgoiORp8ZLTia/IwVJTqO1HRdeukXpxtio+HMXcAqguLDdFeuPCJyTDMv0TgsDM1SJnpx+WnqS+dV6spLfg68skSk7v3mOLDfFrvoFZb0ydNTcqRZ2pobaGPKQZwQSNi89FrAlarSlISKEVBA/Myz3FxhP/3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 31, 2023, David Matlack wrote: > On 2023-10-27 11:21 AM, Sean Christopherson wrote: > > Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-ba= sed > > memory that is tied to a specific KVM virtual machine and whose primary > > purpose is to serve guest memory. > >=20 > > A guest-first memory subsystem allows for optimizations and enhancement= s > > that are kludgy or outright infeasible to implement/support in a generi= c > > memory subsystem. With guest_memfd, guest protections and mapping size= s > > are fully decoupled from host userspace mappings. E.g. KVM currently > > doesn't support mapping memory as writable in the guest without it also > > being writable in host userspace, as KVM's ABI uses VMA protections to > > define the allow guest protection. Userspace can fudge this by > > establishing two mappings, a writable mapping for the guest and readabl= e > > one for itself, but that=E2=80=99s suboptimal on multiple fronts. > >=20 > > Similarly, KVM currently requires the guest mapping size to be a strict > > subset of the host userspace mapping size, e.g. KVM doesn=E2=80=99t sup= port > > creating a 1GiB guest mapping unless userspace also has a 1GiB guest > > mapping. Decoupling the mappings sizes would allow userspace to precis= ely > > map only what is needed without impacting guest performance, e.g. to > > harden against unintentional accesses to guest memory. > >=20 > > Decoupling guest and userspace mappings may also allow for a cleaner > > alternative to high-granularity mappings for HugeTLB, which has reached= a > > bit of an impasse and is unlikely to ever be merged. > >=20 > > A guest-first memory subsystem also provides clearer line of sight to > > things like a dedicated memory pool (for slice-of-hardware VMs) and > > elimination of "struct page" (for offload setups where userspace _never= _ > > needs to mmap() guest memory). >=20 > All of these use-cases involve using guest_memfd for shared pages, but > this entire series sets up KVM to only use guest_memfd for private > pages. >=20 > For example, the per-page attributes are a property of a KVM VM, not the > underlying guest_memfd. So that implies we will need separate > guest_memfds for private and shared pages. But a given memslot can have > a mix of private and shared pages. So that implies a memslot will need > to support 2 guest_memfds? Yes, someday this may be true. Allowing guest_memfd (it was probably calle= d something else at that point) for "regular" memory was discussed in I think= v10? We made a concious decision to defer supporting 2 guest_memfds because it i= sn't strictly necessary to support the TDX/SNP use cases for which all of this was initia= lly designed, and adding a second guest_memfd and the infrastructure needed to = let userspace map a guest_memfd can be done on top with minimal overhead. > But the UAPI only allows 1 and uses the HVA for shared mappings. >=20 > My initial reaction after reading through this series is that the > per-page private/shared should be a property of the guest_memfd, not the > VM. Maybe it would even be cleaner in the long-run to make all memory > attributes a property of the guest_memfd. That way we can scope the > support to only guest_memfds and not have to worry about making per-page > attributes work with "legacy" HVA-based memslots. Making the private vs. shared state a property of the guest_memfd doesn't w= ork for TDX and SNP. We (upstream x86 and KVM maintainers) have taken a hard s= tance that in-place conversion will not be allowed for TDX/SNP due to the ease wi= th which a misbehaving userspace and/or guest can crash the host. We'd also be betting that there would *never* be a use case for per-gfn att= ributes for non-standard memory, e.g. virtio-gpu buffers, any kind of device memory= , etc. We'd also effectively be signing up to either support swap and page migrati= on in guest_memfd, or make those mutually exclusive with per-gfn attributes too. guest_memfd is only intended for guest DRAM, and if I get my way, will neve= r support swap (page migration is less scary). I.e. guest_memfd isn't intended to be= a one-size-fits-all solution, nor is it intended to wholesale replace memslot= s, which is effectively what we'd be doing by deprecating hva-based guest memo= ry. And ignoring all that, the ABI would end up being rather bizarre due to way= guest_memfd interacts with memslots. guest_memfd itself has no real notion of gfns, i.= e. the shared vs. private state would be tied to a file offset, not a gfn. That's= a solvable problem, e.g. we could make a gfn:offset binding "sticky", but that would e= dd extra complexity to the ABI, and AFAICT wouldn't buy us that much, if anything. > Maybe can you sketch out how you see this proposal being extensible to > using guest_memfd for shared mappings? For in-place conversions, e.g. pKVM, no additional guest_memfd is needed. = What's missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs= to ensure there are no outstanding references when converting back to private. For TDX/SNP, assuming we don't find a performant and robust way to do in-pl= ace conversions, a second fd+offset pair would be needed.