From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54940C54E65 for ; Thu, 22 May 2025 15:08:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D17426B0082; Thu, 22 May 2025 11:08:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CEF276B0083; Thu, 22 May 2025 11:08:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C04546B0088; Thu, 22 May 2025 11:08:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A360F6B0082 for ; Thu, 22 May 2025 11:08:10 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3FE9C5FBBE for ; Thu, 22 May 2025 15:08:10 +0000 (UTC) X-FDA: 83470874340.03.57FF1EE Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com [209.85.160.171]) by imf05.hostedemail.com (Postfix) with ESMTP id 55149100006 for ; Thu, 22 May 2025 15:08:08 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="mi3/8hAN"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of tabba@google.com designates 209.85.160.171 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747926488; a=rsa-sha256; cv=none; b=er3JmC8wUfLoDh5eYaR3ryRnDgEoIDVLyraiw5mAc6OiaZQlGsxFG939exENddWadCjZPc kV2bhYsx1l0VsAE8CN1q+JTaO6jMTQyawzTZWTL0Q190d4Uw49ixwMybAOThetYz9Bvgr4 duNlZGXkPUvzIShGqH/dEG1Qmy2SVME= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="mi3/8hAN"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of tabba@google.com designates 209.85.160.171 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747926488; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qD45WleDxaKRfdFYJAyThj8STlU1fBlpeISFDoR8AxE=; b=ZsE/80cPFMqs4yKKzAMKAZYoi+Y3RN0eyOElEktE6uRFzt2jgVA7fsSCuaaAT1zvT5e0gm AF4Zi9/a88k3e5ZBog3VYAHrTbCIJovfEn1AiIZtBvdox/M+lEyv392sbH19i8E8Ua1SZx v/WpxAS/o8RBf4D63J/BJ1ZgYwSim4Y= Received: by mail-qt1-f171.google.com with SMTP id d75a77b69052e-47666573242so1886791cf.0 for ; Thu, 22 May 2025 08:08:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1747926487; x=1748531287; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qD45WleDxaKRfdFYJAyThj8STlU1fBlpeISFDoR8AxE=; b=mi3/8hANYTmciktoBFwengnXmqepJaS6F7KqBmayLmkvFHBwtjEezcbfW9DyMREo92 +TrhC29o0pjThZYcPa9KyWhIG004pIUZvIJAQGbQOdRC4hjevhNnKoURD2Ze1oKYF399 X7kTUaFUquxx5QBGUDJdIvXeKpxyAh8uizG0mjyisCXgRLcf1eS3r2hh12i7jKqft8rJ E+uIZvBjnv2oqTCbVuDaUjHbeq+hcl2w4o1EzuESfS1gg13G6Ae+81uQI1Ke0MGcyzmc 4LOIobdomyFiydBb9oio6vHpORsorGbyhHrbsI0QQWoJzCd+3yYAPvToRTLghuiDvXF8 3FUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747926487; x=1748531287; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qD45WleDxaKRfdFYJAyThj8STlU1fBlpeISFDoR8AxE=; b=QxrpWfORX1vWjxcPA4nAdlapNVlWCb1KkvYu2JlL/H7MOs4hBIMSxldTC8GaZNFSas 1hZ111RpeeS9wBd985bavfiVYFHd7yO12i9Uf0rSsy6tq4kTxZNxNTBImOaD/WNRSHun uLn6frPB2jeE8iFAA+8aHUWYDxoypCZPJ81nY52G37OEmfFiS3MOgngkTwSeQGNxJzXM jmSbo0U6eMBZydbDAMK+X7y5NgCaxFIm1tTwScpWwoaiCskIOH1eXJdiwrk5Mr3OuU5J C7FNufS3wGQIn9WsKIUt2Fk04mRQe/Bl3mwNYExSC+PVZr7XYQ6pNHNr4gtN9zB0v+av NyZw== X-Forwarded-Encrypted: i=1; AJvYcCWmfF6hwQWrSQXzFAP3Zv+098uwwm5e9mNSX4t3NkPzNWnj5ktaoltjkXd3Dp9wZ3ptgz70ScKCOA==@kvack.org X-Gm-Message-State: AOJu0Yyz49AJAlBzEHsPZzfK/OiN4zuJQ8Zj7+74AwtiG9yK+M5h6I+Z 5C7VyR/CQ4YlzY/99nKB6r6WqGIuKKBK5ikXEOgC/dvSqRI33RZQsA3tyHn5TyVA4ZM5urBfrcA nlcmKpKZXvA6GL1HG9Cl62DnJQl/floeL4KtzEuHw X-Gm-Gg: ASbGncuBqhEfa/6WVejyEciVWSJPUrdkqybIIVbmeDWz4J0oV2jcHDFSiIaKSPZjiWX 32z0mKpfqmYyoLJBLdMvOxEiF6m3j9HIJr3mfr7osJ1tedMDI7TgjbMo8vnJjU4IlDbKdmiq9MV jaHti0crkXPazVjyKbv0qX6j+jm+3rxypfctJk3qZoMVA= X-Google-Smtp-Source: AGHT+IExWgZ/kprFJR2qUAmAvBSuiZQwIp3xVKHZZRhAngXGir6bXeq3t/Qwhx3C/I2QsXxLMQlJK+hvAoH5fIBVNiQ= X-Received: by 2002:a05:622a:1356:b0:497:2f60:4ca4 with SMTP id d75a77b69052e-49cf05d4f7cmr3910441cf.15.1747926486878; Thu, 22 May 2025 08:08:06 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Fuad Tabba Date: Thu, 22 May 2025 16:07:29 +0100 X-Gm-Features: AX0GCFtxE_5yf3JaUaANLr_bOaaQ7-YRdImpBrNEqDFWm2pmOqea-YH30YgPCtI Message-ID: Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls To: Sean Christopherson Cc: Vishal Annapurve , Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 55149100006 X-Stat-Signature: 4diych4xwsbcpemu4kewwquzwcnpqdfn X-Rspam-User: X-HE-Tag: 1747926488-71409 X-HE-Meta: U2FsdGVkX1/wWx1vZEBAusU/N84PLBV62aRCuJj2gfjRVTW4fyUhAX1jhcp8mMvwgDWgZg5EwKiVYlKJH0EDj5cr7HbxLvIhmFTZHqrPA9Xkbnw7VHR9XUKT5UidoI4BNzq3rYbA2sTLLieKHVp4jYd6TzDYJos4z1c3ZBhehAbQEQSNoT7A1xE1BYwxzwgiomO1c2Q/nglxTRqWDNrzffEGcYfsYNXbF3+KpQJL/rb9Edp909cns0P1FYrW/fwj3LZOp67vTrXNOIq2NL/YmAHPkjiwKVXN7YrRqcX9Laj0saSDS8+BZQH0INvNyZyyLDY7IrAokPIVK254xgBZwmXue5ooCK8tbvHK3Eac3nNGIDsNuyRgxntRHb459viRuRpnucoPf1bcFBOVDWBQdVdtlTezNXRaWF2xei+s2WCdAn/2K3DE/aUM6J0FqMKRcJjrlMCe87g3K1aWX017VoTuAUxgi5/1ZTSNB7ORGNkaHtYT6vnp+i+nNkHLDtRQFa5G3iHNzvPvDdMRes6gsnkrUuCzojENU7iJnPfqMIREdpJuVxwdTZsBg0xGCdPuU7BN65eVZam3/oKIl7FKLhXj9rubUeX/9yBbBLQalhJadQxlHTKvGpncwf9ZFJjFqH0eHCH+npwEWgxFg6StzSKJ9jKqvnB+3LwrxQf0B9ogjYTwZlMamcxdwU3syMr9e1xYSFKAy22/McDcxigWIIXAvQYFZzYgrDm6Eh9r22Duc19LIv4//X75NAFyH28c+exDXO2G0ouCoB0H14wjWtf4Bg+7zllbdgM+61kCzH0C0/x9CH9+CELbFuaRGo6QRwMqG6mJ5m97/UjBPx2YwFW7u7pux+LyWcDmcUzsZN0NfBi1Q+76qvsLiKxmx9XJ5k5ZdPpAmfvizfkScHCHMtTN/aJpXiGZpPaP11LZ73Cav3WGmOMAglgtRPLye6ZlDQHTORxUocpJel6E6ey GPsgoH+3 7vBZwI2AB4aReer5WxCCG9bPU8gl8Jzvj/BZVy7YgKy4UrU+seEEy11lpz7b11GfpRH8kQHYKBSjKvP7LAi5o8YeOQGjWtqAk+dHKxc6b+i12NjluZsvmJ9mPh0axkpoekv5G4s7qPZXuLGqaenKIQld+kI+n47FEfFaI0gVW2AQyGPtiUdOjiLXdPRF8j6pnXEyCfiw8AMWSnf8AS8CTDSrdcHT5GF7alHV2K5/FZxNnv82Gxd3oOnuajnpdopHTbAu6wvmj62VaEWNMAptAJAG+BxbPQD14lOyOQeXQR+jlUZfqV7pdicqcW0xe/t4g7qKGanYffgYAhPzp0f+vd8QUmA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Sean, On Thu, 22 May 2025 at 15:52, Sean Christopherson wrote= : > > On Wed, May 21, 2025, Fuad Tabba wrote: > > On Wed, 21 May 2025 at 16:51, Vishal Annapurve = wrote: > > > On Wed, May 21, 2025 at 8:22=E2=80=AFAM Fuad Tabba = wrote: > > > > On Wed, 21 May 2025 at 15:42, Vishal Annapurve wrote: > > > > > On Wed, May 21, 2025 at 5:36=E2=80=AFAM Fuad Tabba wrote: > > > > > There are a bunch of complexities here, reboot sequence on x86 ca= n be > > > > > triggered using multiple ways that I don't fully understand, but = few > > > > > of them include reading/writing to "reset register" in MMIO/PCI c= onfig > > > > > space that are emulated by the host userspace directly. Host has = to > > > > > know when the guest is shutting down to manage it's lifecycle. > > > > > > > > In that case, I think we need to fully understand these complexitie= s > > > > before adding new IOCTLs. It could be that once we understand these > > > > issues, we find that we don't need these IOCTLs. It's hard to justi= fy > > > > adding an IOCTL for something we don't understand. > > > > > > > > > > I don't understand all the ways x86 guest can trigger reboot but I do > > > know that x86 CoCo linux guest kernel triggers reset using MMIO/PCI > > > config register write that is emulated by host userspace. > > > > > > > > x86 CoCo VM firmwares don't support warm/soft reboot and even if = it > > > > > does in future, guest kernel can choose a different reboot mechan= ism. > > > > > So guest reboot needs to be emulated by always starting from scra= tch. > > > > > This sequence needs initial guest firmware payload to be installe= d > > > > > into private ranges of guest_memfd. > > > > > > > > > > > > > > > > > Either the host doesn't (or cannot even) know that the guest is > > > > > > rebooting, in which case I don't see how having an IOCTL would = help. > > > > > > > > > > Host does know that the guest is rebooting. > > > > > > > > In that case, that (i.e., the host finding out that the guest is > > > > rebooting) could trigger the conversion back to private. No need fo= r an > > > > IOCTL. > > > > > > In the reboot scenarios, it's the host userspace finding out that the= guest > > > kernel wants to reboot. > > > > How does the host userspace find that out? If the host userspace is cap= able > > of finding that out, then surely KVM is also capable of finding out the= same. > > Nope, not on x86. Well, not without userspace invoking a new ioctl, whic= h would > defeat the purpose of adding these ioctls. > > KVM is only responsible for emulating/virtualizing the "CPU". The chipse= t, e.g. > the PCI config space, is fully owned by userspace. KVM doesn't even know= whether > or not PCI exists for the VM. And reboot may be emulated by simply creat= ing a > new KVM instance, i.e. even if KVM was somehow aware of the reboot reques= t, the > change in state would happen in an entirely new struct kvm. > > That said, Vishal and Ackerley, this patch is a bit lacking on the docume= ntation > front. The changelog asserts that: > > A guest_memfd ioctl is used because shareability is a property of the m= emory, > and this property should be modifiable independently of the attached st= ruct kvm > > but then follows with a very weak and IMO largely irrelevant justificatio= n of: > > This allows shareability to be modified even if the memory is not yet b= ound > using memslots. > > Allowing userspace to change shareability without memslots is one relativ= ely minor > flow in one very specific use case. > > The real justification for these ioctls is that fundamentally, shareabili= ty for > in-place conversions is a property of a guest_memfd instance and not a st= ruct kvm > instance, and so needs to owned by guest_memfd. Thanks for the clarification Sean. I have a couple of followup questions/comments that you might be able to help with: >From a conceptual point of view, I understand that the in-place conversion is a property of guest_memfd. But that doesn't necessarily mean that the interface between kvm <-> guest_memfd is a userspace IOCTL. We already communicate directly between the two. Other, even less related subsystems within the kernel also interact without going through userspace. Why can't we do the same here? I'm not suggesting it not be owned by guest_memfd, but that we communicate directly. >From a performance point of view, I would expect the common case to be that when KVM gets an unshare request from the guest, it would be able to unmap those pages from the (cooperative) host userspace, and return back to the guest. In this scenario, the host userspace wouldn't even need to be involved. Having a userspace IOCTL as part of this makes that trip unnecessarily longer for the common case. Cheers, /fuad > I.e. focus on justifying the change from a design and conceptual perspect= ive, > not from a mechanical perspective of a flow that likely's somewhat unique= to our > specific environment. Y'all are getting deep into the weeds on a random = aspect > of x86 platform architecture, instead of focusing on the overall design. > > The other issue that's likely making this more confusing than it needs to= be is > that this series is actually two completely different series bundled into= one, > with very little explanation. Moving shared vs. private ownership into > guest_memfd isn't a requirement for 1GiB support, it's a requirement for = in-place > shared/private conversion in guest_memfd. > > For the current guest_memfd implementation, shared vs. private is tracked= in the > VM via memory attributes, because a guest_memfd instance is *only* privat= e. I.e. > shared vs. private is a property of the VM, not of the guest_memfd instan= ce. But > when in-place conversion support comes along, ownership of that particula= r > attribute needs to shift to the guest_memfd instance. > > I know I gave feedback on earlier posting about there being too series fl= ying > around, but shoving two distinct concepts into a single series is not the= answer. > My complaints about too much noise wasn't that there were multiple series= , it was > that there was very little coordination and lots of chaos. > > If you split this series in two, which should be trivial since you've alr= eady > organized the patches as a split, then sans the selftests (thank you for = those!), > in-place conversion support will be its own (much smaller!) series that c= an focus > on that specific aspect of the design, and can provide a cover letter tha= t > expounds on the design goals and uAPI. > > KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION > KVM: Query guest_memfd for private/shared status > KVM: guest_memfd: Skip LRU for guest_memfd folios > KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls > KVM: guest_memfd: Introduce and use shareability to guard faulting > KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonym= ous inodes > > And then you can post the 1GiB series separately. So long as you provide= pointers > to dependencies along with a link to a repo+branch with the kitchen sink,= I won't > complain about things being too chaotic :-)