From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 482ECC3600C for ; Thu, 3 Apr 2025 14:51:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6A40A28000A; Thu, 3 Apr 2025 10:51:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 67FDF280008; Thu, 3 Apr 2025 10:51:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4CEA428000A; Thu, 3 Apr 2025 10:51:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 25089280008 for ; Thu, 3 Apr 2025 10:51:20 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 2D431533E8 for ; Thu, 3 Apr 2025 14:51:21 +0000 (UTC) X-FDA: 83293020762.13.A5FD2D3 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) by imf19.hostedemail.com (Postfix) with ESMTP id 666661A0011 for ; Thu, 3 Apr 2025 14:51:19 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VDIW5oSu; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of 3ZqDuZwYKCKcZLHUQJNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--seanjc.bounces.google.com designates 209.85.210.202 as permitted sender) smtp.mailfrom=3ZqDuZwYKCKcZLHUQJNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--seanjc.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743691879; a=rsa-sha256; cv=none; b=VGI6n9s4TUdGrxFox13YllrHhxh32GzYCr+BbC3hP2uRAoZqqyn5wtx4p8l6SWlIjlbU45 Jga+/rm9ATmvCWDqUPHiY6PWCwxOP0a7zThvDRinAGo1n3kwRnIis5j1zSV4u7yTfEVspo I/k8x+sD0MKOLyfGfCZhz2jJT/vN63w= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VDIW5oSu; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of 3ZqDuZwYKCKcZLHUQJNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--seanjc.bounces.google.com designates 209.85.210.202 as permitted sender) smtp.mailfrom=3ZqDuZwYKCKcZLHUQJNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743691879; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=j0DsqAwVuYx2JlbaG0XtUhFv119f/MD4c6JOjjoA75o=; b=b1RCfA9TSk1rBHH7pMh/mXhijxX2BcfFT4xDtr0sG2bkTrEn7jRAWgA+4VuWhesYaLsMVm KoJt1Wb0p4Z1yebEuaEkfJDiVEEHmg6OwfHV2udFRxMuGDkQt1vF1Epseyo5npeoNcqE0B R+Aw5lo4lFF9Cx+5fwMebGOIyiG45OY= Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-736b431ee0dso864440b3a.0 for ; Thu, 03 Apr 2025 07:51:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1743691878; x=1744296678; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=j0DsqAwVuYx2JlbaG0XtUhFv119f/MD4c6JOjjoA75o=; b=VDIW5oSu/LX0ET3WG4XO1KuYWcsS0rZ1xz4Ot571pea+KvH69BYxCPpjua4A05ovk5 LtIZU6bu2mMvt76EagwrLragj+JGGpS+RGy7Ci0kg9v9dUFcKlUVnYJd6976u/9Dj9kk q5woEUKr5n8lmoNU8nj6HKrlJAI1LQOLerjrGsHGczZqtmksse31fkUSDYTFTC6VbmBC +otFTGnM+UbOtvcYcpZZqosJybh+k9EDgjodpzlUIesAYKzIutw3/bUNNEwxCJu3Arb0 tbLFBh9A1YiEU6wzNAEnPn8mEX/QhE7PT1LHctKqAzIq+OVCSwMBoOkCJE9AKTGEUeAg sLVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743691878; x=1744296678; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=j0DsqAwVuYx2JlbaG0XtUhFv119f/MD4c6JOjjoA75o=; b=IwCa3n2/W2VOVltEWRkxV9qEeU6cSkXMtdyG00fnstTAFLb68AhFNUslnWPfFe8MuR t8Vmwk8Xp+oep2y6aRi5c5hvKYur/t4Ti1PPgGo3n1wM+mAIdeSFU2giYRtxcaKixUzv mxZ65WGdFqU2Vf70nnysMNsE5tdVjgNp6M9OOWQEXyLnasolv76fzFBGqR+uLaf5Uo18 MeU9GcJOw4gsGfaWi8DdOHpnvQ0z2bkFR0Mp1HY0xb4u8qUwFa3wmalR0LjtKeNMbAce EVSZ0+jhS24DntU2fArWrOJLXWvGe3kneY3dH8G2uFSIYPk5jROrnXrYFbP3vmuqOwQ3 NIkg== X-Forwarded-Encrypted: i=1; AJvYcCVDa5STV+fbhZJcm7Mi8R+CcEIJS89Q9Ztto4mheO9kzfPBeG6dcF0OPrAe6TAvjXawRjdxsyg8Ow==@kvack.org X-Gm-Message-State: AOJu0Yz4Mril3pSFlIRAMmOXAGyd1ja3eiw5r8DNTKAXwlpFsTHleFZg phlRdeLbA7JZu2PSR2pS78y75BsYmYUUKCECXg4a2Og9iKrY5BbLLihlsFGvIVNL5MN6EKzVrdK XZA== X-Google-Smtp-Source: AGHT+IHIfp2PQ0wir4voLq3Rtr5JW7agosUIXmr9O6oujdKyzhKg8JPLhWkNwiyUIpcw5c9imGXaxi3lq9Q= X-Received: from pfdr14.prod.google.com ([2002:aa7:8b8e:0:b0:736:451f:b9f4]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:aa7:888d:0:b0:736:6ecd:8e32 with SMTP id d2e1a72fcca58-739da128974mr2769958b3a.21.1743691878251; Thu, 03 Apr 2025 07:51:18 -0700 (PDT) Date: Thu, 3 Apr 2025 07:51:16 -0700 In-Reply-To: Mime-Version: 1.0 References: <20250328153133.3504118-4-tabba@google.com> Message-ID: Subject: Re: [PATCH v7 3/7] KVM: guest_memfd: Track folio sharing within a struct kvm_gmem_private From: Sean Christopherson To: Fuad Tabba Cc: Ackerley Tng , kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, mail@maciej.szmigiero.name, david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com, jthoughton@google.com, peterx@redhat.com, pankaj.gupta@amd.com Content-Type: text/plain; charset="us-ascii" X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 666661A0011 X-Stat-Signature: 3nfgrbbtmkrdjzpt7ku6zxh7sg78wsqh X-HE-Tag: 1743691879-551938 X-HE-Meta: U2FsdGVkX1/WZz6tL1E35AW/Vihz1IxjWtRL40Y84h9uOLgNOHI1Yuf/pZemtFsFXf6pCWa0pvxjckmvIEFOMvhqGlI10voeKz9u4qURFJ1AbHf2wnm1jOWa5b7xOfhhtSiWVHmnWS+rrvPM/v8osxzhrMvcrcJmY/APPptBp48JBgin9oizokfheg+kQuMgHWYunSEMFNbxs/sLSrCrch9WTsD75zcvl7AIxnFpLAJtQIrujTQTzLfWvroLR141JV8rKG4z2f8HNRDM9NlxAfVRRsbIM9ccCGRo4JcQk1CICYcfN9ew4ofwrKlBLTn0gr5U+1kfxoNX6GG5OnKtijYXEO2Epq1K57V9lV1zo5wiumYjrv483ywsX5/4PD3Pdb3+GHwwDvvo/Gw228eTKz5VaxH57SVSZiRST77J/Q2WcSMwxfRxG51ml8ho5/VBOmdGzVc9+uMOPL/99o9Ujh2N9fAif82rqvXcowKvk7Er7qz7f92kEZvuiYG9D6aQ+x78u+LVcnBMRbjVJkkUEfJInRTZHOUBob11S3KKpjNVbJvbVEjArwEJSOss2Ce/nCATsvfxXHJFi/LEnXWWgI8wAYfOZ0i7zHTzxyLnPkx7PJwBBVri3TKecUcT10NbYsKxuhCZ6WTjMlh5YcZdgN0PCepbz81KhoDLgbTajfLdescw/uIIcRcqyjr2xOdk0cElG/fvDaaGly46yIGZ4dzt9W+CWGhjcZfOSVDg59j0BNyeLJcXBkI4jwXLY0UZld63kI/VYN0HVBTLcSd7emLJdLplGzDTsV76DHpuvyGuEAnaHjcdVO43sR7xPVYcYLgD0c8Jsft89H8LVNKUqCDmz8ZfB8/blcbL/vJKGa2GK374+pA2lTYeT+CrLdJGeuDfOk+fLKmtRr23SQq7eLYV2TP+1pFxBauhK6DxjNk9/bAo5Q/PXJVcLQaXE78TdRA0jquEcTbwE9/RpKQ X/xnep/j noemwCJfP6hixC1SPI40/vRqmjuxTQCzTfErjLHer2rdazE8ZuUrecpkuwrdnNMLORoJ7pmXVLAYFRA3cLHzr+ZE0GO0XVWtaEi7sChl6sIvJcB6dKy9e0QKpi1MCO0hF83+9sme8+mvHqb7I9R7Duv/7h07qTKcaccf14llw7hERcVBdtqIQ2kwlysDXczYBfE7FKBZ+5iSTz/DPouNbKx1aYxV4FH6LCyqWo2e0JAQYTpWcpKq2b7cOVmoP5OWO3AIwtKpBThfJ/XSVMC5nNW8pOFWNa9+pd6DMkfoQAFyhrs180SsqN1qNLrvr9gEKgBTz2qWAg8qtGiA2T+ZDp7mIMOLOAXZKU3/KyoRsRU+rkFUPM2k3QCq4UXMPqxifC0Da30o4CfC+ajX147tEYcTgPvefFyUqjGqV1P86NQKt6s6OGKQw8SmKK2e+ugKyIFVy4mH4+wSs4Fv9kCLJCtN1bG/iVvHbaIL7b+VOR37hy3OFQzAGjh5G1qZb1pNiaKH3ND7NPpXk/GEMvuZOeABJtY6Il8gr3WD4MAxiP3ut7PccgHOoXRJPYhnqHoAJPlTVX1+DmIqXOLAmLyhYorJOovXsjsfe1YlMBjlQZ6b2MoXQU/eCsrEdqVVWfQXRwdFdDkzqkf39Er3BhyTZHlAISSPwb+o6ASl2/2a/THoOmQ/JWTEiRJEvXMKfL7k2r/kk X-Bogosity: Ham, tests=bogofilter, spamicity=0.000007, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 03, 2025, Fuad Tabba wrote: > On Thu, 3 Apr 2025 at 00:56, Sean Christopherson wrote: > > On Wed, Apr 02, 2025, Ackerley Tng wrote: > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > > > index ac6b8853699d..cde16ed3b230 100644 > > > > --- a/virt/kvm/guest_memfd.c > > > > +++ b/virt/kvm/guest_memfd.c > > > > @@ -17,6 +17,18 @@ struct kvm_gmem { > > > > struct list_head entry; > > > > }; > > > > > > > > +struct kvm_gmem_inode_private { > > > > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM > > > > + struct xarray shared_offsets; > > > > + rwlock_t offsets_lock; > > > > > > This lock doesn't work, either that or this lock can't be held while > > > faulting, because holding this lock means we can't sleep, and we need to > > > sleep to allocate. > > > > rwlock_t is a variant of a spinlock, which can't be held when sleeping. > > > > What exactly does offsets_lock protect, and what are the rules for holding it? > > At a glance, it's flawed. Something needs to prevent KVM from installing a mapping > > for a private gfn that is being converted to shared. KVM doesn't hold references > > to PFNs while they're mapped into the guest, and kvm_gmem_get_pfn() doesn't check > > shared_offsets let alone take offsets_lock. > > You're right about the rwlock_t. The goal of the offsets_lock is to > protect the shared offsets -- i.e., it's just meant to protect the > SHARED/PRIVATE status of a folio, not more, hence why it's not checked > in kvm_gmem_get_pfn(). It used to be protected by the > filemap_invalidate_lock, but the problem is that it would be called > from an interrupt context. > > However, this is wrong, as you've pointed out. The purpose of locking > is to ensure that no two conversions of the same folio happen at the > same time. An alternative I had written up is to rely on having > exclusive access to the folio to ensure that, since this is tied to > the folio. That could be either by acquiring the folio lock, or > ensuring that the folio doesn't have any outstanding references, > indicating that we have exclusive access to it. This would avoid the > whole locking issue. > > > ... Something needs to prevent KVM from installing a mapping > > for a private gfn that is being converted to shared. ... > > > guest_memfd currently handles races between kvm_gmem_fault() and PUNCH_HOLE via > > kvm_gmem_invalidate_{begin,end}(). I don't see any equivalent functionality in > > the shared/private conversion code. > > For in-place sharing, KVM can install a mapping for a SHARED gfn. What > it cannot do is install a mapping for a transient (i.e., NONE) gfn. We > don't rely on kvm_gmem_get_pfn() for that, but on the individual KVM > mmu fault handlers, but that said... Consumption of shared/private physical pages _must_ be enforced by guest_memfd. The private vs. shared state in the MMU handlers is that VM's view of the world and desired state. The guest_memfd inode is the single source of true for the state of the _physical_ page. E.g. on TDX, if KVM installs a private SPTE for a PFN that is in actuality shared, there will be machine checks and the host will likely crash. > > I would much, much prefer one large series that shows the full picture than a > > mish mash of partial series that I can't actually review, even if the big series > > is 100+ patches (hopefully not). > > Dropping the RFC from the second series was not intentional, the first > series is the one where I intended to drop the RFC. I apologize for > that. Especially since I obviously don't know how to handle modules > and wanted some input on how to do that :) In this case, the rules for modules are pretty simple. Code in mm/ can't call into KVM. Either avoid callbacks entirely, or implement via a layer of indirection, e.g. function pointer or ops table, so that KVM can provide its implementation at runtime.