From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0859CC7EE30 for ; Tue, 24 Jun 2025 14:10:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7A74D6B00AC; Tue, 24 Jun 2025 10:10:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 77E676B00C3; Tue, 24 Jun 2025 10:10:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 66D806B00C4; Tue, 24 Jun 2025 10:10:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 546516B00AC for ; Tue, 24 Jun 2025 10:10:55 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 2129216084C for ; Tue, 24 Jun 2025 14:10:55 +0000 (UTC) X-FDA: 83590480470.17.1F0381A Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) by imf30.hostedemail.com (Postfix) with ESMTP id 3471A8000E for ; Tue, 24 Jun 2025 14:10:53 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YfvuUNJd; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750774253; a=rsa-sha256; cv=none; b=TcX3Q45VFPa51b+TIXwfX5mSRBDhAn9cHq1Hd1BVypNwYRh0mskP1QjOVbjq8GPp2PU9qW fNqXVht53hL/reYt/KEHem/pmw7a/4uunjVvg8qGY2NwyEpmrZExVz+TMQfxooYQ9b3h29 MLEIDFmUnzpOvluAcLHP6ZVYh+hcykk= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YfvuUNJd; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750774253; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SPXG9NMXD6nYMUFeG+o56gXIEmWIJJ0n8imu3rUl/a8=; b=IE52RDjh/yYH3a28l45d0cHbVOgIICeqkVzppmuo0Xf7qR+zR1Rot9YAgyW8VgXXZdGp1v 6U6dlbVqhuAMVapgd9chaHJKrdxIPFr9khmd6RcjSz2b/PTF15nec7Nl0Lq8a5VW0ecDVM UoytojlSGiRaRgwPHzcKyrzOlA/FEuY= Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-2357c61cda7so122085ad.1 for ; Tue, 24 Jun 2025 07:10:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1750774252; x=1751379052; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=SPXG9NMXD6nYMUFeG+o56gXIEmWIJJ0n8imu3rUl/a8=; b=YfvuUNJdYTxxuVhosabB6oJzLoCAzqfFBUYx4RNdNtLoYQBmA5K/OV0XnGcejTpaSm aR0/OMHEb9TTwdaeBCZ2Ugd0vASmb9u3UqgLS2UZ+g45ol1o0hAfkO3MlL8+PLg80Veg HQ3QVlLDBF4AIWw9TTYr9mPX2cSat36umUhw/a1iMK03zClCF10yNG13bUSYwLTTKfBU rVbP69zqABhd8BB+i9hnsvS9r1lhsQZkUMMwXMhlbJ6VI579ZTAY+iNd5jJLidHLX1x9 qDe+px3e6zVBTbRAlZXEjJl6sQqCEwJztORLoBznuEnxBc1+bTWM7dCTFt3WIB87ymUx 2YGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750774252; x=1751379052; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=SPXG9NMXD6nYMUFeG+o56gXIEmWIJJ0n8imu3rUl/a8=; b=CG42iWhRkHiUzPhZm+n/x3DGLZVseprF3Csyjaou7FmUTqg+DS92NLgByZPJhgu9uU feROxUfa/0JTDJSRYGolM1r58UkE0xeKGydxDOATmpVVQcf1wQZfVTFfD6jg7pZtTc8r vDaMlL3NFIfYFQS82CPpDpJxE6YOZsnD9226egB/I8K4MgGRLgMzHQ5zJpj2wfMs8I6b cIcSguyW85/QaNXgLPeFe8Dlbdr6VeM7Dct/dM8krgPIZ2GiWUTOU4IG4p/1BzKqhIKS pkV3LkR1meHTDePeHvQDPEN7fa2OwOXhbTtM+5sV8i/Hu0Lpqd5ZVMwTdQi7t1Ac+Vyu A97Q== X-Forwarded-Encrypted: i=1; AJvYcCWUkxfVDBNkMdK8Gpuu1ECeTRRd2LGz5+f0Uun3XMOfIM1caQ9Nh2itPbpTf+7SoGJU9egvHSirbg==@kvack.org X-Gm-Message-State: AOJu0YzIJRlZFGKonX8HAogEPLLu5IpV4a24ySAoyDco+qt6ndCIuPRl goUp1vYeYkOmz3OI6sbAJaukr78Oql2Hv0M59M0GXhR1+NCr531BsH1bCIBQADDeP1yF/LBUBJK OFACddf9am25C3Eavyucy+0CW2HQmkRPmYxrcLgkQ X-Gm-Gg: ASbGncsUDIp1lghxWlpMupTcfr4bxNanL4AnWT+99Fgxb4YKyuEf6s9Jwb9QZlcq3JL Jh8uLq3+clUZgOazAeyncZaR0iKqbTbOfY/l/6bODHZFAV64477cP8/iNBTlYm7POxRzB86psuw JRdzrYkS4FlHEOBouOpKM2a9sUx5uhCsXf2LCEVK1KxbUfXS1iXq2PTg/AsYCyueuC9WO33rlny gJV X-Google-Smtp-Source: AGHT+IHm+0I51ZMCSpB9nsIXuoBv9SIjg9zSCmS8/Wkboc9UHu1UJHMZwYpWKTst/fkr9jbiULJlSRshQWeZUERJRQM= X-Received: by 2002:a17:902:e80c:b0:234:afcf:d9e2 with SMTP id d9443c01a7336-23803f4d9a2mr2639155ad.17.1750774250932; Tue, 24 Jun 2025 07:10:50 -0700 (PDT) MIME-Version: 1.0 References: <9502503f-e0c2-489e-99b0-94146f9b6f85@amd.com> <20250624130811.GB72557@ziepe.ca> In-Reply-To: <20250624130811.GB72557@ziepe.ca> From: Vishal Annapurve Date: Tue, 24 Jun 2025 07:10:38 -0700 X-Gm-Features: AX0GCFsRD1vv1pnJnJAHQ3_lVm2iqeW9vp-TueZFlNIyuARuSWfPMw0I71gJhzE Message-ID: Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls To: Jason Gunthorpe Cc: Alexey Kardashevskiy , Fuad Tabba , Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: u1wz1ycjbjwx5o3sbemyuyzd3q956f64 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 3471A8000E X-HE-Tag: 1750774253-473436 X-HE-Meta: U2FsdGVkX1/4k/WY7UQMq9HImd1Bme/r6VBcNTGVHVIbWPcMOMrdE+4GvXXikU4YH0hdxhZigoMXDcE3UAdcm0xJdee2sPH9FdOKc8g/hI9K1H1i8WdJmT1XFXvjheECKkMAj/LmBZgZGUuJXPT2pjp4hLddXfvYkPyTnUeHKG1bJlVO22ruVm+GYBBZa7awt/RDAfiWOhrBlK/ZqF4act7jbYo5dk9csMTO2/NZ1OTtNfx6qaJzN/Cz31NF+2vbAtojb8gSSUtTOG+3T38FHC1QF1/LIr5ag4Ce/OeNY7cOii/GcZiqBUGAZ1N26kV9cYSihkJA+iAjSRcf4s7PbYAEPkzmhQXZUarD+xngg9ISX9YD2fIwtH5vGIqpy7mhEZ06JMht897w2zihbzUGPbGbcKN45AtLM03tioURYheKxv9u5NUkAGPnK3JFSf3+SazorJDpoVWwaSg6cOw6VprCCQUL+CpDarNH9nsA+h+esbmaKOdiS+9OQpWW5OmKggzH/An0tYI7QmJkalU9VPXL/GRIbjmToONs6y64CcLdd7Yv4vWibpTVzCh80yHif3YakGEl9/Hp5bOXXruB4VjGdD4Z5DQcm+L+/cD6AiMK57cHI/rAvbw2ioAdCb3hqPT2ImycA3H+OZ7XXlgqlL9MOBfI/CeWyNDAkG9EKeg3wmZ9vycbng+A9TZreW+ni0kR1DjZD8LSCcysB+If1qylmCIZTGWLN6TDz4jRw8wK4yyvgRoWUkRQRf3D3fwWnqKbwRIx9+tdGUKPJEnZlccH9wXwzKSQLL69lWHhiT0rZpqP+OmNOpPEChsCSOkxIKAAGNlrVt1lHqNR1nRKRpCHg+B1frpSyG0LAHuUNJ347zpbS4V1e6fh2qqb6dxLk1G7BqBc0LwzFnKI55nUsuEQ1vwvtrXQ980iAFsFzt3c9hD/ttIG+K+9L6Wh/o76KcJxbMSfcj5WQ83u43x Y9WXFePJ 2eFXWBtYUl1fPreU8V8t+2APv6qmqnS4z5buM2AGxdhKcRxUgk2RScuZLQw0u5aGQ4qMvwaJFK/8WL5MkPvZmwCeSu8LcLlWlvHIrZXcjbu3hB3KYUkPjny0Iq/2p5Gs9svp5PDGs663MmZWp/HVvKAJKXm5V8eDdyYzWvbeMzhZ/sWY4ML06YDSClkGgH/HAzoPGMG+YqiRlmagZndcvFKhigEZykoKsOT8J8PQtlD3kW1/jXSIe9LaCXgErK6cRDbWw9zntHcHIMcqUI3yz6aG1F1Ztmisbf8k82mH64UBMVuwpoX0CeWzWLfA3/c/aukxIsOc+an6pr9q/gpy88y20PLTD1U+hgp7m+dLBH+b2+zF3oqUT1ahlZQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 24, 2025 at 6:08=E2=80=AFAM Jason Gunthorpe wrot= e: > > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote: > > > Now, I am rebasing my RFC on top of this patchset and it fails in > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these > > folios in my RFC. > > > > So what is the expected sequence here? The userspace unmaps a DMA > > page and maps it back right away, all from the userspace? The end > > result will be the exactly same which seems useless. And IOMMU TLB As Jason described, ideally IOMMU just like KVM, should just: 1) Directly rely on guest_memfd for pinning -> no page refcounts taken by IOMMU stack 2) Directly query pfns from guest_memfd for both shared/private ranges 3) Implement an invalidation callback that guest_memfd can invoke on conversions. Current flow: Private to Shared conversion via kvm_gmem_convert_range() - 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges on each bound memslot overlapping with the range -> KVM has the concept of invalidation_begin() and end(), which effectively ensures that between these function calls, no new EPT/NPT entries can be added for the range. 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which actually unmaps the KVM SEPT/NPT entries. 3) guest_memfd invokes kvm_gmem_execute_work() which updates the shareability and then splits the folios if needed Shared to private conversion via kvm_gmem_convert_range() - 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges on each bound memslot overlapping with the range 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which actually unmaps the host mappings which will unmap the KVM non-seucure EPT/NPT entries. 3) guest_memfd invokes kvm_gmem_execute_work() which updates the shareability and then merges the folios if needed. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D For IOMMU, could something like below work? * A new UAPI to bind IOMMU FDs with guest_memfd ranges * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from guest_memfd ranges using kvm_gmem_get_pfn() -> kvm invokes kvm_gmem_is_private() to check for the range shareability, IOMMU could use the same or we could add an API in gmem that takes in access type and checks the shareability before returning the pfn. * IOMMU stack exposes an invalidation callback that can be invoked by guest_memfd. Private to Shared conversion via kvm_gmem_convert_range() - 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges on each bound memslot overlapping with the range 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which actually unmaps the KVM SEPT/NPT entries. -> guest_memfd invokes IOMMU invalidation callback to zap the secure IOMMU entries. 3) guest_memfd invokes kvm_gmem_execute_work() which updates the shareability and then splits the folios if needed 4) Userspace invokes IOMMU map operation to map the ranges in non-secure IOMMU. Shared to private conversion via kvm_gmem_convert_range() - 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges on each bound memslot overlapping with the range 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which actually unmaps the host mappings which will unmap the KVM non-seucure EPT/NPT entries. -> guest_memfd invokes IOMMU invalidation callback to zap the non-secure IOMMU entries. 3) guest_memfd invokes kvm_gmem_execute_work() which updates the shareability and then merges the folios if needed. 4) Userspace invokes IOMMU map operation to map the ranges in secure I= OMMU. There should be a way to block external IOMMU pagetable updates while guest_memfd is performing conversion e.g. something like kvm_invalidate_begin()/end(). > > is going to be flushed on a page conversion anyway (the RMPUPDATE > > instruction does that). All this is about AMD's x86 though. > > The iommu should not be using the VMA to manage the mapping. It should +1. > be directly linked to the guestmemfd in some way that does not disturb > its operations. I imagine there would be some kind of invalidation > callback directly to the iommu. > > Presumably that invalidation call back can include a reason for the > invalidation (addr change, shared/private conversion, etc) > > I'm not sure how we will figure out which case is which but guestmemfd > should allow the iommu to plug in either invalidation scheme.. > > Probably invalidation should be a global to the FD thing, I imagine > that once invalidation is established the iommu will not be > incrementing page refcounts. +1. > > Jason