From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-il1-f176.google.com (mail-il1-f176.google.com [209.85.166.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A0D1A1459EA for ; Fri, 27 Jun 2025 15:17:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.166.176 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751037470; cv=none; b=Rx265vLmfRWKcelyM7WZ7sZgUUTPIFPOZyqggGU7WC+ADu9fh48zmvkOlb7+jf1+emXB5ZmXMWTzW5o2WuMRgiweF0Ce9DZ/36gQl4WB2qNuecYoxZo2gt4xVVFbiW1FoEpsrhHUlnM356x8AlhxHHziyL/cTzT92mMD0G0XhtM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751037470; c=relaxed/simple; bh=duqGKW964qjNwuxPqetVf/TrjZYvTw13+ovFDc1dkxw=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=XgMsTRYZg2HEwzEBxhk5ErZJGXyzNkWJwT59nG0n0S6CsD4uGa5/1EOpIEawENLd7LaZqYqRlEWVhQ0pcNCeWiBffgRY8P61B6YUgpeYPJDOIG6DzoKYh3dPxTKOC1M6stxvcgIh7swS12MPKmuvCO9FJkzbW6QoJ1hXV9ZwRMo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=bC5xhPrn; arc=none smtp.client-ip=209.85.166.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="bC5xhPrn" Received: by mail-il1-f176.google.com with SMTP id e9e14a558f8ab-3ddc99e0b77so235355ab.0 for ; Fri, 27 Jun 2025 08:17:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1751037467; x=1751642267; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ctaq7Efseld3m3PwMhLwojtbJNBExDIWVuNO2ad89Gw=; b=bC5xhPrnk30A6esUoF1tTDyw8qSDGKl8ap9F2fK5g2uVMAnXFKB9dj9syGpwhpUto3 4OVTElAwBjz2T0ZZyALwTVglQ/GRkm8GLjGm0sUOkGbQROAxTs3eFtB/5KppOdhQRhiS /7EmQwDS9ZPPY8U78NNfFjctlC9XIB5jURJkQOHmN8EscagB48g3M1j3RWlfIFBKP1LJ gdkKI1y9cZLOVnhsMjM6SIHoMDQGhONPeB8y+rngKhEQwskIBGfht9gXFWbPStBL4Xpa J/cxRkdk6V50pr4jQA64PusxOBIfL/OLBt/1BMgyQo8r5UQj4ECUxXqSKCDL/GwHS5Mx JPrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751037467; x=1751642267; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ctaq7Efseld3m3PwMhLwojtbJNBExDIWVuNO2ad89Gw=; b=vuxQdCfm8MCXC++KW2Pg0Xe9GyM2B5q3k9nIfJb3lG/QZ5Nm4hnWZXZmziY0uUOdSm LgJOMLqDimjVzxK4AkYP1Ga8D71LEo1pyVW8Djzm0ikXl7rK6Nf//cVepFZoL3Pkl2N6 qmwNgUOQ+nLjiUxAUKqocXMf55obBA/0on555taQQIUZQz0Tk06kxRP0XZexNEewYml0 euZwu/q5Bntx5qUmMSbEAZrcRCR2yiRYz0NsglQFDA5X8SFSFXK3ECu0C8PeSPNh1eBn WcEvhe21w+Y5raVrcPqMqkLndmtmzwcsAIoTA8gyX59ZHbMnGga8uqv8ZrbH3OrJt70w uYpg== X-Forwarded-Encrypted: i=1; AJvYcCV3Otx4zD1sSdZ2W7nimJewF9jFLh9ApH2b+Rg8emQfgpKY1VpJO9s441ZWUuJG6B1DBVzegchwh/ZgPd1p@vger.kernel.org X-Gm-Message-State: AOJu0YyG8j1TQp/yjOsVmk+i65zQKYcyu6N9ZVZN+jZD3Lk2u6ApaC4k TmlqU8ISD8wEa535mBjXaQTF+qibQPzZd8oXjeQ9Y+7MazXftew0MU+Xcer0hvtNbkbgm3Ml0aK WFjJ6vDJ7GJVjIjrRJl9sPtzUrwqlyfxPscNmRlfJ X-Gm-Gg: ASbGncvh2GS8P/mpWmIctAuN0FY/aPWA3048UWQTS0bkS9i80O+yNEeKfgyut8SN2oi TghgmNkpHyShjcUAjLVb/1D0XOPdVqGN8TKADfz2GPkOdpFcS8Kt10plJITr1oSjxhf7OMDVs0l TIZtfYlUpnYSw5mpeIL1Z+HkpA216+v6C0oJ5+8WNk+RiCyUrZ1EljFnrU3MTvU4i/eC/Iy6itB TJIej/BJzgG63k= X-Google-Smtp-Source: AGHT+IEb+xxzVEl4IYeZdvYpF90TB/n7ds2F+wk1b/UN6A6zXnHR6Y9V2ll0YmyvEsQKpdZrsBliB1gS8siVMNUIbc0= X-Received: by 2002:a17:902:c94f:b0:234:a469:62ef with SMTP id d9443c01a7336-23ae4da691bmr146245ad.3.1751037466544; Fri, 27 Jun 2025 08:17:46 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <9502503f-e0c2-489e-99b0-94146f9b6f85@amd.com> <20250624130811.GB72557@ziepe.ca> <31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com> In-Reply-To: <31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com> From: Vishal Annapurve Date: Fri, 27 Jun 2025 08:17:34 -0700 X-Gm-Features: Ac12FXyPZKiqBA6vQz47UeBICglaWzMRTynZ4wc-OUKHNtdWirkqG5VqjYk0knQ Message-ID: Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls To: Alexey Kardashevskiy Cc: Jason Gunthorpe , Fuad Tabba , Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Jun 26, 2025 at 9:50=E2=80=AFPM Alexey Kardashevskiy = wrote: > > > > On 25/6/25 00:10, Vishal Annapurve wrote: > > On Tue, Jun 24, 2025 at 6:08=E2=80=AFAM Jason Gunthorpe = wrote: > >> > >> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote: > >> > >>> Now, I am rebasing my RFC on top of this patchset and it fails in > >>> kvm_gmem_has_safe_refcount() as IOMMU holds references to all these > >>> folios in my RFC. > >>> > >>> So what is the expected sequence here? The userspace unmaps a DMA > >>> page and maps it back right away, all from the userspace? The end > >>> result will be the exactly same which seems useless. And IOMMU TLB > > > > As Jason described, ideally IOMMU just like KVM, should just: > > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken > > by IOMMU stack > > 2) Directly query pfns from guest_memfd for both shared/private ranges > > 3) Implement an invalidation callback that guest_memfd can invoke on > > conversions. Conversions and truncations both. > > > > Current flow: > > Private to Shared conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > -> KVM has the concept of invalidation_begin() and end(), > > which effectively ensures that between these function calls, no new > > EPT/NPT entries can be added for the range. > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the KVM SEPT/NPT entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then splits the folios if needed > > > > Shared to private conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the host mappings which will unmap the KVM non-seucure > > EPT/NPT entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then merges the folios if needed. > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > > > > For IOMMU, could something like below work? > > > > * A new UAPI to bind IOMMU FDs with guest_memfd ranges > > Done that. > > > * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from > > guest_memfd ranges using kvm_gmem_get_pfn() > > This API imho should drop the confusing kvm_ prefix. > > > -> kvm invokes kvm_gmem_is_private() to check for the range > > shareability, IOMMU could use the same or we could add an API in gmem > > that takes in access type and checks the shareability before returning > > the pfn. > > Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap= _lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new lin= ks between iommufd.ko and kvm.ko. It is probably unavoidable though. I don't think that's the way to avoid links between iommufd.ko and kvm.ko. Cleaner way probably is to have gmem logic built-in and allow runtime registration of invalidation callbacks from KVM/IOMMU backends. Need to think about this more. > > > > * IOMMU stack exposes an invalidation callback that can be invoked by > > guest_memfd. > > > > Private to Shared conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the KVM SEPT/NPT entries. > > -> guest_memfd invokes IOMMU invalidation callback to zap > > the secure IOMMU entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then splits the folios if needed > > 4) Userspace invokes IOMMU map operation to map the ranges in > > non-secure IOMMU. > > > > Shared to private conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the host mappings which will unmap the KVM non-seucure > > EPT/NPT entries. > > -> guest_memfd invokes IOMMU invalidation callback to zap the > > non-secure IOMMU entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then merges the folios if needed. > > 4) Userspace invokes IOMMU map operation to map the ranges in sec= ure IOMMU. > > > Alright (although this zap+map is not necessary on the AMD hw). IMO guest_memfd ideally should not directly interact or cater to arch specific needs, it should implement a mechanism that works for all archs. KVM/IOMMU implement invalidation callbacks and have all the architecture specific knowledge to take the right decisions. > > > > There should be a way to block external IOMMU pagetable updates while > > guest_memfd is performing conversion e.g. something like > > kvm_invalidate_begin()/end(). > > > >>> is going to be flushed on a page conversion anyway (the RMPUPDATE > >>> instruction does that). All this is about AMD's x86 though. > >> > >> The iommu should not be using the VMA to manage the mapping. It should > > > > +1. > > Yeah, not doing this already, because I physically cannot map gmemfd's me= mory in IOMMU via VMA (which allocates memory via gup() so wrong memory is = mapped in IOMMU). Thanks, > > > >> be directly linked to the guestmemfd in some way that does not disturb > >> its operations. I imagine there would be some kind of invalidation > >> callback directly to the iommu. > >> > >> Presumably that invalidation call back can include a reason for the > >> invalidation (addr change, shared/private conversion, etc) > >> > >> I'm not sure how we will figure out which case is which but guestmemfd > >> should allow the iommu to plug in either invalidation scheme.. > >> > >> Probably invalidation should be a global to the FD thing, I imagine > >> that once invalidation is established the iommu will not be > >> incrementing page refcounts. > > > > +1. > > Alright. Thanks for the comments. > > > > >> > >> Jason > > -- > Alexey >