From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 732FB25C833 for ; Mon, 30 Jun 2025 14:20:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.176 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751293204; cv=none; b=QXsV1kamRAAu4URttnjdnd1YfRJU7V9dK0i4rLoagIxqHK4KtGwlrh+IdtR2EbrRkLVpmPF7LiqNyNRqRnGHZR/0o2Uq+70kErM0gWug+6xCMWWd/qFZJOw0jsGhHKMftN8vVvt1AgydI1zWerF4ZHTuCat+tSrdyBwltheRIG0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751293204; c=relaxed/simple; bh=8p9Lu6GJRcIKPYRKLe8E7It0z/v4O/1xWIO8N0WUytM=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=pUeLQvhUa4Y3ZSsZgyp/6zfh1+6n6yd0Wx3BuL9aPCT7b5Kwf4stHt8W+9o4ODW13o4WVVGOIq99NkLGK3PxWXLIHhbWSA4RPQ4q1lVmZlQBK8apDjeaKlpKE7q+lVu7k7v1So3WwPdX/8jCmuldRnKBSXS/Qf0tXyYdW62mJ/w= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=laGj0GqR; arc=none smtp.client-ip=209.85.214.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="laGj0GqR" Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-2357c61cda7so245485ad.1 for ; Mon, 30 Jun 2025 07:20:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1751293202; x=1751898002; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=U6BgpGI+izTDP23PbygNDraRvPegCbPWLha3s13/sL8=; b=laGj0GqR5yi8pDj2PiwhV6OGIhfhvr+n3Q/OHK8mSkZ++IfOHg/SK1DCU7KpMUGfuE DackhF0DrxvFZPXGmZUXkgXEhTlPzRksBbkMGxemAqFZug37/TrnsPRQ+xyonhoP829a 0zgcMQ3aGG/VGqMp8X69rGIysSFjWiGf0aNDDsmXRRWo22W/ixvc+fp7+nHKSeOYyJZr netbfx4YwcqzTq9GZfRNNsT5FR6nNqVEa82q5mOpuYPA6J+tu2dqGcQoTkv2D4SzdgVK +a9mqgQQiL5grqkAIdrh7aaHA34fwxf7jd98Gs/jwbamsEc2iOKXNhCp0kIFG/ec8mJZ IJhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751293202; x=1751898002; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=U6BgpGI+izTDP23PbygNDraRvPegCbPWLha3s13/sL8=; b=UDsaSJiZUE++XX8fr8tiriGORNt4x4AfEifxL8RMD/686kEZN7nrEjUqHHTbnnBO2I 83go/eySdHdZPhqM1NhSUACKPKC/g52E7f378DjpuSu/g4TSMfyF9B018jQzegRa6xYT GeBWKd6P/TNuD7GlXXuBIa19gyovpTGariJ9kb+UJM3NTw6/89Umc5fDtORqqK32D3s+ IqGfhL/K+99zUqu/g2r9YfLcxQiZMlfDQxVph0exgZDQcsLRtDoVKfua1xLv/FgYDiiO o/5Lk4XMmYTuAkt2CsMxlC0Bvrw+HXOP7zlQiKX8GC/IpTIGDimR81mj9hBuyx8gISXO 7maA== X-Forwarded-Encrypted: i=1; AJvYcCW6f9wAJzZ0qOUK8i0JXAohpYg6vySja8/Pih4TzDF1ZoYYBIlB56CbMIOlrfP8CEQPVjtzzUzXxSgxGRU=@vger.kernel.org X-Gm-Message-State: AOJu0YyqlIWT8+EWaaT+FXElaSenLns0pT+pooYkFUYQlTrqGsyqUAi1 RN8pPoCzG6yYu1Mhc7oRl2KSkawvdWaIAwnMXxOHaHZtFK2fIflUIlHDTKAtg3LPXHoaDZIeNAR 8VaYgQOXdi+neuuNzyHRjpATZiMD4tiR3ml+rU4iU X-Gm-Gg: ASbGncvn/v+WX9CnJS5B2ZCfvFpSWDQHU2+xzcD6AJEIbVpG+O1TpE3yDwaaDH3yDWk 5S7j+vgtyooZmUEoFiAtil1sCocoEPoP7mpevkTc2TqVEVLnw5ABQGh78nIP18pz/6Wc4buDewI NWX8kd3VdfiDtUDB8f6gLlTk0JTptsrNY5ghby9LeuDNMlB269cPx/knInslv3JtXLvpVlvznpM 10p X-Google-Smtp-Source: AGHT+IE5AgZOIXEN2kYR+pPHATg3kKWBvA9NG7t3TRCU6UJnhNGR4z3nEopD8cR2TPGhUdKYcMbUvEi2+5nz92PWy1A= X-Received: by 2002:a17:902:ecc1:b0:234:9f02:e937 with SMTP id d9443c01a7336-23ae9f7531dmr3497505ad.25.1751293200306; Mon, 30 Jun 2025 07:20:00 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <9502503f-e0c2-489e-99b0-94146f9b6f85@amd.com> <20250624130811.GB72557@ziepe.ca> <31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com> <8f04f1df-d68d-4ef8-b176-595bbf00a9d1@amd.com> In-Reply-To: <8f04f1df-d68d-4ef8-b176-595bbf00a9d1@amd.com> From: Vishal Annapurve Date: Mon, 30 Jun 2025 07:19:47 -0700 X-Gm-Features: Ac12FXwcrbDkWWpGm6EqP7Q9KESbZoaZfTZr5E9iZdm0C78qOvk6uq87JKmXh1E Message-ID: Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls To: Alexey Kardashevskiy Cc: Jason Gunthorpe , Fuad Tabba , Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Jun 29, 2025 at 5:19=E2=80=AFPM Alexey Kardashevskiy = wrote: > ... > >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > >>> > >>> For IOMMU, could something like below work? > >>> > >>> * A new UAPI to bind IOMMU FDs with guest_memfd ranges > >> > >> Done that. > >> > >>> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from > >>> guest_memfd ranges using kvm_gmem_get_pfn() > >> > >> This API imho should drop the confusing kvm_ prefix. > >> > >>> -> kvm invokes kvm_gmem_is_private() to check for the range > >>> shareability, IOMMU could use the same or we could add an API in gmem > >>> that takes in access type and checks the shareability before returnin= g > >>> the pfn. > >> > >> Right now I cutnpasted kvm_gmem_get_folio() (which essentially is file= map_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new = links between iommufd.ko and kvm.ko. It is probably unavoidable though. > > > > I don't think that's the way to avoid links between iommufd.ko and > > kvm.ko. Cleaner way probably is to have gmem logic built-in and allow > > runtime registration of invalidation callbacks from KVM/IOMMU > > backends. Need to think about this more. > > Yeah, otherwise iommufd.ko will have to install a hook in guest_memfd (= =3D=3Dkvm.ko) in run time so more beloved symbol_get() :) > > > > >> > >> > >>> * IOMMU stack exposes an invalidation callback that can be invoked by > >>> guest_memfd. > >>> > >>> Private to Shared conversion via kvm_gmem_convert_range() - > >>> 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the rang= es > >>> on each bound memslot overlapping with the range > >>> 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > >>> actually unmaps the KVM SEPT/NPT entries. > >>> -> guest_memfd invokes IOMMU invalidation callback to za= p > >>> the secure IOMMU entries. > >>> 3) guest_memfd invokes kvm_gmem_execute_work() which updates t= he > >>> shareability and then splits the folios if needed > >>> 4) Userspace invokes IOMMU map operation to map the ranges in > >>> non-secure IOMMU. > >>> > >>> Shared to private conversion via kvm_gmem_convert_range() - > >>> 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the rang= es > >>> on each bound memslot overlapping with the range > >>> 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > >>> actually unmaps the host mappings which will unmap the KVM non-seucur= e > >>> EPT/NPT entries. > >>> -> guest_memfd invokes IOMMU invalidation callback to zap = the > >>> non-secure IOMMU entries. > >>> 3) guest_memfd invokes kvm_gmem_execute_work() which updates t= he > >>> shareability and then merges the folios if needed. > >>> 4) Userspace invokes IOMMU map operation to map the ranges in = secure IOMMU. > >> > >> > >> Alright (although this zap+map is not necessary on the AMD hw). > > > > IMO guest_memfd ideally should not directly interact or cater to arch > > specific needs, it should implement a mechanism that works for all > > archs. KVM/IOMMU implement invalidation callbacks and have all the > > architecture specific knowledge to take the right decisions. > > > Every page conversion will go through: > > kvm-amd.ko -1-> guest_memfd (kvm.ko) -2-> iommufd.ko -3-> amd-iommu (buil= d-in). > > Which one decides on IOMMU not needing (un)mapping? Got to be (1) but the= n it need to propagate the decision to amd-iommu (and we do not have (3) at= the moment in that path). If there is a need, guest_memfd can support two different callbacks: 1) Conversion notifier/callback invoked by guest_memfd during conversion handling. 2) Invalidation notifier/callback invoked by guest_memfd during truncation. Iommufd/kvm can handle conversion callback/notifier as per the needs of underlying architecture. e.g. for TDX connect do the unmapping vs for SEV Trusted IO skip the unmapping. Invalidation callback/notifier will need to be handled by unmapping page ta= bles. > > Or we just always do unmap+map (and trigger unwanted page huge page smash= ing)? All is doable and neither particularly horrible, I'm trying to see wh= ere the consensus is now. Thanks, > I assume when you say huge page smashing, it means huge page NPT mapping getting split. AFAIR, based on discussion with Michael during guest_memfd calls, stage2 NPT entries need to be of the same granularity as RMP tables for AMD SNP guests. i.e. huge page NPT mappings need to be smashed on the KVM side during conversion. So today guest_memfd sends invalidation notification to KVM for both conversion and truncation. Doesn't the same constraint for keeping IOMMU page tables at the same granularity as RMP tables hold for trusted IO?