From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from fhigh-b6-smtp.messagingengine.com (fhigh-b6-smtp.messagingengine.com [202.12.124.157]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 93D923793DB; Thu, 5 Mar 2026 17:33:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=202.12.124.157 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772732022; cv=none; b=caJ0hbd+qCIX7JcKDtVzeR2c0V9pxl5l5sHBBNMmncweQ7fECHW7DetEZ9nEzpSH3XMwlxk0k/dZQ+ceA/nu8FsSU5uSTkopvzwT7VQAkSSH6DeeKkxF70c6suJdG7YDtkE56VS8qjWkpR17CmXduhB/X/5jXao2z4fKPgMfzhM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772732022; c=relaxed/simple; bh=EmAbIrmI+i93dlAeOrzZh/+OihYBDIQow9KryXQnLzE=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Q37RLNmmPK+1nZMv9O7SwpjwpJt/EqFjeFYQjejFv4ujSIVQOLGm38EaD6vFMUVZ8DKbaps4KiRb2IztQwh0iuEnkLkLCKhh7T0XeV1U4P4TJXqFkbsL4wVB5f0bc5GLM7WK0IH/uCk+e+Q5wjjvvF4JXNq2efR60aXpdP10SF8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=shazbot.org; spf=pass smtp.mailfrom=shazbot.org; dkim=pass (2048-bit key) header.d=shazbot.org header.i=@shazbot.org header.b=kZ62yR2z; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b=Nf4uv6lz; arc=none smtp.client-ip=202.12.124.157 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=shazbot.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=shazbot.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=shazbot.org header.i=@shazbot.org header.b="kZ62yR2z"; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="Nf4uv6lz" Received: from phl-compute-10.internal (phl-compute-10.internal [10.202.2.50]) by mailfhigh.stl.internal (Postfix) with ESMTP id 4C7E97A0016; Thu, 5 Mar 2026 12:33:38 -0500 (EST) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-10.internal (MEProxy); Thu, 05 Mar 2026 12:33:38 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shazbot.org; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm3; t=1772732018; x=1772818418; bh=m+/p38n2KRI7cxtMuEktLPqTIrKpx/IK6VEOGDGPJqM=; b= kZ62yR2zUHf1XvuT6gCmWpwWI4FLe1KOZZBwnhOAlkCTKaeMN+N74gEv+oqvJavm AtMiMBEaWu3E1NdKyN8R+XHnl/ljMcBm15R0hx3ssJnROg2SRdOsr+ah2ep8AtzK 521tpDYM7GeUqFgUZOU/f7EnTPmkxWc+U3EvTYPH/ciq/NR9U2KI3eHpX4CFIanm 4pbh4zp916d6ZQWxctsN08feg7jVnImP22YpOp8A3GxGcUCquqb51GqniYaU90Bz X5n+QnL0P9jHd0mXIUjuED99M8ptylIeEyGWnKIETXB8Xk7iWr1Kzg8IzH1C3CBg 1E3ICsL9V50iqa46d9YHow== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1772732018; x= 1772818418; bh=m+/p38n2KRI7cxtMuEktLPqTIrKpx/IK6VEOGDGPJqM=; b=N f4uv6lzHBu/v/T+V080Eb/eNplj7HEckzrmdFtDdRcWG7wL6Co7ExtKolIamyIH+ fed7qSJN1pgIgWd6O9j8CnESlsuhd2La9U0cWFAD3ol9c0vfRne/yA2u9eQyOoP3 B5welZp4VT0ZDKNSTkUA2b6D6hcTJuG98VYsifJje/zIt0rRh44yPPX+fzOsmdwU QD2cTO19FSfbaaB2S8I2sOFOikXX+F00GHtUFJmOL/zPoT6+jHMvg53zoCMM5PIl gfMlJVj8N99KMootG+ycbBDCz2EBq1XLkeXhQ25ksAKVBVJGRAwomFHyXougdUml SJzdVxXtt+lmc+RQiUMmg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvieeileelucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucenucfjughrpeffhffvvefukfgjfhfogggtgfesthejre dtredtvdenucfhrhhomheptehlvgigucghihhllhhirghmshhonhcuoegrlhgvgiesshhh rgiisghothdrohhrgheqnecuggftrfgrthhtvghrnhepvdekfeejkedvudfhudfhteekud fgudeiteetvdeukedvheetvdekgfdugeevueeunecuvehluhhsthgvrhfuihiivgeptden ucfrrghrrghmpehmrghilhhfrhhomheprghlvgigsehshhgriigsohhtrdhorhhgpdhnsg gprhgtphhtthhopedugedpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtoheprghnkhhi thgrsehnvhhiughirgdrtghomhdprhgtphhtthhopehvshgvthhhihesnhhvihguihgrrd gtohhmpdhrtghpthhtohepjhhgghesnhhvihguihgrrdgtohhmpdhrtghpthhtohepmhho tghhshesnhhvihguihgrrdgtohhmpdhrtghpthhtohepjhhgghesiihivghpvgdrtggrpd hrtghpthhtohepshhkohhlohhthhhumhhthhhosehnvhhiughirgdrtghomhdprhgtphht thhopegtjhhirgesnhhvihguihgrrdgtohhmpdhrtghpthhtohepiihhihifsehnvhhiug hirgdrtghomhdprhgtphhtthhopehkjhgrjhhusehnvhhiughirgdrtghomh X-ME-Proxy: Feedback-ID: i03f14258:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 5 Mar 2026 12:33:36 -0500 (EST) Date: Thu, 5 Mar 2026 10:33:35 -0700 From: Alex Williamson To: Cc: , , , , , , , , , , , , alex@shazbot.org Subject: Re: [PATCH RFC v2 00/15] Add virtualization support for EGM Message-ID: <20260305103335.74fb8141@shazbot.org> In-Reply-To: <20260223155514.152435-1-ankita@nvidia.com> References: <20260223155514.152435-1-ankita@nvidia.com> X-Mailer: Claws Mail 4.3.1 (GTK 3.24.51; x86_64-pc-linux-gnu) Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Mon, 23 Feb 2026 15:54:59 +0000 wrote: > From: Ankit Agrawal > > Background > ---------- > Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM) > feature that enable the GPU to access the system memory allocations > within and across nodes through high bandwidth path. This access path > goes as: GPU <--> NVswitch <--> GPU <--> CPU. The GPU can utilize the > system memory located on the same socket or from a different socket > or even on a different node in a multi-node system [1]. This feature is > being extended to virtualization. > > > Design Details > -------------- > EGM when enabled in the virtualization stack, the host memory > is partitioned into 2 parts: One partition for the Host OS usage > called Hypervisor region, and a second Hypervisor-Invisible (HI) region > for the VM. Only the hypervisor region is part of the host EFI map > and is thus visible to the host OS on bootup. Since the entire VM > sysmem is eligible for EGM allocations within the VM, the HI partition > is interchangeably called as EGM region in the series. This HI/EGM region > range base SPA and size is exposed through the ACPI DSDT properties. > > Whilst the EGM region is accessible on the host, it is not added to > the kernel. The HI region is assigned to a VM by mapping the QEMU VMA > to the SPA using remap_pfn_range(). > > The following figure shows the memory map in the virtualization > environment. > > |---- Sysmem ----| |--- GPU mem ---| VM Memory > | | | | > |IPA <-> SPA map | |IPA <-> SPA map| > | | | | > |--- HI / EGM ---|-- Host Mem --| |--- GPU mem ---| Host Memory > > The patch series introduce a new nvgrace-egm auxiliary driver module > to manage and map the HI/EGM region in the Grace Blackwell systems. > This binds to the auxiliary device created by the parent > nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio > (out-of-tree open source module for SRIOV vGPU) to manage the > EGM region for the VM. Note that there is a unique EGM region per > socket and the auxiliary device gets created for every region. The > parent module fetches the EGM region information from the ACPI > tables and populate to the data structures shared with the auxiliary > nvgrace-egm module. > > nvgrace-egm module handles the following: > 1. Fetch the EGM memory properties (base HPA, length, proximity domain) > from the parent device shared EGM region structure. > 2. Create a char device that can be used as memory-backend-file by Qemu > for the VM and implement file operations. The char device is /dev/egmX, > where X is the PXM node ID of the EGM being mapped fetched in 1. > 3. Zero the EGM memory on first device open(). > 4. Map the QEMU VMA to the EGM region using remap_pfn_range. > 5. Cleaning up state and destroying the chardev on device unbind. > 6. Handle presence of retired poisoned pages on the EGM region. > > Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept > in the same directory. Pondering this series for a bit, is this auxiliary chardev approach really the model we should be pursuing? I know we're trying to disassociate the EGM region from the GPU, and de-duplicate it between GPUs on the same socket, but is there actually a use case of the EGM chardev separate from the GPU? The independent lifecycle of this aux device is troubling and it hasn't been confirmed whether or not access to the EGM region has some dependency on the state of the GPU. nvgrace-gpu is manipulating sysfs on devices owned by nvgrace-egm, we don't have mechanisms to manage the aux device relative to the state of the GPU, we're trying to add a driver that can bind to device created by an out-of-tree driver, and we're inventing new uAPIs on the chardev for things that already exist for vfio regions. Therefore, does it actually make more sense to expose EGM as a device specific region on the vfio device fd? For example, nvgrace-gpu might manage the de-duplication by only exposing this device specific region on the lowest BDF GPU per socket. The existing REGION_INFO ioctl handles reporting the size to the user. The direct association to the GPU device handles reporting the node locality. If necessary, a capability on the region could report the associated PXM, and maybe even the retired page list. All of the lifecycle issues are automatically handled, there's no separate aux device. If necessary, zapping and faulting across reset is handled just like a BAR mapping. If we need to expose the EGM size and GPU association via sysfs for management tooling, nvgrace-gpu could add an "egm_size" attribute to the PCI device's sysfs node. This could also avoid the implicit implementation knowledge about which GPU exposes the EGM device specific region. Was such a design considered? It seems much, much simpler and could be implemented by either nvgrace-gpu or identically by an out-of-tree driver without references in an in-kernel ID table. I'd like to understand the pros and cons of such an approach vs the one presented here. Thanks, Alex