From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from fout-a5-smtp.messagingengine.com (fout-a5-smtp.messagingengine.com [103.168.172.148]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3D976345CAF; Wed, 11 Mar 2026 20:37:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=103.168.172.148 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773261433; cv=none; b=f0TBvSxt3WCsjMGuLji2xC+hPly9aOPJwptOid54ffPcxegWw6CDJBapVn2CDnYGfba5Z+Xs8hU71QC1Hq9fCpvEHzjRp4eNXzL1wZFo0Pgec+mhczfAhMwoTHQX+05uWmOeo8Uc1eBeyyc1xDWE/ZGl/+BNnWoRnhGLnkH5xT8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773261433; c=relaxed/simple; bh=vzUrkheKR7QoaxVezyzPFB9dYygXqrYaS51yD3yo7ck=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=UJbmQnlNsHYEV4D+AcXjI7ltvqN+FNo/phmRMi9Smg8gzegtNOnnVk8gh6cFzmlnCBQZ3rWu2bN1272rKB2AgPQAMx+Xk94XzUJ74D3NnyTGIGpDWy1d5yaEtVQRQzXdRj0F0UKJsH2Bzm9kdGInRrPyGiDkt7fcbE9Ai0+Fe1w= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=shazbot.org; spf=pass smtp.mailfrom=shazbot.org; dkim=pass (2048-bit key) header.d=shazbot.org header.i=@shazbot.org header.b=oM8F5fOD; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b=1SUbR5lE; arc=none smtp.client-ip=103.168.172.148 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=shazbot.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=shazbot.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=shazbot.org header.i=@shazbot.org header.b="oM8F5fOD"; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="1SUbR5lE" Received: from phl-compute-12.internal (phl-compute-12.internal [10.202.2.52]) by mailfout.phl.internal (Postfix) with ESMTP id 5AF89EC0B07; Wed, 11 Mar 2026 16:37:09 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-12.internal (MEProxy); Wed, 11 Mar 2026 16:37:09 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shazbot.org; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm3; t=1773261429; x=1773347829; bh=E7ek5mWLrWjVUBWgqKopP0NdS8Jsdmt5+Lc6K4+LiDo=; b= oM8F5fODGLxtavevjvXrO53yPGbZEnOKKuCy109aqUAyv1oJ2yDuri7vXBOPN6DX +YQSx94dGTIhzrt9SJCimz1KpLQUMihON67iGZn54HlzZ4V6kxsQPmZw6FJRegPt fyIXMqkm2Cq3hVKkBMI8nvVaKD6nhno7JrmPML8hhSTF9xFYVkjHMXUCSiJUNn5v SxfbPxdb/L2JGBIAAyNJRyqNlLrqiSw6QPKm2mUiyzEYfMFkUO250aUVi8XnFtnh 0XDy/SZIfU3EhxEDF4uYQd3LeESXru86nZZiozrQ4tGTE8B5LDkvG5jPZe3I/8I4 1Cd+2KlG0v8JJgBflrYJSg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1773261429; x= 1773347829; bh=E7ek5mWLrWjVUBWgqKopP0NdS8Jsdmt5+Lc6K4+LiDo=; b=1 SUbR5lEWJ0yTTNOa2/vVi/ffkMRyTV3suxW8v957OkpxpBQx+pZ8Kri3EgvjKbEP dicLCoZeJ9HabtpHBAYovO+NoIEZOz2On9kPAui+Hf4pIIehGK50OjHPLtMl2zg0 CyxKStgN65fWz6BuRec6xWWJ3pGot5cH3oNoxMo78+NpkqjLgRUrv3UmFj/QSB7p +H6ARZZNF3UEQU0LIw6cuHd8ySHh8n5WGwZ976VqfYTUAWtcHGos9cHwAkbndCfF CF3UPA3rCcBBxlcRJQCIAis6xVLe+5BSieA1qz0LM8KBTtRYGal/zvRDLOwAEa7h z2fG3WQRlvwHuKtfSCYMw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvkeegkeekucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkjghfofggtgfgsehtqhertdertdejnecuhfhrohhmpeetlhgvgicu hghilhhlihgrmhhsohhnuceorghlvgigsehshhgriigsohhtrdhorhhgqeenucggtffrrg htthgvrhhnpeegudevhfejueefveduieeuueeifeettdekveekhffgvdetfeelueehgfdt heffhfenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe grlhgvgiesshhhrgiisghothdrohhrghdpnhgspghrtghpthhtohepudegpdhmohguvgep shhmthhpohhuthdprhgtphhtthhopegrnhhkihhtrgesnhhvihguihgrrdgtohhmpdhrtg hpthhtohepvhhsvghthhhisehnvhhiughirgdrtghomhdprhgtphhtthhopehjghhgsehn vhhiughirgdrtghomhdprhgtphhtthhopehmohgthhhssehnvhhiughirgdrtghomhdprh gtphhtthhopehjghhgseiiihgvphgvrdgtrgdprhgtphhtthhopehskhholhhothhhuhhm thhhohesnhhvihguihgrrdgtohhmpdhrtghpthhtoheptghjihgrsehnvhhiughirgdrtg homhdprhgtphhtthhopeiihhhifiesnhhvihguihgrrdgtohhmpdhrtghpthhtohepkhhj rghjuhesnhhvihguihgrrdgtohhm X-ME-Proxy: Feedback-ID: i03f14258:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 11 Mar 2026 16:37:07 -0400 (EDT) Date: Wed, 11 Mar 2026 14:37:06 -0600 From: Alex Williamson To: Ankit Agrawal , Jason Gunthorpe Cc: Vikram Sethi , Matt Ochs , "jgg@ziepe.ca" , Shameer Kolothum Thodi , Neo Jia , Zhi Wang , Krishnakant Jaju , Yishai Hadas , "kevin.tian@intel.com" , "kvm@vger.kernel.org" , "linux-kernel@vger.kernel.org" , alex@shazbot.org Subject: Re: [PATCH RFC v2 00/15] Add virtualization support for EGM Message-ID: <20260311143706.2095a547@shazbot.org> In-Reply-To: References: <20260223155514.152435-1-ankita@nvidia.com> <20260305103335.74fb8141@shazbot.org> X-Mailer: Claws Mail 4.3.1 (GTK 3.24.51; x86_64-pc-linux-gnu) Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Wed, 11 Mar 2026 06:47:12 +0000 Ankit Agrawal wrote: > Thanks Alex for the review. >=20 > >> The patch series introduce a new nvgrace-egm auxiliary driver module > >> to manage and map the HI/EGM region in the Grace Blackwell systems. > >> This binds to the auxiliary device created by the parent > >> nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio > >> (out-of-tree open source module for SRIOV vGPU) to manage the > >> EGM region for the VM. Note that there is a unique EGM region per > >> socket and the auxiliary device gets created for every region. The > >> parent module fetches the EGM region information from the ACPI > >> tables and populate to the data structures shared with the auxiliary > >> nvgrace-egm module. > >> > >> nvgrace-egm module handles the following: > >> 1. Fetch the EGM memory properties (base HPA, length, proximity domain) > >> from the parent device shared EGM region structure. > >> 2. Create a char device that can be used as memory-backend-file by Qemu > >> for the VM and implement file operations. The char device is /dev/egmX, > >> where X is the PXM node ID of the EGM being mapped fetched in 1. > >> 3. Zero the EGM memory on first device open(). > >> 4. Map the QEMU VMA to the EGM region using remap_pfn_range. > >> 5. Cleaning up state and destroying the chardev on device unbind. > >> 6. Handle presence of retired poisoned pages on the EGM region. > >> > >> Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept > >> in the same directory. =20 > > > > > > Pondering this series for a bit, is this auxiliary chardev approach > > really the model we should be pursuing? > > > > I know we're trying to disassociate the EGM region from the GPU, and > > de-duplicate it between GPUs on the same socket, but is there actually a > > use case of the EGM chardev separate from the GPU? =20 >=20 > It is not just de-duplication. The EGM is a carveout of system memory > logically and physically separate and disconnected from the GPU. The > uniqueness here is that the information (SPA, size) of the region is pres= ent > on the GPU ACPI tables. >=20 > > > > The independent lifecycle of this aux device is troubling and it hasn't > > been confirmed whether or not access to the EGM region has some > >dependency on the state of the GPU.=C2=A0 =20 >=20 > The EGM region is independent on the state of the GPU. One can plausibly > bootup the VM with just the EGM memory chardev as the backend file and > no GPU. Seems like we have the wrong model then to base the lifecycle of the aux devices on the state of the PCI driver if EGM is fully independent of the state of the PCI device. > > nvgrace-gpu is manipulating sysfs > > on devices owned by nvgrace-egm, we don't have mechanisms to manage the > > aux device relative to the state of the GPU, we're trying to add a > > driver that can bind to device created by an out-of-tree driver, and > > we're inventing new uAPIs on the chardev for things that already exist > > for vfio regions. =20 >=20 > Sorry for the confusion. The nvgrace-egm would not bind to the device > created by the out-of-tree driver. We would have a separate out-of-tree > equivalent of nvgrace-egm to bind to the device by the out-of-tree vfio > driver. Maybe we can consider exposing a register/unregister APIs from > nvgrace-egm where a module (in-tree nvgrace / out-of-tree) can register > a pdev and nvgrace-egm can use to fetch the region info. Ok, this wasn't clear to me, but does that also mean that if some GPUs are managed by nvgrace-gpu and others by out-of-tree drivers that the in-kernel and out-of-tree equivalent drivers are both installing chardevs as /dev/egmXX? Playing in the same space is ugly, but what happens when the 2 GPUs per socket are split between drivers and they both try to added the same chardev? > > Therefore, does it actually make more sense to expose EGM as a device > > specific region on the vfio device fd? > > > > For example, nvgrace-gpu might manage the de-duplication by only > > exposing this device specific region on the lowest BDF GPU per socket. > > The existing REGION_INFO ioctl handles reporting the size to the user. > > The direct association to the GPU device handles reporting the node > > locality.=C2=A0 If necessary, a capability on the region could report t= he > > associated PXM, and maybe even the retired page list. > > > > All of the lifecycle issues are automatically handled, there's no > > separate aux device.=C2=A0 If necessary, zapping and faulting across re= set > > is handled just like a BAR mapping. =20 >=20 > The EGM memory (which becomes the system memory of the VM) cannot > be connected to the GPU reset as it is unrelated to the GPU device. We wo= uld > not want that to happen to system memory on GPU reset. It's not the state of the EGM/system memory that I'm concerned about, it's the fact that the routing to access that memory traverses two GPUs and both the backplane and c2c NVLink connections. If access through that channel is 100% independent of the state of either GPU then GPU resets are irrelevant. However, I'd then ask the question why we're associating EGM to the GPU PCI driver at all. For instance, why should nvgrace-gpu spawn aux devices to feed into an nvgrace-egm driver, and duplicate that whole thing in an out-of-tree driver, when we could just have one in-kernel platform(?) driver walk ACPI, find these ranges, and expose them as chardev entirely independent of the PCI driver bound to the GPU? =20 > > If we need to expose the EGM size and GPU association via sysfs for > > management tooling, nvgrace-gpu could add an "egm_size" attribute to the > > PCI device's sysfs node.=C2=A0 This could also avoid the implicit > > implementation knowledge about which GPU exposes the EGM device > > specific region. > > > > Was such a design considered?=C2=A0 It seems much, much simpler and cou= ld be > > implemented by either nvgrace-gpu or identically by an out-of-tree > > driver without references in an in-kernel ID table. > > > > I'd like to understand the pros and cons of such an approach vs the one > > presented here.=C2=A0 Thanks, =20 >=20 > We didn't consider it as a separate BAR / region as the EGM memory (part = of the > system memory) is unrelated to the GPU device besides having its informat= ion > in the GPU ACPI table and becomes the system memory of the VM. Considering > it as part of the device BAR / region would connect the lifecyle of the E= GM region > on the GPU device. Also we cannot consider zapping/faulting across GPU re= set > as it is system memory of the VM. It's curious why the EGM description is associated to the GPU ACPI object if it really is fully independent. It seems like perhaps it should be a unique ACPI object in that case, which would make claiming it via a platform driver easier. Maybe we don't need to be tied to that firmware decision in the design of the software driver though. Thanks, Alex