From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by 2002:a17:907:d504:b0:9b2:89ee:1eb8 with SMTP id wb4csp1775940ejc; Wed, 27 Sep 2023 04:33:48 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEubA8eNyhRkxPhZkzzbm7utJ60vmzvqJutpSmbyXCpR53Gc76D3wWQYT30EhTR3vC2E/eg X-Received: by 2002:a05:620a:4546:b0:76f:a86:65bd with SMTP id u6-20020a05620a454600b0076f0a8665bdmr2104369qkp.53.1695814427891; Wed, 27 Sep 2023 04:33:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695814427; cv=none; d=google.com; s=arc-20160816; b=sBHvZNg9UWitG6blGu4afjL2kiyCY7pKd194B4JMm0oRo4TDfO0ZOABGkqAk82qJGJ 4NrVrLHLJbcgTK88Sp/ZOGaAYPy2ElTWTwA3sagP1fCxvINvS4jIEyznPAVD4EctLZIK ov+vGCPuj1odKCJA4PO5PgNg0JVe9M8wMC9BJEEQ2wRHyNkkdds6vzSOgIwtnHnVdfDG dAD08VSxjHZQvn1hsB9y9EHaN6WYCNIJXztQrWr47G9NMdaqLg24k6wvhVeISaYw6Dp5 WDE3HWoipRYQEjyjWbmInyyEHzQIJqzuDCLz49sdNjGddkSd8h9wxecsB1ULNSUFp/Fi XSSw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:from:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence :content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:date; bh=TDghhqp6Wb1Di1f/TO9jOjEJ9j5fYZJXo4fyd6oBbJs=; fh=wng5gR7otOG9hpLpznMNWZhS+osrJCf8Xwrz3QJXFx8=; b=AkWISCOWi6z0QhUS2Oy5t3WgqkWGR7U1iaOLRpEghyz6yzB5hkgMnfjuMdd19tSphw kGJi+mxiYYcLjHruB6sOmMlQaeVYxAJM+dmHSyAsglJluS7/QxMWVhvv+elSkwPZmtQD PVgx3xVHbQugdaOs1qB0pri2qSP0qMDmB8uYvGNs4LigIW5H/hPYkrLChnwsgHs3FrIc K+w/jxRqkTYn+NTnx7WfCxyhqtUrNa6tZRbBF4ZIJ++qvQHFFhzx/5T9PbKa/dKOHCjW 2rfIxozvEs6px20SEGArZ2O60TmhNxxz/HfQsVaBc33+u1FVPQYSkk5kWch3wKeDdSC4 RPYg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom="qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org. [209.51.188.17]) by mx.google.com with ESMTPS id i18-20020a05620a249200b007683e40b4a1si8090342qkn.517.2023.09.27.04.33.47 for (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Wed, 27 Sep 2023 04:33:47 -0700 (PDT) Received-SPF: pass (google.com: domain of qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; Authentication-Results: mx.google.com; spf=pass (google.com: domain of qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom="qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nongnu.org Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qlSnK-0002dr-Uc; Wed, 27 Sep 2023 07:33:31 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qlSnI-0002dJ-8y; Wed, 27 Sep 2023 07:33:28 -0400 Received: from frasgout.his.huawei.com ([185.176.79.56]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qlSnF-00059M-KU; Wed, 27 Sep 2023 07:33:28 -0400 Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.207]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4RwZDd5yFWz6DBT0; Wed, 27 Sep 2023 19:30:57 +0800 (CST) Received: from localhost (10.202.227.76) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Wed, 27 Sep 2023 12:33:19 +0100 Date: Wed, 27 Sep 2023 12:33:18 +0100 To: Ankit Agrawal CC: Alex Williamson , David Hildenbrand , =?ISO-8859-1?Q?C=E9dric?= Le Goater , Jason Gunthorpe , "shannon.zhaosl@gmail.com" , "peter.maydell@linaro.org" , "ani@anisinha.ca" , "Aniket Agashe" , Neo Jia , Kirti Wankhede , "Tarun Gupta (SW-GPU)" , Vikram Sethi , Andy Currid , "qemu-arm@nongnu.org" , "qemu-devel@nongnu.org" , Gavin Shan , , Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory Message-ID: <20230927123318.00005aad@Huawei.com> In-Reply-To: References: <20230915024559.6565-1-ankita@nvidia.com> <20230915084754.4b49d5c0.alex.williamson@redhat.com> <769b577a-65b0-dbfe-3e99-db57cea08529@redhat.com> <20230926131427.1e441670.alex.williamson@redhat.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.202.227.76] X-ClientProxiedBy: lhrpeml100006.china.huawei.com (7.191.160.224) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected Received-SPF: pass client-ip=185.176.79.56; envelope-from=jonathan.cameron@huawei.com; helo=frasgout.his.huawei.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-arm@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-to: Jonathan Cameron From: Jonathan Cameron via Errors-To: qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org Sender: qemu-arm-bounces+alex.bennee=linaro.org@nongnu.org X-TUID: PFZpbCgAv+zh On Wed, 27 Sep 2023 07:14:28 +0000 Ankit Agrawal wrote: > > > > > > Based on the suggestions here, can we consider something like the > > > following? > > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark > > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity > > > structure to make it hotpluggable. =20 > > > > Is that "devnode=3Don" parameter required? Can't we simply expose any n= ode > > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGA= BLE? That needs some checking for what extra stuff we'll instantiate on CPU only (or once we implement them) Generic Initiator / Generic Port nodes - I'm definitely not keen on doing so for generic ports (which QEMU doesn't y= et do though there have been some RFCs I think). > > Right now, with "ordinary", fixed-location memory devices > > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that > > covers the device memory region for these devices with > > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, > > which does not quite work IIRC. All applicable nodes that don't have > > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create th= em. =20 >=20 > Yeah, you're right that it isn't required. Exposing the node without any = memory as > MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using > "devnode=3Don". >=20 > > In your example, which memory ranges would we use for these nodes in SR= AT? =20 >=20 > We are setting the Base Address and the Size as 0 in the SRAT memory affi= nity > structures. This is done through the following: > build_srat_memory(table_data, 0, 0, i, > MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED); >=20 > This results in the following logs in the VM from the Linux ACPI SRAT par= sing code: > [ 0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xfffffffffffffff= f] hotplug >=20 > I would re-iterate that we are just emulating the baremetal behavior here. >=20 >=20 > > I don't see how these numa-node args on a vfio-pci device have any > > general utility.=A0 They're only used to create a firmware table, so why > > don't we be explicit about it and define the firmware table as an > > object?=A0 For example: > > > >=A0=A0=A0 =A0=A0=A0 -numa node,nodeid=3D2 \ > >=A0 =A0=A0=A0=A0=A0 -numa node,nodeid=3D3 \ > > =A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D4 \ > >=A0=A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D5 \ > >=A0=A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D6 \ > >=A0=A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D7 \ > >=A0=A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D8 \ > >=A0=A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D9 \ > >=A0=A0=A0=A0=A0=A0=A0 -device vfio-pci-nohotplug,host=3D0009:01:00.0,bus= =3Dpcie.0,addr=3D04.0,rombar=3D0,id=3Dnvgrace0 \ > >=A0=A0=A0=A0=A0=A0=A0 -object nvidia-gpu-mem-acpi,devid=3Dnvgrace0,nodes= et=3D2-9 \ =20 >=20 > Yeah, that is fine with me. If we agree with this approach, I can go > implement it. >=20 >=20 > > There are some suggestions in this thread that CXL could have similar > > requirements, For CXL side of things, if talking memory devices (type 3), I'm not sure what the usecase will be of this feature. Either we treat them as normal memory in which case it will all be static at boot of the VM (for SRAT anyway - we might plug things in and out of ranges), or it will be whole device hotplug and look like pc-dimm hotplug (which should be into a statically defined range in SRAT). Longer term if we look at virtualizing dynamic capacity devices (not sure we need to other that possibly to leverage sparse DAX etc on top of them) then we might want to provide emulated CXL Fixed memory windows in the guest (which get their own=20 NUMA nodes anyway) + plug the memory into that. We'd probably hide away interleaving etc in the host as all the guest should care about is performance information and I doubt we'd want to emulate the complexity of address routing complexities. Similar to host PA ranges used in CXL fixed memory windows, I'm not sure we wouldn't just allow for the guest to have 'all' possible setups that might get plugged later by just burning a lot of HPA space and hence just be able to use static SRAT nodes covering each region. This would be less painful than for real PAs because as we are emulating the CXL devices, probably as one emulated type 3 device per potential set of real devices in an interleave set we can avoid all the ordering constraints of CXL address decoders that end up eating up Host PA space. Virtualizing DCD is going to be a fun topic (that's next year's plumbers CXL uconf session sorted ;), but I can see it might be done comple= tely differently and look nothing like a CXL device, in which case maybe what you have here will make sense. Come to think of it, you 'could' potentially do that for your use case if the regions are reasonably bound in maximum size at the cost of large GPA usage? CXL accelerators / GPUs etc are a different question but who has one of those anyway? :) > > but I haven't found any evidence that these > > dev-mem-pxm-{start,count} attributes in the _DSD are standardized in > > any way.=A0 If they are, maybe this would be a dev-mem-pxm-acpi object > > rather than an NVIDIA specific one. =20 >=20 > Maybe Jason, Jonathan can chime in on this? I'm not aware of anything general around this. A PCI device can have a _PXM and I think you could define subdevices each with a _PXM of their own? Those subdevices would need drivers to interpret the structure anyway so not real benefit over a _DSD that I can immediately think of... If we think this will be common long term, anyone want to take multiple _PXM per device support as a proposal to ACPI? So agreed, it's not general, so if it's acceptable to have 0 length NUMA nodes (and I think we have to emulate them given that's what real hardware is doing even if some of us think the real hardware shouldn't have done that!) then just spinning them up explicitly as nodes + device specific stuff for the NVIDIA device seems fine to me. >=20 >=20 > > It seems like we could almost meet the requirement for this table via > > -acpitable, but I think we'd like to avoid the VM orchestration tool > > from creating, compiling, and passing ACPI data blobs into the VM. =20 >=20 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 80F7AE810D0 for ; Wed, 27 Sep 2023 11:34:11 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qlSnM-0002dz-1a; Wed, 27 Sep 2023 07:33:32 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qlSnI-0002dJ-8y; Wed, 27 Sep 2023 07:33:28 -0400 Received: from frasgout.his.huawei.com ([185.176.79.56]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qlSnF-00059M-KU; Wed, 27 Sep 2023 07:33:28 -0400 Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.207]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4RwZDd5yFWz6DBT0; Wed, 27 Sep 2023 19:30:57 +0800 (CST) Received: from localhost (10.202.227.76) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Wed, 27 Sep 2023 12:33:19 +0100 Date: Wed, 27 Sep 2023 12:33:18 +0100 To: Ankit Agrawal CC: Alex Williamson , David Hildenbrand , =?ISO-8859-1?Q?C=E9dric?= Le Goater , Jason Gunthorpe , "shannon.zhaosl@gmail.com" , "peter.maydell@linaro.org" , "ani@anisinha.ca" , "Aniket Agashe" , Neo Jia , Kirti Wankhede , "Tarun Gupta (SW-GPU)" , Vikram Sethi , Andy Currid , "qemu-arm@nongnu.org" , "qemu-devel@nongnu.org" , Gavin Shan , , Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory Message-ID: <20230927123318.00005aad@Huawei.com> In-Reply-To: References: <20230915024559.6565-1-ankita@nvidia.com> <20230915084754.4b49d5c0.alex.williamson@redhat.com> <769b577a-65b0-dbfe-3e99-db57cea08529@redhat.com> <20230926131427.1e441670.alex.williamson@redhat.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.202.227.76] X-ClientProxiedBy: lhrpeml100006.china.huawei.com (7.191.160.224) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected Received-SPF: pass client-ip=185.176.79.56; envelope-from=jonathan.cameron@huawei.com; helo=frasgout.his.huawei.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-to: Jonathan Cameron From: Jonathan Cameron via Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On Wed, 27 Sep 2023 07:14:28 +0000 Ankit Agrawal wrote: > > > > > > Based on the suggestions here, can we consider something like the > > > following? > > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark > > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity > > > structure to make it hotpluggable. =20 > > > > Is that "devnode=3Don" parameter required? Can't we simply expose any n= ode > > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGA= BLE? That needs some checking for what extra stuff we'll instantiate on CPU only (or once we implement them) Generic Initiator / Generic Port nodes - I'm definitely not keen on doing so for generic ports (which QEMU doesn't y= et do though there have been some RFCs I think). > > Right now, with "ordinary", fixed-location memory devices > > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that > > covers the device memory region for these devices with > > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, > > which does not quite work IIRC. All applicable nodes that don't have > > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create th= em. =20 >=20 > Yeah, you're right that it isn't required. Exposing the node without any = memory as > MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using > "devnode=3Don". >=20 > > In your example, which memory ranges would we use for these nodes in SR= AT? =20 >=20 > We are setting the Base Address and the Size as 0 in the SRAT memory affi= nity > structures. This is done through the following: > build_srat_memory(table_data, 0, 0, i, > MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED); >=20 > This results in the following logs in the VM from the Linux ACPI SRAT par= sing code: > [ 0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xfffffffffffffff= f] hotplug > [ 0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xfffffffffffffff= f] hotplug >=20 > I would re-iterate that we are just emulating the baremetal behavior here. >=20 >=20 > > I don't see how these numa-node args on a vfio-pci device have any > > general utility.=A0 They're only used to create a firmware table, so why > > don't we be explicit about it and define the firmware table as an > > object?=A0 For example: > > > >=A0=A0=A0 =A0=A0=A0 -numa node,nodeid=3D2 \ > >=A0 =A0=A0=A0=A0=A0 -numa node,nodeid=3D3 \ > > =A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D4 \ > >=A0=A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D5 \ > >=A0=A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D6 \ > >=A0=A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D7 \ > >=A0=A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D8 \ > >=A0=A0=A0=A0=A0=A0=A0 -numa node,nodeid=3D9 \ > >=A0=A0=A0=A0=A0=A0=A0 -device vfio-pci-nohotplug,host=3D0009:01:00.0,bus= =3Dpcie.0,addr=3D04.0,rombar=3D0,id=3Dnvgrace0 \ > >=A0=A0=A0=A0=A0=A0=A0 -object nvidia-gpu-mem-acpi,devid=3Dnvgrace0,nodes= et=3D2-9 \ =20 >=20 > Yeah, that is fine with me. If we agree with this approach, I can go > implement it. >=20 >=20 > > There are some suggestions in this thread that CXL could have similar > > requirements, For CXL side of things, if talking memory devices (type 3), I'm not sure what the usecase will be of this feature. Either we treat them as normal memory in which case it will all be static at boot of the VM (for SRAT anyway - we might plug things in and out of ranges), or it will be whole device hotplug and look like pc-dimm hotplug (which should be into a statically defined range in SRAT). Longer term if we look at virtualizing dynamic capacity devices (not sure we need to other that possibly to leverage sparse DAX etc on top of them) then we might want to provide emulated CXL Fixed memory windows in the guest (which get their own=20 NUMA nodes anyway) + plug the memory into that. We'd probably hide away interleaving etc in the host as all the guest should care about is performance information and I doubt we'd want to emulate the complexity of address routing complexities. Similar to host PA ranges used in CXL fixed memory windows, I'm not sure we wouldn't just allow for the guest to have 'all' possible setups that might get plugged later by just burning a lot of HPA space and hence just be able to use static SRAT nodes covering each region. This would be less painful than for real PAs because as we are emulating the CXL devices, probably as one emulated type 3 device per potential set of real devices in an interleave set we can avoid all the ordering constraints of CXL address decoders that end up eating up Host PA space. Virtualizing DCD is going to be a fun topic (that's next year's plumbers CXL uconf session sorted ;), but I can see it might be done comple= tely differently and look nothing like a CXL device, in which case maybe what you have here will make sense. Come to think of it, you 'could' potentially do that for your use case if the regions are reasonably bound in maximum size at the cost of large GPA usage? CXL accelerators / GPUs etc are a different question but who has one of those anyway? :) > > but I haven't found any evidence that these > > dev-mem-pxm-{start,count} attributes in the _DSD are standardized in > > any way.=A0 If they are, maybe this would be a dev-mem-pxm-acpi object > > rather than an NVIDIA specific one. =20 >=20 > Maybe Jason, Jonathan can chime in on this? I'm not aware of anything general around this. A PCI device can have a _PXM and I think you could define subdevices each with a _PXM of their own? Those subdevices would need drivers to interpret the structure anyway so not real benefit over a _DSD that I can immediately think of... If we think this will be common long term, anyone want to take multiple _PXM per device support as a proposal to ACPI? So agreed, it's not general, so if it's acceptable to have 0 length NUMA nodes (and I think we have to emulate them given that's what real hardware is doing even if some of us think the real hardware shouldn't have done that!) then just spinning them up explicitly as nodes + device specific stuff for the NVIDIA device seems fine to me. >=20 >=20 > > It seems like we could almost meet the requirement for this table via > > -acpitable, but I think we'd like to avoid the VM orchestration tool > > from creating, compiling, and passing ACPI data blobs into the VM. =20 >=20