* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order [not found] <20260222020812.26475-1-ankita@nvidia.com> @ 2026-02-23 7:28 ` Igor Mammedov [not found] ` <SA1PR12MB7199F0C2E1D2325B0062B004B077A@SA1PR12MB7199.namprd12.prod.outlook.com> 0 siblings, 1 reply; 15+ messages in thread From: Igor Mammedov @ 2026-02-23 7:28 UTC (permalink / raw) To: ankita Cc: vsethi, jgg, skolothumtho, alex, mst, anisinha, aniketa, cjia, kwankhede, targupta, zhiw, mochs, kjaju, qemu-devel On Sun, 22 Feb 2026 02:08:12 +0000 <ankita@nvidia.com> wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > During creation of the VM's SRAT table, the generic initiator entries > are added. Currently the order in the entries are not controllable from > the qemu command. This is due to the fact that the code queries the > object tree which may not be in the order objects were inserted. > > As a fix the patch maintains a GPtrArray of generic initiator objects > that preserves their insertion order. Objects are automatically added > to the array when initialized and removed when finalized. When building > the SRAT table, objects are processed in the order they were first > inserted. so question would be, why does it matter? Is ther a requirement in spec for SRAT entries being put in a particular order? > > E.g. for the following qemu command. > ... > -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \ > -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \ > -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \ > -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \ > -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \ > -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \ > -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \ > -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \ > ... > > Original PXM in the VM SRAT table: > [1A4h 0420 004h] Proximity Domain : 00000007 > [1C4h 0452 004h] Proximity Domain : 00000006 > [1E4h 0484 004h] Proximity Domain : 00000005 > [204h 0516 004h] Proximity Domain : 00000004 > [224h 0548 004h] Proximity Domain : 00000003 > [244h 0580 004h] Proximity Domain : 00000009 > [264h 0612 004h] Proximity Domain : 00000002 > [284h 0644 004h] Proximity Domain : 00000008 > [2A2h 0674 004h] Proximity Domain : 00000009 > > After the patch (preserves insertion order): > [1A4h 0420 004h] Proximity Domain : 00000002 > [1C4h 0452 004h] Proximity Domain : 00000003 > [1E4h 0484 004h] Proximity Domain : 00000004 > [204h 0516 004h] Proximity Domain : 00000005 > [224h 0548 004h] Proximity Domain : 00000006 > [244h 0580 004h] Proximity Domain : 00000007 > [264h 0612 004h] Proximity Domain : 00000008 > [284h 0644 004h] Proximity Domain : 00000009 > > cc: Shameer Kolothum <skolothumtho@nvidia.com> > Fixes: 0a5b5acdf2 ("hw/acpi: Implement the SRAT GI affinity structure") > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > --- > hw/acpi/pci.c | 44 ++++++++++++++++++++++++++++++++------------ > 1 file changed, 32 insertions(+), 12 deletions(-) > > diff --git a/hw/acpi/pci.c b/hw/acpi/pci.c > index 8c7ed10479..d97e6e9105 100644 > --- a/hw/acpi/pci.c > +++ b/hw/acpi/pci.c > @@ -88,18 +88,30 @@ OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, acpi_generic_initiator, > > OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR) > > +static GPtrArray *acpi_generic_initiator_list; > + > static void acpi_generic_initiator_init(Object *obj) > { > AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj); > > gi->node = MAX_NODES; > gi->pci_dev = NULL; > + > + /* Initialize array on first use */ > + if (!acpi_generic_initiator_list) { > + acpi_generic_initiator_list = g_ptr_array_new(); > + } > + > + g_ptr_array_add(acpi_generic_initiator_list, gi); > } > > static void acpi_generic_initiator_finalize(Object *obj) > { > AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj); > > + if (acpi_generic_initiator_list) { > + g_ptr_array_remove(acpi_generic_initiator_list, gi); > + } > g_free(gi->pci_dev); > } > > @@ -145,20 +157,15 @@ static void acpi_generic_initiator_class_init(ObjectClass *oc, const void *data) > "NUMA node associated with the PCI device"); > } > > -static int build_acpi_generic_initiator(Object *obj, void *opaque) > + > +static void build_acpi_generic_initiator(AcpiGenericInitiator *gi, > + GArray *table_data) > { > MachineState *ms = MACHINE(qdev_get_machine()); > - AcpiGenericInitiator *gi; > - GArray *table_data = opaque; > int32_t devfn; > uint8_t bus; > Object *o; > > - if (!object_dynamic_cast(obj, TYPE_ACPI_GENERIC_INITIATOR)) { > - return 0; > - } > - > - gi = ACPI_GENERIC_INITIATOR(obj); > if (gi->node >= ms->numa_state->num_nodes) { > error_printf("%s: Specified node %d is invalid.\n", > TYPE_ACPI_GENERIC_INITIATOR, gi->node); > @@ -178,8 +185,22 @@ static int build_acpi_generic_initiator(Object *obj, void *opaque) > assert(devfn >= 0 && devfn < PCI_DEVFN_MAX); > > build_srat_pci_generic_initiator(table_data, gi->node, 0, bus, devfn); > +} > > - return 0; > +static void build_all_acpi_generic_initiators(GArray *table_data) > +{ > + AcpiGenericInitiator *gi; > + guint i; > + > + if (!acpi_generic_initiator_list) { > + return; > + } > + > + /* Iterate array in insertion order */ > + for (i = 0; i < acpi_generic_initiator_list->len; i++) { > + gi = g_ptr_array_index(acpi_generic_initiator_list, i); > + build_acpi_generic_initiator(gi, table_data); > + } > } > > typedef struct AcpiGenericPort { > @@ -295,9 +316,8 @@ static int build_acpi_generic_port(Object *obj, void *opaque) > > void build_srat_generic_affinity_structures(GArray *table_data) > { > - object_child_foreach_recursive(object_get_root(), > - build_acpi_generic_initiator, > - table_data); > + build_all_acpi_generic_initiators(table_data); > + > object_child_foreach_recursive(object_get_root(), build_acpi_generic_port, > table_data); > } ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <SA1PR12MB7199F0C2E1D2325B0062B004B077A@SA1PR12MB7199.namprd12.prod.outlook.com>]
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order [not found] ` <SA1PR12MB7199F0C2E1D2325B0062B004B077A@SA1PR12MB7199.namprd12.prod.outlook.com> @ 2026-02-23 9:44 ` Igor Mammedov 2026-02-23 11:13 ` Jonathan Cameron via qemu development 0 siblings, 1 reply; 15+ messages in thread From: Igor Mammedov @ 2026-02-23 9:44 UTC (permalink / raw) To: Ankit Agrawal Cc: Vikram Sethi, Jason Gunthorpe, Shameer Kolothum Thodi, alex@shazbot.org, mst@redhat.com, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Mon, 23 Feb 2026 07:49:51 +0000 Ankit Agrawal <ankita@nvidia.com> wrote: > >> During creation of the VM's SRAT table, the generic initiator entries > >> are added. Currently the order in the entries are not controllable from > >> the qemu command. This is due to the fact that the code queries the > >> object tree which may not be in the order objects were inserted. > >> > >> As a fix the patch maintains a GPtrArray of generic initiator objects > >> that preserves their insertion order. Objects are automatically added > >> to the array when initialized and removed when finalized. When building > >> the SRAT table, objects are processed in the order they were first > >> inserted. > > > > so question would be, why does it matter? > > Is ther a requirement in spec for SRAT entries being put in a particular order? > > Hi Igor, reposting my response. I'll make this information as part of the next > version if and when I refresh. > > VM's Linux kernel parses the generic initiator (GI) structures present in the SRAT > table sequentially in the order of their occurrence and assigns a numa node > id when a new proximity domain (that is part of the GI structure) is encountered. > A jumbled up entries in the VM's SRAT consequently results in the jumbled up > sequence on numa nodes v/s the ones intended to be assigned through the > qemu command line. This messes up the internode numa distances assignment > through the qemu command line as the VM's view of the corresponding nodes > is entirely different. Assuming that QEMU CLI is correctly defined, above looks very much like a linux kernel bug. Aka: if kernel is not mapping proximity ID to its internal node ids correctly and then links them with something else entirely, it's kernel in wrong and not ACPI tables QEMU provides. IMHO it should be fixed on kernel side. (unless you find statement in spec that mandates the particular ordering in SRAT) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-23 9:44 ` Igor Mammedov @ 2026-02-23 11:13 ` Jonathan Cameron via qemu development 2026-02-24 13:51 ` Jason Gunthorpe 0 siblings, 1 reply; 15+ messages in thread From: Jonathan Cameron via qemu development @ 2026-02-23 11:13 UTC (permalink / raw) To: Igor Mammedov Cc: Ankit Agrawal, Vikram Sethi, Jason Gunthorpe, Shameer Kolothum Thodi, alex@shazbot.org, mst@redhat.com, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Mon, 23 Feb 2026 10:44:11 +0100 Igor Mammedov <imammedo@redhat.com> wrote: > On Mon, 23 Feb 2026 07:49:51 +0000 > Ankit Agrawal <ankita@nvidia.com> wrote: > > > >> During creation of the VM's SRAT table, the generic initiator entries > > >> are added. Currently the order in the entries are not controllable from > > >> the qemu command. This is due to the fact that the code queries the > > >> object tree which may not be in the order objects were inserted. > > >> > > >> As a fix the patch maintains a GPtrArray of generic initiator objects > > >> that preserves their insertion order. Objects are automatically added > > >> to the array when initialized and removed when finalized. When building > > >> the SRAT table, objects are processed in the order they were first > > >> inserted. > > > > > > so question would be, why does it matter? > > > Is ther a requirement in spec for SRAT entries being put in a particular order? > > > > Hi Igor, reposting my response. I'll make this information as part of the next > > version if and when I refresh. > > > > VM's Linux kernel parses the generic initiator (GI) structures present in the SRAT > > table sequentially in the order of their occurrence and assigns a numa node > > id when a new proximity domain (that is part of the GI structure) is encountered. > > A jumbled up entries in the VM's SRAT consequently results in the jumbled up > > sequence on numa nodes v/s the ones intended to be assigned through the > > qemu command line. This messes up the internode numa distances assignment > > through the qemu command line as the VM's view of the corresponding nodes > > is entirely different. > > Assuming that QEMU CLI is correctly defined, above looks very much like a linux > kernel bug. > > Aka: if kernel is not mapping proximity ID to its internal node ids correctly > and then links them with something else entirely, it's kernel in wrong > and not ACPI tables QEMU provides. > > IMHO it should be fixed on kernel side. (unless you find statement in spec > that mandates the particular ordering in SRAT) > Ankit, can you give an example complete with table dumps please. I'm a little unsure on where things are getting scrambled. Everything should be keyed of PXM. Sounds like we have a bug somewhere but ordering shouldn't be relevant. Jonathan > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-23 11:13 ` Jonathan Cameron via qemu development @ 2026-02-24 13:51 ` Jason Gunthorpe 2026-02-24 14:01 ` Michael S. Tsirkin 0 siblings, 1 reply; 15+ messages in thread From: Jason Gunthorpe @ 2026-02-24 13:51 UTC (permalink / raw) To: Jonathan Cameron Cc: Igor Mammedov, Ankit Agrawal, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, mst@redhat.com, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Mon, Feb 23, 2026 at 11:13:02AM +0000, Jonathan Cameron wrote: > Ankit, can you give an example complete with table dumps please. > > I'm a little unsure on where things are getting scrambled. > Everything should be keyed of PXM. Sounds like we have a bug > somewhere but ordering shouldn't be relevant. I understood the issue is Linux assigns the uAPI visible NUMA node numbers based on the ordering. The proximity/etc internal to the kernel (I thought) was OK? Then the problem is that uAPI has developed meaning based on what the bare metal HW does and now there are SW stacks that are expecting these platforms to have certain NUMA IDs in the Linux uAPI. Sure you can argue this is bad/etc/etc but the point of QEMU is to allow creating VMs that closely match real HW and in this instance real HW produces an ACPI table with a certain ordering and the SW is sensitive to this ordering. Even if there is some Linux bug mis-parsing the ACPI, then that still should be addressed from a qemu perspective by providing the ACPI construction that doesn't trigger any bug so existing VM images will work under qemu. Thus qemu needs a way to reflect the ordering on the command line to properly emulate this system and accomodate the existing VM software... Jason ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 13:51 ` Jason Gunthorpe @ 2026-02-24 14:01 ` Michael S. Tsirkin 2026-02-24 14:42 ` Jason Gunthorpe 2026-02-24 14:54 ` Jonathan Cameron via qemu development 0 siblings, 2 replies; 15+ messages in thread From: Michael S. Tsirkin @ 2026-02-24 14:01 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jonathan Cameron, Igor Mammedov, Ankit Agrawal, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Tue, Feb 24, 2026 at 09:51:06AM -0400, Jason Gunthorpe wrote: > On Mon, Feb 23, 2026 at 11:13:02AM +0000, Jonathan Cameron wrote: > > > Ankit, can you give an example complete with table dumps please. > > > > I'm a little unsure on where things are getting scrambled. > > Everything should be keyed of PXM. Sounds like we have a bug > > somewhere but ordering shouldn't be relevant. > > I understood the issue is Linux assigns the uAPI visible NUMA node > numbers based on the ordering. The proximity/etc internal to the > kernel (I thought) was OK? > > Then the problem is that uAPI has developed meaning based on what the > bare metal HW does and now there are SW stacks that are expecting > these platforms to have certain NUMA IDs in the Linux uAPI. Sure you > can argue this is bad/etc/etc but the point of QEMU is to allow > creating VMs that closely match real HW and in this instance real HW > produces an ACPI table with a certain ordering and the SW is sensitive > to this ordering. > > Even if there is some Linux bug mis-parsing the ACPI, then that still > should be addressed from a qemu perspective by providing the ACPI > construction that doesn't trigger any bug so existing VM images will > work under qemu. > > Thus qemu needs a way to reflect the ordering on the command line to > properly emulate this system and accomodate the existing VM software... > > Jason Not arguing against this, but if there's a linux bug it is important to fix it as a 1st step. qemu work arounds for broken guests notwithstanding. then we can check how long the uapi has been around, how practical bugfix backport in linux is, and decide on whether a host side work around is worth it. -- MST ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 14:01 ` Michael S. Tsirkin @ 2026-02-24 14:42 ` Jason Gunthorpe 2026-02-24 14:48 ` Michael S. Tsirkin 2026-02-24 14:51 ` Ankit Agrawal 2026-02-24 14:54 ` Jonathan Cameron via qemu development 1 sibling, 2 replies; 15+ messages in thread From: Jason Gunthorpe @ 2026-02-24 14:42 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jonathan Cameron, Igor Mammedov, Ankit Agrawal, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Tue, Feb 24, 2026 at 09:01:30AM -0500, Michael S. Tsirkin wrote: > On Tue, Feb 24, 2026 at 09:51:06AM -0400, Jason Gunthorpe wrote: > > On Mon, Feb 23, 2026 at 11:13:02AM +0000, Jonathan Cameron wrote: > > > > > Ankit, can you give an example complete with table dumps please. > > > > > > I'm a little unsure on where things are getting scrambled. > > > Everything should be keyed of PXM. Sounds like we have a bug > > > somewhere but ordering shouldn't be relevant. > > > > I understood the issue is Linux assigns the uAPI visible NUMA node > > numbers based on the ordering. The proximity/etc internal to the > > kernel (I thought) was OK? > > > > Then the problem is that uAPI has developed meaning based on what the > > bare metal HW does and now there are SW stacks that are expecting > > these platforms to have certain NUMA IDs in the Linux uAPI. Sure you > > can argue this is bad/etc/etc but the point of QEMU is to allow > > creating VMs that closely match real HW and in this instance real HW > > produces an ACPI table with a certain ordering and the SW is sensitive > > to this ordering. > > > > Even if there is some Linux bug mis-parsing the ACPI, then that still > > should be addressed from a qemu perspective by providing the ACPI > > construction that doesn't trigger any bug so existing VM images will > > work under qemu. > > > > Thus qemu needs a way to reflect the ordering on the command line to > > properly emulate this system and accomodate the existing VM software... > > > > Jason > > Not arguing against this, but if there's a linux bug it is important > to fix it as a 1st step. qemu work arounds for broken guests > notwithstanding. then we can check how long the uapi has been around, > how practical bugfix backport in linux is, and decide on whether > a host side work around is worth it. Yeah, Ankit should provide more details to check for kernel bugs, but I fear it is more userspace bugs in practice :\ However, I think even if it is minor and easier to backport it still doesn't matter. The CSPs all built their VM types for this HW with the exepcted ACPI and this stuff is now very widely deployed. It makes no sense for qemu to be incompatible with everything pre-existing... Jason ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 14:42 ` Jason Gunthorpe @ 2026-02-24 14:48 ` Michael S. Tsirkin 2026-02-24 14:51 ` Ankit Agrawal 1 sibling, 0 replies; 15+ messages in thread From: Michael S. Tsirkin @ 2026-02-24 14:48 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jonathan Cameron, Igor Mammedov, Ankit Agrawal, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Tue, Feb 24, 2026 at 10:42:43AM -0400, Jason Gunthorpe wrote: > On Tue, Feb 24, 2026 at 09:01:30AM -0500, Michael S. Tsirkin wrote: > > On Tue, Feb 24, 2026 at 09:51:06AM -0400, Jason Gunthorpe wrote: > > > On Mon, Feb 23, 2026 at 11:13:02AM +0000, Jonathan Cameron wrote: > > > > > > > Ankit, can you give an example complete with table dumps please. > > > > > > > > I'm a little unsure on where things are getting scrambled. > > > > Everything should be keyed of PXM. Sounds like we have a bug > > > > somewhere but ordering shouldn't be relevant. > > > > > > I understood the issue is Linux assigns the uAPI visible NUMA node > > > numbers based on the ordering. The proximity/etc internal to the > > > kernel (I thought) was OK? > > > > > > Then the problem is that uAPI has developed meaning based on what the > > > bare metal HW does and now there are SW stacks that are expecting > > > these platforms to have certain NUMA IDs in the Linux uAPI. Sure you > > > can argue this is bad/etc/etc but the point of QEMU is to allow > > > creating VMs that closely match real HW and in this instance real HW > > > produces an ACPI table with a certain ordering and the SW is sensitive > > > to this ordering. > > > > > > Even if there is some Linux bug mis-parsing the ACPI, then that still > > > should be addressed from a qemu perspective by providing the ACPI > > > construction that doesn't trigger any bug so existing VM images will > > > work under qemu. > > > > > > Thus qemu needs a way to reflect the ordering on the command line to > > > properly emulate this system and accomodate the existing VM software... > > > > > > Jason > > > > Not arguing against this, but if there's a linux bug it is important > > to fix it as a 1st step. qemu work arounds for broken guests > > notwithstanding. then we can check how long the uapi has been around, > > how practical bugfix backport in linux is, and decide on whether > > a host side work around is worth it. > > Yeah, Ankit should provide more details to check for kernel bugs, but > I fear it is more userspace bugs in practice :\ > > However, I think even if it is minor and easier to backport it still > doesn't matter. The CSPs all built their VM types for this HW with the > exepcted ACPI and this stuff is now very widely deployed. It makes no > sense for qemu to be incompatible with everything pre-existing... > > Jason not having the data I can't judge, but this would be something to detail in the commit log. e.g. since which kernel version did it expose this, etc etc. regardless, asking that kernel fix is developed (if possible) and not just a qemu workaround, is something i routinely do since otherwise it is hard to find people willing to fix things properly. -- MST ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 14:42 ` Jason Gunthorpe 2026-02-24 14:48 ` Michael S. Tsirkin @ 2026-02-24 14:51 ` Ankit Agrawal 2026-02-24 14:54 ` Michael S. Tsirkin 2026-02-24 14:58 ` Jason Gunthorpe 1 sibling, 2 replies; 15+ messages in thread From: Ankit Agrawal @ 2026-02-24 14:51 UTC (permalink / raw) To: Jason Gunthorpe, Michael S. Tsirkin Cc: Jonathan Cameron, Igor Mammedov, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org >> >> Not arguing against this, but if there's a linux bug it is important >> to fix it as a 1st step. qemu work arounds for broken guests >> notwithstanding. then we can check how long the uapi has been around, >> how practical bugfix backport in linux is, and decide on whether >> a host side work around is worth it. > > Yeah, Ankit should provide more details to check for kernel bugs, but > I fear it is more userspace bugs in practice :\ > > However, I think even if it is minor and easier to backport it still > doesn't matter. The CSPs all built their VM types for this HW with the > exepcted ACPI and this stuff is now very widely deployed. It makes no > sense for qemu to be incompatible with everything pre-existing... I don't think this is a kernel bug. To give somewhat more details, we pass memory-less, cpu-less NUMA node to the VM such as the following through the command line. qemu-system-aarch64 \ .. -numa node,nodeid=2 -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \ .. The qemu code here only adds the numa node if the size > 0. https://github.com/qemu/qemu/blob/stable-10.2/hw/arm/virt-acpi-build.c#L718 These nodes are thus discovered through the Generic Initiator Affinity SRAT structures that is build for the acpi-generic-initiator object such as the following. [1A0h 0416 001h] Subtable Type : 05 [Generic Initiator Affinity] [1A1h 0417 001h] Length : 20 [1A2h 0418 001h] Reserved1 : 00 [1A3h 0419 001h] Device Handle Type : 01 [1A4h 0420 004h] Proximity Domain : 00000002 [1A8h 0424 010h] Device Handle : 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 [1B8h 0440 004h] Flags (decoded below) : 00000001 Enabled : 1 Architectural Transactions : 0 [1BCh 0444 004h] Reserved2 : 00000000 Now the kernel parse it in the sequence of their occurrence. A jumbled up sequence thus results in a jumbled up assignment. Thanks Ankit Agrawal ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 14:51 ` Ankit Agrawal @ 2026-02-24 14:54 ` Michael S. Tsirkin 2026-02-24 14:58 ` Jason Gunthorpe 1 sibling, 0 replies; 15+ messages in thread From: Michael S. Tsirkin @ 2026-02-24 14:54 UTC (permalink / raw) To: Ankit Agrawal Cc: Jason Gunthorpe, Jonathan Cameron, Igor Mammedov, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Tue, Feb 24, 2026 at 02:51:42PM +0000, Ankit Agrawal wrote: > >> > >> Not arguing against this, but if there's a linux bug it is important > >> to fix it as a 1st step. qemu work arounds for broken guests > >> notwithstanding. then we can check how long the uapi has been around, > >> how practical bugfix backport in linux is, and decide on whether > >> a host side work around is worth it. > > > > Yeah, Ankit should provide more details to check for kernel bugs, but > > I fear it is more userspace bugs in practice :\ > > > > However, I think even if it is minor and easier to backport it still > > doesn't matter. The CSPs all built their VM types for this HW with the > > exepcted ACPI and this stuff is now very widely deployed. It makes no > > sense for qemu to be incompatible with everything pre-existing... > > I don't think this is a kernel bug. To give somewhat more details, > we pass memory-less, cpu-less NUMA node to the VM such as > the following through the command line. > > qemu-system-aarch64 \ > .. > -numa node,nodeid=2 > -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \ > .. > > The qemu code here only adds the numa node if the size > 0. > https://github.com/qemu/qemu/blob/stable-10.2/hw/arm/virt-acpi-build.c#L718 > These nodes are thus discovered through the Generic Initiator > Affinity SRAT structures that is build for the acpi-generic-initiator > object such as the following. > > > [1A0h 0416 001h] Subtable Type : 05 [Generic Initiator Affinity] > [1A1h 0417 001h] Length : 20 > > [1A2h 0418 001h] Reserved1 : 00 > [1A3h 0419 001h] Device Handle Type : 01 > [1A4h 0420 004h] Proximity Domain : 00000002 > [1A8h 0424 010h] Device Handle : 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 > [1B8h 0440 004h] Flags (decoded below) : 00000001 > Enabled : 1 > Architectural Transactions : 0 > [1BCh 0444 004h] Reserved2 : 00000000 > > > Now the kernel parse it in the sequence of their occurrence. A jumbled up > sequence thus results in a jumbled up assignment. > > Thanks > Ankit Agrawal it can parse in any order it wants. but you are saying it exposes the offset in the table as a proximity domain? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 14:51 ` Ankit Agrawal 2026-02-24 14:54 ` Michael S. Tsirkin @ 2026-02-24 14:58 ` Jason Gunthorpe 2026-02-24 16:22 ` Ankit Agrawal 1 sibling, 1 reply; 15+ messages in thread From: Jason Gunthorpe @ 2026-02-24 14:58 UTC (permalink / raw) To: Ankit Agrawal Cc: Michael S. Tsirkin, Jonathan Cameron, Igor Mammedov, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Tue, Feb 24, 2026 at 02:51:42PM +0000, Ankit Agrawal wrote: > Now the kernel parse it in the sequence of their occurrence. A jumbled up > sequence thus results in a jumbled up assignment. But what is the actual failure mode here? So the numa IDs are all in a weird order, what goes wrong from that? Jason ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 14:58 ` Jason Gunthorpe @ 2026-02-24 16:22 ` Ankit Agrawal 2026-02-24 16:30 ` Michael S. Tsirkin 2026-02-24 16:41 ` Jonathan Cameron via qemu development 0 siblings, 2 replies; 15+ messages in thread From: Ankit Agrawal @ 2026-02-24 16:22 UTC (permalink / raw) To: Jason Gunthorpe Cc: Michael S. Tsirkin, Jonathan Cameron, Igor Mammedov, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org >> Now the kernel parse it in the sequence of their occurrence. A jumbled up >> sequence thus results in a jumbled up assignment. > > But what is the actual failure mode here? So the numa IDs are all in a > weird order, what goes wrong from that? This interferes with the ability to replicate the numa distance topology on host in the VM through qemu command line. E.g. consider a NUMA system with 2 sockets each with a GPU. 0,1 are the node ids for the sysmem on socket 0,1 respectively and 2,3 are the node ids for the GPU memory on socket 0,1 respectively dist(0,2) = X dist(0,3) = Y If we try to replicate this for the VM by passing qemu arguments with 4 numa nodes and assign numa distances similar to host, and for the sake of example qemu mixes up by putting GI for 3 over 2. The SLIT which sets up the distances do it considering the original order in the qemu command line. https://github.com/qemu/qemu/blob/stable-10.2/hw/acpi/aml-build.c#L2040 This would lead to a different numa config in terms of distance within the VM that the one intended through the qemu command line. Thanks Ankit Agrawal ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 16:22 ` Ankit Agrawal @ 2026-02-24 16:30 ` Michael S. Tsirkin 2026-02-24 16:41 ` Jonathan Cameron via qemu development 1 sibling, 0 replies; 15+ messages in thread From: Michael S. Tsirkin @ 2026-02-24 16:30 UTC (permalink / raw) To: Ankit Agrawal Cc: Jason Gunthorpe, Jonathan Cameron, Igor Mammedov, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Tue, Feb 24, 2026 at 04:22:56PM +0000, Ankit Agrawal wrote: > >> Now the kernel parse it in the sequence of their occurrence. A jumbled up > >> sequence thus results in a jumbled up assignment. > > > > But what is the actual failure mode here? So the numa IDs are all in a > > weird order, what goes wrong from that? > > This interferes with the ability to replicate the numa distance topology > on host in the VM through qemu command line. > > E.g. consider a NUMA system with 2 sockets each with a GPU. > 0,1 are the node ids for the sysmem on socket 0,1 respectively and > 2,3 are the node ids for the GPU memory on socket 0,1 respectively > dist(0,2) = X > dist(0,3) = Y > > If we try to replicate this for the VM by passing qemu arguments with > 4 numa nodes and assign numa distances similar to host, and for the > sake of example qemu mixes up by putting GI for 3 over 2. The SLIT > which sets up the distances do it considering the original order in the > qemu command line. > https://github.com/qemu/qemu/blob/stable-10.2/hw/acpi/aml-build.c#L2040 > > This would lead to a different numa config in terms of distance within > the VM that the one intended through the qemu command line. > > Thanks > Ankit Agrawal but this is not how SLIT is formatted, is it? it does not refer to entries by their location in the table, or am I confused? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 16:22 ` Ankit Agrawal 2026-02-24 16:30 ` Michael S. Tsirkin @ 2026-02-24 16:41 ` Jonathan Cameron via qemu development 2026-02-24 17:13 ` Jonathan Cameron via qemu development 1 sibling, 1 reply; 15+ messages in thread From: Jonathan Cameron via qemu development @ 2026-02-24 16:41 UTC (permalink / raw) To: Ankit Agrawal Cc: Jason Gunthorpe, Michael S. Tsirkin, Igor Mammedov, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Tue, 24 Feb 2026 16:22:56 +0000 Ankit Agrawal <ankita@nvidia.com> wrote: > >> Now the kernel parse it in the sequence of their occurrence. A jumbled up > >> sequence thus results in a jumbled up assignment. > > > > But what is the actual failure mode here? So the numa IDs are all in a > > weird order, what goes wrong from that? > > This interferes with the ability to replicate the numa distance topology > on host in the VM through qemu command line. > > E.g. consider a NUMA system with 2 sockets each with a GPU. > 0,1 are the node ids for the sysmem on socket 0,1 respectively and > 2,3 are the node ids for the GPU memory on socket 0,1 respectively > dist(0,2) = X > dist(0,3) = Y > > If we try to replicate this for the VM by passing qemu arguments with > 4 numa nodes and assign numa distances similar to host, and for the > sake of example qemu mixes up by putting GI for 3 over 2. The SLIT > which sets up the distances do it considering the original order in the > qemu command line. > https://github.com/qemu/qemu/blob/stable-10.2/hw/acpi/aml-build.c#L2040 > > This would lead to a different numa config in terms of distance within > the VM that the one intended through the qemu command line. This is the case where I'd like to see an example of the tables before and after your patch. If the SLIT is not correctly created wrt to PXMs (rather than the order of the commands) then we indeed have a QEMU bug that needs fixing. However, I'm confused as SLIT should also not be ordered by command line if the say the command line was: -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=3 \ -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=4 \ -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=6 \ -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \ -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=2 \ -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \ -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \ -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \ and numa stuff was something like -numa dist,src=3,dst=0,val=100 -numa dist,src=4,dst=0,val=200 -numa dist,src=5,dst=0,val=300 -numa dist,src=6,dst=0,val=100 -numa dist,src=7,dst=0,val=200 -numa dist,src=8,dst=0,val=300 -numa dist,src=9,dst=0,val=100 Then it should be matching src numbers here to node in the GIs whatever the order. Thanks, Jonathan > > Thanks > Ankit Agrawal ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 16:41 ` Jonathan Cameron via qemu development @ 2026-02-24 17:13 ` Jonathan Cameron via qemu development 0 siblings, 0 replies; 15+ messages in thread From: Jonathan Cameron via qemu development @ 2026-02-24 17:13 UTC (permalink / raw) To: Jonathan Cameron via qemu development Cc: Jonathan Cameron, Ankit Agrawal, Jason Gunthorpe, Michael S. Tsirkin, Igor Mammedov, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju On Tue, 24 Feb 2026 16:41:16 +0000 Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote: > On Tue, 24 Feb 2026 16:22:56 +0000 > Ankit Agrawal <ankita@nvidia.com> wrote: > > > >> Now the kernel parse it in the sequence of their occurrence. A jumbled up > > >> sequence thus results in a jumbled up assignment. > > > > > > But what is the actual failure mode here? So the numa IDs are all in a > > > weird order, what goes wrong from that? > > > > This interferes with the ability to replicate the numa distance topology > > on host in the VM through qemu command line. > > > > E.g. consider a NUMA system with 2 sockets each with a GPU. > > 0,1 are the node ids for the sysmem on socket 0,1 respectively and > > 2,3 are the node ids for the GPU memory on socket 0,1 respectively > > dist(0,2) = X > > dist(0,3) = Y > > > > If we try to replicate this for the VM by passing qemu arguments with > > 4 numa nodes and assign numa distances similar to host, and for the > > sake of example qemu mixes up by putting GI for 3 over 2. The SLIT > > which sets up the distances do it considering the original order in the > > qemu command line. > > https://github.com/qemu/qemu/blob/stable-10.2/hw/acpi/aml-build.c#L2040 > > > > This would lead to a different numa config in terms of distance within > > the VM that the one intended through the qemu command line. > > This is the case where I'd like to see an example of the tables before > and after your patch. If the SLIT is not correctly created wrt to PXMs > (rather than the order of the commands) then we indeed have a QEMU bug that > needs fixing. However, I'm confused as SLIT should also not be ordered > by command line if the say the command line was: > > -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=3 \ > -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=4 \ > -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=6 \ > -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \ > -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=2 \ > -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \ > -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \ > -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \ > > and numa stuff was something like > -numa dist,src=3,dst=0,val=100 > -numa dist,src=4,dst=0,val=200 > -numa dist,src=5,dst=0,val=300 > -numa dist,src=6,dst=0,val=100 > -numa dist,src=7,dst=0,val=200 > -numa dist,src=8,dst=0,val=300 > -numa dist,src=9,dst=0,val=100 > > Then it should be matching src numbers here to node in the GIs whatever the order. I had a mess around and it seems SLIT is stable to ordering of the nodes (based on a very minimal test so I may well be missing something!), but because the /sys/bus/node/devices/nodeX/distance is reordered by the PXM to kernel numa node mapping (which as you've observed is first come first served in parsing for GIs in new nodes), you will see that apparently reordering to reflect the kernel numa node order. How do you associate the resulting numa node with a particular resource on your GPU? That mapping should also be by PXM and as a result I would expect to see it refer to the appropriate entry after PXM to node translation in the kernel whatever order stuff under /sys/bus/nodes/devices/nodeX ends up in. For extra fun I put my CPUs and memory on different nodes and that always ends up mapped to the first node in Linux (assuming they are all on one node) with appropriate reordering of the nodeX/distance entries. Jonathan > > Thanks, > > Jonathan > > > > > > Thanks > > Ankit Agrawal > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order 2026-02-24 14:01 ` Michael S. Tsirkin 2026-02-24 14:42 ` Jason Gunthorpe @ 2026-02-24 14:54 ` Jonathan Cameron via qemu development 1 sibling, 0 replies; 15+ messages in thread From: Jonathan Cameron via qemu development @ 2026-02-24 14:54 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Jason Gunthorpe, Igor Mammedov, Ankit Agrawal, Vikram Sethi, Shameer Kolothum Thodi, alex@shazbot.org, anisinha@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang, Matt Ochs, Krishnakant Jaju, qemu-devel@nongnu.org On Tue, 24 Feb 2026 09:01:30 -0500 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Feb 24, 2026 at 09:51:06AM -0400, Jason Gunthorpe wrote: > > On Mon, Feb 23, 2026 at 11:13:02AM +0000, Jonathan Cameron wrote: > > > > > Ankit, can you give an example complete with table dumps please. > > > > > > I'm a little unsure on where things are getting scrambled. > > > Everything should be keyed of PXM. Sounds like we have a bug > > > somewhere but ordering shouldn't be relevant. > > > > I understood the issue is Linux assigns the uAPI visible NUMA node > > numbers based on the ordering. The proximity/etc internal to the > > kernel (I thought) was OK? > > > > Then the problem is that uAPI has developed meaning based on what the > > bare metal HW does and now there are SW stacks that are expecting > > these platforms to have certain NUMA IDs in the Linux uAPI. Sure you > > can argue this is bad/etc/etc but the point of QEMU is to allow > > creating VMs that closely match real HW and in this instance real HW > > produces an ACPI table with a certain ordering and the SW is sensitive > > to this ordering. > > > > Even if there is some Linux bug mis-parsing the ACPI, then that still > > should be addressed from a qemu perspective by providing the ACPI > > construction that doesn't trigger any bug so existing VM images will > > work under qemu. > > > > Thus qemu needs a way to reflect the ordering on the command line to > > properly emulate this system and accomodate the existing VM software... > > > > Jason > > Not arguing against this, but if there's a linux bug it is important > to fix it as a 1st step. qemu work arounds for broken guests > notwithstanding. then we can check how long the uapi has been around, > how practical bugfix backport in linux is, and decide on whether > a host side work around is worth it. > IIRC NUMA IDs in linux aren't even consistent across architectures. I think there are cases where the x86 code handles certain SRAT entrees earlier than the arm64 code does (was either CPUless or memory less nodes if my memory is right). I haven't poked this stuff for a while though so maybe those differences got ironed out. Anyhow, relying on those numbers being stable is optimistic at best. Longer term I'd like to see the ACPI spec comprehend this case where lots of GIs are the same device and hence add some way of distinguishing between them that isn't the PXM. I'd not be against having a kernel patch that sorted at least GI only nodes by PXM rather that order in the ACPI table (probably GP only ones as well, though they aren't as visible anyway). It may be controversial! With that in place it might also make sense to make qemu stop generating them in a random order (So what this patch is doing). Note I did a very similar thing for CXL fixed memory windows, but that was to maintain consistency when I made the fixed memory windows devices as before that we had them in a list built from the command line. Without it we got breakage in the bios tests and the physical addresses shuffled in a fashion that depending on a hash that in theory might change at any time. In that case I didn't have an explicit list, but instead stashed an index parameter in the object and built temporary lists for sorting purposes. https://lore.kernel.org/all/20250625161926.549812-3-Jonathan.Cameron@huawei.com/ Jonathan ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-02-24 17:14 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260222020812.26475-1-ankita@nvidia.com>
2026-02-23 7:28 ` [PATCH v1 1/1] hw/acpi/pci.c: preserve generic initiator insertion order Igor Mammedov
[not found] ` <SA1PR12MB7199F0C2E1D2325B0062B004B077A@SA1PR12MB7199.namprd12.prod.outlook.com>
2026-02-23 9:44 ` Igor Mammedov
2026-02-23 11:13 ` Jonathan Cameron via qemu development
2026-02-24 13:51 ` Jason Gunthorpe
2026-02-24 14:01 ` Michael S. Tsirkin
2026-02-24 14:42 ` Jason Gunthorpe
2026-02-24 14:48 ` Michael S. Tsirkin
2026-02-24 14:51 ` Ankit Agrawal
2026-02-24 14:54 ` Michael S. Tsirkin
2026-02-24 14:58 ` Jason Gunthorpe
2026-02-24 16:22 ` Ankit Agrawal
2026-02-24 16:30 ` Michael S. Tsirkin
2026-02-24 16:41 ` Jonathan Cameron via qemu development
2026-02-24 17:13 ` Jonathan Cameron via qemu development
2026-02-24 14:54 ` Jonathan Cameron via qemu development
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.