* [PATCH] nvme-pci: do not set the NUMA node of device if it has none @ 2023-07-25 11:06 Pratyush Yadav 2023-07-25 14:35 ` Keith Busch 2024-07-23 9:49 ` Maurizio Lombardi 0 siblings, 2 replies; 16+ messages in thread From: Pratyush Yadav @ 2023-07-25 11:06 UTC (permalink / raw) To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg Cc: Pratyush Yadav, linux-nvme, linux-kernel If a device has no NUMA node information associated with it, the driver puts the device in node first_memory_node (say node 0). As a side effect, this gives an indication to userspace IRQ balancing programs that the device is in node 0 so they prefer CPUs in node 0 to handle the IRQs associated with the queues. For example, irqbalance will only let CPUs in node 0 handle the interrupts. This reduces random access performance on CPUs in node 1 since the interrupt for command completion will fire on node 0. For example, AWS EC2's i3.16xlarge instance does not expose NUMA information for the NVMe devices. This means all NVMe devices have NUMA_NO_NODE by default. Without this patch, random 4k read performance measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on both nodes get similar performance (around 315k IOPS). Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> --- drivers/nvme/host/pci.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index baf69af7ea78e..f5ba2d7102eae 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2916,9 +2916,6 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev, struct nvme_dev *dev; int ret = -ENOMEM; - if (node == NUMA_NO_NODE) - set_dev_node(&pdev->dev, first_memory_node); - dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node); if (!dev) return ERR_PTR(-ENOMEM); -- 2.40.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-25 11:06 [PATCH] nvme-pci: do not set the NUMA node of device if it has none Pratyush Yadav @ 2023-07-25 14:35 ` Keith Busch 2023-07-26 7:58 ` Sagi Grimberg 2024-07-23 9:49 ` Maurizio Lombardi 1 sibling, 1 reply; 16+ messages in thread From: Keith Busch @ 2023-07-25 14:35 UTC (permalink / raw) To: Pratyush Yadav Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, linux-kernel On Tue, Jul 25, 2023 at 01:06:22PM +0200, Pratyush Yadav wrote: > If a device has no NUMA node information associated with it, the driver > puts the device in node first_memory_node (say node 0). As a side > effect, this gives an indication to userspace IRQ balancing programs > that the device is in node 0 so they prefer CPUs in node 0 to handle the > IRQs associated with the queues. For example, irqbalance will only let > CPUs in node 0 handle the interrupts. This reduces random access > performance on CPUs in node 1 since the interrupt for command completion > will fire on node 0. > > For example, AWS EC2's i3.16xlarge instance does not expose NUMA > information for the NVMe devices. This means all NVMe devices have > NUMA_NO_NODE by default. Without this patch, random 4k read performance > measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% > less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on > both nodes get similar performance (around 315k IOPS). irqbalance doesn't work with this driver though: the interrupts are managed by the kernel. Is there some other reason to explain the perf difference? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-25 14:35 ` Keith Busch @ 2023-07-26 7:58 ` Sagi Grimberg 2023-07-26 13:14 ` Christoph Hellwig 0 siblings, 1 reply; 16+ messages in thread From: Sagi Grimberg @ 2023-07-26 7:58 UTC (permalink / raw) To: Keith Busch, Pratyush Yadav Cc: Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel >> If a device has no NUMA node information associated with it, the driver >> puts the device in node first_memory_node (say node 0). As a side >> effect, this gives an indication to userspace IRQ balancing programs >> that the device is in node 0 so they prefer CPUs in node 0 to handle the >> IRQs associated with the queues. For example, irqbalance will only let >> CPUs in node 0 handle the interrupts. This reduces random access >> performance on CPUs in node 1 since the interrupt for command completion >> will fire on node 0. >> >> For example, AWS EC2's i3.16xlarge instance does not expose NUMA >> information for the NVMe devices. This means all NVMe devices have >> NUMA_NO_NODE by default. Without this patch, random 4k read performance >> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% >> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on >> both nodes get similar performance (around 315k IOPS). > > irqbalance doesn't work with this driver though: the interrupts are > managed by the kernel. Is there some other reason to explain the perf > difference? Maybe its because the numa_node goes to the tagset which allocates stuff based on that numa-node ? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-26 7:58 ` Sagi Grimberg @ 2023-07-26 13:14 ` Christoph Hellwig 2023-07-26 15:30 ` Pratyush Yadav 0 siblings, 1 reply; 16+ messages in thread From: Christoph Hellwig @ 2023-07-26 13:14 UTC (permalink / raw) To: Sagi Grimberg Cc: Keith Busch, Pratyush Yadav, Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote: >>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA >>> information for the NVMe devices. This means all NVMe devices have >>> NUMA_NO_NODE by default. Without this patch, random 4k read performance >>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% >>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on >>> both nodes get similar performance (around 315k IOPS). >> >> irqbalance doesn't work with this driver though: the interrupts are >> managed by the kernel. Is there some other reason to explain the perf >> difference? > > Maybe its because the numa_node goes to the tagset which allocates > stuff based on that numa-node ? Yeah, the only explanation I could come up with is that without this the allocations gets spread, and that somehow helps. All of this is a little obscure, but so is the NVMe practice of setting the node id to first_memory_node, which no other driver does. I'd really like to understand what's going on here first. After that this patch probably is the right thing, I'd just like to understand why. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-26 13:14 ` Christoph Hellwig @ 2023-07-26 15:30 ` Pratyush Yadav 2023-07-26 16:17 ` Keith Busch 0 siblings, 1 reply; 16+ messages in thread From: Pratyush Yadav @ 2023-07-26 15:30 UTC (permalink / raw) To: Christoph Hellwig Cc: Sagi Grimberg, Keith Busch, Jens Axboe, linux-nvme, linux-kernel On Wed, Jul 26 2023, Christoph Hellwig wrote: Hi all, > On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote: >>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA >>>> information for the NVMe devices. This means all NVMe devices have >>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance >>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% >>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on >>>> both nodes get similar performance (around 315k IOPS). >>> >>> irqbalance doesn't work with this driver though: the interrupts are >>> managed by the kernel. Is there some other reason to explain the perf >>> difference? Hmm, I did not know that. I have not gone and looked at the code but I think the same reasoning should hold, just with s/irqbalance/kernel. If the kernel IRQ balancer sees the device is on node 0, it would deliver its interrupts to CPUs on node 0. In my tests I can see that the interrupts for NVME queues are sent only to CPUs from node 0 without this patch. With this patch CPUs from both nodes get the interrupts. >> >> Maybe its because the numa_node goes to the tagset which allocates >> stuff based on that numa-node ? > > Yeah, the only explanation I could come up with is that without this > the allocations gets spread, and that somehow helps. All of this > is a little obscure, but so is the NVMe practice of setting the node id > to first_memory_node, which no other driver does. I'd really like to > understand what's going on here first. After that this patch probably > is the right thing, I'd just like to understand why. See above for my conjecture on why this happens. More specifically, I discovered this when running an application pinned to a node 1 CPU reading from an NVME device. I noticed it was performing worse than when it was pinned to node 0. If the process is free to move around it might not see such a large performance difference since it could move to a node 0 CPU. But if it is pinned to a CPU in node 1 then the interrupt will always hit a node 0 CPU and create higher latency for the reads. I have a simple fio test that can reproduce this. Save this [1] as fio.txt and then run numactl --cpunodebind 1 fio ./fio.txt. You can run it on any host with an NVME device that has no NUMA node. I have tested this on AWS EC2's i3.16xlarge instance type. [1] [global] ioengine=libaio filename=/dev/nvme0n1 group_reporting=1 direct=1 verify=0 norandommap=0 size=10% time_based=1 runtime=30 ramp_time=0 randrepeat=0 log_max_value=1 unified_rw_reporting=1 percentile_list=50:99:99.9:99.99:99.999 bwavgtime=10000 [4k_randread_qd16_4w] stonewall bs=4k rw=randread iodepth=32 numjobs=1 -- Regards, Pratyush Yadav Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-26 15:30 ` Pratyush Yadav @ 2023-07-26 16:17 ` Keith Busch 2023-07-26 19:32 ` Pratyush Yadav 0 siblings, 1 reply; 16+ messages in thread From: Keith Busch @ 2023-07-26 16:17 UTC (permalink / raw) To: Pratyush Yadav Cc: Christoph Hellwig, Sagi Grimberg, Jens Axboe, linux-nvme, linux-kernel On Wed, Jul 26, 2023 at 05:30:33PM +0200, Pratyush Yadav wrote: > On Wed, Jul 26 2023, Christoph Hellwig wrote: > > On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote: > >>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA > >>>> information for the NVMe devices. This means all NVMe devices have > >>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance > >>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% > >>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on > >>>> both nodes get similar performance (around 315k IOPS). > >>> > >>> irqbalance doesn't work with this driver though: the interrupts are > >>> managed by the kernel. Is there some other reason to explain the perf > >>> difference? > > Hmm, I did not know that. I have not gone and looked at the code but I > think the same reasoning should hold, just with s/irqbalance/kernel. If > the kernel IRQ balancer sees the device is on node 0, it would deliver > its interrupts to CPUs on node 0. > > In my tests I can see that the interrupts for NVME queues are sent only > to CPUs from node 0 without this patch. With this patch CPUs from both > nodes get the interrupts. Could you send the output of: numactl --hardware and then with and without your patch: for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ cat /proc/irq/$i/{smp,effective}_affinity_list; \ done ? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-26 16:17 ` Keith Busch @ 2023-07-26 19:32 ` Pratyush Yadav 2023-07-26 22:25 ` Keith Busch 0 siblings, 1 reply; 16+ messages in thread From: Pratyush Yadav @ 2023-07-26 19:32 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Sagi Grimberg, Jens Axboe, linux-nvme, linux-kernel On Wed, Jul 26 2023, Keith Busch wrote: > On Wed, Jul 26, 2023 at 05:30:33PM +0200, Pratyush Yadav wrote: >> On Wed, Jul 26 2023, Christoph Hellwig wrote: >> > On Wed, Jul 26, 2023 at 10:58:36AM +0300, Sagi Grimberg wrote: >> >>>> For example, AWS EC2's i3.16xlarge instance does not expose NUMA >> >>>> information for the NVMe devices. This means all NVMe devices have >> >>>> NUMA_NO_NODE by default. Without this patch, random 4k read performance >> >>>> measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% >> >>>> less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on >> >>>> both nodes get similar performance (around 315k IOPS). >> >>> >> >>> irqbalance doesn't work with this driver though: the interrupts are >> >>> managed by the kernel. Is there some other reason to explain the perf >> >>> difference? >> >> Hmm, I did not know that. I have not gone and looked at the code but I >> think the same reasoning should hold, just with s/irqbalance/kernel. If >> the kernel IRQ balancer sees the device is on node 0, it would deliver >> its interrupts to CPUs on node 0. >> >> In my tests I can see that the interrupts for NVME queues are sent only >> to CPUs from node 0 without this patch. With this patch CPUs from both >> nodes get the interrupts. > > Could you send the output of: > > numactl --hardware $ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 0 size: 245847 MB node 0 free: 245211 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 245932 MB node 1 free: 245328 MB node distances: node 0 1 0: 10 21 1: 21 10 > > and then with and without your patch: > > for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > done Without my patch: $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > done 40 40 33 33 44 44 9 9 32 32 2 2 6 6 11 11 1 1 35 35 39 39 13 13 42 42 46 46 41 41 46 46 15 15 5 5 43 43 0 0 14 14 8 8 12 12 7 7 10 10 47 47 38 38 36 36 3 3 34 34 45 45 5 5 With my patch: $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > done 9 9 15 15 5 5 23 23 38 38 52 52 21 21 36 36 13 13 56 56 44 44 42 42 31 31 48 48 5 5 3 3 1 1 11 11 28 28 18 18 34 34 29 29 58 58 46 46 54 54 59 59 32 32 7 7 56 56 62 62 49 49 57 57 -- Regards, Pratyush Yadav Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-26 19:32 ` Pratyush Yadav @ 2023-07-26 22:25 ` Keith Busch 2023-07-28 18:09 ` Pratyush Yadav 0 siblings, 1 reply; 16+ messages in thread From: Keith Busch @ 2023-07-26 22:25 UTC (permalink / raw) To: Pratyush Yadav Cc: Christoph Hellwig, Sagi Grimberg, Jens Axboe, linux-nvme, linux-kernel On Wed, Jul 26, 2023 at 09:32:33PM +0200, Pratyush Yadav wrote: > On Wed, Jul 26 2023, Keith Busch wrote: > > Could you send the output of: > > > > numactl --hardware > > $ numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 > node 0 size: 245847 MB > node 0 free: 245211 MB > node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 > node 1 size: 245932 MB > node 1 free: 245328 MB > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 > > > > > and then with and without your patch: > > > > for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > > done > > Without my patch: > > $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > > done Hm, I wonder if there's something wrong with my script. All the cpu's should be accounted for in the smp_affinity_list, assuming it captured all the vectors of the nvme device, but both examples are missing half the CPUs. It looks like you have 32 vectors. Does that sound right? This does show the effective affinity is indeed always on node 0 without your patch. I don't see why, though: the "group_cpus_evenly()" function that spreads the interrupts doesn't know anything about the device the resource is being grouped for, so it shouldn't even take its NUMA node into consideration. It's just supposed to ensure all CPUs have a shared resource, preferring to not share across numa nodes. I'll emulate a similar CPU topology with similar nvme vector count and see if I can find anything suspicious. I'm a little concerned we may have the same problem for devices that have an associated NUMA node that your patch isn't addressing. > 41 > 40 > 33 > 33 > 44 > 44 > 9 > 9 > 32 > 32 > 2 > 2 > 6 > 6 > 11 > 11 > 1 > 1 > 35 > 35 > 39 > 39 > 13 > 13 > 42 > 42 > 46 > 46 > 41 > 41 > 46 > 46 > 15 > 15 > 5 > 5 > 43 > 43 > 0 > 0 > 14 > 14 > 8 > 8 > 12 > 12 > 7 > 7 > 10 > 10 > 47 > 47 > 38 > 38 > 36 > 36 > 3 > 3 > 34 > 34 > 45 > 45 > 5 > 5 > > With my patch: > > $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > > done > 9 > 9 > 15 > 15 > 5 > 5 > 23 > 23 > 38 > 38 > 52 > 52 > 21 > 21 > 36 > 36 > 13 > 13 > 56 > 56 > 44 > 44 > 42 > 42 > 31 > 31 > 48 > 48 > 5 > 5 > 3 > 3 > 1 > 1 > 11 > 11 > 28 > 28 > 18 > 18 > 34 > 34 > 29 > 29 > 58 > 58 > 46 > 46 > 54 > 54 > 59 > 59 > 32 > 32 > 7 > 7 > 56 > 56 > 62 > 62 > 49 > 49 > 57 > 57 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-26 22:25 ` Keith Busch @ 2023-07-28 18:09 ` Pratyush Yadav 2023-07-28 19:34 ` Keith Busch 0 siblings, 1 reply; 16+ messages in thread From: Pratyush Yadav @ 2023-07-28 18:09 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Sagi Grimberg, Jens Axboe, linux-nvme, linux-kernel Hi, On Wed, Jul 26 2023, Keith Busch wrote: > On Wed, Jul 26, 2023 at 09:32:33PM +0200, Pratyush Yadav wrote: >> On Wed, Jul 26 2023, Keith Busch wrote: >> > Could you send the output of: >> > >> > numactl --hardware >> >> $ numactl --hardware >> available: 2 nodes (0-1) >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 >> node 0 size: 245847 MB >> node 0 free: 245211 MB >> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 >> node 1 size: 245932 MB >> node 1 free: 245328 MB >> node distances: >> node 0 1 >> 0: 10 21 >> 1: 21 10 >> >> > >> > and then with and without your patch: >> > >> > for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ >> > cat /proc/irq/$i/{smp,effective}_affinity_list; \ >> > done >> >> Without my patch: >> >> $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ >> > cat /proc/irq/$i/{smp,effective}_affinity_list; \ >> > done > > Hm, I wonder if there's something wrong with my script. All the cpu's > should be accounted for in the smp_affinity_list, assuming it captured > all the vectors of the nvme device, but both examples are missing half > the CPUs. It looks like you have 32 vectors. Does that sound right? Yes, there are 32 vectors, from nvme0q0 to nvme0q31. Should there be one vector for each CPU? Perhaps the device does not support that many queues? FWIW, $ sudo nvme get-feature /dev/nvme0n1 -f 7 -H get-feature:0x7 (Number of Queues), Current value:0x1e001e Number of IO Completion Queues Allocated (NCQA): 31 Number of IO Submission Queues Allocated (NSQA): 31 > > This does show the effective affinity is indeed always on node 0 without > your patch. I don't see why, though: the "group_cpus_evenly()" function > that spreads the interrupts doesn't know anything about the device the > resource is being grouped for, so it shouldn't even take its NUMA node > into consideration. It's just supposed to ensure all CPUs have a shared > resource, preferring to not share across numa nodes. I am guessing you are looking at irq_create_affinity_masks(). Yeah, It does not take into account the NUMA information. In fact, even if it did, the NUMA node associated with the IRQ is NUMA_NO_NODE (/proc/$irq/node == -1). I did some more digging over the week to figure out what is going on. It seems like the kernel _does_ in fact allow all CPUs in the affinity. I added some prints in set_affinity_irq() in drivers/xen/events/events_base.c (since that is the irqchip for the interrupt). I see it being called with mask: ffffffff,ffffffff. But I later see the function being called again with a different mask: 00000000,00008000. The stack trace shows the call is coming from ksys_write(). The process doing the write is irqbalance. So I think your earlier statement was incorrect. irqbalance does in fact balance these interrupts and it probably looks at the NUMA information of the device to make that decision. My original reasoning holds and irqbalance is the one picking the affinity. With this explanation, do you think the patch is good to go? BTW, could you please also add the below when applying? I forgot to add it when sending the patch. Fixes: a4aea5623d4a5 ("NVMe: Convert to blk-mq") > > I'll emulate a similar CPU topology with similar nvme vector count and > see if I can find anything suspicious. I'm a little concerned we may > have the same problem for devices that have an associated NUMA node that > your patch isn't addressing. > [...] -- Regards, Pratyush Yadav Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-28 18:09 ` Pratyush Yadav @ 2023-07-28 19:34 ` Keith Busch 2023-08-04 14:50 ` Pratyush Yadav 0 siblings, 1 reply; 16+ messages in thread From: Keith Busch @ 2023-07-28 19:34 UTC (permalink / raw) To: Pratyush Yadav Cc: Christoph Hellwig, Sagi Grimberg, Jens Axboe, linux-nvme, linux-kernel On Fri, Jul 28, 2023 at 08:09:32PM +0200, Pratyush Yadav wrote: > > I am guessing you are looking at irq_create_affinity_masks(). Yeah, It > does not take into account the NUMA information. In fact, even if it > did, the NUMA node associated with the IRQ is NUMA_NO_NODE > (/proc/$irq/node == -1). > > I did some more digging over the week to figure out what is going on. It > seems like the kernel _does_ in fact allow all CPUs in the affinity. I > added some prints in set_affinity_irq() in > drivers/xen/events/events_base.c (since that is the irqchip for the > interrupt). I see it being called with mask: ffffffff,ffffffff. > > But I later see the function being called again with a different mask: > 00000000,00008000. The stack trace shows the call is coming from > ksys_write(). The process doing the write is irqbalance. > > So I think your earlier statement was incorrect. irqbalance does in fact > balance these interrupts and it probably looks at the NUMA information > of the device to make that decision. My original reasoning holds and > irqbalance is the one picking the affinity. > > With this explanation, do you think the patch is good to go? irqbalance still writes to the /proc/<irq>/smp_affinity to change it, right? That's just getting I/O errors on my machines because it fails irq_can_set_affinity_usr() for nvme's kernel managed interrupts (except the first vector, but that one is not used for I/O). Is there another path irqbalance is using that's somehow getting past the appropriate checks? Or perhaps is your xen irq_chip somehow bypassing the managed irq property? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-28 19:34 ` Keith Busch @ 2023-08-04 14:50 ` Pratyush Yadav 2023-08-04 15:19 ` Keith Busch 0 siblings, 1 reply; 16+ messages in thread From: Pratyush Yadav @ 2023-08-04 14:50 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Sagi Grimberg, Jens Axboe, linux-nvme, linux-kernel On Fri, Jul 28 2023, Keith Busch wrote: > On Fri, Jul 28, 2023 at 08:09:32PM +0200, Pratyush Yadav wrote: >> >> I am guessing you are looking at irq_create_affinity_masks(). Yeah, It >> does not take into account the NUMA information. In fact, even if it >> did, the NUMA node associated with the IRQ is NUMA_NO_NODE >> (/proc/$irq/node == -1). >> >> I did some more digging over the week to figure out what is going on. It >> seems like the kernel _does_ in fact allow all CPUs in the affinity. I >> added some prints in set_affinity_irq() in >> drivers/xen/events/events_base.c (since that is the irqchip for the >> interrupt). I see it being called with mask: ffffffff,ffffffff. >> >> But I later see the function being called again with a different mask: >> 00000000,00008000. The stack trace shows the call is coming from >> ksys_write(). The process doing the write is irqbalance. >> >> So I think your earlier statement was incorrect. irqbalance does in fact >> balance these interrupts and it probably looks at the NUMA information >> of the device to make that decision. My original reasoning holds and >> irqbalance is the one picking the affinity. >> >> With this explanation, do you think the patch is good to go? > > irqbalance still writes to the /proc/<irq>/smp_affinity to change it, > right? That's just getting I/O errors on my machines because it fails > irq_can_set_affinity_usr() for nvme's kernel managed interrupts (except > the first vector, but that one is not used for I/O). Is there another > path irqbalance is using that's somehow getting past the appropriate > checks? Or perhaps is your xen irq_chip somehow bypassing the managed > irq property? I picked the interrupt "nvme4q26" as example. The call stack is (printed via WARN_ON(1)): ? __warn+0x7d/0x140 ? set_affinity_irq+0xf0/0x220 ? report_bug+0xf8/0x1e0 ? handle_bug+0x44/0x80 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? set_affinity_irq+0xf0/0x220 ? set_affinity_irq+0xf0/0x220 irq_do_set_affinity+0x135/0x1e0 irq_set_affinity_locked+0x186/0x1f0 __irq_set_affinity+0x41/0x70 write_irq_affinity.isra.8+0xf6/0x120 proc_reg_write+0x59/0x80 vfs_write+0xc7/0x3c0 ? __do_sys_newfstat+0x35/0x60 ? __fget_light+0xcb/0x120 ksys_write+0xa5/0xe0 do_syscall_64+0x42/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd The check you mention is in write_irq_affinity(). I added some prints there and it turns out that __irqd_can_set_affinity() returns true and irqd_affinity_is_managed() is false. I did some more digging and it turns out that the masks are called by irq_create_affinity_masks() which sets is_managed to the IRQ affinity descriptors. This is then passed down to __msi_domain_alloc_locked(). On a non-Xen system you would end up calling __msi_domain_alloc_irqs() next since ops->domain_alloc_irqs() is only implemented by Xen. This function finds the masks created earlier and passes them down to __irq_domain_alloc_irqs(). This then eventually lands in alloc_descs() which checks is_managed and sets IRQD_AFFINITY_MANAGED. On Xen though, xen_msi_domain_alloc_irqs() is called. This eventually lands in xen_allocate_irqs_dynamic() which calls irq_alloc_descs(). This macro calls __irq_alloc_descs() with affinity set to NULL. This leads to us losing the is_managed flag and the affinities created by irq_create_affinity_masks() via group_cpus_evenly(). As a result of this, MSI IRQs on Xen can never be managed by the kernel. They are marked as userspace manageable and irqbalance can set their affinity. Applying the (hacky) patch below fixes this problem and lets the kernel manage the affinities. ------ 8< ------ diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index c7715f8bd4522..15f36e34e28b4 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -743,9 +743,10 @@ static void xen_irq_init(unsigned irq) list_add_tail(&info->list, &xen_irq_list_head); } -static int __must_check xen_allocate_irqs_dynamic(int nvec) +static int __must_check xen_allocate_irqs_dynamic(int nvec, + struct irq_affinity_desc *affd) { - int i, irq = irq_alloc_descs(-1, 0, nvec, -1); + int i, irq = __irq_alloc_descs(-1, 0, nvec, -1, THIS_MODULE, affd); if (irq >= 0) { for (i = 0; i < nvec; i++) @@ -758,7 +759,7 @@ static int __must_check xen_allocate_irqs_dynamic(int nvec) static inline int __must_check xen_allocate_irq_dynamic(void) { - return xen_allocate_irqs_dynamic(1); + return xen_allocate_irqs_dynamic(1, NULL); } static int __must_check xen_allocate_irq_gsi(unsigned gsi) @@ -1108,7 +1109,7 @@ int xen_bind_pirq_msi_to_irq(struct pci_dev *dev, struct msi_desc *msidesc, mutex_lock(&irq_mapping_update_lock); - irq = xen_allocate_irqs_dynamic(nvec); + irq = xen_allocate_irqs_dynamic(nvec, msidesc->affinity); if (irq < 0) goto out; ------ >8 ------ With this patch, I get the below affinities: $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > done 8 8 16-17,48,65,67,69 18-19,50,71,73,75 20,52,77,79 21,53,81,83 22,54,85,87 23,55,89,91 24,56,93,95 25,57,97,99 26,58,101,103 27,59,105,107 28,60,109,111 29,61,113,115 30,62,117,119 31,63,121,123 49,51,125,127 0,32,64,66 1,33,68,70 2,34,72,74 3,35,76,78 4,36,80,82 5,37,84,86 6,38,88,90 7,39,92,94 8,40,96,98 9,41,100,102 10,42,104,106 11,43,108,110 12,44,112,114 13,45,116,118 14,46,120,122 15,47,124,126 The blank lines are because effective_smp_affinity is blank for all but the first interrupt. The problem is, even with this I still get the same performance difference when running on Node 1 vs Node 0. I am not sure why. Any pointers? -- Regards, Pratyush Yadav Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-08-04 14:50 ` Pratyush Yadav @ 2023-08-04 15:19 ` Keith Busch 2023-08-08 15:51 ` Pratyush Yadav 0 siblings, 1 reply; 16+ messages in thread From: Keith Busch @ 2023-08-04 15:19 UTC (permalink / raw) To: Pratyush Yadav Cc: Christoph Hellwig, Sagi Grimberg, Jens Axboe, linux-nvme, linux-kernel On Fri, Aug 04, 2023 at 04:50:16PM +0200, Pratyush Yadav wrote: > With this patch, I get the below affinities: Something still seems off without effective_affinity set. That attribute should always reflect one CPU from the smp_affinity_list. At least with your patch, the smp_affinity_list looks as expected: every CPU is accounted for, and no vector appears to share the resource among CPUs in different nodes. > $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ > > cat /proc/irq/$i/{smp,effective}_affinity_list; \ > > done > 8 > 8 > 16-17,48,65,67,69 > > 18-19,50,71,73,75 > > 20,52,77,79 > > 21,53,81,83 > > 22,54,85,87 > > 23,55,89,91 > > 24,56,93,95 > > 25,57,97,99 > > 26,58,101,103 > > 27,59,105,107 > > 28,60,109,111 > > 29,61,113,115 > > 30,62,117,119 > > 31,63,121,123 > > 49,51,125,127 > > 0,32,64,66 > > 1,33,68,70 > > 2,34,72,74 > > 3,35,76,78 > > 4,36,80,82 > > 5,37,84,86 > > 6,38,88,90 > > 7,39,92,94 > > 8,40,96,98 > > 9,41,100,102 > > 10,42,104,106 > > 11,43,108,110 > > 12,44,112,114 > > 13,45,116,118 > > 14,46,120,122 > > 15,47,124,126 > > The blank lines are because effective_smp_affinity is blank for all but the first interrupt. > > The problem is, even with this I still get the same performance > difference when running on Node 1 vs Node 0. I am not sure why. Any > pointers? I suspect effective_affinity isn't getting set and interrupts are triggering on unexpected CPUs. If you check /proc/interrupts, can you confirm if the interrupts are firing on CPUs within the smp_affinity_list or some other CPU? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-08-04 15:19 ` Keith Busch @ 2023-08-08 15:51 ` Pratyush Yadav 2023-08-08 16:35 ` Keith Busch 0 siblings, 1 reply; 16+ messages in thread From: Pratyush Yadav @ 2023-08-08 15:51 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Sagi Grimberg, Jens Axboe, linux-nvme, linux-kernel On Fri, Aug 04 2023, Keith Busch wrote: > On Fri, Aug 04, 2023 at 04:50:16PM +0200, Pratyush Yadav wrote: >> With this patch, I get the below affinities: > > Something still seems off without effective_affinity set. That attribute > should always reflect one CPU from the smp_affinity_list. > > At least with your patch, the smp_affinity_list looks as expected: every > CPU is accounted for, and no vector appears to share the resource among > CPUs in different nodes. > >> $ for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \ >> > cat /proc/irq/$i/{smp,effective}_affinity_list; \ >> > done >> 8 >> 8 >> 16-17,48,65,67,69 >> >> 18-19,50,71,73,75 >> >> 20,52,77,79 >> >> 21,53,81,83 >> >> 22,54,85,87 >> >> 23,55,89,91 >> >> 24,56,93,95 >> >> 25,57,97,99 >> >> 26,58,101,103 >> >> 27,59,105,107 >> >> 28,60,109,111 >> >> 29,61,113,115 >> >> 30,62,117,119 >> >> 31,63,121,123 >> >> 49,51,125,127 >> >> 0,32,64,66 >> >> 1,33,68,70 >> >> 2,34,72,74 >> >> 3,35,76,78 >> >> 4,36,80,82 >> >> 5,37,84,86 >> >> 6,38,88,90 >> >> 7,39,92,94 >> >> 8,40,96,98 >> >> 9,41,100,102 >> >> 10,42,104,106 >> >> 11,43,108,110 >> >> 12,44,112,114 >> >> 13,45,116,118 >> >> 14,46,120,122 >> >> 15,47,124,126 >> >> The blank lines are because effective_smp_affinity is blank for all but the first interrupt. >> >> The problem is, even with this I still get the same performance >> difference when running on Node 1 vs Node 0. I am not sure why. Any >> pointers? > > I suspect effective_affinity isn't getting set and interrupts are > triggering on unexpected CPUs. If you check /proc/interrupts, can you > confirm if the interrupts are firing on CPUs within the > smp_affinity_list or some other CPU? All interrupts are hitting CPU0. I did some more digging. Xen sets the effective affinity via the function set_affinity_irq(). But it only sets it if info->evtchn is valid. But info->evtchn is set by startup_pirq(), which is called _after_ set_affinity_irq(). This worked before because irqbalance was coming in after all this happened and set the affinity, causing set_affinity_irq() to be called again. By that time startup_pirq() has been called and info->evtchn is set. This whole thing needs some refactoring to work properly. I wrote up a hacky patch (on top of my previous hacky patch) that fixes it. I will write up a proper patch when I find some more time next week. With this, I am now seeing good performance on node 1. I am seeing slightly slower performance on node 1 but that might be due to some other reasons and I do not want to dive into that for now. Thanks for your help with this :-) BTW, do you think I should re-send the patch with updated reasoning, since Cristoph thinks we should do this regardless of the performance reasons [0]? [0] https://lore.kernel.org/linux-nvme/20230726131408.GA15909@lst.de/ ----- 8< ----- diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index c7715f8bd4522..ed33d33a2f1c9 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -108,6 +108,7 @@ struct irq_info { unsigned irq; evtchn_port_t evtchn; /* event channel */ unsigned short cpu; /* cpu bound */ + unsigned short suggested_cpu; unsigned short eoi_cpu; /* EOI must happen on this cpu-1 */ unsigned int irq_epoch; /* If eoi_cpu valid: irq_epoch of event */ u64 eoi_time; /* Time in jiffies when to EOI. */ @@ -871,6 +873,8 @@ static void mask_ack_pirq(struct irq_data *data) eoi_pirq(data); } +static int xen_rebind_evtchn_to_cpu(struct irq_info *info, unsigned int tcpu); + static unsigned int __startup_pirq(unsigned int irq) { struct evtchn_bind_pirq bind_pirq; @@ -903,6 +907,14 @@ static unsigned int __startup_pirq(unsigned int irq) info->evtchn = evtchn; bind_evtchn_to_cpu(evtchn, 0, false); + if (info->suggested_cpu) { + int ret; + ret = xen_rebind_evtchn_to_cpu(info, info->suggested_cpu); + if (!ret) + irq_data_update_effective_affinity(irq_get_irq_data(irq), + cpumask_of(info->suggested_cpu)); + } + rc = xen_evtchn_port_setup(evtchn); if (rc) goto err; @@ -1856,12 +1872,17 @@ static unsigned int select_target_cpu(const struct cpumask *dest) static int set_affinity_irq(struct irq_data *data, const struct cpumask *dest, bool force) { + struct irq_info *info = info_for_irq(data->irq); unsigned int tcpu = select_target_cpu(dest); int ret; - ret = xen_rebind_evtchn_to_cpu(info_for_irq(data->irq), tcpu); - if (!ret) + ret = xen_rebind_evtchn_to_cpu(info, tcpu); + if (!ret) { irq_data_update_effective_affinity(data, cpumask_of(tcpu)); + } else { + if (info) + info->suggested_cpu = tcpu; + } return ret; } -- Regards, Pratyush Yadav Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-08-08 15:51 ` Pratyush Yadav @ 2023-08-08 16:35 ` Keith Busch 0 siblings, 0 replies; 16+ messages in thread From: Keith Busch @ 2023-08-08 16:35 UTC (permalink / raw) To: Pratyush Yadav Cc: Christoph Hellwig, Sagi Grimberg, Jens Axboe, linux-nvme, linux-kernel On Tue, Aug 08, 2023 at 05:51:01PM +0200, Pratyush Yadav wrote: > BTW, do you think I should re-send the patch with updated reasoning, > since Cristoph thinks we should do this regardless of the performance > reasons [0]? Thanks for the investigation and finding the cause! Yeah, we should go with your original patch as-is, just with an updated changelog. While that may work around your immediate issue, I hope your xen patches from this analysis will be considered for upstream as well :) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2023-07-25 11:06 [PATCH] nvme-pci: do not set the NUMA node of device if it has none Pratyush Yadav 2023-07-25 14:35 ` Keith Busch @ 2024-07-23 9:49 ` Maurizio Lombardi 2024-07-23 14:39 ` Christoph Hellwig 1 sibling, 1 reply; 16+ messages in thread From: Maurizio Lombardi @ 2024-07-23 9:49 UTC (permalink / raw) To: Pratyush Yadav Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, linux-kernel út 25. 7. 2023 v 13:07 odesílatel Pratyush Yadav <ptyadav@amazon.de> napsal: > > If a device has no NUMA node information associated with it, the driver > puts the device in node first_memory_node (say node 0). As a side > effect, this gives an indication to userspace IRQ balancing programs > that the device is in node 0 so they prefer CPUs in node 0 to handle the > IRQs associated with the queues. For example, irqbalance will only let > CPUs in node 0 handle the interrupts. This reduces random access > performance on CPUs in node 1 since the interrupt for command completion > will fire on node 0. > > For example, AWS EC2's i3.16xlarge instance does not expose NUMA > information for the NVMe devices. This means all NVMe devices have > NUMA_NO_NODE by default. Without this patch, random 4k read performance > measured via fio on CPUs from node 1 (around 165k IOPS) is almost 50% > less than CPUs from node 0 (around 315k IOPS). With this patch, CPUs on > both nodes get similar performance (around 315k IOPS). > > Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> > --- > drivers/nvme/host/pci.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index baf69af7ea78e..f5ba2d7102eae 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2916,9 +2916,6 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev, > struct nvme_dev *dev; > int ret = -ENOMEM; > > - if (node == NUMA_NO_NODE) > - set_dev_node(&pdev->dev, first_memory_node); > - > dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node); > if (!dev) > return ERR_PTR(-ENOMEM); > -- FYI, we have received bug reports because of this patch. All single numa nodes like a VMware guest or a system set to interleaved mode will now see -1 as the numa_node attribute. Apparently, some applications like Lightbits do not expect to see -1. Maurizio ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] nvme-pci: do not set the NUMA node of device if it has none 2024-07-23 9:49 ` Maurizio Lombardi @ 2024-07-23 14:39 ` Christoph Hellwig 0 siblings, 0 replies; 16+ messages in thread From: Christoph Hellwig @ 2024-07-23 14:39 UTC (permalink / raw) To: Maurizio Lombardi Cc: Pratyush Yadav, Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme, linux-kernel On Tue, Jul 23, 2024 at 11:49:24AM +0200, Maurizio Lombardi wrote: > FYI, we have received bug reports because of this patch. > All single numa nodes like a VMware guest or a system set to interleaved mode > will now see -1 as the numa_node attribute. > > Apparently, some applications like Lightbits do not expect to see -1. But they can see the same for all kinds of other devices. I can't see how we could cater to this kind of behavior.. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-07-23 14:39 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-07-25 11:06 [PATCH] nvme-pci: do not set the NUMA node of device if it has none Pratyush Yadav 2023-07-25 14:35 ` Keith Busch 2023-07-26 7:58 ` Sagi Grimberg 2023-07-26 13:14 ` Christoph Hellwig 2023-07-26 15:30 ` Pratyush Yadav 2023-07-26 16:17 ` Keith Busch 2023-07-26 19:32 ` Pratyush Yadav 2023-07-26 22:25 ` Keith Busch 2023-07-28 18:09 ` Pratyush Yadav 2023-07-28 19:34 ` Keith Busch 2023-08-04 14:50 ` Pratyush Yadav 2023-08-04 15:19 ` Keith Busch 2023-08-08 15:51 ` Pratyush Yadav 2023-08-08 16:35 ` Keith Busch 2024-07-23 9:49 ` Maurizio Lombardi 2024-07-23 14:39 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox