* [PATCH 1/2] genirq/affinity: remove rsvd check against minvec @ 2024-05-10 14:14 Keith Busch 2024-05-10 14:14 ` [PATCH 2/2] nvme-pci: allow unmanaged interrupts Keith Busch 2024-05-10 15:15 ` [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Ming Lei 0 siblings, 2 replies; 21+ messages in thread From: Keith Busch @ 2024-05-10 14:14 UTC (permalink / raw) To: linux-nvme, linux-kernel; +Cc: hch, tglx, ming.lei, Keith Busch From: Keith Busch <kbusch@kernel.org> The reserved vectors are just the desired vectors that don't need to be managed. Signed-off-by: Keith Busch <kbusch@kernel.org> --- kernel/irq/affinity.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 44a4eba80315c..74b7cccb51a16 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -113,9 +113,6 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, unsigned int resv = affd->pre_vectors + affd->post_vectors; unsigned int set_vecs; - if (resv > minvec) - return 0; - if (affd->calc_sets) { set_vecs = maxvec - resv; } else { -- 2.43.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-10 14:14 [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Keith Busch @ 2024-05-10 14:14 ` Keith Busch 2024-05-10 15:10 ` Christoph Hellwig 2024-05-10 15:15 ` [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Ming Lei 1 sibling, 1 reply; 21+ messages in thread From: Keith Busch @ 2024-05-10 14:14 UTC (permalink / raw) To: linux-nvme, linux-kernel; +Cc: hch, tglx, ming.lei, Keith Busch From: Keith Busch <kbusch@kernel.org> Some people _really_ want to control their interrupt affinity. Signed-off-by: Keith Busch <kbusch@kernel.org> --- drivers/nvme/host/pci.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 8e0bb9692685d..4c2799c3f45f5 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -63,6 +63,11 @@ MODULE_PARM_DESC(sgl_threshold, "Use SGLs when average request segment size is larger or equal to " "this size. Use 0 to disable SGLs."); +static bool managed_irqs = true; +module_param(managed_irqs, bool, 0444); +MODULE_PARM_DESC(managed_irqs, + "set to false for user controlled irq affinity"); + #define NVME_PCI_MIN_QUEUE_SIZE 2 #define NVME_PCI_MAX_QUEUE_SIZE 4095 static int io_queue_depth_set(const char *val, const struct kernel_param *kp); @@ -456,7 +461,7 @@ static void nvme_pci_map_queues(struct blk_mq_tag_set *set) * affinity), so use the regular blk-mq cpu mapping */ map->queue_offset = qoff; - if (i != HCTX_TYPE_POLL && offset) + if (managed_irqs && i != HCTX_TYPE_POLL && offset) blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset); else blk_mq_map_queues(map); @@ -2180,6 +2185,9 @@ static void nvme_calc_irq_sets(struct irq_affinity *affd, unsigned int nrirqs) struct nvme_dev *dev = affd->priv; unsigned int nr_read_queues, nr_write_queues = dev->nr_write_queues; + if (!nrirqs) + nrirqs = affd->post_vectors; + /* * If there is no interrupt available for queues, ensure that * the default queue is set to 1. The affinity set size is @@ -2226,6 +2234,9 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues) poll_queues = min(dev->nr_poll_queues, nr_io_queues - 1); dev->io_queues[HCTX_TYPE_POLL] = poll_queues; + if (!managed_irqs) + affd.post_vectors = nr_io_queues - poll_queues; + /* * Initialize for the single interrupt case, will be updated in * nvme_calc_irq_sets(). -- 2.43.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-10 14:14 ` [PATCH 2/2] nvme-pci: allow unmanaged interrupts Keith Busch @ 2024-05-10 15:10 ` Christoph Hellwig 2024-05-10 16:20 ` Keith Busch ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Christoph Hellwig @ 2024-05-10 15:10 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, linux-kernel, hch, tglx, ming.lei, Keith Busch On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > Some people _really_ want to control their interrupt affinity. So let them argue why. I'd rather have a really, really, really good argument for this crap, and I'd like to hear it from the horses mouth. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-10 15:10 ` Christoph Hellwig @ 2024-05-10 16:20 ` Keith Busch 2024-05-10 23:50 ` Ming Lei 2024-05-13 7:33 ` Benjamin Meier 2024-05-13 13:12 ` Bart Van Assche 2 siblings, 1 reply; 21+ messages in thread From: Keith Busch @ 2024-05-10 16:20 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Keith Busch, linux-nvme, linux-kernel, tglx, ming.lei On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote: > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote: > > From: Keith Busch <kbusch@kernel.org> > > > > Some people _really_ want to control their interrupt affinity. > > So let them argue why. I'd rather have a really, really, really > good argument for this crap, and I'd like to hear it from the horses > mouth. It's just prioritizing predictable user task scheduling for a subset of CPUs instead of having consistently better storage performance. We already have "isolcpus=managed_irq," parameter to prevent managed interrupts from running on a subset of CPUs, so the use case is already kind of supported. The problem with that parameter is it is a no-op if the starting affinity spread contains only isolated CPUs. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-10 16:20 ` Keith Busch @ 2024-05-10 23:50 ` Ming Lei 2024-05-11 0:41 ` Keith Busch 0 siblings, 1 reply; 21+ messages in thread From: Ming Lei @ 2024-05-10 23:50 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-kernel, tglx On Fri, May 10, 2024 at 10:20:02AM -0600, Keith Busch wrote: > On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote: > > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote: > > > From: Keith Busch <kbusch@kernel.org> > > > > > > Some people _really_ want to control their interrupt affinity. > > > > So let them argue why. I'd rather have a really, really, really > > good argument for this crap, and I'd like to hear it from the horses > > mouth. > > It's just prioritizing predictable user task scheduling for a subset of > CPUs instead of having consistently better storage performance. > > We already have "isolcpus=managed_irq," parameter to prevent managed > interrupts from running on a subset of CPUs, so the use case is already > kind of supported. The problem with that parameter is it is a no-op if > the starting affinity spread contains only isolated CPUs. Can you explain a bit why it is a no-op? If only isolated CPUs are spread on one queue, there will be no IO originated from these isolated CPUs, that is exactly what the isolation needs. Thanks, Ming ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-10 23:50 ` Ming Lei @ 2024-05-11 0:41 ` Keith Busch 2024-05-11 0:59 ` Ming Lei 2024-05-12 6:35 ` Thomas Gleixner 0 siblings, 2 replies; 21+ messages in thread From: Keith Busch @ 2024-05-11 0:41 UTC (permalink / raw) To: Ming Lei; +Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-kernel, tglx On Sat, May 11, 2024 at 07:50:21AM +0800, Ming Lei wrote: > On Fri, May 10, 2024 at 10:20:02AM -0600, Keith Busch wrote: > > On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote: > > > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote: > > > > From: Keith Busch <kbusch@kernel.org> > > > > > > > > Some people _really_ want to control their interrupt affinity. > > > > > > So let them argue why. I'd rather have a really, really, really > > > good argument for this crap, and I'd like to hear it from the horses > > > mouth. > > > > It's just prioritizing predictable user task scheduling for a subset of > > CPUs instead of having consistently better storage performance. > > > > We already have "isolcpus=managed_irq," parameter to prevent managed > > interrupts from running on a subset of CPUs, so the use case is already > > kind of supported. The problem with that parameter is it is a no-op if > > the starting affinity spread contains only isolated CPUs. > > Can you explain a bit why it is a no-op? If only isolated CPUs are > spread on one queue, there will be no IO originated from these isolated > CPUs, that is exactly what the isolation needs. The "isolcpus=managed_irq," option doesn't limit the dispatching CPUs. It only limits where the managed irq will assign the effective_cpus as a best effort. Example, I boot with a system with 4 threads, one nvme device, and kernel parameter: isolcpus=managed_irq,2-3 Run this: for i in $(seq 0 3); do taskset -c $i dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1000 iflag=direct; done Check /proc/interrupts | grep nvme0: CPU0 CPU1 CPU2 CPU3 ... 26: 1000 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 27: 0 1004 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 28: 0 0 1000 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 29: 0 0 0 1043 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 The isolcpus did nothing becuase the each vector's mask had just one cpu; there was no where else that the managed irq could send it. The documentation seems to indicate that was by design as a "best effort". ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-11 0:41 ` Keith Busch @ 2024-05-11 0:59 ` Ming Lei 2024-05-12 6:35 ` Thomas Gleixner 1 sibling, 0 replies; 21+ messages in thread From: Ming Lei @ 2024-05-11 0:59 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-kernel, tglx On Fri, May 10, 2024 at 06:41:58PM -0600, Keith Busch wrote: > On Sat, May 11, 2024 at 07:50:21AM +0800, Ming Lei wrote: > > On Fri, May 10, 2024 at 10:20:02AM -0600, Keith Busch wrote: > > > On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote: > > > > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote: > > > > > From: Keith Busch <kbusch@kernel.org> > > > > > > > > > > Some people _really_ want to control their interrupt affinity. > > > > > > > > So let them argue why. I'd rather have a really, really, really > > > > good argument for this crap, and I'd like to hear it from the horses > > > > mouth. > > > > > > It's just prioritizing predictable user task scheduling for a subset of > > > CPUs instead of having consistently better storage performance. > > > > > > We already have "isolcpus=managed_irq," parameter to prevent managed > > > interrupts from running on a subset of CPUs, so the use case is already > > > kind of supported. The problem with that parameter is it is a no-op if > > > the starting affinity spread contains only isolated CPUs. > > > > Can you explain a bit why it is a no-op? If only isolated CPUs are > > spread on one queue, there will be no IO originated from these isolated > > CPUs, that is exactly what the isolation needs. > > The "isolcpus=managed_irq," option doesn't limit the dispatching CPUs. Please see commit a46c27026da1 ("blk-mq: don't schedule block kworker on isolated CPUs") in for-6.10/block. > It only limits where the managed irq will assign the effective_cpus as a > best effort. Most of times it does work. > > Example, I boot with a system with 4 threads, one nvme device, and > kernel parameter: > > isolcpus=managed_irq,2-3 > > Run this: > > for i in $(seq 0 3); do taskset -c $i dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1000 iflag=direct; done It is one test problem, when you try to isolate '2-3', it isn't expected to submit IO or run application on these isolated CPUs. Thanks, Ming ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-11 0:41 ` Keith Busch 2024-05-11 0:59 ` Ming Lei @ 2024-05-12 6:35 ` Thomas Gleixner 2024-05-20 15:37 ` Christoph Hellwig 1 sibling, 1 reply; 21+ messages in thread From: Thomas Gleixner @ 2024-05-12 6:35 UTC (permalink / raw) To: Keith Busch, Ming Lei Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-kernel On Fri, May 10 2024 at 18:41, Keith Busch wrote: > On Sat, May 11, 2024 at 07:50:21AM +0800, Ming Lei wrote: >> Can you explain a bit why it is a no-op? If only isolated CPUs are >> spread on one queue, there will be no IO originated from these isolated >> CPUs, that is exactly what the isolation needs. > > The "isolcpus=managed_irq," option doesn't limit the dispatching CPUs. > It only limits where the managed irq will assign the effective_cpus as a > best effort. > > Example, I boot with a system with 4 threads, one nvme device, and > kernel parameter: > > isolcpus=managed_irq,2-3 > > Run this: > > for i in $(seq 0 3); do taskset -c $i dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1000 iflag=direct; done > > Check /proc/interrupts | grep nvme0: > > CPU0 CPU1 CPU2 CPU3 > .. > 26: 1000 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 > 27: 0 1004 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 > 28: 0 0 1000 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 > 29: 0 0 0 1043 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 > > The isolcpus did nothing becuase the each vector's mask had just one > cpu; there was no where else that the managed irq could send it. The > documentation seems to indicate that was by design as a "best effort". That's expected as you pin the I/O operation on the isolated CPUs which in turn makes them use the per CPU queue. The isolated CPUs are only excluded for device management interrupts, but not for the affinity spread of the queues. Thanks, tglx ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-12 6:35 ` Thomas Gleixner @ 2024-05-20 15:37 ` Christoph Hellwig 2024-05-20 20:34 ` Thomas Gleixner 2024-05-21 2:31 ` Ming Lei 0 siblings, 2 replies; 21+ messages in thread From: Christoph Hellwig @ 2024-05-20 15:37 UTC (permalink / raw) To: Thomas Gleixner Cc: Keith Busch, Ming Lei, Christoph Hellwig, Keith Busch, linux-nvme, linux-kernel On Sun, May 12, 2024 at 08:35:55AM +0200, Thomas Gleixner wrote: > That's expected as you pin the I/O operation on the isolated CPUs which > in turn makes them use the per CPU queue. > > The isolated CPUs are only excluded for device management interrupts, > but not for the affinity spread of the queues. We'll probably need a version of isolcpus that also excludes the interrupt spread given that users are asking for it. And I'd much prefer that over adding radom module options to every driver to disable managed interrupts. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-20 15:37 ` Christoph Hellwig @ 2024-05-20 20:34 ` Thomas Gleixner 2024-05-21 2:31 ` Ming Lei 1 sibling, 0 replies; 21+ messages in thread From: Thomas Gleixner @ 2024-05-20 20:34 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Ming Lei, Christoph Hellwig, Keith Busch, linux-nvme, linux-kernel On Mon, May 20 2024 at 17:37, Christoph Hellwig wrote: > On Sun, May 12, 2024 at 08:35:55AM +0200, Thomas Gleixner wrote: >> That's expected as you pin the I/O operation on the isolated CPUs which >> in turn makes them use the per CPU queue. >> >> The isolated CPUs are only excluded for device management interrupts, >> but not for the affinity spread of the queues. > > We'll probably need a version of isolcpus that also excludes the > interrupt spread given that users are asking for it. And I'd much > prefer that over adding radom module options to every driver to disable > managed interrupts. No objections from my side. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-20 15:37 ` Christoph Hellwig 2024-05-20 20:34 ` Thomas Gleixner @ 2024-05-21 2:31 ` Ming Lei 2024-05-21 8:38 ` Thomas Gleixner 1 sibling, 1 reply; 21+ messages in thread From: Ming Lei @ 2024-05-21 2:31 UTC (permalink / raw) To: Christoph Hellwig Cc: Thomas Gleixner, Keith Busch, Keith Busch, linux-nvme, linux-kernel On Mon, May 20, 2024 at 05:37:42PM +0200, Christoph Hellwig wrote: > On Sun, May 12, 2024 at 08:35:55AM +0200, Thomas Gleixner wrote: > > That's expected as you pin the I/O operation on the isolated CPUs which > > in turn makes them use the per CPU queue. > > > > The isolated CPUs are only excluded for device management interrupts, > > but not for the affinity spread of the queues. > > We'll probably need a version of isolcpus that also excludes the > interrupt spread given that users are asking for it. And I'd much > prefer that over adding radom module options to every driver to disable > managed interrupts. BTW, isolcpus has been marked as deprecated, and it can't be adjust runtime. isolcpus= [KNL,SMP,ISOL] Isolate a given set of CPUs from disturbance. [Deprecated - use cpusets instead] Thanks, Ming ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-21 2:31 ` Ming Lei @ 2024-05-21 8:38 ` Thomas Gleixner 2024-05-21 10:06 ` Frederic Weisbecker 0 siblings, 1 reply; 21+ messages in thread From: Thomas Gleixner @ 2024-05-21 8:38 UTC (permalink / raw) To: Ming Lei, Christoph Hellwig Cc: Keith Busch, Keith Busch, linux-nvme, linux-kernel, Frederic Weisbecker On Tue, May 21 2024 at 10:31, Ming Lei wrote: > On Mon, May 20, 2024 at 05:37:42PM +0200, Christoph Hellwig wrote: >> On Sun, May 12, 2024 at 08:35:55AM +0200, Thomas Gleixner wrote: >> > That's expected as you pin the I/O operation on the isolated CPUs which >> > in turn makes them use the per CPU queue. >> > >> > The isolated CPUs are only excluded for device management interrupts, >> > but not for the affinity spread of the queues. >> >> We'll probably need a version of isolcpus that also excludes the >> interrupt spread given that users are asking for it. And I'd much >> prefer that over adding radom module options to every driver to disable >> managed interrupts. > > BTW, isolcpus has been marked as deprecated, and it can't be adjust > runtime. Which is far from reality as cpusets do not allow to do what isolcpus does today. Also runtime adjusting managed interrupts needs way more thoughts. Thanks tglx ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-21 8:38 ` Thomas Gleixner @ 2024-05-21 10:06 ` Frederic Weisbecker 0 siblings, 0 replies; 21+ messages in thread From: Frederic Weisbecker @ 2024-05-21 10:06 UTC (permalink / raw) To: Thomas Gleixner Cc: Ming Lei, Christoph Hellwig, Keith Busch, Keith Busch, linux-nvme, linux-kernel Le Tue, May 21, 2024 at 10:38:25AM +0200, Thomas Gleixner a écrit : > On Tue, May 21 2024 at 10:31, Ming Lei wrote: > > > On Mon, May 20, 2024 at 05:37:42PM +0200, Christoph Hellwig wrote: > >> On Sun, May 12, 2024 at 08:35:55AM +0200, Thomas Gleixner wrote: > >> > That's expected as you pin the I/O operation on the isolated CPUs which > >> > in turn makes them use the per CPU queue. > >> > > >> > The isolated CPUs are only excluded for device management interrupts, > >> > but not for the affinity spread of the queues. > >> > >> We'll probably need a version of isolcpus that also excludes the > >> interrupt spread given that users are asking for it. And I'd much > >> prefer that over adding radom module options to every driver to disable > >> managed interrupts. > > > > BTW, isolcpus has been marked as deprecated, and it can't be adjust > > runtime. > > Which is far from reality as cpusets do not allow to do what isolcpus > does today. > > Also runtime adjusting managed interrupts needs way more thoughts. I'll remove that comment (unless someone beats me at it?). We used to think that cpusets would indeed deprecate isolcpus but for several reasons this will never be the case. Thanks. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-10 15:10 ` Christoph Hellwig 2024-05-10 16:20 ` Keith Busch @ 2024-05-13 7:33 ` Benjamin Meier 2024-05-13 8:39 ` Ming Lei 2024-05-13 13:12 ` Bart Van Assche 2 siblings, 1 reply; 21+ messages in thread From: Benjamin Meier @ 2024-05-13 7:33 UTC (permalink / raw) To: hch; +Cc: kbusch, kbusch, linux-kernel, linux-nvme, ming.lei, tglx > From: Christoph Hellwig <hch@lst.de> > > So let them argue why. I'd rather have a really, really, really > good argument for this crap, and I'd like to hear it from the horses > mouth. I reached out to Keith to explore the possibility of manually defining which cores handle NVMe interrupts. The application which we develop and maintain (in the company I work) has very high requirements regarding latency. We have some isolated cores and we run our application on those. Our system is using kernel 5.4 which unfortunately does not support "isolcpus=managed_irq". Actually, we did not even know about that option, because we are focussed on kernel 5.4. It solves part of our problem, but being able to specify where exactly interrupts are running is still superior in our opinion. E.g. assume the number of house-keeping cores is small, because we want to have full control over the system. In our case we have threads of different priorities where some get an exclusive core. Some other threads share a core (or a group of cores) with other threads. Now we are still happy to assign some interrupts to some of the cores which we consider as "medium-priority". Due to the small number of non-isolated cores, it can be tricky to assign all interrupts to those without a performance-penalty. Given these requirements, manually specifying interrupt/core assignments would offer greater flexibility and control over system performance. Moreover, the proposed code changes appear minimal and have no impact on existing functionalities. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-13 7:33 ` Benjamin Meier @ 2024-05-13 8:39 ` Ming Lei 2024-05-13 8:59 ` Benjamin Meier 0 siblings, 1 reply; 21+ messages in thread From: Ming Lei @ 2024-05-13 8:39 UTC (permalink / raw) To: Benjamin Meier Cc: hch, kbusch, kbusch, linux-kernel, linux-nvme, tglx, ming.lei On Mon, May 13, 2024 at 09:33:27AM +0200, Benjamin Meier wrote: > > From: Christoph Hellwig <hch@lst.de> > > > > So let them argue why. I'd rather have a really, really, really > > good argument for this crap, and I'd like to hear it from the horses > > mouth. > > I reached out to Keith to explore the possibility of manually defining > which cores handle NVMe interrupts. > > The application which we develop and maintain (in the company I work) > has very high requirements regarding latency. We have some isolated cores Are these isolated cores controlled by kernel command line `isolcpus=`? > and we run our application on those. > > Our system is using kernel 5.4 which unfortunately does not support > "isolcpus=managed_irq". Actually, we did not even know about that > option, because we are focussed on kernel 5.4. It solves part > of our problem, but being able to specify where exactly interrupts > are running is still superior in our opinion. > > E.g. assume the number of house-keeping cores is small, because we > want to have full control over the system. In our case we have threads > of different priorities where some get an exclusive core. Some other threads > share a core (or a group of cores) with other threads. Now we are still > happy to assign some interrupts to some of the cores which we consider as > "medium-priority". Due to the small number of non-isolated cores, it can So these "medium-priority" cores belong to isolated cpu list, you still expect NVMe interrupts can be handled on these cpu cores, do I understand correctly? If yes, I think your case still can be covered with 'isolcpus=managed_irq' which needn't to be same with cpu cores specified from `isolcpus=`, such as excluding medium-priority cores from 'isolcpus=managed_irq', and meantime include them in plain `isolcpus=`. > be tricky to assign all interrupts to those without a performance-penalty. > > Given these requirements, manually specifying interrupt/core assignments > would offer greater flexibility and control over system performance. > Moreover, the proposed code changes appear minimal and have no > impact on existing functionalities. Looks your main concern is performance, but as Keith mentioned, the proposed change may degrade nvme perf too: https://lore.kernel.org/linux-nvme/Zj6745UDnwX1BteO@kbusch-mbp.dhcp.thefacebook.com/ thanks, Ming ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-13 8:39 ` Ming Lei @ 2024-05-13 8:59 ` Benjamin Meier 2024-05-13 9:25 ` Ming Lei 0 siblings, 1 reply; 21+ messages in thread From: Benjamin Meier @ 2024-05-13 8:59 UTC (permalink / raw) To: ming.lei Cc: benjamin.meier70, hch, kbusch, kbusch, linux-kernel, linux-nvme, tglx > > The application which we develop and maintain (in the company I work) > > has very high requirements regarding latency. We have some isolated cores > > Are these isolated cores controlled by kernel command line `isolcpus=`? Yes, exactly. > > and we run our application on those. > > > > Our system is using kernel 5.4 which unfortunately does not support > > "isolcpus=managed_irq". Actually, we did not even know about that > > option, because we are focussed on kernel 5.4. It solves part > > of our problem, but being able to specify where exactly interrupts > > are running is still superior in our opinion. > > > > E.g. assume the number of house-keeping cores is small, because we > > want to have full control over the system. In our case we have threads > > of different priorities where some get an exclusive core. Some other threads > > share a core (or a group of cores) with other threads. Now we are still > > happy to assign some interrupts to some of the cores which we consider as > > "medium-priority". Due to the small number of non-isolated cores, it can > > So these "medium-priority" cores belong to isolated cpu list, you still expect > NVMe interrupts can be handled on these cpu cores, do I understand correctly? We want to avoid that the NVMe interrupts are on the "high priority" cores. Having noise on them is quite bad for us, so we wanted to move some interrupts to house keeping cores and if needed (due to performance issues) keep some on those "medium-priority" isolated cores. NVMe is not that highest priority for us, but possibly running too much on the house-keeping cores could also be bad. > If yes, I think your case still can be covered with 'isolcpus=managed_irq' which > needn't to be same with cpu cores specified from `isolcpus=`, such as > excluding medium-priority cores from 'isolcpus=managed_irq', and > meantime include them in plain `isolcpus=`. Unfortunately, our kernel version (5.4) does not support "managed_irq" and due to that we're happy with the patch. However, I see that for newer kernel versions the already existing arguments could be sufficient to do everything. > > be tricky to assign all interrupts to those without a performance-penalty. > > > > Given these requirements, manually specifying interrupt/core assignments > > would offer greater flexibility and control over system performance. > > Moreover, the proposed code changes appear minimal and have no > > impact on existing functionalities. > > Looks your main concern is performance, but as Keith mentioned, the proposed > change may degrade nvme perf too: > > https://lore.kernel.org/linux-nvme/Zj6745UDnwX1BteO@kbusch-mbp.dhcp.thefacebook.com/ Yes, but for NVMe it's not that critical. The most important point for us is to keep them away from our "high-priority" cores. We still wanted to have control where we run those interrupts, but also because we just did not know the "managed_irq" option. Thanks, Benjamin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-13 8:59 ` Benjamin Meier @ 2024-05-13 9:25 ` Ming Lei 2024-05-13 12:33 ` Benjamin Meier 0 siblings, 1 reply; 21+ messages in thread From: Ming Lei @ 2024-05-13 9:25 UTC (permalink / raw) To: Benjamin Meier; +Cc: hch, kbusch, kbusch, linux-kernel, linux-nvme, tglx On Mon, May 13, 2024 at 10:59:02AM +0200, Benjamin Meier wrote: > > > The application which we develop and maintain (in the company I work) > > > has very high requirements regarding latency. We have some isolated > cores > > > > Are these isolated cores controlled by kernel command line `isolcpus=`? > > Yes, exactly. > > > > and we run our application on those. > > > > > > Our system is using kernel 5.4 which unfortunately does not support > > > "isolcpus=managed_irq". Actually, we did not even know about that > > > option, because we are focussed on kernel 5.4. It solves part > > > of our problem, but being able to specify where exactly interrupts > > > are running is still superior in our opinion. > > > > > > E.g. assume the number of house-keeping cores is small, because we > > > want to have full control over the system. In our case we have threads > > > of different priorities where some get an exclusive core. Some other > threads > > > share a core (or a group of cores) with other threads. Now we are still > > > happy to assign some interrupts to some of the cores which we consider > as > > > "medium-priority". Due to the small number of non-isolated cores, it can > > > > So these "medium-priority" cores belong to isolated cpu list, you still > expect > > NVMe interrupts can be handled on these cpu cores, do I understand > correctly? > > We want to avoid that the NVMe interrupts are on the "high priority" cores. > Having > noise on them is quite bad for us, so we wanted to move some interrupts to > house > keeping cores and if needed (due to performance issues) keep some on those > "medium-priority" isolated cores. NVMe is not that highest priority for us, > but possibly running too much on the house-keeping cores could also be bad. > > > If yes, I think your case still can be covered with 'isolcpus=managed_irq' > which > > needn't to be same with cpu cores specified from `isolcpus=`, such as > > excluding medium-priority cores from 'isolcpus=managed_irq', and > > meantime include them in plain `isolcpus=`. > > Unfortunately, our kernel version (5.4) does not support "managed_irq" and > due > to that we're happy with the patch. However, I see that for newer kernel > versions > the already existing arguments could be sufficient to do everything. 'isolcpus=managed_irq' enablement patches are small, and shouldn't be very hard to backport. > > > > be tricky to assign all interrupts to those without a > performance-penalty. > > > > > > Given these requirements, manually specifying interrupt/core assignments > > > would offer greater flexibility and control over system performance. > > > Moreover, the proposed code changes appear minimal and have no > > > impact on existing functionalities. > > > > Looks your main concern is performance, but as Keith mentioned, the > proposed > > change may degrade nvme perf too: > > > > https://lore.kernel.org/linux-nvme/Zj6745UDnwX1BteO@kbusch-mbp.dhcp.thefacebook.com/ > > Yes, but for NVMe it's not that critical. The most important point for us is > to keep them away from our "high-priority" cores. We still wanted to have > control > where we run those interrupts, but also because we just did not know the > "managed_irq" > option. OK, thanks for share the input! Now from upstream viewpoint, 'isolcpus=managed_irq' should work for your case, and seems not necessary to support nvme unmanaged irq for this requirement at least. thanks, Ming ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-13 9:25 ` Ming Lei @ 2024-05-13 12:33 ` Benjamin Meier 0 siblings, 0 replies; 21+ messages in thread From: Benjamin Meier @ 2024-05-13 12:33 UTC (permalink / raw) To: ming.lei Cc: benjamin.meier70, hch, kbusch, kbusch, linux-kernel, linux-nvme, tglx > 'isolcpus=managed_irq' enablement patches are small, and shouldn't be very > hard to backport. I have big respect of kernel code and probably for non-kernel devs it's not so easy:) But yeah, we'll look into this. > > > > be tricky to assign all interrupts to those without a > > performance-penalty. > > > > > > > > Given these requirements, manually specifying interrupt/core assignments > > > > would offer greater flexibility and control over system performance. > > > > Moreover, the proposed code changes appear minimal and have no > > > > impact on existing functionalities. > > > > > > Looks your main concern is performance, but as Keith mentioned, the > > proposed > > > change may degrade nvme perf too: > > > > > > https://lore.kernel.org/linux-nvme/Zj6745UDnwX1BteO@kbusch-mbp.dhcp.thefacebook.com/ > > > > Yes, but for NVMe it's not that critical. The most important point for us is > > to keep them away from our "high-priority" cores. We still wanted to have > > control > > where we run those interrupts, but also because we just did not know the > > "managed_irq" > > option. > > OK, thanks for share the input! > > Now from upstream viewpoint, 'isolcpus=managed_irq' should work for your case, > and seems not necessary to support nvme unmanaged irq for this requirement > at least. Yes, probably that will do it. Personally, I still think it's a nice thing if it's possible to assign interrupts to specific cores, but practically the advantages are likely not that big compared to 'isolcpus=managed_irq'. Thanks for all the explanations ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts 2024-05-10 15:10 ` Christoph Hellwig 2024-05-10 16:20 ` Keith Busch 2024-05-13 7:33 ` Benjamin Meier @ 2024-05-13 13:12 ` Bart Van Assche 2 siblings, 0 replies; 21+ messages in thread From: Bart Van Assche @ 2024-05-13 13:12 UTC (permalink / raw) To: Christoph Hellwig, Keith Busch Cc: linux-nvme, linux-kernel, tglx, ming.lei, Keith Busch On 5/10/24 08:10, Christoph Hellwig wrote: > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote: >> From: Keith Busch <kbusch@kernel.org> >> >> Some people _really_ want to control their interrupt affinity. > > So let them argue why. I'd rather have a really, really, really > good argument for this crap, and I'd like to hear it from the horses > mouth. Performance can be increased by modifying the interrupt assignments carefully, especially in storage appliances that have to process a large number of network and storage interrupts. By carefully assigning interrupts the number of completions processed per interrupt can be increased and hence performance also increases. In 2014 I was working on a product that benefited from this approach. Thanks, Bart. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] genirq/affinity: remove rsvd check against minvec 2024-05-10 14:14 [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Keith Busch 2024-05-10 14:14 ` [PATCH 2/2] nvme-pci: allow unmanaged interrupts Keith Busch @ 2024-05-10 15:15 ` Ming Lei 2024-05-10 16:47 ` Keith Busch 1 sibling, 1 reply; 21+ messages in thread From: Ming Lei @ 2024-05-10 15:15 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, linux-kernel, hch, tglx, Keith Busch On Fri, May 10, 2024 at 07:14:58AM -0700, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > The reserved vectors are just the desired vectors that don't need to be > managed. > > Signed-off-by: Keith Busch <kbusch@kernel.org> > --- > kernel/irq/affinity.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c > index 44a4eba80315c..74b7cccb51a16 100644 > --- a/kernel/irq/affinity.c > +++ b/kernel/irq/affinity.c > @@ -113,9 +113,6 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, > unsigned int resv = affd->pre_vectors + affd->post_vectors; > unsigned int set_vecs; > > - if (resv > minvec) > - return 0; > - This behavior is introduced in 6f9a22bc5775 ("PCI/MSI: Ignore affinity if pre/post vector count is more than min_vecs"), which is one bug fix. Thanks, Ming ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] genirq/affinity: remove rsvd check against minvec 2024-05-10 15:15 ` [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Ming Lei @ 2024-05-10 16:47 ` Keith Busch 0 siblings, 0 replies; 21+ messages in thread From: Keith Busch @ 2024-05-10 16:47 UTC (permalink / raw) To: Ming Lei; +Cc: Keith Busch, linux-nvme, linux-kernel, hch, tglx On Fri, May 10, 2024 at 11:15:54PM +0800, Ming Lei wrote: > On Fri, May 10, 2024 at 07:14:58AM -0700, Keith Busch wrote: > > From: Keith Busch <kbusch@kernel.org> > > > > The reserved vectors are just the desired vectors that don't need to be > > managed. > > > > Signed-off-by: Keith Busch <kbusch@kernel.org> > > --- > > kernel/irq/affinity.c | 3 --- > > 1 file changed, 3 deletions(-) > > > > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c > > index 44a4eba80315c..74b7cccb51a16 100644 > > --- a/kernel/irq/affinity.c > > +++ b/kernel/irq/affinity.c > > @@ -113,9 +113,6 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, > > unsigned int resv = affd->pre_vectors + affd->post_vectors; > > unsigned int set_vecs; > > > > - if (resv > minvec) > > - return 0; > > - > > This behavior is introduced in 6f9a22bc5775 ("PCI/MSI: Ignore affinity if pre/post > vector count is more than min_vecs"), which is one bug fix. Thanks for the pointer. Probably best I avoid messing with irq code just for this use case, so I can have the driver disable the PCI_IRQ_AFFINTY flag instead ... assuming hch doesn't "nak" it. :) ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-05-21 10:07 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-10 14:14 [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Keith Busch 2024-05-10 14:14 ` [PATCH 2/2] nvme-pci: allow unmanaged interrupts Keith Busch 2024-05-10 15:10 ` Christoph Hellwig 2024-05-10 16:20 ` Keith Busch 2024-05-10 23:50 ` Ming Lei 2024-05-11 0:41 ` Keith Busch 2024-05-11 0:59 ` Ming Lei 2024-05-12 6:35 ` Thomas Gleixner 2024-05-20 15:37 ` Christoph Hellwig 2024-05-20 20:34 ` Thomas Gleixner 2024-05-21 2:31 ` Ming Lei 2024-05-21 8:38 ` Thomas Gleixner 2024-05-21 10:06 ` Frederic Weisbecker 2024-05-13 7:33 ` Benjamin Meier 2024-05-13 8:39 ` Ming Lei 2024-05-13 8:59 ` Benjamin Meier 2024-05-13 9:25 ` Ming Lei 2024-05-13 12:33 ` Benjamin Meier 2024-05-13 13:12 ` Bart Van Assche 2024-05-10 15:15 ` [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Ming Lei 2024-05-10 16:47 ` Keith Busch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox