* [PATCH 1/2] genirq/affinity: remove rsvd check against minvec
@ 2024-05-10 14:14 Keith Busch
2024-05-10 14:14 ` [PATCH 2/2] nvme-pci: allow unmanaged interrupts Keith Busch
2024-05-10 15:15 ` [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Ming Lei
0 siblings, 2 replies; 21+ messages in thread
From: Keith Busch @ 2024-05-10 14:14 UTC (permalink / raw)
To: linux-nvme, linux-kernel; +Cc: hch, tglx, ming.lei, Keith Busch
From: Keith Busch <kbusch@kernel.org>
The reserved vectors are just the desired vectors that don't need to be
managed.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
kernel/irq/affinity.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 44a4eba80315c..74b7cccb51a16 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -113,9 +113,6 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
unsigned int resv = affd->pre_vectors + affd->post_vectors;
unsigned int set_vecs;
- if (resv > minvec)
- return 0;
-
if (affd->calc_sets) {
set_vecs = maxvec - resv;
} else {
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-10 14:14 [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Keith Busch
@ 2024-05-10 14:14 ` Keith Busch
2024-05-10 15:10 ` Christoph Hellwig
2024-05-10 15:15 ` [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Ming Lei
1 sibling, 1 reply; 21+ messages in thread
From: Keith Busch @ 2024-05-10 14:14 UTC (permalink / raw)
To: linux-nvme, linux-kernel; +Cc: hch, tglx, ming.lei, Keith Busch
From: Keith Busch <kbusch@kernel.org>
Some people _really_ want to control their interrupt affinity.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/host/pci.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685d..4c2799c3f45f5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -63,6 +63,11 @@ MODULE_PARM_DESC(sgl_threshold,
"Use SGLs when average request segment size is larger or equal to "
"this size. Use 0 to disable SGLs.");
+static bool managed_irqs = true;
+module_param(managed_irqs, bool, 0444);
+MODULE_PARM_DESC(managed_irqs,
+ "set to false for user controlled irq affinity");
+
#define NVME_PCI_MIN_QUEUE_SIZE 2
#define NVME_PCI_MAX_QUEUE_SIZE 4095
static int io_queue_depth_set(const char *val, const struct kernel_param *kp);
@@ -456,7 +461,7 @@ static void nvme_pci_map_queues(struct blk_mq_tag_set *set)
* affinity), so use the regular blk-mq cpu mapping
*/
map->queue_offset = qoff;
- if (i != HCTX_TYPE_POLL && offset)
+ if (managed_irqs && i != HCTX_TYPE_POLL && offset)
blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
else
blk_mq_map_queues(map);
@@ -2180,6 +2185,9 @@ static void nvme_calc_irq_sets(struct irq_affinity *affd, unsigned int nrirqs)
struct nvme_dev *dev = affd->priv;
unsigned int nr_read_queues, nr_write_queues = dev->nr_write_queues;
+ if (!nrirqs)
+ nrirqs = affd->post_vectors;
+
/*
* If there is no interrupt available for queues, ensure that
* the default queue is set to 1. The affinity set size is
@@ -2226,6 +2234,9 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
poll_queues = min(dev->nr_poll_queues, nr_io_queues - 1);
dev->io_queues[HCTX_TYPE_POLL] = poll_queues;
+ if (!managed_irqs)
+ affd.post_vectors = nr_io_queues - poll_queues;
+
/*
* Initialize for the single interrupt case, will be updated in
* nvme_calc_irq_sets().
--
2.43.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-10 14:14 ` [PATCH 2/2] nvme-pci: allow unmanaged interrupts Keith Busch
@ 2024-05-10 15:10 ` Christoph Hellwig
2024-05-10 16:20 ` Keith Busch
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-10 15:10 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-nvme, linux-kernel, hch, tglx, ming.lei, Keith Busch
On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Some people _really_ want to control their interrupt affinity.
So let them argue why. I'd rather have a really, really, really
good argument for this crap, and I'd like to hear it from the horses
mouth.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] genirq/affinity: remove rsvd check against minvec
2024-05-10 14:14 [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Keith Busch
2024-05-10 14:14 ` [PATCH 2/2] nvme-pci: allow unmanaged interrupts Keith Busch
@ 2024-05-10 15:15 ` Ming Lei
2024-05-10 16:47 ` Keith Busch
1 sibling, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-05-10 15:15 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-nvme, linux-kernel, hch, tglx, Keith Busch
On Fri, May 10, 2024 at 07:14:58AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> The reserved vectors are just the desired vectors that don't need to be
> managed.
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> kernel/irq/affinity.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index 44a4eba80315c..74b7cccb51a16 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -113,9 +113,6 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
> unsigned int resv = affd->pre_vectors + affd->post_vectors;
> unsigned int set_vecs;
>
> - if (resv > minvec)
> - return 0;
> -
This behavior is introduced in 6f9a22bc5775 ("PCI/MSI: Ignore affinity if pre/post
vector count is more than min_vecs"), which is one bug fix.
Thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-10 15:10 ` Christoph Hellwig
@ 2024-05-10 16:20 ` Keith Busch
2024-05-10 23:50 ` Ming Lei
2024-05-13 7:33 ` Benjamin Meier
2024-05-13 13:12 ` Bart Van Assche
2 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2024-05-10 16:20 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, linux-nvme, linux-kernel, tglx, ming.lei
On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote:
> On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> >
> > Some people _really_ want to control their interrupt affinity.
>
> So let them argue why. I'd rather have a really, really, really
> good argument for this crap, and I'd like to hear it from the horses
> mouth.
It's just prioritizing predictable user task scheduling for a subset of
CPUs instead of having consistently better storage performance.
We already have "isolcpus=managed_irq," parameter to prevent managed
interrupts from running on a subset of CPUs, so the use case is already
kind of supported. The problem with that parameter is it is a no-op if
the starting affinity spread contains only isolated CPUs.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] genirq/affinity: remove rsvd check against minvec
2024-05-10 15:15 ` [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Ming Lei
@ 2024-05-10 16:47 ` Keith Busch
0 siblings, 0 replies; 21+ messages in thread
From: Keith Busch @ 2024-05-10 16:47 UTC (permalink / raw)
To: Ming Lei; +Cc: Keith Busch, linux-nvme, linux-kernel, hch, tglx
On Fri, May 10, 2024 at 11:15:54PM +0800, Ming Lei wrote:
> On Fri, May 10, 2024 at 07:14:58AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> >
> > The reserved vectors are just the desired vectors that don't need to be
> > managed.
> >
> > Signed-off-by: Keith Busch <kbusch@kernel.org>
> > ---
> > kernel/irq/affinity.c | 3 ---
> > 1 file changed, 3 deletions(-)
> >
> > diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> > index 44a4eba80315c..74b7cccb51a16 100644
> > --- a/kernel/irq/affinity.c
> > +++ b/kernel/irq/affinity.c
> > @@ -113,9 +113,6 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
> > unsigned int resv = affd->pre_vectors + affd->post_vectors;
> > unsigned int set_vecs;
> >
> > - if (resv > minvec)
> > - return 0;
> > -
>
> This behavior is introduced in 6f9a22bc5775 ("PCI/MSI: Ignore affinity if pre/post
> vector count is more than min_vecs"), which is one bug fix.
Thanks for the pointer. Probably best I avoid messing with irq code just
for this use case, so I can have the driver disable the PCI_IRQ_AFFINTY
flag instead ... assuming hch doesn't "nak" it. :)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-10 16:20 ` Keith Busch
@ 2024-05-10 23:50 ` Ming Lei
2024-05-11 0:41 ` Keith Busch
0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-05-10 23:50 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-kernel, tglx
On Fri, May 10, 2024 at 10:20:02AM -0600, Keith Busch wrote:
> On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote:
> > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote:
> > > From: Keith Busch <kbusch@kernel.org>
> > >
> > > Some people _really_ want to control their interrupt affinity.
> >
> > So let them argue why. I'd rather have a really, really, really
> > good argument for this crap, and I'd like to hear it from the horses
> > mouth.
>
> It's just prioritizing predictable user task scheduling for a subset of
> CPUs instead of having consistently better storage performance.
>
> We already have "isolcpus=managed_irq," parameter to prevent managed
> interrupts from running on a subset of CPUs, so the use case is already
> kind of supported. The problem with that parameter is it is a no-op if
> the starting affinity spread contains only isolated CPUs.
Can you explain a bit why it is a no-op? If only isolated CPUs are
spread on one queue, there will be no IO originated from these isolated
CPUs, that is exactly what the isolation needs.
Thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-10 23:50 ` Ming Lei
@ 2024-05-11 0:41 ` Keith Busch
2024-05-11 0:59 ` Ming Lei
2024-05-12 6:35 ` Thomas Gleixner
0 siblings, 2 replies; 21+ messages in thread
From: Keith Busch @ 2024-05-11 0:41 UTC (permalink / raw)
To: Ming Lei; +Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-kernel, tglx
On Sat, May 11, 2024 at 07:50:21AM +0800, Ming Lei wrote:
> On Fri, May 10, 2024 at 10:20:02AM -0600, Keith Busch wrote:
> > On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote:
> > > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote:
> > > > From: Keith Busch <kbusch@kernel.org>
> > > >
> > > > Some people _really_ want to control their interrupt affinity.
> > >
> > > So let them argue why. I'd rather have a really, really, really
> > > good argument for this crap, and I'd like to hear it from the horses
> > > mouth.
> >
> > It's just prioritizing predictable user task scheduling for a subset of
> > CPUs instead of having consistently better storage performance.
> >
> > We already have "isolcpus=managed_irq," parameter to prevent managed
> > interrupts from running on a subset of CPUs, so the use case is already
> > kind of supported. The problem with that parameter is it is a no-op if
> > the starting affinity spread contains only isolated CPUs.
>
> Can you explain a bit why it is a no-op? If only isolated CPUs are
> spread on one queue, there will be no IO originated from these isolated
> CPUs, that is exactly what the isolation needs.
The "isolcpus=managed_irq," option doesn't limit the dispatching CPUs.
It only limits where the managed irq will assign the effective_cpus as a
best effort.
Example, I boot with a system with 4 threads, one nvme device, and
kernel parameter:
isolcpus=managed_irq,2-3
Run this:
for i in $(seq 0 3); do taskset -c $i dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1000 iflag=direct; done
Check /proc/interrupts | grep nvme0:
CPU0 CPU1 CPU2 CPU3
...
26: 1000 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1
27: 0 1004 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2
28: 0 0 1000 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3
29: 0 0 0 1043 PCI-MSIX-0000:00:05.0 4-edge nvme0q4
The isolcpus did nothing becuase the each vector's mask had just one
cpu; there was no where else that the managed irq could send it. The
documentation seems to indicate that was by design as a "best effort".
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-11 0:41 ` Keith Busch
@ 2024-05-11 0:59 ` Ming Lei
2024-05-12 6:35 ` Thomas Gleixner
1 sibling, 0 replies; 21+ messages in thread
From: Ming Lei @ 2024-05-11 0:59 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-kernel, tglx
On Fri, May 10, 2024 at 06:41:58PM -0600, Keith Busch wrote:
> On Sat, May 11, 2024 at 07:50:21AM +0800, Ming Lei wrote:
> > On Fri, May 10, 2024 at 10:20:02AM -0600, Keith Busch wrote:
> > > On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote:
> > > > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote:
> > > > > From: Keith Busch <kbusch@kernel.org>
> > > > >
> > > > > Some people _really_ want to control their interrupt affinity.
> > > >
> > > > So let them argue why. I'd rather have a really, really, really
> > > > good argument for this crap, and I'd like to hear it from the horses
> > > > mouth.
> > >
> > > It's just prioritizing predictable user task scheduling for a subset of
> > > CPUs instead of having consistently better storage performance.
> > >
> > > We already have "isolcpus=managed_irq," parameter to prevent managed
> > > interrupts from running on a subset of CPUs, so the use case is already
> > > kind of supported. The problem with that parameter is it is a no-op if
> > > the starting affinity spread contains only isolated CPUs.
> >
> > Can you explain a bit why it is a no-op? If only isolated CPUs are
> > spread on one queue, there will be no IO originated from these isolated
> > CPUs, that is exactly what the isolation needs.
>
> The "isolcpus=managed_irq," option doesn't limit the dispatching CPUs.
Please see commit a46c27026da1 ("blk-mq: don't schedule block kworker on isolated CPUs")
in for-6.10/block.
> It only limits where the managed irq will assign the effective_cpus as a
> best effort.
Most of times it does work.
>
> Example, I boot with a system with 4 threads, one nvme device, and
> kernel parameter:
>
> isolcpus=managed_irq,2-3
>
> Run this:
>
> for i in $(seq 0 3); do taskset -c $i dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1000 iflag=direct; done
It is one test problem, when you try to isolate '2-3', it isn't expected
to submit IO or run application on these isolated CPUs.
Thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-11 0:41 ` Keith Busch
2024-05-11 0:59 ` Ming Lei
@ 2024-05-12 6:35 ` Thomas Gleixner
2024-05-20 15:37 ` Christoph Hellwig
1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2024-05-12 6:35 UTC (permalink / raw)
To: Keith Busch, Ming Lei
Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-kernel
On Fri, May 10 2024 at 18:41, Keith Busch wrote:
> On Sat, May 11, 2024 at 07:50:21AM +0800, Ming Lei wrote:
>> Can you explain a bit why it is a no-op? If only isolated CPUs are
>> spread on one queue, there will be no IO originated from these isolated
>> CPUs, that is exactly what the isolation needs.
>
> The "isolcpus=managed_irq," option doesn't limit the dispatching CPUs.
> It only limits where the managed irq will assign the effective_cpus as a
> best effort.
>
> Example, I boot with a system with 4 threads, one nvme device, and
> kernel parameter:
>
> isolcpus=managed_irq,2-3
>
> Run this:
>
> for i in $(seq 0 3); do taskset -c $i dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1000 iflag=direct; done
>
> Check /proc/interrupts | grep nvme0:
>
> CPU0 CPU1 CPU2 CPU3
> ..
> 26: 1000 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1
> 27: 0 1004 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2
> 28: 0 0 1000 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3
> 29: 0 0 0 1043 PCI-MSIX-0000:00:05.0 4-edge nvme0q4
>
> The isolcpus did nothing becuase the each vector's mask had just one
> cpu; there was no where else that the managed irq could send it. The
> documentation seems to indicate that was by design as a "best effort".
That's expected as you pin the I/O operation on the isolated CPUs which
in turn makes them use the per CPU queue.
The isolated CPUs are only excluded for device management interrupts,
but not for the affinity spread of the queues.
Thanks,
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-10 15:10 ` Christoph Hellwig
2024-05-10 16:20 ` Keith Busch
@ 2024-05-13 7:33 ` Benjamin Meier
2024-05-13 8:39 ` Ming Lei
2024-05-13 13:12 ` Bart Van Assche
2 siblings, 1 reply; 21+ messages in thread
From: Benjamin Meier @ 2024-05-13 7:33 UTC (permalink / raw)
To: hch; +Cc: kbusch, kbusch, linux-kernel, linux-nvme, ming.lei, tglx
> From: Christoph Hellwig <hch@lst.de>
>
> So let them argue why. I'd rather have a really, really, really
> good argument for this crap, and I'd like to hear it from the horses
> mouth.
I reached out to Keith to explore the possibility of manually defining
which cores handle NVMe interrupts.
The application which we develop and maintain (in the company I work)
has very high requirements regarding latency. We have some isolated cores
and we run our application on those.
Our system is using kernel 5.4 which unfortunately does not support
"isolcpus=managed_irq". Actually, we did not even know about that
option, because we are focussed on kernel 5.4. It solves part
of our problem, but being able to specify where exactly interrupts
are running is still superior in our opinion.
E.g. assume the number of house-keeping cores is small, because we
want to have full control over the system. In our case we have threads
of different priorities where some get an exclusive core. Some other threads
share a core (or a group of cores) with other threads. Now we are still
happy to assign some interrupts to some of the cores which we consider as
"medium-priority". Due to the small number of non-isolated cores, it can
be tricky to assign all interrupts to those without a performance-penalty.
Given these requirements, manually specifying interrupt/core assignments
would offer greater flexibility and control over system performance.
Moreover, the proposed code changes appear minimal and have no
impact on existing functionalities.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-13 7:33 ` Benjamin Meier
@ 2024-05-13 8:39 ` Ming Lei
2024-05-13 8:59 ` Benjamin Meier
0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-05-13 8:39 UTC (permalink / raw)
To: Benjamin Meier
Cc: hch, kbusch, kbusch, linux-kernel, linux-nvme, tglx, ming.lei
On Mon, May 13, 2024 at 09:33:27AM +0200, Benjamin Meier wrote:
> > From: Christoph Hellwig <hch@lst.de>
> >
> > So let them argue why. I'd rather have a really, really, really
> > good argument for this crap, and I'd like to hear it from the horses
> > mouth.
>
> I reached out to Keith to explore the possibility of manually defining
> which cores handle NVMe interrupts.
>
> The application which we develop and maintain (in the company I work)
> has very high requirements regarding latency. We have some isolated cores
Are these isolated cores controlled by kernel command line `isolcpus=`?
> and we run our application on those.
>
> Our system is using kernel 5.4 which unfortunately does not support
> "isolcpus=managed_irq". Actually, we did not even know about that
> option, because we are focussed on kernel 5.4. It solves part
> of our problem, but being able to specify where exactly interrupts
> are running is still superior in our opinion.
>
> E.g. assume the number of house-keeping cores is small, because we
> want to have full control over the system. In our case we have threads
> of different priorities where some get an exclusive core. Some other threads
> share a core (or a group of cores) with other threads. Now we are still
> happy to assign some interrupts to some of the cores which we consider as
> "medium-priority". Due to the small number of non-isolated cores, it can
So these "medium-priority" cores belong to isolated cpu list, you still expect
NVMe interrupts can be handled on these cpu cores, do I understand correctly?
If yes, I think your case still can be covered with 'isolcpus=managed_irq' which
needn't to be same with cpu cores specified from `isolcpus=`, such as
excluding medium-priority cores from 'isolcpus=managed_irq', and
meantime include them in plain `isolcpus=`.
> be tricky to assign all interrupts to those without a performance-penalty.
>
> Given these requirements, manually specifying interrupt/core assignments
> would offer greater flexibility and control over system performance.
> Moreover, the proposed code changes appear minimal and have no
> impact on existing functionalities.
Looks your main concern is performance, but as Keith mentioned, the proposed
change may degrade nvme perf too:
https://lore.kernel.org/linux-nvme/Zj6745UDnwX1BteO@kbusch-mbp.dhcp.thefacebook.com/
thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-13 8:39 ` Ming Lei
@ 2024-05-13 8:59 ` Benjamin Meier
2024-05-13 9:25 ` Ming Lei
0 siblings, 1 reply; 21+ messages in thread
From: Benjamin Meier @ 2024-05-13 8:59 UTC (permalink / raw)
To: ming.lei
Cc: benjamin.meier70, hch, kbusch, kbusch, linux-kernel, linux-nvme,
tglx
> > The application which we develop and maintain (in the company I work)
> > has very high requirements regarding latency. We have some isolated
cores
>
> Are these isolated cores controlled by kernel command line `isolcpus=`?
Yes, exactly.
> > and we run our application on those.
> >
> > Our system is using kernel 5.4 which unfortunately does not support
> > "isolcpus=managed_irq". Actually, we did not even know about that
> > option, because we are focussed on kernel 5.4. It solves part
> > of our problem, but being able to specify where exactly interrupts
> > are running is still superior in our opinion.
> >
> > E.g. assume the number of house-keeping cores is small, because we
> > want to have full control over the system. In our case we have threads
> > of different priorities where some get an exclusive core. Some
other threads
> > share a core (or a group of cores) with other threads. Now we are still
> > happy to assign some interrupts to some of the cores which we
consider as
> > "medium-priority". Due to the small number of non-isolated cores,
it can
>
> So these "medium-priority" cores belong to isolated cpu list, you
still expect
> NVMe interrupts can be handled on these cpu cores, do I understand
correctly?
We want to avoid that the NVMe interrupts are on the "high priority"
cores. Having
noise on them is quite bad for us, so we wanted to move some interrupts
to house
keeping cores and if needed (due to performance issues) keep some on those
"medium-priority" isolated cores. NVMe is not that highest priority for us,
but possibly running too much on the house-keeping cores could also be bad.
> If yes, I think your case still can be covered with
'isolcpus=managed_irq' which
> needn't to be same with cpu cores specified from `isolcpus=`, such as
> excluding medium-priority cores from 'isolcpus=managed_irq', and
> meantime include them in plain `isolcpus=`.
Unfortunately, our kernel version (5.4) does not support "managed_irq"
and due
to that we're happy with the patch. However, I see that for newer kernel
versions
the already existing arguments could be sufficient to do everything.
> > be tricky to assign all interrupts to those without a
performance-penalty.
> >
> > Given these requirements, manually specifying interrupt/core
assignments
> > would offer greater flexibility and control over system performance.
> > Moreover, the proposed code changes appear minimal and have no
> > impact on existing functionalities.
>
> Looks your main concern is performance, but as Keith mentioned, the
proposed
> change may degrade nvme perf too:
>
>
https://lore.kernel.org/linux-nvme/Zj6745UDnwX1BteO@kbusch-mbp.dhcp.thefacebook.com/
Yes, but for NVMe it's not that critical. The most important point for us is
to keep them away from our "high-priority" cores. We still wanted to
have control
where we run those interrupts, but also because we just did not know the
"managed_irq"
option.
Thanks,
Benjamin
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-13 8:59 ` Benjamin Meier
@ 2024-05-13 9:25 ` Ming Lei
2024-05-13 12:33 ` Benjamin Meier
0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-05-13 9:25 UTC (permalink / raw)
To: Benjamin Meier; +Cc: hch, kbusch, kbusch, linux-kernel, linux-nvme, tglx
On Mon, May 13, 2024 at 10:59:02AM +0200, Benjamin Meier wrote:
> > > The application which we develop and maintain (in the company I work)
> > > has very high requirements regarding latency. We have some isolated
> cores
> >
> > Are these isolated cores controlled by kernel command line `isolcpus=`?
>
> Yes, exactly.
>
> > > and we run our application on those.
> > >
> > > Our system is using kernel 5.4 which unfortunately does not support
> > > "isolcpus=managed_irq". Actually, we did not even know about that
> > > option, because we are focussed on kernel 5.4. It solves part
> > > of our problem, but being able to specify where exactly interrupts
> > > are running is still superior in our opinion.
> > >
> > > E.g. assume the number of house-keeping cores is small, because we
> > > want to have full control over the system. In our case we have threads
> > > of different priorities where some get an exclusive core. Some other
> threads
> > > share a core (or a group of cores) with other threads. Now we are still
> > > happy to assign some interrupts to some of the cores which we consider
> as
> > > "medium-priority". Due to the small number of non-isolated cores, it can
> >
> > So these "medium-priority" cores belong to isolated cpu list, you still
> expect
> > NVMe interrupts can be handled on these cpu cores, do I understand
> correctly?
>
> We want to avoid that the NVMe interrupts are on the "high priority" cores.
> Having
> noise on them is quite bad for us, so we wanted to move some interrupts to
> house
> keeping cores and if needed (due to performance issues) keep some on those
> "medium-priority" isolated cores. NVMe is not that highest priority for us,
> but possibly running too much on the house-keeping cores could also be bad.
>
> > If yes, I think your case still can be covered with 'isolcpus=managed_irq'
> which
> > needn't to be same with cpu cores specified from `isolcpus=`, such as
> > excluding medium-priority cores from 'isolcpus=managed_irq', and
> > meantime include them in plain `isolcpus=`.
>
> Unfortunately, our kernel version (5.4) does not support "managed_irq" and
> due
> to that we're happy with the patch. However, I see that for newer kernel
> versions
> the already existing arguments could be sufficient to do everything.
'isolcpus=managed_irq' enablement patches are small, and shouldn't be very
hard to backport.
>
> > > be tricky to assign all interrupts to those without a
> performance-penalty.
> > >
> > > Given these requirements, manually specifying interrupt/core assignments
> > > would offer greater flexibility and control over system performance.
> > > Moreover, the proposed code changes appear minimal and have no
> > > impact on existing functionalities.
> >
> > Looks your main concern is performance, but as Keith mentioned, the
> proposed
> > change may degrade nvme perf too:
> >
> > https://lore.kernel.org/linux-nvme/Zj6745UDnwX1BteO@kbusch-mbp.dhcp.thefacebook.com/
>
> Yes, but for NVMe it's not that critical. The most important point for us is
> to keep them away from our "high-priority" cores. We still wanted to have
> control
> where we run those interrupts, but also because we just did not know the
> "managed_irq"
> option.
OK, thanks for share the input!
Now from upstream viewpoint, 'isolcpus=managed_irq' should work for your case,
and seems not necessary to support nvme unmanaged irq for this requirement
at least.
thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-13 9:25 ` Ming Lei
@ 2024-05-13 12:33 ` Benjamin Meier
0 siblings, 0 replies; 21+ messages in thread
From: Benjamin Meier @ 2024-05-13 12:33 UTC (permalink / raw)
To: ming.lei
Cc: benjamin.meier70, hch, kbusch, kbusch, linux-kernel, linux-nvme,
tglx
> 'isolcpus=managed_irq' enablement patches are small, and shouldn't be
very
> hard to backport.
I have big respect of kernel code and probably for non-kernel devs it's
not so easy:)
But yeah, we'll look into this.
> > > > be tricky to assign all interrupts to those without a
> > performance-penalty.
> > > >
> > > > Given these requirements, manually specifying interrupt/core
assignments
> > > > would offer greater flexibility and control over system
performance.
> > > > Moreover, the proposed code changes appear minimal and have no
> > > > impact on existing functionalities.
> > >
> > > Looks your main concern is performance, but as Keith mentioned, the
> > proposed
> > > change may degrade nvme perf too:
> > >
> > >
https://lore.kernel.org/linux-nvme/Zj6745UDnwX1BteO@kbusch-mbp.dhcp.thefacebook.com/
> >
> > Yes, but for NVMe it's not that critical. The most important point
for us is
> > to keep them away from our "high-priority" cores. We still wanted
to have
> > control
> > where we run those interrupts, but also because we just did not
know the
> > "managed_irq"
> > option.
>
> OK, thanks for share the input!
>
> Now from upstream viewpoint, 'isolcpus=managed_irq' should work for
your case,
> and seems not necessary to support nvme unmanaged irq for this
requirement
> at least.
Yes, probably that will do it. Personally, I still think it's a nice
thing if it's
possible to assign interrupts to specific cores, but practically the
advantages are
likely not that big compared to 'isolcpus=managed_irq'.
Thanks for all the explanations
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-10 15:10 ` Christoph Hellwig
2024-05-10 16:20 ` Keith Busch
2024-05-13 7:33 ` Benjamin Meier
@ 2024-05-13 13:12 ` Bart Van Assche
2 siblings, 0 replies; 21+ messages in thread
From: Bart Van Assche @ 2024-05-13 13:12 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch
Cc: linux-nvme, linux-kernel, tglx, ming.lei, Keith Busch
On 5/10/24 08:10, Christoph Hellwig wrote:
> On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote:
>> From: Keith Busch <kbusch@kernel.org>
>>
>> Some people _really_ want to control their interrupt affinity.
>
> So let them argue why. I'd rather have a really, really, really
> good argument for this crap, and I'd like to hear it from the horses
> mouth.
Performance can be increased by modifying the interrupt assignments
carefully, especially in storage appliances that have to process a
large number of network and storage interrupts. By carefully assigning
interrupts the number of completions processed per interrupt can be
increased and hence performance also increases. In 2014 I was working
on a product that benefited from this approach.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-12 6:35 ` Thomas Gleixner
@ 2024-05-20 15:37 ` Christoph Hellwig
2024-05-20 20:34 ` Thomas Gleixner
2024-05-21 2:31 ` Ming Lei
0 siblings, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2024-05-20 15:37 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Keith Busch, Ming Lei, Christoph Hellwig, Keith Busch, linux-nvme,
linux-kernel
On Sun, May 12, 2024 at 08:35:55AM +0200, Thomas Gleixner wrote:
> That's expected as you pin the I/O operation on the isolated CPUs which
> in turn makes them use the per CPU queue.
>
> The isolated CPUs are only excluded for device management interrupts,
> but not for the affinity spread of the queues.
We'll probably need a version of isolcpus that also excludes the
interrupt spread given that users are asking for it. And I'd much
prefer that over adding radom module options to every driver to disable
managed interrupts.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-20 15:37 ` Christoph Hellwig
@ 2024-05-20 20:34 ` Thomas Gleixner
2024-05-21 2:31 ` Ming Lei
1 sibling, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2024-05-20 20:34 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Ming Lei, Christoph Hellwig, Keith Busch, linux-nvme,
linux-kernel
On Mon, May 20 2024 at 17:37, Christoph Hellwig wrote:
> On Sun, May 12, 2024 at 08:35:55AM +0200, Thomas Gleixner wrote:
>> That's expected as you pin the I/O operation on the isolated CPUs which
>> in turn makes them use the per CPU queue.
>>
>> The isolated CPUs are only excluded for device management interrupts,
>> but not for the affinity spread of the queues.
>
> We'll probably need a version of isolcpus that also excludes the
> interrupt spread given that users are asking for it. And I'd much
> prefer that over adding radom module options to every driver to disable
> managed interrupts.
No objections from my side.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-20 15:37 ` Christoph Hellwig
2024-05-20 20:34 ` Thomas Gleixner
@ 2024-05-21 2:31 ` Ming Lei
2024-05-21 8:38 ` Thomas Gleixner
1 sibling, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-05-21 2:31 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Thomas Gleixner, Keith Busch, Keith Busch, linux-nvme,
linux-kernel
On Mon, May 20, 2024 at 05:37:42PM +0200, Christoph Hellwig wrote:
> On Sun, May 12, 2024 at 08:35:55AM +0200, Thomas Gleixner wrote:
> > That's expected as you pin the I/O operation on the isolated CPUs which
> > in turn makes them use the per CPU queue.
> >
> > The isolated CPUs are only excluded for device management interrupts,
> > but not for the affinity spread of the queues.
>
> We'll probably need a version of isolcpus that also excludes the
> interrupt spread given that users are asking for it. And I'd much
> prefer that over adding radom module options to every driver to disable
> managed interrupts.
BTW, isolcpus has been marked as deprecated, and it can't be adjust
runtime.
isolcpus= [KNL,SMP,ISOL] Isolate a given set of CPUs from disturbance.
[Deprecated - use cpusets instead]
Thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-21 2:31 ` Ming Lei
@ 2024-05-21 8:38 ` Thomas Gleixner
2024-05-21 10:06 ` Frederic Weisbecker
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2024-05-21 8:38 UTC (permalink / raw)
To: Ming Lei, Christoph Hellwig
Cc: Keith Busch, Keith Busch, linux-nvme, linux-kernel,
Frederic Weisbecker
On Tue, May 21 2024 at 10:31, Ming Lei wrote:
> On Mon, May 20, 2024 at 05:37:42PM +0200, Christoph Hellwig wrote:
>> On Sun, May 12, 2024 at 08:35:55AM +0200, Thomas Gleixner wrote:
>> > That's expected as you pin the I/O operation on the isolated CPUs which
>> > in turn makes them use the per CPU queue.
>> >
>> > The isolated CPUs are only excluded for device management interrupts,
>> > but not for the affinity spread of the queues.
>>
>> We'll probably need a version of isolcpus that also excludes the
>> interrupt spread given that users are asking for it. And I'd much
>> prefer that over adding radom module options to every driver to disable
>> managed interrupts.
>
> BTW, isolcpus has been marked as deprecated, and it can't be adjust
> runtime.
Which is far from reality as cpusets do not allow to do what isolcpus
does today.
Also runtime adjusting managed interrupts needs way more thoughts.
Thanks
tglx
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts
2024-05-21 8:38 ` Thomas Gleixner
@ 2024-05-21 10:06 ` Frederic Weisbecker
0 siblings, 0 replies; 21+ messages in thread
From: Frederic Weisbecker @ 2024-05-21 10:06 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Ming Lei, Christoph Hellwig, Keith Busch, Keith Busch, linux-nvme,
linux-kernel
Le Tue, May 21, 2024 at 10:38:25AM +0200, Thomas Gleixner a écrit :
> On Tue, May 21 2024 at 10:31, Ming Lei wrote:
>
> > On Mon, May 20, 2024 at 05:37:42PM +0200, Christoph Hellwig wrote:
> >> On Sun, May 12, 2024 at 08:35:55AM +0200, Thomas Gleixner wrote:
> >> > That's expected as you pin the I/O operation on the isolated CPUs which
> >> > in turn makes them use the per CPU queue.
> >> >
> >> > The isolated CPUs are only excluded for device management interrupts,
> >> > but not for the affinity spread of the queues.
> >>
> >> We'll probably need a version of isolcpus that also excludes the
> >> interrupt spread given that users are asking for it. And I'd much
> >> prefer that over adding radom module options to every driver to disable
> >> managed interrupts.
> >
> > BTW, isolcpus has been marked as deprecated, and it can't be adjust
> > runtime.
>
> Which is far from reality as cpusets do not allow to do what isolcpus
> does today.
>
> Also runtime adjusting managed interrupts needs way more thoughts.
I'll remove that comment (unless someone beats me at it?). We used to think
that cpusets would indeed deprecate isolcpus but for several reasons this
will never be the case.
Thanks.
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-05-21 10:07 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-10 14:14 [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Keith Busch
2024-05-10 14:14 ` [PATCH 2/2] nvme-pci: allow unmanaged interrupts Keith Busch
2024-05-10 15:10 ` Christoph Hellwig
2024-05-10 16:20 ` Keith Busch
2024-05-10 23:50 ` Ming Lei
2024-05-11 0:41 ` Keith Busch
2024-05-11 0:59 ` Ming Lei
2024-05-12 6:35 ` Thomas Gleixner
2024-05-20 15:37 ` Christoph Hellwig
2024-05-20 20:34 ` Thomas Gleixner
2024-05-21 2:31 ` Ming Lei
2024-05-21 8:38 ` Thomas Gleixner
2024-05-21 10:06 ` Frederic Weisbecker
2024-05-13 7:33 ` Benjamin Meier
2024-05-13 8:39 ` Ming Lei
2024-05-13 8:59 ` Benjamin Meier
2024-05-13 9:25 ` Ming Lei
2024-05-13 12:33 ` Benjamin Meier
2024-05-13 13:12 ` Bart Van Assche
2024-05-10 15:15 ` [PATCH 1/2] genirq/affinity: remove rsvd check against minvec Ming Lei
2024-05-10 16:47 ` Keith Busch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox