[PATCH V3] nvme-pci: allow unmanaged interrupts

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V3] nvme-pci: allow unmanaged interrupts
@ 2024-07-02 10:41 Ming Lei
  2024-07-02 11:50 ` Christoph Hellwig
  2024-07-15 16:03 ` Marcelo Tosatti
  0 siblings, 2 replies; 13+ messages in thread
From: Ming Lei @ 2024-07-02 10:41 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, linux-nvme, Sagi Grimberg
  Cc: Lawrence Troup, Marcelo Tosatti, Ming Lei

From: Keith Busch <kbusch@kernel.org>

People _really_ want to control their interrupt affinity in some
cases, such as Openshift with Performance profile, in which each
irq's affinity is completely specified from userspace. Turns out
that 'isolcpus=managed_irqs' isn't enough.

Add module parameter to allow unmanaged interrupts, just as some
SCSI drivers are doing.

Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
v2->v3:
	- rebase on for-next
	- add openshift use case

v1->v2: skip the the AFFINITY vector allocation if the parameter is
provided instead trying to make the vector code handle all post_vectors.

 drivers/nvme/host/pci.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5d8035218de9..a39c99c9b64d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -63,6 +63,11 @@ MODULE_PARM_DESC(sgl_threshold,
 		"Use SGLs when average request segment size is larger or equal to "
 		"this size. Use 0 to disable SGLs.");
 
+static bool managed_irqs = true;
+module_param(managed_irqs, bool, 0444);
+MODULE_PARM_DESC(managed_irqs,
+		 "set to false for user controlled irq affinity");
+
 #define NVME_PCI_MIN_QUEUE_SIZE 2
 #define NVME_PCI_MAX_QUEUE_SIZE 4095
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp);
@@ -456,7 +461,7 @@ static void nvme_pci_map_queues(struct blk_mq_tag_set *set)
 		 * affinity), so use the regular blk-mq cpu mapping
 		 */
 		map->queue_offset = qoff;
-		if (i != HCTX_TYPE_POLL && offset)
+		if (managed_irqs && i != HCTX_TYPE_POLL && offset)
 			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
 		else
 			blk_mq_map_queues(map);
@@ -2226,6 +2231,7 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 	};
 	unsigned int irq_queues, poll_queues;
 	unsigned int flags = PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY;
+	int ret;
 
 	/*
 	 * Poll queues don't need interrupts, but we need at least one I/O queue
@@ -2251,8 +2257,16 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 		irq_queues += (nr_io_queues - poll_queues);
 	if (dev->ctrl.quirks & NVME_QUIRK_BROKEN_MSI)
 		flags &= ~PCI_IRQ_MSI;
-	return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags,
+
+	if (managed_irqs)
+		return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags,
 					      &affd);
+
+	flags &= ~PCI_IRQ_AFFINITY;
+	ret = pci_alloc_irq_vectors(pdev, 1, irq_queues, flags);
+	if (ret > 0)
+		nvme_calc_irq_sets(&affd, ret - 1);
+	return ret;
 }
 
 static unsigned int nvme_max_io_queues(struct nvme_dev *dev)
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-02 10:41 [PATCH V3] nvme-pci: allow unmanaged interrupts Ming Lei
@ 2024-07-02 11:50 ` Christoph Hellwig
  2024-07-02 12:12   ` Ming Lei
  2024-07-15 16:03 ` Marcelo Tosatti
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-07-02 11:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Sagi Grimberg,
	Lawrence Troup, Marcelo Tosatti

On Tue, Jul 02, 2024 at 06:41:12PM +0800, Ming Lei wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> People _really_ want to control their interrupt affinity in some
> cases, such as Openshift with Performance profile, in which each
> irq's affinity is completely specified from userspace. Turns out
> that 'isolcpus=managed_irqs' isn't enough.
> 
> Add module parameter to allow unmanaged interrupts, just as some
> SCSI drivers are doing.

Same as before: hell no.  We can't just add hacky global kernel
parameters everywhere.  We need the cpu isolation infrastructure to
work properly instead of piling hacks of hacks in every relevant driver.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-02 11:50 ` Christoph Hellwig
@ 2024-07-02 12:12   ` Ming Lei
  2024-07-02 12:20     ` Lawrence Troup (ltroup)
  2024-07-02 16:28     ` Daniel Wagner
  0 siblings, 2 replies; 13+ messages in thread
From: Ming Lei @ 2024-07-02 12:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Sagi Grimberg, Lawrence Troup,
	Marcelo Tosatti, ming.lei

On Tue, Jul 02, 2024 at 01:50:02PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 06:41:12PM +0800, Ming Lei wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > People _really_ want to control their interrupt affinity in some
> > cases, such as Openshift with Performance profile, in which each
> > irq's affinity is completely specified from userspace. Turns out
> > that 'isolcpus=managed_irqs' isn't enough.
> > 
> > Add module parameter to allow unmanaged interrupts, just as some
> > SCSI drivers are doing.
> 
> Same as before: hell no.  We can't just add hacky global kernel
> parameters everywhere.  We need the cpu isolation infrastructure to
> work properly instead of piling hacks of hacks in every relevant driver.

Per my understanding, here cpu isolation infrastructure can't work for
Openshift, in which IO workload can be run on applications which are executed
on isolated CPUs, meantime userspace do expect that interrupts can be
triggered on user-specified CPU cores only in controllable way.  

Marcelo and Lawrence may have more input in this area.

Also irq allocation really belongs to device & driver stuff, how can that be
hack? We even may not abstract public API in block layer for handling
irq related thing.


Thanks, 
Ming



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-02 12:12   ` Ming Lei
@ 2024-07-02 12:20     ` Lawrence Troup (ltroup)
  2024-07-02 15:28       ` Christoph Hellwig
  2024-07-02 16:28     ` Daniel Wagner
  1 sibling, 1 reply; 13+ messages in thread
From: Lawrence Troup (ltroup) @ 2024-07-02 12:20 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Sagi Grimberg,
	Marcelo Tosatti, Franck Baudin

Openshift needs the ability to dynamically move IRQs of all drivers away from a specific set of CPUs, at the point that an isolated workload starts running on those CPUs, and requires high performance guarantees, i.e. no HW interrupts to occur. To achieve this, dynamic setting of the smp_affinity for all drivers is used - at the moment, the NVME driver does not support this, so the NVME IRQs remain running on CPUs they should not be on, and so impact performance of the isolated workload.

Ability to set the smp_affinity seems like a generic feature, that should be supported by all drivers - I'm unclear how adding this feature to the NVME driver can be viewed as a hack?

Thanks,
Lawrence

-----Original Message-----
From: Ming Lei <ming.lei@redhat.com> 
Sent: Tuesday, July 2, 2024 1:12 PM
To: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>; linux-nvme@lists.infradead.org; Sagi Grimberg <sagi@grimberg.me>; Lawrence Troup (ltroup) <ltroup@cisco.com>; Marcelo Tosatti <mtosatti@redhat.com>; ming.lei@redhat.com
Subject: Re: [PATCH V3] nvme-pci: allow unmanaged interrupts

On Tue, Jul 02, 2024 at 01:50:02PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 06:41:12PM +0800, Ming Lei wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > People _really_ want to control their interrupt affinity in some 
> > cases, such as Openshift with Performance profile, in which each 
> > irq's affinity is completely specified from userspace. Turns out 
> > that 'isolcpus=managed_irqs' isn't enough.
> > 
> > Add module parameter to allow unmanaged interrupts, just as some 
> > SCSI drivers are doing.
> 
> Same as before: hell no.  We can't just add hacky global kernel 
> parameters everywhere.  We need the cpu isolation infrastructure to 
> work properly instead of piling hacks of hacks in every relevant driver.

Per my understanding, here cpu isolation infrastructure can't work for Openshift, in which IO workload can be run on applications which are executed on isolated CPUs, meantime userspace do expect that interrupts can be triggered on user-specified CPU cores only in controllable way.  

Marcelo and Lawrence may have more input in this area.

Also irq allocation really belongs to device & driver stuff, how can that be hack? We even may not abstract public API in block layer for handling irq related thing.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-02 12:20     ` Lawrence Troup (ltroup)
@ 2024-07-02 15:28       ` Christoph Hellwig
  2024-07-03  1:57         ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2024-07-02 15:28 UTC (permalink / raw)
  To: Lawrence Troup (ltroup)
  Cc: Ming Lei, Christoph Hellwig, Keith Busch,
	linux-nvme@lists.infradead.org, Sagi Grimberg, Marcelo Tosatti,
	Franck Baudin

On Tue, Jul 02, 2024 at 12:20:30PM +0000, Lawrence Troup (ltroup) wrote:
> Openshift needs the ability to dynamically move IRQs of all drivers away from a specific set of CPUs, at the point that an isolated workload starts running on those CPUs, and requires high performance guarantees, i.e. no HW interrupts to occur. To achieve this, dynamic setting of the smp_affinity for all drivers is used - at the moment, the NVME driver does not support this, so the NVME IRQs remain running on CPUs they should not be on, and so impact performance of the isolated workload.

Then you need to create a core kernel interface that moves all managed
IRQs of these CPUs.  Without that you have a never ending whack a mole
adding hacks to the drivers of the day that you care about.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-02 15:28       ` Christoph Hellwig
@ 2024-07-03  1:57         ` Ming Lei
  2024-07-03  5:24           ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2024-07-03  1:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lawrence Troup (ltroup), Keith Busch,
	linux-nvme@lists.infradead.org, Sagi Grimberg, Marcelo Tosatti,
	Franck Baudin

On Tue, Jul 02, 2024 at 05:28:09PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 12:20:30PM +0000, Lawrence Troup (ltroup) wrote:
> > Openshift needs the ability to dynamically move IRQs of all drivers away from a specific set of CPUs, at the point that an isolated workload starts running on those CPUs, and requires high performance guarantees, i.e. no HW interrupts to occur. To achieve this, dynamic setting of the smp_affinity for all drivers is used - at the moment, the NVME driver does not support this, so the NVME IRQs remain running on CPUs they should not be on, and so impact performance of the isolated workload.
> 
> Then you need to create a core kernel interface that moves all managed
> IRQs of these CPUs.

It is basically not doable or too hard since blk-mq queue mapping
depends on managed irq affinity.

> Without that you have a never ending whack a mole
> adding hacks to the drivers of the day that you care about.

So far, most popular SCSI hosts support the option via module parameter, and
the only remainder could be nvme-pci.


Thanks, 
Ming



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-03  1:57         ` Ming Lei
@ 2024-07-03  5:24           ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-07-03  5:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Lawrence Troup (ltroup), Keith Busch,
	linux-nvme@lists.infradead.org, Sagi Grimberg, Marcelo Tosatti,
	Franck Baudin

On Wed, Jul 03, 2024 at 09:57:29AM +0800, Ming Lei wrote:
> So far, most popular SCSI hosts support the option via module parameter, and
> the only remainder could be nvme-pci.

Until something new comes along and then we need to play this game again.

I might sound like a broken record but we need to fix this for real.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-02 12:12   ` Ming Lei
  2024-07-02 12:20     ` Lawrence Troup (ltroup)
@ 2024-07-02 16:28     ` Daniel Wagner
  2024-07-03  1:51       ` Ming Lei
  1 sibling, 1 reply; 13+ messages in thread
From: Daniel Wagner @ 2024-07-02 16:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Sagi Grimberg,
	Lawrence Troup, Marcelo Tosatti

On Tue, Jul 02, 2024 at 08:12:11PM GMT, Ming Lei wrote:
> On Tue, Jul 02, 2024 at 01:50:02PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 02, 2024 at 06:41:12PM +0800, Ming Lei wrote:
> > > From: Keith Busch <kbusch@kernel.org>
> > > 
> > > People _really_ want to control their interrupt affinity in some
> > > cases, such as Openshift with Performance profile, in which each
> > > irq's affinity is completely specified from userspace. Turns out
> > > that 'isolcpus=managed_irqs' isn't enough.
> > > 
> > > Add module parameter to allow unmanaged interrupts, just as some
> > > SCSI drivers are doing.
> > 
> > Same as before: hell no.  We can't just add hacky global kernel
> > parameters everywhere.  We need the cpu isolation infrastructure to
> > work properly instead of piling hacks of hacks in every relevant driver.
> 
> Per my understanding, here cpu isolation infrastructure can't work for
> Openshift, in which IO workload can be run on applications which are executed
> on isolated CPUs, meantime userspace do expect that interrupts can be
> triggered on user-specified CPU cores only in controllable way.  
> 
> Marcelo and Lawrence may have more input in this area.
> 
> Also irq allocation really belongs to device & driver stuff, how can that be
> hack? We even may not abstract public API in block layer for handling
> irq related thing.

I am confused. I though you told me that my series 'nvme-pci: honor
isolcpus configuration' is not necessary. But you still need this patch
to get the affinity sorted out? Wouldn't it make sense to figure out how
we can make my series working also for your use case? E.g. we could
introduce another HK type (io_queue) to control the affinity. This would
decouple if from the managed_irq option.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-02 16:28     ` Daniel Wagner
@ 2024-07-03  1:51       ` Ming Lei
  2024-07-16 10:18         ` Hannes Reinecke
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2024-07-03  1:51 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Sagi Grimberg,
	Lawrence Troup, Marcelo Tosatti

On Tue, Jul 02, 2024 at 06:28:19PM +0200, Daniel Wagner wrote:
> On Tue, Jul 02, 2024 at 08:12:11PM GMT, Ming Lei wrote:
> > On Tue, Jul 02, 2024 at 01:50:02PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 02, 2024 at 06:41:12PM +0800, Ming Lei wrote:
> > > > From: Keith Busch <kbusch@kernel.org>
> > > > 
> > > > People _really_ want to control their interrupt affinity in some
> > > > cases, such as Openshift with Performance profile, in which each
> > > > irq's affinity is completely specified from userspace. Turns out
> > > > that 'isolcpus=managed_irqs' isn't enough.
> > > > 
> > > > Add module parameter to allow unmanaged interrupts, just as some
> > > > SCSI drivers are doing.
> > > 
> > > Same as before: hell no.  We can't just add hacky global kernel
> > > parameters everywhere.  We need the cpu isolation infrastructure to
> > > work properly instead of piling hacks of hacks in every relevant driver.
> > 
> > Per my understanding, here cpu isolation infrastructure can't work for
> > Openshift, in which IO workload can be run on applications which are executed
> > on isolated CPUs, meantime userspace do expect that interrupts can be
> > triggered on user-specified CPU cores only in controllable way.  
> > 
> > Marcelo and Lawrence may have more input in this area.
> > 
> > Also irq allocation really belongs to device & driver stuff, how can that be
> > hack? We even may not abstract public API in block layer for handling
> > irq related thing.
> 
> I am confused. I though you told me that my series 'nvme-pci: honor
> isolcpus configuration' is not necessary. But you still need this patch

Your patch fixes nothing basically, meantime it introduces regression. But
I don't object the approach if blk-mq regressions can be solved.

> to get the affinity sorted out? Wouldn't it make sense to figure out how
> we can make my series working also for your use case? E.g. we could
> introduce another HK type (io_queue) to control the affinity. This would
> decouple if from the managed_irq option.

Adding new HK type can't help this issue because Openshift environment needs
to control each irq's affinity by themselves dynamically, and even IO workload
may be run on isolated CPUs.


Thanks, 
Ming



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-03  1:51       ` Ming Lei
@ 2024-07-16 10:18         ` Hannes Reinecke
  0 siblings, 0 replies; 13+ messages in thread
From: Hannes Reinecke @ 2024-07-16 10:18 UTC (permalink / raw)
  To: Ming Lei, Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Sagi Grimberg,
	Lawrence Troup, Marcelo Tosatti

On 7/3/24 03:51, Ming Lei wrote:
> On Tue, Jul 02, 2024 at 06:28:19PM +0200, Daniel Wagner wrote:
>> On Tue, Jul 02, 2024 at 08:12:11PM GMT, Ming Lei wrote:
>>> On Tue, Jul 02, 2024 at 01:50:02PM +0200, Christoph Hellwig wrote:
>>>> On Tue, Jul 02, 2024 at 06:41:12PM +0800, Ming Lei wrote:
>>>>> From: Keith Busch <kbusch@kernel.org>
>>>>>
>>>>> People _really_ want to control their interrupt affinity in some
>>>>> cases, such as Openshift with Performance profile, in which each
>>>>> irq's affinity is completely specified from userspace. Turns out
>>>>> that 'isolcpus=managed_irqs' isn't enough.
>>>>>
>>>>> Add module parameter to allow unmanaged interrupts, just as some
>>>>> SCSI drivers are doing.
>>>>
>>>> Same as before: hell no.  We can't just add hacky global kernel
>>>> parameters everywhere.  We need the cpu isolation infrastructure to
>>>> work properly instead of piling hacks of hacks in every relevant driver.
>>>
>>> Per my understanding, here cpu isolation infrastructure can't work for
>>> Openshift, in which IO workload can be run on applications which are executed
>>> on isolated CPUs, meantime userspace do expect that interrupts can be
>>> triggered on user-specified CPU cores only in controllable way.
>>>
>>> Marcelo and Lawrence may have more input in this area.
>>>
>>> Also irq allocation really belongs to device & driver stuff, how can that be
>>> hack? We even may not abstract public API in block layer for handling
>>> irq related thing.
>>
>> I am confused. I though you told me that my series 'nvme-pci: honor
>> isolcpus configuration' is not necessary. But you still need this patch
> 
> Your patch fixes nothing basically, meantime it introduces regression. But
> I don't object the approach if blk-mq regressions can be solved.
> 
>> to get the affinity sorted out? Wouldn't it make sense to figure out how
>> we can make my series working also for your use case? E.g. we could
>> introduce another HK type (io_queue) to control the affinity. This would
>> decouple if from the managed_irq option.
> 
> Adding new HK type can't help this issue because Openshift environment needs
> to control each irq's affinity by themselves dynamically, and even IO workload
> may be run on isolated CPUs.
> 
Understood, but that is what Daniel is working on, namely split the pool 
of available interrupts such that each side (housekeeping and isolcpus) 
gets a fair share of interrupts.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-02 10:41 [PATCH V3] nvme-pci: allow unmanaged interrupts Ming Lei
  2024-07-02 11:50 ` Christoph Hellwig
@ 2024-07-15 16:03 ` Marcelo Tosatti
  2024-07-15 16:23   ` Marcelo Tosatti
  2024-07-16  4:29   ` Christoph Hellwig
  1 sibling, 2 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2024-07-15 16:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Sagi Grimberg,
	Lawrence Troup

On Tue, Jul 02, 2024 at 06:41:12PM +0800, Ming Lei wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> People _really_ want to control their interrupt affinity in some
> cases, such as Openshift with Performance profile, in which each
> irq's affinity is completely specified from userspace. Turns out
> that 'isolcpus=managed_irqs' isn't enough.
> 
> Add module parameter to allow unmanaged interrupts, just as some
> SCSI drivers are doing.
> 
> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> v2->v3:
> 	- rebase on for-next
> 	- add openshift use case
> 
> v1->v2: skip the the AFFINITY vector allocation if the parameter is
> provided instead trying to make the vector code handle all post_vectors.
> 
>  drivers/nvme/host/pci.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 5d8035218de9..a39c99c9b64d 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -63,6 +63,11 @@ MODULE_PARM_DESC(sgl_threshold,
>  		"Use SGLs when average request segment size is larger or equal to "
>  		"this size. Use 0 to disable SGLs.");
>  
> +static bool managed_irqs = true;
> +module_param(managed_irqs, bool, 0444);
> +MODULE_PARM_DESC(managed_irqs,
> +		 "set to false for user controlled irq affinity");
> +

What if you set "static bool managed_irqs" to false when isolcpus is being used?

For example:

if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
	managed_irqs = false;

Then there is no additional parameter to tune (which addresses
Christoph's concern).

>  #define NVME_PCI_MIN_QUEUE_SIZE 2
>  #define NVME_PCI_MAX_QUEUE_SIZE 4095
>  static int io_queue_depth_set(const char *val, const struct kernel_param *kp);
> @@ -456,7 +461,7 @@ static void nvme_pci_map_queues(struct blk_mq_tag_set *set)
>  		 * affinity), so use the regular blk-mq cpu mapping
>  		 */
>  		map->queue_offset = qoff;
> -		if (i != HCTX_TYPE_POLL && offset)
> +		if (managed_irqs && i != HCTX_TYPE_POLL && offset)
>  			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
>  		else
>  			blk_mq_map_queues(map);
> @@ -2226,6 +2231,7 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
>  	};
>  	unsigned int irq_queues, poll_queues;
>  	unsigned int flags = PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY;
> +	int ret;
>  
>  	/*
>  	 * Poll queues don't need interrupts, but we need at least one I/O queue
> @@ -2251,8 +2257,16 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
>  		irq_queues += (nr_io_queues - poll_queues);
>  	if (dev->ctrl.quirks & NVME_QUIRK_BROKEN_MSI)
>  		flags &= ~PCI_IRQ_MSI;
> -	return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags,
> +
> +	if (managed_irqs)
> +		return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, flags,
>  					      &affd);
> +
> +	flags &= ~PCI_IRQ_AFFINITY;
> +	ret = pci_alloc_irq_vectors(pdev, 1, irq_queues, flags);
> +	if (ret > 0)
> +		nvme_calc_irq_sets(&affd, ret - 1);
> +	return ret;
>  }
>  
>  static unsigned int nvme_max_io_queues(struct nvme_dev *dev)
> -- 
> 2.44.0
> 
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-15 16:03 ` Marcelo Tosatti
@ 2024-07-15 16:23   ` Marcelo Tosatti
  2024-07-16  4:29   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2024-07-15 16:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Sagi Grimberg,
	Lawrence Troup

On Mon, Jul 15, 2024 at 01:03:02PM -0300, Marcelo Tosatti wrote:
> On Tue, Jul 02, 2024 at 06:41:12PM +0800, Ming Lei wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > People _really_ want to control their interrupt affinity in some
> > cases, such as Openshift with Performance profile, in which each
> > irq's affinity is completely specified from userspace. Turns out
> > that 'isolcpus=managed_irqs' isn't enough.
> > 
> > Add module parameter to allow unmanaged interrupts, just as some
> > SCSI drivers are doing.
> > 
> > Cc: Marcelo Tosatti <mtosatti@redhat.com>
> > Signed-off-by: Keith Busch <kbusch@kernel.org>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > v2->v3:
> > 	- rebase on for-next
> > 	- add openshift use case
> > 
> > v1->v2: skip the the AFFINITY vector allocation if the parameter is
> > provided instead trying to make the vector code handle all post_vectors.
> > 
> >  drivers/nvme/host/pci.c | 18 ++++++++++++++++--
> >  1 file changed, 16 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 5d8035218de9..a39c99c9b64d 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -63,6 +63,11 @@ MODULE_PARM_DESC(sgl_threshold,
> >  		"Use SGLs when average request segment size is larger or equal to "
> >  		"this size. Use 0 to disable SGLs.");
> >  
> > +static bool managed_irqs = true;
> > +module_param(managed_irqs, bool, 0444);
> > +MODULE_PARM_DESC(managed_irqs,
> > +		 "set to false for user controlled irq affinity");
> > +
> 
> What if you set "static bool managed_irqs" to false when isolcpus is being used?
> 
> For example:
> 
> if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
> 	managed_irqs = false;

Well, that is confusing.

Maybe just:

if (housekeeping_enabled(HK_FLAG_MISC))
	managed_irqs = false;

Instead.

> 
> Then there is no additional parameter to tune (which addresses
> Christoph's concern).



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V3] nvme-pci: allow unmanaged interrupts
  2024-07-15 16:03 ` Marcelo Tosatti
  2024-07-15 16:23   ` Marcelo Tosatti
@ 2024-07-16  4:29   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-07-16  4:29 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Ming Lei, Christoph Hellwig, Keith Busch, linux-nvme,
	Sagi Grimberg, Lawrence Troup

On Mon, Jul 15, 2024 at 01:03:02PM -0300, Marcelo Tosatti wrote:
> What if you set "static bool managed_irqs" to false when isolcpus is being used?
> 
> For example:
> 
> if (housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
> 	managed_irqs = false;
> 
> Then there is no additional parameter to tune (which addresses
> Christoph's concern).

No, it absolutely does not.  We need to still properly spread the irqs
using the proper managed irq setup, we just need to do it only on the
CPUs they should be spread on.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-07-16 10:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-02 10:41 [PATCH V3] nvme-pci: allow unmanaged interrupts Ming Lei
2024-07-02 11:50 ` Christoph Hellwig
2024-07-02 12:12   ` Ming Lei
2024-07-02 12:20     ` Lawrence Troup (ltroup)
2024-07-02 15:28       ` Christoph Hellwig
2024-07-03  1:57         ` Ming Lei
2024-07-03  5:24           ` Christoph Hellwig
2024-07-02 16:28     ` Daniel Wagner
2024-07-03  1:51       ` Ming Lei
2024-07-16 10:18         ` Hannes Reinecke
2024-07-15 16:03 ` Marcelo Tosatti
2024-07-15 16:23   ` Marcelo Tosatti
2024-07-16  4:29   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox