Why NVMe MSIx vectors affinity set across NUMA nodes?

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* Why NVMe MSIx vectors affinity set across NUMA nodes?
@ 2018-01-22  4:25 Ganapatrao Kulkarni
  2018-01-22 17:14 ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Ganapatrao Kulkarni @ 2018-01-22  4:25 UTC (permalink / raw)


Hi,

I  have observed that NVMe driver splitting interrupt affinity of MSIx
vectors among available NUMA nodes,
any specific reason for that?

i see this is happening due to pci flag PCI_IRQ_AFFINITY is set in
function nvme_setup_io_queues

   nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
                        PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);

IMO, having all vectors on same node CPUs improves interrupt latency
than distributing among all nodes.


thanks
Ganapat

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-22  4:25 Why NVMe MSIx vectors affinity set across NUMA nodes? Ganapatrao Kulkarni
@ 2018-01-22 17:14 ` Keith Busch
  2018-01-22 17:22   ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2018-01-22 17:14 UTC (permalink / raw)


On Mon, Jan 22, 2018@09:55:55AM +0530, Ganapatrao Kulkarni wrote:
> Hi,
> 
> I  have observed that NVMe driver splitting interrupt affinity of MSIx
> vectors among available NUMA nodes,
> any specific reason for that?
> 
> i see this is happening due to pci flag PCI_IRQ_AFFINITY is set in
> function nvme_setup_io_queues
> 
>    nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
>                         PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
> 
> IMO, having all vectors on same node CPUs improves interrupt latency
> than distributing among all nodes.

What affinity maps are you seeing? It's not supposed to share one vector
across two NUMA nodes, unless you simply don't have enough vectors.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-22 17:14 ` Keith Busch
@ 2018-01-22 17:22   ` Ganapatrao Kulkarni
  2018-01-22 17:32     ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Ganapatrao Kulkarni @ 2018-01-22 17:22 UTC (permalink / raw)


On Mon, Jan 22, 2018@10:44 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Mon, Jan 22, 2018@09:55:55AM +0530, Ganapatrao Kulkarni wrote:
>> Hi,
>>
>> I  have observed that NVMe driver splitting interrupt affinity of MSIx
>> vectors among available NUMA nodes,
>> any specific reason for that?
>>
>> i see this is happening due to pci flag PCI_IRQ_AFFINITY is set in
>> function nvme_setup_io_queues
>>
>>    nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
>>                         PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
>>
>> IMO, having all vectors on same node CPUs improves interrupt latency
>> than distributing among all nodes.
>
> What affinity maps are you seeing? It's not supposed to share one vector
> across two NUMA nodes, unless you simply don't have enough vectors.

There are 31 MSIx vectors getting initialised(one per NVMe queue) and
out of it, 0-15 are getting
affinity set to node 0 CPUs and vectors 16-30 are getting set to node 1.
My question is, why not set affinity to all vectors from same node
CPUs, what was need to use flag PCI_IRQ_AFFINITY?

thanks
Ganapat

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-22 17:22   ` Ganapatrao Kulkarni
@ 2018-01-22 17:32     ` Keith Busch
  2018-01-22 17:55       ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2018-01-22 17:32 UTC (permalink / raw)


On Mon, Jan 22, 2018@10:52:59PM +0530, Ganapatrao Kulkarni wrote:
> 
> There are 31 MSIx vectors getting initialised(one per NVMe queue) and
> out of it, 0-15 are getting
> affinity set to node 0 CPUs and vectors 16-30 are getting set to node 1.
> My question is, why not set affinity to all vectors from same node
> CPUs, what was need to use flag PCI_IRQ_AFFINITY?

I'm sorry, but I am not able to parse this question.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-22 17:32     ` Keith Busch
@ 2018-01-22 17:55       ` Ganapatrao Kulkarni
  2018-01-22 18:05         ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Ganapatrao Kulkarni @ 2018-01-22 17:55 UTC (permalink / raw)


On Mon, Jan 22, 2018@11:02 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Mon, Jan 22, 2018@10:52:59PM +0530, Ganapatrao Kulkarni wrote:
>>
>> There are 31 MSIx vectors getting initialised(one per NVMe queue) and
>> out of it, 0-15 are getting
>> affinity set to node 0 CPUs and vectors 16-30 are getting set to node 1.
>> My question is, why not set affinity to all vectors from same node
>> CPUs, what was need to use flag PCI_IRQ_AFFINITY?
>
> I'm sorry, but I am not able to parse this question.

ok, let me rephrase,
what was the need to use flag PCI_IRQ_AFFINITY in NVMe driver?
i dont see this flag being used widely.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-22 17:55       ` Ganapatrao Kulkarni
@ 2018-01-22 18:05         ` Keith Busch
  2018-01-22 18:12           ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2018-01-22 18:05 UTC (permalink / raw)


On Mon, Jan 22, 2018@11:25:45PM +0530, Ganapatrao Kulkarni wrote:
> what was the need to use flag PCI_IRQ_AFFINITY in NVMe driver?
> i dont see this flag being used widely.

The flag is how we get the affinity set in the first place. Without it,
we'd have to rely on user space to set irq affinities.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-22 18:05         ` Keith Busch
@ 2018-01-22 18:12           ` Ganapatrao Kulkarni
  2018-01-22 18:20             ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Ganapatrao Kulkarni @ 2018-01-22 18:12 UTC (permalink / raw)


On Mon, Jan 22, 2018@11:35 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Mon, Jan 22, 2018@11:25:45PM +0530, Ganapatrao Kulkarni wrote:
>> what was the need to use flag PCI_IRQ_AFFINITY in NVMe driver?
>> i dont see this flag being used widely.
>
> The flag is how we get the affinity set in the first place. Without it,
> we'd have to rely on user space to set irq affinities.

AFAIK, usually drivers sets default affinity and it is likely be node
affinity for NUMA systems.
Later, it is the user-space(like irqbalance etc) which decides the
affinity not the driver.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-22 18:12           ` Ganapatrao Kulkarni
@ 2018-01-22 18:20             ` Keith Busch
  2018-01-23 13:30               ` Sagi Grimberg
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2018-01-22 18:20 UTC (permalink / raw)


On Mon, Jan 22, 2018@11:42:45PM +0530, Ganapatrao Kulkarni wrote:
> 
> AFAIK, usually drivers sets default affinity and it is likely be node
> affinity for NUMA systems.
> Later, it is the user-space(like irqbalance etc) which decides the
> affinity not the driver.

Relying on userspace to provide an optimal setting is a bad idea,
especially for NVMe where we have submission queue cpu affinity that
doesn't work very efficiently if the completion affinity doesn't match.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-22 18:20             ` Keith Busch
@ 2018-01-23 13:30               ` Sagi Grimberg
  2018-01-24  2:17                 ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2018-01-23 13:30 UTC (permalink / raw)



>> AFAIK, usually drivers sets default affinity and it is likely be node
>> affinity for NUMA systems.
>> Later, it is the user-space(like irqbalance etc) which decides the
>> affinity not the driver.
> 
> Relying on userspace to provide an optimal setting is a bad idea,
> especially for NVMe where we have submission queue cpu affinity that
> doesn't work very efficiently if the completion affinity doesn't match.

I tend to agree, also application locality is equally as important as
device locality. so spreading across numa nodes will help applications
running on the far numa node as well.

Also, a recent thread [1] related to PCI_IRQ_AFFINITY not allowing
userspace to modify irq affinity suggested that maybe that can be
supported, but not sure what happened to it.

[1] https://www.spinics.net/lists/netdev/msg464301.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-23 13:30               ` Sagi Grimberg
@ 2018-01-24  2:17                 ` Ganapatrao Kulkarni
  2018-01-24 15:48                   ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Ganapatrao Kulkarni @ 2018-01-24  2:17 UTC (permalink / raw)


On Tue, Jan 23, 2018@7:00 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>>> AFAIK, usually drivers sets default affinity and it is likely be node
>>> affinity for NUMA systems.
>>> Later, it is the user-space(like irqbalance etc) which decides the
>>> affinity not the driver.
>>
>>
>> Relying on userspace to provide an optimal setting is a bad idea,
>> especially for NVMe where we have submission queue cpu affinity that
>> doesn't work very efficiently if the completion affinity doesn't match.
>
>
> I tend to agree, also application locality is equally as important as
> device locality. so spreading across numa nodes will help applications
> running on the far numa node as well.
>
> Also, a recent thread [1] related to PCI_IRQ_AFFINITY not allowing
> userspace to modify irq affinity suggested that maybe that can be
> supported, but not sure what happened to it.
>

application uses libnuma to align to numa locality.
here the driver is breaking the affinity.
certainly having affinity with remote node cpu will add latency to
interrupt response time.
here it is for some NVMe queues.

> [1] https://www.spinics.net/lists/netdev/msg464301.html

thanks
Ganapat

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-24  2:17                 ` Ganapatrao Kulkarni
@ 2018-01-24 15:48                   ` Keith Busch
  2018-01-24 19:39                     ` Sagi Grimberg
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2018-01-24 15:48 UTC (permalink / raw)


On Wed, Jan 24, 2018@07:47:44AM +0530, Ganapatrao Kulkarni wrote:
> application uses libnuma to align to numa locality.
> here the driver is breaking the affinity.
> certainly having affinity with remote node cpu will add latency to
> interrupt response time.
> here it is for some NVMe queues.

I bet you can't come up with an IRQ CPU affinity map that performs better
than the current setup.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-24 15:48                   ` Keith Busch
@ 2018-01-24 19:39                     ` Sagi Grimberg
  2018-01-24 20:38                       ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Sagi Grimberg @ 2018-01-24 19:39 UTC (permalink / raw)



>> application uses libnuma to align to numa locality.
>> here the driver is breaking the affinity.
>> certainly having affinity with remote node cpu will add latency to
>> interrupt response time.
>> here it is for some NVMe queues.
> 
> I bet you can't come up with an IRQ CPU affinity map that performs better
> than the current setup.

:)

While I agree that managed affinity will probably get the optimal
affinitization in 99% of the cases, this is the second complaint we've
had that managed affinity breaks an existing user interface (even though
it was a sure way to allow userspace to screw up for years).

My mlx5 conversion ended up reverted due to that...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Why NVMe MSIx vectors affinity set across NUMA nodes?
  2018-01-24 19:39                     ` Sagi Grimberg
@ 2018-01-24 20:38                       ` Keith Busch
  0 siblings, 0 replies; 13+ messages in thread
From: Keith Busch @ 2018-01-24 20:38 UTC (permalink / raw)


On Wed, Jan 24, 2018@09:39:02PM +0200, Sagi Grimberg wrote:
> 
> > > application uses libnuma to align to numa locality.
> > > here the driver is breaking the affinity.
> > > certainly having affinity with remote node cpu will add latency to
> > > interrupt response time.
> > > here it is for some NVMe queues.
> > 
> > I bet you can't come up with an IRQ CPU affinity map that performs better
> > than the current setup.
> 
> :)
> 
> While I agree that managed affinity will probably get the optimal
> affinitization in 99% of the cases, this is the second complaint we've
> had that managed affinity breaks an existing user interface (even though
> it was a sure way to allow userspace to screw up for years).

Well, this the only complaint for NVMe, and it doesn't seem aware of
how this work. If libnuma is used to run on a specific node, interrupts
will occur only on that node. An interrupt sent to a remote node means
you submitted a command from there, and handling the interrupt there is
cheaper than bouncing hot cache lines across nodes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-01-24 20:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-22  4:25 Why NVMe MSIx vectors affinity set across NUMA nodes? Ganapatrao Kulkarni
2018-01-22 17:14 ` Keith Busch
2018-01-22 17:22   ` Ganapatrao Kulkarni
2018-01-22 17:32     ` Keith Busch
2018-01-22 17:55       ` Ganapatrao Kulkarni
2018-01-22 18:05         ` Keith Busch
2018-01-22 18:12           ` Ganapatrao Kulkarni
2018-01-22 18:20             ` Keith Busch
2018-01-23 13:30               ` Sagi Grimberg
2018-01-24  2:17                 ` Ganapatrao Kulkarni
2018-01-24 15:48                   ` Keith Busch
2018-01-24 19:39                     ` Sagi Grimberg
2018-01-24 20:38                       ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).