IRQ affinity problem from virtio

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* IRQ affinity problem from virtio_blk
@ 2022-11-15  3:40 Angus Chen
  2022-11-15 22:19 ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Angus Chen @ 2022-11-15  3:40 UTC (permalink / raw)
  To: tglx@linutronix.de
  Cc: linux-kernel@vger.kernel.org, Michael S. Tsirkin, Ming Lei,
	Jason Wang

Hi All.
I test the linux 6.1 and found the virtio_blk use irq_affinity with IRQD_AFFINITY_MANAGED.
The machine has 80 cpus with two numa node.

Before probe one virtio_blk.
crash_cts> p *vector_matrix
$44 = {
  matrix_bits = 256,
  alloc_start = 32,
  alloc_end = 236,
  alloc_size = 204,
  global_available = 15354,
  global_reserved = 154,
  systembits_inalloc = 3,
  total_allocated = 411,
  online_maps = 80,
  maps = 0x46100,
  scratch_map = {1160908723191807, 0, 1, 18435222497520517120},
  system_map = {1125904739729407, 0, 1, 18435221191850459136}
}
After probe one virtio_blk.
crash_cts> p *vector_matrix
$45 = {
  matrix_bits = 256,
  alloc_start = 32,
  alloc_end = 236,
  alloc_size = 204,
  global_available = 15273,
  global_reserved = 154,
  systembits_inalloc = 3,
  total_allocated = 413,
  online_maps = 80,
  maps = 0x46100,
  scratch_map = {25769803776, 0, 0, 14680064},
  system_map = {1125904739729407, 0, 1, 18435221191850459136}
}

We can see global_available drop from 15354 to 15273, is 81.
And the total_allocated increase from 411 to 413. One config irq,and one vq irq.

It is easy to expend the irq resource ,because virtio_blk device could be more than 512.
And I read the matrix code of irq,with IRQD_AFFINITY_MANAGED be set ,it is a kind of feature.

If we cosume irq exhausted,it will break per_vq_vectors ,so the ' virtblk_map_queues ' will
Fall back to blk_mq_map_queues finally.

Or if we don’t cosume irq exhausted,we just use irq bits of one cpu more than others for example,
IRQD_AFFINITY_MANAGED will fail too,because it not balance.

I'm not a native English speaker, any suggestion will be appreciated.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IRQ affinity problem from virtio_blk
  2022-11-15  3:40 IRQ affinity problem from virtio_blk Angus Chen
@ 2022-11-15 22:19 ` Thomas Gleixner
  2022-11-15 22:44   ` Michael S. Tsirkin
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2022-11-15 22:19 UTC (permalink / raw)
  To: Angus Chen
  Cc: linux-kernel@vger.kernel.org, Michael S. Tsirkin, Ming Lei,
	Jason Wang

On Tue, Nov 15 2022 at 03:40, Angus Chen wrote:
> Before probe one virtio_blk.
> crash_cts> p *vector_matrix
> $44 = {
>   matrix_bits = 256,
>   alloc_start = 32,
>   alloc_end = 236,
>   alloc_size = 204,
>   global_available = 15354,
>   global_reserved = 154,
>   systembits_inalloc = 3,
>   total_allocated = 411,
>   online_maps = 80,
>   maps = 0x46100,
>   scratch_map = {1160908723191807, 0, 1, 18435222497520517120},
>   system_map = {1125904739729407, 0, 1, 18435221191850459136}
> }
> After probe one virtio_blk.
> crash_cts> p *vector_matrix
> $45 = {
>   matrix_bits = 256,
>   alloc_start = 32,
>   alloc_end = 236,
>   alloc_size = 204,
>   global_available = 15273,
>   global_reserved = 154,
>   systembits_inalloc = 3,
>   total_allocated = 413,
>   online_maps = 80,
>   maps = 0x46100,
>   scratch_map = {25769803776, 0, 0, 14680064},
>   system_map = {1125904739729407, 0, 1, 18435221191850459136}
> }
>
> We can see global_available drop from 15354 to 15273, is 81.
> And the total_allocated increase from 411 to 413. One config irq,and
> one vq irq.

Right. That's perfectly fine. At the point where you looking at it, the
matrix allocator has given out 2 vectors as can be seen via
total_allocated.

But then it also has another 79 vectors put aside for the other queues,
but those queues have not yet requested the interrupts so there is no
allocation yet. But the vectors are guaranteed to be available when
request_irq() for those queues runs, which does the actual allocation.

Btw, you can enable CONFIG_GENERIC_IRQ_DEBUGFS and then look at the
content of /sys/kernel/debug/irq/domain/VECTOR which gives you a very
clear picture of what's going on. No need for gdb.

> It is easy to expend the irq resource ,because virtio_blk device could
> be more than 512.

How so? virtio_blk allocates a config interrupt and one queue interrupt
per CPU. So in your case a total of 81.

How would you exhaust the vector space? Each CPU has about ~200 (in your
case exactly 204) vectors which can be handed out to devices. You'd need
to instantiate about 200 virtio_blk devices to get to the point of
vector exhaustion.

So what are you actually worried about and which problem are you trying
to solve?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IRQ affinity problem from virtio_blk
  2022-11-15 22:19 ` Thomas Gleixner
@ 2022-11-15 22:44   ` Michael S. Tsirkin
  2022-11-15 23:04     ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Michael S. Tsirkin @ 2022-11-15 22:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Angus Chen, linux-kernel@vger.kernel.org, Ming Lei, Jason Wang

Thanks Thomas, I have a question:

On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
> On Tue, Nov 15 2022 at 03:40, Angus Chen wrote:
> > Before probe one virtio_blk.
> > crash_cts> p *vector_matrix
> > $44 = {
> >   matrix_bits = 256,
> >   alloc_start = 32,
> >   alloc_end = 236,
> >   alloc_size = 204,
> >   global_available = 15354,
> >   global_reserved = 154,
> >   systembits_inalloc = 3,
> >   total_allocated = 411,
> >   online_maps = 80,
> >   maps = 0x46100,
> >   scratch_map = {1160908723191807, 0, 1, 18435222497520517120},
> >   system_map = {1125904739729407, 0, 1, 18435221191850459136}
> > }
> > After probe one virtio_blk.
> > crash_cts> p *vector_matrix
> > $45 = {
> >   matrix_bits = 256,
> >   alloc_start = 32,
> >   alloc_end = 236,
> >   alloc_size = 204,
> >   global_available = 15273,
> >   global_reserved = 154,
> >   systembits_inalloc = 3,
> >   total_allocated = 413,
> >   online_maps = 80,
> >   maps = 0x46100,
> >   scratch_map = {25769803776, 0, 0, 14680064},
> >   system_map = {1125904739729407, 0, 1, 18435221191850459136}
> > }
> >
> > We can see global_available drop from 15354 to 15273, is 81.
> > And the total_allocated increase from 411 to 413. One config irq,and
> > one vq irq.
> 
> Right. That's perfectly fine. At the point where you looking at it, the
> matrix allocator has given out 2 vectors as can be seen via
> total_allocated.
> 
> But then it also has another 79 vectors put aside for the other queues,


What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?



> but those queues have not yet requested the interrupts so there is no
> allocation yet. But the vectors are guaranteed to be available when
> request_irq() for those queues runs, which does the actual allocation.
> 
> Btw, you can enable CONFIG_GENERIC_IRQ_DEBUGFS and then look at the
> content of /sys/kernel/debug/irq/domain/VECTOR which gives you a very
> clear picture of what's going on. No need for gdb.
> 
> > It is easy to expend the irq resource ,because virtio_blk device could
> > be more than 512.
> 
> How so? virtio_blk allocates a config interrupt and one queue interrupt
> per CPU. So in your case a total of 81.
> 
> How would you exhaust the vector space? Each CPU has about ~200 (in your
> case exactly 204) vectors which can be handed out to devices. You'd need
> to instantiate about 200 virtio_blk devices to get to the point of
> vector exhaustion.
> 
> So what are you actually worried about and which problem are you trying
> to solve?
> 
> Thanks,
> 
>         tglx
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IRQ affinity problem from virtio_blk
  2022-11-15 22:44   ` Michael S. Tsirkin
@ 2022-11-15 23:04     ` Thomas Gleixner
  2022-11-15 23:24       ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2022-11-15 23:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Angus Chen, linux-kernel@vger.kernel.org, Ming Lei, Jason Wang

On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote:
> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
>> > We can see global_available drop from 15354 to 15273, is 81.
>> > And the total_allocated increase from 411 to 413. One config irq,and
>> > one vq irq.
>> 
>> Right. That's perfectly fine. At the point where you looking at it, the
>> matrix allocator has given out 2 vectors as can be seen via
>> total_allocated.
>> 
>> But then it also has another 79 vectors put aside for the other queues,
>
> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?

init_vq() -> virtio_find_vqs() -> vp_find_vqs() ->
vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity()

init_vq() hands in a struct irq_affinity which means that
pci_alloc_irq_vectors_affinity() will spread out interrupts and have one
for config and one per queue if vp_request_msix_vectors() is invoked
with per_vq_vectors == true, which is what the first invocation in
vp_find_vqs() does.

Thanks,

        tglx






^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IRQ affinity problem from virtio_blk
  2022-11-15 23:04     ` Thomas Gleixner
@ 2022-11-15 23:24       ` Thomas Gleixner
  2022-11-15 23:36         ` Michael S. Tsirkin
  2022-11-16  0:46         ` Angus Chen
  0 siblings, 2 replies; 15+ messages in thread
From: Thomas Gleixner @ 2022-11-15 23:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Angus Chen, linux-kernel@vger.kernel.org, Ming Lei, Jason Wang

On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote:

> On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote:
>> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
>>> > We can see global_available drop from 15354 to 15273, is 81.
>>> > And the total_allocated increase from 411 to 413. One config irq,and
>>> > one vq irq.
>>> 
>>> Right. That's perfectly fine. At the point where you looking at it, the
>>> matrix allocator has given out 2 vectors as can be seen via
>>> total_allocated.
>>> 
>>> But then it also has another 79 vectors put aside for the other queues,
>>
>> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?
>
> init_vq() -> virtio_find_vqs() -> vp_find_vqs() ->
> vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity()
>
> init_vq() hands in a struct irq_affinity which means that
> pci_alloc_irq_vectors_affinity() will spread out interrupts and have one
> for config and one per queue if vp_request_msix_vectors() is invoked
> with per_vq_vectors == true, which is what the first invocation in
> vp_find_vqs() does.

I just checked on a random VM. The PCI device as advertised to the guest
does not expose that many vectors. One has 2 and the other 4.

But as the interrupts are requested 'managed' the core ends up setting
the vectors aside. That's a fundamental property of managed interrupts.

Assume you have less queues than CPUs, which is the case with 2 vectors
and tons of CPUs, i.e. one ends up for config and the other for the
actual queue. So the affinity spreading code will end up having the full
cpumask for the queue vector, which is marked managed. And managed means
that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
be migrated to a still online CPU.

So we end up setting 79 vectors aside (one per CPU) in the case that the
virtio device only provides two vectors.

But that's not the end of the world as you really would need ~200 such
devices to exhaust the vector space...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IRQ affinity problem from virtio_blk
  2022-11-15 23:24       ` Thomas Gleixner
@ 2022-11-15 23:36         ` Michael S. Tsirkin
  2022-11-16  1:02           ` Angus Chen
  2022-11-16 10:43           ` Thomas Gleixner
  2022-11-16  0:46         ` Angus Chen
  1 sibling, 2 replies; 15+ messages in thread
From: Michael S. Tsirkin @ 2022-11-15 23:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Angus Chen, linux-kernel@vger.kernel.org, Ming Lei, Jason Wang

On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
> On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote:
> 
> > On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote:
> >> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
> >>> > We can see global_available drop from 15354 to 15273, is 81.
> >>> > And the total_allocated increase from 411 to 413. One config irq,and
> >>> > one vq irq.
> >>> 
> >>> Right. That's perfectly fine. At the point where you looking at it, the
> >>> matrix allocator has given out 2 vectors as can be seen via
> >>> total_allocated.
> >>> 
> >>> But then it also has another 79 vectors put aside for the other queues,
> >>
> >> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?
> >
> > init_vq() -> virtio_find_vqs() -> vp_find_vqs() ->
> > vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity()
> >
> > init_vq() hands in a struct irq_affinity which means that
> > pci_alloc_irq_vectors_affinity() will spread out interrupts and have one
> > for config and one per queue if vp_request_msix_vectors() is invoked
> > with per_vq_vectors == true, which is what the first invocation in
> > vp_find_vqs() does.
> 
> I just checked on a random VM. The PCI device as advertised to the guest
> does not expose that many vectors. One has 2 and the other 4.
> 
> But as the interrupts are requested 'managed' the core ends up setting
> the vectors aside. That's a fundamental property of managed interrupts.
> 
> Assume you have less queues than CPUs, which is the case with 2 vectors
> and tons of CPUs, i.e. one ends up for config and the other for the
> actual queue. So the affinity spreading code will end up having the full
> cpumask for the queue vector, which is marked managed. And managed means
> that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
> be migrated to a still online CPU.
> 
> So we end up setting 79 vectors aside (one per CPU) in the case that the
> virtio device only provides two vectors.
> 
> But that's not the end of the world as you really would need ~200 such
> devices to exhaust the vector space...
> 
> Thanks,
> 
>         tglx

Let's say we have 20 queues - then just 10 devices will exhaust the
vector space right?

-- 
MST


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: IRQ affinity problem from virtio_blk
  2022-11-15 23:24       ` Thomas Gleixner
  2022-11-15 23:36         ` Michael S. Tsirkin
@ 2022-11-16  0:46         ` Angus Chen
  2022-11-16 10:54           ` Thomas Gleixner
  1 sibling, 1 reply; 15+ messages in thread
From: Angus Chen @ 2022-11-16  0:46 UTC (permalink / raw)
  To: Thomas Gleixner, Michael S. Tsirkin
  Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang



> -----Original Message-----
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Wednesday, November 16, 2022 7:24 AM
> To: Michael S. Tsirkin <mst@redhat.com>
> Cc: Angus Chen <angus.chen@jaguarmicro.com>; linux-kernel@vger.kernel.org;
> Ming Lei <ming.lei@redhat.com>; Jason Wang <jasowang@redhat.com>
> Subject: Re: IRQ affinity problem from virtio_blk
> 
> On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote:
> 
> > On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote:
> >> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
> >>> > We can see global_available drop from 15354 to 15273, is 81.
> >>> > And the total_allocated increase from 411 to 413. One config irq,and
> >>> > one vq irq.
> >>>
> >>> Right. That's perfectly fine. At the point where you looking at it, the
> >>> matrix allocator has given out 2 vectors as can be seen via
> >>> total_allocated.
> >>>
> >>> But then it also has another 79 vectors put aside for the other queues,
en,it not the truth,in fact ,I just has one queue for one virtio_blk.

crash_cts> struct virtio_blk.num_vqs 0xffff888147b79c00
  num_vqs = 1,
I think is the key we talk about.
> >>
> >> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?
> >
> > init_vq() -> virtio_find_vqs() -> vp_find_vqs() ->
> > vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity()
> >
> > init_vq() hands in a struct irq_affinity which means that
> > pci_alloc_irq_vectors_affinity() will spread out interrupts and have one
> > for config and one per queue if vp_request_msix_vectors() is invoked
> > with per_vq_vectors == true, which is what the first invocation in
> > vp_find_vqs() does.
> 
> I just checked on a random VM. The PCI device as advertised to the guest
> does not expose that many vectors. One has 2 and the other 4.
> 
> But as the interrupts are requested 'managed' the core ends up setting
> the vectors aside. That's a fundamental property of managed interrupts.
> 
> Assume you have less queues than CPUs, which is the case with 2 vectors
> and tons of CPUs, i.e. one ends up for config and the other for the
> actual queue. So the affinity spreading code will end up having the full
> cpumask for the queue vector, which is marked managed. And managed means
> that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
> be migrated to a still online CPU.
> 
> So we end up setting 79 vectors aside (one per CPU) in the case that the
> virtio device only provides two vectors.
> 
> But that's not the end of the world as you really would need ~200 such
> devices to exhaust the vector space...
> 
Thank you for your reply..
Let's look the dmesg for more information.
...
Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: 1/0/0 default/read/poll queues
Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: [vdpr] 20480 512-byte logical blocks (10.5 MB/10.0 MiB)
Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: enabling device (0000 -> 0002)
Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: virtio_pci: leaving for legacy driver
Nov 14 11:48:46 localhost kernel: virtio_blk virtio182: 1/0/0 default/read/poll queues---------the virtio182 means index 182.
Nov 14 11:48:46 localhost kernel: vp_find_vqs_msix return err=-28-----------------------------the first time we get 'no space' error from irq subsystem. 
...
We are easy to get the output is :
crash_cts> p *vector_matrix
$97 = {
  matrix_bits = 256,
  alloc_start = 32,
  alloc_end = 236,
  alloc_size = 204,
  global_available = 0,------------the irq is exhausted.
  global_reserved = 154,
  systembits_inalloc = 3,
  total_allocated = 1861,
  online_maps = 80,
  maps = 0x46100,
  scratch_map = {18446744069952503807, 18446744073709551615, 18446744073709551615, 18435229987943481343},
  system_map = {1125904739729407, 0, 1, 18435221191850459136}
}

After that ,all the irq request will be returned "no space".

If the percpu irq vector is more asymmetric,than the more quickly we get the 'no space' error when we probe the irq with 
IRQD_AFFINITY_MANAGED.

> Thanks,
> 
>         tglx
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: IRQ affinity problem from virtio_blk
  2022-11-15 23:36         ` Michael S. Tsirkin
@ 2022-11-16  1:02           ` Angus Chen
  2022-11-16 10:55             ` Thomas Gleixner
  2022-11-16 10:43           ` Thomas Gleixner
  1 sibling, 1 reply; 15+ messages in thread
From: Angus Chen @ 2022-11-16  1:02 UTC (permalink / raw)
  To: Michael S. Tsirkin, Thomas Gleixner
  Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang



> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 16, 2022 7:37 AM
> To: Thomas Gleixner <tglx@linutronix.de>
> Cc: Angus Chen <angus.chen@jaguarmicro.com>; linux-kernel@vger.kernel.org;
> Ming Lei <ming.lei@redhat.com>; Jason Wang <jasowang@redhat.com>
> Subject: Re: IRQ affinity problem from virtio_blk
> 
> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
> > On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote:
> >
> > > On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote:
> > >> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
> > >>> > We can see global_available drop from 15354 to 15273, is 81.
> > >>> > And the total_allocated increase from 411 to 413. One config irq,and
> > >>> > one vq irq.
> > >>>
> > >>> Right. That's perfectly fine. At the point where you looking at it, the
> > >>> matrix allocator has given out 2 vectors as can be seen via
> > >>> total_allocated.
> > >>>
> > >>> But then it also has another 79 vectors put aside for the other queues,
> > >>
> > >> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?
> > >
> > > init_vq() -> virtio_find_vqs() -> vp_find_vqs() ->
> > > vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity()
> > >
> > > init_vq() hands in a struct irq_affinity which means that
> > > pci_alloc_irq_vectors_affinity() will spread out interrupts and have one
> > > for config and one per queue if vp_request_msix_vectors() is invoked
> > > with per_vq_vectors == true, which is what the first invocation in
> > > vp_find_vqs() does.
> >
> > I just checked on a random VM. The PCI device as advertised to the guest
> > does not expose that many vectors. One has 2 and the other 4.
> >
> > But as the interrupts are requested 'managed' the core ends up setting
> > the vectors aside. That's a fundamental property of managed interrupts.
> >
> > Assume you have less queues than CPUs, which is the case with 2 vectors
> > and tons of CPUs, i.e. one ends up for config and the other for the
> > actual queue. So the affinity spreading code will end up having the full
> > cpumask for the queue vector, which is marked managed. And managed
> means
> > that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
> > be migrated to a still online CPU.
> >
> > So we end up setting 79 vectors aside (one per CPU) in the case that the
> > virtio device only provides two vectors.
> >
> > But that's not the end of the world as you really would need ~200 such
> > devices to exhaust the vector space...

Provide the VECTOR Information：
[root@localhost domains]# cat VECTOR
name:   VECTOR
 size:   0
 mapped: 2015
 flags:  0x00000003
Online bitmaps:       80
Global available:      0
Global reserved:     154
Total allocated:    1861
System: 39: 0-19,29,32,50,128,236,240-242,244,246-255
 | CPU | avl | man | mac | act | vectors
     0     0   180     2   23  33-46,48,110,132,162,185,206-207,228-229
     1     0   180     2   23  33-37,41,44,124,134,156-157,167,180,186-187,198,225,228-233
     2     0   180     2   23  33-40,123,154-155,164,177,186,202,221-224,227,232-233,235
     3     0   180     2   23  33-36,70,123-124,140,156,168,174,197,199,201,207,225-228,232-235
     4     0   180     2   23  33-39,101,122,133,147,207,217-221,227-228,231,233-235
     5     0   180     2   23  33-38,83,115,156,165-166,177,207-209,220-222,228,231-234
     6     0   180     2   23  33-38,55,91,146,154,160,164,187-188,209,217-218,221-222,232-235
     7     0   180     2   23  33-37,81-82,113,145,154,186-188,207,221-224,226,229,232,234-235
     8     0   180     2   23  33-37,81,91,148-149,189,198-199,201,210,217-218,222,229-232,234-235
     9     0   180     2   23  33-38,59,133,146,157,165,174,196,205,207,220-221,225-226,232-235
    10     0   180     2   23  33-36,87,133-134,142,174,188,198-199,206,214,217-220,228-230,234-235
    11     0   180     2   23  33-35,83,94,113,127,129,157,187-188,209,219-224,229-230,233-235
    12     0   180     2   23  33-34,36,55,113-114,129,158-159,168,175,189-190,197,208-209,219-220,227,232-235
    13     0   180     2   23  33-34,37-38,83,94,156-158,186-187,207,221-222,225-227,230-235
    14     0   180     2   23  33-35,43,70,101-102,170,175-177,215,217-218,220,226-230,232-233,235
    15     0   180     2   23  33-35,104,112,134,144,158,167-168,170,175-176,187,198,208,221-222,228-229,233-235
    16     0   180     2   23  34-36,71,91,146,155-156,189-190,217-219,223,225-228,231-235
    17     0   180     2   23  33-34,49,92,101,134,144,187,195-197,207-209,216-217,221,230-235
    18     0   180     2   23  33-34,135-136,146,174,198,206-209,217,224-231,233-235
    19     0   180     2   23  33-34,58,91,101,113,122,135,165,197-199,206,221-223,228-229,231-235
    20     0   180     2   23  33-34,215-235
    21     0   180     2   23  33-34,214,216-235
    22     0   180     2   23  33-34,215-235
    23     0   180     2   23  33-34,215-235
    24     0   180     2   23  33-35,216-235
    25     0   180     2   23  33-35,216-235
    26     0   180     2   23  33-35,216-235
    27     0   180     2   23  33-35,216-235
    28     0   180     2   23  33-35,216-235
    29     0   180     2   23  33-35,216-235
    30     0   180     2   23  33-35,216-235
    31     0   180     2   23  33-35,216-235
    32     0   180     2   23  33-34,215-235
    33     0   180     2   23  33-34,215-235
    34     0   180     2   23  33-34,215-235
    35     0   180     2   23  33-34,215-235
    36     0   180     2   23  33-34,211,216-235
    37     0   180     2   23  33-34,215-235
    38     0   180     2   23  33-34,215-235
    39     0   180     2   23  33-34,215-235
    40     0   180     2   23  33-34,56,65,134,170,176-178,207-210,225-229,231-235
    41     0   180     2   23  33-34,54,113,135-137,143,169,195-198,216-217,224,228-230,232-235
    42     0   180     2   23  33,36,57,111-112,126,164,175-176,199-200,207-210,225-226,230-235
    43     0   180     2   23  33-34,70,82,133-135,145,155,166,174,188-189,207,209,218,226-229,233-235
    44     0   180     2   23  33-34,59,103,111,126,166-167,185-186,207-208,217-218,226-232,234-235
    45     0   180     2   23  33,35-36,81,106,145-146,165,176,187,195,220-221,226-235
    46     0   180     2   23  33-34,69,137,143,155,176,180,185-187,197,206-207,212-213,225-228,230,234-235
    47     0   180     2   23  34,36,71,91-92,103-104,143,165,179,185-186,195,208-209,220-221,230-235
    48     0   180     2   23  33-34,36,93,122,157,174,186-188,198,208-209,225,227-235
    49     0   180     2   23  34-35,132-133,147-148,156,176-177,194-197,212,226-228,230-235
    50     0   180     2   23  33-34,45,123,138,162,164-166,195-196,208-209,219,224-226,228,230-231,233-235
    51     0   180     2   23  33-34,55,69-70,110,167,179-181,197-198,217-220,228-230,232-235
    52     0   180     2   23  33-34,70,132,145,156,178,186-188,190,210-212,218-219,228-230,232-235
    53     0   180     2   23  33,35,70,111,144,194-195,197,209,216-219,224,226-231,233-235
    54     0   180     2   23  33-34,102,115,147,154,164-166,181,188,200,210-211,219-220,228-229,231-235
    55     0   180     2   23  33-36,55,114,154-156,174,187,198,207-209,224-225,227-229,233-235
    56     0   180     2   23  33-34,54,104,113,132,154,175,188,209,216-221,226-227,230-233,235
    57     0   180     2   23  34-35,47,100,127,132-133,176-178,196-197,208,220,224-226,230-235
    58     0   180     2   23  34,37,42,100,110-111,143,164-165,185,198,206-208,216-218,228-229,231,233-235
    59     0   180     3   24  33-35,39,43,81-82,111,126,164-165,184,186,211-212,219-221,223,231-235
    60     0   180     3   24  33-35,215-235
    61     0   180     3   24  33-35,215-235
    62     0   180     3   24  33-35,215-235
    63     0   180     3   24  33-35,215-235
    64     0   180     3   24  33-35,215-235
    65     0   180     3   24  33-35,215-235
    66     0   180     3   24  33-35,215-235
    67     0   180     3   24  33-35,215-235
    68     0   180     3   24  33-35,211,216-235
    69     0   180     3   24  33-35,215-235
    70     0   180     3   24  33-35,215-235
    71     0   180     3   24  33-35,215-235
    72     0   180     3   24  33-35,215-235
    73     0   180     3   24  33-35,215-235
    74     0   180     3   24  33-35,215-235
    75     0   180     3   24  33-35,215-235
    76     0   180     3   24  33-35,215-235
    77     0   180     3   24  33-35,215-235
    78     0   180     3   24  33-35,214,216-235
    79     0   180     3   24  33-35,215-235



crash_cts> p *vector_matrix
$98 = {
  matrix_bits = 256,
  alloc_start = 32,
  alloc_end = 236,
  alloc_size = 204,
  global_available = 0,
  global_reserved = 154,
  systembits_inalloc = 3,
  total_allocated = 1861,
  online_maps = 80,
  maps = 0x46100,
  scratch_map = {18446744069952503807, 18446744073709551615, 18446744073709551615, 18435229987943481343},
  system_map = {1125904739729407, 0, 1, 18435221191850459136}
}
Any other information I need to provide,pls tell me.
Thanks.
> >
> > Thanks,
> >
> >         tglx
> 
> Let's say we have 20 queues - then just 10 devices will exhaust the
> vector space right?
> 
> --
> MST


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IRQ affinity problem from virtio_blk
  2022-11-15 23:36         ` Michael S. Tsirkin
  2022-11-16  1:02           ` Angus Chen
@ 2022-11-16 10:43           ` Thomas Gleixner
  2022-11-16 11:35             ` Ming Lei
  1 sibling, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2022-11-16 10:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Angus Chen, linux-kernel@vger.kernel.org, Ming Lei, Jason Wang

On Tue, Nov 15 2022 at 18:36, Michael S. Tsirkin wrote:
> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
>> I just checked on a random VM. The PCI device as advertised to the guest
>> does not expose that many vectors. One has 2 and the other 4.
>> 
>> But as the interrupts are requested 'managed' the core ends up setting
>> the vectors aside. That's a fundamental property of managed interrupts.
>> 
>> Assume you have less queues than CPUs, which is the case with 2 vectors
>> and tons of CPUs, i.e. one ends up for config and the other for the
>> actual queue. So the affinity spreading code will end up having the full
>> cpumask for the queue vector, which is marked managed. And managed means
>> that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
>> be migrated to a still online CPU.
>> 
>> So we end up setting 79 vectors aside (one per CPU) in the case that the
>> virtio device only provides two vectors.
>> 
>> But that's not the end of the world as you really would need ~200 such
>> devices to exhaust the vector space...
>
> Let's say we have 20 queues - then just 10 devices will exhaust the
> vector space right?

No.

If you have 20 queues then the queues are spread out over the
CPUs. Assume 80 CPUs:

Then each queue is associated to 80/20 = 4 CPUs and the resulting
affinity mask of each queue contains exactly 4 CPUs:

q0:      0 -  3
q1:      4 -  7
...
q19:    76 - 79

So this puts exactly 80 vectors aside, one per CPU.

As long as at least one CPU of a queue mask is online the queue is
enabled. If the last CPU of a queue mask goes offline then the queue is
shutdown which means the interrupt associated to the queue is shut down
too. That's all handled by the block MQ and the interrupt core. If a CPU
of a queue mask comes back online then the guaranteed vector is
allocated again.

So it does not matter how many queues per device you have it will
reserve exactly ONE interrupt per CPU.

Ergo you need 200 devices to exhaust the vector space.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: IRQ affinity problem from virtio_blk
  2022-11-16  0:46         ` Angus Chen
@ 2022-11-16 10:54           ` Thomas Gleixner
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2022-11-16 10:54 UTC (permalink / raw)
  To: Angus Chen, Michael S. Tsirkin
  Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang

On Wed, Nov 16 2022 at 00:46, Angus Chen wrote:
>> On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote:
>> >>> But then it also has another 79 vectors put aside for the other queues,
> en,it not the truth,in fact ,I just has one queue for one virtio_blk.

Which does not matter. See my reply to Michael. It's ONE vector per CPU
and block device.

> Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: 1/0/0 default/read/poll queues
> Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: [vdpr] 20480 512-byte logical blocks (10.5 MB/10.0 MiB)
> Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: enabling device (0000 -> 0002)
> Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: virtio_pci: leaving for legacy driver
> Nov 14 11:48:46 localhost kernel: virtio_blk virtio182: 1/0/0 default/read/poll queues---------the virtio182 means index 182.
> Nov 14 11:48:46 localhost kernel: vp_find_vqs_msix return err=-28-----------------------------the first time we get 'no space' error from irq subsystem. 

That's close to 200 virtio devices and the vector space is exhausted.
Works as expected.

Interrupt vectors are a limited resource on x86 and not only on x86. Not
any different from any other resource.

Thanks,

        tglx








^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: IRQ affinity problem from virtio_blk
  2022-11-16  1:02           ` Angus Chen
@ 2022-11-16 10:55             ` Thomas Gleixner
  2022-11-16 11:24               ` Angus Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2022-11-16 10:55 UTC (permalink / raw)
  To: Angus Chen, Michael S. Tsirkin
  Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang

On Wed, Nov 16 2022 at 01:02, Angus Chen wrote:
>> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
> Any other information I need to provide,pls tell me.

A sensible use case for 180+ virtio block devices in a single guest.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: IRQ affinity problem from virtio_blk
  2022-11-16 10:55             ` Thomas Gleixner
@ 2022-11-16 11:24               ` Angus Chen
  2022-11-16 13:27                 ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Angus Chen @ 2022-11-16 11:24 UTC (permalink / raw)
  To: Thomas Gleixner, Michael S. Tsirkin
  Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang



> -----Original Message-----
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Wednesday, November 16, 2022 6:56 PM
> To: Angus Chen <angus.chen@jaguarmicro.com>; Michael S. Tsirkin
> <mst@redhat.com>
> Cc: linux-kernel@vger.kernel.org; Ming Lei <ming.lei@redhat.com>; Jason
> Wang <jasowang@redhat.com>
> Subject: RE: IRQ affinity problem from virtio_blk
> 
> On Wed, Nov 16 2022 at 01:02, Angus Chen wrote:
> >> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
> > Any other information I need to provide,pls tell me.
> 
> A sensible use case for 180+ virtio block devices in a single guest.
> 
Our card can provide more than 512 virtio_blk devices .
one virtio_blk passthrough to one container,like docker.
So we need so much devices.
In the first patch ,I del the IRQD_AFFINITY_MANAGED in virtio_blk .

As you know, if we just use small queues number ,like 1or 2,we
Still occupy 80 vector ,that is kind of waste,and it is easy to eahausted the 
Irq resource.

IRQD_AFFINITY_MANAGED is not the problem,
but many devices use the IRQD_AFFINITY_MANAGED will be problem.

Thanks.

> Thanks,
> 
>         tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IRQ affinity problem from virtio_blk
  2022-11-16 10:43           ` Thomas Gleixner
@ 2022-11-16 11:35             ` Ming Lei
  2022-11-16 13:06               ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2022-11-16 11:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Michael S. Tsirkin, Angus Chen, linux-kernel@vger.kernel.org,
	Jason Wang

On Wed, Nov 16, 2022 at 11:43:24AM +0100, Thomas Gleixner wrote:
> On Tue, Nov 15 2022 at 18:36, Michael S. Tsirkin wrote:
> > On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
> >> I just checked on a random VM. The PCI device as advertised to the guest
> >> does not expose that many vectors. One has 2 and the other 4.
> >> 
> >> But as the interrupts are requested 'managed' the core ends up setting
> >> the vectors aside. That's a fundamental property of managed interrupts.
> >> 
> >> Assume you have less queues than CPUs, which is the case with 2 vectors
> >> and tons of CPUs, i.e. one ends up for config and the other for the
> >> actual queue. So the affinity spreading code will end up having the full
> >> cpumask for the queue vector, which is marked managed. And managed means
> >> that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
> >> be migrated to a still online CPU.
> >> 
> >> So we end up setting 79 vectors aside (one per CPU) in the case that the
> >> virtio device only provides two vectors.
> >> 
> >> But that's not the end of the world as you really would need ~200 such
> >> devices to exhaust the vector space...
> >
> > Let's say we have 20 queues - then just 10 devices will exhaust the
> > vector space right?
> 
> No.
> 
> If you have 20 queues then the queues are spread out over the
> CPUs. Assume 80 CPUs:
> 
> Then each queue is associated to 80/20 = 4 CPUs and the resulting
> affinity mask of each queue contains exactly 4 CPUs:
> 
> q0:      0 -  3
> q1:      4 -  7
> ...
> q19:    76 - 79
> 
> So this puts exactly 80 vectors aside, one per CPU.
> 
> As long as at least one CPU of a queue mask is online the queue is
> enabled. If the last CPU of a queue mask goes offline then the queue is
> shutdown which means the interrupt associated to the queue is shut down
> too. That's all handled by the block MQ and the interrupt core. If a CPU
> of a queue mask comes back online then the guaranteed vector is
> allocated again.
> 
> So it does not matter how many queues per device you have it will
> reserve exactly ONE interrupt per CPU.
> 
> Ergo you need 200 devices to exhaust the vector space.

Hi Thomas,

I am wondering why one interrupt needs to be reserved for each CPU, in
theory one queue needs one irq, I understand, so would you mind
explaining the story a bit?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: IRQ affinity problem from virtio_blk
  2022-11-16 11:35             ` Ming Lei
@ 2022-11-16 13:06               ` Thomas Gleixner
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2022-11-16 13:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Michael S. Tsirkin, Angus Chen, linux-kernel@vger.kernel.org,
	Jason Wang

On Wed, Nov 16 2022 at 19:35, Ming Lei wrote:
> On Wed, Nov 16, 2022 at 11:43:24AM +0100, Thomas Gleixner wrote:
>> > Let's say we have 20 queues - then just 10 devices will exhaust the
>> > vector space right?
>> 
>> No.
>> 
>> If you have 20 queues then the queues are spread out over the
>> CPUs. Assume 80 CPUs:
>> 
>> Then each queue is associated to 80/20 = 4 CPUs and the resulting
>> affinity mask of each queue contains exactly 4 CPUs:
>> 
>> q0:      0 -  3
>> q1:      4 -  7
>> ...
>> q19:    76 - 79
>> 
>> So this puts exactly 80 vectors aside, one per CPU.
>> 
>> As long as at least one CPU of a queue mask is online the queue is
>> enabled. If the last CPU of a queue mask goes offline then the queue is
>> shutdown which means the interrupt associated to the queue is shut down
>> too. That's all handled by the block MQ and the interrupt core. If a CPU
>> of a queue mask comes back online then the guaranteed vector is
>> allocated again.
>> 
>> So it does not matter how many queues per device you have it will
>> reserve exactly ONE interrupt per CPU.
>> 
>> Ergo you need 200 devices to exhaust the vector space.
>
> I am wondering why one interrupt needs to be reserved for each CPU, in
> theory one queue needs one irq, I understand, so would you mind
> explaining the story a bit?

It's only one interrupt per queue. Interrupt != vector.

The guarantee of managed interrupts always was that if there are less
queues than CPUs that CPU hotunplug cannot result in vector exhaustion.

Therefore we differentiate between managed and non-managed
interrupts. Managed have a guaranteed reservation, non-managed do not.

That's been a very deliberate design decision from the very beginning.

Thanks,

        tglx







^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: IRQ affinity problem from virtio_blk
  2022-11-16 11:24               ` Angus Chen
@ 2022-11-16 13:27                 ` Thomas Gleixner
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2022-11-16 13:27 UTC (permalink / raw)
  To: Angus Chen, Michael S. Tsirkin
  Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang,
	Christoph Hellwig, Jens Axboe

On Wed, Nov 16 2022 at 11:24, Angus Chen wrote:
>> >> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
>> > Any other information I need to provide,pls tell me.
>> 
>> A sensible use case for 180+ virtio block devices in a single guest.
>> 
> Our card can provide more than 512 virtio_blk devices .
> one virtio_blk passthrough to one container,like docker.

I'm not sure whether that's sensible, but that's how your hardware is
designed. You could have provided this information upfront instead of
random memory dumps of the irq matrix internals.

> So we need so much devices.
> In the first patch ,I del the IRQD_AFFINITY_MANAGED in virtio_blk .

There is no IRQD_AFFINITY_MANAGED in virtio_blk. That flag is internal
to the interrupt core code and you can neither delete it nor fiddle with
it from inside virtio_blk.

You can do that in your private kernel, but that's not an option for
mainline as it will break existing setups and it's fundamentally wrong.

The block-mq code has assumptions about the semantics of managed
interrupts. It happens to work for the single queue case because that
always ends up with queue affinity == cpu_possible_mask.

For anything else which assigns the queues to partitions of the CPU
space it definitely expects the semantics of managed interrupts.

> As you know, if we just use small queues number ,like 1or 2,we Still
> occupy 80 vector ,that is kind of waste,and it is easy to eahausted
> the Irq resource.

We know that by now. No point in repeating this over and over. Aside of
that it's not that easy because this is the first time within 5 years
that someone ran into this problem.

The real question is how to solve this proper without creating problems
for other scenarios. That needs involvment of the blk-mq people.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-11-16 13:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-15  3:40 IRQ affinity problem from virtio_blk Angus Chen
2022-11-15 22:19 ` Thomas Gleixner
2022-11-15 22:44   ` Michael S. Tsirkin
2022-11-15 23:04     ` Thomas Gleixner
2022-11-15 23:24       ` Thomas Gleixner
2022-11-15 23:36         ` Michael S. Tsirkin
2022-11-16  1:02           ` Angus Chen
2022-11-16 10:55             ` Thomas Gleixner
2022-11-16 11:24               ` Angus Chen
2022-11-16 13:27                 ` Thomas Gleixner
2022-11-16 10:43           ` Thomas Gleixner
2022-11-16 11:35             ` Ming Lei
2022-11-16 13:06               ` Thomas Gleixner
2022-11-16  0:46         ` Angus Chen
2022-11-16 10:54           ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox