* IRQ affinity problem from virtio_blk
@ 2022-11-15 3:40 Angus Chen
2022-11-15 22:19 ` Thomas Gleixner
0 siblings, 1 reply; 15+ messages in thread
From: Angus Chen @ 2022-11-15 3:40 UTC (permalink / raw)
To: tglx@linutronix.de
Cc: linux-kernel@vger.kernel.org, Michael S. Tsirkin, Ming Lei,
Jason Wang
Hi All.
I test the linux 6.1 and found the virtio_blk use irq_affinity with IRQD_AFFINITY_MANAGED.
The machine has 80 cpus with two numa node.
Before probe one virtio_blk.
crash_cts> p *vector_matrix
$44 = {
matrix_bits = 256,
alloc_start = 32,
alloc_end = 236,
alloc_size = 204,
global_available = 15354,
global_reserved = 154,
systembits_inalloc = 3,
total_allocated = 411,
online_maps = 80,
maps = 0x46100,
scratch_map = {1160908723191807, 0, 1, 18435222497520517120},
system_map = {1125904739729407, 0, 1, 18435221191850459136}
}
After probe one virtio_blk.
crash_cts> p *vector_matrix
$45 = {
matrix_bits = 256,
alloc_start = 32,
alloc_end = 236,
alloc_size = 204,
global_available = 15273,
global_reserved = 154,
systembits_inalloc = 3,
total_allocated = 413,
online_maps = 80,
maps = 0x46100,
scratch_map = {25769803776, 0, 0, 14680064},
system_map = {1125904739729407, 0, 1, 18435221191850459136}
}
We can see global_available drop from 15354 to 15273, is 81.
And the total_allocated increase from 411 to 413. One config irq,and one vq irq.
It is easy to expend the irq resource ,because virtio_blk device could be more than 512.
And I read the matrix code of irq,with IRQD_AFFINITY_MANAGED be set ,it is a kind of feature.
If we cosume irq exhausted,it will break per_vq_vectors ,so the ' virtblk_map_queues ' will
Fall back to blk_mq_map_queues finally.
Or if we don’t cosume irq exhausted,we just use irq bits of one cpu more than others for example,
IRQD_AFFINITY_MANAGED will fail too,because it not balance.
I'm not a native English speaker, any suggestion will be appreciated.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: IRQ affinity problem from virtio_blk 2022-11-15 3:40 IRQ affinity problem from virtio_blk Angus Chen @ 2022-11-15 22:19 ` Thomas Gleixner 2022-11-15 22:44 ` Michael S. Tsirkin 0 siblings, 1 reply; 15+ messages in thread From: Thomas Gleixner @ 2022-11-15 22:19 UTC (permalink / raw) To: Angus Chen Cc: linux-kernel@vger.kernel.org, Michael S. Tsirkin, Ming Lei, Jason Wang On Tue, Nov 15 2022 at 03:40, Angus Chen wrote: > Before probe one virtio_blk. > crash_cts> p *vector_matrix > $44 = { > matrix_bits = 256, > alloc_start = 32, > alloc_end = 236, > alloc_size = 204, > global_available = 15354, > global_reserved = 154, > systembits_inalloc = 3, > total_allocated = 411, > online_maps = 80, > maps = 0x46100, > scratch_map = {1160908723191807, 0, 1, 18435222497520517120}, > system_map = {1125904739729407, 0, 1, 18435221191850459136} > } > After probe one virtio_blk. > crash_cts> p *vector_matrix > $45 = { > matrix_bits = 256, > alloc_start = 32, > alloc_end = 236, > alloc_size = 204, > global_available = 15273, > global_reserved = 154, > systembits_inalloc = 3, > total_allocated = 413, > online_maps = 80, > maps = 0x46100, > scratch_map = {25769803776, 0, 0, 14680064}, > system_map = {1125904739729407, 0, 1, 18435221191850459136} > } > > We can see global_available drop from 15354 to 15273, is 81. > And the total_allocated increase from 411 to 413. One config irq,and > one vq irq. Right. That's perfectly fine. At the point where you looking at it, the matrix allocator has given out 2 vectors as can be seen via total_allocated. But then it also has another 79 vectors put aside for the other queues, but those queues have not yet requested the interrupts so there is no allocation yet. But the vectors are guaranteed to be available when request_irq() for those queues runs, which does the actual allocation. Btw, you can enable CONFIG_GENERIC_IRQ_DEBUGFS and then look at the content of /sys/kernel/debug/irq/domain/VECTOR which gives you a very clear picture of what's going on. No need for gdb. > It is easy to expend the irq resource ,because virtio_blk device could > be more than 512. How so? virtio_blk allocates a config interrupt and one queue interrupt per CPU. So in your case a total of 81. How would you exhaust the vector space? Each CPU has about ~200 (in your case exactly 204) vectors which can be handed out to devices. You'd need to instantiate about 200 virtio_blk devices to get to the point of vector exhaustion. So what are you actually worried about and which problem are you trying to solve? Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IRQ affinity problem from virtio_blk 2022-11-15 22:19 ` Thomas Gleixner @ 2022-11-15 22:44 ` Michael S. Tsirkin 2022-11-15 23:04 ` Thomas Gleixner 0 siblings, 1 reply; 15+ messages in thread From: Michael S. Tsirkin @ 2022-11-15 22:44 UTC (permalink / raw) To: Thomas Gleixner Cc: Angus Chen, linux-kernel@vger.kernel.org, Ming Lei, Jason Wang Thanks Thomas, I have a question: On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote: > On Tue, Nov 15 2022 at 03:40, Angus Chen wrote: > > Before probe one virtio_blk. > > crash_cts> p *vector_matrix > > $44 = { > > matrix_bits = 256, > > alloc_start = 32, > > alloc_end = 236, > > alloc_size = 204, > > global_available = 15354, > > global_reserved = 154, > > systembits_inalloc = 3, > > total_allocated = 411, > > online_maps = 80, > > maps = 0x46100, > > scratch_map = {1160908723191807, 0, 1, 18435222497520517120}, > > system_map = {1125904739729407, 0, 1, 18435221191850459136} > > } > > After probe one virtio_blk. > > crash_cts> p *vector_matrix > > $45 = { > > matrix_bits = 256, > > alloc_start = 32, > > alloc_end = 236, > > alloc_size = 204, > > global_available = 15273, > > global_reserved = 154, > > systembits_inalloc = 3, > > total_allocated = 413, > > online_maps = 80, > > maps = 0x46100, > > scratch_map = {25769803776, 0, 0, 14680064}, > > system_map = {1125904739729407, 0, 1, 18435221191850459136} > > } > > > > We can see global_available drop from 15354 to 15273, is 81. > > And the total_allocated increase from 411 to 413. One config irq,and > > one vq irq. > > Right. That's perfectly fine. At the point where you looking at it, the > matrix allocator has given out 2 vectors as can be seen via > total_allocated. > > But then it also has another 79 vectors put aside for the other queues, What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ? > but those queues have not yet requested the interrupts so there is no > allocation yet. But the vectors are guaranteed to be available when > request_irq() for those queues runs, which does the actual allocation. > > Btw, you can enable CONFIG_GENERIC_IRQ_DEBUGFS and then look at the > content of /sys/kernel/debug/irq/domain/VECTOR which gives you a very > clear picture of what's going on. No need for gdb. > > > It is easy to expend the irq resource ,because virtio_blk device could > > be more than 512. > > How so? virtio_blk allocates a config interrupt and one queue interrupt > per CPU. So in your case a total of 81. > > How would you exhaust the vector space? Each CPU has about ~200 (in your > case exactly 204) vectors which can be handed out to devices. You'd need > to instantiate about 200 virtio_blk devices to get to the point of > vector exhaustion. > > So what are you actually worried about and which problem are you trying > to solve? > > Thanks, > > tglx > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IRQ affinity problem from virtio_blk 2022-11-15 22:44 ` Michael S. Tsirkin @ 2022-11-15 23:04 ` Thomas Gleixner 2022-11-15 23:24 ` Thomas Gleixner 0 siblings, 1 reply; 15+ messages in thread From: Thomas Gleixner @ 2022-11-15 23:04 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Angus Chen, linux-kernel@vger.kernel.org, Ming Lei, Jason Wang On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote: > On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote: >> > We can see global_available drop from 15354 to 15273, is 81. >> > And the total_allocated increase from 411 to 413. One config irq,and >> > one vq irq. >> >> Right. That's perfectly fine. At the point where you looking at it, the >> matrix allocator has given out 2 vectors as can be seen via >> total_allocated. >> >> But then it also has another 79 vectors put aside for the other queues, > > What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ? init_vq() -> virtio_find_vqs() -> vp_find_vqs() -> vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity() init_vq() hands in a struct irq_affinity which means that pci_alloc_irq_vectors_affinity() will spread out interrupts and have one for config and one per queue if vp_request_msix_vectors() is invoked with per_vq_vectors == true, which is what the first invocation in vp_find_vqs() does. Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IRQ affinity problem from virtio_blk 2022-11-15 23:04 ` Thomas Gleixner @ 2022-11-15 23:24 ` Thomas Gleixner 2022-11-15 23:36 ` Michael S. Tsirkin 2022-11-16 0:46 ` Angus Chen 0 siblings, 2 replies; 15+ messages in thread From: Thomas Gleixner @ 2022-11-15 23:24 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Angus Chen, linux-kernel@vger.kernel.org, Ming Lei, Jason Wang On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote: > On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote: >> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote: >>> > We can see global_available drop from 15354 to 15273, is 81. >>> > And the total_allocated increase from 411 to 413. One config irq,and >>> > one vq irq. >>> >>> Right. That's perfectly fine. At the point where you looking at it, the >>> matrix allocator has given out 2 vectors as can be seen via >>> total_allocated. >>> >>> But then it also has another 79 vectors put aside for the other queues, >> >> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ? > > init_vq() -> virtio_find_vqs() -> vp_find_vqs() -> > vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity() > > init_vq() hands in a struct irq_affinity which means that > pci_alloc_irq_vectors_affinity() will spread out interrupts and have one > for config and one per queue if vp_request_msix_vectors() is invoked > with per_vq_vectors == true, which is what the first invocation in > vp_find_vqs() does. I just checked on a random VM. The PCI device as advertised to the guest does not expose that many vectors. One has 2 and the other 4. But as the interrupts are requested 'managed' the core ends up setting the vectors aside. That's a fundamental property of managed interrupts. Assume you have less queues than CPUs, which is the case with 2 vectors and tons of CPUs, i.e. one ends up for config and the other for the actual queue. So the affinity spreading code will end up having the full cpumask for the queue vector, which is marked managed. And managed means that it's guaranteed e.g. in the CPU hotplug case that the interrupt can be migrated to a still online CPU. So we end up setting 79 vectors aside (one per CPU) in the case that the virtio device only provides two vectors. But that's not the end of the world as you really would need ~200 such devices to exhaust the vector space... Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IRQ affinity problem from virtio_blk 2022-11-15 23:24 ` Thomas Gleixner @ 2022-11-15 23:36 ` Michael S. Tsirkin 2022-11-16 1:02 ` Angus Chen 2022-11-16 10:43 ` Thomas Gleixner 2022-11-16 0:46 ` Angus Chen 1 sibling, 2 replies; 15+ messages in thread From: Michael S. Tsirkin @ 2022-11-15 23:36 UTC (permalink / raw) To: Thomas Gleixner Cc: Angus Chen, linux-kernel@vger.kernel.org, Ming Lei, Jason Wang On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote: > On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote: > > > On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote: > >> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote: > >>> > We can see global_available drop from 15354 to 15273, is 81. > >>> > And the total_allocated increase from 411 to 413. One config irq,and > >>> > one vq irq. > >>> > >>> Right. That's perfectly fine. At the point where you looking at it, the > >>> matrix allocator has given out 2 vectors as can be seen via > >>> total_allocated. > >>> > >>> But then it also has another 79 vectors put aside for the other queues, > >> > >> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ? > > > > init_vq() -> virtio_find_vqs() -> vp_find_vqs() -> > > vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity() > > > > init_vq() hands in a struct irq_affinity which means that > > pci_alloc_irq_vectors_affinity() will spread out interrupts and have one > > for config and one per queue if vp_request_msix_vectors() is invoked > > with per_vq_vectors == true, which is what the first invocation in > > vp_find_vqs() does. > > I just checked on a random VM. The PCI device as advertised to the guest > does not expose that many vectors. One has 2 and the other 4. > > But as the interrupts are requested 'managed' the core ends up setting > the vectors aside. That's a fundamental property of managed interrupts. > > Assume you have less queues than CPUs, which is the case with 2 vectors > and tons of CPUs, i.e. one ends up for config and the other for the > actual queue. So the affinity spreading code will end up having the full > cpumask for the queue vector, which is marked managed. And managed means > that it's guaranteed e.g. in the CPU hotplug case that the interrupt can > be migrated to a still online CPU. > > So we end up setting 79 vectors aside (one per CPU) in the case that the > virtio device only provides two vectors. > > But that's not the end of the world as you really would need ~200 such > devices to exhaust the vector space... > > Thanks, > > tglx Let's say we have 20 queues - then just 10 devices will exhaust the vector space right? -- MST ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: IRQ affinity problem from virtio_blk 2022-11-15 23:36 ` Michael S. Tsirkin @ 2022-11-16 1:02 ` Angus Chen 2022-11-16 10:55 ` Thomas Gleixner 2022-11-16 10:43 ` Thomas Gleixner 1 sibling, 1 reply; 15+ messages in thread From: Angus Chen @ 2022-11-16 1:02 UTC (permalink / raw) To: Michael S. Tsirkin, Thomas Gleixner Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang > -----Original Message----- > From: Michael S. Tsirkin <mst@redhat.com> > Sent: Wednesday, November 16, 2022 7:37 AM > To: Thomas Gleixner <tglx@linutronix.de> > Cc: Angus Chen <angus.chen@jaguarmicro.com>; linux-kernel@vger.kernel.org; > Ming Lei <ming.lei@redhat.com>; Jason Wang <jasowang@redhat.com> > Subject: Re: IRQ affinity problem from virtio_blk > > On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote: > > On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote: > > > > > On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote: > > >> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote: > > >>> > We can see global_available drop from 15354 to 15273, is 81. > > >>> > And the total_allocated increase from 411 to 413. One config irq,and > > >>> > one vq irq. > > >>> > > >>> Right. That's perfectly fine. At the point where you looking at it, the > > >>> matrix allocator has given out 2 vectors as can be seen via > > >>> total_allocated. > > >>> > > >>> But then it also has another 79 vectors put aside for the other queues, > > >> > > >> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ? > > > > > > init_vq() -> virtio_find_vqs() -> vp_find_vqs() -> > > > vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity() > > > > > > init_vq() hands in a struct irq_affinity which means that > > > pci_alloc_irq_vectors_affinity() will spread out interrupts and have one > > > for config and one per queue if vp_request_msix_vectors() is invoked > > > with per_vq_vectors == true, which is what the first invocation in > > > vp_find_vqs() does. > > > > I just checked on a random VM. The PCI device as advertised to the guest > > does not expose that many vectors. One has 2 and the other 4. > > > > But as the interrupts are requested 'managed' the core ends up setting > > the vectors aside. That's a fundamental property of managed interrupts. > > > > Assume you have less queues than CPUs, which is the case with 2 vectors > > and tons of CPUs, i.e. one ends up for config and the other for the > > actual queue. So the affinity spreading code will end up having the full > > cpumask for the queue vector, which is marked managed. And managed > means > > that it's guaranteed e.g. in the CPU hotplug case that the interrupt can > > be migrated to a still online CPU. > > > > So we end up setting 79 vectors aside (one per CPU) in the case that the > > virtio device only provides two vectors. > > > > But that's not the end of the world as you really would need ~200 such > > devices to exhaust the vector space... Provide the VECTOR Information: [root@localhost domains]# cat VECTOR name: VECTOR size: 0 mapped: 2015 flags: 0x00000003 Online bitmaps: 80 Global available: 0 Global reserved: 154 Total allocated: 1861 System: 39: 0-19,29,32,50,128,236,240-242,244,246-255 | CPU | avl | man | mac | act | vectors 0 0 180 2 23 33-46,48,110,132,162,185,206-207,228-229 1 0 180 2 23 33-37,41,44,124,134,156-157,167,180,186-187,198,225,228-233 2 0 180 2 23 33-40,123,154-155,164,177,186,202,221-224,227,232-233,235 3 0 180 2 23 33-36,70,123-124,140,156,168,174,197,199,201,207,225-228,232-235 4 0 180 2 23 33-39,101,122,133,147,207,217-221,227-228,231,233-235 5 0 180 2 23 33-38,83,115,156,165-166,177,207-209,220-222,228,231-234 6 0 180 2 23 33-38,55,91,146,154,160,164,187-188,209,217-218,221-222,232-235 7 0 180 2 23 33-37,81-82,113,145,154,186-188,207,221-224,226,229,232,234-235 8 0 180 2 23 33-37,81,91,148-149,189,198-199,201,210,217-218,222,229-232,234-235 9 0 180 2 23 33-38,59,133,146,157,165,174,196,205,207,220-221,225-226,232-235 10 0 180 2 23 33-36,87,133-134,142,174,188,198-199,206,214,217-220,228-230,234-235 11 0 180 2 23 33-35,83,94,113,127,129,157,187-188,209,219-224,229-230,233-235 12 0 180 2 23 33-34,36,55,113-114,129,158-159,168,175,189-190,197,208-209,219-220,227,232-235 13 0 180 2 23 33-34,37-38,83,94,156-158,186-187,207,221-222,225-227,230-235 14 0 180 2 23 33-35,43,70,101-102,170,175-177,215,217-218,220,226-230,232-233,235 15 0 180 2 23 33-35,104,112,134,144,158,167-168,170,175-176,187,198,208,221-222,228-229,233-235 16 0 180 2 23 34-36,71,91,146,155-156,189-190,217-219,223,225-228,231-235 17 0 180 2 23 33-34,49,92,101,134,144,187,195-197,207-209,216-217,221,230-235 18 0 180 2 23 33-34,135-136,146,174,198,206-209,217,224-231,233-235 19 0 180 2 23 33-34,58,91,101,113,122,135,165,197-199,206,221-223,228-229,231-235 20 0 180 2 23 33-34,215-235 21 0 180 2 23 33-34,214,216-235 22 0 180 2 23 33-34,215-235 23 0 180 2 23 33-34,215-235 24 0 180 2 23 33-35,216-235 25 0 180 2 23 33-35,216-235 26 0 180 2 23 33-35,216-235 27 0 180 2 23 33-35,216-235 28 0 180 2 23 33-35,216-235 29 0 180 2 23 33-35,216-235 30 0 180 2 23 33-35,216-235 31 0 180 2 23 33-35,216-235 32 0 180 2 23 33-34,215-235 33 0 180 2 23 33-34,215-235 34 0 180 2 23 33-34,215-235 35 0 180 2 23 33-34,215-235 36 0 180 2 23 33-34,211,216-235 37 0 180 2 23 33-34,215-235 38 0 180 2 23 33-34,215-235 39 0 180 2 23 33-34,215-235 40 0 180 2 23 33-34,56,65,134,170,176-178,207-210,225-229,231-235 41 0 180 2 23 33-34,54,113,135-137,143,169,195-198,216-217,224,228-230,232-235 42 0 180 2 23 33,36,57,111-112,126,164,175-176,199-200,207-210,225-226,230-235 43 0 180 2 23 33-34,70,82,133-135,145,155,166,174,188-189,207,209,218,226-229,233-235 44 0 180 2 23 33-34,59,103,111,126,166-167,185-186,207-208,217-218,226-232,234-235 45 0 180 2 23 33,35-36,81,106,145-146,165,176,187,195,220-221,226-235 46 0 180 2 23 33-34,69,137,143,155,176,180,185-187,197,206-207,212-213,225-228,230,234-235 47 0 180 2 23 34,36,71,91-92,103-104,143,165,179,185-186,195,208-209,220-221,230-235 48 0 180 2 23 33-34,36,93,122,157,174,186-188,198,208-209,225,227-235 49 0 180 2 23 34-35,132-133,147-148,156,176-177,194-197,212,226-228,230-235 50 0 180 2 23 33-34,45,123,138,162,164-166,195-196,208-209,219,224-226,228,230-231,233-235 51 0 180 2 23 33-34,55,69-70,110,167,179-181,197-198,217-220,228-230,232-235 52 0 180 2 23 33-34,70,132,145,156,178,186-188,190,210-212,218-219,228-230,232-235 53 0 180 2 23 33,35,70,111,144,194-195,197,209,216-219,224,226-231,233-235 54 0 180 2 23 33-34,102,115,147,154,164-166,181,188,200,210-211,219-220,228-229,231-235 55 0 180 2 23 33-36,55,114,154-156,174,187,198,207-209,224-225,227-229,233-235 56 0 180 2 23 33-34,54,104,113,132,154,175,188,209,216-221,226-227,230-233,235 57 0 180 2 23 34-35,47,100,127,132-133,176-178,196-197,208,220,224-226,230-235 58 0 180 2 23 34,37,42,100,110-111,143,164-165,185,198,206-208,216-218,228-229,231,233-235 59 0 180 3 24 33-35,39,43,81-82,111,126,164-165,184,186,211-212,219-221,223,231-235 60 0 180 3 24 33-35,215-235 61 0 180 3 24 33-35,215-235 62 0 180 3 24 33-35,215-235 63 0 180 3 24 33-35,215-235 64 0 180 3 24 33-35,215-235 65 0 180 3 24 33-35,215-235 66 0 180 3 24 33-35,215-235 67 0 180 3 24 33-35,215-235 68 0 180 3 24 33-35,211,216-235 69 0 180 3 24 33-35,215-235 70 0 180 3 24 33-35,215-235 71 0 180 3 24 33-35,215-235 72 0 180 3 24 33-35,215-235 73 0 180 3 24 33-35,215-235 74 0 180 3 24 33-35,215-235 75 0 180 3 24 33-35,215-235 76 0 180 3 24 33-35,215-235 77 0 180 3 24 33-35,215-235 78 0 180 3 24 33-35,214,216-235 79 0 180 3 24 33-35,215-235 crash_cts> p *vector_matrix $98 = { matrix_bits = 256, alloc_start = 32, alloc_end = 236, alloc_size = 204, global_available = 0, global_reserved = 154, systembits_inalloc = 3, total_allocated = 1861, online_maps = 80, maps = 0x46100, scratch_map = {18446744069952503807, 18446744073709551615, 18446744073709551615, 18435229987943481343}, system_map = {1125904739729407, 0, 1, 18435221191850459136} } Any other information I need to provide,pls tell me. Thanks. > > > > Thanks, > > > > tglx > > Let's say we have 20 queues - then just 10 devices will exhaust the > vector space right? > > -- > MST ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: IRQ affinity problem from virtio_blk 2022-11-16 1:02 ` Angus Chen @ 2022-11-16 10:55 ` Thomas Gleixner 2022-11-16 11:24 ` Angus Chen 0 siblings, 1 reply; 15+ messages in thread From: Thomas Gleixner @ 2022-11-16 10:55 UTC (permalink / raw) To: Angus Chen, Michael S. Tsirkin Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang On Wed, Nov 16 2022 at 01:02, Angus Chen wrote: >> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote: > Any other information I need to provide,pls tell me. A sensible use case for 180+ virtio block devices in a single guest. Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: IRQ affinity problem from virtio_blk 2022-11-16 10:55 ` Thomas Gleixner @ 2022-11-16 11:24 ` Angus Chen 2022-11-16 13:27 ` Thomas Gleixner 0 siblings, 1 reply; 15+ messages in thread From: Angus Chen @ 2022-11-16 11:24 UTC (permalink / raw) To: Thomas Gleixner, Michael S. Tsirkin Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang > -----Original Message----- > From: Thomas Gleixner <tglx@linutronix.de> > Sent: Wednesday, November 16, 2022 6:56 PM > To: Angus Chen <angus.chen@jaguarmicro.com>; Michael S. Tsirkin > <mst@redhat.com> > Cc: linux-kernel@vger.kernel.org; Ming Lei <ming.lei@redhat.com>; Jason > Wang <jasowang@redhat.com> > Subject: RE: IRQ affinity problem from virtio_blk > > On Wed, Nov 16 2022 at 01:02, Angus Chen wrote: > >> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote: > > Any other information I need to provide,pls tell me. > > A sensible use case for 180+ virtio block devices in a single guest. > Our card can provide more than 512 virtio_blk devices . one virtio_blk passthrough to one container,like docker. So we need so much devices. In the first patch ,I del the IRQD_AFFINITY_MANAGED in virtio_blk . As you know, if we just use small queues number ,like 1or 2,we Still occupy 80 vector ,that is kind of waste,and it is easy to eahausted the Irq resource. IRQD_AFFINITY_MANAGED is not the problem, but many devices use the IRQD_AFFINITY_MANAGED will be problem. Thanks. > Thanks, > > tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: IRQ affinity problem from virtio_blk 2022-11-16 11:24 ` Angus Chen @ 2022-11-16 13:27 ` Thomas Gleixner 0 siblings, 0 replies; 15+ messages in thread From: Thomas Gleixner @ 2022-11-16 13:27 UTC (permalink / raw) To: Angus Chen, Michael S. Tsirkin Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang, Christoph Hellwig, Jens Axboe On Wed, Nov 16 2022 at 11:24, Angus Chen wrote: >> >> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote: >> > Any other information I need to provide,pls tell me. >> >> A sensible use case for 180+ virtio block devices in a single guest. >> > Our card can provide more than 512 virtio_blk devices . > one virtio_blk passthrough to one container,like docker. I'm not sure whether that's sensible, but that's how your hardware is designed. You could have provided this information upfront instead of random memory dumps of the irq matrix internals. > So we need so much devices. > In the first patch ,I del the IRQD_AFFINITY_MANAGED in virtio_blk . There is no IRQD_AFFINITY_MANAGED in virtio_blk. That flag is internal to the interrupt core code and you can neither delete it nor fiddle with it from inside virtio_blk. You can do that in your private kernel, but that's not an option for mainline as it will break existing setups and it's fundamentally wrong. The block-mq code has assumptions about the semantics of managed interrupts. It happens to work for the single queue case because that always ends up with queue affinity == cpu_possible_mask. For anything else which assigns the queues to partitions of the CPU space it definitely expects the semantics of managed interrupts. > As you know, if we just use small queues number ,like 1or 2,we Still > occupy 80 vector ,that is kind of waste,and it is easy to eahausted > the Irq resource. We know that by now. No point in repeating this over and over. Aside of that it's not that easy because this is the first time within 5 years that someone ran into this problem. The real question is how to solve this proper without creating problems for other scenarios. That needs involvment of the blk-mq people. Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IRQ affinity problem from virtio_blk 2022-11-15 23:36 ` Michael S. Tsirkin 2022-11-16 1:02 ` Angus Chen @ 2022-11-16 10:43 ` Thomas Gleixner 2022-11-16 11:35 ` Ming Lei 1 sibling, 1 reply; 15+ messages in thread From: Thomas Gleixner @ 2022-11-16 10:43 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Angus Chen, linux-kernel@vger.kernel.org, Ming Lei, Jason Wang On Tue, Nov 15 2022 at 18:36, Michael S. Tsirkin wrote: > On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote: >> I just checked on a random VM. The PCI device as advertised to the guest >> does not expose that many vectors. One has 2 and the other 4. >> >> But as the interrupts are requested 'managed' the core ends up setting >> the vectors aside. That's a fundamental property of managed interrupts. >> >> Assume you have less queues than CPUs, which is the case with 2 vectors >> and tons of CPUs, i.e. one ends up for config and the other for the >> actual queue. So the affinity spreading code will end up having the full >> cpumask for the queue vector, which is marked managed. And managed means >> that it's guaranteed e.g. in the CPU hotplug case that the interrupt can >> be migrated to a still online CPU. >> >> So we end up setting 79 vectors aside (one per CPU) in the case that the >> virtio device only provides two vectors. >> >> But that's not the end of the world as you really would need ~200 such >> devices to exhaust the vector space... > > Let's say we have 20 queues - then just 10 devices will exhaust the > vector space right? No. If you have 20 queues then the queues are spread out over the CPUs. Assume 80 CPUs: Then each queue is associated to 80/20 = 4 CPUs and the resulting affinity mask of each queue contains exactly 4 CPUs: q0: 0 - 3 q1: 4 - 7 ... q19: 76 - 79 So this puts exactly 80 vectors aside, one per CPU. As long as at least one CPU of a queue mask is online the queue is enabled. If the last CPU of a queue mask goes offline then the queue is shutdown which means the interrupt associated to the queue is shut down too. That's all handled by the block MQ and the interrupt core. If a CPU of a queue mask comes back online then the guaranteed vector is allocated again. So it does not matter how many queues per device you have it will reserve exactly ONE interrupt per CPU. Ergo you need 200 devices to exhaust the vector space. Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IRQ affinity problem from virtio_blk 2022-11-16 10:43 ` Thomas Gleixner @ 2022-11-16 11:35 ` Ming Lei 2022-11-16 13:06 ` Thomas Gleixner 0 siblings, 1 reply; 15+ messages in thread From: Ming Lei @ 2022-11-16 11:35 UTC (permalink / raw) To: Thomas Gleixner Cc: Michael S. Tsirkin, Angus Chen, linux-kernel@vger.kernel.org, Jason Wang On Wed, Nov 16, 2022 at 11:43:24AM +0100, Thomas Gleixner wrote: > On Tue, Nov 15 2022 at 18:36, Michael S. Tsirkin wrote: > > On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote: > >> I just checked on a random VM. The PCI device as advertised to the guest > >> does not expose that many vectors. One has 2 and the other 4. > >> > >> But as the interrupts are requested 'managed' the core ends up setting > >> the vectors aside. That's a fundamental property of managed interrupts. > >> > >> Assume you have less queues than CPUs, which is the case with 2 vectors > >> and tons of CPUs, i.e. one ends up for config and the other for the > >> actual queue. So the affinity spreading code will end up having the full > >> cpumask for the queue vector, which is marked managed. And managed means > >> that it's guaranteed e.g. in the CPU hotplug case that the interrupt can > >> be migrated to a still online CPU. > >> > >> So we end up setting 79 vectors aside (one per CPU) in the case that the > >> virtio device only provides two vectors. > >> > >> But that's not the end of the world as you really would need ~200 such > >> devices to exhaust the vector space... > > > > Let's say we have 20 queues - then just 10 devices will exhaust the > > vector space right? > > No. > > If you have 20 queues then the queues are spread out over the > CPUs. Assume 80 CPUs: > > Then each queue is associated to 80/20 = 4 CPUs and the resulting > affinity mask of each queue contains exactly 4 CPUs: > > q0: 0 - 3 > q1: 4 - 7 > ... > q19: 76 - 79 > > So this puts exactly 80 vectors aside, one per CPU. > > As long as at least one CPU of a queue mask is online the queue is > enabled. If the last CPU of a queue mask goes offline then the queue is > shutdown which means the interrupt associated to the queue is shut down > too. That's all handled by the block MQ and the interrupt core. If a CPU > of a queue mask comes back online then the guaranteed vector is > allocated again. > > So it does not matter how many queues per device you have it will > reserve exactly ONE interrupt per CPU. > > Ergo you need 200 devices to exhaust the vector space. Hi Thomas, I am wondering why one interrupt needs to be reserved for each CPU, in theory one queue needs one irq, I understand, so would you mind explaining the story a bit? Thanks, Ming ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: IRQ affinity problem from virtio_blk 2022-11-16 11:35 ` Ming Lei @ 2022-11-16 13:06 ` Thomas Gleixner 0 siblings, 0 replies; 15+ messages in thread From: Thomas Gleixner @ 2022-11-16 13:06 UTC (permalink / raw) To: Ming Lei Cc: Michael S. Tsirkin, Angus Chen, linux-kernel@vger.kernel.org, Jason Wang On Wed, Nov 16 2022 at 19:35, Ming Lei wrote: > On Wed, Nov 16, 2022 at 11:43:24AM +0100, Thomas Gleixner wrote: >> > Let's say we have 20 queues - then just 10 devices will exhaust the >> > vector space right? >> >> No. >> >> If you have 20 queues then the queues are spread out over the >> CPUs. Assume 80 CPUs: >> >> Then each queue is associated to 80/20 = 4 CPUs and the resulting >> affinity mask of each queue contains exactly 4 CPUs: >> >> q0: 0 - 3 >> q1: 4 - 7 >> ... >> q19: 76 - 79 >> >> So this puts exactly 80 vectors aside, one per CPU. >> >> As long as at least one CPU of a queue mask is online the queue is >> enabled. If the last CPU of a queue mask goes offline then the queue is >> shutdown which means the interrupt associated to the queue is shut down >> too. That's all handled by the block MQ and the interrupt core. If a CPU >> of a queue mask comes back online then the guaranteed vector is >> allocated again. >> >> So it does not matter how many queues per device you have it will >> reserve exactly ONE interrupt per CPU. >> >> Ergo you need 200 devices to exhaust the vector space. > > I am wondering why one interrupt needs to be reserved for each CPU, in > theory one queue needs one irq, I understand, so would you mind > explaining the story a bit? It's only one interrupt per queue. Interrupt != vector. The guarantee of managed interrupts always was that if there are less queues than CPUs that CPU hotunplug cannot result in vector exhaustion. Therefore we differentiate between managed and non-managed interrupts. Managed have a guaranteed reservation, non-managed do not. That's been a very deliberate design decision from the very beginning. Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: IRQ affinity problem from virtio_blk 2022-11-15 23:24 ` Thomas Gleixner 2022-11-15 23:36 ` Michael S. Tsirkin @ 2022-11-16 0:46 ` Angus Chen 2022-11-16 10:54 ` Thomas Gleixner 1 sibling, 1 reply; 15+ messages in thread From: Angus Chen @ 2022-11-16 0:46 UTC (permalink / raw) To: Thomas Gleixner, Michael S. Tsirkin Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang > -----Original Message----- > From: Thomas Gleixner <tglx@linutronix.de> > Sent: Wednesday, November 16, 2022 7:24 AM > To: Michael S. Tsirkin <mst@redhat.com> > Cc: Angus Chen <angus.chen@jaguarmicro.com>; linux-kernel@vger.kernel.org; > Ming Lei <ming.lei@redhat.com>; Jason Wang <jasowang@redhat.com> > Subject: Re: IRQ affinity problem from virtio_blk > > On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote: > > > On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote: > >> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote: > >>> > We can see global_available drop from 15354 to 15273, is 81. > >>> > And the total_allocated increase from 411 to 413. One config irq,and > >>> > one vq irq. > >>> > >>> Right. That's perfectly fine. At the point where you looking at it, the > >>> matrix allocator has given out 2 vectors as can be seen via > >>> total_allocated. > >>> > >>> But then it also has another 79 vectors put aside for the other queues, en,it not the truth,in fact ,I just has one queue for one virtio_blk. crash_cts> struct virtio_blk.num_vqs 0xffff888147b79c00 num_vqs = 1, I think is the key we talk about. > >> > >> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ? > > > > init_vq() -> virtio_find_vqs() -> vp_find_vqs() -> > > vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity() > > > > init_vq() hands in a struct irq_affinity which means that > > pci_alloc_irq_vectors_affinity() will spread out interrupts and have one > > for config and one per queue if vp_request_msix_vectors() is invoked > > with per_vq_vectors == true, which is what the first invocation in > > vp_find_vqs() does. > > I just checked on a random VM. The PCI device as advertised to the guest > does not expose that many vectors. One has 2 and the other 4. > > But as the interrupts are requested 'managed' the core ends up setting > the vectors aside. That's a fundamental property of managed interrupts. > > Assume you have less queues than CPUs, which is the case with 2 vectors > and tons of CPUs, i.e. one ends up for config and the other for the > actual queue. So the affinity spreading code will end up having the full > cpumask for the queue vector, which is marked managed. And managed means > that it's guaranteed e.g. in the CPU hotplug case that the interrupt can > be migrated to a still online CPU. > > So we end up setting 79 vectors aside (one per CPU) in the case that the > virtio device only provides two vectors. > > But that's not the end of the world as you really would need ~200 such > devices to exhaust the vector space... > Thank you for your reply.. Let's look the dmesg for more information. ... Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: 1/0/0 default/read/poll queues Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: [vdpr] 20480 512-byte logical blocks (10.5 MB/10.0 MiB) Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: enabling device (0000 -> 0002) Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: virtio_pci: leaving for legacy driver Nov 14 11:48:46 localhost kernel: virtio_blk virtio182: 1/0/0 default/read/poll queues---------the virtio182 means index 182. Nov 14 11:48:46 localhost kernel: vp_find_vqs_msix return err=-28-----------------------------the first time we get 'no space' error from irq subsystem. ... We are easy to get the output is : crash_cts> p *vector_matrix $97 = { matrix_bits = 256, alloc_start = 32, alloc_end = 236, alloc_size = 204, global_available = 0,------------the irq is exhausted. global_reserved = 154, systembits_inalloc = 3, total_allocated = 1861, online_maps = 80, maps = 0x46100, scratch_map = {18446744069952503807, 18446744073709551615, 18446744073709551615, 18435229987943481343}, system_map = {1125904739729407, 0, 1, 18435221191850459136} } After that ,all the irq request will be returned "no space". If the percpu irq vector is more asymmetric,than the more quickly we get the 'no space' error when we probe the irq with IRQD_AFFINITY_MANAGED. > Thanks, > > tglx > ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: IRQ affinity problem from virtio_blk 2022-11-16 0:46 ` Angus Chen @ 2022-11-16 10:54 ` Thomas Gleixner 0 siblings, 0 replies; 15+ messages in thread From: Thomas Gleixner @ 2022-11-16 10:54 UTC (permalink / raw) To: Angus Chen, Michael S. Tsirkin Cc: linux-kernel@vger.kernel.org, Ming Lei, Jason Wang On Wed, Nov 16 2022 at 00:46, Angus Chen wrote: >> On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote: >> >>> But then it also has another 79 vectors put aside for the other queues, > en,it not the truth,in fact ,I just has one queue for one virtio_blk. Which does not matter. See my reply to Michael. It's ONE vector per CPU and block device. > Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: 1/0/0 default/read/poll queues > Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: [vdpr] 20480 512-byte logical blocks (10.5 MB/10.0 MiB) > Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: enabling device (0000 -> 0002) > Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: virtio_pci: leaving for legacy driver > Nov 14 11:48:46 localhost kernel: virtio_blk virtio182: 1/0/0 default/read/poll queues---------the virtio182 means index 182. > Nov 14 11:48:46 localhost kernel: vp_find_vqs_msix return err=-28-----------------------------the first time we get 'no space' error from irq subsystem. That's close to 200 virtio devices and the vector space is exhausted. Works as expected. Interrupt vectors are a limited resource on x86 and not only on x86. Not any different from any other resource. Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2022-11-16 13:27 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-11-15 3:40 IRQ affinity problem from virtio_blk Angus Chen 2022-11-15 22:19 ` Thomas Gleixner 2022-11-15 22:44 ` Michael S. Tsirkin 2022-11-15 23:04 ` Thomas Gleixner 2022-11-15 23:24 ` Thomas Gleixner 2022-11-15 23:36 ` Michael S. Tsirkin 2022-11-16 1:02 ` Angus Chen 2022-11-16 10:55 ` Thomas Gleixner 2022-11-16 11:24 ` Angus Chen 2022-11-16 13:27 ` Thomas Gleixner 2022-11-16 10:43 ` Thomas Gleixner 2022-11-16 11:35 ` Ming Lei 2022-11-16 13:06 ` Thomas Gleixner 2022-11-16 0:46 ` Angus Chen 2022-11-16 10:54 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox