* mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4
@ 2023-05-30 13:04 Niklas Schnelle
2023-05-31 13:33 ` Linux regression tracking #adding (Thorsten Leemhuis)
0 siblings, 1 reply; 5+ messages in thread
From: Niklas Schnelle @ 2023-05-30 13:04 UTC (permalink / raw)
To: Shay Drory, Saeed Mahameed, Eli Cohen, netdev
Hi Saeed, Eli, Shay,
With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
the following commit to be broken:
commit 1da438c0ae02396dc5018b63237492cb5908608d
Author: Shay Drory <shayd@nvidia.com>
Date: Mon Apr 17 10:57:50 2023 +0300
net/mlx5: Fix indexing of mlx5_irq
After the cited patch, mlx5_irq xarray index can be different then
mlx5_irq MSIX table index.
Fix it by storing both mlx5_irq xarray index and MSIX table index.
Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The problem is that our IRQs currently still use a legacy mode instead
of a full fledged IRQ domain. One consequence of that is that
pci_msix_can_alloc_dyn(dev->pdev) returns false. That lands us in the
non dynamic case in mlx5_irq_alloc() where irq->map.index is set to 0.
Now prior to the above commit irq->map.index would later be set to i
(the irq number) but that was replaced with just setting irq-
>pool_index = i. For the dynamic case this is fine because
pci_msix_alloc_irq_at() sets it but for the non-dynamic case this leave
it unset. With the following diff the RX/TX timeouts go away:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index db5687d9fec9..94dce3735204 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -237,7 +237,7 @@ struct mlx5_irq *mlx5_irq_alloc(struct
mlx5_irq_pool *pool, int i,
* vectors have also been allocated.
*/
irq->map.virq = pci_irq_vector(dev->pdev, i);
- irq->map.index = 0;
+ irq->map.index = i;
} else {
irq->map = pci_msix_alloc_irq_at(dev->pdev,
MSI_ANY_INDEX, af_desc);
if (!irq->map.virq) {
I'll sent a patch with the above shortly but wanted to give you a heads
up since I'd really like to get this fixed for -rc5 or at least -rc6.
Thanks,
Niklas
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4
2023-05-30 13:04 mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4 Niklas Schnelle
@ 2023-05-31 13:33 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-05-31 13:43 ` Niklas Schnelle
0 siblings, 1 reply; 5+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-05-31 13:33 UTC (permalink / raw)
To: Niklas Schnelle, Shay Drory, Saeed Mahameed, Eli Cohen, netdev
Cc: Linux kernel regressions list
[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]
[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]
On 30.05.23 15:04, Niklas Schnelle wrote:
>
> With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
> use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
> the following commit to be broken:
>
> commit 1da438c0ae02396dc5018b63237492cb5908608d
> Author: Shay Drory <shayd@nvidia.com>
> Date: Mon Apr 17 10:57:50 2023 +0300
> [...]
Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:
#regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
#regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
ConnectX-6 VFs on s390
#regzbot ignore-activity
This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.
Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4
2023-05-31 13:33 ` Linux regression tracking #adding (Thorsten Leemhuis)
@ 2023-05-31 13:43 ` Niklas Schnelle
2023-05-31 13:57 ` Linux regression tracking (Thorsten Leemhuis)
0 siblings, 1 reply; 5+ messages in thread
From: Niklas Schnelle @ 2023-05-31 13:43 UTC (permalink / raw)
To: Linux regressions mailing list, Shay Drory, Saeed Mahameed,
Eli Cohen, netdev
On Wed, 2023-05-31 at 15:33 +0200, Linux regression tracking #adding
(Thorsten Leemhuis) wrote:
> [CCing the regression list, as it should be in the loop for regressions:
> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>
> [TLDR: I'm adding this report to the list of tracked Linux kernel
> regressions; the text you find below is based on a few templates
> paragraphs you might have encountered already in similar form.
> See link in footer if these mails annoy you.]
>
> On 30.05.23 15:04, Niklas Schnelle wrote:
> >
> > With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
> > use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
> > the following commit to be broken:
> >
> > commit 1da438c0ae02396dc5018b63237492cb5908608d
> > Author: Shay Drory <shayd@nvidia.com>
> > Date: Mon Apr 17 10:57:50 2023 +0300
> > [...]
>
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
>
> #regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
> #regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
> ConnectX-6 VFs on s390
> #regzbot ignore-activity
>
> This isn't a regression? This issue or a fix for it are already
> discussed somewhere else? It was fixed already? You want to clarify when
> the regression started to happen? Or point out I got the title or
> something else totally wrong? Then just reply and tell me -- ideally
> while also telling regzbot about it, as explained by the page listed in
> the footer of this mail.
>
> Developers: When fixing the issue, remember to add 'Link:' tags pointing
> to the report (the parent of this mail). See page linked in footer for
> details.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.
Hi Thorsten,
Thanks for tracking. I actually already sent a fix patch (and v2) for
this. Sadly I forgot to link to this mail. Let's see if I can get the
regzbot command right to update it. As for the humans the latest fix
patch is here:
https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
Thanks,
Niklas
#regzbot fix: net/mlx5: Fix setting of irq->map.index for static IRQ case
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4
2023-05-31 13:43 ` Niklas Schnelle
@ 2023-05-31 13:57 ` Linux regression tracking (Thorsten Leemhuis)
2023-05-31 21:48 ` Saeed Mahameed
0 siblings, 1 reply; 5+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-05-31 13:57 UTC (permalink / raw)
To: Niklas Schnelle, Linux regressions mailing list, Shay Drory,
Saeed Mahameed, Eli Cohen, netdev
On 31.05.23 15:43, Niklas Schnelle wrote:
> On Wed, 2023-05-31 at 15:33 +0200, Linux regression tracking #adding
> (Thorsten Leemhuis) wrote:
>> [CCing the regression list, as it should be in the loop for regressions:
>> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>>
>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>> regressions; the text you find below is based on a few templates
>> paragraphs you might have encountered already in similar form.
>> See link in footer if these mails annoy you.]
>>
>> On 30.05.23 15:04, Niklas Schnelle wrote:
>>>
>>> With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
>>> use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
>>> the following commit to be broken:
>>>
>>> commit 1da438c0ae02396dc5018b63237492cb5908608d
>>> Author: Shay Drory <shayd@nvidia.com>
>>> Date: Mon Apr 17 10:57:50 2023 +0300
>>> [...]
>>
>> Thanks for the report. To be sure the issue doesn't fall through the
>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>> tracking bot:
>>
>> #regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
>> #regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
>> ConnectX-6 VFs on s390
>> #regzbot ignore-activity
>>
>> This isn't a regression? This issue or a fix for it are already
>> discussed somewhere else? It was fixed already? You want to clarify when
>> the regression started to happen? Or point out I got the title or
>> something else totally wrong? Then just reply and tell me -- ideally
>> while also telling regzbot about it, as explained by the page listed in
>> the footer of this mail.
>>
>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>> to the report (the parent of this mail). See page linked in footer for
>> details.
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> That page also explains what to do if mails like this annoy you.
>
> Hi Thorsten,
>
> Thanks for tracking. I actually already sent a fix patch (and v2) for
> this. Sadly I forgot to link to this mail. Let's see if I can get the
> regzbot command right to update it. As for the humans the latest fix
> patch is here:
>
> https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
>
> Thanks,
> Niklas
>
> #regzbot fix: net/mlx5: Fix setting of irq->map.index for static IRQ case
Looks right, many thx. Sorry, should have looked for that myself. Sadly
regzbot doesn't yet search for existing post on lore with a matching
subject, so for completeness let me point manually to it while at it:
#regzbot monitor:
https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
Ciao, Thorsten
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4
2023-05-31 13:57 ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-05-31 21:48 ` Saeed Mahameed
0 siblings, 0 replies; 5+ messages in thread
From: Saeed Mahameed @ 2023-05-31 21:48 UTC (permalink / raw)
To: Linux regressions mailing list
Cc: Niklas Schnelle, Shay Drory, chuck.lever, Saeed Mahameed,
Eli Cohen, netdev
On 31 May 15:57, Linux regression tracking (Thorsten Leemhuis) wrote:
>
>
>On 31.05.23 15:43, Niklas Schnelle wrote:
>> On Wed, 2023-05-31 at 15:33 +0200, Linux regression tracking #adding
>> (Thorsten Leemhuis) wrote:
>>> [CCing the regression list, as it should be in the loop for regressions:
>>> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>>>
>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>> regressions; the text you find below is based on a few templates
>>> paragraphs you might have encountered already in similar form.
>>> See link in footer if these mails annoy you.]
>>>
>>> On 30.05.23 15:04, Niklas Schnelle wrote:
>>>>
>>>> With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
>>>> use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
>>>> the following commit to be broken:
>>>>
>>>> commit 1da438c0ae02396dc5018b63237492cb5908608d
>>>> Author: Shay Drory <shayd@nvidia.com>
>>>> Date: Mon Apr 17 10:57:50 2023 +0300
>>>> [...]
>>>
>>> Thanks for the report. To be sure the issue doesn't fall through the
>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>> tracking bot:
>>>
>>> #regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
>>> #regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
>>> ConnectX-6 VFs on s390
>>> #regzbot ignore-activity
>>>
>>> This isn't a regression? This issue or a fix for it are already
>>> discussed somewhere else? It was fixed already? You want to clarify when
>>> the regression started to happen? Or point out I got the title or
>>> something else totally wrong? Then just reply and tell me -- ideally
>>> while also telling regzbot about it, as explained by the page listed in
>>> the footer of this mail.
>>>
>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>>> to the report (the parent of this mail). See page linked in footer for
>>> details.
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> That page also explains what to do if mails like this annoy you.
>>
>> Hi Thorsten,
>>
>> Thanks for tracking. I actually already sent a fix patch (and v2) for
>> this. Sadly I forgot to link to this mail. Let's see if I can get the
>> regzbot command right to update it. As for the humans the latest fix
>> patch is here:
>>
>> https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
>>
I picked up this patch to net-mlx5 tree, will post to net shortly.
Thanks,
Saeed.
>> Thanks,
>> Niklas
>>
>> #regzbot fix: net/mlx5: Fix setting of irq->map.index for static IRQ case
>
>Looks right, many thx. Sorry, should have looked for that myself. Sadly
>regzbot doesn't yet search for existing post on lore with a matching
>subject, so for completeness let me point manually to it while at it:
>
>#regzbot monitor:
>https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
>
>Ciao, Thorsten
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-05-31 21:48 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-30 13:04 mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4 Niklas Schnelle
2023-05-31 13:33 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-05-31 13:43 ` Niklas Schnelle
2023-05-31 13:57 ` Linux regression tracking (Thorsten Leemhuis)
2023-05-31 21:48 ` Saeed Mahameed
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).