netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4
@ 2023-05-30 13:04 Niklas Schnelle
  2023-05-31 13:33 ` Linux regression tracking #adding (Thorsten Leemhuis)
  0 siblings, 1 reply; 5+ messages in thread
From: Niklas Schnelle @ 2023-05-30 13:04 UTC (permalink / raw)
  To: Shay Drory, Saeed Mahameed, Eli Cohen, netdev

Hi Saeed, Eli, Shay,

With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
the following commit to be broken:

commit 1da438c0ae02396dc5018b63237492cb5908608d
Author: Shay Drory <shayd@nvidia.com>
Date:   Mon Apr 17 10:57:50 2023 +0300

    net/mlx5: Fix indexing of mlx5_irq

    After the cited patch, mlx5_irq xarray index can be different then
    mlx5_irq MSIX table index.
    Fix it by storing both mlx5_irq xarray index and MSIX table index.

    Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
    Signed-off-by: Shay Drory <shayd@nvidia.com>
    Reviewed-by: Eli Cohen <elic@nvidia.com>
    Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

The problem is that our IRQs currently still use a legacy mode instead
of a full fledged IRQ domain. One consequence of that is that
pci_msix_can_alloc_dyn(dev->pdev) returns false. That lands us in the
non dynamic case in mlx5_irq_alloc() where irq->map.index is set to 0.
Now prior to the above commit irq->map.index would later be set to i
(the irq number) but that was replaced with just setting irq-
>pool_index = i. For the dynamic case this is fine because
pci_msix_alloc_irq_at() sets it but for the non-dynamic case this leave
it unset. With the following diff the RX/TX timeouts go away:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index db5687d9fec9..94dce3735204 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -237,7 +237,7 @@ struct mlx5_irq *mlx5_irq_alloc(struct
mlx5_irq_pool *pool, int i,
                 * vectors have also been allocated.
                 */
                irq->map.virq = pci_irq_vector(dev->pdev, i);
-               irq->map.index = 0;
+               irq->map.index = i;
        } else {
                irq->map = pci_msix_alloc_irq_at(dev->pdev,
MSI_ANY_INDEX, af_desc);
                if (!irq->map.virq) {



I'll sent a patch with the above shortly but wanted to give you a heads
up since I'd really like to get this fixed for -rc5 or at least -rc6.

Thanks,
Niklas

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4
  2023-05-30 13:04 mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4 Niklas Schnelle
@ 2023-05-31 13:33 ` Linux regression tracking #adding (Thorsten Leemhuis)
  2023-05-31 13:43   ` Niklas Schnelle
  0 siblings, 1 reply; 5+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-05-31 13:33 UTC (permalink / raw)
  To: Niklas Schnelle, Shay Drory, Saeed Mahameed, Eli Cohen, netdev
  Cc: Linux kernel regressions list

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 30.05.23 15:04, Niklas Schnelle wrote:
> 
> With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
> use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
> the following commit to be broken:
> 
> commit 1da438c0ae02396dc5018b63237492cb5908608d
> Author: Shay Drory <shayd@nvidia.com>
> Date:   Mon Apr 17 10:57:50 2023 +0300
> [...]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
#regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
ConnectX-6 VFs on s390
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4
  2023-05-31 13:33 ` Linux regression tracking #adding (Thorsten Leemhuis)
@ 2023-05-31 13:43   ` Niklas Schnelle
  2023-05-31 13:57     ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 5+ messages in thread
From: Niklas Schnelle @ 2023-05-31 13:43 UTC (permalink / raw)
  To: Linux regressions mailing list, Shay Drory, Saeed Mahameed,
	Eli Cohen, netdev

On Wed, 2023-05-31 at 15:33 +0200, Linux regression tracking #adding
(Thorsten Leemhuis) wrote:
> [CCing the regression list, as it should be in the loop for regressions:
> https://docs.kernel.org/admin-guide/reporting-regressions.html]
> 
> [TLDR: I'm adding this report to the list of tracked Linux kernel
> regressions; the text you find below is based on a few templates
> paragraphs you might have encountered already in similar form.
> See link in footer if these mails annoy you.]
> 
> On 30.05.23 15:04, Niklas Schnelle wrote:
> > 
> > With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
> > use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
> > the following commit to be broken:
> > 
> > commit 1da438c0ae02396dc5018b63237492cb5908608d
> > Author: Shay Drory <shayd@nvidia.com>
> > Date:   Mon Apr 17 10:57:50 2023 +0300
> > [...]
> 
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
> 
> #regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
> #regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
> ConnectX-6 VFs on s390
> #regzbot ignore-activity
> 
> This isn't a regression? This issue or a fix for it are already
> discussed somewhere else? It was fixed already? You want to clarify when
> the regression started to happen? Or point out I got the title or
> something else totally wrong? Then just reply and tell me -- ideally
> while also telling regzbot about it, as explained by the page listed in
> the footer of this mail.
> 
> Developers: When fixing the issue, remember to add 'Link:' tags pointing
> to the report (the parent of this mail). See page linked in footer for
> details.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.

Hi Thorsten,

Thanks for tracking. I actually already sent a fix patch (and v2) for
this. Sadly I forgot to link to this mail. Let's see if I can get the
regzbot command right to update it. As for the humans the latest fix
patch is here:

https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/

Thanks,
Niklas

#regzbot fix: net/mlx5: Fix setting of irq->map.index for static IRQ case


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4
  2023-05-31 13:43   ` Niklas Schnelle
@ 2023-05-31 13:57     ` Linux regression tracking (Thorsten Leemhuis)
  2023-05-31 21:48       ` Saeed Mahameed
  0 siblings, 1 reply; 5+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-05-31 13:57 UTC (permalink / raw)
  To: Niklas Schnelle, Linux regressions mailing list, Shay Drory,
	Saeed Mahameed, Eli Cohen, netdev



On 31.05.23 15:43, Niklas Schnelle wrote:
> On Wed, 2023-05-31 at 15:33 +0200, Linux regression tracking #adding
> (Thorsten Leemhuis) wrote:
>> [CCing the regression list, as it should be in the loop for regressions:
>> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>>
>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>> regressions; the text you find below is based on a few templates
>> paragraphs you might have encountered already in similar form.
>> See link in footer if these mails annoy you.]
>>
>> On 30.05.23 15:04, Niklas Schnelle wrote:
>>>
>>> With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
>>> use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
>>> the following commit to be broken:
>>>
>>> commit 1da438c0ae02396dc5018b63237492cb5908608d
>>> Author: Shay Drory <shayd@nvidia.com>
>>> Date:   Mon Apr 17 10:57:50 2023 +0300
>>> [...]
>>
>> Thanks for the report. To be sure the issue doesn't fall through the
>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>> tracking bot:
>>
>> #regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
>> #regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
>> ConnectX-6 VFs on s390
>> #regzbot ignore-activity
>>
>> This isn't a regression? This issue or a fix for it are already
>> discussed somewhere else? It was fixed already? You want to clarify when
>> the regression started to happen? Or point out I got the title or
>> something else totally wrong? Then just reply and tell me -- ideally
>> while also telling regzbot about it, as explained by the page listed in
>> the footer of this mail.
>>
>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>> to the report (the parent of this mail). See page linked in footer for
>> details.
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> That page also explains what to do if mails like this annoy you.
> 
> Hi Thorsten,
> 
> Thanks for tracking. I actually already sent a fix patch (and v2) for
> this. Sadly I forgot to link to this mail. Let's see if I can get the
> regzbot command right to update it. As for the humans the latest fix
> patch is here:
> 
> https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
> 
> Thanks,
> Niklas
> 
> #regzbot fix: net/mlx5: Fix setting of irq->map.index for static IRQ case

Looks right, many thx. Sorry, should have looked for that myself. Sadly
regzbot doesn't yet search for existing post on lore with a matching
subject, so for completeness let me point manually to it while at it:

#regzbot monitor:
https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4
  2023-05-31 13:57     ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-05-31 21:48       ` Saeed Mahameed
  0 siblings, 0 replies; 5+ messages in thread
From: Saeed Mahameed @ 2023-05-31 21:48 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Niklas Schnelle, Shay Drory, chuck.lever, Saeed Mahameed,
	Eli Cohen, netdev

On 31 May 15:57, Linux regression tracking (Thorsten Leemhuis) wrote:
>
>
>On 31.05.23 15:43, Niklas Schnelle wrote:
>> On Wed, 2023-05-31 at 15:33 +0200, Linux regression tracking #adding
>> (Thorsten Leemhuis) wrote:
>>> [CCing the regression list, as it should be in the loop for regressions:
>>> https://docs.kernel.org/admin-guide/reporting-regressions.html]
>>>
>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>> regressions; the text you find below is based on a few templates
>>> paragraphs you might have encountered already in similar form.
>>> See link in footer if these mails annoy you.]
>>>
>>> On 30.05.23 15:04, Niklas Schnelle wrote:
>>>>
>>>> With v6.4-rc4 I'm getting a stream of RX and TX timeouts when trying to
>>>> use ConnectX-4 and ConnectX-6 VFs on s390. I've bisected this and found
>>>> the following commit to be broken:
>>>>
>>>> commit 1da438c0ae02396dc5018b63237492cb5908608d
>>>> Author: Shay Drory <shayd@nvidia.com>
>>>> Date:   Mon Apr 17 10:57:50 2023 +0300
>>>> [...]
>>>
>>> Thanks for the report. To be sure the issue doesn't fall through the
>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>> tracking bot:
>>>
>>> #regzbot ^introduced 1da438c0ae02396dc5018b63237492cb5908608d
>>> #regzbot title net/mlx5: RX and TX timeouts with ConnectX-4 and
>>> ConnectX-6 VFs on s390
>>> #regzbot ignore-activity
>>>
>>> This isn't a regression? This issue or a fix for it are already
>>> discussed somewhere else? It was fixed already? You want to clarify when
>>> the regression started to happen? Or point out I got the title or
>>> something else totally wrong? Then just reply and tell me -- ideally
>>> while also telling regzbot about it, as explained by the page listed in
>>> the footer of this mail.
>>>
>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>>> to the report (the parent of this mail). See page linked in footer for
>>> details.
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> That page also explains what to do if mails like this annoy you.
>>
>> Hi Thorsten,
>>
>> Thanks for tracking. I actually already sent a fix patch (and v2) for
>> this. Sadly I forgot to link to this mail. Let's see if I can get the
>> regzbot command right to update it. As for the humans the latest fix
>> patch is here:
>>
>> https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
>>

I picked up this patch to net-mlx5 tree, will post to net shortly.

Thanks,
Saeed.

>> Thanks,
>> Niklas
>>
>> #regzbot fix: net/mlx5: Fix setting of irq->map.index for static IRQ case
>
>Looks right, many thx. Sorry, should have looked for that myself. Sadly
>regzbot doesn't yet search for existing post on lore with a matching
>subject, so for completeness let me point manually to it while at it:
>
>#regzbot monitor:
>https://lore.kernel.org/netdev/20230531084856.2091666-1-schnelle@linux.ibm.com/
>
>Ciao, Thorsten
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-05-31 21:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-30 13:04 mlx5 driver is broken when pci_msix_can_alloc_dyn() is false with v6.4-rc4 Niklas Schnelle
2023-05-31 13:33 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-05-31 13:43   ` Niklas Schnelle
2023-05-31 13:57     ` Linux regression tracking (Thorsten Leemhuis)
2023-05-31 21:48       ` Saeed Mahameed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).