linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
@ 2025-06-27  6:50 Mohith Kumar Thummaluru
  2025-07-02 17:58 ` Jacob Keller
  2025-09-16  5:24 ` Shay Drori
  0 siblings, 2 replies; 10+ messages in thread
From: Mohith Kumar Thummaluru @ 2025-06-27  6:50 UTC (permalink / raw)
  To: saeedm@nvidia.com, leon@kernel.org, tariqt@nvidia.com,
	netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, jacob.e.keller@intel.com,
	shayd@nvidia.com, elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Mohith Kumar Thummaluru,
	Anand Khoje, Manjunath Patil, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom, Rohit Sajan Kumar, Moshe Shemesh,
	Mark Bloch, Qing Huang

The mlx5_irq_alloc() function can inadvertently free the entire rmap
and end up in a crash[1] when the other threads tries to access this,
when request_irq() fails due to exhausted IRQ vectors. This commit
modifies the cleanup to remove only the specific IRQ mapping that was
just added.

This prevents removal of other valid mappings and ensures precise
cleanup of the failed IRQ allocation's associated glue object.

Note: This error is observed when both fwctl and rds configs are enabled.

[1]
mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to 
request irq. err = -28
infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while 
trying to test write-combining support
mlx5_core 0000:05:00.0: Successfully unregistered panic handler for port 1
mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to 
request irq. err = -28
infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while 
trying to test write-combining support
mlx5_core 0000:06:00.0: Successfully unregistered panic handler for port 1
mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to 
request irq. err = -28
mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to 
request irq. err = -28
general protection fault, probably for non-canonical address 
0xe277a58fde16f291: 0000 [#1] SMP NOPTI

RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
Call Trace:
   <TASK>
   ? show_trace_log_lvl+0x1d6/0x2f9
   ? show_trace_log_lvl+0x1d6/0x2f9
   ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
   ? __die_body.cold+0x8/0xa
   ? die_addr+0x39/0x53
   ? exc_general_protection+0x1c4/0x3e9
   ? dev_vprintk_emit+0x5f/0x90
   ? asm_exc_general_protection+0x22/0x27
   ? free_irq_cpu_rmap+0x23/0x7d
   mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
   irq_pool_request_vector+0x7d/0x90 [mlx5_core]
   mlx5_irq_request+0x2e/0xe0 [mlx5_core]
   mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
   comp_irq_request_pci+0x64/0xf0 [mlx5_core]
   create_comp_eq+0x71/0x385 [mlx5_core]
   ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
   mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
   ? xas_load+0x8/0x91
   mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
   mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
   mlx5e_open_channels+0xad/0x250 [mlx5_core]
   mlx5e_open_locked+0x3e/0x110 [mlx5_core]
   mlx5e_open+0x23/0x70 [mlx5_core]
   __dev_open+0xf1/0x1a5
   __dev_change_flags+0x1e1/0x249
   dev_change_flags+0x21/0x5c
   do_setlink+0x28b/0xcc4
   ? __nla_parse+0x22/0x3d
   ? inet6_validate_link_af+0x6b/0x108
   ? cpumask_next+0x1f/0x35
   ? __snmp6_fill_stats64.constprop.0+0x66/0x107
   ? __nla_validate_parse+0x48/0x1e6
   __rtnl_newlink+0x5ff/0xa57
   ? kmem_cache_alloc_trace+0x164/0x2ce
   rtnl_newlink+0x44/0x6e
   rtnetlink_rcv_msg+0x2bb/0x362
   ? __netlink_sendskb+0x4c/0x6c
   ? netlink_unicast+0x28f/0x2ce
   ? rtnl_calcit.isra.0+0x150/0x146
   netlink_rcv_skb+0x5f/0x112
   netlink_unicast+0x213/0x2ce
   netlink_sendmsg+0x24f/0x4d9
   __sock_sendmsg+0x65/0x6a
   ____sys_sendmsg+0x28f/0x2c9
   ? import_iovec+0x17/0x2b
   ___sys_sendmsg+0x97/0xe0
   __sys_sendmsg+0x81/0xd8
   do_syscall_64+0x35/0x87
   entry_SYSCALL_64_after_hwframe+0x6e/0x0
RIP: 0033:0x7fc328603727
Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed 
ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
   </TASK>
---[ end trace f43ce73c3c2b13a2 ]---
RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00 
74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31 
f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
FS:  00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000) 
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range: 
0xffffffff80000000-0xffffffffbfffffff)
kvm-guest: disable async PF for cpu 0


Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
Signed-off-by: Mohith Kumar 
Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
Tested-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
---
   drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 3 +--
   1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 40024cfa3099..822e92ed2d45 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -325,8 +325,7 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool 
*pool, int i,
   err_req_irq:
   #ifdef CONFIG_RFS_ACCEL
       if (i && rmap && *rmap) {
-        free_irq_cpu_rmap(*rmap);
-        *rmap = NULL;
+        irq_cpu_rmap_remove(*rmap, irq->map.virq);
       }
   err_irq_rmap:
   #endif
-- 
2.43.5




^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-06-27  6:50 [PATCH net 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure Mohith Kumar Thummaluru
@ 2025-07-02 17:58 ` Jacob Keller
  2025-09-12  3:56   ` Pradyumn Rahar
  2025-09-16  5:24 ` Shay Drori
  1 sibling, 1 reply; 10+ messages in thread
From: Jacob Keller @ 2025-07-02 17:58 UTC (permalink / raw)
  To: Mohith Kumar Thummaluru, saeedm@nvidia.com, leon@kernel.org,
	tariqt@nvidia.com, netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, shayd@nvidia.com,
	elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Anand Khoje, Manjunath Patil,
	Rama Nichanamatlu, Rajesh Sivaramasubramaniom, Rohit Sajan Kumar,
	Moshe Shemesh, Mark Bloch, Qing Huang


[-- Attachment #1.1: Type: text/plain, Size: 679 bytes --]



On 6/26/2025 11:50 PM, Mohith Kumar Thummaluru wrote:
> The mlx5_irq_alloc() function can inadvertently free the entire rmap
> and end up in a crash[1] when the other threads tries to access this,
> when request_irq() fails due to exhausted IRQ vectors. This commit
> modifies the cleanup to remove only the specific IRQ mapping that was
> just added.
> 
> This prevents removal of other valid mappings and ensures precise
> cleanup of the failed IRQ allocation's associated glue object.
> 
> Note: This error is observed when both fwctl and rds configs are enabled.
> 
FWIW, figured i should add it here:

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-07-02 17:58 ` Jacob Keller
@ 2025-09-12  3:56   ` Pradyumn Rahar
  0 siblings, 0 replies; 10+ messages in thread
From: Pradyumn Rahar @ 2025-09-12  3:56 UTC (permalink / raw)
  To: Jacob Keller, Mohith Kumar Thummaluru, saeedm@nvidia.com,
	leon@kernel.org, tariqt@nvidia.com, netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, shayd@nvidia.com,
	elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Anand Khoje, Manjunath Patil,
	Rama Nichanamatlu, Rajesh Sivaramasubramaniom, Rohit Sajan Kumar,
	Moshe Shemesh, Mark Bloch, Qing Huang

On 02-07-2025 23:28, Jacob Keller wrote:
> On 6/26/2025 11:50 PM, Mohith Kumar Thummaluru wrote:
>> The mlx5_irq_alloc() function can inadvertently free the entire rmap
>> and end up in a crash[1] when the other threads tries to access this,
>> when request_irq() fails due to exhausted IRQ vectors. This commit
>> modifies the cleanup to remove only the specific IRQ mapping that was
>> just added.
>>
>> This prevents removal of other valid mappings and ensures precise
>> cleanup of the failed IRQ allocation's associated glue object.
>>
>> Note: This error is observed when both fwctl and rds configs are enabled.
>>
> FWIW, figured i should add it here:
>
> Reviewed-by: Jacob Keller<jacob.e.keller@intel.com>
Hi, this patch has been reviewed but hasn't been applied, could you 
please look into it?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-06-27  6:50 [PATCH net 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure Mohith Kumar Thummaluru
  2025-07-02 17:58 ` Jacob Keller
@ 2025-09-16  5:24 ` Shay Drori
  2025-09-16 19:53   ` Tariq Toukan
  2025-09-23  6:28   ` [PATCH net v2 " Pradyumn Rahar
  1 sibling, 2 replies; 10+ messages in thread
From: Shay Drori @ 2025-09-16  5:24 UTC (permalink / raw)
  To: Mohith Kumar Thummaluru, saeedm@nvidia.com, leon@kernel.org,
	tariqt@nvidia.com, netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, jacob.e.keller@intel.com,
	elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Anand Khoje, Manjunath Patil,
	Rama Nichanamatlu, Rajesh Sivaramasubramaniom, Rohit Sajan Kumar,
	Moshe Shemesh, Mark Bloch, Qing Huang

Hi, sorry for the late response :(

On 27/06/2025 9:50, Mohith Kumar Thummaluru wrote:
> External email: Use caution opening links or attachments
> 
> 
> The mlx5_irq_alloc() function can inadvertently free the entire rmap
> and end up in a crash[1] when the other threads tries to access this,
> when request_irq() fails due to exhausted IRQ vectors. This commit
> modifies the cleanup to remove only the specific IRQ mapping that was
> just added.
> 
> This prevents removal of other valid mappings and ensures precise
> cleanup of the failed IRQ allocation's associated glue object.
> 
> Note: This error is observed when both fwctl and rds configs are enabled.
> 
> [1]
> mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
> request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
> trying to test write-combining support
> mlx5_core 0000:05:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
> request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
> trying to test write-combining support
> mlx5_core 0000:06:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
> request irq. err = -28
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
> request irq. err = -28
> general protection fault, probably for non-canonical address
> 0xe277a58fde16f291: 0000 [#1] SMP NOPTI
> 
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Call Trace:
>    <TASK>
>    ? show_trace_log_lvl+0x1d6/0x2f9
>    ? show_trace_log_lvl+0x1d6/0x2f9
>    ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>    ? __die_body.cold+0x8/0xa
>    ? die_addr+0x39/0x53
>    ? exc_general_protection+0x1c4/0x3e9
>    ? dev_vprintk_emit+0x5f/0x90
>    ? asm_exc_general_protection+0x22/0x27
>    ? free_irq_cpu_rmap+0x23/0x7d
>    mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>    irq_pool_request_vector+0x7d/0x90 [mlx5_core]
>    mlx5_irq_request+0x2e/0xe0 [mlx5_core]
>    mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
>    comp_irq_request_pci+0x64/0xf0 [mlx5_core]
>    create_comp_eq+0x71/0x385 [mlx5_core]
>    ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
>    mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
>    ? xas_load+0x8/0x91
>    mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
>    mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
>    mlx5e_open_channels+0xad/0x250 [mlx5_core]
>    mlx5e_open_locked+0x3e/0x110 [mlx5_core]
>    mlx5e_open+0x23/0x70 [mlx5_core]
>    __dev_open+0xf1/0x1a5
>    __dev_change_flags+0x1e1/0x249
>    dev_change_flags+0x21/0x5c
>    do_setlink+0x28b/0xcc4
>    ? __nla_parse+0x22/0x3d
>    ? inet6_validate_link_af+0x6b/0x108
>    ? cpumask_next+0x1f/0x35
>    ? __snmp6_fill_stats64.constprop.0+0x66/0x107
>    ? __nla_validate_parse+0x48/0x1e6
>    __rtnl_newlink+0x5ff/0xa57
>    ? kmem_cache_alloc_trace+0x164/0x2ce
>    rtnl_newlink+0x44/0x6e
>    rtnetlink_rcv_msg+0x2bb/0x362
>    ? __netlink_sendskb+0x4c/0x6c
>    ? netlink_unicast+0x28f/0x2ce
>    ? rtnl_calcit.isra.0+0x150/0x146
>    netlink_rcv_skb+0x5f/0x112
>    netlink_unicast+0x213/0x2ce
>    netlink_sendmsg+0x24f/0x4d9
>    __sock_sendmsg+0x65/0x6a
>    ____sys_sendmsg+0x28f/0x2c9
>    ? import_iovec+0x17/0x2b
>    ___sys_sendmsg+0x97/0xe0
>    __sys_sendmsg+0x81/0xd8
>    do_syscall_64+0x35/0x87
>    entry_SYSCALL_64_after_hwframe+0x6e/0x0
> RIP: 0033:0x7fc328603727
> Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed
> ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00
> f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
> RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
> RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
> RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
> R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
>    </TASK>
> ---[ end trace f43ce73c3c2b13a2 ]---
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00
> 74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31
> f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
> RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
> RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
> RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
> R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
> FS:  00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Kernel panic - not syncing: Fatal exception
> Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffffbfffffff)
> kvm-guest: disable async PF for cpu 0
> 
> 
> Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
> Signed-off-by: Mohith Kumar
> Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Tested-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
> ---
>    drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 3 +--
>    1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> index 40024cfa3099..822e92ed2d45 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> @@ -325,8 +325,7 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool
> *pool, int i,
>    err_req_irq:
>    #ifdef CONFIG_RFS_ACCEL
>        if (i && rmap && *rmap) {
> -        free_irq_cpu_rmap(*rmap);
> -        *rmap = NULL;
> +        irq_cpu_rmap_remove(*rmap, irq->map.virq);
>        }

now that the condition is only one line, you need to remove the
parenthesis.

other than that.
Reviewed-by: Shay Drory <shayd@nvidia.com>

>    err_irq_rmap:
>    #endif
> -- 
> 2.43.5
> 
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-09-16  5:24 ` Shay Drori
@ 2025-09-16 19:53   ` Tariq Toukan
  2025-09-23  6:28   ` [PATCH net v2 " Pradyumn Rahar
  1 sibling, 0 replies; 10+ messages in thread
From: Tariq Toukan @ 2025-09-16 19:53 UTC (permalink / raw)
  To: Shay Drori, Mohith Kumar Thummaluru, saeedm@nvidia.com,
	leon@kernel.org, tariqt@nvidia.com, netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, jacob.e.keller@intel.com,
	elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Anand Khoje, Manjunath Patil,
	Rama Nichanamatlu, Rajesh Sivaramasubramaniom, Rohit Sajan Kumar,
	Moshe Shemesh, Mark Bloch, Qing Huang



On 16/09/2025 8:24, Shay Drori wrote:
> Hi, sorry for the late response :(
> 
> On 27/06/2025 9:50, Mohith Kumar Thummaluru wrote:
>> External email: Use caution opening links or attachments
>>
>>
..

> 
> now that the condition is only one line, you need to remove the
> parenthesis.
> 
> other than that.
> Reviewed-by: Shay Drory <shayd@nvidia.com>
> 

LGTM.

Acked-by: Tariq Toukan <tariqt@nvidia.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net v2 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-09-16  5:24 ` Shay Drori
  2025-09-16 19:53   ` Tariq Toukan
@ 2025-09-23  6:28   ` Pradyumn Rahar
  2025-09-24 18:14     ` Keller, Jacob E
                       ` (2 more replies)
  1 sibling, 3 replies; 10+ messages in thread
From: Pradyumn Rahar @ 2025-09-23  6:28 UTC (permalink / raw)
  To: shayd
  Cc: anand.a.khoje, andrew+netdev, davem, edumazet, elic,
	jacob.e.keller, kuba, leon, linux-kernel, linux-rdma,
	manjunath.b.patil, mbloch, pradyumn.rahar, moshe, netdev, pabeni,
	qing.huang, rajesh.sivaramasubramaniom, rama.nichanamatlu,
	rohit.sajan.kumar, saeedm, tariqt

The mlx5_irq_alloc() function can inadvertently free the entire rmap
and end up in a crash[1] when the other threads tries to access this,
when request_irq() fails due to exhausted IRQ vectors. This commit
modifies the cleanup to remove only the specific IRQ mapping that was
just added.

This prevents removal of other valid mappings and ensures precise
cleanup of the failed IRQ allocation's associated glue object.

Note: This error is observed when both fwctl and rds configs are enabled.

[1]
mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
request irq. err = -28
infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
trying to test write-combining support
mlx5_core 0000:05:00.0: Successfully unregistered panic handler for port 1
mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
request irq. err = -28
infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
trying to test write-combining support
mlx5_core 0000:06:00.0: Successfully unregistered panic handler for port 1
mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
request irq. err = -28
mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
request irq. err = -28
general protection fault, probably for non-canonical address
0xe277a58fde16f291: 0000 [#1] SMP NOPTI

RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
Call Trace:
   <TASK>
   ? show_trace_log_lvl+0x1d6/0x2f9
   ? show_trace_log_lvl+0x1d6/0x2f9
   ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
   ? __die_body.cold+0x8/0xa
   ? die_addr+0x39/0x53
   ? exc_general_protection+0x1c4/0x3e9
   ? dev_vprintk_emit+0x5f/0x90
   ? asm_exc_general_protection+0x22/0x27
   ? free_irq_cpu_rmap+0x23/0x7d
   mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
   irq_pool_request_vector+0x7d/0x90 [mlx5_core]
   mlx5_irq_request+0x2e/0xe0 [mlx5_core]
   mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
   comp_irq_request_pci+0x64/0xf0 [mlx5_core]
   create_comp_eq+0x71/0x385 [mlx5_core]
   ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
   mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
   ? xas_load+0x8/0x91
   mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
   mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
   mlx5e_open_channels+0xad/0x250 [mlx5_core]
   mlx5e_open_locked+0x3e/0x110 [mlx5_core]
   mlx5e_open+0x23/0x70 [mlx5_core]
   __dev_open+0xf1/0x1a5
   __dev_change_flags+0x1e1/0x249
   dev_change_flags+0x21/0x5c
   do_setlink+0x28b/0xcc4
   ? __nla_parse+0x22/0x3d
   ? inet6_validate_link_af+0x6b/0x108
   ? cpumask_next+0x1f/0x35
   ? __snmp6_fill_stats64.constprop.0+0x66/0x107
   ? __nla_validate_parse+0x48/0x1e6
   __rtnl_newlink+0x5ff/0xa57
   ? kmem_cache_alloc_trace+0x164/0x2ce
   rtnl_newlink+0x44/0x6e
   rtnetlink_rcv_msg+0x2bb/0x362
   ? __netlink_sendskb+0x4c/0x6c
   ? netlink_unicast+0x28f/0x2ce
   ? rtnl_calcit.isra.0+0x150/0x146
   netlink_rcv_skb+0x5f/0x112
   netlink_unicast+0x213/0x2ce
   netlink_sendmsg+0x24f/0x4d9
   __sock_sendmsg+0x65/0x6a
   ____sys_sendmsg+0x28f/0x2c9
   ? import_iovec+0x17/0x2b
   ___sys_sendmsg+0x97/0xe0
   __sys_sendmsg+0x81/0xd8
   do_syscall_64+0x35/0x87
   entry_SYSCALL_64_after_hwframe+0x6e/0x0
RIP: 0033:0x7fc328603727
Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed
ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
   </TASK>
---[ end trace f43ce73c3c2b13a2 ]---
RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00
74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31
f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
FS:  00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffffbfffffff)
kvm-guest: disable async PF for cpu 0

Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
Signed-off-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
Tested-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
Signed-off-by: Pradyumn Rahar <pradyumn.rahar@oracle.com>
---
v1->v2: removed unnecessary braces from if conditon.
---
 drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 692ef9c2f729..82ada674f8e2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -324,10 +324,8 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool *pool, int i,
 	free_irq(irq->map.virq, &irq->nh);
 err_req_irq:
 #ifdef CONFIG_RFS_ACCEL
-	if (i && rmap && *rmap) {
-		free_irq_cpu_rmap(*rmap);
-		*rmap = NULL;
-	}
+	if (i && rmap && *rmap)
+		irq_cpu_rmap_remove(*rmap, irq->map.virq);
 err_irq_rmap:
 #endif
 	if (i && pci_msix_can_alloc_dyn(dev->pdev))
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* RE: [PATCH net v2 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-09-23  6:28   ` [PATCH net v2 " Pradyumn Rahar
@ 2025-09-24 18:14     ` Keller, Jacob E
  2025-09-28  5:55     ` Tariq Toukan
  2025-10-27 14:29     ` Pradyumn Rahar
  2 siblings, 0 replies; 10+ messages in thread
From: Keller, Jacob E @ 2025-09-24 18:14 UTC (permalink / raw)
  To: Pradyumn Rahar, shayd@nvidia.com
  Cc: anand.a.khoje@oracle.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, elic@nvidia.com,
	kuba@kernel.org, leon@kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, manjunath.b.patil@oracle.com,
	mbloch@nvidia.com, moshe@nvidia.com, netdev@vger.kernel.org,
	pabeni@redhat.com, qing.huang@oracle.com,
	rajesh.sivaramasubramaniom@oracle.com,
	rama.nichanamatlu@oracle.com, rohit.sajan.kumar@oracle.com,
	saeedm@nvidia.com, tariqt@nvidia.com



> -----Original Message-----
> From: Pradyumn Rahar <pradyumn.rahar@oracle.com>
> Sent: Monday, September 22, 2025 11:28 PM
> To: shayd@nvidia.com
> Cc: anand.a.khoje@oracle.com; andrew+netdev@lunn.ch;
> davem@davemloft.net; edumazet@google.com; elic@nvidia.com; Keller, Jacob E
> <jacob.e.keller@intel.com>; kuba@kernel.org; leon@kernel.org; linux-
> kernel@vger.kernel.org; linux-rdma@vger.kernel.org;
> manjunath.b.patil@oracle.com; mbloch@nvidia.com;
> pradyumn.rahar@oracle.com; moshe@nvidia.com; netdev@vger.kernel.org;
> pabeni@redhat.com; qing.huang@oracle.com;
> rajesh.sivaramasubramaniom@oracle.com; rama.nichanamatlu@oracle.com;
> rohit.sajan.kumar@oracle.com; saeedm@nvidia.com; tariqt@nvidia.com
> Subject: [PATCH net v2 1/1] net/mlx5: Clean up only new IRQ glue on request_irq()
> failure
> 
> The mlx5_irq_alloc() function can inadvertently free the entire rmap
> and end up in a crash[1] when the other threads tries to access this,
> when request_irq() fails due to exhausted IRQ vectors. This commit
> modifies the cleanup to remove only the specific IRQ mapping that was
> just added.
> 
> This prevents removal of other valid mappings and ensures precise
> cleanup of the failed IRQ allocation's associated glue object.
> 
> Note: This error is observed when both fwctl and rds configs are enabled.
> 
> [1]
> mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
> request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
> trying to test write-combining support
> mlx5_core 0000:05:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
> request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
> trying to test write-combining support
> mlx5_core 0000:06:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
> request irq. err = -28
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
> request irq. err = -28
> general protection fault, probably for non-canonical address
> 0xe277a58fde16f291: 0000 [#1] SMP NOPTI
> 
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Call Trace:
>    <TASK>
>    ? show_trace_log_lvl+0x1d6/0x2f9
>    ? show_trace_log_lvl+0x1d6/0x2f9
>    ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>    ? __die_body.cold+0x8/0xa
>    ? die_addr+0x39/0x53
>    ? exc_general_protection+0x1c4/0x3e9
>    ? dev_vprintk_emit+0x5f/0x90
>    ? asm_exc_general_protection+0x22/0x27
>    ? free_irq_cpu_rmap+0x23/0x7d
>    mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>    irq_pool_request_vector+0x7d/0x90 [mlx5_core]
>    mlx5_irq_request+0x2e/0xe0 [mlx5_core]
>    mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
>    comp_irq_request_pci+0x64/0xf0 [mlx5_core]
>    create_comp_eq+0x71/0x385 [mlx5_core]
>    ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
>    mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
>    ? xas_load+0x8/0x91
>    mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
>    mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
>    mlx5e_open_channels+0xad/0x250 [mlx5_core]
>    mlx5e_open_locked+0x3e/0x110 [mlx5_core]
>    mlx5e_open+0x23/0x70 [mlx5_core]
>    __dev_open+0xf1/0x1a5
>    __dev_change_flags+0x1e1/0x249
>    dev_change_flags+0x21/0x5c
>    do_setlink+0x28b/0xcc4
>    ? __nla_parse+0x22/0x3d
>    ? inet6_validate_link_af+0x6b/0x108
>    ? cpumask_next+0x1f/0x35
>    ? __snmp6_fill_stats64.constprop.0+0x66/0x107
>    ? __nla_validate_parse+0x48/0x1e6
>    __rtnl_newlink+0x5ff/0xa57
>    ? kmem_cache_alloc_trace+0x164/0x2ce
>    rtnl_newlink+0x44/0x6e
>    rtnetlink_rcv_msg+0x2bb/0x362
>    ? __netlink_sendskb+0x4c/0x6c
>    ? netlink_unicast+0x28f/0x2ce
>    ? rtnl_calcit.isra.0+0x150/0x146
>    netlink_rcv_skb+0x5f/0x112
>    netlink_unicast+0x213/0x2ce
>    netlink_sendmsg+0x24f/0x4d9
>    __sock_sendmsg+0x65/0x6a
>    ____sys_sendmsg+0x28f/0x2c9
>    ? import_iovec+0x17/0x2b
>    ___sys_sendmsg+0x97/0xe0
>    __sys_sendmsg+0x81/0xd8
>    do_syscall_64+0x35/0x87
>    entry_SYSCALL_64_after_hwframe+0x6e/0x0
> RIP: 0033:0x7fc328603727
> Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed
> ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00
> f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
> RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX:
> 000000000000002e
> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
> RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
> RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
> R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
>    </TASK>
> ---[ end trace f43ce73c3c2b13a2 ]---
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00
> 74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31
> f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
> RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
> RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
> RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
> R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
> FS:  00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Kernel panic - not syncing: Fatal exception
> Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffffbfffffff)
> kvm-guest: disable async PF for cpu 0
> 
> Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
> Signed-off-by: Mohith Kumar
> Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Tested-by: Mohith Kumar
> Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
> Signed-off-by: Pradyumn Rahar <pradyumn.rahar@oracle.com>
> ---
> v1->v2: removed unnecessary braces from if conditon.
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> index 692ef9c2f729..82ada674f8e2 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> @@ -324,10 +324,8 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool
> *pool, int i,
>  	free_irq(irq->map.virq, &irq->nh);
>  err_req_irq:
>  #ifdef CONFIG_RFS_ACCEL
> -	if (i && rmap && *rmap) {
> -		free_irq_cpu_rmap(*rmap);
> -		*rmap = NULL;
> -	}
> +	if (i && rmap && *rmap)
> +		irq_cpu_rmap_remove(*rmap, irq->map.virq);

Presumably if this fails during initialization, the caller of mlx5_irq_alloc which allocates multiple IRQs would be responsible for cleaning up anything it allocated before failing. Makes sense. Cleaning up only what this function did makes more sense.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

>  err_irq_rmap:
>  #endif
>  	if (i && pci_msix_can_alloc_dyn(dev->pdev))
> --
> 2.43.7


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net v2 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-09-23  6:28   ` [PATCH net v2 " Pradyumn Rahar
  2025-09-24 18:14     ` Keller, Jacob E
@ 2025-09-28  5:55     ` Tariq Toukan
  2025-10-27 14:29     ` Pradyumn Rahar
  2 siblings, 0 replies; 10+ messages in thread
From: Tariq Toukan @ 2025-09-28  5:55 UTC (permalink / raw)
  To: Pradyumn Rahar, shayd
  Cc: anand.a.khoje, andrew+netdev, davem, edumazet, elic,
	jacob.e.keller, kuba, leon, linux-kernel, linux-rdma,
	manjunath.b.patil, mbloch, moshe, netdev, pabeni, qing.huang,
	rajesh.sivaramasubramaniom, rama.nichanamatlu, rohit.sajan.kumar,
	saeedm, tariqt



On 23/09/2025 9:28, Pradyumn Rahar wrote:
> The mlx5_irq_alloc() function can inadvertently free the entire rmap
> and end up in a crash[1] when the other threads tries to access this,
> when request_irq() fails due to exhausted IRQ vectors. This commit
> modifies the cleanup to remove only the specific IRQ mapping that was
> just added.
> 
> This prevents removal of other valid mappings and ensures precise
> cleanup of the failed IRQ allocation's associated glue object.
> 
> Note: This error is observed when both fwctl and rds configs are enabled.
> 
> [1]
> mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
> request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
> trying to test write-combining support
> mlx5_core 0000:05:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
> request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
> trying to test write-combining support
> mlx5_core 0000:06:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
> request irq. err = -28
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
> request irq. err = -28
> general protection fault, probably for non-canonical address
> 0xe277a58fde16f291: 0000 [#1] SMP NOPTI
> 
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Call Trace:
>     <TASK>
>     ? show_trace_log_lvl+0x1d6/0x2f9
>     ? show_trace_log_lvl+0x1d6/0x2f9
>     ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>     ? __die_body.cold+0x8/0xa
>     ? die_addr+0x39/0x53
>     ? exc_general_protection+0x1c4/0x3e9
>     ? dev_vprintk_emit+0x5f/0x90
>     ? asm_exc_general_protection+0x22/0x27
>     ? free_irq_cpu_rmap+0x23/0x7d
>     mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>     irq_pool_request_vector+0x7d/0x90 [mlx5_core]
>     mlx5_irq_request+0x2e/0xe0 [mlx5_core]
>     mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
>     comp_irq_request_pci+0x64/0xf0 [mlx5_core]
>     create_comp_eq+0x71/0x385 [mlx5_core]
>     ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
>     mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
>     ? xas_load+0x8/0x91
>     mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
>     mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
>     mlx5e_open_channels+0xad/0x250 [mlx5_core]
>     mlx5e_open_locked+0x3e/0x110 [mlx5_core]
>     mlx5e_open+0x23/0x70 [mlx5_core]
>     __dev_open+0xf1/0x1a5
>     __dev_change_flags+0x1e1/0x249
>     dev_change_flags+0x21/0x5c
>     do_setlink+0x28b/0xcc4
>     ? __nla_parse+0x22/0x3d
>     ? inet6_validate_link_af+0x6b/0x108
>     ? cpumask_next+0x1f/0x35
>     ? __snmp6_fill_stats64.constprop.0+0x66/0x107
>     ? __nla_validate_parse+0x48/0x1e6
>     __rtnl_newlink+0x5ff/0xa57
>     ? kmem_cache_alloc_trace+0x164/0x2ce
>     rtnl_newlink+0x44/0x6e
>     rtnetlink_rcv_msg+0x2bb/0x362
>     ? __netlink_sendskb+0x4c/0x6c
>     ? netlink_unicast+0x28f/0x2ce
>     ? rtnl_calcit.isra.0+0x150/0x146
>     netlink_rcv_skb+0x5f/0x112
>     netlink_unicast+0x213/0x2ce
>     netlink_sendmsg+0x24f/0x4d9
>     __sock_sendmsg+0x65/0x6a
>     ____sys_sendmsg+0x28f/0x2c9
>     ? import_iovec+0x17/0x2b
>     ___sys_sendmsg+0x97/0xe0
>     __sys_sendmsg+0x81/0xd8
>     do_syscall_64+0x35/0x87
>     entry_SYSCALL_64_after_hwframe+0x6e/0x0
> RIP: 0033:0x7fc328603727
> Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed
> ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00
> f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
> RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
> RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
> RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
> R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
>     </TASK>
> ---[ end trace f43ce73c3c2b13a2 ]---
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00
> 74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31
> f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
> RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
> RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
> RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
> R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
> FS:  00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Kernel panic - not syncing: Fatal exception
> Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffffbfffffff)
> kvm-guest: disable async PF for cpu 0
> 
> Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
> Signed-off-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Tested-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
> Signed-off-by: Pradyumn Rahar <pradyumn.rahar@oracle.com>
> ---
> v1->v2: removed unnecessary braces from if conditon.
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> index 692ef9c2f729..82ada674f8e2 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> @@ -324,10 +324,8 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool *pool, int i,
>   	free_irq(irq->map.virq, &irq->nh);
>   err_req_irq:
>   #ifdef CONFIG_RFS_ACCEL
> -	if (i && rmap && *rmap) {
> -		free_irq_cpu_rmap(*rmap);
> -		*rmap = NULL;
> -	}
> +	if (i && rmap && *rmap)
> +		irq_cpu_rmap_remove(*rmap, irq->map.virq);
>   err_irq_rmap:
>   #endif
>   	if (i && pci_msix_can_alloc_dyn(dev->pdev))


Acked-by: Tariq Toukan <tariqt@nvidia.com>

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net v2 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-09-23  6:28   ` [PATCH net v2 " Pradyumn Rahar
  2025-09-24 18:14     ` Keller, Jacob E
  2025-09-28  5:55     ` Tariq Toukan
@ 2025-10-27 14:29     ` Pradyumn Rahar
  2025-11-03 11:37       ` Tariq Toukan
  2 siblings, 1 reply; 10+ messages in thread
From: Pradyumn Rahar @ 2025-10-27 14:29 UTC (permalink / raw)
  To: shayd
  Cc: anand.a.khoje, andrew+netdev, davem, edumazet, elic,
	jacob.e.keller, kuba, leon, linux-kernel, linux-rdma,
	manjunath.b.patil, mbloch, moshe, netdev, pabeni, qing.huang,
	rajesh.sivaramasubramaniom, rama.nichanamatlu, rohit.sajan.kumar,
	saeedm, tariqt


On 23-09-2025 11:58, Pradyumn Rahar wrote:
> The mlx5_irq_alloc() function can inadvertently free the entire rmap
> and end up in a crash[1] when the other threads tries to access this,
> when request_irq() fails due to exhausted IRQ vectors. This commit
> modifies the cleanup to remove only the specific IRQ mapping that was
> just added.
>
> This prevents removal of other valid mappings and ensures precise
> cleanup of the failed IRQ allocation's associated glue object.
>
> Note: This error is observed when both fwctl and rds configs are enabled.
>
> [1]
> mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
> request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
> trying to test write-combining support
> mlx5_core 0000:05:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
> request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
> trying to test write-combining support
> mlx5_core 0000:06:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
> request irq. err = -28
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
> request irq. err = -28
> general protection fault, probably for non-canonical address
> 0xe277a58fde16f291: 0000 [#1] SMP NOPTI
>
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Call Trace:
>     <TASK>
>     ? show_trace_log_lvl+0x1d6/0x2f9
>     ? show_trace_log_lvl+0x1d6/0x2f9
>     ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>     ? __die_body.cold+0x8/0xa
>     ? die_addr+0x39/0x53
>     ? exc_general_protection+0x1c4/0x3e9
>     ? dev_vprintk_emit+0x5f/0x90
>     ? asm_exc_general_protection+0x22/0x27
>     ? free_irq_cpu_rmap+0x23/0x7d
>     mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>     irq_pool_request_vector+0x7d/0x90 [mlx5_core]
>     mlx5_irq_request+0x2e/0xe0 [mlx5_core]
>     mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
>     comp_irq_request_pci+0x64/0xf0 [mlx5_core]
>     create_comp_eq+0x71/0x385 [mlx5_core]
>     ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
>     mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
>     ? xas_load+0x8/0x91
>     mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
>     mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
>     mlx5e_open_channels+0xad/0x250 [mlx5_core]
>     mlx5e_open_locked+0x3e/0x110 [mlx5_core]
>     mlx5e_open+0x23/0x70 [mlx5_core]
>     __dev_open+0xf1/0x1a5
>     __dev_change_flags+0x1e1/0x249
>     dev_change_flags+0x21/0x5c
>     do_setlink+0x28b/0xcc4
>     ? __nla_parse+0x22/0x3d
>     ? inet6_validate_link_af+0x6b/0x108
>     ? cpumask_next+0x1f/0x35
>     ? __snmp6_fill_stats64.constprop.0+0x66/0x107
>     ? __nla_validate_parse+0x48/0x1e6
>     __rtnl_newlink+0x5ff/0xa57
>     ? kmem_cache_alloc_trace+0x164/0x2ce
>     rtnl_newlink+0x44/0x6e
>     rtnetlink_rcv_msg+0x2bb/0x362
>     ? __netlink_sendskb+0x4c/0x6c
>     ? netlink_unicast+0x28f/0x2ce
>     ? rtnl_calcit.isra.0+0x150/0x146
>     netlink_rcv_skb+0x5f/0x112
>     netlink_unicast+0x213/0x2ce
>     netlink_sendmsg+0x24f/0x4d9
>     __sock_sendmsg+0x65/0x6a
>     ____sys_sendmsg+0x28f/0x2c9
>     ? import_iovec+0x17/0x2b
>     ___sys_sendmsg+0x97/0xe0
>     __sys_sendmsg+0x81/0xd8
>     do_syscall_64+0x35/0x87
>     entry_SYSCALL_64_after_hwframe+0x6e/0x0
> RIP: 0033:0x7fc328603727
> Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed
> ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00
> f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
> RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
> RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
> RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
> R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
>     </TASK>
> ---[ end trace f43ce73c3c2b13a2 ]---
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00
> 74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31
> f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
> RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
> RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
> RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
> R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
> FS:  00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Kernel panic - not syncing: Fatal exception
> Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffffbfffffff)
> kvm-guest: disable async PF for cpu 0
>
> Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
> Signed-off-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Tested-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
> Signed-off-by: Pradyumn Rahar <pradyumn.rahar@oracle.com>
> ---
> v1->v2: removed unnecessary braces from if conditon.
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> index 692ef9c2f729..82ada674f8e2 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> @@ -324,10 +324,8 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool *pool, int i,
>   	free_irq(irq->map.virq, &irq->nh);
>   err_req_irq:
>   #ifdef CONFIG_RFS_ACCEL
> -	if (i && rmap && *rmap) {
> -		free_irq_cpu_rmap(*rmap);
> -		*rmap = NULL;
> -	}
> +	if (i && rmap && *rmap)
> +		irq_cpu_rmap_remove(*rmap, irq->map.virq);
>   err_irq_rmap:
>   #endif
>   	if (i && pci_msix_can_alloc_dyn(dev->pdev))

Hi, this patch has been reviewed but hasn't been applied yet. Could you 
please look into it?

Thanks.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net v2 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-10-27 14:29     ` Pradyumn Rahar
@ 2025-11-03 11:37       ` Tariq Toukan
  0 siblings, 0 replies; 10+ messages in thread
From: Tariq Toukan @ 2025-11-03 11:37 UTC (permalink / raw)
  To: Pradyumn Rahar, shayd
  Cc: anand.a.khoje, andrew+netdev, davem, edumazet, elic,
	jacob.e.keller, kuba, leon, linux-kernel, linux-rdma,
	manjunath.b.patil, mbloch, moshe, netdev, pabeni, qing.huang,
	rajesh.sivaramasubramaniom, rama.nichanamatlu, rohit.sajan.kumar,
	saeedm, tariqt



On 27/10/2025 16:29, Pradyumn Rahar wrote:
> 
> On 23-09-2025 11:58, Pradyumn Rahar wrote:
>> The mlx5_irq_alloc() function can inadvertently free the entire rmap
>> and end up in a crash[1] when the other threads tries to access this,
>> when request_irq() fails due to exhausted IRQ vectors. This commit
>> modifies the cleanup to remove only the specific IRQ mapping that was
>> just added.
>>
>> This prevents removal of other valid mappings and ensures precise
>> cleanup of the failed IRQ allocation's associated glue object.
>>
>> Note: This error is observed when both fwctl and rds configs are enabled.
>>
>> [1]
>> mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
>> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
>> request irq. err = -28
>> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
>> trying to test write-combining support
>> mlx5_core 0000:05:00.0: Successfully unregistered panic handler for 
>> port 1
>> mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
>> mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to
>> request irq. err = -28
>> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while
>> trying to test write-combining support
>> mlx5_core 0000:06:00.0: Successfully unregistered panic handler for 
>> port 1
>> mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
>> request irq. err = -28
>> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to
>> request irq. err = -28
>> general protection fault, probably for non-canonical address
>> 0xe277a58fde16f291: 0000 [#1] SMP NOPTI
>>
>> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
>> Call Trace:
>>     <TASK>
>>     ? show_trace_log_lvl+0x1d6/0x2f9
>>     ? show_trace_log_lvl+0x1d6/0x2f9
>>     ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>>     ? __die_body.cold+0x8/0xa
>>     ? die_addr+0x39/0x53
>>     ? exc_general_protection+0x1c4/0x3e9
>>     ? dev_vprintk_emit+0x5f/0x90
>>     ? asm_exc_general_protection+0x22/0x27
>>     ? free_irq_cpu_rmap+0x23/0x7d
>>     mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>>     irq_pool_request_vector+0x7d/0x90 [mlx5_core]
>>     mlx5_irq_request+0x2e/0xe0 [mlx5_core]
>>     mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
>>     comp_irq_request_pci+0x64/0xf0 [mlx5_core]
>>     create_comp_eq+0x71/0x385 [mlx5_core]
>>     ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
>>     mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
>>     ? xas_load+0x8/0x91
>>     mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
>>     mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
>>     mlx5e_open_channels+0xad/0x250 [mlx5_core]
>>     mlx5e_open_locked+0x3e/0x110 [mlx5_core]
>>     mlx5e_open+0x23/0x70 [mlx5_core]
>>     __dev_open+0xf1/0x1a5
>>     __dev_change_flags+0x1e1/0x249
>>     dev_change_flags+0x21/0x5c
>>     do_setlink+0x28b/0xcc4
>>     ? __nla_parse+0x22/0x3d
>>     ? inet6_validate_link_af+0x6b/0x108
>>     ? cpumask_next+0x1f/0x35
>>     ? __snmp6_fill_stats64.constprop.0+0x66/0x107
>>     ? __nla_validate_parse+0x48/0x1e6
>>     __rtnl_newlink+0x5ff/0xa57
>>     ? kmem_cache_alloc_trace+0x164/0x2ce
>>     rtnl_newlink+0x44/0x6e
>>     rtnetlink_rcv_msg+0x2bb/0x362
>>     ? __netlink_sendskb+0x4c/0x6c
>>     ? netlink_unicast+0x28f/0x2ce
>>     ? rtnl_calcit.isra.0+0x150/0x146
>>     netlink_rcv_skb+0x5f/0x112
>>     netlink_unicast+0x213/0x2ce
>>     netlink_sendmsg+0x24f/0x4d9
>>     __sock_sendmsg+0x65/0x6a
>>     ____sys_sendmsg+0x28f/0x2c9
>>     ? import_iovec+0x17/0x2b
>>     ___sys_sendmsg+0x97/0xe0
>>     __sys_sendmsg+0x81/0xd8
>>     do_syscall_64+0x35/0x87
>>     entry_SYSCALL_64_after_hwframe+0x6e/0x0
>> RIP: 0033:0x7fc328603727
>> Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed
>> ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00
>> f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
>> RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
>> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
>> RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
>> RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
>> R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
>>     </TASK>
>> ---[ end trace f43ce73c3c2b13a2 ]---
>> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
>> Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00
>> 74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31
>> f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
>> RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
>> RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
>> RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
>> RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
>> R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
>> FS:  00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000)
>> knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> PKRU: 55555554
>> Kernel panic - not syncing: Fatal exception
>> Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range:
>> 0xffffffff80000000-0xffffffffbfffffff)
>> kvm-guest: disable async PF for cpu 0
>>
>> Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
>> Signed-off-by: Mohith Kumar 
>> Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
>> Tested-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
>> Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
>> Signed-off-by: Pradyumn Rahar <pradyumn.rahar@oracle.com>
>> ---
>> v1->v2: removed unnecessary braces from if conditon.
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/ 
>> drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
>> index 692ef9c2f729..82ada674f8e2 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
>> @@ -324,10 +324,8 @@ struct mlx5_irq *mlx5_irq_alloc(struct 
>> mlx5_irq_pool *pool, int i,
>>       free_irq(irq->map.virq, &irq->nh);
>>   err_req_irq:
>>   #ifdef CONFIG_RFS_ACCEL
>> -    if (i && rmap && *rmap) {
>> -        free_irq_cpu_rmap(*rmap);
>> -        *rmap = NULL;
>> -    }
>> +    if (i && rmap && *rmap)
>> +        irq_cpu_rmap_remove(*rmap, irq->map.virq);
>>   err_irq_rmap:
>>   #endif
>>       if (i && pci_msix_can_alloc_dyn(dev->pdev))
> 
> Hi, this patch has been reviewed but hasn't been applied yet. Could you 
> please look into it?
> 
> Thanks.
> 
> 

I'll re-send it.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-11-03 11:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27  6:50 [PATCH net 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure Mohith Kumar Thummaluru
2025-07-02 17:58 ` Jacob Keller
2025-09-12  3:56   ` Pradyumn Rahar
2025-09-16  5:24 ` Shay Drori
2025-09-16 19:53   ` Tariq Toukan
2025-09-23  6:28   ` [PATCH net v2 " Pradyumn Rahar
2025-09-24 18:14     ` Keller, Jacob E
2025-09-28  5:55     ` Tariq Toukan
2025-10-27 14:29     ` Pradyumn Rahar
2025-11-03 11:37       ` Tariq Toukan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).