netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RESEND PATCH net-next 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
@ 2025-06-26  6:04 Mohith Kumar Thummaluru
  2025-06-26 11:58 ` Mark Bloch
  0 siblings, 1 reply; 6+ messages in thread
From: Mohith Kumar Thummaluru @ 2025-06-26  6:04 UTC (permalink / raw)
  To: saeedm@nvidia.com, leon@kernel.org, tariqt@nvidia.com,
	netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, jacob.e.keller@intel.com,
	shayd@nvidia.com, elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Mohith Kumar Thummaluru,
	Anand Khoje, Manjunath Patil, Rama Nichanamatlu,
	Rajesh Sivaramasubramaniom, Rohit Sajan Kumar, Qing Huang

The mlx5_irq_alloc() function can inadvertently free the entire rmap
and end up in a crash[1] when the other threads tries to access this,
when request_irq() fails due to exhausted IRQ vectors. This commit
modifies the cleanup to remove only the specific IRQ mapping that was
just added.

This prevents removal of other valid mappings and ensures precise
cleanup of the failed IRQ allocation's associated glue object.

Note: This error is observed when both fwctl and rds configs are enabled.

[1]
mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to 
request irq. err = -28
infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while 
trying to test write-combining support
mlx5_core 0000:05:00.0: Successfully unregistered panic handler for port 1
mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to 
request irq. err = -28
infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while 
trying to test write-combining support
mlx5_core 0000:06:00.0: Successfully unregistered panic handler for port 1
mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to 
request irq. err = -28
mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to 
request irq. err = -28
general protection fault, probably for non-canonical address 
0xe277a58fde16f291: 0000 [#1] SMP NOPTI

RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
Call Trace:
   <TASK>
   ? show_trace_log_lvl+0x1d6/0x2f9
   ? show_trace_log_lvl+0x1d6/0x2f9
   ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
   ? __die_body.cold+0x8/0xa
   ? die_addr+0x39/0x53
   ? exc_general_protection+0x1c4/0x3e9
   ? dev_vprintk_emit+0x5f/0x90
   ? asm_exc_general_protection+0x22/0x27
   ? free_irq_cpu_rmap+0x23/0x7d
   mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
   irq_pool_request_vector+0x7d/0x90 [mlx5_core]
   mlx5_irq_request+0x2e/0xe0 [mlx5_core]
   mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
   comp_irq_request_pci+0x64/0xf0 [mlx5_core]
   create_comp_eq+0x71/0x385 [mlx5_core]
   ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
   mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
   ? xas_load+0x8/0x91
   mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
   mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
   mlx5e_open_channels+0xad/0x250 [mlx5_core]
   mlx5e_open_locked+0x3e/0x110 [mlx5_core]
   mlx5e_open+0x23/0x70 [mlx5_core]
   __dev_open+0xf1/0x1a5
   __dev_change_flags+0x1e1/0x249
   dev_change_flags+0x21/0x5c
   do_setlink+0x28b/0xcc4
   ? __nla_parse+0x22/0x3d
   ? inet6_validate_link_af+0x6b/0x108
   ? cpumask_next+0x1f/0x35
   ? __snmp6_fill_stats64.constprop.0+0x66/0x107
   ? __nla_validate_parse+0x48/0x1e6
   __rtnl_newlink+0x5ff/0xa57
   ? kmem_cache_alloc_trace+0x164/0x2ce
   rtnl_newlink+0x44/0x6e
   rtnetlink_rcv_msg+0x2bb/0x362
   ? __netlink_sendskb+0x4c/0x6c
   ? netlink_unicast+0x28f/0x2ce
   ? rtnl_calcit.isra.0+0x150/0x146
   netlink_rcv_skb+0x5f/0x112
   netlink_unicast+0x213/0x2ce
   netlink_sendmsg+0x24f/0x4d9
   __sock_sendmsg+0x65/0x6a
   ____sys_sendmsg+0x28f/0x2c9
   ? import_iovec+0x17/0x2b
   ___sys_sendmsg+0x97/0xe0
   __sys_sendmsg+0x81/0xd8
   do_syscall_64+0x35/0x87
   entry_SYSCALL_64_after_hwframe+0x6e/0x0
RIP: 0033:0x7fc328603727
Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed 
ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
   </TASK>
---[ end trace f43ce73c3c2b13a2 ]---
RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00 
74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31 
f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
FS:  00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range: 
0xffffffff80000000-0xffffffffbfffffff)
kvm-guest: disable async PF for cpu 0


Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
Signed-off-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
Tested-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
---
   drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 3 +--
   1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 40024cfa3099..822e92ed2d45 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -325,8 +325,7 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool 
*pool, int i,
   err_req_irq:
   #ifdef CONFIG_RFS_ACCEL
   	if (i && rmap && *rmap) {
-		free_irq_cpu_rmap(*rmap);
-		*rmap = NULL;
+		irq_cpu_rmap_remove(*rmap, irq->map.virq);
   	}
   err_irq_rmap:
   #endif
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RESEND PATCH net-next 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-06-26  6:04 Mohith Kumar Thummaluru
@ 2025-06-26 11:58 ` Mark Bloch
  2025-06-27  6:43   ` Mohith Kumar Thummaluru
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Bloch @ 2025-06-26 11:58 UTC (permalink / raw)
  To: Mohith Kumar Thummaluru, saeedm@nvidia.com, leon@kernel.org,
	tariqt@nvidia.com, netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, jacob.e.keller@intel.com,
	shayd@nvidia.com, elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Anand Khoje, Manjunath Patil,
	Rama Nichanamatlu, Rajesh Sivaramasubramaniom, Rohit Sajan Kumar,
	Qing Huang



On 26/06/2025 9:04, Mohith Kumar Thummaluru wrote:
> The mlx5_irq_alloc() function can inadvertently free the entire rmap
> and end up in a crash[1] when the other threads tries to access this,
> when request_irq() fails due to exhausted IRQ vectors. This commit
> modifies the cleanup to remove only the specific IRQ mapping that was
> just added.
> 
> This prevents removal of other valid mappings and ensures precise
> cleanup of the failed IRQ allocation's associated glue object.
> 
> Note: This error is observed when both fwctl and rds configs are enabled.
> 

Please target net and not net-next.

Mark

> [1]
> mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while trying to test write-combining support
> mlx5_core 0000:05:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while trying to test write-combining support
> mlx5_core 0000:06:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to request irq. err = -28
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to request irq. err = -28
> general protection fault, probably for non-canonical address 0xe277a58fde16f291: 0000 [#1] SMP NOPTI
> 
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Call Trace:
>   <TASK>
>   ? show_trace_log_lvl+0x1d6/0x2f9
>   ? show_trace_log_lvl+0x1d6/0x2f9
>   ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>   ? __die_body.cold+0x8/0xa
>   ? die_addr+0x39/0x53
>   ? exc_general_protection+0x1c4/0x3e9
>   ? dev_vprintk_emit+0x5f/0x90
>   ? asm_exc_general_protection+0x22/0x27
>   ? free_irq_cpu_rmap+0x23/0x7d
>   mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>   irq_pool_request_vector+0x7d/0x90 [mlx5_core]
>   mlx5_irq_request+0x2e/0xe0 [mlx5_core]
>   mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
>   comp_irq_request_pci+0x64/0xf0 [mlx5_core]
>   create_comp_eq+0x71/0x385 [mlx5_core]
>   ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
>   mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
>   ? xas_load+0x8/0x91
>   mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
>   mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
>   mlx5e_open_channels+0xad/0x250 [mlx5_core]
>   mlx5e_open_locked+0x3e/0x110 [mlx5_core]
>   mlx5e_open+0x23/0x70 [mlx5_core]
>   __dev_open+0xf1/0x1a5
>   __dev_change_flags+0x1e1/0x249
>   dev_change_flags+0x21/0x5c
>   do_setlink+0x28b/0xcc4
>   ? __nla_parse+0x22/0x3d
>   ? inet6_validate_link_af+0x6b/0x108
>   ? cpumask_next+0x1f/0x35
>   ? __snmp6_fill_stats64.constprop.0+0x66/0x107
>   ? __nla_validate_parse+0x48/0x1e6
>   __rtnl_newlink+0x5ff/0xa57
>   ? kmem_cache_alloc_trace+0x164/0x2ce
>   rtnl_newlink+0x44/0x6e
>   rtnetlink_rcv_msg+0x2bb/0x362
>   ? __netlink_sendskb+0x4c/0x6c
>   ? netlink_unicast+0x28f/0x2ce
>   ? rtnl_calcit.isra.0+0x150/0x146
>   netlink_rcv_skb+0x5f/0x112
>   netlink_unicast+0x213/0x2ce
>   netlink_sendmsg+0x24f/0x4d9
>   __sock_sendmsg+0x65/0x6a
>   ____sys_sendmsg+0x28f/0x2c9
>   ? import_iovec+0x17/0x2b
>   ___sys_sendmsg+0x97/0xe0
>   __sys_sendmsg+0x81/0xd8
>   do_syscall_64+0x35/0x87
>   entry_SYSCALL_64_after_hwframe+0x6e/0x0
> RIP: 0033:0x7fc328603727
> Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
> RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
> RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
> RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
> R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
>   </TASK>
> ---[ end trace f43ce73c3c2b13a2 ]---
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00 74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31 f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
> RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
> RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
> RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
> R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
> FS:  00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Kernel panic - not syncing: Fatal exception
> Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> kvm-guest: disable async PF for cpu 0
> 
> 
> Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
> Signed-off-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Tested-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> index 40024cfa3099..822e92ed2d45 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> @@ -325,8 +325,7 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool *pool, int i,
>   err_req_irq:
>   #ifdef CONFIG_RFS_ACCEL
>       if (i && rmap && *rmap) {
> -        free_irq_cpu_rmap(*rmap);
> -        *rmap = NULL;
> +        irq_cpu_rmap_remove(*rmap, irq->map.virq);
>       }
>   err_irq_rmap:
>   #endif


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND PATCH net-next 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-06-26 11:58 ` Mark Bloch
@ 2025-06-27  6:43   ` Mohith Kumar Thummaluru
  0 siblings, 0 replies; 6+ messages in thread
From: Mohith Kumar Thummaluru @ 2025-06-27  6:43 UTC (permalink / raw)
  To: Mark Bloch, saeedm@nvidia.com, leon@kernel.org, tariqt@nvidia.com,
	netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, jacob.e.keller@intel.com,
	shayd@nvidia.com, elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Anand Khoje, Manjunath Patil,
	Rama Nichanamatlu, Rajesh Sivaramasubramaniom, Rohit Sajan Kumar,
	Qing Huang


On 26/06/25 5:28 pm, Mark Bloch wrote:
>
> On 26/06/2025 9:04, Mohith Kumar Thummaluru wrote:
>> The mlx5_irq_alloc() function can inadvertently free the entire rmap
>> and end up in a crash[1] when the other threads tries to access this,
>> when request_irq() fails due to exhausted IRQ vectors. This commit
>> modifies the cleanup to remove only the specific IRQ mapping that was
>> just added.
>>
>> This prevents removal of other valid mappings and ensures precise
>> cleanup of the failed IRQ allocation's associated glue object.
>>
>> Note: This error is observed when both fwctl and rds configs are enabled.
>>
> Please target net and not net-next.
>
> Mark

Thanks, Mark. Let me do that.

- Mohith.

>> [1]
>> mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
>> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to request irq. err = -28
>> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while trying to test write-combining support
>> mlx5_core 0000:05:00.0: Successfully unregistered panic handler for port 1
>> mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
>> mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to request irq. err = -28
>> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while trying to test write-combining support
>> mlx5_core 0000:06:00.0: Successfully unregistered panic handler for port 1
>> mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to request irq. err = -28
>> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to request irq. err = -28
>> general protection fault, probably for non-canonical address 0xe277a58fde16f291: 0000 [#1] SMP NOPTI
>>
>> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
>> Call Trace:
>>    <TASK>
>>    ? show_trace_log_lvl+0x1d6/0x2f9
>>    ? show_trace_log_lvl+0x1d6/0x2f9
>>    ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>>    ? __die_body.cold+0x8/0xa
>>    ? die_addr+0x39/0x53
>>    ? exc_general_protection+0x1c4/0x3e9
>>    ? dev_vprintk_emit+0x5f/0x90
>>    ? asm_exc_general_protection+0x22/0x27
>>    ? free_irq_cpu_rmap+0x23/0x7d
>>    mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
>>    irq_pool_request_vector+0x7d/0x90 [mlx5_core]
>>    mlx5_irq_request+0x2e/0xe0 [mlx5_core]
>>    mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
>>    comp_irq_request_pci+0x64/0xf0 [mlx5_core]
>>    create_comp_eq+0x71/0x385 [mlx5_core]
>>    ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
>>    mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
>>    ? xas_load+0x8/0x91
>>    mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
>>    mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
>>    mlx5e_open_channels+0xad/0x250 [mlx5_core]
>>    mlx5e_open_locked+0x3e/0x110 [mlx5_core]
>>    mlx5e_open+0x23/0x70 [mlx5_core]
>>    __dev_open+0xf1/0x1a5
>>    __dev_change_flags+0x1e1/0x249
>>    dev_change_flags+0x21/0x5c
>>    do_setlink+0x28b/0xcc4
>>    ? __nla_parse+0x22/0x3d
>>    ? inet6_validate_link_af+0x6b/0x108
>>    ? cpumask_next+0x1f/0x35
>>    ? __snmp6_fill_stats64.constprop.0+0x66/0x107
>>    ? __nla_validate_parse+0x48/0x1e6
>>    __rtnl_newlink+0x5ff/0xa57
>>    ? kmem_cache_alloc_trace+0x164/0x2ce
>>    rtnl_newlink+0x44/0x6e
>>    rtnetlink_rcv_msg+0x2bb/0x362
>>    ? __netlink_sendskb+0x4c/0x6c
>>    ? netlink_unicast+0x28f/0x2ce
>>    ? rtnl_calcit.isra.0+0x150/0x146
>>    netlink_rcv_skb+0x5f/0x112
>>    netlink_unicast+0x213/0x2ce
>>    netlink_sendmsg+0x24f/0x4d9
>>    __sock_sendmsg+0x65/0x6a
>>    ____sys_sendmsg+0x28f/0x2c9
>>    ? import_iovec+0x17/0x2b
>>    ___sys_sendmsg+0x97/0xe0
>>    __sys_sendmsg+0x81/0xd8
>>    do_syscall_64+0x35/0x87
>>    entry_SYSCALL_64_after_hwframe+0x6e/0x0
>> RIP: 0033:0x7fc328603727
>> Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
>> RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
>> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
>> RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
>> RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
>> R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
>>    </TASK>
>> ---[ end trace f43ce73c3c2b13a2 ]---
>> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
>> Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00 74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31 f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
>> RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
>> RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
>> RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
>> RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
>> R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
>> FS:  00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> PKRU: 55555554
>> Kernel panic - not syncing: Fatal exception
>> Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> kvm-guest: disable async PF for cpu 0
>>
>>
>> Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
>> Signed-off-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
>> Tested-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
>> Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
>> ---
>>    drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 3 +--
>>    1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
>> index 40024cfa3099..822e92ed2d45 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
>> @@ -325,8 +325,7 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool *pool, int i,
>>    err_req_irq:
>>    #ifdef CONFIG_RFS_ACCEL
>>        if (i && rmap && *rmap) {
>> -        free_irq_cpu_rmap(*rmap);
>> -        *rmap = NULL;
>> +        irq_cpu_rmap_remove(*rmap, irq->map.virq);
>>        }
>>    err_irq_rmap:
>>    #endif

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND PATCH net-next 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
       [not found] <7cb171c4-3c36-42ea-bd6f-52dfe6bc5dab@oracle.com>
@ 2025-07-01 20:51 ` Jacob Keller
  2025-07-02  5:07   ` Mohith Kumar Thummaluru
  0 siblings, 1 reply; 6+ messages in thread
From: Jacob Keller @ 2025-07-01 20:51 UTC (permalink / raw)
  To: Mohith Kumar Thummaluru, saeedm@nvidia.com, leon@kernel.org,
	tariqt@nvidia.com, netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, shayd@nvidia.com,
	elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Anand Khoje, Manjunath Patil,
	Rama Nichanamatlu, Rajesh Sivaramasubramaniom, Rohit Sajan Kumar,
	Qing Huang


[-- Attachment #1.1: Type: text/plain, Size: 642 bytes --]



On 6/25/2025 10:32 PM, Mohith Kumar Thummaluru wrote:
> The mlx5_irq_alloc() function can inadvertently free the entire rmap
> and end up in a crash[1] when the other threads tries to access this,
> when request_irq() fails due to exhausted IRQ vectors. This commit
> modifies the cleanup to remove only the specific IRQ mapping that was
> just added.
> 
> This prevents removal of other valid mappings and ensures precise
> cleanup of the failed IRQ allocation's associated glue object.
> 
> Note: This error is observed when both fwctl and rds configs are enabled.
> 

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND PATCH net-next 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-07-01 20:51 ` [RESEND PATCH net-next 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure Jacob Keller
@ 2025-07-02  5:07   ` Mohith Kumar Thummaluru
  2025-07-02 17:56     ` Jacob Keller
  0 siblings, 1 reply; 6+ messages in thread
From: Mohith Kumar Thummaluru @ 2025-07-02  5:07 UTC (permalink / raw)
  To: Jacob Keller, saeedm@nvidia.com, leon@kernel.org,
	tariqt@nvidia.com, netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, shayd@nvidia.com,
	elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Anand Khoje, Manjunath Patil,
	Rama Nichanamatlu, Rajesh Sivaramasubramaniom, Rohit Sajan Kumar,
	Qing Huang, Mark Bloch


On 02/07/25 2:21 am, Jacob Keller wrote:
>
> On 6/25/2025 10:32 PM, Mohith Kumar Thummaluru wrote:
>> The mlx5_irq_alloc() function can inadvertently free the entire rmap
>> and end up in a crash[1] when the other threads tries to access this,
>> when request_irq() fails due to exhausted IRQ vectors. This commit
>> modifies the cleanup to remove only the specific IRQ mapping that was
>> just added.
>>
>> This prevents removal of other valid mappings and ensures precise
>> cleanup of the failed IRQ allocation's associated glue object.
>>
>> Note: This error is observed when both fwctl and rds configs are enabled.
>>
> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

Thanks for the review, Jacob!

Based on Mark Bloch's input, I’ve sent out another patch targeting the 
net tree. You can find it here:
https://lore.kernel.org/netdev/1eda4785-6e3e-4660-ac04-62e474133d71@oracle.com/

Regards,
Mohith Kumar Thummaluru


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RESEND PATCH net-next 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
  2025-07-02  5:07   ` Mohith Kumar Thummaluru
@ 2025-07-02 17:56     ` Jacob Keller
  0 siblings, 0 replies; 6+ messages in thread
From: Jacob Keller @ 2025-07-02 17:56 UTC (permalink / raw)
  To: Mohith Kumar Thummaluru, saeedm@nvidia.com, leon@kernel.org,
	tariqt@nvidia.com, netdev@vger.kernel.org
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, shayd@nvidia.com,
	elic@nvidia.com, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Anand Khoje, Manjunath Patil,
	Rama Nichanamatlu, Rajesh Sivaramasubramaniom, Rohit Sajan Kumar,
	Qing Huang, Mark Bloch


[-- Attachment #1.1: Type: text/plain, Size: 1140 bytes --]



On 7/1/2025 10:07 PM, Mohith Kumar Thummaluru wrote:
> 
> On 02/07/25 2:21 am, Jacob Keller wrote:
>>
>> On 6/25/2025 10:32 PM, Mohith Kumar Thummaluru wrote:
>>> The mlx5_irq_alloc() function can inadvertently free the entire rmap
>>> and end up in a crash[1] when the other threads tries to access this,
>>> when request_irq() fails due to exhausted IRQ vectors. This commit
>>> modifies the cleanup to remove only the specific IRQ mapping that was
>>> just added.
>>>
>>> This prevents removal of other valid mappings and ensures precise
>>> cleanup of the failed IRQ allocation's associated glue object.
>>>
>>> Note: This error is observed when both fwctl and rds configs are enabled.
>>>
>> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
> 
> Thanks for the review, Jacob!
> 
> Based on Mark Bloch's input, I’ve sent out another patch targeting the 
> net tree. You can find it here:
> https://lore.kernel.org/netdev/1eda4785-6e3e-4660-ac04-62e474133d71@oracle.com/
> 
> Regards,
> Mohith Kumar Thummaluru
> 

Thanks. Yea I missed it while scrolling my inbox and thought this was
the latest.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-07-02 17:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <7cb171c4-3c36-42ea-bd6f-52dfe6bc5dab@oracle.com>
2025-07-01 20:51 ` [RESEND PATCH net-next 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure Jacob Keller
2025-07-02  5:07   ` Mohith Kumar Thummaluru
2025-07-02 17:56     ` Jacob Keller
2025-06-26  6:04 Mohith Kumar Thummaluru
2025-06-26 11:58 ` Mark Bloch
2025-06-27  6:43   ` Mohith Kumar Thummaluru

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).