From: Mark Bloch <mbloch@nvidia.com>
To: Mohith Kumar Thummaluru <mohith.k.kumar.thummaluru@oracle.com>,
"saeedm@nvidia.com" <saeedm@nvidia.com>,
"leon@kernel.org" <leon@kernel.org>,
"tariqt@nvidia.com" <tariqt@nvidia.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>
Cc: "andrew+netdev@lunn.ch" <andrew+netdev@lunn.ch>,
"davem@davemloft.net" <davem@davemloft.net>,
"edumazet@google.com" <edumazet@google.com>,
"kuba@kernel.org" <kuba@kernel.org>,
"pabeni@redhat.com" <pabeni@redhat.com>,
"jacob.e.keller@intel.com" <jacob.e.keller@intel.com>,
"shayd@nvidia.com" <shayd@nvidia.com>,
"elic@nvidia.com" <elic@nvidia.com>,
"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Anand Khoje <anand.a.khoje@oracle.com>,
Manjunath Patil <manjunath.b.patil@oracle.com>,
Rama Nichanamatlu <rama.nichanamatlu@oracle.com>,
Rajesh Sivaramasubramaniom
<rajesh.sivaramasubramaniom@oracle.com>,
Rohit Sajan Kumar <rohit.sajan.kumar@oracle.com>,
Qing Huang <qing.huang@oracle.com>
Subject: Re: [RESEND PATCH net-next 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure
Date: Thu, 26 Jun 2025 14:58:49 +0300 [thread overview]
Message-ID: <2fc66048-36be-408d-a79f-0393ce2b4040@nvidia.com> (raw)
In-Reply-To: <f4e25a98-5d50-4c9b-b891-c4ac042debd9@oracle.com>
On 26/06/2025 9:04, Mohith Kumar Thummaluru wrote:
> The mlx5_irq_alloc() function can inadvertently free the entire rmap
> and end up in a crash[1] when the other threads tries to access this,
> when request_irq() fails due to exhausted IRQ vectors. This commit
> modifies the cleanup to remove only the specific IRQ mapping that was
> just added.
>
> This prevents removal of other valid mappings and ensures precise
> cleanup of the failed IRQ allocation's associated glue object.
>
> Note: This error is observed when both fwctl and rds configs are enabled.
>
Please target net and not net-next.
Mark
> [1]
> mlx5_core 0000:05:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while trying to test write-combining support
> mlx5_core 0000:05:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:06:00.0: Successfully registered panic handler for port 1
> mlx5_core 0000:06:00.0: mlx5_irq_alloc:293:(pid 66740): Failed to request irq. err = -28
> infiniband mlx5_0: mlx5_ib_test_wc:290:(pid 66740): Error -28 while trying to test write-combining support
> mlx5_core 0000:06:00.0: Successfully unregistered panic handler for port 1
> mlx5_core 0000:03:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to request irq. err = -28
> mlx5_core 0000:05:00.0: mlx5_irq_alloc:293:(pid 28895): Failed to request irq. err = -28
> general protection fault, probably for non-canonical address 0xe277a58fde16f291: 0000 [#1] SMP NOPTI
>
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Call Trace:
> <TASK>
> ? show_trace_log_lvl+0x1d6/0x2f9
> ? show_trace_log_lvl+0x1d6/0x2f9
> ? mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
> ? __die_body.cold+0x8/0xa
> ? die_addr+0x39/0x53
> ? exc_general_protection+0x1c4/0x3e9
> ? dev_vprintk_emit+0x5f/0x90
> ? asm_exc_general_protection+0x22/0x27
> ? free_irq_cpu_rmap+0x23/0x7d
> mlx5_irq_alloc.cold+0x5d/0xf3 [mlx5_core]
> irq_pool_request_vector+0x7d/0x90 [mlx5_core]
> mlx5_irq_request+0x2e/0xe0 [mlx5_core]
> mlx5_irq_request_vector+0xad/0xf7 [mlx5_core]
> comp_irq_request_pci+0x64/0xf0 [mlx5_core]
> create_comp_eq+0x71/0x385 [mlx5_core]
> ? mlx5e_open_xdpsq+0x11c/0x230 [mlx5_core]
> mlx5_comp_eqn_get+0x72/0x90 [mlx5_core]
> ? xas_load+0x8/0x91
> mlx5_comp_irqn_get+0x40/0x90 [mlx5_core]
> mlx5e_open_channel+0x7d/0x3c7 [mlx5_core]
> mlx5e_open_channels+0xad/0x250 [mlx5_core]
> mlx5e_open_locked+0x3e/0x110 [mlx5_core]
> mlx5e_open+0x23/0x70 [mlx5_core]
> __dev_open+0xf1/0x1a5
> __dev_change_flags+0x1e1/0x249
> dev_change_flags+0x21/0x5c
> do_setlink+0x28b/0xcc4
> ? __nla_parse+0x22/0x3d
> ? inet6_validate_link_af+0x6b/0x108
> ? cpumask_next+0x1f/0x35
> ? __snmp6_fill_stats64.constprop.0+0x66/0x107
> ? __nla_validate_parse+0x48/0x1e6
> __rtnl_newlink+0x5ff/0xa57
> ? kmem_cache_alloc_trace+0x164/0x2ce
> rtnl_newlink+0x44/0x6e
> rtnetlink_rcv_msg+0x2bb/0x362
> ? __netlink_sendskb+0x4c/0x6c
> ? netlink_unicast+0x28f/0x2ce
> ? rtnl_calcit.isra.0+0x150/0x146
> netlink_rcv_skb+0x5f/0x112
> netlink_unicast+0x213/0x2ce
> netlink_sendmsg+0x24f/0x4d9
> __sock_sendmsg+0x65/0x6a
> ____sys_sendmsg+0x28f/0x2c9
> ? import_iovec+0x17/0x2b
> ___sys_sendmsg+0x97/0xe0
> __sys_sendmsg+0x81/0xd8
> do_syscall_64+0x35/0x87
> entry_SYSCALL_64_after_hwframe+0x6e/0x0
> RIP: 0033:0x7fc328603727
> Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 0b ed ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 44 ed ff ff 48
> RSP: 002b:00007ffe8eb3f1a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc328603727
> RDX: 0000000000000000 RSI: 00007ffe8eb3f1f0 RDI: 000000000000000d
> RBP: 00007ffe8eb3f1f0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
> R13: 0000000000000000 R14: 00007ffe8eb3f3c8 R15: 00007ffe8eb3f3bc
> </TASK>
> ---[ end trace f43ce73c3c2b13a2 ]---
> RIP: 0010:free_irq_cpu_rmap+0x23/0x7d
> Code: 0f 1f 80 00 00 00 00 48 85 ff 74 6b 55 48 89 fd 53 66 83 7f 06 00 74 24 31 db 48 8b 55 08 0f b7 c3 48 8b 04 c2 48 85 c0 74 09 <8b> 38 31 f6 e8 c4 0a b8 ff 83 c3 01 66 3b 5d 06 72 de b8 ff ff ff
> RSP: 0018:ff384881640eaca0 EFLAGS: 00010282
> RAX: e277a58fde16f291 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ff2335e2e20b3600 RSI: 0000000000000000 RDI: ff2335e2e20b3400
> RBP: ff2335e2e20b3400 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 00000000ffffffe4 R12: ff384881640ead88
> R13: ff2335c3760751e0 R14: ff2335e2e1672200 R15: ff2335c3760751f8
> FS: 00007fc32ac22480(0000) GS:ff2335e2d6e00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f651ab54000 CR3: 00000029f1206003 CR4: 0000000000771ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Kernel panic - not syncing: Fatal exception
> Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> kvm-guest: disable async PF for cpu 0
>
>
> Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
> Signed-off-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Tested-by: Mohith Kumar Thummaluru<mohith.k.kumar.thummaluru@oracle.com>
> Reviewed-by: Moshe Shemesh<moshe@nvidia.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> index 40024cfa3099..822e92ed2d45 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> @@ -325,8 +325,7 @@ struct mlx5_irq *mlx5_irq_alloc(struct mlx5_irq_pool *pool, int i,
> err_req_irq:
> #ifdef CONFIG_RFS_ACCEL
> if (i && rmap && *rmap) {
> - free_irq_cpu_rmap(*rmap);
> - *rmap = NULL;
> + irq_cpu_rmap_remove(*rmap, irq->map.virq);
> }
> err_irq_rmap:
> #endif
next prev parent reply other threads:[~2025-06-26 11:58 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-26 6:04 [RESEND PATCH net-next 1/1] net/mlx5: Clean up only new IRQ glue on request_irq() failure Mohith Kumar Thummaluru
2025-06-26 11:58 ` Mark Bloch [this message]
2025-06-27 6:43 ` Mohith Kumar Thummaluru
[not found] <7cb171c4-3c36-42ea-bd6f-52dfe6bc5dab@oracle.com>
2025-07-01 20:51 ` Jacob Keller
2025-07-02 5:07 ` Mohith Kumar Thummaluru
2025-07-02 17:56 ` Jacob Keller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2fc66048-36be-408d-a79f-0393ce2b4040@nvidia.com \
--to=mbloch@nvidia.com \
--cc=anand.a.khoje@oracle.com \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=elic@nvidia.com \
--cc=jacob.e.keller@intel.com \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=manjunath.b.patil@oracle.com \
--cc=mohith.k.kumar.thummaluru@oracle.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=qing.huang@oracle.com \
--cc=rajesh.sivaramasubramaniom@oracle.com \
--cc=rama.nichanamatlu@oracle.com \
--cc=rohit.sajan.kumar@oracle.com \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox