* [PATCH net] net/mlx5: Avoid deadlock between PCI error recovery and health reporter
@ 2025-08-07 13:11 Gerd Bayer
2025-08-10 11:51 ` Shay Drori
0 siblings, 1 reply; 4+ messages in thread
From: Gerd Bayer @ 2025-08-07 13:11 UTC (permalink / raw)
To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Jakub Kicinski, Paolo Abeni, Moshe Shemesh, Shay Drori
Cc: Niklas Schnelle, Alexandra Winter, Gerd Bayer, Andrew Lunn,
David S. Miller, Eric Dumazet, Yuanyuan Zhong, Mohamed Khalfella,
netdev, linux-rdma, linux-kernel
During error recovery testing a pair of tasks was reported to be hung
due to a dead-lock situation:
- mlx5_unload_one() trying to acquire devlink lock while the PCI error
recovery code had acquired the pci_cfg_access_lock().
- mlx5_crdump_collect() trying to acquire the pci_cfg_access_lock()
while devlink_health_report() had acquired the devlink lock.
Move the check for pci_channel_offline prior to acquiring the
pci_cfg_access_lock in mlx5_vsc_gw_lock since collecting the crdump will
fail anyhow while PCI error recovery is running.
Fixes: 33afbfcc105a ("net/mlx5: Stop waiting for PCI if pci channel is offline")
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
---
Hi all,
while the initial hit was recorded during "random" testing, where PCI
error recovery and poll_health() tripped almost simultaneously, I found
a way to reproduce a very similar hang at will on s390:
Inject a PCI error recovery event on a Physical Function <BDF> with
zpcictl --reset-fw <BDF>
then request a crdump with
devlink health dump show pci/<BDF> reporter fw_fatal
With the patch applied I didn't get the hang but kernel logs showed:
[ 792.885743] mlx5_core 000a:00:00.0: mlx5_crdump_collect:51:(pid 1415): crdump: failed to lock vsc gw err -13
and the crdump request ended with:
kernel answers: Permission denied
Thanks, Gerd
---
drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
index 432c98f2626d..d2d3b57a57d5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
@@ -73,16 +73,15 @@ int mlx5_vsc_gw_lock(struct mlx5_core_dev *dev)
u32 lock_val;
int ret;
+ if (pci_channel_offline(dev->pdev))
+ return -EACCES;
+
pci_cfg_access_lock(dev->pdev);
do {
if (retries > VSC_MAX_RETRIES) {
ret = -EBUSY;
goto pci_unlock;
}
- if (pci_channel_offline(dev->pdev)) {
- ret = -EACCES;
- goto pci_unlock;
- }
/* Check if semaphore is already locked */
ret = vsc_read(dev, VSC_SEMAPHORE_OFFSET, &lock_val);
--
2.48.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH net] net/mlx5: Avoid deadlock between PCI error recovery and health reporter
2025-08-07 13:11 [PATCH net] net/mlx5: Avoid deadlock between PCI error recovery and health reporter Gerd Bayer
@ 2025-08-10 11:51 ` Shay Drori
2025-08-11 7:29 ` Gerd Bayer
0 siblings, 1 reply; 4+ messages in thread
From: Shay Drori @ 2025-08-10 11:51 UTC (permalink / raw)
To: Gerd Bayer, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Jakub Kicinski, Paolo Abeni, Moshe Shemesh
Cc: Niklas Schnelle, Alexandra Winter, Andrew Lunn, David S. Miller,
Eric Dumazet, Yuanyuan Zhong, Mohamed Khalfella, netdev,
linux-rdma, linux-kernel
On 07/08/2025 16:11, Gerd Bayer wrote:
> External email: Use caution opening links or attachments
>
>
> During error recovery testing a pair of tasks was reported to be hung
> due to a dead-lock situation:
>
> - mlx5_unload_one() trying to acquire devlink lock while the PCI error
> recovery code had acquired the pci_cfg_access_lock().
could you please add traces here?
I looked at the code and didn't see where pci_cfg_access_lock() is
taken...
Thanks!
> - mlx5_crdump_collect() trying to acquire the pci_cfg_access_lock()
> while devlink_health_report() had acquired the devlink lock.>
> Move the check for pci_channel_offline prior to acquiring the
> pci_cfg_access_lock in mlx5_vsc_gw_lock since collecting the crdump will
> fail anyhow while PCI error recovery is running.
>
> Fixes: 33afbfcc105a ("net/mlx5: Stop waiting for PCI if pci channel is offline")
> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
> ---
>
> Hi all,
>
> while the initial hit was recorded during "random" testing, where PCI
> error recovery and poll_health() tripped almost simultaneously, I found
> a way to reproduce a very similar hang at will on s390:
>
> Inject a PCI error recovery event on a Physical Function <BDF> with
> zpcictl --reset-fw <BDF>
>
> then request a crdump with
> devlink health dump show pci/<BDF> reporter fw_fatal
>
> With the patch applied I didn't get the hang but kernel logs showed:
> [ 792.885743] mlx5_core 000a:00:00.0: mlx5_crdump_collect:51:(pid 1415): crdump: failed to lock vsc gw err -13
>
> and the crdump request ended with:
> kernel answers: Permission denied
>
> Thanks, Gerd
> ---
> drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> index 432c98f2626d..d2d3b57a57d5 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> @@ -73,16 +73,15 @@ int mlx5_vsc_gw_lock(struct mlx5_core_dev *dev)
> u32 lock_val;
> int ret;
>
> + if (pci_channel_offline(dev->pdev))
> + return -EACCES;
> +
> pci_cfg_access_lock(dev->pdev);
> do {
> if (retries > VSC_MAX_RETRIES) {
> ret = -EBUSY;
> goto pci_unlock;
> }
> - if (pci_channel_offline(dev->pdev)) {
> - ret = -EACCES;
> - goto pci_unlock;
> - }
>
> /* Check if semaphore is already locked */
> ret = vsc_read(dev, VSC_SEMAPHORE_OFFSET, &lock_val);
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH net] net/mlx5: Avoid deadlock between PCI error recovery and health reporter
2025-08-10 11:51 ` Shay Drori
@ 2025-08-11 7:29 ` Gerd Bayer
2025-08-11 12:04 ` Shay Drori
0 siblings, 1 reply; 4+ messages in thread
From: Gerd Bayer @ 2025-08-11 7:29 UTC (permalink / raw)
To: Shay Drori, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Jakub Kicinski, Paolo Abeni, Moshe Shemesh
Cc: Niklas Schnelle, Alexandra Winter, Andrew Lunn, David S. Miller,
Eric Dumazet, Yuanyuan Zhong, Mohamed Khalfella, netdev,
linux-rdma, linux-kernel
On Sun, 2025-08-10 at 14:51 +0300, Shay Drori wrote:
> On 07/08/2025 16:11, Gerd Bayer wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > During error recovery testing a pair of tasks was reported to be hung
> > due to a dead-lock situation:
> >
> > - mlx5_unload_one() trying to acquire devlink lock while the PCI error
> > recovery code had acquired the pci_cfg_access_lock().
>
> could you please add traces here?
> I looked at the code and didn't see where pci_cfg_access_lock() is
> taken...
Sure thing. This is the original hung task message:
10144.859042] mlx5_core 0000:00:00.1: mlx5_health_try_recover:338:(pid 5553): health recovery flow aborted, PCI reads still not working
[10320.359160] INFO: task kmcheck:72 blocked for more than 122 seconds.
[10320.359169] Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
[10320.359171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10320.359172] task:kmcheck state:D stack:0 pid:72 tgid:72 ppid:2 flags:0x00000000
[10320.359176] Call Trace:
[10320.359178] [<000000065256f030>] __schedule+0x2a0/0x590
[10320.359187] [<000000065256f356>] schedule+0x36/0xe0
[10320.359189] [<000000065256f572>] schedule_preempt_disabled+0x22/0x30
[10320.359192] [<0000000652570a94>] __mutex_lock.constprop.0+0x484/0x8a8
[10320.359194] [<000003ff800673a4>] mlx5_unload_one+0x34/0x58 [mlx5_core]
[10320.359360] [<000003ff8006745c>] mlx5_pci_err_detected+0x94/0x140 [mlx5_core]
[10320.359400] [<0000000652556c5a>] zpci_event_attempt_error_recovery+0xf2/0x398
[10320.359406] [<0000000651b9184a>] __zpci_event_error+0x23a/0x2c0
[10320.359411] [<00000006522b3958>] chsc_process_event_information.constprop.0+0x1c8/0x1e8
[10320.359416] [<00000006522baf1a>] crw_collect_info+0x272/0x338
[10320.359418] [<0000000651bc9de0>] kthread+0x108/0x110
[10320.359422] [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
[10320.359425] [<0000000652576642>] ret_from_fork+0xa/0x30
[10320.359440] INFO: task kworker/u1664:6:1514 blocked for more than 122 seconds.
[10320.359441] Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
[10320.359442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10320.359443] task:kworker/u1664:6 state:D stack:0 pid:1514 tgid:1514 ppid:2 flags:0x00000000
[10320.359447] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[10320.359492] Call Trace:
[10320.359521] [<000000065256f030>] __schedule+0x2a0/0x590
[10320.359524] [<000000065256f356>] schedule+0x36/0xe0
[10320.359526] [<0000000652172e28>] pci_wait_cfg+0x80/0xe8
[10320.359532] [<0000000652172f94>] pci_cfg_access_lock+0x74/0x88
[10320.359534] [<000003ff800916b6>] mlx5_vsc_gw_lock+0x36/0x178 [mlx5_core]
[10320.359585] [<000003ff80098824>] mlx5_crdump_collect+0x34/0x1c8 [mlx5_core]
[10320.359637] [<000003ff80074b62>] mlx5_fw_fatal_reporter_dump+0x6a/0xe8 [mlx5_core]
[10320.359680] [<0000000652512242>] devlink_health_do_dump.part.0+0x82/0x168
[10320.359683] [<0000000652513212>] devlink_health_report+0x19a/0x230
[10320.359685] [<000003ff80075a12>] mlx5_fw_fatal_reporter_err_work+0xba/0x1b0 [mlx5_core]
[10320.359728] [<0000000651bbf852>] process_one_work+0x1c2/0x458
[10320.359733] [<0000000651bc073e>] worker_thread+0x3ce/0x528
[10320.359735] [<0000000651bc9de0>] kthread+0x108/0x110
[10320.359737] [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
[10320.359739] [<0000000652576642>] ret_from_fork+0xa/0x30
The pci_config_access_lock() is acquired in zpci_event_attempt_error_recovery() by way of pci_dev_lock().
And this is from the reproduction by injection with a recent kernel (with lockdep enabled):
Aug 05 18:32:23 a83lp68.lnxne.boe kernel: zpci: 000a:00:00.0: The device is ready to resume operations
Aug 05 18:32:23 a83lp68.lnxne.boe kernel: mlx5_core 000a:00:00.0: mlx5_pci_resume Device state = 2 health sensors: 3 pci_status: 1. Enter, loading d>
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: INFO: task kmcheck:161 blocked for more than 122 seconds.
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: Not tainted 6.16.0-11676-g19dc172eb595 #7
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: task:kmcheck state:D stack:10160 pid:161 tgid:161 ppid:2 task_flags:0x208040 flags:0x0000>
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: Call Trace:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918f00be>] __schedule+0xa3e/0x1c80
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918f1360>] schedule+0x60/0x100
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918f148a>] schedule_preempt_disabled+0x2a/0x40
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918f3668>] __mutex_lock_common+0x938/0x10b0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918f2bac>] mutex_lock_nested+0x3c/0x50
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d10cf84c4>] mlx5_pci_resume+0x74/0x110 [mlx5_core]
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d907d0b7e>] zpci_event_attempt_error_recovery+0x2ae/0x410
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d907d0546>] zpci_event_error+0x2a6/0x2f0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d9155367c>] chsc_process_crw+0x2dc/0x990
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d9155b600>] crw_collect_info+0x190/0x2f0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d90814796>] kthread+0x296/0x2c0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d90780a9e>] __ret_from_fork+0x3e/0x2f0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918fc75a>] ret_from_fork+0xa/0x30
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: 4 locks held by kmcheck/161:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #0: 0000033d922455d0 (crw_handler_mutex){+.+.}-{3:3}, at: crw_collect_info+0x162/0x2f0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #1: 00000233282d08d0 (&zdev->state_lock){+.+.}-{3:3}, at: zpci_event_error+0x15c/0x2f0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #2: 00000233205f1190 (&dev->mutex){....}-{3:3}, at: pci_dev_lock+0x2c/0x40
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #3: 000002334b95a270 (&devlink->lock_key){+.+.}-{3:3}, at: mlx5_pci_resume+0x74/0x110 [mlx5_core]
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: INFO: task kmcheck:161 is blocked on a mutex likely owned by task devlink:1274.
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: task:devlink state:D stack:11984 pid:1274 tgid:1274 ppid:1247 task_flags:0x400100 flags:0x0000>
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: Call Trace:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918f00be>] __schedule+0xa3e/0x1c80
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918f1360>] schedule+0x60/0x100
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d914291d2>] pci_wait_cfg+0xa2/0xc0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d91429630>] pci_cfg_access_lock+0x50/0x70
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d10c12726>] mlx5_vsc_gw_lock+0x36/0x150 [mlx5_core]
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d10c19a42>] mlx5_crdump_collect+0x32/0x230 [mlx5_core]
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d10bf0d0e>] mlx5_fw_fatal_reporter_dump+0x6e/0xe0 [mlx5_core]
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d9187a5a0>] devlink_health_do_dump+0x130/0x270
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d9187c0e2>] devlink_nl_health_reporter_dump_get_dumpit+0x222/0x2a0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916c2232>] genl_dumpit+0x62/0x90
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916b9fe0>] netlink_dump+0x1b0/0x3f0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916b9c2e>] __netlink_dump_start+0x1be/0x230
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916c1ee0>] genl_family_rcv_msg_dumpit+0xc0/0xf0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916c1dc6>] genl_rcv_msg+0x4e6/0x540
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916ba806>] netlink_rcv_skb+0xc6/0xf0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916c164a>] genl_rcv+0x3a/0x50
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916b7f86>] netlink_unicast+0x1f6/0x320
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916bc580>] netlink_sendmsg+0x310/0x3a0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d915e3c60>] __sys_sendto+0x120/0x190
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d915e5e28>] __s390x_sys_socketcall+0x358/0x410
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918eaa3c>] __do_syscall+0x14c/0x3e0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918fc72e>] system_call+0x6e/0x90
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: 3 locks held by devlink/1274:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #0: 0000033d92263be0 (cb_lock){++++}-{3:3}, at: genl_rcv+0x2a/0x50
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #1: 000002331a2b6f20 (nlk_cb_mutex-GENERIC){+.+.}-{3:3}, at: __netlink_dump_start+0x84/0x230
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #2: 000002334b95a270 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0xf4/0x160
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: INFO: task devlink:1274 blocked for more than 122 seconds.
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: Not tainted 6.16.0-11676-g19dc172eb595 #7
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: task:devlink state:D stack:11984 pid:1274 tgid:1274 ppid:1247 task_flags:0x400100 flags:0x0000>
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: Call Trace:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918f00be>] __schedule+0xa3e/0x1c80
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918f1360>] schedule+0x60/0x100
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d914291d2>] pci_wait_cfg+0xa2/0xc0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d91429630>] pci_cfg_access_lock+0x50/0x70
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d10c12726>] mlx5_vsc_gw_lock+0x36/0x150 [mlx5_core]
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d10c19a42>] mlx5_crdump_collect+0x32/0x230 [mlx5_core]
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d10bf0d0e>] mlx5_fw_fatal_reporter_dump+0x6e/0xe0 [mlx5_core]
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d9187a5a0>] devlink_health_do_dump+0x130/0x270
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d9187c0e2>] devlink_nl_health_reporter_dump_get_dumpit+0x222/0x2a0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916c2232>] genl_dumpit+0x62/0x90
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916b9fe0>] netlink_dump+0x1b0/0x3f0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916b9c2e>] __netlink_dump_start+0x1be/0x230
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916c1ee0>] genl_family_rcv_msg_dumpit+0xc0/0xf0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916c1dc6>] genl_rcv_msg+0x4e6/0x540
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916ba806>] netlink_rcv_skb+0xc6/0xf0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916c164a>] genl_rcv+0x3a/0x50
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916b7f86>] netlink_unicast+0x1f6/0x320
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d916bc580>] netlink_sendmsg+0x310/0x3a0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d915e3c60>] __sys_sendto+0x120/0x190
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d915e5e28>] __s390x_sys_socketcall+0x358/0x410
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918eaa3c>] __do_syscall+0x14c/0x3e0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: [<0000033d918fc72e>] system_call+0x6e/0x90
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: 3 locks held by devlink/1274:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #0: 0000033d92263be0 (cb_lock){++++}-{3:3}, at: genl_rcv+0x2a/0x50
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #1: 000002331a2b6f20 (nlk_cb_mutex-GENERIC){+.+.}-{3:3}, at: __netlink_dump_start+0x84/0x230
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #2: 000002334b95a270 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0xf4/0x160
Aug 05 18:34:42 a83lp68.lnxne.boe kernel:
Showing all locks held in the system:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: 1 lock held by khungtaskd/132:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #0: 0000033d921366a8 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire+0x14/0x50
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: 2 locks held by kworker/12:1/140:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: 4 locks held by kmcheck/161:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #0: 0000033d922455d0 (crw_handler_mutex){+.+.}-{3:3}, at: crw_collect_info+0x162/0x2f0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #1: 00000233282d08d0 (&zdev->state_lock){+.+.}-{3:3}, at: zpci_event_error+0x15c/0x2f0
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #2: 00000233205f1190 (&dev->mutex){....}-{3:3}, at: pci_dev_lock+0x2c/0x40
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #3: 000002334b95a270 (&devlink->lock_key){+.+.}-{3:3}, at: mlx5_pci_resume+0x74/0x110 [mlx5_core]
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: 3 locks held by devlink/1274:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #0: 0000033d92263be0 (cb_lock){++++}-{3:3}, at: genl_rcv+0x2a/0x50
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #1: 000002331a2b6f20 (nlk_cb_mutex-GENERIC){+.+.}-{3:3}, at: __netlink_dump_start+0x84/0x230
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: #2: 000002334b95a270 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0xf4/0x160
Aug 05 18:34:42 a83lp68.lnxne.boe kernel:
Aug 05 18:34:42 a83lp68.lnxne.boe kernel: =============================================
>
> Thanks!
Hope this helps?
I may be slow with further responses, as I'm traveling for about a
week.
Thanks,
Gerd
>
> > - mlx5_crdump_collect() trying to acquire the pci_cfg_access_lock()
> > while devlink_health_report() had acquired the devlink lock.>
> > Move the check for pci_channel_offline prior to acquiring the
> > pci_cfg_access_lock in mlx5_vsc_gw_lock since collecting the crdump will
> > fail anyhow while PCI error recovery is running.
> >
> > Fixes: 33afbfcc105a ("net/mlx5: Stop waiting for PCI if pci channel is offline")
> > Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
> > ---
> >
> > Hi all,
> >
> > while the initial hit was recorded during "random" testing, where PCI
> > error recovery and poll_health() tripped almost simultaneously, I found
> > a way to reproduce a very similar hang at will on s390:
> >
> > Inject a PCI error recovery event on a Physical Function <BDF> with
> > zpcictl --reset-fw <BDF>
> >
> > then request a crdump with
> > devlink health dump show pci/<BDF> reporter fw_fatal
> >
> > With the patch applied I didn't get the hang but kernel logs showed:
> > [ 792.885743] mlx5_core 000a:00:00.0: mlx5_crdump_collect:51:(pid 1415): crdump: failed to lock vsc gw err -13
> >
> > and the crdump request ended with:
> > kernel answers: Permission denied
> >
> > Thanks, Gerd
> > ---
> > drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 7 +++----
> > 1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> > index 432c98f2626d..d2d3b57a57d5 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> > @@ -73,16 +73,15 @@ int mlx5_vsc_gw_lock(struct mlx5_core_dev *dev)
> > u32 lock_val;
> > int ret;
> >
> > + if (pci_channel_offline(dev->pdev))
> > + return -EACCES;
> > +
> > pci_cfg_access_lock(dev->pdev);
> > do {
> > if (retries > VSC_MAX_RETRIES) {
> > ret = -EBUSY;
> > goto pci_unlock;
> > }
> > - if (pci_channel_offline(dev->pdev)) {
> > - ret = -EACCES;
> > - goto pci_unlock;
> > - }
> >
> > /* Check if semaphore is already locked */
> > ret = vsc_read(dev, VSC_SEMAPHORE_OFFSET, &lock_val);
> > --
> > 2.48.1
> >
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH net] net/mlx5: Avoid deadlock between PCI error recovery and health reporter
2025-08-11 7:29 ` Gerd Bayer
@ 2025-08-11 12:04 ` Shay Drori
0 siblings, 0 replies; 4+ messages in thread
From: Shay Drori @ 2025-08-11 12:04 UTC (permalink / raw)
To: Gerd Bayer, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
Mark Bloch, Jakub Kicinski, Paolo Abeni, Moshe Shemesh
Cc: Niklas Schnelle, Alexandra Winter, Andrew Lunn, David S. Miller,
Eric Dumazet, Yuanyuan Zhong, Mohamed Khalfella, netdev,
linux-rdma, linux-kernel
On 11/08/2025 10:29, Gerd Bayer wrote:
> External email: Use caution opening links or attachments
>
>
> On Sun, 2025-08-10 at 14:51 +0300, Shay Drori wrote:
>> On 07/08/2025 16:11, Gerd Bayer wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> During error recovery testing a pair of tasks was reported to be hung
>>> due to a dead-lock situation:
>>>
>>> - mlx5_unload_one() trying to acquire devlink lock while the PCI error
>>> recovery code had acquired the pci_cfg_access_lock().
>>
>> could you please add traces here?
>> I looked at the code and didn't see where pci_cfg_access_lock() is
>> taken...
>
> Sure thing. This is the original hung task message:
>
> 10144.859042] mlx5_core 0000:00:00.1: mlx5_health_try_recover:338:(pid 5553): health recovery flow aborted, PCI reads still not working
> [10320.359160] INFO: task kmcheck:72 blocked for more than 122 seconds.
> [10320.359169] Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359172] task:kmcheck state:D stack:0 pid:72 tgid:72 ppid:2 flags:0x00000000
> [10320.359176] Call Trace:
> [10320.359178] [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359187] [<000000065256f356>] schedule+0x36/0xe0
> [10320.359189] [<000000065256f572>] schedule_preempt_disabled+0x22/0x30
> [10320.359192] [<0000000652570a94>] __mutex_lock.constprop.0+0x484/0x8a8
> [10320.359194] [<000003ff800673a4>] mlx5_unload_one+0x34/0x58 [mlx5_core]
> [10320.359360] [<000003ff8006745c>] mlx5_pci_err_detected+0x94/0x140 [mlx5_core]
> [10320.359400] [<0000000652556c5a>] zpci_event_attempt_error_recovery+0xf2/0x398
> [10320.359406] [<0000000651b9184a>] __zpci_event_error+0x23a/0x2c0
> [10320.359411] [<00000006522b3958>] chsc_process_event_information.constprop.0+0x1c8/0x1e8
> [10320.359416] [<00000006522baf1a>] crw_collect_info+0x272/0x338
> [10320.359418] [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359422] [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359425] [<0000000652576642>] ret_from_fork+0xa/0x30
> [10320.359440] INFO: task kworker/u1664:6:1514 blocked for more than 122 seconds.
> [10320.359441] Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359443] task:kworker/u1664:6 state:D stack:0 pid:1514 tgid:1514 ppid:2 flags:0x00000000
> [10320.359447] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
> [10320.359492] Call Trace:
> [10320.359521] [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359524] [<000000065256f356>] schedule+0x36/0xe0
> [10320.359526] [<0000000652172e28>] pci_wait_cfg+0x80/0xe8
> [10320.359532] [<0000000652172f94>] pci_cfg_access_lock+0x74/0x88
> [10320.359534] [<000003ff800916b6>] mlx5_vsc_gw_lock+0x36/0x178 [mlx5_core]
> [10320.359585] [<000003ff80098824>] mlx5_crdump_collect+0x34/0x1c8 [mlx5_core]
> [10320.359637] [<000003ff80074b62>] mlx5_fw_fatal_reporter_dump+0x6a/0xe8 [mlx5_core]
> [10320.359680] [<0000000652512242>] devlink_health_do_dump.part.0+0x82/0x168
> [10320.359683] [<0000000652513212>] devlink_health_report+0x19a/0x230
> [10320.359685] [<000003ff80075a12>] mlx5_fw_fatal_reporter_err_work+0xba/0x1b0 [mlx5_core]
> [10320.359728] [<0000000651bbf852>] process_one_work+0x1c2/0x458
> [10320.359733] [<0000000651bc073e>] worker_thread+0x3ce/0x528
> [10320.359735] [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359737] [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359739] [<0000000652576642>] ret_from_fork+0xa/0x30
>
> The pci_config_access_lock() is acquired in zpci_event_attempt_error_recovery() by way of pci_dev_lock().
Thanks a lot!
can you please add the above to the commit message?
snip
<...>
>>> ---
>>> drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 7 +++----
>>> 1 file changed, 3 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
>>> index 432c98f2626d..d2d3b57a57d5 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
>>> @@ -73,16 +73,15 @@ int mlx5_vsc_gw_lock(struct mlx5_core_dev *dev)
>>> u32 lock_val;
>>> int ret;
>>>
>>> + if (pci_channel_offline(dev->pdev))
>>> + return -EACCES;
>>> +
There is still a race here.
it is possible that mlx5 have passed the above check, while
zpci_event_attempt_error_recovery() already took the cfg_look but still
didn't change the pdev to error state :(
>>> pci_cfg_access_lock(dev->pdev);
>>> do {
>>> if (retries > VSC_MAX_RETRIES) {
>>> ret = -EBUSY;
>>> goto pci_unlock;
>>> }
>>> - if (pci_channel_offline(dev->pdev)) {
>>> - ret = -EACCES;
>>> - goto pci_unlock;
>>> - }
>>>
>>> /* Check if semaphore is already locked */
>>> ret = vsc_read(dev, VSC_SEMAPHORE_OFFSET, &lock_val);
>>> --
>>> 2.48.1
>>>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-08-11 12:05 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 13:11 [PATCH net] net/mlx5: Avoid deadlock between PCI error recovery and health reporter Gerd Bayer
2025-08-10 11:51 ` Shay Drori
2025-08-11 7:29 ` Gerd Bayer
2025-08-11 12:04 ` Shay Drori
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).