netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net v2] net/mlx5: Avoid deadlock between PCI error recovery and health reporter
@ 2025-10-07 14:48 Gerd Bayer
  2025-10-07 16:21 ` Jason Gunthorpe
  0 siblings, 1 reply; 3+ messages in thread
From: Gerd Bayer @ 2025-10-07 14:48 UTC (permalink / raw)
  To: Tariq Toukan, Saeed Mahameed, Leon Romanovsky, Shay Drori,
	Mark Bloch
  Cc: Gerd Bayer, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alex Vesker, Feras Daoud, netdev,
	linux-rdma, linux-kernel, Niklas Schnelle, linux-s390

Try to block further PCI config accesses just once when trying to acquire
the VSC GW lock. PCI error recovery on s390 may be blocking accesses
while trying to acquire the devlink lock that mlx5_crdump_collect is
holding already. In effect, this sacrifices the crdump if there is
contention with other tasks about PCI config accesses.

During error recovery testing a pair of tasks was reported to be hung:

[10144.859042] mlx5_core 0000:00:00.1: mlx5_health_try_recover:338:(pid 5553): health recovery flow aborted, PCI reads still not working
[10320.359160] INFO: task kmcheck:72 blocked for more than 122 seconds.
[10320.359169]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
[10320.359171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10320.359172] task:kmcheck         state:D stack:0     pid:72    tgid:72    ppid:2      flags:0x00000000
[10320.359176] Call Trace:
[10320.359178]  [<000000065256f030>] __schedule+0x2a0/0x590
[10320.359187]  [<000000065256f356>] schedule+0x36/0xe0
[10320.359189]  [<000000065256f572>] schedule_preempt_disabled+0x22/0x30
[10320.359192]  [<0000000652570a94>] __mutex_lock.constprop.0+0x484/0x8a8
[10320.359194]  [<000003ff800673a4>] mlx5_unload_one+0x34/0x58 [mlx5_core]
[10320.359360]  [<000003ff8006745c>] mlx5_pci_err_detected+0x94/0x140 [mlx5_core]
[10320.359400]  [<0000000652556c5a>] zpci_event_attempt_error_recovery+0xf2/0x398
[10320.359406]  [<0000000651b9184a>] __zpci_event_error+0x23a/0x2c0
[10320.359411]  [<00000006522b3958>] chsc_process_event_information.constprop.0+0x1c8/0x1e8
[10320.359416]  [<00000006522baf1a>] crw_collect_info+0x272/0x338
[10320.359418]  [<0000000651bc9de0>] kthread+0x108/0x110
[10320.359422]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
[10320.359425]  [<0000000652576642>] ret_from_fork+0xa/0x30
[10320.359440] INFO: task kworker/u1664:6:1514 blocked for more than 122 seconds.
[10320.359441]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
[10320.359442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10320.359443] task:kworker/u1664:6 state:D stack:0     pid:1514  tgid:1514  ppid:2      flags:0x00000000
[10320.359447] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[10320.359492] Call Trace:
[10320.359521]  [<000000065256f030>] __schedule+0x2a0/0x590
[10320.359524]  [<000000065256f356>] schedule+0x36/0xe0
[10320.359526]  [<0000000652172e28>] pci_wait_cfg+0x80/0xe8
[10320.359532]  [<0000000652172f94>] pci_cfg_access_lock+0x74/0x88
[10320.359534]  [<000003ff800916b6>] mlx5_vsc_gw_lock+0x36/0x178 [mlx5_core]
[10320.359585]  [<000003ff80098824>] mlx5_crdump_collect+0x34/0x1c8 [mlx5_core]
[10320.359637]  [<000003ff80074b62>] mlx5_fw_fatal_reporter_dump+0x6a/0xe8 [mlx5_core]
[10320.359680]  [<0000000652512242>] devlink_health_do_dump.part.0+0x82/0x168
[10320.359683]  [<0000000652513212>] devlink_health_report+0x19a/0x230
[10320.359685]  [<000003ff80075a12>] mlx5_fw_fatal_reporter_err_work+0xba/0x1b0 [mlx5_core]
[10320.359728]  [<0000000651bbf852>] process_one_work+0x1c2/0x458
[10320.359733]  [<0000000651bc073e>] worker_thread+0x3ce/0x528
[10320.359735]  [<0000000651bc9de0>] kthread+0x108/0x110
[10320.359737]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
[10320.359739]  [<0000000652576642>] ret_from_fork+0xa/0x30

No kernel log of the exact same error with an upstream kernel is
available - but the very same deadlock situation can be constructed there,
too:

- task: kmcheck
  mlx5_unload_one() tries to acquire devlink lock while the PCI error
  recovery code has set pdev->block_cfg_access by way of
  pci_cfg_access_lock()
- task: kworker
  mlx5_crdump_collect() tries to set block_cfg_access through
  pci_cfg_access_lock() while devlink_health_report() had acquired
  the devlink lock.

A similar deadlock situation can be reproduced by requesting a
crdump with
  > devlink health dump show pci/<BDF> reporter fw_fatal

while PCI error recovery is executed on the same <BDF> physical function
by mlx5_core's pci_error_handlers. On s390 this can be injected with
  > zpcictl --reset-fw <BDF>

Extensive tests with the same setup that showed the original dead-lock
didn't reproduce, nor did the second deadlock situation hit with
this patch applied.

Fixes: b25bbc2f24dc ("net/mlx5: Add Vendor Specific Capability access gateway")
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>

---

v1:
Attempted to fix this differently, but still had potential for deadlocks
and the second inject reproduced it quite regularly, still
https://lore.kernel.org/netdev/20250807131130.4056349-1-gbayer@linux.ibm.com/
---
 drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
index 432c98f2626d..f668237b6bb0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
@@ -73,7 +73,9 @@ int mlx5_vsc_gw_lock(struct mlx5_core_dev *dev)
 	u32 lock_val;
 	int ret;
 
-	pci_cfg_access_lock(dev->pdev);
+	if (!pci_cfg_access_trylock(dev->pdev))
+		return -EBUSY;
+
 	do {
 		if (retries > VSC_MAX_RETRIES) {
 			ret = -EBUSY;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH net v2] net/mlx5: Avoid deadlock between PCI error recovery and health reporter
  2025-10-07 14:48 [PATCH net v2] net/mlx5: Avoid deadlock between PCI error recovery and health reporter Gerd Bayer
@ 2025-10-07 16:21 ` Jason Gunthorpe
  2025-10-08 16:46   ` Gerd Bayer
  0 siblings, 1 reply; 3+ messages in thread
From: Jason Gunthorpe @ 2025-10-07 16:21 UTC (permalink / raw)
  To: Gerd Bayer
  Cc: Tariq Toukan, Saeed Mahameed, Leon Romanovsky, Shay Drori,
	Mark Bloch, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alex Vesker, Feras Daoud, netdev,
	linux-rdma, linux-kernel, Niklas Schnelle, linux-s390

On Tue, Oct 07, 2025 at 04:48:26PM +0200, Gerd Bayer wrote:
> - task: kmcheck
>   mlx5_unload_one() tries to acquire devlink lock while the PCI error
>   recovery code has set pdev->block_cfg_access by way of
>   pci_cfg_access_lock()

This seems wrong, arch code shouldn't invoke the driver's error
handler while hodling pci_dev_lock().

Or at least if we do want to do this the locking should be documented
and some lockdep map should be added to pci_cfg_access_lock() and the
normal AER path..

Jason

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH net v2] net/mlx5: Avoid deadlock between PCI error recovery and health reporter
  2025-10-07 16:21 ` Jason Gunthorpe
@ 2025-10-08 16:46   ` Gerd Bayer
  0 siblings, 0 replies; 3+ messages in thread
From: Gerd Bayer @ 2025-10-08 16:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tariq Toukan, Saeed Mahameed, Leon Romanovsky, Shay Drori,
	Mark Bloch, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alex Vesker, Feras Daoud, netdev,
	linux-rdma, linux-kernel, Niklas Schnelle, linux-s390

On Tue, 2025-10-07 at 13:21 -0300, Jason Gunthorpe wrote:
> On Tue, Oct 07, 2025 at 04:48:26PM +0200, Gerd Bayer wrote:
> > - task: kmcheck
> >   mlx5_unload_one() tries to acquire devlink lock while the PCI
> > error
> >   recovery code has set pdev->block_cfg_access by way of
> >   pci_cfg_access_lock()
> 
> This seems wrong, arch code shouldn't invoke the driver's error
> handler while hodling pci_dev_lock().

Seeing how powerpc's EEH is also just acquiring the device_lock while
executing the PCI error recovery call-back, I'll be investigating that
route by "demoting" pci_dev_lock() to device_lock() (i.e. not including
the blockage of PCI config accesses)

Initial tests look promising, but I need to do more experimenting and
want to check the AER path in passing, too.

> Or at least if we do want to do this the locking should be documented
> and some lockdep map should be added to pci_cfg_access_lock() and the
> normal AER path..

This change of contract sounds a lot more intrusive to device drivers -
so I'm not actually pursuing this.

> 
> Jason

Thanks, Gerd

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-10-08 16:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-07 14:48 [PATCH net v2] net/mlx5: Avoid deadlock between PCI error recovery and health reporter Gerd Bayer
2025-10-07 16:21 ` Jason Gunthorpe
2025-10-08 16:46   ` Gerd Bayer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).