[PATCH v2] s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump

linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump
@ 2025-10-15 11:05 Gerd Bayer
  2025-10-15 14:10 ` Niklas Schnelle
  0 siblings, 1 reply; 3+ messages in thread
From: Gerd Bayer @ 2025-10-15 11:05 UTC (permalink / raw)
  To: Niklas Schnelle, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Shay Drori, Jason Gunthorpe
  Cc: Tariq Toukan, Saeed Mahameed, Leon Romanovsky,
	Christian Borntraeger, Sven Schnelle, Pierre Morel,
	Matthew Rosato, linux-s390, linux-kernel, netdev, linux-rdma,
	Gerd Bayer

Do not block PCI config accesses through pci_cfg_access_lock() when
executing the s390 variant of PCI error recovery: Acquire just
device_lock() instead of pci_dev_lock() as powerpc's EEH and
generig PCI AER processing do.

During error recovery testing a pair of tasks was reported to be hung:

[10144.859042] mlx5_core 0000:00:00.1: mlx5_health_try_recover:338:(pid 5553): health recovery flow aborted, PCI reads still not working
[10320.359160] INFO: task kmcheck:72 blocked for more than 122 seconds.
[10320.359169]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
[10320.359171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10320.359172] task:kmcheck         state:D stack:0     pid:72    tgid:72    ppid:2      flags:0x00000000
[10320.359176] Call Trace:
[10320.359178]  [<000000065256f030>] __schedule+0x2a0/0x590
[10320.359187]  [<000000065256f356>] schedule+0x36/0xe0
[10320.359189]  [<000000065256f572>] schedule_preempt_disabled+0x22/0x30
[10320.359192]  [<0000000652570a94>] __mutex_lock.constprop.0+0x484/0x8a8
[10320.359194]  [<000003ff800673a4>] mlx5_unload_one+0x34/0x58 [mlx5_core]
[10320.359360]  [<000003ff8006745c>] mlx5_pci_err_detected+0x94/0x140 [mlx5_core]
[10320.359400]  [<0000000652556c5a>] zpci_event_attempt_error_recovery+0xf2/0x398
[10320.359406]  [<0000000651b9184a>] __zpci_event_error+0x23a/0x2c0
[10320.359411]  [<00000006522b3958>] chsc_process_event_information.constprop.0+0x1c8/0x1e8
[10320.359416]  [<00000006522baf1a>] crw_collect_info+0x272/0x338
[10320.359418]  [<0000000651bc9de0>] kthread+0x108/0x110
[10320.359422]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
[10320.359425]  [<0000000652576642>] ret_from_fork+0xa/0x30
[10320.359440] INFO: task kworker/u1664:6:1514 blocked for more than 122 seconds.
[10320.359441]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
[10320.359442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10320.359443] task:kworker/u1664:6 state:D stack:0     pid:1514  tgid:1514  ppid:2      flags:0x00000000
[10320.359447] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[10320.359492] Call Trace:
[10320.359521]  [<000000065256f030>] __schedule+0x2a0/0x590
[10320.359524]  [<000000065256f356>] schedule+0x36/0xe0
[10320.359526]  [<0000000652172e28>] pci_wait_cfg+0x80/0xe8
[10320.359532]  [<0000000652172f94>] pci_cfg_access_lock+0x74/0x88
[10320.359534]  [<000003ff800916b6>] mlx5_vsc_gw_lock+0x36/0x178 [mlx5_core]
[10320.359585]  [<000003ff80098824>] mlx5_crdump_collect+0x34/0x1c8 [mlx5_core]
[10320.359637]  [<000003ff80074b62>] mlx5_fw_fatal_reporter_dump+0x6a/0xe8 [mlx5_core]
[10320.359680]  [<0000000652512242>] devlink_health_do_dump.part.0+0x82/0x168
[10320.359683]  [<0000000652513212>] devlink_health_report+0x19a/0x230
[10320.359685]  [<000003ff80075a12>] mlx5_fw_fatal_reporter_err_work+0xba/0x1b0 [mlx5_core]
[10320.359728]  [<0000000651bbf852>] process_one_work+0x1c2/0x458
[10320.359733]  [<0000000651bc073e>] worker_thread+0x3ce/0x528
[10320.359735]  [<0000000651bc9de0>] kthread+0x108/0x110
[10320.359737]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
[10320.359739]  [<0000000652576642>] ret_from_fork+0xa/0x30

No kernel log of the exact same error with an upstream kernel is
available - but the very same deadlock situation can be constructed there,
too:

- task: kmcheck
  mlx5_unload_one() tries to acquire devlink lock while the PCI error
  recovery code has set pdev->block_cfg_access by way of
  pci_cfg_access_lock()
- task: kworker
  mlx5_crdump_collect() tries to set block_cfg_access through
  pci_cfg_access_lock() while devlink_health_report() had acquired
  the devlink lock.

A similar deadlock situation can be reproduced by requesting a
crdump with
  > devlink health dump show pci/<BDF> reporter fw_fatal

while PCI error recovery is executed on the same <BDF> physical function
by mlx5_core's pci_error_handlers. On s390 this can be injected with
  > zpcictl --reset-fw <BDF>

Tests with this patch failed to reproduce that second deadlock situation,
the devlink command is rejected with "kernel answers: Permission denied" -
and we get a kernel log message of:

Oct 14 13:32:39 b46lp03.lnxne.boe kernel: mlx5_core 1ed0:00:00.1: mlx5_crdump_collect:50:(pid 254382): crdump: failed to lock vsc gw err -5

because the config read of VSC_SEMAPHORE is rejected by the underlying
hardware.

Two prior attempts to address this issue have been discussed and
ultimately rejected [see link], with the primary argument that s390's
implementation of PCI error recovery is imposing restrictions that
neither powerpc's EEH nor PCI AER handling need. Tests show that PCI
error recovery on s390 is running to completion even without blocking
access to PCI config space.

Link: https://lore.kernel.org/all/20251007144826.2825134-1-gbayer@linux.ibm.com/

Fixes: 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery")
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
---
All,

sorry for the immediate v2, but I had to rebase this to a current
upstream commit since v1 didn't apply cleanly, as Niklas pointed out in
private. The following assessment from v1 is still valid, though:

Hi Niklas, Shay, Jason,

by now I believe fixing this in s390/pci is the right way to go, since
the other PCI error recovery implementations apparently don't require
this strict blocking of accesses to the PCI config space.
    
Hi Alexander, Vasily, Heiko,
    
while I sent this to netdev since prior versions were discussed there,
I assume this patch will go through the s390 tree, right?
    
Thanks,
Gerd

    
---
 arch/s390/pci/pci_event.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index b95376041501f479eee20705d45fb8c68553da71..27db1e72c623f8a289cae457e87f0a9896ed241d 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -188,7 +188,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 	 * is unbound or probed and that userspace can't access its
 	 * configuration space while we perform recovery.
 	 */
-	pci_dev_lock(pdev);
+	device_lock(&pdev->dev);
 	if (pdev->error_state == pci_channel_io_perm_failure) {
 		ers_res = PCI_ERS_RESULT_DISCONNECT;
 		goto out_unlock;
@@ -257,7 +257,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 		driver->err_handler->resume(pdev);
 	pci_uevent_ers(pdev, PCI_ERS_RESULT_RECOVERED);
 out_unlock:
-	pci_dev_unlock(pdev);
+	device_unlock(&pdev->dev);
 	zpci_report_status(zdev, "recovery", status_str);
 
 	return ers_res;

---
base-commit: 9b332cece987ee1790b2ed4c989e28162fa47860
change-id: 20251015-fix_pcirecov_master-55fe3705c6c6

Best regards,
-- 
Gerd Bayer <gbayer@linux.ibm.com>


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump
  2025-10-15 11:05 [PATCH v2] s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump Gerd Bayer
@ 2025-10-15 14:10 ` Niklas Schnelle
  2025-10-16  8:40   ` Gerd Bayer
  0 siblings, 1 reply; 3+ messages in thread
From: Niklas Schnelle @ 2025-10-15 14:10 UTC (permalink / raw)
  To: Gerd Bayer, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Shay Drori, Jason Gunthorpe
  Cc: Tariq Toukan, Saeed Mahameed, Leon Romanovsky,
	Christian Borntraeger, Sven Schnelle, Pierre Morel,
	Matthew Rosato, linux-s390, linux-kernel, netdev, linux-rdma

On Wed, 2025-10-15 at 13:05 +0200, Gerd Bayer wrote:
> Do not block PCI config accesses through pci_cfg_access_lock() when
> executing the s390 variant of PCI error recovery: Acquire just
> device_lock() instead of pci_dev_lock() as powerpc's EEH and
> generig PCI AER processing do.
> 
> During error recovery testing a pair of tasks was reported to be hung:
> 
> [10144.859042] mlx5_core 0000:00:00.1: mlx5_health_try_recover:338:(pid 5553): health recovery flow aborted, PCI reads still not working
> [10320.359160] INFO: task kmcheck:72 blocked for more than 122 seconds.
> [10320.359169]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359172] task:kmcheck         state:D stack:0     pid:72    tgid:72    ppid:2      flags:0x00000000
> [10320.359176] Call Trace:
> [10320.359178]  [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359187]  [<000000065256f356>] schedule+0x36/0xe0
> [10320.359189]  [<000000065256f572>] schedule_preempt_disabled+0x22/0x30
> [10320.359192]  [<0000000652570a94>] __mutex_lock.constprop.0+0x484/0x8a8
> [10320.359194]  [<000003ff800673a4>] mlx5_unload_one+0x34/0x58 [mlx5_core]
> [10320.359360]  [<000003ff8006745c>] mlx5_pci_err_detected+0x94/0x140 [mlx5_core]
> [10320.359400]  [<0000000652556c5a>] zpci_event_attempt_error_recovery+0xf2/0x398
> [10320.359406]  [<0000000651b9184a>] __zpci_event_error+0x23a/0x2c0
> [10320.359411]  [<00000006522b3958>] chsc_process_event_information.constprop.0+0x1c8/0x1e8
> [10320.359416]  [<00000006522baf1a>] crw_collect_info+0x272/0x338
> [10320.359418]  [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359422]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359425]  [<0000000652576642>] ret_from_fork+0xa/0x30
> [10320.359440] INFO: task kworker/u1664:6:1514 blocked for more than 122 seconds.
> [10320.359441]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359443] task:kworker/u1664:6 state:D stack:0     pid:1514  tgid:1514  ppid:2      flags:0x00000000
> [10320.359447] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
> [10320.359492] Call Trace:
> [10320.359521]  [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359524]  [<000000065256f356>] schedule+0x36/0xe0
> [10320.359526]  [<0000000652172e28>] pci_wait_cfg+0x80/0xe8
> [10320.359532]  [<0000000652172f94>] pci_cfg_access_lock+0x74/0x88
> [10320.359534]  [<000003ff800916b6>] mlx5_vsc_gw_lock+0x36/0x178 [mlx5_core]
> [10320.359585]  [<000003ff80098824>] mlx5_crdump_collect+0x34/0x1c8 [mlx5_core]
> [10320.359637]  [<000003ff80074b62>] mlx5_fw_fatal_reporter_dump+0x6a/0xe8 [mlx5_core]
> [10320.359680]  [<0000000652512242>] devlink_health_do_dump.part.0+0x82/0x168
> [10320.359683]  [<0000000652513212>] devlink_health_report+0x19a/0x230
> [10320.359685]  [<000003ff80075a12>] mlx5_fw_fatal_reporter_err_work+0xba/0x1b0 [mlx5_core]
> [10320.359728]  [<0000000651bbf852>] process_one_work+0x1c2/0x458
> [10320.359733]  [<0000000651bc073e>] worker_thread+0x3ce/0x528
> [10320.359735]  [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359737]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359739]  [<0000000652576642>] ret_from_fork+0xa/0x30

I'd tend to prune this a bit, at the very least I would remove the time
stamp prefix.

> 
> No kernel log of the exact same error with an upstream kernel is
> available - but the very same deadlock situation can be constructed there,
> too:
> 
> - task: kmcheck
>   mlx5_unload_one() tries to acquire devlink lock while the PCI error
>   recovery code has set pdev->block_cfg_access by way of
>   pci_cfg_access_lock()
> - task: kworker
>   mlx5_crdump_collect() tries to set block_cfg_access through
>   pci_cfg_access_lock() while devlink_health_report() had acquired
>   the devlink lock.
> 
> A similar deadlock situation can be reproduced by requesting a
> crdump with
>   > devlink health dump show pci/<BDF> reporter fw_fatal
> 
> while PCI error recovery is executed on the same <BDF> physical function
> by mlx5_core's pci_error_handlers. On s390 this can be injected with
>   > zpcictl --reset-fw <BDF>
> 
> Tests with this patch failed to reproduce that second deadlock situation,
> the devlink command is rejected with "kernel answers: Permission denied" -
> and we get a kernel log message of:
> 
> Oct 14 13:32:39 b46lp03.lnxne.boe kernel: mlx5_core 1ed0:00:00.1: mlx5_crdump_collect:50:(pid 254382): crdump: failed to lock vsc gw err -5

Same as above I'd remove everyting before the "mlx5_core: …" as that
adds no relevant information.

> 
> because the config read of VSC_SEMAPHORE is rejected by the underlying
> hardware.
> 
> Two prior attempts to address this issue have been discussed and
> ultimately rejected [see link], with the primary argument that s390's
> implementation of PCI error recovery is imposing restrictions that
> neither powerpc's EEH nor PCI AER handling need. Tests show that PCI
> error recovery on s390 is running to completion even without blocking
> access to PCI config space.
> 
> Link: https://lore.kernel.org/all/20251007144826.2825134-1-gbayer@linux.ibm.com/
> 
> Fixes: 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery")

Besides a Fixes tag let's add "Cc: stable@vger.kernel.org" and you can
put that instead of the empty line between Link and Fixes since usually
these are an uninterrupted block of tags.

> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
> ---
> All,
> 
> sorry for the immediate v2, but I had to rebase this to a current
> upstream commit since v1 didn't apply cleanly, as Niklas pointed out in
> private. The following assessment from v1 is still valid, though:
> 
> Hi Niklas, Shay, Jason,
> 
> by now I believe fixing this in s390/pci is the right way to go, since
> the other PCI error recovery implementations apparently don't require
> this strict blocking of accesses to the PCI config space.
>     
> Hi Alexander, Vasily, Heiko,
>     
> while I sent this to netdev since prior versions were discussed there,
> I assume this patch will go through the s390 tree, right?
>     
> Thanks,
> Gerd
> 
>     
> ---
>  arch/s390/pci/pci_event.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
> index b95376041501f479eee20705d45fb8c68553da71..27db1e72c623f8a289cae457e87f0a9896ed241d 100644
> --- a/arch/s390/pci/pci_event.c
> +++ b/arch/s390/pci/pci_event.c
> @@ -188,7 +188,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  	 * is unbound or probed and that userspace can't access its
>  	 * configuration space while we perform recovery.
>  	 */
> -	pci_dev_lock(pdev);
> +	device_lock(&pdev->dev);
>  	if (pdev->error_state == pci_channel_io_perm_failure) {
>  		ers_res = PCI_ERS_RESULT_DISCONNECT;
>  		goto out_unlock;
> @@ -257,7 +257,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  		driver->err_handler->resume(pdev);
>  	pci_uevent_ers(pdev, PCI_ERS_RESULT_RECOVERED);
>  out_unlock:
> -	pci_dev_unlock(pdev);
> +	device_unlock(&pdev->dev);
>  	zpci_report_status(zdev, "recovery", status_str);
>  
>  	return ers_res;

Code-wise this looks good to me and I've confirmed that EEH and AER
indeed only use the device_lock() and I don't see a reason why that
shouldn't be enough for us too if it is enough for them. I think I just
picked pci_dev_lock() because it seemed fitting but it probably was too
big a hammer.

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump
  2025-10-15 14:10 ` Niklas Schnelle
@ 2025-10-16  8:40   ` Gerd Bayer
  0 siblings, 0 replies; 3+ messages in thread
From: Gerd Bayer @ 2025-10-16  8:40 UTC (permalink / raw)
  To: Niklas Schnelle, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Shay Drori, Jason Gunthorpe
  Cc: Tariq Toukan, Saeed Mahameed, Leon Romanovsky,
	Christian Borntraeger, Sven Schnelle, Pierre Morel,
	Matthew Rosato, linux-s390, linux-kernel, netdev, linux-rdma

On Wed, 2025-10-15 at 16:10 +0200, Niklas Schnelle wrote:
> On Wed, 2025-10-15 at 13:05 +0200, Gerd Bayer wrote:
> > Do not block PCI config accesses through pci_cfg_access_lock() when
> > executing the s390 variant of PCI error recovery: Acquire just
> > device_lock() instead of pci_dev_lock() as powerpc's EEH and
> > generig PCI AER processing do.
> > 

[... snip ...]

> 
> Code-wise this looks good to me and I've confirmed that EEH and AER
> indeed only use the device_lock() and I don't see a reason why that
> shouldn't be enough for us too if it is enough for them. I think I
> just
> picked pci_dev_lock() because it seemed fitting but it probably was
> too
> big a hammer.
> 
> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>

I'll send around a v3 with your suggested changes to the commit message
included.

Thanks for the review,
Gerd

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-10-16  8:40 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-15 11:05 [PATCH v2] s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump Gerd Bayer
2025-10-15 14:10 ` Niklas Schnelle
2025-10-16  8:40   ` Gerd Bayer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).