linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Niklas Schnelle <schnelle@linux.ibm.com>
To: Gerd Bayer <gbayer@linux.ibm.com>,
	Gerald Schaefer	 <gerald.schaefer@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Shay Drori	 <shayd@nvidia.com>, Jason Gunthorpe <jgg@ziepe.ca>
Cc: Tariq Toukan <tariqt@nvidia.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Leon Romanovsky	 <leon@kernel.org>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Pierre Morel <pmorel@linux.ibm.com>,
	Matthew Rosato	 <mjrosato@linux.ibm.com>,
	linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org
Subject: Re: [PATCH v2] s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump
Date: Wed, 15 Oct 2025 16:10:33 +0200	[thread overview]
Message-ID: <358e3a7a1e735d5b1e761cf159db8ae735a9578d.camel@linux.ibm.com> (raw)
In-Reply-To: <20251015-fix_pcirecov_master-v2-1-e07962fe9558@linux.ibm.com>

On Wed, 2025-10-15 at 13:05 +0200, Gerd Bayer wrote:
> Do not block PCI config accesses through pci_cfg_access_lock() when
> executing the s390 variant of PCI error recovery: Acquire just
> device_lock() instead of pci_dev_lock() as powerpc's EEH and
> generig PCI AER processing do.
> 
> During error recovery testing a pair of tasks was reported to be hung:
> 
> [10144.859042] mlx5_core 0000:00:00.1: mlx5_health_try_recover:338:(pid 5553): health recovery flow aborted, PCI reads still not working
> [10320.359160] INFO: task kmcheck:72 blocked for more than 122 seconds.
> [10320.359169]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359172] task:kmcheck         state:D stack:0     pid:72    tgid:72    ppid:2      flags:0x00000000
> [10320.359176] Call Trace:
> [10320.359178]  [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359187]  [<000000065256f356>] schedule+0x36/0xe0
> [10320.359189]  [<000000065256f572>] schedule_preempt_disabled+0x22/0x30
> [10320.359192]  [<0000000652570a94>] __mutex_lock.constprop.0+0x484/0x8a8
> [10320.359194]  [<000003ff800673a4>] mlx5_unload_one+0x34/0x58 [mlx5_core]
> [10320.359360]  [<000003ff8006745c>] mlx5_pci_err_detected+0x94/0x140 [mlx5_core]
> [10320.359400]  [<0000000652556c5a>] zpci_event_attempt_error_recovery+0xf2/0x398
> [10320.359406]  [<0000000651b9184a>] __zpci_event_error+0x23a/0x2c0
> [10320.359411]  [<00000006522b3958>] chsc_process_event_information.constprop.0+0x1c8/0x1e8
> [10320.359416]  [<00000006522baf1a>] crw_collect_info+0x272/0x338
> [10320.359418]  [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359422]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359425]  [<0000000652576642>] ret_from_fork+0xa/0x30
> [10320.359440] INFO: task kworker/u1664:6:1514 blocked for more than 122 seconds.
> [10320.359441]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359443] task:kworker/u1664:6 state:D stack:0     pid:1514  tgid:1514  ppid:2      flags:0x00000000
> [10320.359447] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
> [10320.359492] Call Trace:
> [10320.359521]  [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359524]  [<000000065256f356>] schedule+0x36/0xe0
> [10320.359526]  [<0000000652172e28>] pci_wait_cfg+0x80/0xe8
> [10320.359532]  [<0000000652172f94>] pci_cfg_access_lock+0x74/0x88
> [10320.359534]  [<000003ff800916b6>] mlx5_vsc_gw_lock+0x36/0x178 [mlx5_core]
> [10320.359585]  [<000003ff80098824>] mlx5_crdump_collect+0x34/0x1c8 [mlx5_core]
> [10320.359637]  [<000003ff80074b62>] mlx5_fw_fatal_reporter_dump+0x6a/0xe8 [mlx5_core]
> [10320.359680]  [<0000000652512242>] devlink_health_do_dump.part.0+0x82/0x168
> [10320.359683]  [<0000000652513212>] devlink_health_report+0x19a/0x230
> [10320.359685]  [<000003ff80075a12>] mlx5_fw_fatal_reporter_err_work+0xba/0x1b0 [mlx5_core]
> [10320.359728]  [<0000000651bbf852>] process_one_work+0x1c2/0x458
> [10320.359733]  [<0000000651bc073e>] worker_thread+0x3ce/0x528
> [10320.359735]  [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359737]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359739]  [<0000000652576642>] ret_from_fork+0xa/0x30

I'd tend to prune this a bit, at the very least I would remove the time
stamp prefix.

> 
> No kernel log of the exact same error with an upstream kernel is
> available - but the very same deadlock situation can be constructed there,
> too:
> 
> - task: kmcheck
>   mlx5_unload_one() tries to acquire devlink lock while the PCI error
>   recovery code has set pdev->block_cfg_access by way of
>   pci_cfg_access_lock()
> - task: kworker
>   mlx5_crdump_collect() tries to set block_cfg_access through
>   pci_cfg_access_lock() while devlink_health_report() had acquired
>   the devlink lock.
> 
> A similar deadlock situation can be reproduced by requesting a
> crdump with
>   > devlink health dump show pci/<BDF> reporter fw_fatal
> 
> while PCI error recovery is executed on the same <BDF> physical function
> by mlx5_core's pci_error_handlers. On s390 this can be injected with
>   > zpcictl --reset-fw <BDF>
> 
> Tests with this patch failed to reproduce that second deadlock situation,
> the devlink command is rejected with "kernel answers: Permission denied" -
> and we get a kernel log message of:
> 
> Oct 14 13:32:39 b46lp03.lnxne.boe kernel: mlx5_core 1ed0:00:00.1: mlx5_crdump_collect:50:(pid 254382): crdump: failed to lock vsc gw err -5

Same as above I'd remove everyting before the "mlx5_core: …" as that
adds no relevant information.

> 
> because the config read of VSC_SEMAPHORE is rejected by the underlying
> hardware.
> 
> Two prior attempts to address this issue have been discussed and
> ultimately rejected [see link], with the primary argument that s390's
> implementation of PCI error recovery is imposing restrictions that
> neither powerpc's EEH nor PCI AER handling need. Tests show that PCI
> error recovery on s390 is running to completion even without blocking
> access to PCI config space.
> 
> Link: https://lore.kernel.org/all/20251007144826.2825134-1-gbayer@linux.ibm.com/
> 
> Fixes: 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery")

Besides a Fixes tag let's add "Cc: stable@vger.kernel.org" and you can
put that instead of the empty line between Link and Fixes since usually
these are an uninterrupted block of tags.

> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
> ---
> All,
> 
> sorry for the immediate v2, but I had to rebase this to a current
> upstream commit since v1 didn't apply cleanly, as Niklas pointed out in
> private. The following assessment from v1 is still valid, though:
> 
> Hi Niklas, Shay, Jason,
> 
> by now I believe fixing this in s390/pci is the right way to go, since
> the other PCI error recovery implementations apparently don't require
> this strict blocking of accesses to the PCI config space.
>     
> Hi Alexander, Vasily, Heiko,
>     
> while I sent this to netdev since prior versions were discussed there,
> I assume this patch will go through the s390 tree, right?
>     
> Thanks,
> Gerd
> 
>     
> ---
>  arch/s390/pci/pci_event.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
> index b95376041501f479eee20705d45fb8c68553da71..27db1e72c623f8a289cae457e87f0a9896ed241d 100644
> --- a/arch/s390/pci/pci_event.c
> +++ b/arch/s390/pci/pci_event.c
> @@ -188,7 +188,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  	 * is unbound or probed and that userspace can't access its
>  	 * configuration space while we perform recovery.
>  	 */
> -	pci_dev_lock(pdev);
> +	device_lock(&pdev->dev);
>  	if (pdev->error_state == pci_channel_io_perm_failure) {
>  		ers_res = PCI_ERS_RESULT_DISCONNECT;
>  		goto out_unlock;
> @@ -257,7 +257,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  		driver->err_handler->resume(pdev);
>  	pci_uevent_ers(pdev, PCI_ERS_RESULT_RECOVERED);
>  out_unlock:
> -	pci_dev_unlock(pdev);
> +	device_unlock(&pdev->dev);
>  	zpci_report_status(zdev, "recovery", status_str);
>  
>  	return ers_res;

Code-wise this looks good to me and I've confirmed that EEH and AER
indeed only use the device_lock() and I don't see a reason why that
shouldn't be enough for us too if it is enough for them. I think I just
picked pci_dev_lock() because it seemed fitting but it probably was too
big a hammer.

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>

  reply	other threads:[~2025-10-15 14:10 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-15 11:05 [PATCH v2] s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump Gerd Bayer
2025-10-15 14:10 ` Niklas Schnelle [this message]
2025-10-16  8:40   ` Gerd Bayer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=358e3a7a1e735d5b1e761cf159db8ae735a9578d.camel@linux.ibm.com \
    --to=schnelle@linux.ibm.com \
    --cc=agordeev@linux.ibm.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=gbayer@linux.ibm.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=jgg@ziepe.ca \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mjrosato@linux.ibm.com \
    --cc=netdev@vger.kernel.org \
    --cc=pmorel@linux.ibm.com \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    --cc=svens@linux.ibm.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).