All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
To: Andy Xu <andy.xu@hj-micro.com>, bhelgaas@google.com, lukas@wunner.de
Cc: mahesh@linux.ibm.com, oohall@gmail.com,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	jemma.zhang@hj-micro.com, peter.du@hj-micro.com
Subject: Re: [PATCH] PCI/DPC: Extend DPC recovery timeout
Date: Mon, 7 Jul 2025 10:04:28 -0700	[thread overview]
Message-ID: <24dfe8e2-e4b3-40e9-b9ac-026e057abd30@linux.intel.com> (raw)
In-Reply-To: <20250707103014.1279262-1-andy.xu@hj-micro.com>


On 7/7/25 3:30 AM, Andy Xu wrote:
> From: Hongbo Yao <andy.xu@hj-micro.com>
>
> Extend the DPC recovery timeout from 4 seconds to 7 seconds to
> support Mellanox ConnectX series network adapters.
>
> My environment:
>    - Platform: arm64 N2 based server
>    - Endpoint1: Mellanox Technologies MT27800 Family [ConnectX-5]
>    - Endpoint2: Mellanox Technologies MT2910 Family [ConnectX-7]
>
> With the original 4s timeout, hotplug would still be triggered:
>
> [ 81.012463] pcieport 0004:00:00.0: DPC: containment event, status:0x1f01 source:0x0000
> [ 81.014536] pcieport 0004:00:00.0: DPC: unmasked uncorrectable error detected
> [ 81.029598] pcieport 0004:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> [ 81.040830] pcieport 0004:00:00.0: device [0823:0110] error status/mask=00008000/04d40000
> [ 81.049870] pcieport 0004:00:00.0: [ 0] ERCR (First)
> [ 81.053520] pcieport 0004:00:00.0: AER: TLP Header: 60008010 010000ff 00001000 9c4c0000
> [ 81.065793] mlx5_core 0004:01:00.0: mlx5_pci_err_detected Device state = 1 health sensors: 1 pci_status: 1. Enter, pci channel state = 2
> [ 81.076183] mlx5_core 0004:01:00.0: mlx5_error_sw_reset:231:(pid 1618): start
> [ 81.083307] mlx5_core 0004:01:00.0: mlx5_error_sw_reset:252:(pid 1618): PCI channel offline, stop waiting for NIC IFC
> [ 81.077428] mlx5_core 0004:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), neovfs(0), active vports(0)
> [ 81.486693] mlx5_core 0004:01:00.0: mlx5_wait_for_pages:786:(pid 1618): Skipping wait for vf pages stage
> [ 81.496965] mlx5_core 0004:01:00.0: mlx5_wait_for_pages:786:(pid 1618): Skipping wait for vf pages stage
> [ 82.395040] mlx5_core 0004:01:00.1: print_health:819:(pid 0): Fatal error detected
> [ 82.395493] mlx5_core 0004:01:00.1: print_health_info:423:(pid 0): PCI slot 1 is unavailable
> [ 83.431094] mlx5_core 0004:01:00.0: mlx5_pci_err_detected Device state = 2 pci_status: 0. Exit, result = 3, need reset
> [ 83.442100] mlx5_core 0004:01:00.1: mlx5_pci_err_detected Device state = 2 health sensors: 1 pci_status: 1. Enter, pci channel state = 2
> [ 83.441801] mlx5_core 0004:01:00.0: mlx5_crdump_collect:50:(pid 2239): crdump: failed to lock gw status -13
> [ 83.454050] mlx5_core 0004:01:00.1: mlx5_error_sw_reset:231:(pid 1618): start
> [ 83.454050] mlx5_core 0004:01:00.1: mlx5_error_sw_reset:252:(pid 1618): PCI channel offline, stop waiting for NIC IFC
> [ 83.849429] mlx5_core 0004:01:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), neovfs(0), active vports(0)
> [ 83.858892] mlx5_core 0004:01:00.1: mlx5_wait_for_pages:786:(pid 1618): Skipping wait for vf pages stage
> [ 83.869464] mlx5_core 0004:01:00.1: mlx5_wait_for_pages:786:(pid 1618): Skipping wait for vf pages stage
> [ 85.201433] pcieport 0004:00:00.0: pciehp: Slot(41): Link Down
> [ 85.815016] mlx5_core 0004:01:00.1: mlx5_health_try_recover:335:(pid 2239): handling bad device here
> [ 85.824164] mlx5_core 0004:01:00.1: mlx5_error_sw_reset:231:(pid 2239): start
> [ 85.831283] mlx5_core 0004:01:00.1: mlx5_error_sw_reset:252:(pid 2239): PCI channel offline, stop waiting for NIC IFC
> [ 85.841899] mlx5_core 0004:01:00.1: mlx5_unload_one_dev_locked:1612:(pid 2239): mlx5_unload_one_dev_locked: interface is down, NOP
> [ 85.853799] mlx5_core 0004:01:00.1: mlx5_health_wait_pci_up:325:(pid 2239): PCI channel offline, stop waiting for PCI
> [ 85.863494] mlx5_core 0004:01:00.1: mlx5_health_try_recover:338:(pid 2239): health recovery flow aborted, PCI reads still not working
> [ 85.873231] mlx5_core 0004:01:00.1: mlx5_pci_err_detected Device state = 2 pci_status: 0. Exit, result = 3, need reset
> [ 85.879899] mlx5_core 0004:01:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), neovfs(0), active vports(0)
> [ 85.921428] mlx5_core 0004:01:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), neovfs(0), active vports(0)
> [ 85.930491] mlx5_core 0004:01:00.1: mlx5_wait_for_pages:786:(pid 1617): Skipping wait for vf pages stage
> [ 85.940849] mlx5_core 0004:01:00.1: mlx5_wait_for_pages:786:(pid 1617): Skipping wait for vf pages stage
> [ 85.949971] mlx5_core 0004:01:00.1: mlx5_uninit_one:1528:(pid 1617): mlx5_uninit_one: interface is down, NOP
> [ 85.959944] mlx5_core 0004:01:00.1: E-Switch: cleanup
> [ 86.035541] mlx5_core 0004:01:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), neovfs(0), active vports(0)
> [ 86.077568] mlx5_core 0004:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), neovfs(0), active vports(0)
> [ 86.071727] mlx5_core 0004:01:00.0: mlx5_wait_for_pages:786:(pid 1617): Skipping wait for vf pages stage
> [ 86.096577] mlx5_core 0004:01:00.0: mlx5_wait_for_pages:786:(pid 1617): Skipping wait for vf pages stage
> [ 86.106909] mlx5_core 0004:01:00.0: mlx5_uninit_one:1528:(pid 1617): mlx5_uninit_one: interface is down, NOP
> [ 86.115940] pcieport 0004:00:00.0: AER: subordinate device reset failed
> [ 86.122557] pcieport 0004:00:00.0: AER: device recovery failed
> [ 86.128571] mlx5_core 0004:01:00.0: E-Switch: cleanup
>
> I added some prints and found that:
>   - ConnectX-5 requires >5s for full recovery
>   - ConnectX-7 requires >6s for full recovery
>
> Setting timeout to 7s covers both devices with safety margin.


Instead of updating the recovery time, can you check why your device recovery takes
such a long time and how to fix it from the device end?


> Signed-off-by: Hongbo Yao <andy.xu@hj-micro.com>
> ---
>   drivers/pci/pcie/dpc.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index fc18349614d7..35a37fd86dcd 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -118,10 +118,10 @@ bool pci_dpc_recovered(struct pci_dev *pdev)
>   	/*
>   	 * Need a timeout in case DPC never completes due to failure of
>   	 * dpc_wait_rp_inactive().  The spec doesn't mandate a time limit,
> -	 * but reports indicate that DPC completes within 4 seconds.
> +	 * but reports indicate that DPC completes within 7 seconds.
>   	 */
>   	wait_event_timeout(dpc_completed_waitqueue, dpc_completed(pdev),
> -			   msecs_to_jiffies(4000));
> +			   msecs_to_jiffies(7000));
>   
>   	return test_and_clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags);
>   }

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


  reply	other threads:[~2025-07-07 17:04 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-07 10:30 [PATCH] PCI/DPC: Extend DPC recovery timeout Andy Xu
2025-07-07 17:04 ` Sathyanarayanan Kuppuswamy [this message]
2025-07-11  3:20   ` Hongbo Yao
2025-07-11  4:13     ` Lukas Wunner
2025-08-06 21:34       ` Bjorn Helgaas
2025-08-06 21:52         ` Keith Busch
2025-08-07  1:54           ` Ethan Zhao
2025-08-07  2:00     ` Ethan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=24dfe8e2-e4b3-40e9-b9ac-026e057abd30@linux.intel.com \
    --to=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=andy.xu@hj-micro.com \
    --cc=bhelgaas@google.com \
    --cc=jemma.zhang@hj-micro.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mahesh@linux.ibm.com \
    --cc=oohall@gmail.com \
    --cc=peter.du@hj-micro.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.