From: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
To: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Cc: linux-pci@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
Bjorn Helgaas <bhelgaas@google.com>,
Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
Lukas Wunner <lukas@wunner.de>,
Yazen Ghannam <yazen.ghannam@amd.com>
Subject: Re: [PATCH v9] PCI/DPC: Ignore Surprise Down error on hot removal
Date: Thu, 8 Feb 2024 14:01:49 +0200 (EET) [thread overview]
Message-ID: <3c5da397-6d72-26cd-7204-4388ff3da1dd@linux.intel.com> (raw)
In-Reply-To: <20240207181854.121335-1-Smita.KoralahalliChannabasappa@amd.com>
[-- Attachment #1: Type: text/plain, Size: 3924 bytes --]
On Wed, 7 Feb 2024, Smita Koralahalli wrote:
> According to PCIe r6.0 sec 6.7.6 [1], async removal with DPC may result in
> surprise down error. This error is expected and is just a side-effect of
> async remove.
>
> Ignore surprise down error generated as a side-effect of async remove.
> Typically, this error is benign as the pciehp handler invoked by PDC
> or/and DLLSC alongside DPC, de-enumerates and brings down the device
> appropriately. But the error messages might confuse users. Get rid of
> these irritating log messages with a 1s delay while pciehp waits for
> dpc recovery.
>
> The implementation is as follows: On an async remove a DPC is triggered
> along with a Presence Detect State change and/or DLL State Change.
> Determine it's an async remove by checking for DPC Trigger Status in DPC
> Status Register and Surprise Down Error Status in AER Uncorrected Error
> Status to be non-zero. If true, treat the DPC event as a side-effect of
> async remove, clear the error status registers and continue with hot-plug
> tear down routines. If not, follow the existing routine to handle AER and
> DPC errors.
>
> Please note that, masking Surprise Down Errors was explored as an
> alternative approach, but left due to the odd behavior that masking only
> avoids the interrupt, but still records an error per PCIe r6.0.1 Section
> 6.2.3.2.2. That stale error is going to be reported the next time some
> error other than Surprise Down is handled.
>
> Dmesg before:
>
> pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
> pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
> pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:00:01.4: device [1022:14ab] error status/mask=00000020/04004000
> pcieport 0000:00:01.4: [ 5] SDES (First)
> nvme nvme2: frozen state error detected, reset controller
> pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
> pcieport 0000:00:01.4: AER: subordinate device reset failed
> pcieport 0000:00:01.4: AER: device recovery failed
> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
> nvme2n1: detected capacity change from 1953525168 to 0
> pci 0000:04:00.0: Removing from iommu group 49
>
> Dmesg after:
>
> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
> nvme1n1: detected capacity change from 1953525168 to 0
> pci 0000:04:00.0: Removing from iommu group 37
>
> [1] PCI Express Base Specification Revision 6.0, Dec 16 2021.
> https://members.pcisig.com/wg/PCI-SIG/document/16609
>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Lukas Wunner <lukas@wunner.de>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> +static void pci_clear_surpdn_errors(struct pci_dev *pdev)
> +{
> + if (pdev->dpc_rp_extensions)
> + pci_write_config_dword(pdev, pdev->dpc_cap +
> + PCI_EXP_DPC_RP_PIO_STATUS, ~0);
> +
> + /*
> + * In practice, Surprise Down errors have been observed to also set
> + * error bits in the Status Register as well as the Fatal Error
> + * Detected bit in the Device Status Register.
> + */
> + pci_write_config_word(pdev, PCI_STATUS, 0xffff);
Nit: one of these is using ~0 and the other 0xffff which is a bit
inconsistent.
> +static bool dpc_is_surprise_removal(struct pci_dev *pdev)
> +{
> + u16 status;
> +
> + if (!pdev->is_hotplug_bridge)
> + return false;
> +
> + if (pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS,
> + &status))
> + return false;
Since you need a line split, I'd have used:
ret = pci_read_config_word(...
...);
if (ret != PCIBIOS_SUCCESSFUL)
return false;
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
--
i.
next prev parent reply other threads:[~2024-02-08 12:01 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-07 18:18 [PATCH v9] PCI/DPC: Ignore Surprise Down error on hot removal Smita Koralahalli
2024-02-08 12:01 ` Ilpo Järvinen [this message]
2024-02-28 16:25 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3c5da397-6d72-26cd-7204-4388ff3da1dd@linux.intel.com \
--to=ilpo.jarvinen@linux.intel.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=bhelgaas@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=mahesh@linux.ibm.com \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.