public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: Lukas Wunner <lukas@wunner.de>
Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, bhelgaas@google.com,
	kbusch@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
	mahesh@linux.ibm.com, oohall@gmail.com,
	Jonathan.Cameron@huawei.com, terry.bowman@amd.com,
	tianruidong@linux.alibaba.com
Subject: Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
Date: Mon, 20 Oct 2025 22:17:10 +0800	[thread overview]
Message-ID: <43390d36-147f-482c-b31a-d02c2624061f@linux.alibaba.com> (raw)
In-Reply-To: <aPY--DJnNam9ejpT@wunner.de>

Hi, Lukas,

在 2025/10/20 21:54, Lukas Wunner 写道:
> On Mon, Oct 20, 2025 at 08:58:55PM +0800, Shuai Xue wrote:
>> ??? 2025/10/20 18:10, Lukas Wunner ??????:
>>> On Wed, Oct 15, 2025 at 10:41:57AM +0800, Shuai Xue wrote:
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -253,6 +254,16 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>    			pci_warn(bridge, "subordinate device reset failed\n");
>>>>    			goto failed;
>>>>    		}
>>>> +
>>>> +		/* Link recovered, report fatal errors of RCiEP or EP */
>>>> +		if (state == pci_channel_io_frozen &&
>>>> +		    (type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_RC_END)) {
>>>> +			aer_add_error_device(&info, dev);
>>>> +			info.severity = AER_FATAL;
>>>> +			if (aer_get_device_error_info(&info, 0, true))
>>>> +				aer_print_error(&info, 0);
>>>> +			pci_dev_put(dev);
>>>> +		}
>>>
>>> Where is the the pci_dev_get() to balance the pci_dev_put() here?
>>
>> The corresponding pci_dev_get() is called in add_error_device(). Please
>> refer to commit 60271ab044a5 ("PCI/AER: Take reference on error
>> devices") which introduced this reference counting mechanism.
> 
> That is non-obvious and needs a code comment.

Agreed. I'll add a comment to clarify the reference counting relationship.

> 
>>> It feels awkward to leak AER-specific details into pcie_do_recovery().
>>> That function is supposed to implement the flow described in
>>> Documentation/PCI/pci-error-recovery.rst in a platform-agnostic way
>>> so that powerpc (EEH) and s390 could conceivably take advantage of it.
>>>
>>> Can you find a way to avoid this, e.g. report errors after
>>> pcie_do_recovery() has concluded?
>>
>> I understand your concern about keeping pcie_do_recovery()
>> platform-agnostic.
> 
> The code you're adding above, with the exception of the check for
> pci_channel_io_frozen, should live in a helper in aer.c.
> Then you also don't need to rename add_error_device().

Good point.

That's a much cleaner approach. I'll create a helper function in aer.c,
something like:

void aer_report_frozen_error(struct pci_dev *dev)
{
     struct aer_err_info info;

     if (dev->pci_type != PCI_EXP_TYPE_ENDPOINT &&
         dev->pci_type != PCI_EXP_TYPE_RC_END)
         return;

     aer_info_init(&info);
     aer_add_error_device(&info, dev);
     info.severity = AER_FATAL;
     if (aer_get_device_error_info(&info, 0, true))
         aer_print_error(&info, 0);

     /* pci_dev_put() pairs with pci_dev_get() in aer_add_error_device() */
     pci_dev_put(dev);
}

>> I explored the possibility of reporting errors after
>> recovery concludes, but unfortunately, this approach isn't feasible due
>> to the recovery sequence. The issue is that most drivers'
>> pci_error_handlers implement .slot_reset() which internally calls
>> pci_restore_state() to restore the device's configuration space and
>> state. This function also clears the device's AER status registers:
>>
>>    .slot_reset()
>>      => pci_restore_state()
>>        => pci_aer_clear_status()
> 
> This was added in 2015 by b07461a8e45b.  The commit claims that
> the errors are stale and can be ignored.  It turns out they cannot.
> 
> So maybe pci_restore_state() should print information about the
> errors before clearing them?

While that could work, we would lose the error severity information at
that point, which could lead to duplicate or less informative error
messages compared to what the AER driver provides. The helper function
approach preserves all error details for proper reporting.

> 
> Actually pci_restore_state() is only supposed to restore state,
> as the name implies, and not clear errors.  It seems questionable
> that the commit amended it to do that.
> 
>>> I'm also worried that errors are reported *during* recovery.
>>> I imagine this looks confusing to a user.  The logged messages
>>> should make it clear that these are errors that occurred *earlier*
>>> and are reported belatedly.
>>
>> You raise an excellent point about potential user confusion. The current
>> aer_print_error() interface doesn't indicate that these are historical
>> errors being reported belatedly. Would it be acceptable to add a
>> clarifying message before calling aer_print_error()? For example:
>>
>>    pci_err(dev, "Reporting error that occurred before recovery:\n");
> 
> Yes, something like that.  "Errors reported prior to reset"?  Dunno.

I'll use "Errors reported prior to reset" - it's clear and concise.

> 
> Thanks,
> 
> Lukas

Thanks.
Shuai

  reply	other threads:[~2025-10-20 14:17 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-15  2:41 [PATCH v6 0/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2025-10-15  2:41 ` [PATCH v6 1/5] PCI/DPC: Clarify naming for error port in DPC Handling Shuai Xue
2025-10-15  2:41 ` [PATCH v6 2/5] PCI/DPC: Run recovery on device that detected the error Shuai Xue
2025-10-15  2:41 ` [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2025-10-20 10:10   ` Lukas Wunner
2025-10-20 12:58     ` Shuai Xue
2025-10-20 13:54       ` Lukas Wunner
2025-10-20 14:17         ` Shuai Xue [this message]
2025-10-20 14:24           ` Lukas Wunner
2025-10-20 15:20             ` Shuai Xue
2025-10-23 10:48               ` Lukas Wunner
2025-10-24  6:43                 ` Shuai Xue
2025-12-16  8:07                   ` Shuai Xue
2025-10-20 18:38   ` Kuppuswamy Sathyanarayanan
2025-10-21  1:51     ` Shuai Xue
2025-10-15  2:41 ` [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control Shuai Xue
2025-10-20 10:17   ` Lukas Wunner
2025-10-20 13:09     ` Shuai Xue
2025-10-20 13:58       ` Lukas Wunner
2025-10-20 14:45         ` Shuai Xue
2025-10-23 10:29           ` Lukas Wunner
2025-10-24  3:09             ` Shuai Xue
2025-10-24  3:14               ` Lukas Wunner
2025-10-24  3:38                 ` Shuai Xue
2025-10-24  4:03                   ` Lukas Wunner
2025-10-24  5:37                     ` Shuai Xue
2025-10-20 18:43   ` Kuppuswamy Sathyanarayanan
2025-10-15  2:41 ` [PATCH v6 5/5] PCI/AER: Clear both AER fatal and non-fatal status Shuai Xue
2025-10-20 18:44   ` Kuppuswamy Sathyanarayanan
2025-10-21  1:33     ` Shuai Xue

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=43390d36-147f-482c-b31a-d02c2624061f@linux.alibaba.com \
    --to=xueshuai@linux.alibaba.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=bhelgaas@google.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lukas@wunner.de \
    --cc=mahesh@linux.ibm.com \
    --cc=oohall@gmail.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=terry.bowman@amd.com \
    --cc=tianruidong@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox