linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
To: Lukas Wunner <lukas@wunner.de>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>,
	Bjorn Helgaas <helgaas@kernel.org>,
	Riana Tauro <riana.tauro@intel.com>,
	Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>,
	"Sean C. Dardis" <sean.c.dardis@intel.com>,
	Terry Bowman <terry.bowman@amd.com>,
	Linas Vepstas <linasvepstas@gmail.com>,
	Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
	Oliver OHalloran <oohall@gmail.com>,
	Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>,
	linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org
Subject: Re: [PATCH 1/5] PCI/AER: Allow drivers to opt in to Bus Reset on Non-Fatal Errors
Date: Sun, 17 Aug 2025 09:10:49 -0700	[thread overview]
Message-ID: <cba69372-1ba1-4a9c-8aa1-5126314d6890@linux.intel.com> (raw)
In-Reply-To: <aKHWf3L0NCl_CET5@wunner.de>


On 8/17/25 6:17 AM, Lukas Wunner wrote:
> On Thu, Aug 14, 2025 at 12:29:25PM -0700, Sathyanarayanan Kuppuswamy wrote:
>> On 8/14/25 2:36 AM, Lukas Wunner wrote:
>>> On Thu, Aug 14, 2025 at 09:56:09AM +0200, Niklas Schnelle wrote:
>>>> On Wed, 2025-08-13 at 07:11 +0200, Lukas Wunner wrote:
>>>>> @@ -233,6 +228,14 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>>    		pci_walk_bridge(bridge, report_mmio_enabled, &status);
>>>>>    	}
>>>>> +	if (status == PCI_ERS_RESULT_NEED_RESET ||
>>>>> +	    state == pci_channel_io_frozen) {
>>>>> +		if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
>>>>> +			pci_warn(bridge, "subordinate device reset failed\n");
>>>>> +			goto failed;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>>    	if (status == PCI_ERS_RESULT_NEED_RESET) {
>>>>>    		/*
>>>>>    		 * TODO: Should call platform-specific
>>>> I wonder if it might make sense to merge the reset into the above
>>>> existing if.
>>> There are drivers such as drivers/bus/mhi/host/pci_generic.c which
>>> return PCI_ERS_RESULT_RECOVERED from ->error_detected().  So they
>>> fall through directly to the ->resume() stage.  They're doing this
>>> even in the pci_channel_io_frozen case (i.e. for Fatal Errors).
>>>
>>> But for DPC we must call reset_subordinates() to bring the link back up.
>>> And for Fatal Errors, Documentation/PCI/pcieaer-howto.rst suggests that
>>> we must likewise call it because the link may be unreliable.
>> For fatal errors, since we already ignore the value returned by
>> ->error_detected() (by calling reset_subordinates() unconditionally), why
>> not update status accordingly in report_frozen_detected() and notify the
>> driver about the reset?
>>
>> That way, the reset logic could be unified under a single if
>> (status == PCI_ERS_RESULT_NEED_RESET) condition.
>>
>> Checking the drivers/bus/mhi/host/pci_generic.c implementation, it looks
>> like calling slot_reset callback looks harmless.
> Unfortunately it's not harmless:
>
> mhi_pci_slot_reset() calls pci_enable_device().  But a corresponding
> call to pci_disable_device() is only performed before in
> mhi_pci_error_detected() if that function returns
> PCI_ERS_RESULT_NEED_RESET.
>
> So there would be an enable_cnt imbalance if I'd change the logic to
> overwrite the driver's vote with PCI_ERS_RESULT_NEED_RESET in the
> pci_channel_io_frozen case and call its ->slot_reset() callback.
>
> The approach taken by this patch is to minimize risk, avoid any changes
> to drivers, make do with minimal changes to pcie_do_recovery() and
> limit the behavioral change.
>
> I think overriding status = PCI_ERS_RESULT_NEED_RESET and calling drivers'
> ->slot_reset() would have to be done in a separate patch on top and would
> require going through all drivers again to see which ones need to be
> amended.
>
> Also, note that report_frozen_detected() is too early to set
> "status = PCI_ERS_RESULT_NEED_RESET".  That needs to happen after the
> ->mmio_enabled() step, so that drivers get a chance to examine the
> device even in the pci_channel_io_frozen case before a reset is
> performed.  (The ->mmio_enabled() step is only performed if "status" is
> PCI_ERS_RESULT_CAN_RECOVER.)
>
> So then the code would look like this:
>
> 	if (state == pci_channel_io_frozen)
> 		status = PCI_ERS_RESULT_NEED_RESET;
>
> 	if (status == PCI_ERS_RESULT_NEED_RESET) {
> 		if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
> 			pci_warn(bridge, "subordinate device reset failed\n");
> 			goto failed;
> 		}
>
> 		status = PCI_ERS_RESULT_RECOVERED;
> 		pci_dbg(bridge, "broadcast slot_reset message\n");
> 		pci_walk_bridge(bridge, report_slot_reset, &status);
> 	}
>
> ... which isn't very different from the present patch:
>
>          if (status == PCI_ERS_RESULT_NEED_RESET ||
>              state == pci_channel_io_frozen) {
>                  if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
>                          pci_warn(bridge, "subordinate device reset failed\n");
>                          goto failed;
>                  }
>          }
>
>          if (status == PCI_ERS_RESULT_NEED_RESET) {
>                  status = PCI_ERS_RESULT_RECOVERED;
>                  pci_dbg(bridge, "broadcast slot_reset message\n");
>                  pci_walk_bridge(bridge, report_slot_reset, &status);
>          }
>
> ... except that this patch avoids touching any drivers.


Makes sense. Thanks for the clarification.


>
> Thanks,
>
> Lukas
>
-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


  reply	other threads:[~2025-08-17 16:11 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-13  5:11 [PATCH 0/5] PCI: Reduce AER / EEH deviations Lukas Wunner
2025-08-13  5:11 ` [PATCH 1/5] PCI/AER: Allow drivers to opt in to Bus Reset on Non-Fatal Errors Lukas Wunner
2025-08-13 23:01   ` Sathyanarayanan Kuppuswamy
2025-08-17 13:45     ` Lukas Wunner
2025-08-14  7:56   ` Niklas Schnelle
2025-08-14  9:36     ` Lukas Wunner
2025-08-14 19:29       ` Sathyanarayanan Kuppuswamy
2025-08-17 13:17         ` Lukas Wunner
2025-08-17 16:10           ` Sathyanarayanan Kuppuswamy [this message]
2025-08-14 20:31       ` Niklas Schnelle
2025-08-18 23:17     ` Linas Vepstas
2025-08-17 16:11   ` Sathyanarayanan Kuppuswamy
2025-08-13  5:11 ` [PATCH 2/5] PCI/ERR: Fix uevent on failure to recover Lukas Wunner
2025-08-13 23:01   ` Sathyanarayanan Kuppuswamy
2025-08-14  7:08   ` Niklas Schnelle
2025-08-13  5:11 ` [PATCH 3/5] PCI/ERR: Notify drivers " Lukas Wunner
2025-08-13 23:05   ` Sathyanarayanan Kuppuswamy
2025-08-13  5:11 ` [PATCH 4/5] PCI/ERR: Update device error_state already after reset Lukas Wunner
2025-08-13 23:43   ` Sathyanarayanan Kuppuswamy
2025-08-13  5:11 ` [PATCH 5/5] PCI/ERR: Remove remnants of .link_reset() callback Lukas Wunner
2025-08-14  0:40   ` Sathyanarayanan Kuppuswamy
2025-08-13 18:21 ` [PATCH 0/5] PCI: Reduce AER / EEH deviations Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cba69372-1ba1-4a9c-8aa1-5126314d6890@linux.intel.com \
    --to=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=helgaas@kernel.org \
    --cc=linasvepstas@gmail.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lukas@wunner.de \
    --cc=mahesh@linux.ibm.com \
    --cc=manivannan.sadhasivam@oss.qualcomm.com \
    --cc=oohall@gmail.com \
    --cc=riana.tauro@intel.com \
    --cc=schnelle@linux.ibm.com \
    --cc=sean.c.dardis@intel.com \
    --cc=terry.bowman@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).