public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <bhelgaas@google.com>
To: "Pandarathil, Vijaymohan R" <vijaymohan.pandarathil@hp.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linasvepstas@gmail.com" <linasvepstas@gmail.com>,
	Myron Stowe <myron.stowe@gmail.com>,
	Lance Ortiz <lance.ortiz@hp.com>,
	Huang Ying <ying.huang@intel.com>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
	Andrew Patterson <andrew.patterson@hp.com>,
	Zhang Yanmin <yanmin.zhang@intel.com>
Subject: Re: [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers
Date: Wed, 14 Nov 2012 17:51:00 -0700	[thread overview]
Message-ID: <20121115005100.GA6613@google.com> (raw)
In-Reply-To: <F9E001219150CB45BEDC82A650F360C90146794E@G9W0717.americas.hpqcorp.net>

[+cc Lance, Huang, Hidetoshi, Andrew, Zhang]

On Sat, Nov 10, 2012 at 07:41:04AM +0000, Pandarathil, Vijaymohan R wrote:
> When an error is detected on a PCIe device which does not have an
> AER-aware driver, prevent AER infrastructure from reporting
> successful error recovery.
> 
> This is because the report_error_detected() function that gets
> called in the first phase of recovery process allows forward
> progress even when the driver for the device does not have AER
> capabilities. It seems that all callbacks (in pci_error_handlers
> structure) registered by drivers that gets called during error
> recovery are not mandatory. So the intention of the infrastructure
> design seems to be to allow forward progress even when a specific
> callback has not been registered by a driver. However, if error
> handler structure itself has not been registered, it doesn't make
> sense to allow forward progress.
> 
> As a result of the current design, in the case of a single device
> having an AER-unaware driver or in the case of any function in a
> multi-function card having an AER-unaware driver, a successful
> recovery is reported.
> 
> Typical scenario this happens is when a PCI device is detached
> from a KVM host and the pci-stub driver on the host claims the
> device. The pci-stub driver does not have error handling capabilities
> but the AER infrastructure still reports that the device recovered
> successfully.
> 
> The changes proposed here leaves the device in an unrecovered state
> if the driver for the device or for any function in a multi-function
> card does not have error handler structure registered. This reflects
> the true state of the device and prevents any partial recovery (or no
> recovery at all) reported as successful.
> 
> Please also see comments from Linas Vepstas at the following link
> http://www.spinics.net/lists/linux-pci/msg18572.html
> 
> Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com>
> Reviewed-by: Myron Stowe <mstowe <at> redhat.com>
> Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil <at> hp.com>
> 
> ---
> 
> drivers/pci/pcie/aer/aerdrv_core.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 06bad96..030b229 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -215,6 +215,12 @@ static int report_error_detected(struct pci_dev *dev, void *data)
>  
>  	dev->error_state = result_data->state;
>  
> +	if ((!dev->driver || !dev->driver->err_handler) &&
> +		!(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE)) {
> +		dev_info(&dev->dev, "AER: Error detected but no driver has claimed this device or the driver is AER-unaware\n");
> +		result_data->result = PCI_ERS_RESULT_NONE;
> +		return 1;

This doesn't seem right because returning 1 here causes pci_walk_bus()
to terminate, which means we won't set dev->error_state or notify
drivers for any devices we haven't visited yet.

> +	}
>  	if (!dev->driver ||
>  		!dev->driver->err_handler ||
>  		!dev->driver->err_handler->error_detected) {

If none of the drivers in the affected hierarchy supports error handling,
I think the call tree looks like this:

    do_recovery                                 # uncorrectable only
        broadcast_error_message(dev, ..., report_error_detected)
            result_data.result = CAN_RECOVER            
            pci_walk_bus(..., report_error_detected)
                report_error_detected           # (each dev in subtree)
                    return 0                    # no change to result
            return result_data.result
        broadcast_error_message(dev, ..., report_mmio_enabled)
            result_data.result = PCI_ERS_RESULT_RECOVERED       
            pci_walk_bus(..., report_mmio_enabled)
                report_mmio_enabled             # (each dev in subtree)
                    return 0                    # no change to result
        dev_info("recovery successful")

Specifically, there are no error_handler functions, so we never change
result_data.result, and the default is that we treat the error as
"recovered successfully."  That seems broken.  An uncorrectable error
is by definition recoverable only by device-specific software, i.e.,
the driver.  We didn't call any drivers, so we can't have recovered
anything.

What do you think of the following alternative?  I don't know why you
checked for bridge devices in your patch, so I don't know whether
that's important here or not.

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 06bad96..a109c68 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -231,11 +231,11 @@ static int report_error_detected(struct pci_dev *dev, void *data)
 				   dev->driver ?
 				   "no AER-aware driver" : "no driver");
 		}
-		return 0;
+		vote = PCI_ERS_RESULT_DISCONNECT;
+	} else {
+		err_handler = dev->driver->err_handler;
+		vote = err_handler->error_detected(dev, result_data->state);
 	}
-
-	err_handler = dev->driver->err_handler;
-	vote = err_handler->error_detected(dev, result_data->state);
 	result_data->result = merge_result(result_data->result, vote);
 	return 0;
 }

  reply	other threads:[~2012-11-15  0:51 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-10  7:41 [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers Pandarathil, Vijaymohan R
2012-11-15  0:51 ` Bjorn Helgaas [this message]
2012-11-15  1:35   ` Linas Vepstas
2012-11-15  7:09   ` Pandarathil, Vijaymohan R
2012-11-16  1:08     ` Bjorn Helgaas
2012-11-16  4:26       ` Pandarathil, Vijaymohan R

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121115005100.GA6613@google.com \
    --to=bhelgaas@google.com \
    --cc=andrew.patterson@hp.com \
    --cc=lance.ortiz@hp.com \
    --cc=linasvepstas@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=myron.stowe@gmail.com \
    --cc=seto.hidetoshi@jp.fujitsu.com \
    --cc=vijaymohan.pandarathil@hp.com \
    --cc=yanmin.zhang@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox