All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <bhelgaas@google.com>
To: "Pandarathil, Vijaymohan R" <vijaymohan.pandarathil@hp.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linasvepstas@gmail.com" <linasvepstas@gmail.com>,
	Myron Stowe <myron.stowe@gmail.com>,
	Lance Ortiz <lance.ortiz@hp.com>,
	Huang Ying <ying.huang@intel.com>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
	Andrew Patterson <andrew.patterson@hp.com>,
	Zhang Yanmin <yanmin.zhang@intel.com>
Subject: Re: [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers
Date: Wed, 14 Nov 2012 17:51:00 -0700	[thread overview]
Message-ID: <20121115005100.GA6613@google.com> (raw)
In-Reply-To: <F9E001219150CB45BEDC82A650F360C90146794E@G9W0717.americas.hpqcorp.net>

[+cc Lance, Huang, Hidetoshi, Andrew, Zhang]

On Sat, Nov 10, 2012 at 07:41:04AM +0000, Pandarathil, Vijaymohan R wrote:
> When an error is detected on a PCIe device which does not have an
> AER-aware driver, prevent AER infrastructure from reporting
> successful error recovery.
> 
> This is because the report_error_detected() function that gets
> called in the first phase of recovery process allows forward
> progress even when the driver for the device does not have AER
> capabilities. It seems that all callbacks (in pci_error_handlers
> structure) registered by drivers that gets called during error
> recovery are not mandatory. So the intention of the infrastructure
> design seems to be to allow forward progress even when a specific
> callback has not been registered by a driver. However, if error
> handler structure itself has not been registered, it doesn't make
> sense to allow forward progress.
> 
> As a result of the current design, in the case of a single device
> having an AER-unaware driver or in the case of any function in a
> multi-function card having an AER-unaware driver, a successful
> recovery is reported.
> 
> Typical scenario this happens is when a PCI device is detached
> from a KVM host and the pci-stub driver on the host claims the
> device. The pci-stub driver does not have error handling capabilities
> but the AER infrastructure still reports that the device recovered
> successfully.
> 
> The changes proposed here leaves the device in an unrecovered state
> if the driver for the device or for any function in a multi-function
> card does not have error handler structure registered. This reflects
> the true state of the device and prevents any partial recovery (or no
> recovery at all) reported as successful.
> 
> Please also see comments from Linas Vepstas at the following link
> http://www.spinics.net/lists/linux-pci/msg18572.html
> 
> Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com>
> Reviewed-by: Myron Stowe <mstowe <at> redhat.com>
> Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil <at> hp.com>
> 
> ---
> 
> drivers/pci/pcie/aer/aerdrv_core.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 06bad96..030b229 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -215,6 +215,12 @@ static int report_error_detected(struct pci_dev *dev, void *data)
>  
>  	dev->error_state = result_data->state;
>  
> +	if ((!dev->driver || !dev->driver->err_handler) &&
> +		!(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE)) {
> +		dev_info(&dev->dev, "AER: Error detected but no driver has claimed this device or the driver is AER-unaware\n");
> +		result_data->result = PCI_ERS_RESULT_NONE;
> +		return 1;

This doesn't seem right because returning 1 here causes pci_walk_bus()
to terminate, which means we won't set dev->error_state or notify
drivers for any devices we haven't visited yet.

> +	}
>  	if (!dev->driver ||
>  		!dev->driver->err_handler ||
>  		!dev->driver->err_handler->error_detected) {

If none of the drivers in the affected hierarchy supports error handling,
I think the call tree looks like this:

    do_recovery                                 # uncorrectable only
        broadcast_error_message(dev, ..., report_error_detected)
            result_data.result = CAN_RECOVER            
            pci_walk_bus(..., report_error_detected)
                report_error_detected           # (each dev in subtree)
                    return 0                    # no change to result
            return result_data.result
        broadcast_error_message(dev, ..., report_mmio_enabled)
            result_data.result = PCI_ERS_RESULT_RECOVERED       
            pci_walk_bus(..., report_mmio_enabled)
                report_mmio_enabled             # (each dev in subtree)
                    return 0                    # no change to result
        dev_info("recovery successful")

Specifically, there are no error_handler functions, so we never change
result_data.result, and the default is that we treat the error as
"recovered successfully."  That seems broken.  An uncorrectable error
is by definition recoverable only by device-specific software, i.e.,
the driver.  We didn't call any drivers, so we can't have recovered
anything.

What do you think of the following alternative?  I don't know why you
checked for bridge devices in your patch, so I don't know whether
that's important here or not.

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 06bad96..a109c68 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -231,11 +231,11 @@ static int report_error_detected(struct pci_dev *dev, void *data)
 				   dev->driver ?
 				   "no AER-aware driver" : "no driver");
 		}
-		return 0;
+		vote = PCI_ERS_RESULT_DISCONNECT;
+	} else {
+		err_handler = dev->driver->err_handler;
+		vote = err_handler->error_detected(dev, result_data->state);
 	}
-
-	err_handler = dev->driver->err_handler;
-	vote = err_handler->error_detected(dev, result_data->state);
 	result_data->result = merge_result(result_data->result, vote);
 	return 0;
 }

  reply	other threads:[~2012-11-15  0:51 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-10  7:41 [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers Pandarathil, Vijaymohan R
2012-11-15  0:51 ` Bjorn Helgaas [this message]
2012-11-15  1:35   ` Linas Vepstas
2012-11-15  7:09   ` Pandarathil, Vijaymohan R
2012-11-16  1:08     ` Bjorn Helgaas
2012-11-16  4:26       ` Pandarathil, Vijaymohan R

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121115005100.GA6613@google.com \
    --to=bhelgaas@google.com \
    --cc=andrew.patterson@hp.com \
    --cc=lance.ortiz@hp.com \
    --cc=linasvepstas@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=myron.stowe@gmail.com \
    --cc=seto.hidetoshi@jp.fujitsu.com \
    --cc=vijaymohan.pandarathil@hp.com \
    --cc=yanmin.zhang@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.