linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] pci, Add AER_panic sysfs file
@ 2012-05-17 17:04 Prarit Bhargava
  2012-05-17 17:29 ` Shyam_Iyer
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Prarit Bhargava @ 2012-05-17 17:04 UTC (permalink / raw)
  To: linux-pci; +Cc: Prarit Bhargava, Bjorn Helgaas

Consider the following case

		[ RP ]
		  |
		  |
	+---------+-----------+
	|	  |	      |
       [H1]      [H2]        [X1]

where RP is a PCIE Root Port, H1 and H2 are devices with drivers that support
PCIE AER driver error handling (ie, they have pci_error_handlers defined in
the driver), and X1 is a device with a driver that does not support PCIE
AER driver error handling.

If the Root Port takes an error what currently happens is that the
bus resets and H1 & H2 call their slot_reset functions.  X1 does nothing.

In some cases a user may not wish the system to continue because X1 is
an unhardened driver.  In these cases, the system should not do a bus reset,
but rather the system should panic to avoid any further possible data
corruption.

This patch implements an AER_panic sysfs entry for each root port which
a user can modify.  AER_panic = 1, means the system will panic on a
PCIE error which would have normally resulted in a secondary bus reset.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pci-sysfs.c       |   42 ++++++++++++++++++++++++++++++++++++++++-
 drivers/pci/pcie/aer/aerdrv.c |    3 ++
 include/linux/pci.h           |    1 +
 3 files changed, 45 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index a55e248..8c6d525 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1135,6 +1135,35 @@ static ssize_t reset_store(struct device *dev,
 
 static struct device_attribute reset_attr = __ATTR(reset, 0200, NULL, reset_store);
 
+static ssize_t AER_panic_show(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return sprintf(buf, "%d\n", pdev->rp_AER_panic);
+}
+
+static ssize_t AER_panic_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	unsigned long val;
+
+	if (kstrtoul(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if ((val > 1) || (val < 0))
+		return -EINVAL;
+
+	pdev->rp_AER_panic = val;
+
+	return count;
+}
+
+static struct device_attribute rp_AER_panic_attr =
+	      __ATTR(AER_panic, 0600, AER_panic_show, AER_panic_store);
+
 static int pci_create_capabilities_sysfs(struct pci_dev *dev)
 {
 	int retval;
@@ -1169,8 +1198,16 @@ static int pci_create_capabilities_sysfs(struct pci_dev *dev)
 			goto error;
 		dev->reset_fn = 1;
 	}
-	return 0;
 
+	/* PCIE Root Port panic-on-AER allows a user to configure each root
+	 * port to panic on an AER error instead of issuing a bus reset.
+	 */
+	if (dev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+		retval = device_create_file(&dev->dev, &rp_AER_panic_attr);
+		if (retval)
+			goto error;
+	}
+	return 0;
 error:
 	pcie_aspm_remove_sysfs_dev_files(dev);
 	if (dev->vpd && dev->vpd->attr) {
@@ -1279,6 +1316,9 @@ static void pci_remove_capabilities_sysfs(struct pci_dev *dev)
 		device_remove_file(&dev->dev, &reset_attr);
 		dev->reset_fn = 0;
 	}
+
+	if (dev->pcie_type == PCI_EXP_TYPE_ROOT_PORT)
+		device_remove_file(&dev->dev, &rp_AER_panic_attr);
 }
 
 /**
diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
index 58ad791..dd6b352 100644
--- a/drivers/pci/pcie/aer/aerdrv.c
+++ b/drivers/pci/pcie/aer/aerdrv.c
@@ -346,6 +346,9 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	u32 reg32;
 	int pos;
 
+	if (dev->rp_AER_panic)
+		panic("%s: AER detected on Root Port", pci_name(dev));
+
 	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
 
 	/* Disable Root's interrupt in response to error messages */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e444f5b..a4e6a5a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -324,6 +324,7 @@ struct pci_dev {
 	unsigned int    is_hotplug_bridge:1;
 	unsigned int    __aer_firmware_first_valid:1;
 	unsigned int	__aer_firmware_first:1;
+	unsigned int	rp_AER_panic:1; /* if 1, panic on AER bus reset */
 	pci_dev_flags_t dev_flags;
 	atomic_t	enable_cnt;	/* pci_enable_device has been called */
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* RE: [PATCH] pci, Add AER_panic sysfs file
  2012-05-17 17:04 [PATCH] pci, Add AER_panic sysfs file Prarit Bhargava
@ 2012-05-17 17:29 ` Shyam_Iyer
  2012-05-17 17:39   ` Prarit Bhargava
  2012-05-17 18:51 ` Don Dutile
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Shyam_Iyer @ 2012-05-17 17:29 UTC (permalink / raw)
  To: prarit, linux-pci; +Cc: bhelgaas



> -----Original Message-----
> From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
> owner@vger.kernel.org] On Behalf Of Prarit Bhargava
> Sent: Thursday, May 17, 2012 1:05 PM
> To: linux-pci@vger.kernel.org
> Cc: Prarit Bhargava; Bjorn Helgaas
> Subject: [PATCH] pci, Add AER_panic sysfs file
> 
> Consider the following case
> 
> 		[ RP ]
> 		  |
> 		  |
> 	+---------+-----------+
> 	|	  |	      |
>        [H1]      [H2]        [X1]
> 
> where RP is a PCIE Root Port, H1 and H2 are devices with drivers that
> support
> PCIE AER driver error handling (ie, they have pci_error_handlers
> defined in
> the driver), and X1 is a device with a driver that does not support
> PCIE
> AER driver error handling.
> 
> If the Root Port takes an error what currently happens is that the
> bus resets and H1 & H2 call their slot_reset functions.  X1 does
> nothing.
> 
> In some cases a user may not wish the system to continue because X1 is
> an unhardened driver.  In these cases, the system should not do a bus
> reset,
> but rather the system should panic to avoid any further possible data
> corruption.

Do we neeed to panic for both correctable and uncorrectable errors.. ?

I thought correctable errors could recover without a bus reset.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-17 17:29 ` Shyam_Iyer
@ 2012-05-17 17:39   ` Prarit Bhargava
  2012-05-17 17:51     ` Shyam_Iyer
       [not found]     ` <DBFB1B45AF80394ABD1C807E9F28D15707BC712175@BLRX7MCDC203.AMER.DELL.COM>
  0 siblings, 2 replies; 19+ messages in thread
From: Prarit Bhargava @ 2012-05-17 17:39 UTC (permalink / raw)
  To: Shyam_Iyer; +Cc: linux-pci, bhelgaas



On 05/17/2012 01:29 PM, Shyam_Iyer@Dell.com wrote:
> 
> 
>> -----Original Message-----
>> From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
>> owner@vger.kernel.org] On Behalf Of Prarit Bhargava
>> Sent: Thursday, May 17, 2012 1:05 PM
>> To: linux-pci@vger.kernel.org
>> Cc: Prarit Bhargava; Bjorn Helgaas
>> Subject: [PATCH] pci, Add AER_panic sysfs file
>>
>> Consider the following case
>>
>> 		[ RP ]
>> 		  |
>> 		  |
>> 	+---------+-----------+
>> 	|	  |	      |
>>        [H1]      [H2]        [X1]
>>
>> where RP is a PCIE Root Port, H1 and H2 are devices with drivers that
>> support
>> PCIE AER driver error handling (ie, they have pci_error_handlers
>> defined in
>> the driver), and X1 is a device with a driver that does not support
>> PCIE
>> AER driver error handling.
>>
>> If the Root Port takes an error what currently happens is that the
>> bus resets and H1 & H2 call their slot_reset functions.  X1 does
>> nothing.
>>
>> In some cases a user may not wish the system to continue because X1 is
>> an unhardened driver.  In these cases, the system should not do a bus
>> reset,
>> but rather the system should panic to avoid any further possible data
>> corruption.
> 
> Do we neeed to panic for both correctable and uncorrectable errors.. ?
> 
> I thought correctable errors could recover without a bus reset.

Will a bus reset be issued on a correctable error?  I thought the code path was
that the bus reset was issued on the uncorrectable error.

drivers/pci/pcie/aer/aerdrv_core.c: do_recovery()

        if (severity == AER_FATAL) {
                result = reset_link(dev);
                if (result != PCI_ERS_RESULT_RECOVERED)
                        goto failed;
        }

I may not be looking at the right spot of code.  Care to enlighten me? :)

P.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] pci, Add AER_panic sysfs file
  2012-05-17 17:39   ` Prarit Bhargava
@ 2012-05-17 17:51     ` Shyam_Iyer
       [not found]     ` <DBFB1B45AF80394ABD1C807E9F28D15707BC712175@BLRX7MCDC203.AMER.DELL.COM>
  1 sibling, 0 replies; 19+ messages in thread
From: Shyam_Iyer @ 2012-05-17 17:51 UTC (permalink / raw)
  To: prarit; +Cc: linux-pci, bhelgaas



> -----Original Message-----
> From: Prarit Bhargava [mailto:prarit@redhat.com]
> Sent: Thursday, May 17, 2012 1:39 PM
> To: Iyer, Shyam
> Cc: linux-pci@vger.kernel.org; bhelgaas@google.com
> Subject: Re: [PATCH] pci, Add AER_panic sysfs file
> 
> 
> 
> On 05/17/2012 01:29 PM, Shyam_Iyer@Dell.com wrote:
> >
> >
> >> -----Original Message-----
> >> From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
> >> owner@vger.kernel.org] On Behalf Of Prarit Bhargava
> >> Sent: Thursday, May 17, 2012 1:05 PM
> >> To: linux-pci@vger.kernel.org
> >> Cc: Prarit Bhargava; Bjorn Helgaas
> >> Subject: [PATCH] pci, Add AER_panic sysfs file
> >>
> >> Consider the following case
> >>
> >> 		[ RP ]
> >> 		  |
> >> 		  |
> >> 	+---------+-----------+
> >> 	|	  |	      |
> >>        [H1]      [H2]        [X1]
> >>
> >> where RP is a PCIE Root Port, H1 and H2 are devices with drivers
> that
> >> support
> >> PCIE AER driver error handling (ie, they have pci_error_handlers
> >> defined in
> >> the driver), and X1 is a device with a driver that does not support
> >> PCIE
> >> AER driver error handling.
> >>
> >> If the Root Port takes an error what currently happens is that the
> >> bus resets and H1 & H2 call their slot_reset functions.  X1 does
> >> nothing.
> >>
> >> In some cases a user may not wish the system to continue because X1
> is
> >> an unhardened driver.  In these cases, the system should not do a
> bus
> >> reset,
> >> but rather the system should panic to avoid any further possible
> data
> >> corruption.
> >
> > Do we neeed to panic for both correctable and uncorrectable errors..
> ?
> >
> > I thought correctable errors could recover without a bus reset.
> 
> Will a bus reset be issued on a correctable error?  I thought the code
> path was
> that the bus reset was issued on the uncorrectable error.
> 
> drivers/pci/pcie/aer/aerdrv_core.c: do_recovery()
> 
>         if (severity == AER_FATAL) {
>                 result = reset_link(dev);
>                 if (result != PCI_ERS_RESULT_RECOVERED)
>                         goto failed;
>         }
> 
> I may not be looking at the right spot of code.  Care to enlighten me?
> :)
> 
> P.

Actually I was reading the documentation .. 
Documentation/PCI/pcieaer-howto.txt

"
Correctable errors pose no impacts on the functionality of the
interface. The PCI Express protocol can recover without any software
intervention or any loss of data. These errors are detected and
corrected by hardware. Unlike correctable errors, uncorrectable
errors impact functionality of the interface. Uncorrectable errors
can cause a particular transaction or a particular PCI Express link
to be unreliable. Depending on those error conditions, uncorrectable
errors are further classified into non-fatal errors and fatal errors.
Non-fatal errors cause the particular transaction to be unreliable,
but the PCI Express link itself is fully functional. Fatal errors, on
the other hand, cause the link to be unreliable.
"

But anyways the AER_FATAL is true for uncorrectable errors only and not for correctable errors which means reset_link doesn't happen for correctable errors.

drivers/pci/pcie/aer/aerdrv_core.c

if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
                e_info->id = ERR_UNCOR_ID(e_src->id);
        
                if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
                        e_info->severity = AER_FATAL;
                else
                        e_info->severity = AER_NONFATAL;
                
                if (e_src->status & PCI_ERR_ROOT_MULTI_UNCOR_RCV)
                        e_info->multi_error_valid = 1;
                else    
                        e_info->multi_error_valid = 0;
                        
                aer_print_port_info(p_device->port, e_info);
        
                if (find_source_device(p_device->port, e_info))
                        aer_process_err_devices(p_device, e_info);
        }               

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] pci, Add AER_panic sysfs file
       [not found]     ` <DBFB1B45AF80394ABD1C807E9F28D15707BC712175@BLRX7MCDC203.AMER.DELL.COM>
@ 2012-05-17 18:00       ` Shyam_Iyer
  0 siblings, 0 replies; 19+ messages in thread
From: Shyam_Iyer @ 2012-05-17 18:00 UTC (permalink / raw)
  To: prarit; +Cc: linux-pci, bhelgaas



> -----Original Message-----
> From: Iyer, Shyam
> Sent: Thursday, May 17, 2012 1:52 PM
> To: 'Prarit Bhargava'
> Cc: 'linux-pci@vger.kernel.org'; 'bhelgaas@google.com'
> Subject: RE: [PATCH] pci, Add AER_panic sysfs file
> 
> 
> 
> > -----Original Message-----
> > From: Prarit Bhargava [mailto:prarit@redhat.com]
> > Sent: Thursday, May 17, 2012 1:39 PM
> > To: Iyer, Shyam
> > Cc: linux-pci@vger.kernel.org; bhelgaas@google.com
> > Subject: Re: [PATCH] pci, Add AER_panic sysfs file
> >
> >
> >
> > On 05/17/2012 01:29 PM, Shyam_Iyer@Dell.com wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
> > >> owner@vger.kernel.org] On Behalf Of Prarit Bhargava
> > >> Sent: Thursday, May 17, 2012 1:05 PM
> > >> To: linux-pci@vger.kernel.org
> > >> Cc: Prarit Bhargava; Bjorn Helgaas
> > >> Subject: [PATCH] pci, Add AER_panic sysfs file
> > >>
> > >> Consider the following case
> > >>
> > >> 		[ RP ]
> > >> 		  |
> > >> 		  |
> > >> 	+---------+-----------+
> > >> 	|	  |	      |
> > >>        [H1]      [H2]        [X1]
> > >>
> > >> where RP is a PCIE Root Port, H1 and H2 are devices with drivers
> > that
> > >> support
> > >> PCIE AER driver error handling (ie, they have pci_error_handlers
> > >> defined in
> > >> the driver), and X1 is a device with a driver that does not
> support
> > >> PCIE
> > >> AER driver error handling.
> > >>
> > >> If the Root Port takes an error what currently happens is that the
> > >> bus resets and H1 & H2 call their slot_reset functions.  X1 does
> > >> nothing.
> > >>
> > >> In some cases a user may not wish the system to continue because
> X1
> > is
> > >> an unhardened driver.  In these cases, the system should not do a
> > bus
> > >> reset,
> > >> but rather the system should panic to avoid any further possible
> > data
> > >> corruption.
> > >
> > > Do we neeed to panic for both correctable and uncorrectable
> errors..
> > ?
> > >
> > > I thought correctable errors could recover without a bus reset.
> >
> > Will a bus reset be issued on a correctable error?  I thought the
> code
> > path was
> > that the bus reset was issued on the uncorrectable error.
> >
> > drivers/pci/pcie/aer/aerdrv_core.c: do_recovery()
> >
> >         if (severity == AER_FATAL) {
> >                 result = reset_link(dev);
> >                 if (result != PCI_ERS_RESULT_RECOVERED)
> >                         goto failed;
> >         }
> >
> > I may not be looking at the right spot of code.  Care to enlighten
> me?
> > :)
> >
> > P.
> 
> Actually I was reading the documentation ..
> Documentation/PCI/pcieaer-howto.txt
> 
> "
> Correctable errors pose no impacts on the functionality of the
> interface. The PCI Express protocol can recover without any software
> intervention or any loss of data. These errors are detected and
> corrected by hardware. Unlike correctable errors, uncorrectable
> errors impact functionality of the interface. Uncorrectable errors
> can cause a particular transaction or a particular PCI Express link
> to be unreliable. Depending on those error conditions, uncorrectable
> errors are further classified into non-fatal errors and fatal errors.
> Non-fatal errors cause the particular transaction to be unreliable,
> but the PCI Express link itself is fully functional. Fatal errors, on
> the other hand, cause the link to be unreliable.
> "
> 
> But anyways the AER_FATAL is true for uncorrectable errors only and not
> for correctable errors which means reset_link doesn't happen for
> correctable errors.
> 
> drivers/pci/pcie/aer/aerdrv_core.c
> 
> if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
>                 e_info->id = ERR_UNCOR_ID(e_src->id);
> 
>                 if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
>                         e_info->severity = AER_FATAL;
>                 else
>                         e_info->severity = AER_NONFATAL;
> 
>                 if (e_src->status & PCI_ERR_ROOT_MULTI_UNCOR_RCV)
>                         e_info->multi_error_valid = 1;
>                 else
>                         e_info->multi_error_valid = 0;
> 
>                 aer_print_port_info(p_device->port, e_info);
> 
>                 if (find_source_device(p_device->port, e_info))
>                         aer_process_err_devices(p_device, e_info);
>         }



Looks like we are saying the same thing and I just misunderstood that you were doing a panic for each error. 

The patch looks good to me if it matters.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-17 17:04 [PATCH] pci, Add AER_panic sysfs file Prarit Bhargava
  2012-05-17 17:29 ` Shyam_Iyer
@ 2012-05-17 18:51 ` Don Dutile
  2012-05-17 18:54   ` Prarit Bhargava
  2012-05-17 21:32 ` Betty Dall
  2012-05-18  4:51 ` Greg KH
  3 siblings, 1 reply; 19+ messages in thread
From: Don Dutile @ 2012-05-17 18:51 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-pci, Bjorn Helgaas

On 05/17/2012 01:04 PM, Prarit Bhargava wrote:
> Consider the following case
>
> 		[ RP ]
> 		  |
> 		  |
> 	+---------+-----------+
> 	|	  |	      |
>         [H1]      [H2]        [X1]
>
> where RP is a PCIE Root Port, H1 and H2 are devices with drivers that support
> PCIE AER driver error handling (ie, they have pci_error_handlers defined in
> the driver), and X1 is a device with a driver that does not support PCIE
> AER driver error handling.
>
> If the Root Port takes an error what currently happens is that the
> bus resets and H1&  H2 call their slot_reset functions.  X1 does nothing.
>
> In some cases a user may not wish the system to continue because X1 is
> an unhardened driver.  In these cases, the system should not do a bus reset,
> but rather the system should panic to avoid any further possible data
> corruption.
>
> This patch implements an AER_panic sysfs entry for each root port which
> a user can modify.  AER_panic = 1, means the system will panic on a
> PCIE error which would have normally resulted in a secondary bus reset.
>
> Signed-off-by: Prarit Bhargava<prarit@redhat.com>
> Cc: Bjorn Helgaas<bhelgaas@google.com>
> ---
>   drivers/pci/pci-sysfs.c       |   42 ++++++++++++++++++++++++++++++++++++++++-
>   drivers/pci/pcie/aer/aerdrv.c |    3 ++
>   include/linux/pci.h           |    1 +
>   3 files changed, 45 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index a55e248..8c6d525 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1135,6 +1135,35 @@ static ssize_t reset_store(struct device *dev,
>
>   static struct device_attribute reset_attr = __ATTR(reset, 0200, NULL, reset_store);
>
> +static ssize_t AER_panic_show(struct device *dev,
> +				 struct device_attribute *attr, char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return sprintf(buf, "%d\n", pdev->rp_AER_panic);
> +}
> +
> +static ssize_t AER_panic_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	unsigned long val;
> +
> +	if (kstrtoul(buf, 0,&val)<  0)
> +		return -EINVAL;
> +
> +	if ((val>  1) || (val<  0))
> +		return -EINVAL;
> +
> +	pdev->rp_AER_panic = val;
> +
> +	return count;
> +}
> +
> +static struct device_attribute rp_AER_panic_attr =
> +	      __ATTR(AER_panic, 0600, AER_panic_show, AER_panic_store);
> +
>   static int pci_create_capabilities_sysfs(struct pci_dev *dev)
>   {
>   	int retval;
> @@ -1169,8 +1198,16 @@ static int pci_create_capabilities_sysfs(struct pci_dev *dev)
>   			goto error;
>   		dev->reset_fn = 1;
>   	}
> -	return 0;
>
> +	/* PCIE Root Port panic-on-AER allows a user to configure each root
> +	 * port to panic on an AER error instead of issuing a bus reset.
> +	 */
> +	if (dev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
> +		retval = device_create_file(&dev->dev,&rp_AER_panic_attr);
> +		if (retval)
> +			goto error;
> +	}
> +	return 0;
>   error:
>   	pcie_aspm_remove_sysfs_dev_files(dev);
>   	if (dev->vpd&&  dev->vpd->attr) {
> @@ -1279,6 +1316,9 @@ static void pci_remove_capabilities_sysfs(struct pci_dev *dev)
>   		device_remove_file(&dev->dev,&reset_attr);
>   		dev->reset_fn = 0;
>   	}
> +
> +	if (dev->pcie_type == PCI_EXP_TYPE_ROOT_PORT)
> +		device_remove_file(&dev->dev,&rp_AER_panic_attr);
>   }
>
>   /**
> diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
> index 58ad791..dd6b352 100644
> --- a/drivers/pci/pcie/aer/aerdrv.c
> +++ b/drivers/pci/pcie/aer/aerdrv.c
> @@ -346,6 +346,9 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>   	u32 reg32;
>   	int pos;
>
> +	if (dev->rp_AER_panic)
> +		panic("%s: AER detected on Root Port", pci_name(dev));
> +
It'd be more informative if the Root Port D:B:D.F was printed out above,
so one knows where the errors are coming from the system.  More likely than
not, the root is ok, but a device dangling from it is the 'root cause' (all pun intended)
of the error.

>   	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
>
>   	/* Disable Root's interrupt in response to error messages */
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index e444f5b..a4e6a5a 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -324,6 +324,7 @@ struct pci_dev {
>   	unsigned int    is_hotplug_bridge:1;
>   	unsigned int    __aer_firmware_first_valid:1;
>   	unsigned int	__aer_firmware_first:1;
> +	unsigned int	rp_AER_panic:1; /* if 1, panic on AER bus reset */
>   	pci_dev_flags_t dev_flags;
>   	atomic_t	enable_cnt;	/* pci_enable_device has been called */
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-17 18:51 ` Don Dutile
@ 2012-05-17 18:54   ` Prarit Bhargava
  2012-05-17 19:11     ` Don Dutile
  0 siblings, 1 reply; 19+ messages in thread
From: Prarit Bhargava @ 2012-05-17 18:54 UTC (permalink / raw)
  To: Don Dutile; +Cc: linux-pci, Bjorn Helgaas


>> +    if (dev->rp_AER_panic)
>> +        panic("%s: AER detected on Root Port", pci_name(dev));
>> +
> It'd be more informative if the Root Port D:B:D.F was printed out above,
> so one knows where the errors are coming from the system.  More likely than
> not, the root is ok, but a device dangling from it is the 'root cause' (all pun
> intended)
> of the error.

That's what pci_name() is.

P.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-17 18:54   ` Prarit Bhargava
@ 2012-05-17 19:11     ` Don Dutile
  2012-05-17 22:16       ` Prarit Bhargava
  0 siblings, 1 reply; 19+ messages in thread
From: Don Dutile @ 2012-05-17 19:11 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-pci, Bjorn Helgaas

On 05/17/2012 02:54 PM, Prarit Bhargava wrote:
>
>>> +    if (dev->rp_AER_panic)
>>> +        panic("%s: AER detected on Root Port", pci_name(dev));
>>> +
>> It'd be more informative if the Root Port D:B:D.F was printed out above,
>> so one knows where the errors are coming from the system.  More likely than
>> not, the root is ok, but a device dangling from it is the 'root cause' (all pun
>> intended)
>> of the error.
>
> That's what pci_name() is.
>
> P.

ok. When I find where it's set, and what values it can be (init_name, vs kobject->name)
at what times in the kernel boot.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-17 17:04 [PATCH] pci, Add AER_panic sysfs file Prarit Bhargava
  2012-05-17 17:29 ` Shyam_Iyer
  2012-05-17 18:51 ` Don Dutile
@ 2012-05-17 21:32 ` Betty Dall
  2012-05-18  4:51 ` Greg KH
  3 siblings, 0 replies; 19+ messages in thread
From: Betty Dall @ 2012-05-17 21:32 UTC (permalink / raw)
  To: linux-pci

Prarit Bhargava <prarit <at> redhat.com> writes:

>  /**
> diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
> index 58ad791..dd6b352 100644
> --- a/drivers/pci/pcie/aer/aerdrv.c
> +++ b/drivers/pci/pcie/aer/aerdrv.c
> @@ -346,6 +346,9 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>  	u32 reg32;
>  	int pos;
> 
> +	if (dev->rp_AER_panic)
> +		panic("%s: AER detected on Root Port", pci_name(dev));
> +
>  	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
> 


I really like this idea. I just wonder if the panic can happen in do_recovery in 
aerdrv_core.c before the broadcast_error_message() is done that invokes all of 
the error_detected callbacks. It would be best to panic as soon as possible to 
increase error containment. 

If we don't move it to before the broadcast_error_message(), I was also looking 
at if it would be appropriate to put the check for this in the 
default_downstream_reset_link() in aerdrv_core.c too. What do you think?

-Betty Dall


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-17 19:11     ` Don Dutile
@ 2012-05-17 22:16       ` Prarit Bhargava
  0 siblings, 0 replies; 19+ messages in thread
From: Prarit Bhargava @ 2012-05-17 22:16 UTC (permalink / raw)
  To: Don Dutile; +Cc: linux-pci, Bjorn Helgaas



On 05/17/2012 03:11 PM, Don Dutile wrote:
> On 05/17/2012 02:54 PM, Prarit Bhargava wrote:
>>
>>>> +    if (dev->rp_AER_panic)
>>>> +        panic("%s: AER detected on Root Port", pci_name(dev));
>>>> +
>>> It'd be more informative if the Root Port D:B:D.F was printed out above,
>>> so one knows where the errors are coming from the system.  More likely than
>>> not, the root is ok, but a device dangling from it is the 'root cause' (all pun
>>> intended)
>>> of the error.
>>
>> That's what pci_name() is.
>>
>> P.
> 
> ok. When I find where it's set, and what values it can be (init_name, vs
> kobject->name)
> at what times in the kernel boot.

The device name is init'd in the function pci_setup_device().  It's name is not
changed after that AFAICT.

This is a well-used and well-known interface to get the address of a pci_dev in
the form of a string.  pci_name(), by my count, is used 879 times throughout the
linux kernel.

P.

> 
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-17 17:04 [PATCH] pci, Add AER_panic sysfs file Prarit Bhargava
                   ` (2 preceding siblings ...)
  2012-05-17 21:32 ` Betty Dall
@ 2012-05-18  4:51 ` Greg KH
  2012-05-18 10:26   ` Prarit Bhargava
  2012-05-18 14:16   ` Prarit Bhargava
  3 siblings, 2 replies; 19+ messages in thread
From: Greg KH @ 2012-05-18  4:51 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-pci, Bjorn Helgaas

On Thu, May 17, 2012 at 01:04:30PM -0400, Prarit Bhargava wrote:
> Consider the following case
> 
> 		[ RP ]
> 		  |
> 		  |
> 	+---------+-----------+
> 	|	  |	      |
>        [H1]      [H2]        [X1]
> 
> where RP is a PCIE Root Port, H1 and H2 are devices with drivers that support
> PCIE AER driver error handling (ie, they have pci_error_handlers defined in
> the driver), and X1 is a device with a driver that does not support PCIE
> AER driver error handling.
> 
> If the Root Port takes an error what currently happens is that the
> bus resets and H1 & H2 call their slot_reset functions.  X1 does nothing.
> 
> In some cases a user may not wish the system to continue because X1 is
> an unhardened driver.  In these cases, the system should not do a bus reset,
> but rather the system should panic to avoid any further possible data
> corruption.
> 
> This patch implements an AER_panic sysfs entry for each root port which
> a user can modify.  AER_panic = 1, means the system will panic on a
> PCIE error which would have normally resulted in a secondary bus reset.

If you add/modify/delete a sysfs file, you have to also have a
corrisponding patch to Documentation/ABI/ in order to keep things sane.
Please do that as part of this patch the next time you submit it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-18  4:51 ` Greg KH
@ 2012-05-18 10:26   ` Prarit Bhargava
  2012-05-18 14:16   ` Prarit Bhargava
  1 sibling, 0 replies; 19+ messages in thread
From: Prarit Bhargava @ 2012-05-18 10:26 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-pci, Bjorn Helgaas



On 05/18/2012 12:51 AM, Greg KH wrote:
> On Thu, May 17, 2012 at 01:04:30PM -0400, Prarit Bhargava wrote:
>> Consider the following case
>>
>> 		[ RP ]
>> 		  |
>> 		  |
>> 	+---------+-----------+
>> 	|	  |	      |
>>        [H1]      [H2]        [X1]
>>
>> where RP is a PCIE Root Port, H1 and H2 are devices with drivers that support
>> PCIE AER driver error handling (ie, they have pci_error_handlers defined in
>> the driver), and X1 is a device with a driver that does not support PCIE
>> AER driver error handling.
>>
>> If the Root Port takes an error what currently happens is that the
>> bus resets and H1 & H2 call their slot_reset functions.  X1 does nothing.
>>
>> In some cases a user may not wish the system to continue because X1 is
>> an unhardened driver.  In these cases, the system should not do a bus reset,
>> but rather the system should panic to avoid any further possible data
>> corruption.
>>
>> This patch implements an AER_panic sysfs entry for each root port which
>> a user can modify.  AER_panic = 1, means the system will panic on a
>> PCIE error which would have normally resulted in a secondary bus reset.
> 
> If you add/modify/delete a sysfs file, you have to also have a
> corrisponding patch to Documentation/ABI/ in order to keep things sane.
> Please do that as part of this patch the next time you submit it.

Sorry about that Greg.  I'll submit a [v2] with that modification.

P.
> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-18  4:51 ` Greg KH
  2012-05-18 10:26   ` Prarit Bhargava
@ 2012-05-18 14:16   ` Prarit Bhargava
  2012-05-18 15:47     ` Greg KH
  1 sibling, 1 reply; 19+ messages in thread
From: Prarit Bhargava @ 2012-05-18 14:16 UTC (permalink / raw)
  To: linux-pci; +Cc: Prarit Bhargava, Bjorn Helgaas, Shyam Iyer, gregkh, ddutile

Bjorn, ... [v2] with missing Doc file.

P.

----8<----

Consider the following case

		[ RP ]
		  |
		  |
	+---------+-----------+
	|	  |	      |
       [H1]      [H2]        [X1]

where RP is a PCIE Root Port, H1 and H2 are devices with drivers that support
PCIE AER driver error handling (ie, they have pci_error_handlers defined in
the driver), and X1 is a device with a driver that does not support PCIE
AER driver error handling.

If the Root Port takes an error what currently happens is that the
bus resets and H1 & H2 call their slot_reset functions.  X1 does nothing.

In some cases a user may not wish the system to continue because X1 is
an unhardened driver.  In these cases, the system should not do a bus reset,
but rather the system should panic to avoid any further possible data
corruption.

This patch implements an AER_panic sysfs entry for each root port which
a user can modify.  AER_panic = 1, means the system will panic on a
PCIE error which would have normally resulted in a secondary bus reset.
The default is AER_panic = 0.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Shyam Iyer <Shyam_Iyer@Dell.com>
Cc: gregkh@linuxfoundation.org
Cc: ddutile@redhat.com

[v2]: added missing Documentation/ABI/testing/sysfs-bus-pci
---
 Documentation/ABI/testing/sysfs-bus-pci |   10 +++++++
 drivers/pci/pci-sysfs.c                 |   42 ++++++++++++++++++++++++++++++-
 drivers/pci/pcie/aer/aerdrv.c           |    3 ++
 include/linux/pci.h                     |    1 +
 4 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 34f5110..e64d434 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -210,3 +210,13 @@ Users:
 		firmware assigned instance number of the PCI
 		device that can help in understanding the firmware
 		intended order of the PCI device.
+
+What:		/sys/bus/pci/devices/.../AER_panic
+Date:		May 2012
+Contact:	linux-pci@vger.kernel.org, Prarit Bhargava <prarit@redhat.com>
+Description:
+		This file is present for PCIe Root Ports only and changes the
+		behavior when an AER event targets the port.
+		This attribute can have two values.  If the value is 0, this
+		PCIe bus will issue a bus reset on the secondary bus.  If the
+		value is 1, the kernel will panic.  The default value is 0.
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index a55e248..8c6d525 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1135,6 +1135,35 @@ static ssize_t reset_store(struct device *dev,
 
 static struct device_attribute reset_attr = __ATTR(reset, 0200, NULL, reset_store);
 
+static ssize_t AER_panic_show(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return sprintf(buf, "%d\n", pdev->rp_AER_panic);
+}
+
+static ssize_t AER_panic_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	unsigned long val;
+
+	if (kstrtoul(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if ((val > 1) || (val < 0))
+		return -EINVAL;
+
+	pdev->rp_AER_panic = val;
+
+	return count;
+}
+
+static struct device_attribute rp_AER_panic_attr =
+	      __ATTR(AER_panic, 0600, AER_panic_show, AER_panic_store);
+
 static int pci_create_capabilities_sysfs(struct pci_dev *dev)
 {
 	int retval;
@@ -1169,8 +1198,16 @@ static int pci_create_capabilities_sysfs(struct pci_dev *dev)
 			goto error;
 		dev->reset_fn = 1;
 	}
-	return 0;
 
+	/* PCIE Root Port panic-on-AER allows a user to configure each root
+	 * port to panic on an AER error instead of issuing a bus reset.
+	 */
+	if (dev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+		retval = device_create_file(&dev->dev, &rp_AER_panic_attr);
+		if (retval)
+			goto error;
+	}
+	return 0;
 error:
 	pcie_aspm_remove_sysfs_dev_files(dev);
 	if (dev->vpd && dev->vpd->attr) {
@@ -1279,6 +1316,9 @@ static void pci_remove_capabilities_sysfs(struct pci_dev *dev)
 		device_remove_file(&dev->dev, &reset_attr);
 		dev->reset_fn = 0;
 	}
+
+	if (dev->pcie_type == PCI_EXP_TYPE_ROOT_PORT)
+		device_remove_file(&dev->dev, &rp_AER_panic_attr);
 }
 
 /**
diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
index 58ad791..dd6b352 100644
--- a/drivers/pci/pcie/aer/aerdrv.c
+++ b/drivers/pci/pcie/aer/aerdrv.c
@@ -346,6 +346,9 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	u32 reg32;
 	int pos;
 
+	if (dev->rp_AER_panic)
+		panic("%s: AER detected on Root Port", pci_name(dev));
+
 	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
 
 	/* Disable Root's interrupt in response to error messages */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e444f5b..a4e6a5a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -324,6 +324,7 @@ struct pci_dev {
 	unsigned int    is_hotplug_bridge:1;
 	unsigned int    __aer_firmware_first_valid:1;
 	unsigned int	__aer_firmware_first:1;
+	unsigned int	rp_AER_panic:1; /* if 1, panic on AER bus reset */
 	pci_dev_flags_t dev_flags;
 	atomic_t	enable_cnt;	/* pci_enable_device has been called */
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-18 14:16   ` Prarit Bhargava
@ 2012-05-18 15:47     ` Greg KH
  2012-05-18 17:17       ` Prarit Bhargava
  2012-05-18 17:26       ` Prarit Bhargava
  0 siblings, 2 replies; 19+ messages in thread
From: Greg KH @ 2012-05-18 15:47 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-pci, Bjorn Helgaas, Shyam Iyer, ddutile

On Fri, May 18, 2012 at 10:16:46AM -0400, Prarit Bhargava wrote:
> Bjorn, ... [v2] with missing Doc file.
> 
> P.
> 
> ----8<----
> 
> Consider the following case
> 
> 		[ RP ]
> 		  |
> 		  |
> 	+---------+-----------+
> 	|	  |	      |
>        [H1]      [H2]        [X1]
> 
> where RP is a PCIE Root Port, H1 and H2 are devices with drivers that support
> PCIE AER driver error handling (ie, they have pci_error_handlers defined in
> the driver), and X1 is a device with a driver that does not support PCIE
> AER driver error handling.

Why can't we provide "default" error handlers that recover from such
errors?

> If the Root Port takes an error what currently happens is that the
> bus resets and H1 & H2 call their slot_reset functions.  X1 does nothing.
> 
> In some cases a user may not wish the system to continue because X1 is
> an unhardened driver.

Please define "unhardened".  Why aren't all drivers "hardened"?

> In these cases, the system should not do a bus reset, but rather the
> system should panic to avoid any further possible data corruption.

Really?  You really want to panic the whole system and shut down and
potentially loose everything?  That does not sound like a good idea at
all to me, is there really no way to recover from this?

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-18 15:47     ` Greg KH
@ 2012-05-18 17:17       ` Prarit Bhargava
  2012-05-18 18:13         ` Greg KH
  2012-05-18 17:26       ` Prarit Bhargava
  1 sibling, 1 reply; 19+ messages in thread
From: Prarit Bhargava @ 2012-05-18 17:17 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-pci, Bjorn Helgaas, Shyam Iyer, ddutile


> 
> Please define "unhardened".  Why aren't all drivers "hardened"?

Most drivers _currently_ do not handle reading all f's (or -1) from hardware.
Some drivers do handle some situations but definitely not all of them.
Hardening a driver involves making the driver "-1" safe.

Some companies do ship hardened drivers, but the ones in the tree are not hardened.

[The above comment is in no way an approval of shipping drivers outside of the
kernel.  I'm just stating a fact.]

The effort involved in hardening this drivers is significant.  It will be a long
time before anyone considers the in-tree drivers hardened.  We should start with
a baby-step of acknowledging the problem and giving current users a way of
protecting their data.

> 
>> In these cases, the system should not do a bus reset, but rather the
>> system should panic to avoid any further possible data corruption.
> 
> Really?  You really want to panic the whole system and shut down and
> potentially loose everything?  That does not sound like a good idea at
> all to me, is there really no way to recover from this?

Yes, that's _exactly_ what I want to do.  Having a driver that is capable of
writing corrupted data to a disk or corrupting memory is much worse than
panicking and stopping the system for a short period of time.

The default is to handle an AER through a bus reset so a user must actively
request the panic.

P.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-18 15:47     ` Greg KH
  2012-05-18 17:17       ` Prarit Bhargava
@ 2012-05-18 17:26       ` Prarit Bhargava
  1 sibling, 0 replies; 19+ messages in thread
From: Prarit Bhargava @ 2012-05-18 17:26 UTC (permalink / raw)
  To: linux-pci; +Cc: ddutile, betty.dall, Shyam Iyer, Bjorn Helgaas

> Prarit Bhargava <prarit <at> redhat.com> writes:
> 
>>  /**
>> diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
>> index 58ad791..dd6b352 100644
>> --- a/drivers/pci/pcie/aer/aerdrv.c
>> +++ b/drivers/pci/pcie/aer/aerdrv.c
>> @@ -346,6 +346,9 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>>      u32 reg32;
>>      int pos;
>>
>> +    if (dev->rp_AER_panic)
>> +        panic("%s: AER detected on Root Port", pci_name(dev));
>> +
>>      pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
>>
> 
> 
> I really like this idea. I just wonder if the panic can happen in do_recovery in
> aerdrv_core.c before the broadcast_error_message() is done that invokes all of
> the error_detected callbacks. It would be best to panic as soon as possible to
> increase error containment.

Hi Betty,

Thanks for the review.

IMO that sounds like a better idea.  As you said, panicking early as possible in
order to increase error containment is the best option.

Bjorn, any objection to moving the panic up as far as possible?

I'll let the rest of the discussion settle before putting out a [v3].

P.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-18 17:17       ` Prarit Bhargava
@ 2012-05-18 18:13         ` Greg KH
  2012-05-18 19:28           ` Prarit Bhargava
  0 siblings, 1 reply; 19+ messages in thread
From: Greg KH @ 2012-05-18 18:13 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-pci, Bjorn Helgaas, Shyam Iyer, ddutile

On Fri, May 18, 2012 at 01:17:54PM -0400, Prarit Bhargava wrote:
> 
> > 
> > Please define "unhardened".  Why aren't all drivers "hardened"?
> 
> Most drivers _currently_ do not handle reading all f's (or -1) from hardware.
> Some drivers do handle some situations but definitely not all of them.
> Hardening a driver involves making the driver "-1" safe.

That's not "hardening", it should be written as, "fixing broken
drivers".  It's a bug if a PCI driver can not handle this as that is
exactly what happens when a PCI device is removed from the system
without the driver knowing about it.

> Some companies do ship hardened drivers, but the ones in the tree are not hardened.

Why are there out-of-tree drivers that are so-called "hardened" and why
are those bug fixes not merged into the kernel tree?

> [The above comment is in no way an approval of shipping drivers outside of the
> kernel.  I'm just stating a fact.]

Any specific drivers you are referring to so that I can go and kick
someone to get their act together?

Seriously, this is a bug in the PCI drivers, not anything else, it needs
to be fixed there, not papered over with a kernel crash from the PCI
core.

> The effort involved in hardening this drivers is significant.

It shouldn't be, this has been well known for what, 13+ years now?  This
is nothing new at all, and again, is a bug if the driver can't handle
this.

> It will be a long time before anyone considers the in-tree drivers
> hardened.  We should start with a baby-step of acknowledging the
> problem and giving current users a way of protecting their data.

No, we need to fix the drivers, again, this is a well-known issue.

What specific drivers do you see in the kernel tree right now that can
not handle this type of thing.  A list would be great so that we can fix
them now.

> >> In these cases, the system should not do a bus reset, but rather the
> >> system should panic to avoid any further possible data corruption.
> > 
> > Really?  You really want to panic the whole system and shut down and
> > potentially loose everything?  That does not sound like a good idea at
> > all to me, is there really no way to recover from this?
> 
> Yes, that's _exactly_ what I want to do.  Having a driver that is capable of
> writing corrupted data to a disk or corrupting memory is much worse than
> panicking and stopping the system for a short period of time.

But by panicking, you just lost data and have potentially corrupt data
written to the disk in a half-finished manner, plus you now have a
broken system that is stuck and needs to be rebooted :)

> The default is to handle an AER through a bus reset so a user must actively
> request the panic.

Fair enough, I can understand why some people might want this type of
control over a system, and if they reboot-on-panic, they can recover
quickly and get back up and running.

But again, this needs to be fixed in the drivers themselves, otherwise
they are broken on systems that, again, have been shipping for 13+ years
now.  It's unacceptable for the driver authors to be that sloppy.

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-18 18:13         ` Greg KH
@ 2012-05-18 19:28           ` Prarit Bhargava
  2012-05-18 23:19             ` Greg KH
  0 siblings, 1 reply; 19+ messages in thread
From: Prarit Bhargava @ 2012-05-18 19:28 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-pci, Bjorn Helgaas, Shyam Iyer, ddutile



On 05/18/2012 02:13 PM, Greg KH wrote:
> On Fri, May 18, 2012 at 01:17:54PM -0400, Prarit Bhargava wrote:
>>
>>>
>>> Please define "unhardened".  Why aren't all drivers "hardened"?
>>
>> Most drivers _currently_ do not handle reading all f's (or -1) from hardware.
>> Some drivers do handle some situations but definitely not all of them.
>> Hardening a driver involves making the driver "-1" safe.
> 
> That's not "hardening", it should be written as, "fixing broken
> drivers".  It's a bug if a PCI driver can not handle this as that is
> exactly what happens when a PCI device is removed from the system
> without the driver knowing about it.

Sorry Greg, I should have stated this at the beginning.  Without question
patches should be _and will be_ sent to drivers as problems are found.

> 
>> Some companies do ship hardened drivers, but the ones in the tree are not hardened.
> 
> Why are there out-of-tree drivers that are so-called "hardened" and why
> are those bug fixes not merged into the kernel tree?

See comment below.
> 
>> [The above comment is in no way an approval of shipping drivers outside of the
>> kernel.  I'm just stating a fact.]

I'm just stating a fact.  I have no idea why patches are not on the list.  Nor
am I condoning the activity of keeping fixes outside of the tree.  I've _just
started_ working with a group who has patches and am going to do some work with
them on getting patches out to the tree.

> 
> Any specific drivers you are referring to so that I can go and kick
> someone to get their act together?

I'll get a list together and hopefully we can get some patches out.

> 
>> The effort involved in hardening this drivers is significant.
> 
> It shouldn't be, this has been well known for what, 13+ years now?  This
> is nothing new at all, and again, is a bug if the driver can't handle
> this.

Most drivers cannot handle surprise removals, and in this case that's what we're
effectively talking about.  There may be a few that can, and even those might be
able to handle a few cases of surprise removal or reset events.

>>>> In these cases, the system should not do a bus reset, but rather the
>>>> system should panic to avoid any further possible data corruption.
>>>
>>> Really?  You really want to panic the whole system and shut down and
>>> potentially loose everything?  That does not sound like a good idea at
>>> all to me, is there really no way to recover from this?
>>
>> Yes, that's _exactly_ what I want to do.  Having a driver that is capable of
>> writing corrupted data to a disk or corrupting memory is much worse than
>> panicking and stopping the system for a short period of time.
> 
> But by panicking, you just lost data and have potentially corrupt data
> written to the disk in a half-finished manner, plus you now have a
> broken system that is stuck and needs to be rebooted :)

Fair enough -- I suppose this comes down to:  Which is worse?  Coming back to a
system with corrupt data or memory, or rebooting and losing data (and waiting
for a reboot)? :)

In my view, if a user *KNOWS* that a driver isn't going to play well with a
reset then let them make the call -- it shouldn't be up to us to decide what is
best for them.  If they want a panic, let them have it.  As time goes on the
drivers will get better but that isn't going to happen overnight.

> 
>> The default is to handle an AER through a bus reset so a user must actively
>> request the panic.
> 
> Fair enough, I can understand why some people might want this type of
> control over a system, and if they reboot-on-panic, they can recover
> quickly and get back up and running.
> 
> But again, this needs to be fixed in the drivers themselves, otherwise
> they are broken on systems that, again, have been shipping for 13+ years
> now.  It's unacceptable for the driver authors to be that sloppy.

I agree with you Greg.  I 100% agree with you.  But getting fixes into the
drivers will take some time.  Getting them to a state where
commercial/enterprise customers consider them reliable for surprise events is
going to take some time... so I'm arguing that we should go with a simple
user-based policy and fix the drivers as we move along.

P.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] pci, Add AER_panic sysfs file
  2012-05-18 19:28           ` Prarit Bhargava
@ 2012-05-18 23:19             ` Greg KH
  0 siblings, 0 replies; 19+ messages in thread
From: Greg KH @ 2012-05-18 23:19 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-pci, Bjorn Helgaas, Shyam Iyer, ddutile

On Fri, May 18, 2012 at 03:28:36PM -0400, Prarit Bhargava wrote:
> 
> 
> On 05/18/2012 02:13 PM, Greg KH wrote:
> > On Fri, May 18, 2012 at 01:17:54PM -0400, Prarit Bhargava wrote:
> >>
> >>>
> >>> Please define "unhardened".  Why aren't all drivers "hardened"?
> >>
> >> Most drivers _currently_ do not handle reading all f's (or -1) from hardware.
> >> Some drivers do handle some situations but definitely not all of them.
> >> Hardening a driver involves making the driver "-1" safe.
> > 
> > That's not "hardening", it should be written as, "fixing broken
> > drivers".  It's a bug if a PCI driver can not handle this as that is
> > exactly what happens when a PCI device is removed from the system
> > without the driver knowing about it.
> 
> Sorry Greg, I should have stated this at the beginning.  Without question
> patches should be _and will be_ sent to drivers as problems are found.

Great, just do that then.

Seriously, that's all you need to do, no modifying the PCI core needed.

There shouldn't be these problems in any "real" drivers anyway, a number
of us did a lot of testing for this very problem years ago (yanking out
huge drawers of PCI slots, having fun with ExpressCard devices, etc.)

Any driver that can't handle this is fundimentally broken, and needs to
be fixed now, no messing around with this "enterprise" crud.

> >> Some companies do ship hardened drivers, but the ones in the tree are not hardened.
> > 
> > Why are there out-of-tree drivers that are so-called "hardened" and why
> > are those bug fixes not merged into the kernel tree?
> 
> See comment below.
> > 
> >> [The above comment is in no way an approval of shipping drivers outside of the
> >> kernel.  I'm just stating a fact.]
> 
> I'm just stating a fact.  I have no idea why patches are not on the list.  Nor
> am I condoning the activity of keeping fixes outside of the tree.  I've _just
> started_ working with a group who has patches and am going to do some work with
> them on getting patches out to the tree.

Send them today.  What's keeping them from going in right now?

> > Any specific drivers you are referring to so that I can go and kick
> > someone to get their act together?
> 
> I'll get a list together and hopefully we can get some patches out.
> 
> > 
> >> The effort involved in hardening this drivers is significant.
> > 
> > It shouldn't be, this has been well known for what, 13+ years now?  This
> > is nothing new at all, and again, is a bug if the driver can't handle
> > this.
> 
> Most drivers cannot handle surprise removals, and in this case that's what we're
> effectively talking about.  There may be a few that can, and even those might be
> able to handle a few cases of surprise removal or reset events.

I think it's really the other way around, most should be able to handle
this as it was tested by a lot of people.

Unless new code crept in that wasn't tested, but no, no company would
ever submit stuff like that, would they?  Nevermind...

> >>>> In these cases, the system should not do a bus reset, but rather the
> >>>> system should panic to avoid any further possible data corruption.
> >>>
> >>> Really?  You really want to panic the whole system and shut down and
> >>> potentially loose everything?  That does not sound like a good idea at
> >>> all to me, is there really no way to recover from this?
> >>
> >> Yes, that's _exactly_ what I want to do.  Having a driver that is capable of
> >> writing corrupted data to a disk or corrupting memory is much worse than
> >> panicking and stopping the system for a short period of time.
> > 
> > But by panicking, you just lost data and have potentially corrupt data
> > written to the disk in a half-finished manner, plus you now have a
> > broken system that is stuck and needs to be rebooted :)
> 
> Fair enough -- I suppose this comes down to:  Which is worse?  Coming back to a
> system with corrupt data or memory, or rebooting and losing data (and waiting
> for a reboot)? :)
> 
> In my view, if a user *KNOWS* that a driver isn't going to play well with a
> reset then let them make the call -- it shouldn't be up to us to decide what is
> best for them.  If they want a panic, let them have it.  As time goes on the
> drivers will get better but that isn't going to happen overnight.

How can a user know that?  You haven't even told us what drivers can't
handle this!  That's not fair at all, what you are really asking us to
do is:
	Take this core patch to cause oopses to work around some unnamed
	crappy drivers that are broken and can't handle basic,
	fundamental, aspects of handling their hardware.

That's not acceptable at all, fix the problem at the root cause, in the
drivers, right now.  There should not be anything stopping this.

> >> The default is to handle an AER through a bus reset so a user must actively
> >> request the panic.
> > 
> > Fair enough, I can understand why some people might want this type of
> > control over a system, and if they reboot-on-panic, they can recover
> > quickly and get back up and running.
> > 
> > But again, this needs to be fixed in the drivers themselves, otherwise
> > they are broken on systems that, again, have been shipping for 13+ years
> > now.  It's unacceptable for the driver authors to be that sloppy.
> 
> I agree with you Greg.  I 100% agree with you.  But getting fixes into the
> drivers will take some time.  Getting them to a state where
> commercial/enterprise customers consider them reliable for surprise events is
> going to take some time... so I'm arguing that we should go with a simple
> user-based policy and fix the drivers as we move along.

I don't care about "enterprise" customers (hint, there is no such thing,
everyone is an "enterprise").  I care about people who don't know how to
write basic PCI drivers to handle this type of event that has been known
for well over a decade[1].

You are also making the claim that somehow it is easier, and safer, to
modify the PCI core, and that these skittish "enterprise" people will
take that, instead of fixing their drivers.

So I say NAK to this patch, fix the root problem now, there is no excuse
for this work around.

greg k-h

[1] Actually longer than that, this went back to PCMCIA cards which came
out what, in the 1980's?

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-05-18 23:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-17 17:04 [PATCH] pci, Add AER_panic sysfs file Prarit Bhargava
2012-05-17 17:29 ` Shyam_Iyer
2012-05-17 17:39   ` Prarit Bhargava
2012-05-17 17:51     ` Shyam_Iyer
     [not found]     ` <DBFB1B45AF80394ABD1C807E9F28D15707BC712175@BLRX7MCDC203.AMER.DELL.COM>
2012-05-17 18:00       ` Shyam_Iyer
2012-05-17 18:51 ` Don Dutile
2012-05-17 18:54   ` Prarit Bhargava
2012-05-17 19:11     ` Don Dutile
2012-05-17 22:16       ` Prarit Bhargava
2012-05-17 21:32 ` Betty Dall
2012-05-18  4:51 ` Greg KH
2012-05-18 10:26   ` Prarit Bhargava
2012-05-18 14:16   ` Prarit Bhargava
2012-05-18 15:47     ` Greg KH
2012-05-18 17:17       ` Prarit Bhargava
2012-05-18 18:13         ` Greg KH
2012-05-18 19:28           ` Prarit Bhargava
2012-05-18 23:19             ` Greg KH
2012-05-18 17:26       ` Prarit Bhargava

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).