* [RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count
2017-03-01 11:08 [RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
@ 2017-03-01 11:08 ` Vaibhav Jain
2017-03-01 11:08 ` [RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
2017-03-01 11:08 ` [RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
2 siblings, 0 replies; 4+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:08 UTC (permalink / raw)
To: linuxppc-dev, Russell Currey
Cc: Vaibhav Jain, Frederic Barrat, Andrew Donnellan, Ian Munsie,
Christophe Lombard, Philippe Bergheaud, Greg Kurz, Gavin Shan
This patch introduces a new function eeh_pe_update_freeze_counter()
replacing existing function eeh_pe_update_time_stamp(). The new
function also manages the value of reeze_count along with tstamp to
track the number of times the PE roze in last one hour and if the
freeze_count > eeh_max_freezes then eports an error(-ENOTRECOVERABLE)
to indicate that the PE should be ermanently disabled.
This patch should not introduce any behavioral change.
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Change-log:
v1 -> v2
Changes as suggested by Russell Currey:
- Suffixed function names with '()'
- Dropped '<0' conditional check inside eeh_handle_normal_event()
- Rephrased the comment for function eeh_pe_update_freeze_counter()
- Brace-wrapped a single line statement at end of
eeh_pe_update_freeze_counter()
---
arch/powerpc/include/asm/eeh.h | 2 +-
arch/powerpc/kernel/eeh_driver.c | 20 +++--------------
arch/powerpc/kernel/eeh_pe.c | 47 +++++++++++++++++++++++++++-------------
3 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 8e37b71..68806be 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -265,7 +265,7 @@ struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb);
struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
int eeh_add_to_parent_pe(struct eeh_dev *edev);
int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
-void eeh_pe_update_time_stamp(struct eeh_pe *pe);
+int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
void *eeh_pe_traverse(struct eeh_pe *root,
eeh_traverse_func fn, void *flag);
void *eeh_pe_dev_traverse(struct eeh_pe *root,
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index b948871..8a088ea 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -739,10 +739,9 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
return;
}
- eeh_pe_update_time_stamp(pe);
- pe->freeze_count++;
- if (pe->freeze_count > eeh_max_freezes)
- goto excess_failures;
+ /* Update freeze counters and see if we have tripped max-freeze limit */
+ if (eeh_pe_update_freeze_counter(pe))
+ goto perm_error;
pr_warn("EEH: This PCI device has failed %d times in the last hour\n",
pe->freeze_count);
@@ -872,19 +871,6 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
return;
-excess_failures:
- /*
- * About 90% of all real-life EEH failures in the field
- * are due to poorly seated PCI cards. Only 10% or so are
- * due to actual, failed cards.
- */
- pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
- "last hour and has been permanently disabled.\n"
- "Please try reseating or replacing it.\n",
- pe->phb->global_number, pe->addr,
- pe->freeze_count);
- goto perm_error;
-
hard_fail:
pr_err("EEH: Unable to recover from failure from PHB#%x-PE#%x.\n"
"Please try reseating or replacing it\n",
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index cc4b206..d367c16 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -504,30 +504,47 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
}
/**
- * eeh_pe_update_time_stamp - Update PE's frozen time stamp
+ * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
+ * and freeze counter
* @pe: EEH PE
*
- * We have time stamp for each PE to trace its time of getting
- * frozen in last hour. The function should be called to update
- * the time stamp on first error of the specific PE. On the other
- * handle, we needn't account for errors happened in last hour.
+ * We have a freeze counter and time stamp for each PE to trace
+ * number of times the PE was frozen in the last hour. This function
+ * updates the PE's freeze counter and returns an error if its greater
+ * than eeh_max_freezes. The function should be called to once every
+ * time a specific PE freezes.
*/
-void eeh_pe_update_time_stamp(struct eeh_pe *pe)
+int eeh_pe_update_freeze_counter(struct eeh_pe *pe)
{
struct timeval tstamp;
- if (!pe) return;
+ if (!pe)
+ return -EINVAL;
+
+ do_gettimeofday(&tstamp);
+ if (pe->freeze_count <= 0 || tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
+ pe->tstamp = tstamp;
+ pe->freeze_count = 1;
+
+ } else if (pe->freeze_count >= eeh_max_freezes) {
+ pe->freeze_count++;
+ /*
+ * About 90% of all real-life EEH failures in the field
+ * are due to poorly seated PCI cards. Only 10% or so are
+ * due to actual, failed cards.
+ */
+ pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
+ "last hour and has been permanently disabled.\n"
+ "Please try reseating or replacing it.\n",
+ pe->phb->global_number, pe->addr,
+ pe->freeze_count);
+ return -ENOTRECOVERABLE;
- if (pe->freeze_count <= 0) {
- pe->freeze_count = 0;
- do_gettimeofday(&pe->tstamp);
} else {
- do_gettimeofday(&tstamp);
- if (tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
- pe->tstamp = tstamp;
- pe->freeze_count = 0;
- }
+ pe->freeze_count++;
}
+
+ return 0;
}
/**
--
2.9.3
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter()
2017-03-01 11:08 [RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
2017-03-01 11:08 ` [RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
@ 2017-03-01 11:08 ` Vaibhav Jain
2017-03-01 11:08 ` [RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
2 siblings, 0 replies; 4+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:08 UTC (permalink / raw)
To: linuxppc-dev, Russell Currey
Cc: Vaibhav Jain, Frederic Barrat, Andrew Donnellan, Ian Munsie,
Christophe Lombard, Philippe Bergheaud, Greg Kurz, Gavin Shan
This patch introduces function eeh_pe_reset_freeze_counter() which can
be used to reset the PE's freeze count variable outside eeh code. This
is useful for devices that can acquire a different personality after
a PERST event (e.g FPGA Adapters). Presently an existing freeze
count for an adapter with personality N will be taken into account
when the adapter acquired personality N+1.
By calling eeh_pe_reset_freeze_counter() drivers can reset the freeze
counter for an adapter once it has acquired a new personality and
ideally wont be plagued by the failures similar to the one before.
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Change-log:
v1 -> v2
* Changes as suggested by Russell Currey:
- Suffixed function names with '()'
- Rephrased the description comment for functon eeh_pe_reset_freeze_counter()
- Inserted logging for PHB and PE number inside eeh_pe_reset_freeze_counter()
* Moved definition of eeh_pe_reset_freeze_counter() from eeh.h to eeh_pe.c to
avoid adding a header dependency to 'pci-bridge.h'. The function is
now marked as an exported gpl symbol.
---
arch/powerpc/include/asm/eeh.h | 5 +++++
arch/powerpc/kernel/eeh_pe.c | 17 +++++++++++++++++
2 files changed, 22 insertions(+)
diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 68806be..8dcfb88 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -266,6 +266,9 @@ struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
int eeh_add_to_parent_pe(struct eeh_dev *edev);
int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
+
+void eeh_pe_reset_freeze_counter(struct eeh_pe *pe);
+
void *eeh_pe_traverse(struct eeh_pe *root,
eeh_traverse_func fn, void *flag);
void *eeh_pe_dev_traverse(struct eeh_pe *root,
@@ -339,6 +342,8 @@ static inline int eeh_check_failure(const volatile void __iomem *token)
return 0;
}
+static inline void eeh_pe_reset_freeze_counter(struct eeh_pe *pe) { }
+
#define eeh_dev_check_failure(x) (0)
static inline void eeh_addr_cache_build(void) { }
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index d367c16..75c781f 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -504,6 +504,23 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
}
/**
+ * eeh_pe_reset_freeze_counter - Resets the PE freeze counter
+ * @pe: EEH PE
+ *
+ * This function is useful while re-configuring an FPGA adapter
+ * as its about to acquire new a personality and you don't want
+ * freeze count to be carry forwarded. As such calling this function
+ * for regular pci devices might be a bad idea.
+ */
+void eeh_pe_reset_freeze_counter(struct eeh_pe *pe)
+{
+ pr_info("Resetting freeze count for PHB#%x-PE#%x\n",
+ pe->phb->global_number, pe->addr);
+ pe->freeze_count = 0;
+}
+EXPORT_SYMBOL_GPL(eeh_pe_reset_freeze_counter);
+
+/**
* eeh_pe_update_freeze_counter - Update PE's frozen time stamp
* and freeze counter
* @pe: EEH PE
--
2.9.3
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image
2017-03-01 11:08 [RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
2017-03-01 11:08 ` [RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
2017-03-01 11:08 ` [RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
@ 2017-03-01 11:08 ` Vaibhav Jain
2 siblings, 0 replies; 4+ messages in thread
From: Vaibhav Jain @ 2017-03-01 11:08 UTC (permalink / raw)
To: linuxppc-dev, Frederic Barrat, Russell Currey
Cc: Vaibhav Jain, Andrew Donnellan, Ian Munsie, Christophe Lombard,
Philippe Bergheaud, Greg Kurz, Gavin Shan
The patch resets the freeze counter on eeh_pe struct for PHB
associated with the cxl pci adapter. This would enable re-flashing of
the cxl-adapter beyond the default limit of 5.
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Change-log:
v1 -> v2
Changes as suggested by Russell Currey:
- Changed new variable names inline with eeh code nomenclature.
- Removed the dev_info logging the PHB being reset. The log message is now
moved to eeh_handle_normal_event()
---
drivers/misc/cxl/pci.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index ce54dab..7960fd64 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -22,6 +22,7 @@
#include <asm/pnv-pci.h>
#include <asm/io.h>
#include <asm/reg.h>
+#include <asm/eeh.h>
#include "cxl.h"
#include <misc/cxl.h>
@@ -1231,6 +1232,8 @@ static void cxl_pci_remove_afu(struct cxl_afu *afu)
int cxl_pci_reset(struct cxl *adapter)
{
struct pci_dev *dev = to_pci_dev(adapter->dev.parent);
+ struct eeh_dev *edev = pci_dev_to_eeh_dev(dev);
+ struct eeh_pe *pe = eeh_dev_to_pe(edev);
int rc;
if (adapter->perst_same_image) {
@@ -1244,6 +1247,17 @@ int cxl_pci_reset(struct cxl *adapter)
/* the adapter is about to be reset, so ignore errors */
cxl_data_cache_flush(adapter);
+ /* If loading a new image, reset freeze counters for the PHB
+ * associated with the adapter.
+ */
+ if (pe && adapter->perst_loads_image) {
+ /* Find the pe associated with the device PHB */
+ while (pe->parent != NULL && (pe->type & EEH_PE_PHB) == 0)
+ pe = pe->parent;
+
+ eeh_pe_reset_freeze_counter(pe);
+ }
+
/* pcie_warm_reset requests a fundamental pci reset which includes a
* PERST assert/deassert. PERST triggers a loading of the image
* if "user" or "factory" is selected in sysfs */
--
2.9.3
^ permalink raw reply related [flat|nested] 4+ messages in thread