* [PATCH v2 0/9] EEH improvement
@ 2014-02-25 5:37 Gavin Shan
2014-02-25 5:37 ` [PATCH 1/9] powerpc/eeh: Remove EEH_PE_PHB_DEAD Gavin Shan
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 5:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Gavin Shan
The series of patches intends to improve reliability of EEH on PowerNV
platform. First all, we have had multiple duplicate states (flags) for
PHB and PE, so we remove those duplicate states to simplify the code.
Besides, we had corrupted PHB diag-data for case of frozen PE. In order
to solve the problem, we introduce eeh_ops->event() and notifications
are sent from EEH core to (PowerNV) platform on creating or destroying
PE instance so that we can allocate or free PHB diag-data backend. Then
we cache the PHB diag-data on the first call to eeh_ops->get_state()
and dump it afterwards, which helps to get correct PHB diag-data.
With the patchset applied, we never dump PHB diag-data for INF errors.
Instead, we just maintain statistics in /proc/powerpc/eeh_inf_err. Also,
we changed the PHB diag-data dump format for a bit to have multiple
fields per line and omits the line with all zero'd fields as Ben suggested.
v1 -> v2:
* Amending commit logs
* Support eeh_ops->event() and maintain PHB diag-data on basis
of PE instance
* When dumping PHB diag-data, to replace "-" with "00000000" and
omit the line if the fields of it are all zeros.
---
arch/powerpc/include/asm/eeh.h | 7 ++-
arch/powerpc/kernel/eeh.c | 10 +---
arch/powerpc/kernel/eeh_driver.c | 10 ++--
arch/powerpc/kernel/eeh_pe.c | 39 ++++++++++++-
arch/powerpc/platforms/powernv/eeh-ioda.c | 193 ++++++++++++++++++++++++++++++++++++-------------------------
arch/powerpc/platforms/powernv/eeh-powernv.c | 74 +++++++++++++++++++-----
arch/powerpc/platforms/powernv/pci.c | 228 +++++++++++++++++++++++++++++++++++++++++-------------------------
arch/powerpc/platforms/powernv/pci.h | 11 ++--
arch/powerpc/platforms/pseries/eeh_pseries.c | 3 +-
9 files changed, 358 insertions(+), 217 deletions(-)
Thanks,
Gavin
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/9] powerpc/eeh: Remove EEH_PE_PHB_DEAD
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
@ 2014-02-25 5:37 ` Gavin Shan
2014-02-25 5:37 ` [PATCH 2/9] powerpc/powernv: Remove PNV_EEH_STATE_REMOVED Gavin Shan
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 5:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Gavin Shan
The PE state (for eeh_pe instance) EEH_PE_PHB_DEAD is duplicate to
EEH_PE_ISOLATED. Originally, those PHBs (PHB PE) with EEH_PE_PHB_DEAD
would be removed from the system. However, it's safe to replace
that with EEH_PE_ISOLATED.
The patch also clear EEH_PE_RECOVERING after fenced PHB has been handled,
either failure or success. It makes the PHB PE state consistent with:
PHB functions normally NONE
PHB has been removed EEH_PE_ISOLATED
PHB fenced, recovery in progress EEH_PE_ISOLATED | RECOVERING
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/eeh.h | 1 -
arch/powerpc/kernel/eeh.c | 10 ++--------
arch/powerpc/kernel/eeh_driver.c | 10 +++++-----
3 files changed, 7 insertions(+), 14 deletions(-)
diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index d4dd41f..a61b06f 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -53,7 +53,6 @@ struct device_node;
#define EEH_PE_ISOLATED (1 << 0) /* Isolated PE */
#define EEH_PE_RECOVERING (1 << 1) /* Recovering PE */
-#define EEH_PE_PHB_DEAD (1 << 2) /* Dead PHB */
#define EEH_PE_KEEP (1 << 8) /* Keep PE on hotplug */
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index e7b76a6..f167676 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -232,7 +232,6 @@ void eeh_slot_error_detail(struct eeh_pe *pe, int severity)
{
size_t loglen = 0;
struct eeh_dev *edev, *tmp;
- bool valid_cfg_log = true;
/*
* When the PHB is fenced or dead, it's pointless to collect
@@ -240,12 +239,7 @@ void eeh_slot_error_detail(struct eeh_pe *pe, int severity)
* 0xFF's. For ER, we still retrieve the data from the PCI
* config space.
*/
- if (eeh_probe_mode_dev() &&
- (pe->type & EEH_PE_PHB) &&
- (pe->state & (EEH_PE_ISOLATED | EEH_PE_PHB_DEAD)))
- valid_cfg_log = false;
-
- if (valid_cfg_log) {
+ if (!(pe->type & EEH_PE_PHB)) {
eeh_pci_enable(pe, EEH_OPT_THAW_MMIO);
eeh_ops->configure_bridge(pe);
eeh_pe_restore_bars(pe);
@@ -309,7 +303,7 @@ static int eeh_phb_check_failure(struct eeh_pe *pe)
/* If the PHB has been in problematic state */
eeh_serialize_lock(&flags);
- if (phb_pe->state & (EEH_PE_ISOLATED | EEH_PE_PHB_DEAD)) {
+ if (phb_pe->state & EEH_PE_ISOLATED) {
ret = 0;
goto out;
}
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index fdc679d..4cf0467 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -665,8 +665,7 @@ static void eeh_handle_special_event(void)
phb_pe = eeh_phb_pe_get(hose);
if (!phb_pe) continue;
- eeh_pe_state_mark(phb_pe,
- EEH_PE_ISOLATED | EEH_PE_PHB_DEAD);
+ eeh_pe_state_mark(phb_pe, EEH_PE_ISOLATED);
}
eeh_serialize_unlock(flags);
@@ -682,8 +681,7 @@ static void eeh_handle_special_event(void)
eeh_remove_event(pe);
if (rc == EEH_NEXT_ERR_DEAD_PHB)
- eeh_pe_state_mark(pe,
- EEH_PE_ISOLATED | EEH_PE_PHB_DEAD);
+ eeh_pe_state_mark(pe, EEH_PE_ISOLATED);
else
eeh_pe_state_mark(pe,
EEH_PE_ISOLATED | EEH_PE_RECOVERING);
@@ -707,12 +705,14 @@ static void eeh_handle_special_event(void)
if (rc == EEH_NEXT_ERR_FROZEN_PE ||
rc == EEH_NEXT_ERR_FENCED_PHB) {
eeh_handle_normal_event(pe);
+ eeh_pe_state_clear(pe, EEH_PE_RECOVERING);
} else {
pci_lock_rescan_remove();
list_for_each_entry(hose, &hose_list, list_node) {
phb_pe = eeh_phb_pe_get(hose);
if (!phb_pe ||
- !(phb_pe->state & EEH_PE_PHB_DEAD))
+ !(phb_pe->state & EEH_PE_ISOLATED) ||
+ (phb_pe->state & EEH_PE_RECOVERING))
continue;
/* Notify all devices to be down */
--
1.7.10.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/9] powerpc/powernv: Remove PNV_EEH_STATE_REMOVED
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
2014-02-25 5:37 ` [PATCH 1/9] powerpc/eeh: Remove EEH_PE_PHB_DEAD Gavin Shan
@ 2014-02-25 5:37 ` Gavin Shan
2014-02-25 5:37 ` [PATCH 3/9] powerpc/powernv: Move PNV_EEH_STATE_ENABLED around Gavin Shan
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 5:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Gavin Shan
The PHB state PNV_EEH_STATE_REMOVED maintained in pnv_phb isn't
so useful any more and it's duplicated to EEH_PE_ISOLATED. The
patch replaces PNV_EEH_STATE_REMOVED with EEH_PE_ISOLATED.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
arch/powerpc/platforms/powernv/eeh-ioda.c | 56 ++++++++---------------------
arch/powerpc/platforms/powernv/pci.h | 1 -
2 files changed, 15 insertions(+), 42 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index f514743..0d1d424 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -662,22 +662,6 @@ static void ioda_eeh_phb_diag(struct pci_controller *hose)
pnv_pci_dump_phb_diag_data(hose, phb->diag.blob);
}
-static int ioda_eeh_get_phb_pe(struct pci_controller *hose,
- struct eeh_pe **pe)
-{
- struct eeh_pe *phb_pe;
-
- phb_pe = eeh_phb_pe_get(hose);
- if (!phb_pe) {
- pr_warning("%s Can't find PE for PHB#%d\n",
- __func__, hose->global_number);
- return -EEXIST;
- }
-
- *pe = phb_pe;
- return 0;
-}
-
static int ioda_eeh_get_pe(struct pci_controller *hose,
u16 pe_no, struct eeh_pe **pe)
{
@@ -685,7 +669,8 @@ static int ioda_eeh_get_pe(struct pci_controller *hose,
struct eeh_dev dev;
/* Find the PHB PE */
- if (ioda_eeh_get_phb_pe(hose, &phb_pe))
+ phb_pe = eeh_phb_pe_get(hose);
+ if (!phb_pe)
return -EEXIST;
/* Find the PE according to PE# */
@@ -713,6 +698,7 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
{
struct pci_controller *hose;
struct pnv_phb *phb;
+ struct eeh_pe *phb_pe;
u64 frozen_pe_no;
u16 err_type, severity;
long rc;
@@ -729,10 +715,12 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
list_for_each_entry(hose, &hose_list, list_node) {
/*
* If the subordinate PCI buses of the PHB has been
- * removed, we needn't take care of it any more.
+ * removed or is exactly under error recovery, we
+ * needn't take care of it any more.
*/
phb = hose->private_data;
- if (phb->eeh_state & PNV_EEH_STATE_REMOVED)
+ phb_pe = eeh_phb_pe_get(hose);
+ if (!phb_pe || (phb_pe->state & EEH_PE_ISOLATED))
continue;
rc = opal_pci_next_error(phb->opal_id,
@@ -765,12 +753,6 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
switch (err_type) {
case OPAL_EEH_IOC_ERROR:
if (severity == OPAL_EEH_SEV_IOC_DEAD) {
- list_for_each_entry(hose, &hose_list,
- list_node) {
- phb = hose->private_data;
- phb->eeh_state |= PNV_EEH_STATE_REMOVED;
- }
-
pr_err("EEH: dead IOC detected\n");
ret = EEH_NEXT_ERR_DEAD_IOC;
} else if (severity == OPAL_EEH_SEV_INF) {
@@ -783,17 +765,12 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
break;
case OPAL_EEH_PHB_ERROR:
if (severity == OPAL_EEH_SEV_PHB_DEAD) {
- if (ioda_eeh_get_phb_pe(hose, pe))
- break;
-
+ *pe = phb_pe;
pr_err("EEH: dead PHB#%x detected\n",
hose->global_number);
- phb->eeh_state |= PNV_EEH_STATE_REMOVED;
ret = EEH_NEXT_ERR_DEAD_PHB;
} else if (severity == OPAL_EEH_SEV_PHB_FENCED) {
- if (ioda_eeh_get_phb_pe(hose, pe))
- break;
-
+ *pe = phb_pe;
pr_err("EEH: fenced PHB#%x detected\n",
hose->global_number);
ret = EEH_NEXT_ERR_FENCED_PHB;
@@ -813,15 +790,12 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
* fenced PHB so that it can be recovered.
*/
if (ioda_eeh_get_pe(hose, frozen_pe_no, pe)) {
- if (!ioda_eeh_get_phb_pe(hose, pe)) {
- pr_err("EEH: Escalated fenced PHB#%x "
- "detected for PE#%llx\n",
- hose->global_number,
- frozen_pe_no);
- ret = EEH_NEXT_ERR_FENCED_PHB;
- } else {
- ret = EEH_NEXT_ERR_NONE;
- }
+ *pe = phb_pe;
+ pr_err("EEH: Escalated fenced PHB#%x "
+ "detected for PE#%llx\n",
+ hose->global_number,
+ frozen_pe_no);
+ ret = EEH_NEXT_ERR_FENCED_PHB;
} else {
pr_err("EEH: Frozen PE#%x on PHB#%x detected\n",
(*pe)->addr, (*pe)->phb->global_number);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index cde1694..6870f60 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -83,7 +83,6 @@ struct pnv_eeh_ops {
};
#define PNV_EEH_STATE_ENABLED (1 << 0) /* EEH enabled */
-#define PNV_EEH_STATE_REMOVED (1 << 1) /* PHB removed */
#endif /* CONFIG_EEH */
--
1.7.10.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/9] powerpc/powernv: Move PNV_EEH_STATE_ENABLED around
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
2014-02-25 5:37 ` [PATCH 1/9] powerpc/eeh: Remove EEH_PE_PHB_DEAD Gavin Shan
2014-02-25 5:37 ` [PATCH 2/9] powerpc/powernv: Remove PNV_EEH_STATE_REMOVED Gavin Shan
@ 2014-02-25 5:37 ` Gavin Shan
2014-02-25 5:37 ` [PATCH 4/9] powerpc/eeh: Introduce eeh_pe_free() Gavin Shan
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 5:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Gavin Shan
The flag PNV_EEH_STATE_ENABLED is put into pnv_phb::eeh_state,
which is protected by CONFIG_EEH. We needn't that. Instead, we
can have pnv_phb::flags and maintain all flags there, which is
the purpose of the patch. The patch also renames PNV_EEH_STATE_ENABLED
to PNV_PHB_FLAG_EEH.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
arch/powerpc/platforms/powernv/eeh-ioda.c | 2 +-
arch/powerpc/platforms/powernv/pci.c | 8 ++------
arch/powerpc/platforms/powernv/pci.h | 7 +++----
3 files changed, 6 insertions(+), 11 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 0d1d424..04b4710 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -153,7 +153,7 @@ static int ioda_eeh_post_init(struct pci_controller *hose)
}
#endif
- phb->eeh_state |= PNV_EEH_STATE_ENABLED;
+ phb->flags |= PNV_PHB_FLAG_EEH;
return 0;
}
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 95633d7..3955fc0 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -396,7 +396,7 @@ int pnv_pci_cfg_read(struct device_node *dn,
if (phb_pe && (phb_pe->state & EEH_PE_ISOLATED))
return PCIBIOS_SUCCESSFUL;
- if (phb->eeh_state & PNV_EEH_STATE_ENABLED) {
+ if (phb->flags & PNV_PHB_FLAG_EEH) {
if (*val == EEH_IO_ERROR_VALUE(size) &&
eeh_dev_check_failure(of_node_to_eeh_dev(dn)))
return PCIBIOS_DEVICE_NOT_FOUND;
@@ -434,12 +434,8 @@ int pnv_pci_cfg_write(struct device_node *dn,
}
/* Check if the PHB got frozen due to an error (no response) */
-#ifdef CONFIG_EEH
- if (!(phb->eeh_state & PNV_EEH_STATE_ENABLED))
+ if (!(phb->flags & PNV_PHB_FLAG_EEH))
pnv_pci_config_check_eeh(phb, dn);
-#else
- pnv_pci_config_check_eeh(phb, dn);
-#endif
return PCIBIOS_SUCCESSFUL;
}
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6870f60..94e3495 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -81,24 +81,23 @@ struct pnv_eeh_ops {
int (*configure_bridge)(struct eeh_pe *pe);
int (*next_error)(struct eeh_pe **pe);
};
-
-#define PNV_EEH_STATE_ENABLED (1 << 0) /* EEH enabled */
-
#endif /* CONFIG_EEH */
+#define PNV_PHB_FLAG_EEH (1 << 0)
+
struct pnv_phb {
struct pci_controller *hose;
enum pnv_phb_type type;
enum pnv_phb_model model;
u64 hub_id;
u64 opal_id;
+ int flags;
void __iomem *regs;
int initialized;
spinlock_t lock;
#ifdef CONFIG_EEH
struct pnv_eeh_ops *eeh_ops;
- int eeh_state;
#endif
#ifdef CONFIG_DEBUG_FS
--
1.7.10.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 4/9] powerpc/eeh: Introduce eeh_pe_free()
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
` (2 preceding siblings ...)
2014-02-25 5:37 ` [PATCH 3/9] powerpc/powernv: Move PNV_EEH_STATE_ENABLED around Gavin Shan
@ 2014-02-25 5:37 ` Gavin Shan
2014-02-25 5:37 ` [PATCH 5/9] powerpc/eeh: Introduce eeh_ops->event() Gavin Shan
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 5:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Gavin Shan
The patch introduces eeh_pe_free() to replace original kfree(pe)
so that we could have more checks there and calls to platform
interface supplied by eeh_ops in future.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
arch/powerpc/kernel/eeh_pe.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index f0c353f..2add834 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -60,6 +60,27 @@ static struct eeh_pe *eeh_pe_alloc(struct pci_controller *phb, int type)
}
/**
+ * eeh_pe_free - Free PE
+ * @pe: EEH PE
+ *
+ * Free PE instance dynamically
+ */
+static void eeh_pe_free(struct eeh_pe *pe)
+{
+ if (!pe)
+ return;
+
+ if (!list_empty(&pe->child_list) ||
+ !list_empty(&pe->edevs)) {
+ pr_warn("%s: PHB#%x-PE#%x has child PE or EEH dev\n",
+ __func__, pe->phb->global_number, pe->addr);
+ return;
+ }
+
+ kfree(pe);
+}
+
+/**
* eeh_phb_pe_create - Create PHB PE
* @phb: PCI controller
*
@@ -374,7 +395,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
pr_err("%s: No PHB PE is found (PHB Domain=%d)\n",
__func__, edev->phb->global_number);
edev->pe = NULL;
- kfree(pe);
+ eeh_pe_free(pe);
return -EEXIST;
}
}
@@ -433,7 +454,7 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
if (list_empty(&pe->edevs) &&
list_empty(&pe->child_list)) {
list_del(&pe->child);
- kfree(pe);
+ eeh_pe_free(pe);
} else {
break;
}
--
1.7.10.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 5/9] powerpc/eeh: Introduce eeh_ops->event()
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
` (3 preceding siblings ...)
2014-02-25 5:37 ` [PATCH 4/9] powerpc/eeh: Introduce eeh_pe_free() Gavin Shan
@ 2014-02-25 5:37 ` Gavin Shan
2014-02-25 5:37 ` [PATCH 6/9] powerpc/powernv: Support eeh_ops->event() Gavin Shan
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 5:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Gavin Shan
The patch introduces eeh_ops->event() so that we can pass various
events to underly platform. One reason to have that is to allocate
or free PHB diag-data for individual PEs on PowerNV platform in
future when EEH core to create or destroy PE instances.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/eeh.h | 6 ++++++
arch/powerpc/kernel/eeh_pe.c | 14 ++++++++++++++
2 files changed, 20 insertions(+)
diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index a61b06f..8fd1c2d 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -71,6 +71,7 @@ struct eeh_pe {
struct list_head child_list; /* Link PE to the child list */
struct list_head edevs; /* Link list of EEH devices */
struct list_head child; /* Child PEs */
+ void *data; /* Platform dependent data */
};
#define eeh_pe_for_each_dev(pe, edev, tmp) \
@@ -151,6 +152,10 @@ enum {
#define EEH_LOG_TEMP 1 /* EEH temporary error log */
#define EEH_LOG_PERM 2 /* EEH permanent error log */
+/* EEH events sent to platform */
+#define EEH_EVENT_PE_ALLOC 0
+#define EEH_EVENT_PE_FREE 1
+
struct eeh_ops {
char *name;
int (*init)(void);
@@ -168,6 +173,7 @@ struct eeh_ops {
int (*write_config)(struct device_node *dn, int where, int size, u32 val);
int (*next_error)(struct eeh_pe **pe);
int (*restore_config)(struct device_node *dn);
+ int (*event)(int event, void *data);
};
extern struct eeh_ops *eeh_ops;
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 2add834..6cdc7a8 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -44,6 +44,7 @@ static LIST_HEAD(eeh_phb_pe);
static struct eeh_pe *eeh_pe_alloc(struct pci_controller *phb, int type)
{
struct eeh_pe *pe;
+ int ret;
/* Allocate PHB PE */
pe = kzalloc(sizeof(struct eeh_pe), GFP_KERNEL);
@@ -56,6 +57,16 @@ static struct eeh_pe *eeh_pe_alloc(struct pci_controller *phb, int type)
INIT_LIST_HEAD(&pe->child);
INIT_LIST_HEAD(&pe->edevs);
+ if (eeh_ops->event) {
+ ret = eeh_ops->event(EEH_EVENT_PE_ALLOC, pe);
+ if (ret) {
+ pr_warn("%s: Can't alloc PE (%d)\n",
+ __func__, ret);
+ kfree(pe);
+ return NULL;
+ }
+ }
+
return pe;
}
@@ -77,6 +88,9 @@ static void eeh_pe_free(struct eeh_pe *pe)
return;
}
+ if (eeh_ops->event)
+ eeh_ops->event(EEH_EVENT_PE_FREE, pe);
+
kfree(pe);
}
--
1.7.10.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 6/9] powerpc/powernv: Support eeh_ops->event()
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
` (4 preceding siblings ...)
2014-02-25 5:37 ` [PATCH 5/9] powerpc/eeh: Introduce eeh_ops->event() Gavin Shan
@ 2014-02-25 5:37 ` Gavin Shan
2014-02-25 5:37 ` [PATCH 7/9] powerpc/powernv: Cache PHB diag-data Gavin Shan
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 5:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Gavin Shan
The patch implements the backend for eeh_ops->event() on PowerNV
platform so that we can allocate or destroy PHB diag-data buffer,
which is attached to eeh_pe::data.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
arch/powerpc/platforms/powernv/eeh-powernv.c | 42 +++++++++++++++++++++++++-
arch/powerpc/platforms/pseries/eeh_pseries.c | 3 +-
2 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index a59788e..cfba40a 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -365,6 +365,45 @@ static int powernv_eeh_restore_config(struct device_node *dn)
return 0;
}
+static int powernv_eeh_event(int event, void *data)
+{
+ struct eeh_pe *pe = data;
+ struct pnv_phb *phb;
+ int ret = 0;
+
+ switch (event) {
+ case EEH_EVENT_PE_ALLOC:
+ if (!pe) {
+ ret = -EINVAL;
+ break;
+ } else if (pe->data) {
+ ret = -EEXIST;
+ break;
+ }
+
+ phb = pe->phb->private_data;
+ if (phb->model == PNV_PHB_MODEL_P7IOC ||
+ phb->model == PNV_PHB_MODEL_PHB3) {
+ pe->data = kzalloc(PNV_PCI_DIAG_BUF_SIZE, GFP_KERNEL);
+ if (!pe->data)
+ ret = -ENOMEM;
+ }
+
+ break;
+ case EEH_EVENT_PE_FREE:
+ if (pe->data) {
+ kfree(pe->data);
+ pe->data = NULL;
+ }
+
+ break;
+ default:
+ return 0;
+ }
+
+ return ret;
+}
+
static struct eeh_ops powernv_eeh_ops = {
.name = "powernv",
.init = powernv_eeh_init,
@@ -381,7 +420,8 @@ static struct eeh_ops powernv_eeh_ops = {
.read_config = pnv_pci_cfg_read,
.write_config = pnv_pci_cfg_write,
.next_error = powernv_eeh_next_error,
- .restore_config = powernv_eeh_restore_config
+ .restore_config = powernv_eeh_restore_config,
+ .event = powernv_eeh_event
};
/**
diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 8a8f047..b9a4ddb 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -691,7 +691,8 @@ static struct eeh_ops pseries_eeh_ops = {
.read_config = pseries_eeh_read_config,
.write_config = pseries_eeh_write_config,
.next_error = NULL,
- .restore_config = NULL
+ .restore_config = NULL,
+ .event = NULL
};
/**
--
1.7.10.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 7/9] powerpc/powernv: Cache PHB diag-data
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
` (5 preceding siblings ...)
2014-02-25 5:37 ` [PATCH 6/9] powerpc/powernv: Support eeh_ops->event() Gavin Shan
@ 2014-02-25 5:37 ` Gavin Shan
2014-02-25 5:37 ` [PATCH 8/9] powerpc/powernv: Add /proc/powerpc/eeh_inf_err Gavin Shan
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 5:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Gavin Shan
The PHB diag-data is useful to help locating the root cause for
frozen PE or fenced PHB. However, EEH core enables IO path by clearing
part of HW registers before collecting it and eventually we got broken
PHB diag-data.
The patch intends to fix it by caching the PHB diag-data in advance
to eeh_pe::data when frozen/fenced state on PE or PHB is detected
for the first time in eeh_ops::get_state() or next_error() backend.
Also, we collect diag-data for INF error without dumping it.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
arch/powerpc/platforms/powernv/eeh-ioda.c | 84 ++++++++++++++------------
arch/powerpc/platforms/powernv/eeh-powernv.c | 32 ++++++----
arch/powerpc/platforms/powernv/pci.h | 2 +-
3 files changed, 67 insertions(+), 51 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index 04b4710..cd06c52 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -114,6 +114,23 @@ DEFINE_SIMPLE_ATTRIBUTE(ioda_eeh_inbB_dbgfs_ops, ioda_eeh_inbB_dbgfs_get,
ioda_eeh_inbB_dbgfs_set, "0x%llx\n");
#endif /* CONFIG_DEBUG_FS */
+static void ioda_eeh_phb_diag(struct pci_controller *hose, char *buf)
+{
+ struct pnv_phb *phb = hose->private_data;
+ long rc;
+
+ if (!buf)
+ return;
+
+ rc = opal_pci_get_phb_diag_data2(phb->opal_id, buf,
+ PNV_PCI_DIAG_BUF_SIZE);
+ if (rc != OPAL_SUCCESS) {
+ pr_warn("%s: Failed to get PHB#%x diag-data (%ld)\n",
+ __func__, hose->global_number, rc);
+ return;
+ }
+}
+
/**
* ioda_eeh_post_init - Chip dependent post initialization
* @hose: PCI controller
@@ -224,12 +241,13 @@ static int ioda_eeh_set_option(struct eeh_pe *pe, int option)
/**
* ioda_eeh_get_state - Retrieve the state of PE
* @pe: EEH PE
+ * @cache_diag: Cache PHB diag-data or not
*
* The PE's state should be retrieved from the PEEV, PEST
* IODA tables. Since the OPAL has exported the function
* to do it, it'd better to use that.
*/
-static int ioda_eeh_get_state(struct eeh_pe *pe)
+static int ioda_eeh_get_state(struct eeh_pe *pe, bool cache_diag)
{
s64 ret = 0;
u8 fstate;
@@ -272,6 +290,9 @@ static int ioda_eeh_get_state(struct eeh_pe *pe)
result |= EEH_STATE_DMA_ACTIVE;
result |= EEH_STATE_MMIO_ENABLED;
result |= EEH_STATE_DMA_ENABLED;
+ } else if (cache_diag && !(pe->state & EEH_PE_ISOLATED)) {
+ /* Cache diag-data for fenced PHB */
+ ioda_eeh_phb_diag(hose, pe->data);
}
return result;
@@ -315,6 +336,14 @@ static int ioda_eeh_get_state(struct eeh_pe *pe)
__func__, fstate, hose->global_number, pe_no);
}
+ /* Cache PHB diag-data for frozen PE */
+ if (cache_diag &&
+ result != EEH_STATE_NOT_SUPPORT &&
+ (result & (EEH_STATE_MMIO_ACTIVE | EEH_STATE_DMA_ACTIVE)) !=
+ (EEH_STATE_MMIO_ACTIVE | EEH_STATE_DMA_ACTIVE) &&
+ !(pe->state & EEH_PE_ISOLATED))
+ ioda_eeh_phb_diag(hose, pe->data);
+
return result;
}
@@ -541,26 +570,10 @@ static int ioda_eeh_reset(struct eeh_pe *pe, int option)
static int ioda_eeh_get_log(struct eeh_pe *pe, int severity,
char *drv_log, unsigned long len)
{
- s64 ret;
- unsigned long flags;
- struct pci_controller *hose = pe->phb;
- struct pnv_phb *phb = hose->private_data;
+ if (!pe->data)
+ return 0;
- spin_lock_irqsave(&phb->lock, flags);
-
- ret = opal_pci_get_phb_diag_data2(phb->opal_id,
- phb->diag.blob, PNV_PCI_DIAG_BUF_SIZE);
- if (ret) {
- spin_unlock_irqrestore(&phb->lock, flags);
- pr_warning("%s: Can't get log for PHB#%x-PE#%x (%lld)\n",
- __func__, hose->global_number, pe->addr, ret);
- return -EIO;
- }
-
- /* The PHB diag-data is always indicative */
- pnv_pci_dump_phb_diag_data(hose, phb->diag.blob);
-
- spin_unlock_irqrestore(&phb->lock, flags);
+ pnv_pci_dump_phb_diag_data(pe->phb, pe->data);
return 0;
}
@@ -646,22 +659,6 @@ static void ioda_eeh_hub_diag(struct pci_controller *hose)
}
}
-static void ioda_eeh_phb_diag(struct pci_controller *hose)
-{
- struct pnv_phb *phb = hose->private_data;
- long rc;
-
- rc = opal_pci_get_phb_diag_data2(phb->opal_id, phb->diag.blob,
- PNV_PCI_DIAG_BUF_SIZE);
- if (rc != OPAL_SUCCESS) {
- pr_warning("%s: Failed to get diag-data for PHB#%x (%ld)\n",
- __func__, hose->global_number, rc);
- return;
- }
-
- pnv_pci_dump_phb_diag_data(hose, phb->diag.blob);
-}
-
static int ioda_eeh_get_pe(struct pci_controller *hose,
u16 pe_no, struct eeh_pe **pe)
{
@@ -778,7 +775,7 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
pr_info("EEH: PHB#%x informative error "
"detected\n",
hose->global_number);
- ioda_eeh_phb_diag(hose);
+ ioda_eeh_phb_diag(hose, phb->diag.blob);
ret = EEH_NEXT_ERR_NONE;
}
@@ -809,6 +806,19 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
}
/*
+ * EEH core will try recover from fenced PHB or
+ * frozen PE. In the time for frozen PE, EEH core
+ * enable IO path for that before collecting logs,
+ * but it ruins the site. So we have to cache the
+ * log in advance here.
+ */
+ if (ret == EEH_NEXT_ERR_FROZEN_PE ||
+ ret == EEH_NEXT_ERR_FENCED_PHB) {
+ eeh_pe_state_mark(*pe, EEH_PE_ISOLATED);
+ ioda_eeh_phb_diag(hose, (*pe)->data);
+ }
+
+ /*
* If we have no errors on the specific PHB or only
* informative error there, we continue poking it.
* Otherwise, we need actions to be taken by upper
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index cfba40a..54051bf 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -190,24 +190,15 @@ static int powernv_eeh_get_pe_addr(struct eeh_pe *pe)
return pe->addr;
}
-/**
- * powernv_eeh_get_state - Retrieve PE state
- * @pe: EEH PE
- * @delay: delay while PE state is temporarily unavailable
- *
- * Retrieve the state of the specified PE. For IODA-compitable
- * platform, it should be retrieved from IODA table. Therefore,
- * we prefer passing down to hardware implementation to handle
- * it.
- */
-static int powernv_eeh_get_state(struct eeh_pe *pe, int *delay)
+static int __powernv_eeh_get_state(struct eeh_pe *pe,
+ int *delay, bool cache_diag)
{
struct pci_controller *hose = pe->phb;
struct pnv_phb *phb = hose->private_data;
int ret = EEH_STATE_NOT_SUPPORT;
if (phb->eeh_ops && phb->eeh_ops->get_state) {
- ret = phb->eeh_ops->get_state(pe);
+ ret = phb->eeh_ops->get_state(pe, cache_diag);
/*
* If the PE state is temporarily unavailable,
@@ -225,6 +216,21 @@ static int powernv_eeh_get_state(struct eeh_pe *pe, int *delay)
}
/**
+ * powernv_eeh_get_state - Retrieve PE state
+ * @pe: EEH PE
+ * @delay: delay while PE state is temporarily unavailable
+ *
+ * Retrieve the state of the specified PE. For IODA-compitable
+ * platform, it should be retrieved from IODA table. Therefore,
+ * we prefer passing down to hardware implementation to handle
+ * it.
+ */
+static int powernv_eeh_get_state(struct eeh_pe *pe, int *delay)
+{
+ return __powernv_eeh_get_state(pe, delay, true);
+}
+
+/**
* powernv_eeh_reset - Reset the specified PE
* @pe: EEH PE
* @option: reset option
@@ -257,7 +263,7 @@ static int powernv_eeh_wait_state(struct eeh_pe *pe, int max_wait)
int mwait;
while (1) {
- ret = powernv_eeh_get_state(pe, &mwait);
+ ret = __powernv_eeh_get_state(pe, &mwait, false);
/*
* If the PE's state is temporarily unavailable,
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 94e3495..3645fc4 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -74,7 +74,7 @@ struct pnv_ioda_pe {
struct pnv_eeh_ops {
int (*post_init)(struct pci_controller *hose);
int (*set_option)(struct eeh_pe *pe, int option);
- int (*get_state)(struct eeh_pe *pe);
+ int (*get_state)(struct eeh_pe *pe, bool cache_diag);
int (*reset)(struct eeh_pe *pe, int option);
int (*get_log)(struct eeh_pe *pe, int severity,
char *drv_log, unsigned long len);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 8/9] powerpc/powernv: Add /proc/powerpc/eeh_inf_err
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
` (6 preceding siblings ...)
2014-02-25 5:37 ` [PATCH 7/9] powerpc/powernv: Cache PHB diag-data Gavin Shan
@ 2014-02-25 5:37 ` Gavin Shan
2014-02-25 5:37 ` [PATCH 9/9] powerpc/powernv: Refactor PHB diag-data dump Gavin Shan
2014-02-25 7:26 ` [PATCH v2 0/9] EEH improvement Gavin Shan
9 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 5:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Gavin Shan
The patch adds /proc/powerpc/eeh_inf_err to count the INF errors
happened on PHBs as Ben suggested.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
arch/powerpc/platforms/powernv/eeh-ioda.c | 51 +++++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/pci.h | 1 +
2 files changed, 52 insertions(+)
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index cd06c52..3ddd706 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -20,6 +20,7 @@
#include <linux/msi.h>
#include <linux/notifier.h>
#include <linux/pci.h>
+#include <linux/proc_fs.h>
#include <linux/string.h>
#include <asm/eeh.h>
@@ -35,6 +36,8 @@
#include "powernv.h"
#include "pci.h"
+static u64 ioda_eeh_ioc_inf_err = 0;
+static int ioda_eeh_proc_init = 0;
static int ioda_eeh_nb_init = 0;
static int ioda_eeh_event(struct notifier_block *nb,
@@ -114,6 +117,44 @@ DEFINE_SIMPLE_ATTRIBUTE(ioda_eeh_inbB_dbgfs_ops, ioda_eeh_inbB_dbgfs_get,
ioda_eeh_inbB_dbgfs_set, "0x%llx\n");
#endif /* CONFIG_DEBUG_FS */
+#ifdef CONFIG_PROC_FS
+static int ioda_eeh_proc_show(struct seq_file *m, void *v)
+{
+ struct pci_controller *hose;
+ struct pnv_phb *phb;
+
+ if (!eeh_enabled()) {
+ seq_printf(m, "EEH Subsystem disabled\n");
+ return 0;
+ }
+
+ seq_printf(m, "EEH Subsystem enabled\n");
+ if (ioda_eeh_ioc_inf_err > 0)
+ seq_printf(m, "\nIOC INF Errors: %llu\n\n",
+ ioda_eeh_ioc_inf_err);
+
+ list_for_each_entry(hose, &hose_list, list_node) {
+ phb = hose->private_data;
+ seq_printf(m, "PHB#%d INF Errors: %llu\n",
+ hose->global_number, phb->inf_err);
+ }
+
+ return 0;
+}
+
+static int ioda_eeh_proc_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, ioda_eeh_proc_show, NULL);
+}
+
+static const struct file_operations ioda_eeh_proc_ops = {
+ .open = ioda_eeh_proc_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+#endif /* CONFIG_PROC_FS */
+
static void ioda_eeh_phb_diag(struct pci_controller *hose, char *buf)
{
struct pnv_phb *phb = hose->private_data;
@@ -170,6 +211,14 @@ static int ioda_eeh_post_init(struct pci_controller *hose)
}
#endif
+#ifdef CONFIG_PROC_FS
+ if (!ioda_eeh_proc_init) {
+ ioda_eeh_proc_init = 1;
+ proc_create("powerpc/eeh_inf_err", 0,
+ NULL, &ioda_eeh_proc_ops);
+ }
+#endif
+
phb->flags |= PNV_PHB_FLAG_EEH;
return 0;
@@ -755,6 +804,7 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
} else if (severity == OPAL_EEH_SEV_INF) {
pr_info("EEH: IOC informative error "
"detected\n");
+ ioda_eeh_ioc_inf_err++;
ioda_eeh_hub_diag(hose);
ret = EEH_NEXT_ERR_NONE;
}
@@ -775,6 +825,7 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
pr_info("EEH: PHB#%x informative error "
"detected\n",
hose->global_number);
+ phb->inf_err++;
ioda_eeh_phb_diag(hose, phb->diag.blob);
ret = EEH_NEXT_ERR_NONE;
}
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 3645fc4..64ca719 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -97,6 +97,7 @@ struct pnv_phb {
spinlock_t lock;
#ifdef CONFIG_EEH
+ u64 inf_err;
struct pnv_eeh_ops *eeh_ops;
#endif
--
1.7.10.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 9/9] powerpc/powernv: Refactor PHB diag-data dump
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
` (7 preceding siblings ...)
2014-02-25 5:37 ` [PATCH 8/9] powerpc/powernv: Add /proc/powerpc/eeh_inf_err Gavin Shan
@ 2014-02-25 5:37 ` Gavin Shan
2014-02-25 7:26 ` [PATCH v2 0/9] EEH improvement Gavin Shan
9 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 5:37 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Gavin Shan
As Ben suggested, the patch prints PHB diag-data with multiple
fields in one line and omits the line if the fields of that
line are all zero.
With the patch applied, the PHB3 diag-data dump looks like:
PHB3 PHB#3 Diag-data (Version: 1)
brdgCtl: 00000002
RootSts: 0000000f 00400000 b0830008 00100147 00002000
nFir: 0000000000000000 0030006e00000000 0000000000000000
PhbSts: 0000001c00000000 0000000000000000
Lem: 0000000000100000 42498e327f502eae 0000000000000000
InAErr: 8000000000000000 8000000000000000 0402030000000000 \
0000000000000000
PE[ 8] A/B: 8480002b00000000 8000000000000000
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
arch/powerpc/platforms/powernv/pci.c | 220 +++++++++++++++++++---------------
1 file changed, 125 insertions(+), 95 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 3955fc0..114e1a7 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -134,57 +134,72 @@ static void pnv_pci_dump_p7ioc_diag_data(struct pci_controller *hose,
pr_info("P7IOC PHB#%d Diag-data (Version: %d)\n\n",
hose->global_number, common->version);
- pr_info(" brdgCtl: %08x\n", data->brdgCtl);
-
- pr_info(" portStatusReg: %08x\n", data->portStatusReg);
- pr_info(" rootCmplxStatus: %08x\n", data->rootCmplxStatus);
- pr_info(" busAgentStatus: %08x\n", data->busAgentStatus);
-
- pr_info(" deviceStatus: %08x\n", data->deviceStatus);
- pr_info(" slotStatus: %08x\n", data->slotStatus);
- pr_info(" linkStatus: %08x\n", data->linkStatus);
- pr_info(" devCmdStatus: %08x\n", data->devCmdStatus);
- pr_info(" devSecStatus: %08x\n", data->devSecStatus);
-
- pr_info(" rootErrorStatus: %08x\n", data->rootErrorStatus);
- pr_info(" uncorrErrorStatus: %08x\n", data->uncorrErrorStatus);
- pr_info(" corrErrorStatus: %08x\n", data->corrErrorStatus);
- pr_info(" tlpHdr1: %08x\n", data->tlpHdr1);
- pr_info(" tlpHdr2: %08x\n", data->tlpHdr2);
- pr_info(" tlpHdr3: %08x\n", data->tlpHdr3);
- pr_info(" tlpHdr4: %08x\n", data->tlpHdr4);
- pr_info(" sourceId: %08x\n", data->sourceId);
- pr_info(" errorClass: %016llx\n", data->errorClass);
- pr_info(" correlator: %016llx\n", data->correlator);
- pr_info(" p7iocPlssr: %016llx\n", data->p7iocPlssr);
- pr_info(" p7iocCsr: %016llx\n", data->p7iocCsr);
- pr_info(" lemFir: %016llx\n", data->lemFir);
- pr_info(" lemErrorMask: %016llx\n", data->lemErrorMask);
- pr_info(" lemWOF: %016llx\n", data->lemWOF);
- pr_info(" phbErrorStatus: %016llx\n", data->phbErrorStatus);
- pr_info(" phbFirstErrorStatus: %016llx\n", data->phbFirstErrorStatus);
- pr_info(" phbErrorLog0: %016llx\n", data->phbErrorLog0);
- pr_info(" phbErrorLog1: %016llx\n", data->phbErrorLog1);
- pr_info(" mmioErrorStatus: %016llx\n", data->mmioErrorStatus);
- pr_info(" mmioFirstErrorStatus: %016llx\n", data->mmioFirstErrorStatus);
- pr_info(" mmioErrorLog0: %016llx\n", data->mmioErrorLog0);
- pr_info(" mmioErrorLog1: %016llx\n", data->mmioErrorLog1);
- pr_info(" dma0ErrorStatus: %016llx\n", data->dma0ErrorStatus);
- pr_info(" dma0FirstErrorStatus: %016llx\n", data->dma0FirstErrorStatus);
- pr_info(" dma0ErrorLog0: %016llx\n", data->dma0ErrorLog0);
- pr_info(" dma0ErrorLog1: %016llx\n", data->dma0ErrorLog1);
- pr_info(" dma1ErrorStatus: %016llx\n", data->dma1ErrorStatus);
- pr_info(" dma1FirstErrorStatus: %016llx\n", data->dma1FirstErrorStatus);
- pr_info(" dma1ErrorLog0: %016llx\n", data->dma1ErrorLog0);
- pr_info(" dma1ErrorLog1: %016llx\n", data->dma1ErrorLog1);
+ if (data->brdgCtl)
+ pr_info(" brdgCtl: %08x\n",
+ data->brdgCtl);
+ if (data->portStatusReg || data->rootCmplxStatus ||
+ data->busAgentStatus)
+ pr_info(" UtlSts: %08x %08x %08x\n",
+ data->portStatusReg, data->rootCmplxStatus,
+ data->busAgentStatus);
+ if (data->deviceStatus || data->slotStatus ||
+ data->linkStatus || data->devCmdStatus ||
+ data->devSecStatus)
+ pr_info(" RootSts: %08x %08x %08x %08x %08x\n",
+ data->deviceStatus, data->slotStatus,
+ data->linkStatus, data->devCmdStatus,
+ data->devSecStatus);
+ if (data->rootErrorStatus || data->uncorrErrorStatus ||
+ data->corrErrorStatus)
+ pr_info(" RootErrSts: %08x %08x %08x\n",
+ data->rootErrorStatus, data->uncorrErrorStatus,
+ data->corrErrorStatus);
+ if (data->tlpHdr1 || data->tlpHdr2 ||
+ data->tlpHdr3 || data->tlpHdr4)
+ pr_info(" RootErrLog: %08x %08x %08x %08x\n",
+ data->tlpHdr1, data->tlpHdr2,
+ data->tlpHdr3, data->tlpHdr4);
+ if (data->sourceId || data->errorClass ||
+ data->correlator)
+ pr_info(" RootErrLog1: %08x %016llx %016llx\n",
+ data->sourceId, data->errorClass,
+ data->correlator);
+ if (data->p7iocPlssr || data->p7iocCsr)
+ pr_info(" PhbSts: %016llx %016llx\n",
+ data->p7iocPlssr, data->p7iocCsr);
+ if (data->lemFir || data->lemErrorMask ||
+ data->lemWOF)
+ pr_info(" Lem: %016llx %016llx %016llx\n",
+ data->lemFir, data->lemErrorMask,
+ data->lemWOF);
+ if (data->phbErrorStatus || data->phbFirstErrorStatus ||
+ data->phbErrorLog0 || data->phbErrorLog1)
+ pr_info(" PhbErr: %016llx %016llx %016llx %016llx\n",
+ data->phbErrorStatus, data->phbFirstErrorStatus,
+ data->phbErrorLog0, data->phbErrorLog1);
+ if (data->mmioErrorStatus || data->mmioFirstErrorStatus ||
+ data->mmioErrorLog0 || data->mmioErrorLog1)
+ pr_info(" OutErr: %016llx %016llx %016llx %016llx\n",
+ data->mmioErrorStatus, data->mmioFirstErrorStatus,
+ data->mmioErrorLog0, data->mmioErrorLog1);
+ if (data->dma0ErrorStatus || data->dma0FirstErrorStatus ||
+ data->dma0ErrorLog0 || data->dma0ErrorLog1)
+ pr_info(" InAErr: %016llx %016llx %016llx %016llx\n",
+ data->dma0ErrorStatus, data->dma0FirstErrorStatus,
+ data->dma0ErrorLog0, data->dma0ErrorLog1);
+ if (data->dma1ErrorStatus || data->dma1FirstErrorStatus ||
+ data->dma1ErrorLog0 || data->dma1ErrorLog1)
+ pr_info(" InBErr: %016llx %016llx %016llx %016llx\n",
+ data->dma1ErrorStatus, data->dma1FirstErrorStatus,
+ data->dma1ErrorLog0, data->dma1ErrorLog1);
for (i = 0; i < OPAL_P7IOC_NUM_PEST_REGS; i++) {
if ((data->pestA[i] >> 63) == 0 &&
(data->pestB[i] >> 63) == 0)
continue;
- pr_info(" PE[%3d] PESTA: %016llx\n", i, data->pestA[i]);
- pr_info(" PESTB: %016llx\n", data->pestB[i]);
+ pr_info(" PE[%3d] A/B: %016llx %016llx\n",
+ i, data->pestA[i], data->pestB[i]);
}
}
@@ -197,62 +212,77 @@ static void pnv_pci_dump_phb3_diag_data(struct pci_controller *hose,
data = (struct OpalIoPhb3ErrorData*)common;
pr_info("PHB3 PHB#%d Diag-data (Version: %d)\n\n",
hose->global_number, common->version);
-
- pr_info(" brdgCtl: %08x\n", data->brdgCtl);
-
- pr_info(" portStatusReg: %08x\n", data->portStatusReg);
- pr_info(" rootCmplxStatus: %08x\n", data->rootCmplxStatus);
- pr_info(" busAgentStatus: %08x\n", data->busAgentStatus);
-
- pr_info(" deviceStatus: %08x\n", data->deviceStatus);
- pr_info(" slotStatus: %08x\n", data->slotStatus);
- pr_info(" linkStatus: %08x\n", data->linkStatus);
- pr_info(" devCmdStatus: %08x\n", data->devCmdStatus);
- pr_info(" devSecStatus: %08x\n", data->devSecStatus);
-
- pr_info(" rootErrorStatus: %08x\n", data->rootErrorStatus);
- pr_info(" uncorrErrorStatus: %08x\n", data->uncorrErrorStatus);
- pr_info(" corrErrorStatus: %08x\n", data->corrErrorStatus);
- pr_info(" tlpHdr1: %08x\n", data->tlpHdr1);
- pr_info(" tlpHdr2: %08x\n", data->tlpHdr2);
- pr_info(" tlpHdr3: %08x\n", data->tlpHdr3);
- pr_info(" tlpHdr4: %08x\n", data->tlpHdr4);
- pr_info(" sourceId: %08x\n", data->sourceId);
- pr_info(" errorClass: %016llx\n", data->errorClass);
- pr_info(" correlator: %016llx\n", data->correlator);
-
- pr_info(" nFir: %016llx\n", data->nFir);
- pr_info(" nFirMask: %016llx\n", data->nFirMask);
- pr_info(" nFirWOF: %016llx\n", data->nFirWOF);
- pr_info(" PhbPlssr: %016llx\n", data->phbPlssr);
- pr_info(" PhbCsr: %016llx\n", data->phbCsr);
- pr_info(" lemFir: %016llx\n", data->lemFir);
- pr_info(" lemErrorMask: %016llx\n", data->lemErrorMask);
- pr_info(" lemWOF: %016llx\n", data->lemWOF);
- pr_info(" phbErrorStatus: %016llx\n", data->phbErrorStatus);
- pr_info(" phbFirstErrorStatus: %016llx\n", data->phbFirstErrorStatus);
- pr_info(" phbErrorLog0: %016llx\n", data->phbErrorLog0);
- pr_info(" phbErrorLog1: %016llx\n", data->phbErrorLog1);
- pr_info(" mmioErrorStatus: %016llx\n", data->mmioErrorStatus);
- pr_info(" mmioFirstErrorStatus: %016llx\n", data->mmioFirstErrorStatus);
- pr_info(" mmioErrorLog0: %016llx\n", data->mmioErrorLog0);
- pr_info(" mmioErrorLog1: %016llx\n", data->mmioErrorLog1);
- pr_info(" dma0ErrorStatus: %016llx\n", data->dma0ErrorStatus);
- pr_info(" dma0FirstErrorStatus: %016llx\n", data->dma0FirstErrorStatus);
- pr_info(" dma0ErrorLog0: %016llx\n", data->dma0ErrorLog0);
- pr_info(" dma0ErrorLog1: %016llx\n", data->dma0ErrorLog1);
- pr_info(" dma1ErrorStatus: %016llx\n", data->dma1ErrorStatus);
- pr_info(" dma1FirstErrorStatus: %016llx\n", data->dma1FirstErrorStatus);
- pr_info(" dma1ErrorLog0: %016llx\n", data->dma1ErrorLog0);
- pr_info(" dma1ErrorLog1: %016llx\n", data->dma1ErrorLog1);
+ if (data->brdgCtl)
+ pr_info(" brdgCtl: %08x\n",
+ data->brdgCtl);
+ if (data->portStatusReg || data->rootCmplxStatus ||
+ data->busAgentStatus)
+ pr_info(" UtlSts: %08x %08x %08x\n",
+ data->portStatusReg, data->rootCmplxStatus,
+ data->busAgentStatus);
+ if (data->deviceStatus || data->slotStatus ||
+ data->linkStatus || data->devCmdStatus ||
+ data->devSecStatus)
+ pr_info(" RootSts: %08x %08x %08x %08x %08x\n",
+ data->deviceStatus, data->slotStatus,
+ data->linkStatus, data->devCmdStatus,
+ data->devSecStatus);
+ if (data->rootErrorStatus || data->uncorrErrorStatus ||
+ data->corrErrorStatus)
+ pr_info(" RootErrSts: %08x %08x %08x\n",
+ data->rootErrorStatus, data->uncorrErrorStatus,
+ data->corrErrorStatus);
+ if (data->tlpHdr1 || data->tlpHdr2 ||
+ data->tlpHdr3 || data->tlpHdr4)
+ pr_info(" RootErrLog: %08x %08x %08x %08x\n",
+ data->tlpHdr1, data->tlpHdr2,
+ data->tlpHdr3, data->tlpHdr4);
+ if (data->sourceId || data->errorClass ||
+ data->correlator)
+ pr_info(" RootErrLog1: %08x %016llx %016llx\n",
+ data->sourceId, data->errorClass,
+ data->correlator);
+ if (data->nFir || data->nFirMask ||
+ data->nFirWOF)
+ pr_info(" nFir: %016llx %016llx %016llx\n",
+ data->nFir, data->nFirMask,
+ data->nFirWOF);
+ if (data->phbPlssr || data->phbCsr)
+ pr_info(" PhbSts: %016llx %016llx\n",
+ data->phbPlssr, data->phbCsr);
+ if (data->lemFir || data->lemErrorMask ||
+ data->lemWOF)
+ pr_info(" Lem: %016llx %016llx %016llx\n",
+ data->lemFir, data->lemErrorMask,
+ data->lemWOF);
+ if (data->phbErrorStatus || data->phbFirstErrorStatus ||
+ data->phbErrorLog0 || data->phbErrorLog1)
+ pr_info(" PhbErr: %016llx %016llx %016llx %016llx\n",
+ data->phbErrorStatus, data->phbFirstErrorStatus,
+ data->phbErrorLog0, data->phbErrorLog1);
+ if (data->mmioErrorStatus || data->mmioFirstErrorStatus ||
+ data->mmioErrorLog0 || data->mmioErrorLog1)
+ pr_info(" OutErr: %016llx %016llx %016llx %016llx\n",
+ data->mmioErrorStatus, data->mmioFirstErrorStatus,
+ data->mmioErrorLog0, data->mmioErrorLog1);
+ if (data->dma0ErrorStatus || data->dma0FirstErrorStatus ||
+ data->dma0ErrorLog0 || data->dma0ErrorLog1)
+ pr_info(" InAErr: %016llx %016llx %016llx %016llx\n",
+ data->dma0ErrorStatus, data->dma0FirstErrorStatus,
+ data->dma0ErrorLog0, data->dma0ErrorLog1);
+ if (data->dma1ErrorStatus || data->dma1FirstErrorStatus ||
+ data->dma1ErrorLog0 || data->dma1ErrorLog1)
+ pr_info(" InBErr: %016llx %016llx %016llx %016llx\n",
+ data->dma1ErrorStatus, data->dma1FirstErrorStatus,
+ data->dma1ErrorLog0, data->dma1ErrorLog1);
for (i = 0; i < OPAL_PHB3_NUM_PEST_REGS; i++) {
if ((data->pestA[i] >> 63) == 0 &&
(data->pestB[i] >> 63) == 0)
continue;
- pr_info(" PE[%3d] PESTA: %016llx\n", i, data->pestA[i]);
- pr_info(" PESTB: %016llx\n", data->pestB[i]);
+ pr_info(" PE[%3d] A/B: %016llx %016llx\n",
+ i, data->pestA[i], data->pestB[i]);
}
}
--
1.7.10.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v2 0/9] EEH improvement
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
` (8 preceding siblings ...)
2014-02-25 5:37 ` [PATCH 9/9] powerpc/powernv: Refactor PHB diag-data dump Gavin Shan
@ 2014-02-25 7:26 ` Gavin Shan
9 siblings, 0 replies; 11+ messages in thread
From: Gavin Shan @ 2014-02-25 7:26 UTC (permalink / raw)
To: Gavin Shan; +Cc: linuxppc-dev
On Tue, Feb 25, 2014 at 01:37:41PM +0800, Gavin Shan wrote:
>The series of patches intends to improve reliability of EEH on PowerNV
>platform. First all, we have had multiple duplicate states (flags) for
>PHB and PE, so we remove those duplicate states to simplify the code.
>Besides, we had corrupted PHB diag-data for case of frozen PE. In order
>to solve the problem, we introduce eeh_ops->event() and notifications
>are sent from EEH core to (PowerNV) platform on creating or destroying
>PE instance so that we can allocate or free PHB diag-data backend. Then
>we cache the PHB diag-data on the first call to eeh_ops->get_state()
>and dump it afterwards, which helps to get correct PHB diag-data.
>
>With the patchset applied, we never dump PHB diag-data for INF errors.
>Instead, we just maintain statistics in /proc/powerpc/eeh_inf_err. Also,
>we changed the PHB diag-data dump format for a bit to have multiple
>fields per line and omits the line with all zero'd fields as Ben suggested.
>
>
>v1 -> v2:
> * Amending commit logs
> * Support eeh_ops->event() and maintain PHB diag-data on basis
> of PE instance
> * When dumping PHB diag-data, to replace "-" with "00000000" and
> omit the line if the fields of it are all zeros.
>
Please ignore this and I'm going to send out v3 where we just
grab and dump the PHB diag-data (without cache any more) as
Ben suggested :-)
Thanks,
Gavin
>---
>
>arch/powerpc/include/asm/eeh.h | 7 ++-
>arch/powerpc/kernel/eeh.c | 10 +---
>arch/powerpc/kernel/eeh_driver.c | 10 ++--
>arch/powerpc/kernel/eeh_pe.c | 39 ++++++++++++-
>arch/powerpc/platforms/powernv/eeh-ioda.c | 193 ++++++++++++++++++++++++++++++++++++-------------------------
>arch/powerpc/platforms/powernv/eeh-powernv.c | 74 +++++++++++++++++++-----
>arch/powerpc/platforms/powernv/pci.c | 228 +++++++++++++++++++++++++++++++++++++++++-------------------------
>arch/powerpc/platforms/powernv/pci.h | 11 ++--
>arch/powerpc/platforms/pseries/eeh_pseries.c | 3 +-
>9 files changed, 358 insertions(+), 217 deletions(-)
>
>Thanks,
>Gavin
>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-02-25 7:26 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-25 5:37 [PATCH v2 0/9] EEH improvement Gavin Shan
2014-02-25 5:37 ` [PATCH 1/9] powerpc/eeh: Remove EEH_PE_PHB_DEAD Gavin Shan
2014-02-25 5:37 ` [PATCH 2/9] powerpc/powernv: Remove PNV_EEH_STATE_REMOVED Gavin Shan
2014-02-25 5:37 ` [PATCH 3/9] powerpc/powernv: Move PNV_EEH_STATE_ENABLED around Gavin Shan
2014-02-25 5:37 ` [PATCH 4/9] powerpc/eeh: Introduce eeh_pe_free() Gavin Shan
2014-02-25 5:37 ` [PATCH 5/9] powerpc/eeh: Introduce eeh_ops->event() Gavin Shan
2014-02-25 5:37 ` [PATCH 6/9] powerpc/powernv: Support eeh_ops->event() Gavin Shan
2014-02-25 5:37 ` [PATCH 7/9] powerpc/powernv: Cache PHB diag-data Gavin Shan
2014-02-25 5:37 ` [PATCH 8/9] powerpc/powernv: Add /proc/powerpc/eeh_inf_err Gavin Shan
2014-02-25 5:37 ` [PATCH 9/9] powerpc/powernv: Refactor PHB diag-data dump Gavin Shan
2014-02-25 7:26 ` [PATCH v2 0/9] EEH improvement Gavin Shan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).