stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Gavin Shan <shangw@linux.vnet.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: [PATCH 3.13 31/46] powerpc/eeh: Handle multiple EEH errors
Date: Fri, 28 Mar 2014 10:32:15 -0700	[thread overview]
Message-ID: <20140328173138.889267879@linuxfoundation.org> (raw)
In-Reply-To: <20140328173134.630198216@linuxfoundation.org>

3.13-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Gavin Shan <shangw@linux.vnet.ibm.com>

commit 7e4e7867b1e551b7b8f326da3604c47332972bc6 upstream.

For one PCI error relevant OPAL event, we possibly have multiple
EEH errors for that. For example, multiple frozen PEs detected on
different PHBs. Unfortunately, we didn't cover the case. The patch
enumarates the return value from eeh_ops::next_error() and change
eeh_handle_special_event() and eeh_ops::next_error() to handle all
existing EEH errors.

As Ben pointed out, we needn't list_for_each_entry_safe() since we
are not deleting any PHB from the hose_list and the EEH serialized
lock should be held while purging EEH events. The patch covers those
suggestions as well.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 arch/powerpc/include/asm/eeh.h            |   10 ++
 arch/powerpc/kernel/eeh_driver.c          |  150 +++++++++++++++---------------
 arch/powerpc/platforms/powernv/eeh-ioda.c |   39 ++++---
 3 files changed, 112 insertions(+), 87 deletions(-)

--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -117,6 +117,16 @@ static inline struct pci_dev *eeh_dev_to
 	return edev ? edev->pdev : NULL;
 }
 
+/* Return values from eeh_ops::next_error */
+enum {
+	EEH_NEXT_ERR_NONE = 0,
+	EEH_NEXT_ERR_INF,
+	EEH_NEXT_ERR_FROZEN_PE,
+	EEH_NEXT_ERR_FENCED_PHB,
+	EEH_NEXT_ERR_DEAD_PHB,
+	EEH_NEXT_ERR_DEAD_IOC
+};
+
 /*
  * The struct is used to trace the registered EEH operation
  * callback functions. Actually, those operation callback
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -626,84 +626,90 @@ static void eeh_handle_special_event(voi
 {
 	struct eeh_pe *pe, *phb_pe;
 	struct pci_bus *bus;
-	struct pci_controller *hose, *tmp;
+	struct pci_controller *hose;
 	unsigned long flags;
-	int rc = 0;
+	int rc;
 
-	/*
-	 * The return value from next_error() has been classified as follows.
-	 * It might be good to enumerate them. However, next_error() is only
-	 * supported by PowerNV platform for now. So it would be fine to use
-	 * integer directly:
-	 *
-	 * 4 - Dead IOC           3 - Dead PHB
-	 * 2 - Fenced PHB         1 - Frozen PE
-	 * 0 - No error found
-	 *
-	 */
-	rc = eeh_ops->next_error(&pe);
-	if (rc <= 0)
-		return;
-
-	switch (rc) {
-	case 4:
-		/* Mark all PHBs in dead state */
-		eeh_serialize_lock(&flags);
-		list_for_each_entry_safe(hose, tmp,
-				&hose_list, list_node) {
-			phb_pe = eeh_phb_pe_get(hose);
-			if (!phb_pe) continue;
 
-			eeh_pe_state_mark(phb_pe,
-				EEH_PE_ISOLATED | EEH_PE_PHB_DEAD);
+	do {
+		rc = eeh_ops->next_error(&pe);
+
+		switch (rc) {
+		case EEH_NEXT_ERR_DEAD_IOC:
+			/* Mark all PHBs in dead state */
+			eeh_serialize_lock(&flags);
+
+			/* Purge all events */
+			eeh_remove_event(NULL);
+
+			list_for_each_entry(hose, &hose_list, list_node) {
+				phb_pe = eeh_phb_pe_get(hose);
+				if (!phb_pe) continue;
+
+				eeh_pe_state_mark(phb_pe,
+					EEH_PE_ISOLATED | EEH_PE_PHB_DEAD);
+			}
+
+			eeh_serialize_unlock(flags);
+
+			break;
+		case EEH_NEXT_ERR_FROZEN_PE:
+		case EEH_NEXT_ERR_FENCED_PHB:
+		case EEH_NEXT_ERR_DEAD_PHB:
+			/* Mark the PE in fenced state */
+			eeh_serialize_lock(&flags);
+
+			/* Purge all events of the PHB */
+			eeh_remove_event(pe);
+
+			if (rc == EEH_NEXT_ERR_DEAD_PHB)
+				eeh_pe_state_mark(pe,
+					EEH_PE_ISOLATED | EEH_PE_PHB_DEAD);
+			else
+				eeh_pe_state_mark(pe,
+					EEH_PE_ISOLATED | EEH_PE_RECOVERING);
+
+			eeh_serialize_unlock(flags);
+
+			break;
+		case EEH_NEXT_ERR_NONE:
+			return;
+		default:
+			pr_warn("%s: Invalid value %d from next_error()\n",
+				__func__, rc);
+			return;
 		}
-		eeh_serialize_unlock(flags);
 
-		/* Purge all events */
-		eeh_remove_event(NULL);
-		break;
-	case 3:
-	case 2:
-	case 1:
-		/* Mark the PE in fenced state */
-		eeh_serialize_lock(&flags);
-		if (rc == 3)
-			eeh_pe_state_mark(pe,
-				EEH_PE_ISOLATED | EEH_PE_PHB_DEAD);
-		else
-			eeh_pe_state_mark(pe,
-				EEH_PE_ISOLATED | EEH_PE_RECOVERING);
-		eeh_serialize_unlock(flags);
-
-		/* Purge all events of the PHB */
-		eeh_remove_event(pe);
-		break;
-	default:
-		pr_err("%s: Invalid value %d from next_error()\n",
-		       __func__, rc);
-		return;
-	}
-
-	/*
-	 * For fenced PHB and frozen PE, it's handled as normal
-	 * event. We have to remove the affected PHBs for dead
-	 * PHB and IOC
-	 */
-	if (rc == 2 || rc == 1)
-		eeh_handle_normal_event(pe);
-	else {
-		list_for_each_entry_safe(hose, tmp,
-			&hose_list, list_node) {
-			phb_pe = eeh_phb_pe_get(hose);
-			if (!phb_pe || !(phb_pe->state & EEH_PE_PHB_DEAD))
-				continue;
-
-			bus = eeh_pe_bus_get(phb_pe);
-			/* Notify all devices that they're about to go down. */
-			eeh_pe_dev_traverse(pe, eeh_report_failure, NULL);
-			pcibios_remove_pci_devices(bus);
+		/*
+		 * For fenced PHB and frozen PE, it's handled as normal
+		 * event. We have to remove the affected PHBs for dead
+		 * PHB and IOC
+		 */
+		if (rc == EEH_NEXT_ERR_FROZEN_PE ||
+		    rc == EEH_NEXT_ERR_FENCED_PHB) {
+			eeh_handle_normal_event(pe);
+		} else {
+			list_for_each_entry(hose, &hose_list, list_node) {
+				phb_pe = eeh_phb_pe_get(hose);
+				if (!phb_pe ||
+				    !(phb_pe->state & EEH_PE_PHB_DEAD))
+					continue;
+
+				/* Notify all devices to be down */
+				bus = eeh_pe_bus_get(phb_pe);
+				eeh_pe_dev_traverse(pe,
+					eeh_report_failure, NULL);
+				pcibios_remove_pci_devices(bus);
+			}
 		}
-	}
+
+		/*
+		 * If we have detected dead IOC, we needn't proceed
+		 * any more since all PHBs would have been removed
+		 */
+		if (rc == EEH_NEXT_ERR_DEAD_IOC)
+			break;
+	} while (rc != EEH_NEXT_ERR_NONE);
 }
 
 /**
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -718,12 +718,12 @@ static int ioda_eeh_get_pe(struct pci_co
  */
 static int ioda_eeh_next_error(struct eeh_pe **pe)
 {
-	struct pci_controller *hose, *tmp;
+	struct pci_controller *hose;
 	struct pnv_phb *phb;
 	u64 frozen_pe_no;
 	u16 err_type, severity;
 	long rc;
-	int ret = 1;
+	int ret = EEH_NEXT_ERR_NONE;
 
 	/*
 	 * While running here, it's safe to purge the event queue.
@@ -733,7 +733,7 @@ static int ioda_eeh_next_error(struct ee
 	eeh_remove_event(NULL);
 	opal_notifier_update_evt(OPAL_EVENT_PCI_ERROR, 0x0ul);
 
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
+	list_for_each_entry(hose, &hose_list, list_node) {
 		/*
 		 * If the subordinate PCI buses of the PHB has been
 		 * removed, we needn't take care of it any more.
@@ -772,19 +772,19 @@ static int ioda_eeh_next_error(struct ee
 		switch (err_type) {
 		case OPAL_EEH_IOC_ERROR:
 			if (severity == OPAL_EEH_SEV_IOC_DEAD) {
-				list_for_each_entry_safe(hose, tmp,
-						&hose_list, list_node) {
+				list_for_each_entry(hose, &hose_list,
+						    list_node) {
 					phb = hose->private_data;
 					phb->eeh_state |= PNV_EEH_STATE_REMOVED;
 				}
 
 				pr_err("EEH: dead IOC detected\n");
-				ret = 4;
-				goto out;
+				ret = EEH_NEXT_ERR_DEAD_IOC;
 			} else if (severity == OPAL_EEH_SEV_INF) {
 				pr_info("EEH: IOC informative error "
 					"detected\n");
 				ioda_eeh_hub_diag(hose);
+				ret = EEH_NEXT_ERR_NONE;
 			}
 
 			break;
@@ -796,21 +796,20 @@ static int ioda_eeh_next_error(struct ee
 				pr_err("EEH: dead PHB#%x detected\n",
 					hose->global_number);
 				phb->eeh_state |= PNV_EEH_STATE_REMOVED;
-				ret = 3;
-				goto out;
+				ret = EEH_NEXT_ERR_DEAD_PHB;
 			} else if (severity == OPAL_EEH_SEV_PHB_FENCED) {
 				if (ioda_eeh_get_phb_pe(hose, pe))
 					break;
 
 				pr_err("EEH: fenced PHB#%x detected\n",
 					hose->global_number);
-				ret = 2;
-				goto out;
+				ret = EEH_NEXT_ERR_FENCED_PHB;
 			} else if (severity == OPAL_EEH_SEV_INF) {
 				pr_info("EEH: PHB#%x informative error "
 					"detected\n",
 					hose->global_number);
 				ioda_eeh_phb_diag(hose);
+				ret = EEH_NEXT_ERR_NONE;
 			}
 
 			break;
@@ -820,13 +819,23 @@ static int ioda_eeh_next_error(struct ee
 
 			pr_err("EEH: Frozen PE#%x on PHB#%x detected\n",
 				(*pe)->addr, (*pe)->phb->global_number);
-			ret = 1;
-			goto out;
+			ret = EEH_NEXT_ERR_FROZEN_PE;
+			break;
+		default:
+			pr_warn("%s: Unexpected error type %d\n",
+				__func__, err_type);
 		}
+
+		/*
+		 * If we have no errors on the specific PHB or only
+		 * informative error there, we continue poking it.
+		 * Otherwise, we need actions to be taken by upper
+		 * layer.
+		 */
+		if (ret > EEH_NEXT_ERR_INF)
+			break;
 	}
 
-	ret = 0;
-out:
 	return ret;
 }
 



  parent reply	other threads:[~2014-03-28 17:32 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-28 17:31 [PATCH 3.13 00/46] 3.13.8-stable review Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 01/46] HID: hidraw: fix warning destroying hidraw device files after parent Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 02/46] ALSA: compress: Pass through return value of open ops callback Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 03/46] clocksource: vf_pit_timer: use complement for sched_clock reading Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 04/46] drm/i915: Fix PSR programming Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 05/46] drm/i915: Dont enable display error interrupts from the start Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 06/46] drm/i915: Disable stolen memory when DMAR is active Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 07/46] tracing: Fix array size mismatch in format string Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 08/46] partly revert commit 8a10bc9: parisc/sti_console: prefer Linux fonts over built-in ROM fonts Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 09/46] net: davinci_emac: Replace devm_request_irq with request_irq Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 10/46] NFSv4: Use the correct net namespace in nfs4_update_server Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 11/46] media: cxusb: unlock on error in cxusb_i2c_xfer() Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 12/46] media: dw2102: some missing unlocks on error Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 13/46] media: cx18: check for allocation failure in cx18_read_eeprom() Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 14/46] libceph: block I/O when PAUSE or FULL osd map flags are set Greg Kroah-Hartman
2014-03-28 17:31 ` [PATCH 3.13 15/46] libceph: resend all writes after the osdmap loses the full flag Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 16/46] ASoC: max98090: make REVISION_ID readable Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 17/46] stop_machine: Fix^2 race between stop_two_cpus() and stop_cpus() Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 18/46] sfc: Use the correct maximum TX DMA ring size for SFC9100 Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 19/46] ARM: 7941/2: Fix incorrect FDT initrd parameter override Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 20/46] SUNRPC: Fix a pipe_version reference leak Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 21/46] x86: bpf_jit: support negative offsets Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 22/46] printk: fix syslog() overflowing user buffer Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 23/46] Fix uses of dma_max_pfn() when converting to a limiting address Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 24/46] perf tools: Fix AAAAARGH64 memory barriers Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 25/46] deb-pkg: Fix building for MIPS big-endian or ARM OABI Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 26/46] deb-pkg: Fix cross-building linux-headers package Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 27/46] MIPS: Fix build error seen in some configurations Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 28/46] p54: clamp properly instead of just truncating Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 29/46] regulator: core: Replace direct ops->disable usage Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 30/46] powerpc/powernv: Move PHB-diag dump functions around Greg Kroah-Hartman
2014-03-28 17:32 ` Greg Kroah-Hartman [this message]
2014-03-28 17:32 ` [PATCH 3.13 32/46] powerpc/powernv: Dump PHB diag-data immediately Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 33/46] powerpc/powernv: Refactor PHB diag-data dump Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 34/46] fs/proc/proc_devtree.c: remove empty /proc/device-tree when no openfirmware exists Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 35/46] Input: elantech - improve clickpad detection Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 36/46] KVM: MMU: handle invalid root_hpa at __direct_map Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 37/46] KVM: x86: handle invalid root_hpa everywhere Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 38/46] KVM: VMX: fix use after free of vmx->loaded_vmcs Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 39/46] Input: wacom - make sure touch_max is set for touch devices Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 40/46] Input: wacom - add support for three new Intuos devices Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 41/46] Input: wacom - add reporting of SW_MUTE_DEVICE events Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 42/46] xhci: Fix resume issues on Renesas chips in Samsung laptops Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 43/46] e100: Fix "disabling already-disabled device" warning Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 44/46] libceph: rename ceph_msg::front_max to front_alloc_len Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 45/46] libceph: rename front to front_len in get_reply() Greg Kroah-Hartman
2014-03-28 17:32 ` [PATCH 3.13 46/46] libceph: fix preallocation check " Greg Kroah-Hartman
2014-03-29  1:12 ` [PATCH 3.13 00/46] 3.13.8-stable review Guenter Roeck
2014-03-29  1:28   ` Greg Kroah-Hartman
2014-03-29 12:19   ` Satoru Takeuchi
2014-03-29 17:01     ` Greg Kroah-Hartman
2014-03-30  1:25 ` Shuah Khan
2014-03-30  2:49   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140328173138.889267879@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=benh@kernel.crashing.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shangw@linux.vnet.ibm.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).