[RESEND] [PATCH v3] cxl: Prevent adapter reset if an active context exists

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [RESEND] [PATCH v3] cxl: Prevent adapter reset if an active context exists
@ 2016-10-14  9:38 Vaibhav Jain
  2016-10-21 22:02 ` [RESEND, " Michael Ellerman
  2016-11-04  6:27 ` [RESEND] [PATCH " Andrew Donnellan
  0 siblings, 2 replies; 7+ messages in thread
From: Vaibhav Jain @ 2016-10-14  9:38 UTC (permalink / raw)
  To: linuxppc-dev, Frederic Barrat, Michael Ellerman
  Cc: Vaibhav Jain, Ian Munsie, Andrew Donnellan, Christophe Lombard,
	Philippe Bergheaud, gkurz, stable

This patch prevents resetting the cxl adapter via sysfs in presence of
one or more active cxl_context on it. This protects against an
unrecoverable error caused by PSL owning a dirty cache line even after
reset and host tries to touch the same cache line. In case a force reset
of the card is required irrespective of any active contexts, the int
value -1 can be stored in the 'reset' sysfs attribute of the card.

The patch introduces a new atomic_t member named contexts_num inside
struct cxl that holds the number of active context attached to the card
, which is checked against '0' before proceeding with the reset. To
prevent against a race condition where a context is activated just after
reset check is performed, the contexts_num is atomically set to '-1'
after reset-check to indicate that no more contexts can be activated on
the card anymore.

Before activating a context we atomically test if contexts_num is
non-negative and if so, increment its value by one. In case the value of
contexts_num is negative then it indicates that the card is about to be
reset and context activation is error-ed out at that point.

Cc: stable@vger.kernel.org
Fixes: 62fa19d4 ("cxl: Add ability to reset the card")
Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
---
Changelog:

RESEND v3	* Marked the patch for stable and added sign-offs & Fixes tag

v3..v1 ->       * Context-lock is now taken earlier in cxl_start_context to
                  prevent against leaking ctx->pid in error path as pointed
                  out by Frederic Barrat.
                * Fixed tabs that sneaked their way into sysfs-class-cxl
                  changes. Thanks Andrew Donnellan for catching that.

v2..v1 ->       * Addressed following review comments from Frederic Barrat:
                - Spell error changing 'Incase' to 'In case'.
                - Changed the comment description for context_num member
                  to use a slightly more universal notation for integers.
                - Added cleanup code for context irqs in case context lock
                  is taken.
                - Added a new function called cxl_adapter_context_unlock
                  that sets context_num to '0' (forcibly if needed).
                - cxl adapter struct when allocated is initialized with
                  context lock taken and released when the card config is
                  complete.
                - Simplified code flow in function reset_adapter_store.

---
 Documentation/ABI/testing/sysfs-class-cxl |  7 ++++--
 drivers/misc/cxl/api.c                    |  9 +++++++
 drivers/misc/cxl/context.c                |  3 +++
 drivers/misc/cxl/cxl.h                    | 24 ++++++++++++++++++
 drivers/misc/cxl/file.c                   | 11 ++++++++
 drivers/misc/cxl/guest.c                  |  3 +++
 drivers/misc/cxl/main.c                   | 42 ++++++++++++++++++++++++++++++-
 drivers/misc/cxl/pci.c                    |  2 ++
 drivers/misc/cxl/sysfs.c                  | 27 +++++++++++++++++---
 9 files changed, 121 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl
index 4ba0a2a..640f65e 100644
--- a/Documentation/ABI/testing/sysfs-class-cxl
+++ b/Documentation/ABI/testing/sysfs-class-cxl
@@ -220,8 +220,11 @@ What:           /sys/class/cxl/<card>/reset
 Date:           October 2014
 Contact:        linuxppc-dev@lists.ozlabs.org
 Description:    write only
-                Writing 1 will issue a PERST to card which may cause the card
-                to reload the FPGA depending on load_image_on_perst.
+                Writing 1 will issue a PERST to card provided there are no
+                contexts active on any one of the card AFUs. This may cause
+                the card to reload the FPGA depending on load_image_on_perst.
+                Writing -1 will do a force PERST irrespective of any active
+                contexts on the card AFUs.
 Users:		https://github.com/ibm-capi/libcxl
 
 What:		/sys/class/cxl/<card>/perst_reloads_same_image (not in a guest)
diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index f3d34b9..af23d7d 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -229,6 +229,14 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
 	if (ctx->status == STARTED)
 		goto out; /* already started */
 
+	/*
+	 * Increment the mapped context count for adapter. This also checks
+	 * if adapter_context_lock is taken.
+	 */
+	rc = cxl_adapter_context_get(ctx->afu->adapter);
+	if (rc)
+		goto out;
+
 	if (task) {
 		ctx->pid = get_task_pid(task, PIDTYPE_PID);
 		ctx->glpid = get_task_pid(task->group_leader, PIDTYPE_PID);
@@ -240,6 +248,7 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
 
 	if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) {
 		put_pid(ctx->pid);
+		cxl_adapter_context_put(ctx->afu->adapter);
 		cxl_ctx_put();
 		goto out;
 	}
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index c466ee2..5e506c1 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -238,6 +238,9 @@ int __detach_context(struct cxl_context *ctx)
 	put_pid(ctx->glpid);
 
 	cxl_ctx_put();
+
+	/* Decrease the attached context count on the adapter */
+	cxl_adapter_context_put(ctx->afu->adapter);
 	return 0;
 }
 
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 01d372a..a144073 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -618,6 +618,14 @@ struct cxl {
 	bool perst_select_user;
 	bool perst_same_image;
 	bool psl_timebase_synced;
+
+	/*
+	 * number of contexts mapped on to this card. Possible values are:
+	 * >0: Number of contexts mapped and new one can be mapped.
+	 *  0: No active contexts and new ones can be mapped.
+	 * -1: No contexts mapped and new ones cannot be mapped.
+	 */
+	atomic_t contexts_num;
 };
 
 int cxl_pci_alloc_one_irq(struct cxl *adapter);
@@ -944,4 +952,20 @@ bool cxl_pci_is_vphb_device(struct pci_dev *dev);
 
 /* decode AFU error bits in the PSL register PSL_SERR_An */
 void cxl_afu_decode_psl_serr(struct cxl_afu *afu, u64 serr);
+
+/*
+ * Increments the number of attached contexts on an adapter.
+ * In case an adapter_context_lock is taken the return -EBUSY.
+ */
+int cxl_adapter_context_get(struct cxl *adapter);
+
+/* Decrements the number of attached contexts on an adapter */
+void cxl_adapter_context_put(struct cxl *adapter);
+
+/* If no active contexts then prevents contexts from being attached */
+int cxl_adapter_context_lock(struct cxl *adapter);
+
+/* Unlock the contexts-lock if taken. Warn and force unlock otherwise */
+void cxl_adapter_context_unlock(struct cxl *adapter);
+
 #endif
diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
index 5fb9894..d0b421f 100644
--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -205,11 +205,22 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
 	ctx->pid = get_task_pid(current, PIDTYPE_PID);
 	ctx->glpid = get_task_pid(current->group_leader, PIDTYPE_PID);
 
+	/*
+	 * Increment the mapped context count for adapter. This also checks
+	 * if adapter_context_lock is taken.
+	 */
+	rc = cxl_adapter_context_get(ctx->afu->adapter);
+	if (rc) {
+		afu_release_irqs(ctx, ctx);
+		goto out;
+	}
+
 	trace_cxl_attach(ctx, work.work_element_descriptor, work.num_interrupts, amr);
 
 	if ((rc = cxl_ops->attach_process(ctx, false, work.work_element_descriptor,
 							amr))) {
 		afu_release_irqs(ctx, ctx);
+		cxl_adapter_context_put(ctx->afu->adapter);
 		goto out;
 	}
 
diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index 9aa58a7..3e102cd 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -1152,6 +1152,9 @@ struct cxl *cxl_guest_init_adapter(struct device_node *np, struct platform_devic
 	if ((rc = cxl_sysfs_adapter_add(adapter)))
 		goto err_put1;
 
+	/* release the context lock as the adapter is configured */
+	cxl_adapter_context_unlock(adapter);
+
 	return adapter;
 
 err_put1:
diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c
index d9be23b2..62e0dfb 100644
--- a/drivers/misc/cxl/main.c
+++ b/drivers/misc/cxl/main.c
@@ -243,8 +243,10 @@ struct cxl *cxl_alloc_adapter(void)
 	if (dev_set_name(&adapter->dev, "card%i", adapter->adapter_num))
 		goto err2;
 
-	return adapter;
+	/* start with context lock taken */
+	atomic_set(&adapter->contexts_num, -1);
 
+	return adapter;
 err2:
 	cxl_remove_adapter_nr(adapter);
 err1:
@@ -286,6 +288,44 @@ int cxl_afu_select_best_mode(struct cxl_afu *afu)
 	return 0;
 }
 
+int cxl_adapter_context_get(struct cxl *adapter)
+{
+	int rc;
+
+	rc = atomic_inc_unless_negative(&adapter->contexts_num);
+	return rc >= 0 ? 0 : -EBUSY;
+}
+
+void cxl_adapter_context_put(struct cxl *adapter)
+{
+	atomic_dec_if_positive(&adapter->contexts_num);
+}
+
+int cxl_adapter_context_lock(struct cxl *adapter)
+{
+	int rc;
+	/* no active contexts -> contexts_num == 0 */
+	rc = atomic_cmpxchg(&adapter->contexts_num, 0, -1);
+	return rc ? -EBUSY : 0;
+}
+
+void cxl_adapter_context_unlock(struct cxl *adapter)
+{
+	int val = atomic_cmpxchg(&adapter->contexts_num, -1, 0);
+
+	/*
+	 * contexts lock taken -> contexts_num == -1
+	 * If not true then show a warning and force reset the lock.
+	 * This will happen when context_unlock was requested without
+	 * doing a context_lock.
+	 */
+	if (val != -1) {
+		atomic_set(&adapter->contexts_num, 0);
+		WARN(1, "Adapter context unlocked with %d active contexts",
+		     val);
+	}
+}
+
 static int __init init_cxl(void)
 {
 	int rc = 0;
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 7afad84..e96be9c 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1487,6 +1487,8 @@ static int cxl_configure_adapter(struct cxl *adapter, struct pci_dev *dev)
 	if ((rc = cxl_native_register_psl_err_irq(adapter)))
 		goto err;
 
+	/* Release the context lock as adapter is configured */
+	cxl_adapter_context_unlock(adapter);
 	return 0;
 
 err:
diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c
index b043c20..a8b6d6a 100644
--- a/drivers/misc/cxl/sysfs.c
+++ b/drivers/misc/cxl/sysfs.c
@@ -75,12 +75,31 @@ static ssize_t reset_adapter_store(struct device *device,
 	int val;
 
 	rc = sscanf(buf, "%i", &val);
-	if ((rc != 1) || (val != 1))
+	if ((rc != 1) || (val != 1 && val != -1))
 		return -EINVAL;
 
-	if ((rc = cxl_ops->adapter_reset(adapter)))
-		return rc;
-	return count;
+	/*
+	 * See if we can lock the context mapping that's only allowed
+	 * when there are no contexts attached to the adapter. Once
+	 * taken this will also prevent any context from getting activated.
+	 */
+	if (val == 1) {
+		rc =  cxl_adapter_context_lock(adapter);
+		if (rc)
+			goto out;
+
+		rc = cxl_ops->adapter_reset(adapter);
+		/* In case reset failed release context lock */
+		if (rc)
+			cxl_adapter_context_unlock(adapter);
+
+	} else if (val == -1) {
+		/* Perform a forced adapter reset */
+		rc = cxl_ops->adapter_reset(adapter);
+	}
+
+out:
+	return rc ? rc : count;
 }
 
 static ssize_t load_image_on_perst_show(struct device *device,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RESEND, v3] cxl: Prevent adapter reset if an active context exists
  2016-10-14  9:38 [RESEND] [PATCH v3] cxl: Prevent adapter reset if an active context exists Vaibhav Jain
@ 2016-10-21 22:02 ` Michael Ellerman
  2016-11-04  6:27 ` [RESEND] [PATCH " Andrew Donnellan
  1 sibling, 0 replies; 7+ messages in thread
From: Michael Ellerman @ 2016-10-21 22:02 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, Frederic Barrat
  Cc: Philippe Bergheaud, Christophe Lombard, Vaibhav Jain, stable,
	Ian Munsie, Andrew Donnellan, gkurz

On Fri, 2016-14-10 at 09:38:36 UTC, Vaibhav Jain wrote:
> This patch prevents resetting the cxl adapter via sysfs in presence of
> one or more active cxl_context on it. This protects against an
> unrecoverable error caused by PSL owning a dirty cache line even after
> reset and host tries to touch the same cache line. In case a force reset
> of the card is required irrespective of any active contexts, the int
> value -1 can be stored in the 'reset' sysfs attribute of the card.
> 
> The patch introduces a new atomic_t member named contexts_num inside
> struct cxl that holds the number of active context attached to the card
> , which is checked against '0' before proceeding with the reset. To
> prevent against a race condition where a context is activated just after
> reset check is performed, the contexts_num is atomically set to '-1'
> after reset-check to indicate that no more contexts can be activated on
> the card anymore.
> 
> Before activating a context we atomically test if contexts_num is
> non-negative and if so, increment its value by one. In case the value of
> contexts_num is negative then it indicates that the card is about to be
> reset and context activation is error-ed out at that point.
> 
> Cc: stable@vger.kernel.org
> Fixes: 62fa19d4 ("cxl: Add ability to reset the card")
> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/70b565bbdb911023373e035225ab10

cheers

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RESEND] [PATCH v3] cxl: Prevent adapter reset if an active context exists
  2016-10-14  9:38 [RESEND] [PATCH v3] cxl: Prevent adapter reset if an active context exists Vaibhav Jain
  2016-10-21 22:02 ` [RESEND, " Michael Ellerman
@ 2016-11-04  6:27 ` Andrew Donnellan
  2016-11-04 12:07   ` Frederic Barrat
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Donnellan @ 2016-11-04  6:27 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, Frederic Barrat, Michael Ellerman
  Cc: Philippe Bergheaud, Christophe Lombard, stable, Ian Munsie, gkurz

On 14/10/16 20:38, Vaibhav Jain wrote:
> This patch prevents resetting the cxl adapter via sysfs in presence of
> one or more active cxl_context on it. This protects against an
> unrecoverable error caused by PSL owning a dirty cache line even after
> reset and host tries to touch the same cache line. In case a force reset
> of the card is required irrespective of any active contexts, the int
> value -1 can be stored in the 'reset' sysfs attribute of the card.
>
> The patch introduces a new atomic_t member named contexts_num inside
> struct cxl that holds the number of active context attached to the card
> , which is checked against '0' before proceeding with the reset. To
> prevent against a race condition where a context is activated just after
> reset check is performed, the contexts_num is atomically set to '-1'
> after reset-check to indicate that no more contexts can be activated on
> the card anymore.
>
> Before activating a context we atomically test if contexts_num is
> non-negative and if so, increment its value by one. In case the value of
> contexts_num is negative then it indicates that the card is about to be
> reset and context activation is error-ed out at that point.
>
> Cc: stable@vger.kernel.org
> Fixes: 62fa19d4 ("cxl: Add ability to reset the card")
> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>

When I inject an EEH error, this patch causes the following WARN. Thoughts?


[   55.965011] EEH: PHB#0 failure detected, location: N/A
[   55.965078] CPU: 20 PID: 9933 Comm: lspci Not tainted 
4.9.0-rc1-ajd-00006-g6fb17cc #4
[   55.965080] Call Trace:
[   55.965091] [c00000036818fab0] [c000000000950ec8] 
dump_stack+0xb0/0xf0 (unreliable)
[   55.965100] [c00000036818faf0] [c00000000002eb44] 
eeh_dev_check_failure+0x1e4/0x540
[   55.965107] [c00000036818fb90] [c000000000064090] 
pnv_pci_read_config+0xc0/0x130
[   55.965114] [c00000036818fbd0] [c0000000004bec24] 
pci_user_read_config_dword+0x84/0x160
[   55.965119] [c00000036818fc20] [c0000000004d12f4] 
pci_read_config+0x164/0x2a0
[   55.965125] [c00000036818fca0] [c000000000318e70] 
sysfs_kf_bin_read+0x70/0xc0
[   55.965131] [c00000036818fcc0] [c000000000317ff8] 
kernfs_fop_read+0xd8/0x260
[   55.965136] [c00000036818fd10] [c000000000278b7c] __vfs_read+0x3c/0x180
[   55.965141] [c00000036818fda0] [c000000000279e2c] vfs_read+0xac/0x1a0
[   55.965146] [c00000036818fde0] [c00000000027bc24] SyS_pread64+0xb4/0xd0
[   55.965152] [c00000036818fe30] [c00000000000bd20] system_call+0x38/0xfc
[   55.965171] EEH: Detected error on PHB#0
[   55.965173] EEH: This PCI device has failed 1 times in the last hour
[   55.965174] EEH: Notify device drivers to shutdown
[   55.965182] cxl afu0.0: Deactivating AFU directed mode
[   55.965261] Harmless Hypervisor Maintenance interrupt [Recovered]
[   55.965263]  Error detail: Unknown
[   55.965265]  HMER: 8040000000000000
[   55.965267] Harmless Hypervisor Maintenance interrupt [Recovered]
[   55.965268]  Error detail: Unknown
[   55.965270]  HMER: 8040000000000000
[   55.965326] cxl afu0.0: PSL Purge called with link down, ignoring
[   55.965563] EEH: Collect temporary log
[   55.965565] PHB3 PHB#0 Diag-data (Version: 1)
[   55.965566] brdgCtl:     0000ffff
[   55.965568] UtlSts:      00200000 00000000 00000000
[   55.965570] RootSts:     ffffffff ffffffff ffffffff ffffffff 0000ffff
[   55.965571] RootErrSts:  ffffffff ffffffff ffffffff
[   55.965572] RootErrLog:  ffffffff ffffffff ffffffff ffffffff
[   55.965574] RootErrLog1: ffffffff 0000000000000000 0000000000000000
[   55.965575] nFir:        0000809000000000 0030006e00000000 
0000800000000000
[   55.965577] PhbSts:      0000001c00000000 0000001c00000000
[   55.965578] Lem:         0000020000100000 40018e2400022482 
0000000000100000
[   55.965582] OutErr:      0000002000000000 0000002000000000 
0000000000000000 0000000000000000
[   55.965584] InAErr:      8000000000000000 8000000000000000 
0402000000000000 0000000000000000
[   55.965586] PE[  0] A/B: 8000000000000000 8000000000000000
[   55.965587] EEH: Reset without hotplug activity
[   60.592750] EEH: Notify device drivers the completion of reset
[   60.592760] cxl-pci 0000:01:00.0: enabling device (0140 -> 0142)
[   60.593018] pci 0000:01     : [PE# 000] Switching PHB to CXL
[   60.593116] pci 0000:01     : [PE# 000] Switching PHB to CXL
[   60.622727] Adapter context unlocked with 0 active contexts
[   60.622762] ------------[ cut here ]------------
[   60.622771] WARNING: CPU: 12 PID: 627 at 
../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl]
[   60.622772] Modules linked in: fuse powernv_rng rng_core leds_powernv 
powernv_op_panel led_class vmx_crypto ib_iser rdma_cm iw_cm ib_cm 
ib_core libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
multipath bnx2x mdio libcrc32c cxl
[   60.622794] CPU: 12 PID: 627 Comm: eehd Not tainted 
4.9.0-rc1-ajd-00006-g6fb17cc #4
[   60.622795] task: c0000003be084900 task.stack: c0000003be108000
[   60.622797] NIP: d000000004350be0 LR: d000000004350bdc CTR: 
c000000000492fd0
[   60.622799] REGS: c0000003be10b660 TRAP: 0700   Not tainted 
(4.9.0-rc1-ajd-00006-g6fb17cc)
[   60.622800] MSR: 900000010282b033 
<SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>
[   60.622810]   CR: 28000282  XER: 20000000
[   60.622811] SOFTE: 1 CFAR: c00000000094fc88
[   60.622814] GPR00: d000000004350bdc c0000003be10b8e0 d000000004379ae8 
000000000000002f
[   60.622818] GPR04: 0000000000000001 0000000000000000 00000000000003b8 
0000000000000000
[   60.622822] GPR08: 0000000000000000 0000000000000000 0000000000000000 
0000000000000001
[   60.622826] GPR12: 0000000000000000 c00000000fe03000 c0000000000baac8 
c0000003c5166500
[   60.622830] GPR16: 0000000000000000 0000000000000000 0000000000000000 
0000000000000000
[   60.622834] GPR20: 0000000000000000 0000000000000000 0000000000000000 
c000000000b14fe8
[   60.622837] GPR24: c000000000b14fc0 c0000003afc10400 c0000003b0c40000 
0000000000000000
[   60.622841] GPR28: c0000003c505a098 0000000000000000 c0000003afc10400 
0000000000000006
[   60.622850] NIP [d000000004350be0] 
cxl_adapter_context_unlock+0x60/0x80 [cxl]
[   60.622856] LR [d000000004350bdc] 
cxl_adapter_context_unlock+0x5c/0x80 [cxl]
[   60.622857] Call Trace:
[   60.622863] [c0000003be10b8e0] [d000000004350bdc] 
cxl_adapter_context_unlock+0x5c/0x80 [cxl] (unreliable)
[   60.622871] [c0000003be10b940] [d00000000435e810] 
cxl_configure_adapter+0x930/0x960 [cxl]
[   60.622879] [c0000003be10b9f0] [d00000000435e88c] 
cxl_pci_slot_reset+0x4c/0x230 [cxl]
[   60.622883] [c0000003be10baa0] [c000000000032cd4] 
eeh_report_reset+0x164/0x1a0
[   60.622887] [c0000003be10bae0] [c000000000031220] 
eeh_pe_dev_traverse+0x90/0x170
[   60.622890] [c0000003be10bb70] [c000000000033354] 
eeh_handle_normal_event+0x3d4/0x520
[   60.622892] [c0000003be10bc20] [c000000000033624] 
eeh_handle_event+0x44/0x360
[   60.622895] [c0000003be10bcd0] [c000000000033a58] 
eeh_event_handler+0x118/0x1d0
[   60.622898] [c0000003be10bd80] [c0000000000babc8] kthread+0x108/0x130
[   60.622902] [c0000003be10be30] [c00000000000c0a0] 
ret_from_kernel_thread+0x5c/0xbc
[   60.622903] Instruction dump:
[   60.622905] 2f84ffff 4dfe0020 7c0802a6 7c8407b4 39200000 f8010010 
f821ffa1 91230348
[   60.622911] 3c620000 e8638070 48016639 e8410018 <0fe00000> 38210060 
e8010010 7c0803a6
[   60.622918] ---[ end trace d358551c9a007b4f ]---
[   60.622959] cxl afu0.0: Activating AFU directed mode
[   60.623097] EEH: Notify device driver to resume


-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RESEND] [PATCH v3] cxl: Prevent adapter reset if an active context exists
  2016-11-04  6:27 ` [RESEND] [PATCH " Andrew Donnellan
@ 2016-11-04 12:07   ` Frederic Barrat
  2016-11-04 13:15     ` Uma Krishnan
  2016-11-07  0:29     ` Andrew Donnellan
  0 siblings, 2 replies; 7+ messages in thread
From: Frederic Barrat @ 2016-11-04 12:07 UTC (permalink / raw)
  To: Andrew Donnellan, Vaibhav Jain, linuxppc-dev, Michael Ellerman
  Cc: Philippe Bergheaud, Christophe Lombard, stable, Ian Munsie, gkurz

Hi Andrew,

Le 04/11/2016 à 07:27, Andrew Donnellan a écrit :
> On 14/10/16 20:38, Vaibhav Jain wrote:
>> This patch prevents resetting the cxl adapter via sysfs in presence of
>> one or more active cxl_context on it. This protects against an
>> unrecoverable error caused by PSL owning a dirty cache line even after
>> reset and host tries to touch the same cache line. In case a force reset
>> of the card is required irrespective of any active contexts, the int
>> value -1 can be stored in the 'reset' sysfs attribute of the card.
>>
>> The patch introduces a new atomic_t member named contexts_num inside
>> struct cxl that holds the number of active context attached to the card
>> , which is checked against '0' before proceeding with the reset. To
>> prevent against a race condition where a context is activated just after
>> reset check is performed, the contexts_num is atomically set to '-1'
>> after reset-check to indicate that no more contexts can be activated on
>> the card anymore.
>>
>> Before activating a context we atomically test if contexts_num is
>> non-negative and if so, increment its value by one. In case the value of
>> contexts_num is negative then it indicates that the card is about to be
>> reset and context activation is error-ed out at that point.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 62fa19d4 ("cxl: Add ability to reset the card")
>> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
>> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
>> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
>
> When I inject an EEH error, this patch causes the following WARN. Thoughts?

mmm, hard to see a relation with that patch. I couldn't reproduce 
either. Could it bear any relation with the patch you're working on 
(lspci called while the capi device is unconfigured)?

   Fred


>
>
> [   55.965011] EEH: PHB#0 failure detected, location: N/A
> [   55.965078] CPU: 20 PID: 9933 Comm: lspci Not tainted
> 4.9.0-rc1-ajd-00006-g6fb17cc #4
> [   55.965080] Call Trace:
> [   55.965091] [c00000036818fab0] [c000000000950ec8]
> dump_stack+0xb0/0xf0 (unreliable)
> [   55.965100] [c00000036818faf0] [c00000000002eb44]
> eeh_dev_check_failure+0x1e4/0x540
> [   55.965107] [c00000036818fb90] [c000000000064090]
> pnv_pci_read_config+0xc0/0x130
> [   55.965114] [c00000036818fbd0] [c0000000004bec24]
> pci_user_read_config_dword+0x84/0x160
> [   55.965119] [c00000036818fc20] [c0000000004d12f4]
> pci_read_config+0x164/0x2a0
> [   55.965125] [c00000036818fca0] [c000000000318e70]
> sysfs_kf_bin_read+0x70/0xc0
> [   55.965131] [c00000036818fcc0] [c000000000317ff8]
> kernfs_fop_read+0xd8/0x260
> [   55.965136] [c00000036818fd10] [c000000000278b7c] __vfs_read+0x3c/0x180
> [   55.965141] [c00000036818fda0] [c000000000279e2c] vfs_read+0xac/0x1a0
> [   55.965146] [c00000036818fde0] [c00000000027bc24] SyS_pread64+0xb4/0xd0
> [   55.965152] [c00000036818fe30] [c00000000000bd20] system_call+0x38/0xfc
> [   55.965171] EEH: Detected error on PHB#0
> [   55.965173] EEH: This PCI device has failed 1 times in the last hour
> [   55.965174] EEH: Notify device drivers to shutdown
> [   55.965182] cxl afu0.0: Deactivating AFU directed mode
> [   55.965261] Harmless Hypervisor Maintenance interrupt [Recovered]
> [   55.965263]  Error detail: Unknown
> [   55.965265]  HMER: 8040000000000000
> [   55.965267] Harmless Hypervisor Maintenance interrupt [Recovered]
> [   55.965268]  Error detail: Unknown
> [   55.965270]  HMER: 8040000000000000
> [   55.965326] cxl afu0.0: PSL Purge called with link down, ignoring
> [   55.965563] EEH: Collect temporary log
> [   55.965565] PHB3 PHB#0 Diag-data (Version: 1)
> [   55.965566] brdgCtl:     0000ffff
> [   55.965568] UtlSts:      00200000 00000000 00000000
> [   55.965570] RootSts:     ffffffff ffffffff ffffffff ffffffff 0000ffff
> [   55.965571] RootErrSts:  ffffffff ffffffff ffffffff
> [   55.965572] RootErrLog:  ffffffff ffffffff ffffffff ffffffff
> [   55.965574] RootErrLog1: ffffffff 0000000000000000 0000000000000000
> [   55.965575] nFir:        0000809000000000 0030006e00000000
> 0000800000000000
> [   55.965577] PhbSts:      0000001c00000000 0000001c00000000
> [   55.965578] Lem:         0000020000100000 40018e2400022482
> 0000000000100000
> [   55.965582] OutErr:      0000002000000000 0000002000000000
> 0000000000000000 0000000000000000
> [   55.965584] InAErr:      8000000000000000 8000000000000000
> 0402000000000000 0000000000000000
> [   55.965586] PE[  0] A/B: 8000000000000000 8000000000000000
> [   55.965587] EEH: Reset without hotplug activity
> [   60.592750] EEH: Notify device drivers the completion of reset
> [   60.592760] cxl-pci 0000:01:00.0: enabling device (0140 -> 0142)
> [   60.593018] pci 0000:01     : [PE# 000] Switching PHB to CXL
> [   60.593116] pci 0000:01     : [PE# 000] Switching PHB to CXL
> [   60.622727] Adapter context unlocked with 0 active contexts
> [   60.622762] ------------[ cut here ]------------
> [   60.622771] WARNING: CPU: 12 PID: 627 at
> ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl]
> [   60.622772] Modules linked in: fuse powernv_rng rng_core leds_powernv
> powernv_op_panel led_class vmx_crypto ib_iser rdma_cm iw_cm ib_cm
> ib_core libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq
> multipath bnx2x mdio libcrc32c cxl
> [   60.622794] CPU: 12 PID: 627 Comm: eehd Not tainted
> 4.9.0-rc1-ajd-00006-g6fb17cc #4
> [   60.622795] task: c0000003be084900 task.stack: c0000003be108000
> [   60.622797] NIP: d000000004350be0 LR: d000000004350bdc CTR:
> c000000000492fd0
> [   60.622799] REGS: c0000003be10b660 TRAP: 0700   Not tainted
> (4.9.0-rc1-ajd-00006-g6fb17cc)
> [   60.622800] MSR: 900000010282b033
> <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>
> [   60.622810]   CR: 28000282  XER: 20000000
> [   60.622811] SOFTE: 1 CFAR: c00000000094fc88
> [   60.622814] GPR00: d000000004350bdc c0000003be10b8e0 d000000004379ae8
> 000000000000002f
> [   60.622818] GPR04: 0000000000000001 0000000000000000 00000000000003b8
> 0000000000000000
> [   60.622822] GPR08: 0000000000000000 0000000000000000 0000000000000000
> 0000000000000001
> [   60.622826] GPR12: 0000000000000000 c00000000fe03000 c0000000000baac8
> c0000003c5166500
> [   60.622830] GPR16: 0000000000000000 0000000000000000 0000000000000000
> 0000000000000000
> [   60.622834] GPR20: 0000000000000000 0000000000000000 0000000000000000
> c000000000b14fe8
> [   60.622837] GPR24: c000000000b14fc0 c0000003afc10400 c0000003b0c40000
> 0000000000000000
> [   60.622841] GPR28: c0000003c505a098 0000000000000000 c0000003afc10400
> 0000000000000006
> [   60.622850] NIP [d000000004350be0]
> cxl_adapter_context_unlock+0x60/0x80 [cxl]
> [   60.622856] LR [d000000004350bdc]
> cxl_adapter_context_unlock+0x5c/0x80 [cxl]
> [   60.622857] Call Trace:
> [   60.622863] [c0000003be10b8e0] [d000000004350bdc]
> cxl_adapter_context_unlock+0x5c/0x80 [cxl] (unreliable)
> [   60.622871] [c0000003be10b940] [d00000000435e810]
> cxl_configure_adapter+0x930/0x960 [cxl]
> [   60.622879] [c0000003be10b9f0] [d00000000435e88c]
> cxl_pci_slot_reset+0x4c/0x230 [cxl]
> [   60.622883] [c0000003be10baa0] [c000000000032cd4]
> eeh_report_reset+0x164/0x1a0
> [   60.622887] [c0000003be10bae0] [c000000000031220]
> eeh_pe_dev_traverse+0x90/0x170
> [   60.622890] [c0000003be10bb70] [c000000000033354]
> eeh_handle_normal_event+0x3d4/0x520
> [   60.622892] [c0000003be10bc20] [c000000000033624]
> eeh_handle_event+0x44/0x360
> [   60.622895] [c0000003be10bcd0] [c000000000033a58]
> eeh_event_handler+0x118/0x1d0
> [   60.622898] [c0000003be10bd80] [c0000000000babc8] kthread+0x108/0x130
> [   60.622902] [c0000003be10be30] [c00000000000c0a0]
> ret_from_kernel_thread+0x5c/0xbc
> [   60.622903] Instruction dump:
> [   60.622905] 2f84ffff 4dfe0020 7c0802a6 7c8407b4 39200000 f8010010
> f821ffa1 91230348
> [   60.622911] 3c620000 e8638070 48016639 e8410018 <0fe00000> 38210060
> e8010010 7c0803a6
> [   60.622918] ---[ end trace d358551c9a007b4f ]---
> [   60.622959] cxl afu0.0: Activating AFU directed mode
> [   60.623097] EEH: Notify device driver to resume
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RESEND] [PATCH v3] cxl: Prevent adapter reset if an active context exists
  2016-11-04 12:07   ` Frederic Barrat
@ 2016-11-04 13:15     ` Uma Krishnan
  2016-11-07  0:28       ` Andrew Donnellan
  2016-11-07  0:29     ` Andrew Donnellan
  1 sibling, 1 reply; 7+ messages in thread
From: Uma Krishnan @ 2016-11-04 13:15 UTC (permalink / raw)
  To: Frederic Barrat, Andrew Donnellan, Vaibhav Jain, linuxppc-dev,
	Michael Ellerman
  Cc: Philippe Bergheaud, gkurz, Christophe Lombard, stable, Ian Munsie

Frederic/Andrew,

Just recently this issue has been reported by system test without any
of the two patches you are suspecting - this patch nor the lspci patch.
I was hoping the lspci patch from Andrew can possibly solve it.
System test CQ is SW370625. The stack reported in that is same,

[ 5895.245959] EEH: PHB#2 failure detected, location: N/A
[ 5895.246078] CPU: 19 PID: 121774 Comm: lspci Not tainted
3.10.0-514.el7.ppc64le #1
[ 5895.246240] Call Trace:
[ 5895.246307] [c0000009f3707a60] [c000000000017ce0]
show_stack+0x80/0x330 (unreliable)
[ 5895.246501] [c0000009f3707b10] [c0000000009b22f4]
dump_stack+0x30/0x44
[ 5895.246665] [c0000009f3707b30] [c00000000003b9ac]
eeh_dev_check_failure+0x21c/0x580
[ 5895.246855] [c0000009f3707bd0] [c0000000000879dc]
pnv_pci_read_config+0xbc/0x160
[ 5895.247045] [c0000009f3707c10] [c000000000527d54]
pci_user_read_config_dword+0x84/0x160
[ 5895.247233] [c0000009f3707c60] [c000000000547224]
pci_read_config+0xf4/0x2e0
[ 5895.247398] [c0000009f3707ce0] [c0000000003efb3c] read+0x10c/0x2a0
[ 5895.247561] [c0000009f3707da0] [c00000000031d160]
vfs_read+0x110/0x290
[ 5895.247726] [c0000009f3707de0] [c00000000031ec70]
SyS_pread64+0xb0/0xd0

Uma Krishnan


On 11/4/2016 7:07 AM, Frederic Barrat wrote:
> Hi Andrew,
>
> Le 04/11/2016 à 07:27, Andrew Donnellan a écrit :
>> On 14/10/16 20:38, Vaibhav Jain wrote:
>>> This patch prevents resetting the cxl adapter via sysfs in presence of
>>> one or more active cxl_context on it. This protects against an
>>> unrecoverable error caused by PSL owning a dirty cache line even after
>>> reset and host tries to touch the same cache line. In case a force reset
>>> of the card is required irrespective of any active contexts, the int
>>> value -1 can be stored in the 'reset' sysfs attribute of the card.
>>>
>>> The patch introduces a new atomic_t member named contexts_num inside
>>> struct cxl that holds the number of active context attached to the card
>>> , which is checked against '0' before proceeding with the reset. To
>>> prevent against a race condition where a context is activated just after
>>> reset check is performed, the contexts_num is atomically set to '-1'
>>> after reset-check to indicate that no more contexts can be activated on
>>> the card anymore.
>>>
>>> Before activating a context we atomically test if contexts_num is
>>> non-negative and if so, increment its value by one. In case the value of
>>> contexts_num is negative then it indicates that the card is about to be
>>> reset and context activation is error-ed out at that point.
>>>
>>> Cc: stable@vger.kernel.org
>>> Fixes: 62fa19d4 ("cxl: Add ability to reset the card")
>>> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
>>> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
>>> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
>>
>> When I inject an EEH error, this patch causes the following WARN.
>> Thoughts?
>
> mmm, hard to see a relation with that patch. I couldn't reproduce
> either. Could it bear any relation with the patch you're working on
> (lspci called while the capi device is unconfigured)?
>
>   Fred
>
>
>>
>>
>> [   55.965011] EEH: PHB#0 failure detected, location: N/A
>> [   55.965078] CPU: 20 PID: 9933 Comm: lspci Not tainted
>> 4.9.0-rc1-ajd-00006-g6fb17cc #4
>> [   55.965080] Call Trace:
>> [   55.965091] [c00000036818fab0] [c000000000950ec8]
>> dump_stack+0xb0/0xf0 (unreliable)
>> [   55.965100] [c00000036818faf0] [c00000000002eb44]
>> eeh_dev_check_failure+0x1e4/0x540
>> [   55.965107] [c00000036818fb90] [c000000000064090]
>> pnv_pci_read_config+0xc0/0x130
>> [   55.965114] [c00000036818fbd0] [c0000000004bec24]
>> pci_user_read_config_dword+0x84/0x160
>> [   55.965119] [c00000036818fc20] [c0000000004d12f4]
>> pci_read_config+0x164/0x2a0
>> [   55.965125] [c00000036818fca0] [c000000000318e70]
>> sysfs_kf_bin_read+0x70/0xc0
>> [   55.965131] [c00000036818fcc0] [c000000000317ff8]
>> kernfs_fop_read+0xd8/0x260
>> [   55.965136] [c00000036818fd10] [c000000000278b7c]
>> __vfs_read+0x3c/0x180
>> [   55.965141] [c00000036818fda0] [c000000000279e2c] vfs_read+0xac/0x1a0
>> [   55.965146] [c00000036818fde0] [c00000000027bc24]
>> SyS_pread64+0xb4/0xd0
>> [   55.965152] [c00000036818fe30] [c00000000000bd20]
>> system_call+0x38/0xfc
>> [   55.965171] EEH: Detected error on PHB#0
>> [   55.965173] EEH: This PCI device has failed 1 times in the last hour
>> [   55.965174] EEH: Notify device drivers to shutdown
>> [   55.965182] cxl afu0.0: Deactivating AFU directed mode
>> [   55.965261] Harmless Hypervisor Maintenance interrupt [Recovered]
>> [   55.965263]  Error detail: Unknown
>> [   55.965265]  HMER: 8040000000000000
>> [   55.965267] Harmless Hypervisor Maintenance interrupt [Recovered]
>> [   55.965268]  Error detail: Unknown
>> [   55.965270]  HMER: 8040000000000000
>> [   55.965326] cxl afu0.0: PSL Purge called with link down, ignoring
>> [   55.965563] EEH: Collect temporary log
>> [   55.965565] PHB3 PHB#0 Diag-data (Version: 1)
>> [   55.965566] brdgCtl:     0000ffff
>> [   55.965568] UtlSts:      00200000 00000000 00000000
>> [   55.965570] RootSts:     ffffffff ffffffff ffffffff ffffffff 0000ffff
>> [   55.965571] RootErrSts:  ffffffff ffffffff ffffffff
>> [   55.965572] RootErrLog:  ffffffff ffffffff ffffffff ffffffff
>> [   55.965574] RootErrLog1: ffffffff 0000000000000000 0000000000000000
>> [   55.965575] nFir:        0000809000000000 0030006e00000000
>> 0000800000000000
>> [   55.965577] PhbSts:      0000001c00000000 0000001c00000000
>> [   55.965578] Lem:         0000020000100000 40018e2400022482
>> 0000000000100000
>> [   55.965582] OutErr:      0000002000000000 0000002000000000
>> 0000000000000000 0000000000000000
>> [   55.965584] InAErr:      8000000000000000 8000000000000000
>> 0402000000000000 0000000000000000
>> [   55.965586] PE[  0] A/B: 8000000000000000 8000000000000000
>> [   55.965587] EEH: Reset without hotplug activity
>> [   60.592750] EEH: Notify device drivers the completion of reset
>> [   60.592760] cxl-pci 0000:01:00.0: enabling device (0140 -> 0142)
>> [   60.593018] pci 0000:01     : [PE# 000] Switching PHB to CXL
>> [   60.593116] pci 0000:01     : [PE# 000] Switching PHB to CXL
>> [   60.622727] Adapter context unlocked with 0 active contexts
>> [   60.622762] ------------[ cut here ]------------
>> [   60.622771] WARNING: CPU: 12 PID: 627 at
>> ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl]
>> [   60.622772] Modules linked in: fuse powernv_rng rng_core leds_powernv
>> powernv_op_panel led_class vmx_crypto ib_iser rdma_cm iw_cm ib_cm
>> ib_core libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456
>> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq
>> multipath bnx2x mdio libcrc32c cxl
>> [   60.622794] CPU: 12 PID: 627 Comm: eehd Not tainted
>> 4.9.0-rc1-ajd-00006-g6fb17cc #4
>> [   60.622795] task: c0000003be084900 task.stack: c0000003be108000
>> [   60.622797] NIP: d000000004350be0 LR: d000000004350bdc CTR:
>> c000000000492fd0
>> [   60.622799] REGS: c0000003be10b660 TRAP: 0700   Not tainted
>> (4.9.0-rc1-ajd-00006-g6fb17cc)
>> [   60.622800] MSR: 900000010282b033
>> <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>
>> [   60.622810]   CR: 28000282  XER: 20000000
>> [   60.622811] SOFTE: 1 CFAR: c00000000094fc88
>> [   60.622814] GPR00: d000000004350bdc c0000003be10b8e0 d000000004379ae8
>> 000000000000002f
>> [   60.622818] GPR04: 0000000000000001 0000000000000000 00000000000003b8
>> 0000000000000000
>> [   60.622822] GPR08: 0000000000000000 0000000000000000 0000000000000000
>> 0000000000000001
>> [   60.622826] GPR12: 0000000000000000 c00000000fe03000 c0000000000baac8
>> c0000003c5166500
>> [   60.622830] GPR16: 0000000000000000 0000000000000000 0000000000000000
>> 0000000000000000
>> [   60.622834] GPR20: 0000000000000000 0000000000000000 0000000000000000
>> c000000000b14fe8
>> [   60.622837] GPR24: c000000000b14fc0 c0000003afc10400 c0000003b0c40000
>> 0000000000000000
>> [   60.622841] GPR28: c0000003c505a098 0000000000000000 c0000003afc10400
>> 0000000000000006
>> [   60.622850] NIP [d000000004350be0]
>> cxl_adapter_context_unlock+0x60/0x80 [cxl]
>> [   60.622856] LR [d000000004350bdc]
>> cxl_adapter_context_unlock+0x5c/0x80 [cxl]
>> [   60.622857] Call Trace:
>> [   60.622863] [c0000003be10b8e0] [d000000004350bdc]
>> cxl_adapter_context_unlock+0x5c/0x80 [cxl] (unreliable)
>> [   60.622871] [c0000003be10b940] [d00000000435e810]
>> cxl_configure_adapter+0x930/0x960 [cxl]
>> [   60.622879] [c0000003be10b9f0] [d00000000435e88c]
>> cxl_pci_slot_reset+0x4c/0x230 [cxl]
>> [   60.622883] [c0000003be10baa0] [c000000000032cd4]
>> eeh_report_reset+0x164/0x1a0
>> [   60.622887] [c0000003be10bae0] [c000000000031220]
>> eeh_pe_dev_traverse+0x90/0x170
>> [   60.622890] [c0000003be10bb70] [c000000000033354]
>> eeh_handle_normal_event+0x3d4/0x520
>> [   60.622892] [c0000003be10bc20] [c000000000033624]
>> eeh_handle_event+0x44/0x360
>> [   60.622895] [c0000003be10bcd0] [c000000000033a58]
>> eeh_event_handler+0x118/0x1d0
>> [   60.622898] [c0000003be10bd80] [c0000000000babc8] kthread+0x108/0x130
>> [   60.622902] [c0000003be10be30] [c00000000000c0a0]
>> ret_from_kernel_thread+0x5c/0xbc
>> [   60.622903] Instruction dump:
>> [   60.622905] 2f84ffff 4dfe0020 7c0802a6 7c8407b4 39200000 f8010010
>> f821ffa1 91230348
>> [   60.622911] 3c620000 e8638070 48016639 e8410018 <0fe00000> 38210060
>> e8010010 7c0803a6
>> [   60.622918] ---[ end trace d358551c9a007b4f ]---
>> [   60.622959] cxl afu0.0: Activating AFU directed mode
>> [   60.623097] EEH: Notify device driver to resume
>>
>>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RESEND] [PATCH v3] cxl: Prevent adapter reset if an active context exists
  2016-11-04 13:15     ` Uma Krishnan
@ 2016-11-07  0:28       ` Andrew Donnellan
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Donnellan @ 2016-11-07  0:28 UTC (permalink / raw)
  To: Uma Krishnan, Frederic Barrat, Vaibhav Jain, linuxppc-dev,
	Michael Ellerman
  Cc: Philippe Bergheaud, gkurz, Christophe Lombard, Ian Munsie

On 05/11/16 00:15, Uma Krishnan wrote:
> Frederic/Andrew,
>
> Just recently this issue has been reported by system test without any
> of the two patches you are suspecting - this patch nor the lspci patch.
> I was hoping the lspci patch from Andrew can possibly solve it.
> System test CQ is SW370625. The stack reported in that is same,
>
> [ 5895.245959] EEH: PHB#2 failure detected, location: N/A
> [ 5895.246078] CPU: 19 PID: 121774 Comm: lspci Not tainted
> 3.10.0-514.el7.ppc64le #1
> [ 5895.246240] Call Trace:
> [ 5895.246307] [c0000009f3707a60] [c000000000017ce0]
> show_stack+0x80/0x330 (unreliable)
> [ 5895.246501] [c0000009f3707b10] [c0000000009b22f4]
> dump_stack+0x30/0x44
> [ 5895.246665] [c0000009f3707b30] [c00000000003b9ac]
> eeh_dev_check_failure+0x21c/0x580
> [ 5895.246855] [c0000009f3707bd0] [c0000000000879dc]
> pnv_pci_read_config+0xbc/0x160
> [ 5895.247045] [c0000009f3707c10] [c000000000527d54]
> pci_user_read_config_dword+0x84/0x160
> [ 5895.247233] [c0000009f3707c60] [c000000000547224]
> pci_read_config+0xf4/0x2e0
> [ 5895.247398] [c0000009f3707ce0] [c0000000003efb3c] read+0x10c/0x2a0
> [ 5895.247561] [c0000009f3707da0] [c00000000031d160]
> vfs_read+0x110/0x290
> [ 5895.247726] [c0000009f3707de0] [c00000000031ec70]
> SyS_pread64+0xb0/0xd0

This isn't a WARN - this stack trace is printed explicitly by the EEH 
code in the case of a PHB failure. arch/powerpc/kernel/eeh.c, line 403.


Andrew

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RESEND] [PATCH v3] cxl: Prevent adapter reset if an active context exists
  2016-11-04 12:07   ` Frederic Barrat
  2016-11-04 13:15     ` Uma Krishnan
@ 2016-11-07  0:29     ` Andrew Donnellan
  1 sibling, 0 replies; 7+ messages in thread
From: Andrew Donnellan @ 2016-11-07  0:29 UTC (permalink / raw)
  To: Frederic Barrat, Vaibhav Jain, linuxppc-dev, Michael Ellerman
  Cc: Philippe Bergheaud, Christophe Lombard, stable, Ian Munsie, gkurz

On 04/11/16 23:07, Frederic Barrat wrote:
>> When I inject an EEH error, this patch causes the following WARN.
>> Thoughts?
>
> mmm, hard to see a relation with that patch. I couldn't reproduce
> either. Could it bear any relation with the patch you're working on
> (lspci called while the capi device is unconfigured)?

No, this was without any other patches...

>> [   60.593116] pci 0000:01     : [PE# 000] Switching PHB to CXL
>> [   60.622727] Adapter context unlocked with 0 active contexts
>> [   60.622762] ------------[ cut here ]------------
>> [   60.622771] WARNING: CPU: 12 PID: 627 at
>> ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl]
>> [   60.622772] Modules linked in: fuse powernv_rng rng_core leds_powernv
>> powernv_op_panel led_class vmx_crypto ib_iser rdma_cm iw_cm ib_cm
>> ib_core libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456
>> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq
>> multipath bnx2x mdio libcrc32c cxl
>> [   60.622794] CPU: 12 PID: 627 Comm: eehd Not tainted
>> 4.9.0-rc1-ajd-00006-g6fb17cc #4
>> [   60.622795] task: c0000003be084900 task.stack: c0000003be108000
>> [   60.622797] NIP: d000000004350be0 LR: d000000004350bdc CTR:
>> c000000000492fd0
>> [   60.622799] REGS: c0000003be10b660 TRAP: 0700   Not tainted
>> (4.9.0-rc1-ajd-00006-g6fb17cc)
>> [   60.622800] MSR: 900000010282b033
>> <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>
>> [   60.622810]   CR: 28000282  XER: 20000000
>> [   60.622811] SOFTE: 1 CFAR: c00000000094fc88
>> [   60.622814] GPR00: d000000004350bdc c0000003be10b8e0 d000000004379ae8
>> 000000000000002f
>> [   60.622818] GPR04: 0000000000000001 0000000000000000 00000000000003b8
>> 0000000000000000
>> [   60.622822] GPR08: 0000000000000000 0000000000000000 0000000000000000
>> 0000000000000001
>> [   60.622826] GPR12: 0000000000000000 c00000000fe03000 c0000000000baac8
>> c0000003c5166500
>> [   60.622830] GPR16: 0000000000000000 0000000000000000 0000000000000000
>> 0000000000000000
>> [   60.622834] GPR20: 0000000000000000 0000000000000000 0000000000000000
>> c000000000b14fe8
>> [   60.622837] GPR24: c000000000b14fc0 c0000003afc10400 c0000003b0c40000
>> 0000000000000000
>> [   60.622841] GPR28: c0000003c505a098 0000000000000000 c0000003afc10400
>> 0000000000000006
>> [   60.622850] NIP [d000000004350be0]
>> cxl_adapter_context_unlock+0x60/0x80 [cxl]
>> [   60.622856] LR [d000000004350bdc]
>> cxl_adapter_context_unlock+0x5c/0x80 [cxl]
>> [   60.622857] Call Trace:
>> [   60.622863] [c0000003be10b8e0] [d000000004350bdc]
>> cxl_adapter_context_unlock+0x5c/0x80 [cxl] (unreliable)
>> [   60.622871] [c0000003be10b940] [d00000000435e810]
>> cxl_configure_adapter+0x930/0x960 [cxl]
>> [   60.622879] [c0000003be10b9f0] [d00000000435e88c]
>> cxl_pci_slot_reset+0x4c/0x230 [cxl]
>> [   60.622883] [c0000003be10baa0] [c000000000032cd4]
>> eeh_report_reset+0x164/0x1a0
>> [   60.622887] [c0000003be10bae0] [c000000000031220]
>> eeh_pe_dev_traverse+0x90/0x170
>> [   60.622890] [c0000003be10bb70] [c000000000033354]
>> eeh_handle_normal_event+0x3d4/0x520
>> [   60.622892] [c0000003be10bc20] [c000000000033624]
>> eeh_handle_event+0x44/0x360
>> [   60.622895] [c0000003be10bcd0] [c000000000033a58]
>> eeh_event_handler+0x118/0x1d0
>> [   60.622898] [c0000003be10bd80] [c0000000000babc8] kthread+0x108/0x130
>> [   60.622902] [c0000003be10be30] [c00000000000c0a0]
>> ret_from_kernel_thread+0x5c/0xbc
>> [   60.622903] Instruction dump:
>> [   60.622905] 2f84ffff 4dfe0020 7c0802a6 7c8407b4 39200000 f8010010
>> f821ffa1 91230348
>> [   60.622911] 3c620000 e8638070 48016639 e8410018 <0fe00000> 38210060
>> e8010010 7c0803a6
>> [   60.622918] ---[ end trace d358551c9a007b4f ]---
>> [   60.622959] cxl afu0.0: Activating AFU directed mode
>> [   60.623097] EEH: Notify device driver to resume

That *definitely* looks related to this patch...


Andrew

-- 
Andrew Donnellan              OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com  IBM Australia Limited

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-11-07  0:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-14  9:38 [RESEND] [PATCH v3] cxl: Prevent adapter reset if an active context exists Vaibhav Jain
2016-10-21 22:02 ` [RESEND, " Michael Ellerman
2016-11-04  6:27 ` [RESEND] [PATCH " Andrew Donnellan
2016-11-04 12:07   ` Frederic Barrat
2016-11-04 13:15     ` Uma Krishnan
2016-11-07  0:28       ` Andrew Donnellan
2016-11-07  0:29     ` Andrew Donnellan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).