linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect
@ 2016-06-29 18:51 Ian Munsie
  2016-06-29 18:51 ` [PATCH 2/2] cxl: Workaround XSL bug that does not clear the RA bit after a reset Ian Munsie
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Ian Munsie @ 2016-06-29 18:51 UTC (permalink / raw)
  To: Michael Ellerman, mikey, linuxppc-dev, Frederic Barrat,
	Huy Nguyen
  Cc: Ian Munsie

From: Ian Munsie <imunsie@au1.ibm.com>

The AFU disable operation has a bug where it will not clear the enable
bit and therefore will have no effect. To date this has likely been
masked by fact that we perform an AFU reset before the disable, which
also has the effect of clearing the enable bit, making the following
disable operation effectively a noop on most hardware. This patch
modifies the afu_control function to take a parameter to clear from the
AFU control register so that the disable operation can clear the
appropriate bit.

This bug was uncovered on the Mellanox CX4, which uses an XSL rather
than a PSL. On the XSL the reset operation will not complete while the
AFU is enabled, meaning the enable bit was still set at the start of the
disable and as a result this bug was hit and the disable also timed out.

Because of this difference in behaviour between the PSL and XSL, this
patch now makes the reset dependent on the card using a PSL to avoid
waiting for a timeout on the XSL. It is entirely possible that we may be
able to drop the reset altogether if it turns out we only ever needed it
due to this bug - however I am not willing to drop it without further
regression testing.

This also fixes a small issue where the AFU_Cntl register was read
outside of the lock that protects it.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
---
 drivers/misc/cxl/cxl.h    |  1 +
 drivers/misc/cxl/native.c | 36 ++++++++++++++++++++++++++++--------
 drivers/misc/cxl/pci.c    |  1 +
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index ce2b9d5..bab8dfd 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -544,6 +544,7 @@ struct cxl_service_layer_ops {
 	void (*write_timebase_ctrl)(struct cxl *adapter);
 	u64 (*timebase_read)(struct cxl *adapter);
 	int capi_mode;
+	bool needs_reset_before_disable;
 };
 
 struct cxl_native {
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 120c468..9479bfc 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -21,10 +21,10 @@
 #include "cxl.h"
 #include "trace.h"
 
-static int afu_control(struct cxl_afu *afu, u64 command,
+static int afu_control(struct cxl_afu *afu, u64 command, u64 clear,
 		       u64 result, u64 mask, bool enabled)
 {
-	u64 AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
+	u64 AFU_Cntl;
 	unsigned long timeout = jiffies + (HZ * CXL_TIMEOUT);
 	int rc = 0;
 
@@ -33,7 +33,8 @@ static int afu_control(struct cxl_afu *afu, u64 command,
 
 	trace_cxl_afu_ctrl(afu, command);
 
-	cxl_p2n_write(afu, CXL_AFU_Cntl_An, AFU_Cntl | command);
+	AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
+	cxl_p2n_write(afu, CXL_AFU_Cntl_An, (AFU_Cntl & ~clear) | command);
 
 	AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
 	while ((AFU_Cntl & mask) != result) {
@@ -67,7 +68,7 @@ static int afu_enable(struct cxl_afu *afu)
 {
 	pr_devel("AFU enable request\n");
 
-	return afu_control(afu, CXL_AFU_Cntl_An_E,
+	return afu_control(afu, CXL_AFU_Cntl_An_E, 0,
 			   CXL_AFU_Cntl_An_ES_Enabled,
 			   CXL_AFU_Cntl_An_ES_MASK, true);
 }
@@ -76,7 +77,8 @@ int cxl_afu_disable(struct cxl_afu *afu)
 {
 	pr_devel("AFU disable request\n");
 
-	return afu_control(afu, 0, CXL_AFU_Cntl_An_ES_Disabled,
+	return afu_control(afu, 0, CXL_AFU_Cntl_An_E,
+			   CXL_AFU_Cntl_An_ES_Disabled,
 			   CXL_AFU_Cntl_An_ES_MASK, false);
 }
 
@@ -85,7 +87,7 @@ static int native_afu_reset(struct cxl_afu *afu)
 {
 	pr_devel("AFU reset request\n");
 
-	return afu_control(afu, CXL_AFU_Cntl_An_RA,
+	return afu_control(afu, CXL_AFU_Cntl_An_RA, 0,
 			   CXL_AFU_Cntl_An_RS_Complete | CXL_AFU_Cntl_An_ES_Disabled,
 			   CXL_AFU_Cntl_An_RS_MASK | CXL_AFU_Cntl_An_ES_MASK,
 			   false);
@@ -595,7 +597,16 @@ static int deactivate_afu_directed(struct cxl_afu *afu)
 	cxl_sysfs_afu_m_remove(afu);
 	cxl_chardev_afu_remove(afu);
 
-	cxl_ops->afu_reset(afu);
+	if (afu->adapter->native->sl_ops->needs_reset_before_disable) {
+		/*
+		 * XXX: We may be able to do away with this entirely - it is
+		 * possible that this was only ever needed due to a bug where
+		 * the disable operation did not clear the enable bit, however
+		 * I will only consider dropping this after more regression
+		 * testing on earlier PSL images.
+		 */
+		cxl_ops->afu_reset(afu);
+	}
 	cxl_afu_disable(afu);
 	cxl_psl_purge(afu);
 
@@ -735,7 +746,16 @@ static int native_attach_process(struct cxl_context *ctx, bool kernel,
 
 static inline int detach_process_native_dedicated(struct cxl_context *ctx)
 {
-	cxl_ops->afu_reset(ctx->afu);
+	if (ctx->afu->adapter->native->sl_ops->needs_reset_before_disable) {
+		/*
+		 * XXX: We may be able to do away with this entirely - it is
+		 * possible that this was only ever needed due to a bug where
+		 * the disable operation did not clear the enable bit, however
+		 * I will only consider dropping this after more regression
+		 * testing on earlier PSL images.
+		 */
+		cxl_ops->afu_reset(ctx->afu);
+	}
 	cxl_afu_disable(ctx->afu);
 	cxl_psl_purge(ctx->afu);
 	return 0;
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 58d7d821..b7f2e96 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1309,6 +1309,7 @@ static const struct cxl_service_layer_ops psl_ops = {
 	.write_timebase_ctrl = write_timebase_ctrl_psl,
 	.timebase_read = timebase_read_psl,
 	.capi_mode = OPAL_PHB_CAPI_MODE_CAPI,
+	.needs_reset_before_disable = true,
 };
 
 static const struct cxl_service_layer_ops xsl_ops = {
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] cxl: Workaround XSL bug that does not clear the RA bit after a reset
  2016-06-29 18:51 [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect Ian Munsie
@ 2016-06-29 18:51 ` Ian Munsie
  2016-06-30 11:05   ` Frederic Barrat
  2016-07-11 10:19   ` [2/2] " Michael Ellerman
  2016-06-30 14:19 ` [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect Frederic Barrat
  2016-06-30 16:50 ` [PATCH v2] " Ian Munsie
  2 siblings, 2 replies; 11+ messages in thread
From: Ian Munsie @ 2016-06-29 18:51 UTC (permalink / raw)
  To: Michael Ellerman, mikey, linuxppc-dev, Frederic Barrat,
	Huy Nguyen
  Cc: Ian Munsie

From: Ian Munsie <imunsie@au1.ibm.com>

An issue was noted in our debug logs where the XSL would leave the RA
bit asserted after an AFU reset operation, which would effectively
prevent further AFU reset operations from working.

Workaround the issue by clearing the RA bit with an MMIO write if it is
still asserted after any AFU control operation.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
---
 drivers/misc/cxl/native.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 9479bfc..bc79be8 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -55,6 +55,16 @@ static int afu_control(struct cxl_afu *afu, u64 command, u64 clear,
 		cpu_relax();
 		AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
 	};
+
+	if (AFU_Cntl & CXL_AFU_Cntl_An_RA) {
+		/*
+		 * Workaround for a bug in the XSL used in the Mellanox CX4
+		 * that fails to clear the RA bit after an AFU reset,
+		 * preventing subsequent AFU resets from working.
+		 */
+		cxl_p2n_write(afu, CXL_AFU_Cntl_An, AFU_Cntl & ~CXL_AFU_Cntl_An_RA);
+	}
+
 	pr_devel("AFU command complete: %llx\n", command);
 	afu->enabled = enabled;
 out:
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] cxl: Workaround XSL bug that does not clear the RA bit after a reset
  2016-06-29 18:51 ` [PATCH 2/2] cxl: Workaround XSL bug that does not clear the RA bit after a reset Ian Munsie
@ 2016-06-30 11:05   ` Frederic Barrat
  2016-07-11 10:19   ` [2/2] " Michael Ellerman
  1 sibling, 0 replies; 11+ messages in thread
From: Frederic Barrat @ 2016-06-30 11:05 UTC (permalink / raw)
  To: Ian Munsie, Michael Ellerman, mikey, linuxppc-dev,
	Frederic Barrat, Huy Nguyen



Le 29/06/2016 20:51, Ian Munsie a écrit :
> From: Ian Munsie <imunsie@au1.ibm.com>
>
> An issue was noted in our debug logs where the XSL would leave the RA
> bit asserted after an AFU reset operation, which would effectively
> prevent further AFU reset operations from working.
>
> Workaround the issue by clearing the RA bit with an MMIO write if it is
> still asserted after any AFU control operation.
>
> Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>


OK, should be safe and valid even on non-xsl hardware.

Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>


   Fred

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect
  2016-06-29 18:51 [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect Ian Munsie
  2016-06-29 18:51 ` [PATCH 2/2] cxl: Workaround XSL bug that does not clear the RA bit after a reset Ian Munsie
@ 2016-06-30 14:19 ` Frederic Barrat
  2016-06-30 15:32   ` Ian Munsie
  2016-06-30 16:50 ` [PATCH v2] " Ian Munsie
  2 siblings, 1 reply; 11+ messages in thread
From: Frederic Barrat @ 2016-06-30 14:19 UTC (permalink / raw)
  To: Ian Munsie, Michael Ellerman, mikey, linuxppc-dev,
	Frederic Barrat, Huy Nguyen

Hi Ian,

> -static int afu_control(struct cxl_afu *afu, u64 command,
> +static int afu_control(struct cxl_afu *afu, u64 command, u64 clear,
>   		       u64 result, u64 mask, bool enabled)

I'm not a big fan of the new "clear" argument, which forces us to pass 
an extra 0 most of the time. Why not always clearing the "action" bits 
of the register before applying the command? They are mutually 
exclusive, so it should work. I.e. :

+	AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
+	AFU_Cntl &= ~(CXL_AFU_Cntl_An_E | CXL_AFU_Cntl_An_RA);
+	AFU_Cntl |= command;




> @@ -595,7 +597,16 @@ static int deactivate_afu_directed(struct cxl_afu *afu)
>   	cxl_sysfs_afu_m_remove(afu);
>   	cxl_chardev_afu_remove(afu);
>
> -	cxl_ops->afu_reset(afu);
> +	if (afu->adapter->native->sl_ops->needs_reset_before_disable) {
> +		/*
> +		 * XXX: We may be able to do away with this entirely - it is
> +		 * possible that this was only ever needed due to a bug where
> +		 * the disable operation did not clear the enable bit, however
> +		 * I will only consider dropping this after more regression
> +		 * testing on earlier PSL images.
> +		 */
> +		cxl_ops->afu_reset(afu);
> +	}
>   	cxl_afu_disable(afu);
>   	cxl_psl_purge(afu);
>
> @@ -735,7 +746,16 @@ static int native_attach_process(struct cxl_context *ctx, bool kernel,
>
>   static inline int detach_process_native_dedicated(struct cxl_context *ctx)
>   {
> -	cxl_ops->afu_reset(ctx->afu);
> +	if (ctx->afu->adapter->native->sl_ops->needs_reset_before_disable) {
> +		/*
> +		 * XXX: We may be able to do away with this entirely - it is
> +		 * possible that this was only ever needed due to a bug where
> +		 * the disable operation did not clear the enable bit, however
> +		 * I will only consider dropping this after more regression
> +		 * testing on earlier PSL images.
> +		 */
> +		cxl_ops->afu_reset(ctx->afu);
> +	}

For dedicated mode, the CAIA recommends an explicit reset of the AFU 
(section 2.1.1).

For directed mode, CAIA says it's AFU-specific. So for xsl, we only 
disable the afu and purge the xsl. Are we getting rid of the reset 
because it's useless in that environment, or because it times out? If 
just because of timeout, would it make sense to switch the order and 
disable first, then reset? I don't see any afu-reset on the next activation.

   Fred

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect
  2016-06-30 14:19 ` [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect Frederic Barrat
@ 2016-06-30 15:32   ` Ian Munsie
  2016-06-30 15:50     ` Frederic Barrat
  0 siblings, 1 reply; 11+ messages in thread
From: Ian Munsie @ 2016-06-30 15:32 UTC (permalink / raw)
  To: Frederic Barrat
  Cc: Michael Ellerman, mikey, linuxppc-dev, Frederic Barrat,
	Huy Nguyen

Excerpts from Frederic Barrat's message of 2016-06-30 16:19:54 +0200:
> I'm not a big fan of the new "clear" argument, which forces us to pass 
> an extra 0 most of the time. Why not always clearing the "action" bits 
> of the register before applying the command? They are mutually 
> exclusive, so it should work. I.e. :
> 
> +    AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
> +    AFU_Cntl &= ~(CXL_AFU_Cntl_An_E | CXL_AFU_Cntl_An_RA);
> +    AFU_Cntl |= command;

In theory that should be OK, but I'd want to test it on some PSL images
first just in case they don't behave quite how we expect since we've had
problems in this area in the past (although after discovering the bug in
the disable path that may provide the explanation for those problems).

The risk I see with that approach is setting the reset bit and clearing
the enable bit at the same time is commanding the hardware to do two
similar but subtly different operations simultaneously, which is not
something we have done before or tested - we can always clean this up in
a later patch after we've tested it on a couple of PSLs, cxlflash, etc
and are happy that it works with no ill effects. I don't see any reason
that needs to be done in this patch though since it's just churn inside
the driver and doesn't impact anyone else.

> >   static inline int detach_process_native_dedicated(struct cxl_context *ctx)
> >   {
> > -    cxl_ops->afu_reset(ctx->afu);
> > +    if (ctx->afu->adapter->native->sl_ops->needs_reset_before_disable) {
> > +        /*
> > +         * XXX: We may be able to do away with this entirely - it is
> > +         * possible that this was only ever needed due to a bug where
> > +         * the disable operation did not clear the enable bit, however
> > +         * I will only consider dropping this after more regression
> > +         * testing on earlier PSL images.
> > +         */
> > +        cxl_ops->afu_reset(ctx->afu);
> > +    }
> 
> For dedicated mode, the CAIA recommends an explicit reset of the AFU 
> (section 2.1.1).

True, I had forgotten that procedure was added to the document before it
was made public - I'll update the comment and resend.

> For directed mode, CAIA says it's AFU-specific.

Damnit, that should never have been architected as AFU specific - PSL
implementation specific would have been ok, but AFU specific is just
asking for trouble since this is all generic code with no way to
identify the AFU.

Oh well, in practice the AFU developers will be coding against our
implementation now, so I guess we effectively became the authority on
how this works by default... and therefore should probably continue to
do the reset here.

Just hope there aren't too many more ASIC implementations that implement
some contradictory behaviour and are set in stone before anyone
realises.

I'll resend this with these two comments updated.

> So for xsl, we only disable the afu and purge the xsl. Are we getting
> rid of the reset because it's useless in that environment, or because
> it times out?
>
> If just because of timeout, would it make sense to switch the order
> and disable first, then reset? I don't see any afu-reset on the next
> activation.

It times out if the AFU is enabled when we attempt the reset - that's
OK, but is a bit of a waste of time and generates unnecessary noise in
the kernel log. We could switch the order, but I don't think that the
AFU reset is necessary - the documentation certainly suggests that only
a disable is required (but then again, the XSL has been full of
surprises already so I reserve the right to send a later patch adding a
reset if it has one more in store for me :-p).

A reset is also pretty meaningless on the XSL as far as I can tell - it
looks like it will assert a bit to the CX4 hardware, so I assume the
rest of the card could choose to do something with it, but unlike the
FPGA cards I don't think it actually resets anything (and I doubt we
even need the one we do while initialising the card since the AFU
descriptor is in the CX4 hardware and should be readable without that).

Cheers,
-Ian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect
  2016-06-30 15:32   ` Ian Munsie
@ 2016-06-30 15:50     ` Frederic Barrat
  2016-06-30 16:45       ` Ian Munsie
  0 siblings, 1 reply; 11+ messages in thread
From: Frederic Barrat @ 2016-06-30 15:50 UTC (permalink / raw)
  To: Ian Munsie; +Cc: mikey, linuxppc-dev, Huy Nguyen, Frederic Barrat



Le 30/06/2016 17:32, Ian Munsie a écrit :
>> For dedicated mode, the CAIA recommends an explicit reset of the AFU
>> >(section 2.1.1).
> True, I had forgotten that procedure was added to the document before it
> was made public - I'll update the comment and resend.
>

Actually, my point was that for dedicated mode, we shouldn't have the 
"if" and always reset. It's only for dedicated mode, so it wouldn't 
impact cx4 and we would stay CAIA-compliant. If one day, there's a xsl 
with a dedicated mode AFU, they are expected to follow the spec.

   Fred

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect
  2016-06-30 15:50     ` Frederic Barrat
@ 2016-06-30 16:45       ` Ian Munsie
  0 siblings, 0 replies; 11+ messages in thread
From: Ian Munsie @ 2016-06-30 16:45 UTC (permalink / raw)
  To: Frederic Barrat; +Cc: mikey, linuxppc-dev, Huy Nguyen, Frederic Barrat

Excerpts from Frederic Barrat's message of 2016-06-30 17:50:00 +0200:
> 
> Le 30/06/2016 17:32, Ian Munsie a écrit :
> >> For dedicated mode, the CAIA recommends an explicit reset of the AFU
> >> >(section 2.1.1).
> > True, I had forgotten that procedure was added to the document before it
> > was made public - I'll update the comment and resend.
> >
> 
> Actually, my point was that for dedicated mode, we shouldn't have the 
> "if" and always reset. It's only for dedicated mode, so it wouldn't 
> impact cx4 and we would stay CAIA-compliant. If one day, there's a xsl 
> with a dedicated mode AFU, they are expected to follow the spec.

Yeah, I thought of that as well while I was updating the patch and
removed that from the dedicated path. I still added a comment to that
path to note that though.

Cheers,
-Ian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] cxl: Fix bug where AFU disable operation had no effect
  2016-06-29 18:51 [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect Ian Munsie
  2016-06-29 18:51 ` [PATCH 2/2] cxl: Workaround XSL bug that does not clear the RA bit after a reset Ian Munsie
  2016-06-30 14:19 ` [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect Frederic Barrat
@ 2016-06-30 16:50 ` Ian Munsie
  2016-07-01  8:09   ` Frederic Barrat
  2016-07-11 10:19   ` [v2] " Michael Ellerman
  2 siblings, 2 replies; 11+ messages in thread
From: Ian Munsie @ 2016-06-30 16:50 UTC (permalink / raw)
  To: Michael Ellerman, mikey, linuxppc-dev, Frederic Barrat,
	Huy Nguyen
  Cc: Ian Munsie

From: Ian Munsie <imunsie@au1.ibm.com>

The AFU disable operation has a bug where it will not clear the enable
bit and therefore will have no effect. To date this has likely been
masked by fact that we perform an AFU reset before the disable, which
also has the effect of clearing the enable bit, making the following
disable operation effectively a noop on most hardware. This patch
modifies the afu_control function to take a parameter to clear from the
AFU control register so that the disable operation can clear the
appropriate bit.

This bug was uncovered on the Mellanox CX4, which uses an XSL rather
than a PSL. On the XSL the reset operation will not complete while the
AFU is enabled, meaning the enable bit was still set at the start of the
disable and as a result this bug was hit and the disable also timed out.

Because of this difference in behaviour between the PSL and XSL, this
patch now makes the reset dependent on the card using a PSL to avoid
waiting for a timeout on the XSL. It is entirely possible that we may be
able to drop the reset altogether if it turns out we only ever needed it
due to this bug - however I am not willing to drop it without further
regression testing and have added comments to the code explaining the
background.

This also fixes a small issue where the AFU_Cntl register was read
outside of the lock that protects it.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
---
Changes since v1:
	- Modified comment to dedicated process disable path to explain
	  the architected procedure, and the origin of our more heavy weight
	  procedure.
	- Modified comment to afu directed disable path with a note that the
	  procedure is AFU specific, and explaining the origin of our heavy
	  weight prcedure.
	- Removed needs_reset_before_disable condition from dedicated process
	  disable path. The XSL in the CX4 does not use this mode (AFU directed
	  only), and since the reset in this procedure is architected we
	  *should* never need to skip it.

 drivers/misc/cxl/cxl.h    |  1 +
 drivers/misc/cxl/native.c | 58 +++++++++++++++++++++++++++++++++++++++++------
 drivers/misc/cxl/pci.c    |  1 +
 3 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index ce2b9d5..bab8dfd 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -544,6 +544,7 @@ struct cxl_service_layer_ops {
 	void (*write_timebase_ctrl)(struct cxl *adapter);
 	u64 (*timebase_read)(struct cxl *adapter);
 	int capi_mode;
+	bool needs_reset_before_disable;
 };
 
 struct cxl_native {
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 120c468..e774505 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -21,10 +21,10 @@
 #include "cxl.h"
 #include "trace.h"
 
-static int afu_control(struct cxl_afu *afu, u64 command,
+static int afu_control(struct cxl_afu *afu, u64 command, u64 clear,
 		       u64 result, u64 mask, bool enabled)
 {
-	u64 AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
+	u64 AFU_Cntl;
 	unsigned long timeout = jiffies + (HZ * CXL_TIMEOUT);
 	int rc = 0;
 
@@ -33,7 +33,8 @@ static int afu_control(struct cxl_afu *afu, u64 command,
 
 	trace_cxl_afu_ctrl(afu, command);
 
-	cxl_p2n_write(afu, CXL_AFU_Cntl_An, AFU_Cntl | command);
+	AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
+	cxl_p2n_write(afu, CXL_AFU_Cntl_An, (AFU_Cntl & ~clear) | command);
 
 	AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
 	while ((AFU_Cntl & mask) != result) {
@@ -67,7 +68,7 @@ static int afu_enable(struct cxl_afu *afu)
 {
 	pr_devel("AFU enable request\n");
 
-	return afu_control(afu, CXL_AFU_Cntl_An_E,
+	return afu_control(afu, CXL_AFU_Cntl_An_E, 0,
 			   CXL_AFU_Cntl_An_ES_Enabled,
 			   CXL_AFU_Cntl_An_ES_MASK, true);
 }
@@ -76,7 +77,8 @@ int cxl_afu_disable(struct cxl_afu *afu)
 {
 	pr_devel("AFU disable request\n");
 
-	return afu_control(afu, 0, CXL_AFU_Cntl_An_ES_Disabled,
+	return afu_control(afu, 0, CXL_AFU_Cntl_An_E,
+			   CXL_AFU_Cntl_An_ES_Disabled,
 			   CXL_AFU_Cntl_An_ES_MASK, false);
 }
 
@@ -85,7 +87,7 @@ static int native_afu_reset(struct cxl_afu *afu)
 {
 	pr_devel("AFU reset request\n");
 
-	return afu_control(afu, CXL_AFU_Cntl_An_RA,
+	return afu_control(afu, CXL_AFU_Cntl_An_RA, 0,
 			   CXL_AFU_Cntl_An_RS_Complete | CXL_AFU_Cntl_An_ES_Disabled,
 			   CXL_AFU_Cntl_An_RS_MASK | CXL_AFU_Cntl_An_ES_MASK,
 			   false);
@@ -595,7 +597,33 @@ static int deactivate_afu_directed(struct cxl_afu *afu)
 	cxl_sysfs_afu_m_remove(afu);
 	cxl_chardev_afu_remove(afu);
 
-	cxl_ops->afu_reset(afu);
+	/*
+	 * The CAIA section 2.2.1 indicates that the procedure for starting and
+	 * stopping an AFU in AFU directed mode is AFU specific, which is not
+	 * ideal since this code is generic and with one exception has no
+	 * knowledge of the AFU. This is in contrast to the procedure for
+	 * disabling a dedicated process AFU, which is documented to just
+	 * require a reset. The architecture does indicate that both an AFU
+	 * reset and an AFU disable should result in the AFU being disabled and
+	 * we do both followed by a PSL purge for safety.
+	 *
+	 * Notably we used to have some issues with the disable sequence on PSL
+	 * cards, which is why we ended up using this heavy weight procedure in
+	 * the first place, however a bug was discovered that had rendered the
+	 * disable operation ineffective, so it is conceivable that was the
+	 * sole explanation for those difficulties. Careful regression testing
+	 * is recommended if anyone attempts to remove or reorder these
+	 * operations.
+	 *
+	 * The XSL on the Mellanox CX4 behaves a little differently from the
+	 * PSL based cards and will time out an AFU reset if the AFU is still
+	 * enabled. That card is special in that we do have a means to identify
+	 * it from this code, so in that case we skip the reset and just use a
+	 * disable/purge to avoid the timeout and corresponding noise in the
+	 * kernel log.
+	 */
+	if (afu->adapter->native->sl_ops->needs_reset_before_disable)
+		cxl_ops->afu_reset(afu);
 	cxl_afu_disable(afu);
 	cxl_psl_purge(afu);
 
@@ -735,6 +763,22 @@ static int native_attach_process(struct cxl_context *ctx, bool kernel,
 
 static inline int detach_process_native_dedicated(struct cxl_context *ctx)
 {
+	/*
+	 * The CAIA section 2.1.1 indicates that we need to do an AFU reset to
+	 * stop the AFU in dedicated mode (we therefore do not make that
+	 * optional like we do in the afu directed path). It does not indicate
+	 * that we need to do an explicit disable (which should occur
+	 * implicitly as part of the reset) or purge, but we do these as well
+	 * to be on the safe side.
+	 *
+	 * Notably we used to have some issues with the disable sequence
+	 * (before the sequence was spelled out in the architecture) which is
+	 * why we were so heavy weight in the first place, however a bug was
+	 * discovered that had rendered the disable operation ineffective, so
+	 * it is conceivable that was the sole explanation for those
+	 * difficulties. Point is, we should be careful and do some regression
+	 * testing if we ever attempt to remove any part of this procedure.
+	 */
 	cxl_ops->afu_reset(ctx->afu);
 	cxl_afu_disable(ctx->afu);
 	cxl_psl_purge(ctx->afu);
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 58d7d821..b7f2e96 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1309,6 +1309,7 @@ static const struct cxl_service_layer_ops psl_ops = {
 	.write_timebase_ctrl = write_timebase_ctrl_psl,
 	.timebase_read = timebase_read_psl,
 	.capi_mode = OPAL_PHB_CAPI_MODE_CAPI,
+	.needs_reset_before_disable = true,
 };
 
 static const struct cxl_service_layer_ops xsl_ops = {
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] cxl: Fix bug where AFU disable operation had no effect
  2016-06-30 16:50 ` [PATCH v2] " Ian Munsie
@ 2016-07-01  8:09   ` Frederic Barrat
  2016-07-11 10:19   ` [v2] " Michael Ellerman
  1 sibling, 0 replies; 11+ messages in thread
From: Frederic Barrat @ 2016-07-01  8:09 UTC (permalink / raw)
  To: Ian Munsie, Michael Ellerman, mikey, linuxppc-dev,
	Frederic Barrat, Huy Nguyen

It looks good to me. I can live with the extra 'clear' parameter, 
because as you say, the approach I had suggested would require more 
testing, so let's play safe for the time being.
And thanks for clarifying/fixing the deactivate paths.

Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>

   Fred


Le 30/06/2016 18:50, Ian Munsie a écrit :
> From: Ian Munsie <imunsie@au1.ibm.com>
>
> The AFU disable operation has a bug where it will not clear the enable
> bit and therefore will have no effect. To date this has likely been
> masked by fact that we perform an AFU reset before the disable, which
> also has the effect of clearing the enable bit, making the following
> disable operation effectively a noop on most hardware. This patch
> modifies the afu_control function to take a parameter to clear from the
> AFU control register so that the disable operation can clear the
> appropriate bit.
>
> This bug was uncovered on the Mellanox CX4, which uses an XSL rather
> than a PSL. On the XSL the reset operation will not complete while the
> AFU is enabled, meaning the enable bit was still set at the start of the
> disable and as a result this bug was hit and the disable also timed out.
>
> Because of this difference in behaviour between the PSL and XSL, this
> patch now makes the reset dependent on the card using a PSL to avoid
> waiting for a timeout on the XSL. It is entirely possible that we may be
> able to drop the reset altogether if it turns out we only ever needed it
> due to this bug - however I am not willing to drop it without further
> regression testing and have added comments to the code explaining the
> background.
>
> This also fixes a small issue where the AFU_Cntl register was read
> outside of the lock that protects it.
>
> Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
> ---
> Changes since v1:
> 	- Modified comment to dedicated process disable path to explain
> 	  the architected procedure, and the origin of our more heavy weight
> 	  procedure.
> 	- Modified comment to afu directed disable path with a note that the
> 	  procedure is AFU specific, and explaining the origin of our heavy
> 	  weight prcedure.
> 	- Removed needs_reset_before_disable condition from dedicated process
> 	  disable path. The XSL in the CX4 does not use this mode (AFU directed
> 	  only), and since the reset in this procedure is architected we
> 	  *should* never need to skip it.
>
>   drivers/misc/cxl/cxl.h    |  1 +
>   drivers/misc/cxl/native.c | 58 +++++++++++++++++++++++++++++++++++++++++------
>   drivers/misc/cxl/pci.c    |  1 +
>   3 files changed, 53 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
> index ce2b9d5..bab8dfd 100644
> --- a/drivers/misc/cxl/cxl.h
> +++ b/drivers/misc/cxl/cxl.h
> @@ -544,6 +544,7 @@ struct cxl_service_layer_ops {
>   	void (*write_timebase_ctrl)(struct cxl *adapter);
>   	u64 (*timebase_read)(struct cxl *adapter);
>   	int capi_mode;
> +	bool needs_reset_before_disable;
>   };
>
>   struct cxl_native {
> diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
> index 120c468..e774505 100644
> --- a/drivers/misc/cxl/native.c
> +++ b/drivers/misc/cxl/native.c
> @@ -21,10 +21,10 @@
>   #include "cxl.h"
>   #include "trace.h"
>
> -static int afu_control(struct cxl_afu *afu, u64 command,
> +static int afu_control(struct cxl_afu *afu, u64 command, u64 clear,
>   		       u64 result, u64 mask, bool enabled)
>   {
> -	u64 AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
> +	u64 AFU_Cntl;
>   	unsigned long timeout = jiffies + (HZ * CXL_TIMEOUT);
>   	int rc = 0;
>
> @@ -33,7 +33,8 @@ static int afu_control(struct cxl_afu *afu, u64 command,
>
>   	trace_cxl_afu_ctrl(afu, command);
>
> -	cxl_p2n_write(afu, CXL_AFU_Cntl_An, AFU_Cntl | command);
> +	AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
> +	cxl_p2n_write(afu, CXL_AFU_Cntl_An, (AFU_Cntl & ~clear) | command);
>
>   	AFU_Cntl = cxl_p2n_read(afu, CXL_AFU_Cntl_An);
>   	while ((AFU_Cntl & mask) != result) {
> @@ -67,7 +68,7 @@ static int afu_enable(struct cxl_afu *afu)
>   {
>   	pr_devel("AFU enable request\n");
>
> -	return afu_control(afu, CXL_AFU_Cntl_An_E,
> +	return afu_control(afu, CXL_AFU_Cntl_An_E, 0,
>   			   CXL_AFU_Cntl_An_ES_Enabled,
>   			   CXL_AFU_Cntl_An_ES_MASK, true);
>   }
> @@ -76,7 +77,8 @@ int cxl_afu_disable(struct cxl_afu *afu)
>   {
>   	pr_devel("AFU disable request\n");
>
> -	return afu_control(afu, 0, CXL_AFU_Cntl_An_ES_Disabled,
> +	return afu_control(afu, 0, CXL_AFU_Cntl_An_E,
> +			   CXL_AFU_Cntl_An_ES_Disabled,
>   			   CXL_AFU_Cntl_An_ES_MASK, false);
>   }
>
> @@ -85,7 +87,7 @@ static int native_afu_reset(struct cxl_afu *afu)
>   {
>   	pr_devel("AFU reset request\n");
>
> -	return afu_control(afu, CXL_AFU_Cntl_An_RA,
> +	return afu_control(afu, CXL_AFU_Cntl_An_RA, 0,
>   			   CXL_AFU_Cntl_An_RS_Complete | CXL_AFU_Cntl_An_ES_Disabled,
>   			   CXL_AFU_Cntl_An_RS_MASK | CXL_AFU_Cntl_An_ES_MASK,
>   			   false);
> @@ -595,7 +597,33 @@ static int deactivate_afu_directed(struct cxl_afu *afu)
>   	cxl_sysfs_afu_m_remove(afu);
>   	cxl_chardev_afu_remove(afu);
>
> -	cxl_ops->afu_reset(afu);
> +	/*
> +	 * The CAIA section 2.2.1 indicates that the procedure for starting and
> +	 * stopping an AFU in AFU directed mode is AFU specific, which is not
> +	 * ideal since this code is generic and with one exception has no
> +	 * knowledge of the AFU. This is in contrast to the procedure for
> +	 * disabling a dedicated process AFU, which is documented to just
> +	 * require a reset. The architecture does indicate that both an AFU
> +	 * reset and an AFU disable should result in the AFU being disabled and
> +	 * we do both followed by a PSL purge for safety.
> +	 *
> +	 * Notably we used to have some issues with the disable sequence on PSL
> +	 * cards, which is why we ended up using this heavy weight procedure in
> +	 * the first place, however a bug was discovered that had rendered the
> +	 * disable operation ineffective, so it is conceivable that was the
> +	 * sole explanation for those difficulties. Careful regression testing
> +	 * is recommended if anyone attempts to remove or reorder these
> +	 * operations.
> +	 *
> +	 * The XSL on the Mellanox CX4 behaves a little differently from the
> +	 * PSL based cards and will time out an AFU reset if the AFU is still
> +	 * enabled. That card is special in that we do have a means to identify
> +	 * it from this code, so in that case we skip the reset and just use a
> +	 * disable/purge to avoid the timeout and corresponding noise in the
> +	 * kernel log.
> +	 */
> +	if (afu->adapter->native->sl_ops->needs_reset_before_disable)
> +		cxl_ops->afu_reset(afu);
>   	cxl_afu_disable(afu);
>   	cxl_psl_purge(afu);
>
> @@ -735,6 +763,22 @@ static int native_attach_process(struct cxl_context *ctx, bool kernel,
>
>   static inline int detach_process_native_dedicated(struct cxl_context *ctx)
>   {
> +	/*
> +	 * The CAIA section 2.1.1 indicates that we need to do an AFU reset to
> +	 * stop the AFU in dedicated mode (we therefore do not make that
> +	 * optional like we do in the afu directed path). It does not indicate
> +	 * that we need to do an explicit disable (which should occur
> +	 * implicitly as part of the reset) or purge, but we do these as well
> +	 * to be on the safe side.
> +	 *
> +	 * Notably we used to have some issues with the disable sequence
> +	 * (before the sequence was spelled out in the architecture) which is
> +	 * why we were so heavy weight in the first place, however a bug was
> +	 * discovered that had rendered the disable operation ineffective, so
> +	 * it is conceivable that was the sole explanation for those
> +	 * difficulties. Point is, we should be careful and do some regression
> +	 * testing if we ever attempt to remove any part of this procedure.
> +	 */
>   	cxl_ops->afu_reset(ctx->afu);
>   	cxl_afu_disable(ctx->afu);
>   	cxl_psl_purge(ctx->afu);
> diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
> index 58d7d821..b7f2e96 100644
> --- a/drivers/misc/cxl/pci.c
> +++ b/drivers/misc/cxl/pci.c
> @@ -1309,6 +1309,7 @@ static const struct cxl_service_layer_ops psl_ops = {
>   	.write_timebase_ctrl = write_timebase_ctrl_psl,
>   	.timebase_read = timebase_read_psl,
>   	.capi_mode = OPAL_PHB_CAPI_MODE_CAPI,
> +	.needs_reset_before_disable = true,
>   };
>
>   static const struct cxl_service_layer_ops xsl_ops = {
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [2/2] cxl: Workaround XSL bug that does not clear the RA bit after a reset
  2016-06-29 18:51 ` [PATCH 2/2] cxl: Workaround XSL bug that does not clear the RA bit after a reset Ian Munsie
  2016-06-30 11:05   ` Frederic Barrat
@ 2016-07-11 10:19   ` Michael Ellerman
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Ellerman @ 2016-07-11 10:19 UTC (permalink / raw)
  To: Ian Munsie, mikey, linuxppc-dev, Frederic Barrat, Huy Nguyen; +Cc: Ian Munsie

On Wed, 2016-29-06 at 18:51:26 UTC, Ian Munsie wrote:
> From: Ian Munsie <imunsie@au1.ibm.com>
> 
> An issue was noted in our debug logs where the XSL would leave the RA
> bit asserted after an AFU reset operation, which would effectively
> prevent further AFU reset operations from working.
> 
> Workaround the issue by clearing the RA bit with an MMIO write if it is
> still asserted after any AFU control operation.
> 
> Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/2a4f667aadb2d61c289a52a0d6

cheers

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [v2] cxl: Fix bug where AFU disable operation had no effect
  2016-06-30 16:50 ` [PATCH v2] " Ian Munsie
  2016-07-01  8:09   ` Frederic Barrat
@ 2016-07-11 10:19   ` Michael Ellerman
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Ellerman @ 2016-07-11 10:19 UTC (permalink / raw)
  To: Ian Munsie, mikey, linuxppc-dev, Frederic Barrat, Huy Nguyen; +Cc: Ian Munsie

On Thu, 2016-30-06 at 16:50:40 UTC, Ian Munsie wrote:
> From: Ian Munsie <imunsie@au1.ibm.com>
> 
> The AFU disable operation has a bug where it will not clear the enable
> bit and therefore will have no effect. To date this has likely been
> masked by fact that we perform an AFU reset before the disable, which
> also has the effect of clearing the enable bit, making the following
> disable operation effectively a noop on most hardware. This patch
> modifies the afu_control function to take a parameter to clear from the
> AFU control register so that the disable operation can clear the
> appropriate bit.
> 
> This bug was uncovered on the Mellanox CX4, which uses an XSL rather
> than a PSL. On the XSL the reset operation will not complete while the
> AFU is enabled, meaning the enable bit was still set at the start of the
> disable and as a result this bug was hit and the disable also timed out.
> 
> Because of this difference in behaviour between the PSL and XSL, this
> patch now makes the reset dependent on the card using a PSL to avoid
> waiting for a timeout on the XSL. It is entirely possible that we may be
> able to drop the reset altogether if it turns out we only ever needed it
> due to this bug - however I am not willing to drop it without further
> regression testing and have added comments to the code explaining the
> background.
> 
> This also fixes a small issue where the AFU_Cntl register was read
> outside of the lock that protects it.
> 
> Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/5e7823c9bc44965c2e7d1d755b

cheers

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-07-11 10:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-29 18:51 [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect Ian Munsie
2016-06-29 18:51 ` [PATCH 2/2] cxl: Workaround XSL bug that does not clear the RA bit after a reset Ian Munsie
2016-06-30 11:05   ` Frederic Barrat
2016-07-11 10:19   ` [2/2] " Michael Ellerman
2016-06-30 14:19 ` [PATCH 1/2] cxl: Fix bug where AFU disable operation had no effect Frederic Barrat
2016-06-30 15:32   ` Ian Munsie
2016-06-30 15:50     ` Frederic Barrat
2016-06-30 16:45       ` Ian Munsie
2016-06-30 16:50 ` [PATCH v2] " Ian Munsie
2016-07-01  8:09   ` Frederic Barrat
2016-07-11 10:19   ` [v2] " Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).