linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 15/18] RDMA/bnxt_re: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 16/18] IB/mlx4: " Sinan Kaya
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Selvin Xavier,
	Devesh Sharma, Somnath Kotur, Sriharsha Basavapatna, Doug Ledford,
	Jason Gunthorpe, linux-rdma, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
index 8329ec6..4a6b981 100644
--- a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
+++ b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
@@ -181,10 +181,10 @@ static int __send_message(struct bnxt_qplib_rcfw *rcfw, struct cmdq_base *req,
 
 	/* ring CMDQ DB */
 	wmb();
-	writel(cmdq_prod, rcfw->cmdq_bar_reg_iomem +
-	       rcfw->cmdq_bar_reg_prod_off);
-	writel(RCFW_CMDQ_TRIG_VAL, rcfw->cmdq_bar_reg_iomem +
-	       rcfw->cmdq_bar_reg_trig_off);
+	writel_relaxed(cmdq_prod, rcfw->cmdq_bar_reg_iomem +
+		       rcfw->cmdq_bar_reg_prod_off);
+	writel_relaxed(RCFW_CMDQ_TRIG_VAL, rcfw->cmdq_bar_reg_iomem +
+		       rcfw->cmdq_bar_reg_trig_off);
 done:
 	spin_unlock_irqrestore(&cmdq->lock, flags);
 	/* Return the CREQ response pointer */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 16/18] IB/mlx4: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
  2018-03-16 16:16 ` [PATCH v3 15/18] RDMA/bnxt_re: Eliminate duplicate barriers on weakly-ordered archs Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 17/18] RDMA/i40iw: " Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 18/18] infiniband: cxgb4: " Sinan Kaya
  3 siblings, 0 replies; 19+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Yishai Hadas,
	Doug Ledford, Jason Gunthorpe, linux-rdma, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/mlx4/qp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index f045491..74b27b0 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -3880,8 +3880,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		 */
 		wmb();
 
-		writel(qp->doorbell_qpn,
-		       to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
+		writel_relaxed(qp->doorbell_qpn,
+			to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
 
 		/*
 		 * Make sure doorbells don't leak out of SQ spinlock
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 17/18] RDMA/i40iw: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
  2018-03-16 16:16 ` [PATCH v3 15/18] RDMA/bnxt_re: Eliminate duplicate barriers on weakly-ordered archs Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 16/18] IB/mlx4: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 18/18] infiniband: cxgb4: " Sinan Kaya
  3 siblings, 0 replies; 19+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Faisal Latif,
	Shiraz Saleem, Doug Ledford, Jason Gunthorpe, linux-rdma,
	linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/i40iw/i40iw_ctrl.c  |  6 ++++--
 drivers/infiniband/hw/i40iw/i40iw_osdep.h |  1 +
 drivers/infiniband/hw/i40iw/i40iw_uk.c    |  2 +-
 drivers/infiniband/hw/i40iw/i40iw_utils.c | 11 +++++++++++
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/i40iw/i40iw_ctrl.c b/drivers/infiniband/hw/i40iw/i40iw_ctrl.c
index c74fd33..47f473e 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_ctrl.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_ctrl.c
@@ -706,9 +706,11 @@ static void i40iw_sc_ccq_arm(struct i40iw_sc_cq *ccq)
 	wmb();       /* make sure shadow area is updated before arming */
 
 	if (ccq->dev->is_pf)
-		i40iw_wr32(ccq->dev->hw, I40E_PFPE_CQARM, ccq->cq_uk.cq_id);
+		i40iw_wr32_relaxed(ccq->dev->hw, I40E_PFPE_CQARM,
+				   ccq->cq_uk.cq_id);
 	else
-		i40iw_wr32(ccq->dev->hw, I40E_VFPE_CQARM1, ccq->cq_uk.cq_id);
+		i40iw_wr32_relaxed(ccq->dev->hw, I40E_VFPE_CQARM1,
+				   ccq->cq_uk.cq_id);
 }
 
 /**
diff --git a/drivers/infiniband/hw/i40iw/i40iw_osdep.h b/drivers/infiniband/hw/i40iw/i40iw_osdep.h
index f27be3e..e06f4b9 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_osdep.h
+++ b/drivers/infiniband/hw/i40iw/i40iw_osdep.h
@@ -213,5 +213,6 @@ void i40iw_hw_stats_start_timer(struct i40iw_sc_vsi *vsi);
 void i40iw_hw_stats_stop_timer(struct i40iw_sc_vsi *vsi);
 #define i40iw_mmiowb() mmiowb()
 void i40iw_wr32(struct i40iw_hw *hw, u32 reg, u32 value);
+void i40iw_wr32_relaxed(struct i40iw_hw *hw, u32 reg, u32 value);
 u32  i40iw_rd32(struct i40iw_hw *hw, u32 reg);
 #endif				/* _I40IW_OSDEP_H_ */
diff --git a/drivers/infiniband/hw/i40iw/i40iw_uk.c b/drivers/infiniband/hw/i40iw/i40iw_uk.c
index 8afa5a6..7f0ebed 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_uk.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_uk.c
@@ -723,7 +723,7 @@ static void i40iw_cq_request_notification(struct i40iw_cq_uk *cq,
 
 	wmb(); /* make sure WQE is populated before valid bit is set */
 
-	writel(cq->cq_id, cq->cqe_alloc_reg);
+	writel_relaxed(cq->cq_id, cq->cqe_alloc_reg);
 }
 
 /**
diff --git a/drivers/infiniband/hw/i40iw/i40iw_utils.c b/drivers/infiniband/hw/i40iw/i40iw_utils.c
index ddc1056..99aa6f8 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_utils.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_utils.c
@@ -125,6 +125,17 @@ inline void i40iw_wr32(struct i40iw_hw *hw, u32 reg, u32 value)
 }
 
 /**
+ * i40iw_wr32_relaxed - write 32 bits to hw register without ordering
+ * @hw: hardware information including registers
+ * @reg: register offset
+ * @value: vvalue to write to register
+ */
+inline void i40iw_wr32_relaxed(struct i40iw_hw *hw, u32 reg, u32 value)
+{
+	writel_relaxed(value, hw->hw_addr + reg);
+}
+
+/**
  * i40iw_rd32 - read a 32 bit hw register
  * @hw: hardware information including registers
  * @reg: register offset
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (2 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 17/18] RDMA/i40iw: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 21:05   ` Steve Wise
  3 siblings, 1 reply; 19+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Steve Wise,
	Doug Ledford, Jason Gunthorpe, linux-rdma, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/cxgb4/t4.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 8369c7c..7a48c9e 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
 				 (u64 *)wqe);
 		} else {
 			pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
-			writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
-			       wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+				       wq->sq.bar2_va + SGE_UDB_KDOORBELL);
 		}
 
 		/* Flush user doorbell area writes. */
 		wmb();
 		return;
 	}
-	writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+	writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
 }
 
 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
 				 (void *)wqe);
 		} else {
 			pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
-			writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
-			       wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+				       wq->rq.bar2_va + SGE_UDB_KDOORBELL);
 		}
 
 		/* Flush user doorbell area writes. */
 		wmb();
 		return;
 	}
-	writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+	writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
 }
 
 static inline int t4_wq_in_error(struct t4_wq *wq)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 16:16 ` [PATCH v3 18/18] infiniband: cxgb4: " Sinan Kaya
@ 2018-03-16 21:05   ` Steve Wise
  2018-03-16 21:46     ` Sinan Kaya
  2018-03-16 22:13     ` Jason Gunthorpe
  0 siblings, 2 replies; 19+ messages in thread
From: Steve Wise @ 2018-03-16 21:05 UTC (permalink / raw)
  To: 'Sinan Kaya', netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe', linux-rdma,
	linux-kernel, 'Michael Werner', 'Casey Leedom'

> Code includes wmb() followed by writel(). writel() already has a barrier
on
> some architectures like arm64.
> 
> This ends up CPU observing two barriers back to back before executing the
> register write.
> 
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>

NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
writeX().  

I was just looking at this with Chelsio developers, and they said the
writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
get rid of the extra barrier for all architectures.

Also, t4.h:pio_copy() needs to  use __raw_writeq() to enable the write
combining fastpath for ARM and PowerPC.  The code as it stands doesn't
achieve any write combining on PowerPC at least.  

And the writel()s at the end of the ring functions (the non bar2 udb path)
needs a mmiowb() afterwards if you're going to use __raw_writeX() there.
However that path is only used for very old hardware (T4), so I wouldn't
worry about them. 

Steve.


> ---
>  drivers/infiniband/hw/cxgb4/t4.h | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/cxgb4/t4.h
> b/drivers/infiniband/hw/cxgb4/t4.h
> index 8369c7c..7a48c9e 100644
> --- a/drivers/infiniband/hw/cxgb4/t4.h
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq,
> u16 inc, union t4_wr *wqe)
>  				 (u64 *)wqe);
>  		} else {
>  			pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -			writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -			       wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq-
> >sq.bar2_qid),
> +				       wq->sq.bar2_va +
> SGE_UDB_KDOORBELL);
>  		}
> 
>  		/* Flush user doorbell area writes. */
>  		wmb();
>  		return;
>  	}
> -	writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +	writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>  }
> 
>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq,
> u16 inc,
>  				 (void *)wqe);
>  		} else {
>  			pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> -			writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> -			       wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> +			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq-
> >rq.bar2_qid),
> +				       wq->rq.bar2_va +
> SGE_UDB_KDOORBELL);
>  		}
> 
>  		/* Flush user doorbell area writes. */
>  		wmb();
>  		return;
>  	}
> -	writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +	writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>  }
> 
>  static inline int t4_wq_in_error(struct t4_wq *wq)
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 21:05   ` Steve Wise
@ 2018-03-16 21:46     ` Sinan Kaya
  2018-03-16 23:05       ` Steve Wise
  2018-03-16 22:13     ` Jason Gunthorpe
  1 sibling, 1 reply; 19+ messages in thread
From: Sinan Kaya @ 2018-03-16 21:46 UTC (permalink / raw)
  To: Steve Wise, netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe', linux-rdma,
	linux-kernel, 'Michael Werner', 'Casey Leedom'

On 3/16/2018 5:05 PM, Steve Wise wrote:
>> Code includes wmb() followed by writel(). writel() already has a barrier
> on
>> some architectures like arm64.
>>
>> This ends up CPU observing two barriers back to back before executing the
>> register write.
>>
>> Since code already has an explicit barrier call, changing writel() to
>> writel_relaxed().
>>
>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> 
> NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
> writeX().  
> 
> I was just looking at this with Chelsio developers, and they said the
> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> get rid of the extra barrier for all architectures.

OK. I can do that but isn't the problem at PowerPC adaptation?

/*
 * We don't do relaxed operations yet, at least not with this semantic
 */
#define readb_relaxed(addr)	readb(addr)
#define readw_relaxed(addr)	readw(addr)
#define readl_relaxed(addr)	readl(addr)
#define readq_relaxed(addr)	readq(addr)
#define writeb_relaxed(v, addr)	writeb(v, addr)
#define writew_relaxed(v, addr)	writew(v, addr)
#define writel_relaxed(v, addr)	writel(v, addr)
#define writeq_relaxed(v, addr)	writeq(v, addr)

Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?

>From API perspective both __raw_writeX() and writeX_relaxed() are correct.
It is just PowerPC doesn't seem the follow the definition yet.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 21:05   ` Steve Wise
  2018-03-16 21:46     ` Sinan Kaya
@ 2018-03-16 22:13     ` Jason Gunthorpe
  2018-03-16 23:04       ` Steve Wise
  1 sibling, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2018-03-16 22:13 UTC (permalink / raw)
  To: Steve Wise
  Cc: 'Sinan Kaya', netdev, timur, sulrich, linux-arm-msm,
	linux-arm-kernel, 'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

On Fri, Mar 16, 2018 at 04:05:10PM -0500, Steve Wise wrote:
> > Code includes wmb() followed by writel(). writel() already has a barrier
> on
> > some architectures like arm64.
> > 
> > This ends up CPU observing two barriers back to back before executing the
> > register write.
> > 
> > Since code already has an explicit barrier call, changing writel() to
> > writel_relaxed().
> > 
> > Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> 
> NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
> writeX().  

?? Why is changing writex() to writeX() a NAK then?

> I was just looking at this with Chelsio developers, and they said the
> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> get rid of the extra barrier for all architectures.

That doesn't seem semanticaly sane.

__raw_writeX() should not appear in driver code, IMHO. Only the arch
code can know what the exact semantics of that accessor are..

If ppc can't use writel_relaxed to optimize then we probably need yet
another io accessor semantic defined :(

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 22:13     ` Jason Gunthorpe
@ 2018-03-16 23:04       ` Steve Wise
  2018-03-17  4:08         ` Timur Tabi
  0 siblings, 1 reply; 19+ messages in thread
From: Steve Wise @ 2018-03-16 23:04 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: 'Sinan Kaya', netdev, timur, sulrich, linux-arm-msm,
	linux-arm-kernel, 'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

> 
> On Fri, Mar 16, 2018 at 04:05:10PM -0500, Steve Wise wrote:
> > > Code includes wmb() followed by writel(). writel() already has a
barrier
> > on
> > > some architectures like arm64.
> > >
> > > This ends up CPU observing two barriers back to back before executing
> the
> > > register write.
> > >
> > > Since code already has an explicit barrier call, changing writel() to
> > > writel_relaxed().
> > >
> > > Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> >
> > NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is
just
> > writeX().
> 
> ?? Why is changing writex() to writeX() a NAK then?

Because I want it correct for PPC as well.

> 
> > I was just looking at this with Chelsio developers, and they said the
> > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(),
to
> > get rid of the extra barrier for all architectures.
> 
> That doesn't seem semanticaly sane.
> 
> __raw_writeX() should not appear in driver code, IMHO. Only the arch
> code can know what the exact semantics of that accessor are..
> 
> If ppc can't use writel_relaxed to optimize then we probably need yet
> another io accessor semantic defined :(


Anybody understand why the PPC implementation of writeX_relaxed() isn't
relaxed?


Steve.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 21:46     ` Sinan Kaya
@ 2018-03-16 23:05       ` Steve Wise
  2018-03-17  3:40         ` Sinan Kaya
  0 siblings, 1 reply; 19+ messages in thread
From: Steve Wise @ 2018-03-16 23:05 UTC (permalink / raw)
  To: 'Sinan Kaya', netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe', linux-rdma,
	linux-kernel, 'Michael Werner', 'Casey Leedom'

> 
> On 3/16/2018 5:05 PM, Steve Wise wrote:
> >> Code includes wmb() followed by writel(). writel() already has a barrier
> > on
> >> some architectures like arm64.
> >>
> >> This ends up CPU observing two barriers back to back before executing
> the
> >> register write.
> >>
> >> Since code already has an explicit barrier call, changing writel() to
> >> writel_relaxed().
> >>
> >> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> >
> > NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
> > writeX().
> >
> > I was just looking at this with Chelsio developers, and they said the
> > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> > get rid of the extra barrier for all architectures.
> 
> OK. I can do that but isn't the problem at PowerPC adaptation?
> 
> /*
>  * We don't do relaxed operations yet, at least not with this semantic
>  */
> #define readb_relaxed(addr)	readb(addr)
> #define readw_relaxed(addr)	readw(addr)
> #define readl_relaxed(addr)	readl(addr)
> #define readq_relaxed(addr)	readq(addr)
> #define writeb_relaxed(v, addr)	writeb(v, addr)
> #define writew_relaxed(v, addr)	writew(v, addr)
> #define writel_relaxed(v, addr)	writel(v, addr)
> #define writeq_relaxed(v, addr)	writeq(v, addr)
> 
> Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?

I don't know the answer, but perhaps the proper fix is to correctly implement these for PPC?


> 
> >From API perspective both __raw_writeX() and writeX_relaxed() are
> correct.
> It is just PowerPC doesn't seem the follow the definition yet.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 23:05       ` Steve Wise
@ 2018-03-17  3:40         ` Sinan Kaya
  2018-03-17  4:03           ` Sinan Kaya
  0 siblings, 1 reply; 19+ messages in thread
From: Sinan Kaya @ 2018-03-17  3:40 UTC (permalink / raw)
  To: Steve Wise, netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe', linux-rdma,
	linux-kernel, 'Michael Werner', 'Casey Leedom'

On 3/16/2018 7:05 PM, Steve Wise wrote:
>>
>> On 3/16/2018 5:05 PM, Steve Wise wrote:
>>>> Code includes wmb() followed by writel(). writel() already has a barrier
>>> on
>>>> some architectures like arm64.
>>>>
>>>> This ends up CPU observing two barriers back to back before executing
>> the
>>>> register write.
>>>>
>>>> Since code already has an explicit barrier call, changing writel() to
>>>> writel_relaxed().
>>>>
>>>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>>>
>>> NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
>>> writeX().
>>>
>>> I was just looking at this with Chelsio developers, and they said the
>>> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
>>> get rid of the extra barrier for all architectures.
>>
>> OK. I can do that but isn't the problem at PowerPC adaptation?
>>
>> /*
>>  * We don't do relaxed operations yet, at least not with this semantic
>>  */
>> #define readb_relaxed(addr)	readb(addr)
>> #define readw_relaxed(addr)	readw(addr)
>> #define readl_relaxed(addr)	readl(addr)
>> #define readq_relaxed(addr)	readq(addr)
>> #define writeb_relaxed(v, addr)	writeb(v, addr)
>> #define writew_relaxed(v, addr)	writew(v, addr)
>> #define writel_relaxed(v, addr)	writel(v, addr)
>> #define writeq_relaxed(v, addr)	writeq(v, addr)
>>
>> Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?
> 
> I don't know the answer, but perhaps the proper fix is to correctly implement these for PPC?
> 

I found this article: https://lwn.net/Articles/698014/#PowerPC%20Implementation

Apparently, this issue was discussed at a conference in 2015.

Based on how I read this article, writel() and writel_relaxed() behave exactly
the same on PowerPC because the barrier is enforced by the time code is leaving
a critical section not during MMIO execution.

I also see that __raw_writel() is being heavily used in drivers directory.

https://elixir.bootlin.com/linux/latest/ident/__raw_writel

I'll change writel_relaxed() with __raw_writel() in the series like you suggested
and also look at your other comments.

This seems to be the current norm.

> 
>>
>> >From API perspective both __raw_writeX() and writeX_relaxed() are
>> correct.
>> It is just PowerPC doesn't seem the follow the definition yet.
> 
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  3:40         ` Sinan Kaya
@ 2018-03-17  4:03           ` Sinan Kaya
  2018-03-17  4:25             ` Sinan Kaya
  0 siblings, 1 reply; 19+ messages in thread
From: Sinan Kaya @ 2018-03-17  4:03 UTC (permalink / raw)
  To: Steve Wise, netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe', linux-rdma,
	linux-kernel, 'Michael Werner', 'Casey Leedom'

On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
> and also look at your other comments.

I spoke too soon.

Now that I realized, code needs to follow one of the following patterns for correctness

1)
wmb()
writel()/writel_relaxed()

or 

2)
wmb()
__raw_wrltel()
mmiowb()

but definitely not

wmb()
__raw_wrltel()

Since #1 == #2, I'll stick to my current implementation of writel_relaxed()

Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
for correctness. ARM's mmiowb() implementation is empty.

So, there is no one size fits all solution with the current state of affairs.


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 23:04       ` Steve Wise
@ 2018-03-17  4:08         ` Timur Tabi
  0 siblings, 0 replies; 19+ messages in thread
From: Timur Tabi @ 2018-03-17  4:08 UTC (permalink / raw)
  To: Steve Wise, 'Jason Gunthorpe',
	linuxppc-dev@lists.ozlabs.org
  Cc: 'Sinan Kaya', netdev, sulrich, linux-arm-msm,
	linux-arm-kernel, 'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

On 3/16/18 6:04 PM, Steve Wise wrote:
> Anybody understand why the PPC implementation of writeX_relaxed() isn't
> relaxed?

You probably should ask that on the linuxppc-dev@lists.ozlabs.org 
mailing list.

I've always wondered why PowerPC has non-standard I/O accessors.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  4:03           ` Sinan Kaya
@ 2018-03-17  4:25             ` Sinan Kaya
  2018-03-17  4:30               ` Timur Tabi
                                 ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Sinan Kaya @ 2018-03-17  4:25 UTC (permalink / raw)
  To: Steve Wise, netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe', linux-rdma,
	linux-kernel, 'Michael Werner', 'Casey Leedom'

On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
>> and also look at your other comments.
> 
> I spoke too soon.
> 
> Now that I realized, code needs to follow one of the following patterns for correctness
> 
> 1)
> wmb()
> writel()/writel_relaxed()
> 
> or 
> 
> 2)
> wmb()
> __raw_wrltel()
> mmiowb()
> 
> but definitely not
> 
> wmb()
> __raw_wrltel()
> 
> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> 
> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
> for correctness. ARM's mmiowb() implementation is empty.
> 
> So, there is no one size fits all solution with the current state of affairs.
> 
> 

I think I finally got what you mean.

Code seems to have

wmb()
writel()/writeq()
wmb()

this can be safely replaced with

wmb()
__raw_writel()/__raw_writeq()
wmb()

This will work on all arches. Below is the new version. Let me know if this is OK.

+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
        int count = 8;

        while (count) {
-               writeq(*src, dst);
+               __raw_writeq(*src, dst);
                src++;
                dst++;
                count--;
@@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
                                 (u64 *)wqe);
                } else {
                        pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
-                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
                }

                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+       mmiowmb()
 }

 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
                                 (void *)wqe);
                } else {
                        pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
-                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
                }

                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+       mmiowmb();
 }


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  4:25             ` Sinan Kaya
@ 2018-03-17  4:30               ` Timur Tabi
  2018-03-17 13:23               ` Steve Wise
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Timur Tabi @ 2018-03-17  4:30 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Steve Wise, netdev, Timur Tabi, sulrich, Casey Leedom, Steve Wise,
	linux-rdma, linux-arm-msm, lkml, Jason Gunthorpe, Doug Ledford,
	Michael Werner, linux-arm-kernel

On Fri, Mar 16, 2018 at 11:25 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>                                  (u64 *)wqe);
>                 } else {
>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),

I think Jason might be right that drivers shouldn't be calling the
__raw_write functions.  You definitely need to run this by the PPC
developers, specifically Ben Herrenschmidt.


-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  4:25             ` Sinan Kaya
  2018-03-17  4:30               ` Timur Tabi
@ 2018-03-17 13:23               ` Steve Wise
  2018-03-17 13:27               ` David Miller
  2018-03-17 15:05               ` Jason Gunthorpe
  3 siblings, 0 replies; 19+ messages in thread
From: Steve Wise @ 2018-03-17 13:23 UTC (permalink / raw)
  To: 'Sinan Kaya', netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe', linux-rdma,
	linux-kernel, 'Michael Werner', 'Casey Leedom'

> 
> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> > On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> >> I'll change writel_relaxed() with __raw_writel() in the series like you
> suggested
> >> and also look at your other comments.
> >
> > I spoke too soon.
> >
> > Now that I realized, code needs to follow one of the following patterns
> for correctness
> >
> > 1)
> > wmb()
> > writel()/writel_relaxed()
> >
> > or
> >
> > 2)
> > wmb()
> > __raw_wrltel()
> > mmiowb()
> >
> > but definitely not
> >
> > wmb()
> > __raw_wrltel()
> >
> > Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> >
> > Changing writel() to writel_relaxed() or __raw_writel() isn't enough.
> PowerPC needs mmiowb()
> > for correctness. ARM's mmiowb() implementation is empty.
> >
> > So, there is no one size fits all solution with the current state of affairs.
> >
> >
> 
> I think I finally got what you mean.
> 
> Code seems to have
> 
> wmb()
> writel()/writeq()
> wmb()
> 
> this can be safely replaced with
> 
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
> 
> This will work on all arches. Below is the new version. Let me know if this is
> OK.
> 
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64
> *src)
>         int count = 8;
> 
>         while (count) {
> -               writeq(*src, dst);
> +               __raw_writeq(*src, dst);
>                 src++;
>                 dst++;
>                 count--;
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq,
> u16 inc, union t4_wr *wqe)
>                                  (u64 *)wqe);
>                 } else {
>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> +                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb()
>  }
> 
>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq,
> u16 inc,
>                                  (void *)wqe);
>                 } else {
>                         pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> -                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> +                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb();
>  }
> 
> 

Yes, this is what chelsio recommended to me.  

Reviewed-by: Steve Wise <swise@opengridcomputing.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  4:25             ` Sinan Kaya
  2018-03-17  4:30               ` Timur Tabi
  2018-03-17 13:23               ` Steve Wise
@ 2018-03-17 13:27               ` David Miller
  2018-03-17 15:05               ` Jason Gunthorpe
  3 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2018-03-17 13:27 UTC (permalink / raw)
  To: okaya
  Cc: swise, netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	swise, dledford, jgg, linux-rdma, linux-kernel, werner, leedom

From: Sinan Kaya <okaya@codeaurora.org>
Date: Sat, 17 Mar 2018 00:25:14 -0400

> I think I finally got what you mean.
> 
> Code seems to have
> 
> wmb()
> writel()/writeq()
> wmb()
> 
> this can be safely replaced with
> 
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
> 
> This will work on all arches. Below is the new version. Let me know if this is OK.

Unfortunately, I think this won't work.

At least on sparc, the __raw_*() variants also change the endianness to
native endianness.

PowerPC does this as well, even documented in a comment :-)

/*
 * Non ordered and non-swapping "raw" accessors
 */

Only the non-__raw_ variants do the endianness swap to/from
little-endian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  4:25             ` Sinan Kaya
                                 ` (2 preceding siblings ...)
  2018-03-17 13:27               ` David Miller
@ 2018-03-17 15:05               ` Jason Gunthorpe
  2018-03-17 18:30                 ` Sinan Kaya
  3 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2018-03-17 15:05 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Steve Wise, netdev, timur, sulrich, linux-arm-msm,
	linux-arm-kernel, 'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> > On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> >> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
> >> and also look at your other comments.
> > 
> > I spoke too soon.
> > 
> > Now that I realized, code needs to follow one of the following patterns for correctness
> > 
> > 1)
> > wmb()
> > writel()/writel_relaxed()
> > 
> > or 
> > 
> > 2)
> > wmb()
> > __raw_wrltel()
> > mmiowb()
> > 
> > but definitely not
> > 
> > wmb()
> > __raw_wrltel()
> > 
> > Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> > 
> > Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
> > for correctness. ARM's mmiowb() implementation is empty.
> > 
> > So, there is no one size fits all solution with the current state of affairs.
> > 
> > 
> 
> I think I finally got what you mean.
> 
> Code seems to have
> 
> wmb()
> writel()/writeq()
> wmb()
> 
> this can be safely replaced with
> 
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
> 
> This will work on all arches. Below is the new version. Let me know if this is OK.
> 
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
>         int count = 8;
> 
>         while (count) {
> -               writeq(*src, dst);
> +               __raw_writeq(*src, dst);
>                 src++;
>                 dst++;
>                 count--;
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>                                  (u64 *)wqe);
>                 } else {
>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> +                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb()
>  }
> 
>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>                                  (void *)wqe);
>                 } else {
>                         pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> -                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> +                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb();

No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!

Your first patch was right, replacing
  wmb()
  writel()

With wmb(); writel_relaxed() is fine, and helps ARM.

The problem with PPC has nothing to do with the writel, it is with the
spinlock, and we can't improve it without adding some new
infrastructure. Certainly don't hack around the problem by using
__raw_writel in multi-arch drivers!

In userspace we settled on something like this as a pattern:

 mmio_wc_spin_lock()
 writel_wc_mmio()
 mmio_wc_spin_unlock()

Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
fully contain all MMIO access to WC and UC memory.

Due to our userspace the pattern is specifically used with MMIO writes
to WC BAR memory, but it could be generalized..

This allows all known arches to use the best code in all cases. x86
can even optimize a lot and combine the mmiowmb() into the
spin_unlock, apparently.

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17 15:05               ` Jason Gunthorpe
@ 2018-03-17 18:30                 ` Sinan Kaya
  2018-03-19  1:48                   ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Sinan Kaya @ 2018-03-17 18:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Steve Wise, netdev, timur, sulrich, linux-arm-msm,
	linux-arm-kernel, 'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom',
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

+linuxppc-dev@lists.ozlabs.org

On 3/17/2018 11:05 AM, Jason Gunthorpe wrote:
> On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
>> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
>>> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>>>> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
>>>> and also look at your other comments.
>>>
>>> I spoke too soon.
>>>
>>> Now that I realized, code needs to follow one of the following patterns for correctness
>>>
>>> 1)
>>> wmb()
>>> writel()/writel_relaxed()
>>>
>>> or 
>>>
>>> 2)
>>> wmb()
>>> __raw_wrltel()
>>> mmiowb()
>>>
>>> but definitely not
>>>
>>> wmb()
>>> __raw_wrltel()
>>>
>>> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
>>>
>>> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
>>> for correctness. ARM's mmiowb() implementation is empty.
>>>
>>> So, there is no one size fits all solution with the current state of affairs.
>>>
>>>
>>
>> I think I finally got what you mean.
>>
>> Code seems to have
>>
>> wmb()
>> writel()/writeq()
>> wmb()
>>
>> this can be safely replaced with
>>
>> wmb()
>> __raw_writel()/__raw_writeq()
>> wmb()
>>
>> This will work on all arches. Below is the new version. Let me know if this is OK.
>>
>> +++ b/drivers/infiniband/hw/cxgb4/t4.h
>> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
>>         int count = 8;
>>
>>         while (count) {
>> -               writeq(*src, dst);
>> +               __raw_writeq(*src, dst);
>>                 src++;
>>                 dst++;
>>                 count--;
>> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>>                                  (u64 *)wqe);
>>                 } else {
>>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
>> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> +                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>>                 }
>>
>>                 /* Flush user doorbell area writes. */
>>                 wmb();
>>                 return;
>>         }
>> -       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> +       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> +       mmiowmb()
>>  }
>>
>>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>>                                  (void *)wqe);
>>                 } else {
>>                         pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
>> -                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> -                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> +                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>>                 }
>>
>>                 /* Flush user doorbell area writes. */
>>                 wmb();
>>                 return;
>>         }
>> -       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> +       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> +       mmiowmb();
> 
> No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!
> 
> Your first patch was right, replacing
>   wmb()
>   writel()
> 
> With wmb(); writel_relaxed() is fine, and helps ARM.
> 
> The problem with PPC has nothing to do with the writel, it is with the
> spinlock, and we can't improve it without adding some new
> infrastructure. Certainly don't hack around the problem by using
> __raw_writel in multi-arch drivers!
> 
> In userspace we settled on something like this as a pattern:
> 
>  mmio_wc_spin_lock()
>  writel_wc_mmio()
>  mmio_wc_spin_unlock()

> 
> Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
> fully contain all MMIO access to WC and UC memory.
> 
> Due to our userspace the pattern is specifically used with MMIO writes
> to WC BAR memory, but it could be generalized..
> 
> This allows all known arches to use the best code in all cases. x86
> can even optimize a lot and combine the mmiowmb() into the
> spin_unlock, apparently.

Given both Dave [1] and Jason's feedback, we have to ask PowerPC developers
to fix writel_relaxed().

Otherwise, PowerPC won't be able to take advantage of the optimizations
introduced in this series.

I don't think we need writel_really_relaxed() API.

Somebody also has to take a task and work very hard to get rid of __raw_writeX()
APIs in drivers/net directory. It looked like a very common practice though
it clearly violates multiarch portability concerns Jason and Deve highlighted.

This will be the next version:

iff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 8369c7c..6e5658a 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
        int count = 8;
 
        while (count) {
-               writeq(*src, dst);
+               writeq_relaxed(*src, dst);
                src++;
                dst++;
                count--;
@@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
                                 (u64 *)wqe);
                } else {
                        pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
-                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+                       writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+                                      wq->sq.bar2_va + SGE_UDB_KDOORBELL);
                }
 
                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+       writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
 }
 
 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
                                 (void *)wqe);
                } else {
                        pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
-                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+                       writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+                                      wq->rq.bar2_va + SGE_UDB_KDOORBELL);
                }
 
                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+       writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
 }


[1] https://lkml.org/lkml/2018/3/17/100

> 
> Jason
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17 18:30                 ` Sinan Kaya
@ 2018-03-19  1:48                   ` Jason Gunthorpe
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2018-03-19  1:48 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Steve Wise, netdev, timur, sulrich, linux-arm-msm,
	linux-arm-kernel, 'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom',
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Sat, Mar 17, 2018 at 02:30:10PM -0400, Sinan Kaya wrote:

> Somebody also has to take a task and work very hard to get rid of __raw_writeX()
> APIs in drivers/net directory. It looked like a very common practice though
> it clearly violates multiarch portability concerns Jason and Deve highlighted.

When you posted your list I thought most of the hits were in what I'd
think of 'one-arch drivers', eg an IRQ controller or clock driver or
something.. Some might have a reason for it (eg avoiding the swap, for
instance), maybe it is a hold over from before writel_relaxed, or
maybe it is just a cargo-cult behavior..

It is the obviously multi-arch drivers that probably need some
attention..

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-03-19  1:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
2018-03-16 16:16 ` [PATCH v3 15/18] RDMA/bnxt_re: Eliminate duplicate barriers on weakly-ordered archs Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 16/18] IB/mlx4: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 17/18] RDMA/i40iw: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 18/18] infiniband: cxgb4: " Sinan Kaya
2018-03-16 21:05   ` Steve Wise
2018-03-16 21:46     ` Sinan Kaya
2018-03-16 23:05       ` Steve Wise
2018-03-17  3:40         ` Sinan Kaya
2018-03-17  4:03           ` Sinan Kaya
2018-03-17  4:25             ` Sinan Kaya
2018-03-17  4:30               ` Timur Tabi
2018-03-17 13:23               ` Steve Wise
2018-03-17 13:27               ` David Miller
2018-03-17 15:05               ` Jason Gunthorpe
2018-03-17 18:30                 ` Sinan Kaya
2018-03-19  1:48                   ` Jason Gunthorpe
2018-03-16 22:13     ` Jason Gunthorpe
2018-03-16 23:04       ` Steve Wise
2018-03-17  4:08         ` Timur Tabi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).