* [PATCH for-next 1/3] RDMA/erdma: Add a workqueue for WRs reflushing
2022-11-16 2:31 [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Cheng Xu
@ 2022-11-16 2:31 ` Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 2/3] RDMA/erdma: Implement the lifecycle of reflushing work for each QP Cheng Xu
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Cheng Xu @ 2022-11-16 2:31 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, KaiShen
ERDMA driver use a workqueue for asynchronous reflush command posting.
Implement the lifecycle of this workqueue.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
drivers/infiniband/hw/erdma/erdma.h | 1 +
drivers/infiniband/hw/erdma/erdma_main.c | 14 ++++++++++++--
2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/hw/erdma/erdma.h b/drivers/infiniband/hw/erdma/erdma.h
index bb23d897c710..7bd053a1147a 100644
--- a/drivers/infiniband/hw/erdma/erdma.h
+++ b/drivers/infiniband/hw/erdma/erdma.h
@@ -190,6 +190,7 @@ struct erdma_dev {
struct net_device *netdev;
struct pci_dev *pdev;
struct notifier_block netdev_nb;
+ struct workqueue_struct *reflush_wq;
resource_size_t func_bar_addr;
resource_size_t func_bar_len;
diff --git a/drivers/infiniband/hw/erdma/erdma_main.c b/drivers/infiniband/hw/erdma/erdma_main.c
index e44b06fea595..5dc31e5df5cb 100644
--- a/drivers/infiniband/hw/erdma/erdma_main.c
+++ b/drivers/infiniband/hw/erdma/erdma_main.c
@@ -521,13 +521,22 @@ static int erdma_ib_device_add(struct pci_dev *pdev)
u64_to_ether_addr(mac, dev->attrs.peer_addr);
+ dev->reflush_wq = alloc_workqueue("erdma-reflush-wq", WQ_UNBOUND,
+ WQ_UNBOUND_MAX_ACTIVE);
+ if (!dev->reflush_wq) {
+ ret = -ENOMEM;
+ goto err_alloc_workqueue;
+ }
+
ret = erdma_device_register(dev);
if (ret)
- goto err_out;
+ goto err_register;
return 0;
-err_out:
+err_register:
+ destroy_workqueue(dev->reflush_wq);
+err_alloc_workqueue:
xa_destroy(&dev->qp_xa);
xa_destroy(&dev->cq_xa);
@@ -543,6 +552,7 @@ static void erdma_ib_device_remove(struct pci_dev *pdev)
unregister_netdevice_notifier(&dev->netdev_nb);
ib_unregister_device(&dev->ibdev);
+ destroy_workqueue(dev->reflush_wq);
erdma_res_cb_free(dev);
xa_destroy(&dev->qp_xa);
xa_destroy(&dev->cq_xa);
--
2.27.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH for-next 2/3] RDMA/erdma: Implement the lifecycle of reflushing work for each QP
2022-11-16 2:31 [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 1/3] RDMA/erdma: Add a workqueue for WRs reflushing Cheng Xu
@ 2022-11-16 2:31 ` Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 3/3] RDMA/erdma: Notify the latest PI to FW for reflushing when necessary Cheng Xu
2022-11-24 19:00 ` [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Jason Gunthorpe
3 siblings, 0 replies; 5+ messages in thread
From: Cheng Xu @ 2022-11-16 2:31 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, KaiShen
Each QP has a work for reflushing purpose. In the work, driver will
report the latest pi to hardware.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
drivers/infiniband/hw/erdma/erdma_hw.h | 8 ++++++++
drivers/infiniband/hw/erdma/erdma_verbs.c | 18 ++++++++++++++++++
drivers/infiniband/hw/erdma/erdma_verbs.h | 2 ++
3 files changed, 28 insertions(+)
diff --git a/drivers/infiniband/hw/erdma/erdma_hw.h b/drivers/infiniband/hw/erdma/erdma_hw.h
index 1b2e2b70678f..ab371fec610c 100644
--- a/drivers/infiniband/hw/erdma/erdma_hw.h
+++ b/drivers/infiniband/hw/erdma/erdma_hw.h
@@ -145,6 +145,7 @@ enum CMDQ_RDMA_OPCODE {
CMDQ_OPCODE_MODIFY_QP = 3,
CMDQ_OPCODE_CREATE_CQ = 4,
CMDQ_OPCODE_DESTROY_CQ = 5,
+ CMDQ_OPCODE_REFLUSH = 6,
CMDQ_OPCODE_REG_MR = 8,
CMDQ_OPCODE_DEREG_MR = 9
};
@@ -301,6 +302,13 @@ struct erdma_cmdq_destroy_qp_req {
u32 qpn;
};
+struct erdma_cmdq_reflush_req {
+ u64 hdr;
+ u32 qpn;
+ u32 sq_pi;
+ u32 rq_pi;
+};
+
/* cap qword 0 definition */
#define ERDMA_CMD_DEV_CAP_MAX_CQE_MASK GENMASK_ULL(47, 40)
#define ERDMA_CMD_DEV_CAP_FLAGS_MASK GENMASK_ULL(31, 24)
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index d843ce1f35f3..5dab1e87975b 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -379,6 +379,21 @@ int erdma_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
return 0;
}
+static void erdma_flush_worker(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+ struct erdma_qp *qp =
+ container_of(dwork, struct erdma_qp, reflush_dwork);
+ struct erdma_cmdq_reflush_req req;
+
+ erdma_cmdq_build_reqhdr(&req.hdr, CMDQ_SUBMOD_RDMA,
+ CMDQ_OPCODE_REFLUSH);
+ req.qpn = QP_ID(qp);
+ req.sq_pi = qp->kern_qp.sq_pi;
+ req.rq_pi = qp->kern_qp.rq_pi;
+ erdma_post_cmd_wait(&qp->dev->cmdq, &req, sizeof(req), NULL, NULL);
+}
+
static int erdma_qp_validate_cap(struct erdma_dev *dev,
struct ib_qp_init_attr *attrs)
{
@@ -735,6 +750,7 @@ int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
qp->attrs.max_send_sge = attrs->cap.max_send_sge;
qp->attrs.max_recv_sge = attrs->cap.max_recv_sge;
qp->attrs.state = ERDMA_QP_STATE_IDLE;
+ INIT_DELAYED_WORK(&qp->reflush_dwork, erdma_flush_worker);
ret = create_qp_cmd(dev, qp);
if (ret)
@@ -1028,6 +1044,8 @@ int erdma_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
erdma_modify_qp_internal(qp, &qp_attrs, ERDMA_QP_ATTR_STATE);
up_write(&qp->state_lock);
+ cancel_delayed_work_sync(&qp->reflush_dwork);
+
erdma_cmdq_build_reqhdr(&req.hdr, CMDQ_SUBMOD_RDMA,
CMDQ_OPCODE_DESTROY_QP);
req.qpn = QP_ID(qp);
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.h b/drivers/infiniband/hw/erdma/erdma_verbs.h
index a5574f0252bb..9f341d032069 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.h
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.h
@@ -197,6 +197,8 @@ struct erdma_qp {
struct erdma_cep *cep;
struct rw_semaphore state_lock;
+ struct delayed_work reflush_dwork;
+
union {
struct erdma_kqp kern_qp;
struct erdma_uqp user_qp;
--
2.27.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH for-next 3/3] RDMA/erdma: Notify the latest PI to FW for reflushing when necessary
2022-11-16 2:31 [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 1/3] RDMA/erdma: Add a workqueue for WRs reflushing Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 2/3] RDMA/erdma: Implement the lifecycle of reflushing work for each QP Cheng Xu
@ 2022-11-16 2:31 ` Cheng Xu
2022-11-24 19:00 ` [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Jason Gunthorpe
3 siblings, 0 replies; 5+ messages in thread
From: Cheng Xu @ 2022-11-16 2:31 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, KaiShen
Firmware is responsible for flushing WRs in HW, and it's a little
difficult for firmware to get the latest PI of QPs, especially for RQs
after QP state being changed to ERROR. So we introduce a new CMDQ
command, by which driver can notify to latest PI to FW, and then FW can
flush all posted WRs.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
drivers/infiniband/hw/erdma/erdma_qp.c | 30 ++++++++++++++++-------
drivers/infiniband/hw/erdma/erdma_verbs.h | 5 ++++
2 files changed, 26 insertions(+), 9 deletions(-)
diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c b/drivers/infiniband/hw/erdma/erdma_qp.c
index 521e97258de7..d088d6bef431 100644
--- a/drivers/infiniband/hw/erdma/erdma_qp.c
+++ b/drivers/infiniband/hw/erdma/erdma_qp.c
@@ -120,6 +120,7 @@ static int erdma_modify_qp_state_to_stop(struct erdma_qp *qp,
int erdma_modify_qp_internal(struct erdma_qp *qp, struct erdma_qp_attrs *attrs,
enum erdma_qp_attr_mask mask)
{
+ bool need_reflush = false;
int drop_conn, ret = 0;
if (!mask)
@@ -135,6 +136,7 @@ int erdma_modify_qp_internal(struct erdma_qp *qp, struct erdma_qp_attrs *attrs,
ret = erdma_modify_qp_state_to_rts(qp, attrs, mask);
} else if (attrs->state == ERDMA_QP_STATE_ERROR) {
qp->attrs.state = ERDMA_QP_STATE_ERROR;
+ need_reflush = true;
if (qp->cep) {
erdma_cep_put(qp->cep);
qp->cep = NULL;
@@ -145,17 +147,12 @@ int erdma_modify_qp_internal(struct erdma_qp *qp, struct erdma_qp_attrs *attrs,
case ERDMA_QP_STATE_RTS:
drop_conn = 0;
- if (attrs->state == ERDMA_QP_STATE_CLOSING) {
+ if (attrs->state == ERDMA_QP_STATE_CLOSING ||
+ attrs->state == ERDMA_QP_STATE_TERMINATE ||
+ attrs->state == ERDMA_QP_STATE_ERROR) {
ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
drop_conn = 1;
- } else if (attrs->state == ERDMA_QP_STATE_TERMINATE) {
- qp->attrs.state = ERDMA_QP_STATE_TERMINATE;
- ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
- drop_conn = 1;
- } else if (attrs->state == ERDMA_QP_STATE_ERROR) {
- ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
- qp->attrs.state = ERDMA_QP_STATE_ERROR;
- drop_conn = 1;
+ need_reflush = true;
}
if (drop_conn)
@@ -180,6 +177,12 @@ int erdma_modify_qp_internal(struct erdma_qp *qp, struct erdma_qp_attrs *attrs,
break;
}
+ if (need_reflush && !ret && rdma_is_kernel_res(&qp->ibqp.res)) {
+ qp->flags |= ERDMA_QP_IN_FLUSHING;
+ mod_delayed_work(qp->dev->reflush_wq, &qp->reflush_dwork,
+ usecs_to_jiffies(100));
+ }
+
return ret;
}
@@ -527,6 +530,10 @@ int erdma_post_send(struct ib_qp *ibqp, const struct ib_send_wr *send_wr,
}
spin_unlock_irqrestore(&qp->lock, flags);
+ if (unlikely(qp->flags & ERDMA_QP_IN_FLUSHING))
+ mod_delayed_work(qp->dev->reflush_wq, &qp->reflush_dwork,
+ usecs_to_jiffies(100));
+
return ret;
}
@@ -580,5 +587,10 @@ int erdma_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *recv_wr,
}
spin_unlock_irqrestore(&qp->lock, flags);
+
+ if (unlikely(qp->flags & ERDMA_QP_IN_FLUSHING))
+ mod_delayed_work(qp->dev->reflush_wq, &qp->reflush_dwork,
+ usecs_to_jiffies(100));
+
return ret;
}
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.h b/drivers/infiniband/hw/erdma/erdma_verbs.h
index 9f341d032069..e0a993bc032a 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.h
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.h
@@ -173,6 +173,10 @@ enum erdma_qp_attr_mask {
ERDMA_QP_ATTR_MPA = (1 << 7)
};
+enum erdma_qp_flags {
+ ERDMA_QP_IN_FLUSHING = (1 << 0),
+};
+
struct erdma_qp_attrs {
enum erdma_qp_state state;
enum erdma_cc_alg cc; /* Congestion control algorithm */
@@ -197,6 +201,7 @@ struct erdma_qp {
struct erdma_cep *cep;
struct rw_semaphore state_lock;
+ unsigned long flags;
struct delayed_work reflush_dwork;
union {
--
2.27.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR
2022-11-16 2:31 [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Cheng Xu
` (2 preceding siblings ...)
2022-11-16 2:31 ` [PATCH for-next 3/3] RDMA/erdma: Notify the latest PI to FW for reflushing when necessary Cheng Xu
@ 2022-11-24 19:00 ` Jason Gunthorpe
3 siblings, 0 replies; 5+ messages in thread
From: Jason Gunthorpe @ 2022-11-24 19:00 UTC (permalink / raw)
To: Cheng Xu; +Cc: leon, linux-rdma, KaiShen
On Wed, Nov 16, 2022 at 10:31:04AM +0800, Cheng Xu wrote:
> Hi,
>
> This series introduces the support of flushing all WRs posted to hardware
> after QP state changed to ERROR.
>
> Old Firmware may not flush the newly posted WRs after QP state chagned to
> ERROR, because it's a little difficult for firmware to get the realtime
> PI (producer index) of QPs, especially for the RQs.
>
> Previously we want to avoid this issue by implementing custom
> drain_{sq/rq} [1], but this has falw, as Tom and Jason pointed out, which
> we also meet in some scenarios, for example, NoF fatal recovery.
>
> So, we introduce a new mechanism to fix this. When registering the ibdev,
> we create a workqueue for reflushing (we name it "reflush", because
> hardware is already start flushing for the QPs at that time, and it's used
> for hardware to flush newly posted WRs). Once QP needs to flush WRs, or
> new WRs posted after flushing, we post a delay work to the workqueue or
> modify the delay time if is already posted. In the work, driver notifies
> the lastest PIs to firmware by CMDQ, so that firmware can flush all the
> newly posted WRs. This applies to kernel QP first.
>
> - #1 adds a workqueue for WRs reflushing.
> - #2 adds a reflushing work for each QP.
> - #4 notifies the lastest PIs to firmware for reflushing.
>
> [1] https://lore.kernel.org/all/20220824094251.23190-3-chengyou@linux.alibaba.com/t/
>
> Thanks,
> Cheng Xu
>
> Cheng Xu (3):
> RDMA/erdma: Add a workqueue for WRs reflushing
> RDMA/erdma: Implement the lifecycle of reflushing work for each QP
> RDMA/erdma: Notify the latest PI to FW for reflushing when necessary
Applied to for-next, thanks
Jason
^ permalink raw reply [flat|nested] 5+ messages in thread