* [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR
@ 2022-11-16 2:31 Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 1/3] RDMA/erdma: Add a workqueue for WRs reflushing Cheng Xu
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Cheng Xu @ 2022-11-16 2:31 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, KaiShen
Hi,
This series introduces the support of flushing all WRs posted to hardware
after QP state changed to ERROR.
Old Firmware may not flush the newly posted WRs after QP state chagned to
ERROR, because it's a little difficult for firmware to get the realtime
PI (producer index) of QPs, especially for the RQs.
Previously we want to avoid this issue by implementing custom
drain_{sq/rq} [1], but this has falw, as Tom and Jason pointed out, which
we also meet in some scenarios, for example, NoF fatal recovery.
So, we introduce a new mechanism to fix this. When registering the ibdev,
we create a workqueue for reflushing (we name it "reflush", because
hardware is already start flushing for the QPs at that time, and it's used
for hardware to flush newly posted WRs). Once QP needs to flush WRs, or
new WRs posted after flushing, we post a delay work to the workqueue or
modify the delay time if is already posted. In the work, driver notifies
the lastest PIs to firmware by CMDQ, so that firmware can flush all the
newly posted WRs. This applies to kernel QP first.
- #1 adds a workqueue for WRs reflushing.
- #2 adds a reflushing work for each QP.
- #4 notifies the lastest PIs to firmware for reflushing.
[1] https://lore.kernel.org/all/20220824094251.23190-3-chengyou@linux.alibaba.com/t/
Thanks,
Cheng Xu
Cheng Xu (3):
RDMA/erdma: Add a workqueue for WRs reflushing
RDMA/erdma: Implement the lifecycle of reflushing work for each QP
RDMA/erdma: Notify the latest PI to FW for reflushing when necessary
drivers/infiniband/hw/erdma/erdma.h | 1 +
drivers/infiniband/hw/erdma/erdma_hw.h | 8 ++++++
drivers/infiniband/hw/erdma/erdma_main.c | 14 +++++++++--
drivers/infiniband/hw/erdma/erdma_qp.c | 30 ++++++++++++++++-------
drivers/infiniband/hw/erdma/erdma_verbs.c | 18 ++++++++++++++
drivers/infiniband/hw/erdma/erdma_verbs.h | 7 ++++++
6 files changed, 67 insertions(+), 11 deletions(-)
--
2.27.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH for-next 1/3] RDMA/erdma: Add a workqueue for WRs reflushing
2022-11-16 2:31 [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Cheng Xu
@ 2022-11-16 2:31 ` Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 2/3] RDMA/erdma: Implement the lifecycle of reflushing work for each QP Cheng Xu
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Cheng Xu @ 2022-11-16 2:31 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, KaiShen
ERDMA driver use a workqueue for asynchronous reflush command posting.
Implement the lifecycle of this workqueue.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
drivers/infiniband/hw/erdma/erdma.h | 1 +
drivers/infiniband/hw/erdma/erdma_main.c | 14 ++++++++++++--
2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/hw/erdma/erdma.h b/drivers/infiniband/hw/erdma/erdma.h
index bb23d897c710..7bd053a1147a 100644
--- a/drivers/infiniband/hw/erdma/erdma.h
+++ b/drivers/infiniband/hw/erdma/erdma.h
@@ -190,6 +190,7 @@ struct erdma_dev {
struct net_device *netdev;
struct pci_dev *pdev;
struct notifier_block netdev_nb;
+ struct workqueue_struct *reflush_wq;
resource_size_t func_bar_addr;
resource_size_t func_bar_len;
diff --git a/drivers/infiniband/hw/erdma/erdma_main.c b/drivers/infiniband/hw/erdma/erdma_main.c
index e44b06fea595..5dc31e5df5cb 100644
--- a/drivers/infiniband/hw/erdma/erdma_main.c
+++ b/drivers/infiniband/hw/erdma/erdma_main.c
@@ -521,13 +521,22 @@ static int erdma_ib_device_add(struct pci_dev *pdev)
u64_to_ether_addr(mac, dev->attrs.peer_addr);
+ dev->reflush_wq = alloc_workqueue("erdma-reflush-wq", WQ_UNBOUND,
+ WQ_UNBOUND_MAX_ACTIVE);
+ if (!dev->reflush_wq) {
+ ret = -ENOMEM;
+ goto err_alloc_workqueue;
+ }
+
ret = erdma_device_register(dev);
if (ret)
- goto err_out;
+ goto err_register;
return 0;
-err_out:
+err_register:
+ destroy_workqueue(dev->reflush_wq);
+err_alloc_workqueue:
xa_destroy(&dev->qp_xa);
xa_destroy(&dev->cq_xa);
@@ -543,6 +552,7 @@ static void erdma_ib_device_remove(struct pci_dev *pdev)
unregister_netdevice_notifier(&dev->netdev_nb);
ib_unregister_device(&dev->ibdev);
+ destroy_workqueue(dev->reflush_wq);
erdma_res_cb_free(dev);
xa_destroy(&dev->qp_xa);
xa_destroy(&dev->cq_xa);
--
2.27.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH for-next 2/3] RDMA/erdma: Implement the lifecycle of reflushing work for each QP
2022-11-16 2:31 [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 1/3] RDMA/erdma: Add a workqueue for WRs reflushing Cheng Xu
@ 2022-11-16 2:31 ` Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 3/3] RDMA/erdma: Notify the latest PI to FW for reflushing when necessary Cheng Xu
2022-11-24 19:00 ` [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Jason Gunthorpe
3 siblings, 0 replies; 5+ messages in thread
From: Cheng Xu @ 2022-11-16 2:31 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, KaiShen
Each QP has a work for reflushing purpose. In the work, driver will
report the latest pi to hardware.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
drivers/infiniband/hw/erdma/erdma_hw.h | 8 ++++++++
drivers/infiniband/hw/erdma/erdma_verbs.c | 18 ++++++++++++++++++
drivers/infiniband/hw/erdma/erdma_verbs.h | 2 ++
3 files changed, 28 insertions(+)
diff --git a/drivers/infiniband/hw/erdma/erdma_hw.h b/drivers/infiniband/hw/erdma/erdma_hw.h
index 1b2e2b70678f..ab371fec610c 100644
--- a/drivers/infiniband/hw/erdma/erdma_hw.h
+++ b/drivers/infiniband/hw/erdma/erdma_hw.h
@@ -145,6 +145,7 @@ enum CMDQ_RDMA_OPCODE {
CMDQ_OPCODE_MODIFY_QP = 3,
CMDQ_OPCODE_CREATE_CQ = 4,
CMDQ_OPCODE_DESTROY_CQ = 5,
+ CMDQ_OPCODE_REFLUSH = 6,
CMDQ_OPCODE_REG_MR = 8,
CMDQ_OPCODE_DEREG_MR = 9
};
@@ -301,6 +302,13 @@ struct erdma_cmdq_destroy_qp_req {
u32 qpn;
};
+struct erdma_cmdq_reflush_req {
+ u64 hdr;
+ u32 qpn;
+ u32 sq_pi;
+ u32 rq_pi;
+};
+
/* cap qword 0 definition */
#define ERDMA_CMD_DEV_CAP_MAX_CQE_MASK GENMASK_ULL(47, 40)
#define ERDMA_CMD_DEV_CAP_FLAGS_MASK GENMASK_ULL(31, 24)
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index d843ce1f35f3..5dab1e87975b 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -379,6 +379,21 @@ int erdma_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
return 0;
}
+static void erdma_flush_worker(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+ struct erdma_qp *qp =
+ container_of(dwork, struct erdma_qp, reflush_dwork);
+ struct erdma_cmdq_reflush_req req;
+
+ erdma_cmdq_build_reqhdr(&req.hdr, CMDQ_SUBMOD_RDMA,
+ CMDQ_OPCODE_REFLUSH);
+ req.qpn = QP_ID(qp);
+ req.sq_pi = qp->kern_qp.sq_pi;
+ req.rq_pi = qp->kern_qp.rq_pi;
+ erdma_post_cmd_wait(&qp->dev->cmdq, &req, sizeof(req), NULL, NULL);
+}
+
static int erdma_qp_validate_cap(struct erdma_dev *dev,
struct ib_qp_init_attr *attrs)
{
@@ -735,6 +750,7 @@ int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
qp->attrs.max_send_sge = attrs->cap.max_send_sge;
qp->attrs.max_recv_sge = attrs->cap.max_recv_sge;
qp->attrs.state = ERDMA_QP_STATE_IDLE;
+ INIT_DELAYED_WORK(&qp->reflush_dwork, erdma_flush_worker);
ret = create_qp_cmd(dev, qp);
if (ret)
@@ -1028,6 +1044,8 @@ int erdma_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
erdma_modify_qp_internal(qp, &qp_attrs, ERDMA_QP_ATTR_STATE);
up_write(&qp->state_lock);
+ cancel_delayed_work_sync(&qp->reflush_dwork);
+
erdma_cmdq_build_reqhdr(&req.hdr, CMDQ_SUBMOD_RDMA,
CMDQ_OPCODE_DESTROY_QP);
req.qpn = QP_ID(qp);
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.h b/drivers/infiniband/hw/erdma/erdma_verbs.h
index a5574f0252bb..9f341d032069 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.h
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.h
@@ -197,6 +197,8 @@ struct erdma_qp {
struct erdma_cep *cep;
struct rw_semaphore state_lock;
+ struct delayed_work reflush_dwork;
+
union {
struct erdma_kqp kern_qp;
struct erdma_uqp user_qp;
--
2.27.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH for-next 3/3] RDMA/erdma: Notify the latest PI to FW for reflushing when necessary
2022-11-16 2:31 [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 1/3] RDMA/erdma: Add a workqueue for WRs reflushing Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 2/3] RDMA/erdma: Implement the lifecycle of reflushing work for each QP Cheng Xu
@ 2022-11-16 2:31 ` Cheng Xu
2022-11-24 19:00 ` [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Jason Gunthorpe
3 siblings, 0 replies; 5+ messages in thread
From: Cheng Xu @ 2022-11-16 2:31 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, KaiShen
Firmware is responsible for flushing WRs in HW, and it's a little
difficult for firmware to get the latest PI of QPs, especially for RQs
after QP state being changed to ERROR. So we introduce a new CMDQ
command, by which driver can notify to latest PI to FW, and then FW can
flush all posted WRs.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
drivers/infiniband/hw/erdma/erdma_qp.c | 30 ++++++++++++++++-------
drivers/infiniband/hw/erdma/erdma_verbs.h | 5 ++++
2 files changed, 26 insertions(+), 9 deletions(-)
diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c b/drivers/infiniband/hw/erdma/erdma_qp.c
index 521e97258de7..d088d6bef431 100644
--- a/drivers/infiniband/hw/erdma/erdma_qp.c
+++ b/drivers/infiniband/hw/erdma/erdma_qp.c
@@ -120,6 +120,7 @@ static int erdma_modify_qp_state_to_stop(struct erdma_qp *qp,
int erdma_modify_qp_internal(struct erdma_qp *qp, struct erdma_qp_attrs *attrs,
enum erdma_qp_attr_mask mask)
{
+ bool need_reflush = false;
int drop_conn, ret = 0;
if (!mask)
@@ -135,6 +136,7 @@ int erdma_modify_qp_internal(struct erdma_qp *qp, struct erdma_qp_attrs *attrs,
ret = erdma_modify_qp_state_to_rts(qp, attrs, mask);
} else if (attrs->state == ERDMA_QP_STATE_ERROR) {
qp->attrs.state = ERDMA_QP_STATE_ERROR;
+ need_reflush = true;
if (qp->cep) {
erdma_cep_put(qp->cep);
qp->cep = NULL;
@@ -145,17 +147,12 @@ int erdma_modify_qp_internal(struct erdma_qp *qp, struct erdma_qp_attrs *attrs,
case ERDMA_QP_STATE_RTS:
drop_conn = 0;
- if (attrs->state == ERDMA_QP_STATE_CLOSING) {
+ if (attrs->state == ERDMA_QP_STATE_CLOSING ||
+ attrs->state == ERDMA_QP_STATE_TERMINATE ||
+ attrs->state == ERDMA_QP_STATE_ERROR) {
ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
drop_conn = 1;
- } else if (attrs->state == ERDMA_QP_STATE_TERMINATE) {
- qp->attrs.state = ERDMA_QP_STATE_TERMINATE;
- ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
- drop_conn = 1;
- } else if (attrs->state == ERDMA_QP_STATE_ERROR) {
- ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
- qp->attrs.state = ERDMA_QP_STATE_ERROR;
- drop_conn = 1;
+ need_reflush = true;
}
if (drop_conn)
@@ -180,6 +177,12 @@ int erdma_modify_qp_internal(struct erdma_qp *qp, struct erdma_qp_attrs *attrs,
break;
}
+ if (need_reflush && !ret && rdma_is_kernel_res(&qp->ibqp.res)) {
+ qp->flags |= ERDMA_QP_IN_FLUSHING;
+ mod_delayed_work(qp->dev->reflush_wq, &qp->reflush_dwork,
+ usecs_to_jiffies(100));
+ }
+
return ret;
}
@@ -527,6 +530,10 @@ int erdma_post_send(struct ib_qp *ibqp, const struct ib_send_wr *send_wr,
}
spin_unlock_irqrestore(&qp->lock, flags);
+ if (unlikely(qp->flags & ERDMA_QP_IN_FLUSHING))
+ mod_delayed_work(qp->dev->reflush_wq, &qp->reflush_dwork,
+ usecs_to_jiffies(100));
+
return ret;
}
@@ -580,5 +587,10 @@ int erdma_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *recv_wr,
}
spin_unlock_irqrestore(&qp->lock, flags);
+
+ if (unlikely(qp->flags & ERDMA_QP_IN_FLUSHING))
+ mod_delayed_work(qp->dev->reflush_wq, &qp->reflush_dwork,
+ usecs_to_jiffies(100));
+
return ret;
}
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.h b/drivers/infiniband/hw/erdma/erdma_verbs.h
index 9f341d032069..e0a993bc032a 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.h
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.h
@@ -173,6 +173,10 @@ enum erdma_qp_attr_mask {
ERDMA_QP_ATTR_MPA = (1 << 7)
};
+enum erdma_qp_flags {
+ ERDMA_QP_IN_FLUSHING = (1 << 0),
+};
+
struct erdma_qp_attrs {
enum erdma_qp_state state;
enum erdma_cc_alg cc; /* Congestion control algorithm */
@@ -197,6 +201,7 @@ struct erdma_qp {
struct erdma_cep *cep;
struct rw_semaphore state_lock;
+ unsigned long flags;
struct delayed_work reflush_dwork;
union {
--
2.27.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR
2022-11-16 2:31 [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Cheng Xu
` (2 preceding siblings ...)
2022-11-16 2:31 ` [PATCH for-next 3/3] RDMA/erdma: Notify the latest PI to FW for reflushing when necessary Cheng Xu
@ 2022-11-24 19:00 ` Jason Gunthorpe
3 siblings, 0 replies; 5+ messages in thread
From: Jason Gunthorpe @ 2022-11-24 19:00 UTC (permalink / raw)
To: Cheng Xu; +Cc: leon, linux-rdma, KaiShen
On Wed, Nov 16, 2022 at 10:31:04AM +0800, Cheng Xu wrote:
> Hi,
>
> This series introduces the support of flushing all WRs posted to hardware
> after QP state changed to ERROR.
>
> Old Firmware may not flush the newly posted WRs after QP state chagned to
> ERROR, because it's a little difficult for firmware to get the realtime
> PI (producer index) of QPs, especially for the RQs.
>
> Previously we want to avoid this issue by implementing custom
> drain_{sq/rq} [1], but this has falw, as Tom and Jason pointed out, which
> we also meet in some scenarios, for example, NoF fatal recovery.
>
> So, we introduce a new mechanism to fix this. When registering the ibdev,
> we create a workqueue for reflushing (we name it "reflush", because
> hardware is already start flushing for the QPs at that time, and it's used
> for hardware to flush newly posted WRs). Once QP needs to flush WRs, or
> new WRs posted after flushing, we post a delay work to the workqueue or
> modify the delay time if is already posted. In the work, driver notifies
> the lastest PIs to firmware by CMDQ, so that firmware can flush all the
> newly posted WRs. This applies to kernel QP first.
>
> - #1 adds a workqueue for WRs reflushing.
> - #2 adds a reflushing work for each QP.
> - #4 notifies the lastest PIs to firmware for reflushing.
>
> [1] https://lore.kernel.org/all/20220824094251.23190-3-chengyou@linux.alibaba.com/t/
>
> Thanks,
> Cheng Xu
>
> Cheng Xu (3):
> RDMA/erdma: Add a workqueue for WRs reflushing
> RDMA/erdma: Implement the lifecycle of reflushing work for each QP
> RDMA/erdma: Notify the latest PI to FW for reflushing when necessary
Applied to for-next, thanks
Jason
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-11-24 19:00 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-16 2:31 [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 1/3] RDMA/erdma: Add a workqueue for WRs reflushing Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 2/3] RDMA/erdma: Implement the lifecycle of reflushing work for each QP Cheng Xu
2022-11-16 2:31 ` [PATCH for-next 3/3] RDMA/erdma: Notify the latest PI to FW for reflushing when necessary Cheng Xu
2022-11-24 19:00 ` [PATCH for-next 0/3] RDMA/erdma: Support flushing all WRs after QP state changed to ERROR Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox