* [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory
2022-02-18 0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
@ 2022-02-18 0:35 ` Bob Pearson
2022-02-23 19:33 ` Jason Gunthorpe
2022-02-18 0:35 ` [PATCH for-next v12 2/6] RDMA/rxe: Collect mca init code in a subroutine Bob Pearson
` (4 subsequent siblings)
5 siblings, 1 reply; 11+ messages in thread
From: Bob Pearson @ 2022-02-18 0:35 UTC (permalink / raw)
To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson
Well behaved applications will free all memory allocated by multicast.
If this fails rdma-core will detach all qp's from multicast groups when
an application terminates. This patch prints a warning and cleans out
memory allocated by multicast if both of these fail to occur.
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
drivers/infiniband/sw/rxe/rxe.c | 2 ++
drivers/infiniband/sw/rxe/rxe_loc.h | 1 +
drivers/infiniband/sw/rxe/rxe_mcast.c | 38 +++++++++++++++++++++++++++
3 files changed, 41 insertions(+)
diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 3520eb2db685..603b0156f889 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -29,6 +29,8 @@ void rxe_dealloc(struct ib_device *ib_dev)
rxe_pool_cleanup(&rxe->mr_pool);
rxe_pool_cleanup(&rxe->mw_pool);
+ rxe_cleanup_mcast(rxe);
+
if (rxe->tfm)
crypto_free_shash(rxe->tfm);
}
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 409efeecd581..0bc1b7e2877c 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -44,6 +44,7 @@ struct rxe_mcg *rxe_lookup_mcg(struct rxe_dev *rxe, union ib_gid *mgid);
int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid);
int rxe_detach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid);
void rxe_cleanup_mcg(struct kref *kref);
+void rxe_cleanup_mcast(struct rxe_dev *rxe);
/* rxe_mmap.c */
struct rxe_mmap_info {
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 4935fe5c5868..447d78bea28b 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -388,3 +388,41 @@ int rxe_detach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
return rxe_detach_mcg(rxe, qp, mgid);
}
+
+/**
+ * rxe_cleanup_mcast - cleanup all resources still held by mcast
+ * @rxe: rxe object
+ *
+ * Called when rxe device is unloaded. Walk red-black tree to
+ * find all mcg's and then walk mcg->qp_list to find all mca's and
+ * free them.
+ *
+ * This is belt and suspenders. These should have been recovered
+ * already by the application else rdma-core.
+ */
+void rxe_cleanup_mcast(struct rxe_dev *rxe)
+{
+ struct rb_root *root = &rxe->mcg_tree;
+ struct rb_node *node, *next;
+ struct rxe_mcg *mcg;
+ struct rxe_mca *mca, *tmp;
+
+ if (RB_EMPTY_ROOT(root))
+ return; /* expected behavior */
+
+ pr_warn("RDMA/core bug: Recovering leaked multicast resources\n");
+
+ for (node = rb_first(root); node; node = next) {
+ next = rb_next(node);
+ mcg = rb_entry(node, typeof(*mcg), node);
+
+ spin_lock_bh(&rxe->mcg_lock);
+ list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list)
+ kfree(mca);
+
+ __rxe_remove_mcg(mcg);
+ spin_unlock_bh(&rxe->mcg_lock);
+
+ kfree(mcg);
+ }
+}
--
2.32.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory
2022-02-18 0:35 ` [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory Bob Pearson
@ 2022-02-23 19:33 ` Jason Gunthorpe
0 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2022-02-23 19:33 UTC (permalink / raw)
To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma
On Thu, Feb 17, 2022 at 06:35:39PM -0600, Bob Pearson wrote:
> Well behaved applications will free all memory allocated by multicast.
> If this fails rdma-core will detach all qp's from multicast groups when
> an application terminates. This patch prints a warning and cleans out
> memory allocated by multicast if both of these fail to occur.
>
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> drivers/infiniband/sw/rxe/rxe.c | 2 ++
> drivers/infiniband/sw/rxe/rxe_loc.h | 1 +
> drivers/infiniband/sw/rxe/rxe_mcast.c | 38 +++++++++++++++++++++++++++
> 3 files changed, 41 insertions(+)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 3520eb2db685..603b0156f889 100644
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -29,6 +29,8 @@ void rxe_dealloc(struct ib_device *ib_dev)
> rxe_pool_cleanup(&rxe->mr_pool);
> rxe_pool_cleanup(&rxe->mw_pool);
>
> + rxe_cleanup_mcast(rxe);
Lets just put
WARN_ON(!RB_EMPTY_ROOT(&rxe->mcg_tree));
Righ here and forget about there rest. Leaking memory if the kernel is
buggy is OK.
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH for-next v12 2/6] RDMA/rxe: Collect mca init code in a subroutine
2022-02-18 0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
2022-02-18 0:35 ` [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory Bob Pearson
@ 2022-02-18 0:35 ` Bob Pearson
2022-02-18 0:35 ` [PATCH for-next v12 3/6] RDMA/rxe: Collect cleanup mca " Bob Pearson
` (3 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-18 0:35 UTC (permalink / raw)
To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson
Collect initialization code for struct rxe_mca into a subroutine,
__rxe_init_mca(), to cleanup rxe_attach_mcg() in rxe_mcast.c. Check
limit on total number of attached qp's.
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
drivers/infiniband/sw/rxe/rxe_mcast.c | 58 ++++++++++++++++++++-------
drivers/infiniband/sw/rxe/rxe_verbs.h | 1 +
2 files changed, 44 insertions(+), 15 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 447d78bea28b..53db0984a9a1 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -259,6 +259,46 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
spin_unlock_irqrestore(&mcg->rxe->mcg_lock, flags);
}
+/**
+ * __rxe_init_mca - initialize a new mca holding lock
+ * @qp: qp object
+ * @mcg: mcg object
+ * @mca: empty space for new mca
+ *
+ * Context: caller must hold references on qp and mcg, rxe->mcg_lock
+ * and pass memory for new mca
+ *
+ * Returns: 0 on success else an error
+ */
+static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
+ struct rxe_mca *mca)
+{
+ struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
+ int n;
+
+ n = atomic_inc_return(&rxe->mcg_attach);
+ if (n > rxe->attr.max_total_mcast_qp_attach) {
+ atomic_dec(&rxe->mcg_attach);
+ return -ENOMEM;
+ }
+
+ n = atomic_inc_return(&mcg->qp_num);
+ if (n > rxe->attr.max_mcast_qp_attach) {
+ atomic_dec(&mcg->qp_num);
+ atomic_dec(&rxe->mcg_attach);
+ return -ENOMEM;
+ }
+
+ atomic_inc(&qp->mcg_num);
+
+ rxe_add_ref(qp);
+ mca->qp = qp;
+
+ list_add_tail(&mca->qp_list, &mcg->qp_list);
+
+ return 0;
+}
+
static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
struct rxe_mcg *mcg)
{
@@ -291,22 +331,9 @@ static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
}
}
- /* check limits after checking if already attached */
- if (atomic_inc_return(&mcg->qp_num) > rxe->attr.max_mcast_qp_attach) {
- atomic_dec(&mcg->qp_num);
+ err = __rxe_init_mca(qp, mcg, mca);
+ if (err)
kfree(mca);
- err = -ENOMEM;
- goto out;
- }
-
- /* protect pointer to qp in mca */
- rxe_add_ref(qp);
- mca->qp = qp;
-
- atomic_inc(&qp->mcg_num);
- list_add(&mca->qp_list, &mcg->qp_list);
-
- err = 0;
out:
spin_unlock_irqrestore(&rxe->mcg_lock, flags);
return err;
@@ -329,6 +356,7 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
if (mca->qp == qp) {
list_del(&mca->qp_list);
atomic_dec(&qp->mcg_num);
+ atomic_dec(&rxe->mcg_attach);
rxe_drop_ref(qp);
/* if the number of qp's attached to the
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 20fe3ee6589d..6b15251ff67a 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -401,6 +401,7 @@ struct rxe_dev {
spinlock_t mcg_lock;
struct rb_root mcg_tree;
atomic_t mcg_num;
+ atomic_t mcg_attach;
spinlock_t pending_lock; /* guard pending_mmaps */
struct list_head pending_mmaps;
--
2.32.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH for-next v12 3/6] RDMA/rxe: Collect cleanup mca code in a subroutine
2022-02-18 0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
2022-02-18 0:35 ` [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory Bob Pearson
2022-02-18 0:35 ` [PATCH for-next v12 2/6] RDMA/rxe: Collect mca init code in a subroutine Bob Pearson
@ 2022-02-18 0:35 ` Bob Pearson
2022-02-18 0:35 ` [PATCH for-next v12 4/6] RDMA/rxe: Cleanup rxe_mcast.c Bob Pearson
` (2 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-18 0:35 UTC (permalink / raw)
To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson
Collect cleanup code for struct rxe_mca into a subroutine,
__rxe_cleanup_mca() called in rxe_detach_mcg() in rxe_mcast.c.
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
drivers/infiniband/sw/rxe/rxe_mcast.c | 41 +++++++++++++++++----------
1 file changed, 26 insertions(+), 15 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 53db0984a9a1..0f0655e81d40 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -339,13 +339,31 @@ static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
return err;
}
+/**
+ * __rxe_cleanup_mca - cleanup mca object holding lock
+ * @mca: mca object
+ * @mcg: mcg object
+ *
+ * Context: caller must hold a reference to mcg and rxe->mcg_lock
+ */
+static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
+{
+ list_del(&mca->qp_list);
+
+ atomic_dec(&mcg->qp_num);
+ atomic_dec(&mcg->rxe->mcg_attach);
+ atomic_dec(&mca->qp->mcg_num);
+ rxe_drop_ref(mca->qp);
+
+ kfree(mca);
+}
+
static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
union ib_gid *mgid)
{
struct rxe_mcg *mcg;
struct rxe_mca *mca, *tmp;
unsigned long flags;
- int err;
mcg = rxe_lookup_mcg(rxe, mgid);
if (!mcg)
@@ -354,37 +372,30 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
spin_lock_irqsave(&rxe->mcg_lock, flags);
list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
if (mca->qp == qp) {
- list_del(&mca->qp_list);
- atomic_dec(&qp->mcg_num);
- atomic_dec(&rxe->mcg_attach);
- rxe_drop_ref(qp);
+ __rxe_cleanup_mca(mca, mcg);
/* if the number of qp's attached to the
* mcast group falls to zero go ahead and
* tear it down. This will not free the
* object since we are still holding a ref
- * from the get key above.
+ * from the get key above
*/
- if (atomic_dec_return(&mcg->qp_num) <= 0)
+ if (atomic_read(&mcg->qp_num) <= 0)
__rxe_destroy_mcg(mcg);
/* drop the ref from get key. This will free the
* object if qp_num is zero.
*/
kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
- kfree(mca);
- err = 0;
- goto out_unlock;
+
+ spin_unlock_irqrestore(&rxe->mcg_lock, flags);
+ return 0;
}
}
/* we didn't find the qp on the list */
- kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
- err = -EINVAL;
-
-out_unlock:
spin_unlock_irqrestore(&rxe->mcg_lock, flags);
- return err;
+ return -EINVAL;
}
int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
--
2.32.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH for-next v12 4/6] RDMA/rxe: Cleanup rxe_mcast.c
2022-02-18 0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
` (2 preceding siblings ...)
2022-02-18 0:35 ` [PATCH for-next v12 3/6] RDMA/rxe: Collect cleanup mca " Bob Pearson
@ 2022-02-18 0:35 ` Bob Pearson
2022-02-18 0:35 ` [PATCH for-next v12 5/6] RDMA/rxe: For mcast copy qp list to temp array Bob Pearson
2022-02-18 0:35 ` [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
5 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-18 0:35 UTC (permalink / raw)
To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson
Finish adding subroutine comment headers to subroutines in
rxe_mcast.c. Make minor api change cleanups.
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
drivers/infiniband/sw/rxe/rxe_mcast.c | 97 +++++++++++++++++++++------
1 file changed, 78 insertions(+), 19 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 0f0655e81d40..349a6fac2fcc 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -1,12 +1,33 @@
// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
/*
+ * Copyright (c) 2022 Hewlett Packard Enterprise, Inc. All rights reserved.
* Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
* Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
*/
+/*
+ * rxe_mcast.c implements driver support for multicast transport.
+ * It is based on two data structures struct rxe_mcg ('mcg') and
+ * struct rxe_mca ('mca'). An mcg is allocated each time a qp is
+ * attached to a new mgid for the first time. These are indexed by
+ * a red-black tree using the mgid. This data structure is searched
+ * for the mcg when a multicast packet is received and when another
+ * qp is attached to the same mgid. It is cleaned up when the last qp
+ * is detached from the mcg. Each time a qp is attached to an mcg an
+ * mca is created. It holds a pointer to the qp and is added to a list
+ * of qp's that are attached to the mcg. The qp_list is used to replicate
+ * mcast packets in the rxe receive path.
+ */
+
#include "rxe.h"
-#include "rxe_loc.h"
+/**
+ * rxe_mcast_add - add multicast address to rxe device
+ * @rxe: rxe device object
+ * @mgid: multicast address as a gid
+ *
+ * Returns 0 on success else an error
+ */
static int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid)
{
unsigned char ll_addr[ETH_ALEN];
@@ -16,6 +37,13 @@ static int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid)
return dev_mc_add(rxe->ndev, ll_addr);
}
+/**
+ * rxe_mcast_delete - delete multicast address from rxe device
+ * @rxe: rxe device object
+ * @mgid: multicast address as a gid
+ *
+ * Returns 0 on success else an error
+ */
static int rxe_mcast_delete(struct rxe_dev *rxe, union ib_gid *mgid)
{
unsigned char ll_addr[ETH_ALEN];
@@ -216,7 +244,7 @@ static struct rxe_mcg *rxe_get_mcg(struct rxe_dev *rxe, union ib_gid *mgid)
/**
* rxe_cleanup_mcg - cleanup mcg for kref_put
- * @kref:
+ * @kref: struct kref embnedded in mcg
*/
void rxe_cleanup_mcg(struct kref *kref)
{
@@ -299,9 +327,17 @@ static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
return 0;
}
-static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
- struct rxe_mcg *mcg)
+/**
+ * rxe_attach_mcg - attach qp to mcg if not already attached
+ * @qp: qp object
+ * @mcg: mcg object
+ *
+ * Context: caller must hold reference on qp and mcg.
+ * Returns: 0 on success else an error
+ */
+static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
{
+ struct rxe_dev *rxe = mcg->rxe;
struct rxe_mca *mca, *tmp;
unsigned long flags;
int err;
@@ -358,17 +394,19 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
kfree(mca);
}
-static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
- union ib_gid *mgid)
+/**
+ * rxe_detach_mcg - detach qp from mcg
+ * @mcg: mcg object
+ * @qp: qp object
+ *
+ * Returns: 0 on success else an error if qp is not attached.
+ */
+static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
{
- struct rxe_mcg *mcg;
+ struct rxe_dev *rxe = mcg->rxe;
struct rxe_mca *mca, *tmp;
unsigned long flags;
- mcg = rxe_lookup_mcg(rxe, mgid);
- if (!mcg)
- return -EINVAL;
-
spin_lock_irqsave(&rxe->mcg_lock, flags);
list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
if (mca->qp == qp) {
@@ -378,16 +416,11 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
* mcast group falls to zero go ahead and
* tear it down. This will not free the
* object since we are still holding a ref
- * from the get key above
+ * from the caller
*/
if (atomic_read(&mcg->qp_num) <= 0)
__rxe_destroy_mcg(mcg);
- /* drop the ref from get key. This will free the
- * object if qp_num is zero.
- */
- kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
-
spin_unlock_irqrestore(&rxe->mcg_lock, flags);
return 0;
}
@@ -398,6 +431,14 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
return -EINVAL;
}
+/**
+ * rxe_attach_mcast - attach qp to multicast group (see IBA-11.3.1)
+ * @ibqp: (IB) qp object
+ * @mgid: multicast IP address
+ * @mlid: multicast LID, ignored for RoCEv2 (see IBA-A17.5.6)
+ *
+ * Returns: 0 on success else an errno
+ */
int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
{
int err;
@@ -410,22 +451,40 @@ int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
if (IS_ERR(mcg))
return PTR_ERR(mcg);
- err = rxe_attach_mcg(rxe, qp, mcg);
+ err = rxe_attach_mcg(mcg, qp);
/* if we failed to attach the first qp to mcg tear it down */
if (atomic_read(&mcg->qp_num) == 0)
rxe_destroy_mcg(mcg);
kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
+
return err;
}
+/**
+ * rxe_detach_mcast - detach qp from multicast group (see IBA-11.3.2)
+ * @ibqp: address of (IB) qp object
+ * @mgid: multicast IP address
+ * @mlid: multicast LID, ignored for RoCEv2 (see IBA-A17.5.6)
+ *
+ * Returns: 0 on success else an errno
+ */
int rxe_detach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
{
struct rxe_dev *rxe = to_rdev(ibqp->device);
struct rxe_qp *qp = to_rqp(ibqp);
+ struct rxe_mcg *mcg;
+ int err;
+
+ mcg = rxe_lookup_mcg(rxe, mgid);
+ if (!mcg)
+ return -EINVAL;
- return rxe_detach_mcg(rxe, qp, mgid);
+ err = rxe_detach_mcg(mcg, qp);
+ kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
+
+ return err;
}
/**
--
2.32.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH for-next v12 5/6] RDMA/rxe: For mcast copy qp list to temp array
2022-02-18 0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
` (3 preceding siblings ...)
2022-02-18 0:35 ` [PATCH for-next v12 4/6] RDMA/rxe: Cleanup rxe_mcast.c Bob Pearson
@ 2022-02-18 0:35 ` Bob Pearson
2022-02-18 0:35 ` [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
5 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-18 0:35 UTC (permalink / raw)
To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson
Currently rxe_rcv_mcast_pkt performs most of its work under the
rxe->mcg_lock and calls into rxe_rcv which queues the packets
to the responder and completer tasklets holding the lock which is
a very bad idea. This patch walks the qp_list in mcg and copies the
qp addresses to a temporary array under the lock but does the rest
of the work without holding the lock. The critical section is now
very small.
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
drivers/infiniband/sw/rxe/rxe_recv.c | 103 +++++++++++++++++----------
1 file changed, 64 insertions(+), 39 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
index 53924453abef..9b21cbb22602 100644
--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -232,11 +232,15 @@ static inline void rxe_rcv_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb)
static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
{
+ struct sk_buff *skb_copy;
struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
+ struct rxe_pkt_info *pkt_copy;
struct rxe_mcg *mcg;
struct rxe_mca *mca;
struct rxe_qp *qp;
+ struct rxe_qp **qp_array;
union ib_gid dgid;
+ int n, nmax;
int err;
if (skb->protocol == htons(ETH_P_IP))
@@ -248,68 +252,89 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
/* lookup mcast group corresponding to mgid, takes a ref */
mcg = rxe_lookup_mcg(rxe, &dgid);
if (!mcg)
- goto drop; /* mcast group not registered */
+ goto err_drop; /* mcast group not registered */
+
+ /* this is the current number of qp's attached to mcg plus a
+ * little room in case new qp's are attached between here
+ * and when we finish walking the qp list. If someone can
+ * attach more than 4 new qp's we will miss forwarding
+ * packets to those qp's. This is actually OK since UD is
+ * a unreliable service.
+ */
+ nmax = atomic_read(&mcg->qp_num) + 4;
+ qp_array = kmalloc_array(nmax, sizeof(qp), GFP_KERNEL);
+ n = 0;
spin_lock_bh(&rxe->mcg_lock);
-
- /* this is unreliable datagram service so we let
- * failures to deliver a multicast packet to a
- * single QP happen and just move on and try
- * the rest of them on the list
- */
list_for_each_entry(mca, &mcg->qp_list, qp_list) {
- qp = mca->qp;
+ /* protect the qp pointers in the list */
+ rxe_add_ref(mca->qp);
+ qp_array[n++] = mca->qp;
+ if (n == nmax)
+ break;
+ }
+ spin_unlock_bh(&rxe->mcg_lock);
+ nmax = n;
+ kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
- /* validate qp for incoming packet */
+ for (n = 0; n < nmax; n++) {
+ qp = qp_array[n];
+
+ /* since this is an unreliable transport if
+ * one of the qp's fails to pass these checks
+ * just don't forward a packet and continue
+ * on to the other qp's. If there aren't any
+ * drop the skb
+ */
err = check_type_state(rxe, pkt, qp);
- if (err)
+ if (err) {
+ rxe_drop_ref(qp);
+ if (n == nmax - 1)
+ goto err_free;
continue;
+ }
err = check_keys(rxe, pkt, bth_qpn(pkt), qp);
- if (err)
+ if (err) {
+ rxe_drop_ref(qp);
+ if (n == nmax - 1)
+ goto err_free;
continue;
+ }
- /* for all but the last QP create a new clone of the
- * skb and pass to the QP. Pass the original skb to
- * the last QP in the list.
+ /* for all but the last qp create a new copy(clone)
+ * of the skb and pass to the qp. Pass the original
+ * skb to the last qp in the list unless it failed
+ * checks above
*/
- if (mca->qp_list.next != &mcg->qp_list) {
- struct sk_buff *cskb;
- struct rxe_pkt_info *cpkt;
-
- cskb = skb_clone(skb, GFP_ATOMIC);
- if (unlikely(!cskb))
+ if (n < nmax - 1) {
+ skb_copy = skb_clone(skb, GFP_KERNEL);
+ if (unlikely(!skb_copy)) {
+ rxe_drop_ref(qp);
continue;
+ }
if (WARN_ON(!ib_device_try_get(&rxe->ib_dev))) {
- kfree_skb(cskb);
- break;
+ kfree_skb(skb_copy);
+ rxe_drop_ref(qp);
+ continue;
}
- cpkt = SKB_TO_PKT(cskb);
- cpkt->qp = qp;
- rxe_add_ref(qp);
- rxe_rcv_pkt(cpkt, cskb);
+ pkt_copy = SKB_TO_PKT(skb_copy);
+ pkt_copy->qp = qp;
+ rxe_rcv_pkt(pkt_copy, skb_copy);
} else {
pkt->qp = qp;
- rxe_add_ref(qp);
rxe_rcv_pkt(pkt, skb);
- skb = NULL; /* mark consumed */
}
}
- spin_unlock_bh(&rxe->mcg_lock);
-
- kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
-
- if (likely(!skb))
- return;
-
- /* This only occurs if one of the checks fails on the last
- * QP in the list above
- */
+ kfree(qp_array);
+ return;
-drop:
+err_free:
+ kfree(qp_array);
+err_drop:
kfree_skb(skb);
ib_device_put(&rxe->ib_dev);
}
--
2.32.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU
2022-02-18 0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
` (4 preceding siblings ...)
2022-02-18 0:35 ` [PATCH for-next v12 5/6] RDMA/rxe: For mcast copy qp list to temp array Bob Pearson
@ 2022-02-18 0:35 ` Bob Pearson
2022-02-23 19:52 ` Jason Gunthorpe
5 siblings, 1 reply; 11+ messages in thread
From: Bob Pearson @ 2022-02-18 0:35 UTC (permalink / raw)
To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson
Replace spinlock with rcu read locks for read side operations
on mca in rxe_recv.c and rxe_mcast.c.
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
drivers/infiniband/sw/rxe/rxe_mcast.c | 97 +++++++++++++++++----------
drivers/infiniband/sw/rxe/rxe_recv.c | 6 +-
drivers/infiniband/sw/rxe/rxe_verbs.h | 2 +
3 files changed, 67 insertions(+), 38 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 349a6fac2fcc..2bfec3748e1e 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -17,6 +17,12 @@
* mca is created. It holds a pointer to the qp and is added to a list
* of qp's that are attached to the mcg. The qp_list is used to replicate
* mcast packets in the rxe receive path.
+ *
+ * The highest performance operations are mca list traversal when
+ * processing incoming multicast packets which need to be fanned out
+ * to the attached qp's. This list is protected by RCU locking for read
+ * operations and a spinlock in the rxe_dev struct for write operations.
+ * The red-black tree is protected by the same spinlock.
*/
#include "rxe.h"
@@ -288,7 +294,7 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
}
/**
- * __rxe_init_mca - initialize a new mca holding lock
+ * __rxe_init_mca_rcu - initialize a new mca holding lock
* @qp: qp object
* @mcg: mcg object
* @mca: empty space for new mca
@@ -298,8 +304,8 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
*
* Returns: 0 on success else an error
*/
-static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
- struct rxe_mca *mca)
+static int __rxe_init_mca_rcu(struct rxe_qp *qp, struct rxe_mcg *mcg,
+ struct rxe_mca *mca)
{
struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
int n;
@@ -322,7 +328,10 @@ static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
rxe_add_ref(qp);
mca->qp = qp;
- list_add_tail(&mca->qp_list, &mcg->qp_list);
+ kref_get(&mcg->ref_cnt);
+ mca->mcg = mcg;
+
+ list_add_tail_rcu(&mca->qp_list, &mcg->qp_list);
return 0;
}
@@ -343,14 +352,14 @@ static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
int err;
/* check to see if the qp is already a member of the group */
- spin_lock_irqsave(&rxe->mcg_lock, flags);
- list_for_each_entry(mca, &mcg->qp_list, qp_list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
if (mca->qp == qp) {
- spin_unlock_irqrestore(&rxe->mcg_lock, flags);
+ rcu_read_unlock();
return 0;
}
}
- spin_unlock_irqrestore(&rxe->mcg_lock, flags);
+ rcu_read_unlock();
/* speculative alloc new mca without using GFP_ATOMIC */
mca = kzalloc(sizeof(*mca), GFP_KERNEL);
@@ -367,7 +376,7 @@ static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
}
}
- err = __rxe_init_mca(qp, mcg, mca);
+ err = __rxe_init_mca_rcu(qp, mcg, mca);
if (err)
kfree(mca);
out:
@@ -376,15 +385,16 @@ static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
}
/**
- * __rxe_cleanup_mca - cleanup mca object holding lock
- * @mca: mca object
- * @mcg: mcg object
+ * __rxe_destroy_mca - free mca resources
+ * @head: rcu_head embedded in mca
*
* Context: caller must hold a reference to mcg and rxe->mcg_lock
+ * all rcu read operations should be compelete
*/
-static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
+static void __rxe_destroy_mca(struct rcu_head *head)
{
- list_del(&mca->qp_list);
+ struct rxe_mca *mca = container_of(head, typeof(*mca), rcu);
+ struct rxe_mcg *mcg = mca->mcg;
atomic_dec(&mcg->qp_num);
atomic_dec(&mcg->rxe->mcg_attach);
@@ -394,6 +404,19 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
kfree(mca);
}
+/**
+ * __rxe_cleanup_mca_rcu - cleanup mca object holding lock
+ * @mca: mca object
+ * @mcg: mcg object
+ *
+ * Context: caller must hold a reference to mcg and rxe->mcg_lock
+ */
+static void __rxe_cleanup_mca_rcu(struct rxe_mca *mca, struct rxe_mcg *mcg)
+{
+ list_del_rcu(&mca->qp_list);
+ call_rcu(&mca->rcu, __rxe_destroy_mca);
+}
+
/**
* rxe_detach_mcg - detach qp from mcg
* @mcg: mcg object
@@ -404,31 +427,35 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
{
struct rxe_dev *rxe = mcg->rxe;
- struct rxe_mca *mca, *tmp;
- unsigned long flags;
+ struct rxe_mca *mca;
+ int ret;
- spin_lock_irqsave(&rxe->mcg_lock, flags);
- list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
- if (mca->qp == qp) {
- __rxe_cleanup_mca(mca, mcg);
-
- /* if the number of qp's attached to the
- * mcast group falls to zero go ahead and
- * tear it down. This will not free the
- * object since we are still holding a ref
- * from the caller
- */
- if (atomic_read(&mcg->qp_num) <= 0)
- __rxe_destroy_mcg(mcg);
-
- spin_unlock_irqrestore(&rxe->mcg_lock, flags);
- return 0;
- }
+ spin_lock_bh(&rxe->mcg_lock);
+ list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
+ if (mca->qp == qp)
+ goto found;
}
/* we didn't find the qp on the list */
- spin_unlock_irqrestore(&rxe->mcg_lock, flags);
- return -EINVAL;
+ ret = -EINVAL;
+ goto done;
+
+found:
+ __rxe_cleanup_mca_rcu(mca, mcg);
+
+ /* if the number of qp's attached to the
+ * mcast group falls to zero go ahead and
+ * tear it down. This will not free the
+ * object since we are still holding a ref
+ * from the caller
+ */
+ if (atomic_read(&mcg->qp_num) <= 0)
+ __rxe_destroy_mcg(mcg);
+
+ ret = 0;
+done:
+ spin_unlock_bh(&rxe->mcg_lock);
+ return ret;
}
/**
diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
index 9b21cbb22602..c2cab85c6576 100644
--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -265,15 +265,15 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
qp_array = kmalloc_array(nmax, sizeof(qp), GFP_KERNEL);
n = 0;
- spin_lock_bh(&rxe->mcg_lock);
- list_for_each_entry(mca, &mcg->qp_list, qp_list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
/* protect the qp pointers in the list */
rxe_add_ref(mca->qp);
qp_array[n++] = mca->qp;
if (n == nmax)
break;
}
- spin_unlock_bh(&rxe->mcg_lock);
+ rcu_read_unlock();
nmax = n;
kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 6b15251ff67a..4ee51de50b95 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -364,7 +364,9 @@ struct rxe_mcg {
struct rxe_mca {
struct list_head qp_list;
+ struct rcu_head rcu;
struct rxe_qp *qp;
+ struct rxe_mcg *mcg;
};
struct rxe_port {
--
2.32.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU
2022-02-18 0:35 ` [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
@ 2022-02-23 19:52 ` Jason Gunthorpe
2022-02-23 22:40 ` Bob Pearson
0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2022-02-23 19:52 UTC (permalink / raw)
To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma
On Thu, Feb 17, 2022 at 06:35:44PM -0600, Bob Pearson wrote:
> Replace spinlock with rcu read locks for read side operations
> on mca in rxe_recv.c and rxe_mcast.c.
>
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> drivers/infiniband/sw/rxe/rxe_mcast.c | 97 +++++++++++++++++----------
> drivers/infiniband/sw/rxe/rxe_recv.c | 6 +-
> drivers/infiniband/sw/rxe/rxe_verbs.h | 2 +
> 3 files changed, 67 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
> index 349a6fac2fcc..2bfec3748e1e 100644
> +++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
> @@ -17,6 +17,12 @@
> * mca is created. It holds a pointer to the qp and is added to a list
> * of qp's that are attached to the mcg. The qp_list is used to replicate
> * mcast packets in the rxe receive path.
> + *
> + * The highest performance operations are mca list traversal when
> + * processing incoming multicast packets which need to be fanned out
> + * to the attached qp's. This list is protected by RCU locking for read
> + * operations and a spinlock in the rxe_dev struct for write operations.
> + * The red-black tree is protected by the same spinlock.
> */
>
> #include "rxe.h"
> @@ -288,7 +294,7 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
> }
>
> /**
> - * __rxe_init_mca - initialize a new mca holding lock
> + * __rxe_init_mca_rcu - initialize a new mca holding lock
> * @qp: qp object
> * @mcg: mcg object
> * @mca: empty space for new mca
> @@ -298,8 +304,8 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
> *
> * Returns: 0 on success else an error
> */
> -static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
> - struct rxe_mca *mca)
> +static int __rxe_init_mca_rcu(struct rxe_qp *qp, struct rxe_mcg *mcg,
> + struct rxe_mca *mca)
> {
This isn't really 'rcu', it still has to hold the write side lock
> struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
> int n;
> @@ -322,7 +328,10 @@ static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
> rxe_add_ref(qp);
> mca->qp = qp;
>
> - list_add_tail(&mca->qp_list, &mcg->qp_list);
> + kref_get(&mcg->ref_cnt);
> + mca->mcg = mcg;
> +
> + list_add_tail_rcu(&mca->qp_list, &mcg->qp_list);
list_add_tail gets to be called rcu because it is slower than the
non-rcu safe version.
> -static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
> +static void __rxe_destroy_mca(struct rcu_head *head)
> {
> - list_del(&mca->qp_list);
> + struct rxe_mca *mca = container_of(head, typeof(*mca), rcu);
> + struct rxe_mcg *mcg = mca->mcg;
>
> atomic_dec(&mcg->qp_num);
> atomic_dec(&mcg->rxe->mcg_attach);
> @@ -394,6 +404,19 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
> kfree(mca);
> +/**
> + * __rxe_cleanup_mca_rcu - cleanup mca object holding lock
> + * @mca: mca object
> + * @mcg: mcg object
> + *
> + * Context: caller must hold a reference to mcg and rxe->mcg_lock
> + */
> +static void __rxe_cleanup_mca_rcu(struct rxe_mca *mca, struct rxe_mcg *mcg)
> +{
> + list_del_rcu(&mca->qp_list);
> + call_rcu(&mca->rcu, __rxe_destroy_mca);
> +}
I think this is OK now..
> +
> /**
> * rxe_detach_mcg - detach qp from mcg
> * @mcg: mcg object
> @@ -404,31 +427,35 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
> static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
> {
> struct rxe_dev *rxe = mcg->rxe;
> - struct rxe_mca *mca, *tmp;
> - unsigned long flags;
> + struct rxe_mca *mca;
> + int ret;
>
> - spin_lock_irqsave(&rxe->mcg_lock, flags);
> - list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
> - if (mca->qp == qp) {
> - __rxe_cleanup_mca(mca, mcg);
> -
> - /* if the number of qp's attached to the
> - * mcast group falls to zero go ahead and
> - * tear it down. This will not free the
> - * object since we are still holding a ref
> - * from the caller
> - */
> - if (atomic_read(&mcg->qp_num) <= 0)
> - __rxe_destroy_mcg(mcg);
> -
> - spin_unlock_irqrestore(&rxe->mcg_lock, flags);
> - return 0;
> - }
> + spin_lock_bh(&rxe->mcg_lock);
> + list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
> + if (mca->qp == qp)
> + goto found;
> }
>
> /* we didn't find the qp on the list */
> - spin_unlock_irqrestore(&rxe->mcg_lock, flags);
> - return -EINVAL;
> + ret = -EINVAL;
> + goto done;
> +
> +found:
> + __rxe_cleanup_mca_rcu(mca, mcg);
> +
> + /* if the number of qp's attached to the
> + * mcast group falls to zero go ahead and
> + * tear it down. This will not free the
> + * object since we are still holding a ref
> + * from the caller
> + */
Fix the word wrap
Would prefer to avoid the found label in the middle of the code.
> + rcu_read_lock();
> + list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
> /* protect the qp pointers in the list */
> rxe_add_ref(mca->qp);
The other approach you could take is to als kref_free the qp and allow
its kref to become zero here. But this is fine, I think.
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU
2022-02-23 19:52 ` Jason Gunthorpe
@ 2022-02-23 22:40 ` Bob Pearson
2022-02-24 0:04 ` Jason Gunthorpe
0 siblings, 1 reply; 11+ messages in thread
From: Bob Pearson @ 2022-02-23 22:40 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: zyjzyj2000, linux-rdma
On 2/23/22 13:52, Jason Gunthorpe wrote:
> On Thu, Feb 17, 2022 at 06:35:44PM -0600, Bob Pearson wrote:
>> Replace spinlock with rcu read locks for read side operations
>> on mca in rxe_recv.c and rxe_mcast.c.
>>
>> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>> drivers/infiniband/sw/rxe/rxe_mcast.c | 97 +++++++++++++++++----------
>> drivers/infiniband/sw/rxe/rxe_recv.c | 6 +-
>> drivers/infiniband/sw/rxe/rxe_verbs.h | 2 +
>> 3 files changed, 67 insertions(+), 38 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
>> index 349a6fac2fcc..2bfec3748e1e 100644
>> +++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
>> @@ -17,6 +17,12 @@
>> * mca is created. It holds a pointer to the qp and is added to a list
>> * of qp's that are attached to the mcg. The qp_list is used to replicate
>> * mcast packets in the rxe receive path.
>> + *
>> + * The highest performance operations are mca list traversal when
>> + * processing incoming multicast packets which need to be fanned out
>> + * to the attached qp's. This list is protected by RCU locking for read
>> + * operations and a spinlock in the rxe_dev struct for write operations.
>> + * The red-black tree is protected by the same spinlock.
>> */
>>
>> #include "rxe.h"
>> @@ -288,7 +294,7 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
>> }
>>
>> /**
>> - * __rxe_init_mca - initialize a new mca holding lock
>> + * __rxe_init_mca_rcu - initialize a new mca holding lock
>> * @qp: qp object
>> * @mcg: mcg object
>> * @mca: empty space for new mca
>> @@ -298,8 +304,8 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
>> *
>> * Returns: 0 on success else an error
>> */
>> -static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
>> - struct rxe_mca *mca)
>> +static int __rxe_init_mca_rcu(struct rxe_qp *qp, struct rxe_mcg *mcg,
>> + struct rxe_mca *mca)
>> {
>
> This isn't really 'rcu', it still has to hold the write side lock
>
>> struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
>> int n;
>> @@ -322,7 +328,10 @@ static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
>> rxe_add_ref(qp);
>> mca->qp = qp;
>>
>> - list_add_tail(&mca->qp_list, &mcg->qp_list);
>> + kref_get(&mcg->ref_cnt);
>> + mca->mcg = mcg;
>> +
>> + list_add_tail_rcu(&mca->qp_list, &mcg->qp_list);
>
> list_add_tail gets to be called rcu because it is slower than the
> non-rcu safe version.
>
>> -static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
>> +static void __rxe_destroy_mca(struct rcu_head *head)
>> {
>> - list_del(&mca->qp_list);
>> + struct rxe_mca *mca = container_of(head, typeof(*mca), rcu);
>> + struct rxe_mcg *mcg = mca->mcg;
>>
>> atomic_dec(&mcg->qp_num);
>> atomic_dec(&mcg->rxe->mcg_attach);
>> @@ -394,6 +404,19 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
>> kfree(mca);
>
>> +/**
>> + * __rxe_cleanup_mca_rcu - cleanup mca object holding lock
>> + * @mca: mca object
>> + * @mcg: mcg object
>> + *
>> + * Context: caller must hold a reference to mcg and rxe->mcg_lock
>> + */
>> +static void __rxe_cleanup_mca_rcu(struct rxe_mca *mca, struct rxe_mcg *mcg)
>> +{
>> + list_del_rcu(&mca->qp_list);
>> + call_rcu(&mca->rcu, __rxe_destroy_mca);
>> +}
>
> I think this is OK now..
>
>> +
>> /**
>> * rxe_detach_mcg - detach qp from mcg
>> * @mcg: mcg object
>> @@ -404,31 +427,35 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
>> static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
>> {
>> struct rxe_dev *rxe = mcg->rxe;
>> - struct rxe_mca *mca, *tmp;
>> - unsigned long flags;
>> + struct rxe_mca *mca;
>> + int ret;
>>
>> - spin_lock_irqsave(&rxe->mcg_lock, flags);
>> - list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
>> - if (mca->qp == qp) {
>> - __rxe_cleanup_mca(mca, mcg);
>> -
>> - /* if the number of qp's attached to the
>> - * mcast group falls to zero go ahead and
>> - * tear it down. This will not free the
>> - * object since we are still holding a ref
>> - * from the caller
>> - */
>> - if (atomic_read(&mcg->qp_num) <= 0)
>> - __rxe_destroy_mcg(mcg);
>> -
>> - spin_unlock_irqrestore(&rxe->mcg_lock, flags);
>> - return 0;
>> - }
>> + spin_lock_bh(&rxe->mcg_lock);
>> + list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
>> + if (mca->qp == qp)
>> + goto found;
>> }
>>
>> /* we didn't find the qp on the list */
>> - spin_unlock_irqrestore(&rxe->mcg_lock, flags);
>> - return -EINVAL;
>> + ret = -EINVAL;
>> + goto done;
>> +
>> +found:
>> + __rxe_cleanup_mca_rcu(mca, mcg);
>> +
>> + /* if the number of qp's attached to the
>> + * mcast group falls to zero go ahead and
>> + * tear it down. This will not free the
>> + * object since we are still holding a ref
>> + * from the caller
>> + */
>
> Fix the word wrap
>
> Would prefer to avoid the found label in the middle of the code.
>
>> + rcu_read_lock();
>> + list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
>> /* protect the qp pointers in the list */
>> rxe_add_ref(mca->qp);
>
> The other approach you could take is to als kref_free the qp and allow
> its kref to become zero here. But this is fine, I think.
>
> Jason
OK I looked at this again. Here is what happens.
without the complete()/wait_for_completion() fix the test passes but the (private i.e. not visible to rdma-core)
mca object is holding a reference on qp which will be freed after the call_rcu() subroutine takes place.
Meanwhile the ib_detach_mcast() call returns to rdma-core and since there is nothing left to do the test is over
and the python code closes the device. But this generates a warning (rdma_core.c line 940) since not all the
user objects (i.e. the qp) were freed. In other contexts we are planning on putting complete()/wait...()'s
in all the other verbs calls which ref count objects for the same reason. I think it applies here too.
Bob
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU
2022-02-23 22:40 ` Bob Pearson
@ 2022-02-24 0:04 ` Jason Gunthorpe
0 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2022-02-24 0:04 UTC (permalink / raw)
To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma
On Wed, Feb 23, 2022 at 04:40:19PM -0600, Bob Pearson wrote:
> OK I looked at this again. Here is what happens.
>
> without the complete()/wait_for_completion() fix the test passes but
> the (private i.e. not visible to rdma-core) mca object is holding a
> reference on qp which will be freed after the call_rcu() subroutine
> takes place. Meanwhile the ib_detach_mcast() call returns to
> rdma-core and since there is nothing left to do the test is over and
> the python code closes the device. But this generates a warning
> (rdma_core.c line 940) since not all the user objects (i.e. the qp)
> were freed. In other contexts we are planning on putting
> complete()/wait...()'s in all the other verbs calls which ref count
> objects for the same reason. I think it applies here too.
This explanation makes sense..
The mistake is on the QP side where it has a refcount but doesn't wait
for it to drain I suppose.
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread