linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c
@ 2022-02-18  0:35 Bob Pearson
  2022-02-18  0:35 ` [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory Bob Pearson
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-18  0:35 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

This patch series completes the separation of the mc_grp and mc_elem
object pools from rxe_pools.c and moving of their code to rxe_mcast.c.
This makes sense because these two pools are different from the other
pools as the only ones that do not share objects with rdma-core and
that use key's instead of indices to enable looking up objects. This
change will enable a significant simplification of the normal object
pools.

These patches correspond to 08-11/11 of the previous version of this
series. The earler patches have been accepted.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
v12
  Addressed issues raised by Jason Gunthorpe.
  Returned a warning if rdma-core fails to cleanup mcast memory.
  Split the 'cleanup patch' into more managable patches.
  Fixed an error in the rcu patch by moving the kfree and
  rxe_drop_ref(qp) to a call_rcu() routine.
v11
  Restructured the patch series to simplify it and make each patch
  simpler. Addressed some additional issues raised by Jason Gunthorpe
  in the review of that last version.
v10
  Corrected issues reported by Jason Gunthorpe
  Isolated patches 01-17 separate from the remaining patches.
  They will be submitted later
v9
  Corrected issues reported by Jason Gunthorpe,
  Converted locking in rxe_mcast.c and rxe_pool.c to use RCU
  Split up the patches into smaller changes
v8
  Fixed an additional race in 3/8 which was not handled correctly.
v7
  Corrected issues reported by Jason Gunthorpe
Link: https://lore.kernel.org/linux-rdma/20211207190947.GH6385@nvidia.com/
Link: https://lore.kernel.org/linux-rdma/20211207191857.GI6385@nvidia.com/
Link: https://lore.kernel.org/linux-rdma/20211207192824.GJ6385@nvidia.com/
v6
  Fixed a kzalloc flags bug.
  Fixed comment bug reported by 'Kernel Test Robot'.
  Changed type of rxe_pool.c in __rxe_fini().
v5
  Removed patches already accepted into for-next and addressed comments
  from Jason Gunthorpe.
v4
  Restructured patch series to change to xarray earlier which
  greatly simplified the changes.
  Rebased to current for-next
v3
  Changed rxe_alloc to use GFP_KERNEL
  Addressed other comments by Jason Gunthorp
  Merged the previous 06/10 and 07/10 patches into one since they overlapped
  Added some minor cleanups as 10/10
v2
  Rebased to current for-next.
  Added 4 additional patches

Bob Pearson (6):
  RDMA/rxe: Add code to cleanup mcast memory
  RDMA/rxe: Collect init code for mca in a subroutine
  RDMA/rxe: Collect cleanup mca code in a subroutine
  RDMA/rxe: Cleanup rxe_mcast.c
  RDMA/rxe: For mcast copy qp list to temp array
  RDMA/rxe: Convert mca read locking to RCU

 drivers/infiniband/sw/rxe/rxe.c       |   2 +
 drivers/infiniband/sw/rxe/rxe_loc.h   |   1 +
 drivers/infiniband/sw/rxe/rxe_mcast.c | 289 ++++++++++++++++++++------
 drivers/infiniband/sw/rxe/rxe_recv.c  | 107 ++++++----
 drivers/infiniband/sw/rxe/rxe_verbs.h |   3 +
 5 files changed, 298 insertions(+), 104 deletions(-)


This patch series applies to the current version of wip/jgg-for-next
base-commit: 3810c1a1cbe8f3157f644b4e42f6c0157dfd22cb
-- 
2.32.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory
  2022-02-18  0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
@ 2022-02-18  0:35 ` Bob Pearson
  2022-02-23 19:33   ` Jason Gunthorpe
  2022-02-18  0:35 ` [PATCH for-next v12 2/6] RDMA/rxe: Collect mca init code in a subroutine Bob Pearson
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Bob Pearson @ 2022-02-18  0:35 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Well behaved applications will free all memory allocated by multicast.
If this fails rdma-core will detach all qp's from multicast groups when
an application terminates. This patch prints a warning and cleans out
memory allocated by multicast if both of these fail to occur.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe.c       |  2 ++
 drivers/infiniband/sw/rxe/rxe_loc.h   |  1 +
 drivers/infiniband/sw/rxe/rxe_mcast.c | 38 +++++++++++++++++++++++++++
 3 files changed, 41 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 3520eb2db685..603b0156f889 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -29,6 +29,8 @@ void rxe_dealloc(struct ib_device *ib_dev)
 	rxe_pool_cleanup(&rxe->mr_pool);
 	rxe_pool_cleanup(&rxe->mw_pool);
 
+	rxe_cleanup_mcast(rxe);
+
 	if (rxe->tfm)
 		crypto_free_shash(rxe->tfm);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 409efeecd581..0bc1b7e2877c 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -44,6 +44,7 @@ struct rxe_mcg *rxe_lookup_mcg(struct rxe_dev *rxe, union ib_gid *mgid);
 int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid);
 int rxe_detach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid);
 void rxe_cleanup_mcg(struct kref *kref);
+void rxe_cleanup_mcast(struct rxe_dev *rxe);
 
 /* rxe_mmap.c */
 struct rxe_mmap_info {
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 4935fe5c5868..447d78bea28b 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -388,3 +388,41 @@ int rxe_detach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
 
 	return rxe_detach_mcg(rxe, qp, mgid);
 }
+
+/**
+ * rxe_cleanup_mcast - cleanup all resources still held by mcast
+ * @rxe: rxe object
+ *
+ * Called when rxe device is unloaded. Walk red-black tree to
+ * find all mcg's and then walk mcg->qp_list to find all mca's and
+ * free them.
+ *
+ * This is belt and suspenders. These should have been recovered
+ * already by the application else rdma-core.
+ */
+void rxe_cleanup_mcast(struct rxe_dev *rxe)
+{
+	struct rb_root *root = &rxe->mcg_tree;
+	struct rb_node *node, *next;
+	struct rxe_mcg *mcg;
+	struct rxe_mca *mca, *tmp;
+
+	if (RB_EMPTY_ROOT(root))
+		return;		/* expected behavior */
+
+	pr_warn("RDMA/core bug: Recovering  leaked multicast resources\n");
+
+	for (node = rb_first(root); node; node = next) {
+		next = rb_next(node);
+		mcg = rb_entry(node, typeof(*mcg), node);
+
+		spin_lock_bh(&rxe->mcg_lock);
+		list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list)
+			kfree(mca);
+
+		__rxe_remove_mcg(mcg);
+		spin_unlock_bh(&rxe->mcg_lock);
+
+		kfree(mcg);
+	}
+}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-next v12 2/6] RDMA/rxe: Collect mca init code in a subroutine
  2022-02-18  0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
  2022-02-18  0:35 ` [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory Bob Pearson
@ 2022-02-18  0:35 ` Bob Pearson
  2022-02-18  0:35 ` [PATCH for-next v12 3/6] RDMA/rxe: Collect cleanup mca " Bob Pearson
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-18  0:35 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Collect initialization code for struct rxe_mca into a subroutine,
__rxe_init_mca(), to cleanup rxe_attach_mcg() in rxe_mcast.c. Check
limit on total number of attached qp's.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mcast.c | 58 ++++++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  1 +
 2 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 447d78bea28b..53db0984a9a1 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -259,6 +259,46 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
 	spin_unlock_irqrestore(&mcg->rxe->mcg_lock, flags);
 }
 
+/**
+ * __rxe_init_mca - initialize a new mca holding lock
+ * @qp: qp object
+ * @mcg: mcg object
+ * @mca: empty space for new mca
+ *
+ * Context: caller must hold references on qp and mcg, rxe->mcg_lock
+ * and pass memory for new mca
+ *
+ * Returns: 0 on success else an error
+ */
+static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
+			  struct rxe_mca *mca)
+{
+	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
+	int n;
+
+	n = atomic_inc_return(&rxe->mcg_attach);
+	if (n > rxe->attr.max_total_mcast_qp_attach) {
+		atomic_dec(&rxe->mcg_attach);
+		return -ENOMEM;
+	}
+
+	n = atomic_inc_return(&mcg->qp_num);
+	if (n > rxe->attr.max_mcast_qp_attach) {
+		atomic_dec(&mcg->qp_num);
+		atomic_dec(&rxe->mcg_attach);
+		return -ENOMEM;
+	}
+
+	atomic_inc(&qp->mcg_num);
+
+	rxe_add_ref(qp);
+	mca->qp = qp;
+
+	list_add_tail(&mca->qp_list, &mcg->qp_list);
+
+	return 0;
+}
+
 static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 				  struct rxe_mcg *mcg)
 {
@@ -291,22 +331,9 @@ static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 		}
 	}
 
-	/* check limits after checking if already attached */
-	if (atomic_inc_return(&mcg->qp_num) > rxe->attr.max_mcast_qp_attach) {
-		atomic_dec(&mcg->qp_num);
+	err = __rxe_init_mca(qp, mcg, mca);
+	if (err)
 		kfree(mca);
-		err = -ENOMEM;
-		goto out;
-	}
-
-	/* protect pointer to qp in mca */
-	rxe_add_ref(qp);
-	mca->qp = qp;
-
-	atomic_inc(&qp->mcg_num);
-	list_add(&mca->qp_list, &mcg->qp_list);
-
-	err = 0;
 out:
 	spin_unlock_irqrestore(&rxe->mcg_lock, flags);
 	return err;
@@ -329,6 +356,7 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 		if (mca->qp == qp) {
 			list_del(&mca->qp_list);
 			atomic_dec(&qp->mcg_num);
+			atomic_dec(&rxe->mcg_attach);
 			rxe_drop_ref(qp);
 
 			/* if the number of qp's attached to the
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 20fe3ee6589d..6b15251ff67a 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -401,6 +401,7 @@ struct rxe_dev {
 	spinlock_t		mcg_lock;
 	struct rb_root		mcg_tree;
 	atomic_t		mcg_num;
+	atomic_t		mcg_attach;
 
 	spinlock_t		pending_lock; /* guard pending_mmaps */
 	struct list_head	pending_mmaps;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-next v12 3/6] RDMA/rxe: Collect cleanup mca code in a subroutine
  2022-02-18  0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
  2022-02-18  0:35 ` [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory Bob Pearson
  2022-02-18  0:35 ` [PATCH for-next v12 2/6] RDMA/rxe: Collect mca init code in a subroutine Bob Pearson
@ 2022-02-18  0:35 ` Bob Pearson
  2022-02-18  0:35 ` [PATCH for-next v12 4/6] RDMA/rxe: Cleanup rxe_mcast.c Bob Pearson
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-18  0:35 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Collect cleanup code for struct rxe_mca into a subroutine,
__rxe_cleanup_mca() called in rxe_detach_mcg() in rxe_mcast.c.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mcast.c | 41 +++++++++++++++++----------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 53db0984a9a1..0f0655e81d40 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -339,13 +339,31 @@ static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 	return err;
 }
 
+/**
+ * __rxe_cleanup_mca - cleanup mca object holding lock
+ * @mca: mca object
+ * @mcg: mcg object
+ *
+ * Context: caller must hold a reference to mcg and rxe->mcg_lock
+ */
+static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
+{
+	list_del(&mca->qp_list);
+
+	atomic_dec(&mcg->qp_num);
+	atomic_dec(&mcg->rxe->mcg_attach);
+	atomic_dec(&mca->qp->mcg_num);
+	rxe_drop_ref(mca->qp);
+
+	kfree(mca);
+}
+
 static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 				   union ib_gid *mgid)
 {
 	struct rxe_mcg *mcg;
 	struct rxe_mca *mca, *tmp;
 	unsigned long flags;
-	int err;
 
 	mcg = rxe_lookup_mcg(rxe, mgid);
 	if (!mcg)
@@ -354,37 +372,30 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 	spin_lock_irqsave(&rxe->mcg_lock, flags);
 	list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
 		if (mca->qp == qp) {
-			list_del(&mca->qp_list);
-			atomic_dec(&qp->mcg_num);
-			atomic_dec(&rxe->mcg_attach);
-			rxe_drop_ref(qp);
+			__rxe_cleanup_mca(mca, mcg);
 
 			/* if the number of qp's attached to the
 			 * mcast group falls to zero go ahead and
 			 * tear it down. This will not free the
 			 * object since we are still holding a ref
-			 * from the get key above.
+			 * from the get key above
 			 */
-			if (atomic_dec_return(&mcg->qp_num) <= 0)
+			if (atomic_read(&mcg->qp_num) <= 0)
 				__rxe_destroy_mcg(mcg);
 
 			/* drop the ref from get key. This will free the
 			 * object if qp_num is zero.
 			 */
 			kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
-			kfree(mca);
-			err = 0;
-			goto out_unlock;
+
+			spin_unlock_irqrestore(&rxe->mcg_lock, flags);
+			return 0;
 		}
 	}
 
 	/* we didn't find the qp on the list */
-	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
-	err = -EINVAL;
-
-out_unlock:
 	spin_unlock_irqrestore(&rxe->mcg_lock, flags);
-	return err;
+	return -EINVAL;
 }
 
 int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-next v12 4/6] RDMA/rxe: Cleanup rxe_mcast.c
  2022-02-18  0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
                   ` (2 preceding siblings ...)
  2022-02-18  0:35 ` [PATCH for-next v12 3/6] RDMA/rxe: Collect cleanup mca " Bob Pearson
@ 2022-02-18  0:35 ` Bob Pearson
  2022-02-18  0:35 ` [PATCH for-next v12 5/6] RDMA/rxe: For mcast copy qp list to temp array Bob Pearson
  2022-02-18  0:35 ` [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
  5 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-18  0:35 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Finish adding subroutine comment headers to subroutines in
rxe_mcast.c. Make minor api change cleanups.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mcast.c | 97 +++++++++++++++++++++------
 1 file changed, 78 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 0f0655e81d40..349a6fac2fcc 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -1,12 +1,33 @@
 // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
 /*
+ * Copyright (c) 2022 Hewlett Packard Enterprise, Inc. All rights reserved.
  * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
  * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
  */
 
+/*
+ * rxe_mcast.c implements driver support for multicast transport.
+ * It is based on two data structures struct rxe_mcg ('mcg') and
+ * struct rxe_mca ('mca'). An mcg is allocated each time a qp is
+ * attached to a new mgid for the first time. These are indexed by
+ * a red-black tree using the mgid. This data structure is searched
+ * for the mcg when a multicast packet is received and when another
+ * qp is attached to the same mgid. It is cleaned up when the last qp
+ * is detached from the mcg. Each time a qp is attached to an mcg an
+ * mca is created. It holds a pointer to the qp and is added to a list
+ * of qp's that are attached to the mcg. The qp_list is used to replicate
+ * mcast packets in the rxe receive path.
+ */
+
 #include "rxe.h"
-#include "rxe_loc.h"
 
+/**
+ * rxe_mcast_add - add multicast address to rxe device
+ * @rxe: rxe device object
+ * @mgid: multicast address as a gid
+ *
+ * Returns 0 on success else an error
+ */
 static int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid)
 {
 	unsigned char ll_addr[ETH_ALEN];
@@ -16,6 +37,13 @@ static int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid)
 	return dev_mc_add(rxe->ndev, ll_addr);
 }
 
+/**
+ * rxe_mcast_delete - delete multicast address from rxe device
+ * @rxe: rxe device object
+ * @mgid: multicast address as a gid
+ *
+ * Returns 0 on success else an error
+ */
 static int rxe_mcast_delete(struct rxe_dev *rxe, union ib_gid *mgid)
 {
 	unsigned char ll_addr[ETH_ALEN];
@@ -216,7 +244,7 @@ static struct rxe_mcg *rxe_get_mcg(struct rxe_dev *rxe, union ib_gid *mgid)
 
 /**
  * rxe_cleanup_mcg - cleanup mcg for kref_put
- * @kref:
+ * @kref: struct kref embnedded in mcg
  */
 void rxe_cleanup_mcg(struct kref *kref)
 {
@@ -299,9 +327,17 @@ static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
 	return 0;
 }
 
-static int rxe_attach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
-				  struct rxe_mcg *mcg)
+/**
+ * rxe_attach_mcg - attach qp to mcg if not already attached
+ * @qp: qp object
+ * @mcg: mcg object
+ *
+ * Context: caller must hold reference on qp and mcg.
+ * Returns: 0 on success else an error
+ */
+static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 {
+	struct rxe_dev *rxe = mcg->rxe;
 	struct rxe_mca *mca, *tmp;
 	unsigned long flags;
 	int err;
@@ -358,17 +394,19 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
 	kfree(mca);
 }
 
-static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
-				   union ib_gid *mgid)
+/**
+ * rxe_detach_mcg - detach qp from mcg
+ * @mcg: mcg object
+ * @qp: qp object
+ *
+ * Returns: 0 on success else an error if qp is not attached.
+ */
+static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 {
-	struct rxe_mcg *mcg;
+	struct rxe_dev *rxe = mcg->rxe;
 	struct rxe_mca *mca, *tmp;
 	unsigned long flags;
 
-	mcg = rxe_lookup_mcg(rxe, mgid);
-	if (!mcg)
-		return -EINVAL;
-
 	spin_lock_irqsave(&rxe->mcg_lock, flags);
 	list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
 		if (mca->qp == qp) {
@@ -378,16 +416,11 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 			 * mcast group falls to zero go ahead and
 			 * tear it down. This will not free the
 			 * object since we are still holding a ref
-			 * from the get key above
+			 * from the caller
 			 */
 			if (atomic_read(&mcg->qp_num) <= 0)
 				__rxe_destroy_mcg(mcg);
 
-			/* drop the ref from get key. This will free the
-			 * object if qp_num is zero.
-			 */
-			kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
-
 			spin_unlock_irqrestore(&rxe->mcg_lock, flags);
 			return 0;
 		}
@@ -398,6 +431,14 @@ static int rxe_detach_mcg(struct rxe_dev *rxe, struct rxe_qp *qp,
 	return -EINVAL;
 }
 
+/**
+ * rxe_attach_mcast - attach qp to multicast group (see IBA-11.3.1)
+ * @ibqp: (IB) qp object
+ * @mgid: multicast IP address
+ * @mlid: multicast LID, ignored for RoCEv2 (see IBA-A17.5.6)
+ *
+ * Returns: 0 on success else an errno
+ */
 int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
 {
 	int err;
@@ -410,22 +451,40 @@ int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
 	if (IS_ERR(mcg))
 		return PTR_ERR(mcg);
 
-	err = rxe_attach_mcg(rxe, qp, mcg);
+	err = rxe_attach_mcg(mcg, qp);
 
 	/* if we failed to attach the first qp to mcg tear it down */
 	if (atomic_read(&mcg->qp_num) == 0)
 		rxe_destroy_mcg(mcg);
 
 	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
+
 	return err;
 }
 
+/**
+ * rxe_detach_mcast - detach qp from multicast group (see IBA-11.3.2)
+ * @ibqp: address of (IB) qp object
+ * @mgid: multicast IP address
+ * @mlid: multicast LID, ignored for RoCEv2 (see IBA-A17.5.6)
+ *
+ * Returns: 0 on success else an errno
+ */
 int rxe_detach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
 {
 	struct rxe_dev *rxe = to_rdev(ibqp->device);
 	struct rxe_qp *qp = to_rqp(ibqp);
+	struct rxe_mcg *mcg;
+	int err;
+
+	mcg = rxe_lookup_mcg(rxe, mgid);
+	if (!mcg)
+		return -EINVAL;
 
-	return rxe_detach_mcg(rxe, qp, mgid);
+	err = rxe_detach_mcg(mcg, qp);
+	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
+
+	return err;
 }
 
 /**
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-next v12 5/6] RDMA/rxe: For mcast copy qp list to temp array
  2022-02-18  0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
                   ` (3 preceding siblings ...)
  2022-02-18  0:35 ` [PATCH for-next v12 4/6] RDMA/rxe: Cleanup rxe_mcast.c Bob Pearson
@ 2022-02-18  0:35 ` Bob Pearson
  2022-02-18  0:35 ` [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
  5 siblings, 0 replies; 11+ messages in thread
From: Bob Pearson @ 2022-02-18  0:35 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Currently rxe_rcv_mcast_pkt performs most of its work under the
rxe->mcg_lock and calls into rxe_rcv which queues the packets
to the responder and completer tasklets holding the lock which is
a very bad idea. This patch walks the qp_list in mcg and copies the
qp addresses to a temporary array under the lock but does the rest
of the work without holding the lock. The critical section is now
very small.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_recv.c | 103 +++++++++++++++++----------
 1 file changed, 64 insertions(+), 39 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
index 53924453abef..9b21cbb22602 100644
--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -232,11 +232,15 @@ static inline void rxe_rcv_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb)
 
 static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 {
+	struct sk_buff *skb_copy;
 	struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
+	struct rxe_pkt_info *pkt_copy;
 	struct rxe_mcg *mcg;
 	struct rxe_mca *mca;
 	struct rxe_qp *qp;
+	struct rxe_qp **qp_array;
 	union ib_gid dgid;
+	int n, nmax;
 	int err;
 
 	if (skb->protocol == htons(ETH_P_IP))
@@ -248,68 +252,89 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 	/* lookup mcast group corresponding to mgid, takes a ref */
 	mcg = rxe_lookup_mcg(rxe, &dgid);
 	if (!mcg)
-		goto drop;	/* mcast group not registered */
+		goto err_drop;	/* mcast group not registered */
+
+	/* this is the current number of qp's attached to mcg plus a
+	 * little room in case new qp's are attached between here
+	 * and when we finish walking the qp list. If someone can
+	 * attach more than 4 new qp's we will miss forwarding
+	 * packets to those qp's. This is actually OK since UD is
+	 * a unreliable service.
+	 */
+	nmax = atomic_read(&mcg->qp_num) + 4;
+	qp_array = kmalloc_array(nmax, sizeof(qp), GFP_KERNEL);
+	n = 0;
 
 	spin_lock_bh(&rxe->mcg_lock);
-
-	/* this is unreliable datagram service so we let
-	 * failures to deliver a multicast packet to a
-	 * single QP happen and just move on and try
-	 * the rest of them on the list
-	 */
 	list_for_each_entry(mca, &mcg->qp_list, qp_list) {
-		qp = mca->qp;
+		/* protect the qp pointers in the list */
+		rxe_add_ref(mca->qp);
+		qp_array[n++] = mca->qp;
+		if (n == nmax)
+			break;
+	}
+	spin_unlock_bh(&rxe->mcg_lock);
+	nmax = n;
+	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
 
-		/* validate qp for incoming packet */
+	for (n = 0; n < nmax; n++) {
+		qp = qp_array[n];
+
+		/* since this is an unreliable transport if
+		 * one of the qp's fails to pass these checks
+		 * just don't forward a packet and continue
+		 * on to the other qp's. If there aren't any
+		 * drop the skb
+		 */
 		err = check_type_state(rxe, pkt, qp);
-		if (err)
+		if (err) {
+			rxe_drop_ref(qp);
+			if (n == nmax - 1)
+				goto err_free;
 			continue;
+		}
 
 		err = check_keys(rxe, pkt, bth_qpn(pkt), qp);
-		if (err)
+		if (err) {
+			rxe_drop_ref(qp);
+			if (n == nmax - 1)
+				goto err_free;
 			continue;
+		}
 
-		/* for all but the last QP create a new clone of the
-		 * skb and pass to the QP. Pass the original skb to
-		 * the last QP in the list.
+		/* for all but the last qp create a new copy(clone)
+		 * of the skb and pass to the qp. Pass the original
+		 * skb to the last qp in the list unless it failed
+		 * checks above
 		 */
-		if (mca->qp_list.next != &mcg->qp_list) {
-			struct sk_buff *cskb;
-			struct rxe_pkt_info *cpkt;
-
-			cskb = skb_clone(skb, GFP_ATOMIC);
-			if (unlikely(!cskb))
+		if (n < nmax - 1) {
+			skb_copy = skb_clone(skb, GFP_KERNEL);
+			if (unlikely(!skb_copy)) {
+				rxe_drop_ref(qp);
 				continue;
+			}
 
 			if (WARN_ON(!ib_device_try_get(&rxe->ib_dev))) {
-				kfree_skb(cskb);
-				break;
+				kfree_skb(skb_copy);
+				rxe_drop_ref(qp);
+				continue;
 			}
 
-			cpkt = SKB_TO_PKT(cskb);
-			cpkt->qp = qp;
-			rxe_add_ref(qp);
-			rxe_rcv_pkt(cpkt, cskb);
+			pkt_copy = SKB_TO_PKT(skb_copy);
+			pkt_copy->qp = qp;
+			rxe_rcv_pkt(pkt_copy, skb_copy);
 		} else {
 			pkt->qp = qp;
-			rxe_add_ref(qp);
 			rxe_rcv_pkt(pkt, skb);
-			skb = NULL;	/* mark consumed */
 		}
 	}
 
-	spin_unlock_bh(&rxe->mcg_lock);
-
-	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
-
-	if (likely(!skb))
-		return;
-
-	/* This only occurs if one of the checks fails on the last
-	 * QP in the list above
-	 */
+	kfree(qp_array);
+	return;
 
-drop:
+err_free:
+	kfree(qp_array);
+err_drop:
 	kfree_skb(skb);
 	ib_device_put(&rxe->ib_dev);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU
  2022-02-18  0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
                   ` (4 preceding siblings ...)
  2022-02-18  0:35 ` [PATCH for-next v12 5/6] RDMA/rxe: For mcast copy qp list to temp array Bob Pearson
@ 2022-02-18  0:35 ` Bob Pearson
  2022-02-23 19:52   ` Jason Gunthorpe
  5 siblings, 1 reply; 11+ messages in thread
From: Bob Pearson @ 2022-02-18  0:35 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Replace spinlock with rcu read locks for read side operations
on mca in rxe_recv.c and rxe_mcast.c.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mcast.c | 97 +++++++++++++++++----------
 drivers/infiniband/sw/rxe/rxe_recv.c  |  6 +-
 drivers/infiniband/sw/rxe/rxe_verbs.h |  2 +
 3 files changed, 67 insertions(+), 38 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 349a6fac2fcc..2bfec3748e1e 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -17,6 +17,12 @@
  * mca is created. It holds a pointer to the qp and is added to a list
  * of qp's that are attached to the mcg. The qp_list is used to replicate
  * mcast packets in the rxe receive path.
+ *
+ * The highest performance operations are mca list traversal when
+ * processing incoming multicast packets which need to be fanned out
+ * to the attached qp's. This list is protected by RCU locking for read
+ * operations and a spinlock in the rxe_dev struct for write operations.
+ * The red-black tree is protected by the same spinlock.
  */
 
 #include "rxe.h"
@@ -288,7 +294,7 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
 }
 
 /**
- * __rxe_init_mca - initialize a new mca holding lock
+ * __rxe_init_mca_rcu - initialize a new mca holding lock
  * @qp: qp object
  * @mcg: mcg object
  * @mca: empty space for new mca
@@ -298,8 +304,8 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
  *
  * Returns: 0 on success else an error
  */
-static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
-			  struct rxe_mca *mca)
+static int __rxe_init_mca_rcu(struct rxe_qp *qp, struct rxe_mcg *mcg,
+			      struct rxe_mca *mca)
 {
 	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
 	int n;
@@ -322,7 +328,10 @@ static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
 	rxe_add_ref(qp);
 	mca->qp = qp;
 
-	list_add_tail(&mca->qp_list, &mcg->qp_list);
+	kref_get(&mcg->ref_cnt);
+	mca->mcg = mcg;
+
+	list_add_tail_rcu(&mca->qp_list, &mcg->qp_list);
 
 	return 0;
 }
@@ -343,14 +352,14 @@ static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 	int err;
 
 	/* check to see if the qp is already a member of the group */
-	spin_lock_irqsave(&rxe->mcg_lock, flags);
-	list_for_each_entry(mca, &mcg->qp_list, qp_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
 		if (mca->qp == qp) {
-			spin_unlock_irqrestore(&rxe->mcg_lock, flags);
+			rcu_read_unlock();
 			return 0;
 		}
 	}
-	spin_unlock_irqrestore(&rxe->mcg_lock, flags);
+	rcu_read_unlock();
 
 	/* speculative alloc new mca without using GFP_ATOMIC */
 	mca = kzalloc(sizeof(*mca), GFP_KERNEL);
@@ -367,7 +376,7 @@ static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 		}
 	}
 
-	err = __rxe_init_mca(qp, mcg, mca);
+	err = __rxe_init_mca_rcu(qp, mcg, mca);
 	if (err)
 		kfree(mca);
 out:
@@ -376,15 +385,16 @@ static int rxe_attach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 }
 
 /**
- * __rxe_cleanup_mca - cleanup mca object holding lock
- * @mca: mca object
- * @mcg: mcg object
+ * __rxe_destroy_mca - free mca resources
+ * @head: rcu_head embedded in mca
  *
  * Context: caller must hold a reference to mcg and rxe->mcg_lock
+ *          all rcu read operations should be compelete
  */
-static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
+static void __rxe_destroy_mca(struct rcu_head *head)
 {
-	list_del(&mca->qp_list);
+	struct rxe_mca *mca = container_of(head, typeof(*mca), rcu);
+	struct rxe_mcg *mcg = mca->mcg;
 
 	atomic_dec(&mcg->qp_num);
 	atomic_dec(&mcg->rxe->mcg_attach);
@@ -394,6 +404,19 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
 	kfree(mca);
 }
 
+/**
+ * __rxe_cleanup_mca_rcu - cleanup mca object holding lock
+ * @mca: mca object
+ * @mcg: mcg object
+ *
+ * Context: caller must hold a reference to mcg and rxe->mcg_lock
+ */
+static void __rxe_cleanup_mca_rcu(struct rxe_mca *mca, struct rxe_mcg *mcg)
+{
+	list_del_rcu(&mca->qp_list);
+	call_rcu(&mca->rcu, __rxe_destroy_mca);
+}
+
 /**
  * rxe_detach_mcg - detach qp from mcg
  * @mcg: mcg object
@@ -404,31 +427,35 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
 static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
 {
 	struct rxe_dev *rxe = mcg->rxe;
-	struct rxe_mca *mca, *tmp;
-	unsigned long flags;
+	struct rxe_mca *mca;
+	int ret;
 
-	spin_lock_irqsave(&rxe->mcg_lock, flags);
-	list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
-		if (mca->qp == qp) {
-			__rxe_cleanup_mca(mca, mcg);
-
-			/* if the number of qp's attached to the
-			 * mcast group falls to zero go ahead and
-			 * tear it down. This will not free the
-			 * object since we are still holding a ref
-			 * from the caller
-			 */
-			if (atomic_read(&mcg->qp_num) <= 0)
-				__rxe_destroy_mcg(mcg);
-
-			spin_unlock_irqrestore(&rxe->mcg_lock, flags);
-			return 0;
-		}
+	spin_lock_bh(&rxe->mcg_lock);
+	list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
+		if (mca->qp == qp)
+			goto found;
 	}
 
 	/* we didn't find the qp on the list */
-	spin_unlock_irqrestore(&rxe->mcg_lock, flags);
-	return -EINVAL;
+	ret = -EINVAL;
+	goto done;
+
+found:
+	__rxe_cleanup_mca_rcu(mca, mcg);
+
+	/* if the number of qp's attached to the
+	 * mcast group falls to zero go ahead and
+	 * tear it down. This will not free the
+	 * object since we are still holding a ref
+	 * from the caller
+	 */
+	if (atomic_read(&mcg->qp_num) <= 0)
+		__rxe_destroy_mcg(mcg);
+
+	ret = 0;
+done:
+	spin_unlock_bh(&rxe->mcg_lock);
+	return ret;
 }
 
 /**
diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
index 9b21cbb22602..c2cab85c6576 100644
--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -265,15 +265,15 @@ static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
 	qp_array = kmalloc_array(nmax, sizeof(qp), GFP_KERNEL);
 	n = 0;
 
-	spin_lock_bh(&rxe->mcg_lock);
-	list_for_each_entry(mca, &mcg->qp_list, qp_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
 		/* protect the qp pointers in the list */
 		rxe_add_ref(mca->qp);
 		qp_array[n++] = mca->qp;
 		if (n == nmax)
 			break;
 	}
-	spin_unlock_bh(&rxe->mcg_lock);
+	rcu_read_unlock();
 	nmax = n;
 	kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 6b15251ff67a..4ee51de50b95 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -364,7 +364,9 @@ struct rxe_mcg {
 
 struct rxe_mca {
 	struct list_head	qp_list;
+	struct rcu_head		rcu;
 	struct rxe_qp		*qp;
+	struct rxe_mcg		*mcg;
 };
 
 struct rxe_port {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory
  2022-02-18  0:35 ` [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory Bob Pearson
@ 2022-02-23 19:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2022-02-23 19:33 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Thu, Feb 17, 2022 at 06:35:39PM -0600, Bob Pearson wrote:
> Well behaved applications will free all memory allocated by multicast.
> If this fails rdma-core will detach all qp's from multicast groups when
> an application terminates. This patch prints a warning and cleans out
> memory allocated by multicast if both of these fail to occur.
> 
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>  drivers/infiniband/sw/rxe/rxe.c       |  2 ++
>  drivers/infiniband/sw/rxe/rxe_loc.h   |  1 +
>  drivers/infiniband/sw/rxe/rxe_mcast.c | 38 +++++++++++++++++++++++++++
>  3 files changed, 41 insertions(+)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 3520eb2db685..603b0156f889 100644
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -29,6 +29,8 @@ void rxe_dealloc(struct ib_device *ib_dev)
>  	rxe_pool_cleanup(&rxe->mr_pool);
>  	rxe_pool_cleanup(&rxe->mw_pool);
>  
> +	rxe_cleanup_mcast(rxe);

Lets just put

WARN_ON(!RB_EMPTY_ROOT(&rxe->mcg_tree));

Righ here and forget about there rest. Leaking memory if the kernel is
buggy is OK.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU
  2022-02-18  0:35 ` [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
@ 2022-02-23 19:52   ` Jason Gunthorpe
  2022-02-23 22:40     ` Bob Pearson
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2022-02-23 19:52 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Thu, Feb 17, 2022 at 06:35:44PM -0600, Bob Pearson wrote:
> Replace spinlock with rcu read locks for read side operations
> on mca in rxe_recv.c and rxe_mcast.c.
> 
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>  drivers/infiniband/sw/rxe/rxe_mcast.c | 97 +++++++++++++++++----------
>  drivers/infiniband/sw/rxe/rxe_recv.c  |  6 +-
>  drivers/infiniband/sw/rxe/rxe_verbs.h |  2 +
>  3 files changed, 67 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
> index 349a6fac2fcc..2bfec3748e1e 100644
> +++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
> @@ -17,6 +17,12 @@
>   * mca is created. It holds a pointer to the qp and is added to a list
>   * of qp's that are attached to the mcg. The qp_list is used to replicate
>   * mcast packets in the rxe receive path.
> + *
> + * The highest performance operations are mca list traversal when
> + * processing incoming multicast packets which need to be fanned out
> + * to the attached qp's. This list is protected by RCU locking for read
> + * operations and a spinlock in the rxe_dev struct for write operations.
> + * The red-black tree is protected by the same spinlock.
>   */
>  
>  #include "rxe.h"
> @@ -288,7 +294,7 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
>  }
>  
>  /**
> - * __rxe_init_mca - initialize a new mca holding lock
> + * __rxe_init_mca_rcu - initialize a new mca holding lock
>   * @qp: qp object
>   * @mcg: mcg object
>   * @mca: empty space for new mca
> @@ -298,8 +304,8 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
>   *
>   * Returns: 0 on success else an error
>   */
> -static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
> -			  struct rxe_mca *mca)
> +static int __rxe_init_mca_rcu(struct rxe_qp *qp, struct rxe_mcg *mcg,
> +			      struct rxe_mca *mca)
>  {

This isn't really 'rcu', it still has to hold the write side lock

>  	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
>  	int n;
> @@ -322,7 +328,10 @@ static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
>  	rxe_add_ref(qp);
>  	mca->qp = qp;
>  
> -	list_add_tail(&mca->qp_list, &mcg->qp_list);
> +	kref_get(&mcg->ref_cnt);
> +	mca->mcg = mcg;
> +
> +	list_add_tail_rcu(&mca->qp_list, &mcg->qp_list);

list_add_tail gets to be called rcu because it is slower than the
non-rcu safe version.

> -static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
> +static void __rxe_destroy_mca(struct rcu_head *head)
>  {
> -	list_del(&mca->qp_list);
> +	struct rxe_mca *mca = container_of(head, typeof(*mca), rcu);
> +	struct rxe_mcg *mcg = mca->mcg;
>  
>  	atomic_dec(&mcg->qp_num);
>  	atomic_dec(&mcg->rxe->mcg_attach);
> @@ -394,6 +404,19 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
>  	kfree(mca);

> +/**
> + * __rxe_cleanup_mca_rcu - cleanup mca object holding lock
> + * @mca: mca object
> + * @mcg: mcg object
> + *
> + * Context: caller must hold a reference to mcg and rxe->mcg_lock
> + */
> +static void __rxe_cleanup_mca_rcu(struct rxe_mca *mca, struct rxe_mcg *mcg)
> +{
> +	list_del_rcu(&mca->qp_list);
> +	call_rcu(&mca->rcu, __rxe_destroy_mca);
> +}

I think this is OK now..

> +
>  /**
>   * rxe_detach_mcg - detach qp from mcg
>   * @mcg: mcg object
> @@ -404,31 +427,35 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
>  static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
>  {
>  	struct rxe_dev *rxe = mcg->rxe;
> -	struct rxe_mca *mca, *tmp;
> -	unsigned long flags;
> +	struct rxe_mca *mca;
> +	int ret;
>  
> -	spin_lock_irqsave(&rxe->mcg_lock, flags);
> -	list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
> -		if (mca->qp == qp) {
> -			__rxe_cleanup_mca(mca, mcg);
> -
> -			/* if the number of qp's attached to the
> -			 * mcast group falls to zero go ahead and
> -			 * tear it down. This will not free the
> -			 * object since we are still holding a ref
> -			 * from the caller
> -			 */
> -			if (atomic_read(&mcg->qp_num) <= 0)
> -				__rxe_destroy_mcg(mcg);
> -
> -			spin_unlock_irqrestore(&rxe->mcg_lock, flags);
> -			return 0;
> -		}
> +	spin_lock_bh(&rxe->mcg_lock);
> +	list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
> +		if (mca->qp == qp)
> +			goto found;
>  	}
>  
>  	/* we didn't find the qp on the list */
> -	spin_unlock_irqrestore(&rxe->mcg_lock, flags);
> -	return -EINVAL;
> +	ret = -EINVAL;
> +	goto done;
> +
> +found:
> +	__rxe_cleanup_mca_rcu(mca, mcg);
> +
> +	/* if the number of qp's attached to the
> +	 * mcast group falls to zero go ahead and
> +	 * tear it down. This will not free the
> +	 * object since we are still holding a ref
> +	 * from the caller
> +	 */

Fix the word wrap

Would prefer to avoid the found label in the middle of the code.

> +	rcu_read_lock();
> +	list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
>  		/* protect the qp pointers in the list */
>  		rxe_add_ref(mca->qp);

The other approach you could take is to als kref_free the qp and allow
its kref to become zero here. But this is fine, I think.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU
  2022-02-23 19:52   ` Jason Gunthorpe
@ 2022-02-23 22:40     ` Bob Pearson
  2022-02-24  0:04       ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Bob Pearson @ 2022-02-23 22:40 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: zyjzyj2000, linux-rdma

On 2/23/22 13:52, Jason Gunthorpe wrote:
> On Thu, Feb 17, 2022 at 06:35:44PM -0600, Bob Pearson wrote:
>> Replace spinlock with rcu read locks for read side operations
>> on mca in rxe_recv.c and rxe_mcast.c.
>>
>> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>>  drivers/infiniband/sw/rxe/rxe_mcast.c | 97 +++++++++++++++++----------
>>  drivers/infiniband/sw/rxe/rxe_recv.c  |  6 +-
>>  drivers/infiniband/sw/rxe/rxe_verbs.h |  2 +
>>  3 files changed, 67 insertions(+), 38 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
>> index 349a6fac2fcc..2bfec3748e1e 100644
>> +++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
>> @@ -17,6 +17,12 @@
>>   * mca is created. It holds a pointer to the qp and is added to a list
>>   * of qp's that are attached to the mcg. The qp_list is used to replicate
>>   * mcast packets in the rxe receive path.
>> + *
>> + * The highest performance operations are mca list traversal when
>> + * processing incoming multicast packets which need to be fanned out
>> + * to the attached qp's. This list is protected by RCU locking for read
>> + * operations and a spinlock in the rxe_dev struct for write operations.
>> + * The red-black tree is protected by the same spinlock.
>>   */
>>  
>>  #include "rxe.h"
>> @@ -288,7 +294,7 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
>>  }
>>  
>>  /**
>> - * __rxe_init_mca - initialize a new mca holding lock
>> + * __rxe_init_mca_rcu - initialize a new mca holding lock
>>   * @qp: qp object
>>   * @mcg: mcg object
>>   * @mca: empty space for new mca
>> @@ -298,8 +304,8 @@ static void rxe_destroy_mcg(struct rxe_mcg *mcg)
>>   *
>>   * Returns: 0 on success else an error
>>   */
>> -static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
>> -			  struct rxe_mca *mca)
>> +static int __rxe_init_mca_rcu(struct rxe_qp *qp, struct rxe_mcg *mcg,
>> +			      struct rxe_mca *mca)
>>  {
> 
> This isn't really 'rcu', it still has to hold the write side lock
> 
>>  	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
>>  	int n;
>> @@ -322,7 +328,10 @@ static int __rxe_init_mca(struct rxe_qp *qp, struct rxe_mcg *mcg,
>>  	rxe_add_ref(qp);
>>  	mca->qp = qp;
>>  
>> -	list_add_tail(&mca->qp_list, &mcg->qp_list);
>> +	kref_get(&mcg->ref_cnt);
>> +	mca->mcg = mcg;
>> +
>> +	list_add_tail_rcu(&mca->qp_list, &mcg->qp_list);
> 
> list_add_tail gets to be called rcu because it is slower than the
> non-rcu safe version.
> 
>> -static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
>> +static void __rxe_destroy_mca(struct rcu_head *head)
>>  {
>> -	list_del(&mca->qp_list);
>> +	struct rxe_mca *mca = container_of(head, typeof(*mca), rcu);
>> +	struct rxe_mcg *mcg = mca->mcg;
>>  
>>  	atomic_dec(&mcg->qp_num);
>>  	atomic_dec(&mcg->rxe->mcg_attach);
>> @@ -394,6 +404,19 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
>>  	kfree(mca);
> 
>> +/**
>> + * __rxe_cleanup_mca_rcu - cleanup mca object holding lock
>> + * @mca: mca object
>> + * @mcg: mcg object
>> + *
>> + * Context: caller must hold a reference to mcg and rxe->mcg_lock
>> + */
>> +static void __rxe_cleanup_mca_rcu(struct rxe_mca *mca, struct rxe_mcg *mcg)
>> +{
>> +	list_del_rcu(&mca->qp_list);
>> +	call_rcu(&mca->rcu, __rxe_destroy_mca);
>> +}
> 
> I think this is OK now..
> 
>> +
>>  /**
>>   * rxe_detach_mcg - detach qp from mcg
>>   * @mcg: mcg object
>> @@ -404,31 +427,35 @@ static void __rxe_cleanup_mca(struct rxe_mca *mca, struct rxe_mcg *mcg)
>>  static int rxe_detach_mcg(struct rxe_mcg *mcg, struct rxe_qp *qp)
>>  {
>>  	struct rxe_dev *rxe = mcg->rxe;
>> -	struct rxe_mca *mca, *tmp;
>> -	unsigned long flags;
>> +	struct rxe_mca *mca;
>> +	int ret;
>>  
>> -	spin_lock_irqsave(&rxe->mcg_lock, flags);
>> -	list_for_each_entry_safe(mca, tmp, &mcg->qp_list, qp_list) {
>> -		if (mca->qp == qp) {
>> -			__rxe_cleanup_mca(mca, mcg);
>> -
>> -			/* if the number of qp's attached to the
>> -			 * mcast group falls to zero go ahead and
>> -			 * tear it down. This will not free the
>> -			 * object since we are still holding a ref
>> -			 * from the caller
>> -			 */
>> -			if (atomic_read(&mcg->qp_num) <= 0)
>> -				__rxe_destroy_mcg(mcg);
>> -
>> -			spin_unlock_irqrestore(&rxe->mcg_lock, flags);
>> -			return 0;
>> -		}
>> +	spin_lock_bh(&rxe->mcg_lock);
>> +	list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
>> +		if (mca->qp == qp)
>> +			goto found;
>>  	}
>>  
>>  	/* we didn't find the qp on the list */
>> -	spin_unlock_irqrestore(&rxe->mcg_lock, flags);
>> -	return -EINVAL;
>> +	ret = -EINVAL;
>> +	goto done;
>> +
>> +found:
>> +	__rxe_cleanup_mca_rcu(mca, mcg);
>> +
>> +	/* if the number of qp's attached to the
>> +	 * mcast group falls to zero go ahead and
>> +	 * tear it down. This will not free the
>> +	 * object since we are still holding a ref
>> +	 * from the caller
>> +	 */
> 
> Fix the word wrap
> 
> Would prefer to avoid the found label in the middle of the code.
> 
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(mca, &mcg->qp_list, qp_list) {
>>  		/* protect the qp pointers in the list */
>>  		rxe_add_ref(mca->qp);
> 
> The other approach you could take is to als kref_free the qp and allow
> its kref to become zero here. But this is fine, I think.
> 
> Jason

OK I looked at this again. Here is what happens.

without the complete()/wait_for_completion() fix the test passes but the (private i.e. not visible to rdma-core)
mca object is holding a reference on qp which will be freed after the call_rcu() subroutine takes place.
Meanwhile the ib_detach_mcast() call returns to rdma-core and since there is nothing left to do the test is over
and the python code closes the device. But this generates a warning (rdma_core.c line 940) since not all the
user objects (i.e. the qp) were freed. In other contexts we are planning on putting complete()/wait...()'s
in all the other verbs calls which ref count objects for the same reason. I think it applies here too.

Bob

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU
  2022-02-23 22:40     ` Bob Pearson
@ 2022-02-24  0:04       ` Jason Gunthorpe
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2022-02-24  0:04 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Wed, Feb 23, 2022 at 04:40:19PM -0600, Bob Pearson wrote:

> OK I looked at this again. Here is what happens.
> 
> without the complete()/wait_for_completion() fix the test passes but
> the (private i.e. not visible to rdma-core) mca object is holding a
> reference on qp which will be freed after the call_rcu() subroutine
> takes place.  Meanwhile the ib_detach_mcast() call returns to
> rdma-core and since there is nothing left to do the test is over and
> the python code closes the device. But this generates a warning
> (rdma_core.c line 940) since not all the user objects (i.e. the qp)
> were freed. In other contexts we are planning on putting
> complete()/wait...()'s in all the other verbs calls which ref count
> objects for the same reason. I think it applies here too.

This explanation makes sense..

The mistake is on the QP side where it has a refcount but doesn't wait
for it to drain I suppose.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-02-24  0:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-18  0:35 [PATCH for-next v12 0/6] Move two object pools to rxe_mcast.c Bob Pearson
2022-02-18  0:35 ` [PATCH for-next v12 1/6] RDMA/rxe: Add code to cleanup mcast memory Bob Pearson
2022-02-23 19:33   ` Jason Gunthorpe
2022-02-18  0:35 ` [PATCH for-next v12 2/6] RDMA/rxe: Collect mca init code in a subroutine Bob Pearson
2022-02-18  0:35 ` [PATCH for-next v12 3/6] RDMA/rxe: Collect cleanup mca " Bob Pearson
2022-02-18  0:35 ` [PATCH for-next v12 4/6] RDMA/rxe: Cleanup rxe_mcast.c Bob Pearson
2022-02-18  0:35 ` [PATCH for-next v12 5/6] RDMA/rxe: For mcast copy qp list to temp array Bob Pearson
2022-02-18  0:35 ` [PATCH for-next v12 6/6] RDMA/rxe: Convert mca read locking to RCU Bob Pearson
2022-02-23 19:52   ` Jason Gunthorpe
2022-02-23 22:40     ` Bob Pearson
2022-02-24  0:04       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).