* [PATCH for-next v15 0/2] Fix race conditions in rxe_pool
@ 2022-06-06 14:18 Bob Pearson
2022-06-06 14:18 ` [PATCH for-next v15 1/2] RDMA/rxe: Stop lookup of partially built objects Bob Pearson
2022-06-06 14:18 ` [PATCH for-next v15 2/2] RDMA/rxe: Convert read side locking to rcu Bob Pearson
0 siblings, 2 replies; 7+ messages in thread
From: Bob Pearson @ 2022-06-06 14:18 UTC (permalink / raw)
To: jgg, zyjzyj2000, bvanassche, linux-rdma, frank.zago, jhack; +Cc: Bob Pearson
There are several race conditions discovered in the current rdma_rxe
driver. They mostly relate to races between normal operations and
destroying objects. This patch series includes the remaining two
patches of the original series.
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
v15
Rebased to the current for-next branch of 5.19.0-rc1+.
Adds support for RDMA_AH_CREATE/DESTROY_SLEEPABLE.
v14
Rebased to current wip/jgg-for-next
Removed patch 3 as unnecessary
Waited for resolution of bugs in rxe_resp and some locking bugs.
Note: With rcu read lock in rxe_pool_get_index there are no bottom
half spinlocks from looking up AH or non AH objects to conflict
with the default xa_lock so no lockdep warnings occur. The rxe_pool
alloc functions can hold locks simultanteously with the rcu read
lock so it does not have to prevent soft or hard IRQs.
v13
Rebased to current for-next
Addressed Jason's comments on patch 1, 8 and 9. 8 and 9 are
combined into one patch.
v12
Rebased to current wip/jgg-for-next
Dropped patches already accepted from v11.
Moved all object cleanup code to type specific cleanup routines.
Renamed to match Jason's requests.
Addressed some other issued raised.
Kept the contentious rxe_hide() function but renamed to
rxe_disable_lookup() which says what it does. I am still convinced
this cleaner than other alternatives like moving xa_erase to the
top of destroy routines but just for indexed objects.
v11
Rebased to current for-next.
Reordered patches and made other changes to respond to issues
reported by Jason Gunthorpe.
v10
Rebased to current wip/jgg-for-next.
Split some patches into smaller ones.
v9
Corrected issues reported by Jason Gunthorpe,
Converted locking in rxe_mcast.c and rxe_pool.c to use RCU
Split up the patches into smaller changes
v8
Fixed an additional race in 3/8 which was not handled correctly.
v7
Corrected issues reported by Jason Gunthorpe
Link: https://lore.kernel.org/linux-rdma/20211207190947.GH6385@nvidia.com/
Link: https://lore.kernel.org/linux-rdma/20211207191857.GI6385@nvidia.com/
Link: https://lore.kernel.org/linux-rdma/20211207192824.GJ6385@nvidia.com/
v6
Fixed a kzalloc flags bug.
Fixed comment bug reported by 'Kernel Test Robot'.
Changed type of rxe_pool.c in __rxe_fini().
v5
Removed patches already accepted into for-next and addressed comments
from Jason Gunthorpe.
v4
Restructured patch series to change to xarray earlier which
greatly simplified the changes.
Rebased to current for-next
v3
Changed rxe_alloc to use GFP_KERNEL
Addressed other comments by Jason Gunthorp
Merged the previous 06/10 and 07/10 patches into one since they overlapped
Added some minor cleanups as 10/10
v2
Rebased to current for-next.
Added 4 additional patches
Bob Pearson (2):
RDMA/rxe: Stop lookup of partially built objects
RDMA/rxe: Convert read side locking to rcu
drivers/infiniband/sw/rxe/rxe_mr.c | 2 +-
drivers/infiniband/sw/rxe/rxe_mw.c | 4 +-
drivers/infiniband/sw/rxe/rxe_pool.c | 97 ++++++++++++++++++++++++---
drivers/infiniband/sw/rxe/rxe_pool.h | 18 +++--
drivers/infiniband/sw/rxe/rxe_req.c | 37 ++++++++--
drivers/infiniband/sw/rxe/rxe_verbs.c | 39 +++++++----
6 files changed, 160 insertions(+), 37 deletions(-)
base-commit: f2906aa863381afb0015a9eb7fefad885d4e5a56
--
2.34.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH for-next v15 1/2] RDMA/rxe: Stop lookup of partially built objects
2022-06-06 14:18 [PATCH for-next v15 0/2] Fix race conditions in rxe_pool Bob Pearson
@ 2022-06-06 14:18 ` Bob Pearson
2022-06-06 17:10 ` Jason Gunthorpe
2022-06-06 14:18 ` [PATCH for-next v15 2/2] RDMA/rxe: Convert read side locking to rcu Bob Pearson
1 sibling, 1 reply; 7+ messages in thread
From: Bob Pearson @ 2022-06-06 14:18 UTC (permalink / raw)
To: jgg, zyjzyj2000, bvanassche, linux-rdma, frank.zago, jhack; +Cc: Bob Pearson
Currently the rdma_rxe driver has a security weakness due to giving
objects which are partially initialized indices allowing external
actors to gain access to them by sending packets which refer to
their index (e.g. qpn, rkey, etc) causing unpredictable results.
This patch adds a new API rxe_finalize(obj) which enables looking up
pool objects from indices using rxe_pool_get_index() for
AH, QP, MR, and MW. They are added in create verbs only after the
objects are fully initialized.
It also adds wait for completion to destroy/dealloc verbs to assure that
all references have been dropped before returning to rdma_core by
implementing a new rxe_pool API rxe_cleanup() which drops a reference
to the object and then waits for all other references to be dropped.
When the last reference is dropped the object is completed by kref.
After that it cleans up the object and if locally allocated frees
the memory. In the special case of address handle objects the delay
is implemented separately if the destroy_ah call is not sleepable.
Combined with deferring cleanup code to type specific cleanup
routines this allows all pending activity referring to objects to
complete before returning to rdma_core.
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
drivers/infiniband/sw/rxe/rxe_mr.c | 2 +-
drivers/infiniband/sw/rxe/rxe_mw.c | 4 +-
drivers/infiniband/sw/rxe/rxe_pool.c | 93 +++++++++++++++++++++++++--
drivers/infiniband/sw/rxe/rxe_pool.h | 18 ++++--
drivers/infiniband/sw/rxe/rxe_req.c | 37 +++++++++--
drivers/infiniband/sw/rxe/rxe_verbs.c | 39 +++++++----
6 files changed, 161 insertions(+), 32 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index fc3942e04a1f..9a5c2af6a56f 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -687,7 +687,7 @@ int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
if (atomic_read(&mr->num_mw) > 0)
return -EINVAL;
- rxe_put(mr);
+ rxe_cleanup(mr);
return 0;
}
diff --git a/drivers/infiniband/sw/rxe/rxe_mw.c b/drivers/infiniband/sw/rxe/rxe_mw.c
index 2e1fa844fabf..86e63d7dc1f3 100644
--- a/drivers/infiniband/sw/rxe/rxe_mw.c
+++ b/drivers/infiniband/sw/rxe/rxe_mw.c
@@ -33,6 +33,8 @@ int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata)
RXE_MW_STATE_FREE : RXE_MW_STATE_VALID;
spin_lock_init(&mw->lock);
+ rxe_finalize(mw);
+
return 0;
}
@@ -40,7 +42,7 @@ int rxe_dealloc_mw(struct ib_mw *ibmw)
{
struct rxe_mw *mw = to_rmw(ibmw);
- rxe_put(mw);
+ rxe_cleanup(mw);
return 0;
}
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 19b14826385b..7a5685f0713e 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -6,6 +6,7 @@
#include "rxe.h"
+#define RXE_POOL_TIMEOUT (200)
#define RXE_POOL_ALIGN (16)
static const struct rxe_type_info {
@@ -136,8 +137,12 @@ void *rxe_alloc(struct rxe_pool *pool)
elem->pool = pool;
elem->obj = obj;
kref_init(&elem->ref_cnt);
+ init_completion(&elem->complete);
- err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
+ /* allocate index in array but leave pointer as NULL so it
+ * can't be looked up until rxe_finalize() is called
+ */
+ err = xa_alloc_cyclic(&pool->xa, &elem->index, NULL, pool->limit,
&pool->next, GFP_KERNEL);
if (err)
goto err_free;
@@ -151,9 +156,11 @@ void *rxe_alloc(struct rxe_pool *pool)
return NULL;
}
-int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
+int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem,
+ bool sleepable)
{
int err;
+ gfp_t gfp_flags;
if (WARN_ON(pool->type == RXE_TYPE_MR))
return -EINVAL;
@@ -164,9 +171,16 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
elem->pool = pool;
elem->obj = (u8 *)elem - pool->elem_offset;
kref_init(&elem->ref_cnt);
+ init_completion(&elem->complete);
- err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
- &pool->next, GFP_KERNEL);
+ /* AH objects are unique in that the create_ah verb
+ * can be called in atomic context. If the create_ah
+ * call is not sleepable use GFP_ATOMIC.
+ */
+ gfp_flags = sleepable ? GFP_KERNEL : GFP_ATOMIC;
+
+ err = xa_alloc_cyclic(&pool->xa, &elem->index, NULL,
+ pool->limit, &pool->next, gfp_flags);
if (err)
goto err_cnt;
@@ -198,9 +212,64 @@ void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
static void rxe_elem_release(struct kref *kref)
{
struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
+
+ complete(&elem->complete);
+}
+
+int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
+{
struct rxe_pool *pool = elem->pool;
+ struct xarray *xa = &pool->xa;
+ static int timeout = RXE_POOL_TIMEOUT;
+ unsigned long flags;
+ int ret, err = 0;
+ void *xa_ret;
- xa_erase(&pool->xa, elem->index);
+ /* erase xarray entry to prevent looking up
+ * the pool elem from its index
+ */
+ xa_lock_irqsave(xa, flags);
+ xa_ret = __xa_erase(xa, elem->index);
+ xa_unlock_irqrestore(xa, flags);
+ WARN_ON(xa_err(xa_ret));
+
+ /* if this is the last call to rxe_put complete the
+ * object. It is safe to touch obj->elem after this since
+ * it is freed below
+ */
+ __rxe_put(elem);
+
+ /* wait until all references to the object have been
+ * dropped before final object specific cleanup and
+ * return to rdma-core
+ */
+ if (sleepable) {
+ if (!completion_done(&elem->complete) && timeout) {
+ ret = wait_for_completion_timeout(&elem->complete,
+ timeout);
+
+ /* Shouldn't happen. There are still references to
+ * the object but, rather than deadlock, free the
+ * object or pass back to rdma-core.
+ */
+ if (WARN_ON(!ret))
+ err = -EINVAL;
+ }
+ } else {
+ unsigned long until = jiffies + timeout;
+
+ /* AH objects are unique in that the destroy_ah verb
+ * can be called in atomic context. This delay
+ * replaces the wait_for_completion call above
+ * when the destroy_ah call is not sleepable
+ */
+ while (!completion_done(&elem->complete) &&
+ time_before(jiffies, until))
+ mdelay(1);
+
+ if (WARN_ON(!completion_done(&elem->complete)))
+ err = -EINVAL;
+ }
if (pool->cleanup)
pool->cleanup(elem);
@@ -209,6 +278,8 @@ static void rxe_elem_release(struct kref *kref)
kfree(elem->obj);
atomic_dec(&pool->num_elem);
+
+ return err;
}
int __rxe_get(struct rxe_pool_elem *elem)
@@ -220,3 +291,15 @@ int __rxe_put(struct rxe_pool_elem *elem)
{
return kref_put(&elem->ref_cnt, rxe_elem_release);
}
+
+void __rxe_finalize(struct rxe_pool_elem *elem)
+{
+ struct xarray *xa = &elem->pool->xa;
+ unsigned long flags;
+ void *ret;
+
+ xa_lock_irqsave(xa, flags);
+ ret = __xa_store(&elem->pool->xa, elem->index, elem, GFP_KERNEL);
+ xa_unlock_irqrestore(xa, flags);
+ WARN_ON(xa_err(ret));
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index 0860660d65ec..9d83cb32092f 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -24,6 +24,7 @@ struct rxe_pool_elem {
void *obj;
struct kref ref_cnt;
struct list_head list;
+ struct completion complete;
u32 index;
};
@@ -57,21 +58,28 @@ void rxe_pool_cleanup(struct rxe_pool *pool);
void *rxe_alloc(struct rxe_pool *pool);
/* connect already allocated object to pool */
-int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem);
-
-#define rxe_add_to_pool(pool, obj) __rxe_add_to_pool(pool, &(obj)->elem)
+int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem,
+ bool sleepable);
+#define rxe_add_to_pool(pool, obj) __rxe_add_to_pool(pool, &(obj)->elem, true)
+#define rxe_add_to_pool_ah(pool, obj, sleepable) __rxe_add_to_pool(pool, \
+ &(obj)->elem, sleepable)
/* lookup an indexed object from index. takes a reference on object */
void *rxe_pool_get_index(struct rxe_pool *pool, u32 index);
int __rxe_get(struct rxe_pool_elem *elem);
-
#define rxe_get(obj) __rxe_get(&(obj)->elem)
int __rxe_put(struct rxe_pool_elem *elem);
-
#define rxe_put(obj) __rxe_put(&(obj)->elem)
+int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable);
+#define rxe_cleanup(obj) __rxe_cleanup(&(obj)->elem, true)
+#define rxe_cleanup_ah(obj, sleepable) __rxe_cleanup(&(obj)->elem, sleepable)
+
#define rxe_read(obj) kref_read(&(obj)->elem.ref_cnt)
+void __rxe_finalize(struct rxe_pool_elem *elem);
+#define rxe_finalize(obj) __rxe_finalize(&(obj)->elem)
+
#endif /* RXE_POOL_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 9d98237389cf..e8a1664a40eb 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -161,16 +161,36 @@ static struct rxe_send_wqe *req_next_wqe(struct rxe_qp *qp)
(wqe->state != wqe_state_processing)))
return NULL;
- if (unlikely((wqe->wr.send_flags & IB_SEND_FENCE) &&
- (index != cons))) {
- qp->req.wait_fence = 1;
- return NULL;
- }
-
wqe->mask = wr_opcode_mask(wqe->wr.opcode, qp);
return wqe;
}
+/**
+ * rxe_wqe_is_fenced - check if next wqe is fenced
+ * @qp: the queue pair
+ * @wqe: the next wqe
+ *
+ * Returns: 1 if wqe needs to wait
+ * 0 if wqe is ready to go
+ */
+static int rxe_wqe_is_fenced(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
+{
+ /* Local invalidate fence (LIF) see IBA 10.6.5.1
+ * Requires ALL previous operations on the send queue
+ * are complete. Make mandatory for the rxe driver.
+ */
+ if (wqe->wr.opcode == IB_WR_LOCAL_INV)
+ return qp->req.wqe_index != queue_get_consumer(qp->sq.queue,
+ QUEUE_TYPE_FROM_CLIENT);
+
+ /* Fence see IBA 10.8.3.3
+ * Requires that all previous read and atomic operations
+ * are complete.
+ */
+ return (wqe->wr.send_flags & IB_SEND_FENCE) &&
+ atomic_read(&qp->req.rd_atomic) != qp->attr.max_rd_atomic;
+}
+
static int next_opcode_rc(struct rxe_qp *qp, u32 opcode, int fits)
{
switch (opcode) {
@@ -632,6 +652,11 @@ int rxe_requester(void *arg)
if (unlikely(!wqe))
goto exit;
+ if (rxe_wqe_is_fenced(qp, wqe)) {
+ qp->req.wait_fence = 1;
+ goto exit;
+ }
+
if (wqe->mask & WR_LOCAL_OP_MASK) {
ret = rxe_do_local_ops(qp, wqe);
if (unlikely(ret))
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 9d995854a174..151c6280abd5 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -115,7 +115,7 @@ static void rxe_dealloc_ucontext(struct ib_ucontext *ibuc)
{
struct rxe_ucontext *uc = to_ruc(ibuc);
- rxe_put(uc);
+ rxe_cleanup(uc);
}
static int rxe_port_immutable(struct ib_device *dev, u32 port_num,
@@ -149,7 +149,7 @@ static int rxe_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
{
struct rxe_pd *pd = to_rpd(ibpd);
- rxe_put(pd);
+ rxe_cleanup(pd);
return 0;
}
@@ -176,7 +176,8 @@ static int rxe_create_ah(struct ib_ah *ibah,
if (err)
return err;
- err = rxe_add_to_pool(&rxe->ah_pool, ah);
+ err = rxe_add_to_pool_ah(&rxe->ah_pool, ah,
+ init_attr->flags & RDMA_CREATE_AH_SLEEPABLE);
if (err)
return err;
@@ -188,7 +189,7 @@ static int rxe_create_ah(struct ib_ah *ibah,
err = copy_to_user(&uresp->ah_num, &ah->ah_num,
sizeof(uresp->ah_num));
if (err) {
- rxe_put(ah);
+ rxe_cleanup(ah);
return -EFAULT;
}
} else if (ah->is_user) {
@@ -197,6 +198,8 @@ static int rxe_create_ah(struct ib_ah *ibah,
}
rxe_init_av(init_attr->ah_attr, &ah->av);
+ rxe_finalize(ah);
+
return 0;
}
@@ -228,7 +231,8 @@ static int rxe_destroy_ah(struct ib_ah *ibah, u32 flags)
{
struct rxe_ah *ah = to_rah(ibah);
- rxe_put(ah);
+ rxe_cleanup_ah(ah, flags & RDMA_DESTROY_AH_SLEEPABLE);
+
return 0;
}
@@ -308,12 +312,13 @@ static int rxe_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init,
err = rxe_srq_from_init(rxe, srq, init, udata, uresp);
if (err)
- goto err_put;
+ goto err_cleanup;
return 0;
-err_put:
- rxe_put(srq);
+err_cleanup:
+ rxe_cleanup(srq);
+
return err;
}
@@ -362,7 +367,7 @@ static int rxe_destroy_srq(struct ib_srq *ibsrq, struct ib_udata *udata)
{
struct rxe_srq *srq = to_rsrq(ibsrq);
- rxe_put(srq);
+ rxe_cleanup(srq);
return 0;
}
@@ -429,10 +434,11 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
if (err)
goto qp_init;
+ rxe_finalize(qp);
return 0;
qp_init:
- rxe_put(qp);
+ rxe_cleanup(qp);
return err;
}
@@ -485,7 +491,7 @@ static int rxe_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
if (ret)
return ret;
- rxe_put(qp);
+ rxe_cleanup(qp);
return 0;
}
@@ -803,7 +809,7 @@ static int rxe_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
rxe_cq_disable(cq);
- rxe_put(cq);
+ rxe_cleanup(cq);
return 0;
}
@@ -898,6 +904,7 @@ static struct ib_mr *rxe_get_dma_mr(struct ib_pd *ibpd, int access)
rxe_get(pd);
rxe_mr_init_dma(pd, access, mr);
+ rxe_finalize(mr);
return &mr->ibmr;
}
@@ -926,11 +933,13 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
if (err)
goto err3;
+ rxe_finalize(mr);
+
return &mr->ibmr;
err3:
rxe_put(pd);
- rxe_put(mr);
+ rxe_cleanup(mr);
err2:
return ERR_PTR(err);
}
@@ -958,11 +967,13 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
if (err)
goto err2;
+ rxe_finalize(mr);
+
return &mr->ibmr;
err2:
rxe_put(pd);
- rxe_put(mr);
+ rxe_cleanup(mr);
err1:
return ERR_PTR(err);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH for-next v15 2/2] RDMA/rxe: Convert read side locking to rcu
2022-06-06 14:18 [PATCH for-next v15 0/2] Fix race conditions in rxe_pool Bob Pearson
2022-06-06 14:18 ` [PATCH for-next v15 1/2] RDMA/rxe: Stop lookup of partially built objects Bob Pearson
@ 2022-06-06 14:18 ` Bob Pearson
2022-06-06 17:07 ` Jason Gunthorpe
1 sibling, 1 reply; 7+ messages in thread
From: Bob Pearson @ 2022-06-06 14:18 UTC (permalink / raw)
To: jgg, zyjzyj2000, bvanassche, linux-rdma, frank.zago, jhack; +Cc: Bob Pearson
Use rcu_read_lock() for protecting read side operations in rxe_pool.c.
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
drivers/infiniband/sw/rxe/rxe_pool.c | 26 ++++++++++----------------
1 file changed, 10 insertions(+), 16 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 7a5685f0713e..103bf0c03441 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -195,16 +195,15 @@ void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
{
struct rxe_pool_elem *elem;
struct xarray *xa = &pool->xa;
- unsigned long flags;
void *obj;
- xa_lock_irqsave(xa, flags);
+ rcu_read_lock();
elem = xa_load(xa, index);
if (elem && kref_get_unless_zero(&elem->ref_cnt))
obj = elem->obj;
else
obj = NULL;
- xa_unlock_irqrestore(xa, flags);
+ rcu_read_unlock();
return obj;
}
@@ -221,16 +220,15 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
struct rxe_pool *pool = elem->pool;
struct xarray *xa = &pool->xa;
static int timeout = RXE_POOL_TIMEOUT;
- unsigned long flags;
int ret, err = 0;
void *xa_ret;
+ WARN_ON(!in_task());
+
/* erase xarray entry to prevent looking up
* the pool elem from its index
*/
- xa_lock_irqsave(xa, flags);
- xa_ret = __xa_erase(xa, elem->index);
- xa_unlock_irqrestore(xa, flags);
+ xa_ret = xa_erase(xa, elem->index);
WARN_ON(xa_err(xa_ret));
/* if this is the last call to rxe_put complete the
@@ -275,7 +273,7 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
pool->cleanup(elem);
if (pool->type == RXE_TYPE_MR)
- kfree(elem->obj);
+ kfree_rcu(elem->obj);
atomic_dec(&pool->num_elem);
@@ -294,12 +292,8 @@ int __rxe_put(struct rxe_pool_elem *elem)
void __rxe_finalize(struct rxe_pool_elem *elem)
{
- struct xarray *xa = &elem->pool->xa;
- unsigned long flags;
- void *ret;
-
- xa_lock_irqsave(xa, flags);
- ret = __xa_store(&elem->pool->xa, elem->index, elem, GFP_KERNEL);
- xa_unlock_irqrestore(xa, flags);
- WARN_ON(xa_err(ret));
+ void *xa_ret;
+
+ xa_ret = xa_store(&elem->pool->xa, elem->index, elem, GFP_KERNEL);
+ WARN_ON(xa_err(xa_ret));
}
--
2.34.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH for-next v15 2/2] RDMA/rxe: Convert read side locking to rcu
2022-06-06 14:18 ` [PATCH for-next v15 2/2] RDMA/rxe: Convert read side locking to rcu Bob Pearson
@ 2022-06-06 17:07 ` Jason Gunthorpe
2022-06-06 20:51 ` Pearson, Robert B
0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2022-06-06 17:07 UTC (permalink / raw)
To: Bob Pearson; +Cc: zyjzyj2000, bvanassche, linux-rdma, frank.zago, jhack
On Mon, Jun 06, 2022 at 09:18:04AM -0500, Bob Pearson wrote:
> Use rcu_read_lock() for protecting read side operations in rxe_pool.c.
>
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> drivers/infiniband/sw/rxe/rxe_pool.c | 26 ++++++++++----------------
> 1 file changed, 10 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
> index 7a5685f0713e..103bf0c03441 100644
> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> @@ -195,16 +195,15 @@ void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
> {
> struct rxe_pool_elem *elem;
> struct xarray *xa = &pool->xa;
> - unsigned long flags;
> void *obj;
>
> - xa_lock_irqsave(xa, flags);
> + rcu_read_lock();
> elem = xa_load(xa, index);
> if (elem && kref_get_unless_zero(&elem->ref_cnt))
> obj = elem->obj;
> else
> obj = NULL;
> - xa_unlock_irqrestore(xa, flags);
> + rcu_read_unlock();
>
> return obj;
> }
> @@ -221,16 +220,15 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
> struct rxe_pool *pool = elem->pool;
> struct xarray *xa = &pool->xa;
> static int timeout = RXE_POOL_TIMEOUT;
> - unsigned long flags;
> int ret, err = 0;
> void *xa_ret;
>
> + WARN_ON(!in_task());
I think this should just be
if (sleepable)
might_sleep();
?
It all seems OK, any chance the AH bit can be split to its own patch
or is it all interconnected? It would be easier for rc this way.
Jason
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH for-next v15 1/2] RDMA/rxe: Stop lookup of partially built objects
2022-06-06 14:18 ` [PATCH for-next v15 1/2] RDMA/rxe: Stop lookup of partially built objects Bob Pearson
@ 2022-06-06 17:10 ` Jason Gunthorpe
2022-06-06 20:48 ` Pearson, Robert B
0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2022-06-06 17:10 UTC (permalink / raw)
To: Bob Pearson; +Cc: zyjzyj2000, bvanassche, linux-rdma, frank.zago, jhack
On Mon, Jun 06, 2022 at 09:18:03AM -0500, Bob Pearson wrote:
> @@ -164,9 +171,16 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
> elem->pool = pool;
> elem->obj = (u8 *)elem - pool->elem_offset;
> kref_init(&elem->ref_cnt);
> + init_completion(&elem->complete);
>
> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
> - &pool->next, GFP_KERNEL);
> + /* AH objects are unique in that the create_ah verb
> + * can be called in atomic context. If the create_ah
> + * call is not sleepable use GFP_ATOMIC.
> + */
> + gfp_flags = sleepable ? GFP_KERNEL : GFP_ATOMIC;
I would add a 'if (sleepable) might_sleep()' here too
> + } else {
> + unsigned long until = jiffies + timeout;
> +
> + /* AH objects are unique in that the destroy_ah verb
> + * can be called in atomic context. This delay
> + * replaces the wait_for_completion call above
> + * when the destroy_ah call is not sleepable
> + */
> + while (!completion_done(&elem->complete) &&
> + time_before(jiffies, until))
> + mdelay(1);
Is it even possible that a transient AH reference can exist?
> +/**
> + * rxe_wqe_is_fenced - check if next wqe is fenced
> + * @qp: the queue pair
> + * @wqe: the next wqe
> + *
> + * Returns: 1 if wqe needs to wait
> + * 0 if wqe is ready to go
> + */
> +static int rxe_wqe_is_fenced(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
> +{
> + /* Local invalidate fence (LIF) see IBA 10.6.5.1
> + * Requires ALL previous operations on the send queue
> + * are complete. Make mandatory for the rxe driver.
> + */
> + if (wqe->wr.opcode == IB_WR_LOCAL_INV)
> + return qp->req.wqe_index != queue_get_consumer(qp->sq.queue,
> + QUEUE_TYPE_FROM_CLIENT);
> +
> + /* Fence see IBA 10.8.3.3
> + * Requires that all previous read and atomic operations
> + * are complete.
> + */
> + return (wqe->wr.send_flags & IB_SEND_FENCE) &&
> + atomic_read(&qp->req.rd_atomic) != qp->attr.max_rd_atomic;
> +}
Does this belong in this patch?
Jason
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH for-next v15 1/2] RDMA/rxe: Stop lookup of partially built objects
2022-06-06 17:10 ` Jason Gunthorpe
@ 2022-06-06 20:48 ` Pearson, Robert B
0 siblings, 0 replies; 7+ messages in thread
From: Pearson, Robert B @ 2022-06-06 20:48 UTC (permalink / raw)
To: Jason Gunthorpe, Bob Pearson
Cc: zyjzyj2000@gmail.com, bvanassche@acm.org,
linux-rdma@vger.kernel.org, Zago, Frank,
Hack, Jenny (Ft. Collins)
I'm on windows for the week in Minnesota without access to my Linux system
It will be next weekend before I can get to this.
Bob
-----Original Message-----
From: Jason Gunthorpe <jgg@nvidia.com>
Sent: Monday, June 6, 2022 12:10 PM
To: Bob Pearson <rpearsonhpe@gmail.com>
Cc: zyjzyj2000@gmail.com; bvanassche@acm.org; linux-rdma@vger.kernel.org; Zago, Frank <frank.zago@hpe.com>; Hack, Jenny (Ft. Collins) <jhack@hpe.com>
Subject: Re: [PATCH for-next v15 1/2] RDMA/rxe: Stop lookup of partially built objects
On Mon, Jun 06, 2022 at 09:18:03AM -0500, Bob Pearson wrote:
> @@ -164,9 +171,16 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
> elem->pool = pool;
> elem->obj = (u8 *)elem - pool->elem_offset;
> kref_init(&elem->ref_cnt);
> + init_completion(&elem->complete);
>
> - err = xa_alloc_cyclic(&pool->xa, &elem->index, elem, pool->limit,
> - &pool->next, GFP_KERNEL);
> + /* AH objects are unique in that the create_ah verb
> + * can be called in atomic context. If the create_ah
> + * call is not sleepable use GFP_ATOMIC.
> + */
> + gfp_flags = sleepable ? GFP_KERNEL : GFP_ATOMIC;
I would add a 'if (sleepable) might_sleep()' here too
Sure.
> + } else {
> + unsigned long until = jiffies + timeout;
> +
> + /* AH objects are unique in that the destroy_ah verb
> + * can be called in atomic context. This delay
> + * replaces the wait_for_completion call above
> + * when the destroy_ah call is not sleepable
> + */
> + while (!completion_done(&elem->complete) &&
> + time_before(jiffies, until))
> + mdelay(1);
Is it even possible that a transient AH reference can exist?
One of the tasklets takes a reference on the AH when building a packet.
It mostly seems to not hold it for long so the answer is likely no but I'm not 100% sure.
> +/**
> + * rxe_wqe_is_fenced - check if next wqe is fenced
> + * @qp: the queue pair
> + * @wqe: the next wqe
> + *
> + * Returns: 1 if wqe needs to wait
> + * 0 if wqe is ready to go
> + */
> +static int rxe_wqe_is_fenced(struct rxe_qp *qp, struct rxe_send_wqe
> +*wqe) {
> + /* Local invalidate fence (LIF) see IBA 10.6.5.1
> + * Requires ALL previous operations on the send queue
> + * are complete. Make mandatory for the rxe driver.
> + */
> + if (wqe->wr.opcode == IB_WR_LOCAL_INV)
> + return qp->req.wqe_index != queue_get_consumer(qp->sq.queue,
> + QUEUE_TYPE_FROM_CLIENT);
> +
> + /* Fence see IBA 10.8.3.3
> + * Requires that all previous read and atomic operations
> + * are complete.
> + */
> + return (wqe->wr.send_flags & IB_SEND_FENCE) &&
> + atomic_read(&qp->req.rd_atomic) != qp->attr.max_rd_atomic; }
Does this belong in this patch?
Nope. That fragment got lost and here it is.
Jason
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH for-next v15 2/2] RDMA/rxe: Convert read side locking to rcu
2022-06-06 17:07 ` Jason Gunthorpe
@ 2022-06-06 20:51 ` Pearson, Robert B
0 siblings, 0 replies; 7+ messages in thread
From: Pearson, Robert B @ 2022-06-06 20:51 UTC (permalink / raw)
To: Jason Gunthorpe, Bob Pearson
Cc: zyjzyj2000@gmail.com, bvanassche@acm.org,
linux-rdma@vger.kernel.org, Zago, Frank,
Hack, Jenny (Ft. Collins)
-----Original Message-----
From: Jason Gunthorpe <jgg@nvidia.com>
Sent: Monday, June 6, 2022 12:07 PM
To: Bob Pearson <rpearsonhpe@gmail.com>
Cc: zyjzyj2000@gmail.com; bvanassche@acm.org; linux-rdma@vger.kernel.org; Zago, Frank <frank.zago@hpe.com>; Hack, Jenny (Ft. Collins) <jhack@hpe.com>
Subject: Re: [PATCH for-next v15 2/2] RDMA/rxe: Convert read side locking to rcu
On Mon, Jun 06, 2022 at 09:18:04AM -0500, Bob Pearson wrote:
> Use rcu_read_lock() for protecting read side operations in rxe_pool.c.
>
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> drivers/infiniband/sw/rxe/rxe_pool.c | 26 ++++++++++----------------
> 1 file changed, 10 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c
> b/drivers/infiniband/sw/rxe/rxe_pool.c
> index 7a5685f0713e..103bf0c03441 100644
> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> @@ -195,16 +195,15 @@ void *rxe_pool_get_index(struct rxe_pool *pool,
> u32 index) {
> struct rxe_pool_elem *elem;
> struct xarray *xa = &pool->xa;
> - unsigned long flags;
> void *obj;
>
> - xa_lock_irqsave(xa, flags);
> + rcu_read_lock();
> elem = xa_load(xa, index);
> if (elem && kref_get_unless_zero(&elem->ref_cnt))
> obj = elem->obj;
> else
> obj = NULL;
> - xa_unlock_irqrestore(xa, flags);
> + rcu_read_unlock();
>
> return obj;
> }
> @@ -221,16 +220,15 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
> struct rxe_pool *pool = elem->pool;
> struct xarray *xa = &pool->xa;
> static int timeout = RXE_POOL_TIMEOUT;
> - unsigned long flags;
> int ret, err = 0;
> void *xa_ret;
>
> + WARN_ON(!in_task());
I think this should just be
if (sleepable)
might_sleep();
?
It all seems OK, any chance the AH bit can be split to its own patch or is it all interconnected? It would be easier for rc this way.
I can't not do anything about AH because it throws kernel splats. I'd rather not break the driver
even for one patch. If one says this won't work unless you apply all of them you may as well just have
one patch.
Bob
Jason
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-06-06 21:03 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-06 14:18 [PATCH for-next v15 0/2] Fix race conditions in rxe_pool Bob Pearson
2022-06-06 14:18 ` [PATCH for-next v15 1/2] RDMA/rxe: Stop lookup of partially built objects Bob Pearson
2022-06-06 17:10 ` Jason Gunthorpe
2022-06-06 20:48 ` Pearson, Robert B
2022-06-06 14:18 ` [PATCH for-next v15 2/2] RDMA/rxe: Convert read side locking to rcu Bob Pearson
2022-06-06 17:07 ` Jason Gunthorpe
2022-06-06 20:51 ` Pearson, Robert B
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox