From: Liibaan Egal <liibaegal@gmail.com>
To: linux-rdma@vger.kernel.org
Cc: zyjzyj2000@gmail.com, jgg@ziepe.ca, leon@kernel.org,
linux-kernel@vger.kernel.org
Subject: [RFC PATCH rdma-next 1/2] RDMA/rxe: add local implicit ODP MR support
Date: Tue, 12 May 2026 15:14:52 -0500 [thread overview]
Message-ID: <20260512201453.21156-2-liibaegal@gmail.com> (raw)
In-Reply-To: <20260512201453.21156-1-liibaegal@gmail.com>
RXE already supports explicit ODP MRs. The implicit registration form
(addr == 0, length == U64_MAX, IB_ACCESS_ON_DEMAND) is recognized but
not implemented: the implicit branch in rxe_odp_mr_init_user() returns
-EINVAL through a placeholder block, and no path creates child umems
for SGE accesses on an implicit MR.
Wire the implicit registration case through ib_umem_odp_alloc_implicit()
and route the local SGE walker through per-chunk child umems.
Registration. rxe_odp_mr_init_implicit() rejects remote access bits
(-EOPNOTSUPP), allocates the empty parent umem via
ib_umem_odp_alloc_implicit(), and initializes mr->implicit_children via
xa_init(). rxe_odp_init_pages() is skipped because there are no pages
to fault at registration time.
Chunking. Implicit MRs split the address space into fixed-size chunks
defined by RXE_ODP_CHILD_SHIFT (21, 2 MiB). Each chunk is backed by at
most one child ib_umem_odp allocated on demand. The chunk size keeps
the child count bounded while limiting the amount of VA covered by
each child; whether the size should be fixed, derived, or configurable
is an open design question.
SGE fault path. rxe_odp_umem_for_iova() returns the parent for
explicit MRs and rxe_odp_get_child() for implicit MRs. The child
lookup is xa_load -> ib_umem_odp_alloc_child -> xa_cmpxchg(GFP_KERNEL);
a racing insertion drops the loser. rxe_odp_chunk_len_at() reports how
many bytes of an access can be served by one umem; for explicit MRs
that is the full request, for implicit it is the bytes remaining in
the current chunk. rxe_odp_mr_copy() loops across chunks, resolving,
locking, copying, and unlocking each child independently. Explicit
MRs run the loop exactly once with identical behavior to the pre-patch
path.
Prefetch. rxe_odp_prefetch_one() uses the same chunk loop. Async
prefetch walks per chunk under short-held mutexes so a long range
does not stall concurrent invalidators.
Atomic, flush, and atomic-write paths reject implicit MRs at the top
of each helper. These walk mr->umem->pfn_list directly which is empty
for an implicit parent; extending them is not in this series.
Lifetime. rxe_mr_cleanup walks mr->implicit_children with xa_for_each
and releases each child via ib_umem_odp_release() before releasing
the parent via ib_umem_release(), so each child's
mmu_interval_notifier tears down while the parent's per_mm is alive.
The xarray is xa_destroy()ed afterwards.
Per-transport ODP caps are unchanged: they describe RC/UD behavior on
explicit ODP MRs. Advertising IB_ODP_SUPPORT_IMPLICIT to userspace is
a separate patch, since whether the existing capability bit is the
right surface for a local-access-only operation matrix is an open
question for review.
Limitations. The xarray grows monotonically per MR: a child is not
reclaimed until MR destroy. Long-lived MRs that touch a sparse address
space accumulate children. A reclaim mechanism is the natural
follow-up.
Tested on Linux 7.1-rc2 (arm64, Soft-RoCE over loopback):
- five-case registration accept/reject matrix passes
- single-chunk 64 KiB RDMA WRITE through an implicit lkey delivers
- two-chunk multi test (two 1 MiB WRITEs from buffers in different
2 MiB chunks of one implicit MR) delivers
- cross-chunk single-SGE test (128 KiB WRITE spanning a 2 MiB
boundary) delivers
Benchmark measures registration latency and RSS only; first-touch and
steady-state data path costs are not characterized in this series.
Signed-off-by: Liibaan Egal <liibaegal@gmail.com>
---
drivers/infiniband/sw/rxe/rxe_mr.c | 19 ++
drivers/infiniband/sw/rxe/rxe_odp.c | 288 +++++++++++++++++++++-----
drivers/infiniband/sw/rxe/rxe_verbs.h | 18 ++
3 files changed, 269 insertions(+), 56 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index c696ff8749..c429bf0e6f 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -6,6 +6,8 @@
#include <linux/libnvdimm.h>
+#include <rdma/ib_umem_odp.h>
+
#include "rxe.h"
#include "rxe_loc.h"
@@ -809,6 +811,23 @@ void rxe_mr_cleanup(struct rxe_pool_elem *elem)
struct rxe_mr *mr = container_of(elem, typeof(*mr), elem);
rxe_put(mr_pd(mr));
+
+ /* Implicit ODP MRs may have created child umems on demand for each
+ * accessed 2 MiB chunk. Release them before the parent so each
+ * child's mmu_interval_notifier tears down while the parent's
+ * per_mm is still alive. The xarray is empty for explicit MRs, so
+ * walking it is a no-op there.
+ */
+ if (mr->umem && mr->umem->is_odp &&
+ to_ib_umem_odp(mr->umem)->is_implicit_odp) {
+ struct ib_umem_odp *child;
+ unsigned long key;
+
+ xa_for_each(&mr->implicit_children, key, child)
+ ib_umem_odp_release(child);
+ xa_destroy(&mr->implicit_children);
+ }
+
ib_umem_release(mr->umem);
if (mr->ibmr.type != IB_MR_TYPE_DMA)
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index ff904d5e54..b90cb8f64f 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -5,6 +5,7 @@
#include <linux/hmm.h>
#include <linux/libnvdimm.h>
+#include <linux/xarray.h>
#include <rdma/ib_umem_odp.h>
@@ -41,9 +42,14 @@ const struct mmu_interval_notifier_ops rxe_mn_ops = {
#define RXE_PAGEFAULT_DEFAULT 0
#define RXE_PAGEFAULT_RDONLY BIT(0)
#define RXE_PAGEFAULT_SNAPSHOT BIT(1)
-static int rxe_odp_do_pagefault_and_lock(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags)
+
+/* Low-level fault helper. Operates directly on a umem_odp (parent for
+ * explicit MRs, child for implicit). On success the caller holds
+ * umem_odp->umem_mutex via ib_umem_odp_map_dma_and_lock.
+ */
+static int rxe_odp_do_pagefault_and_lock(struct ib_umem_odp *umem_odp,
+ u64 user_va, int bcnt, u32 flags)
{
- struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
u64 access_mask = 0;
int np;
@@ -51,11 +57,6 @@ static int rxe_odp_do_pagefault_and_lock(struct rxe_mr *mr, u64 user_va, int bcn
if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
access_mask |= HMM_PFN_WRITE;
- /*
- * ib_umem_odp_map_dma_and_lock() locks umem_mutex on success.
- * Callers must release the lock later to let invalidation handler
- * do its work again.
- */
np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt,
access_mask, fault);
return np;
@@ -66,7 +67,8 @@ static int rxe_odp_init_pages(struct rxe_mr *mr)
struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
int ret;
- ret = rxe_odp_do_pagefault_and_lock(mr, mr->umem->address,
+ /* Explicit MR only: snapshot the page table at registration. */
+ ret = rxe_odp_do_pagefault_and_lock(umem_odp, mr->umem->address,
mr->umem->length,
RXE_PAGEFAULT_SNAPSHOT);
@@ -76,6 +78,50 @@ static int rxe_odp_init_pages(struct rxe_mr *mr)
return ret >= 0 ? 0 : ret;
}
+/* Remote access on an implicit MR is intentionally out of scope. A
+ * remote rkey on a full-VA-shaped MR would let a peer drive faults
+ * against arbitrary process memory, and that surface needs separate
+ * thinking. Reject up front.
+ */
+#define RXE_REMOTE_ACCESS_MASK (IB_ACCESS_REMOTE_READ | \
+ IB_ACCESS_REMOTE_WRITE | \
+ IB_ACCESS_REMOTE_ATOMIC)
+
+static int rxe_odp_mr_init_implicit(struct rxe_dev *rxe, int access_flags,
+ struct rxe_mr *mr)
+{
+ struct ib_umem_odp *umem_odp;
+
+ if (access_flags & RXE_REMOTE_ACCESS_MASK)
+ return -EOPNOTSUPP;
+
+ umem_odp = ib_umem_odp_alloc_implicit(&rxe->ib_dev, access_flags);
+ if (IS_ERR(umem_odp)) {
+ rxe_dbg_mr(mr, "implicit umem alloc failed err=%d\n",
+ (int)PTR_ERR(umem_odp));
+ return PTR_ERR(umem_odp);
+ }
+
+ umem_odp->private = mr;
+ mr->umem = &umem_odp->umem;
+ mr->access = access_flags;
+ mr->ibmr.length = U64_MAX;
+ mr->ibmr.iova = 0;
+
+ /* Init the per-MR child xarray here so the cleanup path can
+ * unconditionally xa_destroy() regardless of MR mode. Explicit MRs
+ * never touch this xarray, so it stays empty for them. The xarray
+ * allocator is invoked under GFP_KERNEL on the cmpxchg insertion
+ * path below.
+ */
+ xa_init(&mr->implicit_children);
+
+ mr->state = RXE_MR_STATE_VALID;
+ mr->ibmr.type = IB_MR_TYPE_USER;
+
+ return 0;
+}
+
int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
u64 iova, int access_flags, struct rxe_mr *mr)
{
@@ -93,7 +139,7 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
if (!(rxe->attr.odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
return -EINVAL;
- /* Never reach here, for implicit ODP is not implemented. */
+ return rxe_odp_mr_init_implicit(rxe, access_flags, mr);
}
umem_odp = ib_umem_odp_get(&rxe->ib_dev, start, length, access_flags,
@@ -123,6 +169,73 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
return err;
}
+/* Look up or create the child umem covering the chunk that contains iova.
+ * Each chunk is RXE_ODP_CHILD_SIZE aligned. A cmpxchg insertion avoids
+ * leaking a child if a concurrent fault wins the race.
+ */
+static struct ib_umem_odp *rxe_odp_get_child(struct rxe_mr *mr, u64 iova)
+{
+ struct ib_umem_odp *parent = to_ib_umem_odp(mr->umem);
+ struct ib_umem_odp *child, *existing;
+ unsigned long aligned_start = iova & ~RXE_ODP_CHILD_MASK;
+ unsigned long key = aligned_start >> RXE_ODP_CHILD_SHIFT;
+
+ child = xa_load(&mr->implicit_children, key);
+ if (child)
+ return child;
+
+ child = ib_umem_odp_alloc_child(parent, aligned_start,
+ RXE_ODP_CHILD_SIZE, &rxe_mn_ops);
+ if (IS_ERR(child))
+ return child;
+ child->private = mr;
+
+ existing = xa_cmpxchg(&mr->implicit_children, key, NULL, child,
+ GFP_KERNEL);
+ if (xa_is_err(existing)) {
+ ib_umem_odp_release(child);
+ return ERR_PTR(xa_err(existing));
+ }
+ if (existing) {
+ /* Another thread inserted while this allocation was in
+ * flight. Drop the loser and use the winner.
+ */
+ ib_umem_odp_release(child);
+ return existing;
+ }
+ return child;
+}
+
+/* Pick the umem_odp to use for an operation on mr at iova. For explicit
+ * MRs that is mr->umem. For implicit MRs it is the chunk's child. The
+ * caller is responsible for clamping the access length to one chunk via
+ * rxe_odp_chunk_len_at(); each call here returns one child.
+ */
+static struct ib_umem_odp *rxe_odp_umem_for_iova(struct rxe_mr *mr, u64 iova)
+{
+ struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+
+ if (!umem_odp->is_implicit_odp)
+ return umem_odp;
+ return rxe_odp_get_child(mr, iova);
+}
+
+/* How many bytes of an access starting at iova can be served by a single
+ * umem? For explicit MRs the answer is "the whole request" (bounded by
+ * mr length elsewhere). For implicit MRs it is the bytes remaining in
+ * the current chunk.
+ */
+static int rxe_odp_chunk_len_at(struct rxe_mr *mr, u64 iova, int length)
+{
+ u64 next_boundary;
+
+ if (!to_ib_umem_odp(mr->umem)->is_implicit_odp)
+ return length;
+
+ next_boundary = (iova & ~RXE_ODP_CHILD_MASK) + RXE_ODP_CHILD_SIZE;
+ return min_t(u64, (u64)length, next_boundary - iova);
+}
+
static inline bool rxe_check_pagefault(struct ib_umem_odp *umem_odp, u64 iova,
int length)
{
@@ -132,7 +245,6 @@ static inline bool rxe_check_pagefault(struct ib_umem_odp *umem_odp, u64 iova,
addr = iova & (~(BIT(umem_odp->page_shift) - 1));
- /* Skim through all pages that are to be accessed. */
while (addr < iova + length) {
idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
@@ -156,23 +268,32 @@ static unsigned long rxe_odp_iova_to_page_offset(struct ib_umem_odp *umem_odp, u
return iova & (BIT(umem_odp->page_shift) - 1);
}
-static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length, u32 flags)
+/* Resolve, lock, and fault one chunk worth of access. On success the
+ * caller holds umem_odp->umem_mutex and gets the chosen umem_odp via
+ * *out_umem_odp. length must already be clamped via rxe_odp_chunk_len_at.
+ */
+static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length,
+ u32 flags,
+ struct ib_umem_odp **out_umem_odp)
{
- struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+ struct ib_umem_odp *umem_odp;
bool need_fault;
int err;
if (unlikely(length < 1))
return -EINVAL;
+ umem_odp = rxe_odp_umem_for_iova(mr, iova);
+ if (IS_ERR(umem_odp))
+ return PTR_ERR(umem_odp);
+
mutex_lock(&umem_odp->umem_mutex);
need_fault = rxe_check_pagefault(umem_odp, iova, length);
if (need_fault) {
mutex_unlock(&umem_odp->umem_mutex);
- /* umem_mutex is locked on success. */
- err = rxe_odp_do_pagefault_and_lock(mr, iova, length,
+ err = rxe_odp_do_pagefault_and_lock(umem_odp, iova, length,
flags);
if (err < 0)
return err;
@@ -184,13 +305,14 @@ static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length, u
}
}
+ *out_umem_odp = umem_odp;
return 0;
}
-static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
- int length, enum rxe_mr_copy_dir dir)
+static int __rxe_odp_mr_copy_one(struct ib_umem_odp *umem_odp, u64 iova,
+ void *addr, int length,
+ enum rxe_mr_copy_dir dir)
{
- struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
struct page *page;
int idx, bytes;
size_t offset;
@@ -226,8 +348,10 @@ static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
enum rxe_mr_copy_dir dir)
{
- struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
u32 flags = RXE_PAGEFAULT_DEFAULT;
+ u64 cur_iova = iova;
+ u8 *cur_addr = addr;
+ int remaining = length;
int err;
if (length == 0)
@@ -248,15 +372,43 @@ int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
return -EINVAL;
}
- err = rxe_odp_map_range_and_lock(mr, iova, length, flags);
- if (err)
- return err;
+ /* Walk one chunk at a time. For explicit MRs the chunk-length helper
+ * returns the full remaining length, so this loop runs exactly once
+ * and is identical to the pre-implicit behavior.
+ */
+ while (remaining > 0) {
+ struct ib_umem_odp *umem_odp;
+ int this_len = rxe_odp_chunk_len_at(mr, cur_iova, remaining);
- err = __rxe_odp_mr_copy(mr, iova, addr, length, dir);
+ err = rxe_odp_map_range_and_lock(mr, cur_iova, this_len, flags,
+ &umem_odp);
+ if (err)
+ return err;
- mutex_unlock(&umem_odp->umem_mutex);
+ err = __rxe_odp_mr_copy_one(umem_odp, cur_iova, cur_addr,
+ this_len, dir);
+ mutex_unlock(&umem_odp->umem_mutex);
+ if (err)
+ return err;
- return err;
+ cur_iova += this_len;
+ cur_addr += this_len;
+ remaining -= this_len;
+ }
+
+ return 0;
+}
+
+/* Atomic, flush, and atomic-write paths assume mr->umem itself holds the
+ * pfn_list. That is true for explicit MRs only. The implicit parent has
+ * no pages of its own. Reject those operations on implicit MRs rather
+ * than extend them: remote access on implicit is already out of scope,
+ * so the only way these helpers could be reached is via a local atomic
+ * or flush, which the test matrix does not exercise.
+ */
+static inline bool rxe_odp_mr_is_implicit(struct rxe_mr *mr)
+{
+ return to_ib_umem_odp(mr->umem)->is_implicit_odp;
}
static enum resp_states rxe_odp_do_atomic_op(struct rxe_mr *mr, u64 iova,
@@ -313,11 +465,16 @@ static enum resp_states rxe_odp_do_atomic_op(struct rxe_mr *mr, u64 iova,
enum resp_states rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
u64 compare, u64 swap_add, u64 *orig_val)
{
- struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+ struct ib_umem_odp *umem_odp;
int err;
+ if (rxe_odp_mr_is_implicit(mr)) {
+ rxe_dbg_mr(mr, "atomic op not supported on implicit ODP MR\n");
+ return RESPST_ERR_RKEY_VIOLATION;
+ }
+
err = rxe_odp_map_range_and_lock(mr, iova, sizeof(char),
- RXE_PAGEFAULT_DEFAULT);
+ RXE_PAGEFAULT_DEFAULT, &umem_odp);
if (err < 0)
return RESPST_ERR_RKEY_VIOLATION;
@@ -331,7 +488,7 @@ enum resp_states rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
unsigned int length)
{
- struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+ struct ib_umem_odp *umem_odp;
unsigned int page_offset;
unsigned long index;
struct page *page;
@@ -339,8 +496,11 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
int err;
u8 *va;
+ if (rxe_odp_mr_is_implicit(mr))
+ return -EOPNOTSUPP;
+
err = rxe_odp_map_range_and_lock(mr, iova, length,
- RXE_PAGEFAULT_DEFAULT);
+ RXE_PAGEFAULT_DEFAULT, &umem_odp);
if (err)
return err;
@@ -368,13 +528,16 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
enum resp_states rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
{
- struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+ struct ib_umem_odp *umem_odp;
unsigned int page_offset;
unsigned long index;
struct page *page;
int err;
u64 *va;
+ if (rxe_odp_mr_is_implicit(mr))
+ return RESPST_ERR_RKEY_VIOLATION;
+
/* See IBA oA19-28 */
err = mr_check_range(mr, iova, sizeof(value));
if (unlikely(err)) {
@@ -383,7 +546,7 @@ enum resp_states rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
}
err = rxe_odp_map_range_and_lock(mr, iova, sizeof(value),
- RXE_PAGEFAULT_DEFAULT);
+ RXE_PAGEFAULT_DEFAULT, &umem_odp);
if (err)
return RESPST_ERR_RKEY_VIOLATION;
@@ -419,6 +582,38 @@ struct prefetch_mr_work {
} frags[];
};
+/* Prefetch one SGE range. For implicit MRs the range may span multiple
+ * chunks; fault each chunk separately and drop the lock between them
+ * so concurrent invalidators are not blocked across the whole range.
+ */
+static int rxe_odp_prefetch_one(struct rxe_mr *mr, u64 io_virt, size_t length,
+ u32 pf_flags)
+{
+ u64 cur = io_virt;
+ size_t remaining = length;
+ int ret;
+
+ while (remaining > 0) {
+ struct ib_umem_odp *umem_odp;
+ int this_len = rxe_odp_chunk_len_at(mr, cur, remaining);
+
+ umem_odp = rxe_odp_umem_for_iova(mr, cur);
+ if (IS_ERR(umem_odp))
+ return PTR_ERR(umem_odp);
+
+ ret = rxe_odp_do_pagefault_and_lock(umem_odp, cur, this_len,
+ pf_flags);
+ if (ret < 0)
+ return ret;
+
+ mutex_unlock(&umem_odp->umem_mutex);
+
+ cur += this_len;
+ remaining -= this_len;
+ }
+ return 0;
+}
+
static void rxe_ib_prefetch_mr_work(struct work_struct *w)
{
struct prefetch_mr_work *work =
@@ -426,28 +621,16 @@ static void rxe_ib_prefetch_mr_work(struct work_struct *w)
int ret;
u32 i;
- /*
- * We rely on IB/core that work is executed
- * if we have num_sge != 0 only.
- */
WARN_ON(!work->num_sge);
for (i = 0; i < work->num_sge; ++i) {
- struct ib_umem_odp *umem_odp;
-
- ret = rxe_odp_do_pagefault_and_lock(work->frags[i].mr,
- work->frags[i].io_virt,
- work->frags[i].length,
- work->pf_flags);
- if (ret < 0) {
+ ret = rxe_odp_prefetch_one(work->frags[i].mr,
+ work->frags[i].io_virt,
+ work->frags[i].length,
+ work->pf_flags);
+ if (ret < 0)
rxe_dbg_mr(work->frags[i].mr,
"failed to prefetch the mr\n");
- goto deref;
- }
-
- umem_odp = to_ib_umem_odp(work->frags[i].mr->umem);
- mutex_unlock(&umem_odp->umem_mutex);
-deref:
rxe_put(work->frags[i].mr);
}
@@ -465,7 +648,6 @@ static int rxe_ib_prefetch_sg_list(struct ib_pd *ibpd,
for (i = 0; i < num_sge; ++i) {
struct rxe_mr *mr;
- struct ib_umem_odp *umem_odp;
mr = lookup_mr(pd, IB_ACCESS_LOCAL_WRITE,
sg_list[i].lkey, RXE_LOOKUP_LOCAL);
@@ -483,17 +665,14 @@ static int rxe_ib_prefetch_sg_list(struct ib_pd *ibpd,
return -EPERM;
}
- ret = rxe_odp_do_pagefault_and_lock(
- mr, sg_list[i].addr, sg_list[i].length, pf_flags);
+ ret = rxe_odp_prefetch_one(mr, sg_list[i].addr,
+ sg_list[i].length, pf_flags);
if (ret < 0) {
rxe_dbg_mr(mr, "failed to prefetch the mr\n");
rxe_put(mr);
return ret;
}
- umem_odp = to_ib_umem_odp(mr->umem);
- mutex_unlock(&umem_odp->umem_mutex);
-
rxe_put(mr);
}
@@ -517,7 +696,6 @@ static int rxe_ib_advise_mr_prefetch(struct ib_pd *ibpd,
if (advice == IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT)
pf_flags |= RXE_PAGEFAULT_SNAPSHOT;
- /* Synchronous call */
if (flags & IB_UVERBS_ADVISE_MR_FLAG_FLUSH)
return rxe_ib_prefetch_sg_list(ibpd, advice, pf_flags, sg_list,
num_sge);
@@ -532,7 +710,6 @@ static int rxe_ib_advise_mr_prefetch(struct ib_pd *ibpd,
work->num_sge = num_sge;
for (i = 0; i < num_sge; ++i) {
- /* Takes a reference, which will be released in the queued work */
mr = lookup_mr(pd, IB_ACCESS_LOCAL_WRITE,
sg_list[i].lkey, RXE_LOOKUP_LOCAL);
if (!mr) {
@@ -550,7 +727,6 @@ static int rxe_ib_advise_mr_prefetch(struct ib_pd *ibpd,
return 0;
err:
- /* rollback reference counts for the invalid request */
while (i > 0) {
i--;
rxe_put(work->frags[i].mr);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index d92f80d16f..a783dee95d 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -341,12 +341,30 @@ struct rxe_mr_page {
unsigned int offset; /* offset in system page */
};
+/* For implicit ODP MRs the virtual address space is split into fixed-size
+ * chunks. Each chunk is backed by at most one child umem allocated on
+ * first access. The 2 MiB chunk size keeps the child count bounded while
+ * limiting the amount of VA covered by each child. Whether the chunk
+ * size should be fixed, derived from page_shift, or configurable is an
+ * open design question for review.
+ */
+#define RXE_ODP_CHILD_SHIFT 21
+#define RXE_ODP_CHILD_SIZE (BIT(RXE_ODP_CHILD_SHIFT))
+#define RXE_ODP_CHILD_MASK (RXE_ODP_CHILD_SIZE - 1)
+
struct rxe_mr {
struct rxe_pool_elem elem;
struct ib_mr ibmr;
struct ib_umem *umem;
+ /* For implicit ODP MRs only: xarray of child umems keyed by
+ * (aligned_start >> RXE_ODP_CHILD_SHIFT). Each entry covers one
+ * RXE_ODP_CHILD_SIZE-aligned chunk and is created lazily on first
+ * access. Unused (xa_empty) for explicit MRs.
+ */
+ struct xarray implicit_children;
+
u32 lkey;
u32 rkey;
enum rxe_mr_state state;
--
2.43.0
next prev parent reply other threads:[~2026-05-12 20:14 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-12 20:14 [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support Liibaan Egal
2026-05-12 20:14 ` Liibaan Egal [this message]
2026-05-12 20:40 ` [RFC PATCH rdma-next 1/2] " Liibaan Egal
2026-05-12 20:14 ` [RFC PATCH rdma-next 2/2] RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access Liibaan Egal
2026-05-12 22:56 ` [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support yanjun.zhu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260512201453.21156-2-liibaegal@gmail.com \
--to=liibaegal@gmail.com \
--cc=jgg@ziepe.ca \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=zyjzyj2000@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox