[RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support
@ 2026-05-12 20:14 Liibaan Egal
  2026-05-12 20:14 ` [RFC PATCH rdma-next 1/2] " Liibaan Egal
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Liibaan Egal @ 2026-05-12 20:14 UTC (permalink / raw)
  To: linux-rdma; +Cc: zyjzyj2000, jgg, leon, linux-kernel

This RFC adds local-access implicit On-Demand Paging memory regions to
RXE (Soft-RoCE).

RXE already supports explicit ODP MRs. The implicit registration form
(addr == 0, length == U64_MAX, IB_ACCESS_ON_DEMAND) is recognized but
not implemented: the implicit branch in rxe_odp_mr_init_user() returns
-EINVAL through a placeholder block, and no path creates child umems
for SGE accesses on an implicit MR.

This series wires the implicit registration case through
ib_umem_odp_alloc_implicit() and routes the local SGE walker through
per-chunk child umems. The chunk size is fixed at 2 MiB
(RXE_ODP_CHILD_SHIFT = 21) and children are allocated lazily on first
access via ib_umem_odp_alloc_child(), stored in a per-MR xarray.

Patches
-------

  1/2 RDMA/rxe: add local implicit ODP MR support

      Adds rxe_odp_mr_init_implicit() (rejects remote access bits with
      -EOPNOTSUPP, allocates the parent umem). Adds rxe_odp_get_child()
      and the per-chunk loop in rxe_odp_mr_copy() and the prefetch
      path. Atomic, flush and atomic-write paths reject implicit MRs
      at the top because those helpers walk mr->umem->pfn_list
      directly which is empty for an implicit parent. rxe_mr_cleanup
      walks the child xarray and releases each child before the
      parent.

      This patch leaves IB_ODP_SUPPORT_IMPLICIT unadvertised, so
      rxe_odp_mr_init_user() still returns -EINVAL on the implicit
      form. No user-visible behavior change yet.

  2/2 RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access

      Flip the cap bit so userspace can probe support via
      ibv_query_device. Kept as its own patch so the policy question
      is separable from the implementation.

Question for reviewers
----------------------

Patch 2/2 advertises IB_ODP_SUPPORT_IMPLICIT for a local-access-only
operation matrix. Local SGE access on implicit MRs works; remote rkey
access, atomic, flush, and atomic-write on implicit MRs do not. Is
this an acceptable use of the capability bit, or should capability
exposure wait for a broader operation matrix? Splitting the cap flip
out is meant to keep that decision separable from the implementation.

Scope and limitations
---------------------

Out of scope in this series:

- Remote rkey access on implicit MRs. Rejected at registration time
  with -EOPNOTSUPP.
- Atomic, flush, atomic-write paths. These return -EOPNOTSUPP /
  RESPST_ERR_RKEY_VIOLATION on implicit MRs.
- Child reclaim. The xarray grows monotonically per MR; a child is
  not freed until MR destroy. Long-lived implicit MRs that touch a
  sparse address space accumulate children. A reclaim mechanism is
  the natural follow-up.

Tested
------

Verified on rdma/for-next at commit 7fd2df204f34 (Linux 7.1-rc2),
arm64, Soft-RoCE over loopback:

- Registration accept/reject matrix (5 cases).
- Single-chunk 64 KiB RDMA WRITE through an implicit lkey.
- Two-chunk multi-range test: two 1 MiB WRITEs from buffers in
  different 2 MiB chunks of one implicit MR.
- Cross-chunk single-SGE test: one 128 KiB WRITE whose SGE spans a
  2 MiB chunk boundary.

Each patch builds cleanly standalone (M=drivers/infiniband/sw/rxe).

Registration latency was measured for 4 KiB to 1 GiB across explicit
and implicit forms. Explicit grows with size and fails ENOMEM at 1 GiB
on a 6 GiB host. Implicit median latency stays in the low microseconds
across all sizes; peak RSS during an implicit registration stays at
the baseline, while explicit RSS climbs with the registered size. The
benchmark measures registration-time work only; it does not
characterize first-touch or steady-state data path cost. Tests, bench
and raw numbers are in the companion repository:
https://github.com/Liibon/rxe-implicit-odp

scripts/checkpatch.pl --strict on each patch: 0 errors, 0 warnings,
0 checks.

---

Liibaan Egal (2):
  RDMA/rxe: add local implicit ODP MR support
  RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access

 drivers/infiniband/sw/rxe/rxe.c       |   7 +-
 drivers/infiniband/sw/rxe/rxe_mr.c    |  19 +++
 drivers/infiniband/sw/rxe/rxe_odp.c   | 288 +++++++++++++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  18 +++
 4 files changed, 275 insertions(+), 57 deletions(-)

Liibaan Egal (2):
  RDMA/rxe: add local implicit ODP MR support
  RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access

 drivers/infiniband/sw/rxe/rxe.c       |   7 +-
 drivers/infiniband/sw/rxe/rxe_mr.c    |  19 ++
 drivers/infiniband/sw/rxe/rxe_odp.c   | 288 +++++++++++++++++++++-----
 drivers/infiniband/sw/rxe/rxe_verbs.h |  18 ++
 4 files changed, 275 insertions(+), 57 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH rdma-next 1/2] RDMA/rxe: add local implicit ODP MR support
  2026-05-12 20:14 [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support Liibaan Egal
@ 2026-05-12 20:14 ` Liibaan Egal
  2026-05-12 20:40   ` Liibaan Egal
  2026-05-12 20:14 ` [RFC PATCH rdma-next 2/2] RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access Liibaan Egal
  2026-05-12 22:56 ` [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support yanjun.zhu
  2 siblings, 1 reply; 5+ messages in thread
From: Liibaan Egal @ 2026-05-12 20:14 UTC (permalink / raw)
  To: linux-rdma; +Cc: zyjzyj2000, jgg, leon, linux-kernel

RXE already supports explicit ODP MRs. The implicit registration form
(addr == 0, length == U64_MAX, IB_ACCESS_ON_DEMAND) is recognized but
not implemented: the implicit branch in rxe_odp_mr_init_user() returns
-EINVAL through a placeholder block, and no path creates child umems
for SGE accesses on an implicit MR.

Wire the implicit registration case through ib_umem_odp_alloc_implicit()
and route the local SGE walker through per-chunk child umems.

Registration. rxe_odp_mr_init_implicit() rejects remote access bits
(-EOPNOTSUPP), allocates the empty parent umem via
ib_umem_odp_alloc_implicit(), and initializes mr->implicit_children via
xa_init(). rxe_odp_init_pages() is skipped because there are no pages
to fault at registration time.

Chunking. Implicit MRs split the address space into fixed-size chunks
defined by RXE_ODP_CHILD_SHIFT (21, 2 MiB). Each chunk is backed by at
most one child ib_umem_odp allocated on demand. The chunk size keeps
the child count bounded while limiting the amount of VA covered by
each child; whether the size should be fixed, derived, or configurable
is an open design question.

SGE fault path. rxe_odp_umem_for_iova() returns the parent for
explicit MRs and rxe_odp_get_child() for implicit MRs. The child
lookup is xa_load -> ib_umem_odp_alloc_child -> xa_cmpxchg(GFP_KERNEL);
a racing insertion drops the loser. rxe_odp_chunk_len_at() reports how
many bytes of an access can be served by one umem; for explicit MRs
that is the full request, for implicit it is the bytes remaining in
the current chunk. rxe_odp_mr_copy() loops across chunks, resolving,
locking, copying, and unlocking each child independently. Explicit
MRs run the loop exactly once with identical behavior to the pre-patch
path.

Prefetch. rxe_odp_prefetch_one() uses the same chunk loop. Async
prefetch walks per chunk under short-held mutexes so a long range
does not stall concurrent invalidators.

Atomic, flush, and atomic-write paths reject implicit MRs at the top
of each helper. These walk mr->umem->pfn_list directly which is empty
for an implicit parent; extending them is not in this series.

Lifetime. rxe_mr_cleanup walks mr->implicit_children with xa_for_each
and releases each child via ib_umem_odp_release() before releasing
the parent via ib_umem_release(), so each child's
mmu_interval_notifier tears down while the parent's per_mm is alive.
The xarray is xa_destroy()ed afterwards.

Per-transport ODP caps are unchanged: they describe RC/UD behavior on
explicit ODP MRs. Advertising IB_ODP_SUPPORT_IMPLICIT to userspace is
a separate patch, since whether the existing capability bit is the
right surface for a local-access-only operation matrix is an open
question for review.

Limitations. The xarray grows monotonically per MR: a child is not
reclaimed until MR destroy. Long-lived MRs that touch a sparse address
space accumulate children. A reclaim mechanism is the natural
follow-up.

Tested on Linux 7.1-rc2 (arm64, Soft-RoCE over loopback):
- five-case registration accept/reject matrix passes
- single-chunk 64 KiB RDMA WRITE through an implicit lkey delivers
- two-chunk multi test (two 1 MiB WRITEs from buffers in different
  2 MiB chunks of one implicit MR) delivers
- cross-chunk single-SGE test (128 KiB WRITE spanning a 2 MiB
  boundary) delivers

Benchmark measures registration latency and RSS only; first-touch and
steady-state data path costs are not characterized in this series.

Signed-off-by: Liibaan Egal <liibaegal@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c    |  19 ++
 drivers/infiniband/sw/rxe/rxe_odp.c   | 288 +++++++++++++++++++++-----
 drivers/infiniband/sw/rxe/rxe_verbs.h |  18 ++
 3 files changed, 269 insertions(+), 56 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index c696ff8749..c429bf0e6f 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -6,6 +6,8 @@
 
 #include <linux/libnvdimm.h>
 
+#include <rdma/ib_umem_odp.h>
+
 #include "rxe.h"
 #include "rxe_loc.h"
 
@@ -809,6 +811,23 @@ void rxe_mr_cleanup(struct rxe_pool_elem *elem)
 	struct rxe_mr *mr = container_of(elem, typeof(*mr), elem);
 
 	rxe_put(mr_pd(mr));
+
+	/* Implicit ODP MRs may have created child umems on demand for each
+	 * accessed 2 MiB chunk. Release them before the parent so each
+	 * child's mmu_interval_notifier tears down while the parent's
+	 * per_mm is still alive. The xarray is empty for explicit MRs, so
+	 * walking it is a no-op there.
+	 */
+	if (mr->umem && mr->umem->is_odp &&
+	    to_ib_umem_odp(mr->umem)->is_implicit_odp) {
+		struct ib_umem_odp *child;
+		unsigned long key;
+
+		xa_for_each(&mr->implicit_children, key, child)
+			ib_umem_odp_release(child);
+		xa_destroy(&mr->implicit_children);
+	}
+
 	ib_umem_release(mr->umem);
 
 	if (mr->ibmr.type != IB_MR_TYPE_DMA)
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index ff904d5e54..b90cb8f64f 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -5,6 +5,7 @@
 
 #include <linux/hmm.h>
 #include <linux/libnvdimm.h>
+#include <linux/xarray.h>
 
 #include <rdma/ib_umem_odp.h>
 
@@ -41,9 +42,14 @@ const struct mmu_interval_notifier_ops rxe_mn_ops = {
 #define RXE_PAGEFAULT_DEFAULT 0
 #define RXE_PAGEFAULT_RDONLY BIT(0)
 #define RXE_PAGEFAULT_SNAPSHOT BIT(1)
-static int rxe_odp_do_pagefault_and_lock(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags)
+
+/* Low-level fault helper. Operates directly on a umem_odp (parent for
+ * explicit MRs, child for implicit). On success the caller holds
+ * umem_odp->umem_mutex via ib_umem_odp_map_dma_and_lock.
+ */
+static int rxe_odp_do_pagefault_and_lock(struct ib_umem_odp *umem_odp,
+					 u64 user_va, int bcnt, u32 flags)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
 	bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
 	u64 access_mask = 0;
 	int np;
@@ -51,11 +57,6 @@ static int rxe_odp_do_pagefault_and_lock(struct rxe_mr *mr, u64 user_va, int bcn
 	if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
 		access_mask |= HMM_PFN_WRITE;
 
-	/*
-	 * ib_umem_odp_map_dma_and_lock() locks umem_mutex on success.
-	 * Callers must release the lock later to let invalidation handler
-	 * do its work again.
-	 */
 	np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt,
 					  access_mask, fault);
 	return np;
@@ -66,7 +67,8 @@ static int rxe_odp_init_pages(struct rxe_mr *mr)
 	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
 	int ret;
 
-	ret = rxe_odp_do_pagefault_and_lock(mr, mr->umem->address,
+	/* Explicit MR only: snapshot the page table at registration. */
+	ret = rxe_odp_do_pagefault_and_lock(umem_odp, mr->umem->address,
 					    mr->umem->length,
 					    RXE_PAGEFAULT_SNAPSHOT);
 
@@ -76,6 +78,50 @@ static int rxe_odp_init_pages(struct rxe_mr *mr)
 	return ret >= 0 ? 0 : ret;
 }
 
+/* Remote access on an implicit MR is intentionally out of scope. A
+ * remote rkey on a full-VA-shaped MR would let a peer drive faults
+ * against arbitrary process memory, and that surface needs separate
+ * thinking. Reject up front.
+ */
+#define RXE_REMOTE_ACCESS_MASK (IB_ACCESS_REMOTE_READ |	\
+				IB_ACCESS_REMOTE_WRITE |	\
+				IB_ACCESS_REMOTE_ATOMIC)
+
+static int rxe_odp_mr_init_implicit(struct rxe_dev *rxe, int access_flags,
+				    struct rxe_mr *mr)
+{
+	struct ib_umem_odp *umem_odp;
+
+	if (access_flags & RXE_REMOTE_ACCESS_MASK)
+		return -EOPNOTSUPP;
+
+	umem_odp = ib_umem_odp_alloc_implicit(&rxe->ib_dev, access_flags);
+	if (IS_ERR(umem_odp)) {
+		rxe_dbg_mr(mr, "implicit umem alloc failed err=%d\n",
+			   (int)PTR_ERR(umem_odp));
+		return PTR_ERR(umem_odp);
+	}
+
+	umem_odp->private = mr;
+	mr->umem = &umem_odp->umem;
+	mr->access = access_flags;
+	mr->ibmr.length = U64_MAX;
+	mr->ibmr.iova = 0;
+
+	/* Init the per-MR child xarray here so the cleanup path can
+	 * unconditionally xa_destroy() regardless of MR mode. Explicit MRs
+	 * never touch this xarray, so it stays empty for them. The xarray
+	 * allocator is invoked under GFP_KERNEL on the cmpxchg insertion
+	 * path below.
+	 */
+	xa_init(&mr->implicit_children);
+
+	mr->state = RXE_MR_STATE_VALID;
+	mr->ibmr.type = IB_MR_TYPE_USER;
+
+	return 0;
+}
+
 int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
 			 u64 iova, int access_flags, struct rxe_mr *mr)
 {
@@ -93,7 +139,7 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
 		if (!(rxe->attr.odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
 			return -EINVAL;
 
-		/* Never reach here, for implicit ODP is not implemented. */
+		return rxe_odp_mr_init_implicit(rxe, access_flags, mr);
 	}
 
 	umem_odp = ib_umem_odp_get(&rxe->ib_dev, start, length, access_flags,
@@ -123,6 +169,73 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
 	return err;
 }
 
+/* Look up or create the child umem covering the chunk that contains iova.
+ * Each chunk is RXE_ODP_CHILD_SIZE aligned. A cmpxchg insertion avoids
+ * leaking a child if a concurrent fault wins the race.
+ */
+static struct ib_umem_odp *rxe_odp_get_child(struct rxe_mr *mr, u64 iova)
+{
+	struct ib_umem_odp *parent = to_ib_umem_odp(mr->umem);
+	struct ib_umem_odp *child, *existing;
+	unsigned long aligned_start = iova & ~RXE_ODP_CHILD_MASK;
+	unsigned long key = aligned_start >> RXE_ODP_CHILD_SHIFT;
+
+	child = xa_load(&mr->implicit_children, key);
+	if (child)
+		return child;
+
+	child = ib_umem_odp_alloc_child(parent, aligned_start,
+					RXE_ODP_CHILD_SIZE, &rxe_mn_ops);
+	if (IS_ERR(child))
+		return child;
+	child->private = mr;
+
+	existing = xa_cmpxchg(&mr->implicit_children, key, NULL, child,
+			      GFP_KERNEL);
+	if (xa_is_err(existing)) {
+		ib_umem_odp_release(child);
+		return ERR_PTR(xa_err(existing));
+	}
+	if (existing) {
+		/* Another thread inserted while this allocation was in
+		 * flight. Drop the loser and use the winner.
+		 */
+		ib_umem_odp_release(child);
+		return existing;
+	}
+	return child;
+}
+
+/* Pick the umem_odp to use for an operation on mr at iova. For explicit
+ * MRs that is mr->umem. For implicit MRs it is the chunk's child. The
+ * caller is responsible for clamping the access length to one chunk via
+ * rxe_odp_chunk_len_at(); each call here returns one child.
+ */
+static struct ib_umem_odp *rxe_odp_umem_for_iova(struct rxe_mr *mr, u64 iova)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+
+	if (!umem_odp->is_implicit_odp)
+		return umem_odp;
+	return rxe_odp_get_child(mr, iova);
+}
+
+/* How many bytes of an access starting at iova can be served by a single
+ * umem? For explicit MRs the answer is "the whole request" (bounded by
+ * mr length elsewhere). For implicit MRs it is the bytes remaining in
+ * the current chunk.
+ */
+static int rxe_odp_chunk_len_at(struct rxe_mr *mr, u64 iova, int length)
+{
+	u64 next_boundary;
+
+	if (!to_ib_umem_odp(mr->umem)->is_implicit_odp)
+		return length;
+
+	next_boundary = (iova & ~RXE_ODP_CHILD_MASK) + RXE_ODP_CHILD_SIZE;
+	return min_t(u64, (u64)length, next_boundary - iova);
+}
+
 static inline bool rxe_check_pagefault(struct ib_umem_odp *umem_odp, u64 iova,
 				       int length)
 {
@@ -132,7 +245,6 @@ static inline bool rxe_check_pagefault(struct ib_umem_odp *umem_odp, u64 iova,
 
 	addr = iova & (~(BIT(umem_odp->page_shift) - 1));
 
-	/* Skim through all pages that are to be accessed. */
 	while (addr < iova + length) {
 		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
 
@@ -156,23 +268,32 @@ static unsigned long rxe_odp_iova_to_page_offset(struct ib_umem_odp *umem_odp, u
 	return iova & (BIT(umem_odp->page_shift) - 1);
 }
 
-static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length, u32 flags)
+/* Resolve, lock, and fault one chunk worth of access. On success the
+ * caller holds umem_odp->umem_mutex and gets the chosen umem_odp via
+ * *out_umem_odp. length must already be clamped via rxe_odp_chunk_len_at.
+ */
+static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length,
+				      u32 flags,
+				      struct ib_umem_odp **out_umem_odp)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	struct ib_umem_odp *umem_odp;
 	bool need_fault;
 	int err;
 
 	if (unlikely(length < 1))
 		return -EINVAL;
 
+	umem_odp = rxe_odp_umem_for_iova(mr, iova);
+	if (IS_ERR(umem_odp))
+		return PTR_ERR(umem_odp);
+
 	mutex_lock(&umem_odp->umem_mutex);
 
 	need_fault = rxe_check_pagefault(umem_odp, iova, length);
 	if (need_fault) {
 		mutex_unlock(&umem_odp->umem_mutex);
 
-		/* umem_mutex is locked on success. */
-		err = rxe_odp_do_pagefault_and_lock(mr, iova, length,
+		err = rxe_odp_do_pagefault_and_lock(umem_odp, iova, length,
 						    flags);
 		if (err < 0)
 			return err;
@@ -184,13 +305,14 @@ static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length, u
 		}
 	}
 
+	*out_umem_odp = umem_odp;
 	return 0;
 }
 
-static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
-			     int length, enum rxe_mr_copy_dir dir)
+static int __rxe_odp_mr_copy_one(struct ib_umem_odp *umem_odp, u64 iova,
+				 void *addr, int length,
+				 enum rxe_mr_copy_dir dir)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
 	struct page *page;
 	int idx, bytes;
 	size_t offset;
@@ -226,8 +348,10 @@ static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
 int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		    enum rxe_mr_copy_dir dir)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
 	u32 flags = RXE_PAGEFAULT_DEFAULT;
+	u64 cur_iova = iova;
+	u8 *cur_addr = addr;
+	int remaining = length;
 	int err;
 
 	if (length == 0)
@@ -248,15 +372,43 @@ int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		return -EINVAL;
 	}
 
-	err = rxe_odp_map_range_and_lock(mr, iova, length, flags);
-	if (err)
-		return err;
+	/* Walk one chunk at a time. For explicit MRs the chunk-length helper
+	 * returns the full remaining length, so this loop runs exactly once
+	 * and is identical to the pre-implicit behavior.
+	 */
+	while (remaining > 0) {
+		struct ib_umem_odp *umem_odp;
+		int this_len = rxe_odp_chunk_len_at(mr, cur_iova, remaining);
 
-	err =  __rxe_odp_mr_copy(mr, iova, addr, length, dir);
+		err = rxe_odp_map_range_and_lock(mr, cur_iova, this_len, flags,
+						 &umem_odp);
+		if (err)
+			return err;
 
-	mutex_unlock(&umem_odp->umem_mutex);
+		err = __rxe_odp_mr_copy_one(umem_odp, cur_iova, cur_addr,
+					    this_len, dir);
+		mutex_unlock(&umem_odp->umem_mutex);
+		if (err)
+			return err;
 
-	return err;
+		cur_iova += this_len;
+		cur_addr += this_len;
+		remaining -= this_len;
+	}
+
+	return 0;
+}
+
+/* Atomic, flush, and atomic-write paths assume mr->umem itself holds the
+ * pfn_list. That is true for explicit MRs only. The implicit parent has
+ * no pages of its own. Reject those operations on implicit MRs rather
+ * than extend them: remote access on implicit is already out of scope,
+ * so the only way these helpers could be reached is via a local atomic
+ * or flush, which the test matrix does not exercise.
+ */
+static inline bool rxe_odp_mr_is_implicit(struct rxe_mr *mr)
+{
+	return to_ib_umem_odp(mr->umem)->is_implicit_odp;
 }
 
 static enum resp_states rxe_odp_do_atomic_op(struct rxe_mr *mr, u64 iova,
@@ -313,11 +465,16 @@ static enum resp_states rxe_odp_do_atomic_op(struct rxe_mr *mr, u64 iova,
 enum resp_states rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 				   u64 compare, u64 swap_add, u64 *orig_val)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	struct ib_umem_odp *umem_odp;
 	int err;
 
+	if (rxe_odp_mr_is_implicit(mr)) {
+		rxe_dbg_mr(mr, "atomic op not supported on implicit ODP MR\n");
+		return RESPST_ERR_RKEY_VIOLATION;
+	}
+
 	err = rxe_odp_map_range_and_lock(mr, iova, sizeof(char),
-					 RXE_PAGEFAULT_DEFAULT);
+					 RXE_PAGEFAULT_DEFAULT, &umem_odp);
 	if (err < 0)
 		return RESPST_ERR_RKEY_VIOLATION;
 
@@ -331,7 +488,7 @@ enum resp_states rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
 			    unsigned int length)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	struct ib_umem_odp *umem_odp;
 	unsigned int page_offset;
 	unsigned long index;
 	struct page *page;
@@ -339,8 +496,11 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
 	int err;
 	u8 *va;
 
+	if (rxe_odp_mr_is_implicit(mr))
+		return -EOPNOTSUPP;
+
 	err = rxe_odp_map_range_and_lock(mr, iova, length,
-					 RXE_PAGEFAULT_DEFAULT);
+					 RXE_PAGEFAULT_DEFAULT, &umem_odp);
 	if (err)
 		return err;
 
@@ -368,13 +528,16 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
 
 enum resp_states rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	struct ib_umem_odp *umem_odp;
 	unsigned int page_offset;
 	unsigned long index;
 	struct page *page;
 	int err;
 	u64 *va;
 
+	if (rxe_odp_mr_is_implicit(mr))
+		return RESPST_ERR_RKEY_VIOLATION;
+
 	/* See IBA oA19-28 */
 	err = mr_check_range(mr, iova, sizeof(value));
 	if (unlikely(err)) {
@@ -383,7 +546,7 @@ enum resp_states rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 	}
 
 	err = rxe_odp_map_range_and_lock(mr, iova, sizeof(value),
-					 RXE_PAGEFAULT_DEFAULT);
+					 RXE_PAGEFAULT_DEFAULT, &umem_odp);
 	if (err)
 		return RESPST_ERR_RKEY_VIOLATION;
 
@@ -419,6 +582,38 @@ struct prefetch_mr_work {
 	} frags[];
 };
 
+/* Prefetch one SGE range. For implicit MRs the range may span multiple
+ * chunks; fault each chunk separately and drop the lock between them
+ * so concurrent invalidators are not blocked across the whole range.
+ */
+static int rxe_odp_prefetch_one(struct rxe_mr *mr, u64 io_virt, size_t length,
+				u32 pf_flags)
+{
+	u64 cur = io_virt;
+	size_t remaining = length;
+	int ret;
+
+	while (remaining > 0) {
+		struct ib_umem_odp *umem_odp;
+		int this_len = rxe_odp_chunk_len_at(mr, cur, remaining);
+
+		umem_odp = rxe_odp_umem_for_iova(mr, cur);
+		if (IS_ERR(umem_odp))
+			return PTR_ERR(umem_odp);
+
+		ret = rxe_odp_do_pagefault_and_lock(umem_odp, cur, this_len,
+						    pf_flags);
+		if (ret < 0)
+			return ret;
+
+		mutex_unlock(&umem_odp->umem_mutex);
+
+		cur += this_len;
+		remaining -= this_len;
+	}
+	return 0;
+}
+
 static void rxe_ib_prefetch_mr_work(struct work_struct *w)
 {
 	struct prefetch_mr_work *work =
@@ -426,28 +621,16 @@ static void rxe_ib_prefetch_mr_work(struct work_struct *w)
 	int ret;
 	u32 i;
 
-	/*
-	 * We rely on IB/core that work is executed
-	 * if we have num_sge != 0 only.
-	 */
 	WARN_ON(!work->num_sge);
 	for (i = 0; i < work->num_sge; ++i) {
-		struct ib_umem_odp *umem_odp;
-
-		ret = rxe_odp_do_pagefault_and_lock(work->frags[i].mr,
-						    work->frags[i].io_virt,
-						    work->frags[i].length,
-						    work->pf_flags);
-		if (ret < 0) {
+		ret = rxe_odp_prefetch_one(work->frags[i].mr,
+					   work->frags[i].io_virt,
+					   work->frags[i].length,
+					   work->pf_flags);
+		if (ret < 0)
 			rxe_dbg_mr(work->frags[i].mr,
 				   "failed to prefetch the mr\n");
-			goto deref;
-		}
-
-		umem_odp = to_ib_umem_odp(work->frags[i].mr->umem);
-		mutex_unlock(&umem_odp->umem_mutex);
 
-deref:
 		rxe_put(work->frags[i].mr);
 	}
 
@@ -465,7 +648,6 @@ static int rxe_ib_prefetch_sg_list(struct ib_pd *ibpd,
 
 	for (i = 0; i < num_sge; ++i) {
 		struct rxe_mr *mr;
-		struct ib_umem_odp *umem_odp;
 
 		mr = lookup_mr(pd, IB_ACCESS_LOCAL_WRITE,
 			       sg_list[i].lkey, RXE_LOOKUP_LOCAL);
@@ -483,17 +665,14 @@ static int rxe_ib_prefetch_sg_list(struct ib_pd *ibpd,
 			return -EPERM;
 		}
 
-		ret = rxe_odp_do_pagefault_and_lock(
-			mr, sg_list[i].addr, sg_list[i].length, pf_flags);
+		ret = rxe_odp_prefetch_one(mr, sg_list[i].addr,
+					   sg_list[i].length, pf_flags);
 		if (ret < 0) {
 			rxe_dbg_mr(mr, "failed to prefetch the mr\n");
 			rxe_put(mr);
 			return ret;
 		}
 
-		umem_odp = to_ib_umem_odp(mr->umem);
-		mutex_unlock(&umem_odp->umem_mutex);
-
 		rxe_put(mr);
 	}
 
@@ -517,7 +696,6 @@ static int rxe_ib_advise_mr_prefetch(struct ib_pd *ibpd,
 	if (advice == IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT)
 		pf_flags |= RXE_PAGEFAULT_SNAPSHOT;
 
-	/* Synchronous call */
 	if (flags & IB_UVERBS_ADVISE_MR_FLAG_FLUSH)
 		return rxe_ib_prefetch_sg_list(ibpd, advice, pf_flags, sg_list,
 					       num_sge);
@@ -532,7 +710,6 @@ static int rxe_ib_advise_mr_prefetch(struct ib_pd *ibpd,
 	work->num_sge = num_sge;
 
 	for (i = 0; i < num_sge; ++i) {
-		/* Takes a reference, which will be released in the queued work */
 		mr = lookup_mr(pd, IB_ACCESS_LOCAL_WRITE,
 			       sg_list[i].lkey, RXE_LOOKUP_LOCAL);
 		if (!mr) {
@@ -550,7 +727,6 @@ static int rxe_ib_advise_mr_prefetch(struct ib_pd *ibpd,
 	return 0;
 
  err:
-	/* rollback reference counts for the invalid request */
 	while (i > 0) {
 		i--;
 		rxe_put(work->frags[i].mr);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index d92f80d16f..a783dee95d 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -341,12 +341,30 @@ struct rxe_mr_page {
 	unsigned int		offset; /* offset in system page */
 };
 
+/* For implicit ODP MRs the virtual address space is split into fixed-size
+ * chunks. Each chunk is backed by at most one child umem allocated on
+ * first access. The 2 MiB chunk size keeps the child count bounded while
+ * limiting the amount of VA covered by each child. Whether the chunk
+ * size should be fixed, derived from page_shift, or configurable is an
+ * open design question for review.
+ */
+#define RXE_ODP_CHILD_SHIFT 21
+#define RXE_ODP_CHILD_SIZE  (BIT(RXE_ODP_CHILD_SHIFT))
+#define RXE_ODP_CHILD_MASK  (RXE_ODP_CHILD_SIZE - 1)
+
 struct rxe_mr {
 	struct rxe_pool_elem	elem;
 	struct ib_mr		ibmr;
 
 	struct ib_umem		*umem;
 
+	/* For implicit ODP MRs only: xarray of child umems keyed by
+	 * (aligned_start >> RXE_ODP_CHILD_SHIFT). Each entry covers one
+	 * RXE_ODP_CHILD_SIZE-aligned chunk and is created lazily on first
+	 * access. Unused (xa_empty) for explicit MRs.
+	 */
+	struct xarray		implicit_children;
+
 	u32			lkey;
 	u32			rkey;
 	enum rxe_mr_state	state;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH rdma-next 2/2] RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access
  2026-05-12 20:14 [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support Liibaan Egal
  2026-05-12 20:14 ` [RFC PATCH rdma-next 1/2] " Liibaan Egal
@ 2026-05-12 20:14 ` Liibaan Egal
  2026-05-12 22:56 ` [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support yanjun.zhu
  2 siblings, 0 replies; 5+ messages in thread
From: Liibaan Egal @ 2026-05-12 20:14 UTC (permalink / raw)
  To: linux-rdma; +Cc: zyjzyj2000, jgg, leon, linux-kernel

Now that the implicit ODP registration and local SGE fault paths are
in place, advertise IB_ODP_SUPPORT_IMPLICIT in general_odp_caps so
userspace can probe the support via ibv_query_device.

The advertised support is intentionally scoped to local access:
remote rkey access on implicit MRs is rejected at registration time,
and the atomic, flush, and atomic-write paths reject implicit MRs at
the top of each helper.

Question for reviewers: is IB_ODP_SUPPORT_IMPLICIT the right
capability bit to advertise for this local-access-only operation
matrix, or should capability exposure wait for broader operation
coverage? The cap-flip is kept in its own patch so the policy
decision is separable from the implementation.

Signed-off-by: Liibaan Egal <liibaegal@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index b0714f9abe..581313591d 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -94,8 +94,13 @@ static void rxe_init_device_param(struct rxe_dev *rxe, struct net_device *ndev)
 	if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
 		rxe->attr.kernel_cap_flags |= IBK_ON_DEMAND_PAGING;
 
-		/* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
 		rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
+		/* IMPLICIT is gated to the local-access subset. The fault path
+		 * in rxe_odp.c rejects remote-access implicit forms at
+		 * registration time. Per-transport caps below stay unchanged:
+		 * they describe explicit ODP MR semantics and remain accurate.
+		 */
+		rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT_IMPLICIT;
 
 		rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SEND;
 		rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_RECV;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH rdma-next 1/2] RDMA/rxe: add local implicit ODP MR support
  2026-05-12 20:14 ` [RFC PATCH rdma-next 1/2] " Liibaan Egal
@ 2026-05-12 20:40   ` Liibaan Egal
  0 siblings, 0 replies; 5+ messages in thread
From: Liibaan Egal @ 2026-05-12 20:40 UTC (permalink / raw)
  To: linux-rdma; +Cc: zyjzyj2000, jgg, leon, linux-kernel

One clarification on the wording in the cover letter and patch 1/2
commit message: when I said the per-transport ODP caps describe
explicit ODP MR semantics, that was too strong. What I meant is that
this series leaves the existing per-transport caps unchanged and
implements only the local lkey implicit-MR access path, while
rejecting remote rkey, atomic, flush, and atomic-write uses of
implicit MRs.

The intended review question remains whether advertising
IB_ODP_SUPPORT_IMPLICIT is acceptable for that local-access-only
implicit operation matrix, or whether the cap should wait for broader
implicit coverage. I will reword this in a v2 if the series moves
forward.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support
  2026-05-12 20:14 [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support Liibaan Egal
  2026-05-12 20:14 ` [RFC PATCH rdma-next 1/2] " Liibaan Egal
  2026-05-12 20:14 ` [RFC PATCH rdma-next 2/2] RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access Liibaan Egal
@ 2026-05-12 22:56 ` yanjun.zhu
  2 siblings, 0 replies; 5+ messages in thread
From: yanjun.zhu @ 2026-05-12 22:56 UTC (permalink / raw)
  To: Liibaan Egal, linux-rdma, Zhu Yanjun; +Cc: zyjzyj2000, jgg, leon, linux-kernel

On 5/12/26 1:14 PM, Liibaan Egal wrote:
> This RFC adds local-access implicit On-Demand Paging memory regions to
> RXE (Soft-RoCE).
> 
> RXE already supports explicit ODP MRs. The implicit registration form
> (addr == 0, length == U64_MAX, IB_ACCESS_ON_DEMAND) is recognized but
> not implemented: the implicit branch in rxe_odp_mr_init_user() returns
> -EINVAL through a placeholder block, and no path creates child umems
> for SGE accesses on an implicit MR.
> 
> This series wires the implicit registration case through
> ib_umem_odp_alloc_implicit() and routes the local SGE walker through
> per-chunk child umems. The chunk size is fixed at 2 MiB
> (RXE_ODP_CHILD_SHIFT = 21) and children are allocated lazily on first
> access via ib_umem_odp_alloc_child(), stored in a per-MR xarray.
> 
> Patches
> -------
> 
>    1/2 RDMA/rxe: add local implicit ODP MR support
> 
>        Adds rxe_odp_mr_init_implicit() (rejects remote access bits with
>        -EOPNOTSUPP, allocates the parent umem). Adds rxe_odp_get_child()
>        and the per-chunk loop in rxe_odp_mr_copy() and the prefetch
>        path. Atomic, flush and atomic-write paths reject implicit MRs
>        at the top because those helpers walk mr->umem->pfn_list
>        directly which is empty for an implicit parent. rxe_mr_cleanup
>        walks the child xarray and releases each child before the
>        parent.
> 
>        This patch leaves IB_ODP_SUPPORT_IMPLICIT unadvertised, so
>        rxe_odp_mr_init_user() still returns -EINVAL on the implicit
>        form. No user-visible behavior change yet.
> 
>    2/2 RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access
> 
>        Flip the cap bit so userspace can probe support via
>        ibv_query_device. Kept as its own patch so the policy question
>        is separable from the implementation.
> 
> Question for reviewers
> ----------------------
> 
> Patch 2/2 advertises IB_ODP_SUPPORT_IMPLICIT for a local-access-only
> operation matrix. Local SGE access on implicit MRs works; remote rkey
> access, atomic, flush, and atomic-write on implicit MRs do not. Is
> this an acceptable use of the capability bit, or should capability
> exposure wait for a broader operation matrix? Splitting the cap flip
> out is meant to keep that decision separable from the implementation.
> 
> Scope and limitations
> ---------------------
> 
> Out of scope in this series:
> 
> - Remote rkey access on implicit MRs. Rejected at registration time
>    with -EOPNOTSUPP.
> - Atomic, flush, atomic-write paths. These return -EOPNOTSUPP /
>    RESPST_ERR_RKEY_VIOLATION on implicit MRs.
> - Child reclaim. The xarray grows monotonically per MR; a child is
>    not freed until MR destroy. Long-lived implicit MRs that touch a
>    sparse address space accumulate children. A reclaim mechanism is
>    the natural follow-up.
> 
> Tested
> ------
> 
> Verified on rdma/for-next at commit 7fd2df204f34 (Linux 7.1-rc2),
> arm64, Soft-RoCE over loopback:
> 
> - Registration accept/reject matrix (5 cases).
> - Single-chunk 64 KiB RDMA WRITE through an implicit lkey.
> - Two-chunk multi-range test: two 1 MiB WRITEs from buffers in
>    different 2 MiB chunks of one implicit MR.
> - Cross-chunk single-SGE test: one 128 KiB WRITE whose SGE spans a
>    2 MiB chunk boundary.
> 
> Each patch builds cleanly standalone (M=drivers/infiniband/sw/rxe).

IMO, please use a shell script like the following to act as selftest.
Please put the following script in tools/testing/selftests/rdma/

Or you can add more testcases to prove your features.

"
#!/bin/bash
# Enable exit on error for better debugging
set -e

# 1. Cleanup old environment
echo "Cleaning up..."
ip netns delete ns0 2>/dev/null || true
ip link delete nk1 2>/dev/null || true

# 2. Setup Network Namespaces and Netkit interfaces
echo "Setting up network..."
ip netns add ns0

# Create netkit pair: nk1 (host) and nk0 (to be moved to ns0)
ip link add nk1 type netkit mode l2 peer name nk0

# Set host side up
ip link set nk1 up
ip addr add 10.0.0.2/24 dev nk1

# Move nk0 to namespace ns0
ip link set nk0 netns ns0
ip netns exec ns0 ip addr add 10.0.0.1/24 dev nk0
ip netns exec ns0 ip link set nk0 up
ip netns exec ns0 ip link set lo up

# Verify connectivity
echo "Verifying IP connectivity..."
ping -c 2 10.0.0.1 -I nk1

# 3. Setup Soft-RoCE (RXE) links
echo "Configuring RXE..."
# In namespace ns0
ip netns exec ns0 rdma link add rxe0 type rxe netdev nk0
# In host namespace
rdma link add rxe1 type rxe netdev nk1

# Wait for RDMA devices to initialize
sleep 1
rdma link

# 4. Run ibv_rc_pingpong with Implicit ODP (-O)
echo "Starting ibv_rc_pingpong with Implicit ODP..."

# Start Server in ns0
# -g 1: GID index (usually 1 for RoCE v2)
# -O: Use Implicit ODP
ip netns exec ns0 ibv_rc_pingpong -g 1 -O &
SERVER_PID=$!

# Give the server a moment to bind
sleep 2

# Start Client in host
# -O: Use Implicit ODP
ibv_rc_pingpong -g 1 -O 10.0.0.1

# 5. Collect Statistics
echo "--- Post-test Statistics ---"
echo "Host Stats:"
ip -s link show nk1
echo "Namespace ns0 Stats:"
ip netns exec ns0 ip -s link show nk0

# 6. Cleanup
echo "Cleaning up..."
kill $SERVER_PID 2>/dev/null || true
rdma link del rxe0 2>/dev/null || true
rdma link del rxe1 2>/dev/null || true
ip link del nk1
ip netns delete ns0

echo "Test Complete."
"

The output should be the following

"
# ./implicit_odp.sh
Cleaning up...
Setting up network...
Verifying IP connectivity...
PING 10.0.0.1 (10.0.0.1) from 10.0.0.2 nk1: 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.071 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.040 ms

--- 10.0.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1013ms
rtt min/avg/max/mdev = 0.040/0.055/0.071/0.015 ms
Configuring RXE...
link rxe0/1 state ACTIVE physical_state LINK_UP
link rxe1/1 state ACTIVE physical_state LINK_UP netdev nk1
Starting ibv_rc_pingpong with Implicit ODP...
   local address:  LID 0x0000, QPN 0x000011, PSN 0x51486a, GID 
::ffff:10.0.0.1
   local address:  LID 0x0000, QPN 0x000012, PSN 0xc14439, GID 
::ffff:10.0.0.1
   remote address: LID 0x0000, QPN 0x000011, PSN 0x51486a, GID 
::ffff:10.0.0.1
   remote address: LID 0x0000, QPN 0x000012, PSN 0xc14439, GID 
::ffff:10.0.0.1
8192000 bytes in 0.03 seconds = 2341.91 Mbit/sec
8192000 bytes in 0.03 seconds = 2354.70 Mbit/sec
1000 iters in 0.03 seconds = 27.83 usec/iter
1000 iters in 0.03 seconds = 27.98 usec/iter
--- Post-test Statistics ---
Host Stats:
8: nk1@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default qlen 1000
     link/ether ba:48:69:41:c7:71 brd ff:ff:ff:ff:ff:ff link-netns ns0
     RX:  bytes packets errors dropped  missed   mcast
           1078      13      0       0       0       0
     TX:  bytes packets errors dropped carrier collsns
           4326      35      0       1       0       0
Namespace ns0 Stats:
7: nk0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default qlen 1000
     link/ether 3a:46:ee:e9:12:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
     RX:  bytes packets errors dropped  missed   mcast
           4326      35      0       0       0       0
     TX:  bytes packets errors dropped carrier collsns
           1078      13      0       0       0       0
Cleaning up...
Test Complete.
"

If you think that rdma-core is better, I am fine with it.

Anyway, some testcases are needed to prove your feature.

Zhu Yanjun


> 
> Registration latency was measured for 4 KiB to 1 GiB across explicit
> and implicit forms. Explicit grows with size and fails ENOMEM at 1 GiB
> on a 6 GiB host. Implicit median latency stays in the low microseconds
> across all sizes; peak RSS during an implicit registration stays at
> the baseline, while explicit RSS climbs with the registered size. The
> benchmark measures registration-time work only; it does not
> characterize first-touch or steady-state data path cost. Tests, bench
> and raw numbers are in the companion repository:
> https://github.com/Liibon/rxe-implicit-odp
> 
> scripts/checkpatch.pl --strict on each patch: 0 errors, 0 warnings,
> 0 checks.
> 
> ---
> 
> Liibaan Egal (2):
>    RDMA/rxe: add local implicit ODP MR support
>    RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access
> 
>   drivers/infiniband/sw/rxe/rxe.c       |   7 +-
>   drivers/infiniband/sw/rxe/rxe_mr.c    |  19 +++
>   drivers/infiniband/sw/rxe/rxe_odp.c   | 288 +++++++++++++++++++++++++++-------
>   drivers/infiniband/sw/rxe/rxe_verbs.h |  18 +++
>   4 files changed, 275 insertions(+), 57 deletions(-)
> 
> Liibaan Egal (2):
>    RDMA/rxe: add local implicit ODP MR support
>    RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access
> 
>   drivers/infiniband/sw/rxe/rxe.c       |   7 +-
>   drivers/infiniband/sw/rxe/rxe_mr.c    |  19 ++
>   drivers/infiniband/sw/rxe/rxe_odp.c   | 288 +++++++++++++++++++++-----
>   drivers/infiniband/sw/rxe/rxe_verbs.h |  18 ++
>   4 files changed, 275 insertions(+), 57 deletions(-)
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-12 22:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 20:14 [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support Liibaan Egal
2026-05-12 20:14 ` [RFC PATCH rdma-next 1/2] " Liibaan Egal
2026-05-12 20:40   ` Liibaan Egal
2026-05-12 20:14 ` [RFC PATCH rdma-next 2/2] RDMA/rxe: advertise IB_ODP_SUPPORT_IMPLICIT for local access Liibaan Egal
2026-05-12 22:56 ` [RFC PATCH rdma-next 0/2] RDMA/rxe: add local implicit ODP MR support yanjun.zhu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox