From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-oi1-f178.google.com (mail-oi1-f178.google.com [209.85.167.178])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 80CBE3ACF0C
	for <linux-kernel@vger.kernel.org>; Tue, 12 May 2026 20:14:59 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.178
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778616903; cv=none; b=CyGbyP06Q5+D1X1kIgopuepskYqFiSqEoPOea+R0qhKxgfLPeF070/JNuYzhSe9WtO0KdfmtUHFlvCsrQkkQi+yZGBlVmAA6jmbYUVy/sWcGhQZPP7+u4Zt1GVu6msyBJNz9JfMAYWWsw9y+pNjEpfT5uQHi8z2GptJ31o/dZI8=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778616903; c=relaxed/simple;
	bh=zfIW3SWBWkGjyw2N/9XJJ/fBo/zN+ESmBUF1nU+dO8Q=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version; b=hhHjz+lGzm/X9HOnjkbCGqUnXmWP0lVvEUcY21HNrOmiBbLYG3amBptgNcInVdPXVOXup3Nya3whM/AA7oXyIzr9VQIhYzxU5HDp958+47O0r9HUX2rUXqsDrSs1NHtpA+OuTW21h9fQllNApUbsQJRP8wGSFr48Yjwttl6Pm24=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=nSu5Ma5s; arc=none smtp.client-ip=209.85.167.178
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="nSu5Ma5s"
Received: by mail-oi1-f178.google.com with SMTP id 5614622812f47-463f00cda04so3263934b6e.2
        for <linux-kernel@vger.kernel.org>; Tue, 12 May 2026 13:14:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20251104; t=1778616898; x=1779221698; darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=3v4lmT0LIyVJwbA8rQHi/i9UgyWWjq/btGUpulIz2qM=;
        b=nSu5Ma5sREYr8Uiik5cG+7MgdPsaHlXo+usxUhcTfltjuoJ5Ru3s3sDXtm4x/4U3YT
         fH8K1VjoYL/Yl5PlXvrP5YAP/M93pF1z4kb/h+1QecbwQhieWcP7WuP2n5hEo4NxjMFr
         8FuEMbtf6p74DTRjgoEAiqWrpninIOvRETJCxgNY/dOqjg9nTWVhFzB77gP0ZvM8MS6/
         joQi4cgQu73KufHjjyVT335XdAFqybNcqydxjo5bsSwWnk8pfddQc6yVzxl4bFwCORhU
         ZHDxRRBqfwXv7L0hMpBkn8yl5GvxvEjuW/oCfMk7JchceoU2CGeCYa2OFnPeGOxBpyiJ
         sVKg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1778616898; x=1779221698;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=3v4lmT0LIyVJwbA8rQHi/i9UgyWWjq/btGUpulIz2qM=;
        b=dBDkgODSbnZ5S0BnRAZxovR+DT7TXB+AiNsLatP31kWi8mGn40Bv5iuQVaX9kTgGzt
         Vptnl4Jy9XBfGG92AkbQpcDZeHBbcq6mrXafakZe1gEf/NyxHLf5H4GHyb6ruo4JbSEX
         JqvjxGp7NH4SnPUl5Ndai5XxPtvkwvERDYWSbCcygr52O46YuDn+phwjqT5XhlmiF/Ms
         mqHNumQizDwDDyra2hKFN/pF1j10fggUP8Ic3BGCm/bYcOy5xQDaRMjf+L1YiPPKUrnk
         pfTPcc3Q6v18fCfjLdQel9CKfPDcuPsJsOtL8RSeKOD8d1UrE4wPEtp/SyBBL1I1y7TX
         1ppw==
X-Forwarded-Encrypted: i=1; AFNElJ+BmCq+T3uAogsULg4eukoKEm8yl57ICTDi+g0eVh7wAbsEsarX4PR5ndSG+gNcImK5l6+QbAOEkVx95XE=@vger.kernel.org
X-Gm-Message-State: AOJu0YzlxupNftbaZrJLE+r9whuQDueUIPtKI5HAZEoeLF688DNFKqgr
	j+SfWwJaoQBFzK/Gm2pXwN/6c5dYl9r9pn4frj9g3122ZgxXnZ1ZN6AJ
X-Gm-Gg: Acq92OEHJT0Xm0CjaEE1ADrPvyOrioCPC5+/WOhBrjyUk67Ixo+02YM2cF3L85m1NdJ
	dNVWNs/sqng7nGsRwOLMvIAYEamN9JqCSWzS4LAVIe5jdtMi/Wu1bHyjTewu9ynYnnqP3ZhlLbv
	44L82a1uyC6I1HCLsTPb35P+hhTLwI61BC6XZEEotMy241r8TCRM9NzF2c2u37OerDx3NLCb5PV
	oGSMU0uXoOLZlego5QXMi2mZaTJqjYd5tZG0jQZ9h5Id+pbbjalAa5yQN7rUjaWdPnihrPLxbNz
	TavZkRl/zyNZzYqRWW5cGgMi3cqmXqFxIVFAWOjT4MeeJOrg5sqGQDNfUm8VBRMsNcUWGUiNp7m
	q19eGWsFl/zCEVInxcVB12I1uYXel+auuQtX1+2P9Mc1Z8O8uDaszMjt4991eKmWdT3t9J3N/QP
	PLN8YBxSeChHoc2iwHQ957hKvT8t9mVHLibOkyy8WGXKWXgLPvpVP6lazx
X-Received: by 2002:a05:6808:5142:b0:47c:34fd:d3b3 with SMTP id 5614622812f47-482b2d7f543mr347939b6e.36.1778616897471;
        Tue, 12 May 2026 13:14:57 -0700 (PDT)
Received: from localhost.localdomain ([2600:1014:b0b0:a3c6:a82b:c292:fd90:24d0])
        by smtp.gmail.com with ESMTPSA id 5614622812f47-47c76935904sm22800623b6e.11.2026.05.12.13.14.56
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Tue, 12 May 2026 13:14:57 -0700 (PDT)
From: Liibaan Egal <liibaegal@gmail.com>
To: linux-rdma@vger.kernel.org
Cc: zyjzyj2000@gmail.com,
	jgg@ziepe.ca,
	leon@kernel.org,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH rdma-next 1/2] RDMA/rxe: add local implicit ODP MR support
Date: Tue, 12 May 2026 15:14:52 -0500
Message-Id: <20260512201453.21156-2-liibaegal@gmail.com>
X-Mailer: git-send-email 2.39.5 (Apple Git-154)
In-Reply-To: <20260512201453.21156-1-liibaegal@gmail.com>
References: <20260512201453.21156-1-liibaegal@gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

RXE already supports explicit ODP MRs. The implicit registration form
(addr == 0, length == U64_MAX, IB_ACCESS_ON_DEMAND) is recognized but
not implemented: the implicit branch in rxe_odp_mr_init_user() returns
-EINVAL through a placeholder block, and no path creates child umems
for SGE accesses on an implicit MR.

Wire the implicit registration case through ib_umem_odp_alloc_implicit()
and route the local SGE walker through per-chunk child umems.

Registration. rxe_odp_mr_init_implicit() rejects remote access bits
(-EOPNOTSUPP), allocates the empty parent umem via
ib_umem_odp_alloc_implicit(), and initializes mr->implicit_children via
xa_init(). rxe_odp_init_pages() is skipped because there are no pages
to fault at registration time.

Chunking. Implicit MRs split the address space into fixed-size chunks
defined by RXE_ODP_CHILD_SHIFT (21, 2 MiB). Each chunk is backed by at
most one child ib_umem_odp allocated on demand. The chunk size keeps
the child count bounded while limiting the amount of VA covered by
each child; whether the size should be fixed, derived, or configurable
is an open design question.

SGE fault path. rxe_odp_umem_for_iova() returns the parent for
explicit MRs and rxe_odp_get_child() for implicit MRs. The child
lookup is xa_load -> ib_umem_odp_alloc_child -> xa_cmpxchg(GFP_KERNEL);
a racing insertion drops the loser. rxe_odp_chunk_len_at() reports how
many bytes of an access can be served by one umem; for explicit MRs
that is the full request, for implicit it is the bytes remaining in
the current chunk. rxe_odp_mr_copy() loops across chunks, resolving,
locking, copying, and unlocking each child independently. Explicit
MRs run the loop exactly once with identical behavior to the pre-patch
path.

Prefetch. rxe_odp_prefetch_one() uses the same chunk loop. Async
prefetch walks per chunk under short-held mutexes so a long range
does not stall concurrent invalidators.

Atomic, flush, and atomic-write paths reject implicit MRs at the top
of each helper. These walk mr->umem->pfn_list directly which is empty
for an implicit parent; extending them is not in this series.

Lifetime. rxe_mr_cleanup walks mr->implicit_children with xa_for_each
and releases each child via ib_umem_odp_release() before releasing
the parent via ib_umem_release(), so each child's
mmu_interval_notifier tears down while the parent's per_mm is alive.
The xarray is xa_destroy()ed afterwards.

Per-transport ODP caps are unchanged: they describe RC/UD behavior on
explicit ODP MRs. Advertising IB_ODP_SUPPORT_IMPLICIT to userspace is
a separate patch, since whether the existing capability bit is the
right surface for a local-access-only operation matrix is an open
question for review.

Limitations. The xarray grows monotonically per MR: a child is not
reclaimed until MR destroy. Long-lived MRs that touch a sparse address
space accumulate children. A reclaim mechanism is the natural
follow-up.

Tested on Linux 7.1-rc2 (arm64, Soft-RoCE over loopback):
- five-case registration accept/reject matrix passes
- single-chunk 64 KiB RDMA WRITE through an implicit lkey delivers
- two-chunk multi test (two 1 MiB WRITEs from buffers in different
  2 MiB chunks of one implicit MR) delivers
- cross-chunk single-SGE test (128 KiB WRITE spanning a 2 MiB
  boundary) delivers

Benchmark measures registration latency and RSS only; first-touch and
steady-state data path costs are not characterized in this series.

Signed-off-by: Liibaan Egal <liibaegal@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c    |  19 ++
 drivers/infiniband/sw/rxe/rxe_odp.c   | 288 +++++++++++++++++++++-----
 drivers/infiniband/sw/rxe/rxe_verbs.h |  18 ++
 3 files changed, 269 insertions(+), 56 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index c696ff8749..c429bf0e6f 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -6,6 +6,8 @@
 
 #include <linux/libnvdimm.h>
 
+#include <rdma/ib_umem_odp.h>
+
 #include "rxe.h"
 #include "rxe_loc.h"
 
@@ -809,6 +811,23 @@ void rxe_mr_cleanup(struct rxe_pool_elem *elem)
 	struct rxe_mr *mr = container_of(elem, typeof(*mr), elem);
 
 	rxe_put(mr_pd(mr));
+
+	/* Implicit ODP MRs may have created child umems on demand for each
+	 * accessed 2 MiB chunk. Release them before the parent so each
+	 * child's mmu_interval_notifier tears down while the parent's
+	 * per_mm is still alive. The xarray is empty for explicit MRs, so
+	 * walking it is a no-op there.
+	 */
+	if (mr->umem && mr->umem->is_odp &&
+	    to_ib_umem_odp(mr->umem)->is_implicit_odp) {
+		struct ib_umem_odp *child;
+		unsigned long key;
+
+		xa_for_each(&mr->implicit_children, key, child)
+			ib_umem_odp_release(child);
+		xa_destroy(&mr->implicit_children);
+	}
+
 	ib_umem_release(mr->umem);
 
 	if (mr->ibmr.type != IB_MR_TYPE_DMA)
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index ff904d5e54..b90cb8f64f 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -5,6 +5,7 @@
 
 #include <linux/hmm.h>
 #include <linux/libnvdimm.h>
+#include <linux/xarray.h>
 
 #include <rdma/ib_umem_odp.h>
 
@@ -41,9 +42,14 @@ const struct mmu_interval_notifier_ops rxe_mn_ops = {
 #define RXE_PAGEFAULT_DEFAULT 0
 #define RXE_PAGEFAULT_RDONLY BIT(0)
 #define RXE_PAGEFAULT_SNAPSHOT BIT(1)
-static int rxe_odp_do_pagefault_and_lock(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags)
+
+/* Low-level fault helper. Operates directly on a umem_odp (parent for
+ * explicit MRs, child for implicit). On success the caller holds
+ * umem_odp->umem_mutex via ib_umem_odp_map_dma_and_lock.
+ */
+static int rxe_odp_do_pagefault_and_lock(struct ib_umem_odp *umem_odp,
+					 u64 user_va, int bcnt, u32 flags)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
 	bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
 	u64 access_mask = 0;
 	int np;
@@ -51,11 +57,6 @@ static int rxe_odp_do_pagefault_and_lock(struct rxe_mr *mr, u64 user_va, int bcn
 	if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
 		access_mask |= HMM_PFN_WRITE;
 
-	/*
-	 * ib_umem_odp_map_dma_and_lock() locks umem_mutex on success.
-	 * Callers must release the lock later to let invalidation handler
-	 * do its work again.
-	 */
 	np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt,
 					  access_mask, fault);
 	return np;
@@ -66,7 +67,8 @@ static int rxe_odp_init_pages(struct rxe_mr *mr)
 	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
 	int ret;
 
-	ret = rxe_odp_do_pagefault_and_lock(mr, mr->umem->address,
+	/* Explicit MR only: snapshot the page table at registration. */
+	ret = rxe_odp_do_pagefault_and_lock(umem_odp, mr->umem->address,
 					    mr->umem->length,
 					    RXE_PAGEFAULT_SNAPSHOT);
 
@@ -76,6 +78,50 @@ static int rxe_odp_init_pages(struct rxe_mr *mr)
 	return ret >= 0 ? 0 : ret;
 }
 
+/* Remote access on an implicit MR is intentionally out of scope. A
+ * remote rkey on a full-VA-shaped MR would let a peer drive faults
+ * against arbitrary process memory, and that surface needs separate
+ * thinking. Reject up front.
+ */
+#define RXE_REMOTE_ACCESS_MASK (IB_ACCESS_REMOTE_READ |	\
+				IB_ACCESS_REMOTE_WRITE |	\
+				IB_ACCESS_REMOTE_ATOMIC)
+
+static int rxe_odp_mr_init_implicit(struct rxe_dev *rxe, int access_flags,
+				    struct rxe_mr *mr)
+{
+	struct ib_umem_odp *umem_odp;
+
+	if (access_flags & RXE_REMOTE_ACCESS_MASK)
+		return -EOPNOTSUPP;
+
+	umem_odp = ib_umem_odp_alloc_implicit(&rxe->ib_dev, access_flags);
+	if (IS_ERR(umem_odp)) {
+		rxe_dbg_mr(mr, "implicit umem alloc failed err=%d\n",
+			   (int)PTR_ERR(umem_odp));
+		return PTR_ERR(umem_odp);
+	}
+
+	umem_odp->private = mr;
+	mr->umem = &umem_odp->umem;
+	mr->access = access_flags;
+	mr->ibmr.length = U64_MAX;
+	mr->ibmr.iova = 0;
+
+	/* Init the per-MR child xarray here so the cleanup path can
+	 * unconditionally xa_destroy() regardless of MR mode. Explicit MRs
+	 * never touch this xarray, so it stays empty for them. The xarray
+	 * allocator is invoked under GFP_KERNEL on the cmpxchg insertion
+	 * path below.
+	 */
+	xa_init(&mr->implicit_children);
+
+	mr->state = RXE_MR_STATE_VALID;
+	mr->ibmr.type = IB_MR_TYPE_USER;
+
+	return 0;
+}
+
 int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
 			 u64 iova, int access_flags, struct rxe_mr *mr)
 {
@@ -93,7 +139,7 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
 		if (!(rxe->attr.odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
 			return -EINVAL;
 
-		/* Never reach here, for implicit ODP is not implemented. */
+		return rxe_odp_mr_init_implicit(rxe, access_flags, mr);
 	}
 
 	umem_odp = ib_umem_odp_get(&rxe->ib_dev, start, length, access_flags,
@@ -123,6 +169,73 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
 	return err;
 }
 
+/* Look up or create the child umem covering the chunk that contains iova.
+ * Each chunk is RXE_ODP_CHILD_SIZE aligned. A cmpxchg insertion avoids
+ * leaking a child if a concurrent fault wins the race.
+ */
+static struct ib_umem_odp *rxe_odp_get_child(struct rxe_mr *mr, u64 iova)
+{
+	struct ib_umem_odp *parent = to_ib_umem_odp(mr->umem);
+	struct ib_umem_odp *child, *existing;
+	unsigned long aligned_start = iova & ~RXE_ODP_CHILD_MASK;
+	unsigned long key = aligned_start >> RXE_ODP_CHILD_SHIFT;
+
+	child = xa_load(&mr->implicit_children, key);
+	if (child)
+		return child;
+
+	child = ib_umem_odp_alloc_child(parent, aligned_start,
+					RXE_ODP_CHILD_SIZE, &rxe_mn_ops);
+	if (IS_ERR(child))
+		return child;
+	child->private = mr;
+
+	existing = xa_cmpxchg(&mr->implicit_children, key, NULL, child,
+			      GFP_KERNEL);
+	if (xa_is_err(existing)) {
+		ib_umem_odp_release(child);
+		return ERR_PTR(xa_err(existing));
+	}
+	if (existing) {
+		/* Another thread inserted while this allocation was in
+		 * flight. Drop the loser and use the winner.
+		 */
+		ib_umem_odp_release(child);
+		return existing;
+	}
+	return child;
+}
+
+/* Pick the umem_odp to use for an operation on mr at iova. For explicit
+ * MRs that is mr->umem. For implicit MRs it is the chunk's child. The
+ * caller is responsible for clamping the access length to one chunk via
+ * rxe_odp_chunk_len_at(); each call here returns one child.
+ */
+static struct ib_umem_odp *rxe_odp_umem_for_iova(struct rxe_mr *mr, u64 iova)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+
+	if (!umem_odp->is_implicit_odp)
+		return umem_odp;
+	return rxe_odp_get_child(mr, iova);
+}
+
+/* How many bytes of an access starting at iova can be served by a single
+ * umem? For explicit MRs the answer is "the whole request" (bounded by
+ * mr length elsewhere). For implicit MRs it is the bytes remaining in
+ * the current chunk.
+ */
+static int rxe_odp_chunk_len_at(struct rxe_mr *mr, u64 iova, int length)
+{
+	u64 next_boundary;
+
+	if (!to_ib_umem_odp(mr->umem)->is_implicit_odp)
+		return length;
+
+	next_boundary = (iova & ~RXE_ODP_CHILD_MASK) + RXE_ODP_CHILD_SIZE;
+	return min_t(u64, (u64)length, next_boundary - iova);
+}
+
 static inline bool rxe_check_pagefault(struct ib_umem_odp *umem_odp, u64 iova,
 				       int length)
 {
@@ -132,7 +245,6 @@ static inline bool rxe_check_pagefault(struct ib_umem_odp *umem_odp, u64 iova,
 
 	addr = iova & (~(BIT(umem_odp->page_shift) - 1));
 
-	/* Skim through all pages that are to be accessed. */
 	while (addr < iova + length) {
 		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
 
@@ -156,23 +268,32 @@ static unsigned long rxe_odp_iova_to_page_offset(struct ib_umem_odp *umem_odp, u
 	return iova & (BIT(umem_odp->page_shift) - 1);
 }
 
-static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length, u32 flags)
+/* Resolve, lock, and fault one chunk worth of access. On success the
+ * caller holds umem_odp->umem_mutex and gets the chosen umem_odp via
+ * *out_umem_odp. length must already be clamped via rxe_odp_chunk_len_at.
+ */
+static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length,
+				      u32 flags,
+				      struct ib_umem_odp **out_umem_odp)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	struct ib_umem_odp *umem_odp;
 	bool need_fault;
 	int err;
 
 	if (unlikely(length < 1))
 		return -EINVAL;
 
+	umem_odp = rxe_odp_umem_for_iova(mr, iova);
+	if (IS_ERR(umem_odp))
+		return PTR_ERR(umem_odp);
+
 	mutex_lock(&umem_odp->umem_mutex);
 
 	need_fault = rxe_check_pagefault(umem_odp, iova, length);
 	if (need_fault) {
 		mutex_unlock(&umem_odp->umem_mutex);
 
-		/* umem_mutex is locked on success. */
-		err = rxe_odp_do_pagefault_and_lock(mr, iova, length,
+		err = rxe_odp_do_pagefault_and_lock(umem_odp, iova, length,
 						    flags);
 		if (err < 0)
 			return err;
@@ -184,13 +305,14 @@ static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length, u
 		}
 	}
 
+	*out_umem_odp = umem_odp;
 	return 0;
 }
 
-static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
-			     int length, enum rxe_mr_copy_dir dir)
+static int __rxe_odp_mr_copy_one(struct ib_umem_odp *umem_odp, u64 iova,
+				 void *addr, int length,
+				 enum rxe_mr_copy_dir dir)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
 	struct page *page;
 	int idx, bytes;
 	size_t offset;
@@ -226,8 +348,10 @@ static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
 int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		    enum rxe_mr_copy_dir dir)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
 	u32 flags = RXE_PAGEFAULT_DEFAULT;
+	u64 cur_iova = iova;
+	u8 *cur_addr = addr;
+	int remaining = length;
 	int err;
 
 	if (length == 0)
@@ -248,15 +372,43 @@ int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		return -EINVAL;
 	}
 
-	err = rxe_odp_map_range_and_lock(mr, iova, length, flags);
-	if (err)
-		return err;
+	/* Walk one chunk at a time. For explicit MRs the chunk-length helper
+	 * returns the full remaining length, so this loop runs exactly once
+	 * and is identical to the pre-implicit behavior.
+	 */
+	while (remaining > 0) {
+		struct ib_umem_odp *umem_odp;
+		int this_len = rxe_odp_chunk_len_at(mr, cur_iova, remaining);
 
-	err =  __rxe_odp_mr_copy(mr, iova, addr, length, dir);
+		err = rxe_odp_map_range_and_lock(mr, cur_iova, this_len, flags,
+						 &umem_odp);
+		if (err)
+			return err;
 
-	mutex_unlock(&umem_odp->umem_mutex);
+		err = __rxe_odp_mr_copy_one(umem_odp, cur_iova, cur_addr,
+					    this_len, dir);
+		mutex_unlock(&umem_odp->umem_mutex);
+		if (err)
+			return err;
 
-	return err;
+		cur_iova += this_len;
+		cur_addr += this_len;
+		remaining -= this_len;
+	}
+
+	return 0;
+}
+
+/* Atomic, flush, and atomic-write paths assume mr->umem itself holds the
+ * pfn_list. That is true for explicit MRs only. The implicit parent has
+ * no pages of its own. Reject those operations on implicit MRs rather
+ * than extend them: remote access on implicit is already out of scope,
+ * so the only way these helpers could be reached is via a local atomic
+ * or flush, which the test matrix does not exercise.
+ */
+static inline bool rxe_odp_mr_is_implicit(struct rxe_mr *mr)
+{
+	return to_ib_umem_odp(mr->umem)->is_implicit_odp;
 }
 
 static enum resp_states rxe_odp_do_atomic_op(struct rxe_mr *mr, u64 iova,
@@ -313,11 +465,16 @@ static enum resp_states rxe_odp_do_atomic_op(struct rxe_mr *mr, u64 iova,
 enum resp_states rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 				   u64 compare, u64 swap_add, u64 *orig_val)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	struct ib_umem_odp *umem_odp;
 	int err;
 
+	if (rxe_odp_mr_is_implicit(mr)) {
+		rxe_dbg_mr(mr, "atomic op not supported on implicit ODP MR\n");
+		return RESPST_ERR_RKEY_VIOLATION;
+	}
+
 	err = rxe_odp_map_range_and_lock(mr, iova, sizeof(char),
-					 RXE_PAGEFAULT_DEFAULT);
+					 RXE_PAGEFAULT_DEFAULT, &umem_odp);
 	if (err < 0)
 		return RESPST_ERR_RKEY_VIOLATION;
 
@@ -331,7 +488,7 @@ enum resp_states rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
 			    unsigned int length)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	struct ib_umem_odp *umem_odp;
 	unsigned int page_offset;
 	unsigned long index;
 	struct page *page;
@@ -339,8 +496,11 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
 	int err;
 	u8 *va;
 
+	if (rxe_odp_mr_is_implicit(mr))
+		return -EOPNOTSUPP;
+
 	err = rxe_odp_map_range_and_lock(mr, iova, length,
-					 RXE_PAGEFAULT_DEFAULT);
+					 RXE_PAGEFAULT_DEFAULT, &umem_odp);
 	if (err)
 		return err;
 
@@ -368,13 +528,16 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
 
 enum resp_states rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 {
-	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	struct ib_umem_odp *umem_odp;
 	unsigned int page_offset;
 	unsigned long index;
 	struct page *page;
 	int err;
 	u64 *va;
 
+	if (rxe_odp_mr_is_implicit(mr))
+		return RESPST_ERR_RKEY_VIOLATION;
+
 	/* See IBA oA19-28 */
 	err = mr_check_range(mr, iova, sizeof(value));
 	if (unlikely(err)) {
@@ -383,7 +546,7 @@ enum resp_states rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 	}
 
 	err = rxe_odp_map_range_and_lock(mr, iova, sizeof(value),
-					 RXE_PAGEFAULT_DEFAULT);
+					 RXE_PAGEFAULT_DEFAULT, &umem_odp);
 	if (err)
 		return RESPST_ERR_RKEY_VIOLATION;
 
@@ -419,6 +582,38 @@ struct prefetch_mr_work {
 	} frags[];
 };
 
+/* Prefetch one SGE range. For implicit MRs the range may span multiple
+ * chunks; fault each chunk separately and drop the lock between them
+ * so concurrent invalidators are not blocked across the whole range.
+ */
+static int rxe_odp_prefetch_one(struct rxe_mr *mr, u64 io_virt, size_t length,
+				u32 pf_flags)
+{
+	u64 cur = io_virt;
+	size_t remaining = length;
+	int ret;
+
+	while (remaining > 0) {
+		struct ib_umem_odp *umem_odp;
+		int this_len = rxe_odp_chunk_len_at(mr, cur, remaining);
+
+		umem_odp = rxe_odp_umem_for_iova(mr, cur);
+		if (IS_ERR(umem_odp))
+			return PTR_ERR(umem_odp);
+
+		ret = rxe_odp_do_pagefault_and_lock(umem_odp, cur, this_len,
+						    pf_flags);
+		if (ret < 0)
+			return ret;
+
+		mutex_unlock(&umem_odp->umem_mutex);
+
+		cur += this_len;
+		remaining -= this_len;
+	}
+	return 0;
+}
+
 static void rxe_ib_prefetch_mr_work(struct work_struct *w)
 {
 	struct prefetch_mr_work *work =
@@ -426,28 +621,16 @@ static void rxe_ib_prefetch_mr_work(struct work_struct *w)
 	int ret;
 	u32 i;
 
-	/*
-	 * We rely on IB/core that work is executed
-	 * if we have num_sge != 0 only.
-	 */
 	WARN_ON(!work->num_sge);
 	for (i = 0; i < work->num_sge; ++i) {
-		struct ib_umem_odp *umem_odp;
-
-		ret = rxe_odp_do_pagefault_and_lock(work->frags[i].mr,
-						    work->frags[i].io_virt,
-						    work->frags[i].length,
-						    work->pf_flags);
-		if (ret < 0) {
+		ret = rxe_odp_prefetch_one(work->frags[i].mr,
+					   work->frags[i].io_virt,
+					   work->frags[i].length,
+					   work->pf_flags);
+		if (ret < 0)
 			rxe_dbg_mr(work->frags[i].mr,
 				   "failed to prefetch the mr\n");
-			goto deref;
-		}
-
-		umem_odp = to_ib_umem_odp(work->frags[i].mr->umem);
-		mutex_unlock(&umem_odp->umem_mutex);
 
-deref:
 		rxe_put(work->frags[i].mr);
 	}
 
@@ -465,7 +648,6 @@ static int rxe_ib_prefetch_sg_list(struct ib_pd *ibpd,
 
 	for (i = 0; i < num_sge; ++i) {
 		struct rxe_mr *mr;
-		struct ib_umem_odp *umem_odp;
 
 		mr = lookup_mr(pd, IB_ACCESS_LOCAL_WRITE,
 			       sg_list[i].lkey, RXE_LOOKUP_LOCAL);
@@ -483,17 +665,14 @@ static int rxe_ib_prefetch_sg_list(struct ib_pd *ibpd,
 			return -EPERM;
 		}
 
-		ret = rxe_odp_do_pagefault_and_lock(
-			mr, sg_list[i].addr, sg_list[i].length, pf_flags);
+		ret = rxe_odp_prefetch_one(mr, sg_list[i].addr,
+					   sg_list[i].length, pf_flags);
 		if (ret < 0) {
 			rxe_dbg_mr(mr, "failed to prefetch the mr\n");
 			rxe_put(mr);
 			return ret;
 		}
 
-		umem_odp = to_ib_umem_odp(mr->umem);
-		mutex_unlock(&umem_odp->umem_mutex);
-
 		rxe_put(mr);
 	}
 
@@ -517,7 +696,6 @@ static int rxe_ib_advise_mr_prefetch(struct ib_pd *ibpd,
 	if (advice == IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT)
 		pf_flags |= RXE_PAGEFAULT_SNAPSHOT;
 
-	/* Synchronous call */
 	if (flags & IB_UVERBS_ADVISE_MR_FLAG_FLUSH)
 		return rxe_ib_prefetch_sg_list(ibpd, advice, pf_flags, sg_list,
 					       num_sge);
@@ -532,7 +710,6 @@ static int rxe_ib_advise_mr_prefetch(struct ib_pd *ibpd,
 	work->num_sge = num_sge;
 
 	for (i = 0; i < num_sge; ++i) {
-		/* Takes a reference, which will be released in the queued work */
 		mr = lookup_mr(pd, IB_ACCESS_LOCAL_WRITE,
 			       sg_list[i].lkey, RXE_LOOKUP_LOCAL);
 		if (!mr) {
@@ -550,7 +727,6 @@ static int rxe_ib_advise_mr_prefetch(struct ib_pd *ibpd,
 	return 0;
 
  err:
-	/* rollback reference counts for the invalid request */
 	while (i > 0) {
 		i--;
 		rxe_put(work->frags[i].mr);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index d92f80d16f..a783dee95d 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -341,12 +341,30 @@ struct rxe_mr_page {
 	unsigned int		offset; /* offset in system page */
 };
 
+/* For implicit ODP MRs the virtual address space is split into fixed-size
+ * chunks. Each chunk is backed by at most one child umem allocated on
+ * first access. The 2 MiB chunk size keeps the child count bounded while
+ * limiting the amount of VA covered by each child. Whether the chunk
+ * size should be fixed, derived from page_shift, or configurable is an
+ * open design question for review.
+ */
+#define RXE_ODP_CHILD_SHIFT 21
+#define RXE_ODP_CHILD_SIZE  (BIT(RXE_ODP_CHILD_SHIFT))
+#define RXE_ODP_CHILD_MASK  (RXE_ODP_CHILD_SIZE - 1)
+
 struct rxe_mr {
 	struct rxe_pool_elem	elem;
 	struct ib_mr		ibmr;
 
 	struct ib_umem		*umem;
 
+	/* For implicit ODP MRs only: xarray of child umems keyed by
+	 * (aligned_start >> RXE_ODP_CHILD_SHIFT). Each entry covers one
+	 * RXE_ODP_CHILD_SIZE-aligned chunk and is created lazily on first
+	 * access. Unused (xa_empty) for explicit MRs.
+	 */
+	struct xarray		implicit_children;
+
 	u32			lkey;
 	u32			rkey;
 	enum rxe_mr_state	state;
-- 
2.43.0