Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd
@ 2026-06-04  1:27 Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 01/10] IB/mlx5: Don't take the rereg_mr fallback without a new translation Jason Gunthorpe
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

Sashiko pointed out an existing bug related to mr->pd: when IB_MR_REREG_PD
is used the mr->pd is changed while only holding the write side of the
MR's uobject lock.

Effectively, because IB_MR_REREG_PD is usually implemented by changing the
MR in-place, the mr->pd becomes unreadable outside an MR-specific system
call that holds the uobject lock. All the readers in this series could
race with an IB_MR_REREG_PD and potentially UAF the mr->pd.

 https://sashiko.dev/#/patchset/20260427-security-bug-fixes-v3-0-4621fa52de0e%40nvidia.com?part=4

This was presented as a simple 'oh look it can race with nldev' which is
correct. However, asking AI to fully audit mr->pd touches also revealed a
much more convoluted issue inside mlx5 ODP that is also using mr->pd from
the page fault work queue, advise mr work queue and advise mr system call
without any locking.

It turns out this mlx5 problem is entirely unnecessary since outside
implicit mr there are only three cases where the UMR actually flags the
PDN to be read by HW, umr_rereg_pas(), mlx5_ib_init_odp_mr() and
mlx5_ib_init_dmabuf_mr(). umr_rereg_pas() is particularly tricky because
it illegaly updates mr->pd inside the driver.  Reorganize the giant call
chain from mlx5r_umr_set_update_xlt_mkey_seg() upward so that pdn is
passed down from those three functions instead of unconditionally picked
out at the bottom.

nldev however is trickier to fix. To avoid disurbing the happy paths build
a synchronize barrier by removing the mr from the xarray and then putting
it right back. The kref completion acts as a positive signal that the
mr->pd is no longer being used.

Jason Gunthorpe (10):
  IB/mlx5: Don't take the rereg_mr fallback without a new translation
  RDMA/mlx5: Create ODP EQ for non-pinned dmabuf MRs
  IB/mlx5: Properly support implicit ODP rereg_mr
  RDMA/nldev: Fix locking when accessing mr->pd
  IB/mlx5: Remove unused mkc bits in mlx5r_umr_update_mr_page_shift()
  IB/mlx5: Pull the pdn out of the depths of the umr machinery
  IB/mlx5: Don't mangle the mr->pd inside the rereg callback
  IB/mlx5: Push pdn above mlx5r_umr_update_xlt()
  IB/mlx5: Push pdn above pagfault_real_mr()
  IB/mlx5: Push pdn above pagefault_dmabuf_mr()

 drivers/infiniband/core/nldev.c      | 15 +++--
 drivers/infiniband/core/restrack.c   | 49 +++++++++++++++
 drivers/infiniband/core/restrack.h   |  1 +
 drivers/infiniband/core/uverbs_cmd.c | 10 ++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 ++--
 drivers/infiniband/hw/mlx5/mr.c      | 37 ++++++++---
 drivers/infiniband/hw/mlx5/odp.c     | 82 +++++++++++++++----------
 drivers/infiniband/hw/mlx5/umr.c     | 92 +++++++++++++---------------
 drivers/infiniband/hw/mlx5/umr.h     | 11 ++--
 include/rdma/ib_verbs.h              |  5 ++
 10 files changed, 203 insertions(+), 111 deletions(-)


base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 01/10] IB/mlx5: Don't take the rereg_mr fallback without a new translation
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
@ 2026-06-04  1:27 ` Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 02/10] RDMA/mlx5: Create ODP EQ for non-pinned dmabuf MRs Jason Gunthorpe
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

Jumping to mlx5_ib_reg_user_mr() without IB_MR_REREG_TRANS set will use
garbage values for start, length, and iova. Recovering the original mr
parameters for ODP and DMABUF to properly recreate it is too hard in this
flow, so just fail it.

Fixes: ef3642c4f54d ("RDMA/mlx5: Fix error unwinds for rereg_mr")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 3b6da45061a552..35125bbb380c47 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1198,7 +1198,7 @@ struct ib_mr *mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
 		}
 		/* DM or ODP MR's don't have a normal umem so we can't re-use it */
 		if (!mr->umem || is_odp_mr(mr) || is_dmabuf_mr(mr))
-			goto recreate;
+			return ERR_PTR(-EOPNOTSUPP);
 
 		/*
 		 * Only one active MR can refer to a umem at one time, revoke
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 02/10] RDMA/mlx5: Create ODP EQ for non-pinned dmabuf MRs
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 01/10] IB/mlx5: Don't take the rereg_mr fallback without a new translation Jason Gunthorpe
@ 2026-06-04  1:27 ` Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 03/10] IB/mlx5: Properly support implicit ODP rereg_mr Jason Gunthorpe
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

DMABUF generally relies on the ODP EQ mechanism to safely implement the
move semantics. ODP requires a device-global one time startup of the ODP
machinery when the first MR is created, and this was missed on the DMABUF
path.

Call mlx5r_odp_create_eq() when creating a ODP'able DMABUF.

The core code prevents using IB_ACCESS_ON_DEMAND unless the driver
advertises IB_ODP_SUPPORT, so until now, mlx5r_odp_create_eq() cannot be
called unless the device has ODP support.

However, DMABUF has no such protection and a second bug was allowing
DMABUFs to be created on non-ODP capable HW. Add a guard at the start of
mlx5r_odp_create_eq(). This is necessary here anyhow as the
dev->odp_eq_mutex is not initialized without IB_ODP_SUPPORT.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mr.c  | 4 ++++
 drivers/infiniband/hw/mlx5/odp.c | 3 +++
 2 files changed, 7 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 35125bbb380c47..c0ef50d04a2698 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -956,6 +956,10 @@ reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device,
 	atomic_add(ib_umem_num_pages(mr->umem), &dev->mdev->priv.reg_pages);
 	umem_dmabuf->private = mr;
 	if (!pinned_mode) {
+		err = mlx5r_odp_create_eq(dev, &dev->odp_pf_eq);
+		if (err)
+			goto err_dereg_mr;
+
 		err = mlx5r_store_odp_mkey(dev, &mr->mmkey);
 		if (err)
 			goto err_dereg_mr;
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 1119ce163ea783..10c11b72f41247 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -1807,6 +1807,9 @@ int mlx5r_odp_create_eq(struct mlx5_ib_dev *dev, struct mlx5_ib_pf_eq *eq)
 	struct mlx5_eq_param param = {};
 	int err = 0;
 
+	if (!(dev->odp_caps.general_caps & IB_ODP_SUPPORT))
+		return -EOPNOTSUPP;
+
 	mutex_lock(&dev->odp_eq_mutex);
 	if (eq->core)
 		goto unlock;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 03/10] IB/mlx5: Properly support implicit ODP rereg_mr
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 01/10] IB/mlx5: Don't take the rereg_mr fallback without a new translation Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 02/10] RDMA/mlx5: Create ODP EQ for non-pinned dmabuf MRs Jason Gunthorpe
@ 2026-06-04  1:27 ` Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 04/10] RDMA/nldev: Fix locking when accessing mr->pd Jason Gunthorpe
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

Due to all the child mkeys in the implicit ODP configuration we cannot
change anything in place for the parent mkey. Instead the whole thing
needs to be rebuilt if any change is requested. If the user does not
specify a translation then force the implicit values which will then fall
through the logic into mlx5_ib_reg_user_mr() to allocate a completely new
MR.

Since implicit children were also touching the mr->pd, this removes
another case where the access was racy.

Fixes: ef3642c4f54d ("RDMA/mlx5: Fix error unwinds for rereg_mr")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mr.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index c0ef50d04a2698..ab0ba34448eb89 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1188,6 +1188,21 @@ struct ib_mr *mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
 	if (!(flags & IB_MR_REREG_PD))
 		new_pd = ib_mr->pd;
 
+	if (mr->is_odp_implicit && !(flags & IB_MR_REREG_TRANS)) {
+		if (!(new_access_flags & IB_ACCESS_ON_DEMAND))
+			return ERR_PTR(-EOPNOTSUPP);
+
+		/*
+		 * Due to all the child mkeys we cannot actually change an
+		 * implicit MR in place. If the user did not specify a new
+		 * translation then force the fixed implicit MR values.
+		 */
+		start = 0;
+		iova = 0;
+		length = U64_MAX;
+		flags |= IB_MR_REREG_TRANS;
+	}
+
 	if (!(flags & IB_MR_REREG_TRANS)) {
 		struct ib_umem *umem;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 04/10] RDMA/nldev: Fix locking when accessing mr->pd
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2026-06-04  1:27 ` [PATCH 03/10] IB/mlx5: Properly support implicit ODP rereg_mr Jason Gunthorpe
@ 2026-06-04  1:27 ` Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 05/10] IB/mlx5: Remove unused mkc bits in mlx5r_umr_update_mr_page_shift() Jason Gunthorpe
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

Sashiko points out that, due to rereg_mr, the PD is actually variable and
all the touches in nldev are racy.

Use mr->device instead of mr->pd->device.

Getting the PD restrack ID is more tricky. To avoid disturbing all the
happy paths, add an rdma_restrack_sync() operation which is sort of like
flush_workqueue() or synchronize_irq(): after it returns, all the old
nldev touches to the mr are gone and everything sees the new PD. This
makes it safe to reach into the PD pointer.

Link: https://sashiko.dev/#/patchset/20260427-security-bug-fixes-v3-0-4621fa52de0e%40nvidia.com?part=4
Fixes: da5c85078215 ("RDMA/nldev: add driver-specific resource tracking")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/nldev.c      | 15 +++++----
 drivers/infiniband/core/restrack.c   | 49 ++++++++++++++++++++++++++++
 drivers/infiniband/core/restrack.h   |  1 +
 drivers/infiniband/core/uverbs_cmd.c | 10 ++++--
 include/rdma/ib_verbs.h              |  5 +++
 5 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 5aaba2b9746ba6..02a0a9c0a4a6ad 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -695,7 +695,7 @@ static int fill_res_mr_entry(struct sk_buff *msg, bool has_cap_net_admin,
 			     struct rdma_restrack_entry *res, uint32_t port)
 {
 	struct ib_mr *mr = container_of(res, struct ib_mr, res);
-	struct ib_device *dev = mr->pd->device;
+	struct ib_device *dev = mr->device;
 
 	if (has_cap_net_admin) {
 		if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_RKEY, mr->rkey))
@@ -711,9 +711,12 @@ static int fill_res_mr_entry(struct sk_buff *msg, bool has_cap_net_admin,
 	if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_MRN, res->id))
 		return -EMSGSIZE;
 
-	if (!rdma_is_kernel_res(res) &&
-	    nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_PDN, mr->pd->res.id))
-		return -EMSGSIZE;
+	if (!rdma_is_kernel_res(res)) {
+		struct ib_pd *pd = READ_ONCE(mr->pd);
+
+		if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_PDN, pd->res.id))
+			return -EMSGSIZE;
+	}
 
 	if (fill_res_name_pid(msg, res))
 		return -EMSGSIZE;
@@ -727,7 +730,7 @@ static int fill_res_mr_raw_entry(struct sk_buff *msg, bool has_cap_net_admin,
 				 struct rdma_restrack_entry *res, uint32_t port)
 {
 	struct ib_mr *mr = container_of(res, struct ib_mr, res);
-	struct ib_device *dev = mr->pd->device;
+	struct ib_device *dev = mr->device;
 
 	if (!dev->ops.fill_res_mr_entry_raw)
 		return -EINVAL;
@@ -1017,7 +1020,7 @@ static int fill_stat_mr_entry(struct sk_buff *msg, bool has_cap_net_admin,
 			      struct rdma_restrack_entry *res, uint32_t port)
 {
 	struct ib_mr *mr = container_of(res, struct ib_mr, res);
-	struct ib_device *dev = mr->pd->device;
+	struct ib_device *dev = mr->device;
 
 	if (nla_put_u32(msg, RDMA_NLDEV_ATTR_RES_MRN, res->id))
 		goto err;
diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c
index ac3688952cabbf..cfee2071586c16 100644
--- a/drivers/infiniband/core/restrack.c
+++ b/drivers/infiniband/core/restrack.c
@@ -71,6 +71,8 @@ int rdma_restrack_count(struct ib_device *dev, enum rdma_restrack_type type,
 
 	xa_lock(&rt->xa);
 	xas_for_each(&xas, e, U32_MAX) {
+		if (xa_is_zero(e))
+			continue;
 		if (xa_get_mark(&rt->xa, e->id, RESTRACK_DD) && !show_details)
 			continue;
 		cnt++;
@@ -276,6 +278,53 @@ int rdma_restrack_put(struct rdma_restrack_entry *res)
 }
 EXPORT_SYMBOL(rdma_restrack_put);
 
+/**
+ * rdma_restrack_sync() - Fence concurrent netlink dumps on an entry
+ * @res:  resource entry
+ *
+ * After this returns any concurrent netlink dump threads will see the current
+ * value of the object. This is useful if the object has to be changed and there
+ * is not locking to protect the nl side. Eg for mr->pd. This effectively
+ * destroys the object from a kref/xarray perspective and then immediately
+ * restores it. The kref is acting like a lock to barrier concurrent nl threads.
+ * Callers must ensure rdma_restrack_del() is not concurrently called.
+ */
+void rdma_restrack_sync(struct rdma_restrack_entry *res)
+{
+	struct rdma_restrack_entry *old;
+	struct rdma_restrack_root *rt;
+	struct task_struct *task;
+	struct ib_device *dev;
+
+	if (!res->valid || res->no_track)
+		return;
+
+	dev = res_to_dev(res);
+	if (WARN_ON(!dev))
+		return;
+
+	rt = &dev->res[res->type];
+	if (WARN_ON(xa_get_mark(&rt->xa, res->id, RESTRACK_DD)))
+		return;
+
+	old = xa_cmpxchg(&rt->xa, res->id, res, XA_ZERO_ENTRY, GFP_KERNEL);
+	if (WARN_ON(old != res))
+		return;
+
+	task = res->task;
+	if (task)
+		get_task_struct(task);
+	rdma_restrack_put(res);
+	wait_for_completion(&res->comp);
+	reinit_completion(&res->comp);
+	if (task)
+		res->task = task;
+	kref_init(&res->kref);
+
+	xa_cmpxchg(&rt->xa, res->id, XA_ZERO_ENTRY, res, GFP_KERNEL);
+}
+EXPORT_SYMBOL(rdma_restrack_sync);
+
 /**
  * rdma_restrack_del() - delete object from the resource tracking database
  * @res:  resource entry
diff --git a/drivers/infiniband/core/restrack.h b/drivers/infiniband/core/restrack.h
index 6a04fc41f73801..75b8d1005a984b 100644
--- a/drivers/infiniband/core/restrack.h
+++ b/drivers/infiniband/core/restrack.h
@@ -27,6 +27,7 @@ int rdma_restrack_init(struct ib_device *dev);
 void rdma_restrack_clean(struct ib_device *dev);
 void rdma_restrack_add(struct rdma_restrack_entry *res);
 void rdma_restrack_del(struct rdma_restrack_entry *res);
+void rdma_restrack_sync(struct rdma_restrack_entry *res);
 void rdma_restrack_new(struct rdma_restrack_entry *res,
 		       enum rdma_restrack_type type);
 void rdma_restrack_set_name(struct rdma_restrack_entry *res,
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 91a62d2ade4dd0..22793e4b1895e4 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -47,6 +47,7 @@
 
 #include "uverbs.h"
 #include "core_priv.h"
+#include "restrack.h"
 
 /*
  * Copy a response to userspace. If the provided 'resp' is larger than the
@@ -819,6 +820,10 @@ static int ib_uverbs_rereg_mr(struct uverbs_attr_bundle *attrs)
 			ret = PTR_ERR(new_pd);
 			goto put_uobjs;
 		}
+		if (new_pd == orig_pd) {
+			uobj_put_obj_read(new_pd);
+			cmd.flags &= ~IB_MR_REREG_PD;
+		}
 	} else {
 		new_pd = mr->pd;
 	}
@@ -866,9 +871,10 @@ static int ib_uverbs_rereg_mr(struct uverbs_attr_bundle *attrs)
 		mr = new_mr;
 	} else {
 		if (cmd.flags & IB_MR_REREG_PD) {
-			atomic_dec(&orig_pd->usecnt);
-			mr->pd = new_pd;
 			atomic_inc(&new_pd->usecnt);
+			WRITE_ONCE(mr->pd, new_pd);
+			rdma_restrack_sync(&mr->res);
+			atomic_dec(&orig_pd->usecnt);
 		}
 		if (cmd.flags & IB_MR_REREG_TRANS) {
 			mr->iova = cmd.hca_va;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9dd76f489a0ba4..46568a5221f403 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1977,6 +1977,11 @@ struct ib_dmah {
 
 struct ib_mr {
 	struct ib_device  *device;
+	/*
+	 * Due to IB_MR_REREG_PD pd is not a fixed pointer and can change. For a
+	 * user MR, this value should only be read from a system call that holds
+	 * the uobject lock, or the driver should disable in-place REREG_PD.
+	 */
 	struct ib_pd	  *pd;
 	u32		   lkey;
 	u32		   rkey;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 05/10] IB/mlx5: Remove unused mkc bits in mlx5r_umr_update_mr_page_shift()
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2026-06-04  1:27 ` [PATCH 04/10] RDMA/nldev: Fix locking when accessing mr->pd Jason Gunthorpe
@ 2026-06-04  1:27 ` Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 06/10] IB/mlx5: Pull the pdn out of the depths of the umr machinery Jason Gunthorpe
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

The HW only processes mkc fields selected by mkey_mask.
pd, qpn and mkey_7_0 are never selected so they can be left as zero.

This removes a racy read of mr->pd.

Fixes: e73242aa14d2 ("RDMA/mlx5: Optimize DMABUF mkey page size")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/mlx5/umr.c | 16 +++-------------
 drivers/infiniband/hw/mlx5/umr.h |  3 +--
 2 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/umr.c b/drivers/infiniband/hw/mlx5/umr.c
index f2139474be3751..062a47d0c991b7 100644
--- a/drivers/infiniband/hw/mlx5/umr.c
+++ b/drivers/infiniband/hw/mlx5/umr.c
@@ -937,8 +937,7 @@ int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
  * pinned and the HW can switch from 4K to huge-page alignment).
  */
 int mlx5r_umr_update_mr_page_shift(struct mlx5_ib_mr *mr,
-				   unsigned int page_shift,
-				   bool dd)
+				   unsigned int page_shift)
 {
 	struct mlx5_ib_dev *dev = mr_to_mdev(mr);
 	struct mlx5r_umr_wqe wqe = {};
@@ -953,16 +952,8 @@ int mlx5r_umr_update_mr_page_shift(struct mlx5_ib_mr *mr,
 	/* Fill mkey segment with the new page size, keep the rest unchanged */
 	MLX5_SET(mkc, &wqe.mkey_seg, log_page_size, page_shift);
 
-	if (dd)
-		MLX5_SET(mkc, &wqe.mkey_seg, pd, dev->ddr.pdn);
-	else
-		MLX5_SET(mkc, &wqe.mkey_seg, pd, to_mpd(mr->ibmr.pd)->pdn);
-
 	MLX5_SET64(mkc, &wqe.mkey_seg, start_addr, mr->ibmr.iova);
 	MLX5_SET64(mkc, &wqe.mkey_seg, len, mr->ibmr.length);
-	MLX5_SET(mkc, &wqe.mkey_seg, qpn, 0xffffff);
-	MLX5_SET(mkc, &wqe.mkey_seg, mkey_7_0,
-		 mlx5_mkey_variant(mr->mmkey.key));
 
 	err = mlx5r_umr_post_send_wait(dev, mr->mmkey.key, &wqe, false);
 	if (!err)
@@ -1049,7 +1040,7 @@ static int _mlx5r_umr_zap_mkey(struct mlx5_ib_mr *mr,
 	 * non-present.
 	 */
 	if (*nblocks) {
-		err = mlx5r_umr_update_mr_page_shift(mr, max_page_shift, dd);
+		err = mlx5r_umr_update_mr_page_shift(mr, max_page_shift);
 		if (err) {
 			mr->page_shift = old_page_shift;
 			return err;
@@ -1114,8 +1105,7 @@ int mlx5r_umr_dmabuf_update_pgsz(struct mlx5_ib_mr *mr, u32 xlt_flags,
 			goto err;
 	}
 
-	err = mlx5r_umr_update_mr_page_shift(mr, mr->page_shift,
-					     mr->data_direct);
+	err = mlx5r_umr_update_mr_page_shift(mr, mr->page_shift);
 	if (err)
 		goto err;
 	err = _mlx5r_dmabuf_umr_update_pas(mr, xlt_flags, 0, zapped_blocks,
diff --git a/drivers/infiniband/hw/mlx5/umr.h b/drivers/infiniband/hw/mlx5/umr.h
index 7eeaf6a94c9743..59809d4d7d7297 100644
--- a/drivers/infiniband/hw/mlx5/umr.h
+++ b/drivers/infiniband/hw/mlx5/umr.h
@@ -106,8 +106,7 @@ int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags);
 int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
 			 int page_shift, int flags);
 int mlx5r_umr_update_mr_page_shift(struct mlx5_ib_mr *mr,
-				   unsigned int page_shift,
-				   bool dd);
+				   unsigned int page_shift);
 int mlx5r_umr_dmabuf_update_pgsz(struct mlx5_ib_mr *mr, u32 xlt_flags,
 				 unsigned int page_shift);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 06/10] IB/mlx5: Pull the pdn out of the depths of the umr machinery
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2026-06-04  1:27 ` [PATCH 05/10] IB/mlx5: Remove unused mkc bits in mlx5r_umr_update_mr_page_shift() Jason Gunthorpe
@ 2026-06-04  1:27 ` Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 07/10] IB/mlx5: Don't mangle the mr->pd inside the rereg callback Jason Gunthorpe
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

Instead of getting the pdn deep inside the umr code, pass it in from the
top. to_mpd(mr->ibmr.pd)->pdn is not safe due to the rereg races, so all
the call sites need some revision to obtain the pdn in a safe way.

Mark them with mlx5_mr_pdn(); following patches will go through and remove
these.

Cases where the XLT flags are known and do not require the PDN can pass 0,
such as for mlx5_ib_dmabuf_invalidate_cb().

Also extract the DMABUF data_direct special case from inside the UMR code
and into the only place that needs it, pagefault_dmabuf_mr(). The actual
mr was created directly without using the UMR flow. Ultimately this will
be moved into mlx5_ib_init_dmabuf_mr().

Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  9 ++++
 drivers/infiniband/hw/mlx5/mr.c      |  8 +--
 drivers/infiniband/hw/mlx5/odp.c     | 12 +++--
 drivers/infiniband/hw/mlx5/umr.c     | 75 ++++++++++++++--------------
 drivers/infiniband/hw/mlx5/umr.h     |  6 +--
 5 files changed, 64 insertions(+), 46 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index e156dc4d752996..0a2b8ede0d818a 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -331,6 +331,10 @@ struct mlx5_ib_flow_db {
 #define MLX5_IB_QPT_DCT		IB_QPT_RESERVED4
 #define MLX5_IB_WR_UMR		IB_WR_RESERVED1
 
+/*
+ * A valid pdn is required when flags include MLX5_IB_UPD_XLT_ENABLE,
+ * MLX5_IB_UPD_XLT_PD or MLX5_IB_UPD_XLT_ACCESS.
+ */
 #define MLX5_IB_UPD_XLT_ZAP	      BIT(0)
 #define MLX5_IB_UPD_XLT_ENABLE	      BIT(1)
 #define MLX5_IB_UPD_XLT_ATOMIC	      BIT(2)
@@ -1209,6 +1213,11 @@ static inline struct mlx5_ib_pd *to_mpd(struct ib_pd *ibpd)
 	return container_of(ibpd, struct mlx5_ib_pd, ibpd);
 }
 
+static inline u32 mlx5_mr_pdn(struct mlx5_ib_mr *mr)
+{
+	return to_mpd(mr->ibmr.pd)->pdn;
+}
+
 static inline struct mlx5_ib_srq *to_msrq(struct ib_srq *ibsrq)
 {
 	return container_of(ibsrq, struct mlx5_ib_srq, ibsrq);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index ab0ba34448eb89..a7924e96c2817a 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -781,7 +781,8 @@ static struct ib_mr *create_real_mr(struct ib_pd *pd, struct ib_umem *umem,
 		 * configured properly but left disabled. It is safe to go ahead
 		 * and configure it again via UMR while enabling it.
 		 */
-		err = mlx5r_umr_update_mr_pas(mr, MLX5_IB_UPD_XLT_ENABLE);
+		err = mlx5r_umr_update_mr_pas(mr, MLX5_IB_UPD_XLT_ENABLE,
+					      to_mpd(pd)->pdn);
 		if (err) {
 			mlx5_ib_dereg_mr(&mr->ibmr, NULL);
 			return ERR_PTR(err);
@@ -890,7 +891,8 @@ static void mlx5_ib_dmabuf_invalidate_cb(struct dma_buf_attachment *attach)
 	if (!umem_dmabuf->sgt || !mr)
 		return;
 
-	mlx5r_umr_update_mr_pas(mr, MLX5_IB_UPD_XLT_ZAP);
+	/* MLX5_IB_UPD_XLT_ZAP does not change the pdn */
+	mlx5r_umr_update_mr_pas(mr, MLX5_IB_UPD_XLT_ZAP, 0);
 	ib_umem_dmabuf_unmap_pages(umem_dmabuf);
 }
 
@@ -1145,7 +1147,7 @@ static int umr_rereg_pas(struct mlx5_ib_mr *mr, struct ib_pd *pd,
 	mr->ibmr.length = new_umem->length;
 	mr->page_shift = order_base_2(page_size);
 	mr->umem = new_umem;
-	err = mlx5r_umr_update_mr_pas(mr, upd_flags);
+	err = mlx5r_umr_update_mr_pas(mr, upd_flags, mlx5_mr_pdn(mr));
 	if (err) {
 		/*
 		 * The MR is revoked at this point so there is no issue to free
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 10c11b72f41247..d7feb49b28fbac 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -833,12 +833,14 @@ static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, size_t bcnt,
 			       u32 *bytes_mapped, u32 flags)
 {
 	struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(mr->umem);
+	struct mlx5_ib_dev *dev = mr_to_mdev(mr);
 	int access_mode = mr->data_direct ? MLX5_MKC_ACCESS_MODE_KSM :
 					    MLX5_MKC_ACCESS_MODE_MTT;
 	unsigned int old_page_shift = mr->page_shift;
 	unsigned int page_shift;
 	unsigned long page_size;
 	u32 xlt_flags = 0;
+	u32 pdn = 0;
 	int err;
 
 	if (flags & MLX5_PF_FLAGS_ENABLE)
@@ -857,8 +859,12 @@ static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, size_t bcnt,
 		err = -EINVAL;
 	} else {
 		page_shift = order_base_2(page_size);
+		if (mr->data_direct)
+			pdn = dev->ddr.pdn;
+		else
+			pdn = mlx5_mr_pdn(mr);
 		if (page_shift != mr->page_shift && mr->dmabuf_faulted) {
-			err = mlx5r_umr_dmabuf_update_pgsz(mr, xlt_flags,
+			err = mlx5r_umr_dmabuf_update_pgsz(mr, xlt_flags, pdn,
 							   page_shift);
 		} else {
 			mr->page_shift = page_shift;
@@ -866,8 +872,8 @@ static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, size_t bcnt,
 				err = mlx5r_umr_update_data_direct_ksm_pas(
 					mr, xlt_flags);
 			else
-				err = mlx5r_umr_update_mr_pas(mr,
-							      xlt_flags);
+				err = mlx5r_umr_update_mr_pas(mr, xlt_flags,
+							      pdn);
 		}
 	}
 	dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
diff --git a/drivers/infiniband/hw/mlx5/umr.c b/drivers/infiniband/hw/mlx5/umr.c
index 062a47d0c991b7..f92f222b899f8d 100644
--- a/drivers/infiniband/hw/mlx5/umr.c
+++ b/drivers/infiniband/hw/mlx5/umr.c
@@ -603,11 +603,11 @@ mlx5r_umr_set_update_xlt_ctrl_seg(struct mlx5_wqe_umr_ctrl_seg *ctrl_seg,
 
 static void mlx5r_umr_set_update_xlt_mkey_seg(struct mlx5_ib_dev *dev,
 					      struct mlx5_mkey_seg *mkey_seg,
-					      struct mlx5_ib_mr *mr,
+					      struct mlx5_ib_mr *mr, u32 pdn,
 					      unsigned int page_shift)
 {
 	mlx5r_umr_set_access_flags(dev, mkey_seg, mr->access_flags);
-	MLX5_SET(mkc, mkey_seg, pd, to_mpd(mr->ibmr.pd)->pdn);
+	MLX5_SET(mkc, mkey_seg, pd, pdn);
 	MLX5_SET64(mkc, mkey_seg, start_addr, mr->ibmr.iova);
 	MLX5_SET64(mkc, mkey_seg, len, mr->ibmr.length);
 	MLX5_SET(mkc, mkey_seg, log_page_size, page_shift);
@@ -670,23 +670,22 @@ static void mlx5r_umr_final_update_xlt(struct mlx5_ib_dev *dev,
 	wqe->data_seg.byte_count = cpu_to_be32(sg->length);
 }
 
-static void
-_mlx5r_umr_init_wqe(struct mlx5_ib_mr *mr, struct mlx5r_umr_wqe *wqe,
-		    struct ib_sge *sg, unsigned int flags,
-		    unsigned int page_shift, bool dd)
+static void _mlx5r_umr_init_wqe(struct mlx5_ib_mr *mr,
+				struct mlx5r_umr_wqe *wqe, struct ib_sge *sg,
+				unsigned int flags, u32 pdn,
+				unsigned int page_shift)
 {
 	struct mlx5_ib_dev *dev = mr_to_mdev(mr);
 
 	mlx5r_umr_set_update_xlt_ctrl_seg(&wqe->ctrl_seg, flags, sg);
-	mlx5r_umr_set_update_xlt_mkey_seg(dev, &wqe->mkey_seg, mr, page_shift);
-	if (dd) /* Use the data direct internal kernel PD */
-		MLX5_SET(mkc, &wqe->mkey_seg, pd, dev->ddr.pdn);
+	mlx5r_umr_set_update_xlt_mkey_seg(dev, &wqe->mkey_seg, mr, pdn,
+					  page_shift);
 	mlx5r_umr_set_update_xlt_data_seg(&wqe->data_seg, sg);
 }
 
-static int
-_mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags, bool dd,
-			 size_t start_block, size_t nblocks)
+static int _mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags,
+				    u32 pdn, bool dd, size_t start_block,
+				    size_t nblocks)
 {
 	size_t ent_size = dd ? sizeof(struct mlx5_ksm) : sizeof(struct mlx5_mtt);
 	struct mlx5_ib_dev *dev = mr_to_mdev(mr);
@@ -720,7 +719,7 @@ _mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags, bool dd,
 
 	orig_sg_length = sg.length;
 
-	_mlx5r_umr_init_wqe(mr, &wqe, &sg, flags, mr->page_shift, dd);
+	_mlx5r_umr_init_wqe(mr, &wqe, &sg, flags, pdn, mr->page_shift);
 
 	/* Set initial translation offset to start_block */
 	offset = (u64)start_block * ent_size;
@@ -811,7 +810,8 @@ int mlx5r_umr_update_data_direct_ksm_pas_range(struct mlx5_ib_mr *mr,
 	    !(flags & MLX5_IB_UPD_XLT_KEEP_PGSZ)))
 		return -EINVAL;
 
-	return _mlx5r_umr_update_mr_pas(mr, flags, true, start_block, nblocks);
+	return _mlx5r_umr_update_mr_pas(mr, flags, mr_to_mdev(mr)->ddr.pdn,
+					true, start_block, nblocks);
 }
 
 int mlx5r_umr_update_data_direct_ksm_pas(struct mlx5_ib_mr *mr,
@@ -821,12 +821,13 @@ int mlx5r_umr_update_data_direct_ksm_pas(struct mlx5_ib_mr *mr,
 }
 
 int mlx5r_umr_update_mr_pas_range(struct mlx5_ib_mr *mr, unsigned int flags,
-				  size_t start_block, size_t nblocks)
+				  u32 pdn, size_t start_block, size_t nblocks)
 {
 	if (WARN_ON(mr->umem->is_odp))
 		return -EINVAL;
 
-	return _mlx5r_umr_update_mr_pas(mr, flags, false, start_block, nblocks);
+	return _mlx5r_umr_update_mr_pas(mr, flags, pdn, false, start_block,
+					nblocks);
 }
 
 /*
@@ -834,9 +835,9 @@ int mlx5r_umr_update_mr_pas_range(struct mlx5_ib_mr *mr, unsigned int flags,
  * Dmabuf MR is handled in a similar way, except that the MLX5_IB_UPD_XLT_ZAP
  * flag may be used.
  */
-int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags)
+int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags, u32 pdn)
 {
-	return mlx5r_umr_update_mr_pas_range(mr, flags, 0, 0);
+	return mlx5r_umr_update_mr_pas_range(mr, flags, pdn, 0, 0);
 }
 
 static bool umr_can_use_indirect_mkey(struct mlx5_ib_dev *dev)
@@ -861,6 +862,7 @@ int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
 	size_t orig_sg_length;
 	size_t pages_iter;
 	struct ib_sge sg;
+	u32 pdn = mlx5_mr_pdn(mr);
 	int err = 0;
 	void *xlt;
 
@@ -895,7 +897,8 @@ int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
 	}
 
 	mlx5r_umr_set_update_xlt_ctrl_seg(&wqe.ctrl_seg, flags, &sg);
-	mlx5r_umr_set_update_xlt_mkey_seg(dev, &wqe.mkey_seg, mr, page_shift);
+	mlx5r_umr_set_update_xlt_mkey_seg(dev, &wqe.mkey_seg, mr, pdn,
+					  page_shift);
 	mlx5r_umr_set_update_xlt_data_seg(&wqe.data_seg, &sg);
 
 	for (pages_mapped = 0;
@@ -962,17 +965,18 @@ int mlx5r_umr_update_mr_page_shift(struct mlx5_ib_mr *mr,
 	return err;
 }
 
-static inline int
-_mlx5r_dmabuf_umr_update_pas(struct mlx5_ib_mr *mr, unsigned int flags,
-			     size_t start_block, size_t nblocks, bool dd)
+static inline int _mlx5r_dmabuf_umr_update_pas(struct mlx5_ib_mr *mr,
+					       unsigned int flags, u32 pdn,
+					       size_t start_block,
+					       size_t nblocks, bool dd)
 {
 	if (dd)
 		return mlx5r_umr_update_data_direct_ksm_pas_range(mr, flags,
 								  start_block,
 								  nblocks);
 	else
-		return mlx5r_umr_update_mr_pas_range(mr, flags, start_block,
-						     nblocks);
+		return mlx5r_umr_update_mr_pas_range(mr, flags, pdn,
+						     start_block, nblocks);
 }
 
 /**
@@ -986,11 +990,9 @@ _mlx5r_dmabuf_umr_update_pas(struct mlx5_ib_mr *mr, unsigned int flags,
  * Return: On success, returns the number of entries that were zapped.
  *         On error, returns a negative error code.
  */
-static int _mlx5r_umr_zap_mkey(struct mlx5_ib_mr *mr,
-			       unsigned int flags,
-			       unsigned int page_shift,
-			       size_t *nblocks,
-			       bool dd)
+static int _mlx5r_umr_zap_mkey(struct mlx5_ib_mr *mr, unsigned int flags,
+			       unsigned int page_shift, size_t *nblocks,
+			       u32 pdn, bool dd)
 {
 	unsigned int old_page_shift = mr->page_shift;
 	struct mlx5_ib_dev *dev = mr_to_mdev(mr);
@@ -1030,7 +1032,7 @@ static int _mlx5r_umr_zap_mkey(struct mlx5_ib_mr *mr,
 	 */
 	if (*nblocks)
 		mr->page_shift = max_page_shift;
-	err = _mlx5r_dmabuf_umr_update_pas(mr, flags, 0, *nblocks, dd);
+	err = _mlx5r_dmabuf_umr_update_pas(mr, flags, pdn, 0, *nblocks, dd);
 	if (err) {
 		mr->page_shift = old_page_shift;
 		return err;
@@ -1055,6 +1057,7 @@ static int _mlx5r_umr_zap_mkey(struct mlx5_ib_mr *mr,
  * entries accordingly
  * @mr:        The memory region to update
  * @xlt_flags: Translation table update flags
+ * @pdn:       Protection domain number
  * @page_shift: The new (optimized) page shift to use
  *
  * This function updates the page size and mkey translation entries for a DMABUF
@@ -1074,7 +1077,7 @@ static int _mlx5r_umr_zap_mkey(struct mlx5_ib_mr *mr,
  *
  * Returns 0 on success or a negative error code on failure.
  */
-int mlx5r_umr_dmabuf_update_pgsz(struct mlx5_ib_mr *mr, u32 xlt_flags,
+int mlx5r_umr_dmabuf_update_pgsz(struct mlx5_ib_mr *mr, u32 xlt_flags, u32 pdn,
 				 unsigned int page_shift)
 {
 	unsigned int old_page_shift = mr->page_shift;
@@ -1083,7 +1086,7 @@ int mlx5r_umr_dmabuf_update_pgsz(struct mlx5_ib_mr *mr, u32 xlt_flags,
 	int err;
 
 	err = _mlx5r_umr_zap_mkey(mr, xlt_flags, page_shift, &zapped_blocks,
-				  mr->data_direct);
+				  pdn, mr->data_direct);
 	if (err)
 		return err;
 
@@ -1096,10 +1099,8 @@ int mlx5r_umr_dmabuf_update_pgsz(struct mlx5_ib_mr *mr, u32 xlt_flags,
 		 * the page size in the mkey yet.
 		 */
 		err = _mlx5r_dmabuf_umr_update_pas(
-			mr,
-			xlt_flags | MLX5_IB_UPD_XLT_KEEP_PGSZ,
-			zapped_blocks,
-			total_blocks - zapped_blocks,
+			mr, xlt_flags | MLX5_IB_UPD_XLT_KEEP_PGSZ, pdn,
+			zapped_blocks, total_blocks - zapped_blocks,
 			mr->data_direct);
 		if (err)
 			goto err;
@@ -1108,7 +1109,7 @@ int mlx5r_umr_dmabuf_update_pgsz(struct mlx5_ib_mr *mr, u32 xlt_flags,
 	err = mlx5r_umr_update_mr_page_shift(mr, mr->page_shift);
 	if (err)
 		goto err;
-	err = _mlx5r_dmabuf_umr_update_pas(mr, xlt_flags, 0, zapped_blocks,
+	err = _mlx5r_dmabuf_umr_update_pas(mr, xlt_flags, pdn, 0, zapped_blocks,
 					   mr->data_direct);
 	if (err)
 		goto err;
diff --git a/drivers/infiniband/hw/mlx5/umr.h b/drivers/infiniband/hw/mlx5/umr.h
index 59809d4d7d7297..99192ec67957c7 100644
--- a/drivers/infiniband/hw/mlx5/umr.h
+++ b/drivers/infiniband/hw/mlx5/umr.h
@@ -101,13 +101,13 @@ int mlx5r_umr_update_data_direct_ksm_pas_range(struct mlx5_ib_mr *mr,
 					       size_t nblocks);
 int mlx5r_umr_update_data_direct_ksm_pas(struct mlx5_ib_mr *mr, unsigned int flags);
 int mlx5r_umr_update_mr_pas_range(struct mlx5_ib_mr *mr, unsigned int flags,
-				  size_t start_block, size_t nblocks);
-int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags);
+				  u32 pdn, size_t start_block, size_t nblocks);
+int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags, u32 pdn);
 int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
 			 int page_shift, int flags);
 int mlx5r_umr_update_mr_page_shift(struct mlx5_ib_mr *mr,
 				   unsigned int page_shift);
-int mlx5r_umr_dmabuf_update_pgsz(struct mlx5_ib_mr *mr, u32 xlt_flags,
+int mlx5r_umr_dmabuf_update_pgsz(struct mlx5_ib_mr *mr, u32 xlt_flags, u32 pdn,
 				 unsigned int page_shift);
 
 #endif /* _MLX5_IB_UMR_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 07/10] IB/mlx5: Don't mangle the mr->pd inside the rereg callback
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2026-06-04  1:27 ` [PATCH 06/10] IB/mlx5: Pull the pdn out of the depths of the umr machinery Jason Gunthorpe
@ 2026-06-04  1:27 ` Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 08/10] IB/mlx5: Push pdn above mlx5r_umr_update_xlt() Jason Gunthorpe
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

The rereg protocol expects the core code to change mr->pd and synchronize
that change with the atomics and syncs. The driver should not touch it.

mlx5 needed to update it in umr_rereg_pas() because
mlx5r_umr_update_mr_pas() required the updated mr->pd to build the
UMR.

Simply switch mlx5r_umr_update_mr_pas() to use the pdn directly from
the new pd and remove the mr->pd update.

Fixes: 56e11d628c5d ("IB/mlx5: Added support for re-registration of MRs")
Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mr.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index a7924e96c2817a..a56443b9053734 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1134,10 +1134,8 @@ static int umr_rereg_pas(struct mlx5_ib_mr *mr, struct ib_pd *pd,
 	if (err)
 		return err;
 
-	if (flags & IB_MR_REREG_PD) {
-		mr->ibmr.pd = pd;
+	if (flags & IB_MR_REREG_PD)
 		upd_flags |= MLX5_IB_UPD_XLT_PD;
-	}
 	if (flags & IB_MR_REREG_ACCESS) {
 		mr->access_flags = access_flags;
 		upd_flags |= MLX5_IB_UPD_XLT_ACCESS;
@@ -1147,7 +1145,7 @@ static int umr_rereg_pas(struct mlx5_ib_mr *mr, struct ib_pd *pd,
 	mr->ibmr.length = new_umem->length;
 	mr->page_shift = order_base_2(page_size);
 	mr->umem = new_umem;
-	err = mlx5r_umr_update_mr_pas(mr, upd_flags, mlx5_mr_pdn(mr));
+	err = mlx5r_umr_update_mr_pas(mr, upd_flags, to_mpd(pd)->pdn);
 	if (err) {
 		/*
 		 * The MR is revoked at this point so there is no issue to free
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 08/10] IB/mlx5: Push pdn above mlx5r_umr_update_xlt()
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2026-06-04  1:27 ` [PATCH 07/10] IB/mlx5: Don't mangle the mr->pd inside the rereg callback Jason Gunthorpe
@ 2026-06-04  1:27 ` Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 09/10] IB/mlx5: Push pdn above pagfault_real_mr() Jason Gunthorpe
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

Keep pushing the pdn higher to remove more places touching mr->pd:

- XLT combinations that don't use PDN can just pass 0
- Use local pd values instead of mr->pd
- Implicit MR does not have inplace rereg, so the mr->pd is safe

Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/mlx5/odp.c | 40 +++++++++++++++++---------------
 drivers/infiniband/hw/mlx5/umr.c |  3 +--
 drivers/infiniband/hw/mlx5/umr.h |  2 +-
 3 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index d7feb49b28fbac..50804b4c90e477 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -221,7 +221,8 @@ static void free_implicit_child_mr_work(struct work_struct *work)
 	mutex_lock(&odp_imr->umem_mutex);
 	mlx5r_umr_update_xlt(mr->parent,
 			     ib_umem_start(odp) >> mlx5_imr_mtt_shift, 1, 0,
-			     MLX5_IB_UPD_XLT_INDIRECT | MLX5_IB_UPD_XLT_ATOMIC);
+			     MLX5_IB_UPD_XLT_INDIRECT | MLX5_IB_UPD_XLT_ATOMIC,
+			     0);
 	mutex_unlock(&odp_imr->umem_mutex);
 	mlx5_ib_dereg_mr(&mr->ibmr, NULL);
 
@@ -318,10 +319,12 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
 			u64 umr_offset = idx & umr_block_mask;
 
 			if (in_block && umr_offset == 0) {
-				mlx5r_umr_update_xlt(mr, blk_start_idx,
-						     idx - blk_start_idx, 0,
-						     MLX5_IB_UPD_XLT_ZAP |
-						     MLX5_IB_UPD_XLT_ATOMIC);
+				mlx5r_umr_update_xlt(
+					mr, blk_start_idx, idx - blk_start_idx,
+					0,
+					MLX5_IB_UPD_XLT_ZAP |
+						MLX5_IB_UPD_XLT_ATOMIC,
+					0);
 				in_block = 0;
 				/* Count page invalidations */
 				invalidations += idx - blk_start_idx + 1;
@@ -329,10 +332,9 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
 		}
 	}
 	if (in_block) {
-		mlx5r_umr_update_xlt(mr, blk_start_idx,
-				     idx - blk_start_idx + 1, 0,
-				     MLX5_IB_UPD_XLT_ZAP |
-				     MLX5_IB_UPD_XLT_ATOMIC);
+		mlx5r_umr_update_xlt(
+			mr, blk_start_idx, idx - blk_start_idx + 1, 0,
+			MLX5_IB_UPD_XLT_ZAP | MLX5_IB_UPD_XLT_ATOMIC, 0);
 		/* Count page invalidations */
 		invalidations += idx - blk_start_idx + 1;
 	}
@@ -502,11 +504,9 @@ static struct mlx5_ib_mr *implicit_get_child_mr(struct mlx5_ib_mr *imr,
 	 */
 	refcount_set(&mr->mmkey.usecount, 2);
 
-	err = mlx5r_umr_update_xlt(mr, 0,
-				   mlx5_imr_mtt_entries,
-				   PAGE_SHIFT,
-				   MLX5_IB_UPD_XLT_ZAP |
-				   MLX5_IB_UPD_XLT_ENABLE);
+	err = mlx5r_umr_update_xlt(mr, 0, mlx5_imr_mtt_entries, PAGE_SHIFT,
+				   MLX5_IB_UPD_XLT_ZAP | MLX5_IB_UPD_XLT_ENABLE,
+				   to_mpd(mr->ibmr.pd)->pdn);
 	if (err) {
 		ret = ERR_PTR(err);
 		goto out_mr;
@@ -647,7 +647,8 @@ struct mlx5_ib_mr *mlx5_ib_alloc_implicit_mr(struct mlx5_ib_pd *pd,
 				   mlx5_imr_ksm_page_shift,
 				   MLX5_IB_UPD_XLT_INDIRECT |
 				   MLX5_IB_UPD_XLT_ZAP |
-				   MLX5_IB_UPD_XLT_ENABLE);
+				   MLX5_IB_UPD_XLT_ENABLE,
+				   pd->pdn);
 	if (err)
 		goto out_mr;
 
@@ -720,7 +721,8 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 	 * No need to check whether the MTTs really belong to this MR, since
 	 * ib_umem_odp_map_dma_and_lock already checks this.
 	 */
-	ret = mlx5r_umr_update_xlt(mr, start_idx, np, page_shift, xlt_flags);
+	ret = mlx5r_umr_update_xlt(mr, start_idx, np, page_shift, xlt_flags,
+				   mlx5_mr_pdn(mr));
 	mutex_unlock(&odp->umem_mutex);
 
 	if (ret < 0) {
@@ -818,9 +820,9 @@ static int pagefault_implicit_mr(struct mlx5_ib_mr *imr,
 	 * next pagefault handler will see the new information.
 	 */
 	mutex_lock(&odp_imr->umem_mutex);
-	err = mlx5r_umr_update_xlt(imr, upd_start_idx, upd_len, 0,
-				   MLX5_IB_UPD_XLT_INDIRECT |
-					  MLX5_IB_UPD_XLT_ATOMIC);
+	err = mlx5r_umr_update_xlt(
+		imr, upd_start_idx, upd_len, 0,
+		MLX5_IB_UPD_XLT_INDIRECT | MLX5_IB_UPD_XLT_ATOMIC, 0);
 	mutex_unlock(&odp_imr->umem_mutex);
 	if (err) {
 		mlx5_ib_err(mr_to_mdev(imr), "Failed to update PAS\n");
diff --git a/drivers/infiniband/hw/mlx5/umr.c b/drivers/infiniband/hw/mlx5/umr.c
index f92f222b899f8d..257d8eacd0363a 100644
--- a/drivers/infiniband/hw/mlx5/umr.c
+++ b/drivers/infiniband/hw/mlx5/umr.c
@@ -846,7 +846,7 @@ static bool umr_can_use_indirect_mkey(struct mlx5_ib_dev *dev)
 }
 
 int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
-			 int page_shift, int flags)
+			 int page_shift, int flags, u32 pdn)
 {
 	int desc_size = (flags & MLX5_IB_UPD_XLT_INDIRECT)
 			       ? sizeof(struct mlx5_klm)
@@ -862,7 +862,6 @@ int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
 	size_t orig_sg_length;
 	size_t pages_iter;
 	struct ib_sge sg;
-	u32 pdn = mlx5_mr_pdn(mr);
 	int err = 0;
 	void *xlt;
 
diff --git a/drivers/infiniband/hw/mlx5/umr.h b/drivers/infiniband/hw/mlx5/umr.h
index 99192ec67957c7..bda7123781a953 100644
--- a/drivers/infiniband/hw/mlx5/umr.h
+++ b/drivers/infiniband/hw/mlx5/umr.h
@@ -104,7 +104,7 @@ int mlx5r_umr_update_mr_pas_range(struct mlx5_ib_mr *mr, unsigned int flags,
 				  u32 pdn, size_t start_block, size_t nblocks);
 int mlx5r_umr_update_mr_pas(struct mlx5_ib_mr *mr, unsigned int flags, u32 pdn);
 int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
-			 int page_shift, int flags);
+			 int page_shift, int flags, u32 pdn);
 int mlx5r_umr_update_mr_page_shift(struct mlx5_ib_mr *mr,
 				   unsigned int page_shift);
 int mlx5r_umr_dmabuf_update_pgsz(struct mlx5_ib_mr *mr, u32 xlt_flags, u32 pdn,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 09/10] IB/mlx5: Push pdn above pagfault_real_mr()
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2026-06-04  1:27 ` [PATCH 08/10] IB/mlx5: Push pdn above mlx5r_umr_update_xlt() Jason Gunthorpe
@ 2026-06-04  1:27 ` Jason Gunthorpe
  2026-06-04  1:27 ` [PATCH 10/10] IB/mlx5: Push pdn above pagefault_dmabuf_mr() Jason Gunthorpe
  2026-06-08 17:52 ` [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

Remove the mlx5_mr_pdn() in pagefault_real_mr() by pushing the pdn up, all
the callers use 0 since they don't pass MLX5_PF_FLAGS_ENABLE except the
ioctl reg_mr path which can use the ioctl pd.

Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  4 ++--
 drivers/infiniband/hw/mlx5/mr.c      |  2 +-
 drivers/infiniband/hw/mlx5/odp.c     | 20 ++++++++++++--------
 3 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 0a2b8ede0d818a..1c88a11f8dfc93 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1422,7 +1422,7 @@ int mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
 int mlx5_ib_advise_mr_prefetch(struct ib_pd *pd,
 			       enum ib_uverbs_advise_mr_advice advice,
 			       u32 flags, struct ib_sge *sg_list, u32 num_sge);
-int mlx5_ib_init_odp_mr(struct mlx5_ib_mr *mr);
+int mlx5_ib_init_odp_mr(struct mlx5_ib_mr *mr, struct ib_pd *pd);
 int mlx5_ib_init_dmabuf_mr(struct mlx5_ib_mr *mr);
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 static inline int mlx5_ib_odp_init_one(struct mlx5_ib_dev *ibdev) { return 0; }
@@ -1451,7 +1451,7 @@ mlx5_ib_advise_mr_prefetch(struct ib_pd *pd,
 {
 	return -EOPNOTSUPP;
 }
-static inline int mlx5_ib_init_odp_mr(struct mlx5_ib_mr *mr)
+static inline int mlx5_ib_init_odp_mr(struct mlx5_ib_mr *mr, struct ib_pd *pd)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index a56443b9053734..46847054491f50 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -842,7 +842,7 @@ static struct ib_mr *create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length,
 	if (err)
 		goto err_dereg_mr;
 
-	err = mlx5_ib_init_odp_mr(mr);
+	err = mlx5_ib_init_odp_mr(mr, pd);
 	if (err)
 		goto err_dereg_mr;
 	return &mr->ibmr;
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 50804b4c90e477..2dcc4f5339f9d3 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -687,12 +687,16 @@ void mlx5_ib_free_odp_mr(struct mlx5_ib_mr *mr)
 	}
 }
 
+/*
+ * pdn must be valid only when xlt_flags updates the mkey PD. In this path that
+ * is only MLX5_PF_FLAGS_ENABLE. DOWNGRADE and SNAPSHOT leave the PD masked out.
+ */
 #define MLX5_PF_FLAGS_DOWNGRADE BIT(1)
 #define MLX5_PF_FLAGS_SNAPSHOT BIT(2)
 #define MLX5_PF_FLAGS_ENABLE BIT(3)
 static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 			     u64 user_va, size_t bcnt, u32 *bytes_mapped,
-			     u32 flags)
+			     u32 flags, u32 pdn)
 {
 	int page_shift, ret, np;
 	bool downgrade = flags & MLX5_PF_FLAGS_DOWNGRADE;
@@ -722,7 +726,7 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 	 * ib_umem_odp_map_dma_and_lock already checks this.
 	 */
 	ret = mlx5r_umr_update_xlt(mr, start_idx, np, page_shift, xlt_flags,
-				   mlx5_mr_pdn(mr));
+				   pdn);
 	mutex_unlock(&odp->umem_mutex);
 
 	if (ret < 0) {
@@ -788,7 +792,7 @@ static int pagefault_implicit_mr(struct mlx5_ib_mr *imr,
 		      user_va;
 
 		ret = pagefault_real_mr(mtt, umem_odp, user_va, len,
-					bytes_mapped, flags);
+					bytes_mapped, flags, 0);
 
 		mlx5r_deref_odp_mkey(&mtt->mmkey);
 
@@ -930,19 +934,20 @@ static int pagefault_mr(struct mlx5_ib_mr *mr, u64 io_virt, size_t bcnt,
 				    ib_umem_end(odp) - user_va < bcnt))
 			return -EFAULT;
 		return pagefault_real_mr(mr, odp, user_va, bcnt, bytes_mapped,
-					 flags);
+					 flags, 0);
 	}
 	return pagefault_implicit_mr(mr, odp, io_virt, bcnt, bytes_mapped,
 				     flags);
 }
 
-int mlx5_ib_init_odp_mr(struct mlx5_ib_mr *mr)
+int mlx5_ib_init_odp_mr(struct mlx5_ib_mr *mr, struct ib_pd *pd)
 {
 	int ret;
 
 	ret = pagefault_real_mr(mr, to_ib_umem_odp(mr->umem), mr->umem->address,
 				mr->umem->length, NULL,
-				MLX5_PF_FLAGS_SNAPSHOT | MLX5_PF_FLAGS_ENABLE);
+				MLX5_PF_FLAGS_SNAPSHOT | MLX5_PF_FLAGS_ENABLE,
+				to_mpd(pd)->pdn);
 	return ret >= 0 ? 0 : ret;
 }
 
@@ -1575,8 +1580,7 @@ static void mlx5_ib_mr_memory_pfault_handler(struct mlx5_ib_dev *dev,
 	ret = pagefault_mr(mr, prefetch_va, prefetch_size, NULL, 0, true);
 	if (ret < 0) {
 		ret = pagefault_mr(mr, pfault->memory.va,
-				   pfault->memory.fault_byte_count, NULL, 0,
-				   true);
+				   pfault->memory.fault_byte_count, NULL, 0, true);
 		if (ret < 0)
 			goto err;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 10/10] IB/mlx5: Push pdn above pagefault_dmabuf_mr()
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2026-06-04  1:27 ` [PATCH 09/10] IB/mlx5: Push pdn above pagfault_real_mr() Jason Gunthorpe
@ 2026-06-04  1:27 ` Jason Gunthorpe
  2026-06-08 17:52 ` [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-04  1:27 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

Remove the mlx5_mr_pdn() inside pagefault_dmabuf_mr(), the only user of
the pdn is the init path which is inside an ioctl.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  9 ++-------
 drivers/infiniband/hw/mlx5/mr.c      |  2 +-
 drivers/infiniband/hw/mlx5/odp.c     | 21 +++++++++++----------
 3 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 1c88a11f8dfc93..74613f53483384 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1213,11 +1213,6 @@ static inline struct mlx5_ib_pd *to_mpd(struct ib_pd *ibpd)
 	return container_of(ibpd, struct mlx5_ib_pd, ibpd);
 }
 
-static inline u32 mlx5_mr_pdn(struct mlx5_ib_mr *mr)
-{
-	return to_mpd(mr->ibmr.pd)->pdn;
-}
-
 static inline struct mlx5_ib_srq *to_msrq(struct ib_srq *ibsrq)
 {
 	return container_of(ibsrq, struct mlx5_ib_srq, ibsrq);
@@ -1423,7 +1418,7 @@ int mlx5_ib_advise_mr_prefetch(struct ib_pd *pd,
 			       enum ib_uverbs_advise_mr_advice advice,
 			       u32 flags, struct ib_sge *sg_list, u32 num_sge);
 int mlx5_ib_init_odp_mr(struct mlx5_ib_mr *mr, struct ib_pd *pd);
-int mlx5_ib_init_dmabuf_mr(struct mlx5_ib_mr *mr);
+int mlx5_ib_init_dmabuf_mr(struct mlx5_ib_mr *mr, struct ib_pd *pd);
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 static inline int mlx5_ib_odp_init_one(struct mlx5_ib_dev *ibdev) { return 0; }
 static inline int mlx5r_odp_create_eq(struct mlx5_ib_dev *dev,
@@ -1455,7 +1450,7 @@ static inline int mlx5_ib_init_odp_mr(struct mlx5_ib_mr *mr, struct ib_pd *pd)
 {
 	return -EOPNOTSUPP;
 }
-static inline int mlx5_ib_init_dmabuf_mr(struct mlx5_ib_mr *mr)
+static inline int mlx5_ib_init_dmabuf_mr(struct mlx5_ib_mr *mr, struct ib_pd *pd)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 46847054491f50..193f874482edf1 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -969,7 +969,7 @@ reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device,
 		mr->data_direct = true;
 	}
 
-	err = mlx5_ib_init_dmabuf_mr(mr);
+	err = mlx5_ib_init_dmabuf_mr(mr, pd);
 	if (err)
 		goto err_dereg_mr;
 	return &mr->ibmr;
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 2dcc4f5339f9d3..1badec9bf52708 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -836,17 +836,15 @@ static int pagefault_implicit_mr(struct mlx5_ib_mr *imr,
 }
 
 static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, size_t bcnt,
-			       u32 *bytes_mapped, u32 flags)
+			       u32 *bytes_mapped, u32 flags, u32 pdn)
 {
 	struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(mr->umem);
-	struct mlx5_ib_dev *dev = mr_to_mdev(mr);
 	int access_mode = mr->data_direct ? MLX5_MKC_ACCESS_MODE_KSM :
 					    MLX5_MKC_ACCESS_MODE_MTT;
 	unsigned int old_page_shift = mr->page_shift;
 	unsigned int page_shift;
 	unsigned long page_size;
 	u32 xlt_flags = 0;
-	u32 pdn = 0;
 	int err;
 
 	if (flags & MLX5_PF_FLAGS_ENABLE)
@@ -865,10 +863,6 @@ static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, size_t bcnt,
 		err = -EINVAL;
 	} else {
 		page_shift = order_base_2(page_size);
-		if (mr->data_direct)
-			pdn = dev->ddr.pdn;
-		else
-			pdn = mlx5_mr_pdn(mr);
 		if (page_shift != mr->page_shift && mr->dmabuf_faulted) {
 			err = mlx5r_umr_dmabuf_update_pgsz(mr, xlt_flags, pdn,
 							   page_shift);
@@ -915,7 +909,7 @@ static int pagefault_mr(struct mlx5_ib_mr *mr, u64 io_virt, size_t bcnt,
 		return -EFAULT;
 
 	if (mr->umem->is_dmabuf)
-		return pagefault_dmabuf_mr(mr, bcnt, bytes_mapped, flags);
+		return pagefault_dmabuf_mr(mr, bcnt, bytes_mapped, flags, 0);
 
 	if (!odp->is_implicit_odp) {
 		u64 offset = io_virt < mr->ibmr.iova ? 0 : io_virt - mr->ibmr.iova;
@@ -951,12 +945,19 @@ int mlx5_ib_init_odp_mr(struct mlx5_ib_mr *mr, struct ib_pd *pd)
 	return ret >= 0 ? 0 : ret;
 }
 
-int mlx5_ib_init_dmabuf_mr(struct mlx5_ib_mr *mr)
+int mlx5_ib_init_dmabuf_mr(struct mlx5_ib_mr *mr, struct ib_pd *pd)
 {
+	struct mlx5_ib_dev *dev = mr_to_mdev(mr);
+	u32 pdn;
 	int ret;
 
+	if (mr->data_direct)
+		pdn = dev->ddr.pdn;
+	else
+		pdn = to_mpd(pd)->pdn;
+
 	ret = pagefault_dmabuf_mr(mr, mr->umem->length, NULL,
-				  MLX5_PF_FLAGS_ENABLE);
+				  MLX5_PF_FLAGS_ENABLE, pdn);
 
 	return ret >= 0 ? 0 : ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd
  2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
                   ` (9 preceding siblings ...)
  2026-06-04  1:27 ` [PATCH 10/10] IB/mlx5: Push pdn above pagefault_dmabuf_mr() Jason Gunthorpe
@ 2026-06-08 17:52 ` Jason Gunthorpe
  10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2026-06-08 17:52 UTC (permalink / raw)
  To: Leon Romanovsky, linux-rdma
  Cc: Doug Ledford, Edward Srouji, Leon Romanovsky, Leon Romanovsky,
	Matan Barak, Michael Guralnik, Noa Osherovich, patches,
	Steve Wise

On Wed, Jun 03, 2026 at 10:27:39PM -0300, Jason Gunthorpe wrote:
> Sashiko pointed out an existing bug related to mr->pd: when IB_MR_REREG_PD
> is used the mr->pd is changed while only holding the write side of the
> MR's uobject lock.
> 
> Effectively, because IB_MR_REREG_PD is usually implemented by changing the
> MR in-place, the mr->pd becomes unreadable outside an MR-specific system
> call that holds the uobject lock. All the readers in this series could
> race with an IB_MR_REREG_PD and potentially UAF the mr->pd.
> 
>  https://sashiko.dev/#/patchset/20260427-security-bug-fixes-v3-0-4621fa52de0e%40nvidia.com?part=4
> 
> This was presented as a simple 'oh look it can race with nldev' which is
> correct. However, asking AI to fully audit mr->pd touches also revealed a
> much more convoluted issue inside mlx5 ODP that is also using mr->pd from
> the page fault work queue, advise mr work queue and advise mr system call
> without any locking.
> 
> It turns out this mlx5 problem is entirely unnecessary since outside
> implicit mr there are only three cases where the UMR actually flags the
> PDN to be read by HW, umr_rereg_pas(), mlx5_ib_init_odp_mr() and
> mlx5_ib_init_dmabuf_mr(). umr_rereg_pas() is particularly tricky because
> it illegaly updates mr->pd inside the driver.  Reorganize the giant call
> chain from mlx5r_umr_set_update_xlt_mkey_seg() upward so that pdn is
> passed down from those three functions instead of unconditionally picked
> out at the bottom.
> 
> nldev however is trickier to fix. To avoid disurbing the happy paths build
> a synchronize barrier by removing the mr from the xarray and then putting
> it right back. The kref completion acts as a positive signal that the
> mr->pd is no longer being used.
> 
> Jason Gunthorpe (10):
>   IB/mlx5: Don't take the rereg_mr fallback without a new translation
>   RDMA/mlx5: Create ODP EQ for non-pinned dmabuf MRs
>   IB/mlx5: Properly support implicit ODP rereg_mr
>   RDMA/nldev: Fix locking when accessing mr->pd
>   IB/mlx5: Remove unused mkc bits in mlx5r_umr_update_mr_page_shift()
>   IB/mlx5: Pull the pdn out of the depths of the umr machinery
>   IB/mlx5: Don't mangle the mr->pd inside the rereg callback
>   IB/mlx5: Push pdn above mlx5r_umr_update_xlt()
>   IB/mlx5: Push pdn above pagfault_real_mr()
>   IB/mlx5: Push pdn above pagefault_dmabuf_mr()

Applied to for-next

Thanks,
Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-06-08 17:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04  1:27 [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe
2026-06-04  1:27 ` [PATCH 01/10] IB/mlx5: Don't take the rereg_mr fallback without a new translation Jason Gunthorpe
2026-06-04  1:27 ` [PATCH 02/10] RDMA/mlx5: Create ODP EQ for non-pinned dmabuf MRs Jason Gunthorpe
2026-06-04  1:27 ` [PATCH 03/10] IB/mlx5: Properly support implicit ODP rereg_mr Jason Gunthorpe
2026-06-04  1:27 ` [PATCH 04/10] RDMA/nldev: Fix locking when accessing mr->pd Jason Gunthorpe
2026-06-04  1:27 ` [PATCH 05/10] IB/mlx5: Remove unused mkc bits in mlx5r_umr_update_mr_page_shift() Jason Gunthorpe
2026-06-04  1:27 ` [PATCH 06/10] IB/mlx5: Pull the pdn out of the depths of the umr machinery Jason Gunthorpe
2026-06-04  1:27 ` [PATCH 07/10] IB/mlx5: Don't mangle the mr->pd inside the rereg callback Jason Gunthorpe
2026-06-04  1:27 ` [PATCH 08/10] IB/mlx5: Push pdn above mlx5r_umr_update_xlt() Jason Gunthorpe
2026-06-04  1:27 ` [PATCH 09/10] IB/mlx5: Push pdn above pagfault_real_mr() Jason Gunthorpe
2026-06-04  1:27 ` [PATCH 10/10] IB/mlx5: Push pdn above pagefault_dmabuf_mr() Jason Gunthorpe
2026-06-08 17:52 ` [PATCH 00/10] Fix races around IB_MR_REREG_PD and mr->pd Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox