public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH rdma-next 0/7] Batch of mlx5_ib fixes
@ 2025-03-13 14:29 Leon Romanovsky
  2025-03-13 14:29 ` [PATCH rdma-next 1/7] RDMA/mlx5: Fix MR cache initialization error flow Leon Romanovsky
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: Leon Romanovsky @ 2025-03-13 14:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chiara Meiohas, Edward Srouji, linux-rdma, Michael Guralnik,
	Or Har-Toov, Patrisious Haddad, Yishai Hadas

Hi,

This is batch of various fixes to mlx5_ib driver.

Thanks

Chiara Meiohas (1):
  RDMA/mlx5: Fix calculation of total invalidated pages

Michael Guralnik (5):
  RDMA/mlx5: Fix MR cache initialization error flow
  RDMA/mlx5: Fix cache entry update on dereg error
  RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc()
  RDMA/mlx5: Fix page_size variable overflow
  RDMA/mlx5: Align cap check of mkc page size to device specification

Patrisious Haddad (1):
  RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow

 drivers/infiniband/hw/mlx5/cq.c      |  2 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 23 ++++++++++---
 drivers/infiniband/hw/mlx5/mr.c      | 50 +++++++++++++++++-----------
 drivers/infiniband/hw/mlx5/odp.c     | 10 +++---
 4 files changed, 56 insertions(+), 29 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 1/7] RDMA/mlx5: Fix MR cache initialization error flow
  2025-03-13 14:29 [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
@ 2025-03-13 14:29 ` Leon Romanovsky
  2025-03-13 14:29 ` [PATCH rdma-next 2/7] RDMA/mlx5: Fix cache entry update on dereg error Leon Romanovsky
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2025-03-13 14:29 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Michael Guralnik, linux-rdma, Yishai Hadas

From: Michael Guralnik <michaelgur@nvidia.com>

Destroy all previously created cache entries and work queue when rolling
back the MR cache initialization upon an error.

Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mr.c | 33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index bb02b6adbf2c..1ffa4b3d0f76 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -919,6 +919,25 @@ mlx5r_cache_create_ent_locked(struct mlx5_ib_dev *dev,
 	return ERR_PTR(ret);
 }
 
+static void mlx5r_destroy_cache_entries(struct mlx5_ib_dev *dev)
+{
+	struct rb_root *root = &dev->cache.rb_root;
+	struct mlx5_cache_ent *ent;
+	struct rb_node *node;
+
+	mutex_lock(&dev->cache.rb_lock);
+	node = rb_first(root);
+	while (node) {
+		ent = rb_entry(node, struct mlx5_cache_ent, node);
+		node = rb_next(node);
+		clean_keys(dev, ent);
+		rb_erase(&ent->node, root);
+		mlx5r_mkeys_uninit(ent);
+		kfree(ent);
+	}
+	mutex_unlock(&dev->cache.rb_lock);
+}
+
 int mlx5_mkey_cache_init(struct mlx5_ib_dev *dev)
 {
 	struct mlx5_mkey_cache *cache = &dev->cache;
@@ -970,6 +989,8 @@ int mlx5_mkey_cache_init(struct mlx5_ib_dev *dev)
 err:
 	mutex_unlock(&cache->rb_lock);
 	mlx5_mkey_cache_debugfs_cleanup(dev);
+	mlx5r_destroy_cache_entries(dev);
+	destroy_workqueue(cache->wq);
 	mlx5_ib_warn(dev, "failed to create mkey cache entry\n");
 	return ret;
 }
@@ -1003,17 +1024,7 @@ void mlx5_mkey_cache_cleanup(struct mlx5_ib_dev *dev)
 	mlx5_cmd_cleanup_async_ctx(&dev->async_ctx);
 
 	/* At this point all entries are disabled and have no concurrent work. */
-	mutex_lock(&dev->cache.rb_lock);
-	node = rb_first(root);
-	while (node) {
-		ent = rb_entry(node, struct mlx5_cache_ent, node);
-		node = rb_next(node);
-		clean_keys(dev, ent);
-		rb_erase(&ent->node, root);
-		mlx5r_mkeys_uninit(ent);
-		kfree(ent);
-	}
-	mutex_unlock(&dev->cache.rb_lock);
+	mlx5r_destroy_cache_entries(dev);
 
 	destroy_workqueue(dev->cache.wq);
 	del_timer_sync(&dev->delay_timer);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 2/7] RDMA/mlx5: Fix cache entry update on dereg error
  2025-03-13 14:29 [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
  2025-03-13 14:29 ` [PATCH rdma-next 1/7] RDMA/mlx5: Fix MR cache initialization error flow Leon Romanovsky
@ 2025-03-13 14:29 ` Leon Romanovsky
  2025-03-18 10:28   ` Leon Romanovsky
  2025-03-13 14:29 ` [PATCH rdma-next 3/7] RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc() Leon Romanovsky
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2025-03-13 14:29 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Michael Guralnik, linux-rdma, Or Har-Toov, Yishai Hadas

From: Michael Guralnik <michaelgur@nvidia.com>

Fix double decrement of 'in_use' counter on push_mkey_locked() failure
while deregistering an MR.
If we fail to return an mkey to the cache in cache_ent_find_and_store()
it'll update the 'in_use' counter. Its caller, revoke_mr(), also updates
it, thus having double decrement.

Wrong value of 'in_use' counter will be exposed through debugfs and can
also cause wrong resizing of the cache when users try to set cache
entry size using the 'size' debugfs.

To address this issue, the 'in_use' counter is now decremented within
mlx5_revoke_mr() also after a successful call to
cache_ent_find_and_store() and not within cache_ent_find_and_store().
Other success or failure flows remains unchanged where it was also
decremented.

Fixes: 8c1185fef68c ("RDMA/mlx5: Change check for cacheable mkeys")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 1ffa4b3d0f76..cbab0240c7e5 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1967,7 +1967,6 @@ static int cache_ent_find_and_store(struct mlx5_ib_dev *dev,
 
 	if (mr->mmkey.cache_ent) {
 		spin_lock_irq(&mr->mmkey.cache_ent->mkeys_queue.lock);
-		mr->mmkey.cache_ent->in_use--;
 		goto end;
 	}
 
@@ -2042,6 +2041,7 @@ static int mlx5_revoke_mr(struct mlx5_ib_mr *mr)
 		ent = mr->mmkey.cache_ent;
 		/* upon storing to a clean temp entry - schedule its cleanup */
 		spin_lock_irq(&ent->mkeys_queue.lock);
+		ent->in_use--;
 		if (ent->is_tmp && !ent->tmp_cleanup_scheduled) {
 			mod_delayed_work(ent->dev->cache.wq, &ent->dwork,
 					 msecs_to_jiffies(30 * 1000));
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 3/7] RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc()
  2025-03-13 14:29 [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
  2025-03-13 14:29 ` [PATCH rdma-next 1/7] RDMA/mlx5: Fix MR cache initialization error flow Leon Romanovsky
  2025-03-13 14:29 ` [PATCH rdma-next 2/7] RDMA/mlx5: Fix cache entry update on dereg error Leon Romanovsky
@ 2025-03-13 14:29 ` Leon Romanovsky
  2025-03-13 14:29 ` [PATCH rdma-next 4/7] RDMA/mlx5: Fix page_size variable overflow Leon Romanovsky
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2025-03-13 14:29 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Michael Guralnik, linux-rdma, Yishai Hadas

From: Michael Guralnik <michaelgur@nvidia.com>

Drop the unused access_flags parameter.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mr.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index cbab0240c7e5..2e5e25bb53f3 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -718,8 +718,7 @@ mkey_cache_ent_from_rb_key(struct mlx5_ib_dev *dev,
 }
 
 static struct mlx5_ib_mr *_mlx5_mr_cache_alloc(struct mlx5_ib_dev *dev,
-					struct mlx5_cache_ent *ent,
-					int access_flags)
+					       struct mlx5_cache_ent *ent)
 {
 	struct mlx5_ib_mr *mr;
 	int err;
@@ -794,7 +793,7 @@ struct mlx5_ib_mr *mlx5_mr_cache_alloc(struct mlx5_ib_dev *dev,
 	if (!ent)
 		return ERR_PTR(-EOPNOTSUPP);
 
-	return _mlx5_mr_cache_alloc(dev, ent, access_flags);
+	return _mlx5_mr_cache_alloc(dev, ent);
 }
 
 static void mlx5_mkey_cache_debugfs_cleanup(struct mlx5_ib_dev *dev)
@@ -1155,7 +1154,7 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd,
 		return mr;
 	}
 
-	mr = _mlx5_mr_cache_alloc(dev, ent, access_flags);
+	mr = _mlx5_mr_cache_alloc(dev, ent);
 	if (IS_ERR(mr))
 		return mr;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 4/7] RDMA/mlx5: Fix page_size variable overflow
  2025-03-13 14:29 [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
                   ` (2 preceding siblings ...)
  2025-03-13 14:29 ` [PATCH rdma-next 3/7] RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc() Leon Romanovsky
@ 2025-03-13 14:29 ` Leon Romanovsky
  2025-03-13 14:29 ` [PATCH rdma-next 5/7] RDMA/mlx5: Align cap check of mkc page size to device specification Leon Romanovsky
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2025-03-13 14:29 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Michael Guralnik, linux-rdma, Yishai Hadas

From: Michael Guralnik <michaelgur@nvidia.com>

Change all variables storing mlx5_umem_mkc_find_best_pgsz() result to
unsigned long to support values larger than 31 and avoid overflow.

For example: If we try to register 4GB of memory that is contiguous in
physical memory, the driver will optimize the page_size and try to use
an mkey with 4GB entity size. The 'unsigned int' page_size variable will
overflow to '0' and we'll hit the WARN_ON() in alloc_cacheable_mr().

WARNING: CPU: 2 PID: 1203 at drivers/infiniband/hw/mlx5/mr.c:1124 alloc_cacheable_mr+0x22/0x580 [mlx5_ib]
Modules linked in: mlx5_ib mlx5_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre rdma_rxe rdma_ucm ib_uverbs ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm fuse ib_core [last unloaded: mlx5_core]
CPU: 2 UID: 70878 PID: 1203 Comm: rdma_resource_l Tainted: G        W          6.14.0-rc4-dirty #43
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:alloc_cacheable_mr+0x22/0x580 [mlx5_ib]
Code: 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 41 52 53 48 83 ec 30 f6 46 28 04 4c 8b 77 08 75 21 <0f> 0b 49 c7 c2 ea ff ff ff 48 8d 65 d0 4c 89 d0 5b 41 5a 41 5c 41
RSP: 0018:ffffc900006ffac8 EFLAGS: 00010246
RAX: 0000000004c0d0d0 RBX: ffff888217a22000 RCX: 0000000000100001
RDX: 00007fb7ac480000 RSI: ffff8882037b1240 RDI: ffff8882046f0600
RBP: ffffc900006ffb28 R08: 0000000000000001 R09: 0000000000000000
R10: 00000000000007e0 R11: ffffea0008011d40 R12: ffff8882037b1240
R13: ffff8882046f0600 R14: ffff888217a22000 R15: ffffc900006ffe00
FS:  00007fb7ed013340(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb7ed1d8000 CR3: 00000001fd8f6006 CR4: 0000000000772eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x81/0x130
 ? alloc_cacheable_mr+0x22/0x580 [mlx5_ib]
 ? report_bug+0xfc/0x1e0
 ? handle_bug+0x55/0x90
 ? exc_invalid_op+0x17/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? alloc_cacheable_mr+0x22/0x580 [mlx5_ib]
 create_real_mr+0x54/0x150 [mlx5_ib]
 ib_uverbs_reg_mr+0x17f/0x2a0 [ib_uverbs]
 ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xca/0x140 [ib_uverbs]
 ib_uverbs_run_method+0x6d0/0x780 [ib_uverbs]
 ? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs]
 ib_uverbs_cmd_verbs+0x19b/0x360 [ib_uverbs]
 ? walk_system_ram_range+0x79/0xd0
 ? ___pte_offset_map+0x1b/0x110
 ? __pte_offset_map_lock+0x80/0x100
 ib_uverbs_ioctl+0xac/0x110 [ib_uverbs]
 __x64_sys_ioctl+0x94/0xb0
 do_syscall_64+0x50/0x110
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fb7ecf0737b
Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7d 2a 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffdbe03ecc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffdbe03edb8 RCX: 00007fb7ecf0737b
RDX: 00007ffdbe03eda0 RSI: 00000000c0181b01 RDI: 0000000000000003
RBP: 00007ffdbe03ed80 R08: 00007fb7ecc84010 R09: 00007ffdbe03eed4
R10: 0000000000000009 R11: 0000000000000246 R12: 00007ffdbe03eed4
R13: 000000000000000c R14: 000000000000000c R15: 00007fb7ecc84150
 </TASK>

Fixes: cef7dde8836a ("net/mlx5: Expand mkey page size to support 6 bits")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mr.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 2e5e25bb53f3..ed6908949c87 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -56,7 +56,7 @@ static void
 create_mkey_callback(int status, struct mlx5_async_work *context);
 static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem,
 				     u64 iova, int access_flags,
-				     unsigned int page_size, bool populate,
+				     unsigned long page_size, bool populate,
 				     int access_mode);
 static int __mlx5_ib_dereg_mr(struct ib_mr *ibmr);
 
@@ -1125,7 +1125,7 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd,
 	struct mlx5r_cache_rb_key rb_key = {};
 	struct mlx5_cache_ent *ent;
 	struct mlx5_ib_mr *mr;
-	unsigned int page_size;
+	unsigned long page_size;
 
 	if (umem->is_dmabuf)
 		page_size = mlx5_umem_dmabuf_default_pgsz(umem, iova);
@@ -1229,7 +1229,7 @@ reg_create_crossing_vhca_mr(struct ib_pd *pd, u64 iova, u64 length, int access_f
  */
 static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem,
 				     u64 iova, int access_flags,
-				     unsigned int page_size, bool populate,
+				     unsigned long page_size, bool populate,
 				     int access_mode)
 {
 	struct mlx5_ib_dev *dev = to_mdev(pd->device);
@@ -1435,7 +1435,7 @@ static struct ib_mr *create_real_mr(struct ib_pd *pd, struct ib_umem *umem,
 		mr = alloc_cacheable_mr(pd, umem, iova, access_flags,
 					MLX5_MKC_ACCESS_MODE_MTT);
 	} else {
-		unsigned int page_size =
+		unsigned long page_size =
 			mlx5_umem_mkc_find_best_pgsz(dev, umem, iova);
 
 		mutex_lock(&dev->slow_path_mutex);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 5/7] RDMA/mlx5: Align cap check of mkc page size to device specification
  2025-03-13 14:29 [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
                   ` (3 preceding siblings ...)
  2025-03-13 14:29 ` [PATCH rdma-next 4/7] RDMA/mlx5: Fix page_size variable overflow Leon Romanovsky
@ 2025-03-13 14:29 ` Leon Romanovsky
  2025-03-13 14:29 ` [PATCH rdma-next 6/7] RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow Leon Romanovsky
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2025-03-13 14:29 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Michael Guralnik, linux-rdma, Yishai Hadas

From: Michael Guralnik <michaelgur@nvidia.com>

Align the caps checked when using the log_page_size 6th bit in the mkey
context to the PRM definition. The upper and lower bounds are set by
max/min caps and modifying of the 6th bit by UMR is allowed only when a
specific UMR cap is set.
Current implementation falsely assumes all page sizes up-to 2^63 are
supported when the UMR cap is set. In case the upper bound cap is lower
than 63, this might result a FW syndrome on mkey creation.

Previous cap enforcement is still correct for all current HW, FW and
driver combinations. However, this patch aligns the code to be spec
compliant in the general case.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index ace2df3e1d9f..a6ef052c4344 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1753,10 +1753,25 @@ static __always_inline unsigned long
 mlx5_umem_mkc_find_best_pgsz(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 			     u64 iova)
 {
-	int page_size_bits =
-		MLX5_CAP_GEN_2(dev->mdev, umr_log_entity_size_5) ? 6 : 5;
-	unsigned long bitmap =
-		__mlx5_log_page_size_to_bitmap(page_size_bits, 0);
+	unsigned int max_log_size, max_log_size_cap, min_log_size;
+	unsigned long bitmap;
+
+	max_log_size_cap =
+		MLX5_CAP_GEN_2(dev->mdev, max_mkey_log_entity_size_mtt) ?
+			MLX5_CAP_GEN_2(dev->mdev,
+				       max_mkey_log_entity_size_mtt) :
+			31;
+
+	max_log_size = MLX5_CAP_GEN_2(dev->mdev, umr_log_entity_size_5) ?
+			       max_log_size_cap :
+			       min(max_log_size_cap, 31);
+
+	min_log_size =
+		MLX5_CAP_GEN_2(dev->mdev, log_min_mkey_entity_size) ?
+			MLX5_CAP_GEN_2(dev->mdev, log_min_mkey_entity_size) :
+			MLX5_ADAPTER_PAGE_SHIFT;
+
+	bitmap = GENMASK_ULL(max_log_size, min_log_size);
 
 	return ib_umem_find_best_pgsz(umem, bitmap, iova);
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 6/7] RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow
  2025-03-13 14:29 [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
                   ` (4 preceding siblings ...)
  2025-03-13 14:29 ` [PATCH rdma-next 5/7] RDMA/mlx5: Align cap check of mkc page size to device specification Leon Romanovsky
@ 2025-03-13 14:29 ` Leon Romanovsky
  2025-03-13 14:29 ` [PATCH rdma-next 7/7] RDMA/mlx5: Fix calculation of total invalidated pages Leon Romanovsky
  2025-03-18 10:29 ` [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
  7 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2025-03-13 14:29 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Patrisious Haddad, Edward Srouji, linux-rdma

From: Patrisious Haddad <phaddad@nvidia.com>

When cur_qp isn't NULL, in order to avoid fetching the QP from
the radix tree again we check if the next cqe QP is identical to
the one we already have.

The bug however is that we are checking if the QP is identical by
checking the QP number inside the CQE against the QP number inside the
mlx5_ib_qp, but that's wrong since the QP number from the CQE is from
FW so it should be matched against mlx5_core_qp which is our FW QP
number.

Otherwise we could use the wrong QP when handling a CQE which could
cause the kernel trace below.

This issue is mainly noticeable over QPs 0 & 1, since for now they are
the only QPs in our driver whereas the QP number inside mlx5_ib_qp
doesn't match the QP number inside mlx5_core_qp.

BUG: kernel NULL pointer dereference, address: 0000000000000012
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0000 [#1] SMP
 CPU: 0 UID: 0 PID: 7927 Comm: kworker/u62:1 Not tainted 6.14.0-rc3+ #189
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
 RIP: 0010:mlx5_ib_poll_cq+0x4c7/0xd90 [mlx5_ib]
 Code: 03 00 00 8d 58 ff 21 cb 66 39 d3 74 39 48 c7 c7 3c 89 6e a0 0f b7 db e8 b7 d2 b3 e0 49 8b 86 60 03 00 00 48 c7 c7 4a 89 6e a0 <0f> b7 5c 98 02 e8 9f d2 b3 e0 41 0f b7 86 78 03 00 00 83 e8 01 21
 RSP: 0018:ffff88810511bd60 EFLAGS: 00010046
 RAX: 0000000000000010 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff88885fa1b3c0 RDI: ffffffffa06e894a
 RBP: 00000000000000b0 R08: 0000000000000000 R09: ffff88810511bc10
 R10: 0000000000000001 R11: 0000000000000001 R12: ffff88810d593000
 R13: ffff88810e579108 R14: ffff888105146000 R15: 00000000000000b0
 FS:  0000000000000000(0000) GS:ffff88885fa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000012 CR3: 00000001077e6001 CR4: 0000000000370eb0
 Call Trace:
  <TASK>
  ? __die+0x20/0x60
  ? page_fault_oops+0x150/0x3e0
  ? exc_page_fault+0x74/0x130
  ? asm_exc_page_fault+0x22/0x30
  ? mlx5_ib_poll_cq+0x4c7/0xd90 [mlx5_ib]
  __ib_process_cq+0x5a/0x150 [ib_core]
  ib_cq_poll_work+0x31/0x90 [ib_core]
  process_one_work+0x169/0x320
  worker_thread+0x288/0x3a0
  ? work_busy+0xb0/0xb0
  kthread+0xd7/0x1f0
  ? kthreads_online_cpu+0x130/0x130
  ? kthreads_online_cpu+0x130/0x130
  ret_from_fork+0x2d/0x50
  ? kthreads_online_cpu+0x130/0x130
  ret_from_fork_asm+0x11/0x20
  </TASK>

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/cq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 4c54dc578069..1aa5311b03e9 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -490,7 +490,7 @@ static int mlx5_poll_one(struct mlx5_ib_cq *cq,
 	}
 
 	qpn = ntohl(cqe64->sop_drop_qpn) & 0xffffff;
-	if (!*cur_qp || (qpn != (*cur_qp)->ibqp.qp_num)) {
+	if (!*cur_qp || (qpn != (*cur_qp)->trans_qp.base.mqp.qpn)) {
 		/* We do not have to take the QP table lock here,
 		 * because CQs will be locked while QPs are removed
 		 * from the table.
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 7/7] RDMA/mlx5: Fix calculation of total invalidated pages
  2025-03-13 14:29 [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
                   ` (5 preceding siblings ...)
  2025-03-13 14:29 ` [PATCH rdma-next 6/7] RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow Leon Romanovsky
@ 2025-03-13 14:29 ` Leon Romanovsky
  2025-03-18 10:29 ` [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
  7 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2025-03-13 14:29 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Chiara Meiohas, linux-rdma, Michael Guralnik

From: Chiara Meiohas <cmeiohas@nvidia.com>

When invalidating an address range in mlx5, there is an optimization to
do UMR operations in chunks.
Previously, the invalidation counter was incorrectly updated for the
same indexes within a chunk. Now, the invalidation counter is updated
only when a chunk is complete and mlx5r_umr_update_xlt() is called.
This ensures that the counter accurately represents the number of pages
invalidated using UMR.

Fixes: a3de94e3d61e ("IB/mlx5: Introduce ODP diagnostic counters")
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/odp.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index f1e23583e6c0..de9683344f5e 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -308,9 +308,6 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
 				blk_start_idx = idx;
 				in_block = 1;
 			}
-
-			/* Count page invalidations */
-			invalidations += idx - blk_start_idx + 1;
 		} else {
 			u64 umr_offset = idx & umr_block_mask;
 
@@ -320,14 +317,19 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
 						     MLX5_IB_UPD_XLT_ZAP |
 						     MLX5_IB_UPD_XLT_ATOMIC);
 				in_block = 0;
+				/* Count page invalidations */
+				invalidations += idx - blk_start_idx + 1;
 			}
 		}
 	}
-	if (in_block)
+	if (in_block) {
 		mlx5r_umr_update_xlt(mr, blk_start_idx,
 				     idx - blk_start_idx + 1, 0,
 				     MLX5_IB_UPD_XLT_ZAP |
 				     MLX5_IB_UPD_XLT_ATOMIC);
+		/* Count page invalidations */
+		invalidations += idx - blk_start_idx + 1;
+	}
 
 	mlx5_update_odp_stats_with_handled(mr, invalidations, invalidations);
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH rdma-next 2/7] RDMA/mlx5: Fix cache entry update on dereg error
  2025-03-13 14:29 ` [PATCH rdma-next 2/7] RDMA/mlx5: Fix cache entry update on dereg error Leon Romanovsky
@ 2025-03-18 10:28   ` Leon Romanovsky
  0 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2025-03-18 10:28 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Michael Guralnik, linux-rdma, Or Har-Toov, Yishai Hadas

On Thu, Mar 13, 2025 at 04:29:49PM +0200, Leon Romanovsky wrote:
> From: Michael Guralnik <michaelgur@nvidia.com>
> 
> Fix double decrement of 'in_use' counter on push_mkey_locked() failure
> while deregistering an MR.
> If we fail to return an mkey to the cache in cache_ent_find_and_store()
> it'll update the 'in_use' counter. Its caller, revoke_mr(), also updates
> it, thus having double decrement.
> 
> Wrong value of 'in_use' counter will be exposed through debugfs and can
> also cause wrong resizing of the cache when users try to set cache
> entry size using the 'size' debugfs.
> 
> To address this issue, the 'in_use' counter is now decremented within
> mlx5_revoke_mr() also after a successful call to
> cache_ent_find_and_store() and not within cache_ent_find_and_store().
> Other success or failure flows remains unchanged where it was also
> decremented.
> 
> Fixes: 8c1185fef68c ("RDMA/mlx5: Change check for cacheable mkeys")
> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
> Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/infiniband/hw/mlx5/mr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

<...>

> @@ -2042,6 +2041,7 @@ static int mlx5_revoke_mr(struct mlx5_ib_mr *mr)
>  		ent = mr->mmkey.cache_ent;
>  		/* upon storing to a clean temp entry - schedule its cleanup */
>  		spin_lock_irq(&ent->mkeys_queue.lock);
> +		ent->in_use--;

This needs slightly different fix, fixed it locally.
@@ -2033,6 +2032,7 @@ static int mlx5_revoke_mr(struct mlx5_ib_mr *mr)
        struct mlx5_ib_dev *dev = to_mdev(mr->ibmr.device);
        struct mlx5_cache_ent *ent = mr->mmkey.cache_ent;
        bool is_odp = is_odp_mr(mr);
+       bool from_cache = !!ent;
        int ret = 0;

        if (is_odp)
@@ -2042,6 +2042,8 @@ static int mlx5_revoke_mr(struct mlx5_ib_mr *mr)
                ent = mr->mmkey.cache_ent;
                /* upon storing to a clean temp entry - schedule its cleanup */
                spin_lock_irq(&ent->mkeys_queue.lock);
+               if (from_cache)
+                       ent->in_use--;
                if (ent->is_tmp && !ent->tmp_cleanup_scheduled) {
                        mod_delayed_work(ent->dev->cache.wq, &ent->dwork,
                                         msecs_to_jiffies(30 * 1000));


>  		if (ent->is_tmp && !ent->tmp_cleanup_scheduled) {
>  			mod_delayed_work(ent->dev->cache.wq, &ent->dwork,
>  					 msecs_to_jiffies(30 * 1000));
> -- 
> 2.48.1
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH rdma-next 0/7] Batch of mlx5_ib fixes
  2025-03-13 14:29 [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
                   ` (6 preceding siblings ...)
  2025-03-13 14:29 ` [PATCH rdma-next 7/7] RDMA/mlx5: Fix calculation of total invalidated pages Leon Romanovsky
@ 2025-03-18 10:29 ` Leon Romanovsky
  7 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2025-03-18 10:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chiara Meiohas, Edward Srouji, linux-rdma, Michael Guralnik,
	Or Har-Toov, Patrisious Haddad, Yishai Hadas

On Thu, Mar 13, 2025 at 04:29:47PM +0200, Leon Romanovsky wrote:
> Hi,
> 
> This is batch of various fixes to mlx5_ib driver.
> 
> Thanks
> 
> Chiara Meiohas (1):
>   RDMA/mlx5: Fix calculation of total invalidated pages
> 
> Michael Guralnik (5):
>   RDMA/mlx5: Fix MR cache initialization error flow
>   RDMA/mlx5: Fix cache entry update on dereg error
>   RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc()
>   RDMA/mlx5: Fix page_size variable overflow

Applied.

>   RDMA/mlx5: Align cap check of mkc page size to device specification

I was asked offline to drop it for now.

Thanks

> 
> Patrisious Haddad (1):
>   RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow
> 
>  drivers/infiniband/hw/mlx5/cq.c      |  2 +-
>  drivers/infiniband/hw/mlx5/mlx5_ib.h | 23 ++++++++++---
>  drivers/infiniband/hw/mlx5/mr.c      | 50 +++++++++++++++++-----------
>  drivers/infiniband/hw/mlx5/odp.c     | 10 +++---
>  4 files changed, 56 insertions(+), 29 deletions(-)
> 
> -- 
> 2.48.1
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-03-18 10:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-13 14:29 [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky
2025-03-13 14:29 ` [PATCH rdma-next 1/7] RDMA/mlx5: Fix MR cache initialization error flow Leon Romanovsky
2025-03-13 14:29 ` [PATCH rdma-next 2/7] RDMA/mlx5: Fix cache entry update on dereg error Leon Romanovsky
2025-03-18 10:28   ` Leon Romanovsky
2025-03-13 14:29 ` [PATCH rdma-next 3/7] RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc() Leon Romanovsky
2025-03-13 14:29 ` [PATCH rdma-next 4/7] RDMA/mlx5: Fix page_size variable overflow Leon Romanovsky
2025-03-13 14:29 ` [PATCH rdma-next 5/7] RDMA/mlx5: Align cap check of mkc page size to device specification Leon Romanovsky
2025-03-13 14:29 ` [PATCH rdma-next 6/7] RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow Leon Romanovsky
2025-03-13 14:29 ` [PATCH rdma-next 7/7] RDMA/mlx5: Fix calculation of total invalidated pages Leon Romanovsky
2025-03-18 10:29 ` [PATCH rdma-next 0/7] Batch of mlx5_ib fixes Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox