[PATCH rdma-rc 00/10] Batch of uverbs and mlx5

linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes
@ 2023-06-05 10:33 Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 01/10] RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions Leon Romanovsky
                   ` (10 more replies)
  0 siblings, 11 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Leon Romanovsky, linux-rdma

From: Leon Romanovsky <leonro@nvidia.com>

Hi,

This is my collection of various fixes.

Thanks

Edward Srouji (1):
  RDMA/uverbs: Restrict usage of privileged QKEYs

Maher Sanalla (1):
  RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions

Mark Bloch (2):
  RDMA/mlx5: Create an indirect flow table for steering anchor
  RDMA/mlx5: Fix affinity assignment

Mark Zhang (1):
  RDMA/cma: Always set static rate to 0 for RoCE

Michael Guralnik (1):
  RDMA/mlx5: Fix mkey cache possible deadlock on cleanup

Patrisious Haddad (3):
  RDMA/mlx5: Fix Q-counters per vport allocation
  RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters
  RDMA/mlx5: Fix Q-counters query in LAG mode

Yishai Hadas (1):
  IB/uverbs: Fix to consider event queue closing also upon non-blocking
    mode

 drivers/infiniband/core/cma.c                 |   4 +-
 drivers/infiniband/core/uverbs_cmd.c          |   7 +-
 drivers/infiniband/core/uverbs_main.c         |  12 +-
 drivers/infiniband/hw/mlx5/counters.c         |  89 ++++--
 drivers/infiniband/hw/mlx5/fs.c               | 276 +++++++++++++++++-
 drivers/infiniband/hw/mlx5/fs.h               |  16 +
 drivers/infiniband/hw/mlx5/main.c             |   3 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h          |  14 +
 drivers/infiniband/hw/mlx5/mr.c               |  10 +-
 drivers/infiniband/hw/mlx5/qp.c               |   3 +
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  12 -
 include/linux/mlx5/driver.h                   |  12 +
 include/rdma/ib_addr.h                        |  23 --
 13 files changed, 401 insertions(+), 80 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH rdma-rc 01/10] RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
@ 2023-06-05 10:33 ` Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 02/10] RDMA/mlx5: Create an indirect flow table for steering anchor Leon Romanovsky
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Maher Sanalla, linux-rdma, Maor Gottlieb

From: Maher Sanalla <msanalla@nvidia.com>

Delay drop data is initiated for PFs that have the capability of
rq_delay_drop and are in roce profile.

However, PFs with RAW ethernet profile do not initiate delay drop data
on function load, causing kernel panic if delay drop struct members are
accessed later on in case a dropless RQ is created.

Thus, stage the delay drop initialization as part of RAW ethernet
PF loading process.

Fixes: b5ca15ad7e61 ("IB/mlx5: Add proper representors support")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 5d45de223c43..f0b394ed7452 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -4275,6 +4275,9 @@ const struct mlx5_ib_profile raw_eth_profile = {
 	STAGE_CREATE(MLX5_IB_STAGE_POST_IB_REG_UMR,
 		     mlx5_ib_stage_post_ib_reg_umr_init,
 		     NULL),
+	STAGE_CREATE(MLX5_IB_STAGE_DELAY_DROP,
+		     mlx5_ib_stage_delay_drop_init,
+		     mlx5_ib_stage_delay_drop_cleanup),
 	STAGE_CREATE(MLX5_IB_STAGE_RESTRACK,
 		     mlx5_ib_restrack_init,
 		     NULL),
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH rdma-rc 02/10] RDMA/mlx5: Create an indirect flow table for steering anchor
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 01/10] RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions Leon Romanovsky
@ 2023-06-05 10:33 ` Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 03/10] RDMA/mlx5: Fix Q-counters per vport allocation Leon Romanovsky
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Mark Bloch, linux-rdma, Maor Gottlieb

From: Mark Bloch <mbloch@nvidia.com>

A misbehaved user can create a steering anchor that points to a kernel
flow table and then destroy the anchor without freeing the associated
STC. This creates a problem as the kernel can't destroy the flow
table since there is still a reference to it. As a result, this can
exhaust all available flow table resources, preventing other users from
using the RDMA device.

To prevent this problem, a solution is implemented where a special flow
table with two steering rules is created when a user creates a steering
anchor for the first time. The rules include one that drops all traffic
and another that points to the kernel flow table. If the steering anchor
is destroyed, only the rule pointing to the kernel's flow table is removed.
Any traffic reaching the special flow table after that is dropped.

Since the special flow table is not destroyed when the steering anchor is
destroyed, any issues are prevented from occurring. The remaining resources
are only destroyed when the RDMA device is destroyed, which happens after
all DEVX objects are freed, including the STCs, thus mitigating the issue.

Fixes: 0c6ab0ca9a66 ("RDMA/mlx5: Expose steering anchor to userspace")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/fs.c      | 276 ++++++++++++++++++++++++++-
 drivers/infiniband/hw/mlx5/fs.h      |  16 ++
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  11 ++
 3 files changed, 296 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/fs.c b/drivers/infiniband/hw/mlx5/fs.c
index 3008632a6c20..1e419e080b53 100644
--- a/drivers/infiniband/hw/mlx5/fs.c
+++ b/drivers/infiniband/hw/mlx5/fs.c
@@ -695,8 +695,6 @@ static struct mlx5_ib_flow_prio *_get_prio(struct mlx5_ib_dev *dev,
 	struct mlx5_flow_table_attr ft_attr = {};
 	struct mlx5_flow_table *ft;
 
-	if (mlx5_ib_shared_ft_allowed(&dev->ib_dev))
-		ft_attr.uid = MLX5_SHARED_RESOURCE_UID;
 	ft_attr.prio = priority;
 	ft_attr.max_fte = num_entries;
 	ft_attr.flags = flags;
@@ -2025,6 +2023,237 @@ static int flow_matcher_cleanup(struct ib_uobject *uobject,
 	return 0;
 }
 
+static int steering_anchor_create_ft(struct mlx5_ib_dev *dev,
+				     struct mlx5_ib_flow_prio *ft_prio,
+				     enum mlx5_flow_namespace_type ns_type)
+{
+	struct mlx5_flow_table_attr ft_attr = {};
+	struct mlx5_flow_namespace *ns;
+	struct mlx5_flow_table *ft;
+
+	if (ft_prio->anchor.ft)
+		return 0;
+
+	ns = mlx5_get_flow_namespace(dev->mdev, ns_type);
+	if (!ns)
+		return -EOPNOTSUPP;
+
+	ft_attr.flags = MLX5_FLOW_TABLE_UNMANAGED;
+	ft_attr.uid = MLX5_SHARED_RESOURCE_UID;
+	ft_attr.prio = 0;
+	ft_attr.max_fte = 2;
+	ft_attr.level = 1;
+
+	ft = mlx5_create_flow_table(ns, &ft_attr);
+	if (IS_ERR(ft))
+		return PTR_ERR(ft);
+
+	ft_prio->anchor.ft = ft;
+
+	return 0;
+}
+
+static void steering_anchor_destroy_ft(struct mlx5_ib_flow_prio *ft_prio)
+{
+	if (ft_prio->anchor.ft) {
+		mlx5_destroy_flow_table(ft_prio->anchor.ft);
+		ft_prio->anchor.ft = NULL;
+	}
+}
+
+static int
+steering_anchor_create_fg_drop(struct mlx5_ib_flow_prio *ft_prio)
+{
+	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+	struct mlx5_flow_group *fg;
+	void *flow_group_in;
+	int err = 0;
+
+	if (ft_prio->anchor.fg_drop)
+		return 0;
+
+	flow_group_in = kvzalloc(inlen, GFP_KERNEL);
+	if (!flow_group_in)
+		return -ENOMEM;
+
+	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 1);
+	MLX5_SET(create_flow_group_in, flow_group_in, end_flow_index, 1);
+
+	fg = mlx5_create_flow_group(ft_prio->anchor.ft, flow_group_in);
+	if (IS_ERR(fg)) {
+		err = PTR_ERR(fg);
+		goto out;
+	}
+
+	ft_prio->anchor.fg_drop = fg;
+
+out:
+	kvfree(flow_group_in);
+
+	return err;
+}
+
+static void
+steering_anchor_destroy_fg_drop(struct mlx5_ib_flow_prio *ft_prio)
+{
+	if (ft_prio->anchor.fg_drop) {
+		mlx5_destroy_flow_group(ft_prio->anchor.fg_drop);
+		ft_prio->anchor.fg_drop = NULL;
+	}
+}
+
+static int
+steering_anchor_create_fg_goto_table(struct mlx5_ib_flow_prio *ft_prio)
+{
+	int inlen = MLX5_ST_SZ_BYTES(create_flow_group_in);
+	struct mlx5_flow_group *fg;
+	void *flow_group_in;
+	int err = 0;
+
+	if (ft_prio->anchor.fg_goto_table)
+		return 0;
+
+	flow_group_in = kvzalloc(inlen, GFP_KERNEL);
+	if (!flow_group_in)
+		return -ENOMEM;
+
+	fg = mlx5_create_flow_group(ft_prio->anchor.ft, flow_group_in);
+	if (IS_ERR(fg)) {
+		err = PTR_ERR(fg);
+		goto out;
+	}
+	ft_prio->anchor.fg_goto_table = fg;
+
+out:
+	kvfree(flow_group_in);
+
+	return err;
+}
+
+static void
+steering_anchor_destroy_fg_goto_table(struct mlx5_ib_flow_prio *ft_prio)
+{
+	if (ft_prio->anchor.fg_goto_table) {
+		mlx5_destroy_flow_group(ft_prio->anchor.fg_goto_table);
+		ft_prio->anchor.fg_goto_table = NULL;
+	}
+}
+
+static int
+steering_anchor_create_rule_drop(struct mlx5_ib_flow_prio *ft_prio)
+{
+	struct mlx5_flow_act flow_act = {};
+	struct mlx5_flow_handle *handle;
+
+	if (ft_prio->anchor.rule_drop)
+		return 0;
+
+	flow_act.fg = ft_prio->anchor.fg_drop;
+	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_DROP;
+
+	handle = mlx5_add_flow_rules(ft_prio->anchor.ft, NULL, &flow_act,
+				     NULL, 0);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	ft_prio->anchor.rule_drop = handle;
+
+	return 0;
+}
+
+static void steering_anchor_destroy_rule_drop(struct mlx5_ib_flow_prio *ft_prio)
+{
+	if (ft_prio->anchor.rule_drop) {
+		mlx5_del_flow_rules(ft_prio->anchor.rule_drop);
+		ft_prio->anchor.rule_drop = NULL;
+	}
+}
+
+static int
+steering_anchor_create_rule_goto_table(struct mlx5_ib_flow_prio *ft_prio)
+{
+	struct mlx5_flow_destination dest = {};
+	struct mlx5_flow_act flow_act = {};
+	struct mlx5_flow_handle *handle;
+
+	if (ft_prio->anchor.rule_goto_table)
+		return 0;
+
+	flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
+	flow_act.flags |= FLOW_ACT_IGNORE_FLOW_LEVEL;
+	flow_act.fg = ft_prio->anchor.fg_goto_table;
+
+	dest.type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE;
+	dest.ft = ft_prio->flow_table;
+
+	handle = mlx5_add_flow_rules(ft_prio->anchor.ft, NULL, &flow_act,
+				     &dest, 1);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	ft_prio->anchor.rule_goto_table = handle;
+
+	return 0;
+}
+
+static void
+steering_anchor_destroy_rule_goto_table(struct mlx5_ib_flow_prio *ft_prio)
+{
+	if (ft_prio->anchor.rule_goto_table) {
+		mlx5_del_flow_rules(ft_prio->anchor.rule_goto_table);
+		ft_prio->anchor.rule_goto_table = NULL;
+	}
+}
+
+static int steering_anchor_create_res(struct mlx5_ib_dev *dev,
+				      struct mlx5_ib_flow_prio *ft_prio,
+				      enum mlx5_flow_namespace_type ns_type)
+{
+	int err;
+
+	err = steering_anchor_create_ft(dev, ft_prio, ns_type);
+	if (err)
+		return err;
+
+	err = steering_anchor_create_fg_drop(ft_prio);
+	if (err)
+		goto destroy_ft;
+
+	err = steering_anchor_create_fg_goto_table(ft_prio);
+	if (err)
+		goto destroy_fg_drop;
+
+	err = steering_anchor_create_rule_drop(ft_prio);
+	if (err)
+		goto destroy_fg_goto_table;
+
+	err = steering_anchor_create_rule_goto_table(ft_prio);
+	if (err)
+		goto destroy_rule_drop;
+
+	return 0;
+
+destroy_rule_drop:
+	steering_anchor_destroy_rule_drop(ft_prio);
+destroy_fg_goto_table:
+	steering_anchor_destroy_fg_goto_table(ft_prio);
+destroy_fg_drop:
+	steering_anchor_destroy_fg_drop(ft_prio);
+destroy_ft:
+	steering_anchor_destroy_ft(ft_prio);
+
+	return err;
+}
+
+static void mlx5_steering_anchor_destroy_res(struct mlx5_ib_flow_prio *ft_prio)
+{
+	steering_anchor_destroy_rule_goto_table(ft_prio);
+	steering_anchor_destroy_rule_drop(ft_prio);
+	steering_anchor_destroy_fg_goto_table(ft_prio);
+	steering_anchor_destroy_fg_drop(ft_prio);
+	steering_anchor_destroy_ft(ft_prio);
+}
+
 static int steering_anchor_cleanup(struct ib_uobject *uobject,
 				   enum rdma_remove_reason why,
 				   struct uverbs_attr_bundle *attrs)
@@ -2035,6 +2264,9 @@ static int steering_anchor_cleanup(struct ib_uobject *uobject,
 		return -EBUSY;
 
 	mutex_lock(&obj->dev->flow_db->lock);
+	if (!--obj->ft_prio->anchor.rule_goto_table_ref)
+		steering_anchor_destroy_rule_goto_table(obj->ft_prio);
+
 	put_flow_table(obj->dev, obj->ft_prio, true);
 	mutex_unlock(&obj->dev->flow_db->lock);
 
@@ -2042,6 +2274,24 @@ static int steering_anchor_cleanup(struct ib_uobject *uobject,
 	return 0;
 }
 
+static void fs_cleanup_anchor(struct mlx5_ib_flow_prio *prio,
+			      int count)
+{
+	while (count--)
+		mlx5_steering_anchor_destroy_res(&prio[count]);
+}
+
+void mlx5_ib_fs_cleanup_anchor(struct mlx5_ib_dev *dev)
+{
+	fs_cleanup_anchor(dev->flow_db->prios, MLX5_IB_NUM_FLOW_FT);
+	fs_cleanup_anchor(dev->flow_db->egress_prios, MLX5_IB_NUM_FLOW_FT);
+	fs_cleanup_anchor(dev->flow_db->sniffer, MLX5_IB_NUM_SNIFFER_FTS);
+	fs_cleanup_anchor(dev->flow_db->egress, MLX5_IB_NUM_EGRESS_FTS);
+	fs_cleanup_anchor(dev->flow_db->fdb, MLX5_IB_NUM_FDB_FTS);
+	fs_cleanup_anchor(dev->flow_db->rdma_rx, MLX5_IB_NUM_FLOW_FT);
+	fs_cleanup_anchor(dev->flow_db->rdma_tx, MLX5_IB_NUM_FLOW_FT);
+}
+
 static int mlx5_ib_matcher_ns(struct uverbs_attr_bundle *attrs,
 			      struct mlx5_ib_flow_matcher *obj)
 {
@@ -2182,21 +2432,31 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_STEERING_ANCHOR_CREATE)(
 		return -ENOMEM;
 
 	mutex_lock(&dev->flow_db->lock);
+
 	ft_prio = _get_flow_table(dev, priority, ns_type, 0);
 	if (IS_ERR(ft_prio)) {
-		mutex_unlock(&dev->flow_db->lock);
 		err = PTR_ERR(ft_prio);
 		goto free_obj;
 	}
 
 	ft_prio->refcount++;
-	ft_id = mlx5_flow_table_id(ft_prio->flow_table);
-	mutex_unlock(&dev->flow_db->lock);
+
+	if (!ft_prio->anchor.rule_goto_table_ref) {
+		err = steering_anchor_create_res(dev, ft_prio, ns_type);
+		if (err)
+			goto put_flow_table;
+	}
+
+	ft_prio->anchor.rule_goto_table_ref++;
+
+	ft_id = mlx5_flow_table_id(ft_prio->anchor.ft);
 
 	err = uverbs_copy_to(attrs, MLX5_IB_ATTR_STEERING_ANCHOR_FT_ID,
 			     &ft_id, sizeof(ft_id));
 	if (err)
-		goto put_flow_table;
+		goto destroy_res;
+
+	mutex_unlock(&dev->flow_db->lock);
 
 	uobj->object = obj;
 	obj->dev = dev;
@@ -2205,8 +2465,10 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_STEERING_ANCHOR_CREATE)(
 
 	return 0;
 
+destroy_res:
+	--ft_prio->anchor.rule_goto_table_ref;
+	mlx5_steering_anchor_destroy_res(ft_prio);
 put_flow_table:
-	mutex_lock(&dev->flow_db->lock);
 	put_flow_table(dev, ft_prio, true);
 	mutex_unlock(&dev->flow_db->lock);
 free_obj:
diff --git a/drivers/infiniband/hw/mlx5/fs.h b/drivers/infiniband/hw/mlx5/fs.h
index ad320adaf321..b9734904f5f0 100644
--- a/drivers/infiniband/hw/mlx5/fs.h
+++ b/drivers/infiniband/hw/mlx5/fs.h
@@ -10,6 +10,7 @@
 
 #if IS_ENABLED(CONFIG_INFINIBAND_USER_ACCESS)
 int mlx5_ib_fs_init(struct mlx5_ib_dev *dev);
+void mlx5_ib_fs_cleanup_anchor(struct mlx5_ib_dev *dev);
 #else
 static inline int mlx5_ib_fs_init(struct mlx5_ib_dev *dev)
 {
@@ -21,9 +22,24 @@ static inline int mlx5_ib_fs_init(struct mlx5_ib_dev *dev)
 	mutex_init(&dev->flow_db->lock);
 	return 0;
 }
+
+inline void mlx5_ib_fs_cleanup_anchor(struct mlx5_ib_dev *dev) {}
 #endif
+
 static inline void mlx5_ib_fs_cleanup(struct mlx5_ib_dev *dev)
 {
+	/* When a steering anchor is created, a special flow table is also
+	 * created for the user to reference. Since the user can reference it,
+	 * the kernel cannot trust that when the user destroys the steering
+	 * anchor, they no longer reference the flow table.
+	 *
+	 * To address this issue, when a user destroys a steering anchor, only
+	 * the flow steering rule in the table is destroyed, but the table
+	 * itself is kept to deal with the above scenario. The remaining
+	 * resources are only removed when the RDMA device is destroyed, which
+	 * is a safe assumption that all references are gone.
+	 */
+	mlx5_ib_fs_cleanup_anchor(dev);
 	kfree(dev->flow_db);
 }
 #endif /* _MLX5_IB_FS_H */
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 2c148ef60088..2a2d2a356c41 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -238,8 +238,19 @@ enum {
 #define MLX5_IB_NUM_SNIFFER_FTS		2
 #define MLX5_IB_NUM_EGRESS_FTS		1
 #define MLX5_IB_NUM_FDB_FTS		MLX5_BY_PASS_NUM_REGULAR_PRIOS
+
+struct mlx5_ib_anchor {
+	struct mlx5_flow_table *ft;
+	struct mlx5_flow_group *fg_goto_table;
+	struct mlx5_flow_group *fg_drop;
+	struct mlx5_flow_handle *rule_goto_table;
+	struct mlx5_flow_handle *rule_drop;
+	unsigned int rule_goto_table_ref;
+};
+
 struct mlx5_ib_flow_prio {
 	struct mlx5_flow_table		*flow_table;
+	struct mlx5_ib_anchor		anchor;
 	unsigned int			refcount;
 };
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH rdma-rc 03/10] RDMA/mlx5: Fix Q-counters per vport allocation
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 01/10] RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 02/10] RDMA/mlx5: Create an indirect flow table for steering anchor Leon Romanovsky
@ 2023-06-05 10:33 ` Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 04/10] RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters Leon Romanovsky
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Patrisious Haddad, linux-rdma, Mark Zhang

From: Patrisious Haddad <phaddad@nvidia.com>

Previously Q-counters data was being allocated over the PF for all of
the available vports, however that isn't necessary.

Since each VF or SF has a Q-counter allocated for itself.

So we only need to allocate two counters data structures, one for the
device counters, and one for all the other vports to expose the
representors, since they only need to read from it in order to
determine mainly counters numbers and names, so they can all share.

This in turn also solves a bug we previously had where we couldn't
switch the device to switchdev mode when there were more than 128 SF/VFs
configured, since that is the maximum amount of Q-counters available for
a single port

Fixes: d22467a71ebe ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/counters.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/counters.c b/drivers/infiniband/hw/mlx5/counters.c
index 1c06920505d2..3d7ef81a50b8 100644
--- a/drivers/infiniband/hw/mlx5/counters.c
+++ b/drivers/infiniband/hw/mlx5/counters.c
@@ -209,7 +209,8 @@ static const struct mlx5_ib_counters *get_counters(struct mlx5_ib_dev *dev,
 	     !vport_qcounters_supported(dev)) || !port_num)
 		return &dev->port[0].cnts;
 
-	return &dev->port[port_num - 1].cnts;
+	return is_mdev_switchdev_mode(dev->mdev) ?
+	       &dev->port[1].cnts : &dev->port[port_num - 1].cnts;
 }
 
 /**
@@ -262,7 +263,7 @@ static struct rdma_hw_stats *
 mlx5_ib_alloc_hw_port_stats(struct ib_device *ibdev, u32 port_num)
 {
 	struct mlx5_ib_dev *dev = to_mdev(ibdev);
-	const struct mlx5_ib_counters *cnts = &dev->port[port_num - 1].cnts;
+	const struct mlx5_ib_counters *cnts = get_counters(dev, port_num);
 
 	return do_alloc_stats(cnts);
 }
@@ -725,11 +726,11 @@ static int __mlx5_ib_alloc_counters(struct mlx5_ib_dev *dev,
 static void mlx5_ib_dealloc_counters(struct mlx5_ib_dev *dev)
 {
 	u32 in[MLX5_ST_SZ_DW(dealloc_q_counter_in)] = {};
-	int num_cnt_ports;
+	int num_cnt_ports = dev->num_ports;
 	int i, j;
 
-	num_cnt_ports = (!is_mdev_switchdev_mode(dev->mdev) ||
-			 vport_qcounters_supported(dev)) ? dev->num_ports : 1;
+	if (is_mdev_switchdev_mode(dev->mdev))
+		num_cnt_ports = min(2, num_cnt_ports);
 
 	MLX5_SET(dealloc_q_counter_in, in, opcode,
 		 MLX5_CMD_OP_DEALLOC_Q_COUNTER);
@@ -761,15 +762,22 @@ static int mlx5_ib_alloc_counters(struct mlx5_ib_dev *dev)
 {
 	u32 out[MLX5_ST_SZ_DW(alloc_q_counter_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(alloc_q_counter_in)] = {};
-	int num_cnt_ports;
+	int num_cnt_ports = dev->num_ports;
 	int err = 0;
 	int i;
 	bool is_shared;
 
 	MLX5_SET(alloc_q_counter_in, in, opcode, MLX5_CMD_OP_ALLOC_Q_COUNTER);
 	is_shared = MLX5_CAP_GEN(dev->mdev, log_max_uctx) != 0;
-	num_cnt_ports = (!is_mdev_switchdev_mode(dev->mdev) ||
-			 vport_qcounters_supported(dev)) ? dev->num_ports : 1;
+
+	/*
+	 * In switchdev we need to allocate two ports, one that is used for
+	 * the device Q_counters and it is essentially the real Q_counters of
+	 * this device, while the other is used as a helper for PF to be able to
+	 * query all other vports.
+	 */
+	if (is_mdev_switchdev_mode(dev->mdev))
+		num_cnt_ports = min(2, num_cnt_ports);
 
 	for (i = 0; i < num_cnt_ports; i++) {
 		err = __mlx5_ib_alloc_counters(dev, &dev->port[i].cnts, i);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH rdma-rc 04/10] RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
                   ` (2 preceding siblings ...)
  2023-06-05 10:33 ` [PATCH rdma-rc 03/10] RDMA/mlx5: Fix Q-counters per vport allocation Leon Romanovsky
@ 2023-06-05 10:33 ` Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 05/10] RDMA/mlx5: Fix Q-counters query in LAG mode Leon Romanovsky
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Patrisious Haddad, linux-rdma, Mark Zhang

From: Patrisious Haddad <phaddad@nvidia.com>

Previously the Q-counters initialization assumed that the vport Q-counters
structures and the normal Q-counters structures are identical in size,
and hence when a Q-counter was added to normal Q-counters structure but
not to the vport Q-counters struct it would lead to that counter name
being NULL in switchdev mode, which could cause the kernel crash below.

Currently break the dependency between those two structure and always
use the appropriate struct size, in order to remove the assumption
that both structure sizes are equal.

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 20c64a067 P4D 20c64a067 PUD 20152b067 PMD 0
 Oops: 0000 [#1] SMP
 CPU: 19 PID: 11717 Comm: devlink Tainted: G           OE      6.2.0_mlnx #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:strlen+0x0/0x20
 Code: 66 2e 0f 1f 84 00 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 48 89 f8 74 10 48 83 c7 01 80 3f 00 75 f7 48 29 c7 48 89
 RSP: 0018:ffffc9000318b618 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000002c00
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 0000000000000000 R08: ffff888211918110 R09: ffff888211918000
 R10: 000000000000001e R11: ffff888211918000 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881038ec250
 FS:  00007fa53342fe80(0000) GS:ffff88885fcc0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000002042b2003 CR4: 0000000000770ee0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  <TASK>
  kernfs_name_hash+0x12/0x80
  kernfs_find_ns+0x35/0xb0
  kernfs_remove_by_name_ns+0x46/0xc0
  remove_files.isra.1+0x30/0x70
  internal_create_group+0x253/0x380
  internal_create_groups.part.4+0x3e/0xa0
  setup_port+0x27a/0x8c0 [ib_core]
  ib_setup_port_attrs+0x9d/0x300 [ib_core]
  ib_register_device+0x48e/0x550 [ib_core]
  __mlx5_ib_add+0x2b/0x80 [mlx5_ib]
  mlx5_ib_vport_rep_load+0x141/0x360 [mlx5_ib]
  mlx5_esw_offloads_rep_load+0x48/0xa0 [mlx5_core]
  esw_offloads_enable+0x41e/0xd10 [mlx5_core]
  mlx5_eswitch_enable_locked+0x1e3/0x340 [mlx5_core]
  ? __cond_resched+0x15/0x30
  mlx5_devlink_eswitch_mode_set+0x204/0x3c0 [mlx5_core]
  devlink_nl_cmd_eswitch_set_doit+0x8d/0x100
  genl_family_rcv_msg_doit.isra.19+0xea/0x110
  genl_rcv_msg+0x19b/0x290
  ? devlink_nl_cmd_region_read_dumpit+0x760/0x760
  ? devlink_nl_cmd_port_param_get_doit+0x30/0x30
  ? devlink_put+0x50/0x50
  ? genl_get_cmd_both+0x60/0x60
  netlink_rcv_skb+0x54/0x100
  genl_rcv+0x24/0x40
  netlink_unicast+0x1be/0x2a0
  netlink_sendmsg+0x361/0x4d0
  sock_sendmsg+0x30/0x40
  __sys_sendto+0x11a/0x150
  ? handle_mm_fault+0x101/0x2b0
  ? do_user_addr_fault+0x21d/0x720
  __x64_sys_sendto+0x24/0x30
  do_syscall_64+0x34/0x80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7fa533611cba
 Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
 RSP: 002b:00007ffdb6a898a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 0000000000daab00 RCX: 00007fa533611cba
 RDX: 0000000000000038 RSI: 0000000000daab00 RDI: 0000000000000003
 RBP: 0000000000daa910 R08: 00007fa533822000 R09: 000000000000000c
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
  </TASK>
 Modules linked in: rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) mlxfw(OE) memtrack(OE) pci_hyperv_intf nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat dns_resolver nf_nat br_netfilter nfs bridge stp llc lockd grace fscache netfs rfkill overlay iTCO_wdt iTCO_vendor_support kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel i2c_i801 sunrpc lpc_ich sha512_ssse3 pcspkr i2c_smbus mfd_core drm sch_fq_codel i2c_core ip_tables fuse crc32c_intel serio_raw virtio_net net_failover failover [last unloaded: mlxfw]
 CR2: 0000000000000000
 ---[ end trace 0000000000000000 ]---
 RIP: 0010:strlen+0x0/0x20
 Code: 66 2e 0f 1f 84 00 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 48 89 f8 74 10 48 83 c7 01 80 3f 00 75 f7 48 29 c7 48 89
 RSP: 0018:ffffc9000318b618 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000002c00
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 0000000000000000 R08: ffff888211918110 R09: ffff888211918000
 R10: 000000000000001e R11: ffff888211918000 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881038ec250
 FS:  00007fa53342fe80(0000) GS:ffff88885fcc0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000002042b2003 CR4: 0000000000770ee0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Kernel panic - not syncing: Fatal exception
 Kernel Offset: disabled
 ---[ end Kernel panic - not syncing: Fatal exception ]---

Fixes: d22467a71ebe ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/counters.c | 58 ++++++++++++++++++---------
 1 file changed, 40 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/counters.c b/drivers/infiniband/hw/mlx5/counters.c
index 3d7ef81a50b8..f40d9c61e30b 100644
--- a/drivers/infiniband/hw/mlx5/counters.c
+++ b/drivers/infiniband/hw/mlx5/counters.c
@@ -576,43 +576,53 @@ static void mlx5_ib_fill_counters(struct mlx5_ib_dev *dev,
 	bool is_vport = is_mdev_switchdev_mode(dev->mdev) &&
 			port_num != MLX5_VPORT_PF;
 	const struct mlx5_ib_counter *names;
-	int j = 0, i;
+	int j = 0, i, size;
 
 	names = is_vport ? vport_basic_q_cnts : basic_q_cnts;
-	for (i = 0; i < ARRAY_SIZE(basic_q_cnts); i++, j++) {
+	size = is_vport ? ARRAY_SIZE(vport_basic_q_cnts) :
+			  ARRAY_SIZE(basic_q_cnts);
+	for (i = 0; i < size; i++, j++) {
 		descs[j].name = names[i].name;
-		offsets[j] = basic_q_cnts[i].offset;
+		offsets[j] = names[i].offset;
 	}
 
 	names = is_vport ? vport_out_of_seq_q_cnts : out_of_seq_q_cnts;
+	size = is_vport ? ARRAY_SIZE(vport_out_of_seq_q_cnts) :
+			  ARRAY_SIZE(out_of_seq_q_cnts);
 	if (MLX5_CAP_GEN(dev->mdev, out_of_seq_cnt)) {
-		for (i = 0; i < ARRAY_SIZE(out_of_seq_q_cnts); i++, j++) {
+		for (i = 0; i < size; i++, j++) {
 			descs[j].name = names[i].name;
-			offsets[j] = out_of_seq_q_cnts[i].offset;
+			offsets[j] = names[i].offset;
 		}
 	}
 
 	names = is_vport ? vport_retrans_q_cnts : retrans_q_cnts;
+	size = is_vport ? ARRAY_SIZE(vport_retrans_q_cnts) :
+			  ARRAY_SIZE(retrans_q_cnts);
 	if (MLX5_CAP_GEN(dev->mdev, retransmission_q_counters)) {
-		for (i = 0; i < ARRAY_SIZE(retrans_q_cnts); i++, j++) {
+		for (i = 0; i < size; i++, j++) {
 			descs[j].name = names[i].name;
-			offsets[j] = retrans_q_cnts[i].offset;
+			offsets[j] = names[i].offset;
 		}
 	}
 
 	names = is_vport ? vport_extended_err_cnts : extended_err_cnts;
+	size = is_vport ? ARRAY_SIZE(vport_extended_err_cnts) :
+			  ARRAY_SIZE(extended_err_cnts);
 	if (MLX5_CAP_GEN(dev->mdev, enhanced_error_q_counters)) {
-		for (i = 0; i < ARRAY_SIZE(extended_err_cnts); i++, j++) {
+		for (i = 0; i < size; i++, j++) {
 			descs[j].name = names[i].name;
-			offsets[j] = extended_err_cnts[i].offset;
+			offsets[j] = names[i].offset;
 		}
 	}
 
 	names = is_vport ? vport_roce_accl_cnts : roce_accl_cnts;
+	size = is_vport ? ARRAY_SIZE(vport_roce_accl_cnts) :
+			  ARRAY_SIZE(roce_accl_cnts);
 	if (MLX5_CAP_GEN(dev->mdev, roce_accl)) {
-		for (i = 0; i < ARRAY_SIZE(roce_accl_cnts); i++, j++) {
+		for (i = 0; i < size; i++, j++) {
 			descs[j].name = names[i].name;
-			offsets[j] = roce_accl_cnts[i].offset;
+			offsets[j] = names[i].offset;
 		}
 	}
 
@@ -662,25 +672,37 @@ static void mlx5_ib_fill_counters(struct mlx5_ib_dev *dev,
 static int __mlx5_ib_alloc_counters(struct mlx5_ib_dev *dev,
 				    struct mlx5_ib_counters *cnts, u32 port_num)
 {
-	u32 num_counters, num_op_counters = 0;
+	bool is_vport = is_mdev_switchdev_mode(dev->mdev) &&
+			port_num != MLX5_VPORT_PF;
+	u32 num_counters, num_op_counters = 0, size;
 
-	num_counters = ARRAY_SIZE(basic_q_cnts);
+	size = is_vport ? ARRAY_SIZE(vport_basic_q_cnts) :
+			  ARRAY_SIZE(basic_q_cnts);
+	num_counters = size;
 
+	size = is_vport ? ARRAY_SIZE(vport_out_of_seq_q_cnts) :
+			  ARRAY_SIZE(out_of_seq_q_cnts);
 	if (MLX5_CAP_GEN(dev->mdev, out_of_seq_cnt))
-		num_counters += ARRAY_SIZE(out_of_seq_q_cnts);
+		num_counters += size;
 
+	size = is_vport ? ARRAY_SIZE(vport_retrans_q_cnts) :
+			  ARRAY_SIZE(retrans_q_cnts);
 	if (MLX5_CAP_GEN(dev->mdev, retransmission_q_counters))
-		num_counters += ARRAY_SIZE(retrans_q_cnts);
+		num_counters += size;
 
+	size = is_vport ? ARRAY_SIZE(vport_extended_err_cnts) :
+			  ARRAY_SIZE(extended_err_cnts);
 	if (MLX5_CAP_GEN(dev->mdev, enhanced_error_q_counters))
-		num_counters += ARRAY_SIZE(extended_err_cnts);
+		num_counters += size;
 
+	size = is_vport ? ARRAY_SIZE(vport_roce_accl_cnts) :
+			  ARRAY_SIZE(roce_accl_cnts);
 	if (MLX5_CAP_GEN(dev->mdev, roce_accl))
-		num_counters += ARRAY_SIZE(roce_accl_cnts);
+		num_counters += size;
 
 	cnts->num_q_counters = num_counters;
 
-	if (is_mdev_switchdev_mode(dev->mdev) && port_num != MLX5_VPORT_PF)
+	if (is_vport)
 		goto skip_non_qcounters;
 
 	if (MLX5_CAP_GEN(dev->mdev, cc_query_allowed)) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH rdma-rc 05/10] RDMA/mlx5: Fix Q-counters query in LAG mode
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
                   ` (3 preceding siblings ...)
  2023-06-05 10:33 ` [PATCH rdma-rc 04/10] RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters Leon Romanovsky
@ 2023-06-05 10:33 ` Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 06/10] RDMA/mlx5: Fix mkey cache possible deadlock on cleanup Leon Romanovsky
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Patrisious Haddad, linux-rdma, Mark Zhang

From: Patrisious Haddad <phaddad@nvidia.com>

Previously we used the core device associated to the IB device in order
to do the Q-counters query to the FW, but in LAG mode it is possible
that the core device isn't the one that created this VF.

Hence instead of using the core device to query the Q-counters
we use the ESW core device which is guaranteed to be that of the VF.

Fixes: d22467a71ebe ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/counters.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/counters.c b/drivers/infiniband/hw/mlx5/counters.c
index f40d9c61e30b..93257fa5aae8 100644
--- a/drivers/infiniband/hw/mlx5/counters.c
+++ b/drivers/infiniband/hw/mlx5/counters.c
@@ -330,6 +330,7 @@ static int mlx5_ib_query_q_counters_vport(struct mlx5_ib_dev *dev,
 {
 	u32 out[MLX5_ST_SZ_DW(query_q_counter_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(query_q_counter_in)] = {};
+	struct mlx5_core_dev *mdev;
 	__be32 val;
 	int ret, i;
 
@@ -337,12 +338,16 @@ static int mlx5_ib_query_q_counters_vport(struct mlx5_ib_dev *dev,
 	    dev->port[port_num].rep->vport == MLX5_VPORT_UPLINK)
 		return 0;
 
+	mdev = mlx5_eswitch_get_core_dev(dev->port[port_num].rep->esw);
+	if (!mdev)
+		return -EOPNOTSUPP;
+
 	MLX5_SET(query_q_counter_in, in, opcode, MLX5_CMD_OP_QUERY_Q_COUNTER);
 	MLX5_SET(query_q_counter_in, in, other_vport, 1);
 	MLX5_SET(query_q_counter_in, in, vport_number,
 		 dev->port[port_num].rep->vport);
 	MLX5_SET(query_q_counter_in, in, aggregate, 1);
-	ret = mlx5_cmd_exec_inout(dev->mdev, query_q_counter, in, out);
+	ret = mlx5_cmd_exec_inout(mdev, query_q_counter, in, out);
 	if (ret)
 		return ret;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH rdma-rc 06/10] RDMA/mlx5: Fix mkey cache possible deadlock on cleanup
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
                   ` (4 preceding siblings ...)
  2023-06-05 10:33 ` [PATCH rdma-rc 05/10] RDMA/mlx5: Fix Q-counters query in LAG mode Leon Romanovsky
@ 2023-06-05 10:33 ` Leon Romanovsky
  2023-06-05 16:54   ` Jason Gunthorpe
  2023-06-05 10:33 ` [PATCH rdma-rc 07/10] RDMA/cma: Always set static rate to 0 for RoCE Leon Romanovsky
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Michael Guralnik, linux-rdma, Maor Gottlieb

From: Michael Guralnik <michaelgur@nvidia.com>

Move cancellation of delayed cache work that adds or removes mkeys to the
a separate iteration in the mkey cleanup so that we don't call
someone_adding() while holding the rb_lock.

Lockdep:
WARNING: possible circular locking dependency detected
 6.2.0-rc6_for_upstream_debug_2023_01_31_14_02 #1 Not tainted
 ------------------------------------------------------
 devlink/53872 is trying to acquire lock:
 ffff888124f8c0c8 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0xc8/0x900

 but task is already holding lock:
 ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib]

 which lock already depends on the new lock.


 the existing dependency chain (in reverse order) is:

 -> #1 (&dev->cache.rb_lock){+.+.}-{3:3}:
        __mutex_lock+0x14c/0x15c0
        delayed_cache_work_func+0x2d1/0x610 [mlx5_ib]
        process_one_work+0x7c2/0x1310
        worker_thread+0x59d/0xec0
        kthread+0x28f/0x330
        ret_from_fork+0x1f/0x30

 -> #0 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}:
        __lock_acquire+0x2d8a/0x5fe0
        lock_acquire+0x1c1/0x540
        __flush_work+0xe8/0x900
        __cancel_work_timer+0x2c7/0x3f0
        mlx5_mkey_cache_cleanup+0xfb/0x250 [mlx5_ib]
        mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x16/0x30 [mlx5_ib]
        __mlx5_ib_remove+0x68/0x120 [mlx5_ib]
        mlx5r_remove+0x63/0x80 [mlx5_ib]
        auxiliary_bus_remove+0x52/0x70
        device_release_driver_internal+0x3c1/0x600
        bus_remove_device+0x2a5/0x560
        device_del+0x492/0xb80
        mlx5_detach_device+0x1a9/0x360 [mlx5_core]
        mlx5_unload_one_devl_locked+0x5a/0x110 [mlx5_core]
        mlx5_devlink_reload_down+0x292/0x580 [mlx5_core]
        devlink_reload+0x439/0x590
        devlink_nl_cmd_reload+0xaef/0xff0
        genl_family_rcv_msg_doit.isra.0+0x1bd/0x290
        genl_rcv_msg+0x3ca/0x6c0
        netlink_rcv_skb+0x12c/0x360
        genl_rcv+0x24/0x40
        netlink_unicast+0x438/0x710
        netlink_sendmsg+0x7a1/0xca0
        sock_sendmsg+0xc5/0x190
        __sys_sendto+0x1bc/0x290
        __x64_sys_sendto+0xdc/0x1b0
        do_syscall_64+0x3d/0x90
        entry_SYSCALL_64_after_hwframe+0x46/0xb0

 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&dev->cache.rb_lock);
                                lock((work_completion)(&(&ent->dwork)->work));
                                lock(&dev->cache.rb_lock);
   lock((work_completion)(&(&ent->dwork)->work));

  *** DEADLOCK ***

 6 locks held by devlink/53872:
  #0: ffffffff84c17a50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40
  #1: ffff888142280218 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0x12d/0x2d0
  #2: ffff8881422d3c38 (&dev->lock_key){+.+.}-{3:3}, at: mlx5_unload_one_devl_locked+0x4a/0x110 [mlx5_core]
  #3: ffffffffa0e59068 (mlx5_intf_mutex){+.+.}-{3:3}, at: mlx5_detach_device+0x60/0x360 [mlx5_core]
  #4: ffff88810e3cb0e8 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x8d/0x600
  #5: ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib]

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mr.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 1ce48e485c5b..f113656e4027 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1033,7 +1033,15 @@ void mlx5_mkey_cache_cleanup(struct mlx5_ib_dev *dev)
 		xa_lock_irq(&ent->mkeys);
 		ent->disabled = true;
 		xa_unlock_irq(&ent->mkeys);
-		cancel_delayed_work_sync(&ent->dwork);
+	}
+
+	/* Run the canceling of delayed works on the cache in a separate loop after
+	 * disabling all entries to ensure someone_adding() will not try taking the
+	 * rb_lock while flushing the workqueue.
+	 */
+	for (node = rb_first(root); node; node = rb_next(node)) {
+		ent = rb_entry(node, struct mlx5_cache_ent, node);
+		cancel_delayed_work(&ent->dwork);
 	}
 
 	mlx5_mkey_cache_debugfs_cleanup(dev);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH rdma-rc 07/10] RDMA/cma: Always set static rate to 0 for RoCE
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
                   ` (5 preceding siblings ...)
  2023-06-05 10:33 ` [PATCH rdma-rc 06/10] RDMA/mlx5: Fix mkey cache possible deadlock on cleanup Leon Romanovsky
@ 2023-06-05 10:33 ` Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 08/10] RDMA/uverbs: Restrict usage of privileged QKEYs Leon Romanovsky
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Mark Zhang, linux-rdma

From: Mark Zhang <markzhang@nvidia.com>

Set static rate to 0 as it should be discovered by path query and
has no meaning for RoCE.
This also avoid of using the rtnl lock and ethtool API, which is
a bottleneck when try to setup many rdma-cm connections at the same
time, especially with multiple processes.

Fixes: 3c86aa70bf67 ("RDMA/cm: Add RDMA CM support for IBoE devices")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/cma.c |  4 ++--
 include/rdma/ib_addr.h        | 23 -----------------------
 2 files changed, 2 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 93a1c48d0c32..6b3f4384e46a 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -3295,7 +3295,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv)
 	route->path_rec->traffic_class = tos;
 	route->path_rec->mtu = iboe_get_mtu(ndev->mtu);
 	route->path_rec->rate_selector = IB_SA_EQ;
-	route->path_rec->rate = iboe_get_rate(ndev);
+	route->path_rec->rate = IB_RATE_PORT_CURRENT;
 	dev_put(ndev);
 	route->path_rec->packet_life_time_selector = IB_SA_EQ;
 	/* In case ACK timeout is set, use this value to calculate
@@ -4964,7 +4964,7 @@ static int cma_iboe_join_multicast(struct rdma_id_private *id_priv,
 	if (!ndev)
 		return -ENODEV;
 
-	ib.rec.rate = iboe_get_rate(ndev);
+	ib.rec.rate = IB_RATE_PORT_CURRENT;
 	ib.rec.hop_limit = 1;
 	ib.rec.mtu = iboe_get_mtu(ndev->mtu);
 
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index d808dc3d239e..811a0f11d0db 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -194,29 +194,6 @@ static inline enum ib_mtu iboe_get_mtu(int mtu)
 		return 0;
 }
 
-static inline int iboe_get_rate(struct net_device *dev)
-{
-	struct ethtool_link_ksettings cmd;
-	int err;
-
-	rtnl_lock();
-	err = __ethtool_get_link_ksettings(dev, &cmd);
-	rtnl_unlock();
-	if (err)
-		return IB_RATE_PORT_CURRENT;
-
-	if (cmd.base.speed >= 40000)
-		return IB_RATE_40_GBPS;
-	else if (cmd.base.speed >= 30000)
-		return IB_RATE_30_GBPS;
-	else if (cmd.base.speed >= 20000)
-		return IB_RATE_20_GBPS;
-	else if (cmd.base.speed >= 10000)
-		return IB_RATE_10_GBPS;
-	else
-		return IB_RATE_PORT_CURRENT;
-}
-
 static inline int rdma_link_local_addr(struct in6_addr *addr)
 {
 	if (addr->s6_addr32[0] == htonl(0xfe800000) &&
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH rdma-rc 08/10] RDMA/uverbs: Restrict usage of privileged QKEYs
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
                   ` (6 preceding siblings ...)
  2023-06-05 10:33 ` [PATCH rdma-rc 07/10] RDMA/cma: Always set static rate to 0 for RoCE Leon Romanovsky
@ 2023-06-05 10:33 ` Leon Romanovsky
  2023-06-05 16:55   ` Jason Gunthorpe
  2023-06-05 10:33 ` [PATCH rdma-rc 09/10] IB/uverbs: Fix to consider event queue closing also upon non-blocking mode Leon Romanovsky
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Edward Srouji, linux-rdma

From: Edward Srouji <edwards@nvidia.com>

According to the IB specification rel-1.6, section 3.5.3:
"QKEYs with the most significant bit set are considered controlled
QKEYs, and a HCA does not allow a consumer to arbitrarily specify a
controlled QKEY."

Thus, block non-privileged users from setting such a QKEY.

Fixes: bc38a6abdd5a ("[PATCH] IB uverbs: core implementation")
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/uverbs_cmd.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 4796f6a8828c..e836c9c477f6 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1850,8 +1850,13 @@ static int modify_qp(struct uverbs_attr_bundle *attrs,
 		attr->path_mtu = cmd->base.path_mtu;
 	if (cmd->base.attr_mask & IB_QP_PATH_MIG_STATE)
 		attr->path_mig_state = cmd->base.path_mig_state;
-	if (cmd->base.attr_mask & IB_QP_QKEY)
+	if (cmd->base.attr_mask & IB_QP_QKEY) {
+		if (cmd->base.qkey & IB_QP_SET_QKEY && !capable(CAP_NET_RAW)) {
+			ret = -EPERM;
+			goto release_qp;
+		}
 		attr->qkey = cmd->base.qkey;
+	}
 	if (cmd->base.attr_mask & IB_QP_RQ_PSN)
 		attr->rq_psn = cmd->base.rq_psn;
 	if (cmd->base.attr_mask & IB_QP_SQ_PSN)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH rdma-rc 09/10] IB/uverbs: Fix to consider event queue closing also upon non-blocking mode
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
                   ` (7 preceding siblings ...)
  2023-06-05 10:33 ` [PATCH rdma-rc 08/10] RDMA/uverbs: Restrict usage of privileged QKEYs Leon Romanovsky
@ 2023-06-05 10:33 ` Leon Romanovsky
  2023-06-05 10:33 ` [PATCH rdma-rc 10/10] RDMA/mlx5: Fix affinity assignment Leon Romanovsky
  2023-06-11  9:21 ` (subset) [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
  10 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Yishai Hadas, linux-rdma, Maor Gottlieb

From: Yishai Hadas <yishaih@nvidia.com>

Fix ib_uverbs_event_read() to consider event queue closing also upon
non-blocking mode.

Once the queue is closed (e.g. hot-plug flow) all the existing events
are cleaned-up as part of ib_uverbs_free_event_queue().

An application that uses the non-blocking FD mode should get -EIO in
that case to let it knows that the device was removed already.

Otherwise, it can loose the indication that the device was removed and
won't recover.

As part of that, refactor the code to have a single flow with regards to
'is_closed' for both blocking and non-blocking modes.

Fixes: 14e23bd6d221 ("RDMA/core: Fix locking in ib_uverbs_event_read")
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/uverbs_main.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index fbace69672ca..7c9c79c13941 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -222,8 +222,12 @@ static ssize_t ib_uverbs_event_read(struct ib_uverbs_event_queue *ev_queue,
 	spin_lock_irq(&ev_queue->lock);
 
 	while (list_empty(&ev_queue->event_list)) {
-		spin_unlock_irq(&ev_queue->lock);
+		if (ev_queue->is_closed) {
+			spin_unlock_irq(&ev_queue->lock);
+			return -EIO;
+		}
 
+		spin_unlock_irq(&ev_queue->lock);
 		if (filp->f_flags & O_NONBLOCK)
 			return -EAGAIN;
 
@@ -233,12 +237,6 @@ static ssize_t ib_uverbs_event_read(struct ib_uverbs_event_queue *ev_queue,
 			return -ERESTARTSYS;
 
 		spin_lock_irq(&ev_queue->lock);
-
-		/* If device was disassociated and no event exists set an error */
-		if (list_empty(&ev_queue->event_list) && ev_queue->is_closed) {
-			spin_unlock_irq(&ev_queue->lock);
-			return -EIO;
-		}
 	}
 
 	event = list_entry(ev_queue->event_list.next, struct ib_uverbs_event, list);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH rdma-rc 10/10] RDMA/mlx5: Fix affinity assignment
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
                   ` (8 preceding siblings ...)
  2023-06-05 10:33 ` [PATCH rdma-rc 09/10] IB/uverbs: Fix to consider event queue closing also upon non-blocking mode Leon Romanovsky
@ 2023-06-05 10:33 ` Leon Romanovsky
  2023-06-11  9:21 ` (subset) [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
  10 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-05 10:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Mark Bloch, linux-rdma, Maor Gottlieb

From: Mark Bloch <mbloch@nvidia.com>

The cited commit aimed to ensure that Virtual Functions (VFs) assign a
queue affinity to a Queue Pair (QP) to distribute traffic when
the LAG master creates a hardware LAG. If the affinity was set while
the hardware was not in LAG, the firmware would ignore the affinity value.

However, this commit unintentionally assigned an affinity to QPs on the LAG
master's VPORT even if the RDMA device was not marked as LAG-enabled.
In most cases, this was not an issue because when the hardware entered
hardware LAG configuration, the RDMA device of the LAG master would be
destroyed and a new one would be created, marked as LAG-enabled.

The problem arises when a user configures Equal-Cost Multipath (ECMP).
In ECMP mode, traffic can be directed to different physical ports based on
the queue affinity, which is intended for use by VPORTS other than the
E-Switch manager. ECMP mode is supported only if both E-Switch managers are
in switchdev mode and the appropriate route is configured via IP. In this
configuration, the RDMA device is not destroyed, and we retain the RDMA
device that is not marked as LAG-enabled.

To ensure correct behavior, Send Queues (SQs) opened by the E-Switch
manager through verbs should be assigned strict affinity. This means they
will only be able to communicate through the native physical port
associated with the E-Switch manager. This will prevent the firmware from
assigning affinity and will not allow the SQs to be remapped in case of
failover.

Fixes: 802dcc7fc5ec ("RDMA/mlx5: Support TX port affinity for VF drivers in LAG mode")
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h                |  3 +++
 drivers/infiniband/hw/mlx5/qp.c                     |  3 +++
 drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 12 ------------
 include/linux/mlx5/driver.h                         | 12 ++++++++++++
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 2a2d2a356c41..e0ad75f7026f 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1599,6 +1599,9 @@ static inline bool mlx5_ib_lag_should_assign_affinity(struct mlx5_ib_dev *dev)
 	    MLX5_CAP_PORT_SELECTION(dev->mdev, port_select_flow_table_bypass))
 		return 0;
 
+	if (mlx5_lag_is_lacp_owner(dev->mdev) && !dev->lag_active)
+		return 0;
+
 	return dev->lag_active ||
 		(MLX5_CAP_GEN(dev->mdev, num_lag_ports) > 1 &&
 		 MLX5_CAP_GEN(dev->mdev, lag_tx_port_affinity));
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index c4d04c05546a..f99e49499529 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -1239,6 +1239,9 @@ static int create_raw_packet_qp_tis(struct mlx5_ib_dev *dev,
 
 	MLX5_SET(create_tis_in, in, uid, to_mpd(pd)->uid);
 	MLX5_SET(tisc, tisc, transport_domain, tdn);
+	if (!mlx5_ib_lag_should_assign_affinity(dev) &&
+	    mlx5_lag_is_lacp_owner(dev->mdev))
+		MLX5_SET(tisc, tisc, strict_lag_tx_port_affinity, 1);
 	if (qp->flags & IB_QP_CREATE_SOURCE_QPN)
 		MLX5_SET(tisc, tisc, underlay_qpn, qp->underlay_qpn);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 1d879374acaa..229520405d4a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -276,18 +276,6 @@ static inline bool mlx5_sriov_is_enabled(struct mlx5_core_dev *dev)
 	return pci_num_vf(dev->pdev) ? true : false;
 }
 
-static inline int mlx5_lag_is_lacp_owner(struct mlx5_core_dev *dev)
-{
-	/* LACP owner conditions:
-	 * 1) Function is physical.
-	 * 2) LAG is supported by FW.
-	 * 3) LAG is managed by driver (currently the only option).
-	 */
-	return  MLX5_CAP_GEN(dev, vport_group_manager) &&
-		   (MLX5_CAP_GEN(dev, num_lag_ports) > 1) &&
-		    MLX5_CAP_GEN(dev, lag_master);
-}
-
 int mlx5_rescan_drivers_locked(struct mlx5_core_dev *dev);
 static inline int mlx5_rescan_drivers(struct mlx5_core_dev *dev)
 {
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index e67c603d507b..1e9f5bb4882b 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1227,6 +1227,18 @@ static inline u16 mlx5_core_max_vfs(const struct mlx5_core_dev *dev)
 	return dev->priv.sriov.max_vfs;
 }
 
+static inline int mlx5_lag_is_lacp_owner(struct mlx5_core_dev *dev)
+{
+	/* LACP owner conditions:
+	 * 1) Function is physical.
+	 * 2) LAG is supported by FW.
+	 * 3) LAG is managed by driver (currently the only option).
+	 */
+	return  MLX5_CAP_GEN(dev, vport_group_manager) &&
+		   (MLX5_CAP_GEN(dev, num_lag_ports) > 1) &&
+		    MLX5_CAP_GEN(dev, lag_master);
+}
+
 static inline int mlx5_get_gid_table_len(u16 param)
 {
 	if (param > 4) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH rdma-rc 06/10] RDMA/mlx5: Fix mkey cache possible deadlock on cleanup
  2023-06-05 10:33 ` [PATCH rdma-rc 06/10] RDMA/mlx5: Fix mkey cache possible deadlock on cleanup Leon Romanovsky
@ 2023-06-05 16:54   ` Jason Gunthorpe
  2023-06-06  5:50     ` Leon Romanovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2023-06-05 16:54 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Michael Guralnik, linux-rdma, Maor Gottlieb

On Mon, Jun 05, 2023 at 01:33:22PM +0300, Leon Romanovsky wrote:

> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index 1ce48e485c5b..f113656e4027 100644
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -1033,7 +1033,15 @@ void mlx5_mkey_cache_cleanup(struct mlx5_ib_dev *dev)
>  		xa_lock_irq(&ent->mkeys);
>  		ent->disabled = true;
>  		xa_unlock_irq(&ent->mkeys);
> -		cancel_delayed_work_sync(&ent->dwork);
> +	}
> +
> +	/* Run the canceling of delayed works on the cache in a separate loop after
> +	 * disabling all entries to ensure someone_adding() will not try taking the
> +	 * rb_lock while flushing the workqueue.
> +	 */
> +	for (node = rb_first(root); node; node = rb_next(node)) {
> +		ent = rb_entry(node, struct mlx5_cache_ent, node);
> +		cancel_delayed_work(&ent->dwork);
>  	}
>
This goes on to kfree end, so this can't drop the sync.

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH rdma-rc 08/10] RDMA/uverbs: Restrict usage of privileged QKEYs
  2023-06-05 10:33 ` [PATCH rdma-rc 08/10] RDMA/uverbs: Restrict usage of privileged QKEYs Leon Romanovsky
@ 2023-06-05 16:55   ` Jason Gunthorpe
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Gunthorpe @ 2023-06-05 16:55 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Edward Srouji, linux-rdma

On Mon, Jun 05, 2023 at 01:33:24PM +0300, Leon Romanovsky wrote:
> From: Edward Srouji <edwards@nvidia.com>
> 
> According to the IB specification rel-1.6, section 3.5.3:
> "QKEYs with the most significant bit set are considered controlled
> QKEYs, and a HCA does not allow a consumer to arbitrarily specify a
> controlled QKEY."
> 
> Thus, block non-privileged users from setting such a QKEY.
> 
> Fixes: bc38a6abdd5a ("[PATCH] IB uverbs: core implementation")

cc: stable

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH rdma-rc 06/10] RDMA/mlx5: Fix mkey cache possible deadlock on cleanup
  2023-06-05 16:54   ` Jason Gunthorpe
@ 2023-06-06  5:50     ` Leon Romanovsky
  0 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-06  5:50 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Michael Guralnik, linux-rdma, Maor Gottlieb

On Mon, Jun 05, 2023 at 01:54:38PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 01:33:22PM +0300, Leon Romanovsky wrote:
> 
> > diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> > index 1ce48e485c5b..f113656e4027 100644
> > --- a/drivers/infiniband/hw/mlx5/mr.c
> > +++ b/drivers/infiniband/hw/mlx5/mr.c
> > @@ -1033,7 +1033,15 @@ void mlx5_mkey_cache_cleanup(struct mlx5_ib_dev *dev)
> >  		xa_lock_irq(&ent->mkeys);
> >  		ent->disabled = true;
> >  		xa_unlock_irq(&ent->mkeys);
> > -		cancel_delayed_work_sync(&ent->dwork);
> > +	}
> > +
> > +	/* Run the canceling of delayed works on the cache in a separate loop after
> > +	 * disabling all entries to ensure someone_adding() will not try taking the
> > +	 * rb_lock while flushing the workqueue.
> > +	 */
> > +	for (node = rb_first(root); node; node = rb_next(node)) {
> > +		ent = rb_entry(node, struct mlx5_cache_ent, node);
> > +		cancel_delayed_work(&ent->dwork);
> >  	}
> >
> This goes on to kfree end, so this can't drop the sync.

with _sync, we will get same code as it was before.
Let's put this patch aside.

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (subset) [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes
  2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
                   ` (9 preceding siblings ...)
  2023-06-05 10:33 ` [PATCH rdma-rc 10/10] RDMA/mlx5: Fix affinity assignment Leon Romanovsky
@ 2023-06-11  9:21 ` Leon Romanovsky
  10 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-06-11  9:21 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky; +Cc: linux-rdma, Leon Romanovsky


On Mon, 05 Jun 2023 13:33:16 +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Hi,
> 
> This is my collection of various fixes.
> 
> Thanks
> 
> [...]

Applied, thanks!

[01/10] RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions
        https://git.kernel.org/rdma/rdma/c/ee4d269eccfea6
[02/10] RDMA/mlx5: Create an indirect flow table for steering anchor
        https://git.kernel.org/rdma/rdma/c/e1f4a52ac171dd
[03/10] RDMA/mlx5: Fix Q-counters per vport allocation
        https://git.kernel.org/rdma/rdma/c/c2ea687e5e0e29
[04/10] RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters
        https://git.kernel.org/rdma/rdma/c/e80ef139488fb4
[05/10] RDMA/mlx5: Fix Q-counters query in LAG mode
        https://git.kernel.org/rdma/rdma/c/2de43f5b5137e0
[07/10] RDMA/cma: Always set static rate to 0 for RoCE
        https://git.kernel.org/rdma/rdma/c/58030c76cce473
[08/10] RDMA/uverbs: Restrict usage of privileged QKEYs
        https://git.kernel.org/rdma/rdma/c/0cadb4db79e1d9
[09/10] IB/uverbs: Fix to consider event queue closing also upon non-blocking mode
        https://git.kernel.org/rdma/rdma/c/62fab312fa1683
[10/10] RDMA/mlx5: Fix affinity assignment
        https://git.kernel.org/rdma/rdma/c/617f5db1a626f1

Best regards,
-- 
Leon Romanovsky <leon@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-06-11  9:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-05 10:33 [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky
2023-06-05 10:33 ` [PATCH rdma-rc 01/10] RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions Leon Romanovsky
2023-06-05 10:33 ` [PATCH rdma-rc 02/10] RDMA/mlx5: Create an indirect flow table for steering anchor Leon Romanovsky
2023-06-05 10:33 ` [PATCH rdma-rc 03/10] RDMA/mlx5: Fix Q-counters per vport allocation Leon Romanovsky
2023-06-05 10:33 ` [PATCH rdma-rc 04/10] RDMA/mlx5: Remove vport Q-counters dependency on normal Q-counters Leon Romanovsky
2023-06-05 10:33 ` [PATCH rdma-rc 05/10] RDMA/mlx5: Fix Q-counters query in LAG mode Leon Romanovsky
2023-06-05 10:33 ` [PATCH rdma-rc 06/10] RDMA/mlx5: Fix mkey cache possible deadlock on cleanup Leon Romanovsky
2023-06-05 16:54   ` Jason Gunthorpe
2023-06-06  5:50     ` Leon Romanovsky
2023-06-05 10:33 ` [PATCH rdma-rc 07/10] RDMA/cma: Always set static rate to 0 for RoCE Leon Romanovsky
2023-06-05 10:33 ` [PATCH rdma-rc 08/10] RDMA/uverbs: Restrict usage of privileged QKEYs Leon Romanovsky
2023-06-05 16:55   ` Jason Gunthorpe
2023-06-05 10:33 ` [PATCH rdma-rc 09/10] IB/uverbs: Fix to consider event queue closing also upon non-blocking mode Leon Romanovsky
2023-06-05 10:33 ` [PATCH rdma-rc 10/10] RDMA/mlx5: Fix affinity assignment Leon Romanovsky
2023-06-11  9:21 ` (subset) [PATCH rdma-rc 00/10] Batch of uverbs and mlx5_ib fixes Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).