Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next V10 07/14] devlink: Allow rate node parents from other devlinks
From: Tariq Toukan @ 2026-07-01  7:32 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
	Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
	David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
	Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
	linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
	Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
	Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
	Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
	Tariq Toukan, Willem de Bruijn, Gal Pressman
In-Reply-To: <20260701073254.754518-1-tariqt@nvidia.com>

From: Cosmin Ratiu <cratiu@nvidia.com>

This commit makes use of the building blocks previously added to
implement cross-device rate nodes.

A new 'supported_cross_device_rate_nodes' bool is added to devlink_ops
which lets drivers advertise support for cross-device rate objects.
If enabled and if there is a common shared devlink instance, then:
- all rate objects will be stored in the top-most common nested instance
  and
- rate objects can have parents from other devices sharing the same
  common instance.

Storing rates in the common shared ancestor is safe, because it is
reference counted by its nested devlink instances, so it's guaranteed to
outlive them. Furthermore, the shared devlink infra guarantees a given
nested devlink hierarchy is managed by the same driver.

The parent devlink from info->ctx is not locked, so none of its mutable
fields can be used. But parent setting only requires comparing devlink
pointer comparisons. Additionally, since the shared devlink is locked,
other rate operations cannot concurrently happen.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../networking/devlink/devlink-port.rst       |  2 +
 include/net/devlink.h                         |  9 ++
 net/devlink/core.c                            |  4 +-
 net/devlink/rate.c                            | 86 +++++++++++++++++--
 4 files changed, 92 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 9374ebe70f48..18aca77006d5 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -420,6 +420,8 @@ API allows to configure following rate object's parameters:
   Parent node name. Parent node rate limits are considered as additional limits
   to all node children limits. ``tx_max`` is an upper limit for children.
   ``tx_share`` is a total bandwidth distributed among children.
+  If the device supports cross-function scheduling, the parent can be from a
+  different function of the same underlying device.
 
 ``tc_bw``
   Allow users to set the bandwidth allocation per traffic class on rate
diff --git a/include/net/devlink.h b/include/net/devlink.h
index dd546dbd57cf..ffe1ad5fb70b 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1594,6 +1594,15 @@ struct devlink_ops {
 				    struct devlink_rate *parent,
 				    void *priv_child, void *priv_parent,
 				    struct netlink_ext_ack *extack);
+	/* Indicates if cross-device rate nodes are supported.
+	 * This also requires a shared common ancestor object all devices that
+	 * could share rate nodes are nested in.
+	 * If enabled, rate operations may be called on an instance with only
+	 * the common ancestor lock held and *without that instance lock held*.
+	 * It is the driver's responsibility to ensure proper serialization
+	 * with other operations.
+	 */
+	bool supported_cross_device_rate_nodes;
 	/**
 	 * selftests_check() - queries if selftest is supported
 	 * @devlink: devlink instance
diff --git a/net/devlink/core.c b/net/devlink/core.c
index ee26c50b4118..c53a42e17a58 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -534,6 +534,9 @@ void devlink_free(struct devlink *devlink)
 {
 	ASSERT_DEVLINK_NOT_REGISTERED(devlink);
 
+	devl_lock(devlink);
+	WARN_ON(devlink_rates_check(devlink, NULL, NULL));
+	devl_unlock(devlink);
 	devlink_rel_put(devlink);
 
 	WARN_ON(!list_empty(&devlink->trap_policer_list));
@@ -544,7 +547,6 @@ void devlink_free(struct devlink *devlink)
 	WARN_ON(!list_empty(&devlink->resource_list));
 	WARN_ON(!list_empty(&devlink->dpipe_table_list));
 	WARN_ON(!list_empty(&devlink->sb_list));
-	WARN_ON(devlink_rates_check(devlink, NULL, NULL));
 	WARN_ON(!list_empty(&devlink->linecard_list));
 	WARN_ON(!xa_empty(&devlink->ports));
 
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 78a59d79c2ea..e727c8b8b33e 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -30,14 +30,42 @@ devlink_rate_leaf_get_from_info(struct devlink *devlink, struct genl_info *info)
 	return devlink_rate ?: ERR_PTR(-ENODEV);
 }
 
+/* Repeatedly walks the nested devlink chain while cross device rate nodes are
+ * supported and finds the topmost instance where rates should be stored.
+ * That instance is locked, referenced and returned.
+ * When cross device rate nodes aren't supported the original devlink instance
+ * is returned.
+ */
 static struct devlink *devl_rate_lock(struct devlink *devlink)
 {
-	return devlink;
+	struct devlink *rate_devlink = devlink, *parent;
+
+	devl_assert_locked(devlink);
+
+	while (rate_devlink->ops &&
+	       rate_devlink->ops->supported_cross_device_rate_nodes) {
+		parent = devlink_nested_in_get_lock(rate_devlink);
+		if (!parent)
+			break;
+		if (rate_devlink != devlink) {
+			/* Unlock intermediate instances. */
+			devl_unlock(rate_devlink);
+			devlink_put(rate_devlink);
+		}
+		rate_devlink = parent;
+	}
+	return rate_devlink;
 }
 
+/* Unlocks and puts 'rate devlink' if different than 'devlink'. */
 static void devl_rate_unlock(struct devlink *devlink,
 			     struct devlink *rate_devlink)
 {
+	if (devlink == rate_devlink)
+		return;
+
+	devl_unlock(rate_devlink);
+	devlink_put(rate_devlink);
 }
 
 static struct devlink_rate *
@@ -121,6 +149,25 @@ static int devlink_rate_put_tc_bws(struct sk_buff *msg, u32 *tc_bw)
 	return -EMSGSIZE;
 }
 
+static int devlink_nl_rate_parent_fill(struct sk_buff *msg,
+				       struct devlink_rate *devlink_rate)
+{
+	struct devlink_rate *parent = devlink_rate->parent;
+	struct devlink *devlink = parent->devlink;
+
+	if (nla_put_string(msg, DEVLINK_ATTR_RATE_PARENT_NODE_NAME,
+			   parent->name))
+		return -EMSGSIZE;
+
+	if (devlink != devlink_rate->devlink &&
+	    devlink_nl_put_nested_handle(msg,
+					 devlink_net(devlink_rate->devlink),
+					 devlink, DEVLINK_ATTR_PARENT_DEV))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
 static int devlink_nl_rate_fill(struct sk_buff *msg,
 				struct devlink_rate *devlink_rate,
 				enum devlink_command cmd, u32 portid, u32 seq,
@@ -165,10 +212,9 @@ static int devlink_nl_rate_fill(struct sk_buff *msg,
 			devlink_rate->tx_weight))
 		goto nla_put_failure;
 
-	if (devlink_rate->parent)
-		if (nla_put_string(msg, DEVLINK_ATTR_RATE_PARENT_NODE_NAME,
-				   devlink_rate->parent->name))
-			goto nla_put_failure;
+	if (devlink_rate->parent &&
+	    devlink_nl_rate_parent_fill(msg, devlink_rate))
+		goto nla_put_failure;
 
 	if (devlink_rate_put_tc_bws(msg, devlink_rate->tc_bw))
 		goto nla_put_failure;
@@ -322,13 +368,14 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
 				struct genl_info *info,
 				struct nlattr *nla_parent)
 {
-	struct devlink *devlink = devlink_rate->devlink;
+	struct devlink *devlink = devlink_rate->devlink, *parent_devlink;
 	const char *parent_name = nla_data(nla_parent);
 	const struct devlink_ops *ops = devlink->ops;
 	size_t len = strlen(parent_name);
 	struct devlink_rate *parent;
 	int err = -EOPNOTSUPP;
 
+	parent_devlink = devlink_nl_ctx(info)->parent_devlink ? : devlink;
 	parent = devlink_rate->parent;
 
 	if (parent && !len) {
@@ -346,7 +393,13 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
 		refcount_dec(&parent->refcnt);
 		devlink_rate->parent = NULL;
 	} else if (len) {
-		parent = devlink_rate_node_get_by_name(rate_devlink, devlink,
+		/* parent_devlink (when different than devlink) isn't locked,
+		 * but the rate node devlink instance is, so nobody from the
+		 * same group of devices sharing rates could change the used
+		 * fields or unregister the parent.
+		 */
+		parent = devlink_rate_node_get_by_name(rate_devlink,
+						       parent_devlink,
 						       parent_name);
 		if (IS_ERR(parent))
 			return -ENODEV;
@@ -633,9 +686,11 @@ static bool devlink_rate_set_ops_supported(const struct devlink_ops *ops,
 
 int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
+	struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
+	struct devlink *devlink = ctx->devlink;
 	struct devlink_rate *devlink_rate;
 	const struct devlink_ops *ops;
+	struct devlink *rate_devlink;
 	int err;
 
 	rate_devlink = devl_rate_lock(devlink);
@@ -652,6 +707,14 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
 		goto unlock;
 	}
 
+	if (ctx->parent_devlink && ctx->parent_devlink != devlink &&
+	    !ops->supported_cross_device_rate_nodes) {
+		NL_SET_ERR_MSG(info->extack,
+			       "Cross-device rate parents aren't supported");
+		err = -EOPNOTSUPP;
+		goto unlock;
+	}
+
 	err = devlink_nl_rate_set(devlink_rate, rate_devlink, ops, info);
 
 	if (!err)
@@ -679,6 +742,13 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
 	if (!devlink_rate_set_ops_supported(ops, info, DEVLINK_RATE_TYPE_NODE))
 		return -EOPNOTSUPP;
 
+	if (ctx->parent_devlink && ctx->parent_devlink != devlink &&
+	    !ops->supported_cross_device_rate_nodes) {
+		NL_SET_ERR_MSG(info->extack,
+			       "Cross-device rate parents aren't supported");
+		return -EOPNOTSUPP;
+	}
+
 	rate_devlink = devl_rate_lock(devlink);
 	rate_node = devlink_rate_node_get_from_attrs(rate_devlink, devlink,
 						     info->attrs);
-- 
2.44.0


^ permalink raw reply related

* [PATCH net-next V10 09/14] net/mlx5: qos: Refactor vport QoS cleanup
From: Tariq Toukan @ 2026-07-01  7:32 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
	Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
	David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
	Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
	linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
	Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
	Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
	Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
	Tariq Toukan, Willem de Bruijn, Gal Pressman
In-Reply-To: <20260701073254.754518-1-tariqt@nvidia.com>

From: Cosmin Ratiu <cratiu@nvidia.com>

Qos cleanup is a complex affair, because of the two modes of operation
(legacy and switchdev).

Leaf QoS is removed:
1. In legacy mode by esw_vport_cleanup() -> mlx5_esw_qos_vport_disable()
2. In switchdev mode by mlx5_esw_offloads_devlink_port_unregister() ->
mlx5_esw_qos_vport_update_parent(). A little later in the same flow, the
calls in 1 happen but they are noops.

Zooming out a bit, from both mlx5_eswitch_disable_locked() and
mlx5_eswitch_disable_sriov() the leaves are destroyed before the nodes,
which is the reverse of what should be.

For SFs there's no devl_rate_nodes_destroy() call to unparent the
affected leaf.

Sanitize all of this by:
1. Destroying nodes before leaves in both legacy and switchdev mode.
2. Only removing vport qos from esw_vport_cleanup(), reachable from both
   legacy and switchdev and also reachable by SF removal.
3. Unexpose mlx5_esw_qos_vport_update_parent(), which becomes internal
   to qos.
4. Remove the WARN in mlx5_esw_qos_vport_disable().

This also takes care of a theoretical corner case, when
mlx5_esw_qos_vport_update_parent() tried to reattach the vport to
the original parent on failure, which can fail as well, leaving the
vport in a broken state.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../mellanox/mlx5/core/esw/devlink_port.c     |  1 -
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 14 ++++----------
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 19 ++++++++++---------
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  2 --
 4 files changed, 14 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
index 6e50311faa27..8c27a33f9d7b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
@@ -268,7 +268,6 @@ void mlx5_esw_offloads_devlink_port_unregister(struct mlx5_vport *vport)
 	dl_port = vport->dl_port;
 	mlx5_esw_devlink_port_res_unregister(&dl_port->dl_port);
 
-	mlx5_esw_qos_vport_update_parent(vport, NULL, NULL);
 	devl_rate_leaf_destroy(&dl_port->dl_port);
 
 	devl_port_unregister(&dl_port->dl_port);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index d04fda4b3778..204f47c99142 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -1139,18 +1139,10 @@ static void mlx5_esw_qos_vport_disable_locked(struct mlx5_vport *vport)
 void mlx5_esw_qos_vport_disable(struct mlx5_vport *vport)
 {
 	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
-	struct mlx5_esw_sched_node *parent;
 
 	lockdep_assert_held(&esw->state_lock);
 	esw_qos_lock(esw);
-	if (!vport->qos.sched_node)
-		goto unlock;
-
-	parent = vport->qos.sched_node->parent;
-	WARN(parent, "Disabling QoS on port before detaching it from node");
-
 	mlx5_esw_qos_vport_disable_locked(vport);
-unlock:
 	esw_qos_unlock(esw);
 }
 
@@ -1866,8 +1858,10 @@ int mlx5_esw_devlink_rate_node_del(struct devlink_rate *rate_node, void *priv,
 	return 0;
 }
 
-int mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport, struct mlx5_esw_sched_node *parent,
-				     struct netlink_ext_ack *extack)
+static int
+mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
+				 struct mlx5_esw_sched_node *parent,
+				 struct netlink_ext_ack *extack)
 {
 	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
 	int err = 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index a0e2ca87b8d8..b67f15a8f766 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1990,6 +1990,13 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
 		 esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
 
 	mlx5_eswitch_invalidate_wq(esw);
+
+	if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
+		struct devlink *devlink = priv_to_devlink(esw->dev);
+
+		devl_rate_nodes_destroy(devlink);
+	}
+
 	mlx5_esw_reps_block(esw);
 
 	if (!mlx5_core_is_ecpf(esw->dev)) {
@@ -2003,12 +2010,6 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
 	}
 
 	mlx5_esw_reps_unblock(esw);
-
-	if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
-		struct devlink *devlink = priv_to_devlink(esw->dev);
-
-		devl_rate_nodes_destroy(devlink);
-	}
 	/* Destroy legacy fdb when disabling sriov in legacy mode. */
 	if (esw->mode == MLX5_ESWITCH_LEGACY)
 		mlx5_eswitch_disable_locked(esw);
@@ -2039,6 +2040,9 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw)
 		 esw->mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
 		 esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
 
+	if (esw->mode == MLX5_ESWITCH_OFFLOADS)
+		devl_rate_nodes_destroy(devlink);
+
 	if (esw->fdb_table.flags & MLX5_ESW_FDB_CREATED) {
 		esw->fdb_table.flags &= ~MLX5_ESW_FDB_CREATED;
 		if (esw->mode == MLX5_ESWITCH_OFFLOADS)
@@ -2047,9 +2051,6 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw)
 			esw_legacy_disable(esw);
 		mlx5_esw_acls_ns_cleanup(esw);
 	}
-
-	if (esw->mode == MLX5_ESWITCH_OFFLOADS)
-		devl_rate_nodes_destroy(devlink);
 }
 
 void mlx5_eswitch_disable(struct mlx5_eswitch *esw)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index fea72b1dedab..140343f2b913 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -482,8 +482,6 @@ int mlx5_eswitch_set_vport_trust(struct mlx5_eswitch *esw,
 				 u16 vport_num, bool setting);
 int mlx5_eswitch_set_vport_rate(struct mlx5_eswitch *esw, u16 vport,
 				u32 max_rate, u32 min_rate);
-int mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport, struct mlx5_esw_sched_node *node,
-				     struct netlink_ext_ack *extack);
 int mlx5_eswitch_set_vepa(struct mlx5_eswitch *esw, u8 setting);
 int mlx5_eswitch_get_vepa(struct mlx5_eswitch *esw, u8 *setting);
 int mlx5_eswitch_get_vport_config(struct mlx5_eswitch *esw,
-- 
2.44.0


^ permalink raw reply related

* [PATCH net-next V10 03/14] devlink: Migrate from info->user_ptr to info->ctx
From: Tariq Toukan @ 2026-07-01  7:32 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
	Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
	David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
	Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
	linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
	Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
	Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
	Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
	Tariq Toukan, Willem de Bruijn, Gal Pressman
In-Reply-To: <20260701073254.754518-1-tariqt@nvidia.com>

From: Cosmin Ratiu <cratiu@nvidia.com>

Replace deprecated info->user_ptr[0]/[1] with a typed
devlink_nl_ctx struct stored in info->ctx. The struct aliases
the same union memory, so the migration is safe.

There are no functionality changes here.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 net/devlink/dev.c           | 16 ++++++++--------
 net/devlink/devl_internal.h | 13 +++++++++++++
 net/devlink/dpipe.c         | 14 +++++++-------
 net/devlink/health.c        | 12 ++++++------
 net/devlink/linecard.c      |  4 ++--
 net/devlink/netlink.c       |  8 ++++----
 net/devlink/param.c         |  4 ++--
 net/devlink/port.c          | 18 +++++++++---------
 net/devlink/rate.c          |  8 ++++----
 net/devlink/region.c        |  6 +++---
 net/devlink/resource.c      | 14 +++++++++-----
 net/devlink/sb.c            | 22 +++++++++++-----------
 net/devlink/trap.c          | 12 ++++++------
 13 files changed, 84 insertions(+), 67 deletions(-)

diff --git a/net/devlink/dev.c b/net/devlink/dev.c
index 57b2b8f03543..bcf001554e84 100644
--- a/net/devlink/dev.c
+++ b/net/devlink/dev.c
@@ -222,7 +222,7 @@ static void devlink_notify(struct devlink *devlink, enum devlink_command cmd)
 
 int devlink_nl_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct sk_buff *msg;
 	int err;
 
@@ -519,7 +519,7 @@ devlink_nl_reload_actions_performed_snd(struct devlink *devlink, u32 actions_per
 
 int devlink_nl_reload_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	enum devlink_reload_action action;
 	enum devlink_reload_limit limit;
 	struct net *dest_net = NULL;
@@ -683,7 +683,7 @@ static int devlink_nl_eswitch_fill(struct sk_buff *msg, struct devlink *devlink,
 
 int devlink_nl_eswitch_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct sk_buff *msg;
 	int err;
 
@@ -704,7 +704,7 @@ int devlink_nl_eswitch_get_doit(struct sk_buff *skb, struct genl_info *info)
 
 int devlink_nl_eswitch_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	const struct devlink_ops *ops = devlink->ops;
 	enum devlink_eswitch_encap_mode encap_mode;
 	u8 inline_mode;
@@ -906,7 +906,7 @@ devlink_nl_info_fill(struct sk_buff *msg, struct devlink *devlink,
 
 int devlink_nl_info_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct sk_buff *msg;
 	int err;
 
@@ -1134,7 +1134,7 @@ int devlink_nl_flash_update_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct nlattr *nla_overwrite_mask, *nla_file_name;
 	struct devlink_flash_update_params params = {};
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	const char *file_name;
 	u32 supported_params;
 	int ret;
@@ -1302,7 +1302,7 @@ devlink_nl_selftests_fill(struct sk_buff *msg, struct devlink *devlink,
 
 int devlink_nl_selftests_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct sk_buff *msg;
 	int err;
 
@@ -1372,7 +1372,7 @@ static const struct nla_policy devlink_selftest_nl_policy[DEVLINK_ATTR_SELFTEST_
 int devlink_nl_selftests_run_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct nlattr *tb[DEVLINK_ATTR_SELFTEST_ID_MAX + 1];
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct nlattr *attrs, *selftests;
 	struct sk_buff *msg;
 	void *hdr;
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 36dff282f9b0..52c8bf359dd4 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -151,6 +151,19 @@ int devlink_rel_devlink_handle_put(struct sk_buff *msg, struct devlink *devlink,
 				   bool *msg_updated);
 
 /* Netlink */
+struct devlink_nl_ctx {
+	struct devlink *devlink;
+	struct devlink_port *devlink_port;
+};
+
+static inline struct devlink_nl_ctx *
+devlink_nl_ctx(struct genl_info *info)
+{
+	BUILD_BUG_ON(sizeof(struct devlink_nl_ctx) >
+		     sizeof_field(struct genl_info, ctx));
+	return (struct devlink_nl_ctx *)info->ctx;
+}
+
 enum devlink_multicast_groups {
 	DEVLINK_MCGRP_CONFIG,
 };
diff --git a/net/devlink/dpipe.c b/net/devlink/dpipe.c
index c8d4a4374ae1..08c7b66fc3e8 100644
--- a/net/devlink/dpipe.c
+++ b/net/devlink/dpipe.c
@@ -213,7 +213,7 @@ static int devlink_dpipe_tables_fill(struct genl_info *info,
 				     struct list_head *dpipe_tables,
 				     const char *table_name)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_dpipe_table *table;
 	struct nlattr *tables_attr;
 	struct sk_buff *skb = NULL;
@@ -290,7 +290,7 @@ static int devlink_dpipe_tables_fill(struct genl_info *info,
 
 int devlink_nl_dpipe_table_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	const char *table_name =  NULL;
 
 	if (info->attrs[DEVLINK_ATTR_DPIPE_TABLE_NAME])
@@ -478,7 +478,7 @@ int devlink_dpipe_entry_ctx_prepare(struct devlink_dpipe_dump_ctx *dump_ctx)
 	if (!dump_ctx->hdr)
 		goto nla_put_failure;
 
-	devlink = dump_ctx->info->user_ptr[0];
+	devlink = devlink_nl_ctx(dump_ctx->info)->devlink;
 	if (devlink_nl_put_handle(dump_ctx->skb, devlink))
 		goto nla_put_failure;
 	dump_ctx->nest = nla_nest_start_noflag(dump_ctx->skb,
@@ -563,7 +563,7 @@ static int devlink_dpipe_entries_fill(struct genl_info *info,
 int devlink_nl_dpipe_entries_get_doit(struct sk_buff *skb,
 				      struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_dpipe_table *table;
 	const char *table_name;
 
@@ -650,7 +650,7 @@ static int devlink_dpipe_headers_fill(struct genl_info *info,
 				      struct devlink_dpipe_headers *
 				      dpipe_headers)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct nlattr *headers_attr;
 	struct sk_buff *skb = NULL;
 	struct nlmsghdr *nlh;
@@ -713,7 +713,7 @@ static int devlink_dpipe_headers_fill(struct genl_info *info,
 int devlink_nl_dpipe_headers_get_doit(struct sk_buff *skb,
 				      struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 
 	if (!devlink->dpipe_headers)
 		return -EOPNOTSUPP;
@@ -747,7 +747,7 @@ static int devlink_dpipe_table_counters_set(struct devlink *devlink,
 int devlink_nl_dpipe_table_counters_set_doit(struct sk_buff *skb,
 					     struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	const char *table_name;
 	bool counters_enable;
 
diff --git a/net/devlink/health.c b/net/devlink/health.c
index ea7a334e939b..8ce6cd399cb7 100644
--- a/net/devlink/health.c
+++ b/net/devlink/health.c
@@ -358,7 +358,7 @@ devlink_health_reporter_get_from_info(struct devlink *devlink,
 int devlink_nl_health_reporter_get_doit(struct sk_buff *skb,
 					struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_health_reporter *reporter;
 	struct sk_buff *msg;
 	int err;
@@ -456,7 +456,7 @@ int devlink_nl_health_reporter_get_dumpit(struct sk_buff *skb,
 int devlink_nl_health_reporter_set_doit(struct sk_buff *skb,
 					struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_health_reporter *reporter;
 
 	reporter = devlink_health_reporter_get_from_info(devlink, info);
@@ -715,7 +715,7 @@ EXPORT_SYMBOL_GPL(devlink_health_reporter_state_update);
 int devlink_nl_health_reporter_recover_doit(struct sk_buff *skb,
 					    struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_health_reporter *reporter;
 
 	reporter = devlink_health_reporter_get_from_info(devlink, info);
@@ -1157,7 +1157,7 @@ static int devlink_fmsg_dumpit(struct devlink_fmsg *fmsg, struct sk_buff *skb,
 int devlink_nl_health_reporter_diagnose_doit(struct sk_buff *skb,
 					     struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_health_reporter *reporter;
 	struct devlink_fmsg *fmsg;
 	int err;
@@ -1252,7 +1252,7 @@ int devlink_nl_health_reporter_dump_get_dumpit(struct sk_buff *skb,
 int devlink_nl_health_reporter_dump_clear_doit(struct sk_buff *skb,
 					       struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_health_reporter *reporter;
 
 	reporter = devlink_health_reporter_get_from_info(devlink, info);
@@ -1269,7 +1269,7 @@ int devlink_nl_health_reporter_dump_clear_doit(struct sk_buff *skb,
 int devlink_nl_health_reporter_test_doit(struct sk_buff *skb,
 					 struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_health_reporter *reporter;
 
 	reporter = devlink_health_reporter_get_from_info(devlink, info);
diff --git a/net/devlink/linecard.c b/net/devlink/linecard.c
index 8315d35cb91d..fd18f2759770 100644
--- a/net/devlink/linecard.c
+++ b/net/devlink/linecard.c
@@ -171,7 +171,7 @@ void devlink_linecards_notify_unregister(struct devlink *devlink)
 
 int devlink_nl_linecard_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_linecard *linecard;
 	struct sk_buff *msg;
 	int err;
@@ -371,7 +371,7 @@ static int devlink_linecard_type_unset(struct devlink_linecard *linecard,
 int devlink_nl_linecard_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct netlink_ext_ack *extack = info->extack;
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_linecard *linecard;
 	int err;
 
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index ae4afc739678..f0a857e286bc 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -252,18 +252,18 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
 	if (IS_ERR(devlink))
 		return PTR_ERR(devlink);
 
-	info->user_ptr[0] = devlink;
+	devlink_nl_ctx(info)->devlink = devlink;
 	if (flags & DEVLINK_NL_FLAG_NEED_PORT) {
 		devlink_port = devlink_port_get_from_info(devlink, info);
 		if (IS_ERR(devlink_port)) {
 			err = PTR_ERR(devlink_port);
 			goto unlock;
 		}
-		info->user_ptr[1] = devlink_port;
+		devlink_nl_ctx(info)->devlink_port = devlink_port;
 	} else if (flags & DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT) {
 		devlink_port = devlink_port_get_from_info(devlink, info);
 		if (!IS_ERR(devlink_port))
-			info->user_ptr[1] = devlink_port;
+			devlink_nl_ctx(info)->devlink_port = devlink_port;
 	}
 	return 0;
 
@@ -304,7 +304,7 @@ static void __devlink_nl_post_doit(struct sk_buff *skb, struct genl_info *info,
 	bool dev_lock = flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK;
 	struct devlink *devlink;
 
-	devlink = info->user_ptr[0];
+	devlink = devlink_nl_ctx(info)->devlink;
 	devl_dev_unlock(devlink, dev_lock);
 	devlink_put(devlink);
 }
diff --git a/net/devlink/param.c b/net/devlink/param.c
index 3e9d2e5750c2..1cc562a6ebfd 100644
--- a/net/devlink/param.c
+++ b/net/devlink/param.c
@@ -627,7 +627,7 @@ devlink_param_get_from_info(struct xarray *params, struct genl_info *info)
 int devlink_nl_param_get_doit(struct sk_buff *skb,
 			      struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_param_item *param_item;
 	struct sk_buff *msg;
 	int err;
@@ -728,7 +728,7 @@ static int __devlink_nl_cmd_param_set_doit(struct devlink *devlink,
 
 int devlink_nl_param_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 
 	return __devlink_nl_cmd_param_set_doit(devlink, 0, &devlink->params,
 					       info, DEVLINK_CMD_PARAM_NEW);
diff --git a/net/devlink/port.c b/net/devlink/port.c
index 485029d43428..c268afefaed7 100644
--- a/net/devlink/port.c
+++ b/net/devlink/port.c
@@ -594,7 +594,7 @@ void devlink_ports_notify_unregister(struct devlink *devlink)
 
 int devlink_nl_port_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
+	struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
 	struct sk_buff *msg;
 	int err;
 
@@ -830,7 +830,7 @@ static int devlink_port_function_set(struct devlink_port *port,
 
 int devlink_nl_port_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
+	struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
 	int err;
 
 	if (info->attrs[DEVLINK_ATTR_PORT_TYPE]) {
@@ -856,8 +856,8 @@ int devlink_nl_port_set_doit(struct sk_buff *skb, struct genl_info *info)
 
 int devlink_nl_port_split_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	u32 count;
 
 	if (GENL_REQ_ATTR_CHECK(info, DEVLINK_ATTR_PORT_SPLIT_COUNT))
@@ -887,8 +887,8 @@ int devlink_nl_port_split_doit(struct sk_buff *skb, struct genl_info *info)
 
 int devlink_nl_port_unsplit_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 
 	if (!devlink_port->ops->port_unsplit)
 		return -EOPNOTSUPP;
@@ -899,7 +899,7 @@ int devlink_nl_port_new_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct netlink_ext_ack *extack = info->extack;
 	struct devlink_port_new_attrs new_attrs = {};
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_port *devlink_port;
 	struct sk_buff *msg;
 	int err;
@@ -961,9 +961,9 @@ int devlink_nl_port_new_doit(struct sk_buff *skb, struct genl_info *info)
 
 int devlink_nl_port_del_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
+	struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
 	struct netlink_ext_ack *extack = info->extack;
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 
 	if (!devlink_port->ops->port_del)
 		return -EOPNOTSUPP;
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 533d21b028a7..630441e429b3 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -239,7 +239,7 @@ int devlink_nl_rate_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 
 int devlink_nl_rate_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_rate *devlink_rate;
 	struct sk_buff *msg;
 	int err;
@@ -588,7 +588,7 @@ static bool devlink_rate_set_ops_supported(const struct devlink_ops *ops,
 
 int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_rate *devlink_rate;
 	const struct devlink_ops *ops;
 	int err;
@@ -610,7 +610,7 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
 
 int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_rate *rate_node;
 	const struct devlink_ops *ops;
 	int err;
@@ -666,7 +666,7 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
 
 int devlink_nl_rate_del_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_rate *rate_node;
 	int err;
 
diff --git a/net/devlink/region.c b/net/devlink/region.c
index 5588e3d560b9..537779bbff07 100644
--- a/net/devlink/region.c
+++ b/net/devlink/region.c
@@ -469,7 +469,7 @@ static void devlink_region_snapshot_del(struct devlink_region *region,
 
 int devlink_nl_region_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_port *port = NULL;
 	struct devlink_region *region;
 	const char *region_name;
@@ -588,7 +588,7 @@ int devlink_nl_region_get_dumpit(struct sk_buff *skb,
 
 int devlink_nl_region_del_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_snapshot *snapshot;
 	struct devlink_port *port = NULL;
 	struct devlink_region *region;
@@ -633,7 +633,7 @@ int devlink_nl_region_del_doit(struct sk_buff *skb, struct genl_info *info)
 
 int devlink_nl_region_new_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_snapshot *snapshot;
 	struct devlink_port *port = NULL;
 	struct nlattr *snapshot_id_attr;
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index 574108ccfe5d..c3cfda7ea070 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -117,7 +117,7 @@ devlink_resource_validate_size(struct devlink_resource *resource, u64 size,
 
 int devlink_nl_resource_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_resource *resource;
 	u64 resource_id;
 	u64 size;
@@ -251,8 +251,9 @@ static int devlink_resource_list_fill(struct sk_buff *skb,
 static int devlink_resource_fill(struct genl_info *info,
 				 enum devlink_command cmd, int flags)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
+	struct devlink *devlink = ctx->devlink;
+	struct devlink_port *devlink_port;
 	struct devlink_resource *resource;
 	struct list_head *resource_list;
 	struct nlattr *resources_attr;
@@ -263,6 +264,7 @@ static int devlink_resource_fill(struct genl_info *info,
 	int i;
 	int err;
 
+	devlink_port = ctx->devlink_port;
 	resource_list = devlink_port ?
 		&devlink_port->resource_list : &devlink->resource_list;
 	resource = list_first_entry(resource_list,
@@ -326,10 +328,12 @@ static int devlink_resource_fill(struct genl_info *info,
 
 int devlink_nl_resource_dump_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
+	struct devlink *devlink = ctx->devlink;
+	struct devlink_port *devlink_port;
 	struct list_head *resource_list;
 
+	devlink_port = ctx->devlink_port;
 	if (info->attrs[DEVLINK_ATTR_PORT_INDEX] && !devlink_port)
 		return -ENODEV;
 
diff --git a/net/devlink/sb.c b/net/devlink/sb.c
index 49fcbfe08f15..129bd016e302 100644
--- a/net/devlink/sb.c
+++ b/net/devlink/sb.c
@@ -204,7 +204,7 @@ static int devlink_nl_sb_fill(struct sk_buff *msg, struct devlink *devlink,
 
 int devlink_nl_sb_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_sb *devlink_sb;
 	struct sk_buff *msg;
 	int err;
@@ -306,7 +306,7 @@ static int devlink_nl_sb_pool_fill(struct sk_buff *msg, struct devlink *devlink,
 
 int devlink_nl_sb_pool_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_sb *devlink_sb;
 	struct sk_buff *msg;
 	u16 pool_index;
@@ -415,7 +415,7 @@ static int devlink_sb_pool_set(struct devlink *devlink, unsigned int sb_index,
 
 int devlink_nl_sb_pool_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	enum devlink_sb_threshold_type threshold_type;
 	struct devlink_sb *devlink_sb;
 	u16 pool_index;
@@ -506,7 +506,7 @@ static int devlink_nl_sb_port_pool_fill(struct sk_buff *msg,
 int devlink_nl_sb_port_pool_get_doit(struct sk_buff *skb,
 				     struct genl_info *info)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
+	struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
 	struct devlink *devlink = devlink_port->devlink;
 	struct devlink_sb *devlink_sb;
 	struct sk_buff *msg;
@@ -624,8 +624,8 @@ static int devlink_sb_port_pool_set(struct devlink_port *devlink_port,
 int devlink_nl_sb_port_pool_set_doit(struct sk_buff *skb,
 				     struct genl_info *info)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_sb *devlink_sb;
 	u16 pool_index;
 	u32 threshold;
@@ -716,7 +716,7 @@ devlink_nl_sb_tc_pool_bind_fill(struct sk_buff *msg, struct devlink *devlink,
 int devlink_nl_sb_tc_pool_bind_get_doit(struct sk_buff *skb,
 					struct genl_info *info)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
+	struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
 	struct devlink *devlink = devlink_port->devlink;
 	struct devlink_sb *devlink_sb;
 	struct sk_buff *msg;
@@ -864,8 +864,8 @@ static int devlink_sb_tc_pool_bind_set(struct devlink_port *devlink_port,
 int devlink_nl_sb_tc_pool_bind_set_doit(struct sk_buff *skb,
 					struct genl_info *info)
 {
-	struct devlink_port *devlink_port = info->user_ptr[1];
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink_port *devlink_port = devlink_nl_ctx(info)->devlink_port;
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	enum devlink_sb_pool_type pool_type;
 	struct devlink_sb *devlink_sb;
 	u16 tc_index;
@@ -902,7 +902,7 @@ int devlink_nl_sb_tc_pool_bind_set_doit(struct sk_buff *skb,
 
 int devlink_nl_sb_occ_snapshot_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	const struct devlink_ops *ops = devlink->ops;
 	struct devlink_sb *devlink_sb;
 
@@ -918,7 +918,7 @@ int devlink_nl_sb_occ_snapshot_doit(struct sk_buff *skb, struct genl_info *info)
 int devlink_nl_sb_occ_max_clear_doit(struct sk_buff *skb,
 				     struct genl_info *info)
 {
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	const struct devlink_ops *ops = devlink->ops;
 	struct devlink_sb *devlink_sb;
 
diff --git a/net/devlink/trap.c b/net/devlink/trap.c
index 8edb31654a68..793ffc66dc11 100644
--- a/net/devlink/trap.c
+++ b/net/devlink/trap.c
@@ -302,7 +302,7 @@ static int devlink_nl_trap_fill(struct sk_buff *msg, struct devlink *devlink,
 int devlink_nl_trap_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct netlink_ext_ack *extack = info->extack;
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_trap_item *trap_item;
 	struct sk_buff *msg;
 	int err;
@@ -412,7 +412,7 @@ static int devlink_trap_action_set(struct devlink *devlink,
 int devlink_nl_trap_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct netlink_ext_ack *extack = info->extack;
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_trap_item *trap_item;
 
 	if (list_empty(&devlink->trap_list))
@@ -511,7 +511,7 @@ devlink_nl_trap_group_fill(struct sk_buff *msg, struct devlink *devlink,
 int devlink_nl_trap_group_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct netlink_ext_ack *extack = info->extack;
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_trap_group_item *group_item;
 	struct sk_buff *msg;
 	int err;
@@ -682,7 +682,7 @@ static int devlink_trap_group_set(struct devlink *devlink,
 int devlink_nl_trap_group_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct netlink_ext_ack *extack = info->extack;
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_trap_group_item *group_item;
 	bool modified = false;
 	int err;
@@ -804,7 +804,7 @@ int devlink_nl_trap_policer_get_doit(struct sk_buff *skb,
 {
 	struct devlink_trap_policer_item *policer_item;
 	struct netlink_ext_ack *extack = info->extack;
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 	struct sk_buff *msg;
 	int err;
 
@@ -924,7 +924,7 @@ int devlink_nl_trap_policer_set_doit(struct sk_buff *skb,
 {
 	struct devlink_trap_policer_item *policer_item;
 	struct netlink_ext_ack *extack = info->extack;
-	struct devlink *devlink = info->user_ptr[0];
+	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
 
 	if (list_empty(&devlink->trap_policer_list))
 		return -EOPNOTSUPP;
-- 
2.44.0


^ permalink raw reply related

* [PATCH net-next V10 06/14] devlink: Allow parent dev for rate-set and rate-new
From: Tariq Toukan @ 2026-07-01  7:32 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
	Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
	David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
	Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
	linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
	Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
	Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
	Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
	Tariq Toukan, Willem de Bruijn, Gal Pressman
In-Reply-To: <20260701073254.754518-1-tariqt@nvidia.com>

From: Cosmin Ratiu <cratiu@nvidia.com>

Currently, a devlink rate's parent device is assumed to be the same as
the one where the devlink rate is created.

This patch changes that to allow rate commands to accept an additional
argument that specifies the parent dev. This will allow devlink rate
groups with leafs from other devices.

Example of the new usage with ynl:

Creating a group on pci/0000:08:00.1 with a parent to an already
existing pci/0000:08:00.1/group1:
./tools/net/ynl/pyynl/cli.py --spec \
Documentation/netlink/specs/devlink.yaml --do rate-new --json '{
    "bus-name": "pci",
    "dev-name": "0000:08:00.1",
    "rate-node-name": "group2",
    "rate-parent-node-name": "group1",
    "parent-dev": {
        "bus-name": "pci",
        "dev-name": "0000:08:00.1"
    }
  }'

Setting the parent of leaf node pci/0000:08:00.1/65537 to
pci/0000:08:00.0/group1:
./tools/net/ynl/pyynl/cli.py --spec \
Documentation/netlink/specs/devlink.yaml --do rate-set --json '{
    "bus-name": "pci",
    "dev-name": "0000:08:00.1",
    "port-index": 65537,
    "parent-dev": {
        "bus-name": "pci",
        "dev-name": "0000:08:00.0"
    },
    "rate-parent-node-name": "group1"
  }'

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 Documentation/netlink/specs/devlink.yaml | 10 +++---
 net/devlink/netlink.c                    | 40 +++++++++++++++++++++++-
 net/devlink/netlink_gen.c                | 24 +++++++++-----
 net/devlink/netlink_gen.h                |  8 +++++
 net/devlink/rate.c                       |  4 ++-
 5 files changed, 72 insertions(+), 14 deletions(-)

diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index 13d960b3abb1..38b1190f3d26 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -2309,8 +2309,8 @@ operations:
       dont-validate: [strict]
       flags: [admin-perm]
       do:
-        pre: devlink-nl-pre-doit
-        post: devlink-nl-post-doit
+        pre: devlink-nl-pre-doit-parent-dev-optional
+        post: devlink-nl-post-doit-parent-dev-optional
         request:
           attributes:
             - bus-name
@@ -2323,6 +2323,7 @@ operations:
             - rate-tx-weight
             - rate-parent-node-name
             - rate-tc-bws
+            - parent-dev
 
     -
       name: rate-new
@@ -2331,8 +2332,8 @@ operations:
       dont-validate: [strict]
       flags: [admin-perm]
       do:
-        pre: devlink-nl-pre-doit
-        post: devlink-nl-post-doit
+        pre: devlink-nl-pre-doit-parent-dev-optional
+        post: devlink-nl-post-doit-parent-dev-optional
         request:
           attributes:
             - bus-name
@@ -2345,6 +2346,7 @@ operations:
             - rate-tx-weight
             - rate-parent-node-name
             - rate-tc-bws
+            - parent-dev
 
     -
       name: rate-del
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index 5a057dc86b0f..300580c1a217 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -243,7 +243,29 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
 struct devlink *
 devlink_get_parent_from_attrs_lock(struct net *net, struct nlattr **attrs)
 {
-	return ERR_PTR(-EOPNOTSUPP);
+	unsigned int maxtype = ARRAY_SIZE(devlink_dl_parent_dev_nl_policy) - 1;
+	struct devlink *devlink;
+	struct nlattr **tb;
+	int err;
+
+	if (!attrs[DEVLINK_ATTR_PARENT_DEV])
+		return ERR_PTR(-EINVAL);
+
+	tb = kcalloc(maxtype + 1, sizeof(*tb), GFP_KERNEL);
+	if (!tb)
+		return ERR_PTR(-ENOMEM);
+
+	err = nla_parse_nested(tb, maxtype, attrs[DEVLINK_ATTR_PARENT_DEV],
+			       devlink_dl_parent_dev_nl_policy, NULL);
+	if (err)
+		goto out;
+
+	devlink = devlink_get_from_attrs_lock(net, tb, false);
+	kfree(tb);
+	return devlink;
+out:
+	kfree(tb);
+	return ERR_PTR(err);
 }
 
 static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
@@ -322,6 +344,14 @@ int devlink_nl_pre_doit_port_optional(const struct genl_split_ops *ops,
 	return __devlink_nl_pre_doit(skb, info, DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT);
 }
 
+int devlink_nl_pre_doit_parent_dev_optional(const struct genl_split_ops *ops,
+					    struct sk_buff *skb,
+					    struct genl_info *info)
+{
+	return __devlink_nl_pre_doit(skb, info,
+				     DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV);
+}
+
 static void __devlink_nl_post_doit(struct sk_buff *skb, struct genl_info *info,
 				   u8 flags)
 {
@@ -348,6 +378,14 @@ devlink_nl_post_doit_dev_lock(const struct genl_split_ops *ops,
 	__devlink_nl_post_doit(skb, info, DEVLINK_NL_FLAG_NEED_DEV_LOCK);
 }
 
+void
+devlink_nl_post_doit_parent_dev_optional(const struct genl_split_ops *ops,
+					 struct sk_buff *skb,
+					 struct genl_info *info)
+{
+	__devlink_nl_post_doit(skb, info, DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV);
+}
+
 static int devlink_nl_inst_single_dumpit(struct sk_buff *msg,
 					 struct netlink_callback *cb, int flags,
 					 devlink_nl_dump_one_func_t *dump_one,
diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c
index f52b0c2b19ed..dec00133178d 100644
--- a/net/devlink/netlink_gen.c
+++ b/net/devlink/netlink_gen.c
@@ -46,6 +46,12 @@ devlink_attr_param_type_validate(const struct nlattr *attr,
 }
 
 /* Common nested types */
+const struct nla_policy devlink_dl_parent_dev_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+	[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
+	[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
+	[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
+};
+
 const struct nla_policy devlink_dl_port_function_nl_policy[DEVLINK_PORT_FN_ATTR_CAPS + 1] = {
 	[DEVLINK_PORT_FUNCTION_ATTR_HW_ADDR] = { .type = NLA_BINARY, },
 	[DEVLINK_PORT_FN_ATTR_STATE] = NLA_POLICY_MAX(NLA_U8, 1),
@@ -608,7 +614,7 @@ static const struct nla_policy devlink_rate_get_dump_nl_policy[DEVLINK_ATTR_INDE
 };
 
 /* DEVLINK_CMD_RATE_SET - do */
-static const struct nla_policy devlink_rate_set_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+static const struct nla_policy devlink_rate_set_nl_policy[DEVLINK_ATTR_PARENT_DEV + 1] = {
 	[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
 	[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
 	[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
@@ -619,10 +625,11 @@ static const struct nla_policy devlink_rate_set_nl_policy[DEVLINK_ATTR_INDEX + 1
 	[DEVLINK_ATTR_RATE_TX_WEIGHT] = { .type = NLA_U32, },
 	[DEVLINK_ATTR_RATE_PARENT_NODE_NAME] = { .type = NLA_NUL_STRING, },
 	[DEVLINK_ATTR_RATE_TC_BWS] = NLA_POLICY_NESTED(devlink_dl_rate_tc_bws_nl_policy),
+	[DEVLINK_ATTR_PARENT_DEV] = NLA_POLICY_NESTED(devlink_dl_parent_dev_nl_policy),
 };
 
 /* DEVLINK_CMD_RATE_NEW - do */
-static const struct nla_policy devlink_rate_new_nl_policy[DEVLINK_ATTR_INDEX + 1] = {
+static const struct nla_policy devlink_rate_new_nl_policy[DEVLINK_ATTR_PARENT_DEV + 1] = {
 	[DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, },
 	[DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, },
 	[DEVLINK_ATTR_INDEX] = NLA_POLICY_FULL_RANGE(NLA_UINT, &devlink_attr_index_range),
@@ -633,6 +640,7 @@ static const struct nla_policy devlink_rate_new_nl_policy[DEVLINK_ATTR_INDEX + 1
 	[DEVLINK_ATTR_RATE_TX_WEIGHT] = { .type = NLA_U32, },
 	[DEVLINK_ATTR_RATE_PARENT_NODE_NAME] = { .type = NLA_NUL_STRING, },
 	[DEVLINK_ATTR_RATE_TC_BWS] = NLA_POLICY_NESTED(devlink_dl_rate_tc_bws_nl_policy),
+	[DEVLINK_ATTR_PARENT_DEV] = NLA_POLICY_NESTED(devlink_dl_parent_dev_nl_policy),
 };
 
 /* DEVLINK_CMD_RATE_DEL - do */
@@ -1290,21 +1298,21 @@ const struct genl_split_ops devlink_nl_ops[75] = {
 	{
 		.cmd		= DEVLINK_CMD_RATE_SET,
 		.validate	= GENL_DONT_VALIDATE_STRICT,
-		.pre_doit	= devlink_nl_pre_doit,
+		.pre_doit	= devlink_nl_pre_doit_parent_dev_optional,
 		.doit		= devlink_nl_rate_set_doit,
-		.post_doit	= devlink_nl_post_doit,
+		.post_doit	= devlink_nl_post_doit_parent_dev_optional,
 		.policy		= devlink_rate_set_nl_policy,
-		.maxattr	= DEVLINK_ATTR_INDEX,
+		.maxattr	= DEVLINK_ATTR_PARENT_DEV,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
 	{
 		.cmd		= DEVLINK_CMD_RATE_NEW,
 		.validate	= GENL_DONT_VALIDATE_STRICT,
-		.pre_doit	= devlink_nl_pre_doit,
+		.pre_doit	= devlink_nl_pre_doit_parent_dev_optional,
 		.doit		= devlink_nl_rate_new_doit,
-		.post_doit	= devlink_nl_post_doit,
+		.post_doit	= devlink_nl_post_doit_parent_dev_optional,
 		.policy		= devlink_rate_new_nl_policy,
-		.maxattr	= DEVLINK_ATTR_INDEX,
+		.maxattr	= DEVLINK_ATTR_PARENT_DEV,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
 	{
diff --git a/net/devlink/netlink_gen.h b/net/devlink/netlink_gen.h
index 20034b0929a8..a70e0e4769aa 100644
--- a/net/devlink/netlink_gen.h
+++ b/net/devlink/netlink_gen.h
@@ -13,6 +13,7 @@
 #include <uapi/linux/devlink.h>
 
 /* Common nested types */
+extern const struct nla_policy devlink_dl_parent_dev_nl_policy[DEVLINK_ATTR_INDEX + 1];
 extern const struct nla_policy devlink_dl_port_function_nl_policy[DEVLINK_PORT_FN_ATTR_CAPS + 1];
 extern const struct nla_policy devlink_dl_rate_tc_bws_nl_policy[DEVLINK_RATE_TC_ATTR_BW + 1];
 extern const struct nla_policy devlink_dl_selftest_id_nl_policy[DEVLINK_ATTR_SELFTEST_ID_FLASH + 1];
@@ -29,12 +30,19 @@ int devlink_nl_pre_doit_port_optional(const struct genl_split_ops *ops,
 				      struct genl_info *info);
 int devlink_nl_pre_doit_dev_lock(const struct genl_split_ops *ops,
 				 struct sk_buff *skb, struct genl_info *info);
+int devlink_nl_pre_doit_parent_dev_optional(const struct genl_split_ops *ops,
+					    struct sk_buff *skb,
+					    struct genl_info *info);
 void
 devlink_nl_post_doit(const struct genl_split_ops *ops, struct sk_buff *skb,
 		     struct genl_info *info);
 void
 devlink_nl_post_doit_dev_lock(const struct genl_split_ops *ops,
 			      struct sk_buff *skb, struct genl_info *info);
+void
+devlink_nl_post_doit_parent_dev_optional(const struct genl_split_ops *ops,
+					 struct sk_buff *skb,
+					 struct genl_info *info);
 
 int devlink_nl_get_doit(struct sk_buff *skb, struct genl_info *info);
 int devlink_nl_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 295f4185fdfd..78a59d79c2ea 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -663,9 +663,11 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
 
 int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
+	struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
+	struct devlink *devlink = ctx->devlink;
 	struct devlink_rate *rate_node;
 	const struct devlink_ops *ops;
+	struct devlink *rate_devlink;
 	int err;
 
 	ops = devlink->ops;
-- 
2.44.0


^ permalink raw reply related

* [PATCH net-next V10 05/14] devlink: Add parent dev to devlink API
From: Tariq Toukan @ 2026-07-01  7:32 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
	Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
	David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
	Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
	linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
	Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
	Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
	Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
	Tariq Toukan, Willem de Bruijn, Gal Pressman
In-Reply-To: <20260701073254.754518-1-tariqt@nvidia.com>

From: Cosmin Ratiu <cratiu@nvidia.com>

Upcoming changes to the rate commands need the parent devlink specified.
This change adds a nested 'parent-dev' attribute to the API and helpers
to obtain and put a reference to the parent devlink instance in
info->ctx.

To avoid deadlocks, the parent devlink is unlocked before obtaining the
main devlink instance that is the target of the request.
A reference to the parent is kept until the end of the request to avoid
it suddenly disappearing.

This means that this reference is of limited use without additional
protection.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 Documentation/netlink/specs/devlink.yaml | 20 +++++++++++++
 include/uapi/linux/devlink.h             |  2 ++
 net/devlink/devl_internal.h              |  3 ++
 net/devlink/netlink.c                    | 36 ++++++++++++++++++++----
 4 files changed, 56 insertions(+), 5 deletions(-)

diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml
index 52ad1e7805d1..13d960b3abb1 100644
--- a/Documentation/netlink/specs/devlink.yaml
+++ b/Documentation/netlink/specs/devlink.yaml
@@ -895,6 +895,16 @@ attribute-sets:
           resource-dump response. Bit 0 (dev) selects device-level
           resources; bit 1 (port) selects port-level resources.
           When absent all classes are returned.
+      -
+        name: parent-dev
+        type: nest
+        nested-attributes: dl-parent-dev
+        doc: |
+          Identifies the devlink instance which owns the parent rate node.
+          Used with rate-set and rate-new to parent a rate object to a node on
+          a different devlink instance, enabling cross-device rate scheduling.
+          When absent, the parent node is resolved on the same instance.
+
   -
     name: dl-dev-stats
     subset-of: devlink
@@ -1317,6 +1327,16 @@ attribute-sets:
              Specifies the bandwidth share assigned to the Traffic Class.
              The bandwidth for the traffic class is determined
              in proportion to the sum of the shares of all configured classes.
+  -
+    name: dl-parent-dev
+    subset-of: devlink
+    attributes:
+      -
+        name: bus-name
+      -
+        name: dev-name
+      -
+        name: index
 
 operations:
   enum-model: directional
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index ca713bcc47b9..a6801feb7744 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -648,6 +648,8 @@ enum devlink_attr {
 	DEVLINK_ATTR_INDEX,			/* uint */
 	DEVLINK_ATTR_RESOURCE_SCOPE_MASK,	/* u32 */
 
+	DEVLINK_ATTR_PARENT_DEV,		/* nested */
+
 	/* Add new attributes above here, update the spec in
 	 * Documentation/netlink/specs/devlink.yaml and re-generate
 	 * net/devlink/netlink_gen.c.
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 52c8bf359dd4..cdf894ba5a9d 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -154,6 +154,7 @@ int devlink_rel_devlink_handle_put(struct sk_buff *msg, struct devlink *devlink,
 struct devlink_nl_ctx {
 	struct devlink *devlink;
 	struct devlink_port *devlink_port;
+	struct devlink *parent_devlink;
 };
 
 static inline struct devlink_nl_ctx *
@@ -197,6 +198,8 @@ typedef int devlink_nl_dump_one_func_t(struct sk_buff *msg,
 struct devlink *
 devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
 			    bool dev_lock);
+struct devlink *
+devlink_get_parent_from_attrs_lock(struct net *net, struct nlattr **attrs);
 
 int devlink_nl_dumpit(struct sk_buff *msg, struct netlink_callback *cb,
 		      devlink_nl_dump_one_func_t *dump_one);
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index f0a857e286bc..5a057dc86b0f 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -12,6 +12,7 @@
 #define DEVLINK_NL_FLAG_NEED_PORT		BIT(0)
 #define DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT	BIT(1)
 #define DEVLINK_NL_FLAG_NEED_DEV_LOCK		BIT(2)
+#define DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV	BIT(3)
 
 static const struct genl_multicast_group devlink_nl_mcgrps[] = {
 	[DEVLINK_MCGRP_CONFIG] = { .name = DEVLINK_GENL_MCGRP_CONFIG_NAME },
@@ -239,19 +240,39 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
 	return ERR_PTR(-ENODEV);
 }
 
+struct devlink *
+devlink_get_parent_from_attrs_lock(struct net *net, struct nlattr **attrs)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
 				 u8 flags)
 {
+	bool parent_dev = flags & DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV;
 	bool dev_lock = flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK;
+	struct devlink *devlink, *parent_devlink = NULL;
+	struct net *net = genl_info_net(info);
+	struct nlattr **attrs = info->attrs;
 	struct devlink_port *devlink_port;
-	struct devlink *devlink;
 	int err;
 
-	devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs,
-					      dev_lock);
-	if (IS_ERR(devlink))
-		return PTR_ERR(devlink);
+	if (parent_dev && attrs[DEVLINK_ATTR_PARENT_DEV]) {
+		parent_devlink = devlink_get_parent_from_attrs_lock(net, attrs);
+		if (IS_ERR(parent_devlink))
+			return PTR_ERR(parent_devlink);
+		devlink_nl_ctx(info)->parent_devlink = parent_devlink;
+		/* Drop the parent devlink lock but don't release the reference.
+		 * This will keep it alive until the end of the request.
+		 */
+		devl_unlock(parent_devlink);
+	}
 
+	devlink = devlink_get_from_attrs_lock(net, attrs, dev_lock);
+	if (IS_ERR(devlink)) {
+		err = PTR_ERR(devlink);
+		goto parent_put;
+	}
 	devlink_nl_ctx(info)->devlink = devlink;
 	if (flags & DEVLINK_NL_FLAG_NEED_PORT) {
 		devlink_port = devlink_port_get_from_info(devlink, info);
@@ -270,6 +291,9 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
 unlock:
 	devl_dev_unlock(devlink, dev_lock);
 	devlink_put(devlink);
+parent_put:
+	if (parent_dev && parent_devlink)
+		devlink_put(parent_devlink);
 	return err;
 }
 
@@ -307,6 +331,8 @@ static void __devlink_nl_post_doit(struct sk_buff *skb, struct genl_info *info,
 	devlink = devlink_nl_ctx(info)->devlink;
 	devl_dev_unlock(devlink, dev_lock);
 	devlink_put(devlink);
+	if (devlink_nl_ctx(info)->parent_devlink)
+		devlink_put(devlink_nl_ctx(info)->parent_devlink);
 }
 
 void devlink_nl_post_doit(const struct genl_split_ops *ops,
-- 
2.44.0


^ permalink raw reply related

* [PATCH net-next V10 04/14] devlink: Decouple rate storage from associated devlink object
From: Tariq Toukan @ 2026-07-01  7:32 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
	Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
	David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
	Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
	linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
	Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
	Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
	Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
	Tariq Toukan, Willem de Bruijn, Gal Pressman
In-Reply-To: <20260701073254.754518-1-tariqt@nvidia.com>

From: Cosmin Ratiu <cratiu@nvidia.com>

Devlink rate leafs and nodes were stored in their respective devlink
objects pointed to by devlink_rate->devlink.

This patch removes that association by introducing the concept of
'rate node devlink', which is where all rates that could link to each
other are stored. For now this is the same as devlink_rate->devlink.

After this patch, the devlink rates stored in this devlink instance
could potentially be from multiple other devlink instances. So all rate
node manipulation code was updated to:
- correctly compare the actual devlink object during iteration.
- maybe acquire additional locks (noop for now).

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 net/devlink/rate.c | 249 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 177 insertions(+), 72 deletions(-)

diff --git a/net/devlink/rate.c b/net/devlink/rate.c
index 630441e429b3..295f4185fdfd 100644
--- a/net/devlink/rate.c
+++ b/net/devlink/rate.c
@@ -30,13 +30,25 @@ devlink_rate_leaf_get_from_info(struct devlink *devlink, struct genl_info *info)
 	return devlink_rate ?: ERR_PTR(-ENODEV);
 }
 
+static struct devlink *devl_rate_lock(struct devlink *devlink)
+{
+	return devlink;
+}
+
+static void devl_rate_unlock(struct devlink *devlink,
+			     struct devlink *rate_devlink)
+{
+}
+
 static struct devlink_rate *
-devlink_rate_node_get_by_name(struct devlink *devlink, const char *node_name)
+devlink_rate_node_get_by_name(struct devlink *rate_devlink,
+			      struct devlink *devlink, const char *node_name)
 {
 	struct devlink_rate *devlink_rate;
 
-	list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
-		if (devlink_rate_is_node(devlink_rate) &&
+	list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list) {
+		if (devlink_rate->devlink == devlink &&
+		    devlink_rate_is_node(devlink_rate) &&
 		    !strcmp(node_name, devlink_rate->name))
 			return devlink_rate;
 	}
@@ -44,7 +56,8 @@ devlink_rate_node_get_by_name(struct devlink *devlink, const char *node_name)
 }
 
 static struct devlink_rate *
-devlink_rate_node_get_from_attrs(struct devlink *devlink, struct nlattr **attrs)
+devlink_rate_node_get_from_attrs(struct devlink *rate_devlink,
+				 struct devlink *devlink, struct nlattr **attrs)
 {
 	const char *rate_node_name;
 	size_t len;
@@ -57,24 +70,30 @@ devlink_rate_node_get_from_attrs(struct devlink *devlink, struct nlattr **attrs)
 	if (!len || strspn(rate_node_name, "0123456789") == len)
 		return ERR_PTR(-EINVAL);
 
-	return devlink_rate_node_get_by_name(devlink, rate_node_name);
+	return devlink_rate_node_get_by_name(rate_devlink, devlink,
+					     rate_node_name);
 }
 
 static struct devlink_rate *
-devlink_rate_node_get_from_info(struct devlink *devlink, struct genl_info *info)
+devlink_rate_node_get_from_info(struct devlink *rate_devlink,
+				struct devlink *devlink,
+				struct genl_info *info)
 {
-	return devlink_rate_node_get_from_attrs(devlink, info->attrs);
+	return devlink_rate_node_get_from_attrs(rate_devlink, devlink,
+						info->attrs);
 }
 
 static struct devlink_rate *
-devlink_rate_get_from_info(struct devlink *devlink, struct genl_info *info)
+devlink_rate_get_from_info(struct devlink *rate_devlink,
+			   struct devlink *devlink, struct genl_info *info)
 {
 	struct nlattr **attrs = info->attrs;
 
 	if (attrs[DEVLINK_ATTR_PORT_INDEX])
 		return devlink_rate_leaf_get_from_info(devlink, info);
 	else if (attrs[DEVLINK_ATTR_RATE_NODE_NAME])
-		return devlink_rate_node_get_from_info(devlink, info);
+		return devlink_rate_node_get_from_info(rate_devlink, devlink,
+						       info);
 	else
 		return ERR_PTR(-EINVAL);
 }
@@ -190,17 +209,25 @@ static void devlink_rate_notify(struct devlink_rate *devlink_rate,
 void devlink_rates_notify_register(struct devlink *devlink)
 {
 	struct devlink_rate *rate_node;
+	struct devlink *rate_devlink;
 
-	list_for_each_entry(rate_node, &devlink->rate_list, list)
-		devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+	rate_devlink = devl_rate_lock(devlink);
+	list_for_each_entry(rate_node, &rate_devlink->rate_list, list)
+		if (rate_node->devlink == devlink)
+			devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+	devl_rate_unlock(devlink, rate_devlink);
 }
 
 void devlink_rates_notify_unregister(struct devlink *devlink)
 {
 	struct devlink_rate *rate_node;
+	struct devlink *rate_devlink;
 
-	list_for_each_entry_reverse(rate_node, &devlink->rate_list, list)
-		devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_DEL);
+	rate_devlink = devl_rate_lock(devlink);
+	list_for_each_entry_reverse(rate_node, &rate_devlink->rate_list, list)
+		if (rate_node->devlink == devlink)
+			devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_DEL);
+	devl_rate_unlock(devlink, rate_devlink);
 }
 
 static int
@@ -209,17 +236,20 @@ devlink_nl_rate_get_dump_one(struct sk_buff *msg, struct devlink *devlink,
 {
 	struct devlink_nl_dump_state *state = devlink_dump_state(cb);
 	struct devlink_rate *devlink_rate;
+	struct devlink *rate_devlink;
 	int idx = 0;
 	int err = 0;
 
-	list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
+	rate_devlink = devl_rate_lock(devlink);
+	list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list) {
 		enum devlink_command cmd = DEVLINK_CMD_RATE_NEW;
 		u32 id = NETLINK_CB(cb->skb).portid;
 
-		if (idx < state->idx) {
+		if (idx < state->idx || devlink_rate->devlink != devlink) {
 			idx++;
 			continue;
 		}
+
 		err = devlink_nl_rate_fill(msg, devlink_rate, cmd, id,
 					   cb->nlh->nlmsg_seq, flags, NULL);
 		if (err) {
@@ -228,6 +258,7 @@ devlink_nl_rate_get_dump_one(struct sk_buff *msg, struct devlink *devlink,
 		}
 		idx++;
 	}
+	devl_rate_unlock(devlink, rate_devlink);
 
 	return err;
 }
@@ -239,28 +270,38 @@ int devlink_nl_rate_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 
 int devlink_nl_rate_get_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
+	struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_rate *devlink_rate;
 	struct sk_buff *msg;
 	int err;
 
-	devlink_rate = devlink_rate_get_from_info(devlink, info);
-	if (IS_ERR(devlink_rate))
-		return PTR_ERR(devlink_rate);
+	rate_devlink = devl_rate_lock(devlink);
+	devlink_rate = devlink_rate_get_from_info(rate_devlink, devlink, info);
+	if (IS_ERR(devlink_rate)) {
+		err = PTR_ERR(devlink_rate);
+		goto unlock;
+	}
 
 	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
-	if (!msg)
-		return -ENOMEM;
+	if (!msg) {
+		err = -ENOMEM;
+		goto unlock;
+	}
 
 	err = devlink_nl_rate_fill(msg, devlink_rate, DEVLINK_CMD_RATE_NEW,
 				   info->snd_portid, info->snd_seq, 0,
 				   info->extack);
-	if (err) {
-		nlmsg_free(msg);
-		return err;
-	}
+	if (err)
+		goto err_fill;
 
+	devl_rate_unlock(devlink, rate_devlink);
 	return genlmsg_reply(msg, info);
+
+err_fill:
+	nlmsg_free(msg);
+unlock:
+	devl_rate_unlock(devlink, rate_devlink);
+	return err;
 }
 
 static bool
@@ -277,6 +318,7 @@ devlink_rate_is_parent_node(struct devlink_rate *devlink_rate,
 
 static int
 devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
+				struct devlink *rate_devlink,
 				struct genl_info *info,
 				struct nlattr *nla_parent)
 {
@@ -304,7 +346,8 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
 		refcount_dec(&parent->refcnt);
 		devlink_rate->parent = NULL;
 	} else if (len) {
-		parent = devlink_rate_node_get_by_name(devlink, parent_name);
+		parent = devlink_rate_node_get_by_name(rate_devlink, devlink,
+						       parent_name);
 		if (IS_ERR(parent))
 			return -ENODEV;
 
@@ -423,6 +466,7 @@ static int devlink_nl_rate_tc_bw_set(struct devlink_rate *devlink_rate,
 }
 
 static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
+			       struct devlink *rate_devlink,
 			       const struct devlink_ops *ops,
 			       struct genl_info *info)
 {
@@ -497,7 +541,8 @@ static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
 	 */
 	nla_parent = attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME];
 	if (nla_parent) {
-		err = devlink_nl_rate_parent_node_set(devlink_rate, info,
+		err = devlink_nl_rate_parent_node_set(devlink_rate,
+						      rate_devlink, info,
 						      nla_parent);
 		if (err)
 			return err;
@@ -588,29 +633,37 @@ static bool devlink_rate_set_ops_supported(const struct devlink_ops *ops,
 
 int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
+	struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_rate *devlink_rate;
 	const struct devlink_ops *ops;
 	int err;
 
-	devlink_rate = devlink_rate_get_from_info(devlink, info);
-	if (IS_ERR(devlink_rate))
-		return PTR_ERR(devlink_rate);
+	rate_devlink = devl_rate_lock(devlink);
+	devlink_rate = devlink_rate_get_from_info(rate_devlink, devlink, info);
+	if (IS_ERR(devlink_rate)) {
+		err = PTR_ERR(devlink_rate);
+		goto unlock;
+	}
 
 	ops = devlink->ops;
-	if (!ops || !devlink_rate_set_ops_supported(ops, info, devlink_rate->type))
-		return -EOPNOTSUPP;
+	if (!ops ||
+	    !devlink_rate_set_ops_supported(ops, info, devlink_rate->type)) {
+		err = -EOPNOTSUPP;
+		goto unlock;
+	}
 
-	err = devlink_nl_rate_set(devlink_rate, ops, info);
+	err = devlink_nl_rate_set(devlink_rate, rate_devlink, ops, info);
 
 	if (!err)
 		devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW);
+unlock:
+	devl_rate_unlock(devlink, rate_devlink);
 	return err;
 }
 
 int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
+	struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_rate *rate_node;
 	const struct devlink_ops *ops;
 	int err;
@@ -624,15 +677,22 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
 	if (!devlink_rate_set_ops_supported(ops, info, DEVLINK_RATE_TYPE_NODE))
 		return -EOPNOTSUPP;
 
-	rate_node = devlink_rate_node_get_from_attrs(devlink, info->attrs);
-	if (!IS_ERR(rate_node))
-		return -EEXIST;
-	else if (rate_node == ERR_PTR(-EINVAL))
-		return -EINVAL;
+	rate_devlink = devl_rate_lock(devlink);
+	rate_node = devlink_rate_node_get_from_attrs(rate_devlink, devlink,
+						     info->attrs);
+	if (!IS_ERR(rate_node)) {
+		err = -EEXIST;
+		goto unlock;
+	} else if (rate_node == ERR_PTR(-EINVAL)) {
+		err = -EINVAL;
+		goto unlock;
+	}
 
 	rate_node = kzalloc_obj(*rate_node);
-	if (!rate_node)
-		return -ENOMEM;
+	if (!rate_node) {
+		err = -ENOMEM;
+		goto unlock;
+	}
 
 	rate_node->devlink = devlink;
 	rate_node->type = DEVLINK_RATE_TYPE_NODE;
@@ -646,13 +706,14 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
 	if (err)
 		goto err_node_new;
 
-	err = devlink_nl_rate_set(rate_node, ops, info);
+	err = devlink_nl_rate_set(rate_node, rate_devlink, ops, info);
 	if (err)
 		goto err_rate_set;
 
 	refcount_set(&rate_node->refcnt, 1);
-	list_add(&rate_node->list, &devlink->rate_list);
+	list_add(&rate_node->list, &rate_devlink->rate_list);
 	devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+	devl_rate_unlock(devlink, rate_devlink);
 	return 0;
 
 err_rate_set:
@@ -661,22 +722,29 @@ int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
 	kfree(rate_node->name);
 err_strdup:
 	kfree(rate_node);
+unlock:
+	devl_rate_unlock(devlink, rate_devlink);
 	return err;
 }
 
 int devlink_nl_rate_del_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	struct devlink *devlink = devlink_nl_ctx(info)->devlink;
+	struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
 	struct devlink_rate *rate_node;
 	int err;
 
-	rate_node = devlink_rate_node_get_from_info(devlink, info);
-	if (IS_ERR(rate_node))
-		return PTR_ERR(rate_node);
+	rate_devlink = devl_rate_lock(devlink);
+	rate_node = devlink_rate_node_get_from_info(rate_devlink, devlink,
+						    info);
+	if (IS_ERR(rate_node)) {
+		err = PTR_ERR(rate_node);
+		goto unlock;
+	}
 
 	if (refcount_read(&rate_node->refcnt) > 1) {
 		NL_SET_ERR_MSG(info->extack, "Node has children. Cannot delete node.");
-		return -EBUSY;
+		err = -EBUSY;
+		goto unlock;
 	}
 
 	devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_DEL);
@@ -687,6 +755,8 @@ int devlink_nl_rate_del_doit(struct sk_buff *skb, struct genl_info *info)
 	list_del(&rate_node->list);
 	kfree(rate_node->name);
 	kfree(rate_node);
+unlock:
+	devl_rate_unlock(devlink, rate_devlink);
 	return err;
 }
 
@@ -695,14 +765,20 @@ int devlink_rates_check(struct devlink *devlink,
 			struct netlink_ext_ack *extack)
 {
 	struct devlink_rate *devlink_rate;
+	struct devlink *rate_devlink;
+	int err = 0;
 
-	list_for_each_entry(devlink_rate, &devlink->rate_list, list)
-		if (!rate_filter || rate_filter(devlink_rate)) {
+	rate_devlink = devl_rate_lock(devlink);
+	list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list)
+		if (devlink_rate->devlink == devlink &&
+		    (!rate_filter || rate_filter(devlink_rate))) {
 			if (extack)
 				NL_SET_ERR_MSG(extack, "Rate node(s) exists.");
-			return -EBUSY;
+			err = -EBUSY;
+			break;
 		}
-	return 0;
+	devl_rate_unlock(devlink, rate_devlink);
+	return err;
 }
 
 /**
@@ -719,14 +795,21 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
 		      struct devlink_rate *parent)
 {
 	struct devlink_rate *rate_node;
-
-	rate_node = devlink_rate_node_get_by_name(devlink, node_name);
-	if (!IS_ERR(rate_node))
-		return ERR_PTR(-EEXIST);
+	struct devlink *rate_devlink;
+
+	rate_devlink = devl_rate_lock(devlink);
+	rate_node = devlink_rate_node_get_by_name(rate_devlink, devlink,
+						  node_name);
+	if (!IS_ERR(rate_node)) {
+		rate_node = ERR_PTR(-EEXIST);
+		goto unlock;
+	}
 
 	rate_node = kzalloc_obj(*rate_node);
-	if (!rate_node)
-		return ERR_PTR(-ENOMEM);
+	if (!rate_node) {
+		rate_node = ERR_PTR(-ENOMEM);
+		goto unlock;
+	}
 
 	rate_node->type = DEVLINK_RATE_TYPE_NODE;
 	rate_node->devlink = devlink;
@@ -735,7 +818,8 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
 	rate_node->name = kstrdup(node_name, GFP_KERNEL);
 	if (!rate_node->name) {
 		kfree(rate_node);
-		return ERR_PTR(-ENOMEM);
+		rate_node = ERR_PTR(-ENOMEM);
+		goto unlock;
 	}
 
 	if (parent) {
@@ -744,8 +828,10 @@ devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
 	}
 
 	refcount_set(&rate_node->refcnt, 1);
-	list_add(&rate_node->list, &devlink->rate_list);
+	list_add(&rate_node->list, &rate_devlink->rate_list);
 	devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+unlock:
+	devl_rate_unlock(devlink, rate_devlink);
 	return rate_node;
 }
 EXPORT_SYMBOL_GPL(devl_rate_node_create);
@@ -761,10 +847,10 @@ EXPORT_SYMBOL_GPL(devl_rate_node_create);
 int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
 			  struct devlink_rate *parent)
 {
-	struct devlink *devlink = devlink_port->devlink;
+	struct devlink *rate_devlink, *devlink = devlink_port->devlink;
 	struct devlink_rate *devlink_rate;
 
-	devl_assert_locked(devlink_port->devlink);
+	devl_assert_locked(devlink);
 
 	if (WARN_ON(devlink_port->devlink_rate))
 		return -EBUSY;
@@ -773,6 +859,7 @@ int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
 	if (!devlink_rate)
 		return -ENOMEM;
 
+	rate_devlink = devl_rate_lock(devlink);
 	if (parent) {
 		devlink_rate->parent = parent;
 		refcount_inc(&devlink_rate->parent->refcnt);
@@ -782,9 +869,10 @@ int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
 	devlink_rate->devlink = devlink;
 	devlink_rate->devlink_port = devlink_port;
 	devlink_rate->priv = priv;
-	list_add_tail(&devlink_rate->list, &devlink->rate_list);
+	list_add_tail(&devlink_rate->list, &rate_devlink->rate_list);
 	devlink_port->devlink_rate = devlink_rate;
 	devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW);
+	devl_rate_unlock(devlink, rate_devlink);
 
 	return 0;
 }
@@ -800,16 +888,19 @@ EXPORT_SYMBOL_GPL(devl_rate_leaf_create);
 void devl_rate_leaf_destroy(struct devlink_port *devlink_port)
 {
 	struct devlink_rate *devlink_rate = devlink_port->devlink_rate;
+	struct devlink *rate_devlink, *devlink = devlink_port->devlink;
 
-	devl_assert_locked(devlink_port->devlink);
+	devl_assert_locked(devlink);
 	if (!devlink_rate)
 		return;
 
+	rate_devlink = devl_rate_lock(devlink);
 	devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_DEL);
 	if (devlink_rate->parent)
 		refcount_dec(&devlink_rate->parent->refcnt);
 	list_del(&devlink_rate->list);
 	devlink_port->devlink_rate = NULL;
+	devl_rate_unlock(devlink, rate_devlink);
 	kfree(devlink_rate);
 }
 EXPORT_SYMBOL_GPL(devl_rate_leaf_destroy);
@@ -818,20 +909,30 @@ EXPORT_SYMBOL_GPL(devl_rate_leaf_destroy);
  * devl_rate_nodes_destroy - destroy all devlink rate nodes on device
  * @devlink: devlink instance
  *
- * Unset parent for all rate objects and destroy all rate nodes
- * on specified device.
+ * Unset parent for all rate objects involving this device and destroy all rate
+ * nodes on it.
  */
 void devl_rate_nodes_destroy(struct devlink *devlink)
 {
-	const struct devlink_ops *ops = devlink->ops;
 	struct devlink_rate *devlink_rate, *tmp;
+	const struct devlink_ops *ops;
+	struct devlink *rate_devlink;
 
 	devl_assert_locked(devlink);
+	rate_devlink = devl_rate_lock(devlink);
 
-	list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
-		if (!devlink_rate->parent)
+	list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list) {
+		if (!devlink_rate->parent ||
+		    (devlink_rate->devlink != devlink &&
+		     devlink_rate->parent->devlink != devlink))
 			continue;
 
+		/* This could destroy rate objects on other devlinks in the
+		 * same hierarchy under 'rate_devlink'. This is safe because
+		 * the shared common ancestor is locked so there can be no
+		 * other concurrent rate operations on devlink_rate->devlink.
+		 */
+		ops = devlink_rate->devlink->ops;
 		if (devlink_rate_is_leaf(devlink_rate))
 			ops->rate_leaf_parent_set(devlink_rate, NULL, devlink_rate->priv,
 						  NULL, NULL);
@@ -842,13 +943,17 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
 		refcount_dec(&devlink_rate->parent->refcnt);
 		devlink_rate->parent = NULL;
 	}
-	list_for_each_entry_safe(devlink_rate, tmp, &devlink->rate_list, list) {
-		if (devlink_rate_is_node(devlink_rate)) {
+	ops = devlink->ops;
+	list_for_each_entry_safe(devlink_rate, tmp, &rate_devlink->rate_list,
+				 list) {
+		if (devlink_rate->devlink == devlink &&
+		    devlink_rate_is_node(devlink_rate)) {
 			ops->rate_node_del(devlink_rate, devlink_rate->priv, NULL);
 			list_del(&devlink_rate->list);
 			kfree(devlink_rate->name);
 			kfree(devlink_rate);
 		}
 	}
+	devl_rate_unlock(devlink, rate_devlink);
 }
 EXPORT_SYMBOL_GPL(devl_rate_nodes_destroy);
-- 
2.44.0


^ permalink raw reply related

* [PATCH net-next V10 02/14] devlink: Add a helper for getting a nested-in instance
From: Tariq Toukan @ 2026-07-01  7:32 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
	Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
	David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
	Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
	linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
	Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
	Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
	Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
	Tariq Toukan, Willem de Bruijn, Gal Pressman
In-Reply-To: <20260701073254.754518-1-tariqt@nvidia.com>

From: Cosmin Ratiu <cratiu@nvidia.com>

Upcoming code will need to obtain references to locked nested-in
devlink instances. Add a helper to lock, reference and return the
nested-in instance.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 net/devlink/core.c          | 16 ++++++++++++++++
 net/devlink/devl_internal.h |  4 ++++
 2 files changed, 20 insertions(+)

diff --git a/net/devlink/core.c b/net/devlink/core.c
index fe9f6a0a67d5..ee26c50b4118 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -67,6 +67,22 @@ static void __devlink_rel_put(struct devlink_rel *rel)
 		devlink_rel_free(rel);
 }
 
+struct devlink *__must_check devlink_nested_in_get_lock(struct devlink *devlink)
+{
+	devl_assert_locked(devlink);
+	if (!devlink->rel)
+		return NULL;
+	devlink = devlinks_xa_get(devlink->rel->nested_in.devlink_index);
+	if (!devlink)
+		return NULL;
+	devl_lock(devlink);
+	if (devl_is_registered(devlink))
+		return devlink;
+	devl_unlock(devlink);
+	devlink_put(devlink);
+	return NULL;
+}
+
 static void devlink_rel_nested_in_notify_work(struct work_struct *work)
 {
 	struct devlink_rel *rel = container_of(work, struct devlink_rel,
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index e4e48ee2da5a..36dff282f9b0 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -136,6 +136,10 @@ typedef void devlink_rel_notify_cb_t(struct devlink *devlink, u32 obj_index);
 typedef void devlink_rel_cleanup_cb_t(struct devlink *devlink, u32 obj_index,
 				      u32 rel_index);
 
+/* Returns the locked+referenced nested-in instance or NULL. */
+struct devlink *__must_check
+devlink_nested_in_get_lock(struct devlink *devlink);
+
 void devlink_rel_nested_in_clear(u32 rel_index);
 int devlink_rel_nested_in_add(u32 *rel_index, u32 devlink_index,
 			      u32 obj_index, devlink_rel_notify_cb_t *notify_cb,
-- 
2.44.0


^ permalink raw reply related

* [PATCH net-next V10 01/14] devlink: Update nested instance locking comment
From: Tariq Toukan @ 2026-07-01  7:32 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
	Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
	David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
	Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
	linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
	Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
	Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
	Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
	Tariq Toukan, Willem de Bruijn, Gal Pressman
In-Reply-To: <20260701073254.754518-1-tariqt@nvidia.com>

From: Cosmin Ratiu <cratiu@nvidia.com>

In commit [1] a comment about nested instance locking was updated. But
there's another place where this is mentioned, so update that as well.

[1] commit 0061b5199d7c ("devlink: Reverse locking order for nested
instances")

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 Documentation/networking/devlink/index.rst | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index 32f70879ddd0..4745148fecf4 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -31,10 +31,10 @@ sure to respect following rules:
 
  - Lock ordering should be maintained. If driver needs to take instance
    lock of both nested and parent instances at the same time, devlink
-   instance lock of the parent instance should be taken first, only then
-   instance lock of the nested instance could be taken.
- - Driver should use object-specific helpers to setup the nested relationship
-   before registering the nested devlink instance:
+   instance lock of the nested instance should be taken first, only then
+   instance lock of the parent instance could be taken.
+ - Driver should use object-specific helpers to setup the
+   nested relationship:
 
    - ``devl_nested_devlink_set()`` - called to setup devlink -> nested
      devlink relationship (could be used for multiple nested instances).
-- 
2.44.0


^ permalink raw reply related

* [PATCH net-next V10 00/14] devlink and mlx5: Support cross-function rate scheduling
From: Tariq Toukan @ 2026-07-01  7:32 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Adithya Jayachandran, Bobby Eshleman, Carolina Jubran,
	Cosmin Ratiu, Daniel Borkmann, Daniel Jurgens, Daniel Zahka,
	David Wei, Donald Hunter, Dragos Tatulea, Jiri Pirko, Jiri Pirko,
	Joe Damato, Jonathan Corbet, Kees Cook, Leon Romanovsky,
	linux-doc, linux-kernel, linux-kselftest, linux-rdma, Mark Bloch,
	Moshe Shemesh, Or Har-Toov, Parav Pandit, Petr Machata,
	Ratheesh Kannoth, Saeed Mahameed, Shahar Shitrit, Shay Drori,
	Shuah Khan, Shuah Khan, Simon Horman, Stanislav Fomichev,
	Tariq Toukan, Willem de Bruijn, Gal Pressman

Hi,

This series by Cosmin adds support for cross-function rate scheduling in
devlink and mlx5.
See detailed explanation by Cosmin below [0].

Regards,
Tariq

[0]
devlink objects support rate management for TX scheduling, which
involves maintaining a tree of rate nodes that corresponds to TX
schedulers in hardware. 'man devlink-rate' has the full details.

The tree of rate nodes is maintained per devlink object, protected by
the devlink lock.

There exists hardware capable of instantiating TX scheduling trees
spanning multiple functions of the same physical device (and thus
devlink objects) and therefore the current API and locking scheme is
insufficient.

This patch series changes the devlink rate implementation and API to
allow supporting such hardware and managing TX scheduling trees across
multiple functions of a physical device.

Modeling this requires having devlink rate nodes with parents in other
devlink objects. A naive approach that relies on the current
one-lock-per-devlink model is impossible, as it would require in some
cases acquiring multiple devlink locks in the correct order.

The solution proposed in this patch series makes use of the recently
introduced shared devlink instance [1] to manage rate hierarchy changes
across multiple functions.

V1 of this patch series was sent a long time ago [2], using a different
approach of storing rates in a shared rate domain with special locking
rules. This new approach uses standard devlink instances and nesting.

The first part of the series adds support to devlink rates for
maintaining the rate tree across multiple functions.

The second part changes the mlx5 implementation to make use of this (and
cleans up remnants of the previous approach, involving rate domains).

The neat part about using the shared devlink object is that it works for
SFs as well, which are already nested in their parent PF instances. So
with this series, complex scheduling trees spanning multiple SFs across
multiple PFs of the same NIC can now be supported.

---

[1] https://lore.kernel.org/all/20260312100407.551173-1-jiri@resnulli.us/T/#u
[2] https://lore.kernel.org/netdev/20250213180134.323929-1-tariqt@nvidia.com/
[3] https://lore.kernel.org/netdev/20260325152801.236343-1-jiri@resnulli.us/

Comments from internal Sashiko review:
> When a rate node is linked to a parent on a different devlink device, should
> the rate-get reply and dump operations also include the parent's device
> information?
> Currently, parent-dev is only added to the request attributes for rate-set
> and rate-new, but it appears to be omitted from the serialization logic and
> the rate-get reply definition. Does this render the cross-device
> relationship invisible to user space monitoring tools?

[CR] Preexisting problem with devlink rate API, the rate-get command
specifies only a subset of output attributes. May be fixed separately.

> > diff --git a/net/devlink/rate.c b/net/devlink/rate.c
> > index 5436811cfbbd..2b30329a1394 100644
> > --- a/net/devlink/rate.c
> > +++ b/net/devlink/rate.c
> > @@ -660,9 +660,11 @@ int devlink_nl_rate_set_doit(struct sk_buff *skb, struct genl_info *info)
> >
> >  int devlink_nl_rate_new_doit(struct sk_buff *skb, struct genl_info *info)
> >  {
> > -	struct devlink *rate_devlink, *devlink = devlink_nl_ctx(info)->devlink;
> > +	struct devlink_nl_ctx *ctx = devlink_nl_ctx(info);
> > +	struct devlink *devlink = ctx->devlink;
> While ctx is now retrieved, does the rest of the function ignore
> ctx->parent_devlink?

> It looks like the command handlers never retrieve or use the parent devlink,
> and unconditionally use the child's devlink for lookups and assignments. Will
> this cause cross-device parent assignments to silently fail or return -ENODEV
> since they search for the parent node exclusively on the child's device?

[CR] False positive, it's used in devlink_nl_rate_parent_node_set().

> Additionally, looking at how the parent devlink is prepared in
> __devlink_nl_pre_doit() (which is called via
> DEVLINK_NL_FLAG_OPTIONAL_PARENT_DEV):
> __devlink_nl_pre_doit()
>     if (parent_dev && attrs[DEVLINK_ATTR_PARENT_DEV]) {
>         parent_devlink = devlink_get_parent_from_attrs_lock(net, attrs);
>         ...
>         devlink_nl_ctx(info)->parent_devlink = parent_devlink;
>         /* Drop the parent devlink lock but don't release the reference.
>          * This will keep it alive until the end of the request.
>          */
>         devl_unlock(parent_devlink);
>     }
> To actually implement the cross-device feature in the command handlers, the
> parent's rate list would need to be accessed. Since the parent devlink lock
> is explicitly dropped here, wouldn't accessing the parent's rate list
> without its lock violate devlink locking rules and lead to data races?

[CR] Locking changes are intentional. All rate changes for anything in
the rate group now acquire the root devlink lock, where rates are
stored.

> >   * devl_rate_nodes_destroy - destroy all devlink rate nodes on device
> >   * @devlink: devlink instance
> >   *
> > - * Unset parent for all rate objects and destroy all rate nodes
> > - * on specified device.
> > + * Unset parent for all rate objects involving this device and destroy all rate
> > + * nodes on it.
> >   */
> >  void devl_rate_nodes_destroy(struct devlink *devlink)
> >  {
> > -	const struct devlink_ops *ops = devlink->ops;
> >  	struct devlink_rate *devlink_rate, *tmp;
> > +	const struct devlink_ops *ops;
> > +	struct devlink *rate_devlink;
> >
> >  	devl_assert_locked(devlink);
> > +	rate_devlink = devl_rate_lock(devlink);
> >
> > -	list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
> > -		if (!devlink_rate->parent)
> > +	list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list) {
> > +		if (!devlink_rate->parent ||
> > +		    (devlink_rate->devlink != devlink &&
> > +		     devlink_rate->parent->devlink != devlink))
> >  			continue;
> >
> > +		/* This could destroy rate objects on other devlinks in the
> > +		 * same hierarchy under 'rate_devlink'. This is safe because
> > +		 * the shared common ancestor is locked so there can be no
> > +		 * other concurrent rate operations on devlink_rate->devlink.
> > +		 */
> > +		ops = devlink_rate->devlink->ops;
> Does this code invoke devlink_ops callbacks on other devlink instances
> without holding their instance lock?
> While the comment indicates the shared common ancestor lock prevents concurrent
> rate operations, the child device's instance lock (devlink_rate->devlink) is
> not held.
> Could this violate the devlink locking contract if the driver is executing
> concurrent non-rate operations on the child instance? Because devlink
> drivers rely on the instance lock to serialize access to their internal
> structures, could invoking these callbacks unlocked cause data corruption or
> trigger lockdep splats if the driver asserts devl_assert_locked()?

[CR] As mentioned in the comment, concurrent ops which touch rates
_will_ eventually acquire the shared devlink lock, so they're safe. We
don't care about other ops, since we only touching rates here.

> > @@ -321,22 +307,19 @@ static int esw_qos_create_rate_limit_element(struct mlx5_esw_sched_node *node,
> >  	return esw_qos_node_create_sched_element(node, sched_ctx, extack);
> >  }
> >
> > -static u32 esw_qos_calculate_min_rate_divider(struct mlx5_eswitch *esw,
> > -					      struct mlx5_esw_sched_node *parent)
> > +static u32
> > +esw_qos_calculate_min_rate_divider(struct mlx5_esw_sched_node *parent)
> >  {
> > -	struct list_head *nodes = parent ? &parent->children : &esw->qos.domain->nodes;
> > -	u32 fw_max_bw_share = MLX5_CAP_QOS(esw->dev, max_tsar_bw_share);
> > +	u32 fw_max_bw_share = MLX5_CAP_QOS(parent->esw->dev, max_tsar_bw_share);
> This is a pre-existing issue, but does this division risk a divide-by-zero
> panic regression?
> If the hardware does not support bandwidth sharing, max_tsar_bw_share
> could be 0. If a user subsequently sets a non-zero tx_share (minimum rate
> guarantee), max_guarantee becomes greater than 0, resulting in
> max_guarantee / fw_max_bw_share.
> Should we validate whether fw_max_bw_share is non-zero before performing
> the division?

[CR] Preexisting, also not really possible for fw to give a value of 0
for the divisor.

> > @@ -665,7 +631,6 @@ static int esw_qos_create_tc_arbiter_sched_elem(
> >  		struct netlink_ext_ack *extack)
> >  {
> >  	u32 tsar_ctx[MLX5_ST_SZ_DW(scheduling_context)] = {};
> > -	u32 tsar_parent_ix;
> >  	void *attr;
> This is a pre-existing issue, but is there a hardware scheduling element
> leak when transitioning a node to a TC arbiter?
> When esw_qos_tc_arbiter_scheduling_setup() calls
> esw_qos_create_tc_arbiter_sched_elem(node, extack), it provisions a new
> hardware element and immediately overwrites node->ix with its ID.
> If the subsequent step esw_qos_create_vports_tc_nodes(node, extack) fails,
> it jumps to err_tc_node_create and restores node->ix = curr_ix, but it
> doesn't appear to tear down the newly created scheduling element. Does this
> permanently leak the newly provisioned hardware element in the firmware?

[CR] False alarm, the call to esw_qos_node_destroy_sched_element()
destroys the new temporary scheduling element on failure.

> > @@ -1319,11 +1288,9 @@ static int esw_qos_switch_tc_arbiter_node_to_vports(
> >  	struct mlx5_esw_sched_node *node,
> >  	struct netlink_ext_ack *extack)
> >  {
> This is a pre-existing issue, but what happens if the rollback path fails
> during vport migration in esw_qos_switch_vports_node_to_tc_arbiter()?
> If an error occurs, it attempts to rollback by calling
> esw_qos_switch_vport_tcs_to_vport(). If this rollback fails internally
> (e.g. due to ENOMEM in esw_qos_vport_update()), the vport is
> successfully restored to the doomed TC arbiter node, and the error
> is silently ignored.
> Then, the caller esw_qos_node_enable_tc_arbitration() executes
> esw_qos_tc_arbiter_scheduling_teardown(), which frees the TC arbiter's
> VPORTS_TC_TSAR nodes without freeing the attached vport VPORT_TC nodes.
> Does this leak the vport TC nodes and leave them holding use-after-free
> pointers to their freed parents, leading to memory corruption when the
> vport is later disabled?

[CR] Preexisting, also hard to hit. Requires a failure mid-way when
switching nodes to TC arbitration AND a failure during rollback. Fixing
it outside the scope of this series.

> > @@ -1888,7 +1855,9 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
> >  		err = mlx5_esw_qos_vport_enable(vport, type, parent, 0, 0,
> >  						extack);
> >  	} else if (vport->qos.sched_node) {
> > -		err = esw_qos_vport_update_parent(vport, parent, extack);
> > +		err = esw_qos_vport_update_parent(vport,
> > +						  parent ? : esw->qos.root,
> > +						  extack);
> Can this sequence cause a use-after-free regression on esw->qos.root
> when updating a vport's QoS configuration?
> If a vport is the only QoS user in the E-Switch, esw->qos.refcnt is 1.
> If its QoS configuration is updated, esw_qos_vport_update() is called
> with the parent argument pointing to esw->qos.root (since this patch
> changes the default parent from NULL to esw->qos.root).
> Inside esw_qos_vport_update(), mlx5_esw_qos_vport_disable_locked() drops
> the vport's QoS reference, decrementing esw->qos.refcnt to 0. This
> triggers esw_qos_destroy(), freeing esw->qos.root.
> The function then calls mlx5_esw_qos_vport_enable(..., parent, ...) passing
> the stale parent pointer. Because parent is no longer NULL, it doesn't
> fetch the newly re-allocated root node, and instead passes the freed
> pointer to __esw_qos_alloc_node(), resulting in a use-after-free.

[CR] False alarm. The finding confuses esw_qos_vport_disable() and
mlx5_esw_qos_vport_disable_locked(). esw_qos_vport_disable() doesn't
touch the refcnt.

> > @@ -1859,13 +1840,15 @@ mlx5_esw_qos_vport_update_parent(struct mlx5_vport *vport,
> >  						  extack);
> >  	}
> > -	esw_qos_unlock(esw);
> > +
> >  	return err;
> >  }
> >
> >  void mlx5_esw_qos_vport_clear_parent(struct mlx5_vport *vport)
> >  {
> > +	esw_qos_lock(vport->dev);
> >  	mlx5_esw_qos_vport_update_parent(vport, NULL, NULL);
> > +	esw_qos_unlock(vport->dev);
> >  }
> Could this lead to a recursive mutex deadlock during VF teardown on older
> hardware where shd is NULL?
> In the teardown path, mlx5_eswitch_unload_pf_vf_vport() already acquires
> esw->state_lock. It then proceeds to call:
> mlx5_eswitch_unload_pf_vf_vport()
>   mlx5_eswitch_unload_vport()
>     mlx5_esw_offloads_unload_rep()
>       mlx5_esw_offloads_devlink_port_unregister()
>         mlx5_esw_qos_vport_clear_parent()
> Since this patch changes mlx5_esw_qos_vport_clear_parent() to
> unconditionally call esw_qos_lock(), which falls back to acquiring
> esw->state_lock when dev->shd is NULL, won't this result in an attempt to
> re-acquire the non-recursive state_lock that is already held by the
> teardown process?

[CR] False alarm. state_lock isn't held during VF teardown.

> > @@ -839,13 +940,17 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
> >  		refcount_dec(&devlink_rate->parent->refcnt);
> >  		devlink_rate->parent = NULL;
> When unsetting the parent of a rate object in devl_rate_nodes_destroy(),
> this patch allows it to happen to cross-device child nodes (where
> devlink_rate->devlink != devlink). Since the child's devlink instance is
> still active, shouldn't its state change (losing its parent) be
> communicated to user space?
> Omitting devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW) here might
> leave user space monitors out of sync, incorrectly believing the active child
> is still attached to the deleted parent node.

[CR] Preexisting, devl_rate_nodes_destroy() didn't send notifications.

> > +static struct devlink *devl_rate_lock(struct devlink *devlink)
> >  {
> > -	return devlink;
> > +	struct devlink *rate_devlink = devlink, *parent;
> > +
> > +	devl_assert_locked(devlink);
> > +
> > +	while (rate_devlink->ops &&
> > +	       rate_devlink->ops->supported_cross_device_rate_nodes) {
> > +		parent = devlink_nested_in_get_lock(rate_devlink);
> > +		if (!parent)
> > +			break;
> Can this lead to list corruption if the parent devlink is unregistered?
> If the parent is unregistered concurrently (for example, a PF is unregistered
> before VF teardown), devlink_nested_in_get_lock() will return NULL because of
> this internal check:
> devlink_nested_in_get_lock() {
>     ...
>     if (devl_is_registered(devlink))
>         return devlink;
>     ...
>     return NULL;
> }
> Since parent is NULL, this loop breaks and returns the child devlink lock
> instead of the parent's lock. Later, in functions like devl_rate_leaf_destroy(),
> list_del(&devlink_rate->list) is called on the rate node.
> Because the rate node is chained into the parent's rate_list, does this
> modify the parent's list concurrently without holding its lock?

[CR] This misunderstands the nesting hierarchy. VFs are not nested in
PFs, PFs are nested in shd, which is reference counted (=> outlives all
PFs) and keeps all rate nodes for all PFs.

> This is a pre-existing issue, but does devlink_nl_rate_del_doit()
> unconditionally free rate nodes even if the driver rejects the deletion?
> Looking at devlink_nl_rate_del_doit() in net/devlink/rate.c:
> devlink_nl_rate_del_doit() {
>     ...
>     err = devlink->ops->rate_node_del(rate_node, rate_node->priv,
>                                       info->extack);
>     if (rate_node->parent)
>         refcount_dec(&rate_node->parent->refcnt);
>     list_del(&rate_node->list);
>     kfree(rate_node->name);
>     kfree(rate_node);
>     ...
> }
> If the hardware or driver cannot fulfill the request and ops->rate_node_del()
> returns an error like -EBUSY, the function ignores the error regarding memory
> management and proceeds to kfree() the rate node.
> Could this leave the driver with a dangling pointer to the freed rate node,
> triggering a use-after-free later?

[CR] Preexisting, it is intentional that the node is removed no matter
what the rate_node_del op manages to do.

> > @@ -243,7 +243,29 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
> > +	err = nla_parse_nested(tb, maxtype, attrs[DEVLINK_ATTR_PARENT_DEV],
> > +			       devlink_dl_parent_dev_nl_policy, NULL);
> Could this nla_parse_nested() call pass down an extended ACK pointer instead
> of NULL?
> When passing NULL for the extack parameter, if the nested attribute is
> malformed (such as a missing NLA_F_NESTED flag), userspace will receive a
> silent -EINVAL without a descriptive error message.
> Because devlink_get_parent_from_attrs_lock() is called by
> __devlink_nl_pre_doit() which has access to the genl_info structure and
> its extack pointer, passing it down to this function would allow for
> reporting accurate validation errors to userspace.

[CR] Valid, but quite a lot of plumbing is necessary for extack to be
used here. Out of scope of the series.

> This is a pre-existing issue, but does the error path in
> devlink_nl_rate_new_doit() leak the parent's reference count?
> If devlink_nl_rate_set() successfully sets the parent (which increments
> parent->refcnt) but fails on a subsequent attribute validation like tc_bw,
> it returns an error.
> The cleanup path in devlink_nl_rate_new_doit() handles the error:
> err_rate_set:
>     ops->rate_node_del(rate_node, rate_node->priv, info->extack);
> err_node_new:
>     kfree(rate_node->name);
> err_strdup:
>     kfree(rate_node);
> Because this path frees the node but never decrements the parent's refcnt,
> does this permanently elevate the parent's refcount, falsely indicating it
> has children and preventing it from ever being deleted?

[CR] Preexisting, will be fixed separately with a net patch.

> > @@ -30,14 +30,42 @@ devlink_rate_leaf_get_from_info(struct devlink *devlink, struct genl_info *info)
> >  	return devlink_rate ?: ERR_PTR(-ENODEV);
> >  }
> >
> > +/* Repeatedly walks the nested devlink chain while cross device rate nodes are
> > + * supported and finds the topmost instance where rates should be stored.
> > + * That instance is locked, referenced and returned.
> > + * When cross device rate nodes aren't supported the original devlink instance
> > + * is returned.
> > + */
> >  static struct devlink *devl_rate_lock(struct devlink *devlink)
> >  {
> > -	return devlink;
> > +	struct devlink *rate_devlink = devlink, *parent;
> > +
> > +	devl_assert_locked(devlink);
> > +
> > +	while (rate_devlink->ops &&
> > +	       rate_devlink->ops->supported_cross_device_rate_nodes) {
> > +		parent = devlink_nested_in_get_lock(rate_devlink);
> > +		if (!parent)
> > +			break;
> > +		if (rate_devlink != devlink) {
> > +			/* Unlock intermediate instances. */
> > +			devl_unlock(rate_devlink);
> > +			devlink_put(rate_devlink);
> > +		}
> > +		rate_devlink = parent;
> > +	}
> > +	return rate_devlink;
> >  }
> Now that devl_rate_lock() returns the shared ancestor, does this cause
> devlink_rates_check() to skip child rate nodes?
> In net/devlink/rate.c:devlink_rates_check(), the loop filters nodes by the
> devlink pointer:
> 	rate_devlink = devl_rate_lock(devlink);
> 	list_for_each_entry(devlink_rate, &rate_devlink->rate_list, list)
> 		if (devlink_rate->devlink == devlink &&
> 		    (!rate_filter || rate_filter(devlink_rate))) {
> 			...
> 			err = -EBUSY;
> 			break;
> 		}
> Because of this check, if a user creates an SF rate node and then triggers an
> eswitch mode change to legacy on the PF devlink, devlink_rates_check() skips
> the SF node and returns 0 instead of -EBUSY.
> Could this allow mlx5_devlink_eswitch_mode_set() to proceed and unconditionally
> free the QoS domain in mlx5_esw_qos_cleanup(), while the child SF devlink and
> its software rate node remain active?
> If so, a subsequent operation on the active child SF rate node (like
> DEVLINK_CMD_RATE_DEL) would invoke mlx5_esw_devlink_rate_node_del() and call
> esw_qos_lock(), leading to a NULL pointer dereference when accessing
> esw->qos.domain->lock.

[CR] False alarm, no driver uses supported_cross_device_rate_nodes yet
so this code path isn't taken. QoS domains are removed in a later patch
in the series before the cross device flag is enabled, so this finding
is moot.

> >  static int mlx5_esw_qos_max_link_speed_get(struct mlx5_core_dev *mdev, u32 *link_speed_max,
> > -					   bool take_rtnl,
> >  					   struct netlink_ext_ack *extack)
> >  {
> >  	int err;
> >
> > -	if (!mlx5_lag_is_active(mdev))
> > +	if (!mlx5_lag_is_active(mdev) ||
> > +	    mlx5_lag_query_bond_speed(mdev, link_speed_max) < 0 ||
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This is a pre-existing issue, but does mlx5_lag_query_bond_speed() have a race
> condition when accessing the ldev->tracker structure?
> The mlx5_lag_query_bond_speed() function reads ldev->tracker.bond_speed_mbps
> while holding lag_lock. However, mlx5_lag_netdev_event() modifies
> ldev->tracker holding only rtnl_lock, without holding lag_lock:
> drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c:mlx5_lag_netdev_event() {
>     ...
>     ldev->tracker = tracker;
>     ...
> }
> Concurrently, mlx5_ldev_add_netdev() modifies fields inside it under lag_lock,
> meaning its updates can be overwritten and lost by the read-modify-write in
> mlx5_lag_netdev_event().
> Since this patch now heavily relies on mlx5_lag_query_bond_speed() for QoS
> max link speed, could this expose readers to observing torn or inconsistent
> tracker states?

[CR] Preexisting, no impact. Either a value of 0 is read, or the correct one.

> > @@ -458,10 +461,17 @@ esw_qos_vport_create_sched_element(struct mlx5_esw_sched_node *vport_node,
> >  	MLX5_SET(scheduling_context, sched_ctx, max_average_bw,
> >  		 vport_node->max_rate);
> > +	if (vport->dev != dev) {
> > +		/* The port is assigned to a node on another eswitch. */
> > +		MLX5_SET(vport_element, attr, eswitch_owner_vhca_id_valid,
> > +			 true);
> > +		MLX5_SET(vport_element, attr, eswitch_owner_vhca_id,
> > +			 MLX5_CAP_GEN(vport->dev, vhca_id));
> Does this code silently truncate the vhca_id if it is >= 2048?
> The hardware structure mlx5_ifc_vport_element_bits defines
> eswitch_owner_vhca_id as an 11-bit field, but MLX5_CAP_GEN(..., vhca_id)
> returns a 16-bit identifier. The MLX5_SET macro will silently apply a
> mask of 0x7ff. Should there be a bounds check returning an error like
> -EOVERFLOW here, or does the hardware actually support a 16-bit field?

[CR] vhca_id >= 2048 would imply a NIC with that many physical ports,
which is only possible in theory. Not worth addressing.

> > @@ -1221,8 +1260,15 @@ static int esw_qos_vport_update(struct mlx5_vport *vport,
> >  	if (err)
> >  		return err;
> >
> > -	if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type)
> > +	if (curr_type == SCHED_NODE_TYPE_TC_ARBITER_TSAR && curr_type == type) {
> >  		esw_qos_tc_arbiter_get_bw_shares(vport_node, curr_tc_bw);
> Does this condition skip the preservation and restoration of TC bandwidth
> shares when transitioning between different node types?
> When a vport transitions away from SCHED_NODE_TYPE_TC_ARBITER_TSAR
> (e.g., to a regular SCHED_NODE_TYPE_VPORT), curr_type != type evaluates to
> true, causing the function to skip saving the current tc_bw shares.
> esw_qos_vport_disable() then unconditionally clears vport_node->tc_bw to
> zero.
> If the subsequent esw_qos_vport_enable() fails, the rollback path attempts
> to restore the old TC_ARBITER_TSAR state.
> However, because curr_type != type, the code skips restoring the bandwidth
> shares via esw_qos_set_tc_arbiter_bw_shares().
> Are the vport's bandwidth guarantees permanently erased in this failure
> scenario?

[CR] Pre-existing, unlikely and low-impact: If tc-bw settings fail,
there will be an error and an opportunity for the user to retry.

> > @@ -1990,6 +1990,13 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw, bool clear_vf)
> >  		 esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
> >
>  	mlx5_eswitch_invalidate_wq(esw);
> > +
> > +	if (esw->mode == MLX5_ESWITCH_OFFLOADS) {
> > +		struct devlink *devlink = priv_to_devlink(esw->dev);
> > +
> > +		devl_rate_nodes_destroy(devlink);
> > +	}
> > +
> Can this reordering cause a use-after-free if leaf unparenting fails?
> Devlink's devl_rate_nodes_destroy() calls ops->rate_leaf_parent_set(..., NULL)
> which maps to mlx5_esw_qos_vport_update_parent(). Devlink ignores the return
> value of this callback.
> If the hardware operation fails (e.g., due to a firmware timeout), the driver
> leaves vport->qos.sched_node->parent pointing to the original parent node.
> Devlink then unconditionally proceeds to destroy and free the rate node.
> Later, when mlx5_eswitch_unload_vf_vports() executes, it calls
> mlx5_esw_qos_vport_disable(), which reaches esw_qos_vport_disable():
> drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c:esw_qos_vport_disable() {
>     ...
> 	list_del_init(&vport_node->entry);
> 	esw_qos_normalize_min_rate(vport_node->esw, vport_node->parent, extack);
>     ...
> }
> Will this dereference the freed vport_node->parent, resulting in list
> corruption or a use-after-free?

[CR] This is a preexisting problem, brought to light by the reordering
of group destruction before leaf destruction. It's extremely unlikely,
requiring the firmware command to reparent a vport to its root to fail.
Fixing this properly requires multiple patches and will be pursued after
this series.

> > @@ -2039,6 +2040,9 @@ void mlx5_eswitch_disable_locked(struct mlx5_eswitch *esw)
> >  		 esw->mode == MLX5_ESWITCH_LEGACY ? "LEGACY" : "OFFLOADS",
> >  		 esw->esw_funcs.num_vfs, esw->esw_funcs.num_ec_vfs, esw->enabled_vports);
> >
> > +	if (esw->mode == MLX5_ESWITCH_OFFLOADS)
> > +		devl_rate_nodes_destroy(devlink);
> > +
> Does this identical reordering in the locked disable path suffer from the
> same unparenting failure use-after-free described above?

[CR] Same comment as above. QoS improvements for the error paths will
follow.

V10:
- Added a comment in devl_rate_nodes_destroy clarifying locking.
- Expanded 'supported_cross_device_rate_nodes' comment with locking
  expectations.
- Simplified rate locking by only keeping the common ancestor locked.
- Removed devlink_nested_in_get_locked and devlink_nested_in_put_unlock.
- devlink_nl_rate_parent_node_set iterates over the proper rate list.
- Refactored mlx5 locking given dev->shd is now optional (after [3]).
- Fixed a bug in pruning introduced by the root node patch.
- Fixed a bug on failure when detaching a node from parent.
- Clarified expectations for shared devlink rate storage.
- Fixed incorrect net namespace when listing shared instances.

V9:
https://lore.kernel.org/netdev/20260326065949.44058-1-tariqt@nvidia.com/

Cosmin Ratiu (14):
  devlink: Update nested instance locking comment
  devlink: Add a helper for getting a nested-in instance
  devlink: Migrate from info->user_ptr to info->ctx
  devlink: Decouple rate storage from associated devlink object
  devlink: Add parent dev to devlink API
  devlink: Allow parent dev for rate-set and rate-new
  devlink: Allow rate node parents from other devlinks
  net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed
  net/mlx5: qos: Refactor vport QoS cleanup
  net/mlx5: qos: Model the root node in the scheduling hierarchy
  net/mlx5: qos: Remove qos domains and use shd
  net/mlx5: qos: Support cross-device tx scheduling
  selftests: drv-net: Add test for cross-esw rate scheduling
  net/mlx5: Document devlink rates

 Documentation/netlink/specs/devlink.yaml      |  30 +-
 .../networking/devlink/devlink-port.rst       |   2 +
 Documentation/networking/devlink/index.rst    |   8 +-
 Documentation/networking/devlink/mlx5.rst     |  33 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   1 +
 .../mellanox/mlx5/core/esw/devlink_port.c     |   1 -
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 605 +++++++++---------
 .../net/ethernet/mellanox/mlx5/core/esw/qos.h |   3 -
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  27 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  18 +-
 include/net/devlink.h                         |   9 +
 include/uapi/linux/devlink.h                  |   2 +
 net/devlink/core.c                            |  20 +-
 net/devlink/dev.c                             |  16 +-
 net/devlink/devl_internal.h                   |  20 +
 net/devlink/dpipe.c                           |  14 +-
 net/devlink/health.c                          |  12 +-
 net/devlink/linecard.c                        |   4 +-
 net/devlink/netlink.c                         |  82 ++-
 net/devlink/netlink_gen.c                     |  24 +-
 net/devlink/netlink_gen.h                     |   8 +
 net/devlink/param.c                           |   4 +-
 net/devlink/port.c                            |  18 +-
 net/devlink/rate.c                            | 331 +++++++---
 net/devlink/region.c                          |   6 +-
 net/devlink/resource.c                        |  14 +-
 net/devlink/sb.c                              |  22 +-
 net/devlink/trap.c                            |  12 +-
 .../testing/selftests/drivers/net/hw/Makefile |   1 +
 .../drivers/net/hw/devlink_rate_cross_esw.py  | 296 +++++++++
 30 files changed, 1132 insertions(+), 511 deletions(-)
 create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_cross_esw.py


base-commit: 1c664ec4b9ea827b609d296921ed5bad8a40a158
-- 
2.44.0


^ permalink raw reply

* [PATCH net-next v9 5/5] net: wangxun: add pcie error handler
From: Jiawen Wu @ 2026-07-01  7:23 UTC (permalink / raw)
  To: netdev
  Cc: Mengyuan Lou, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Richard Cochran, Russell King,
	Aleksandr Loktionov, Jacob Keller, Michal Swiatkowski,
	Simon Horman, Kees Cook, Larysa Zaremba, Greg Kroah-Hartman,
	Thomas Gleixner, Breno Leitao, Rongguang Wei,
	Uwe Kleine-König (The Capable Hub), Fabio Baltieri,
	Jiawen Wu
In-Reply-To: <20260701072357.33984-1-jiawenwu@trustnetic.com>

Support AER driver to handle the PCIe errors. Sometimes netdev watchdog
Tx timeout happens before the AER error report when a PCIe error occurs,
CPU blocking would be caused by MMIO during the reset process. To
prevent it, check PCIe error status in .ndo_tx_timeout. The current
function of ngbe is not yet fully developed, it will be completed in the
future.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
---
 drivers/net/ethernet/wangxun/libwx/wx_err.c   | 148 +++++++++++++++++-
 drivers/net/ethernet/wangxun/libwx/wx_err.h   |   2 +
 drivers/net/ethernet/wangxun/libwx/wx_type.h  |   4 +
 drivers/net/ethernet/wangxun/ngbe/ngbe_main.c |  32 +++-
 .../net/ethernet/wangxun/txgbe/txgbe_main.c   |  31 +++-
 5 files changed, 212 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/wangxun/libwx/wx_err.c b/drivers/net/ethernet/wangxun/libwx/wx_err.c
index ee27f96735dc..c34c9406a5ae 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_err.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_err.c
@@ -4,11 +4,124 @@
 
 #include <linux/netdevice.h>
 #include <linux/pci.h>
+#include <linux/aer.h>
 
 #include "wx_type.h"
 #include "wx_lib.h"
 #include "wx_err.h"
 
+/**
+ * wx_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci connection state
+ *
+ * Return: pci_ers_result_t.
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t wx_io_error_detected(struct pci_dev *pdev,
+					     pci_channel_state_t state)
+{
+	struct wx *wx = pci_get_drvdata(pdev);
+	struct net_device *netdev;
+
+	if (!wx)
+		return PCI_ERS_RESULT_DISCONNECT;
+
+	netdev = wx->netdev;
+	if (!netif_device_present(netdev))
+		return PCI_ERS_RESULT_DISCONNECT;
+
+	if (state == pci_channel_io_perm_failure)
+		return PCI_ERS_RESULT_DISCONNECT;
+
+	rtnl_lock();
+	netif_device_detach(netdev);
+	set_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags);
+	wx_soft_quiesce(wx);
+
+	if (!test_and_set_bit(WX_STATE_DISABLED, wx->state))
+		pci_disable_device(pdev);
+	rtnl_unlock();
+
+	/* Request a slot reset. */
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * wx_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Return: pci_ers_result_t.
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t wx_io_slot_reset(struct pci_dev *pdev)
+{
+	struct wx *wx = pci_get_drvdata(pdev);
+	pci_ers_result_t result;
+
+	if (pci_enable_device_mem(pdev)) {
+		wx_err(wx, "Cannot re-enable PCI device after reset.\n");
+		result = PCI_ERS_RESULT_DISCONNECT;
+	} else {
+		/* make all memory operations done before clearing the flag */
+		smp_mb__before_atomic();
+		clear_bit(WX_STATE_DISABLED, wx->state);
+		clear_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags);
+		pci_set_master(pdev);
+		pci_restore_state(pdev);
+		pci_wake_from_d3(pdev, false);
+
+		rtnl_lock();
+		if (netif_running(wx->netdev) && wx->down_suspend)
+			wx->down_suspend(wx);
+		if (wx->do_reset)
+			wx->do_reset(wx->netdev, false);
+		rtnl_unlock();
+		result = PCI_ERS_RESULT_RECOVERED;
+	}
+
+	pci_aer_clear_nonfatal_status(pdev);
+
+	return result;
+}
+
+/**
+ * wx_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells us that
+ * its OK to resume normal operation.
+ */
+static void wx_io_resume(struct pci_dev *pdev)
+{
+	struct wx *wx = pci_get_drvdata(pdev);
+	struct net_device *netdev;
+	int err;
+
+	netdev = wx->netdev;
+	rtnl_lock();
+	if (netif_running(netdev)) {
+		err = netdev->netdev_ops->ndo_open(netdev);
+		if (err) {
+			wx_err(wx, "Failed to open netdev after reset\n");
+			goto out;
+		}
+	}
+	netif_device_attach(netdev);
+out:
+	rtnl_unlock();
+}
+
+const struct pci_error_handlers wx_err_handler = {
+	.error_detected = wx_io_error_detected,
+	.slot_reset = wx_io_slot_reset,
+	.resume = wx_io_resume,
+};
+EXPORT_SYMBOL(wx_err_handler);
+
 static void wx_pf_reset_subtask(struct wx *wx)
 {
 	if (!test_and_clear_bit(WX_FLAG_NEED_PF_RESET, wx->flags))
@@ -25,6 +138,9 @@ static void wx_reset_task(struct work_struct *work)
 
 	rtnl_lock();
 
+	if (test_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags))
+		wx_soft_quiesce(wx);
+
 	if (test_bit(WX_STATE_DOWN, wx->state) ||
 	    test_bit(WX_STATE_RESETTING, wx->state))
 		goto out;
@@ -139,6 +255,33 @@ void wx_check_hang_subtask(struct wx *wx)
 }
 EXPORT_SYMBOL(wx_check_hang_subtask);
 
+static bool wx_check_pcie_error(struct wx *wx)
+{
+	u16 vid, pci_cmd;
+
+	pci_read_config_word(wx->pdev, PCI_VENDOR_ID, &vid);
+	pci_read_config_word(wx->pdev, PCI_COMMAND, &pci_cmd);
+
+	/* PCIe link loss or memory space can't access */
+	if (vid == U16_MAX || !(pci_cmd & PCI_COMMAND_MEMORY))
+		return true;
+
+	return false;
+}
+
+static void wx_tx_timeout_recovery(struct wx *wx)
+{
+	/*
+	 * When a PCIe hardware error occurs, the driver should initiate a PCIe
+	 * recovery mechanism. However, this recovery flow relies on the AER
+	 * driver for current kernel policy. Therefore, a self-contained
+	 * recovery mechanism is not implemented yet.
+	 */
+	set_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags);
+	wx_err(wx, "PCIe error detected during tx timeout\n");
+	queue_work(wx->reset_wq, &wx->reset_task);
+}
+
 static void wx_tx_timeout_reset(struct wx *wx)
 {
 	if (test_bit(WX_STATE_DOWN, wx->state))
@@ -153,7 +296,10 @@ void wx_tx_timeout(struct net_device *netdev, unsigned int __always_unused txque
 {
 	struct wx *wx = netdev_priv(netdev);
 
-	wx_tx_timeout_reset(wx);
+	if (wx_check_pcie_error(wx))
+		wx_tx_timeout_recovery(wx);
+	else
+		wx_tx_timeout_reset(wx);
 }
 EXPORT_SYMBOL(wx_tx_timeout);
 
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_err.h b/drivers/net/ethernet/wangxun/libwx/wx_err.h
index 1eed13e48095..a6a82a263528 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_err.h
+++ b/drivers/net/ethernet/wangxun/libwx/wx_err.h
@@ -7,6 +7,8 @@
 #ifndef _WX_ERR_H_
 #define _WX_ERR_H_
 
+extern const struct pci_error_handlers wx_err_handler;
+
 void wx_check_err_subtask(struct wx *wx);
 int wx_init_err_task(struct wx *wx);
 void wx_check_hang_subtask(struct wx *wx);
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_type.h b/drivers/net/ethernet/wangxun/libwx/wx_type.h
index a8b4e84787f4..c2edb74881f2 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_type.h
+++ b/drivers/net/ethernet/wangxun/libwx/wx_type.h
@@ -1221,6 +1221,8 @@ enum wx_state {
 	WX_STATE_PTP_RUNNING,
 	WX_STATE_PTP_TX_IN_PROGRESS,
 	WX_STATE_SERVICE_SCHED,
+	WX_STATE_DISABLED,
+	WX_STATE_RES_FREED,
 	WX_STATE_NBITS		/* must be last */
 };
 
@@ -1288,6 +1290,7 @@ enum wx_pf_flags {
 	WX_FLAG_RX_MERGE_ENABLED,
 	WX_FLAG_TXHEAD_WB_ENABLED,
 	WX_FLAG_NEED_PF_RESET,
+	WX_FLAG_NEED_PCIE_RECOVERY,
 	WX_PF_FLAGS_NBITS               /* must be last */
 };
 
@@ -1409,6 +1412,7 @@ struct wx {
 	void (*configure_fdir)(struct wx *wx);
 	int (*setup_tc)(struct net_device *netdev, u8 tc);
 	void (*do_reset)(struct net_device *netdev, bool reinit);
+	void (*down_suspend)(struct wx *wx);
 	int (*ptp_setup_sdp)(struct wx *wx);
 	void (*set_num_queues)(struct wx *wx);
 
diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
index 92895f503511..56d4b63387fd 100644
--- a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
@@ -47,6 +47,20 @@ static const struct pci_device_id ngbe_pci_tbl[] = {
 	{ }
 };
 
+static void ngbe_down_suspend(struct wx *wx)
+{
+	if (test_and_set_bit(WX_STATE_RES_FREED, wx->state))
+		return;
+
+	phylink_stop(wx->phylink);
+	phylink_disconnect_phy(wx->phylink);
+	wx_clean_all_tx_rings(wx);
+	wx_clean_all_rx_rings(wx);
+	wx_free_irq(wx);
+	wx_free_isb_resources(wx);
+	wx_free_resources(wx);
+}
+
 /**
  *  ngbe_init_type_code - Initialize the shared code
  *  @wx: pointer to hardware structure
@@ -135,6 +149,7 @@ static int ngbe_sw_init(struct wx *wx)
 	wx->mbx.size = WX_VXMAILBOX_SIZE;
 	wx->setup_tc = ngbe_setup_tc;
 	wx->do_reset = ngbe_do_reset;
+	wx->down_suspend = ngbe_down_suspend;
 	set_bit(0, &wx->fwd_bitmask);
 
 	return 0;
@@ -413,6 +428,9 @@ static void ngbe_disable_device(struct wx *wx)
 
 static void ngbe_reset(struct wx *wx)
 {
+	if (test_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags))
+		return;
+
 	wx_flush_sw_mac_table(wx);
 	wx_mac_set_default_filter(wx, wx->mac.addr);
 	if (test_bit(WX_STATE_PTP_RUNNING, wx->state))
@@ -435,6 +453,7 @@ static void ngbe_up_complete(struct wx *wx)
 	/* make sure to complete pre-operations */
 	smp_mb__before_atomic();
 	clear_bit(WX_STATE_DOWN, wx->state);
+	clear_bit(WX_STATE_RES_FREED, wx->state);
 	wx_napi_enable_all(wx);
 	/* enable transmits */
 	netif_tx_start_all_queues(wx->netdev);
@@ -529,12 +548,16 @@ static int ngbe_close(struct net_device *netdev)
 {
 	struct wx *wx = netdev_priv(netdev);
 
+	if (test_bit(WX_STATE_RES_FREED, wx->state))
+		goto out;
+
 	wx_ptp_stop(wx);
 	ngbe_down(wx);
 	wx_free_irq(wx);
 	wx_free_isb_resources(wx);
 	wx_free_resources(wx);
 	phylink_disconnect_phy(wx->phylink);
+out:
 	wx_control_hw(wx, false);
 
 	return 0;
@@ -566,7 +589,8 @@ static void ngbe_dev_shutdown(struct pci_dev *pdev, bool *enable_wake)
 	*enable_wake = !!wufc;
 	wx_control_hw(wx, false);
 
-	pci_disable_device(pdev);
+	if (!test_and_set_bit(WX_STATE_DISABLED, wx->state))
+		pci_disable_device(pdev);
 }
 
 static void ngbe_shutdown(struct pci_dev *pdev)
@@ -855,6 +879,7 @@ static int ngbe_probe(struct pci_dev *pdev,
 		goto err_register;
 
 	pci_set_drvdata(pdev, wx);
+	pci_save_state(pdev);
 
 	return 0;
 
@@ -910,7 +935,8 @@ static void ngbe_remove(struct pci_dev *pdev)
 	kfree(wx->mac_table);
 	wx_clear_interrupt_scheme(wx);
 
-	pci_disable_device(pdev);
+	if (!test_and_set_bit(WX_STATE_DISABLED, wx->state))
+		pci_disable_device(pdev);
 }
 
 static int ngbe_suspend(struct pci_dev *pdev, pm_message_t state)
@@ -937,6 +963,7 @@ static int ngbe_resume(struct pci_dev *pdev)
 		wx_err(wx, "Cannot enable PCI device from suspend\n");
 		return err;
 	}
+	clear_bit(WX_STATE_DISABLED, wx->state);
 	pci_set_master(pdev);
 	device_wakeup_disable(&pdev->dev);
 
@@ -961,6 +988,7 @@ static struct pci_driver ngbe_driver = {
 	.resume   = ngbe_resume,
 	.shutdown = ngbe_shutdown,
 	.sriov_configure = wx_pci_sriov_configure,
+	.err_handler = &wx_err_handler,
 };
 
 module_pci_driver(ngbe_driver);
diff --git a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
index a7bde03a98fe..d85ee83192e4 100644
--- a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
+++ b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
@@ -155,6 +155,7 @@ static void txgbe_up_complete(struct wx *wx)
 	/* make sure to complete pre-operations */
 	smp_mb__before_atomic();
 	clear_bit(WX_STATE_DOWN, wx->state);
+	clear_bit(WX_STATE_RES_FREED, wx->state);
 	wx_napi_enable_all(wx);
 
 	switch (wx->mac.type) {
@@ -198,6 +199,9 @@ static void txgbe_reset(struct wx *wx)
 	u8 old_addr[ETH_ALEN];
 	int err;
 
+	if (test_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags))
+		return;
+
 	err = txgbe_reset_hw(wx);
 	if (err != 0)
 		wx_err(wx, "Hardware Error: %d\n", err);
@@ -304,6 +308,20 @@ void txgbe_up(struct wx *wx)
 	txgbe_up_complete(wx);
 }
 
+static void txgbe_down_suspend(struct wx *wx)
+{
+	if (test_and_set_bit(WX_STATE_RES_FREED, wx->state))
+		return;
+
+	phylink_stop(wx->phylink);
+	wx_clean_all_tx_rings(wx);
+	wx_clean_all_rx_rings(wx);
+	wx_free_irq(wx);
+	txgbe_free_misc_irq(wx->priv);
+	wx_free_resources(wx);
+	txgbe_fdir_filter_exit(wx);
+}
+
 /**
  *  txgbe_init_type_code - Initialize the shared code
  *  @wx: pointer to hardware structure
@@ -420,6 +438,7 @@ static int txgbe_sw_init(struct wx *wx)
 
 	wx->setup_tc = txgbe_setup_tc;
 	wx->do_reset = txgbe_do_reset;
+	wx->down_suspend = txgbe_down_suspend;
 	set_bit(0, &wx->fwd_bitmask);
 
 	switch (wx->mac.type) {
@@ -530,12 +549,16 @@ static int txgbe_close(struct net_device *netdev)
 {
 	struct wx *wx = netdev_priv(netdev);
 
+	if (test_bit(WX_STATE_RES_FREED, wx->state))
+		goto out;
+
 	wx_ptp_stop(wx);
 	txgbe_down(wx);
 	wx_free_irq(wx);
 	txgbe_free_misc_irq(wx->priv);
 	wx_free_resources(wx);
 	txgbe_fdir_filter_exit(wx);
+out:
 	wx_control_hw(wx, false);
 
 	return 0;
@@ -556,7 +579,8 @@ static void txgbe_dev_shutdown(struct pci_dev *pdev)
 
 	wx_control_hw(wx, false);
 
-	pci_disable_device(pdev);
+	if (!test_and_set_bit(WX_STATE_DISABLED, wx->state))
+		pci_disable_device(pdev);
 }
 
 static void txgbe_shutdown(struct pci_dev *pdev)
@@ -907,6 +931,7 @@ static int txgbe_probe(struct pci_dev *pdev,
 		goto err_remove_phy;
 
 	pci_set_drvdata(pdev, wx);
+	pci_save_state(pdev);
 
 	netif_tx_stop_all_queues(netdev);
 
@@ -981,7 +1006,8 @@ static void txgbe_remove(struct pci_dev *pdev)
 	kfree(wx->mac_table);
 	wx_clear_interrupt_scheme(wx);
 
-	pci_disable_device(pdev);
+	if (!test_and_set_bit(WX_STATE_DISABLED, wx->state))
+		pci_disable_device(pdev);
 }
 
 static struct pci_driver txgbe_driver = {
@@ -991,6 +1017,7 @@ static struct pci_driver txgbe_driver = {
 	.remove   = txgbe_remove,
 	.shutdown = txgbe_shutdown,
 	.sriov_configure = wx_pci_sriov_configure,
+	.err_handler = &wx_err_handler,
 };
 
 module_pci_driver(txgbe_driver);
-- 
2.51.0


^ permalink raw reply related

* [PATCH net-next v9 4/5] net: wangxun: implement soft quiesce for PCIe error recovery
From: Jiawen Wu @ 2026-07-01  7:23 UTC (permalink / raw)
  To: netdev
  Cc: Mengyuan Lou, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Richard Cochran, Russell King,
	Aleksandr Loktionov, Jacob Keller, Michal Swiatkowski,
	Simon Horman, Kees Cook, Larysa Zaremba, Greg Kroah-Hartman,
	Thomas Gleixner, Breno Leitao, Rongguang Wei,
	Uwe Kleine-König (The Capable Hub), Fabio Baltieri,
	Jiawen Wu
In-Reply-To: <20260701072357.33984-1-jiawenwu@trustnetic.com>

Function wx_soft_quiesce() provide a lightweight shutdown path during
PCIe error recovery. It avoids MMIO-dependent operations in PCIe error
status.

Waiting for the service task to complete may unnecessarily delay PCIe
error recovery, especially if the work item is already blocked by the
hardware failure that triggered AER. So the service task is not
explicitly cancelled in quiesce path. As a measure to block the service
task, the checking of WX_STATE_DOWN and WX_STATE_RESETTING is added at
the entry of relevant work item.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
 drivers/net/ethernet/wangxun/libwx/wx_lib.c   | 18 ++++++++++++++++
 drivers/net/ethernet/wangxun/libwx/wx_lib.h   |  1 +
 drivers/net/ethernet/wangxun/libwx/wx_ptp.c   | 21 +++++++++++++++++++
 drivers/net/ethernet/wangxun/libwx/wx_ptp.h   |  1 +
 .../net/ethernet/wangxun/txgbe/txgbe_main.c   |  8 +++++++
 5 files changed, 49 insertions(+)

diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
index e5a45356ba00..d3340b2b0682 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
@@ -3382,5 +3382,23 @@ void wx_service_timer(struct timer_list *t)
 }
 EXPORT_SYMBOL(wx_service_timer);
 
+void wx_soft_quiesce(struct wx *wx)
+{
+	if (!netif_running(wx->netdev) ||
+	    test_and_set_bit(WX_STATE_DOWN, wx->state))
+		return;
+
+	pci_clear_master(wx->pdev);
+	netif_tx_stop_all_queues(wx->netdev);
+	netif_carrier_off(wx->netdev);
+	netif_tx_disable(wx->netdev);
+	wx_napi_disable_all(wx);
+	wx_ptp_quiesce(wx);
+
+	clear_bit(WX_FLAG_NEED_PF_RESET, wx->flags);
+	timer_delete_sync(&wx->service_timer);
+}
+EXPORT_SYMBOL(wx_soft_quiesce);
+
 MODULE_DESCRIPTION("Common library for Wangxun(R) Ethernet drivers.");
 MODULE_LICENSE("GPL");
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.h b/drivers/net/ethernet/wangxun/libwx/wx_lib.h
index aed6ea8cf0d6..11bd79985e17 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_lib.h
+++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.h
@@ -41,5 +41,6 @@ void wx_set_ring(struct wx *wx, u32 new_tx_count,
 void wx_service_event_schedule(struct wx *wx);
 void wx_service_event_complete(struct wx *wx);
 void wx_service_timer(struct timer_list *t);
+void wx_soft_quiesce(struct wx *wx);
 
 #endif /* _WX_LIB_H_ */
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_ptp.c b/drivers/net/ethernet/wangxun/libwx/wx_ptp.c
index 44f3e6505246..a25eb6aed566 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_ptp.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_ptp.c
@@ -321,6 +321,9 @@ static long wx_ptp_do_aux_work(struct ptp_clock_info *ptp)
 	struct wx *wx = container_of(ptp, struct wx, ptp_caps);
 	int ts_done;
 
+	if (!test_bit(WX_STATE_PTP_RUNNING, wx->state))
+		return HZ;
+
 	ts_done = wx_ptp_tx_hwtstamp_work(wx);
 
 	wx_ptp_overflow_check(wx);
@@ -842,6 +845,24 @@ void wx_ptp_stop(struct wx *wx)
 }
 EXPORT_SYMBOL(wx_ptp_stop);
 
+void wx_ptp_quiesce(struct wx *wx)
+{
+	if (!test_and_clear_bit(WX_STATE_PTP_RUNNING, wx->state))
+		return;
+
+	clear_bit(WX_FLAG_PTP_PPS_ENABLED, wx->flags);
+
+	if (wx->ptp_clock)
+		ptp_cancel_worker_sync(wx->ptp_clock);
+
+	if (wx->ptp_tx_skb) {
+		dev_kfree_skb_any(wx->ptp_tx_skb);
+		wx->ptp_tx_skb = NULL;
+	}
+	clear_bit_unlock(WX_STATE_PTP_TX_IN_PROGRESS, wx->state);
+}
+EXPORT_SYMBOL(wx_ptp_quiesce);
+
 /**
  * wx_ptp_rx_hwtstamp - utility function which checks for RX time stamp
  * @wx: pointer to wx struct
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_ptp.h b/drivers/net/ethernet/wangxun/libwx/wx_ptp.h
index 50db90a6e3ee..ad2f824875d5 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_ptp.h
+++ b/drivers/net/ethernet/wangxun/libwx/wx_ptp.h
@@ -10,6 +10,7 @@ void wx_ptp_reset(struct wx *wx);
 void wx_ptp_init(struct wx *wx);
 void wx_ptp_suspend(struct wx *wx);
 void wx_ptp_stop(struct wx *wx);
+void wx_ptp_quiesce(struct wx *wx);
 void wx_ptp_rx_hwtstamp(struct wx *wx, struct sk_buff *skb);
 int wx_hwtstamp_get(struct net_device *dev,
 		    struct kernel_hwtstamp_config *cfg);
diff --git a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
index a8773712cff8..a7bde03a98fe 100644
--- a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
+++ b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
@@ -94,6 +94,10 @@ static void txgbe_module_detection_subtask(struct wx *wx)
 {
 	int err;
 
+	if (test_bit(WX_STATE_DOWN, wx->state) ||
+	    test_bit(WX_STATE_RESETTING, wx->state))
+		return;
+
 	if (!test_and_clear_bit(WX_FLAG_NEED_MODULE_RESET, wx->flags))
 		return;
 
@@ -107,6 +111,10 @@ static void txgbe_module_detection_subtask(struct wx *wx)
 
 static void txgbe_link_config_subtask(struct wx *wx)
 {
+	if (test_bit(WX_STATE_DOWN, wx->state) ||
+	    test_bit(WX_STATE_RESETTING, wx->state))
+		return;
+
 	if (!test_and_clear_bit(WX_FLAG_NEED_LINK_CONFIG, wx->flags))
 		return;
 
-- 
2.51.0


^ permalink raw reply related

* [PATCH net-next v9 3/5] net: wangxun: add reinit parameter to wx->do_reset callback
From: Jiawen Wu @ 2026-07-01  7:23 UTC (permalink / raw)
  To: netdev
  Cc: Mengyuan Lou, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Richard Cochran, Russell King,
	Aleksandr Loktionov, Jacob Keller, Michal Swiatkowski,
	Simon Horman, Kees Cook, Larysa Zaremba, Greg Kroah-Hartman,
	Thomas Gleixner, Breno Leitao, Rongguang Wei,
	Uwe Kleine-König (The Capable Hub), Fabio Baltieri,
	Jiawen Wu
In-Reply-To: <20260701072357.33984-1-jiawenwu@trustnetic.com>

To implement a simple hardware reset without tearing down the network
interface state, introduce a boolean 'reinit' parameter to wx->do_reset
callback.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
 drivers/net/ethernet/wangxun/libwx/wx_err.c     | 2 +-
 drivers/net/ethernet/wangxun/libwx/wx_ethtool.c | 2 +-
 drivers/net/ethernet/wangxun/libwx/wx_lib.c     | 4 ++--
 drivers/net/ethernet/wangxun/libwx/wx_type.h    | 2 +-
 drivers/net/ethernet/wangxun/ngbe/ngbe_main.c   | 4 ++--
 drivers/net/ethernet/wangxun/ngbe/ngbe_type.h   | 2 +-
 drivers/net/ethernet/wangxun/txgbe/txgbe_main.c | 4 ++--
 drivers/net/ethernet/wangxun/txgbe/txgbe_type.h | 2 +-
 8 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/wangxun/libwx/wx_err.c b/drivers/net/ethernet/wangxun/libwx/wx_err.c
index b6e2d16d4a16..ee27f96735dc 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_err.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_err.c
@@ -16,7 +16,7 @@ static void wx_pf_reset_subtask(struct wx *wx)
 
 	wx_warn(wx, "Reset adapter.\n");
 	if (wx->do_reset)
-		wx->do_reset(wx->netdev);
+		wx->do_reset(wx->netdev, true);
 }
 
 static void wx_reset_task(struct work_struct *work)
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_ethtool.c b/drivers/net/ethernet/wangxun/libwx/wx_ethtool.c
index 5df971aca9e3..d1356ff5d69b 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_ethtool.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_ethtool.c
@@ -395,7 +395,7 @@ static void wx_update_rsc(struct wx *wx)
 
 	/* reset the device to apply the new RSC setting */
 	if (need_reset && wx->do_reset)
-		wx->do_reset(netdev);
+		wx->do_reset(netdev, true);
 }
 
 int wx_set_coalesce(struct net_device *netdev,
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
index da4d9e229c9e..e5a45356ba00 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
@@ -3148,7 +3148,7 @@ int wx_set_features(struct net_device *netdev, netdev_features_t features)
 	netdev->features = features;
 
 	if (changed & NETIF_F_HW_VLAN_CTAG_RX && wx->do_reset)
-		wx->do_reset(netdev);
+		wx->do_reset(netdev, true);
 	else if (changed & (NETIF_F_HW_VLAN_CTAG_RX | NETIF_F_HW_VLAN_CTAG_FILTER))
 		wx_set_rx_mode(netdev);
 
@@ -3198,7 +3198,7 @@ int wx_set_features(struct net_device *netdev, netdev_features_t features)
 
 out:
 	if (need_reset && wx->do_reset)
-		wx->do_reset(netdev);
+		wx->do_reset(netdev, true);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_type.h b/drivers/net/ethernet/wangxun/libwx/wx_type.h
index 75d74ca2e259..a8b4e84787f4 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_type.h
+++ b/drivers/net/ethernet/wangxun/libwx/wx_type.h
@@ -1408,7 +1408,7 @@ struct wx {
 	void (*atr)(struct wx_ring *ring, struct wx_tx_buffer *first, u8 ptype);
 	void (*configure_fdir)(struct wx *wx);
 	int (*setup_tc)(struct net_device *netdev, u8 tc);
-	void (*do_reset)(struct net_device *netdev);
+	void (*do_reset)(struct net_device *netdev, bool reinit);
 	int (*ptp_setup_sdp)(struct wx *wx);
 	void (*set_num_queues)(struct wx *wx);
 
diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
index 996c48da52d7..92895f503511 100644
--- a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
@@ -633,11 +633,11 @@ static void ngbe_reinit_locked(struct wx *wx)
 	mutex_unlock(&wx->reset_lock);
 }
 
-void ngbe_do_reset(struct net_device *netdev)
+void ngbe_do_reset(struct net_device *netdev, bool reinit)
 {
 	struct wx *wx = netdev_priv(netdev);
 
-	if (netif_running(netdev))
+	if (netif_running(netdev) && reinit)
 		ngbe_reinit_locked(wx);
 	else
 		ngbe_reset(wx);
diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h b/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h
index 4f648f272c08..c9233dc7ae50 100644
--- a/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h
@@ -125,6 +125,6 @@ extern char ngbe_driver_name[];
 void ngbe_down(struct wx *wx);
 void ngbe_up(struct wx *wx);
 int ngbe_setup_tc(struct net_device *dev, u8 tc);
-void ngbe_do_reset(struct net_device *netdev);
+void ngbe_do_reset(struct net_device *netdev, bool reinit);
 
 #endif /* _NGBE_TYPE_H_ */
diff --git a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
index b1615f82a265..a8773712cff8 100644
--- a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
+++ b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
@@ -610,11 +610,11 @@ static void txgbe_reinit_locked(struct wx *wx)
 	mutex_unlock(&wx->reset_lock);
 }
 
-void txgbe_do_reset(struct net_device *netdev)
+void txgbe_do_reset(struct net_device *netdev, bool reinit)
 {
 	struct wx *wx = netdev_priv(netdev);
 
-	if (netif_running(netdev))
+	if (netif_running(netdev) && reinit)
 		txgbe_reinit_locked(wx);
 	else
 		txgbe_reset(wx);
diff --git a/drivers/net/ethernet/wangxun/txgbe/txgbe_type.h b/drivers/net/ethernet/wangxun/txgbe/txgbe_type.h
index 877234e3fdc2..3e93a3f309c1 100644
--- a/drivers/net/ethernet/wangxun/txgbe/txgbe_type.h
+++ b/drivers/net/ethernet/wangxun/txgbe/txgbe_type.h
@@ -313,7 +313,7 @@ extern char txgbe_driver_name[];
 void txgbe_down(struct wx *wx);
 void txgbe_up(struct wx *wx);
 int txgbe_setup_tc(struct net_device *dev, u8 tc);
-void txgbe_do_reset(struct net_device *netdev);
+void txgbe_do_reset(struct net_device *netdev, bool reinit);
 
 #define DECLARE_PHY_INTERFACE_MASK_ZERO(name) \
 	unsigned long name[PHY_INTERFACE_MODE_MAX] = { 0, }
-- 
2.51.0


^ permalink raw reply related

* [PATCH net-next v9 0/5] net: wangxun: timeout and error
From: Jiawen Wu @ 2026-07-01  7:23 UTC (permalink / raw)
  To: netdev
  Cc: Mengyuan Lou, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Richard Cochran, Russell King,
	Aleksandr Loktionov, Jacob Keller, Michal Swiatkowski,
	Simon Horman, Kees Cook, Larysa Zaremba, Greg Kroah-Hartman,
	Thomas Gleixner, Breno Leitao, Rongguang Wei,
	Uwe Kleine-König (The Capable Hub), Fabio Baltieri,
	Jiawen Wu

It is about adding the Tx timeout process and pci_error_handlers.
When a PCIe error occurs, the txgbe device is able to recover on platform
that support AER interrupt. And for Tx timeout, the txgbe driver can
recover the device by reset process.

For ngbe devices, due to the absence of the current function, it cannot
br fully recovered once there is a PCIe error or Tx timeout. Its
function will be completed in the future.

---
Changes log:
v9:
- Add calling ptp_cancel_worker_sync() in wx_ptp_quiesce().
- Fix the typo of 'out' for wx_control_hw().

v8: https://lore.kernel.org/all/20260630031016.19820-1-jiawenwu@trustnetic.com
- Not destroying PTP clock in wx_soft_quiesce(), and keeping the PTP worker
  alive but idle during PCIe recovery.
- Move wx_soft_quiesce() after wx_napi_disable_all().
- Use PCI_COMMAND_MEMORY and U16_MAX instead of magic number.
- Fix the leak of wx_control_hw() when WX_STATE_RES_FREED is set.

v7: https://lore.kernel.org/all/20260615065016.21672-1-jiawenwu@trustnetic.com
- Move ptp_clock_unregister() to be executed before free wx->ptp_tx_skb.

v6: https://lore.kernel.org/all/20260610060917.23980-1-jiawenwu@trustnetic.com
- Move the check of device status inside wx_soft_quiesce().
- Reverse the error return of txgbe_disable_device().
- Add PCIe error check in tx_timeout.
- Add WX_STATE_RES_FREED flag to avoid double-free of resources.

v5: https://lore.kernel.org/all/20260604085631.12720-1-jiawenwu@trustnetic.com
- Avoid the same name on two functions.
- Encode the device identity into the name of reset work queue.
- Change pr_err() to wx_err().
- Check WX_STATE_DOWN and WX_STATE_RESETTING at the entry of every work item.
- Implement wx_ptp_quiesce().
- Add netif_carrier_off() and netif_tx_disable() in soft_quiesce.
- Move resource free operations after PCIe recovery.
- Return error code in down path.

v4: https://lore.kernel.org/all/20260601072221.2952-1-jiawenwu@trustnetic.com
- Create a separate work queue for the reset task.
- Gate wx_watchdog_flush_tx() on netif_running().
- Add rtnl_lock() around wx->do_reset() in wx_io_slot_reset().
- Change .close_suspend() to .soft_quiesce() to avoid MMIO when PCI
  channel is frozen.

v3: https://lore.kernel.org/all/20260509100540.32612-1-jiawenwu@trustnetic.com
- Merge the multiple string line into one in wx_handle_tx_hang().
- Remove the redundant warn messages.
- Use test_and_clear_bit() instead of checking the flag bit then clear it.
- Drop the Tx hang check in tx_timeout.
- Call wx_update_stats() before wx_check_tx_hang().
- Add Tx flush when link lost.
- Move wx_ptp_stop() into wx->close_suspend().
- Drop V2 patch 5/6 because WOL packets are handled before DMA ring.
- Check wx NULL pointer in wx_io_error_detected().
- Check perm failure before hardware teardown.

v2: https://lore.kernel.org/all/20260430082517.19612-1-jiawenwu@trustnetic.com
- Add the missing rtnl_unlock() at early return in wx_reset_subtask().
- Replace ngbe_close() with ngbe_close_suspend() in ngbe_dev_shutdown().
- Add a patch to clear stored DMA addresses.
 
v1: https://lore.kernel.org/r/20260428021156.13564-1-jiawenwu@trustnetic.com
---

Jiawen Wu (5):
  net: ngbe: implement libwx reset ops
  net: wangxun: add Tx timeout process
  net: wangxun: add reinit parameter to wx->do_reset callback
  net: wangxun: implement soft quiesce for PCIe error recovery
  net: wangxun: add pcie error handler

 drivers/net/ethernet/wangxun/libwx/Makefile   |   2 +-
 drivers/net/ethernet/wangxun/libwx/wx_err.c   | 321 ++++++++++++++++++
 drivers/net/ethernet/wangxun/libwx/wx_err.h   |  18 +
 .../net/ethernet/wangxun/libwx/wx_ethtool.c   |   2 +-
 drivers/net/ethernet/wangxun/libwx/wx_hw.c    |  17 +-
 drivers/net/ethernet/wangxun/libwx/wx_lib.c   |  59 +++-
 drivers/net/ethernet/wangxun/libwx/wx_lib.h   |   1 +
 drivers/net/ethernet/wangxun/libwx/wx_ptp.c   |  21 ++
 drivers/net/ethernet/wangxun/libwx/wx_ptp.h   |   1 +
 drivers/net/ethernet/wangxun/libwx/wx_type.h  |  25 +-
 .../net/ethernet/wangxun/ngbe/ngbe_ethtool.c  |   1 -
 drivers/net/ethernet/wangxun/ngbe/ngbe_main.c |  83 ++++-
 drivers/net/ethernet/wangxun/ngbe/ngbe_type.h |   1 +
 .../net/ethernet/wangxun/txgbe/txgbe_main.c   |  57 +++-
 .../net/ethernet/wangxun/txgbe/txgbe_type.h   |   2 +-
 15 files changed, 592 insertions(+), 19 deletions(-)
 create mode 100644 drivers/net/ethernet/wangxun/libwx/wx_err.c
 create mode 100644 drivers/net/ethernet/wangxun/libwx/wx_err.h

-- 
2.51.0


^ permalink raw reply

* [PATCH net-next v9 1/5] net: ngbe: implement libwx reset ops
From: Jiawen Wu @ 2026-07-01  7:23 UTC (permalink / raw)
  To: netdev
  Cc: Mengyuan Lou, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Richard Cochran, Russell King,
	Aleksandr Loktionov, Jacob Keller, Michal Swiatkowski,
	Simon Horman, Kees Cook, Larysa Zaremba, Greg Kroah-Hartman,
	Thomas Gleixner, Breno Leitao, Rongguang Wei,
	Uwe Kleine-König (The Capable Hub), Fabio Baltieri,
	Jiawen Wu
In-Reply-To: <20260701072357.33984-1-jiawenwu@trustnetic.com>

Implement wx->do_reset() for library module calling.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
 .../net/ethernet/wangxun/ngbe/ngbe_ethtool.c  |  1 -
 drivers/net/ethernet/wangxun/ngbe/ngbe_main.c | 37 ++++++++++++++++++-
 drivers/net/ethernet/wangxun/ngbe/ngbe_type.h |  1 +
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_ethtool.c b/drivers/net/ethernet/wangxun/ngbe/ngbe_ethtool.c
index b2e191982803..1960f7154151 100644
--- a/drivers/net/ethernet/wangxun/ngbe/ngbe_ethtool.c
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_ethtool.c
@@ -59,7 +59,6 @@ static int ngbe_set_ringparam(struct net_device *netdev,
 	wx_set_ring(wx, new_tx_count, new_rx_count, temp_ring);
 	kvfree(temp_ring);
 
-	wx_configure(wx);
 	ngbe_up(wx);
 
 clear_reset:
diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
index a16221995909..bbbec9b43bc2 100644
--- a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
@@ -133,6 +133,7 @@ static int ngbe_sw_init(struct wx *wx)
 
 	wx->mbx.size = WX_VXMAILBOX_SIZE;
 	wx->setup_tc = ngbe_setup_tc;
+	wx->do_reset = ngbe_do_reset;
 	set_bit(0, &wx->fwd_bitmask);
 
 	return 0;
@@ -423,7 +424,7 @@ void ngbe_down(struct wx *wx)
 	wx_clean_all_rx_rings(wx);
 }
 
-void ngbe_up(struct wx *wx)
+static void ngbe_up_complete(struct wx *wx)
 {
 	wx_configure_vectors(wx);
 
@@ -490,7 +491,7 @@ static int ngbe_open(struct net_device *netdev)
 
 	wx_ptp_init(wx);
 
-	ngbe_up(wx);
+	ngbe_up_complete(wx);
 
 	return 0;
 err_dis_phy:
@@ -503,6 +504,12 @@ static int ngbe_open(struct net_device *netdev)
 	return err;
 }
 
+void ngbe_up(struct wx *wx)
+{
+	wx_configure(wx);
+	ngbe_up_complete(wx);
+}
+
 /**
  * ngbe_close - Disables a network interface
  * @netdev: network interface device structure
@@ -590,6 +597,8 @@ int ngbe_setup_tc(struct net_device *dev, u8 tc)
 	 */
 	if (netif_running(dev))
 		ngbe_close(dev);
+	else
+		ngbe_reset(wx);
 
 	wx_clear_interrupt_scheme(wx);
 
@@ -606,6 +615,30 @@ int ngbe_setup_tc(struct net_device *dev, u8 tc)
 	return 0;
 }
 
+static void ngbe_reinit_locked(struct wx *wx)
+{
+	netif_trans_update(wx->netdev);
+
+	mutex_lock(&wx->reset_lock);
+	set_bit(WX_STATE_RESETTING, wx->state);
+
+	ngbe_down(wx);
+	ngbe_up(wx);
+
+	clear_bit(WX_STATE_RESETTING, wx->state);
+	mutex_unlock(&wx->reset_lock);
+}
+
+void ngbe_do_reset(struct net_device *netdev)
+{
+	struct wx *wx = netdev_priv(netdev);
+
+	if (netif_running(netdev))
+		ngbe_reinit_locked(wx);
+	else
+		ngbe_reset(wx);
+}
+
 static const struct net_device_ops ngbe_netdev_ops = {
 	.ndo_open               = ngbe_open,
 	.ndo_stop               = ngbe_close,
diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h b/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h
index 7077a0da4c98..4f648f272c08 100644
--- a/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_type.h
@@ -125,5 +125,6 @@ extern char ngbe_driver_name[];
 void ngbe_down(struct wx *wx);
 void ngbe_up(struct wx *wx);
 int ngbe_setup_tc(struct net_device *dev, u8 tc);
+void ngbe_do_reset(struct net_device *netdev);
 
 #endif /* _NGBE_TYPE_H_ */
-- 
2.51.0


^ permalink raw reply related

* [PATCH net-next v9 2/5] net: wangxun: add Tx timeout process
From: Jiawen Wu @ 2026-07-01  7:23 UTC (permalink / raw)
  To: netdev
  Cc: Mengyuan Lou, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Richard Cochran, Russell King,
	Aleksandr Loktionov, Jacob Keller, Michal Swiatkowski,
	Simon Horman, Kees Cook, Larysa Zaremba, Greg Kroah-Hartman,
	Thomas Gleixner, Breno Leitao, Rongguang Wei,
	Uwe Kleine-König (The Capable Hub), Fabio Baltieri,
	Jiawen Wu
In-Reply-To: <20260701072357.33984-1-jiawenwu@trustnetic.com>

Implement .ndo_tx_timeout to handle Tx side timeout event. When a Tx
timeout event occur, it will trigger driver into reset process. And
allocate a separate work queue for reset process.

The WX_HANG_CHECK_ARMED bit is set to indicate a potential hang. It will
be cleared if a pause frame is received to avoid false hang detection
caused by pause frames.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
---
 drivers/net/ethernet/wangxun/libwx/Makefile   |   2 +-
 drivers/net/ethernet/wangxun/libwx/wx_err.c   | 175 ++++++++++++++++++
 drivers/net/ethernet/wangxun/libwx/wx_err.h   |  16 ++
 drivers/net/ethernet/wangxun/libwx/wx_hw.c    |  17 +-
 drivers/net/ethernet/wangxun/libwx/wx_lib.c   |  37 ++++
 drivers/net/ethernet/wangxun/libwx/wx_type.h  |  19 +-
 drivers/net/ethernet/wangxun/ngbe/ngbe_main.c |  14 ++
 .../net/ethernet/wangxun/txgbe/txgbe_main.c   |  14 ++
 8 files changed, 289 insertions(+), 5 deletions(-)
 create mode 100644 drivers/net/ethernet/wangxun/libwx/wx_err.c
 create mode 100644 drivers/net/ethernet/wangxun/libwx/wx_err.h

diff --git a/drivers/net/ethernet/wangxun/libwx/Makefile b/drivers/net/ethernet/wangxun/libwx/Makefile
index a71b0ad77de3..c8724bb129aa 100644
--- a/drivers/net/ethernet/wangxun/libwx/Makefile
+++ b/drivers/net/ethernet/wangxun/libwx/Makefile
@@ -4,5 +4,5 @@
 
 obj-$(CONFIG_LIBWX) += libwx.o
 
-libwx-objs := wx_hw.o wx_lib.o wx_ethtool.o wx_ptp.o wx_mbx.o wx_sriov.o
+libwx-objs := wx_hw.o wx_lib.o wx_ethtool.o wx_ptp.o wx_mbx.o wx_sriov.o wx_err.o
 libwx-objs += wx_vf.o wx_vf_lib.o wx_vf_common.o
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_err.c b/drivers/net/ethernet/wangxun/libwx/wx_err.c
new file mode 100644
index 000000000000..b6e2d16d4a16
--- /dev/null
+++ b/drivers/net/ethernet/wangxun/libwx/wx_err.c
@@ -0,0 +1,175 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2015 - 2026 Beijing WangXun Technology Co., Ltd. */
+/* Copyright (c) 1999 - 2026 Intel Corporation. */
+
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+
+#include "wx_type.h"
+#include "wx_lib.h"
+#include "wx_err.h"
+
+static void wx_pf_reset_subtask(struct wx *wx)
+{
+	if (!test_and_clear_bit(WX_FLAG_NEED_PF_RESET, wx->flags))
+		return;
+
+	wx_warn(wx, "Reset adapter.\n");
+	if (wx->do_reset)
+		wx->do_reset(wx->netdev);
+}
+
+static void wx_reset_task(struct work_struct *work)
+{
+	struct wx *wx = container_of(work, struct wx, reset_task);
+
+	rtnl_lock();
+
+	if (test_bit(WX_STATE_DOWN, wx->state) ||
+	    test_bit(WX_STATE_RESETTING, wx->state))
+		goto out;
+
+	wx_pf_reset_subtask(wx);
+
+out:
+	rtnl_unlock();
+}
+
+void wx_check_err_subtask(struct wx *wx)
+{
+	if (test_bit(WX_FLAG_NEED_PF_RESET, wx->flags))
+		queue_work(wx->reset_wq, &wx->reset_task);
+}
+EXPORT_SYMBOL(wx_check_err_subtask);
+
+int wx_init_err_task(struct wx *wx)
+{
+	wx->reset_wq = alloc_workqueue("%s_reset_wq_%x", WQ_UNBOUND | WQ_HIGHPRI,
+				       1, wx->driver_name, pci_dev_id(wx->pdev));
+	if (!wx->reset_wq) {
+		wx_err(wx, "Failed to create wx_reset_wq workqueue\n");
+		return -ENOMEM;
+	}
+
+	INIT_WORK(&wx->reset_task, wx_reset_task);
+	return 0;
+}
+EXPORT_SYMBOL(wx_init_err_task);
+
+static bool wx_ring_tx_pending(struct wx *wx)
+{
+	int i;
+
+	for (i = 0; i < wx->num_tx_queues; i++) {
+		struct wx_ring *tx_ring = wx->tx_ring[i];
+
+		if (tx_ring->next_to_use != tx_ring->next_to_clean)
+			return true;
+	}
+
+	return false;
+}
+
+static bool wx_vf_tx_pending(struct wx *wx)
+{
+	struct wx_ring_feature *vmdq = &wx->ring_feature[RING_F_VMDQ];
+	u32 q_per_pool = __ALIGN_MASK(1, ~vmdq->mask);
+	u32 i, j;
+
+	if (!wx->num_vfs)
+		return false;
+
+	for (i = 0; i < wx->num_vfs; i++) {
+		for (j = 0; j < q_per_pool; j++) {
+			u32 h, t;
+
+			h = rd32(wx, WX_PX_TR_RP_PV(q_per_pool, i, j));
+			t = rd32(wx, WX_PX_TR_WP_PV(q_per_pool, i, j));
+
+			if (h != t)
+				return true;
+		}
+	}
+
+	return false;
+}
+
+static void wx_watchdog_flush_tx(struct wx *wx)
+{
+	if (!netif_running(wx->netdev))
+		return;
+	if (netif_carrier_ok(wx->netdev))
+		return;
+
+	if (wx_ring_tx_pending(wx) || wx_vf_tx_pending(wx)) {
+		/* We've lost link, so the controller stops DMA,
+		 * but we've got queued Tx work that's never going
+		 * to get done, so reset controller to flush Tx.
+		 * (Do the reset outside of interrupt context).
+		 */
+		wx_warn(wx, "initiating reset due to lost link with pending Tx work\n");
+		set_bit(WX_FLAG_NEED_PF_RESET, wx->flags);
+	}
+}
+
+static void wx_detect_tx_hang(struct wx *wx)
+{
+	int i;
+
+	/* If we're down or resetting, just bail */
+	if (!netif_running(wx->netdev) ||
+	    test_bit(WX_STATE_RESETTING, wx->state))
+		return;
+
+	/* Force detection of hung controller */
+	if (netif_carrier_ok(wx->netdev)) {
+		for (i = 0; i < wx->num_tx_queues; i++)
+			set_bit(WX_TX_DETECT_HANG, wx->tx_ring[i]->state);
+	}
+}
+
+void wx_check_hang_subtask(struct wx *wx)
+{
+	if (test_bit(WX_STATE_DOWN, wx->state) ||
+	    test_bit(WX_STATE_RESETTING, wx->state))
+		return;
+
+	wx_watchdog_flush_tx(wx);
+	wx_detect_tx_hang(wx);
+}
+EXPORT_SYMBOL(wx_check_hang_subtask);
+
+static void wx_tx_timeout_reset(struct wx *wx)
+{
+	if (test_bit(WX_STATE_DOWN, wx->state))
+		return;
+
+	set_bit(WX_FLAG_NEED_PF_RESET, wx->flags);
+	wx_warn(wx, "initiating reset due to tx timeout\n");
+	wx_service_event_schedule(wx);
+}
+
+void wx_tx_timeout(struct net_device *netdev, unsigned int __always_unused txqueue)
+{
+	struct wx *wx = netdev_priv(netdev);
+
+	wx_tx_timeout_reset(wx);
+}
+EXPORT_SYMBOL(wx_tx_timeout);
+
+void wx_handle_tx_hang(struct wx_ring *tx_ring, unsigned int next)
+{
+	struct wx *wx = netdev_priv(tx_ring->netdev);
+
+	wx_warn(wx,
+		"Detected Tx Unit Hang: Queue %d, TDH %x, TDT %x, ntu %x, ntc %x, ntc.time_stamp %lx, jiffies %lx\n",
+		tx_ring->queue_index,
+		rd32(wx, WX_PX_TR_RP(tx_ring->reg_idx)),
+		rd32(wx, WX_PX_TR_WP(tx_ring->reg_idx)),
+		tx_ring->next_to_use, next,
+		tx_ring->tx_buffer_info[next].time_stamp, jiffies);
+
+	netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
+
+	wx_tx_timeout_reset(wx);
+}
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_err.h b/drivers/net/ethernet/wangxun/libwx/wx_err.h
new file mode 100644
index 000000000000..1eed13e48095
--- /dev/null
+++ b/drivers/net/ethernet/wangxun/libwx/wx_err.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * WangXun Gigabit PCI Express Linux driver
+ * Copyright (c) 2015 - 2026 Beijing WangXun Technology Co., Ltd.
+ */
+
+#ifndef _WX_ERR_H_
+#define _WX_ERR_H_
+
+void wx_check_err_subtask(struct wx *wx);
+int wx_init_err_task(struct wx *wx);
+void wx_check_hang_subtask(struct wx *wx);
+void wx_tx_timeout(struct net_device *netdev, unsigned int txqueue);
+void wx_handle_tx_hang(struct wx_ring *tx_ring, unsigned int next);
+
+#endif /* _WX_ERR_H_ */
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_hw.c b/drivers/net/ethernet/wangxun/libwx/wx_hw.c
index 260e14d5d541..122c4952d203 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_hw.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_hw.c
@@ -1932,6 +1932,7 @@ static void wx_configure_tx_ring(struct wx *wx,
 	else
 		ring->atr_sample_rate = 0;
 
+	bitmap_zero(ring->state, WX_RING_STATE_NBITS);
 	/* reinitialize tx_buffer_info */
 	memset(ring->tx_buffer_info, 0,
 	       sizeof(struct wx_tx_buffer) * ring->count);
@@ -2851,16 +2852,26 @@ EXPORT_SYMBOL(wx_fc_enable);
 static void wx_update_xoff_rx_lfc(struct wx *wx)
 {
 	struct wx_hw_stats *hwstats = &wx->stats;
+	u64 data;
+	int i;
 
 	if (wx->fc.mode != wx_fc_full &&
 	    wx->fc.mode != wx_fc_rx_pause)
 		return;
 
 	if (wx->mac.type >= wx_mac_aml)
-		hwstats->lxoffrxc += rd32_wrap(wx, WX_MAC_LXOFFRXC_AML,
-					       &wx->last_stats.lxoffrxc);
+		data = rd32_wrap(wx, WX_MAC_LXOFFRXC_AML,
+				 &wx->last_stats.lxoffrxc);
 	else
-		hwstats->lxoffrxc += rd64(wx, WX_MAC_LXOFFRXC);
+		data = rd64(wx, WX_MAC_LXOFFRXC);
+	hwstats->lxoffrxc += data;
+
+	/* refill credits (no tx hang) if we received xoff */
+	if (!data)
+		return;
+
+	for (i = 0; i < wx->num_tx_queues; i++)
+		clear_bit(WX_HANG_CHECK_ARMED, wx->tx_ring[i]->state);
 }
 
 /**
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
index d042567b8128..da4d9e229c9e 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
@@ -14,6 +14,7 @@
 
 #include "wx_type.h"
 #include "wx_lib.h"
+#include "wx_err.h"
 #include "wx_ptp.h"
 #include "wx_hw.h"
 #include "wx_vf_lib.h"
@@ -742,6 +743,37 @@ static struct netdev_queue *wx_txring_txq(const struct wx_ring *ring)
 	return netdev_get_tx_queue(ring->netdev, ring->queue_index);
 }
 
+static u32 wx_get_tx_pending(struct wx_ring *ring)
+{
+	unsigned int head, tail;
+
+	head = ring->next_to_clean;
+	tail = ring->next_to_use;
+
+	return ((head <= tail) ? tail : tail + ring->count) - head;
+}
+
+static bool wx_check_tx_hang(struct wx_ring *ring)
+{
+	u32 tx_done_old = ring->tx_stats.tx_done_old;
+	u32 tx_pending = wx_get_tx_pending(ring);
+	u32 tx_done = ring->stats.packets;
+
+	if (!test_and_clear_bit(WX_TX_DETECT_HANG, ring->state))
+		return false;
+
+	if (tx_done_old == tx_done && tx_pending)
+		/* make sure it is true for two checks in a row */
+		return test_and_set_bit(WX_HANG_CHECK_ARMED, ring->state);
+
+	/* update completed stats and continue */
+	ring->tx_stats.tx_done_old = tx_done;
+	/* reset the countdown */
+	clear_bit(WX_HANG_CHECK_ARMED, ring->state);
+
+	return false;
+}
+
 /**
  * wx_clean_tx_irq - Reclaim resources after transmit completes
  * @q_vector: structure containing interrupt and ring information
@@ -866,6 +898,11 @@ static bool wx_clean_tx_irq(struct wx_q_vector *q_vector,
 	netdev_tx_completed_queue(wx_txring_txq(tx_ring),
 				  total_packets, total_bytes);
 
+	if (wx_check_tx_hang(tx_ring)) {
+		wx_handle_tx_hang(tx_ring, i);
+		return true;
+	}
+
 #define TX_WAKE_THRESHOLD (DESC_NEEDED * 2)
 	if (unlikely(total_packets && netif_carrier_ok(tx_ring->netdev) &&
 		     (wx_desc_unused(tx_ring) >= TX_WAKE_THRESHOLD))) {
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_type.h b/drivers/net/ethernet/wangxun/libwx/wx_type.h
index c7befe4cdfe9..75d74ca2e259 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_type.h
+++ b/drivers/net/ethernet/wangxun/libwx/wx_type.h
@@ -450,6 +450,11 @@ enum WX_MSCA_CMD_value {
 #define WX_PX_TR_CFG_THRE_SHIFT      8
 #define WX_PX_TR_CFG_HEAD_WB         BIT(27)
 
+#define WX_PX_TR_RP_PV(q_per_pool, vf_number, vf_q_index) \
+		(WX_PX_TR_RP((q_per_pool) * (vf_number) + (vf_q_index)))
+#define WX_PX_TR_WP_PV(q_per_pool, vf_number, vf_q_index) \
+		(WX_PX_TR_WP((q_per_pool) * (vf_number) + (vf_q_index)))
+
 /* Receive DMA Registers */
 #define WX_PX_RR_BAL(_i)             (0x01000 + ((_i) * 0x40))
 #define WX_PX_RR_BAH(_i)             (0x01004 + ((_i) * 0x40))
@@ -1039,6 +1044,7 @@ struct wx_queue_stats {
 struct wx_tx_queue_stats {
 	u64 restart_queue;
 	u64 tx_busy;
+	u32 tx_done_old;
 };
 
 struct wx_rx_queue_stats {
@@ -1054,6 +1060,12 @@ struct wx_rx_queue_stats {
 #define wx_for_each_ring(posm, headm) \
 	for (posm = (headm).ring; posm; posm = posm->next)
 
+enum wx_ring_state {
+	WX_TX_DETECT_HANG,
+	WX_HANG_CHECK_ARMED,
+	WX_RING_STATE_NBITS
+};
+
 struct wx_ring_container {
 	struct wx_ring *ring;           /* pointer to linked list of rings */
 	unsigned int total_bytes;       /* total bytes processed this int */
@@ -1073,6 +1085,7 @@ struct wx_ring {
 		struct wx_tx_buffer *tx_buffer_info;
 		struct wx_rx_buffer *rx_buffer_info;
 	};
+	DECLARE_BITMAP(state, WX_RING_STATE_NBITS);
 	u8 __iomem *tail;
 	dma_addr_t dma;                 /* phys. address of descriptor ring */
 	dma_addr_t headwb_dma;
@@ -1274,6 +1287,7 @@ enum wx_pf_flags {
 	WX_FLAG_NEED_DO_RESET,
 	WX_FLAG_RX_MERGE_ENABLED,
 	WX_FLAG_TXHEAD_WB_ENABLED,
+	WX_FLAG_NEED_PF_RESET,
 	WX_PF_FLAGS_NBITS               /* must be last */
 };
 
@@ -1422,6 +1436,8 @@ struct wx {
 
 	struct timer_list service_timer;
 	struct work_struct service_task;
+	struct work_struct reset_task;
+	struct workqueue_struct *reset_wq;
 	struct mutex reset_lock; /* mutex for reset */
 };
 
@@ -1504,7 +1520,8 @@ rd32_wrap(struct wx *wx, u32 reg, u32 *last)
 
 #define wx_err(wx, fmt, arg...) \
 	dev_err(&(wx)->pdev->dev, fmt, ##arg)
-
+#define wx_warn(wx, fmt, arg...) \
+	dev_warn(&(wx)->pdev->dev, fmt, ##arg)
 #define wx_dbg(wx, fmt, arg...) \
 	dev_dbg(&(wx)->pdev->dev, fmt, ##arg)
 
diff --git a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
index bbbec9b43bc2..996c48da52d7 100644
--- a/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
+++ b/drivers/net/ethernet/wangxun/ngbe/ngbe_main.c
@@ -14,6 +14,7 @@
 #include "../libwx/wx_type.h"
 #include "../libwx/wx_hw.h"
 #include "../libwx/wx_lib.h"
+#include "../libwx/wx_err.h"
 #include "../libwx/wx_ptp.h"
 #include "../libwx/wx_mbx.h"
 #include "../libwx/wx_sriov.h"
@@ -148,6 +149,8 @@ static void ngbe_service_task(struct work_struct *work)
 	struct wx *wx = container_of(work, struct wx, service_task);
 
 	wx_update_stats(wx);
+	wx_check_hang_subtask(wx);
+	wx_check_err_subtask(wx);
 
 	wx_service_event_complete(wx);
 }
@@ -393,6 +396,7 @@ static void ngbe_disable_device(struct wx *wx)
 	netif_tx_stop_all_queues(netdev);
 	netif_tx_disable(netdev);
 
+	clear_bit(WX_FLAG_NEED_PF_RESET, wx->flags);
 	timer_delete_sync(&wx->service_timer);
 	cancel_work_sync(&wx->service_task);
 
@@ -644,6 +648,7 @@ static const struct net_device_ops ngbe_netdev_ops = {
 	.ndo_stop               = ngbe_close,
 	.ndo_change_mtu         = wx_change_mtu,
 	.ndo_start_xmit         = wx_xmit_frame,
+	.ndo_tx_timeout         = wx_tx_timeout,
 	.ndo_set_rx_mode        = wx_set_rx_mode,
 	.ndo_set_features       = wx_set_features,
 	.ndo_fix_features       = wx_fix_features,
@@ -733,6 +738,7 @@ static int ngbe_probe(struct pci_dev *pdev,
 	wx->driver_name = ngbe_driver_name;
 	ngbe_set_ethtool_ops(netdev);
 	netdev->netdev_ops = &ngbe_netdev_ops;
+	netdev->watchdog_timeo = 5 * HZ;
 
 	netdev->features = NETIF_F_SG | NETIF_F_IP_CSUM |
 			   NETIF_F_TSO | NETIF_F_TSO6 |
@@ -829,6 +835,10 @@ static int ngbe_probe(struct pci_dev *pdev,
 	eth_hw_addr_set(netdev, wx->mac.perm_addr);
 	wx_mac_set_default_filter(wx, wx->mac.perm_addr);
 
+	err = wx_init_err_task(wx);
+	if (err)
+		goto err_free_mac_table;
+
 	ngbe_init_service(wx);
 
 	err = wx_init_interrupt_scheme(wx);
@@ -856,6 +866,8 @@ static int ngbe_probe(struct pci_dev *pdev,
 err_cancel_service:
 	timer_delete_sync(&wx->service_timer);
 	cancel_work_sync(&wx->service_task);
+	cancel_work_sync(&wx->reset_task);
+	destroy_workqueue(wx->reset_wq);
 err_free_mac_table:
 	kfree(wx->rss_key);
 	kfree(wx->mac_table);
@@ -887,6 +899,8 @@ static void ngbe_remove(struct pci_dev *pdev)
 
 	timer_shutdown_sync(&wx->service_timer);
 	cancel_work_sync(&wx->service_task);
+	cancel_work_sync(&wx->reset_task);
+	destroy_workqueue(wx->reset_wq);
 
 	phylink_destroy(wx->phylink);
 	pci_release_selected_regions(pdev,
diff --git a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
index 20c5a295c6c2..b1615f82a265 100644
--- a/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
+++ b/drivers/net/ethernet/wangxun/txgbe/txgbe_main.c
@@ -14,6 +14,7 @@
 
 #include "../libwx/wx_type.h"
 #include "../libwx/wx_lib.h"
+#include "../libwx/wx_err.h"
 #include "../libwx/wx_ptp.h"
 #include "../libwx/wx_hw.h"
 #include "../libwx/wx_mbx.h"
@@ -123,6 +124,8 @@ static void txgbe_service_task(struct work_struct *work)
 	txgbe_module_detection_subtask(wx);
 	txgbe_link_config_subtask(wx);
 	wx_update_stats(wx);
+	wx_check_hang_subtask(wx);
+	wx_check_err_subtask(wx);
 
 	wx_service_event_complete(wx);
 }
@@ -224,6 +227,7 @@ static void txgbe_disable_device(struct wx *wx)
 	wx_irq_disable(wx);
 	wx_napi_disable_all(wx);
 
+	clear_bit(WX_FLAG_NEED_PF_RESET, wx->flags);
 	timer_delete_sync(&wx->service_timer);
 	cancel_work_sync(&wx->service_task);
 
@@ -654,6 +658,7 @@ static const struct net_device_ops txgbe_netdev_ops = {
 	.ndo_stop               = txgbe_close,
 	.ndo_change_mtu         = wx_change_mtu,
 	.ndo_start_xmit         = wx_xmit_frame,
+	.ndo_tx_timeout         = wx_tx_timeout,
 	.ndo_set_rx_mode        = wx_set_rx_mode,
 	.ndo_set_features       = wx_set_features,
 	.ndo_fix_features       = wx_fix_features,
@@ -745,6 +750,7 @@ static int txgbe_probe(struct pci_dev *pdev,
 	wx->driver_name = txgbe_driver_name;
 	txgbe_set_ethtool_ops(netdev);
 	netdev->netdev_ops = &txgbe_netdev_ops;
+	netdev->watchdog_timeo = 5 * HZ;
 	netdev->udp_tunnel_nic_info = &txgbe_udp_tunnels;
 
 	/* setup the private structure */
@@ -814,6 +820,10 @@ static int txgbe_probe(struct pci_dev *pdev,
 	eth_hw_addr_set(netdev, wx->mac.perm_addr);
 	wx_mac_set_default_filter(wx, wx->mac.perm_addr);
 
+	err = wx_init_err_task(wx);
+	if (err)
+		goto err_free_mac_table;
+
 	txgbe_init_service(wx);
 
 	err = wx_init_interrupt_scheme(wx);
@@ -916,6 +926,8 @@ static int txgbe_probe(struct pci_dev *pdev,
 err_cancel_service:
 	timer_delete_sync(&wx->service_timer);
 	cancel_work_sync(&wx->service_task);
+	cancel_work_sync(&wx->reset_task);
+	destroy_workqueue(wx->reset_wq);
 err_free_mac_table:
 	kfree(wx->rss_key);
 	kfree(wx->mac_table);
@@ -948,6 +960,8 @@ static void txgbe_remove(struct pci_dev *pdev)
 
 	timer_shutdown_sync(&wx->service_timer);
 	cancel_work_sync(&wx->service_task);
+	cancel_work_sync(&wx->reset_task);
+	destroy_workqueue(wx->reset_wq);
 
 	txgbe_remove_phy(txgbe);
 	wx_free_isb_resources(wx);
-- 
2.51.0


^ permalink raw reply related

* [PATCH v3] ptp: ocp: add CPLD ISP support for ADVA TimeCard
From: Sagi Maimon @ 2026-07-01  7:22 UTC (permalink / raw)
  To: jonathan.lemon, vadim.fedorenko, richardcochran, andrew+netdev,
	davem, edumazet, kuba, pabeni
  Cc: linux-kernel, netdev, Sagi Maimon

The ADVA TimeCard uses a Lattice MachXO3 CPLD that is programmed over
I2C using in-system programming (ISP).

The CPLD is connected to a secondary I2C bus controlled by the onboard
MicroBlaze. Add support for taking ownership of this bus and exposing
the required interfaces through sysfs, allowing userspace tools to
perform CPLD programming.

To limit the scope of this functionality, sysfs-based I2C access is
available only on ADVA TimeCard boards and only for the CPLD-related
I2C slave addresses used during ISP.

Add sysfs support to:
  - control ownership of the secondary I2C bus
  - access the CPLD-related I2C slave addresses required for ISP

Signed-off-by: Sagi Maimon <maimon.sagi@gmail.com>
---

Addressed comments from:
 - Vadim Fedorenko : https://lore.kernel.org/all/c6aff5f7-e087-4bd9-b159-7adeb82e19f4@linux.dev/

Changes since version 2:
 - Allow I2C access via sysfs only on ADVA TimeCard boards.
 - Allow access only to the ADVA TimeCard-specific I2C slave
   addresses.
   
 drivers/ptp/ptp_ocp.c | 180 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 179 insertions(+), 1 deletion(-)

diff --git a/drivers/ptp/ptp_ocp.c b/drivers/ptp/ptp_ocp.c
index 35e911f1ad78..86c341ea4062 100644
--- a/drivers/ptp/ptp_ocp.c
+++ b/drivers/ptp/ptp_ocp.c
@@ -416,6 +416,10 @@ struct ptp_ocp {
 	dpll_tracker tracker;
 	int signals_nr;
 	int freq_in_nr;
+	/* cpld_i2c_xfer sysfs (adva_x1) */
+	struct mutex		tap_i2c_lock;
+	u8			tap_i2c_rsp[21]; /* [status, read_data...] */
+	size_t			tap_i2c_rsp_len;
 };
 
 #define OCP_REQ_TIMESTAMP	BIT(0)
@@ -3188,6 +3192,8 @@ ptp_ocp_adva_board_init(struct ptp_ocp *bp, struct ocp_resource *r)
 	ptp_ocp_nmea_out_init(bp);
 	ptp_ocp_signal_init(bp);
 
+	mutex_init(&bp->tap_i2c_lock);
+
 	err = ptp_ocp_attr_group_add(bp, info->attr_groups);
 	if (err)
 		return err;
@@ -4224,6 +4230,171 @@ static const struct ocp_attr_group art_timecard_groups[] = {
 	{ },
 };
 
+static ssize_t
+i2c_bus_ctrl_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct ptp_ocp *bp = dev_get_drvdata(dev);
+
+	if (!bp->pps_select)
+		return -ENODEV;
+	return sysfs_emit(buf, "0x%08x\n",
+			  ioread32(&bp->pps_select->__pad1));
+}
+
+static ssize_t
+i2c_bus_ctrl_store(struct device *dev, struct device_attribute *attr,
+		   const char *buf, size_t count)
+{
+	struct ptp_ocp *bp = dev_get_drvdata(dev);
+	u32 val;
+
+	if (!bp->pps_select)
+		return -ENODEV;
+	if (kstrtou32(buf, 0, &val))
+		return -EINVAL;
+	iowrite32(val, &bp->pps_select->__pad1);
+	return count;
+}
+
+static DEVICE_ATTR_RW(i2c_bus_ctrl);
+
+/*
+ * cpld_i2c_xfer - sysfs binary I2C passthrough for adva_x1 TAP CPLD.
+ *
+ * write: [addr][write_len][read_len][flags][write_data...]
+ *   flags bit 0: I2C_M_NOSTART on the read segment
+ * read:  [status][read_data...]
+ *   status 0 = success, else positive errno
+ *
+ * Only addresses 0x40 (CPLD) and 0x74 (mux) are permitted.
+ */
+#define TAP_I2C_ALLOWED_ADDRS_NUM  2
+static const u8 tap_i2c_allowed_addrs[TAP_I2C_ALLOWED_ADDRS_NUM] = {
+	0x40, /* CPLD */
+	0x74, /* mux  */
+};
+
+#define TAP_I2C_REQ_HDR_LEN   4
+#define TAP_I2C_MAX_WRITE_LEN 67
+#define TAP_I2C_MAX_READ_LEN  20
+#define TAP_I2C_FLAG_NOSTART  BIT(0)
+
+static int
+ptp_ocp_tap_i2c_find_adapter(struct device *dev, void *data)
+{
+	struct i2c_adapter **adap = data;
+
+	*adap = i2c_verify_adapter(dev);
+	return (*adap) ? 1 : 0;
+}
+
+static ssize_t
+ptp_ocp_cpld_i2c_write(struct file *file, struct kobject *kobj,
+		       const struct bin_attribute *attr,
+		       char *buf, loff_t off, size_t count)
+{
+	struct ptp_ocp *bp = dev_get_drvdata(kobj_to_dev(kobj));
+	const u8 *req = (const u8 *)buf;
+	u8 addr, write_len, read_len, flags;
+	struct i2c_adapter *adap = NULL;
+	struct i2c_msg msgs[2];
+	u8 rdbuf[TAP_I2C_MAX_READ_LEN];
+	int nmsgs, ret, i;
+
+	if (count < TAP_I2C_REQ_HDR_LEN || count > TAP_I2C_REQ_HDR_LEN + TAP_I2C_MAX_WRITE_LEN)
+		return -EINVAL;
+
+	addr      = req[0];
+	write_len = req[1];
+	read_len  = req[2];
+	flags     = req[3];
+
+	/* Validate */
+	for (i = 0; i < TAP_I2C_ALLOWED_ADDRS_NUM; i++)
+		if (addr == tap_i2c_allowed_addrs[i])
+			break;
+	if (i == TAP_I2C_ALLOWED_ADDRS_NUM)
+		return -EPERM;
+
+	if (write_len > TAP_I2C_MAX_WRITE_LEN)
+		return -EINVAL;
+	if (read_len > TAP_I2C_MAX_READ_LEN)
+		return -EINVAL;
+	if (write_len + TAP_I2C_REQ_HDR_LEN > count)
+		return -EINVAL;
+	if (write_len == 0 && read_len == 0)
+		return -EINVAL;
+
+	if (!bp->i2c_ctrl)
+		return -ENODEV;
+	device_for_each_child(&bp->i2c_ctrl->dev, &adap,
+			      ptp_ocp_tap_i2c_find_adapter);
+	if (!adap)
+		return -ENODEV;
+
+	nmsgs = 0;
+	if (write_len > 0) {
+		msgs[nmsgs].addr  = addr;
+		msgs[nmsgs].flags = 0;
+		msgs[nmsgs].len   = write_len;
+		msgs[nmsgs].buf   = (u8 *)req + TAP_I2C_REQ_HDR_LEN;
+		nmsgs++;
+	}
+	if (read_len > 0) {
+		u16 rd_flags = I2C_M_RD;
+
+		if (flags & TAP_I2C_FLAG_NOSTART)
+			rd_flags |= I2C_M_NOSTART;
+		msgs[nmsgs].addr  = addr;
+		msgs[nmsgs].flags = rd_flags;
+		msgs[nmsgs].len   = read_len;
+		msgs[nmsgs].buf   = rdbuf;
+		nmsgs++;
+	}
+
+	ret = i2c_transfer(adap, msgs, nmsgs);
+
+	mutex_lock(&bp->tap_i2c_lock);
+	if (ret == nmsgs) {
+		bp->tap_i2c_rsp[0] = 0;
+		if (read_len > 0)
+			memcpy(&bp->tap_i2c_rsp[1], rdbuf, read_len);
+		bp->tap_i2c_rsp_len = 1 + read_len;
+		ret = count;
+	} else {
+		bp->tap_i2c_rsp[0]  = (u8)(ret < 0 ? -ret : EIO);
+		bp->tap_i2c_rsp_len = 1;
+		ret = (ret < 0) ? ret : -EIO;
+	}
+	mutex_unlock(&bp->tap_i2c_lock);
+
+	return ret;
+}
+
+static ssize_t
+ptp_ocp_cpld_i2c_read(struct file *file, struct kobject *kobj,
+		      const struct bin_attribute *attr,
+		      char *buf, loff_t off, size_t count)
+{
+	struct ptp_ocp *bp = dev_get_drvdata(kobj_to_dev(kobj));
+	ssize_t ret;
+
+	if (off > 0)
+		return 0;
+
+	mutex_lock(&bp->tap_i2c_lock);
+	ret = min(count, bp->tap_i2c_rsp_len);
+	memcpy(buf, bp->tap_i2c_rsp, ret);
+	mutex_unlock(&bp->tap_i2c_lock);
+	return ret;
+}
+
+static const struct bin_attribute tap_i2c_bin_attr = {
+	.attr  = { .name = "cpld_i2c_xfer", .mode = 0600 },
+	.write = ptp_ocp_cpld_i2c_write,
+	.read  = ptp_ocp_cpld_i2c_read,
+};
+
 static struct attribute *adva_timecard_attrs[] = {
 	&dev_attr_serialnum.attr,
 	&dev_attr_gnss_sync.attr,
@@ -4272,11 +4443,18 @@ static struct attribute *adva_timecard_x1_attrs[] = {
 	&dev_attr_ts_window_adjust.attr,
 	&dev_attr_utc_tai_offset.attr,
 	&dev_attr_tod_correction.attr,
+	&dev_attr_i2c_bus_ctrl.attr,
+	NULL,
+};
+
+static const struct bin_attribute *const bin_adva_x1_timecard_attrs[] = {
+	&tap_i2c_bin_attr,
 	NULL,
 };
 
 static const struct attribute_group adva_timecard_x1_group = {
-	.attrs = adva_timecard_x1_attrs,
+	.attrs     = adva_timecard_x1_attrs,
+	.bin_attrs = bin_adva_x1_timecard_attrs,
 };
 
 static const struct ocp_attr_group adva_timecard_x1_groups[] = {
-- 
2.47.0


^ permalink raw reply related

* Re: [PATCH v8 02/14] firmware: qcom_scm: Migrate to generic PAS service
From: Sumit Garg @ 2026-07-01  7:21 UTC (permalink / raw)
  To: Julian Braha
  Cc: andersson, linux-arm-msm, dri-devel, freedreno, linux-media,
	netdev, linux-wireless, ath12k, linux-remoteproc, konradybcio,
	robh, krzk+dt, conor+dt, robin.clark, sean, akhilpo, lumag,
	abhinav.kumar, jesszhan0024, marijn.suijten, airlied, simona,
	vikash.garodia, bod, mchehab, elder, andrew+netdev, davem,
	edumazet, kuba, pabeni, jjohnson, mathieu.poirier,
	trilokkumar.soni, mukesh.ojha, pavan.kondeti, jorge.ramirez,
	tonyh, vignesh.viswanathan, srinivas.kandagatla, amirreza.zarrabi,
	jens.wiklander, op-tee, apurupa, skare, linux-kernel, Sumit Garg,
	Harshal Dev
In-Reply-To: <ac8c92cb-21f2-4274-8fe6-f771fe48eec7@gmail.com>

On Fri, Jun 26, 2026 at 06:05:54PM +0100, Julian Braha wrote:
> Hi Sumit,
> 
> On 6/26/26 14:34, Sumit Garg wrote:
> 
> >  config QCOM_SCM
> > +	tristate "Qualcomm PAS SCM interface driver"
> > +	select QCOM_PAS
> >  	select QCOM_TZMEM
> > -	tristate
> I think QCOM_SCM is missing a 'select IRQ_DOMAIN'. Right now I get a
> build error without it:
> 
> drivers/firmware/qcom/qcom_scm.c: In function ‘qcom_scm_get_waitq_irq’:
>   drivers/firmware/qcom/qcom_scm.c:2512:16: error: implicit declaration
> of function ‘irq_create_fwspec_mapping’; did you mean
> ‘irq_create_of_mapping’? [-Wimplicit-function-declaration]
>    2512 |         return irq_create_fwspec_mapping(&fwspec);
>         |                ^~~~~~~~~~~~~~~~~~~~~~~~~
>         |                irq_create_of_mapping
>

This issue should be independent of this patch-set. Please submit a
standalone fix for this.

-Sumit

^ permalink raw reply

* Re: [PATCH v8 01/14] firmware: qcom: Add a generic PAS service
From: Sumit Garg @ 2026-07-01  7:17 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: andersson, linux-arm-msm, dri-devel, freedreno, linux-media,
	netdev, linux-wireless, ath12k, linux-remoteproc, konradybcio,
	robh, krzk+dt, conor+dt, robin.clark, sean, akhilpo, lumag,
	abhinav.kumar, jesszhan0024, marijn.suijten, airlied, simona,
	vikash.garodia, bod, mchehab, elder, andrew+netdev, davem,
	edumazet, kuba, pabeni, jjohnson, mathieu.poirier,
	trilokkumar.soni, mukesh.ojha, pavan.kondeti, jorge.ramirez,
	tonyh, vignesh.viswanathan, srinivas.kandagatla, amirreza.zarrabi,
	jens.wiklander, op-tee, apurupa, skare, linux-kernel, Sumit Garg,
	Harshal Dev
In-Reply-To: <dc7e58d3-4383-4d93-a38e-699888bff903@oss.qualcomm.com>

On Tue, Jun 30, 2026 at 02:14:30PM +0200, Konrad Dybcio wrote:
> On 6/26/26 3:34 PM, Sumit Garg wrote:
> > From: Sumit Garg <sumit.garg@oss.qualcomm.com>
> > 
> > Qcom platforms has the legacy of using non-standard SCM calls
> > splintered over the various kernel drivers. These SCM calls aren't
> > compliant with the standard SMC calling conventions which is a
> > prerequisite to enable migration to the FF-A specifications from Arm.
> 
> [...]
> 
> > +bool qcom_pas_is_available(void)
> 
> This is the most important function, for which I would expect
> kerneldoc be present. I think it also wouldn't hurt to add a
> footnote in every other function's kerneldoc saying that this must
> be called first

Will add in the next spin.

-Sumit

^ permalink raw reply

* Re: [PATCH v3 0/3] net: stmmac: L3/L4 filter bug fixes
From: Maxime Chevallier @ 2026-07-01  7:06 UTC (permalink / raw)
  To: muhammad.nazim.amirul.nazle.asmade, netdev
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, rmk+kernel,
	Jose.Abreu, linux-kernel
In-Reply-To: <20260630115622.9426-1-muhammad.nazim.amirul.nazle.asmade@altera.com>

Hi,

On 6/30/26 13:56, muhammad.nazim.amirul.nazle.asmade@altera.com wrote:
> From: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
> 
> This series fixes three bugs in the stmmac L3/L4 TC flower filter
> implementation for the XGMAC2 core. All three patches target net.

A quick note on that, I noticed all your recent series on stmmac are
missing the tree tag in the subject line. It should be something like

 [PATCH net v3 0/3] net: stmmac: L3/L4 filter bug fixes

you can add it when generating patches with :

git format-patch --subject-prefix='PATCH net-next' start..finish

cf https://docs.kernel.org/process/maintainer-netdev.html#indicating-target-tree

You can also use b4 for this.

Thanks,

Maxime


^ permalink raw reply

* RE: [PATCH net v2] net: qualcomm: rmnet: validate MAP frame length before ingress parsing
From: subash.a.kasiviswanathan @ 2026-07-01  7:03 UTC (permalink / raw)
  To: 'Xiang Mei', sean.tranchetti, netdev
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, linux-kernel,
	bestswngs
In-Reply-To: <20260630174110.2003121-1-xmei5@asu.edu>

> -----Original Message-----
> From: Xiang Mei <xmei5@asu.edu>
> Sent: Tuesday, June 30, 2026 11:41 AM
> To: subash.a.kasiviswanathan@oss.qualcomm.com;
> sean.tranchetti@oss.qualcomm.com; netdev@vger.kernel.org
> Cc: andrew+netdev@lunn.ch; davem@davemloft.net;
> edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; linux-
> kernel@vger.kernel.org; bestswngs@gmail.com; Xiang Mei <xmei5@asu.edu>
> Subject: [PATCH net v2] net: qualcomm: rmnet: validate MAP frame length
> before ingress parsing
> 
> When ingress deaggregation is disabled, rmnet_map_ingress_handler() passes
> the skb straight to __rmnet_map_ingress_handler(), skipping the length
> validation that rmnet_map_deaggregate() performs on the aggregated path.
> The parser then dereferences the MAP header and csum header/trailer based
> on
> the on-wire pkt_len without checking skb->len, so a short frame is read
out
> of bounds:
> 
>   BUG: KASAN: slab-out-of-bounds in
> rmnet_map_checksum_downlink_packet
>   Read of size 1 at addr ffff88801118ed00 by task exploit/147
>   Call Trace:
>    ...
>    rmnet_map_checksum_downlink_packet
> (drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:413)
>    __rmnet_map_ingress_handler
> (drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c:96)
>    rmnet_rx_handler
> (drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c:129)
>    __netif_receive_skb_core.constprop.0 (net/core/dev.c:6089)
>    netif_receive_skb (net/core/dev.c:6460)
>    tun_get_user (drivers/net/tun.c:1955)
>    tun_chr_write_iter (drivers/net/tun.c:2001)
>    vfs_write (fs/read_write.c:688)
>    ksys_write (fs/read_write.c:740)
>    do_syscall_64 (arch/x86/entry/syscall_64.c:94)
>    ...
> 
> Factor that validation out of rmnet_map_deaggregate() into
> rmnet_map_validate_packet_len() and run it on the no-aggregation path too.
> The MAP header is bounds-checked first, since this path can receive a
frame
> shorter than the header.
> 
> Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial
> implementation")
> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Suggested-by: Subash Abhinov Kasiviswanathan
> <subash.a.kasiviswanathan@oss.qualcomm.com>
> Signed-off-by: Xiang Mei <xmei5@asu.edu>
> ---
> v2: Validate on the no-aggregation path by reusing the deaggregation
>     length checks (factored into rmnet_map_validate_packet_len()) instead
>     of adding separate pskb_may_pull() guards in
> __rmnet_map_ingress_handler().
> 
>  .../ethernet/qualcomm/rmnet/rmnet_handlers.c  |  5 +-
>  .../net/ethernet/qualcomm/rmnet/rmnet_map.h   |  1 +
>  .../ethernet/qualcomm/rmnet/rmnet_map_data.c  | 72 ++++++++++---------
>  3 files changed, 45 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
> b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
> index 9f3479500f85..d055a2628d8c 100644
> --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
> +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
> @@ -126,7 +126,10 @@ rmnet_map_ingress_handler(struct sk_buff *skb,
> 
>  		consume_skb(skb);
>  	} else {
> -		__rmnet_map_ingress_handler(skb, port);
> +		if (rmnet_map_validate_packet_len(skb, port))
> +			__rmnet_map_ingress_handler(skb, port);
> +		else
> +			kfree_skb(skb);
>  	}
>  }
> 
> diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
> b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
> index b70284095568..60ca8b780c88 100644
> --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
> +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
> @@ -59,5 +59,6 @@ void rmnet_map_tx_aggregate_init(struct rmnet_port
> *port);
>  void rmnet_map_tx_aggregate_exit(struct rmnet_port *port);
>  void rmnet_map_update_ul_agg_config(struct rmnet_port *port, u32 size,
>  				    u32 count, u32 time);
> +u32 rmnet_map_validate_packet_len(struct sk_buff *skb, struct rmnet_port
> *port);
> 
>  #endif /* _RMNET_MAP_H_ */
> diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
> b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
> index 8b4640c5d61e..305ae15ae8f3 100644
> --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
> +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
> @@ -333,54 +333,62 @@ struct rmnet_map_header
> *rmnet_map_add_map_header(struct sk_buff *skb,
>  	return map_header;
>  }
> 
> -/* Deaggregates a single packet
> - * A whole new buffer is allocated for each portion of an aggregated
frame.
> - * Caller should keep calling deaggregate() on the source skb until 0 is
> - * returned, indicating that there are no more packets to deaggregate.
Caller
> - * is responsible for freeing the original skb.
> - */
> -struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb,
> -				      struct rmnet_port *port)
> +u32 rmnet_map_validate_packet_len(struct sk_buff *skb, struct rmnet_port
> *port)
>  {
>  	struct rmnet_map_v5_csum_header *next_hdr = NULL;
>  	struct rmnet_map_header *maph;
>  	void *data = skb->data;
> -	struct sk_buff *skbn;
> -	u8 nexthdr_type;
>  	u32 packet_len;
> 
> -	if (skb->len == 0)
> -		return NULL;
> +	if (skb->len < sizeof(*maph))
> +		return 0;
> 
>  	maph = (struct rmnet_map_header *)skb->data;
> +
> +	/* Some hardware can send us empty frames. Catch them */
> +	if (!maph->pkt_len)
> +		return 0;
> +
>  	packet_len = ntohs(maph->pkt_len) + sizeof(*maph);
> 
>  	if (port->data_format & RMNET_FLAGS_INGRESS_MAP_CKSUMV4) {
>  		packet_len += sizeof(struct rmnet_map_dl_csum_trailer);
> -	} else if (port->data_format &
> RMNET_FLAGS_INGRESS_MAP_CKSUMV5) {
> -		if (!(maph->flags & MAP_CMD_FLAG)) {
> -			packet_len += sizeof(*next_hdr);
> -			if (maph->flags & MAP_NEXT_HEADER_FLAG)
> -				next_hdr = data + sizeof(*maph);
> -			else
> -				/* Mapv5 data pkt without csum hdr is
invalid
> */
> -				return NULL;
> -		}
> +	} else if ((port->data_format &
> RMNET_FLAGS_INGRESS_MAP_CKSUMV5) &&
> +		   !(maph->flags & MAP_CMD_FLAG)) {
> +		/* Mapv5 data pkt without csum hdr is invalid */
> +		if (!(maph->flags & MAP_NEXT_HEADER_FLAG))
> +			return 0;
> +
> +		packet_len += sizeof(*next_hdr);
> +		next_hdr = data + sizeof(*maph);
>  	}
> 
> -	if (((int)skb->len - (int)packet_len) < 0)
> -		return NULL;
> +	if (skb->len < packet_len)
> +		return 0;
> 
> -	/* Some hardware can send us empty frames. Catch them */
> -	if (!maph->pkt_len)
> -		return NULL;
> +	if (next_hdr &&
> +	    u8_get_bits(next_hdr->header_info,
> MAPV5_HDRINFO_HDR_TYPE_FMASK) !=
> +	    RMNET_MAP_HEADER_TYPE_CSUM_OFFLOAD)
> +		return 0;
> 
> -	if (next_hdr) {
> -		nexthdr_type = u8_get_bits(next_hdr->header_info,
> -
> MAPV5_HDRINFO_HDR_TYPE_FMASK);
> -		if (nexthdr_type !=
> RMNET_MAP_HEADER_TYPE_CSUM_OFFLOAD)
> -			return NULL;
> -	}
> +	return packet_len;
> +}
> +
> +/* Deaggregates a single packet
> + * A whole new buffer is allocated for each portion of an aggregated
frame.
> + * Caller should keep calling deaggregate() on the source skb until 0 is
> + * returned, indicating that there are no more packets to deaggregate.
Caller
> + * is responsible for freeing the original skb.
> + */
> +struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb,
> +				      struct rmnet_port *port)
> +{
> +	struct sk_buff *skbn;
> +	u32 packet_len;
> +
> +	packet_len = rmnet_map_validate_packet_len(skb, port);
> +	if (!packet_len)
> +		return NULL;
> 
>  	skbn = alloc_skb(packet_len + RMNET_MAP_DEAGGR_SPACING,
> GFP_ATOMIC);
>  	if (!skbn)
> --
> 2.43.0

Reviewed-by: Subash Abhinov Kasiviswanathan
<subash.a.kasiviswanathan@oss.qualcomm.com>


^ permalink raw reply

* [PATCH net] ipvs: fix PMTU for GUE/GRE tunnel ICMP errors
From: Yizhou Zhao @ 2026-07-01  6:59 UTC (permalink / raw)
  To: netdev
  Cc: Yizhou Zhao, Simon Horman, Julian Anastasov, Pablo Neira Ayuso,
	Florian Westphal, Phil Sutter, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, lvs-devel, netfilter-devel, coreteam,
	linux-kernel, stable, Yuxiang Yang, Ao Wang, Xuewei Feng, Qi Li,
	Ke Xu

When an ICMP Fragmentation Needed error is received for a tunneled IPVS
connection, ip_vs_in_icmp() recomputes the MTU that the original packet
can use by subtracting the tunnel overhead from the reported next-hop
MTU.

The current code always subtracts sizeof(struct iphdr), which is only
the IPIP overhead. For GUE and GRE tunnels, ipvs_udp_decap() and
ipvs_gre_decap() already compute the additional tunnel header length,
but that value is scoped to the decapsulation block and is lost before
the ICMP_FRAG_NEEDED handling. As a result, the ICMP error sent back to
the client advertises an MTU that is too large, so PMTUD can fail to
converge for GUE/GRE-tunneled real servers.

With a reported next-hop MTU of 1400, a GUE tunnel currently returns
1380 to the client. The correct value is 1368:

  1400 - sizeof(struct iphdr) - sizeof(struct udphdr) -
  sizeof(struct guehdr)

Hoist the tunnel header length into the main ip_vs_in_icmp() scope and
subtract sizeof(struct iphdr) + ulen in the Fragmentation Needed path.
The IPIP path keeps ulen as 0, so its existing 1400 - 20 = 1380 result
is unchanged.

Fixes: 508f744c0de3 ("ipvs: strip udp tunnel headers from icmp errors")
Cc: stable@vger.kernel.org
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
Reported-by: Ao Wang <wangao@seu.edu.cn>
Reported-by: Xuewei Feng <fengxw06@126.com>
Reported-by: Qi Li <qli01@tsinghua.edu.cn>
Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
Assisted-by: Claude Code:GLM-5.2
Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
---
 net/netfilter/ipvs/ip_vs_core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index d40b404c1bf62..74c5bd8b5f48 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1765,8 +1765,9 @@ ip_vs_in_icmp(struct netns_ipvs *ipvs, struct sk_buff *skb, int *related,
 	struct ip_vs_proto_data *pd;
 	unsigned int offset, offset2, ihl, verdict;
 	bool tunnel, new_cp = false;
 	union nf_inet_addr *raddr;
 	char *outer_proto = "IPIP";
+	int ulen = 0;
 
 	*related = 1;
 
@@ -1831,7 +1832,6 @@ ip_vs_in_icmp(struct netns_ipvs *ipvs, struct sk_buff *skb, int *related,
 		   /* Error for our tunnel must arrive at LOCAL_IN */
 		   (skb_rtable(skb)->rt_flags & RTCF_LOCAL)) {
 		__u8 iproto;
-		int ulen;
 
 		/* Non-first fragment has no UDP/GRE header */
 		if (unlikely(cih->frag_off & htons(IP_OFFSET)))
@@ -1936,8 +1936,8 @@ ip_vs_in_icmp(struct netns_ipvs *ipvs, struct sk_buff *skb, int *related,
 				if (dest_dst)
 					mtu = dst_mtu(dest_dst->dst_cache);
 			}
-			if (mtu > 68 + sizeof(struct iphdr))
-				mtu -= sizeof(struct iphdr);
+			if (mtu > 68 + sizeof(struct iphdr) + ulen)
+				mtu -= sizeof(struct iphdr) + ulen;
 			info = htonl(mtu);
 		}
 		/* Strip outer IP, ICMP and IPIP/UDP/GRE, go to IP header of


^ permalink raw reply related

* Re: [PATCH net] net: usb: lan78xx: disable VLAN filter in promiscuous mode
From: Nicolai Buchwitz @ 2026-07-01  6:57 UTC (permalink / raw)
  To: enrico.pozzobon
  Cc: Thangaraj Samynathan, Rengarajan Sundararajan, UNGLinuxDriver,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Woojung.Huh, netdev, linux-usb, linux-kernel
In-Reply-To: <20260630-lan78xx-vlan-promisc-v1-1-fbf0f903bd8f@dissecto.com>

Hi Enrico

On 30.6.2026 16:15, Enrico Pozzobon via B4 Relay wrote:
> From: Enrico Pozzobon <enrico.pozzobon@dissecto.com>
> 
> The hardware VLAN filter (RFE_CTL_VLAN_FILTER_) drops VLAN-tagged 
> frames
> whose VID has not been registered via lan78xx_vlan_rx_add_vid(). It is
> left enabled in promiscuous mode, so packet capture (e.g. tcpdump or
> Wireshark) does not see tagged frames for unregistered VIDs.
> 
> Clear the filter while the interface is promiscuous and restore it from
> NETIF_F_HW_VLAN_CTAG_FILTER otherwise. Enforce the same condition in
> lan78xx_set_features() so netdev_update_features() cannot re-enable the
> filter while promiscuous.
> 
> Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
> Ethernet device driver")
> Signed-off-by: Enrico Pozzobon <enrico.pozzobon@dissecto.com>
> ---
> Currently, on microchip lan7801, enabling promiscuous mode does not
> result in VLAN tagged packets being captured. This patch fixes this,
> forcing the RFE_CTL_VLAN_FILTER_ flag to be off when promiscuous mode 
> is
> enabled.
> ---
>  drivers/net/usb/lan78xx.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
> index c4cebacabcb5..a1a53ef85cb9 100644
> --- a/drivers/net/usb/lan78xx.c
> +++ b/drivers/net/usb/lan78xx.c
> @@ -1525,7 +1525,14 @@ static void lan78xx_set_multicast(struct 
> net_device *netdev)
>  	if (dev->net->flags & IFF_PROMISC) {
>  		netif_dbg(dev, drv, dev->net, "promiscuous mode enabled");
>  		pdata->rfe_ctl |= RFE_CTL_MCAST_EN_ | RFE_CTL_UCAST_EN_;
> +		/* bypass VLAN filter so all tagged frames are captured */
> +		pdata->rfe_ctl &= ~RFE_CTL_VLAN_FILTER_;
>  	} else {
> +		if (dev->net->features & NETIF_F_HW_VLAN_CTAG_FILTER)
> +			pdata->rfe_ctl |= RFE_CTL_VLAN_FILTER_;
> +		else
> +			pdata->rfe_ctl &= ~RFE_CTL_VLAN_FILTER_;
> +
>  		if (dev->net->flags & IFF_ALLMULTI) {
>  			netif_dbg(dev, drv, dev->net,
>  				  "receive all multicast enabled");
> @@ -3074,7 +3081,9 @@ static int lan78xx_set_features(struct net_device 
> *netdev,
>  	else
>  		pdata->rfe_ctl &= ~RFE_CTL_VLAN_STRIP_;
> 
> -	if (features & NETIF_F_HW_VLAN_CTAG_FILTER)
> +	/* keep VLAN filter off while promiscuous */
> +	if ((features & NETIF_F_HW_VLAN_CTAG_FILTER) &&
> +	    !(netdev->flags & IFF_PROMISC))
>  		pdata->rfe_ctl |= RFE_CTL_VLAN_FILTER_;
>  	else
>  		pdata->rfe_ctl &= ~RFE_CTL_VLAN_FILTER_;

nit: Could this be addressed as helper for both callers?

> [...]

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>

Thanks
Nicolai

^ permalink raw reply

* [PATCH v2 6.6.y/6.12.y/6.18.y] af_unix: Set gc_in_progress to true in unix_gc().
From: Igor Ushakov @ 2026-07-01  6:53 UTC (permalink / raw)
  To: stable; +Cc: sashal, kuniyu, kuba, pabeni, davem, edumazet, netdev, sysroot314

From: Kuniyuki Iwashima <kuniyu@google.com>

[ Upstream commit d82ba05263c69fa2437fe93e4e561cc40f4c03af ]

Igor Ushakov reported that unix_gc() could run with gc_in_progress
being false if the work is scheduled while running:

  Thread 1         Thread 2                     Thread 3
  --------         --------                     --------
                   unix_schedule_gc()           unix_schedule_gc()
                   `- if (!gc_in_progress)      `- if (!gc_in_progress)
                      |- gc_in_progress = true     |
                      `- queue_work()              |
  unix_gc() <----------------/                     |
  |                                                |- gc_in_progress = true
  ...                                              `- queue_work()
  |                                                       |
  `- gc_in_progress = false                               |
                                                          |
  unix_gc() <---------------------------------------------'
  |
  ... /* gc_in_progress == false */
  |
  `- gc_in_progress = false

unix_peek_fpl() relies on gc_in_progress not to confuse GC
by MSG_PEEK.

Let's set gc_in_progress to true in unix_gc().

Fixes: 8b90a9f819dc ("af_unix: Run GC on only one CPU.")
Reported-by: Igor Ushakov <sysroot314@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260501073945.1884564-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[ Add setting gc_in_progress in __unix_gc(). Keep the existing
  set in unix_gc() for wait_for_unix_gc() over-limit throttling. ]
Signed-off-by: Igor Ushakov <sysroot314@gmail.com>
---
 net/unix/garbage.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 1cdb54c616..fa6983dc31 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -583,6 +583,8 @@ static void __unix_gc(struct work_struct *work)
 	struct sk_buff_head hitlist;
 	struct sk_buff *skb;
 
+	WRITE_ONCE(gc_in_progress, true);
+
 	spin_lock(&unix_gc_lock);
 
 	if (!unix_graph_maybe_cyclic) {
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH] net: airoha: fix MIB stats collection to be lossless
From: Lorenzo Bianconi @ 2026-07-01  6:51 UTC (permalink / raw)
  To: Aniket Negi
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev,
	linux-kernel, aniket.negi
In-Reply-To: <20260701063823.239783-1-aniket.negi03@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5633 bytes --]

> Hi Lorenzo,
> 
> Thank you for the detailed review and suggestions!
> 
> > > +	/* TX - 64-bit H+L registers: hw accumulates the total, read directly. =
> > */
> > >  	val = airoha_fe_rr(eth, REG_FE_GDM_TX_OK_PKT_CNT_H(port->id));
> > > -	dev->stats.tx_ok_pkts += ((u64)val << 32);
> > > -	val = airoha_fe_rr(eth, REG_FE_GDM_TX_OK_PKT_CNT_L(port->id));
> > > -	dev->stats.tx_ok_pkts += val;
> > > +	dev->stats.tx_ok_pkts = (u64)val << 32;
> > 
> > I guess it is more readable to store REG_FE_GDM_TX_OK_PKT_CNT_L() read in v=
> > al
> > here. Something like:
> > 
> > 	val = airoha_fe_rr(eth, REG_FE_GDM_TX_OK_PKT_CNT_L(port->id));
> > 	dev->stats.tx_ok_pkts += val;
> > 
> > This apply even to occurrence below
> Agreed. I'll store CNT_L read in val first to improve readability.
> 
> > > +	dev->stats.tx_ok_pkts += airoha_fe_rr(eth, REG_FE_GDM_TX_OK_PKT_CNT_L=
> > (port->id));
> > > =20
> > >  	val = airoha_fe_rr(eth, REG_FE_GDM_TX_OK_BYTE_CNT_H(port->id));
> > > -	dev->stats.tx_ok_bytes += ((u64)val << 32);
> > > -	val = airoha_fe_rr(eth, REG_FE_GDM_TX_OK_BYTE_CNT_L(port->id));
> > > -	dev->stats.tx_ok_bytes += val;
> > > +	dev->stats.tx_ok_bytes = (u64)val << 32;
> > > +	dev->stats.tx_ok_bytes += airoha_fe_rr(eth, REG_FE_GDM_TX_OK_BYTE_CNT=
> > _L(port->id));
> > > =20
> > > +	/* TX - 32-bit registers: accumulate delta to handle wrap-around. */
> > >  	val = airoha_fe_rr(eth, REG_FE_GDM_TX_ETH_DROP_CNT(port->id));
> > > -	dev->stats.tx_drops += val;
> > > +	dev->stats.tx_drops += (u32)(val - dev->stats.hw_prev_stats.tx_drops);
> > > +	dev->stats.hw_prev_stats.tx_drops = val;
> > > =20
> > >  	val = airoha_fe_rr(eth, REG_FE_GDM_TX_ETH_BC_CNT(port->id));
> > > -	dev->stats.tx_broadcast += val;
> > > +	dev->stats.tx_broadcast += (u32)(val - dev->stats.hw_prev_stats.tx_br=
> > oadcast);
> > > +	dev->stats.hw_prev_stats.tx_broadcast = val;
> > > =20
> > >  	val = airoha_fe_rr(eth, REG_FE_GDM_TX_ETH_MC_CNT(port->id));
> > > -	dev->stats.tx_multicast += val;
> > > +	dev->stats.tx_multicast += (u32)(val - dev->stats.hw_prev_stats.tx_mu=
> > lticast);
> > > +	dev->stats.hw_prev_stats.tx_multicast = val;
> > > =20
> > >  	val = airoha_fe_rr(eth, REG_FE_GDM_TX_ETH_RUNT_CNT(port->id));
> > > -	dev->stats.tx_len[i] += val;
> > > +	dev->stats.tx_len[i] += (u32)(val - dev->stats.hw_prev_stats.tx_len[i=
> > ]);
> > > +	dev->stats.hw_prev_stats.tx_len[i] = val;
> > > =20
> > >  	val = airoha_fe_rr(eth, REG_FE_GDM_TX_ETH_E64_CNT_H(port->id));
> > > -	dev->stats.tx_len[i] += ((u64)val << 32);
> > > -	val = airoha_fe_rr(eth, REG_FE_GDM_TX_ETH_E64_CNT_L(port->id));
> > > -	dev->stats.tx_len[i++] += val;
> > > +	dev->stats.tx_len[i] += (u64)val << 32;
> >  
> > Since now we do not reset MIB counters, this is wrong, you can't use "+="
> 
> You are absolutely right, since MIB counters are no longer cleared, using "+=" for E64 counter would cause double counting each iteration. This was missed in the patch, specifically for the case where runt count(32 bit) and E64 counter (64 bit) need to be combined in the same counter. 
> 
> I'll fix this by using separate accumulator fields to "tx_runt_accum/rx_runt_accum" to track the runt deltas, then compute tx_len[i] as tx_len[i]= tx_runt_accum + E64_CNT (H+L).
> 
> > >  	val = airoha_fe_rr(eth, REG_FE_GDM_RX_ETH_RUNT_CNT(port->id));
> > > -	dev->stats.rx_len[i] += val;
> > > +	dev->stats.rx_len[i] += (u32)(val - dev->stats.hw_prev_stats.rx_len[i=
> > ]);
> > > +	dev->stats.hw_prev_stats.rx_len[i] = val;
> > > =20
> > >  	val = airoha_fe_rr(eth, REG_FE_GDM_RX_ETH_E64_CNT_H(port->id));
> > > -	dev->stats.rx_len[i] += ((u64)val << 32);
> > > -	val = airoha_fe_rr(eth, REG_FE_GDM_RX_ETH_E64_CNT_L(port->id));
> > > -	dev->stats.rx_len[i++] += val;
> > > +	dev->stats.rx_len[i] += (u64)val << 32;
> > 
> > same here.
> 
> Acked. The same approach above will be applied to rx_len[i]. 
> 
> > > +
> > > +	struct {
> > > +	/* Previous HW register values for 32-bit counter delta tracking.
> > > +	 * Storing the last seen value and accumulating (u32)(curr - prev)
> > > +	 * in 64-bit software counter & handles wrap-around transparently
> > > +	 * via unsigned arithmetic. These fields are never reported to
> > > +	 * userspace.
> > > +	 */
> > 
> > can you please align the comment here?
> 
> Will fix the comment alignment.
> 
> > 
> > > +		u32 tx_drops;
> > > +		u32 tx_broadcast;
> > > +		u32 tx_multicast;
> > > +		u32 tx_len[7];
> > > +		u32 rx_drops;
> > > +		u32 rx_broadcast;
> > > +		u32 rx_multicast;
> > > +		u32 rx_errors;
> > > +		u32 rx_crc_error;
> > > +		u32 rx_over_errors;
> > > +		u32 rx_fragment;
> > > +		u32 rx_jabber;
> > > +		u32 rx_len[7];
> > > +	} hw_prev_stats;
> > 
> > Maybe something like "prev_val32" ?
> > 
> > Will update the name of struct to hold prev counter from hw_pre_stats to prev_val32.
> 
> Good suggestion. However, since the struct hw_prev_stats now contains both u32 (previous register value) and u64 (runt accumulators) fields. I'll rename it to "prev_mib_state" to better reflect its dual purpose of storing previous register values for delta calculation and accumulators for combined counters. 

Maybe better mib_prev?

Since now we do not reset the MIB counters in airoha_update_hw_stats(), we can
get rid of the for loop there and just call airoha_dev_get_hw_stats() with the
provided dev pointer. Even better, just rename airoha_dev_get_hw_stats() in
airoha_update_hw_stats() and move the spinlock there. What do you think?

Regards,
Lorenzo

>   
> Regards,
> Aniket Negi
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH net] bnx2x: fix null pointer dereference in bnx2x_free_mem_bp()
From: Abdun Nihaal @ 2026-07-01  6:50 UTC (permalink / raw)
  To: skalluru
  Cc: Abdun Nihaal, manishc, andrew+netdev, davem, edumazet, kuba,
	pabeni, netdev, linux-kernel, horms, stable

In one of the error path in bnx2x_alloc_mem_bp(), bnx2x_free_mem_bp()
may be called with bp->fp uninitialized. And so, there could be a null
pointer dereference in bnx2x_free_mem_bp(). Fix that by adding a null
check before the only dereference of bp->fp in the function.

The issue was reported by Sashiko AI review.

Fixes: c3146eb676e7 ("bnx2x: Correct memory preparation and release")
Cc: stable@vger.kernel.org
Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in>
---
Compile tested only.
Thanks to Simon Horman for pointing out the Sashiko review.

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 5b2640bd31c3..25ee45cb7f3f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4712,8 +4712,9 @@ void bnx2x_free_mem_bp(struct bnx2x *bp)
 {
 	int i;
 
-	for (i = 0; i < bp->fp_array_size; i++)
-		kfree(bp->fp[i].tpa_info);
+	if (bp->fp)
+		for (i = 0; i < bp->fp_array_size; i++)
+			kfree(bp->fp[i].tpa_info);
 	kfree(bp->fp);
 	kfree(bp->sp_objs);
 	kfree(bp->fp_stats);
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox