Re: [PATCH net-next V9 11/14] net/mlx5: qos: Remove qos domains and use shd lock

public inbox for linux-doc@vger.kernel.org
 help / color / mirror / Atom feed

From: Cosmin Ratiu <cratiu@nvidia.com>
To: Tariq Toukan <tariqt@nvidia.com>, "kuba@kernel.org" <kuba@kernel.org>
Cc: "allison.henderson@oracle.com" <allison.henderson@oracle.com>,
	Moshe Shemesh <moshe@nvidia.com>,
	"jiri@resnulli.us" <jiri@resnulli.us>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"daniel.zahka@gmail.com" <daniel.zahka@gmail.com>,
	"donald.hunter@gmail.com" <donald.hunter@gmail.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"matttbe@kernel.org" <matttbe@kernel.org>,
	"pabeni@redhat.com" <pabeni@redhat.com>,
	"horms@kernel.org" <horms@kernel.org>,
	Parav Pandit <parav@nvidia.com>,
	"corbet@lwn.net" <corbet@lwn.net>,
	"kees@kernel.org" <kees@kernel.org>,
	"willemb@google.com" <willemb@google.com>,
	Dragos Tatulea <dtatulea@nvidia.com>,
	"razor@blackwall.org" <razor@blackwall.org>,
	Adithya Jayachandran <ajayachandra@nvidia.com>,
	Dan Jurgens <danielj@nvidia.com>,
	"leon@kernel.org" <leon@kernel.org>,
	"vadim.fedorenko@linux.dev" <vadim.fedorenko@linux.dev>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	"shuah@kernel.org" <shuah@kernel.org>,
	"andrew+netdev@lunn.ch" <andrew+netdev@lunn.ch>,
	Mark Bloch <mbloch@nvidia.com>, Jiri Pirko <jiri@nvidia.com>,
	Carolina Jubran <cjubran@nvidia.com>,
	Shahar Shitrit <shshitrit@nvidia.com>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	"minhquangbui99@gmail.com" <minhquangbui99@gmail.com>,
	Nimrod Oren <noren@nvidia.com>, "dw@davidwei.uk" <dw@davidwei.uk>,
	"jacob.e.keller@intel.com" <jacob.e.keller@intel.com>,
	"skhan@linuxfoundation.org" <skhan@linuxfoundation.org>,
	Petr Machata <petrm@nvidia.com>,
	"edumazet@google.com" <edumazet@google.com>,
	"antonio@openvpn.net" <antonio@openvpn.net>,
	"mst@redhat.com" <mst@redhat.com>,
	"linux-kselftest@vger.kernel.org"
	<linux-kselftest@vger.kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	Shay Drori <shayd@nvidia.com>,
	"sdf@fomichev.me" <sdf@fomichev.me>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>,
	Gal Pressman <gal@nvidia.com>, "joe@dama.to" <joe@dama.to>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
Subject: Re: [PATCH net-next V9 11/14] net/mlx5: qos: Remove qos domains and use shd lock
Date: Tue, 31 Mar 2026 12:53:47 +0000	[thread overview]
Message-ID: <3bd3caead46d1965d3a7a151d0ef0a54ac78332a.camel@nvidia.com> (raw)
In-Reply-To: <20260331020817.3525089-1-kuba@kernel.org>

On Mon, 2026-03-30 at 19:08 -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
>     net/mlx5: qos: Remove qos domains and use shd lock
> 
> This commit removes QoS domains and switches to using the shd lock to
> protect against concurrent QoS modifications. It also enables the
> supported_cross_device_rate_nodes devlink ops attribute.
> 
> Link:
> https://lore.kernel.org/netdev/20250213180134.323929-1-tariqt@nvidia.com/
> 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
> > index 6698ac55a4bfa..c051605fecd23 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
> > @@ -385,6 +385,7 @@ static const struct devlink_ops
> > mlx5_devlink_ops = {
> >  	.rate_node_del = mlx5_esw_devlink_rate_node_del,
> >  	.rate_leaf_parent_set =
> > mlx5_esw_devlink_rate_leaf_parent_set,
> >  	.rate_node_parent_set =
> > mlx5_esw_devlink_rate_node_parent_set,
> > +	.supported_cross_device_rate_nodes = true,
> 
> The commit message states that enabling
> supported_cross_device_rate_nodes
> now is safe because mlx5_esw_qos_vport_update_parent rejects cross-
> esw
> parent updates.
> 
> When a user provides a DEVLINK_ATTR_PARENT_DEV belonging to an
> unrelated
> device, the devlink core (devlink_nl_rate_parent_node_set) drops the
> lock on
> the parent devlink and calls devlink_rate_node_get_by_name(). This
> function
> traverses the rate_list of the unlocked parent devlink instance.
> 
> Does this leave the devlink core exposed to data races and potential
> use-after-free vulnerabilities, since the driver's rejection callback
> executes only after this lockless list traversal has occurred?

I will address this by adding common ancestor validation in devlink
rate core, which was missing.

> 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> > index 0be516003bcde..f67f99428959f 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
> > @@ -1110,7 +1068,7 @@ void mlx5_esw_qos_vport_disable(struct
> > mlx5_vport *vport)
> >  	struct mlx5_esw_sched_node *parent;
> >  
> >  	lockdep_assert_held(&esw->state_lock);
> > -	esw_qos_lock(esw);
> > +	devl_lock(esw->dev->shd);
> 
> Could esw->dev->shd be NULL here?
> 
> In mlx5_shd_init(), initialization is explicitly skipped for non-PF
> devices,
> meaning Sub-Functions (SFs) acting as E-Switch managers will have a
> NULL
> shd pointer. If the device lacks VPD data, pci_vpd_alloc() returns -
> ENODEV,
> causing mlx5_shd_init() to return 0 but leaving dev->shd as NULL.
> 
> When QoS operations are attempted on these devices, calling
> devl_lock()
> dereferences the pointer, which could cause an immediate NULL pointer
> dereference and kernel panic.

Right. This is a race with Jiri's fix ([1]), which changed the
assumption this code was architected with that dev->shd is always
initialized. That is no longer the case, and the consequence is that
this needs to 1) not unconditionally use dev->shd and 2) use a fallback
lock when it isn't available.

> 
> Additionally, this acquires the devlink lock while already holding
> esw->state_lock, establishing the lock order:
> state_lock -> devlink->lock.
> 
> Conversely, the devlink core invokes driver callbacks with the
> devlink
> instance lock held, and the driver subsequently acquires esw-
> >state_lock
> inside these callbacks, creating the reverse order:
> devlink->lock -> state_lock.
> 
> Does this introduce a circular locking dependency splat between
> state_lock
> and the devlink instance lock?

This claim is based on the incorrect assumption that devlink instances
share the same lock class, while in fact each is unique. The
established lock order is devlink->lock -> state_lock and this doesn't
conflict with state_lock -> shd->lock.

> 
> > @@ -1159,26 +1117,25 @@ int mlx5_esw_qos_set_vport_rate(struct
> > mlx5_vport *vport, u32 max_rate, u32 min_
> >  	struct mlx5_eswitch *esw = vport->dev->priv.eswitch;
> >  	int err;
> >  
> > -	esw_qos_lock(esw);
> > +	devl_lock(esw->dev->shd);
> 
> This function is invoked by mlx5_eswitch_set_vport_rate(), which acts
> as the
> handler for the ndo_set_vf_rate netdev callback. The core network
> stack
> executes ndo_set_vf_rate with the rtnl_lock held, establishing the
> lock
> order: rtnl_lock -> devlink->lock.
> 
> However, the devlink subsystem enforces the opposite locking order
> (devlink->lock -> rtnl_lock). Because all devlink locks share the
> same
> lockdep class, does this inversion trigger a circular locking
> dependency
> lockdep splat and pose an ABBA deadlock risk?

Same as above, there's no ABBA here since devlink instances use
different lock classes.
Perhaps Sashiko (I think) could be taught this to reduce false
positives?

> 
> [ ... ]
> 
> > @@ -1859,13 +1763,15 @@ mlx5_esw_qos_vport_update_parent(struct
> > mlx5_vport *vport,
> >  						  parent ? : esw-
> > >qos.root,
> >  						  extack);
> >  	}
> > -	esw_qos_unlock(esw);
> > +
> >  	return err;
> >  }
> >  
> >  void mlx5_esw_qos_vport_clear_parent(struct mlx5_vport *vport)
> >  {
> > +	devl_lock(vport->dev->shd);
> >  	mlx5_esw_qos_vport_update_parent(vport, NULL, NULL);
> > +	devl_unlock(vport->dev->shd);
> >  }
> 
> During device teardown, execution paths such as
> mlx5_unload_one_devl_locked()
> hold the primary devlink instance lock. This path invokes
> mlx5_eswitch_unload_vf_vports() ->
> mlx5_esw_offloads_devlink_port_unregister()
> -> mlx5_esw_qos_vport_clear_parent().
> 
> Because both the primary devlink instance lock and the shd devlink
> instance
> lock are initialized from the same site (devlink_alloc), they share
> the same
> lockdep class.
> 
> Does calling devl_lock() directly (instead of using the devlink
> core's
> mutex_lock_nested capability) while a lock of the same class is
> already held
> trigger a recursive locking lockdep warning?

Again, same false positive. Devlinks use unique lock classes.

Cosmin.

next prev parent reply	other threads:[~2026-03-31 12:54 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26  6:59 [PATCH net-next V9 00/14] devlink and mlx5: Support cross-function rate scheduling Tariq Toukan
2026-03-26  6:59 ` [PATCH net-next V9 01/14] devlink: Update nested instance locking comment Tariq Toukan
2026-03-26  6:59 ` [PATCH net-next V9 02/14] devlink: Add helpers to lock nested-in instances Tariq Toukan
2026-03-31  2:08   ` Jakub Kicinski
2026-03-31 12:20     ` Cosmin Ratiu
2026-03-31 23:55       ` Jacob Keller
2026-04-01 10:22         ` Cosmin Ratiu
2026-04-01 20:18           ` Jacob Keller
2026-03-26  6:59 ` [PATCH net-next V9 03/14] devlink: Migrate from info->user_ptr to info->ctx Tariq Toukan
2026-03-26  6:59 ` [PATCH net-next V9 04/14] devlink: Decouple rate storage from associated devlink object Tariq Toukan
2026-03-31  2:08   ` Jakub Kicinski
2026-03-31 12:28     ` Cosmin Ratiu
2026-03-26  6:59 ` [PATCH net-next V9 05/14] devlink: Add parent dev to devlink API Tariq Toukan
2026-03-26  6:59 ` [PATCH net-next V9 06/14] devlink: Allow parent dev for rate-set and rate-new Tariq Toukan
2026-03-26  6:59 ` [PATCH net-next V9 07/14] devlink: Allow rate node parents from other devlinks Tariq Toukan
2026-03-31  2:08   ` Jakub Kicinski
2026-03-31 12:44     ` Cosmin Ratiu
2026-03-26  6:59 ` [PATCH net-next V9 08/14] net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed Tariq Toukan
2026-03-26  6:59 ` [PATCH net-next V9 09/14] net/mlx5: qos: Expose a function to clear a vport's parent Tariq Toukan
2026-03-26  6:59 ` [PATCH net-next V9 10/14] net/mlx5: qos: Model the root node in the scheduling hierarchy Tariq Toukan
2026-03-26  6:59 ` [PATCH net-next V9 11/14] net/mlx5: qos: Remove qos domains and use shd lock Tariq Toukan
2026-03-31  2:08   ` Jakub Kicinski
2026-03-31 12:53     ` Cosmin Ratiu [this message]
2026-03-31 16:37       ` Cosmin Ratiu
2026-03-26  6:59 ` [PATCH net-next V9 12/14] net/mlx5: qos: Support cross-device tx scheduling Tariq Toukan
2026-03-31  2:08   ` Jakub Kicinski
2026-03-31 12:57     ` Cosmin Ratiu
2026-03-26  6:59 ` [PATCH net-next V9 13/14] selftests: drv-net: Add test for cross-esw rate scheduling Tariq Toukan
2026-03-26  6:59 ` [PATCH net-next V9 14/14] net/mlx5: Document devlink rates Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3bd3caead46d1965d3a7a151d0ef0a54ac78332a.camel@nvidia.com \
    --to=cratiu@nvidia.com \
    --cc=ajayachandra@nvidia.com \
    --cc=allison.henderson@oracle.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=antonio@openvpn.net \
    --cc=chuck.lever@oracle.com \
    --cc=cjubran@nvidia.com \
    --cc=corbet@lwn.net \
    --cc=daniel.zahka@gmail.com \
    --cc=daniel@iogearbox.net \
    --cc=danielj@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=donald.hunter@gmail.com \
    --cc=dtatulea@nvidia.com \
    --cc=dw@davidwei.uk \
    --cc=edumazet@google.com \
    --cc=gal@nvidia.com \
    --cc=horms@kernel.org \
    --cc=jacob.e.keller@intel.com \
    --cc=jiri@nvidia.com \
    --cc=jiri@resnulli.us \
    --cc=joe@dama.to \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=matttbe@kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=minhquangbui99@gmail.com \
    --cc=moshe@nvidia.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=noren@nvidia.com \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=petrm@nvidia.com \
    --cc=razor@blackwall.org \
    --cc=saeedm@nvidia.com \
    --cc=sdf@fomichev.me \
    --cc=shayd@nvidia.com \
    --cc=shshitrit@nvidia.com \
    --cc=shuah@kernel.org \
    --cc=skhan@linuxfoundation.org \
    --cc=tariqt@nvidia.com \
    --cc=vadim.fedorenko@linux.dev \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox