public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Tariq Toukan <tariqt@nvidia.com>
To: Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>
Cc: Donald Hunter <donald.hunter@gmail.com>,
	Simon Horman <horms@kernel.org>, Jiri Pirko <jiri@resnulli.us>,
	Jonathan Corbet <corbet@lwn.net>,
	Shuah Khan <skhan@linuxfoundation.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	"Leon Romanovsky" <leon@kernel.org>,
	Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
	Chuck Lever <chuck.lever@oracle.com>,
	"Matthieu Baerts (NGI0)" <matttbe@kernel.org>,
	Cosmin Ratiu <cratiu@nvidia.com>,
	"Carolina Jubran" <cjubran@nvidia.com>,
	Daniel Zahka <daniel.zahka@gmail.com>,
	"Shay Drory" <shayd@nvidia.com>, Kees Cook <kees@kernel.org>,
	Daniel Jurgens <danielj@nvidia.com>,
	Moshe Shemesh <moshe@nvidia.com>,
	Adithya Jayachandran <ajayachandra@nvidia.com>,
	Willem de Bruijn <willemb@google.com>, David Wei <dw@davidwei.uk>,
	Petr Machata <petrm@nvidia.com>,
	Stanislav Fomichev <sdf@fomichev.me>,
	Vadim Fedorenko <vadim.fedorenko@linux.dev>,
	<netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-doc@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
	<linux-kselftest@vger.kernel.org>, Gal Pressman <gal@nvidia.com>,
	Jiri Pirko <jiri@nvidia.com>
Subject: [PATCH net-next V8 00/14] devlink and mlx5: Support cross-function rate scheduling
Date: Tue, 24 Mar 2026 14:28:34 +0200	[thread overview]
Message-ID: <20260324122848.36731-1-tariqt@nvidia.com> (raw)

Hi,

This series by Cosmin adds support for cross-function rate scheduling in
devlink and mlx5.
See detailed explanation by Cosmin below [0].

Regards,
Tariq

[0]
devlink objects support rate management for TX scheduling, which
involves maintaining a tree of rate nodes that corresponds to TX
schedulers in hardware. 'man devlink-rate' has the full details.

The tree of rate nodes is maintained per devlink object, protected by
the devlink lock.

There exists hardware capable of instantiating TX scheduling trees
spanning multiple functions of the same physical device (and thus
devlink objects) and therefore the current API and locking scheme is
insufficient.

This patch series changes the devlink rate implementation and API to
allow supporting such hardware and managing TX scheduling trees across
multiple functions of a physical device.

Modeling this requires having devlink rate nodes with parents in other
devlink objects. A naive approach that relies on the current
one-lock-per-devlink model is impossible, as it would require in some
cases acquiring multiple devlink locks in the correct order.

The solution proposed in this patch series makes use of the recently
introduced shared devlink instance [1] to manage rate hierarchy changes
across multiple functions.

V1 of this patch series was sent a long time ago [2], using a different
approach of storing rates in a shared rate domain with special locking
rules. This new approach uses standard devlink instances and nesting.

The first part of the series adds support to devlink rates for
maintaining the rate tree across multiple functions.

The second part changes the mlx5 implementation to make use of this (and
cleans up remnants of the previous approach, involving rate domains).

The neat part about using the shared devlink object is that it works for
SFs as well, which are already nested in their parent PF instances. So
with this series, complex scheduling trees spanning multiple SFs across
multiple PFs of the same NIC can now be supported.

Patches:

devlink rate changes for cross-device TX scheduling:
devlink: Add helpers to lock nested-in instances
devlink: Migrate from info->user_ptr to info->ctx
devlink: Decouple rate storage from associated devlink object
devlink: Add parent dev to devlink API
devlink: Allow parent dev for rate-set and rate-new
devlink: Allow rate node parents from other devlinks

mlx5 support for cross-device TX scheduling:
net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed
net/mlx5: qos: Expose a function to clear a vport's parent
net/mlx5: qos: Model the root node in the scheduling hierarchy
net/mlx5: qos: Remove qos domains and use shd lock
net/mlx5: qos: Support cross-device tx scheduling
selftests: drv-net: Add test for cross-esw rate scheduling
net/mlx5: Document devlink rates

[1] https://lore.kernel.org/all/20260312100407.551173-1-jiri@resnulli.us/T/#u
[2] https://lore.kernel.org/netdev/20250213180134.323929-1-tariqt@nvidia.com/

V8:
- Link to V7: https://lore.kernel.org/all/20260128112544.1661250-1-tariqt@nvidia.com/
- 2 patches accepted in v7, so they're gone from the series.
- Jiri's devlink changes sent as a separate series [1].
- added a hw selftest for cross-esw devlink rates.
- migrated devlink from info->user_ptr to info->ctx.
- simplified LAG speed locking in qos.c.
- found and fixed a qos cleanup bug.
- new patch simplifying parent handling in qos.c
- removed unnecessary exports for devl_rate_{,un}lock

Cosmin Ratiu (14):
  devlink: Update nested instance locking comment
  devlink: Add helpers to lock nested-in instances
  devlink: Migrate from info->user_ptr to info->ctx
  devlink: Decouple rate storage from associated devlink object
  devlink: Add parent dev to devlink API
  devlink: Allow parent dev for rate-set and rate-new
  devlink: Allow rate node parents from other devlinks
  net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed
  net/mlx5: qos: Expose a function to clear a vport's parent
  net/mlx5: qos: Model the root node in the scheduling hierarchy
  net/mlx5: qos: Remove qos domains and use shd lock
  net/mlx5: qos: Support cross-device tx scheduling
  selftests: drv-net: Add test for cross-esw rate scheduling
  net/mlx5: Document devlink rates

 Documentation/netlink/specs/devlink.yaml      |  24 +-
 .../networking/devlink/devlink-port.rst       |   2 +
 Documentation/networking/devlink/index.rst    |   4 +-
 Documentation/networking/devlink/mlx5.rst     |  33 ++
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   1 +
 .../mellanox/mlx5/core/esw/devlink_port.c     |   2 +-
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 524 +++++++-----------
 .../net/ethernet/mellanox/mlx5/core/esw/qos.h |   3 -
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |   8 -
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  15 +-
 include/net/devlink.h                         |   5 +
 include/uapi/linux/devlink.h                  |   2 +
 net/devlink/core.c                            |  42 ++
 net/devlink/dev.c                             |  16 +-
 net/devlink/devl_internal.h                   |  19 +
 net/devlink/dpipe.c                           |  14 +-
 net/devlink/health.c                          |  12 +-
 net/devlink/linecard.c                        |   4 +-
 net/devlink/netlink.c                         |  82 ++-
 net/devlink/netlink_gen.c                     |  24 +-
 net/devlink/netlink_gen.h                     |   8 +
 net/devlink/param.c                           |   4 +-
 net/devlink/port.c                            |  18 +-
 net/devlink/rate.c                            | 290 ++++++++--
 net/devlink/region.c                          |   6 +-
 net/devlink/resource.c                        |   6 +-
 net/devlink/sb.c                              |  22 +-
 net/devlink/trap.c                            |  12 +-
 .../testing/selftests/drivers/net/hw/Makefile |   1 +
 .../drivers/net/hw/devlink_rate_cross_esw.py  | 300 ++++++++++
 30 files changed, 1028 insertions(+), 475 deletions(-)
 create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_cross_esw.py


base-commit: b50a48b65cb0c955728894b24ba5ecc3cd4b8c7d
-- 
2.44.0


             reply	other threads:[~2026-03-24 12:29 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-24 12:28 Tariq Toukan [this message]
2026-03-24 12:28 ` [PATCH net-next V8 01/14] devlink: Update nested instance locking comment Tariq Toukan
2026-03-24 13:05   ` Jiri Pirko
2026-03-24 12:28 ` [PATCH net-next V8 02/14] devlink: Add helpers to lock nested-in instances Tariq Toukan
2026-03-24 12:28 ` [PATCH net-next V8 03/14] devlink: Migrate from info->user_ptr to info->ctx Tariq Toukan
2026-03-24 13:05   ` Jiri Pirko
2026-03-24 12:28 ` [PATCH net-next V8 04/14] devlink: Decouple rate storage from associated devlink object Tariq Toukan
2026-03-24 12:28 ` [PATCH net-next V8 05/14] devlink: Add parent dev to devlink API Tariq Toukan
2026-03-24 12:28 ` [PATCH net-next V8 06/14] devlink: Allow parent dev for rate-set and rate-new Tariq Toukan
2026-03-24 19:16   ` Cosmin Ratiu
2026-03-24 12:28 ` [PATCH net-next V8 07/14] devlink: Allow rate node parents from other devlinks Tariq Toukan
2026-03-24 12:28 ` [PATCH net-next V8 08/14] net/mlx5: qos: Use mlx5_lag_query_bond_speed to query LAG speed Tariq Toukan
2026-03-24 12:28 ` [PATCH net-next V8 09/14] net/mlx5: qos: Expose a function to clear a vport's parent Tariq Toukan
2026-03-24 12:28 ` [PATCH net-next V8 10/14] net/mlx5: qos: Model the root node in the scheduling hierarchy Tariq Toukan
2026-03-24 12:28 ` [PATCH net-next V8 11/14] net/mlx5: qos: Remove qos domains and use shd lock Tariq Toukan
2026-03-24 12:28 ` [PATCH net-next V8 12/14] net/mlx5: qos: Support cross-device tx scheduling Tariq Toukan
2026-03-24 12:28 ` [PATCH net-next V8 13/14] selftests: drv-net: Add test for cross-esw rate scheduling Tariq Toukan
2026-03-24 12:28 ` [PATCH net-next V8 14/14] net/mlx5: Document devlink rates Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260324122848.36731-1-tariqt@nvidia.com \
    --to=tariqt@nvidia.com \
    --cc=ajayachandra@nvidia.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=chuck.lever@oracle.com \
    --cc=cjubran@nvidia.com \
    --cc=corbet@lwn.net \
    --cc=cratiu@nvidia.com \
    --cc=daniel.zahka@gmail.com \
    --cc=danielj@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=donald.hunter@gmail.com \
    --cc=dw@davidwei.uk \
    --cc=edumazet@google.com \
    --cc=gal@nvidia.com \
    --cc=horms@kernel.org \
    --cc=jiri@nvidia.com \
    --cc=jiri@resnulli.us \
    --cc=kees@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=matttbe@kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=moshe@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=petrm@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=sdf@fomichev.me \
    --cc=shayd@nvidia.com \
    --cc=skhan@linuxfoundation.org \
    --cc=vadim.fedorenko@linux.dev \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox