From: Tariq Toukan <tariqt@nvidia.com>
To: Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>
Cc: Donald Hunter <donald.hunter@gmail.com>,
Jiri Pirko <jiri@resnulli.us>, Jonathan Corbet <corbet@lwn.net>,
Saeed Mahameed <saeedm@nvidia.com>,
"Leon Romanovsky" <leon@kernel.org>,
Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
<netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-doc@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
Gal Pressman <gal@nvidia.com>, Moshe Shemesh <moshe@nvidia.com>,
Carolina Jubran <cjubran@nvidia.com>,
Cosmin Ratiu <cratiu@nvidia.com>, Jiri Pirko <jiri@nvidia.com>,
Randy Dunlap <rdunlap@infradead.org>,
Simon Horman <horms@kernel.org>,
Krzysztof Kozlowski <krzk@kernel.org>
Subject: [PATCH net-next V5 00/15] devlink and mlx5: Support cross-function rate scheduling
Date: Tue, 20 Jan 2026 09:57:43 +0200 [thread overview]
Message-ID: <1768895878-1637182-1-git-send-email-tariqt@nvidia.com> (raw)
Hi,
This series by Cosmin and Jiri adds support for cross-function rate
scheduling in devlink and mlx5.
This is V5, find V4 here:
https://lore.kernel.org/all/1764101173-1312171-1-git-send-email-tariqt@nvidia.com/
Regards,
Tariq
[1]
devlink objects support rate management for TX scheduling, which
involves maintaining a tree of rate nodes that corresponds to TX
schedulers in hardware. 'man devlink-rate' has the full details.
The tree of rate nodes is maintained per devlink object, protected by
the devlink lock.
There exists hardware capable of instantiating TX scheduling trees
spanning multiple functions of the same physical device (and thus
devlink objects) and therefore the current API and locking scheme is
insufficient.
This patch series changes the devlink rate implementation and API to
allow supporting such hardware and managing TX scheduling trees across
multiple functions of a physical device.
Modeling this requires having devlink rate nodes with parents in other
devlink objects. A naive approach that relies on the current
one-lock-per-devlink model is impossible, as it would require in some
cases acquiring multiple devlink locks in the correct order.
The solution proposed in this patch series consists of two parts:
1. Representing the underlying physical NIC as a shared devlink object
on the faux bus and nesting all its PF devlink instances in it.
2. Changing the devlink rate implementation to store rates in this
shared devlink object, if it exists, and use its lock to protect
against concurrent changes of the scheduling tree.
With these in place, cross-esw scheduling support is added to mlx5.
The neat part about this approach is that it works for SFs as well,
which are already nested in their parent PF instances. So with this
series, complex scheduling trees spanning multiple SFs across multiple
PFs of the same NIC can now be supported.
V1 of this patch series was sent a long time ago [2], using a different
approach of storing rates in a shared rate domain with special locking
rules. This new approach uses standard devlink instances and nesting.
Patches:
devlink rate changes for cross-device TX scheduling:
devlink: Reverse locking order for nested instances
documentation: networking: add shared devlink documentation
devlink: Add helpers to lock nested-in instances
devlink: Refactor devlink_rate_nodes_check
devlink: Decouple rate storage from associated devlink object
devlink: Add parent dev to devlink API
devlink: Allow parent dev for rate-set and rate-new
devlink: Allow rate node parents from other devlinks
devlink: introduce shared devlink instance for PFs on same chip
mlx5 support for cross-device TX scheduling:
net/mlx5: Add a shared devlink instance for PFs on same chip
net/mlx5: Expose a function to clear a vport's parent
net/mlx5: Store QoS sched nodes in the sh_devlink
net/mlx5: qos: Support cross-esw tx scheduling
net/mlx5: qos: Enable cross-device scheduling
net/mlx5: Document devlink rates and cross-esw scheduling
[2] https://lore.kernel.org/netdev/20250213180134.323929-1-tariqt@nvidia.com/
V5:
- Made parts of shd generic devlink infra (Jakub).
- Stopped using __free (Krzysztof Kozlowski).
- Moved some generated netlink in the correct patch (Simon Horman).
- Addressed cleanup bug (Jakub).
- Clarified uses of shared devlink in documentation.
V4:
- Fix typo in documentation (Randy Dunlap).
V3:
- Remove mistakenly repeated devlink interface in docs (Jakub).
- Add Jiri's review tags on ML.
V2:
- Rebase.
- Add Jiri's review tags on ML.
Cosmin Ratiu (12):
devlink: Reverse locking order for nested instances
devlink: Add helpers to lock nested-in instances
devlink: Refactor devlink_rate_nodes_check
devlink: Decouple rate storage from associated devlink object
devlink: Add parent dev to devlink API
devlink: Allow parent dev for rate-set and rate-new
devlink: Allow rate node parents from other devlinks
net/mlx5: Expose a function to clear a vport's parent
net/mlx5: Store QoS sched nodes in the sh_devlink
net/mlx5: qos: Support cross-device tx scheduling
net/mlx5: qos: Enable cross-device scheduling
net/mlx5: Document devlink rates
Jiri Pirko (3):
documentation: networking: add shared devlink documentation
devlink: introduce shared devlink instance for PFs on same chip
net/mlx5: Add a shared devlink instance for PFs on same chip
Documentation/netlink/specs/devlink.yaml | 22 +-
.../networking/devlink/devlink-port.rst | 2 +
.../networking/devlink/devlink-shared.rst | 95 +++++
Documentation/networking/devlink/index.rst | 1 +
Documentation/networking/devlink/mlx5.rst | 33 ++
.../net/ethernet/mellanox/mlx5/core/Makefile | 5 +-
.../net/ethernet/mellanox/mlx5/core/devlink.c | 1 +
.../mellanox/mlx5/core/esw/devlink_port.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 332 ++++++++----------
.../net/ethernet/mellanox/mlx5/core/esw/qos.h | 3 -
.../net/ethernet/mellanox/mlx5/core/eswitch.c | 9 +-
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 14 +-
.../net/ethernet/mellanox/mlx5/core/main.c | 17 +
.../ethernet/mellanox/mlx5/core/sh_devlink.c | 91 +++++
.../ethernet/mellanox/mlx5/core/sh_devlink.h | 14 +
include/linux/mlx5/driver.h | 1 +
include/net/devlink.h | 13 +
include/uapi/linux/devlink.h | 2 +
net/devlink/Makefile | 2 +-
net/devlink/core.c | 48 ++-
net/devlink/dev.c | 7 +-
net/devlink/devl_internal.h | 11 +-
net/devlink/netlink.c | 67 +++-
net/devlink/netlink_gen.c | 23 +-
net/devlink/netlink_gen.h | 8 +
net/devlink/rate.c | 290 +++++++++++----
net/devlink/sh_dev.c | 163 +++++++++
27 files changed, 978 insertions(+), 298 deletions(-)
create mode 100644 Documentation/networking/devlink/devlink-shared.rst
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sh_devlink.c
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/sh_devlink.h
create mode 100644 net/devlink/sh_dev.c
base-commit: c5e7b1d1cc8a6cb8b709eef34c93a9458427ab2e
--
2.44.0
next reply other threads:[~2026-01-20 7:58 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-20 7:57 Tariq Toukan [this message]
2026-01-20 7:57 ` [PATCH net-next V5 01/15] documentation: networking: add shared devlink documentation Tariq Toukan
2026-01-20 7:57 ` [PATCH net-next V5 02/15] devlink: introduce shared devlink instance for PFs on same chip Tariq Toukan
2026-01-20 7:57 ` [PATCH net-next V5 03/15] devlink: Reverse locking order for nested instances Tariq Toukan
2026-01-20 7:57 ` [PATCH net-next V5 04/15] devlink: Add helpers to lock nested-in instances Tariq Toukan
2026-01-20 7:57 ` [PATCH net-next V5 05/15] devlink: Refactor devlink_rate_nodes_check Tariq Toukan
2026-01-20 7:57 ` [PATCH net-next V5 06/15] devlink: Decouple rate storage from associated devlink object Tariq Toukan
2026-01-22 3:39 ` [net-next,V5,06/15] " Jakub Kicinski
2026-01-22 11:18 ` Cosmin Ratiu
2026-01-20 7:57 ` [PATCH net-next V5 07/15] devlink: Add parent dev to devlink API Tariq Toukan
2026-01-20 7:57 ` [PATCH net-next V5 08/15] devlink: Allow parent dev for rate-set and rate-new Tariq Toukan
2026-01-20 7:57 ` [PATCH net-next V5 09/15] devlink: Allow rate node parents from other devlinks Tariq Toukan
2026-01-20 7:57 ` [PATCH net-next V5 10/15] net/mlx5: Add a shared devlink instance for PFs on same chip Tariq Toukan
2026-01-22 3:39 ` [net-next,V5,10/15] " Jakub Kicinski
2026-01-22 3:41 ` Jakub Kicinski
2026-01-22 7:42 ` [PATCH net-next V5 10/15] " Krzysztof Kozlowski
2026-01-22 11:13 ` Cosmin Ratiu
2026-01-20 7:57 ` [PATCH net-next V5 11/15] net/mlx5: Expose a function to clear a vport's parent Tariq Toukan
2026-01-22 3:40 ` [net-next,V5,11/15] " Jakub Kicinski
2026-01-22 3:42 ` Jakub Kicinski
2026-01-20 7:57 ` [PATCH net-next V5 12/15] net/mlx5: Store QoS sched nodes in the sh_devlink Tariq Toukan
2026-01-22 3:40 ` [net-next,V5,12/15] " Jakub Kicinski
2026-01-22 11:15 ` Cosmin Ratiu
2026-01-20 7:57 ` [PATCH net-next V5 13/15] net/mlx5: qos: Support cross-device tx scheduling Tariq Toukan
2026-01-20 7:57 ` [PATCH net-next V5 14/15] net/mlx5: qos: Enable cross-device scheduling Tariq Toukan
2026-01-20 7:57 ` [PATCH net-next V5 15/15] net/mlx5: Document devlink rates Tariq Toukan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1768895878-1637182-1-git-send-email-tariqt@nvidia.com \
--to=tariqt@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=cjubran@nvidia.com \
--cc=corbet@lwn.net \
--cc=cratiu@nvidia.com \
--cc=davem@davemloft.net \
--cc=donald.hunter@gmail.com \
--cc=edumazet@google.com \
--cc=gal@nvidia.com \
--cc=horms@kernel.org \
--cc=jiri@nvidia.com \
--cc=jiri@resnulli.us \
--cc=krzk@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=moshe@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=rdunlap@infradead.org \
--cc=saeedm@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox