linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v11 0/8] Support rate management on traffic classes in devlink and mlx5
@ 2025-06-25 18:30 Mark Bloch
  2025-06-25 18:30 ` [PATCH net-next v11 1/8] netlink: introduce type-checking attribute iteration for nlmsg Mark Bloch
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: Mark Bloch @ 2025-06-25 18:30 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Simon Horman
  Cc: saeedm, gal, leonro, tariqt, Donald Hunter, Jiri Pirko,
	Jonathan Corbet, Leon Romanovsky, Chuck Lever, Jeff Layton,
	NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Shuah Khan,
	netdev, linux-kernel, linux-doc, linux-rdma, linux-nfs,
	linux-kselftest, Mark Bloch

V11:
- Refactored the devlink code to accept relative TC bandwidth share
  values instead of percentages.
- Updated documentation to clarify that values are interpreted as
  relative shares.
- Refactored the logic in mlx5 to support proportional scaling for
  tc-bw values.
- Switched to `nlmsg_for_each_attr_type()` for cleaner attribute
  parsing.
- Added a hardware selftest to validate TC bandwidth behavior.
- Refactored esw_qos_is_node_empty for readability.

V10:
- Added netdevsim selftest for tc-bw ops.
- Dropped header: field as it’s unnecessary for local constants in
  devlink.yaml.

V9:
- Defined DEVLINK_RATE_TCS_MAX as 8 in uapi/linux/devlink.h.
- Replaced IEEE_8021QAZ_MAX_TCS with DEVLINK_RATE_TCS_MAX throughout
  the code.
- Updated devlink-rate-tc-index-max spec to reference the correct UAPI
  header.

V8:
- Limit line width to 80 characters in mlx5 changes instead of 100.
- Increase the scheduling node levels to support TC arbitration.
- Ensure parent nodes are set correctly in all code paths that extend
  the hierarchy depth for TC arbitration.
- Extended the cover letter with the ongoing discussion on devlink-rate
  and net-shapers.
- Extended the cover letter with the Netdev talk link on this series.

V7:
- Fixed disabling tc-bw on leaf nodes that did not have tc-bw
  configured.
- Fixed an issue where tc-bw was disabled on a node with assigned
  vports, ensuring that vport->qos.sched_node->parent is correctly
  updated with the cloned node.
- Declared a constant for the maximum allowed Traffic Class index in
  devlink rate.
- Added a range check to validate rate-tc-index.
- Added documentation for the tc-bw argument.
- Add a validation check to ensure that the total bandwidth assigned to
  all traffic classes sums to 100.

V6:
- Addressed comments on devlink patch #3.
- Removed first 4 IFC patches, to be pulled from mlx5-next.

V5:
- Fix warning in devlink_nl_rate_tc_bw_set().
- Fix target branch of patch #4.

V4:
- Renamed the nested attribute for traffic class bandwidth to
  DEVLINK_ATTR_RATE_TC_BWS.
- Changed the order of the attributes in `devlink.h`.
- Refactored the initialization tc-bw array in
  devlink_nl_rate_tc_bw_set().
- Added extack messages to provide clear feedback on issues with tc-bw
  arguments.
- Updated `rate-tc-bws` to support a multi-attr set, where each
  attribute includes an index and the corresponding bandwidth for that
  traffic class.
- Handled the issue where the user could provide
  DEVLINK_ATTR_RATE_TC_BWS with duplicate indices.
- Provided ynl exmaples in patch [1/5] commit message.
- Take IFC patches to beginning of the series, targeted for mlx5-next.

V3:
- Dropped rate-tc-index, using tc-bw array index instead.
- Renamed rate-bw to rate-tc-bw.
- Documneted what the rate-tc-bw represents and added a range check for
  validation.
- Intorduced devlink_nl_rate_tc_bw_set() to parse and set the TC
  bandwidth values.
- Updated the user API in the commit message of patch 1/6 to ensure
  bandwidths sum equals 100.
- Fixed missing filling of rate-parent in devlink_nl_rate_fill().

V2:
- Included <linux/dcbnl.h> in devlink.h to resolve missing
  IEEE_8021QAZ_MAX_TCS definition.
- Refactored the rate-tc-bw attribute structure to use a separate
  rate-tc-index.
- Updated patch 2/6 title.

This patch series extends the devlink-rate API to support traffic class
(TC) bandwidth management, enabling more granular control over traffic
shaping and rate limiting across multiple TCs. The API now allows users
to specify bandwidth proportions for different traffic classes in a
single command. This is particularly useful for managing Enhanced
Transmission Selection (ETS) for groups of Virtual Functions (VFs),
allowing precise bandwidth allocation across traffic classes.

Additionally the series refines the QoS handling in net/mlx5 to support
TC arbitration and bandwidth management on vports and rate nodes.

Discussions on traffic class shaping in net-shapers began in V5 [1],
where we discussed with maintainers whether net-shapers should support
traffic classes and how this could be implemented.

Later, after further conversations with Paolo Abeni and Simon Horman,
Cosmin provided an update [2], confirming that net-shapers' tree-based
hierarchy aligns well with traffic classes when treated as distinct
subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX
queues and traffic classes, this approach seems feasible, though some
open questions remain regarding queue reconfiguration and certain mlx5
scheduling behaviors.

Building on that discussion, Cosmin has now shared a concrete
implementation plan on the netdev mailing list [3]. The plan, developed
in collaboration with Paolo and Simon, outlines how net-shapers can be
extended to support the same use cases currently covered by
devlink-rate, with the eventual goal of aligning both and simplifying
the shaping infrastructure in the kernel.

This work was presented at Netdev 0x19 in Zagreb [4].
There we presented how TC scheduling is enforced in mlx5 hardware,
which led to discussions on the mailing list.

A summary of how things work:

Classification means labeling a packet with a traffic class based on
the packet's DSCP or VLAN PCP field, then treating packets with
different traffic classes differently during transmit processing.

In a virtualized setup, VFs are untrusted and do not control
classification or shaping. Classification is done by the hardware using
a prio-to-TC mapping set by the hypervisor. VFs only select which send
queue to use and are expected to respect the classification logic by
sending each traffic class on its dedicated queue. As stated in the
net-shapers plan [3], each transmit queue should carry only a single
traffic class. Mixing classes in a single queue can lead to HOL
blocking.

In the mlx5 implementation, if the queue used does not match the
classified traffic class, the hardware moves the queue to the correct
TC scheduler. This movement is not a reclassification; it’s a necessary
enforcement step to ensure traffic class isolation is maintained.

Extend devlink-rate API to support rate management on TCs:
- devlink: Extend the devlink rate API to support traffic class
  bandwidth management

Introduce a no-op implementation:
- net/mlx5: Add no-op implementation for setting tc-bw on rate objects

Add support for enabling and disabling TC QoS on vports and nodes:
- net/mlx5: Add support for setting tc-bw on nodes
- net/mlx5: Add traffic class scheduling support for vport QoS

Support for setting tc-bw on rate objects:
- net/mlx5: Manage TC arbiter nodes and implement full support for
  tc-bw

[1]
https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/
[2]
https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.camel@nvidia.com/
[3]
https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.camel@nvidia.com/T/
[4]
https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-with-ets-and-traffic-classes.html

Carolina Jubran (8):
  netlink: introduce type-checking attribute iteration for nlmsg
  devlink: Extend devlink rate API with traffic classes bandwidth management
  selftest: netdevsim: Add devlink rate tc-bw test
  net/mlx5: Add no-op implementation for setting tc-bw on rate objects
  net/mlx5: Add support for setting tc-bw on nodes
  net/mlx5: Add traffic class scheduling support for vport QoS
  net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw
  selftests: drv-net: Add test for devlink-rate traffic class bandwidth distribution

 Documentation/netlink/specs/devlink.yaml      |   32 +-
 .../networking/devlink/devlink-port.rst       |    8 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |    2 +
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 1037 ++++++++++++++++-
 .../net/ethernet/mellanox/mlx5/core/esw/qos.h |    8 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   14 +-
 drivers/net/netdevsim/dev.c                   |   43 +
 drivers/net/netdevsim/netdevsim.h             |    1 +
 drivers/net/vxlan/vxlan_vnifilter.c           |   13 +-
 fs/nfsd/nfsctl.c                              |   36 +-
 include/net/devlink.h                         |    8 +
 include/net/netlink.h                         |   14 +
 include/uapi/linux/devlink.h                  |    9 +
 net/devlink/netlink_gen.c                     |   15 +-
 net/devlink/netlink_gen.h                     |    1 +
 net/devlink/rate.c                            |  129 ++
 .../drivers/net/hw/devlink_rate_tc_bw.py      |  466 ++++++++
 .../drivers/net/netdevsim/devlink.sh          |   51 +
 .../testing/selftests/net/lib/py/__init__.py  |    2 +-
 tools/testing/selftests/net/lib/py/ynl.py     |    5 +
 20 files changed, 1823 insertions(+), 71 deletions(-)
 create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py


base-commit: 8dacfd92dbefee829ca555a860e86108fdd1d55b
-- 
2.34.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-06-27  1:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25 18:30 [PATCH net-next v11 0/8] Support rate management on traffic classes in devlink and mlx5 Mark Bloch
2025-06-25 18:30 ` [PATCH net-next v11 1/8] netlink: introduce type-checking attribute iteration for nlmsg Mark Bloch
2025-06-25 18:30 ` [PATCH net-next v11 2/8] devlink: Extend devlink rate API with traffic classes bandwidth management Mark Bloch
2025-06-27  1:01   ` Jakub Kicinski
2025-06-25 18:30 ` [PATCH net-next v11 3/8] selftest: netdevsim: Add devlink rate tc-bw test Mark Bloch
2025-06-25 18:30 ` [PATCH net-next v11 4/8] net/mlx5: Add no-op implementation for setting tc-bw on rate objects Mark Bloch
2025-06-25 18:30 ` [PATCH net-next v11 5/8] net/mlx5: Add support for setting tc-bw on nodes Mark Bloch
2025-06-25 18:30 ` [PATCH net-next v11 6/8] net/mlx5: Add traffic class scheduling support for vport QoS Mark Bloch
2025-06-25 18:30 ` [PATCH net-next v11 7/8] net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw Mark Bloch
2025-06-25 18:30 ` [PATCH net-next v11 8/8] selftests: drv-net: Add test for devlink-rate traffic class bandwidth distribution Mark Bloch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).