* net-shapers plan
@ 2025-03-06 14:03 Cosmin Ratiu
2025-03-11 1:42 ` Samudrala, Sridhar
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Cosmin Ratiu @ 2025-03-06 14:03 UTC (permalink / raw)
To: netdev@vger.kernel.org
Cc: horms@kernel.org, andrew+netdev@lunn.ch, davem@davemloft.net,
Tariq Toukan, Gal Pressman, jiri@resnulli.us, Leon Romanovsky,
edumazet@google.com, kuba@kernel.org, Saeed Mahameed,
Carolina Jubran, pabeni@redhat.com
Hello,
This (long) email presents a plan agreed with Simon and Paolo for
extending net-shapers with use cases currently serviced by devlink-
rate. The goal is to get net-shapers to feature parity with devlink-
rate so that the amount of code dedicated to traffic shaping in the
kernel could eventually be reduced significantly.
This is in response to Jakub's concerns raised in [3] and [4].
Context
-------
devlink-rate ([1]) can control traffic shaping for a VF / VF group and
is currently implemented by the Intel ice and NVIDIA mlx5 drivers. It
operates either on devlink ports (for VF rates) or on devlink objects
(for group rates). Rate objects are owned by the devlink object.
net-shapers ([2]) is a recently added API for shaping traffic for a
netdev tx queue / queue group / entire netdev. It is more granular than
devlink-rate but cannot currently control shaping for groups of
netdevs. It operates with netdev handles. Stores the shaping hierarchy
in the netdevice.
[3] & [4] add support to devlink-rate for traffic-class shaping, which
is controlling the shaping hierarchy in hardware to control the
bandwidth allocation different traffic classes get. The question is how
to represent traffic classes in net-shapers.
In [5], Jiri expressed a desire to eventually convert devlink-rate to
net-shapers.
Finally, in [6] I sent an update outlining a snapshot of discussions
that took place trying to figure things out.
Putting these pieces together, the following plan takes shape.
Plan, in short
--------------
1. Extend net-shapers hierarchy with the ability to define 8 traffic
class roots for a net device instead of a single root like today. There
is no need for a new scope, the NETDEV scope with a different id to
differentiate TCs should be enough.
This is needed to allow backpressure from the hierarchy to the txq
level and proper TC selection.
The goal is to either have a hierarchy like today, with one netdev-
level root containing nodes and leaves being txqs or to have a TC-
enabled hierarchy with 8 roots (one for each traffic class), with nodes
and txqs as leaves.
2. Extend the semantics of NET_SHAPER_SCOPE_NODE to be able to group
multiple netdevs, similar to devlink-rate nodes.
3. Add a new DEVLINK binding type for the hierarchy, to be able to
represent netdev groups. That part of the hierarchy would be stored in
the devlink object instead of the netdev. This allows separation
between the VM and the hypervisor parts of the hierarchy.
These together should make net-shapers a strict superset of devlink-
rate and would allow the devlink-rate implementation to be converted to
net-shapers. It allows independently operating traffic shaping from a
VM (limited to its own VF/netdev) and from the hypervisor (being able
to rate limit traffic classes and groups of VFs, like devlink-rate).
Plan, in detail
---------------
1. Packet classification
It is outside the scope of net-shapers, but it's worth talking about
it.
Packet classification is done based on either:
a. TOS field in the IP header (known as DSCP) or
b. VLAN priority in the VLAN header (known as PCP).
c. Arbitrary rules based on DPI (not standard, but possible).
Classification means labeling a packet with a traffic class based on
the above rules, then treating packets with different traffic classes
differently during tx processing.
The first moment when classification matters is when choosing a txq.
Since the goal is to be able to treat different traffic classes
differently, it it necessary to have a txq only output a single traffic
class. If that condition doesn't hold, a txq sending a mixture of
traffic classes might suffer from head-of-line blocking. Imagine a
scenario with a txq on which low volume high priority TC 7 for example
is sent alongside high volume low priority TC 0.
Backpressure on TC 0 from further up the shaping hierarchy would only
be able to manifest itself by blocking the entire txq, affecting both
traffic classes.
It is not important which entity (kernel or hw) classifies packets as
long as the condition that a given txq only sends traffic for a single
traffic class holds.
2. New net-shapers netdev TC roots
A new netdev TC root would therefore logically identify a disjoint
subset of txqs that service that TC. The union of all 8 roots would
encompass all device txqs.
The primary reason to define separate roots for each TC is that
backpressure from the hierarchy on one of the traffic classes needs to
not affect other traffic classes, meaning only txqs servicing the
blocked traffic class should be affected.
Furthermore, this cannot be done by simply grouping txqs for a given TC
with NET_SHAPER_SCOPE_NODE, because the TC for a txq is not always
known to the kernel and might only be known to the driver or the NIC.
With the new roots, net-shapers can relay the intent to shape traffic
for a particular TC to the driver without having knowledge of which
txqs service a TC. The association between txqs and TCs they service
doesn't need to be known to the kernel.
3. Extend NODE scope to group multiple netdevs and new DEVLINK binding
Today, all net-shapers objects are owned by a netdevice. Who should own
a net shaper that represents a group of netdevices? It needs to be a
stable object that isn't affected by group membership changes and
therefore cannot be any netdev from the group. The only sensible option
would be to pick an object corresponding to the eswitch to own such
groups, which neatly corresponds to the devlink object today.
4. VM/hypervisor considerations
A great deal of discussion happened about the split of shaping
responsibilities between the VM and the hypervisor. With devlink today,
the shaping hierarchy and traffic class bw split is decided entirely by
the hypervisor, the VMs have no influence on shaping.
But net-shapers has more precise granularity for shaping at queue
level, so perhaps there are valid use cases for allowing VMs to control
their part of the hierarchy. In the end, what we think makes sense is
this model:
VMs can control the shaping of txqs, queue groups and the VFs they own.
On top of that, the hypervisor can take the netdev root of the VM
hierarchy and plug it into its own hierarchy, imposing additional
constraints. The VM has no influence on that. So for example the VM can
decide that "my VF should be limited to 10Gbps", but the hypervisor can
then add another shaping node saying "that VF is limited to 1Gbps" and
the later should be the limit.
With traffic classes, the VM can send out tc-labeled traffic on
different txqs, but the hypervisor decides to take the VM TC roots and
group them in an arbiter node (== a shaping node arbitrating between
different traffic classes), or to group TC roots from multiple VMs
before applying arbitration settings. This is similar to devlink-rate
today. The VM itself should have no control into TC bandwidth settings.
Cosmin.
[1] https://man7.org/linux/man-pages/man8/devlink-rate.8.html
[2]
https://lore.kernel.org/netdev/cover.1728460186.git.pabeni@redhat.com/
[3] https://lore.kernel.org/netdev/20241206181345.3eccfca4@kernel.org/
[4]
https://lore.kernel.org/netdev/20250209101716.112774-1-tariqt@nvidia.com/
[5] https://lore.kernel.org/netdev/ZwP8OWtMfCH0_ikc@nanopsycho.orion/
[6]
https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.camel@nvidia.com/
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: net-shapers plan
2025-03-06 14:03 net-shapers plan Cosmin Ratiu
@ 2025-03-11 1:42 ` Samudrala, Sridhar
2025-03-12 11:02 ` Carolina Jubran
2025-03-11 13:38 ` Jiri Pirko
2025-03-28 12:13 ` Jakub Kicinski
2 siblings, 1 reply; 16+ messages in thread
From: Samudrala, Sridhar @ 2025-03-11 1:42 UTC (permalink / raw)
To: Cosmin Ratiu, netdev@vger.kernel.org
Cc: horms@kernel.org, andrew+netdev@lunn.ch, davem@davemloft.net,
Tariq Toukan, Gal Pressman, jiri@resnulli.us, Leon Romanovsky,
edumazet@google.com, kuba@kernel.org, Saeed Mahameed,
Carolina Jubran, pabeni@redhat.com, Madhu Chittim, Zaki, Ahmed
On 3/6/2025 6:03 AM, Cosmin Ratiu wrote:
> Hello,
>
> This (long) email presents a plan agreed with Simon and Paolo for
> extending net-shapers with use cases currently serviced by devlink-
> rate. The goal is to get net-shapers to feature parity with devlink-
> rate so that the amount of code dedicated to traffic shaping in the
> kernel could eventually be reduced significantly.
>
> This is in response to Jakub's concerns raised in [3] and [4].
>
> Context
> -------
> devlink-rate ([1]) can control traffic shaping for a VF / VF group and
> is currently implemented by the Intel ice and NVIDIA mlx5 drivers. It
> operates either on devlink ports (for VF rates) or on devlink objects
> (for group rates). Rate objects are owned by the devlink object.
>
> net-shapers ([2]) is a recently added API for shaping traffic for a
> netdev tx queue / queue group / entire netdev. It is more granular than
> devlink-rate but cannot currently control shaping for groups of
> netdevs. It operates with netdev handles. Stores the shaping hierarchy
> in the netdevice.
>
> [3] & [4] add support to devlink-rate for traffic-class shaping, which
> is controlling the shaping hierarchy in hardware to control the
> bandwidth allocation different traffic classes get. The question is how
> to represent traffic classes in net-shapers.
>
> In [5], Jiri expressed a desire to eventually convert devlink-rate to
> net-shapers.
>
> Finally, in [6] I sent an update outlining a snapshot of discussions
> that took place trying to figure things out.
>
> Putting these pieces together, the following plan takes shape.
>
> Plan, in short
> --------------
> 1. Extend net-shapers hierarchy with the ability to define 8 traffic
> class roots for a net device instead of a single root like today. There
> is no need for a new scope, the NETDEV scope with a different id to
> differentiate TCs should be enough.
> This is needed to allow backpressure from the hierarchy to the txq
> level and proper TC selection.
>
> The goal is to either have a hierarchy like today, with one netdev-
> level root containing nodes and leaves being txqs or to have a TC-
> enabled hierarchy with 8 roots (one for each traffic class), with nodes
> and txqs as leaves.
>
> 2. Extend the semantics of NET_SHAPER_SCOPE_NODE to be able to group
> multiple netdevs, similar to devlink-rate nodes.
>
> 3. Add a new DEVLINK binding type for the hierarchy, to be able to
> represent netdev groups. That part of the hierarchy would be stored in
> the devlink object instead of the netdev. This allows separation
> between the VM and the hypervisor parts of the hierarchy.
>
> These together should make net-shapers a strict superset of devlink-
> rate and would allow the devlink-rate implementation to be converted to
> net-shapers. It allows independently operating traffic shaping from a
> VM (limited to its own VF/netdev) and from the hypervisor (being able
> to rate limit traffic classes and groups of VFs, like devlink-rate).
>
> Plan, in detail
> ---------------
> 1. Packet classification
> It is outside the scope of net-shapers, but it's worth talking about
> it.
> Packet classification is done based on either:
> a. TOS field in the IP header (known as DSCP) or
> b. VLAN priority in the VLAN header (known as PCP).
> c. Arbitrary rules based on DPI (not standard, but possible).
>
> Classification means labeling a packet with a traffic class based on
> the above rules, then treating packets with different traffic classes
> differently during tx processing.
>
> The first moment when classification matters is when choosing a txq.
> Since the goal is to be able to treat different traffic classes
> differently, it it necessary to have a txq only output a single traffic
> class. If that condition doesn't hold, a txq sending a mixture of
> traffic classes might suffer from head-of-line blocking. Imagine a
> scenario with a txq on which low volume high priority TC 7 for example
> is sent alongside high volume low priority TC 0.
> Backpressure on TC 0 from further up the shaping hierarchy would only
> be able to manifest itself by blocking the entire txq, affecting both
> traffic classes.
>
> It is not important which entity (kernel or hw) classifies packets as
> long as the condition that a given txq only sends traffic for a single
> traffic class holds.
>
> 2. New net-shapers netdev TC roots
> A new netdev TC root would therefore logically identify a disjoint
> subset of txqs that service that TC. The union of all 8 roots would
> encompass all device txqs.
Are these TC roots configured on the VF/SF netdev? OR are these on the
corresponding Port representor netdevs?
>
> The primary reason to define separate roots for each TC is that
> backpressure from the hierarchy on one of the traffic classes needs to
> not affect other traffic classes, meaning only txqs servicing the
> blocked traffic class should be affected.
>
> Furthermore, this cannot be done by simply grouping txqs for a given TC
> with NET_SHAPER_SCOPE_NODE, because the TC for a txq is not always
> known to the kernel and might only be known to the driver or the NIC.
> With the new roots, net-shapers can relay the intent to shape traffic
> for a particular TC to the driver without having knowledge of which
> txqs service a TC. The association between txqs and TCs they service
> doesn't need to be known to the kernel.
>
> 3. Extend NODE scope to group multiple netdevs and new DEVLINK binding
> Today, all net-shapers objects are owned by a netdevice. Who should own
> a net shaper that represents a group of netdevices? It needs to be a
> stable object that isn't affected by group membership changes and
> therefore cannot be any netdev from the group. The only sensible option
> would be to pick an object corresponding to the eswitch to own such
> groups, which neatly corresponds to the devlink object today.
When you are referring to grouping multiple netdevs, I am assuming these
are port representor netdevs. Is this correct?
>
> 4. VM/hypervisor considerations
> A great deal of discussion happened about the split of shaping
> responsibilities between the VM and the hypervisor. With devlink today,
> the shaping hierarchy and traffic class bw split is decided entirely by
> the hypervisor, the VMs have no influence on shaping.
>
> But net-shapers has more precise granularity for shaping at queue
> level, so perhaps there are valid use cases for allowing VMs to control
> their part of the hierarchy. In the end, what we think makes sense is
> this model:
>
> VMs can control the shaping of txqs, queue groups and the VFs they own.
> On top of that, the hypervisor can take the netdev root of the VM
> hierarchy and plug it into its own hierarchy, imposing additional
> constraints. The VM has no influence on that. So for example the VM can
> decide that "my VF should be limited to 10Gbps", but the hypervisor can
> then add another shaping node saying "that VF is limited to 1Gbps" and
> the later should be the limit.
Isn't it sufficient to enable rate limit at a VF/SF's queue or
queue-group granularity from the VF/SF netdev? The hypervisor should be
able to rate limit at VF granularity.
>
> With traffic classes, the VM can send out tc-labeled traffic on
> different txqs, but the hypervisor decides to take the VM TC roots and
> group them in an arbiter node (== a shaping node arbitrating between
> different traffic classes), or to group TC roots from multiple VMs
> before applying arbitration settings. This is similar to devlink-rate
> today. The VM itself should have no control into TC bandwidth settings.
It is not clear if TC roots are configured by the VF driver or the PF
drivers supporting switchdev. Can you share an example configuration
with steps on how to configure hierachical traffic shaping of VFs queues/TCs
>
> Cosmin.
>
> [1] https://man7.org/linux/man-pages/man8/devlink-rate.8.html
> [2]
> https://lore.kernel.org/netdev/cover.1728460186.git.pabeni@redhat.com/
> [3] https://lore.kernel.org/netdev/20241206181345.3eccfca4@kernel.org/
> [4]
> https://lore.kernel.org/netdev/20250209101716.112774-1-tariqt@nvidia.com/
> [5] https://lore.kernel.org/netdev/ZwP8OWtMfCH0_ikc@nanopsycho.orion/
> [6]
> https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.camel@nvidia.com/
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-03-11 1:42 ` Samudrala, Sridhar
@ 2025-03-12 11:02 ` Carolina Jubran
0 siblings, 0 replies; 16+ messages in thread
From: Carolina Jubran @ 2025-03-12 11:02 UTC (permalink / raw)
To: Samudrala, Sridhar, Cosmin Ratiu, netdev@vger.kernel.org
Cc: horms@kernel.org, andrew+netdev@lunn.ch, davem@davemloft.net,
Tariq Toukan, Gal Pressman, jiri@resnulli.us, Leon Romanovsky,
edumazet@google.com, kuba@kernel.org, Saeed Mahameed,
pabeni@redhat.com, Madhu Chittim, Zaki, Ahmed
On 11/03/2025 3:42, Samudrala, Sridhar wrote:
>
>
> On 3/6/2025 6:03 AM, Cosmin Ratiu wrote:
>> Hello,
>>
>> This (long) email presents a plan agreed with Simon and Paolo for
>> extending net-shapers with use cases currently serviced by devlink-
>> rate. The goal is to get net-shapers to feature parity with devlink-
>> rate so that the amount of code dedicated to traffic shaping in the
>> kernel could eventually be reduced significantly.
>>
>> This is in response to Jakub's concerns raised in [3] and [4].
>>
>> Context
>> -------
>> devlink-rate ([1]) can control traffic shaping for a VF / VF group and
>> is currently implemented by the Intel ice and NVIDIA mlx5 drivers. It
>> operates either on devlink ports (for VF rates) or on devlink objects
>> (for group rates). Rate objects are owned by the devlink object.
>>
>> net-shapers ([2]) is a recently added API for shaping traffic for a
>> netdev tx queue / queue group / entire netdev. It is more granular than
>> devlink-rate but cannot currently control shaping for groups of
>> netdevs. It operates with netdev handles. Stores the shaping hierarchy
>> in the netdevice.
>>
>> [3] & [4] add support to devlink-rate for traffic-class shaping, which
>> is controlling the shaping hierarchy in hardware to control the
>> bandwidth allocation different traffic classes get. The question is how
>> to represent traffic classes in net-shapers.
>> In [5], Jiri expressed a desire to eventually convert devlink-rate to
>> net-shapers.
>> Finally, in [6] I sent an update outlining a snapshot of discussions
>> that took place trying to figure things out.
>> Putting these pieces together, the following plan takes shape.
>> Plan, in short
>> --------------
>> 1. Extend net-shapers hierarchy with the ability to define 8 traffic
>> class roots for a net device instead of a single root like today. There
>> is no need for a new scope, the NETDEV scope with a different id to
>> differentiate TCs should be enough.
>> This is needed to allow backpressure from the hierarchy to the txq
>> level and proper TC selection.
>>
>> The goal is to either have a hierarchy like today, with one netdev-
>> level root containing nodes and leaves being txqs or to have a TC-
>> enabled hierarchy with 8 roots (one for each traffic class), with nodes
>> and txqs as leaves.
>>
>> 2. Extend the semantics of NET_SHAPER_SCOPE_NODE to be able to group
>> multiple netdevs, similar to devlink-rate nodes.
>>
>> 3. Add a new DEVLINK binding type for the hierarchy, to be able to
>> represent netdev groups. That part of the hierarchy would be stored in
>> the devlink object instead of the netdev. This allows separation
>> between the VM and the hypervisor parts of the hierarchy.
>>
>> These together should make net-shapers a strict superset of devlink-
>> rate and would allow the devlink-rate implementation to be converted to
>> net-shapers. It allows independently operating traffic shaping from a
>> VM (limited to its own VF/netdev) and from the hypervisor (being able
>> to rate limit traffic classes and groups of VFs, like devlink-rate).
>>
>> Plan, in detail
>> ---------------
>> 1. Packet classification
>> It is outside the scope of net-shapers, but it's worth talking about
>> it.
>> Packet classification is done based on either:
>> a. TOS field in the IP header (known as DSCP) or
>> b. VLAN priority in the VLAN header (known as PCP).
>> c. Arbitrary rules based on DPI (not standard, but possible).
>>
>> Classification means labeling a packet with a traffic class based on
>> the above rules, then treating packets with different traffic classes
>> differently during tx processing.
>>
>> The first moment when classification matters is when choosing a txq.
>> Since the goal is to be able to treat different traffic classes
>> differently, it it necessary to have a txq only output a single traffic
>> class. If that condition doesn't hold, a txq sending a mixture of
>> traffic classes might suffer from head-of-line blocking. Imagine a
>> scenario with a txq on which low volume high priority TC 7 for example
>> is sent alongside high volume low priority TC 0.
>> Backpressure on TC 0 from further up the shaping hierarchy would only
>> be able to manifest itself by blocking the entire txq, affecting both
>> traffic classes.
>>
>> It is not important which entity (kernel or hw) classifies packets as
>> long as the condition that a given txq only sends traffic for a single
>> traffic class holds.
>>
>> 2. New net-shapers netdev TC roots
>> A new netdev TC root would therefore logically identify a disjoint
>> subset of txqs that service that TC. The union of all 8 roots would
>> encompass all device txqs.
>
> Are these TC roots configured on the VF/SF netdev? OR are these on the
> corresponding Port representor netdevs?
The answer is it depends. If the user want to achieve tc bandwidth
allocation inside the VF, these TC roots are configured on the VF
netdev. If we are trying to achieve this tc-bw on a intermediate node
that groups multiple devlink ports, this will happen on the devlink
port.
>
>>
>> The primary reason to define separate roots for each TC is that
>> backpressure from the hierarchy on one of the traffic classes needs to
>> not affect other traffic classes, meaning only txqs servicing the
>> blocked traffic class should be affected.
>>
>> Furthermore, this cannot be done by simply grouping txqs for a given TC
>> with NET_SHAPER_SCOPE_NODE, because the TC for a txq is not always
>> known to the kernel and might only be known to the driver or the NIC.
>> With the new roots, net-shapers can relay the intent to shape traffic
>> for a particular TC to the driver without having knowledge of which
>> txqs service a TC. The association between txqs and TCs they service
>> doesn't need to be known to the kernel.
>>
>> 3. Extend NODE scope to group multiple netdevs and new DEVLINK binding
>> Today, all net-shapers objects are owned by a netdevice. Who should own
>> a net shaper that represents a group of netdevices? It needs to be a
>> stable object that isn't affected by group membership changes and
>> therefore cannot be any netdev from the group. The only sensible option
>> would be to pick an object corresponding to the eswitch to own such
>> groups, which neatly corresponds to the devlink object today.
>
> When you are referring to grouping multiple netdevs, I am assuming these
> are port representor netdevs. Is this correct?
>
Grouping multiple netdevs; these are devlink ports not netdevices.
>>
>> 4. VM/hypervisor considerations
>> A great deal of discussion happened about the split of shaping
>> responsibilities between the VM and the hypervisor. With devlink today,
>> the shaping hierarchy and traffic class bw split is decided entirely by
>> the hypervisor, the VMs have no influence on shaping.
>>
>> But net-shapers has more precise granularity for shaping at queue
>> level, so perhaps there are valid use cases for allowing VMs to control
>> their part of the hierarchy. In the end, what we think makes sense is
>> this model:
>>
>> VMs can control the shaping of txqs, queue groups and the VFs they own.
>> On top of that, the hypervisor can take the netdev root of the VM
>> hierarchy and plug it into its own hierarchy, imposing additional
>> constraints. The VM has no influence on that. So for example the VM can
>> decide that "my VF should be limited to 10Gbps", but the hypervisor can
>> then add another shaping node saying "that VF is limited to 1Gbps" and
>> the later should be the limit.
>
> Isn't it sufficient to enable rate limit at a VF/SF's queue or queue-
> group granularity from the VF/SF netdev? The hypervisor should be able
> to rate limit at VF granularity.
You can do that, this depends on the requirement. If we have two VFs
that shares the same physical NIC, one VF could fully utilize the link.
To prevent that, you should limit each function from the outside (on the
hypervisor). Otherwise there is no enforcement to let the VFs fairly
share the link capacity.
>>
>> With traffic classes, the VM can send out tc-labeled traffic on
>> different txqs, but the hypervisor decides to take the VM TC roots and
>> group them in an arbiter node (== a shaping node arbitrating between
>> different traffic classes), or to group TC roots from multiple VMs
>> before applying arbitration settings. This is similar to devlink-rate
>> today. The VM itself should have no control into TC bandwidth settings.
>
>
> It is not clear if TC roots are configured by the VF driver or the PF
> drivers supporting switchdev. Can you share an example configuration
> with steps on how to configure hierachical traffic shaping of VFs
> queues/TCs
>
From the inside the tc roots are configured by the VF driver.
From the outside this will happen on the devlink port that indicates
the eswitch port that represents the PF.
>>
>> Cosmin.
>>
>> [1] https://man7.org/linux/man-pages/man8/devlink-rate.8.html
>> [2]
>> https://lore.kernel.org/netdev/cover.1728460186.git.pabeni@redhat.com/
>> [3] https://lore.kernel.org/netdev/20241206181345.3eccfca4@kernel.org/
>> [4]
>> https://lore.kernel.org/netdev/20250209101716.112774-1-tariqt@nvidia.com/
>> [5] https://lore.kernel.org/netdev/ZwP8OWtMfCH0_ikc@nanopsycho.orion/
>> [6]
>> https://lore.kernel.org/
>> netdev/67df1a562614b553dcab043f347a0d7c5393ff83.camel@nvidia.com/
>>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-03-06 14:03 net-shapers plan Cosmin Ratiu
2025-03-11 1:42 ` Samudrala, Sridhar
@ 2025-03-11 13:38 ` Jiri Pirko
2025-03-26 15:55 ` Cosmin Ratiu
2025-03-28 12:13 ` Jakub Kicinski
2 siblings, 1 reply; 16+ messages in thread
From: Jiri Pirko @ 2025-03-11 13:38 UTC (permalink / raw)
To: Cosmin Ratiu
Cc: netdev@vger.kernel.org, horms@kernel.org, andrew+netdev@lunn.ch,
davem@davemloft.net, Tariq Toukan, Gal Pressman, Leon Romanovsky,
edumazet@google.com, kuba@kernel.org, Saeed Mahameed,
Carolina Jubran, pabeni@redhat.com
Thu, Mar 06, 2025 at 03:03:54PM +0100, cratiu@nvidia.com wrote:
[...]
>
>3. Add a new DEVLINK binding type for the hierarchy, to be able to
>represent netdev groups. That part of the hierarchy would be stored in
>the devlink object instead of the netdev. This allows separation
>between the VM and the hypervisor parts of the hierarchy.
[...]
>
>3. Extend NODE scope to group multiple netdevs and new DEVLINK binding
>Today, all net-shapers objects are owned by a netdevice. Who should own
>a net shaper that represents a group of netdevices? It needs to be a
>stable object that isn't affected by group membership changes and
>therefore cannot be any netdev from the group. The only sensible option
>would be to pick an object corresponding to the eswitch to own such
>groups, which neatly corresponds to the devlink object today.
Could you be litte bit more descriptive about this? I don't understand
why you need group of netdevices. I understand that for devlink binding,
you have usecase for group (devlink rate node). But do you have a
usecase for group of netdevices? Perhaps I'm missing something.
[...]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-03-11 13:38 ` Jiri Pirko
@ 2025-03-26 15:55 ` Cosmin Ratiu
0 siblings, 0 replies; 16+ messages in thread
From: Cosmin Ratiu @ 2025-03-26 15:55 UTC (permalink / raw)
To: jiri@resnulli.us
Cc: andrew+netdev@lunn.ch, davem@davemloft.net, Tariq Toukan,
Gal Pressman, pabeni@redhat.com, Leon Romanovsky,
horms@kernel.org, edumazet@google.com, netdev@vger.kernel.org,
Saeed Mahameed, Carolina Jubran, kuba@kernel.org
On Tue, 2025-03-11 at 14:38 +0100, Jiri Pirko wrote:
> Thu, Mar 06, 2025 at 03:03:54PM +0100, cratiu@nvidia.com wrote:
>
> [...]
>
> >
> > 3. Add a new DEVLINK binding type for the hierarchy, to be able to
> > represent netdev groups. That part of the hierarchy would be stored
> > in
> > the devlink object instead of the netdev. This allows separation
> > between the VM and the hypervisor parts of the hierarchy.
>
> [...]
>
> >
> > 3. Extend NODE scope to group multiple netdevs and new DEVLINK
> > binding
> > Today, all net-shapers objects are owned by a netdevice. Who should
> > own
> > a net shaper that represents a group of netdevices? It needs to be
> > a
> > stable object that isn't affected by group membership changes and
> > therefore cannot be any netdev from the group. The only sensible
> > option
> > would be to pick an object corresponding to the eswitch to own such
> > groups, which neatly corresponds to the devlink object today.
>
> Could you be litte bit more descriptive about this? I don't
> understand
> why you need group of netdevices. I understand that for devlink
> binding,
> you have usecase for group (devlink rate node). But do you have a
> usecase for group of netdevices? Perhaps I'm missing something.
>
> [...]
The goal is for net-shapers to be able to manipulate the hw scheduling
object corresponding to a devlink rate node which groups multiple
devlink leaves (one per devlink port) today. At this level, the
netdevice is just a proxy for the devlink port. So yes, I meant
grouping devlink ports or whatever entity feels appropriate at that
level (maybe representor devices as handles for the corresponding
ports?).
Without this ability, you cannot fully represent the devlink hierarchy
with net-shapers.
Cosmin.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-03-06 14:03 net-shapers plan Cosmin Ratiu
2025-03-11 1:42 ` Samudrala, Sridhar
2025-03-11 13:38 ` Jiri Pirko
@ 2025-03-28 12:13 ` Jakub Kicinski
2025-04-01 8:35 ` Carolina Jubran
2 siblings, 1 reply; 16+ messages in thread
From: Jakub Kicinski @ 2025-03-28 12:13 UTC (permalink / raw)
To: Cosmin Ratiu
Cc: netdev@vger.kernel.org, horms@kernel.org, andrew+netdev@lunn.ch,
davem@davemloft.net, Tariq Toukan, Gal Pressman, jiri@resnulli.us,
Leon Romanovsky, edumazet@google.com, Saeed Mahameed,
Carolina Jubran, pabeni@redhat.com
On Thu, 6 Mar 2025 14:03:54 +0000 Cosmin Ratiu wrote:
> It is not important which entity (kernel or hw) classifies packets as
> long as the condition that a given txq only sends traffic for a single
> traffic class holds.
> Furthermore, this cannot be done by simply grouping txqs for a given TC
> with NET_SHAPER_SCOPE_NODE, because the TC for a txq is not always
> known to the kernel and might only be known to the driver or the NIC.
> With the new roots, net-shapers can relay the intent to shape traffic
> for a particular TC to the driver without having knowledge of which
> txqs service a TC. The association between txqs and TCs they service
> doesn't need to be known to the kernel.
As mentioned in Zagreb the part of HW reclassifying traffic does not
make sense to me. Is this a real user scenario you have or more of
an attempt to "maximize flexibility"?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-03-28 12:13 ` Jakub Kicinski
@ 2025-04-01 8:35 ` Carolina Jubran
2025-04-01 14:50 ` Jakub Kicinski
0 siblings, 1 reply; 16+ messages in thread
From: Carolina Jubran @ 2025-04-01 8:35 UTC (permalink / raw)
To: Jakub Kicinski, Cosmin Ratiu
Cc: netdev@vger.kernel.org, horms@kernel.org, andrew+netdev@lunn.ch,
davem@davemloft.net, Tariq Toukan, Gal Pressman, jiri@resnulli.us,
Leon Romanovsky, edumazet@google.com, Saeed Mahameed,
pabeni@redhat.com
On 28/03/2025 15:13, Jakub Kicinski wrote:
> On Thu, 6 Mar 2025 14:03:54 +0000 Cosmin Ratiu wrote:
>> It is not important which entity (kernel or hw) classifies packets as
>> long as the condition that a given txq only sends traffic for a single
>> traffic class holds.
>
>> Furthermore, this cannot be done by simply grouping txqs for a given TC
>> with NET_SHAPER_SCOPE_NODE, because the TC for a txq is not always
>> known to the kernel and might only be known to the driver or the NIC.
>> With the new roots, net-shapers can relay the intent to shape traffic
>> for a particular TC to the driver without having knowledge of which
>> txqs service a TC. The association between txqs and TCs they service
>> doesn't need to be known to the kernel.
>
> As mentioned in Zagreb the part of HW reclassifying traffic does not
> make sense to me. Is this a real user scenario you have or more of
> an attempt to "maximize flexibility"?
I don't believe there's a specific real-world scenario. It's really
about maximizing flexibility. Essentially, if a user sets things up in a
less-than-optimal way, the hardware can ensure that traffic is
classified and managed properly.
Carolina
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-04-01 8:35 ` Carolina Jubran
@ 2025-04-01 14:50 ` Jakub Kicinski
2025-04-08 14:43 ` Carolina Jubran
0 siblings, 1 reply; 16+ messages in thread
From: Jakub Kicinski @ 2025-04-01 14:50 UTC (permalink / raw)
To: Carolina Jubran
Cc: Cosmin Ratiu, netdev@vger.kernel.org, horms@kernel.org,
andrew+netdev@lunn.ch, davem@davemloft.net, Tariq Toukan,
Gal Pressman, jiri@resnulli.us, Leon Romanovsky,
edumazet@google.com, Saeed Mahameed, pabeni@redhat.com
On Tue, 1 Apr 2025 11:35:56 +0300 Carolina Jubran wrote:
> > As mentioned in Zagreb the part of HW reclassifying traffic does not
> > make sense to me. Is this a real user scenario you have or more of
> > an attempt to "maximize flexibility"?
>
> I don't believe there's a specific real-world scenario. It's really
> about maximizing flexibility. Essentially, if a user sets things up in a
> less-than-optimal way, the hardware can ensure that traffic is
> classified and managed properly.
I see. If you could turn it off and leave it out, at least until clear
user appears that'd be great. Reclassifying packets on Tx slightly goes
against the netdev recommendation to limit any packet parsing and
interpretation on Tx.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-04-01 14:50 ` Jakub Kicinski
@ 2025-04-08 14:43 ` Carolina Jubran
2025-04-09 22:06 ` Jakub Kicinski
0 siblings, 1 reply; 16+ messages in thread
From: Carolina Jubran @ 2025-04-08 14:43 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Cosmin Ratiu, netdev@vger.kernel.org, horms@kernel.org,
andrew+netdev@lunn.ch, davem@davemloft.net, Tariq Toukan,
Gal Pressman, jiri@resnulli.us, Leon Romanovsky,
edumazet@google.com, Saeed Mahameed, pabeni@redhat.com
On 01/04/2025 17:50, Jakub Kicinski wrote:
> On Tue, 1 Apr 2025 11:35:56 +0300 Carolina Jubran wrote:
>>> As mentioned in Zagreb the part of HW reclassifying traffic does not
>>> make sense to me. Is this a real user scenario you have or more of
>>> an attempt to "maximize flexibility"?
>>
>> I don't believe there's a specific real-world scenario. It's really
>> about maximizing flexibility. Essentially, if a user sets things up in a
>> less-than-optimal way, the hardware can ensure that traffic is
>> classified and managed properly.
>
> I see. If you could turn it off and leave it out, at least until clear
> user appears that'd be great. Reclassifying packets on Tx slightly goes
> against the netdev recommendation to limit any packet parsing and
> interpretation on Tx.
The hardware enforces a match between the packet’s priority and the
scheduling queue’s configured priority. If they match, the packet is
transmitted without further processing. If not, the hardware moves the
Tx queue to the right scheduling queue to ensure proper traffic class
separation.
This check is always active and cannot currently be disabled. Even when
the queue is configured with the correct priority, the hardware still
verifies the match before sending.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-04-08 14:43 ` Carolina Jubran
@ 2025-04-09 22:06 ` Jakub Kicinski
2025-04-10 15:23 ` Carolina Jubran
0 siblings, 1 reply; 16+ messages in thread
From: Jakub Kicinski @ 2025-04-09 22:06 UTC (permalink / raw)
To: Carolina Jubran
Cc: Cosmin Ratiu, netdev@vger.kernel.org, horms@kernel.org,
andrew+netdev@lunn.ch, davem@davemloft.net, Tariq Toukan,
Gal Pressman, jiri@resnulli.us, edumazet@google.com,
Saeed Mahameed, pabeni@redhat.com
On Tue, 8 Apr 2025 17:43:19 +0300 Carolina Jubran wrote:
> >> I don't believe there's a specific real-world scenario. It's really
> >> about maximizing flexibility. Essentially, if a user sets things up in a
> >> less-than-optimal way, the hardware can ensure that traffic is
> >> classified and managed properly.
> >
> > I see. If you could turn it off and leave it out, at least until clear
> > user appears that'd be great. Reclassifying packets on Tx slightly goes
> > against the netdev recommendation to limit any packet parsing and
> > interpretation on Tx.
>
> The hardware enforces a match between the packet’s priority and the
> scheduling queue’s configured priority. If they match, the packet is
> transmitted without further processing. If not, the hardware moves the
> Tx queue to the right scheduling queue to ensure proper traffic class
> separation.
> This check is always active and cannot currently be disabled. Even when
> the queue is configured with the correct priority, the hardware still
> verifies the match before sending.
It needs to work as intended :( so you probably need to enforce
the correct mapping in the FW or the driver.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-04-09 22:06 ` Jakub Kicinski
@ 2025-04-10 15:23 ` Carolina Jubran
2025-04-10 23:16 ` Jakub Kicinski
0 siblings, 1 reply; 16+ messages in thread
From: Carolina Jubran @ 2025-04-10 15:23 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Cosmin Ratiu, netdev@vger.kernel.org, horms@kernel.org,
andrew+netdev@lunn.ch, davem@davemloft.net, Tariq Toukan,
Gal Pressman, jiri@resnulli.us, edumazet@google.com,
Saeed Mahameed, pabeni@redhat.com
On 10/04/2025 1:06, Jakub Kicinski wrote:
> On Tue, 8 Apr 2025 17:43:19 +0300 Carolina Jubran wrote:
>>>> I don't believe there's a specific real-world scenario. It's really
>>>> about maximizing flexibility. Essentially, if a user sets things up in a
>>>> less-than-optimal way, the hardware can ensure that traffic is
>>>> classified and managed properly.
>>>
>>> I see. If you could turn it off and leave it out, at least until clear
>>> user appears that'd be great. Reclassifying packets on Tx slightly goes
>>> against the netdev recommendation to limit any packet parsing and
>>> interpretation on Tx.
>>
>> The hardware enforces a match between the packet’s priority and the
>> scheduling queue’s configured priority. If they match, the packet is
>> transmitted without further processing. If not, the hardware moves the
>> Tx queue to the right scheduling queue to ensure proper traffic class
>> separation.
>> This check is always active and cannot currently be disabled. Even when
>> the queue is configured with the correct priority, the hardware still
>> verifies the match before sending.
>
> It needs to work as intended :( so you probably need to enforce
> the correct mapping in the FW or the driver.
We do configure the correct priority-to-queue mapping in the driver when
mqprio is used in DCB mode. In this setup, each traffic class has its
own dedicated Tx queue(s), and the driver programs the mapping
accordingly. The hardware performs its default priority check, sees that
the packet matches the configured queue, and proceeds to transmit
without taking any further action — everything behaves as expected.
When DCB mode is not enabled, there is no fixed mapping between traffic
classes and Tx queues. In this case, the hardware still performs the
check, and if it detects a mismatch, it moves the send queue to the
appropriate scheduling queue to maintain proper traffic class behavior.
The priority check is always active by default, but when the mapping is
configured properly, it’s followed by a noop.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-04-10 15:23 ` Carolina Jubran
@ 2025-04-10 23:16 ` Jakub Kicinski
2025-04-14 8:27 ` Carolina Jubran
0 siblings, 1 reply; 16+ messages in thread
From: Jakub Kicinski @ 2025-04-10 23:16 UTC (permalink / raw)
To: Carolina Jubran
Cc: Cosmin Ratiu, netdev@vger.kernel.org, horms@kernel.org,
andrew+netdev@lunn.ch, davem@davemloft.net, Tariq Toukan,
Gal Pressman, jiri@resnulli.us, edumazet@google.com,
Saeed Mahameed, pabeni@redhat.com
On Thu, 10 Apr 2025 18:23:56 +0300 Carolina Jubran wrote:
> We do configure the correct priority-to-queue mapping in the driver when
> mqprio is used in DCB mode. In this setup, each traffic class has its
> own dedicated Tx queue(s), and the driver programs the mapping
> accordingly. The hardware performs its default priority check, sees that
> the packet matches the configured queue, and proceeds to transmit
> without taking any further action — everything behaves as expected.
>
> When DCB mode is not enabled, there is no fixed mapping between traffic
> classes and Tx queues. In this case, the hardware still performs the
> check, and if it detects a mismatch, it moves the send queue to the
> appropriate scheduling queue to maintain proper traffic class behavior.
> The priority check is always active by default, but when the mapping is
> configured properly, it’s followed by a noop.
I hope you understand my concern, tho. Since you're providing the first
implementation, if the users can grow dependent on such behavior we'd
be in no position to explain later that it's just a quirk of mlx5 and
not how the API is intended to operate.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-04-10 23:16 ` Jakub Kicinski
@ 2025-04-14 8:27 ` Carolina Jubran
2025-04-14 16:27 ` Jakub Kicinski
0 siblings, 1 reply; 16+ messages in thread
From: Carolina Jubran @ 2025-04-14 8:27 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Cosmin Ratiu, netdev@vger.kernel.org, horms@kernel.org,
andrew+netdev@lunn.ch, davem@davemloft.net, Tariq Toukan,
Gal Pressman, jiri@resnulli.us, edumazet@google.com,
Saeed Mahameed, pabeni@redhat.com
On 11/04/2025 2:16, Jakub Kicinski wrote:
> On Thu, 10 Apr 2025 18:23:56 +0300 Carolina Jubran wrote:
>> We do configure the correct priority-to-queue mapping in the driver when
>> mqprio is used in DCB mode. In this setup, each traffic class has its
>> own dedicated Tx queue(s), and the driver programs the mapping
>> accordingly. The hardware performs its default priority check, sees that
>> the packet matches the configured queue, and proceeds to transmit
>> without taking any further action — everything behaves as expected.
>>
>> When DCB mode is not enabled, there is no fixed mapping between traffic
>> classes and Tx queues. In this case, the hardware still performs the
>> check, and if it detects a mismatch, it moves the send queue to the
>> appropriate scheduling queue to maintain proper traffic class behavior.
>> The priority check is always active by default, but when the mapping is
>> configured properly, it’s followed by a noop.
>
> I hope you understand my concern, tho. Since you're providing the first
> implementation, if the users can grow dependent on such behavior we'd
> be in no position to explain later that it's just a quirk of mlx5 and
> not how the API is intended to operate.
Thanks for bringing this up. I want to make it clear that traffic
classes must be properly matched to queues. We don’t rely on the
hardware fallback behavior in mlx5. If the driver or firmware isn’t
configured correctly, traffic class bandwidth control won’t work as
expected — the user will suffer from constant switching of the TX queue
between scheduling queues and head-of-line blocking. As a result, users
shouldn’t expect reliable performance or correct bandwidth allocation.
We don’t encourage configuring this without proper TX queue mapping, so
users won’t grow dependent on behavior that only happens to work without it.
We tried to highlight this in the plan section discussing queue
selection and head-of-line blocking: To make traffic class shaping work,
we must keep traffic classes separate for each transmit queue.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-04-14 8:27 ` Carolina Jubran
@ 2025-04-14 16:27 ` Jakub Kicinski
2025-04-23 6:50 ` Carolina Jubran
0 siblings, 1 reply; 16+ messages in thread
From: Jakub Kicinski @ 2025-04-14 16:27 UTC (permalink / raw)
To: Carolina Jubran
Cc: Cosmin Ratiu, netdev@vger.kernel.org, horms@kernel.org,
andrew+netdev@lunn.ch, davem@davemloft.net, Tariq Toukan,
Gal Pressman, jiri@resnulli.us, edumazet@google.com,
Saeed Mahameed, pabeni@redhat.com
On Mon, 14 Apr 2025 11:27:00 +0300 Carolina Jubran wrote:
> > I hope you understand my concern, tho. Since you're providing the first
> > implementation, if the users can grow dependent on such behavior we'd
> > be in no position to explain later that it's just a quirk of mlx5 and
> > not how the API is intended to operate.
>
> Thanks for bringing this up. I want to make it clear that traffic
> classes must be properly matched to queues. We don’t rely on the
> hardware fallback behavior in mlx5. If the driver or firmware isn’t
> configured correctly, traffic class bandwidth control won’t work as
> expected — the user will suffer from constant switching of the TX queue
> between scheduling queues and head-of-line blocking. As a result, users
> shouldn’t expect reliable performance or correct bandwidth allocation.
> We don’t encourage configuring this without proper TX queue mapping, so
> users won’t grow dependent on behavior that only happens to work without it.
> We tried to highlight this in the plan section discussing queue
> selection and head-of-line blocking: To make traffic class shaping work,
> we must keep traffic classes separate for each transmit queue.
Right, my concern is more that there is no requirement for explicit
configuration of the queues, as long as traffic arrives silo'ed WRT
DSCP markings. As long as a VF sorts the traffic it does not have
to explicitly say (or even know) that queue A will land in TC N.
BTW the classification is before all rewrites? IOW flower or any other
forwarding rules cannot affect scheduling?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-04-14 16:27 ` Jakub Kicinski
@ 2025-04-23 6:50 ` Carolina Jubran
2025-04-30 12:12 ` Carolina Jubran
0 siblings, 1 reply; 16+ messages in thread
From: Carolina Jubran @ 2025-04-23 6:50 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Cosmin Ratiu, netdev@vger.kernel.org, horms@kernel.org,
andrew+netdev@lunn.ch, davem@davemloft.net, Tariq Toukan,
Gal Pressman, jiri@resnulli.us, edumazet@google.com,
Saeed Mahameed, pabeni@redhat.com
On 14/04/2025 19:27, Jakub Kicinski wrote:
> On Mon, 14 Apr 2025 11:27:00 +0300 Carolina Jubran wrote:
>>> I hope you understand my concern, tho. Since you're providing the first
>>> implementation, if the users can grow dependent on such behavior we'd
>>> be in no position to explain later that it's just a quirk of mlx5 and
>>> not how the API is intended to operate.
>>
>> Thanks for bringing this up. I want to make it clear that traffic
>> classes must be properly matched to queues. We don’t rely on the
>> hardware fallback behavior in mlx5. If the driver or firmware isn’t
>> configured correctly, traffic class bandwidth control won’t work as
>> expected — the user will suffer from constant switching of the TX queue
>> between scheduling queues and head-of-line blocking. As a result, users
>> shouldn’t expect reliable performance or correct bandwidth allocation.
>> We don’t encourage configuring this without proper TX queue mapping, so
>> users won’t grow dependent on behavior that only happens to work without it.
>> We tried to highlight this in the plan section discussing queue
>> selection and head-of-line blocking: To make traffic class shaping work,
>> we must keep traffic classes separate for each transmit queue.
>
> Right, my concern is more that there is no requirement for explicit
> configuration of the queues, as long as traffic arrives silo'ed WRT
> DSCP markings. As long as a VF sorts the traffic it does not have
> to explicitly say (or even know) that queue A will land in TC N.
>
Even if the VF sends DSCP marked traffic, the packet's classification
into a traffic class still depends on the prio-to-TC mapping set by the
hypervisor. Without that mapping, the hardware can't reliably classify
packets, and traffic may not land in the intended TC.
Overall, for traffic class separation and scheduling to work as
intended, the VF and hypervisor need to be in sync. The VF provides the
markings, but the hypervisor owns the classification logic.
The hypervisor sets up the classification mechanism; it’s up to the VFs
to use it correctly, otherwise, packets will be misclassified. In a
virtualized setup, VFs are untrusted and don’t control classification or
shaping, they just select which queue to transmit on.
> BTW the classification is before all rewrites? IOW flower or any other
> forwarding rules cannot affect scheduling?
The classification happens after forwarding actions. So yes, if the user
modifies DSCP or VLAN priority as part of a TC rule, that rewritten
value is what we use for classification and scheduling. The
classification reflects how the packet will look on the wire.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: net-shapers plan
2025-04-23 6:50 ` Carolina Jubran
@ 2025-04-30 12:12 ` Carolina Jubran
0 siblings, 0 replies; 16+ messages in thread
From: Carolina Jubran @ 2025-04-30 12:12 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Cosmin Ratiu, netdev@vger.kernel.org, horms@kernel.org,
andrew+netdev@lunn.ch, davem@davemloft.net, Tariq Toukan,
Gal Pressman, jiri@resnulli.us, edumazet@google.com,
Saeed Mahameed, pabeni@redhat.com
On 23/04/2025 9:50, Carolina Jubran wrote:
>
>
> On 14/04/2025 19:27, Jakub Kicinski wrote:
>> On Mon, 14 Apr 2025 11:27:00 +0300 Carolina Jubran wrote:
>>>> I hope you understand my concern, tho. Since you're providing the first
>>>> implementation, if the users can grow dependent on such behavior we'd
>>>> be in no position to explain later that it's just a quirk of mlx5 and
>>>> not how the API is intended to operate.
>>>
>>> Thanks for bringing this up. I want to make it clear that traffic
>>> classes must be properly matched to queues. We don’t rely on the
>>> hardware fallback behavior in mlx5. If the driver or firmware isn’t
>>> configured correctly, traffic class bandwidth control won’t work as
>>> expected — the user will suffer from constant switching of the TX queue
>>> between scheduling queues and head-of-line blocking. As a result, users
>>> shouldn’t expect reliable performance or correct bandwidth allocation.
>>> We don’t encourage configuring this without proper TX queue mapping, so
>>> users won’t grow dependent on behavior that only happens to work
>>> without it.
>>> We tried to highlight this in the plan section discussing queue
>>> selection and head-of-line blocking: To make traffic class shaping work,
>>> we must keep traffic classes separate for each transmit queue.
>>
>> Right, my concern is more that there is no requirement for explicit
>> configuration of the queues, as long as traffic arrives silo'ed WRT
>> DSCP markings. As long as a VF sorts the traffic it does not have
>> to explicitly say (or even know) that queue A will land in TC N.
>>
>
> Even if the VF sends DSCP marked traffic, the packet's classification
> into a traffic class still depends on the prio-to-TC mapping set by the
> hypervisor. Without that mapping, the hardware can't reliably classify
> packets, and traffic may not land in the intended TC.
>
> Overall, for traffic class separation and scheduling to work as
> intended, the VF and hypervisor need to be in sync. The VF provides the
> markings, but the hypervisor owns the classification logic.
>
> The hypervisor sets up the classification mechanism; it’s up to the VFs
> to use it correctly, otherwise, packets will be misclassified. In a
> virtualized setup, VFs are untrusted and don’t control classification or
> shaping, they just select which queue to transmit on.
>
>> BTW the classification is before all rewrites? IOW flower or any other
>> forwarding rules cannot affect scheduling?
>
> The classification happens after forwarding actions. So yes, if the user
> modifies DSCP or VLAN priority as part of a TC rule, that rewritten
> value is what we use for classification and scheduling. The
> classification reflects how the packet will look on the wire.
>
Just to add a clarification on top of my previous reply:
The hardware does not reclassify packets. The packet's priority (from
DSCP or VLAN PCP) is interpreted based on the prio-to-TC mapping set by
the hypervisor, and that classification remains unchanged.
What actually happens is that if the packet’s traffic class differs from
the TC associated with the current scheduler of the SQ, the SQ is moved
to the correct TC scheduler to maintain traffic separation. This SQ
movement does not change the packet’s classification.
This is necessary to avoid sending traffic through the wrong TC
scheduler. Otherwise, packets would bypass the intended shaping
hierarchy, and traffic isolation between classes would break.
In particular, without this queue movement, backpressure applied to a
traffic class would incorrectly stall packets from other classes,
leading to HOL blocking, exactly the kind of behavior we want to prevent
by keeping queues bound to a single TC.
So this is not a reclassification of the packet itself, but a necessary
mechanism to enforce correct scheduling and maintain class based
isolation. Smart SQ selection helps improve performance by avoiding
scheduler transitions, but it's just an optimization, not something that
affects classification.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-04-30 12:12 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-06 14:03 net-shapers plan Cosmin Ratiu
2025-03-11 1:42 ` Samudrala, Sridhar
2025-03-12 11:02 ` Carolina Jubran
2025-03-11 13:38 ` Jiri Pirko
2025-03-26 15:55 ` Cosmin Ratiu
2025-03-28 12:13 ` Jakub Kicinski
2025-04-01 8:35 ` Carolina Jubran
2025-04-01 14:50 ` Jakub Kicinski
2025-04-08 14:43 ` Carolina Jubran
2025-04-09 22:06 ` Jakub Kicinski
2025-04-10 15:23 ` Carolina Jubran
2025-04-10 23:16 ` Jakub Kicinski
2025-04-14 8:27 ` Carolina Jubran
2025-04-14 16:27 ` Jakub Kicinski
2025-04-23 6:50 ` Carolina Jubran
2025-04-30 12:12 ` Carolina Jubran
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).