[RFC] HW TX Rate Limiting Driver API

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] HW TX Rate Limiting Driver API
@ 2024-04-05 10:23 Simon Horman
  2024-04-05 13:33 ` Jamal Hadi Salim
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Simon Horman @ 2024-04-05 10:23 UTC (permalink / raw)
  To: netdev
  Cc: Jakub Kicinski, Jiri Pirko, Madhu Chittim, Sridhar Samudrala,
	Paolo Abeni

Hi,

This is follow-up to the ongoing discussion started by Intel to extend the
support for TX shaping H/W offload [1].

The goal is allowing the user-space to configure TX shaping offload on a
per-queue basis with min guaranteed B/W, max B/W limit and burst size on a
VF device.


In the past few months several different solutions were attempted and
discussed, without finding a perfect fit:

- devlink_rate APIs are not appropriate for to control TX shaping on netdevs
- No existing TC qdisc offload covers the required feature set
- HTB does not allow direct queue configuration
- MQPRIO imposes constraint on the maximum number of TX queues
- TBF does not support max B/W limit
- ndo_set_tx_maxrate() only controls the max B/W limit

A new H/W offload API is needed, but offload API proliferation should be
avoided.

The following proposal intends to cover the above specified requirement and
provide a possible base to unify all the shaping offload APIs mentioned above.

The following only defines the in-kernel interface between the core and
drivers. The intention is to expose the feature to user-space via Netlink.
Hopefully the latter part should be straight-forward after agreement
on the in-kernel interface.

All feedback and comment is more then welcome!

[1] https://lore.kernel.org/netdev/20230808015734.1060525-1-wenjun1.wu@intel.com/

Regards,
Simon with much assistance from Paolo

--- 
/* SPDX-License-Identifier: GPL-2.0-or-later */

#ifndef _NET_SHAPER_H_
#define _NET_SHAPER_H_

/**
 * enum shaper_metric - the metric of the shaper
 * @SHAPER_METRIC_PPS: Shaper operates on a packets per second basis
 * @SHAPER_METRIC_BPS: Shaper operates on a bits per second basis
 */
enum shaper_metric {
	SHAPER_METRIC_PPS;
	SHAPER_METRIC_BPS;
};

#define SHAPER_ROOT_ID 0
#define SHAPER_NONE_ID UINT_MAX

/**
 * struct shaper_info - represent a node of the shaper hierarchy
 * @id: Unique identifier inside the shaper tree.
 * @parent_id: ID of parent shaper, or SHAPER_NONE_ID if the shaper has
 *             no parent. Only the root shaper has no parent.
 * @metric: Specify if the bw limits refers to PPS or BPS
 * @bw_min: Minimum guaranteed rate for this shaper
 * @bw_max: Maximum peak bw allowed for this shaper
 * @burst: Maximum burst for the peek rate of this shaper
 * @priority: Scheduling priority for this shaper
 * @weight: Scheduling weight for this shaper
 *
 * The full shaper hierarchy is maintained only by the
 * NIC driver (or firmware), possibly in a NIC-specific format
 * and/or in H/W tables.
 * The kernel uses this representation and the shaper_ops to
 * access, traverse, and update it.
 */
struct shaper_info {
	/* The following fields allow the full traversal of the whole
	 * hierarchy.
	 */
	u32 id;
	u32 parent_id;

	/* The following fields define the behavior of the shaper. */
	enum shaper_metric metric;
	u64 bw_min;
	u64 bw_max;
	u32 burst;
	u32 priority;
	u32 weight;
};

/**
 * enum shaper_lookup_mode - Lookup method used to access a shaper
 * @SHAPER_LOOKUP_BY_PORT: The root shaper for the whole H/W, @id is unused
 * @SHAPER_LOOKUP_BY_NETDEV: The main shaper for the given network device,
 *                           @id is unused
 * @SHAPER_LOOKUP_BY_VF: @id is a virtual function number.
 * @SHAPER_LOOKUP_BY_QUEUE: @id is a queue identifier.
 * @SHAPER_LOOKUP_BY_TREE_ID: @id is the unique shaper identifier inside the
 *                            shaper hierarchy as in shaper_info.id
 *
 * SHAPER_LOOKUP_BY_PORT and SHAPER_LOOKUP_BY_VF, SHAPER_LOOKUP_BY_TREE_ID are
 * only available on PF devices, usually inside the host/hypervisor.
 * SHAPER_LOOKUP_BY_NETDEV is available on both PFs and VFs devices, but
 * only if the latter are privileged ones.
 * The same shaper can be reached with different lookup mode/id pairs,
 * mapping network visible objects (devices, VFs, queues) to the scheduler
 * hierarchy and vice-versa.
 */
enum shaper_lookup_mode {
    SHAPER_LOOKUP_BY_PORT,
    SHAPER_LOOKUP_BY_NETDEV,
    SHAPER_LOOKUP_BY_VF,
    SHAPER_LOOKUP_BY_QUEUE,
    SHAPER_LOOKUP_BY_TREE_ID,
};


/**
 * struct shaper_ops - Operations on shaper hierarchy
 * @get: Access the specified shaper.
 * @set: Modify the specifier shaper.
 * @move: Move the specifier shaper inside the hierarchy.
 * @add: Add a shaper inside the shaper hierarchy.
 * @delete: Delete the specified shaper .
 *
 * The netdevice exposes a pointer to these ops.
 *
 * It’s up to the driver or firmware to create the default shapers hierarchy,
 * according to the H/W capabilities.
 */
struct shaper_ops {
	/* get - Fetch the specified shaper, if it exists
	 * @dev: Netdevice to operate on.
	 * @lookup_mode: How to perform the shaper lookup
	 * @id: ID of the specified shaper,
	 *      relative to the specified @lookup_mode.
	 * @shaper: Object to return shaper.
	 * @extack: Netlink extended ACK for reporting errors.
	 *
	 * Multiple placement domain/id pairs can refer to the same shaper.
	 * And multiple entities (e.g. VF and PF) can try to access the same
	 * shaper concurrently.
	 *
	 * Values of @id depend on the @access_type:
	 * * If @access_type is SHAPER_LOOKUP_BY_PORT or
	 *   SHAPER_LOOKUP_BY_NETDEV, then @placement_id is unused.
	 * * If @access_type is SHAPER_LOOKUP_BY_VF,
	 *   then @id is a virtual function number, relative to @dev
	 *   which should be phisical function
	 * * If @access_type is SHAPER_LOOKUP_BY_QUEUE,
	 *   Then @id represents the queue number, relative to @dev
	 * * If @access_type is SHAPER_LOOKUP_BY_TREE_ID,
	 *   then @id is a @shaper_info.id and any shaper inside the
	 *   hierarcy can be accessed directly.
	 *
	 * Return:
	 * * %0 - Success
	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
	 *		    or core for any reason. @extack should be set
	 *		    to text describing the reason.
	 * * Other negative error value on failure.
	 */
	int (*get)(struct net_device *dev,
		   enum shaper_lookup_mode lookup_mode, u32 id,
                   struct shaper_info *shaper, struct netlink_ext_ack *extack);

	/* set - Update the specified shaper, if it exists
	 * @dev: Netdevice to operate on.
	 * @lookup_mode: How to perform the shaper lookup
	 * @id: ID of the specified shaper,
	 *      relative to the specified @access_type.
	 * @shaper: Configuration of shaper.
	 * @extack: Netlink extended ACK for reporting errors.
	 *
	 * Configure the parameters of @shaper according to values supplied
	 * in the following fields:
	 * * @shaper.metric
	 * * @shaper.bw_min
	 * * @shaper.bw_max
	 * * @shaper.burst
	 * * @shaper.priority
	 * * @shaper.weight
	 * Values supplied in other fields of @shaper must be zero and,
	 * other than verifying that, are ignored.
	 *
	 * Return:
	 * * %0 - Success
	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
	 *		    or core for any reason. @extack should be set
	 *		    to text describing the reason.
	 * * Other negative error values on failure.
	 */
	int (*set)(struct net_device *dev,
		   enum shaper_lookup_mode lookup_mode, u32 id,
		   const struct shaper_info *shaper,
		   struct netlink_ext_ack *extack);

	/* Move - change the parent id of the specified shaper
	 * @dev: netdevice to operate on.
	 * @lookup_mode: how to perform the shaper lookup
	 * @id: ID of the specified shaper,
	 *                      relative to the specified @access_mode.
	 * @new_parent_id: new ID of the parent shapers,
	 *                      always relative to the SHAPER_LOOKUP_BY_TREE_ID
	 *                      lookup mode
	 * @extack: Netlink extended ACK for reporting errors.
	 *
	 * Move the specified shaper in the hierarchy replacing its
	 * current parent shaper with @new_parent_id
	 *
	 * Return:
	 * * %0 - Success
	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
	 *		    or core for any reason. @extack should be set
	 *		    to text describing the reason.
	 * * Other negative error values on failure.
	 */
	int (*move)(struct net_device *dev,
		    enum shaper_lookup_mode lookup_mode, u32 id,
		    u32 new_parent_id, struct netlink_ext_ack *extack);

	/* add - Add a shaper inside the shaper hierarchy
	 * @dev: netdevice to operate on.
	 * @shaper: configuration of shaper.
	 * @extack: Netlink extended ACK for reporting errors.
	 *
	 * @shaper.id must be set to SHAPER_NONE_ID as
	 * the id for the shaper will be automatically allocated.
	 * @shaper.parent_id determines where inside the shaper's tree
	 * this node is inserted.
	 *
	 * Return:
	 * * non-negative shaper id on success
	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
	 *		    or core for any reason. @extack should be set
	 *		    to text describing the reason.
	 * * Other negative error values on failure.
	 *
	 * Examples or reasons this operation may fail include:
	 * * H/W resources limits.
	 * * The parent is a ‘leaf’ node - attached to a queue.
	 * * Can’t respect the requested bw limits.
	 */
	int (*add)(struct net_device *dev, const struct shaper_info *shaper,
		   struct netlink_ext_ack *extack);

	/* delete - Add a shaper inside the shaper hierarchy
	 * @dev: netdevice to operate on.
	 * @lookup_mode: how to perform the shaper lookup
	 * @id: ID of the specified shaper,
	 *                      relative to the specified @access_type.
	 * @shaper: Object to return the deleted shaper configuration.
	 *              Ignored if NULL.
	 * @extack: Netlink extended ACK for reporting errors.
	 *
	 * Return:
	 * * %0 - Success
	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
	 *		    or core for any reason. @extack should be set
	 *		    to text describing the reason.
	 * * Other negative error values on failure.
	 */
	int (*delete)(struct net_device *dev,
		      enum shaper_lookup_mode lookup_mode,
		      u32 id, struct shaper_info *shaper,
		      struct netlink_ext_ack *extack);
};

/*
 * Examples:
 * - set shaping on a given queue
 *   struct shaper_info info = { // fill this };
 *   dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, NULL);
 *
 * - create a queue group with a queue group shaping limits.
 *   Assuming the following topology already exists:
 *                    < netdev shaper >
 *                      /           \
 *         <queue 0 shaper> . . .  <queue N shaper>
 *
 *   struct shaper_info pinfo, ginfo;
 *   dev->shaper_ops->get(dev, SHAPER_LOOKUP_BY_NETDEV, 0, &pinfo);
 *
 *   ginfo.parent_id = pinfo.id;
 *   // fill-in other shaper params...
 *   new_node_id = dev->shaper_ops->add(dev, &ginfo);
 *
 *   // now topology is:
 *   //                  <    netdev shaper    >
 *   //                  /            |        \
 *   //                 /             |        <newly created shaper>
 *   //                /              |
 *   // <queue 0 shaper> . . . <queue N shaper>
 *
 *   // move a shapers for queues 3..n out of such queue group
 *   for (i = 0; i <= 2; ++i)
 *           dev->shaper_ops->move(dev, SHAPER_LOOKUP_BY_QUEUE, i, new_node_id);
 *
 *   // now topology is:
 *   //                   < netdev shaper >
 *   //                   /              \
 *   //        <newly created shaper>   <queue 3 shaper> ... <queue n shaper>
 *   //         /                   \
 *   // <queue 0 shaper> ... <queue 2 shaper>
 */
#endif


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-05 10:23 [RFC] HW TX Rate Limiting Driver API Simon Horman
@ 2024-04-05 13:33 ` Jamal Hadi Salim
  2024-04-05 17:06   ` Paolo Abeni
  2024-04-05 14:34 ` John Fastabend
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Jamal Hadi Salim @ 2024-04-05 13:33 UTC (permalink / raw)
  To: Simon Horman
  Cc: netdev, Jakub Kicinski, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala, Paolo Abeni

On Fri, Apr 5, 2024 at 6:25 AM Simon Horman <horms@kernel.org> wrote:
>
> Hi,
>
> This is follow-up to the ongoing discussion started by Intel to extend the
> support for TX shaping H/W offload [1].
>
> The goal is allowing the user-space to configure TX shaping offload on a
> per-queue basis with min guaranteed B/W, max B/W limit and burst size on a
> VF device.
>
>
> In the past few months several different solutions were attempted and
> discussed, without finding a perfect fit:
>
> - devlink_rate APIs are not appropriate for to control TX shaping on netdevs
> - No existing TC qdisc offload covers the required feature set
> - HTB does not allow direct queue configuration
> - MQPRIO imposes constraint on the maximum number of TX queues
> - TBF does not support max B/W limit
> - ndo_set_tx_maxrate() only controls the max B/W limit
>
> A new H/W offload API is needed, but offload API proliferation should be
> avoided.
>
> The following proposal intends to cover the above specified requirement and
> provide a possible base to unify all the shaping offload APIs mentioned above.
>
> The following only defines the in-kernel interface between the core and
> drivers. The intention is to expose the feature to user-space via Netlink.
> Hopefully the latter part should be straight-forward after agreement
> on the in-kernel interface.
>
> All feedback and comment is more then welcome!
>
> [1] https://lore.kernel.org/netdev/20230808015734.1060525-1-wenjun1.wu@intel.com/
>

My 2 cents:
I did peruse the lore quoted thread but i am likely to have missed something.
It sounds like the requirement is for egress-from-host (which to a
device internal looks like ingress-from-host on the device). Doesn't
existing HTB offload already support this? I didnt see this being
discussed in the thread. Also, IIUC, there is no hierarchy
requirement. That is something you can teach HTB but there's probably
something i missed because i didnt understand the context of "HTB does
not allow direct queue configuration". If HTB is confusing from a
config pov then it seems what Paolo was describing in the thread on
TBF is a reasonable approach too. I couldnt grok why that TBF
extension for max bw was considered a bad idea.
On config:
Could we not introduce skip_sw/hw semantics for qdiscs? IOW, skip_sw
means the config is only subjected to hw and you have DIRECT
semantics, etc.
I understand the mlnx implementation of HTB does a lot of things in
the driver but the one nice thing they had was ability to use classid
X:Y to select a egress h/w queue. The driver resolution of all the
hierarchy is not needed at all here if i understood the requirement
above.
You still need to have a classifier in s/w (which could be attached to
clsact egress) to select the queue. That is something the mlnx
implementation allowed. So there is no "double queueing"

If this is about totally bypassing s/w config then its a different ballgame..

cheers,
jamal

> Regards,
> Simon with much assistance from Paolo
>
> ---
> /* SPDX-License-Identifier: GPL-2.0-or-later */
>
> #ifndef _NET_SHAPER_H_
> #define _NET_SHAPER_H_
>
> /**
>  * enum shaper_metric - the metric of the shaper
>  * @SHAPER_METRIC_PPS: Shaper operates on a packets per second basis
>  * @SHAPER_METRIC_BPS: Shaper operates on a bits per second basis
>  */
> enum shaper_metric {
>         SHAPER_METRIC_PPS;
>         SHAPER_METRIC_BPS;
> };
>
> #define SHAPER_ROOT_ID 0
> #define SHAPER_NONE_ID UINT_MAX
>
> /**
>  * struct shaper_info - represent a node of the shaper hierarchy
>  * @id: Unique identifier inside the shaper tree.
>  * @parent_id: ID of parent shaper, or SHAPER_NONE_ID if the shaper has
>  *             no parent. Only the root shaper has no parent.
>  * @metric: Specify if the bw limits refers to PPS or BPS
>  * @bw_min: Minimum guaranteed rate for this shaper
>  * @bw_max: Maximum peak bw allowed for this shaper
>  * @burst: Maximum burst for the peek rate of this shaper
>  * @priority: Scheduling priority for this shaper
>  * @weight: Scheduling weight for this shaper
>  *
>  * The full shaper hierarchy is maintained only by the
>  * NIC driver (or firmware), possibly in a NIC-specific format
>  * and/or in H/W tables.
>  * The kernel uses this representation and the shaper_ops to
>  * access, traverse, and update it.
>  */
> struct shaper_info {
>         /* The following fields allow the full traversal of the whole
>          * hierarchy.
>          */
>         u32 id;
>         u32 parent_id;
>
>         /* The following fields define the behavior of the shaper. */
>         enum shaper_metric metric;
>         u64 bw_min;
>         u64 bw_max;
>         u32 burst;
>         u32 priority;
>         u32 weight;
> };
>
> /**
>  * enum shaper_lookup_mode - Lookup method used to access a shaper
>  * @SHAPER_LOOKUP_BY_PORT: The root shaper for the whole H/W, @id is unused
>  * @SHAPER_LOOKUP_BY_NETDEV: The main shaper for the given network device,
>  *                           @id is unused
>  * @SHAPER_LOOKUP_BY_VF: @id is a virtual function number.
>  * @SHAPER_LOOKUP_BY_QUEUE: @id is a queue identifier.
>  * @SHAPER_LOOKUP_BY_TREE_ID: @id is the unique shaper identifier inside the
>  *                            shaper hierarchy as in shaper_info.id
>  *
>  * SHAPER_LOOKUP_BY_PORT and SHAPER_LOOKUP_BY_VF, SHAPER_LOOKUP_BY_TREE_ID are
>  * only available on PF devices, usually inside the host/hypervisor.
>  * SHAPER_LOOKUP_BY_NETDEV is available on both PFs and VFs devices, but
>  * only if the latter are privileged ones.
>  * The same shaper can be reached with different lookup mode/id pairs,
>  * mapping network visible objects (devices, VFs, queues) to the scheduler
>  * hierarchy and vice-versa.
>  */
> enum shaper_lookup_mode {
>     SHAPER_LOOKUP_BY_PORT,
>     SHAPER_LOOKUP_BY_NETDEV,
>     SHAPER_LOOKUP_BY_VF,
>     SHAPER_LOOKUP_BY_QUEUE,
>     SHAPER_LOOKUP_BY_TREE_ID,
> };
>
>
> /**
>  * struct shaper_ops - Operations on shaper hierarchy
>  * @get: Access the specified shaper.
>  * @set: Modify the specifier shaper.
>  * @move: Move the specifier shaper inside the hierarchy.
>  * @add: Add a shaper inside the shaper hierarchy.
>  * @delete: Delete the specified shaper .
>  *
>  * The netdevice exposes a pointer to these ops.
>  *
>  * It’s up to the driver or firmware to create the default shapers hierarchy,
>  * according to the H/W capabilities.
>  */
> struct shaper_ops {
>         /* get - Fetch the specified shaper, if it exists
>          * @dev: Netdevice to operate on.
>          * @lookup_mode: How to perform the shaper lookup
>          * @id: ID of the specified shaper,
>          *      relative to the specified @lookup_mode.
>          * @shaper: Object to return shaper.
>          * @extack: Netlink extended ACK for reporting errors.
>          *
>          * Multiple placement domain/id pairs can refer to the same shaper.
>          * And multiple entities (e.g. VF and PF) can try to access the same
>          * shaper concurrently.
>          *
>          * Values of @id depend on the @access_type:
>          * * If @access_type is SHAPER_LOOKUP_BY_PORT or
>          *   SHAPER_LOOKUP_BY_NETDEV, then @placement_id is unused.
>          * * If @access_type is SHAPER_LOOKUP_BY_VF,
>          *   then @id is a virtual function number, relative to @dev
>          *   which should be phisical function
>          * * If @access_type is SHAPER_LOOKUP_BY_QUEUE,
>          *   Then @id represents the queue number, relative to @dev
>          * * If @access_type is SHAPER_LOOKUP_BY_TREE_ID,
>          *   then @id is a @shaper_info.id and any shaper inside the
>          *   hierarcy can be accessed directly.
>          *
>          * Return:
>          * * %0 - Success
>          * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
>          *                  or core for any reason. @extack should be set
>          *                  to text describing the reason.
>          * * Other negative error value on failure.
>          */
>         int (*get)(struct net_device *dev,
>                    enum shaper_lookup_mode lookup_mode, u32 id,
>                    struct shaper_info *shaper, struct netlink_ext_ack *extack);
>
>         /* set - Update the specified shaper, if it exists
>          * @dev: Netdevice to operate on.
>          * @lookup_mode: How to perform the shaper lookup
>          * @id: ID of the specified shaper,
>          *      relative to the specified @access_type.
>          * @shaper: Configuration of shaper.
>          * @extack: Netlink extended ACK for reporting errors.
>          *
>          * Configure the parameters of @shaper according to values supplied
>          * in the following fields:
>          * * @shaper.metric
>          * * @shaper.bw_min
>          * * @shaper.bw_max
>          * * @shaper.burst
>          * * @shaper.priority
>          * * @shaper.weight
>          * Values supplied in other fields of @shaper must be zero and,
>          * other than verifying that, are ignored.
>          *
>          * Return:
>          * * %0 - Success
>          * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
>          *                  or core for any reason. @extack should be set
>          *                  to text describing the reason.
>          * * Other negative error values on failure.
>          */
>         int (*set)(struct net_device *dev,
>                    enum shaper_lookup_mode lookup_mode, u32 id,
>                    const struct shaper_info *shaper,
>                    struct netlink_ext_ack *extack);
>
>         /* Move - change the parent id of the specified shaper
>          * @dev: netdevice to operate on.
>          * @lookup_mode: how to perform the shaper lookup
>          * @id: ID of the specified shaper,
>          *                      relative to the specified @access_mode.
>          * @new_parent_id: new ID of the parent shapers,
>          *                      always relative to the SHAPER_LOOKUP_BY_TREE_ID
>          *                      lookup mode
>          * @extack: Netlink extended ACK for reporting errors.
>          *
>          * Move the specified shaper in the hierarchy replacing its
>          * current parent shaper with @new_parent_id
>          *
>          * Return:
>          * * %0 - Success
>          * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
>          *                  or core for any reason. @extack should be set
>          *                  to text describing the reason.
>          * * Other negative error values on failure.
>          */
>         int (*move)(struct net_device *dev,
>                     enum shaper_lookup_mode lookup_mode, u32 id,
>                     u32 new_parent_id, struct netlink_ext_ack *extack);
>
>         /* add - Add a shaper inside the shaper hierarchy
>          * @dev: netdevice to operate on.
>          * @shaper: configuration of shaper.
>          * @extack: Netlink extended ACK for reporting errors.
>          *
>          * @shaper.id must be set to SHAPER_NONE_ID as
>          * the id for the shaper will be automatically allocated.
>          * @shaper.parent_id determines where inside the shaper's tree
>          * this node is inserted.
>          *
>          * Return:
>          * * non-negative shaper id on success
>          * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
>          *                  or core for any reason. @extack should be set
>          *                  to text describing the reason.
>          * * Other negative error values on failure.
>          *
>          * Examples or reasons this operation may fail include:
>          * * H/W resources limits.
>          * * The parent is a ‘leaf’ node - attached to a queue.
>          * * Can’t respect the requested bw limits.
>          */
>         int (*add)(struct net_device *dev, const struct shaper_info *shaper,
>                    struct netlink_ext_ack *extack);
>
>         /* delete - Add a shaper inside the shaper hierarchy
>          * @dev: netdevice to operate on.
>          * @lookup_mode: how to perform the shaper lookup
>          * @id: ID of the specified shaper,
>          *                      relative to the specified @access_type.
>          * @shaper: Object to return the deleted shaper configuration.
>          *              Ignored if NULL.
>          * @extack: Netlink extended ACK for reporting errors.
>          *
>          * Return:
>          * * %0 - Success
>          * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
>          *                  or core for any reason. @extack should be set
>          *                  to text describing the reason.
>          * * Other negative error values on failure.
>          */
>         int (*delete)(struct net_device *dev,
>                       enum shaper_lookup_mode lookup_mode,
>                       u32 id, struct shaper_info *shaper,
>                       struct netlink_ext_ack *extack);
> };
>
> /*
>  * Examples:
>  * - set shaping on a given queue
>  *   struct shaper_info info = { // fill this };
>  *   dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, NULL);
>  *
>  * - create a queue group with a queue group shaping limits.
>  *   Assuming the following topology already exists:
>  *                    < netdev shaper >
>  *                      /           \
>  *         <queue 0 shaper> . . .  <queue N shaper>
>  *
>  *   struct shaper_info pinfo, ginfo;
>  *   dev->shaper_ops->get(dev, SHAPER_LOOKUP_BY_NETDEV, 0, &pinfo);
>  *
>  *   ginfo.parent_id = pinfo.id;
>  *   // fill-in other shaper params...
>  *   new_node_id = dev->shaper_ops->add(dev, &ginfo);
>  *
>  *   // now topology is:
>  *   //                  <    netdev shaper    >
>  *   //                  /            |        \
>  *   //                 /             |        <newly created shaper>
>  *   //                /              |
>  *   // <queue 0 shaper> . . . <queue N shaper>
>  *
>  *   // move a shapers for queues 3..n out of such queue group
>  *   for (i = 0; i <= 2; ++i)
>  *           dev->shaper_ops->move(dev, SHAPER_LOOKUP_BY_QUEUE, i, new_node_id);
>  *
>  *   // now topology is:
>  *   //                   < netdev shaper >
>  *   //                   /              \
>  *   //        <newly created shaper>   <queue 3 shaper> ... <queue n shaper>
>  *   //         /                   \
>  *   // <queue 0 shaper> ... <queue 2 shaper>
>  */
> #endif
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC] HW TX Rate Limiting Driver API
  2024-04-05 10:23 [RFC] HW TX Rate Limiting Driver API Simon Horman
  2024-04-05 13:33 ` Jamal Hadi Salim
@ 2024-04-05 14:34 ` John Fastabend
  2024-04-05 16:25   ` Paolo Abeni
  2024-04-09 22:32 ` Jakub Kicinski
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2024-04-05 14:34 UTC (permalink / raw)
  To: Simon Horman, netdev
  Cc: Jakub Kicinski, Jiri Pirko, Madhu Chittim, Sridhar Samudrala,
	Paolo Abeni

Simon Horman wrote:
> Hi,
> 
> This is follow-up to the ongoing discussion started by Intel to extend the
> support for TX shaping H/W offload [1].
> 
> The goal is allowing the user-space to configure TX shaping offload on a
> per-queue basis with min guaranteed B/W, max B/W limit and burst size on a
> VF device.
> 
> 
> In the past few months several different solutions were attempted and
> discussed, without finding a perfect fit:
> 
> - devlink_rate APIs are not appropriate for to control TX shaping on netdevs
> - No existing TC qdisc offload covers the required feature set
> - HTB does not allow direct queue configuration
> - MQPRIO imposes constraint on the maximum number of TX queues

iirc the TX queue limitation was somewhat artifical. Probably we could
extend it.

> - TBF does not support max B/W limit
> - ndo_set_tx_maxrate() only controls the max B/W limit
> 
> A new H/W offload API is needed, but offload API proliferation should be
> avoided.

Did you consider the dcbnl_rtnl_ops? These have the advantage of being
implemented in intel, mlx, amd, broadcom and a few more drivers. There
is an IEEE spec as well fwiw.

This has a getmaxrate, setmaxrate API. Adding a getminrate and setminrate
would be relatively straightforward. The spec defines how to do WRR and
NICs support it.

The main limitation is spec only defined 8 TCs so interface followed
spec. But I think we could make this as large as needed to map TCs
1:1 to queues if you want. By the way someone probably already thought
of it, but having queue sets rate limited proved very useful in the
past. For example having a 40Gbps limit on a 100Gbps NIC likely needs
multiple queues (a TC in DCB language).

Just an idea I don't have the full context, but looks like an easy fit
to me. Assuming the mqprio mapping can be easily extended.

> 
> The following proposal intends to cover the above specified requirement and
> provide a possible base to unify all the shaping offload APIs mentioned above.
> 
> The following only defines the in-kernel interface between the core and
> drivers. The intention is to expose the feature to user-space via Netlink.
> Hopefully the latter part should be straight-forward after agreement
> on the in-kernel interface.
> 
> All feedback and comment is more then welcome!
> 
> [1] https://lore.kernel.org/netdev/20230808015734.1060525-1-wenjun1.wu@intel.com/
> 
> Regards,
> Simon with much assistance from Paolo

Cool to see some hw offloads.

> 
> --- 
> /* SPDX-License-Identifier: GPL-2.0-or-later */
> 
> #ifndef _NET_SHAPER_H_
> #define _NET_SHAPER_H_
> 
> /**
>  * enum shaper_metric - the metric of the shaper
>  * @SHAPER_METRIC_PPS: Shaper operates on a packets per second basis
>  * @SHAPER_METRIC_BPS: Shaper operates on a bits per second basis
>  */
> enum shaper_metric {
> 	SHAPER_METRIC_PPS;
> 	SHAPER_METRIC_BPS;
> };

Interesting hw has a PPS limiter?

> 
> #define SHAPER_ROOT_ID 0
> #define SHAPER_NONE_ID UINT_MAX
> 
> /**
>  * struct shaper_info - represent a node of the shaper hierarchy
>  * @id: Unique identifier inside the shaper tree.
>  * @parent_id: ID of parent shaper, or SHAPER_NONE_ID if the shaper has
>  *             no parent. Only the root shaper has no parent.
>  * @metric: Specify if the bw limits refers to PPS or BPS
>  * @bw_min: Minimum guaranteed rate for this shaper
>  * @bw_max: Maximum peak bw allowed for this shaper
>  * @burst: Maximum burst for the peek rate of this shaper
>  * @priority: Scheduling priority for this shaper
>  * @weight: Scheduling weight for this shaper
>  *
>  * The full shaper hierarchy is maintained only by the
>  * NIC driver (or firmware), possibly in a NIC-specific format
>  * and/or in H/W tables.

Is the hw actually implementing hierarchy? For some reason I
imagined different scopes, PF, VF, QueueSet, Queue. And if it
has more hierarchy maybe FW is just translating this? IF that
is the case perhaps instead of hierarchy we expose a

  enum hw_rate_limit_scope scope

and

  enum hw_rate_limit_scope {
     SHAPER_LOOKUP_BY_PORT,
     SHAPER_LOOKUP_BY_NETDEV,
     SHAPER_LOOKUP_BY_VF,
     SHAPER_LOOKUP_BY_QUEUE_SET,
     SHAPER_LOOKUP_BY_QUEUE,
  }

My preference is almost always to push logic out of firmware
and into OS if possible.

>  * The kernel uses this representation and the shaper_ops to
>  * access, traverse, and update it.
>  */
> struct shaper_info {
> 	/* The following fields allow the full traversal of the whole
> 	 * hierarchy.
> 	 */
> 	u32 id;
> 	u32 parent_id;
> 
> 	/* The following fields define the behavior of the shaper. */
> 	enum shaper_metric metric;
> 	u64 bw_min;
> 	u64 bw_max;
> 	u32 burst;

Any details on how burst is implemented in HW?

> 	u32 priority;

What is priority?

> 	u32 weight;

Weight to me is a reference to a WRR algorithm? Is there some other
notion here?

> };
> 
> /**
>  * enum shaper_lookup_mode - Lookup method used to access a shaper
>  * @SHAPER_LOOKUP_BY_PORT: The root shaper for the whole H/W, @id is unused
>  * @SHAPER_LOOKUP_BY_NETDEV: The main shaper for the given network device,
>  *                           @id is unused
>  * @SHAPER_LOOKUP_BY_VF: @id is a virtual function number.
>  * @SHAPER_LOOKUP_BY_QUEUE: @id is a queue identifier.
>  * @SHAPER_LOOKUP_BY_TREE_ID: @id is the unique shaper identifier inside the
>  *                            shaper hierarchy as in shaper_info.id
>  *
>  * SHAPER_LOOKUP_BY_PORT and SHAPER_LOOKUP_BY_VF, SHAPER_LOOKUP_BY_TREE_ID are
>  * only available on PF devices, usually inside the host/hypervisor.
>  * SHAPER_LOOKUP_BY_NETDEV is available on both PFs and VFs devices, but
>  * only if the latter are privileged ones.
>  * The same shaper can be reached with different lookup mode/id pairs,
>  * mapping network visible objects (devices, VFs, queues) to the scheduler
>  * hierarchy and vice-versa.
>  */
> enum shaper_lookup_mode {
>     SHAPER_LOOKUP_BY_PORT,
>     SHAPER_LOOKUP_BY_NETDEV,
>     SHAPER_LOOKUP_BY_VF,
>     SHAPER_LOOKUP_BY_QUEUE,
>     SHAPER_LOOKUP_BY_TREE_ID,
> };
> 
> 
> /**
>  * struct shaper_ops - Operations on shaper hierarchy
>  * @get: Access the specified shaper.
>  * @set: Modify the specifier shaper.
>  * @move: Move the specifier shaper inside the hierarchy.
>  * @add: Add a shaper inside the shaper hierarchy.
>  * @delete: Delete the specified shaper .
>  *
>  * The netdevice exposes a pointer to these ops.
>  *
>  * It’s up to the driver or firmware to create the default shapers hierarchy,
>  * according to the H/W capabilities.
>  */
> struct shaper_ops {
> 	/* get - Fetch the specified shaper, if it exists
> 	 * @dev: Netdevice to operate on.
> 	 * @lookup_mode: How to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *      relative to the specified @lookup_mode.
> 	 * @shaper: Object to return shaper.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Multiple placement domain/id pairs can refer to the same shaper.
> 	 * And multiple entities (e.g. VF and PF) can try to access the same
> 	 * shaper concurrently.
> 	 *
> 	 * Values of @id depend on the @access_type:
> 	 * * If @access_type is SHAPER_LOOKUP_BY_PORT or
> 	 *   SHAPER_LOOKUP_BY_NETDEV, then @placement_id is unused.
> 	 * * If @access_type is SHAPER_LOOKUP_BY_VF,
> 	 *   then @id is a virtual function number, relative to @dev
> 	 *   which should be phisical function
> 	 * * If @access_type is SHAPER_LOOKUP_BY_QUEUE,
> 	 *   Then @id represents the queue number, relative to @dev
> 	 * * If @access_type is SHAPER_LOOKUP_BY_TREE_ID,
> 	 *   then @id is a @shaper_info.id and any shaper inside the
> 	 *   hierarcy can be accessed directly.
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error value on failure.
> 	 */
> 	int (*get)(struct net_device *dev,
> 		   enum shaper_lookup_mode lookup_mode, u32 id,
>                    struct shaper_info *shaper, struct netlink_ext_ack *extack);
> 
> 	/* set - Update the specified shaper, if it exists
> 	 * @dev: Netdevice to operate on.
> 	 * @lookup_mode: How to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *      relative to the specified @access_type.
> 	 * @shaper: Configuration of shaper.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Configure the parameters of @shaper according to values supplied
> 	 * in the following fields:
> 	 * * @shaper.metric
> 	 * * @shaper.bw_min
> 	 * * @shaper.bw_max
> 	 * * @shaper.burst
> 	 * * @shaper.priority
> 	 * * @shaper.weight
> 	 * Values supplied in other fields of @shaper must be zero and,
> 	 * other than verifying that, are ignored.
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 */
> 	int (*set)(struct net_device *dev,
> 		   enum shaper_lookup_mode lookup_mode, u32 id,
> 		   const struct shaper_info *shaper,
> 		   struct netlink_ext_ack *extack);
> 
> 	/* Move - change the parent id of the specified shaper
> 	 * @dev: netdevice to operate on.
> 	 * @lookup_mode: how to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *                      relative to the specified @access_mode.
> 	 * @new_parent_id: new ID of the parent shapers,
> 	 *                      always relative to the SHAPER_LOOKUP_BY_TREE_ID
> 	 *                      lookup mode
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Move the specified shaper in the hierarchy replacing its
> 	 * current parent shaper with @new_parent_id
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 */

Some heavy firmware or onchip CPU managing this move operation? Is this
a reset operation as well? Is there a need for an atomic move like this
in initial API? Maybe start with just set and push down an entire config
in one hit. If hw can really move things around dynamically I think it
would make sense though.

> 	int (*move)(struct net_device *dev,
> 		    enum shaper_lookup_mode lookup_mode, u32 id,
> 		    u32 new_parent_id, struct netlink_ext_ack *extack);
> 
> 	/* add - Add a shaper inside the shaper hierarchy
> 	 * @dev: netdevice to operate on.
> 	 * @shaper: configuration of shaper.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * @shaper.id must be set to SHAPER_NONE_ID as
> 	 * the id for the shaper will be automatically allocated.
> 	 * @shaper.parent_id determines where inside the shaper's tree
> 	 * this node is inserted.
> 	 *
> 	 * Return:
> 	 * * non-negative shaper id on success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 *
> 	 * Examples or reasons this operation may fail include:
> 	 * * H/W resources limits.
> 	 * * The parent is a ‘leaf’ node - attached to a queue.
> 	 * * Can’t respect the requested bw limits.
> 	 */
> 	int (*add)(struct net_device *dev, const struct shaper_info *shaper,
> 		   struct netlink_ext_ack *extack);
> 
> 	/* delete - Add a shaper inside the shaper hierarchy
> 	 * @dev: netdevice to operate on.
> 	 * @lookup_mode: how to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *                      relative to the specified @access_type.
> 	 * @shaper: Object to return the deleted shaper configuration.
> 	 *              Ignored if NULL.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 */
> 	int (*delete)(struct net_device *dev,
> 		      enum shaper_lookup_mode lookup_mode,
> 		      u32 id, struct shaper_info *shaper,
> 		      struct netlink_ext_ack *extack);

One thought I have about exposing hierarchy like this is user will need 
to have nic user manual in hand to navigate hardware limitation I presume.
If hw is this flexible than lets do something. But, this is why I started
to think more about scopes (nic, pf, vf, queueSet, queue) than arbitrary
hierarchy. Perhaps this is going to target some DPU though with lots of
flexibility here.

> };
> 
> /*
>  * Examples:
>  * - set shaping on a given queue
>  *   struct shaper_info info = { // fill this };
>  *   dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, NULL);
>  *
>  * - create a queue group with a queue group shaping limits.
>  *   Assuming the following topology already exists:
>  *                    < netdev shaper >
>  *                      /           \
>  *         <queue 0 shaper> . . .  <queue N shaper>
>  *
>  *   struct shaper_info pinfo, ginfo;
>  *   dev->shaper_ops->get(dev, SHAPER_LOOKUP_BY_NETDEV, 0, &pinfo);
>  *
>  *   ginfo.parent_id = pinfo.id;
>  *   // fill-in other shaper params...
>  *   new_node_id = dev->shaper_ops->add(dev, &ginfo);
>  *
>  *   // now topology is:
>  *   //                  <    netdev shaper    >
>  *   //                  /            |        \
>  *   //                 /             |        <newly created shaper>
>  *   //                /              |
>  *   // <queue 0 shaper> . . . <queue N shaper>

I think we need a queue set shaper as part of this for larger bw limits.

>  *
>  *   // move a shapers for queues 3..n out of such queue group
>  *   for (i = 0; i <= 2; ++i)
>  *           dev->shaper_ops->move(dev, SHAPER_LOOKUP_BY_QUEUE, i, new_node_id);
>  *
>  *   // now topology is:
>  *   //                   < netdev shaper >
>  *   //                   /              \
>  *   //        <newly created shaper>   <queue 3 shaper> ... <queue n shaper>
>  *   //         /                   \
>  *   // <queue 0 shaper> ... <queue 2 shaper>
>  */
> #endif
> 
> 

Thanks,
John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-05 14:34 ` John Fastabend
@ 2024-04-05 16:25   ` Paolo Abeni
  0 siblings, 0 replies; 21+ messages in thread
From: Paolo Abeni @ 2024-04-05 16:25 UTC (permalink / raw)
  To: John Fastabend, Simon Horman, netdev
  Cc: Jakub Kicinski, Jiri Pirko, Madhu Chittim, Sridhar Samudrala,
	Jamal Hadi Salim

On Fri, 2024-04-05 at 07:34 -0700, John Fastabend wrote:
> Simon Horman wrote:
> > This is follow-up to the ongoing discussion started by Intel to extend the
> > support for TX shaping H/W offload [1].
> > 
> > The goal is allowing the user-space to configure TX shaping offload on a
> > per-queue basis with min guaranteed B/W, max B/W limit and burst size on a
> > VF device.
> > 
> > 
> > In the past few months several different solutions were attempted and
> > discussed, without finding a perfect fit:
> > 
> > - devlink_rate APIs are not appropriate for to control TX shaping on netdevs
> > - No existing TC qdisc offload covers the required feature set
> > - HTB does not allow direct queue configuration
> > - MQPRIO imposes constraint on the maximum number of TX queues
> 
> iirc the TX queue limitation was somewhat artifical. Probably we could
> extend it.

Indeed MQPRIO was one of the preferred candidates, but removing the
limits is blocked by uAPI constraints.

Even ignoring the uAPI problem, it will requires some offload change to
get full support of the features required here.

In practice that will be quite similar to creating a new offload type,
and will not be 'future proof'.

Since here we are adding a new offload type, the idea is to try to
cover the capabilities of exiting (and possibly foreseeable) H/W to
avoid being in the same situation some time from now.

Being 'future proof' is one of the requirements (sorry, not stated in
this thread yet) that materialized in the past discussion.

> > - TBF does not support max B/W limit
> > - ndo_set_tx_maxrate() only controls the max B/W limit
> > 
> > A new H/W offload API is needed, but offload API proliferation should be
> > avoided.
> 
> Did you consider the dcbnl_rtnl_ops? These have the advantage of being
> implemented in intel, mlx, amd, broadcom and a few more drivers. There
> is an IEEE spec as well fwiw.
> 
> This has a getmaxrate, setmaxrate API. Adding a getminrate and setminrate
> would be relatively straightforward. The spec defines how to do WRR and
> NICs support it.

It would be similar to ndo_set_tx_maxrate(). Extending it would be
simple, and is implemented by some different vendors already.

But we will face the same points mentioned above: will be quite similar
to creating a new offload type, and will not be 'future proof'.

> > The following proposal intends to cover the above specified requirement and
> > provide a possible base to unify all the shaping offload APIs mentioned above.
> > 
> > The following only defines the in-kernel interface between the core and
> > drivers. The intention is to expose the feature to user-space via Netlink.
> > Hopefully the latter part should be straight-forward after agreement
> > on the in-kernel interface.
> > 
> > All feedback and comment is more then welcome!
> > 
> > [1] https://lore.kernel.org/netdev/20230808015734.1060525-1-wenjun1.wu@intel.com/
> > 
> > Regards,
> > Simon with much assistance from Paolo
> 
> Cool to see some hw offloads.

:)

The planning phase was less cool :)

> > --- 
> > /* SPDX-License-Identifier: GPL-2.0-or-later */
> > 
> > #ifndef _NET_SHAPER_H_
> > #define _NET_SHAPER_H_
> > 
> > /**
> >  * enum shaper_metric - the metric of the shaper
> >  * @SHAPER_METRIC_PPS: Shaper operates on a packets per second basis
> >  * @SHAPER_METRIC_BPS: Shaper operates on a bits per second basis
> >  */
> > enum shaper_metric {
> > 	SHAPER_METRIC_PPS;
> > 	SHAPER_METRIC_BPS;
> > };
> 
> Interesting hw has a PPS limiter?

AFAIK at least the intel ice driver has this kind of support. I've been
told this is a requirement to access the telco market. I guess other
vendors will follow - assuming they don't have it already.

> 
> > 
> > #define SHAPER_ROOT_ID 0
> > #define SHAPER_NONE_ID UINT_MAX
> > 
> > /**
> >  * struct shaper_info - represent a node of the shaper hierarchy
> >  * @id: Unique identifier inside the shaper tree.
> >  * @parent_id: ID of parent shaper, or SHAPER_NONE_ID if the shaper has
> >  *             no parent. Only the root shaper has no parent.
> >  * @metric: Specify if the bw limits refers to PPS or BPS
> >  * @bw_min: Minimum guaranteed rate for this shaper
> >  * @bw_max: Maximum peak bw allowed for this shaper
> >  * @burst: Maximum burst for the peek rate of this shaper
> >  * @priority: Scheduling priority for this shaper
> >  * @weight: Scheduling weight for this shaper
> >  *
> >  * The full shaper hierarchy is maintained only by the
> >  * NIC driver (or firmware), possibly in a NIC-specific format
> >  * and/or in H/W tables.
> 
> Is the hw actually implementing hierarchy? For some reason I
> imagined different scopes, PF, VF, QueueSet, Queue. And if it
> has more hierarchy maybe FW is just translating this? IF that
> is the case perhaps instead of hierarchy we expose a
> 
>   enum hw_rate_limit_scope scope
> 
> and
> 
>   enum hw_rate_limit_scope {
>      SHAPER_LOOKUP_BY_PORT,
>      SHAPER_LOOKUP_BY_NETDEV,
>      SHAPER_LOOKUP_BY_VF,
>      SHAPER_LOOKUP_BY_QUEUE_SET,
>      SHAPER_LOOKUP_BY_QUEUE,
>   }
> 
> My preference is almost always to push logic out of firmware
> and into OS if possible.

I think this actually overlap with is described below? this API will
allow access by the above 'scopes' (except by QUEUE_SET, but we can add
it). See shaper_lookup_mode below.

> 
> >  * The kernel uses this representation and the shaper_ops to
> >  * access, traverse, and update it.
> >  */
> > struct shaper_info {
> > 	/* The following fields allow the full traversal of the whole
> > 	 * hierarchy.
> > 	 */
> > 	u32 id;
> > 	u32 parent_id;
> > 
> > 	/* The following fields define the behavior of the shaper. */
> > 	enum shaper_metric metric;
> > 	u64 bw_min;
> > 	u64 bw_max;
> > 	u32 burst;
> 
> Any details on how burst is implemented in HW?

I guess I could describe some specific H/W implementation, but would
that be too restrictive? what about just:

	maximum bw_max burst in bytes

or the like?

> 
> > 	u32 priority;
> 
> What is priority?

It's the strict scheduling priority, within the relevant group/set. 

> 
> > 	u32 weight;
> 
> Weight to me is a reference to a WRR algorithm? Is there some other
> notion here?

Yes, as in WRR.

The general idea is to try to expose as much features as possible from
the existing H/W.

[...]

> > 	/* set - Update the specified shaper, if it exists
> > 	 * @dev: Netdevice to operate on.
> > 	 * @lookup_mode: How to perform the shaper lookup
> > 	 * @id: ID of the specified shaper,
> > 	 *      relative to the specified @access_type.
> > 	 * @shaper: Configuration of shaper.
> > 	 * @extack: Netlink extended ACK for reporting errors.
> > 	 *
> > 	 * Configure the parameters of @shaper according to values supplied
> > 	 * in the following fields:
> > 	 * * @shaper.metric
> > 	 * * @shaper.bw_min
> > 	 * * @shaper.bw_max
> > 	 * * @shaper.burst
> > 	 * * @shaper.priority
> > 	 * * @shaper.weight
> > 	 * Values supplied in other fields of @shaper must be zero and,
> > 	 * other than verifying that, are ignored.
> > 	 *
> > 	 * Return:
> > 	 * * %0 - Success
> > 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > 	 *		    or core for any reason. @extack should be set
> > 	 *		    to text describing the reason.
> > 	 * * Other negative error values on failure.
> > 	 */
> > 	int (*set)(struct net_device *dev,
> > 		   enum shaper_lookup_mode lookup_mode, u32 id,
> > 		   const struct shaper_info *shaper,
> > 		   struct netlink_ext_ack *extack);
> > 
> > 	/* Move - change the parent id of the specified shaper
> > 	 * @dev: netdevice to operate on.
> > 	 * @lookup_mode: how to perform the shaper lookup
> > 	 * @id: ID of the specified shaper,
> > 	 *                      relative to the specified @access_mode.
> > 	 * @new_parent_id: new ID of the parent shapers,
> > 	 *                      always relative to the SHAPER_LOOKUP_BY_TREE_ID
> > 	 *                      lookup mode
> > 	 * @extack: Netlink extended ACK for reporting errors.
> > 	 *
> > 	 * Move the specified shaper in the hierarchy replacing its
> > 	 * current parent shaper with @new_parent_id
> > 	 *
> > 	 * Return:
> > 	 * * %0 - Success
> > 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > 	 *		    or core for any reason. @extack should be set
> > 	 *		    to text describing the reason.
> > 	 * * Other negative error values on failure.
> > 	 */
> 
> Some heavy firmware or onchip CPU managing this move operation? Is this
> a reset operation as well? Is there a need for an atomic move like this
> in initial API? Maybe start with just set and push down an entire config
> in one hit. If hw can really move things around dynamically I think it
> would make sense though.

This 'move' ops was a last minute addendum. It's used by the example
below, useful in not trivial scenario. Reading the datasheet 'ice'
supports it, and I think even mlx5 - guessing on devlink_rate api.

It can be dropped or added later, without it the queue group example
below will need some other additional op.

I guess h/w not supporting it (or posing some kind of constraint) could
error out with some descriptive extack.

> > 	int (*move)(struct net_device *dev,
> > 		    enum shaper_lookup_mode lookup_mode, u32 id,
> > 		    u32 new_parent_id, struct netlink_ext_ack *extack);
> > 
> > 	/* add - Add a shaper inside the shaper hierarchy
> > 	 * @dev: netdevice to operate on.
> > 	 * @shaper: configuration of shaper.
> > 	 * @extack: Netlink extended ACK for reporting errors.
> > 	 *
> > 	 * @shaper.id must be set to SHAPER_NONE_ID as
> > 	 * the id for the shaper will be automatically allocated.
> > 	 * @shaper.parent_id determines where inside the shaper's tree
> > 	 * this node is inserted.
> > 	 *
> > 	 * Return:
> > 	 * * non-negative shaper id on success
> > 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > 	 *		    or core for any reason. @extack should be set
> > 	 *		    to text describing the reason.
> > 	 * * Other negative error values on failure.
> > 	 *
> > 	 * Examples or reasons this operation may fail include:
> > 	 * * H/W resources limits.
> > 	 * * The parent is a ‘leaf’ node - attached to a queue.
> > 	 * * Can’t respect the requested bw limits.
> > 	 */
> > 	int (*add)(struct net_device *dev, const struct shaper_info *shaper,
> > 		   struct netlink_ext_ack *extack);
> > 
> > 	/* delete - Add a shaper inside the shaper hierarchy
> > 	 * @dev: netdevice to operate on.
> > 	 * @lookup_mode: how to perform the shaper lookup
> > 	 * @id: ID of the specified shaper,
> > 	 *                      relative to the specified @access_type.
> > 	 * @shaper: Object to return the deleted shaper configuration.
> > 	 *              Ignored if NULL.
> > 	 * @extack: Netlink extended ACK for reporting errors.
> > 	 *
> > 	 * Return:
> > 	 * * %0 - Success
> > 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > 	 *		    or core for any reason. @extack should be set
> > 	 *		    to text describing the reason.
> > 	 * * Other negative error values on failure.
> > 	 */
> > 	int (*delete)(struct net_device *dev,
> > 		      enum shaper_lookup_mode lookup_mode,
> > 		      u32 id, struct shaper_info *shaper,
> > 		      struct netlink_ext_ack *extack);
> 
> One thought I have about exposing hierarchy like this is user will need 
> to have nic user manual in hand to navigate hardware limitation I presume.
> If hw is this flexible than lets do something. But, this is why I started
> to think more about scopes (nic, pf, vf, queueSet, queue) than arbitrary
> hierarchy. Perhaps this is going to target some DPU though with lots of
> flexibility here.

I fear we were too terse describing 'enum shaper_lookup_mode
lookup_mode/id access'. I think, it actually allows the scope lookup
you are looking for. And even the full hierarchy access - if the H/W
support it. 

My understanding is that at least intel and mellanox already support
this kind of features, and likely marvell.

> > };
> > 
> > /*
> >  * Examples:
> >  * - set shaping on a given queue
> >  *   struct shaper_info info = { // fill this };
> >  *   dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, NULL);
> >  *
> >  * - create a queue group with a queue group shaping limits.
> >  *   Assuming the following topology already exists:
> >  *                    < netdev shaper >
> >  *                      /           \
> >  *         <queue 0 shaper> . . .  <queue N shaper>
> >  *
> >  *   struct shaper_info pinfo, ginfo;
> >  *   dev->shaper_ops->get(dev, SHAPER_LOOKUP_BY_NETDEV, 0, &pinfo);
> >  *
> >  *   ginfo.parent_id = pinfo.id;
> >  *   // fill-in other shaper params...
> >  *   new_node_id = dev->shaper_ops->add(dev, &ginfo);
> >  *
> >  *   // now topology is:
> >  *   //                  <    netdev shaper    >
> >  *   //                  /            |        \
> >  *   //                 /             |        <newly created shaper>
> >  *   //                /              |
> >  *   // <queue 0 shaper> . . . <queue N shaper>
> 
> I think we need a queue set shaper as part of this for larger bw limits.

I'm sorry, I'm unsure I understand the above, could you please re-
phrase?

In this example the 'newly created shaper' is a 'queue set' shaper.


Thanks for the feedback!

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-05 13:33 ` Jamal Hadi Salim
@ 2024-04-05 17:06   ` Paolo Abeni
  2024-04-06 13:48     ` Jamal Hadi Salim
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2024-04-05 17:06 UTC (permalink / raw)
  To: Jamal Hadi Salim, Simon Horman
  Cc: netdev, Jakub Kicinski, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala, John Fastabend

On Fri, 2024-04-05 at 09:33 -0400, Jamal Hadi Salim wrote:
> On Fri, Apr 5, 2024 at 6:25 AM Simon Horman <horms@kernel.org> wrote:
> > This is follow-up to the ongoing discussion started by Intel to extend the
> > support for TX shaping H/W offload [1].
> > 
> > The goal is allowing the user-space to configure TX shaping offload on a
> > per-queue basis with min guaranteed B/W, max B/W limit and burst size on a
> > VF device.
> > 
> > 
> > In the past few months several different solutions were attempted and
> > discussed, without finding a perfect fit:
> > 
> > - devlink_rate APIs are not appropriate for to control TX shaping on netdevs
> > - No existing TC qdisc offload covers the required feature set
> > - HTB does not allow direct queue configuration
> > - MQPRIO imposes constraint on the maximum number of TX queues
> > - TBF does not support max B/W limit
> > - ndo_set_tx_maxrate() only controls the max B/W limit
> > 
> > A new H/W offload API is needed, but offload API proliferation should be
> > avoided.
> > 
> > The following proposal intends to cover the above specified requirement and
> > provide a possible base to unify all the shaping offload APIs mentioned above.
> > 
> > The following only defines the in-kernel interface between the core and
> > drivers. The intention is to expose the feature to user-space via Netlink.
> > Hopefully the latter part should be straight-forward after agreement
> > on the in-kernel interface.
> > 
> > All feedback and comment is more then welcome!
> > 
> > [1] https://lore.kernel.org/netdev/20230808015734.1060525-1-wenjun1.wu@intel.com/
> > 
> 
> My 2 cents:
> I did peruse the lore quoted thread but i am likely to have missed something.
> It sounds like the requirement is for egress-from-host (which to a
> device internal looks like ingress-from-host on the device). Doesn't
> existing HTB offload already support this? I didnt see this being
> discussed in the thread. 

Yes, HTB has been one of the possible option discussed, but not in that
thread, let me find the reference:

https://lore.kernel.org/netdev/131da9645be5ef6ea584da27ecde795c52dfbb00.camel@redhat.com/

it turns out that HTB does not allow configuring TX shaping on a per
(existing, direct) queue basis. It could, with some small functional
changes, but then we will be in the suboptimal scenario I mentioned in
my previous email: quite similar to creating a new offload type,
and will not be 'future proof'.

> Also, IIUC, there is no hierarchy
> requirement. That is something you can teach HTB but there's probably
> something i missed because i didnt understand the context of "HTB does
> not allow direct queue configuration". If HTB is confusing from a
> config pov then it seems what Paolo was describing in the thread on
> TBF is a reasonable approach too. I couldnt grok why that TBF
> extension for max bw was considered a bad idea.

TBF too was also in the category 'near enough but not 100% fit'

> On config:
> Could we not introduce skip_sw/hw semantics for qdiscs? IOW, skip_sw
> means the config is only subjected to hw and you have DIRECT
> semantics, etc.
> I understand the mlnx implementation of HTB does a lot of things in
> the driver but the one nice thing they had was ability to use classid
> X:Y to select a egress h/w queue. The driver resolution of all the
> hierarchy is not needed at all here if i understood the requirement
> above.
> You still need to have a classifier in s/w (which could be attached to
> clsact egress) to select the queue. That is something the mlnx
> implementation allowed. So there is no "double queueing"

AFAICS the current status of qdisc H/W offload implementation is a bit
mixed-up. e.g. HTB requires explicit syntax on the command line to
enable H/W offload, TBF doesn't.

H/W offload enabled on MQPRIO implies skipping the software path, while
for HTB and TBF doesn't.

> If this is about totally bypassing s/w config then its a different ballgame..

Yes, this does not have s/w counter-part. It limits itself to
configure/expose H/W features.

My take is that configuring the shapers on a queue/device/queue
group/vfs group basis, the admin is enforcing shared resources
reservation: we don't strictly need a software counter-part.

Thanks for the feedback!

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-05 17:06   ` Paolo Abeni
@ 2024-04-06 13:48     ` Jamal Hadi Salim
  2024-04-10  9:41       ` Paolo Abeni
  0 siblings, 1 reply; 21+ messages in thread
From: Jamal Hadi Salim @ 2024-04-06 13:48 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Simon Horman, netdev, Jakub Kicinski, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala, John Fastabend

On Fri, Apr 5, 2024 at 1:06 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Fri, 2024-04-05 at 09:33 -0400, Jamal Hadi Salim wrote:
> > On Fri, Apr 5, 2024 at 6:25 AM Simon Horman <horms@kernel.org> wrote:
> > > This is follow-up to the ongoing discussion started by Intel to extend the
> > > support for TX shaping H/W offload [1].
> > >
> > > The goal is allowing the user-space to configure TX shaping offload on a
> > > per-queue basis with min guaranteed B/W, max B/W limit and burst size on a
> > > VF device.
> > >
> > >
> > > In the past few months several different solutions were attempted and
> > > discussed, without finding a perfect fit:
> > >
> > > - devlink_rate APIs are not appropriate for to control TX shaping on netdevs
> > > - No existing TC qdisc offload covers the required feature set
> > > - HTB does not allow direct queue configuration
> > > - MQPRIO imposes constraint on the maximum number of TX queues
> > > - TBF does not support max B/W limit
> > > - ndo_set_tx_maxrate() only controls the max B/W limit
> > >
> > > A new H/W offload API is needed, but offload API proliferation should be
> > > avoided.
> > >
> > > The following proposal intends to cover the above specified requirement and
> > > provide a possible base to unify all the shaping offload APIs mentioned above.
> > >
> > > The following only defines the in-kernel interface between the core and
> > > drivers. The intention is to expose the feature to user-space via Netlink.
> > > Hopefully the latter part should be straight-forward after agreement
> > > on the in-kernel interface.
> > >
> > > All feedback and comment is more then welcome!
> > >
> > > [1] https://lore.kernel.org/netdev/20230808015734.1060525-1-wenjun1.wu@intel.com/
> > >
> >
> > My 2 cents:
> > I did peruse the lore quoted thread but i am likely to have missed something.
> > It sounds like the requirement is for egress-from-host (which to a
> > device internal looks like ingress-from-host on the device). Doesn't
> > existing HTB offload already support this? I didnt see this being
> > discussed in the thread.
>
> Yes, HTB has been one of the possible option discussed, but not in that
> thread, let me find the reference:
>
> https://lore.kernel.org/netdev/131da9645be5ef6ea584da27ecde795c52dfbb00.camel@redhat.com/
>
> it turns out that HTB does not allow configuring TX shaping on a per
> (existing, direct) queue basis. It could, with some small functional
> changes, but then we will be in the suboptimal scenario I mentioned in
> my previous email: quite similar to creating a new offload type,
> and will not be 'future proof'.
>
> > Also, IIUC, there is no hierarchy
> > requirement. That is something you can teach HTB but there's probably
> > something i missed because i didnt understand the context of "HTB does
> > not allow direct queue configuration". If HTB is confusing from a
> > config pov then it seems what Paolo was describing in the thread on
> > TBF is a reasonable approach too. I couldnt grok why that TBF
> > extension for max bw was considered a bad idea.
>
> TBF too was also in the category 'near enough but not 100% fit'
>
> > On config:
> > Could we not introduce skip_sw/hw semantics for qdiscs? IOW, skip_sw
> > means the config is only subjected to hw and you have DIRECT
> > semantics, etc.
> > I understand the mlnx implementation of HTB does a lot of things in
> > the driver but the one nice thing they had was ability to use classid
> > X:Y to select a egress h/w queue. The driver resolution of all the
> > hierarchy is not needed at all here if i understood the requirement
> > above.
> > You still need to have a classifier in s/w (which could be attached to
> > clsact egress) to select the queue. That is something the mlnx
> > implementation allowed. So there is no "double queueing"
>
> AFAICS the current status of qdisc H/W offload implementation is a bit
> mixed-up. e.g. HTB requires explicit syntax on the command line to
> enable H/W offload, TBF doesn't.
>
> H/W offload enabled on MQPRIO implies skipping the software path, while
> for HTB and TBF doesn't.
>
> > If this is about totally bypassing s/w config then its a different ballgame..
>
> Yes, this does not have s/w counter-part. It limits itself to
> configure/expose H/W features.
>

I think this changes the dynamics altogether. Would IDPF[1] be a fit for this?

> My take is that configuring the shapers on a queue/device/queue
> group/vfs group basis, the admin is enforcing shared resources
> reservation: we don't strictly need a software counter-part.
>

I am assuming then, if the hw allows it one could run offloaded TC/htb
on the queue/device/queue

cheers,
jamal

[1] https://netdevconf.info/0x16/sessions/workshop/infrastructure-datapath-functionidpf-workshop.html
> Thanks for the feedback!
>
> Paolo
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-05 10:23 [RFC] HW TX Rate Limiting Driver API Simon Horman
  2024-04-05 13:33 ` Jamal Hadi Salim
  2024-04-05 14:34 ` John Fastabend
@ 2024-04-09 22:32 ` Jakub Kicinski
  2024-04-10  8:33   ` Paolo Abeni
  2024-04-11 23:51 ` Vinicius Costa Gomes
  2024-04-22 11:30 ` Sunil Kovvuri Goutham
  4 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2024-04-09 22:32 UTC (permalink / raw)
  To: Simon Horman
  Cc: netdev, Jiri Pirko, Madhu Chittim, Sridhar Samudrala, Paolo Abeni

On Fri, 5 Apr 2024 11:23:13 +0100 Simon Horman wrote:
> /**
>  * enum shaper_lookup_mode - Lookup method used to access a shaper
>  * @SHAPER_LOOKUP_BY_PORT: The root shaper for the whole H/W, @id is unused
>  * @SHAPER_LOOKUP_BY_NETDEV: The main shaper for the given network device,
>  *                           @id is unused
>  * @SHAPER_LOOKUP_BY_VF: @id is a virtual function number.
>  * @SHAPER_LOOKUP_BY_QUEUE: @id is a queue identifier.
>  * @SHAPER_LOOKUP_BY_TREE_ID: @id is the unique shaper identifier inside the
>  *                            shaper hierarchy as in shaper_info.id
>  *
>  * SHAPER_LOOKUP_BY_PORT and SHAPER_LOOKUP_BY_VF, SHAPER_LOOKUP_BY_TREE_ID are
>  * only available on PF devices, usually inside the host/hypervisor.
>  * SHAPER_LOOKUP_BY_NETDEV is available on both PFs and VFs devices, but
>  * only if the latter are privileged ones.

The privileged part is an implementation limitation. Limiting oneself
or subqueues should not require elevated privileges, in principle,
right?

>  * The same shaper can be reached with different lookup mode/id pairs,
>  * mapping network visible objects (devices, VFs, queues) to the scheduler
>  * hierarchy and vice-versa.

:o

> enum shaper_lookup_mode {
>     SHAPER_LOOKUP_BY_PORT,
>     SHAPER_LOOKUP_BY_NETDEV,
>     SHAPER_LOOKUP_BY_VF,
>     SHAPER_LOOKUP_BY_QUEUE,
>     SHAPER_LOOKUP_BY_TREE_ID,

Two questions.

1) are these looking up real nodes or some special kind of node which
can't have settings assigned directly? IOW if I want to rate limit 
the port do I get + set the port object or create a node above it and
link it up?

Or do those special nodes not exit implicitly (from the example it
seems like they do?)

2) the objects themselves seem rather different. I'm guessing we need
port/netdev/vf because either some users don't want to use switchdev
(vf = repr netdev) or drivers don't implement it "correctly" (port !=
netdev?!)?

Also feels like some of these are root nodes, some are leaf nodes?

> };
> 
> 
> /**
>  * struct shaper_ops - Operations on shaper hierarchy
>  * @get: Access the specified shaper.
>  * @set: Modify the specifier shaper.
>  * @move: Move the specifier shaper inside the hierarchy.
>  * @add: Add a shaper inside the shaper hierarchy.
>  * @delete: Delete the specified shaper .
>  *
>  * The netdevice exposes a pointer to these ops.
>  *
>  * It’s up to the driver or firmware to create the default shapers hierarchy,
>  * according to the H/W capabilities.
>  */
> struct shaper_ops {
> 	/* get - Fetch the specified shaper, if it exists
> 	 * @dev: Netdevice to operate on.
> 	 * @lookup_mode: How to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *      relative to the specified @lookup_mode.
> 	 * @shaper: Object to return shaper.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Multiple placement domain/id pairs can refer to the same shaper.
> 	 * And multiple entities (e.g. VF and PF) can try to access the same
> 	 * shaper concurrently.
> 	 *
> 	 * Values of @id depend on the @access_type:
> 	 * * If @access_type is SHAPER_LOOKUP_BY_PORT or
> 	 *   SHAPER_LOOKUP_BY_NETDEV, then @placement_id is unused.
> 	 * * If @access_type is SHAPER_LOOKUP_BY_VF,
> 	 *   then @id is a virtual function number, relative to @dev
> 	 *   which should be phisical function
> 	 * * If @access_type is SHAPER_LOOKUP_BY_QUEUE,
> 	 *   Then @id represents the queue number, relative to @dev
> 	 * * If @access_type is SHAPER_LOOKUP_BY_TREE_ID,
> 	 *   then @id is a @shaper_info.id and any shaper inside the
> 	 *   hierarcy can be accessed directly.
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error value on failure.
> 	 */
> 	int (*get)(struct net_device *dev,
> 		   enum shaper_lookup_mode lookup_mode, u32 id,
>                    struct shaper_info *shaper, struct netlink_ext_ack *extack);

How about we store the hierarchy in the core?
Assume core has the source of truth, no need to get?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-09 22:32 ` Jakub Kicinski
@ 2024-04-10  8:33   ` Paolo Abeni
  2024-04-10 14:57     ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2024-04-10  8:33 UTC (permalink / raw)
  To: Jakub Kicinski, Simon Horman
  Cc: netdev, Jiri Pirko, Madhu Chittim, Sridhar Samudrala

On Tue, 2024-04-09 at 15:32 -0700, Jakub Kicinski wrote:
> On Fri, 5 Apr 2024 11:23:13 +0100 Simon Horman wrote:
> > /**
> >  * enum shaper_lookup_mode - Lookup method used to access a shaper
> >  * @SHAPER_LOOKUP_BY_PORT: The root shaper for the whole H/W, @id is unused
> >  * @SHAPER_LOOKUP_BY_NETDEV: The main shaper for the given network device,
> >  *                           @id is unused
> >  * @SHAPER_LOOKUP_BY_VF: @id is a virtual function number.
> >  * @SHAPER_LOOKUP_BY_QUEUE: @id is a queue identifier.
> >  * @SHAPER_LOOKUP_BY_TREE_ID: @id is the unique shaper identifier inside the
> >  *                            shaper hierarchy as in shaper_info.id
> >  *
> >  * SHAPER_LOOKUP_BY_PORT and SHAPER_LOOKUP_BY_VF, SHAPER_LOOKUP_BY_TREE_ID are
> >  * only available on PF devices, usually inside the host/hypervisor.
> >  * SHAPER_LOOKUP_BY_NETDEV is available on both PFs and VFs devices, but
> >  * only if the latter are privileged ones.
> 
> The privileged part is an implementation limitation. 
> Limiting oneself
> or subqueues should not require elevated privileges, in principle,
> right?

The reference to privileged functions is here to try to ensure proper
isolation when required.

E.g. Let's suppose the admin in the the host wants to restricts/limits
the B/W for a given VF (the VF itself, not the representor! See below
WRT shaper_lookup_mode) to some rate, he/she likely wants intends
additionally preventing the guest from relaxing the setting configuring
the such rate on the guest device.

> >  * The same shaper can be reached with different lookup mode/id pairs,
> >  * mapping network visible objects (devices, VFs, queues) to the scheduler
> >  * hierarchy and vice-versa.
> 
> :o
> 
> > enum shaper_lookup_mode {
> >     SHAPER_LOOKUP_BY_PORT,
> >     SHAPER_LOOKUP_BY_NETDEV,
> >     SHAPER_LOOKUP_BY_VF,
> >     SHAPER_LOOKUP_BY_QUEUE,
> >     SHAPER_LOOKUP_BY_TREE_ID,
> 
> Two questions.
> 
> 1) are these looking up real nodes or some special kind of node which
> can't have settings assigned directly? 
> IOW if I want to rate limit 
> the port do I get + set the port object or create a node above it and
> link it up?

There is no concept of such special kind of nodes. Probably the above
enum needs a better/clearer definition of each element.
How to reach a specific configuration for the port shaper depends on
the NIC defaults - whatever hierarchy it provides/creates at
initialization time. 

The NIC/firmware can either provide a default shaper for the port level
- in such case the user/admin just need to set it. Otherwise user/admin
will have to create the shaper and link it.

I guess the first option should be more common/the expected one.

This proposal allows both cases.

> Or do those special nodes not exit implicitly (from the example it
> seems like they do?)

Could you please re-phrase the above?

> 2) the objects themselves seem rather different. I'm guessing we need
> port/netdev/vf because either some users don't want to use switchdev
> (vf = repr netdev) or drivers don't implement it "correctly" (port !=
> netdev?!)?

Yes, the nodes inside the hierarchy can be linked to very different
objects. The different lookup mode are there just to provide easy
access to relevant objects.

Likely a much better description is needed:  'port' here is really the
cable plug level, 'netdev' refers to the Linux network device. There
could be multiple netdev for the same port as 'netdev' could be either
referring to a PF or a VFs. Finally VF is really the virtual function
device, not the representor, so that the host can configure/limits the
guest tx rate. 

The same shaper can be reached/looked-up with different ids.

e.g. the device level shaper for a virtual function can be reached
with:

- SHAPER_LOOKUP_BY_TREE_ID + unique tree id (every node is reachable
this way) from any host device in the same hierarcy
- SHAPER_LOOKUP_BY_VF + virtual function id, from the host PF device
- SHAPER_LOOKUP_BY_NETDEV, from the guest VF device

> Also feels like some of these are root nodes, some are leaf nodes?

There is a single root node (the port's parent), possibly many internal
nodes (port, netdev, vf, possibly more intermediate levels depending on
the actual configuration [e.g. the firmware or the admin could create
'device groups' or 'queue groups']) and likely many leave nodes (queue
level).

My personal take is than from an API point of view differentiating
between leaves and internal nodes makes the API more complex with no
practical advantage for the API users.

> > };
> > 
> > 
> > /**
> >  * struct shaper_ops - Operations on shaper hierarchy
> >  * @get: Access the specified shaper.
> >  * @set: Modify the specifier shaper.
> >  * @move: Move the specifier shaper inside the hierarchy.
> >  * @add: Add a shaper inside the shaper hierarchy.
> >  * @delete: Delete the specified shaper .
> >  *
> >  * The netdevice exposes a pointer to these ops.
> >  *
> >  * It’s up to the driver or firmware to create the default shapers hierarchy,
> >  * according to the H/W capabilities.
> >  */
> > struct shaper_ops {
> > 	/* get - Fetch the specified shaper, if it exists
> > 	 * @dev: Netdevice to operate on.
> > 	 * @lookup_mode: How to perform the shaper lookup
> > 	 * @id: ID of the specified shaper,
> > 	 *      relative to the specified @lookup_mode.
> > 	 * @shaper: Object to return shaper.
> > 	 * @extack: Netlink extended ACK for reporting errors.
> > 	 *
> > 	 * Multiple placement domain/id pairs can refer to the same shaper.
> > 	 * And multiple entities (e.g. VF and PF) can try to access the same
> > 	 * shaper concurrently.
> > 	 *
> > 	 * Values of @id depend on the @access_type:
> > 	 * * If @access_type is SHAPER_LOOKUP_BY_PORT or
> > 	 *   SHAPER_LOOKUP_BY_NETDEV, then @placement_id is unused.
> > 	 * * If @access_type is SHAPER_LOOKUP_BY_VF,
> > 	 *   then @id is a virtual function number, relative to @dev
> > 	 *   which should be phisical function
> > 	 * * If @access_type is SHAPER_LOOKUP_BY_QUEUE,
> > 	 *   Then @id represents the queue number, relative to @dev
> > 	 * * If @access_type is SHAPER_LOOKUP_BY_TREE_ID,
> > 	 *   then @id is a @shaper_info.id and any shaper inside the
> > 	 *   hierarcy can be accessed directly.
> > 	 *
> > 	 * Return:
> > 	 * * %0 - Success
> > 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > 	 *		    or core for any reason. @extack should be set
> > 	 *		    to text describing the reason.
> > 	 * * Other negative error value on failure.
> > 	 */
> > 	int (*get)(struct net_device *dev,
> > 		   enum shaper_lookup_mode lookup_mode, u32 id,
> >                    struct shaper_info *shaper, struct netlink_ext_ack *extack);
> 
> How about we store the hierarchy in the core?
> Assume core has the source of truth, no need to get?

One design choice was _not_ duplicating the hierarchy in the core: the
full hierarchy is maintained only by the NIC/firmware. The NIC/firmware
can perform some changes "automatically" e.g. when adding or deleting
VFs or queues it will likely create/delete the paired shaper. The
synchronization looks cumbersome and error prone.

The hierarchy could be exposed to both host and guests, I guess only
the host core could be the source of truth, right?

Thanks for the feedback,

Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-06 13:48     ` Jamal Hadi Salim
@ 2024-04-10  9:41       ` Paolo Abeni
  0 siblings, 0 replies; 21+ messages in thread
From: Paolo Abeni @ 2024-04-10  9:41 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Simon Horman, netdev, Jakub Kicinski, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala, John Fastabend

On Sat, 2024-04-06 at 09:48 -0400, Jamal Hadi Salim wrote:
> On Fri, Apr 5, 2024 at 1:06 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > On Fri, 2024-04-05 at 09:33 -0400, Jamal Hadi Salim wrote:
> > > On Fri, Apr 5, 2024 at 6:25 AM Simon Horman <horms@kernel.org> wrote:
> > > > This is follow-up to the ongoing discussion started by Intel to extend the
> > > > support for TX shaping H/W offload [1].
> > > > 
> > > > The goal is allowing the user-space to configure TX shaping offload on a
> > > > per-queue basis with min guaranteed B/W, max B/W limit and burst size on a
> > > > VF device.
> > > > 
> > > > 
> > > > In the past few months several different solutions were attempted and
> > > > discussed, without finding a perfect fit:
> > > > 
> > > > - devlink_rate APIs are not appropriate for to control TX shaping on netdevs
> > > > - No existing TC qdisc offload covers the required feature set
> > > > - HTB does not allow direct queue configuration
> > > > - MQPRIO imposes constraint on the maximum number of TX queues
> > > > - TBF does not support max B/W limit
> > > > - ndo_set_tx_maxrate() only controls the max B/W limit
> > > > 
> > > > A new H/W offload API is needed, but offload API proliferation should be
> > > > avoided.
> > > > 
> > > > The following proposal intends to cover the above specified requirement and
> > > > provide a possible base to unify all the shaping offload APIs mentioned above.
> > > > 
> > > > The following only defines the in-kernel interface between the core and
> > > > drivers. The intention is to expose the feature to user-space via Netlink.
> > > > Hopefully the latter part should be straight-forward after agreement
> > > > on the in-kernel interface.
> > > > 
> > > > All feedback and comment is more then welcome!
> > > > 
> > > > [1] https://lore.kernel.org/netdev/20230808015734.1060525-1-wenjun1.wu@intel.com/
> > > > 
> > > 
> > > My 2 cents:
> > > I did peruse the lore quoted thread but i am likely to have missed something.
> > > It sounds like the requirement is for egress-from-host (which to a
> > > device internal looks like ingress-from-host on the device). Doesn't
> > > existing HTB offload already support this? I didnt see this being
> > > discussed in the thread.
> > 
> > Yes, HTB has been one of the possible option discussed, but not in that
> > thread, let me find the reference:
> > 
> > https://lore.kernel.org/netdev/131da9645be5ef6ea584da27ecde795c52dfbb00.camel@redhat.com/
> > 
> > it turns out that HTB does not allow configuring TX shaping on a per
> > (existing, direct) queue basis. It could, with some small functional
> > changes, but then we will be in the suboptimal scenario I mentioned in
> > my previous email: quite similar to creating a new offload type,
> > and will not be 'future proof'.
> > 
> > > Also, IIUC, there is no hierarchy
> > > requirement. That is something you can teach HTB but there's probably
> > > something i missed because i didnt understand the context of "HTB does
> > > not allow direct queue configuration". If HTB is confusing from a
> > > config pov then it seems what Paolo was describing in the thread on
> > > TBF is a reasonable approach too. I couldnt grok why that TBF
> > > extension for max bw was considered a bad idea.
> > 
> > TBF too was also in the category 'near enough but not 100% fit'
> > 
> > > On config:
> > > Could we not introduce skip_sw/hw semantics for qdiscs? IOW, skip_sw
> > > means the config is only subjected to hw and you have DIRECT
> > > semantics, etc.
> > > I understand the mlnx implementation of HTB does a lot of things in
> > > the driver but the one nice thing they had was ability to use classid
> > > X:Y to select a egress h/w queue. The driver resolution of all the
> > > hierarchy is not needed at all here if i understood the requirement
> > > above.
> > > You still need to have a classifier in s/w (which could be attached to
> > > clsact egress) to select the queue. That is something the mlnx
> > > implementation allowed. So there is no "double queueing"
> > 
> > AFAICS the current status of qdisc H/W offload implementation is a bit
> > mixed-up. e.g. HTB requires explicit syntax on the command line to
> > enable H/W offload, TBF doesn't.
> > 
> > H/W offload enabled on MQPRIO implies skipping the software path, while
> > for HTB and TBF doesn't.
> > 
> > > If this is about totally bypassing s/w config then its a different ballgame..
> > 
> > Yes, this does not have s/w counter-part. It limits itself to
> > configure/expose H/W features.
> > 
> 
> I think this changes the dynamics altogether. Would IDPF[1] be a fit for this?

Thank for the pointer, I admit we did not look closely to that before.

Anyway it looks like quite a different beast: we are looking for way to
configure H/W shaping that could be applied to most/all/as much as
possible existing H/W, while AFAICS the above dictates specification
for 'standard' high-perf netdev H/W. Only matching H/W will be
affected.

> > My take is that configuring the shapers on a queue/device/queue
> > group/vfs group basis, the admin is enforcing shared resources
> > reservation: we don't strictly need a software counter-part.
> 
> I am assuming then, if the hw allows it one could run offloaded TC/htb
> on the queue/device/queue

HTB is one of the thing we looked at. The existing offload interface
proved not being enough for the use-case at hands.

We could extend it - or others, notably ndo_set_tx_maxrate() would be
simpler to adapt - but one request here is to avoid a delta to H/W
offload dictated by a specific request and covering only/exactly that
specific request that will lead to exactly this same process in some
(little?) time.

The idea is to cover a reasonable large spectrum of the existing
capabilities related to H/W shaping - "once for all" I would say if I
were not sure that such statement will fire back in the most painful
way ;)

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-10  8:33   ` Paolo Abeni
@ 2024-04-10 14:57     ` Jakub Kicinski
  2024-04-11 15:58       ` Paolo Abeni
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2024-04-10 14:57 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Simon Horman, netdev, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala

On Wed, 10 Apr 2024 10:33:50 +0200 Paolo Abeni wrote:
> On Tue, 2024-04-09 at 15:32 -0700, Jakub Kicinski wrote:
> > On Fri, 5 Apr 2024 11:23:13 +0100 Simon Horman wrote:  
> > > /**
> > >  * enum shaper_lookup_mode - Lookup method used to access a shaper
> > >  * @SHAPER_LOOKUP_BY_PORT: The root shaper for the whole H/W, @id is unused
> > >  * @SHAPER_LOOKUP_BY_NETDEV: The main shaper for the given network device,
> > >  *                           @id is unused
> > >  * @SHAPER_LOOKUP_BY_VF: @id is a virtual function number.
> > >  * @SHAPER_LOOKUP_BY_QUEUE: @id is a queue identifier.
> > >  * @SHAPER_LOOKUP_BY_TREE_ID: @id is the unique shaper identifier inside the
> > >  *                            shaper hierarchy as in shaper_info.id
> > >  *
> > >  * SHAPER_LOOKUP_BY_PORT and SHAPER_LOOKUP_BY_VF, SHAPER_LOOKUP_BY_TREE_ID are
> > >  * only available on PF devices, usually inside the host/hypervisor.
> > >  * SHAPER_LOOKUP_BY_NETDEV is available on both PFs and VFs devices, but
> > >  * only if the latter are privileged ones.  
> > 
> > The privileged part is an implementation limitation. 
> > Limiting oneself
> > or subqueues should not require elevated privileges, in principle,
> > right?  
> 
> The reference to privileged functions is here to try to ensure proper
> isolation when required.
> 
> E.g. Let's suppose the admin in the the host wants to restricts/limits
> the B/W for a given VF (the VF itself, not the representor! See below
> WRT shaper_lookup_mode) to some rate, he/she likely wants intends
> additionally preventing the guest from relaxing the setting configuring
> the such rate on the guest device.

1) representor is just a control path manifestation of the VF
   any offload on the representor is for the VF - this is how forwarding
   works, this is how "switchdev qdisc offload" works

2) its a hierarchy, we can delegate one layer to the VF and the layer
   under that to the eswitch manager. VF can set it limit to inf
   but the eswitch layer should still enforce its limit
   The driver implementation may constrain the number of layers,
   delegation or form of the hierarchy, sure, but as I said, that's an
   implementation limitation (which reminds me -- remember to add extack 
   to the "write" ops)

> > > enum shaper_lookup_mode {
> > >     SHAPER_LOOKUP_BY_PORT,
> > >     SHAPER_LOOKUP_BY_NETDEV,
> > >     SHAPER_LOOKUP_BY_VF,
> > >     SHAPER_LOOKUP_BY_QUEUE,
> > >     SHAPER_LOOKUP_BY_TREE_ID,  
> > 
> > Two questions.
> > 
> > 1) are these looking up real nodes or some special kind of node which
> > can't have settings assigned directly? 
> > IOW if I want to rate limit 
> > the port do I get + set the port object or create a node above it and
> > link it up?  
> 
> There is no concept of such special kind of nodes. Probably the above
> enum needs a better/clearer definition of each element.
> How to reach a specific configuration for the port shaper depends on
> the NIC defaults - whatever hierarchy it provides/creates at
> initialization time. 
> 
> The NIC/firmware can either provide a default shaper for the port level
> - in such case the user/admin just need to set it. Otherwise user/admin
> will have to create the shaper and link it.

What's a default limit for a port? Line rate?
I understand that the scheduler can't be "disabled" but that doesn't mean
we must expose "noop" schedulers as if user has created them.

Let me step back. The goal, AFAIU, was to create an internal API into
which we can render existing APIs. We can combine settings from mqprio
and sysfs etc. and create a desired abstract hierarchy to present to 
the driver. That's doable.
Transforming arbitrary pre-existing driver hierarchy to achieve what
the uAPI wants.. would be an NP hard problem, no?

> I guess the first option should be more common/the expected one.
> 
> This proposal allows both cases.
> 
> > Or do those special nodes not exit implicitly (from the example it
> > seems like they do?)  

s/exit/exist/
 
> Could you please re-phrase the above?

Basically whether dump of the hierarchy is empty at the start.

> > 2) the objects themselves seem rather different. I'm guessing we need
> > port/netdev/vf because either some users don't want to use switchdev
> > (vf = repr netdev) or drivers don't implement it "correctly" (port !=
> > netdev?!)?  
> 
> Yes, the nodes inside the hierarchy can be linked to very different
> objects. The different lookup mode are there just to provide easy
> access to relevant objects.
> 
> Likely a much better description is needed:  'port' here is really the
> cable plug level, 'netdev' refers to the Linux network device. There
> could be multiple netdev for the same port as 'netdev' could be either
> referring to a PF or a VFs. Finally VF is really the virtual function
> device, not the representor, so that the host can configure/limits the
> guest tx rate. 
> 
> The same shaper can be reached/looked-up with different ids.
> 
> e.g. the device level shaper for a virtual function can be reached
> with:
> 
> - SHAPER_LOOKUP_BY_TREE_ID + unique tree id (every node is reachable
> this way) from any host device in the same hierarcy
> - SHAPER_LOOKUP_BY_VF + virtual function id, from the host PF device
> - SHAPER_LOOKUP_BY_NETDEV, from the guest VF device
> 
> > Also feels like some of these are root nodes, some are leaf nodes?  
> 
> There is a single root node (the port's parent), possibly many internal
> nodes (port, netdev, vf, possibly more intermediate levels depending on
> the actual configuration [e.g. the firmware or the admin could create
> 'device groups' or 'queue groups']) and likely many leave nodes (queue
> level).
> 
> My personal take is than from an API point of view differentiating
> between leaves and internal nodes makes the API more complex with no
> practical advantage for the API users.

If you allow them to be configured.

What if we consider netdev/queue node as "exit points" of the tree, 
to which a layer of actual scheduling nodes can be attached, rather 
than doing scheduling by themselves?

> > > 	int (*get)(struct net_device *dev,
> > > 		   enum shaper_lookup_mode lookup_mode, u32 id,
> > >                    struct shaper_info *shaper, struct netlink_ext_ack *extack);  
> > 
> > How about we store the hierarchy in the core?
> > Assume core has the source of truth, no need to get?  
> 
> One design choice was _not_ duplicating the hierarchy in the core: the
> full hierarchy is maintained only by the NIC/firmware. The NIC/firmware
> can perform some changes "automatically" e.g. when adding or deleting
> VFs or queues it will likely create/delete the paired shaper. The
> synchronization looks cumbersome and error prone.

The core only needs to maintain the hierarchy of whats in its domain of
control.

> The hierarchy could be exposed to both host and guests, I guess only
> the host core could be the source of truth, right?

I guess you're saying that the VF configuration may be implemented by asking 
the PF driver to perform the actions? Perhaps we can let the driver allocate
and maintain its own hierarchy but yes, we don't have much gain from holding
that in the core.

This is the picture in my mind:

PF / eswitch domain

              Q0 ]-> [50G] | RR | >-[ PF netdev = pf repr ] - |
              Q1 ]-> [50G] |    |                             |
                                                              |
-----------------------------------------                     |
VF1 domain                               |                    |
                                         |                    |
     Q0 ]-> | RR | - [35G] >-[ VF netdev x vf repr ]-> [50G]- | RR | >- port
     Q1 ]-> |    |                       |                    |
                                         |                    |
-----------------------------------------                     |
                                                              |
-------------------------------------------                   |
VF2 domain                                 |                  |
                                           |                  |
     Q0 ]-> [100G] -> |0 SP | >-[ VF net.. x vf r ]-> [50G] - |
     Q1 ]-> [200G] -> |1    |              |                  |
-------------------------------------------

In this contrived example we have VF1 which limited itself to 35G.
VF2 limited each queue to 100G and 200G (ignored, eswitch limit is lower)
and set strict priority between queues.
PF limits each of its queues, and VFs to 50G, no rate limit on the port.

"x" means we cross domains, "=" purely splices one hierarchy with another.

The hierarchy for netdevs always starts with a queue and ends in a netdev.
The hierarchy for eswitch has just netdevs at each end (hierarchy is
shared by all netdevs with the same switchdev id).

If the eswitch implementation is not capable of having a proper repr for PFs
the PF queues feed directly into the port.

The final RR node may be implicit (if hierarchy has loose ends, the are
assumed to RR at the last possible point before egress).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-10 14:57     ` Jakub Kicinski
@ 2024-04-11 15:58       ` Paolo Abeni
  2024-04-11 16:03         ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2024-04-11 15:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Simon Horman, netdev, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala

On Wed, 2024-04-10 at 07:57 -0700, Jakub Kicinski wrote:
> On Wed, 10 Apr 2024 10:33:50 +0200 Paolo Abeni wrote:
> > 
> > The reference to privileged functions is here to try to ensure proper
> > isolation when required.
> > 
> > E.g. Let's suppose the admin in the the host wants to restricts/limits
> > the B/W for a given VF (the VF itself, not the representor! See below
> > WRT shaper_lookup_mode) to some rate, he/she likely wants intends
> > additionally preventing the guest from relaxing the setting configuring
> > the such rate on the guest device.
> 
> 1) representor is just a control path manifestation of the VF
>    any offload on the representor is for the VF - this is how forwarding
>    works, this is how "switchdev qdisc offload" works
> 
> 2) its a hierarchy, we can delegate one layer to the VF and the layer
>    under that to the eswitch manager. VF can set it limit to inf
>    but the eswitch layer should still enforce its limit
>    The driver implementation may constrain the number of layers,
>    delegation or form of the hierarchy, sure, but as I said, that's an
>    implementation limitation (which reminds me -- remember to add extack 
>    to the "write" ops)
> 
> > > > enum shaper_lookup_mode {
> > > >     SHAPER_LOOKUP_BY_PORT,
> > > >     SHAPER_LOOKUP_BY_NETDEV,
> > > >     SHAPER_LOOKUP_BY_VF,
> > > >     SHAPER_LOOKUP_BY_QUEUE,
> > > >     SHAPER_LOOKUP_BY_TREE_ID,  
> > > 
> > > Two questions.
> > > 
> > > 1) are these looking up real nodes or some special kind of node which
> > > can't have settings assigned directly? 
> > > IOW if I want to rate limit 
> > > the port do I get + set the port object or create a node above it and
> > > link it up?  
> > 
> > There is no concept of such special kind of nodes. Probably the above
> > enum needs a better/clearer definition of each element.
> > How to reach a specific configuration for the port shaper depends on
> > the NIC defaults - whatever hierarchy it provides/creates at
> > initialization time. 
> > 
> > The NIC/firmware can either provide a default shaper for the port level
> > - in such case the user/admin just need to set it. Otherwise user/admin
> > will have to create the shaper and link it.
> 
> What's a default limit for a port? Line rate?
> I understand that the scheduler can't be "disabled" but that doesn't mean
> we must expose "noop" schedulers as if user has created them.
> 
> Let me step back. The goal, AFAIU, was to create an internal API into
> which we can render existing APIs. We can combine settings from mqprio
> and sysfs etc. and create a desired abstract hierarchy to present to 
> the driver. That's doable.
> Transforming arbitrary pre-existing driver hierarchy to achieve what
> the uAPI wants.. would be an NP hard problem, no?
> 
> > I guess the first option should be more common/the expected one.
> > 
> > This proposal allows both cases.
> > 
> > > Or do those special nodes not exit implicitly (from the example it
> > > seems like they do?)  
> 
> s/exit/exist/
>  
> > Could you please re-phrase the above?
> 
> Basically whether dump of the hierarchy is empty at the start.
> 
> > > 2) the objects themselves seem rather different. I'm guessing we need
> > > port/netdev/vf because either some users don't want to use switchdev
> > > (vf = repr netdev) or drivers don't implement it "correctly" (port !=
> > > netdev?!)?  
> > 
> > Yes, the nodes inside the hierarchy can be linked to very different
> > objects. The different lookup mode are there just to provide easy
> > access to relevant objects.
> > 
> > Likely a much better description is needed:  'port' here is really the
> > cable plug level, 'netdev' refers to the Linux network device. There
> > could be multiple netdev for the same port as 'netdev' could be either
> > referring to a PF or a VFs. Finally VF is really the virtual function
> > device, not the representor, so that the host can configure/limits the
> > guest tx rate. 
> > 
> > The same shaper can be reached/looked-up with different ids.
> > 
> > e.g. the device level shaper for a virtual function can be reached
> > with:
> > 
> > - SHAPER_LOOKUP_BY_TREE_ID + unique tree id (every node is reachable
> > this way) from any host device in the same hierarcy
> > - SHAPER_LOOKUP_BY_VF + virtual function id, from the host PF device
> > - SHAPER_LOOKUP_BY_NETDEV, from the guest VF device
> > 
> > > Also feels like some of these are root nodes, some are leaf nodes?  
> > 
> > There is a single root node (the port's parent), possibly many internal
> > nodes (port, netdev, vf, possibly more intermediate levels depending on
> > the actual configuration [e.g. the firmware or the admin could create
> > 'device groups' or 'queue groups']) and likely many leave nodes (queue
> > level).
> > 
> > My personal take is than from an API point of view differentiating
> > between leaves and internal nodes makes the API more complex with no
> > practical advantage for the API users.
> 
> If you allow them to be configured.
> 
> What if we consider netdev/queue node as "exit points" of the tree, 
> to which a layer of actual scheduling nodes can be attached, rather 
> than doing scheduling by themselves?
> 
> > > > 	int (*get)(struct net_device *dev,
> > > > 		   enum shaper_lookup_mode lookup_mode, u32 id,
> > > >                    struct shaper_info *shaper, struct netlink_ext_ack *extack);  
> > > 
> > > How about we store the hierarchy in the core?
> > > Assume core has the source of truth, no need to get?  
> > 
> > One design choice was _not_ duplicating the hierarchy in the core: the
> > full hierarchy is maintained only by the NIC/firmware. The NIC/firmware
> > can perform some changes "automatically" e.g. when adding or deleting
> > VFs or queues it will likely create/delete the paired shaper. The
> > synchronization looks cumbersome and error prone.
> 
> The core only needs to maintain the hierarchy of whats in its domain of
> control.
> 
> > The hierarchy could be exposed to both host and guests, I guess only
> > the host core could be the source of truth, right?
> 
> I guess you're saying that the VF configuration may be implemented by asking 
> the PF driver to perform the actions? Perhaps we can let the driver allocate
> and maintain its own hierarchy but yes, we don't have much gain from holding
> that in the core.
> 
> This is the picture in my mind:
> 
> PF / eswitch domain
> 
>               Q0 ]-> [50G] | RR | >-[ PF netdev = pf repr ] - |
>               Q1 ]-> [50G] |    |                             |
>                                                               |
> -----------------------------------------                     |
> VF1 domain                               |                    |
>                                          |                    |
>      Q0 ]-> | RR | - [35G] >-[ VF netdev x vf repr ]-> [50G]- | RR | >- port
>      Q1 ]-> |    |                       |                    |
>                                          |                    |
> -----------------------------------------                     |
>                                                               |
> -------------------------------------------                   |
> VF2 domain                                 |                  |
>                                            |                  |
>      Q0 ]-> [100G] -> |0 SP | >-[ VF net.. x vf r ]-> [50G] - |
>      Q1 ]-> [200G] -> |1    |              |                  |
> -------------------------------------------
> 
> In this contrived example we have VF1 which limited itself to 35G.
> VF2 limited each queue to 100G and 200G (ignored, eswitch limit is lower)
> and set strict priority between queues.
> PF limits each of its queues, and VFs to 50G, no rate limit on the port.
> 
> "x" means we cross domains, "=" purely splices one hierarchy with another.
> 
> The hierarchy for netdevs always starts with a queue and ends in a netdev.
> The hierarchy for eswitch has just netdevs at each end (hierarchy is
> shared by all netdevs with the same switchdev id).
> 
> If the eswitch implementation is not capable of having a proper repr for PFs
> the PF queues feed directly into the port.
> 
> The final RR node may be implicit (if hierarchy has loose ends, the are
> assumed to RR at the last possible point before egress).

Let me try to wrap-up all the changes suggested above:

- we need to clearly define the initial/default status (possibly no b/w
limits and all the objects on the same level doing RR)

- The hierarchy controlled by the API should shown only non
default/user-configured nodes

- We need to drop the references to privileged VFs.

- The core should maintain the full status of the user-provided
configuration changes (say, the 'delta' hierarchy )

Am I missing something?

Also it's not 110% clear to me the implication of:

> consider netdev/queue node as "exit points" of the tree, 
> to which a layer of actual scheduling nodes can be attached

could you please rephrase a bit?

I have the feeling the the points above should not require significant
changes to the API defined here, mainly more clear documentation, but
I'll have a better look.

Thanks!

Paolo 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-11 15:58       ` Paolo Abeni
@ 2024-04-11 16:03         ` Jakub Kicinski
  2024-04-19 11:53           ` Paolo Abeni
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2024-04-11 16:03 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Simon Horman, netdev, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala

On Thu, 11 Apr 2024 17:58:59 +0200 Paolo Abeni wrote:
> > In this contrived example we have VF1 which limited itself to 35G.
> > VF2 limited each queue to 100G and 200G (ignored, eswitch limit is lower)
> > and set strict priority between queues.
> > PF limits each of its queues, and VFs to 50G, no rate limit on the port.
> > 
> > "x" means we cross domains, "=" purely splices one hierarchy with another.
> > 
> > The hierarchy for netdevs always starts with a queue and ends in a netdev.
> > The hierarchy for eswitch has just netdevs at each end (hierarchy is
> > shared by all netdevs with the same switchdev id).
> > 
> > If the eswitch implementation is not capable of having a proper repr for PFs
> > the PF queues feed directly into the port.
> > 
> > The final RR node may be implicit (if hierarchy has loose ends, the are
> > assumed to RR at the last possible point before egress).  
> 
> Let me try to wrap-up all the changes suggested above:
> 
> - we need to clearly define the initial/default status (possibly no b/w
> limits and all the objects on the same level doing RR)
> 
> - The hierarchy controlled by the API should shown only non
> default/user-configured nodes
> 
> - We need to drop the references to privileged VFs.
> 
> - The core should maintain the full status of the user-provided
> configuration changes (say, the 'delta' hierarchy )
> 
> Am I missing something?

LG

> Also it's not 110% clear to me the implication of:
> 
> > consider netdev/queue node as "exit points" of the tree, 
> > to which a layer of actual scheduling nodes can be attached  
> 
> could you please rephrase a bit?
> 
> I have the feeling the the points above should not require significant
> changes to the API defined here, mainly more clear documentation, but
> I'll have a better look.

They don't have to be nodes. They can appear as parent or child of 
a real node, but they don't themselves carry any configuration.

IOW you can represent them as a special encoding of the ID field,
rather than a real node.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-05 10:23 [RFC] HW TX Rate Limiting Driver API Simon Horman
                   ` (2 preceding siblings ...)
  2024-04-09 22:32 ` Jakub Kicinski
@ 2024-04-11 23:51 ` Vinicius Costa Gomes
  2024-04-12  4:39   ` John Fastabend
  2024-04-22 11:30 ` Sunil Kovvuri Goutham
  4 siblings, 1 reply; 21+ messages in thread
From: Vinicius Costa Gomes @ 2024-04-11 23:51 UTC (permalink / raw)
  To: Simon Horman, netdev
  Cc: Jakub Kicinski, Jiri Pirko, Madhu Chittim, Sridhar Samudrala,
	Paolo Abeni

Hi,

Simon Horman <horms@kernel.org> writes:

> Hi,
>
> This is follow-up to the ongoing discussion started by Intel to extend the
> support for TX shaping H/W offload [1].
>
> The goal is allowing the user-space to configure TX shaping offload on a
> per-queue basis with min guaranteed B/W, max B/W limit and burst size on a
> VF device.
>

What about non-VF cases? Would it be out of scope?

>
> In the past few months several different solutions were attempted and
> discussed, without finding a perfect fit:
>
> - devlink_rate APIs are not appropriate for to control TX shaping on netdevs
> - No existing TC qdisc offload covers the required feature set
> - HTB does not allow direct queue configuration
> - MQPRIO imposes constraint on the maximum number of TX queues
> - TBF does not support max B/W limit
> - ndo_set_tx_maxrate() only controls the max B/W limit
>

Another questions: is how "to plug" different shaper algorithms? for
example, the TSN world defines the Credit Based Shaper (IEEE 802.1Q-2018
Annex L gives a good overview), which tries to be accurate over sub
milisecond intervals.

(sooner or later, some NIC with lots of queues will appear with TSN
features, and I guess some people would like to know that they are using
the expected shaper)

> A new H/W offload API is needed, but offload API proliferation should be
> avoided.
>
> The following proposal intends to cover the above specified requirement and
> provide a possible base to unify all the shaping offload APIs mentioned above.
>
> The following only defines the in-kernel interface between the core and
> drivers. The intention is to expose the feature to user-space via Netlink.
> Hopefully the latter part should be straight-forward after agreement
> on the in-kernel interface.
>

Another thing that MQPRIO (indirectly) gives is the ability to userspace
applications to have some amount of control in which queue their packets
will end up, via skb->priority.

Would this new shaper hierarchy have something that would fill this
role? (if this is for VF-only use cases, then the answer would be "no" I
guess)

(I tried to read the whole thread, sorry if I missed something)

> All feedback and comment is more then welcome!
>
> [1] https://lore.kernel.org/netdev/20230808015734.1060525-1-wenjun1.wu@intel.com/
>
> Regards,
> Simon with much assistance from Paolo
>
> --- 
> /* SPDX-License-Identifier: GPL-2.0-or-later */
>
> #ifndef _NET_SHAPER_H_
> #define _NET_SHAPER_H_
>
> /**
>  * enum shaper_metric - the metric of the shaper
>  * @SHAPER_METRIC_PPS: Shaper operates on a packets per second basis
>  * @SHAPER_METRIC_BPS: Shaper operates on a bits per second basis
>  */
> enum shaper_metric {
> 	SHAPER_METRIC_PPS;
> 	SHAPER_METRIC_BPS;
> };
>
> #define SHAPER_ROOT_ID 0
> #define SHAPER_NONE_ID UINT_MAX
>
> /**
>  * struct shaper_info - represent a node of the shaper hierarchy
>  * @id: Unique identifier inside the shaper tree.
>  * @parent_id: ID of parent shaper, or SHAPER_NONE_ID if the shaper has
>  *             no parent. Only the root shaper has no parent.
>  * @metric: Specify if the bw limits refers to PPS or BPS
>  * @bw_min: Minimum guaranteed rate for this shaper
>  * @bw_max: Maximum peak bw allowed for this shaper
>  * @burst: Maximum burst for the peek rate of this shaper
>  * @priority: Scheduling priority for this shaper
>  * @weight: Scheduling weight for this shaper
>  *
>  * The full shaper hierarchy is maintained only by the
>  * NIC driver (or firmware), possibly in a NIC-specific format
>  * and/or in H/W tables.
>  * The kernel uses this representation and the shaper_ops to
>  * access, traverse, and update it.
>  */
> struct shaper_info {
> 	/* The following fields allow the full traversal of the whole
> 	 * hierarchy.
> 	 */
> 	u32 id;
> 	u32 parent_id;
>
> 	/* The following fields define the behavior of the shaper. */
> 	enum shaper_metric metric;
> 	u64 bw_min;
> 	u64 bw_max;
> 	u32 burst;
> 	u32 priority;
> 	u32 weight;
> };
>
> /**
>  * enum shaper_lookup_mode - Lookup method used to access a shaper
>  * @SHAPER_LOOKUP_BY_PORT: The root shaper for the whole H/W, @id is unused
>  * @SHAPER_LOOKUP_BY_NETDEV: The main shaper for the given network device,
>  *                           @id is unused
>  * @SHAPER_LOOKUP_BY_VF: @id is a virtual function number.
>  * @SHAPER_LOOKUP_BY_QUEUE: @id is a queue identifier.
>  * @SHAPER_LOOKUP_BY_TREE_ID: @id is the unique shaper identifier inside the
>  *                            shaper hierarchy as in shaper_info.id
>  *
>  * SHAPER_LOOKUP_BY_PORT and SHAPER_LOOKUP_BY_VF, SHAPER_LOOKUP_BY_TREE_ID are
>  * only available on PF devices, usually inside the host/hypervisor.
>  * SHAPER_LOOKUP_BY_NETDEV is available on both PFs and VFs devices, but
>  * only if the latter are privileged ones.
>  * The same shaper can be reached with different lookup mode/id pairs,
>  * mapping network visible objects (devices, VFs, queues) to the scheduler
>  * hierarchy and vice-versa.
>  */
> enum shaper_lookup_mode {
>     SHAPER_LOOKUP_BY_PORT,
>     SHAPER_LOOKUP_BY_NETDEV,
>     SHAPER_LOOKUP_BY_VF,
>     SHAPER_LOOKUP_BY_QUEUE,
>     SHAPER_LOOKUP_BY_TREE_ID,
> };
>
>
> /**
>  * struct shaper_ops - Operations on shaper hierarchy
>  * @get: Access the specified shaper.
>  * @set: Modify the specifier shaper.
>  * @move: Move the specifier shaper inside the hierarchy.
>  * @add: Add a shaper inside the shaper hierarchy.
>  * @delete: Delete the specified shaper .
>  *
>  * The netdevice exposes a pointer to these ops.
>  *
>  * It’s up to the driver or firmware to create the default shapers hierarchy,
>  * according to the H/W capabilities.
>  */
> struct shaper_ops {
> 	/* get - Fetch the specified shaper, if it exists
> 	 * @dev: Netdevice to operate on.
> 	 * @lookup_mode: How to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *      relative to the specified @lookup_mode.
> 	 * @shaper: Object to return shaper.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Multiple placement domain/id pairs can refer to the same shaper.
> 	 * And multiple entities (e.g. VF and PF) can try to access the same
> 	 * shaper concurrently.
> 	 *
> 	 * Values of @id depend on the @access_type:
> 	 * * If @access_type is SHAPER_LOOKUP_BY_PORT or
> 	 *   SHAPER_LOOKUP_BY_NETDEV, then @placement_id is unused.
> 	 * * If @access_type is SHAPER_LOOKUP_BY_VF,
> 	 *   then @id is a virtual function number, relative to @dev
> 	 *   which should be phisical function
> 	 * * If @access_type is SHAPER_LOOKUP_BY_QUEUE,
> 	 *   Then @id represents the queue number, relative to @dev
> 	 * * If @access_type is SHAPER_LOOKUP_BY_TREE_ID,
> 	 *   then @id is a @shaper_info.id and any shaper inside the
> 	 *   hierarcy can be accessed directly.
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error value on failure.
> 	 */
> 	int (*get)(struct net_device *dev,
> 		   enum shaper_lookup_mode lookup_mode, u32 id,
>                    struct shaper_info *shaper, struct netlink_ext_ack *extack);
>
> 	/* set - Update the specified shaper, if it exists
> 	 * @dev: Netdevice to operate on.
> 	 * @lookup_mode: How to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *      relative to the specified @access_type.
> 	 * @shaper: Configuration of shaper.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Configure the parameters of @shaper according to values supplied
> 	 * in the following fields:
> 	 * * @shaper.metric
> 	 * * @shaper.bw_min
> 	 * * @shaper.bw_max
> 	 * * @shaper.burst
> 	 * * @shaper.priority
> 	 * * @shaper.weight
> 	 * Values supplied in other fields of @shaper must be zero and,
> 	 * other than verifying that, are ignored.
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 */
> 	int (*set)(struct net_device *dev,
> 		   enum shaper_lookup_mode lookup_mode, u32 id,
> 		   const struct shaper_info *shaper,
> 		   struct netlink_ext_ack *extack);
>
> 	/* Move - change the parent id of the specified shaper
> 	 * @dev: netdevice to operate on.
> 	 * @lookup_mode: how to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *                      relative to the specified @access_mode.
> 	 * @new_parent_id: new ID of the parent shapers,
> 	 *                      always relative to the SHAPER_LOOKUP_BY_TREE_ID
> 	 *                      lookup mode
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Move the specified shaper in the hierarchy replacing its
> 	 * current parent shaper with @new_parent_id
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 */
> 	int (*move)(struct net_device *dev,
> 		    enum shaper_lookup_mode lookup_mode, u32 id,
> 		    u32 new_parent_id, struct netlink_ext_ack *extack);
>
> 	/* add - Add a shaper inside the shaper hierarchy
> 	 * @dev: netdevice to operate on.
> 	 * @shaper: configuration of shaper.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * @shaper.id must be set to SHAPER_NONE_ID as
> 	 * the id for the shaper will be automatically allocated.
> 	 * @shaper.parent_id determines where inside the shaper's tree
> 	 * this node is inserted.
> 	 *
> 	 * Return:
> 	 * * non-negative shaper id on success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 *
> 	 * Examples or reasons this operation may fail include:
> 	 * * H/W resources limits.
> 	 * * The parent is a ‘leaf’ node - attached to a queue.
> 	 * * Can’t respect the requested bw limits.
> 	 */
> 	int (*add)(struct net_device *dev, const struct shaper_info *shaper,
> 		   struct netlink_ext_ack *extack);
>
> 	/* delete - Add a shaper inside the shaper hierarchy
> 	 * @dev: netdevice to operate on.
> 	 * @lookup_mode: how to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *                      relative to the specified @access_type.
> 	 * @shaper: Object to return the deleted shaper configuration.
> 	 *              Ignored if NULL.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 */
> 	int (*delete)(struct net_device *dev,
> 		      enum shaper_lookup_mode lookup_mode,
> 		      u32 id, struct shaper_info *shaper,
> 		      struct netlink_ext_ack *extack);
> };
>
> /*
>  * Examples:
>  * - set shaping on a given queue
>  *   struct shaper_info info = { // fill this };
>  *   dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, NULL);
>  *
>  * - create a queue group with a queue group shaping limits.
>  *   Assuming the following topology already exists:
>  *                    < netdev shaper >
>  *                      /           \
>  *         <queue 0 shaper> . . .  <queue N shaper>
>  *
>  *   struct shaper_info pinfo, ginfo;
>  *   dev->shaper_ops->get(dev, SHAPER_LOOKUP_BY_NETDEV, 0, &pinfo);
>  *
>  *   ginfo.parent_id = pinfo.id;
>  *   // fill-in other shaper params...
>  *   new_node_id = dev->shaper_ops->add(dev, &ginfo);
>  *
>  *   // now topology is:
>  *   //                  <    netdev shaper    >
>  *   //                  /            |        \
>  *   //                 /             |        <newly created shaper>
>  *   //                /              |
>  *   // <queue 0 shaper> . . . <queue N shaper>
>  *
>  *   // move a shapers for queues 3..n out of such queue group
>  *   for (i = 0; i <= 2; ++i)
>  *           dev->shaper_ops->move(dev, SHAPER_LOOKUP_BY_QUEUE, i, new_node_id);
>  *
>  *   // now topology is:
>  *   //                   < netdev shaper >
>  *   //                   /              \
>  *   //        <newly created shaper>   <queue 3 shaper> ... <queue n shaper>
>  *   //         /                   \
>  *   // <queue 0 shaper> ... <queue 2 shaper>
>  */
> #endif
>
>

-- 
Vinicius

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-11 23:51 ` Vinicius Costa Gomes
@ 2024-04-12  4:39   ` John Fastabend
  0 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2024-04-12  4:39 UTC (permalink / raw)
  To: Vinicius Costa Gomes, Simon Horman, netdev
  Cc: Jakub Kicinski, Jiri Pirko, Madhu Chittim, Sridhar Samudrala,
	Paolo Abeni

Vinicius Costa Gomes wrote:
> Hi,
> 
> Simon Horman <horms@kernel.org> writes:
> 
> > Hi,
> >
> > This is follow-up to the ongoing discussion started by Intel to extend the
> > support for TX shaping H/W offload [1].
> >
> > The goal is allowing the user-space to configure TX shaping offload on a
> > per-queue basis with min guaranteed B/W, max B/W limit and burst size on a
> > VF device.
> >
> 
> What about non-VF cases? Would it be out of scope?
> 
> >
> > In the past few months several different solutions were attempted and
> > discussed, without finding a perfect fit:
> >
> > - devlink_rate APIs are not appropriate for to control TX shaping on netdevs
> > - No existing TC qdisc offload covers the required feature set
> > - HTB does not allow direct queue configuration
> > - MQPRIO imposes constraint on the maximum number of TX queues
> > - TBF does not support max B/W limit
> > - ndo_set_tx_maxrate() only controls the max B/W limit
> >
> 
> Another questions: is how "to plug" different shaper algorithms? for
> example, the TSN world defines the Credit Based Shaper (IEEE 802.1Q-2018
> Annex L gives a good overview), which tries to be accurate over sub
> milisecond intervals.
> 
> (sooner or later, some NIC with lots of queues will appear with TSN
> features, and I guess some people would like to know that they are using
> the expected shaper)
> 
> > A new H/W offload API is needed, but offload API proliferation should be
> > avoided.
> >
> > The following proposal intends to cover the above specified requirement and
> > provide a possible base to unify all the shaping offload APIs mentioned above.
> >
> > The following only defines the in-kernel interface between the core and
> > drivers. The intention is to expose the feature to user-space via Netlink.
> > Hopefully the latter part should be straight-forward after agreement
> > on the in-kernel interface.
> >
> 
> Another thing that MQPRIO (indirectly) gives is the ability to userspace
> applications to have some amount of control in which queue their packets
> will end up, via skb->priority.

You can attach a BPF program now to set the queue_mapping. So one way
to do this would be to have a simple BPF program that maps priority
to a set of queues. We could likely include it somewhere in the source
or tooling to make it easily available for folks.

I agree having a way to map applications/packets to QOS is useful. The
nice bit about allowing BPF to do this is you don't have to figure out
how to set the priority on the socket and can use whatever policy makes
sense.

> 
> Would this new shaper hierarchy have something that would fill this
> role? (if this is for VF-only use cases, then the answer would be "no" I
> guess)
> 
> (I tried to read the whole thread, sorry if I missed something)
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-11 16:03         ` Jakub Kicinski
@ 2024-04-19 11:53           ` Paolo Abeni
  2024-04-22 18:06             ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2024-04-19 11:53 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Simon Horman, netdev, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala

On Thu, 2024-04-11 at 09:03 -0700, Jakub Kicinski wrote:
> On Thu, 11 Apr 2024 17:58:59 +0200 Paolo Abeni wrote:
> > 
> > Also it's not 110% clear to me the implication of:
> > 
> > > consider netdev/queue node as "exit points" of the tree, 
> > > to which a layer of actual scheduling nodes can be attached  
> > 
> > could you please rephrase a bit?
> > 
> > I have the feeling the the points above should not require significant
> > changes to the API defined here, mainly more clear documentation, but
> > I'll have a better look.
> 
> They don't have to be nodes. They can appear as parent or child of 
> a real node, but they don't themselves carry any configuration.
> 
> IOW you can represent them as a special encoding of the ID field,
> rather than a real node.

I'm sorry for the latency, I got distracted elsewhere. 

It's not clear the benefit of introducing this 'attach points' concept.

With the current proposal, configuring a queue shaper would be:

info.bw_min = ...
dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, &ack);

and restoring the default could be either:

info.bw_min = 0;
dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, &ack);

or:

dev->shaper_ops->delete(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, &ack);

With the 'attach points' I guess it will be something alike the
following (am not defining a different node type here just to keep the
example short):

# configure a queue shaper
struct shaper_info attach_info;
dev->shaper_ops->get(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &attach_info, &ack);
info.parent_id = attach_info.id;
info.bw_min = ...
new_node_id = dev->shaper_ops->add(dev, &info, &ack);

# restore defaults:
dev->shaper_ops->delete(dev, SHAPER_LOOKUP_BY_TREE_ID, new_node_id, &info, &ack);

likely some additional operation would be needed to traverse/fetch
directly the actual shaper present at the attach points???

I think the operations will be simpler without the 'attach points', am
I missing something? 

A clear conventions/definition about the semantic of deleting shapers
at specific locations (e.g. restoring the default behaviour) should
suffice, together with the current schema.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC] HW TX Rate Limiting Driver API
  2024-04-05 10:23 [RFC] HW TX Rate Limiting Driver API Simon Horman
                   ` (3 preceding siblings ...)
  2024-04-11 23:51 ` Vinicius Costa Gomes
@ 2024-04-22 11:30 ` Sunil Kovvuri Goutham
  2024-04-23 14:07   ` Paolo Abeni
  4 siblings, 1 reply; 21+ messages in thread
From: Sunil Kovvuri Goutham @ 2024-04-22 11:30 UTC (permalink / raw)
  To: Simon Horman, netdev@vger.kernel.org
  Cc: Jakub Kicinski, Jiri Pirko, Madhu Chittim, Sridhar Samudrala,
	Paolo Abeni



> -----Original Message-----
> From: Simon Horman <horms@kernel.org>
> Sent: Friday, April 5, 2024 3:53 PM
> To: netdev@vger.kernel.org
> Cc: Jakub Kicinski <kuba@kernel.org>; Jiri Pirko <jiri@resnulli.us>; Madhu
> Chittim <madhu.chittim@intel.com>; Sridhar Samudrala
> <sridhar.samudrala@intel.com>; Paolo Abeni <pabeni@redhat.com>
> Subject: [EXTERNAL] [RFC] HW TX Rate Limiting Driver API
> 
> Hi,
> 
> This is follow-up to the ongoing discussion started by Intel to extend the
> support for TX shaping H/W offload [1].
> 
> The goal is allowing the user-space to configure TX shaping offload on a per-
> queue basis with min guaranteed B/W, max B/W limit and burst size on a VF
> device.
> 
> 
> In the past few months several different solutions were attempted and
> discussed, without finding a perfect fit:
> 
> - devlink_rate APIs are not appropriate for to control TX shaping on netdevs
> - No existing TC qdisc offload covers the required feature set
> - HTB does not allow direct queue configuration
> - MQPRIO imposes constraint on the maximum number of TX queues
> - TBF does not support max B/W limit
> - ndo_set_tx_maxrate() only controls the max B/W limit
> 
> A new H/W offload API is needed, but offload API proliferation should be
> avoided.
> 
> The following proposal intends to cover the above specified requirement and
> provide a possible base to unify all the shaping offload APIs mentioned above.
> 
> The following only defines the in-kernel interface between the core and
> drivers. The intention is to expose the feature to user-space via Netlink.
> Hopefully the latter part should be straight-forward after agreement on the in-
> kernel interface.
> 
> All feedback and comment is more then welcome!
> 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lore.kernel.org_netdev_20230808015734.1060525-2D1-
> 2Dwenjun1.wu-
> 40intel.com_&d=DwIDaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=q3VKxXQKiboRw
> _F01ggTzHuhwawxR1P9_tMCN2FODU4&m=015N0GdRd3FIBUwfXqM-
> HTLdWkiKKaV6waaJqEgl39nha-
> lqR10J3SmjlrjKbtBA&s=VbuyXcaL4DZCVXveM_Y9iSadF7UJvPNDmTAgCbwSD
> FA&e=
> 
> Regards,
> Simon with much assistance from Paolo
> 
> ---
> /* SPDX-License-Identifier: GPL-2.0-or-later */
> 
> #ifndef _NET_SHAPER_H_
> #define _NET_SHAPER_H_
> 
> /**
>  * enum shaper_metric - the metric of the shaper
>  * @SHAPER_METRIC_PPS: Shaper operates on a packets per second basis
>  * @SHAPER_METRIC_BPS: Shaper operates on a bits per second basis  */
> enum shaper_metric {
> 	SHAPER_METRIC_PPS;
> 	SHAPER_METRIC_BPS;
> };
> 
> #define SHAPER_ROOT_ID 0
> #define SHAPER_NONE_ID UINT_MAX
> 
> /**
>  * struct shaper_info - represent a node of the shaper hierarchy
>  * @id: Unique identifier inside the shaper tree.
>  * @parent_id: ID of parent shaper, or SHAPER_NONE_ID if the shaper has
>  *             no parent. Only the root shaper has no parent.
>  * @metric: Specify if the bw limits refers to PPS or BPS
>  * @bw_min: Minimum guaranteed rate for this shaper
>  * @bw_max: Maximum peak bw allowed for this shaper
>  * @burst: Maximum burst for the peek rate of this shaper
>  * @priority: Scheduling priority for this shaper
>  * @weight: Scheduling weight for this shaper
>  *
>  * The full shaper hierarchy is maintained only by the
>  * NIC driver (or firmware), possibly in a NIC-specific format
>  * and/or in H/W tables.
>  * The kernel uses this representation and the shaper_ops to
>  * access, traverse, and update it.
>  */
> struct shaper_info {
> 	/* The following fields allow the full traversal of the whole
> 	 * hierarchy.
> 	 */
> 	u32 id;
> 	u32 parent_id;
> 
> 	/* The following fields define the behavior of the shaper. */
> 	enum shaper_metric metric;
> 	u64 bw_min;
> 	u64 bw_max;
> 	u32 burst;
> 	u32 priority;
> 	u32 weight;
> };
> 

'bw_min/max' is u64 and 'burst' / 'weight' are u32, any specific reason ?


> /**
>  * enum shaper_lookup_mode - Lookup method used to access a shaper
>  * @SHAPER_LOOKUP_BY_PORT: The root shaper for the whole H/W, @id is
> unused
>  * @SHAPER_LOOKUP_BY_NETDEV: The main shaper for the given network
> device,
>  *                           @id is unused
>  * @SHAPER_LOOKUP_BY_VF: @id is a virtual function number.
>  * @SHAPER_LOOKUP_BY_QUEUE: @id is a queue identifier.
>  * @SHAPER_LOOKUP_BY_TREE_ID: @id is the unique shaper identifier inside
> the
>  *                            shaper hierarchy as in shaper_info.id
>  *
>  * SHAPER_LOOKUP_BY_PORT and SHAPER_LOOKUP_BY_VF,
> SHAPER_LOOKUP_BY_TREE_ID are
>  * only available on PF devices, usually inside the host/hypervisor.
>  * SHAPER_LOOKUP_BY_NETDEV is available on both PFs and VFs devices, but
>  * only if the latter are privileged ones.
>  * The same shaper can be reached with different lookup mode/id pairs,
>  * mapping network visible objects (devices, VFs, queues) to the scheduler
>  * hierarchy and vice-versa.
>  */
> enum shaper_lookup_mode {
>     SHAPER_LOOKUP_BY_PORT,
>     SHAPER_LOOKUP_BY_NETDEV,
>     SHAPER_LOOKUP_BY_VF,
>     SHAPER_LOOKUP_BY_QUEUE,
>     SHAPER_LOOKUP_BY_TREE_ID,
> };
> 
> 
> /**
>  * struct shaper_ops - Operations on shaper hierarchy
>  * @get: Access the specified shaper.
>  * @set: Modify the specifier shaper.
>  * @move: Move the specifier shaper inside the hierarchy.
>  * @add: Add a shaper inside the shaper hierarchy.
>  * @delete: Delete the specified shaper .
>  *
>  * The netdevice exposes a pointer to these ops.
>  *
>  * It’s up to the driver or firmware to create the default shapers hierarchy,
>  * according to the H/W capabilities.
>  */
> struct shaper_ops {
> 	/* get - Fetch the specified shaper, if it exists
> 	 * @dev: Netdevice to operate on.
> 	 * @lookup_mode: How to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *      relative to the specified @lookup_mode.
> 	 * @shaper: Object to return shaper.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Multiple placement domain/id pairs can refer to the same shaper.
> 	 * And multiple entities (e.g. VF and PF) can try to access the same
> 	 * shaper concurrently.
> 	 *
> 	 * Values of @id depend on the @access_type:
> 	 * * If @access_type is SHAPER_LOOKUP_BY_PORT or
> 	 *   SHAPER_LOOKUP_BY_NETDEV, then @placement_id is unused.
> 	 * * If @access_type is SHAPER_LOOKUP_BY_VF,
> 	 *   then @id is a virtual function number, relative to @dev
> 	 *   which should be phisical function
> 	 * * If @access_type is SHAPER_LOOKUP_BY_QUEUE,
> 	 *   Then @id represents the queue number, relative to @dev
> 	 * * If @access_type is SHAPER_LOOKUP_BY_TREE_ID,
> 	 *   then @id is a @shaper_info.id and any shaper inside the
> 	 *   hierarcy can be accessed directly.
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error value on failure.
> 	 */
> 	int (*get)(struct net_device *dev,
> 		   enum shaper_lookup_mode lookup_mode, u32 id,
>                    struct shaper_info *shaper, struct netlink_ext_ack *extack);
> 
> 	/* set - Update the specified shaper, if it exists
> 	 * @dev: Netdevice to operate on.
> 	 * @lookup_mode: How to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *      relative to the specified @access_type.
> 	 * @shaper: Configuration of shaper.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Configure the parameters of @shaper according to values supplied
> 	 * in the following fields:
> 	 * * @shaper.metric
> 	 * * @shaper.bw_min
> 	 * * @shaper.bw_max
> 	 * * @shaper.burst
> 	 * * @shaper.priority
> 	 * * @shaper.weight
> 	 * Values supplied in other fields of @shaper must be zero and,
> 	 * other than verifying that, are ignored.
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 */
> 	int (*set)(struct net_device *dev,
> 		   enum shaper_lookup_mode lookup_mode, u32 id,
> 		   const struct shaper_info *shaper,
> 		   struct netlink_ext_ack *extack);
> 
> 	/* Move - change the parent id of the specified shaper
> 	 * @dev: netdevice to operate on.
> 	 * @lookup_mode: how to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *                      relative to the specified @access_mode.
> 	 * @new_parent_id: new ID of the parent shapers,
> 	 *                      always relative to the SHAPER_LOOKUP_BY_TREE_ID
> 	 *                      lookup mode
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Move the specified shaper in the hierarchy replacing its
> 	 * current parent shaper with @new_parent_id
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 */
> 	int (*move)(struct net_device *dev,
> 		    enum shaper_lookup_mode lookup_mode, u32 id,
> 		    u32 new_parent_id, struct netlink_ext_ack *extack);
> 
> 	/* add - Add a shaper inside the shaper hierarchy
> 	 * @dev: netdevice to operate on.
> 	 * @shaper: configuration of shaper.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * @shaper.id must be set to SHAPER_NONE_ID as
> 	 * the id for the shaper will be automatically allocated.
> 	 * @shaper.parent_id determines where inside the shaper's tree
> 	 * this node is inserted.
> 	 *
> 	 * Return:
> 	 * * non-negative shaper id on success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 *
> 	 * Examples or reasons this operation may fail include:
> 	 * * H/W resources limits.
> 	 * * The parent is a ‘leaf’ node - attached to a queue.
> 	 * * Can’t respect the requested bw limits.
> 	 */
> 	int (*add)(struct net_device *dev, const struct shaper_info *shaper,
> 		   struct netlink_ext_ack *extack);
> 
> 	/* delete - Add a shaper inside the shaper hierarchy
> 	 * @dev: netdevice to operate on.
> 	 * @lookup_mode: how to perform the shaper lookup
> 	 * @id: ID of the specified shaper,
> 	 *                      relative to the specified @access_type.
> 	 * @shaper: Object to return the deleted shaper configuration.
> 	 *              Ignored if NULL.
> 	 * @extack: Netlink extended ACK for reporting errors.
> 	 *
> 	 * Return:
> 	 * * %0 - Success
> 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> 	 *		    or core for any reason. @extack should be set
> 	 *		    to text describing the reason.
> 	 * * Other negative error values on failure.
> 	 */
> 	int (*delete)(struct net_device *dev,
> 		      enum shaper_lookup_mode lookup_mode,
> 		      u32 id, struct shaper_info *shaper,
> 		      struct netlink_ext_ack *extack); };
> 
> /*
>  * Examples:
>  * - set shaping on a given queue
>  *   struct shaper_info info = { // fill this };
>  *   dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id,
> &info, NULL);
>  *
>  * - create a queue group with a queue group shaping limits.
>  *   Assuming the following topology already exists:
>  *                    < netdev shaper >
>  *                      /           \
>  *         <queue 0 shaper> . . .  <queue N shaper>
>  *
>  *   struct shaper_info pinfo, ginfo;
>  *   dev->shaper_ops->get(dev, SHAPER_LOOKUP_BY_NETDEV, 0, &pinfo);
>  *
>  *   ginfo.parent_id = pinfo.id;
>  *   // fill-in other shaper params...
>  *   new_node_id = dev->shaper_ops->add(dev, &ginfo);
>  *
>  *   // now topology is:
>  *   //                  <    netdev shaper    >
>  *   //                  /            |        \
>  *   //                 /             |        <newly created shaper>
>  *   //                /              |
>  *   // <queue 0 shaper> . . . <queue N shaper>
>  *
>  *   // move a shapers for queues 3..n out of such queue group
>  *   for (i = 0; i <= 2; ++i)
>  *           dev->shaper_ops->move(dev, SHAPER_LOOKUP_BY_QUEUE, i,
> new_node_id);
>  *
>  *   // now topology is:
>  *   //                   < netdev shaper >
>  *   //                   /              \
>  *   //        <newly created shaper>   <queue 3 shaper> ... <queue n shaper>
>  *   //         /                   \
>  *   // <queue 0 shaper> ... <queue 2 shaper>
>  */
> #endif
> 

Simon / Paolo,

Can you please confirm below
1. Does privileged VF mean trusted VF ?
2. With this we can setup below as well, from PF ?
     -- Per-VF Quantum
     -- Per-VF Strict priority
     -- Per-VF ratelimit
3. Wondering if it would be beneficial to apply this to ingress traffic as well, all modes may or may not apply,
     but at the least it could be possible to apply PPS/BPS ratelimit. So that the configuration methodology
     remains same across egress and ingress.

Thanks,
Sunil.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-19 11:53           ` Paolo Abeni
@ 2024-04-22 18:06             ` Jakub Kicinski
  2024-04-23 17:25               ` Paolo Abeni
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2024-04-22 18:06 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Simon Horman, netdev, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala

On Fri, 19 Apr 2024 13:53:53 +0200 Paolo Abeni wrote:
> > They don't have to be nodes. They can appear as parent or child of 
> > a real node, but they don't themselves carry any configuration.
> > 
> > IOW you can represent them as a special encoding of the ID field,
> > rather than a real node.  
> 
> I'm sorry for the latency, I got distracted elsewhere. 
> 
> It's not clear the benefit of introducing this 'attach points' concept.
> 
> With the current proposal, configuring a queue shaper would be:
> 
> info.bw_min = ...
> dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, &ack);
> 
> and restoring the default could be either:
> 
> info.bw_min = 0;
> dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, &ack);

And presumably also bw_max = 0 also means "delete" or will it be bw_max
= ~0 ?

> or:
> 
> dev->shaper_ops->delete(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, &ack);

Which confusingly will not actually delete the node, subsequent get()
will still return it.

> With the 'attach points' I guess it will be something alike the
> following (am not defining a different node type here just to keep the
> example short):
> 
> # configure a queue shaper
> struct shaper_info attach_info;
> dev->shaper_ops->get(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &attach_info, &ack);
> info.parent_id = attach_info.id;
> info.bw_min = ...
> new_node_id = dev->shaper_ops->add(dev, &info, &ack);
>
> # restore defaults:
> dev->shaper_ops->delete(dev, SHAPER_LOOKUP_BY_TREE_ID, new_node_id, &info, &ack);
> 
> likely some additional operation would be needed to traverse/fetch
> directly the actual shaper present at the attach points???

Whether type + ID (here SHAPER_LOOKUP_BY_QUEUE, queue_id) identifies
the node sufficiently to avoid the get is orthogonal. Your ->set
example assumes you don't have to do a get first to find exact
(synthetic) node ID. The same can be true for an ->add, if you prefer.
 
> I think the operations will be simpler without the 'attach points', am
> I missing something? 
> 
> A clear conventions/definition about the semantic of deleting shapers
> at specific locations (e.g. restoring the default behaviour) should
> suffice, together with the current schema.

I guess. I do find it odd that we have objects in multiple places of 
the hierarchy when there is no configuration intended. Especially that
the HW may actually not support such configuration (say there is always
a DRR before the egress, now we insert a shaping stage there).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-22 11:30 ` Sunil Kovvuri Goutham
@ 2024-04-23 14:07   ` Paolo Abeni
  2024-04-23 15:56     ` Sunil Kovvuri Goutham
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2024-04-23 14:07 UTC (permalink / raw)
  To: Sunil Kovvuri Goutham, Simon Horman, netdev@vger.kernel.org
  Cc: Jakub Kicinski, Jiri Pirko, Madhu Chittim, Sridhar Samudrala

On Mon, 2024-04-22 at 11:30 +0000, Sunil Kovvuri Goutham wrote:
> 
> On Friday, April 5, 2024 3:53 PM Simon Horman <horms@kernel.org> wrote:
[...]
> > /**
> >  * struct shaper_info - represent a node of the shaper hierarchy
> >  * @id: Unique identifier inside the shaper tree.
> >  * @parent_id: ID of parent shaper, or SHAPER_NONE_ID if the shaper has
> >  *             no parent. Only the root shaper has no parent.
> >  * @metric: Specify if the bw limits refers to PPS or BPS
> >  * @bw_min: Minimum guaranteed rate for this shaper
> >  * @bw_max: Maximum peak bw allowed for this shaper
> >  * @burst: Maximum burst for the peek rate of this shaper
> >  * @priority: Scheduling priority for this shaper
> >  * @weight: Scheduling weight for this shaper
> >  *
> >  * The full shaper hierarchy is maintained only by the
> >  * NIC driver (or firmware), possibly in a NIC-specific format
> >  * and/or in H/W tables.
> >  * The kernel uses this representation and the shaper_ops to
> >  * access, traverse, and update it.
> >  */
> > struct shaper_info {
> > 	/* The following fields allow the full traversal of the whole
> > 	 * hierarchy.
> > 	 */
> > 	u32 id;
> > 	u32 parent_id;
> > 
> > 	/* The following fields define the behavior of the shaper. */
> > 	enum shaper_metric metric;
> > 	u64 bw_min;
> > 	u64 bw_max;
> > 	u32 burst;
> > 	u32 priority;
> > 	u32 weight;
> > };
> > 
> 
> 'bw_min/max' is u64 and 'burst' / 'weight' are u32, any specific reason ?

A NIC can exceed UINT32_MAX bps, while UINT32_MAX different values look
more then enough to cope for different weights.

No strong opinions, we can increase the weight range, but it looks a
bit overkill to me.

[...]
> Can you please confirm below
> 1. Does privileged VF mean trusted VF ?

Yes. But keep in mind that given the current status of discussion we
are going to drop privileged/trusted VF reference.

> 2. With this we can setup below as well, from PF ?
>      -- Per-VF Quantum
>      -- Per-VF Strict priority
>      -- Per-VF ratelimit

Assuming the NIC H/W support it, yes: the host will have full control
over all the available shaper.

> 3. Wondering if it would be beneficial to apply this to ingress traffic as well, all modes may or may not apply,
>      but at the least it could be possible to apply PPS/BPS ratelimit. So that the configuration methodology
>      remains same across egress and ingress.

I think the API could be extended to the RX side, too, as a follow-
up/next step. This relatively simple feature is in progress since a
significant time, I think it would be nice try to have it in place
first.

We will try to ensure there will be no naming clash for such extension.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-23 14:07   ` Paolo Abeni
@ 2024-04-23 15:56     ` Sunil Kovvuri Goutham
  0 siblings, 0 replies; 21+ messages in thread
From: Sunil Kovvuri Goutham @ 2024-04-23 15:56 UTC (permalink / raw)
  To: Paolo Abeni, Simon Horman, netdev@vger.kernel.org
  Cc: Jakub Kicinski, Jiri Pirko, Madhu Chittim, Sridhar Samudrala



> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com>
> Sent: Tuesday, April 23, 2024 7:38 PM
> To: Sunil Kovvuri Goutham <sgoutham@marvell.com>; Simon Horman
> <horms@kernel.org>; netdev@vger.kernel.org
> Cc: Jakub Kicinski <kuba@kernel.org>; Jiri Pirko <jiri@resnulli.us>; Madhu
> Chittim <madhu.chittim@intel.com>; Sridhar Samudrala
> <sridhar.samudrala@intel.com>
> Subject: [EXTERNAL] Re: [RFC] HW TX Rate Limiting Driver API
> 
> On Mon, 2024-04-22 at 11:30 +0000, Sunil Kovvuri Goutham wrote:
> >
> > On Friday, April 5, 2024 3:53 PM Simon Horman <horms@kernel.org>
> wrote:
> [...]
> > > /**
> > >  * struct shaper_info - represent a node of the shaper hierarchy
> > >  * @id: Unique identifier inside the shaper tree.
> > >  * @parent_id: ID of parent shaper, or SHAPER_NONE_ID if the shaper has
> > >  *             no parent. Only the root shaper has no parent.
> > >  * @metric: Specify if the bw limits refers to PPS or BPS
> > >  * @bw_min: Minimum guaranteed rate for this shaper
> > >  * @bw_max: Maximum peak bw allowed for this shaper
> > >  * @burst: Maximum burst for the peek rate of this shaper
> > >  * @priority: Scheduling priority for this shaper
> > >  * @weight: Scheduling weight for this shaper
> > >  *
> > >  * The full shaper hierarchy is maintained only by the
> > >  * NIC driver (or firmware), possibly in a NIC-specific format
> > >  * and/or in H/W tables.
> > >  * The kernel uses this representation and the shaper_ops to
> > >  * access, traverse, and update it.
> > >  */
> > > struct shaper_info {
> > > 	/* The following fields allow the full traversal of the whole
> > > 	 * hierarchy.
> > > 	 */
> > > 	u32 id;
> > > 	u32 parent_id;
> > >
> > > 	/* The following fields define the behavior of the shaper. */
> > > 	enum shaper_metric metric;
> > > 	u64 bw_min;
> > > 	u64 bw_max;
> > > 	u32 burst;
> > > 	u32 priority;
> > > 	u32 weight;
> > > };
> > >
> >
> > 'bw_min/max' is u64 and 'burst' / 'weight' are u32, any specific reason ?
> 
> A NIC can exceed UINT32_MAX bps, while UINT32_MAX different values look
> more then enough to cope for different weights.
> 
> No strong opinions, we can increase the weight range, but it looks a bit overkill
> to me.
> 
> [...]
> > Can you please confirm below
> > 1. Does privileged VF mean trusted VF ?
> 
> Yes. But keep in mind that given the current status of discussion we are going
> to drop privileged/trusted VF reference.
> 
> > 2. With this we can setup below as well, from PF ?
> >      -- Per-VF Quantum
> >      -- Per-VF Strict priority
> >      -- Per-VF ratelimit
> 
> Assuming the NIC H/W support it, yes: the host will have full control over all
> the available shaper.

Thanks for the clarification.

> 
> > 3. Wondering if it would be beneficial to apply this to ingress traffic as well,
> all modes may or may not apply,
> >      but at the least it could be possible to apply PPS/BPS ratelimit. So that the
> configuration methodology
> >      remains same across egress and ingress.
> 
> I think the API could be extended to the RX side, too, as a follow- up/next
> step. This relatively simple feature is in progress since a significant time, I think
> it would be nice try to have it in place first.
> 
> We will try to ensure there will be no naming clash for such extension.
> 


Makes sense and agree.

Thanks,
Sunil.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-22 18:06             ` Jakub Kicinski
@ 2024-04-23 17:25               ` Paolo Abeni
  2024-04-24 23:57                 ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2024-04-23 17:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Simon Horman, netdev, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala

On Mon, 2024-04-22 at 11:06 -0700, Jakub Kicinski wrote:
> On Fri, 19 Apr 2024 13:53:53 +0200 Paolo Abeni wrote:
> > > They don't have to be nodes. They can appear as parent or child of 
> > > a real node, but they don't themselves carry any configuration.
> > > 
> > > IOW you can represent them as a special encoding of the ID field,
> > > rather than a real node.  
> > 
> > I'm sorry for the latency, I got distracted elsewhere. 
> > 
> > It's not clear the benefit of introducing this 'attach points' concept.
> > 
> > With the current proposal, configuring a queue shaper would be:
> > 
> > info.bw_min = ...
> > dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, &ack);
> > 
> > and restoring the default could be either:
> > 
> > info.bw_min = 0;
> > dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, &ack);
> 
> And presumably also bw_max = 0 also means "delete" or will it be bw_max
> = ~0 ?

I would say just bw_min = 0, all others fields are ignored in such
case. But not very relevant since...
> 
> > or:
> > 
> > dev->shaper_ops->delete(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, &ack);
> 
> Which confusingly will not actually delete the node, subsequent get()
> will still return it.
> 
> > With the 'attach points' I guess it will be something alike the
> > following (am not defining a different node type here just to keep the
> > example short):
> > 
> > # configure a queue shaper
> > struct shaper_info attach_info;
> > dev->shaper_ops->get(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &attach_info, &ack);
> > info.parent_id = attach_info.id;
> > info.bw_min = ...
> > new_node_id = dev->shaper_ops->add(dev, &info, &ack);
> > 
> > # restore defaults:
> > dev->shaper_ops->delete(dev, SHAPER_LOOKUP_BY_TREE_ID, new_node_id, &info, &ack);
> > 
> > likely some additional operation would be needed to traverse/fetch
> > directly the actual shaper present at the attach points???
> 
> Whether type + ID (here SHAPER_LOOKUP_BY_QUEUE, queue_id) identifies
> the node sufficiently to avoid the get is orthogonal. Your ->set
> example assumes you don't have to do a get first to find exact
> (synthetic) node ID. The same can be true for an ->add, if you prefer.

... my understanding is that you have strong preference over the
'attach points' variant.

I think in the end is mostly a matter of clearly define
expectation/behavior and initial status. 

Assuming the initial tree is empty, and the kernel tracks all changes
vs the default (empty tree) configuration, I guess it's probably better
change slightly the ->add():

int (*add)(struct net_device *dev, 
	   enum shaper_lookup_mode lookup_mode, // for the parent
	   u32 parent_id			// relative to lookup_mode
	   const struct shaper_info *shaper,    // 'id' and 'parent_id' fields
						// are unused
	   struct netlink_ext_ack *extack);

so we can easily add/configure a queue shapers with a single op:

	struct shaper_info attach_info;
	info.bw_min = ...
	dev->shaper_ops->add(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, &ack);

And possibly even the ->move() op:

	int (*move)(struct net_device *dev,
		    u32 id, // shapers to be moved, id is SHAPER_LOOKUP_BY_TREE_ID
		    enum shaper_lookup_mode lookup_mode, // for the new parent
		    u32 new_parent_id, 			 // relative to 'lookup_mode'
		    struct netlink_ext_ack *extack);

Since it does not make sense to move around the attach point, the
kernel knows the id of the created/modified nodes, and it should be
useful be able to move the shaper directly under an arbitrary attach
point.

Finally, what about renaming the whole SHAPER_LOOKUP_* values to
SHAPER_LOOKUP_TX_* so we can later easily extends this to RX? 

> I do find it odd that we have objects in multiple places of 
> the hierarchy when there is no configuration intended. Especially
> that
> the HW may actually not support such configuration (say there is
> always
> a DRR before the egress, now we insert a shaping stage there).

Regardless of the model we chose (with 'attach points' or without), if
the H/W does not support a given configuration, the relative op must
fail. No shapers will be created matching an unsupported configuration.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] HW TX Rate Limiting Driver API
  2024-04-23 17:25               ` Paolo Abeni
@ 2024-04-24 23:57                 ` Jakub Kicinski
  0 siblings, 0 replies; 21+ messages in thread
From: Jakub Kicinski @ 2024-04-24 23:57 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Simon Horman, netdev, Jiri Pirko, Madhu Chittim,
	Sridhar Samudrala

On Tue, 23 Apr 2024 19:25:37 +0200 Paolo Abeni wrote:
> I would say just bw_min = 0, all others fields are ignored in such
> case. But not very relevant since...

> ... my understanding is that you have strong preference over the
> 'attach points' variant.
> 
> I think in the end is mostly a matter of clearly define
> expectation/behavior and initial status. 

Agreed, no strong preference but also no strong argument either way?

Maybe my main worry was that we have 4 "lookup modes" if every one
of them have fake nodes that's a bit messy. With fewer modes it's more
palatable.

And IIUC TC uses the magic encoding method, so that's a precedent.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-04-24 23:57 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-05 10:23 [RFC] HW TX Rate Limiting Driver API Simon Horman
2024-04-05 13:33 ` Jamal Hadi Salim
2024-04-05 17:06   ` Paolo Abeni
2024-04-06 13:48     ` Jamal Hadi Salim
2024-04-10  9:41       ` Paolo Abeni
2024-04-05 14:34 ` John Fastabend
2024-04-05 16:25   ` Paolo Abeni
2024-04-09 22:32 ` Jakub Kicinski
2024-04-10  8:33   ` Paolo Abeni
2024-04-10 14:57     ` Jakub Kicinski
2024-04-11 15:58       ` Paolo Abeni
2024-04-11 16:03         ` Jakub Kicinski
2024-04-19 11:53           ` Paolo Abeni
2024-04-22 18:06             ` Jakub Kicinski
2024-04-23 17:25               ` Paolo Abeni
2024-04-24 23:57                 ` Jakub Kicinski
2024-04-11 23:51 ` Vinicius Costa Gomes
2024-04-12  4:39   ` John Fastabend
2024-04-22 11:30 ` Sunil Kovvuri Goutham
2024-04-23 14:07   ` Paolo Abeni
2024-04-23 15:56     ` Sunil Kovvuri Goutham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).