* [RFC net-next v2 0/2] devlink: whole-device, resource .occ_set()
@ 2025-02-19 16:32 Przemek Kitszel
2025-02-19 16:32 ` [RFC net-next v2 1/2] devlink: add whole device devlink instance Przemek Kitszel
2025-02-19 16:32 ` [RFC net-next v2 2/2] devlink: give user option to allocate resources Przemek Kitszel
0 siblings, 2 replies; 19+ messages in thread
From: Przemek Kitszel @ 2025-02-19 16:32 UTC (permalink / raw)
To: intel-wired-lan, Tony Nguyen, Jiri Pirko, Jakub Kicinski,
Cosmin Ratiu, Tariq Toukan
Cc: netdev, Konrad Knitter, Jacob Keller, davem, Eric Dumazet,
Paolo Abeni, Andrew Lunn, linux-kernel, ITP Upstream,
Carolina Jubran, Przemek Kitszel
I'm working on ice+iavf changes that utilize the two devlink patches
of this RFC series. The two are related in that I will group them anyway
for my actual submission.
Patch 1: add an option for drivers like ice (devlink instance per PF)
to add a whole-device devlink instance, that wraps them together.
Patch 2: add resource occupation setter and a (better) mode of resource
control for users
Przemek Kitszel (2):
devlink: add whole device devlink instance
devlink: give user option to allocate resources
net/devlink/devl_internal.h | 14 +++---
net/devlink/core.c | 58 ++++++++++++++++++-----
net/devlink/netlink.c | 4 +-
net/devlink/port.c | 4 +-
net/devlink/resource.c | 94 +++++++++++++++++++++++++++++--------
5 files changed, 132 insertions(+), 42 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-19 16:32 [RFC net-next v2 0/2] devlink: whole-device, resource .occ_set() Przemek Kitszel
@ 2025-02-19 16:32 ` Przemek Kitszel
2025-02-19 22:11 ` Jacob Keller
` (3 more replies)
2025-02-19 16:32 ` [RFC net-next v2 2/2] devlink: give user option to allocate resources Przemek Kitszel
1 sibling, 4 replies; 19+ messages in thread
From: Przemek Kitszel @ 2025-02-19 16:32 UTC (permalink / raw)
To: intel-wired-lan, Tony Nguyen, Jiri Pirko, Jakub Kicinski,
Cosmin Ratiu, Tariq Toukan
Cc: netdev, Konrad Knitter, Jacob Keller, davem, Eric Dumazet,
Paolo Abeni, Andrew Lunn, linux-kernel, ITP Upstream,
Carolina Jubran, Przemek Kitszel
Add a support for whole device devlink instance. Intented as a entity
over all PF devices on given physical device.
In case of ice driver we have multiple PF devices (with their devlink
dev representation), that have separate drivers loaded. However those
still do share lots of resources due to being the on same HW. Examples
include PTP clock and RSS LUT. Historically such stuff was assigned to
PF0, but that was both not clear and not working well. Now such stuff
is moved to be covered into struct ice_adapter, there is just one instance
of such per HW.
This patch adds a devlink instance that corresponds to that ice_adapter,
to allow arbitrage over resources (as RSS LUT) via it (further in the
series (RFC NOTE: stripped out so far)).
Thanks to Wojciech Drewek for very nice naming of the devlink instance:
PF0: pci/0000:00:18.0
whole-dev: pci/0000:00:18
But I made this a param for now (driver is free to pass just "whole-dev").
$ devlink dev # (Interesting part of output only)
pci/0000:af:00:
nested_devlink:
pci/0000:af:00.0
pci/0000:af:00.1
pci/0000:af:00.2
pci/0000:af:00.3
pci/0000:af:00.4
pci/0000:af:00.5
pci/0000:af:00.6
pci/0000:af:00.7
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
---
net/devlink/devl_internal.h | 14 +++++----
net/devlink/core.c | 58 +++++++++++++++++++++++++++++--------
net/devlink/netlink.c | 4 +--
net/devlink/port.c | 4 +--
4 files changed, 58 insertions(+), 22 deletions(-)
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 14eaad9cfe35..073afe02ce2f 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -49,6 +49,8 @@ struct devlink {
struct xarray snapshot_ids;
struct devlink_dev_stats stats;
struct device *dev;
+ const char *dev_name;
+ const char *bus_name;
possible_net_t _net;
/* Serializes access to devlink instance specific objects such as
* port, sb, dpipe, resource, params, region, traps and more.
@@ -104,15 +106,15 @@ static inline bool devl_is_registered(struct devlink *devlink)
static inline void devl_dev_lock(struct devlink *devlink, bool dev_lock)
{
- if (dev_lock)
+ if (dev_lock && devlink->dev)
device_lock(devlink->dev);
devl_lock(devlink);
}
static inline void devl_dev_unlock(struct devlink *devlink, bool dev_lock)
{
devl_unlock(devlink);
- if (dev_lock)
+ if (dev_lock && devlink->dev)
device_unlock(devlink->dev);
}
@@ -174,9 +176,9 @@ devlink_dump_state(struct netlink_callback *cb)
static inline int
devlink_nl_put_handle(struct sk_buff *msg, struct devlink *devlink)
{
- if (nla_put_string(msg, DEVLINK_ATTR_BUS_NAME, devlink->dev->bus->name))
+ if (nla_put_string(msg, DEVLINK_ATTR_BUS_NAME, devlink->bus_name))
return -EMSGSIZE;
- if (nla_put_string(msg, DEVLINK_ATTR_DEV_NAME, dev_name(devlink->dev)))
+ if (nla_put_string(msg, DEVLINK_ATTR_DEV_NAME, devlink->dev_name))
return -EMSGSIZE;
return 0;
}
@@ -209,8 +211,8 @@ static inline void devlink_nl_obj_desc_init(struct devlink_obj_desc *desc,
struct devlink *devlink)
{
memset(desc, 0, sizeof(*desc));
- desc->bus_name = devlink->dev->bus->name;
- desc->dev_name = dev_name(devlink->dev);
+ desc->bus_name = devlink->bus_name;
+ desc->dev_name = devlink->dev_name;
}
static inline void devlink_nl_obj_desc_port_set(struct devlink_obj_desc *desc,
diff --git a/net/devlink/core.c b/net/devlink/core.c
index f49cd83f1955..f4960074b845 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -397,26 +397,25 @@ void devlink_unregister(struct devlink *devlink)
EXPORT_SYMBOL_GPL(devlink_unregister);
/**
- * devlink_alloc_ns - Allocate new devlink instance resources
- * in specific namespace
+ * devlink_alloc_wrapper - Allocate a new devlink instance resources
+ * for a SW wrapper over multiple HW devlink instances
*
* @ops: ops
* @priv_size: size of user private data
- * @net: net namespace
- * @dev: parent device
+ * @bus_name: user visible bus name
+ * @dev_name: user visible device name
*
- * Allocate new devlink instance resources, including devlink index
- * and name.
+ * Allocate new devlink instance resources, including devlink index.
*/
-struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
- size_t priv_size, struct net *net,
- struct device *dev)
+struct devlink *devlink_alloc_wrapper(const struct devlink_ops *ops,
+ size_t priv_size, const char *bus_name,
+ const char *dev_name)
{
struct devlink *devlink;
static u32 last_id;
int ret;
- WARN_ON(!ops || !dev);
+ WARN_ON(!ops);
if (!devlink_reload_actions_valid(ops))
return NULL;
@@ -429,13 +428,14 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
if (ret < 0)
goto err_xa_alloc;
- devlink->dev = get_device(dev);
devlink->ops = ops;
+ devlink->bus_name = bus_name;
+ devlink->dev_name = dev_name;
xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
xa_init_flags(&devlink->params, XA_FLAGS_ALLOC);
xa_init_flags(&devlink->snapshot_ids, XA_FLAGS_ALLOC);
xa_init_flags(&devlink->nested_rels, XA_FLAGS_ALLOC);
- write_pnet(&devlink->_net, net);
+ write_pnet(&devlink->_net, &init_net);
INIT_LIST_HEAD(&devlink->rate_list);
INIT_LIST_HEAD(&devlink->linecard_list);
INIT_LIST_HEAD(&devlink->sb_list);
@@ -458,6 +458,40 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
kvfree(devlink);
return NULL;
}
+EXPORT_SYMBOL_GPL(devlink_alloc_wrapper);
+
+/**
+ * devlink_alloc_ns - Allocate new devlink instance resources
+ * in specific namespace
+ *
+ * @ops: ops
+ * @priv_size: size of user private data
+ * @net: net namespace
+ * @dev: parent device
+ *
+ * Allocate new devlink instance resources, including devlink index
+ * and name.
+ */
+struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
+ size_t priv_size, struct net *net,
+ struct device *dev)
+{
+ struct devlink *devlink;
+
+ if (WARN_ON(!dev))
+ return NULL;
+
+ dev = get_device(dev);
+ devlink = devlink_alloc_wrapper(ops, priv_size, dev->bus->name,
+ dev_name(dev));
+ if (!devlink) {
+ put_device(dev);
+ return NULL;
+ }
+ devlink->dev = dev;
+ write_pnet(&devlink->_net, net);
+ return devlink;
+}
EXPORT_SYMBOL_GPL(devlink_alloc_ns);
/**
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index 593605c1b1ef..3f73ced2d879 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -193,8 +193,8 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
devname = nla_data(attrs[DEVLINK_ATTR_DEV_NAME]);
devlinks_xa_for_each_registered_get(net, index, devlink) {
- if (strcmp(devlink->dev->bus->name, busname) == 0 &&
- strcmp(dev_name(devlink->dev), devname) == 0) {
+ if (strcmp(devlink->bus_name, busname) == 0 &&
+ strcmp(devlink->dev_name, devname) == 0) {
devl_dev_lock(devlink, dev_lock);
if (devl_is_registered(devlink))
return devlink;
diff --git a/net/devlink/port.c b/net/devlink/port.c
index 939081a0e615..508ecf34d41a 100644
--- a/net/devlink/port.c
+++ b/net/devlink/port.c
@@ -220,8 +220,8 @@ size_t devlink_nl_port_handle_size(struct devlink_port *devlink_port)
{
struct devlink *devlink = devlink_port->devlink;
- return nla_total_size(strlen(devlink->dev->bus->name) + 1) /* DEVLINK_ATTR_BUS_NAME */
- + nla_total_size(strlen(dev_name(devlink->dev)) + 1) /* DEVLINK_ATTR_DEV_NAME */
+ return nla_total_size(strlen(devlink->bus_name) + 1) /* DEVLINK_ATTR_BUS_NAME */
+ + nla_total_size(strlen(devlink->dev_name) + 1) /* DEVLINK_ATTR_DEV_NAME */
+ nla_total_size(4); /* DEVLINK_ATTR_PORT_INDEX */
}
--
2.46.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC net-next v2 2/2] devlink: give user option to allocate resources
2025-02-19 16:32 [RFC net-next v2 0/2] devlink: whole-device, resource .occ_set() Przemek Kitszel
2025-02-19 16:32 ` [RFC net-next v2 1/2] devlink: add whole device devlink instance Przemek Kitszel
@ 2025-02-19 16:32 ` Przemek Kitszel
1 sibling, 0 replies; 19+ messages in thread
From: Przemek Kitszel @ 2025-02-19 16:32 UTC (permalink / raw)
To: intel-wired-lan, Tony Nguyen, Jiri Pirko, Jakub Kicinski,
Cosmin Ratiu, Tariq Toukan
Cc: netdev, Konrad Knitter, Jacob Keller, davem, Eric Dumazet,
Paolo Abeni, Andrew Lunn, linux-kernel, ITP Upstream,
Carolina Jubran, Przemek Kitszel
Current devlink resources are designed as a thing that user could limit,
but there is not much otherwise that could be done with them.
Perhaps that's the reason there is no much adoption despite API being
there for multiple years.
Add new mode of operation, where user could allocate/assign resources
(from a common pool) to specific devices.
That requires "occ set" support, triggered by user.
To support that mode, "occ get" is (only then) turned into a simple
"get/show" operation, as opposed to "ask driver about current occupation"
in the "legacy" mode.
Naming advice welcomed, for now the modes are reffered as:
legacy/static-occ/mlx vs new/ice/dynamic-occ
Perhaps "user-settable" for the new mode and "driver-only" for the legacy?
Does not matter much, as this will be only embedded in the
net/devlink/resource.c file as names/comments for clarity.
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
---
net/devlink/resource.c | 94 +++++++++++++++++++++++++++++++++---------
1 file changed, 74 insertions(+), 20 deletions(-)
diff --git a/net/devlink/resource.c b/net/devlink/resource.c
index 2d6324f3d91f..c81d05427e12 100644
--- a/net/devlink/resource.c
+++ b/net/devlink/resource.c
@@ -14,25 +14,30 @@
* @size_new: updated size of the resource, reload is needed
* @size_valid: valid in case the total size of the resource is valid
* including its children
+ * @occ_mode: false for static occ mode == legacy mlx like
+ * true for dynamic occ mode == new one for intel
* @parent: parent resource
* @size_params: size parameters
* @list: parent list
* @resource_list: list of child resources
* @occ_get: occupancy getter callback
- * @occ_get_priv: occupancy getter callback priv
+ * @occ_set: occupancy setter callback
+ * @occ_priv: occupancy callbacks priv
*/
struct devlink_resource {
const char *name;
u64 id;
u64 size;
u64 size_new;
bool size_valid;
+ bool occ_mode;
struct devlink_resource *parent;
struct devlink_resource_size_params size_params;
struct list_head list;
struct list_head resource_list;
devlink_resource_occ_get_t *occ_get;
- void *occ_get_priv;
+ devlink_resource_occ_set_t *occ_set;
+ void *occ_priv;
};
static struct devlink_resource *
@@ -127,6 +132,9 @@ int devlink_nl_resource_set_doit(struct sk_buff *skb, struct genl_info *info)
if (err)
return err;
+ if (resource->occ_set)
+ return resource->occ_set(size, info->extack, resource->occ_priv);
+
resource->size_new = size;
devlink_resource_validate_children(resource);
if (resource->parent)
@@ -152,13 +160,46 @@ devlink_resource_size_params_put(struct devlink_resource *resource,
return 0;
}
-static int devlink_resource_occ_put(struct devlink_resource *resource,
- struct sk_buff *skb)
+static
+int devlink_resource_occ_size_put_legacy(struct devlink_resource *resource,
+ struct sk_buff *skb)
+{
+ int err;
+
+ if (resource->occ_get) {
+ err = devlink_nl_put_u64(skb, DEVLINK_ATTR_RESOURCE_OCC,
+ resource->occ_get(resource->occ_priv));
+ if (err)
+ return err;
+ }
+
+ if (resource->size != resource->size_new) {
+ err = devlink_nl_put_u64(skb, DEVLINK_ATTR_RESOURCE_SIZE_NEW,
+ resource->size_new);
+ if (err)
+ return err;
+ }
+
+ err = nla_put_u8(skb, DEVLINK_ATTR_RESOURCE_SIZE_VALID,
+ resource->size_valid);
+ if (err)
+ return err;
+
+
+ return devlink_nl_put_u64(skb, DEVLINK_ATTR_RESOURCE_SIZE,
+ resource->size);
+}
+
+static int devlink_resource_occ_size_put(struct devlink_resource *resource,
+ struct sk_buff *skb)
{
- if (!resource->occ_get)
- return 0;
- return devlink_nl_put_u64(skb, DEVLINK_ATTR_RESOURCE_OCC,
- resource->occ_get(resource->occ_get_priv));
+ if (!resource->occ_get || !resource->occ_set)
+ return devlink_resource_occ_size_put_legacy(resource, skb);
+
+ nla_put_u8(skb, DEVLINK_ATTR_RESOURCE_SIZE_VALID, true);
+
+ return devlink_nl_put_u64(skb, DEVLINK_ATTR_RESOURCE_SIZE,
+ resource->occ_get(resource->occ_priv));
}
static int devlink_resource_put(struct devlink *devlink, struct sk_buff *skb,
@@ -173,23 +214,16 @@ static int devlink_resource_put(struct devlink *devlink, struct sk_buff *skb,
return -EMSGSIZE;
if (nla_put_string(skb, DEVLINK_ATTR_RESOURCE_NAME, resource->name) ||
- devlink_nl_put_u64(skb, DEVLINK_ATTR_RESOURCE_SIZE, resource->size) ||
devlink_nl_put_u64(skb, DEVLINK_ATTR_RESOURCE_ID, resource->id))
goto nla_put_failure;
- if (resource->size != resource->size_new &&
- devlink_nl_put_u64(skb, DEVLINK_ATTR_RESOURCE_SIZE_NEW,
- resource->size_new))
- goto nla_put_failure;
- if (devlink_resource_occ_put(resource, skb))
- goto nla_put_failure;
if (devlink_resource_size_params_put(resource, skb))
goto nla_put_failure;
+ if (devlink_resource_occ_size_put(resource, skb))
+ goto nla_put_failure;
+
if (list_empty(&resource->resource_list))
goto out;
- if (nla_put_u8(skb, DEVLINK_ATTR_RESOURCE_SIZE_VALID,
- resource->size_valid))
- goto nla_put_failure;
child_resource_attr = nla_nest_start_noflag(skb,
DEVLINK_ATTR_RESOURCE_LIST);
@@ -476,7 +510,7 @@ void devl_resource_occ_get_register(struct devlink *devlink,
WARN_ON(resource->occ_get);
resource->occ_get = occ_get;
- resource->occ_get_priv = occ_get_priv;
+ resource->occ_priv = occ_get_priv;
}
EXPORT_SYMBOL_GPL(devl_resource_occ_get_register);
@@ -499,6 +533,26 @@ void devl_resource_occ_get_unregister(struct devlink *devlink,
WARN_ON(!resource->occ_get);
resource->occ_get = NULL;
- resource->occ_get_priv = NULL;
}
EXPORT_SYMBOL_GPL(devl_resource_occ_get_unregister);
+
+void devl_resource_occ_set_get_register(struct devlink *devlink,
+ u64 resource_id,
+ devlink_resource_occ_set_t *occ_set,
+ devlink_resource_occ_get_t *occ_get,
+ void *occ_priv)
+{
+ struct devlink_resource *resource;
+
+ lockdep_assert_held(&devlink->lock);
+
+ resource = devlink_resource_find(devlink, NULL, resource_id);
+ if (WARN_ON(!resource))
+ return;
+ WARN_ON(resource->occ_get || resource->occ_set);
+
+ resource->occ_set = occ_set;
+ resource->occ_get = occ_get;
+ resource->occ_priv = occ_priv;
+}
+EXPORT_SYMBOL_GPL(devl_resource_occ_set_get_register);
--
2.46.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-19 16:32 ` [RFC net-next v2 1/2] devlink: add whole device devlink instance Przemek Kitszel
@ 2025-02-19 22:11 ` Jacob Keller
2025-02-21 1:45 ` Jakub Kicinski
` (2 subsequent siblings)
3 siblings, 0 replies; 19+ messages in thread
From: Jacob Keller @ 2025-02-19 22:11 UTC (permalink / raw)
To: Przemek Kitszel, intel-wired-lan, Tony Nguyen, Jiri Pirko,
Jakub Kicinski, Cosmin Ratiu, Tariq Toukan
Cc: netdev, Konrad Knitter, davem, Eric Dumazet, Paolo Abeni,
Andrew Lunn, linux-kernel, ITP Upstream, Carolina Jubran
On 2/19/2025 8:32 AM, Przemek Kitszel wrote:
> Add a support for whole device devlink instance. Intented as a entity
> over all PF devices on given physical device.
>
> In case of ice driver we have multiple PF devices (with their devlink
> dev representation), that have separate drivers loaded. However those
> still do share lots of resources due to being the on same HW. Examples
> include PTP clock and RSS LUT. Historically such stuff was assigned to
> PF0, but that was both not clear and not working well. Now such stuff
> is moved to be covered into struct ice_adapter, there is just one instance
> of such per HW.
>
> This patch adds a devlink instance that corresponds to that ice_adapter,
> to allow arbitrage over resources (as RSS LUT) via it (further in the
> series (RFC NOTE: stripped out so far)).
>
> Thanks to Wojciech Drewek for very nice naming of the devlink instance:
> PF0: pci/0000:00:18.0
> whole-dev: pci/0000:00:18
> But I made this a param for now (driver is free to pass just "whole-dev").
>
> $ devlink dev # (Interesting part of output only)
> pci/0000:af:00:
> nested_devlink:
> pci/0000:af:00.0
> pci/0000:af:00.1
> pci/0000:af:00.2
> pci/0000:af:00.3
> pci/0000:af:00.4
> pci/0000:af:00.5
> pci/0000:af:00.6
> pci/0000:af:00.7
>
This adds an additional devlink interface instead of replacing the
existing scheme? Seems reasonable to avoid compatibility issues with
older driver versions. I had wanted to use a single instance pretty much
from my initial attempts at flash update, but ended up giving up at the
time.
I do like that we can see the nesting so its clear which ones are connected.
One downside to this approach is in dealing with something like direct
assignment with virtualization. In practice, I think that already exists
because of HW limitations, and I would expect most such setups want to
assign the entire device rather than just one of its functions.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-19 16:32 ` [RFC net-next v2 1/2] devlink: add whole device devlink instance Przemek Kitszel
2025-02-19 22:11 ` Jacob Keller
@ 2025-02-21 1:45 ` Jakub Kicinski
2025-02-21 22:50 ` Jacob Keller
2025-02-24 13:03 ` Jiri Pirko
2025-02-24 16:14 ` Jiri Pirko
2025-03-18 15:42 ` Jiri Pirko
3 siblings, 2 replies; 19+ messages in thread
From: Jakub Kicinski @ 2025-02-21 1:45 UTC (permalink / raw)
To: Przemek Kitszel
Cc: intel-wired-lan, Tony Nguyen, Jiri Pirko, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, Jacob Keller, davem,
Eric Dumazet, Paolo Abeni, Andrew Lunn, linux-kernel,
ITP Upstream, Carolina Jubran
On Wed, 19 Feb 2025 17:32:54 +0100 Przemek Kitszel wrote:
> Add a support for whole device devlink instance. Intented as a entity
> over all PF devices on given physical device.
>
> In case of ice driver we have multiple PF devices (with their devlink
> dev representation), that have separate drivers loaded. However those
> still do share lots of resources due to being the on same HW. Examples
> include PTP clock and RSS LUT. Historically such stuff was assigned to
> PF0, but that was both not clear and not working well. Now such stuff
> is moved to be covered into struct ice_adapter, there is just one instance
> of such per HW.
>
> This patch adds a devlink instance that corresponds to that ice_adapter,
> to allow arbitrage over resources (as RSS LUT) via it (further in the
> series (RFC NOTE: stripped out so far)).
>
> Thanks to Wojciech Drewek for very nice naming of the devlink instance:
> PF0: pci/0000:00:18.0
> whole-dev: pci/0000:00:18
> But I made this a param for now (driver is free to pass just "whole-dev").
Which only works nicely if you're talking about functions not full
separate links :) When I was thinking about it a while back my
intuition was that we should have a single instance, just accessible
under multiple names. But I'm not married to that direction if there
are problems with it.
> $ devlink dev # (Interesting part of output only)
> pci/0000:af:00:
> nested_devlink:
> pci/0000:af:00.0
> pci/0000:af:00.1
> pci/0000:af:00.2
> pci/0000:af:00.3
> pci/0000:af:00.4
> pci/0000:af:00.5
> pci/0000:af:00.6
> pci/0000:af:00.7
Could you go into more details on what stays on the "nested" instances
and what moves to the "whole-dev"? Jiri recently pointed out to y'all
cases where stuff that should be a port attribute was an instance
attribute.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-21 1:45 ` Jakub Kicinski
@ 2025-02-21 22:50 ` Jacob Keller
2025-02-24 10:15 ` Przemek Kitszel
2025-02-24 13:03 ` Jiri Pirko
1 sibling, 1 reply; 19+ messages in thread
From: Jacob Keller @ 2025-02-21 22:50 UTC (permalink / raw)
To: Jakub Kicinski, Przemek Kitszel
Cc: intel-wired-lan, Tony Nguyen, Jiri Pirko, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, davem, Eric Dumazet,
Paolo Abeni, Andrew Lunn, linux-kernel, ITP Upstream,
Carolina Jubran
On 2/20/2025 5:45 PM, Jakub Kicinski wrote:
> On Wed, 19 Feb 2025 17:32:54 +0100 Przemek Kitszel wrote:
>> Add a support for whole device devlink instance. Intented as a entity
>> over all PF devices on given physical device.
>>
>> In case of ice driver we have multiple PF devices (with their devlink
>> dev representation), that have separate drivers loaded. However those
>> still do share lots of resources due to being the on same HW. Examples
>> include PTP clock and RSS LUT. Historically such stuff was assigned to
>> PF0, but that was both not clear and not working well. Now such stuff
>> is moved to be covered into struct ice_adapter, there is just one instance
>> of such per HW.
>>
>> This patch adds a devlink instance that corresponds to that ice_adapter,
>> to allow arbitrage over resources (as RSS LUT) via it (further in the
>> series (RFC NOTE: stripped out so far)).
>>
>> Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>> PF0: pci/0000:00:18.0
>> whole-dev: pci/0000:00:18
>> But I made this a param for now (driver is free to pass just "whole-dev").
>
> Which only works nicely if you're talking about functions not full
> separate links :) When I was thinking about it a while back my
> intuition was that we should have a single instance, just accessible
> under multiple names. But I'm not married to that direction if there
> are problems with it.
>
I would also prefer to see a single devlink instance + one port for each
function. I think thats the most natural fit to how devlink works, and
it gives us a natural entry point for "whole device" configuration. It
also limits the amount of duplicate data, for example "devlink dev info"
reports once for each function.
The main things I think this causes problems for are:
1) PCIe direct assignment with IOV
This could be an issue in cases where someone assigns only one function
to a VM. The VM would only see one function and the functions outside
the VM would not interact with it. IMHO this is not a big deal as I
think simply assigning the entire device into the VM is more preferable.
We also already have this issue with ice_adapter, and we've seen that we
need to do this in order to make the device and software function
properly. Assigning single functions does not make much sense to me. In
addition, there is SR-IOV if you want to assign a portion of the device
to a VM.
2) locking may get complicated
If we have entry point which needs to interact with ice_pf data the
locking could get a little complicated, but I think this is also an
issue we can solve with ice_adapter, as a natural place to put
whole-device functionality.
I have also investigated in the past if it was possible to make the PCI
bus subsystem wrap the functions together somehow to represent them to
the host as a sort of pseudo "single-function" even tho the hardware is
multi-function. This seemed like a natural way to prevent direct
assignment of the whole device.. but I was never able to figure out how
to even start on such a path.
>> $ devlink dev # (Interesting part of output only)
>> pci/0000:af:00:
>> nested_devlink:
>> pci/0000:af:00.0
>> pci/0000:af:00.1
>> pci/0000:af:00.2
>> pci/0000:af:00.3
>> pci/0000:af:00.4
>> pci/0000:af:00.5
>> pci/0000:af:00.6
>> pci/0000:af:00.7
>
> Could you go into more details on what stays on the "nested" instances
> and what moves to the "whole-dev"? Jiri recently pointed out to y'all
> cases where stuff that should be a port attribute was an instance
> attribute.
I suspect this is a case of "we have separate devlink instances per
function, so we just put it in the devlink".
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-21 22:50 ` Jacob Keller
@ 2025-02-24 10:15 ` Przemek Kitszel
0 siblings, 0 replies; 19+ messages in thread
From: Przemek Kitszel @ 2025-02-24 10:15 UTC (permalink / raw)
To: Jacob Keller, Jakub Kicinski, Jiri Pirko
Cc: intel-wired-lan, Tony Nguyen, Cosmin Ratiu, Tariq Toukan, netdev,
Konrad Knitter, davem, Eric Dumazet, Paolo Abeni, Andrew Lunn,
linux-kernel, ITP Upstream, Carolina Jubran
On 2/21/25 23:50, Jacob Keller wrote:
>
>
> On 2/20/2025 5:45 PM, Jakub Kicinski wrote:
>> On Wed, 19 Feb 2025 17:32:54 +0100 Przemek Kitszel wrote:
>>> Add a support for whole device devlink instance. Intented as a entity
>>> over all PF devices on given physical device.
>>>
>>> In case of ice driver we have multiple PF devices (with their devlink
>>> dev representation), that have separate drivers loaded. However those
>>> still do share lots of resources due to being the on same HW. Examples
>>> include PTP clock and RSS LUT. Historically such stuff was assigned to
>>> PF0, but that was both not clear and not working well. Now such stuff
>>> is moved to be covered into struct ice_adapter, there is just one instance
>>> of such per HW.
>>>
>>> This patch adds a devlink instance that corresponds to that ice_adapter,
>>> to allow arbitrage over resources (as RSS LUT) via it (further in the
>>> series (RFC NOTE: stripped out so far)).
>>>
>>> Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>>> PF0: pci/0000:00:18.0
>>> whole-dev: pci/0000:00:18
>>> But I made this a param for now (driver is free to pass just "whole-dev").
>>
>> Which only works nicely if you're talking about functions not full
>> separate links :) When I was thinking about it a while back my
that's why I have make the name as a param, instead letting devlink
infer it
I admit that with ice/e800 we could have done better with splitting
into devlink and devlink port parts at beginning, but with growing
complexities of devices, we are going to hit the "we need something
above" issue anyway.
>> intuition was that we should have a single instance, just accessible
>> under multiple names. But I'm not married to that direction if there
>> are problems with it.
>>
>
> I would also prefer to see a single devlink instance + one port for each
> function. I think thats the most natural fit to how devlink works, and
> it gives us a natural entry point for "whole device" configuration. It
> also limits the amount of duplicate data, for example "devlink dev info"
> reports once for each function.
>
>
> The main things I think this causes problems for are:
>
> 1) PCIe direct assignment with IOV
>
> This could be an issue in cases where someone assigns only one function
> to a VM. The VM would only see one function and the functions outside
> the VM would not interact with it. IMHO this is not a big deal as I
> think simply assigning the entire device into the VM is more preferable.
>
> We also already have this issue with ice_adapter, and we've seen that we
> need to do this in order to make the device and software function
> properly. Assigning single functions does not make much sense to me. In
> addition, there is SR-IOV if you want to assign a portion of the device
> to a VM.
>
> 2) locking may get complicated
>
if any driver would go the "plain devlink" + "port devlink" route,
the devl_lock(devlink) and devl_lock(devlink_port) should be enough
> If we have entry point which needs to interact with ice_pf data the
> locking could get a little complicated, but I think this is also an
> issue we can solve with ice_adapter, as a natural place to put
> whole-device functionality.
>
> I have also investigated in the past if it was possible to make the PCI
> bus subsystem wrap the functions together somehow to represent them to
> the host as a sort of pseudo "single-function" even tho the hardware is
> multi-function. This seemed like a natural way to prevent direct
> assignment of the whole device.. but I was never able to figure out how
> to even start on such a path.
I get that the general sentiment is to "leave the complexities to the
driver/other layers", but it was based on reading only limited amount
of internal (non networking) mailing lists.
It will be great when devlink will be finally used by non networking
drivers :)
>
>>> $ devlink dev # (Interesting part of output only)
>>> pci/0000:af:00:
>>> nested_devlink:
>>> pci/0000:af:00.0
>>> pci/0000:af:00.1
BTW, I have local version that adds SR-IOV VF's devlink instances
as nested ones to PF ones:
pci/0000:af:00.1:
nested_devlink:
pci/0000:af:05.0
>>> pci/0000:af:00.2
>>> pci/0000:af:00.3
>>> pci/0000:af:00.4
>>> pci/0000:af:00.5
>>> pci/0000:af:00.6
>>> pci/0000:af:00.7
>>
>> Could you go into more details on what stays on the "nested" instances
>> and what moves to the "whole-dev"? Jiri recently pointed out to y'all
>> cases where stuff that should be a port attribute was an instance
>> attribute.
My initial assumption was that everything stays as-is
>
> I suspect this is a case of "we have separate devlink instances per
> function, so we just put it in the devlink".
That's true that our 1:1 mapping of PF:devlink made this naively obvious
moving almost all the stuff we now have under "PF" devlink into devlink
port instances is likely too much of uAPI change, but let's make the
question vocal: should we?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-21 1:45 ` Jakub Kicinski
2025-02-21 22:50 ` Jacob Keller
@ 2025-02-24 13:03 ` Jiri Pirko
2025-02-24 22:09 ` Jacob Keller
1 sibling, 1 reply; 19+ messages in thread
From: Jiri Pirko @ 2025-02-24 13:03 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Przemek Kitszel, intel-wired-lan, Tony Nguyen, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, Jacob Keller, davem,
Eric Dumazet, Paolo Abeni, Andrew Lunn, linux-kernel,
ITP Upstream, Carolina Jubran
Fri, Feb 21, 2025 at 02:45:12AM +0100, kuba@kernel.org wrote:
>On Wed, 19 Feb 2025 17:32:54 +0100 Przemek Kitszel wrote:
>> Add a support for whole device devlink instance. Intented as a entity
>> over all PF devices on given physical device.
>>
>> In case of ice driver we have multiple PF devices (with their devlink
>> dev representation), that have separate drivers loaded. However those
>> still do share lots of resources due to being the on same HW. Examples
>> include PTP clock and RSS LUT. Historically such stuff was assigned to
>> PF0, but that was both not clear and not working well. Now such stuff
>> is moved to be covered into struct ice_adapter, there is just one instance
>> of such per HW.
>>
>> This patch adds a devlink instance that corresponds to that ice_adapter,
>> to allow arbitrage over resources (as RSS LUT) via it (further in the
>> series (RFC NOTE: stripped out so far)).
>>
>> Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>> PF0: pci/0000:00:18.0
>> whole-dev: pci/0000:00:18
>> But I made this a param for now (driver is free to pass just "whole-dev").
>
>Which only works nicely if you're talking about functions not full
>separate links :) When I was thinking about it a while back my
>intuition was that we should have a single instance, just accessible
>under multiple names. But I'm not married to that direction if there
>are problems with it.
I kind of agree. Like multiple channels to one entity, each labeled by a
different name (handle in devlink case).
>
>> $ devlink dev # (Interesting part of output only)
>> pci/0000:af:00:
>> nested_devlink:
>> pci/0000:af:00.0
>> pci/0000:af:00.1
>> pci/0000:af:00.2
>> pci/0000:af:00.3
>> pci/0000:af:00.4
>> pci/0000:af:00.5
>> pci/0000:af:00.6
>> pci/0000:af:00.7
>
>Could you go into more details on what stays on the "nested" instances
>and what moves to the "whole-dev"? Jiri recently pointed out to y'all
>cases where stuff that should be a port attribute was an instance
>attribute.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-19 16:32 ` [RFC net-next v2 1/2] devlink: add whole device devlink instance Przemek Kitszel
2025-02-19 22:11 ` Jacob Keller
2025-02-21 1:45 ` Jakub Kicinski
@ 2025-02-24 16:14 ` Jiri Pirko
2025-02-24 22:12 ` Jacob Keller
2025-02-25 11:30 ` Przemek Kitszel
2025-03-18 15:42 ` Jiri Pirko
3 siblings, 2 replies; 19+ messages in thread
From: Jiri Pirko @ 2025-02-24 16:14 UTC (permalink / raw)
To: Przemek Kitszel
Cc: intel-wired-lan, Tony Nguyen, Jakub Kicinski, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, Jacob Keller, davem,
Eric Dumazet, Paolo Abeni, Andrew Lunn, linux-kernel,
ITP Upstream, Carolina Jubran
Wed, Feb 19, 2025 at 05:32:54PM +0100, przemyslaw.kitszel@intel.com wrote:
>Add a support for whole device devlink instance. Intented as a entity
>over all PF devices on given physical device.
>
>In case of ice driver we have multiple PF devices (with their devlink
>dev representation), that have separate drivers loaded. However those
>still do share lots of resources due to being the on same HW. Examples
>include PTP clock and RSS LUT. Historically such stuff was assigned to
>PF0, but that was both not clear and not working well. Now such stuff
>is moved to be covered into struct ice_adapter, there is just one instance
>of such per HW.
>
>This patch adds a devlink instance that corresponds to that ice_adapter,
>to allow arbitrage over resources (as RSS LUT) via it (further in the
>series (RFC NOTE: stripped out so far)).
>
>Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>PF0: pci/0000:00:18.0
>whole-dev: pci/0000:00:18
>But I made this a param for now (driver is free to pass just "whole-dev").
>
>$ devlink dev # (Interesting part of output only)
>pci/0000:af:00:
> nested_devlink:
> pci/0000:af:00.0
> pci/0000:af:00.1
> pci/0000:af:00.2
> pci/0000:af:00.3
> pci/0000:af:00.4
> pci/0000:af:00.5
> pci/0000:af:00.6
> pci/0000:af:00.7
In general, I like this approach. In fact, I have quite similar
patch/set in my sandbox git.
The problem I didn't figure out how to handle, was a backing entity
for the parent devlink.
You use part of PCI BDF, which is obviously wrong:
1) bus_name/dev_name the user expects to be the backing device bus and
address on it (pci/usb/i2c). With using part of BDF, you break this
assumption.
2) 2 PFs can have totally different BDF (in VM for example). Then your
approach is broken.
I was thinking about having an auxiliary device created for the parent,
but auxiliary assumes it is child. The is upside-down.
I was thinking about having some sort of made-up per-driver bus, like
"ice" of "mlx5" with some thing like DSN that would act as a "dev_name".
I have a patch that introduces:
struct devlink_shared_inst;
struct devlink *devlink_shared_alloc(const struct devlink_ops *ops,
size_t priv_size, struct net *net,
struct module *module, u64 per_module_id,
void *inst_priv,
struct devlink_shared_inst **p_inst);
void devlink_shared_free(struct devlink *devlink,
struct devlink_shared_inst *inst);
I took a stab at it here:
https://github.com/jpirko/linux_mlxsw/commits/wip_dl_pfs_parent/
The work is not finished.
Also, I was thinking about having some made-up bus, like "pci_ids",
where instead of BDFs as addresses, there would be DSN for example.
None of these 3 is nice.
The shared parent entity for PFs (and other Fs) is always reference
counted, first creates, last removes. I feel like this is something
missing in PCI spec. If such beast would exist, very easy to implement
this in devlink. We have all we need in place already.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-24 13:03 ` Jiri Pirko
@ 2025-02-24 22:09 ` Jacob Keller
0 siblings, 0 replies; 19+ messages in thread
From: Jacob Keller @ 2025-02-24 22:09 UTC (permalink / raw)
To: Jiri Pirko, Jakub Kicinski
Cc: Przemek Kitszel, intel-wired-lan, Tony Nguyen, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, davem, Eric Dumazet,
Paolo Abeni, Andrew Lunn, linux-kernel, ITP Upstream,
Carolina Jubran
On 2/24/2025 5:03 AM, Jiri Pirko wrote:
> Fri, Feb 21, 2025 at 02:45:12AM +0100, kuba@kernel.org wrote:
>> On Wed, 19 Feb 2025 17:32:54 +0100 Przemek Kitszel wrote:
>>> Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>>> PF0: pci/0000:00:18.0
>>> whole-dev: pci/0000:00:18
>>> But I made this a param for now (driver is free to pass just "whole-dev").
>>
>> Which only works nicely if you're talking about functions not full
>> separate links :) When I was thinking about it a while back my
>> intuition was that we should have a single instance, just accessible
>> under multiple names. But I'm not married to that direction if there
>> are problems with it.
>
> I kind of agree. Like multiple channels to one entity, each labeled by a
> different name (handle in devlink case).
>
This might actually also help alleviate some of the uAPI concerns too?
Since the original names would access the instance.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-24 16:14 ` Jiri Pirko
@ 2025-02-24 22:12 ` Jacob Keller
2025-02-25 11:30 ` Przemek Kitszel
1 sibling, 0 replies; 19+ messages in thread
From: Jacob Keller @ 2025-02-24 22:12 UTC (permalink / raw)
To: Jiri Pirko, Przemek Kitszel
Cc: intel-wired-lan, Tony Nguyen, Jakub Kicinski, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, davem, Eric Dumazet,
Paolo Abeni, Andrew Lunn, linux-kernel, ITP Upstream,
Carolina Jubran
On 2/24/2025 8:14 AM, Jiri Pirko wrote:
>
> The shared parent entity for PFs (and other Fs) is always reference
> counted, first creates, last removes. I feel like this is something
> missing in PCI spec. If such beast would exist, very easy to implement
> this in devlink. We have all we need in place already.
>
>
This is basically what I was thinking too. It does feel like it fits in
the PCI layer better than in the devlink layer.. but Przemek also
mentioned the following from reading the other lists:
>
> I get that the general sentiment is to "leave the complexities to the
> driver/other layers", but it was based on reading only limited amount
> of internal (non networking) mailing lists.
Which makes me think going the PCI route might be tricky. I am not sure
if there is another way to get that model without though...
There was also something I saw recently merge, faux_bus? Not sure if
that would be just as ugly as your other 3 options though...
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-24 16:14 ` Jiri Pirko
2025-02-24 22:12 ` Jacob Keller
@ 2025-02-25 11:30 ` Przemek Kitszel
2025-02-25 14:35 ` Jiri Pirko
1 sibling, 1 reply; 19+ messages in thread
From: Przemek Kitszel @ 2025-02-25 11:30 UTC (permalink / raw)
To: Jiri Pirko
Cc: intel-wired-lan, Tony Nguyen, Jakub Kicinski, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, Jacob Keller, davem,
Eric Dumazet, Paolo Abeni, Andrew Lunn, linux-kernel,
ITP Upstream, Carolina Jubran
>> Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>> PF0: pci/0000:00:18.0
>> whole-dev: pci/0000:00:18
>> But I made this a param for now (driver is free to pass just "whole-dev").
>>
>> $ devlink dev # (Interesting part of output only)
>> pci/0000:af:00:
>> nested_devlink:
>> pci/0000:af:00.0
>> pci/0000:af:00.1
>> pci/0000:af:00.2
>> pci/0000:af:00.3
>> pci/0000:af:00.4
>> pci/0000:af:00.5
>> pci/0000:af:00.6
>> pci/0000:af:00.7
>
>
> In general, I like this approach. In fact, I have quite similar
> patch/set in my sandbox git.
>
> The problem I didn't figure out how to handle, was a backing entity
> for the parent devlink.
>
> You use part of PCI BDF, which is obviously wrong:
> 1) bus_name/dev_name the user expects to be the backing device bus and
> address on it (pci/usb/i2c). With using part of BDF, you break this
> assumption.
> 2) 2 PFs can have totally different BDF (in VM for example). Then your
> approach is broken.
To make the hard part of it easy, I like to have the name to be provided
by what the PF/driver has available (whichever will be the first of
given device PFs), as of now, we resolve this issue (and provide ~what
your devlink_shared does) via ice_adapter.
Making it a devlink instance gives user an easy way to see the whole
picture of all resources handled as "shared per device", my current
output, for all PFs and VFs on given device:
pci/0000:af:00:
name rss size 8 unit entry size_min 0 size_max 24 size_gran 1
resources:
name lut_512 size 0 unit entry size_min 0 size_max 16 size_gran 1
name lut_2048 size 8 unit entry size_min 0 size_max 8 size_gran 1
What is contributing to the hardness, this is not just one for all ice
PFs, but one per device, which we distinguish via pci BDF.
>
> I was thinking about having an auxiliary device created for the parent,
> but auxiliary assumes it is child. The is upside-down.
>
> I was thinking about having some sort of made-up per-driver bus, like
> "ice" of "mlx5" with some thing like DSN that would act as a "dev_name".
> I have a patch that introduces:
>
> struct devlink_shared_inst;
>
> struct devlink *devlink_shared_alloc(const struct devlink_ops *ops,
> size_t priv_size, struct net *net,
> struct module *module, u64 per_module_id,
> void *inst_priv,
> struct devlink_shared_inst **p_inst);
> void devlink_shared_free(struct devlink *devlink,
> struct devlink_shared_inst *inst);
>
> I took a stab at it here:
> https://github.com/jpirko/linux_mlxsw/commits/wip_dl_pfs_parent/
> The work is not finished.
>
>
> Also, I was thinking about having some made-up bus, like "pci_ids",
> where instead of BDFs as addresses, there would be DSN for example.
>
> None of these 3 is nice.
how one would invent/infer/allocate the DSN?
faux_bus mentioned by Jake would be about the same level of "fakeness"
as simply allocating a new instance of devlink by the first PF, IMO :)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-25 11:30 ` Przemek Kitszel
@ 2025-02-25 14:35 ` Jiri Pirko
2025-02-25 15:40 ` Przemek Kitszel
0 siblings, 1 reply; 19+ messages in thread
From: Jiri Pirko @ 2025-02-25 14:35 UTC (permalink / raw)
To: Przemek Kitszel
Cc: intel-wired-lan, Tony Nguyen, Jakub Kicinski, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, Jacob Keller, davem,
Eric Dumazet, Paolo Abeni, Andrew Lunn, linux-kernel,
ITP Upstream, Carolina Jubran
Tue, Feb 25, 2025 at 12:30:49PM +0100, przemyslaw.kitszel@intel.com wrote:
>
>> > Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>> > PF0: pci/0000:00:18.0
>> > whole-dev: pci/0000:00:18
>> > But I made this a param for now (driver is free to pass just "whole-dev").
>> >
>> > $ devlink dev # (Interesting part of output only)
>> > pci/0000:af:00:
>> > nested_devlink:
>> > pci/0000:af:00.0
>> > pci/0000:af:00.1
>> > pci/0000:af:00.2
>> > pci/0000:af:00.3
>> > pci/0000:af:00.4
>> > pci/0000:af:00.5
>> > pci/0000:af:00.6
>> > pci/0000:af:00.7
>>
>>
>> In general, I like this approach. In fact, I have quite similar
>> patch/set in my sandbox git.
>>
>> The problem I didn't figure out how to handle, was a backing entity
>> for the parent devlink.
>>
>> You use part of PCI BDF, which is obviously wrong:
>> 1) bus_name/dev_name the user expects to be the backing device bus and
>> address on it (pci/usb/i2c). With using part of BDF, you break this
>> assumption.
>> 2) 2 PFs can have totally different BDF (in VM for example). Then your
>> approach is broken.
>
>To make the hard part of it easy, I like to have the name to be provided
>by what the PF/driver has available (whichever will be the first of
>given device PFs), as of now, we resolve this issue (and provide ~what
>your devlink_shared does) via ice_adapter.
I don't understand. Can you provide some examples please?
>
>Making it a devlink instance gives user an easy way to see the whole
>picture of all resources handled as "shared per device", my current
>output, for all PFs and VFs on given device:
>
>pci/0000:af:00:
> name rss size 8 unit entry size_min 0 size_max 24 size_gran 1
> resources:
> name lut_512 size 0 unit entry size_min 0 size_max 16 size_gran 1
> name lut_2048 size 8 unit entry size_min 0 size_max 8 size_gran 1
>
>What is contributing to the hardness, this is not just one for all ice
>PFs, but one per device, which we distinguish via pci BDF.
How?
>
>>
>> I was thinking about having an auxiliary device created for the parent,
>> but auxiliary assumes it is child. The is upside-down.
>>
>> I was thinking about having some sort of made-up per-driver bus, like
>> "ice" of "mlx5" with some thing like DSN that would act as a "dev_name".
>> I have a patch that introduces:
>>
>> struct devlink_shared_inst;
>>
>> struct devlink *devlink_shared_alloc(const struct devlink_ops *ops,
>> size_t priv_size, struct net *net,
>> struct module *module, u64 per_module_id,
>> void *inst_priv,
>> struct devlink_shared_inst **p_inst);
>> void devlink_shared_free(struct devlink *devlink,
>> struct devlink_shared_inst *inst);
>>
>> I took a stab at it here:
>> https://github.com/jpirko/linux_mlxsw/commits/wip_dl_pfs_parent/
>> The work is not finished.
>>
>>
>> Also, I was thinking about having some made-up bus, like "pci_ids",
>> where instead of BDFs as addresses, there would be DSN for example.
>>
>> None of these 3 is nice.
>
>how one would invent/infer/allocate the DSN?
Driver knows DSN, it can obtain from pci layer.
>
>faux_bus mentioned by Jake would be about the same level of "fakeness"
>as simply allocating a new instance of devlink by the first PF, IMO :)
Hmm, briefly looking at faux, this looks like fills the gap I missed in
auxdev. Will try to use it in my patchset.
Thanks!
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-25 14:35 ` Jiri Pirko
@ 2025-02-25 15:40 ` Przemek Kitszel
2025-02-25 18:16 ` Jacob Keller
2025-02-26 14:48 ` Jiri Pirko
0 siblings, 2 replies; 19+ messages in thread
From: Przemek Kitszel @ 2025-02-25 15:40 UTC (permalink / raw)
To: Jiri Pirko
Cc: intel-wired-lan, Tony Nguyen, Jakub Kicinski, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, Jacob Keller, davem,
Eric Dumazet, Paolo Abeni, Andrew Lunn, linux-kernel,
ITP Upstream, Carolina Jubran
On 2/25/25 15:35, Jiri Pirko wrote:
> Tue, Feb 25, 2025 at 12:30:49PM +0100, przemyslaw.kitszel@intel.com wrote:
>>
>>>> Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>>>> PF0: pci/0000:00:18.0
>>>> whole-dev: pci/0000:00:18
>>>> But I made this a param for now (driver is free to pass just "whole-dev").
>>>>
>>>> $ devlink dev # (Interesting part of output only)
>>>> pci/0000:af:00:
>>>> nested_devlink:
>>>> pci/0000:af:00.0
>>>> pci/0000:af:00.1
>>>> pci/0000:af:00.2
>>>> pci/0000:af:00.3
>>>> pci/0000:af:00.4
>>>> pci/0000:af:00.5
>>>> pci/0000:af:00.6
>>>> pci/0000:af:00.7
>>>
>>>
>>> In general, I like this approach. In fact, I have quite similar
>>> patch/set in my sandbox git.
>>>
>>> The problem I didn't figure out how to handle, was a backing entity
>>> for the parent devlink.
>>>
>>> You use part of PCI BDF, which is obviously wrong:
>>> 1) bus_name/dev_name the user expects to be the backing device bus and
>>> address on it (pci/usb/i2c). With using part of BDF, you break this
>>> assumption.
>>> 2) 2 PFs can have totally different BDF (in VM for example). Then your
>>> approach is broken.
>>
>> To make the hard part of it easy, I like to have the name to be provided
>> by what the PF/driver has available (whichever will be the first of
>> given device PFs), as of now, we resolve this issue (and provide ~what
>> your devlink_shared does) via ice_adapter.
>
> I don't understand. Can you provide some examples please?
Right now we have one object of struct ice_adapter per device/card,
it is refcounted and freed after last PF put()s their copy.
In the struct one could have a mutex or spinlock to guard shared stuff,
existing example is ptp_gltsyn_time_lock of ice driver.
>
>
>>
>> Making it a devlink instance gives user an easy way to see the whole
>> picture of all resources handled as "shared per device", my current
This part is what is missing in current devlink impl and likely would
still be after your series. I would still like to have it :)
(And the rest is sugar coating for me)
>> output, for all PFs and VFs on given device:
>>
>> pci/0000:af:00:
>> name rss size 8 unit entry size_min 0 size_max 24 size_gran 1
>> resources:
>> name lut_512 size 0 unit entry size_min 0 size_max 16 size_gran 1
>> name lut_2048 size 8 unit entry size_min 0 size_max 8 size_gran 1
>>
>> What is contributing to the hardness, this is not just one for all ice
>> PFs, but one per device, which we distinguish via pci BDF.
>
> How?
code is in ice_adapter_index()
Now I get what DSN is, looks like it could be used equally well instead
pci BDF.
Still we need more instances, each card has their own PTP clock, their
own "global RSS LUT" pool, etc.
>
>
>>
>>>
>>> I was thinking about having an auxiliary device created for the parent,
>>> but auxiliary assumes it is child. The is upside-down.
>>>
>>> I was thinking about having some sort of made-up per-driver bus, like
>>> "ice" of "mlx5" with some thing like DSN that would act as a "dev_name".
>>> I have a patch that introduces:
>>>
>>> struct devlink_shared_inst;
>>>
>>> struct devlink *devlink_shared_alloc(const struct devlink_ops *ops,
>>> size_t priv_size, struct net *net,
>>> struct module *module, u64 per_module_id,
>>> void *inst_priv,
>>> struct devlink_shared_inst **p_inst);
>>> void devlink_shared_free(struct devlink *devlink,
>>> struct devlink_shared_inst *inst);
>>>
>>> I took a stab at it here:
>>> https://github.com/jpirko/linux_mlxsw/commits/wip_dl_pfs_parent/
>>> The work is not finished.
>>>
>>>
>>> Also, I was thinking about having some made-up bus, like "pci_ids",
>>> where instead of BDFs as addresses, there would be DSN for example.
>>>
>>> None of these 3 is nice.
>>
>> how one would invent/infer/allocate the DSN?
>
> Driver knows DSN, it can obtain from pci layer.
Aaach, I got the abbreviation wrong, pci_get_dsn() does the thing, thank
you. BTW, again, by Jake :D
>
>
>>
>> faux_bus mentioned by Jake would be about the same level of "fakeness"
>> as simply allocating a new instance of devlink by the first PF, IMO :)
>
> Hmm, briefly looking at faux, this looks like fills the gap I missed in
> auxdev. Will try to use it in my patchset.
>
> Thanks!
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-25 15:40 ` Przemek Kitszel
@ 2025-02-25 18:16 ` Jacob Keller
2025-02-26 14:48 ` Jiri Pirko
1 sibling, 0 replies; 19+ messages in thread
From: Jacob Keller @ 2025-02-25 18:16 UTC (permalink / raw)
To: Przemek Kitszel, Jiri Pirko
Cc: intel-wired-lan, Tony Nguyen, Jakub Kicinski, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, davem, Eric Dumazet,
Paolo Abeni, Andrew Lunn, linux-kernel, ITP Upstream,
Carolina Jubran
On 2/25/2025 7:40 AM, Przemek Kitszel wrote:
> On 2/25/25 15:35, Jiri Pirko wrote:
>> Tue, Feb 25, 2025 at 12:30:49PM +0100, przemyslaw.kitszel@intel.com wrote:
>>>
>>>>> Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>>>>> PF0: pci/0000:00:18.0
>>>>> whole-dev: pci/0000:00:18
>>>>> But I made this a param for now (driver is free to pass just "whole-dev").
>>>>>
>>>>> $ devlink dev # (Interesting part of output only)
>>>>> pci/0000:af:00:
>>>>> nested_devlink:
>>>>> pci/0000:af:00.0
>>>>> pci/0000:af:00.1
>>>>> pci/0000:af:00.2
>>>>> pci/0000:af:00.3
>>>>> pci/0000:af:00.4
>>>>> pci/0000:af:00.5
>>>>> pci/0000:af:00.6
>>>>> pci/0000:af:00.7
>>>>
>>>>
>>>> In general, I like this approach. In fact, I have quite similar
>>>> patch/set in my sandbox git.
>>>>
>>>> The problem I didn't figure out how to handle, was a backing entity
>>>> for the parent devlink.
>>>>
>>>> You use part of PCI BDF, which is obviously wrong:
>>>> 1) bus_name/dev_name the user expects to be the backing device bus and
>>>> address on it (pci/usb/i2c). With using part of BDF, you break this
>>>> assumption.
>>>> 2) 2 PFs can have totally different BDF (in VM for example). Then your
>>>> approach is broken.
>>>
>>> To make the hard part of it easy, I like to have the name to be provided
>>> by what the PF/driver has available (whichever will be the first of
>>> given device PFs), as of now, we resolve this issue (and provide ~what
>>> your devlink_shared does) via ice_adapter.
>>
>> I don't understand. Can you provide some examples please?
>
> Right now we have one object of struct ice_adapter per device/card,
> it is refcounted and freed after last PF put()s their copy.
> In the struct one could have a mutex or spinlock to guard shared stuff,
> existing example is ptp_gltsyn_time_lock of ice driver.
>
>>
>>
>>>
>>> Making it a devlink instance gives user an easy way to see the whole
>>> picture of all resources handled as "shared per device", my current
>
> This part is what is missing in current devlink impl and likely would
> still be after your series. I would still like to have it :)
> (And the rest is sugar coating for me)
>
>>> output, for all PFs and VFs on given device:
>>>
>>> pci/0000:af:00:
>>> name rss size 8 unit entry size_min 0 size_max 24 size_gran 1
>>> resources:
>>> name lut_512 size 0 unit entry size_min 0 size_max 16 size_gran 1
>>> name lut_2048 size 8 unit entry size_min 0 size_max 8 size_gran 1
>>>
>>> What is contributing to the hardness, this is not just one for all ice
>>> PFs, but one per device, which we distinguish via pci BDF.
>>
>> How?
>
> code is in ice_adapter_index()
> Now I get what DSN is, looks like it could be used equally well instead
> pci BDF.
>
> Still we need more instances, each card has their own PTP clock, their
> own "global RSS LUT" pool, etc.
>
>>
>>
>>>
>>>>
>>>> I was thinking about having an auxiliary device created for the parent,
>>>> but auxiliary assumes it is child. The is upside-down.
>>>>
>>>> I was thinking about having some sort of made-up per-driver bus, like
>>>> "ice" of "mlx5" with some thing like DSN that would act as a "dev_name".
>>>> I have a patch that introduces:
>>>>
>>>> struct devlink_shared_inst;
>>>>
>>>> struct devlink *devlink_shared_alloc(const struct devlink_ops *ops,
>>>> size_t priv_size, struct net *net,
>>>> struct module *module, u64 per_module_id,
>>>> void *inst_priv,
>>>> struct devlink_shared_inst **p_inst);
>>>> void devlink_shared_free(struct devlink *devlink,
>>>> struct devlink_shared_inst *inst);
>>>>
>>>> I took a stab at it here:
>>>> https://github.com/jpirko/linux_mlxsw/commits/wip_dl_pfs_parent/
>>>> The work is not finished.
>>>>
>>>>
>>>> Also, I was thinking about having some made-up bus, like "pci_ids",
>>>> where instead of BDFs as addresses, there would be DSN for example.
>>>>
>>>> None of these 3 is nice.
>>>
>>> how one would invent/infer/allocate the DSN?
>>
>> Driver knows DSN, it can obtain from pci layer.
>
> Aaach, I got the abbreviation wrong, pci_get_dsn() does the thing, thank
> you. BTW, again, by Jake :D
>
I agree DSN is a good choice, but I will point out one potential issue,
at least for early development: A lot of pre-production cards I've
worked with in the past fail to have unique DSN. At least for Intel
cards it is typically stored in the NVM flash memory. A normal flash
update process will keep the same DSN, but if you have to do something
like dediprog to recover the card the DSN can be erased. This can make
using it as a unique identifier like this potentially problematic if
your test system has multiple pre-production cards.
Not a deal breaker, but just a warning if you run into that while
testing/working on an implementation.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-25 15:40 ` Przemek Kitszel
2025-02-25 18:16 ` Jacob Keller
@ 2025-02-26 14:48 ` Jiri Pirko
2025-02-26 15:06 ` Przemek Kitszel
1 sibling, 1 reply; 19+ messages in thread
From: Jiri Pirko @ 2025-02-26 14:48 UTC (permalink / raw)
To: Przemek Kitszel
Cc: intel-wired-lan, Tony Nguyen, Jakub Kicinski, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, Jacob Keller, davem,
Eric Dumazet, Paolo Abeni, Andrew Lunn, linux-kernel,
ITP Upstream, Carolina Jubran
Tue, Feb 25, 2025 at 04:40:49PM +0100, przemyslaw.kitszel@intel.com wrote:
>On 2/25/25 15:35, Jiri Pirko wrote:
>> Tue, Feb 25, 2025 at 12:30:49PM +0100, przemyslaw.kitszel@intel.com wrote:
[...]
>> > output, for all PFs and VFs on given device:
>> >
>> > pci/0000:af:00:
>> > name rss size 8 unit entry size_min 0 size_max 24 size_gran 1
>> > resources:
>> > name lut_512 size 0 unit entry size_min 0 size_max 16 size_gran 1
>> > name lut_2048 size 8 unit entry size_min 0 size_max 8 size_gran 1
>> >
>> > What is contributing to the hardness, this is not just one for all ice
>> > PFs, but one per device, which we distinguish via pci BDF.
>>
>> How?
>
>code is in ice_adapter_index()
If you pass 2 pfs of the same device to a VM with random BDF, you get 2
ice_adapters, correct?
[...]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-26 14:48 ` Jiri Pirko
@ 2025-02-26 15:06 ` Przemek Kitszel
2025-02-26 15:25 ` Jiri Pirko
0 siblings, 1 reply; 19+ messages in thread
From: Przemek Kitszel @ 2025-02-26 15:06 UTC (permalink / raw)
To: Jiri Pirko
Cc: intel-wired-lan, Tony Nguyen, Jakub Kicinski, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, Jacob Keller, davem,
Eric Dumazet, Paolo Abeni, Andrew Lunn, linux-kernel,
ITP Upstream, Carolina Jubran
On 2/26/25 15:48, Jiri Pirko wrote:
> Tue, Feb 25, 2025 at 04:40:49PM +0100, przemyslaw.kitszel@intel.com wrote:
>> On 2/25/25 15:35, Jiri Pirko wrote:
>>> Tue, Feb 25, 2025 at 12:30:49PM +0100, przemyslaw.kitszel@intel.com wrote:
>
> [...]
>
>>>> output, for all PFs and VFs on given device:
>>>>
>>>> pci/0000:af:00:
>>>> name rss size 8 unit entry size_min 0 size_max 24 size_gran 1
>>>> resources:
>>>> name lut_512 size 0 unit entry size_min 0 size_max 16 size_gran 1
>>>> name lut_2048 size 8 unit entry size_min 0 size_max 8 size_gran 1
>>>>
>>>> What is contributing to the hardness, this is not just one for all ice
>>>> PFs, but one per device, which we distinguish via pci BDF.
>>>
>>> How?
>>
>> code is in ice_adapter_index()
>
> If you pass 2 pfs of the same device to a VM with random BDF, you get 2
> ice_adapters, correct?
Right now, yes
>
> [...]
What I want is to keep two ice_adapters for two actual devices (SDNs)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-26 15:06 ` Przemek Kitszel
@ 2025-02-26 15:25 ` Jiri Pirko
0 siblings, 0 replies; 19+ messages in thread
From: Jiri Pirko @ 2025-02-26 15:25 UTC (permalink / raw)
To: Przemek Kitszel
Cc: intel-wired-lan, Tony Nguyen, Jakub Kicinski, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, Jacob Keller, davem,
Eric Dumazet, Paolo Abeni, Andrew Lunn, linux-kernel,
ITP Upstream, Carolina Jubran
Wed, Feb 26, 2025 at 04:06:19PM +0100, przemyslaw.kitszel@intel.com wrote:
>On 2/26/25 15:48, Jiri Pirko wrote:
>> Tue, Feb 25, 2025 at 04:40:49PM +0100, przemyslaw.kitszel@intel.com wrote:
>> > On 2/25/25 15:35, Jiri Pirko wrote:
>> > > Tue, Feb 25, 2025 at 12:30:49PM +0100, przemyslaw.kitszel@intel.com wrote:
>>
>> [...]
>>
>> > > > output, for all PFs and VFs on given device:
>> > > >
>> > > > pci/0000:af:00:
>> > > > name rss size 8 unit entry size_min 0 size_max 24 size_gran 1
>> > > > resources:
>> > > > name lut_512 size 0 unit entry size_min 0 size_max 16 size_gran 1
>> > > > name lut_2048 size 8 unit entry size_min 0 size_max 8 size_gran 1
>> > > >
>> > > > What is contributing to the hardness, this is not just one for all ice
>> > > > PFs, but one per device, which we distinguish via pci BDF.
>> > >
>> > > How?
>> >
>> > code is in ice_adapter_index()
>>
>> If you pass 2 pfs of the same device to a VM with random BDF, you get 2
>> ice_adapters, correct?
>
>Right now, yes
That is a bug.
>
>>
>> [...]
>
>What I want is to keep two ice_adapters for two actual devices (SDNs)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
2025-02-19 16:32 ` [RFC net-next v2 1/2] devlink: add whole device devlink instance Przemek Kitszel
` (2 preceding siblings ...)
2025-02-24 16:14 ` Jiri Pirko
@ 2025-03-18 15:42 ` Jiri Pirko
3 siblings, 0 replies; 19+ messages in thread
From: Jiri Pirko @ 2025-03-18 15:42 UTC (permalink / raw)
To: Przemek Kitszel
Cc: intel-wired-lan, Tony Nguyen, Jakub Kicinski, Cosmin Ratiu,
Tariq Toukan, netdev, Konrad Knitter, Jacob Keller, davem,
Eric Dumazet, Paolo Abeni, Andrew Lunn, linux-kernel,
ITP Upstream, Carolina Jubran
Wed, Feb 19, 2025 at 05:32:54PM +0100, przemyslaw.kitszel@intel.com wrote:
>Add a support for whole device devlink instance. Intented as a entity
>over all PF devices on given physical device.
>
>In case of ice driver we have multiple PF devices (with their devlink
>dev representation), that have separate drivers loaded. However those
>still do share lots of resources due to being the on same HW. Examples
>include PTP clock and RSS LUT. Historically such stuff was assigned to
>PF0, but that was both not clear and not working well. Now such stuff
>is moved to be covered into struct ice_adapter, there is just one instance
>of such per HW.
>
>This patch adds a devlink instance that corresponds to that ice_adapter,
>to allow arbitrage over resources (as RSS LUT) via it (further in the
>series (RFC NOTE: stripped out so far)).
>
>Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>PF0: pci/0000:00:18.0
>whole-dev: pci/0000:00:18
>But I made this a param for now (driver is free to pass just "whole-dev").
>
>$ devlink dev # (Interesting part of output only)
>pci/0000:af:00:
> nested_devlink:
> pci/0000:af:00.0
> pci/0000:af:00.1
> pci/0000:af:00.2
> pci/0000:af:00.3
> pci/0000:af:00.4
> pci/0000:af:00.5
> pci/0000:af:00.6
> pci/0000:af:00.7
Please check my RFC attempt to solve this for mlx5:
https://lore.kernel.org/all/20250318124706.94156-1-jiri@resnulli.us/
I believe that the same could work for you too.
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-03-18 15:43 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-19 16:32 [RFC net-next v2 0/2] devlink: whole-device, resource .occ_set() Przemek Kitszel
2025-02-19 16:32 ` [RFC net-next v2 1/2] devlink: add whole device devlink instance Przemek Kitszel
2025-02-19 22:11 ` Jacob Keller
2025-02-21 1:45 ` Jakub Kicinski
2025-02-21 22:50 ` Jacob Keller
2025-02-24 10:15 ` Przemek Kitszel
2025-02-24 13:03 ` Jiri Pirko
2025-02-24 22:09 ` Jacob Keller
2025-02-24 16:14 ` Jiri Pirko
2025-02-24 22:12 ` Jacob Keller
2025-02-25 11:30 ` Przemek Kitszel
2025-02-25 14:35 ` Jiri Pirko
2025-02-25 15:40 ` Przemek Kitszel
2025-02-25 18:16 ` Jacob Keller
2025-02-26 14:48 ` Jiri Pirko
2025-02-26 15:06 ` Przemek Kitszel
2025-02-26 15:25 ` Jiri Pirko
2025-03-18 15:42 ` Jiri Pirko
2025-02-19 16:32 ` [RFC net-next v2 2/2] devlink: give user option to allocate resources Przemek Kitszel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).