* [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs
@ 2024-05-28 9:11 Shay Drory
2024-05-28 9:11 ` [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs Shay Drory
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Shay Drory @ 2024-05-28 9:11 UTC (permalink / raw)
To: netdev, pabeni, davem, kuba, edumazet, gregkh, david.m.ertman
Cc: rafael, ira.weiny, linux-rdma, leon, tariqt, Shay Drory
Today, PCI PFs and VFs, which are anchored on the PCI bus, display their
IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs files. PCI
subfunctions (SFs) are similar to PFs and VFs and these SFs are anchored
on the auxiliary bus. However, these PCI SFs lack such IRQ information
on the auxiliary bus, leaving users without visibility into which IRQs
are used by the SFs. This absence makes it impossible to debug
situations and to understand the source of interrupts/SFs for
performance tuning and debug.
Additionally, the SFs are multifunctional devices supporting RDMA,
network devices, clocks, and more, similar to their peer PCI PFs and
VFs. Therefore, it is desirable to have SFs' IRQ information available
at the bus/device level.
To overcome the above limitations, this short series extends the
auxiliary bus to display IRQ information in sysfs, similar to that of
PFs and VFs.
It adds an 'irqs' directory under the auxiliary device and includes an
<irq_num> sysfs file within it. Sometimes, the PCI SF auxiliary devices
share the IRQ with other SFs, a detail that is also not available to the
users. Consequently, this <irq_num> file indicates whether the IRQ is
'exclusive' or 'shared'. This 'irqs' directory extenstion is optional,
i.e. only for PCI SFs the sysfs irq information is optionally exposed.
For example:
$ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
50 51 52 53 54 55 56 57 58
$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
exclusive
Patch summary:
==============
patch-1 adds auxiliary bus to support irqs used by auxiliary device
patch-2 mlx5 driver using exposing irqs for PCI SF devices via auxiliary
bus
---
v4->v5:
- addressed comments from Greg in patch #1.
v3->4:
- addressed comments from Przemek in patch #1.
v2->v3:
- addressed comments from Parav and Przemek in patch #1.
- fixed a bug in patch #2.
v1->v2:
- addressed comments from Greg, Simon H and kernel test boot in patch #1.
Shay Drory (2):
driver core: auxiliary bus: show auxiliary device IRQs
net/mlx5: Expose SFs IRQs
Documentation/ABI/testing/sysfs-bus-auxiliary | 14 ++
drivers/base/auxiliary.c | 165 +++++++++++++++++-
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 6 +-
.../mellanox/mlx5/core/irq_affinity.c | 15 +-
.../ethernet/mellanox/mlx5/core/mlx5_core.h | 6 +
.../ethernet/mellanox/mlx5/core/mlx5_irq.h | 12 +-
.../net/ethernet/mellanox/mlx5/core/pci_irq.c | 12 +-
.../ethernet/mellanox/mlx5/core/sf/dev/dev.c | 16 +-
include/linux/auxiliary_bus.h | 24 ++-
9 files changed, 247 insertions(+), 23 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-auxiliary
--
2.38.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs
2024-05-28 9:11 [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs Shay Drory
@ 2024-05-28 9:11 ` Shay Drory
2024-05-28 14:43 ` Przemek Kitszel
2024-05-28 18:00 ` Greg KH
2024-05-28 9:11 ` [PATCH net-next v5 2/2] net/mlx5: Expose SFs IRQs Shay Drory
2024-05-28 17:57 ` [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs Greg KH
2 siblings, 2 replies; 18+ messages in thread
From: Shay Drory @ 2024-05-28 9:11 UTC (permalink / raw)
To: netdev, pabeni, davem, kuba, edumazet, gregkh, david.m.ertman
Cc: rafael, ira.weiny, linux-rdma, leon, tariqt, Shay Drory,
Parav Pandit
Some PCI subfunctions (SF) are anchored on the auxiliary bus. PCI
physical and virtual functions are anchored on the PCI bus. The irq
information of each such function is visible to users via sysfs
directory "msi_irqs" containing file for each irq entry. However, for
PCI SFs such information is unavailable. Due to this users have no
visibility on IRQs used by the SFs.
Secondly, an SF can be multi function device supporting rdma, netdevice
and more. Without irq information at the bus level, the user is unable
to view or use the affinity of the SF IRQs.
Hence to match to the equivalent PCI PFs and VFs, add "irqs" directory,
for supporting auxiliary devices, containing file for each irq entry.
Additionally, the PCI SFs sometimes share the IRQs with peer SFs. This
information is also not available to the users. To overcome this
limitation, each irq sysfs entry shows if irq is exclusive or shared.
For example:
$ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
50 51 52 53 54 55 56 57 58
$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
exclusive
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
---
v4-v5:
- restore global mutex and replace refcount_t with simple integer (Greg)
v3->4:
- remove global mutex (Przemek)
v2->v3:
- fix function declaration in case SYSFS isn't defined
v1->v2:
- move #ifdefs from drivers/base/auxiliary.c to
include/linux/auxiliary_bus.h (Greg)
- use EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL (Greg)
- Fix kzalloc(ref) to kzalloc(*ref) (Simon)
- Add return description in auxiliary_device_sysfs_irq_add() kdoc (Simon)
- Fix auxiliary_irq_mode_show doc (kernel test boot)
---
Documentation/ABI/testing/sysfs-bus-auxiliary | 14 ++
drivers/base/auxiliary.c | 165 +++++++++++++++++-
include/linux/auxiliary_bus.h | 24 ++-
3 files changed, 200 insertions(+), 3 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-auxiliary
diff --git a/Documentation/ABI/testing/sysfs-bus-auxiliary b/Documentation/ABI/testing/sysfs-bus-auxiliary
new file mode 100644
index 000000000000..3b8299d49d9e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-auxiliary
@@ -0,0 +1,14 @@
+What: /sys/bus/auxiliary/devices/.../irqs/
+Date: April, 2024
+Contact: Shay Drory <shayd@nvidia.com>
+Description:
+ The /sys/devices/.../irqs directory contains a variable set of
+ files, with each file is named as irq number similar to PCI PF
+ or VF's irq number located in msi_irqs directory.
+
+What: /sys/bus/auxiliary/devices/.../irqs/<N>
+Date: April, 2024
+Contact: Shay Drory <shayd@nvidia.com>
+Description:
+ auxiliary devices can share IRQs. This attribute indicates if
+ the irq is shared with other SFs or exclusively used by the SF.
diff --git a/drivers/base/auxiliary.c b/drivers/base/auxiliary.c
index d3a2c40c2f12..579d755dcbee 100644
--- a/drivers/base/auxiliary.c
+++ b/drivers/base/auxiliary.c
@@ -158,6 +158,163 @@
* };
*/
+#ifdef CONFIG_SYSFS
+/* Xarray of irqs to determine if irq is exclusive or shared. */
+static DEFINE_XARRAY(irqs);
+/* Protects insertions into the irqs xarray. */
+static DEFINE_MUTEX(irqs_lock);
+
+struct auxiliary_irq_info {
+ struct device_attribute sysfs_attr;
+ int irq;
+};
+
+static struct attribute *auxiliary_irq_attrs[] = {
+ NULL
+};
+
+static const struct attribute_group auxiliary_irqs_group = {
+ .name = "irqs",
+ .attrs = auxiliary_irq_attrs,
+};
+
+static const struct attribute_group *auxiliary_irqs_groups[] = {
+ &auxiliary_irqs_group,
+ NULL
+};
+
+/* Auxiliary devices can share IRQs. Expose to user whether the provided IRQ is
+ * shared or exclusive.
+ */
+static ssize_t auxiliary_irq_mode_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct auxiliary_irq_info *info =
+ container_of(attr, struct auxiliary_irq_info, sysfs_attr);
+ int ref = xa_to_value(xa_load(&irqs, info->irq));
+
+ if (!ref)
+ return -ENOENT;
+ if (ref > 1)
+ return sysfs_emit(buf, "%s\n", "shared");
+ else
+ return sysfs_emit(buf, "%s\n", "exclusive");
+}
+
+static void auxiliary_irq_destroy(int irq)
+{
+ int ref;
+
+ mutex_lock(&irqs_lock);
+ ref = xa_to_value(xa_load(&irqs, irq));
+ if (!(--ref))
+ xa_erase(&irqs, irq);
+ else
+ xa_store(&irqs, irq, xa_mk_value(ref), GFP_KERNEL);
+ mutex_unlock(&irqs_lock);
+}
+
+static int auxiliary_irq_create(int irq)
+{
+ int ret = 0;
+ int ref;
+
+ mutex_lock(&irqs_lock);
+ ref = xa_to_value(xa_load(&irqs, irq));
+ if (ref) {
+ ref++;
+ xa_store(&irqs, irq, xa_mk_value(ref), GFP_KERNEL);
+ goto out;
+ }
+
+ ret = xa_insert(&irqs, irq, xa_mk_value(1), GFP_KERNEL);
+
+out:
+ mutex_unlock(&irqs_lock);
+ return ret;
+}
+
+/**
+ * auxiliary_device_sysfs_irq_add - add a sysfs entry for the given IRQ
+ * @auxdev: auxiliary bus device to add the sysfs entry.
+ * @irq: The associated Linux interrupt number.
+ *
+ * This function should be called after auxiliary device have successfully
+ * received the irq.
+ *
+ * Return: zero on success or an error code on failure.
+ */
+int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
+{
+ struct device *dev = &auxdev->dev;
+ struct auxiliary_irq_info *info;
+ int ret;
+
+ ret = auxiliary_irq_create(irq);
+ if (ret)
+ return ret;
+
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (!info) {
+ ret = -ENOMEM;
+ goto info_err;
+ }
+
+ sysfs_attr_init(&info->sysfs_attr.attr);
+ info->sysfs_attr.attr.name = kasprintf(GFP_KERNEL, "%d", irq);
+ if (!info->sysfs_attr.attr.name) {
+ ret = -ENOMEM;
+ goto name_err;
+ }
+ info->irq = irq;
+ info->sysfs_attr.attr.mode = 0444;
+ info->sysfs_attr.show = auxiliary_irq_mode_show;
+
+ ret = xa_insert(&auxdev->irqs, irq, info, GFP_KERNEL);
+ if (ret)
+ goto auxdev_xa_err;
+
+ ret = sysfs_add_file_to_group(&dev->kobj, &info->sysfs_attr.attr,
+ auxiliary_irqs_group.name);
+ if (ret)
+ goto sysfs_add_err;
+
+ return 0;
+
+sysfs_add_err:
+ xa_erase(&auxdev->irqs, irq);
+auxdev_xa_err:
+ kfree(info->sysfs_attr.attr.name);
+name_err:
+ kfree(info);
+info_err:
+ auxiliary_irq_destroy(irq);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(auxiliary_device_sysfs_irq_add);
+
+/**
+ * auxiliary_device_sysfs_irq_remove - remove a sysfs entry for the given IRQ
+ * @auxdev: auxiliary bus device to add the sysfs entry.
+ * @irq: the IRQ to remove.
+ *
+ * This function should be called to remove an IRQ sysfs entry.
+ */
+void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev, int irq)
+{
+ struct auxiliary_irq_info *info = xa_load(&auxdev->irqs, irq);
+ struct device *dev = &auxdev->dev;
+
+ sysfs_remove_file_from_group(&dev->kobj, &info->sysfs_attr.attr,
+ auxiliary_irqs_group.name);
+ xa_erase(&auxdev->irqs, irq);
+ kfree(info->sysfs_attr.attr.name);
+ kfree(info);
+ auxiliary_irq_destroy(irq);
+}
+EXPORT_SYMBOL_GPL(auxiliary_device_sysfs_irq_remove);
+#endif
+
static const struct auxiliary_device_id *auxiliary_match_id(const struct auxiliary_device_id *id,
const struct auxiliary_device *auxdev)
{
@@ -295,6 +452,7 @@ EXPORT_SYMBOL_GPL(auxiliary_device_init);
* __auxiliary_device_add - add an auxiliary bus device
* @auxdev: auxiliary bus device to add to the bus
* @modname: name of the parent device's driver module
+ * @irqs_sysfs_enable: whether to enable IRQs sysfs
*
* This is the third step in the three-step process to register an
* auxiliary_device.
@@ -310,7 +468,8 @@ EXPORT_SYMBOL_GPL(auxiliary_device_init);
* parameter. Only if a user requires a custom name would this version be
* called directly.
*/
-int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname)
+int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname,
+ bool irqs_sysfs_enable)
{
struct device *dev = &auxdev->dev;
int ret;
@@ -325,6 +484,10 @@ int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname)
dev_err(dev, "auxiliary device dev_set_name failed: %d\n", ret);
return ret;
}
+ if (irqs_sysfs_enable) {
+ dev->groups = auxiliary_irqs_groups;
+ xa_init(&auxdev->irqs);
+ }
ret = device_add(dev);
if (ret)
diff --git a/include/linux/auxiliary_bus.h b/include/linux/auxiliary_bus.h
index de21d9d24a95..760fadb26620 100644
--- a/include/linux/auxiliary_bus.h
+++ b/include/linux/auxiliary_bus.h
@@ -58,6 +58,7 @@
* in
* @name: Match name found by the auxiliary device driver,
* @id: unique identitier if multiple devices of the same name are exported,
+ * @irqs: irqs xarray contains irq indices which are used by the device,
*
* An auxiliary_device represents a part of its parent device's functionality.
* It is given a name that, combined with the registering drivers
@@ -138,6 +139,7 @@
struct auxiliary_device {
struct device dev;
const char *name;
+ struct xarray irqs;
u32 id;
};
@@ -209,8 +211,26 @@ static inline struct auxiliary_driver *to_auxiliary_drv(struct device_driver *dr
}
int auxiliary_device_init(struct auxiliary_device *auxdev);
-int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname);
-#define auxiliary_device_add(auxdev) __auxiliary_device_add(auxdev, KBUILD_MODNAME)
+int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname,
+ bool irqs_sysfs_enable);
+#define auxiliary_device_add(auxdev) __auxiliary_device_add(auxdev, KBUILD_MODNAME, false)
+#define auxiliary_device_add_with_irqs(auxdev) \
+ __auxiliary_device_add(auxdev, KBUILD_MODNAME, true)
+
+#ifdef CONFIG_SYSFS
+int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq);
+void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev,
+ int irq);
+#else /* CONFIG_SYSFS */
+static inline int
+auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
+{
+ return 0;
+}
+
+static inline void
+auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev, int irq) {}
+#endif
static inline void auxiliary_device_uninit(struct auxiliary_device *auxdev)
{
--
2.38.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH net-next v5 2/2] net/mlx5: Expose SFs IRQs
2024-05-28 9:11 [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs Shay Drory
2024-05-28 9:11 ` [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs Shay Drory
@ 2024-05-28 9:11 ` Shay Drory
2024-05-28 14:48 ` Przemek Kitszel
2024-05-28 17:57 ` [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs Greg KH
2 siblings, 1 reply; 18+ messages in thread
From: Shay Drory @ 2024-05-28 9:11 UTC (permalink / raw)
To: netdev, pabeni, davem, kuba, edumazet, gregkh, david.m.ertman
Cc: rafael, ira.weiny, linux-rdma, leon, tariqt, Shay Drory,
Parav Pandit
Expose the sysfs files for the IRQs that the mlx5 PCI SFs are using.
These entries are similar to PCI PFs and VFs in 'msi_irqs' directory.
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
---
v2->v3:
- fix mlx5 sfnum SF sysfs
---
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 6 +++---
.../ethernet/mellanox/mlx5/core/irq_affinity.c | 15 ++++++++++++++-
.../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 6 ++++++
.../net/ethernet/mellanox/mlx5/core/mlx5_irq.h | 12 ++++++++----
.../net/ethernet/mellanox/mlx5/core/pci_irq.c | 12 +++++++++---
.../net/ethernet/mellanox/mlx5/core/sf/dev/dev.c | 16 +++++++---------
6 files changed, 47 insertions(+), 20 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 5693986ae656..5661f047702e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -714,7 +714,7 @@ static int create_async_eqs(struct mlx5_core_dev *dev)
err1:
mlx5_cmd_allowed_opcode(dev, CMD_ALLOWED_OPCODE_ALL);
mlx5_eq_notifier_unregister(dev, &table->cq_err_nb);
- mlx5_ctrl_irq_release(table->ctrl_irq);
+ mlx5_ctrl_irq_release(dev, table->ctrl_irq);
return err;
}
@@ -730,7 +730,7 @@ static void destroy_async_eqs(struct mlx5_core_dev *dev)
cleanup_async_eq(dev, &table->cmd_eq, "cmd");
mlx5_cmd_allowed_opcode(dev, CMD_ALLOWED_OPCODE_ALL);
mlx5_eq_notifier_unregister(dev, &table->cq_err_nb);
- mlx5_ctrl_irq_release(table->ctrl_irq);
+ mlx5_ctrl_irq_release(dev, table->ctrl_irq);
}
struct mlx5_eq *mlx5_get_async_eq(struct mlx5_core_dev *dev)
@@ -918,7 +918,7 @@ static int comp_irq_request_sf(struct mlx5_core_dev *dev, u16 vecidx)
af_desc.is_managed = 1;
cpumask_copy(&af_desc.mask, cpu_online_mask);
cpumask_andnot(&af_desc.mask, &af_desc.mask, &table->used_cpus);
- irq = mlx5_irq_affinity_request(pool, &af_desc);
+ irq = mlx5_irq_affinity_request(dev, pool, &af_desc);
if (IS_ERR(irq))
return PTR_ERR(irq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/irq_affinity.c b/drivers/net/ethernet/mellanox/mlx5/core/irq_affinity.c
index 612e666ec263..5c36aa3c57e0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/irq_affinity.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/irq_affinity.c
@@ -112,15 +112,18 @@ irq_pool_find_least_loaded(struct mlx5_irq_pool *pool, const struct cpumask *req
/**
* mlx5_irq_affinity_request - request an IRQ according to the given mask.
+ * @dev: mlx5 core device which is requesting the IRQ.
* @pool: IRQ pool to request from.
* @af_desc: affinity descriptor for this IRQ.
*
* This function returns a pointer to IRQ, or ERR_PTR in case of error.
*/
struct mlx5_irq *
-mlx5_irq_affinity_request(struct mlx5_irq_pool *pool, struct irq_affinity_desc *af_desc)
+mlx5_irq_affinity_request(struct mlx5_core_dev *dev, struct mlx5_irq_pool *pool,
+ struct irq_affinity_desc *af_desc)
{
struct mlx5_irq *least_loaded_irq, *new_irq;
+ int ret;
mutex_lock(&pool->lock);
least_loaded_irq = irq_pool_find_least_loaded(pool, &af_desc->mask);
@@ -152,6 +155,13 @@ mlx5_irq_affinity_request(struct mlx5_irq_pool *pool, struct irq_affinity_desc *
mlx5_irq_get_index(least_loaded_irq)), pool->name,
mlx5_irq_read_locked(least_loaded_irq) / MLX5_EQ_REFS_PER_IRQ);
unlock:
+ if (mlx5_irq_pool_is_sf_pool(pool)) {
+ ret = auxiliary_device_sysfs_irq_add(mlx5_sf_coredev_to_adev(dev),
+ mlx5_irq_get_irq(least_loaded_irq));
+ if (ret)
+ mlx5_core_err(dev, "Failed to create sysfs entry for irq %d, ret = %d\n",
+ mlx5_irq_get_irq(least_loaded_irq), ret);
+ }
mutex_unlock(&pool->lock);
return least_loaded_irq;
}
@@ -164,6 +174,9 @@ void mlx5_irq_affinity_irq_release(struct mlx5_core_dev *dev, struct mlx5_irq *i
cpu = cpumask_first(mlx5_irq_get_affinity_mask(irq));
synchronize_irq(pci_irq_vector(pool->dev->pdev,
mlx5_irq_get_index(irq)));
+ if (mlx5_irq_pool_is_sf_pool(pool))
+ auxiliary_device_sysfs_irq_remove(mlx5_sf_coredev_to_adev(dev),
+ mlx5_irq_get_irq(irq));
if (mlx5_irq_put(irq))
if (pool->irqs_per_cpu)
cpu_put(pool, cpu);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index c38342b9f320..e764b720d9b2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -320,6 +320,12 @@ static inline bool mlx5_core_is_sf(const struct mlx5_core_dev *dev)
return dev->coredev_type == MLX5_COREDEV_SF;
}
+static inline struct auxiliary_device *
+mlx5_sf_coredev_to_adev(struct mlx5_core_dev *mdev)
+{
+ return container_of(mdev->device, struct auxiliary_device, dev);
+}
+
int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx);
void mlx5_mdev_uninit(struct mlx5_core_dev *dev);
int mlx5_init_one(struct mlx5_core_dev *dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h
index 1088114e905d..0881e961d8b1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_irq.h
@@ -25,7 +25,7 @@ int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn,
int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs);
struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev);
-void mlx5_ctrl_irq_release(struct mlx5_irq *ctrl_irq);
+void mlx5_ctrl_irq_release(struct mlx5_core_dev *dev, struct mlx5_irq *ctrl_irq);
struct mlx5_irq *mlx5_irq_request(struct mlx5_core_dev *dev, u16 vecidx,
struct irq_affinity_desc *af_desc,
struct cpu_rmap **rmap);
@@ -36,13 +36,15 @@ int mlx5_irq_attach_nb(struct mlx5_irq *irq, struct notifier_block *nb);
int mlx5_irq_detach_nb(struct mlx5_irq *irq, struct notifier_block *nb);
struct cpumask *mlx5_irq_get_affinity_mask(struct mlx5_irq *irq);
int mlx5_irq_get_index(struct mlx5_irq *irq);
+int mlx5_irq_get_irq(const struct mlx5_irq *irq);
struct mlx5_irq_pool;
#ifdef CONFIG_MLX5_SF
struct mlx5_irq *mlx5_irq_affinity_irq_request_auto(struct mlx5_core_dev *dev,
struct cpumask *used_cpus, u16 vecidx);
-struct mlx5_irq *mlx5_irq_affinity_request(struct mlx5_irq_pool *pool,
- struct irq_affinity_desc *af_desc);
+struct mlx5_irq *
+mlx5_irq_affinity_request(struct mlx5_core_dev *dev, struct mlx5_irq_pool *pool,
+ struct irq_affinity_desc *af_desc);
void mlx5_irq_affinity_irq_release(struct mlx5_core_dev *dev, struct mlx5_irq *irq);
#else
static inline
@@ -53,7 +55,8 @@ struct mlx5_irq *mlx5_irq_affinity_irq_request_auto(struct mlx5_core_dev *dev,
}
static inline struct mlx5_irq *
-mlx5_irq_affinity_request(struct mlx5_irq_pool *pool, struct irq_affinity_desc *af_desc)
+mlx5_irq_affinity_request(struct mlx5_core_dev *dev, struct mlx5_irq_pool *pool,
+ struct irq_affinity_desc *af_desc)
{
return ERR_PTR(-EOPNOTSUPP);
}
@@ -61,6 +64,7 @@ mlx5_irq_affinity_request(struct mlx5_irq_pool *pool, struct irq_affinity_desc *
static inline
void mlx5_irq_affinity_irq_release(struct mlx5_core_dev *dev, struct mlx5_irq *irq)
{
+ mlx5_irq_release_vector(irq);
}
#endif
#endif /* __MLX5_IRQ_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index fb8787e30d3f..ac7c3a76b4cf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -367,6 +367,11 @@ struct cpumask *mlx5_irq_get_affinity_mask(struct mlx5_irq *irq)
return irq->mask;
}
+int mlx5_irq_get_irq(const struct mlx5_irq *irq)
+{
+ return irq->map.virq;
+}
+
int mlx5_irq_get_index(struct mlx5_irq *irq)
{
return irq->map.index;
@@ -440,11 +445,12 @@ static void _mlx5_irq_release(struct mlx5_irq *irq)
/**
* mlx5_ctrl_irq_release - release a ctrl IRQ back to the system.
+ * @dev: mlx5 device that releasing the IRQ.
* @ctrl_irq: ctrl IRQ to be released.
*/
-void mlx5_ctrl_irq_release(struct mlx5_irq *ctrl_irq)
+void mlx5_ctrl_irq_release(struct mlx5_core_dev *dev, struct mlx5_irq *ctrl_irq)
{
- _mlx5_irq_release(ctrl_irq);
+ mlx5_irq_affinity_irq_release(dev, ctrl_irq);
}
/**
@@ -473,7 +479,7 @@ struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
/* Allocate the IRQ in index 0. The vector was already allocated */
irq = irq_pool_request_vector(pool, 0, &af_desc, NULL);
} else {
- irq = mlx5_irq_affinity_request(pool, &af_desc);
+ irq = mlx5_irq_affinity_request(dev, pool, &af_desc);
}
return irq;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
index 99219ea52c4b..27dfa56c27db 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c
@@ -60,11 +60,6 @@ static const struct attribute_group sf_attr_group = {
.attrs = sf_device_attrs,
};
-static const struct attribute_group *sf_attr_groups[2] = {
- &sf_attr_group,
- NULL
-};
-
static void mlx5_sf_dev_release(struct device *device)
{
struct auxiliary_device *adev = container_of(device, struct auxiliary_device, dev);
@@ -111,7 +106,6 @@ static void mlx5_sf_dev_add(struct mlx5_core_dev *dev, u16 sf_index, u16 fn_id,
sf_dev->adev.name = MLX5_SF_DEV_ID_NAME;
sf_dev->adev.dev.release = mlx5_sf_dev_release;
sf_dev->adev.dev.parent = &pdev->dev;
- sf_dev->adev.dev.groups = sf_attr_groups;
sf_dev->sfnum = sfnum;
sf_dev->parent_mdev = dev;
sf_dev->fn_id = fn_id;
@@ -127,18 +121,22 @@ static void mlx5_sf_dev_add(struct mlx5_core_dev *dev, u16 sf_index, u16 fn_id,
goto add_err;
}
- err = auxiliary_device_add(&sf_dev->adev);
+ err = auxiliary_device_add_with_irqs(&sf_dev->adev);
if (err) {
auxiliary_device_uninit(&sf_dev->adev);
goto add_err;
}
+ err = devm_device_add_group(&sf_dev->adev.dev, &sf_attr_group);
+ if (err)
+ goto add_group_err;
+
err = xa_insert(&table->devices, sf_index, sf_dev, GFP_KERNEL);
if (err)
- goto xa_err;
+ goto add_group_err;
return;
-xa_err:
+add_group_err:
mlx5_sf_dev_remove_aux(dev, sf_dev);
add_err:
mlx5_core_err(dev, "SF DEV: fail device add for index=%d sfnum=%d err=%d\n",
--
2.38.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs
2024-05-28 9:11 ` [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs Shay Drory
@ 2024-05-28 14:43 ` Przemek Kitszel
2024-05-29 6:58 ` Shay Drori
2024-05-28 18:00 ` Greg KH
1 sibling, 1 reply; 18+ messages in thread
From: Przemek Kitszel @ 2024-05-28 14:43 UTC (permalink / raw)
To: Shay Drory, gregkh
Cc: rafael, ira.weiny, edumazet, kuba, davem, pabeni, netdev,
david.m.ertman, linux-rdma, leon, tariqt, Parav Pandit
On 5/28/24 11:11, Shay Drory wrote:
> Some PCI subfunctions (SF) are anchored on the auxiliary bus. PCI
> physical and virtual functions are anchored on the PCI bus. The irq
> information of each such function is visible to users via sysfs
> directory "msi_irqs" containing file for each irq entry. However, for
> PCI SFs such information is unavailable. Due to this users have no
> visibility on IRQs used by the SFs.
> Secondly, an SF can be multi function device supporting rdma, netdevice
> and more. Without irq information at the bus level, the user is unable
> to view or use the affinity of the SF IRQs.
>
> Hence to match to the equivalent PCI PFs and VFs, add "irqs" directory,
> for supporting auxiliary devices, containing file for each irq entry.
>
> Additionally, the PCI SFs sometimes share the IRQs with peer SFs. This
> information is also not available to the users. To overcome this
> limitation, each irq sysfs entry shows if irq is exclusive or shared.
>
> For example:
> $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
> 50 51 52 53 54 55 56 57 58
> $ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
> exclusive
>
> Reviewed-by: Parav Pandit <parav@nvidia.com>
> Signed-off-by: Shay Drory <shayd@nvidia.com>
>
> ---
> v4-v5:
> - restore global mutex and replace refcount_t with simple integer (Greg)
> v3->4:
> - remove global mutex (Przemek)
> v2->v3:
> - fix function declaration in case SYSFS isn't defined
> v1->v2:
> - move #ifdefs from drivers/base/auxiliary.c to
> include/linux/auxiliary_bus.h (Greg)
> - use EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL (Greg)
> - Fix kzalloc(ref) to kzalloc(*ref) (Simon)
> - Add return description in auxiliary_device_sysfs_irq_add() kdoc (Simon)
> - Fix auxiliary_irq_mode_show doc (kernel test boot)
> ---
> Documentation/ABI/testing/sysfs-bus-auxiliary | 14 ++
> drivers/base/auxiliary.c | 165 +++++++++++++++++-
> include/linux/auxiliary_bus.h | 24 ++-
> 3 files changed, 200 insertions(+), 3 deletions(-)
> create mode 100644 Documentation/ABI/testing/sysfs-bus-auxiliary
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-auxiliary b/Documentation/ABI/testing/sysfs-bus-auxiliary
> new file mode 100644
> index 000000000000..3b8299d49d9e
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-auxiliary
> @@ -0,0 +1,14 @@
> +What: /sys/bus/auxiliary/devices/.../irqs/
> +Date: April, 2024
> +Contact: Shay Drory <shayd@nvidia.com>
> +Description:
> + The /sys/devices/.../irqs directory contains a variable set of
> + files, with each file is named as irq number similar to PCI PF
> + or VF's irq number located in msi_irqs directory.
> +
> +What: /sys/bus/auxiliary/devices/.../irqs/<N>
> +Date: April, 2024
> +Contact: Shay Drory <shayd@nvidia.com>
> +Description:
> + auxiliary devices can share IRQs. This attribute indicates if
> + the irq is shared with other SFs or exclusively used by the SF.
> diff --git a/drivers/base/auxiliary.c b/drivers/base/auxiliary.c
> index d3a2c40c2f12..579d755dcbee 100644
> --- a/drivers/base/auxiliary.c
> +++ b/drivers/base/auxiliary.c
> @@ -158,6 +158,163 @@
> * };
> */
>
> +#ifdef CONFIG_SYSFS
> +/* Xarray of irqs to determine if irq is exclusive or shared. */
> +static DEFINE_XARRAY(irqs);
> +/* Protects insertions into the irqs xarray. */
> +static DEFINE_MUTEX(irqs_lock);
> +
> +struct auxiliary_irq_info {
> + struct device_attribute sysfs_attr;
> + int irq;
> +};
> +
> +static struct attribute *auxiliary_irq_attrs[] = {
> + NULL
> +};
> +
> +static const struct attribute_group auxiliary_irqs_group = {
> + .name = "irqs",
> + .attrs = auxiliary_irq_attrs,
> +};
> +
> +static const struct attribute_group *auxiliary_irqs_groups[] = {
> + &auxiliary_irqs_group,
> + NULL
> +};
> +
> +/* Auxiliary devices can share IRQs. Expose to user whether the provided IRQ is
> + * shared or exclusive.
> + */
> +static ssize_t auxiliary_irq_mode_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct auxiliary_irq_info *info =
> + container_of(attr, struct auxiliary_irq_info, sysfs_attr);
> + int ref = xa_to_value(xa_load(&irqs, info->irq));
just a note that you forgot to take the global lock here
> +
> + if (!ref)
> + return -ENOENT;
> + if (ref > 1)
> + return sysfs_emit(buf, "%s\n", "shared");
> + else
> + return sysfs_emit(buf, "%s\n", "exclusive");
> +}
> +
> +static void auxiliary_irq_destroy(int irq)
> +{
> + int ref;
> +
> + mutex_lock(&irqs_lock);
> + ref = xa_to_value(xa_load(&irqs, irq));
> + if (!(--ref))
> + xa_erase(&irqs, irq);
Global lock makes it indeed simpler to support xa_erase()-on-zero.
There are simple solutions without erasing zero elements (you could
have non-allocating store), but let's say we are leaving "the simplest"
room then :)
> + else
> + xa_store(&irqs, irq, xa_mk_value(ref), GFP_KERNEL);
> + mutex_unlock(&irqs_lock);
> +}
> +
> +static int auxiliary_irq_create(int irq)
> +{
> + int ret = 0;
> + int ref;
> +
> + mutex_lock(&irqs_lock);
> + ref = xa_to_value(xa_load(&irqs, irq));
> + if (ref) {
> + ref++;
> + xa_store(&irqs, irq, xa_mk_value(ref), GFP_KERNEL);
> + goto out;
> + }
> +
> + ret = xa_insert(&irqs, irq, xa_mk_value(1), GFP_KERNEL);
make code simpler by one common variant of ref++ & store
> +
> +out:
> + mutex_unlock(&irqs_lock);
> + return ret;
> +}
> +
> +/**
> + * auxiliary_device_sysfs_irq_add - add a sysfs entry for the given IRQ
> + * @auxdev: auxiliary bus device to add the sysfs entry.
> + * @irq: The associated Linux interrupt number.
> + *
> + * This function should be called after auxiliary device have successfully
> + * received the irq.
> + *
> + * Return: zero on success or an error code on failure.
> + */
> +int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
> +{
> + struct device *dev = &auxdev->dev;
> + struct auxiliary_irq_info *info;
> + int ret;
> +
> + ret = auxiliary_irq_create(irq);
> + if (ret)
> + return ret;
> +
> + info = kzalloc(sizeof(*info), GFP_KERNEL);
> + if (!info) {
> + ret = -ENOMEM;
> + goto info_err;
> + }
> +
> + sysfs_attr_init(&info->sysfs_attr.attr);
> + info->sysfs_attr.attr.name = kasprintf(GFP_KERNEL, "%d", irq);
> + if (!info->sysfs_attr.attr.name) {
> + ret = -ENOMEM;
> + goto name_err;
> + }
> + info->irq = irq;
> + info->sysfs_attr.attr.mode = 0444;
> + info->sysfs_attr.show = auxiliary_irq_mode_show;
> +
> + ret = xa_insert(&auxdev->irqs, irq, info, GFP_KERNEL);
> + if (ret)
> + goto auxdev_xa_err;
> +
> + ret = sysfs_add_file_to_group(&dev->kobj, &info->sysfs_attr.attr,
> + auxiliary_irqs_group.name);
> + if (ret)
> + goto sysfs_add_err;
> +
> + return 0;
> +
> +sysfs_add_err:
> + xa_erase(&auxdev->irqs, irq);
> +auxdev_xa_err:
> + kfree(info->sysfs_attr.attr.name);
> +name_err:
> + kfree(info);
> +info_err:
> + auxiliary_irq_destroy(irq);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(auxiliary_device_sysfs_irq_add);
> +
> +/**
> + * auxiliary_device_sysfs_irq_remove - remove a sysfs entry for the given IRQ
> + * @auxdev: auxiliary bus device to add the sysfs entry.
> + * @irq: the IRQ to remove.
> + *
> + * This function should be called to remove an IRQ sysfs entry.
> + */
> +void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev, int irq)
> +{
> + struct auxiliary_irq_info *info = xa_load(&auxdev->irqs, irq);
> + struct device *dev = &auxdev->dev;
> +
> + sysfs_remove_file_from_group(&dev->kobj, &info->sysfs_attr.attr,
> + auxiliary_irqs_group.name);
> + xa_erase(&auxdev->irqs, irq);
> + kfree(info->sysfs_attr.attr.name);
> + kfree(info);
> + auxiliary_irq_destroy(irq);
> +}
> +EXPORT_SYMBOL_GPL(auxiliary_device_sysfs_irq_remove);
> +#endif
> +
> static const struct auxiliary_device_id *auxiliary_match_id(const struct auxiliary_device_id *id,
> const struct auxiliary_device *auxdev)
> {
> @@ -295,6 +452,7 @@ EXPORT_SYMBOL_GPL(auxiliary_device_init);
> * __auxiliary_device_add - add an auxiliary bus device
> * @auxdev: auxiliary bus device to add to the bus
> * @modname: name of the parent device's driver module
> + * @irqs_sysfs_enable: whether to enable IRQs sysfs
> *
> * This is the third step in the three-step process to register an
> * auxiliary_device.
> @@ -310,7 +468,8 @@ EXPORT_SYMBOL_GPL(auxiliary_device_init);
> * parameter. Only if a user requires a custom name would this version be
> * called directly.
> */
> -int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname)
> +int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname,
> + bool irqs_sysfs_enable)
> {
> struct device *dev = &auxdev->dev;
> int ret;
> @@ -325,6 +484,10 @@ int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname)
> dev_err(dev, "auxiliary device dev_set_name failed: %d\n", ret);
> return ret;
> }
> + if (irqs_sysfs_enable) {
> + dev->groups = auxiliary_irqs_groups;
> + xa_init(&auxdev->irqs);
> + }
>
> ret = device_add(dev);
> if (ret)
> diff --git a/include/linux/auxiliary_bus.h b/include/linux/auxiliary_bus.h
> index de21d9d24a95..760fadb26620 100644
> --- a/include/linux/auxiliary_bus.h
> +++ b/include/linux/auxiliary_bus.h
> @@ -58,6 +58,7 @@
> * in
> * @name: Match name found by the auxiliary device driver,
> * @id: unique identitier if multiple devices of the same name are exported,
> + * @irqs: irqs xarray contains irq indices which are used by the device,
> *
> * An auxiliary_device represents a part of its parent device's functionality.
> * It is given a name that, combined with the registering drivers
> @@ -138,6 +139,7 @@
> struct auxiliary_device {
> struct device dev;
> const char *name;
> + struct xarray irqs;
> u32 id;
> };
>
> @@ -209,8 +211,26 @@ static inline struct auxiliary_driver *to_auxiliary_drv(struct device_driver *dr
> }
>
> int auxiliary_device_init(struct auxiliary_device *auxdev);
> -int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname);
> -#define auxiliary_device_add(auxdev) __auxiliary_device_add(auxdev, KBUILD_MODNAME)
> +int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname,
> + bool irqs_sysfs_enable);
> +#define auxiliary_device_add(auxdev) __auxiliary_device_add(auxdev, KBUILD_MODNAME, false)
> +#define auxiliary_device_add_with_irqs(auxdev) \
> + __auxiliary_device_add(auxdev, KBUILD_MODNAME, true)
> +
> +#ifdef CONFIG_SYSFS
> +int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq);
> +void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev,
> + int irq);
> +#else /* CONFIG_SYSFS */
> +static inline int
> +auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
> +{
> + return 0;
> +}
> +
> +static inline void
> +auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev, int irq) {}
> +#endif
>
> static inline void auxiliary_device_uninit(struct auxiliary_device *auxdev)
> {
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 2/2] net/mlx5: Expose SFs IRQs
2024-05-28 9:11 ` [PATCH net-next v5 2/2] net/mlx5: Expose SFs IRQs Shay Drory
@ 2024-05-28 14:48 ` Przemek Kitszel
2024-05-28 14:51 ` Parav Pandit
0 siblings, 1 reply; 18+ messages in thread
From: Przemek Kitszel @ 2024-05-28 14:48 UTC (permalink / raw)
To: Shay Drory, netdev, pabeni, davem, kuba, edumazet, gregkh,
david.m.ertman
Cc: rafael, ira.weiny, linux-rdma, leon, tariqt, Parav Pandit
On 5/28/24 11:11, Shay Drory wrote:
> Expose the sysfs files for the IRQs that the mlx5 PCI SFs are using.
> These entries are similar to PCI PFs and VFs in 'msi_irqs' directory.
>
> Reviewed-by: Parav Pandit <parav@nvidia.com>
> Signed-off-by: Shay Drory <shayd@nvidia.com>
>
> ---
> v2->v3:
> - fix mlx5 sfnum SF sysfs
> ---
> drivers/net/ethernet/mellanox/mlx5/core/eq.c | 6 +++---
> .../ethernet/mellanox/mlx5/core/irq_affinity.c | 15 ++++++++++++++-
> .../net/ethernet/mellanox/mlx5/core/mlx5_core.h | 6 ++++++
> .../net/ethernet/mellanox/mlx5/core/mlx5_irq.h | 12 ++++++++----
> .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 12 +++++++++---
> .../net/ethernet/mellanox/mlx5/core/sf/dev/dev.c | 16 +++++++---------
> 6 files changed, 47 insertions(+), 20 deletions(-)
>
[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/irq_affinity.c b/drivers/net/ethernet/mellanox/mlx5/core/irq_affinity.c
> index 612e666ec263..5c36aa3c57e0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/irq_affinity.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/irq_affinity.c
> @@ -112,15 +112,18 @@ irq_pool_find_least_loaded(struct mlx5_irq_pool *pool, const struct cpumask *req
>
> /**
> * mlx5_irq_affinity_request - request an IRQ according to the given mask.
> + * @dev: mlx5 core device which is requesting the IRQ.
> * @pool: IRQ pool to request from.
> * @af_desc: affinity descriptor for this IRQ.
> *
> * This function returns a pointer to IRQ, or ERR_PTR in case of error.
> */
> struct mlx5_irq *
> -mlx5_irq_affinity_request(struct mlx5_irq_pool *pool, struct irq_affinity_desc *af_desc)
> +mlx5_irq_affinity_request(struct mlx5_core_dev *dev, struct mlx5_irq_pool *pool,
> + struct irq_affinity_desc *af_desc)
> {
> struct mlx5_irq *least_loaded_irq, *new_irq;
> + int ret;
>
> mutex_lock(&pool->lock);
> least_loaded_irq = irq_pool_find_least_loaded(pool, &af_desc->mask);
> @@ -152,6 +155,13 @@ mlx5_irq_affinity_request(struct mlx5_irq_pool *pool, struct irq_affinity_desc *
> mlx5_irq_get_index(least_loaded_irq)), pool->name,
> mlx5_irq_read_locked(least_loaded_irq) / MLX5_EQ_REFS_PER_IRQ);
> unlock:
> + if (mlx5_irq_pool_is_sf_pool(pool)) {
> + ret = auxiliary_device_sysfs_irq_add(mlx5_sf_coredev_to_adev(dev),
> + mlx5_irq_get_irq(least_loaded_irq));
> + if (ret)
> + mlx5_core_err(dev, "Failed to create sysfs entry for irq %d, ret = %d\n",
> + mlx5_irq_get_irq(least_loaded_irq), ret);
you are handling the error by logging a message, then ignoring it
this is clearly not an ERROR, just a WARN or INFO.
> + }
> mutex_unlock(&pool->lock);
> return least_loaded_irq;
> }
[...]
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH net-next v5 2/2] net/mlx5: Expose SFs IRQs
2024-05-28 14:48 ` Przemek Kitszel
@ 2024-05-28 14:51 ` Parav Pandit
2024-05-29 11:13 ` Shay Drori
0 siblings, 1 reply; 18+ messages in thread
From: Parav Pandit @ 2024-05-28 14:51 UTC (permalink / raw)
To: Przemek Kitszel, Shay Drori, netdev@vger.kernel.org,
pabeni@redhat.com, davem@davemloft.net, kuba@kernel.org,
edumazet@google.com, gregkh@linuxfoundation.org,
david.m.ertman@intel.com
Cc: rafael@kernel.org, ira.weiny@intel.com,
linux-rdma@vger.kernel.org, leon@kernel.org, Tariq Toukan
> From: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Sent: Tuesday, May 28, 2024 8:18 PM
[..]
> mlx5_irq_get_index(least_loaded_irq)), pool->name,
> > mlx5_irq_read_locked(least_loaded_irq) /
> MLX5_EQ_REFS_PER_IRQ);
> > unlock:
> > + if (mlx5_irq_pool_is_sf_pool(pool)) {
> > + ret =
> auxiliary_device_sysfs_irq_add(mlx5_sf_coredev_to_adev(dev),
> > +
> mlx5_irq_get_irq(least_loaded_irq));
> > + if (ret)
> > + mlx5_core_err(dev, "Failed to create sysfs entry for irq
> %d, ret = %d\n",
> > + mlx5_irq_get_irq(least_loaded_irq), ret);
>
> you are handling the error by logging a message, then ignoring it this is clearly
> not an ERROR, just a WARN or INFO.
>
> > + }
> > mutex_unlock(&pool->lock);
> > return least_loaded_irq;
> > }
>
> [...]
I clearly remember discussing/reviewing this internally to error out.
Without it, we didn’t add the entry, but we will try to remove it where the remove function does not expect an error.
Shay,
Error unwinding should happen when fail to create the sysfs entry.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs
2024-05-28 9:11 [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs Shay Drory
2024-05-28 9:11 ` [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs Shay Drory
2024-05-28 9:11 ` [PATCH net-next v5 2/2] net/mlx5: Expose SFs IRQs Shay Drory
@ 2024-05-28 17:57 ` Greg KH
2024-05-29 6:19 ` Shay Drori
2 siblings, 1 reply; 18+ messages in thread
From: Greg KH @ 2024-05-28 17:57 UTC (permalink / raw)
To: Shay Drory
Cc: netdev, pabeni, davem, kuba, edumazet, david.m.ertman, rafael,
ira.weiny, linux-rdma, leon, tariqt
On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote:
> Today, PCI PFs and VFs, which are anchored on the PCI bus, display their
> IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs files. PCI
> subfunctions (SFs) are similar to PFs and VFs and these SFs are anchored
> on the auxiliary bus. However, these PCI SFs lack such IRQ information
> on the auxiliary bus, leaving users without visibility into which IRQs
> are used by the SFs. This absence makes it impossible to debug
> situations and to understand the source of interrupts/SFs for
> performance tuning and debug.
Wait, again, this feels wrong. You should be able to walk back up the
tree see the irq for the device, and vf, right? Why would the value be
different down in the aux device? Does the msi irq somehow not actually
show anywhere for the real pci device in sysfs at all today?
What does sysfs look like today exactly for this information?
And what about /proc/irq/ and /proc/interrupts/ doesn't that show you
the needed information? Why are aux devices somehow "special" here?
> Additionally, the SFs are multifunctional devices supporting RDMA,
> network devices, clocks, and more, similar to their peer PCI PFs and
> VFs. Therefore, it is desirable to have SFs' IRQ information available
> at the bus/device level.
But it should be as part of the pci device, as that's where that
information lives and is "bound" to, not the aux device on its own.
> To overcome the above limitations, this short series extends the
> auxiliary bus to display IRQ information in sysfs, similar to that of
> PFs and VFs.
Again, examples of what it looks like today, and what it will look like
with this patch set is needed in order to justify why this really is
needed as it seems that the information should already be there for you.
> It adds an 'irqs' directory under the auxiliary device and includes an
> <irq_num> sysfs file within it. Sometimes, the PCI SF auxiliary devices
> share the IRQ with other SFs, a detail that is also not available to the
> users. Consequently, this <irq_num> file indicates whether the IRQ is
> 'exclusive' or 'shared'. This 'irqs' directory extenstion is optional,
> i.e. only for PCI SFs the sysfs irq information is optionally exposed.
Why does userspace care about "shared" or not? What can they do with
that, and why?
> For example:
> $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
> 50 51 52 53 54 55 56 57 58
> $ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
> exclusive
"exclusive" for now, but again, why? Who cares? These are msi irqs it
shouldn't matter if they are shared or not.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs
2024-05-28 9:11 ` [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs Shay Drory
2024-05-28 14:43 ` Przemek Kitszel
@ 2024-05-28 18:00 ` Greg KH
2024-05-29 6:29 ` Shay Drori
1 sibling, 1 reply; 18+ messages in thread
From: Greg KH @ 2024-05-28 18:00 UTC (permalink / raw)
To: Shay Drory
Cc: netdev, pabeni, davem, kuba, edumazet, david.m.ertman, rafael,
ira.weiny, linux-rdma, leon, tariqt, Parav Pandit
On Tue, May 28, 2024 at 12:11:43PM +0300, Shay Drory wrote:
> +#ifdef CONFIG_SYSFS
> +/* Xarray of irqs to determine if irq is exclusive or shared. */
> +static DEFINE_XARRAY(irqs);
> +/* Protects insertions into the irqs xarray. */
> +static DEFINE_MUTEX(irqs_lock);
You access the irq xarray without grabbing the lock in places :(
But again, I fail to see why the xarray is needed at all, why isn't the
needed information here:
> +struct auxiliary_irq_info {
> + struct device_attribute sysfs_attr;
> + int irq;
> +};
Right there^ should contain everything you need, NOT a global array and
lock at all.
> +/* Auxiliary devices can share IRQs. Expose to user whether the provided IRQ is
> + * shared or exclusive.
Why are you using networking comment style here? :)
> diff --git a/include/linux/auxiliary_bus.h b/include/linux/auxiliary_bus.h
> index de21d9d24a95..760fadb26620 100644
> --- a/include/linux/auxiliary_bus.h
> +++ b/include/linux/auxiliary_bus.h
> @@ -58,6 +58,7 @@
> * in
> * @name: Match name found by the auxiliary device driver,
> * @id: unique identitier if multiple devices of the same name are exported,
> + * @irqs: irqs xarray contains irq indices which are used by the device,
> *
> * An auxiliary_device represents a part of its parent device's functionality.
> * It is given a name that, combined with the registering drivers
> @@ -138,6 +139,7 @@
> struct auxiliary_device {
> struct device dev;
> const char *name;
> + struct xarray irqs;
wait, why is an xarray added here too? That feels wrong, or odd, or
something as you seem to have multiple xarrays here when it feels like
you need none.
confused,
greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs
2024-05-28 17:57 ` [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs Greg KH
@ 2024-05-29 6:19 ` Shay Drori
2024-05-30 14:13 ` Shay Drori
2024-05-30 14:50 ` Greg KH
0 siblings, 2 replies; 18+ messages in thread
From: Shay Drori @ 2024-05-29 6:19 UTC (permalink / raw)
To: Greg KH
Cc: netdev, pabeni, davem, kuba, edumazet, david.m.ertman, rafael,
ira.weiny, linux-rdma, leon, tariqt
On 28/05/2024 20:57, Greg KH wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote:
>> Today, PCI PFs and VFs, which are anchored on the PCI bus, display their
>> IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs files. PCI
>> subfunctions (SFs) are similar to PFs and VFs and these SFs are anchored
>> on the auxiliary bus. However, these PCI SFs lack such IRQ information
>> on the auxiliary bus, leaving users without visibility into which IRQs
>> are used by the SFs. This absence makes it impossible to debug
>> situations and to understand the source of interrupts/SFs for
>> performance tuning and debug.
>
> Wait, again, this feels wrong. You should be able to walk back up the
> tree see the irq for the device, and vf, right? Why would the value be
> different down in the aux device?
you are correct, the IRQs of the aux device are subset of the IRQs of
the parent device. more detailed answer bellow.
> Does the msi irq somehow not actually show anywhere for the real pci device in sysfs at all today?
>
> What does sysfs look like today exactly for this information?
The IRQ information of all the children SFs of a PF is found in the PF
device as one single list.
There is no sane way to distinguish which IRQ is used by which SFs.
For example, in a machine with a single child SF of the PF 00:0b.0 we
currently have the bellow:
$ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
the above are IRQs of both the PF and its child SF. from here we cannot
tell which IRQ is used by the child SF.
>
> And what about /proc/irq/ and /proc/interrupts/ doesn't that show you
> the needed information? Why are aux devices somehow "special" here?
/proc/interrupts interrupt name is some random driver choice. There is
no sane naming convention and some small few bytes of upper limit of
what the IRQ name.
>
>> Additionally, the SFs are multifunctional devices supporting RDMA,
>> network devices, clocks, and more, similar to their peer PCI PFs and
>> VFs. Therefore, it is desirable to have SFs' IRQ information available
>> at the bus/device level.
>
> But it should be as part of the pci device, as that's where that
> information lives and is "bound" to, not the aux device on its own.
Auxiliary bus level SF device is using that IRQ too. So it is
additionally shown at auxiliary device level too.
>
>> To overcome the above limitations, this short series extends the
>> auxiliary bus to display IRQ information in sysfs, similar to that of
>> PFs and VFs.
>
> Again, examples of what it looks like today, and what it will look like
> with this patch set is needed in order to justify why this really is
> needed as it seems that the information should already be there for you.
full example:
in a machine with a single child SF of the PF 00:0b.0 we currently have
the bellow:
$ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
with this patch-set we will also have:
$ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/
50 51 52 53 54 55 56 57 58
which means that IRQs 50-58, which are shown on the PF "msi_irqs" are
used by the child SF.
>
>> It adds an 'irqs' directory under the auxiliary device and includes an
>> <irq_num> sysfs file within it. Sometimes, the PCI SF auxiliary devices
>> share the IRQ with other SFs, a detail that is also not available to the
>> users. Consequently, this <irq_num> file indicates whether the IRQ is
>> 'exclusive' or 'shared'. This 'irqs' directory extenstion is optional,
>> i.e. only for PCI SFs the sysfs irq information is optionally exposed.
>
> Why does userspace care about "shared" or not? What can they do with
> that, and why?
If IRQ is shared, userspace needs to take it into account when looking
at IRQ data and counters such as interrupts/sec.
Also, If IRQ is shared, user better not mess with the irq affinity of
such irq it as it can affect multiple SFs.
>
>> For example:
>> $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
>> 50 51 52 53 54 55 56 57 58
>> $ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
>> exclusive
>
> "exclusive" for now, but again, why? Who cares? These are msi irqs it
> shouldn't matter if they are shared or not.
hope I explained the current limitation and the proposed solution above
>
> thanks,
>
> greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs
2024-05-28 18:00 ` Greg KH
@ 2024-05-29 6:29 ` Shay Drori
2024-05-30 14:46 ` Greg KH
0 siblings, 1 reply; 18+ messages in thread
From: Shay Drori @ 2024-05-29 6:29 UTC (permalink / raw)
To: Greg KH
Cc: netdev, pabeni, davem, kuba, edumazet, david.m.ertman, rafael,
ira.weiny, linux-rdma, leon, tariqt, Parav Pandit
On 28/05/2024 21:00, Greg KH wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, May 28, 2024 at 12:11:43PM +0300, Shay Drory wrote:
>> +#ifdef CONFIG_SYSFS
>> +/* Xarray of irqs to determine if irq is exclusive or shared. */
>> +static DEFINE_XARRAY(irqs);
>> +/* Protects insertions into the irqs xarray. */
>> +static DEFINE_MUTEX(irqs_lock);
>
> You access the irq xarray without grabbing the lock in places :(
>
> But again, I fail to see why the xarray is needed at all, why isn't the
> needed information here:
>
>> +struct auxiliary_irq_info {
>> + struct device_attribute sysfs_attr;
>> + int irq;
>> +};
>
> Right there^ should contain everything you need, NOT a global array and
> lock at all.
1) one xarray is per aux device that indicates which IRQs irqs are used
by this device. this xarray is holding the info above.
2) second xarray is global that tracks if irq share between multiple aux
devices or exclusive to aux device.
>
>> +/* Auxiliary devices can share IRQs. Expose to user whether the provided IRQ is
>> + * shared or exclusive.
>
> Why are you using networking comment style here? :)
correct, will fix in next version
>
>> diff --git a/include/linux/auxiliary_bus.h b/include/linux/auxiliary_bus.h
>> index de21d9d24a95..760fadb26620 100644
>> --- a/include/linux/auxiliary_bus.h
>> +++ b/include/linux/auxiliary_bus.h
>> @@ -58,6 +58,7 @@
>> * in
>> * @name: Match name found by the auxiliary device driver,
>> * @id: unique identitier if multiple devices of the same name are exported,
>> + * @irqs: irqs xarray contains irq indices which are used by the device,
>> *
>> * An auxiliary_device represents a part of its parent device's functionality.
>> * It is given a name that, combined with the registering drivers
>> @@ -138,6 +139,7 @@
>> struct auxiliary_device {
>> struct device dev;
>> const char *name;
>> + struct xarray irqs;
>
> wait, why is an xarray added here too? That feels wrong, or odd, or
> something as you seem to have multiple xarrays here when it feels like
> you need none.
>
please look the answer above
> confused,
>
> greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs
2024-05-28 14:43 ` Przemek Kitszel
@ 2024-05-29 6:58 ` Shay Drori
0 siblings, 0 replies; 18+ messages in thread
From: Shay Drori @ 2024-05-29 6:58 UTC (permalink / raw)
To: Przemek Kitszel, gregkh
Cc: rafael, ira.weiny, edumazet, kuba, davem, pabeni, netdev,
david.m.ertman, linux-rdma, leon, tariqt, Parav Pandit
On 28/05/2024 17:43, Przemek Kitszel wrote:
> External email: Use caution opening links or attachments
>
>
> On 5/28/24 11:11, Shay Drory wrote:
>> Some PCI subfunctions (SF) are anchored on the auxiliary bus. PCI
>> physical and virtual functions are anchored on the PCI bus. The irq
>> information of each such function is visible to users via sysfs
>> directory "msi_irqs" containing file for each irq entry. However, for
>> PCI SFs such information is unavailable. Due to this users have no
>> visibility on IRQs used by the SFs.
>> Secondly, an SF can be multi function device supporting rdma, netdevice
>> and more. Without irq information at the bus level, the user is unable
>> to view or use the affinity of the SF IRQs.
>>
>> Hence to match to the equivalent PCI PFs and VFs, add "irqs" directory,
>> for supporting auxiliary devices, containing file for each irq entry.
>>
>> Additionally, the PCI SFs sometimes share the IRQs with peer SFs. This
>> information is also not available to the users. To overcome this
>> limitation, each irq sysfs entry shows if irq is exclusive or shared.
>>
>> For example:
>> $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
>> 50 51 52 53 54 55 56 57 58
>> $ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
>> exclusive
>>
>> Reviewed-by: Parav Pandit <parav@nvidia.com>
>> Signed-off-by: Shay Drory <shayd@nvidia.com>
>>
>> ---
>> v4-v5:
>> - restore global mutex and replace refcount_t with simple integer (Greg)
>> v3->4:
>> - remove global mutex (Przemek)
>> v2->v3:
>> - fix function declaration in case SYSFS isn't defined
>> v1->v2:
>> - move #ifdefs from drivers/base/auxiliary.c to
>> include/linux/auxiliary_bus.h (Greg)
>> - use EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL (Greg)
>> - Fix kzalloc(ref) to kzalloc(*ref) (Simon)
>> - Add return description in auxiliary_device_sysfs_irq_add() kdoc (Simon)
>> - Fix auxiliary_irq_mode_show doc (kernel test boot)
>> ---
>> Documentation/ABI/testing/sysfs-bus-auxiliary | 14 ++
>> drivers/base/auxiliary.c | 165 +++++++++++++++++-
>> include/linux/auxiliary_bus.h | 24 ++-
>> 3 files changed, 200 insertions(+), 3 deletions(-)
>> create mode 100644 Documentation/ABI/testing/sysfs-bus-auxiliary
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-auxiliary
>> b/Documentation/ABI/testing/sysfs-bus-auxiliary
>> new file mode 100644
>> index 000000000000..3b8299d49d9e
>> --- /dev/null
>> +++ b/Documentation/ABI/testing/sysfs-bus-auxiliary
>> @@ -0,0 +1,14 @@
>> +What: /sys/bus/auxiliary/devices/.../irqs/
>> +Date: April, 2024
>> +Contact: Shay Drory <shayd@nvidia.com>
>> +Description:
>> + The /sys/devices/.../irqs directory contains a variable
>> set of
>> + files, with each file is named as irq number similar to
>> PCI PF
>> + or VF's irq number located in msi_irqs directory.
>> +
>> +What: /sys/bus/auxiliary/devices/.../irqs/<N>
>> +Date: April, 2024
>> +Contact: Shay Drory <shayd@nvidia.com>
>> +Description:
>> + auxiliary devices can share IRQs. This attribute
>> indicates if
>> + the irq is shared with other SFs or exclusively used by
>> the SF.
>> diff --git a/drivers/base/auxiliary.c b/drivers/base/auxiliary.c
>> index d3a2c40c2f12..579d755dcbee 100644
>> --- a/drivers/base/auxiliary.c
>> +++ b/drivers/base/auxiliary.c
>> @@ -158,6 +158,163 @@
>> * };
>> */
>>
>> +#ifdef CONFIG_SYSFS
>> +/* Xarray of irqs to determine if irq is exclusive or shared. */
>> +static DEFINE_XARRAY(irqs);
>> +/* Protects insertions into the irqs xarray. */
>> +static DEFINE_MUTEX(irqs_lock);
>> +
>> +struct auxiliary_irq_info {
>> + struct device_attribute sysfs_attr;
>> + int irq;
>> +};
>> +
>> +static struct attribute *auxiliary_irq_attrs[] = {
>> + NULL
>> +};
>> +
>> +static const struct attribute_group auxiliary_irqs_group = {
>> + .name = "irqs",
>> + .attrs = auxiliary_irq_attrs,
>> +};
>> +
>> +static const struct attribute_group *auxiliary_irqs_groups[] = {
>> + &auxiliary_irqs_group,
>> + NULL
>> +};
>> +
>> +/* Auxiliary devices can share IRQs. Expose to user whether the
>> provided IRQ is
>> + * shared or exclusive.
>> + */
>> +static ssize_t auxiliary_irq_mode_show(struct device *dev,
>> + struct device_attribute *attr,
>> char *buf)
>> +{
>> + struct auxiliary_irq_info *info =
>> + container_of(attr, struct auxiliary_irq_info, sysfs_attr);
>> + int ref = xa_to_value(xa_load(&irqs, info->irq));
>
> just a note that you forgot to take the global lock here
correct, will fix in next version
>
>> +
>> + if (!ref)
>> + return -ENOENT;
>> + if (ref > 1)
>> + return sysfs_emit(buf, "%s\n", "shared");
>> + else
>> + return sysfs_emit(buf, "%s\n", "exclusive");
>> +}
>> +
>> +static void auxiliary_irq_destroy(int irq)
>> +{
>> + int ref;
>> +
>> + mutex_lock(&irqs_lock);
>> + ref = xa_to_value(xa_load(&irqs, irq));
>> + if (!(--ref))
>> + xa_erase(&irqs, irq);
>
> Global lock makes it indeed simpler to support xa_erase()-on-zero.
> There are simple solutions without erasing zero elements (you could
> have non-allocating store), but let's say we are leaving "the simplest"
> room then :)
>
>> + else
>> + xa_store(&irqs, irq, xa_mk_value(ref), GFP_KERNEL);
>> + mutex_unlock(&irqs_lock);
>> +}
>> +
>> +static int auxiliary_irq_create(int irq)
>> +{
>> + int ret = 0;
>> + int ref;
>> +
>> + mutex_lock(&irqs_lock);
>> + ref = xa_to_value(xa_load(&irqs, irq));
>> + if (ref) {
>> + ref++;
>> + xa_store(&irqs, irq, xa_mk_value(ref), GFP_KERNEL);
>> + goto out;
>> + }
>> +
>> + ret = xa_insert(&irqs, irq, xa_mk_value(1), GFP_KERNEL);
>
> make code simpler by one common variant of ref++ & store
Nice :)
will change in next version.
>
>> +
>> +out:
>> + mutex_unlock(&irqs_lock);
>> + return ret;
>> +}
>> +
>> +/**
>> + * auxiliary_device_sysfs_irq_add - add a sysfs entry for the given IRQ
>> + * @auxdev: auxiliary bus device to add the sysfs entry.
>> + * @irq: The associated Linux interrupt number.
>> + *
>> + * This function should be called after auxiliary device have
>> successfully
>> + * received the irq.
>> + *
>> + * Return: zero on success or an error code on failure.
>> + */
>> +int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev,
>> int irq)
>> +{
>> + struct device *dev = &auxdev->dev;
>> + struct auxiliary_irq_info *info;
>> + int ret;
>> +
>> + ret = auxiliary_irq_create(irq);
>> + if (ret)
>> + return ret;
>> +
>> + info = kzalloc(sizeof(*info), GFP_KERNEL);
>> + if (!info) {
>> + ret = -ENOMEM;
>> + goto info_err;
>> + }
>> +
>> + sysfs_attr_init(&info->sysfs_attr.attr);
>> + info->sysfs_attr.attr.name = kasprintf(GFP_KERNEL, "%d", irq);
>> + if (!info->sysfs_attr.attr.name) {
>> + ret = -ENOMEM;
>> + goto name_err;
>> + }
>> + info->irq = irq;
>> + info->sysfs_attr.attr.mode = 0444;
>> + info->sysfs_attr.show = auxiliary_irq_mode_show;
>> +
>> + ret = xa_insert(&auxdev->irqs, irq, info, GFP_KERNEL);
>> + if (ret)
>> + goto auxdev_xa_err;
>> +
>> + ret = sysfs_add_file_to_group(&dev->kobj, &info->sysfs_attr.attr,
>> + auxiliary_irqs_group.name);
>> + if (ret)
>> + goto sysfs_add_err;
>> +
>> + return 0;
>> +
>> +sysfs_add_err:
>> + xa_erase(&auxdev->irqs, irq);
>> +auxdev_xa_err:
>> + kfree(info->sysfs_attr.attr.name);
>> +name_err:
>> + kfree(info);
>> +info_err:
>> + auxiliary_irq_destroy(irq);
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(auxiliary_device_sysfs_irq_add);
>> +
>> +/**
>> + * auxiliary_device_sysfs_irq_remove - remove a sysfs entry for the
>> given IRQ
>> + * @auxdev: auxiliary bus device to add the sysfs entry.
>> + * @irq: the IRQ to remove.
>> + *
>> + * This function should be called to remove an IRQ sysfs entry.
>> + */
>> +void auxiliary_device_sysfs_irq_remove(struct auxiliary_device
>> *auxdev, int irq)
>> +{
>> + struct auxiliary_irq_info *info = xa_load(&auxdev->irqs, irq);
>> + struct device *dev = &auxdev->dev;
>> +
>> + sysfs_remove_file_from_group(&dev->kobj, &info->sysfs_attr.attr,
>> + auxiliary_irqs_group.name);
>> + xa_erase(&auxdev->irqs, irq);
>> + kfree(info->sysfs_attr.attr.name);
>> + kfree(info);
>> + auxiliary_irq_destroy(irq);
>> +}
>> +EXPORT_SYMBOL_GPL(auxiliary_device_sysfs_irq_remove);
>> +#endif
>> +
>> static const struct auxiliary_device_id *auxiliary_match_id(const
>> struct auxiliary_device_id *id,
>> const struct
>> auxiliary_device *auxdev)
>> {
>> @@ -295,6 +452,7 @@ EXPORT_SYMBOL_GPL(auxiliary_device_init);
>> * __auxiliary_device_add - add an auxiliary bus device
>> * @auxdev: auxiliary bus device to add to the bus
>> * @modname: name of the parent device's driver module
>> + * @irqs_sysfs_enable: whether to enable IRQs sysfs
>> *
>> * This is the third step in the three-step process to register an
>> * auxiliary_device.
>> @@ -310,7 +468,8 @@ EXPORT_SYMBOL_GPL(auxiliary_device_init);
>> * parameter. Only if a user requires a custom name would this
>> version be
>> * called directly.
>> */
>> -int __auxiliary_device_add(struct auxiliary_device *auxdev, const
>> char *modname)
>> +int __auxiliary_device_add(struct auxiliary_device *auxdev, const
>> char *modname,
>> + bool irqs_sysfs_enable)
>> {
>> struct device *dev = &auxdev->dev;
>> int ret;
>> @@ -325,6 +484,10 @@ int __auxiliary_device_add(struct
>> auxiliary_device *auxdev, const char *modname)
>> dev_err(dev, "auxiliary device dev_set_name failed:
>> %d\n", ret);
>> return ret;
>> }
>> + if (irqs_sysfs_enable) {
>> + dev->groups = auxiliary_irqs_groups;
>> + xa_init(&auxdev->irqs);
>> + }
>>
>> ret = device_add(dev);
>> if (ret)
>> diff --git a/include/linux/auxiliary_bus.h
>> b/include/linux/auxiliary_bus.h
>> index de21d9d24a95..760fadb26620 100644
>> --- a/include/linux/auxiliary_bus.h
>> +++ b/include/linux/auxiliary_bus.h
>> @@ -58,6 +58,7 @@
>> * in
>> * @name: Match name found by the auxiliary device driver,
>> * @id: unique identitier if multiple devices of the same name are
>> exported,
>> + * @irqs: irqs xarray contains irq indices which are used by the device,
>> *
>> * An auxiliary_device represents a part of its parent device's
>> functionality.
>> * It is given a name that, combined with the registering drivers
>> @@ -138,6 +139,7 @@
>> struct auxiliary_device {
>> struct device dev;
>> const char *name;
>> + struct xarray irqs;
>> u32 id;
>> };
>>
>> @@ -209,8 +211,26 @@ static inline struct auxiliary_driver
>> *to_auxiliary_drv(struct device_driver *dr
>> }
>>
>> int auxiliary_device_init(struct auxiliary_device *auxdev);
>> -int __auxiliary_device_add(struct auxiliary_device *auxdev, const
>> char *modname);
>> -#define auxiliary_device_add(auxdev) __auxiliary_device_add(auxdev,
>> KBUILD_MODNAME)
>> +int __auxiliary_device_add(struct auxiliary_device *auxdev, const
>> char *modname,
>> + bool irqs_sysfs_enable);
>> +#define auxiliary_device_add(auxdev) __auxiliary_device_add(auxdev,
>> KBUILD_MODNAME, false)
>> +#define auxiliary_device_add_with_irqs(auxdev) \
>> + __auxiliary_device_add(auxdev, KBUILD_MODNAME, true)
>> +
>> +#ifdef CONFIG_SYSFS
>> +int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev,
>> int irq);
>> +void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev,
>> + int irq);
>> +#else /* CONFIG_SYSFS */
>> +static inline int
>> +auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline void
>> +auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev,
>> int irq) {}
>> +#endif
>>
>> static inline void auxiliary_device_uninit(struct auxiliary_device
>> *auxdev)
>> {
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 2/2] net/mlx5: Expose SFs IRQs
2024-05-28 14:51 ` Parav Pandit
@ 2024-05-29 11:13 ` Shay Drori
0 siblings, 0 replies; 18+ messages in thread
From: Shay Drori @ 2024-05-29 11:13 UTC (permalink / raw)
To: Parav Pandit, Przemek Kitszel, netdev@vger.kernel.org,
pabeni@redhat.com, davem@davemloft.net, kuba@kernel.org,
edumazet@google.com, gregkh@linuxfoundation.org,
david.m.ertman@intel.com
Cc: rafael@kernel.org, ira.weiny@intel.com,
linux-rdma@vger.kernel.org, leon@kernel.org, Tariq Toukan
On 28/05/2024 17:51, Parav Pandit wrote:
>
>> From: Przemek Kitszel <przemyslaw.kitszel@intel.com>
>> Sent: Tuesday, May 28, 2024 8:18 PM
>
> [..]
>
>> mlx5_irq_get_index(least_loaded_irq)), pool->name,
>>> mlx5_irq_read_locked(least_loaded_irq) /
>> MLX5_EQ_REFS_PER_IRQ);
>>> unlock:
>>> + if (mlx5_irq_pool_is_sf_pool(pool)) {
>>> + ret =
>> auxiliary_device_sysfs_irq_add(mlx5_sf_coredev_to_adev(dev),
>>> +
>> mlx5_irq_get_irq(least_loaded_irq));
>>> + if (ret)
>>> + mlx5_core_err(dev, "Failed to create sysfs entry for irq
>> %d, ret = %d\n",
>>> + mlx5_irq_get_irq(least_loaded_irq), ret);
>>
>> you are handling the error by logging a message, then ignoring it this is clearly
>> not an ERROR, just a WARN or INFO.
>>
>>> + }
>>> mutex_unlock(&pool->lock);
>>> return least_loaded_irq;
>>> }
>>
>> [...]
>
> I clearly remember discussing/reviewing this internally to error out.
> Without it, we didn’t add the entry, but we will try to remove it where the remove function does not expect an error.
>
> Shay,
> Error unwinding should happen when fail to create the sysfs entry.
correct, will fix in next version
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs
2024-05-29 6:19 ` Shay Drori
@ 2024-05-30 14:13 ` Shay Drori
2024-05-30 14:50 ` Greg KH
1 sibling, 0 replies; 18+ messages in thread
From: Shay Drori @ 2024-05-30 14:13 UTC (permalink / raw)
To: Greg KH
Cc: netdev, pabeni, davem, kuba, edumazet, david.m.ertman, rafael,
ira.weiny, linux-rdma, leon, tariqt
Hi Greg,
Did you get a chance to see my reply? Is it ok with you? I would like to
submit the fixes that Parav/Przemek suggested if no further inputs.
Thanks
On 29/05/2024 9:19, Shay Drori wrote:
>
>
> On 28/05/2024 20:57, Greg KH wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote:
>>> Today, PCI PFs and VFs, which are anchored on the PCI bus, display their
>>> IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs files. PCI
>>> subfunctions (SFs) are similar to PFs and VFs and these SFs are anchored
>>> on the auxiliary bus. However, these PCI SFs lack such IRQ information
>>> on the auxiliary bus, leaving users without visibility into which IRQs
>>> are used by the SFs. This absence makes it impossible to debug
>>> situations and to understand the source of interrupts/SFs for
>>> performance tuning and debug.
>>
>> Wait, again, this feels wrong. You should be able to walk back up the
>> tree see the irq for the device, and vf, right? Why would the value be
>> different down in the aux device?
>
>
> you are correct, the IRQs of the aux device are subset of the IRQs of
> the parent device. more detailed answer bellow.
>
>
>> Does the msi irq somehow not actually show anywhere for the real pci
>> device in sysfs at all today?
>>
>> What does sysfs look like today exactly for this information?
>
>
> The IRQ information of all the children SFs of a PF is found in the PF
> device as one single list.
> There is no sane way to distinguish which IRQ is used by which SFs.
> For example, in a machine with a single child SF of the PF 00:0b.0 we
> currently have the bellow:
> $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
>
> the above are IRQs of both the PF and its child SF. from here we cannot
> tell which IRQ is used by the child SF.
>
>
>>
>> And what about /proc/irq/ and /proc/interrupts/ doesn't that show you
>> the needed information? Why are aux devices somehow "special" here?
>
>
> /proc/interrupts interrupt name is some random driver choice. There is
> no sane naming convention and some small few bytes of upper limit of
> what the IRQ name.
>
>>
>>> Additionally, the SFs are multifunctional devices supporting RDMA,
>>> network devices, clocks, and more, similar to their peer PCI PFs and
>>> VFs. Therefore, it is desirable to have SFs' IRQ information available
>>> at the bus/device level.
>>
>> But it should be as part of the pci device, as that's where that
>> information lives and is "bound" to, not the aux device on its own.
>
>
> Auxiliary bus level SF device is using that IRQ too. So it is
> additionally shown at auxiliary device level too.
>
>
>>
>>> To overcome the above limitations, this short series extends the
>>> auxiliary bus to display IRQ information in sysfs, similar to that of
>>> PFs and VFs.
>>
>> Again, examples of what it looks like today, and what it will look like
>> with this patch set is needed in order to justify why this really is
>> needed as it seems that the information should already be there for you.
>
>
> full example:
> in a machine with a single child SF of the PF 00:0b.0 we currently have
> the bellow:
> $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
>
> with this patch-set we will also have:
> $ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/
> 50 51 52 53 54 55 56 57 58
>
> which means that IRQs 50-58, which are shown on the PF "msi_irqs" are
> used by the child SF.
>
>
>>
>>> It adds an 'irqs' directory under the auxiliary device and includes an
>>> <irq_num> sysfs file within it. Sometimes, the PCI SF auxiliary devices
>>> share the IRQ with other SFs, a detail that is also not available to the
>>> users. Consequently, this <irq_num> file indicates whether the IRQ is
>>> 'exclusive' or 'shared'. This 'irqs' directory extenstion is optional,
>>> i.e. only for PCI SFs the sysfs irq information is optionally exposed.
>>
>> Why does userspace care about "shared" or not? What can they do with
>> that, and why?
>
>
> If IRQ is shared, userspace needs to take it into account when looking
> at IRQ data and counters such as interrupts/sec.
> Also, If IRQ is shared, user better not mess with the irq affinity of
> such irq it as it can affect multiple SFs.
>
>
>>
>>> For example:
>>> $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
>>> 50 51 52 53 54 55 56 57 58
>>> $ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
>>> exclusive
>>
>> "exclusive" for now, but again, why? Who cares? These are msi irqs it
>> shouldn't matter if they are shared or not.
>
> hope I explained the current limitation and the proposed solution above
>
>>
>> thanks,
>>
>> greg k-h
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs
2024-05-29 6:29 ` Shay Drori
@ 2024-05-30 14:46 ` Greg KH
0 siblings, 0 replies; 18+ messages in thread
From: Greg KH @ 2024-05-30 14:46 UTC (permalink / raw)
To: Shay Drori
Cc: netdev, pabeni, davem, kuba, edumazet, david.m.ertman, rafael,
ira.weiny, linux-rdma, leon, tariqt, Parav Pandit
On Wed, May 29, 2024 at 09:29:33AM +0300, Shay Drori wrote:
>
>
> On 28/05/2024 21:00, Greg KH wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Tue, May 28, 2024 at 12:11:43PM +0300, Shay Drory wrote:
> > > +#ifdef CONFIG_SYSFS
> > > +/* Xarray of irqs to determine if irq is exclusive or shared. */
> > > +static DEFINE_XARRAY(irqs);
> > > +/* Protects insertions into the irqs xarray. */
> > > +static DEFINE_MUTEX(irqs_lock);
> >
> > You access the irq xarray without grabbing the lock in places :(
> >
> > But again, I fail to see why the xarray is needed at all, why isn't the
> > needed information here:
> >
> > > +struct auxiliary_irq_info {
> > > + struct device_attribute sysfs_attr;
> > > + int irq;
> > > +};
> >
> > Right there^ should contain everything you need, NOT a global array and
> > lock at all.
>
>
> 1) one xarray is per aux device that indicates which IRQs irqs are used
> by this device. this xarray is holding the info above.
Ok, please document that better, it's not obvious.
> 2) second xarray is global that tracks if irq share between multiple aux
> devices or exclusive to aux device.
That should not be a "global" thing, as now you are getting into what
the msi irq core should be handling, NOT the aux device.
Userspace should be able to determine, just by the number, if it is
"shared" or not by looking at them all, so why need to add complex logic
here to attempt to also mirror this information?
Doesn't the irq layer track this sufficiently? And it wouldn't even be
correct if an irq was "shared" by a device that was NOT controlled by an
aux device so it could be incorrect.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs
2024-05-29 6:19 ` Shay Drori
2024-05-30 14:13 ` Shay Drori
@ 2024-05-30 14:50 ` Greg KH
2024-06-03 7:10 ` Parav Pandit
1 sibling, 1 reply; 18+ messages in thread
From: Greg KH @ 2024-05-30 14:50 UTC (permalink / raw)
To: Shay Drori
Cc: netdev, pabeni, davem, kuba, edumazet, david.m.ertman, rafael,
ira.weiny, linux-rdma, leon, tariqt
On Wed, May 29, 2024 at 09:19:13AM +0300, Shay Drori wrote:
>
>
> On 28/05/2024 20:57, Greg KH wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote:
> > > Today, PCI PFs and VFs, which are anchored on the PCI bus, display their
> > > IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs files. PCI
> > > subfunctions (SFs) are similar to PFs and VFs and these SFs are anchored
> > > on the auxiliary bus. However, these PCI SFs lack such IRQ information
> > > on the auxiliary bus, leaving users without visibility into which IRQs
> > > are used by the SFs. This absence makes it impossible to debug
> > > situations and to understand the source of interrupts/SFs for
> > > performance tuning and debug.
> >
> > Wait, again, this feels wrong. You should be able to walk back up the
> > tree see the irq for the device, and vf, right? Why would the value be
> > different down in the aux device?
>
>
> you are correct, the IRQs of the aux device are subset of the IRQs of
> the parent device. more detailed answer bellow.
But isn't that already shown in /proc/interrupts? Why show it here as
well?
> > Does the msi irq somehow not actually show anywhere for the real pci device in sysfs at all today?
> >
> > What does sysfs look like today exactly for this information?
>
>
> The IRQ information of all the children SFs of a PF is found in the PF
> device as one single list.
> There is no sane way to distinguish which IRQ is used by which SFs.
> For example, in a machine with a single child SF of the PF 00:0b.0 we
> currently have the bellow:
> $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
>
> the above are IRQs of both the PF and its child SF. from here we cannot
> tell which IRQ is used by the child SF.
Does that matter? Seriously, what is userspace going to do with that
that it can not already determine from /proc/interrupts/ or /proc/irq/?
I know it's /proc and not /sys but duplicating information might not be
needed as I think you already have this information. If not, what is
missing in /proc/irq today for this?
> > And what about /proc/irq/ and /proc/interrupts/ doesn't that show you
> > the needed information? Why are aux devices somehow "special" here?
>
>
> /proc/interrupts interrupt name is some random driver choice.
That's the same name you use here.
> There is
> no sane naming convention and some small few bytes of upper limit of
> what the IRQ name.
That's up to the driver to name, so wouldn't it be the same here?
> > > Additionally, the SFs are multifunctional devices supporting RDMA,
> > > network devices, clocks, and more, similar to their peer PCI PFs and
> > > VFs. Therefore, it is desirable to have SFs' IRQ information available
> > > at the bus/device level.
> >
> > But it should be as part of the pci device, as that's where that
> > information lives and is "bound" to, not the aux device on its own.
>
>
> Auxiliary bus level SF device is using that IRQ too. So it is
> additionally shown at auxiliary device level too.
Your aux bus devices are using that, it's not generic to ALL aux bus
devices. Along those lines, why not just put this info in the aux bus
devices that your driver is bound to?
> > > To overcome the above limitations, this short series extends the
> > > auxiliary bus to display IRQ information in sysfs, similar to that of
> > > PFs and VFs.
> >
> > Again, examples of what it looks like today, and what it will look like
> > with this patch set is needed in order to justify why this really is
> > needed as it seems that the information should already be there for you.
>
>
> full example:
> in a machine with a single child SF of the PF 00:0b.0 we currently have
> the bellow:
> $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
>
> with this patch-set we will also have:
> $ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/
> 50 51 52 53 54 55 56 57 58
>
> which means that IRQs 50-58, which are shown on the PF "msi_irqs" are
> used by the child SF.
Please document this in the changelog if you wish to persist with it.
> > > It adds an 'irqs' directory under the auxiliary device and includes an
> > > <irq_num> sysfs file within it. Sometimes, the PCI SF auxiliary devices
> > > share the IRQ with other SFs, a detail that is also not available to the
> > > users. Consequently, this <irq_num> file indicates whether the IRQ is
> > > 'exclusive' or 'shared'. This 'irqs' directory extenstion is optional,
> > > i.e. only for PCI SFs the sysfs irq information is optionally exposed.
> >
> > Why does userspace care about "shared" or not? What can they do with
> > that, and why?
>
>
> If IRQ is shared, userspace needs to take it into account when looking
> at IRQ data and counters such as interrupts/sec.
> Also, If IRQ is shared, user better not mess with the irq affinity of
> such irq it as it can affect multiple SFs.
But the logic of "shared" is at the irq level, not at the aux bus level.
That's where that should be shown, not in a random bus subsystem,
otherwise you would have to look across ALL busses to properly determine
if an irq is shared or not, right?
thanks,
greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs
2024-05-30 14:50 ` Greg KH
@ 2024-06-03 7:10 ` Parav Pandit
2024-06-06 2:42 ` Parav Pandit
0 siblings, 1 reply; 18+ messages in thread
From: Parav Pandit @ 2024-06-03 7:10 UTC (permalink / raw)
To: Greg KH, Shay Drori
Cc: netdev@vger.kernel.org, pabeni@redhat.com, davem@davemloft.net,
kuba@kernel.org, edumazet@google.com, david.m.ertman@intel.com,
rafael@kernel.org, ira.weiny@intel.com,
linux-rdma@vger.kernel.org, leon@kernel.org, Tariq Toukan
Hi Greg,
An emergency has turned up for Shay. So, he is going to be out of office for few days.
I will attempt to answer below questions and spin v6 based on your guidance.
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Thursday, May 30, 2024 8:21 PM
>
> On Wed, May 29, 2024 at 09:19:13AM +0300, Shay Drori wrote:
> >
> >
> > On 28/05/2024 20:57, Greg KH wrote:
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote:
> > > > Today, PCI PFs and VFs, which are anchored on the PCI bus, display
> > > > their IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs
> > > > files. PCI subfunctions (SFs) are similar to PFs and VFs and these
> > > > SFs are anchored on the auxiliary bus. However, these PCI SFs lack
> > > > such IRQ information on the auxiliary bus, leaving users without
> > > > visibility into which IRQs are used by the SFs. This absence makes
> > > > it impossible to debug situations and to understand the source of
> > > > interrupts/SFs for performance tuning and debug.
> > >
> > > Wait, again, this feels wrong. You should be able to walk back up
> > > the tree see the irq for the device, and vf, right? Why would the
> > > value be different down in the aux device?
> >
> >
> > you are correct, the IRQs of the aux device are subset of the IRQs of
> > the parent device. more detailed answer bellow.
>
> But isn't that already shown in /proc/interrupts? Why show it here as well?
>
Because /proc/interrupts cannot distinguish which irqs are for which SFs.
More below.
> > > Does the msi irq somehow not actually show anywhere for the real pci
> device in sysfs at all today?
> > >
> > > What does sysfs look like today exactly for this information?
> >
> >
> > The IRQ information of all the children SFs of a PF is found in the PF
> > device as one single list.
> > There is no sane way to distinguish which IRQ is used by which SFs.
> > For example, in a machine with a single child SF of the PF 00:0b.0 we
> > currently have the bellow:
> > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
> >
> > the above are IRQs of both the PF and its child SF. from here we
> > cannot tell which IRQ is used by the child SF.
>
> Does that matter?
Yes it matters. A user needs to see if a given IRQ is for SF or a PF and if it is for SF, is it for which SF.
It needs to know this to set/modify the irq affinity via /proc/irq/<irq_num>/affinity_files.
> Seriously, what is userspace going to do with that that it
> can not already determine from /proc/interrupts/ or /proc/irq/?
>
User space cannot determine if a given IRQ is for SF or not from /proc/interrupts and /proc/irq either.
Let me explain with example.
It cannot figure out for two reasons.
1. For a PF or a VF, the interrupt name in /proc/interrupts looks like below.
mlx5 VF device writes irq name as: mlx5_comp62@pci:0000:06:00.0.
Format is: "purpose, event q number @ device_name".
Purpose = mlx5_compl (completions)
Event q number = 62
Device = pci:0000:06.00.0.
Can we do same for SFs?
For example, as,
mlx5_compl62@auxiliary:mlx5_core.sf.2
Will it work when SFs share the IRQ? No. it won't be accurate name when a vector is shared among many SFs.
Also updating IRQ names dynamically to arbitrary long name won't be any good either.
2. mlx5 do not use IRQF_SHARED.
This is the main reason due to which, /proc/irq/<irq_num>/devices_list is missing SFs auxiliary device entries.
We cannot share the NIC interrupt with any other arbitrary system device because nic SF needs to have reasonably predictable latency.
Shared interrupts also complicate the mlx5 driver flow even further as it can fire right away.
And we want to avoid the already complex path further, with device recovery flow in this area.
Secondly, a while back also when team played with IRQF_SHARED there was 8% or so perf drop too.
So mlx5 uses IRQ sharing among its own interrupt generators.
This is uniformly across PF, VF, SFs.
It also has certain good thresholds where it avoids sharing the IRQ and allocates new dynamic one.
Thanks to Thomas's work.
It wouldn’t be wise to use IRQF_SHARED for mlx5 NICs, unless IRQ subsystem can offer APIs to restrict IRQs sharing to certain devices.
Sounds too complicated for the sake of showing SFs affinity.
> I know it's /proc and not /sys but duplicating information might not be
> needed as I think you already have this information.
Like I explained above, we don’t have that info. I think IRQF_SHARED probably was missing piece to you previously.
PCI duplicates this info and widely used by irqbalance [1] which reads from /sys/bus/pci/devices/<dev_name>/msi_irqs/<all_irqs file>.
The useful part of irq files under sys is, a use space can directly see the irq files under the desired device directly.
It would be extended for auxiliary bus SFs.
[1] https://github.com/Irqbalance/irqbalance
> If not, what is missing in /proc/irq today for this?
>
If IRQF_SHARED is used, nothing is missing. But we don’t find a good way to use it.
> > > And what about /proc/irq/ and /proc/interrupts/ doesn't that show
> > > you the needed information? Why are aux devices somehow "special"
> here?
> >
> >
> > /proc/interrupts interrupt name is some random driver choice.
>
> That's the same name you use here.
>
Yes, like I explained above. We can come up with some name for /proc/interrupts.
But it won't tell which all SFs are sharing it, for example 64 SFs sharing a vector won't be visible in some giant irq name in /proc/interrupts.
So, to summarize, there are three places.
1. /proc/interrupts.
Limitations:
a. One line per interrupt does not fit the case.
b. interrupt name also not enough to show sharing SFs.
2. /proc/irq/<irq_number>/{device_list}
Limited only to IRQF_SHARED users.
3. /sys/bus/auxiliary/devices/mlx5_core.sf.2/irqs<irq_file_name>
Pros:
a. Matches with pci PF and VFs
b. will be usable by irqbalance in similar way as PCI bus
c. works regardless of IRQF_SHARED used or not used.
d. Does not require complicated IRQ subsystem changes to limit IRQF_SHARING to only a specific device type.
> > There is
> > no sane naming convention and some small few bytes of upper limit of
> > what the IRQ name.
>
> That's up to the driver to name, so wouldn't it be the same here?
>
> > > > Additionally, the SFs are multifunctional devices supporting RDMA,
> > > > network devices, clocks, and more, similar to their peer PCI PFs
> > > > and VFs. Therefore, it is desirable to have SFs' IRQ information
> > > > available at the bus/device level.
> > >
> > > But it should be as part of the pci device, as that's where that
> > > information lives and is "bound" to, not the aux device on its own.
> >
> >
> > Auxiliary bus level SF device is using that IRQ too. So it is
> > additionally shown at auxiliary device level too.
>
> Your aux bus devices are using that, it's not generic to ALL aux bus devices.
> Along those lines, why not just put this info in the aux bus devices that your
> driver is bound to?
>
mlx5 driver uses the generic aux bus for SFs.
mlx5 driver can create some sysfs files directly into this SF device, however we strongly want to avoid doing
such vendor specific files based on our experience in netdev and rdma subsystems.
There are two main reasons to not put in mlx5 driver code.
1. In past several times it happened that a drivers tend to abuse this sysfs interface to put any
sort of debug information or other things and it becomes hard to get rid of.
An example of this is fw_pages in /sys/class/infiniband/mlx5_0/device/fw_pages file. ☹
I cleaned a lot in rdma now to avoid such abuse.
Want to avoid the same for the auxiliary bus too.
2. Intel sf driver is just getting started up. If mlx5 driver shows "irqs", and intel driver will show this as
"foo_irqs", the user will not have same experience across two drivers.
For example, irqbalance [1] will not be able to work for these two vendors.
So, it would be wiser to use aux bus extension for SF to show the directory, like how pci bus does.
Vendor driver is just feeding as what irq is used by semi-virtual bus.
> > > > To overcome the above limitations, this short series extends the
> > > > auxiliary bus to display IRQ information in sysfs, similar to that
> > > > of PFs and VFs.
> > >
> > > Again, examples of what it looks like today, and what it will look
> > > like with this patch set is needed in order to justify why this
> > > really is needed as it seems that the information should already be there
> for you.
> >
> >
> > full example:
> > in a machine with a single child SF of the PF 00:0b.0 we currently
> > have the bellow:
> > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
> >
> > with this patch-set we will also have:
> > $ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/
> > 50 51 52 53 54 55 56 57 58
> >
> > which means that IRQs 50-58, which are shown on the PF "msi_irqs" are
> > used by the child SF.
>
> Please document this in the changelog if you wish to persist with it.
>
Sure. Its already part of the commit log of patch 2.
Will add the same example in cover letter too.
> > > > It adds an 'irqs' directory under the auxiliary device and
> > > > includes an <irq_num> sysfs file within it. Sometimes, the PCI SF
> > > > auxiliary devices share the IRQ with other SFs, a detail that is
> > > > also not available to the users. Consequently, this <irq_num> file
> > > > indicates whether the IRQ is 'exclusive' or 'shared'. This 'irqs'
> > > > directory extenstion is optional, i.e. only for PCI SFs the sysfs irq
> information is optionally exposed.
> > >
> > > Why does userspace care about "shared" or not? What can they do
> > > with that, and why?
> >
> >
> > If IRQ is shared, userspace needs to take it into account when looking
> > at IRQ data and counters such as interrupts/sec.
> > Also, If IRQ is shared, user better not mess with the irq affinity of
> > such irq it as it can affect multiple SFs.
>
> But the logic of "shared" is at the irq level, not at the aux bus level.
> That's where that should be shown, not in a random bus subsystem,
> otherwise you would have to look across ALL busses to properly determine if
> an irq is shared or not, right?
>
Yes, that is a good point. Showing shared or exclusive irq in
/sys/irq/<irq_num>/device_name file is right think to do.
I agree with you.
With that, are you ok, if we just add the /sys/bus/auxiliary/devices/mlx5_core.sf.2/<irq_num> files to list IRQs used by this SF?
And let user space deal to figure out sharing or exclusive.
This will eliminate the global xarray in the patch-1 and "shared", "exclusive" strings part of all the refcount code.
> thanks,
>
> greg k-h
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs
2024-06-03 7:10 ` Parav Pandit
@ 2024-06-06 2:42 ` Parav Pandit
2024-06-10 13:01 ` Parav Pandit
0 siblings, 1 reply; 18+ messages in thread
From: Parav Pandit @ 2024-06-06 2:42 UTC (permalink / raw)
To: Greg KH, Shay Drori
Cc: netdev@vger.kernel.org, pabeni@redhat.com, davem@davemloft.net,
kuba@kernel.org, edumazet@google.com, david.m.ertman@intel.com,
rafael@kernel.org, ira.weiny@intel.com,
linux-rdma@vger.kernel.org, leon@kernel.org, Tariq Toukan
Hi Greg,
> From: Parav Pandit
> Sent: Monday, June 3, 2024 12:41 PM
>
> Hi Greg,
>
> An emergency has turned up for Shay. So, he is going to be out of office for
> few days.
> I will attempt to answer below questions and spin v6 based on your
> guidance.
>
> > From: Greg KH <gregkh@linuxfoundation.org>
> > Sent: Thursday, May 30, 2024 8:21 PM
> >
> > On Wed, May 29, 2024 at 09:19:13AM +0300, Shay Drori wrote:
> > >
> > >
> > > On 28/05/2024 20:57, Greg KH wrote:
> > > > External email: Use caution opening links or attachments
> > > >
> > > >
> > > > On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote:
> > > > > Today, PCI PFs and VFs, which are anchored on the PCI bus,
> > > > > display their IRQ information in the
> > > > > <pci_device>/msi_irqs/<irq_num> sysfs files. PCI subfunctions
> > > > > (SFs) are similar to PFs and VFs and these SFs are anchored on
> > > > > the auxiliary bus. However, these PCI SFs lack such IRQ
> > > > > information on the auxiliary bus, leaving users without
> > > > > visibility into which IRQs are used by the SFs. This absence
> > > > > makes it impossible to debug situations and to understand the source
> of interrupts/SFs for performance tuning and debug.
> > > >
> > > > Wait, again, this feels wrong. You should be able to walk back up
> > > > the tree see the irq for the device, and vf, right? Why would the
> > > > value be different down in the aux device?
> > >
> > >
> > > you are correct, the IRQs of the aux device are subset of the IRQs
> > > of the parent device. more detailed answer bellow.
> >
> > But isn't that already shown in /proc/interrupts? Why show it here as well?
> >
> Because /proc/interrupts cannot distinguish which irqs are for which SFs.
> More below.
>
> > > > Does the msi irq somehow not actually show anywhere for the real
> > > > pci
> > device in sysfs at all today?
> > > >
> > > > What does sysfs look like today exactly for this information?
> > >
> > >
> > > The IRQ information of all the children SFs of a PF is found in the
> > > PF device as one single list.
> > > There is no sane way to distinguish which IRQ is used by which SFs.
> > > For example, in a machine with a single child SF of the PF 00:0b.0
> > > we currently have the bellow:
> > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
> > > 58
> > >
> > > the above are IRQs of both the PF and its child SF. from here we
> > > cannot tell which IRQ is used by the child SF.
> >
> > Does that matter?
> Yes it matters. A user needs to see if a given IRQ is for SF or a PF and if it is for
> SF, is it for which SF.
> It needs to know this to set/modify the irq affinity via
> /proc/irq/<irq_num>/affinity_files.
>
> > Seriously, what is userspace going to do with that that it can not
> > already determine from /proc/interrupts/ or /proc/irq/?
> >
> User space cannot determine if a given IRQ is for SF or not from
> /proc/interrupts and /proc/irq either.
> Let me explain with example.
> It cannot figure out for two reasons.
>
> 1. For a PF or a VF, the interrupt name in /proc/interrupts looks like below.
> mlx5 VF device writes irq name as: mlx5_comp62@pci:0000:06:00.0.
> Format is: "purpose, event q number @ device_name".
> Purpose = mlx5_compl (completions)
> Event q number = 62
> Device = pci:0000:06.00.0.
>
> Can we do same for SFs?
> For example, as,
> mlx5_compl62@auxiliary:mlx5_core.sf.2
>
> Will it work when SFs share the IRQ? No. it won't be accurate name when a
> vector is shared among many SFs.
> Also updating IRQ names dynamically to arbitrary long name won't be any
> good either.
>
> 2. mlx5 do not use IRQF_SHARED.
> This is the main reason due to which, /proc/irq/<irq_num>/devices_list is
> missing SFs auxiliary device entries.
>
> We cannot share the NIC interrupt with any other arbitrary system device
> because nic SF needs to have reasonably predictable latency.
>
> Shared interrupts also complicate the mlx5 driver flow even further as it can
> fire right away.
> And we want to avoid the already complex path further, with device recovery
> flow in this area.
>
> Secondly, a while back also when team played with IRQF_SHARED there was
> 8% or so perf drop too.
> So mlx5 uses IRQ sharing among its own interrupt generators.
> This is uniformly across PF, VF, SFs.
> It also has certain good thresholds where it avoids sharing the IRQ and
> allocates new dynamic one.
> Thanks to Thomas's work.
>
> It wouldn’t be wise to use IRQF_SHARED for mlx5 NICs, unless IRQ subsystem
> can offer APIs to restrict IRQs sharing to certain devices.
> Sounds too complicated for the sake of showing SFs affinity.
>
> > I know it's /proc and not /sys but duplicating information might not
> > be needed as I think you already have this information.
>
> Like I explained above, we don’t have that info. I think IRQF_SHARED
> probably was missing piece to you previously.
> PCI duplicates this info and widely used by irqbalance [1] which reads from
> /sys/bus/pci/devices/<dev_name>/msi_irqs/<all_irqs file>.
>
> The useful part of irq files under sys is, a use space can directly see the irq
> files under the desired device directly.
> It would be extended for auxiliary bus SFs.
>
> [1] https://github.com/Irqbalance/irqbalance
>
> > If not, what is missing in /proc/irq today for this?
> >
> If IRQF_SHARED is used, nothing is missing. But we don’t find a good way to
> use it.
>
> > > > And what about /proc/irq/ and /proc/interrupts/ doesn't that show
> > > > you the needed information? Why are aux devices somehow "special"
> > here?
> > >
> > >
> > > /proc/interrupts interrupt name is some random driver choice.
> >
> > That's the same name you use here.
> >
> Yes, like I explained above. We can come up with some name for
> /proc/interrupts.
> But it won't tell which all SFs are sharing it, for example 64 SFs sharing a
> vector won't be visible in some giant irq name in /proc/interrupts.
>
> So, to summarize, there are three places.
>
> 1. /proc/interrupts.
> Limitations:
> a. One line per interrupt does not fit the case.
> b. interrupt name also not enough to show sharing SFs.
>
> 2. /proc/irq/<irq_number>/{device_list}
> Limited only to IRQF_SHARED users.
>
> 3. /sys/bus/auxiliary/devices/mlx5_core.sf.2/irqs<irq_file_name>
> Pros:
> a. Matches with pci PF and VFs
> b. will be usable by irqbalance in similar way as PCI bus c. works regardless of
> IRQF_SHARED used or not used.
> d. Does not require complicated IRQ subsystem changes to limit
> IRQF_SHARING to only a specific device type.
>
> > > There is
> > > no sane naming convention and some small few bytes of upper limit of
> > > what the IRQ name.
> >
> > That's up to the driver to name, so wouldn't it be the same here?
> >
> > > > > Additionally, the SFs are multifunctional devices supporting
> > > > > RDMA, network devices, clocks, and more, similar to their peer
> > > > > PCI PFs and VFs. Therefore, it is desirable to have SFs' IRQ
> > > > > information available at the bus/device level.
> > > >
> > > > But it should be as part of the pci device, as that's where that
> > > > information lives and is "bound" to, not the aux device on its own.
> > >
> > >
> > > Auxiliary bus level SF device is using that IRQ too. So it is
> > > additionally shown at auxiliary device level too.
> >
> > Your aux bus devices are using that, it's not generic to ALL aux bus devices.
> > Along those lines, why not just put this info in the aux bus devices
> > that your driver is bound to?
> >
> mlx5 driver uses the generic aux bus for SFs.
> mlx5 driver can create some sysfs files directly into this SF device, however
> we strongly want to avoid doing such vendor specific files based on our
> experience in netdev and rdma subsystems.
>
> There are two main reasons to not put in mlx5 driver code.
>
> 1. In past several times it happened that a drivers tend to abuse this sysfs
> interface to put any sort of debug information or other things and it becomes
> hard to get rid of.
> An example of this is fw_pages in
> /sys/class/infiniband/mlx5_0/device/fw_pages file. ☹ I cleaned a lot in rdma
> now to avoid such abuse.
> Want to avoid the same for the auxiliary bus too.
>
> 2. Intel sf driver is just getting started up. If mlx5 driver shows "irqs", and intel
> driver will show this as "foo_irqs", the user will not have same experience
> across two drivers.
> For example, irqbalance [1] will not be able to work for these two vendors.
>
> So, it would be wiser to use aux bus extension for SF to show the directory,
> like how pci bus does.
> Vendor driver is just feeding as what irq is used by semi-virtual bus.
>
> > > > > To overcome the above limitations, this short series extends the
> > > > > auxiliary bus to display IRQ information in sysfs, similar to
> > > > > that of PFs and VFs.
> > > >
> > > > Again, examples of what it looks like today, and what it will look
> > > > like with this patch set is needed in order to justify why this
> > > > really is needed as it seems that the information should already
> > > > be there
> > for you.
> > >
> > >
> > > full example:
> > > in a machine with a single child SF of the PF 00:0b.0 we currently
> > > have the bellow:
> > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
> > > 58
> > >
> > > with this patch-set we will also have:
> > > $ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/
> > > 50 51 52 53 54 55 56 57 58
> > >
> > > which means that IRQs 50-58, which are shown on the PF "msi_irqs"
> > > are used by the child SF.
> >
> > Please document this in the changelog if you wish to persist with it.
> >
> Sure. Its already part of the commit log of patch 2.
> Will add the same example in cover letter too.
>
> > > > > It adds an 'irqs' directory under the auxiliary device and
> > > > > includes an <irq_num> sysfs file within it. Sometimes, the PCI
> > > > > SF auxiliary devices share the IRQ with other SFs, a detail that
> > > > > is also not available to the users. Consequently, this <irq_num>
> > > > > file indicates whether the IRQ is 'exclusive' or 'shared'. This 'irqs'
> > > > > directory extenstion is optional, i.e. only for PCI SFs the
> > > > > sysfs irq
> > information is optionally exposed.
> > > >
> > > > Why does userspace care about "shared" or not? What can they do
> > > > with that, and why?
> > >
> > >
> > > If IRQ is shared, userspace needs to take it into account when
> > > looking at IRQ data and counters such as interrupts/sec.
> > > Also, If IRQ is shared, user better not mess with the irq affinity
> > > of such irq it as it can affect multiple SFs.
> >
> > But the logic of "shared" is at the irq level, not at the aux bus level.
> > That's where that should be shown, not in a random bus subsystem,
> > otherwise you would have to look across ALL busses to properly
> > determine if an irq is shared or not, right?
> >
> Yes, that is a good point. Showing shared or exclusive irq in
> /sys/irq/<irq_num>/device_name file is right think to do.
> I agree with you.
>
> With that, are you ok, if we just add the
> /sys/bus/auxiliary/devices/mlx5_core.sf.2/<irq_num> files to list IRQs used
> by this SF?
> And let user space deal to figure out sharing or exclusive.
>
> This will eliminate the global xarray in the patch-1 and "shared", "exclusive"
> strings part of all the refcount code.
>
Did you get a chance to review above?
Are you ok to add the sysfs files by the SF auxiliary bus driver without the "shared"/"exclusive" complexity?
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs
2024-06-06 2:42 ` Parav Pandit
@ 2024-06-10 13:01 ` Parav Pandit
0 siblings, 0 replies; 18+ messages in thread
From: Parav Pandit @ 2024-06-10 13:01 UTC (permalink / raw)
To: Greg KH, Shay Drori
Cc: netdev@vger.kernel.org, pabeni@redhat.com, davem@davemloft.net,
kuba@kernel.org, edumazet@google.com, david.m.ertman@intel.com,
rafael@kernel.org, ira.weiny@intel.com,
linux-rdma@vger.kernel.org, leon@kernel.org, Tariq Toukan
Hi Greg,
> From: Parav Pandit
> Sent: Thursday, June 6, 2024 8:13 AM
>
> Hi Greg,
>
>
> > From: Parav Pandit
> > Sent: Monday, June 3, 2024 12:41 PM
> >
> > Hi Greg,
> >
> > An emergency has turned up for Shay. So, he is going to be out of
> > office for few days.
> > I will attempt to answer below questions and spin v6 based on your
> > guidance.
> >
> > > From: Greg KH <gregkh@linuxfoundation.org>
> > > Sent: Thursday, May 30, 2024 8:21 PM
> > >
> > > On Wed, May 29, 2024 at 09:19:13AM +0300, Shay Drori wrote:
> > > >
> > > >
> > > > On 28/05/2024 20:57, Greg KH wrote:
> > > > > External email: Use caution opening links or attachments
> > > > >
> > > > >
> > > > > On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote:
> > > > > > Today, PCI PFs and VFs, which are anchored on the PCI bus,
> > > > > > display their IRQ information in the
> > > > > > <pci_device>/msi_irqs/<irq_num> sysfs files. PCI subfunctions
> > > > > > (SFs) are similar to PFs and VFs and these SFs are anchored on
> > > > > > the auxiliary bus. However, these PCI SFs lack such IRQ
> > > > > > information on the auxiliary bus, leaving users without
> > > > > > visibility into which IRQs are used by the SFs. This absence
> > > > > > makes it impossible to debug situations and to understand the
> > > > > > source
> > of interrupts/SFs for performance tuning and debug.
> > > > >
> > > > > Wait, again, this feels wrong. You should be able to walk back
> > > > > up the tree see the irq for the device, and vf, right? Why
> > > > > would the value be different down in the aux device?
> > > >
> > > >
> > > > you are correct, the IRQs of the aux device are subset of the IRQs
> > > > of the parent device. more detailed answer bellow.
> > >
> > > But isn't that already shown in /proc/interrupts? Why show it here as
> well?
> > >
> > Because /proc/interrupts cannot distinguish which irqs are for which SFs.
> > More below.
> >
> > > > > Does the msi irq somehow not actually show anywhere for the real
> > > > > pci
> > > device in sysfs at all today?
> > > > >
> > > > > What does sysfs look like today exactly for this information?
> > > >
> > > >
> > > > The IRQ information of all the children SFs of a PF is found in
> > > > the PF device as one single list.
> > > > There is no sane way to distinguish which IRQ is used by which SFs.
> > > > For example, in a machine with a single child SF of the PF 00:0b.0
> > > > we currently have the bellow:
> > > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > > > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
> > > > 58
> > > >
> > > > the above are IRQs of both the PF and its child SF. from here we
> > > > cannot tell which IRQ is used by the child SF.
> > >
> > > Does that matter?
> > Yes it matters. A user needs to see if a given IRQ is for SF or a PF
> > and if it is for SF, is it for which SF.
> > It needs to know this to set/modify the irq affinity via
> > /proc/irq/<irq_num>/affinity_files.
> >
> > > Seriously, what is userspace going to do with that that it can not
> > > already determine from /proc/interrupts/ or /proc/irq/?
> > >
> > User space cannot determine if a given IRQ is for SF or not from
> > /proc/interrupts and /proc/irq either.
> > Let me explain with example.
> > It cannot figure out for two reasons.
> >
> > 1. For a PF or a VF, the interrupt name in /proc/interrupts looks like below.
> > mlx5 VF device writes irq name as: mlx5_comp62@pci:0000:06:00.0.
> > Format is: "purpose, event q number @ device_name".
> > Purpose = mlx5_compl (completions)
> > Event q number = 62
> > Device = pci:0000:06.00.0.
> >
> > Can we do same for SFs?
> > For example, as,
> > mlx5_compl62@auxiliary:mlx5_core.sf.2
> >
> > Will it work when SFs share the IRQ? No. it won't be accurate name
> > when a vector is shared among many SFs.
> > Also updating IRQ names dynamically to arbitrary long name won't be
> > any good either.
> >
> > 2. mlx5 do not use IRQF_SHARED.
> > This is the main reason due to which, /proc/irq/<irq_num>/devices_list
> > is missing SFs auxiliary device entries.
> >
> > We cannot share the NIC interrupt with any other arbitrary system
> > device because nic SF needs to have reasonably predictable latency.
> >
> > Shared interrupts also complicate the mlx5 driver flow even further as
> > it can fire right away.
> > And we want to avoid the already complex path further, with device
> > recovery flow in this area.
> >
> > Secondly, a while back also when team played with IRQF_SHARED there
> > was 8% or so perf drop too.
> > So mlx5 uses IRQ sharing among its own interrupt generators.
> > This is uniformly across PF, VF, SFs.
> > It also has certain good thresholds where it avoids sharing the IRQ
> > and allocates new dynamic one.
> > Thanks to Thomas's work.
> >
> > It wouldn’t be wise to use IRQF_SHARED for mlx5 NICs, unless IRQ
> > subsystem can offer APIs to restrict IRQs sharing to certain devices.
> > Sounds too complicated for the sake of showing SFs affinity.
> >
> > > I know it's /proc and not /sys but duplicating information might not
> > > be needed as I think you already have this information.
> >
> > Like I explained above, we don’t have that info. I think IRQF_SHARED
> > probably was missing piece to you previously.
> > PCI duplicates this info and widely used by irqbalance [1] which reads
> > from /sys/bus/pci/devices/<dev_name>/msi_irqs/<all_irqs file>.
> >
> > The useful part of irq files under sys is, a use space can directly
> > see the irq files under the desired device directly.
> > It would be extended for auxiliary bus SFs.
> >
> > [1] https://github.com/Irqbalance/irqbalance
> >
> > > If not, what is missing in /proc/irq today for this?
> > >
> > If IRQF_SHARED is used, nothing is missing. But we don’t find a good
> > way to use it.
> >
> > > > > And what about /proc/irq/ and /proc/interrupts/ doesn't that
> > > > > show you the needed information? Why are aux devices somehow
> "special"
> > > here?
> > > >
> > > >
> > > > /proc/interrupts interrupt name is some random driver choice.
> > >
> > > That's the same name you use here.
> > >
> > Yes, like I explained above. We can come up with some name for
> > /proc/interrupts.
> > But it won't tell which all SFs are sharing it, for example 64 SFs
> > sharing a vector won't be visible in some giant irq name in /proc/interrupts.
> >
> > So, to summarize, there are three places.
> >
> > 1. /proc/interrupts.
> > Limitations:
> > a. One line per interrupt does not fit the case.
> > b. interrupt name also not enough to show sharing SFs.
> >
> > 2. /proc/irq/<irq_number>/{device_list}
> > Limited only to IRQF_SHARED users.
> >
> > 3. /sys/bus/auxiliary/devices/mlx5_core.sf.2/irqs<irq_file_name>
> > Pros:
> > a. Matches with pci PF and VFs
> > b. will be usable by irqbalance in similar way as PCI bus c. works
> > regardless of IRQF_SHARED used or not used.
> > d. Does not require complicated IRQ subsystem changes to limit
> > IRQF_SHARING to only a specific device type.
> >
> > > > There is
> > > > no sane naming convention and some small few bytes of upper limit
> > > > of what the IRQ name.
> > >
> > > That's up to the driver to name, so wouldn't it be the same here?
> > >
> > > > > > Additionally, the SFs are multifunctional devices supporting
> > > > > > RDMA, network devices, clocks, and more, similar to their peer
> > > > > > PCI PFs and VFs. Therefore, it is desirable to have SFs' IRQ
> > > > > > information available at the bus/device level.
> > > > >
> > > > > But it should be as part of the pci device, as that's where that
> > > > > information lives and is "bound" to, not the aux device on its own.
> > > >
> > > >
> > > > Auxiliary bus level SF device is using that IRQ too. So it is
> > > > additionally shown at auxiliary device level too.
> > >
> > > Your aux bus devices are using that, it's not generic to ALL aux bus devices.
> > > Along those lines, why not just put this info in the aux bus devices
> > > that your driver is bound to?
> > >
> > mlx5 driver uses the generic aux bus for SFs.
> > mlx5 driver can create some sysfs files directly into this SF device,
> > however we strongly want to avoid doing such vendor specific files
> > based on our experience in netdev and rdma subsystems.
> >
> > There are two main reasons to not put in mlx5 driver code.
> >
> > 1. In past several times it happened that a drivers tend to abuse this
> > sysfs interface to put any sort of debug information or other things
> > and it becomes hard to get rid of.
> > An example of this is fw_pages in
> > /sys/class/infiniband/mlx5_0/device/fw_pages file. ☹ I cleaned a lot
> > in rdma now to avoid such abuse.
> > Want to avoid the same for the auxiliary bus too.
> >
> > 2. Intel sf driver is just getting started up. If mlx5 driver shows
> > "irqs", and intel driver will show this as "foo_irqs", the user will
> > not have same experience across two drivers.
> > For example, irqbalance [1] will not be able to work for these two vendors.
> >
> > So, it would be wiser to use aux bus extension for SF to show the
> > directory, like how pci bus does.
> > Vendor driver is just feeding as what irq is used by semi-virtual bus.
> >
> > > > > > To overcome the above limitations, this short series extends
> > > > > > the auxiliary bus to display IRQ information in sysfs, similar
> > > > > > to that of PFs and VFs.
> > > > >
> > > > > Again, examples of what it looks like today, and what it will
> > > > > look like with this patch set is needed in order to justify why
> > > > > this really is needed as it seems that the information should
> > > > > already be there
> > > for you.
> > > >
> > > >
> > > > full example:
> > > > in a machine with a single child SF of the PF 00:0b.0 we currently
> > > > have the bellow:
> > > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > > > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
> > > > 58
> > > >
> > > > with this patch-set we will also have:
> > > > $ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/
> > > > 50 51 52 53 54 55 56 57 58
> > > >
> > > > which means that IRQs 50-58, which are shown on the PF "msi_irqs"
> > > > are used by the child SF.
> > >
> > > Please document this in the changelog if you wish to persist with it.
> > >
> > Sure. Its already part of the commit log of patch 2.
> > Will add the same example in cover letter too.
> >
> > > > > > It adds an 'irqs' directory under the auxiliary device and
> > > > > > includes an <irq_num> sysfs file within it. Sometimes, the PCI
> > > > > > SF auxiliary devices share the IRQ with other SFs, a detail
> > > > > > that is also not available to the users. Consequently, this
> > > > > > <irq_num> file indicates whether the IRQ is 'exclusive' or 'shared'. This
> 'irqs'
> > > > > > directory extenstion is optional, i.e. only for PCI SFs the
> > > > > > sysfs irq
> > > information is optionally exposed.
> > > > >
> > > > > Why does userspace care about "shared" or not? What can they do
> > > > > with that, and why?
> > > >
> > > >
> > > > If IRQ is shared, userspace needs to take it into account when
> > > > looking at IRQ data and counters such as interrupts/sec.
> > > > Also, If IRQ is shared, user better not mess with the irq affinity
> > > > of such irq it as it can affect multiple SFs.
> > >
> > > But the logic of "shared" is at the irq level, not at the aux bus level.
> > > That's where that should be shown, not in a random bus subsystem,
> > > otherwise you would have to look across ALL busses to properly
> > > determine if an irq is shared or not, right?
> > >
> > Yes, that is a good point. Showing shared or exclusive irq in
> > /sys/irq/<irq_num>/device_name file is right think to do.
> > I agree with you.
> >
>
> > With that, are you ok, if we just add the
> > /sys/bus/auxiliary/devices/mlx5_core.sf.2/<irq_num> files to list IRQs
> > used by this SF?
> > And let user space deal to figure out sharing or exclusive.
> >
> > This will eliminate the global xarray in the patch-1 and "shared", "exclusive"
> > strings part of all the refcount code.
> >
>
> Did you get a chance to review above?
> Are you ok to add the sysfs files by the SF auxiliary bus driver without the
> "shared"/"exclusive" complexity?
Can you please suggest? Are you ok with this approach?
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-06-10 13:01 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-28 9:11 [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs Shay Drory
2024-05-28 9:11 ` [PATCH net-next v5 1/2] driver core: auxiliary bus: show auxiliary device IRQs Shay Drory
2024-05-28 14:43 ` Przemek Kitszel
2024-05-29 6:58 ` Shay Drori
2024-05-28 18:00 ` Greg KH
2024-05-29 6:29 ` Shay Drori
2024-05-30 14:46 ` Greg KH
2024-05-28 9:11 ` [PATCH net-next v5 2/2] net/mlx5: Expose SFs IRQs Shay Drory
2024-05-28 14:48 ` Przemek Kitszel
2024-05-28 14:51 ` Parav Pandit
2024-05-29 11:13 ` Shay Drori
2024-05-28 17:57 ` [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs Greg KH
2024-05-29 6:19 ` Shay Drori
2024-05-30 14:13 ` Shay Drori
2024-05-30 14:50 ` Greg KH
2024-06-03 7:10 ` Parav Pandit
2024-06-06 2:42 ` Parav Pandit
2024-06-10 13:01 ` Parav Pandit
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).