* [PATCH v2] net/mlx5: don't printk garbage when transceiver overheats
@ 2026-05-12 7:32 Will Mortensen
2026-05-14 11:32 ` Leon Romanovsky
0 siblings, 1 reply; 2+ messages in thread
From: Will Mortensen @ 2026-05-12 7:32 UTC (permalink / raw)
To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, netdev
Cc: Shahar Shitrit, Carolina Jubran, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Will Mortensen,
Jeremy Royal
When the mlx5 driver processes a temperature warning event, in events.c
and hwmon.c, temp_warn() calls print_sensor_names_in_bit_set(), which
calls hwmon_get_sensor_name() to get the NUL-terminated name of the
relevant sensor, and then prints it to dmesg. In particular,
print_sensor_names_in_bit_set() passes the bit index ("sensor index")
within the 128-bit vector in the warning event to
hwmon_get_sensor_name(). But hwmon_get_sensor_name() was expecting the
index of the hwmon channel, and the driver registers hwmon channels for
at most only two sensors: the ASIC sensor (sensor index 0) and the
module sensor (sensor index 64 or 65 if we're on a 2-port NIC). So when
the warning event concerned a module, hwmon_get_sensor_name() took the
64th or 65th element of the likely 2-element temp_channel_desc array and
thus returned a pointer to some other kernel memory past the end of it,
which was printed to dmesg up to the first NUL byte.
A further difficulty is that, at least in testing on our CX-8 C8240 with
firmware 40.47.1088, the warning event can have bits set for other
modules, e.g. if this PCI physical function is associated with
port/module 0, we might expect bit 64 to be set, but bit 65 (for port/
module 1) can also be set.
Fix this by clarifying that the argument to hwmon_get_sensor_name() is
the raw sensor index, and correctly converting it to the hwmon channel
index. Return NULL if the sensor index doesn't correspond to a hwmon
channel (e.g. because it's for the other port's module).
Fixes: 46fd50cfcc12 ("net/mlx5: Add sensor name to temperature event message")
Signed-off-by: Will Mortensen <will@extrahop.com>
Reviewed-by: Jeremy Royal <jeremyr@extrahop.com>
---
| 2 ++
| 15 ++++++++++++++-
| 2 +-
3 files changed, 17 insertions(+), 2 deletions(-)
--git a/drivers/net/ethernet/mellanox/mlx5/core/events.c b/drivers/net/ethernet/mellanox/mlx5/core/events.c
index 4d7f35b96876..9372551c7f90 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/events.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/events.c
@@ -165,6 +165,8 @@ static void print_sensor_names_in_bit_set(struct mlx5_core_dev *dev, struct mlx5
for_each_set_bit(i, bit_set_ptr, num_bits) {
const char *sensor_name = hwmon_get_sensor_name(hwmon, i + bit_set_offset);
+ if (!sensor_name)
+ continue;
mlx5_core_warn(dev, "Sensor name[%d]: %s\n", i + bit_set_offset, sensor_name);
}
}
--git a/drivers/net/ethernet/mellanox/mlx5/core/hwmon.c b/drivers/net/ethernet/mellanox/mlx5/core/hwmon.c
index afcdebac9c4f..747ff30362f1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/hwmon.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/hwmon.c
@@ -417,7 +417,20 @@ void mlx5_hwmon_dev_unregister(struct mlx5_core_dev *mdev)
mdev->hwmon = NULL;
}
-const char *hwmon_get_sensor_name(struct mlx5_hwmon *hwmon, int channel)
+const char *hwmon_get_sensor_name(struct mlx5_hwmon *hwmon, int sensor_idx)
{
+ int channel;
+
+ if (sensor_idx >= 64) {
+ if (hwmon->module_scount == 0)
+ return NULL;
+ channel = hwmon->asic_platform_scount;
+ if (sensor_idx != hwmon->temp_channel_desc[channel].sensor_index)
+ return NULL;
+ } else {
+ if (sensor_idx >= hwmon->asic_platform_scount)
+ return NULL;
+ channel = sensor_idx;
+ }
return hwmon->temp_channel_desc[channel].sensor_name;
}
--git a/drivers/net/ethernet/mellanox/mlx5/core/hwmon.h b/drivers/net/ethernet/mellanox/mlx5/core/hwmon.h
index f38271c22c10..b2476bf54ce5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/hwmon.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/hwmon.h
@@ -10,7 +10,7 @@
int mlx5_hwmon_dev_register(struct mlx5_core_dev *mdev);
void mlx5_hwmon_dev_unregister(struct mlx5_core_dev *mdev);
-const char *hwmon_get_sensor_name(struct mlx5_hwmon *hwmon, int channel);
+const char *hwmon_get_sensor_name(struct mlx5_hwmon *hwmon, int sensor_idx);
#else
static inline int mlx5_hwmon_dev_register(struct mlx5_core_dev *mdev)
---
base-commit: 70390501d1944d4e5b8f7352be180fceb3a44132
change-id: 20260508-b4-mlx5-sensor-fix-043d4a22f641
Best regards,
--
Will Mortensen <will@extrahop.com>
^ permalink raw reply related [flat|nested] 2+ messages in thread* Re: [PATCH v2] net/mlx5: don't printk garbage when transceiver overheats
2026-05-12 7:32 [PATCH v2] net/mlx5: don't printk garbage when transceiver overheats Will Mortensen
@ 2026-05-14 11:32 ` Leon Romanovsky
0 siblings, 0 replies; 2+ messages in thread
From: Leon Romanovsky @ 2026-05-14 11:32 UTC (permalink / raw)
To: Will Mortensen
Cc: Saeed Mahameed, Tariq Toukan, Mark Bloch, netdev, Shahar Shitrit,
Carolina Jubran, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jeremy Royal
On Tue, May 12, 2026 at 12:32:38AM -0700, Will Mortensen wrote:
> When the mlx5 driver processes a temperature warning event, in events.c
> and hwmon.c, temp_warn() calls print_sensor_names_in_bit_set(), which
> calls hwmon_get_sensor_name() to get the NUL-terminated name of the
> relevant sensor, and then prints it to dmesg. In particular,
> print_sensor_names_in_bit_set() passes the bit index ("sensor index")
> within the 128-bit vector in the warning event to
> hwmon_get_sensor_name(). But hwmon_get_sensor_name() was expecting the
> index of the hwmon channel, and the driver registers hwmon channels for
> at most only two sensors: the ASIC sensor (sensor index 0) and the
> module sensor (sensor index 64 or 65 if we're on a 2-port NIC). So when
> the warning event concerned a module, hwmon_get_sensor_name() took the
> 64th or 65th element of the likely 2-element temp_channel_desc array and
> thus returned a pointer to some other kernel memory past the end of it,
> which was printed to dmesg up to the first NUL byte.
>
> A further difficulty is that, at least in testing on our CX-8 C8240 with
> firmware 40.47.1088, the warning event can have bits set for other
> modules, e.g. if this PCI physical function is associated with
> port/module 0, we might expect bit 64 to be set, but bit 65 (for port/
> module 1) can also be set.
>
> Fix this by clarifying that the argument to hwmon_get_sensor_name() is
> the raw sensor index, and correctly converting it to the hwmon channel
> index. Return NULL if the sensor index doesn't correspond to a hwmon
> channel (e.g. because it's for the other port's module).
>
> Fixes: 46fd50cfcc12 ("net/mlx5: Add sensor name to temperature event message")
> Signed-off-by: Will Mortensen <will@extrahop.com>
Your Signed-off-by needs to be last.
> Reviewed-by: Jeremy Royal <jeremyr@extrahop.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/events.c | 2 ++
> drivers/net/ethernet/mellanox/mlx5/core/hwmon.c | 15 ++++++++++++++-
> drivers/net/ethernet/mellanox/mlx5/core/hwmon.h | 2 +-
> 3 files changed, 17 insertions(+), 2 deletions(-)
I would take a simpler approach by removing this extra complexity and
using a static temp_channel_desc[64] array, avoiding any dynamic
allocation. But this change is acceptable as well.
Thanks
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-05-14 11:33 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 7:32 [PATCH v2] net/mlx5: don't printk garbage when transceiver overheats Will Mortensen
2026-05-14 11:32 ` Leon Romanovsky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox