* [PATCH net] mlxsw: core_thermal: Fix fan speed in maximum cooling state
@ 2023-03-17 15:32 Petr Machata
2023-03-19 15:30 ` patchwork-bot+netdevbpf
0 siblings, 1 reply; 2+ messages in thread
From: Petr Machata @ 2023-03-17 15:32 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
netdev
Cc: Ido Schimmel, Petr Machata, Jiri Pirko, Vadim Pasternak, mlxsw,
Vadim Pasternak
From: Ido Schimmel <idosch@nvidia.com>
The cooling levels array is supposed to prevent the system fans from
being configured below a 20% duty cycle as otherwise some of them get
stuck at 0 RPM.
Due to an off-by-one error, the last element in the array was not
initialized, causing it to be set to zero, which in turn lead to fans
being configured with a 0% duty cycle in maximum cooling state.
Since commit 332fdf951df8 ("mlxsw: thermal: Fix out-of-bounds memory
accesses") the contents of the array are static. Therefore, instead of
fixing the initialization of the array, simply remove it and adjust
thermal_cooling_device_ops::set_cur_state() so that the configured duty
cycle is never set below 20%.
Before:
# cat /sys/class/thermal/thermal_zone0/cdev0/type
mlxsw_fan
# echo 10 > /sys/class/thermal/thermal_zone0/cdev0/cur_state
# cat /sys/class/hwmon/hwmon0/name
mlxsw
# cat /sys/class/hwmon/hwmon0/pwm1
0
After:
# cat /sys/class/thermal/thermal_zone0/cdev0/type
mlxsw_fan
# echo 10 > /sys/class/thermal/thermal_zone0/cdev0/cur_state
# cat /sys/class/hwmon/hwmon0/name
mlxsw
# cat /sys/class/hwmon/hwmon0/pwm1
255
This bug was uncovered when the thermal subsystem repeatedly tried to
configure the cooling devices to their maximum state due to another
issue [1]. This resulted in the fans being stuck at 0 RPM, which
eventually lead to the system undergoing thermal shutdown.
[1] https://lore.kernel.org/netdev/ZA3CFNhU4AbtsP4G@shredder/
Fixes: a421ce088ac8 ("mlxsw: core: Extend cooling device with cooling levels")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
---
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index c5240d38c9db..09ed6e5fa6c3 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -105,7 +105,6 @@ struct mlxsw_thermal {
struct thermal_zone_device *tzdev;
int polling_delay;
struct thermal_cooling_device *cdevs[MLXSW_MFCR_PWMS_MAX];
- u8 cooling_levels[MLXSW_THERMAL_MAX_STATE + 1];
struct thermal_trip trips[MLXSW_THERMAL_NUM_TRIPS];
struct mlxsw_cooling_states cooling_states[MLXSW_THERMAL_NUM_TRIPS];
struct mlxsw_thermal_area line_cards[];
@@ -468,7 +467,7 @@ static int mlxsw_thermal_set_cur_state(struct thermal_cooling_device *cdev,
return idx;
/* Normalize the state to the valid speed range. */
- state = thermal->cooling_levels[state];
+ state = max_t(unsigned long, MLXSW_THERMAL_MIN_STATE, state);
mlxsw_reg_mfsc_pack(mfsc_pl, idx, mlxsw_state_to_duty(state));
err = mlxsw_reg_write(thermal->core, MLXSW_REG(mfsc), mfsc_pl);
if (err) {
@@ -859,10 +858,6 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
}
}
- /* Initialize cooling levels per PWM state. */
- for (i = 0; i < MLXSW_THERMAL_MAX_STATE; i++)
- thermal->cooling_levels[i] = max(MLXSW_THERMAL_MIN_STATE, i);
-
thermal->polling_delay = bus_info->low_frequency ?
MLXSW_THERMAL_SLOW_POLL_INT :
MLXSW_THERMAL_POLL_INT;
--
2.39.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH net] mlxsw: core_thermal: Fix fan speed in maximum cooling state
2023-03-17 15:32 [PATCH net] mlxsw: core_thermal: Fix fan speed in maximum cooling state Petr Machata
@ 2023-03-19 15:30 ` patchwork-bot+netdevbpf
0 siblings, 0 replies; 2+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-03-19 15:30 UTC (permalink / raw)
To: Petr Machata
Cc: davem, edumazet, kuba, pabeni, netdev, idosch, jiri, vadimp,
mlxsw, vadimp
Hello:
This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:
On Fri, 17 Mar 2023 16:32:59 +0100 you wrote:
> From: Ido Schimmel <idosch@nvidia.com>
>
> The cooling levels array is supposed to prevent the system fans from
> being configured below a 20% duty cycle as otherwise some of them get
> stuck at 0 RPM.
>
> Due to an off-by-one error, the last element in the array was not
> initialized, causing it to be set to zero, which in turn lead to fans
> being configured with a 0% duty cycle in maximum cooling state.
>
> [...]
Here is the summary with links:
- [net] mlxsw: core_thermal: Fix fan speed in maximum cooling state
https://git.kernel.org/netdev/net/c/6d206b1ea9f4
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2023-03-19 15:30 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-17 15:32 [PATCH net] mlxsw: core_thermal: Fix fan speed in maximum cooling state Petr Machata
2023-03-19 15:30 ` patchwork-bot+netdevbpf
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.