public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Subject: thermal: Fix potential race condition in suspend/resume
@ 2023-09-16 11:33 Bo Ye
  2023-10-12  7:35 ` [PATCH] " Bo Ye (叶波)
  2023-10-12 15:39 ` [PATCH] Subject: thermal: Fix potential " Daniel Lezcano
  0 siblings, 2 replies; 11+ messages in thread
From: Bo Ye @ 2023-09-16 11:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano, Amit Kucheria, Zhang Rui,
	Matthias Brugger, AngeloGioacchino Del Regno
  Cc: yugang.wang, yongdong.zhang, browse.zhang, Bo Ye, linux-pm,
	linux-kernel, linux-arm-kernel, linux-mediatek

From: "yugang.wang" <yugang.wang@mediatek.com>

Body:
This patch fixes a race condition during system resume. It occurs if
the system is exiting a suspend state and a user is trying to
register/unregister a thermal zone concurrently. The root cause is
that both actions access the `thermal_tz_list`.

In detail:

1. At PM_POST_SUSPEND during the resume, the system reads all thermal
   zones in `thermal_tz_list`, then resets and updates their
   temperatures.
2. When registering/unregistering a thermal zone, the
   `thermal_tz_list` gets manipulated.

These two actions might occur concurrently, causing a race condition.
To solve this issue, we introduce a mutex lock to protect
`thermal_tz_list` from being modified while it's being read and
updated during the resume from suspend.

Kernel oops excerpt related to this fix:

[ 5201.869845] [T316822] pc: [0xffffffeb7d4876f0] mutex_lock+0x34/0x170
[ 5201.869856] [T316822] lr: [0xffffffeb7ca98a84] thermal_pm_notify+0xd4/0x26c
[... cut for brevity ...]
[ 5201.871061] [T316822]  suspend_prepare+0x150/0x470
[ 5201.871067] [T316822]  enter_state+0x84/0x6f4
[ 5201.871076] [T316822]  state_store+0x15c/0x1e8

Change-Id: Ifdbdecba17093f91eab7e36ce04b46d311ca6568
Signed-off-by: yugang.wang <yugang.wang@mediatek.com>
Signed-off-by: Bo Ye <bo.ye@mediatek.com>
---
 drivers/thermal/thermal_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 8717a3343512..a7a18ed57b6d 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1529,12 +1529,14 @@ static int thermal_pm_notify(struct notifier_block *nb,
 	case PM_POST_HIBERNATION:
 	case PM_POST_RESTORE:
 	case PM_POST_SUSPEND:
+		mutex_lock(&thermal_list_lock);
 		atomic_set(&in_suspend, 0);
 		list_for_each_entry(tz, &thermal_tz_list, node) {
 			thermal_zone_device_init(tz);
 			thermal_zone_device_update(tz,
 						   THERMAL_EVENT_UNSPECIFIED);
 		}
+		mutex_unlock(&thermal_list_lock);
 		break;
 	default:
 		break;
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread
* [PATCH] thermal: fix race condition in suspend/resume
@ 2023-12-18 16:23 Bo Ye
  2023-12-18 16:30 ` Rafael J. Wysocki
  0 siblings, 1 reply; 11+ messages in thread
From: Bo Ye @ 2023-12-18 16:23 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Matthias Brugger, AngeloGioacchino Del Regno
  Cc: yongdong.zhang, yugang.wang, Bo Ye, linux-pm, linux-kernel,
	linux-arm-kernel, linux-mediatek

From: "yugang.wang" <yugang.wang@mediatek.com>

  Firstly, it needs to be clarified that this issue occurs in a real-
world environment. By analyzing the logs, we inferred that the
issue occurred just as the system was entering suspend mode, and the user
was switching the thermal policy (this action causes all thermal zones
to unregister/register). In addition, we conducted degradation tests
and also reproduced this issue. The specific method is to first switch
the thermal policy through a command, and then immediately put the
system into suspend state through another command. This method can also
reproduce the issue.

Body:
  This patch fixes a race condition during system resume. It occurs
  if the system is exiting a suspend state and a user is trying to
  register/unregister a thermal zone concurrently. The root cause is
  that both actions access the `thermal_tz_list`.

In detail:

  1. At PM_POST_SUSPEND during the resume, the system reads all
  thermal
     zones in `thermal_tz_list`, then resets and updates their
     temperatures.
  2. When registering/unregistering a thermal zone, the
     `thermal_tz_list` gets manipulated.

  These two actions might occur concurrently, causing a race condition.
  To solve this issue, we introduce a mutex lock to protect
  `thermal_tz_list` from being modified while it's being read and
  updated during the resume from suspend.

  Kernel oops excerpt related to this fix:

  [ 5201.869845] [T316822] pc: [0xffffffeb7d4876f0]
  mutex_lock+0x34/0x170
  [ 5201.869856] [T316822] lr: [0xffffffeb7ca98a84]
  thermal_pm_notify+0xd4/0x26c
  [... cut for brevity ...]
  [ 5201.871061] [T316822]  suspend_prepare+0x150/0x470
  [ 5201.871067] [T316822]  enter_state+0x84/0x6f4
  [ 5201.871076] [T316822]  state_store+0x15c/0x1e8

  3.Enable thermal policy operation will unregister/register all thermal zones
     10-21 06:13:59.280   854   922 I libMtcLoader: enable thermal policy thermal_policy_09.

  4.System suspend entry time is 2023-10-20 22:13:59.242
     [ 4106.364175][T609387] binder:534_2: [name:spm&][SPM] PM: suspend entry 2023-10-20 22:13:59.242898243 UTC
     [ 4106.366185][T609387] binder:534_2: PM: [name:wakeup&]PM: Pending Wakeup Sources: NETLINK

  5. It can be proven that the absence of a switch strategy will perform
     unregister/register operations on thermal zones (android time is 2023-10-20 22:13:59.282),
     Because the logs for other thermal zones switching are not enabled by
     default, we cannot see the logs related to other thermal zones.
     [ 4106.404167][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_unbind unbinding OK
     [ 4106.404215][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_unbind unbinding OK
     [ 4106.404225][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_unbind unbinding OK
     [ 4106.404504][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_bind binding OK, 0
     [ 4106.404545][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_bind binding OK, 2
     [ 4106.404566][T600922] mtkPowerMsgHdl:[name:thermal_monitor&][Thermal/TZ/CPU]tscpu_bind binding OK, 1

  6. thermal_pm_notify trigger KE(android time:  2023-10-20 22:13:59.315894)
     [ 4106.437171][T209387] binder:534_2: [name:mrdump&]Kernel Offset:0x289cc80000 from 0xffffffc008000000
     [ 4106.437182][T209387] binder:534_2: [name:mrdump&]PHYS_OFFSET:0x40000000
     [ 4106.437191][T209387] binder:534_2: [name:mrdump&]pstate: 80400005(Nzcv daif +PAN -UAO)
     [ 4106.437204][T209387] binder:534_2: [name:mrdump&]pc :[0xffffffe8a6688200] mutex_lock+0x34/0x184
     [ 4106.437214][T209387] binder:534_2: [name:mrdump&]lr :[0xffffffe8a5ce66bc] thermal_pm_notify+0xd4/0x26c
     [ 4106.437220][T209387] binder:534_2: [name:mrdump&]sp :ffffffc01bab3ae0
     [ 4106.437226][T209387] binder:534_2: [name:mrdump&]x29:ffffffc01bab3af0 x28: 0000000000000001

Signed-off-by: Yugang Wang <yugang.wang@mediatek.com>
Signed-off-by: Bo Ye <bo.ye@mediatek.com>
---
 drivers/thermal/thermal_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 9c17d35ccbbd..73d6b820c8b5 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1520,12 +1520,14 @@ static int thermal_pm_notify(struct notifier_block *nb,
 	case PM_POST_HIBERNATION:
 	case PM_POST_RESTORE:
 	case PM_POST_SUSPEND:
+		mutex_lock(&thermal_list_lock);
 		atomic_set(&in_suspend, 0);
 		list_for_each_entry(tz, &thermal_tz_list, node) {
 			thermal_zone_device_init(tz);
 			thermal_zone_device_update(tz,
 						   THERMAL_EVENT_UNSPECIFIED);
 		}
+		mutex_unlock(&thermal_list_lock);
 		break;
 	default:
 		break;
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-12-18 16:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-16 11:33 [PATCH] Subject: thermal: Fix potential race condition in suspend/resume Bo Ye
2023-10-12  7:35 ` [PATCH] " Bo Ye (叶波)
2023-10-23  1:19   ` Bo Ye (叶波)
2023-10-25 18:21     ` Rafael J. Wysocki
2023-11-01 14:58       ` [PATCH] thermal: Fix " Bo Ye (叶波)
2023-11-16 15:06         ` Bo Ye (叶波)
2023-12-06 14:32           ` Bo Ye (叶波)
2023-10-12 15:39 ` [PATCH] Subject: thermal: Fix potential " Daniel Lezcano
2023-10-12 17:03   ` Rafael J. Wysocki
  -- strict thread matches above, loose matches on Subject: below --
2023-12-18 16:23 [PATCH] thermal: fix " Bo Ye
2023-12-18 16:30 ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox