From: "Bo Ye (叶波)" <Bo.Ye@mediatek.com>
To: "rafael@kernel.org" <rafael@kernel.org>
Cc: "linux-mediatek@lists.infradead.org"
<linux-mediatek@lists.infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"rui.zhang@intel.com" <rui.zhang@intel.com>,
"Browse Zhang (张磊)" <Browse.Zhang@mediatek.com>,
"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>,
"daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>,
"Yongdong Zhang (张永东)" <Yongdong.Zhang@mediatek.com>,
"amitk@kernel.org" <amitk@kernel.org>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"Yugang Wang (王玉刚)" <Yugang.Wang@mediatek.com>,
"matthias.bgg@gmail.com" <matthias.bgg@gmail.com>,
"angelogioacchino.delregno@collabora.com"
<angelogioacchino.delregno@collabora.com>
Subject: Re: [PATCH] thermal: Fix race condition in suspend/resume
Date: Thu, 16 Nov 2023 15:06:58 +0000 [thread overview]
Message-ID: <2f6530d551665e87cdf0331ecb223bdf4feb2435.camel@mediatek.com> (raw)
In-Reply-To: <c8d305f8b46287d86a49a887983ff2198cfbc297.camel@mediatek.com>
Hi Rafael sir,
Could you help review this patch, thanks a lot.
Best Regards
Bo Ye
On Wed, 2023-11-01 at 22:58 +0800, Bo Ye wrote:
> On Wed, 2023-10-25 at 20:21 +0200, Rafael J. Wysocki wrote:
> >
> > External email : Please do not click links or open attachments
> > until
> > you have verified the sender or the content.
> > On Mon, Oct 23, 2023 at 3:20 AM Bo Ye (叶波) <Bo.Ye@mediatek.com>
> > wrote:
> > >
> > > Yes, it is observed issue.
> >
> > It does happen, so it's not just "potential" and the subject of the
> > patch is slightly misleading. Please adjust it.
>
> Done
>
> >
> > > Firstly, it needs to be clarified that this issue occurs in a
> > > real-
> > > world environment. By analyzing the logs, we inferred that the
> >
> > issue
> > > occurred just as the system was entering suspend mode, and the
> > > user
> >
> > was
> > > switching the thermal policy (this action causes all thermal
> > > zones
> >
> > to
> > > unregister/register). In addition, we conducted degradation tests
> >
> > and
> > > also reproduced this issue. The specific method is to first
> > > switch
> >
> > the
> > > thermal policy through a command, and then immediately put the
> >
> > system
> > > into suspend state through another command. This method can also
> > > reproduce the issue.
> >
> > OK, so please add this information to the patch changelog.
> >
> > > On Thu, 2023-10-12 at 07:35 +0000, Bo Ye (叶波) wrote:
> > > > On Sat, 2023-09-16 at 19:33 +0800, Bo Ye wrote:
> > > >
> > > > Correct mail title format: remove "Subject:" from mail title.
> > > >
> > > > > From: "yugang.wang" <yugang.wang@mediatek.com>
> > > > >
> > > > > Body:
> > > > > This patch fixes a race condition during system resume. It
> >
> > occurs
> > > > > if
> > > > > the system is exiting a suspend state and a user is trying to
> > > > > register/unregister a thermal zone concurrently. The root
> > > > > cause
> >
> > is
> > > > > that both actions access the `thermal_tz_list`.
> > > > >
> > > > > In detail:
> > > > >
> > > > > 1. At PM_POST_SUSPEND during the resume, the system reads all
> > > > > thermal
> > > > > zones in `thermal_tz_list`, then resets and updates their
> > > > > temperatures.
> > > > > 2. When registering/unregistering a thermal zone, the
> > > > > `thermal_tz_list` gets manipulated.
> > > > >
> > > > > These two actions might occur concurrently, causing a race
> > > > > condition.
> > > > > To solve this issue, we introduce a mutex lock to protect
> > > > > `thermal_tz_list` from being modified while it's being read
> > > > > and
> > > > > updated during the resume from suspend.
> > > > >
> > > > > Kernel oops excerpt related to this fix:
> > > > >
> > > > > [ 5201.869845] [T316822] pc: [0xffffffeb7d4876f0]
> > > > > mutex_lock+0x34/0x170
> > > > > [ 5201.869856] [T316822] lr: [0xffffffeb7ca98a84]
> > > > > thermal_pm_notify+0xd4/0x26c
> > > > > [... cut for brevity ...]
> > > > > [ 5201.871061] [T316822] suspend_prepare+0x150/0x470
> > > > > [ 5201.871067] [T316822] enter_state+0x84/0x6f4
> > > > > [ 5201.871076] [T316822] state_store+0x15c/0x1e8
> >
> > Well, the connection between the above log snippet and the issue
> > addressed by the patch is rather hard to establish. Please include
> > more of the oops information.
>
> Thank you very much for reviewing the additional explanations.
>
> 1.Enable thermal policy operation will unregister/register all
> thermal
> zones
> 10-21 06:13:59.280 854 922 I libMtcLoader: enable thermal policy
> thermal_policy_09.
>
> 2.System suspend entry time is 2023-10-20 22:13:59.242
> [ 4106.364175][T609387] binder:534_2: [name:spm&][SPM] PM: suspend
> entry 2023-10-20 22:13:59.242898243 UTC
> [ 4106.366185][T609387] binder:534_2: PM: [name:wakeup&]PM: Pending
> Wakeup Sources: NETLINK
>
> 3. It can be proven that the absence of a switch strategy will
> perform
> unregister/register operations on thermal zones (android time is
> 2023-
> 10-20 22:13:59.282),
> Because the logs for other thermal zones switching are not enabled by
> default, we cannot see the logs related to other thermal zones.
> [ 4106.404167][T600922] mtkPowerMsgHdl:
> [name:thermal_monitor&][Thermal/TZ/CPU]tscpu_unbind unbinding OK
> [ 4106.404215][T600922] mtkPowerMsgHdl:
> [name:thermal_monitor&][Thermal/TZ/CPU]tscpu_unbind unbinding OK
> [ 4106.404225][T600922] mtkPowerMsgHdl:
> [name:thermal_monitor&][Thermal/TZ/CPU]tscpu_unbind unbinding OK
> [ 4106.404504][T600922] mtkPowerMsgHdl:
> [name:thermal_monitor&][Thermal/TZ/CPU]tscpu_bind binding OK, 0
> [ 4106.404545][T600922] mtkPowerMsgHdl:
> [name:thermal_monitor&][Thermal/TZ/CPU]tscpu_bind binding OK, 2
> [ 4106.404566][T600922] mtkPowerMsgHdl:
> [name:thermal_monitor&][Thermal/TZ/CPU]tscpu_bind binding OK, 1
>
> 4. thermal_pm_notify trigger KE(android time: 2023-10-20 22:13:59.
> 315894)
> [ 4106.437171][T209387] binder:534_2: [name:mrdump&]Kernel Offset:
> 0x289cc80000 from 0xffffffc008000000
> [ 4106.437182][T209387] binder:534_2: [name:mrdump&]PHYS_OFFSET:
> 0x40000000
> [ 4106.437191][T209387] binder:534_2: [name:mrdump&]pstate: 80400005
> (Nzcv daif +PAN -UAO)
> [ 4106.437204][T209387] binder:534_2: [name:mrdump&]pc :
> [0xffffffe8a6688200] mutex_lock+0x34/0x184
> [ 4106.437214][T209387] binder:534_2: [name:mrdump&]lr :
> [0xffffffe8a5ce66bc] thermal_pm_notify+0xd4/0x26c
> [ 4106.437220][T209387] binder:534_2: [name:mrdump&]sp :
> ffffffc01bab3ae0
> [ 4106.437226][T209387] binder:534_2: [name:mrdump&]x29:
> ffffffc01bab3af0 x28: 0000000000000001
>
> >
> > > > >
> > > > > Change-Id: Ifdbdecba17093f91eab7e36ce04b46d311ca6568
> > > > > Signed-off-by: yugang.wang <yugang.wang@mediatek.com>
> > > > > Signed-off-by: Bo Ye <bo.ye@mediatek.com>
> > > > > ---
> > > > > drivers/thermal/thermal_core.c | 2 ++
> > > > > 1 file changed, 2 insertions(+)
> > > > >
> > > > > diff --git a/drivers/thermal/thermal_core.c
> > > > > b/drivers/thermal/thermal_core.c
> > > > > index 8717a3343512..a7a18ed57b6d 100644
> > > > > --- a/drivers/thermal/thermal_core.c
> > > > > +++ b/drivers/thermal/thermal_core.c
> > > > > @@ -1529,12 +1529,14 @@ static int thermal_pm_notify(struct
> > > > > notifier_block *nb,
> > > > > case PM_POST_HIBERNATION:
> > > > > case PM_POST_RESTORE:
> > > > > case PM_POST_SUSPEND:
> > > > > + mutex_lock(&thermal_list_lock);
> > > > > atomic_set(&in_suspend, 0);
> >
> > It is not clear to me why the above statement needs to be under the
> > lock.
> >
> > > > > list_for_each_entry(tz, &thermal_tz_list, node) {
> > > > > thermal_zone_device_init(tz);
> > > > > thermal_zone_device_update(tz,
> > >
> > > THERMAL_EVENT_UNSP
> > > EC
> > > > > IFIED);
> > > > > }
> > > > > + mutex_unlock(&thermal_list_lock);
> > > > > break;
> > > > > default:
> > > > > break;
next prev parent reply other threads:[~2023-11-16 15:07 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-16 11:33 [PATCH] Subject: thermal: Fix potential race condition in suspend/resume Bo Ye
2023-10-12 7:35 ` [PATCH] " Bo Ye (叶波)
2023-10-23 1:19 ` Bo Ye (叶波)
2023-10-25 18:21 ` Rafael J. Wysocki
2023-11-01 14:58 ` [PATCH] thermal: Fix " Bo Ye (叶波)
2023-11-16 15:06 ` Bo Ye (叶波) [this message]
2023-12-06 14:32 ` Bo Ye (叶波)
2023-10-12 15:39 ` [PATCH] Subject: thermal: Fix potential " Daniel Lezcano
2023-10-12 17:03 ` Rafael J. Wysocki
-- strict thread matches above, loose matches on Subject: below --
2023-12-18 16:23 [PATCH] thermal: fix " Bo Ye
2023-12-18 16:30 ` Rafael J. Wysocki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2f6530d551665e87cdf0331ecb223bdf4feb2435.camel@mediatek.com \
--to=bo.ye@mediatek.com \
--cc=Browse.Zhang@mediatek.com \
--cc=Yongdong.Zhang@mediatek.com \
--cc=Yugang.Wang@mediatek.com \
--cc=amitk@kernel.org \
--cc=angelogioacchino.delregno@collabora.com \
--cc=daniel.lezcano@linaro.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mediatek@lists.infradead.org \
--cc=linux-pm@vger.kernel.org \
--cc=matthias.bgg@gmail.com \
--cc=rafael@kernel.org \
--cc=rui.zhang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox