From: Peter Zijlstra <peterz@infradead.org>
To: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
Cc: "dietmar.eggemann@arm.com" <dietmar.eggemann@arm.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-mediatek@lists.infradead.org"
<linux-mediatek@lists.infradead.org>,
"rostedt@goodmis.org" <rostedt@goodmis.org>,
wsd_upstream <wsd_upstream@mediatek.com>,
"vschneid@redhat.com" <vschneid@redhat.com>,
"bristot@redhat.com" <bristot@redhat.com>,
"juri.lelli@redhat.com" <juri.lelli@redhat.com>,
"mingo@redhat.com" <mingo@redhat.com>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"bsegall@google.com" <bsegall@google.com>,
"mgorman@suse.de" <mgorman@suse.de>,
"matthias.bgg@gmail.com" <matthias.bgg@gmail.com>,
"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
"angelogioacchino.delregno@collabora.com"
<angelogioacchino.delregno@collabora.com>
Subject: Re: [PATCH 1/1] sched/core: Fix stuck on completion for affine_move_task() when stopper disable
Date: Thu, 28 Sep 2023 17:16:16 +0200 [thread overview]
Message-ID: <20230928151616.GD27245@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <b9def8f3d9426bc158b302f4474b6e643b46d206.camel@mediatek.com>
On Wed, Sep 27, 2023 at 03:57:35PM +0000, Kuyo Chang (張建文) wrote:
> On Wed, 2023-09-27 at 10:08 +0200, Peter Zijlstra wrote:
> >
> > External email : Please do not click links or open attachments until
> > you have verified the sender or the content.
> > On Wed, Sep 27, 2023 at 11:34:28AM +0800, Kuyo Chang wrote:
> > > From: kuyo chang <kuyo.chang@mediatek.com>
> > >
> > > [Syndrome] hung detect shows below warning msg
> > > [ 4320.666557] [ T56] khungtaskd: [name:hung_task&]INFO: task
> > stressapptest:17803 blocked for more than 3600 seconds.
> > > [ 4320.666589] [ T56] khungtaskd:
> > [name:core&]task:stressapptest state:D stack:0 pid:17803
> > ppid:17579 flags:0x04000008
> > > [ 4320.666601] [ T56] khungtaskd: Call trace:
> > > [ 4320.666607] [ T56] khungtaskd: __switch_to+0x17c/0x338
> > > [ 4320.666642] [ T56] khungtaskd: __schedule+0x54c/0x8ec
> > > [ 4320.666651] [ T56] khungtaskd: schedule+0x74/0xd4
> > > [ 4320.666656] [ T56] khungtaskd: schedule_timeout+0x34/0x108
> > > [ 4320.666672] [ T56] khungtaskd: do_wait_for_common+0xe0/0x154
> > > [ 4320.666678] [ T56] khungtaskd: wait_for_completion+0x44/0x58
> > > [ 4320.666681] [ T56]
> > khungtaskd: __set_cpus_allowed_ptr_locked+0x344/0x730
> > > [ 4320.666702] [ T56]
> > khungtaskd: __sched_setaffinity+0x118/0x160
> > > [ 4320.666709] [ T56] khungtaskd: sched_setaffinity+0x10c/0x248
> > > [ 4320.666715] [ T56]
> > khungtaskd: __arm64_sys_sched_setaffinity+0x15c/0x1c0
> > > [ 4320.666719] [ T56] khungtaskd: invoke_syscall+0x3c/0xf8
> > > [ 4320.666743] [ T56] khungtaskd: el0_svc_common+0xb0/0xe8
> > > [ 4320.666749] [ T56] khungtaskd: do_el0_svc+0x28/0xa8
> > > [ 4320.666755] [ T56] khungtaskd: el0_svc+0x28/0x9c
> > > [ 4320.666761] [ T56] khungtaskd: el0t_64_sync_handler+0x7c/0xe4
> > > [ 4320.666766] [ T56] khungtaskd: el0t_64_sync+0x18c/0x190
> > >
> > > [Analysis]
> > >
> > > After add some debug footprint massage, this issue happened at
> > stopper
> > > disable case.
> > > It cannot exec migration_cpu_stop fun to complete migration.
> > > This will cause stuck on wait_for_completion.
> >
> > How did you get in this situation?
> >
>
> This issue occurs at CPU hotplug/set_affinity stress test.
> The reproduce ratio is very low(about once a week).
>
> So I add/record some debug message to snapshot the task status while it
> stuck on wait_for_completion.
>
> Below is the snapshot status while issue happened:
>
> cpu_active_mask is 0xFC
> new_mask is 0x8
> pending->arg.dest_cpu is 0x3
> task_on_cpu(rq,p) is 1
> task_cpu is 0x2
> p__state = TASK_RUNNING
> flag is SCA_CHACK|SCA_USER
> stop_one_cpu_nowait(stopper->enabled) return value is false.
>
> I also record the footprint at migration_cpu_stop.
> It shows the migration_cpu_stop is not execute.
AFAICT this is migrate_enable(), which acts on current, so how can the
CPU that current runs on go away?
That is completely unexplained. You've not given a proper description of
the race scenario. And because you've not, we can't even begin to talk
about how best to address the issue.
> > struct task_struct *p, struct rq_flag
> > > task_rq_unlock(rq, p, rf);
> > >
> > > if (!stop_pending) {
> > > -stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> > > - &pending->arg, &pending->stop_work);
> > > +if (!stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> > > + &pending->arg, &pending->stop_work))
> > > +return -ENOENT;
> >
> > And -ENOENT is the right return code for when the target CPU is not
> > available?
> >
> > I suspect you're missing more than halp the picture and this is a
> > band-aid solution at best. Please try harder.
> >
>
> I think -ENOENT means stopper is not execute?
> Perhaps the error code is abused, or could you kindly give me some
> suggestions?
Well, at this point you're leaving the whole affine_move_task()
machinery in an undefined state, which is a much bigger problem than the
weird return value.
Please read through that function and its comments a number of times. If
you're not a little nervous, you've not understood the thing.
Your patch has at least one very obvious resource leak.
WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <peterz@infradead.org>
To: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
Cc: "dietmar.eggemann@arm.com" <dietmar.eggemann@arm.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-mediatek@lists.infradead.org"
<linux-mediatek@lists.infradead.org>,
"rostedt@goodmis.org" <rostedt@goodmis.org>,
wsd_upstream <wsd_upstream@mediatek.com>,
"vschneid@redhat.com" <vschneid@redhat.com>,
"bristot@redhat.com" <bristot@redhat.com>,
"juri.lelli@redhat.com" <juri.lelli@redhat.com>,
"mingo@redhat.com" <mingo@redhat.com>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"bsegall@google.com" <bsegall@google.com>,
"mgorman@suse.de" <mgorman@suse.de>,
"matthias.bgg@gmail.com" <matthias.bgg@gmail.com>,
"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
"angelogioacchino.delregno@collabora.com"
<angelogioacchino.delregno@collabora.com>
Subject: Re: [PATCH 1/1] sched/core: Fix stuck on completion for affine_move_task() when stopper disable
Date: Thu, 28 Sep 2023 17:16:16 +0200 [thread overview]
Message-ID: <20230928151616.GD27245@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <b9def8f3d9426bc158b302f4474b6e643b46d206.camel@mediatek.com>
On Wed, Sep 27, 2023 at 03:57:35PM +0000, Kuyo Chang (張建文) wrote:
> On Wed, 2023-09-27 at 10:08 +0200, Peter Zijlstra wrote:
> >
> > External email : Please do not click links or open attachments until
> > you have verified the sender or the content.
> > On Wed, Sep 27, 2023 at 11:34:28AM +0800, Kuyo Chang wrote:
> > > From: kuyo chang <kuyo.chang@mediatek.com>
> > >
> > > [Syndrome] hung detect shows below warning msg
> > > [ 4320.666557] [ T56] khungtaskd: [name:hung_task&]INFO: task
> > stressapptest:17803 blocked for more than 3600 seconds.
> > > [ 4320.666589] [ T56] khungtaskd:
> > [name:core&]task:stressapptest state:D stack:0 pid:17803
> > ppid:17579 flags:0x04000008
> > > [ 4320.666601] [ T56] khungtaskd: Call trace:
> > > [ 4320.666607] [ T56] khungtaskd: __switch_to+0x17c/0x338
> > > [ 4320.666642] [ T56] khungtaskd: __schedule+0x54c/0x8ec
> > > [ 4320.666651] [ T56] khungtaskd: schedule+0x74/0xd4
> > > [ 4320.666656] [ T56] khungtaskd: schedule_timeout+0x34/0x108
> > > [ 4320.666672] [ T56] khungtaskd: do_wait_for_common+0xe0/0x154
> > > [ 4320.666678] [ T56] khungtaskd: wait_for_completion+0x44/0x58
> > > [ 4320.666681] [ T56]
> > khungtaskd: __set_cpus_allowed_ptr_locked+0x344/0x730
> > > [ 4320.666702] [ T56]
> > khungtaskd: __sched_setaffinity+0x118/0x160
> > > [ 4320.666709] [ T56] khungtaskd: sched_setaffinity+0x10c/0x248
> > > [ 4320.666715] [ T56]
> > khungtaskd: __arm64_sys_sched_setaffinity+0x15c/0x1c0
> > > [ 4320.666719] [ T56] khungtaskd: invoke_syscall+0x3c/0xf8
> > > [ 4320.666743] [ T56] khungtaskd: el0_svc_common+0xb0/0xe8
> > > [ 4320.666749] [ T56] khungtaskd: do_el0_svc+0x28/0xa8
> > > [ 4320.666755] [ T56] khungtaskd: el0_svc+0x28/0x9c
> > > [ 4320.666761] [ T56] khungtaskd: el0t_64_sync_handler+0x7c/0xe4
> > > [ 4320.666766] [ T56] khungtaskd: el0t_64_sync+0x18c/0x190
> > >
> > > [Analysis]
> > >
> > > After add some debug footprint massage, this issue happened at
> > stopper
> > > disable case.
> > > It cannot exec migration_cpu_stop fun to complete migration.
> > > This will cause stuck on wait_for_completion.
> >
> > How did you get in this situation?
> >
>
> This issue occurs at CPU hotplug/set_affinity stress test.
> The reproduce ratio is very low(about once a week).
>
> So I add/record some debug message to snapshot the task status while it
> stuck on wait_for_completion.
>
> Below is the snapshot status while issue happened:
>
> cpu_active_mask is 0xFC
> new_mask is 0x8
> pending->arg.dest_cpu is 0x3
> task_on_cpu(rq,p) is 1
> task_cpu is 0x2
> p__state = TASK_RUNNING
> flag is SCA_CHACK|SCA_USER
> stop_one_cpu_nowait(stopper->enabled) return value is false.
>
> I also record the footprint at migration_cpu_stop.
> It shows the migration_cpu_stop is not execute.
AFAICT this is migrate_enable(), which acts on current, so how can the
CPU that current runs on go away?
That is completely unexplained. You've not given a proper description of
the race scenario. And because you've not, we can't even begin to talk
about how best to address the issue.
> > struct task_struct *p, struct rq_flag
> > > task_rq_unlock(rq, p, rf);
> > >
> > > if (!stop_pending) {
> > > -stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> > > - &pending->arg, &pending->stop_work);
> > > +if (!stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
> > > + &pending->arg, &pending->stop_work))
> > > +return -ENOENT;
> >
> > And -ENOENT is the right return code for when the target CPU is not
> > available?
> >
> > I suspect you're missing more than halp the picture and this is a
> > band-aid solution at best. Please try harder.
> >
>
> I think -ENOENT means stopper is not execute?
> Perhaps the error code is abused, or could you kindly give me some
> suggestions?
Well, at this point you're leaving the whole affine_move_task()
machinery in an undefined state, which is a much bigger problem than the
weird return value.
Please read through that function and its comments a number of times. If
you're not a little nervous, you've not understood the thing.
Your patch has at least one very obvious resource leak.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-09-28 15:16 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-27 3:34 [PATCH 1/1] sched/core: Fix stuck on completion for affine_move_task() when stopper disable Kuyo Chang
2023-09-27 3:34 ` Kuyo Chang
2023-09-27 8:08 ` Peter Zijlstra
2023-09-27 8:08 ` Peter Zijlstra
2023-09-27 15:57 ` Kuyo Chang (張建文)
2023-09-27 15:57 ` Kuyo Chang (張建文)
2023-09-28 15:16 ` Peter Zijlstra [this message]
2023-09-28 15:16 ` Peter Zijlstra
2023-09-28 15:19 ` Peter Zijlstra
2023-09-28 15:19 ` Peter Zijlstra
2023-09-29 10:21 ` Peter Zijlstra
2023-09-29 10:21 ` Peter Zijlstra
2023-10-01 15:15 ` Kuyo Chang (張建文)
2023-10-01 15:15 ` Kuyo Chang (張建文)
2023-10-10 14:40 ` Kuyo Chang (張建文)
2023-10-10 14:40 ` Kuyo Chang (張建文)
2023-10-10 14:57 ` Peter Zijlstra
2023-10-10 14:57 ` Peter Zijlstra
2023-10-10 20:04 ` [PATCH] sched: Fix stop_one_cpu_nowait() vs hotplug Peter Zijlstra
2023-10-10 20:04 ` Peter Zijlstra
2023-10-11 3:24 ` Kuyo Chang (張建文)
2023-10-11 3:24 ` Kuyo Chang (張建文)
2023-10-11 13:26 ` Peter Zijlstra
2023-10-11 13:26 ` Peter Zijlstra
2023-10-13 8:06 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230928151616.GD27245@noisy.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=Kuyo.Chang@mediatek.com \
--cc=angelogioacchino.delregno@collabora.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mediatek@lists.infradead.org \
--cc=matthias.bgg@gmail.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=wsd_upstream@mediatek.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.