* [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds
@ 2024-07-09 7:22 kernel test robot
2024-07-09 10:03 ` Vincent Guittot
0 siblings, 1 reply; 5+ messages in thread
From: kernel test robot @ 2024-07-09 7:22 UTC (permalink / raw)
To: Vincent Guittot
Cc: oe-lkp, lkp, linux-kernel, Ingo Molnar, Lukasz Luba, Qais Yousef,
linux-doc, aubrey.li, yu.c.chen, oliver.sang
Hello,
kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:
commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
[test failed on linus/master 22f902dfc51eb3602ff9b505ac3980f6ff77b1df]
[test failed on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
in testcase: rcutorture
version:
with following parameters:
runtime: 300s
test: cpuhotplug
torture_type: tasks
compiler: gcc-13
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
(please refer to attached dmesg/kmsg for entire log/backtrace)
we noticed the issue is random, 86 times out of 500 runs as below, while
keeping clean on parent.
d4dbc991714eefcb 97450eb909658573dcacc1063b0
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:500 17% 86:500 dmesg.INFO:task_blocked_for_more_than#seconds
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407091527.bb0be229-lkp@intel.com
[ 996.963402][ T17] INFO: task swapper:1 blocked for more than 491 seconds.
[ 996.973637][ T17] Tainted: G W 6.9.0-rc1-00051-g97450eb90965 #1
[ 996.983009][ T17] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 996.992369][ T17] task:swapper state:D stack:5680 pid:1 tgid:1 ppid:0 flags:0x00004000
[ 997.003585][ T17] Call Trace:
[ 997.022071][ T17] __schedule (kernel/sched/core.c:5412 kernel/sched/core.c:6746)
[ 997.032795][ T17] ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
[ 997.051350][ T17] schedule (arch/x86/include/asm/preempt.h:84 (discriminator 13) kernel/sched/core.c:6824 (discriminator 13) kernel/sched/core.c:6838 (discriminator 13))
[ 997.061394][ T17] async_synchronize_cookie_domain (kernel/async.c:317 (discriminator 9) kernel/async.c:310 (discriminator 9))
[ 997.071656][ T17] ? add_wait_queue (kernel/sched/wait.c:383)
[ 997.084201][ T17] wait_for_initramfs (init/initramfs.c:757)
[ 997.093158][ T17] ? do_header (init/initramfs.c:761)
[ 997.104204][ T17] populate_rootfs (init/initramfs.c:768)
[ 997.113736][ T17] do_one_initcall (init/main.c:1238)
[ 997.122668][ T17] ? parameq (kernel/params.c:90 (discriminator 1) kernel/params.c:99 (discriminator 1))
[ 997.133272][ T17] ? rdinit_setup (init/main.c:1286)
[ 997.142176][ T17] ? parse_args (kernel/params.c:142 (discriminator 1) kernel/params.c:186 (discriminator 1))
[ 997.172210][ T17] ? do_initcalls (init/main.c:1298 init/main.c:1316)
[ 997.183570][ T17] do_initcalls (init/main.c:1299 (discriminator 1) init/main.c:1316 (discriminator 1))
[ 997.194241][ T17] ? rest_init (init/main.c:1429)
[ 997.204290][ T17] kernel_init_freeable (init/main.c:1552)
[ 997.214432][ T17] kernel_init (init/main.c:1439)
[ 997.223979][ T17] ret_from_fork (arch/x86/kernel/process.c:153)
[ 997.232870][ T17] ? rest_init (init/main.c:1429)
[ 997.244227][ T17] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
[ 997.254412][ T17] entry_INT80_32 (arch/x86/entry/entry_32.S:944)
[ 997.831274][ T17]
[ 997.831274][ T17] Showing all locks held in the system:
[ 997.840770][ T17] 4 locks held by kworker/u4:1/16:
[ 997.844292][ T17] 1 lock held by khungtaskd/17:
[ 997.853205][ T17] #0: c26e50c8 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks (kernel/locking/lockdep.c:6612)
[ 997.880625][ T17]
[ 997.883881][ T17] =============================================
[ 997.883881][ T17]
BUG: kernel hang in boot stage
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240709/202407091527.bb0be229-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds
2024-07-09 7:22 [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds kernel test robot
@ 2024-07-09 10:03 ` Vincent Guittot
2024-07-12 16:41 ` Chen Yu
0 siblings, 1 reply; 5+ messages in thread
From: Vincent Guittot @ 2024-07-09 10:03 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, linux-kernel, Ingo Molnar, Lukasz Luba, Qais Yousef,
linux-doc, aubrey.li, yu.c.chen
On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:
>
> commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
First, I'm surprised that an Intel platform is impacted by this patch
because Intel doesn't use it AFAIK.
Then, this patch mainly remove a right shift i.e.:
instead of:
return rq_clock_task(rq) >> sched_hw_decay_shift
we are now doing:
return rq_clock_task(rq)
Could it be a false positive ?
>
> [test failed on linus/master 22f902dfc51eb3602ff9b505ac3980f6ff77b1df]
> [test failed on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
>
> in testcase: rcutorture
> version:
> with following parameters:
>
> runtime: 300s
> test: cpuhotplug
> torture_type: tasks
>
>
>
> compiler: gcc-13
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
> we noticed the issue is random, 86 times out of 500 runs as below, while
> keeping clean on parent.
>
>
> d4dbc991714eefcb 97450eb909658573dcacc1063b0
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> :500 17% 86:500 dmesg.INFO:task_blocked_for_more_than#seconds
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202407091527.bb0be229-lkp@intel.com
>
>
> [ 996.963402][ T17] INFO: task swapper:1 blocked for more than 491 seconds.
> [ 996.973637][ T17] Tainted: G W 6.9.0-rc1-00051-g97450eb90965 #1
> [ 996.983009][ T17] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 996.992369][ T17] task:swapper state:D stack:5680 pid:1 tgid:1 ppid:0 flags:0x00004000
> [ 997.003585][ T17] Call Trace:
> [ 997.022071][ T17] __schedule (kernel/sched/core.c:5412 kernel/sched/core.c:6746)
> [ 997.032795][ T17] ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
> [ 997.051350][ T17] schedule (arch/x86/include/asm/preempt.h:84 (discriminator 13) kernel/sched/core.c:6824 (discriminator 13) kernel/sched/core.c:6838 (discriminator 13))
> [ 997.061394][ T17] async_synchronize_cookie_domain (kernel/async.c:317 (discriminator 9) kernel/async.c:310 (discriminator 9))
> [ 997.071656][ T17] ? add_wait_queue (kernel/sched/wait.c:383)
> [ 997.084201][ T17] wait_for_initramfs (init/initramfs.c:757)
> [ 997.093158][ T17] ? do_header (init/initramfs.c:761)
> [ 997.104204][ T17] populate_rootfs (init/initramfs.c:768)
> [ 997.113736][ T17] do_one_initcall (init/main.c:1238)
> [ 997.122668][ T17] ? parameq (kernel/params.c:90 (discriminator 1) kernel/params.c:99 (discriminator 1))
> [ 997.133272][ T17] ? rdinit_setup (init/main.c:1286)
> [ 997.142176][ T17] ? parse_args (kernel/params.c:142 (discriminator 1) kernel/params.c:186 (discriminator 1))
> [ 997.172210][ T17] ? do_initcalls (init/main.c:1298 init/main.c:1316)
> [ 997.183570][ T17] do_initcalls (init/main.c:1299 (discriminator 1) init/main.c:1316 (discriminator 1))
> [ 997.194241][ T17] ? rest_init (init/main.c:1429)
> [ 997.204290][ T17] kernel_init_freeable (init/main.c:1552)
> [ 997.214432][ T17] kernel_init (init/main.c:1439)
> [ 997.223979][ T17] ret_from_fork (arch/x86/kernel/process.c:153)
> [ 997.232870][ T17] ? rest_init (init/main.c:1429)
> [ 997.244227][ T17] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
> [ 997.254412][ T17] entry_INT80_32 (arch/x86/entry/entry_32.S:944)
> [ 997.831274][ T17]
> [ 997.831274][ T17] Showing all locks held in the system:
> [ 997.840770][ T17] 4 locks held by kworker/u4:1/16:
> [ 997.844292][ T17] 1 lock held by khungtaskd/17:
> [ 997.853205][ T17] #0: c26e50c8 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks (kernel/locking/lockdep.c:6612)
> [ 997.880625][ T17]
> [ 997.883881][ T17] =============================================
> [ 997.883881][ T17]
> BUG: kernel hang in boot stage
>
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240709/202407091527.bb0be229-lkp@intel.com
>
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds
2024-07-09 10:03 ` Vincent Guittot
@ 2024-07-12 16:41 ` Chen Yu
2024-07-24 11:34 ` Dietmar Eggemann
0 siblings, 1 reply; 5+ messages in thread
From: Chen Yu @ 2024-07-12 16:41 UTC (permalink / raw)
To: Vincent Guittot
Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Ingo Molnar,
Lukasz Luba, Qais Yousef
On 2024-07-09 at 12:03:42 +0200, Vincent Guittot wrote:
> On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote:
> >
> >
> >
> > Hello,
> >
> > kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:
> >
> > commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> First, I'm surprised that an Intel platform is impacted by this patch
> because Intel doesn't use it AFAIK.
> Then, this patch mainly remove a right shift i.e.:
> instead of:
> return rq_clock_task(rq) >> sched_hw_decay_shift
> we are now doing:
> return rq_clock_task(rq)
>
> Could it be a false positive ?
Before trying to reproduce it locally, one question is that, should we use
rq_clock_task(rq) in __update_blocked_others() rather than 'now', which is
actually calculated by rq_clock_pelt(rq)?
thanks,
Chenyu
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d34f6d5b11b5..17ec0c51b29d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9432,7 +9432,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
- update_hw_load_avg(now, rq, hw_pressure) |
+ update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) |
update_irq_load_avg(rq, 0);
if (others_have_blocked(rq))
--
2.25.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds
2024-07-12 16:41 ` Chen Yu
@ 2024-07-24 11:34 ` Dietmar Eggemann
2024-07-24 15:09 ` Chen Yu
0 siblings, 1 reply; 5+ messages in thread
From: Dietmar Eggemann @ 2024-07-24 11:34 UTC (permalink / raw)
To: Chen Yu, Vincent Guittot
Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Ingo Molnar,
Lukasz Luba, Qais Yousef
On 12/07/2024 18:41, Chen Yu wrote:
> On 2024-07-09 at 12:03:42 +0200, Vincent Guittot wrote:
>> On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote:
>>>
>>> Hello,
>>>
>>> kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:
>>>
>>> commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> First, I'm surprised that an Intel platform is impacted by this patch
>> because Intel doesn't use it AFAIK.
>> Then, this patch mainly remove a right shift i.e.:
>> instead of:
>> return rq_clock_task(rq) >> sched_hw_decay_shift
>> we are now doing:
>> return rq_clock_task(rq)
>>
>> Could it be a false positive ?
>
> Before trying to reproduce it locally, one question is that, should we use
> rq_clock_task(rq) in __update_blocked_others() rather than 'now', which is
> actually calculated by rq_clock_pelt(rq)?
>
> thanks,
> Chenyu
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d34f6d5b11b5..17ec0c51b29d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9432,7 +9432,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>
> decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> - update_hw_load_avg(now, rq, hw_pressure) |
> + update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) |
> update_irq_load_avg(rq, 0);
>
> if (others_have_blocked(rq))
Yes, update_hw_load_avg() should be driven entirely by
rq_clock_task(rq). But IMHO this PELT signal is only used on some arm64
platforms. So you won't detect any misbehavior running your tests on Intel.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds
2024-07-24 11:34 ` Dietmar Eggemann
@ 2024-07-24 15:09 ` Chen Yu
0 siblings, 0 replies; 5+ messages in thread
From: Chen Yu @ 2024-07-24 15:09 UTC (permalink / raw)
To: Dietmar Eggemann, Vincent Guittot
Cc: Vincent Guittot, kernel test robot, oe-lkp, lkp, linux-kernel,
Ingo Molnar, Lukasz Luba, Qais Yousef, Yujie Liu
Hi Dietmar,
On 2024-07-24 at 13:34:31 +0200, Dietmar Eggemann wrote:
> On 12/07/2024 18:41, Chen Yu wrote:
> > On 2024-07-09 at 12:03:42 +0200, Vincent Guittot wrote:
> >> On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote:
> >>>
> >>> Hello,
> >>>
> >>> kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:
> >>>
> >>> commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
> >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>
> >> First, I'm surprised that an Intel platform is impacted by this patch
> >> because Intel doesn't use it AFAIK.
> >> Then, this patch mainly remove a right shift i.e.:
> >> instead of:
> >> return rq_clock_task(rq) >> sched_hw_decay_shift
> >> we are now doing:
> >> return rq_clock_task(rq)
> >>
> >> Could it be a false positive ?
> >
> > Before trying to reproduce it locally, one question is that, should we use
> > rq_clock_task(rq) in __update_blocked_others() rather than 'now', which is
> > actually calculated by rq_clock_pelt(rq)?
> >
> > thanks,
> > Chenyu
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d34f6d5b11b5..17ec0c51b29d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9432,7 +9432,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> >
> > decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> > update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> > - update_hw_load_avg(now, rq, hw_pressure) |
> > + update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) |
> > update_irq_load_avg(rq, 0);
> >
> > if (others_have_blocked(rq))
>
> Yes, update_hw_load_avg() should be driven entirely by
> rq_clock_task(rq). But IMHO this PELT signal is only used on some arm64
> platforms. So you won't detect any misbehavior running your tests on Intel.
Yes, on Intel platform the arch_scale_hw_pressure() should return 0. Unfortunately
I could not reproduce the issue locally.
Since the commit replace rq_clock_task() with rq_clock_pelt() in __update_blocked_others(),
while keeps it rq_clock_task() in sched_tick(), I wonder if the following
code would cause inconsistence in ___update_load_sum():
u64 delta = now - sa->last_update_time
'now' could be calculated by rq_clock_pelt(), and last_update_time was
calculated by rq_clock_task(), and usually the former chases after the
latter, it cause a very large 'delta' and bring unexpected behavior.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-07-24 15:10 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09 7:22 [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds kernel test robot
2024-07-09 10:03 ` Vincent Guittot
2024-07-12 16:41 ` Chen Yu
2024-07-24 11:34 ` Dietmar Eggemann
2024-07-24 15:09 ` Chen Yu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox