* [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds
@ 2024-07-09 7:22 kernel test robot
2024-07-09 10:03 ` Vincent Guittot
0 siblings, 1 reply; 5+ messages in thread
From: kernel test robot @ 2024-07-09 7:22 UTC (permalink / raw)
To: Vincent Guittot
Cc: oe-lkp, lkp, linux-kernel, Ingo Molnar, Lukasz Luba, Qais Yousef,
linux-doc, aubrey.li, yu.c.chen, oliver.sang
Hello,
kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:
commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
[test failed on linus/master 22f902dfc51eb3602ff9b505ac3980f6ff77b1df]
[test failed on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
in testcase: rcutorture
version:
with following parameters:
runtime: 300s
test: cpuhotplug
torture_type: tasks
compiler: gcc-13
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
(please refer to attached dmesg/kmsg for entire log/backtrace)
we noticed the issue is random, 86 times out of 500 runs as below, while
keeping clean on parent.
d4dbc991714eefcb 97450eb909658573dcacc1063b0
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:500 17% 86:500 dmesg.INFO:task_blocked_for_more_than#seconds
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407091527.bb0be229-lkp@intel.com
[ 996.963402][ T17] INFO: task swapper:1 blocked for more than 491 seconds.
[ 996.973637][ T17] Tainted: G W 6.9.0-rc1-00051-g97450eb90965 #1
[ 996.983009][ T17] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 996.992369][ T17] task:swapper state:D stack:5680 pid:1 tgid:1 ppid:0 flags:0x00004000
[ 997.003585][ T17] Call Trace:
[ 997.022071][ T17] __schedule (kernel/sched/core.c:5412 kernel/sched/core.c:6746)
[ 997.032795][ T17] ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
[ 997.051350][ T17] schedule (arch/x86/include/asm/preempt.h:84 (discriminator 13) kernel/sched/core.c:6824 (discriminator 13) kernel/sched/core.c:6838 (discriminator 13))
[ 997.061394][ T17] async_synchronize_cookie_domain (kernel/async.c:317 (discriminator 9) kernel/async.c:310 (discriminator 9))
[ 997.071656][ T17] ? add_wait_queue (kernel/sched/wait.c:383)
[ 997.084201][ T17] wait_for_initramfs (init/initramfs.c:757)
[ 997.093158][ T17] ? do_header (init/initramfs.c:761)
[ 997.104204][ T17] populate_rootfs (init/initramfs.c:768)
[ 997.113736][ T17] do_one_initcall (init/main.c:1238)
[ 997.122668][ T17] ? parameq (kernel/params.c:90 (discriminator 1) kernel/params.c:99 (discriminator 1))
[ 997.133272][ T17] ? rdinit_setup (init/main.c:1286)
[ 997.142176][ T17] ? parse_args (kernel/params.c:142 (discriminator 1) kernel/params.c:186 (discriminator 1))
[ 997.172210][ T17] ? do_initcalls (init/main.c:1298 init/main.c:1316)
[ 997.183570][ T17] do_initcalls (init/main.c:1299 (discriminator 1) init/main.c:1316 (discriminator 1))
[ 997.194241][ T17] ? rest_init (init/main.c:1429)
[ 997.204290][ T17] kernel_init_freeable (init/main.c:1552)
[ 997.214432][ T17] kernel_init (init/main.c:1439)
[ 997.223979][ T17] ret_from_fork (arch/x86/kernel/process.c:153)
[ 997.232870][ T17] ? rest_init (init/main.c:1429)
[ 997.244227][ T17] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
[ 997.254412][ T17] entry_INT80_32 (arch/x86/entry/entry_32.S:944)
[ 997.831274][ T17]
[ 997.831274][ T17] Showing all locks held in the system:
[ 997.840770][ T17] 4 locks held by kworker/u4:1/16:
[ 997.844292][ T17] 1 lock held by khungtaskd/17:
[ 997.853205][ T17] #0: c26e50c8 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks (kernel/locking/lockdep.c:6612)
[ 997.880625][ T17]
[ 997.883881][ T17] =============================================
[ 997.883881][ T17]
BUG: kernel hang in boot stage
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240709/202407091527.bb0be229-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds 2024-07-09 7:22 [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds kernel test robot @ 2024-07-09 10:03 ` Vincent Guittot 2024-07-12 16:41 ` Chen Yu 0 siblings, 1 reply; 5+ messages in thread From: Vincent Guittot @ 2024-07-09 10:03 UTC (permalink / raw) To: kernel test robot Cc: oe-lkp, lkp, linux-kernel, Ingo Molnar, Lukasz Luba, Qais Yousef, linux-doc, aubrey.li, yu.c.chen On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote: > > > > Hello, > > kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on: > > commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock") > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master First, I'm surprised that an Intel platform is impacted by this patch because Intel doesn't use it AFAIK. Then, this patch mainly remove a right shift i.e.: instead of: return rq_clock_task(rq) >> sched_hw_decay_shift we are now doing: return rq_clock_task(rq) Could it be a false positive ? > > [test failed on linus/master 22f902dfc51eb3602ff9b505ac3980f6ff77b1df] > [test failed on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233] > > in testcase: rcutorture > version: > with following parameters: > > runtime: 300s > test: cpuhotplug > torture_type: tasks > > > > compiler: gcc-13 > test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G > > (please refer to attached dmesg/kmsg for entire log/backtrace) > > > we noticed the issue is random, 86 times out of 500 runs as below, while > keeping clean on parent. > > > d4dbc991714eefcb 97450eb909658573dcacc1063b0 > ---------------- --------------------------- > fail:runs %reproduction fail:runs > | | | > :500 17% 86:500 dmesg.INFO:task_blocked_for_more_than#seconds > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <oliver.sang@intel.com> > | Closes: https://lore.kernel.org/oe-lkp/202407091527.bb0be229-lkp@intel.com > > > [ 996.963402][ T17] INFO: task swapper:1 blocked for more than 491 seconds. > [ 996.973637][ T17] Tainted: G W 6.9.0-rc1-00051-g97450eb90965 #1 > [ 996.983009][ T17] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 996.992369][ T17] task:swapper state:D stack:5680 pid:1 tgid:1 ppid:0 flags:0x00004000 > [ 997.003585][ T17] Call Trace: > [ 997.022071][ T17] __schedule (kernel/sched/core.c:5412 kernel/sched/core.c:6746) > [ 997.032795][ T17] ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837) > [ 997.051350][ T17] schedule (arch/x86/include/asm/preempt.h:84 (discriminator 13) kernel/sched/core.c:6824 (discriminator 13) kernel/sched/core.c:6838 (discriminator 13)) > [ 997.061394][ T17] async_synchronize_cookie_domain (kernel/async.c:317 (discriminator 9) kernel/async.c:310 (discriminator 9)) > [ 997.071656][ T17] ? add_wait_queue (kernel/sched/wait.c:383) > [ 997.084201][ T17] wait_for_initramfs (init/initramfs.c:757) > [ 997.093158][ T17] ? do_header (init/initramfs.c:761) > [ 997.104204][ T17] populate_rootfs (init/initramfs.c:768) > [ 997.113736][ T17] do_one_initcall (init/main.c:1238) > [ 997.122668][ T17] ? parameq (kernel/params.c:90 (discriminator 1) kernel/params.c:99 (discriminator 1)) > [ 997.133272][ T17] ? rdinit_setup (init/main.c:1286) > [ 997.142176][ T17] ? parse_args (kernel/params.c:142 (discriminator 1) kernel/params.c:186 (discriminator 1)) > [ 997.172210][ T17] ? do_initcalls (init/main.c:1298 init/main.c:1316) > [ 997.183570][ T17] do_initcalls (init/main.c:1299 (discriminator 1) init/main.c:1316 (discriminator 1)) > [ 997.194241][ T17] ? rest_init (init/main.c:1429) > [ 997.204290][ T17] kernel_init_freeable (init/main.c:1552) > [ 997.214432][ T17] kernel_init (init/main.c:1439) > [ 997.223979][ T17] ret_from_fork (arch/x86/kernel/process.c:153) > [ 997.232870][ T17] ? rest_init (init/main.c:1429) > [ 997.244227][ T17] ret_from_fork_asm (arch/x86/entry/entry_32.S:737) > [ 997.254412][ T17] entry_INT80_32 (arch/x86/entry/entry_32.S:944) > [ 997.831274][ T17] > [ 997.831274][ T17] Showing all locks held in the system: > [ 997.840770][ T17] 4 locks held by kworker/u4:1/16: > [ 997.844292][ T17] 1 lock held by khungtaskd/17: > [ 997.853205][ T17] #0: c26e50c8 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks (kernel/locking/lockdep.c:6612) > [ 997.880625][ T17] > [ 997.883881][ T17] ============================================= > [ 997.883881][ T17] > BUG: kernel hang in boot stage > > > > The kernel config and materials to reproduce are available at: > https://download.01.org/0day-ci/archive/20240709/202407091527.bb0be229-lkp@intel.com > > > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds 2024-07-09 10:03 ` Vincent Guittot @ 2024-07-12 16:41 ` Chen Yu 2024-07-24 11:34 ` Dietmar Eggemann 0 siblings, 1 reply; 5+ messages in thread From: Chen Yu @ 2024-07-12 16:41 UTC (permalink / raw) To: Vincent Guittot Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Ingo Molnar, Lukasz Luba, Qais Yousef On 2024-07-09 at 12:03:42 +0200, Vincent Guittot wrote: > On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote: > > > > > > > > Hello, > > > > kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on: > > > > commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock") > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > First, I'm surprised that an Intel platform is impacted by this patch > because Intel doesn't use it AFAIK. > Then, this patch mainly remove a right shift i.e.: > instead of: > return rq_clock_task(rq) >> sched_hw_decay_shift > we are now doing: > return rq_clock_task(rq) > > Could it be a false positive ? Before trying to reproduce it locally, one question is that, should we use rq_clock_task(rq) in __update_blocked_others() rather than 'now', which is actually calculated by rq_clock_pelt(rq)? thanks, Chenyu diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d34f6d5b11b5..17ec0c51b29d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9432,7 +9432,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done) decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | - update_hw_load_avg(now, rq, hw_pressure) | + update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) | update_irq_load_avg(rq, 0); if (others_have_blocked(rq)) -- 2.25.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds 2024-07-12 16:41 ` Chen Yu @ 2024-07-24 11:34 ` Dietmar Eggemann 2024-07-24 15:09 ` Chen Yu 0 siblings, 1 reply; 5+ messages in thread From: Dietmar Eggemann @ 2024-07-24 11:34 UTC (permalink / raw) To: Chen Yu, Vincent Guittot Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Ingo Molnar, Lukasz Luba, Qais Yousef On 12/07/2024 18:41, Chen Yu wrote: > On 2024-07-09 at 12:03:42 +0200, Vincent Guittot wrote: >> On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote: >>> >>> Hello, >>> >>> kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on: >>> >>> commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock") >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master >> >> First, I'm surprised that an Intel platform is impacted by this patch >> because Intel doesn't use it AFAIK. >> Then, this patch mainly remove a right shift i.e.: >> instead of: >> return rq_clock_task(rq) >> sched_hw_decay_shift >> we are now doing: >> return rq_clock_task(rq) >> >> Could it be a false positive ? > > Before trying to reproduce it locally, one question is that, should we use > rq_clock_task(rq) in __update_blocked_others() rather than 'now', which is > actually calculated by rq_clock_pelt(rq)? > > thanks, > Chenyu > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d34f6d5b11b5..17ec0c51b29d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9432,7 +9432,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done) > > decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | > update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | > - update_hw_load_avg(now, rq, hw_pressure) | > + update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) | > update_irq_load_avg(rq, 0); > > if (others_have_blocked(rq)) Yes, update_hw_load_avg() should be driven entirely by rq_clock_task(rq). But IMHO this PELT signal is only used on some arm64 platforms. So you won't detect any misbehavior running your tests on Intel. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds 2024-07-24 11:34 ` Dietmar Eggemann @ 2024-07-24 15:09 ` Chen Yu 0 siblings, 0 replies; 5+ messages in thread From: Chen Yu @ 2024-07-24 15:09 UTC (permalink / raw) To: Dietmar Eggemann, Vincent Guittot Cc: Vincent Guittot, kernel test robot, oe-lkp, lkp, linux-kernel, Ingo Molnar, Lukasz Luba, Qais Yousef, Yujie Liu Hi Dietmar, On 2024-07-24 at 13:34:31 +0200, Dietmar Eggemann wrote: > On 12/07/2024 18:41, Chen Yu wrote: > > On 2024-07-09 at 12:03:42 +0200, Vincent Guittot wrote: > >> On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote: > >>> > >>> Hello, > >>> > >>> kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on: > >>> > >>> commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock") > >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > >> > >> First, I'm surprised that an Intel platform is impacted by this patch > >> because Intel doesn't use it AFAIK. > >> Then, this patch mainly remove a right shift i.e.: > >> instead of: > >> return rq_clock_task(rq) >> sched_hw_decay_shift > >> we are now doing: > >> return rq_clock_task(rq) > >> > >> Could it be a false positive ? > > > > Before trying to reproduce it locally, one question is that, should we use > > rq_clock_task(rq) in __update_blocked_others() rather than 'now', which is > > actually calculated by rq_clock_pelt(rq)? > > > > thanks, > > Chenyu > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index d34f6d5b11b5..17ec0c51b29d 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -9432,7 +9432,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done) > > > > decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | > > update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | > > - update_hw_load_avg(now, rq, hw_pressure) | > > + update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) | > > update_irq_load_avg(rq, 0); > > > > if (others_have_blocked(rq)) > > Yes, update_hw_load_avg() should be driven entirely by > rq_clock_task(rq). But IMHO this PELT signal is only used on some arm64 > platforms. So you won't detect any misbehavior running your tests on Intel. Yes, on Intel platform the arch_scale_hw_pressure() should return 0. Unfortunately I could not reproduce the issue locally. Since the commit replace rq_clock_task() with rq_clock_pelt() in __update_blocked_others(), while keeps it rq_clock_task() in sched_tick(), I wonder if the following code would cause inconsistence in ___update_load_sum(): u64 delta = now - sa->last_update_time 'now' could be calculated by rq_clock_pelt(), and last_update_time was calculated by rq_clock_task(), and usually the former chases after the latter, it cause a very large 'delta' and bring unexpected behavior. thanks, Chenyu ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-07-24 15:10 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-09 7:22 [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds kernel test robot 2024-07-09 10:03 ` Vincent Guittot 2024-07-12 16:41 ` Chen Yu 2024-07-24 11:34 ` Dietmar Eggemann 2024-07-24 15:09 ` Chen Yu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox