public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [linus:master] [sched/pelt]  97450eb909: INFO:task_blocked_for_more_than#seconds
@ 2024-07-09  7:22 kernel test robot
  2024-07-09 10:03 ` Vincent Guittot
  0 siblings, 1 reply; 5+ messages in thread
From: kernel test robot @ 2024-07-09  7:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: oe-lkp, lkp, linux-kernel, Ingo Molnar, Lukasz Luba, Qais Yousef,
	linux-doc, aubrey.li, yu.c.chen, oliver.sang



Hello,

kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:

commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linus/master      22f902dfc51eb3602ff9b505ac3980f6ff77b1df]
[test failed on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]

in testcase: rcutorture
version: 
with following parameters:

	runtime: 300s
	test: cpuhotplug
	torture_type: tasks



compiler: gcc-13
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


we noticed the issue is random, 86 times out of 500 runs as below, while
keeping clean on parent.


d4dbc991714eefcb 97450eb909658573dcacc1063b0
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :500         17%          86:500   dmesg.INFO:task_blocked_for_more_than#seconds



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407091527.bb0be229-lkp@intel.com


[  996.963402][   T17] INFO: task swapper:1 blocked for more than 491 seconds.
[  996.973637][   T17]       Tainted: G        W          6.9.0-rc1-00051-g97450eb90965 #1
[  996.983009][   T17] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  996.992369][   T17] task:swapper         state:D stack:5680  pid:1     tgid:1     ppid:0      flags:0x00004000
[  997.003585][   T17] Call Trace:
[ 997.022071][ T17] __schedule (kernel/sched/core.c:5412 kernel/sched/core.c:6746) 
[ 997.032795][ T17] ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837) 
[ 997.051350][ T17] schedule (arch/x86/include/asm/preempt.h:84 (discriminator 13) kernel/sched/core.c:6824 (discriminator 13) kernel/sched/core.c:6838 (discriminator 13)) 
[ 997.061394][ T17] async_synchronize_cookie_domain (kernel/async.c:317 (discriminator 9) kernel/async.c:310 (discriminator 9)) 
[ 997.071656][ T17] ? add_wait_queue (kernel/sched/wait.c:383) 
[ 997.084201][ T17] wait_for_initramfs (init/initramfs.c:757) 
[ 997.093158][ T17] ? do_header (init/initramfs.c:761) 
[ 997.104204][ T17] populate_rootfs (init/initramfs.c:768) 
[ 997.113736][ T17] do_one_initcall (init/main.c:1238) 
[ 997.122668][ T17] ? parameq (kernel/params.c:90 (discriminator 1) kernel/params.c:99 (discriminator 1)) 
[ 997.133272][ T17] ? rdinit_setup (init/main.c:1286) 
[ 997.142176][ T17] ? parse_args (kernel/params.c:142 (discriminator 1) kernel/params.c:186 (discriminator 1)) 
[ 997.172210][ T17] ? do_initcalls (init/main.c:1298 init/main.c:1316) 
[ 997.183570][ T17] do_initcalls (init/main.c:1299 (discriminator 1) init/main.c:1316 (discriminator 1)) 
[ 997.194241][ T17] ? rest_init (init/main.c:1429) 
[ 997.204290][ T17] kernel_init_freeable (init/main.c:1552) 
[ 997.214432][ T17] kernel_init (init/main.c:1439) 
[ 997.223979][ T17] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 997.232870][ T17] ? rest_init (init/main.c:1429) 
[ 997.244227][ T17] ret_from_fork_asm (arch/x86/entry/entry_32.S:737) 
[ 997.254412][ T17] entry_INT80_32 (arch/x86/entry/entry_32.S:944) 
[  997.831274][   T17]
[  997.831274][   T17] Showing all locks held in the system:
[  997.840770][   T17] 4 locks held by kworker/u4:1/16:
[  997.844292][   T17] 1 lock held by khungtaskd/17:
[ 997.853205][ T17] #0: c26e50c8 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks (kernel/locking/lockdep.c:6612) 
[  997.880625][   T17]
[  997.883881][   T17] =============================================
[  997.883881][   T17]
BUG: kernel hang in boot stage



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240709/202407091527.bb0be229-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds
  2024-07-09  7:22 [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds kernel test robot
@ 2024-07-09 10:03 ` Vincent Guittot
  2024-07-12 16:41   ` Chen Yu
  0 siblings, 1 reply; 5+ messages in thread
From: Vincent Guittot @ 2024-07-09 10:03 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, Ingo Molnar, Lukasz Luba, Qais Yousef,
	linux-doc, aubrey.li, yu.c.chen

On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:
>
> commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

First, I'm surprised that an Intel platform is impacted by this patch
because Intel doesn't use it AFAIK.
Then, this patch mainly remove a right shift i.e.:
instead of:
  return rq_clock_task(rq) >> sched_hw_decay_shift
we are now doing:
  return rq_clock_task(rq)

Could it be a false positive ?


>
> [test failed on linus/master      22f902dfc51eb3602ff9b505ac3980f6ff77b1df]
> [test failed on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
>
> in testcase: rcutorture
> version:
> with following parameters:
>
>         runtime: 300s
>         test: cpuhotplug
>         torture_type: tasks
>
>
>
> compiler: gcc-13
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
> we noticed the issue is random, 86 times out of 500 runs as below, while
> keeping clean on parent.
>
>
> d4dbc991714eefcb 97450eb909658573dcacc1063b0
> ---------------- ---------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>            :500         17%          86:500   dmesg.INFO:task_blocked_for_more_than#seconds
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202407091527.bb0be229-lkp@intel.com
>
>
> [  996.963402][   T17] INFO: task swapper:1 blocked for more than 491 seconds.
> [  996.973637][   T17]       Tainted: G        W          6.9.0-rc1-00051-g97450eb90965 #1
> [  996.983009][   T17] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  996.992369][   T17] task:swapper         state:D stack:5680  pid:1     tgid:1     ppid:0      flags:0x00004000
> [  997.003585][   T17] Call Trace:
> [ 997.022071][ T17] __schedule (kernel/sched/core.c:5412 kernel/sched/core.c:6746)
> [ 997.032795][ T17] ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
> [ 997.051350][ T17] schedule (arch/x86/include/asm/preempt.h:84 (discriminator 13) kernel/sched/core.c:6824 (discriminator 13) kernel/sched/core.c:6838 (discriminator 13))
> [ 997.061394][ T17] async_synchronize_cookie_domain (kernel/async.c:317 (discriminator 9) kernel/async.c:310 (discriminator 9))
> [ 997.071656][ T17] ? add_wait_queue (kernel/sched/wait.c:383)
> [ 997.084201][ T17] wait_for_initramfs (init/initramfs.c:757)
> [ 997.093158][ T17] ? do_header (init/initramfs.c:761)
> [ 997.104204][ T17] populate_rootfs (init/initramfs.c:768)
> [ 997.113736][ T17] do_one_initcall (init/main.c:1238)
> [ 997.122668][ T17] ? parameq (kernel/params.c:90 (discriminator 1) kernel/params.c:99 (discriminator 1))
> [ 997.133272][ T17] ? rdinit_setup (init/main.c:1286)
> [ 997.142176][ T17] ? parse_args (kernel/params.c:142 (discriminator 1) kernel/params.c:186 (discriminator 1))
> [ 997.172210][ T17] ? do_initcalls (init/main.c:1298 init/main.c:1316)
> [ 997.183570][ T17] do_initcalls (init/main.c:1299 (discriminator 1) init/main.c:1316 (discriminator 1))
> [ 997.194241][ T17] ? rest_init (init/main.c:1429)
> [ 997.204290][ T17] kernel_init_freeable (init/main.c:1552)
> [ 997.214432][ T17] kernel_init (init/main.c:1439)
> [ 997.223979][ T17] ret_from_fork (arch/x86/kernel/process.c:153)
> [ 997.232870][ T17] ? rest_init (init/main.c:1429)
> [ 997.244227][ T17] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
> [ 997.254412][ T17] entry_INT80_32 (arch/x86/entry/entry_32.S:944)
> [  997.831274][   T17]
> [  997.831274][   T17] Showing all locks held in the system:
> [  997.840770][   T17] 4 locks held by kworker/u4:1/16:
> [  997.844292][   T17] 1 lock held by khungtaskd/17:
> [ 997.853205][ T17] #0: c26e50c8 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks (kernel/locking/lockdep.c:6612)
> [  997.880625][   T17]
> [  997.883881][   T17] =============================================
> [  997.883881][   T17]
> BUG: kernel hang in boot stage
>
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240709/202407091527.bb0be229-lkp@intel.com
>
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds
  2024-07-09 10:03 ` Vincent Guittot
@ 2024-07-12 16:41   ` Chen Yu
  2024-07-24 11:34     ` Dietmar Eggemann
  0 siblings, 1 reply; 5+ messages in thread
From: Chen Yu @ 2024-07-12 16:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Ingo Molnar,
	Lukasz Luba, Qais Yousef

On 2024-07-09 at 12:03:42 +0200, Vincent Guittot wrote:
> On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote:
> >
> >
> >
> > Hello,
> >
> > kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:
> >
> > commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> First, I'm surprised that an Intel platform is impacted by this patch
> because Intel doesn't use it AFAIK.
> Then, this patch mainly remove a right shift i.e.:
> instead of:
>   return rq_clock_task(rq) >> sched_hw_decay_shift
> we are now doing:
>   return rq_clock_task(rq)
> 
> Could it be a false positive ?

Before trying to reproduce it locally, one question is that, should we use
rq_clock_task(rq) in __update_blocked_others() rather than 'now', which is
actually calculated by rq_clock_pelt(rq)? 

thanks,
Chenyu

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d34f6d5b11b5..17ec0c51b29d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9432,7 +9432,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
 
 	decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
 		  update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
-		  update_hw_load_avg(now, rq, hw_pressure) |
+		  update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) |
 		  update_irq_load_avg(rq, 0);
 
 	if (others_have_blocked(rq))
-- 
2.25.1





^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds
  2024-07-12 16:41   ` Chen Yu
@ 2024-07-24 11:34     ` Dietmar Eggemann
  2024-07-24 15:09       ` Chen Yu
  0 siblings, 1 reply; 5+ messages in thread
From: Dietmar Eggemann @ 2024-07-24 11:34 UTC (permalink / raw)
  To: Chen Yu, Vincent Guittot
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Ingo Molnar,
	Lukasz Luba, Qais Yousef

On 12/07/2024 18:41, Chen Yu wrote:
> On 2024-07-09 at 12:03:42 +0200, Vincent Guittot wrote:
>> On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote:
>>>
>>> Hello,
>>>
>>> kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:
>>>
>>> commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> First, I'm surprised that an Intel platform is impacted by this patch
>> because Intel doesn't use it AFAIK.
>> Then, this patch mainly remove a right shift i.e.:
>> instead of:
>>   return rq_clock_task(rq) >> sched_hw_decay_shift
>> we are now doing:
>>   return rq_clock_task(rq)
>>
>> Could it be a false positive ?
> 
> Before trying to reproduce it locally, one question is that, should we use
> rq_clock_task(rq) in __update_blocked_others() rather than 'now', which is
> actually calculated by rq_clock_pelt(rq)? 
> 
> thanks,
> Chenyu
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d34f6d5b11b5..17ec0c51b29d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9432,7 +9432,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>  
>  	decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
>  		  update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> -		  update_hw_load_avg(now, rq, hw_pressure) |
> +		  update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) |
>  		  update_irq_load_avg(rq, 0);
>  
>  	if (others_have_blocked(rq))

Yes, update_hw_load_avg() should be driven entirely by
rq_clock_task(rq). But IMHO this PELT signal is only used on some arm64
platforms. So you won't detect any misbehavior running your tests on Intel.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds
  2024-07-24 11:34     ` Dietmar Eggemann
@ 2024-07-24 15:09       ` Chen Yu
  0 siblings, 0 replies; 5+ messages in thread
From: Chen Yu @ 2024-07-24 15:09 UTC (permalink / raw)
  To: Dietmar Eggemann, Vincent Guittot
  Cc: Vincent Guittot, kernel test robot, oe-lkp, lkp, linux-kernel,
	Ingo Molnar, Lukasz Luba, Qais Yousef, Yujie Liu

Hi Dietmar,

On 2024-07-24 at 13:34:31 +0200, Dietmar Eggemann wrote:
> On 12/07/2024 18:41, Chen Yu wrote:
> > On 2024-07-09 at 12:03:42 +0200, Vincent Guittot wrote:
> >> On Tue, 9 Jul 2024 at 09:22, kernel test robot <oliver.sang@intel.com> wrote:
> >>>
> >>> Hello,
> >>>
> >>> kernel test robot noticed "INFO:task_blocked_for_more_than#seconds" on:
> >>>
> >>> commit: 97450eb909658573dcacc1063b06d3d08642c0c1 ("sched/pelt: Remove shift of thermal clock")
> >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>
> >> First, I'm surprised that an Intel platform is impacted by this patch
> >> because Intel doesn't use it AFAIK.
> >> Then, this patch mainly remove a right shift i.e.:
> >> instead of:
> >>   return rq_clock_task(rq) >> sched_hw_decay_shift
> >> we are now doing:
> >>   return rq_clock_task(rq)
> >>
> >> Could it be a false positive ?
> > 
> > Before trying to reproduce it locally, one question is that, should we use
> > rq_clock_task(rq) in __update_blocked_others() rather than 'now', which is
> > actually calculated by rq_clock_pelt(rq)? 
> > 
> > thanks,
> > Chenyu
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d34f6d5b11b5..17ec0c51b29d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9432,7 +9432,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> >  
> >  	decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> >  		  update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> > -		  update_hw_load_avg(now, rq, hw_pressure) |
> > +		  update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) |
> >  		  update_irq_load_avg(rq, 0);
> >  
> >  	if (others_have_blocked(rq))
> 
> Yes, update_hw_load_avg() should be driven entirely by
> rq_clock_task(rq). But IMHO this PELT signal is only used on some arm64
> platforms. So you won't detect any misbehavior running your tests on Intel.

Yes, on Intel platform the arch_scale_hw_pressure() should return 0. Unfortunately
I could not reproduce the issue locally.

Since the commit replace rq_clock_task() with rq_clock_pelt() in __update_blocked_others(),
while keeps it rq_clock_task() in sched_tick(), I wonder if the following
code would cause inconsistence in ___update_load_sum():

u64 delta = now - sa->last_update_time

'now' could be calculated by rq_clock_pelt(), and last_update_time was
calculated by rq_clock_task(), and usually the former chases after the
latter, it cause a very large 'delta' and bring unexpected behavior.


thanks,
Chenyu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-07-24 15:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09  7:22 [linus:master] [sched/pelt] 97450eb909: INFO:task_blocked_for_more_than#seconds kernel test robot
2024-07-09 10:03 ` Vincent Guittot
2024-07-12 16:41   ` Chen Yu
2024-07-24 11:34     ` Dietmar Eggemann
2024-07-24 15:09       ` Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox