sched: update_entity_lag does not handle corner case with task in PI chain

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* sched: update_entity_lag does not handle corner case with task in PI chain
@ 2025-10-18 11:34 Luis Claudio R. Goncalves
  2025-10-18 19:57 ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Luis Claudio R. Goncalves @ 2025-10-18 11:34 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Phil Auld,
	Valentin Schneider, Steven Rostedt, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Shizhao Chen,
	linux-kernel, Omar Sandoval, Xuewen Yan

Hello!

The underlying question here is what is the expected behavior of
update_entity_lag() in the context explained below...


--[ Short Description:

While running sched_group_migration test from CKI repository[1], which
migrates tasks between cpusets, Shizhao Chen reports hitting the warning
in update_entity_lag():

    WARN_ON_ONCE(!se->on_rq);

In short, update_entity_lag() is acting on a task that is waiting on a lock,
sleeping, with both on_rq and se->on_rq equal to zero.

When a stalled RCU grace period occurs, rcu_boost_kthread() is called. If an
rt_mutex is involved in the process, rt_mutex_setprio() is called and may
eventually walk down a Priority Inheritance chain, adjusting the priorities
of the waiters in the chain. In such cases update_entity_lag() may be called.

What is the expected behavior for this case, to bail out of update_entity_lag()
or avoid calling the function entirely?


--[ Additional Notes:

Reproducing the Problem:

  - Install sched_group_migration[1] and run it on a loop.
    (while : ;  do runtest.sh; done)
  - In my experience, running the test with 4 CPUs reproduces the problem
    within 15 minutes. Setting "nr_cpus=4 max_cpus=4" on boot does the trick.


The scenario below is a simplification of the cases I observed while
investigating the problem:

    CPUn					CPUx

    task01 has rcu-state lock
    contends on another lock		
    (goes to sleep)
    --> on_rq=0 se.on_rq=0
					rcub/x contends on rcu-state lock
					  rcu_boost_kthread()
					    rt_set_prio()
					      update_entity_lag(task01->se)
					        WARNING()


It could be that task01 and the task holding the lock wanted by task01 are
being migrated from one cpuset to another at that point. In any case, that
is not an error, so the problem seems to be update_entity_lag() being called
to work on a task that hurts a basic requirement (!se->on_rq).


The resulting backtrace is:

[ 1805.450470] ------------[ cut here ]------------
[ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
[ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
 nfnetlink
[ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT 
[ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
[ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
[ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
 8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
[ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
[ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
[ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
[ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
[ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
[ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1805.606020] FS:  0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
[ 1805.614107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
[ 1805.626985] PKRU: 55555554
[ 1805.629696] Call Trace:
[ 1805.632150]  <TASK>
[ 1805.634258]  dequeue_entity+0x90/0x4f0
[ 1805.638012]  dequeue_entities+0xc9/0x6b0
[ 1805.641935]  dequeue_task_fair+0x8a/0x190
[ 1805.645949]  ? sched_clock+0x10/0x30
[ 1805.649527]  rt_mutex_setprio+0x318/0x4b0
[ 1805.653541]  rt_mutex_adjust_prio_chain+0x71c/0xa40
[ 1805.658421]  task_blocks_on_rt_mutex.constprop.0+0x20c/0x4a0
[ 1805.664081]  __rt_mutex_slowlock.constprop.0+0x53/0x1d0
[ 1805.669305]  __rt_mutex_slowlock_locked.constprop.0+0x48/0x70
[ 1805.675051]  rt_mutex_slowlock.constprop.0+0x4d/0xd0
[ 1805.680016]  rcu_boost_kthread+0xd5/0x2d0
[ 1805.684030]  ? __pfx_rcu_boost_kthread+0x10/0x10
[ 1805.688646]  kthread+0x108/0x250
[ 1805.691880]  ? migrate_enable+0xd1/0xf0
[ 1805.695719]  ? __pfx_kthread+0x10/0x10
[ 1805.699473]  ret_from_fork+0x116/0x130
[ 1805.703226]  ? __pfx_kthread+0x10/0x10
[ 1805.706978]  ret_from_fork_asm+0x1a/0x30
[ 1805.710908]  </TASK>


Please let me know if what I reported above is enough to understand the problem
and design/suggest a solution. I tried to organize the scattered information
bits as well as possible.

Best regards,
Luis

[1] https://gitlab.com/cki-project/kernel-tests/-/archive/main/kernel-tests-main.zip#general/scheduler/sched_group_migration/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sched: update_entity_lag does not handle corner case with task in PI chain
  2025-10-18 11:34 sched: update_entity_lag does not handle corner case with task in PI chain Luis Claudio R. Goncalves
@ 2025-10-18 19:57 ` Peter Zijlstra
  2025-10-20 11:00   ` Luis Claudio R. Goncalves
  2025-10-21  7:08   ` K Prateek Nayak
  0 siblings, 2 replies; 6+ messages in thread
From: Peter Zijlstra @ 2025-10-18 19:57 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves
  Cc: Ingo Molnar, Juri Lelli, Phil Auld, Valentin Schneider,
	Steven Rostedt, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Shizhao Chen, linux-kernel, Omar Sandoval, Xuewen Yan

On Sat, Oct 18, 2025 at 08:34:52AM -0300, Luis Claudio R. Goncalves wrote:
> Hello!
> 

> While running sched_group_migration test from CKI repository[1], which

What's a CKI ?

> migrates tasks between cpusets, Shizhao Chen reports hitting the warning
> in update_entity_lag():
> 
>     WARN_ON_ONCE(!se->on_rq);
> 
> In short, update_entity_lag() is acting on a task that is waiting on a lock,
> sleeping, with both on_rq and se->on_rq equal to zero.

You can't get to where you are with p->on_rq being zero.

> When a stalled RCU grace period occurs, rcu_boost_kthread() is called. If an
> rt_mutex is involved in the process, rt_mutex_setprio() is called and may
> eventually walk down a Priority Inheritance chain, adjusting the priorities
> of the waiters in the chain. In such cases update_entity_lag() may be called.
> 
> What is the expected behavior for this case, to bail out of update_entity_lag()
> or avoid calling the function entirely?
> 
> 
> --[ Additional Notes:
> 
> Reproducing the Problem:
> 
>   - Install sched_group_migration[1] and run it on a loop.
>     (while : ;  do runtest.sh; done)
>   - In my experience, running the test with 4 CPUs reproduces the problem
>     within 15 minutes. Setting "nr_cpus=4 max_cpus=4" on boot does the trick.
> 
> 
> The scenario below is a simplification of the cases I observed while
> investigating the problem:
> 
>     CPUn					CPUx
> 
>     task01 has rcu-state lock
>     contends on another lock		
>     (goes to sleep)
>     --> on_rq=0 se.on_rq=0
> 					rcub/x contends on rcu-state lock
> 					  rcu_boost_kthread()
> 					    rt_set_prio()
> 					      update_entity_lag(task01->se)
> 					        WARNING()

There is a whole lot wrong with this, firstly there is no rt_set_prio()
function, and update_entity_lag() isn't directly called by it.
Additionally, you should never get to update_entity_lag() if !p->on_rq,
see below:

> [ 1805.450470] ------------[ cut here ]------------
> [ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
> [ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
> ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
> tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
> ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
>  nfnetlink
> [ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT 
> [ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
> [ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
> [ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
>  8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
> [ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
> [ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
> [ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
> [ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
> [ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
> [ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 1805.606020] FS:  0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
> [ 1805.614107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
> [ 1805.626985] PKRU: 55555554
> [ 1805.629696] Call Trace:
> [ 1805.632150]  <TASK>
> [ 1805.634258]  dequeue_entity+0x90/0x4f0
> [ 1805.638012]  dequeue_entities+0xc9/0x6b0
> [ 1805.641935]  dequeue_task_fair+0x8a/0x190
> [ 1805.645949]  ? sched_clock+0x10/0x30
> [ 1805.649527]  rt_mutex_setprio+0x318/0x4b0

So we have:

rt_mutex_setprio()

  rq = __task_rq_lock(p, ..); // this asserts p->pi_lock is held

  ...

  queued = task_on_rq_queued(rq); // basically reads p->on_rq
  running = task_current_donor()
  if (queued)
    dequeue_task(rq, p, queue_flags);
      dequeue_task_fair()
        dequeue_entities()
	  dequeue_entity()
	    update_entity_lag()
	      WARN_ON_ONCE(se->on_rq);

So the only way to get here is if: rq->on_rq is in fact !0 *and*
se->on_rq is zero.

And I'm not at all sure how one would get into such a state.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sched: update_entity_lag does not handle corner case with task in PI chain
  2025-10-18 19:57 ` Peter Zijlstra
@ 2025-10-20 11:00   ` Luis Claudio R. Goncalves
  2025-10-21  7:08   ` K Prateek Nayak
  1 sibling, 0 replies; 6+ messages in thread
From: Luis Claudio R. Goncalves @ 2025-10-20 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Phil Auld, Valentin Schneider,
	Steven Rostedt, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Shizhao Chen, linux-kernel, Omar Sandoval, Xuewen Yan

On Sat, Oct 18, 2025 at 09:57:30PM +0200, Peter Zijlstra wrote:
> On Sat, Oct 18, 2025 at 08:34:52AM -0300, Luis Claudio R. Goncalves wrote:
> > Hello!
> > 
> 
> > While running sched_group_migration test from CKI repository[1], which
> 
> What's a CKI ?

Continuous Kernel Integration (https://cki-project.org/). That's not
relevant to the problem, I just wanted to mention the source of the test in
case there were other tests with the same name or other versions in different
repositories.

> > migrates tasks between cpusets, Shizhao Chen reports hitting the warning
> > in update_entity_lag():
> > 
> >     WARN_ON_ONCE(!se->on_rq);
> > 
> > In short, update_entity_lag() is acting on a task that is waiting on a lock,
> > sleeping, with both on_rq and se->on_rq equal to zero.
> 
> You can't get to where you are with p->on_rq being zero.

I can check the other vmcores I have, but I am certain that I saw this in
at least two of the vmcores I analyzed:


   crash> task -R pi_blocked_on,prio,rt_priority,on_rq,se.on_rq 0xffff8955813c2180 0xffff895926cb4300
   PID: 19       TASK: ffff8955813c2180  CPU: 2    COMMAND: "rcub/0"
     pi_blocked_on = 0xffffcc9e802f7de0,       <------- held by the thread below
     prio = 98,
     rt_priority = 1,
     on_rq = 1,
     se.on_rq = 0 '\000',

   PID: 445515   TASK: ffff895926cb4300  CPU: 0    COMMAND: "bz1738415-test"
     pi_blocked_on = 0xffffcc9ea19f7b70,       <------- waiting on a lock
     prio = 98,
     rt_priority = 0,
     on_rq = 0,                                <------- 
     se.on_rq = 0 '\000',                      <-------


In the vmcores I collected the thread blocking rcub/X (holding the lock) is
blocked on another test thread, waiting on a different lock. As in some of
the vmcores this second lock has no owner, this looks like the lock had just
been released. The timing was perfect to hit this apparent corner case.

> > When a stalled RCU grace period occurs, rcu_boost_kthread() is called. If an
> > rt_mutex is involved in the process, rt_mutex_setprio() is called and may
> > eventually walk down a Priority Inheritance chain, adjusting the priorities
> > of the waiters in the chain. In such cases update_entity_lag() may be called.
> > 
> > What is the expected behavior for this case, to bail out of update_entity_lag()
> > or avoid calling the function entirely?
> > 
> > 
> > --[ Additional Notes:
> > 
> > Reproducing the Problem:
> > 
> >   - Install sched_group_migration[1] and run it on a loop.
> >     (while : ;  do runtest.sh; done)
> >   - In my experience, running the test with 4 CPUs reproduces the problem
> >     within 15 minutes. Setting "nr_cpus=4 max_cpus=4" on boot does the trick.
> > 
> > 
> > The scenario below is a simplification of the cases I observed while
> > investigating the problem:
> > 
> >     CPUn					CPUx
> > 
> >     task01 has rcu-state lock
> >     contends on another lock		
> >     (goes to sleep)
> >     --> on_rq=0 se.on_rq=0
> > 					rcub/x contends on rcu-state lock
> > 					  rcu_boost_kthread()
> > 					    rt_set_prio()
> > 					      update_entity_lag(task01->se)
> > 					        WARNING()
> 
> There is a whole lot wrong with this, firstly there is no rt_set_prio()
> function, and update_entity_lag() isn't directly called by it.
> Additionally, you should never get to update_entity_lag() if !p->on_rq,
> see below:

Sorry, I meant rt_mutex_setprio(). My mistake. And yes, I oversimplified
the note. The idea is that rt_mutex_setprio() could eventually result in
a call to update_entity_lag() down the chain.

> > [ 1805.450470] ------------[ cut here ]------------
> > [ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
> > [ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
> > ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
> > tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
> > ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
> >  nfnetlink
> > [ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT 
> > [ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
> > [ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
> > [ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
> >  8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
> > [ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
> > [ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
> > [ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
> > [ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
> > [ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
> > [ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > [ 1805.606020] FS:  0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
> > [ 1805.614107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
> > [ 1805.626985] PKRU: 55555554
> > [ 1805.629696] Call Trace:
> > [ 1805.632150]  <TASK>
> > [ 1805.634258]  dequeue_entity+0x90/0x4f0
> > [ 1805.638012]  dequeue_entities+0xc9/0x6b0
> > [ 1805.641935]  dequeue_task_fair+0x8a/0x190
> > [ 1805.645949]  ? sched_clock+0x10/0x30
> > [ 1805.649527]  rt_mutex_setprio+0x318/0x4b0
> 
> So we have:
> 
> rt_mutex_setprio()
> 
>   rq = __task_rq_lock(p, ..); // this asserts p->pi_lock is held
> 
>   ...
> 
>   queued = task_on_rq_queued(rq); // basically reads p->on_rq
>   running = task_current_donor()
>   if (queued)
>     dequeue_task(rq, p, queue_flags);
>       dequeue_task_fair()
>         dequeue_entities()
> 	  dequeue_entity()
> 	    update_entity_lag()
> 	      WARN_ON_ONCE(se->on_rq);
> 
> So the only way to get here is if: rq->on_rq is in fact !0 *and*
> se->on_rq is zero.

Assuming the vmcores I collected are not damaged and that the simple crash
command I used earlier to display the thread on_rq and se->on_rq fields is
correct, is there a chance that the sequence above could be tampered by
the thread being concurrently moved from one cpuset to another?

> And I'm not at all sure how one would get into such a state.

Sorry again for the convoluted report, but I am also trying to make sense
of the results observed. Is there anything I could do in terms of tests or
information that could help shed a light here?

Best,
Luis


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sched: update_entity_lag does not handle corner case with task in PI chain
  2025-10-18 19:57 ` Peter Zijlstra
  2025-10-20 11:00   ` Luis Claudio R. Goncalves
@ 2025-10-21  7:08   ` K Prateek Nayak
  2025-10-22  0:35     ` Luis Claudio R. Goncalves
  1 sibling, 1 reply; 6+ messages in thread
From: K Prateek Nayak @ 2025-10-21  7:08 UTC (permalink / raw)
  To: Peter Zijlstra, Luis Claudio R. Goncalves
  Cc: Ingo Molnar, Juri Lelli, Phil Auld, Valentin Schneider,
	Steven Rostedt, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Shizhao Chen, linux-kernel, Omar Sandoval, Xuewen Yan

Hello Peter, Luis,

On 10/19/2025 1:27 AM, Peter Zijlstra wrote:
>> [ 1805.450470] ------------[ cut here ]------------
>> [ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
>> [ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
>> ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
>> tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
>> ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
>>  nfnetlink
>> [ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT 
>> [ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
>> [ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
>> [ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
>>  8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
>> [ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
>> [ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
>> [ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
>> [ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
>> [ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
>> [ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> [ 1805.606020] FS:  0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
>> [ 1805.614107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
>> [ 1805.626985] PKRU: 55555554
>> [ 1805.629696] Call Trace:
>> [ 1805.632150]  <TASK>
>> [ 1805.634258]  dequeue_entity+0x90/0x4f0
>> [ 1805.638012]  dequeue_entities+0xc9/0x6b0
>> [ 1805.641935]  dequeue_task_fair+0x8a/0x190
>> [ 1805.645949]  ? sched_clock+0x10/0x30
>> [ 1805.649527]  rt_mutex_setprio+0x318/0x4b0
> 
> So we have:
> 
> rt_mutex_setprio()
> 
>   rq = __task_rq_lock(p, ..); // this asserts p->pi_lock is held
> 
>   ...
> 
>   queued = task_on_rq_queued(rq); // basically reads p->on_rq
>   running = task_current_donor()
>   if (queued)
>     dequeue_task(rq, p, queue_flags);
>       dequeue_task_fair()
>         dequeue_entities()
> 	  dequeue_entity()
> 	    update_entity_lag()
> 	      WARN_ON_ONCE(se->on_rq);
> 
> So the only way to get here is if: rq->on_rq is in fact !0 *and*
> se->on_rq is zero.
> 
> And I'm not at all sure how one would get into such a state.

This looks like something that can happen when a delayed task is
dequeued from a throttled hierarchy. Matt had reported similar
problem with wait_task_inactive() in
https://lore.kernel.org/all/20250925133310.1843863-1-matt@readmodwrite.com/

rt_mutex_setprio()
  ...
  if (prev_class != next_class && p->se.sched_delayed)
    dequeue_task(rq, p, DEQUEUE_DELAYED)
      dequeue_entities(se = &p->se)
        dequeue_entity(se)
          se->on_rq = 0; /* se->on_rq turns 0 here */
        ...
        if (cfs_rq_throttled(cfs_rq))
          return 0; /* Early return brfore __block_task() */
  ...

  /* __block_task() not called; task_on_rq_queued() returns true. */
  queued = task_on_rq_queued(p);
  ...

  if (queued)
    dequeue_task(rq, p, queue_flag)
      dequeue_entities(se = &p->se)
        dequeue_entity(se)
          update_entity_lag(se)
            WARN_ON_ONCE(!se->on_rq)


v6.18 kernels will get rid of this issue as a part of per-task throttle
feature and stable should pick up the fix for same on the thread soon. 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sched: update_entity_lag does not handle corner case with task in PI chain
  2025-10-21  7:08   ` K Prateek Nayak
@ 2025-10-22  0:35     ` Luis Claudio R. Goncalves
  2025-10-24  4:00       ` K Prateek Nayak
  0 siblings, 1 reply; 6+ messages in thread
From: Luis Claudio R. Goncalves @ 2025-10-22  0:35 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Phil Auld,
	Valentin Schneider, Steven Rostedt, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Shizhao Chen,
	linux-kernel, Omar Sandoval, Xuewen Yan

On Tue, Oct 21, 2025 at 12:38:17PM +0530, K Prateek Nayak wrote:
> Hello Peter, Luis,
> 
> On 10/19/2025 1:27 AM, Peter Zijlstra wrote:
> >> [ 1805.450470] ------------[ cut here ]------------
> >> [ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
> >> [ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
> >> ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
> >> tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
> >> ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
> >>  nfnetlink
> >> [ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT 
> >> [ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
> >> [ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
> >> [ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
> >>  8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
> >> [ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
> >> [ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
> >> [ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
> >> [ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
> >> [ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
> >> [ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> >> [ 1805.606020] FS:  0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
> >> [ 1805.614107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
> >> [ 1805.626985] PKRU: 55555554
> >> [ 1805.629696] Call Trace:
> >> [ 1805.632150]  <TASK>
> >> [ 1805.634258]  dequeue_entity+0x90/0x4f0
> >> [ 1805.638012]  dequeue_entities+0xc9/0x6b0
> >> [ 1805.641935]  dequeue_task_fair+0x8a/0x190
> >> [ 1805.645949]  ? sched_clock+0x10/0x30
> >> [ 1805.649527]  rt_mutex_setprio+0x318/0x4b0
> > 
> > So we have:
> > 
> > rt_mutex_setprio()
> > 
> >   rq = __task_rq_lock(p, ..); // this asserts p->pi_lock is held
> > 
> >   ...
> > 
> >   queued = task_on_rq_queued(rq); // basically reads p->on_rq
> >   running = task_current_donor()
> >   if (queued)
> >     dequeue_task(rq, p, queue_flags);
> >       dequeue_task_fair()
> >         dequeue_entities()
> > 	  dequeue_entity()
> > 	    update_entity_lag()
> > 	      WARN_ON_ONCE(se->on_rq);
> > 
> > So the only way to get here is if: rq->on_rq is in fact !0 *and*
> > se->on_rq is zero.
> > 
> > And I'm not at all sure how one would get into such a state.
> 
> This looks like something that can happen when a delayed task is
> dequeued from a throttled hierarchy. Matt had reported similar
> problem with wait_task_inactive() in
> https://lore.kernel.org/all/20250925133310.1843863-1-matt@readmodwrite.com/
> 
> rt_mutex_setprio()
>   ...
>   if (prev_class != next_class && p->se.sched_delayed)
>     dequeue_task(rq, p, DEQUEUE_DELAYED)
>       dequeue_entities(se = &p->se)
>         dequeue_entity(se)
>           se->on_rq = 0; /* se->on_rq turns 0 here */
>         ...
>         if (cfs_rq_throttled(cfs_rq))
>           return 0; /* Early return brfore __block_task() */
>   ...
> 
>   /* __block_task() not called; task_on_rq_queued() returns true. */
>   queued = task_on_rq_queued(p);
>   ...
> 
>   if (queued)
>     dequeue_task(rq, p, queue_flag)
>       dequeue_entities(se = &p->se)
>         dequeue_entity(se)
>           update_entity_lag(se)
>             WARN_ON_ONCE(!se->on_rq)
> 
> 
> v6.18 kernels will get rid of this issue as a part of per-task throttle
> feature and stable should pick up the fix for same on the thread soon. 

Thank you! You were right, your patch in that thread seems to have fixed
the issue I reported.

I read the thread you mentioned, built a test kernel with the patch and have
been running tests for more than 6h now without a single backtrace. As reported
earlier, I was able to hit the bug within 15 minutes without the patch.

Best regards,
Luis

> 
> -- 
> Thanks and Regards,
> Prateek
> 
---end quoted text---


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sched: update_entity_lag does not handle corner case with task in PI chain
  2025-10-22  0:35     ` Luis Claudio R. Goncalves
@ 2025-10-24  4:00       ` K Prateek Nayak
  0 siblings, 0 replies; 6+ messages in thread
From: K Prateek Nayak @ 2025-10-24  4:00 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Phil Auld,
	Valentin Schneider, Steven Rostedt, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Shizhao Chen,
	linux-kernel, Omar Sandoval, Xuewen Yan

Hello Luis,

On 10/22/2025 6:05 AM, Luis Claudio R. Goncalves wrote:
>> v6.18 kernels will get rid of this issue as a part of per-task throttle
>> feature and stable should pick up the fix for same on the thread soon. 
> 
> Thank you! You were right, your patch in that thread seems to have fixed
> the issue I reported.
> 
> I read the thread you mentioned, built a test kernel with the patch and have
> been running tests for more than 6h now without a single backtrace. As reported
> earlier, I was able to hit the bug within 15 minutes without the patch.

Thank you for confirming! Greg has picked the patches for both
v6.17 stable and v6.12 stable so the upcoming stable releases and the
RT releases based on them should solve this issues. Thank you again
for testing the fix.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-10-24  4:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-18 11:34 sched: update_entity_lag does not handle corner case with task in PI chain Luis Claudio R. Goncalves
2025-10-18 19:57 ` Peter Zijlstra
2025-10-20 11:00   ` Luis Claudio R. Goncalves
2025-10-21  7:08   ` K Prateek Nayak
2025-10-22  0:35     ` Luis Claudio R. Goncalves
2025-10-24  4:00       ` K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox