Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
       [not found] <CAB8ipk9N5_pO1Awp6PLnWt6hf1Bu_XtY3qGMJKqz=Uf6eZQejw@mail.gmail.com>
@ 2026-04-01  4:25 ` John Stultz
  2026-04-01  6:04   ` Xuewen Yan
  2026-04-08 12:19 ` David Laight
  2026-04-09 21:39 ` John Stultz
  2 siblings, 1 reply; 19+ messages in thread
From: John Stultz @ 2026-04-01  4:25 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>
> Dear Linux maintainers and reviewers,
>
> I am writing to report a severe scheduling latency issue we recently
> discovered on Linux Kernel 6.12.
>
> Issue Description
>
> We observed that when running a specific background workload pattern,
> certain tasks experience excessive scheduling latency. The delay from
> the runnable state to running on the CPU exceeds 10 seconds, and in
> extreme cases, it reaches up to 100 seconds.
>
> Environment Details
>
> Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> Architecture: [ ARM64]
> Hardware: T7300
> Config: gki_defconfig
>
> RT-app‘s workload Pattern:
>
> {
>     "tasks" : {
>             "t0" : {
>                     "instance" : 40,
>                     "priority" : 0,
>                     "cpus" : [ 0, 1, 2, 3 ],
> "taskgroup" : "/background",
>                     "loop" : -1,
>                     "run" : 200,
>                     "sleep" : 50
>             }
>     }
> }
>
> And we have applied the following patchs:
>
> https://lore.kernel.org/all/20251216111321.966709786@linuxfoundation.org/
> https://lore.kernel.org/all/20260106170509.413636243@linuxfoundation.org/
> https://lore.kernel.org/all/20260323134533.805879358@linuxfoundation.org/
>
>
> Could you please advise if there are known changes in the eevdf in
> 6.12 that might affect this specific workload pattern?
>

Could you maybe instead point to some source for the runqslower binary
you attached? I don't think folks will run random binaries.

Also, it looks like the RT-app description uses the background cgroup,
can you share the cgroup configuration you have set for that?

Also, did you try to reproduce this against vanilla 6.12-stable ? I'm
not sure the audience here is going to pay much attention to GKI based
reports.  Were you using any vendorhooks?

thanks
-john

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-01  4:25 ` [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload John Stultz
@ 2026-04-01  6:04   ` Xuewen Yan
  2026-04-01 10:05     ` Vincent Guittot
  0 siblings, 1 reply; 19+ messages in thread
From: Xuewen Yan @ 2026-04-01  6:04 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On Wed, Apr 1, 2026 at 12:25 PM John Stultz <jstultz@google.com> wrote:
>
> On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> >
> > Dear Linux maintainers and reviewers,
> >
> > I am writing to report a severe scheduling latency issue we recently
> > discovered on Linux Kernel 6.12.
> >
> > Issue Description
> >
> > We observed that when running a specific background workload pattern,
> > certain tasks experience excessive scheduling latency. The delay from
> > the runnable state to running on the CPU exceeds 10 seconds, and in
> > extreme cases, it reaches up to 100 seconds.
> >
> > Environment Details
> >
> > Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> > Architecture: [ ARM64]
> > Hardware: T7300
> > Config: gki_defconfig
> >
> > RT-app‘s workload Pattern:
> >
> > {
> >     "tasks" : {
> >             "t0" : {
> >                     "instance" : 40,
> >                     "priority" : 0,
> >                     "cpus" : [ 0, 1, 2, 3 ],
> > "taskgroup" : "/background",
> >                     "loop" : -1,
> >                     "run" : 200,
> >                     "sleep" : 50
> >             }
> >     }
> > }
> >
> > And we have applied the following patchs:
> >
> > https://lore.kernel.org/all/20251216111321.966709786@linuxfoundation.org/
> > https://lore.kernel.org/all/20260106170509.413636243@linuxfoundation.org/
> > https://lore.kernel.org/all/20260323134533.805879358@linuxfoundation.org/
> >
> >
> > Could you please advise if there are known changes in the eevdf in
> > 6.12 that might affect this specific workload pattern?
> >
>
Thanks for the quick response！

> Could you maybe instead point to some source for the runqslower binary
> you attached? I don't think folks will run random binaries.

We use the code in kernel "tools/bpf/runqslower".

>
> Also, it looks like the RT-app description uses the background cgroup,
> can you share the cgroup configuration you have set for that?

Our "background" cgroup does not have any special configurations applied.

cpu.shares: Set to 1024, which is consistent with other cgroups on the system.
Bandwidth Control: It is disabled (no cpu.cfs_quota_us limits set).

>
> Also, did you try to reproduce this against vanilla 6.12-stable ? I'm
> not sure the audience here is going to pay much attention to GKI based
> reports.  Were you using any vendorhooks?

We have verified this on a GKI kernel with all vendor hooks removed.
The issue still reproduces in this environment. This suggests the
problem is not directly caused by our vendor-specific modifications.

We conducted an experiment by disabling the DELAY_DEQUEUE feature.
After turning it off, we observed a significant increase in threads
with extremely long runnable times. Even kworkers started exhibiting
timeout phenomena.

Thanks!

---
xuewen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-01  6:04   ` Xuewen Yan
@ 2026-04-01 10:05     ` Vincent Guittot
  2026-04-01 10:48       ` Xuewen Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2026-04-01 10:05 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: John Stultz, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On Wed, 1 Apr 2026 at 08:04, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>
> On Wed, Apr 1, 2026 at 12:25 PM John Stultz <jstultz@google.com> wrote:
> >
> > On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> > >
> > > Dear Linux maintainers and reviewers,
> > >
> > > I am writing to report a severe scheduling latency issue we recently
> > > discovered on Linux Kernel 6.12.
> > >
> > > Issue Description
> > >
> > > We observed that when running a specific background workload pattern,
> > > certain tasks experience excessive scheduling latency. The delay from
> > > the runnable state to running on the CPU exceeds 10 seconds, and in
> > > extreme cases, it reaches up to 100 seconds.
> > >
> > > Environment Details
> > >
> > > Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> > > Architecture: [ ARM64]
> > > Hardware: T7300
> > > Config: gki_defconfig
> > >
> > > RT-app‘s workload Pattern:
> > >
> > > {
> > >     "tasks" : {
> > >             "t0" : {
> > >                     "instance" : 40,
> > >                     "priority" : 0,
> > >                     "cpus" : [ 0, 1, 2, 3 ],
> > > "taskgroup" : "/background",
> > >                     "loop" : -1,
> > >                     "run" : 200,
> > >                     "sleep" : 50
> > >             }
> > >     }
> > > }
> > >
> > > And we have applied the following patchs:
> > >
> > > https://lore.kernel.org/all/20251216111321.966709786@linuxfoundation.org/
> > > https://lore.kernel.org/all/20260106170509.413636243@linuxfoundation.org/
> > > https://lore.kernel.org/all/20260323134533.805879358@linuxfoundation.org/
> > >
> > >
> > > Could you please advise if there are known changes in the eevdf in
> > > 6.12 that might affect this specific workload pattern?
> > >
> >
> Thanks for the quick response！
>
> > Could you maybe instead point to some source for the runqslower binary
> > you attached? I don't think folks will run random binaries.
>
> We use the code in kernel "tools/bpf/runqslower".
>
> >
> > Also, it looks like the RT-app description uses the background cgroup,
> > can you share the cgroup configuration you have set for that?
>
> Our "background" cgroup does not have any special configurations applied.
>
> cpu.shares: Set to 1024, which is consistent with other cgroups on the system.
> Bandwidth Control: It is disabled (no cpu.cfs_quota_us limits set).
>
> >
> > Also, did you try to reproduce this against vanilla 6.12-stable ? I'm
> > not sure the audience here is going to pay much attention to GKI based
> > reports.  Were you using any vendorhooks?
>
> We have verified this on a GKI kernel with all vendor hooks removed.
> The issue still reproduces in this environment. This suggests the
> problem is not directly caused by our vendor-specific modifications.

Did you try on the latest android mainline kernel which is based on
v6.19 ? This would help determine if the issue only happens on v6.12
or on more recent kernels too

I ran your rt-app json file on the latest tip/sched/core but I don't
see any scheduling issue

>
> We conducted an experiment by disabling the DELAY_DEQUEUE feature.
> After turning it off, we observed a significant increase in threads
> with extremely long runnable times. Even kworkers started exhibiting
> timeout phenomena.

Just to make sure, the problem happens even if you don't disable DELAY_DEQUEUE ?

>
> Thanks!
>
> ---
> xuewen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-01 10:05     ` Vincent Guittot
@ 2026-04-01 10:48       ` Xuewen Yan
  2026-04-01 13:00         ` Dietmar Eggemann
  2026-04-01 14:01         ` Vincent Guittot
  0 siblings, 2 replies; 19+ messages in thread
From: Xuewen Yan @ 2026-04-01 10:48 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: John Stultz, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On Wed, Apr 1, 2026 at 6:05 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 1 Apr 2026 at 08:04, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> >
> > On Wed, Apr 1, 2026 at 12:25 PM John Stultz <jstultz@google.com> wrote:
> > >
> > > On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> > > >
> > > > Dear Linux maintainers and reviewers,
> > > >
> > > > I am writing to report a severe scheduling latency issue we recently
> > > > discovered on Linux Kernel 6.12.
> > > >
> > > > Issue Description
> > > >
> > > > We observed that when running a specific background workload pattern,
> > > > certain tasks experience excessive scheduling latency. The delay from
> > > > the runnable state to running on the CPU exceeds 10 seconds, and in
> > > > extreme cases, it reaches up to 100 seconds.
> > > >
> > > > Environment Details
> > > >
> > > > Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> > > > Architecture: [ ARM64]
> > > > Hardware: T7300
> > > > Config: gki_defconfig
> > > >
> > > > RT-app‘s workload Pattern:
> > > >
> > > > {
> > > >     "tasks" : {
> > > >             "t0" : {
> > > >                     "instance" : 40,
> > > >                     "priority" : 0,
> > > >                     "cpus" : [ 0, 1, 2, 3 ],
> > > > "taskgroup" : "/background",
> > > >                     "loop" : -1,
> > > >                     "run" : 200,
> > > >                     "sleep" : 50
> > > >             }
> > > >     }
> > > > }
> > > >
> > > > And we have applied the following patchs:
> > > >
> > > > https://lore.kernel.org/all/20251216111321.966709786@linuxfoundation.org/
> > > > https://lore.kernel.org/all/20260106170509.413636243@linuxfoundation.org/
> > > > https://lore.kernel.org/all/20260323134533.805879358@linuxfoundation.org/
> > > >
> > > >
> > > > Could you please advise if there are known changes in the eevdf in
> > > > 6.12 that might affect this specific workload pattern?
> > > >
> > >
> > Thanks for the quick response！
> >
> > > Could you maybe instead point to some source for the runqslower binary
> > > you attached? I don't think folks will run random binaries.
> >
> > We use the code in kernel "tools/bpf/runqslower".
> >
> > >
> > > Also, it looks like the RT-app description uses the background cgroup,
> > > can you share the cgroup configuration you have set for that?
> >
> > Our "background" cgroup does not have any special configurations applied.
> >
> > cpu.shares: Set to 1024, which is consistent with other cgroups on the system.
> > Bandwidth Control: It is disabled (no cpu.cfs_quota_us limits set).
> >
> > >
> > > Also, did you try to reproduce this against vanilla 6.12-stable ? I'm
> > > not sure the audience here is going to pay much attention to GKI based
> > > reports.  Were you using any vendorhooks?
> >
> > We have verified this on a GKI kernel with all vendor hooks removed.
> > The issue still reproduces in this environment. This suggests the
> > problem is not directly caused by our vendor-specific modifications.
>
> Did you try on the latest android mainline kernel which is based on
> v6.19 ? This would help determine if the issue only happens on v6.12
> or on more recent kernels too

We also tested this case on android kernel 6.18. The issue is still
reproducible, although the probability of occurrence is significantly
lower compared to 6.12.


>
> I ran your rt-app json file on the latest tip/sched/core but I don't
> see any scheduling issue
>
> >
> > We conducted an experiment by disabling the DELAY_DEQUEUE feature.
> > After turning it off, we observed a significant increase in threads
> > with extremely long runnable times. Even kworkers started exhibiting
> > timeout phenomena.
>
> Just to make sure, the problem happens even if you don't disable DELAY_DEQUEUE ?

Yes, we see this problem with both DELAY_DEQUEUE on and off.

Additionally, we noticed that the tasks suffering from long scheduling
latencies frequently belong to different cgroups (e.g., foreground),
rather than the background cgroup where the rt-app load is running.
This unexpected cross-group interference is quite puzzling to us...

Thanks!
 ---
 xuewen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-01 10:48       ` Xuewen Yan
@ 2026-04-01 13:00         ` Dietmar Eggemann
  2026-04-02  5:16           ` Xuewen Yan
  2026-04-01 14:01         ` Vincent Guittot
  1 sibling, 1 reply; 19+ messages in thread
From: Dietmar Eggemann @ 2026-04-01 13:00 UTC (permalink / raw)
  To: Xuewen Yan, Vincent Guittot
  Cc: John Stultz, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Steven Rostedt, Benjamin Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, 王科 (Ke Wang), Xuewen Yan,
	hongyu.jin@unisoc.com, guohua.yan

On 01.04.26 12:48, Xuewen Yan wrote:
> On Wed, Apr 1, 2026 at 6:05 PM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>>
>> On Wed, 1 Apr 2026 at 08:04, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>>>
>>> On Wed, Apr 1, 2026 at 12:25 PM John Stultz <jstultz@google.com> wrote:
>>>>
>>>> On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>>>>>
>>>>> Dear Linux maintainers and reviewers,
>>>>>
>>>>> I am writing to report a severe scheduling latency issue we recently
>>>>> discovered on Linux Kernel 6.12.
>>>>>
>>>>> Issue Description
>>>>>
>>>>> We observed that when running a specific background workload pattern,
>>>>> certain tasks experience excessive scheduling latency. The delay from
>>>>> the runnable state to running on the CPU exceeds 10 seconds, and in
>>>>> extreme cases, it reaches up to 100 seconds.
>>>>>
>>>>> Environment Details
>>>>>
>>>>> Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
>>>>> Architecture: [ ARM64]
>>>>> Hardware: T7300

Is this 4 big & 4 little CPUs?

>>>>> Config: gki_defconfig
>>>>>
>>>>> RT-app‘s workload Pattern:
>>>>>
>>>>> {
>>>>>     "tasks" : {
>>>>>             "t0" : {
>>>>>                     "instance" : 40,
>>>>>                     "priority" : 0,
>>>>>                     "cpus" : [ 0, 1, 2, 3 ],
>>>>> "taskgroup" : "/background",
>>>>>                     "loop" : -1,
>>>>>                     "run" : 200,
>>>>>                     "sleep" : 50
>>>>>             }
>>>>>     }
>>>>> }
>>>>>
>>>>> And we have applied the following patchs:
>>>>>
>>>>> https://lore.kernel.org/all/20251216111321.966709786@linuxfoundation.org/
>>>>> https://lore.kernel.org/all/20260106170509.413636243@linuxfoundation.org/
>>>>> https://lore.kernel.org/all/20260323134533.805879358@linuxfoundation.org/

Does the issue happen on v6.12.58 plain (android) or only when those 3
additional patches are applied on top?

d5843e1530d8 - sched/fair: Forfeit vruntime on yield (2025-12-18 Fernand
Sieber) v6.12.63

bddd95054e33 - sched/eevdf: Fix min_vruntime vs avg_vruntime (2026-01-08
Peter Zijlstra) v6.12.64

d2fc2dcfce47 - sched/fair: Fix zero_vruntime tracking (2026-03-25 Peter
Zijlstra) v6.12.78

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-01 13:00         ` Dietmar Eggemann
@ 2026-04-02  5:16           ` Xuewen Yan
  2026-04-02 14:58             ` Dietmar Eggemann
  0 siblings, 1 reply; 19+ messages in thread
From: Xuewen Yan @ 2026-04-02  5:16 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, John Stultz, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On Wed, Apr 1, 2026 at 9:00 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 01.04.26 12:48, Xuewen Yan wrote:
> > On Wed, Apr 1, 2026 at 6:05 PM Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> >>
> >> On Wed, 1 Apr 2026 at 08:04, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> >>>
> >>> On Wed, Apr 1, 2026 at 12:25 PM John Stultz <jstultz@google.com> wrote:
> >>>>
> >>>> On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> >>>>>
> >>>>> Dear Linux maintainers and reviewers,
> >>>>>
> >>>>> I am writing to report a severe scheduling latency issue we recently
> >>>>> discovered on Linux Kernel 6.12.
> >>>>>
> >>>>> Issue Description
> >>>>>
> >>>>> We observed that when running a specific background workload pattern,
> >>>>> certain tasks experience excessive scheduling latency. The delay from
> >>>>> the runnable state to running on the CPU exceeds 10 seconds, and in
> >>>>> extreme cases, it reaches up to 100 seconds.
> >>>>>
> >>>>> Environment Details
> >>>>>
> >>>>> Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> >>>>> Architecture: [ ARM64]
> >>>>> Hardware: T7300
>
> Is this 4 big & 4 little CPUs?

6 little + 2big.
On our devices, background tasks are bound to cores 0-3. To mimic the
behavior of these background tasks, we also bound rt-app to cores 0-3.

>
> >>>>> Config: gki_defconfig
> >>>>>
> >>>>> RT-app‘s workload Pattern:
> >>>>>
> >>>>> {
> >>>>>     "tasks" : {
> >>>>>             "t0" : {
> >>>>>                     "instance" : 40,
> >>>>>                     "priority" : 0,
> >>>>>                     "cpus" : [ 0, 1, 2, 3 ],
> >>>>> "taskgroup" : "/background",
> >>>>>                     "loop" : -1,
> >>>>>                     "run" : 200,
> >>>>>                     "sleep" : 50
> >>>>>             }
> >>>>>     }
> >>>>> }
> >>>>>
> >>>>> And we have applied the following patchs:
> >>>>>
> >>>>> https://lore.kernel.org/all/20251216111321.966709786@linuxfoundation.org/
> >>>>> https://lore.kernel.org/all/20260106170509.413636243@linuxfoundation.org/
> >>>>> https://lore.kernel.org/all/20260323134533.805879358@linuxfoundation.org/
>
> Does the issue happen on v6.12.58 plain (android) or only when those 3
> additional patches are applied on top?

The issue was discovered on android16-6.12.58. We applied the
following three patches, but the issue is still reproducible.

>
> d5843e1530d8 - sched/fair: Forfeit vruntime on yield (2025-12-18 Fernand
> Sieber) v6.12.63
>
> bddd95054e33 - sched/eevdf: Fix min_vruntime vs avg_vruntime (2026-01-08
> Peter Zijlstra) v6.12.64
>
> d2fc2dcfce47 - sched/fair: Fix zero_vruntime tracking (2026-03-25 Peter
> Zijlstra) v6.12.78

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-02  5:16           ` Xuewen Yan
@ 2026-04-02 14:58             ` Dietmar Eggemann
  2026-04-08  2:50               ` Xuewen Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Dietmar Eggemann @ 2026-04-02 14:58 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: Vincent Guittot, John Stultz, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan



On 02.04.26 07:16, Xuewen Yan wrote:
> On Wed, Apr 1, 2026 at 9:00 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 01.04.26 12:48, Xuewen Yan wrote:
>>> On Wed, Apr 1, 2026 at 6:05 PM Vincent Guittot
>>> <vincent.guittot@linaro.org> wrote:
>>>>
>>>> On Wed, 1 Apr 2026 at 08:04, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>>>>>
>>>>> On Wed, Apr 1, 2026 at 12:25 PM John Stultz <jstultz@google.com> wrote:
>>>>>>
>>>>>> On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>>>>>>>
>>>>>>> Dear Linux maintainers and reviewers,
>>>>>>>
>>>>>>> I am writing to report a severe scheduling latency issue we recently
>>>>>>> discovered on Linux Kernel 6.12.
>>>>>>>
>>>>>>> Issue Description
>>>>>>>
>>>>>>> We observed that when running a specific background workload pattern,
>>>>>>> certain tasks experience excessive scheduling latency. The delay from
>>>>>>> the runnable state to running on the CPU exceeds 10 seconds, and in
>>>>>>> extreme cases, it reaches up to 100 seconds.
>>>>>>>
>>>>>>> Environment Details
>>>>>>>
>>>>>>> Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
>>>>>>> Architecture: [ ARM64]
>>>>>>> Hardware: T7300
>>
>> Is this 4 big & 4 little CPUs?
> 
> 6 little + 2big.
> On our devices, background tasks are bound to cores 0-3. To mimic the
> behavior of these background tasks, we also bound rt-app to cores 0-3.
> 
>>
>>>>>>> Config: gki_defconfig
>>>>>>>
>>>>>>> RT-app‘s workload Pattern:
>>>>>>>
>>>>>>> {
>>>>>>>     "tasks" : {
>>>>>>>             "t0" : {
>>>>>>>                     "instance" : 40,
>>>>>>>                     "priority" : 0,
>>>>>>>                     "cpus" : [ 0, 1, 2, 3 ],
>>>>>>> "taskgroup" : "/background",
>>>>>>>                     "loop" : -1,
>>>>>>>                     "run" : 200,
>>>>>>>                     "sleep" : 50
>>>>>>>             }
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>> And we have applied the following patchs:
>>>>>>>
>>>>>>> https://lore.kernel.org/all/20251216111321.966709786@linuxfoundation.org/
>>>>>>> https://lore.kernel.org/all/20260106170509.413636243@linuxfoundation.org/
>>>>>>> https://lore.kernel.org/all/20260323134533.805879358@linuxfoundation.org/
>>
>> Does the issue happen on v6.12.58 plain (android) or only when those 3
>> additional patches are applied on top?
> 
> The issue was discovered on android16-6.12.58. We applied the
> following three patches, but the issue is still reproducible.
> 
>>
>> d5843e1530d8 - sched/fair: Forfeit vruntime on yield (2025-12-18 Fernand
>> Sieber) v6.12.63
>>
>> bddd95054e33 - sched/eevdf: Fix min_vruntime vs avg_vruntime (2026-01-08
>> Peter Zijlstra) v6.12.64
>>
>> d2fc2dcfce47 - sched/fair: Fix zero_vruntime tracking (2026-03-25 Peter
>> Zijlstra) v6.12.78
> 
> Thanks!

I tried to recreate your env as much as possible on qemu and ran your
rt-app file but I can't spot anything suspicious either. This is with
defconfig and cgroupv2.

$ cat /sys/devices/system/cpu/cpu*/cpu_capacity
512
512
512
512
512
512
1024
1024

10 highest wu_lat values:

v6.6

0.024601000 task0-9:881
0.019151000 task0-13:885
0.018344000 task0-27:899
0.017332000 task0-5:876
0.010613000 task0-21:893
0.010356000 task0-20:892
0.007796000 task0-15:887
0.007550000 task0-13:885
0.007292000 task0-2:872
0.006718000 task0-15:887

6.12.58

0.029507000 task0-32:1211
0.027374000 task0-37:1216
0.027294000 task0-12:1191
0.027063000 task0-11:1190
0.026612000 task0-28:1207
0.024829000 task0-38:1217
0.024472000 task0-18:1197
0.024396000 task0-34:1213
0.024303000 task0-10:1189
0.023317000 task0-26:1205

tip sched/core (7.0.0-rc4-00030-g265439eb88fd)

0.025000000 task0-32:851
0.020467000 task0-5:824
0.017190000 task0-16:835
0.015365000 task0-8:827
0.011591000 task0-32:851
0.010153000 task0-34:853
0.009932000 task0-4:823
0.008972000 task0-24:843
0.008564000 task0-39:858
0.007591000 task0-25:844

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-02 14:58             ` Dietmar Eggemann
@ 2026-04-08  2:50               ` Xuewen Yan
  2026-04-10 11:13                 ` Dietmar Eggemann
  0 siblings, 1 reply; 19+ messages in thread
From: Xuewen Yan @ 2026-04-08  2:50 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, John Stultz, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

Hi Dietmar and Vincent,


>
> I tried to recreate your env as much as possible on qemu and ran your
> rt-app file but I can't spot anything suspicious either. This is with
> defconfig and cgroupv2.

Could you please try the following configuration?
To rule out Android's influence, I created two new cgroups:
foreground_test and background_test.
I then placed only rt-app threads into these groups. Even with this
setup, we can still observe high scheduling latency for tasks in
foreground_test.

{
    "tasks" : {
            "t0" : {
                    "instance" : 40,
                    "priority" : 0,
                    "cpus" : [ 0, 1, 2, 3 ],
"taskgroup" : "/background_test",
                    "loop" : -1,
                    "run" : 200,
                    "sleep" : 50
            },
"t1" : {
                    "instance" : 2,
                    "priority" : 19,
                    "cpus" : [ 0, 1, 2, 3 ],
"taskgroup" : "/foreground_test",
                    "loop" : -1,
                    "run" : 60000,
                    "sleep" : 100000
            },
"t2" : {
                    "instance" : 2,
                    "priority" : 10,
                    "cpus" : [ 0, 1, 2, 3 ],
"taskgroup" : "/foreground_test",
                    "loop" : -1,
                    "run" : 5000,
                    "sleep" : 100000
            }
    }
}


Thanks！

BR
---
xuewen

>
> $ cat /sys/devices/system/cpu/cpu*/cpu_capacity
> 512
> 512
> 512
> 512
> 512
> 512
> 1024
> 1024
>
> 10 highest wu_lat values:
>
> v6.6
>
> 0.024601000 task0-9:881
> 0.019151000 task0-13:885
> 0.018344000 task0-27:899
> 0.017332000 task0-5:876
> 0.010613000 task0-21:893
> 0.010356000 task0-20:892
> 0.007796000 task0-15:887
> 0.007550000 task0-13:885
> 0.007292000 task0-2:872
> 0.006718000 task0-15:887
>
> 6.12.58
>
> 0.029507000 task0-32:1211
> 0.027374000 task0-37:1216
> 0.027294000 task0-12:1191
> 0.027063000 task0-11:1190
> 0.026612000 task0-28:1207
> 0.024829000 task0-38:1217
> 0.024472000 task0-18:1197
> 0.024396000 task0-34:1213
> 0.024303000 task0-10:1189
> 0.023317000 task0-26:1205
>
> tip sched/core (7.0.0-rc4-00030-g265439eb88fd)
>
> 0.025000000 task0-32:851
> 0.020467000 task0-5:824
> 0.017190000 task0-16:835
> 0.015365000 task0-8:827
> 0.011591000 task0-32:851
> 0.010153000 task0-34:853
> 0.009932000 task0-4:823
> 0.008972000 task0-24:843
> 0.008564000 task0-39:858
> 0.007591000 task0-25:844

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-08  2:50               ` Xuewen Yan
@ 2026-04-10 11:13                 ` Dietmar Eggemann
  2026-04-14  5:18                   ` Xuewen Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Dietmar Eggemann @ 2026-04-10 11:13 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: Vincent Guittot, John Stultz, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On 08.04.26 03:50, Xuewen Yan wrote:
> Hi Dietmar and Vincent,
> 
>> I tried to recreate your env as much as possible on qemu and ran your
>> rt-app file but I can't spot anything suspicious either. This is with
>> defconfig and cgroupv2.
> 
> Could you please try the following configuration?
> To rule out Android's influence, I created two new cgroups:
> foreground_test and background_test.
> I then placed only rt-app threads into these groups. Even with this
> setup, we can still observe high scheduling latency for tasks in
> foreground_test.

Is this still on an Android (vendor hooks, etc.) or mainline 6.12.58
kernel/device?

> {
>     "tasks" : {
>             "t0" : {
>                     "instance" : 40,
>                     "priority" : 0,
>                     "cpus" : [ 0, 1, 2, 3 ],
> "taskgroup" : "/background_test",
>                     "loop" : -1,
>                     "run" : 200,
>                     "sleep" : 50
>             },
> "t1" : {
>                     "instance" : 2,
>                     "priority" : 19,
>                     "cpus" : [ 0, 1, 2, 3 ],
> "taskgroup" : "/foreground_test",
>                     "loop" : -1,
>                     "run" : 60000,
>                     "sleep" : 100000
>             },
> "t2" : {
>                     "instance" : 2,
>                     "priority" : 10,
>                     "cpus" : [ 0, 1, 2, 3 ],
> "taskgroup" : "/foreground_test",
>                     "loop" : -1,
>                     "run" : 5000,
>                     "sleep" : 100000
>             }
>     }
> }

With your rt-app file and moving the tasks into cgroupv2 taskgroups
manually:

t0-0 679 0 > /sys/fs/cgroup/A
t0-1 680 0 > /sys/fs/cgroup/A
t0-2 681 0 > /sys/fs/cgroup/A
...
t0-37 716 0 > /sys/fs/cgroup/A
t0-38 717 0 > /sys/fs/cgroup/A
t0-39 718 0 > /sys/fs/cgroup/A
t1-40 719 19 > /sys/fs/cgroup/B
t1-41 720 19 > /sys/fs/cgroup/B
t2-42 721 10 > /sys/fs/cgroup/B
t2-43 722 10 > /sys/fs/cgroup/B

10 highest wu_lat values on Arm64 qemu (-accel hvf):

v6.6

0.564159000 t2-43:684
0.306418000 t2-42:683
0.237134000 t2-43:684
0.166982000 t2-42:683
0.166674000 t2-42:683
0.161856000 t2-42:683
0.098879000 t2-43:684
0.097746000 t2-43:684
0.083329000 t2-42:683
0.082943000 t2-42:683

6.12.58

0.368566000 t2-43:939
0.228139000 t2-42:938
0.212454000 t2-43:939
0.207144000 t2-42:938
0.177373000 t2-43:939
0.148268000 t2-42:938
0.147619000 t2-42:938
0.125988000 t2-43:939
0.091564000 t2-42:938
0.088160000 t2-42:938

tip sched/core (7.0.0-rc6-00050-g985215804dcb)

0.395585000 t2-43:697
0.203889000 t2-43:697
0.101130000 t2-42:696
0.098782000 t2-42:696
0.084523000 t2-43:697
0.033895000 t2-42:696
0.031881000 t2-43:697
0.021958000 t2-42:696
0.018123000 t2-42:696
0.013132000 t0-7:661

Could you specify which tasks had those > 1s wu_lat values?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-10 11:13                 ` Dietmar Eggemann
@ 2026-04-14  5:18                   ` Xuewen Yan
  0 siblings, 0 replies; 19+ messages in thread
From: Xuewen Yan @ 2026-04-14  5:18 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, John Stultz, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

Thanks Dietmar!

On Fri, Apr 10, 2026 at 7:14 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 08.04.26 03:50, Xuewen Yan wrote:
> > Hi Dietmar and Vincent,
> >
> >> I tried to recreate your env as much as possible on qemu and ran your
> >> rt-app file but I can't spot anything suspicious either. This is with
> >> defconfig and cgroupv2.
> >
> > Could you please try the following configuration?
> > To rule out Android's influence, I created two new cgroups:
> > foreground_test and background_test.
> > I then placed only rt-app threads into these groups. Even with this
> > setup, we can still observe high scheduling latency for tasks in
> > foreground_test.
>
> Is this still on an Android (vendor hooks, etc.) or mainline 6.12.58
> kernel/device?

We tested in Android tree without any vendorhooks.
Just as John said, after we revert commit 6d71a9c61604 ("sched/fair:
Fix EEVDF entity
placement bug causing scheduling lag") in android16-6.12, the latency
disappeared, because the commit is not exist in stable tree.

On the other hand, we also tested the android17-6.18, without any
revert, the latency still exist.


>
> > {
> >     "tasks" : {
> >             "t0" : {
> >                     "instance" : 40,
> >                     "priority" : 0,
> >                     "cpus" : [ 0, 1, 2, 3 ],
> > "taskgroup" : "/background_test",
> >                     "loop" : -1,
> >                     "run" : 200,
> >                     "sleep" : 50
> >             },
> > "t1" : {
> >                     "instance" : 2,
> >                     "priority" : 19,
> >                     "cpus" : [ 0, 1, 2, 3 ],
> > "taskgroup" : "/foreground_test",
> >                     "loop" : -1,
> >                     "run" : 60000,
> >                     "sleep" : 100000
> >             },
> > "t2" : {
> >                     "instance" : 2,
> >                     "priority" : 10,
> >                     "cpus" : [ 0, 1, 2, 3 ],
> > "taskgroup" : "/foreground_test",
> >                     "loop" : -1,
> >                     "run" : 5000,
> >                     "sleep" : 100000
> >             }
> >     }
> > }
>
> With your rt-app file and moving the tasks into cgroupv2 taskgroups
> manually:
>
> t0-0 679 0 > /sys/fs/cgroup/A
> t0-1 680 0 > /sys/fs/cgroup/A
> t0-2 681 0 > /sys/fs/cgroup/A
> ...
> t0-37 716 0 > /sys/fs/cgroup/A
> t0-38 717 0 > /sys/fs/cgroup/A
> t0-39 718 0 > /sys/fs/cgroup/A
> t1-40 719 19 > /sys/fs/cgroup/B
> t1-41 720 19 > /sys/fs/cgroup/B
> t2-42 721 10 > /sys/fs/cgroup/B
> t2-43 722 10 > /sys/fs/cgroup/B
>
> 10 highest wu_lat values on Arm64 qemu (-accel hvf):
>
> v6.6
>
> 0.564159000 t2-43:684
> 0.306418000 t2-42:683
> 0.237134000 t2-43:684
> 0.166982000 t2-42:683
> 0.166674000 t2-42:683
> 0.161856000 t2-42:683
> 0.098879000 t2-43:684
> 0.097746000 t2-43:684
> 0.083329000 t2-42:683
> 0.082943000 t2-42:683
>
> 6.12.58
>
> 0.368566000 t2-43:939
> 0.228139000 t2-42:938
> 0.212454000 t2-43:939
> 0.207144000 t2-42:938
> 0.177373000 t2-43:939
> 0.148268000 t2-42:938
> 0.147619000 t2-42:938
> 0.125988000 t2-43:939
> 0.091564000 t2-42:938
> 0.088160000 t2-42:938
>
> tip sched/core (7.0.0-rc6-00050-g985215804dcb)
>
> 0.395585000 t2-43:697
> 0.203889000 t2-43:697
> 0.101130000 t2-42:696
> 0.098782000 t2-42:696
> 0.084523000 t2-43:697
> 0.033895000 t2-42:696
> 0.031881000 t2-43:697
> 0.021958000 t2-42:696
> 0.018123000 t2-42:696
> 0.013132000 t0-7:661
>
> Could you specify which tasks had those > 1s wu_lat values?
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-01 10:48       ` Xuewen Yan
  2026-04-01 13:00         ` Dietmar Eggemann
@ 2026-04-01 14:01         ` Vincent Guittot
  2026-04-02  5:11           ` Xuewen Yan
  1 sibling, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2026-04-01 14:01 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: John Stultz, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On Wed, 1 Apr 2026 at 12:49, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>
> On Wed, Apr 1, 2026 at 6:05 PM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Wed, 1 Apr 2026 at 08:04, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> > >
> > > On Wed, Apr 1, 2026 at 12:25 PM John Stultz <jstultz@google.com> wrote:
> > > >
> > > > On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> > > > >
> > > > > Dear Linux maintainers and reviewers,
> > > > >
> > > > > I am writing to report a severe scheduling latency issue we recently
> > > > > discovered on Linux Kernel 6.12.
> > > > >
> > > > > Issue Description
> > > > >
> > > > > We observed that when running a specific background workload pattern,
> > > > > certain tasks experience excessive scheduling latency. The delay from
> > > > > the runnable state to running on the CPU exceeds 10 seconds, and in
> > > > > extreme cases, it reaches up to 100 seconds.
> > > > >
> > > > > Environment Details
> > > > >
> > > > > Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> > > > > Architecture: [ ARM64]
> > > > > Hardware: T7300
> > > > > Config: gki_defconfig
> > > > >
> > > > > RT-app‘s workload Pattern:
> > > > >
> > > > > {
> > > > >     "tasks" : {
> > > > >             "t0" : {
> > > > >                     "instance" : 40,
> > > > >                     "priority" : 0,
> > > > >                     "cpus" : [ 0, 1, 2, 3 ],
> > > > > "taskgroup" : "/background",
> > > > >                     "loop" : -1,
> > > > >                     "run" : 200,
> > > > >                     "sleep" : 50
> > > > >             }
> > > > >     }
> > > > > }
> > > > >
> > > > > And we have applied the following patchs:
> > > > >
> > > > > https://lore.kernel.org/all/20251216111321.966709786@linuxfoundation.org/
> > > > > https://lore.kernel.org/all/20260106170509.413636243@linuxfoundation.org/
> > > > > https://lore.kernel.org/all/20260323134533.805879358@linuxfoundation.org/
> > > > >
> > > > >
> > > > > Could you please advise if there are known changes in the eevdf in
> > > > > 6.12 that might affect this specific workload pattern?
> > > > >
> > > >
> > > Thanks for the quick response！
> > >
> > > > Could you maybe instead point to some source for the runqslower binary
> > > > you attached? I don't think folks will run random binaries.
> > >
> > > We use the code in kernel "tools/bpf/runqslower".
> > >
> > > >
> > > > Also, it looks like the RT-app description uses the background cgroup,
> > > > can you share the cgroup configuration you have set for that?
> > >
> > > Our "background" cgroup does not have any special configurations applied.
> > >
> > > cpu.shares: Set to 1024, which is consistent with other cgroups on the system.
> > > Bandwidth Control: It is disabled (no cpu.cfs_quota_us limits set).
> > >
> > > >
> > > > Also, did you try to reproduce this against vanilla 6.12-stable ? I'm
> > > > not sure the audience here is going to pay much attention to GKI based
> > > > reports.  Were you using any vendorhooks?
> > >
> > > We have verified this on a GKI kernel with all vendor hooks removed.
> > > The issue still reproduces in this environment. This suggests the
> > > problem is not directly caused by our vendor-specific modifications.
> >
> > Did you try on the latest android mainline kernel which is based on
> > v6.19 ? This would help determine if the issue only happens on v6.12
> > or on more recent kernels too
>
> We also tested this case on android kernel 6.18. The issue is still
> reproducible, although the probability of occurrence is significantly
> lower compared to 6.12.
>
>
> >
> > I ran your rt-app json file on the latest tip/sched/core but I don't
> > see any scheduling issue
> >
> > >
> > > We conducted an experiment by disabling the DELAY_DEQUEUE feature.
> > > After turning it off, we observed a significant increase in threads
> > > with extremely long runnable times. Even kworkers started exhibiting
> > > timeout phenomena.
> >
> > Just to make sure, the problem happens even if you don't disable DELAY_DEQUEUE ?
>
> Yes, we see this problem with both DELAY_DEQUEUE on and off.
>
> Additionally, we noticed that the tasks suffering from long scheduling
> latencies frequently belong to different cgroups (e.g., foreground),
> rather than the background cgroup where the rt-app load is running.
> This unexpected cross-group interference is quite puzzling to us...

Do you have more details about what is running at the same time as
rt-app? I thought the problem occurred on the rt-app threads but it
seems to happen on other threads running simultaneously.

I tried your rt-app JSON file in one cgroup with another rt-app
running small tasks in a different cgroup on tip/sched/core and
v6.12.79 kernel, and I can't trigger any scheduling latency bigger
than 35ms which is not that far from the theoretical 30ms = 10 tasks
per cpu * 3ms (2.8ms slices with 1ms tick)

Vincent

>
> Thanks!
>  ---
>  xuewen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-01 14:01         ` Vincent Guittot
@ 2026-04-02  5:11           ` Xuewen Yan
  2026-04-02  5:24             ` Xuewen Yan
  0 siblings, 1 reply; 19+ messages in thread
From: Xuewen Yan @ 2026-04-02  5:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: John Stultz, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

Hi Vincent,

On Wed, Apr 1, 2026 at 10:02 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 1 Apr 2026 at 12:49, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> >
> > On Wed, Apr 1, 2026 at 6:05 PM Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Wed, 1 Apr 2026 at 08:04, Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> > > >
> > > > On Wed, Apr 1, 2026 at 12:25 PM John Stultz <jstultz@google.com> wrote:
> > > > >
> > > > > On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> > > > > >
> > > > > > Dear Linux maintainers and reviewers,
> > > > > >
> > > > > > I am writing to report a severe scheduling latency issue we recently
> > > > > > discovered on Linux Kernel 6.12.
> > > > > >
> > > > > > Issue Description
> > > > > >
> > > > > > We observed that when running a specific background workload pattern,
> > > > > > certain tasks experience excessive scheduling latency. The delay from
> > > > > > the runnable state to running on the CPU exceeds 10 seconds, and in
> > > > > > extreme cases, it reaches up to 100 seconds.
> > > > > >
> > > > > > Environment Details
> > > > > >
> > > > > > Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> > > > > > Architecture: [ ARM64]
> > > > > > Hardware: T7300
> > > > > > Config: gki_defconfig
> > > > > >
> > > > > > RT-app‘s workload Pattern:
> > > > > >
> > > > > > {
> > > > > >     "tasks" : {
> > > > > >             "t0" : {
> > > > > >                     "instance" : 40,
> > > > > >                     "priority" : 0,
> > > > > >                     "cpus" : [ 0, 1, 2, 3 ],
> > > > > > "taskgroup" : "/background",
> > > > > >                     "loop" : -1,
> > > > > >                     "run" : 200,
> > > > > >                     "sleep" : 50
> > > > > >             }
> > > > > >     }
> > > > > > }
> > > > > >
> > > > > > And we have applied the following patchs:
> > > > > >
> > > > > > https://lore.kernel.org/all/20251216111321.966709786@linuxfoundation.org/
> > > > > > https://lore.kernel.org/all/20260106170509.413636243@linuxfoundation.org/
> > > > > > https://lore.kernel.org/all/20260323134533.805879358@linuxfoundation.org/
> > > > > >
> > > > > >
> > > > > > Could you please advise if there are known changes in the eevdf in
> > > > > > 6.12 that might affect this specific workload pattern?
> > > > > >
> > > > >
> > > > Thanks for the quick response！
> > > >
> > > > > Could you maybe instead point to some source for the runqslower binary
> > > > > you attached? I don't think folks will run random binaries.
> > > >
> > > > We use the code in kernel "tools/bpf/runqslower".
> > > >
> > > > >
> > > > > Also, it looks like the RT-app description uses the background cgroup,
> > > > > can you share the cgroup configuration you have set for that?
> > > >
> > > > Our "background" cgroup does not have any special configurations applied.
> > > >
> > > > cpu.shares: Set to 1024, which is consistent with other cgroups on the system.
> > > > Bandwidth Control: It is disabled (no cpu.cfs_quota_us limits set).
> > > >
> > > > >
> > > > > Also, did you try to reproduce this against vanilla 6.12-stable ? I'm
> > > > > not sure the audience here is going to pay much attention to GKI based
> > > > > reports.  Were you using any vendorhooks?
> > > >
> > > > We have verified this on a GKI kernel with all vendor hooks removed.
> > > > The issue still reproduces in this environment. This suggests the
> > > > problem is not directly caused by our vendor-specific modifications.
> > >
> > > Did you try on the latest android mainline kernel which is based on
> > > v6.19 ? This would help determine if the issue only happens on v6.12
> > > or on more recent kernels too
> >
> > We also tested this case on android kernel 6.18. The issue is still
> > reproducible, although the probability of occurrence is significantly
> > lower compared to 6.12.
> >
> >
> > >
> > > I ran your rt-app json file on the latest tip/sched/core but I don't
> > > see any scheduling issue
> > >
> > > >
> > > > We conducted an experiment by disabling the DELAY_DEQUEUE feature.
> > > > After turning it off, we observed a significant increase in threads
> > > > with extremely long runnable times. Even kworkers started exhibiting
> > > > timeout phenomena.
> > >
> > > Just to make sure, the problem happens even if you don't disable DELAY_DEQUEUE ?
> >
> > Yes, we see this problem with both DELAY_DEQUEUE on and off.
> >
> > Additionally, we noticed that the tasks suffering from long scheduling
> > latencies frequently belong to different cgroups (e.g., foreground),
> > rather than the background cgroup where the rt-app load is running.
> > This unexpected cross-group interference is quite puzzling to us...
>
> Do you have more details about what is running at the same time as
> rt-app? I thought the problem occurred on the rt-app threads but it
> seems to happen on other threads running simultaneously.
>
> I tried your rt-app JSON file in one cgroup with another rt-app
> running small tasks in a different cgroup on tip/sched/core and
> v6.12.79 kernel, and I can't trigger any scheduling latency bigger
> than 35ms which is not that far from the theoretical 30ms = 10 tasks
> per cpu * 3ms (2.8ms slices with 1ms tick)

To reproduce the issue, simply operate the phone to launch a few apps
while rt-app is running in the background, and then turn off the
screen. After a short wait, we will observe tasks with scheduling
latencies in the order of seconds.

Below is a captured instance where the latency exceeded 100 seconds.
We dumped the runqueue (rq) at that moment, and the thread information
is as follows:(CachedAppOptimi and android.bg went unscheduled for
more than 100 seconds.)
(enqueue, sleep, wake: The timestamps indicating when the task was
enqueued, when it slept, and when it woke up.):

CPU0 9 process is running
 |--[status: curr] pid: 773 tsk: 0xffffff80b2a54500 comm: tcpdump
stack: 0xffffffc08e710000 prio: 120 aff: 0f delayed: 0 enqueue:
129701710483481 sleep: 122565444420050 wake: 122565465620396 vrun:
258051036097647 deadline: 258051038773377 exec_start: 124562605999432
sum_ex: 110797759941549
 CFS 9 process(2 groups) is pending
 |--[status: curr] pid: 773 comm: tcpdump preempt: 0x100010000
 |--system group--pend: 8 process is in group
 |  |--curr : None(0)
 |  |--[status: pend] pid: 7716 tsk: 0xffffff80b6be2e00 comm: t0-26
stack: 0xffffffc09a168000 prio: 120 aff: 0f delayed: 0 enqueue:
129749977667220 sleep: 129749977540912 wake: 129749977666643 vrun:
25100617833006 deadline: 25100618030934 exec_start: 124562594098546
sum_ex: 7975676204
 |  |--[status: pend] pid: 7703 tsk: 0xffffff80a7f01700 comm: t0-23
stack: 0xffffffc0a0368000 prio: 120 aff: 0f delayed: 0 enqueue:
129749985123489 sleep: 129749984999989 wake: 129749985123066 vrun:
25100617824531 deadline: 25100618072759 exec_start: 124562601471047
sum_ex: 7577607107
 |  |--[status: pend] pid: 7677 tsk: 0xffffff808b3d1700 comm: t0-13
stack: 0xffffffc09ae28000 prio: 120 aff: 0f delayed: 0 enqueue:
129749985426220 sleep: 129749985303143 wake: 129749985425797 vrun:
25100617862037 deadline: 25100618288191 exec_start: 124562605521239
sum_ex: 9375620632
 |  |--[status: pend] pid: 7687 tsk: 0xffffff809c5bae00 comm: t0-16
stack: 0xffffffc0a0388000 prio: 120 aff: 0f delayed: 0 enqueue:
129749985815873 sleep: 129749985691489 wake: 129749985815450 vrun:
25100617809713 deadline: 25100619864676 exec_start: 124562602128508
sum_ex: 9878306928
 |  |--[status: pend] pid: 7711 tsk: 0xffffff812a39dc00 comm: t0-25
stack: 0xffffffc09a070000 prio: 120 aff: 0f delayed: 0 enqueue:
129749989667489 sleep: 129749989541450 wake: 129749989667066 vrun:
25100618056222 deadline: 25100620120645 exec_start: 124562605875162
sum_ex: 7549612132
 |  |--[status: pend] pid: 7770 tsk: 0xffffff8098b15c00 comm: t0-34
stack: 0xffffffc09c430000 prio: 120 aff: 0f delayed: 0 enqueue:
129749957114066 sleep: 129749956986989 wake: 129749957113681 vrun:
25100617430414 deadline: 25100620230414 exec_start: 124562573816509
sum_ex: 6653557883
 |  |--[status: pend] pid: 7663 tsk: 0xffffff808b3cae00 comm: t0-7
stack: 0xffffffc09ee10000 prio: 120 aff: 0f delayed: 0 enqueue:
129749989169873 sleep: 129749989040643 wake: 129749989162489 vrun:
25100617739245 deadline: 25100620539245 exec_start: 124562605396047
sum_ex: 8631500166
 |  |--[status: pend] pid: 7750 tsk: 0xffffff80f0409700 comm: t0-28
stack: 0xffffffc09af48000 prio: 120 aff: 0f delayed: 0 enqueue:
129749978263912 sleep: 129749978138797 wake: 129749978263527 vrun:
25100617871842 deadline: 25100620671842 exec_start: 124562594675585
sum_ex: 10184100935
 RT 0 process is pending

CPU1 12 process is running
 |--[status: curr] pid: 7761 tsk: 0xffffff8025db9700 comm: t0-30
stack: 0xffffffc09aee0000 prio: 120 aff: 0f delayed: 0 enqueue:
129749985754797 sleep: 129749985632335 wake: 129749985754412 vrun:
23922316736109 deadline: 23922317701147 exec_start: 125273240674085
sum_ex: 8544979285
 CFS 12 process(1 groups) is pending
 |--system group--curr: 12 process is in group
 |  |--[status: curr] pid: 7761 comm: t0-30 preempt: 0x100010000
 |  |--[status: pend] pid: 7652 tsk: 0xffffff8127964500 comm: t0-1
stack: 0xffffffc09b288000 prio: 120 aff: 0f delayed: 0 enqueue:
129749988875450 sleep: 129749988752720 wake: 129749988875104 vrun:
23922317069000 deadline: 23922317662501 exec_start: 125273239582124
sum_ex: 8796331458
 |  |--[status: pend] pid: 7773 tsk: 0xffffff8098b10000 comm: t0-36
stack: 0xffffffc09f5c8000 prio: 120 aff: 0f delayed: 0 enqueue:
129749989256027 sleep: 129749989133373 wake: 129749989255681 vrun:
23922317057443 deadline: 23922317664214 exec_start: 125273239946854
sum_ex: 8600842892
 |  |--[status: pend] pid: 7766 tsk: 0xffffff80818f8000 comm: t0-33
stack: 0xffffffc09adf0000 prio: 120 aff: 0f delayed: 0 enqueue:
129749989636143 sleep: 129749989511181 wake: 129749989635758 vrun:
23922317077817 deadline: 23922317673818 exec_start: 125273240309354
sum_ex: 9189242817
 |  |--[status: pend] pid: 7764 tsk: 0xffffff80818fc500 comm: t0-32
stack: 0xffffffc09ae70000 prio: 120 aff: 0f delayed: 1 enqueue:
129749985375412 sleep: 129749989893143 wake: 129749985375066 vrun:
23922317087233 deadline: 23922317696809 exec_start: 125273240674085
sum_ex: 8867997747
 |  |--[status: pend] pid: 7658 tsk: 0xffffff8018185c00 comm: t0-4
stack: 0xffffffc09b240000 prio: 120 aff: 0f delayed: 0 enqueue:
129749986134873 sleep: 129749986012143 wake: 129749986134527 vrun:
23922316750208 deadline: 23922317714055 exec_start: 125273236997511
sum_ex: 9299251939
 |  |--[status: pend] pid: 7668 tsk: 0xffffff814f61c500 comm: t0-9
stack: 0xffffffc09e9f8000 prio: 120 aff: 0f delayed: 0 enqueue:
129749986515489 sleep: 129749986392873 wake: 129749986515066 vrun:
23922316748606 deadline: 23922317720145 exec_start: 125273237362049
sum_ex: 8568645989
 |  |--[status: pend] pid: 7651 tsk: 0xffffff8012128000 comm: t0-0
stack: 0xffffffc09eea8000 prio: 120 aff: 0f delayed: 0 enqueue:
129749986898027 sleep: 129749986771758 wake: 129749986897643 vrun:
23922316823052 deadline: 23922317784939 exec_start: 125273237726049
sum_ex: 9400079830
 |  |--[status: pend] pid: 7697 tsk: 0xffffff80b6581700 comm: t0-19
stack: 0xffffffc09b0e8000 prio: 120 aff: 0f delayed: 0 enqueue:
129749987282835 sleep: 129749987162335 wake: 129749987282412 vrun:
23922316848563 deadline: 23922317797946 exec_start: 125273238098280
sum_ex: 9244661640
 |  |--[status: pend] pid: 7670 tsk: 0xffffff814f618000 comm: t0-10
stack: 0xffffffc09dd40000 prio: 120 aff: 0f delayed: 0 enqueue:
129749987674412 sleep: 129749987549373 wake: 129749987674066 vrun:
23922316839247 deadline: 23922317798978 exec_start: 125273238470048
sum_ex: 10064701654
 |  |--[status: pend] pid: 7660 tsk: 0xffffff80b415c500 comm: t0-6
stack: 0xffffffc09ee50000 prio: 120 aff: 0f delayed: 0 enqueue:
129749988071912 sleep: 129749987943489 wake: 129749988064950 vrun:
23922316196954 deadline: 23922318996954 exec_start: 125273238843202
sum_ex: 8121803041
 |  |--[status: pend] pid: 7700 tsk: 0xffffff817007dc00 comm: t0-22
stack: 0xffffffc09afd0000 prio: 120 aff: 0f delayed: 0 enqueue:
129749988458643 sleep: 129749988331027 wake: 129749988452412 vrun:
23922316230410 deadline: 23922319030410 exec_start: 125273239210278
sum_ex: 8217864682
 RT 0 process is pending

CPU2 12 process is running
 |--[status: curr] pid: 7681 tsk: 0xffffff809c5b8000 comm: t0-15
stack: 0xffffffc09c5f8000 prio: 120 aff: 0f delayed: 0 enqueue:
129749985836527 sleep: 129749985713873 wake: 129749985836104 vrun:
23237215363675 deadline: 23237217433790 exec_start: 124989542777297
sum_ex: 9227596667
 CFS 12 process(1 groups) is pending
 |--system group--curr: 12 process is in group
 |  |--[status: curr] pid: 7681 comm: t0-15 preempt: 0x100010000
 |  |--[status: pend] pid: 7777 tsk: 0xffffff80e00d9700 comm: t0-38
stack: 0xffffffc099df0000 prio: 120 aff: 0f delayed: 0 enqueue:
129749984668373 sleep: 129749984545950 wake: 129749984668027 vrun:
23237215480433 deadline: 23237217374857 exec_start: 124989541645989
sum_ex: 8141721727
 |  |--[status: pend] pid: 7679 tsk: 0xffffff802b0dae00 comm: t0-14
stack: 0xffffffc09b1a8000 prio: 120 aff: 0f delayed: 0 enqueue:
129749989585835 sleep: 129749989462066 wake: 129749989585450 vrun:
23237215692450 deadline: 23237217397027 exec_start: 124989542414874
sum_ex: 8523879757
 |  |--[status: pend] pid: 7696 tsk: 0xffffff810f735c00 comm: t0-18
stack: 0xffffffc09b130000 prio: 120 aff: 0f delayed: 1 enqueue:
129749985457027 sleep: 129749989840797 wake: 129749985456643 vrun:
23237215692746 deadline: 23237217400592 exec_start: 124989542777297
sum_ex: 8330286966
 |  |--[status: pend] pid: 7672 tsk: 0xffffff8080938000 comm: t0-11
stack: 0xffffffc09c668000 prio: 120 aff: 0f delayed: 0 enqueue:
129749986216643 sleep: 129749986093450 wake: 129749986216258 vrun:
23237215367417 deadline: 23237217438148 exec_start: 124989539255414
sum_ex: 8766621990
 |  |--[status: pend] pid: 7657 tsk: 0xffffff80b4159700 comm: t0-3
stack: 0xffffffc09ee58000 prio: 120 aff: 0f delayed: 0 enqueue:
129749986599258 sleep: 129749986476297 wake: 129749986598873 vrun:
23237215394091 deadline: 23237217459628 exec_start: 124989539622415
sum_ex: 10357036283
 |  |--[status: pend] pid: 7706 tsk: 0xffffff809323dc00 comm: t0-24
stack: 0xffffffc09a208000 prio: 120 aff: 0f delayed: 0 enqueue:
129749986979181 sleep: 129749986857066 wake: 129749986978797 vrun:
23237215445452 deadline: 23237217513876 exec_start: 124989539987607
sum_ex: 9653857861
 |  |--[status: pend] pid: 7666 tsk: 0xffffff808b3cc500 comm: t0-8
stack: 0xffffffc09ee08000 prio: 120 aff: 0f delayed: 0 enqueue:
129749987363450 sleep: 129749987240297 wake: 129749987363066 vrun:
23237215453893 deadline: 23237217522123 exec_start: 124989540355992
sum_ex: 8568372234
 |  |--[status: pend] pid: 7675 tsk: 0xffffff809c5b9700 comm: t0-12
stack: 0xffffffc09fc30000 prio: 120 aff: 0f delayed: 0 enqueue:
129749987747066 sleep: 129749987624566 wake: 129749987746643 vrun:
23237215493101 deadline: 23237217557948 exec_start: 124989540724722
sum_ex: 8606555185
 |  |--[status: pend] pid: 7699 tsk: 0xffffff8170079700 comm: t0-21
stack: 0xffffffc09afd8000 prio: 120 aff: 0f delayed: 0 enqueue:
129749988153989 sleep: 129749988020758 wake: 129749988146373 vrun:
23237215173170 deadline: 23237217973170 exec_start: 124989541098798
sum_ex: 7654163294
 |  |--[status: pend] pid: 7656 tsk: 0xffffff80c0649700 comm: t0-2
stack: 0xffffffc09b278000 prio: 120 aff: 0f delayed: 0 enqueue:
129749988547066 sleep: 129749988415835 wake: 129749988540873 vrun:
23237215176152 deadline: 23237217976152 exec_start: 124989541468567
sum_ex: 9647417361
 |  |--[status: pend] pid: 7725 tsk: 0xffffff8130439700 comm: t0-27
stack: 0xffffffc0a0d60000 prio: 120 aff: 0f delayed: 0 enqueue:
129749989203450 sleep: 129749989074950 wake: 129749989196758 vrun:
23237215183429 deadline: 23237217983429 exec_start: 124989542049104
sum_ex: 8366970334
 RT 0 process is pending

CPU3 12 process is running
 |--[status: curr] pid: 7752 tsk: 0xffffff812e6f0000 comm: t0-29
stack: 0xffffffc09af40000 prio: 120 aff: 0f delayed: 0 enqueue:
129749988017297 sleep: 129749987894412 wake: 129749988016873 vrun:
21935407414521 deadline: 21935409112944 exec_start: 125371011025153
sum_ex: 10673494974
 CFS 12 process(2 groups) is pending
 |--system group--curr: 8 process is in group
 |  |--[status: curr] pid: 7752 comm: t0-29 preempt: 0x100010001
 |  |--[status: pend] pid: 7778 tsk: 0xffffff800f9c1700 comm: t0-39
stack: 0xffffffc09ac10000 prio: 120 aff: 0f delayed: 0 enqueue:
129749989685373 sleep: 129749989562797 wake: 129749989684989 vrun:
21935407785235 deadline: 21935408052773 exec_start: 125371010656577
sum_ex: 8042858876
 |  |--[status: pend] pid: 7659 tsk: 0xffffff815c4d0000 comm: t0-5
stack: 0xffffffc09b1b0000 prio: 120 aff: 0f delayed: 1 enqueue:
129749987629912 sleep: 129749989947489 wake: 129749987629566 vrun:
21935407769505 deadline: 21935409105235 exec_start: 125371011025153
sum_ex: 12135216345
 |  |--[status: pend] pid: 7772 tsk: 0xffffff8098b11700 comm: t0-35
stack: 0xffffffc09c420000 prio: 120 aff: 0f delayed: 0 enqueue:
129749988693412 sleep: 129749988571104 wake: 129749988693104 vrun:
21935407456083 deadline: 21935409979699 exec_start: 125371009734078
sum_ex: 8723848416
 |  |--[status: pend] pid: 7776 tsk: 0xffffff800f9c2e00 comm: t0-37
stack: 0xffffffc09add0000 prio: 120 aff: 0f delayed: 0 enqueue:
129749988928104 sleep: 129749988805258 wake: 129749988927758 vrun:
21935407461644 deadline: 21935410065530 exec_start: 125371009930192
sum_ex: 9006979288
 |  |--[status: pend] pid: 7698 tsk: 0xffffff802a954500 comm: t0-20
stack: 0xffffffc09b0b8000 prio: 120 aff: 0f delayed: 0 enqueue:
129749986262566 sleep: 129749986139758 wake: 129749986262143 vrun:
21935407338508 deadline: 21935410138508 exec_start: 125371007400155
sum_ex: 10200000722
 |  |--[status: pend] pid: 7763 tsk: 0xffffff8025dbdc00 comm: t0-31
stack: 0xffffffc09aed8000 prio: 120 aff: 0f delayed: 0 enqueue:
129749988403104 sleep: 129749988275489 wake: 129749988396450 vrun:
21935407357657 deadline: 21935410157657 exec_start: 125371009457694
sum_ex: 11933014206
 |  |--[status: pend] pid: 7688 tsk: 0xffffff80a7f00000 comm: t0-17
stack: 0xffffffc09c5d8000 prio: 120 aff: 0f delayed: 0 enqueue:
129749989307912 sleep: 129749989185066 wake: 129749989307566 vrun:
21935407751480 deadline: 21935410551480 exec_start: 125371010294230
sum_ex: 11220342066
 |--foreground group--pend: 4 process is in group
 |  |--curr : None(0)
 |  |--[status: pend] pid: 1544 tsk: 0xffffff810f4d9700 comm:
CachedAppOptimi stack: 0xffffffc085f28000 prio: 122 aff: 0f delayed: 0
enqueue: 129626712825690 sleep: 129626010455462 wake: 129626660645229
vrun: 3911890666624 deadline: 3911895044028 exec_start:
125254578215081 sum_ex: 2426485755565
 |  |--[status: pend] pid: 1371 tsk: 0xffffff80f17ec500 comm:
android.bg stack: 0xffffffc083b70000 prio: 130 aff: 0f delayed: 0
enqueue: 129627837194378 sleep: 129627518391341 wake: 129627571130110
vrun: 3911886852543 deadline: 3911911449312 exec_start:
125254574848159 sum_ex: 135874975341
 |  |--[status: pend] pid: 5550 tsk: 0xffffff8009005c00 comm:
aiai-cc-datasha stack: 0xffffffc08ce28000 prio: 139 aff: 0f delayed: 0
enqueue: 129627692105571 sleep: 129616484841921 wake: 129627692097994
vrun: 3911885383858 deadline: 3912076530524 exec_start:
125243602228529 sum_ex: 1191863852
 |  |--[status: pend] pid: 2864 tsk: 0xffffff8160ec8000 comm:
aiai-vc-0 stack: 0xffffffc084de8000 prio: 139 aff: 0f delayed: 0
enqueue: 129627692070340 sleep: 129616487551075 wake: 129627692056417
vrun: 3911893877800 deadline: 3912076530524 exec_start:
125254572184850 sum_ex: 17830231835
 RT 0 process is pending

CPU4 0 process is running
 |--[status: curr] pid: 0 tsk: 0xffffff8080360000 comm: swapper/4
stack: 0xffffffc080258000 prio: 120 aff: 10 delayed: 0 enqueue: 0
sleep: 0 wake: 0 vrun: 0 deadline: 8742091 exec_start: 127879762650582
sum_ex: 0
 CFS 0 process(0 groups) is pending
 |--curr : None(0)
 RT 0 process is pending

CPU5 1 process is running
 |--[status: curr] pid: 114 tsk: 0xffffff8089259700 comm:
native_hang_det stack: 0xffffffc080e30000 prio: 0 aff: ff delayed: 0
enqueue: 129749988839758 sleep: 129729508682798 wake: 129749988830566
vrun: 97734703 deadline: 2774423 exec_start: 128230900434207 sum_ex:
2725543785
 CFS 0 process(0 groups) is pending
 |--curr : None(0)
 RT 1 process is pending
 |--[status: curr] pid: 114 comm: native_hang_det preempt: 0x100000001

CPU6 0 process is running
 |--[status: curr] pid: 0 tsk: 0xffffff80803f4500 comm: swapper/6
stack: 0xffffffc080268000 prio: 120 aff: 40 delayed: 0 enqueue: 0
sleep: 0 wake: 0 vrun: 0 deadline: 8742091 exec_start: 127969263022680
sum_ex: 0
 CFS 0 process(0 groups) is pending
 |--curr : None(0)
 RT 0 process is pending

CPU7 0 process is running
 |--[status: curr] pid: 0 tsk: 0xffffff80803f5c00 comm: swapper/7
stack: 0xffffffc080270000 prio: 120 aff: 80 delayed: 0 enqueue: 0
sleep: 0 wake: 0 vrun: 0 deadline: 8742091 exec_start: 128201870087905
sum_ex: 0
 CFS 0 process(0 groups) is pending
 |--curr : None(0)
 RT 0 process is pending


 Thanks!
  ---
  xuewen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-02  5:11           ` Xuewen Yan
@ 2026-04-02  5:24             ` Xuewen Yan
  0 siblings, 0 replies; 19+ messages in thread
From: Xuewen Yan @ 2026-04-02  5:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: John Stultz, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On Thu, Apr 2, 2026 at 1:11 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>
> Hi Vincent,
>
> On Wed, Apr 1, 2026 at 10:02 PM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > > Additionally, we noticed that the tasks suffering from long scheduling
> > > latencies frequently belong to different cgroups (e.g., foreground),
> > > rather than the background cgroup where the rt-app load is running.
> > > This unexpected cross-group interference is quite puzzling to us...
> >
> > Do you have more details about what is running at the same time as
> > rt-app? I thought the problem occurred on the rt-app threads but it
> > seems to happen on other threads running simultaneously.
> >
> > I tried your rt-app JSON file in one cgroup with another rt-app
> > running small tasks in a different cgroup on tip/sched/core and
> > v6.12.79 kernel, and I can't trigger any scheduling latency bigger
> > than 35ms which is not that far from the theoretical 30ms = 10 tasks
> > per cpu * 3ms (2.8ms slices with 1ms tick)
>
> To reproduce the issue, simply operate the phone to launch a few apps
> while rt-app is running in the background, and then turn off the
> screen. After a short wait, we will observe tasks with scheduling
> latencies in the order of seconds.

By the way, we could not reproduce this issue on Kernel 6.6 with the
same Android 16 environment.

>
> Below is a captured instance where the latency exceeded 100 seconds.
> We dumped the runqueue (rq) at that moment, and the thread information
> is as follows:(CachedAppOptimi and android.bg went unscheduled for
> more than 100 seconds.)
> (enqueue, sleep, wake: The timestamps indicating when the task was
> enqueued, when it slept, and when it woke up.):
>
> CPU0 9 process is running
>  |--[status: curr] pid: 773 tsk: 0xffffff80b2a54500 comm: tcpdump
> stack: 0xffffffc08e710000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129701710483481 sleep: 122565444420050 wake: 122565465620396 vrun:
> 258051036097647 deadline: 258051038773377 exec_start: 124562605999432
> sum_ex: 110797759941549
>  CFS 9 process(2 groups) is pending
>  |--[status: curr] pid: 773 comm: tcpdump preempt: 0x100010000
>  |--system group--pend: 8 process is in group
>  |  |--curr : None(0)
>  |  |--[status: pend] pid: 7716 tsk: 0xffffff80b6be2e00 comm: t0-26
> stack: 0xffffffc09a168000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749977667220 sleep: 129749977540912 wake: 129749977666643 vrun:
> 25100617833006 deadline: 25100618030934 exec_start: 124562594098546
> sum_ex: 7975676204
>  |  |--[status: pend] pid: 7703 tsk: 0xffffff80a7f01700 comm: t0-23
> stack: 0xffffffc0a0368000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749985123489 sleep: 129749984999989 wake: 129749985123066 vrun:
> 25100617824531 deadline: 25100618072759 exec_start: 124562601471047
> sum_ex: 7577607107
>  |  |--[status: pend] pid: 7677 tsk: 0xffffff808b3d1700 comm: t0-13
> stack: 0xffffffc09ae28000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749985426220 sleep: 129749985303143 wake: 129749985425797 vrun:
> 25100617862037 deadline: 25100618288191 exec_start: 124562605521239
> sum_ex: 9375620632
>  |  |--[status: pend] pid: 7687 tsk: 0xffffff809c5bae00 comm: t0-16
> stack: 0xffffffc0a0388000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749985815873 sleep: 129749985691489 wake: 129749985815450 vrun:
> 25100617809713 deadline: 25100619864676 exec_start: 124562602128508
> sum_ex: 9878306928
>  |  |--[status: pend] pid: 7711 tsk: 0xffffff812a39dc00 comm: t0-25
> stack: 0xffffffc09a070000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749989667489 sleep: 129749989541450 wake: 129749989667066 vrun:
> 25100618056222 deadline: 25100620120645 exec_start: 124562605875162
> sum_ex: 7549612132
>  |  |--[status: pend] pid: 7770 tsk: 0xffffff8098b15c00 comm: t0-34
> stack: 0xffffffc09c430000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749957114066 sleep: 129749956986989 wake: 129749957113681 vrun:
> 25100617430414 deadline: 25100620230414 exec_start: 124562573816509
> sum_ex: 6653557883
>  |  |--[status: pend] pid: 7663 tsk: 0xffffff808b3cae00 comm: t0-7
> stack: 0xffffffc09ee10000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749989169873 sleep: 129749989040643 wake: 129749989162489 vrun:
> 25100617739245 deadline: 25100620539245 exec_start: 124562605396047
> sum_ex: 8631500166
>  |  |--[status: pend] pid: 7750 tsk: 0xffffff80f0409700 comm: t0-28
> stack: 0xffffffc09af48000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749978263912 sleep: 129749978138797 wake: 129749978263527 vrun:
> 25100617871842 deadline: 25100620671842 exec_start: 124562594675585
> sum_ex: 10184100935
>  RT 0 process is pending
>
> CPU1 12 process is running
>  |--[status: curr] pid: 7761 tsk: 0xffffff8025db9700 comm: t0-30
> stack: 0xffffffc09aee0000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749985754797 sleep: 129749985632335 wake: 129749985754412 vrun:
> 23922316736109 deadline: 23922317701147 exec_start: 125273240674085
> sum_ex: 8544979285
>  CFS 12 process(1 groups) is pending
>  |--system group--curr: 12 process is in group
>  |  |--[status: curr] pid: 7761 comm: t0-30 preempt: 0x100010000
>  |  |--[status: pend] pid: 7652 tsk: 0xffffff8127964500 comm: t0-1
> stack: 0xffffffc09b288000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749988875450 sleep: 129749988752720 wake: 129749988875104 vrun:
> 23922317069000 deadline: 23922317662501 exec_start: 125273239582124
> sum_ex: 8796331458
>  |  |--[status: pend] pid: 7773 tsk: 0xffffff8098b10000 comm: t0-36
> stack: 0xffffffc09f5c8000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749989256027 sleep: 129749989133373 wake: 129749989255681 vrun:
> 23922317057443 deadline: 23922317664214 exec_start: 125273239946854
> sum_ex: 8600842892
>  |  |--[status: pend] pid: 7766 tsk: 0xffffff80818f8000 comm: t0-33
> stack: 0xffffffc09adf0000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749989636143 sleep: 129749989511181 wake: 129749989635758 vrun:
> 23922317077817 deadline: 23922317673818 exec_start: 125273240309354
> sum_ex: 9189242817
>  |  |--[status: pend] pid: 7764 tsk: 0xffffff80818fc500 comm: t0-32
> stack: 0xffffffc09ae70000 prio: 120 aff: 0f delayed: 1 enqueue:
> 129749985375412 sleep: 129749989893143 wake: 129749985375066 vrun:
> 23922317087233 deadline: 23922317696809 exec_start: 125273240674085
> sum_ex: 8867997747
>  |  |--[status: pend] pid: 7658 tsk: 0xffffff8018185c00 comm: t0-4
> stack: 0xffffffc09b240000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749986134873 sleep: 129749986012143 wake: 129749986134527 vrun:
> 23922316750208 deadline: 23922317714055 exec_start: 125273236997511
> sum_ex: 9299251939
>  |  |--[status: pend] pid: 7668 tsk: 0xffffff814f61c500 comm: t0-9
> stack: 0xffffffc09e9f8000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749986515489 sleep: 129749986392873 wake: 129749986515066 vrun:
> 23922316748606 deadline: 23922317720145 exec_start: 125273237362049
> sum_ex: 8568645989
>  |  |--[status: pend] pid: 7651 tsk: 0xffffff8012128000 comm: t0-0
> stack: 0xffffffc09eea8000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749986898027 sleep: 129749986771758 wake: 129749986897643 vrun:
> 23922316823052 deadline: 23922317784939 exec_start: 125273237726049
> sum_ex: 9400079830
>  |  |--[status: pend] pid: 7697 tsk: 0xffffff80b6581700 comm: t0-19
> stack: 0xffffffc09b0e8000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749987282835 sleep: 129749987162335 wake: 129749987282412 vrun:
> 23922316848563 deadline: 23922317797946 exec_start: 125273238098280
> sum_ex: 9244661640
>  |  |--[status: pend] pid: 7670 tsk: 0xffffff814f618000 comm: t0-10
> stack: 0xffffffc09dd40000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749987674412 sleep: 129749987549373 wake: 129749987674066 vrun:
> 23922316839247 deadline: 23922317798978 exec_start: 125273238470048
> sum_ex: 10064701654
>  |  |--[status: pend] pid: 7660 tsk: 0xffffff80b415c500 comm: t0-6
> stack: 0xffffffc09ee50000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749988071912 sleep: 129749987943489 wake: 129749988064950 vrun:
> 23922316196954 deadline: 23922318996954 exec_start: 125273238843202
> sum_ex: 8121803041
>  |  |--[status: pend] pid: 7700 tsk: 0xffffff817007dc00 comm: t0-22
> stack: 0xffffffc09afd0000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749988458643 sleep: 129749988331027 wake: 129749988452412 vrun:
> 23922316230410 deadline: 23922319030410 exec_start: 125273239210278
> sum_ex: 8217864682
>  RT 0 process is pending
>
> CPU2 12 process is running
>  |--[status: curr] pid: 7681 tsk: 0xffffff809c5b8000 comm: t0-15
> stack: 0xffffffc09c5f8000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749985836527 sleep: 129749985713873 wake: 129749985836104 vrun:
> 23237215363675 deadline: 23237217433790 exec_start: 124989542777297
> sum_ex: 9227596667
>  CFS 12 process(1 groups) is pending
>  |--system group--curr: 12 process is in group
>  |  |--[status: curr] pid: 7681 comm: t0-15 preempt: 0x100010000
>  |  |--[status: pend] pid: 7777 tsk: 0xffffff80e00d9700 comm: t0-38
> stack: 0xffffffc099df0000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749984668373 sleep: 129749984545950 wake: 129749984668027 vrun:
> 23237215480433 deadline: 23237217374857 exec_start: 124989541645989
> sum_ex: 8141721727
>  |  |--[status: pend] pid: 7679 tsk: 0xffffff802b0dae00 comm: t0-14
> stack: 0xffffffc09b1a8000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749989585835 sleep: 129749989462066 wake: 129749989585450 vrun:
> 23237215692450 deadline: 23237217397027 exec_start: 124989542414874
> sum_ex: 8523879757
>  |  |--[status: pend] pid: 7696 tsk: 0xffffff810f735c00 comm: t0-18
> stack: 0xffffffc09b130000 prio: 120 aff: 0f delayed: 1 enqueue:
> 129749985457027 sleep: 129749989840797 wake: 129749985456643 vrun:
> 23237215692746 deadline: 23237217400592 exec_start: 124989542777297
> sum_ex: 8330286966
>  |  |--[status: pend] pid: 7672 tsk: 0xffffff8080938000 comm: t0-11
> stack: 0xffffffc09c668000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749986216643 sleep: 129749986093450 wake: 129749986216258 vrun:
> 23237215367417 deadline: 23237217438148 exec_start: 124989539255414
> sum_ex: 8766621990
>  |  |--[status: pend] pid: 7657 tsk: 0xffffff80b4159700 comm: t0-3
> stack: 0xffffffc09ee58000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749986599258 sleep: 129749986476297 wake: 129749986598873 vrun:
> 23237215394091 deadline: 23237217459628 exec_start: 124989539622415
> sum_ex: 10357036283
>  |  |--[status: pend] pid: 7706 tsk: 0xffffff809323dc00 comm: t0-24
> stack: 0xffffffc09a208000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749986979181 sleep: 129749986857066 wake: 129749986978797 vrun:
> 23237215445452 deadline: 23237217513876 exec_start: 124989539987607
> sum_ex: 9653857861
>  |  |--[status: pend] pid: 7666 tsk: 0xffffff808b3cc500 comm: t0-8
> stack: 0xffffffc09ee08000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749987363450 sleep: 129749987240297 wake: 129749987363066 vrun:
> 23237215453893 deadline: 23237217522123 exec_start: 124989540355992
> sum_ex: 8568372234
>  |  |--[status: pend] pid: 7675 tsk: 0xffffff809c5b9700 comm: t0-12
> stack: 0xffffffc09fc30000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749987747066 sleep: 129749987624566 wake: 129749987746643 vrun:
> 23237215493101 deadline: 23237217557948 exec_start: 124989540724722
> sum_ex: 8606555185
>  |  |--[status: pend] pid: 7699 tsk: 0xffffff8170079700 comm: t0-21
> stack: 0xffffffc09afd8000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749988153989 sleep: 129749988020758 wake: 129749988146373 vrun:
> 23237215173170 deadline: 23237217973170 exec_start: 124989541098798
> sum_ex: 7654163294
>  |  |--[status: pend] pid: 7656 tsk: 0xffffff80c0649700 comm: t0-2
> stack: 0xffffffc09b278000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749988547066 sleep: 129749988415835 wake: 129749988540873 vrun:
> 23237215176152 deadline: 23237217976152 exec_start: 124989541468567
> sum_ex: 9647417361
>  |  |--[status: pend] pid: 7725 tsk: 0xffffff8130439700 comm: t0-27
> stack: 0xffffffc0a0d60000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749989203450 sleep: 129749989074950 wake: 129749989196758 vrun:
> 23237215183429 deadline: 23237217983429 exec_start: 124989542049104
> sum_ex: 8366970334
>  RT 0 process is pending
>
> CPU3 12 process is running
>  |--[status: curr] pid: 7752 tsk: 0xffffff812e6f0000 comm: t0-29
> stack: 0xffffffc09af40000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749988017297 sleep: 129749987894412 wake: 129749988016873 vrun:
> 21935407414521 deadline: 21935409112944 exec_start: 125371011025153
> sum_ex: 10673494974
>  CFS 12 process(2 groups) is pending
>  |--system group--curr: 8 process is in group
>  |  |--[status: curr] pid: 7752 comm: t0-29 preempt: 0x100010001
>  |  |--[status: pend] pid: 7778 tsk: 0xffffff800f9c1700 comm: t0-39
> stack: 0xffffffc09ac10000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749989685373 sleep: 129749989562797 wake: 129749989684989 vrun:
> 21935407785235 deadline: 21935408052773 exec_start: 125371010656577
> sum_ex: 8042858876
>  |  |--[status: pend] pid: 7659 tsk: 0xffffff815c4d0000 comm: t0-5
> stack: 0xffffffc09b1b0000 prio: 120 aff: 0f delayed: 1 enqueue:
> 129749987629912 sleep: 129749989947489 wake: 129749987629566 vrun:
> 21935407769505 deadline: 21935409105235 exec_start: 125371011025153
> sum_ex: 12135216345
>  |  |--[status: pend] pid: 7772 tsk: 0xffffff8098b11700 comm: t0-35
> stack: 0xffffffc09c420000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749988693412 sleep: 129749988571104 wake: 129749988693104 vrun:
> 21935407456083 deadline: 21935409979699 exec_start: 125371009734078
> sum_ex: 8723848416
>  |  |--[status: pend] pid: 7776 tsk: 0xffffff800f9c2e00 comm: t0-37
> stack: 0xffffffc09add0000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749988928104 sleep: 129749988805258 wake: 129749988927758 vrun:
> 21935407461644 deadline: 21935410065530 exec_start: 125371009930192
> sum_ex: 9006979288
>  |  |--[status: pend] pid: 7698 tsk: 0xffffff802a954500 comm: t0-20
> stack: 0xffffffc09b0b8000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749986262566 sleep: 129749986139758 wake: 129749986262143 vrun:
> 21935407338508 deadline: 21935410138508 exec_start: 125371007400155
> sum_ex: 10200000722
>  |  |--[status: pend] pid: 7763 tsk: 0xffffff8025dbdc00 comm: t0-31
> stack: 0xffffffc09aed8000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749988403104 sleep: 129749988275489 wake: 129749988396450 vrun:
> 21935407357657 deadline: 21935410157657 exec_start: 125371009457694
> sum_ex: 11933014206
>  |  |--[status: pend] pid: 7688 tsk: 0xffffff80a7f00000 comm: t0-17
> stack: 0xffffffc09c5d8000 prio: 120 aff: 0f delayed: 0 enqueue:
> 129749989307912 sleep: 129749989185066 wake: 129749989307566 vrun:
> 21935407751480 deadline: 21935410551480 exec_start: 125371010294230
> sum_ex: 11220342066
>  |--foreground group--pend: 4 process is in group
>  |  |--curr : None(0)
>  |  |--[status: pend] pid: 1544 tsk: 0xffffff810f4d9700 comm:
> CachedAppOptimi stack: 0xffffffc085f28000 prio: 122 aff: 0f delayed: 0
> enqueue: 129626712825690 sleep: 129626010455462 wake: 129626660645229
> vrun: 3911890666624 deadline: 3911895044028 exec_start:
> 125254578215081 sum_ex: 2426485755565
>  |  |--[status: pend] pid: 1371 tsk: 0xffffff80f17ec500 comm:
> android.bg stack: 0xffffffc083b70000 prio: 130 aff: 0f delayed: 0
> enqueue: 129627837194378 sleep: 129627518391341 wake: 129627571130110
> vrun: 3911886852543 deadline: 3911911449312 exec_start:
> 125254574848159 sum_ex: 135874975341
>  |  |--[status: pend] pid: 5550 tsk: 0xffffff8009005c00 comm:
> aiai-cc-datasha stack: 0xffffffc08ce28000 prio: 139 aff: 0f delayed: 0
> enqueue: 129627692105571 sleep: 129616484841921 wake: 129627692097994
> vrun: 3911885383858 deadline: 3912076530524 exec_start:
> 125243602228529 sum_ex: 1191863852
>  |  |--[status: pend] pid: 2864 tsk: 0xffffff8160ec8000 comm:
> aiai-vc-0 stack: 0xffffffc084de8000 prio: 139 aff: 0f delayed: 0
> enqueue: 129627692070340 sleep: 129616487551075 wake: 129627692056417
> vrun: 3911893877800 deadline: 3912076530524 exec_start:
> 125254572184850 sum_ex: 17830231835
>  RT 0 process is pending
>
> CPU4 0 process is running
>  |--[status: curr] pid: 0 tsk: 0xffffff8080360000 comm: swapper/4
> stack: 0xffffffc080258000 prio: 120 aff: 10 delayed: 0 enqueue: 0
> sleep: 0 wake: 0 vrun: 0 deadline: 8742091 exec_start: 127879762650582
> sum_ex: 0
>  CFS 0 process(0 groups) is pending
>  |--curr : None(0)
>  RT 0 process is pending
>
> CPU5 1 process is running
>  |--[status: curr] pid: 114 tsk: 0xffffff8089259700 comm:
> native_hang_det stack: 0xffffffc080e30000 prio: 0 aff: ff delayed: 0
> enqueue: 129749988839758 sleep: 129729508682798 wake: 129749988830566
> vrun: 97734703 deadline: 2774423 exec_start: 128230900434207 sum_ex:
> 2725543785
>  CFS 0 process(0 groups) is pending
>  |--curr : None(0)
>  RT 1 process is pending
>  |--[status: curr] pid: 114 comm: native_hang_det preempt: 0x100000001
>
> CPU6 0 process is running
>  |--[status: curr] pid: 0 tsk: 0xffffff80803f4500 comm: swapper/6
> stack: 0xffffffc080268000 prio: 120 aff: 40 delayed: 0 enqueue: 0
> sleep: 0 wake: 0 vrun: 0 deadline: 8742091 exec_start: 127969263022680
> sum_ex: 0
>  CFS 0 process(0 groups) is pending
>  |--curr : None(0)
>  RT 0 process is pending
>
> CPU7 0 process is running
>  |--[status: curr] pid: 0 tsk: 0xffffff80803f5c00 comm: swapper/7
> stack: 0xffffffc080270000 prio: 120 aff: 80 delayed: 0 enqueue: 0
> sleep: 0 wake: 0 vrun: 0 deadline: 8742091 exec_start: 128201870087905
> sum_ex: 0
>  CFS 0 process(0 groups) is pending
>  |--curr : None(0)
>  RT 0 process is pending
>
>
>  Thanks!
>   ---
>   xuewen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
       [not found] <CAB8ipk9N5_pO1Awp6PLnWt6hf1Bu_XtY3qGMJKqz=Uf6eZQejw@mail.gmail.com>
  2026-04-01  4:25 ` [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload John Stultz
@ 2026-04-08 12:19 ` David Laight
  2026-04-09 21:39 ` John Stultz
  2 siblings, 0 replies; 19+ messages in thread
From: David Laight @ 2026-04-08 12:19 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan, John Stultz

On Wed, 1 Apr 2026 10:32:30 +0800
Xuewen Yan <xuewen.yan94@gmail.com> wrote:

> Dear Linux maintainers and reviewers,
> 
> I am writing to report a severe scheduling latency issue we recently
> discovered on Linux Kernel 6.12.
> 
> Issue Description
> 
> We observed that when running a specific background workload pattern,
> certain tasks experience excessive scheduling latency. The delay from
> the runnable state to running on the CPU exceeds 10 seconds, and in
> extreme cases, it reaches up to 100 seconds.

Have you managed to get a low priority process spinning in kernel on the
cpu the RT process last ran on?
The RT process will always run on the same cpu it ran on provided in
is higher priority than the current process running on that cpu.

ftrace logging scheduler events is you 'friend' here.

	David 

> 
> Environment Details
> 
> Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> Architecture: [ ARM64]
> Hardware: T7300
> Config: gki_defconfig
> 
> RT-app‘s workload Pattern:
> 
> {
>     "tasks" : {
>             "t0" : {
>                     "instance" : 40,
>                     "priority" : 0,
>                     "cpus" : [ 0, 1, 2, 3 ],
> "taskgroup" : "/background",
>                     "loop" : -1,
>                     "run" : 200,
>                     "sleep" : 50
>             }
>     }
> }
> 
> And we have applied the following patchs:
> 
> https://lore.kernel.org/all/20251216111321.966709786@linuxfoundation.org/
> https://lore.kernel.org/all/20260106170509.413636243@linuxfoundation.org/
> https://lore.kernel.org/all/20260323134533.805879358@linuxfoundation.org/
> 
> 
> Could you please advise if there are known changes in the eevdf in
> 6.12 that might affect this specific workload pattern?
> 
> Thanks!
> 
> BR
> ---
> xuewen


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
       [not found] <CAB8ipk9N5_pO1Awp6PLnWt6hf1Bu_XtY3qGMJKqz=Uf6eZQejw@mail.gmail.com>
  2026-04-01  4:25 ` [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload John Stultz
  2026-04-08 12:19 ` David Laight
@ 2026-04-09 21:39 ` John Stultz
  2026-04-10  3:31   ` Xuewen Yan
  2 siblings, 1 reply; 19+ messages in thread
From: John Stultz @ 2026-04-09 21:39 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>
> I am writing to report a severe scheduling latency issue we recently
> discovered on Linux Kernel 6.12.
>
> Issue Description
>
> We observed that when running a specific background workload pattern,
> certain tasks experience excessive scheduling latency. The delay from
> the runnable state to running on the CPU exceeds 10 seconds, and in
> extreme cases, it reaches up to 100 seconds.
>
> Environment Details
>
> Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> Architecture: [ ARM64]
> Hardware: T7300
> Config: gki_defconfig
>
> RT-app‘s workload Pattern:
>
> {
>     "tasks" : {
>             "t0" : {
>                     "instance" : 40,
>                     "priority" : 0,
>                     "cpus" : [ 0, 1, 2, 3 ],
> "taskgroup" : "/background",
>                     "loop" : -1,
>                     "run" : 200,
>                     "sleep" : 50
>             }
>     }
> }
>

So, with this config I think I may have reproduced it on a device
(using android16-6.12).  I've not quite seen 10+ seconds, but I have
seen >2second delays for kworker threads (though usually the max seems
to be around 600ms).

Unfortunately trying to reproduce using the same (andorid16-6.12)
kernel branch with qemu initially hasn't been successful (and has been
a bit of a yak shaving adventure: rt-app needs cgroupv1, which newer
debian/systemd doesn't support anylonger, so installed a debian11
image and had to build rt-app and its dependencies from source - then
found perfetto binaries require a newer glibc so had to fetch and
build perfetto from scratch as well).   I can't see any similarly
sized delays there.

Out of curiosity, what are you using to detect the problem when you
have rt-app running in the background?  I've been tinkering with using
cyclictest (-m -t -a --policy=SCHED_OTHER -b 1000000) to try to catch
> 1sec latencies, but curious if you had something better?

thanks
-john

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-09 21:39 ` John Stultz
@ 2026-04-10  3:31   ` Xuewen Yan
  2026-04-10  4:01     ` John Stultz
  0 siblings, 1 reply; 19+ messages in thread
From: Xuewen Yan @ 2026-04-10  3:31 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

Hi John,

On Fri, Apr 10, 2026 at 5:39 AM John Stultz <jstultz@google.com> wrote:
>
> On Tue, Mar 31, 2026 at 7:32 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> >
> > I am writing to report a severe scheduling latency issue we recently
> > discovered on Linux Kernel 6.12.
> >
> > Issue Description
> >
> > We observed that when running a specific background workload pattern,
> > certain tasks experience excessive scheduling latency. The delay from
> > the runnable state to running on the CPU exceeds 10 seconds, and in
> > extreme cases, it reaches up to 100 seconds.
> >
> > Environment Details
> >
> > Kernel Version: 6.12.58-android16-6-g3835fd28159d-ab000018-4k
> > Architecture: [ ARM64]
> > Hardware: T7300
> > Config: gki_defconfig
> >
> > RT-app‘s workload Pattern:
> >
> > {
> >     "tasks" : {
> >             "t0" : {
> >                     "instance" : 40,
> >                     "priority" : 0,
> >                     "cpus" : [ 0, 1, 2, 3 ],
> > "taskgroup" : "/background",
> >                     "loop" : -1,
> >                     "run" : 200,
> >                     "sleep" : 50
> >             }
> >     }
> > }
> >
>
> So, with this config I think I may have reproduced it on a device
> (using android16-6.12).  I've not quite seen 10+ seconds, but I have
> seen >2second delays for kworker threads (though usually the max seems
> to be around 600ms).

Thanks for the detailed update! It’s great to hear that you’ve managed
to reproduce the issue on a real device. Even though the latency is
around 2 seconds (instead of 10+), that still significantly confirms
the problem exists. The difference in magnitude might just be due to
specific background load conditions.

>
> Unfortunately trying to reproduce using the same (andorid16-6.12)
> kernel branch with qemu initially hasn't been successful (and has been
> a bit of a yak shaving adventure: rt-app needs cgroupv1, which newer
> debian/systemd doesn't support anylonger, so installed a debian11
> image and had to build rt-app and its dependencies from source - then
> found perfetto binaries require a newer glibc so had to fetch and
> build perfetto from scratch as well).   I can't see any similarly
> sized delays there.
>
> Out of curiosity, what are you using to detect the problem when you
> have rt-app running in the background?  I've been tinkering with using
> cyclictest (-m -t -a --policy=SCHED_OTHER -b 1000000) to try to catch
> > 1sec latencies, but curious if you had something better?
>
We use the runqslower ebpf tool, the code in the kernel "tools/bpf/runqslower".

Thanks!
BR
---
xuewen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-10  3:31   ` Xuewen Yan
@ 2026-04-10  4:01     ` John Stultz
  2026-04-10 12:24       ` Dietmar Eggemann
  2026-04-13 20:44       ` John Stultz
  0 siblings, 2 replies; 19+ messages in thread
From: John Stultz @ 2026-04-10  4:01 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On Thu, Apr 9, 2026 at 8:31 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> On Fri, Apr 10, 2026 at 5:39 AM John Stultz <jstultz@google.com> wrote:
>>
> > So, with this config I think I may have reproduced it on a device
> > (using android16-6.12).  I've not quite seen 10+ seconds, but I have
> > seen >2second delays for kworker threads (though usually the max seems
> > to be around 600ms).
>
> Thanks for the detailed update! It’s great to hear that you’ve managed
> to reproduce the issue on a real device. Even though the latency is
> around 2 seconds (instead of 10+), that still significantly confirms
> the problem exists. The difference in magnitude might just be due to
> specific background load conditions.

So, I believe there is an issue in the android16-6.12 tree where some
changes from upstream, not in 6.12-stable were backported, and then
some fixes from 6.12-stable were reverted, and the mix of the two is
causing problems. I've tried to align the tree more closely to
6.12-stable and I'm seeing the results improve, so now its just a
matter of trying to de-tangle things in the android tree.

So, for the community folks: I don't think upstream or the -stable
tree is connected to what is being reported.

thanks
-john

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-10  4:01     ` John Stultz
@ 2026-04-10 12:24       ` Dietmar Eggemann
  2026-04-13 20:44       ` John Stultz
  1 sibling, 0 replies; 19+ messages in thread
From: Dietmar Eggemann @ 2026-04-10 12:24 UTC (permalink / raw)
  To: John Stultz, Xuewen Yan
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Benjamin Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, 王科 (Ke Wang), Xuewen Yan,
	hongyu.jin@unisoc.com, guohua.yan

On 10.04.26 05:01, John Stultz wrote:
> On Thu, Apr 9, 2026 at 8:31 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>> On Fri, Apr 10, 2026 at 5:39 AM John Stultz <jstultz@google.com> wrote:

[...]

> So, I believe there is an issue in the android16-6.12 tree where some
> changes from upstream, not in 6.12-stable were backported, and then
> some fixes from 6.12-stable were reverted, and the mix of the two is
> causing problems. I've tried to align the tree more closely to
> 6.12-stable and I'm seeing the results improve, so now its just a
> matter of trying to de-tangle things in the android tree.
> 
> So, for the community folks: I don't think upstream or the -stable
> tree is connected to what is being reported.

Good to hear ;-)

Just did the tests again with the new rt-app file but still there wasn't
anything spectacular to see.

https://lore.kernel.org/r/3ed0139b-9e34-491d-80cf-e818fadadd06@arm.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload
  2026-04-10  4:01     ` John Stultz
  2026-04-10 12:24       ` Dietmar Eggemann
@ 2026-04-13 20:44       ` John Stultz
  1 sibling, 0 replies; 19+ messages in thread
From: John Stultz @ 2026-04-13 20:44 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Benjamin Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, 王科 (Ke Wang),
	Xuewen Yan, hongyu.jin@unisoc.com, guohua.yan

On Thu, Apr 9, 2026 at 9:01 PM John Stultz <jstultz@google.com> wrote:
>
> On Thu, Apr 9, 2026 at 8:31 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> > On Fri, Apr 10, 2026 at 5:39 AM John Stultz <jstultz@google.com> wrote:
> >>
> > > So, with this config I think I may have reproduced it on a device
> > > (using android16-6.12).  I've not quite seen 10+ seconds, but I have
> > > seen >2second delays for kworker threads (though usually the max seems
> > > to be around 600ms).
> >
> > Thanks for the detailed update! It’s great to hear that you’ve managed
> > to reproduce the issue on a real device. Even though the latency is
> > around 2 seconds (instead of 10+), that still significantly confirms
> > the problem exists. The difference in magnitude might just be due to
> > specific background load conditions.
>
> So, I believe there is an issue in the android16-6.12 tree where some
> changes from upstream, not in 6.12-stable were backported, and then
> some fixes from 6.12-stable were reverted, and the mix of the two is
> causing problems. I've tried to align the tree more closely to
> 6.12-stable and I'm seeing the results improve, so now its just a
> matter of trying to de-tangle things in the android tree.
>
> So, for the community folks: I don't think upstream or the -stable
> tree is connected to what is being reported.

So unfortuantely, it looks like I'm going to have to eat my words here. :(

In the android16-6.12 tree, the behavior has been isolated down to a
backport of commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity
placement bug causing scheduling lag").
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d71a9c6160479899ee744d2c6d6602a191deb1f

Specifically the `place_entity(cfs_rq, se, 0);` addition in `reweight_entity()`

I was assuming android16-6.12 was missing other related changes
causing the bad behavior, but Xuewen pointed out similar problems
could be seen on android17-6.18. I aligned that tree to the
6.18-stable branch, and could also reproduce it.  Additionally, I
tested with our andorid-mainline branch (as of 6.19 the base) and it
showed the same issue.  So this does "look" to be upstream related.

Removing the place_entity() line added in commit 6d71a9c61604 from
reweight_entity() seems to prevent the behavior.

I've been able to trigger this "bad behavior" on devices using the
rt-app with the configuration[1] Xuewen first provided (putting 10
spinners per cpu on the bottom 4 cpus in a background v1 cpu cgroup).
Then I run `cyclictest -m -t -a --policy=SCHED_OTHER -b 1000000 -D
120s` in the root cgroup, and (usually well) within two minutes I'll
see > 1second delays in cyclictest.

If I remove the cgroup from the rt-app config, the issue doesn't reproduce.

Unfortuantely I've only been able to reproduce this on device, which
requires the android kernel tree. I've installed an old debian11 image
on x86 QEMU to be able to utilize cpu v1 cgroup support, but I haven't
reproduced the issue there, which is a bit confounding.

The issue doesn't immediatley trigger, but usually after a few seconds
of normal behavior I'll start to see cyclictest on one or two of the
cpus start to trip larger 100ms+ latencies until it hits the 1second
boundary.

Late last week I went digging into place_entity() to try to understand
why it was tripping, but wasn't very succesful in narrowing down what
might be going wrong. I did see that the place_entity() call seems to
always be on a non-task se, and its almost always exiting at the `if
(se->rel_deadline)` case. It does go through the lag calculation
conditional, but not always.

Anyway, I'm going to continue digging into this, but I just wanted to
give folks a heads up in case there were any ideas to explore.

thanks
-john

[1] Xuewen's rt-app config:
{
    "tasks" : {
            "t0" : {
                    "instance" : 40,
                    "priority" : 0,
                    "cpus" : [ 0, 1, 2, 3 ],
"taskgroup" : "/background",
                    "loop" : -1,
                    "run" : 200,
                    "sleep" : 50
            }
    }
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-04-14  5:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAB8ipk9N5_pO1Awp6PLnWt6hf1Bu_XtY3qGMJKqz=Uf6eZQejw@mail.gmail.com>
2026-04-01  4:25 ` [Question] Sched: Severe scheduling latency (>10s) observed on kernel 6.12 with specific workload John Stultz
2026-04-01  6:04   ` Xuewen Yan
2026-04-01 10:05     ` Vincent Guittot
2026-04-01 10:48       ` Xuewen Yan
2026-04-01 13:00         ` Dietmar Eggemann
2026-04-02  5:16           ` Xuewen Yan
2026-04-02 14:58             ` Dietmar Eggemann
2026-04-08  2:50               ` Xuewen Yan
2026-04-10 11:13                 ` Dietmar Eggemann
2026-04-14  5:18                   ` Xuewen Yan
2026-04-01 14:01         ` Vincent Guittot
2026-04-02  5:11           ` Xuewen Yan
2026-04-02  5:24             ` Xuewen Yan
2026-04-08 12:19 ` David Laight
2026-04-09 21:39 ` John Stultz
2026-04-10  3:31   ` Xuewen Yan
2026-04-10  4:01     ` John Stultz
2026-04-10 12:24       ` Dietmar Eggemann
2026-04-13 20:44       ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox