BUG Report: Fork benchmark drop by 30% on aarch64

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* BUG Report: Fork benchmark drop by 30% on aarch64
@ 2025-02-05 15:10 Hagar Hemdan
  2025-02-07  9:14 ` Dietmar Eggemann
  0 siblings, 1 reply; 14+ messages in thread
From: Hagar Hemdan @ 2025-02-05 15:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Hagar Hemdan, wuchi, linux-kernel, Mohamed, Abuelfotoh, Hazem

Hi,

There is about a 30% drop in fork benchmark [1] on aarch64 and a 10%
drop on x86_64 using kernel v6.13.1.

Git bisect pointed to commit eff6c8ce8d4d ("sched/core: Reduce cost
of sched_move_task when config autogroup") which merged starting
v6.4-rc1.

The regression only happens when number of CPUs is equal to number
of threads [2] that fork test is creating which means it's only visible
under CPU contention.

I used m6g.xlarge AWS EC2 Instance with 4 vCPUs and 16 GiB RAM for ARM64
and m6a.xlarge with also 4 vCPUs and 16 GiB RAM for x86_64.

I noticed this regression exists only when autogroup config is enabled.

Run the fork test with these combinations and autogroup is enabled:

Arch      | commit eff6c8ce8d4d | Fork Result (lps)  |  %Cpu(s)
----------+---------------------+--------------------+------------------
aarch64   | without             | 28677.0            |  3.2 us, 96.7 sy
aarch64   | with                | 19860.7 (30% drop) |  2.7 us, 79.4 sy
x86_64    | without             | 27776.2            |  3.1 us, 96.9 sy
x86_64    | with                | 25020.6 (10% drop) |  4.1 us, 93.2 sy
----------+---------------------+--------------------+------------------

It seems that the commit is capping the amount of CPU resources that can
be utilized leaving around 18% idle in case of aarch64 and 3% idle in
x86_64 case which is likely the main reason behind the reported fork
regression.

When autogroup is disabled:

Arch      | commit eff6c8ce8d4d | Fork Result (lps)  |  %Cpu(s)
----------+---------------------+--------------------+------------------
aarch64   | without             | 19877.8            |  2.2 us, 80.1 sy  
aarch64   | with                | 20086.3 (~same)    |  1.9 us, 80.2 sy
x86_64    | without             | 24974.2            |  4.9 us, 92.5 sy 
x86_64    | with                | 24921.5 (~same)    |  4.9 us, 92.4 sy
----------+---------------------+--------------------+------------------

So when autogroup disabled, I still see the amount of idle CPU resources 
18%, 3% on aarch64 and x86_64 regardless of commit.

Is this performance drop an expected of this commit when autogroup is
enabled?

Thanks,
Hagar

[1] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench
[2] Used command: ./Run -c 4 spawn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-05 15:10 BUG Report: Fork benchmark drop by 30% on aarch64 Hagar Hemdan
@ 2025-02-07  9:14 ` Dietmar Eggemann
  2025-02-07 11:07   ` Hagar Hemdan
  0 siblings, 1 reply; 14+ messages in thread
From: Dietmar Eggemann @ 2025-02-07  9:14 UTC (permalink / raw)
  To: Hagar Hemdan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: wuchi, linux-kernel, Mohamed, Abuelfotoh, Hazem

Hi Hagar,

On 05/02/2025 16:10, Hagar Hemdan wrote:
> Hi,
> 
> There is about a 30% drop in fork benchmark [1] on aarch64 and a 10%
> drop on x86_64 using kernel v6.13.1.
> 
> Git bisect pointed to commit eff6c8ce8d4d ("sched/core: Reduce cost
> of sched_move_task when config autogroup") which merged starting
> v6.4-rc1.
> 
> The regression only happens when number of CPUs is equal to number
> of threads [2] that fork test is creating which means it's only visible
> under CPU contention.
> 
> I used m6g.xlarge AWS EC2 Instance with 4 vCPUs and 16 GiB RAM for ARM64
> and m6a.xlarge with also 4 vCPUs and 16 GiB RAM for x86_64.
> 
> I noticed this regression exists only when autogroup config is enabled.

So '# CONFIG_SCHED_AUTOGROUP is not set' in .config so we have: 

static inline void sched_autogroup_exit_task(struct task_struct *p) { }

I.e. doing a 'echo 0 > /proc/sys/kernel/sched_autogroup_enabled' still
shows this issue?

> 
> Run the fork test with these combinations and autogroup is enabled:
> 
> Arch      | commit eff6c8ce8d4d | Fork Result (lps)  |  %Cpu(s)
> ----------+---------------------+--------------------+------------------
> aarch64   | without             | 28677.0            |  3.2 us, 96.7 sy
> aarch64   | with                | 19860.7 (30% drop) |  2.7 us, 79.4 sy
> x86_64    | without             | 27776.2            |  3.1 us, 96.9 sy
> x86_64    | with                | 25020.6 (10% drop) |  4.1 us, 93.2 sy
> ----------+---------------------+--------------------+------------------

Can you rerun with:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e5a6bf587f9..62cc50c79a78 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9057,7 +9057,7 @@ void sched_move_task(struct task_struct *tsk)
         * group changes.
         */
        group = sched_get_task_group(tsk);
-       if (group == tsk->sched_task_group)
+       if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
                return;

> 
> It seems that the commit is capping the amount of CPU resources that can
> be utilized leaving around 18% idle in case of aarch64 and 3% idle in
> x86_64 case which is likely the main reason behind the reported fork
> regression.
> 
> When autogroup is disabled:
> 
> Arch      | commit eff6c8ce8d4d | Fork Result (lps)  |  %Cpu(s)
> ----------+---------------------+--------------------+------------------
> aarch64   | without             | 19877.8            |  2.2 us, 80.1 sy  
> aarch64   | with                | 20086.3 (~same)    |  1.9 us, 80.2 sy
> x86_64    | without             | 24974.2            |  4.9 us, 92.5 sy 
> x86_64    | with                | 24921.5 (~same)    |  4.9 us, 92.4 sy
> ----------+---------------------+--------------------+------------------
> 
> So when autogroup disabled, I still see the amount of idle CPU resources 
> 18%, 3% on aarch64 and x86_64 regardless of commit.
> 
> Is this performance drop an expected of this commit when autogroup is
> enabled?
> 
> Thanks,
> Hagar
> 
> [1] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench
> [2] Used command: ./Run -c 4 spawn
> 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-07  9:14 ` Dietmar Eggemann
@ 2025-02-07 11:07   ` Hagar Hemdan
  2025-02-10 10:38     ` Dietmar Eggemann
  0 siblings, 1 reply; 14+ messages in thread
From: Hagar Hemdan @ 2025-02-07 11:07 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, wuchi,
	linux-kernel, Mohamed, Abuelfotoh, Hazem

On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
> Hi Hagar,
> 
> On 05/02/2025 16:10, Hagar Hemdan wrote:
> > Hi,
> > 
> > There is about a 30% drop in fork benchmark [1] on aarch64 and a 10%
> > drop on x86_64 using kernel v6.13.1.
> > 
> > Git bisect pointed to commit eff6c8ce8d4d ("sched/core: Reduce cost
> > of sched_move_task when config autogroup") which merged starting
> > v6.4-rc1.
> > 
> > The regression only happens when number of CPUs is equal to number
> > of threads [2] that fork test is creating which means it's only visible
> > under CPU contention.
> > 
> > I used m6g.xlarge AWS EC2 Instance with 4 vCPUs and 16 GiB RAM for ARM64
> > and m6a.xlarge with also 4 vCPUs and 16 GiB RAM for x86_64.
> > 
> > I noticed this regression exists only when autogroup config is enabled.
> 
> So '# CONFIG_SCHED_AUTOGROUP is not set' in .config so we have: 
> 
> static inline void sched_autogroup_exit_task(struct task_struct *p) { }
> 
> I.e. doing a 'echo 0 > /proc/sys/kernel/sched_autogroup_enabled' still
> shows this issue?
yes, when I do 'echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled',
It behaves like the disable 'CONFIG_SCHED_AUTOGROUP'.
> 
> > 
> > Run the fork test with these combinations and autogroup is enabled:
> > 
> > Arch      | commit eff6c8ce8d4d | Fork Result (lps)  |  %Cpu(s)
> > ----------+---------------------+--------------------+------------------
> > aarch64   | without             | 28677.0            |  3.2 us, 96.7 sy
> > aarch64   | with                | 19860.7 (30% drop) |  2.7 us, 79.4 sy
> > x86_64    | without             | 27776.2            |  3.1 us, 96.9 sy
> > x86_64    | with                | 25020.6 (10% drop) |  4.1 us, 93.2 sy
> > ----------+---------------------+--------------------+------------------
> 
> Can you rerun with:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e5a6bf587f9..62cc50c79a78 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9057,7 +9057,7 @@ void sched_move_task(struct task_struct *tsk)
>          * group changes.
>          */
>         group = sched_get_task_group(tsk);
> -       if (group == tsk->sched_task_group)
> +       if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
>                 return;
I tried that and I see it fixed the regression and the cpu utilization
is 100% with it.

I'd like to ask if this like reverting the patch in case of exit path
and also means the enqueue/dequeue are needed in case of task exiting,
right?

Thanks for replying :)
> 
> > 
> > It seems that the commit is capping the amount of CPU resources that can
> > be utilized leaving around 18% idle in case of aarch64 and 3% idle in
> > x86_64 case which is likely the main reason behind the reported fork
> > regression.
> > 
> > When autogroup is disabled:
> > 
> > Arch      | commit eff6c8ce8d4d | Fork Result (lps)  |  %Cpu(s)
> > ----------+---------------------+--------------------+------------------
> > aarch64   | without             | 19877.8            |  2.2 us, 80.1 sy  
> > aarch64   | with                | 20086.3 (~same)    |  1.9 us, 80.2 sy
> > x86_64    | without             | 24974.2            |  4.9 us, 92.5 sy 
> > x86_64    | with                | 24921.5 (~same)    |  4.9 us, 92.4 sy
> > ----------+---------------------+--------------------+------------------
> > 
> > So when autogroup disabled, I still see the amount of idle CPU resources 
> > 18%, 3% on aarch64 and x86_64 regardless of commit.
> > 
> > Is this performance drop an expected of this commit when autogroup is
> > enabled?
> > 
> > Thanks,
> > Hagar
> > 
> > [1] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench
> > [2] Used command: ./Run -c 4 spawn
> > 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-07 11:07   ` Hagar Hemdan
@ 2025-02-10 10:38     ` Dietmar Eggemann
  2025-02-10 21:31       ` Hagar Hemdan
  0 siblings, 1 reply; 14+ messages in thread
From: Dietmar Eggemann @ 2025-02-10 10:38 UTC (permalink / raw)
  To: Hagar Hemdan
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, wuchi,
	linux-kernel, Mohamed, Abuelfotoh, Hazem

On 07/02/2025 12:07, Hagar Hemdan wrote:
> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
>> Hi Hagar,
>>
>> On 05/02/2025 16:10, Hagar Hemdan wrote:
>>> Hi,
>>>
>>> There is about a 30% drop in fork benchmark [1] on aarch64 and a 10%
>>> drop on x86_64 using kernel v6.13.1.
>>>
>>> Git bisect pointed to commit eff6c8ce8d4d ("sched/core: Reduce cost
>>> of sched_move_task when config autogroup") which merged starting
>>> v6.4-rc1.
>>>
>>> The regression only happens when number of CPUs is equal to number
>>> of threads [2] that fork test is creating which means it's only visible
>>> under CPU contention.
>>>
>>> I used m6g.xlarge AWS EC2 Instance with 4 vCPUs and 16 GiB RAM for ARM64
>>> and m6a.xlarge with also 4 vCPUs and 16 GiB RAM for x86_64.
>>>
>>> I noticed this regression exists only when autogroup config is enabled.
>>
>> So '# CONFIG_SCHED_AUTOGROUP is not set' in .config so we have: 
>>
>> static inline void sched_autogroup_exit_task(struct task_struct *p) { }
>>
>> I.e. doing a 'echo 0 > /proc/sys/kernel/sched_autogroup_enabled' still
>> shows this issue?
> yes, when I do 'echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled',
> It behaves like the disable 'CONFIG_SCHED_AUTOGROUP'.

OK.

>>> Run the fork test with these combinations and autogroup is enabled:
>>>
>>> Arch      | commit eff6c8ce8d4d | Fork Result (lps)  |  %Cpu(s)
>>> ----------+---------------------+--------------------+------------------
>>> aarch64   | without             | 28677.0            |  3.2 us, 96.7 sy
>>> aarch64   | with                | 19860.7 (30% drop) |  2.7 us, 79.4 sy
>>> x86_64    | without             | 27776.2            |  3.1 us, 96.9 sy
>>> x86_64    | with                | 25020.6 (10% drop) |  4.1 us, 93.2 sy
>>> ----------+---------------------+--------------------+------------------
>>
>> Can you rerun with:
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 3e5a6bf587f9..62cc50c79a78 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9057,7 +9057,7 @@ void sched_move_task(struct task_struct *tsk)
>>          * group changes.
>>          */
>>         group = sched_get_task_group(tsk);
>> -       if (group == tsk->sched_task_group)
>> +       if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
>>                 return;
> I tried that and I see it fixed the regression and the cpu utilization
> is 100% with it.

W/ this one, we force the 'spawn' tasks into: 

  sched_change_group() -> task_change_group_fair()

> I'd like to ask if this like reverting the patch in case of exit path
> and also means the enqueue/dequeue are needed in case of task exiting,
> right?

The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
call dequeue_task(), put_prev_task(), enqueue_task() and
set_next_task().

I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
case of root tg) update in:

  task_change_group_fair() -> detach_task_cfs_rq() -> ...,
  attach_task_cfs_rq() -> ...

since this is used for WF_FORK, WF_EXEC handling in wakeup:

  select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
  sched_balance_find_dst_group_cpu()

in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.

You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
then they match "/" == "/".

I assume you run Ubuntu on your AWS instances? What kind of
'cgroup/taskgroup' related setup are you using?

Can you run w/ this debug snippet w/ and w/o AG enabled?

-->8--

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e5a6bf587f9..c696740177b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9035,6 +9035,8 @@ static void sched_change_group(struct task_struct *tsk, struct task_group *group
                set_task_rq(tsk, task_cpu(tsk));
 }
 
+extern void task_group_path(struct task_group *tg, char *path, int plen);
+
 /*
  * Change task's runqueue when it moves between groups.
  *
@@ -9048,6 +9050,8 @@ void sched_move_task(struct task_struct *tsk)
                DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
        struct task_group *group;
        struct rq *rq;
+       int cpu = cpu_of(rq_of(cfs_rq_of(&tsk->se)));
+       char buf_1[64], buf_2[64];
 
        CLASS(task_rq_lock, rq_guard)(tsk);
        rq = rq_guard.rq;
@@ -9057,6 +9061,13 @@ void sched_move_task(struct task_struct *tsk)
         * group changes.
         */
        group = sched_get_task_group(tsk);
+
+       task_group_path(group, buf_1, sizeof(buf_1));
+       task_group_path(tsk->sched_task_group, buf_2, sizeof(buf_2));
+
+       trace_printk("%s %d cpu=%d group=%s tsk->sched_task_group=%s\n",
+                    tsk->comm, tsk->pid, cpu, buf_1, buf_2);
+
        if (group == tsk->sched_task_group)
                return;
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index a1be00a988bf..c967d55f971f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -701,7 +701,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 static DEFINE_SPINLOCK(sched_debug_lock);
 static char group_path[PATH_MAX];
 
-static void task_group_path(struct task_group *tg, char *path, int plen)
+extern void task_group_path(struct task_group *tg, char *path, int plen);
+
+void task_group_path(struct task_group *tg, char *path, int plen)
 {
        if (autogroup_path(tg, path, plen))
                return;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26958431deb7..f90d28be9695 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13198,9 +13198,16 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #endif
 }
 
+extern void task_group_path(struct task_group *tg, char *path, int plen);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void task_change_group_fair(struct task_struct *p)
 {
+       int cpu = cpu_of(rq_of(cfs_rq_of(&p->se)));
+       struct task_group *tg = task_group(p);
+       unsigned long cfs_rq_load_avg_pre, cfs_rq_load_avg_post;
+       char buf[64];
+
        /*
         * We couldn't detach or attach a forked task which
         * hasn't been woken up by wake_up_new_task().
@@ -13208,14 +13215,33 @@ static void task_change_group_fair(struct task_struct *p)
        if (READ_ONCE(p->__state) == TASK_NEW)
                return;
 
+       task_group_path(tg, buf, sizeof(buf));
+       cfs_rq_load_avg_pre = cfs_rq_of(&p->se)->avg.load_avg;
+
        detach_task_cfs_rq(p);
 
+       cfs_rq_load_avg_post = cfs_rq_of(&p->se)->avg.load_avg;
+
+       trace_printk("%s %d (d) cpu=%d tg=%s load_avg=%lu->%lu\n",
+                    p->comm, p->pid, cpu, buf, cfs_rq_load_avg_pre,
+                    cfs_rq_load_avg_post);
+
 #ifdef CONFIG_SMP
        /* Tell se's cfs_rq has been changed -- migrated */
        p->se.avg.last_update_time = 0;
 #endif
        set_task_rq(p, task_cpu(p));
+
+       task_group_path(tg, buf, sizeof(buf));
+       cfs_rq_load_avg_pre = cfs_rq_of(&p->se)->avg.load_avg;
+
        attach_task_cfs_rq(p);
+
+       cfs_rq_load_avg_post = cfs_rq_of(&p->se)->avg.load_avg;
+
+       trace_printk("%s %d (a) cpu=%d tg=%s load_avg=%lu->%lu\n",
+                    p->comm, p->pid, cpu, buf, cfs_rq_load_avg_pre,
+                    cfs_rq_load_avg_post);
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-10 10:38     ` Dietmar Eggemann
@ 2025-02-10 21:31       ` Hagar Hemdan
  2025-02-11 16:27         ` Dietmar Eggemann
  0 siblings, 1 reply; 14+ messages in thread
From: Hagar Hemdan @ 2025-02-10 21:31 UTC (permalink / raw)
  To: Dietmar Eggemann, hagarhem; +Cc: abuehaze, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 8099 bytes --]

On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
> On 07/02/2025 12:07, Hagar Hemdan wrote:
> > On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
> >> Hi Hagar,
> >>
> >> On 05/02/2025 16:10, Hagar Hemdan wrote:
> >>> Hi,
> >>>
> >>> There is about a 30% drop in fork benchmark [1] on aarch64 and a 10%
> >>> drop on x86_64 using kernel v6.13.1.
> >>>
> >>> Git bisect pointed to commit eff6c8ce8d4d ("sched/core: Reduce cost
> >>> of sched_move_task when config autogroup") which merged starting
> >>> v6.4-rc1.
> >>>
> >>> The regression only happens when number of CPUs is equal to number
> >>> of threads [2] that fork test is creating which means it's only visible
> >>> under CPU contention.
> >>>
> >>> I used m6g.xlarge AWS EC2 Instance with 4 vCPUs and 16 GiB RAM for ARM64
> >>> and m6a.xlarge with also 4 vCPUs and 16 GiB RAM for x86_64.
> >>>
> >>> I noticed this regression exists only when autogroup config is enabled.
> >>
> >> So '# CONFIG_SCHED_AUTOGROUP is not set' in .config so we have: 
> >>
> >> static inline void sched_autogroup_exit_task(struct task_struct *p) { }
> >>
> >> I.e. doing a 'echo 0 > /proc/sys/kernel/sched_autogroup_enabled' still
> >> shows this issue?
> > yes, when I do 'echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled',
> > It behaves like the disable 'CONFIG_SCHED_AUTOGROUP'.
> 
> OK.
> 
> >>> Run the fork test with these combinations and autogroup is enabled:
> >>>
> >>> Arch      | commit eff6c8ce8d4d | Fork Result (lps)  |  %Cpu(s)
> >>> ----------+---------------------+--------------------+------------------
> >>> aarch64   | without             | 28677.0            |  3.2 us, 96.7 sy
> >>> aarch64   | with                | 19860.7 (30% drop) |  2.7 us, 79.4 sy
> >>> x86_64    | without             | 27776.2            |  3.1 us, 96.9 sy
> >>> x86_64    | with                | 25020.6 (10% drop) |  4.1 us, 93.2 sy
> >>> ----------+---------------------+--------------------+------------------
> >>
> >> Can you rerun with:
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 3e5a6bf587f9..62cc50c79a78 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -9057,7 +9057,7 @@ void sched_move_task(struct task_struct *tsk)
> >>          * group changes.
> >>          */
> >>         group = sched_get_task_group(tsk);
> >> -       if (group == tsk->sched_task_group)
> >> +       if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
> >>                 return;
> > I tried that and I see it fixed the regression and the cpu utilization
> > is 100% with it.
> 
> W/ this one, we force the 'spawn' tasks into: 
> 
>   sched_change_group() -> task_change_group_fair()
> 
> > I'd like to ask if this like reverting the patch in case of exit path
> > and also means the enqueue/dequeue are needed in case of task exiting,
> > right?
> 
> The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
> call dequeue_task(), put_prev_task(), enqueue_task() and
> set_next_task().
> 
> I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
> case of root tg) update in:
> 
>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
>   attach_task_cfs_rq() -> ...
> 
> since this is used for WF_FORK, WF_EXEC handling in wakeup:
> 
>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
>   sched_balance_find_dst_group_cpu()
> 
> in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.
> 
> You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
> Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
> 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
> tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
> then they match "/" == "/".
> 
> I assume you run Ubuntu on your AWS instances? What kind of
> 'cgroup/taskgroup' related setup are you using?

I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance.
AL2023 uses cgroupv2 by default.
> 
> Can you run w/ this debug snippet w/ and w/o AG enabled?

I have run that and have attached the trace files to this email.
> 
> -->8--
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e5a6bf587f9..c696740177b7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9035,6 +9035,8 @@ static void sched_change_group(struct task_struct *tsk, struct task_group *group
>                 set_task_rq(tsk, task_cpu(tsk));
>  }
>  
> +extern void task_group_path(struct task_group *tg, char *path, int plen);
> +
>  /*
>   * Change task's runqueue when it moves between groups.
>   *
> @@ -9048,6 +9050,8 @@ void sched_move_task(struct task_struct *tsk)
>                 DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
>         struct task_group *group;
>         struct rq *rq;
> +       int cpu = cpu_of(rq_of(cfs_rq_of(&tsk->se)));
> +       char buf_1[64], buf_2[64];
>  
>         CLASS(task_rq_lock, rq_guard)(tsk);
>         rq = rq_guard.rq;
> @@ -9057,6 +9061,13 @@ void sched_move_task(struct task_struct *tsk)
>          * group changes.
>          */
>         group = sched_get_task_group(tsk);
> +
> +       task_group_path(group, buf_1, sizeof(buf_1));
> +       task_group_path(tsk->sched_task_group, buf_2, sizeof(buf_2));
> +
> +       trace_printk("%s %d cpu=%d group=%s tsk->sched_task_group=%s\n",
> +                    tsk->comm, tsk->pid, cpu, buf_1, buf_2);
> +
>         if (group == tsk->sched_task_group)
>                 return;
>  
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index a1be00a988bf..c967d55f971f 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -701,7 +701,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>  static DEFINE_SPINLOCK(sched_debug_lock);
>  static char group_path[PATH_MAX];
>  
> -static void task_group_path(struct task_group *tg, char *path, int plen)
> +extern void task_group_path(struct task_group *tg, char *path, int plen);
> +
> +void task_group_path(struct task_group *tg, char *path, int plen)
>  {
>         if (autogroup_path(tg, path, plen))
>                 return;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26958431deb7..f90d28be9695 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13198,9 +13198,16 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>  #endif
>  }
>  
> +extern void task_group_path(struct task_group *tg, char *path, int plen);
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static void task_change_group_fair(struct task_struct *p)
>  {
> +       int cpu = cpu_of(rq_of(cfs_rq_of(&p->se)));
> +       struct task_group *tg = task_group(p);
> +       unsigned long cfs_rq_load_avg_pre, cfs_rq_load_avg_post;
> +       char buf[64];
> +
>         /*
>          * We couldn't detach or attach a forked task which
>          * hasn't been woken up by wake_up_new_task().
> @@ -13208,14 +13215,33 @@ static void task_change_group_fair(struct task_struct *p)
>         if (READ_ONCE(p->__state) == TASK_NEW)
>                 return;
>  
> +       task_group_path(tg, buf, sizeof(buf));
> +       cfs_rq_load_avg_pre = cfs_rq_of(&p->se)->avg.load_avg;
> +
>         detach_task_cfs_rq(p);
>  
> +       cfs_rq_load_avg_post = cfs_rq_of(&p->se)->avg.load_avg;
> +
> +       trace_printk("%s %d (d) cpu=%d tg=%s load_avg=%lu->%lu\n",
> +                    p->comm, p->pid, cpu, buf, cfs_rq_load_avg_pre,
> +                    cfs_rq_load_avg_post);
> +
>  #ifdef CONFIG_SMP
>         /* Tell se's cfs_rq has been changed -- migrated */
>         p->se.avg.last_update_time = 0;
>  #endif
>         set_task_rq(p, task_cpu(p));
> +
> +       task_group_path(tg, buf, sizeof(buf));
> +       cfs_rq_load_avg_pre = cfs_rq_of(&p->se)->avg.load_avg;
> +
>         attach_task_cfs_rq(p);
> +
> +       cfs_rq_load_avg_post = cfs_rq_of(&p->se)->avg.load_avg;
> +
> +       trace_printk("%s %d (a) cpu=%d tg=%s load_avg=%lu->%lu\n",
> +                    p->comm, p->pid, cpu, buf, cfs_rq_load_avg_pre,
> +                    cfs_rq_load_avg_post);
>  }

[-- Attachment #2: fork_traces.tar.gz --]
[-- Type: application/gzip, Size: 446011 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-10 21:31       ` Hagar Hemdan
@ 2025-02-11 16:27         ` Dietmar Eggemann
  2025-02-11 21:40           ` Hagar Hemdan
  0 siblings, 1 reply; 14+ messages in thread
From: Dietmar Eggemann @ 2025-02-11 16:27 UTC (permalink / raw)
  To: Hagar Hemdan; +Cc: abuehaze, linux-kernel

On 10/02/2025 22:31, Hagar Hemdan wrote:
> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
>> On 07/02/2025 12:07, Hagar Hemdan wrote:
>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
>>>> Hi Hagar,
>>>>
>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:

[...]

>> The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
>> call dequeue_task(), put_prev_task(), enqueue_task() and
>> set_next_task().
>>
>> I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
>> case of root tg) update in:
>>
>>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
>>   attach_task_cfs_rq() -> ...
>>
>> since this is used for WF_FORK, WF_EXEC handling in wakeup:
>>
>>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
>>   sched_balance_find_dst_group_cpu()
>>
>> in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.
>>
>> You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
>> Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
>> 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
>> tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
>> then they match "/" == "/".
>>
>> I assume you run Ubuntu on your AWS instances? What kind of
>> 'cgroup/taskgroup' related setup are you using?
> 
> I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance.
> AL2023 uses cgroupv2 by default.
>>
>> Can you run w/ this debug snippet w/ and w/o AG enabled?
> 
> I have run that and have attached the trace files to this email.

Thanks!

So w/ AG you see that 'group' and 'tsk->sched_task_group' are both
'/user.slice/user-1000.slice/session-1.scope' so we bail for those tasks
w/o doing the 'cfs_rq->avg.load_avg' update I described above.

You said that there is no issue w/o AG. Unfortunately your 'w/o AG'
trace does not contain any evidence that you ran UnixBench's './Run -c 4
spawn' since there are no lines for tasks with p->comm='spawn'. Could
you rerun this please. My hunch is that 'group' and
'tsk->sched_task_group' differ w/o AG?

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-11 16:27         ` Dietmar Eggemann
@ 2025-02-11 21:40           ` Hagar Hemdan
  2025-02-13 18:55             ` Dietmar Eggemann
  0 siblings, 1 reply; 14+ messages in thread
From: Hagar Hemdan @ 2025-02-11 21:40 UTC (permalink / raw)
  To: Dietmar Eggemann, Hagar Hemdan; +Cc: abuehaze, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3395 bytes --]

On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
> On 10/02/2025 22:31, Hagar Hemdan wrote:
> > On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
> >> On 07/02/2025 12:07, Hagar Hemdan wrote:
> >>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
> >>>> Hi Hagar,
> >>>>
> >>>> On 05/02/2025 16:10, Hagar Hemdan wrote:
> 
> [...]
> 
> >> The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
> >> call dequeue_task(), put_prev_task(), enqueue_task() and
> >> set_next_task().
> >>
> >> I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
> >> case of root tg) update in:
> >>
> >>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
> >>   attach_task_cfs_rq() -> ...
> >>
> >> since this is used for WF_FORK, WF_EXEC handling in wakeup:
> >>
> >>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
> >>   sched_balance_find_dst_group_cpu()
> >>
> >> in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.
> >>
> >> You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
> >> Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
> >> 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
> >> tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
> >> then they match "/" == "/".
> >>
> >> I assume you run Ubuntu on your AWS instances? What kind of
> >> 'cgroup/taskgroup' related setup are you using?
> > 
> > I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance.
> > AL2023 uses cgroupv2 by default.
> >>
> >> Can you run w/ this debug snippet w/ and w/o AG enabled?
> > 
> > I have run that and have attached the trace files to this email.
> 
> Thanks!
> 
> So w/ AG you see that 'group' and 'tsk->sched_task_group' are both
> '/user.slice/user-1000.slice/session-1.scope' so we bail for those tasks
> w/o doing the 'cfs_rq->avg.load_avg' update I described above.

yes, both groups are identical so it returns from sched_move_task()
without {de|en}queue and without call task_change_group_fair().
> 
> You said that there is no issue w/o AG. 

To clarify, I meant by there's no regression when autogroup is disabled,
that the fork results w/o AG remain consistent with or without the commit 
"sched/core: Reduce cost of sched_move_task when config autogroup". However,
the fork results are consistently lower when AG disabled compared to when
it's enabled (without commit applied). This is illustrated in the tables
provided in the report.

>Unfortunately your 'w/o AG'
> trace does not contain any evidence that you ran UnixBench's './Run -c 4
> spawn' since there are no lines for tasks with p->comm='spawn'.

You're right, the trace doesn't show tasks with p->comm='spawn' because I ran
the fork benchmark with CONFIG_SCHED_AUTOGROUP disabled and 
sched_autogroup_exit_task() is empty in this case.

> Could you rerun this please. My hunch is that 'group' and
> 'tsk->sched_task_group' differ w/o AG?

I've re-run the fork test with CONFIG_SCHED_AUTOGROUP enabled, but with 
autogroup disabled at runtime using 'echo 0 | sudo tee /proc/sys/kernel
/sched_autogroup_enabled' to avoid empty sched_autogroup_exit_task().
I've attached the new trace file and I observed that both groups still
don't differ.
Have you seen different results when testing this on a Debian 12 
Juno-r0 Arm64 board? 
> [...]

[-- Attachment #2: fork_traces_without_AG.tar.gz --]
[-- Type: application/gzip, Size: 447503 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-11 21:40           ` Hagar Hemdan
@ 2025-02-13 18:55             ` Dietmar Eggemann
  2025-02-17 22:51               ` Dietmar Eggemann
  0 siblings, 1 reply; 14+ messages in thread
From: Dietmar Eggemann @ 2025-02-13 18:55 UTC (permalink / raw)
  To: Hagar Hemdan; +Cc: abuehaze, linux-kernel

On 11/02/2025 22:40, Hagar Hemdan wrote:
> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
>> On 10/02/2025 22:31, Hagar Hemdan wrote:
>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
>>>>>> Hi Hagar,
>>>>>>
>>>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:
>>
>> [...]
>>
>>>> The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
>>>> call dequeue_task(), put_prev_task(), enqueue_task() and
>>>> set_next_task().
>>>>
>>>> I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
>>>> case of root tg) update in:
>>>>
>>>>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
>>>>   attach_task_cfs_rq() -> ...
>>>>
>>>> since this is used for WF_FORK, WF_EXEC handling in wakeup:
>>>>
>>>>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
>>>>   sched_balance_find_dst_group_cpu()
>>>>
>>>> in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.
>>>>
>>>> You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
>>>> Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
>>>> 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
>>>> tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
>>>> then they match "/" == "/".
>>>>
>>>> I assume you run Ubuntu on your AWS instances? What kind of
>>>> 'cgroup/taskgroup' related setup are you using?
>>>
>>> I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance.
>>> AL2023 uses cgroupv2 by default.
>>>>
>>>> Can you run w/ this debug snippet w/ and w/o AG enabled?
>>>
>>> I have run that and have attached the trace files to this email.
>>
>> Thanks!
>>
>> So w/ AG you see that 'group' and 'tsk->sched_task_group' are both
>> '/user.slice/user-1000.slice/session-1.scope' so we bail for those tasks
>> w/o doing the 'cfs_rq->avg.load_avg' update I described above.
> 
> yes, both groups are identical so it returns from sched_move_task()
> without {de|en}queue and without call task_change_group_fair().

OK.

>> You said that there is no issue w/o AG. 
> 
> To clarify, I meant by there's no regression when autogroup is disabled,
> that the fork results w/o AG remain consistent with or without the commit 
> "sched/core: Reduce cost of sched_move_task when config autogroup". However,
> the fork results are consistently lower when AG disabled compared to when
> it's enabled (without commit applied). This is illustrated in the tables
> provided in the report.

OK, but I don't quite get yet why w/o AG the results are lower even w/o
eff6c8ce8d4d? Have to dig further I guess. Maybe there is more than this
p->se.avg.load_avg update when we go via task_change_group_fair()?

>> Unfortunately your 'w/o AG'
>> trace does not contain any evidence that you ran UnixBench's './Run -c 4
>> spawn' since there are no lines for tasks with p->comm='spawn'.
> 
> You're right, the trace doesn't show tasks with p->comm='spawn' because I ran
> the fork benchmark with CONFIG_SCHED_AUTOGROUP disabled and 
> sched_autogroup_exit_task() is empty in this case.

Makes sense.

>> Could you rerun this please. My hunch is that 'group' and
>> 'tsk->sched_task_group' differ w/o AG?
> 
> I've re-run the fork test with CONFIG_SCHED_AUTOGROUP enabled, but with 
> autogroup disabled at runtime using 'echo 0 | sudo tee /proc/sys/kernel
> /sched_autogroup_enabled' to avoid empty sched_autogroup_exit_task().
> I've attached the new trace file and I observed that both groups still
> don't differ.

So with 'sched_autogroup_enabled: 0' you run:

group                  /user.slice/user-1000.slice/session-3.scope
tsk->sched_task_group  /user.slice/user-1000.slice/session-3.scope

Looks the same as with 'sched_autogroup_enabled: 1'.

> Have you seen different results when testing this on a Debian 12 
> Juno-r0 Arm64 board? 

I have on 4 (little) CPUs with './Run -c 4 spawn'

sched_autogroup_enabled: 1

group                 /
tsk->sched_task_group /autogroup-21

sched_autogroup_enabled: 0

group                 /
tsk->sched_task_group /

(1) v6.13:

CONFIG_SCHED_AUTOGROUP is not set

  5250 lps

sched_autogroup_enabled: 1

  4900 lps

This +1200 compared to 'sched_autogroup_enabled: 0' could fit into this
'p->se.avg.load_avg update when we go via task_change_group_fair()' story.

sched_autogroup_enabled: 0

  3700 lps


(2) v6.13 w/o eff6c8ce8d4d

CONFIG_SCHED_AUTOGROUP is not set

  5250 lps

sched_autogroup_enabled: 1

  5250 lps

sched_autogroup_enabled: 0

  5250 lps

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-13 18:55             ` Dietmar Eggemann
@ 2025-02-17 22:51               ` Dietmar Eggemann
  2025-02-21  6:44                 ` Hagar Hemdan
  2025-02-28 19:39                 ` Hagar Hemdan
  0 siblings, 2 replies; 14+ messages in thread
From: Dietmar Eggemann @ 2025-02-17 22:51 UTC (permalink / raw)
  To: Hagar Hemdan; +Cc: abuehaze, linux-kernel

On 13/02/2025 19:55, Dietmar Eggemann wrote:
> On 11/02/2025 22:40, Hagar Hemdan wrote:
>> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
>>> On 10/02/2025 22:31, Hagar Hemdan wrote:
>>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
>>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
>>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
>>>>>>> Hi Hagar,
>>>>>>>
>>>>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:
>>>
>>> [...]
>>>
>>>>> The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
>>>>> call dequeue_task(), put_prev_task(), enqueue_task() and
>>>>> set_next_task().
>>>>>
>>>>> I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
>>>>> case of root tg) update in:
>>>>>
>>>>>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
>>>>>   attach_task_cfs_rq() -> ...
>>>>>
>>>>> since this is used for WF_FORK, WF_EXEC handling in wakeup:
>>>>>
>>>>>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
>>>>>   sched_balance_find_dst_group_cpu()
>>>>>
>>>>> in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.
>>>>>
>>>>> You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
>>>>> Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
>>>>> 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
>>>>> tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
>>>>> then they match "/" == "/".
>>>>>
>>>>> I assume you run Ubuntu on your AWS instances? What kind of
>>>>> 'cgroup/taskgroup' related setup are you using?
>>>>
>>>> I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance.
>>>> AL2023 uses cgroupv2 by default.
>>>>>
>>>>> Can you run w/ this debug snippet w/ and w/o AG enabled?
>>>>
>>>> I have run that and have attached the trace files to this email.
>>>
>>> Thanks!
>>>
>>> So w/ AG you see that 'group' and 'tsk->sched_task_group' are both
>>> '/user.slice/user-1000.slice/session-1.scope' so we bail for those tasks
>>> w/o doing the 'cfs_rq->avg.load_avg' update I described above.
>>
>> yes, both groups are identical so it returns from sched_move_task()
>> without {de|en}queue and without call task_change_group_fair().
> 
> OK.
> 
>>> You said that there is no issue w/o AG. 
>>
>> To clarify, I meant by there's no regression when autogroup is disabled,
>> that the fork results w/o AG remain consistent with or without the commit 
>> "sched/core: Reduce cost of sched_move_task when config autogroup". However,
>> the fork results are consistently lower when AG disabled compared to when
>> it's enabled (without commit applied). This is illustrated in the tables
>> provided in the report.
> 
> OK, but I don't quite get yet why w/o AG the results are lower even w/o
> eff6c8ce8d4d? Have to dig further I guess. Maybe there is more than this
> p->se.avg.load_avg update when we go via task_change_group_fair()?

'./Run -c 4 spawn' on AWS instance (m7gd.16xlarge) with v6.13, 'mem=16G
maxcpus=4 nr_cpus=4' and Ubuntu '22.04.5 LTS':

CFG_SCHED_AUTOGROUP | sched_ag_enabled | eff6c8ce8d4d | Fork (lps)

   	y	             1		   y            21005 (27120 **)
	y		     0		   y            21059 (27012 **)
	n		     -		   y            21299
	y		     1		   n	        27745 *
	y		     0		   n	        27493 *
	n		     -		   n	        20928

(*) So here the higher numbers are only achieved when
'sched_autogroup_exit_task() -> sched_move_task() ->
sched_change_group() is called for the 'spawn' tasks.

(**) When I apply the fix from
https://lkml.kernel.org/r/4a9cc5ab-c538-4427-8a7c-99cb317a283f@arm.com.

These results support the story that we need:

  task_change_group_fair() -> detach_task_cfs_rq() -> ...,
  attach_task_cfs_rq() -> ...

i.e. the related 'cfs_rq->avg.load_avg' update during do_exit() so that
WF_FORK handling in wakeup:

  select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
  sched_balance_find_dst_group_cpu()

can use more recent 'load = cpu_load(cpu_rq(i)' values to get a better
'least_loaded_cpu'.

The AWS instance runs systemd so shell and test run in a taskgroup other
than root which trumps autogroups:

  task_wants_autogroup()

     if (tg != &root_task_group)
       return false;

     ...

That's why 'group == tsk->sched_task_group' in sched_move_task() is
true, which is different on my Juno: the shell from which I launch the
tests runs in '/' so that the test ends up in an autogroup, i.e. 'group
!= tsk->sched_task_group'.

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-17 22:51               ` Dietmar Eggemann
@ 2025-02-21  6:44                 ` Hagar Hemdan
  2025-03-03 10:05                   ` Dietmar Eggemann
  2025-02-28 19:39                 ` Hagar Hemdan
  1 sibling, 1 reply; 14+ messages in thread
From: Hagar Hemdan @ 2025-02-21  6:44 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: hagarhem, abuehaze, wuchi.zero, linux-kernel

On Mon, Feb 17, 2025 at 11:51:45PM +0100, Dietmar Eggemann wrote:
> On 13/02/2025 19:55, Dietmar Eggemann wrote:
> > On 11/02/2025 22:40, Hagar Hemdan wrote:
> >> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
> >>> On 10/02/2025 22:31, Hagar Hemdan wrote:
> >>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
> >>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
> >>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
> >>>>>>> Hi Hagar,
> >>>>>>>
> >>>>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:
> >>>
> >>> [...]
> >>>
> >>>>> The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
> >>>>> call dequeue_task(), put_prev_task(), enqueue_task() and
> >>>>> set_next_task().
> >>>>>
> >>>>> I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
> >>>>> case of root tg) update in:
> >>>>>
> >>>>>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
> >>>>>   attach_task_cfs_rq() -> ...
> >>>>>
> >>>>> since this is used for WF_FORK, WF_EXEC handling in wakeup:
> >>>>>
> >>>>>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
> >>>>>   sched_balance_find_dst_group_cpu()
> >>>>>
> >>>>> in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.
> >>>>>
> >>>>> You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
> >>>>> Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
> >>>>> 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
> >>>>> tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
> >>>>> then they match "/" == "/".
> >>>>>
> >>>>> I assume you run Ubuntu on your AWS instances? What kind of
> >>>>> 'cgroup/taskgroup' related setup are you using?
> >>>>
> >>>> I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance.
> >>>> AL2023 uses cgroupv2 by default.
> >>>>>
> >>>>> Can you run w/ this debug snippet w/ and w/o AG enabled?
> >>>>
> >>>> I have run that and have attached the trace files to this email.
> >>>
> >>> Thanks!
> >>>
> >>> So w/ AG you see that 'group' and 'tsk->sched_task_group' are both
> >>> '/user.slice/user-1000.slice/session-1.scope' so we bail for those tasks
> >>> w/o doing the 'cfs_rq->avg.load_avg' update I described above.
> >>
> >> yes, both groups are identical so it returns from sched_move_task()
> >> without {de|en}queue and without call task_change_group_fair().
> > 
> > OK.
> > 
> >>> You said that there is no issue w/o AG. 
> >>
> >> To clarify, I meant by there's no regression when autogroup is disabled,
> >> that the fork results w/o AG remain consistent with or without the commit 
> >> "sched/core: Reduce cost of sched_move_task when config autogroup". However,
> >> the fork results are consistently lower when AG disabled compared to when
> >> it's enabled (without commit applied). This is illustrated in the tables
> >> provided in the report.
> > 
> > OK, but I don't quite get yet why w/o AG the results are lower even w/o
> > eff6c8ce8d4d? Have to dig further I guess. Maybe there is more than this
> > p->se.avg.load_avg update when we go via task_change_group_fair()?
> 
> './Run -c 4 spawn' on AWS instance (m7gd.16xlarge) with v6.13, 'mem=16G
> maxcpus=4 nr_cpus=4' and Ubuntu '22.04.5 LTS':
> 
> CFG_SCHED_AUTOGROUP | sched_ag_enabled | eff6c8ce8d4d | Fork (lps)
> 
>    	y	             1		   y            21005 (27120 **)
> 	y		     0		   y            21059 (27012 **)
> 	n		     -		   y            21299
> 	y		     1		   n	        27745 *
> 	y		     0		   n	        27493 *
> 	n		     -		   n	        20928
> 
> (*) So here the higher numbers are only achieved when
> 'sched_autogroup_exit_task() -> sched_move_task() ->
> sched_change_group() is called for the 'spawn' tasks.
> 
> (**) When I apply the fix from
> https://lkml.kernel.org/r/4a9cc5ab-c538-4427-8a7c-99cb317a283f@arm.com.
Thanks!
Will you submit that fix upstream?
Do you think that this fix is the same as reverting commit eff6c8ce8d4d and
its follow up commit fa614b4feb5a? I mean what does commit eff6c8ce8d4d 
actually improve?
> 
> These results support the story that we need:
> 
>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
>   attach_task_cfs_rq() -> ...
> 
> i.e. the related 'cfs_rq->avg.load_avg' update during do_exit() so that
> WF_FORK handling in wakeup:
> 
>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
>   sched_balance_find_dst_group_cpu()
> 
> can use more recent 'load = cpu_load(cpu_rq(i)' values to get a better
> 'least_loaded_cpu'.
> 
> The AWS instance runs systemd so shell and test run in a taskgroup other
> than root which trumps autogroups:
> 
>   task_wants_autogroup()
> 
>      if (tg != &root_task_group)
>        return false;
> 
>      ...
> 
> That's why 'group == tsk->sched_task_group' in sched_move_task() is
> true, which is different on my Juno: the shell from which I launch the
> tests runs in '/' so that the test ends up in an autogroup, i.e. 'group
> != tsk->sched_task_group'.
Thanks for the explanation
> 
> [...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-21  6:44                 ` Hagar Hemdan
@ 2025-03-03 10:05                   ` Dietmar Eggemann
  2025-03-03 13:57                     ` Hagar Hemdan
  0 siblings, 1 reply; 14+ messages in thread
From: Dietmar Eggemann @ 2025-03-03 10:05 UTC (permalink / raw)
  To: Hagar Hemdan; +Cc: abuehaze, wuchi.zero, linux-kernel

On 21/02/2025 07:44, Hagar Hemdan wrote:
> On Mon, Feb 17, 2025 at 11:51:45PM +0100, Dietmar Eggemann wrote:
>> On 13/02/2025 19:55, Dietmar Eggemann wrote:
>>> On 11/02/2025 22:40, Hagar Hemdan wrote:
>>>> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
>>>>> On 10/02/2025 22:31, Hagar Hemdan wrote:
>>>>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
>>>>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
>>>>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
>>>>>>>>> Hi Hagar,
>>>>>>>>>
>>>>>>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:

[...]

>> './Run -c 4 spawn' on AWS instance (m7gd.16xlarge) with v6.13, 'mem=16G
>> maxcpus=4 nr_cpus=4' and Ubuntu '22.04.5 LTS':
>>
>> CFG_SCHED_AUTOGROUP | sched_ag_enabled | eff6c8ce8d4d | Fork (lps)
>>
>>    	y	             1		   y            21005 (27120 **)
>> 	y		     0		   y            21059 (27012 **)
>> 	n		     -		   y            21299
>> 	y		     1		   n	        27745 *
>> 	y		     0		   n	        27493 *
>> 	n		     -		   n	        20928
>>
>> (*) So here the higher numbers are only achieved when
>> 'sched_autogroup_exit_task() -> sched_move_task() ->
>> sched_change_group() is called for the 'spawn' tasks.
>>
>> (**) When I apply the fix from
>> https://lkml.kernel.org/r/4a9cc5ab-c538-4427-8a7c-99cb317a283f@arm.com.
> Thanks!
> Will you submit that fix upstream?

I will, I just had to understand in detail why this regression happens.

Looks like the issue is rather related to 'sgs->group_util' in
group_is_overloaded() and group_has_capacity(). If we don't
'deqeue/detach + attach/enqueue' (1) the task in sched_move_task() then
sgs->group_util is ~900 (you run 4 CPUs flat in a single MC sched domain
so sgs->group_capacity = 1024 and this leads to group_is_overloaded()
returning true and group_has_capacity() false much more often as if
we would do (1).

I.e. we have much more cases of 'group_is_overloaded' and
'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
then (a) returns much more often a CPU != smp_processor_id() (which
isn't good for these extremely short running tasks (FORK + EXIT)) and
also involves calling sched_balance_find_dst_group_cpu() unnecessary
(since we deal with single CPU sched domains). 

select_task_rq_fair(..., wake_flags = WF_FORK)

  cpu = smp_processor_id()

  new_cpu = sched_balance_find_dst_group(..., cpu, ...)

    do {

      update_sg_wakeup_stats()

        sgs->group_type = group_classify()   
							w/o patch 	w/ patch                   
          if group_is_overloaded() (*)
            return group_overloaded /* 6 */		457,141		394

          if !group_has_capacity() (**)
            return group_fully_busy /* 1 */ 	  	816,629		714

          return group_has_spare    /* 0 */		1,158,890	3,157,472

    } while group 

    if local_sgs.group_type > idlest_sgs.group_type	
      return idlest					351,598		273

    case group_has_spare:

      if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
        return NULL 					156,760		788,462


(*)

  if sgs->group_capacity * 100) <			
		sgs->group_util * imbalance_pct		951,705		856
    return true

  sgs->group_util ~ 900 and sgs->group_capacity = 1024 (1 CPU per sched group)


(**)

 if sgs->group_capacity * 100 >
		sgs->group_util * imbalance_pct
   return true						1,087,555	3,163,152

 return false						1,332,974	882


(*) and (**) are for 'wakeup' and 'load-balance' so they don't
match the only wakeup numbers above!

In this test run I got 608,092 new wakeups w/o and 789,572 (~+ 30%)
w/ the patch when running './Run -c 4 -i 1 spawn' on AWS instance
(m7gd.16xlarge) with v6.13, 'mem=16G maxcpus=4 nr_cpus=4' and
Ubuntu '22.04.5 LTS'

> Do you think that this fix is the same as reverting commit eff6c8ce8d4d and
> its follow up commit fa614b4feb5a? I mean what does commit eff6c8ce8d4d 
> actually improve?

There are occurrences in which 'group == tsk->sched_task_group' and
'!(tsk->flags & PF_EXITING)' so there the early bail might help w/o
the negative impact on sched benchmarks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-03-03 10:05                   ` Dietmar Eggemann
@ 2025-03-03 13:57                     ` Hagar Hemdan
  0 siblings, 0 replies; 14+ messages in thread
From: Hagar Hemdan @ 2025-03-03 13:57 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: hagarhem, abuehaze, linux-kernel

On Mon, Mar 03, 2025 at 11:05:01AM +0100, Dietmar Eggemann wrote:
> On 21/02/2025 07:44, Hagar Hemdan wrote:
> > On Mon, Feb 17, 2025 at 11:51:45PM +0100, Dietmar Eggemann wrote:
> >> On 13/02/2025 19:55, Dietmar Eggemann wrote:
> >>> On 11/02/2025 22:40, Hagar Hemdan wrote:
> >>>> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
> >>>>> On 10/02/2025 22:31, Hagar Hemdan wrote:
> >>>>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
> >>>>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
> >>>>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
> >>>>>>>>> Hi Hagar,
> >>>>>>>>>
> >>>>>>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:
> 
> [...]
> 
> >> './Run -c 4 spawn' on AWS instance (m7gd.16xlarge) with v6.13, 'mem=16G
> >> maxcpus=4 nr_cpus=4' and Ubuntu '22.04.5 LTS':
> >>
> >> CFG_SCHED_AUTOGROUP | sched_ag_enabled | eff6c8ce8d4d | Fork (lps)
> >>
> >>    	y	             1		   y            21005 (27120 **)
> >> 	y		     0		   y            21059 (27012 **)
> >> 	n		     -		   y            21299
> >> 	y		     1		   n	        27745 *
> >> 	y		     0		   n	        27493 *
> >> 	n		     -		   n	        20928
> >>
> >> (*) So here the higher numbers are only achieved when
> >> 'sched_autogroup_exit_task() -> sched_move_task() ->
> >> sched_change_group() is called for the 'spawn' tasks.
> >>
> >> (**) When I apply the fix from
> >> https://lkml.kernel.org/r/4a9cc5ab-c538-4427-8a7c-99cb317a283f@arm.com.
> > Thanks!
> > Will you submit that fix upstream?
> 
> I will, I just had to understand in detail why this regression happens.
> 
> Looks like the issue is rather related to 'sgs->group_util' in
> group_is_overloaded() and group_has_capacity(). If we don't
> 'deqeue/detach + attach/enqueue' (1) the task in sched_move_task() then
> sgs->group_util is ~900 (you run 4 CPUs flat in a single MC sched domain
> so sgs->group_capacity = 1024 and this leads to group_is_overloaded()
> returning true and group_has_capacity() false much more often as if
> we would do (1).
> 
> I.e. we have much more cases of 'group_is_overloaded' and
> 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
> then (a) returns much more often a CPU != smp_processor_id() (which
> isn't good for these extremely short running tasks (FORK + EXIT)) and
> also involves calling sched_balance_find_dst_group_cpu() unnecessary
> (since we deal with single CPU sched domains). 
> 
> select_task_rq_fair(..., wake_flags = WF_FORK)
> 
>   cpu = smp_processor_id()
> 
>   new_cpu = sched_balance_find_dst_group(..., cpu, ...)
> 
>     do {
> 
>       update_sg_wakeup_stats()
> 
>         sgs->group_type = group_classify()   
> 							w/o patch 	w/ patch                   
>           if group_is_overloaded() (*)
>             return group_overloaded /* 6 */		457,141		394
> 
>           if !group_has_capacity() (**)
>             return group_fully_busy /* 1 */ 	  	816,629		714
> 
>           return group_has_spare    /* 0 */		1,158,890	3,157,472
> 
>     } while group 
> 
>     if local_sgs.group_type > idlest_sgs.group_type	
>       return idlest					351,598		273
> 
>     case group_has_spare:
> 
>       if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
>         return NULL 					156,760		788,462
> 
> 
> (*)
> 
>   if sgs->group_capacity * 100) <			
> 		sgs->group_util * imbalance_pct		951,705		856
>     return true
> 
>   sgs->group_util ~ 900 and sgs->group_capacity = 1024 (1 CPU per sched group)
> 
> 
> (**)
> 
>  if sgs->group_capacity * 100 >
> 		sgs->group_util * imbalance_pct
>    return true						1,087,555	3,163,152
> 
>  return false						1,332,974	882
> 
> 
> (*) and (**) are for 'wakeup' and 'load-balance' so they don't
> match the only wakeup numbers above!

Thank you for the detailed explanation. We appreciate your effort and
will await the fix.
> 
> In this test run I got 608,092 new wakeups w/o and 789,572 (~+ 30%)
> w/ the patch when running './Run -c 4 -i 1 spawn' on AWS instance
> (m7gd.16xlarge) with v6.13, 'mem=16G maxcpus=4 nr_cpus=4' and
> Ubuntu '22.04.5 LTS'
> 
> > Do you think that this fix is the same as reverting commit eff6c8ce8d4d and
> > its follow up commit fa614b4feb5a? I mean what does commit eff6c8ce8d4d 
> > actually improve?
> 
> There are occurrences in which 'group == tsk->sched_task_group' and
> '!(tsk->flags & PF_EXITING)' so there the early bail might help w/o
> the negative impact on sched benchmarks.
ok, thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-17 22:51               ` Dietmar Eggemann
  2025-02-21  6:44                 ` Hagar Hemdan
@ 2025-02-28 19:39                 ` Hagar Hemdan
  2025-03-03 10:06                   ` Dietmar Eggemann
  1 sibling, 1 reply; 14+ messages in thread
From: Hagar Hemdan @ 2025-02-28 19:39 UTC (permalink / raw)
  To: Dietmar Eggemann; +Cc: abuehaze, linux-kernel, hagarhem, wuchi.zero

On Mon, Feb 17, 2025 at 11:51:45PM +0100, Dietmar Eggemann wrote:
> On 13/02/2025 19:55, Dietmar Eggemann wrote:
> > On 11/02/2025 22:40, Hagar Hemdan wrote:
> >> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
> >>> On 10/02/2025 22:31, Hagar Hemdan wrote:
> >>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
> >>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
> >>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
> >>>>>>> Hi Hagar,
> >>>>>>>
> >>>>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:
> >>>
> >>> [...]
> >>>
> >>>>> The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
> >>>>> call dequeue_task(), put_prev_task(), enqueue_task() and
> >>>>> set_next_task().
> >>>>>
> >>>>> I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
> >>>>> case of root tg) update in:
> >>>>>
> >>>>>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
> >>>>>   attach_task_cfs_rq() -> ...
> >>>>>
> >>>>> since this is used for WF_FORK, WF_EXEC handling in wakeup:
> >>>>>
> >>>>>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
> >>>>>   sched_balance_find_dst_group_cpu()
> >>>>>
> >>>>> in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.
> >>>>>
> >>>>> You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
> >>>>> Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
> >>>>> 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
> >>>>> tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
> >>>>> then they match "/" == "/".
> >>>>>
> >>>>> I assume you run Ubuntu on your AWS instances? What kind of
> >>>>> 'cgroup/taskgroup' related setup are you using?
> >>>>
> >>>> I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance.
> >>>> AL2023 uses cgroupv2 by default.
> >>>>>
> >>>>> Can you run w/ this debug snippet w/ and w/o AG enabled?
> >>>>
> >>>> I have run that and have attached the trace files to this email.
> >>>
> >>> Thanks!
> >>>
> >>> So w/ AG you see that 'group' and 'tsk->sched_task_group' are both
> >>> '/user.slice/user-1000.slice/session-1.scope' so we bail for those tasks
> >>> w/o doing the 'cfs_rq->avg.load_avg' update I described above.
> >>
> >> yes, both groups are identical so it returns from sched_move_task()
> >> without {de|en}queue and without call task_change_group_fair().
> > 
> > OK.
> > 
> >>> You said that there is no issue w/o AG. 
> >>
> >> To clarify, I meant by there's no regression when autogroup is disabled,
> >> that the fork results w/o AG remain consistent with or without the commit 
> >> "sched/core: Reduce cost of sched_move_task when config autogroup". However,
> >> the fork results are consistently lower when AG disabled compared to when
> >> it's enabled (without commit applied). This is illustrated in the tables
> >> provided in the report.
> > 
> > OK, but I don't quite get yet why w/o AG the results are lower even w/o
> > eff6c8ce8d4d? Have to dig further I guess. Maybe there is more than this
> > p->se.avg.load_avg update when we go via task_change_group_fair()?
> 
> './Run -c 4 spawn' on AWS instance (m7gd.16xlarge) with v6.13, 'mem=16G
> maxcpus=4 nr_cpus=4' and Ubuntu '22.04.5 LTS':
> 
> CFG_SCHED_AUTOGROUP | sched_ag_enabled | eff6c8ce8d4d | Fork (lps)
> 
>    	y	             1		   y            21005 (27120 **)
> 	y		     0		   y            21059 (27012 **)
> 	n		     -		   y            21299
> 	y		     1		   n	        27745 *
> 	y		     0		   n	        27493 *
> 	n		     -		   n	        20928
> 
> (*) So here the higher numbers are only achieved when
> 'sched_autogroup_exit_task() -> sched_move_task() ->
> sched_change_group() is called for the 'spawn' tasks.
> 
> (**) When I apply the fix from
> https://lkml.kernel.org/r/4a9cc5ab-c538-4427-8a7c-99cb317a283f@arm.com.

This is currently impacting our kernel, do you
have any concerns to submit this fix upstream?
Thanks,
Hagar
> 
> These results support the story that we need:
> 
>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
>   attach_task_cfs_rq() -> ...
> 
> i.e. the related 'cfs_rq->avg.load_avg' update during do_exit() so that
> WF_FORK handling in wakeup:
> 
>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
>   sched_balance_find_dst_group_cpu()
> 
> can use more recent 'load = cpu_load(cpu_rq(i)' values to get a better
> 'least_loaded_cpu'.
> 
> The AWS instance runs systemd so shell and test run in a taskgroup other
> than root which trumps autogroups:
> 
>   task_wants_autogroup()
> 
>      if (tg != &root_task_group)
>        return false;
> 
>      ...
> 
> That's why 'group == tsk->sched_task_group' in sched_move_task() is
> true, which is different on my Juno: the shell from which I launch the
> tests runs in '/' so that the test ends up in an autogroup, i.e. 'group
> != tsk->sched_task_group'.
> 
> [...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BUG Report: Fork benchmark drop by 30% on aarch64
  2025-02-28 19:39                 ` Hagar Hemdan
@ 2025-03-03 10:06                   ` Dietmar Eggemann
  0 siblings, 0 replies; 14+ messages in thread
From: Dietmar Eggemann @ 2025-03-03 10:06 UTC (permalink / raw)
  To: Hagar Hemdan; +Cc: abuehaze, linux-kernel, wuchi.zero

On 28/02/2025 20:39, Hagar Hemdan wrote:
> On Mon, Feb 17, 2025 at 11:51:45PM +0100, Dietmar Eggemann wrote:
>> On 13/02/2025 19:55, Dietmar Eggemann wrote:
>>> On 11/02/2025 22:40, Hagar Hemdan wrote:
>>>> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
>>>>> On 10/02/2025 22:31, Hagar Hemdan wrote:
>>>>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
>>>>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
>>>>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:

[...]

>> './Run -c 4 spawn' on AWS instance (m7gd.16xlarge) with v6.13, 'mem=16G
>> maxcpus=4 nr_cpus=4' and Ubuntu '22.04.5 LTS':
>>
>> CFG_SCHED_AUTOGROUP | sched_ag_enabled | eff6c8ce8d4d | Fork (lps)
>>
>>    	y	             1		   y            21005 (27120 **)
>> 	y		     0		   y            21059 (27012 **)
>> 	n		     -		   y            21299
>> 	y		     1		   n	        27745 *
>> 	y		     0		   n	        27493 *
>> 	n		     -		   n	        20928
>>
>> (*) So here the higher numbers are only achieved when
>> 'sched_autogroup_exit_task() -> sched_move_task() ->
>> sched_change_group() is called for the 'spawn' tasks.
>>
>> (**) When I apply the fix from
>> https://lkml.kernel.org/r/4a9cc5ab-c538-4427-8a7c-99cb317a283f@arm.com.
> 
> This is currently impacting our kernel, do you
> have any concerns to submit this fix upstream?

Will send it out after the analysis now done. See the other email from
today.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-03-03 13:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-05 15:10 BUG Report: Fork benchmark drop by 30% on aarch64 Hagar Hemdan
2025-02-07  9:14 ` Dietmar Eggemann
2025-02-07 11:07   ` Hagar Hemdan
2025-02-10 10:38     ` Dietmar Eggemann
2025-02-10 21:31       ` Hagar Hemdan
2025-02-11 16:27         ` Dietmar Eggemann
2025-02-11 21:40           ` Hagar Hemdan
2025-02-13 18:55             ` Dietmar Eggemann
2025-02-17 22:51               ` Dietmar Eggemann
2025-02-21  6:44                 ` Hagar Hemdan
2025-03-03 10:05                   ` Dietmar Eggemann
2025-03-03 13:57                     ` Hagar Hemdan
2025-02-28 19:39                 ` Hagar Hemdan
2025-03-03 10:06                   ` Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox