All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hagar Hemdan <hagarhem@amazon.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>, <hagarhem@amazon.com>
Cc: <abuehaze@amazon.com>, <linux-kernel@vger.kernel.org>
Subject: Re: BUG Report: Fork benchmark drop by 30% on aarch64
Date: Mon, 10 Feb 2025 21:31:55 +0000	[thread overview]
Message-ID: <20250210213155.GA649@amazon.com> (raw)
In-Reply-To: <1ca758c7-b6ab-4880-9cc7-217093a30bbb@arm.com>

[-- Attachment #1: Type: text/plain, Size: 8099 bytes --]

On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
> On 07/02/2025 12:07, Hagar Hemdan wrote:
> > On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
> >> Hi Hagar,
> >>
> >> On 05/02/2025 16:10, Hagar Hemdan wrote:
> >>> Hi,
> >>>
> >>> There is about a 30% drop in fork benchmark [1] on aarch64 and a 10%
> >>> drop on x86_64 using kernel v6.13.1.
> >>>
> >>> Git bisect pointed to commit eff6c8ce8d4d ("sched/core: Reduce cost
> >>> of sched_move_task when config autogroup") which merged starting
> >>> v6.4-rc1.
> >>>
> >>> The regression only happens when number of CPUs is equal to number
> >>> of threads [2] that fork test is creating which means it's only visible
> >>> under CPU contention.
> >>>
> >>> I used m6g.xlarge AWS EC2 Instance with 4 vCPUs and 16 GiB RAM for ARM64
> >>> and m6a.xlarge with also 4 vCPUs and 16 GiB RAM for x86_64.
> >>>
> >>> I noticed this regression exists only when autogroup config is enabled.
> >>
> >> So '# CONFIG_SCHED_AUTOGROUP is not set' in .config so we have: 
> >>
> >> static inline void sched_autogroup_exit_task(struct task_struct *p) { }
> >>
> >> I.e. doing a 'echo 0 > /proc/sys/kernel/sched_autogroup_enabled' still
> >> shows this issue?
> > yes, when I do 'echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled',
> > It behaves like the disable 'CONFIG_SCHED_AUTOGROUP'.
> 
> OK.
> 
> >>> Run the fork test with these combinations and autogroup is enabled:
> >>>
> >>> Arch      | commit eff6c8ce8d4d | Fork Result (lps)  |  %Cpu(s)
> >>> ----------+---------------------+--------------------+------------------
> >>> aarch64   | without             | 28677.0            |  3.2 us, 96.7 sy
> >>> aarch64   | with                | 19860.7 (30% drop) |  2.7 us, 79.4 sy
> >>> x86_64    | without             | 27776.2            |  3.1 us, 96.9 sy
> >>> x86_64    | with                | 25020.6 (10% drop) |  4.1 us, 93.2 sy
> >>> ----------+---------------------+--------------------+------------------
> >>
> >> Can you rerun with:
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 3e5a6bf587f9..62cc50c79a78 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -9057,7 +9057,7 @@ void sched_move_task(struct task_struct *tsk)
> >>          * group changes.
> >>          */
> >>         group = sched_get_task_group(tsk);
> >> -       if (group == tsk->sched_task_group)
> >> +       if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
> >>                 return;
> > I tried that and I see it fixed the regression and the cpu utilization
> > is 100% with it.
> 
> W/ this one, we force the 'spawn' tasks into: 
> 
>   sched_change_group() -> task_change_group_fair()
> 
> > I'd like to ask if this like reverting the patch in case of exit path
> > and also means the enqueue/dequeue are needed in case of task exiting,
> > right?
> 
> The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
> call dequeue_task(), put_prev_task(), enqueue_task() and
> set_next_task().
> 
> I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
> case of root tg) update in:
> 
>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
>   attach_task_cfs_rq() -> ...
> 
> since this is used for WF_FORK, WF_EXEC handling in wakeup:
> 
>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
>   sched_balance_find_dst_group_cpu()
> 
> in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.
> 
> You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
> Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
> 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
> tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
> then they match "/" == "/".
> 
> I assume you run Ubuntu on your AWS instances? What kind of
> 'cgroup/taskgroup' related setup are you using?

I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance.
AL2023 uses cgroupv2 by default.
> 
> Can you run w/ this debug snippet w/ and w/o AG enabled?

I have run that and have attached the trace files to this email.
> 
> -->8--
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e5a6bf587f9..c696740177b7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9035,6 +9035,8 @@ static void sched_change_group(struct task_struct *tsk, struct task_group *group
>                 set_task_rq(tsk, task_cpu(tsk));
>  }
>  
> +extern void task_group_path(struct task_group *tg, char *path, int plen);
> +
>  /*
>   * Change task's runqueue when it moves between groups.
>   *
> @@ -9048,6 +9050,8 @@ void sched_move_task(struct task_struct *tsk)
>                 DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
>         struct task_group *group;
>         struct rq *rq;
> +       int cpu = cpu_of(rq_of(cfs_rq_of(&tsk->se)));
> +       char buf_1[64], buf_2[64];
>  
>         CLASS(task_rq_lock, rq_guard)(tsk);
>         rq = rq_guard.rq;
> @@ -9057,6 +9061,13 @@ void sched_move_task(struct task_struct *tsk)
>          * group changes.
>          */
>         group = sched_get_task_group(tsk);
> +
> +       task_group_path(group, buf_1, sizeof(buf_1));
> +       task_group_path(tsk->sched_task_group, buf_2, sizeof(buf_2));
> +
> +       trace_printk("%s %d cpu=%d group=%s tsk->sched_task_group=%s\n",
> +                    tsk->comm, tsk->pid, cpu, buf_1, buf_2);
> +
>         if (group == tsk->sched_task_group)
>                 return;
>  
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index a1be00a988bf..c967d55f971f 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -701,7 +701,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>  static DEFINE_SPINLOCK(sched_debug_lock);
>  static char group_path[PATH_MAX];
>  
> -static void task_group_path(struct task_group *tg, char *path, int plen)
> +extern void task_group_path(struct task_group *tg, char *path, int plen);
> +
> +void task_group_path(struct task_group *tg, char *path, int plen)
>  {
>         if (autogroup_path(tg, path, plen))
>                 return;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26958431deb7..f90d28be9695 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13198,9 +13198,16 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>  #endif
>  }
>  
> +extern void task_group_path(struct task_group *tg, char *path, int plen);
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static void task_change_group_fair(struct task_struct *p)
>  {
> +       int cpu = cpu_of(rq_of(cfs_rq_of(&p->se)));
> +       struct task_group *tg = task_group(p);
> +       unsigned long cfs_rq_load_avg_pre, cfs_rq_load_avg_post;
> +       char buf[64];
> +
>         /*
>          * We couldn't detach or attach a forked task which
>          * hasn't been woken up by wake_up_new_task().
> @@ -13208,14 +13215,33 @@ static void task_change_group_fair(struct task_struct *p)
>         if (READ_ONCE(p->__state) == TASK_NEW)
>                 return;
>  
> +       task_group_path(tg, buf, sizeof(buf));
> +       cfs_rq_load_avg_pre = cfs_rq_of(&p->se)->avg.load_avg;
> +
>         detach_task_cfs_rq(p);
>  
> +       cfs_rq_load_avg_post = cfs_rq_of(&p->se)->avg.load_avg;
> +
> +       trace_printk("%s %d (d) cpu=%d tg=%s load_avg=%lu->%lu\n",
> +                    p->comm, p->pid, cpu, buf, cfs_rq_load_avg_pre,
> +                    cfs_rq_load_avg_post);
> +
>  #ifdef CONFIG_SMP
>         /* Tell se's cfs_rq has been changed -- migrated */
>         p->se.avg.last_update_time = 0;
>  #endif
>         set_task_rq(p, task_cpu(p));
> +
> +       task_group_path(tg, buf, sizeof(buf));
> +       cfs_rq_load_avg_pre = cfs_rq_of(&p->se)->avg.load_avg;
> +
>         attach_task_cfs_rq(p);
> +
> +       cfs_rq_load_avg_post = cfs_rq_of(&p->se)->avg.load_avg;
> +
> +       trace_printk("%s %d (a) cpu=%d tg=%s load_avg=%lu->%lu\n",
> +                    p->comm, p->pid, cpu, buf, cfs_rq_load_avg_pre,
> +                    cfs_rq_load_avg_post);
>  }

[-- Attachment #2: fork_traces.tar.gz --]
[-- Type: application/gzip, Size: 446011 bytes --]

  reply	other threads:[~2025-02-10 21:32 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-05 15:10 BUG Report: Fork benchmark drop by 30% on aarch64 Hagar Hemdan
2025-02-07  9:14 ` Dietmar Eggemann
2025-02-07 11:07   ` Hagar Hemdan
2025-02-10 10:38     ` Dietmar Eggemann
2025-02-10 21:31       ` Hagar Hemdan [this message]
2025-02-11 16:27         ` Dietmar Eggemann
2025-02-11 21:40           ` Hagar Hemdan
2025-02-13 18:55             ` Dietmar Eggemann
2025-02-17 22:51               ` Dietmar Eggemann
2025-02-21  6:44                 ` Hagar Hemdan
2025-03-03 10:05                   ` Dietmar Eggemann
2025-03-03 13:57                     ` Hagar Hemdan
2025-02-28 19:39                 ` Hagar Hemdan
2025-03-03 10:06                   ` Dietmar Eggemann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250210213155.GA649@amazon.com \
    --to=hagarhem@amazon.com \
    --cc=abuehaze@amazon.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.