On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote: > On 07/02/2025 12:07, Hagar Hemdan wrote: > > On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote: > >> Hi Hagar, > >> > >> On 05/02/2025 16:10, Hagar Hemdan wrote: > >>> Hi, > >>> > >>> There is about a 30% drop in fork benchmark [1] on aarch64 and a 10% > >>> drop on x86_64 using kernel v6.13.1. > >>> > >>> Git bisect pointed to commit eff6c8ce8d4d ("sched/core: Reduce cost > >>> of sched_move_task when config autogroup") which merged starting > >>> v6.4-rc1. > >>> > >>> The regression only happens when number of CPUs is equal to number > >>> of threads [2] that fork test is creating which means it's only visible > >>> under CPU contention. > >>> > >>> I used m6g.xlarge AWS EC2 Instance with 4 vCPUs and 16 GiB RAM for ARM64 > >>> and m6a.xlarge with also 4 vCPUs and 16 GiB RAM for x86_64. > >>> > >>> I noticed this regression exists only when autogroup config is enabled. > >> > >> So '# CONFIG_SCHED_AUTOGROUP is not set' in .config so we have: > >> > >> static inline void sched_autogroup_exit_task(struct task_struct *p) { } > >> > >> I.e. doing a 'echo 0 > /proc/sys/kernel/sched_autogroup_enabled' still > >> shows this issue? > > yes, when I do 'echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled', > > It behaves like the disable 'CONFIG_SCHED_AUTOGROUP'. > > OK. > > >>> Run the fork test with these combinations and autogroup is enabled: > >>> > >>> Arch | commit eff6c8ce8d4d | Fork Result (lps) | %Cpu(s) > >>> ----------+---------------------+--------------------+------------------ > >>> aarch64 | without | 28677.0 | 3.2 us, 96.7 sy > >>> aarch64 | with | 19860.7 (30% drop) | 2.7 us, 79.4 sy > >>> x86_64 | without | 27776.2 | 3.1 us, 96.9 sy > >>> x86_64 | with | 25020.6 (10% drop) | 4.1 us, 93.2 sy > >>> ----------+---------------------+--------------------+------------------ > >> > >> Can you rerun with: > >> > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > >> index 3e5a6bf587f9..62cc50c79a78 100644 > >> --- a/kernel/sched/core.c > >> +++ b/kernel/sched/core.c > >> @@ -9057,7 +9057,7 @@ void sched_move_task(struct task_struct *tsk) > >> * group changes. > >> */ > >> group = sched_get_task_group(tsk); > >> - if (group == tsk->sched_task_group) > >> + if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING)) > >> return; > > I tried that and I see it fixed the regression and the cpu utilization > > is 100% with it. > > W/ this one, we force the 'spawn' tasks into: > > sched_change_group() -> task_change_group_fair() > > > I'd like to ask if this like reverting the patch in case of exit path > > and also means the enqueue/dequeue are needed in case of task exiting, > > right? > > The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we > call dequeue_task(), put_prev_task(), enqueue_task() and > set_next_task(). > > I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in > case of root tg) update in: > > task_change_group_fair() -> detach_task_cfs_rq() -> ..., > attach_task_cfs_rq() -> ... > > since this is used for WF_FORK, WF_EXEC handling in wakeup: > > select_task_rq_fair() -> sched_balance_find_dst_cpu() -> > sched_balance_find_dst_group_cpu() > > in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'. > > You mentioned AutoGroups (AG). I don't see this issue on my Debian 12 > Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and > 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group == > tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG > then they match "/" == "/". > > I assume you run Ubuntu on your AWS instances? What kind of > 'cgroup/taskgroup' related setup are you using? I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance. AL2023 uses cgroupv2 by default. > > Can you run w/ this debug snippet w/ and w/o AG enabled? I have run that and have attached the trace files to this email. > > -->8-- > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 3e5a6bf587f9..c696740177b7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -9035,6 +9035,8 @@ static void sched_change_group(struct task_struct *tsk, struct task_group *group > set_task_rq(tsk, task_cpu(tsk)); > } > > +extern void task_group_path(struct task_group *tg, char *path, int plen); > + > /* > * Change task's runqueue when it moves between groups. > * > @@ -9048,6 +9050,8 @@ void sched_move_task(struct task_struct *tsk) > DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; > struct task_group *group; > struct rq *rq; > + int cpu = cpu_of(rq_of(cfs_rq_of(&tsk->se))); > + char buf_1[64], buf_2[64]; > > CLASS(task_rq_lock, rq_guard)(tsk); > rq = rq_guard.rq; > @@ -9057,6 +9061,13 @@ void sched_move_task(struct task_struct *tsk) > * group changes. > */ > group = sched_get_task_group(tsk); > + > + task_group_path(group, buf_1, sizeof(buf_1)); > + task_group_path(tsk->sched_task_group, buf_2, sizeof(buf_2)); > + > + trace_printk("%s %d cpu=%d group=%s tsk->sched_task_group=%s\n", > + tsk->comm, tsk->pid, cpu, buf_1, buf_2); > + > if (group == tsk->sched_task_group) > return; > > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c > index a1be00a988bf..c967d55f971f 100644 > --- a/kernel/sched/debug.c > +++ b/kernel/sched/debug.c > @@ -701,7 +701,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group > static DEFINE_SPINLOCK(sched_debug_lock); > static char group_path[PATH_MAX]; > > -static void task_group_path(struct task_group *tg, char *path, int plen) > +extern void task_group_path(struct task_group *tg, char *path, int plen); > + > +void task_group_path(struct task_group *tg, char *path, int plen) > { > if (autogroup_path(tg, path, plen)) > return; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 26958431deb7..f90d28be9695 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -13198,9 +13198,16 @@ void init_cfs_rq(struct cfs_rq *cfs_rq) > #endif > } > > +extern void task_group_path(struct task_group *tg, char *path, int plen); > + > #ifdef CONFIG_FAIR_GROUP_SCHED > static void task_change_group_fair(struct task_struct *p) > { > + int cpu = cpu_of(rq_of(cfs_rq_of(&p->se))); > + struct task_group *tg = task_group(p); > + unsigned long cfs_rq_load_avg_pre, cfs_rq_load_avg_post; > + char buf[64]; > + > /* > * We couldn't detach or attach a forked task which > * hasn't been woken up by wake_up_new_task(). > @@ -13208,14 +13215,33 @@ static void task_change_group_fair(struct task_struct *p) > if (READ_ONCE(p->__state) == TASK_NEW) > return; > > + task_group_path(tg, buf, sizeof(buf)); > + cfs_rq_load_avg_pre = cfs_rq_of(&p->se)->avg.load_avg; > + > detach_task_cfs_rq(p); > > + cfs_rq_load_avg_post = cfs_rq_of(&p->se)->avg.load_avg; > + > + trace_printk("%s %d (d) cpu=%d tg=%s load_avg=%lu->%lu\n", > + p->comm, p->pid, cpu, buf, cfs_rq_load_avg_pre, > + cfs_rq_load_avg_post); > + > #ifdef CONFIG_SMP > /* Tell se's cfs_rq has been changed -- migrated */ > p->se.avg.last_update_time = 0; > #endif > set_task_rq(p, task_cpu(p)); > + > + task_group_path(tg, buf, sizeof(buf)); > + cfs_rq_load_avg_pre = cfs_rq_of(&p->se)->avg.load_avg; > + > attach_task_cfs_rq(p); > + > + cfs_rq_load_avg_post = cfs_rq_of(&p->se)->avg.load_avg; > + > + trace_printk("%s %d (a) cpu=%d tg=%s load_avg=%lu->%lu\n", > + p->comm, p->pid, cpu, buf, cfs_rq_load_avg_pre, > + cfs_rq_load_avg_post); > }