All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
@ 2023-12-15  5:26 Imran Khan
  2023-12-15  9:11 ` Vincent Guittot
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Imran Khan @ 2023-12-15  5:26 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid
  Cc: linux-kernel

It has been found that sometimes a task_group has some residual
load_avg even though the load average at each of its owned queues
i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
tg_load_avg_contrib have become 0 for a long time.
Under this scenario if another task starts running in this task_group,
it does not get proper time share on CPU since pre-existing
load average of task group inversely impacts the new task's CPU share
on each CPU.

This change looks for the condition when a task_group has no running
tasks and sets the task_group's load average to 0 in such cases, so
that tasks that run in future under this task_group get the CPU time
in accordance with the current load.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---

So far I have encountered only one way of reproducing this issue which is
as follows:

1. Take a kvm guest (booted with 4 CPUs and can be scaled up to 124 CPUs) and 
create 2 custom cgroups: /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/
cpu/test_group_2

2. Assign a CPU intensive workload to each of these cgroups and start the
workload.

For my tests I am using following app:

int main(int argc, char *argv[])
{
        unsigned long count, i, val;
        if (argc != 2) {
              printf("usage: ./a.out <number of random nums to generate> \n");
              return 0;
        }

        count = strtoul(argv[1], NULL, 10);

        printf("Generating %lu random numbers \n", count);
        for (i = 0; i < count; i++) {
                val = rand();
                val = val % 2;
                //usleep(1);
        }
        printf("Generated %lu random numbers \n", count);
        return 0;
}
Also since the system is booted with 4 CPUs, in order to completely load the
system I am also launching 4 instances of same test app under:
/sys/fs/cgroup/cpu/

3. We can see that both of the cgroups get similar CPU time:

# systemd-cgtop --depth 1
Path                                 Tasks    %CPU  Memory  Input/s    Output/s
/                                      659      -     5.5G        -        -
/system.slice                            -      -     5.7G        -        -
/test_group_1                            4      -        -        -        -
/test_group_2                            3      -        -        -        -
/user.slice                             31      -    56.5M        -        -

Path                                 Tasks   %CPU   Memory  Input/s    Output/s
/                                      659  394.6     5.5G        -        -
/test_group_2                            3   65.7        -        -        -
/user.slice                             29   55.1    48.0M        -        -
/test_group_1                            4   47.3        -        -        -
/system.slice                            -    2.2     5.7G        -        -

Path                                 Tasks  %CPU    Memory  Input/s    Output/s
/                                      659  394.8     5.5G        -        -
/test_group_1                            4   62.9        -        -        -
/user.slice                             28   44.9    54.2M        -        -
/test_group_2                            3   44.7        -        -        -
/system.slice                            -    0.9     5.7G        -        -

Path                                 Tasks  %CPU    Memory  Input/s     Output/s
/                                      659  394.4     5.5G        -        -
/test_group_2                            3   58.8        -        -        -
/test_group_1                            4   51.9        -        -        -
/user.slice                              30   39.3    59.6M        -        -
/system.slice                            -    1.9     5.7G        -        -

Path                                 Tasks  %CPU     Memory  Input/s    Output/s
/                                      659  394.7     5.5G        -        -
/test_group_1                            4   60.9        -        -        -
/test_group_2                            3   57.9        -        -        -
/user.slice                             28   43.5    36.9M        -        -
/system.slice                            -    3.0     5.7G        -        -

Path                                 Tasks  %CPU     Memory  Input/s     Output/s
/                                      659  395.0     5.5G        -        -
/test_group_1                            4   66.8        -        -        -
/test_group_2                            3   56.3        -        -        -
/user.slice                             29   43.1    51.8M        -        -
/system.slice                            -    0.7     5.7G        -        -

4. Now move systemd-udevd to one of these test groups, say test_group_1, and 
perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
host side.

5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu
and one instance of  CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and
/sys/fs/cgroup/test_group_2.
It is seen that test_group_1 (the one where systemd-udevd was moved) is getting
much less CPU than the test_group_2, even though at this point of time both of
these groups have only CPU hogger running.

#systemd-cgtop --depth 1
Path                                   Tasks   %CPU   Memory  Input/s   Output/s
/                                      1219     -     5.4G        -        -
/system.slice                           -       -     5.6G        -        -
/test_group_1                           4       -        -        -        -
/test_group_2                           3       -        -        -        -
/user.slice                            26       -    91.3M        -        -

Path                                   Tasks  %CPU     Memory  Input/s   Output/s
/                                      1221  394.3     5.4G        -        -
/test_group_2                             3   82.7        -        -        -
/test_group_1                             4   14.3        -        -        -
/system.slice                             -    0.8     5.6G        -        -
/user.slice                              26    0.4    91.2M        -        -

Path                                   Tasks  %CPU    Memory  Input/s    Output/s
/                                      1221  394.6     5.4G        -        -
/test_group_2                             3   67.4        -        -        -
/system.slice                             -   24.6     5.6G        -        -
/test_group_1                             4   12.5        -        -        -
/user.slice                              26    0.4    91.2M        -        -

Path                                  Tasks  %CPU    Memory  Input/s    Output/s
/                                     1221  395.2     5.4G        -        -
/test_group_2                            3   60.9        -        -        -
/system.slice                            -   27.9     5.6G        -        -
/test_group_1                            4   12.2        -        -        -
/user.slice                             26    0.4    91.2M        -        -

Path                                  Tasks  %CPU    Memory  Input/s    Output/s
/                                     1221  395.2     5.4G        -        -
/test_group_2                            3   69.4        -        -        -
/test_group_1                            4   13.9        -        -        -
/user.slice                             28    1.6    92.0M        -        -
/system.slice                            -    1.0     5.6G        -        -

Path                                  Tasks  %CPU    Memory  Input/s    Output/s
/                                      1221  395.6     5.4G        -        -
/test_group_2                             3   59.3        -        -        -
/test_group_1                             4   14.1        -        -        -
/user.slice                              28    1.3    92.2M        -        -
/system.slice                             -    0.7     5.6G        -        -

Path                                  Tasks  %CPU    Memory  Input/s    Output/s
/                                      1221  395.5     5.4G        -        -
/test_group_2                            3   67.2        -        -        -
/test_group_1                            4   11.5        -        -        -
/user.slice                             28    1.3    92.5M        -        -
/system.slice                            -    0.6     5.6G        -        -

Path                                  Tasks  %CPU    Memory  Input/s    Output/s
/                                      1221  395.1     5.4G        -        -
/test_group_2                             3   76.8        -        -        -
/test_group_1                             4   12.9        -        -        -
/user.slice                              28    1.3    92.8M        -        -
/system.slice                             -    1.2     5.6G        -        -

From sched_debug data it is seen that in bad case the load.weight of per-cpu
sched entities corresponding to test_group_1 has reduced significantly and
also load_avg of test_group_1 remains much higher than that of test_group_2,
even though systemd-udevd stopped running long time back and at this point of
time both cgroups just have the cpu hogger app as running entity. 

I put some traces in update_tg_load_avg to check why load_avg was much higher
for test_group_1 and from there, I saw that even after systemd-udevd has
stopped and we are yet to launch the CPU hoggers, test_group_1 has got some
load average like shown in following snippet (I have changed the format of 
trace to limit the number of columns here):

  <idle>-0   [000] d.s.  1824.372593: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.380572: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1 
  <idle>-0   [000] d.s.  1824.380573: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.384563: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.384563: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.391564: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.391564: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.398569: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.398569: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
.....................
.....................
  xfsaild/dm-3-1623    [001] d... 1824.569615: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  xfsaild/dm-3-1623    [001] d... 1824.569617: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
...................
...................
  sh-195050  [000] d.s.  1824.756563: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.757570: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.757571: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.761566: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.761566: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.765567: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.765568: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.769562: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.769563: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664

Now when CPU hogger is launched again in test_group_1, this pre-existing
tg->load_avg gets added to load contribution of cpu hogger and results in a
situation where both cgroups are running the same work load but very different
task_group.load_avg which in turn results in these cgroups getting very
different amount of time on CPU.

Apologies for this long description of the issue, but I have found only one way
of reproducing this issue and using this way issue can be reproduced with 100%
strike rate, so I thought of providing as much info as possible.

I have kept the change as RFC as I was not sure if the load_avg should be
dragged down to 0 in one go (as being done in this change) or should it be
pulled back in steps like making it half of the previous value each time.


 kernel/sched/fair.c  | 17 +++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..c34d9e7abb96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3572,6 +3572,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 		account_numa_enqueue(rq, task_of(se));
 		list_add(&se->group_node, &rq->cfs_tasks);
+		atomic_long_inc(&task_group(task_of(se))->cfs_nr_running);
 	}
 #endif
 	cfs_rq->nr_running++;
@@ -3587,6 +3588,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (entity_is_task(se)) {
 		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
+		atomic_long_dec(&task_group(task_of(se))->cfs_nr_running);
 	}
 #endif
 	cfs_rq->nr_running--;
@@ -4104,6 +4106,20 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
 		return;
 
+	/*
+	 * If number of running tasks for a task_group is 0, pull back its
+	 * load_avg to 0 as well.
+	 * This avoids the situation where a freshly forked task in a cgroup,
+	 * with some residual load_avg but with no running tasks, gets less
+	 * CPU time because of pre-existing load_avg of task_group.
+	 */
+	if (!atomic_long_read(&cfs_rq->tg->cfs_nr_running)
+	    && atomic_long_read(&cfs_rq->tg->load_avg)) {
+		atomic_long_set(&cfs_rq->tg->load_avg, 0);
+		cfs_rq->last_update_tg_load_avg = now;
+		return;
+	}
+
 	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
@@ -12808,6 +12824,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 		goto err;
 
 	tg->shares = NICE_0_LOAD;
+	atomic_long_set(&tg->cfs_nr_running, 0);
 
 	init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..c3390bdb8400 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -370,6 +370,7 @@ struct task_group {
 	 */
 	atomic_long_t		load_avg ____cacheline_aligned;
 #endif
+	atomic_long_t           cfs_nr_running ____cacheline_aligned;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
  2023-12-15  5:26 [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks Imran Khan
@ 2023-12-15  9:11 ` Vincent Guittot
  2023-12-15  9:59   ` Imran Khan
  2023-12-17  6:10 ` kernel test robot
  2023-12-17  7:14 ` kernel test robot
  2 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2023-12-15  9:11 UTC (permalink / raw)
  To: Imran Khan
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel

On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@oracle.com> wrote:
>
> It has been found that sometimes a task_group has some residual
> load_avg even though the load average at each of its owned queues
> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
> tg_load_avg_contrib have become 0 for a long time.
> Under this scenario if another task starts running in this task_group,
> it does not get proper time share on CPU since pre-existing
> load average of task group inversely impacts the new task's CPU share
> on each CPU.
>
> This change looks for the condition when a task_group has no running
> tasks and sets the task_group's load average to 0 in such cases, so
> that tasks that run in future under this task_group get the CPU time
> in accordance with the current load.
>
> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
> ---
>

[...]

>
> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
> host side.

Could it be the root cause of your problem ?

The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
then unplugged,  have not been correctly removed from tg->load_avg. If
the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
tg->load_avg should be 0 too.

Could you track that the cfs_rq->tg_load_avg_contrib is correctly
removed from tg->load_avg when you unplug the CPUs ? I can easily
imagine that the rate limit can skip some update of tg- >load_avg
while offlining the cpu

> 5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu
> and one instance of  CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and
> /sys/fs/cgroup/test_group_2.
> It is seen that test_group_1 (the one where systemd-udevd was moved) is getting
> much less CPU than the test_group_2, even though at this point of time both of
> these groups have only CPU hogger running.
>

[...]

> Apologies for this long description of the issue, but I have found only one way
> of reproducing this issue and using this way issue can be reproduced with 100%
> strike rate, so I thought of providing as much info as possible.
>
> I have kept the change as RFC as I was not sure if the load_avg should be
> dragged down to 0 in one go (as being done in this change) or should it be
> pulled back in steps like making it half of the previous value each time.
>
>
>  kernel/sched/fair.c  | 17 +++++++++++++++++
>  kernel/sched/sched.h |  1 +
>  2 files changed, 18 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7a3c63a2171..c34d9e7abb96 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3572,6 +3572,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
>                 account_numa_enqueue(rq, task_of(se));
>                 list_add(&se->group_node, &rq->cfs_tasks);
> +               atomic_long_inc(&task_group(task_of(se))->cfs_nr_running);

We have rate limited the access to the atomic tg->load_avg because it
was impacting significantly the performance.  We don't want to add a
new one at every enqueue/dequeue


>         }
>  #endif
>         cfs_rq->nr_running++;
> @@ -3587,6 +3588,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>         if (entity_is_task(se)) {
>                 account_numa_dequeue(rq_of(cfs_rq), task_of(se));
>                 list_del_init(&se->group_node);
> +               atomic_long_dec(&task_group(task_of(se))->cfs_nr_running);
>         }
>  #endif
>         cfs_rq->nr_running--;
> @@ -4104,6 +4106,20 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>         if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
>                 return;
>
> +       /*
> +        * If number of running tasks for a task_group is 0, pull back its
> +        * load_avg to 0 as well.
> +        * This avoids the situation where a freshly forked task in a cgroup,
> +        * with some residual load_avg but with no running tasks, gets less
> +        * CPU time because of pre-existing load_avg of task_group.
> +        */
> +       if (!atomic_long_read(&cfs_rq->tg->cfs_nr_running)
> +           && atomic_long_read(&cfs_rq->tg->load_avg)) {
> +               atomic_long_set(&cfs_rq->tg->load_avg, 0);
> +               cfs_rq->last_update_tg_load_avg = now;
> +               return;
> +       }
> +
>         delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>         if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
> @@ -12808,6 +12824,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
>                 goto err;
>
>         tg->shares = NICE_0_LOAD;
> +       atomic_long_set(&tg->cfs_nr_running, 0);
>
>         init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e5a95486a42..c3390bdb8400 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -370,6 +370,7 @@ struct task_group {
>          */
>         atomic_long_t           load_avg ____cacheline_aligned;
>  #endif
> +       atomic_long_t           cfs_nr_running ____cacheline_aligned;
>  #endif
>
>  #ifdef CONFIG_RT_GROUP_SCHED
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
  2023-12-15  9:11 ` Vincent Guittot
@ 2023-12-15  9:59   ` Imran Khan
  2023-12-19  6:41     ` Imran Khan
  0 siblings, 1 reply; 11+ messages in thread
From: Imran Khan @ 2023-12-15  9:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel

Hello Vincent,
Thanks a lot for having a look and getting back.

On 15/12/2023 7:11 pm, Vincent Guittot wrote:
> On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@oracle.com> wrote:
>>
>> It has been found that sometimes a task_group has some residual
>> load_avg even though the load average at each of its owned queues
>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
>> tg_load_avg_contrib have become 0 for a long time.
>> Under this scenario if another task starts running in this task_group,
>> it does not get proper time share on CPU since pre-existing
>> load average of task group inversely impacts the new task's CPU share
>> on each CPU.
>>
>> This change looks for the condition when a task_group has no running
>> tasks and sets the task_group's load average to 0 in such cases, so
>> that tasks that run in future under this task_group get the CPU time
>> in accordance with the current load.
>>
>> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
>> ---
>>
> 
> [...]
> 
>>
>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
>> host side.
> 
> Could it be the root cause of your problem ?
> 
> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
> then unplugged,  have not been correctly removed from tg->load_avg. If
> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
> tg->load_avg should be 0 too.
> 
Agree and this was my understanding as well. The issue only happens
with large number of CPUs. For example if I go from 4 to 8 and back to
4 , the issue does not happen and even if it happens the residual load
avg is very little.

> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
> removed from tg->load_avg when you unplug the CPUs ? I can easily
> imagine that the rate limit can skip some update of tg- >load_avg
> while offlining the cpu
> 

I will try to trace it but just so you know this issue is happening on other
kernel versions (which don't have rate limit feature) as well. I started
with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.

Thanks,
Imran


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
  2023-12-15  5:26 [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks Imran Khan
  2023-12-15  9:11 ` Vincent Guittot
@ 2023-12-17  6:10 ` kernel test robot
  2023-12-17  7:14 ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2023-12-17  6:10 UTC (permalink / raw)
  To: Imran Khan; +Cc: llvm, oe-kbuild-all

Hi Imran,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on peterz-queue/sched/core linus/master v6.7-rc5 next-20231215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Imran-Khan/sched-fair-reset-task_group-load_avg-when-there-are-no-running-tasks/20231215-132912
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20231215052652.917741-1-imran.f.khan%40oracle.com
patch subject: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
config: i386-buildonly-randconfig-001-20231217 (https://download.01.org/0day-ci/archive/20231217/202312171440.Nj64lUjd-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231217/202312171440.Nj64lUjd-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312171440.Nj64lUjd-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/sched/fair.c:3579:45: error: no member named 'cfs_nr_running' in 'struct task_group'
                   atomic_long_inc(&task_group(task_of(se))->cfs_nr_running);
                                    ~~~~~~~~~~~~~~~~~~~~~~~  ^
   kernel/sched/fair.c:3595:45: error: no member named 'cfs_nr_running' in 'struct task_group'
                   atomic_long_dec(&task_group(task_of(se))->cfs_nr_running);
                                    ~~~~~~~~~~~~~~~~~~~~~~~  ^
   2 errors generated.


vim +3579 kernel/sched/fair.c

  3568	
  3569	static void
  3570	account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
  3571	{
  3572		update_load_add(&cfs_rq->load, se->load.weight);
  3573	#ifdef CONFIG_SMP
  3574		if (entity_is_task(se)) {
  3575			struct rq *rq = rq_of(cfs_rq);
  3576	
  3577			account_numa_enqueue(rq, task_of(se));
  3578			list_add(&se->group_node, &rq->cfs_tasks);
> 3579			atomic_long_inc(&task_group(task_of(se))->cfs_nr_running);
  3580		}
  3581	#endif
  3582		cfs_rq->nr_running++;
  3583		if (se_is_idle(se))
  3584			cfs_rq->idle_nr_running++;
  3585	}
  3586	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
  2023-12-15  5:26 [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks Imran Khan
  2023-12-15  9:11 ` Vincent Guittot
  2023-12-17  6:10 ` kernel test robot
@ 2023-12-17  7:14 ` kernel test robot
  2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2023-12-17  7:14 UTC (permalink / raw)
  To: Imran Khan; +Cc: llvm, oe-kbuild-all

Hi Imran,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on peterz-queue/sched/core linus/master v6.7-rc5 next-20231215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Imran-Khan/sched-fair-reset-task_group-load_avg-when-there-are-no-running-tasks/20231215-132912
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20231215052652.917741-1-imran.f.khan%40oracle.com
patch subject: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
config: i386-buildonly-randconfig-002-20231217 (https://download.01.org/0day-ci/archive/20231217/202312171425.MFabjrns-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231217/202312171425.MFabjrns-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312171425.MFabjrns-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/sched/fair.c:3579:43: error: incomplete definition of type 'struct task_group'
                   atomic_long_inc(&task_group(task_of(se))->cfs_nr_running);
                                    ~~~~~~~~~~~~~~~~~~~~~~~^
   include/linux/sched.h:71:8: note: forward declaration of 'struct task_group'
   struct task_group;
          ^
   kernel/sched/fair.c:3595:43: error: incomplete definition of type 'struct task_group'
                   atomic_long_dec(&task_group(task_of(se))->cfs_nr_running);
                                    ~~~~~~~~~~~~~~~~~~~~~~~^
   include/linux/sched.h:71:8: note: forward declaration of 'struct task_group'
   struct task_group;
          ^
   kernel/sched/fair.c:13091:6: warning: no previous prototype for function 'free_fair_sched_group' [-Wmissing-prototypes]
   void free_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:13091:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void free_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   kernel/sched/fair.c:13093:5: warning: no previous prototype for function 'alloc_fair_sched_group' [-Wmissing-prototypes]
   int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
       ^
   kernel/sched/fair.c:13093:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
   ^
   static 
   kernel/sched/fair.c:13098:6: warning: no previous prototype for function 'online_fair_sched_group' [-Wmissing-prototypes]
   void online_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:13098:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void online_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   kernel/sched/fair.c:13100:6: warning: no previous prototype for function 'unregister_fair_sched_group' [-Wmissing-prototypes]
   void unregister_fair_sched_group(struct task_group *tg) { }
        ^
   kernel/sched/fair.c:13100:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void unregister_fair_sched_group(struct task_group *tg) { }
   ^
   static 
   4 warnings and 2 errors generated.


vim +3579 kernel/sched/fair.c

  3568	
  3569	static void
  3570	account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
  3571	{
  3572		update_load_add(&cfs_rq->load, se->load.weight);
  3573	#ifdef CONFIG_SMP
  3574		if (entity_is_task(se)) {
  3575			struct rq *rq = rq_of(cfs_rq);
  3576	
  3577			account_numa_enqueue(rq, task_of(se));
  3578			list_add(&se->group_node, &rq->cfs_tasks);
> 3579			atomic_long_inc(&task_group(task_of(se))->cfs_nr_running);
  3580		}
  3581	#endif
  3582		cfs_rq->nr_running++;
  3583		if (se_is_idle(se))
  3584			cfs_rq->idle_nr_running++;
  3585	}
  3586	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
  2023-12-15  9:59   ` Imran Khan
@ 2023-12-19  6:41     ` Imran Khan
  2023-12-19 18:49       ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Imran Khan @ 2023-12-19  6:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel

Hello Vincent,


On 15/12/2023 8:59 pm, Imran Khan wrote:
> Hello Vincent,
> Thanks a lot for having a look and getting back.
> 
> On 15/12/2023 7:11 pm, Vincent Guittot wrote:
>> On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@oracle.com> wrote:
>>>
>>> It has been found that sometimes a task_group has some residual
>>> load_avg even though the load average at each of its owned queues
>>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
>>> tg_load_avg_contrib have become 0 for a long time.
>>> Under this scenario if another task starts running in this task_group,
>>> it does not get proper time share on CPU since pre-existing
>>> load average of task group inversely impacts the new task's CPU share
>>> on each CPU.
>>>
>>> This change looks for the condition when a task_group has no running
>>> tasks and sets the task_group's load average to 0 in such cases, so
>>> that tasks that run in future under this task_group get the CPU time
>>> in accordance with the current load.
>>>
>>> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
>>> ---
>>>
>>
>> [...]
>>
>>>
>>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
>>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
>>> host side.
>>
>> Could it be the root cause of your problem ?
>>
>> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
>> then unplugged,  have not been correctly removed from tg->load_avg. If
>> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
>> tg->load_avg should be 0 too.
>>
> Agree and this was my understanding as well. The issue only happens
> with large number of CPUs. For example if I go from 4 to 8 and back to
> 4 , the issue does not happen and even if it happens the residual load
> avg is very little.
> 
>> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
>> removed from tg->load_avg when you unplug the CPUs ? I can easily
>> imagine that the rate limit can skip some update of tg- >load_avg
>> while offlining the cpu
>>
> 
> I will try to trace it but just so you know this issue is happening on other
> kernel versions (which don't have rate limit feature) as well. I started
> with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.
> 
I collected some debug trace to understand the missing load avg
context better. From the traces it looks like during scale down,
the task_group.cfs_rq[cpu].avg.load_avg is not getting updated
properly for CPU(s) being hotplugged out.

For example if we look at following snippet (I have kept
only the relevant portion of trace in the mail), we can see that,
in the last invocation of update_tg_load_avg for task_group.cfs_rq[11]
both the load avg and contribution of this cfs_rq were 1024.
So delta was zero and this contribution eventually remains undeducted.
In this case scale down was done from 16 to 8 CPUs, so CPU 11 has been
offlined.


cpuhp/15-131605  [015] d...  6112.350658: update_tg_load_avg.constprop.124:
   cfs_of_cpu=5 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
   tg->load_avg = 0 delta = 0 ###
 systemd-udevd-894 [005] d...  6112.351096: update_tg_load_avg.constprop.124:
   cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
   tg->load_avg = 0 delta = 1024 ###
 systemd-udevd-894 [005] d...  6112.351165: update_tg_load_avg.constprop.124:
   cfs_of_cpu=5 cfs_rq->avg.load_avg = 10, cfs_rq->tg_load_avg_contrib = 0
   tg->load_avg = 1024 delta = 10 ###

.........................
.........................
 cat-128667  [006] d...  6112.504633: update_tg_load_avg.constprop.124:
   cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
   tg->load_avg = 3085 delta = 0 ###
.........................
 sh-142414  [006] d...  6112.505392: update_tg_load_avg.constprop.124:
   cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
   tg->load_avg = 4041 delta = 1024 ###
.........................
 cat-128667  [006] d...  6112.504633: update_tg_load_avg.constprop.124:
   cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
   tg->load_avg = 3085 delta = 0 ###
..........................
 sh-142414  [006] d...  6112.505392: update_tg_load_avg.constprop.124:
   cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
   tg->load_avg = 4041 delta = 1024 ###
..........................
 systemd-run-142416  [011] d.h.  6112.506547: update_tg_load_avg.constprop.124:
   cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
   tg->load_avg = 3010 delta = 0 ###
..........................
 systemd-run-142416  [011] d.h.  6112.507546: update_tg_load_avg.constprop.124:
   cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
   tg->load_avg = 3010 delta = 0 ### <-- last invocation for cfs_rq[11]

..........................
..........................
<idle>-0  [001] d.s.  6113.868542: update_tg_load_avg.constprop.124:
   cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
   tg->load_avg = 1027 delta = 0 ###
<idle>-0  [001] d.s.  6113.869542: update_tg_load_avg.constprop.124:
   cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
   tg->load_avg = 1027 delta = 0 ###
 <idle>-0 [001] d.s.  6113.870541: update_tg_load_avg.constprop.124:
   cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
   tg->load_avg = 1027 delta = 0 ###


If I understand correctly, when CPU 11 is offlined the task(s) on its cfs_rq
will be migrated and its cfs_rq.avg.load_avg will be updated accordingly. This
drop is cfs_rq.avg.load_avg will be detected by update_tg_load_avg and hence
the contribution of this cfs_rq will get deducted from tg->load_avg.
It looks like during hotplug load of one or more tasks, being migrated are
not getting accounted in the source cfs_rq and this is ending up as residual
load_avg at task_group (if these tasks are members of a task_group).

Moreover this looks racy and dependent on number of CPUs or some delay.
For example for scale down from 124 to 4 CPUs I always hit this issue but
for scale down from 16 to 4 CPUs I hit this issue 8-9 out of 10 times.
Also for the cases when residual load_avg at task group is low (like < 10),
I can see that both of my test cgroups get similar CPU times which further
proves that the unaccounted load avg ending up in a task_group is eventually
leading to uneven CPU allotment between task groups.


I am debugging it further but in the mean time if you have some suggestions or
need traces from some specific portion of sched code, please let me know.

Thanks,
Imran

> Thanks,
> Imran
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
  2023-12-19  6:41     ` Imran Khan
@ 2023-12-19 18:49       ` Vincent Guittot
  2023-12-20  2:00         ` Imran Khan
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2023-12-19 18:49 UTC (permalink / raw)
  To: Imran Khan
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel

Hi Imram,

On Tue, 19 Dec 2023 at 07:42, Imran Khan <imran.f.khan@oracle.com> wrote:
>
> Hello Vincent,
>
>
> On 15/12/2023 8:59 pm, Imran Khan wrote:
> > Hello Vincent,
> > Thanks a lot for having a look and getting back.
> >
> > On 15/12/2023 7:11 pm, Vincent Guittot wrote:
> >> On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@oracle.com> wrote:
> >>>
> >>> It has been found that sometimes a task_group has some residual
> >>> load_avg even though the load average at each of its owned queues
> >>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
> >>> tg_load_avg_contrib have become 0 for a long time.
> >>> Under this scenario if another task starts running in this task_group,
> >>> it does not get proper time share on CPU since pre-existing
> >>> load average of task group inversely impacts the new task's CPU share
> >>> on each CPU.
> >>>
> >>> This change looks for the condition when a task_group has no running
> >>> tasks and sets the task_group's load average to 0 in such cases, so
> >>> that tasks that run in future under this task_group get the CPU time
> >>> in accordance with the current load.
> >>>
> >>> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
> >>> ---
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
> >>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
> >>> host side.
> >>
> >> Could it be the root cause of your problem ?
> >>
> >> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
> >> then unplugged,  have not been correctly removed from tg->load_avg. If
> >> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
> >> tg->load_avg should be 0 too.
> >>
> > Agree and this was my understanding as well. The issue only happens
> > with large number of CPUs. For example if I go from 4 to 8 and back to
> > 4 , the issue does not happen and even if it happens the residual load
> > avg is very little.
> >
> >> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
> >> removed from tg->load_avg when you unplug the CPUs ? I can easily
> >> imagine that the rate limit can skip some update of tg- >load_avg
> >> while offlining the cpu
> >>
> >
> > I will try to trace it but just so you know this issue is happening on other
> > kernel versions (which don't have rate limit feature) as well. I started
> > with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.
> >
> I collected some debug trace to understand the missing load avg
> context better. From the traces it looks like during scale down,
> the task_group.cfs_rq[cpu].avg.load_avg is not getting updated
> properly for CPU(s) being hotplugged out.

Your traces are interesting and I think that
task_group.cfs_rq[cpu].avg.load_avg is updated correctly but we don't
call update_tg_load_avg() to reflect that in tg->load_avg.

Could you try the patch below ? It forces the scheduler to clear the
contribution of all cfs_rq of a CPU that becomes offline.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 466e01b2131f..e5da5eaab6ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4087,6 +4087,11 @@ static inline void update_tg_load_avg(struct
cfs_rq *cfs_rq)
        if (cfs_rq->tg == &root_task_group)
                return;

+
+       /* rq has been offline and doesn't contribute anymore to the share */
+       if (!cpu_active(cpu_of(rq_of(cfs_rq))))
+               return;
+
        /*
         * For migration heavy workloads, access to tg->load_avg can be
         * unbound. Limit the update rate to at most once per ms.
@@ -4103,6 +4108,48 @@ static inline void update_tg_load_avg(struct
cfs_rq *cfs_rq)
        }
 }

+static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
+{
+       long delta;
+       u64 now;
+
+       /*
+        * No need to update load_avg for root_task_group as it is not used.
+        */
+       if (cfs_rq->tg == &root_task_group)
+               return;
+
+       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+       delta = 0 - cfs_rq->tg_load_avg_contrib;
+       atomic_long_add(delta, &cfs_rq->tg->load_avg);
+       cfs_rq->tg_load_avg_contrib = 0;
+       cfs_rq->last_update_tg_load_avg = now;
+}
+
+/* cpu offline callback */
+static void __maybe_unused clear_tg_offline_cfs_rqs(struct rq *rq)
+{
+       struct task_group *tg;
+
+       lockdep_assert_rq_held(rq);
+
+       /*
+        * The rq clock has already been updated in the
+        * set_rq_offline(), so we should skip updating
+        * the rq clock again in unthrottle_cfs_rq().
+        */
+       rq_clock_start_loop_update(rq);
+
+       rcu_read_lock();
+       list_for_each_entry_rcu(tg, &task_groups, list) {
+               struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+               clear_tg_load_avg(cfs_rq);
+       }
+       rcu_read_unlock();
+
+       rq_clock_stop_loop_update(rq);
+}
+
 /*
  * Called within set_task_rq() right before setting a task's CPU. The
  * caller only guarantees p->pi_lock is held; no other assumptions,
@@ -12414,6 +12461,9 @@ static void rq_offline_fair(struct rq *rq)

        /* Ensure any throttled groups are reachable by pick_next_task */
        unthrottle_offline_cfs_rqs(rq);
+
+       /* Ensure that we remove rq contribution to group share */
+       clear_tg_offline_cfs_rqs(rq);
 }

 #endif /* CONFIG_SMP */


>
> For example if we look at following snippet (I have kept
> only the relevant portion of trace in the mail), we can see that,
> in the last invocation of update_tg_load_avg for task_group.cfs_rq[11]
> both the load avg and contribution of this cfs_rq were 1024.
> So delta was zero and this contribution eventually remains undeducted.
> In this case scale down was done from 16 to 8 CPUs, so CPU 11 has been
> offlined.
>
>
> cpuhp/15-131605  [015] d...  6112.350658: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=5 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>    tg->load_avg = 0 delta = 0 ###
>  systemd-udevd-894 [005] d...  6112.351096: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
>    tg->load_avg = 0 delta = 1024 ###
>  systemd-udevd-894 [005] d...  6112.351165: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=5 cfs_rq->avg.load_avg = 10, cfs_rq->tg_load_avg_contrib = 0
>    tg->load_avg = 1024 delta = 10 ###
>
> .........................
> .........................
>  cat-128667  [006] d...  6112.504633: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>    tg->load_avg = 3085 delta = 0 ###
> .........................
>  sh-142414  [006] d...  6112.505392: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
>    tg->load_avg = 4041 delta = 1024 ###
> .........................
>  cat-128667  [006] d...  6112.504633: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>    tg->load_avg = 3085 delta = 0 ###
> ..........................
>  sh-142414  [006] d...  6112.505392: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
>    tg->load_avg = 4041 delta = 1024 ###
> ..........................
>  systemd-run-142416  [011] d.h.  6112.506547: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
>    tg->load_avg = 3010 delta = 0 ###
> ..........................
>  systemd-run-142416  [011] d.h.  6112.507546: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
>    tg->load_avg = 3010 delta = 0 ### <-- last invocation for cfs_rq[11]
>
> ..........................
> ..........................
> <idle>-0  [001] d.s.  6113.868542: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>    tg->load_avg = 1027 delta = 0 ###
> <idle>-0  [001] d.s.  6113.869542: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>    tg->load_avg = 1027 delta = 0 ###
>  <idle>-0 [001] d.s.  6113.870541: update_tg_load_avg.constprop.124:
>    cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>    tg->load_avg = 1027 delta = 0 ###
>
>
> If I understand correctly, when CPU 11 is offlined the task(s) on its cfs_rq
> will be migrated and its cfs_rq.avg.load_avg will be updated accordingly. This
> drop is cfs_rq.avg.load_avg will be detected by update_tg_load_avg and hence
> the contribution of this cfs_rq will get deducted from tg->load_avg.
> It looks like during hotplug load of one or more tasks, being migrated are
> not getting accounted in the source cfs_rq and this is ending up as residual
> load_avg at task_group (if these tasks are members of a task_group).
>
> Moreover this looks racy and dependent on number of CPUs or some delay.
> For example for scale down from 124 to 4 CPUs I always hit this issue but
> for scale down from 16 to 4 CPUs I hit this issue 8-9 out of 10 times.
> Also for the cases when residual load_avg at task group is low (like < 10),
> I can see that both of my test cgroups get similar CPU times which further
> proves that the unaccounted load avg ending up in a task_group is eventually
> leading to uneven CPU allotment between task groups.
>
>
> I am debugging it further but in the mean time if you have some suggestions or
> need traces from some specific portion of sched code, please let me know.
>
> Thanks,
> Imran
>
> > Thanks,
> > Imran
> >

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
  2023-12-19 18:49       ` Vincent Guittot
@ 2023-12-20  2:00         ` Imran Khan
  2023-12-20 17:31           ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Imran Khan @ 2023-12-20  2:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel

Hello Vincent,

On 20/12/2023 5:49 am, Vincent Guittot wrote:
> Hi Imram,
> 
> On Tue, 19 Dec 2023 at 07:42, Imran Khan <imran.f.khan@oracle.com> wrote:
>>
>> Hello Vincent,
>>
>>
>> On 15/12/2023 8:59 pm, Imran Khan wrote:
>>> Hello Vincent,
>>> Thanks a lot for having a look and getting back.
>>>
>>> On 15/12/2023 7:11 pm, Vincent Guittot wrote:
>>>> On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@oracle.com> wrote:
>>>>>
>>>>> It has been found that sometimes a task_group has some residual
>>>>> load_avg even though the load average at each of its owned queues
>>>>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
>>>>> tg_load_avg_contrib have become 0 for a long time.
>>>>> Under this scenario if another task starts running in this task_group,
>>>>> it does not get proper time share on CPU since pre-existing
>>>>> load average of task group inversely impacts the new task's CPU share
>>>>> on each CPU.
>>>>>
>>>>> This change looks for the condition when a task_group has no running
>>>>> tasks and sets the task_group's load average to 0 in such cases, so
>>>>> that tasks that run in future under this task_group get the CPU time
>>>>> in accordance with the current load.
>>>>>
>>>>> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
>>>>> ---
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>
>>>>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
>>>>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
>>>>> host side.
>>>>
>>>> Could it be the root cause of your problem ?
>>>>
>>>> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
>>>> then unplugged,  have not been correctly removed from tg->load_avg. If
>>>> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
>>>> tg->load_avg should be 0 too.
>>>>
>>> Agree and this was my understanding as well. The issue only happens
>>> with large number of CPUs. For example if I go from 4 to 8 and back to
>>> 4 , the issue does not happen and even if it happens the residual load
>>> avg is very little.
>>>
>>>> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
>>>> removed from tg->load_avg when you unplug the CPUs ? I can easily
>>>> imagine that the rate limit can skip some update of tg- >load_avg
>>>> while offlining the cpu
>>>>
>>>
>>> I will try to trace it but just so you know this issue is happening on other
>>> kernel versions (which don't have rate limit feature) as well. I started
>>> with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.
>>>
>> I collected some debug trace to understand the missing load avg
>> context better. From the traces it looks like during scale down,
>> the task_group.cfs_rq[cpu].avg.load_avg is not getting updated
>> properly for CPU(s) being hotplugged out.
> 
> Your traces are interesting and I think that
> task_group.cfs_rq[cpu].avg.load_avg is updated correctly but we don't
> call update_tg_load_avg() to reflect that in tg->load_avg.
> 
> Could you try the patch below ? It forces the scheduler to clear the
> contribution of all cfs_rq of a CPU that becomes offline.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 466e01b2131f..e5da5eaab6ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4087,6 +4087,11 @@ static inline void update_tg_load_avg(struct
> cfs_rq *cfs_rq)
>         if (cfs_rq->tg == &root_task_group)
>                 return;
> 
> +
> +       /* rq has been offline and doesn't contribute anymore to the share */
> +       if (!cpu_active(cpu_of(rq_of(cfs_rq))))
> +               return;
> +
>         /*
>          * For migration heavy workloads, access to tg->load_avg can be
>          * unbound. Limit the update rate to at most once per ms.
> @@ -4103,6 +4108,48 @@ static inline void update_tg_load_avg(struct
> cfs_rq *cfs_rq)
>         }
>  }
> 
> +static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
> +{
> +       long delta;
> +       u64 now;
> +
> +       /*
> +        * No need to update load_avg for root_task_group as it is not used.
> +        */
> +       if (cfs_rq->tg == &root_task_group)
> +               return;
> +
> +       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> +       delta = 0 - cfs_rq->tg_load_avg_contrib;
> +       atomic_long_add(delta, &cfs_rq->tg->load_avg);
> +       cfs_rq->tg_load_avg_contrib = 0;
> +       cfs_rq->last_update_tg_load_avg = now;
> +}
> +
> +/* cpu offline callback */
> +static void __maybe_unused clear_tg_offline_cfs_rqs(struct rq *rq)
> +{
> +       struct task_group *tg;
> +
> +       lockdep_assert_rq_held(rq);
> +
> +       /*
> +        * The rq clock has already been updated in the
> +        * set_rq_offline(), so we should skip updating
> +        * the rq clock again in unthrottle_cfs_rq().
> +        */
> +       rq_clock_start_loop_update(rq);
> +
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(tg, &task_groups, list) {
> +               struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> +               clear_tg_load_avg(cfs_rq);
> +       }
> +       rcu_read_unlock();
> +
> +       rq_clock_stop_loop_update(rq);
> +}
> +
>  /*
>   * Called within set_task_rq() right before setting a task's CPU. The
>   * caller only guarantees p->pi_lock is held; no other assumptions,
> @@ -12414,6 +12461,9 @@ static void rq_offline_fair(struct rq *rq)
> 
>         /* Ensure any throttled groups are reachable by pick_next_task */
>         unthrottle_offline_cfs_rqs(rq);
> +
> +       /* Ensure that we remove rq contribution to group share */
> +       clear_tg_offline_cfs_rqs(rq);
>  }
> 
>  #endif /* CONFIG_SMP */
> 
> 

Thanks a lot for this suggestion. I have tested it in my local setup and it is
fixing the issue.  With a little tweak (remove usage of
last_update_tg_load_avg), it works well for older kernel (v5.4.x) as well.
I have not yet tested for v4.14.x but should be fine there as well.

Thanks,
Imran

>>
>> For example if we look at following snippet (I have kept
>> only the relevant portion of trace in the mail), we can see that,
>> in the last invocation of update_tg_load_avg for task_group.cfs_rq[11]
>> both the load avg and contribution of this cfs_rq were 1024.
>> So delta was zero and this contribution eventually remains undeducted.
>> In this case scale down was done from 16 to 8 CPUs, so CPU 11 has been
>> offlined.
>>
>>
>> cpuhp/15-131605  [015] d...  6112.350658: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=5 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>>    tg->load_avg = 0 delta = 0 ###
>>  systemd-udevd-894 [005] d...  6112.351096: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
>>    tg->load_avg = 0 delta = 1024 ###
>>  systemd-udevd-894 [005] d...  6112.351165: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=5 cfs_rq->avg.load_avg = 10, cfs_rq->tg_load_avg_contrib = 0
>>    tg->load_avg = 1024 delta = 10 ###
>>
>> .........................
>> .........................
>>  cat-128667  [006] d...  6112.504633: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>>    tg->load_avg = 3085 delta = 0 ###
>> .........................
>>  sh-142414  [006] d...  6112.505392: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
>>    tg->load_avg = 4041 delta = 1024 ###
>> .........................
>>  cat-128667  [006] d...  6112.504633: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>>    tg->load_avg = 3085 delta = 0 ###
>> ..........................
>>  sh-142414  [006] d...  6112.505392: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
>>    tg->load_avg = 4041 delta = 1024 ###
>> ..........................
>>  systemd-run-142416  [011] d.h.  6112.506547: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
>>    tg->load_avg = 3010 delta = 0 ###
>> ..........................
>>  systemd-run-142416  [011] d.h.  6112.507546: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
>>    tg->load_avg = 3010 delta = 0 ### <-- last invocation for cfs_rq[11]
>>
>> ..........................
>> ..........................
>> <idle>-0  [001] d.s.  6113.868542: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>>    tg->load_avg = 1027 delta = 0 ###
>> <idle>-0  [001] d.s.  6113.869542: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>>    tg->load_avg = 1027 delta = 0 ###
>>  <idle>-0 [001] d.s.  6113.870541: update_tg_load_avg.constprop.124:
>>    cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>>    tg->load_avg = 1027 delta = 0 ###
>>
>>
>> If I understand correctly, when CPU 11 is offlined the task(s) on its cfs_rq
>> will be migrated and its cfs_rq.avg.load_avg will be updated accordingly. This
>> drop is cfs_rq.avg.load_avg will be detected by update_tg_load_avg and hence
>> the contribution of this cfs_rq will get deducted from tg->load_avg.
>> It looks like during hotplug load of one or more tasks, being migrated are
>> not getting accounted in the source cfs_rq and this is ending up as residual
>> load_avg at task_group (if these tasks are members of a task_group).
>>
>> Moreover this looks racy and dependent on number of CPUs or some delay.
>> For example for scale down from 124 to 4 CPUs I always hit this issue but
>> for scale down from 16 to 4 CPUs I hit this issue 8-9 out of 10 times.
>> Also for the cases when residual load_avg at task group is low (like < 10),
>> I can see that both of my test cgroups get similar CPU times which further
>> proves that the unaccounted load avg ending up in a task_group is eventually
>> leading to uneven CPU allotment between task groups.
>>
>>
>> I am debugging it further but in the mean time if you have some suggestions or
>> need traces from some specific portion of sched code, please let me know.
>>
>> Thanks,
>> Imran
>>
>>> Thanks,
>>> Imran
>>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
  2023-12-20  2:00         ` Imran Khan
@ 2023-12-20 17:31           ` Vincent Guittot
  2023-12-21  8:21             ` Aaron Lu
  0 siblings, 1 reply; 11+ messages in thread
From: Vincent Guittot @ 2023-12-20 17:31 UTC (permalink / raw)
  To: Imran Khan, Aaron Lu
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel

On Wed, 20 Dec 2023 at 03:01, Imran Khan <imran.f.khan@oracle.com> wrote:
>
> Hello Vincent,
>
> On 20/12/2023 5:49 am, Vincent Guittot wrote:
> > Hi Imram,
> >
> > On Tue, 19 Dec 2023 at 07:42, Imran Khan <imran.f.khan@oracle.com> wrote:
> >>
> >> Hello Vincent,
> >>
> >>
> >> On 15/12/2023 8:59 pm, Imran Khan wrote:
> >>> Hello Vincent,
> >>> Thanks a lot for having a look and getting back.
> >>>
> >>> On 15/12/2023 7:11 pm, Vincent Guittot wrote:
> >>>> On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@oracle.com> wrote:
> >>>>>
> >>>>> It has been found that sometimes a task_group has some residual
> >>>>> load_avg even though the load average at each of its owned queues
> >>>>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
> >>>>> tg_load_avg_contrib have become 0 for a long time.
> >>>>> Under this scenario if another task starts running in this task_group,
> >>>>> it does not get proper time share on CPU since pre-existing
> >>>>> load average of task group inversely impacts the new task's CPU share
> >>>>> on each CPU.
> >>>>>
> >>>>> This change looks for the condition when a task_group has no running
> >>>>> tasks and sets the task_group's load average to 0 in such cases, so
> >>>>> that tasks that run in future under this task_group get the CPU time
> >>>>> in accordance with the current load.
> >>>>>
> >>>>> Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
> >>>>> ---
> >>>>>
> >>>>
> >>>> [...]
> >>>>
> >>>>>
> >>>>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
> >>>>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
> >>>>> host side.
> >>>>
> >>>> Could it be the root cause of your problem ?
> >>>>
> >>>> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
> >>>> then unplugged,  have not been correctly removed from tg->load_avg. If
> >>>> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
> >>>> tg->load_avg should be 0 too.
> >>>>
> >>> Agree and this was my understanding as well. The issue only happens
> >>> with large number of CPUs. For example if I go from 4 to 8 and back to
> >>> 4 , the issue does not happen and even if it happens the residual load
> >>> avg is very little.
> >>>
> >>>> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
> >>>> removed from tg->load_avg when you unplug the CPUs ? I can easily
> >>>> imagine that the rate limit can skip some update of tg- >load_avg
> >>>> while offlining the cpu
> >>>>
> >>>
> >>> I will try to trace it but just so you know this issue is happening on other
> >>> kernel versions (which don't have rate limit feature) as well. I started
> >>> with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.
> >>>
> >> I collected some debug trace to understand the missing load avg
> >> context better. From the traces it looks like during scale down,
> >> the task_group.cfs_rq[cpu].avg.load_avg is not getting updated
> >> properly for CPU(s) being hotplugged out.
> >
> > Your traces are interesting and I think that
> > task_group.cfs_rq[cpu].avg.load_avg is updated correctly but we don't
> > call update_tg_load_avg() to reflect that in tg->load_avg.
> >
> > Could you try the patch below ? It forces the scheduler to clear the
> > contribution of all cfs_rq of a CPU that becomes offline.
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 466e01b2131f..e5da5eaab6ce 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4087,6 +4087,11 @@ static inline void update_tg_load_avg(struct
> > cfs_rq *cfs_rq)
> >         if (cfs_rq->tg == &root_task_group)
> >                 return;
> >
> > +
> > +       /* rq has been offline and doesn't contribute anymore to the share */
> > +       if (!cpu_active(cpu_of(rq_of(cfs_rq))))
> > +               return;
> > +
> >         /*
> >          * For migration heavy workloads, access to tg->load_avg can be
> >          * unbound. Limit the update rate to at most once per ms.
> > @@ -4103,6 +4108,48 @@ static inline void update_tg_load_avg(struct
> > cfs_rq *cfs_rq)
> >         }
> >  }
> >
> > +static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
> > +{
> > +       long delta;
> > +       u64 now;
> > +
> > +       /*
> > +        * No need to update load_avg for root_task_group as it is not used.
> > +        */
> > +       if (cfs_rq->tg == &root_task_group)
> > +               return;
> > +
> > +       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > +       delta = 0 - cfs_rq->tg_load_avg_contrib;
> > +       atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > +       cfs_rq->tg_load_avg_contrib = 0;
> > +       cfs_rq->last_update_tg_load_avg = now;
> > +}
> > +
> > +/* cpu offline callback */
> > +static void __maybe_unused clear_tg_offline_cfs_rqs(struct rq *rq)
> > +{
> > +       struct task_group *tg;
> > +
> > +       lockdep_assert_rq_held(rq);
> > +
> > +       /*
> > +        * The rq clock has already been updated in the
> > +        * set_rq_offline(), so we should skip updating
> > +        * the rq clock again in unthrottle_cfs_rq().
> > +        */
> > +       rq_clock_start_loop_update(rq);
> > +
> > +       rcu_read_lock();
> > +       list_for_each_entry_rcu(tg, &task_groups, list) {
> > +               struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> > +               clear_tg_load_avg(cfs_rq);
> > +       }
> > +       rcu_read_unlock();
> > +
> > +       rq_clock_stop_loop_update(rq);
> > +}
> > +
> >  /*
> >   * Called within set_task_rq() right before setting a task's CPU. The
> >   * caller only guarantees p->pi_lock is held; no other assumptions,
> > @@ -12414,6 +12461,9 @@ static void rq_offline_fair(struct rq *rq)
> >
> >         /* Ensure any throttled groups are reachable by pick_next_task */
> >         unthrottle_offline_cfs_rqs(rq);
> > +
> > +       /* Ensure that we remove rq contribution to group share */
> > +       clear_tg_offline_cfs_rqs(rq);
> >  }
> >
> >  #endif /* CONFIG_SMP */
> >
> >
>
> Thanks a lot for this suggestion. I have tested it in my local setup and it is
> fixing the issue.  With a little tweak (remove usage of
> last_update_tg_load_avg), it works well for older kernel (v5.4.x) as well.
> I have not yet tested for v4.14.x but should be fine there as well.

Ok, so we have the root cause. My only concern with this fix is that
we have recently limited the access of the "global" tg->load_avg that
was creating perf regression because of contention when accessing it
too often and this patch adds a test on the "global" cpu_active_mask
at the same place. The cpu_active_mask is almost read only in this
case whereas tg->load_avg is written so that's maybe not a problem

Aaron,
Could you run the tests that you run for testing "Ratelimit update to
tg->load_avg" and check if this patch impacts your performance ?

Vincent

>
> Thanks,
> Imran
>
> >>
> >> For example if we look at following snippet (I have kept
> >> only the relevant portion of trace in the mail), we can see that,
> >> in the last invocation of update_tg_load_avg for task_group.cfs_rq[11]
> >> both the load avg and contribution of this cfs_rq were 1024.
> >> So delta was zero and this contribution eventually remains undeducted.
> >> In this case scale down was done from 16 to 8 CPUs, so CPU 11 has been
> >> offlined.
> >>
> >>
> >> cpuhp/15-131605  [015] d...  6112.350658: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=5 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >>    tg->load_avg = 0 delta = 0 ###
> >>  systemd-udevd-894 [005] d...  6112.351096: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> >>    tg->load_avg = 0 delta = 1024 ###
> >>  systemd-udevd-894 [005] d...  6112.351165: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=5 cfs_rq->avg.load_avg = 10, cfs_rq->tg_load_avg_contrib = 0
> >>    tg->load_avg = 1024 delta = 10 ###
> >>
> >> .........................
> >> .........................
> >>  cat-128667  [006] d...  6112.504633: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >>    tg->load_avg = 3085 delta = 0 ###
> >> .........................
> >>  sh-142414  [006] d...  6112.505392: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> >>    tg->load_avg = 4041 delta = 1024 ###
> >> .........................
> >>  cat-128667  [006] d...  6112.504633: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >>    tg->load_avg = 3085 delta = 0 ###
> >> ..........................
> >>  sh-142414  [006] d...  6112.505392: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> >>    tg->load_avg = 4041 delta = 1024 ###
> >> ..........................
> >>  systemd-run-142416  [011] d.h.  6112.506547: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
> >>    tg->load_avg = 3010 delta = 0 ###
> >> ..........................
> >>  systemd-run-142416  [011] d.h.  6112.507546: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
> >>    tg->load_avg = 3010 delta = 0 ### <-- last invocation for cfs_rq[11]
> >>
> >> ..........................
> >> ..........................
> >> <idle>-0  [001] d.s.  6113.868542: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >>    tg->load_avg = 1027 delta = 0 ###
> >> <idle>-0  [001] d.s.  6113.869542: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >>    tg->load_avg = 1027 delta = 0 ###
> >>  <idle>-0 [001] d.s.  6113.870541: update_tg_load_avg.constprop.124:
> >>    cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >>    tg->load_avg = 1027 delta = 0 ###
> >>
> >>
> >> If I understand correctly, when CPU 11 is offlined the task(s) on its cfs_rq
> >> will be migrated and its cfs_rq.avg.load_avg will be updated accordingly. This
> >> drop is cfs_rq.avg.load_avg will be detected by update_tg_load_avg and hence
> >> the contribution of this cfs_rq will get deducted from tg->load_avg.
> >> It looks like during hotplug load of one or more tasks, being migrated are
> >> not getting accounted in the source cfs_rq and this is ending up as residual
> >> load_avg at task_group (if these tasks are members of a task_group).
> >>
> >> Moreover this looks racy and dependent on number of CPUs or some delay.
> >> For example for scale down from 124 to 4 CPUs I always hit this issue but
> >> for scale down from 16 to 4 CPUs I hit this issue 8-9 out of 10 times.
> >> Also for the cases when residual load_avg at task group is low (like < 10),
> >> I can see that both of my test cgroups get similar CPU times which further
> >> proves that the unaccounted load avg ending up in a task_group is eventually
> >> leading to uneven CPU allotment between task groups.
> >>
> >>
> >> I am debugging it further but in the mean time if you have some suggestions or
> >> need traces from some specific portion of sched code, please let me know.
> >>
> >> Thanks,
> >> Imran
> >>
> >>> Thanks,
> >>> Imran
> >>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
  2023-12-20 17:31           ` Vincent Guittot
@ 2023-12-21  8:21             ` Aaron Lu
  2023-12-21 15:26               ` Vincent Guittot
  0 siblings, 1 reply; 11+ messages in thread
From: Aaron Lu @ 2023-12-21  8:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Imran Khan, mingo, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, linux-kernel

Hi Vincent,

Thanks for the heads up.

On Wed, Dec 20, 2023 at 06:31:08PM +0100, Vincent Guittot wrote:
> Aaron,
> Could you run the tests that you run for testing "Ratelimit update to
> tg->load_avg" and check if this patch impacts your performance ?

I run hackbench/netperf/postgres_sysbench with nr_thread=nr_cpu on a 2
sockets/120cores/240cpus Intel server and didn't notice any performance
change after applying your diff so I think it's not a problem.

Regards,
Aaron

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
  2023-12-21  8:21             ` Aaron Lu
@ 2023-12-21 15:26               ` Vincent Guittot
  0 siblings, 0 replies; 11+ messages in thread
From: Vincent Guittot @ 2023-12-21 15:26 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Imran Khan, mingo, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vschneid, linux-kernel

Hi Aaron,

On Thu, 21 Dec 2023 at 09:21, Aaron Lu <aaron.lu@intel.com> wrote:
>
> Hi Vincent,
>
> Thanks for the heads up.
>
> On Wed, Dec 20, 2023 at 06:31:08PM +0100, Vincent Guittot wrote:
> > Aaron,
> > Could you run the tests that you run for testing "Ratelimit update to
> > tg->load_avg" and check if this patch impacts your performance ?
>
> I run hackbench/netperf/postgres_sysbench with nr_thread=nr_cpu on a 2
> sockets/120cores/240cpus Intel server and didn't notice any performance
> change after applying your diff so I think it's not a problem.

Thanks for the tests.

I'm going to prepare a proper patch

>
> Regards,
> Aaron

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-12-21 15:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-15  5:26 [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks Imran Khan
2023-12-15  9:11 ` Vincent Guittot
2023-12-15  9:59   ` Imran Khan
2023-12-19  6:41     ` Imran Khan
2023-12-19 18:49       ` Vincent Guittot
2023-12-20  2:00         ` Imran Khan
2023-12-20 17:31           ` Vincent Guittot
2023-12-21  8:21             ` Aaron Lu
2023-12-21 15:26               ` Vincent Guittot
2023-12-17  6:10 ` kernel test robot
2023-12-17  7:14 ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.