[PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
@ 2025-08-04 13:03 xupengbo
  2025-08-05  9:08 ` Aaron Lu
  0 siblings, 1 reply; 12+ messages in thread
From: xupengbo @ 2025-08-04 13:03 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel
  Cc: xupengbo, cgroups

We added a function named update_tg_load_avg_immediately() that mimics
update_tg_load_avg(). In this function we remove the update interval
restriction from update_tg_load_avg() in order to update tg->load
immediately when the function is called. This function is only called in
update_load_avg(). In update_load_avg(), we should call
update_tg_load_avg_immediately() if flag & DO_DETACH == true and the task
is the last task in cfs_rq, otherwise we call update_tg_load_avg(). The
reason is as follows.

1. Due to the 1ms update period limitation in update_tg_load_avg(), there
is a possibility that the reduced load_avg is not updated to tg->load_avg
when a task migrates out.
2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
function cfs_rq_is_decayed() does not check whether
cfs->tg_load_avg_contrib is null. Consequently, in some cases,
__update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
updated to tg->load_avg.

When these two events occur within the 1ms window (defined by
NSEC_PER_MSEC in update_tg_load_avg()) and no other tasks can migrate to
the CPU due to the cpumask constraints, the corresponding portion of
load_avg will never be subtracted from tg->load_avg. This results in an
inflated tg->load_avg and reduced scheduling entity (se) weight for the
task group. If the migrating task had a large weight, the task group's
share may deviate significantly from its expected value. This issue is
easily reproducible in task migration scenarios.

Initially, I discovered this bug on Android 16 (running kernel v6.12), and
was subsequently able to reproduce it on an 8-core Ubuntu 24.04 VM with
kernel versions v6.14 and v6.16-rc7. I believe it exists in any kernel
version that defines both CONFIG_FAIR_GROUP_SCHED and CONFIG_SMP.
I wrote a short C program which just does 3 things:
  1. call sched_setaffinity() to bound itself to cpu 1.
  2. call sched_setaffinity() to bound itself to cpu 2.
  3. endless loop.

Here is the source code.
```
\#define _GNU_SOURCE
\#include <sched.h>
\#include <unistd.h>
int main() {
  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(1, &cpuset);
  pid_t pid = gettid();

  if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
    return 1;
  }

  CPU_ZERO(&cpuset);
  CPU_SET(2, &cpuset);

  if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
    return 1;
  }
  while (1)
    ;
  return 0;
}
```

Then I made a test script to add tasks into groups.
(Forgive me for just copying and pasting those lines but not using
a for-loop)

```
\#!/usr/bin/bash

shares=100
pkill 'bug_test'
sleep 2
rmdir /sys/fs/cgroup/cpu/bug_test_{1..4}
mkdir /sys/fs/cgroup/cpu/bug_test_{1..4}

echo $shares >/sys/fs/cgroup/cpu/bug_test_1/cpu.shares
echo $shares >/sys/fs/cgroup/cpu/bug_test_2/cpu.shares
echo $shares >/sys/fs/cgroup/cpu/bug_test_3/cpu.shares
echo $shares >/sys/fs/cgroup/cpu/bug_test_4/cpu.shares

nohup ./bug_test &
proc1=$!
echo "$proc1" >/sys/fs/cgroup/cpu/bug_test_1/cgroup.procs
nohup ./bug_test &
proc2=$!
echo "$proc2" >/sys/fs/cgroup/cpu/bug_test_2/cgroup.procs
nohup ./bug_test &
proc3=$!
echo "$proc3" >/sys/fs/cgroup/cpu/bug_test_3/cgroup.procs
nohup ./bug_test &
proc4=$!
echo "$proc4" >/sys/fs/cgroup/cpu/bug_test_4/cgroup.procs

```

After several repetitions of the script, we can find that some
processes have a smaller share of the cpu, while others have twice
that. This state is stable until the end of the process.

$ ps u -C bug_test
USER    PID   %CPU %MEM    VSZ   RSS TTY    STAT START   TIME COMMAND
root    13924 33.3  0.0   2556  1196 ?     R    18:55   0:56 ./bug_test
root    13925 16.6  0.0   2556  1196 ?     R    18:55   0:28 ./bug_test
root    13926 33.2  0.0   2556  1196 ?     R    18:55   0:56 ./bug_test
root    13927 16.6  0.0   2556  1200 ?     R    18:55   0:28 ./bug_test

Link: https://lore.kernel.org/all/20210501141950.23622-2-odin@uged.al/
This phenomenon is very much like the one mentioned in the above bug fix
patch.

Link: https://lore.kernel.org/cgroups/CAFpoUr1zGNf9vTbWjwsfY9E8YBjyE5xJ0SwzLebPiS7b=xz_Zw@mail.gmail.com/
Link: https://lore.kernel.org/cgroups/CAKfTPtA6AyL2f-KqHXecZrYKmZ9r9mT=Ks6BeNLjV9dfbSZJxQ@mail.gmail.com/
And there are also hints of this in the above responses. Maybe there's
nothing wrong with this patch, except that it doesn't take into account
the case that cfs_rq->tg_load_avg_contrib != 0 but
cfs_rq_is_decayed() == true when the last cfs task is migrated from this
CPU.

It is also easy to reproduce when autogroup is turned on. We just need to
repeat the following command over and over again.

$ sudo nohup ./bug_test 2>&1 &

PS: You must use sudo or the new process will be added to the same
autogroup.

$ ps u -C bug_test

USER      PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root    14444 18.8  0.0   2556  1196 pts/5    RN   19:30   0:16 ./bug_test
root    14458 17.9  0.0   2556  1196 pts/6    RN   19:30   0:15 ./bug_test
root    14472 17.0  0.0   2556  1196 pts/7    RN   19:30   0:14 ./bug_test
root    14486 16.7  0.0   2556  1196 pts/8    RN   19:30   0:13 ./bug_test
root    14500 33.2  0.0   2556  1196 pts/9    RN   19:30   0:27 ./bug_test

As we can see, each process was supposed to get a share of 20, but 4 out
of 5 encountered the bug.

Same environment but another experiment:
In /sys/kernel/debug/sched/
$ cat debug | grep 'bug_test'|  awk '{print $2, $3, $4, $16}'
bug_test 20245 21805.633289 /autogroup-891
bug_test 20256 10453.702263 /autogroup-892
bug_test 20270 20333.813653 /autogroup-893
bug_test 20284 19965.294506 /autogroup-894
bug_test 20298 9781.445557 /autogroup-895
rker:bug_test.c 13189 10.359087 /autogroup-515
rker:bug_test.c 13190 11.425550 /autogroup-515

$ cat debug | grep -A 28 'cfs_rq\[2\]:/autogroup' | \
egrep "(tg_load_avg|cfs_rq)"
cfs_rq[2]:/autogroup-891
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 335
cfs_rq[2]:/autogroup-892
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 670
cfs_rq[2]:/autogroup-893
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 335
cfs_rq[2]:/autogroup-894
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 335
cfs_rq[2]:/autogroup-895
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 670

$ cat debug | grep -A 28 '\]:/autogroup-892' | \
egrep '(tg_load_avg|cfs_rq|\.nr_queued|\.load)'
cfs_rq[2]:/autogroup-892
  .nr_queued                     : 1
  .load                          : 343040
  .load_avg                      : 335
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 670
  .se->avg.load_avg              : 511

There is only one task in autogroup-892, even though it has doubled
tg_load_avg. As far as I know, this should not be a feature of autogroup.

The above experiments were all done on ubuntu24.04 with kernel v6.16-RC7.
On Android, I instrumented trace points at these two code locations for
the update_tg_load_avg() and __update_blocked_fair() functions.
```
static bool __update_blocked_fair(struct rq *rq, bool *done)
{
  ......
    if (cfs_rq_is_decayed(cfs_rq)) {
      //trace here
      list_del_leaf_cfs_rq(cfs_rq);
    }
  ......
}

static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
{
  ......
  now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
  if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) {
    //trace here
    return;
  }
  ......
}
```
I'm sorry I can't provide these traces. But It's easy to see with
perfetto, that the abnormal cfs_rq hitting both if-branches within 1ms.
At first, the only task in the cgroup was moved into cpu 1, called
update_tg_load_avg and updated its timestamp. It was then migrated out
within 1ms, but when it was migrated out, the update failed due to the
update interval limit. At this point cfs_rq is still in the
leaf_cfs_rq_list, but in some cases cfs_rq_is_decay() returns true within
1ms, and the value of cfs_rq->tg_load_avg_contrib approximates the task's
load_avg which is not null.

When a task is migrated from cfs_rq, dequeue_load_avg() will subtract its
avg.load_sum and avg.load_avg. Sometimes its load_sum is reduced to null
sometimes not. If load_sum is reduced to null, then this cfs_rq will be
removed from the leaf_cfs_rq_list soon. So __update_blocked_fair() can not
update it anymore.

Link: https://lore.kernel.org/cgroups/20210518125202.78658-2-odin@uged.al/
In this patch, Odin proposed adding a check in cfs_rq_is_decayed() to
determine whether cfs_rq->tg_load_avg_contrib is null. However, it appears
that this patch was not merged. In fact, if there were a check in
cfs_rq_is_decayed() similar to the one in update_tg_load_avg() regarding
the size of the _delta_ value (see update_tg_load_avg()), this issue
could also be effectively resolved. This solution would block (2.),
because if delta is too large, cfs_rq_is_decayed() returns false, and the
cfs_rq remains in leaf_cfs_rq_list, ultimately causing
__update_blocked_fair() to update it outside the 1ms limit. The only
consideration is whether to add a check for cfs_rq->tg_load_avg_contrib in
cfs_rq_is_decayed(), which may increase coupling.

Signed-off-by: xupengbo <xupengbo@oppo.com>
---
 kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..97feba367be9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4065,6 +4065,45 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	return true;
 }

+/* only called in update_load_avg() */
+static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq)
+{
+	long delta;
+	u64 now;
+
+	/*
+	 * No need to update load_avg for root_task_group as it is not used.
+	 */
+	if (cfs_rq->tg == &root_task_group)
+		return;
+
+	/* rq has been offline and doesn't contribute to the share anymore: */
+	if (!cpu_active(cpu_of(rq_of(cfs_rq))))
+		return;
+
+	/*
+	 * Under normal circumstances, for migration heavy workloads, access
+	 * to tg->load_avg can be unbound. Limit the update rate to at most
+	 * once per ms.
+	 * However when the last task is migrating from this cpu, we must
+	 * update tg->load_avg immediately. Otherwise, if this cfs_rq becomes
+	 * idle forever due to cpumask and is removed from leaf_cfs_rq_list,
+	 * the huge mismatch between cfs_rq->avg.load_avg(which may be zero)
+	 * and cfs_rq->tg_load_avg_contrib(stalled load_avg of last task)
+	 * can never be corrected, which will lead to a significant value
+	 * error in se.weight for this group.
+	 * We retain this value filter below because it is not the main cause
+	 * of this bug, so we are being conservative.
+	 */
+	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+		atomic_long_add(delta, &cfs_rq->tg->load_avg);
+		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
+	}
+}
+
 /**
  * update_tg_load_avg - update the tg's load avg
  * @cfs_rq: the cfs_rq whose avg changed
@@ -4449,6 +4488,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)

 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}

+static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq) {}
+
 static inline void clear_tg_offline_cfs_rqs(struct rq *rq) {}

 static inline int propagate_entity_load_avg(struct sched_entity *se)
@@ -4747,9 +4788,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		/*
 		 * DO_DETACH means we're here from dequeue_entity()
 		 * and we are migrating task out of the CPU.
+		 *
+		 * At this point, we have not subtracted nr_queued.
+		 * If cfs_rq->nr_queued ==1, the last cfs task is being
+		 * migrated from this cfs_rq.
 		 */
 		detach_entity_load_avg(cfs_rq, se);
-		update_tg_load_avg(cfs_rq);
+		if (cfs_rq->nr_queued == 1)
+			update_tg_load_avg_immediately(cfs_rq);
+		else
+			update_tg_load_avg(cfs_rq);
 	} else if (decayed) {
 		cfs_rq_util_change(cfs_rq, 0);

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
  2025-08-04 13:03 [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out xupengbo
@ 2025-08-05  9:08 ` Aaron Lu
  2025-08-05  9:17   ` Vincent Guittot
  0 siblings, 1 reply; 12+ messages in thread
From: Aaron Lu @ 2025-08-05  9:08 UTC (permalink / raw)
  To: xupengbo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, cgroups

On Mon, Aug 04, 2025 at 09:03:26PM +0800, xupengbo wrote:
> We added a function named update_tg_load_avg_immediately() that mimics
> update_tg_load_avg(). In this function we remove the update interval
> restriction from update_tg_load_avg() in order to update tg->load
> immediately when the function is called. This function is only called in
> update_load_avg(). In update_load_avg(), we should call
> update_tg_load_avg_immediately() if flag & DO_DETACH == true and the task
> is the last task in cfs_rq, otherwise we call update_tg_load_avg(). The
> reason is as follows.
> 
> 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> is a possibility that the reduced load_avg is not updated to tg->load_avg
> when a task migrates out.
> 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> function cfs_rq_is_decayed() does not check whether
> cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> updated to tg->load_avg.
> 
> When these two events occur within the 1ms window (defined by
> NSEC_PER_MSEC in update_tg_load_avg()) and no other tasks can migrate to
> the CPU due to the cpumask constraints, the corresponding portion of
> load_avg will never be subtracted from tg->load_avg. This results in an
> inflated tg->load_avg and reduced scheduling entity (se) weight for the
> task group. If the migrating task had a large weight, the task group's
> share may deviate significantly from its expected value. This issue is
> easily reproducible in task migration scenarios.
>

Agree this is a problem.

> Initially, I discovered this bug on Android 16 (running kernel v6.12), and
> was subsequently able to reproduce it on an 8-core Ubuntu 24.04 VM with
> kernel versions v6.14 and v6.16-rc7. I believe it exists in any kernel
> version that defines both CONFIG_FAIR_GROUP_SCHED and CONFIG_SMP.
> I wrote a short C program which just does 3 things:
>   1. call sched_setaffinity() to bound itself to cpu 1.
>   2. call sched_setaffinity() to bound itself to cpu 2.
>   3. endless loop.
> 
> Here is the source code.
> ```
> \#define _GNU_SOURCE
> \#include <sched.h>
> \#include <unistd.h>
> int main() {
>   cpu_set_t cpuset;
>   CPU_ZERO(&cpuset);
>   CPU_SET(1, &cpuset);
>   pid_t pid = gettid();
> 
>   if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
>     return 1;
>   }
> 
>   CPU_ZERO(&cpuset);
>   CPU_SET(2, &cpuset);
> 
>   if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
>     return 1;
>   }
>   while (1)
>     ;
>   return 0;
> }
> ```
> 
> Then I made a test script to add tasks into groups.
> (Forgive me for just copying and pasting those lines but not using
> a for-loop)
> 
> ```
> \#!/usr/bin/bash
> 
> shares=100
> pkill 'bug_test'
> sleep 2
> rmdir /sys/fs/cgroup/cpu/bug_test_{1..4}
> mkdir /sys/fs/cgroup/cpu/bug_test_{1..4}
> 
> echo $shares >/sys/fs/cgroup/cpu/bug_test_1/cpu.shares
> echo $shares >/sys/fs/cgroup/cpu/bug_test_2/cpu.shares
> echo $shares >/sys/fs/cgroup/cpu/bug_test_3/cpu.shares
> echo $shares >/sys/fs/cgroup/cpu/bug_test_4/cpu.shares
> 
> nohup ./bug_test &
> proc1=$!
> echo "$proc1" >/sys/fs/cgroup/cpu/bug_test_1/cgroup.procs
> nohup ./bug_test &
> proc2=$!
> echo "$proc2" >/sys/fs/cgroup/cpu/bug_test_2/cgroup.procs
> nohup ./bug_test &
> proc3=$!
> echo "$proc3" >/sys/fs/cgroup/cpu/bug_test_3/cgroup.procs
> nohup ./bug_test &
> proc4=$!
> echo "$proc4" >/sys/fs/cgroup/cpu/bug_test_4/cgroup.procs
> 
> ```
>
> After several repetitions of the script, we can find that some
> processes have a smaller share of the cpu, while others have twice
> that. This state is stable until the end of the process.
 
I can reproduce it using your test code.
 
> When a task is migrated from cfs_rq, dequeue_load_avg() will subtract its
> avg.load_sum and avg.load_avg. Sometimes its load_sum is reduced to null
> sometimes not. If load_sum is reduced to null, then this cfs_rq will be
> removed from the leaf_cfs_rq_list soon. So __update_blocked_fair() can not
> update it anymore.
> 
> Link: https://lore.kernel.org/cgroups/20210518125202.78658-2-odin@uged.al/
> In this patch, Odin proposed adding a check in cfs_rq_is_decayed() to
> determine whether cfs_rq->tg_load_avg_contrib is null. However, it appears
> that this patch was not merged. In fact, if there were a check in
> cfs_rq_is_decayed() similar to the one in update_tg_load_avg() regarding
> the size of the _delta_ value (see update_tg_load_avg()), this issue
> could also be effectively resolved. This solution would block (2.),
> because if delta is too large, cfs_rq_is_decayed() returns false, and the
> cfs_rq remains in leaf_cfs_rq_list, ultimately causing
> __update_blocked_fair() to update it outside the 1ms limit. The only
> consideration is whether to add a check for cfs_rq->tg_load_avg_contrib in
> cfs_rq_is_decayed(), which may increase coupling.
>

Performance wise, doing an immediate update to tg->load_avg on detach
path should be OK because last time when I did those tests, the migration
path that leads to updates to tg->load_avg is mostly from task wakeup path.
I also did some quick tests using hackbench and netperf on an Intel EMR
and didn't notice anything problematic regarding performance with your
change here.

With this said, I think adding cfs_rq->tg_load_avg_contrib check in
cfs_rq_is_decayed() makes more sense to me, because if a cfs_rq still has
contrib to its tg but its load_avg is 0, it should stay in that list and
have its contrib synced with its load_avg to zero when that 1ms window
has passed by __update_blocked_fair() path.

> Signed-off-by: xupengbo <xupengbo@oppo.com>

Maybe add a fix tag for commit 1528c661c24b("sched/fair: Ratelimit
update to tg->load_avg")?

Thanks,
Aaron

> ---
>  kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..97feba367be9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4065,6 +4065,45 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  	return true;
>  }
>  
> +/* only called in update_load_avg() */
> +static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq)
> +{
> +	long delta;
> +	u64 now;
> +
> +	/*
> +	 * No need to update load_avg for root_task_group as it is not used.
> +	 */
> +	if (cfs_rq->tg == &root_task_group)
> +		return;
> +
> +	/* rq has been offline and doesn't contribute to the share anymore: */
> +	if (!cpu_active(cpu_of(rq_of(cfs_rq))))
> +		return;
> +
> +	/*
> +	 * Under normal circumstances, for migration heavy workloads, access
> +	 * to tg->load_avg can be unbound. Limit the update rate to at most
> +	 * once per ms.
> +	 * However when the last task is migrating from this cpu, we must
> +	 * update tg->load_avg immediately. Otherwise, if this cfs_rq becomes
> +	 * idle forever due to cpumask and is removed from leaf_cfs_rq_list,
> +	 * the huge mismatch between cfs_rq->avg.load_avg(which may be zero)
> +	 * and cfs_rq->tg_load_avg_contrib(stalled load_avg of last task)
> +	 * can never be corrected, which will lead to a significant value
> +	 * error in se.weight for this group.
> +	 * We retain this value filter below because it is not the main cause
> +	 * of this bug, so we are being conservative.
> +	 */
> +	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> +	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> +		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> +		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> +		cfs_rq->last_update_tg_load_avg = now;
> +	}
> +}
> +
>  /**
>   * update_tg_load_avg - update the tg's load avg
>   * @cfs_rq: the cfs_rq whose avg changed
> @@ -4449,6 +4488,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
>  
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
>  
> +static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq) {}
> +
>  static inline void clear_tg_offline_cfs_rqs(struct rq *rq) {}
>  
>  static inline int propagate_entity_load_avg(struct sched_entity *se)
> @@ -4747,9 +4788,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  		/*
>  		 * DO_DETACH means we're here from dequeue_entity()
>  		 * and we are migrating task out of the CPU.
> +		 *
> +		 * At this point, we have not subtracted nr_queued.
> +		 * If cfs_rq->nr_queued ==1, the last cfs task is being
> +		 * migrated from this cfs_rq.
>  		 */
>  		detach_entity_load_avg(cfs_rq, se);
> -		update_tg_load_avg(cfs_rq);
> +		if (cfs_rq->nr_queued == 1)
> +			update_tg_load_avg_immediately(cfs_rq);
> +		else
> +			update_tg_load_avg(cfs_rq);
>  	} else if (decayed) {
>  		cfs_rq_util_change(cfs_rq, 0);
>  
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
  2025-08-05  9:08 ` Aaron Lu
@ 2025-08-05  9:17   ` Vincent Guittot
  2025-08-06  6:31     ` [PATCH v2] " xupengbo
  0 siblings, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2025-08-05  9:17 UTC (permalink / raw)
  To: Aaron Lu
  Cc: xupengbo, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, cgroups

On Tue, 5 Aug 2025 at 11:08, Aaron Lu <ziqianlu@bytedance.com> wrote:
>
> On Mon, Aug 04, 2025 at 09:03:26PM +0800, xupengbo wrote:
> > We added a function named update_tg_load_avg_immediately() that mimics
> > update_tg_load_avg(). In this function we remove the update interval
> > restriction from update_tg_load_avg() in order to update tg->load
> > immediately when the function is called. This function is only called in
> > update_load_avg(). In update_load_avg(), we should call
> > update_tg_load_avg_immediately() if flag & DO_DETACH == true and the task
> > is the last task in cfs_rq, otherwise we call update_tg_load_avg(). The
> > reason is as follows.
> >
> > 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> > is a possibility that the reduced load_avg is not updated to tg->load_avg
> > when a task migrates out.
> > 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> > calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> > function cfs_rq_is_decayed() does not check whether
> > cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> > __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> > updated to tg->load_avg.
> >
> > When these two events occur within the 1ms window (defined by
> > NSEC_PER_MSEC in update_tg_load_avg()) and no other tasks can migrate to
> > the CPU due to the cpumask constraints, the corresponding portion of
> > load_avg will never be subtracted from tg->load_avg. This results in an
> > inflated tg->load_avg and reduced scheduling entity (se) weight for the
> > task group. If the migrating task had a large weight, the task group's
> > share may deviate significantly from its expected value. This issue is
> > easily reproducible in task migration scenarios.
> >
>
> Agree this is a problem.
>
> > Initially, I discovered this bug on Android 16 (running kernel v6.12), and
> > was subsequently able to reproduce it on an 8-core Ubuntu 24.04 VM with
> > kernel versions v6.14 and v6.16-rc7. I believe it exists in any kernel
> > version that defines both CONFIG_FAIR_GROUP_SCHED and CONFIG_SMP.
> > I wrote a short C program which just does 3 things:
> >   1. call sched_setaffinity() to bound itself to cpu 1.
> >   2. call sched_setaffinity() to bound itself to cpu 2.
> >   3. endless loop.
> >
> > Here is the source code.
> > ```
> > \#define _GNU_SOURCE
> > \#include <sched.h>
> > \#include <unistd.h>
> > int main() {
> >   cpu_set_t cpuset;
> >   CPU_ZERO(&cpuset);
> >   CPU_SET(1, &cpuset);
> >   pid_t pid = gettid();
> >
> >   if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
> >     return 1;
> >   }
> >
> >   CPU_ZERO(&cpuset);
> >   CPU_SET(2, &cpuset);
> >
> >   if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
> >     return 1;
> >   }
> >   while (1)
> >     ;
> >   return 0;
> > }
> > ```
> >
> > Then I made a test script to add tasks into groups.
> > (Forgive me for just copying and pasting those lines but not using
> > a for-loop)
> >
> > ```
> > \#!/usr/bin/bash
> >
> > shares=100
> > pkill 'bug_test'
> > sleep 2
> > rmdir /sys/fs/cgroup/cpu/bug_test_{1..4}
> > mkdir /sys/fs/cgroup/cpu/bug_test_{1..4}
> >
> > echo $shares >/sys/fs/cgroup/cpu/bug_test_1/cpu.shares
> > echo $shares >/sys/fs/cgroup/cpu/bug_test_2/cpu.shares
> > echo $shares >/sys/fs/cgroup/cpu/bug_test_3/cpu.shares
> > echo $shares >/sys/fs/cgroup/cpu/bug_test_4/cpu.shares
> >
> > nohup ./bug_test &
> > proc1=$!
> > echo "$proc1" >/sys/fs/cgroup/cpu/bug_test_1/cgroup.procs
> > nohup ./bug_test &
> > proc2=$!
> > echo "$proc2" >/sys/fs/cgroup/cpu/bug_test_2/cgroup.procs
> > nohup ./bug_test &
> > proc3=$!
> > echo "$proc3" >/sys/fs/cgroup/cpu/bug_test_3/cgroup.procs
> > nohup ./bug_test &
> > proc4=$!
> > echo "$proc4" >/sys/fs/cgroup/cpu/bug_test_4/cgroup.procs
> >
> > ```
> >
> > After several repetitions of the script, we can find that some
> > processes have a smaller share of the cpu, while others have twice
> > that. This state is stable until the end of the process.
>
> I can reproduce it using your test code.
>
> > When a task is migrated from cfs_rq, dequeue_load_avg() will subtract its
> > avg.load_sum and avg.load_avg. Sometimes its load_sum is reduced to null
> > sometimes not. If load_sum is reduced to null, then this cfs_rq will be
> > removed from the leaf_cfs_rq_list soon. So __update_blocked_fair() can not
> > update it anymore.
> >
> > Link: https://lore.kernel.org/cgroups/20210518125202.78658-2-odin@uged.al/
> > In this patch, Odin proposed adding a check in cfs_rq_is_decayed() to
> > determine whether cfs_rq->tg_load_avg_contrib is null. However, it appears
> > that this patch was not merged. In fact, if there were a check in
> > cfs_rq_is_decayed() similar to the one in update_tg_load_avg() regarding
> > the size of the _delta_ value (see update_tg_load_avg()), this issue
> > could also be effectively resolved. This solution would block (2.),
> > because if delta is too large, cfs_rq_is_decayed() returns false, and the
> > cfs_rq remains in leaf_cfs_rq_list, ultimately causing
> > __update_blocked_fair() to update it outside the 1ms limit. The only
> > consideration is whether to add a check for cfs_rq->tg_load_avg_contrib in
> > cfs_rq_is_decayed(), which may increase coupling.
> >
>
> Performance wise, doing an immediate update to tg->load_avg on detach
> path should be OK because last time when I did those tests, the migration
> path that leads to updates to tg->load_avg is mostly from task wakeup path.
> I also did some quick tests using hackbench and netperf on an Intel EMR
> and didn't notice anything problematic regarding performance with your
> change here.
>
> With this said, I think adding cfs_rq->tg_load_avg_contrib check in
> cfs_rq_is_decayed() makes more sense to me, because if a cfs_rq still has

Adding a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed()
makes more sense for me too because it doesn't bypass the ratelimit

> contrib to its tg but its load_avg is 0, it should stay in that list and
> have its contrib synced with its load_avg to zero when that 1ms window
> has passed by __update_blocked_fair() path.
>
> > Signed-off-by: xupengbo <xupengbo@oppo.com>
>
> Maybe add a fix tag for commit 1528c661c24b("sched/fair: Ratelimit
> update to tg->load_avg")?
>
> Thanks,
> Aaron
>
> > ---
> >  kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 49 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b173a059315c..97feba367be9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4065,6 +4065,45 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >       return true;
> >  }
> >
> > +/* only called in update_load_avg() */
> > +static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq)
> > +{
> > +     long delta;
> > +     u64 now;
> > +
> > +     /*
> > +      * No need to update load_avg for root_task_group as it is not used.
> > +      */
> > +     if (cfs_rq->tg == &root_task_group)
> > +             return;
> > +
> > +     /* rq has been offline and doesn't contribute to the share anymore: */
> > +     if (!cpu_active(cpu_of(rq_of(cfs_rq))))
> > +             return;
> > +
> > +     /*
> > +      * Under normal circumstances, for migration heavy workloads, access
> > +      * to tg->load_avg can be unbound. Limit the update rate to at most
> > +      * once per ms.
> > +      * However when the last task is migrating from this cpu, we must
> > +      * update tg->load_avg immediately. Otherwise, if this cfs_rq becomes
> > +      * idle forever due to cpumask and is removed from leaf_cfs_rq_list,
> > +      * the huge mismatch between cfs_rq->avg.load_avg(which may be zero)
> > +      * and cfs_rq->tg_load_avg_contrib(stalled load_avg of last task)
> > +      * can never be corrected, which will lead to a significant value
> > +      * error in se.weight for this group.
> > +      * We retain this value filter below because it is not the main cause
> > +      * of this bug, so we are being conservative.
> > +      */
> > +     now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > +     delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +     if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > +             atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > +             cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > +             cfs_rq->last_update_tg_load_avg = now;
> > +     }
> > +}
> > +
> >  /**
> >   * update_tg_load_avg - update the tg's load avg
> >   * @cfs_rq: the cfs_rq whose avg changed
> > @@ -4449,6 +4488,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> >
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> >
> > +static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq) {}
> > +
> >  static inline void clear_tg_offline_cfs_rqs(struct rq *rq) {}
> >
> >  static inline int propagate_entity_load_avg(struct sched_entity *se)
> > @@ -4747,9 +4788,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >               /*
> >                * DO_DETACH means we're here from dequeue_entity()
> >                * and we are migrating task out of the CPU.
> > +              *
> > +              * At this point, we have not subtracted nr_queued.
> > +              * If cfs_rq->nr_queued ==1, the last cfs task is being
> > +              * migrated from this cfs_rq.
> >                */
> >               detach_entity_load_avg(cfs_rq, se);
> > -             update_tg_load_avg(cfs_rq);
> > +             if (cfs_rq->nr_queued == 1)
> > +                     update_tg_load_avg_immediately(cfs_rq);
> > +             else
> > +                     update_tg_load_avg(cfs_rq);
> >       } else if (decayed) {
> >               cfs_rq_util_change(cfs_rq, 0);
> >
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
  2025-08-05  9:17   ` Vincent Guittot
@ 2025-08-06  6:31     ` xupengbo
  2025-08-06  7:33       ` Aaron Lu
  0 siblings, 1 reply; 12+ messages in thread
From: xupengbo @ 2025-08-06  6:31 UTC (permalink / raw)
  To: vincent.guittot
  Cc: bsegall, cgroups, dietmar.eggemann, juri.lelli, linux-kernel,
	mgorman, mingo, peterz, rostedt, vschneid, xupengbo, ziqianlu

> >On Tue, 5 Aug 2025 at 16:42, xupengbo <xupengbo@oppo.com> wrote:
> >
> > When a task is migrated out, there is a probability that the tg->load_avg
> > value will become abnormal. The reason is as follows.
> >
> > 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> > is a possibility that the reduced load_avg is not updated to tg->load_avg
> > when a task migrates out.
> > 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> > calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> > function cfs_rq_is_decayed() does not check whether
> > cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> > __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> > updated to tg->load_avg.
> >
> > I added a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
> > which blocks the case (2.) mentioned above. I follow the condition in
> > update_tg_load_avg() instead of directly checking if
> > cfs_rq->tg_load_avg_contrib is null. I think it's necessary to keep the
> > condition consistent in both places, otherwise unexpected problems may
> > occur.
> >
> > Thanks for your comments,
> > Xu Pengbo
> >
> > Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
> > Signed-off-by: xupengbo <xupengbo@oppo.com>
> > ---
> > Changes:
> > v1 -> v2:
> > - Another option to fix the bug. Check cfs_rq->tg_load_avg_contrib in
> > cfs_rq_is_decayed() to avoid early removal from the leaf_cfs_rq_list.
> > - Link to v1 : https://lore.kernel.org/cgroups/20250804130326.57523-1-xupengbo@oppo.com/T/#u
> >
> >  kernel/sched/fair.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b173a059315c..a35083a2d006 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4062,6 +4062,11 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >         if (child_cfs_rq_on_list(cfs_rq))
> >                 return false;
> >
> > +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +
> > +       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)
> 
>I don't understand why you use the above condition instead of if
>(!cfs_rq->tg_load_avg_contrib). Can you elaborate ?
> 
>strictly speaking we want to keep the cfs_rq in the list if
>(cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg) and
>cfs_rq->avg.load_avg == 0 when we test this condition

I use this condition primarily based on the function update_tg_load_avg().
I want to absolutely avoid a situation where cfs_rq_is_decay() returns 
false but update_tg_load_avg() cannot update its value due to the delta 
check, which may cause the cfs_rq to remain on the list permanently. 
Honestly, I am not sure if this will happen, so I took this conservative 
approach. 

In fact, in the second if-condition of cfs_rq_is_decay(), the comment in 
the load_avg_is_decayed() function states:"_avg must be null when _sum is 
null because _avg = _sum / divider". Therefore, when we check this newly 
added condition, cfs_rq->avg.load_avg should already be 0, right?

After reading your comments, I carefully considered the differences 
between these two approaches. Here, my condition is similar
to cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg but weaker. In 
fact, when cfs_rq->avg.load_avg is already 0, 
abs(delta) > cfs_rq->tg_load_avg_contrib / 64 is equivalent to 
cfs_rq->tg_load_avg_contrib > cfs_rq->tg_load_avg_contrib / 64,
Further reasoning leads to the condition cfs_rq->tg_load_avg_contrib > 0.
However if cfs_rq->avg.load_avg is not necessarily 0 at this point, then
the condition you propose is obviously more accurate, simpler than the
delta check, and requires fewer calculations.

I think our perspectives differ. From the perspective of 
update_tg_load_avg(), the semantics of this condition are as follows: if
there is no 1ms update limit, and update_tg_load_avg() can continue 
updating after checking the delta, then in cfs_rq_is_decayed() we should
return false to keep the cfs_rq in the list for subsequent updates. As 
mentioned in the first paragraph, this avoids that tricky situation. From
the perspective of cfs_rq_is_decayed(), the semantics of the condition you
proposed are that if cfs_rq->avg.load_avg is already 0, then it cannot be
removed from the list before all load_avg are updated to tg. That makes 
sense to me, but I still feel like there's a little bit of a risk. Am I 
being paranoid?

How do you view these two lines of thinking?

It's a pleasure to discuss this with you, 
xupengbo.

> > +               return false;
> > +
> >         return true;
> >  }
> >
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
  2025-08-06  6:31     ` [PATCH v2] " xupengbo
@ 2025-08-06  7:33       ` Aaron Lu
  2025-08-06  7:58         ` xupengbo
  2025-08-06  8:38         ` xupengbo
  0 siblings, 2 replies; 12+ messages in thread
From: Aaron Lu @ 2025-08-06  7:33 UTC (permalink / raw)
  To: xupengbo
  Cc: vincent.guittot, bsegall, cgroups, dietmar.eggemann, juri.lelli,
	linux-kernel, mgorman, mingo, peterz, rostedt, vschneid

On Wed, Aug 06, 2025 at 02:31:58PM +0800, xupengbo wrote:
> > >On Tue, 5 Aug 2025 at 16:42, xupengbo <xupengbo@oppo.com> wrote:
> > >
> > > When a task is migrated out, there is a probability that the tg->load_avg
> > > value will become abnormal. The reason is as follows.
> > >
> > > 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> > > is a possibility that the reduced load_avg is not updated to tg->load_avg
> > > when a task migrates out.
> > > 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> > > calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> > > function cfs_rq_is_decayed() does not check whether
> > > cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> > > __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> > > updated to tg->load_avg.
> > >
> > > I added a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
> > > which blocks the case (2.) mentioned above. I follow the condition in
> > > update_tg_load_avg() instead of directly checking if
> > > cfs_rq->tg_load_avg_contrib is null. I think it's necessary to keep the
> > > condition consistent in both places, otherwise unexpected problems may
> > > occur.
> > >
> > > Thanks for your comments,
> > > Xu Pengbo
> > >
> > > Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
> > > Signed-off-by: xupengbo <xupengbo@oppo.com>
> > > ---
> > > Changes:
> > > v1 -> v2:
> > > - Another option to fix the bug. Check cfs_rq->tg_load_avg_contrib in
> > > cfs_rq_is_decayed() to avoid early removal from the leaf_cfs_rq_list.
> > > - Link to v1 : https://lore.kernel.org/cgroups/20250804130326.57523-1-xupengbo@oppo.com/T/#u
> > >
> > >  kernel/sched/fair.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index b173a059315c..a35083a2d006 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4062,6 +4062,11 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > >         if (child_cfs_rq_on_list(cfs_rq))
> > >                 return false;
> > >
> > > +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > > +
> > > +       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)
> > 
> >I don't understand why you use the above condition instead of if
> >(!cfs_rq->tg_load_avg_contrib). Can you elaborate ?
> > 
> >strictly speaking we want to keep the cfs_rq in the list if
> >(cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg) and
> >cfs_rq->avg.load_avg == 0 when we test this condition
> 
> 
> I use this condition primarily based on the function update_tg_load_avg().
> I want to absolutely avoid a situation where cfs_rq_is_decay() returns 
> false but update_tg_load_avg() cannot update its value due to the delta 
> check, which may cause the cfs_rq to remain on the list permanently. 
> Honestly, I am not sure if this will happen, so I took this conservative 
> approach.

Hmm...it doesn't seem we need worry about this situation.

Because when cfs_rq->load_avg is 0, abs(delta) will be
cfs_rq->tg_load_avg_contrib and the following condition:

	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)
becomes:
	if (cfs_rq->tg_load_avg_contrib > cfs_rq->tg_load_avg_contrib / 64)

which should always be true, right?

Thanks,
Aaron

> 
> In fact, in the second if-condition of cfs_rq_is_decay(), the comment in 
> the load_avg_is_decayed() function states:"_avg must be null when _sum is 
> null because _avg = _sum / divider". Therefore, when we check this newly 
> added condition, cfs_rq->avg.load_avg should already be 0, right?
> 
> After reading your comments, I carefully considered the differences 
> between these two approaches. Here, my condition is similar
> to cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg but weaker. In 
> fact, when cfs_rq->avg.load_avg is already 0, 
> abs(delta) > cfs_rq->tg_load_avg_contrib / 64 is equivalent to 
> cfs_rq->tg_load_avg_contrib > cfs_rq->tg_load_avg_contrib / 64,
> Further reasoning leads to the condition cfs_rq->tg_load_avg_contrib > 0.
> However if cfs_rq->avg.load_avg is not necessarily 0 at this point, then
> the condition you propose is obviously more accurate, simpler than the
> delta check, and requires fewer calculations.
> 
> I think our perspectives differ. From the perspective of 
> update_tg_load_avg(), the semantics of this condition are as follows: if
> there is no 1ms update limit, and update_tg_load_avg() can continue 
> updating after checking the delta, then in cfs_rq_is_decayed() we should
> return false to keep the cfs_rq in the list for subsequent updates. As 
> mentioned in the first paragraph, this avoids that tricky situation. From
> the perspective of cfs_rq_is_decayed(), the semantics of the condition you
> proposed are that if cfs_rq->avg.load_avg is already 0, then it cannot be
> removed from the list before all load_avg are updated to tg. That makes 
> sense to me, but I still feel like there's a little bit of a risk. Am I 
> being paranoid?
> 
> How do you view these two lines of thinking?
> 
> It's a pleasure to discuss this with you, 
> xupengbo.
> 
> > > +               return false;
> > > +
> > >         return true;
> > >  }
> > >
> > > --
> > > 2.43.0
> > >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
  2025-08-06  7:33       ` Aaron Lu
@ 2025-08-06  7:58         ` xupengbo
  2025-08-06  8:38         ` xupengbo
  1 sibling, 0 replies; 12+ messages in thread
From: xupengbo @ 2025-08-06  7:58 UTC (permalink / raw)
  To: ziqianlu
  Cc: bsegall, cgroups, dietmar.eggemann, juri.lelli, linux-kernel,
	mgorman, mingo, peterz, rostedt, vincent.guittot, vschneid,
	xupengbo

On Wed, Aug 06, 2025 at 02:31:58PM +0800, xupengbo wrote:
> > >On Tue, 5 Aug 2025 at 16:42, xupengbo <xupengbo@oppo.com> wrote:
> > > >
> > > > When a task is migrated out, there is a probability that the tg->load_avg
> > > > value will become abnormal. The reason is as follows.
> > > >
> > > > 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> > > > is a possibility that the reduced load_avg is not updated to tg->load_avg
> > > > when a task migrates out.
> > > > 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> > > > calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> > > > function cfs_rq_is_decayed() does not check whether
> > > > cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> > > > __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> > > > updated to tg->load_avg.
> > > >
> > > > I added a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
> > > > which blocks the case (2.) mentioned above. I follow the condition in
> > > > update_tg_load_avg() instead of directly checking if
> > > > cfs_rq->tg_load_avg_contrib is null. I think it's necessary to keep the
> > > > condition consistent in both places, otherwise unexpected problems may
> > > > occur.
> > > >
> > > > Thanks for your comments,
> > > > Xu Pengbo
> > > >
> > > > Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
> > > > Signed-off-by: xupengbo <xupengbo@oppo.com>
> > > > ---
> > > > Changes:
> > > > v1 -> v2:
> > > > - Another option to fix the bug. Check cfs_rq->tg_load_avg_contrib in
> > > > cfs_rq_is_decayed() to avoid early removal from the leaf_cfs_rq_list.
> > > > - Link to v1 : https://lore.kernel.org/cgroups/20250804130326.57523-1-xupengbo@oppo.com/T/#u
> > > >
> > > >  kernel/sched/fair.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index b173a059315c..a35083a2d006 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -4062,6 +4062,11 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > > >         if (child_cfs_rq_on_list(cfs_rq))
> > > >                 return false;
> > > >
> > > > +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > > > +
> > > > +       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)
> > > 
> > >I don't understand why you use the above condition instead of if
> > >(!cfs_rq->tg_load_avg_contrib). Can you elaborate ?
> > > 
> > >strictly speaking we want to keep the cfs_rq in the list if
> > >(cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg) and
> > >cfs_rq->avg.load_avg == 0 when we test this condition
> > 
> > 
> > I use this condition primarily based on the function update_tg_load_avg().
> > I want to absolutely avoid a situation where cfs_rq_is_decay() returns 
> > false but update_tg_load_avg() cannot update its value due to the delta 
> > check, which may cause the cfs_rq to remain on the list permanently. 
> > Honestly, I am not sure if this will happen, so I took this conservative 
> > approach.
> 
> Hmm...it doesn't seem we need worry about this situation.

yeah, I am worried about this situation, but I can't find any evidence 
that it exists.

> Because when cfs_rq->load_avg is 0, abs(delta) will be
> cfs_rq->tg_load_avg_contrib and the following condition:
> 
> 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)
> becomes:
> 	if (cfs_rq->tg_load_avg_contrib > cfs_rq->tg_load_avg_contrib / 64)
> 
> which should always be true, right?


It actually becomes:
    if (cfs_rq->tg_load_avg_contrib > 0)
if cfs_rq->tg_load_avg_contrib == 0 , it will be false. As it is an unsigned
long, this condition is equivalent to :
    if (!cfs_rq->tg_load_avg_contrib)

Thanks,
Xupengbo

> Thanks,
> Aaron
> 
> > 
> > In fact, in the second if-condition of cfs_rq_is_decay(), the comment in 
> > the load_avg_is_decayed() function states:"_avg must be null when _sum is 
> > null because _avg = _sum / divider". Therefore, when we check this newly 
> > added condition, cfs_rq->avg.load_avg should already be 0, right?
> > 
> > After reading your comments, I carefully considered the differences 
> > between these two approaches. Here, my condition is similar
> > to cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg but weaker. In 
> > fact, when cfs_rq->avg.load_avg is already 0, 
> > abs(delta) > cfs_rq->tg_load_avg_contrib / 64 is equivalent to 
> > cfs_rq->tg_load_avg_contrib > cfs_rq->tg_load_avg_contrib / 64,
> > Further reasoning leads to the condition cfs_rq->tg_load_avg_contrib > 0.
> > However if cfs_rq->avg.load_avg is not necessarily 0 at this point, then
> > the condition you propose is obviously more accurate, simpler than the
> > delta check, and requires fewer calculations.
> > 
> > I think our perspectives differ. From the perspective of 
> > update_tg_load_avg(), the semantics of this condition are as follows: if
> > there is no 1ms update limit, and update_tg_load_avg() can continue 
> > updating after checking the delta, then in cfs_rq_is_decayed() we should
> > return false to keep the cfs_rq in the list for subsequent updates. As 
> > mentioned in the first paragraph, this avoids that tricky situation. From
> > the perspective of cfs_rq_is_decayed(), the semantics of the condition you
> > proposed are that if cfs_rq->avg.load_avg is already 0, then it cannot be
> > removed from the list before all load_avg are updated to tg. That makes 
> > sense to me, but I still feel like there's a little bit of a risk. Am I 
> > being paranoid?
> > 
> > How do you view these two lines of thinking?
> > 
> > It's a pleasure to discuss this with you, 
> > xupengbo.
> > 
> > > > +               return false;
> > > > +
> > > >         return true;
> > > >  }
> > > >
> > > > --
> > > > 2.43.0
> > > >


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
  2025-08-06  7:33       ` Aaron Lu
  2025-08-06  7:58         ` xupengbo
@ 2025-08-06  8:38         ` xupengbo
  2025-08-06  9:22           ` Vincent Guittot
  2025-08-25  9:50           ` Aaron Lu
  1 sibling, 2 replies; 12+ messages in thread
From: xupengbo @ 2025-08-06  8:38 UTC (permalink / raw)
  To: ziqianlu
  Cc: bsegall, cgroups, dietmar.eggemann, juri.lelli, linux-kernel,
	mgorman, mingo, peterz, rostedt, vincent.guittot, vschneid,
	xupengbo

On Wed, Aug 06, 2025 at 02:31:58PM +0800, xupengbo wrote:
> > >On Tue, 5 Aug 2025 at 16:42, xupengbo <xupengbo@oppo.com> wrote:
> > > >
> > > > When a task is migrated out, there is a probability that the tg->load_avg
> > > > value will become abnormal. The reason is as follows.
> > > >
> > > > 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> > > > is a possibility that the reduced load_avg is not updated to tg->load_avg
> > > > when a task migrates out.
> > > > 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> > > > calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> > > > function cfs_rq_is_decayed() does not check whether
> > > > cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> > > > __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> > > > updated to tg->load_avg.
> > > >
> > > > I added a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
> > > > which blocks the case (2.) mentioned above. I follow the condition in
> > > > update_tg_load_avg() instead of directly checking if
> > > > cfs_rq->tg_load_avg_contrib is null. I think it's necessary to keep the
> > > > condition consistent in both places, otherwise unexpected problems may
> > > > occur.
> > > >
> > > > Thanks for your comments,
> > > > Xu Pengbo
> > > >
> > > > Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
> > > > Signed-off-by: xupengbo <xupengbo@oppo.com>
> > > > ---
> > > > Changes:
> > > > v1 -> v2:
> > > > - Another option to fix the bug. Check cfs_rq->tg_load_avg_contrib in
> > > > cfs_rq_is_decayed() to avoid early removal from the leaf_cfs_rq_list.
> > > > - Link to v1 : https://lore.kernel.org/cgroups/20250804130326.57523-1-xupengbo@oppo.com/T/#u
> > > >
> > > >  kernel/sched/fair.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index b173a059315c..a35083a2d006 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -4062,6 +4062,11 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > > >         if (child_cfs_rq_on_list(cfs_rq))
> > > >                 return false;
> > > >
> > > > +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > > > +
> > > > +       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)
> > > 
> > >I don't understand why you use the above condition instead of if
> > >(!cfs_rq->tg_load_avg_contrib). Can you elaborate ?

Sorry I was misled here, I think it should be if (cfs_rq->tg_load_avg_contrib ! = 0)

> > > 
> > >strictly speaking we want to keep the cfs_rq in the list if
> > >(cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg) and
> > >cfs_rq->avg.load_avg == 0 when we test this condition
> > 
> > 
> > I use this condition primarily based on the function update_tg_load_avg().
> > I want to absolutely avoid a situation where cfs_rq_is_decay() returns 
> > false but update_tg_load_avg() cannot update its value due to the delta 
> > check, which may cause the cfs_rq to remain on the list permanently. 
> > Honestly, I am not sure if this will happen, so I took this conservative 
> > approach.
> 
> Hmm...it doesn't seem we need worry about this situation.

yeah, I am worried about this situation, but I can't find any evidence 
that it exists.

> Because when cfs_rq->load_avg is 0, abs(delta) will be
> cfs_rq->tg_load_avg_contrib and the following condition:
> 
> 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)
> becomes:
> 	if (cfs_rq->tg_load_avg_contrib > cfs_rq->tg_load_avg_contrib / 64)
> 
> which should always be true, right?


It actually becomes:
    if (cfs_rq->tg_load_avg_contrib > 0)
if cfs_rq->tg_load_avg_contrib == 0 , it will be false. As it is an unsigned
long, this condition is equivalent to :
    if (cfs_rq->tg_load_avg_contrib)

Sorry I just made a mistake. 
Thanks,
Xupengbo

> Thanks,
> Aaron
> 
> > 
> > In fact, in the second if-condition of cfs_rq_is_decay(), the comment in 
> > the load_avg_is_decayed() function states:"_avg must be null when _sum is 
> > null because _avg = _sum / divider". Therefore, when we check this newly 
> > added condition, cfs_rq->avg.load_avg should already be 0, right?
> > 
> > After reading your comments, I carefully considered the differences 
> > between these two approaches. Here, my condition is similar
> > to cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg but weaker. In 
> > fact, when cfs_rq->avg.load_avg is already 0, 
> > abs(delta) > cfs_rq->tg_load_avg_contrib / 64 is equivalent to 
> > cfs_rq->tg_load_avg_contrib > cfs_rq->tg_load_avg_contrib / 64,
> > Further reasoning leads to the condition cfs_rq->tg_load_avg_contrib > 0.
> > However if cfs_rq->avg.load_avg is not necessarily 0 at this point, then
> > the condition you propose is obviously more accurate, simpler than the
> > delta check, and requires fewer calculations.
> > 
> > I think our perspectives differ. From the perspective of 
> > update_tg_load_avg(), the semantics of this condition are as follows: if
> > there is no 1ms update limit, and update_tg_load_avg() can continue 
> > updating after checking the delta, then in cfs_rq_is_decayed() we should
> > return false to keep the cfs_rq in the list for subsequent updates. As 
> > mentioned in the first paragraph, this avoids that tricky situation. From
> > the perspective of cfs_rq_is_decayed(), the semantics of the condition you
> > proposed are that if cfs_rq->avg.load_avg is already 0, then it cannot be
> > removed from the list before all load_avg are updated to tg. That makes 
> > sense to me, but I still feel like there's a little bit of a risk. Am I 
> > being paranoid?
> > 
> > How do you view these two lines of thinking?
> > 
> > It's a pleasure to discuss this with you, 
> > xupengbo.
> > 
> > > > +               return false;
> > > > +
> > > >         return true;
> > > >  }
> > > >
> > > > --
> > > > 2.43.0
> > > >


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
  2025-08-06  8:38         ` xupengbo
@ 2025-08-06  9:22           ` Vincent Guittot
  2025-08-25  9:50           ` Aaron Lu
  1 sibling, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2025-08-06  9:22 UTC (permalink / raw)
  To: xupengbo
  Cc: ziqianlu, bsegall, cgroups, dietmar.eggemann, juri.lelli,
	linux-kernel, mgorman, mingo, peterz, rostedt, vschneid

On Wed, 6 Aug 2025 at 10:38, xupengbo <xupengbo@oppo.com> wrote:
>
> On Wed, Aug 06, 2025 at 02:31:58PM +0800, xupengbo wrote:
> > > >On Tue, 5 Aug 2025 at 16:42, xupengbo <xupengbo@oppo.com> wrote:
> > > > >
> > > > > When a task is migrated out, there is a probability that the tg->load_avg
> > > > > value will become abnormal. The reason is as follows.
> > > > >
> > > > > 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> > > > > is a possibility that the reduced load_avg is not updated to tg->load_avg
> > > > > when a task migrates out.
> > > > > 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> > > > > calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> > > > > function cfs_rq_is_decayed() does not check whether
> > > > > cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> > > > > __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> > > > > updated to tg->load_avg.
> > > > >
> > > > > I added a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
> > > > > which blocks the case (2.) mentioned above. I follow the condition in
> > > > > update_tg_load_avg() instead of directly checking if
> > > > > cfs_rq->tg_load_avg_contrib is null. I think it's necessary to keep the
> > > > > condition consistent in both places, otherwise unexpected problems may
> > > > > occur.
> > > > >
> > > > > Thanks for your comments,
> > > > > Xu Pengbo
> > > > >
> > > > > Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
> > > > > Signed-off-by: xupengbo <xupengbo@oppo.com>
> > > > > ---
> > > > > Changes:
> > > > > v1 -> v2:
> > > > > - Another option to fix the bug. Check cfs_rq->tg_load_avg_contrib in
> > > > > cfs_rq_is_decayed() to avoid early removal from the leaf_cfs_rq_list.
> > > > > - Link to v1 : https://lore.kernel.org/cgroups/20250804130326.57523-1-xupengbo@oppo.com/T/#u
> > > > >
> > > > >  kernel/sched/fair.c | 5 +++++
> > > > >  1 file changed, 5 insertions(+)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index b173a059315c..a35083a2d006 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -4062,6 +4062,11 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > > > >         if (child_cfs_rq_on_list(cfs_rq))
> > > > >                 return false;
> > > > >
> > > > > +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > > > > +
> > > > > +       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)
> > > >
> > > >I don't understand why you use the above condition instead of if
> > > >(!cfs_rq->tg_load_avg_contrib). Can you elaborate ?
>
> Sorry I was misled here, I think it should be if (cfs_rq->tg_load_avg_contrib ! = 0)

yes I made a mistake. It should be
if (cfs_rq->tg_load_avg_contrib ! = 0)
or
if (cfs_rq->tg_load_avg_contrib)

>
> > > >
> > > >strictly speaking we want to keep the cfs_rq in the list if
> > > >(cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg) and
> > > >cfs_rq->avg.load_avg == 0 when we test this condition
> > >
> > >
> > > I use this condition primarily based on the function update_tg_load_avg().
> > > I want to absolutely avoid a situation where cfs_rq_is_decay() returns
> > > false but update_tg_load_avg() cannot update its value due to the delta
> > > check, which may cause the cfs_rq to remain on the list permanently.
> > > Honestly, I am not sure if this will happen, so I took this conservative
> > > approach.
> >
> > Hmm...it doesn't seem we need worry about this situation.
>
> yeah, I am worried about this situation, but I can't find any evidence
> that it exists.
>
> > Because when cfs_rq->load_avg is 0, abs(delta) will be
> > cfs_rq->tg_load_avg_contrib and the following condition:
> >
> >       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)
> > becomes:
> >       if (cfs_rq->tg_load_avg_contrib > cfs_rq->tg_load_avg_contrib / 64)
> >
> > which should always be true, right?
>
>
> It actually becomes:
>     if (cfs_rq->tg_load_avg_contrib > 0)
> if cfs_rq->tg_load_avg_contrib == 0 , it will be false. As it is an unsigned
> long, this condition is equivalent to :
>     if (cfs_rq->tg_load_avg_contrib)
>
> Sorry I just made a mistake.
> Thanks,
> Xupengbo
>
> > Thanks,
> > Aaron
> >
> > >
> > > In fact, in the second if-condition of cfs_rq_is_decay(), the comment in
> > > the load_avg_is_decayed() function states:"_avg must be null when _sum is
> > > null because _avg = _sum / divider". Therefore, when we check this newly
> > > added condition, cfs_rq->avg.load_avg should already be 0, right?
> > >
> > > After reading your comments, I carefully considered the differences
> > > between these two approaches. Here, my condition is similar
> > > to cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg but weaker. In
> > > fact, when cfs_rq->avg.load_avg is already 0,
> > > abs(delta) > cfs_rq->tg_load_avg_contrib / 64 is equivalent to
> > > cfs_rq->tg_load_avg_contrib > cfs_rq->tg_load_avg_contrib / 64,
> > > Further reasoning leads to the condition cfs_rq->tg_load_avg_contrib > 0.
> > > However if cfs_rq->avg.load_avg is not necessarily 0 at this point, then
> > > the condition you propose is obviously more accurate, simpler than the
> > > delta check, and requires fewer calculations.
> > >
> > > I think our perspectives differ. From the perspective of
> > > update_tg_load_avg(), the semantics of this condition are as follows: if
> > > there is no 1ms update limit, and update_tg_load_avg() can continue
> > > updating after checking the delta, then in cfs_rq_is_decayed() we should
> > > return false to keep the cfs_rq in the list for subsequent updates. As
> > > mentioned in the first paragraph, this avoids that tricky situation. From
> > > the perspective of cfs_rq_is_decayed(), the semantics of the condition you
> > > proposed are that if cfs_rq->avg.load_avg is already 0, then it cannot be
> > > removed from the list before all load_avg are updated to tg. That makes
> > > sense to me, but I still feel like there's a little bit of a risk. Am I
> > > being paranoid?
> > >
> > > How do you view these two lines of thinking?
> > >
> > > It's a pleasure to discuss this with you,
> > > xupengbo.
> > >
> > > > > +               return false;
> > > > > +
> > > > >         return true;
> > > > >  }
> > > > >
> > > > > --
> > > > > 2.43.0
> > > > >
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
  2025-08-06  8:38         ` xupengbo
  2025-08-06  9:22           ` Vincent Guittot
@ 2025-08-25  9:50           ` Aaron Lu
  1 sibling, 0 replies; 12+ messages in thread
From: Aaron Lu @ 2025-08-25  9:50 UTC (permalink / raw)
  To: xupengbo
  Cc: bsegall, cgroups, dietmar.eggemann, juri.lelli, linux-kernel,
	mgorman, mingo, peterz, rostedt, vincent.guittot, vschneid

Hi xupengbo,

On Wed, Aug 06, 2025 at 04:38:10PM +0800, xupengbo wrote:
... ... 
> 
> It actually becomes:
>     if (cfs_rq->tg_load_avg_contrib > 0)
> if cfs_rq->tg_load_avg_contrib == 0 , it will be false. As it is an unsigned
> long, this condition is equivalent to :
>     if (cfs_rq->tg_load_avg_contrib)

I suppose we have reached a conclusion that the right fix is to add a
check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed()? Something
like below:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index af33d107d8034..3ebcb683063f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4056,6 +4056,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
 
+	if (cfs_rq->tg_load_avg_contrib)
+		return false;
+
 	return true;
 }
 
If you also agree, can you send an updated patch to fix this problem?
Thanks.

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
@ 2025-08-05 14:41 xupengbo
  2025-08-05 16:10 ` Vincent Guittot
  0 siblings, 1 reply; 12+ messages in thread
From: xupengbo @ 2025-08-05 14:41 UTC (permalink / raw)
  To: ziqianlu, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Aaron Lu, Mathieu Desnoyers,
	linux-kernel
  Cc: xupengbo, cgroups

When a task is migrated out, there is a probability that the tg->load_avg
value will become abnormal. The reason is as follows.

1. Due to the 1ms update period limitation in update_tg_load_avg(), there
is a possibility that the reduced load_avg is not updated to tg->load_avg
when a task migrates out.
2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
function cfs_rq_is_decayed() does not check whether
cfs->tg_load_avg_contrib is null. Consequently, in some cases,
__update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
updated to tg->load_avg.

I added a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
which blocks the case (2.) mentioned above. I follow the condition in
update_tg_load_avg() instead of directly checking if
cfs_rq->tg_load_avg_contrib is null. I think it's necessary to keep the
condition consistent in both places, otherwise unexpected problems may
occur.

Thanks for your comments,
Xu Pengbo

Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
Signed-off-by: xupengbo <xupengbo@oppo.com>
---
Changes:
v1 -> v2: 
- Another option to fix the bug. Check cfs_rq->tg_load_avg_contrib in 
cfs_rq_is_decayed() to avoid early removal from the leaf_cfs_rq_list.
- Link to v1 : https://lore.kernel.org/cgroups/20250804130326.57523-1-xupengbo@oppo.com/T/#u

 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..a35083a2d006 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4062,6 +4062,11 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;

+	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+
+	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)
+		return false;
+
 	return true;
 }

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
  2025-08-05 14:41 xupengbo
@ 2025-08-05 16:10 ` Vincent Guittot
  0 siblings, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2025-08-05 16:10 UTC (permalink / raw)
  To: xupengbo
  Cc: ziqianlu, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Aaron Lu, Mathieu Desnoyers, linux-kernel,
	cgroups

On Tue, 5 Aug 2025 at 16:42, xupengbo <xupengbo@oppo.com> wrote:
>
> When a task is migrated out, there is a probability that the tg->load_avg
> value will become abnormal. The reason is as follows.
>
> 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> is a possibility that the reduced load_avg is not updated to tg->load_avg
> when a task migrates out.
> 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> function cfs_rq_is_decayed() does not check whether
> cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> updated to tg->load_avg.
>
> I added a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
> which blocks the case (2.) mentioned above. I follow the condition in
> update_tg_load_avg() instead of directly checking if
> cfs_rq->tg_load_avg_contrib is null. I think it's necessary to keep the
> condition consistent in both places, otherwise unexpected problems may
> occur.
>
> Thanks for your comments,
> Xu Pengbo
>
> Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
> Signed-off-by: xupengbo <xupengbo@oppo.com>
> ---
> Changes:
> v1 -> v2:
> - Another option to fix the bug. Check cfs_rq->tg_load_avg_contrib in
> cfs_rq_is_decayed() to avoid early removal from the leaf_cfs_rq_list.
> - Link to v1 : https://lore.kernel.org/cgroups/20250804130326.57523-1-xupengbo@oppo.com/T/#u
>
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..a35083a2d006 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4062,6 +4062,11 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>         if (child_cfs_rq_on_list(cfs_rq))
>                 return false;
>
> +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +
> +       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64)

I don't understand why you use the above condition instead of if
(!cfs_rq->tg_load_avg_contrib). Can you elaborate ?

strictly speaking we want to keep the cfs_rq in the list if
(cfs_rq->tg_load_avg_contrib != cfs_rq->avg.load_avg) and
cfs_rq->avg.load_avg == 0 when we test this condition


> +               return false;
> +
>         return true;
>  }
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
@ 2025-08-26  7:57 xupengbo
  2025-08-26  8:17 ` [PATCH v2] " xupengbo
  0 siblings, 1 reply; 12+ messages in thread
From: xupengbo @ 2025-08-26  7:57 UTC (permalink / raw)
  To: ziqianlu, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, David Vernet, Aaron Lu,
	linux-kernel
  Cc: xupengbo, cgroups, xupengbo1029

When a task is migrated out, there is a probability that the tg->load_avg
value will become abnormal. The reason is as follows.

1. Due to the 1ms update period limitation in update_tg_load_avg(), there
is a possibility that the reduced load_avg is not updated to tg->load_avg
when a task migrates out.
2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
function cfs_rq_is_decayed() does not check whether
cfs->tg_load_avg_contrib is null. Consequently, in some cases,
__update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
updated to tg->load_avg.

I added a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
which blocks the case (2.) mentioned above.
After some preliminary discussion and analysis, I think it is feasible to
directly check if cfs_rq->tg_load_avg_contrib is 0 in cfs_rq_is_decay().
So patch v3 was submitted.

Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg")
Signed-off-by: xupengbo <xupengbo@oppo.com>
---
Changes:
v1 -> v2: 
- Another option to fix the bug. Check cfs_rq->tg_load_avg_contrib in 
cfs_rq_is_decayed() to avoid early removal from the leaf_cfs_rq_list.
- Link to v1 : https://lore.kernel.org/cgroups/20250804130326.57523-1-xupengbo@oppo.com/
v2 -> v3:
- Check if cfs_rq->tg_load_avg_contrib is 0 derectly.
- Link to v2 : https://lore.kernel.org/cgroups/20250805144121.14871-1-xupengbo@oppo.com/

Please send emails to a different email address <xupengbo1029@163.com>
after September 3, 2025, after that date <xupengbo@oppo.com> will expire
for personal reasons.

Thanks,
Xu Pengbo

 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..df348cb6a5f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4062,6 +4062,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;

+	if (cfs_rq->tg_laod_avg_contrib)
+		return false;
+
 	return true;
 }

base-commit: fab1beda7597fac1cecc01707d55eadb6bbe773c
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
  2025-08-26  7:57 [PATCH v3] " xupengbo
@ 2025-08-26  8:17 ` xupengbo
  0 siblings, 0 replies; 12+ messages in thread
From: xupengbo @ 2025-08-26  8:17 UTC (permalink / raw)
  To: xupengbo
  Cc: aaron.lu, bsegall, cgroups, dietmar.eggemann, juri.lelli,
	linux-kernel, mgorman, mingo, peterz, rostedt, vincent.guittot,
	void, vschneid, xupengbo1029, ziqianlu

>Hi xupengbo,
>
>On Wed, Aug 06, 2025 at 04:38:10PM +0800, xupengbo wrote:
>... ... 
>> 
>> It actually becomes:
>>     if (cfs_rq->tg_load_avg_contrib > 0)
>> if cfs_rq->tg_load_avg_contrib == 0 , it will be false. As it is an unsigned
>> long, this condition is equivalent to :
>>     if (cfs_rq->tg_load_avg_contrib)
>
>I suppose we have reached a conclusion that the right fix is to add a
>check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed()? Something
>like below:
>
>diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>index af33d107d8034..3ebcb683063f0 100644
>--- a/kernel/sched/fair.c
>+++ b/kernel/sched/fair.c
>@@ -4056,6 +4056,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> 	if (child_cfs_rq_on_list(cfs_rq))
> 		return false;
> 
>+	if (cfs_rq->tg_load_avg_contrib)
>+		return false;
>+
> 	return true;
> }
> 
>If you also agree, can you send an updated patch to fix this problem?
>Thanks.

I already sent an updated patch v3.
Link: https://lore.kernel.org/cgroups/20250826075743.19106-1-xupengbo@oppo.com/

Thanks,
xupengbo

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-08-26  8:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-04 13:03 [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out xupengbo
2025-08-05  9:08 ` Aaron Lu
2025-08-05  9:17   ` Vincent Guittot
2025-08-06  6:31     ` [PATCH v2] " xupengbo
2025-08-06  7:33       ` Aaron Lu
2025-08-06  7:58         ` xupengbo
2025-08-06  8:38         ` xupengbo
2025-08-06  9:22           ` Vincent Guittot
2025-08-25  9:50           ` Aaron Lu
  -- strict thread matches above, loose matches on Subject: below --
2025-08-05 14:41 xupengbo
2025-08-05 16:10 ` Vincent Guittot
2025-08-26  7:57 [PATCH v3] " xupengbo
2025-08-26  8:17 ` [PATCH v2] " xupengbo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).