Re: [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Aaron Lu <ziqianlu@bytedance.com>
To: xupengbo <xupengbo@oppo.com>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org
Subject: Re: [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.
Date: Tue, 5 Aug 2025 17:08:40 +0800	[thread overview]
Message-ID: <20250805090517.GA687566@bytedance> (raw)
In-Reply-To: <20250804130326.57523-1-xupengbo@oppo.com>

On Mon, Aug 04, 2025 at 09:03:26PM +0800, xupengbo wrote:
> We added a function named update_tg_load_avg_immediately() that mimics
> update_tg_load_avg(). In this function we remove the update interval
> restriction from update_tg_load_avg() in order to update tg->load
> immediately when the function is called. This function is only called in
> update_load_avg(). In update_load_avg(), we should call
> update_tg_load_avg_immediately() if flag & DO_DETACH == true and the task
> is the last task in cfs_rq, otherwise we call update_tg_load_avg(). The
> reason is as follows.
> 
> 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> is a possibility that the reduced load_avg is not updated to tg->load_avg
> when a task migrates out.
> 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> function cfs_rq_is_decayed() does not check whether
> cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> updated to tg->load_avg.
> 
> When these two events occur within the 1ms window (defined by
> NSEC_PER_MSEC in update_tg_load_avg()) and no other tasks can migrate to
> the CPU due to the cpumask constraints, the corresponding portion of
> load_avg will never be subtracted from tg->load_avg. This results in an
> inflated tg->load_avg and reduced scheduling entity (se) weight for the
> task group. If the migrating task had a large weight, the task group's
> share may deviate significantly from its expected value. This issue is
> easily reproducible in task migration scenarios.
>

Agree this is a problem.

> Initially, I discovered this bug on Android 16 (running kernel v6.12), and
> was subsequently able to reproduce it on an 8-core Ubuntu 24.04 VM with
> kernel versions v6.14 and v6.16-rc7. I believe it exists in any kernel
> version that defines both CONFIG_FAIR_GROUP_SCHED and CONFIG_SMP.
> I wrote a short C program which just does 3 things:
>   1. call sched_setaffinity() to bound itself to cpu 1.
>   2. call sched_setaffinity() to bound itself to cpu 2.
>   3. endless loop.
> 
> Here is the source code.
> ```
> \#define _GNU_SOURCE
> \#include <sched.h>
> \#include <unistd.h>
> int main() {
>   cpu_set_t cpuset;
>   CPU_ZERO(&cpuset);
>   CPU_SET(1, &cpuset);
>   pid_t pid = gettid();
> 
>   if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
>     return 1;
>   }
> 
>   CPU_ZERO(&cpuset);
>   CPU_SET(2, &cpuset);
> 
>   if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
>     return 1;
>   }
>   while (1)
>     ;
>   return 0;
> }
> ```
> 
> Then I made a test script to add tasks into groups.
> (Forgive me for just copying and pasting those lines but not using
> a for-loop)
> 
> ```
> \#!/usr/bin/bash
> 
> shares=100
> pkill 'bug_test'
> sleep 2
> rmdir /sys/fs/cgroup/cpu/bug_test_{1..4}
> mkdir /sys/fs/cgroup/cpu/bug_test_{1..4}
> 
> echo $shares >/sys/fs/cgroup/cpu/bug_test_1/cpu.shares
> echo $shares >/sys/fs/cgroup/cpu/bug_test_2/cpu.shares
> echo $shares >/sys/fs/cgroup/cpu/bug_test_3/cpu.shares
> echo $shares >/sys/fs/cgroup/cpu/bug_test_4/cpu.shares
> 
> nohup ./bug_test &
> proc1=$!
> echo "$proc1" >/sys/fs/cgroup/cpu/bug_test_1/cgroup.procs
> nohup ./bug_test &
> proc2=$!
> echo "$proc2" >/sys/fs/cgroup/cpu/bug_test_2/cgroup.procs
> nohup ./bug_test &
> proc3=$!
> echo "$proc3" >/sys/fs/cgroup/cpu/bug_test_3/cgroup.procs
> nohup ./bug_test &
> proc4=$!
> echo "$proc4" >/sys/fs/cgroup/cpu/bug_test_4/cgroup.procs
> 
> ```
>
> After several repetitions of the script, we can find that some
> processes have a smaller share of the cpu, while others have twice
> that. This state is stable until the end of the process.
 
I can reproduce it using your test code.
 
> When a task is migrated from cfs_rq, dequeue_load_avg() will subtract its
> avg.load_sum and avg.load_avg. Sometimes its load_sum is reduced to null
> sometimes not. If load_sum is reduced to null, then this cfs_rq will be
> removed from the leaf_cfs_rq_list soon. So __update_blocked_fair() can not
> update it anymore.
> 
> Link: https://lore.kernel.org/cgroups/20210518125202.78658-2-odin@uged.al/
> In this patch, Odin proposed adding a check in cfs_rq_is_decayed() to
> determine whether cfs_rq->tg_load_avg_contrib is null. However, it appears
> that this patch was not merged. In fact, if there were a check in
> cfs_rq_is_decayed() similar to the one in update_tg_load_avg() regarding
> the size of the _delta_ value (see update_tg_load_avg()), this issue
> could also be effectively resolved. This solution would block (2.),
> because if delta is too large, cfs_rq_is_decayed() returns false, and the
> cfs_rq remains in leaf_cfs_rq_list, ultimately causing
> __update_blocked_fair() to update it outside the 1ms limit. The only
> consideration is whether to add a check for cfs_rq->tg_load_avg_contrib in
> cfs_rq_is_decayed(), which may increase coupling.
>

Performance wise, doing an immediate update to tg->load_avg on detach
path should be OK because last time when I did those tests, the migration
path that leads to updates to tg->load_avg is mostly from task wakeup path.
I also did some quick tests using hackbench and netperf on an Intel EMR
and didn't notice anything problematic regarding performance with your
change here.

With this said, I think adding cfs_rq->tg_load_avg_contrib check in
cfs_rq_is_decayed() makes more sense to me, because if a cfs_rq still has
contrib to its tg but its load_avg is 0, it should stay in that list and
have its contrib synced with its load_avg to zero when that 1ms window
has passed by __update_blocked_fair() path.

> Signed-off-by: xupengbo <xupengbo@oppo.com>

Maybe add a fix tag for commit 1528c661c24b("sched/fair: Ratelimit
update to tg->load_avg")?

Thanks,
Aaron

> ---
>  kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..97feba367be9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4065,6 +4065,45 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  	return true;
>  }
>  
> +/* only called in update_load_avg() */
> +static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq)
> +{
> +	long delta;
> +	u64 now;
> +
> +	/*
> +	 * No need to update load_avg for root_task_group as it is not used.
> +	 */
> +	if (cfs_rq->tg == &root_task_group)
> +		return;
> +
> +	/* rq has been offline and doesn't contribute to the share anymore: */
> +	if (!cpu_active(cpu_of(rq_of(cfs_rq))))
> +		return;
> +
> +	/*
> +	 * Under normal circumstances, for migration heavy workloads, access
> +	 * to tg->load_avg can be unbound. Limit the update rate to at most
> +	 * once per ms.
> +	 * However when the last task is migrating from this cpu, we must
> +	 * update tg->load_avg immediately. Otherwise, if this cfs_rq becomes
> +	 * idle forever due to cpumask and is removed from leaf_cfs_rq_list,
> +	 * the huge mismatch between cfs_rq->avg.load_avg(which may be zero)
> +	 * and cfs_rq->tg_load_avg_contrib(stalled load_avg of last task)
> +	 * can never be corrected, which will lead to a significant value
> +	 * error in se.weight for this group.
> +	 * We retain this value filter below because it is not the main cause
> +	 * of this bug, so we are being conservative.
> +	 */
> +	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> +	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> +		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> +		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> +		cfs_rq->last_update_tg_load_avg = now;
> +	}
> +}
> +
>  /**
>   * update_tg_load_avg - update the tg's load avg
>   * @cfs_rq: the cfs_rq whose avg changed
> @@ -4449,6 +4488,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
>  
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
>  
> +static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq) {}
> +
>  static inline void clear_tg_offline_cfs_rqs(struct rq *rq) {}
>  
>  static inline int propagate_entity_load_avg(struct sched_entity *se)
> @@ -4747,9 +4788,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  		/*
>  		 * DO_DETACH means we're here from dequeue_entity()
>  		 * and we are migrating task out of the CPU.
> +		 *
> +		 * At this point, we have not subtracted nr_queued.
> +		 * If cfs_rq->nr_queued ==1, the last cfs task is being
> +		 * migrated from this cfs_rq.
>  		 */
>  		detach_entity_load_avg(cfs_rq, se);
> -		update_tg_load_avg(cfs_rq);
> +		if (cfs_rq->nr_queued == 1)
> +			update_tg_load_avg_immediately(cfs_rq);
> +		else
> +			update_tg_load_avg(cfs_rq);
>  	} else if (decayed) {
>  		cfs_rq_util_change(cfs_rq, 0);
>  
> -- 
> 2.43.0
>

next prev parent reply	other threads:[~2025-08-05  9:08 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-04 13:03 [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out xupengbo
2025-08-05  9:08 ` Aaron Lu [this message]
2025-08-05  9:17   ` Vincent Guittot
2025-08-06  6:31     ` [PATCH v2] " xupengbo
2025-08-06  7:33       ` Aaron Lu
2025-08-06  7:58         ` xupengbo
2025-08-06  8:38         ` xupengbo
2025-08-06  9:22           ` Vincent Guittot
2025-08-25  9:50           ` Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250805090517.GA687566@bytedance \
    --to=ziqianlu@bytedance.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xupengbo@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.