From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6EDA63C3BFE
	for <linux-kernel@vger.kernel.org>; Thu, 18 Jun 2026 09:18:17 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1781774301; cv=none; b=fnnLfpgjIjSdps+4tLYmJRV08uZ5bfFT7WjpoR+EZWwmw+dCE2KVwicdqnKS67cmDpy7ir7Su+IcmQ+FmPrrvIIL4QCciZ/+s+oBHtv8z3cjJpKmay3n3RzGNbE+t2mnSqdx1N79fbDg4hluxOOUO9xKgdASbxTxR8wmh6SpQAo=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1781774301; c=relaxed/simple;
	bh=2wSIVu6FKlzH65Ov6bSLkvM5TnHHjx3yXRcdp/tdmzw=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=jKC7IfIUdRus2X22EP7bqBpCyzSbSgt7UsmPsempzhE6wDEw5qFu8AVgooWSQimpVdQvwbfOef6LKv8MDTs6Pkh61JPBdKNUK+/IWoJuyxMJK41OCKKXpe0J6nDnnCf/HlG4fqoTrX+QM1bFNADxwBxaWa0Q/Ri8Fa8KYy0ehX0=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=pass smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=nMUj/xWH; arc=none smtp.client-ip=90.155.50.34
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=infradead.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="nMUj/xWH"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=G9xjjTSIj9H+0PVMVEWsKqidvbWb6RQs5oWtTQDglR0=; b=nMUj/xWH89ZRQdWMf8toCIZu4o
	eBPFw5n8rbMbnEdurxXv+P4WY77gJy0XnnpgM07jnjCrChv26yI9ySm9E8xqRUvQDWqb+JVd62fX+
	0wk+os0V0F8d/ow9p4LHT7XAWHHpQSPCKQ5WLAhuyGXStew4DtgwBnj7aT3JGk2fi7PnFdRu6tCf1
	wBuuAGrmLrxpUiHmn0dxk4+lH/zN+iv1rFq3nnVwgl8fjBCaqdePbjTzWKIxdhPvui/7VaLLbfyc7
	ML1HRt8KYDEuGzpbjh7iGIYKLpRUxTMWEGFETmWHZdNGBGbHG0RQjnAf28VpaR4F705tyIJhXAVwN
	wTccLxeQ==;
Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.99.1 #2 (Red Hat Linux))
	id 1wa8sv-0000000ECJR-24Ps;
	Thu, 18 Jun 2026 09:18:06 +0000
Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000)
	id 9FD6C300578; Thu, 18 Jun 2026 11:18:04 +0200 (CEST)
Date: Thu, 18 Jun 2026 11:18:04 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Xin Zhao <jackzxcui1989@163.com>
Cc: vschneid@redhat.com, mingo@redhat.com, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	kprateek.nayak@amd.com, pauld@redhat.com, aiqun.yu@oss.qualcomm.com,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 1/2] sched/fair: Don't trigger active lb if
 src_rq->curr is not on_rq
Message-ID: <20260618091804.GM42921@noisy.programming.kicks-ass.net>
References: <20260617072151.1173416-1-jackzxcui1989@163.com>
 <20260617072151.1173416-2-jackzxcui1989@163.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20260617072151.1173416-2-jackzxcui1989@163.com>

On Wed, Jun 17, 2026 at 03:21:50PM +0800, Xin Zhao wrote:
> Active balancing needs the help by migration threads which will interrupt
> task on src_rq. It has a certain impact on overall performance. Active
> balancing often fails, there is a check to determine whether the current
> task(say it 'curr') on src_rq can run on dst_rq. We have observed that
> even that, if curr is a CFS task and on_rq is 0, the failure rate of
> active balancing is very high. Below are the test data from a certain
> fillback task scenario executed on a platform with 18 CPUs over 300
> seconds:

<snip table for brevity>

> In __schedule(), before setting curr to next, during the execution of
> pick_next_task(), sched_balance_rq() is called. It will unlock and then
> re-lock the rq, creating "holes" during which other CPUs may see zero
> rq->curr->on_rq. try_to_block_task() sets curr->on_rq to 0, and during the
> rq lock "hole" in pick_next_task(), rq->curr has not yet been assigned to
> next, resulting in curr->on_rq being seen as 0.
> 
> We do not need to perform active balancing when src_rq->curr is CFS task
> but on_rq is 0, as other CFS tasks have been probably checked just before.
> For cases where src_rq->curr is a non-CFS task, we retain the affinity
> check for dst_rq to trigger active balancing because such task is likely
> to wake-up or woken-by src_rq CFS task which has similar affinity
> characteristics to migrate. Also, after executing detach_tasks(), rq lock
> is released. Tasks on the rq awakened during detach_tasks() may preempt
> the previous CFS task. Based on my test(though not shown above), success
> rate of active balancing under the condition of !fair && on_rq is 98.4%.
> This scenario does not require the use of stop work, but need to add
> another path to detach attach task(s). It seems not necessary enough to
> add it, Valentin and Vincent have already discussed about it, see [1].
> 
> Additionally, sched_class field is a bit far from on_cpu in task_struct.
> The previous traversal of cfs_tasks checks on_cpu in can_migrate_task(),
> so the additional check for on_rq will not incur much cpu cycle loss, due
> to cache locality.
> 
> Two reasons why not check sched_class and on_rq of busiest->curr with the
> cpumask_test_cpu() check:
> 1. Let the PATCH not introduce new cases that skip logic for resetting
> balance_interval to min_interval.
> 2. The check of whether busiest cpu has been just triggered active balance
> filters a bit more cases than the check of sched_class and on_rq.
> 
> [1]: https://lore.kernel.org/lkml/20190815145107.5318-5-valentin.schneider@arm.com/
> 
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>

Perhaps because it is early and I've not had enough wake-up juice, or
perhaps because it is too damn warm already and my brain is
pre-emptively shutting down already, I found it very hard to read your
Changelog.

I asked one of these fancy AI things to rephrase things, and then did a
manual edit on it and ended up with the below. Notably, the changelog
talks about a sched_class check, while the actual patch has none of
that.

Does this work for you?

---
Subject: sched/fair: Don't trigger active lb if src_rq->curr is not on_rq
From: Xin Zhao <jackzxcui1989@163.com>
Date: Wed, 17 Jun 2026 15:21:50 +0800

From: Xin Zhao <jackzxcui1989@163.com>

Active load balancing relies on migration threads, which temporarily preempt
tasks on the source runqueue (src_rq). This preemption can negatively impact
overall system performance. The active balancing logic includes a check to
verify whether the current task (curr) on src_rq can actually run on the
destination runqueue (dst_rq). We have observed that when curr is a CFS task
and its on_rq flag is 0, the active balancing failure rate is exceptionally
high. The following table summarizes test data collected over 300 seconds on an
18-CPU platform under a specific fillback task scenario:

  fair: busiest->curr->sched_class == &fair_sched_class
  on_rq: busiest->curr->on_rq
  total: active balance count triggered of correspondent type
  fail: fail to migrate one task in active_load_balance_cpu_stop()

                   fair && !on_rq   !fair && !on_rq
         domain       total    fail   total    fail
  cpu0   0x00003          0       0       0       0
  cpu0   0x3ffff         33      33       1       1
  cpu1   0x00003          0       0       0       0
  cpu1   0x3ffff         42      42       0       0
  cpu2   0x0003c          4       4       0       0
  cpu2   0x3ffff         12      12       0       0
  cpu3   0x0003c          3       3       0       0
  cpu3   0x3ffff          8       7       0       0
  cpu4   0x0003c          2       2       0       0
  cpu4   0x3ffff          5       4       0       0
  cpu5   0x0003c          4       4       0       0
  cpu5   0x3ffff          8       8       0       0
  cpu6   0x003c0         60      60       0       0
  cpu6   0x3ffff         28      27       0       0
  cpu7   0x003c0        194     184       0       0
  cpu7   0x3ffff         35      35       1       1
  cpu8   0x003c0        240     228       0       0
  cpu8   0x3ffff         28      28       0       0
  cpu9   0x003c0          0       0       0       0
  cpu9   0x3ffff         10      10       0       0
  cpu10  0x03c00         52      50       0       0
  cpu10  0x3ffff          0       0       0       0
  cpu11  0x03c00         70      68       0       0
  cpu11  0x3ffff          1       1       0       0
  cpu12  0x03c00         73      72       0       0
  cpu12  0x3ffff          0       0       0       0
  cpu13  0x03c00         79      76       0       0
  cpu13  0x3ffff          0       0       0       0
  cpu14  0x3c000          0       0       0       0
  cpu14  0x3ffff         57      55       1       0
  cpu15  0x3c000         53      52       1       0
  cpu15  0x3ffff         30      29       0       0
  cpu16  0x3c000        344     341      10       6
  cpu16  0x3ffff        103     100       2       1
  cpu17  0x3c000        183     179       2       2
  cpu17  0x3ffff         78      77       0       0
  sum                  1839    1791      18      11

In __schedule(), before curr is updated to next, pick_next_task() invokes
sched_balance_rq(). This function temporarily unlocks and relocks the runqueue,
creating a window where other CPUs may observe rq->curr->on_rq as 0.

We can safely skip active balancing when src_rq->curr->on_rq == 0, as other
eligible tasks have likely already been evaluated. 

We retain the affinity check on dst_rq to trigger active balancing, since such
tasks are often woken by (or wake up) tasks on src_rq that share similar
affinity constraints. Furthermore, detach_tasks() releases the runqueue lock;
any tasks awakened during this window may preempt the previous CFS task. My
testing (data not shown) indicates that active balancing succeeds in 98.4% of
cases where !fair && on_rq.

This scenario does not require a stop-work callback, but would necessitate an
additional detach/attach path. As Valentin and Vincent have already discussed,
this addition does not appear justified at this time (see [1]).

Since can_migrate_task() already checks on_cpu during the cfs_tasks traversal,
adding an on_rq check will have negligible performance overhead due to cache
locality.

There are two reasons for not combining the on_rq check with the
cpumask_test_cpu() check:

 - Avoiding new scenarios that would skip the logic for resetting
   balance_interval to min_interval.

 - The existing check for whether the busiest CPU recently triggered active
   load balancing already filters more cases than the on_rq check.

[1]: https://lore.kernel.org/lkml/20190815145107.5318-5-valentin.schneider@arm.com/

Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://patch.msgid.link/20260617072151.1173416-2-jackzxcui1989@163.com
---
 kernel/sched/fair.c |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13482,12 +13482,21 @@ static int sched_balance_rq(int this_cpu
 			 * ->active_balance_work.  Once set, it's cleared
 			 * only after active load balance is finished.
 			 */
-			if (!busiest->active_balance) {
-				busiest->active_balance = 1;
-				busiest->push_cpu = this_cpu;
-				active_balance = 1;
-			}
+			if (busiest->active_balance)
+				goto no_active_balance;
 
+			/*
+			 * @busiest dropped its rq_lock in the middle of
+			 * scheduling out its ->curr task (->on_rq := 0), no
+			 * need to forcefully punt it away with active balance.
+			 */
+			if (!busiest->curr->on_rq)
+				goto no_active_balance;
+
+			busiest->active_balance = 1;
+			busiest->push_cpu = this_cpu;
+			active_balance = 1;
+no_active_balance:
 			preempt_disable();
 			raw_spin_rq_unlock_irqrestore(busiest, flags);
 			if (active_balance) {