From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 459931A9F87
	for <linux-kernel@vger.kernel.org>; Wed,  1 Apr 2026 21:46:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775080004; cv=none; b=lWztACE4Cpck54ZFvR0DQxKrAqYo64jiYPshTpjGXxzzRFDZ01rl+vMhDwJJdSfqyhoeh4izq9sQWkYKNsOISXotuXAkW5nDBwuEX2RcduR5Z7ekE+J+m4fR7ex1SeXq5PHolcEdH6wexA5DaSE5SyQ7prxa70DzqtfCJSTwhcc=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775080004; c=relaxed/simple;
	bh=C+FVzLNDxf+YloNBa+HQP9VhEFnngiwb7RRSNhHOq6Q=;
	h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=lu+HqrewwoF6N54ivtxwJA3PX+aPjpdCkumoWuOviMFNECrO+8ww/BllyatCqnHTU7pOGTDuF+tNapd5XbcM2HhUgQNrFmjSxLvZP3oUJ3I1aNEtz9NVgpEHVlHMT5DSsoYt0a9WJRGMqjJxzUZsiEH/V9TY0xmJOPu4uz2FXjk=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=OVFdWOqS; arc=none smtp.client-ip=198.175.65.15
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="OVFdWOqS"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1775080003; x=1806616003;
  h=from:to:cc:subject:date:message-id:mime-version:
   content-transfer-encoding;
  bh=C+FVzLNDxf+YloNBa+HQP9VhEFnngiwb7RRSNhHOq6Q=;
  b=OVFdWOqSgCqB32IYmo7T/73PG7n8ALGqrrqM6HoiUHBbmSrfaWj6Dzh1
   lEoGmIYOLvJQ4CuozoTB80P1lQDeWtf/bKObrxizKrXKToToHeVDXBmTh
   8LU8Q2YRcN+UJb8G5PeZI4zpl2VX1+KQ7bm23tjRTSVYMocmSPIOr2q1A
   BEYYzjAwoEbIyl6wrZ4/8qceULDb1VnTsWDTtsIHuPG1Ema3AmNyaYePp
   gVuQX6L3LMVjn9AgbW1PHrBVa5FmAr67woCiP4dzkpp/OKxAKcNiyvGus
   Bg0vhHmNTXnZVRYOuu2P/sABxz8NaHmAyqBTpjJzldmHR/C9ykQLOwHGI
   A==;
X-CSE-ConnectionGUID: m6OnF+cKQfKOZjVth4hELA==
X-CSE-MsgGUID: wuy7+OEjQCWOY23nGKh2YA==
X-IronPort-AV: E=McAfee;i="6800,10657,11746"; a="79739689"
X-IronPort-AV: E=Sophos;i="6.23,153,1770624000"; 
   d="scan'208";a="79739689"
Received: from fmviesa002.fm.intel.com ([10.60.135.142])
  by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Apr 2026 14:46:41 -0700
X-CSE-ConnectionGUID: G0l2EW6SToGgvJZFoEXl6A==
X-CSE-MsgGUID: OicjhR8bSVeK4Xi7jRFn4A==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.23,153,1770624000"; 
   d="scan'208";a="249842381"
Received: from b04f130c83f2.jf.intel.com ([10.165.154.98])
  by fmviesa002.fm.intel.com with ESMTP; 01 Apr 2026 14:46:40 -0700
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>,
	Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 00/22] Cache aware scheduling 
Date: Wed,  1 Apr 2026 14:52:12 -0700
Message-Id: <cover.1775065312.git.tim.c.chen@linux.intel.com>
X-Mailer: git-send-email 2.32.0
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

This patch series introduces infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data within
the same Last Level Cache (LLC) domain. By improving cache locality,
the scheduler can reduce cache bouncing and cache misses, ultimately
improving data access efficiency. The design builds on the initial
prototype from Peter [1].

This initial implementation treats threads within the same process
as entities that are likely to share data. During load balancing, the
scheduler attempts to aggregate such threads onto the same LLC domain
whenever possible.

Most of the feedback received on v3 has been addressed. Some aspects
could be enhanced later after the basic cache-aware portion has landed:

There were discussions around grouping tasks using mechanisms other
than process membership. While we agree that more flexible grouping
is desirable, this series intentionally focuses on establishing basic
process-based grouping first, with alternative grouping mechanisms to
be explored in a follow-on series.

There was also discussion in v3 that the task wakeup path should be used
to perform cache-aware scheduling. According to previous test results,
performing task aggregation in the wakeup path introduced task migration
bouncing. Primarily that was due to the wake up path not having the up
to date LLC load information.  That led to over-aggregation that needed
to be corrected later in load balancing. Load balancing path was chosen
as the conservative path to perform task aggregation. The task wakeup
path will be investigated as a future enhancement.

Furthermore, there was also requests to make cache-aware scheduling
benefit small LLC systems. Peter suggested using an llc-mask instead of
a single llc value for preferences[2]. This could also be implemented
as a future enhancement.

The cache aware load balancing logic remains largely unchanged. The
significant changes in v4 are:

1. LLC ID management: the calculation of the LLC ID switches to using
bitmap allocation rather than maintaining a static value.
2. Introduce a new patch [2/22] to limit the CPU scan span with
preferred NUMA node when NUMA balancing is enabled.
3. Tweaks in load balance failure considerations where
keeping load imbalance at low load and not pulling task from preferred
LLC is not considered as a balance failure.

Other changes are described in each patch.

Test results:
The patch series was applied and tested on v7.0-rc3.
Git tree can be found here:
https://github.com/timcchen1298/linux/tree/cache_aware_v4

The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.

hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows limitted wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[3]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Detail:
To conserve space, data without much difference with
baseline is not presented.

Sapphire Rapids:
[hackbench pipe]
================
case                    load            baseline(std%)  compare%( std%)
threads-pipe-10         1-groups         1.00 (  1.22)  +26.09 (  1.10)
threads-pipe-10         2-groups         1.00 (  4.90)  +22.88 (  0.18)
threads-pipe-10         4-groups         1.00 (  2.07)   +9.00 (  3.49)
threads-pipe-10         8-groups         1.00 (  8.13)   +3.45 (  3.62)
threads-pipe-16         1-groups         1.00 (  2.11)  +26.30 (  0.08)
threads-pipe-16         2-groups         1.00 ( 15.13)   -1.77 ( 11.89)
threads-pipe-16         4-groups         1.00 (  4.37)   +0.58 (  7.99)
threads-pipe-16         8-groups         1.00 (  2.88)   +2.71 (  3.50)
threads-pipe-2          1-groups         1.00 (  9.40)  +22.07 (  0.71)
threads-pipe-2          2-groups         1.00 (  9.99)  +18.01 (  0.95)
threads-pipe-2          4-groups         1.00 (  3.98)  +24.66 (  0.96)
threads-pipe-2          8-groups         1.00 (  7.00)  +21.83 (  0.23)
threads-pipe-20         1-groups         1.00 (  1.03)  +28.84 (  0.21)
threads-pipe-20         2-groups         1.00 (  4.42)  +31.90 (  3.15)
threads-pipe-20         4-groups         1.00 (  9.97)   +4.56 (  1.69)
threads-pipe-20         8-groups         1.00 (  1.87)   +1.25 (  0.74)
threads-pipe-4          1-groups         1.00 (  4.48)  +25.67 (  0.78)
threads-pipe-4          2-groups         1.00 (  9.14)   +4.91 (  2.08)
threads-pipe-4          4-groups         1.00 (  7.68)  +19.36 (  1.53)
threads-pipe-4          8-groups         1.00 ( 10.79)   +7.20 ( 12.20)
threads-pipe-8          1-groups         1.00 (  4.69)  +21.93 (  0.03)
threads-pipe-8          2-groups         1.00 (  1.16)  +25.29 (  0.65)
threads-pipe-8          4-groups         1.00 (  2.23)   -1.27 (  3.62)
threads-pipe-8          8-groups         1.00 (  4.65)   -3.08 (  2.75)

Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.

[schbench]
The 99th percentile wakeup latency shows some improvements when the
system is underload, while it does not bring much difference with
the increasing of system utilization.

99th Wakeup Latencies	Base (mean std)      Compare (mean std)   Change
--------------------------------------------------------------------------------
thread=2                 9.00(0.00)           9.00(1.73)           0.00%
thread=4                 7.33(0.58)           6.33(0.58)           +13.64%
thread=8                 9.00(0.00)           7.67(1.15)           +14.78%   
thread=16                8.67(0.58)           8.67(1.53)           0.00%     
thread=32                9.00(0.00)           7.00(0.00)           +22.22%   
thread=64                9.33(0.58)           9.67(0.58)           -3.64%    
thread=128              12.00(0.00)          12.00(0.00)           0.00%

[chacha200]
baseline:
Host time spent: 67861ms
cache aware scheduling enabled:
Host time spent: 54441ms

Time reduced by 24%

Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively.
Exclude the result with large run-to-run variance, 10% ~ 50%
improvement is observed when the system is underloaded:

[hackbench pipe]
================
case                    load            baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  2.89)  +47.33 (  1.20)
threads-pipe-2          2-groups         1.00 (  3.88)  +39.82 (  0.61)
threads-pipe-2          4-groups         1.00 (  8.76)   +5.57 ( 13.10)
threads-pipe-20         1-groups         1.00 (  4.61)  +11.72 (  1.06)
threads-pipe-20         2-groups         1.00 (  6.18)  +14.55 (  1.47)
threads-pipe-20         4-groups         1.00 (  2.99)  +10.16 (  4.49)
threads-pipe-4          1-groups         1.00 (  4.23)  +43.70 (  2.14)
threads-pipe-4          2-groups         1.00 (  3.68)   +8.45 (  4.04)
threads-pipe-4          4-groups         1.00 ( 17.72)   +2.42 (  1.14)
threads-pipe-6          1-groups         1.00 (  3.10)   +7.74 (  3.83)
threads-pipe-6          2-groups         1.00 (  3.42)  +14.26 (  4.53)
threads-pipe-6          4-groups         1.00 ( 10.34)  +10.94 (  7.12)
threads-pipe-8          1-groups         1.00 (  4.21)   +9.06 (  4.43)
threads-pipe-8          2-groups         1.00 (  1.88)   +3.74 (  0.58)
threads-pipe-8          4-groups         1.00 (  2.78)  +23.96 (  1.18)


[chacha200]
Host time spent: 54762ms
Host time spent: 28295ms

Time reduced by 48%

[1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
[2] https://lore.kernel.org/all/20260219165221.GM1395266@noisy.programming.kicks-ass.net/
[3] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin

Change history:
**v4 Changes:**
1. Using bitmap based LLC id dynamic allocation mechanism.
2. Introduce a new patch [2/22] to limit the CPU scan depth with
   preferred NUMA node.
3. Keeping load imbalance at low load and not pulling task from preferred
LLC is not considered as a balance failure.
4. Other changes from v3 are detailed in each patch's change log.

**v3 Changes:**
v3 link: https://lore.kernel.org/all/cover.1770760558.git.tim.c.chen@linux.intel.com/
1. Cache-aware scheduling is skipped after repeated load balance
   failures (up to cache_nice_tries). This avoids repeatedly attempting
   cache-aware migrations when no movable tasks prefer the destination
   LLC.
2. The busiest runqueue is no longer sorted to select tasks that prefer
   the destination LLC. This sorting was costly, and equivalent
   behavior can be achieved by skipping tasks that do not prefer the
   destination LLC during cache-aware migrations.
3. Accounting of the number of tasks preferring each LLC is now kept in
   the lowest-level sched domain per CPU. This simplifies handling of
   LLC resizing and changes in the number of LLC domains.
4. Other changes from v2 are detailed in each patch's change log.

 
**v2 Changes:**
v2 link: https://lore.kernel.org/all/cover.1764801860.git.tim.c.chen@linux.intel.com/
1. Align NUMA balancing and cache affinity by
   prioritizing NUMA balancing when their decisions differ.
2. Dynamically resize per-LLC statistics structures based on the LLC
   size.
3. Switch to a contiguous LLC-ID space so these IDs can be used
   directly as array indices for LLC statistics.
4. Add clarification comments.
5. Add 3 debug patches (not meant for merging).
6. Other changes to address feedbacks from review of v1 patch set
   (see individual patch change log).

**v1**
v1 link: https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/

Chen Yu (10):
  sched/cache: Limit the scan number of CPUs when calculating task
    occupancy
  sched/cache: Record per LLC utilization to guide cache aware
    scheduling decisions
  sched/cache: Introduce helper functions to enforce LLC migration
    policy
  sched/cache: Disable cache aware scheduling for processes with high
    thread counts
  sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
  sched/cache: Allow the user space to turn on and off cache aware
    scheduling
  sched/cache: Add user control to adjust the aggressiveness of
    cache-aware scheduling
  -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy
    for each process via proc fs
  -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load
    balance statistics

Peter Zijlstra (Intel) (1):
  sched/cache: Introduce infrastructure for cache-aware load balancing

Tim Chen (11):
  sched/cache: Make LLC id continuous
  sched/cache: Assign preferred LLC ID to processes
  sched/cache: Track LLC-preferred tasks per runqueue
  sched/cache: Introduce per CPU's tasks LLC preference counter
  sched/cache: Calculate the percpu sd task LLC preference
  sched/cache: Count tasks prefering destination LLC in a sched group
  sched/cache: Check local_group only once in update_sg_lb_stats()
  sched/cache: Prioritize tasks preferring destination LLC during
    balancing
  sched/cache: Add migrate_llc_task migration type for cache-aware
    balancing
  sched/cache: Handle moving single tasks to/from their preferred LLC
  sched/cache: Respect LLC preference in task migration and detach

 fs/proc/base.c                 |   31 +
 include/linux/cacheinfo.h      |   21 +-
 include/linux/mm_types.h       |   43 ++
 include/linux/sched.h          |   32 +
 include/linux/sched/topology.h |   17 +
 include/trace/events/sched.h   |  140 ++++
 init/Kconfig                   |   11 +
 init/init_task.c               |    3 +
 kernel/fork.c                  |    6 +
 kernel/sched/core.c            |   13 +
 kernel/sched/debug.c           |   58 +-
 kernel/sched/fair.c            | 1180 +++++++++++++++++++++++++++++++-
 kernel/sched/sched.h           |   50 ++
 kernel/sched/topology.c        |  234 ++++++-
 14 files changed, 1810 insertions(+), 29 deletions(-)

-- 
2.32.0