From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 459931A9F87 for ; Wed, 1 Apr 2026 21:46:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775080004; cv=none; b=lWztACE4Cpck54ZFvR0DQxKrAqYo64jiYPshTpjGXxzzRFDZ01rl+vMhDwJJdSfqyhoeh4izq9sQWkYKNsOISXotuXAkW5nDBwuEX2RcduR5Z7ekE+J+m4fR7ex1SeXq5PHolcEdH6wexA5DaSE5SyQ7prxa70DzqtfCJSTwhcc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775080004; c=relaxed/simple; bh=C+FVzLNDxf+YloNBa+HQP9VhEFnngiwb7RRSNhHOq6Q=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=lu+HqrewwoF6N54ivtxwJA3PX+aPjpdCkumoWuOviMFNECrO+8ww/BllyatCqnHTU7pOGTDuF+tNapd5XbcM2HhUgQNrFmjSxLvZP3oUJ3I1aNEtz9NVgpEHVlHMT5DSsoYt0a9WJRGMqjJxzUZsiEH/V9TY0xmJOPu4uz2FXjk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=OVFdWOqS; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="OVFdWOqS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1775080003; x=1806616003; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=C+FVzLNDxf+YloNBa+HQP9VhEFnngiwb7RRSNhHOq6Q=; b=OVFdWOqSgCqB32IYmo7T/73PG7n8ALGqrrqM6HoiUHBbmSrfaWj6Dzh1 lEoGmIYOLvJQ4CuozoTB80P1lQDeWtf/bKObrxizKrXKToToHeVDXBmTh 8LU8Q2YRcN+UJb8G5PeZI4zpl2VX1+KQ7bm23tjRTSVYMocmSPIOr2q1A BEYYzjAwoEbIyl6wrZ4/8qceULDb1VnTsWDTtsIHuPG1Ema3AmNyaYePp gVuQX6L3LMVjn9AgbW1PHrBVa5FmAr67woCiP4dzkpp/OKxAKcNiyvGus Bg0vhHmNTXnZVRYOuu2P/sABxz8NaHmAyqBTpjJzldmHR/C9ykQLOwHGI A==; X-CSE-ConnectionGUID: m6OnF+cKQfKOZjVth4hELA== X-CSE-MsgGUID: wuy7+OEjQCWOY23nGKh2YA== X-IronPort-AV: E=McAfee;i="6800,10657,11746"; a="79739689" X-IronPort-AV: E=Sophos;i="6.23,153,1770624000"; d="scan'208";a="79739689" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Apr 2026 14:46:41 -0700 X-CSE-ConnectionGUID: G0l2EW6SToGgvJZFoEXl6A== X-CSE-MsgGUID: OicjhR8bSVeK4Xi7jRFn4A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,153,1770624000"; d="scan'208";a="249842381" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa002.fm.intel.com with ESMTP; 01 Apr 2026 14:46:40 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [Patch v4 00/22] Cache aware scheduling Date: Wed, 1 Apr 2026 14:52:12 -0700 Message-Id: X-Mailer: git-send-email 2.32.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit This patch series introduces infrastructure for cache-aware load balancing, with the goal of co-locating tasks that share data within the same Last Level Cache (LLC) domain. By improving cache locality, the scheduler can reduce cache bouncing and cache misses, ultimately improving data access efficiency. The design builds on the initial prototype from Peter [1]. This initial implementation treats threads within the same process as entities that are likely to share data. During load balancing, the scheduler attempts to aggregate such threads onto the same LLC domain whenever possible. Most of the feedback received on v3 has been addressed. Some aspects could be enhanced later after the basic cache-aware portion has landed: There were discussions around grouping tasks using mechanisms other than process membership. While we agree that more flexible grouping is desirable, this series intentionally focuses on establishing basic process-based grouping first, with alternative grouping mechanisms to be explored in a follow-on series. There was also discussion in v3 that the task wakeup path should be used to perform cache-aware scheduling. According to previous test results, performing task aggregation in the wakeup path introduced task migration bouncing. Primarily that was due to the wake up path not having the up to date LLC load information. That led to over-aggregation that needed to be corrected later in load balancing. Load balancing path was chosen as the conservative path to perform task aggregation. The task wakeup path will be investigated as a future enhancement. Furthermore, there was also requests to make cache-aware scheduling benefit small LLC systems. Peter suggested using an llc-mask instead of a single llc value for preferences[2]. This could also be implemented as a future enhancement. The cache aware load balancing logic remains largely unchanged. The significant changes in v4 are: 1. LLC ID management: the calculation of the LLC ID switches to using bitmap allocation rather than maintaining a static value. 2. Introduce a new patch [2/22] to limit the CPU scan span with preferred NUMA node when NUMA balancing is enabled. 3. Tweaks in load balance failure considerations where keeping load imbalance at low load and not pulling task from preferred LLC is not considered as a balance failure. Other changes are described in each patch. Test results: The patch series was applied and tested on v7.0-rc3. Git tree can be found here: https://github.com/timcchen1298/linux/tree/cache_aware_v4 The first test platform is a 2 socket Intel Sapphire Rapids with 30 cores per socket. The DRAM interleaving is enabled in the BIOS so it essential has one NUMA node with two last level caches. There are 60 CPUs associated with each last level cache. The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node. Each node has 2 CCXs and each CCX has 16 CPUs. hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on these two platforms. [TL;DR] Sappire Rapids: hackbench shows significant improvement when the number of different active threads is below the capacity of a LLC. schbench shows limitted wakeup latency improvement. ChaCha20-xiangshan(risc-v simulator) shows good throughput improvement. No obvious difference was observed in netperf/stream/stress-ng in Hmean. Genoa: Significant improvement is observed in hackbench when the active number of threads is lower than the number of CPUs within 1 LLC. On v2, Aaron reported improvement of hackbench/redis when system is underloaded. ChaCha20-xiangshan shows huge throughput improvement. Phoronix has tested v1 and shows good improvements in 30+ cases[3]. No obvious difference was observed in netperf/stream/stress-ng in Hmean. Detail: To conserve space, data without much difference with baseline is not presented. Sapphire Rapids: [hackbench pipe] ================ case load baseline(std%) compare%( std%) threads-pipe-10 1-groups 1.00 ( 1.22) +26.09 ( 1.10) threads-pipe-10 2-groups 1.00 ( 4.90) +22.88 ( 0.18) threads-pipe-10 4-groups 1.00 ( 2.07) +9.00 ( 3.49) threads-pipe-10 8-groups 1.00 ( 8.13) +3.45 ( 3.62) threads-pipe-16 1-groups 1.00 ( 2.11) +26.30 ( 0.08) threads-pipe-16 2-groups 1.00 ( 15.13) -1.77 ( 11.89) threads-pipe-16 4-groups 1.00 ( 4.37) +0.58 ( 7.99) threads-pipe-16 8-groups 1.00 ( 2.88) +2.71 ( 3.50) threads-pipe-2 1-groups 1.00 ( 9.40) +22.07 ( 0.71) threads-pipe-2 2-groups 1.00 ( 9.99) +18.01 ( 0.95) threads-pipe-2 4-groups 1.00 ( 3.98) +24.66 ( 0.96) threads-pipe-2 8-groups 1.00 ( 7.00) +21.83 ( 0.23) threads-pipe-20 1-groups 1.00 ( 1.03) +28.84 ( 0.21) threads-pipe-20 2-groups 1.00 ( 4.42) +31.90 ( 3.15) threads-pipe-20 4-groups 1.00 ( 9.97) +4.56 ( 1.69) threads-pipe-20 8-groups 1.00 ( 1.87) +1.25 ( 0.74) threads-pipe-4 1-groups 1.00 ( 4.48) +25.67 ( 0.78) threads-pipe-4 2-groups 1.00 ( 9.14) +4.91 ( 2.08) threads-pipe-4 4-groups 1.00 ( 7.68) +19.36 ( 1.53) threads-pipe-4 8-groups 1.00 ( 10.79) +7.20 ( 12.20) threads-pipe-8 1-groups 1.00 ( 4.69) +21.93 ( 0.03) threads-pipe-8 2-groups 1.00 ( 1.16) +25.29 ( 0.65) threads-pipe-8 4-groups 1.00 ( 2.23) -1.27 ( 3.62) threads-pipe-8 8-groups 1.00 ( 4.65) -3.08 ( 2.75) Note: The default number of fd in hackbench is changed from 20 to various values to ensure that threads fit within a single LLC, especially on AMD systems. Take "threads-pipe-8, 2-groups" for example, the number of fd is 8, and 2 groups are created. [schbench] The 99th percentile wakeup latency shows some improvements when the system is underload, while it does not bring much difference with the increasing of system utilization. 99th Wakeup Latencies Base (mean std) Compare (mean std) Change -------------------------------------------------------------------------------- thread=2 9.00(0.00) 9.00(1.73) 0.00% thread=4 7.33(0.58) 6.33(0.58) +13.64% thread=8 9.00(0.00) 7.67(1.15) +14.78% thread=16 8.67(0.58) 8.67(1.53) 0.00% thread=32 9.00(0.00) 7.00(0.00) +22.22% thread=64 9.33(0.58) 9.67(0.58) -3.64% thread=128 12.00(0.00) 12.00(0.00) 0.00% [chacha200] baseline: Host time spent: 67861ms cache aware scheduling enabled: Host time spent: 54441ms Time reduced by 24% Genoa: [hackbench pipe] The default number of fd is 20, which exceed the number of CPUs in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively. Exclude the result with large run-to-run variance, 10% ~ 50% improvement is observed when the system is underloaded: [hackbench pipe] ================ case load baseline(std%) compare%( std%) threads-pipe-2 1-groups 1.00 ( 2.89) +47.33 ( 1.20) threads-pipe-2 2-groups 1.00 ( 3.88) +39.82 ( 0.61) threads-pipe-2 4-groups 1.00 ( 8.76) +5.57 ( 13.10) threads-pipe-20 1-groups 1.00 ( 4.61) +11.72 ( 1.06) threads-pipe-20 2-groups 1.00 ( 6.18) +14.55 ( 1.47) threads-pipe-20 4-groups 1.00 ( 2.99) +10.16 ( 4.49) threads-pipe-4 1-groups 1.00 ( 4.23) +43.70 ( 2.14) threads-pipe-4 2-groups 1.00 ( 3.68) +8.45 ( 4.04) threads-pipe-4 4-groups 1.00 ( 17.72) +2.42 ( 1.14) threads-pipe-6 1-groups 1.00 ( 3.10) +7.74 ( 3.83) threads-pipe-6 2-groups 1.00 ( 3.42) +14.26 ( 4.53) threads-pipe-6 4-groups 1.00 ( 10.34) +10.94 ( 7.12) threads-pipe-8 1-groups 1.00 ( 4.21) +9.06 ( 4.43) threads-pipe-8 2-groups 1.00 ( 1.88) +3.74 ( 0.58) threads-pipe-8 4-groups 1.00 ( 2.78) +23.96 ( 1.18) [chacha200] Host time spent: 54762ms Host time spent: 28295ms Time reduced by 48% [1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/ [2] https://lore.kernel.org/all/20260219165221.GM1395266@noisy.programming.kicks-ass.net/ [3] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin Change history: **v4 Changes:** 1. Using bitmap based LLC id dynamic allocation mechanism. 2. Introduce a new patch [2/22] to limit the CPU scan depth with preferred NUMA node. 3. Keeping load imbalance at low load and not pulling task from preferred LLC is not considered as a balance failure. 4. Other changes from v3 are detailed in each patch's change log. **v3 Changes:** v3 link: https://lore.kernel.org/all/cover.1770760558.git.tim.c.chen@linux.intel.com/ 1. Cache-aware scheduling is skipped after repeated load balance failures (up to cache_nice_tries). This avoids repeatedly attempting cache-aware migrations when no movable tasks prefer the destination LLC. 2. The busiest runqueue is no longer sorted to select tasks that prefer the destination LLC. This sorting was costly, and equivalent behavior can be achieved by skipping tasks that do not prefer the destination LLC during cache-aware migrations. 3. Accounting of the number of tasks preferring each LLC is now kept in the lowest-level sched domain per CPU. This simplifies handling of LLC resizing and changes in the number of LLC domains. 4. Other changes from v2 are detailed in each patch's change log. **v2 Changes:** v2 link: https://lore.kernel.org/all/cover.1764801860.git.tim.c.chen@linux.intel.com/ 1. Align NUMA balancing and cache affinity by prioritizing NUMA balancing when their decisions differ. 2. Dynamically resize per-LLC statistics structures based on the LLC size. 3. Switch to a contiguous LLC-ID space so these IDs can be used directly as array indices for LLC statistics. 4. Add clarification comments. 5. Add 3 debug patches (not meant for merging). 6. Other changes to address feedbacks from review of v1 patch set (see individual patch change log). **v1** v1 link: https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/ Chen Yu (10): sched/cache: Limit the scan number of CPUs when calculating task occupancy sched/cache: Record per LLC utilization to guide cache aware scheduling decisions sched/cache: Introduce helper functions to enforce LLC migration policy sched/cache: Disable cache aware scheduling for processes with high thread counts sched/cache: Avoid cache-aware scheduling for memory-heavy processes sched/cache: Enable cache aware scheduling for multi LLCs NUMA node sched/cache: Allow the user space to turn on and off cache aware scheduling sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Peter Zijlstra (Intel) (1): sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen (11): sched/cache: Make LLC id continuous sched/cache: Assign preferred LLC ID to processes sched/cache: Track LLC-preferred tasks per runqueue sched/cache: Introduce per CPU's tasks LLC preference counter sched/cache: Calculate the percpu sd task LLC preference sched/cache: Count tasks prefering destination LLC in a sched group sched/cache: Check local_group only once in update_sg_lb_stats() sched/cache: Prioritize tasks preferring destination LLC during balancing sched/cache: Add migrate_llc_task migration type for cache-aware balancing sched/cache: Handle moving single tasks to/from their preferred LLC sched/cache: Respect LLC preference in task migration and detach fs/proc/base.c | 31 + include/linux/cacheinfo.h | 21 +- include/linux/mm_types.h | 43 ++ include/linux/sched.h | 32 + include/linux/sched/topology.h | 17 + include/trace/events/sched.h | 140 ++++ init/Kconfig | 11 + init/init_task.c | 3 + kernel/fork.c | 6 + kernel/sched/core.c | 13 + kernel/sched/debug.c | 58 +- kernel/sched/fair.c | 1180 +++++++++++++++++++++++++++++++- kernel/sched/sched.h | 50 ++ kernel/sched/topology.c | 234 ++++++- 14 files changed, 1810 insertions(+), 29 deletions(-) -- 2.32.0