From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F134C330678 for ; Tue, 10 Feb 2026 22:12:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761573; cv=none; b=bKaRGyvLstOhAUHZGlWaXYRl0ZLnjg2/v7X9ddGa0G84p3acm6F8yjMP/v+K+mfwtSxli2JWLKrPCM8z3egYpmZemjS6YQkMnXtmaG430eGBvcznQqORt/CH97CBLmSRqyc6eZBVqc1ukl2d7a7A9F1ZPgt5Yn8nZtKbeEZ7nik= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770761573; c=relaxed/simple; bh=eIdxjr07zbmQQUrIoCTSTvnhr0r8+Adp+dLomfKqVCc=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version:Content-Type; b=Xs0L+DmBmGcGimk6nQUzObqpGVsWG+RZpMbgJvmFur2NgYMt3cKlDXrfnlMaXmlQFO2GKqeu45FbCD4T/J9LehCGBdkLAw4B10LUJYaqoTr9++iiLkAYVyNjQbhPAsOsXki8dHte/w+rsEl7lt+w6hEem/jJjKaA0Or9xjAmHOc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=eKEGEHLC; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="eKEGEHLC" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770761571; x=1802297571; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=eIdxjr07zbmQQUrIoCTSTvnhr0r8+Adp+dLomfKqVCc=; b=eKEGEHLC9iL81QEsfov9/I0yZLKdl0QxE4tUPDShkItEmD64O/F2A6uq /4ysx7jOPDtSoDDPlvOjj9PKPbGO8S68OfAFWAwWPsh57enGY7H8XGq7G Qjj/15zgBf6JcczEx8KbYIuRXzsd2+uF4ZOAMI1n3MnDhtd4Jn1lZmkVr Cr+md/ItnyVt05cLNUT1OgaMC7TlQ3biV+ZJCniC04QrgG2mBDA2t5fdf qwTlhnnWMqJVGDjjx/rUbQx+h3rXWYyUPl8HgiEy1YXiSMaOzW8hAnPes BQOUINour9lH4qQ3F6P5zNqTKkszHhPYkAFVDf/UrgnJKv/FJg7AyFrYk g==; X-CSE-ConnectionGUID: gVi/DirXQuq5mdUWYuVRlQ== X-CSE-MsgGUID: TSHJ/mKMQf6rbqdIze7iWg== X-IronPort-AV: E=McAfee;i="6800,10657,11697"; a="82631182" X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="82631182" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2026 14:12:50 -0800 X-CSE-ConnectionGUID: KoEhgX4hTpWaR8ERRd2Llw== X-CSE-MsgGUID: fAFzGdaaQxWsbhNRbWvYOQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,283,1763452800"; d="scan'208";a="216373851" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 10 Feb 2026 14:12:48 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [PATCH v3 00/21] Cache Aware Scheduling Date: Tue, 10 Feb 2026 14:18:40 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This patch series introduces infrastructure for cache-aware load balancing, with the goal of co-locating tasks that share data within the same Last Level Cache (LLC) domain. By improving cache locality, the scheduler can reduce cache bouncing and cache misses, ultimately improving data access efficiency. The design builds on the initial prototype from Peter [1]. This initial implementation treats threads within the same process as entities that are likely to share data. During load balancing, the scheduler attempts to aggregate such threads onto the same LLC domain whenever possible. Most of the feedback received on v2 has been addressed. There were discussions around grouping tasks using mechanisms other than process membership. While we agree that more flexible grouping is desirable, this series intentionally focuses on establishing the basic process-based grouping first, with alternative grouping mechanisms to be explored in a follow-on series. As a step in that direction, cache aware scheduling statistics have been separated from the mm structure into a new sched_cache_stats structure. Thanks for the many useful feedbacks at LPC 2025 and for v2, we'd like to create another separate thread to discuss the possible user interfaces. The load balancing algorithms remain largely unchanged. The main changes in v3 are: 1. Cache-aware scheduling is skipped after repeated load balance failures (up to cache_nice_tries). This avoids repeatedly attempting cache-aware migrations when no movable tasks prefer the destination LLC. 2. The busiest runqueue is no longer sorted to select tasks that prefer the destination LLC. This sorting was costly, and equivalent behavior can be achieved by skipping tasks that do not prefer the destination LLC during cache-aware migrations. 3. The calculation of the LLC ID switches to using sched_domain_topology_level data directly that simplifies the ID derivation. 4. Accounting of the number of tasks preferring each LLC is now kept in the lowest-level sched domain per CPU. This simplifies handling of LLC resizing and changes in the number of LLC domains. Test results: The patch series was applied and tested on v6.19-rc3. See: https://github.com/timcchen1298/linux/commits/cache_aware_v3 The first test platform is a 2 socket Intel Sapphire Rapids with 30 cores per socket. The DRAM interleaving is enabled in the BIOS so it essential has one NUMA node with two last level caches. There are 60 CPUs associated with each last level cache. The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node. Each node has 2 CCXs and each CCX has 16 CPUs. hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on these two platforms. [TL;DR] Sappire Rapids: hackbench shows significant improvement when the number of different active threads is below the capacity of a LLC. schbench shows overall wakeup latency improvement. ChaCha20-xiangshan(risc-v simulator) shows good throughput improvement. No obvious difference was observed in netperf/stream/stress-ng in Hmean. Genoa: Significant improvement is observed in hackbench when the active number of threads is lower than the number of CPUs within 1 LLC. On v2, Aaron reported improvement of hackbench/redis when system is underloaded. ChaCha20-xiangshan shows huge throughput improvement. Phoronix has tested v1 and shows good improvements in 30+ cases[2]. No obvious difference was observed in netperf/stream/stress-ng in Hmean. Detail: Due to length constraints, data without much difference with baseline is not presented. Sapphire Rapids: [hackbench pipe] case load baseline(std%) compare%( std%) threads-pipe-2 1-groups 1.00 ( 3.19) +29.06 ( 3.31)* threads-pipe-2 2-groups 1.00 ( 9.61) +19.19 ( 0.55)* threads-pipe-2 4-groups 1.00 ( 6.69) +15.02 ( 1.34)* threads-pipe-2 8-groups 1.00 ( 1.83) +25.59 ( 1.46)* threads-pipe-4 1-groups 1.00 ( 3.41) +28.63 ( 1.17)* threads-pipe-4 2-groups 1.00 ( 15.62) +19.51 ( 0.82) threads-pipe-4 4-groups 1.00 ( 0.19) +27.05 ( 0.74)* threads-pipe-4 8-groups 1.00 ( 4.32) +5.64 ( 3.18) threads-pipe-8 1-groups 1.00 ( 0.44) +24.68 ( 0.49)* threads-pipe-8 2-groups 1.00 ( 2.03) +23.76 ( 0.52)* threads-pipe-8 4-groups 1.00 ( 3.77) +7.16 ( 1.58) threads-pipe-8 8-groups 1.00 ( 4.53) +6.88 ( 2.36) threads-pipe-16 1-groups 1.00 ( 1.71) +28.46 ( 0.68)* threads-pipe-16 2-groups 1.00 ( 4.25) -0.23 ( 0.97) threads-pipe-16 4-groups 1.00 ( 0.64) -0.95 ( 3.74) threads-pipe-16 8-groups 1.00 ( 1.23) +1.77 ( 0.31) Note: The default number of fd in hackbench is changed from 20 to various values to ensure that threads fit within a single LLC, especially on AMD systems. Take "threads-pipe-8, 2-groups" for example, the number of fd is 8, and 2 groups are created. [schbench] The 99th percentile wakeup latency shows overall improvements, while the 99th percentile request latency exhibits increased some run-to-run variance. The cache-aware scheduling logic, which scans all online CPUs to identify the hottest LLC, may be the root cause of the elevated request latency. It delays the task from returning to user space due to the costly task_cache_work(). This issue should be mitigated by restricting the scan to a limited set of NUMA nodes [3], and the fix is planned to be integrated after the current version is in good shape. 99th Wakeup Latencies Base (mean±std) Compare (mean±std) Change -------------------------------------------------------------------------------- thread = 2 13.33(1.15) 13.00(1.73) +2.48% thread = 4 12.33(1.53) 9.67(1.53) +21.57% thread = 8 10.00(0.00) 10.67(0.58) -6.70% thread = 16 10.00(1.00) 9.33(0.58) +6.70% thread = 32 10.33(0.58) 9.67(1.53) +6.39% thread = 64 10.33(0.58) 9.33(1.53) +9.68% thread = 128 12.67(0.58) 12.00(0.00) +5.29% run-to-run variance regress at 1 messager + 8 worker: Request Latencies 99.0th 3981.33(260.16) 4877.33(1880.57) -22.51% [chacha200] Time reduced by 20% Genoa: [hackbench pipe] The default number of fd is 20, which exceed the number of CPUs in a LLC. So the fd is adjusted to 2, 4, 8, 16 respectively. Exclude the result with large run-to-run variance, 20% ~ 50% improvement is observed when the system is underloaded: case load baseline(std%) compare%( std%) threads-pipe-2 1-groups 1.00 ( 4.04) +47.22 ( 4.77)* threads-pipe-2 2-groups 1.00 ( 5.04) +33.79 ( 8.92)* threads-pipe-2 4-groups 1.00 ( 5.82) +5.93 ( 7.97) threads-pipe-2 8-groups 1.00 ( 16.15) -4.11 ( 6.85) threads-pipe-4 1-groups 1.00 ( 7.28) +50.43 ( 2.39)* threads-pipe-4 2-groups 1.00 ( 10.77) -4.31 ( 7.71) threads-pipe-4 4-groups 1.00 ( 11.16) +8.12 ( 11.21) threads-pipe-4 8-groups 1.00 ( 12.79) -10.10 ( 12.92) threads-pipe-8 1-groups 1.00 ( 5.57) -1.50 ( 6.55) threads-pipe-8 2-groups 1.00 ( 10.72) +0.69 ( 6.38) threads-pipe-8 4-groups 1.00 ( 7.04) +19.70 ( 5.58)* threads-pipe-8 8-groups 1.00 ( 7.11) +27.46 ( 2.34)* threads-pipe-16 1-groups 1.00 ( 2.86) -12.82 ( 8.97) threads-pipe-16 2-groups 1.00 ( 8.55) +2.96 ( 1.65) threads-pipe-16 4-groups 1.00 ( 5.12) +20.49 ( 5.33)* threads-pipe-16 8-groups 1.00 ( 3.23) +9.06 ( 2.87) [chacha200] baseline: Host time spent: 51432ms sched_cache: Host time spent: 28664ms Time reduced by 45% [1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/ [2] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin [3] https://lore.kernel.org/all/865b852e3fdef6561c9e0a5be9a94aec8a68cdea.1760206683.git.tim.c.chen@linux.intel.com/ Change history: **v3 Changes:** 1. Cache-aware scheduling is skipped after repeated load balance failures (up to cache_nice_tries). This avoids repeatedly attempting cache-aware migrations when no movable tasks prefer the destination LLC. 2. The busiest runqueue is no longer sorted to select tasks that prefer the destination LLC. This sorting was costly, and equivalent behavior can be achieved by skipping tasks that do not prefer the destination LLC during cache-aware migrations. 3. Accounting of the number of tasks preferring each LLC is now kept in the lowest-level sched domain per CPU. This simplifies handling of LLC resizing and changes in the number of LLC domains. 4. Other changes from v2 are detailed in each patch's change log. **v2 Changes:** v2 link: https://lore.kernel.org/all/cover.1764801860.git.tim.c.chen@linux.intel.com/ 1. Align NUMA balancing and cache affinity by prioritizing NUMA balancing when their decisions differ. 2. Dynamically resize per-LLC statistics structures based on the LLC size. 3. Switch to a contiguous LLC-ID space so these IDs can be used directly as array indices for LLC statistics. 4. Add clarification comments. 5. Add 3 debug patches (not meant for merging). 6. Other changes to address feedbacks from review of v1 patch set (see individual patch change log). **v1** v1 link: https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/ Chen Yu (10): sched/cache: Record per LLC utilization to guide cache aware scheduling decisions sched/cache: Introduce helper functions to enforce LLC migration policy sched/cache: Make LLC id continuous sched/cache: Disable cache aware scheduling for processes with high thread counts sched/cache: Avoid cache-aware scheduling for memory-heavy processes sched/cache: Enable cache aware scheduling for multi LLCs NUMA node sched/cache: Allow the user space to turn on and off cache aware scheduling sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Peter Zijlstra (Intel) (1): sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen (10): sched/cache: Assign preferred LLC ID to processes sched/cache: Track LLC-preferred tasks per runqueue sched/cache: Introduce per CPU's tasks LLC preference counter sched/cache: Calculate the percpu sd task LLC preference sched/cache: Count tasks prefering destination LLC in a sched group sched/cache: Check local_group only once in update_sg_lb_stats() sched/cache: Prioritize tasks preferring destination LLC during balancing sched/cache: Add migrate_llc_task migration type for cache-aware balancing sched/cache: Handle moving single tasks to/from their preferred LLC sched/cache: Respect LLC preference in task migration and detach fs/proc/base.c | 31 + include/linux/cacheinfo.h | 21 +- include/linux/mm_types.h | 43 ++ include/linux/sched.h | 32 + include/linux/sched/topology.h | 8 + include/trace/events/sched.h | 79 +++ init/Kconfig | 11 + init/init_task.c | 3 + kernel/fork.c | 6 + kernel/sched/core.c | 11 + kernel/sched/debug.c | 55 ++ kernel/sched/fair.c | 1088 +++++++++++++++++++++++++++++++- kernel/sched/sched.h | 44 ++ kernel/sched/topology.c | 194 +++++- 14 files changed, 1598 insertions(+), 28 deletions(-) -- 2.32.0