From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ED0DB2E8B9D for ; Wed, 3 Dec 2025 23:01:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802880; cv=none; b=YxLDaMBwmvrmLTM/rMGTaE/kXrll+pd+ImieGXgIjVhmoAE8mmNRf/MOS5Hdbx8BHKNgF5bHOSllrfqmxVUgav/QPuvFkN9SoSKX6IBbBkwFLYN2aE3AVz+5on7Z8dl92jdRTFV38goQC01uL8lND23Xw+n96EYZ4gj1M1SixjM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802880; c=relaxed/simple; bh=4pojkZWg/R+wnR/O9PnL2nWnx33CCu0+cHj+qWpv4bI=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version:Content-Type; b=d8IDKS4W1QRo4HM6SjCOmw7jnOABtfcqoNgongXaHvD5CfvE98E4bENwvtmIRbm9wi1K2lh3q9+8e2gCU/Vc0wrzOUn0byIVGc9DuXsDg1NeP3gn+Y8Q1yzVmBKMF8RlskXyRphO45PPHpTPW6LpjFSfZE2C2Ek426QIKpuJq80= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=KPdDlIRC; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="KPdDlIRC" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802878; x=1796338878; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=4pojkZWg/R+wnR/O9PnL2nWnx33CCu0+cHj+qWpv4bI=; b=KPdDlIRCbasK0S5oyi85XlONdm36S0gzaou8+K7OgjlTt1W6e5pCGOPg /Qezr0yUI0W57aOV4pMOSxUXFyhYw+SWbJlppenRAezK4dENNkm/zl2MC PnT+CSzFgLXlszYccl0PUMBcwDtNH4yhpBOjiQk5qXt/fo+EcL55Ecdh7 DmI9PISo8my1MPCm1D5DvKuBUwp0irLSUUnojARsdjrMbFIwBzfT5a+Oz F2MPtCV7UBReYROuAu1IzzL/sdSs3dNYWIo8LimMnfRBSvhtTDu/SpSFf Am9G4Qyd2DBK5xrNMk9CUX6cV3WnHAVP1Ncfg6j1DnkAxx/8vZPCI1axm w==; X-CSE-ConnectionGUID: DFeG5xuYSCuK8P0ChIgCCw== X-CSE-MsgGUID: riseuTG3SN6sIFAx40kZ8Q== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136164" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136164" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:17 -0800 X-CSE-ConnectionGUID: Tcy9Eo8JQQyLs3d8CCO1yQ== X-CSE-MsgGUID: cywDbfQaT5qIiL2jVof7cQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763728" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:16 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Tim Chen , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 00/23] Cache aware scheduling Date: Wed, 3 Dec 2025 15:07:19 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This patch series introduces infrastructure for cache-aware load balancing, with the goal of co-locating tasks that share data on the same Last Level Cache (LLC) domain. By improving cache locality, the scheduler can reduce cache bouncing and cache misses, ultimately improving data access efficiency. The design builds on the initial prototype from Peter [1]. In this initial implementation, threads within the same process are treated as entities that are likely to share data. During load balancing, the scheduler attempts to aggregate these threads onto the same LLC domain whenever possible. We would like to thank everyone who provided feedbacks on the v1 series[1]. Most of the comments have been addressed in this revision. Several broader suggestions surfaced during review, and we believe they are best approached in follow-up work once the foundational cache-aware scheduling infrastructure is merged: 1. **Generalizing task grouping beyond processes.** While v2 focuses on grouping threads within a single process, other classes of workloads naturally share data and could benefit from LLC co-location, such as: a) Tasks from different processes that operate on shared data. b) Tasks belonging to the same NUMA group. c) Tasks with strong waker/wakee relationships. d) User-defined groups via cgroups or other user interfaces. 2. **Configurable cache-aware scheduling policies.** The current iteration implements a global cache-aware scheduling policy. Future work may introduce per-process or per-task-group policies, exposed through prctl() or other mechanisms. **v2 Changes:** 1. Align NUMA balancing and cache affinity by prioritizing NUMA balancing when their decisions differ. 2. Dynamically resize per-LLC statistics structures based on the LLC size. 3. Switch to a contiguous LLC-ID space so these IDs can be used directly as array indices for LLC statistics. 4. Add clarification comments. 5. Add 3 debug patches (not meant for merging). 6. Other changes to address feedbacks from review of v1 patch set (see individual patch change log). Test results: The patch series was applied and tested on v6.18-rc7. See: https://github.com/timcchen1298/linux/commits/cache_aware_v2 The first test platform is a 2 socket Intel Sapphire Rapids with 30 cores per socket. The DRAM interleaving is enabled in the BIOS so it essential has one NUMA node with two last level caches. There are 60 CPUs associated with each last level cache. The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node. Each node has 2 CCXs and each CCX has 16 CPUs. hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on these two platforms. [TL;DR] Sappire Rapids: hackbench shows significant improvement when the number of different active threads is below the capacity of a LLC. schbench shows overall wakeup latency improvement. ChaCha20-xiangshan shows good throughput improvement. Genoa: ChaCha20-xiangshan shows huge throughput improvement. No obvious difference is observed in hackbench/schbench /netperf/stream/stress-ng. Phoronix has tested v1 and shows good improvements in 33 cases[2]. Detail: Due to length constraints, only part of the data is presented. Sapphire Rapids: hackbench thread pipes baseline sched_cache groups Amean 1 38.8224 ( 0.00%) 26.4582 * 31.85%* Amean 3 38.2358 ( 0.00%) 38.0758 ( 0.42%) Amean 5 40.7282 ( 0.00%) 41.1568 ( -1.05%) Amean 7 51.1720 ( 0.00%) 50.6646 ( 0.99%) Amean 12 63.1562 ( 0.00%) 63.3516 ( -0.31%) Amean 16 73.9584 ( 0.00%) 75.5596 ( -2.17%) Max 1 39.4140 ( 0.00%) 26.7590 ( 32.11%) Max 3 40.8310 ( 0.00%) 39.8000 ( 2.53%) Max 5 42.2150 ( 0.00%) 42.4860 ( -0.64%) Max 7 52.1800 ( 0.00%) 51.9370 ( 0.47%) Max 12 63.9430 ( 0.00%) 64.2820 ( -0.53%) Max 16 74.3710 ( 0.00%) 76.4170 ( -2.75%) further test hackbench using other number of fds: case fd groups baseline(std%) compare%( std%) threads-pipe-2 1-groups 1.00 ( 1.25) +38.52 ( 1.33) threads-pipe-2 2-groups 1.00 ( 12.52) +12.74 ( 1.31) threads-pipe-2 4-groups 1.00 ( 7.91) +12.29 ( 1.86) threads-pipe-4 1-groups 1.00 ( 0.55) +34.99 ( 0.45) threads-pipe-4 2-groups 1.00 ( 16.00) +27.32 ( 0.75) threads-pipe-4 4-groups 1.00 ( 17.37) +25.75 ( 0.20) threads-pipe-8 1-groups 1.00 ( 0.74) +27.13 ( 0.44) threads-pipe-8 2-groups 1.00 ( 8.82) +23.79 ( 0.32) threads-pipe-8 4-groups 1.00 ( 1.30) +27.64 ( 0.51) threads-pipe-16 1-groups 1.00 ( 1.03) +30.55 ( 0.27) threads-pipe-16 2-groups 1.00 ( 6.43) +29.52 ( 0.20) threads-pipe-16 4-groups 1.00 ( 1.36) -1.85 ( 1.43) threads-pipe-20 1-groups 1.00 ( 0.45) +30.88 ( 0.42) threads-pipe-20 2-groups 1.00 ( 1.95) -0.81 ( 5.84) threads-pipe-20 4-groups 1.00 ( 2.09) -1.77 ( 7.57) stream: baseline sched_cache GB/sec copy-2 36.48 ( 0.00%) 36.55 ( 0.18%) GB/sec scale-2 36.83 ( 0.00%) 36.97 ( 0.38%) GB/sec add-2 37.92 ( 0.00%) 38.03 ( 0.31%) GB/sec triad-2 37.83 ( 0.00%) 37.97 ( 0.37%) stress-ng context switch: baseline sched_cache Min context-1 2957.81 ( 0.00%) 2966.17 ( 0.28%) Min context-2 5931.68 ( 0.00%) 5930.17 ( -0.03%) Min context-4 11874.20 ( 0.00%) 11875.68 ( 0.01%) Min context-8 23755.30 ( 0.00%) 23762.43 ( 0.03%) Min context-16 47535.14 ( 0.00%) 47526.46 ( -0.02%) Min context-32 95078.66 ( 0.00%) 94356.39 ( -0.76%) Min context-64 190074.62 ( 0.00%) 190042.93 ( -0.02%) Min context-128 371107.12 ( 0.00%) 371008.10 ( -0.03%) Min context-256 578443.73 ( 0.00%) 579037.86 ( 0.10%) Min context-480 580203.34 ( 0.00%) 580499.43 ( 0.05%) Hmean context-1 2964.59 ( 0.00%) 2967.69 ( 0.10%) Hmean context-2 5936.41 ( 0.00%) 5935.51 ( -0.02%) Hmean context-4 11879.56 ( 0.00%) 11881.70 ( 0.02%) Hmean context-8 23771.92 ( 0.00%) 23770.28 ( -0.01%) Hmean context-16 47552.23 ( 0.00%) 47538.01 ( -0.03%) Hmean context-32 95102.67 ( 0.00%) 94969.43 ( -0.14%) Hmean context-64 190129.74 ( 0.00%) 190088.68 ( -0.02%) Hmean context-128 371291.95 ( 0.00%) 371114.82 ( -0.05%) Hmean context-256 578907.96 ( 0.00%) 579338.99 ( 0.07%) Hmean context-480 580541.78 ( 0.00%) 580726.13 ( 0.03%) Max context-1 2967.93 ( 0.00%) 2968.90 ( 0.03%) Max context-2 5942.37 ( 0.00%) 5940.40 ( -0.03%) Max context-4 11885.25 ( 0.00%) 11886.43 ( 0.01%) Max context-8 23784.17 ( 0.00%) 23783.31 ( -0.00%) Max context-16 47576.84 ( 0.00%) 47561.42 ( -0.03%) Max context-32 95139.03 ( 0.00%) 95094.86 ( -0.05%) Max context-64 190180.08 ( 0.00%) 190123.31 ( -0.03%) Max context-128 371451.73 ( 0.00%) 371240.25 ( -0.06%) Max context-256 579355.24 ( 0.00%) 579731.37 ( 0.06%) Max context-480 580750.44 ( 0.00%) 581118.33 ( 0.06%) BHmean-50 context-1 2966.80 ( 0.00%) 2968.82 ( 0.07%) BHmean-50 context-2 5939.32 ( 0.00%) 5939.49 ( 0.00%) BHmean-50 context-4 11883.02 ( 0.00%) 11886.08 ( 0.03%) BHmean-50 context-8 23778.40 ( 0.00%) 23775.90 ( -0.01%) BHmean-50 context-16 47568.31 ( 0.00%) 47546.19 ( -0.05%) BHmean-50 context-32 95125.84 ( 0.00%) 95087.06 ( -0.04%) BHmean-50 context-64 190165.37 ( 0.00%) 190117.94 ( -0.02%) BHmean-50 context-128 371405.28 ( 0.00%) 371168.75 ( -0.06%) BHmean-50 context-256 579137.11 ( 0.00%) 579609.35 ( 0.08%) BHmean-50 context-480 580646.72 ( 0.00%) 580920.46 ( 0.05%) BHmean-95 context-1 2965.72 ( 0.00%) 2967.94 ( 0.07%) BHmean-95 context-2 5937.20 ( 0.00%) 5936.40 ( -0.01%) BHmean-95 context-4 11880.45 ( 0.00%) 11882.71 ( 0.02%) BHmean-95 context-8 23774.69 ( 0.00%) 23771.59 ( -0.01%) BHmean-95 context-16 47555.08 ( 0.00%) 47539.93 ( -0.03%) BHmean-95 context-32 95106.67 ( 0.00%) 95072.38 ( -0.04%) BHmean-95 context-64 190138.93 ( 0.00%) 190096.30 ( -0.02%) BHmean-95 context-128 371322.78 ( 0.00%) 371132.61 ( -0.05%) BHmean-95 context-256 578985.41 ( 0.00%) 579389.21 ( 0.07%) BHmean-95 context-480 580598.22 ( 0.00%) 580763.93 ( 0.03%) BHmean-99 context-1 2965.72 ( 0.00%) 2967.94 ( 0.07%) BHmean-99 context-2 5937.20 ( 0.00%) 5936.40 ( -0.01%) BHmean-99 context-4 11880.45 ( 0.00%) 11882.71 ( 0.02%) BHmean-99 context-8 23774.69 ( 0.00%) 23771.59 ( -0.01%) BHmean-99 context-16 47555.08 ( 0.00%) 47539.93 ( -0.03%) BHmean-99 context-32 95106.67 ( 0.00%) 95072.38 ( -0.04%) BHmean-99 context-64 190138.93 ( 0.00%) 190096.30 ( -0.02%) BHmean-99 context-128 371322.78 ( 0.00%) 371132.61 ( -0.05%) BHmean-99 context-256 578985.41 ( 0.00%) 579389.21 ( 0.07%) BHmean-99 context-480 580598.22 ( 0.00%) 580763.93 ( 0.03%) schbench thread = 1 Metric Base (mean±std) Compare (mean±std) Change ------------------------------------------------------------------------------------- Wakeup Latencies 99.0th 10.71(0.76) 9.86(1.46) +7.94% Request Latencies 99.0th 4036.00(6.53) 4054.29(10.03) -0.45% RPS 50.0th 267.29(0.49) 266.86(0.38) -0.16% Average RPS 268.42(0.16) 267.86(0.31) -0.21% schbench thread = 2 Metric Base (mean±std) Compare (mean±std) Change ------------------------------------------------------------------------------------- Wakeup Latencies 99.0th 11.43(1.13) 8.00(2.00) +30.01% Request Latencies 99.0th 4007.43(34.52) 3967.43(70.03) +1.00% RPS 50.0th 536.71(0.76) 536.14(1.57) -0.11% Average RPS 536.59(0.55) 535.33(1.34) -0.23% schbench thread = 4 Metric Base (mean±std) Compare (mean±std) Change ------------------------------------------------------------------------------------- Wakeup Latencies 99.0th 9.57(0.79) 6.14(1.46) +35.84% Request Latencies 99.0th 3789.14(31.47) 3810.86(48.97) -0.57% RPS 50.0th 1074.00(0.00) 1073.43(2.76) -0.05% Average RPS 1075.03(1.07) 1072.93(2.13) -0.20% schbench thread = 8 Metric Base (mean±std) Compare (mean±std) Change ------------------------------------------------------------------------------------- Wakeup Latencies 99.0th 9.29(0.49) 6.57(1.81) +29.28% Request Latencies 99.0th 3756.00(19.60) 3769.71(23.87) -0.37% RPS 50.0th 2152.57(4.28) 2152.57(4.28) 0.00% Average RPS 2151.07(2.71) 2150.58(3.41) -0.02% schbench thread = 16 Metric Base (mean±std) Compare (mean±std) Change ------------------------------------------------------------------------------------- Wakeup Latencies 99.0th 9.43(0.53) 6.86(0.90) +27.25% Request Latencies 99.0th 3780.00(32.98) 3774.29(11.04) +0.15% RPS 50.0th 4305.14(8.55) 4307.43(7.81) +0.05% Average RPS 4303.47(5.74) 4301.71(4.35) -0.04% schbench thread = 32 Metric Base (mean±std) Compare (mean±std) Change ------------------------------------------------------------------------------------- Wakeup Latencies 99.0th 10.14(0.38) 6.86(0.69) +32.35% Request Latencies 99.0th 3764.00(21.66) 3806.29(32.24) -1.12% RPS 50.0th 8624.00(0.00) 8619.43(12.09) -0.05% Average RPS 8607.36(5.29) 8602.69(7.08) -0.05% schbench thread = 64 Metric Base (mean±std) Compare (mean±std) Change ------------------------------------------------------------------------------------- Wakeup Latencies 99.0th 11.71(0.49) 8.43(1.81) +28.01% Request Latencies 99.0th 3796.00(62.48) 3860.25(147.35) -1.69% RPS 50.0th 17238.86(24.19) 16411.43(88.95) -4.80% Average RPS 17209.02(10.18) 16389.73(100.27) -4.76% schbench thread = 128 Metric Base (mean±std) Compare (mean±std) Change ------------------------------------------------------------------------------------- Wakeup Latencies 99.0th 13.29(0.49) 12.00(0.00) +9.71% Request Latencies 99.0th 7893.71(11.04) 7909.71(17.10) -0.20% RPS 50.0th 32013.71(194.52) 32068.57(50.35) +0.17% Average RPS 31762.03(238.18) 31884.81(300.85) +0.39% schbench thread = 239 Metric Base (mean±std) Compare (mean±std) Change ------------------------------------------------------------------------------------- Wakeup Latencies 99.0th 13.29(0.49) 14.43(0.53) -8.58% Request Latencies 99.0th 8174.86(8.55) 8244.57(12.09) -0.85% RPS 50.0th 30624.00(0.00) 30614.86(24.19) -0.03% Average RPS 30695.86(11.03) 30673.35(17.31) -0.07% chacha20: baseline: Host time spent: 66,320ms sched_cache: Host time spent: 53,859ms Time reduced by 18%, throughput increased by 23% Genoa: chacha20 baseline: Host time spent: 51,848ms sched_cache: Host time spent: 28,439ms Time reduced by 45%, throughput increased by 82% [1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/ [2] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin Chen Yu (10): sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions sched/cache: Introduce helper functions to enforce LLC migration policy sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node sched/cache: Record the number of active threads per process for cache-aware scheduling sched/cache: Disable cache aware scheduling for processes with high thread counts sched/cache: Avoid cache-aware scheduling for memory-heavy processes sched/cache: Add user control to adjust the parameters of cache-aware scheduling -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Peter Zijlstra (Intel) (1): sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen (12): sched/cache: Make LLC id continuous sched/cache: Assign preferred LLC ID to processes sched/cache: Track LLC-preferred tasks per runqueue sched/cache: Introduce per runqueue task LLC preference counter sched/cache: Calculate the per runqueue task LLC preference sched/cache: Count tasks prefering destination LLC in a sched group sched/cache: Check local_group only once in update_sg_lb_stats() sched/cache: Prioritize tasks preferring destination LLC during balancing sched/cache: Add migrate_llc_task migration type for cache-aware balancing sched/cache: Handle moving single tasks to/from their preferred LLC sched/cache: Consider LLC preference when selecting tasks for load balancing sched/cache: Respect LLC preference in task migration and detach fs/proc/base.c | 22 + include/linux/cacheinfo.h | 21 +- include/linux/mm_types.h | 60 ++ include/linux/sched.h | 19 + include/linux/sched/topology.h | 5 + include/trace/events/sched.h | 31 + init/Kconfig | 11 + init/init_task.c | 4 + kernel/fork.c | 6 + kernel/sched/core.c | 12 + kernel/sched/debug.c | 62 ++ kernel/sched/fair.c | 1034 +++++++++++++++++++++++++++++++- kernel/sched/sched.h | 39 ++ kernel/sched/stats.c | 5 +- kernel/sched/topology.c | 239 +++++++- 15 files changed, 1543 insertions(+), 27 deletions(-) -- 2.32.0