From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA28A3B895B for ; Wed, 1 Apr 2026 21:47:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775080041; cv=none; b=X94j+4LR6uyTgp9KZmUD86pFTif17NLumY+88MGowAz4ZMZtTQzV2HW/QyFLQ5mc3aaKb1Ek1z7NJLSWr76dISiN8iBBUm1FKgQ7W9BhBlEuPgUBwub36bcqYfZL1CLpApy0wXkxKnrqGcvgvt+AhpnFxJONXtS+324E2+OpbA0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775080041; c=relaxed/simple; bh=zQRgwtz++1QLJV5YqclpXnM7ftbQlvBvdRh14WqXHgM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=qAo7ERDpRl7rNKqtYcPumtczpTWhfw6sbgxJRtE0EpOZtKKP/rGrKpSMu/nhfMtNukhOrd2lfxnVVslVZ5tr/av3zX/0f1dVZv2fK1uheTvZ24bTV9u7U/Nf9kft5C739BsGmWSRYtyJdM9fjr7Z2mWwQyFnIA4R09vPqxa61ZE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=kPS2JOgA; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="kPS2JOgA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1775080034; x=1806616034; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=zQRgwtz++1QLJV5YqclpXnM7ftbQlvBvdRh14WqXHgM=; b=kPS2JOgAPJawzxww4cCHikeJgh+AjVKOWpGogmMgheUmH4vXZojLz23i EscxUU/vx1BuxXMMLiQp0uKp3mdVIRSF9jz/lQxMvRIrka0MR2HKEbqg4 khk/eMGVM2w7FvQiEaoNfnHrPeCqQTpHOyNSvTvtMK+T6N00Te/4+wYaP HEnJ3LInvsOBjwklglIJH+mWE27cXbhYZO6zPtQe1b8nIffLcxs1Xxibw yU8id5/zVSvjQVTwfq9ERnRyGde/M7WxmXDQYz1f8HTVC+oyd+2yZOfhI Y5NbO5YJtM3ocl1J+VcrzasY6LtVZFdC4TrfYaaloyH38aSHlhaqANjoz A==; X-CSE-ConnectionGUID: 6CXh+On4Rc+xE/e/k3m48g== X-CSE-MsgGUID: 1ES0J0elRJ+pba7NanOr6Q== X-IronPort-AV: E=McAfee;i="6800,10657,11746"; a="79740078" X-IronPort-AV: E=Sophos;i="6.23,153,1770624000"; d="scan'208";a="79740078" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Apr 2026 14:47:14 -0700 X-CSE-ConnectionGUID: V9MdV46qSxy2jjFa1DA8Ig== X-CSE-MsgGUID: tD5u6w/mQSKprf7WyObx4Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,153,1770624000"; d="scan'208";a="249842493" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa002.fm.intel.com with ESMTP; 01 Apr 2026 14:47:12 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , linux-kernel@vger.kernel.org Subject: [Patch v4 18/22] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node Date: Wed, 1 Apr 2026 14:52:30 -0700 Message-Id: <71972e12ab4f08aff422b31e34df09bdbd94de84.1775065312.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: Chen Yu Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node Cache-aware load balancing should only be enabled if there are more than 1 LLCs within 1 NUMA node. sched_cache_present is introduced to indicate whether this platform supports this topology. Test results: The first test platform is a 2 socket Intel Sapphire Rapids with 30 cores per socket. The DRAM interleaving is enabled in the BIOS so it essential has one NUMA node with two last level caches. There are 60 CPUs associated with each last level cache. The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node. Each node has 2 CCXs and each CCX has 16 CPUs. hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on these two platforms. [TL;DR] Sappire Rapids: hackbench shows significant improvement when the number of different active threads is below the capacity of a LLC. schbench shows limitted wakeup latency improvement. ChaCha20-xiangshan(risc-v simulator) shows good throughput improvement. No obvious difference was observed in netperf/stream/stress-ng in Hmean. Genoa: Significant improvement is observed in hackbench when the active number of threads is lower than the number of CPUs within 1 LLC. On v2, Aaron reported improvement of hackbench/redis when system is underloaded. ChaCha20-xiangshan shows huge throughput improvement. Phoronix has tested v1 and shows good improvements in 30+ cases[3]. No obvious difference was observed in netperf/stream/stress-ng in Hmean. Detail: Due to length constraints, data without much difference with baseline is not presented. Sapphire Rapids: [hackbench pipe] ================ case load baseline(std%) compare%( std%) threads-pipe-10 1-groups 1.00 ( 1.22) +26.09 ( 1.10) threads-pipe-10 2-groups 1.00 ( 4.90) +22.88 ( 0.18) threads-pipe-10 4-groups 1.00 ( 2.07) +9.00 ( 3.49) threads-pipe-10 8-groups 1.00 ( 8.13) +3.45 ( 3.62) threads-pipe-16 1-groups 1.00 ( 2.11) +26.30 ( 0.08) threads-pipe-16 2-groups 1.00 ( 15.13) -1.77 ( 11.89) threads-pipe-16 4-groups 1.00 ( 4.37) +0.58 ( 7.99) threads-pipe-16 8-groups 1.00 ( 2.88) +2.71 ( 3.50) threads-pipe-2 1-groups 1.00 ( 9.40) +22.07 ( 0.71) threads-pipe-2 2-groups 1.00 ( 9.99) +18.01 ( 0.95) threads-pipe-2 4-groups 1.00 ( 3.98) +24.66 ( 0.96) threads-pipe-2 8-groups 1.00 ( 7.00) +21.83 ( 0.23) threads-pipe-20 1-groups 1.00 ( 1.03) +28.84 ( 0.21) threads-pipe-20 2-groups 1.00 ( 4.42) +31.90 ( 3.15) threads-pipe-20 4-groups 1.00 ( 9.97) +4.56 ( 1.69) threads-pipe-20 8-groups 1.00 ( 1.87) +1.25 ( 0.74) threads-pipe-4 1-groups 1.00 ( 4.48) +25.67 ( 0.78) threads-pipe-4 2-groups 1.00 ( 9.14) +4.91 ( 2.08) threads-pipe-4 4-groups 1.00 ( 7.68) +19.36 ( 1.53) threads-pipe-4 8-groups 1.00 ( 10.79) +7.20 ( 12.20) threads-pipe-8 1-groups 1.00 ( 4.69) +21.93 ( 0.03) threads-pipe-8 2-groups 1.00 ( 1.16) +25.29 ( 0.65) threads-pipe-8 4-groups 1.00 ( 2.23) -1.27 ( 3.62) threads-pipe-8 8-groups 1.00 ( 4.65) -3.08 ( 2.75) Note: The default number of fd in hackbench is changed from 20 to various values to ensure that threads fit within a single LLC, especially on AMD systems. Take "threads-pipe-8, 2-groups" for example, the number of fd is 8, and 2 groups are created. [schbench] The 99th percentile wakeup latency shows some improvements when the system is underload, while it does not bring much difference with the increasing of system utilization. 99th Wakeup Latencies Base (mean std) Compare (mean std) Change ========================================================================= thread=2 9.00(0.00) 9.00(1.73) 0.00% thread=4 7.33(0.58) 6.33(0.58) +13.64% thread=8 9.00(0.00) 7.67(1.15) +14.78% thread=16 8.67(0.58) 8.67(1.53) 0.00% thread=32 9.00(0.00) 7.00(0.00) +22.22% thread=64 9.33(0.58) 9.67(0.58) -3.64% thread=128 12.00(0.00) 12.00(0.00) 0.00% [chacha20 on simulated risc-v] baseline: Host time spent: 67861ms cache aware scheduling enabled: Host time spent: 54441ms Time reduced by 24% Genoa: [hackbench pipe] The default number of fd is 20, which exceed the number of CPUs in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively. Exclude the result with large run-to-run variance, 10% ~ 50% improvement is observed when the system is underloaded: [hackbench pipe] ================ case load baseline(std%) compare%( std%) threads-pipe-2 1-groups 1.00 ( 2.89) +47.33 ( 1.20) threads-pipe-2 2-groups 1.00 ( 3.88) +39.82 ( 0.61) threads-pipe-2 4-groups 1.00 ( 8.76) +5.57 ( 13.10) threads-pipe-20 1-groups 1.00 ( 4.61) +11.72 ( 1.06) threads-pipe-20 2-groups 1.00 ( 6.18) +14.55 ( 1.47) threads-pipe-20 4-groups 1.00 ( 2.99) +10.16 ( 4.49) threads-pipe-4 1-groups 1.00 ( 4.23) +43.70 ( 2.14) threads-pipe-4 2-groups 1.00 ( 3.68) +8.45 ( 4.04) threads-pipe-4 4-groups 1.00 ( 17.72) +2.42 ( 1.14) threads-pipe-6 1-groups 1.00 ( 3.10) +7.74 ( 3.83) threads-pipe-6 2-groups 1.00 ( 3.42) +14.26 ( 4.53) threads-pipe-6 4-groups 1.00 ( 10.34) +10.94 ( 7.12) threads-pipe-8 1-groups 1.00 ( 4.21) +9.06 ( 4.43) threads-pipe-8 2-groups 1.00 ( 1.88) +3.74 ( 0.58) threads-pipe-8 4-groups 1.00 ( 2.78) +23.96 ( 1.18) [chacha20 on simulated risc-v] Host time spent: 54762ms Host time spent: 28295ms Time reduced by 48% Suggested-by: Libo Chen Suggested-by: Adam Li Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- Notes: v3->v4: Add test results into commit log. kernel/sched/sched.h | 4 +++- kernel/sched/topology.c | 18 ++++++++++++++++-- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 511c85572b96..518c798231ac 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -4035,9 +4035,11 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct #endif /* !CONFIG_SCHED_MM_CID */ #ifdef CONFIG_SCHED_CACHE +DECLARE_STATIC_KEY_FALSE(sched_cache_present); + static inline bool sched_cache_enabled(void) { - return false; + return static_branch_unlikely(&sched_cache_present); } #endif diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 995a42cb4697..0b1fc1b0709d 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -809,6 +809,7 @@ enum s_alloc { }; #ifdef CONFIG_SCHED_CACHE +DEFINE_STATIC_KEY_FALSE(sched_cache_present); static bool alloc_sd_llc(const struct cpumask *cpu_map, struct s_data *d) { @@ -2674,6 +2675,7 @@ static int build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr) { enum s_alloc alloc_state = sa_none; + bool has_multi_llcs = false; struct sched_domain *sd; struct s_data d; struct rq *rq = NULL; @@ -2784,10 +2786,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att * between LLCs and memory channels. */ nr_llcs = sd->span_weight / child->span_weight; - if (nr_llcs == 1) + if (nr_llcs == 1) { imb = sd->span_weight >> 3; - else + } else { imb = nr_llcs; + has_multi_llcs = true; + } imb = max(1U, imb); sd->imb_numa_nr = imb; @@ -2842,6 +2846,16 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att ret = 0; error: +#ifdef CONFIG_SCHED_CACHE + /* + * TBD: check before writing to it. sched domain rebuild + * is not in the critical path, leave as-is for now. + */ + if (!ret && has_multi_llcs) + static_branch_enable_cpuslocked(&sched_cache_present); + else + static_branch_disable_cpuslocked(&sched_cache_present); +#endif __free_domain_allocs(&d, alloc_state, cpu_map); return ret; -- 2.32.0