From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from MW6PR02CU001.outbound.protection.outlook.com (mail-westus2azon11012005.outbound.protection.outlook.com [52.101.48.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 90ECD404BDC for ; Mon, 29 Jun 2026 13:11:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.48.5 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782738667; cv=fail; b=u3kB9YQolGmPDtdtv6aIT0d9eGJloOTpC89ptGqmATL9LqOuErUJagcJl5xcK1C4VS3fph3hwkGK6JkVJIXSjLMY+BhqoC5orioW3a7iCaXt54xZoYoomzv+vZ3NRSfUgqiUX/ISG18jimTHhpm/bkaSJI3hsF9SzWA3HyhtaBw= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782738667; c=relaxed/simple; bh=CpyN57CJzZoO11q37UDXoJMdqEvoaC6kB9Y/rCBKFYs=; h=Date:From:To:Cc:Subject:Message-ID:References:Content-Type: Content-Disposition:In-Reply-To:MIME-Version; b=pQRn/2fcZaOgJ7FQF0GRAXYkJu/OazkGBnf6or98vlYbgnHjIbctiWJwMfPpRvin8gQK+kEIe4AyIt08rS3mx2AbZlXwpo4W0Jm8i6zBEFbm9ob99/amHHdUe/7bf+p3RXsfbZyOpxwFMYJNe4/539cn6lH5InI3QMTG+wh5Qt4= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com; spf=fail smtp.mailfrom=nvidia.com; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b=MaGJSRQ2; arc=fail smtp.client-ip=52.101.48.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=nvidia.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b="MaGJSRQ2" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=f9lb7EvvJ+Uy5Z4WW/vrgHXFmh3gK9uqOzOBDMVt+SnfGTQcyQ0roPM2Pee/AW8HzgcofN9Yq3TbhVnB/xEOSQHDN2xB5YPVb58ldLMeKYXmybmjw/37Bz3cb7CsOo06bOLj5Jh1htcUIQoF4uUX97To1h8Bsoq1/HfitCw0VXYuRAtIkV82IlZqVFySZ7nEvYiR/rK5P2Pd6/n83hC5Y+1ZvmaHD/7nqF2MLjhxIuQcveP9Nj8akJxwAPYpF9a0ZEHUwA5Smv85mizeWRcvxlW9VPIw00oO44rMl26Kq/ARg9R/hmwx66t4UqJju83ePkCkPdrX3rCrFepZS5gOrg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=pEDdVNlfeY92jEUmAE/9F/5LAk9BQYOhAwphXiSVCIc=; b=VvumDLgGf6Hw9/zjGQxyUdah6A2lTt23EUDcp/WvgxiXWV1kwbPtlId3cvC1dTa7k8Q4D4pskTQ9aH1NxRxBHBiP0vO+0Rg3LO3hbz3vnHBjVZlmjDfP13PX1ephq8YhvQhFH02qaAMeaq0HEn5Jvs5FTyPLCROxZv7mOW/lxP5w80B7AU3ifDRDftv4Sgs4AQvhScmBt98d+U5zg5CGK5E410tTbeQnNlVbvqEIkBhgqXggtEXSpD0TGjTVhyhsBZdns8YN4Pl+kFhU9vuRIAHEKimqolMu5oNIDzj9BB+ZhdjAugjiyeRaSx8Cpy0KJMfFAIV/mog7+KdYst589w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=pEDdVNlfeY92jEUmAE/9F/5LAk9BQYOhAwphXiSVCIc=; b=MaGJSRQ2pdgWHSP/ody2XbxkUkAsOscZhraM9McQrlUefw4v+VHsFnHx3FyjPsb0LvRBMIgT+Gs2GlweXF/haVpM0R4eOmh1h4okCzQXQ7B12w5782HeQ2/rhNrG34ocintCEm6QeK61gkfGW2FBZtnQGO6Moo9RQ00AJwHHF7WKOvXBZ3oOJGWsRuvuveJU7o9uf2Qh0iSIopjoHDf/K1AOH6b+dPgEddJbBoM9cowEqlxkh/LdSGA01qI9ZLMW6Oux5bdEQEmWK+hlwff6kl+UkXCAbtZh+5cyK/npcyTkYROcLEtLVYeX9wUaLpaL1GEXlXwAn5ZeANcg3TXueg== Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=nvidia.com; Received: from DM6PR12MB4827.namprd12.prod.outlook.com (2603:10b6:5:1d6::14) by IA1PR12MB7687.namprd12.prod.outlook.com (2603:10b6:208:421::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.21.159.19; Mon, 29 Jun 2026 13:10:59 +0000 Received: from DM6PR12MB4827.namprd12.prod.outlook.com ([fe80::6261:3040:864b:159c]) by DM6PR12MB4827.namprd12.prod.outlook.com ([fe80::6261:3040:864b:159c%3]) with mapi id 15.21.0159.018; Mon, 29 Jun 2026 13:10:59 +0000 Date: Mon, 29 Jun 2026 15:10:49 +0200 From: Andrea Righi To: Ricardo Neri Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Tim C Chen , Chen Yu , Christian Loehle , K Prateek Nayak , Barry Song , "Rafael J. Wysocki" , Len Brown , ricardo.neri@intel.com, linux-kernel@vger.kernel.org, Vincent Guittot Subject: Re: [PATCH v5 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity Message-ID: References: <20260622-rneri-fix-cas-clusters-v5-0-19968f2d1497@linux.intel.com> Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260622-rneri-fix-cas-clusters-v5-0-19968f2d1497@linux.intel.com> X-ClientProxiedBy: MI1PEPF000008D0.ITAP293.PROD.OUTLOOK.COM (2603:10a6:298:1::42d) To DM6PR12MB4827.namprd12.prod.outlook.com (2603:10b6:5:1d6::14) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM6PR12MB4827:EE_|IA1PR12MB7687:EE_ X-MS-Office365-Filtering-Correlation-Id: 2dcc7609-ddd9-4fb7-3375-08ded5dfdcc7 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|376014|23010399003|7416014|1800799024|366016|18002099003|22082099003|11063799006|56012099006; X-Microsoft-Antispam-Message-Info: 7PLmsRiU0/IyNDpgbcjbz9EH+gxPjwvEO54F+rZ6QcFHZ1n47a+zwxy+6LnQ1mtxxER5C+aIrw5Ldk2ycrTju26rqtkBE4//AoCIqgt/R0Qxi14hndoXqWhc7brJngtMFBzcL+IuNF8Hg7usIptpdBu8mgqtXREGXhz7jyFhmAKqKH1m/F+a/1LiY+yAH7NMg4dXVUNqVjsOMMW9BreZ9Ggg8dW5Xv6gcfEMnoB5uatwQpPyK9sc/mjH1tljMrgfH4VByYGwQpb4gU42qeyJWAp6dDQAHC+++rjc5hipflsZkKuDLaKLoLmYoXq8yU9AoFS41Er63pXJQiJ1AJBMmAQckIYv7DeHG31qMtEhK6RpnIep+LBWn8O1cmY0xYsx4v9gKExuaMU0FTShzHNr1wHXjEMyUXvXP43GHq8JwCl2MImxn+HmMTE5oQycOOdwzRQy1NEbdnWIU0F5EnKhmV1DcBsj2CclKmfkbSEkps/SG0owmMf9mBXfmFiy9+edvS7Q6XVAa8p5/d10MI4ahsxy3EyZzvaiAekoonsFp8leM/9C0SyhN2mfdxO0pYpjEkX5PljATUSBIWrqGn4wSVKtz9NTBMTVxombtoaE5gXgTBXZ1uTPSV/HmqZ0T7HNPDS7WYfPiLNlEm9oj+fZvbradriXR75WJI3jo9E/qRs= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DM6PR12MB4827.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(376014)(23010399003)(7416014)(1800799024)(366016)(18002099003)(22082099003)(11063799006)(56012099006);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?sm/U7jtS0Z9tdonFoFZ300oE7wp5ZaLo0R2WV8nbunxayjTdTORQ85vlZclh?= =?us-ascii?Q?MSYjkVyFLL3JJHWGEgCbXv7e7fPWrfugYkieOP0XR2ac7BRh/c4kEvQpxk5C?= =?us-ascii?Q?lOPD4N2eMas+wdwAosfSBZsYdOQKWLar7Bq9MY779/qx/02ZHsccK79Dn810?= =?us-ascii?Q?au7PSdwG3bL6Zyzu6dNJjJTL5eH7CETTFZac72lAWcZlU/HrUwvB7de79mKa?= =?us-ascii?Q?2nqP/VhU20zpjlSgrzrVIXuIwyxrIdVXSiHMGfZkeUm36Coma8EY1U6kpVzQ?= =?us-ascii?Q?DfkQG1fHPUVevS1BgkzrNuJzvcfMgUSVBRjcWX2sNOFXrPFogvbm8tFWZfLA?= =?us-ascii?Q?2QCrscNGGLeC2ohdqq1qV9jUvzvjSJIBSYpt3U6Gz4KI5atIqogk9OCc37eO?= =?us-ascii?Q?wFV4Wk1VZq2OoN7YzIH0TWLj/i6RnhYCYci7T4GvdjACEY6k3fia6m9K2BLt?= =?us-ascii?Q?jggcP2sI6weqjZ0T0QvTADktgTHI7TGF+CQiDEygaQ5YDLZAozh9Y3tsBdWE?= =?us-ascii?Q?fIgG6IrBVx3wsieTnQDr+KjQ2gXYPe6QJ3JgjrcfCSEKbrSTxVfA1fUlMLFA?= =?us-ascii?Q?OC8neAItV+zvrNdgtGCKO9Z94EmGG8+Zmwf/Gi5NCqDF85vriTuFWEttPEuB?= =?us-ascii?Q?t6y91qPshhGReMIaTfHsoQHVsCmskFh3qQnxPiw1HjbxMNsZuy4r0HO+/9mO?= =?us-ascii?Q?2b2hPyHorETNRtfey/2OEY+kACyEtL/FpJ34/l/45J2BPZbaDvfM2W9AdP3w?= =?us-ascii?Q?uV0Jgh+TQGNhlaownf29mQM0P8YLAH3AlJXfPfrrLRmfk7xSHAhQYrxJm8z3?= =?us-ascii?Q?JMcrh2FBKrhVb1nuIZsHdfZyyuhpMdMITZPASUFpBechFBeEGX80rWdrvCj5?= =?us-ascii?Q?XeH5nmS+EPQyRcDPnS2iRIGVtevKG/f5qQZrH11X7XmzUZH2hNbeCGSkMbe9?= =?us-ascii?Q?3WMjFb+t7lM86OyRL/znJBF/q9uRTSPox7e4vEzpj6UiM4NFpf3f3RBXn9En?= =?us-ascii?Q?8zOaaAXcZZkbsjofb/1xdUCCi0BlUnTazDB9MFueLI3u1T0G2SLqh4Bk6fsw?= =?us-ascii?Q?r3IaeQYkAhRae6d4kkRvdGLi8hwa9QTqQF6DqohlCAbn7ChFesoQvsTC0/Oh?= =?us-ascii?Q?LPiW1E7UC/An6e1CzZIAtQpp8ZZnVzYMZU0aMKVz4KM2kRLfBLO9QA84JxU0?= =?us-ascii?Q?b2d4n88uQdTCzCsLX3fUypUwOrW9scW2MInVNdeaPq3suBHVBQkIQCwXVbcT?= =?us-ascii?Q?ZQq3zz4Um303j0+tFAgAEcxUQB/A02KQfrbtzdYk83HA8XYdCCXwGWJ8dfPa?= =?us-ascii?Q?bjsQLsf3V50kOSShP74d+xrcLuMWY8K6k6BVQflW4QWoXpV/5EDVaiMn8Mfg?= =?us-ascii?Q?H+43EZXtmuLw6/8YlrrHttZK5lAploCTNiHNcMbglm0TygbnzE0K3JYIvxiU?= =?us-ascii?Q?agpfUMqXqXRX6k1CcOmNcIbkJ2cZyXHo6aQAbA+ySfxmnoVviSbUcmh8uYGT?= =?us-ascii?Q?+V0JLpMMMJvQ84/gxe9H+q6vebP+E9it/9zA6KSaK8/SfNrqV8dJ2XLhqya3?= =?us-ascii?Q?tQJBGzP0LMkejl7ePsB2yLEFnyz43tmhJ9xxA67H4Oy1dKTBoo3CNUNqW+4s?= =?us-ascii?Q?b/PQDeqou4sxtqM+TRLtKmZY6EwBReTP/ddPbazCL4Z4omJ5xfOYtPuqW8nc?= =?us-ascii?Q?04rXbkTYFjv8PaUxL0TFgbmzF1/HcFqwv5J7/T0kTx5deYCR4zl8NMmX/0Q7?= =?us-ascii?Q?9JB/UTcwXQ=3D=3D?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 2dcc7609-ddd9-4fb7-3375-08ded5dfdcc7 X-MS-Exchange-CrossTenant-AuthSource: DM6PR12MB4827.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jun 2026 13:10:59.5067 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: LAnFL0uWDdcYZZ67Wz/hcXeA/IEY3PusQ4QVsX68mOdIAQGK03usqOgM6i6loSutgUqOem6Fx5V/p0reixX14g== X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA1PR12MB7687 Hi Ricardo, On Mon, Jun 22, 2026 at 05:05:50PM -0700, Ricardo Neri wrote: > Hi, > > This is v5 of the patch series. The only changes are optimizations to check > same-capacity cluster and SMT siblings only when needed. Please read the > changelog for details. > > Cluster scheduling aims to maximize performance by spreading load across > clusters of CPUs that share mid-level resources [2]. It works well on > uniform systems, but it breaks down on topologies with big and small > cores arranged in clusters. As a result, it fails on several generations > of Intel processors already shipped and upcoming. > > Consider the topology below of big (B) cores and clusters of small (s) > cores. > ------ ------ > | B | | B | ----------------- ----------------- > | | | | | s | s | s | s | | s | s | s | s | > ------ ------ ----------------- ----------------- > | L2 | | L2 | | L2 | | L2 | > ------------------------------------------------------- > | L3 | > ------------------------------------------------------- > > On a partially busy system (one with idle CPUs; busy CPUs have one task > each), scheduling for asymmetric capacity ensures that misfit tasks land on > the big CPUs. The remaining tasks, misfit or not, run on the small CPUs. > When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to > be evenly spread among the small-CPU clusters. Today, this does not > happen. > > Several issues in the load balancer prevent a small CPU in one cluster > from pulling tasks from another: > > a) update_sd_pick_busiest() may select a fully_busy group with higher > per-CPU capacity as the busiest, preventing a subsequent fully_busy > group of equal capacity from being correctly selected. > b) Misfit-load statistics are used to identify tasks that would benefit > from migrating to bigger CPUs. Accounting misfit load is pointless if > the destination CPU is equally small, and it also blocks balancing > between clusters. > c) Due to b), groups that are truly has_spare or fully_busy get > misclassified as misfit_task. update_sd_pick_busiest() then skips > them, since a small destination CPU cannot help with misfit tasks. > d) Once a busiest group has been identified, sched_balance_find_src_rq() > will refuse to migrate tasks to CPUs of equal capacity, even when > doing so is precisely what is required to balance small-CPU clusters. > e) The SD_PREFER_SIBLING flag is missing from scheduling domains with > asymmetric capacity, preventing the balancer from equalizing load > across sibling small-core clusters. > > Together, these issues prevent cluster-level balancing on systems with > asymmetric CPU capacity. > > This series addresses each problem and restores the intended behavior. > Details, rationale, and code changes are explained in each patch. > > I tested these patches on Alder Lake, which has both SMT Pcores and > clusters of Ecores. I tested with SMT both disabled and enabled. I also > tested on Lunar Lake and Panther Lake, which have an Ecore cluster not > connected to the L3 cache. I repeated the same experiment with > CONFIG_SCHED_CLUSTER disabled. The load balancer behaves as expected. > > Christian also tested this patchset on a synthetic arm64 qemu topology and > the expected behavior [3]. > > Link: https://lore.kernel.org/all/20260509180955.1840064-1-arighi@nvidia.com/ [1] > Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@gmail.com/ [2] > Link: https://lore.kernel.org/all/e08492e0-d9f3-4574-8841-b633db008507@arm.com/ [3] I looked at this series and did some tests on Vera Rubin (arm64). The series does not directly target this topology, because the system has no usable SD_CLUSTER domain. Nevertheless, there are changes to the common fair load-balancing and SD_ASYM_CPUCAPACITY paths, so I tested it just to check for any potential regression. I used DCPerf MediaWiki and a CPU-intensive SGEMM workload based on NVPL. The results showed no measurable performance differences, with all observed variations falling within normal run-to-run variability. Based on these results, the series looks good on this platform as well. Tested-by: Andrea Righi Thanks, -Andrea > > Changes in v5: > - Added Tested-by tags from Christian. Thanks! > - Patch 1 (pre-work): Optimized logic to identify CPUs with busy SMT > siblings only when needed. (Prateek, Chen Yu) > - Patch 5: Optimized logic to check for architectural capacity only when > needed. > - Added Reviewed-by tag from Prateek. Thanks! > - Link to v4: https://lore.kernel.org/r/20260608-rneri-fix-cas-clusters-v4-0-1526711c944c@linux.intel.com > > Changes in v4: > - Patch 1 (pre-work): Fixed a bug that would block load balancing on SMT > cores with more than one busy sibling. > - Patch 2 (pre-work): Fixed a bug that would needlessly update > sg_overloaded. > - Patch 5: Reworked logic using a local variable for improved > readability. > - Added Reviewed-by tags from Chen Yu, Tim, and Vincent. Thanks! > - Link to v3: https://lore.kernel.org/r/20260514-rneri-fix-cas-clusters-v3-0-0037869554bd@linux.intel.com > > Changes in v3: > - Patch 3: Reverted the inverted runtime capacity check. The inverted > form resulted in migrations to CPUs of slightly lower capacity. Guarded > the check for architectural capacity with the sched_cluster_active > static key. > - Patch 4: Expanded the patch description to explain the behavior of > overloaded groups and low-capacity clusters with spare capacity. > - Added Reviewed-by tags from Christian. Thanks! > - Link to v2: https://lore.kernel.org/r/20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@linux.intel.com > > Changes in v2: > - Patch 1: Rewrote patch description for clarity. Added a note > clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually > exclusive. (Tim) > - Patch 2: Fixed a bug where the capacity check inadvertently broke > the mutual exclusion of the sched_reduced_capacity() path. Keep > marking the root domain as overloaded when misfit tasks are present > to allow bigger CPUs to help via newly idle balance. (sashiko) > Fixed the description to state that capacity_greater() looks for > differences of ~5% or more, not 20%. (Christian) > - Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to > ignore runtime capacity variability. Inverted the capacity check. > (Christian) > - Patch 4: Reworded the patch description for clarity. > - Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@linux.intel.com/ > > --- > Ricardo Neri (6): > sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings > sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY > sched/fair: Check CPU capacity before comparing group types during load balance > sched/fair: Skip misfit load accounting when the destination CPU cannot help > sched/fair: Allow load balancing between CPUs of identical capacity > sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters > > include/linux/sched/sd_flags.h | 3 +- > kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++------------ > kernel/sched/topology.c | 14 +++++++-- > 3 files changed, 62 insertions(+), 21 deletions(-) > --- > base-commit: 50436392fe2359ea108fd27308f86c8283be1622 > change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152 > > Best regards, > -- > Ricardo Neri >