From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from CH4PR04CU002.outbound.protection.outlook.com (mail-northcentralusazon11013027.outbound.protection.outlook.com [40.107.201.27]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6C7323ECBC4 for ; Fri, 3 Jul 2026 14:38:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.201.27 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783089512; cv=fail; b=cYtN/TEtafFsRDAaP/aIy7DGXqdj0UOQU58vHOQjsqMsO/l1XXeV13wDyZbIaqvU6lphgFweJepv64a/8qnmD4zeEwtsID/QwNFK3WX9a0taW1CX4yAbE9sVghocH/gmvNax8M+vFPOVNDUWJZw1DR+WtRnZV8jC8Eyr7bRG6uA= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783089512; c=relaxed/simple; bh=j7lNTo7JS1lHij6rCy6EU7nZ/L46XSYnvVAr46ktXk8=; h=Date:From:To:Cc:Subject:Message-ID:References:Content-Type: Content-Disposition:In-Reply-To:MIME-Version; b=GVv8y5Kusbxc54j5wDDaMPX/Jgy27w+Z/k6umJP/owLHaQfNh1/azBSNWVsh/LfbYlX9Fue+RK/AGsegzRbFNRwJc/agEz+QGfOBFaTg8WXcpfG6CWilksfvVMzEOxlfesZp8MdpoYUMoDdEVsEZd6S9etO/5gmZo3FQZ4pnxD4= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com; spf=fail smtp.mailfrom=nvidia.com; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b=uklZRN5p; arc=fail smtp.client-ip=40.107.201.27 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=nvidia.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b="uklZRN5p" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Cgjp8e5JR0L3dITG8dXgm0SwXlDVj/RNEWKqA8xxIolkohmB35IU3Xc+zTlov9u50f3UJOxQ915ogqhy8Wnlh835jJZjYj0wDfRcRX+YTTSFqoQbhAJlt+LlY7RI6BEYI746geJaY0HApwv7EI5ymeiff88QF5DjYTIC7/ohAvAnlkSjjPNGXVevaszNxqdUstA8XwwU24lL4Lt9n/IMI0XQ1E4P76vh34T+vbiu0rsNfN/zvPi16TpTmSlJGMRLfgWRuanxulkJhILJc2mXb/YOijDNY6bELgq+mBPslOM/1Kni8MZwuNtSwkEAxZyVhZpy3KxnlDlsR+1fd6HbXg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=5MwzQRtiB/TPU7KvrtifICDP0qmFbzG77zLux+oo9JY=; b=mPKAV0NPo2dy/MrCAxQXml+XFIJwB51HGtXm8oHaJeceQhcZAd5ucAco4Zm1CF41yj/d7AftZQrOCx2UWv8hljKhkVJ9xluvz/7XtxQniAFBfyl2oLjo+oX+AT0DLAsl9a28MFW6Awy1U8eeqGkPKDm//Qk9d1suHYm99MAunjt9NntFDM58ffh1A+azFOa1DaNFIHic0ePQl56kDd15aF8EK452qiZqXeHZg30TzcVglV/17M6+CEJ5oRJhAqu833ePKCX/N8do5qpdhZg+Y1XkmlonRTxTXyLQ6dirmD91EdNLPoUW66ZDfh6YcSdSo9+8wEW5/oDeUhPzoqQOUg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5MwzQRtiB/TPU7KvrtifICDP0qmFbzG77zLux+oo9JY=; b=uklZRN5p1YGhg4UwHPkBEzHrHOgJ7fWnNJLQhpiEV1gTXeoyQ5yDF5Y130GuubQb2YjURMImdIqyYwmk/s17ZCN4LP5YhqCFBwjEAtOywAghL9pQwLZ09gKtuUI39Oa4kAUCJEVNxHFTrxi9nXDw0shKPePgMltKyo9oHczrKxDJUomjjZIcW6KsY8pS8P3N2kTTtyG4aYcU+FivG3UmRBvsPyO5TPDC+4qVtgrz+66WvvHaxnjOlykcqmt76kTCc1UG8LPK2LbWLsOxcuUmPDL/wlXtF+O1MpLGwWpdUBU2rq4cYQRTQ5F5GV0F9d+D06XVAocuV2YjwVvccroTMQ== Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=nvidia.com; Received: from DM6PR12MB4827.namprd12.prod.outlook.com (2603:10b6:5:1d6::14) by CY3PR12MB9553.namprd12.prod.outlook.com (2603:10b6:930:109::13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.21.181.10; Fri, 3 Jul 2026 14:38:25 +0000 Received: from DM6PR12MB4827.namprd12.prod.outlook.com ([fe80::6261:3040:864b:159c]) by DM6PR12MB4827.namprd12.prod.outlook.com ([fe80::6261:3040:864b:159c%3]) with mapi id 15.21.0181.010; Fri, 3 Jul 2026 14:38:25 +0000 Date: Fri, 3 Jul 2026 16:38:13 +0200 From: Andrea Righi To: Julia Lawall Cc: K Prateek Nayak , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Ricardo Neri , Christian Loehle , Shrikanth Hegde , Felix Abecassis , Joel Fernandes , Phil Auld , linux-kernel@vger.kernel.org, jean-pierre.lozi@inria.fr Subject: Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity Message-ID: References: <20260630152747.128746-1-arighi@nvidia.com> <23b0e0d3-1bce-834c-6d91-9b56a10202c@inria.fr> Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <23b0e0d3-1bce-834c-6d91-9b56a10202c@inria.fr> X-ClientProxiedBy: MI1P293CA0019.ITAP293.PROD.OUTLOOK.COM (2603:10a6:290:3::16) To DM6PR12MB4827.namprd12.prod.outlook.com (2603:10b6:5:1d6::14) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DM6PR12MB4827:EE_|CY3PR12MB9553:EE_ X-MS-Office365-Filtering-Correlation-Id: 538f43a2-21e4-4e11-98f7-08ded910bc0c X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|23010399003|7416014|376014|366016|6133799003|18002099003|22082099003|11063799006|4143699003|56012099006; X-Microsoft-Antispam-Message-Info: 9giiSGb6i30kwSj04G/Tdb/R6SXojcjNrlQS9nKFqRIxyt/YdX9jmGKd8+10fOFbStDdVl2gYp4gSxvU9wL5W6KJF2BBkAdc2suPNX2k8cfM32fzhHtUoWvyYzHXEK21xpLgGwgKIc87hiXlXe+JGaie0ywcelX347kVoZ6oLh3Z7Nk721W8AaLiULpzfKkrAwCbgZBsQ8KSEVpaULOQ2sBCFW9p6VMWJAeilEmpxxdSny1YmCYe55aKQB/e2glAgvz3pS7yuX+HNGTupIlH8LGa35CVSRkqZRS0EaTMGJ06YKXayQCg5VajZsRXLYd+ziNEX0FLVw2XZkaADX+S4NPswAP0ctA9DGcXHlhgi9PXVvx/c4Ez/IJzVF3VSMqUp+iUsMHVUQRXpU1jeJkmpxy0Zzh9Ojr5aGkpSldLKrXv3l+1v9qmL6FdvFFkJkTxFE9ApWhYwXhAbJjZig56GrkZts4KnCTmd9h4oB6iN9Noscp3RnAFbIkgiTWbt3Orr5l/ber0ui+pPTTBbIAuB06ykDe+/G91eSKZAj2Y0I5FAab2N0m/yVvTNurOnBwQasU8ko5YX9GB2hDkl3uyyaeKd7QHw/EBWl0RubQnfYYHtDeEX459eaigbaopNCH5QpapX+ku6xBmwewZPN+9jk6YR4Deh6cuWR595uDR0U0= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DM6PR12MB4827.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(1800799024)(23010399003)(7416014)(376014)(366016)(6133799003)(18002099003)(22082099003)(11063799006)(4143699003)(56012099006);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?PtMgj80TP17XUcPrJGdmZ6CMe/UvZIHHYYN8xBXFKwzTu3C1WIbqPIxUjNPs?= =?us-ascii?Q?pzjFlUwQEElhODuKNznzzZbHWFhtWFNOutWlqWu70PHre8ilRGV7JUvYDK26?= =?us-ascii?Q?l9XK6zCg3CL44On/gePtdj2pyg1h8EvXa/0L2UeIO9aobdxVV3aV+Mcd7RGC?= =?us-ascii?Q?LNfaRsdAltL5L7bn96KUiXLISyQb1EyXiytTaOYXXT1NsvF7XZsxZMAbFs7v?= =?us-ascii?Q?wRko31WrT68tPjoPuVCdY+1dBu7QnksMmEqkxYtoOb2EcsRmwH2pmdkEr4o2?= =?us-ascii?Q?IgKKKh9XilH+IgcOMXPanWPR6iHr2EmF9TR/1mqwfWd6M4EbO6IdacTPRnD4?= =?us-ascii?Q?8M5omtSc9je51sXpTCQ32hFtBeJgp0DWpQEJuI67wS6A2blKMVXbi93CQwSL?= =?us-ascii?Q?fI+TrYbAu4hbU0uwY9mLkq0xwqNLfetxQR7kQDuSo7MmAUB4n7swrAEH/6A2?= =?us-ascii?Q?AR5TPZiOS8MdyNQkBF76u45h+i+Ep27MdDaUyWsQnewTfkXB+D3HKb+2nHEx?= =?us-ascii?Q?zKO0bnA0RpXS4dixyqvh5O1Ni3UTX/Dpx/jbX5FtKhEEmyGY4aLGmpldePIz?= =?us-ascii?Q?2cBY5ZYcBqJdMLltXrpri26WhNJSKeyoL8dg4lsZ9wdMQ629Sj2awCeWG+6T?= =?us-ascii?Q?KUHTj9wfMo7+sWq/qZQFdjvv+VFgk5+szS1R2vqyHUObKMasJcVRliSQrVD/?= =?us-ascii?Q?Uf+Euu3Y76NHO/87WWWA2BkOBdBd8ijww6GPNAYdEjSJGJFHS0nG2qP1fH5i?= =?us-ascii?Q?Gc2DzcxVHlpFWcZwhcUJyPVlbXRWNAxtxbZZPYN6lUoUVehZzx/f3VxYUXUN?= =?us-ascii?Q?GzPjpcri5Z+ZNDfG3dQHRMxr1zkpyR4Ro0bkgVFDi27mTEHw7LxJ3vK5xioL?= =?us-ascii?Q?di8n6Dso44CUgh+VgJi9VuVqHHQn8ag/1ziq7L9KZ1he32maQpoz1NoWk2Cm?= =?us-ascii?Q?M2/SJJGZnDxJS9AaL5pQ/Hn5nE0U1GxU3QIUqCYnHOgxOj+47324uo+SpdIh?= =?us-ascii?Q?9v3YzmsRrX3y1C41R0e8J3OqLnTQfDw632X6tVBeUCwxzYkKnu6zRVwaqpsc?= =?us-ascii?Q?aPr2DFeJkOAWn1BGj96Hh836Gj0144CxobayJOc9M74blOtJUvjhuRh2cjpM?= =?us-ascii?Q?K7CITfiRLZOwiBnrcR/shrZy0x/w0eWNrdKhDhFe1bdrTFB6LjsDmWSzaitE?= =?us-ascii?Q?pOez9J4t/lTUg8maSkEw4L0V0v3DA/9CcmW57Eh4VMbMnNWCG4AUJaYTbsus?= =?us-ascii?Q?FbpPZ0WYODUX9p0quOIJdbM1y0A74lwpfrNQtqXM99x7Fq/Zu7qD6OFYFRrj?= =?us-ascii?Q?sUuWF9v0GKblMpiwRKu4lEhyC6p/9ah4jCRA9KDorS484/FaIanJJC+kFFKS?= =?us-ascii?Q?TkfspXRtyMbPqJ7UKpKxetvHQUiNR7H/X5B4aOpDSsnyx/iaLfQoAkM2Sj7O?= =?us-ascii?Q?MY9w3t2N79FaFPMRSFSjeblS64B2yXNUo/T5bapXLsRMcxTqvpVFZyY8BkEY?= =?us-ascii?Q?V+HsGQH+owD74XAgoroC8Fe1ypqbO6IVB+1H+jl4lUW1nmZGIsF/lcsGtLiQ?= =?us-ascii?Q?J46ViRzcMZJklu1nDHxnOk6etGDgkZtPSfW0zHD1YESOkR0L2PjBPAjscRah?= =?us-ascii?Q?EcgqwBngIbDZaX0R0DdJ1xA9kfrKSToujtJjPmckQlAIb845xfaKb6F6G+r+?= =?us-ascii?Q?tc6kUrgYxfE3GbrpPwW9xe0r6E6nGkNi98UTplhamkLbAhvMCJMDaifJPIkP?= =?us-ascii?Q?hU1cTeFCJw=3D=3D?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 538f43a2-21e4-4e11-98f7-08ded910bc0c X-MS-Exchange-CrossTenant-AuthSource: DM6PR12MB4827.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 03 Jul 2026 14:38:25.1050 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: W3VIiEl1PTWvufPJZOAorbnp7i2Fa7eRPidBHYfHY+wU4yZPHb5URWnv9ETybqFv6HX8mTpoYKannjZjSMspKA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY3PR12MB9553 Hi Julia, On Fri, Jul 03, 2026 at 07:20:38AM -0400, Julia Lawall wrote: > On Fri, 3 Jul 2026, Andrea Righi wrote: > > > Hi Prateek, > > > > On Fri, Jul 03, 2026 at 11:21:57AM +0530, K Prateek Nayak wrote: > > > Hello Andrea, > > > > > > On 6/30/2026 8:57 PM, Andrea Righi wrote: > > > > select_idle_capacity() scans all logical CPUs also when it is looking > > > > for a fully idle SMT core. Two concurrent wakeups can therefore observe > > > > the same core as idle, encounter different siblings first, and place one > > > > task on each sibling while another core remains unused. > > > > > > > > Make every logical CPU of a selected idle core resolve to the same > > > > stable CPU representative within the scan's existing affinity and > > > > scheduling-domain mask. If the first task is enqueued before the next > > > > scan examines the core, that scan rejects the now-busy core. If both > > > > scans observe the core as idle, they select the same runqueue even if > > > > the first enqueue becomes visible before the second scan finishes, > > > > exposing the imbalance to the load balancer. > > > > > > > > The symmetric CPU idle selection path is subject to the same race, but > > > > normally returns as soon as select_idle_core() finds a fully idle core, > > > > reducing the conflict window. The per-CPU capacity scan can retain an > > > > idle-core candidate while evaluating other CPUs, giving concurrent > > > > wakeups more opportunity to select different siblings of the same SMT > > > > core. Therefore, limit the normalization to the asym-capacity path, > > > > where this behavior has a measurable impact. > > > > > > > > On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a > > > > CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per > > > > core) showed a consistent 23% increase in mean throughput across > > > > multiple runs. > > > > > > Interesting! This reads like active balance across cores is not aggressive > > > enough for this workload and, as a result, stacking somehow helps. > > > > > > I would have expected balance within the core would trigger first and that > > > would just lead to the same scenario as both sibling sibling busy but I > > > guess there is a higher order effect of stacking. > > > > I think the key here is that temporary runqueue stacking is preferable to > > consuming both SMT siblings when fully-idle SMT cores are available, more than > > having benfits from the stacking itself. > > Andrea, did you try changing the clock speed? With ticks every 4ms and an > EEVDF time slice that rounds up to 4ms, task_hot makes it almost > impossible for already-idle CPUs to pull tasks. > > julia Oh I remember you mentioned this. However, the kernel that I'm using has CONFIG_HZ_1000=y, so the scheduler tick is 1 ms rather than 4 ms. I tried to play a bit with different migration_cost_ns settings, but didn't get much benefit from that. I think I have a lead, and the observed improvement with this patch may not be a scheduler/load-balancing effect. In practice, I see different performance on sibling 0 vs sibling 1, apparently sibling 0 is faster, despite the firmware advertising identical capacity. So I think my patch is helping mostly due the fact that I'm using cpumask_first_and(), more than the aggressive SMT avoidance. I'm trying to get more details from the hw/firmware. Will keep you updated. Thanks, -Andrea