From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from NAM04-DM6-obe.outbound.protection.outlook.com (mail-dm6nam04on2070.outbound.protection.outlook.com [40.107.102.70]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6AEBB2566 for ; Tue, 10 Dec 2024 00:14:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.102.70 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733789697; cv=fail; b=Wkzcd8GELTsW7zkkEBumXM9gIm2OFLCy55KiDtHNNiPdWMq0A1U9j3dMDDtpEATDFKsm8Gpg2Vi+CV8SlTmj3m+qLfSRbFgbYKWdjRQwEw5OMhcY3eb5ZaoJreBB1UzhQJp7lUpgZywssR9a0EGIcW8fQAeAh0WpBfTySRx6aPg= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733789697; c=relaxed/simple; bh=yPqdOWb2690Ll9XSZEV90z1H0GGBhIZcExoNTZO+8o0=; h=Date:From:To:Cc:Subject:Message-ID:References:Content-Type: Content-Disposition:In-Reply-To:MIME-Version; b=bXL+L6nw/d7OrvG4jgvIeZi3u4T6hA2VfZJCiJ9efjgcD03tZDOAxUaIMSKWDBLqIp5nkhVALTVduPVJfHSIKYabYiuPRHA5JsId1nQ8BIlNbu71f4HTuZrtWlQCNUbhpyqeZnvQlr6aLDf6NcCYBxe+sjOjADNgEIcExeA6DzA= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com; spf=fail smtp.mailfrom=nvidia.com; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b=aJXm1vZ6; arc=fail smtp.client-ip=40.107.102.70 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=nvidia.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b="aJXm1vZ6" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=cyk7TExd/TcJLzdiyCis8PS2Pq7Eer3edualdkaMiEa/Z/WOrcb6EsrN+1MaHy+bYGUEAMOtEnrsjBnQ77x+d7GmbJu6P6rQ7T3yA27MJdE/aCDjeM5AevEop9NKnE2AM9O6JmZbTm2cGB9FzfIFbVpqVbUMe+Z8/3nyeX22mv4p91gxew/2CSyMbm7a9HyKFqiQXlXqmBRM04Qerk9wzSq763DcOKZKvl2Y7rtOu7xEv4skrBBkYySBSU6N4X5JnVjRg2WPSXppgIu351Vc0XMxTLBP8UMSrp19EA7q89llKHszexGiFr7w/qGoaAsGrncnTxnzTGndlQNmJNHVvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=TSanktczi+gJT4qbFYAgr63HfodHtev1EQhsfsdp7Tg=; b=IuYq9Y6tf2Rfp/Yf8/JRE0CcoOprBUJhVurvMnF4797E1n3sx9421cKKGjWfdAlg04CZk2BSWHDSJe6evXEtzVKKxhaGHzqab85cUy203RrMCFvCmS/bGmn/UugfrPzzK6aG82GopNb1oLpWquhOJ5e94GDJEwCKHqFl37Kd5pd/Ms2fcwXNJ0hYawLNUAlFYJif6gV3Oyx7J9n3LcSKUTthZFhT7jSu4+xtD3kn1PdLNX2tYJXzmPcTB2/dvkjZu1Viq1otsPpO+KgenWlKp6eq8f+IgnWC+NXGMK8Xcg32T0WE69n2Z+XITuuO/XGDC6hE25SXQbiXhaGOyNSN8Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=TSanktczi+gJT4qbFYAgr63HfodHtev1EQhsfsdp7Tg=; b=aJXm1vZ6tRdYJdaohRgnU5Hd4RMwPLK1jEecn/QOBWXauo/kWUwOL7GdzwOKnT65X7u2tmCfB8NKKmoAXH8CDCRLEH/+B7CO5KDVdqcsjul9KUEkzH18dfsQm9oNGFoMO4/15ZwOy6n6R6Rn1dyF9Xi40hiEYaXRj2/QZvhrvzHz0No8g4EBlzNmvEGblXtUyjmXUJ5bjv3aEt+1kGUaaAxh+vJ6KSVNuCK+ArduwuOV4DI1sKNZ4TUieyJ16avB12N0MLGnCv/EtmImSnkbOoq+41xZK+B2ahRQ5dRGlNyoBmRVBbPbmJtaDh/PEZjjDMzy7hMKfXmkUs1ZpgGeCA== Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=nvidia.com; Received: from CY5PR12MB6405.namprd12.prod.outlook.com (2603:10b6:930:3e::17) by BL1PR12MB5849.namprd12.prod.outlook.com (2603:10b6:208:384::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8230.11; Tue, 10 Dec 2024 00:14:51 +0000 Received: from CY5PR12MB6405.namprd12.prod.outlook.com ([fe80::2119:c96c:b455:53b5]) by CY5PR12MB6405.namprd12.prod.outlook.com ([fe80::2119:c96c:b455:53b5%3]) with mapi id 15.20.8230.016; Tue, 10 Dec 2024 00:14:48 +0000 Date: Tue, 10 Dec 2024 01:14:43 +0100 From: Andrea Righi To: Yury Norov Cc: Tejun Heo , David Vernet , Changwoo Min , linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Message-ID: References: <20241209104632.718085-1-arighi@nvidia.com> <20241209104632.718085-2-arighi@nvidia.com> Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-ClientProxiedBy: FR0P281CA0116.DEUP281.PROD.OUTLOOK.COM (2603:10a6:d10:a8::18) To CY5PR12MB6405.namprd12.prod.outlook.com (2603:10b6:930:3e::17) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CY5PR12MB6405:EE_|BL1PR12MB5849:EE_ X-MS-Office365-Filtering-Correlation-Id: 352a2c17-4d63-4047-4a8f-08dd18afa87b X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|376014|366016; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?yxmhxHbDx6i0zVydHxgaUPalgmm3GwN2NUfpVd6TmE29IuvIE40ovF+Mx5it?= =?us-ascii?Q?EXpaW8wLM2btVoZi7PTN2O0EiQw4RaFrLuuOciSIIRfmh+MIFngHjW5mG1Fx?= =?us-ascii?Q?W0pHi36s7pU6O3+X/aBRUJcsBne366sPYiNfXrVnP/Is+3BiutaDTALoVUgX?= =?us-ascii?Q?AFR8kePj8eqLNF2ckhwuz2n4k276pPSp/BX3w4fk8d3wT9JSorBKS8zVzdq6?= =?us-ascii?Q?uOEeN6nsp0/1R6lz2kF1ffWQKvXoSvXbEIANswsm1WcXsU47jRVdpSBsgGcd?= =?us-ascii?Q?NnCPXVbEnhdgScsceKvLVLiPbK+JOwZXhXbUAm7Xc3PtWBejWRoWiByA1ngJ?= =?us-ascii?Q?l+fAShc39H9FukW/OAZHvG0D4PzHJTdUIkQ+4B5xawU89qtj9ERtxDf4Va8O?= =?us-ascii?Q?IET+kooj96vTiDTxtAZu63j6roNl6fgZT39kyW3vLh5hzReITj/YFjo8AYNv?= =?us-ascii?Q?3QrPJs6lITwWCo6Ptg207lYj/W6G5Bf0sAO/sQeKnIhHVxhQeHp569VrB1r7?= =?us-ascii?Q?Lk0oAm7QWXgYdoQw8avv7snKpnAnD0M0VtKtUlIGu+BgOO5eoy0uLyPOuxsm?= =?us-ascii?Q?wyFinRLR9f4SBnJ0tqqXvQU5N4S1UPatkvQCy12hagbz1aFGNVazukMtl4EJ?= =?us-ascii?Q?oRsT2b6tpAZaDrWsIZBaOqDYCdqZs0l0+QcLIYn8GfTVmTa45mi8IMoEUB8X?= =?us-ascii?Q?LkFQ1eOhHXGQl33KRbCdCRfJZmYqxjCgJKVtaaKsl0NzZv2t1Bis+rNrTNJd?= =?us-ascii?Q?spcQ9le3VSOIP5657mCiw/faNXg7Y3n3sM3JBcCAHFI0xj/zfCJta4EvhHvM?= =?us-ascii?Q?Hch6CEuhd+ognkjpTHMfrR+j/3Dtbvd/+STRqlPbrVqYWhsdtgbTt2Wjeb8J?= =?us-ascii?Q?oPpWM5a5exv99DD31Q6uD9tJ+qh2AM9DLe9VWtdvA1kQi2qd2GbDeIOmdKoc?= =?us-ascii?Q?Ss3wYFgixbZPVQVIY9s3IIufWUJATwNj4jVA8/qlxm6/BWDzeBoYwoFkNFxR?= =?us-ascii?Q?fqu4aU58r2Bgo1o0gTS2Kkc7INYk0uB9ZDw2r4NhqeSptiL00nSn/03/x112?= =?us-ascii?Q?riyYNulls7xU09tzkyaImssM5QcJGblaZvVFxFk7BEWb+Z0QWqLYhTr1TK+P?= =?us-ascii?Q?xxMrUXNhhDnguQI1H2eDMHPaoLdgwZyznlgXANLOIOwW8VaxP46ZtEQvMN1Q?= =?us-ascii?Q?ZzBKKYr3A/O809ZdcGeHt6zYzRvDqtV7lB9nubG4MFu4XW+QsMymLCHCe2gT?= =?us-ascii?Q?8PPtgMr+GIJM7I3p/sd9JZ2uWjjO4PwMd/TQdiKkubzOLb5AvBWZaOjLS7od?= =?us-ascii?Q?EUi5eB92hUyIgcyb/1hwfYHQexNKHX1xkq5TYbgXxi9nXw=3D=3D?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CY5PR12MB6405.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(1800799024)(376014)(366016);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?I+6KorNo2FYUopVJ0rSREb66Fr0I3M38yYyuyd+qknMknFaCIj/Wqtzbo03O?= =?us-ascii?Q?X0vmgrnV1DtJspoBejV29FVTh3iYbXKZkhae5lliv29FkfjJUBSNf/tUl/iT?= =?us-ascii?Q?QqqqwjYGWG9D8fuLDcB9dt5I31qLZgVBaP3yNygoI32wcXbaLDMt9yQ8CvFz?= =?us-ascii?Q?nBOJA0v4V9OHWbu0cVEA4prImLvxt5GWmCtLHonRz9M18kU3ax4INDvVQYmR?= =?us-ascii?Q?bIh6CAonB9dnHk+qUIkgN7aemH/H3AqSoTcJKPgPAzIuh3kUHsEasYgabvZR?= =?us-ascii?Q?PJ/+F59AsvgV2T2UrmImUk4V5wX1cF2wzlslRy24HhYltnhqyczv/Dxq0swS?= =?us-ascii?Q?9NTDfWckmc/LZLocsZYX3Ta9TP1hnlSzC3QHKsPHNbMdocyQ1dWSCaB1c3F5?= =?us-ascii?Q?EdHuZq2opKgvvzeg37G1vjd0XSQ1hvsoQwhbHcxJM3jncZBeJ33LA04oLQTl?= =?us-ascii?Q?6QxPAvEOqi0/eW5uyj+ykXoQWozu5nKfR+uVU6BDa0ittiUmCRuk6WnMOKJb?= =?us-ascii?Q?EKA3tBXZqfTJSEkNhKkFUl59I5jajANoVGXk5+vjAbGhmRZAHVKwm5mwLJ+e?= =?us-ascii?Q?BWUkGoTAwIrCyduTlwkMbcz10PpUneliDkBzKg/tGtYZQIId2DNvrajhbvSC?= =?us-ascii?Q?6fZYuCSIbbrFmEEFjA1SZQ8Tn1A6RKBkLcQoQFCa2muCAkXcaiPDa+Xx/G1V?= =?us-ascii?Q?SaldeUZfTf/Qs4dV8C6HYH5KeeZbcu3YvCs4t8BKdTTNiCzTjyFl1ETgTO8J?= =?us-ascii?Q?ngPZ4RW2oi5lFrXKfAQYOZiidjI2eWWRI0XoeNB5pPlV1xDvKVGgJbnTSfha?= =?us-ascii?Q?OMXhY1R976suOAVirlezppCNs5UQH9hHsvXS0P1b1U5Ib9dOw/xKG2i7cdTh?= =?us-ascii?Q?xmRuQLWHX+c56dX3jVMSHPnFMTiT9ov26cTPzri/dfH4xdBBDIkDT5Kc12BZ?= =?us-ascii?Q?s1mJfMSC3/fCrW/hOSyDCuQJ1GD+0gwcRV7HHUseAU6pX4Aqm26ZsVH34bOg?= =?us-ascii?Q?+D2nRjt8CtUVAmy9qJojvSAgdtbdwZWNH156V0q1cpcEUhUb+MQOUnn34MiX?= =?us-ascii?Q?sv/Yr0cZCnvKSHsMeU+HiqurqPuwumnBzf56MANzObwIuEmp1ITw2QEXmgBt?= =?us-ascii?Q?jtBX4ON/9zU8umtAKm8SV5V+Uw9/hBv68w3KCpT1mE4IOfX9knq3U9MVICq/?= =?us-ascii?Q?PlcqItkOMlII8WaYZ4H06jPUsdjC45sbHb/iy84rLASVSO3UFVjk/ZViBWq4?= =?us-ascii?Q?Jwb/lu3CvjUEaFMDgYuo8scTDWtxvH9Q5LiAuQWm1jtd50yDYDEfCXkCVcdx?= =?us-ascii?Q?4u6bSLUkTb4Wis05TTfWUBwzMtveH/mk7jBIJ8J/P/aBOMI0DiUpsTtsIRyA?= =?us-ascii?Q?Ohkr1YIzMmBz/+ImyvSehBZ/C1Bvn3RSVHbEv6y0KLRpQ6nzRFN5BWHz95AH?= =?us-ascii?Q?n2UZDT1vbDoeLN4c4HMJcRLyAs2mr+oA+MQK/VEMf5c/C5eEywkHbd7r0/SA?= =?us-ascii?Q?dEWWwXVutVc6dVLgbtawklAAe7dynV05myD7SajjJfWCGYdYI4bBfmVHzEzb?= =?us-ascii?Q?KDEs1CI9h0V/C5plFdqY5C9ukH5DUCrH0fbK1Uja?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 352a2c17-4d63-4047-4a8f-08dd18afa87b X-MS-Exchange-CrossTenant-AuthSource: CY5PR12MB6405.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Dec 2024 00:14:48.4714 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: iqGiNNRfy9xIMm6cezOuyRCYfVNk3ULI4sUGR4rzAJaLAyaLTMcBgLpRkp+WG2P5tmUExyRg3tcXxhRBxv5n4g== X-MS-Exchange-Transport-CrossTenantHeadersStamped: BL1PR12MB5849 On Mon, Dec 09, 2024 at 11:32:56AM -0800, Yury Norov wrote: ... > > +static struct cpumask *get_idle_cpumask_node(int node) > > +{ > > + return idle_masks[node]->cpu; > > +} > > + > > +static struct cpumask *get_idle_smtmask_node(int node) > > +{ > > + return idle_masks[node]->smt; > > +} > > This implies that NUMA_NO_NODE, which is -1, will never be passed. > Is that indeed impossible? In PATCH 4/4 I adds some checks to make sure "node" is valid and return NULL otherwise, but thinking more about it, it seems better to return cpu_none_mask. > > > + > > +static struct cpumask *get_curr_idle_cpumask(void) > > +{ > > + int node = cpu_to_node(smp_processor_id()); > > + > > + return get_idle_cpumask_node(node); > > +} > > + > > +static struct cpumask *get_curr_idle_smtmask(void) > > +{ > > + int node = cpu_to_node(smp_processor_id()); > > + > > + if (sched_smt_active()) > > + return get_idle_smtmask_node(node); > > + else > > + return get_idle_cpumask_node(node); > > +} > > + > > +static void idle_masks_init(void) > > +{ > > + int node; > > + > > + idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL); > > + BUG_ON(!idle_masks); > > + > > + for_each_node_state(node, N_POSSIBLE) { > > for_each_node(node) Ok. > > > + idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node); > > + BUG_ON(!idle_masks[node]); > > + > > + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node)); > > + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node)); > > + } > > +} > > +#else /* !CONFIG_SMP */ > > +static struct cpumask *get_curr_idle_cpumask(void) > > +{ > > + return cpu_none_mask; > > +} > > + > > +static struct cpumask *get_curr_idle_smtmask(void) > > +{ > > + return cpu_none_mask; > > +} > > #endif /* CONFIG_SMP */ > > > > /* for %SCX_KICK_WAIT */ > > @@ -3166,6 +3218,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, > > > > static bool test_and_clear_cpu_idle(int cpu) > > { > > + int node = cpu_to_node(cpu); > > + struct cpumask *idle_cpu = get_idle_cpumask_node(node); > > + > > #ifdef CONFIG_SCHED_SMT > > /* > > * SMT mask should be cleared whether we can claim @cpu or not. The SMT > > @@ -3174,29 +3229,34 @@ static bool test_and_clear_cpu_idle(int cpu) > > */ > > if (sched_smt_active()) { > > const struct cpumask *smt = cpu_smt_mask(cpu); > > + struct cpumask *idle_smt = get_idle_smtmask_node(node); > > > > /* > > * If offline, @cpu is not its own sibling and > > * scx_pick_idle_cpu() can get caught in an infinite loop as > > - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu > > - * is eventually cleared. > > + * @cpu is never cleared from the idle SMT mask. Ensure that > > + * @cpu is eventually cleared. > > + * > > + * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to > > + * reduce memory writes, which may help alleviate cache > > + * coherence pressure. > > */ > > - if (cpumask_intersects(smt, idle_masks.smt)) > > - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); > > - else if (cpumask_test_cpu(cpu, idle_masks.smt)) > > - __cpumask_clear_cpu(cpu, idle_masks.smt); > > + if (cpumask_intersects(smt, idle_smt)) > > + cpumask_andnot(idle_smt, idle_smt, smt); > > + else if (cpumask_test_cpu(cpu, idle_smt)) > > + __cpumask_clear_cpu(cpu, idle_smt); > > } > > #endif > > - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); > > + return cpumask_test_and_clear_cpu(cpu, idle_cpu); > > } > > > > -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > > +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags) > > { > > int cpu; > > > > retry: > > if (sched_smt_active()) { > > - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); > > + cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed); > > if (cpu < nr_cpu_ids) > > goto found; > > > > @@ -3204,15 +3264,66 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > > return -EBUSY; > > } > > > > - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); > > - if (cpu >= nr_cpu_ids) > > - return -EBUSY; > > + cpu = cpumask_any_and_distribute(get_idle_cpumask_node(node), cpus_allowed); > > + if (cpu < nr_cpu_ids) > > + goto found; > > + > > + return -EBUSY; > > What for did you change the error handling logic? Can you keep the > original? I changed that to follow the same pattern as the one inside sched_smt_active(), it should be equivalent, but I can keep the original. > > > > > found: > > if (test_and_clear_cpu_idle(cpu)) > > return cpu; > > else > > goto retry; > > + > > +} > > + > > +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags) > > +{ > > + const struct cpumask *node_mask; > > + s32 cpu; > > + > > + /* > > + * Only node 0 is used if per-node idle cpumasks are disabled. > > + */ > > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) > > + return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags); > > + > > + /* > > + * Traverse all nodes in order of increasing distance, starting from > > + * prev_cpu's node. > > + */ > > + rcu_read_lock(); > > + for_each_numa_hop_mask(node_mask, cpu_to_node(prev_cpu)) { > > This 'node_mask' misleads. This is not a nodemask. This is a cpumask - > all cpus in the hop. Can you name it 'hop_cpus', or similar? Ack. > > > + /* > > + * scx_pick_idle_cpu_from_node() can be expensive and redundant > > + * if none of the CPUs in the NUMA node can be used (according > > + * to cpus_allowed). > > + * > > + * Therefore, check if the NUMA node is usable in advance to > > + * save some CPU cycles. > > + */ > > + if (!cpumask_intersects(node_mask, cpus_allowed)) > > + continue; > > + > > + /* > > + * It would be nice to have a "node" iterator, instead of the > > + * cpumask, to get rid of the cpumask_first() to determine the > > + * node. > > + */ > > I'm not aware about such case. And this one doesn't look like that because you > filter against a cpumask (cpus_allowed). Right? I'm interested to get the node id here to use it with scx_pick_idle_cpu_from_node(). > > > + cpu = cpumask_first(node_mask); > > + if (cpu >= nr_cpu_ids) > > + continue; > > for_each_numa_hop_mask() doesn't traverse per-node CPUs. It traverses > per-hop CPUs. The difference is that the hop may include more than one > node if distances from a given node to those belonging the hop are the > same. Hm... I was missing this aspect. Thanks for clarifying this. > > So, imagine we have nodes 1 and 2 in the same hop, but cpus_allowed > has only cpus from node 2. With this configuration you pass the > cpumask_intersects() check, but when picking a CPU, you ignore the > cpus_allowed and pick node 1. This is wrong, I guess. Yeah, that's not what I would like to do. > > Instead, I would make a single check: > > cpu = cpumask_first_and(node_mask, cpus_allowed); > if (cpu >= nr_cpu_ids) > continue; Ok, but at this point I'm not sure that for_each_numa_hop_mask() is the most efficient way to iterate NUMA nodes. > > > + > > + cpu = scx_pick_idle_cpu_from_node(cpu_to_node(cpu), cpus_allowed, flags); > > + if (cpu >= 0) > > + goto out_unlock; > > The code in scx_pick_idle_cpu_from_node() is written such that CPUs > that your picks are well distributed across the nodes. Your code above > always picks 1st node in the hop. I think here you should use > > cpumask_any_and_distribute() > > like the scx_pick_idle_cpu_from_node() does with idle_masks. Right, that's because I was (incorrectly) assuming that each hop was a single node. > > Another problem is that high-level hops include CPUs from lower-level > ones. It means that you will traverse the same lower-level CPUs again > for nothing. So, you need to mask them out. > > Another-another problem: if your hop has more than one node, you should > not jump to the next hop until you tried every node from the current hop. > This brings another loop, but doesn't increase complexity if you > carefully mask-out all visited nodes. > > > + } > > + cpu = -EBUSY; > > This hides the actual error returned from scx_pick_idle_cpu_from_node(). Right, scx_pick_idle_cpu_from_node() can only returns -EBUSY at the moment, but it'd be nicer to pass the returned error. > > > + > > +out_unlock: > > + rcu_read_unlock(); > > + return cpu; > > } > > And altogether this should look like: > > int scx_pick_idle_cpu_from_hop(struct cpumask *hop_cpus, struct cpumask *cpus_allowed) > { > int node, cpu, random_cpu; > > do { > > /* Pick a 'random' CPU in the hop */ > random_cpu = cpumask_any_and_distribute(hop_cpus, cpus_allowed); > if (random_cpu >= nr_cpu_ids) > continue; > > node = cpu_to_node(random_cpu); > > /* Find an idle CPU in the same node */ > cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags); > if (cpu >= 0) > break; > > /* No luck? Try other nodes */ > } while (cpumask_andnot(hop_cpus, hop_cpus, cpumask_of_node(node))); > > return cpu; > } > > static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags) > { > const struct cpumask *next, *prev = cpu_none_mask; > int prev_node = cpu_to_node(prev_cpu); > ... > for_each_numa_hop_mask(next, prev_node) { > cpumask_andnot(hop_cpus, next, prev); > cpu = scx_pick_idle_cpu_from_hop(hop_cpus, cpus_allowed); > prev = next; > } > ... > } > > Not tested, but should work. Makes sense to me, I'll do some testing with this. > > @@ -3636,17 +3748,27 @@ static void set_cpus_allowed_scx(struct task_struct *p, > > > > static void reset_idle_masks(void) > > { > > + int node; > > + > > /* > > * Consider all online cpus idle. Should converge to the actual state > > * quickly. > > */ > > - cpumask_copy(idle_masks.cpu, cpu_online_mask); > > - cpumask_copy(idle_masks.smt, cpu_online_mask); > > + for_each_node_state(node, N_POSSIBLE) { > > for_each_node() > Ok. And I think I haven't missed any comment this time, sorry about that and thanks again for the review! -Andrea