From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2082.outbound.protection.outlook.com [40.107.93.82])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 17FBC198E69
	for <linux-kernel@vger.kernel.org>; Wed, 18 Dec 2024 10:21:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.93.82
ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734517301; cv=fail; b=eHo0xi2S98lZXcVSuG3wldy3XXrSG7dGdQ8dDdOjFwTPSrWK9ey8RWxPLnBPkGS+P8fk4WxRC++9ojVUhxIwkNdkKiU7qXlbYlPJ25zCY+AkUMmbJ56i9ifD1y2oeExO4QJzSvASsLfJOtoRblRETg/nDuFrgBxuIRYSLaxrfXg=
ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734517301; c=relaxed/simple;
	bh=ck71COv7s/PgZdZ9vEpP3bVlrbZLh6Nr3iEhsUUF15Q=;
	h=Date:From:To:Cc:Subject:Message-ID:References:Content-Type:
	 Content-Disposition:In-Reply-To:MIME-Version; b=UbrLXh7oqwbuyyZpyZlrUaqR4v/ab2BKq1uG0UuuY+96MhADBuoaP8zcjTwRU8xTc+qJ00vfL3L5rYeHXwMepiBGHXBvx8vtBplZP0KgmcTp/Kl08/AQOqi39k2p3pS0ARM8a454hRbvjUeyeyhYLlGw74wl5TYeX9GV/74qlJY=
ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com; spf=fail smtp.mailfrom=nvidia.com; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b=uBYecUkK; arc=fail smtp.client-ip=40.107.93.82
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com
Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=nvidia.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b="uBYecUkK"
ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none;
 b=zSaClRJvC4nDnzQ0cug3De/5rbWoRv1itDjl+gFKeQcrOKD1vHmlDXCRzdpL8HmbS39t5htUN/drBp0kzdrK5O4LlMNkXbp4Ntf0DIY0UepC13erMx3YFuwEeH79YD7HyQ7BUNMi1KSkvFWKKhMbRCqn3aO7ziA3YN6BjpFW10lnZMh4MCq5eMmmOwWj9CrVUmRGCSgSfYkyMiIffx82SJR1+HGNVM6BXlD/jSD40CRfkZALADqeE1cdGsRDrKdGlufjaH1pzfq2SQMFVhKsgfKWH7fgCGSB5Aak9R2gbAHstQacAoM00Y8uVbOLfP5vkfgdJc1wAhaIEPlOq6yeHw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector10001;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=Bq27SwN4+07Vy8Tt6collA132XiSjfZMqs5dy7uyZWw=;
 b=PKc3wb0tyvxliQQ5txEEPamPn4H446AILojiRj+act6qx8TGVkAVYzuPLAGaYOPK+ITgaREp7olGmlErEHpHTPYJZTt5+k+WWQimG51/DGRvP4A5k2bLEUoTXZRO6hH56tbBbZFuo4DnAZtHcgIJZSy8G6YoA2jJcihJopDUR85Irf243jOzhRRBPiqrIgS7IsCllQwOkD0bh5xn4QujjQKnLslEUMtHlMhCmHGqJ56pfMiii+AOZqoIupG8p0Hs/AQaAc0+YAruHNzWnuCYmEkG2NL0pCuU/UY4F3d+DCnjp00rgNwzUzM9VvJPphrCqqM7eOOXetIOFZvWi7enTA==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com;
 dkim=pass header.d=nvidia.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com;
 s=selector2;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=Bq27SwN4+07Vy8Tt6collA132XiSjfZMqs5dy7uyZWw=;
 b=uBYecUkKCT65DpidQMhzzSZaBWGbR0btJaTvYQjDQz44d+m2aNMhYB/91SKkr8qn4nR6rZDeGf0rlOfa1eA3Ari0LWHSFnK+Lmq6OOaDl6ouSLuW1W16bkgYapLeyxRXYwxQ4LwtXrqS9JiaFioGpqTKXbVv09UhdEd1OYSu9pGSTKb8oxRKGDqHLrqO43DG0VG5cuzWX1ui0Dev31JuBOuznY7V62W/LI2+oMLItYt57zPLuYcLqIObeIMxyoPsfy5l2NuHZOQqCZx1jrMyb16pGM3IfoLZRfarOiQL1Tsyuu9nXEpz47/7pcz6bv/Cxn7lC22c6v3iXcsEvfpqgA==
Authentication-Results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=nvidia.com;
Received: from CY5PR12MB6405.namprd12.prod.outlook.com (2603:10b6:930:3e::17)
 by BY5PR12MB4291.namprd12.prod.outlook.com (2603:10b6:a03:20c::22) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8251.16; Wed, 18 Dec
 2024 10:21:36 +0000
Received: from CY5PR12MB6405.namprd12.prod.outlook.com
 ([fe80::2119:c96c:b455:53b5]) by CY5PR12MB6405.namprd12.prod.outlook.com
 ([fe80::2119:c96c:b455:53b5%3]) with mapi id 15.20.8251.015; Wed, 18 Dec 2024
 10:21:36 +0000
Date: Wed, 18 Dec 2024 11:21:30 +0100
From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>, Changwoo Min <changwoo@igalia.com>,
	Yury Norov <yury.norov@gmail.com>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
Message-ID: <Z2KiKs-Jw9meENCi@gpd3>
References: <20241217094156.577262-1-arighi@nvidia.com>
 <20241217094156.577262-4-arighi@nvidia.com>
 <Z2IHsuzeW5e7MAr6@slm.duckdns.org>
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Z2IHsuzeW5e7MAr6@slm.duckdns.org>
X-ClientProxiedBy: FR0P281CA0183.DEUP281.PROD.OUTLOOK.COM
 (2603:10a6:d10:ab::12) To CY5PR12MB6405.namprd12.prod.outlook.com
 (2603:10b6:930:3e::17)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: CY5PR12MB6405:EE_|BY5PR12MB4291:EE_
X-MS-Office365-Filtering-Correlation-Id: f64c608a-dcf1-4f03-da0f-08dd1f4dc09f
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam: BCL:0;ARA:13230040|376014|1800799024|366016|7416014;
X-Microsoft-Antispam-Message-Info:
	=?us-ascii?Q?kiDLab8avKUmMCyDlaKxoLCGe2GgLLThdVJU1nHQ822uSxivdyjz/tPuCQ9O?=
 =?us-ascii?Q?sCjktepxWrvM1hbdtqlXWNsb6Z5I+dBfkYsV5DKP6TmWUHlLUc/lSZXflEiH?=
 =?us-ascii?Q?uqm7E7tlP08evRcyAqLVzuW6acrfefg+URmbXFXNc8XeUlOIoo4HB6vDKTf6?=
 =?us-ascii?Q?PVyLoYax6O6Nrn3AJY6vt5puZB/gWJQC8e7fgTlt//gnufaC2owcAoJ6KzUX?=
 =?us-ascii?Q?/7DjQKca6NuOuw+mdrq1d//l1XixmFqmeFdbxEkLqGIw6aulsjHGNqN+uMML?=
 =?us-ascii?Q?Zv1R4PrZIlbIirUwg1O9LwuxMPxfJFBurgcZdoXOPwVocKDUMDA60Uuup41r?=
 =?us-ascii?Q?KBRUryZfbVhmn/pX0BfQL1qyTqD0Q4ul8wSB+Km/jmPwOYfDuJNhE4ZkdN8t?=
 =?us-ascii?Q?5pMIaecQnyjk4KHiqt8EnP/roUbz+3okGhCGuPwIs5Q8BDbw7zwwq/CqpkWE?=
 =?us-ascii?Q?1KOIlgkGnkdc874aUResllZcrwvQ7nZY24h5lXZ2U64ONnMdLJVsfJDmsdWr?=
 =?us-ascii?Q?4BkZGBgxl4EiBTElyar0ih83XYtkMRqM+U6AqnKiEN2OmK8uBty0JAD6O0h8?=
 =?us-ascii?Q?K/7ERG5eVeTpHh29LDxbaDOmtgT8pfZwbXU2x47jvYJXuxVlg5eHiB4VmPbA?=
 =?us-ascii?Q?fOs5kJSErgWP7p8Mw9oEP1ndQu1gFRSS/FOZEy/dn6FII8rv/wR17wtCugMG?=
 =?us-ascii?Q?zeIrpbjAc3e92042IQ+EHuGP0xSQ4SiyXWm3HPDyasK7zByBePiQ1t9nw88l?=
 =?us-ascii?Q?NphcBs/1E5F00GDq1YhEI+nTxu7HkZAiRgeDg/kb5QDW1aTnnHI+LoW9WjLR?=
 =?us-ascii?Q?rzmTMi1Bgzw/PMczxubvR4QcbzmpkYLSaWFrqoz7P2lt6Bc8T5u96dxmpfGt?=
 =?us-ascii?Q?9QY657dHJOghl7W6Oq/fOQhNqYIV248N0yo+0RbJjFtWp+Femc2EUkVb2jpe?=
 =?us-ascii?Q?bx6wsIqeoJDlBBm/TCTnRKzT8LMknzxxBjpcXwOp4S/vpv9xyyxeVW2UjdAa?=
 =?us-ascii?Q?MOT16liM7DOVPoTkZIrYF4RgheQJVaE3qhStddZCNHT0ytt8EYF1+4QzHY77?=
 =?us-ascii?Q?HFhH3G5kywNM3mxJ6ZTNM/+3GBZPluVno0g/hP/GwNbXUEATv9lQBmjuMX8R?=
 =?us-ascii?Q?AAsStTtJgEIvN/6MbCvn5qEMcCbRlsJPZqK3QfKIsoPhRYOX0bWByM7QxIfH?=
 =?us-ascii?Q?lS3IIF4yBamNJhJPVnDRkWirDQcQ5suZHbtQpE5YNugdLan2lVFvsXxUFMIc?=
 =?us-ascii?Q?Y79WKh9ZMJMWazm7VuNRuU9eNY6iUc3/mkS1oNV8Qj3mijOqCkD9PCYDk1ZS?=
 =?us-ascii?Q?iZY7FliCIjkshdn6hCA2ee4bhEqlmXQXz++WuXRkq69OkjyVSmw3K58EDA+T?=
 =?us-ascii?Q?ap4A6aX9nZ7I/2QbghrdITuH2ArE?=
X-Forefront-Antispam-Report:
	CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CY5PR12MB6405.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(376014)(1800799024)(366016)(7416014);DIR:OUT;SFP:1101;
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0:
	=?us-ascii?Q?Niv2B2FLKVXSK4bE7fy4Tk4qOcyqkvD79U4bdV7FhvsjHhaGHCER0+4W588l?=
 =?us-ascii?Q?gdfO0bXKDczNkulYCWBUjfsQgUDiBxrjLOcjjG6nTJOhc75ZAeG5IyvSdgUH?=
 =?us-ascii?Q?PmroUrqWCSIMUOLNvRxWoPaJqmFO2+MJ1blKhN4nx58W1CTQSY9J4Stl/zRL?=
 =?us-ascii?Q?2XwNthq/CWOWPIQzweDvhUVtSQ5gqWi6gL2AxIL1oi1O1d74vT/bGqQSzcru?=
 =?us-ascii?Q?ODo7Nb8NLkQtpvCxybfMwERP+sOm4/SDTtOf2lsIs+lJxjEiesvYuAQTLUry?=
 =?us-ascii?Q?5lQcrRirmV7k28dAWSPR2EVAZo57sXsRDEsQdlXnOuj4C/sSZBVWeT0RcfJH?=
 =?us-ascii?Q?GfIqE0qFAS4Neso7JxeJjHC2vAqYmLo/I/MtIIXN/S/5bAlGvPl1ffN0ucwf?=
 =?us-ascii?Q?z6/r11XGMEVlAB+WpRdptd9OxvvBpZhctyICTKnHiprttMBGmOrhvorjSlxh?=
 =?us-ascii?Q?a3Q3eACWP28R6JrrvLDjkPRUK7vczmtEgwDteu0LF8vcibZqh8GvTSErLklP?=
 =?us-ascii?Q?70y+nW33zCzgyKE7+TXb6AAd8Ys9lUYHI0ZarKhMi9nQ7P7YAGG2ic987YTQ?=
 =?us-ascii?Q?FTPvN+DdSwtx5gyDinZyvmuPnm7gBS399iPqa5zno3zEJ6WSPKfS2+A8lLco?=
 =?us-ascii?Q?kkvoI6dWPyS2+uA2LSBvzJNRBTs9RUPw8VwTlXpo22wfYLGOA126MdbU0lV+?=
 =?us-ascii?Q?RthkvzZBOKrCGS/Yn8k8XchVQ/89fBAGCanGQJxdJ2iF1XBhhaBcM1H8Sw5+?=
 =?us-ascii?Q?PtyWUTb3hZzwuw8bB+3bVeXAavFb4QK+IfSUzvlQiM0xJVdImfA3HXMnHsW1?=
 =?us-ascii?Q?eHfGJ7Rt2qOZQEPMCoJLnPYWpWPy2bFB9eitAZ1y3kukyMotLsD4cDY3zf/J?=
 =?us-ascii?Q?XU2ypOJLByvSgvIKvGesMhl+TW1BBzLZ7cwZ8+Xy9ArxLdaZcZLLYvQR9rj7?=
 =?us-ascii?Q?uzLpCeUXnN2Dvx8jtO43bIm4ePc7FhZKb+BhJ+ObVUxotQKHf+Q4LvFeZU/g?=
 =?us-ascii?Q?Bwyzd+zwZdSnQN+ETSEGBnYWPafChHdibn8IFN4JdsoiSyW/PPRiju7msMoE?=
 =?us-ascii?Q?R1PsKNB5+FC6FU2jYUIKMh5vA0QOReOrkol7h1pYwJmdaS1CMhRmiVSPrklv?=
 =?us-ascii?Q?bYHYg4Z4tfh5xtpzl6fCy6oTsjp0UiHD+O7KxZX8nVdcBfID6KYvKCBNPyvo?=
 =?us-ascii?Q?u/gEORaxIDcna57e2HLCKYN5rS8B1vRJnx9Fiqzf+fqssoQhu7d+a0DkJhc2?=
 =?us-ascii?Q?MQCwkAVZCIzZkPIdxdkFI4zcwkjy8Tgy2uBoKioXD7S6skiFgGERubmQn40y?=
 =?us-ascii?Q?jSfkl6sf0Gw0ShrH+4X4bJov8gr0E1f0q7u8P2LG5SlBlckd7oD469z84uE/?=
 =?us-ascii?Q?Aa8KR6Lmg95qt5UiT4EdvMm2if7sFEStOFHkIEkgXrjMB4/FgNavbVc7KQwD?=
 =?us-ascii?Q?0RCosMkhmAAAfG2SjovzLJoX93l5kEeCdTolZL1GL8cVWzEAmmeyoI3Usd//?=
 =?us-ascii?Q?E15DSTC5wnyuqR+kWBfXFa6HnXTurBp/VxVyJnA/o/3/P04in1QxNbrF/aHA?=
 =?us-ascii?Q?ZvwC3NyapkzRVdYJlHSptpta894qfger8bibjpkt?=
X-OriginatorOrg: Nvidia.com
X-MS-Exchange-CrossTenant-Network-Message-Id: f64c608a-dcf1-4f03-da0f-08dd1f4dc09f
X-MS-Exchange-CrossTenant-AuthSource: CY5PR12MB6405.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 18 Dec 2024 10:21:36.2101
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: HAQfm611UIQXNfTKAjueJvFwXqtwZ2RPbj1hs54pgtxx4aRMkCaZxlR6AirPkjJRsMLrYrKHy8/RvjQPUhz+/A==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY5PR12MB4291

Hi Tejun,

On Tue, Dec 17, 2024 at 01:22:26PM -1000, Tejun Heo wrote:
> On Tue, Dec 17, 2024 at 10:32:28AM +0100, Andrea Righi wrote:
> > +static int validate_node(int node)
> > +{
> > +	/* If no node is specified, return the current one */
> > +	if (node == NUMA_NO_NODE)
> > +		return numa_node_id();
> > +
> > +	/* Make sure node is in the range of possible nodes */
> > +	if (node < 0 || node >= num_possible_nodes())
> > +		return -EINVAL;
> 
> Are node IDs guaranteed to be consecutive? Shouldn't it be `node >=
> nr_node_ids`? Also, should probably add node_possible(node)?

Or even better add node_online(node), an offline NUMA node shouldn't be
used in this context.

> 
> > +/*
> > + * cpumasks to track idle CPUs within each NUMA node.
> > + *
> > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> > + * from node 0 is used to track all idle CPUs system-wide.
> > + */
> > +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
> 
> As the masks are allocated separately anyway, the aligned attribute can be
> dropped. There's no reason to align the index array.

Right.

> 
> > +static struct cpumask *get_idle_mask_node(int node, bool smt)
> > +{
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> > +
> > +	node = validate_node(node);
> 
> It's odd to validate input node in an internal function. If node is being
> passed from BPF side, we should validate it and trigger scx_ops_error() if
> invalid, but once the node number is inside the kernel, we should be able to
> trust it.

Makes sense, I'll move the validation in the kfuncs and trigger
scx_ops_error() if the validation fails.

> 
> > +static struct cpumask *get_idle_cpumask_node(int node)
> > +{
> > +	return get_idle_mask_node(node, false);
> 
> Maybe make the inner function return `struct idle_cpumasks *` so that the
> caller can pick between cpu and smt?

Ok.

> 
> > +static void idle_masks_init(void)
> > +{
> > +	int node;
> > +
> > +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> 
> We probably want to use a variable name which is more qualified for a global
> variable - scx_idle_masks?

Ok.

> 
> > @@ -3173,6 +3245,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
> >  
> >  static bool test_and_clear_cpu_idle(int cpu)
> >  {
> > +	int node = cpu_to_node(cpu);
> > +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> 
> Can we use plurals for cpumask varialbles - idle_cpus here?

Ok.

> 
> > -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> > +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
> 
> Do we need "from_node"?
> 
> >  {
> >  	int cpu;
> >  
> >  retry:
> >  	if (sched_smt_active()) {
> > -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
> 
> This too, would s/get_idle_smtmask_node(node)/idle_smtmask(node)/ work?
> There are no node-unaware counterparts to these functions, right?

Correct, we can just get rid of the _from_node() part.

> 
> > +static s32
> > +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> > +{
> > +	nodemask_t hop_nodes = NODE_MASK_NONE;
> > +	int start_node = cpu_to_node(prev_cpu);
> > +	s32 cpu = -EBUSY;
> > +
> > +	/*
> > +	 * Traverse all online nodes in order of increasing distance,
> > +	 * starting from prev_cpu's node.
> > +	 */
> > +	rcu_read_lock();
> 
> Is rcu_read_lock() necessary? Does lockdep warn if the explicit
> rcu_read_lock() is dropped?

Good point, the other for_each_numa_hop_mask() iterator requires it, but
only to access the cpumasks via rcu_dereference(). Since we are iterating
node IDs I think we can get rid of rcu_read_lock/unlock() here. I'll double
check if lockdep complains without it.

> 
> > @@ -3643,17 +3776,33 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >  
> >  static void reset_idle_masks(void)
> >  {
> > +	int node;
> > +
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> > +		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> > +		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> > +		return;
> > +	}
> > +
> >  	/*
> >  	 * Consider all online cpus idle. Should converge to the actual state
> >  	 * quickly.
> >  	 */
> > -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> > +	for_each_node_state(node, N_POSSIBLE) {
> > +		const struct cpumask *node_mask = cpumask_of_node(node);
> > +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> > +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> > +
> > +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> > +		cpumask_copy(idle_smt, idle_cpu);
> 
> Can you do the same cpumask_and() here? I don't think it'll cause practical
> problems but idle_cpus can be updated inbetween and e.g. we can end up with
> idle_smts that have different idle states between siblings.

Makes sense, the state should still converge to the right one in any case,
but I agree that it's more accurate to use cpumask_and() also for idle_smt.
Will change that.

> 
> >  /**
> >   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> > - * per-CPU cpumask.
> > + * per-CPU cpumask of the current NUMA node.
> 
> This is a bit misleading as it can be system-wide too.
> 
> It's a bit confusing for scx_bpf_get_idle_cpu/smtmask() to return per-node
> mask while scx_bpf_pick_idle_cpu() and friends are not scoped to the node.
> Also, scx_bpf_pick_idle_cpu() picking the local node as the origin probably
> doesn't make sense for most use cases as it's usually called from
> ops.select_cpu() and the waker won't necessarily run on the same node as the
> wakee.
> 
> Maybe disallow scx_bpf_get_idle_cpu/smtmask() if idle_per_node is enabled
> and add scx_bpF_get_idle_cpu/smtmask_node()? Ditto for
> scx_bpf_pick_idle_cpu() and we can add a PICK_IDLE flag to allow/inhibit
> CPUs outside the specified node.

Yeah, I also don't like much the idea of implicitly use the current node
when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.

I think it's totally reasonable to disallow the system-wide
scx_bpf_get_idle_cpu/smtmask() when the flag is enabled. Ultimately, it's
the scheduler's responsibility to enable or disable this feature, and if
it's enabled, the scheduler is expected to implement NUMA-aware logic.

I'm also fine with adding SCX_PICK_IDLE_NODE (or similar) to restrict the
search for an idle CPU to the specified node.

Thanks!
-Andrea