From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from SN4PR0501CU005.outbound.protection.outlook.com (mail-southcentralusazon11011048.outbound.protection.outlook.com [40.93.194.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8E3E6379978 for ; Mon, 9 Feb 2026 15:43:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.93.194.48 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770651812; cv=fail; b=RcM7Ct+p2exstceow6QHQnzM9fVViagYQt1P+c/2a0Zxf7gIAB9aGJ4Lp8uUuGvW6WSeCx1EsDLrq72s6rKC+VnIS3+VvjxS67hwaXzu0Q0v6sXiEumLqjhqs6iuseRlD/jUBiEdo6mxgqoR70oPSzK8fpdQal0J2ZgLYNv5F3E= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770651812; c=relaxed/simple; bh=1cse0JISgJubpoqURsB0UltNoi3xElLfEbKbfcFC5Ac=; h=Date:From:To:Cc:Subject:Message-ID:References:Content-Type: Content-Disposition:In-Reply-To:MIME-Version; b=RDVzxxeH47VMBgMigqb1Iy1JOIVTV3SNa8m4iJqZU1ANbcCeX+W4+cEDi/537lmpS0jsrq2kDgYRK3w43GVzpJ8ohT6QEVx5QPP2KPRfwuY15UyolsHfhFD/LFlK/PUzkuWTUgIZaKWizogPyxlWHtWDSg0ru/HB038DUGaZIgA= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com; spf=fail smtp.mailfrom=nvidia.com; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b=c9jYpT78; arc=fail smtp.client-ip=40.93.194.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=nvidia.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b="c9jYpT78" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=gBBThhHrWmyNtF1Kbh42/G8wXJjhApXEW60YSCEibLQMquhXDA1rYTQ/DQdQkaKFjZp0oly/Cq4QmLkUSqGKnH2OsZYkI4Yu0RihdakkbaMEHJH7jK0scjmYv3B1dh3VIa65CvITsdAwjlJdFDbQOfogn6F48CqGblWolXmeA9e/6kyO6X7q1b9RrfGx4jqdE6vEN4PK5pn1zPbrsuGHhktP0EUwXKuR4FxLwZdmGmc+YXgkb7yBKfJfLqTdjrdyy7MXD6pLFb6bawOT67bB/ZYcWYo403Mo9jSmsqFaRTYoIMHsokBtZ7voDoMkTkkDqovSaT72HdppcF3SBYTcow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=qUWjgub2yvpVolyeX7dCDSea0nq4W0obKOuP5YUrwfw=; b=uvlGOgaOUH6ZCeIg8p+cHzMrcgq1HovZQudh3aYoAzPYMvUVSD/G6FaF2k8ANDC85IADmKQX1tWHkaQFpcym183zxFOlC6MlIQJHfSE1CPUkiinJ3xIg93ZSwop3ezmjiHboLVuMkFlCpVROuu934STdJeI7feRZgS9nZD8cwOv5dIhcHquTXZvWN1iK1PtY0ziZedZul9Qqq5wymuqPWqV1qFx1tUnb65xlblDZEGLxUKQH1H7dRqkHR+ozLYOMRqwtDxWz1clpA8mWFx1Qo3wfYgEmMsuIDV+SS/5E+ddqZ6XMr39MKHWI7adOaP1E6j7lI+xk+qvsqqg6gW9NhA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=qUWjgub2yvpVolyeX7dCDSea0nq4W0obKOuP5YUrwfw=; b=c9jYpT78Q3pXSpfnDPOO0aayg1MgEMWx9pHLFY565FuKOnk2kZXTwf+oKIqnmDqmilT8FGqJoQm7qWKuAHmRIz3s1Du6hf44nfPVs8esNt3eTfYC3432N4DTWBDqMh5nCSNQ5sIk+MBh9eU9yFjIcqTtQ56cRXCaafcP960Eg9srclsDXRRHgtY4y/DILEng8h/Q0sWDJ7A1TCuJqVZ2ly8oXDK6+vS0IP5a72cne4LIvERW5R6NFktehmDXPXRGDZf4W427YqaqorgiOuhfbsfHzfgseZthXYVHSHenxcdHYkewZHBGx65OPii0XBi1y2y79r6M4+UL72LlOez/rw== Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=nvidia.com; Received: from LV8PR12MB9620.namprd12.prod.outlook.com (2603:10b6:408:2a1::19) by SJ0PR12MB7067.namprd12.prod.outlook.com (2603:10b6:a03:4ae::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9587.19; Mon, 9 Feb 2026 15:43:28 +0000 Received: from LV8PR12MB9620.namprd12.prod.outlook.com ([fe80::299d:f5e0:3550:1528]) by LV8PR12MB9620.namprd12.prod.outlook.com ([fe80::299d:f5e0:3550:1528%5]) with mapi id 15.20.9587.017; Mon, 9 Feb 2026 15:43:27 +0000 Date: Mon, 9 Feb 2026 16:43:20 +0100 From: Andrea Righi To: Emil Tsalapatis Cc: Tejun Heo , David Vernet , Changwoo Min , Kuba Piecuch , Christian Loehle , Daniel Hodges , sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() semantics Message-ID: References: Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-ClientProxiedBy: ZR2P278CA0023.CHEP278.PROD.OUTLOOK.COM (2603:10a6:910:46::18) To LV8PR12MB9620.namprd12.prod.outlook.com (2603:10b6:408:2a1::19) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: LV8PR12MB9620:EE_|SJ0PR12MB7067:EE_ X-MS-Office365-Filtering-Correlation-Id: 1ab33fe0-d891-45b3-3414-08de67f1f7bc X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|366016|376014|19052099003; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?3lGIuHVKggSzu/6FOZlP2YIvzZEbNSSs6rsVFLDqK+oAB2/xS/Wd+f/I9FDH?= =?us-ascii?Q?t1VBGs+twWouzbncJxZBKc6V48r+pNvpqlQhxUf/2eYDSMGfQUwt0jwp8489?= =?us-ascii?Q?ldooHOI1szwbwJea46q0o9IHa9jQ9V8E5dSlsM4JoThfI19UQ9leSIItqxq1?= =?us-ascii?Q?avKMqrhWPBGrsLa7AolppnjvFAlXjU5el8PvOuo1dRunUXvcPbCE6lrANlWu?= =?us-ascii?Q?0AajyPs720KwP+XFtc103qoxj/d2sKya/CDyuUKpXOXxf1LODs0N9R6wq1vD?= =?us-ascii?Q?Q/+oNCud42V8JFQVoZoywt/pCuX+3As/g+2IGN4E72uOF6TZLTi49JGknGV8?= =?us-ascii?Q?Sp1fVW8VGvKlgFu13bS32kQpXfFYFjeSPJ1OhPYfzqlL7VvAT+CbYJqoQWxz?= =?us-ascii?Q?tPCa2JIhE77+wFCnqYrE/TSrFi5E+SoRy/G/GxhqeS42NKmNneIPjoWbZCTJ?= =?us-ascii?Q?mm9jtRIqXOv1xshf4Oid5vznLOUAgotiB5C1qsKcsBrDBrBfb1MHCeENyM0V?= =?us-ascii?Q?77AU8IsLDlDi2+so0h24LyI7FNXuRyy0gCfsf9k5FyT2rAKJoTAORbJ53LAO?= =?us-ascii?Q?CHTizPRn+tAO8L3cj682nsPojtmdqDfNgQwJhVTe5r+1runRJ2SWogTn+bdh?= =?us-ascii?Q?Ha0fFajc+3Jm8G4sPgN1SsDNeStA4fAlMMcXJeGPpYuucbqSfnxY4a4zs9mz?= =?us-ascii?Q?BSe7iA8DX+M8gKrU7IgWNh2LxdByZlhLtWJ3tt033qCqt3mffuYmXznd1G82?= =?us-ascii?Q?sPxm+DZ0IyaE9oV5bMGHJCNThi0tgtnL74+dG/IJf59osNCINV9IST7tCH1w?= =?us-ascii?Q?1A6jBL2gjkygxhcSOQ/ztPflfcDZ++IHohckeuWCZNEOkMvZH4zIS9HtXghH?= =?us-ascii?Q?bO270QdfZss1K6ndbmhBU9k49SJJblqVfZ+2fjBI5Ifit2ibk3JrErImEr0h?= =?us-ascii?Q?L6H3CZ3XC2IS1p1sVZ1s3em9zJFS2ZRquGvbzutNSgNcrzVS9rrW5LDKLaho?= =?us-ascii?Q?BN/aJ2AW4csjvC9lADzWNJIcAV4dmoAwxlYkaVHz2C6kFANMai0iONGs+vdc?= =?us-ascii?Q?pJxNce8Vn4fENDud8ifV6oRNzE6koLcCHn4spk2wvQz/cELpj9q1xVTSooXb?= =?us-ascii?Q?XtBt8OvCMj2cm8oXqtIGmlGvLaGsrMZECSAxmDOUkTdxgz1DOWL7Sx3cnUPw?= =?us-ascii?Q?fVD+epVORzxYOA/3QVyDblZPoTrnirLvo71coNhse+nyYXdPUybS9hzzO1+r?= =?us-ascii?Q?vjBqu5FVcu7PWmbR+p8GWQJzx7m2c7YKp8u2snZS29GzaoR1VstOPm4KKj4Q?= =?us-ascii?Q?+sIh31u+Hk2+Qzoqa0Mov8V0Dwa31MLghbhUdnb/5hiPAUP7s7rr8jF4jynS?= =?us-ascii?Q?fivic7WvXr+WgXcf9ik/5ZuSMF/dZm2qUKntbdYXHUqfZFYy2zPrEUGSa1IN?= =?us-ascii?Q?tFBycnaXUE4i8RQVluae3Ok5P9HNgJikusfYwSjktn5wzTxHulHyWcbE4Gyd?= =?us-ascii?Q?UMTrmE7bYjjeWaVcRxBeRDbLMN4f+iUKrUDxPky1/wVAelqbr6cJBVuDS9ha?= =?us-ascii?Q?PGqt1Xe39hdr5MIUyFY=3D?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:LV8PR12MB9620.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(1800799024)(366016)(376014)(19052099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?8b4lfYEJMcS4wPOT3BLX7FqUTyyBzDIRvONTOabYe+sP6is0MC3whJxJ8LIY?= =?us-ascii?Q?b6IVDEiyBd5VhvDz6oEzZUz2SlHvLcnbonW9yq0RQiFzS3504DRowK1PmXXs?= =?us-ascii?Q?Ryqgp+ohCR54PT1vVIXAvARyqGLCSfJNn7utK3ob136OPNp7AZNfAbozMjqP?= =?us-ascii?Q?rsmFDJ0cgjoew2Eh3p1aF80ILqE1IbdJU60PF6W/C5pAd6g8pBj+tYKqcvE7?= =?us-ascii?Q?VU5XIwlnXtifFEOEiDogqgU1m8GqqYO4qjJP5/ubDRzltvHVd9B2+b17rVch?= =?us-ascii?Q?lenLqVjetOJV5W7bIDlek1vQIzjorWGFCGX9FOEphewnRYjFdbcKqn/BYFuX?= =?us-ascii?Q?WD+qEbjVhz7r5DRDQ5pngq9FxMsHoZaWjiH7yJOQdP4xzO2It7hzdwaNtRsR?= =?us-ascii?Q?X4TFw34ZiRM9GjDR1WD7z7uIQNpWGs/N+VUks6yrrMhe3WjM7tqFk4QZpMbx?= =?us-ascii?Q?ZZNvGzEe2rJieJbKTmgbz8hwTsld7nUrSsGmcVpJPHHIytzvMR0qOZgqGvVp?= =?us-ascii?Q?z/SAKuN0E+SXdOuy5iLi3YdyiZMuvhUYbLfkK/EPhMno3lKGAWsWlJu2qCuz?= =?us-ascii?Q?nEH1tnKTq9UUYscxrqIfkhAPZVCW++AHu55QXLVPscjAj72+pqHKC5qcKFUM?= =?us-ascii?Q?hJmqzU0DOswp//Q7MzAFiJILtD/tNPmh3Xy/CgtBApoEKxXX3gP2Wdwa0pEu?= =?us-ascii?Q?31OVeR7+q8GjhG48MhUKO/EoNnvh0weqag49TtwGByta9aeTi3ZPWIGqwjwq?= =?us-ascii?Q?RsKpecmbIKlCT3/scFc/2MXrpqNSQfd9SOhWoH+x5w8hjSgm2cpVMUghRC7u?= =?us-ascii?Q?eT5W0i4DYZF24U0qP7qk7TmrmodgClgjnk3Aq4h2id/Cq3Qr1tby7D3H2CEW?= =?us-ascii?Q?4iXbdzwX7WqBd/CkWFzDfCYiPg8im+GtU644m94yYHbkY2g96XoeDYvfWOjN?= =?us-ascii?Q?jGSJWqp6vT3Qr72puysezPt7NSYQCYPDHTyN80a48I/xjEHydTnxcc4YjOkR?= =?us-ascii?Q?nDjZNRPHg5cHfB5Ys6pdDfNJC5o6SNxKjuu0P1lFfKRHP4PiUFlg6OSwPPaF?= =?us-ascii?Q?F6tCbvgu+F3nVYi157EXpY/AVEMkRcPvUdCYi/+LPQMrEZffgOPOvBs7e7Dx?= =?us-ascii?Q?//vCIxh75POITKLUb1zIHzk7i82khqvgGbbJM13ZyKEHCCuNj/PYHpOTy4lx?= =?us-ascii?Q?29jQMmM01ax8oUJZoJ1r8vEpDb3ktsMjmVgIv2H7INOkqlbdnW1s5W/RWCeM?= =?us-ascii?Q?j0wT/jaZUmXoJR4nXpvv/f7+uzmJF2zj0Qrr6htnmNg/YgsPlduG1+duC6r3?= =?us-ascii?Q?5Nu1m6PAMVjzynCVuTpYdyc/J3cMRJfUM3S8rF/Ygl69wk95rCMZBeXB1pF3?= =?us-ascii?Q?VUKen+ffn/CE6xgXddzWgecpCN7G+En7gvRuNXenlWW42hgzGViaW8iJ9Awc?= =?us-ascii?Q?cNM3vl5a+W/h5kyhRlY+aYrl/WcDOfrOo7siELFK5vfIrN+LYNjg4aQc+Woc?= =?us-ascii?Q?rfPMGnKRZLa5SnsMVc1mp3cqWFw9FCW/HXonpKQ6NsLXxunrY3Ug/eizDHDw?= =?us-ascii?Q?rNMN4lZyErHxvyt/gpispqFsGtWY9cRuogSBHmo2h4apvVAAk529b7eGVQgg?= =?us-ascii?Q?7GOy1EB0kjwQjoKUQj1Iyk030OhQ2GjEDvjWuLm9x9n+XCmhenpX+ls0pZCQ?= =?us-ascii?Q?WQyZFIbVtsa+ujhT/wtF8S6V0/LtSAzaHhjl4mPUwKe5CuqB?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 1ab33fe0-d891-45b3-3414-08de67f1f7bc X-MS-Exchange-CrossTenant-AuthSource: LV8PR12MB9620.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Feb 2026 15:43:27.7926 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: a1/umV9NQfaT5qwymqFcl/VEKRdO4sdMOgbVp5k7A/JGtRCZURp4c3QJ0CTjuATvq7BmbIcrC6pkjTGCqUtAZQ== X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ0PR12MB7067 On Mon, Feb 09, 2026 at 10:00:40AM -0500, Emil Tsalapatis wrote: > On Mon Feb 9, 2026 at 5:20 AM EST, Andrea Righi wrote: > > On Sun, Feb 08, 2026 at 09:08:38PM +0100, Andrea Righi wrote: > >> On Sun, Feb 08, 2026 at 12:59:36PM -0500, Emil Tsalapatis wrote: > >> > On Sun Feb 8, 2026 at 8:55 AM EST, Andrea Righi wrote: > >> > > On Sun, Feb 08, 2026 at 11:26:13AM +0100, Andrea Righi wrote: > >> > >> On Sun, Feb 08, 2026 at 10:02:41AM +0100, Andrea Righi wrote: > >> > >> ... > >> > >> > > >> > - From ops.select_cpu(): > >> > >> > > >> > - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass > >> > >> > > >> > the BPF scheduler entirely; they never enter BPF custody, so > >> > >> > > >> > ops.dequeue() is not called, > >> > >> > > >> > - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also > >> > >> > > >> > bypass the BPF scheduler, like the local DSQ; ops.dequeue() is > >> > >> > > >> > not called, > >> > >> > > >> > - scenario 2 (user DSQ): tasks enter BPF scheduler custody with full > >> > >> > > >> > enqueue/dequeue lifecycle tracking and state machine validation > >> > >> > > >> > (expects 1:1 enqueue/dequeue pairing). > >> > >> > > >> > >> > >> > > >> Could you add a note here about why there's no equivalent to scenario 6? > >> > >> > > >> The differentiating factor between that and scenario 2 (nonterminal queue) is > >> > >> > > >> that scx_dsq_insert_commit() is called regardless of whether the queue is terminal. > >> > >> > > >> And this makes sense since for non-DSQ queues the BPF scheduler can do its > >> > >> > > >> own tracking of enqueue/dequeue (plus it does not make too much sense to > >> > >> > > >> do BPF-internal enqueueing in select_cpu). > >> > >> > > >> > >> > >> > > >> What do you think? If the above makes sense, maybe we should spell it out > >> > >> > > >> in the documentation too. Maybe also add it makes no sense to enqueue > >> > >> > > >> in an internal BPF structure from select_cpu - the task is not yet > >> > >> > > >> enqueued, and would have to go through enqueue anyway. > >> > >> > > > > >> > >> > > > Oh, I just didn't think about it, we can definitely add to ops.select_cpu() > >> > >> > > > a scenario equivalent to scenario 6 (push task to the BPF queue). > >> > >> > > > > >> > >> > > > From a practical standpoint the benefits are questionable, but in the scope > >> > >> > > > of the kselftest I think it makes sense to better validate the entire state > >> > >> > > > machine in all cases. I'll add this scenario as well. > >> > >> > > > > >> > >> > > > >> > >> > > That makes sense! Let's add it for completeness. Even if it doesn't make > >> > >> > > sense right now that may change in the future. For example, if we end > >> > >> > > up finding a good reason to add the task into an internal structure from > >> > >> > > .select_cpu(), we may allow the task to be explicitly marked as being in > >> > >> > > the BPF scheduler's custody from a kfunc. Right now we can't do that > >> > >> > > from select_cpu() unless we direct dispatch IIUC. > >> > >> > > >> > >> > Ok, I'll send a new patch later with the new scenario included. It should > >> > >> > work already (if done properly in the test case), I think we don't need to > >> > >> > change anything in the kernel. > >> > >> > >> > >> Actually I take that back. The internal BPF queue from ops.select_cpu() > >> > >> scenario is a bit tricky, because when we return from ops.select_cpu() > >> > >> without p->scx.ddsp_dsq_id being set, we don't know if the scheduler added > >> > >> the task to an internal BPF queue or simply did nothing. > >> > >> > >> > >> We need to add some special logic here, preferably without introducing > >> > >> overhead just to handle this particular (really uncommon) case. I'll take a > >> > >> look. > >> > > > >> > > The more I think about this, the more it feels wrong to consider a task as > >> > > being "in BPF scheduler custody" if it is stored in a BPF internal data > >> > > structure from ops.select_cpu(). > >> > > > >> > > At the point where ops.select_cpu() runs, the task has not yet entered the > >> > > BPF scheduler's queues. While it is technically possible to stash the task > >> > > in some BPF-managed structure from there, doing so should not imply full > >> > > scheduler custody. > >> > > > >> > > In particular, we should not trigger ops.dequeue(), because the task has > >> > > not reached the "enqueue" stage of its lifecycle. ops.select_cpu() is > >> > > effectively a pre-enqueue hook, primarily intended as a fast path to bypass > >> > > the scheduler altogether. As such, triggering ops.dequeue() in this case > >> > > would not make sense IMHO. > >> > > > >> > > I think it would make more sense to document this behavior explicitly and > >> > > leave the kselftest as is. > >> > > > >> > > Thoughts? > >> > > >> > I am going back and forth on this but I think the problem is that the enqueue() > >> > and dequeue() BPF callbacks we have are not actually symmetrical? > >> > > >> > 1) ops.enqueue() is "sched-ext specific work for the scheduler core's enqueue > >> > method". This is independent on whether the task ends up in BPF custody or not. > >> > It could be in a terminal DSQ, a non-terminal DSQ, or a BPF data structure. > >> > > >> > 2) ops.dequeue() is "remove task from BPF custody". E.g., it is used by the > >> > BPF scheduler to signal whether it should keep a task within its > >> > internal tracking structures. > >> > > >> > So the edge case of ops.select_cpu() placing the task in BPF custody is > >> > currently valid. The way I see it, we have two choices in terms of > >> > semantics: > >> > > >> > 1) ops.dequeue() must be the equivalent of ops.enqueue(). If the BPF > >> > scheduler writer decides to place a task into BPF custody during the > >> > ops.select_cpu() that's on them. ops.select_cpu() is supposed to be a > >> > pure function providing a hint, anyway. Using it to place a task into > >> > BPF is a bit of an abuse even if allowed. > >> > > >> > 2) We interpret ops.dequeue() to mean "dequeue from the BPF scheduler". > >> > In that case we allow the edge case and interpret ops.dequeue() as "the > >> > function that must be called to clear the NEEDS_DEQ/IN_BPF flag", not as > >> > the complement of ops.enqueue(). In most cases both will be true, and in > >> > the cases where not then it's up to the scheduler writer to understand > >> > the nuance. > >> > > >> > I think while 2) is cleaner, it is more involved and honestly kinda > >> > speculative. However, I think it's fair game since once we settle on > >> > the semantics it will be more difficult to change them. Which one do you > >> > think makes more sense? > >> > >> Yeah, I'm also going back and forth on this. > >> > >> Honestly from a pure theoretical perspective, option (1) feels cleaner to > >> me: when ops.select_cpu() runs, the task has not entered the BPF scheduler > >> yet. If we trigger ops.dequeue() in this case, we end up with tasks that > >> are "leaving" the scheduler without ever having entered it, which feels > >> like a violation of the lifecycle model. > >> > >> However, from a practical perspective, it's probably more convenient to > >> trigger ops.dequeue() also for tasks that are stored in BPF data structures > >> or user DSQs from ops.select_cpu() as well. If we don't allow that, we > >> can't just silently ignore the behavior and it's also pretty hard to > >> reliably detect and trigger an error for this kind of "abuse" at runtime. > >> That means it could easily turn into a source of subtle bugs in the future, > >> and I don't think documentation alone would be sufficient to prevent that > >> (the "don't do that" rules are always fragile). > >> > >> Therefore, at the moment I'm more inclined to go with option (2), as it > >> provides better robustness and gives schedulers more flexibility. > > > > I'm running into a number of headaches and corner cases if we go with > > option (2)... One of them is the following. > > > > Assume we push tasks into a BPF queue from ops.select_cpu() and pop them > > from ops.dispatch(). The following scenario can happen: > > > > CPU0 CPU1 > > ---- ---- > > ops.select_cpu() > > bpf_map_push_elem(&queue, &pid, 0) > > ops.dispatch() > > bpf_map_pop_elem(&queue, &pid) > > scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | dst_cpu) > > ==> ops.dequeue() is not triggered! > > p->scx.flags |= SCX_TASK_IN_BPF > > > > To fix this, we would need to always set SCX_TASK_IN_BPF before calling > > ops.select_cpu(), and then clear it again if the task is directly > > dispatched to a terminal DSQ from ops.select_cpu(). > > > > However, doing so introduces further problems. In particular, we may end up > > triggering spurious ops.dequeue() callbacks, which means we would then need > > to distinguish whether a task entered BPF custody via ops.select_cpu() or > > via ops.enqueue(), and handle the two cases differently. Which is also racy > > and leads to additional locking and complexity. > > > > At that point, it starts to feel like we're over-complicating the design to > > support a scenario that is both uncommon and of questionable practical > > value. > > > > Given that, I'd suggest proceeding incrementally: for now, we go with > > option (1), which looks doable without major changes and it probably fixes > > the ops.dequeue() semantics for the majority of use cases (which is already > > a significant improvement over the current state). Once that is in place, > > we can revisit the "store tasks in internal BPF data structures from > > ops.select_cpu()" scenario and see if it's worth supporting it in a cleaner > > way. WDYT? > > > > I agree with going with option 1. > > For the select_cpu() edge case, how about introducing an explicit > kfunc scx_place_in_bpf_custody() later? Placing a task in BPF custody > during select_cpu() is already pretty niche, so we can assume the > scheduler writer knows what they're doing. In that case, let's let > _them_ decide when in select_cpu() the task is considered "in BPF". > They can also do their own locking to avoid races with locking on > the task context. This keeps the state machine clean for the average > scheduler while still handling the edge case. DYT that would work? Yeah, I was also considering introducing dedicated kfuncs so that the BPF scheduler can explicitly manage the "in BPF custody" state, decoupling the notion of BPF custody from ops.enqueue(). With such interface, a scheduler could do something like: ops.select_cpu() { s32 pid = p->pid; scx_bpf_enter_custody(p); if (!bpf_map_push_elem(&bpf_queue, &pid, 0)) { set_task_state(TASK_ENQUEUED); } else { scx_bpf_exit_custody(p); set_task_state(TASK_NONE); } return prev_cpu; } On the implementation side, entering / leaving BPF custody is essentially setting / clearing SCX_TASK_IN_BPF, with the scheduler taking full responsibility for ensuring the flag is managed consistently: you set the flag => ops.dequeue() is called when the task leaves custody, you clear the flag => fallback to the default custody behavior. But I think this is something to explore in the future, for now I'd go with the easier way first. :) Thanks, -Andrea