From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ed1-f73.google.com (mail-ed1-f73.google.com [209.85.208.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DD7C62D97B4 for ; Sat, 31 Jan 2026 16:46:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.73 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769877963; cv=none; b=VzQb+QPkcK6M5TY3ScmtcLL6Hhu3eXaXHsJZZuGpRg4LoR0deKqtqIqdlDEkegak6pN6NsVJEcIaxJMJ+rS0cT2VgNMGAqHdO6CC7CX0YffCE9y7C4QfGN6spFS+aLKuTGIWBB6lYA+xi9O9zfjvFyEruplTTx5xN43KdUYOsXw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769877963; c=relaxed/simple; bh=uxbjOkLqeYCMTmCcfMDbgpe3ZSO79v7Q5hUxfyX7pQg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=CirdgeUwUY85YNxMKr4riT6RTiOO57LCzueT3aEWk34jkfUnFaiLhHi33lwHzVL4rmR5hZurjas+q1dXqyEOqNEq2u+OtwPhwXLB8AoZNPqce6ERCigwmJNW9JXkpxsAp/zLawBOGfGizko3Ai7kxlzW/oMA694B1gYdRDRkmzw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jpiecuch.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=XQEwkiQv; arc=none smtp.client-ip=209.85.208.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jpiecuch.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="XQEwkiQv" Received: by mail-ed1-f73.google.com with SMTP id 4fb4d7f45d1cf-658c3d46547so5093226a12.0 for ; Sat, 31 Jan 2026 08:46:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1769877960; x=1770482760; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=BBBsbAIOyi1VcwSxx1zjgVn+/GP2kKw+AUM9hD7kTsk=; b=XQEwkiQvb7iAbYKoJyDgcjP8Eljne+OK2+tXbfvX3fxQ2vZxrPxymmTOXaQwlRg74K EKGZa6tGCsdvs0bnBRXJ2jVpjY3Z0DYx4rpKtpVWlxOe5Ai5xNfEZco2prs7+Yi8uFFP LrUvkI7t18yWR2ycETu8SCZUUzAUVRYO1oI2Deq4I8NKh9Vcg+PVyVS0N4rDZGWLcsAf Arm7zlqS0SMOcUTmEwOlVXhT21BgbGRZgh3QOUsppDuwREP3gYfA2W2bDQvK9EqaGc/d VJQxCof01q6VvZ4OKNnDqp0RBNJm6quCoq5kkiMU9YDaYbhAfawrINSuG0NdgWpmtzKA cd+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769877960; x=1770482760; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=BBBsbAIOyi1VcwSxx1zjgVn+/GP2kKw+AUM9hD7kTsk=; b=DWBAD4eM79EU1J5ZwcJMc6H2+AK+T2MDcyGl4gd7dm5cyhiKzjZmCOyzQk/1WBDBzz R+oM7k2eKWawEVeeY6QwJ8gz33hURUrc6nqEJc1yaxS9ea1nugi0QR9F8mhmNJ7AOA8e 4fcDH9Bf7TMQJ8Xpt4duQCr0K4vj3vgtW+J9F3agfPjD1rvkphqKnzm/2QmwCxRa77Eb 3xRe0lP0f7t+M5r3c2CA2s+LoNDTsghEtmbjsoObKW1SEnOydQmbrTTReaj20/c4fOzI Nh6QfRZvX5laj8K/tFWSnp05LPvWfUR7wrKFY1bT+vpYwgNo0KOW037lnkFSzaFDh95m p6VQ== X-Forwarded-Encrypted: i=1; AJvYcCVgH8fIr0boxAjtbD2N6r8O+xokyhOhlL/bLgVEyB6EzOJIBHgy5OcmcwnVMyOhIdHF7B3pRuurLkRnN7I=@vger.kernel.org X-Gm-Message-State: AOJu0YwYR/A/7bLf1BNKbsOjg8/TE5Krr1CRbhuUKunmqKLIjadwtdo5 x/bdSbf2/+IVtbvgWZLecM9BdCSA/71y/WnpHVvjQzLdWwyFJPEvPPU2e4rbSJnO4grGr612tiJ blyQc0xR702ffbw== X-Received: from edbif6.prod.google.com ([2002:a05:6402:5d86:b0:655:b1e2:8246]) (user=jpiecuch job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6402:f09:b0:655:a20a:a258 with SMTP id 4fb4d7f45d1cf-658de573b59mr3191384a12.10.1769877960241; Sat, 31 Jan 2026 08:46:00 -0800 (PST) Date: Sat, 31 Jan 2026 16:45:59 +0000 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260126084258.3798129-1-arighi@nvidia.com> <20260126084258.3798129-2-arighi@nvidia.com> X-Mailer: aerc 0.21.0-0-g5549850facc2 Message-ID: Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics From: Kuba Piecuch To: Andrea Righi , Kuba Piecuch Cc: Tejun Heo , David Vernet , Changwoo Min , Christian Loehle , Daniel Hodges , , , Emil Tsalapatis Content-Type: text/plain; charset="UTF-8" Hi Andrea, On Sat Jan 31, 2026 at 6:54 AM UTC, Andrea Righi wrote: >> >> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called >> >> for a task at most once between it being dispatched and taken off the CPU, >> >> even if its properties are changed multiple times while it's on CPU. >> >> Is that intentional? I don't see it documented. >> >> >> >> To illustrate, assume we have a task p that has been enqueued, dispatched, and >> >> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and >> >> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags. >> >> >> >> When a property of p is changed while it runs on the CPU, >> >> the sequence of calls is: >> >> dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) => >> >> (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) => >> >> set_next_task_scx(p). >> >> >> >> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls >> >> ops.dequeue(p, ... | SCHED_CHANGE) and clears >> >> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags. >> >> >> >> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by >> >> dequeue_task_scx(). >> >> >> >> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is >> >> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to >> >> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving >> >> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ. >> >> >> >> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though >> >> this is not a core-sched pick, but it won't do much because the ops_state is >> >> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls >> >> dispatch_dequeue(p) which the removes the task from the local DSQ it was just >> >> inserted into. >> >> >> >> >> >> So, we end up in a state where any subsequent property change while the task is >> >> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being >> >> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are >> >> unset in p->scx.flags. >> >> >> >> I really hope I didn't mess anything up when tracing the code, but of course >> >> I'm happy to be corrected. >> > >> > Correct. And the enqueue/dequeue balancing is preserved here. In the >> > scenario you describe, subsequent property changes while the task remains >> > running go through ENQUEUE_RESTORE, which intentionally skips >> > ops.enqueue(). Since no new enqueue cycle is started, there is no >> > corresponding ops.dequeue() to deliver either. >> > >> > In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the >> > scheduler state established by the last ops.enqueue(), not with every >> > individual property change. Multiple property changes while the task stays >> > on CPU are coalesced and the enqueue/dequeue pairing remains balanced. >> >> Ok, I think I understand the logic behind this, here's how I understand it: >> >> The BPF scheduler is naturally going to have some internal per-task state. >> That state may be expensive to compute from scratch, so we don't want to >> completely discard it when the BPF scheduler loses ownership of the task. >> >> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler: >> "Hey, some scheduling properties of the task are about to change, so you >> probably should invalidate whatever state you have for that task which depends >> on these properties." > > Correct. And it's also a way to notify that the task has left the BPF > scheduler, so if the task is stored in any internal queue it can/should be > removed. Right, unless the task has already been dispatched, in which case it's just an invalidation notification. > >> >> That way, the BPF scheduler will know to recompute the invalidated state on >> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the >> BPF scheduler knows that none of the task's fundamental scheduling properties >> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing >> the state. Of course, the potential for savings depends on the particular >> scheduler's policy. >> >> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while >> a task is running: for subsequent calls, the BPF scheduler had already been >> notified to invalidate its state, so there's no use in notifying it again. > > Actually I think the proper behavior would be to trigger > ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF > scheduler. While running, tasks are outside the BPF scheduler ownership, so > ops.dequeue() shouldn't be triggered at all. > I don't think this is what the current implementation does, right? >> >> However, I feel like there's a hidden assumption here that the BPF scheduler >> doesn't recompute its state for the task before the next ops.enqueue(). > > And that should be the proper behavior. BPF scheduler should recompute a > task state only when the task is re-enqueued after a property change. > That would make sense if ops.enqueue() was called immediately after a property change when a task is running, but I believe that's currently not the case, see my attempt at tracing the enqueue-dequeue cycle on property change in my first reply. >> What if the scheduler wanted to immediately react to the priority of a task >> being decreased by preempting it? You might say "hook into >> ops.set_weight()", but then doesn't that obviate the need for >> ops.dequeue(SCHED_CHANGE)? > > If a scheduler wants to implement preemption on property change, it can do > so in ops.enqueue(): after a property change, the task is re-enqueued, > triggering ops.enqueue(), at which point the BPF scheduler can decide > whether and how to preempt currently running tasks. > > If a property change does not result in an ops.enqueue() call, it means the > task is not runnable yet (or does not intend to run), so attempting to > trigger a preemption at that point would be pointless. > IIUC a dequeue-enqueue cycle on a running task during property change doesn't result in a call to ops.enqueue(), so if the BPF scheduler recomputed its state only in ops.enqueue(), then it wouldn't be able to react immediately. >> >> I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property >> changes that happen under ``scoped_guard (sched_change, ...)`` which don't have >> a dedicated ops callback, but I wasn't able to find any such properties which >> would be relevant to SCX. >> >> Another thought on the design: currently, the exact meaning of >> ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF >> scheduler: >> >> * When it's owned, it combines two notifications: BPF scheduler losing >> ownership AND that it should invalidate task state. >> * When it's not owned, it only serves as an "invalidate" notification, >> the ownership status doesn't change. > > When it's not owned I think ops.dequeue() shouldn't be triggered at all. > >> >> Wouldn't it be more elegant to have another callback, say >> ops.property_change(), which would only serve as the "invalidate" notification, >> and leave ops.dequeue() only for tracking ownership? >> That would mean calling ops.dequeue() followed by ops.property_change() when >> changing properties of a task owned by the BPF scheduler, as opposed to a >> single call to ops.dequeue(SCHED_CHANGE). > > We could provide an ops.property_change(), but honestly I don't see any > practical usage of this callback. > Neither do I, I just made it up for the sake of argument :-) Thanks, Kuba