From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 88D91258CDC; Tue, 30 Sep 2025 23:49:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759276154; cv=none; b=dmSDB+AZr5BKq76bNSOIMlbIvyQKfJHJtSVwEOgeJwR30lGAA1KGN0GUHmQMUMoR7FESpgOBFkWHerGinLb3ovuPyXYzxQUpIsaSJKO/38F71BbFrCYG49rjUY/an0YR9jPVMI7Saxv7PVz2JKCJslTbzu0qgCH6JklnrTPDilE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759276154; c=relaxed/simple; bh=7VAol8rQRsLQd4lbBp1qCT4rhceiGs6eyik3FbiRdyE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=PwaOjit81y+Kv0q7KgiEn+LDQAUv6r/8PdF/eKoqpw0CbV31q82ACjTZXT4V7OAGejK0WsbnL8gcu1WLQix5i/CqqttsV/av38vNT0Tm3n1zwzLvjAnlOlE13QHCvQYhJkiL0fgqKM5jFwz0jxDNq5bDe7dS4MMxSsady+Zn+24= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=lmd0Svoj; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="lmd0Svoj" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DADFDC4CEF0; Tue, 30 Sep 2025 23:49:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1759276154; bh=7VAol8rQRsLQd4lbBp1qCT4rhceiGs6eyik3FbiRdyE=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=lmd0Svoj7Pee3uBFZtW8KHm67+M5ATQAdOhX0uWJxMXalU3c4ApycHGnnXrbdJwhm Cre0ZgRZLQuY61U5uikz9ccsPJ5/ymErxRiYVHWLQrC/+jvPgMHcZl993sKS8h4d25 pkfUMe9JWIRH+2l07dj22igHfNotMqwOiJAkh/g+WF54jCaEoQqEKPl0bSiE2ust4q Gh2uPzbFu4ggKiHDDyAoQQTX91PiZvjIWZcwKrvsnUIh7O6X/5x3BIyBI6eQSasfIb tGMwsxdOLMghyRBFaHxplrg6l5udPhZr/BM9LdpWdvm4+QzgyksmPP9vH2o9IFtPO1 MNWaSBJ5SmUlg== Date: Tue, 30 Sep 2025 13:49:12 -1000 From: Tejun Heo To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, longman@redhat.com, hannes@cmpxchg.org, mkoutny@suse.com, void@manifault.com, arighi@nvidia.com, changwoo@igalia.com, cgroups@vger.kernel.org, sched-ext@lists.linux.dev, liuwenfang@honor.com, tglx@linutronix.de Subject: Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock() Message-ID: References: <20250912115459.GZ3289052@noisy.programming.kicks-ass.net> <20250915083815.GB3289052@noisy.programming.kicks-ass.net> <20250925083533.GW4067720@noisy.programming.kicks-ass.net> <20250926103628.GE4067720@noisy.programming.kicks-ass.net> <20250929100658.GC3245006@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250929100658.GC3245006@noisy.programming.kicks-ass.net> Hello, Peter. On Mon, Sep 29, 2025 at 12:06:58PM +0200, Peter Zijlstra wrote: ... > > However, thinking more about it. I'm unsure how e.g. the actual migration > > would work. The actual migration is done by: deactivate_task() -> > > set_task_cpu() -> switch rq locks -> activate_task(). Enqueueing/dequeueing > > steps have operations that depend on rq lock - psi updates, uclamp updates > > and so on. How would they work? > > Suppose __task_rq_lock() will take rq->lock and p->sub_lock, in that > order, such that task_rq_lock() will take p->pi_lock, rq->lock and > p->sub_lock. > ... > While at the same time, above you argued p->sub_lock should be inside > dsq->lock. Because: > > __schedule() > rq = this_rq(); > LOCK rq->lock > next = pick_next() := pick_next_scx() > LOCK dsq->lock > p = find_task(dsq); > LOCK p->sub_lock > dequeue(dsq, p); > UNLOCK dsq->lock I was going back and forth with the locking order. Note that even if sub_lock is nested outside DSQ locks, the hot path for the pick_task() path wouldn't be that different - it just needs RCU protected first_task pointer and the DSQ association needs to be verified after grabbing the sub_lock (much like how task_rq_lock() needs to retry). > Because if you did something like: > > __schedule() > rq = this_rq(); > LOCK rq->lock > next = pick_next() := pick_next_scx() > LOCK dsq->lock (or RCU, doesn't matter) > p = find_task(dsq); > UNLOCK dsq->lock > migrate: > LOCK p->pi_lock > rq = task_rq(p) > LOCK rq->lock > (verify bla bla) > LOCK p->sub_lock > LOCK dsq->lock > dequeue(dsq, p) > UNLOCK dsq->lock > set_task_cpu(n); > UNLOCK rq->lock > rq = cpu_rq(n); > LOCK rq->lock (inversion vs p->sub_lock) > LOCK dsq2->lock > enqueue(dsq2, p) > UNLOCK dsq2->lock > > LOCK p->sub_lock > LOCK dsq->lock (whoopsie, p is on dsq2) > dequeue(dsq, p) > set_task_cpu(here); > UNLOCK dsq->lock I suppose the above is showing that p->sub_lock is nested outside dsq->lock, right? If so, the right sequence would be: LOCK p->sub_lock LOCK src_dsq->lock verify p is still on src_dsq, if not retry remove from src_dsq UNLOCK src_dsq->lock LOCK dst_dsq->lock insert into dst_dsq UNLOCK dst_dsq->lock UNLOCK p->sub_lock > That is, either way around: dsq->lock outside, p->sub_lock inside, or > the other way around, I emd up with inversions and race conditions that > are not fun. It's not straightforward for sure but none of these approaches are. They are all complicated in some areas. From sched_ext POV, I think what matter most are providing as much latitude as possible to BPF scheduler implementations and having lower likelihood of really subtle issues. > Also, if you do put p->sub_lock inside dsq->lock, this means > __task_rq_lock() cannot take it and it needs to be pushed deep into scx > (possibly into bpf ?) and that means I'm not sure how to do the change > pattern sanely. I'm not quite following why task_rq_lock() wouldn't be able to take it. Whether p->sub_lock nests under or over DSQ locks should only matter to sched_ext proper. From core's POV, the only thing that matters is that as long as p->sub_lock is held, the task won't be migrating and is safe to change states for (more on this later). > Having __task_rq_lock() take p->dsq->lock solves all these problems, > except for that one weird case where BPF wants to do things their own > way. The longer I'm thinking about it, the more I dislike that. I just > don't see *ANY* upside from allowing BPF to do this while it is making > everything else quite awkward. I'm failing to see the problems you're seeing and have to disagree that allowing more capabilities on BPF side doesn't bring any upsides. DSQs are useful but it quickly becomes too restrictive - e.g. people often want to put the same task on multiple queues and other data structures, which is a lot more straightforward to do if the data structure and locking are managed in BPF. In general, I don't think it's a good direction to be prescriptive about how schedulers should be implemented or behave. Even if we might not be able to think up something neat right now, someone will. > The easy fix is to have these BPF managed things have a single global > lock. That works and is correct. Then if they want something better, > they can use DSQs :-) > > Fundamentally, we need the DSQ->lock to cover all CPUs that will pick > from it, there is no wiggle room there. Also note that while we change > only the attributes of a single task with the change pattern, that > affects the whole RQ, since a runqueue is an aggregate of all tasks. > This is very much why dequeue/enqueue around the change pattern, to keep > the runqueue aggregates updated. > > Use the BPF thing to play with scheduling policies, but leave the > locking to the core code. I have two questions: - Let's say something works, whether that's holding dsq->lock or p->sub_lock. I still don't understand how things would be safe w.r.t. things like PSI and uclamp updates. How would they cope with set_task_cpu() happening without the rq locked? - This all started from two proposed core changes. One additional hook after task pick regardless of the picked task's class (this is a regression that I missed during the pick_task() conversion) and balance() call for deadline server, which I think can be combined with existing special case for sched_ext. While it'd be nice to be able to migrate without holding rq locks, that path seems very invasive and to have a lot of proposed capability impacts. This doesn't seem like a particularly productive direection to me. Thanks. -- tejun