From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from bali.collaboradmins.com (bali.collaboradmins.com [148.251.105.195]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DCAE2369995 for ; Thu, 5 Mar 2026 10:52:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.251.105.195 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772707928; cv=none; b=HfXDwznkuFsC1fsg3lO6YQQxbkZJaRd7HuvfES+/h9O32UhF71irIkV8pZS/+SvJ5PkIGpJ//nBZSgT7A4lkPoLtiRiLvqE4x0UPcsoJqt94DbWiYsIuDgXArxVbDywuj7A/aWbfTSLblBzOhzaLm8Gm/NqlikJ+Sy8nhv2CU8M= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772707928; c=relaxed/simple; bh=uOSvOEA86snVkC8Rsx48dLtVGDeiD89MLov76NFaq8w=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=GQhQID8yOHqKqR4TtU+miQsi6Fmlhb04bD1kMB1HWoxINUx9YNW/A1H6eYgdEHHlZ0R1/2wSG9vVoBVWph3OklYcaMR5EwhkLOlHQCG9YD2j9EsFr+XuZevAU7mPoS/KGT+Rbqj6OZuIkbUDBRNFSvAAC5isz7kFMirEwRl7Luw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=collabora.com; spf=pass smtp.mailfrom=collabora.com; dkim=pass (2048-bit key) header.d=collabora.com header.i=@collabora.com header.b=HVjMb4Zw; arc=none smtp.client-ip=148.251.105.195 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=collabora.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=collabora.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=collabora.com header.i=@collabora.com header.b="HVjMb4Zw" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1772707925; bh=uOSvOEA86snVkC8Rsx48dLtVGDeiD89MLov76NFaq8w=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=HVjMb4ZwOpu4rm4ZIEj30Fh6x0FIGlxgbD+htP0il6UMt+1gk16K06mubq0IBZ2Pk 1mnPhcamjRzDWKY3sYFFFx57P9gZLz2vhzXMcGZXPbrFWJt7vdg3zMZhubVl2YZSQm l5BZgvLcklm6B+Y+rBiT05WwppVnRM6dbX6gE1EEBkI02C+naT1JplZ8WF+vL1z7uF eYoX+jareCn/DWI3y3xn3uNUDzKFNHigBmdE1yMnpYiRO0zGWCuKQT/9fAjXj577xJ 0I476bOmbwvS8viVlYCA0T5JbPHcgv0loAGXQ9oVQNqO1eveNY8/MefiU2O1La7bKR 5Vq+X1Drs73aw== Received: from fedora (unknown [IPv6:2a01:e0a:2c:6930:d919:a6e:5ea1:8a9f]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (prime256v1) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: bbrezillon) by bali.collaboradmins.com (Postfix) with ESMTPSA id 5CF9E17E1274; Thu, 5 Mar 2026 11:52:04 +0100 (CET) Date: Thu, 5 Mar 2026 11:52:01 +0100 From: Boris Brezillon To: Matthew Brost Cc: Chia-I Wu , ML dri-devel , , "Steven Price" , Liviu Dudau , "Maarten Lankhorst" , Maxime Ripard , Thomas Zimmermann , David Airlie , Simona Vetter , Danilo Krummrich , Philipp Stanner , Christian =?UTF-8?B?S8O2bmln?= , Thomas =?UTF-8?B?SGVsbHN0csO2bQ==?= , Rodrigo Vivi , open list , Subject: Re: drm_sched run_job and scheduling latency Message-ID: <20260305115201.6fb044f0@fedora> In-Reply-To: References: <20260305092711.20069ca1@fedora> Organization: Collabora X-Mailer: Claws Mail 4.3.1 (GTK 3.24.51; x86_64-redhat-linux-gnu) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Thu, 5 Mar 2026 02:09:16 -0800 Matthew Brost wrote: > On Thu, Mar 05, 2026 at 09:27:11AM +0100, Boris Brezillon wrote: >=20 > I addressed most of your comments in a chained reply to Phillip, but I > guess he dropped some of your email and thus missed those. Responding > below. >=20 > > Hi Matthew, > >=20 > > On Wed, 4 Mar 2026 18:04:25 -0800 > > Matthew Brost wrote: > > =20 > > > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote: =20 > > > > Hi, > > > >=20 > > > > Our system compositor (surfaceflinger on android) submits gpu jobs > > > > from a SCHED_FIFO thread to an RT gpu queue. However, because > > > > workqueue threads are SCHED_NORMAL, the scheduling latency from sub= mit > > > > to run_job can sometimes cause frame misses. We are seeing this on > > > > panthor and xe, but the issue should be common to all drm_sched use= rs. > > > > =20 > > >=20 > > > I'm going to assume that since this is a compositor, you do not pass > > > input dependencies to the page-flip job. Is that correct? > > >=20 > > > If so, I believe we could fairly easily build an opt-in DRM sched path > > > that directly calls run_job in the exec IOCTL context (I assume this = is > > > SCHED_FIFO) if the job has no dependencies. =20 > >=20 > > I guess by ::run_job() you mean something slightly more involved that > > checks if: > >=20 > > - other jobs are pending > > - enough credits (AKA ringbuf space) is available > > - and probably other stuff I forgot about > > =20 > > >=20 > > > This would likely break some of Xe=E2=80=99s submission-backend assum= ptions > > > around mutual exclusion and ordering based on the workqueue, but that > > > seems workable. I don=E2=80=99t know how the Panthor code is structur= ed or > > > whether they have similar issues. =20 > >=20 > > Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea > > you're describing. There's just so many things we can forget that would > > lead to races/ordering issues that will end up being hard to trigger and > > debug. Besides, it doesn't solve the problem where your gfx pipeline is > > fully stuffed and the kernel has to dequeue things asynchronously. I do > > believe we want RT-prio support in that case too. > > =20 >=20 > My understanding of SurfaceFlinger is that it never waits on input > dependencies from rendering applications, since those may not signal in > time for a page flip. Because of that, you can=E2=80=99t have the job(s) = that > draw to the screen accept input dependencies. Maybe I have that > wrong=E2=80=94but I've spoken to the Google team several times about issu= es with > SurfaceFlinger, and that was my takeaway. >=20 > So I don't think the kernel should ever have to dequeue things > asynchronously, at least for SurfaceFlinger. There's still the contention coming from the ring buffer size, which can prevent jobs from being queued directly to the HW, though, admittedly, if the HW is not capable of compositing the frame faster than the refresh rate, and guarantee an almost always empty ringbuffer, fixing the scheduling prio is probably pointless. > If there is another RT use > case that requires input dependencies plus the kernel dequeuing things > asynchronously, I agree this wouldn=E2=80=99t help=E2=80=94but my suggest= ion also isn=E2=80=99t > mutually exclusive with other RT rework either. Yeah, dunno. It just feels like another hack on top of the already quite convoluted design that drm_sched has become. >=20 > > >=20 > > > I can try to hack together a quick PoC to see what this would look li= ke > > > and give you something to test. > > > =20 > > > > Using a WQ_HIGHPRI workqueue helps, but it is still not RT (and won= 't > > > > meet future android requirements). It seems either workqueue needs = to > > > > gain RT support, or drm_sched needs to support kthread_worker. =20 > > >=20 > > > +Tejun to see if RT workqueue is in the plans. =20 > >=20 > > Dunno how feasible that is, but that would be my preferred option. > > =20 > > > =20 > > > >=20 > > > > I know drm_sched switched from kthread_worker to workqueue for bett= er > > > > scaling when xe was introduced. But if drm_sched can support either > > > > workqueue or kthread_worker during drm_sched_init, drivers can > > > > selectively use kthread_worker only for RT gpu queues. And because > > > > drivers require CAP_SYS_NICE for RT gpu queues, this should not cau= se > > > > scaling issues. > > > > =20 > > >=20 > > > I don=E2=80=99t think having two paths will ever be acceptable, nor d= o I think > > > supporting a kthread would be all that easy. For example, in Xe we qu= eue > > > additional work items outside of the scheduler on the queue for order= ing > > > reasons =E2=80=94 we=E2=80=99d have to move all of that code down int= o DRM sched or > > > completely redesign our submission model to avoid this. I=E2=80=99m n= ot sure if > > > other drivers also do this, but it is allowed. =20 > >=20 > > Panthor doesn't rely on the serialization provided by the single-thread > > workqueue, Panfrost might rely on it though (I don't remember). I agree > > that maintaining a thread and workqueue based scheduling is not ideal > > though. > > =20 > > > =20 > > > > Thoughts? Or perhaps this becomes less of an issue if all drm_sched > > > > users have concrete plans for userspace submissions.. =20 > > >=20 > > > Maybe some day.... =20 > >=20 > > I've yet to see a solution where no dma_fence-based signalization is > > involved in graphics workloads though (IIRC, Arm's solution still > > needs the kernel for that). Until that happens, we'll still need the > > kernel to signal fences asynchronously when the job is done, which I > > suspect will cause the same kind of latency issue... > > =20 >=20 > I don't think that is the problem here. Doesn=E2=80=99t the job that draw= s the > frame actually draw it, or does the display wait on the draw job=E2=80=99= s fence > to signal and then do something else? I know close to nothing about SurfaceFlinger and very little about compositors in general, so I'll let Chia answer that one. What's sure is that, on regular page-flips (don't remember what async page-flips do), the display drivers wait on the fences attached to the buffer to signal before doing the flip. > (Sorry=E2=80=94I know next to nothing > about display.) Either way, fences should be signaled in IRQ handlers, In Panthor they are not, but that's probably something for us to address. > which presumably don=E2=80=99t have the same latency issues as workqueues= , but I > could be mistaken. Might have to do with the mental model I had of this "reconcile Usermode queues with dma_fence signaling" model, where I was imagining a SW job queue (based on drm_sched too) that would wait on HW fences to be signal and would as a result signal the dma_fence attached to the job. So the queueing/dequeuing of these jobs would still happen through drm_sched, with the same scheduling prio issue. This being said, those jobs would likely be dependency less, so more likely to hit your fast-path-run-job.