The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* Re: [PATCH 07/10] drm/panthor: Automate CSG IRQ processing at group unbind time
       [not found]   ` <e5c9763b-87c1-405d-9fda-f3f324d6bdfe@arm.com>
@ 2026-05-04 15:00     ` Boris Brezillon
  0 siblings, 0 replies; 7+ messages in thread
From: Boris Brezillon @ 2026-05-04 15:00 UTC (permalink / raw)
  To: Steven Price
  Cc: Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, dri-devel, linux-kernel

On Fri, 1 May 2026 14:53:22 +0100
Steven Price <steven.price@arm.com> wrote:

> > @@ -2970,8 +2963,6 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
> >  
> >  			if (flush_caches_failed)
> >  				csg_slot->group->state = PANTHOR_CS_GROUP_TERMINATED;
> > -			else
> > -				csg_slot_sync_update_locked(ptdev, csg_id);  
> 
> The justification for this change doesn't seem to be included in the
> commit message and looks suspicious.

Hm, right. I somehow confused csg_slot_sync_update_locked() and
sched_process_csg_irq_locked().

> Although AFAICT the events_lock
> wouldn't be held here so it could trigger a lockdep assert before this
> change...

Yeah, the guard(spinlock_irqsave)(&sched->events_lock); should have
been added to patch 4.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 00/10] drm/panthor: Reduce dma_fence signalling latency
       [not found] ` <20260429123607.7a8c7051@fedora>
@ 2026-05-05  8:54   ` Boris Brezillon
  2026-05-05 16:12     ` Liviu Dudau
  0 siblings, 1 reply; 7+ messages in thread
From: Boris Brezillon @ 2026-05-05  8:54 UTC (permalink / raw)
  To: Steven Price, Liviu Dudau
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, dri-devel, linux-kernel

On Wed, 29 Apr 2026 12:36:07 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> On Wed, 29 Apr 2026 11:38:27 +0200
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
> > Right now, panthor is one of the rare drivers to signal fences
> > from work items (not even from the threaded IRQ handler). We
> > could move that to the threaded handler, but that would still
> > leave the latency caused by the scheduling of the IRQ thread.
> > 
> > Instead, this patchset moves all the job IRQ processing to
> > the raw IRQ handler, which is fine because what the current
> > code does is demux the interrupts and deferring actual handling
> > to sub work items. The only bits we keep in the IRQ path is
> > the dma_fence signalling, which should be acceptable, in term
> > of CPU cycles spent in the IRQ context.
> > 
> > Pretty much all the patches except the last two are just
> > preparing the ground to get there. The second to last one
> > does the thread -> IRQ transition, and the last one is some
> > experimental interrupt coalescing support that I've added
> > because I noticed moving job IRQ handling to the raw handler
> > generates quite a lot of interrupts in some case, and having
> > the system constantly interrupted like that can be
> > detrimental.
> >   
> 
> Forgot to post some preliminary numbers I collected during my,
> admittedly, very basic testing :-). What this shows is that IRQ
> coalescing provides small but noticeable improvements only in some
> of the glmark scenes (terrain, refract),

I think I found the problem on the "refract" regression. Turns out if I
set cpufreq governor to performance instead of schedutil, I get those
~20% back, which is not surprising given schedutil uses scheduling stats
to decide to bump/lower the frequency, and the extra IRQ_WAKE_THREAD
indirection does add regular thread activity.

So, this basically leaves terrain where interrupt coalescing helps a
bit, but it's not clear that the improvement justifies the added
complexity or the extra CPU load incurred by the polling done in the
threaded handler.

If everyone is fine with that, I'd be tempted to drop patch 10 for now,
and revisit it if/when we have an actual use case where it's deemed
essential to limit IRQs coming from the GPU.

Regards,

Boris

> the rest of the variations
> stay in the noise of what we see between regular glmark runs. BTW,
> those relatively small improvements (~5%) aren't even reflected in the
> final score, because many tests have high FPS scores, and any variation
> on those might actually have more impact on the final score (which is
> just a average FPS IIUC) than any improvement on the lower-FPS scenes.
> 
> It's also worth noting that the refract scenes seems to suffer from
> this threaded -> raw-IRQ transition, and that coalescing gets us back
> to where we were.
> 
> TLDR; As always, there's no simple answer to this 'latency vs throughput'
> issue, and it's not surprising one approach helps some cases and
> regresses others.
> 
> ---------- Before this series ---------------
> 
> =======================================================
>     glmark2 2023.01
> =======================================================
>     OpenGL Information
>     GL_VENDOR:      Mesa
>     GL_RENDERER:    Mali-G610 MC4 (Panfrost)
>     GL_VERSION:     OpenGL ES 3.1 Mesa 26.2.0-devel (git-c71664cfbc)
>     Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
>     Surface Size:   800x600 windowed
> =======================================================
> [build] use-vbo=false: FPS: 2708 FrameTime: 0.369 ms
> [build] use-vbo=true: FPS: 4209 FrameTime: 0.238 ms
> [texture] texture-filter=nearest: FPS: 5211 FrameTime: 0.192 ms
> [texture] texture-filter=linear: FPS: 5224 FrameTime: 0.191 ms
> [texture] texture-filter=mipmap: FPS: 5255 FrameTime: 0.190 ms
> [shading] shading=gouraud: FPS: 3395 FrameTime: 0.295 ms
> [shading] shading=blinn-phong-inf: FPS: 3329 FrameTime: 0.300 ms
> [shading] shading=phong: FPS: 2990 FrameTime: 0.335 ms
> [shading] shading=cel: FPS: 2916 FrameTime: 0.343 ms
> [bump] bump-render=high-poly: FPS: 1879 FrameTime: 0.532 ms
> [bump] bump-render=normals: FPS: 5242 FrameTime: 0.191 ms
> [bump] bump-render=height: FPS: 4997 FrameTime: 0.200 ms
> [effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 3725 FrameTime: 0.268 ms
> [effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 1906 FrameTime: 0.525 ms
> [pulsar] light=false:quads=5:texture=false: FPS: 4863 FrameTime: 0.206 ms
> [desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 706 FrameTime: 1.417 ms
> [desktop] effect=shadow:windows=4: FPS: 2621 FrameTime: 0.382 ms
> [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 411 FrameTime: 2.435 ms
> [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 402 FrameTime: 2.489 ms
> [buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 490 FrameTime: 2.043 ms
> [ideas] speed=duration: FPS: 1008 FrameTime: 0.992 ms
> [jellyfish] <default>: FPS: 2722 FrameTime: 0.367 ms
> [terrain] <default>: FPS: 120 FrameTime: 8.339 ms
> [shadow] <default>: FPS: 2086 FrameTime: 0.479 ms
> [refract] <default>: FPS: 312 FrameTime: 3.209 ms
> [conditionals] fragment-steps=0:vertex-steps=0: FPS: 4877 FrameTime: 0.205 ms
> [conditionals] fragment-steps=5:vertex-steps=0: FPS: 4118 FrameTime: 0.243 ms
> [conditionals] fragment-steps=0:vertex-steps=5: FPS: 4845 FrameTime: 0.206 ms
> [function] fragment-complexity=low:fragment-steps=5: FPS: 4444 FrameTime: 0.225 ms
> [function] fragment-complexity=medium:fragment-steps=5: FPS: 3722 FrameTime: 0.269 ms
> [loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 4468 FrameTime: 0.224 ms
> [loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 4442 FrameTime: 0.225 ms
> [loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 3847 FrameTime: 0.260 ms
> =======================================================
>                                   glmark2 Score: 3135 
> =======================================================
> 
> ---------- After transitioning to job event processing in the IRQ context ------------
> 
> =======================================================
>     glmark2 2023.01
> =======================================================
>     OpenGL Information
>     GL_VENDOR:      Mesa
>     GL_RENDERER:    Mali-G610 MC4 (Panfrost)
>     GL_VERSION:     OpenGL ES 3.1 Mesa 26.2.0-devel (git-c71664cfbc)
>     Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
>     Surface Size:   800x600 windowed
> =======================================================
> [build] use-vbo=false: FPS: 2703 FrameTime: 0.370 ms
> [build] use-vbo=true: FPS: 4630 FrameTime: 0.216 ms
> [texture] texture-filter=nearest: FPS: 5406 FrameTime: 0.185 ms
> [texture] texture-filter=linear: FPS: 5429 FrameTime: 0.184 ms
> [texture] texture-filter=mipmap: FPS: 5408 FrameTime: 0.185 ms
> [shading] shading=gouraud: FPS: 3678 FrameTime: 0.272 ms
> [shading] shading=blinn-phong-inf: FPS: 3587 FrameTime: 0.279 ms
> [shading] shading=phong: FPS: 3221 FrameTime: 0.311 ms
> [shading] shading=cel: FPS: 3119 FrameTime: 0.321 ms
> [bump] bump-render=high-poly: FPS: 1977 FrameTime: 0.506 ms
> [bump] bump-render=normals: FPS: 5488 FrameTime: 0.182 ms
> [bump] bump-render=height: FPS: 5323 FrameTime: 0.188 ms
> [effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 4003 FrameTime: 0.250 ms
> [effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 2008 FrameTime: 0.498 ms
> [pulsar] light=false:quads=5:texture=false: FPS: 4961 FrameTime: 0.202 ms
> [desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 852 FrameTime: 1.174 ms
> [desktop] effect=shadow:windows=4: FPS: 2649 FrameTime: 0.378 ms
> [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 412 FrameTime: 2.429 ms
> [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 392 FrameTime: 2.554 ms
> [buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 482 FrameTime: 2.075 ms
> [ideas] speed=duration: FPS: 1021 FrameTime: 0.980 ms
> [jellyfish] <default>: FPS: 2939 FrameTime: 0.340 ms
> [terrain] <default>: FPS: 126 FrameTime: 7.979 ms
> [shadow] <default>: FPS: 2273 FrameTime: 0.440 ms
> [refract] <default>: FPS: 251 FrameTime: 3.999 ms
> [conditionals] fragment-steps=0:vertex-steps=0: FPS: 5148 FrameTime: 0.194 ms
> [conditionals] fragment-steps=5:vertex-steps=0: FPS: 4555 FrameTime: 0.220 ms
> [conditionals] fragment-steps=0:vertex-steps=5: FPS: 5245 FrameTime: 0.191 ms
> [function] fragment-complexity=low:fragment-steps=5: FPS: 4880 FrameTime: 0.205 ms
> [function] fragment-complexity=medium:fragment-steps=5: FPS: 4042 FrameTime: 0.247 ms
> [loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 4846 FrameTime: 0.206 ms
> [loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 4854 FrameTime: 0.206 ms
> [loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 4207 FrameTime: 0.238 ms
> =======================================================
>                                   glmark2 Score: 3335 
> =======================================================
> 
> ---- With IRQ coalescing enabled (max_us=100 poll_period_us=5 inbounds_cnt_threshold=5) ---
> 
> =======================================================
>     glmark2 2023.01
> =======================================================
>     OpenGL Information
>     GL_VENDOR:      Mesa
>     GL_RENDERER:    Mali-G610 MC4 (Panfrost)
>     GL_VERSION:     OpenGL ES 3.1 Mesa 26.2.0-devel (git-c71664cfbc)
>     Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
>     Surface Size:   800x600 windowed
> =======================================================
> [build] use-vbo=false: FPS: 2663 FrameTime: 0.376 ms
> [build] use-vbo=true: FPS: 4640 FrameTime: 0.216 ms
> [texture] texture-filter=nearest: FPS: 5335 FrameTime: 0.187 ms
> [texture] texture-filter=linear: FPS: 5442 FrameTime: 0.184 ms
> [texture] texture-filter=mipmap: FPS: 5434 FrameTime: 0.184 ms
> [shading] shading=gouraud: FPS: 3683 FrameTime: 0.272 ms
> [shading] shading=blinn-phong-inf: FPS: 3580 FrameTime: 0.279 ms
> [shading] shading=phong: FPS: 3211 FrameTime: 0.312 ms
> [shading] shading=cel: FPS: 3093 FrameTime: 0.323 ms
> [bump] bump-render=high-poly: FPS: 1969 FrameTime: 0.508 ms
> [bump] bump-render=normals: FPS: 5368 FrameTime: 0.186 ms
> [bump] bump-render=height: FPS: 5273 FrameTime: 0.190 ms
> [effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 4038 FrameTime: 0.248 ms
> [effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 2001 FrameTime: 0.500 ms
> [pulsar] light=false:quads=5:texture=false: FPS: 4961 FrameTime: 0.202 ms
> [desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 842 FrameTime: 1.188 ms
> [desktop] effect=shadow:windows=4: FPS: 2681 FrameTime: 0.373 ms
> [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 412 FrameTime: 2.430 ms
> [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 408 FrameTime: 2.452 ms
> [buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 483 FrameTime: 2.072 ms
> [ideas] speed=duration: FPS: 1005 FrameTime: 0.995 ms
> [jellyfish] <default>: FPS: 2945 FrameTime: 0.340 ms
> [terrain] <default>: FPS: 131 FrameTime: 7.663 ms
> [shadow] <default>: FPS: 2276 FrameTime: 0.440 ms
> [refract] <default>: FPS: 328 FrameTime: 3.050 ms
> [conditionals] fragment-steps=0:vertex-steps=0: FPS: 5099 FrameTime: 0.196 ms
> [conditionals] fragment-steps=5:vertex-steps=0: FPS: 4538 FrameTime: 0.220 ms
> [conditionals] fragment-steps=0:vertex-steps=5: FPS: 5152 FrameTime: 0.194 ms
> [function] fragment-complexity=low:fragment-steps=5: FPS: 4818 FrameTime: 0.208 ms
> [function] fragment-complexity=medium:fragment-steps=5: FPS: 4035 FrameTime: 0.248 ms
> [loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 4855 FrameTime: 0.206 ms
> [loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 4812 FrameTime: 0.208 ms
> [loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 4150 FrameTime: 0.241 ms
> =======================================================
>                                   glmark2 Score: 3322 
> =======================================================
> 
> 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> > Boris Brezillon (10):
> >       drm/panthor: Make panthor_irq::state a non-atomic field
> >       drm/panthor: Move the register accessors before the IRQ helpers
> >       drm/panthor: Replace the panthor_irq macro machinery by inline helpers
> >       drm/panthor: Extend the IRQ logic to allow fast/raw IRQ handlers
> >       drm/panthor: Make panthor_fw_{update,toggle}_reqs() callable from IRQ context
> >       drm/panthor: Prepare the scheduler logic for FW events in IRQ context
> >       drm/panthor: Automate CSG IRQ processing at group unbind time
> >       drm/panthor: Automatically enable interrupts in panthor_fw_wait_acks()
> >       drm/panthor: Process FW events in IRQ context
> >       drm/panthor: Introduce interrupt coalescing support for job IRQs
> > 
> >  drivers/gpu/drm/panthor/panthor_device.h | 358 ++++++++++++++---------
> >  drivers/gpu/drm/panthor/panthor_drv.c    |   1 +
> >  drivers/gpu/drm/panthor/panthor_fw.c     | 226 +++++++++++++--
> >  drivers/gpu/drm/panthor/panthor_fw.h     |  11 +-
> >  drivers/gpu/drm/panthor/panthor_gpu.c    |  27 +-
> >  drivers/gpu/drm/panthor/panthor_mmu.c    |  38 +--
> >  drivers/gpu/drm/panthor/panthor_pwr.c    |  21 +-
> >  drivers/gpu/drm/panthor/panthor_sched.c  | 475 ++++++++++++++-----------------
> >  8 files changed, 698 insertions(+), 459 deletions(-)
> > ---
> > base-commit: 7455a0583a906533041a80e48c6a2e3230cce96e
> > change-id: 20260429-panthor-signal-from-irq-d33684f4d292
> > prerequisite-message-id: <20260427155934.416502-1-karunika.choo@arm.com>
> > prerequisite-patch-id: 70905a2eb09ab2b31d242a5ed5af3b42fb6a464c
> > prerequisite-patch-id: aa4c22669f80328039762f25c0b3942bbadbdc89
> > prerequisite-patch-id: 7f61bcee3c4bb5703900b18d5b6e0f52e622f29d
> > prerequisite-patch-id: 3402f4d60aa526d40113fc3d9b3e599f8f89e705
> > prerequisite-patch-id: 00ddbd3d455891f6950609614c1acd2baa78b0db
> > prerequisite-patch-id: 6a9928f609e3757cadebb2df6795d0da55745f4e
> > prerequisite-patch-id: fd91f68f25d4bc93eec405f0131f5ae4284bfaf2
> > prerequisite-patch-id: 553958a10a0ca2f20f7883ad4c752cfc7485c5a8
> > 
> > Best regards,  
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 00/10] drm/panthor: Reduce dma_fence signalling latency
  2026-05-05  8:54   ` [PATCH 00/10] drm/panthor: Reduce dma_fence signalling latency Boris Brezillon
@ 2026-05-05 16:12     ` Liviu Dudau
  0 siblings, 0 replies; 7+ messages in thread
From: Liviu Dudau @ 2026-05-05 16:12 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Steven Price, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, dri-devel, linux-kernel

On Tue, May 05, 2026 at 10:54:53AM +0200, Boris Brezillon wrote:
> On Wed, 29 Apr 2026 12:36:07 +0200
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
> > On Wed, 29 Apr 2026 11:38:27 +0200
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> > 
> > > Right now, panthor is one of the rare drivers to signal fences
> > > from work items (not even from the threaded IRQ handler). We
> > > could move that to the threaded handler, but that would still
> > > leave the latency caused by the scheduling of the IRQ thread.
> > > 
> > > Instead, this patchset moves all the job IRQ processing to
> > > the raw IRQ handler, which is fine because what the current
> > > code does is demux the interrupts and deferring actual handling
> > > to sub work items. The only bits we keep in the IRQ path is
> > > the dma_fence signalling, which should be acceptable, in term
> > > of CPU cycles spent in the IRQ context.
> > > 
> > > Pretty much all the patches except the last two are just
> > > preparing the ground to get there. The second to last one
> > > does the thread -> IRQ transition, and the last one is some
> > > experimental interrupt coalescing support that I've added
> > > because I noticed moving job IRQ handling to the raw handler
> > > generates quite a lot of interrupts in some case, and having
> > > the system constantly interrupted like that can be
> > > detrimental.
> > >   
> > 
> > Forgot to post some preliminary numbers I collected during my,
> > admittedly, very basic testing :-). What this shows is that IRQ
> > coalescing provides small but noticeable improvements only in some
> > of the glmark scenes (terrain, refract),
> 
> I think I found the problem on the "refract" regression. Turns out if I
> set cpufreq governor to performance instead of schedutil, I get those
> ~20% back, which is not surprising given schedutil uses scheduling stats
> to decide to bump/lower the frequency, and the extra IRQ_WAKE_THREAD
> indirection does add regular thread activity.

Nice find! I've completely ignore that the choice of scheduler could play a
part in the latency of handling raw IRQs if it ends up lowering the CPU freq.

> 
> So, this basically leaves terrain where interrupt coalescing helps a
> bit, but it's not clear that the improvement justifies the added
> complexity or the extra CPU load incurred by the polling done in the
> threaded handler.
> 
> If everyone is fine with that, I'd be tempted to drop patch 10 for now,
> and revisit it if/when we have an actual use case where it's deemed
> essential to limit IRQs coming from the GPU.

Yes, I agree with the idea of dropping the coalescing patch for now. Doing
more in the raw handler and deferring only slow work to the threaded handler
makes sense, so we should keep that.

Best regards,
Liviu

> 
> Regards,
> 
> Boris
> 
> > the rest of the variations
> > stay in the noise of what we see between regular glmark runs. BTW,
> > those relatively small improvements (~5%) aren't even reflected in the
> > final score, because many tests have high FPS scores, and any variation
> > on those might actually have more impact on the final score (which is
> > just a average FPS IIUC) than any improvement on the lower-FPS scenes.
> > 
> > It's also worth noting that the refract scenes seems to suffer from
> > this threaded -> raw-IRQ transition, and that coalescing gets us back
> > to where we were.
> > 
> > TLDR; As always, there's no simple answer to this 'latency vs throughput'
> > issue, and it's not surprising one approach helps some cases and
> > regresses others.
> > 
> > ---------- Before this series ---------------
> > 
> > =======================================================
> >     glmark2 2023.01
> > =======================================================
> >     OpenGL Information
> >     GL_VENDOR:      Mesa
> >     GL_RENDERER:    Mali-G610 MC4 (Panfrost)
> >     GL_VERSION:     OpenGL ES 3.1 Mesa 26.2.0-devel (git-c71664cfbc)
> >     Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
> >     Surface Size:   800x600 windowed
> > =======================================================
> > [build] use-vbo=false: FPS: 2708 FrameTime: 0.369 ms
> > [build] use-vbo=true: FPS: 4209 FrameTime: 0.238 ms
> > [texture] texture-filter=nearest: FPS: 5211 FrameTime: 0.192 ms
> > [texture] texture-filter=linear: FPS: 5224 FrameTime: 0.191 ms
> > [texture] texture-filter=mipmap: FPS: 5255 FrameTime: 0.190 ms
> > [shading] shading=gouraud: FPS: 3395 FrameTime: 0.295 ms
> > [shading] shading=blinn-phong-inf: FPS: 3329 FrameTime: 0.300 ms
> > [shading] shading=phong: FPS: 2990 FrameTime: 0.335 ms
> > [shading] shading=cel: FPS: 2916 FrameTime: 0.343 ms
> > [bump] bump-render=high-poly: FPS: 1879 FrameTime: 0.532 ms
> > [bump] bump-render=normals: FPS: 5242 FrameTime: 0.191 ms
> > [bump] bump-render=height: FPS: 4997 FrameTime: 0.200 ms
> > [effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 3725 FrameTime: 0.268 ms
> > [effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 1906 FrameTime: 0.525 ms
> > [pulsar] light=false:quads=5:texture=false: FPS: 4863 FrameTime: 0.206 ms
> > [desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 706 FrameTime: 1.417 ms
> > [desktop] effect=shadow:windows=4: FPS: 2621 FrameTime: 0.382 ms
> > [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 411 FrameTime: 2.435 ms
> > [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 402 FrameTime: 2.489 ms
> > [buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 490 FrameTime: 2.043 ms
> > [ideas] speed=duration: FPS: 1008 FrameTime: 0.992 ms
> > [jellyfish] <default>: FPS: 2722 FrameTime: 0.367 ms
> > [terrain] <default>: FPS: 120 FrameTime: 8.339 ms
> > [shadow] <default>: FPS: 2086 FrameTime: 0.479 ms
> > [refract] <default>: FPS: 312 FrameTime: 3.209 ms
> > [conditionals] fragment-steps=0:vertex-steps=0: FPS: 4877 FrameTime: 0.205 ms
> > [conditionals] fragment-steps=5:vertex-steps=0: FPS: 4118 FrameTime: 0.243 ms
> > [conditionals] fragment-steps=0:vertex-steps=5: FPS: 4845 FrameTime: 0.206 ms
> > [function] fragment-complexity=low:fragment-steps=5: FPS: 4444 FrameTime: 0.225 ms
> > [function] fragment-complexity=medium:fragment-steps=5: FPS: 3722 FrameTime: 0.269 ms
> > [loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 4468 FrameTime: 0.224 ms
> > [loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 4442 FrameTime: 0.225 ms
> > [loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 3847 FrameTime: 0.260 ms
> > =======================================================
> >                                   glmark2 Score: 3135 
> > =======================================================
> > 
> > ---------- After transitioning to job event processing in the IRQ context ------------
> > 
> > =======================================================
> >     glmark2 2023.01
> > =======================================================
> >     OpenGL Information
> >     GL_VENDOR:      Mesa
> >     GL_RENDERER:    Mali-G610 MC4 (Panfrost)
> >     GL_VERSION:     OpenGL ES 3.1 Mesa 26.2.0-devel (git-c71664cfbc)
> >     Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
> >     Surface Size:   800x600 windowed
> > =======================================================
> > [build] use-vbo=false: FPS: 2703 FrameTime: 0.370 ms
> > [build] use-vbo=true: FPS: 4630 FrameTime: 0.216 ms
> > [texture] texture-filter=nearest: FPS: 5406 FrameTime: 0.185 ms
> > [texture] texture-filter=linear: FPS: 5429 FrameTime: 0.184 ms
> > [texture] texture-filter=mipmap: FPS: 5408 FrameTime: 0.185 ms
> > [shading] shading=gouraud: FPS: 3678 FrameTime: 0.272 ms
> > [shading] shading=blinn-phong-inf: FPS: 3587 FrameTime: 0.279 ms
> > [shading] shading=phong: FPS: 3221 FrameTime: 0.311 ms
> > [shading] shading=cel: FPS: 3119 FrameTime: 0.321 ms
> > [bump] bump-render=high-poly: FPS: 1977 FrameTime: 0.506 ms
> > [bump] bump-render=normals: FPS: 5488 FrameTime: 0.182 ms
> > [bump] bump-render=height: FPS: 5323 FrameTime: 0.188 ms
> > [effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 4003 FrameTime: 0.250 ms
> > [effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 2008 FrameTime: 0.498 ms
> > [pulsar] light=false:quads=5:texture=false: FPS: 4961 FrameTime: 0.202 ms
> > [desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 852 FrameTime: 1.174 ms
> > [desktop] effect=shadow:windows=4: FPS: 2649 FrameTime: 0.378 ms
> > [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 412 FrameTime: 2.429 ms
> > [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 392 FrameTime: 2.554 ms
> > [buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 482 FrameTime: 2.075 ms
> > [ideas] speed=duration: FPS: 1021 FrameTime: 0.980 ms
> > [jellyfish] <default>: FPS: 2939 FrameTime: 0.340 ms
> > [terrain] <default>: FPS: 126 FrameTime: 7.979 ms
> > [shadow] <default>: FPS: 2273 FrameTime: 0.440 ms
> > [refract] <default>: FPS: 251 FrameTime: 3.999 ms
> > [conditionals] fragment-steps=0:vertex-steps=0: FPS: 5148 FrameTime: 0.194 ms
> > [conditionals] fragment-steps=5:vertex-steps=0: FPS: 4555 FrameTime: 0.220 ms
> > [conditionals] fragment-steps=0:vertex-steps=5: FPS: 5245 FrameTime: 0.191 ms
> > [function] fragment-complexity=low:fragment-steps=5: FPS: 4880 FrameTime: 0.205 ms
> > [function] fragment-complexity=medium:fragment-steps=5: FPS: 4042 FrameTime: 0.247 ms
> > [loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 4846 FrameTime: 0.206 ms
> > [loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 4854 FrameTime: 0.206 ms
> > [loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 4207 FrameTime: 0.238 ms
> > =======================================================
> >                                   glmark2 Score: 3335 
> > =======================================================
> > 
> > ---- With IRQ coalescing enabled (max_us=100 poll_period_us=5 inbounds_cnt_threshold=5) ---
> > 
> > =======================================================
> >     glmark2 2023.01
> > =======================================================
> >     OpenGL Information
> >     GL_VENDOR:      Mesa
> >     GL_RENDERER:    Mali-G610 MC4 (Panfrost)
> >     GL_VERSION:     OpenGL ES 3.1 Mesa 26.2.0-devel (git-c71664cfbc)
> >     Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
> >     Surface Size:   800x600 windowed
> > =======================================================
> > [build] use-vbo=false: FPS: 2663 FrameTime: 0.376 ms
> > [build] use-vbo=true: FPS: 4640 FrameTime: 0.216 ms
> > [texture] texture-filter=nearest: FPS: 5335 FrameTime: 0.187 ms
> > [texture] texture-filter=linear: FPS: 5442 FrameTime: 0.184 ms
> > [texture] texture-filter=mipmap: FPS: 5434 FrameTime: 0.184 ms
> > [shading] shading=gouraud: FPS: 3683 FrameTime: 0.272 ms
> > [shading] shading=blinn-phong-inf: FPS: 3580 FrameTime: 0.279 ms
> > [shading] shading=phong: FPS: 3211 FrameTime: 0.312 ms
> > [shading] shading=cel: FPS: 3093 FrameTime: 0.323 ms
> > [bump] bump-render=high-poly: FPS: 1969 FrameTime: 0.508 ms
> > [bump] bump-render=normals: FPS: 5368 FrameTime: 0.186 ms
> > [bump] bump-render=height: FPS: 5273 FrameTime: 0.190 ms
> > [effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 4038 FrameTime: 0.248 ms
> > [effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 2001 FrameTime: 0.500 ms
> > [pulsar] light=false:quads=5:texture=false: FPS: 4961 FrameTime: 0.202 ms
> > [desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 842 FrameTime: 1.188 ms
> > [desktop] effect=shadow:windows=4: FPS: 2681 FrameTime: 0.373 ms
> > [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 412 FrameTime: 2.430 ms
> > [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 408 FrameTime: 2.452 ms
> > [buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 483 FrameTime: 2.072 ms
> > [ideas] speed=duration: FPS: 1005 FrameTime: 0.995 ms
> > [jellyfish] <default>: FPS: 2945 FrameTime: 0.340 ms
> > [terrain] <default>: FPS: 131 FrameTime: 7.663 ms
> > [shadow] <default>: FPS: 2276 FrameTime: 0.440 ms
> > [refract] <default>: FPS: 328 FrameTime: 3.050 ms
> > [conditionals] fragment-steps=0:vertex-steps=0: FPS: 5099 FrameTime: 0.196 ms
> > [conditionals] fragment-steps=5:vertex-steps=0: FPS: 4538 FrameTime: 0.220 ms
> > [conditionals] fragment-steps=0:vertex-steps=5: FPS: 5152 FrameTime: 0.194 ms
> > [function] fragment-complexity=low:fragment-steps=5: FPS: 4818 FrameTime: 0.208 ms
> > [function] fragment-complexity=medium:fragment-steps=5: FPS: 4035 FrameTime: 0.248 ms
> > [loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 4855 FrameTime: 0.206 ms
> > [loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 4812 FrameTime: 0.208 ms
> > [loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 4150 FrameTime: 0.241 ms
> > =======================================================
> >                                   glmark2 Score: 3322 
> > =======================================================
> > 
> > 
> > > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > > ---
> > > Boris Brezillon (10):
> > >       drm/panthor: Make panthor_irq::state a non-atomic field
> > >       drm/panthor: Move the register accessors before the IRQ helpers
> > >       drm/panthor: Replace the panthor_irq macro machinery by inline helpers
> > >       drm/panthor: Extend the IRQ logic to allow fast/raw IRQ handlers
> > >       drm/panthor: Make panthor_fw_{update,toggle}_reqs() callable from IRQ context
> > >       drm/panthor: Prepare the scheduler logic for FW events in IRQ context
> > >       drm/panthor: Automate CSG IRQ processing at group unbind time
> > >       drm/panthor: Automatically enable interrupts in panthor_fw_wait_acks()
> > >       drm/panthor: Process FW events in IRQ context
> > >       drm/panthor: Introduce interrupt coalescing support for job IRQs
> > > 
> > >  drivers/gpu/drm/panthor/panthor_device.h | 358 ++++++++++++++---------
> > >  drivers/gpu/drm/panthor/panthor_drv.c    |   1 +
> > >  drivers/gpu/drm/panthor/panthor_fw.c     | 226 +++++++++++++--
> > >  drivers/gpu/drm/panthor/panthor_fw.h     |  11 +-
> > >  drivers/gpu/drm/panthor/panthor_gpu.c    |  27 +-
> > >  drivers/gpu/drm/panthor/panthor_mmu.c    |  38 +--
> > >  drivers/gpu/drm/panthor/panthor_pwr.c    |  21 +-
> > >  drivers/gpu/drm/panthor/panthor_sched.c  | 475 ++++++++++++++-----------------
> > >  8 files changed, 698 insertions(+), 459 deletions(-)
> > > ---
> > > base-commit: 7455a0583a906533041a80e48c6a2e3230cce96e
> > > change-id: 20260429-panthor-signal-from-irq-d33684f4d292
> > > prerequisite-message-id: <20260427155934.416502-1-karunika.choo@arm.com>
> > > prerequisite-patch-id: 70905a2eb09ab2b31d242a5ed5af3b42fb6a464c
> > > prerequisite-patch-id: aa4c22669f80328039762f25c0b3942bbadbdc89
> > > prerequisite-patch-id: 7f61bcee3c4bb5703900b18d5b6e0f52e622f29d
> > > prerequisite-patch-id: 3402f4d60aa526d40113fc3d9b3e599f8f89e705
> > > prerequisite-patch-id: 00ddbd3d455891f6950609614c1acd2baa78b0db
> > > prerequisite-patch-id: 6a9928f609e3757cadebb2df6795d0da55745f4e
> > > prerequisite-patch-id: fd91f68f25d4bc93eec405f0131f5ae4284bfaf2
> > > prerequisite-patch-id: 553958a10a0ca2f20f7883ad4c752cfc7485c5a8
> > > 
> > > Best regards,  
> > 
> 

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 08/10] drm/panthor: Automatically enable interrupts in panthor_fw_wait_acks()
       [not found]     ` <20260504130215.0222b3bd@fedora>
@ 2026-05-06 14:35       ` Steven Price
  2026-05-06 16:08         ` Boris Brezillon
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Price @ 2026-05-06 14:35 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, dri-devel, linux-kernel

On 04/05/2026 12:02, Boris Brezillon wrote:
> On Fri, 1 May 2026 15:20:17 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 29/04/2026 10:38, Boris Brezillon wrote:
>>> Rather than assuming an interrupt is always expected for request
>>> acks, temporarily enable the relevant interrupts when the polling-wait
>>> failed. This should hopefully reduce the number of interrupts the CPU
>>> has to process.
>>>
>>> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>  
>>
>> It seems to work, although I'm lightly uneasy about this because I'm not
>> entirely sure whether the FW will immediately see the updates to
>> ack_irq_mask and therefore whether there's a possibility to miss an
>> event and be stuck waiting for the timeout.
>>
>> Memory models are not my strong point, OpenAI tells me the sequence
>> should be something like:
>>
>>   scoped_guard(spinlock_irqsave, lock) {
>>   	u32 ack_irq_mask = READ_ONCE(*ack_irq_mask_ptr);
>>
>>   	WRITE_ONCE(*ack_irq_mask_ptr, ack_irq_mask | req_mask);
>>   }
> 
> Is this really needed? In which situation would the compiler/CPU decide
> to re-order this read_update_modify sequence?

I think that's the AI being a bit overzealous, but in general WRITE_ONCE
is necessary to avoid some surprising effects. In theory the compiler
can decide to perform multiple writes if it's non-volatile. I.e. a
sequence like:

	u32 old_mask = *ack_irq_mask_ptr;
	if (condition)
		*ack_irq_mask_ptr = 0;
	else
		*ack_irq_mask_ptr |= req_mask;

Can be 'optimised' to:

	u32 old_mask = *ack_irq_mask_ptr;
	*ack_irq_mask_ptr = 0;
	if (!condition)
		*ack_irq_mask_ptr = old_mask | req_mask;

In which the compiler has changed the (!condition) path to do two writes
one of which "should never be seen".

Given that the compiler shouldn't be able to move any of the effects
outside of the scoped_guard(), and since there's only one operation then
I can't see how a compiler would screw it up - but the compiler is
technically free to do so.

>>
>>   /*
>>    * The FW interface can be mapped write-combine/Normal-NC.
> 
> I'm not too sure I see what the non-cached property has to do with it.
> If it was cached we would still need this memory barrier, and in
> addition, we'd need a cache flush if the FW is not IO-coherent.

I *think* the point the AI was making is that the memory isn't Device.
I.e. it's writeback and the write might not have completed.

>> Make sure the
>>    * IRQ mask update is visible to the FW before sleeping waiting for
>> the IRQ.
>>    */
>>   wmb();
>>
>> Which seems plausible. But I've long ago learnt that plausible doesn't
>> mean much when dealing with memory models!
> 
> Yeah, I'm not too sure. I was honestly expecting the spinlock guard to
> act as a memory barrier already, but maybe it's not enough.

So logically it must be enough to enable other CPUs to see writes within
the spinlock - otherwise spinlocks would be completely broken on SMP. I
guess it should be sufficient for the GPU's firmware MCU to see.

This is of course the problem with AI - it's found something 'plausible'
but it's probably completely wrong. I guess we just ignore it unless we
do actually start seeing that timeout happen.

Thanks,
Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 08/10] drm/panthor: Automatically enable interrupts in panthor_fw_wait_acks()
  2026-05-06 14:35       ` [PATCH 08/10] drm/panthor: Automatically enable interrupts in panthor_fw_wait_acks() Steven Price
@ 2026-05-06 16:08         ` Boris Brezillon
  2026-05-13 15:02           ` Steven Price
  0 siblings, 1 reply; 7+ messages in thread
From: Boris Brezillon @ 2026-05-06 16:08 UTC (permalink / raw)
  To: Steven Price
  Cc: Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, dri-devel, linux-kernel

On Wed, 6 May 2026 15:35:18 +0100
Steven Price <steven.price@arm.com> wrote:

> On 04/05/2026 12:02, Boris Brezillon wrote:
> > On Fri, 1 May 2026 15:20:17 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >   
> >> On 29/04/2026 10:38, Boris Brezillon wrote:  
> >>> Rather than assuming an interrupt is always expected for request
> >>> acks, temporarily enable the relevant interrupts when the polling-wait
> >>> failed. This should hopefully reduce the number of interrupts the CPU
> >>> has to process.
> >>>
> >>> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>    
> >>
> >> It seems to work, although I'm lightly uneasy about this because I'm not
> >> entirely sure whether the FW will immediately see the updates to
> >> ack_irq_mask and therefore whether there's a possibility to miss an
> >> event and be stuck waiting for the timeout.
> >>
> >> Memory models are not my strong point, OpenAI tells me the sequence
> >> should be something like:
> >>
> >>   scoped_guard(spinlock_irqsave, lock) {
> >>   	u32 ack_irq_mask = READ_ONCE(*ack_irq_mask_ptr);
> >>
> >>   	WRITE_ONCE(*ack_irq_mask_ptr, ack_irq_mask | req_mask);
> >>   }  
> > 
> > Is this really needed? In which situation would the compiler/CPU decide
> > to re-order this read_update_modify sequence?  
> 
> I think that's the AI being a bit overzealous, but in general WRITE_ONCE
> is necessary to avoid some surprising effects. In theory the compiler
> can decide to perform multiple writes if it's non-volatile. I.e. a
> sequence like:
> 
> 	u32 old_mask = *ack_irq_mask_ptr;
> 	if (condition)
> 		*ack_irq_mask_ptr = 0;
> 	else
> 		*ack_irq_mask_ptr |= req_mask;
> 
> Can be 'optimised' to:
> 
> 	u32 old_mask = *ack_irq_mask_ptr;
> 	*ack_irq_mask_ptr = 0;
> 	if (!condition)
> 		*ack_irq_mask_ptr = old_mask | req_mask;
> 
> In which the compiler has changed the (!condition) path to do two writes
> one of which "should never be seen".
> 
> Given that the compiler shouldn't be able to move any of the effects
> outside of the scoped_guard(), and since there's only one operation then
> I can't see how a compiler would screw it up - but the compiler is
> technically free to do so.

Sure, I'm not saying read_modify_write is atomic per-se (even though
I'd be surprised if the compiler wasn't generating instructions that
are atomic in the end), but it is thread-safe because of the spinlock
covering the read_modify_write op.

> 
> >>
> >>   /*
> >>    * The FW interface can be mapped write-combine/Normal-NC.  
> > 
> > I'm not too sure I see what the non-cached property has to do with it.
> > If it was cached we would still need this memory barrier, and in
> > addition, we'd need a cache flush if the FW is not IO-coherent.  
> 
> I *think* the point the AI was making is that the memory isn't Device.
> I.e. it's writeback and the write might not have completed.

Okay, get it now.

> 
> >> Make sure the
> >>    * IRQ mask update is visible to the FW before sleeping waiting for
> >> the IRQ.
> >>    */
> >>   wmb();
> >>
> >> Which seems plausible. But I've long ago learnt that plausible doesn't
> >> mean much when dealing with memory models!  
> > 
> > Yeah, I'm not too sure. I was honestly expecting the spinlock guard to
> > act as a memory barrier already, but maybe it's not enough.  
> 
> So logically it must be enough to enable other CPUs to see writes within
> the spinlock - otherwise spinlocks would be completely broken on SMP. I
> guess it should be sufficient for the GPU's firmware MCU to see.

For the record, this is currently mapped uncached on both the CPU and
GPU side, because we don't have a way to describe the
shareability properly with the current IOMMU flags.

So, my understanding was that the smp_wb() (DMB(ISH) on arm64) at
the end of a spin_unlock(), would ensure proper store/load instruction
ordering around this barrier, but that it would only wait for the
content to reach the inner shareable domain before returning, not any
further. But maybe I got that wrong from the start, and DMB(ISH)
doesn't even start the transaction if the access is targeting uncached
memory. In which case, AI is right, a full wmb() is needed, otherwise
there's a chance we'll wait indefinitely because the update didn't make
it to the FW interface in the first place.

Also, if that's broken for ack_irq_mask, it's also broken in other
places...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 08/10] drm/panthor: Automatically enable interrupts in panthor_fw_wait_acks()
  2026-05-06 16:08         ` Boris Brezillon
@ 2026-05-13 15:02           ` Steven Price
  2026-05-13 15:42             ` Boris Brezillon
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Price @ 2026-05-13 15:02 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, dri-devel, linux-kernel

On 06/05/2026 17:08, Boris Brezillon wrote:
> On Wed, 6 May 2026 15:35:18 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 04/05/2026 12:02, Boris Brezillon wrote:
>>> On Fri, 1 May 2026 15:20:17 +0100
>>> Steven Price <steven.price@arm.com> wrote:
>>>   
>>>> On 29/04/2026 10:38, Boris Brezillon wrote:  
>>>>> Rather than assuming an interrupt is always expected for request
>>>>> acks, temporarily enable the relevant interrupts when the polling-wait
>>>>> failed. This should hopefully reduce the number of interrupts the CPU
>>>>> has to process.
>>>>>
>>>>> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>    
>>>>
>>>> It seems to work, although I'm lightly uneasy about this because I'm not
>>>> entirely sure whether the FW will immediately see the updates to
>>>> ack_irq_mask and therefore whether there's a possibility to miss an
>>>> event and be stuck waiting for the timeout.
>>>>
>>>> Memory models are not my strong point, OpenAI tells me the sequence
>>>> should be something like:
>>>>
>>>>   scoped_guard(spinlock_irqsave, lock) {
>>>>   	u32 ack_irq_mask = READ_ONCE(*ack_irq_mask_ptr);
>>>>
>>>>   	WRITE_ONCE(*ack_irq_mask_ptr, ack_irq_mask | req_mask);
>>>>   }  
>>>
>>> Is this really needed? In which situation would the compiler/CPU decide
>>> to re-order this read_update_modify sequence?  
>>
>> I think that's the AI being a bit overzealous, but in general WRITE_ONCE
>> is necessary to avoid some surprising effects. In theory the compiler
>> can decide to perform multiple writes if it's non-volatile. I.e. a
>> sequence like:
>>
>> 	u32 old_mask = *ack_irq_mask_ptr;
>> 	if (condition)
>> 		*ack_irq_mask_ptr = 0;
>> 	else
>> 		*ack_irq_mask_ptr |= req_mask;
>>
>> Can be 'optimised' to:
>>
>> 	u32 old_mask = *ack_irq_mask_ptr;
>> 	*ack_irq_mask_ptr = 0;
>> 	if (!condition)
>> 		*ack_irq_mask_ptr = old_mask | req_mask;
>>
>> In which the compiler has changed the (!condition) path to do two writes
>> one of which "should never be seen".
>>
>> Given that the compiler shouldn't be able to move any of the effects
>> outside of the scoped_guard(), and since there's only one operation then
>> I can't see how a compiler would screw it up - but the compiler is
>> technically free to do so.
> 
> Sure, I'm not saying read_modify_write is atomic per-se (even though
> I'd be surprised if the compiler wasn't generating instructions that
> are atomic in the end), but it is thread-safe because of the spinlock
> covering the read_modify_write op.

But one of the "threads" is the MCU which isn't using the spinlock -
which is why it's a problem if the compiler left the value in a 'random'
state even if it's all fixed up by the time the spinlock is released.

Like you say I would be very surprised if a compiler messed it up in
this case.

>>
>>>>
>>>>   /*
>>>>    * The FW interface can be mapped write-combine/Normal-NC.  
>>>
>>> I'm not too sure I see what the non-cached property has to do with it.
>>> If it was cached we would still need this memory barrier, and in
>>> addition, we'd need a cache flush if the FW is not IO-coherent.  
>>
>> I *think* the point the AI was making is that the memory isn't Device.
>> I.e. it's writeback and the write might not have completed.
> 
> Okay, get it now.
> 
>>
>>>> Make sure the
>>>>    * IRQ mask update is visible to the FW before sleeping waiting for
>>>> the IRQ.
>>>>    */
>>>>   wmb();
>>>>
>>>> Which seems plausible. But I've long ago learnt that plausible doesn't
>>>> mean much when dealing with memory models!  
>>>
>>> Yeah, I'm not too sure. I was honestly expecting the spinlock guard to
>>> act as a memory barrier already, but maybe it's not enough.  
>>
>> So logically it must be enough to enable other CPUs to see writes within
>> the spinlock - otherwise spinlocks would be completely broken on SMP. I
>> guess it should be sufficient for the GPU's firmware MCU to see.
> 
> For the record, this is currently mapped uncached on both the CPU and
> GPU side, because we don't have a way to describe the
> shareability properly with the current IOMMU flags.
> 
> So, my understanding was that the smp_wb() (DMB(ISH) on arm64) at
> the end of a spin_unlock(), would ensure proper store/load instruction
> ordering around this barrier, but that it would only wait for the
> content to reach the inner shareable domain before returning, not any
> further. But maybe I got that wrong from the start, and DMB(ISH)
> doesn't even start the transaction if the access is targeting uncached
> memory. In which case, AI is right, a full wmb() is needed, otherwise
> there's a chance we'll wait indefinitely because the update didn't make
> it to the FW interface in the first place.
> 
> Also, if that's broken for ack_irq_mask, it's also broken in other
> places...

So I've looked at the firmware implementation and I can say that there's
a race condition on the firmware side if we change the mask after
sending the START/RESUME request. The firmware currently samples the
mask before triggering the update to CSG_ACK and then uses the sampled
value to decide whether to trigger an interrupt.

This is all fine if the mask is set before the request (so nothing
broken with the current code), but we'd need a firmware change before we
can safely do what this patch was proposing. And of course we'd have to
get our heads round the barriers needed! ;)

Thanks,
Steve


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 08/10] drm/panthor: Automatically enable interrupts in panthor_fw_wait_acks()
  2026-05-13 15:02           ` Steven Price
@ 2026-05-13 15:42             ` Boris Brezillon
  0 siblings, 0 replies; 7+ messages in thread
From: Boris Brezillon @ 2026-05-13 15:42 UTC (permalink / raw)
  To: Steven Price
  Cc: Liviu Dudau, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, dri-devel, linux-kernel

On Wed, 13 May 2026 16:02:11 +0100
Steven Price <steven.price@arm.com> wrote:

> >>>> It seems to work, although I'm lightly uneasy about this because I'm not
> >>>> entirely sure whether the FW will immediately see the updates to
> >>>> ack_irq_mask and therefore whether there's a possibility to miss an
> >>>> event and be stuck waiting for the timeout.
> >>>>
> >>>> Memory models are not my strong point, OpenAI tells me the sequence
> >>>> should be something like:
> >>>>
> >>>>   scoped_guard(spinlock_irqsave, lock) {
> >>>>   	u32 ack_irq_mask = READ_ONCE(*ack_irq_mask_ptr);
> >>>>
> >>>>   	WRITE_ONCE(*ack_irq_mask_ptr, ack_irq_mask | req_mask);
> >>>>   }    
> >>>
> >>> Is this really needed? In which situation would the compiler/CPU decide
> >>> to re-order this read_update_modify sequence?    
> >>
> >> I think that's the AI being a bit overzealous, but in general WRITE_ONCE
> >> is necessary to avoid some surprising effects. In theory the compiler
> >> can decide to perform multiple writes if it's non-volatile. I.e. a
> >> sequence like:
> >>
> >> 	u32 old_mask = *ack_irq_mask_ptr;
> >> 	if (condition)
> >> 		*ack_irq_mask_ptr = 0;
> >> 	else
> >> 		*ack_irq_mask_ptr |= req_mask;
> >>
> >> Can be 'optimised' to:
> >>
> >> 	u32 old_mask = *ack_irq_mask_ptr;
> >> 	*ack_irq_mask_ptr = 0;
> >> 	if (!condition)
> >> 		*ack_irq_mask_ptr = old_mask | req_mask;
> >>
> >> In which the compiler has changed the (!condition) path to do two writes
> >> one of which "should never be seen".
> >>
> >> Given that the compiler shouldn't be able to move any of the effects
> >> outside of the scoped_guard(), and since there's only one operation then
> >> I can't see how a compiler would screw it up - but the compiler is
> >> technically free to do so.  
> > 
> > Sure, I'm not saying read_modify_write is atomic per-se (even though
> > I'd be surprised if the compiler wasn't generating instructions that
> > are atomic in the end), but it is thread-safe because of the spinlock
> > covering the read_modify_write op.  
> 
> But one of the "threads" is the MCU which isn't using the spinlock -
> which is why it's a problem if the compiler left the value in a 'random'
> state even if it's all fixed up by the time the spinlock is released.

Okay, I see what you mean. I truly hope it's not random values, but if
it goes

	X -> 0 -> X | Y

or

	X -> 0 -> X & ~Y

that's already problematic, because we'd lose events.

> 
> Like you say I would be very surprised if a compiler messed it up in
> this case.

I'll add the READ/WRITE_ONCE() and add a comment to make sure we don't
forget why they are needed (in theory).

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-13 15:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260429-panthor-signal-from-irq-v1-0-4b92ae4142d2@collabora.com>
     [not found] ` <20260429-panthor-signal-from-irq-v1-7-4b92ae4142d2@collabora.com>
     [not found]   ` <e5c9763b-87c1-405d-9fda-f3f324d6bdfe@arm.com>
2026-05-04 15:00     ` [PATCH 07/10] drm/panthor: Automate CSG IRQ processing at group unbind time Boris Brezillon
     [not found] ` <20260429123607.7a8c7051@fedora>
2026-05-05  8:54   ` [PATCH 00/10] drm/panthor: Reduce dma_fence signalling latency Boris Brezillon
2026-05-05 16:12     ` Liviu Dudau
     [not found] ` <20260429-panthor-signal-from-irq-v1-8-4b92ae4142d2@collabora.com>
     [not found]   ` <446e9d1f-b6be-42fa-bd2b-f4fcbc130f70@arm.com>
     [not found]     ` <20260504130215.0222b3bd@fedora>
2026-05-06 14:35       ` [PATCH 08/10] drm/panthor: Automatically enable interrupts in panthor_fw_wait_acks() Steven Price
2026-05-06 16:08         ` Boris Brezillon
2026-05-13 15:02           ` Steven Price
2026-05-13 15:42             ` Boris Brezillon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox