[PATCH] drm/doc: Start documenting aspects specific to tile-based renderers

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers
@ 2025-04-18 12:25 Boris Brezillon
  2025-04-23  9:41 ` Steven Price
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Boris Brezillon @ 2025-04-18 12:25 UTC (permalink / raw)
  To: dri-devel
  Cc: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe,
	lima, Qiang Yu, David Airlie, Simona Vetter, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, Dmitry Osipenko,
	Alyssa Rosenzweig, Christian Koenig, Faith Ekstrand, kernel

Tile-based GPUs come with a set of constraints that are not present
when immediate rendering is used. This new document tries to explain
the differences between tile/immediate rendering, the problems that
come with tilers, and how we plan to address them.

This is just a started point, this document will be updated with new
materials as we refine the libraries we add to help deal with
tilers, and have more drivers converted to follow the rules listed
here.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 Documentation/gpu/drm-tile-based-renderer.rst | 201 ++++++++++++++++++
 Documentation/gpu/index.rst                   |   1 +
 2 files changed, 202 insertions(+)
 create mode 100644 Documentation/gpu/drm-tile-based-renderer.rst

diff --git a/Documentation/gpu/drm-tile-based-renderer.rst b/Documentation/gpu/drm-tile-based-renderer.rst
new file mode 100644
index 000000000000..19b56b9476fc
--- /dev/null
+++ b/Documentation/gpu/drm-tile-based-renderer.rst
@@ -0,0 +1,201 @@
+==================================================
+Infrastructure and tricks for tile-based renderers
+==================================================
+
+All lot of embedded GPUs are using tile-based rendering instead of immediate
+rendering. This mode of rendering has various implications that we try to
+document here along with some hints about how to deal with some of the
+problems that surface with tile-based renderers.
+
+The main idea behind tile-based rendering is to batch processing of nearby
+pixels during the fragment shading phase to limit the traffic on the memory
+bus by making optimal use of the various caches present in the GPU. Unlike
+immediate rendering, where primitives generated by the geometry stages of
+the pipeline are directly consumed by the fragment stage, tilers have to
+record primitives in bins that are somehow attached to tiles (the
+granularity of the tile being GPU-specific). This data is usually stored
+in memory, and pulled back when the fragment stage is executed.
+
+This approach has several issues that most drivers need to handle somehow,
+sometimes with a bit of help from the hardware.
+
+Issues at hand
+==============
+
+Tiler memory
+------------
+
+The amount of memory needed to store primitives data and metadata is hard
+to guess ahead of time, because it depends on various parameters that are
+not in control of the UMD (UserMode Driver). Here is a non-exhaustive list
+of things that may complicate the calculation of the memory needed to store
+primitive information:
+
+- Primitives distribution across tiles is hard to guess: the binning process
+  is about assigning each primitive to the set tiles it covers. The more tiles
+  being covered the more memory is needed to record those. We can estimate
+  the worst case scenario by assuming all primitives will cover all tiles but
+  this will lead to over-allocation most of the time, which is not good
+- Indirect draws: the number of vertices comes from a GPU buffer that might
+  be filled by previous GPU compute jobs. This means we only know the number
+  of vertices when the GPU executes the draw, and thus can't guess how much
+  memory will be needed for those and allocate a GPU buffer that's big enough
+  to hold those
+- Complex geometry pipelines: if you throw geometry/tesselation/mesh shaders
+  it gets even trickier to guess the number of primitives from the number
+  of vertices passed to the vertex shader.
+
+For all these reasons, the tiler usually allocates memory dynamically, but
+DRM has not been designed with this use case in mind. Drivers will address
+these problems differently based on the functionality provided by their
+hardware, but all of them almost certainly have to deal with this somehow.
+
+The easy solution is to statically allocate a huge buffer to pick from when
+tiler memory is needed, and fail the rendering when this buffer is depleted.
+Some drivers try to be smarter to avoid reserving a lot of memory upfront.
+Instead, they start with an almost empty buffer and progressively populate it
+when the GPU faults on an address sitting in the tiler buffer range. This
+works okay most of the time but it falls short when the system is under
+memory pressure, because the memory request is not guaranteed to be satisfied.
+In that case, the driver either fails the rendering, or, if the hardware
+allows it, it tries to flush the primitives that have been processed and
+triggers a fragment job that will consume those primitives and free up some
+memory to be recycled and make further progress on the tiling step. This is
+usually referred as partial/incremental rendering (it might have other names).
+
+Compute based emulation of geometry stages
+------------------------------------------
+
+More and more hardware vendors don't bother providing hardware support for
+geometry/tesselation/mesh stages, since those can be emulated with compute
+shaders. But the same problem we have with tiler memory exists with those
+intermediate compute-emulated stages, because transient data shared between
+stages need to be stored in memory for the next stage to consume, and this
+bubbles up until the tiling stage is reached, because ultimately, what the
+tiling stage will need to process is a set of vertices it can turn into
+primitives, like would happen if the application had emulated the geometry,
+tesselation or mesh stages with compute.
+
+Unlike tiling, where the hardware can provide a fallback to recycle memory,
+there is no way the intermediate primitives can be flushed up to the framebuffer,
+because it's a purely software emulation here. This being said, the same
+"start small, grow on-demand" can be applied to avoid over-allocating memory
+upfront.
+
+On-demand memory allocation
+---------------------------
+
+As explained in previous sections, on-demand allocation is a central piece
+of tile-based renderer if we don't want to over-allocate, which is bad for
+integrated GPUs who share their memory with the rest of the system.
+
+The problem with on-demand allocation is that suddenly, GPU accesses can
+fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem mostly)
+were not designed for that. Those are assuming that buffers memory is
+populated at job submission time, and will stay around for the job lifetime.
+If a GPU fault happens, it's the user fault, and the context can be flagged
+unusable. On-demand allocation is usually implemented as allocation-on-fault,
+and the dma_fence contract prevents us from blocking on allocations in that
+path (GPU fault handlers are in the dma-fence signalling path). So now we
+have GPU allocations that will be satisfied most of the time, but can fail
+occasionally. And this is not great, because an allocation failure might
+kill the user GPU context (VK_DEVICE_LOST in Vulkan terms), without the
+application having dong anything wrong. So, we need something that makes those
+allocation failures rare enough that most users won't experience them, and
+we need a fallback for when this happens to try to avoid them on the next
+user attempt to submit a graphics job.
+
+The plan
+========
+
+On-demand allocation rules
+--------------------------
+
+First of all, all allocations happening in the fault handler path must
+be using GFP_NOWAIT. With this flag, low-hanging fruit can be picked
+(clean FS cache will be flushed for instance), but an error will be
+returned if no memory is readily available. GFP_NOWAIT will also trigger
+background reclaim to hopefully free-up some memory for our future
+requests.
+
+How to deal with allocation failures
+------------------------------------
+
+The first trick here is to try to guess approximately how much memory
+will be needed, and force-populate on-demand buffers with that amount
+of memory when the job is started. It's not about guessing the worst
+case scenario here, but more the most likely case, probably with a
+reasonable margin, so that the job is likely to succeed when this amount
+of memory is provided by the KMD.
+
+The second trick to try to avoid over-allocation, even with this
+sub-optimistic estimate, is to have a shared pool of memory that can be
+used by all GPU contexts when they need tiler/geometry memory. This
+implies returning chunks to this pool at some point, so other contexts
+can re-use those. Details about what this global memory pool implementation
+would look like is currently undefined, but it needs to be filled to
+guarantee that pre-allocation requests for on-demand buffers used by a
+GPU job can be satisfied in the fault handler path.
+
+As a last resort, we can try to allocate with GFP_ATOMIC if everything
+else fails, but this is a dangerous game, because we would be stealing
+memory from the atomic reserve, so it's not entirely clear if this is
+better than failing the job at this point.
+
+Ideas on how to make allocation failures decrease over time
+-----------------------------------------------------------
+
+When an on-demand allocation fails and the hardware doesn't have a
+flush-primitives fallback, we usually can't do much apart from failing the
+whole job. But it's important to try to avoid future allocation failures
+when the application creates a new context. There's no clear path for
+how to guess the actual size to force-populate on the next attempt. One
+option is to have a simple heuristics, like double the current resident size,
+but this has the downside of potentially taking a few attempts before reaching
+the stability point. Another option is to repeatedly map a dummy page at the
+fault addresses, so we can get a sense of how much memory was needed for this
+particular job.
+
+Once userspace gets an idea of what the application needs, it should force
+this to be the minimum populated size on the next context creation. For GL
+drivers, the UMD is in control of the context recreation, so it can easily
+record the next buffer size to use. For Vulkan applications, something should
+be recorded to track that, maybe in the form of some implicit dri-conf
+database that can overload the explicit dri-conf.
+
+Various implementation details have been discussed
+`here <https://lore.kernel.org/dri-devel/Z_kEjFjmsumfmbfM@phenom.ffwll.local/>`_
+but nothing has been decided yet.
+
+DRM infrastructure changes for tile-based renderers
+===================================================
+
+As seen in previous sections, allocation for tile-based GPUs can be tricky,
+so we really want to add as much facility as we can, and document how these
+helpers must be used. This section tries to list the various components and
+how we expect them to work.
+
+GEM SHMEM sparse backing
+------------------------
+
+On-demand allocation is not something the GEM layer has been designed for.
+The idea is to extend the existing GEM and GEM SHMEM helpers to cover the
+concept of sparse backing.
+
+A solution has been proposed
+`here<https://lore.kernel.org/dri-devel/20250404092634.2968115-1-boris.brezillon@collabora.com/>`_
+
+Fault injection mechanism
+-------------------------
+
+In order to easily test/validate the on-demand allocation logic, we need
+a way to fake GPU faults and trigger on-demand allocation. We also need
+to fake allocation failures are various points.
+
+This part is likely to be driver specific, and should probably involve
+new debugfs knobs.
+
+Global memory pool for on-demand allocation
+-------------------------------------------
+
+TBD.
diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
index 7dcb15850afd..186917524854 100644
--- a/Documentation/gpu/index.rst
+++ b/Documentation/gpu/index.rst
@@ -14,6 +14,7 @@ GPU Driver Developer's Guide
    driver-uapi
    drm-client
    drm-compute
+   drm-tile-based-renderer
    drivers
    backlight
    vga-switcheroo
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers
  2025-04-18 12:25 [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers Boris Brezillon
@ 2025-04-23  9:41 ` Steven Price
  2025-04-28  8:00   ` Boris Brezillon
  2025-04-23 14:47 ` Alyssa Rosenzweig
  2025-04-28  6:55 ` Iago Toral
  2 siblings, 1 reply; 9+ messages in thread
From: Steven Price @ 2025-04-23  9:41 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Liviu Dudau, Adrián Larumbe, lima, Qiang Yu, David Airlie,
	Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Dmitry Osipenko, Alyssa Rosenzweig,
	Christian Koenig, Faith Ekstrand, kernel

On 18/04/2025 13:25, Boris Brezillon wrote:
> Tile-based GPUs come with a set of constraints that are not present
> when immediate rendering is used. This new document tries to explain
> the differences between tile/immediate rendering, the problems that
> come with tilers, and how we plan to address them.
> 
> This is just a started point, this document will be updated with new
> materials as we refine the libraries we add to help deal with
> tilers, and have more drivers converted to follow the rules listed
> here.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Seems like a good starting point, a few minor comments below. We really
need some non-Mali input too though.

> ---
>  Documentation/gpu/drm-tile-based-renderer.rst | 201 ++++++++++++++++++
>  Documentation/gpu/index.rst                   |   1 +
>  2 files changed, 202 insertions(+)
>  create mode 100644 Documentation/gpu/drm-tile-based-renderer.rst
> 
> diff --git a/Documentation/gpu/drm-tile-based-renderer.rst b/Documentation/gpu/drm-tile-based-renderer.rst
> new file mode 100644
> index 000000000000..19b56b9476fc
> --- /dev/null
> +++ b/Documentation/gpu/drm-tile-based-renderer.rst
> @@ -0,0 +1,201 @@
> +==================================================
> +Infrastructure and tricks for tile-based renderers
> +==================================================
> +
> +All lot of embedded GPUs are using tile-based rendering instead of immediate
> +rendering. This mode of rendering has various implications that we try to
> +document here along with some hints about how to deal with some of the
> +problems that surface with tile-based renderers.
> +
> +The main idea behind tile-based rendering is to batch processing of nearby
> +pixels during the fragment shading phase to limit the traffic on the memory
> +bus by making optimal use of the various caches present in the GPU. Unlike
> +immediate rendering, where primitives generated by the geometry stages of
> +the pipeline are directly consumed by the fragment stage, tilers have to
> +record primitives in bins that are somehow attached to tiles (the
> +granularity of the tile being GPU-specific). This data is usually stored
> +in memory, and pulled back when the fragment stage is executed.
> +
> +This approach has several issues that most drivers need to handle somehow,
> +sometimes with a bit of help from the hardware.
> +
> +Issues at hand
> +==============
> +
> +Tiler memory
> +------------
> +
> +The amount of memory needed to store primitives data and metadata is hard
> +to guess ahead of time, because it depends on various parameters that are
> +not in control of the UMD (UserMode Driver). Here is a non-exhaustive list
> +of things that may complicate the calculation of the memory needed to store
> +primitive information:
> +
> +- Primitives distribution across tiles is hard to guess: the binning process
> +  is about assigning each primitive to the set tiles it covers. The more tiles
> +  being covered the more memory is needed to record those. We can estimate
> +  the worst case scenario by assuming all primitives will cover all tiles but
> +  this will lead to over-allocation most of the time, which is not good
> +- Indirect draws: the number of vertices comes from a GPU buffer that might
> +  be filled by previous GPU compute jobs. This means we only know the number
> +  of vertices when the GPU executes the draw, and thus can't guess how much
> +  memory will be needed for those and allocate a GPU buffer that's big enough
> +  to hold those
> +- Complex geometry pipelines: if you throw geometry/tesselation/mesh shaders
> +  it gets even trickier to guess the number of primitives from the number
> +  of vertices passed to the vertex shader.
> +
> +For all these reasons, the tiler usually allocates memory dynamically, but
> +DRM has not been designed with this use case in mind. Drivers will address
> +these problems differently based on the functionality provided by their
> +hardware, but all of them almost certainly have to deal with this somehow.
> +
> +The easy solution is to statically allocate a huge buffer to pick from when
> +tiler memory is needed, and fail the rendering when this buffer is depleted.
> +Some drivers try to be smarter to avoid reserving a lot of memory upfront.
> +Instead, they start with an almost empty buffer and progressively populate it
> +when the GPU faults on an address sitting in the tiler buffer range. This
> +works okay most of the time but it falls short when the system is under
> +memory pressure, because the memory request is not guaranteed to be satisfied.
> +In that case, the driver either fails the rendering, or, if the hardware
> +allows it, it tries to flush the primitives that have been processed and
> +triggers a fragment job that will consume those primitives and free up some
> +memory to be recycled and make further progress on the tiling step. This is
> +usually referred as partial/incremental rendering (it might have other names).
> +
> +Compute based emulation of geometry stages
> +------------------------------------------
> +
> +More and more hardware vendors don't bother providing hardware support for
> +geometry/tesselation/mesh stages, since those can be emulated with compute
> +shaders. But the same problem we have with tiler memory exists with those
> +intermediate compute-emulated stages, because transient data shared between
> +stages need to be stored in memory for the next stage to consume, and this
> +bubbles up until the tiling stage is reached, because ultimately, what the
> +tiling stage will need to process is a set of vertices it can turn into
> +primitives, like would happen if the application had emulated the geometry,
> +tesselation or mesh stages with compute.
> +
> +Unlike tiling, where the hardware can provide a fallback to recycle memory,
> +there is no way the intermediate primitives can be flushed up to the framebuffer,
> +because it's a purely software emulation here. This being said, the same
> +"start small, grow on-demand" can be applied to avoid over-allocating memory
> +upfront.
> +
> +On-demand memory allocation
> +---------------------------
> +
> +As explained in previous sections, on-demand allocation is a central piece
> +of tile-based renderer if we don't want to over-allocate, which is bad for
> +integrated GPUs who share their memory with the rest of the system.
> +
> +The problem with on-demand allocation is that suddenly, GPU accesses can
> +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem mostly)
> +were not designed for that. Those are assuming that buffers memory is

NIT: s/buffers/buffer's/

> +populated at job submission time, and will stay around for the job lifetime.
> +If a GPU fault happens, it's the user fault, and the context can be flagged

NIT: s/user/user's/

> +unusable. On-demand allocation is usually implemented as allocation-on-fault,
> +and the dma_fence contract prevents us from blocking on allocations in that
> +path (GPU fault handlers are in the dma-fence signalling path). So now we
> +have GPU allocations that will be satisfied most of the time, but can fail
> +occasionally. And this is not great, because an allocation failure might
> +kill the user GPU context (VK_DEVICE_LOST in Vulkan terms), without the
> +application having dong anything wrong. So, we need something that makes those
> +allocation failures rare enough that most users won't experience them, and
> +we need a fallback for when this happens to try to avoid them on the next
> +user attempt to submit a graphics job.
> +
> +The plan
> +========
> +
> +On-demand allocation rules
> +--------------------------
> +
> +First of all, all allocations happening in the fault handler path must
> +be using GFP_NOWAIT. With this flag, low-hanging fruit can be picked
> +(clean FS cache will be flushed for instance), but an error will be
> +returned if no memory is readily available. GFP_NOWAIT will also trigger
> +background reclaim to hopefully free-up some memory for our future
> +requests.
> +
> +How to deal with allocation failures
> +------------------------------------
> +
> +The first trick here is to try to guess approximately how much memory
> +will be needed, and force-populate on-demand buffers with that amount
> +of memory when the job is started. It's not about guessing the worst
> +case scenario here, but more the most likely case, probably with a
> +reasonable margin, so that the job is likely to succeed when this amount
> +of memory is provided by the KMD.
> +
> +The second trick to try to avoid over-allocation, even with this
> +sub-optimistic estimate, is to have a shared pool of memory that can be
> +used by all GPU contexts when they need tiler/geometry memory. This
> +implies returning chunks to this pool at some point, so other contexts
> +can re-use those. Details about what this global memory pool implementation
> +would look like is currently undefined, but it needs to be filled to
> +guarantee that pre-allocation requests for on-demand buffers used by a
> +GPU job can be satisfied in the fault handler path.

Note one thing I haven't seen discussed is that across multiple contexts
it's possible to prioritise jobs that free memory. E.g. a fragment job
can be run to free up memory from a tiler heap, allowing pages to be
returned to the global pool. This might imply a uAPI extension allowing
a fragment job to automatically drop memory from a BO so that the kernel
can have confidence that it will actually free up memory.

Sadly I don't think it's plausible to wait in the fault handler for a
fragment job to complete to free up memory - so the best we can do here
is postpone *starting* a vertex+tiler job if we're short on memory and
have fragment jobs to run.

> +
> +As a last resort, we can try to allocate with GFP_ATOMIC if everything
> +else fails, but this is a dangerous game, because we would be stealing
> +memory from the atomic reserve, so it's not entirely clear if this is
> +better than failing the job at this point.
> +
> +Ideas on how to make allocation failures decrease over time
> +-----------------------------------------------------------
> +
> +When an on-demand allocation fails and the hardware doesn't have a
> +flush-primitives fallback, we usually can't do much apart from failing the
> +whole job. But it's important to try to avoid future allocation failures
> +when the application creates a new context. There's no clear path for
> +how to guess the actual size to force-populate on the next attempt. One
> +option is to have a simple heuristics, like double the current resident size,
> +but this has the downside of potentially taking a few attempts before reaching
> +the stability point. Another option is to repeatedly map a dummy page at the
> +fault addresses, so we can get a sense of how much memory was needed for this
> +particular job.

We'd have to double check that we don't cause extra problems with an
aliasing heap like that. The tiler might attempt to read back data which
could cause 'interesting' errors if it's getting clobbered. Given this
is just a heuristic it might be ok, but it definitely needs more research.

> +
> +Once userspace gets an idea of what the application needs, it should force
> +this to be the minimum populated size on the next context creation. For GL
> +drivers, the UMD is in control of the context recreation, so it can easily
> +record the next buffer size to use. For Vulkan applications, something should
> +be recorded to track that, maybe in the form of some implicit dri-conf
> +database that can overload the explicit dri-conf.
> +
> +Various implementation details have been discussed
> +`here <https://lore.kernel.org/dri-devel/Z_kEjFjmsumfmbfM@phenom.ffwll.local/>`_
> +but nothing has been decided yet.
> +
> +DRM infrastructure changes for tile-based renderers
> +===================================================
> +
> +As seen in previous sections, allocation for tile-based GPUs can be tricky,
> +so we really want to add as much facility as we can, and document how these
> +helpers must be used. This section tries to list the various components and
> +how we expect them to work.
> +
> +GEM SHMEM sparse backing
> +------------------------
> +
> +On-demand allocation is not something the GEM layer has been designed for.
> +The idea is to extend the existing GEM and GEM SHMEM helpers to cover the
> +concept of sparse backing.
> +
> +A solution has been proposed
> +`here<https://lore.kernel.org/dri-devel/20250404092634.2968115-1-boris.brezillon@collabora.com/>`_
> +
> +Fault injection mechanism
> +-------------------------
> +
> +In order to easily test/validate the on-demand allocation logic, we need
> +a way to fake GPU faults and trigger on-demand allocation. We also need
> +to fake allocation failures are various points.

NIT: s/are/at/

Thanks,
Steve

> +
> +This part is likely to be driver specific, and should probably involve
> +new debugfs knobs.
> +
> +Global memory pool for on-demand allocation
> +-------------------------------------------
> +
> +TBD.
> diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
> index 7dcb15850afd..186917524854 100644
> --- a/Documentation/gpu/index.rst
> +++ b/Documentation/gpu/index.rst
> @@ -14,6 +14,7 @@ GPU Driver Developer's Guide
>     driver-uapi
>     drm-client
>     drm-compute
> +   drm-tile-based-renderer
>     drivers
>     backlight
>     vga-switcheroo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers
  2025-04-18 12:25 [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers Boris Brezillon
  2025-04-23  9:41 ` Steven Price
@ 2025-04-23 14:47 ` Alyssa Rosenzweig
  2025-04-28  7:42   ` Boris Brezillon
  2025-04-28  6:55 ` Iago Toral
  2 siblings, 1 reply; 9+ messages in thread
From: Alyssa Rosenzweig @ 2025-04-23 14:47 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: dri-devel, Steven Price, Liviu Dudau, Adrián Larumbe, lima,
	Qiang Yu, David Airlie, Simona Vetter, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, Dmitry Osipenko,
	Christian Koenig, Faith Ekstrand, kernel

Steven wanted non-Mali eyes, so with my Imaginapple hat on...

> +All lot of embedded GPUs are using tile-based rendering instead of immediate

s/All lot of/Many/

> +- Complex geometry pipelines: if you throw geometry/tesselation/mesh shaders
> +  it gets even trickier to guess the number of primitives from the number
> +  of vertices passed to the vertex shader.

Tessellation, yes. Geometry shaders, no. Geometry shaders must declare
the maximum # of vertices they output, so by themselves geometry shaders
don't make the problem much harder - unless you do an indirect draw with
a GS, the emulated GS draw can still be direct.

But I guess "even trickier" is accurate still...

> +For all these reasons, the tiler usually allocates memory dynamically, but
> +DRM has not been designed with this use case in mind. Drivers will address
> +these problems differently based on the functionality provided by their
> +hardware, but all of them almost certainly have to deal with this somehow.
> +
> +The easy solution is to statically allocate a huge buffer to pick from when
> +tiler memory is needed, and fail the rendering when this buffer is depleted.
> +Some drivers try to be smarter to avoid reserving a lot of memory upfront.
> +Instead, they start with an almost empty buffer and progressively populate it
> +when the GPU faults on an address sitting in the tiler buffer range.

This all seems very Mali-centric. Imaginapple has had partial renders
since forever.

> +More and more hardware vendors don't bother providing hardware support for
> +geometry/tesselation/mesh stages

I wouldn't say that... Mali is the only relevant hardware that has *no*
hardware support for any of geom/tess/mesh. All the desktop vendors +
Qualcomm have full hardware support, Apple has hardware mesh on m3+,
Broadcom has geom/tess, and I think Imagination has geom/tess on certain
parts.

And I don't know of any vendors (except possibly Imagination) that
removed hardware support, because it turns out having hardware support
for core API features is a good thing actually. It doesn't need to look
like "put the API in hardware" but some sort of hardware acceleration
(like AMD's NGG) solves the problems in this doc and more.

So... just "Some hardware vendors omit hardware support for
geometry/tessellation/mesh stages".

> This being said, the same +"start small, grow on-demand" can be
> applied to avoid over-allocating memory +upfront.

[citation needed], if we overflow that buffer we're screwed and hit
device_loss, and that's unacceptable in normal usage.

> +The problem with on-demand allocation is that suddenly, GPU accesses can
> +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem mostly)
> +were not designed for that.

It's not the common DRM scheduler that causes this problem, it
fundamentally violates the kernel-wide dma_fence contract: signalling a
dma-fence must not block on a fallible memory allocation, full stop.
Nothing we do in DRM will change that contract (and it's not obvious to
me that kbase is actually correct in all the corner cases).

> +The second trick to try to avoid over-allocation, even with this
> +sub-optimistic estimate, is to have a shared pool of memory that can be
> +used by all GPU contexts when they need tiler/geometry memory. This
> +implies returning chunks to this pool at some point, so other contexts
> +can re-use those. Details about what this global memory pool implementation
> +would look like is currently undefined, but it needs to be filled to
> +guarantee that pre-allocation requests for on-demand buffers used by a
> +GPU job can be satisfied in the fault handler path.

How do we clean memory between contexts? This is a security issue.
Either we need to pin physical pages to single processes, or we need to
zero pages when returning pages to the shared pool. Zeroing on the
CPU side is an option but the performance hit may be unacceptable
depending how it's implemented. Alternatively we can require userspace to
clean up after itself on the gpu (with a compute shader) but that's
going to burn memory b/w in the happy path where we have lots of memory
free.

> For GL +drivers, the UMD is in control of the context recreation, so
> it can easily +record the next buffer size to use.

I'm /really/ skeptical of this. Once we hit a device loss in GL, it's
game over, and I'm skeptical of any plan that expects userspace to
magically recover, especially as soon as side effects are introduced
(including transform feedback which is already gles3.0 required).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers
  2025-04-18 12:25 [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers Boris Brezillon
  2025-04-23  9:41 ` Steven Price
  2025-04-23 14:47 ` Alyssa Rosenzweig
@ 2025-04-28  6:55 ` Iago Toral
  2025-04-28  8:13   ` Boris Brezillon
  2 siblings, 1 reply; 9+ messages in thread
From: Iago Toral @ 2025-04-28  6:55 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Steven Price, Liviu Dudau, Adrián Larumbe, lima, Qiang Yu,
	David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Dmitry Osipenko, Alyssa Rosenzweig,
	Christian Koenig, Faith Ekstrand, kernel

Hi,

Pitching in to describe the situation for v3d:

El vie, 18-04-2025 a las 14:25 +0200, Boris Brezillon escribió:

(...)
> +For all these reasons, the tiler usually allocates memory
> dynamically, but
> +DRM has not been designed with this use case in mind. Drivers will
> address
> +these problems differently based on the functionality provided by
> their
> +hardware, but all of them almost certainly have to deal with this
> somehow.
> +
> +The easy solution is to statically allocate a huge buffer to pick
> from when
> +tiler memory is needed, and fail the rendering when this buffer is
> depleted.
> +Some drivers try to be smarter to avoid reserving a lot of memory
> upfront.
> +Instead, they start with an almost empty buffer and progressively
> populate it
> +when the GPU faults on an address sitting in the tiler buffer range.
> This
> +works okay most of the time but it falls short when the system is
> under
> +memory pressure, because the memory request is not guaranteed to be
> satisfied.
> +In that case, the driver either fails the rendering, or, if the
> hardware
> +allows it, it tries to flush the primitives that have been processed
> and
> +triggers a fragment job that will consume those primitives and free
> up some
> +memory to be recycled and make further progress on the tiling step.
> This is
> +usually referred as partial/incremental rendering (it might have
> other names).

In our case, user space allocates some memory up front hoping to avoid
running out of memory during tiling, but if the tiler does run out of
memory we get an interrupt and the tiler hw will stop and wait for the
kernel driver to write back an address where more memory is made
available (via register write), which we will try to allocate at that
point. This can happen any number of times until the tiler job
completes

I am not sure that we are handling allocation failure on this path 
nicely at the moment since we don't try to fail and cancel the job,
that's maybe something we should fix, although I don't personally
recall any reports of us running into this situation either.


> +
> +Compute based emulation of geometry stages
> +------------------------------------------
> +
> +More and more hardware vendors don't bother providing hardware
> support for
> +geometry/tesselation/mesh stages, since those can be emulated with
> compute
> +shaders. But the same problem we have with tiler memory exists with
> those
> +intermediate compute-emulated stages, because transient data shared
> between
> +stages need to be stored in memory for the next stage to consume,
> and this
> +bubbles up until the tiling stage is reached, because ultimately,
> what the
> +tiling stage will need to process is a set of vertices it can turn
> into
> +primitives, like would happen if the application had emulated the
> geometry,
> +tesselation or mesh stages with compute.
> +
> +Unlike tiling, where the hardware can provide a fallback to recycle
> memory,
> +there is no way the intermediate primitives can be flushed up to the
> framebuffer,
> +because it's a purely software emulation here. This being said, the
> same
> +"start small, grow on-demand" can be applied to avoid over-
> allocating memory
> +upfront.

FWIW, v3d has geometry and tessellation hardware.


> +
> +On-demand memory allocation
> +---------------------------
> +
> +As explained in previous sections, on-demand allocation is a central
> piece
> +of tile-based renderer if we don't want to over-allocate, which is
> bad for
> +integrated GPUs who share their memory with the rest of the system.
> +
> +The problem with on-demand allocation is that suddenly, GPU accesses
> can
> +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem
> mostly)
> +were not designed for that. Those are assuming that buffers memory
> is
> +populated at job submission time, and will stay around for the job
> lifetime.
> +If a GPU fault happens, it's the user fault, and the context can be
> flagged
> +unusable. On-demand allocation is usually implemented as allocation-
> on-fault,
> +and the dma_fence contract prevents us from blocking on allocations
> in that
> +path (GPU fault handlers are in the dma-fence signalling path).

As I described above, v3d is not quite an allocation-on-fault mechanism
but rather, we get a dedicated interrupt from the hw when it needs more
memory, which I believe happens a bit before it completely runs out of
memory actually. Maybe that changes the picture since we don't exactly
use a fault handler?

Iago

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers
  2025-04-23 14:47 ` Alyssa Rosenzweig
@ 2025-04-28  7:42   ` Boris Brezillon
  2025-04-28 13:45     ` Alyssa Rosenzweig
  0 siblings, 1 reply; 9+ messages in thread
From: Boris Brezillon @ 2025-04-28  7:42 UTC (permalink / raw)
  To: Alyssa Rosenzweig
  Cc: dri-devel, Steven Price, Liviu Dudau, Adrián Larumbe, lima,
	Qiang Yu, David Airlie, Simona Vetter, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, Dmitry Osipenko,
	Christian Koenig, Faith Ekstrand, kernel

Hi Alyssa,

On Wed, 23 Apr 2025 10:47:21 -0400
Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:

> Steven wanted non-Mali eyes, so with my Imaginapple hat on...
> 
> > +All lot of embedded GPUs are using tile-based rendering instead of immediate  
> 
> s/All lot of/Many/

Will change that.

> 
> > +- Complex geometry pipelines: if you throw geometry/tesselation/mesh shaders
> > +  it gets even trickier to guess the number of primitives from the number
> > +  of vertices passed to the vertex shader.  
> 
> Tessellation, yes. Geometry shaders, no. Geometry shaders must declare
> the maximum # of vertices they output, so by themselves geometry shaders
> don't make the problem much harder - unless you do an indirect draw with
> a GS, the emulated GS draw can still be direct.

Right, GS is simpler to cap, so I'll distinguish them from tessellation
shaders.

> 
> But I guess "even trickier" is accurate still...
> 
> > +For all these reasons, the tiler usually allocates memory dynamically, but
> > +DRM has not been designed with this use case in mind. Drivers will address
> > +these problems differently based on the functionality provided by their
> > +hardware, but all of them almost certainly have to deal with this somehow.
> > +
> > +The easy solution is to statically allocate a huge buffer to pick from when
> > +tiler memory is needed, and fail the rendering when this buffer is depleted.
> > +Some drivers try to be smarter to avoid reserving a lot of memory upfront.
> > +Instead, they start with an almost empty buffer and progressively populate it
> > +when the GPU faults on an address sitting in the tiler buffer range.  
> 
> This all seems very Mali-centric.

How surprising :-).

> Imaginapple has had partial renders
> since forever.

Yep, I've noticed that Imagination had partial renders too, but didn't
it had been present from the start.

Anyway, I'll make it clear that all sane HW should have a partial
render fallback, but that it's sometimes not the case, and in that
case, we have to work around this HW limitation.

> 
> > +More and more hardware vendors don't bother providing hardware support for
> > +geometry/tesselation/mesh stages  
> 
> I wouldn't say that... Mali is the only relevant hardware that has *no*
> hardware support for any of geom/tess/mesh. All the desktop vendors +
> Qualcomm have full hardware support, Apple has hardware mesh on m3+,
> Broadcom has geom/tess, and I think Imagination has geom/tess on certain
> parts.

Okay. I'll name Arm/Mali specifically here then.

> 
> And I don't know of any vendors (except possibly Imagination) that
> removed hardware support, because it turns out having hardware support
> for core API features is a good thing actually. It doesn't need to look
> like "put the API in hardware" but some sort of hardware acceleration
> (like AMD's NGG) solves the problems in this doc and more.
> 
> So... just "Some hardware vendors omit hardware support for
> geometry/tessellation/mesh stages".

Sounds good.

> 
> > This being said, the same +"start small, grow on-demand" can be
> > applied to avoid over-allocating memory +upfront.  
> 
> [citation needed], if we overflow that buffer we're screwed and hit
> device_loss, and that's unacceptable in normal usage.

My bad, I went ahead and mentioned other geom stages here because Sima
seemed to think that the same tricks could be applied to those. I'll
just drop this section.

> 
> > +The problem with on-demand allocation is that suddenly, GPU accesses can
> > +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem mostly)
> > +were not designed for that.  
> 
> It's not the common DRM scheduler that causes this problem, it
> fundamentally violates the kernel-wide dma_fence contract: signalling a
> dma-fence must not block on a fallible memory allocation, full stop.
> Nothing we do in DRM will change that contract (and it's not obvious to
> me that kbase is actually correct in all the corner cases).

Okay, I'll stop blaming DRM and hide behind the dma_fence contract here.

BTW, is there a piece of doc explaining the rational behind this
dma_fence contract, or is it just the usual informal knowledge shared
among DRM devs over IRC/email threads :-) ?

To be honest, I'm a bit unhappy with this "it's part of the dma_fence
contract" explanation, because I have a hard time remembering all the
details that led to these set of rules myself, so I suspect it's even
harder for new comers to reason about this. To me, it's one of the
reasons people fail to understand/tend to forget what the
problems/limitations are, and end up ignoring them (intentionally or
not).

FWIW, this is what I remember, but I'm sure there's more:

1. dma_fence must signal in finite time, so unbounded waits in the
   fence signalling path path is not good, and that's what happens with
   GFP_KERNEL allocations
2. if you're blocked in your GPU fault handler, that means you can't
   process further faults happening on other contexts
3. GPU drivers are actively participating in the memory reclaim
   process, which leads to deadlocks if the memory allocation in the
   fault handler is waiting on the very same GPU job fence that's
   waiting for its memory allocation to be satisfied

I'd really love if someone (Sima, Alyssa and/or Christian?) could sum it
up, so I can put the outcome of this discussion in some kernel doc
entry (or maybe it'd be better if this was one of you submitting a
patch for that ;-)). If it's already documented somewhere, I'll just
have to eat my hat and accept your RTFM answer :-).

> 
> > +The second trick to try to avoid over-allocation, even with this
> > +sub-optimistic estimate, is to have a shared pool of memory that can be
> > +used by all GPU contexts when they need tiler/geometry memory. This
> > +implies returning chunks to this pool at some point, so other contexts
> > +can re-use those. Details about what this global memory pool implementation
> > +would look like is currently undefined, but it needs to be filled to
> > +guarantee that pre-allocation requests for on-demand buffers used by a
> > +GPU job can be satisfied in the fault handler path.  
> 
> How do we clean memory between contexts? This is a security issue.
> Either we need to pin physical pages to single processes, or we need to
> zero pages when returning pages to the shared pool. Zeroing on the
> CPU side is an option but the performance hit may be unacceptable
> depending how it's implemented. Alternatively we can require userspace to
> clean up after itself on the gpu (with a compute shader) but that's
> going to burn memory b/w in the happy path where we have lots of memory
> free.

I would say memset(0) on allocation (when recycling pages returned to
the pool) since that's already where you're taking the hit for regular
allocations anyway (allocating pages is not free).

> 
> > For GL +drivers, the UMD is in control of the context recreation, so
> > it can easily +record the next buffer size to use.  
> 
> I'm /really/ skeptical of this. Once we hit a device loss in GL, it's
> game over, and I'm skeptical of any plan that expects userspace to
> magically recover, especially as soon as side effects are introduced
> (including transform feedback which is already gles3.0 required).

I can drop that one. I know we've always done that in gallium/panfrost
(that predates CSF/panthor BTW), but I trust you when you say it
doesn't comply with the GL spec.

Thanks a lot for chiming in BTW, that's truly appreciated.

Regards,

Boris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers
  2025-04-23  9:41 ` Steven Price
@ 2025-04-28  8:00   ` Boris Brezillon
  0 siblings, 0 replies; 9+ messages in thread
From: Boris Brezillon @ 2025-04-28  8:00 UTC (permalink / raw)
  To: Steven Price
  Cc: dri-devel, Liviu Dudau, Adrián Larumbe, lima, Qiang Yu,
	David Airlie, Simona Vetter, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, Dmitry Osipenko, Alyssa Rosenzweig,
	Christian Koenig, Faith Ekstrand, kernel

Hi Steve,

On Wed, 23 Apr 2025 10:41:53 +0100
Steven Price <steven.price@arm.com> wrote:

> On 18/04/2025 13:25, Boris Brezillon wrote:
> > Tile-based GPUs come with a set of constraints that are not present
> > when immediate rendering is used. This new document tries to explain
> > the differences between tile/immediate rendering, the problems that
> > come with tilers, and how we plan to address them.
> > 
> > This is just a started point, this document will be updated with new
> > materials as we refine the libraries we add to help deal with
> > tilers, and have more drivers converted to follow the rules listed
> > here.
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>  
> 
> Seems like a good starting point, a few minor comments below. We really
> need some non-Mali input too though.

Totally agree with that, my view on this problem is certainly biased.

> 
> > ---
> >  Documentation/gpu/drm-tile-based-renderer.rst | 201 ++++++++++++++++++
> >  Documentation/gpu/index.rst                   |   1 +
> >  2 files changed, 202 insertions(+)
> >  create mode 100644 Documentation/gpu/drm-tile-based-renderer.rst
> > 
> > diff --git a/Documentation/gpu/drm-tile-based-renderer.rst b/Documentation/gpu/drm-tile-based-renderer.rst
> > new file mode 100644
> > index 000000000000..19b56b9476fc
> > --- /dev/null
> > +++ b/Documentation/gpu/drm-tile-based-renderer.rst
> > @@ -0,0 +1,201 @@
> > +==================================================
> > +Infrastructure and tricks for tile-based renderers
> > +==================================================
> > +
> > +All lot of embedded GPUs are using tile-based rendering instead of immediate
> > +rendering. This mode of rendering has various implications that we try to
> > +document here along with some hints about how to deal with some of the
> > +problems that surface with tile-based renderers.
> > +
> > +The main idea behind tile-based rendering is to batch processing of nearby
> > +pixels during the fragment shading phase to limit the traffic on the memory
> > +bus by making optimal use of the various caches present in the GPU. Unlike
> > +immediate rendering, where primitives generated by the geometry stages of
> > +the pipeline are directly consumed by the fragment stage, tilers have to
> > +record primitives in bins that are somehow attached to tiles (the
> > +granularity of the tile being GPU-specific). This data is usually stored
> > +in memory, and pulled back when the fragment stage is executed.
> > +
> > +This approach has several issues that most drivers need to handle somehow,
> > +sometimes with a bit of help from the hardware.
> > +
> > +Issues at hand
> > +==============
> > +
> > +Tiler memory
> > +------------
> > +
> > +The amount of memory needed to store primitives data and metadata is hard
> > +to guess ahead of time, because it depends on various parameters that are
> > +not in control of the UMD (UserMode Driver). Here is a non-exhaustive list
> > +of things that may complicate the calculation of the memory needed to store
> > +primitive information:
> > +
> > +- Primitives distribution across tiles is hard to guess: the binning process
> > +  is about assigning each primitive to the set tiles it covers. The more tiles
> > +  being covered the more memory is needed to record those. We can estimate
> > +  the worst case scenario by assuming all primitives will cover all tiles but
> > +  this will lead to over-allocation most of the time, which is not good
> > +- Indirect draws: the number of vertices comes from a GPU buffer that might
> > +  be filled by previous GPU compute jobs. This means we only know the number
> > +  of vertices when the GPU executes the draw, and thus can't guess how much
> > +  memory will be needed for those and allocate a GPU buffer that's big enough
> > +  to hold those
> > +- Complex geometry pipelines: if you throw geometry/tesselation/mesh shaders
> > +  it gets even trickier to guess the number of primitives from the number
> > +  of vertices passed to the vertex shader.
> > +
> > +For all these reasons, the tiler usually allocates memory dynamically, but
> > +DRM has not been designed with this use case in mind. Drivers will address
> > +these problems differently based on the functionality provided by their
> > +hardware, but all of them almost certainly have to deal with this somehow.
> > +
> > +The easy solution is to statically allocate a huge buffer to pick from when
> > +tiler memory is needed, and fail the rendering when this buffer is depleted.
> > +Some drivers try to be smarter to avoid reserving a lot of memory upfront.
> > +Instead, they start with an almost empty buffer and progressively populate it
> > +when the GPU faults on an address sitting in the tiler buffer range. This
> > +works okay most of the time but it falls short when the system is under
> > +memory pressure, because the memory request is not guaranteed to be satisfied.
> > +In that case, the driver either fails the rendering, or, if the hardware
> > +allows it, it tries to flush the primitives that have been processed and
> > +triggers a fragment job that will consume those primitives and free up some
> > +memory to be recycled and make further progress on the tiling step. This is
> > +usually referred as partial/incremental rendering (it might have other names).
> > +
> > +Compute based emulation of geometry stages
> > +------------------------------------------
> > +
> > +More and more hardware vendors don't bother providing hardware support for
> > +geometry/tesselation/mesh stages, since those can be emulated with compute
> > +shaders. But the same problem we have with tiler memory exists with those
> > +intermediate compute-emulated stages, because transient data shared between
> > +stages need to be stored in memory for the next stage to consume, and this
> > +bubbles up until the tiling stage is reached, because ultimately, what the
> > +tiling stage will need to process is a set of vertices it can turn into
> > +primitives, like would happen if the application had emulated the geometry,
> > +tesselation or mesh stages with compute.
> > +
> > +Unlike tiling, where the hardware can provide a fallback to recycle memory,
> > +there is no way the intermediate primitives can be flushed up to the framebuffer,
> > +because it's a purely software emulation here. This being said, the same
> > +"start small, grow on-demand" can be applied to avoid over-allocating memory
> > +upfront.
> > +
> > +On-demand memory allocation
> > +---------------------------
> > +
> > +As explained in previous sections, on-demand allocation is a central piece
> > +of tile-based renderer if we don't want to over-allocate, which is bad for
> > +integrated GPUs who share their memory with the rest of the system.
> > +
> > +The problem with on-demand allocation is that suddenly, GPU accesses can
> > +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem mostly)
> > +were not designed for that. Those are assuming that buffers memory is  
> 
> NIT: s/buffers/buffer's/
> 
> > +populated at job submission time, and will stay around for the job lifetime.
> > +If a GPU fault happens, it's the user fault, and the context can be flagged  
> 
> NIT: s/user/user's/
> 
> > +unusable. On-demand allocation is usually implemented as allocation-on-fault,
> > +and the dma_fence contract prevents us from blocking on allocations in that
> > +path (GPU fault handlers are in the dma-fence signalling path). So now we
> > +have GPU allocations that will be satisfied most of the time, but can fail
> > +occasionally. And this is not great, because an allocation failure might
> > +kill the user GPU context (VK_DEVICE_LOST in Vulkan terms), without the
> > +application having dong anything wrong. So, we need something that makes those
> > +allocation failures rare enough that most users won't experience them, and
> > +we need a fallback for when this happens to try to avoid them on the next
> > +user attempt to submit a graphics job.
> > +
> > +The plan
> > +========
> > +
> > +On-demand allocation rules
> > +--------------------------
> > +
> > +First of all, all allocations happening in the fault handler path must
> > +be using GFP_NOWAIT. With this flag, low-hanging fruit can be picked
> > +(clean FS cache will be flushed for instance), but an error will be
> > +returned if no memory is readily available. GFP_NOWAIT will also trigger
> > +background reclaim to hopefully free-up some memory for our future
> > +requests.
> > +
> > +How to deal with allocation failures
> > +------------------------------------
> > +
> > +The first trick here is to try to guess approximately how much memory
> > +will be needed, and force-populate on-demand buffers with that amount
> > +of memory when the job is started. It's not about guessing the worst
> > +case scenario here, but more the most likely case, probably with a
> > +reasonable margin, so that the job is likely to succeed when this amount
> > +of memory is provided by the KMD.
> > +
> > +The second trick to try to avoid over-allocation, even with this
> > +sub-optimistic estimate, is to have a shared pool of memory that can be
> > +used by all GPU contexts when they need tiler/geometry memory. This
> > +implies returning chunks to this pool at some point, so other contexts
> > +can re-use those. Details about what this global memory pool implementation
> > +would look like is currently undefined, but it needs to be filled to
> > +guarantee that pre-allocation requests for on-demand buffers used by a
> > +GPU job can be satisfied in the fault handler path.  
> 
> Note one thing I haven't seen discussed is that across multiple contexts
> it's possible to prioritise jobs that free memory. E.g. a fragment job
> can be run to free up memory from a tiler heap, allowing pages to be
> returned to the global pool. This might imply a uAPI extension allowing
> a fragment job to automatically drop memory from a BO so that the kernel
> can have confidence that it will actually free up memory.
> 
> Sadly I don't think it's plausible to wait in the fault handler for a
> fragment job to complete to free up memory - so the best we can do here
> is postpone *starting* a vertex+tiler job if we're short on memory and
> have fragment jobs to run.

Right, we'll have to do with an internal dma_fence (returned
through drm_sched_ops::prepare_job()) that's controlling access to this
memory pool, so we're sure all currently queued tiler jobs (those
passed to ::run_job()) can have their estimated memory allocation
satisfied. But because it's just an estimate, there's still no guarantee
that the job won't try to allocate more, and thus no guarantee that
the job will always succeed.

> 
> > +
> > +As a last resort, we can try to allocate with GFP_ATOMIC if everything
> > +else fails, but this is a dangerous game, because we would be stealing
> > +memory from the atomic reserve, so it's not entirely clear if this is
> > +better than failing the job at this point.
> > +
> > +Ideas on how to make allocation failures decrease over time
> > +-----------------------------------------------------------
> > +
> > +When an on-demand allocation fails and the hardware doesn't have a
> > +flush-primitives fallback, we usually can't do much apart from failing the
> > +whole job. But it's important to try to avoid future allocation failures
> > +when the application creates a new context. There's no clear path for
> > +how to guess the actual size to force-populate on the next attempt. One
> > +option is to have a simple heuristics, like double the current resident size,
> > +but this has the downside of potentially taking a few attempts before reaching
> > +the stability point. Another option is to repeatedly map a dummy page at the
> > +fault addresses, so we can get a sense of how much memory was needed for this
> > +particular job.  
> 
> We'd have to double check that we don't cause extra problems with an
> aliasing heap like that. The tiler might attempt to read back data which
> could cause 'interesting' errors if it's getting clobbered.

Yeah, I thought about that too :-(.

> Given this
> is just a heuristic it might be ok, but it definitely needs more research.

I should probably make it clear that these options are based on
speculations about how the HW works, and they might prove impossible to
implement in practice. The reason I have them listed here is so Sima's
suggestions don't get lost in the original thread.

Thanks for reviewing the piece of doc. I'll leave a bit more time for
others to chime in, and post of v2 addressing your comments.

Boris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers
  2025-04-28  6:55 ` Iago Toral
@ 2025-04-28  8:13   ` Boris Brezillon
  2025-04-28  8:22     ` Iago Toral
  0 siblings, 1 reply; 9+ messages in thread
From: Boris Brezillon @ 2025-04-28  8:13 UTC (permalink / raw)
  To: Iago Toral
  Cc: dri-devel, Steven Price, Liviu Dudau, Adrián Larumbe, lima,
	Qiang Yu, David Airlie, Simona Vetter, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, Dmitry Osipenko,
	Alyssa Rosenzweig, Christian Koenig, Faith Ekstrand, kernel

Hi Iago,

On Mon, 28 Apr 2025 08:55:07 +0200
Iago Toral <itoral@igalia.com> wrote:

> Hi,
> 
> Pitching in to describe the situation for v3d:

Thanks for chiming in.

> 
> El vie, 18-04-2025 a las 14:25 +0200, Boris Brezillon escribió:
> 
> (...)
> > +For all these reasons, the tiler usually allocates memory
> > dynamically, but
> > +DRM has not been designed with this use case in mind. Drivers will
> > address
> > +these problems differently based on the functionality provided by
> > their
> > +hardware, but all of them almost certainly have to deal with this
> > somehow.
> > +
> > +The easy solution is to statically allocate a huge buffer to pick
> > from when
> > +tiler memory is needed, and fail the rendering when this buffer is
> > depleted.
> > +Some drivers try to be smarter to avoid reserving a lot of memory
> > upfront.
> > +Instead, they start with an almost empty buffer and progressively
> > populate it
> > +when the GPU faults on an address sitting in the tiler buffer range.
> > This
> > +works okay most of the time but it falls short when the system is
> > under
> > +memory pressure, because the memory request is not guaranteed to be
> > satisfied.
> > +In that case, the driver either fails the rendering, or, if the
> > hardware
> > +allows it, it tries to flush the primitives that have been processed
> > and
> > +triggers a fragment job that will consume those primitives and free
> > up some
> > +memory to be recycled and make further progress on the tiling step.
> > This is
> > +usually referred as partial/incremental rendering (it might have
> > other names).  
> 
> In our case, user space allocates some memory up front hoping to avoid
> running out of memory during tiling, but if the tiler does run out of
> memory we get an interrupt and the tiler hw will stop and wait for the
> kernel driver to write back an address where more memory is made
> available (via register write), which we will try to allocate at that
> point. This can happen any number of times until the tiler job
> completes

Sounds very much like how new Mali-CSF works, except Mali-CSF also has
a fallback for when the allocation can't be satisfied.

> 
> I am not sure that we are handling allocation failure on this path 
> nicely at the moment since we don't try to fail and cancel the job,
> that's maybe something we should fix, although I don't personally
> recall any reports of us running into this situation either.

Yeah, I'd say you're pretty much in the same place Panfrost/Panthor are
at the moment: we're not playing by the dma_fence rules, but no user
complained so far. BTW, that doesn't necessarily mean the problem
doesn't occur, just that it's not been identified as being a KMD issue
:-).

> 
> 
> > +
> > +Compute based emulation of geometry stages
> > +------------------------------------------
> > +
> > +More and more hardware vendors don't bother providing hardware
> > support for
> > +geometry/tesselation/mesh stages, since those can be emulated with
> > compute
> > +shaders. But the same problem we have with tiler memory exists with
> > those
> > +intermediate compute-emulated stages, because transient data shared
> > between
> > +stages need to be stored in memory for the next stage to consume,
> > and this
> > +bubbles up until the tiling stage is reached, because ultimately,
> > what the
> > +tiling stage will need to process is a set of vertices it can turn
> > into
> > +primitives, like would happen if the application had emulated the
> > geometry,
> > +tesselation or mesh stages with compute.
> > +
> > +Unlike tiling, where the hardware can provide a fallback to recycle
> > memory,
> > +there is no way the intermediate primitives can be flushed up to the
> > framebuffer,
> > +because it's a purely software emulation here. This being said, the
> > same
> > +"start small, grow on-demand" can be applied to avoid over-
> > allocating memory
> > +upfront.  
> 
> FWIW, v3d has geometry and tessellation hardware.

Yep, Alyssa mentioned that. I'll change this section to specifically
mention Arm/Mali as being the outlier here.

> 
> 
> > +
> > +On-demand memory allocation
> > +---------------------------
> > +
> > +As explained in previous sections, on-demand allocation is a central
> > piece
> > +of tile-based renderer if we don't want to over-allocate, which is
> > bad for
> > +integrated GPUs who share their memory with the rest of the system.
> > +
> > +The problem with on-demand allocation is that suddenly, GPU accesses
> > can
> > +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem
> > mostly)
> > +were not designed for that. Those are assuming that buffers memory
> > is
> > +populated at job submission time, and will stay around for the job
> > lifetime.
> > +If a GPU fault happens, it's the user fault, and the context can be
> > flagged
> > +unusable. On-demand allocation is usually implemented as allocation-
> > on-fault,
> > +and the dma_fence contract prevents us from blocking on allocations
> > in that
> > +path (GPU fault handlers are in the dma-fence signalling path).  
> 
> As I described above, v3d is not quite an allocation-on-fault mechanism
> but rather, we get a dedicated interrupt from the hw when it needs more
> memory, which I believe happens a bit before it completely runs out of
> memory actually. Maybe that changes the picture since we don't exactly
> use a fault handler?

Not really. Any mechanism relying on on-demand allocation in the
dma_fence signalling path is problematic. The fact it's based on a
fault handler might add extra problems on top, but both designs violate
the dma_fence contract stating that no non-fallible allocation should
be done in the dma_fence signalling path (that is, any allocation
happening between the moment the job was queued to the
drm_sched_entity, and the moment the job fence is signalled).

Given, the description you made, I think we can add v3d to the list of
problematic drivers :-(.

Regards,

Boris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers
  2025-04-28  8:13   ` Boris Brezillon
@ 2025-04-28  8:22     ` Iago Toral
  0 siblings, 0 replies; 9+ messages in thread
From: Iago Toral @ 2025-04-28  8:22 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: dri-devel, Steven Price, Liviu Dudau, Adrián Larumbe, lima,
	Qiang Yu, David Airlie, Simona Vetter, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, Dmitry Osipenko,
	Alyssa Rosenzweig, Christian Koenig, Faith Ekstrand, kernel

El lun, 28-04-2025 a las 10:13 +0200, Boris Brezillon escribió:
> Hi Iago,
> 
> On Mon, 28 Apr 2025 08:55:07 +0200
> Iago Toral <itoral@igalia.com> wrote:
(...)
> > As I described above, v3d is not quite an allocation-on-fault
> > mechanism
> > but rather, we get a dedicated interrupt from the hw when it needs
> > more
> > memory, which I believe happens a bit before it completely runs out
> > of
> > memory actually. Maybe that changes the picture since we don't
> > exactly
> > use a fault handler?
> 
> Not really. Any mechanism relying on on-demand allocation in the
> dma_fence signalling path is problematic. The fact it's based on a
> fault handler might add extra problems on top, but both designs
> violate
> the dma_fence contract stating that no non-fallible allocation should
> be done in the dma_fence signalling path (that is, any allocation
> happening between the moment the job was queued to the
> drm_sched_entity, and the moment the job fence is signalled).
> 
> Given, the description you made, I think we can add v3d to the list
> of
> problematic drivers :-(.

In that case we should add vc4 as well, since it is the same story
there.

Thanks,
Iago

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers
  2025-04-28  7:42   ` Boris Brezillon
@ 2025-04-28 13:45     ` Alyssa Rosenzweig
  0 siblings, 0 replies; 9+ messages in thread
From: Alyssa Rosenzweig @ 2025-04-28 13:45 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: dri-devel, Steven Price, Liviu Dudau, Adrián Larumbe, lima,
	Qiang Yu, David Airlie, Simona Vetter, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, Dmitry Osipenko,
	Christian Koenig, Faith Ekstrand, kernel

> BTW, is there a piece of doc explaining the rational behind this
> dma_fence contract, or is it just the usual informal knowledge shared
> among DRM devs over IRC/email threads :-) ?
> 
> To be honest, I'm a bit unhappy with this "it's part of the dma_fence
> contract" explanation, because I have a hard time remembering all the
> details that led to these set of rules myself, so I suspect it's even
> harder for new comers to reason about this. To me, it's one of the
> reasons people fail to understand/tend to forget what the
> problems/limitations are, and end up ignoring them (intentionally or
> not).
> 
> FWIW, this is what I remember, but I'm sure there's more:
> 
> 1. dma_fence must signal in finite time, so unbounded waits in the
>    fence signalling path path is not good, and that's what happens with
>    GFP_KERNEL allocations
> 2. if you're blocked in your GPU fault handler, that means you can't
>    process further faults happening on other contexts
> 3. GPU drivers are actively participating in the memory reclaim
>    process, which leads to deadlocks if the memory allocation in the
>    fault handler is waiting on the very same GPU job fence that's
>    waiting for its memory allocation to be satisfied
> 
> I'd really love if someone (Sima, Alyssa and/or Christian?) could sum it
> up, so I can put the outcome of this discussion in some kernel doc
> entry (or maybe it'd be better if this was one of you submitting a
> patch for that ;-)). If it's already documented somewhere, I'll just
> have to eat my hat and accept your RTFM answer :-).

https://www.kernel.org/doc/html/next/driver-api/dma-buf.html#dma-fence-cross-driver-contract

Specifically

  Drivers are allowed to call dma_fence_wait() from their shrinker
  callbacks. This means any code required for fence completion cannot
  allocate memory with GFP_KERNEL.

Concretely:

* Job requires memory allocation to signal a fence
* We're in a low memory situation, so the shrinker is invoked
* The shrinker can't free memory until the job finishes
* Deadlock!

Possibly we could relax the contract to let us reclaim non-graphics
memory, but that's not my department.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-04-28 13:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-18 12:25 [PATCH] drm/doc: Start documenting aspects specific to tile-based renderers Boris Brezillon
2025-04-23  9:41 ` Steven Price
2025-04-28  8:00   ` Boris Brezillon
2025-04-23 14:47 ` Alyssa Rosenzweig
2025-04-28  7:42   ` Boris Brezillon
2025-04-28 13:45     ` Alyssa Rosenzweig
2025-04-28  6:55 ` Iago Toral
2025-04-28  8:13   ` Boris Brezillon
2025-04-28  8:22     ` Iago Toral

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.