From: Boris Brezillon <boris.brezillon@collabora.com>
To: Simona Vetter <simona.vetter@ffwll.ch>
Cc: "Liviu Dudau" <liviu.dudau@arm.com>,
"Alyssa Rosenzweig" <alyssa@rosenzweig.io>,
"Christian König" <christian.koenig@amd.com>,
"Steven Price" <steven.price@arm.com>,
"Adrián Larumbe" <adrian.larumbe@collabora.com>,
lima@lists.freedesktop.org, "Qiang Yu" <yuq825@gmail.com>,
"David Airlie" <airlied@gmail.com>,
"Simona Vetter" <simona@ffwll.ch>,
"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
"Maxime Ripard" <mripard@kernel.org>,
"Thomas Zimmermann" <tzimmermann@suse.de>,
dri-devel@lists.freedesktop.org,
"Dmitry Osipenko" <dmitry.osipenko@collabora.com>,
kernel@collabora.com,
"Faith Ekstrand" <faith.ekstrand@collabora.com>
Subject: Re: [PATCH v3 0/8] drm: Introduce sparse GEM shmem
Date: Mon, 14 Apr 2025 17:15:01 +0200 [thread overview]
Message-ID: <20250414171501.6b5b57f3@collabora.com> (raw)
In-Reply-To: <Z_0dBzMjkeWYNSRM@phenom.ffwll.local>
On Mon, 14 Apr 2025 16:34:47 +0200
Simona Vetter <simona.vetter@ffwll.ch> wrote:
> On Mon, Apr 14, 2025 at 02:08:25PM +0100, Liviu Dudau wrote:
> > On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote:
> > > Hi Sima,
> > >
> > > On Fri, 11 Apr 2025 14:01:16 +0200
> > > Simona Vetter <simona.vetter@ffwll.ch> wrote:
> > >
> > > > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:
> > > > > On Thu, 10 Apr 2025 14:01:03 -0400
> > > > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> > > > >
> > > > > > > > > In Panfrost and Lima, we don't have this concept of "incremental
> > > > > > > > > rendering", so when we fail the allocation, we just fail the GPU job
> > > > > > > > > with an unhandled GPU fault.
> > > > > > > >
> > > > > > > > To be honest I think that this is enough to mark those two drivers as
> > > > > > > > broken. It's documented that this approach is a no-go for upstream
> > > > > > > > drivers.
> > > > > > > >
> > > > > > > > How widely is that used?
> > > > > > >
> > > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar
> > > > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> > > > > > > freedreno, powervr, ...), because ultimately that's how tilers work:
> > > > > > > the amount of memory needed to store per-tile primitives (and metadata)
> > > > > > > depends on what the geometry pipeline feeds the tiler with, and that
> > > > > > > can't be predicted. If you over-provision, that's memory the system won't
> > > > > > > be able to use while rendering takes place, even though only a small
> > > > > > > portion might actually be used by the GPU. If your allocation is too
> > > > > > > small, it will either trigger a GPU fault (for HW not supporting an
> > > > > > > "incremental rendering" mode) or under-perform (because flushing
> > > > > > > primitives has a huge cost on tilers).
> > > > > >
> > > > > > Yes and no.
> > > > > >
> > > > > > Although we can't allocate more memory for /this/ frame, we know the
> > > > > > required size is probably constant across its lifetime. That gives a
> > > > > > simple heuristic to manage the tiler heap efficiently without
> > > > > > allocations - even fallible ones - in the fence signal path:
> > > > > >
> > > > > > * Start with a small fixed size tiler heap
> > > > > > * Try to render, let incremental rendering kick in when it's too small.
> > > > > > * When cleaning up the job, check if we used incremental rendering.
> > > > > > * If we did - double the size of the heap the next time we submit work.
> > > > > >
> > > > > > The tiler heap still grows dynamically - it just does so over the span
> > > > > > of a couple frames. In practice that means a tiny hit to startup time as
> > > > > > we dynamically figure out the right size, incurring extra flushing at
> > > > > > the start, without needing any "grow-on-page-fault" heroics.
> > > > > >
> > > > > > This should solve the problem completely for CSF/panthor. So it's only
> > > > > > hardware that architecturally cannot do incremental rendering (older
> > > > > > Mali: panfrost/lima) where we need this mess.
> > > > >
> > > > > OTOH, if we need something
> > > > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
> > > > > thing for CSF, since CSF is arguably the sanest of all the HW
> > > > > architectures listed above: allocation can fail/be non-blocking,
> > > > > because there's a fallback to incremental rendering when it fails.
> > > >
> > > > So this is a really horrible idea to sort this out for panfrost hardware,
> > > > which doesn't have a tiler cache flush as a fallback. It's roughly three
> > > > stages:
> > > >
> > > > 1. A pile of clever tricks to make the chances of running out of memory
> > > > really low. Most of these also make sense for panthor platforms, just as a
> > > > performance optimization.
> > > >
> > > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
> > > > such that the impact should be minimal. This is nasty, and we really want
> > > > to avoid that for panthor.
> > > >
> > > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice.
> > > >
> > > > 1. Clever tricks
> > > > ----------------
> > > >
> > > > This is a cascade of tricks we can pull in the gpu fault handler:
> > > >
> > > > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers
> > > > background reclaim, and that might be enough to get us through and free
> > > > some easy caches (like clean fs cache and stuff like that which can just
> > > > be dropped).
> > >
> > > There's no GFP_NORECLAIM, and given the discussions we had before, I
> > > guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN |
> > > __GFP_NORETRY I used in this series, and it probably doesn't try hard
> > > enough as pointed out by you and Christian.
> > >
> > > >
> > > > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm
> > > > hoping that between render target size and maybe counting the total
> > > > amounts of vertices we can do a decent guesstimate for many workloads.
> > >
> > > There are extra parameters to take into account, like the tile
> > > hierarchy mask (number of binning lists instantiated) and probably
> > > other things I forget, but for simple vertex+fragment pipelines and
> > > direct draws, guessing the worst memory usage case is probably doable.
> > > Throw indirect draws into the mix, and it suddenly becomes a lot more
> > > complicated. Not even talking about GEOM/TESS stages, which makes the
> > > guessing even harder AFAICT.
> > >
> > > > Note that goal here is not to ensure success, but just to get the rough
> > > > ballpark. The actual starting number here should aim fairly low, so that
> > > > we avoid wasting memory since this is memory wasted on every context
> > > > (that uses a feature which needs dynamic memory allocation, which I
> > > > guess for pan* is everything, but for apple it would be more limited).
> > >
> > > Ack.
> > >
> > > >
> > > > 1c The kernel then keeps an additional global memory pool. Note this would
> > > > not have the same semantics as mempool.h, which is aimed GFP_NOIO
> > > > forward progress guarantees, but more as a preallocation pool. In every
> > > > CS ioctl we'll make sure the pool is filled, and we probably want to
> > > > size the pool relative to the context with the biggest dynamic memory
> > > > usage. So probably this thing needs a shrinker, so we can reclaim it
> > > > when you don't run an app with a huge buffer need on the gpu anymore.
> > >
> > > Okay, that's a technique Arm has been using in their downstream driver
> > > (it named JIT-allocation there).
> > >
> > > >
> > > > Note that we're still not sizing this to guarantee success, but together
> > > > with the userspace heuristics it should be big enough to almost always
> > > > work out. And since it's global reserve we can afford to waste a bit
> > > > more memory on this one. We might also want to scale this pool by the
> > > > total memory available, like the watermarks core mm computes. We'll only
> > > > hang onto this memory when the gpu is in active usage, so this should be
> > > > fine.
> > >
> > > Sounds like a good idea.
> > >
> > > >
> > > > Also the preallocation would need to happen without holding the memory
> > > > pool look, so that we can use GFP_KERNEL.
> > > >
> > > > Up to this point I think it's all tricks that panthor also wants to
> > > > employ.
> > > >
> > > > 1d Next up is scratch dynamic memory. If we can assume that the memory does
> > > > not need to survive a batchbuffer (hopefully the case with vulkan render
> > > > pass) we could steal such memory from other contexts. We could even do
> > > > that for contexts which are queued but not yet running on the hardware
> > > > (might need unloading them to be doable with fw renderers like
> > > > panthor/CSF) as long as we keep such stolen dynamic memory on a separate
> > > > free list. Because if we'd entirely free this, or release it into the
> > > > memory pool we'll make things worse for these other contexts, we need to
> > > > be able to guarantee that any context can always get all the stolen
> > > > dynamic pages back before we start running it on the gpu.
> > >
> > > Actually, CSF stands in the way of re-allocating memory to other
> > > contexts, because once we've allocated memory to a tiler heap, the FW
> > > manages this pool of chunks, and recycles them. Mesa can intercept
> > > the "returned chunks" and collect those chunks instead of re-assiging
> > > then to the tiler heap through a CS instruction (which goes thought
> > > the FW internallu), but that involves extra collaboration between the
> > > UMD, KMD and FW which we don't have at the moment. Not saying never,
> > > but I'd rather fix things gradually (first the blocking alloc in the
> > > fence-signalling path, then the optimization to share the extra mem
> > > reservation cost among contexts by returning the chunks to the global
> > > kernel pool rather than directly to the heap).
> >
> > The additional issue with borrowing memory from idle contexts is that it will
> > involve MMU operations, as we will have to move the memory into the active
> > context address space. CSF GPUs have a limitation that they can only work with
> > one address space for the active job when it comes to memory used internally
> > by the job, so we either have to map the scratch dynamic memory in all the
> > jobs before we submit them, or we will have to do MMU maintainance operations
> > in the OOM path in order to borrow memory from other contexts.
>
> Hm, this could be tricky. So mmu operations shouldn't be an issue because
> they must work for GFP_NOFS contexts for i/o writeback. You might need to
> much more carefully manage this and make sure the iommu has big enough
> range of pagetables preallocated. This also holds for the other games
> we're playing here, at least for gpu pagetables. But since pagetables are
> really small overhead it might be good to somewhat aggressively
> preallocate them.
>
> But yeah this is possible an issue if you you need iommu wrangling, I
> have honestly not looked at the exact rules in there.
We already have a mechanism to pre-allocate page tables in panthor (for
aync VM_BIND requests where we're not allowed to allocate in the
run_job() path), but as said before, I probably won't try this global
mem pool thing on Panthor, since Panthor can do without it for now.
The page table pre-allocation mechanism is something we can easily
transpose to panfrost though.
next prev parent reply other threads:[~2025-04-14 15:15 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-04 9:26 [PATCH v3 0/8] drm: Introduce sparse GEM shmem Boris Brezillon
2025-04-04 9:26 ` [PATCH v3 1/8] drm/gem: Add helpers to request a range of pages on a GEM Boris Brezillon
2025-04-04 9:26 ` [PATCH v3 2/8] drm/gem-shmem: Support sparse backing Boris Brezillon
2025-04-04 9:26 ` [PATCH v3 3/8] drm/panfrost: Switch to sparse gem shmem to implement our alloc-on-fault Boris Brezillon
2025-04-04 9:26 ` [PATCH v3 4/8] drm/panthor: Add support for alloc-on-fault buffers Boris Brezillon
2025-04-04 12:17 ` Boris Brezillon
2025-04-04 9:26 ` [PATCH v3 5/8] drm/panthor: Allow kernel BOs to pass DRM_PANTHOR_BO_ALLOC_ON_FAULT Boris Brezillon
2025-04-04 9:26 ` [PATCH v3 6/8] drm/panthor: Add a panthor_vm_pre_fault_range() helper Boris Brezillon
2025-04-04 9:26 ` [PATCH v3 7/8] drm/panthor: Make heap chunk allocation non-blocking Boris Brezillon
2025-04-04 9:26 ` [PATCH v3 8/8] drm/lima: Use drm_gem_shmem_sparse_backing for heap buffers Boris Brezillon
2025-04-10 14:48 ` [PATCH v3 0/8] drm: Introduce sparse GEM shmem Boris Brezillon
2025-04-10 15:05 ` Christian König
2025-04-10 15:53 ` Boris Brezillon
2025-04-10 16:43 ` Christian König
2025-04-10 17:20 ` Boris Brezillon
2025-04-10 18:01 ` Alyssa Rosenzweig
2025-04-10 18:41 ` Boris Brezillon
2025-04-11 8:04 ` Christian König
2025-04-11 8:38 ` Boris Brezillon
2025-04-11 10:55 ` Christian König
2025-04-11 12:02 ` Boris Brezillon
2025-04-11 12:45 ` Christian König
2025-04-11 13:00 ` Boris Brezillon
2025-04-11 13:13 ` Christian König
2025-04-11 14:39 ` Boris Brezillon
2025-04-14 12:47 ` Boris Brezillon
2025-04-14 15:34 ` Steven Price
2025-04-15 9:47 ` Boris Brezillon
2025-04-16 15:16 ` Steven Price
2025-04-16 15:53 ` Boris Brezillon
2025-04-15 12:39 ` Daniel Stone
2025-04-11 18:24 ` Simona Vetter
2025-04-11 12:01 ` Simona Vetter
2025-04-11 12:50 ` Christian König
2025-04-11 18:18 ` Simona Vetter
2025-04-11 13:52 ` Alyssa Rosenzweig
2025-04-11 18:16 ` Simona Vetter
2025-04-14 11:22 ` Boris Brezillon
2025-04-14 13:03 ` Alyssa Rosenzweig
2025-04-14 13:31 ` Boris Brezillon
2025-04-14 13:42 ` Alyssa Rosenzweig
2025-04-14 13:08 ` Liviu Dudau
2025-04-14 14:34 ` Simona Vetter
2025-04-14 15:15 ` Boris Brezillon [this message]
2025-04-14 14:46 ` Simona Vetter
2025-04-10 18:52 ` Christian König
2025-04-11 8:08 ` Boris Brezillon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250414171501.6b5b57f3@collabora.com \
--to=boris.brezillon@collabora.com \
--cc=adrian.larumbe@collabora.com \
--cc=airlied@gmail.com \
--cc=alyssa@rosenzweig.io \
--cc=christian.koenig@amd.com \
--cc=dmitry.osipenko@collabora.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=faith.ekstrand@collabora.com \
--cc=kernel@collabora.com \
--cc=lima@lists.freedesktop.org \
--cc=liviu.dudau@arm.com \
--cc=maarten.lankhorst@linux.intel.com \
--cc=mripard@kernel.org \
--cc=simona.vetter@ffwll.ch \
--cc=simona@ffwll.ch \
--cc=steven.price@arm.com \
--cc=tzimmermann@suse.de \
--cc=yuq825@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.