Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Daniel Vetter <daniel@ffwll.ch>
Cc: robdclark@chromium.org, airlied@linux.ie, lina@asahilina.net,
	"Thomas Hellström (Intel)" <thomas_os@shipmail.org>,
	dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	boris.brezillon@collabora.com,
	"Christian König" <christian.koenig@amd.com>,
	faith.ekstrand@collabora.com
Subject: Re: [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
Date: Thu, 6 Apr 2023 16:58:33 +0000	[thread overview]
Message-ID: <ZC76OWKHmjwfx7mo@DUT025-TGLU.fm.intel.com> (raw)
In-Reply-To: <ZC5nm73p6SByTXDn@phenom.ffwll.local>

On Thu, Apr 06, 2023 at 08:32:59AM +0200, Daniel Vetter wrote:
> On Wed, Apr 05, 2023 at 11:58:44PM +0000, Matthew Brost wrote:
> > On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote:
> > > On Tue, Apr 04, 2023 at 07:48:27PM +0000, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > > > > On Tue, Apr 04, 2023 at 07:02:23PM +0000, Matthew Brost wrote:
> > > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > > > > > > 
> > > > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > > > Hi, Christian,
> > > > > > > > > 
> > > > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > > > From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > > > > > > 
> > > > > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > > > > completion
> > > > > > > > > > > waits, invent their own synchronization primitives or internally use
> > > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > > > > > > without any lockdep annotation all these approaches are error prone.
> > > > > > > > > > > 
> > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > > > > desirable for
> > > > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > > > handling also with
> > > > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > > > dma-fence protocol.
> > > > > > > > > > > 
> > > > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > > > dma-fences, and add
> > > > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > > > 
> > > > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > > > > helper adding
> > > > > > > > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > > > > scheduler chaining
> > > > > > > > > > >    a new long-running fence on another one.
> > > > > > > > > > 
> > > > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > > > 
> > > > > > 
> > > > > > I don't think this quite the same, this explictly enforces that we don't
> > > > > > break the dma-fence rules (in path of memory allocations, exported in
> > > > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > > > infrastructure for signaling / callbacks. I believe your series tried to
> > > > > > export these fences to user space (admittedly I haven't fully read your
> > > > > > series).
> > > > > > 
> > > > > > In this use case we essentially just want to flow control the ring via
> > > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > > > > 
> > > > > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > > > > in the backend regardless on ring space, maintain a list of jobs pending
> > > > > > for cleanup after errors, and write a different cleanup path as now the
> > > > > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > > > > when the DRM scheduler provides all of this for us. Also if we go this
> > > > > > route, now all drivers are going to invent ways to handle LR jobs /w the
> > > > > > DRM scheduler.
> > > > > > 
> > > > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > > > export any fences from the scheduler. If you try to export these fences
> > > > > > a blow up happens.
> > > > > 
> > > > > The problem is if you mix things up. Like for resets you need all the
> > > > > schedulers on an engine/set-of-engines to quiescent or things get
> > > > > potentially hilarious. If you now have a scheduler in forever limbo, the
> > > > > dma_fence guarantees are right out the window.
> > > > > 
> > > > 
> > > > Right, a GT reset on Xe is:
> > > > 
> > > > Stop all schedulers
> > > > Do a reset
> > > > Ban any schedulers which we think caused the GT reset
> > > > Resubmit all schedulers which we think were good
> > > > Restart all schedulers
> > > > 
> > > > None of this flow depends on LR dma-fences, all of this uses the DRM
> > > > sched infrastructure and work very well compared to the i915. Rewriting
> > > > all this with a driver specific implementation is what we are trying to
> > > > avoid.
> > > > 
> > > > Similarly if LR entity hangs on its own (not a GT reset, rather the
> > > > firmware does the reset for us) we use all the DRM scheduler
> > > > infrastructure to handle this. Again this works rather well...
> > > 
> > > Yeah this is why I don't think duplicating everything that long-running
> > > jobs need makes any sense. iow I agree with you.
> > > 
> > 
> > Glad we agree.
> > 
> > > > > But the issue you're having is fairly specific if it's just about
> > > > > ringspace. I think the dumbest fix is to just block in submit if you run
> > > > > out of per-ctx ringspace, and call it a day. This notion that somehow the
> > > > 
> > > > How does that not break the dma-fence rules? A job can publish its
> > > > finished fence after ARM, if the finished fence fence waits on ring
> > > > space that may not free up in a reasonable amount of time we now have
> > > > broken the dma-dence rules. My understanding is any dma-fence must only
> > > > on other dma-fence, Christian seems to agree and NAK'd just blocking if
> > > > no space available [1]. IMO this series ensures we don't break dma-fence
> > > > rules by restricting how the finished fence can be used.
> > > 
> > > Oh I meant in the submit ioctl, _before_ you even call
> > > drm_sched_job_arm(). It's ok to block in there indefinitely.
> > >
> > 
> > Ok, but how do we determine if their is ring space, wait on xe_hw_fence
> > which is a dma-fence. We just move a wait from the scheduler to the exec
> > IOCTL and I realy fail to see the point of that.
> 
> Fill in anything you need into the ring at ioctl time, but don't update
> the tail pointers? If there's no space, then EWOULDBLCK.
> 

Ok, I can maybe buy this approach and this is fairly easy to do. I'm
going to do this for LR jobs only though (non-LR job will still flow
control on the ring via the scheduler + write ring in run_job). A bit of
duplicate code but I can live with this.

> > > > > kernel is supposed to provide a bottomless queue of anything userspace
> > > > > submits simply doesn't hold up in reality (as much as userspace standards
> > > > > committees would like it to), and as long as it doesn't have a real-world
> > > > > perf impact it doesn't really matter why we end up blocking in the submit
> > > > > ioctl. It might also be a simple memory allocation that hits a snag in
> > > > > page reclaim.
> > > > > 
> > > > > > > > > > And the reasons why it was rejected haven't changed.
> > > > > > > > > > 
> > > > > > > > > > Regards,
> > > > > > > > > > Christian.
> > > > > > > > > > 
> > > > > > > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > > > > > > tackle this problem while being able to reuse the scheduler for
> > > > > > > > > long-running workloads.
> > > > > > > > > 
> > > > > > > > > I couldn't see any clear decision on your series, though, but one
> > > > > > > > > main difference I see is that this is intended for driver-internal
> > > > > > > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > > > > indefinite fence problem.
> > > > > > > > 
> > > > > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > > > > problems are the same as with your approach: When we express such
> > > > > > > > operations as dma_fence there is always the change that we leak that
> > > > > > > > somewhere.
> > > > > > > > 
> > > > > > > > My approach of adding a flag noting that this operation is dangerous and
> > > > > > > > can't be synced with something memory management depends on tried to
> > > > > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > > > > rejected it (for good reasons I think).
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > > > > each driver could hack something up, like sleeping in the
> > > > > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > > > > should still be annotated in one way or annotated one way or another
> > > > > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > > > > do anything bad.
> > > > > > > > > 
> > > > > > > > >  So any suggestions as to what would be the better solution here
> > > > > > > > > would be appreciated.
> > > > > > > > 
> > > > > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > > > > 
> > > > > > 
> > > > > > I think we need to solve this within the DRM scheduler one way or
> > > > > > another.
> > > > > 
> > > > > Yeah so if we conclude that the queue really must be bottomless then I
> > > > > agree drm-sched should help out sort out the mess. Because I'm guessing
> > > > > that every driver will have this issue. But that's a big if.
> > > > > 
> > > > > I guess if we teach the drm scheduler that some jobs are fairly endless
> > > > > then maybe it wouldn't be too far-fetched to also teach it to wait for a
> > > > > previous one to finish (but not with the dma_fence that preempts, which we
> > > > > put into the dma_resv for memory management, but some other struct
> > > > > completion). The scheduler already has a concept of not stuffing too much
> > > > > stuff into the same queue after all, so this should fit?
> > > > 
> > > > See above, exact same situation as spinning on flow controling the ring,
> > > > this IMO absolutely breaks the dma-fence rules. IMO the correct solution
> > > > is to have a DRM that doesn't export dma-fences, this is exactly what
> > > > this series does as if we try to, boom lockdep / warn on blow up.
> > > 
> > > I dont think it's impossible to do this correctly, but definitely very,
> > > very hard. Which is why neither Christian nor me like the idea :-)
> > > 
> > > Essentially you'd have to make sure that any indefinite way will still
> > > react to drm_sched_job, so that you're not holding up a gt reset or
> > > anything like that, but only ever hold up forward progress for this
> > > specific scheduler/drm_sched_entity. Which you can do as long (and again,
> > > another hugely tricky detail) you still obey the preempt-ctx dma_fence and
> > > manage to preempt the underlying long-running ctx even when the drm/sched
> > > is stuck waiting for an indefinite fence (like waiting for ringspace or
> > > something like that).
> > > 
> > > So I don't think it's impossible, but very far away from "a good idea" :-)
> > > 
> > > Hence to proposal to bail out of this entire mess by throwing EWOULDBLCK
> > > back to userspace directly from the ioctl function, where you still can do
> > > that without breaking any dma_fence rules. Or if it's not a case that
> > > matters in practice, simply block in the ioctl handler instead of
> > > returning EWOULDBLCK.
> > 
> > Returning EWOULDBLCK on a full ring is reasonsible I guess but again
> > without returning a fence in run job the TDR can't be used for clean up
> > on LR entities which will result in duplicate code open coded by each
> > driver. Same goes for checking ring full in exec.
> > 
> > How about this:
> > - We mark xe_hw_fence as LR to ensure it can't be exported, return this
> >   in run_job which gives flow control on the ring + the handy TDR
> >   functionality
> > - When a scheduler is marked as LR, we do not generate finished fences
> >   for jobs
> > - We heavily, heavily scrutinize any usage of the LR fence flag going
> >   foward
> > - We document all of this very loudly
> > 
> > Is this reasonable?
> 
> I'm not seeing why it's needed? If you're worried about TDR duplication
> then I think we need something else. Because for long-running ctx we never
> have a timeout of the ctx itself (by definition). The only thing we time
> out on is the preempt, so I guess what could be done:
> - have the minimal scaffolding to support the preempt-ctx fence in
>   drm_sched_entity
> - when the preempt ctx fence enables signalling a) callback to the driver
>   to start the preempt (which should signal the fence) b) start a timer,
>   which should catch if the preempt takes too long

The GuC does this for us, no need.

> - if the timeout first (importantly we only enable that when the
>   preemption is trigger, not by default), kick of the normal drm/sched tdr
>   flow. maybe needs some adjustements in case there's different handling
>   needed for when a preemption times out compared to just a job timing out
>

The GuC imforms us this and yea we kick the TDR.

> I think this might make sense for sharing timeout handling needs for
> long-running context. What you proposed I don't really follow why it
> should exist, because that kind of timeout handling should not ever happen
> for long-running jobs.

We use just the TDR a as single cleanup point for all entities. In the
case of a LR entity this occurs if the GuC issues a reset on the
entity (liekly preempt timeout), the entity takes a non-recoverable page
fail, or the entity to the root cause of a GT reset. The pending job
list here is handy, that why I wanted to return xe_hw_fence in run_job
to hold the job in the scheduler pending list. The doesn't TDR won't
fire if the pending list is empty.

Based on what you are saying my new proposal:

1. Write ring in exec for LR jobs, return -EWOULDBLCK if no space in
ring
2. Return NULL in run_job (or alternatively a signaled fence)
3. Have specical cased cleanup flow for LR entites (not the TDR, rather
likely a different worker we kick owned by the xe_engine).
4. Document this some that this how drivers are expected to do LR
workloads plus DRM scheduler

1 & 3 are pretty clear duplicates of code but I can live with that if
I can get Ack on the plan + move on. The coding will not be all that
difficult either, I am just being difficult. In the is probably a 100ish
lines of code.

What do you think Daniel, seem like a reasonable plan?

Matt

> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

next prev parent reply	other threads:[~2023-04-06 16:58 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-04  0:22 [Intel-xe] [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
2023-06-09  6:58   ` Boris Brezillon
2023-07-31  0:56     ` Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 02/10] drm/sched: Move schedule policy to scheduler / entity Matthew Brost
2023-04-05 17:37   ` Luben Tuikov
2023-04-05 18:29     ` Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 03/10] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 04/10] drm/sched: Add generic scheduler message interface Matthew Brost
2023-05-04  5:28   ` Luben Tuikov
2023-07-31  2:42     ` Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 05/10] drm/sched: Start run wq before TDR in drm_sched_start Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 06/10] drm/sched: Submit job before starting TDR Matthew Brost
2023-05-04  5:23   ` Luben Tuikov
2023-07-31  1:00     ` Matthew Brost
2023-07-31  7:26       ` Boris Brezillon
2023-08-31 19:48         ` Luben Tuikov
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout Matthew Brost
2023-05-04  5:28   ` Luben Tuikov
2023-07-31  1:09     ` Matthew Brost
2023-08-31 19:52       ` Luben Tuikov
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences Matthew Brost
2023-04-04  9:09   ` Christian König
2023-04-04 12:54     ` Thomas Hellström
2023-04-04 13:10       ` Christian König
2023-04-04 18:14         ` Thomas Hellström (Intel)
2023-04-04 19:02           ` Matthew Brost
2023-04-04 19:25             ` Daniel Vetter
2023-04-04 19:48               ` Matthew Brost
2023-04-05 13:09                 ` Daniel Vetter
2023-04-05 23:58                   ` Matthew Brost
2023-04-06  6:32                     ` Daniel Vetter
2023-04-06 16:58                       ` Matthew Brost [this message]
2023-04-06 17:09                         ` Daniel Vetter
2023-04-05 12:35               ` Thomas Hellström
2023-04-05 12:39                 ` Christian König
2023-04-05 12:45                   ` Daniel Vetter
2023-04-05 14:08                     ` Christian König
2023-04-04 19:00         ` Daniel Vetter
2023-04-04 20:03           ` Matthew Brost
2023-04-04 20:11             ` Daniel Vetter
2023-04-04 20:19               ` Matthew Brost
2023-04-04 20:31                 ` Daniel Vetter
2023-04-04 20:46                   ` Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 09/10] drm/sched: Support long-running sched entities Matthew Brost
2023-04-04  0:22 ` [Intel-xe] [RFC PATCH 10/10] drm/syncobj: Warn on long running dma-fences Matthew Brost
2023-04-04  0:24 ` [Intel-xe] ✗ CI.Patch_applied: failure for Xe DRM scheduler and long running workload plans Patchwork
2023-04-04  1:07 ` [Intel-xe] [RFC PATCH 00/10] " Asahi Lina
2023-04-04  1:58   ` Matthew Brost
2023-04-08  7:05     ` Asahi Lina
2023-04-11 14:07       ` Daniel Vetter
2023-04-12  5:47         ` Asahi Lina
2023-04-12  8:18           ` Daniel Vetter
2023-04-17  0:03       ` Matthew Brost
2023-04-04  9:04 ` Christian König
2023-04-04 13:23   ` Matthew Brost
2023-04-04  9:13 ` Christian König
2023-04-04 13:37   ` Matthew Brost
2023-04-05  7:41     ` Christian König
2023-04-05  8:34       ` Daniel Vetter
2023-04-05  8:53         ` Christian König
2023-04-05  9:07           ` Daniel Vetter
2023-04-05  9:57             ` Christian König
2023-04-05 10:12               ` Daniel Vetter
2023-04-06  2:08                 ` Matthew Brost
2023-04-06  6:37                   ` Daniel Vetter
2023-04-06 10:14                     ` Christian König
2023-04-06 10:32                       ` Daniel Vetter
2023-04-04  9:43 ` Tvrtko Ursulin
2023-04-04  9:48   ` Christian König
2023-04-04 13:43     ` Matthew Brost
2023-04-04 13:52   ` Matthew Brost
2023-04-04 17:29     ` Tvrtko Ursulin
2023-04-04 19:07       ` Daniel Vetter
2023-04-04 18:02 ` Zeng, Oak
2023-04-04 18:08   ` Matthew Brost
2023-04-05  7:30     ` Christian König
2023-04-05  8:42       ` Daniel Vetter
2023-04-05 18:06       ` Zeng, Oak
2023-04-05 18:53         ` Matthew Brost
2023-04-06 10:04           ` Christian König
2023-04-07  0:20           ` Zeng, Oak
2023-04-11  9:02             ` Christian König
2023-04-11 14:13               ` Daniel Vetter
2023-04-17  6:47                 ` Christian König
2023-04-17  8:39                   ` Daniel Vetter
2023-04-18 15:10 ` Liviu Dudau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZC76OWKHmjwfx7mo@DUT025-TGLU.fm.intel.com \
    --to=matthew.brost@intel.com \
    --cc=airlied@linux.ie \
    --cc=boris.brezillon@collabora.com \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=faith.ekstrand@collabora.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lina@asahilina.net \
    --cc=robdclark@chromium.org \
    --cc=thomas_os@shipmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox