All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ville Syrjälä" <ville.syrjala@linux.intel.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: intel-gfx@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org,
	"Simona Vetter" <simona.vetter@ffwll.ch>,
	"Jani Nikula" <jani.nikula@intel.com>,
	"Jouni Högander" <jouni.hogander@intel.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Michel Dänzer" <michel.daenzer@mailbox.org>
Subject: Re: [PATCH 5/6] drm/i915/reset: Handle the display vs. GPU reset deadlock using a custom dma-fence
Date: Mon, 13 Apr 2026 12:11:35 +0300	[thread overview]
Message-ID: <adyzR6ISo_idi38p@intel.com> (raw)
In-Reply-To: <cc51ea63-19b0-4352-9299-e5ab63f57ce4@amd.com>

On Thu, Apr 09, 2026 at 02:17:07PM +0200, Christian König wrote:
> On 4/9/26 13:19, Ville Syrjälä wrote:
> > On Thu, Apr 09, 2026 at 12:46:11PM +0200, Christian König wrote:
> >> On 4/9/26 01:34, Ville Syrjala wrote:
> >>> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >>>
> >>> The old display vs. GPU reset deadlock is back more or less.
> >>> The old (working) solution to the problem was originally
> >>> introduced in commit 9db529aac938 ("drm/i915: More surgically
> >>> unbreak the modeset vs reset deadlock"), but it got nuked with
> >>> commit d59cf7bb73f3 ("drm/i915/display: Use dma_fence interfaces
> >>> instead of i915_sw_fence").
> >>>
> >>> Apparently no one looked hard enough to see that things didn't
> >>> work quite properly anymore. What is still saving us for the most
> >>> part is that we have a timeout on the fence wait
> >>> (CONFIG_DRM_I915_FENCE_TIMEOUT, 10 seconds by default). But
> >>> people are perhaps trying to get rid of that so we may need
> >>> another solution, and 10 seconds is a bit slow.
> >>
> >> Yeah agree that approach with the timeout is usually a big no no.
> >>
> >>> Re-solve the problem yet again with a custom dma-fence that gets
> >>> signaled just prior to a GPU reset, and have the atomic commit wait
> >>> for either that or the real fence using dma_fence_wait_any_timeout().
> >>
> >> Hui? I don't fully understand what the source of the problem is, but of hand that approach of solving it doesn't sound like a good idea either.
> >>
> >>> Whichever signals first will let the commit proceed. We create a new
> >>> "reset fence" whenever someone needs one, and keep it until the next
> >>> GPU reset has completed. After that the next guy will again get a
> >>> fresh unsignaled "reset fence".
> >>
> >> And that sounds even worse. A dma_fence which waits for the next GPU reset without triggering it itself would be an indefinite dma_fence which is not allowed.
> > 
> > This is a purely internal thing to the i915 display code. This fence
> > is never shared with anyone. And we never wait for just this fence,
> > the wait always happens alongside a real fence using
> > dma_fence_wait_any_timeout().
> > 
> >>
> >> I think something is missing in my picture how that is supposed to work.
> > 
> > The problem stems from the fact that on old platforms a GPU reset
> > also resets the display hardware,
> 
> Which is true for at least AMD GPUs and I think pretty much everybody else as well, but that wasn't so much of a problem so far.
> 
> > and to do that safely we need:
> > 1. shut down display
> > 2. perform the GPU reset
> > 3. restore the display hardware to its orignal state
> 
> Mhm, I've recently talked with Michel about it and we confirmed that this is perfectly possible without issues. Adding Michel as well.
> 
> > We just do that with essentially with a normal atomic commit.
> 
> I think that is the source of the problem.
> 
> I'm not an expert on that topic but amdgpu and tons of other drivers seem to just use drm_atomic_helper_shutdown() for that.

drm_atomic_helper_shutdown() is definitely not the thing to use
for this as it would clobber the stored kms state, leaving everything
permanently disabled. The drm_atomic_helper_commit_duplicated_state()
stuff i915 uses is the correct thing here.

But for this problem it doesn't even matter which gets used. Either
would get equally stuck behind a previous atomic commit waiting for
its fences.

> 
> What is i915 doing differently?

I see zero code for any display reset stuff in any other driver. If
amdgpu does anything it must be something completely custom, hidden
somewhere deep.

> 
> > But a 
> > previous atomic commit may already be waiting for a fence, which
> > won't signal until the GPU reset happens, and the GPU reset is now
> > waiting for that previous atomic commit to finish so that it can do
> > its own atomic commit. In order to break the deadlock we need to
> > abort the fence waits in the atomic commit, and that's what this
> > "reset fence" achieves.
> 
> As far as I can see that approach looks strongly like a no-go.
> 
> You essentially have a lock inversion here and it is documented that it should *never* be resolved by a timeout, the approach you take now is not much better.

Unless someone wants to add some kind of extra abort mechanism to
dma_fence_wait*() then I think this is probably the best solution.
And given this is only a thing for one driver on old hardware,
adding extra stuff to dma_fence_wait*() doesn't really seem worth
the hassle.

-- 
Ville Syrjälä
Intel

  reply	other threads:[~2026-04-13  9:11 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-08 23:34 [PATCH 0/6] drm/i915/reset: Solve display vs. GPU reset deadlock, again Ville Syrjala
2026-04-08 23:34 ` [PATCH 1/6] dma-buf: Remove old lies about dma_fence_wait_any_timeout() not accepting some fences Ville Syrjala
2026-04-09  8:09   ` Jani Nikula
2026-04-09 10:39   ` Christian König
2026-04-08 23:34 ` [PATCH 2/6] drm/i915/reset: Reorganize display reset code Ville Syrjala
2026-04-09  8:13   ` Jani Nikula
2026-04-08 23:34 ` [PATCH 3/6] drm/i915/reset: Move pending_fb_pin handling to i915 Ville Syrjala
2026-04-09  8:17   ` Jani Nikula
2026-04-08 23:34 ` [PATCH 4/6] drm/xe/display: Add init_clock_gating.h stubs Ville Syrjala
2026-04-09  8:19   ` Jani Nikula
2026-04-08 23:34 ` [PATCH 5/6] drm/i915/reset: Handle the display vs. GPU reset deadlock using a custom dma-fence Ville Syrjala
2026-04-09 10:37   ` Jani Nikula
2026-04-09 10:46   ` Christian König
2026-04-09 11:19     ` Ville Syrjälä
2026-04-09 12:17       ` Christian König
2026-04-13  9:11         ` Ville Syrjälä [this message]
2026-04-13  9:35           ` Christian König
2026-04-13  9:58             ` Ville Syrjälä
2026-04-13 10:08               ` Ville Syrjälä
2026-04-13 11:24               ` Christian König
2026-04-13 11:49                 ` Ville Syrjälä
2026-04-08 23:34 ` [PATCH 6/6] drm/i915/display: Make fence timeout infinite Ville Syrjala
2026-04-09 10:51   ` Jani Nikula
2026-04-08 23:42 ` ✗ CI.KUnit: failure for drm/i915/reset: Solve display vs. GPU reset deadlock, again Patchwork
2026-04-09  3:27 ` ✓ i915.CI.BAT: success " Patchwork
2026-04-09 12:06 ` ✓ i915.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adyzR6ISo_idi38p@intel.com \
    --to=ville.syrjala@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@intel.com \
    --cc=jouni.hogander@intel.com \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=michel.daenzer@mailbox.org \
    --cc=simona.vetter@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.