From: "Christian König" <christian.koenig@amd.com>
To: "Ville Syrjälä" <ville.syrjala@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org, intel-xe@lists.freedesktop.org,
dri-devel@lists.freedesktop.org,
"Simona Vetter" <simona.vetter@ffwll.ch>,
"Jani Nikula" <jani.nikula@intel.com>,
"Jouni Högander" <jouni.hogander@intel.com>,
"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
"Michel Dänzer" <michel.daenzer@mailbox.org>
Subject: Re: [PATCH 5/6] drm/i915/reset: Handle the display vs. GPU reset deadlock using a custom dma-fence
Date: Thu, 9 Apr 2026 14:17:07 +0200 [thread overview]
Message-ID: <cc51ea63-19b0-4352-9299-e5ab63f57ce4@amd.com> (raw)
In-Reply-To: <adeLRHQ2omAv93yM@intel.com>
On 4/9/26 13:19, Ville Syrjälä wrote:
> On Thu, Apr 09, 2026 at 12:46:11PM +0200, Christian König wrote:
>> On 4/9/26 01:34, Ville Syrjala wrote:
>>> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>
>>> The old display vs. GPU reset deadlock is back more or less.
>>> The old (working) solution to the problem was originally
>>> introduced in commit 9db529aac938 ("drm/i915: More surgically
>>> unbreak the modeset vs reset deadlock"), but it got nuked with
>>> commit d59cf7bb73f3 ("drm/i915/display: Use dma_fence interfaces
>>> instead of i915_sw_fence").
>>>
>>> Apparently no one looked hard enough to see that things didn't
>>> work quite properly anymore. What is still saving us for the most
>>> part is that we have a timeout on the fence wait
>>> (CONFIG_DRM_I915_FENCE_TIMEOUT, 10 seconds by default). But
>>> people are perhaps trying to get rid of that so we may need
>>> another solution, and 10 seconds is a bit slow.
>>
>> Yeah agree that approach with the timeout is usually a big no no.
>>
>>> Re-solve the problem yet again with a custom dma-fence that gets
>>> signaled just prior to a GPU reset, and have the atomic commit wait
>>> for either that or the real fence using dma_fence_wait_any_timeout().
>>
>> Hui? I don't fully understand what the source of the problem is, but of hand that approach of solving it doesn't sound like a good idea either.
>>
>>> Whichever signals first will let the commit proceed. We create a new
>>> "reset fence" whenever someone needs one, and keep it until the next
>>> GPU reset has completed. After that the next guy will again get a
>>> fresh unsignaled "reset fence".
>>
>> And that sounds even worse. A dma_fence which waits for the next GPU reset without triggering it itself would be an indefinite dma_fence which is not allowed.
>
> This is a purely internal thing to the i915 display code. This fence
> is never shared with anyone. And we never wait for just this fence,
> the wait always happens alongside a real fence using
> dma_fence_wait_any_timeout().
>
>>
>> I think something is missing in my picture how that is supposed to work.
>
> The problem stems from the fact that on old platforms a GPU reset
> also resets the display hardware,
Which is true for at least AMD GPUs and I think pretty much everybody else as well, but that wasn't so much of a problem so far.
> and to do that safely we need:
> 1. shut down display
> 2. perform the GPU reset
> 3. restore the display hardware to its orignal state
Mhm, I've recently talked with Michel about it and we confirmed that this is perfectly possible without issues. Adding Michel as well.
> We just do that with essentially with a normal atomic commit.
I think that is the source of the problem.
I'm not an expert on that topic but amdgpu and tons of other drivers seem to just use drm_atomic_helper_shutdown() for that.
What is i915 doing differently?
> But a
> previous atomic commit may already be waiting for a fence, which
> won't signal until the GPU reset happens, and the GPU reset is now
> waiting for that previous atomic commit to finish so that it can do
> its own atomic commit. In order to break the deadlock we need to
> abort the fence waits in the atomic commit, and that's what this
> "reset fence" achieves.
As far as I can see that approach looks strongly like a no-go.
You essentially have a lock inversion here and it is documented that it should *never* be resolved by a timeout, the approach you take now is not much better.
Regards,
Christian.
>
>>
>> Thanks,
>> Christian.
>>
>>>
>>> Cc: Simona Vetter <simona.vetter@ffwll.ch>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Jani Nikula <jani.nikula@intel.com>
>>> Cc: Jouni Högander <jouni.hogander@intel.com>
>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>> ---
>>> drivers/gpu/drm/i915/display/intel_display.c | 34 +++++---
>>> .../gpu/drm/i915/display/intel_display_core.h | 6 ++
>>> .../drm/i915/display/intel_display_driver.c | 5 ++
>>> .../drm/i915/display/intel_display_reset.c | 77 +++++++++++++++++++
>>> .../drm/i915/display/intel_display_reset.h | 4 +
>>> drivers/gpu/drm/xe/Makefile | 1 +
>>> 6 files changed, 117 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
>>> index 58a654ca0d20..83ccf13c4b16 100644
>>> --- a/drivers/gpu/drm/i915/display/intel_display.c
>>> +++ b/drivers/gpu/drm/i915/display/intel_display.c
>>> @@ -72,6 +72,7 @@
>>> #include "intel_display_driver.h"
>>> #include "intel_display_power.h"
>>> #include "intel_display_regs.h"
>>> +#include "intel_display_reset.h"
>>> #include "intel_display_rpm.h"
>>> #include "intel_display_types.h"
>>> #include "intel_display_utils.h"
>>> @@ -7149,22 +7150,35 @@ static void skl_commit_modeset_enables(struct intel_atomic_state *state)
>>>
>>> static void intel_atomic_commit_fence_wait(struct intel_atomic_state *intel_state)
>>> {
>>> - struct drm_plane *plane;
>>> + struct intel_display *display = to_intel_display(intel_state);
>>> struct drm_plane_state *new_plane_state;
>>> - long ret;
>>> + struct dma_fence *reset_fence;
>>> + struct drm_plane *plane;
>>> int i;
>>>
>>> + reset_fence = intel_display_reset_fence_get(display);
>>> +
>>> for_each_new_plane_in_state(&intel_state->base, plane, new_plane_state, i) {
>>> - if (new_plane_state->fence) {
>>> - ret = dma_fence_wait_timeout(new_plane_state->fence, false,
>>> - i915_fence_timeout());
>>> - if (ret <= 0)
>>> - break;
>>> + struct dma_fence *fences[2] = {
>>> + [0] = new_plane_state->fence,
>>> + [1] = reset_fence,
>>> + };
>>> + long ret;
>>>
>>> - dma_fence_put(new_plane_state->fence);
>>> - new_plane_state->fence = NULL;
>>> - }
>>> + if (!new_plane_state->fence)
>>> + continue;
>>> +
>>> + ret = dma_fence_wait_any_timeout(fences, reset_fence ? 2 : 1, false,
>>> + i915_fence_timeout(), NULL);
>>> + if (ret <= 0)
>>> + break;
>>> +
>>> + dma_fence_put(new_plane_state->fence);
>>> + new_plane_state->fence = NULL;
>>> }
>>> +
>>> + if (reset_fence)
>>> + dma_fence_put(reset_fence);
>>> }
>>>
>>> static void intel_atomic_dsb_wait_commit(struct intel_crtc_state *crtc_state)
>>> diff --git a/drivers/gpu/drm/i915/display/intel_display_core.h b/drivers/gpu/drm/i915/display/intel_display_core.h
>>> index 9e77003addd0..6687b658c51d 100644
>>> --- a/drivers/gpu/drm/i915/display/intel_display_core.h
>>> +++ b/drivers/gpu/drm/i915/display/intel_display_core.h
>>> @@ -556,6 +556,12 @@ struct intel_display {
>>> unsigned long mask;
>>> } quirks;
>>>
>>> + struct {
>>> + /* protects reset.fence */
>>> + struct mutex mutex;
>>> + struct dma_fence *fence;
>>> + } reset;
>>> +
>>> struct {
>>> /* restore state for suspend/resume and display reset */
>>> struct drm_atomic_state *modeset_state;
>>> diff --git a/drivers/gpu/drm/i915/display/intel_display_driver.c b/drivers/gpu/drm/i915/display/intel_display_driver.c
>>> index 23bfecc983e8..fcd31722c731 100644
>>> --- a/drivers/gpu/drm/i915/display/intel_display_driver.c
>>> +++ b/drivers/gpu/drm/i915/display/intel_display_driver.c
>>> @@ -34,6 +34,7 @@
>>> #include "intel_display_driver.h"
>>> #include "intel_display_irq.h"
>>> #include "intel_display_power.h"
>>> +#include "intel_display_reset.h"
>>> #include "intel_display_types.h"
>>> #include "intel_display_utils.h"
>>> #include "intel_display_wa.h"
>>> @@ -257,6 +258,8 @@ int intel_display_driver_probe_noirq(struct intel_display *display)
>>>
>>> intel_mode_config_init(display);
>>>
>>> + intel_display_reset_fence_init(display);
>>> +
>>> ret = intel_cdclk_init(display);
>>> if (ret)
>>> goto cleanup_wq_unordered;
>>> @@ -584,6 +587,8 @@ void intel_display_driver_remove(struct intel_display *display)
>>> if (!HAS_DISPLAY(display))
>>> return;
>>>
>>> + intel_display_reset_fence_discard(display);
>>> +
>>> flush_workqueue(display->wq.flip);
>>> flush_workqueue(display->wq.modeset);
>>> flush_workqueue(display->wq.cleanup);
>>> diff --git a/drivers/gpu/drm/i915/display/intel_display_reset.c b/drivers/gpu/drm/i915/display/intel_display_reset.c
>>> index ca15dc18ef0f..80dd2ea8a0c2 100644
>>> --- a/drivers/gpu/drm/i915/display/intel_display_reset.c
>>> +++ b/drivers/gpu/drm/i915/display/intel_display_reset.c
>>> @@ -3,6 +3,8 @@
>>> * Copyright © 2023 Intel Corporation
>>> */
>>>
>>> +#include <linux/dma-fence.h>
>>> +
>>> #include <drm/drm_atomic_helper.h>
>>> #include <drm/drm_print.h>
>>>
>>> @@ -16,6 +18,72 @@
>>> #include "intel_hotplug.h"
>>> #include "intel_pps.h"
>>>
>>> +static const char *intel_display_reset_fence_get_driver_name(struct dma_fence *fence)
>>> +{
>>> + return "intel_display";
>>> +}
>>> +
>>> +static const char *intel_display_reset_fence_get_timeline_name(struct dma_fence *fence)
>>> +{
>>> + return "reset";
>>> +}
>>> +
>>> +static const struct dma_fence_ops intel_display_reset_fence_ops = {
>>> + .get_driver_name = intel_display_reset_fence_get_driver_name,
>>> + .get_timeline_name = intel_display_reset_fence_get_timeline_name,
>>> +};
>>> +
>>> +static void intel_display_reset_create(struct intel_display *display)
>>> +{
>>> + struct dma_fence *fence;
>>> +
>>> + fence = kzalloc_obj(*fence);
>>> + if (!fence)
>>> + return;
>>> +
>>> + dma_fence_init(fence, &intel_display_reset_fence_ops, NULL, 0, 0);
>>> +
>>> + display->reset.fence = fence;
>>> +}
>>> +
>>> +struct dma_fence *intel_display_reset_fence_get(struct intel_display *display)
>>> +{
>>> + struct dma_fence *fence;
>>> +
>>> + mutex_lock(&display->reset.mutex);
>>> +
>>> + if (!display->reset.fence)
>>> + intel_display_reset_create(display);
>>> +
>>> + fence = display->reset.fence;
>>> + if (fence)
>>> + dma_fence_get(fence);
>>> +
>>> + mutex_unlock(&display->reset.mutex);
>>> +
>>> + return fence;
>>> +}
>>> +
>>> +void intel_display_reset_fence_discard(struct intel_display *display)
>>> +{
>>> + struct dma_fence *fence;
>>> +
>>> + mutex_lock(&display->reset.mutex);
>>> +
>>> + fence = display->reset.fence;
>>> + if (fence)
>>> + dma_fence_put(fence);
>>> +
>>> + display->reset.fence = NULL;
>>> +
>>> + mutex_unlock(&display->reset.mutex);
>>> +}
>>> +
>>> +void intel_display_reset_fence_init(struct intel_display *display)
>>> +{
>>> + mutex_init(&display->reset.mutex);
>>> +}
>>> +
>>> bool intel_display_reset_supported(struct intel_display *display)
>>> {
>>> return HAS_DISPLAY(display);
>>> @@ -31,8 +99,15 @@ void intel_display_reset_prepare(struct intel_display *display)
>>> {
>>> struct drm_modeset_acquire_ctx *ctx = &display->restore.reset_ctx;
>>> struct drm_atomic_state *state;
>>> + struct dma_fence *reset_fence;
>>> int ret;
>>>
>>> + reset_fence = intel_display_reset_fence_get(display);
>>> + if (reset_fence) {
>>> + dma_fence_signal(reset_fence);
>>> + dma_fence_put(reset_fence);
>>> + }
>>> +
>>> /*
>>> * Need mode_config.mutex so that we don't
>>> * trample ongoing ->detect() and whatnot.
>>> @@ -110,6 +185,8 @@ void intel_display_reset_finish(struct intel_display *display, bool test_only)
>>>
>>> drm_atomic_state_put(state);
>>> unlock:
>>> + intel_display_reset_fence_discard(display);
>>> +
>>> drm_modeset_drop_locks(ctx);
>>> drm_modeset_acquire_fini(ctx);
>>> mutex_unlock(&display->drm->mode_config.mutex);
>>> diff --git a/drivers/gpu/drm/i915/display/intel_display_reset.h b/drivers/gpu/drm/i915/display/intel_display_reset.h
>>> index a8aa7729d33f..c36a075c6b4d 100644
>>> --- a/drivers/gpu/drm/i915/display/intel_display_reset.h
>>> +++ b/drivers/gpu/drm/i915/display/intel_display_reset.h
>>> @@ -10,6 +10,10 @@
>>>
>>> struct intel_display;
>>>
>>> +struct dma_fence *intel_display_reset_fence_get(struct intel_display *display);
>>> +void intel_display_reset_fence_discard(struct intel_display *display);
>>> +void intel_display_reset_fence_init(struct intel_display *display);
>>> +
>>> bool intel_display_reset_supported(struct intel_display *display);
>>> bool intel_display_reset_test(struct intel_display *display);
>>> void intel_display_reset_prepare(struct intel_display *display);
>>> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>>> index 110fef511fe2..1a85dfe457f0 100644
>>> --- a/drivers/gpu/drm/xe/Makefile
>>> +++ b/drivers/gpu/drm/xe/Makefile
>>> @@ -262,6 +262,7 @@ xe-$(CONFIG_DRM_XE_DISPLAY) += \
>>> i915-display/intel_display_power.o \
>>> i915-display/intel_display_power_map.o \
>>> i915-display/intel_display_power_well.o \
>>> + i915-display/intel_display_reset.o \
>>> i915-display/intel_display_rpm.o \
>>> i915-display/intel_display_rps.o \
>>> i915-display/intel_display_trace.o \
>
next prev parent reply other threads:[~2026-04-09 12:17 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 23:34 [PATCH 0/6] drm/i915/reset: Solve display vs. GPU reset deadlock, again Ville Syrjala
2026-04-08 23:34 ` [PATCH 1/6] dma-buf: Remove old lies about dma_fence_wait_any_timeout() not accepting some fences Ville Syrjala
2026-04-09 8:09 ` Jani Nikula
2026-04-09 10:39 ` Christian König
2026-04-08 23:34 ` [PATCH 2/6] drm/i915/reset: Reorganize display reset code Ville Syrjala
2026-04-09 8:13 ` Jani Nikula
2026-04-08 23:34 ` [PATCH 3/6] drm/i915/reset: Move pending_fb_pin handling to i915 Ville Syrjala
2026-04-09 8:17 ` Jani Nikula
2026-04-08 23:34 ` [PATCH 4/6] drm/xe/display: Add init_clock_gating.h stubs Ville Syrjala
2026-04-09 8:19 ` Jani Nikula
2026-04-08 23:34 ` [PATCH 5/6] drm/i915/reset: Handle the display vs. GPU reset deadlock using a custom dma-fence Ville Syrjala
2026-04-09 10:37 ` Jani Nikula
2026-04-09 10:46 ` Christian König
2026-04-09 11:19 ` Ville Syrjälä
2026-04-09 12:17 ` Christian König [this message]
2026-04-08 23:34 ` [PATCH 6/6] drm/i915/display: Make fence timeout infinite Ville Syrjala
2026-04-09 10:51 ` Jani Nikula
2026-04-09 3:27 ` ✓ i915.CI.BAT: success for drm/i915/reset: Solve display vs. GPU reset deadlock, again Patchwork
2026-04-09 12:06 ` ✓ i915.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cc51ea63-19b0-4352-9299-e5ab63f57ce4@amd.com \
--to=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=jani.nikula@intel.com \
--cc=jouni.hogander@intel.com \
--cc=maarten.lankhorst@linux.intel.com \
--cc=michel.daenzer@mailbox.org \
--cc=simona.vetter@ffwll.ch \
--cc=ville.syrjala@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox