From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 764BCEBFD22 for ; Mon, 13 Apr 2026 09:11:43 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id E384910E398; Mon, 13 Apr 2026 09:11:42 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="RmbmWuz0"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14]) by gabe.freedesktop.org (Postfix) with ESMTPS id E94E610E389; Mon, 13 Apr 2026 09:11:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1776071502; x=1807607502; h=date:from:to:cc:subject:message-id:references: mime-version:content-transfer-encoding:in-reply-to; bh=IxvGRtzD4DwJ9oWyjEEu23bDs4r8q04RyUD+jpLeV/I=; b=RmbmWuz0YnDLFpOy0Or2tArbgLWizQs58b7Mzwn96sJr5LhE3a57JUxx VX9ErQcMGw50WQVgl/Cr6wD+d78M+cRoO9zC6iliXaJqf7WrOIrEVaANI vmQ4KqXzzQ8C80PLKSJzBZJ64vcF6zydtYMnQNufWzGgXx7+UYl88rp9Z nGGwhDZIFfGG6+wpK4sikNK7J4VVK4gARNCmwNiKrAP9vwu7b1swTmsk5 /+iwYtZ8Coh6R2rTzUWfKYEZ2V7njovalswb2ypDB6V2BnoBzpqcTiMR7 qIsbsm6z4abOjmrNXE8xCK5yD3a78Nebzv+z02Y4CZ/RDkz0tLRwAzgNM w==; X-CSE-ConnectionGUID: kE/fWliHT5mEMgP42taxiQ== X-CSE-MsgGUID: ZmDwSGBkQ0m9Q57Mzwhatg== X-IronPort-AV: E=McAfee;i="6800,10657,11757"; a="77071068" X-IronPort-AV: E=Sophos;i="6.23,176,1770624000"; d="scan'208";a="77071068" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Apr 2026 02:11:41 -0700 X-CSE-ConnectionGUID: UjOu0DyeSOaX+XeuJ1m1mA== X-CSE-MsgGUID: hJAlFyspRBO01OOymtJ5DQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,176,1770624000"; d="scan'208";a="234656597" Received: from fpallare-mobl4.ger.corp.intel.com (HELO localhost) ([10.245.244.235]) by fmviesa005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Apr 2026 02:11:38 -0700 Date: Mon, 13 Apr 2026 12:11:35 +0300 From: Ville =?iso-8859-1?Q?Syrj=E4l=E4?= To: Christian =?iso-8859-1?Q?K=F6nig?= Cc: intel-gfx@lists.freedesktop.org, intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Simona Vetter , Jani Nikula , Jouni =?iso-8859-1?Q?H=F6gander?= , Maarten Lankhorst , Michel =?iso-8859-1?Q?D=E4nzer?= Subject: Re: [PATCH 5/6] drm/i915/reset: Handle the display vs. GPU reset deadlock using a custom dma-fence Message-ID: References: <20260408233458.22666-1-ville.syrjala@linux.intel.com> <20260408233458.22666-6-ville.syrjala@linux.intel.com> <44fa373c-6216-4cc4-a605-94776b3873ad@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Patchwork-Hint: comment Organization: Intel Finland Oy - BIC 0357606-4 - c/o Alberga Business Park, 6 krs Bertel Jungin Aukio 5, 02600 Espoo, Finland X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On Thu, Apr 09, 2026 at 02:17:07PM +0200, Christian König wrote: > On 4/9/26 13:19, Ville Syrjälä wrote: > > On Thu, Apr 09, 2026 at 12:46:11PM +0200, Christian König wrote: > >> On 4/9/26 01:34, Ville Syrjala wrote: > >>> From: Ville Syrjälä > >>> > >>> The old display vs. GPU reset deadlock is back more or less. > >>> The old (working) solution to the problem was originally > >>> introduced in commit 9db529aac938 ("drm/i915: More surgically > >>> unbreak the modeset vs reset deadlock"), but it got nuked with > >>> commit d59cf7bb73f3 ("drm/i915/display: Use dma_fence interfaces > >>> instead of i915_sw_fence"). > >>> > >>> Apparently no one looked hard enough to see that things didn't > >>> work quite properly anymore. What is still saving us for the most > >>> part is that we have a timeout on the fence wait > >>> (CONFIG_DRM_I915_FENCE_TIMEOUT, 10 seconds by default). But > >>> people are perhaps trying to get rid of that so we may need > >>> another solution, and 10 seconds is a bit slow. > >> > >> Yeah agree that approach with the timeout is usually a big no no. > >> > >>> Re-solve the problem yet again with a custom dma-fence that gets > >>> signaled just prior to a GPU reset, and have the atomic commit wait > >>> for either that or the real fence using dma_fence_wait_any_timeout(). > >> > >> Hui? I don't fully understand what the source of the problem is, but of hand that approach of solving it doesn't sound like a good idea either. > >> > >>> Whichever signals first will let the commit proceed. We create a new > >>> "reset fence" whenever someone needs one, and keep it until the next > >>> GPU reset has completed. After that the next guy will again get a > >>> fresh unsignaled "reset fence". > >> > >> And that sounds even worse. A dma_fence which waits for the next GPU reset without triggering it itself would be an indefinite dma_fence which is not allowed. > > > > This is a purely internal thing to the i915 display code. This fence > > is never shared with anyone. And we never wait for just this fence, > > the wait always happens alongside a real fence using > > dma_fence_wait_any_timeout(). > > > >> > >> I think something is missing in my picture how that is supposed to work. > > > > The problem stems from the fact that on old platforms a GPU reset > > also resets the display hardware, > > Which is true for at least AMD GPUs and I think pretty much everybody else as well, but that wasn't so much of a problem so far. > > > and to do that safely we need: > > 1. shut down display > > 2. perform the GPU reset > > 3. restore the display hardware to its orignal state > > Mhm, I've recently talked with Michel about it and we confirmed that this is perfectly possible without issues. Adding Michel as well. > > > We just do that with essentially with a normal atomic commit. > > I think that is the source of the problem. > > I'm not an expert on that topic but amdgpu and tons of other drivers seem to just use drm_atomic_helper_shutdown() for that. drm_atomic_helper_shutdown() is definitely not the thing to use for this as it would clobber the stored kms state, leaving everything permanently disabled. The drm_atomic_helper_commit_duplicated_state() stuff i915 uses is the correct thing here. But for this problem it doesn't even matter which gets used. Either would get equally stuck behind a previous atomic commit waiting for its fences. > > What is i915 doing differently? I see zero code for any display reset stuff in any other driver. If amdgpu does anything it must be something completely custom, hidden somewhere deep. > > > But a > > previous atomic commit may already be waiting for a fence, which > > won't signal until the GPU reset happens, and the GPU reset is now > > waiting for that previous atomic commit to finish so that it can do > > its own atomic commit. In order to break the deadlock we need to > > abort the fence waits in the atomic commit, and that's what this > > "reset fence" achieves. > > As far as I can see that approach looks strongly like a no-go. > > You essentially have a lock inversion here and it is documented that it should *never* be resolved by a timeout, the approach you take now is not much better. Unless someone wants to add some kind of extra abort mechanism to dma_fence_wait*() then I think this is probably the best solution. And given this is only a thing for one driver on old hardware, adding extra stuff to dma_fence_wait*() doesn't really seem worth the hassle. -- Ville Syrjälä Intel