From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>,
<intel-xe@lists.freedesktop.org>, <matthew.brost@intel.com>,
<michal.wajdeczko@intel.com>, <badal.nilawar@intel.com>,
<karthik.poosa@intel.com>, <dev@lankhorst.se>
Subject: Re: [PATCH v1] drm/xe/pm: Handle GT resume failure
Date: Fri, 19 Dec 2025 13:53:20 -0500 [thread overview]
Message-ID: <aUWfIIUsp_aisILz@intel.com> (raw)
In-Reply-To: <aUWSovxtNUyZ0Yys@black.igk.intel.com>
On Fri, Dec 19, 2025 at 07:00:02PM +0100, Raag Jadav wrote:
> On Fri, Dec 19, 2025 at 11:08:25AM -0500, Rodrigo Vivi wrote:
> > On Fri, Dec 19, 2025 at 06:04:42AM +0100, Raag Jadav wrote:
> > > On Thu, Dec 18, 2025 at 10:46:10AM -0800, Matt Roper wrote:
> > > > On Thu, Dec 18, 2025 at 12:12:59PM +0100, Raag Jadav wrote:
> > > > > On Wed, Dec 17, 2025 at 09:38:34AM -0800, Matt Roper wrote:
> > > > > > On Wed, Dec 17, 2025 at 12:25:32PM -0500, Rodrigo Vivi wrote:
> > > > > > > On Wed, Dec 17, 2025 at 06:49:09PM +0530, Raag Jadav wrote:
> > > > > > > > We've been historically ignoring GT resume failure. Since the function
> > > > > > > > can return error, handle it properly.
> > > > > > >
> > > > > > > I probably had a reason for it, but since I didn't document and
> > > > > > > cannot remember it, let's go forward and make the clean flow.
> > > > > > >
> > > > > > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > > > > >
> > > > > > > >
> > > > > > > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > > > > > > ---
> > > > > > > > drivers/gpu/drm/xe/xe_pm.c | 14 ++++++++++----
> > > > > > > > 1 file changed, 10 insertions(+), 4 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> > > > > > > > index 4390ba69610d..a8b50091d62e 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_pm.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > > > > > > > @@ -260,8 +260,11 @@ int xe_pm_resume(struct xe_device *xe)
> > > > > > > >
> > > > > > > > xe_irq_resume(xe);
> > > > > > > >
> > > > > > > > - for_each_gt(gt, xe, id)
> > > > > > > > - xe_gt_resume(gt);
> > > > > > > > + for_each_gt(gt, xe, id) {
> > > > > > > > + err = xe_gt_resume(gt);
> > > > > > > > + if (err)
> > > > > > > > + goto err;
> > > > > >
> > > > > > When we propagate these errors upward, what's the end result / where
> > > > > > does it eventually get handled? If the device is still [partially]
> > > > > > usable after an error, wouldn't it be better to not bail out of the loop
> > > > > > immediately, but rather at least try to resume the other GTs, the
> > > > > > display, etc. before returning the error at the end to indicate
> > > > > > something failed? Then you might still have a partially functioning
> > > > > > device and have a better chance of at least having your screen turn back
> > > > > > on to show the relevant error messages?
> > > > >
> > > > > I had a similar question when I came across xe_device_probe(), but as
> > > > > Lucas mentioned[1] that the expectation here is pretty much "all or
> > > > > nothing". Again, not my call but I think we should be consistent.
> > > >
> > > > I think device probe is a bit different --- if you can't bring up the
> > > > hardware successfully at the very beginning then something is pretty
> > > > wrong and it's best to just not enable and start using the device at
> > > > all. But the resume paths are different --- the device is already bound
> > > > and in use, and was working properly previously. If we intentionally
> > > > don't even try to power up other parts of the device that might still
> > > > work (display, other GTs, etc.), then we're making the situation worse
> > > > and that could be the difference between the user having a functional UI
> > > > that gives them a chance to save their work and shutdown/recover
> > > > gracefully vs having to just power off the machine because their monitor
> > > > is black and they don't have any idea what's going on. Powering up
> > > > other units like display also makes it more likely that we can get
> > > > useful debugging information out of the machine to figure out what
> > > > actually went wrong.
> > >
> > > Fair, but this also means the existing error handing in resume path is
> > > redundant and should be removed.
> >
> > Not necessarily. Otherwise resume itself wouldn't have the failure path.
> > I believe that Matt is suggesting is that we need to scrutinize them all
> > and handle with care without a one-rule-fits-all.
> >
> > If there's a chance of getting display back even without some engines
> > for instance, perhaps we should try it.
> >
> > Imagine that media gt failed to come back, but you still have everything
> > else. But user will try to open the lid on their laptops and will just
> > get a blank screen. We need to avoid this scenario and work for a more
> > reliable platform with more granular and contained error handling.
>
> Or perhaps we can reconsider the ordering to make sure we have essential
> parts up and running before we move on to non-essentials?
indeed, but careful with chicken-egg cases...
>
> I'm not a fan of hiding errors but upto you.
Nobody is in favor of hiding errors. Just that this patch, as is, might
impose a worse user experience. So, someone needs to take the ball here
and do some study and design the flow for a more reliable experience
and error handling.
>
> Raag
>
> > > > > [1] https://lore.kernel.org/intel-xe/lliho4ci6gi5spxxelttgqntbh7rxr4utg4dgfevlrdy54phrh@2k4mjuofaqye/
> > > > >
> > > > > > > > + }
> > > > > > > >
> > > > > > > > xe_display_pm_resume(xe);
> > > > > > > >
> > > > > > > > @@ -656,8 +659,11 @@ int xe_pm_runtime_resume(struct xe_device *xe)
> > > > > > > >
> > > > > > > > xe_irq_resume(xe);
> > > > > > > >
> > > > > > > > - for_each_gt(gt, xe, id)
> > > > > > > > - xe->d3cold.allowed ? xe_gt_resume(gt) : xe_gt_runtime_resume(gt);
> > > > > > > > + for_each_gt(gt, xe, id) {
> > > > > > > > + err = xe->d3cold.allowed ? xe_gt_resume(gt) : xe_gt_runtime_resume(gt);
> > > > > > > > + if (err)
> > > > > > > > + goto out;
> > > > > > > > + }
> > > > > > > >
> > > > > > > > xe_display_pm_runtime_resume(xe);
> > > > > > > >
> > > > > > > > --
> > > > > > > > 2.43.0
> > > > > > > >
next prev parent reply other threads:[~2025-12-19 18:53 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-17 13:19 [PATCH v1] drm/xe/pm: Handle GT resume failure Raag Jadav
2025-12-17 15:00 ` ✓ CI.KUnit: success for " Patchwork
2025-12-17 15:37 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-17 17:25 ` [PATCH v1] " Rodrigo Vivi
2025-12-17 17:38 ` Matt Roper
2025-12-18 11:12 ` Raag Jadav
2025-12-18 18:46 ` Matt Roper
2025-12-19 5:04 ` Raag Jadav
2025-12-19 16:08 ` Rodrigo Vivi
2025-12-19 18:00 ` Raag Jadav
2025-12-19 18:53 ` Rodrigo Vivi [this message]
2025-12-18 12:59 ` ✗ Xe.CI.Full: failure for " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aUWfIIUsp_aisILz@intel.com \
--to=rodrigo.vivi@intel.com \
--cc=badal.nilawar@intel.com \
--cc=dev@lankhorst.se \
--cc=intel-xe@lists.freedesktop.org \
--cc=karthik.poosa@intel.com \
--cc=matthew.brost@intel.com \
--cc=matthew.d.roper@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=raag.jadav@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox