Re: [PATCH v1] drm/xe/pm: Handle GT resume failure

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>,
	<intel-xe@lists.freedesktop.org>,  <matthew.brost@intel.com>,
	<michal.wajdeczko@intel.com>, <badal.nilawar@intel.com>,
	<karthik.poosa@intel.com>, <dev@lankhorst.se>
Subject: Re: [PATCH v1] drm/xe/pm: Handle GT resume failure
Date: Fri, 19 Dec 2025 13:53:20 -0500	[thread overview]
Message-ID: <aUWfIIUsp_aisILz@intel.com> (raw)
In-Reply-To: <aUWSovxtNUyZ0Yys@black.igk.intel.com>

On Fri, Dec 19, 2025 at 07:00:02PM +0100, Raag Jadav wrote:
> On Fri, Dec 19, 2025 at 11:08:25AM -0500, Rodrigo Vivi wrote:
> > On Fri, Dec 19, 2025 at 06:04:42AM +0100, Raag Jadav wrote:
> > > On Thu, Dec 18, 2025 at 10:46:10AM -0800, Matt Roper wrote:
> > > > On Thu, Dec 18, 2025 at 12:12:59PM +0100, Raag Jadav wrote:
> > > > > On Wed, Dec 17, 2025 at 09:38:34AM -0800, Matt Roper wrote:
> > > > > > On Wed, Dec 17, 2025 at 12:25:32PM -0500, Rodrigo Vivi wrote:
> > > > > > > On Wed, Dec 17, 2025 at 06:49:09PM +0530, Raag Jadav wrote:
> > > > > > > > We've been historically ignoring GT resume failure. Since the function
> > > > > > > > can return error, handle it properly.
> > > > > > > 
> > > > > > > I probably had a reason for it, but since I didn't document and
> > > > > > > cannot remember it, let's go forward and make the clean flow.
> > > > > > > 
> > > > > > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > > > > > 
> > > > > > > > 
> > > > > > > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > > > > > > ---
> > > > > > > >  drivers/gpu/drm/xe/xe_pm.c | 14 ++++++++++----
> > > > > > > >  1 file changed, 10 insertions(+), 4 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> > > > > > > > index 4390ba69610d..a8b50091d62e 100644
> > > > > > > > --- a/drivers/gpu/drm/xe/xe_pm.c
> > > > > > > > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > > > > > > > @@ -260,8 +260,11 @@ int xe_pm_resume(struct xe_device *xe)
> > > > > > > >  
> > > > > > > >  	xe_irq_resume(xe);
> > > > > > > >  
> > > > > > > > -	for_each_gt(gt, xe, id)
> > > > > > > > -		xe_gt_resume(gt);
> > > > > > > > +	for_each_gt(gt, xe, id) {
> > > > > > > > +		err = xe_gt_resume(gt);
> > > > > > > > +		if (err)
> > > > > > > > +			goto err;
> > > > > > 
> > > > > > When we propagate these errors upward, what's the end result / where
> > > > > > does it eventually get handled?  If the device is still [partially]
> > > > > > usable after an error, wouldn't it be better to not bail out of the loop
> > > > > > immediately, but rather at least try to resume the other GTs, the
> > > > > > display, etc. before returning the error at the end to indicate
> > > > > > something failed?  Then you might still have a partially functioning
> > > > > > device and have a better chance of at least having your screen turn back
> > > > > > on to show the relevant error messages?
> > > > > 
> > > > > I had a similar question when I came across xe_device_probe(), but as
> > > > > Lucas mentioned[1] that the expectation here is pretty much "all or
> > > > > nothing". Again, not my call but I think we should be consistent.
> > > > 
> > > > I think device probe is a bit different --- if you can't bring up the
> > > > hardware successfully at the very beginning then something is pretty
> > > > wrong and it's best to just not enable and start using the device at
> > > > all.  But the resume paths are different --- the device is already bound
> > > > and in use, and was working properly previously.  If we intentionally
> > > > don't even try to power up other parts of the device that might still
> > > > work (display, other GTs, etc.), then we're making the situation worse
> > > > and that could be the difference between the user having a functional UI
> > > > that gives them a chance to save their work and shutdown/recover
> > > > gracefully vs having to just power off the machine because their monitor
> > > > is black and they don't have any idea what's going on.  Powering up
> > > > other units like display also makes it more likely that we can get
> > > > useful debugging information out of the machine to figure out what
> > > > actually went wrong.
> > > 
> > > Fair, but this also means the existing error handing in resume path is
> > > redundant and should be removed.
> > 
> > Not necessarily. Otherwise resume itself wouldn't have the failure path.
> > I believe that Matt is suggesting is that we need to scrutinize them all
> > and handle with care without a one-rule-fits-all.
> > 
> > If there's a chance of getting display back even without some engines
> > for instance, perhaps we should try it.
> > 
> > Imagine that media gt failed to come back, but you still have everything
> > else. But user will try to open the lid on their laptops and will just
> > get a blank screen. We need to avoid this scenario and work for a more
> > reliable platform with more granular and contained error handling.
> 
> Or perhaps we can reconsider the ordering to make sure we have essential
> parts up and running before we move on to non-essentials?

indeed, but careful with chicken-egg cases...

> 
> I'm not a fan of hiding errors but upto you.

Nobody is in favor of hiding errors. Just that this patch, as is, might
impose a worse user experience. So, someone needs to take the ball here
and do some study and design the flow for a more reliable experience
and error handling.

> 
> Raag
> 
> > > > > [1] https://lore.kernel.org/intel-xe/lliho4ci6gi5spxxelttgqntbh7rxr4utg4dgfevlrdy54phrh@2k4mjuofaqye/
> > > > > 
> > > > > > > > +	}
> > > > > > > >  
> > > > > > > >  	xe_display_pm_resume(xe);
> > > > > > > >  
> > > > > > > > @@ -656,8 +659,11 @@ int xe_pm_runtime_resume(struct xe_device *xe)
> > > > > > > >  
> > > > > > > >  	xe_irq_resume(xe);
> > > > > > > >  
> > > > > > > > -	for_each_gt(gt, xe, id)
> > > > > > > > -		xe->d3cold.allowed ? xe_gt_resume(gt) : xe_gt_runtime_resume(gt);
> > > > > > > > +	for_each_gt(gt, xe, id) {
> > > > > > > > +		err = xe->d3cold.allowed ? xe_gt_resume(gt) : xe_gt_runtime_resume(gt);
> > > > > > > > +		if (err)
> > > > > > > > +			goto out;
> > > > > > > > +	}
> > > > > > > >  
> > > > > > > >  	xe_display_pm_runtime_resume(xe);
> > > > > > > >  
> > > > > > > > -- 
> > > > > > > > 2.43.0
> > > > > > > >

next prev parent reply	other threads:[~2025-12-19 18:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-17 13:19 [PATCH v1] drm/xe/pm: Handle GT resume failure Raag Jadav
2025-12-17 15:00 ` ✓ CI.KUnit: success for " Patchwork
2025-12-17 15:37 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-17 17:25 ` [PATCH v1] " Rodrigo Vivi
2025-12-17 17:38   ` Matt Roper
2025-12-18 11:12     ` Raag Jadav
2025-12-18 18:46       ` Matt Roper
2025-12-19  5:04         ` Raag Jadav
2025-12-19 16:08           ` Rodrigo Vivi
2025-12-19 18:00             ` Raag Jadav
2025-12-19 18:53               ` Rodrigo Vivi [this message]
2025-12-18 12:59 ` ✗ Xe.CI.Full: failure for " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aUWfIIUsp_aisILz@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=dev@lankhorst.se \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=karthik.poosa@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=raag.jadav@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox