Re: [PATCH v1] drm/xe/pm: Handle GT resume failure

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Raag Jadav <raag.jadav@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>,
	<intel-xe@lists.freedesktop.org>,  <matthew.brost@intel.com>,
	<michal.wajdeczko@intel.com>, <badal.nilawar@intel.com>,
	<karthik.poosa@intel.com>, <dev@lankhorst.se>
Subject: Re: [PATCH v1] drm/xe/pm: Handle GT resume failure
Date: Fri, 19 Dec 2025 11:08:25 -0500	[thread overview]
Message-ID: <aUV4eX7XXpApdnCN@intel.com> (raw)
In-Reply-To: <aUTc6kypmsbHNWC1@black.igk.intel.com>

On Fri, Dec 19, 2025 at 06:04:42AM +0100, Raag Jadav wrote:
> On Thu, Dec 18, 2025 at 10:46:10AM -0800, Matt Roper wrote:
> > On Thu, Dec 18, 2025 at 12:12:59PM +0100, Raag Jadav wrote:
> > > On Wed, Dec 17, 2025 at 09:38:34AM -0800, Matt Roper wrote:
> > > > On Wed, Dec 17, 2025 at 12:25:32PM -0500, Rodrigo Vivi wrote:
> > > > > On Wed, Dec 17, 2025 at 06:49:09PM +0530, Raag Jadav wrote:
> > > > > > We've been historically ignoring GT resume failure. Since the function
> > > > > > can return error, handle it properly.
> > > > > 
> > > > > I probably had a reason for it, but since I didn't document and
> > > > > cannot remember it, let's go forward and make the clean flow.
> > > > > 
> > > > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > > > 
> > > > > > 
> > > > > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/xe_pm.c | 14 ++++++++++----
> > > > > >  1 file changed, 10 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
> > > > > > index 4390ba69610d..a8b50091d62e 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_pm.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_pm.c
> > > > > > @@ -260,8 +260,11 @@ int xe_pm_resume(struct xe_device *xe)
> > > > > >  
> > > > > >  	xe_irq_resume(xe);
> > > > > >  
> > > > > > -	for_each_gt(gt, xe, id)
> > > > > > -		xe_gt_resume(gt);
> > > > > > +	for_each_gt(gt, xe, id) {
> > > > > > +		err = xe_gt_resume(gt);
> > > > > > +		if (err)
> > > > > > +			goto err;
> > > > 
> > > > When we propagate these errors upward, what's the end result / where
> > > > does it eventually get handled?  If the device is still [partially]
> > > > usable after an error, wouldn't it be better to not bail out of the loop
> > > > immediately, but rather at least try to resume the other GTs, the
> > > > display, etc. before returning the error at the end to indicate
> > > > something failed?  Then you might still have a partially functioning
> > > > device and have a better chance of at least having your screen turn back
> > > > on to show the relevant error messages?
> > > 
> > > I had a similar question when I came across xe_device_probe(), but as
> > > Lucas mentioned[1] that the expectation here is pretty much "all or
> > > nothing". Again, not my call but I think we should be consistent.
> > 
> > I think device probe is a bit different --- if you can't bring up the
> > hardware successfully at the very beginning then something is pretty
> > wrong and it's best to just not enable and start using the device at
> > all.  But the resume paths are different --- the device is already bound
> > and in use, and was working properly previously.  If we intentionally
> > don't even try to power up other parts of the device that might still
> > work (display, other GTs, etc.), then we're making the situation worse
> > and that could be the difference between the user having a functional UI
> > that gives them a chance to save their work and shutdown/recover
> > gracefully vs having to just power off the machine because their monitor
> > is black and they don't have any idea what's going on.  Powering up
> > other units like display also makes it more likely that we can get
> > useful debugging information out of the machine to figure out what
> > actually went wrong.
> 
> Fair, but this also means the existing error handing in resume path is
> redundant and should be removed.

Not necessarily. Otherwise resume itself wouldn't have the failure path.
I believe that Matt is suggesting is that we need to scrutinize them all
and handle with care without a one-rule-fits-all.

If there's a chance of getting display back even without some engines
for instance, perhaps we should try it.

Imagine that media gt failed to come back, but you still have everything
else. But user will try to open the lid on their laptops and will just
get a blank screen. We need to avoid this scenario and work for a more
reliable platform with more granular and contained error handling.

> 
> Raag
> 
> > > [1] https://lore.kernel.org/intel-xe/lliho4ci6gi5spxxelttgqntbh7rxr4utg4dgfevlrdy54phrh@2k4mjuofaqye/
> > > 
> > > > > > +	}
> > > > > >  
> > > > > >  	xe_display_pm_resume(xe);
> > > > > >  
> > > > > > @@ -656,8 +659,11 @@ int xe_pm_runtime_resume(struct xe_device *xe)
> > > > > >  
> > > > > >  	xe_irq_resume(xe);
> > > > > >  
> > > > > > -	for_each_gt(gt, xe, id)
> > > > > > -		xe->d3cold.allowed ? xe_gt_resume(gt) : xe_gt_runtime_resume(gt);
> > > > > > +	for_each_gt(gt, xe, id) {
> > > > > > +		err = xe->d3cold.allowed ? xe_gt_resume(gt) : xe_gt_runtime_resume(gt);
> > > > > > +		if (err)
> > > > > > +			goto out;
> > > > > > +	}
> > > > > >  
> > > > > >  	xe_display_pm_runtime_resume(xe);
> > > > > >  
> > > > > > -- 
> > > > > > 2.43.0
> > > > > >

next prev parent reply	other threads:[~2025-12-19 16:08 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-17 13:19 [PATCH v1] drm/xe/pm: Handle GT resume failure Raag Jadav
2025-12-17 15:00 ` ✓ CI.KUnit: success for " Patchwork
2025-12-17 15:37 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-17 17:25 ` [PATCH v1] " Rodrigo Vivi
2025-12-17 17:38   ` Matt Roper
2025-12-18 11:12     ` Raag Jadav
2025-12-18 18:46       ` Matt Roper
2025-12-19  5:04         ` Raag Jadav
2025-12-19 16:08           ` Rodrigo Vivi [this message]
2025-12-19 18:00             ` Raag Jadav
2025-12-19 18:53               ` Rodrigo Vivi
2025-12-18 12:59 ` ✗ Xe.CI.Full: failure for " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aUV4eX7XXpApdnCN@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=dev@lankhorst.se \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=karthik.poosa@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=raag.jadav@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox