Re: [PATCH 1/1] drm/xe: Don't short circuit TDR on jobs not started

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Zanoni, Paulo R" <paulo.r.zanoni@intel.com>
To: "Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Justen, Jordan L" <jordan.l.justen@intel.com>,
	"Briano, Ivan" <ivan.briano@intel.com>
Subject: Re: [PATCH 1/1] drm/xe: Don't short circuit TDR on jobs not started
Date: Fri, 25 Oct 2024 19:32:33 +0000	[thread overview]
Message-ID: <e25b8e02ceba708c71ff783c547744b7dc15cbae.camel@intel.com> (raw)
In-Reply-To: <Zxk1XqHRi90GG89g@DUT025-TGLU.fm.intel.com>

On Wed, 2024-10-23 at 17:41 +0000, Matthew Brost wrote:
> On Wed, Oct 23, 2024 at 10:47:05AM -0600, Zanoni, Paulo R wrote:
> > On Tue, 2024-10-22 at 16:27 -0700, Matthew Brost wrote:
> > > Short circuiting TDR on jobs not started is an optimization which is not
> > > required. On LNL we are facing an issue where jobs do not get scheduled
> > > by the GuC for an unknown reason. Removing this optimization allows jobs
> > > to get scheduled after TDR fire once which is a big improvement. Remove
> > > this optimization for now while root causing job scheduling issue on
> > > LNL.
> > 
> > I just tested it and it seems to do what it promises. Thanks! Having a
> > 5 second hiccup is still horribly bad, but it is - checks math notes -
> > infinitely better than waiting forever for a syncobj that will never be
> > signaled.
> > 
> > This patch will *tremendously* help Mesa CI, since we can reproduce
> > this bug all the time with Vulkan CTS tests.
> > 
> > Suggestions:
> > 
> > - Can we get a message on dmesg every time this hiccup happens? We're
> > not sure if it's happening on real workloads on people's machines, so
> > maybe having some sort of indication "oops, we just unstuck the batch
> > you submitted 300 frames ago!" would help.
> > 
> 
> We will add 'notice' level message if this occurs.

I may be wrong, but from what I understand, 'notice' level is something
that will *not* show up on people's dmesg if they are using distros'
default config. This message signals a bug is happening, we need to
make sure it appears in dmesg by default. The whole point is to be able
to figure out if this is happening in the wild. Can we promote this to
KERN_WARNING?

> 
> > - Since we don't know how long until the real fix, can this be tagged
> > for stable? If it turns out this requires special GuC, it would be even
> > more valuable to have this in stable since those tend to take more to
> > propagate to people's machines.
> 
> I don't see any reason why this can't be backported, will include required tags.
> 
> Matt
> 
> > 
> > Thanks a lot!
> > 
> > > 
> > > Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_guc_submit.c | 4 ----
> > >  1 file changed, 4 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > index 0b81972ff651..25ab675e9c7d 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > @@ -1052,10 +1052,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> > >  		exec_queue_killed_or_banned_or_wedged(q) ||
> > >  		exec_queue_destroyed(q);
> > >  
> > > -	/* Job hasn't started, can't be timed out */
> > > -	if (!skip_timeout_check && !xe_sched_job_started(job))
> > > -		goto rearm;
> > > -
> > >  	/*
> > >  	 * If devcoredump not captured and GuC capture for the job is not ready
> > >  	 * do manual capture first and decide later if we need to use it
> >

next prev parent reply	other threads:[~2024-10-25 19:32 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-22 23:27 [PATCH 0/1] Don't short circuit TDR on jobs not started Matthew Brost
2024-10-22 23:27 ` [PATCH 1/1] drm/xe: " Matthew Brost
2024-10-23 16:47   ` Zanoni, Paulo R
2024-10-23 17:41     ` Matthew Brost
2024-10-25 19:32       ` Zanoni, Paulo R [this message]
2024-10-25 19:59         ` Matthew Brost
2024-10-25 23:21           ` John Harrison
2024-10-22 23:32 ` ✓ CI.Patch_applied: success for " Patchwork
2024-10-22 23:32 ` ✓ CI.checkpatch: " Patchwork
2024-10-22 23:33 ` ✓ CI.KUnit: " Patchwork
2024-10-22 23:45 ` ✓ CI.Build: " Patchwork
2024-10-22 23:47 ` ✓ CI.Hooks: " Patchwork
2024-10-22 23:49 ` ✓ CI.checksparse: " Patchwork
2024-10-23  0:09 ` ✓ CI.BAT: " Patchwork
2024-10-23  5:28 ` ✗ CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e25b8e02ceba708c71ff783c547744b7dc15cbae.camel@intel.com \
    --to=paulo.r.zanoni@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=ivan.briano@intel.com \
    --cc=jordan.l.justen@intel.com \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox