Re: [PATCH] drm/xe/guc: In guc_ct_send_recv flush g2h worker if g2h resp times out

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: "Nilawar, Badal" <badal.nilawar@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <anshuman.gupta@intel.com>,
	<john.c.harrison@intel.com>, <vinay.belgaumkar@intel.com>,
	<vivi.rodrigo@intel.com>
Subject: Re: [PATCH] drm/xe/guc: In guc_ct_send_recv flush g2h worker if g2h resp times out
Date: Fri, 4 Oct 2024 06:50:15 +0000	[thread overview]
Message-ID: <Zv+QJ4HVR0pJbNsX@DUT025-TGLU.fm.intel.com> (raw)
In-Reply-To: <ab52e413-3dc4-494c-b50c-5456c81502eb@intel.com>

On Thu, Oct 03, 2024 at 03:24:12PM +0530, Nilawar, Badal wrote:
> 
> 
> On 02-10-2024 19:34, Matthew Brost wrote:
> > On Tue, Oct 01, 2024 at 01:41:15PM +0530, Nilawar, Badal wrote:
> > > 
> > > 
> > > On 28-09-2024 02:57, Matthew Brost wrote:
> > > > On Sat, Sep 28, 2024 at 12:54:28AM +0530, Badal Nilawar wrote:
> > > > > It is observed that for GuC CT request G2H IRQ triggered and g2h_worker
> > > > > queued, but it didn't get opportunity to execute and timeout occurred.
> > > > > To address this the g2h_worker is being flushed.
> > > > > 
> > > > > Cc: John Harrison <John.C.Harrison@Intel.com>
> > > > > Signed-off-by: Badal Nilawar <badal.nilawar@intel.com>
> > > > > ---
> > > > >    drivers/gpu/drm/xe/xe_guc_ct.c | 11 +++++++++++
> > > > >    1 file changed, 11 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > > index 4b95f75b1546..4a5d7f85d1a0 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > > @@ -903,6 +903,17 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
> > > > >    	}
> > > > >    	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
> > > > > +
> > > > > +	/*
> > > > > +	 * It is observed that for above GuC CT request G2H IRQ triggered
> > > > 
> > > > Where is this observed. 1 second is a long to wait for a worker...
> > > 
> > > Please see this log.
> > > 
> > 
> > Logs are good but explaining the test case is also helpful so I don't
> > have reverse engineer things. Also having platform information would be
> > helpful too. So what is the test case here and what platform?
> 
> Sorry, my bad, I should have added issue id in commit message.
> https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620.
> 
> This issue is reported on LNL for xe_gt_freq@freq_reset_multiple test and
> xe_pm@* tests during resume flow.
> 
> > > [  176.602482] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
> > > GuC PC status query
> > > [  176.603019] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> > > GT[0]
> > > [  176.603449] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
> > > running GT[0]
> > > [  176.604379] xe 0000:00:02.0: [drm:xe_guc_pc_get_max_freq [xe]] GT0: GT[0]
> > > GuC PC status query
> > > [  176.605464] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> > > GT[0]
> > > [  176.605821] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
> > > running GT[0]
> > > [  176.716699] xe 0000:00:02.0: [drm] GT0: trying reset
> > 
> > This looks we are doing a GT reset and this is causing problems. This
> > patch is likely papering over an issue with our GT flows. So this patch
> > doesn't seem correct to me. Let's try to figure what is going wrong in
> > the reset flow.
> 
> This is seen for slpc query after "reset done" as well.
> 
> > > [  176.716718] xe 0000:00:02.0: [drm] GT0: GuC PC status query    //GuC PC
> > > check request
> > > [  176.717648] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> > > GT[0]  // IRQ
> > > [  177.728637] xe 0000:00:02.0: [drm] *ERROR* GT0: Timed out wait for G2H,
> > > fence 1311, action 3003  //Timeout
> > > [  177.737637] xe 0000:00:02.0: [drm] *ERROR* GT0: GuC PC query task state
> > > failed: -ETIME
> > > [  177.745644] xe 0000:00:02.0: [drm] GT0: reset queued
> > 
> > Here this is almost 1 second after 'trying reset' which I'm unsure how
> > that could happen looking at the source code upstream.
> > 'xe_uc_reset_prepare' is called between 'trying reset' and 'reset
> > queued' but that doesn't wait anywhere rather resolves to the below
> > function:
> > 
> > 1769 int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> > 1770 {
> > 1771         int ret;
> > 1772
> > 1773         /*
> > 1774          * Using an atomic here rather than submission_state.lock as this
> > 1775          * function can be called while holding the CT lock (engine reset
> > 1776          * failure). submission_state.lock needs the CT lock to resubmit jobs.
> > 1777          * Atomic is not ideal, but it works to prevent against concurrent reset
> > 1778          * and releasing any TDRs waiting on guc->submission_state.stopped.
> > 1779          */
> > 1780         ret = atomic_fetch_or(1, &guc->submission_state.stopped);
> > 1781         smp_wmb();
> > 1782         wake_up_all(&guc->ct.wq);
> > 1783
> > 1784         return ret;
> > 1785 }
> 
> And CT is not disabled yet, so SLPC query will go through.
> 

I agree CT should not be disabled at this point.

> > 
> > If this log from an internal repo or something? This looks like some
> > sort of circular dependency where a GT reset starts and the G2H handler
> > doesn't get queued because the CT channel is disabled, the G2H times
> > out, and reset stalls waiting for the timeout.
> 
> This log is captured on LNL, with debug prints added, by running
> xe_gt_freq@freq_reset_multiple.
> 
> If CT channel is disabled then we will not see "G2H fence (1311) not
> found!".
> 
> During xe pm resume flow this is seen during guc_pc_start->pc_init_freqs().
>

Ok, was spitballing ideas - if this is upstream then the CT should be
alive but somehow it appears the worker to process CT is getting
staled.

Also it very suspect the time gap between 'trying reset' and 'reset
queued'.

This patch doesn't look like the solution. Can you look into hints I've
given here?

Matt
 
> 
> > 
> > > [  177.849081] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
> > > GuC PC status query
> > > [  177.849659] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> > > GT[0]
> > > [  178.632672] xe 0000:00:02.0: [drm] GT0: reset started
> > > [  178.632639] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
> > > running GT[0] // Worker ran
> > > [  178.632897] xe 0000:00:02.0: [drm] GT0: G2H fence (1311) not found!
> > > 
> > > > 
> > > > > +	 * and g2h_worker queued, but it didn't get opportunity to execute
> > > > > +	 * and timeout occurred. To address the g2h_worker is being flushed.
> > > > > +	 */
> > > > > +	if (!ret) {
> > > > > +		flush_work(&ct->g2h_worker);
> > > > > +		ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
> > > > 
> > > > If this is needed I wouldn't wait 1 second, if the flush worked
> > > > 'g2h_fence.done' should immediately be signaled. Maybe wait 1 MS?
> > > 
> > > In config HZ is set to 250, which is 4 ms I think.
> > > 
> > 
> > HZ should always be one second [1].
> > 
> > [1] https://www.oreilly.com/library/view/linux-device-drivers/9781785280009/4041820a-bbe4-4502-8ef9-d1913e133332.xhtml#:~:text=In%20other%20words%2C%20HZ%20represents,incremented%20HZ%20times%20every%20second.
> > 
> > > CONFIG_HZ_250=y
> > > # CONFIG_HZ_300 is not set
> > > # CONFIG_HZ_1000 is not set
> > > CONFIG_HZ=250
> > > 
> > 
> > I'm little confused how this Kconfig works [2] but I don't think
> > actually changes the time of HZ rather it changes how many jiffies are
> > in one second.
> > 
> > [2] https://lwn.net/Articles/56378/
> 
> Oh ok, Thanks for clarification.
> 
> Regards,
> Badal
> 
> > 
> > Matt
> > 
> > > Regards,
> > > Badal
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > +	}
> > > > > +
> > > > >    	if (!ret) {
> > > > >    		xe_gt_err(gt, "Timed out wait for G2H, fence %u, action %04x",
> > > > >    			  g2h_fence.seqno, action[0]);
> > > > > -- 
> > > > > 2.34.1
> > > > > 
> > > 
>

next prev parent reply	other threads:[~2024-10-04  6:50 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-27 19:24 [PATCH] drm/xe/guc: In guc_ct_send_recv flush g2h worker if g2h resp times out Badal Nilawar
2024-09-27 20:26 ` ✓ CI.Patch_applied: success for " Patchwork
2024-09-27 20:26 ` ✓ CI.checkpatch: " Patchwork
2024-09-27 20:27 ` ✓ CI.KUnit: " Patchwork
2024-09-27 20:39 ` ✓ CI.Build: " Patchwork
2024-09-27 20:41 ` ✓ CI.Hooks: " Patchwork
2024-09-27 20:43 ` ✓ CI.checksparse: " Patchwork
2024-09-27 21:08 ` ✓ CI.BAT: " Patchwork
2024-09-27 21:27 ` [PATCH] " Matthew Brost
2024-10-01  8:11   ` Nilawar, Badal
2024-10-01 17:49     ` Nilawar, Badal
2024-10-02 14:04     ` Matthew Brost
2024-10-03  9:54       ` Nilawar, Badal
2024-10-04  6:50         ` Matthew Brost [this message]
2024-10-04 10:06           ` Nilawar, Badal
2024-09-30  6:46 ` ✗ CI.FULL: failure for " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zv+QJ4HVR0pJbNsX@DUT025-TGLU.fm.intel.com \
    --to=matthew.brost@intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=vinay.belgaumkar@intel.com \
    --cc=vivi.rodrigo@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox