Re: [PATCH v4 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Auld <matthew.auld@intel.com>
To: Nirmoy Das <nirmoy.das@intel.com>, intel-xe@lists.freedesktop.org
Cc: Badal Nilawar <badal.nilawar@intel.com>,
	Matthew Brost <matthew.brost@intel.com>,
	John Harrison <John.C.Harrison@Intel.com>,
	Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>,
	Lucas De Marchi <lucas.demarchi@intel.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH v4 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout
Date: Tue, 29 Oct 2024 10:59:08 +0000	[thread overview]
Message-ID: <b50daa2b-45b7-434b-88e9-d5f40bcc6542@intel.com> (raw)
In-Reply-To: <20241029095416.3919218-3-nirmoy.das@intel.com>

On 29/10/2024 09:54, Nirmoy Das wrote:
> Flush the g2h worker explicitly if TLB timeout happens which is
> observed on LNL and that points to the recent scheduling issue with
> E-cores on LNL.
> 
> This is similar to the recent fix:
> commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
> response timeout") and should be removed once there is E core
> scheduling fix.
> 
> v2: Add platform check(Himal)
> v3: Remove gfx platform check as the issue related to cpu
>      platform(John)
>      Use the common WA macro(John) and print when the flush
>      resolves timeout(Matt B)
> 
> Cc: Badal Nilawar <badal.nilawar@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: <stable@vger.kernel.org> # v6.11+
> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> index 773de1f08db9..0bdb3ba5220a 100644
> --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
> @@ -81,6 +81,15 @@ static void xe_gt_tlb_fence_timeout(struct work_struct *work)
>   		if (msecs_to_jiffies(since_inval_ms) < tlb_timeout_jiffies(gt))
>   			break;
>   
> +		LNL_FLUSH_WORK(&gt->uc.guc.ct.g2h_worker);

I think here we are holding the pending lock, and g2h worker also wants 
to grab that same lock so this smells like potential deadlock. Also 
flush_work can sleep so I don't think is allowed under spinlock.

> +		since_inval_ms = ktime_ms_delta(ktime_get(),
> +						fence->invalidation_time);

I think invalidation_time is rather when we sent off the invalidation 
req, and we already check that above so if we get here then we know the 
timeout has expired for this fence, so checking again after the flush 
doesn't really help AFAICT.

I think we can just move the flush to before the loop and outside the 
lock, and then if the fence(s) gets signalled they will be removed from 
the list and then also won't be considered for timeout?

> +		if (msecs_to_jiffies(since_inval_ms) < tlb_timeout_jiffies(gt)) {
> +			xe_gt_dbg(gt, "LNL_FLUSH_WORK resolved TLB invalidation fence timeout, seqno=%d recv=%d",
> +				  fence->seqno, gt->tlb_invalidation.seqno_recv);
> +			break;
> +		}
> +
>   		trace_xe_gt_tlb_invalidation_fence_timeout(xe, fence);
>   		xe_gt_err(gt, "TLB invalidation fence timeout, seqno=%d recv=%d",
>   			  fence->seqno, gt->tlb_invalidation.seqno_recv);

next prev parent reply	other threads:[~2024-10-29 10:59 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-29  9:54 [PATCH v4 1/3] drm/xe: Move LNL scheduling WA to xe_device.h Nirmoy Das
2024-10-29  9:54 ` [PATCH v4 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout Nirmoy Das
2024-10-29  9:54 ` [PATCH v4 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout Nirmoy Das
2024-10-29 10:59   ` Matthew Auld [this message]
2024-10-29 12:08     ` Nirmoy Das
2024-10-29 10:51 ` ✓ CI.Patch_applied: success for series starting with [v4,1/3] drm/xe: Move LNL scheduling WA to xe_device.h Patchwork
2024-10-29 10:52 ` ✗ CI.checkpatch: warning " Patchwork
2024-10-29 10:53 ` ✓ CI.KUnit: success " Patchwork
2024-10-29 11:04 ` ✓ CI.Build: " Patchwork
2024-10-29 11:07 ` ✓ CI.Hooks: " Patchwork
2024-10-29 11:08 ` ✓ CI.checksparse: " Patchwork
2024-10-29 11:34 ` ✓ CI.BAT: " Patchwork
2024-10-29 13:48 ` ✗ CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b50daa2b-45b7-434b-88e9-d5f40bcc6542@intel.com \
    --to=matthew.auld@intel.com \
    --cc=John.C.Harrison@Intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=nirmoy.das@intel.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox