[PATCH 2/4] drm/i915/gt: Wait for CSB entries on Tigerlake

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2/4] drm/i915/gt: Wait for CSB entries on Tigerlake
       [not found] <20200915124150.12045-1-chris@chris-wilson.co.uk>
@ 2020-09-15 12:41 ` Chris Wilson
  2020-09-16  6:33   ` Greg KH
  0 siblings, 1 reply; 4+ messages in thread
From: Chris Wilson @ 2020-09-15 12:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson, Bruce Chang, Mika Kuoppala, stable

On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl:
Forcibly evict stale csb entries") where, presumably, due to a missing
Global Observation Point synchronisation, the write pointer of the CSB
ringbuffer is updated _prior_ to the contents of the ringbuffer. That is
we see the GPU report more context-switch entries for us to parse, but
those entries have not been written, leading us to process stale events,
and eventually report a hung GPU.

However, this effect appears to be much more severe than we previously
saw on Icelake (though it might be best if we try the same approach
there as well and measure), and Bruce suggested the good idea of resetting
the CSB entry after use so that we can detect when it has been updated by
the GPU. By instrumenting how long that may be, we can set a reliable
upper bound for how long we should wait for:

    513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries")
Suggested-by: Bruce Chang <yu.bruce.chang@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: stable@vger.kernel.org # v5.4
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index d6e0f62337b4..d75712a503b7 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last)
  */
 static inline bool gen12_csb_parse(const u64 *csb)
 {
-	u64 entry = READ_ONCE(*csb);
-	bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
-	bool new_queue =
+	bool ctx_away_valid;
+	bool new_queue;
+	u64 entry;
+
+	/* HSD#22011248461 */
+	entry = READ_ONCE(*csb);
+	if (unlikely(entry == -1)) {
+		preempt_disable();
+		if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50))
+			GEM_WARN_ON("50us CSB timeout");
+		preempt_enable();
+	}
+	WRITE_ONCE(*(u64 *)csb, -1);
+
+	ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
+	new_queue =
 		lower_32_bits(entry) & GEN12_CTX_STATUS_SWITCHED_TO_NEW_QUEUE;
 
 	/*
@@ -4004,6 +4017,8 @@ static void reset_csb_pointers(struct intel_engine_cs *engine)
 	WRITE_ONCE(*execlists->csb_write, reset_value);
 	wmb(); /* Make sure this is visible to HW (paranoia?) */
 
+	/* Check that the GPU does indeed update the CSB entries! */
+	memset(execlists->csb_status, -1, (reset_value + 1) * sizeof(u64));
 	invalidate_csb_entries(&execlists->csb_status[0],
 			       &execlists->csb_status[reset_value]);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/4] drm/i915/gt: Wait for CSB entries on Tigerlake
  2020-09-15 12:41 ` [PATCH 2/4] drm/i915/gt: Wait for CSB entries on Tigerlake Chris Wilson
@ 2020-09-16  6:33   ` Greg KH
  2020-09-16  8:26     ` [Intel-gfx] " Chris Wilson
  0 siblings, 1 reply; 4+ messages in thread
From: Greg KH @ 2020-09-16  6:33 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, Bruce Chang, Mika Kuoppala, stable

On Tue, Sep 15, 2020 at 01:41:48PM +0100, Chris Wilson wrote:
> On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl:
> Forcibly evict stale csb entries") where, presumably, due to a missing
> Global Observation Point synchronisation, the write pointer of the CSB
> ringbuffer is updated _prior_ to the contents of the ringbuffer. That is
> we see the GPU report more context-switch entries for us to parse, but
> those entries have not been written, leading us to process stale events,
> and eventually report a hung GPU.
> 
> However, this effect appears to be much more severe than we previously
> saw on Icelake (though it might be best if we try the same approach
> there as well and measure), and Bruce suggested the good idea of resetting
> the CSB entry after use so that we can detect when it has been updated by
> the GPU. By instrumenting how long that may be, we can set a reliable
> upper bound for how long we should wait for:
> 
>     513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)
> 
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
> References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries")

What does "References:" mean?  Should that be "Fixes:"?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Intel-gfx] [PATCH 2/4] drm/i915/gt: Wait for CSB entries on Tigerlake
  2020-09-16  6:33   ` Greg KH
@ 2020-09-16  8:26     ` Chris Wilson
  2020-09-16  8:35       ` Greg KH
  0 siblings, 1 reply; 4+ messages in thread
From: Chris Wilson @ 2020-09-16  8:26 UTC (permalink / raw)
  To: Greg KH; +Cc: intel-gfx, stable

Quoting Greg KH (2020-09-16 07:33:58)
> On Tue, Sep 15, 2020 at 01:41:48PM +0100, Chris Wilson wrote:
> > On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl:
> > Forcibly evict stale csb entries") where, presumably, due to a missing
> > Global Observation Point synchronisation, the write pointer of the CSB
> > ringbuffer is updated _prior_ to the contents of the ringbuffer. That is
> > we see the GPU report more context-switch entries for us to parse, but
> > those entries have not been written, leading us to process stale events,
> > and eventually report a hung GPU.
> > 
> > However, this effect appears to be much more severe than we previously
> > saw on Icelake (though it might be best if we try the same approach
> > there as well and measure), and Bruce suggested the good idea of resetting
> > the CSB entry after use so that we can detect when it has been updated by
> > the GPU. By instrumenting how long that may be, we can set a reliable
> > upper bound for how long we should wait for:
> > 
> >     513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)
> > 
> > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
> > References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries")
> 
> What does "References:" mean?  Should that be "Fixes:"?

It's a reference to an earlier w/a for a previous generation for the
same symptoms. This patch should supplement that w/a.
-Chris

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Intel-gfx] [PATCH 2/4] drm/i915/gt: Wait for CSB entries on Tigerlake
  2020-09-16  8:26     ` [Intel-gfx] " Chris Wilson
@ 2020-09-16  8:35       ` Greg KH
  0 siblings, 0 replies; 4+ messages in thread
From: Greg KH @ 2020-09-16  8:35 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, stable

On Wed, Sep 16, 2020 at 09:26:58AM +0100, Chris Wilson wrote:
> Quoting Greg KH (2020-09-16 07:33:58)
> > On Tue, Sep 15, 2020 at 01:41:48PM +0100, Chris Wilson wrote:
> > > On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl:
> > > Forcibly evict stale csb entries") where, presumably, due to a missing
> > > Global Observation Point synchronisation, the write pointer of the CSB
> > > ringbuffer is updated _prior_ to the contents of the ringbuffer. That is
> > > we see the GPU report more context-switch entries for us to parse, but
> > > those entries have not been written, leading us to process stale events,
> > > and eventually report a hung GPU.
> > > 
> > > However, this effect appears to be much more severe than we previously
> > > saw on Icelake (though it might be best if we try the same approach
> > > there as well and measure), and Bruce suggested the good idea of resetting
> > > the CSB entry after use so that we can detect when it has been updated by
> > > the GPU. By instrumenting how long that may be, we can set a reliable
> > > upper bound for how long we should wait for:
> > > 
> > >     513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)
> > > 
> > > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
> > > References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries")
> > 
> > What does "References:" mean?  Should that be "Fixes:"?
> 
> It's a reference to an earlier w/a for a previous generation for the
> same symptoms. This patch should supplement that w/a.

I see no such "reference" to that tag in
Documentation/process/submitting-patches.rst, so how were we supposed to
know this?  :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-09-16  8:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20200915124150.12045-1-chris@chris-wilson.co.uk>
2020-09-15 12:41 ` [PATCH 2/4] drm/i915/gt: Wait for CSB entries on Tigerlake Chris Wilson
2020-09-16  6:33   ` Greg KH
2020-09-16  8:26     ` [Intel-gfx] " Chris Wilson
2020-09-16  8:35       ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).