All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michel Thierry <michel.thierry@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/i915: Disable per-engine reset for Broxton
Date: Thu, 06 Jul 2017 16:55:13 -0007	[thread overview]
Message-ID: <1499360533.29930.1@smtp.intel.com> (raw)
In-Reply-To: <149932507620.26368.11755811720157230179@mail.alporthouse.com>

On Thu, Jul 6, 2017 at 12:11 AM, Chris Wilson 
<chris@chris-wilson.co.uk> wrote:
> Quoting Michel Thierry (2017-07-06 02:24:26)
>>  On 04/07/17 09:09, Chris Wilson wrote:
>>  > Triggering a GPU reset for one engine affects another, notably
>>  > corrupting the context status buffer (CSB) effectively losing 
>> track of
>>  > inflight requests.
>>  >
>>  > Adding a few printks:
>>  > diff --git a/drivers/gpu/drm/i915/i915_drv.c 
>> b/drivers/gpu/drm/i915/i915_drv.c
>>  > index ad41836fa5e5..a969456bc0fa 100644
>>  > --- a/drivers/gpu/drm/i915/i915_drv.c
>>  > +++ b/drivers/gpu/drm/i915/i915_drv.c
>>  > @@ -1953,6 +1953,7 @@ int i915_reset_engine(struct 
>> intel_engine_cs *engine)
>>  >                 goto out;
>>  >         }
>>  >
>>  > +       pr_err("Resetting %s\n", engine->name);
>>  >         ret = intel_gpu_reset(engine->i915, 
>> intel_engine_flag(engine));
>>  >         if (ret) {
>>  >                 /* If we fail here, we expect to fallback to a 
>> global reset */
>>  > diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
>> b/drivers/gpu/drm/i915/intel_lrc.c
>>  > index 716e5c9ea222..a72bc35d0870 100644
>>  > --- a/drivers/gpu/drm/i915/intel_lrc.c
>>  > +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>  > @@ -355,6 +355,7 @@ static void execlists_submit_ports(struct 
>> intel_engine_cs *engine)
>>  >                                 
>> execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
>>  >                         port_set(&port[n], port_pack(rq, count));
>>  >                         desc = execlists_update_context(rq);
>>  > +                       pr_err("%s: in (rq=%x) ctx=%d\n", 
>> engine->name, rq->global_seqno, upper_32_bits(desc));
>>  >                         GEM_DEBUG_EXEC(port[n].context_id = 
>> upper_32_bits(desc));
>>  >                 } else {
>>  >                         GEM_BUG_ON(!n);
>>  > @@ -594,9 +595,23 @@ static void intel_lrc_irq_handler(unsigned 
>> long data)
>>  >                         if (!(status & 
>> GEN8_CTX_STATUS_COMPLETED_MASK))
>>  >                                 continue;
>>  >
>>  > +                       pr_err("%s: out CSB (%x head=%d, 
>> tail=%d), ctx=%d, rq=%d\n",
>>  > +                                       engine->name,
>>  > +                                       readl(csb_mmio),
>>  > +                                       head, tail,
>>  > +                                       readl(buf+2*head+1),
>>  > +                                       port->context_id);
>>  > +
>>  >                         /* Check the context/desc id for this 
>> event matches */
>>  > -                       GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 
>> 1) !=
>>  > -                                        port->context_id);
>>  > +                       if (readl(buf + 2 * head + 1) != 
>> port->context_id) {
>>  > +                               pr_err("%s: BUG CSB (%x head=%d, 
>> tail=%d), ctx=%d, rq=%d\n",
>>  > +                                               engine->name,
>>  > +                                               readl(csb_mmio),
>>  > +                                               head, tail,
>>  > +                                               
>> readl(buf+2*head+1),
>>  > +                                               port->context_id);
>>  > +                               BUG();
>>  > +                       }
>>  >
>>  >                         rq = port_unpack(port, &count);
>>  >                         GEM_BUG_ON(count == 0);
>>  >
>>  > Results in:
>>  >
>>  > [ 6423.006602] Resetting rcs0
>>  > [ 6423.009080] rcs0: in (rq=fffffe70) ctx=1
>>  > [ 6423.009216] rcs0: in (rq=fffffe6f) ctx=3
>>  > [ 6423.009542] rcs0: out CSB (2 head=1, tail=2), ctx=3, rq=3
>>  > [ 6423.009619] Resetting bcs0
>>  > [ 6423.009980] rcs0: BUG CSB (0 head=1, tail=2), ctx=0, rq=3
>>  >
>>  
>>  It took me a while, but  I was able to replicate this (with your 
>>  igt_reset_active_engines) a couple of times. I also captured the 
>> value 
>>  of the CSB events at that point and it looked like this.
> 
> Ah, that's a separate issue that definitely isn't limited to bxt. In 
> the
> bug on my machine the CSB is distinctly zeroed.

Oh, if you saw them being zeroed, then there's not much I can argue (do 
you remember if the CSB write pointer was reset to '7' too?).

Hopefully is only limited to bxt, so sadly 

Reviewed-by: Michel Thierry <michel.thierry@intel.com>

> 
>  
>>  [   55.134393] Resetting rcs0
>>  [   55.134747] bcs0: BUG CSB (3 head=1, tail=2), ctx=10, rq=4
>>  [   55.134755]  bcs0: HWSP[16-17] Execlist CSB[0]:   0x00000018 _ 
>> 0x0000000a
>>  [   55.134759]  bcs0: HWSP[18-19] Execlist CSB[1]:   0x00000012 _ 
>> 0x0000000a
>>  [   55.134762]  bcs0: HWSP[20-21] Execlist CSB[2]:   0x00008002 _ 
>> 0x00000004
>>  [   55.134765]  bcs0: HWSP[22-23] Execlist CSB[3]:   0x00000014 _ 
>> 0x00000004
>>  [   55.134767]  bcs0: HWSP[24-25] Execlist CSB[4]:   0x00000018 _ 
>> 0x0000000a
>>  [   55.134770]  bcs0: HWSP[26-27] Execlist CSB[5]: 0x00000001 _ 
>> 0x00000000
>>  
>>  The problem is ctx 10 finished in CSB[0] (ctx_complete & 
>>  active_to_idle), but then somehow CSB[1] has the same ctx 10 with 
>>  'preempted' & ctx_complete.
>>  
>>  To make things worse, in CSB[2], the hw claims to have 
>> lite-restored ctx 4.
>>  
>>  So it seems, we could ignore events like CSB[1], i.e. preempted && 
>>  ctx_complete && !lite_restore.
> 
> I think this bug is fixed in the series I sent last week, this is just
> the race in reset vs the tasklet.

I'll try with those patches.

Thanks,

> 
> -Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2017-07-06 17:02 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-04 16:09 [PATCH] drm/i915: Disable per-engine reset for Broxton Chris Wilson
2017-07-06  1:24 ` Michel Thierry
2017-07-06  7:11   ` Chris Wilson
2017-07-06 17:02     ` Michel Thierry [this message]
2017-07-06 19:55       ` Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1499360533.29930.1@smtp.intel.com \
    --to=michel.thierry@intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.