From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ville =?iso-8859-1?Q?Syrj=E4l=E4?= Subject: Re: [PATCH] drm/i915: Decouple GPU error reporting from ring initialisation Date: Fri, 24 Jan 2014 14:06:12 +0200 Message-ID: <20140124120612.GC9454@intel.com> References: <1390513783-27726-1-git-send-email-chris@chris-wilson.co.uk> <20140124115025.GB9454@intel.com> <20140124115521.GE25529@nuc-i3427.alporthouse.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTP id 6969CFB6BD for ; Fri, 24 Jan 2014 04:06:16 -0800 (PST) Content-Disposition: inline In-Reply-To: <20140124115521.GE25529@nuc-i3427.alporthouse.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: intel-gfx-bounces@lists.freedesktop.org Errors-To: intel-gfx-bounces@lists.freedesktop.org To: Chris Wilson , intel-gfx@lists.freedesktop.org, Ben Widawsky List-Id: intel-gfx@lists.freedesktop.org On Fri, Jan 24, 2014 at 11:55:21AM +0000, Chris Wilson wrote: > On Fri, Jan 24, 2014 at 01:50:25PM +0200, Ville Syrj=E4l=E4 wrote: > > On Thu, Jan 23, 2014 at 09:49:43PM +0000, Chris Wilson wrote: > > > Currently we report through our error state only the rings that have > > > been initialised (as detected by ring->obj). This check is done after > > > the GPU reset and ring re-initialisation, which means that the softwa= re > > > state may not be the same as when we captured the hardware error and = we > > > may not print out any of the vital information for debugging the hang. > > > = > > > This (and the implied object leak) is a regression from > > > = > > > commit 3d57e5bd1284f44e325f3a52d966259ed42f9e05 > > > Author: Ben Widawsky > > > Date: Mon Oct 14 10:01:36 2013 -0700 > > > = > > > drm/i915: Do a fuller init after reset > > > = > > > Signed-off-by: Chris Wilson > > > Cc: Ben Widawsky > > > --- > > > drivers/gpu/drm/i915/i915_drv.h | 1 + > > > drivers/gpu/drm/i915/i915_gpu_error.c | 19 +++++++++++++------ > > > 2 files changed, 14 insertions(+), 6 deletions(-) > > > = > > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i= 915_drv.h > > > index c45cbbecd66a..64a1aca7804d 100644 > > > --- a/drivers/gpu/drm/i915/i915_drv.h > > > +++ b/drivers/gpu/drm/i915/i915_drv.h > > > @@ -334,6 +334,7 @@ struct drm_i915_error_state { > > > struct timeval time; > > > = > > > struct drm_i915_error_ring { > > > + int valid; > > = > > bool > = > in a struct? I tend to think it leads to laziness not to coalesce them > into bitfields. bool valid:1; then ;) > = > > > - obj =3D error->ring[i].ctx; > > > - if (obj) { > > > + if ((obj =3D error->ring[i].ctx)) { > > = > > Unrelated change. Although it does make this more consistent w/ the > > surrouding code. But I admit to not being a fan of assignments inside > > if statements. > = > The inconsistency was uglier. > = > > > err_printf(m, "%s --- HW Context =3D 0x%08x\n", > > > dev_priv->ring[i].name, > > > obj->gtt_offset); > > > @@ -826,11 +827,17 @@ static void i915_gem_record_rings(struct drm_de= vice *dev, > > > struct drm_i915_error_state *error) > > > { > > > struct drm_i915_private *dev_priv =3D dev->dev_private; > > > - struct intel_ring_buffer *ring; > > > struct drm_i915_gem_request *request; > > > int i, count; > > > = > > > - for_each_ring(ring, dev_priv, i) { > > > + for (i =3D 0; i < I915_NUM_RINGS; i++) { > > > + struct intel_ring_buffer *ring =3D &dev_priv->ring[i]; > > > + > > > + if (ring->dev =3D=3D NULL) > > > + continue; > > > + > > > + error->ring[i].valid =3D true; > > > + > > = > > The code here runs before the reset, and it would actually oops if > > ring->obj=3D=3DNULL, so using for_each_ring() here looks appropriate. > = > No, we need to record that ring->obj is NULL, especially if the ring > registers are still set... OK so we just need to actually fix the scratch.obj=3D=3DNULL case, and then I guess it's fine. -- = Ville Syrj=E4l=E4 Intel OTC