From mboxrd@z Thu Jan  1 00:00:00 1970
From: Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [PATCH] drm/i915: add interface to simulate gpu
	hangs
Date: Sat, 03 Dec 2011 01:33:59 +0000
Message-ID: <e0d58a$2f04f7@orsmga002.jf.intel.com>
References: <1320942887-6919-1-git-send-email-daniel.vetter@ffwll.ch>
	<1322864509-4130-1-git-send-email-daniel.vetter@ffwll.ch>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org>
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
	by gabe.freedesktop.org (Postfix) with ESMTP id 38A519E746
	for <intel-gfx@lists.freedesktop.org>;
	Fri,  2 Dec 2011 17:34:20 -0800 (PST)
In-Reply-To: <1322864509-4130-1-git-send-email-daniel.vetter@ffwll.ch>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
To: intel-gfx@lists.freedesktop.org
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
List-Id: intel-gfx@lists.freedesktop.org

On Fri,  2 Dec 2011 23:21:49 +0100, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> gpu reset is a very important piece of our infrastructure.
> Unfortunately we only really it test by actually hanging the gpu,
> which often has bad side-effects for the entire system. And the gpu
> hang handling code is one of the rather complicated pieces of code we
> have, consisting of
> - hang detection
> - error capture
> - actual gpu reset
> - reset of all the gem bookkeeping
> - reinitialition of the entire gpu
> 
> This patch adds a debugfs to selectively stopping rings by ceasing to
> update the hw tail pointer, which will result in the gpu no longer
> updating it's head pointer and eventually to the hangcheck firing.
> This way we can exercise the gpu hang code under controlled conditions
> without a dying gpu taking down the entire systems.
> 
> Patch motivated by me forgetting to properly reinitialize ppgtt after
> a gpu reset.
> 
> Usage:
> 
> echo $((1 << $ringnum)) > i915_ring_stop # stops one ring
> 
> echo 0xffffffff > i915_ring_stop # stops all, future-proof version
> 
> then run whatever testload is desired. i915_ring_stop automatically
> resets after a gpu hang is detected to avoid hanging the gpu to fast
> and declaring it wedged.
> 
> v2: Incorporate feedback from Chris Wilson.
> 
> v3: Add the missing cleanup.

I think I've made my peace with this patch. I'm still not completely
sold on its value, but if Daniel found it useful then it has merit.
> 
> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

> ---
>  drivers/gpu/drm/i915/i915_debugfs.c     |   65 +++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_drv.c         |    2 +
>  drivers/gpu/drm/i915/i915_drv.h         |    2 +
>  drivers/gpu/drm/i915/intel_ringbuffer.c |    4 ++
>  4 files changed, 73 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index db83552..85328f7 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -1397,6 +1397,64 @@ static const struct file_operations i915_wedged_fops = {
>  };
>  
>  static ssize_t
> +i915_ring_stop_read(struct file *filp,
> +		    char __user *ubuf,
> +		    size_t max,
> +		    loff_t *ppos)
> +{
> +	struct drm_device *dev = filp->private_data;
> +	drm_i915_private_t *dev_priv = dev->dev_private;
> +	char buf[80];
> +	int len;
> +
> +	len = snprintf(buf, sizeof(buf),
> +		       "%d\n", dev_priv->stop_rings);
%08x since it is a flags value, though 8 may be overkill! 

> +
> +	if (len > sizeof(buf))
> +		len = sizeof(buf);
> +
> +	return simple_read_from_buffer(ubuf, max, ppos, buf, len);
> +}
> +
> +static ssize_t
> +i915_ring_stop_write(struct file *filp,
> +		     const char __user *ubuf,
> +		     size_t cnt,
> +		     loff_t *ppos)
> +{
> +	struct drm_device *dev = filp->private_data;
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	char buf[20];
> +	int val = 0;
> +
> +	if (cnt > 0) {
> +		if (cnt > sizeof(buf) - 1)
> +			return -EINVAL;
> +
> +		if (copy_from_user(buf, ubuf, cnt))
> +			return -EFAULT;
> +		buf[cnt] = 0;
> +
> +		val = simple_strtoul(buf, NULL, 0);
> +	}
> +
> +	DRM_DEBUG_DRIVER("Stopping rings %u\n", val);
%x here as well

-- 
Chris Wilson, Intel Open Source Technology Centre