All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Widawsky <ben@bwidawsk.net>
To: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: intel-gfx <intel-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 3/3] [RFC] drm/i915: add interface to simulate gpu hangs
Date: Sat, 5 Nov 2011 15:07:45 -0700	[thread overview]
Message-ID: <20111105150745.2a1fbb33@bwidawsk.net> (raw)
In-Reply-To: <1320234396-4605-3-git-send-email-daniel.vetter@ffwll.ch>

On Wed,  2 Nov 2011 12:46:36 +0100
Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> GPU reset is a very important piece of our infrastructure.
> Unfortunately we only really test it by actually hanging the gpu,
> which often has bad side-effects for the entire system. And the gpu
> hang handling code is one of the rather complicated pieces of code we
> have, consisting of
> - hang detection
> - error capture
> - actual gpu reset
> - reset of all the gem bookkeeping
> - reinitialition of the entire gpu
> 
> This patch adds a debugfs to selectively stopping rings by ceasing to
> update the hw tail pointer. This way we can exercise the gpu hang code
> under controlled conditions without a dying gpu taking down the entire
> systems.
> 
> Patch motivated by me forgetting to properly reinitialize ppgtt after
> a gpu reset.
> 
> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

I like the concept here very much. One thing which I thought might be
interesting would to allow hanging rings temporarily instead of
automatically hanging until reset on the write of the debugfs entry. I
think if we could start dramatically altering the time it takes to get
commands into the ring would really test our buffer dependency tracking.
The obvious target here to me is hang the media ring for say 33ms, and
see what happens to the system.

Also, maybe I missed something but could you explain how you use this
differently than i915_wedged?

> ---
>  drivers/gpu/drm/i915/i915_debugfs.c     |   63 +++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_drv.c         |    3 +
>  drivers/gpu/drm/i915/i915_drv.h         |    2 +
>  drivers/gpu/drm/i915/intel_ringbuffer.c |    4 ++
>  4 files changed, 72 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 8476441..9821a3b 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -1392,6 +1392,64 @@ static const struct file_operations i915_wedged_fops = {
>  };
>  
>  static ssize_t
> +i915_ring_stop_read(struct file *filp,
> +		    char __user *ubuf,
> +		    size_t max,
> +		    loff_t *ppos)
> +{
> +	struct drm_device *dev = filp->private_data;
> +	drm_i915_private_t *dev_priv = dev->dev_private;
> +	char buf[80];
> +	int len;
> +
> +	len = snprintf(buf, sizeof(buf),
> +		       "%d\n", dev_priv->stop_rings);
> +
> +	if (len > sizeof(buf))
> +		len = sizeof(buf);
> +
> +	return simple_read_from_buffer(ubuf, max, ppos, buf, len);
> +}

buf[80] seems a little excessive.

> +
> +static ssize_t
> +i915_ring_stop_write(struct file *filp,
> +		     const char __user *ubuf,
> +		     size_t cnt,
> +		     loff_t *ppos)
> +{
> +	struct drm_device *dev = filp->private_data;
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	char buf[20];
> +	int val = 0;
> +
> +	if (cnt > 0) {
> +		if (cnt > sizeof(buf) - 1)
> +			return -EINVAL;
> +
> +		if (copy_from_user(buf, ubuf, cnt))
> +			return -EFAULT;
> +		buf[cnt] = 0;
> +
> +		val = simple_strtoul(buf, NULL, 0);
> +	}
> +
> +	DRM_DEBUG_DRIVER("Stopping rings %u\n", val);
> +
> +	mutex_lock(&dev->struct_mutex);
> +	dev_priv->stop_rings = val;
> +	mutex_unlock(&dev->struct_mutex);
> +
> +	return cnt;
> +}
> +

I think an atomic takes away the need for struct_mutex, unless you plan
to do more.

> +static const struct file_operations i915_ring_stop_fops = {
> +	.owner = THIS_MODULE,
> +	.open = i915_debugfs_common_open,
> +	.read = i915_ring_stop_read,
> +	.write = i915_ring_stop_write,
> +	.llseek = default_llseek,
> +};
> +static ssize_t
>  i915_max_freq_read(struct file *filp,
>  		   char __user *ubuf,
>  		   size_t max,
> @@ -1691,6 +1749,11 @@ int i915_debugfs_init(struct drm_minor *minor)
>  				  &i915_cache_sharing_fops);
>  	if (ret)
>  		return ret;
> +	ret = i915_debugfs_create(minor->debugfs_root, minor,
> +				  "i915_ring_stop",
> +				  &i915_ring_stop_fops);
> +	if (ret)
> +		return ret;
>  
>  	return drm_debugfs_create_files(i915_debugfs_list,
>  					I915_DEBUGFS_ENTRIES,

I think the fact that you're stopping rings is an implementation detail,
and the name doesn't sounds nearly dangerous enough. I'd just call this
i915_hang_gpu, or something like that. Unless of course you plan
something like what I mentioned at the top about being able to restart
the ring.

> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 548e04b..566cc1e 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -611,6 +611,9 @@ int i915_reset(struct drm_device *dev, u8 flags)
>  	if (!mutex_trylock(&dev->struct_mutex))
>  		return -EBUSY;
>  
> +	printk("reenabling rings\n");
> +	dev_priv->stop_rings = 0;
> +
>  	i915_gem_reset(dev);
>  
>  	ret = -ENODEV;

Probably shouldn't be using printk, just to keep things consistent with
everything else. DRM_DEBUG_DRIVER

> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index bd98fb3..503ae8c 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -330,6 +330,8 @@ typedef struct drm_i915_private {
>  	uint32_t last_instdone;
>  	uint32_t last_instdone1;
>  
> +	unsigned int stop_rings;
> +
>  	unsigned long cfb_size;
>  	unsigned int cfb_fb;
>  	enum plane cfb_plane;

bool? Also I think you could consider treating this per ring.

> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 3c30dba..ef7a1ca 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1179,7 +1179,11 @@ int intel_ring_begin(struct intel_ring_buffer *ring,
>  
>  void intel_ring_advance(struct intel_ring_buffer *ring)
>  {
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> +
>  	ring->tail &= ring->size - 1;
> +	if (dev_priv->stop_rings & intel_ring_flag(ring))
> +		return;
>  	ring->write_tail(ring, ring->tail);
>  }
>  

Maybe wrap the if with "unlikely?"

  parent reply	other threads:[~2011-11-05 22:08 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-02 11:46 [PATCH 1/3] drm/i915: refactor debugfs open function Daniel Vetter
2011-11-02 11:46 ` [PATCH 2/3] drm/i915: refactor debugfs create functions Daniel Vetter
2011-11-02 11:46 ` [PATCH 3/3] [RFC] drm/i915: add interface to simulate gpu hangs Daniel Vetter
2011-11-03 15:56   ` Eugeni Dodonov
2011-11-03 16:10     ` Daniel Vetter
2011-11-04 16:23       ` Eugeni Dodonov
2011-11-04 16:31         ` Chris Wilson
2011-11-04 16:42           ` Eugeni Dodonov
2011-11-05 22:07   ` Ben Widawsky [this message]
2011-11-05 22:30     ` Daniel Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111105150745.2a1fbb33@bwidawsk.net \
    --to=ben@bwidawsk.net \
    --cc=daniel.vetter@ffwll.ch \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.