From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ben Widawsky <ben@bwidawsk.net>
Subject: Re: [PATCH 2/5] drm/i915: don't bail out of
 intel_wait_ring_buffer too early
Date: Sun, 30 Oct 2011 18:29:13 -0700
Message-ID: <20111030182913.511d8d53@bwidawsk.net>
References: <1320001932-1846-1-git-send-email-daniel.vetter@ffwll.ch>
	<1320001932-1846-2-git-send-email-daniel.vetter@ffwll.ch>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org>
Received: from cloud01.chad-versace.us (184-106-247-128.static.cloud-ips.com
	[184.106.247.128])
	by gabe.freedesktop.org (Postfix) with ESMTP id 2B4579E8BC
	for <intel-gfx@lists.freedesktop.org>;
	Sun, 30 Oct 2011 18:29:23 -0700 (PDT)
In-Reply-To: <1320001932-1846-2-git-send-email-daniel.vetter@ffwll.ch>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
To: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: intel-gfx <intel-gfx@lists.freedesktop.org>
List-Id: intel-gfx@lists.freedesktop.org

On Sun, 30 Oct 2011 20:12:09 +0100
Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> In the pre-gem days with non-existing hangcheck and gpu reset code,
> this timeout of 3 seconds was pretty important to avoid stuck
> processes.
> 
> But now we have the hangcheck code in gem that goes to great length
> to ensure that the gpu is really dead before declaring it wedged.
> 
> So there's no need for this timeout anymore. Actually it's even harmful
> because we can bail out too early (e.g. with xscreensaver slip)
> when running giant batchbuffers. And our code isn't robust enough
> to properly unroll any state-changes, we pretty much rely on the gpu
> reset code cleaning up the mess (like cache tracking, fencing state,
> active list/request tracking, ...).
> 
> With this change intel_begin_ring can only fail when the gpu is
> wedged, and it will return -EAGAIN (like wait_request in case the
> gpu reset is still outstanding).
> 
> v2: Chris Wilson noted that on resume timers aren't running and hence
> we won't ever get kicked out of this loop by the hangcheck code. Use
> an insanely large timeout instead for the HAS_GEM case to prevent
> resume bugs from totally hanging the machine.
> 
> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/intel_ringbuffer.c |   11 ++++++++++-
>  1 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index ca70e2f..6e28301 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -1119,7 +1119,16 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
>  	}
>  
>  	trace_i915_ring_wait_begin(ring);
> -	end = jiffies + 3 * HZ;
> +	if (drm_core_check_feature(dev, DRIVER_GEM))
> +		/* With GEM the hangcheck timer should kick us out of the loop,
> +		 * leaving it early runs the risk of corrupting GEM state (due
> +		 * to running on almost untested codepaths). But on resume
> +		 * timers don't work yet, so prevent a complete hang in that
> +		 * case by choosing an insanely large timeout. */
> +		end = jiffies + 60 * HZ;
> +	else
> +		end = jiffies + 3 * HZ;
> +
>  	do {
>  		ring->head = I915_READ_HEAD(ring);
>  		ring->space = ring_space(ring);

I didn't really check to see if there is actually an issue here, but
instead of 60, do you want to play nice with timeouts such as
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT (ie. the min of all the timeouts and
60)?

Acked-by: Ben Widawsky <ben@bwidawsk.net>