From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Widawsky Subject: Re: [PATCH 2/5] drm/i915: don't bail out of intel_wait_ring_buffer too early Date: Sun, 30 Oct 2011 18:29:13 -0700 Message-ID: <20111030182913.511d8d53@bwidawsk.net> References: <1320001932-1846-1-git-send-email-daniel.vetter@ffwll.ch> <1320001932-1846-2-git-send-email-daniel.vetter@ffwll.ch> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from cloud01.chad-versace.us (184-106-247-128.static.cloud-ips.com [184.106.247.128]) by gabe.freedesktop.org (Postfix) with ESMTP id 2B4579E8BC for ; Sun, 30 Oct 2011 18:29:23 -0700 (PDT) In-Reply-To: <1320001932-1846-2-git-send-email-daniel.vetter@ffwll.ch> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org To: Daniel Vetter Cc: intel-gfx List-Id: intel-gfx@lists.freedesktop.org On Sun, 30 Oct 2011 20:12:09 +0100 Daniel Vetter wrote: > In the pre-gem days with non-existing hangcheck and gpu reset code, > this timeout of 3 seconds was pretty important to avoid stuck > processes. > > But now we have the hangcheck code in gem that goes to great length > to ensure that the gpu is really dead before declaring it wedged. > > So there's no need for this timeout anymore. Actually it's even harmful > because we can bail out too early (e.g. with xscreensaver slip) > when running giant batchbuffers. And our code isn't robust enough > to properly unroll any state-changes, we pretty much rely on the gpu > reset code cleaning up the mess (like cache tracking, fencing state, > active list/request tracking, ...). > > With this change intel_begin_ring can only fail when the gpu is > wedged, and it will return -EAGAIN (like wait_request in case the > gpu reset is still outstanding). > > v2: Chris Wilson noted that on resume timers aren't running and hence > we won't ever get kicked out of this loop by the hangcheck code. Use > an insanely large timeout instead for the HAS_GEM case to prevent > resume bugs from totally hanging the machine. > > Signed-off-by: Daniel Vetter > Reviewed-by: Chris Wilson > --- > drivers/gpu/drm/i915/intel_ringbuffer.c | 11 ++++++++++- > 1 files changed, 10 insertions(+), 1 deletions(-) > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c > index ca70e2f..6e28301 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > @@ -1119,7 +1119,16 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n) > } > > trace_i915_ring_wait_begin(ring); > - end = jiffies + 3 * HZ; > + if (drm_core_check_feature(dev, DRIVER_GEM)) > + /* With GEM the hangcheck timer should kick us out of the loop, > + * leaving it early runs the risk of corrupting GEM state (due > + * to running on almost untested codepaths). But on resume > + * timers don't work yet, so prevent a complete hang in that > + * case by choosing an insanely large timeout. */ > + end = jiffies + 60 * HZ; > + else > + end = jiffies + 3 * HZ; > + > do { > ring->head = I915_READ_HEAD(ring); > ring->space = ring_space(ring); I didn't really check to see if there is actually an issue here, but instead of 60, do you want to play nice with timeouts such as CONFIG_DEFAULT_HUNG_TASK_TIMEOUT (ie. the min of all the timeouts and 60)? Acked-by: Ben Widawsky