From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ville =?iso-8859-1?Q?Syrj=E4l=E4?= <ville.syrjala@linux.intel.com>
Subject: Re: [PATCH] drm/i915: Convert hangcheck from a timer
 into a delayed work item
Date: Thu, 4 Sep 2014 19:40:32 +0300
Message-ID: <20140904164032.GN4193@intel.com>
References: <1409832275-14943-1-git-send-email-chris@chris-wilson.co.uk>
 <1409843342-437-1-git-send-email-chris@chris-wilson.co.uk>
 <20140904152503.GM4193@intel.com>
 <20140904153839.GE13230@nuc-i3427.alporthouse.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Return-path: <intel-gfx-bounces@lists.freedesktop.org>
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by gabe.freedesktop.org (Postfix) with ESMTP id 81C576E1EB
 for <intel-gfx@lists.freedesktop.org>; Thu,  4 Sep 2014 09:40:37 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <20140904153839.GE13230@nuc-i3427.alporthouse.com>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/intel-gfx>,
 <mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
 <mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org, Jani Nikula <jani.nikula@intel.com>, Daniel Vetter <dnaiel.vetter@ffwll.chm>
List-Id: intel-gfx@lists.freedesktop.org

On Thu, Sep 04, 2014 at 04:38:39PM +0100, Chris Wilson wrote:
> On Thu, Sep 04, 2014 at 06:25:03PM +0300, Ville Syrj=E4l=E4 wrote:
> > On Thu, Sep 04, 2014 at 04:09:02PM +0100, Chris Wilson wrote:
> > > When run as a timer, i915_hangcheck_elapsed() must adhere to all the
> > > rules of running in a softirq context. This is advantageous to us as =
we
> > > want to minimise the risk that a driver bug will prevent us from
> > > detecting a hung GPU. However, that is irrelevant if the driver bug
> > > prevents us from resetting and recovering. Still it is prudent not to
> > > rely on mutexes inside the checker, but given the coarseness of
> > > dev->struct_mutex doing so is extremely hard.
> > > =

> > > Give in and run from a work queue, i.e. outside of softirq.
> > > =

> > > v2:
> > > =

> > > The conversion does have one significant change, from the use of
> > > mod_timer to schedule_delayed_work, means that the time that we execu=
te
> > > the first hangcheck is fixed and not continually deferred by later wo=
rk.
> > > This has the advantage of not allowing userspace to fill the ring bef=
ore
> > > hangcheck can finally run. At the same time, it removes the ability f=
or
> > > the interrupt to defer the hangcheck as well. This is sensible for th=
at
> > > an interrupt is only for a single engine, whereas we perform hangcheck
> > > globally, so whilst one ring may have hung, the other could be running
> > > normally and preventing the hangcheck from firing.
> > =

> > But doesn't this make it so that we may not detect a hang unless more
> > work gets submitted constantly? Eg.
> > =

> > 1. execbuffer batch 1 -> queue hangcheck schedules work
> > 2. execbuffer batch 2 -> queue hangcheck does nothing
> > 3. execbuffer batch 3 -> queue hangcheck does nothing
> > 4. hangcheck expires and sees progress up to batch 2 -> everything is f=
ine
> 4.b hangcheck rearms itself as there is outstanding wrok

Indeed. I should have actually read the code and it would have been
obvious.

> > 5. batch 3 hangs
> 6. hangcheck fires, sees progress, rearms
> 7. hangcheck fires, sees no progress, shoots the user.

Sounds like we need a disclaimer about the dangers of causing a GPU
hang :)

-- =

Ville Syrj=E4l=E4
Intel OTC