From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ville =?iso-8859-1?Q?Syrj=E4l=E4?= Subject: Re: [PATCH] drm/i915: Convert hangcheck from a timer into a delayed work item Date: Thu, 4 Sep 2014 19:40:32 +0300 Message-ID: <20140904164032.GN4193@intel.com> References: <1409832275-14943-1-git-send-email-chris@chris-wilson.co.uk> <1409843342-437-1-git-send-email-chris@chris-wilson.co.uk> <20140904152503.GM4193@intel.com> <20140904153839.GE13230@nuc-i3427.alporthouse.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by gabe.freedesktop.org (Postfix) with ESMTP id 81C576E1EB for ; Thu, 4 Sep 2014 09:40:37 -0700 (PDT) Content-Disposition: inline In-Reply-To: <20140904153839.GE13230@nuc-i3427.alporthouse.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" To: Chris Wilson , intel-gfx@lists.freedesktop.org, Jani Nikula , Daniel Vetter List-Id: intel-gfx@lists.freedesktop.org On Thu, Sep 04, 2014 at 04:38:39PM +0100, Chris Wilson wrote: > On Thu, Sep 04, 2014 at 06:25:03PM +0300, Ville Syrj=E4l=E4 wrote: > > On Thu, Sep 04, 2014 at 04:09:02PM +0100, Chris Wilson wrote: > > > When run as a timer, i915_hangcheck_elapsed() must adhere to all the > > > rules of running in a softirq context. This is advantageous to us as = we > > > want to minimise the risk that a driver bug will prevent us from > > > detecting a hung GPU. However, that is irrelevant if the driver bug > > > prevents us from resetting and recovering. Still it is prudent not to > > > rely on mutexes inside the checker, but given the coarseness of > > > dev->struct_mutex doing so is extremely hard. > > > = > > > Give in and run from a work queue, i.e. outside of softirq. > > > = > > > v2: > > > = > > > The conversion does have one significant change, from the use of > > > mod_timer to schedule_delayed_work, means that the time that we execu= te > > > the first hangcheck is fixed and not continually deferred by later wo= rk. > > > This has the advantage of not allowing userspace to fill the ring bef= ore > > > hangcheck can finally run. At the same time, it removes the ability f= or > > > the interrupt to defer the hangcheck as well. This is sensible for th= at > > > an interrupt is only for a single engine, whereas we perform hangcheck > > > globally, so whilst one ring may have hung, the other could be running > > > normally and preventing the hangcheck from firing. > > = > > But doesn't this make it so that we may not detect a hang unless more > > work gets submitted constantly? Eg. > > = > > 1. execbuffer batch 1 -> queue hangcheck schedules work > > 2. execbuffer batch 2 -> queue hangcheck does nothing > > 3. execbuffer batch 3 -> queue hangcheck does nothing > > 4. hangcheck expires and sees progress up to batch 2 -> everything is f= ine > 4.b hangcheck rearms itself as there is outstanding wrok Indeed. I should have actually read the code and it would have been obvious. > > 5. batch 3 hangs > 6. hangcheck fires, sees progress, rearms > 7. hangcheck fires, sees no progress, shoots the user. Sounds like we need a disclaimer about the dangers of causing a GPU hang :) -- = Ville Syrj=E4l=E4 Intel OTC