From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Vetter Subject: Re: [PATCH 0/4] [RFC] use HW watchdog timer Date: Mon, 16 Jul 2012 22:16:22 +0200 Message-ID: <20120716201622.GE5023@phenom.ffwll.local> References: <1342464719-8790-1-git-send-email-ben@bwidawsk.net> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail-wg0-f43.google.com (mail-wg0-f43.google.com [74.125.82.43]) by gabe.freedesktop.org (Postfix) with ESMTP id 7E2369EF51 for ; Mon, 16 Jul 2012 13:16:18 -0700 (PDT) Received: by wgbdr1 with SMTP id dr1so1755002wgb.12 for ; Mon, 16 Jul 2012 13:16:17 -0700 (PDT) Content-Disposition: inline In-Reply-To: <1342464719-8790-1-git-send-email-ben@bwidawsk.net> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org To: Ben Widawsky Cc: Daniel Vetter , intel-gfx@lists.freedesktop.org List-Id: intel-gfx@lists.freedesktop.org On Mon, Jul 16, 2012 at 11:51:55AM -0700, Ben Widawsky wrote: > This was my pet project for the last few days, but I have to take a > break from working on it for now to do some real work ;-). The patches > compile, and pass a basic test, but that's about it. There is still > quite a bit of work left to make this useful. The easiest thing would be > to tie this into error state. > > The idea is pretty simple. It uses the HW watchdog to set a timeout per > batchbuffer, instead of a global software watchdog. > > Pros: > * Potential for per batch, or ring watchdog values. I believe when/if we > get to GPGPU workloads, this is particularly interesting. > * Batch granularity hang detection. This mostly just makes hang > detection and recovery a bit easier IMO. > > Cons: > * Blit ring doesn't have an interrupt. This means we still need the > software watchdog, and it makes hang detection more complex. I've been > led to believe future HW *may* have this interrupt. > * Semaphores > > I'm looking for feedback, mainly for Daniel, and Chris if this is worth > pursuing further when I have more time. The idea would be to eventually > use this to implement much of the ARB robustness requirements instead of > doing a bunch of request list processing. > > Ben Widawsky (4): > drm/i915: Use HW watchdog for each batch > drm/i915: Turn on watchdog interrupts > drm/i915: Add a breadcrumb > drm/i915: Display the failing seqno High-level drive-by review: I think it's very important to separate hangs in the batch itself from hangs in the ringbuffers, e.g. semaphore deadlocks. We should only blame the former on the userspace-submitted batchbuffer. So on that ground I think the following speudo-code check in the hangcheck code would be simpler to implement and more robust: /* Check whether we're outside of the ring. At worst this check presumes * that the hang is in the ring, never the other way round. */ if (ACTHEAD != RING_HEAD) /* Check whether ACTHEAD lies in any active ring. We can't check * the object's gpu_domain, that might have been changed again * already. */ for_each_active_buffer(buffer) if (buffer->gtt_offset + buffer->size < ACTHEAD && buffer->gtt_offset >= ACTHEAD) goto found; Otoh I don't think your patches here are a completely lost cause. This render watchdog could be used rather well for a per-patch short timeout to kill one specific batchbuffer I guess. But right now I don't see an immediate use-case ... Yours, Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48