From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AF677C433E0 for ; Thu, 2 Jul 2020 09:39:58 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 883D620874 for ; Thu, 2 Jul 2020 09:39:58 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 883D620874 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 55C3F6E0AA; Thu, 2 Jul 2020 09:39:57 +0000 (UTC) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by gabe.freedesktop.org (Postfix) with ESMTPS id 6A4DB6E0AA for ; Thu, 2 Jul 2020 09:39:56 +0000 (UTC) IronPort-SDR: 7zrXVz0ZvMN2dKVyl/xZLQJa0vTfhYEgP+cf3NGJIV3ewFKn9wMAwojdpRTLxoUn0Crove+j5m tQ72DNwyqDBg== X-IronPort-AV: E=McAfee;i="6000,8403,9669"; a="134297294" X-IronPort-AV: E=Sophos;i="5.75,304,1589266800"; d="scan'208";a="134297294" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2020 02:39:55 -0700 IronPort-SDR: Erzc3Hit1WiOIw9KIhLqZUHHhG3ePE8ehOYqy8qteBDmvTkKzlaaQNgLcOMXkCPVe3Km+YhLTG Maj9gHpyQGRw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,304,1589266800"; d="scan'208";a="312974278" Received: from gaia.fi.intel.com ([10.237.72.192]) by orsmga008.jf.intel.com with ESMTP; 02 Jul 2020 02:39:54 -0700 Received: by gaia.fi.intel.com (Postfix, from userid 1000) id 356645C0D84; Thu, 2 Jul 2020 12:39:43 +0300 (EEST) From: Mika Kuoppala To: Chris Wilson , intel-gfx@lists.freedesktop.org In-Reply-To: <20200701084053.6086-1-chris@chris-wilson.co.uk> References: <20200701084053.6086-1-chris@chris-wilson.co.uk> Date: Thu, 02 Jul 2020 12:39:43 +0300 Message-ID: <87d05e45nk.fsf@gaia.fi.intel.com> MIME-Version: 1.0 Subject: Re: [Intel-gfx] [PATCH 01/33] drm/i915/gt: Harden the heartbeat against a stuck driver X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Chris Wilson Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Chris Wilson writes: > If the driver get stuck holding the kernel timeline, we cannot issue a > heartbeat and so fail to discover that the driver is indeed stuck and do > not issue a GPU reset (which would hopefully unstick the driver!). > Switch to using a trylock so that we can query if the heartbeat's > timelin mutex is locked elsewhere, and then use the timer to probe if it timeline > remains stuck at the same spot for consecutive heartbeats, indicating > that the mutex has not been released and the engine has not progressed. > > Signed-off-by: Chris Wilson > --- > drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 14 ++++++++++++-- > drivers/gpu/drm/i915/gt/intel_engine_types.h | 1 + > 2 files changed, 13 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > index 8db7e93abde5..1663ab5c68a5 100644 > --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > @@ -65,6 +65,7 @@ static void heartbeat(struct work_struct *wrk) > container_of(wrk, typeof(*engine), heartbeat.work.work); > struct intel_context *ce = engine->kernel_context; > struct i915_request *rq; > + unsigned long serial; > > /* Just in case everything has gone horribly wrong, give it a kick */ > intel_engine_flush_submission(engine); > @@ -122,10 +123,19 @@ static void heartbeat(struct work_struct *wrk) > goto out; > } > > - if (engine->wakeref_serial == engine->serial) > + serial = READ_ONCE(engine->serial); > + if (engine->wakeref_serial == serial) > goto out; > > - mutex_lock(&ce->timeline->mutex); > + if (!mutex_trylock(&ce->timeline->mutex)) { > + /* Unable to lock the kernel timeline, is the engine stuck? */ > + if (xchg(&engine->heartbeat.blocked, serial) == serial) > + intel_gt_handle_error(engine->gt, engine->mask, > + I915_ERROR_CAPTURE, > + "stopped heartbeat on %s", > + engine->name); > + goto out; > + } This should do the trick. I was worried on the submit signal (fence) block being empty above in this function that what happens if we never manage to submit. But this should cover that case also. Reviewed-by: Mika Kuoppala > > intel_context_enter(ce); > rq = __i915_request_create(ce, GFP_NOWAIT | __GFP_NOWARN); > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h > index 073c3769e8cc..490af81bd6f3 100644 > --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h > +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h > @@ -348,6 +348,7 @@ struct intel_engine_cs { > struct { > struct delayed_work work; > struct i915_request *systole; > + unsigned long blocked; > } heartbeat; > > unsigned long serial; > -- > 2.20.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx