From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <intel-gfx-bounces@lists.freedesktop.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 6CCBFC43334
	for <intel-gfx@archiver.kernel.org>; Wed, 29 Jun 2022 05:35:22 +0000 (UTC)
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 7E95312B0DB;
	Wed, 29 Jun 2022 05:35:21 +0000 (UTC)
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
 by gabe.freedesktop.org (Postfix) with ESMTPS id E264C12B0D9;
 Wed, 29 Jun 2022 05:35:19 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1656480919; x=1688016919;
 h=date:message-id:from:to:cc:subject:in-reply-to:
 references:mime-version;
 bh=/jyYU4HIgQP020SyOHYuI4AOgaxDm1uh7N8yu8084mc=;
 b=kbZnT27utr6QZTioNBweSsDng+7mqI9XxLS7O4qx30sM1XmdjbE9iMDd
 kra3QG75aXum0FXJGJZZW/xlgdKgTqVb/7r8qcrd+glGFCbIBgylexZ92
 SXx4hnP5pdan9pRezRIUcQ6HIWWnq/jj7AXo/FGhakoynZkj5+d5oWFa7
 oShQ+PYrC66f3nNH16ZJCrXwRPNZxPXwjK91AdcfPAQdM0B2lW7nwKDs8
 zxyKdwGHi6gUiiUsGM6oN8ggejVv9qlWJ0S1bKgG76KjGIMD1t1BFs04Z
 MTjq9xS1cqr6vll9EPB1HZVxhVWcPXcUIrXwiAHlm9tWsyPYw1DpEKlAy A==;
X-IronPort-AV: E=McAfee;i="6400,9594,10392"; a="270692094"
X-IronPort-AV: E=Sophos;i="5.92,230,1650956400"; d="scan'208";a="270692094"
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
 by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 28 Jun 2022 22:35:19 -0700
X-IronPort-AV: E=Sophos;i="5.92,230,1650956400"; d="scan'208";a="658414420"
Received: from adixit-mobl.amr.corp.intel.com (HELO adixit-arch.intel.com)
 ([10.209.82.188])
 by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 28 Jun 2022 22:35:15 -0700
Date: Tue, 28 Jun 2022 22:35:13 -0700
Message-ID: <87o7ycowvi.wl-ashutosh.dixit@intel.com>
From: "Dixit, Ashutosh" <ashutosh.dixit@intel.com>
To: intel-gfx@lists.freedesktop.org
In-Reply-To: <20220628191741.28866-1-ashutosh.dixit@intel.com>
References: <20220628191741.28866-1-ashutosh.dixit@intel.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0
 Emacs/28.1 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
Content-Type: text/plain; charset=US-ASCII
Subject: Re: [Intel-gfx] [PATCH] drm/i915/reset: Handle reset timeouts under
 unrelated kernel hangs
X-BeenThere: intel-gfx@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Intel graphics driver community testing & development
 <intel-gfx.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-gfx>,
 <mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
 <mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
Cc: dri-devel@lists.freedesktop.org, Chris Wilson <chris.p.wilson@intel.com>,
 Chris Wilson <chris@chris-wilson.co.uk>
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

On Tue, 28 Jun 2022 12:17:41 -0700, Ashutosh Dixit wrote:
>
> From: Chris Wilson <chris@chris-wilson.co.uk>
>
> When resuming after hibernate sometimes we see hangs in unrelated kernel
> subsystems. These hangs often result in the following i915 trace:
>
> i915 0000:00:02.0: [drm] \
>	*ERROR* intel_gt_reset_global timed out, cancelling all in-flight rendering.
>
> implying our reset task has been starved by the hanging kernel subsystem,
> causing us to inappropiately declare the system as wedged beyond recovery.
>
> The trace would be caused by our synchronize_srcu_expedited() taking more
> than the allowed 5s due to the unrelated kernel hang. But we neither need
> to perform that synchronisation inside the reset watchdog, nor do we need
> such a short timeout before declaring the device as unrecoverable.
>
> Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/3575
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index a5338c3fde7a0..e72744f6faedc 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -1259,12 +1259,9 @@ static void intel_gt_reset_global(struct intel_gt *gt,
>	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>
>	/* Use a watchdog to ensure that our reset completes */
> -	intel_wedge_on_timeout(&w, gt, 5 * HZ) {
> +	intel_wedge_on_timeout(&w, gt, 60 * HZ) {

How about we take one step at a time so if we are moving
synchronize_srcu_expedited() out of the reset watchdog, we leave the
timeout to the previous 5s? With the original timeout restored this patch
is:

Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>

>		intel_display_prepare_reset(gt->i915);
>
> -		/* Flush everyone using a resource about to be clobbered */
> -		synchronize_srcu_expedited(&gt->reset.backoff_srcu);
> -
>		intel_gt_reset(gt, engine_mask, reason);
>
>		intel_display_finish_reset(gt->i915);
> @@ -1373,6 +1370,9 @@ void intel_gt_handle_error(struct intel_gt *gt,
>		}
>	}
>
> +	/* Flush everyone using a resource about to be clobbered */
> +	synchronize_srcu_expedited(&gt->reset.backoff_srcu);
> +
>	intel_gt_reset_global(gt, engine_mask, msg);
>
>	if (!intel_uc_uses_guc_submission(&gt->uc)) {
> --
> 2.36.1
>