Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Vetter <daniel@ffwll.ch>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog
Date: Fri, 26 Mar 2021 10:10:13 +0100	[thread overview]
Message-ID: <YF2k9TivGrDdenoE@phenom.ffwll.local> (raw)
In-Reply-To: <20210324121335.2307063-1-tvrtko.ursulin@linux.intel.com>

On Wed, Mar 24, 2021 at 12:13:28PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> "Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
> post of a somewhat controversial feature, now upgraded to patch status.
> 
> I quote the "watchdog" becuase in classical sense watchdog would allow userspace
> to ping it and so remain alive.
> 
> I quote "restoring hangcheck" because this series, contrary to the old
> hangcheck, is not looking at whether the workload is making any progress from
> the kernel side either. (Although disclaimer my memory may be leaky - Daniel
> suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
> But apart from being prone to both false negatives and false positives I can't
> remember that myself.)
> 
> Short version - ask is to fail any user submissions after a set time period. In
> this RFC that time is twelve seconds.
> 
> Time counts from the moment user submission is "runnable" (implicit and explicit
> dependencies have been cleared) and keeps counting regardless of the GPU
> contetion caused by other users of the system.
> 
> So semantics are really a bit weak, but again, I understand this is really
> really wanted by the DRM core even if I am not convinced it is a good idea.
> 
> There are some dangers with doing this - text borrowed from a patch in the
> series:
> 
>   This can have an effect that workloads which used to work fine will
>   suddenly start failing. Even workloads comprised of short batches but in
>   long dependency chains can be terminated.
> 
>   And becuase of lack of agreement on usefulness and safety of fence error
>   propagation this partial execution can be invisible to userspace even if
>   it is "listening" to returned fence status.
> 
>   Another interaction is with hangcheck where care needs to be taken timeout
>   is not set lower or close to three times the heartbeat interval. Otherwise
>   a hang in any application can cause complete termination of all
>   submissions from unrelated clients. Any users modifying the per engine
>   heartbeat intervals therefore need to be aware of this potential denial of
>   service to avoid inadvertently enabling it.
> 
>   Given all this I am personally not convinced the scheme is a good idea.
>   Intuitively it feels object importers would be better positioned to
>   enforce the time they are willing to wait for something to complete.
> 
> v2:
>  * Dropped context param.
>  * Improved commit messages and Kconfig text.
> 
> v3:
>  * Log timeouts.
>  * Bump timeout to 20s to see if it helps Tigerlake.

I think 20s is a bit much, and seems like problem is still there in igt. I
think we need look at that and figure out what to do with it. And then go
back down with the timeout somewhat again since 20s is quite a long time.
Irrespective of all the additional gaps/opens around watchdog timeout.
-Daniel

>  * Fix sentinel assert.
> 
> v4:
>  * A round of review feedback applied.
> 
> Chris Wilson (1):
>   drm/i915: Individual request cancellation
> 
> Tvrtko Ursulin (6):
>   drm/i915: Extract active lookup engine to a helper
>   drm/i915: Restrict sentinel requests further
>   drm/i915: Handle async cancellation in sentinel assert
>   drm/i915: Request watchdog infrastructure
>   drm/i915: Fail too long user submissions by default
>   drm/i915: Allow configuring default request expiry via modparam
> 
>  drivers/gpu/drm/i915/Kconfig.profile          |  14 ++
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   |  73 ++++---
>  .../gpu/drm/i915/gem/i915_gem_context_types.h |   4 +
>  drivers/gpu/drm/i915/gt/intel_context_param.h |  11 +-
>  drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
>  .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
>  .../drm/i915/gt/intel_execlists_submission.c  |  23 +-
>  .../drm/i915/gt/intel_execlists_submission.h  |   2 +
>  drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
>  drivers/gpu/drm/i915/gt/intel_gt.h            |   2 +
>  drivers/gpu/drm/i915/gt/intel_gt_requests.c   |  28 +++
>  drivers/gpu/drm/i915/gt/intel_gt_types.h      |   7 +
>  drivers/gpu/drm/i915/i915_params.c            |   5 +
>  drivers/gpu/drm/i915/i915_params.h            |   1 +
>  drivers/gpu/drm/i915/i915_request.c           | 129 ++++++++++-
>  drivers/gpu/drm/i915/i915_request.h           |  16 +-
>  drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
>  17 files changed, 479 insertions(+), 45 deletions(-)
> 
> -- 
> 2.27.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2021-03-26  9:10 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-24 12:13 [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog Tvrtko Ursulin
2021-03-24 12:13 ` [Intel-gfx] [PATCH 1/7] drm/i915: Extract active lookup engine to a helper Tvrtko Ursulin
2021-03-24 12:21   ` Matthew Auld
2021-03-24 12:13 ` [Intel-gfx] [PATCH 2/7] drm/i915: Individual request cancellation Tvrtko Ursulin
2021-03-24 15:24   ` Matthew Auld
2021-03-24 12:13 ` [Intel-gfx] [PATCH 3/7] drm/i915: Restrict sentinel requests further Tvrtko Ursulin
2021-03-24 15:25   ` Matthew Auld
2021-03-26  0:01     ` Daniel Vetter
2021-03-24 12:13 ` [Intel-gfx] [PATCH 4/7] drm/i915: Handle async cancellation in sentinel assert Tvrtko Ursulin
2021-03-24 17:22   ` Matthew Auld
2021-03-24 12:13 ` [Intel-gfx] [PATCH 5/7] drm/i915: Request watchdog infrastructure Tvrtko Ursulin
2021-03-26  0:00   ` Daniel Vetter
2021-03-26 10:32     ` Tvrtko Ursulin
2021-03-24 12:13 ` [Intel-gfx] [PATCH 6/7] drm/i915: Fail too long user submissions by default Tvrtko Ursulin
2021-03-24 12:13 ` [Intel-gfx] [PATCH 7/7] drm/i915: Allow configuring default request expiry via modparam Tvrtko Ursulin
2021-03-26  0:25   ` Daniel Vetter
2021-03-24 13:16 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Default request/fence expiry + watchdog (rev5) Patchwork
2021-03-24 13:21 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
2021-03-24 13:48 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-03-24 23:29 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2021-03-26  9:10 ` Daniel Vetter [this message]
2021-03-26 10:31   ` [Intel-gfx] [PATCH v4 0/7] Default request/fence expiry + watchdog Tvrtko Ursulin
2021-04-08 10:18     ` Daniel Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YF2k9TivGrDdenoE@phenom.ffwll.local \
    --to=daniel@ffwll.ch \
    --cc=Intel-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox