From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Cc: mika.kuoppala@intel.com
Subject: Re: [PATCH v2 13/22] drm/i915: Update reset path to fix incomplete requests
Date: Mon, 3 Oct 2016 13:44:14 +0100 [thread overview]
Message-ID: <0be5c99b-101b-1132-6199-e81cb28b031b@linux.intel.com> (raw)
In-Reply-To: <20160907144516.29495-13-chris@chris-wilson.co.uk>
On 07/09/2016 15:45, Chris Wilson wrote:
> Update reset path in preparation for engine reset which requires
> identification of incomplete requests and associated context and fixing
> their state so that engine can resume correctly after reset.
>
> The request that caused the hang will be skipped and head is reset to the
> start of breadcrumb. This allows us to resume from where we left-off.
> Since this request didn't complete normally we also need to cleanup elsp
> queue manually. This is vital if we employ nonblocking request
> submission where we may have a web of dependencies upon the hung request
> and so advancing the seqno manually is no longer trivial.
>
> ABI: gem_reset_stats / DRM_IOCTL_I915_GET_RESET_STATS
>
> We change the way we count pending batches. Only the active context
> involved in the reset is marked as either innocent or guilty, and not
> mark the entire world as pending. By inspection this only affects
> igt/gem_reset_stats (which assumes implementation details) and not
> piglit.
>
> ARB_robustness gives this guide on how we expect the user of this
> interface to behave:
>
> * Provide a mechanism for an OpenGL application to learn about
> graphics resets that affect the context. When a graphics reset
> occurs, the OpenGL context becomes unusable and the application
> must create a new context to continue operation. Detecting a
> graphics reset happens through an inexpensive query.
>
> And with regards to the actual meaning of the reset values:
>
> Certain events can result in a reset of the GL context. Such a reset
> causes all context state to be lost. Recovery from such events
> requires recreation of all objects in the affected context. The
> current status of the graphics reset state is returned by
>
> enum GetGraphicsResetStatusARB();
>
> The symbolic constant returned indicates if the GL context has been
> in a reset state at any point since the last call to
> GetGraphicsResetStatusARB. NO_ERROR indicates that the GL context
> has not been in a reset state since the last call.
> GUILTY_CONTEXT_RESET_ARB indicates that a reset has been detected
> that is attributable to the current GL context.
> INNOCENT_CONTEXT_RESET_ARB indicates a reset has been detected that
> is not attributable to the current GL context.
> UNKNOWN_CONTEXT_RESET_ARB indicates a detected graphics reset whose
> cause is unknown.
>
> The language here is explicit in that we must mark up the guilty batch,
> but is loose enough for us to relax the innocent (i.e. pending)
> accounting as only the active batches are involved with the reset.
>
> In the future, we are looking towards single engine resetting (with
> minimal locking), where it seems inappropriate to mark the entire world
> as innocent since the reset occurred on a different engine. Reducing the
> information available means we only have to encounter the pain once, and
> also reduces the information leaking from one context to another.
>
> v2: Legacy ringbuffer submission required a reset following hibernation,
> or else we restore stale values to the RING_HEAD and walked over
> stolen garbage.
>
> v3: GuC requires replaying the requests after a reset.
>
> v4: Restore engine IRQ after reset (so waiters will be woken!)
> Rearm hangcheck if resetting with a waiter.
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
> drivers/gpu/drm/i915/i915_drv.c | 8 +-
> drivers/gpu/drm/i915/i915_drv.h | 5 +-
> drivers/gpu/drm/i915/i915_gem.c | 123 +++++++++++++++++------------
> drivers/gpu/drm/i915/i915_gem_context.c | 16 ----
> drivers/gpu/drm/i915/i915_guc_submission.c | 8 +-
> drivers/gpu/drm/i915/intel_engine_cs.c | 15 +++-
> drivers/gpu/drm/i915/intel_lrc.c | 49 ++++++++++--
> drivers/gpu/drm/i915/intel_lrc.h | 3 +-
> drivers/gpu/drm/i915/intel_ringbuffer.c | 47 +++++++----
> drivers/gpu/drm/i915/intel_ringbuffer.h | 7 +-
> 10 files changed, 183 insertions(+), 98 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index c1b890dbd6cc..2b0727d1467d 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -559,7 +559,6 @@ static void i915_gem_fini(struct drm_device *dev)
> }
>
> mutex_lock(&dev->struct_mutex);
> - i915_gem_reset(dev);
> i915_gem_cleanup_engines(dev);
> i915_gem_context_fini(dev);
> mutex_unlock(&dev->struct_mutex);
> @@ -1579,7 +1578,7 @@ static int i915_drm_resume(struct drm_device *dev)
> mutex_lock(&dev->struct_mutex);
> if (i915_gem_init_hw(dev)) {
> DRM_ERROR("failed to re-initialize GPU, declaring wedged!\n");
> - set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
> + i915_gem_set_wedged(dev_priv);
> }
> mutex_unlock(&dev->struct_mutex);
>
> @@ -1756,8 +1755,6 @@ int i915_reset(struct drm_i915_private *dev_priv)
>
> pr_notice("drm/i915: Resetting chip after gpu hang\n");
>
> - i915_gem_reset(dev);
> -
> ret = intel_gpu_reset(dev_priv, ALL_ENGINES);
> if (ret) {
> if (ret != -ENODEV)
> @@ -1767,6 +1764,7 @@ int i915_reset(struct drm_i915_private *dev_priv)
> goto error;
> }
>
> + i915_gem_reset(dev_priv);
> intel_overlay_reset(dev_priv);
>
> /* Ok, now get things going again... */
> @@ -1803,7 +1801,7 @@ out:
> return ret;
>
> error:
> - set_bit(I915_WEDGED, &error->flags);
> + i915_gem_set_wedged(dev_priv);
> goto out;
> }
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 2e2fd8a77233..a63bf820aa8f 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2029,6 +2029,7 @@ struct drm_i915_private {
>
> /* Abstract the submission mechanism (legacy ringbuffer or execlists) away */
> struct {
> + void (*resume)(struct drm_i915_private *);
> void (*cleanup_engine)(struct intel_engine_cs *engine);
>
> /**
> @@ -3262,7 +3263,8 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
> return READ_ONCE(error->reset_count);
> }
>
> -void i915_gem_reset(struct drm_device *dev);
> +void i915_gem_reset(struct drm_i915_private *dev_priv);
> +void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
> bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
> int __must_check i915_gem_init(struct drm_device *dev);
> int __must_check i915_gem_init_hw(struct drm_device *dev);
> @@ -3391,7 +3393,6 @@ void i915_gem_object_save_bit_17_swizzle(struct drm_i915_gem_object *obj);
> int __must_check i915_gem_context_init(struct drm_device *dev);
> void i915_gem_context_lost(struct drm_i915_private *dev_priv);
> void i915_gem_context_fini(struct drm_device *dev);
> -void i915_gem_context_reset(struct drm_device *dev);
> int i915_gem_context_open(struct drm_device *dev, struct drm_file *file);
> void i915_gem_context_close(struct drm_device *dev, struct drm_file *file);
> int i915_switch_context(struct drm_i915_gem_request *req);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 23069a2d2850..65a69bbe021d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2555,29 +2555,83 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
> return NULL;
> }
>
> -static void i915_gem_reset_engine_status(struct intel_engine_cs *engine)
> +static void reset_request(struct drm_i915_gem_request *request)
> +{
> + void *vaddr = request->ring->vaddr;
> + u32 head;
> +
> + /* As this request likely depends on state from the lost
> + * context, clear out all the user operations leaving the
> + * breadcrumb at the end (so we get the fence notifications).
> + */
> + head = request->head;
> + if (request->postfix < head) {
> + memset(vaddr + head, 0, request->ring->size - head);
> + head = 0;
> + }
> + memset(vaddr + head, 0, request->postfix - head);
> +}
> +
> +static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> {
> struct drm_i915_gem_request *request;
> + struct i915_gem_context *incomplete_ctx;
> bool ring_hung;
>
> + /* Ensure irq handler finishes, and not run again. */
> + tasklet_kill(&engine->irq_tasklet);
> +
> request = i915_gem_find_active_request(engine);
> - if (request == NULL)
> + if (!request)
> return;
>
> ring_hung = engine->hangcheck.score >= HANGCHECK_SCORE_RING_HUNG;
> -
> i915_set_reset_status(request->ctx, ring_hung);
> + if (!ring_hung)
> + return;
> +
> + DRM_DEBUG_DRIVER("reseting %s to start from tail of request 0x%x\n",
> + engine->name, request->fence.seqno);
> +
> + /* Setup the CS to resume from the breadcrumb of the hung request */
> + engine->reset_hw(engine, request);
> +
> + /* Users of the default context do not rely on logical state
> + * preserved between batches. They have to emit full state on
> + * every batch and so it is safe to execute queued requests following
> + * the hang.
> + *
> + * Other contexts preserve state, now corrupt. We want to skip all
> + * queued requests that reference the corrupt context.
> + */
> + incomplete_ctx = request->ctx;
> + if (i915_gem_context_is_default(incomplete_ctx))
> + return;
> +
> list_for_each_entry_continue(request, &engine->request_list, link)
> - i915_set_reset_status(request->ctx, false);
> + if (request->ctx == incomplete_ctx)
> + reset_request(request);
> }
>
> -static void i915_gem_reset_engine_cleanup(struct intel_engine_cs *engine)
> +void i915_gem_reset(struct drm_i915_private *dev_priv)
> {
> - struct drm_i915_gem_request *request;
> - struct intel_ring *ring;
> + struct intel_engine_cs *engine;
>
> - /* Ensure irq handler finishes, and not run again. */
> - tasklet_kill(&engine->irq_tasklet);
> + i915_gem_retire_requests(dev_priv);
> +
> + for_each_engine(engine, dev_priv)
> + i915_gem_reset_engine(engine);
> +
> + i915_gem_restore_fences(&dev_priv->drm);
> +}
> +
> +static void nop_submit_request(struct drm_i915_gem_request *request)
> +{
> +}
> +
> +static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
I am aware this is a late comment, but wanted to say that the name above
is not ideal since we have both i915_gem_cleanup_engines and
dev_priv->gt.cleanup_engine which do completely different type of thing
than this.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2016-10-03 12:44 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-09-05 13:04 Non-blocking, explict fences Chris Wilson
2016-09-05 13:04 ` [PATCH 01/21] drm/i915: Add a sw fence for collecting up dma fences Chris Wilson
2016-09-05 13:04 ` [PATCH 02/21] drm/i915: Only queue requests during execlists submission Chris Wilson
2016-09-05 13:04 ` [PATCH 03/21] drm/i915: Record the position of the workarounds in the tail of the request Chris Wilson
2016-09-05 13:04 ` [PATCH 04/21] drm/i915: Compute the ELSP register location once Chris Wilson
2016-09-05 13:04 ` [PATCH 05/21] drm/i915: Reorder submitting the requests to ELSP Chris Wilson
2016-09-05 13:04 ` [PATCH 06/21] drm/i915: Simplify ELSP queue request tracking Chris Wilson
2016-09-05 13:04 ` [PATCH 07/21] drm/i915: Separate out reset flags from the reset counter Chris Wilson
2016-09-05 13:04 ` [PATCH 08/21] drm/i915: Drop local struct_mutex around intel_init_emon[ilk] Chris Wilson
2016-09-05 13:04 ` [PATCH 09/21] drm/i915: Expand bool interruptible to pass flags to i915_wait_request() Chris Wilson
2016-09-05 13:04 ` [PATCH 10/21] drm/i915: Mark up all locked waiters Chris Wilson
2016-09-06 9:24 ` Mika Kuoppala
2016-09-05 13:04 ` [PATCH 11/21] drm/i915: Perform a direct reset of the GPU from the waiter Chris Wilson
2016-09-05 13:04 ` [PATCH 12/21] drm/i915: Replace wait-on-mutex with wait-on-bit in reset worker Chris Wilson
2016-09-06 9:28 ` Mika Kuoppala
2016-09-05 13:04 ` [PATCH 13/21] drm/i915: Update reset path to fix incomplete requests Chris Wilson
2016-09-05 13:04 ` [PATCH 14/21] drm/i915: Drive request submission through fence callbacks Chris Wilson
2016-09-05 13:04 ` [PATCH 15/21] drm/i915: Reorder i915_add_request to separate the phases better Chris Wilson
2016-09-06 9:36 ` Mika Kuoppala
2016-09-05 13:04 ` [PATCH 16/21] drm/i915: Prepare object synchronisation for asynchronicity Chris Wilson
2016-09-05 13:04 ` [PATCH 17/21] drm/i915/guc: Prepare for nonblocking execbuf submission Chris Wilson
2016-09-05 13:04 ` [PATCH 18/21] drm/i915: Nonblocking request submission Chris Wilson
2016-09-05 13:04 ` [PATCH 19/21] drm/i915: Serialise execbuf operation after a dma-buf reservation object Chris Wilson
2016-09-05 13:04 ` [PATCH 20/21] drm/i915: Enable userspace to opt-out of implicit fencing Chris Wilson
2016-09-05 13:04 ` [PATCH 21/21] drm/i915: Support explicit fencing for execbuf Chris Wilson
2016-09-05 13:59 ` ✗ Fi.CI.BAT: failure for series starting with [01/21] drm/i915: Add a sw fence for collecting up dma fences Patchwork
2016-09-05 14:07 ` Chris Wilson
2016-09-07 14:44 ` [PATCH v2 01/22] " Chris Wilson
2016-09-07 14:44 ` [PATCH v2 02/22] drm/i915: Only queue requests during execlists submission Chris Wilson
2016-09-07 14:44 ` [PATCH v2 03/22] drm/i915: Record the position of the workarounds in the tail of the request Chris Wilson
2016-09-07 14:44 ` [PATCH v2 04/22] drm/i915: Compute the ELSP register location once Chris Wilson
2016-09-07 14:44 ` [PATCH v2 05/22] drm/i915: Reorder submitting the requests to ELSP Chris Wilson
2016-09-07 14:45 ` [PATCH v2 06/22] drm/i915: Simplify ELSP queue request tracking Chris Wilson
2016-09-07 14:45 ` [PATCH v2 07/22] drm/i915: Separate out reset flags from the reset counter Chris Wilson
2016-09-07 14:45 ` [PATCH v2 08/22] drm/i915: Drop local struct_mutex around intel_init_emon[ilk] Chris Wilson
2016-09-07 14:45 ` [PATCH v2 09/22] drm/i915: Expand bool interruptible to pass flags to i915_wait_request() Chris Wilson
2016-09-07 14:45 ` [PATCH v2 10/22] drm/i915: Mark up all locked waiters Chris Wilson
2016-09-07 14:45 ` [PATCH v2 11/22] drm/i915: Perform a direct reset of the GPU from the waiter Chris Wilson
2016-09-08 9:35 ` Mika Kuoppala
2016-09-07 14:45 ` [PATCH v2 12/22] drm/i915: Replace wait-on-mutex with wait-on-bit in reset worker Chris Wilson
2016-09-08 10:52 ` Mika Kuoppala
2016-09-08 13:24 ` Chris Wilson
2016-09-07 14:45 ` [PATCH v2 13/22] drm/i915: Update reset path to fix incomplete requests Chris Wilson
2016-09-08 12:03 ` Mika Kuoppala
2016-10-03 12:44 ` Tvrtko Ursulin [this message]
2016-10-03 12:56 ` Chris Wilson
2016-09-07 14:45 ` [PATCH v2 14/22] drm/i915: Drive request submission through fence callbacks Chris Wilson
2016-09-07 14:45 ` [PATCH v2 15/22] drm/i915: Reorder i915_add_request to separate the phases better Chris Wilson
2016-09-07 14:45 ` [PATCH v2 16/22] drm/i915: Prepare object synchronisation for asynchronicity Chris Wilson
2016-09-07 14:45 ` [PATCH v2 17/22] drm/i915/guc: Prepare for nonblocking execbuf submission Chris Wilson
2016-09-12 14:14 ` Tvrtko Ursulin
2016-09-07 14:45 ` [PATCH v2 18/22] drm/i915: Ignore valid but unknown semaphores Chris Wilson
2016-09-08 6:31 ` Joonas Lahtinen
2016-09-07 14:45 ` [PATCH v2 19/22] drm/i915: Nonblocking request submission Chris Wilson
2016-09-07 14:45 ` [PATCH v2 20/22] drm/i915: Serialise execbuf operation after a dma-buf reservation object Chris Wilson
2016-09-07 14:45 ` [PATCH v2 21/22] drm/i915: Enable userspace to opt-out of implicit fencing Chris Wilson
2016-09-07 14:45 ` [PATCH v2 22/22] drm/i915: Support explicit fencing for execbuf Chris Wilson
2016-09-07 15:49 ` ✗ Fi.CI.BAT: warning for series starting with [01/21] drm/i915: Add a sw fence for collecting up dma fences (rev22) Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0be5c99b-101b-1132-6199-e81cb28b031b@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=intel-gfx@lists.freedesktop.org \
--cc=mika.kuoppala@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.