All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>, stable@vger.kernel.org
Subject: Re: [PATCH 4/4] drm/i915: Late request cancellations are harmful
Date: Mon, 11 Apr 2016 14:50:17 +0100	[thread overview]
Message-ID: <570BAB99.5050402@linux.intel.com> (raw)
In-Reply-To: <1460194059-12447-5-git-send-email-chris@chris-wilson.co.uk>


On 09/04/16 10:27, Chris Wilson wrote:
> Conceptually, each request is a record of a hardware transaction - we
> build up a list of pending commands and then either commit them to
> hardware, or cancel them. However, whilst building up the list of
> pending commands, we may modify state outside of the request and make
> references to the pending request. If we do so and then cancel that
> request, external objects then point to the deleted request leading to
> both graphical and memory corruption.
>
> The easiest example is to consider object/VMA tracking. When we mark an
> object as active in a request, we store a pointer to this, the most
> recent request, in the object. Then we want to free that object, we wait
> for the most recent request to be idle before proceeding (otherwise the
> hardware will write to pages now owned by the system, or we will attempt
> to read from those pages before the hardware is finished writing). If
> the request was cancelled instead, that wait completes immediately. As a
> result, all requests must be committed and not cancelled if the external
> state is unknown.

This was a bit hard to figure out.

So we cannot unwind because once we set last_read_req we lose the data 
on what was the previous one, before this transaction started?

Intuitively I don't like the idea of sending unfinished stuff to the
GPU, when it failed at some random point in ring buffer preparation.

So I am struggling with reviewing this as I have in the previous round.

> All that remains of i915_gem_request_cancel() users are just a couple of
> extremely unlikely allocation failures, so remove the API entirely.

This parts feels extra weird because in the non-execbuf cases we 
actually can cancel the transaction without any issues, correct?

Would middle-ground be to keep the cancellations for in-kernel submits, 
and for execbuf rewind the ringbuf so only request post-amble is sent to 
the GPU?

> A consequence of committing all incomplete requests is that we generate
> excess breadcrumbs and fill the ring much more often with dummy work. We
> have completely undone the outstanding_last_seqno optimisation.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93907
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: stable@vger.kernel.org
> ---
>   drivers/gpu/drm/i915/i915_drv.h            |  2 --
>   drivers/gpu/drm/i915/i915_gem.c            | 50 ++++++++++++------------------
>   drivers/gpu/drm/i915/i915_gem_context.c    | 21 ++++++-------
>   drivers/gpu/drm/i915/i915_gem_execbuffer.c | 15 +++------
>   drivers/gpu/drm/i915/intel_display.c       |  2 +-
>   drivers/gpu/drm/i915/intel_lrc.c           |  4 +--
>   drivers/gpu/drm/i915/intel_overlay.c       |  8 ++---
>   7 files changed, 39 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index a93e5dd4fa9a..f374db8de673 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2320,7 +2320,6 @@ struct drm_i915_gem_request {
>   struct drm_i915_gem_request * __must_check
>   i915_gem_request_alloc(struct intel_engine_cs *engine,
>   		       struct intel_context *ctx);
> -void i915_gem_request_cancel(struct drm_i915_gem_request *req);
>   void i915_gem_request_free(struct kref *req_ref);
>   int i915_gem_request_add_to_client(struct drm_i915_gem_request *req,
>   				   struct drm_file *file);
> @@ -2872,7 +2871,6 @@ int i915_gem_sw_finish_ioctl(struct drm_device *dev, void *data,
>   			     struct drm_file *file_priv);
>   void i915_gem_execbuffer_move_to_active(struct list_head *vmas,
>   					struct drm_i915_gem_request *req);
> -void i915_gem_execbuffer_retire_commands(struct i915_execbuffer_params *params);
>   int i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
>   				   struct drm_i915_gem_execbuffer2 *args,
>   				   struct list_head *vmas);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 1c3ff56594d6..42227495803f 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2753,7 +2753,8 @@ __i915_gem_request_alloc(struct intel_engine_cs *engine,
>   		 * fully prepared. Thus it can be cleaned up using the proper
>   		 * free code.
>   		 */
> -		i915_gem_request_cancel(req);
> +		intel_ring_reserved_space_cancel(req->ringbuf);
> +		i915_gem_request_unreference(req);
>   		return ret;
>   	}
>
> @@ -2790,13 +2791,6 @@ i915_gem_request_alloc(struct intel_engine_cs *engine,
>   	return err ? ERR_PTR(err) : req;
>   }
>
> -void i915_gem_request_cancel(struct drm_i915_gem_request *req)
> -{
> -	intel_ring_reserved_space_cancel(req->ringbuf);
> -
> -	i915_gem_request_unreference(req);
> -}
> -
>   struct drm_i915_gem_request *
>   i915_gem_find_active_request(struct intel_engine_cs *engine)
>   {
> @@ -3410,12 +3404,9 @@ int i915_gpu_idle(struct drm_device *dev)
>   				return PTR_ERR(req);
>
>   			ret = i915_switch_context(req);
> -			if (ret) {
> -				i915_gem_request_cancel(req);
> -				return ret;
> -			}
> -
>   			i915_add_request_no_flush(req);
> +			if (ret)
> +				return ret;

Looks like with this it could execute the context switch on the GPU but 
not update the engine->last_context in do_switch().

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

WARNING: multiple messages have this Message-ID (diff)
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>, stable@vger.kernel.org
Subject: Re: [PATCH 4/4] drm/i915: Late request cancellations are harmful
Date: Mon, 11 Apr 2016 14:50:17 +0100	[thread overview]
Message-ID: <570BAB99.5050402@linux.intel.com> (raw)
In-Reply-To: <1460194059-12447-5-git-send-email-chris@chris-wilson.co.uk>


On 09/04/16 10:27, Chris Wilson wrote:
> Conceptually, each request is a record of a hardware transaction - we
> build up a list of pending commands and then either commit them to
> hardware, or cancel them. However, whilst building up the list of
> pending commands, we may modify state outside of the request and make
> references to the pending request. If we do so and then cancel that
> request, external objects then point to the deleted request leading to
> both graphical and memory corruption.
>
> The easiest example is to consider object/VMA tracking. When we mark an
> object as active in a request, we store a pointer to this, the most
> recent request, in the object. Then we want to free that object, we wait
> for the most recent request to be idle before proceeding (otherwise the
> hardware will write to pages now owned by the system, or we will attempt
> to read from those pages before the hardware is finished writing). If
> the request was cancelled instead, that wait completes immediately. As a
> result, all requests must be committed and not cancelled if the external
> state is unknown.

This was a bit hard to figure out.

So we cannot unwind because once we set last_read_req we lose the data 
on what was the previous one, before this transaction started?

Intuitively I don't like the idea of sending unfinished stuff to the
GPU, when it failed at some random point in ring buffer preparation.

So I am struggling with reviewing this as I have in the previous round.

> All that remains of i915_gem_request_cancel() users are just a couple of
> extremely unlikely allocation failures, so remove the API entirely.

This parts feels extra weird because in the non-execbuf cases we 
actually can cancel the transaction without any issues, correct?

Would middle-ground be to keep the cancellations for in-kernel submits, 
and for execbuf rewind the ringbuf so only request post-amble is sent to 
the GPU?

> A consequence of committing all incomplete requests is that we generate
> excess breadcrumbs and fill the ring much more often with dummy work. We
> have completely undone the outstanding_last_seqno optimisation.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93907
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: stable@vger.kernel.org
> ---
>   drivers/gpu/drm/i915/i915_drv.h            |  2 --
>   drivers/gpu/drm/i915/i915_gem.c            | 50 ++++++++++++------------------
>   drivers/gpu/drm/i915/i915_gem_context.c    | 21 ++++++-------
>   drivers/gpu/drm/i915/i915_gem_execbuffer.c | 15 +++------
>   drivers/gpu/drm/i915/intel_display.c       |  2 +-
>   drivers/gpu/drm/i915/intel_lrc.c           |  4 +--
>   drivers/gpu/drm/i915/intel_overlay.c       |  8 ++---
>   7 files changed, 39 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index a93e5dd4fa9a..f374db8de673 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2320,7 +2320,6 @@ struct drm_i915_gem_request {
>   struct drm_i915_gem_request * __must_check
>   i915_gem_request_alloc(struct intel_engine_cs *engine,
>   		       struct intel_context *ctx);
> -void i915_gem_request_cancel(struct drm_i915_gem_request *req);
>   void i915_gem_request_free(struct kref *req_ref);
>   int i915_gem_request_add_to_client(struct drm_i915_gem_request *req,
>   				   struct drm_file *file);
> @@ -2872,7 +2871,6 @@ int i915_gem_sw_finish_ioctl(struct drm_device *dev, void *data,
>   			     struct drm_file *file_priv);
>   void i915_gem_execbuffer_move_to_active(struct list_head *vmas,
>   					struct drm_i915_gem_request *req);
> -void i915_gem_execbuffer_retire_commands(struct i915_execbuffer_params *params);
>   int i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
>   				   struct drm_i915_gem_execbuffer2 *args,
>   				   struct list_head *vmas);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 1c3ff56594d6..42227495803f 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2753,7 +2753,8 @@ __i915_gem_request_alloc(struct intel_engine_cs *engine,
>   		 * fully prepared. Thus it can be cleaned up using the proper
>   		 * free code.
>   		 */
> -		i915_gem_request_cancel(req);
> +		intel_ring_reserved_space_cancel(req->ringbuf);
> +		i915_gem_request_unreference(req);
>   		return ret;
>   	}
>
> @@ -2790,13 +2791,6 @@ i915_gem_request_alloc(struct intel_engine_cs *engine,
>   	return err ? ERR_PTR(err) : req;
>   }
>
> -void i915_gem_request_cancel(struct drm_i915_gem_request *req)
> -{
> -	intel_ring_reserved_space_cancel(req->ringbuf);
> -
> -	i915_gem_request_unreference(req);
> -}
> -
>   struct drm_i915_gem_request *
>   i915_gem_find_active_request(struct intel_engine_cs *engine)
>   {
> @@ -3410,12 +3404,9 @@ int i915_gpu_idle(struct drm_device *dev)
>   				return PTR_ERR(req);
>
>   			ret = i915_switch_context(req);
> -			if (ret) {
> -				i915_gem_request_cancel(req);
> -				return ret;
> -			}
> -
>   			i915_add_request_no_flush(req);
> +			if (ret)
> +				return ret;

Looks like with this it could execute the context switch on the GPU but 
not update the engine->last_context in do_switch().

Regards,

Tvrtko

  reply	other threads:[~2016-04-11 13:50 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-09  9:27 GPU and machine hang fixes, take 2 Chris Wilson
2016-04-09  9:27 ` [PATCH 1/4] drm/i915: Prevent machine death on Ivybridge context switching Chris Wilson
2016-04-09  9:27   ` Chris Wilson
2016-04-09  9:27 ` [PATCH 2/4] drm/i915: Force ringbuffers to not be at offset 0 Chris Wilson
2016-04-09  9:39   ` Chris Wilson
2016-04-09  9:39     ` Chris Wilson
2016-04-09  9:27 ` [PATCH 3/4] drm/i915: Move the mb() following release-mmap into release-mmap Chris Wilson
2016-04-09  9:27 ` [PATCH 4/4] drm/i915: Late request cancellations are harmful Chris Wilson
2016-04-11 13:50   ` Tvrtko Ursulin [this message]
2016-04-11 13:50     ` Tvrtko Ursulin
2016-04-11 14:19     ` Chris Wilson
2016-04-11 14:19       ` Chris Wilson
2016-04-12  6:23       ` [PATCH] drm/i915: Reorganise legacy context switch to cope with late failure Chris Wilson
2016-04-09  9:56 ` ✗ Fi.CI.BAT: warning for series starting with [1/4] drm/i915: Prevent machine death on Ivybridge context switching Patchwork
2016-04-09 10:39   ` Chris Wilson
2016-04-12  7:54 ` ✗ Fi.CI.BAT: warning for series starting with [1/4] drm/i915: Prevent machine death on Ivybridge context switching (rev2) Patchwork
     [not found] <1460193615-12359-1-git-send-email-chris@chris-wilson.co.uk>
2016-04-09  9:20 ` [PATCH 4/4] drm/i915: Late request cancellations are harmful Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=570BAB99.5050402@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=daniel.vetter@ffwll.ch \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.