public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH resend v2 3/8] drm/i915: Cope with request list state change during error state capture
@ 2015-10-19 14:55 Tomas Elf
  2015-10-19 16:06 ` Chris Wilson
  2015-10-19 16:51 ` [PATCH v3 7/8] " Tomas Elf
  0 siblings, 2 replies; 4+ messages in thread
From: Tomas Elf @ 2015-10-19 14:55 UTC (permalink / raw)
  To: Intel-GFX

Since we're not synchronizing the ring request list during error state capture
the request list state might change between the time the corresponding error
request list was allocated and dimensioned to the time when the ring request
list is actually captured into the error state. If this happens then do an
early exit and be aware that the captured error state might not be fully
reliable.

* v2:
- Chris Wilson: Removed WARN_ON from size check since having the error state
  request list and the live driver request list diverge like this is a
  legitimate behaviour.

- Tomas Elf: Removed update of num_request field since this made no sense. Just
  exit and move on.

* Resend:
- Responded to the wrong mailthread

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 2f04e4f..b08a76b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1071,6 +1071,18 @@ static void i915_gem_record_rings(struct drm_device *dev,
 		list_for_each_entry(request, &ring->request_list, list) {
 			struct drm_i915_error_request *erq;
 
+			if (count >= error->ring[i].num_requests) {
+				/*
+				 * If the ring request list was changed in
+				 * between the point where the error request
+				 * list was created and dimensioned and this
+				 * point then just exit early to avoid crashes.
+				 */
+				DRM_ERROR("Request list changed size since allocation (%u->%u)\n",
+					error->ring[i].num_requests, count);
+				break;
+			}
+
 			erq = &error->ring[i].requests[count++];
 			erq->seqno = request->seqno;
 			erq->jiffies = request->emitted_jiffies;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH resend v2 3/8] drm/i915: Cope with request list state change during error state capture
  2015-10-19 14:55 [PATCH resend v2 3/8] drm/i915: Cope with request list state change during error state capture Tomas Elf
@ 2015-10-19 16:06 ` Chris Wilson
  2015-10-19 16:51 ` [PATCH v3 7/8] " Tomas Elf
  1 sibling, 0 replies; 4+ messages in thread
From: Chris Wilson @ 2015-10-19 16:06 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Mon, Oct 19, 2015 at 03:55:48PM +0100, Tomas Elf wrote:
> Since we're not synchronizing the ring request list during error state capture
> the request list state might change between the time the corresponding error
> request list was allocated and dimensioned to the time when the ring request
> list is actually captured into the error state. If this happens then do an
> early exit and be aware that the captured error state might not be fully
> reliable.
> 
> * v2:
> - Chris Wilson: Removed WARN_ON from size check since having the error state
>   request list and the live driver request list diverge like this is a
>   legitimate behaviour.
> 
> - Tomas Elf: Removed update of num_request field since this made no sense. Just
>   exit and move on.
> 
> * Resend:
> - Responded to the wrong mailthread
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 2f04e4f..b08a76b 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1071,6 +1071,18 @@ static void i915_gem_record_rings(struct drm_device *dev,
>  		list_for_each_entry(request, &ring->request_list, list) {
>  			struct drm_i915_error_request *erq;
>  
> +			if (count >= error->ring[i].num_requests) {
> +				/*
> +				 * If the ring request list was changed in
> +				 * between the point where the error request
> +				 * list was created and dimensioned and this
> +				 * point then just exit early to avoid crashes.
> +				 */
> +				DRM_ERROR("Request list changed size since allocation (%u->%u)\n",
> +					error->ring[i].num_requests, count);

The error message simply isn't that interesting. That requests were
added after the gpu hang occurred doesn't affect post-mortem debugging
of the hang, and if it were at all interesting, that information should
be stored in the error state itself.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v3 7/8] drm/i915: Cope with request list state change during error state capture
  2015-10-19 14:55 [PATCH resend v2 3/8] drm/i915: Cope with request list state change during error state capture Tomas Elf
  2015-10-19 16:06 ` Chris Wilson
@ 2015-10-19 16:51 ` Tomas Elf
  2015-10-22 10:53   ` Daniel Vetter
  1 sibling, 1 reply; 4+ messages in thread
From: Tomas Elf @ 2015-10-19 16:51 UTC (permalink / raw)
  To: Intel-GFX

Since we're not synchronizing the ring request list during error state capture
the request list state might change between the time the corresponding error
request list was allocated and dimensioned to the time when the ring request
list is actually captured into the error state. If this happens then do an
early exit and be aware that the captured error state might not be fully
reliable.

* v2:
- Chris Wilson: Removed WARN_ON from size check since having the error state
  request list and the live driver request list diverge like this is a
  legitimate behaviour.

- Tomas Elf: Removed update of num_request field since this made no sense. Just
  exit and move on.

* v3:
- Chris Wilson: Removed error message at the point of early exit. The user is
  not interested in any state changes happening during the error state capture,
  only in the state that we're trying to capture at the point of the error.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 2f04e4f..f3dc67b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1071,6 +1071,25 @@ static void i915_gem_record_rings(struct drm_device *dev,
 		list_for_each_entry(request, &ring->request_list, list) {
 			struct drm_i915_error_request *erq;
 
+			if (count >= error->ring[i].num_requests) {
+				/*
+				 * If the ring request list was changed in
+				 * between the point where the error request
+				 * list was created and dimensioned and this
+				 * point then just exit early to avoid crashes.
+				 *
+				 * We don't need to communicate that the
+				 * request list changed state during error
+				 * state capture and that the error state is
+				 * slightly incorrect as a consequence since we
+				 * are typically only interested in the request
+				 * list state at the point of error state
+				 * capture, not in any changes happening during
+				 * the capture.
+				 */
+				break;
+			}
+
 			erq = &error->ring[i].requests[count++];
 			erq->seqno = request->seqno;
 			erq->jiffies = request->emitted_jiffies;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3 7/8] drm/i915: Cope with request list state change during error state capture
  2015-10-19 16:51 ` [PATCH v3 7/8] " Tomas Elf
@ 2015-10-22 10:53   ` Daniel Vetter
  0 siblings, 0 replies; 4+ messages in thread
From: Daniel Vetter @ 2015-10-22 10:53 UTC (permalink / raw)
  To: Tomas Elf; +Cc: Intel-GFX

On Mon, Oct 19, 2015 at 05:51:57PM +0100, Tomas Elf wrote:
> Since we're not synchronizing the ring request list during error state capture
> the request list state might change between the time the corresponding error
> request list was allocated and dimensioned to the time when the ring request
> list is actually captured into the error state. If this happens then do an
> early exit and be aware that the captured error state might not be fully
> reliable.
> 
> * v2:
> - Chris Wilson: Removed WARN_ON from size check since having the error state
>   request list and the live driver request list diverge like this is a
>   legitimate behaviour.
> 
> - Tomas Elf: Removed update of num_request field since this made no sense. Just
>   exit and move on.
> 
> * v3:
> - Chris Wilson: Removed error message at the point of early exit. The user is
>   not interested in any state changes happening during the error state capture,
>   only in the state that we're trying to capture at the point of the error.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>

Queued for -next, thanks for the patch.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_gpu_error.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 2f04e4f..f3dc67b 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1071,6 +1071,25 @@ static void i915_gem_record_rings(struct drm_device *dev,
>  		list_for_each_entry(request, &ring->request_list, list) {
>  			struct drm_i915_error_request *erq;
>  
> +			if (count >= error->ring[i].num_requests) {
> +				/*
> +				 * If the ring request list was changed in
> +				 * between the point where the error request
> +				 * list was created and dimensioned and this
> +				 * point then just exit early to avoid crashes.
> +				 *
> +				 * We don't need to communicate that the
> +				 * request list changed state during error
> +				 * state capture and that the error state is
> +				 * slightly incorrect as a consequence since we
> +				 * are typically only interested in the request
> +				 * list state at the point of error state
> +				 * capture, not in any changes happening during
> +				 * the capture.
> +				 */
> +				break;
> +			}
> +
>  			erq = &error->ring[i].requests[count++];
>  			erq->seqno = request->seqno;
>  			erq->jiffies = request->emitted_jiffies;
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-10-22 10:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-19 14:55 [PATCH resend v2 3/8] drm/i915: Cope with request list state change during error state capture Tomas Elf
2015-10-19 16:06 ` Chris Wilson
2015-10-19 16:51 ` [PATCH v3 7/8] " Tomas Elf
2015-10-22 10:53   ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox