[PATCH v3 0/2] Fix error propagation amongst request

Linux kernel -stable discussions
 help / color / mirror / Atom feed

* [PATCH v3 0/2] Fix error propagation amongst request
@ 2023-02-28  2:11 Andi Shyti
  2023-02-28  2:11 ` [PATCH v3 1/2] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Andi Shyti @ 2023-02-28  2:11 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Matthew Auld, Maciej Patelczyk, Chris Wilson, Andi Shyti,
	Andi Shyti

Hi,

This series of two patches fixes the issue introduced in
cf586021642d80 ("drm/i915/gt: Pipelined page migration") where,
as reported by Matt, in a chain of requests an error is reported
only if happens in the last request.

However Chris noticed that without ensuring exclusivity in the
locking we might end up in some deadlock. That's why patch 1
throttles for the ringspace in order to make sure that no one is
holding it.

Version 1 of this patch has been reviewed by matt and this
version is adding Chris exclusive locking.

Thanks Chris for this work.

Andi

Changelog
=========
v1 -> v2
 - Add patch 1 for ensuring exclusive locking of the timeline
 - Reword git commit of patch 2.

Andi Shyti (1):
  drm/i915/gt: Make sure that errors are propagated through request
    chains

Chris Wilson (1):
  drm/i915: Throttle for ringspace prior to taking the timeline mutex

 drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h |  2 ++
 drivers/gpu/drm/i915/gt/intel_migrate.c | 39 +++++++++++++++++------
 drivers/gpu/drm/i915/i915_request.c     |  3 ++
 4 files changed, 75 insertions(+), 10 deletions(-)

-- 
2.39.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v3 1/2] drm/i915: Throttle for ringspace prior to taking the timeline mutex
  2023-02-28  2:11 [PATCH v3 0/2] Fix error propagation amongst request Andi Shyti
@ 2023-02-28  2:11 ` Andi Shyti
  2023-02-28  2:11 ` [PATCH v3 2/2] drm/i915/gt: Make sure that errors are propagated through request chains Andi Shyti
  2023-03-07  7:33 ` [Intel-gfx] [PATCH v3 0/2] Fix error propagation amongst request Gwan-gyeong Mun
  2 siblings, 0 replies; 5+ messages in thread
From: Andi Shyti @ 2023-02-28  2:11 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Matthew Auld, Maciej Patelczyk, Chris Wilson, Andi Shyti,
	Andi Shyti

From: Chris Wilson <chris@chris-wilson.co.uk>

Before taking exclusive ownership of the ring for emitting the request,
wait for space in the ring to become available. This allows others to
take the timeline->mutex to make forward progresses while userspace is
blocked.

In particular, this allows regular clients to issue requests on the
kernel context, potentially filling the ring, but allow the higher
priority heartbeats and pulses to still be submitted without being
blocked by the less critical work.

Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
Signed-off-by: Chris Wilson <chris.p.wilson@linux.intel.com>
Cc: Maciej Patelczyk <maciej.patelczyk@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
---
Hi,

I'm not sure I need to add the Fixes tag here as this is more
preparatory for the next patch. Together, though, patch 1 and 2
make the fix with proper locking mechanism.

Andi

 drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h |  2 ++
 drivers/gpu/drm/i915/i915_request.c     |  3 ++
 3 files changed, 46 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index 2aa63ec521b89..59cd612a23561 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -626,6 +626,47 @@ bool intel_context_revoke(struct intel_context *ce)
 	return ret;
 }
 
+int intel_context_throttle(const struct intel_context *ce)
+{
+	const struct intel_ring *ring = ce->ring;
+	const struct intel_timeline *tl = ce->timeline;
+	struct i915_request *rq;
+	int err = 0;
+
+	if (READ_ONCE(ring->space) >= SZ_1K)
+		return 0;
+
+	rcu_read_lock();
+	list_for_each_entry_reverse(rq, &tl->requests, link) {
+		if (__i915_request_is_complete(rq))
+			break;
+
+		if (rq->ring != ring)
+			continue;
+
+		/* Wait until there will be enough space following that rq */
+		if (__intel_ring_space(rq->postfix,
+				       ring->emit,
+				       ring->size) < ring->size / 2) {
+			if (i915_request_get_rcu(rq)) {
+				rcu_read_unlock();
+
+				if (i915_request_wait(rq,
+						      I915_WAIT_INTERRUPTIBLE,
+						      MAX_SCHEDULE_TIMEOUT) < 0)
+					err = -EINTR;
+
+				rcu_read_lock();
+				i915_request_put(rq);
+			}
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return err;
+}
+
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_context.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index 0a8d553da3f43..f919a66cebf5b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -226,6 +226,8 @@ static inline void intel_context_exit(struct intel_context *ce)
 		ce->ops->exit(ce);
 }
 
+int intel_context_throttle(const struct intel_context *ce);
+
 static inline struct intel_context *intel_context_get(struct intel_context *ce)
 {
 	kref_get(&ce->ref);
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 7503dcb9043bb..a1741c4a8cffd 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1035,6 +1035,9 @@ i915_request_create(struct intel_context *ce)
 	struct i915_request *rq;
 	struct intel_timeline *tl;
 
+	if (intel_context_throttle(ce))
+		return ERR_PTR(-EINTR);
+
 	tl = intel_context_timeline_lock(ce);
 	if (IS_ERR(tl))
 		return ERR_CAST(tl);
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 2/2] drm/i915/gt: Make sure that errors are propagated through request chains
  2023-02-28  2:11 [PATCH v3 0/2] Fix error propagation amongst request Andi Shyti
  2023-02-28  2:11 ` [PATCH v3 1/2] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
@ 2023-02-28  2:11 ` Andi Shyti
  2023-03-07  7:33 ` [Intel-gfx] [PATCH v3 0/2] Fix error propagation amongst request Gwan-gyeong Mun
  2 siblings, 0 replies; 5+ messages in thread
From: Andi Shyti @ 2023-02-28  2:11 UTC (permalink / raw)
  To: intel-gfx, dri-devel, stable
  Cc: Matthew Auld, Maciej Patelczyk, Chris Wilson, Andi Shyti,
	Andi Shyti

Currently, when we perform operations such as clearing or copying
large blocks of memory, we generate multiple requests that are
executed in a chain.

However, if one of these requests fails, we may not realize it
unless it happens to be the last request in the chain. This is
because errors are not properly propagated.

For this we need to keep propagating the chain of fence
notification in order to always reach the final fence associated
to the final request.

To address this issue, we need to ensure that the chain of fence
notifications is always propagated so that we can reach the final
fence associated with the last request. By doing so, we will be
able to detect any memory operation  failures and determine
whether the memory is still invalid.

On copy and clear migration signal fences upon completion.

On copy and clear migration, signal fences upon request
completion to ensure that we have a reliable perpetuation of the
operation outcome.

Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration")
Reported-by: Matthew Auld <matthew.auld@intel.com>
Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 39 ++++++++++++++++++-------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 3f638f1987968..6b497640d3a0a 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context *ce,
 			dst_offset = 2 * CHUNK_SZ;
 	}
 
+	/*
+	 * While building the chain of requests, we need to ensure
+	 * that no one can sneak into the timeline unnoticed.
+	 */
+	mutex_lock(&ce->timeline->mutex);
+
 	do {
 		int len;
 
 		rq = i915_request_create(ce);
 		if (IS_ERR(rq)) {
 			err = PTR_ERR(rq);
-			goto out_ce;
+			break;
 		}
 
 		if (deps) {
@@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context *ce,
 
 		/* Arbitration is re-enabled between requests. */
 out_rq:
-		if (*out)
-			i915_request_put(*out);
-		*out = i915_request_get(rq);
+		i915_sw_fence_await(&rq->submit);
+		i915_request_get(rq);
 		i915_request_add(rq);
+		if (*out) {
+			i915_sw_fence_complete(&(*out)->submit);
+			i915_request_put(*out);
+		}
+		*out = rq;
 
 		if (err)
 			break;
@@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce,
 		cond_resched();
 	} while (1);
 
-out_ce:
+	mutex_unlock(&ce->timeline->mutex);
+
+	if (*out)
+		i915_sw_fence_complete(&(*out)->submit);
 	return err;
 }
 
@@ -1005,7 +1018,7 @@ intel_context_migrate_clear(struct intel_context *ce,
 		rq = i915_request_create(ce);
 		if (IS_ERR(rq)) {
 			err = PTR_ERR(rq);
-			goto out_ce;
+			break;
 		}
 
 		if (deps) {
@@ -1056,17 +1069,23 @@ intel_context_migrate_clear(struct intel_context *ce,
 
 		/* Arbitration is re-enabled between requests. */
 out_rq:
-		if (*out)
-			i915_request_put(*out);
-		*out = i915_request_get(rq);
+		i915_sw_fence_await(&rq->submit);
+		i915_request_get(rq);
 		i915_request_add(rq);
+		if (*out) {
+			i915_sw_fence_complete(&(*out)->submit);
+			i915_request_put(*out);
+		}
+		*out = rq;
+
 		if (err || !it.sg || !sg_dma_len(it.sg))
 			break;
 
 		cond_resched();
 	} while (1);
 
-out_ce:
+	if (*out)
+		i915_sw_fence_complete(&(*out)->submit);
 	return err;
 }
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [Intel-gfx] [PATCH v3 0/2] Fix error propagation amongst request
  2023-02-28  2:11 [PATCH v3 0/2] Fix error propagation amongst request Andi Shyti
  2023-02-28  2:11 ` [PATCH v3 1/2] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
  2023-02-28  2:11 ` [PATCH v3 2/2] drm/i915/gt: Make sure that errors are propagated through request chains Andi Shyti
@ 2023-03-07  7:33 ` Gwan-gyeong Mun
  2023-03-07  9:45   ` Andi Shyti
  2 siblings, 1 reply; 5+ messages in thread
From: Gwan-gyeong Mun @ 2023-03-07  7:33 UTC (permalink / raw)
  To: Andi Shyti, intel-gfx, dri-devel, stable
  Cc: Chris Wilson, Matthew Auld, Maciej Patelczyk

Hi Andi,

After applying these two patches, deadlock is being detected in the call 
stack below. Please review whether the patch to update the 
intel_context_migrate_copy() part affected the deadlock.


https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v1/bat-dg2-8/igt@i915_module_load@load.html#dmesg-warnings1037

<4> [33.070967] ============================================
<4> [33.070968] WARNING: possible recursive locking detected
<4> [33.070969] 6.2.0-Patchwork_114451v1-g8589fd9227ca+ #1 Not tainted
<4> [33.070970] --------------------------------------------
<4> [33.070971] i915_module_loa/948 is trying to acquire lock:
<4> [33.070972] ffff8881127f0478 (migrate){+.+.}-{3:3}, at: 
i915_request_create+0x1c6/0x230 [i915]
<4> [33.071215]
but task is already holding lock:
<4> [33.071235] ffff8881127f0478 (migrate){+.+.}-{3:3}, at: 
intel_context_migrate_copy+0x1b3/0xa80 [i915]
<4> [33.071484]
other info that might help us debug this:
<4> [33.071504]  Possible unsafe locking scenario:
<4> [33.071522]        CPU0
<4> [33.071532]        ----
<4> [33.071541]   lock(migrate);
<4> [33.071554]   lock(migrate);
<4> [33.071567]
  *** DEADLOCK ***
<4> [33.071585]  May be due to missing lock nesting notation
<4> [33.071606] 3 locks held by i915_module_loa/948:
<4> [33.071622]  #0: ffffc90001eb7b70 
(reservation_ww_class_acquire){+.+.}-{0:0}, at: 
i915_gem_do_execbuffer+0xae2/0x21c0 [i915]
<4> [33.071893]  #1: ffff8881127b9c28 
(reservation_ww_class_mutex){+.+.}-{3:3}, at: 
__intel_context_do_pin_ww+0x7a/0xa30 [i915]
<4> [33.072133]  #2: ffff8881127f0478 (migrate){+.+.}-{3:3}, at: 
intel_context_migrate_copy+0x1b3/0xa80 [i915]
<4> [33.072384]
stack backtrace:
<4> [33.072399] CPU: 7 PID: 948 Comm: i915_module_loa Not tainted 
6.2.0-Patchwork_114451v1-g8589fd9227ca+ #1
<4> [33.072428] Hardware name: Intel Corporation CoffeeLake Client 
Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWR1.R00.X220.B00.2103302221 
03/30/2021
<4> [33.072465] Call Trace:
<4> [33.072475]  <TASK>
<4> [33.072486]  dump_stack_lvl+0x5b/0x85
<4> [33.072503]  __lock_acquire.cold+0x158/0x33b
<4> [33.072524]  lock_acquire+0xd6/0x310
<4> [33.072541]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.072812]  __mutex_lock+0x95/0xf40
<4> [33.072829]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.073093]  ? rcu_read_lock_sched_held+0x55/0x80
<4> [33.073112]  ? __mutex_lock+0x133/0xf40
<4> [33.073128]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.073388]  ? intel_context_migrate_copy+0x1b3/0xa80 [i915]
<4> [33.073619]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.073876]  i915_request_create+0x1c6/0x230 [i915]
<4> [33.074135]  intel_context_migrate_copy+0x1d0/0xa80 [i915]
<4> [33.074360]  __i915_ttm_move+0x7a8/0x940 [i915]
<4> [33.074538]  ? _raw_spin_unlock_irqrestore+0x41/0x70
<4> [33.074552]  ? dma_resv_iter_next+0x91/0xb0
<4> [33.074564]  ? dma_resv_iter_first+0x42/0xb0
<4> [33.074576]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
<4> [33.074744]  i915_ttm_move+0x2ac/0x430 [i915]
<4> [33.074910]  ttm_bo_handle_move_mem+0xb5/0x140 [ttm]
<4> [33.074930]  ttm_bo_validate+0xe9/0x1a0 [ttm]
<4> [33.074947]  __i915_ttm_get_pages+0x4e/0x190 [i915]
<4> [33.075112]  i915_ttm_get_pages+0xf3/0x160 [i915]
<4> [33.075280]  ____i915_gem_object_get_pages+0x36/0xb0 [i915]
<4> [33.075446]  __i915_gem_object_get_pages+0x95/0xa0 [i915]
<4> [33.075608]  i915_vma_get_pages+0xfa/0x160 [i915]
<4> [33.075779]  i915_vma_pin_ww+0xdc/0xb50 [i915]
<4> [33.075953]  eb_validate_vmas+0x1c6/0xac0 [i915]
<4> [33.076114]  i915_gem_do_execbuffer+0xb2a/0x21c0 [i915]
<4> [33.076276]  ? __stack_depot_save+0x3f/0x4e0
<4> [33.076292]  ? 0xffffffff81000000
<4> [33.076301]  ? _raw_spin_unlock_irq+0x41/0x50
<4> [33.076312]  ? lockdep_hardirqs_on+0xc3/0x140
<4> [33.076325]  ? set_track_update+0x25/0x50
<4> [33.076338]  ? __lock_acquire+0x5f2/0x2130
<4> [33.076356]  i915_gem_execbuffer2_ioctl+0x123/0x2e0 [i915]
<4> [33.076519]  ? __pfx_i915_gem_execbuffer2_ioctl+0x10/0x10 [i915]
<4> [33.076679]  drm_ioctl_kernel+0xb4/0x150
<4> [33.076692]  drm_ioctl+0x21d/0x420
<4> [33.076703]  ? __pfx_i915_gem_execbuffer2_ioctl+0x10/0x10 [i915]
<4> [33.076864]  ? __vm_munmap+0xd3/0x170
<4> [33.076877]  __x64_sys_ioctl+0x76/0xb0
<4> [33.076889]  do_syscall_64+0x3c/0x90
<4> [33.076900]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
<4> [33.076913] RIP: 0033:0x7f304aa903ab
<4> [33.076923] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 
48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d 00 f7 d8 64 89 01 48
<4> [33.076957] RSP: 002b:00007fffb1424cf8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
<4> [33.076975] RAX: ffffffffffffffda RBX: 00007fffb1424da0 RCX: 
00007f304aa903ab
<4> [33.076990] RDX: 00007fffb1424da0 RSI: 0000000040406469 RDI: 
0000000000000005
<4> [33.077004] RBP: 0000000040406469 R08: 0000000000000005 R09: 
0000000100003000
<4> [33.077019] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000010000
<4> [33.077034] R13: 0000000000000005 R14: 00000000ffffffff R15: 
00000000000056a0
<4> [33.077052]  </TASK>

Br,

G.G.

On 2/28/23 4:11 AM, Andi Shyti wrote:
> Hi,
> 
> This series of two patches fixes the issue introduced in
> cf586021642d80 ("drm/i915/gt: Pipelined page migration") where,
> as reported by Matt, in a chain of requests an error is reported
> only if happens in the last request.
> 
> However Chris noticed that without ensuring exclusivity in the
> locking we might end up in some deadlock. That's why patch 1
> throttles for the ringspace in order to make sure that no one is
> holding it.
> 
> Version 1 of this patch has been reviewed by matt and this
> version is adding Chris exclusive locking.
> 
> Thanks Chris for this work.
> 
> Andi
> 
> Changelog
> =========
> v1 -> v2
>   - Add patch 1 for ensuring exclusive locking of the timeline
>   - Reword git commit of patch 2.
> 
> Andi Shyti (1):
>    drm/i915/gt: Make sure that errors are propagated through request
>      chains
> 
> Chris Wilson (1):
>    drm/i915: Throttle for ringspace prior to taking the timeline mutex
> 
>   drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++
>   drivers/gpu/drm/i915/gt/intel_context.h |  2 ++
>   drivers/gpu/drm/i915/gt/intel_migrate.c | 39 +++++++++++++++++------
>   drivers/gpu/drm/i915/i915_request.c     |  3 ++
>   4 files changed, 75 insertions(+), 10 deletions(-)
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Intel-gfx] [PATCH v3 0/2] Fix error propagation amongst request
  2023-03-07  7:33 ` [Intel-gfx] [PATCH v3 0/2] Fix error propagation amongst request Gwan-gyeong Mun
@ 2023-03-07  9:45   ` Andi Shyti
  0 siblings, 0 replies; 5+ messages in thread
From: Andi Shyti @ 2023-03-07  9:45 UTC (permalink / raw)
  To: Gwan-gyeong Mun
  Cc: Andi Shyti, intel-gfx, dri-devel, stable, Chris Wilson,
	Matthew Auld, Maciej Patelczyk

Hi GG,

On Tue, Mar 07, 2023 at 09:33:12AM +0200, Gwan-gyeong Mun wrote:
> Hi Andi,
> 
> After applying these two patches, deadlock is being detected in the call
> stack below. Please review whether the patch to update the
> intel_context_migrate_copy() part affected the deadlock.
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v1/bat-dg2-8/igt@i915_module_load@load.html#dmesg-warnings1037

Thanks for looking into this. Yes, there is a basic locking issue
here coming from migrate. migrate() takes the timeline lock and
then calls the request_create() which tries to lock again. We
inevitably fall into deadlock.

The locking of the timeline is quite exotic, it's started in
request_create() and released in request_add().

It's still in trybot, but this is supposed to be the next
version:

https://patchwork.freedesktop.org/series/114645/

This creates new version of request_create_locked() and
request_add_locked() where there the timeline is not locked in
the process.

There are still some selftests that need to be fixed, though.

Andi

> <4> [33.070967] ============================================
> <4> [33.070968] WARNING: possible recursive locking detected
> <4> [33.070969] 6.2.0-Patchwork_114451v1-g8589fd9227ca+ #1 Not tainted
> <4> [33.070970] --------------------------------------------
> <4> [33.070971] i915_module_loa/948 is trying to acquire lock:
> <4> [33.070972] ffff8881127f0478 (migrate){+.+.}-{3:3}, at:
> i915_request_create+0x1c6/0x230 [i915]
> <4> [33.071215]
> but task is already holding lock:
> <4> [33.071235] ffff8881127f0478 (migrate){+.+.}-{3:3}, at:
> intel_context_migrate_copy+0x1b3/0xa80 [i915]
> <4> [33.071484]
> other info that might help us debug this:
> <4> [33.071504]  Possible unsafe locking scenario:
> <4> [33.071522]        CPU0
> <4> [33.071532]        ----
> <4> [33.071541]   lock(migrate);
> <4> [33.071554]   lock(migrate);
> <4> [33.071567]
>  *** DEADLOCK ***
> <4> [33.071585]  May be due to missing lock nesting notation
> <4> [33.071606] 3 locks held by i915_module_loa/948:
> <4> [33.071622]  #0: ffffc90001eb7b70
> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
> i915_gem_do_execbuffer+0xae2/0x21c0 [i915]
> <4> [33.071893]  #1: ffff8881127b9c28
> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
> __intel_context_do_pin_ww+0x7a/0xa30 [i915]
> <4> [33.072133]  #2: ffff8881127f0478 (migrate){+.+.}-{3:3}, at:
> intel_context_migrate_copy+0x1b3/0xa80 [i915]
> <4> [33.072384]
> stack backtrace:
> <4> [33.072399] CPU: 7 PID: 948 Comm: i915_module_loa Not tainted
> 6.2.0-Patchwork_114451v1-g8589fd9227ca+ #1
> <4> [33.072428] Hardware name: Intel Corporation CoffeeLake Client
> Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWR1.R00.X220.B00.2103302221
> 03/30/2021
> <4> [33.072465] Call Trace:
> <4> [33.072475]  <TASK>
> <4> [33.072486]  dump_stack_lvl+0x5b/0x85
> <4> [33.072503]  __lock_acquire.cold+0x158/0x33b
> <4> [33.072524]  lock_acquire+0xd6/0x310
> <4> [33.072541]  ? i915_request_create+0x1c6/0x230 [i915]
> <4> [33.072812]  __mutex_lock+0x95/0xf40
> <4> [33.072829]  ? i915_request_create+0x1c6/0x230 [i915]
> <4> [33.073093]  ? rcu_read_lock_sched_held+0x55/0x80
> <4> [33.073112]  ? __mutex_lock+0x133/0xf40
> <4> [33.073128]  ? i915_request_create+0x1c6/0x230 [i915]
> <4> [33.073388]  ? intel_context_migrate_copy+0x1b3/0xa80 [i915]
> <4> [33.073619]  ? i915_request_create+0x1c6/0x230 [i915]
> <4> [33.073876]  i915_request_create+0x1c6/0x230 [i915]
> <4> [33.074135]  intel_context_migrate_copy+0x1d0/0xa80 [i915]
> <4> [33.074360]  __i915_ttm_move+0x7a8/0x940 [i915]
> <4> [33.074538]  ? _raw_spin_unlock_irqrestore+0x41/0x70
> <4> [33.074552]  ? dma_resv_iter_next+0x91/0xb0
> <4> [33.074564]  ? dma_resv_iter_first+0x42/0xb0
> <4> [33.074576]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
> <4> [33.074744]  i915_ttm_move+0x2ac/0x430 [i915]
> <4> [33.074910]  ttm_bo_handle_move_mem+0xb5/0x140 [ttm]
> <4> [33.074930]  ttm_bo_validate+0xe9/0x1a0 [ttm]
> <4> [33.074947]  __i915_ttm_get_pages+0x4e/0x190 [i915]
> <4> [33.075112]  i915_ttm_get_pages+0xf3/0x160 [i915]
> <4> [33.075280]  ____i915_gem_object_get_pages+0x36/0xb0 [i915]
> <4> [33.075446]  __i915_gem_object_get_pages+0x95/0xa0 [i915]
> <4> [33.075608]  i915_vma_get_pages+0xfa/0x160 [i915]
> <4> [33.075779]  i915_vma_pin_ww+0xdc/0xb50 [i915]
> <4> [33.075953]  eb_validate_vmas+0x1c6/0xac0 [i915]
> <4> [33.076114]  i915_gem_do_execbuffer+0xb2a/0x21c0 [i915]
> <4> [33.076276]  ? __stack_depot_save+0x3f/0x4e0
> <4> [33.076292]  ? 0xffffffff81000000
> <4> [33.076301]  ? _raw_spin_unlock_irq+0x41/0x50
> <4> [33.076312]  ? lockdep_hardirqs_on+0xc3/0x140
> <4> [33.076325]  ? set_track_update+0x25/0x50
> <4> [33.076338]  ? __lock_acquire+0x5f2/0x2130
> <4> [33.076356]  i915_gem_execbuffer2_ioctl+0x123/0x2e0 [i915]
> <4> [33.076519]  ? __pfx_i915_gem_execbuffer2_ioctl+0x10/0x10 [i915]
> <4> [33.076679]  drm_ioctl_kernel+0xb4/0x150
> <4> [33.076692]  drm_ioctl+0x21d/0x420
> <4> [33.076703]  ? __pfx_i915_gem_execbuffer2_ioctl+0x10/0x10 [i915]
> <4> [33.076864]  ? __vm_munmap+0xd3/0x170
> <4> [33.076877]  __x64_sys_ioctl+0x76/0xb0
> <4> [33.076889]  do_syscall_64+0x3c/0x90
> <4> [33.076900]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> <4> [33.076913] RIP: 0033:0x7f304aa903ab
> <4> [33.076923] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 48
> c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d 00 f7 d8 64 89 01 48
> <4> [33.076957] RSP: 002b:00007fffb1424cf8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> <4> [33.076975] RAX: ffffffffffffffda RBX: 00007fffb1424da0 RCX:
> 00007f304aa903ab
> <4> [33.076990] RDX: 00007fffb1424da0 RSI: 0000000040406469 RDI:
> 0000000000000005
> <4> [33.077004] RBP: 0000000040406469 R08: 0000000000000005 R09:
> 0000000100003000
> <4> [33.077019] R10: 0000000000000001 R11: 0000000000000246 R12:
> 0000000000010000
> <4> [33.077034] R13: 0000000000000005 R14: 00000000ffffffff R15:
> 00000000000056a0
> <4> [33.077052]  </TASK>
> 
> Br,
> 
> G.G.
> 
> On 2/28/23 4:11 AM, Andi Shyti wrote:
> > Hi,
> > 
> > This series of two patches fixes the issue introduced in
> > cf586021642d80 ("drm/i915/gt: Pipelined page migration") where,
> > as reported by Matt, in a chain of requests an error is reported
> > only if happens in the last request.
> > 
> > However Chris noticed that without ensuring exclusivity in the
> > locking we might end up in some deadlock. That's why patch 1
> > throttles for the ringspace in order to make sure that no one is
> > holding it.
> > 
> > Version 1 of this patch has been reviewed by matt and this
> > version is adding Chris exclusive locking.
> > 
> > Thanks Chris for this work.
> > 
> > Andi
> > 
> > Changelog
> > =========
> > v1 -> v2
> >   - Add patch 1 for ensuring exclusive locking of the timeline
> >   - Reword git commit of patch 2.
> > 
> > Andi Shyti (1):
> >    drm/i915/gt: Make sure that errors are propagated through request
> >      chains
> > 
> > Chris Wilson (1):
> >    drm/i915: Throttle for ringspace prior to taking the timeline mutex
> > 
> >   drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++
> >   drivers/gpu/drm/i915/gt/intel_context.h |  2 ++
> >   drivers/gpu/drm/i915/gt/intel_migrate.c | 39 +++++++++++++++++------
> >   drivers/gpu/drm/i915/i915_request.c     |  3 ++
> >   4 files changed, 75 insertions(+), 10 deletions(-)
> > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-03-07  9:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-28  2:11 [PATCH v3 0/2] Fix error propagation amongst request Andi Shyti
2023-02-28  2:11 ` [PATCH v3 1/2] drm/i915: Throttle for ringspace prior to taking the timeline mutex Andi Shyti
2023-02-28  2:11 ` [PATCH v3 2/2] drm/i915/gt: Make sure that errors are propagated through request chains Andi Shyti
2023-03-07  7:33 ` [Intel-gfx] [PATCH v3 0/2] Fix error propagation amongst request Gwan-gyeong Mun
2023-03-07  9:45   ` Andi Shyti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox