[Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset

Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset
@ 2021-09-14  4:24 Matthew Brost
  2021-09-14  4:24 ` [Intel-gfx] [PATCH 1/4] drm/i915/guc: Move guc_ids under submission_state sub-structure Matthew Brost
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Matthew Brost @ 2021-09-14  4:24 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Rather allocating an error capture in nowait context to break a lockdep
splat [1], do the error capture async compared to the G2H processing.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://patchwork.freedesktop.org/patch/451415/?series=93704&rev=5

John Harrison (1):
  drm/i915/guc: Refcount context during error capture

Matthew Brost (3):
  drm/i915/guc: Move guc_ids under submission_state sub-structure
  drm/i915/guc: Do error capture asynchronously
  drm/i915/guc: Flush G2H work queue during reset

 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |  16 ++-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  36 +++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 133 ++++++++++++------
 4 files changed, 128 insertions(+), 59 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Intel-gfx] [PATCH 1/4] drm/i915/guc: Move guc_ids under submission_state sub-structure
  2021-09-14  4:24 [Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset Matthew Brost
@ 2021-09-14  4:24 ` Matthew Brost
  2021-09-14  4:24 ` [Intel-gfx] [PATCH 2/4] drm/i915/guc: Do error capture asynchronously Matthew Brost
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Matthew Brost @ 2021-09-14  4:24 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

Move guc_ids under submission_state sub-structure as a future patch will
use contexts_lock (global GuC submission lock) to protect more data.
Introducing the sub-structure makes ownership of the locking / fields
clear.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |  9 ++--
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 26 ++++++-----
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 44 ++++++++++---------
 3 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 930569a1a01f..af43b3c83339 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -189,18 +189,19 @@ struct intel_context {
 	struct {
 		/**
 		 * @id: handle which is used to uniquely identify this context
-		 * with the GuC, protected by guc->contexts_lock
+		 * with the GuC, protected by guc->submission_state.lock
 		 */
 		u16 id;
 		/**
 		 * @ref: the number of references to the guc_id, when
 		 * transitioning in and out of zero protected by
-		 * guc->contexts_lock
+		 * guc->submission_state.lock
 		 */
 		atomic_t ref;
 		/**
-		 * @link: in guc->guc_id_list when the guc_id has no refs but is
-		 * still valid, protected by guc->contexts_lock
+		 * @link: in guc->submission_state.guc_id_list when the guc_id
+		 * has no refs but is still valid, protected by
+		 * guc->submission_state.lock
 		 */
 		struct list_head link;
 	} guc_id;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 5dd174babf7a..c0292a94f4c3 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -70,17 +70,21 @@ struct intel_guc {
 		void (*disable)(struct intel_guc *guc);
 	} interrupts;
 
-	/**
-	 * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and
-	 * ce->guc_id.ref when transitioning in and out of zero
-	 */
-	spinlock_t contexts_lock;
-	/** @guc_ids: used to allocate unique ce->guc_id.id values */
-	struct ida guc_ids;
-	/**
-	 * @guc_id_list: list of intel_context with valid guc_ids but no refs
-	 */
-	struct list_head guc_id_list;
+	struct {
+		/**
+		 * @lock: protects everything in submission_state and ce->guc_id
+		 */
+		spinlock_t lock;
+		/**
+		 * @guc_ids: used to allocate new guc_ids
+		 */
+		struct ida guc_ids;
+		/**
+		 * @guc_id_list: list of intel_context with valid guc_ids but no
+		 * refs
+		 */
+		struct list_head guc_id_list;
+	} submission_state;
 
 	/**
 	 * @submission_supported: tracks whether we support GuC submission on
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index c7a41802b448..678da915eb9d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -89,7 +89,7 @@
  * sched_engine can be submitting at a time. Currently only one sched_engine is
  * used for all of GuC submission but that could change in the future.
  *
- * guc->contexts_lock
+ * guc->submission_state.lock
  * Protects guc_id allocation for the given GuC, i.e. only one context can be
  * doing guc_id allocation operations at a time for each GuC in the system.
  *
@@ -103,7 +103,7 @@
  *
  * Lock ordering rules:
  * sched_engine->lock -> ce->guc_state.lock
- * guc->contexts_lock -> ce->guc_state.lock
+ * guc->submission_state.lock -> ce->guc_state.lock
  *
  * Reset races:
  * When a full GT reset is triggered it is assumed that some G2H responses to
@@ -1148,9 +1148,9 @@ int intel_guc_submission_init(struct intel_guc *guc)
 
 	xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ);
 
-	spin_lock_init(&guc->contexts_lock);
-	INIT_LIST_HEAD(&guc->guc_id_list);
-	ida_init(&guc->guc_ids);
+	spin_lock_init(&guc->submission_state.lock);
+	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
+	ida_init(&guc->submission_state.guc_ids);
 
 	return 0;
 }
@@ -1215,7 +1215,7 @@ static void guc_submit_request(struct i915_request *rq)
 
 static int new_guc_id(struct intel_guc *guc)
 {
-	return ida_simple_get(&guc->guc_ids, 0,
+	return ida_simple_get(&guc->submission_state.guc_ids, 0,
 			      GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL |
 			      __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 }
@@ -1223,7 +1223,8 @@ static int new_guc_id(struct intel_guc *guc)
 static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	if (!context_guc_id_invalid(ce)) {
-		ida_simple_remove(&guc->guc_ids, ce->guc_id.id);
+		ida_simple_remove(&guc->submission_state.guc_ids,
+				  ce->guc_id.id);
 		reset_lrc_desc(guc, ce->guc_id.id);
 		set_context_guc_id_invalid(ce);
 	}
@@ -1235,9 +1236,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	__release_guc_id(guc, ce);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static int steal_guc_id(struct intel_guc *guc)
@@ -1245,10 +1246,10 @@ static int steal_guc_id(struct intel_guc *guc)
 	struct intel_context *ce;
 	int guc_id;
 
-	lockdep_assert_held(&guc->contexts_lock);
+	lockdep_assert_held(&guc->submission_state.lock);
 
-	if (!list_empty(&guc->guc_id_list)) {
-		ce = list_first_entry(&guc->guc_id_list,
+	if (!list_empty(&guc->submission_state.guc_id_list)) {
+		ce = list_first_entry(&guc->submission_state.guc_id_list,
 				      struct intel_context,
 				      guc_id.link);
 
@@ -1273,7 +1274,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out)
 {
 	int ret;
 
-	lockdep_assert_held(&guc->contexts_lock);
+	lockdep_assert_held(&guc->submission_state.lock);
 
 	ret = new_guc_id(guc);
 	if (unlikely(ret < 0)) {
@@ -1295,7 +1296,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	GEM_BUG_ON(atomic_read(&ce->guc_id.ref));
 
 try_again:
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 
 	might_lock(&ce->guc_state.lock);
 
@@ -1310,7 +1311,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	atomic_inc(&ce->guc_id.ref);
 
 out_unlock:
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
 	/*
 	 * -EAGAIN indicates no guc_id are available, let's retire any
@@ -1346,11 +1347,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce)
 	if (unlikely(context_guc_id_invalid(ce)))
 		return;
 
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) &&
 	    !atomic_read(&ce->guc_id.ref))
-		list_add_tail(&ce->guc_id.link, &guc->guc_id_list);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+		list_add_tail(&ce->guc_id.link,
+			      &guc->submission_state.guc_id_list);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 }
 
 static int __guc_action_register_context(struct intel_guc *guc,
@@ -1921,16 +1923,16 @@ static void guc_context_destroy(struct kref *kref)
 	 * returns indicating this context has been deregistered the guc_id is
 	 * returned to the pool of available guc_id.
 	 */
-	spin_lock_irqsave(&guc->contexts_lock, flags);
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
 	if (context_guc_id_invalid(ce)) {
-		spin_unlock_irqrestore(&guc->contexts_lock, flags);
+		spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 		__guc_context_destroy(ce);
 		return;
 	}
 
 	if (!list_empty(&ce->guc_id.link))
 		list_del_init(&ce->guc_id.link);
-	spin_unlock_irqrestore(&guc->contexts_lock, flags);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
 	/* Seal race with Reset */
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Intel-gfx] [PATCH 2/4] drm/i915/guc: Do error capture asynchronously
  2021-09-14  4:24 [Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset Matthew Brost
  2021-09-14  4:24 ` [Intel-gfx] [PATCH 1/4] drm/i915/guc: Move guc_ids under submission_state sub-structure Matthew Brost
@ 2021-09-14  4:24 ` Matthew Brost
  2021-09-14  4:24 ` [Intel-gfx] [PATCH 3/4] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Matthew Brost @ 2021-09-14  4:24 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

An error capture allocates memory, memory allocations depend on resets,
and resets need to flush the G2H handlers to seal several races. If the
error capture is done from the G2H handler this creates a circular
dependency. To work around this, do a error capture in a work queue
asynchronously from the G2H handler. This should be fine as (eventually)
all register state is put into a buffer by the GuC so it is safe to
restart the context before the error capture is complete.

Example of lockdep splat below:

[  154.625989] ======================================================
[  154.632195] WARNING: possible circular locking dependency detected
[  154.638393] 5.14.0-rc5-guc+ #50 Tainted: G     U
[  154.643991] ------------------------------------------------------
[  154.650196] i915_selftest/1673 is trying to acquire lock:
[  154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0 [  154.665826]
               but task is already holding lock:
[  154.671682] ffff8881079cbfb8 (&gt->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915] [  154.680659]
               which lock already depends on the new lock.

[  154.688857]
               the existing dependency chain (in reverse order) is:
[  154.696365]
               -> #2 (&gt->reset.mutex){+.+.}-{3:3}:
[  154.702571]        lock_acquire+0xd2/0x300
[  154.706695]        i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]
[  154.712959]        intel_gt_init_reset+0x61/0x80 [i915]
[  154.718258]        intel_gt_init_early+0xe6/0x120 [i915]
[  154.723648]        i915_driver_probe+0x592/0xdc0 [i915]
[  154.728942]        i915_pci_probe+0x43/0x1c0 [i915]
[  154.733891]        pci_device_probe+0x9b/0x110
[  154.738362]        really_probe+0x1a6/0x3a0
[  154.742568]        __driver_probe_device+0xf9/0x170
[  154.747468]        driver_probe_device+0x19/0x90
[  154.752114]        __driver_attach+0x99/0x170
[  154.756492]        bus_for_each_dev+0x73/0xc0
[  154.760870]        bus_add_driver+0x14b/0x1f0
[  154.765248]        driver_register+0x67/0xb0
[  154.769542]        i915_init+0x18/0x8c [i915]
[  154.773964]        do_one_initcall+0x53/0x2e0
[  154.778343]        do_init_module+0x56/0x210
[  154.782639]        load_module+0x25fc/0x29f0
[  154.786934]        __do_sys_finit_module+0xae/0x110
[  154.791835]        do_syscall_64+0x38/0xc0
[  154.795958]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  154.801558]
               -> #1 (fs_reclaim){+.+.}-{0:0}:
[  154.807241]        lock_acquire+0xd2/0x300
[  154.811361]        fs_reclaim_acquire+0x9e/0xd0
[  154.815914]        kmem_cache_alloc_trace+0x30/0x790
[  154.820899]        i915_gpu_coredump_alloc+0x53/0x1a0 [i915]
[  154.826649]        i915_gpu_coredump+0x39/0x560 [i915]
[  154.831866]        i915_capture_error_state+0xa/0x70 [i915]
[  154.837513]        intel_guc_context_reset_process_msg+0x174/0x1f0 [i915]
[  154.844383]        ct_incoming_request_worker_func+0x130/0x1b0 [i915]
[  154.850898]        process_one_work+0x264/0x590
[  154.855451]        worker_thread+0x4b/0x3a0
[  154.859655]        kthread+0x147/0x170
[  154.863428]        ret_from_fork+0x1f/0x30
[  154.867548]
               -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}:
[  154.875747]        check_prev_add+0x90/0xc30
[  154.880042]        __lock_acquire+0x1643/0x2110
[  154.884595]        lock_acquire+0xd2/0x300
[  154.888715]        __flush_work+0x373/0x4d0
[  154.892920]        intel_guc_submission_reset_prepare+0xf3/0x340 [i915]
[  154.899606]        intel_uc_reset_prepare+0x40/0x50 [i915]
[  154.905166]        reset_prepare+0x55/0x60 [i915]
[  154.909946]        intel_gt_reset+0x11c/0x300 [i915]
[  154.914984]        do_device_reset+0x13/0x20 [i915]
[  154.919936]        check_whitelist_across_reset+0x166/0x250 [i915]
[  154.926212]        live_reset_whitelist.cold+0x6a/0x7a [i915]
[  154.932037]        __i915_subtests.cold+0x20/0x74 [i915]
[  154.937428]        __run_selftests.cold+0x96/0xee [i915]
[  154.942816]        i915_live_selftests+0x2c/0x60 [i915]
[  154.948125]        i915_pci_probe+0x93/0x1c0 [i915]
[  154.953076]        pci_device_probe+0x9b/0x110
[  154.957545]        really_probe+0x1a6/0x3a0
[  154.961749]        __driver_probe_device+0xf9/0x170
[  154.966653]        driver_probe_device+0x19/0x90
[  154.971290]        __driver_attach+0x99/0x170
[  154.975671]        bus_for_each_dev+0x73/0xc0
[  154.980053]        bus_add_driver+0x14b/0x1f0
[  154.984431]        driver_register+0x67/0xb0
[  154.988725]        i915_init+0x18/0x8c [i915]
[  154.993149]        do_one_initcall+0x53/0x2e0
[  154.997527]        do_init_module+0x56/0x210
[  155.001822]        load_module+0x25fc/0x29f0
[  155.006118]        __do_sys_finit_module+0xae/0x110
[  155.011019]        do_syscall_64+0x38/0xc0
[  155.015139]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  155.020729]
               other info that might help us debug this:

[  155.028752] Chain exists of:
                 (work_completion)(&ct->requests.worker) --> fs_reclaim --> &gt->reset.mutex

[  155.041294]  Possible unsafe locking scenario:

[  155.047240]        CPU0                    CPU1
[  155.051791]        ----                    ----
[  155.056344]   lock(&gt->reset.mutex);
[  155.060026]                                lock(fs_reclaim);
[  155.065706]                                lock(&gt->reset.mutex);
[  155.071912]   lock((work_completion)(&ct->requests.worker));
[  155.077595]
                *** DEADLOCK ***

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c       |  2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |  7 +++
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        | 10 ++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 47 +++++++++++++++++--
 4 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index ff637147b1a9..72ad60e9d908 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine)
 	ce->guc_id.id = GUC_INVALID_LRC_ID;
 	INIT_LIST_HEAD(&ce->guc_id.link);
 
+	INIT_LIST_HEAD(&ce->guc_capture_link);
+
 	/*
 	 * Initialize fence to be complete as this is expected to be complete
 	 * unless there is a pending schedule disable outstanding.
diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
index af43b3c83339..925a06162e8e 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -206,6 +206,13 @@ struct intel_context {
 		struct list_head link;
 	} guc_id;
 
+	/**
+	 * @guc_capture_link: in guc->submission_state.capture_list when an
+	 * error capture is pending on this context, protected by
+	 * guc->submission_state.lock
+	 */
+	struct list_head guc_capture_link;
+
 #ifdef CONFIG_DRM_I915_SELFTEST
 	/**
 	 * @drop_schedule_enable: Force drop of schedule enable G2H for selftest
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index c0292a94f4c3..842df190ee67 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -84,6 +84,16 @@ struct intel_guc {
 		 * refs
 		 */
 		struct list_head guc_id_list;
+		/**
+		 * @capture_list: list of intel_context that need to capture
+		 * error state
+		 */
+		struct list_head capture_list;
+		/**
+		 * @capture_worker: worker to do error capture when the GuC
+		 * signals a context has been reset
+		 */
+		struct work_struct capture_worker;
 	} submission_state;
 
 	/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 678da915eb9d..ba6838a35a69 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -91,7 +91,8 @@
  *
  * guc->submission_state.lock
  * Protects guc_id allocation for the given GuC, i.e. only one context can be
- * doing guc_id allocation operations at a time for each GuC in the system.
+ * doing guc_id allocation operations at a time for each GuC in the system. Also
+ * protects everything else under the guc->submission_state sub-structure.
  *
  * ce->guc_state.lock
  * Protects everything under ce->guc_state. Ensures that a context is in the
@@ -1126,6 +1127,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc)
 	intel_gt_unpark_heartbeats(guc_to_gt(guc));
 }
 
+static void capture_worker_func(struct work_struct *w);
+
 /*
  * Set up the memory resources to be shared with the GuC (via the GGTT)
  * at firmware loading time.
@@ -1151,6 +1154,8 @@ int intel_guc_submission_init(struct intel_guc *guc)
 	spin_lock_init(&guc->submission_state.lock);
 	INIT_LIST_HEAD(&guc->submission_state.guc_id_list);
 	ida_init(&guc->submission_state.guc_ids);
+	INIT_LIST_HEAD(&guc->submission_state.capture_list);
+	INIT_WORK(&guc->submission_state.capture_worker, capture_worker_func);
 
 	return 0;
 }
@@ -2879,17 +2884,51 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc,
 	return 0;
 }
 
-static void capture_error_state(struct intel_guc *guc,
-				struct intel_context *ce)
+static void capture_worker_func(struct work_struct *w)
 {
+	struct intel_guc *guc = container_of(w, struct intel_guc,
+					     submission_state.capture_worker);
 	struct intel_gt *gt = guc_to_gt(guc);
 	struct drm_i915_private *i915 = gt->i915;
+	struct intel_context *ce =
+		list_first_entry(&guc->submission_state.capture_list,
+				 struct intel_context, guc_capture_link);
 	struct intel_engine_cs *engine = __context_to_physical_engine(ce);
+	unsigned long flags;
 	intel_wakeref_t wakeref;
 
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_del_init(&ce->guc_capture_link);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
 	intel_engine_set_hung_context(engine, ce);
 	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
-		i915_capture_error_state(gt, engine->mask);
+		i915_capture_error_state(gt, ce->engine->mask);
+}
+
+static void capture_error_state(struct intel_guc *guc,
+				struct intel_context *ce)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct drm_i915_private *i915 = gt->i915;
+	struct intel_engine_cs *engine = __context_to_physical_engine(ce);
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->submission_state.lock, flags);
+	list_add_tail(&guc->submission_state.capture_list,
+		      &ce->guc_capture_link);
+	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+	/*
+	 * We do the error capture async as an error capture can allocate
+	 * memory, the reset path must flush the G2H handler in order to seal
+	 * several races, and the memory allocations depend on the reset path
+	 * (circular dependecy if error capture done here in the G2H handler).
+	 * Doing the error capture async should be ok, as (eventually) all
+	 * register state is captured in buffer by the GuC (i.e. it ok to
+	 * restart the context before the error capture is complete).
+	 */
+	queue_work(system_unbound_wq, &guc->submission_state.capture_worker);
 	atomic_inc(&i915->gpu_error.reset_engine_count[engine->uabi_class]);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Intel-gfx] [PATCH 3/4] drm/i915/guc: Flush G2H work queue during reset
  2021-09-14  4:24 [Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset Matthew Brost
  2021-09-14  4:24 ` [Intel-gfx] [PATCH 1/4] drm/i915/guc: Move guc_ids under submission_state sub-structure Matthew Brost
  2021-09-14  4:24 ` [Intel-gfx] [PATCH 2/4] drm/i915/guc: Do error capture asynchronously Matthew Brost
@ 2021-09-14  4:24 ` Matthew Brost
  2021-09-14  4:24 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture Matthew Brost
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Matthew Brost @ 2021-09-14  4:24 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

It isn't safe to scrub for missing G2H or continue with the reset until
all G2H processing is complete. Flush the G2H work queue during reset to
ensure it is done running. No need to call the IRQ handler directly
either as the scrubbing code can deal with any missing G2H.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ba6838a35a69..1986a57b52cc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -800,8 +800,6 @@ static void guc_flush_submissions(struct intel_guc *guc)
 
 void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 {
-	int i;
-
 	if (unlikely(!guc_submission_initialized(guc))) {
 		/* Reset called during driver load? GuC not yet initialised! */
 		return;
@@ -816,21 +814,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
 	spin_unlock_irq(&guc_to_gt(guc)->irq_lock);
 
 	guc_flush_submissions(guc);
-
-	/*
-	 * Handle any outstanding G2Hs before reset. Call IRQ handler directly
-	 * each pass as interrupt have been disabled. We always scrub for
-	 * outstanding G2H as it is possible for outstanding_submission_g2h to
-	 * be incremented after the context state update.
-	 */
-	for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
-		intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
-		intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
-		do {
-			wait_for_reset(guc, &guc->outstanding_submission_g2h);
-		} while (!list_empty(&guc->ct.requests.incoming));
-	}
+	flush_work(&guc->ct.requests.worker);
 
 	scrub_guc_desc_for_outstanding_g2h(guc);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture
  2021-09-14  4:24 [Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset Matthew Brost
                   ` (2 preceding siblings ...)
  2021-09-14  4:24 ` [Intel-gfx] [PATCH 3/4] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
@ 2021-09-14  4:24 ` Matthew Brost
  2021-09-14  5:03 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Do error capture async, flush G2H processing on reset Patchwork
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Matthew Brost @ 2021-09-14  4:24 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: john.c.harrison, daniele.ceraolospurio

From: John Harrison <John.C.Harrison@Intel.com>

When i915 receives a context reset notification from GuC, it triggers
an error capture before resetting any outstanding requsts of that
context. Unfortunately, the error capture is not a time bound
operation. In certain situations it can take a long time, particularly
when multiple large LMEM buffers must be read back and eoncoded. If
this delay is longer than other timeouts (heartbeat, test recovery,
etc.) then a full GT reset can be triggered in the middle.

That can result in the context being reset by GuC actually being
destroyed before the error capture completes and the GuC submission
code resumes. Thus, the GuC side can start dereferencing stale
pointers and Bad Things ensue.

So add a refcount get of the context during the entire reset
operation. That way, the context can't be destroyed part way through
no matter what other resets or user interactions occur.

v2:
 (Matthew Brost)
  - Update patch to work with async error capture

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 1986a57b52cc..02917fc4d4a8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2888,6 +2888,8 @@ static void capture_worker_func(struct work_struct *w)
 	intel_engine_set_hung_context(engine, ce);
 	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
 		i915_capture_error_state(gt, ce->engine->mask);
+
+	intel_context_put(ce);
 }
 
 static void capture_error_state(struct intel_guc *guc,
@@ -2924,7 +2926,7 @@ static void guc_context_replay(struct intel_context *ce)
 	tasklet_hi_schedule(&sched_engine->tasklet);
 }
 
-static void guc_handle_context_reset(struct intel_guc *guc,
+static bool guc_handle_context_reset(struct intel_guc *guc,
 				     struct intel_context *ce)
 {
 	trace_intel_context_reset(ce);
@@ -2937,7 +2939,11 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 		   !context_blocked(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
+
+		return false;
 	}
+
+	return true;
 }
 
 int intel_guc_context_reset_process_msg(struct intel_guc *guc,
@@ -2945,6 +2951,7 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
 {
 	struct intel_context *ce;
 	int desc_idx;
+	unsigned long flags;
 
 	if (unlikely(len != 1)) {
 		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
@@ -2952,11 +2959,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
 	}
 
 	desc_idx = msg[0];
+
+	/*
+	 * The context lookup uses the xarray but lookups only require an RCU lock
+	 * not the full spinlock. So take the lock explicitly and keep it until the
+	 * context has been reference count locked to ensure it can't be destroyed
+	 * asynchronously until the reset is done.
+	 */
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	ce = g2h_context_lookup(guc, desc_idx);
+	if (ce)
+		intel_context_get(ce);
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	if (unlikely(!ce))
 		return -EPROTO;
 
-	guc_handle_context_reset(guc, ce);
+	if (guc_handle_context_reset(guc, ce))
+		intel_context_put(ce);
 
 	return 0;
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Do error capture async, flush G2H processing on reset
  2021-09-14  4:24 [Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset Matthew Brost
                   ` (3 preceding siblings ...)
  2021-09-14  4:24 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture Matthew Brost
@ 2021-09-14  5:03 ` Patchwork
  2021-09-14  5:06 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
  2021-09-14  5:31 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
  6 siblings, 0 replies; 14+ messages in thread
From: Patchwork @ 2021-09-14  5:03 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Do error capture async, flush G2H processing on reset
URL   : https://patchwork.freedesktop.org/series/94642/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
-
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:27:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:27:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:27:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:32:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:32:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:49:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:49:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:49:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:56:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_engine_stats.h:56:9: warning: trying to copy expression type 31
+drivers/gpu/drm/i915/gt/intel_reset.c:1392:5: warning: context imbalance in 'intel_gt_reset_trylock' - different lock contexts for basic block
+drivers/gpu/drm/i915/i915_perf.c:1442:15: warning: memset with byte count of 16777216
+drivers/gpu/drm/i915/i915_perf.c:1496:15: warning: memset with byte count of 16777216
+./include/asm-generic/bitops/find.h:112:45: warning: shift count is negative (-262080)
+./include/asm-generic/bitops/find.h:32:31: warning: shift count is negative (-262080)
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen11_fwtable_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen11_fwtable_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen11_fwtable_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen11_fwtable_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen11_fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen11_fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen11_fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen12_fwtable_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen12_fwtable_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen12_fwtable_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_read16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_read32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_read64' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_read8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen6_write8' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen8_write16' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen8_write32' - different lock contexts for basic block
+./include/linux/spinlock.h:418:9: warning: context imbalance in 'gen8_write8' - different lock contexts for basic block



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Intel-gfx] ✗ Fi.CI.DOCS: warning for Do error capture async, flush G2H processing on reset
  2021-09-14  4:24 [Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset Matthew Brost
                   ` (4 preceding siblings ...)
  2021-09-14  5:03 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Do error capture async, flush G2H processing on reset Patchwork
@ 2021-09-14  5:06 ` Patchwork
  2021-09-14  5:31 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
  6 siblings, 0 replies; 14+ messages in thread
From: Patchwork @ 2021-09-14  5:06 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

== Series Details ==

Series: Do error capture async, flush G2H processing on reset
URL   : https://patchwork.freedesktop.org/series/94642/
State : warning

== Summary ==

$ make htmldocs 2>&1 > /dev/null | grep i915
./drivers/gpu/drm/i915/gt/uc/intel_guc.h:155: warning: Function parameter or member 'submission_state' not described in 'intel_guc'



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Do error capture async, flush G2H processing on reset
  2021-09-14  4:24 [Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset Matthew Brost
                   ` (5 preceding siblings ...)
  2021-09-14  5:06 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
@ 2021-09-14  5:31 ` Patchwork
  6 siblings, 0 replies; 14+ messages in thread
From: Patchwork @ 2021-09-14  5:31 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 5567 bytes --]

== Series Details ==

Series: Do error capture async, flush G2H processing on reset
URL   : https://patchwork.freedesktop.org/series/94642/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_10576 -> Patchwork_21033
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_21033 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_21033, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/index.html

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_21033:

### IGT changes ###

#### Possible regressions ####

  * igt@core_hotunplug@unbind-rebind:
    - fi-skl-6700k2:      [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10576/fi-skl-6700k2/igt@core_hotunplug@unbind-rebind.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/fi-skl-6700k2/igt@core_hotunplug@unbind-rebind.html
    - fi-kbl-7567u:       [PASS][3] -> [INCOMPLETE][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10576/fi-kbl-7567u/igt@core_hotunplug@unbind-rebind.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/fi-kbl-7567u/igt@core_hotunplug@unbind-rebind.html

  * igt@i915_module_load@reload:
    - fi-kbl-guc:         [PASS][5] -> [INCOMPLETE][6]
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10576/fi-kbl-guc/igt@i915_module_load@reload.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/fi-kbl-guc/igt@i915_module_load@reload.html
    - fi-cfl-8109u:       [PASS][7] -> [INCOMPLETE][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10576/fi-cfl-8109u/igt@i915_module_load@reload.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/fi-cfl-8109u/igt@i915_module_load@reload.html

  * igt@i915_selftest@live@gt_timelines:
    - fi-cfl-guc:         [PASS][9] -> [FAIL][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10576/fi-cfl-guc/igt@i915_selftest@live@gt_timelines.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/fi-cfl-guc/igt@i915_selftest@live@gt_timelines.html

  * igt@i915_selftest@live@mman:
    - fi-cfl-8700k:       NOTRUN -> [INCOMPLETE][11]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/fi-cfl-8700k/igt@i915_selftest@live@mman.html

  
Known issues
------------

  Here are the changes found in Patchwork_21033 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@runner@aborted:
    - fi-cfl-8700k:       NOTRUN -> [FAIL][12] ([i915#2426] / [i915#3363])
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/fi-cfl-8700k/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@core_hotunplug@unbind-rebind:
    - fi-cfl-8700k:       [INCOMPLETE][13] -> [PASS][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10576/fi-cfl-8700k/igt@core_hotunplug@unbind-rebind.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/fi-cfl-8700k/igt@core_hotunplug@unbind-rebind.html

  * igt@gem_exec_suspend@basic-s3:
    - fi-tgl-1115g4:      [FAIL][15] ([i915#1888]) -> [PASS][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10576/fi-tgl-1115g4/igt@gem_exec_suspend@basic-s3.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/fi-tgl-1115g4/igt@gem_exec_suspend@basic-s3.html

  
#### Warnings ####

  * igt@i915_pm_rpm@basic-rte:
    - fi-kbl-guc:         [SKIP][17] ([fdo#109271]) -> [FAIL][18] ([i915#579])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10576/fi-kbl-guc/igt@i915_pm_rpm@basic-rte.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/fi-kbl-guc/igt@i915_pm_rpm@basic-rte.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [i915#1888]: https://gitlab.freedesktop.org/drm/intel/issues/1888
  [i915#2426]: https://gitlab.freedesktop.org/drm/intel/issues/2426
  [i915#2932]: https://gitlab.freedesktop.org/drm/intel/issues/2932
  [i915#3363]: https://gitlab.freedesktop.org/drm/intel/issues/3363
  [i915#3690]: https://gitlab.freedesktop.org/drm/intel/issues/3690
  [i915#579]: https://gitlab.freedesktop.org/drm/intel/issues/579


Participating hosts (43 -> 38)
------------------------------

  Missing    (5): bat-dg1-6 bat-dg1-5 fi-bsw-cyan bat-adlp-4 fi-bdw-samus 


Build changes
-------------

  * Linux: CI_DRM_10576 -> Patchwork_21033

  CI-20190529: 20190529
  CI_DRM_10576: 881240262b8bd204464f1f5f663348688484b867 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6209: 07d6594ed02f55b68d64fa6dd7f80cfbc1ce4ef8 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_21033: e709b911bb815391da81856059be315677654b51 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

e709b911bb81 drm/i915/guc: Refcount context during error capture
68def796bf1a drm/i915/guc: Flush G2H work queue during reset
75a9ad29c780 drm/i915/guc: Do error capture asynchronously
1c3146da224f drm/i915/guc: Move guc_ids under submission_state sub-structure

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21033/index.html

[-- Attachment #2: Type: text/html, Size: 6316 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset
@ 2021-09-14  5:09 Matthew Brost
  2021-09-14  5:09 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture Matthew Brost
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Brost @ 2021-09-14  5:09 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

Rather allocating an error capture in nowait context to break a lockdep
splat [1], do the error capture async compared to the G2H processing.

v2: Fix Docs warning

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

[1] https://patchwork.freedesktop.org/patch/451415/?series=93704&rev=5

John Harrison (1):
  drm/i915/guc: Refcount context during error capture

Matthew Brost (3):
  drm/i915/guc: Move guc_ids under submission_state sub-structure
  drm/i915/guc: Do error capture asynchronously
  drm/i915/guc: Flush G2H work queue during reset

 drivers/gpu/drm/i915/gt/intel_context.c       |   2 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |  16 ++-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  36 +++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 133 ++++++++++++------
 4 files changed, 130 insertions(+), 57 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture
  2021-09-14  5:09 [Intel-gfx] [PATCH 0/4] " Matthew Brost
@ 2021-09-14  5:09 ` Matthew Brost
  2021-09-14 14:29   ` Daniel Vetter
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Brost @ 2021-09-14  5:09 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: daniele.ceraolospurio, john.c.harrison

From: John Harrison <John.C.Harrison@Intel.com>

When i915 receives a context reset notification from GuC, it triggers
an error capture before resetting any outstanding requsts of that
context. Unfortunately, the error capture is not a time bound
operation. In certain situations it can take a long time, particularly
when multiple large LMEM buffers must be read back and eoncoded. If
this delay is longer than other timeouts (heartbeat, test recovery,
etc.) then a full GT reset can be triggered in the middle.

That can result in the context being reset by GuC actually being
destroyed before the error capture completes and the GuC submission
code resumes. Thus, the GuC side can start dereferencing stale
pointers and Bad Things ensue.

So add a refcount get of the context during the entire reset
operation. That way, the context can't be destroyed part way through
no matter what other resets or user interactions occur.

v2:
 (Matthew Brost)
  - Update patch to work with async error capture

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 1986a57b52cc..02917fc4d4a8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -2888,6 +2888,8 @@ static void capture_worker_func(struct work_struct *w)
 	intel_engine_set_hung_context(engine, ce);
 	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
 		i915_capture_error_state(gt, ce->engine->mask);
+
+	intel_context_put(ce);
 }
 
 static void capture_error_state(struct intel_guc *guc,
@@ -2924,7 +2926,7 @@ static void guc_context_replay(struct intel_context *ce)
 	tasklet_hi_schedule(&sched_engine->tasklet);
 }
 
-static void guc_handle_context_reset(struct intel_guc *guc,
+static bool guc_handle_context_reset(struct intel_guc *guc,
 				     struct intel_context *ce)
 {
 	trace_intel_context_reset(ce);
@@ -2937,7 +2939,11 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 		   !context_blocked(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
+
+		return false;
 	}
+
+	return true;
 }
 
 int intel_guc_context_reset_process_msg(struct intel_guc *guc,
@@ -2945,6 +2951,7 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
 {
 	struct intel_context *ce;
 	int desc_idx;
+	unsigned long flags;
 
 	if (unlikely(len != 1)) {
 		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
@@ -2952,11 +2959,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
 	}
 
 	desc_idx = msg[0];
+
+	/*
+	 * The context lookup uses the xarray but lookups only require an RCU lock
+	 * not the full spinlock. So take the lock explicitly and keep it until the
+	 * context has been reference count locked to ensure it can't be destroyed
+	 * asynchronously until the reset is done.
+	 */
+	xa_lock_irqsave(&guc->context_lookup, flags);
 	ce = g2h_context_lookup(guc, desc_idx);
+	if (ce)
+		intel_context_get(ce);
+	xa_unlock_irqrestore(&guc->context_lookup, flags);
+
 	if (unlikely(!ce))
 		return -EPROTO;
 
-	guc_handle_context_reset(guc, ce);
+	if (guc_handle_context_reset(guc, ce))
+		intel_context_put(ce);
 
 	return 0;
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture
  2021-09-14  5:09 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture Matthew Brost
@ 2021-09-14 14:29   ` Daniel Vetter
  2021-09-14 23:23     ` John Harrison
  2021-09-14 23:36     ` Matthew Brost
  0 siblings, 2 replies; 14+ messages in thread
From: Daniel Vetter @ 2021-09-14 14:29 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-gfx, dri-devel, daniele.ceraolospurio, john.c.harrison

On Mon, Sep 13, 2021 at 10:09:56PM -0700, Matthew Brost wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> When i915 receives a context reset notification from GuC, it triggers
> an error capture before resetting any outstanding requsts of that
> context. Unfortunately, the error capture is not a time bound
> operation. In certain situations it can take a long time, particularly
> when multiple large LMEM buffers must be read back and eoncoded. If
> this delay is longer than other timeouts (heartbeat, test recovery,
> etc.) then a full GT reset can be triggered in the middle.
> 
> That can result in the context being reset by GuC actually being
> destroyed before the error capture completes and the GuC submission
> code resumes. Thus, the GuC side can start dereferencing stale
> pointers and Bad Things ensue.
> 
> So add a refcount get of the context during the entire reset
> operation. That way, the context can't be destroyed part way through
> no matter what other resets or user interactions occur.
> 
> v2:
>  (Matthew Brost)
>   - Update patch to work with async error capture
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

This sounds like a fundamental issue in our reset/scheduler design. If we
have multiple timeout-things working in parallel, then there's going to be
an endless whack-a-mole fireworks show.

Reset is not a perf critical path (aside from media timeout, which guc
handles internally anyway). Simplicity trumps everything else. The fix
here is to guarantee that anything related to reset cannot happen in
parallel with anything else related to reset/timeout. At least on a
per-engine (and really on a per-reset domain) basis.

The fix we've developed for drm/sched is that the driver can allocate a
single-thread work queue, pass it to each drm/sched instance, and all
timeout handling is run in there.

For i915 it's more of a mess since we have a ton of random things that
time out/reset potentially going on in parallel. But that's the design we
should head towards.

_not_ sprinkling random refcounts all over the place until most of the
oops/splats disappear. That's cargo-culting, not engineering.
-Daniel

> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++++++++++++++++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 1986a57b52cc..02917fc4d4a8 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2888,6 +2888,8 @@ static void capture_worker_func(struct work_struct *w)
>  	intel_engine_set_hung_context(engine, ce);
>  	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
>  		i915_capture_error_state(gt, ce->engine->mask);
> +
> +	intel_context_put(ce);
>  }
>  
>  static void capture_error_state(struct intel_guc *guc,
> @@ -2924,7 +2926,7 @@ static void guc_context_replay(struct intel_context *ce)
>  	tasklet_hi_schedule(&sched_engine->tasklet);
>  }
>  
> -static void guc_handle_context_reset(struct intel_guc *guc,
> +static bool guc_handle_context_reset(struct intel_guc *guc,
>  				     struct intel_context *ce)
>  {
>  	trace_intel_context_reset(ce);
> @@ -2937,7 +2939,11 @@ static void guc_handle_context_reset(struct intel_guc *guc,
>  		   !context_blocked(ce))) {
>  		capture_error_state(guc, ce);
>  		guc_context_replay(ce);
> +
> +		return false;
>  	}
> +
> +	return true;
>  }
>  
>  int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> @@ -2945,6 +2951,7 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
>  {
>  	struct intel_context *ce;
>  	int desc_idx;
> +	unsigned long flags;
>  
>  	if (unlikely(len != 1)) {
>  		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
> @@ -2952,11 +2959,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
>  	}
>  
>  	desc_idx = msg[0];
> +
> +	/*
> +	 * The context lookup uses the xarray but lookups only require an RCU lock
> +	 * not the full spinlock. So take the lock explicitly and keep it until the
> +	 * context has been reference count locked to ensure it can't be destroyed
> +	 * asynchronously until the reset is done.
> +	 */
> +	xa_lock_irqsave(&guc->context_lookup, flags);
>  	ce = g2h_context_lookup(guc, desc_idx);
> +	if (ce)
> +		intel_context_get(ce);
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> +
>  	if (unlikely(!ce))
>  		return -EPROTO;
>  
> -	guc_handle_context_reset(guc, ce);
> +	if (guc_handle_context_reset(guc, ce))
> +		intel_context_put(ce);
>  
>  	return 0;
>  }
> -- 
> 2.32.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture
  2021-09-14 14:29   ` Daniel Vetter
@ 2021-09-14 23:23     ` John Harrison
  2021-09-17 12:37       ` Daniel Vetter
  2021-09-14 23:36     ` Matthew Brost
  1 sibling, 1 reply; 14+ messages in thread
From: John Harrison @ 2021-09-14 23:23 UTC (permalink / raw)
  To: Daniel Vetter, Matthew Brost; +Cc: intel-gfx, dri-devel, daniele.ceraolospurio

On 9/14/2021 07:29, Daniel Vetter wrote:
> On Mon, Sep 13, 2021 at 10:09:56PM -0700, Matthew Brost wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> When i915 receives a context reset notification from GuC, it triggers
>> an error capture before resetting any outstanding requsts of that
>> context. Unfortunately, the error capture is not a time bound
>> operation. In certain situations it can take a long time, particularly
>> when multiple large LMEM buffers must be read back and eoncoded. If
>> this delay is longer than other timeouts (heartbeat, test recovery,
>> etc.) then a full GT reset can be triggered in the middle.
>>
>> That can result in the context being reset by GuC actually being
>> destroyed before the error capture completes and the GuC submission
>> code resumes. Thus, the GuC side can start dereferencing stale
>> pointers and Bad Things ensue.
>>
>> So add a refcount get of the context during the entire reset
>> operation. That way, the context can't be destroyed part way through
>> no matter what other resets or user interactions occur.
>>
>> v2:
>>   (Matthew Brost)
>>    - Update patch to work with async error capture
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> This sounds like a fundamental issue in our reset/scheduler design. If we
> have multiple timeout-things working in parallel, then there's going to be
> an endless whack-a-mole fireworks show.
>
> Reset is not a perf critical path (aside from media timeout, which guc
> handles internally anyway). Simplicity trumps everything else. The fix
> here is to guarantee that anything related to reset cannot happen in
> parallel with anything else related to reset/timeout. At least on a
> per-engine (and really on a per-reset domain) basis.
>
> The fix we've developed for drm/sched is that the driver can allocate a
> single-thread work queue, pass it to each drm/sched instance, and all
> timeout handling is run in there.
>
> For i915 it's more of a mess since we have a ton of random things that
> time out/reset potentially going on in parallel. But that's the design we
> should head towards.
>
> _not_ sprinkling random refcounts all over the place until most of the
> oops/splats disappear. That's cargo-culting, not engineering.
> -Daniel
Not sure I follow this.

The code pulls an intel_context object out of a structure and proceeds 
to dereference it in what can be a slow piece of code that is running in 
a worker thread and is therefore already asynchronous to other activity. 
Acquiring a reference count on that object while holding its pointer is 
standard practice, I thought. That's the whole point of reference counting!

To be clear, this is not adding a brand new reference count object. It 
is merely taking the correct lock on an object while accessing that object.

It uses the xarray's lock while accessing the xarray and then the ce's 
lock while accessing the ce and makes sure to overlap the two to prevent 
any race conditions. To me, that seems like a) correct object access 
practice and b) it should have been there in the first place.

John.


>
>> ---
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++++++++++++++++--
>>   1 file changed, 22 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 1986a57b52cc..02917fc4d4a8 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -2888,6 +2888,8 @@ static void capture_worker_func(struct work_struct *w)
>>   	intel_engine_set_hung_context(engine, ce);
>>   	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
>>   		i915_capture_error_state(gt, ce->engine->mask);
>> +
>> +	intel_context_put(ce);
>>   }
>>   
>>   static void capture_error_state(struct intel_guc *guc,
>> @@ -2924,7 +2926,7 @@ static void guc_context_replay(struct intel_context *ce)
>>   	tasklet_hi_schedule(&sched_engine->tasklet);
>>   }
>>   
>> -static void guc_handle_context_reset(struct intel_guc *guc,
>> +static bool guc_handle_context_reset(struct intel_guc *guc,
>>   				     struct intel_context *ce)
>>   {
>>   	trace_intel_context_reset(ce);
>> @@ -2937,7 +2939,11 @@ static void guc_handle_context_reset(struct intel_guc *guc,
>>   		   !context_blocked(ce))) {
>>   		capture_error_state(guc, ce);
>>   		guc_context_replay(ce);
>> +
>> +		return false;
>>   	}
>> +
>> +	return true;
>>   }
>>   
>>   int intel_guc_context_reset_process_msg(struct intel_guc *guc,
>> @@ -2945,6 +2951,7 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
>>   {
>>   	struct intel_context *ce;
>>   	int desc_idx;
>> +	unsigned long flags;
>>   
>>   	if (unlikely(len != 1)) {
>>   		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
>> @@ -2952,11 +2959,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
>>   	}
>>   
>>   	desc_idx = msg[0];
>> +
>> +	/*
>> +	 * The context lookup uses the xarray but lookups only require an RCU lock
>> +	 * not the full spinlock. So take the lock explicitly and keep it until the
>> +	 * context has been reference count locked to ensure it can't be destroyed
>> +	 * asynchronously until the reset is done.
>> +	 */
>> +	xa_lock_irqsave(&guc->context_lookup, flags);
>>   	ce = g2h_context_lookup(guc, desc_idx);
>> +	if (ce)
>> +		intel_context_get(ce);
>> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
>> +
>>   	if (unlikely(!ce))
>>   		return -EPROTO;
>>   
>> -	guc_handle_context_reset(guc, ce);
>> +	if (guc_handle_context_reset(guc, ce))
>> +		intel_context_put(ce);
>>   
>>   	return 0;
>>   }
>> -- 
>> 2.32.0
>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture
  2021-09-14 23:23     ` John Harrison
@ 2021-09-17 12:37       ` Daniel Vetter
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Vetter @ 2021-09-17 12:37 UTC (permalink / raw)
  To: John Harrison
  Cc: Daniel Vetter, Matthew Brost, intel-gfx, dri-devel,
	daniele.ceraolospurio

On Tue, Sep 14, 2021 at 04:23:26PM -0700, John Harrison wrote:
> On 9/14/2021 07:29, Daniel Vetter wrote:
> > On Mon, Sep 13, 2021 at 10:09:56PM -0700, Matthew Brost wrote:
> > > From: John Harrison <John.C.Harrison@Intel.com>
> > > 
> > > When i915 receives a context reset notification from GuC, it triggers
> > > an error capture before resetting any outstanding requsts of that
> > > context. Unfortunately, the error capture is not a time bound
> > > operation. In certain situations it can take a long time, particularly
> > > when multiple large LMEM buffers must be read back and eoncoded. If
> > > this delay is longer than other timeouts (heartbeat, test recovery,
> > > etc.) then a full GT reset can be triggered in the middle.
> > > 
> > > That can result in the context being reset by GuC actually being
> > > destroyed before the error capture completes and the GuC submission
> > > code resumes. Thus, the GuC side can start dereferencing stale
> > > pointers and Bad Things ensue.
> > > 
> > > So add a refcount get of the context during the entire reset
> > > operation. That way, the context can't be destroyed part way through
> > > no matter what other resets or user interactions occur.
> > > 
> > > v2:
> > >   (Matthew Brost)
> > >    - Update patch to work with async error capture
> > > 
> > > Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > This sounds like a fundamental issue in our reset/scheduler design. If we
> > have multiple timeout-things working in parallel, then there's going to be
> > an endless whack-a-mole fireworks show.
> > 
> > Reset is not a perf critical path (aside from media timeout, which guc
> > handles internally anyway). Simplicity trumps everything else. The fix
> > here is to guarantee that anything related to reset cannot happen in
> > parallel with anything else related to reset/timeout. At least on a
> > per-engine (and really on a per-reset domain) basis.
> > 
> > The fix we've developed for drm/sched is that the driver can allocate a
> > single-thread work queue, pass it to each drm/sched instance, and all
> > timeout handling is run in there.
> > 
> > For i915 it's more of a mess since we have a ton of random things that
> > time out/reset potentially going on in parallel. But that's the design we
> > should head towards.
> > 
> > _not_ sprinkling random refcounts all over the place until most of the
> > oops/splats disappear. That's cargo-culting, not engineering.
> > -Daniel
> Not sure I follow this.
> 
> The code pulls an intel_context object out of a structure and proceeds to
> dereference it in what can be a slow piece of code that is running in a
> worker thread and is therefore already asynchronous to other activity.
> Acquiring a reference count on that object while holding its pointer is
> standard practice, I thought. That's the whole point of reference counting!
> 
> To be clear, this is not adding a brand new reference count object. It is
> merely taking the correct lock on an object while accessing that object.
> 
> It uses the xarray's lock while accessing the xarray and then the ce's lock
> while accessing the ce and makes sure to overlap the two to prevent any race
> conditions. To me, that seems like a) correct object access practice and b)
> it should have been there in the first place.

Sure we do reference count. And we reference count intel_context. But we
shouldn't just use a reference count because it's there and looks
convenient.

This is reset code. If the intel_context can go away while we process the
reset affecting it, there's a giantic bug going on. Doing locally a bit
more reference counting just makes the race small enough to not hit it
anymore easily. It doesn't fix a bug anywhere, or if it does, then the
locking looks really, really fragile.

The proper fix here is breaking this back to data structures, figuring out
what exactly the invariants are (e.g. it shouldn't be possible to try
processing an intel_context when it's not longer in need of processing).
And then figuring out the locking scheme you need.

For the intel_context refcount we currently (if I got them all):
- gem_context -> intel_context refcount
- some temp reference during execbuf
- i915_request->context so that we don't tear down the context while it's
  still running stuff

The latter should be enough to also make sure the context doesn't
disappear while guc code is processing it. If that's not enough, then we
need to analyze this, figure out why/where, and rework this rules around
locking/refcounting so that things are clean, simple, understandable and
actually get the job done.

This patch otoh looks a lot like "if we whack this refcount the oops goes
away, therefore it must be the right fix". And that's not how locking
works, at least not maintainable locking.

Cheers, Daniel

> 
> John.
> 
> 
> > 
> > > ---
> > >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++++++++++++++++--
> > >   1 file changed, 22 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index 1986a57b52cc..02917fc4d4a8 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -2888,6 +2888,8 @@ static void capture_worker_func(struct work_struct *w)
> > >   	intel_engine_set_hung_context(engine, ce);
> > >   	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
> > >   		i915_capture_error_state(gt, ce->engine->mask);
> > > +
> > > +	intel_context_put(ce);
> > >   }
> > >   static void capture_error_state(struct intel_guc *guc,
> > > @@ -2924,7 +2926,7 @@ static void guc_context_replay(struct intel_context *ce)
> > >   	tasklet_hi_schedule(&sched_engine->tasklet);
> > >   }
> > > -static void guc_handle_context_reset(struct intel_guc *guc,
> > > +static bool guc_handle_context_reset(struct intel_guc *guc,
> > >   				     struct intel_context *ce)
> > >   {
> > >   	trace_intel_context_reset(ce);
> > > @@ -2937,7 +2939,11 @@ static void guc_handle_context_reset(struct intel_guc *guc,
> > >   		   !context_blocked(ce))) {
> > >   		capture_error_state(guc, ce);
> > >   		guc_context_replay(ce);
> > > +
> > > +		return false;
> > >   	}
> > > +
> > > +	return true;
> > >   }
> > >   int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> > > @@ -2945,6 +2951,7 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> > >   {
> > >   	struct intel_context *ce;
> > >   	int desc_idx;
> > > +	unsigned long flags;
> > >   	if (unlikely(len != 1)) {
> > >   		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
> > > @@ -2952,11 +2959,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> > >   	}
> > >   	desc_idx = msg[0];
> > > +
> > > +	/*
> > > +	 * The context lookup uses the xarray but lookups only require an RCU lock
> > > +	 * not the full spinlock. So take the lock explicitly and keep it until the
> > > +	 * context has been reference count locked to ensure it can't be destroyed
> > > +	 * asynchronously until the reset is done.
> > > +	 */
> > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > >   	ce = g2h_context_lookup(guc, desc_idx);
> > > +	if (ce)
> > > +		intel_context_get(ce);
> > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > > +
> > >   	if (unlikely(!ce))
> > >   		return -EPROTO;
> > > -	guc_handle_context_reset(guc, ce);
> > > +	if (guc_handle_context_reset(guc, ce))
> > > +		intel_context_put(ce);
> > >   	return 0;
> > >   }
> > > -- 
> > > 2.32.0
> > > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture
  2021-09-14 14:29   ` Daniel Vetter
  2021-09-14 23:23     ` John Harrison
@ 2021-09-14 23:36     ` Matthew Brost
  2021-09-17 12:40       ` Daniel Vetter
  1 sibling, 1 reply; 14+ messages in thread
From: Matthew Brost @ 2021-09-14 23:36 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: intel-gfx, dri-devel, daniele.ceraolospurio, john.c.harrison

On Tue, Sep 14, 2021 at 04:29:21PM +0200, Daniel Vetter wrote:
> On Mon, Sep 13, 2021 at 10:09:56PM -0700, Matthew Brost wrote:
> > From: John Harrison <John.C.Harrison@Intel.com>
> > 
> > When i915 receives a context reset notification from GuC, it triggers
> > an error capture before resetting any outstanding requsts of that
> > context. Unfortunately, the error capture is not a time bound
> > operation. In certain situations it can take a long time, particularly
> > when multiple large LMEM buffers must be read back and eoncoded. If
> > this delay is longer than other timeouts (heartbeat, test recovery,
> > etc.) then a full GT reset can be triggered in the middle.
> > 
> > That can result in the context being reset by GuC actually being
> > destroyed before the error capture completes and the GuC submission
> > code resumes. Thus, the GuC side can start dereferencing stale
> > pointers and Bad Things ensue.
> > 
> > So add a refcount get of the context during the entire reset
> > operation. That way, the context can't be destroyed part way through
> > no matter what other resets or user interactions occur.
> > 
> > v2:
> >  (Matthew Brost)
> >   - Update patch to work with async error capture
> > 
> > Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 
> This sounds like a fundamental issue in our reset/scheduler design. If we
> have multiple timeout-things working in parallel, then there's going to be
> an endless whack-a-mole fireworks show.
> 

We have two different possible reset paths.

One initiated from the GuC on per context basis. Each of these resets is
execute serially in the order in which they are received and each
contexts reset is protected by a lock.

Another is a full GT reset, typically triggered from the heartbeat or
selftest. Only 1 of these can happen at time as this is protected by a
reset mutex. The full GT reset should flush all the inflight per context
resets before proceeding with the whole GT reset (after patch #3 in this
series), I believe this patch was written before patch #3 so when it was
written there was a race where a per context reset & GT reset could be
happening at the same time but that is no longer the case. The commit
message should be reworded to reflect that. All that being said I still
believe the patch is correct to reference count the context until after
the error capture completes. As John H said, this is just a standard ref
count here.

> Reset is not a perf critical path (aside from media timeout, which guc
> handles internally anyway). Simplicity trumps everything else. The fix
> here is to guarantee that anything related to reset cannot happen in
> parallel with anything else related to reset/timeout. At least on a
> per-engine (and really on a per-reset domain) basis.
> 
> The fix we've developed for drm/sched is that the driver can allocate a
> single-thread work queue, pass it to each drm/sched instance, and all
> timeout handling is run in there.
> 
> For i915 it's more of a mess since we have a ton of random things that
> time out/reset potentially going on in parallel. But that's the design we
> should head towards.
>

See above, the parallel resets is fixed by patch #3 in this series.

> _not_ sprinkling random refcounts all over the place until most of the
> oops/splats disappear. That's cargo-culting, not engineering.

See above, I believe the ref count is still correct.

Matt

> -Daniel
> 
> > ---
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++++++++++++++++--
> >  1 file changed, 22 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index 1986a57b52cc..02917fc4d4a8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -2888,6 +2888,8 @@ static void capture_worker_func(struct work_struct *w)
> >  	intel_engine_set_hung_context(engine, ce);
> >  	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
> >  		i915_capture_error_state(gt, ce->engine->mask);
> > +
> > +	intel_context_put(ce);
> >  }
> >  
> >  static void capture_error_state(struct intel_guc *guc,
> > @@ -2924,7 +2926,7 @@ static void guc_context_replay(struct intel_context *ce)
> >  	tasklet_hi_schedule(&sched_engine->tasklet);
> >  }
> >  
> > -static void guc_handle_context_reset(struct intel_guc *guc,
> > +static bool guc_handle_context_reset(struct intel_guc *guc,
> >  				     struct intel_context *ce)
> >  {
> >  	trace_intel_context_reset(ce);
> > @@ -2937,7 +2939,11 @@ static void guc_handle_context_reset(struct intel_guc *guc,
> >  		   !context_blocked(ce))) {
> >  		capture_error_state(guc, ce);
> >  		guc_context_replay(ce);
> > +
> > +		return false;
> >  	}
> > +
> > +	return true;
> >  }
> >  
> >  int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> > @@ -2945,6 +2951,7 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> >  {
> >  	struct intel_context *ce;
> >  	int desc_idx;
> > +	unsigned long flags;
> >  
> >  	if (unlikely(len != 1)) {
> >  		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
> > @@ -2952,11 +2959,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> >  	}
> >  
> >  	desc_idx = msg[0];
> > +
> > +	/*
> > +	 * The context lookup uses the xarray but lookups only require an RCU lock
> > +	 * not the full spinlock. So take the lock explicitly and keep it until the
> > +	 * context has been reference count locked to ensure it can't be destroyed
> > +	 * asynchronously until the reset is done.
> > +	 */
> > +	xa_lock_irqsave(&guc->context_lookup, flags);
> >  	ce = g2h_context_lookup(guc, desc_idx);
> > +	if (ce)
> > +		intel_context_get(ce);
> > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > +
> >  	if (unlikely(!ce))
> >  		return -EPROTO;
> >  
> > -	guc_handle_context_reset(guc, ce);
> > +	if (guc_handle_context_reset(guc, ce))
> > +		intel_context_put(ce);
> >  
> >  	return 0;
> >  }
> > -- 
> > 2.32.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture
  2021-09-14 23:36     ` Matthew Brost
@ 2021-09-17 12:40       ` Daniel Vetter
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Vetter @ 2021-09-17 12:40 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Daniel Vetter, intel-gfx, dri-devel, daniele.ceraolospurio,
	john.c.harrison

On Tue, Sep 14, 2021 at 04:36:54PM -0700, Matthew Brost wrote:
> On Tue, Sep 14, 2021 at 04:29:21PM +0200, Daniel Vetter wrote:
> > On Mon, Sep 13, 2021 at 10:09:56PM -0700, Matthew Brost wrote:
> > > From: John Harrison <John.C.Harrison@Intel.com>
> > > 
> > > When i915 receives a context reset notification from GuC, it triggers
> > > an error capture before resetting any outstanding requsts of that
> > > context. Unfortunately, the error capture is not a time bound
> > > operation. In certain situations it can take a long time, particularly
> > > when multiple large LMEM buffers must be read back and eoncoded. If
> > > this delay is longer than other timeouts (heartbeat, test recovery,
> > > etc.) then a full GT reset can be triggered in the middle.
> > > 
> > > That can result in the context being reset by GuC actually being
> > > destroyed before the error capture completes and the GuC submission
> > > code resumes. Thus, the GuC side can start dereferencing stale
> > > pointers and Bad Things ensue.
> > > 
> > > So add a refcount get of the context during the entire reset
> > > operation. That way, the context can't be destroyed part way through
> > > no matter what other resets or user interactions occur.
> > > 
> > > v2:
> > >  (Matthew Brost)
> > >   - Update patch to work with async error capture
> > > 
> > > Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > 
> > This sounds like a fundamental issue in our reset/scheduler design. If we
> > have multiple timeout-things working in parallel, then there's going to be
> > an endless whack-a-mole fireworks show.
> > 
> 
> We have two different possible reset paths.
> 
> One initiated from the GuC on per context basis. Each of these resets is
> execute serially in the order in which they are received and each
> contexts reset is protected by a lock.
> 
> Another is a full GT reset, typically triggered from the heartbeat or
> selftest. Only 1 of these can happen at time as this is protected by a
> reset mutex. The full GT reset should flush all the inflight per context
> resets before proceeding with the whole GT reset (after patch #3 in this
> series), I believe this patch was written before patch #3 so when it was
> written there was a race where a per context reset & GT reset could be
> happening at the same time but that is no longer the case. The commit
> message should be reworded to reflect that. All that being said I still
> believe the patch is correct to reference count the context until after
> the error capture completes. As John H said, this is just a standard ref
> count here.

Yeah the direction in drm/sched, and which we should follow, is that
resets can't happen in parallel. At least not when touching the same
structs. So per-engine reset can proceed as-is, but if you go a level
higher (guc reset) then that needs to block out everything else.

And yes heartbeat and timeout and all that should follow this pattern too.

If we can have multiple ongoing resets touching the same engine in
parallel, then shit will hit the fan.

I'm also involved in a discussion with amdgpu folks for similar reasons.
You can't fix this with some hacks locally.

Wrt "it's just standard refcounting", see my other reply.

> > Reset is not a perf critical path (aside from media timeout, which guc
> > handles internally anyway). Simplicity trumps everything else. The fix
> > here is to guarantee that anything related to reset cannot happen in
> > parallel with anything else related to reset/timeout. At least on a
> > per-engine (and really on a per-reset domain) basis.
> > 
> > The fix we've developed for drm/sched is that the driver can allocate a
> > single-thread work queue, pass it to each drm/sched instance, and all
> > timeout handling is run in there.
> > 
> > For i915 it's more of a mess since we have a ton of random things that
> > time out/reset potentially going on in parallel. But that's the design we
> > should head towards.
> >
> 
> See above, the parallel resets is fixed by patch #3 in this series.
> 
> > _not_ sprinkling random refcounts all over the place until most of the
> > oops/splats disappear. That's cargo-culting, not engineering.
> 
> See above, I believe the ref count is still correct.

So with patch #3 we don't need this patch anymore? If so, then we should
drop it. And document the exact rules we rely on that guarantee that the
context doesn't untimely disappear (in the kerneldoc for the involved
structs).
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > > ---
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++++++++++++++++--
> > >  1 file changed, 22 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index 1986a57b52cc..02917fc4d4a8 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -2888,6 +2888,8 @@ static void capture_worker_func(struct work_struct *w)
> > >  	intel_engine_set_hung_context(engine, ce);
> > >  	with_intel_runtime_pm(&i915->runtime_pm, wakeref)
> > >  		i915_capture_error_state(gt, ce->engine->mask);
> > > +
> > > +	intel_context_put(ce);
> > >  }
> > >  
> > >  static void capture_error_state(struct intel_guc *guc,
> > > @@ -2924,7 +2926,7 @@ static void guc_context_replay(struct intel_context *ce)
> > >  	tasklet_hi_schedule(&sched_engine->tasklet);
> > >  }
> > >  
> > > -static void guc_handle_context_reset(struct intel_guc *guc,
> > > +static bool guc_handle_context_reset(struct intel_guc *guc,
> > >  				     struct intel_context *ce)
> > >  {
> > >  	trace_intel_context_reset(ce);
> > > @@ -2937,7 +2939,11 @@ static void guc_handle_context_reset(struct intel_guc *guc,
> > >  		   !context_blocked(ce))) {
> > >  		capture_error_state(guc, ce);
> > >  		guc_context_replay(ce);
> > > +
> > > +		return false;
> > >  	}
> > > +
> > > +	return true;
> > >  }
> > >  
> > >  int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> > > @@ -2945,6 +2951,7 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> > >  {
> > >  	struct intel_context *ce;
> > >  	int desc_idx;
> > > +	unsigned long flags;
> > >  
> > >  	if (unlikely(len != 1)) {
> > >  		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
> > > @@ -2952,11 +2959,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc,
> > >  	}
> > >  
> > >  	desc_idx = msg[0];
> > > +
> > > +	/*
> > > +	 * The context lookup uses the xarray but lookups only require an RCU lock
> > > +	 * not the full spinlock. So take the lock explicitly and keep it until the
> > > +	 * context has been reference count locked to ensure it can't be destroyed
> > > +	 * asynchronously until the reset is done.
> > > +	 */
> > > +	xa_lock_irqsave(&guc->context_lookup, flags);
> > >  	ce = g2h_context_lookup(guc, desc_idx);
> > > +	if (ce)
> > > +		intel_context_get(ce);
> > > +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> > > +
> > >  	if (unlikely(!ce))
> > >  		return -EPROTO;
> > >  
> > > -	guc_handle_context_reset(guc, ce);
> > > +	if (guc_handle_context_reset(guc, ce))
> > > +		intel_context_put(ce);
> > >  
> > >  	return 0;
> > >  }
> > > -- 
> > > 2.32.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-09-17 12:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-09-14  4:24 [Intel-gfx] [PATCH 0/4] Do error capture async, flush G2H processing on reset Matthew Brost
2021-09-14  4:24 ` [Intel-gfx] [PATCH 1/4] drm/i915/guc: Move guc_ids under submission_state sub-structure Matthew Brost
2021-09-14  4:24 ` [Intel-gfx] [PATCH 2/4] drm/i915/guc: Do error capture asynchronously Matthew Brost
2021-09-14  4:24 ` [Intel-gfx] [PATCH 3/4] drm/i915/guc: Flush G2H work queue during reset Matthew Brost
2021-09-14  4:24 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture Matthew Brost
2021-09-14  5:03 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Do error capture async, flush G2H processing on reset Patchwork
2021-09-14  5:06 ` [Intel-gfx] ✗ Fi.CI.DOCS: " Patchwork
2021-09-14  5:31 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2021-09-14  5:09 [Intel-gfx] [PATCH 0/4] " Matthew Brost
2021-09-14  5:09 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture Matthew Brost
2021-09-14 14:29   ` Daniel Vetter
2021-09-14 23:23     ` John Harrison
2021-09-17 12:37       ` Daniel Vetter
2021-09-14 23:36     ` Matthew Brost
2021-09-17 12:40       ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox