intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH 0/3] drm/i915/active: Fix other potential list corruption root causes
@ 2023-03-13 17:24 Janusz Krzysztofik
  2023-03-13 17:24 ` [Intel-gfx] [PATCH 1/3] drm/i915/active: Serialize preallocation of idle barriers Janusz Krzysztofik
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Janusz Krzysztofik @ 2023-03-13 17:24 UTC (permalink / raw)
  To: intel-gfx
  Cc: Andrzej Hajda, dri-devel, Rodrigo Vivi, Chris Wilson, Nirmoy Das

While perfroming root cause analyses of fence callback list corruptions,
a couple of other potential though less likely root causes have been
identified in addition to barrier tasks list deletion results ignored.
This series tries to fix those potential issues, also in longterm stable
releases starting from v5.10.  The third patch, while not fixing any real
bug, is believed to make the code more predictable and easy to understand,
then more easy to debug should other barrier related issue still exist.

Janusz Krzysztofik (3):
  drm/i915/active: Serialize preallocation of idle barriers
  drm/i915/active: Serialize use of barriers as fence trackers
  drm/i915/active: Simplify llist search-and-delete

 drivers/gpu/drm/i915/i915_active.c | 124 ++++++++++++++++++-----------
 1 file changed, 78 insertions(+), 46 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Intel-gfx] [PATCH 1/3] drm/i915/active: Serialize preallocation of idle barriers
  2023-03-13 17:24 [Intel-gfx] [PATCH 0/3] drm/i915/active: Fix other potential list corruption root causes Janusz Krzysztofik
@ 2023-03-13 17:24 ` Janusz Krzysztofik
  2023-03-13 17:24 ` [Intel-gfx] [PATCH 2/3] drm/i915/active: Serialize use of barriers as fence trackers Janusz Krzysztofik
  2023-03-13 17:24 ` [Intel-gfx] [PATCH 3/3] drm/i915/active: Simplify llist search-and-delete Janusz Krzysztofik
  2 siblings, 0 replies; 4+ messages in thread
From: Janusz Krzysztofik @ 2023-03-13 17:24 UTC (permalink / raw)
  To: intel-gfx
  Cc: Andrzej Hajda, dri-devel, Rodrigo Vivi, Chris Wilson, Nirmoy Das

When we collect barriers for preallocating them, we reuse either idle or
non-idle ones, whichever we find.  In case of non-idle barriers, we
depend on their successful deletion from their barrier tasks lists as an
indication of them not being claimed by another thread.  However, in case
of idle barriers, we neither perform any similar checks nor take any
preventive countermeasures against unexpected races with other threads.
We may then end up adding the same barrier to two independent preallocated
lists, and then adding it twice back to the same engine's barrier tasks
list, thus effectively creating a loop of llist nodes.  As a result,
searches through that barrier tasks llist may start spinning indefinitely.

Occurrences of that issue were never observed on CI nor reported by users.
However, deep code analysis revealed a silent, most probably not intended
workaround that actively breaks those loops by rebuilding barrier tasks
llists in reverse order inside our local implementation of llist node
deletion.  A simple patch that replaces that reverse order rebuild with
just an update of next pointer of a node preceding the one to be deleted
helps to reproduce the race, though still not easily.  As soon as we have
the race fixed, we may want to consider such update for the code to be
more clear and more predictable.

To fix the issue, whenever an idle barrier is selected for preallocation,
mark it immediately as non-idle with our ERR_PTR(-EAGAIN) barrier mark, so
other threads are no longer able to claim it, neither as idle nor as
non-idle since not a member of respective barrier tasks list.  Serialize
that claim operation against other potential concurrent updates of active
fence pointer, and skip the node in favor of allocating a new one if it
occurs claimed meanwhile by another competing thread.  Once claimed,
increase active count of its composite tracker host immediately, as long
as we still know that was an idle barrier.

While being at it, fortify now still marginally racy check for
preallocated_barriers llist being still empty when we populate it with
collected proto-barriers (assuming we need that check).

Fixes: 9ff33bbcda25 ("drm/i915: Reduce locking around i915_active_acquire_preallocate_barrier()")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_active.c | 50 +++++++++++++++++-------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index a9fea115f2d26..b2f79f5c257a8 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -788,8 +788,13 @@ static struct active_node *reuse_idle_barrier(struct i915_active *ref, u64 idx)
 	 * node kept alive (as we reuse before parking). We prefer to reuse
 	 * completely idle barriers (less hassle in manipulating the llists),
 	 * but otherwise any will do.
+	 *
+	 * We reuse the request field to mark this as being our proto-node.
 	 */
-	if (ref->cache && is_idle_barrier(ref->cache, idx)) {
+	if (ref->cache && is_idle_barrier(ref->cache, idx) &&
+	    !cmpxchg(__active_fence_slot(&ref->cache->base), NULL,
+		     ERR_PTR(-EAGAIN))) {
+		__i915_active_acquire(ref);
 		p = &ref->cache->node;
 		goto match;
 	}
@@ -800,8 +805,12 @@ static struct active_node *reuse_idle_barrier(struct i915_active *ref, u64 idx)
 		struct active_node *node =
 			rb_entry(p, struct active_node, node);
 
-		if (is_idle_barrier(node, idx))
+		if (is_idle_barrier(node, idx) &&
+		    !cmpxchg(__active_fence_slot(&node->base), NULL,
+			     ERR_PTR(-EAGAIN))) {
+			__i915_active_acquire(ref);
 			goto match;
+		}
 
 		prev = p;
 		if (node->timeline < idx)
@@ -827,8 +836,12 @@ static struct active_node *reuse_idle_barrier(struct i915_active *ref, u64 idx)
 		if (node->timeline < idx)
 			continue;
 
-		if (is_idle_barrier(node, idx))
+		if (is_idle_barrier(node, idx) &&
+		    !cmpxchg(__active_fence_slot(&node->base), NULL,
+			     ERR_PTR(-EAGAIN))) {
+			__i915_active_acquire(ref);
 			goto match;
+		}
 
 		/*
 		 * The list of pending barriers is protected by the
@@ -889,29 +902,24 @@ int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
 			if (!node)
 				goto unwind;
 
-			RCU_INIT_POINTER(node->base.fence, NULL);
+			/* Mark this as being our unconnected proto-node */
+			RCU_INIT_POINTER(node->base.fence, ERR_PTR(-EAGAIN));
 			node->base.cb.func = node_retire;
 			node->timeline = idx;
 			node->ref = ref;
-		}
-
-		if (!i915_active_fence_isset(&node->base)) {
-			/*
-			 * Mark this as being *our* unconnected proto-node.
-			 *
-			 * Since this node is not in any list, and we have
-			 * decoupled it from the rbtree, we can reuse the
-			 * request to indicate this is an idle-barrier node
-			 * and then we can use the rb_node and list pointers
-			 * for our tracking of the pending barrier.
-			 */
-			RCU_INIT_POINTER(node->base.fence, ERR_PTR(-EAGAIN));
-			node->base.cb.node.prev = (void *)engine;
 			__i915_active_acquire(ref);
+		} else {
+			GEM_BUG_ON(rcu_access_pointer(node->base.fence) !=
+				   ERR_PTR(-EAGAIN));
 		}
-		GEM_BUG_ON(rcu_access_pointer(node->base.fence) != ERR_PTR(-EAGAIN));
 
-		GEM_BUG_ON(barrier_to_engine(node) != engine);
+		/*
+		 * Since this node is not in any list, we have decoupled it
+		 * from the rbtree, and we reuse the request to indicate
+		 * this is a barrier node, then we can use list pointers
+		 * for our tracking of the pending barrier.
+		 */
+		node->base.cb.node.prev = (void *)engine;
 		first = barrier_to_ll(node);
 		first->next = prev;
 		if (!last)
@@ -920,7 +928,7 @@ int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
 	}
 
 	GEM_BUG_ON(!llist_empty(&ref->preallocated_barriers));
-	llist_add_batch(first, last, &ref->preallocated_barriers);
+	GEM_BUG_ON(!llist_add_batch(first, last, &ref->preallocated_barriers));
 
 	return 0;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [Intel-gfx] [PATCH 2/3] drm/i915/active: Serialize use of barriers as fence trackers
  2023-03-13 17:24 [Intel-gfx] [PATCH 0/3] drm/i915/active: Fix other potential list corruption root causes Janusz Krzysztofik
  2023-03-13 17:24 ` [Intel-gfx] [PATCH 1/3] drm/i915/active: Serialize preallocation of idle barriers Janusz Krzysztofik
@ 2023-03-13 17:24 ` Janusz Krzysztofik
  2023-03-13 17:24 ` [Intel-gfx] [PATCH 3/3] drm/i915/active: Simplify llist search-and-delete Janusz Krzysztofik
  2 siblings, 0 replies; 4+ messages in thread
From: Janusz Krzysztofik @ 2023-03-13 17:24 UTC (permalink / raw)
  To: intel-gfx
  Cc: Andrzej Hajda, dri-devel, Rodrigo Vivi, Chris Wilson, Nirmoy Das

When adding a request to a composite tracker, we try to use an existing
fence tracker already registered with that composite.  The tracker we
obtain can already track another fence, can be an idle barrier, or an
active barrier.

When we acquire an idle barrier, we don't claim it in any way until
__i915_active_fence_set() we call substitutes its NULL fence pointer with
that of our request's fence.  But another thread looking for an idle
barrier can race with us.  If that thread is collecting barriers for
preallocation, it may update the NULL fence pointer with ERR_PTR(-EAGAIN)
barrier mark, either before or after we manage to replace it with our
request fence.  It can also corrupt our callback list pointers when
reusing them as an engine pointer (prev) and a preallocated barriers
llist node link (next), or we can corrupt their data.

When we acquire a non-idle barrier in turn, we try to delete that barrier
from a list of barrier tasks it belongs to.  If that deletion succeedes
then we convert the barrier to an idle one by replacing its barrier mark
with NULL and decermenting active count of its hosting composite tracker.
But as soon as we do this, we expose that barrier to the above described
idle barrier race.

Claim acquired idle barrier right away by marking it immediately with
ERR_PTR(-EAGAIN) barrier mark.  Serialize that operation with other
threads trying to claim a barrier and go back for picking up another
tracker if some other thread wins the race.

Furthermore, on successful deletion of a non-idle barrier from a barrier
tasks list, don't overwrite the barrier mark with NULL -- that's not
needed at the moment since the barrier, once deleted from its list, can no
longer be acquired by any other thread as long as all threads respect
deletion results.  Also, don't decrease active counter of the hosting
composite tracker, but skip the follow up step that increases it back.

For the above to work correctly, teach __i915_active_fence_set() function
to recognize and handle non-idle barriers correctly when requested.

The issue has never been reproduced cleanly, only identified via code
analysis while working on fence callback list corruptions which occurred
to have a complex root cause, see commit e0e6b416b25e ("drm/i915/active:
Fix misuse of non-idle barriers as fence trackers") for details.  However,
it has been assumed that the issue could start to be potentially
reproducible as soon as timeline mutex locks around calls to
i915_active_fence_set() were dropped by commit df9f85d8582e ("drm/i915:
Serialise i915_active_fence_set() with itself").

Fixes: df9f85d8582e ("drm/i915: Serialise i915_active_fence_set() with itself")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: stable@vger.kernel.org # v5.6+
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_active.c | 65 ++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index b2f79f5c257a8..8eb10af7928f4 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -425,11 +425,17 @@ replace_barrier(struct i915_active *ref, struct i915_active_fence *active)
 	return __active_del_barrier(ref, node_from_active(active));
 }
 
+static inline bool is_idle_barrier(struct active_node *node, u64 idx);
+static struct dma_fence *
+____i915_active_fence_set(struct i915_active_fence *active,
+			  struct dma_fence *fence, bool barrier);
+
 int i915_active_add_request(struct i915_active *ref, struct i915_request *rq)
 {
 	u64 idx = i915_request_timeline(rq)->fence_context;
 	struct dma_fence *fence = &rq->fence;
 	struct i915_active_fence *active;
+	bool replaced;
 	int err;
 
 	/* Prevent reaping in case we malloc/wait while building the tree */
@@ -444,13 +450,18 @@ int i915_active_add_request(struct i915_active *ref, struct i915_request *rq)
 			goto out;
 		}
 
-		if (replace_barrier(ref, active)) {
-			RCU_INIT_POINTER(active->fence, NULL);
-			atomic_dec(&ref->count);
-		}
-	} while (unlikely(is_barrier(active)));
+		replaced = replace_barrier(ref, active);
+		if (replaced)
+			break;
+
+		if (!cmpxchg(__active_fence_slot(active), NULL,
+			     ERR_PTR(-EAGAIN)))
+			break;
 
-	if (!__i915_active_fence_set(active, fence))
+	} while (IS_ERR_OR_NULL(rcu_access_pointer(active->fence)));
+
+	if (!____i915_active_fence_set(active, fence, is_barrier(active)) &&
+	    !replaced)
 		__i915_active_acquire(ref);
 
 out:
@@ -1021,21 +1032,9 @@ void i915_request_add_active_barriers(struct i915_request *rq)
 	spin_unlock_irqrestore(&rq->lock, flags);
 }
 
-/*
- * __i915_active_fence_set: Update the last active fence along its timeline
- * @active: the active tracker
- * @fence: the new fence (under construction)
- *
- * Records the new @fence as the last active fence along its timeline in
- * this active tracker, moving the tracking callbacks from the previous
- * fence onto this one. Returns the previous fence (if not already completed),
- * which the caller must ensure is executed before the new fence. To ensure
- * that the order of fences within the timeline of the i915_active_fence is
- * understood, it should be locked by the caller.
- */
-struct dma_fence *
-__i915_active_fence_set(struct i915_active_fence *active,
-			struct dma_fence *fence)
+static struct dma_fence *
+____i915_active_fence_set(struct i915_active_fence *active,
+			  struct dma_fence *fence, bool barrier)
 {
 	struct dma_fence *prev;
 	unsigned long flags;
@@ -1067,6 +1066,11 @@ __i915_active_fence_set(struct i915_active_fence *active,
 	 */
 	spin_lock_irqsave(fence->lock, flags);
 	prev = xchg(__active_fence_slot(active), fence);
+	if (barrier) {
+		GEM_BUG_ON(!IS_ERR(prev));
+		prev = NULL;
+	}
+	GEM_BUG_ON(IS_ERR(prev));
 	if (prev) {
 		GEM_BUG_ON(prev == fence);
 		spin_lock_nested(prev->lock, SINGLE_DEPTH_NESTING);
@@ -1079,6 +1083,25 @@ __i915_active_fence_set(struct i915_active_fence *active,
 	return prev;
 }
 
+/*
+ * __i915_active_fence_set: Update the last active fence along its timeline
+ * @active: the active tracker
+ * @fence: the new fence (under construction)
+ *
+ * Records the new @fence as the last active fence along its timeline in
+ * this active tracker, moving the tracking callbacks from the previous
+ * fence onto this one. Returns the previous fence (if not already completed),
+ * which the caller must ensure is executed before the new fence. To ensure
+ * that the order of fences within the timeline of the i915_active_fence is
+ * understood, it should be locked by the caller.
+ */
+struct dma_fence *
+__i915_active_fence_set(struct i915_active_fence *active,
+			struct dma_fence *fence)
+{
+	return ____i915_active_fence_set(active, fence, false);
+}
+
 int i915_active_fence_set(struct i915_active_fence *active,
 			  struct i915_request *rq)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [Intel-gfx] [PATCH 3/3] drm/i915/active: Simplify llist search-and-delete
  2023-03-13 17:24 [Intel-gfx] [PATCH 0/3] drm/i915/active: Fix other potential list corruption root causes Janusz Krzysztofik
  2023-03-13 17:24 ` [Intel-gfx] [PATCH 1/3] drm/i915/active: Serialize preallocation of idle barriers Janusz Krzysztofik
  2023-03-13 17:24 ` [Intel-gfx] [PATCH 2/3] drm/i915/active: Serialize use of barriers as fence trackers Janusz Krzysztofik
@ 2023-03-13 17:24 ` Janusz Krzysztofik
  2 siblings, 0 replies; 4+ messages in thread
From: Janusz Krzysztofik @ 2023-03-13 17:24 UTC (permalink / raw)
  To: intel-gfx
  Cc: Andrzej Hajda, dri-devel, Rodrigo Vivi, Chris Wilson, Nirmoy Das

Inside ____active_del_barrier(), while searching for a node to be deleted,
we now rebuild barrier_tasks llist content in reverse order.
Theoretically neutral, that method was observed to provide an undocumented
workaround for unexpected loops of llist nodes appearing now and again due
to races, silently breaking those llist node loops, then protecting
llist_for_each_safe() from spinning indefinitely.

Having all races hopefully fixed, make that function behavior more
predictable, more easy to follow -- switch to an alternative, equally
simple but less invasive algorithm that only updates a link between list
nodes that precede and follow the deleted node.

Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_active.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c
index 8eb10af7928f4..10f52eb4a4592 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -391,13 +391,14 @@ static bool ____active_del_barrier(struct i915_active *ref,
 	llist_for_each_safe(pos, next, llist_del_all(&engine->barrier_tasks)) {
 		if (node == barrier_from_ll(pos)) {
 			node = NULL;
+			if (tail)
+				tail->next = next;
 			continue;
 		}
 
-		pos->next = head;
-		head = pos;
-		if (!tail)
-			tail = pos;
+		if (!head)
+			head = pos;
+		tail = pos;
 	}
 	if (head)
 		llist_add_batch(head, tail, &engine->barrier_tasks);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-03-13 17:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-13 17:24 [Intel-gfx] [PATCH 0/3] drm/i915/active: Fix other potential list corruption root causes Janusz Krzysztofik
2023-03-13 17:24 ` [Intel-gfx] [PATCH 1/3] drm/i915/active: Serialize preallocation of idle barriers Janusz Krzysztofik
2023-03-13 17:24 ` [Intel-gfx] [PATCH 2/3] drm/i915/active: Serialize use of barriers as fence trackers Janusz Krzysztofik
2023-03-13 17:24 ` [Intel-gfx] [PATCH 3/3] drm/i915/active: Simplify llist search-and-delete Janusz Krzysztofik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).