[Intel-xe] [PATCH v2 2/3] drm/xe: fix xe_device_mem_access_get() race

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Auld <matthew.auld@intel.com>
To: intel-xe@lists.freedesktop.org
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Subject: [Intel-xe] [PATCH v2 2/3] drm/xe: fix xe_device_mem_access_get() race
Date: Fri,  5 May 2023 19:03:48 +0100	[thread overview]
Message-ID: <20230505180349.367572-2-matthew.auld@intel.com> (raw)
In-Reply-To: <20230505180349.367572-1-matthew.auld@intel.com>

It looks like there is at least one race here, given that the
pm_runtime_suspended() check looks to return false if we are in the
process of suspending the device (RPM_SUSPENDING vs RPM_SUSPENDED). We
later also do xe_pm_runtime_get_if_active(), but since the device is
suspending or has now suspended, this doesn't do anything either.
Following from this we can potentially return from
xe_device_mem_access_get() with the device suspended or about to be,
leading to broken behaviour.

Attempt to fix this by always grabbing the runtime ref when our internal
ref transitions from 0 -> 1, and then wrap the whole thing with a lock
to ensure callers are serialized.

v2:
  - ct->lock looks to be primed with fs_reclaim, so holding that and
    then allocating memory will cause lockdep to complain. Now that we
    unconditionally grab the mem_access.lock around mem_access_{get,put},
    we need to change the ordering wrt to grabbing the ct->lock, since
    some of the runtime_pm routines can allocate memory (or at least
    that's what lockdep seems to suggest). Hopefully not a big deal.
    It might be that there were already issues with this, just that the
    atomics where "hiding" the potential issues.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/258
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c              | 11 +++--------
 drivers/gpu/drm/xe/xe_device_types.h        |  5 ++++-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |  4 ++++
 drivers/gpu/drm/xe/xe_guc_ct.c              | 13 +++++++------
 drivers/gpu/drm/xe/xe_pm.c                  |  9 ++-------
 drivers/gpu/drm/xe/xe_pm.h                  |  2 +-
 6 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 01c497bcf9a5..0a18b41a0e1a 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -406,17 +406,12 @@ u32 xe_device_ccs_bytes(struct xe_device *xe, u64 size)
 
 void xe_device_mem_access_get(struct xe_device *xe)
 {
-	bool resumed = xe_pm_runtime_resume_if_suspended(xe);
-
 	mutex_lock(&xe->mem_access.lock);
-	if (xe->mem_access.ref++ == 0)
-		xe->mem_access.hold_rpm = xe_pm_runtime_get_if_active(xe);
+	if (xe->mem_access.ref == 0)
+		xe->mem_access.hold_rpm = xe_pm_runtime_resume_and_get(xe);
+	xe->mem_access.ref++;
 	mutex_unlock(&xe->mem_access.lock);
 
-	/* The usage counter increased if device was immediately resumed */
-	if (resumed)
-		xe_pm_runtime_put(xe);
-
 	XE_WARN_ON(xe->mem_access.ref == S32_MAX);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 59462933f67a..9e37189d5745 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -256,7 +256,10 @@ struct xe_device {
 	 * triggering additional actions when they occur.
 	 */
 	struct {
-		/** @lock: protect the ref count */
+		/**
+		 * @lock: Serialize xe_device_mem_access users,
+		 * and protect the below internal state, like @ref.
+		 */
 		struct mutex lock;
 		/** @ref: ref count of memory accesses */
 		s32 ref;
diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
index 604f189dbd70..e769af7e7cdb 100644
--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c
@@ -5,6 +5,7 @@
 
 #include "xe_gt_tlb_invalidation.h"
 
+#include "xe_device.h"
 #include "xe_gt.h"
 #include "xe_guc.h"
 #include "xe_guc_ct.h"
@@ -112,6 +113,8 @@ static int send_tlb_invalidation(struct xe_guc *guc,
 	 * in order which they currently are, if that changes the algorithm will
 	 * need to be updated.
 	 */
+
+	xe_device_mem_access_get(gt->xe);
 	mutex_lock(&guc->ct.lock);
 	seqno = gt->tlb_invalidation.seqno;
 	if (fence) {
@@ -140,6 +143,7 @@ static int send_tlb_invalidation(struct xe_guc *guc,
 	if (ret < 0 && fence)
 		invalidation_fence_signal(fence);
 	mutex_unlock(&guc->ct.lock);
+	xe_device_mem_access_put(gt->xe);
 
 	return ret;
 }
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 9055ff133a7c..579d7f341f13 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -494,26 +494,22 @@ static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
 		}
 	}
 
-	xe_device_mem_access_get(ct_to_xe(ct));
 retry:
 	ret = has_room(ct, len + GUC_CTB_HDR_LEN, g2h_len);
 	if (unlikely(ret))
-		goto put_wa;
+		goto out;
 
 	ret = h2g_write(ct, action, len, g2h_fence ? g2h_fence->seqno : 0,
 			!!g2h_fence);
 	if (unlikely(ret)) {
 		if (ret == -EAGAIN)
 			goto retry;
-		goto put_wa;
+		goto out;
 	}
 
 	g2h_reserve_space(ct, g2h_len, num_g2h);
 	xe_guc_notify(ct_to_guc(ct));
-put_wa:
-	xe_device_mem_access_put(ct_to_xe(ct));
 out:
-
 	return ret;
 }
 
@@ -535,6 +531,7 @@ static int guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action, u32 len,
 
 	XE_BUG_ON(g2h_len && g2h_fence);
 	lockdep_assert_held(&ct->lock);
+	xe_device_assert_mem_access(ct_to_xe(ct));
 
 try_again:
 	ret = __guc_ct_send_locked(ct, action, len, g2h_len, num_g2h,
@@ -602,10 +599,14 @@ static int guc_ct_send(struct xe_guc_ct *ct, const u32 *action, u32 len,
 
 	XE_BUG_ON(g2h_len && g2h_fence);
 
+	xe_device_mem_access_get(ct_to_xe(ct));
+
 	mutex_lock(&ct->lock);
 	ret = guc_ct_send_locked(ct, action, len, g2h_len, num_g2h, g2h_fence);
 	mutex_unlock(&ct->lock);
 
+	xe_device_mem_access_put(ct_to_xe(ct));
+
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
index b7b57f10ba25..b2ffa001e6f7 100644
--- a/drivers/gpu/drm/xe/xe_pm.c
+++ b/drivers/gpu/drm/xe/xe_pm.c
@@ -210,14 +210,9 @@ int xe_pm_runtime_put(struct xe_device *xe)
 	return pm_runtime_put_autosuspend(xe->drm.dev);
 }
 
-/* Return true if resume operation happened and usage count was increased */
-bool xe_pm_runtime_resume_if_suspended(struct xe_device *xe)
+bool xe_pm_runtime_resume_and_get(struct xe_device *xe)
 {
-	/* In case we are suspended we need to immediately wake up */
-	if (pm_runtime_suspended(xe->drm.dev))
-		return !pm_runtime_resume_and_get(xe->drm.dev);
-
-	return false;
+	return !pm_runtime_resume_and_get(xe->drm.dev);
 }
 
 int xe_pm_runtime_get_if_active(struct xe_device *xe)
diff --git a/drivers/gpu/drm/xe/xe_pm.h b/drivers/gpu/drm/xe/xe_pm.h
index 6a885585f653..1b4c15b5e71a 100644
--- a/drivers/gpu/drm/xe/xe_pm.h
+++ b/drivers/gpu/drm/xe/xe_pm.h
@@ -19,7 +19,7 @@ int xe_pm_runtime_suspend(struct xe_device *xe);
 int xe_pm_runtime_resume(struct xe_device *xe);
 int xe_pm_runtime_get(struct xe_device *xe);
 int xe_pm_runtime_put(struct xe_device *xe);
-bool xe_pm_runtime_resume_if_suspended(struct xe_device *xe);
+bool xe_pm_runtime_resume_and_get(struct xe_device *xe);
 int xe_pm_runtime_get_if_active(struct xe_device *xe);
 
 #endif
-- 
2.40.0

next prev parent reply	other threads:[~2023-05-05 18:04 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-05 18:03 [Intel-xe] [PATCH v2 1/3] Revert "drm/xe: Use atomic instead of mutex for xe_device_mem_access_ongoing" Matthew Auld
2023-05-05 18:03 ` Matthew Auld [this message]
2023-05-05 18:20   ` [Intel-xe] [PATCH v2 2/3] drm/xe: fix xe_device_mem_access_get() race Rodrigo Vivi
2023-05-05 19:04     ` Matthew Auld
2023-05-05 18:03 ` [Intel-xe] [PATCH v2 3/3] drm/xe: Use atomic for mem_access.ref Matthew Auld
2023-05-05 18:06 ` [Intel-xe] ✓ CI.Patch_applied: success for series starting with [v2,1/3] Revert "drm/xe: Use atomic instead of mutex for xe_device_mem_access_ongoing" Patchwork
2023-05-05 18:07 ` [Intel-xe] ✓ CI.KUnit: " Patchwork
2023-05-05 18:11 ` [Intel-xe] ✓ CI.Build: " Patchwork
2023-05-05 18:30 ` [Intel-xe] ○ CI.BAT: info " Patchwork

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:01c497bcf9a dfblob:0a18b41a0e1 dfblob:59462933f67
dfblob:9e37189d574 dfblob:604f189dbd7 dfblob:e769af7e7cd
dfblob:9055ff133a7 dfblob:579d7f341f1 dfblob:b7b57f10ba2
dfblob:b2ffa001e6f dfblob:6a885585f65 dfblob:1b4c15b5e71 )
 OR (
bs:"[Intel-xe] [PATCH v2 2/3] drm/xe: fix xe_device_mem_access_get() race" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230505180349.367572-2-matthew.auld@intel.com \
    --to=matthew.auld@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.