From: Matthew Brost <matthew.brost@intel.com>
To: intel-xe@lists.freedesktop.org
Cc: rodrigo.vivi@intel.com
Subject: [PATCH v3 2/2] drm/xe: Don't suspend device upon wedge
Date: Mon, 15 Jul 2024 23:39:02 -0700 [thread overview]
Message-ID: <20240716063902.1390130-2-matthew.brost@intel.com> (raw)
In-Reply-To: <20240716063902.1390130-1-matthew.brost@intel.com>
When wedging a device we shouldn't be suspending device as state for
debug will be lost.
Also this appears to not work as the below stack trace pops upon trying
to resume a wedged device:
[ 304.245044] INFO: task cat:12115 blocked for more than 151 seconds.
[ 304.251333] Tainted: G W 6.10.0-rc7-xe+ #3518
[ 304.257617] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 304.265459] task:cat state:D stack:13384 pid:12115 tgid:12115 ppid:3986 flags:0x00000006
[ 304.265465] Call Trace:
[ 304.265467] <TASK>
[ 304.265469] __schedule+0x3c4/0xdf0
[ 304.265478] schedule+0x3c/0x140
[ 304.265481] rpm_resume+0x1cc/0x740
[ 304.265484] ? __pfx_autoremove_wake_function+0x10/0x10
[ 304.265489] __pm_runtime_resume+0x49/0x80
[ 304.265494] guc_info+0x6b/0xb0 [xe]
[ 304.265538] ? __pfx___drm_printfn_seq_file+0x10/0x10
[ 304.265541] ? __pfx___drm_puts_seq_file+0x10/0x10
[ 304.265545] seq_read_iter+0x111/0x4c0
[ 304.265551] seq_read+0xfc/0x140
[ 304.265556] full_proxy_read+0x58/0x80
[ 304.265560] vfs_read+0xa7/0x360
[ 304.265563] ? find_held_lock+0x2b/0x80
[ 304.265568] ksys_read+0x64/0xe0
[ 304.265571] do_syscall_64+0x68/0x140
[ 304.265575] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 304.265578] RIP: 0033:0x7f4254d14992
[ 304.265580] RSP: 002b:00007ffc558666f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 304.265583] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f4254d14992
[ 304.265584] RDX: 0000000000020000 RSI: 00007f4254ebb000 RDI: 0000000000000003
[ 304.265586] RBP: 00007f4254ebb000 R08: 00007f4254eba010 R09: 00007f4254eba010
[ 304.265587] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000022000
[ 304.265588] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[ 304.265593] </TASK>
[ 304.265594]
Showing all locks held in the system:
[ 304.265598] 1 lock held by khungtaskd/57:
[ 304.265599] #0: ffffffff8273b860 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x36/0x1c0
[ 304.265607] 3 locks held by kworker/6:1/90:
[ 304.265610] 1 lock held by in:imklog/547:
[ 304.265611] #0: ffff88810498cd88 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x76/0xc0
[ 304.265620] 1 lock held by dmesg/1310:
Fixes: 8ed9aaae39f3 ("drm/xe: Force wedged state and block GT reset upon any GPU hang")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
drivers/gpu/drm/xe/xe_device.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 1e3d3a7e74d5..07aedbaf1821 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -893,6 +893,13 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address)
return address & GENMASK_ULL(xe->info.va_bits - 1, 0);
}
+static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
+{
+ struct xe_device *xe = arg;
+
+ xe_pm_runtime_put(xe);
+}
+
/**
* xe_device_declare_wedged - Declare device wedged
* @xe: xe device instance
@@ -911,12 +918,21 @@ void xe_device_declare_wedged(struct xe_device *xe)
{
struct xe_gt *gt;
u8 id;
+ int err;
if (xe->wedged.mode == 0) {
drm_dbg(&xe->drm, "Wedged mode is forcibly disabled\n");
return;
}
+ err = drmm_add_action_or_reset(&xe->drm, xe_device_wedged_fini, xe);
+ if (err) {
+ drm_err(&xe->drm, "Failed to register xe_device_wedged_fini clean-up. Although device is wedged.\n");
+ return;
+ }
+
+ xe_pm_runtime_get_noresume(xe);
+
if (!atomic_xchg(&xe->wedged.flag, 1)) {
xe->needs_flr_on_fini = true;
drm_err(&xe->drm,
--
2.34.1
next prev parent reply other threads:[~2024-07-16 6:38 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-16 6:39 [PATCH v3 1/2] drm/xe: Wedge the entire device Matthew Brost
2024-07-16 6:39 ` Matthew Brost [this message]
2024-07-16 21:26 ` [PATCH v3 2/2] drm/xe: Don't suspend device upon wedge Cavitt, Jonathan
2024-07-16 21:41 ` Matthew Brost
2024-07-16 8:08 ` ✓ CI.Patch_applied: success for series starting with [v3,1/2] drm/xe: Wedge the entire device Patchwork
2024-07-16 8:08 ` ✓ CI.checkpatch: " Patchwork
2024-07-16 8:09 ` ✓ CI.KUnit: " Patchwork
2024-07-16 8:21 ` ✓ CI.Build: " Patchwork
2024-07-16 8:23 ` ✓ CI.Hooks: " Patchwork
2024-07-16 8:25 ` ✓ CI.checksparse: " Patchwork
2024-07-16 8:49 ` ✓ CI.BAT: " Patchwork
2024-07-16 10:20 ` ✗ CI.FULL: failure " Patchwork
2024-07-16 21:42 ` [PATCH v3 1/2] " Cavitt, Jonathan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240716063902.1390130-2-matthew.brost@intel.com \
--to=matthew.brost@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=rodrigo.vivi@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox