From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: "Michal Wajdeczko" <michal.wajdeczko@intel.com>,
"Piotr Piórkowski" <piotr.piorkowski@intel.com>,
"Jonathan Cavitt" <jonathan.cavitt@intel.com>,
"Sasha Levin" <sashal@kernel.org>,
lucas.demarchi@intel.com, thomas.hellstrom@linux.intel.com,
rodrigo.vivi@intel.com, intel-xe@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17] drm/xe/pf: Don't resume device from restart worker
Date: Sat, 25 Oct 2025 11:58:50 -0400 [thread overview]
Message-ID: <20251025160905.3857885-299-sashal@kernel.org> (raw)
In-Reply-To: <20251025160905.3857885-1-sashal@kernel.org>
From: Michal Wajdeczko <michal.wajdeczko@intel.com>
[ Upstream commit 9fd9f221440024b7451678898facfb34af054310 ]
The PF's restart worker shouldn't attempt to resume the device on
its own, since its goal is to finish PF and VFs reprovisioning on
the recently reset GuC. Take extra RPM reference while scheduling
a work and release it from the worker or when we cancel a work.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://lore.kernel.org/r/20250801142822.180530-4-michal.wajdeczko@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
YES
- Bug fixed: Prevents the PF SR-IOV restart worker from forcing a
runtime PM resume, which can violate PM expectations, cause unwanted
wakeups, and race with suspend/resume. The worker’s role is to finish
PF/VF reprovisioning after a GuC reset, not to wake the device.
- Core change: Move the runtime PM ref from the worker body to the
queueing point.
- Before: Worker resumes device via `xe_pm_runtime_get(xe)` and later
`xe_pm_runtime_put(xe)` in `pf_restart()`
(drivers/gpu/drm/xe/xe_gt_sriov_pf.c:229).
- After: `pf_queue_restart()` takes a non-resuming RPM reference via
`xe_pm_runtime_get_noresume(xe)` before `queue_work()`, and only
drops it either in the worker on completion or if the work is
canceled/disabled.
- New get: `pf_queue_restart()` adds
`xe_pm_runtime_get_noresume(xe)` and if `queue_work()` returns
false (already queued), it immediately `xe_pm_runtime_put(xe)` to
avoid leaks (drivers/gpu/drm/xe/xe_gt_sriov_pf.c:244).
- New put on cancel/disable: If `cancel_work_sync()` or
`disable_work_sync()` returns true, drop the worker’s RPM ref
(drivers/gpu/drm/xe/xe_gt_sriov_pf.c:206,
drivers/gpu/drm/xe/xe_gt_sriov_pf.c:55).
- Worker body: `pf_restart()` no longer resumes; it asserts device
is not suspended and only does the final `xe_pm_runtime_put(xe)`
to drop the ref held “on its behalf”
(drivers/gpu/drm/xe/xe_gt_sriov_pf.c:229).
- Correct PM lifetime: This pattern matches established XE usage for
async work (e.g., `xe_vm.c:1751`, `xe_sched_job.c:149`,
`xe_mocs.c:785`, `xe_pci_sriov.c:171`), where async paths use
`xe_pm_runtime_get_noresume()` to keep the device from autosuspending
without performing a resume from the inner worker.
- Rationale and safety:
- `gt_reset()` already holds a runtime PM ref across reset and restart
scheduling (`drivers/gpu/drm/xe/xe_gt.c:822` get,
`drivers/gpu/drm/xe/xe_gt.c:857` put). Taking an additional
`get_noresume()` before queuing guarantees the device won’t
autosuspend before the worker executes, but crucially avoids an
unsolicited resume from the worker itself.
- The assert in `pf_restart()` (`!xe_pm_runtime_suspended(xe)`) is a
correctness guard ensuring the worker only runs with the device
awake; the RPM ref taken at queue time enforces this in practice.
- The cancellation/disable paths now correctly drop the worker’s PM
ref, preventing leaks when a pending restart is canceled because a
subsequent reset is about to happen (synergizes with the already
backported reset-cancellation change in this file).
- Scope and risk:
- Change is small, self-contained, and limited to SR-IOV PF code in
`drivers/gpu/drm/xe/xe_gt_sriov_pf.c`.
- No API/ABI or architectural change; just corrects RPM reference
placement and balances puts on cancel/disable.
- Reduces risk of unintended device resumes and PM races; aligns with
driver PM policy.
- Stable backport fit:
- Fixes a real PM semantics bug affecting SR-IOV PF restart handling
after GT resets.
- Minimal, contained, and follows existing patterns; low regression
risk.
- Depends only on existing helpers (e.g.,
`xe_pm_runtime_get_noresume`, `xe_pm_runtime_suspended`), which are
present in stable branches already carrying the async restart worker
(see prior “Move VFs reprovisioning to worker” backport).
Given the above, this is a good candidate for stable backport.
drivers/gpu/drm/xe/xe_gt_sriov_pf.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_gt_sriov_pf.c b/drivers/gpu/drm/xe/xe_gt_sriov_pf.c
index bdbd15f3afe38..c4dda87b47cc8 100644
--- a/drivers/gpu/drm/xe/xe_gt_sriov_pf.c
+++ b/drivers/gpu/drm/xe/xe_gt_sriov_pf.c
@@ -55,7 +55,12 @@ static void pf_init_workers(struct xe_gt *gt)
static void pf_fini_workers(struct xe_gt *gt)
{
xe_gt_assert(gt, IS_SRIOV_PF(gt_to_xe(gt)));
- disable_work_sync(>->sriov.pf.workers.restart);
+
+ if (disable_work_sync(>->sriov.pf.workers.restart)) {
+ xe_gt_sriov_dbg_verbose(gt, "pending restart disabled!\n");
+ /* release an rpm reference taken on the worker's behalf */
+ xe_pm_runtime_put(gt_to_xe(gt));
+ }
}
/**
@@ -207,8 +212,11 @@ static void pf_cancel_restart(struct xe_gt *gt)
{
xe_gt_assert(gt, IS_SRIOV_PF(gt_to_xe(gt)));
- if (cancel_work_sync(>->sriov.pf.workers.restart))
+ if (cancel_work_sync(>->sriov.pf.workers.restart)) {
xe_gt_sriov_dbg_verbose(gt, "pending restart canceled!\n");
+ /* release an rpm reference taken on the worker's behalf */
+ xe_pm_runtime_put(gt_to_xe(gt));
+ }
}
/**
@@ -226,9 +234,12 @@ static void pf_restart(struct xe_gt *gt)
{
struct xe_device *xe = gt_to_xe(gt);
- xe_pm_runtime_get(xe);
+ xe_gt_assert(gt, !xe_pm_runtime_suspended(xe));
+
xe_gt_sriov_pf_config_restart(gt);
xe_gt_sriov_pf_control_restart(gt);
+
+ /* release an rpm reference taken on our behalf */
xe_pm_runtime_put(xe);
xe_gt_sriov_dbg(gt, "restart completed\n");
@@ -247,8 +258,13 @@ static void pf_queue_restart(struct xe_gt *gt)
xe_gt_assert(gt, IS_SRIOV_PF(xe));
- if (!queue_work(xe->sriov.wq, >->sriov.pf.workers.restart))
+ /* take an rpm reference on behalf of the worker */
+ xe_pm_runtime_get_noresume(xe);
+
+ if (!queue_work(xe->sriov.wq, >->sriov.pf.workers.restart)) {
xe_gt_sriov_dbg(gt, "restart already in queue!\n");
+ xe_pm_runtime_put(xe);
+ }
}
/**
--
2.51.0
next prev parent reply other threads:[~2025-10-25 16:23 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20251025160905.3857885-1-sashal@kernel.org>
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/pcode: Initialize data0 for pcode read routine Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: improve dma-resv handling for backup object Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe: Extend wa_13012615864 to additional Xe2 and Xe3 platforms Sasha Levin
2025-10-25 15:54 ` [PATCH AUTOSEL 6.17] drm/xe/ptl: Apply Wa_16026007364 Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe: Set GT as wedged before sending wedged uevent Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/i2c: Enable bus mastering Sasha Levin
2025-10-25 15:55 ` [PATCH AUTOSEL 6.17] drm/xe/configfs: Enforce canonical device names Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Extend Wa_22021007897 to Xe3 platforms Sasha Levin
2025-10-25 15:56 ` [PATCH AUTOSEL 6.17] drm/xe: Cancel pending TLB inval workers on teardown Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Increase GuC crash dump buffer size Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/wcl: Extend L3bank mask workaround Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Set upper limit of H2G retries over CTB Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe: Make page size consistent in loop Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Add devm release action to safely tear down CT Sasha Levin
2025-10-25 15:57 ` [PATCH AUTOSEL 6.17] drm/xe/pf: Program LMTT directory pointer on all GTs within a tile Sasha Levin
2025-10-25 15:58 ` [PATCH AUTOSEL 6.17] drm/xe/guc: Always add CT disable action during second init step Sasha Levin
2025-10-25 15:58 ` Sasha Levin [this message]
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Return an error code if the GuC load fails Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: Ensure GT is in C0 during resumes Sasha Levin
2025-10-25 15:59 ` [PATCH AUTOSEL 6.17] drm/xe: rework PDE PAT index selection Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe/guc: Add more GuC load error status codes Sasha Levin
2025-10-25 16:01 ` [PATCH AUTOSEL 6.17-6.12] drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251025160905.3857885-299-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=jonathan.cavitt@intel.com \
--cc=lucas.demarchi@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=patches@lists.linux.dev \
--cc=piotr.piorkowski@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=stable@vger.kernel.org \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox