Linux kernel -stable discussions
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: "Christian König" <christian.koenig@amd.com>,
	"Prike Liang" <Prike.Liang@amd.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Sasha Levin" <sashal@kernel.org>,
	airlied@gmail.com, simona@ffwll.ch,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0] drm/amdgpu: remove deadlocks from amdgpu_userq_pre_reset
Date: Wed, 20 May 2026 07:18:34 -0400	[thread overview]
Message-ID: <20260520111944.3424570-2-sashal@kernel.org> (raw)
In-Reply-To: <20260520111944.3424570-1-sashal@kernel.org>

From: Christian König <christian.koenig@amd.com>

[ Upstream commit d5971c5c34303a00bf841a902ca00a703602c500 ]

The purpose of a GPU reset is to make sure that fence can be signaled
again and the signal and resume workers can make progress again.

So waiting for the resume worker or any fence in the GPU reset path is
just utterly nonsense.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fcd5f065eab46993af43442fd77ee8d9eb9c5bdf)
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Decision Summary
This should be backported for stable trees that contain
`amdgpu_userq_pre_reset()`. It removes two blocking waits from the GPU
reset path: `cancel_delayed_work_sync(&uqm->resume_work)` and
`amdgpu_userq_wait_for_last_fence()`, where the latter waits on
`dma_fence_wait_timeout(..., MAX_SCHEDULE_TIMEOUT)`. The reset path is
precisely what is supposed to let those fences/workers make progress, so
waiting there can deadlock GPU recovery.

The patch is one-function, one-file, no new API, and `git apply --check`
succeeds against this `7.0.y` checkout. Main concern: the review thread
confirms broader queue-state locking concerns remain, but those are pre-
existing and the accepted patch was explicitly scoped to removing the
obvious deadlock.

## Phase Walkthrough
Phase 1:
- Record 1.1: subsystem `drm/amdgpu`; action `remove`; intent is to
  remove deadlocks from `amdgpu_userq_pre_reset()`.
- Record 1.2: tags present: `Signed-off-by: Christian König`, `Reviewed-
  by: Prike Liang`, `Signed-off-by: Alex Deucher`; no `Fixes:`,
  `Reported-by:`, `Tested-by:`, `Acked-by:`, `Link:`, or `Cc: stable`.
- Record 1.3: commit body says GPU reset exists to allow fences and
  signal/resume workers to progress; waiting for resume worker or fences
  inside reset is wrong. Failure mode: reset can hang/deadlock.
- Record 1.4: not hidden; it is explicitly a deadlock fix.

Phase 2:
- Record 2.1: one file, `drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c`,
  `+12/-14`; only `amdgpu_userq_pre_reset()` changed; single-file
  surgical fix.
- Record 2.2: before, reset path canceled `resume_work` synchronously
  and waited for `last_fence` before unmapping mapped queues. After, it
  skips non-mapped queues and directly unmaps/marks hung/force-completes
  mapped queues.
- Record 2.3: bug category is deadlock/hang. Mechanism: reset path
  waited on work/fences that may require reset progress to complete.
- Record 2.4: fix is minimal. Regression risk is low-to-medium because
  queue-state locking is still called out by the TODO and review
  discussion, but no new API or broad behavior is added.

Phase 3:
- Record 3.1: `git blame` shows the reset function came from
  `290f46cf57265` (`drm/amdgpu: Implement user queue reset
  functionality`); current textual lines were later touched by
  `473f12f820956` and `dc21e39fd20c7` helper/signature cleanups. `git
  describe --contains 290f46cf...` points to `v6.19-rc1`, so older
  stable trees before v6.19 likely do not contain this exact code.
- Record 3.2: no `Fixes:` tag, so no Fixes target to follow.
- Record 3.3: recent file history shows active userq work and related
  reset/fence changes; this patch applies standalone to current `7.0.y`.
- Record 3.4: `MAINTAINERS` lists Christian König and Alex Deucher as
  AMDGPU DRM maintainers; author/committer are subsystem maintainers.
- Record 3.5: no compile dependency found; patch is part of an 11-patch
  series but applies cleanly by itself.

Phase 4:
- Record 4.1: `b4 dig` failed because the commit object is absent
  locally. Fallback found the original patch at
  `lists.freedesktop.org/archives/amd-gfx/2026-April/143199.html`;
  committed version matches the patch content with `Reviewed-by` added.
- Record 4.2: `b4 -w` failed; lore thread shows Alex Deucher, Prike
  Liang, Sunil Khatri, and `amd-gfx` were involved.
- Record 4.3: no external bug report or syzbot link found.
- Record 4.4: series context exists (`[PATCH 02/11]`), with later
  userq/fence/reset cleanups, but no hard prerequisite was identified
  for this specific deadlock removal.
- Record 4.5: web search did not find stable-list discussion for this
  exact patch.

Phase 5:
- Record 5.1: modified function: `amdgpu_userq_pre_reset()`.
- Record 5.2: callers: local search shows only
  `amdgpu_device_halt_activities()`, called from
  `amdgpu_device_gpu_recover()` and PCI error handling.
- Record 5.3: affected callees include `userq_funcs->unmap()` and
  `amdgpu_userq_fence_driver_force_completion()`; removed callees
  include `cancel_delayed_work_sync()` and
  `amdgpu_userq_wait_for_last_fence()`.
- Record 5.4: reachable through AMDGPU recovery paths, including job
  timeout/GPU recovery paths; exact unprivileged trigger policy was not
  fully verified.
- Record 5.5: similar wait/cancel patterns remain in non-reset paths
  such as destroy/suspend/eviction, but the reset-path instance is the
  one fixed here.

Phase 6:
- Record 6.1: buggy code exists in this `7.0.y` tree and appears
  introduced for `v6.19-rc1`; likely relevant to v6.19+ stable trees
  containing userq reset support.
- Record 6.2: `git apply --check` succeeds against this checkout, so
  current `7.0.y` backport difficulty is clean.
- Record 6.3: no exact related stable replacement fix found locally or
  in web stable search.

Phase 7:
- Record 7.1: subsystem is AMDGPU DRM driver; criticality is IMPORTANT,
  driver-specific but for widely used GPU hardware.
- Record 7.2: subsystem/file is active; recent history contains many
  userq reset/fence fixes and refactors.

Phase 8:
- Record 8.1: affected users are AMDGPU users with user queues enabled
  who hit GPU reset/recovery.
- Record 8.2: trigger is GPU reset while user queues have pending resume
  work or unsignaled fences; commonality depends on userq usage and GPU
  hangs.
- Record 8.3: failure mode is GPU reset deadlock/hang; severity HIGH to
  CRITICAL because recovery can stall.
- Record 8.4: benefit is high for affected systems; risk is low-to-
  medium due small scope but reset-path concurrency sensitivity. Ratio
  favors backport.

Phase 9:
- Record 9.1 FOR: fixes deadlock, reset-path hang, small one-function
  diff, maintainer-authored, reviewed, applies cleanly. AGAINST: no
  reported user bug, no stable tag, part of a larger userq series,
  residual queue-state locking TODO. UNRESOLVED: no runtime
  reproduction/build, exact applicability to every active stable branch
  not verified.
- Record 9.2 stable rules: obviously correct enough by code and
  maintainer review: yes; fixes real bug: yes, deadlock; important
  issue: yes, recovery hang; small/contained: yes; no new API/features:
  yes; can apply to current stable tree: yes.
- Record 9.3 exception category: none.
- Record 9.4 decision: backport to stable trees containing this userq
  reset code.

## Verification
- [Phase 1] Parsed GitHub patch and lore message: confirmed subject,
  body, tags, and lack of `Fixes`/`Cc stable`.
- [Phase 2] Compared diff: confirmed one-file `+12/-14` change removing
  `cancel_delayed_work_sync()` and `amdgpu_userq_wait_for_last_fence()`.
- [Phase 3] `git blame`: confirmed reset code origin/refactor history
  around `amdgpu_userq_pre_reset()`.
- [Phase 3] `git describe --contains`: confirmed introduction is in the
  v6.19 era.
- [Phase 3] `MAINTAINERS`: confirmed Christian König and Alex Deucher
  maintain AMDGPU DRM.
- [Phase 4] `b4 dig -c/-a/-w`: failed because commit object is not in
  this checkout.
- [Phase 4] WebFetch GitHub patch and freedesktop lore: confirmed
  original patch, review discussion, and `Reviewed-by: Prike Liang`.
- [Phase 5] `rg` and file reads: confirmed caller path through
  `amdgpu_device_halt_activities()` and `amdgpu_device_gpu_recover()`.
- [Phase 6] `git apply --check`: confirmed clean application to current
  `7.0.y`.
- UNVERIFIED: no kernel build or runtime reproduction was performed;
  exact applicability to all non-`7.0.y` stable trees was not fully
  checked.

**YES**

 drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 26 +++++++++++------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
index 0a1b93259887a..6f49ba3bd0394 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
@@ -1583,23 +1583,21 @@ void amdgpu_userq_pre_reset(struct amdgpu_device *adev)
 {
 	const struct amdgpu_userq_funcs *userq_funcs;
 	struct amdgpu_usermode_queue *queue;
-	struct amdgpu_userq_mgr *uqm;
 	unsigned long queue_id;
 
+	/* TODO: We probably need a new lock for the queue state */
 	xa_for_each(&adev->userq_doorbell_xa, queue_id, queue) {
-		uqm = queue->userq_mgr;
-		cancel_delayed_work_sync(&uqm->resume_work);
-		if (queue->state == AMDGPU_USERQ_STATE_MAPPED) {
-			amdgpu_userq_wait_for_last_fence(queue);
-			userq_funcs = adev->userq_funcs[queue->queue_type];
-			userq_funcs->unmap(queue);
-			/* just mark all queues as hung at this point.
-			 * if unmap succeeds, we could map again
-			 * in amdgpu_userq_post_reset() if vram is not lost
-			 */
-			queue->state = AMDGPU_USERQ_STATE_HUNG;
-			amdgpu_userq_fence_driver_force_completion(queue);
-		}
+		if (queue->state != AMDGPU_USERQ_STATE_MAPPED)
+			continue;
+
+		userq_funcs = adev->userq_funcs[queue->queue_type];
+		userq_funcs->unmap(queue);
+		/* just mark all queues as hung at this point.
+		 * if unmap succeeds, we could map again
+		 * in amdgpu_userq_post_reset() if vram is not lost
+		 */
+		queue->state = AMDGPU_USERQ_STATE_HUNG;
+		amdgpu_userq_fence_driver_force_completion(queue);
 	}
 }
 
-- 
2.53.0


  reply	other threads:[~2026-05-20 11:19 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20 11:18 [PATCH AUTOSEL 7.0-6.12] HID: logitech-hidpp: Add support for newer Bluetooth keyboards Sasha Levin
2026-05-20 11:18 ` Sasha Levin [this message]
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.10] ALSA: sparc/dbri: add missing fallthrough Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-6.6] docs: cgroup-v1: Update charge-commit section Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.10] drm/panel: feiyang-fy07024di26a30d: return display-on error Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-6.6] ALSA: usb-audio: Add iface reset and delay quirk for TTGK Technology USB-C Audio Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.10] selftests/cgroup: Fix cg_read_strcmp() empty string comparison Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-6.1] smb: client: Zero-pad short GSS session keys per MS-SMB2 Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.10] HID: magicmouse: Prevent out-of-bounds (OOB) read during DOUBLE_REPORT_ID Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0] smb: client: avoid integer overflow in SMB2 READ length check Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.10] libceph: Fix unnecessarily high ceph_decode_need() for uniform bucket Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-6.6] ALSA: hda/realtek: fix mic boost on Framework PTL Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-6.6] io_uring: hold uring_lock when walking link chain in io_wq_free_work() Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.15] wifi: nl80211: re-check wiphy netns in nl80211_prepare_wdev_dump() continuation Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-6.12] KVM: arm64: nv: Consider the DS bit when translating TCR_EL2 Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0] docs: hwmon: sy7636a: fix temperature sysfs attribute name Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0] ALSA: hda/realtek: ALC269 fixup for Lenovo Yoga Pro 7 15ASH111 audio Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-6.6] ipv6: Implement limits on extension header parsing Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-6.12] net: usb: cdc_ncm: add Apple Mac USB-C direct networking quirk Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.15] net: usb: r8152: add TRENDnet TUC-ET2G v2.0 Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.10] i2c: dev: prevent integer overflow in I2C_TIMEOUT ioctl Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.10] powerpc/vmx: avoid KASAN instrumentation in enter_vmx_ops() for kexec Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-6.18] ALSA: usb-audio: add min_mute quirk for Razer Nommo V2 X Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.10] wifi: libertas: fix integer underflow in process_cmdrequest() Sasha Levin
2026-05-20 20:41   ` James Cameron
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.10] HID: mcp2221: fix OOB write in mcp2221_raw_event() Sasha Levin
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0] io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER Sasha Levin
2026-05-20 11:40   ` Jens Axboe
2026-05-20 11:18 ` [PATCH AUTOSEL 7.0-5.10] wifi: nl80211: require CAP_NET_ADMIN over the target netns in SET_WIPHY_NETNS Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-5.10] HID: elan: Add support for ELAN SB974D touchpad Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.12] media: qcom: camss: avoid format string warning Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.12] HID: i2c-hid: add reset quirk for BLTP7853 touchpad Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.12] ALSA: hda/realtek: Limit mic boost on Positivo DN50E Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.12] Documentation: kvm: update links in the references section of AMD Memory Encryption Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-5.10] scsi: scsi_dh_alua: Increase default ALUA timeout to maximum spec value Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.1] HID: google: hammer: stop hardware on devres action failure Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.18] ALSA: doc: cs35l56: Update path to HDA driver source Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.18] Bluetooth: hci_uart: Fix NULL deref in recv callbacks when priv is uninitialized Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0] ALSA: hda/realtek: Add mute LED fixup for HP Pavilion 15-cs1xxx Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.12] btrfs: fix check_chunk_block_group_mappings() to iterate all chunk maps Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-5.15] ALSA: usb-audio: Add quirk flags for AlphaTheta EUPHONIA Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-5.10] powerpc/g5: Enable all windfarms by default Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.18] ALSA: hda/realtek: Add codec SSID quirk for Lenovo Yoga Pro 9 16IMH9 Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.18] tools/ynl: add missing uapi header deps in Makefile.deps Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-5.10] fbdev: ipu-v3: clean up kernel-doc warnings Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.6] ASoC: amd: yc: Add DMI quirk for MSI Bravo 15 C7VE Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.1] powerpc/pasemi: Drop redundant res assignment Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-5.10] cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed() Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.18] drm/amd/ras: Fix CPER ring debugfs read overflow Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-5.15] scsi: smartpqi: Silence a recursive lock warning Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0] io_uring: defer linked-timeout chain splice out of hrtimer context Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.18] io_uring: validate user-controlled cq.head in io_cqe_cache_refill() Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.12] platform/x86: asus-nb-wmi: add DMI quirk for ASUS Zenbook Duo UX8407AA Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.18] powerpc/pseries/htmdump: Free the global buffers in htmdump module exit Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.6] HID: sony: add missing size validation for SMK-Link remotes Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-5.15] HID: ft260: validate i2c input report length Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0] io_uring: hold uring_lock across io_kill_timeouts() in cancel path Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0] platform/x86: hp-wmi: Add support for Victus 16-r0xxx (8BC2) Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-5.10] i2c: acpi: Add ELAN0678 to i2c_acpi_force_100khz_device_ids Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.18] KVM: VMX: introduce module parameter to disable CET Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.18] iommu/amd: Use maximum Event log buffer size when SNP is enabled on Family 0x19 Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-5.10] ALSA: usb-audio: add clock quirk for Motu 1248 Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.18] workqueue: Release PENDING in __queue_work() drain/destroy reject path Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0] ASoC: sdw_utils: avoid the SDCA companion function not supported failure Sasha Levin
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0] Documentation: security-bugs: do not systematically Cc the security team Sasha Levin
2026-05-20 13:07   ` Jonathan Corbet
2026-05-20 11:19 ` [PATCH AUTOSEL 7.0-6.12] io_uring/fdinfo: translate SqThread PID through caller's pid_ns Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260520111944.3424570-2-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Prike.Liang@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=simona@ffwll.ch \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox