From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: "Lin.Cao" <lincao12@amd.com>,
"Philipp Stanner" <phasta@kernel.org>,
"Christian König" <christian.koenig@amd.com>,
"Sasha Levin" <sashal@kernel.org>,
matthew.brost@intel.com, dakr@kernel.org,
dri-devel@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.6 04/18] drm/scheduler: signal scheduled fence when kill job
Date: Mon, 9 Jun 2025 09:46:38 -0400 [thread overview]
Message-ID: <20250609134652.1344323-4-sashal@kernel.org> (raw)
In-Reply-To: <20250609134652.1344323-1-sashal@kernel.org>
From: "Lin.Cao" <lincao12@amd.com>
[ Upstream commit 471db2c2d4f80ee94225a1ef246e4f5011733e50 ]
When an entity from application B is killed, drm_sched_entity_kill()
removes all jobs belonging to that entity through
drm_sched_entity_kill_jobs_work(). If application A's job depends on a
scheduled fence from application B's job, and that fence is not properly
signaled during the killing process, application A's dependency cannot be
cleared.
This leads to application A hanging indefinitely while waiting for a
dependency that will never be resolved. Fix this issue by ensuring that
scheduled fences are properly signaled when an entity is killed, allowing
dependent applications to continue execution.
Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Philipp Stanner <phasta@kernel.org>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250515020713.1110476-1-lincao12@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
**YES**
This commit should be backported to stable kernel trees. Here's my
extensive analysis:
## Analysis of the Commit
### 1. **This is a Critical Bug Fix**
The commit fixes a serious bug where applications can hang indefinitely
when they depend on scheduled fences from killed jobs. The specific
change adds a single line:
```c
+ drm_sched_fence_scheduled(job->s_fence, NULL);
```
This line signals the scheduled fence when killing a job, which was
previously missing and caused the hang.
### 2. **Clear User-Visible Impact**
From the commit message:
- Application A's job depends on a scheduled fence from Application B's
job
- When Application B is killed, the fence was not properly signaled
- This causes Application A to hang indefinitely waiting for a
dependency that will never resolve
This is a real-world scenario that affects system stability and user
experience.
### 3. **Minimal and Safe Change**
The fix is:
- **One line addition** - extremely minimal change
- **Well-contained** - only affects the job kill path
- **No API/ABI changes** - just adds a missing fence signal
- **No new features** - purely fixes existing broken behavior
### 4. **Comparison with Similar Commits**
Looking at the historical commits provided:
**Similar Commit #1 (Backported: YES)** - "drm/scheduler: fix fence ref
counting"
- Fixed dependency fence leaks when processes were killed
- Added proper fence reference management
- Similar in nature: fixing fence handling in the kill path
This commit shares key characteristics:
- Fixes fence handling bugs in entity kill path
- Prevents resource leaks/hangs
- Minimal, targeted fix
The other similar commits (NOT backported) were either:
- Large architectural changes (Commit #3 - complete rework of dependency
handling)
- Feature additions (Commit #4 - new error handling logic)
- Less critical fixes (Commit #5 - memleak in uncommon path)
### 5. **Risk Assessment**
**Low Risk**:
- The change only affects the error/kill path, not normal operation
- Signaling a fence with NULL is a valid operation indicating no
hardware execution
- The `drm_sched_fence_finished()` call immediately after still signals
completion with error
- No changes to data structures or algorithms
### 6. **Follows Stable Kernel Rules**
According to stable kernel rules, this commit qualifies because it:
- ✓ Fixes a real bug that affects users (application hangs)
- ✓ Is already in Linus's tree (has proper Link: tag)
- ✓ Is small and self-contained (1 line)
- ✓ Has been reviewed (Reviewed-by: tag present)
- ✓ Doesn't introduce new features
- ✓ Has clear problem description and solution
### 7. **Technical Details**
The fix ensures proper fence signaling order:
1. `drm_sched_fence_scheduled(job->s_fence, NULL)` - signals that the
job was "scheduled" (even though it won't run)
2. `drm_sched_fence_finished(job->s_fence, -ESRCH)` - signals completion
with error
This allows dependent jobs waiting on the scheduled fence to proceed and
handle the error appropriately, preventing indefinite hangs.
## Conclusion
This commit is an ideal candidate for stable backporting. It fixes a
serious user-visible bug (application hangs) with a minimal, well-
understood change that follows the established fence signaling pattern
in the DRM scheduler. The fix is similar in nature to previous commits
that were successfully backported, and the risk of regression is very
low since it only affects the error handling path.
drivers/gpu/drm/scheduler/sched_entity.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 53130a50584ca..eed3b8bed9e40 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -167,6 +167,7 @@ static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
{
struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
+ drm_sched_fence_scheduled(job->s_fence, NULL);
drm_sched_fence_finished(job->s_fence, -ESRCH);
WARN_ON(job->s_fence->parent);
job->sched->ops->free_job(job);
--
2.39.5
next prev parent reply other threads:[~2025-06-09 13:46 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-09 13:46 [PATCH AUTOSEL 6.6 01/18] md/md-bitmap: fix dm-raid max_write_behind setting Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 02/18] amd/amdkfd: fix a kfd_process ref leak Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 03/18] bcache: fix NULL pointer in cache_set_flush() Sasha Levin
2025-06-09 13:46 ` Sasha Levin [this message]
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 05/18] iio: pressure: zpa2326: Use aligned_s64 for the timestamp Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 06/18] um: Add cmpxchg8b_emu and checksum functions to asm-prototypes.h Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 07/18] um: use proper care when taking mmap lock during segfault Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 08/18] coresight: Only check bottom two claim bits Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 09/18] usb: dwc2: also exit clock_gating when stopping udc while suspended Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 10/18] iio: adc: ad_sigma_delta: Fix use of uninitialized status_pos Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 11/18] misc: tps6594-pfsm: Add NULL pointer check in tps6594_pfsm_probe() Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 12/18] usb: potential integer overflow in usbg_make_tpg() Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 13/18] tty: serial: uartlite: register uart driver in init Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 14/18] usb: common: usb-conn-gpio: use a unique name for usb connector device Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 15/18] usb: Add checks for snprintf() calls in usb_alloc_dev() Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 16/18] usb: cdc-wdm: avoid setting WDM_READ for ZLP-s Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 17/18] usb: typec: displayport: Receive DP Status Update NAK request exit dp altmode Sasha Levin
2025-06-09 13:46 ` [PATCH AUTOSEL 6.6 18/18] usb: typec: mux: do not return on EOPNOTSUPP in {mux, switch}_set Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250609134652.1344323-4-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=christian.koenig@amd.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=lincao12@amd.com \
--cc=matthew.brost@intel.com \
--cc=patches@lists.linux.dev \
--cc=phasta@kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox