On Thu, Feb 19, 2026 at 12:21:55AM +0100, Tomasz Lis wrote:There is a small but non-zero chance that fixups are running on a context during teardown. The chances are decreased by starting the teardown by releasing guc_id, but remain non-zero. On the other hand the sync between fixups and context creation drastically increases chance for such parallel teardown if context creation fails.I don't see how this is possible. xe_exec_queue_contexts_hwsp_rebase only happens if the exec queue is present in &guc->submission_state.exec_queue_lookup. 332 static void __xe_exec_queue_fini(struct xe_exec_queue *q) 333 { 334 int i; 335 336 q->ops->fini(q); 337 338 for (i = 0; i < q->width; ++i) 339 xe_lrc_put(q->lrc[i]); 340 } The removal from &guc->submission_state.exec_queue_lookup happen on line 336 in the above before. Thus a xe_exec_queue_contexts_hwsp_rebase can't be executing on a 'q' after line 336 returns, then we drop the references to the LRC. I agree this lifetime is questionable at best (IIRC my GuC documentation explain this why this works) but if there is a problem it should be fix with this lifetime in mind.
Consider a situation: __xe_exec_queue_init() and xe_exec_queue_contexts_hwsp_rebase() are running at the same time, on a one core VM, switching CPU contexts (each bullet is a context switch).
* __xe_exec_queue_init() passes `q->ops->init(q)` - the queue is added to exec_queue_lookup, then it starts creating LRCs - it's multi-LRC queue
* xe_exec_queue_contexts_hwsp_rebase() is executed on this new queue, starts the loop over LRCs
* __xe_exec_queue_init() fails to create last of the LRCs, and jumps to `err_lrc` where all the finalization is done - removal form exec_queue_lookup and freeing of already created LRCs
* CPU context switches back to __xe_exec_queue_init() which goes through pointers of now freed LRCs, accessing the inside - SEGFAULT.
(I used one CPU core only to simplify the scenario, it could happen on multi-core as well)
Looking at __xe_exec_queue_init, I believe 'err_lrc' label should actually call __xe_exec_queue_fini.
The __xe_exec_queue_fini() currently assumes that all LRC pointers are non-NULL.
Do you mean adding such check there? With it present, we could call that function in `err_lrc`.
I see no issue with such change, so let me know and I'll do it (assuming we will not be adding any wait there, as hinted below).
Prevent LRC teardown in parallel with fixups by getting a reference. Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> --- drivers/gpu/drm/xe/xe_exec_queue.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index 42849be46166..e9396ad3390a 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -1669,10 +1669,11 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch) lrc = READ_ONCE(q->lrc[i]); if (!lrc) continue; - + xe_lrc_get(lrc);This doesn't actually fix anything. The LRC could (current, in error paths) disappear between the read and get.
It is true that this is only narrowing the window rather than providing flawless fix. Though narrowing the window is a substantial improvement over ignoring the issue.
We could use xe_gt_sriov_vf_wait_valid_default_lrc() within `err_lrc:` instead, that would allow a flawless fix. An advantage of the current solution is that it keeps the complication within recovery code, without altering the common flow (by common flow I mean the queue creation flow used for both PF and VF, and regardless whether vf migration is possible). It also allows to free the memory faster - if we've failed LRC creation, it may be important to free resources as soon as possible.
Reading and writing local mem is substantially slower than the local pointer read and refcount increase, so this way we're narrowing the window by definitely more than 95%.
-Tomasz
Mattxe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch); xe_lrc_update_hwctx_regs_with_address(lrc); err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch); + xe_lrc_put(lrc); if (err) break; } -- 2.25.1