From: "Lis, Tomasz" <tomasz.lis@intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: intel-xe@lists.freedesktop.org,
"Michał Winiarski" <michal.winiarski@intel.com>,
"Michał Wajdeczko" <michal.wajdeczko@intel.com>,
"Piotr Piórkowski" <piotr.piorkowski@intel.com>
Subject: Re: [PATCH v2 2/5] drm/xe/vf: Avoid LRC being freed while applying fixups
Date: Fri, 20 Feb 2026 16:20:25 +0100 [thread overview]
Message-ID: <ceaa1492-0e88-4110-917e-a93a318a8eb2@intel.com> (raw)
In-Reply-To: <aZdds4C5AZyUQfBm@lstrano-desk.jf.intel.com>
[-- Attachment #1: Type: text/plain, Size: 4490 bytes --]
On 2/19/2026 8:00 PM, Matthew Brost wrote:
> On Thu, Feb 19, 2026 at 12:21:55AM +0100, Tomasz Lis wrote:
>> There is a small but non-zero chance that fixups are running on
>> a context during teardown. The chances are decreased by starting
>> the teardown by releasing guc_id, but remain non-zero.
>> On the other hand the sync between fixups and context creation
>> drastically increases chance for such parallel teardown if
>> context creation fails.
>>
> I don't see how this is possible.
>
> xe_exec_queue_contexts_hwsp_rebase only happens if the exec queue is
> present in &guc->submission_state.exec_queue_lookup.
>
> 332 static void __xe_exec_queue_fini(struct xe_exec_queue *q)
> 333 {
> 334 int i;
> 335
> 336 q->ops->fini(q);
> 337
> 338 for (i = 0; i < q->width; ++i)
> 339 xe_lrc_put(q->lrc[i]);
> 340 }
>
> The removal from &guc->submission_state.exec_queue_lookup happen on line
> 336 in the above before. Thus a xe_exec_queue_contexts_hwsp_rebase can't
> be executing on a 'q' after line 336 returns, then we drop the
> references to the LRC. I agree this lifetime is questionable at best
> (IIRC my GuC documentation explain this why this works) but if there is
> a problem it should be fix with this lifetime in mind.
Consider a situation: __xe_exec_queue_init() and
xe_exec_queue_contexts_hwsp_rebase() are running at the same time, on a
one core VM, switching CPU contexts (each bullet is a context switch).
* __xe_exec_queue_init() passes `q->ops->init(q)` - the queue is added
to exec_queue_lookup, then it starts creating LRCs - it's multi-LRC queue
* xe_exec_queue_contexts_hwsp_rebase() is executed on this new queue,
starts the loop over LRCs
* __xe_exec_queue_init() fails to create last of the LRCs, and jumps to
`err_lrc` where all the finalization is done - removal form
exec_queue_lookup and freeing of already created LRCs
* CPU context switches back to __xe_exec_queue_init() which goes through
pointers of now freed LRCs, accessing the inside - SEGFAULT.
(I used one CPU core only to simplify the scenario, it could happen on
multi-core as well)
> Looking at __xe_exec_queue_init, I believe 'err_lrc' label should
> actually call __xe_exec_queue_fini.
The __xe_exec_queue_fini() currently assumes that all LRC pointers are
non-NULL.
Do you mean adding such check there? With it present, we could call that
function in `err_lrc`.
I see no issue with such change, so let me know and I'll do it (assuming
we will not be adding any wait there, as hinted below).
>
>> Prevent LRC teardown in parallel with fixups by getting a reference.
>>
>> Signed-off-by: Tomasz Lis<tomasz.lis@intel.com>
>> ---
>> drivers/gpu/drm/xe/xe_exec_queue.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>> index 42849be46166..e9396ad3390a 100644
>> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
>> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>> @@ -1669,10 +1669,11 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
>> lrc = READ_ONCE(q->lrc[i]);
>> if (!lrc)
>> continue;
>> -
>> + xe_lrc_get(lrc);
> This doesn't actually fix anything. The LRC could (current, in error
> paths) disappear between the read and get.
It is true that this is only narrowing the window rather than providing
flawless fix. Though narrowing the window is a substantial improvement
over ignoring the issue.
We could use xe_gt_sriov_vf_wait_valid_default_lrc() within `err_lrc:`
instead, that would allow a flawless fix. An advantage of the current
solution is that it keeps the complication within recovery code, without
altering the common flow (by common flow I mean the queue creation flow
used for both PF and VF, and regardless whether vf migration is
possible). It also allows to free the memory faster - if we've failed
LRC creation, it may be important to free resources as soon as possible.
Reading and writing local mem is substantially slower than the local
pointer read and refcount increase, so this way we're narrowing the
window by definitely more than 95%.
-Tomasz
>
> Matt
>
>> xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
>> xe_lrc_update_hwctx_regs_with_address(lrc);
>> err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
>> + xe_lrc_put(lrc);
>> if (err)
>> break;
>> }
>> --
>> 2.25.1
>>
[-- Attachment #2: Type: text/html, Size: 6188 bytes --]
next prev parent reply other threads:[~2026-02-20 15:20 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-18 23:21 [PATCH v2 0/5] drm/xe/vf: Fix exec queue creation during post-migration recovery Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 1/5] drm/xe/queue: Call fini on exec queue creation fail Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 2/5] drm/xe/vf: Avoid LRC being freed while applying fixups Tomasz Lis
2026-02-19 19:00 ` Matthew Brost
2026-02-20 15:20 ` Lis, Tomasz [this message]
2026-02-20 16:20 ` Matthew Brost
2026-02-18 23:21 ` [PATCH v2 3/5] drm/xe/vf: Wait for default LRCs fixups before using Tomasz Lis
2026-02-19 20:16 ` Matthew Brost
2026-02-19 20:40 ` Matthew Brost
2026-02-20 17:20 ` Lis, Tomasz
2026-02-20 18:20 ` Matthew Brost
2026-02-18 23:21 ` [PATCH v2 4/5] drm/xe/vf: Redo LRC creation while in VF fixups Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 5/5] drm/xe/vf: Use marker to catch fixups during LRC creation Tomasz Lis
2026-02-19 20:33 ` Matthew Brost
2026-02-20 16:43 ` Lis, Tomasz
2026-02-20 17:41 ` Matthew Brost
2026-02-18 23:34 ` ✓ CI.KUnit: success for drm/xe/vf: Fix exec queue creation during post-migration recovery (rev2) Patchwork
2026-02-19 0:35 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-19 1:49 ` ✗ Xe.CI.FULL: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ceaa1492-0e88-4110-917e-a93a318a8eb2@intel.com \
--to=tomasz.lis@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=michal.wajdeczko@intel.com \
--cc=michal.winiarski@intel.com \
--cc=piotr.piorkowski@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox