From: Matthew Brost <matthew.brost@intel.com>
To: "Lis, Tomasz" <tomasz.lis@intel.com>
Cc: intel-xe@lists.freedesktop.org,
"Michał Winiarski" <michal.winiarski@intel.com>,
"Michał Wajdeczko" <michal.wajdeczko@intel.com>,
"Piotr Piórkowski" <piotr.piorkowski@intel.com>
Subject: Re: [PATCH v2 2/5] drm/xe/vf: Avoid LRC being freed while applying fixups
Date: Fri, 20 Feb 2026 08:20:18 -0800 [thread overview]
Message-ID: <aZiJwinZLdmb9AE9@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <ceaa1492-0e88-4110-917e-a93a318a8eb2@intel.com>
On Fri, Feb 20, 2026 at 04:20:25PM +0100, Lis, Tomasz wrote:
>
> On 2/19/2026 8:00 PM, Matthew Brost wrote:
> > On Thu, Feb 19, 2026 at 12:21:55AM +0100, Tomasz Lis wrote:
> > > There is a small but non-zero chance that fixups are running on
> > > a context during teardown. The chances are decreased by starting
> > > the teardown by releasing guc_id, but remain non-zero.
> > > On the other hand the sync between fixups and context creation
> > > drastically increases chance for such parallel teardown if
> > > context creation fails.
> > >
> > I don't see how this is possible.
> >
> > xe_exec_queue_contexts_hwsp_rebase only happens if the exec queue is
> > present in &guc->submission_state.exec_queue_lookup.
> >
> > 332 static void __xe_exec_queue_fini(struct xe_exec_queue *q)
> > 333 {
> > 334 int i;
> > 335
> > 336 q->ops->fini(q);
> > 337
> > 338 for (i = 0; i < q->width; ++i)
> > 339 xe_lrc_put(q->lrc[i]);
> > 340 }
> >
> > The removal from &guc->submission_state.exec_queue_lookup happen on line
> > 336 in the above before. Thus a xe_exec_queue_contexts_hwsp_rebase can't
> > be executing on a 'q' after line 336 returns, then we drop the
> > references to the LRC. I agree this lifetime is questionable at best
> > (IIRC my GuC documentation explain this why this works) but if there is
> > a problem it should be fix with this lifetime in mind.
>
> Consider a situation: __xe_exec_queue_init() and
> xe_exec_queue_contexts_hwsp_rebase() are running at the same time, on a one
> core VM, switching CPU contexts (each bullet is a context switch).
>
> * __xe_exec_queue_init() passes `q->ops->init(q)` - the queue is added to
> exec_queue_lookup, then it starts creating LRCs - it's multi-LRC queue
>
> * xe_exec_queue_contexts_hwsp_rebase() is executed on this new queue, starts
> the loop over LRCs
>
> * __xe_exec_queue_init() fails to create last of the LRCs, and jumps to
> `err_lrc` where all the finalization is done - removal form
> exec_queue_lookup and freeing of already created LRCs
>
> * CPU context switches back to __xe_exec_queue_init() which goes through
> pointers of now freed LRCs, accessing the inside - SEGFAULT.
>
> (I used one CPU core only to simplify the scenario, it could happen on
> multi-core as well)
>
Yes.
> > Looking at __xe_exec_queue_init, I believe 'err_lrc' label should
> > actually call __xe_exec_queue_fini.
>
> The __xe_exec_queue_fini() currently assumes that all LRC pointers are
> non-NULL.
>
Oh, yes. I missed that. Either __xe_exec_queue_fini would need a NULL
check or xe_lrc_put could have a NULL check (e.g., make it like kfree,
dma_fencez_put, or xe_bo_put which can be called with NULL.
> Do you mean adding such check there? With it present, we could call that
> function in `err_lrc`.
>
> I see no issue with such change, so let me know and I'll do it (assuming we
> will not be adding any wait there, as hinted below).
>
See above, a NULL check somewhere. No real preference where.
> >
> > > Prevent LRC teardown in parallel with fixups by getting a reference.
> > >
> > > Signed-off-by: Tomasz Lis<tomasz.lis@intel.com>
> > > ---
> > > drivers/gpu/drm/xe/xe_exec_queue.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > index 42849be46166..e9396ad3390a 100644
> > > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > @@ -1669,10 +1669,11 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
> > > lrc = READ_ONCE(q->lrc[i]);
> > > if (!lrc)
> > > continue;
> > > -
> > > + xe_lrc_get(lrc);
> > This doesn't actually fix anything. The LRC could (current, in error
> > paths) disappear between the read and get.
>
> It is true that this is only narrowing the window rather than providing
> flawless fix. Though narrowing the window is a substantial improvement over
> ignoring the issue.
>
> We could use xe_gt_sriov_vf_wait_valid_default_lrc() within `err_lrc:`
> instead, that would allow a flawless fix. An advantage of the current
> solution is that it keeps the complication within recovery code, without
> altering the common flow (by common flow I mean the queue creation flow used
> for both PF and VF, and regardless whether vf migration is possible). It
> also allows to free the memory faster - if we've failed LRC creation, it may
> be important to free resources as soon as possible.
>
> Reading and writing local mem is substantially slower than the local pointer
> read and refcount increase, so this way we're narrowing the window by
> definitely more than 95%.
Yes, but if we are going to fix this, let's make sure it is 100%
correct. Please use __xe_exec_queue_fini in LRC error path and NULL
check somewhere.
Matt
>
> -Tomasz
>
> >
> > Matt
> >
> > > xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
> > > xe_lrc_update_hwctx_regs_with_address(lrc);
> > > err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
> > > + xe_lrc_put(lrc);
> > > if (err)
> > > break;
> > > }
> > > --
> > > 2.25.1
> > >
next prev parent reply other threads:[~2026-02-20 16:20 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-18 23:21 [PATCH v2 0/5] drm/xe/vf: Fix exec queue creation during post-migration recovery Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 1/5] drm/xe/queue: Call fini on exec queue creation fail Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 2/5] drm/xe/vf: Avoid LRC being freed while applying fixups Tomasz Lis
2026-02-19 19:00 ` Matthew Brost
2026-02-20 15:20 ` Lis, Tomasz
2026-02-20 16:20 ` Matthew Brost [this message]
2026-02-18 23:21 ` [PATCH v2 3/5] drm/xe/vf: Wait for default LRCs fixups before using Tomasz Lis
2026-02-19 20:16 ` Matthew Brost
2026-02-19 20:40 ` Matthew Brost
2026-02-20 17:20 ` Lis, Tomasz
2026-02-20 18:20 ` Matthew Brost
2026-02-18 23:21 ` [PATCH v2 4/5] drm/xe/vf: Redo LRC creation while in VF fixups Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 5/5] drm/xe/vf: Use marker to catch fixups during LRC creation Tomasz Lis
2026-02-19 20:33 ` Matthew Brost
2026-02-20 16:43 ` Lis, Tomasz
2026-02-20 17:41 ` Matthew Brost
2026-02-18 23:34 ` ✓ CI.KUnit: success for drm/xe/vf: Fix exec queue creation during post-migration recovery (rev2) Patchwork
2026-02-19 0:35 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-19 1:49 ` ✗ Xe.CI.FULL: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aZiJwinZLdmb9AE9@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=michal.wajdeczko@intel.com \
--cc=michal.winiarski@intel.com \
--cc=piotr.piorkowski@intel.com \
--cc=tomasz.lis@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox