Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: "Lis, Tomasz" <tomasz.lis@intel.com>
Cc: intel-xe@lists.freedesktop.org,
	"Michał Winiarski" <michal.winiarski@intel.com>,
	"Michał Wajdeczko" <michal.wajdeczko@intel.com>,
	"Piotr Piórkowski" <piotr.piorkowski@intel.com>
Subject: Re: [PATCH v2 2/5] drm/xe/vf: Avoid LRC being freed while applying fixups
Date: Fri, 20 Feb 2026 08:20:18 -0800	[thread overview]
Message-ID: <aZiJwinZLdmb9AE9@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <ceaa1492-0e88-4110-917e-a93a318a8eb2@intel.com>

On Fri, Feb 20, 2026 at 04:20:25PM +0100, Lis, Tomasz wrote:
> 
> On 2/19/2026 8:00 PM, Matthew Brost wrote:
> > On Thu, Feb 19, 2026 at 12:21:55AM +0100, Tomasz Lis wrote:
> > > There is a small but non-zero chance that fixups are running on
> > > a context during teardown. The chances are decreased by starting
> > > the teardown by releasing guc_id, but remain non-zero.
> > > On the other hand the sync between fixups and context creation
> > > drastically increases chance for such parallel teardown if
> > > context creation fails.
> > > 
> > I don't see how this is possible.
> > 
> > xe_exec_queue_contexts_hwsp_rebase only happens if the exec queue is
> > present in &guc->submission_state.exec_queue_lookup.
> > 
> >   332 static void __xe_exec_queue_fini(struct xe_exec_queue *q)
> >   333 {
> >   334         int i;
> >   335
> >   336         q->ops->fini(q);
> >   337
> >   338         for (i = 0; i < q->width; ++i)
> >   339                 xe_lrc_put(q->lrc[i]);
> >   340 }
> > 
> > The removal from &guc->submission_state.exec_queue_lookup happen on line
> > 336 in the above before. Thus a xe_exec_queue_contexts_hwsp_rebase can't
> > be executing on a 'q' after line 336 returns, then we drop the
> > references to the LRC. I agree this lifetime is questionable at best
> > (IIRC my GuC documentation explain this why this works) but if there is
> > a problem it should be fix with this lifetime in mind.
> 
> Consider a situation: __xe_exec_queue_init() and
> xe_exec_queue_contexts_hwsp_rebase() are running at the same time, on a one
> core VM, switching CPU contexts (each bullet is a context switch).
> 
> * __xe_exec_queue_init() passes `q->ops->init(q)` - the queue is added to
> exec_queue_lookup, then it starts creating LRCs - it's multi-LRC queue
> 
> * xe_exec_queue_contexts_hwsp_rebase() is executed on this new queue, starts
> the loop over LRCs
> 
> *  __xe_exec_queue_init() fails to create last of the LRCs, and jumps to
> `err_lrc` where all the finalization is done - removal form
> exec_queue_lookup and freeing of already created LRCs
> 
> * CPU context switches back to __xe_exec_queue_init() which goes through
> pointers of now freed LRCs, accessing the inside - SEGFAULT.
> 
> (I used one CPU core only to simplify the scenario, it could happen on
> multi-core as well)
> 

Yes.

> > Looking at __xe_exec_queue_init, I believe 'err_lrc' label should
> > actually call __xe_exec_queue_fini.
> 
> The __xe_exec_queue_fini() currently assumes that all LRC pointers are
> non-NULL.
> 

Oh, yes. I missed that. Either __xe_exec_queue_fini would need a NULL
check or xe_lrc_put could have a NULL check (e.g., make it like kfree,
dma_fencez_put, or xe_bo_put which can be called with NULL. 

> Do you mean adding such check there? With it present, we could call that
> function in `err_lrc`.
> 
> I see no issue with such change, so let me know and I'll do it (assuming we
> will not be adding any wait there, as hinted below).
>

See above, a NULL check somewhere. No real preference where.
 
> > 
> > > Prevent LRC teardown in parallel with fixups by getting a reference.
> > > 
> > > Signed-off-by: Tomasz Lis<tomasz.lis@intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_exec_queue.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > index 42849be46166..e9396ad3390a 100644
> > > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > @@ -1669,10 +1669,11 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
> > >   		lrc = READ_ONCE(q->lrc[i]);
> > >   		if (!lrc)
> > >   			continue;
> > > -
> > > +		xe_lrc_get(lrc);
> > This doesn't actually fix anything. The LRC could (current, in error
> > paths) disappear between the read and get.
> 
> It is true that this is only narrowing the window rather than providing
> flawless fix. Though narrowing the window is a substantial improvement over
> ignoring the issue.
> 
> We could use xe_gt_sriov_vf_wait_valid_default_lrc() within `err_lrc:`
> instead, that would allow a flawless fix. An advantage of the current
> solution is that it keeps the complication within recovery code, without
> altering the common flow (by common flow I mean the queue creation flow used
> for both PF and VF, and regardless whether vf migration is possible). It
> also allows to free the memory faster - if we've failed LRC creation, it may
> be important to free resources as soon as possible.
> 
> Reading and writing local mem is substantially slower than the local pointer
> read and refcount increase, so this way we're narrowing the window by
> definitely more than 95%.

Yes, but if we are going to fix this, let's make sure it is 100%
correct. Please use __xe_exec_queue_fini in LRC error path and NULL
check somewhere.

Matt

> 
> -Tomasz
> 
> > 
> > Matt
> > 
> > >   		xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
> > >   		xe_lrc_update_hwctx_regs_with_address(lrc);
> > >   		err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
> > > +		xe_lrc_put(lrc);
> > >   		if (err)
> > >   			break;
> > >   	}
> > > -- 
> > > 2.25.1
> > > 

  reply	other threads:[~2026-02-20 16:20 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-18 23:21 [PATCH v2 0/5] drm/xe/vf: Fix exec queue creation during post-migration recovery Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 1/5] drm/xe/queue: Call fini on exec queue creation fail Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 2/5] drm/xe/vf: Avoid LRC being freed while applying fixups Tomasz Lis
2026-02-19 19:00   ` Matthew Brost
2026-02-20 15:20     ` Lis, Tomasz
2026-02-20 16:20       ` Matthew Brost [this message]
2026-02-18 23:21 ` [PATCH v2 3/5] drm/xe/vf: Wait for default LRCs fixups before using Tomasz Lis
2026-02-19 20:16   ` Matthew Brost
2026-02-19 20:40     ` Matthew Brost
2026-02-20 17:20       ` Lis, Tomasz
2026-02-20 18:20         ` Matthew Brost
2026-02-18 23:21 ` [PATCH v2 4/5] drm/xe/vf: Redo LRC creation while in VF fixups Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 5/5] drm/xe/vf: Use marker to catch fixups during LRC creation Tomasz Lis
2026-02-19 20:33   ` Matthew Brost
2026-02-20 16:43     ` Lis, Tomasz
2026-02-20 17:41       ` Matthew Brost
2026-02-18 23:34 ` ✓ CI.KUnit: success for drm/xe/vf: Fix exec queue creation during post-migration recovery (rev2) Patchwork
2026-02-19  0:35 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-19  1:49 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZiJwinZLdmb9AE9@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=michal.wajdeczko@intel.com \
    --cc=michal.winiarski@intel.com \
    --cc=piotr.piorkowski@intel.com \
    --cc=tomasz.lis@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox