Re: [PATCH v2 2/5] drm/xe/vf: Avoid LRC being freed while applying fixups

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Lis, Tomasz" <tomasz.lis@intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: intel-xe@lists.freedesktop.org,
	"Michał Winiarski" <michal.winiarski@intel.com>,
	"Michał Wajdeczko" <michal.wajdeczko@intel.com>,
	"Piotr Piórkowski" <piotr.piorkowski@intel.com>
Subject: Re: [PATCH v2 2/5] drm/xe/vf: Avoid LRC being freed while applying fixups
Date: Fri, 20 Feb 2026 16:20:25 +0100	[thread overview]
Message-ID: <ceaa1492-0e88-4110-917e-a93a318a8eb2@intel.com> (raw)
In-Reply-To: <aZdds4C5AZyUQfBm@lstrano-desk.jf.intel.com>

[-- Attachment #1: Type: text/plain, Size: 4490 bytes --]


On 2/19/2026 8:00 PM, Matthew Brost wrote:
> On Thu, Feb 19, 2026 at 12:21:55AM +0100, Tomasz Lis wrote:
>> There is a small but non-zero chance that fixups are running on
>> a context during teardown. The chances are decreased by starting
>> the teardown by releasing guc_id, but remain non-zero.
>> On the other hand the sync between fixups and context creation
>> drastically increases chance for such parallel teardown if
>> context creation fails.
>>
> I don't see how this is possible.
>
> xe_exec_queue_contexts_hwsp_rebase only happens if the exec queue is
> present in &guc->submission_state.exec_queue_lookup.
>
>   332 static void __xe_exec_queue_fini(struct xe_exec_queue *q)
>   333 {
>   334         int i;
>   335
>   336         q->ops->fini(q);
>   337
>   338         for (i = 0; i < q->width; ++i)
>   339                 xe_lrc_put(q->lrc[i]);
>   340 }
>
> The removal from &guc->submission_state.exec_queue_lookup happen on line
> 336 in the above before. Thus a xe_exec_queue_contexts_hwsp_rebase can't
> be executing on a 'q' after line 336 returns, then we drop the
> references to the LRC. I agree this lifetime is questionable at best
> (IIRC my GuC documentation explain this why this works) but if there is
> a problem it should be fix with this lifetime in mind.

Consider a situation: __xe_exec_queue_init() and 
xe_exec_queue_contexts_hwsp_rebase() are running at the same time, on a 
one core VM, switching CPU contexts (each bullet is a context switch).

* __xe_exec_queue_init() passes `q->ops->init(q)` - the queue is added 
to exec_queue_lookup, then it starts creating LRCs - it's multi-LRC queue

* xe_exec_queue_contexts_hwsp_rebase() is executed on this new queue, 
starts the loop over LRCs

*  __xe_exec_queue_init() fails to create last of the LRCs, and jumps to 
`err_lrc` where all the finalization is done - removal form 
exec_queue_lookup and freeing of already created LRCs

* CPU context switches back to __xe_exec_queue_init() which goes through 
pointers of now freed LRCs, accessing the inside - SEGFAULT.

(I used one CPU core only to simplify the scenario, it could happen on 
multi-core as well)

> Looking at __xe_exec_queue_init, I believe 'err_lrc' label should
> actually call __xe_exec_queue_fini.

The __xe_exec_queue_fini() currently assumes that all LRC pointers are 
non-NULL.

Do you mean adding such check there? With it present, we could call that 
function in `err_lrc`.

I see no issue with such change, so let me know and I'll do it (assuming 
we will not be adding any wait there, as hinted below).

>
>> Prevent LRC teardown in parallel with fixups by getting a reference.
>>
>> Signed-off-by: Tomasz Lis<tomasz.lis@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_exec_queue.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>> index 42849be46166..e9396ad3390a 100644
>> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
>> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>> @@ -1669,10 +1669,11 @@ int xe_exec_queue_contexts_hwsp_rebase(struct xe_exec_queue *q, void *scratch)
>>   		lrc = READ_ONCE(q->lrc[i]);
>>   		if (!lrc)
>>   			continue;
>> -
>> +		xe_lrc_get(lrc);
> This doesn't actually fix anything. The LRC could (current, in error
> paths) disappear between the read and get.

It is true that this is only narrowing the window rather than providing 
flawless fix. Though narrowing the window is a substantial improvement 
over ignoring the issue.

We could use xe_gt_sriov_vf_wait_valid_default_lrc() within `err_lrc:` 
instead, that would allow a flawless fix. An advantage of the current 
solution is that it keeps the complication within recovery code, without 
altering the common flow (by common flow I mean the queue creation flow 
used for both PF and VF, and regardless whether vf migration is 
possible). It also allows to free the memory faster - if we've failed 
LRC creation, it may be important to free resources as soon as possible.

Reading and writing local mem is substantially slower than the local 
pointer read and refcount increase, so this way we're narrowing the 
window by definitely more than 95%.

-Tomasz

>
> Matt
>
>>   		xe_lrc_update_memirq_regs_with_address(lrc, q->hwe, scratch);
>>   		xe_lrc_update_hwctx_regs_with_address(lrc);
>>   		err = xe_lrc_setup_wa_bb_with_scratch(lrc, q->hwe, scratch);
>> +		xe_lrc_put(lrc);
>>   		if (err)
>>   			break;
>>   	}
>> -- 
>> 2.25.1
>>

[-- Attachment #2: Type: text/html, Size: 6188 bytes --]

next prev parent reply	other threads:[~2026-02-20 15:20 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-18 23:21 [PATCH v2 0/5] drm/xe/vf: Fix exec queue creation during post-migration recovery Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 1/5] drm/xe/queue: Call fini on exec queue creation fail Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 2/5] drm/xe/vf: Avoid LRC being freed while applying fixups Tomasz Lis
2026-02-19 19:00   ` Matthew Brost
2026-02-20 15:20     ` Lis, Tomasz [this message]
2026-02-20 16:20       ` Matthew Brost
2026-02-18 23:21 ` [PATCH v2 3/5] drm/xe/vf: Wait for default LRCs fixups before using Tomasz Lis
2026-02-19 20:16   ` Matthew Brost
2026-02-19 20:40     ` Matthew Brost
2026-02-20 17:20       ` Lis, Tomasz
2026-02-20 18:20         ` Matthew Brost
2026-02-18 23:21 ` [PATCH v2 4/5] drm/xe/vf: Redo LRC creation while in VF fixups Tomasz Lis
2026-02-18 23:21 ` [PATCH v2 5/5] drm/xe/vf: Use marker to catch fixups during LRC creation Tomasz Lis
2026-02-19 20:33   ` Matthew Brost
2026-02-20 16:43     ` Lis, Tomasz
2026-02-20 17:41       ` Matthew Brost
2026-02-18 23:34 ` ✓ CI.KUnit: success for drm/xe/vf: Fix exec queue creation during post-migration recovery (rev2) Patchwork
2026-02-19  0:35 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-19  1:49 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ceaa1492-0e88-4110-917e-a93a318a8eb2@intel.com \
    --to=tomasz.lis@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=michal.winiarski@intel.com \
    --cc=piotr.piorkowski@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox