Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Dong, Zhanjun" <zhanjun.dong@intel.com>
To: Michal Wajdeczko <michal.wajdeczko@intel.com>,
	<intel-xe@lists.freedesktop.org>,
	Matthew Brost <matthew.brost@intel.com>
Cc: Stuart Summers <stuart.summers@intel.com>
Subject: Re: [PATCH v4] drm/xe/guc: Set wedged on Xe micro hardware initialization error
Date: Fri, 13 Jun 2025 12:30:56 -0400	[thread overview]
Message-ID: <454b4847-5654-49db-925b-68bba716061b@intel.com> (raw)
In-Reply-To: <48915a0c-f795-4b0a-96a5-1f2eb62cc301@intel.com>

Please see my comments inline below.

Regards,
Zhanjun Dong

On 2025-06-13 10:32 a.m., Michal Wajdeczko wrote:
> Hi,
> 
> maybe title should start with "drm/xe/uc:" instead?
Sure>
> On 13.06.2025 00:09, Zhanjun Dong wrote:
> 
> the commit message should also say "why" not just "what", see [1]
> 
> [1] https://docs.kernel.org/process/submitting-patches.html#explanation-body
> 
>> Declare wedged on Xe micro controller hardware initialization
>> failed.
> 
> failure ?
> error ?
Will change to error>
>>
>> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
>> Reviewed-by: Stuart Summers <stuart.summers@intel.com>
>>
> 
> no empty lines between tags!
> 
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4917
>>
>> ---
>> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
>> Change list:
>> v4: Fix typo and add new line
>> v3: v2 CI re-run
>> v2: Remove unnecessary jump to err-out
>>      Drop disable ct, switch to set wedge
>> ---
>>   drivers/gpu/drm/xe/xe_uc.c | 21 ++++++++++++++-------
>>   1 file changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
>> index 3a8751a8b92d..77175938fc40 100644
>> --- a/drivers/gpu/drm/xe/xe_uc.c
>> +++ b/drivers/gpu/drm/xe/xe_uc.c
>> @@ -157,19 +157,22 @@ static int vf_uc_init_hw(struct xe_uc *uc)
>>   
>>   	err = xe_guc_enable_communication(&uc->guc);
>>   	if (err)
>> -		return err;
>> +		goto err_out;
>>   
>>   	err = xe_gt_sriov_vf_connect(uc_to_gt(uc));
>>   	if (err)
>> -		return err;
>> +		goto err_out;
>>   
>>   	uc->guc.submission_state.enabled = true;
>>   
>>   	err = xe_gt_record_default_lrcs(uc_to_gt(uc));
>>   	if (err)
>> -		return err;
>> +		goto err_out;
>>   
>>   	return 0;
> 
> add separation line here please
> 
>> +err_out:
>> +	xe_device_declare_wedged(uc_to_xe(uc));
>> +	return err;
>>   }
>>   
>>   /*
>> @@ -197,19 +200,19 @@ int xe_uc_init_hw(struct xe_uc *uc)
>>   
>>   	ret = xe_guc_enable_communication(&uc->guc);
>>   	if (ret)
>> -		return ret;
>> +		goto err_out;
>>   
>>   	ret = xe_gt_record_default_lrcs(uc_to_gt(uc));
>>   	if (ret)
>> -		return ret;
>> +		goto err_out;
>>   
>>   	ret = xe_guc_post_load_init(&uc->guc);
>>   	if (ret)
>> -		return ret;
>> +		goto err_out;
>>   
>>   	ret = xe_guc_pc_start(&uc->guc.pc);
>>   	if (ret)
>> -		return ret;
>> +		goto err_out;
>>   
>>   	xe_guc_engine_activity_enable_stats(&uc->guc);
>>   
>> @@ -221,6 +224,10 @@ int xe_uc_init_hw(struct xe_uc *uc)
>>   	xe_gsc_load_start(&uc->gsc);
>>   
>>   	return 0;
>> +
>> +err_out:
>> +	xe_device_declare_wedged(uc_to_xe(uc));
> 
> hmm, but shouldn't this be really decided at the caller layer?
> 
> this is already done in this way by xe_gt.c:gt_reset() and it looks that
>   xe_gt.c:xe_gt_resume() or xe_device.c:xe_device_probe() are not
> handling that in the same way, so maybe fix should be there?
This function is on hardware initialization level, declare wedged on 
hardware initialization errors is reasonable.
I understand reasons do it at caller level, while if the reason is 
strong enough, then do it here might make it looks clean and straight 
forward.

> 
> and from the issue #4917 it looks that indeed the fix should be
> somewhere in xe_device_probe() or more specifically in the xe_gt_tlb
> code which missed to do proper unwind that would include explicit
> xe_gt_tlb_invalidation_reset() like done by the declare_wedged()
> 
> without that we could see that delayed TLB work hit UAF

Yes, this patch is not a fix for the issue.
For this issue, the solution should be split into 2 parts:
1. This patch, declare wedge on _hw_init error
2. Fix on xe_gt_tlb code.

For part 2, I already sent email to kernel-core team offline, I will 
forward that email to you guys, and might assign the issue (or create 
new issue?) to kernel-core team later to track part 2.

> 
> + Matt
> 
>> +	return ret;
>>   }
>>   
>>   int xe_uc_fini_hw(struct xe_uc *uc)
> 
> 


  reply	other threads:[~2025-06-13 16:31 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-12 22:09 [PATCH v4] drm/xe/guc: Set wedged on Xe micro hardware initialization error Zhanjun Dong
2025-06-13  4:16 ` ✓ CI.KUnit: success for drm/xe/guc: Set wedged on Xe micro hardware initialization error (rev3) Patchwork
2025-06-13  5:13 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-06-13 14:32 ` [PATCH v4] drm/xe/guc: Set wedged on Xe micro hardware initialization error Michal Wajdeczko
2025-06-13 16:30   ` Dong, Zhanjun [this message]
2025-06-14  6:50 ` ✗ Xe.CI.Full: failure for drm/xe/guc: Set wedged on Xe micro hardware initialization error (rev3) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=454b4847-5654-49db-925b-68bba716061b@intel.com \
    --to=zhanjun.dong@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=stuart.summers@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox