From: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
To: Michal Wajdeczko <michal.wajdeczko@intel.com>,
<intel-xe@lists.freedesktop.org>
Cc: Matthew Brost <matthew.brost@intel.com>
Subject: Re: [PATCH] drm/xe/guc: Add support for NO_RESPONSE_BUSY in CTB
Date: Wed, 8 Apr 2026 10:01:59 -0700 [thread overview]
Message-ID: <7248b00f-35f3-4753-976d-952f1c4873a6@intel.com> (raw)
In-Reply-To: <4d27ebeb-c27e-4253-8799-f939754d047b@intel.com>
On 4/8/2026 6:54 AM, Michal Wajdeczko wrote:
>
> On 4/8/2026 12:18 AM, Daniele Ceraolo Spurio wrote:
>>
>> On 4/3/2026 1:44 PM, Michal Wajdeczko wrote:
>>> We only have support for G2H NO_RESPONSE_BUSY messages over MMIO,
>>> but it turned out that GuC also uses that type of messages in CTB.
>>>
>>> The following error was recently observed on BMG after adding VGT
>>> policy updates to the GT restart sequence:
>>>
>>> [] xe 0000:03:00.0: [drm] *ERROR* Tile0: GT1: G2H channel broken on read, type=3, reset required
>>> [] xe 0000:03:00.0: [drm] *ERROR* Tile0: GT1: CT dequeue failed: -95
>>> ...
>>> [] xe 0000:03:00.0: [drm] *ERROR* Tile0: GT1: Timed out wait for G2H, fence 21965, action 5502, done no
>>> [] xe 0000:03:00.0: [drm] PF: Tile0: GT1: Failed to push 1 policy KLV (-ETIME)
>>> [] xe 0000:03:00.0: [drm] Tile0: GT1: { key 0x8004 : no value } # engine_group_config
>>>
>>> where type=3 was this unrecognized NO_RESPONSE_BUSY message.
>>>
>>> Note that GuC might send the real RESPONSE message right after
>>> the BUSY message, so we must be prepared to update our g2h_fence
>>> data twice before sender actually wakes up and clears the flags.
>>>
>>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>> ---
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>>> Link: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-164119v2/shard-bmg-9/igt@xe_exec_reset@gt-reset.html
>>> ---
>>> drivers/gpu/drm/xe/xe_guc_ct.c | 29 +++++++++++++++++++++++++++--
>>> 1 file changed, 27 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>>> index a11cff7a20be..19305acb98e4 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>> @@ -186,6 +186,7 @@ static void fast_req_track(struct xe_guc_ct *ct, u16 fence, u16 action) { }
>>> struct g2h_fence {
>>> u32 *response_buffer;
>>> u32 seqno;
>>> + /* fields below this point are setup based on the response */
>>> u32 response_data;
>>> u16 response_len;
>>> u16 error;
>>> @@ -193,6 +194,7 @@ struct g2h_fence {
>>> u16 reason;
>>> bool cancel;
>>> bool retry;
>>> + bool wait;
>>> bool fail;
>>> bool done;
>>> };
>>> @@ -204,6 +206,11 @@ static void g2h_fence_init(struct g2h_fence *g2h_fence, u32 *response_buffer)
>>> g2h_fence->seqno = ~0x0;
>>> }
>>> +static void g2h_fence_void(struct g2h_fence *g2h_fence)
>> I'm not convinced that g2h_fence_void is the correct function name here. Maybe g2h_fence_clear_response or something like that?
> hmm, the 'g2h_fence' name itself is IMO also little questionable ;)
>
> note that everything in this struct is 'response' related, so that
> 'response' in clear_response() may also sound redundant or at least
> mislead about impact of the clear
>
> being non-native, below 'void' meanings were working for me:
>
> "(verb) to remove the legal force from an agreement or contract
> "(verb) discharge or drain away (water, gases, etc.)
>
> but if clear_response() is more welcomed, I can respin the patch
My confusion with the wording was around the fact that normally once you
void a contract you can't re-use it, it's done.
>
> other candidates to consider:
>
> g2h_fence_reinit()
> g2h_fence_reset()
> g2h_fence_prepare()
> g2h_fence_empty()
I considered suggensting reinit, but we keep the seqno and the buffer so
it isn't a full reinit. g2h_fence_reset might be ok.
>
>>> +{
>>> + memset_after(g2h_fence, 0, seqno);
>>> +}
>>> +
>>> static void g2h_fence_cancel(struct g2h_fence *g2h_fence)
>>> {
>>> g2h_fence->cancel = true;
>>> @@ -1331,6 +1338,7 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>> /* READ_ONCEs pairs with WRITE_ONCEs in parse_g2h_response
>>> * and g2h_fence_cancel.
>>> */
>>> +wait_again:
>>> ret = wait_event_timeout(ct->g2h_fence_wq, READ_ONCE(g2h_fence.done), HZ);
>>> if (!ret) {
>>> LNL_FLUSH_WORK(&ct->g2h_worker);
>>> @@ -1356,6 +1364,12 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>> return -ETIME;
>>> }
>>> + if (g2h_fence.wait) {
>>> + xe_gt_dbg(gt, "H2G action %#x busy...\n", action[0]);
>>> + g2h_fence_void(&g2h_fence);
>>> + mutex_unlock(&ct->lock);
>>> + goto wait_again;
>>> + }
>>> if (g2h_fence.retry) {
>>> xe_gt_dbg(gt, "H2G action %#x retrying: reason %#x\n",
>>> action[0], g2h_fence.reason);
>>> @@ -1508,7 +1522,12 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>> return -EPROTO;
>>> }
>>> - g2h_fence = xa_erase(&ct->fence_lookup, fence);
>>> + /* don't erase as we still expect a final response with the same fence */
>>> + if (type == GUC_HXG_TYPE_NO_RESPONSE_BUSY)
>>> + g2h_fence = xa_load(&ct->fence_lookup, fence);
>>> + else
>>> + g2h_fence = xa_erase(&ct->fence_lookup, fence);
>>> +
>>> if (unlikely(!g2h_fence)) {
>> if we hit this error with a NO_RESPONSE_BUSY we'll release the memory with the fence still in the xa, which seems wrong.
> but NULL here would mean that the fence wasn't in the xa already
D'oh, that's true. My bad.
>
> it had to be either removed during earlier processing of some other
> non-BUSY G2H message with the same fence or it was removed by the
> caller due to a timeout or ... the incoming fence is completely unexpected
>
> and IMO we are not quite good in handling this last case and after
> xe_gt_warn() we might release space that was never reserved by us
>
> but that's not related to this patch
>
>>> /* Don't tear down channel, as send could've timed out */
>>> /* CT_DEAD(ct, NULL, PARSE_G2H_UNKNOWN); */
>>> @@ -1518,6 +1537,7 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>> }
>>> xe_gt_assert(gt, fence == g2h_fence->seqno);
>>> + g2h_fence_void(g2h_fence);
>> Is this here because we might be parsing the G2H with the actual response before the waiter has had time to process the initial BUSY response? It might be worth adding a comment to explain that.
> yes, and it's already mentioned in the commit message:
>
> " Note that GuC might send the real RESPONSE message right after
> " the BUSY message, so we must be prepared to update our g2h_fence
> " data twice before sender actually wakes up and clears the flags.
sure, but in-code it isn't immediately clear that this is there to cover
for that case, hence why IMO a comment would help.
Daniele
>
>> Daniele
>>
>>> if (type == GUC_HXG_TYPE_RESPONSE_FAILURE) {
>>> g2h_fence->fail = true;
>>> @@ -1526,6 +1546,9 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>> } else if (type == GUC_HXG_TYPE_NO_RESPONSE_RETRY) {
>>> g2h_fence->retry = true;
>>> g2h_fence->reason = FIELD_GET(GUC_HXG_RETRY_MSG_0_REASON, hxg[0]);
>>> + } else if (type == GUC_HXG_TYPE_NO_RESPONSE_BUSY) {
>>> + g2h_fence->wait = true;
>>> + g2h_fence->reason = FIELD_GET(GUC_HXG_BUSY_MSG_0_COUNTER, hxg[0]);
>>> } else if (g2h_fence->response_buffer) {
>>> g2h_fence->response_len = hxg_len;
>>> memcpy(g2h_fence->response_buffer, hxg, hxg_len * sizeof(u32));
>>> @@ -1533,7 +1556,8 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>> g2h_fence->response_data = FIELD_GET(GUC_HXG_RESPONSE_MSG_0_DATA0, hxg[0]);
>>> }
>>> - g2h_release_space(ct, GUC_CTB_HXG_MSG_MAX_LEN);
>>> + if (!g2h_fence->wait)
>>> + g2h_release_space(ct, GUC_CTB_HXG_MSG_MAX_LEN);
>>> /* WRITE_ONCE pairs with READ_ONCEs in guc_ct_send_recv. */
>>> WRITE_ONCE(g2h_fence->done, true);
>>> @@ -1570,6 +1594,7 @@ static int parse_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>> case GUC_HXG_TYPE_RESPONSE_SUCCESS:
>>> case GUC_HXG_TYPE_RESPONSE_FAILURE:
>>> case GUC_HXG_TYPE_NO_RESPONSE_RETRY:
>>> + case GUC_HXG_TYPE_NO_RESPONSE_BUSY:
>>> ret = parse_g2h_response(ct, msg, len);
>>> break;
>>> default:
prev parent reply other threads:[~2026-04-08 17:02 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 20:44 [PATCH] drm/xe/guc: Add support for NO_RESPONSE_BUSY in CTB Michal Wajdeczko
2026-04-03 20:50 ` ✗ CI.checkpatch: warning for " Patchwork
2026-04-03 20:51 ` ✓ CI.KUnit: success " Patchwork
2026-04-03 21:27 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-03 22:31 ` ✗ Xe.CI.FULL: failure " Patchwork
2026-04-07 22:18 ` [PATCH] " Daniele Ceraolo Spurio
2026-04-08 13:54 ` Michal Wajdeczko
2026-04-08 17:01 ` Daniele Ceraolo Spurio [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7248b00f-35f3-4753-976d-952f1c4873a6@intel.com \
--to=daniele.ceraolospurio@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=michal.wajdeczko@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox