Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: John Harrison <john.c.harrison@intel.com>
To: Matthew Auld <matthew.auld@intel.com>, <intel-xe@lists.freedesktop.org>
Cc: Matthew Brost <matthew.brost@intel.com>,
	Nirmoy Das <nirmoy.das@intel.com>
Subject: Re: [PATCH] drm/xe/guc_submit: improve schedule disable error logging
Date: Mon, 30 Sep 2024 15:48:07 -0700	[thread overview]
Message-ID: <16c9b1b6-c857-4053-9ec5-6b7096603b04@intel.com> (raw)
In-Reply-To: <7e64c52d-3b38-42eb-8f63-ad6c37ef225f@intel.com>

On 9/30/2024 03:00, Matthew Auld wrote:
> On 28/09/2024 00:05, John Harrison wrote:
>> On 9/27/2024 06:35, Matthew Auld wrote:
>>> A few things here. Make the two prints consistent (and distinct), print
>>> the guc_id, and finally dump the CT queues. It should be possible to
>>> spot the guc_id in the CT queue dump, and for example see that host 
>>> side
>>> has yet to process the response for the schedule disable, or see that
>>> GuC is yet to send it, to help narrow things down if we trigger the
>>> timeout.
>> Where are you seeing these failures? Is there an understanding of 
>> why? Or is this patch basically a "we have no idea what is going on, 
>> so get better logs out of CI" type thing? In which case you really 
>> want is to generate a devcoredump (with my debug improvements patch 
>> set to include the GuC log and such like) and to get CI to give you 
>> the core dumps back.
>
> Yeah, patch is "we have no idea what is going on, so get better logs 
> out of CI".
>
> From https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638, one 
> example failure: 
> https://intel-gfx-ci.01.org/tree/intel-xe/xe-1873-c689a348137cb6f8934a9be49438bafe413b97d5/re-bmg-5/igt@xe_exec_fault_mode@many-execqueues-invalid-userptr-fault.html
>
> devcoredump wired up to CI with everything thrown in sounds good.
Just in general, it would probably be worth generating a devcoredump on 
this failure anyway. You actually have access to a 'q' object at this 
point, so just calling the existing devcoredump code is trivial. 
Although we really need to get 
https://patchwork.freedesktop.org/series/134695/ reviewed and merged for 
the dump to be particularly useful in this kind of 'GuC did not respond' 
error.

But if the buglog repro rate of 29% can be believed then it really 
should be possible to repro this locally and get all the logs out. And 
even to try with a flush work fix/hack to see if that is the problem.

>
>>
>> And maybe this is related to the fix from Badal: "drm/xe/guc: In 
>> guc_ct_send_recv flush g2h worker if g2h resp times out"? We have 
>> seen problems where the worker is simply not getting to run before 
>> the timeout expires.
>
> I don't think the schedule disable is using guc_ct_send_recv() 
> interface, so I don't think is related but not 100% sure.
That just means that it won't benefit from the same fix (aka hack). It 
is entirely possible it is still suffering from the worker thread not 
running in a timely manner. But it would need its own explicit flush and 
retry prior to returning the timeout as it is a different code path.

Although as Matthew B says, if we are seeing the worker being delays for 
a second or more on a regular basis then it suggests that something is 
badly wrong somewhere. Linux is no realtime OS but that kind of system 
burp should not be that frequent!

John.

>
>>
>> John.
>>
>>>
>>> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638
>>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Nirmoy Das <nirmoy.das@intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_guc_submit.c | 17 ++++++++++++++---
>>>   1 file changed, 14 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c 
>>> b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> index 80062e1d3f66..52ed7c0043f9 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> @@ -977,7 +977,12 @@ static void xe_guc_exec_queue_lr_cleanup(struct 
>>> work_struct *w)
>>>                        !exec_queue_pending_disable(q) ||
>>>                        guc_read_stopped(guc), HZ * 5);
>>>           if (!ret) {
>>> -            drm_warn(&xe->drm, "Schedule disable failed to respond");
>>> +            struct xe_gt *gt = guc_to_gt(guc);
>>> +            struct drm_printer p = xe_gt_err_printer(gt);
>>> +
>>> +            xe_gt_warn(gt, "%s schedule disable failed to respond 
>>> guc_id=%d",
>>> +                   __func__, ge->id);
>>> +            xe_guc_ct_print(&guc->ct, &p, false);
>>>               xe_sched_submission_start(sched);
>>>               xe_gt_reset_async(q->gt);
>>>               return;
>>> @@ -1177,8 +1182,14 @@ guc_exec_queue_timedout_job(struct 
>>> drm_sched_job *drm_job)
>>>                        guc_read_stopped(guc), HZ * 5);
>>>           if (!ret || guc_read_stopped(guc)) {
>>>   trigger_reset:
>>> -            if (!ret)
>>> -                xe_gt_warn(guc_to_gt(guc), "Schedule disable failed 
>>> to respond");
>>> +            if (!ret) {
>>> +                struct xe_gt *gt = guc_to_gt(guc);
>>> +                struct drm_printer p = xe_gt_err_printer(gt);
>>> +
>>> +                xe_gt_warn(gt, "%s schedule disable failed to 
>>> respond guc_id=%d",
>>> +                       __func__, q->guc->id);
>>> +                xe_guc_ct_print(&guc->ct, &p, true);
>>> +            }
>>>               set_exec_queue_extra_ref(q);
>>>               xe_exec_queue_get(q);    /* GT reset owns this */
>>>               set_exec_queue_banned(q);
>>


  reply	other threads:[~2024-09-30 22:48 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-27 13:35 [PATCH] drm/xe/guc_submit: improve schedule disable error logging Matthew Auld
2024-09-27 13:41 ` ✓ CI.Patch_applied: success for " Patchwork
2024-09-27 13:42 ` ✗ CI.checkpatch: warning " Patchwork
2024-09-27 13:43 ` ✓ CI.KUnit: success " Patchwork
2024-09-27 13:54 ` ✓ CI.Build: " Patchwork
2024-09-27 13:56 ` ✓ CI.Hooks: " Patchwork
2024-09-27 13:58 ` ✓ CI.checksparse: " Patchwork
2024-09-27 14:10 ` [PATCH] " Nirmoy Das
2024-09-27 14:16 ` ✓ CI.BAT: success for " Patchwork
2024-09-27 21:30 ` [PATCH] " Matthew Brost
2024-09-27 23:05 ` John Harrison
2024-09-28  2:39   ` Matthew Brost
2024-09-30 10:00   ` Matthew Auld
2024-09-30 22:48     ` John Harrison [this message]
2024-09-28  7:14 ` ✗ CI.FULL: failure for " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=16c9b1b6-c857-4053-9ec5-6b7096603b04@intel.com \
    --to=john.c.harrison@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=nirmoy.das@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox