Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Auld <matthew.auld@intel.com>
To: John Harrison <john.c.harrison@intel.com>,
	intel-xe@lists.freedesktop.org
Cc: Matthew Brost <matthew.brost@intel.com>,
	Nirmoy Das <nirmoy.das@intel.com>
Subject: Re: [PATCH] drm/xe/guc_submit: improve schedule disable error logging
Date: Mon, 30 Sep 2024 11:00:55 +0100	[thread overview]
Message-ID: <7e64c52d-3b38-42eb-8f63-ad6c37ef225f@intel.com> (raw)
In-Reply-To: <e2f109e3-d34c-4461-bfda-910965a14ce9@intel.com>

On 28/09/2024 00:05, John Harrison wrote:
> On 9/27/2024 06:35, Matthew Auld wrote:
>> A few things here. Make the two prints consistent (and distinct), print
>> the guc_id, and finally dump the CT queues. It should be possible to
>> spot the guc_id in the CT queue dump, and for example see that host side
>> has yet to process the response for the schedule disable, or see that
>> GuC is yet to send it, to help narrow things down if we trigger the
>> timeout.
> Where are you seeing these failures? Is there an understanding of why? 
> Or is this patch basically a "we have no idea what is going on, so get 
> better logs out of CI" type thing? In which case you really want is to 
> generate a devcoredump (with my debug improvements patch set to include 
> the GuC log and such like) and to get CI to give you the core dumps back.

Yeah, patch is "we have no idea what is going on, so get better logs out 
of CI".

 From https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638, one 
example failure: 
https://intel-gfx-ci.01.org/tree/intel-xe/xe-1873-c689a348137cb6f8934a9be49438bafe413b97d5/re-bmg-5/igt@xe_exec_fault_mode@many-execqueues-invalid-userptr-fault.html

devcoredump wired up to CI with everything thrown in sounds good.

> 
> And maybe this is related to the fix from Badal: "drm/xe/guc: In 
> guc_ct_send_recv flush g2h worker if g2h resp times out"? We have seen 
> problems where the worker is simply not getting to run before the 
> timeout expires.

I don't think the schedule disable is using guc_ct_send_recv() 
interface, so I don't think is related but not 100% sure.

> 
> John.
> 
>>
>> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Nirmoy Das <nirmoy.das@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_guc_submit.c | 17 ++++++++++++++---
>>   1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c 
>> b/drivers/gpu/drm/xe/xe_guc_submit.c
>> index 80062e1d3f66..52ed7c0043f9 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>> @@ -977,7 +977,12 @@ static void xe_guc_exec_queue_lr_cleanup(struct 
>> work_struct *w)
>>                        !exec_queue_pending_disable(q) ||
>>                        guc_read_stopped(guc), HZ * 5);
>>           if (!ret) {
>> -            drm_warn(&xe->drm, "Schedule disable failed to respond");
>> +            struct xe_gt *gt = guc_to_gt(guc);
>> +            struct drm_printer p = xe_gt_err_printer(gt);
>> +
>> +            xe_gt_warn(gt, "%s schedule disable failed to respond 
>> guc_id=%d",
>> +                   __func__, ge->id);
>> +            xe_guc_ct_print(&guc->ct, &p, false);
>>               xe_sched_submission_start(sched);
>>               xe_gt_reset_async(q->gt);
>>               return;
>> @@ -1177,8 +1182,14 @@ guc_exec_queue_timedout_job(struct 
>> drm_sched_job *drm_job)
>>                        guc_read_stopped(guc), HZ * 5);
>>           if (!ret || guc_read_stopped(guc)) {
>>   trigger_reset:
>> -            if (!ret)
>> -                xe_gt_warn(guc_to_gt(guc), "Schedule disable failed 
>> to respond");
>> +            if (!ret) {
>> +                struct xe_gt *gt = guc_to_gt(guc);
>> +                struct drm_printer p = xe_gt_err_printer(gt);
>> +
>> +                xe_gt_warn(gt, "%s schedule disable failed to respond 
>> guc_id=%d",
>> +                       __func__, q->guc->id);
>> +                xe_guc_ct_print(&guc->ct, &p, true);
>> +            }
>>               set_exec_queue_extra_ref(q);
>>               xe_exec_queue_get(q);    /* GT reset owns this */
>>               set_exec_queue_banned(q);
> 

  parent reply	other threads:[~2024-09-30 10:01 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-27 13:35 [PATCH] drm/xe/guc_submit: improve schedule disable error logging Matthew Auld
2024-09-27 13:41 ` ✓ CI.Patch_applied: success for " Patchwork
2024-09-27 13:42 ` ✗ CI.checkpatch: warning " Patchwork
2024-09-27 13:43 ` ✓ CI.KUnit: success " Patchwork
2024-09-27 13:54 ` ✓ CI.Build: " Patchwork
2024-09-27 13:56 ` ✓ CI.Hooks: " Patchwork
2024-09-27 13:58 ` ✓ CI.checksparse: " Patchwork
2024-09-27 14:10 ` [PATCH] " Nirmoy Das
2024-09-27 14:16 ` ✓ CI.BAT: success for " Patchwork
2024-09-27 21:30 ` [PATCH] " Matthew Brost
2024-09-27 23:05 ` John Harrison
2024-09-28  2:39   ` Matthew Brost
2024-09-30 10:00   ` Matthew Auld [this message]
2024-09-30 22:48     ` John Harrison
2024-09-28  7:14 ` ✗ CI.FULL: failure for " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e64c52d-3b38-42eb-8f63-ad6c37ef225f@intel.com \
    --to=matthew.auld@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=nirmoy.das@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox