Re: [PATCH v7 07/10] drm/xe/guc: Dead CT helper

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: John Harrison <john.c.harrison@intel.com>
To: Julia Filipchuk <julia.filipchuk@intel.com>,
	<Intel-Xe@Lists.FreeDesktop.Org>
Subject: Re: [PATCH v7 07/10] drm/xe/guc: Dead CT helper
Date: Wed, 11 Sep 2024 13:13:49 -0700	[thread overview]
Message-ID: <b56b7826-55e7-4459-a5e5-8472975cbb3e@intel.com> (raw)
In-Reply-To: <45901fbc-56bf-4f9e-8044-eb83e24f02f4@intel.com>

On 9/11/2024 12:55, Julia Filipchuk wrote:
> On 9/5/2024 1:51 PM, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> Add a worker function helper for asynchronously dumping state when an
>> internal/fatal error is detected in CT processing. Being asynchronous
>> is required to avoid deadlocks and scheduling-while-atomic or
>> process-stalled-for-too-long issues. Also check for a bunch more error
>> conditions and improve the handling of some existing checks.
>>
>> v2: Use compile time CONFIG check for new (but not directly CT_DEAD
>> related) checks and use unsigned int for a bitmask, rename
>> CT_DEAD_RESET to CT_DEAD_REARM and add some explaining comments,
>> rename 'hxg' macro parameter to 'ctb' - review feedback from Michal W.
>> Drop CT_DEAD_ALIVE as no need for a bitfield define to just set the
>> entire mask to zero.
>> v3: Fix kerneldoc
>> v4: Nullify some floating pointers after free.
>> v5: Add section headings and device info to make the state dump look
>> more like a devcoredump to allow parsing by the same tools (eventual
>> aim is to just call the devcoredump code itself, but that currently
>> requires an xe_sched_job, which is not available in the CT code).
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   .../drm/xe/abi/guc_communication_ctb_abi.h    |   1 +
>>   drivers/gpu/drm/xe/xe_guc.c                   |   2 +-
>>   drivers/gpu/drm/xe/xe_guc_ct.c                | 280 ++++++++++++++++--
>>   drivers/gpu/drm/xe/xe_guc_ct.h                |   2 +-
>>   drivers/gpu/drm/xe/xe_guc_ct_types.h          |  23 ++
>>   5 files changed, 280 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h b/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
>> index 8f86a16dc577..f58198cf2cf6 100644
>> --- a/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
>> +++ b/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
>> @@ -52,6 +52,7 @@ struct guc_ct_buffer_desc {
>>   #define GUC_CTB_STATUS_OVERFLOW				(1 << 0)
>>   #define GUC_CTB_STATUS_UNDERFLOW			(1 << 1)
>>   #define GUC_CTB_STATUS_MISMATCH				(1 << 2)
>> +#define GUC_CTB_STATUS_DISABLED				(1 << 3)
>>   	u32 reserved[13];
>>   } __packed;
>>   static_assert(sizeof(struct guc_ct_buffer_desc) == 64);
>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>> index 34cdb08b6e27..3fef24c965c4 100644
>> --- a/drivers/gpu/drm/xe/xe_guc.c
>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>> @@ -1176,7 +1176,7 @@ void xe_guc_print_info(struct xe_guc *guc, struct drm_printer *p)
>>   
>>   	xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>>   
>> -	xe_guc_ct_print(&guc->ct, p, false);
>> +	xe_guc_ct_print(&guc->ct, p);
>>   	xe_guc_submit_print(guc, p);
>>   }
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>> index a63fe0a9077a..e31b1f0b855f 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>> @@ -25,12 +25,57 @@
>>   #include "xe_gt_sriov_pf_monitor.h"
>>   #include "xe_gt_tlb_invalidation.h"
>>   #include "xe_guc.h"
>> +#include "xe_guc_log.h"
>>   #include "xe_guc_relay.h"
>>   #include "xe_guc_submit.h"
>>   #include "xe_map.h"
>>   #include "xe_pm.h"
>>   #include "xe_trace_guc.h"
>>   
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>> +enum {
>> +	CT_DEAD_REARM,				/* 0x0001 - not an error condition */
>> +	CT_DEAD_SETUP,				/* 0x0002 */
>> +	CT_DEAD_H2G_WRITE,			/* 0x0004 */
>> +	CT_DEAD_H2G_HAS_ROOM,			/* 0x0008 */
>> +	CT_DEAD_G2H_READ,			/* 0x0010 */
>> +	CT_DEAD_G2H_RECV,			/* 0x0020 */
>> +	CT_DEAD_G2H_RELEASE,			/* 0x0040 */
>> +	CT_DEAD_DEADLOCK,			/* 0x0080 */
>> +	CT_DEAD_PROCESS_FAILED,			/* 0x0100 */
>> +	CT_DEAD_FAST_G2H,			/* 0x0200 */
>> +	CT_DEAD_PARSE_G2H_RESPONSE,		/* 0x0400 */
>> +	CT_DEAD_PARSE_G2H_UNKNOWN,		/* 0x0800 */
>> +	CT_DEAD_PARSE_G2H_ORIGIN,		/* 0x1000 */
>> +	CT_DEAD_PARSE_G2H_TYPE,			/* 0x2000 */
>> +};
>> +
>> +static void ct_dead_worker_func(struct work_struct *w);
>> +
>> +#define CT_DEAD(ct, ctb, reason_code)								\
>> +	do {											\
>> +		struct guc_ctb *_ctb = (ctb);							\
>> +		if (_ctb)									\
>> +			_ctb->info.broken = true;						\
>> +		if (!(ct)->dead.reported) {							\
> Do we need to worry about a second dead ct causing conflict with quick
> back-to-back CT_DEAD? Suggest to set reported here and to clear it only
> from worker thread. Snapshot can be used instead of reported to
> determine if it has been printed (in worker_func).
No. We are only really interested in the intial cause of a problem. 
Subsequent failures are likely to be caused by the first. So once a 
report has been printed, we just ignore any further failures until a 
reset has happened.

> Should the snapshot be taken in the "CT_DEAD_REARM" case?
No. That is the reset that turns the reporting system back on. I.e. it 
re-arms the reporting trigger.
>> +			struct xe_guc *guc = ct_to_guc(ct);					\
>> +			spin_lock_irq(&ct->dead.lock);						\
>> +			(ct)->dead.reason |= 1 << CT_DEAD_##reason_code;			\
> Does this field need to be cleared or does this accumulate reasons CT died?
This is to accumulate in the case of multiple errors in rapid succession 
(i.e. before it has managed to do the dump) to account for the 
possibility that the errors might not be noticed in the order they 
really happened. So it accumulates everything up to the first dump and 
then goes quite.

>
>> +			(ct)->dead.snapshot_log = xe_guc_log_snapshot_capture(&guc->log, true);	\
>> +			(ct)->dead.snapshot_ct = xe_guc_ct_snapshot_capture((ct), true);	\
>> +			spin_unlock_irq(&ct->dead.lock);					\
>> +			queue_work(system_unbound_wq, &(ct)->dead.worker);			\
>> +		}										\
>> +	} while (0)
>> +#else
>> +#define CT_DEAD(ct, ctb, reason)			\
>> +	do {						\
>> +		struct guc_ctb *_ctb = (ctb);		\
>> +		if (_ctb)				\
>> +			_ctb->info.broken = true;	\
>> +	} while (0)
>> +#endif
>> +
>>   /* Used when a CT send wants to block and / or receive data */
>>   struct g2h_fence {
>>   	u32 *response_buffer;
>> @@ -183,6 +228,10 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
>>   	xa_init(&ct->fence_lookup);
>>   	INIT_WORK(&ct->g2h_worker, g2h_worker_func);
>>   	INIT_DELAYED_WORK(&ct->safe_mode_worker, safe_mode_worker_func);
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>> +	spin_lock_init(&ct->dead.lock);
>> +	INIT_WORK(&ct->dead.worker, ct_dead_worker_func);
>> +#endif
>>   	init_waitqueue_head(&ct->wq);
>>   	init_waitqueue_head(&ct->g2h_fence_wq);
>>   
>> @@ -419,10 +468,22 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
>>   	if (ct_needs_safe_mode(ct))
>>   		ct_enter_safe_mode(ct);
>>   
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>> +	/*
>> +	 * The CT has now been reset so the dumper can be re-armed
>> +	 * after any existing dead state has been dumped.
>> +	 */
>> +	spin_lock_irq(&ct->dead.lock);
>> +	if (ct->dead.reason)
>> +		ct->dead.reason |= CT_DEAD_REARM;
>> +	spin_unlock_irq(&ct->dead.lock);
>> +#endif
>> +
>>   	return 0;
>>   
>>   err_out:
>>   	xe_gt_err(gt, "Failed to enable GuC CT (%pe)\n", ERR_PTR(err));
>> +	CT_DEAD(ct, NULL, SETUP);
>>   
>>   	return err;
>>   }
>> @@ -773,8 +895,13 @@ static int guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>   			goto broken;
>>   #undef g2h_avail
>>   
>> -		if (dequeue_one_g2h(ct) < 0)
>> +		ret = dequeue_one_g2h(ct);
>> +		if (ret < 0) {
>> +			if (ret != -ECANCELED)
>> +				xe_gt_err(ct_to_gt(ct), "CTB receive failed (%pe)",
>> +					  ERR_PTR(ret));
> Is it correct there is no success condition here? Would canceled case
> need to route to try_again?
Not sure what you mean. This whole block is inside an EBUSY error path. 
As part of the retry, it is trying to make space by clearing out G2H 
entries. The ECANCELED error is because a reset or something known 
happened, therefore there is no need to re-report it. If the dequeue was 
successful then it continues with the retry by hitting the "goto try_again".

>>   			goto broken;
>> +		}
>>   
>>   		goto try_again;
>>   	}
>
>
>
>
>> +
>> +static void ct_dead_worker_func(struct work_struct *w)
>> +{
>> +	struct xe_guc_ct *ct = container_of(w, struct xe_guc_ct, dead.worker);
>> +
>> +	if (!ct->dead.reported) {
>> +		ct->dead.reported = true;> +		ct_dead_print(&ct->dead);
>> +	}
> Would this guard work against quick back-to-back CT_DEAD calls? This may
> not happen? Suggest to move 'ct->dead.reported = true;' into the CT_DEAD
> macro. Relates to CT_DEAD macro above.
As above, there is no concern about wanting to do back to back dumps. 
Once a dump has happened, there is no point printing anything further 
until a GT reset has happened. At which point everything is re-enabled.

>
>
> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
>
Again, you should not give an r-b when there are many concerns being 
raised. An r-b says you approve of the code as it is and are happy for 
it to be merged.

John.

next prev parent reply	other threads:[~2024-09-11 20:13 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-05 20:50 [PATCH v7 00/10] drm/xe/guc: Improve GuC log dumping and add to devcoredump John.C.Harrison
2024-09-05 20:50 ` [PATCH v7 01/10] drm/xe/guc: Remove spurious line feed in debug print John.C.Harrison
2024-09-10 19:32   ` Julia Filipchuk
2024-09-05 20:50 ` [PATCH v7 02/10] drm/xe/devcoredump: Add a section heading for the submission backend John.C.Harrison
2024-09-10 19:33   ` Julia Filipchuk
2024-09-05 20:50 ` [PATCH v7 03/10] drm/xe/devcoredump: Add ASCII85 dump helper function John.C.Harrison
2024-09-06  1:54   ` Lucas De Marchi
2024-09-06  2:01     ` John Harrison
2024-09-06  3:04       ` Lucas De Marchi
2024-09-07  2:06         ` John Harrison
2024-09-10  1:31           ` John Harrison
2024-09-10 19:43             ` Lucas De Marchi
2024-09-10 20:17               ` John Harrison
2024-09-11 19:12                 ` Lucas De Marchi
2024-09-11 19:30                   ` Souza, Jose
2024-09-11 19:35                     ` John Harrison
2024-09-11 19:54                       ` Souza, Jose
2024-09-11 19:59                         ` John Harrison
2024-09-12 13:57                           ` Rodrigo Vivi
2024-09-12 18:06                             ` John Harrison
2024-09-16 15:32                               ` Rodrigo Vivi
2024-09-16 17:46                                 ` John Harrison
2024-09-11 19:31                   ` John Harrison
2024-09-10 19:33   ` Julia Filipchuk
2024-09-11  1:27     ` John Harrison
2024-09-05 20:50 ` [PATCH v7 04/10] drm/xe/guc: Copy GuC log prior to dumping John.C.Harrison
2024-09-11  0:48   ` Julia Filipchuk
2024-09-05 20:51 ` [PATCH v7 05/10] drm/xe/guc: Use a two stage dump for GuC logs and add more info John.C.Harrison
2024-09-11  0:48   ` Julia Filipchuk
2024-09-11  1:14     ` John Harrison
2024-09-05 20:51 ` [PATCH v7 06/10] drm/print: Introduce drm_line_printer John.C.Harrison
2024-09-05 20:51 ` [PATCH v7 07/10] drm/xe/guc: Dead CT helper John.C.Harrison
2024-09-11  0:09   ` John Harrison
2024-09-11 19:55   ` Julia Filipchuk
2024-09-11 20:13     ` John Harrison [this message]
2024-09-11 20:57       ` Julia Filipchuk
2024-09-05 20:51 ` [PATCH v7 08/10] drm/xe/guc: Dump entire CTB on errors John.C.Harrison
2024-09-11 20:12   ` Julia Filipchuk
2024-09-05 20:51 ` [PATCH v7 09/10] drm/xe/guc: Add GuC log to devcoredump captures John.C.Harrison
2024-09-11 20:25   ` Julia Filipchuk
2024-09-05 20:51 ` [PATCH v7 10/10] drm/xe/guc: Add a helper function for dumping GuC log to dmesg John.C.Harrison
2024-09-11 20:36   ` Julia Filipchuk
2024-09-11 20:41     ` John Harrison
2024-09-05 20:57 ` ✓ CI.Patch_applied: success for drm/xe/guc: Improve GuC log dumping and add to devcoredump (rev2) Patchwork
2024-09-05 20:57 ` ✗ CI.checkpatch: warning " Patchwork
2024-09-05 20:58 ` ✓ CI.KUnit: success " Patchwork
2024-09-05 21:10 ` ✓ CI.Build: " Patchwork
2024-09-05 21:13 ` ✓ CI.Hooks: " Patchwork
2024-09-05 21:14 ` ✗ CI.checksparse: warning " Patchwork
2024-09-05 22:06 ` ✗ CI.BAT: failure " Patchwork
2024-09-08  0:02 ` ✗ CI.FULL: " Patchwork
2024-09-12  9:16 ` [PATCH v7 00/10] drm/xe/guc: Improve GuC log dumping and add to devcoredump Jani Nikula
2024-09-12 18:50   ` John Harrison
2024-09-13  7:26     ` Jani Nikula

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b56b7826-55e7-4459-a5e5-8472975cbb3e@intel.com \
    --to=john.c.harrison@intel.com \
    --cc=Intel-Xe@Lists.FreeDesktop.Org \
    --cc=julia.filipchuk@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox