Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Julia Filipchuk <julia.filipchuk@intel.com>
To: <John.C.Harrison@Intel.com>, <Intel-Xe@Lists.FreeDesktop.Org>
Subject: Re: [PATCH v7 07/10] drm/xe/guc: Dead CT helper
Date: Wed, 11 Sep 2024 12:55:36 -0700	[thread overview]
Message-ID: <45901fbc-56bf-4f9e-8044-eb83e24f02f4@intel.com> (raw)
In-Reply-To: <20240905205106.1063091-8-John.C.Harrison@Intel.com>

On 9/5/2024 1:51 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> Add a worker function helper for asynchronously dumping state when an
> internal/fatal error is detected in CT processing. Being asynchronous
> is required to avoid deadlocks and scheduling-while-atomic or
> process-stalled-for-too-long issues. Also check for a bunch more error
> conditions and improve the handling of some existing checks.
> 
> v2: Use compile time CONFIG check for new (but not directly CT_DEAD
> related) checks and use unsigned int for a bitmask, rename
> CT_DEAD_RESET to CT_DEAD_REARM and add some explaining comments,
> rename 'hxg' macro parameter to 'ctb' - review feedback from Michal W.
> Drop CT_DEAD_ALIVE as no need for a bitfield define to just set the
> entire mask to zero.
> v3: Fix kerneldoc
> v4: Nullify some floating pointers after free.
> v5: Add section headings and device info to make the state dump look
> more like a devcoredump to allow parsing by the same tools (eventual
> aim is to just call the devcoredump code itself, but that currently
> requires an xe_sched_job, which is not available in the CT code).
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>  .../drm/xe/abi/guc_communication_ctb_abi.h    |   1 +
>  drivers/gpu/drm/xe/xe_guc.c                   |   2 +-
>  drivers/gpu/drm/xe/xe_guc_ct.c                | 280 ++++++++++++++++--
>  drivers/gpu/drm/xe/xe_guc_ct.h                |   2 +-
>  drivers/gpu/drm/xe/xe_guc_ct_types.h          |  23 ++
>  5 files changed, 280 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h b/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
> index 8f86a16dc577..f58198cf2cf6 100644
> --- a/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
> +++ b/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
> @@ -52,6 +52,7 @@ struct guc_ct_buffer_desc {
>  #define GUC_CTB_STATUS_OVERFLOW				(1 << 0)
>  #define GUC_CTB_STATUS_UNDERFLOW			(1 << 1)
>  #define GUC_CTB_STATUS_MISMATCH				(1 << 2)
> +#define GUC_CTB_STATUS_DISABLED				(1 << 3)
>  	u32 reserved[13];
>  } __packed;
>  static_assert(sizeof(struct guc_ct_buffer_desc) == 64);
> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> index 34cdb08b6e27..3fef24c965c4 100644
> --- a/drivers/gpu/drm/xe/xe_guc.c
> +++ b/drivers/gpu/drm/xe/xe_guc.c
> @@ -1176,7 +1176,7 @@ void xe_guc_print_info(struct xe_guc *guc, struct drm_printer *p)
>  
>  	xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>  
> -	xe_guc_ct_print(&guc->ct, p, false);
> +	xe_guc_ct_print(&guc->ct, p);
>  	xe_guc_submit_print(guc, p);
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index a63fe0a9077a..e31b1f0b855f 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -25,12 +25,57 @@
>  #include "xe_gt_sriov_pf_monitor.h"
>  #include "xe_gt_tlb_invalidation.h"
>  #include "xe_guc.h"
> +#include "xe_guc_log.h"
>  #include "xe_guc_relay.h"
>  #include "xe_guc_submit.h"
>  #include "xe_map.h"
>  #include "xe_pm.h"
>  #include "xe_trace_guc.h"
>  
> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> +enum {
> +	CT_DEAD_REARM,				/* 0x0001 - not an error condition */
> +	CT_DEAD_SETUP,				/* 0x0002 */
> +	CT_DEAD_H2G_WRITE,			/* 0x0004 */
> +	CT_DEAD_H2G_HAS_ROOM,			/* 0x0008 */
> +	CT_DEAD_G2H_READ,			/* 0x0010 */
> +	CT_DEAD_G2H_RECV,			/* 0x0020 */
> +	CT_DEAD_G2H_RELEASE,			/* 0x0040 */
> +	CT_DEAD_DEADLOCK,			/* 0x0080 */
> +	CT_DEAD_PROCESS_FAILED,			/* 0x0100 */
> +	CT_DEAD_FAST_G2H,			/* 0x0200 */
> +	CT_DEAD_PARSE_G2H_RESPONSE,		/* 0x0400 */
> +	CT_DEAD_PARSE_G2H_UNKNOWN,		/* 0x0800 */
> +	CT_DEAD_PARSE_G2H_ORIGIN,		/* 0x1000 */
> +	CT_DEAD_PARSE_G2H_TYPE,			/* 0x2000 */
> +};
> +
> +static void ct_dead_worker_func(struct work_struct *w);
> +
> +#define CT_DEAD(ct, ctb, reason_code)								\
> +	do {											\
> +		struct guc_ctb *_ctb = (ctb);							\
> +		if (_ctb)									\
> +			_ctb->info.broken = true;						\
> +		if (!(ct)->dead.reported) {							\
Do we need to worry about a second dead ct causing conflict with quick
back-to-back CT_DEAD? Suggest to set reported here and to clear it only
from worker thread. Snapshot can be used instead of reported to
determine if it has been printed (in worker_func).

Should the snapshot be taken in the "CT_DEAD_REARM" case?

> +			struct xe_guc *guc = ct_to_guc(ct);					\
> +			spin_lock_irq(&ct->dead.lock);						\
> +			(ct)->dead.reason |= 1 << CT_DEAD_##reason_code;			\
Does this field need to be cleared or does this accumulate reasons CT died?

> +			(ct)->dead.snapshot_log = xe_guc_log_snapshot_capture(&guc->log, true);	\
> +			(ct)->dead.snapshot_ct = xe_guc_ct_snapshot_capture((ct), true);	\
> +			spin_unlock_irq(&ct->dead.lock);					\
> +			queue_work(system_unbound_wq, &(ct)->dead.worker);			\
> +		}										\
> +	} while (0)
> +#else
> +#define CT_DEAD(ct, ctb, reason)			\
> +	do {						\
> +		struct guc_ctb *_ctb = (ctb);		\
> +		if (_ctb)				\
> +			_ctb->info.broken = true;	\
> +	} while (0)
> +#endif
> +
>  /* Used when a CT send wants to block and / or receive data */
>  struct g2h_fence {
>  	u32 *response_buffer;
> @@ -183,6 +228,10 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
>  	xa_init(&ct->fence_lookup);
>  	INIT_WORK(&ct->g2h_worker, g2h_worker_func);
>  	INIT_DELAYED_WORK(&ct->safe_mode_worker, safe_mode_worker_func);
> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> +	spin_lock_init(&ct->dead.lock);
> +	INIT_WORK(&ct->dead.worker, ct_dead_worker_func);
> +#endif
>  	init_waitqueue_head(&ct->wq);
>  	init_waitqueue_head(&ct->g2h_fence_wq);
>  
> @@ -419,10 +468,22 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
>  	if (ct_needs_safe_mode(ct))
>  		ct_enter_safe_mode(ct);
>  
> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
> +	/*
> +	 * The CT has now been reset so the dumper can be re-armed
> +	 * after any existing dead state has been dumped.
> +	 */
> +	spin_lock_irq(&ct->dead.lock);
> +	if (ct->dead.reason)
> +		ct->dead.reason |= CT_DEAD_REARM;
> +	spin_unlock_irq(&ct->dead.lock);
> +#endif
> +
>  	return 0;
>  
>  err_out:
>  	xe_gt_err(gt, "Failed to enable GuC CT (%pe)\n", ERR_PTR(err));
> +	CT_DEAD(ct, NULL, SETUP);
>  
>  	return err;
>  } 
> @@ -773,8 +895,13 @@ static int guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action, u32 len,
>  			goto broken;
>  #undef g2h_avail
>  
> -		if (dequeue_one_g2h(ct) < 0)
> +		ret = dequeue_one_g2h(ct);
> +		if (ret < 0) {
> +			if (ret != -ECANCELED)
> +				xe_gt_err(ct_to_gt(ct), "CTB receive failed (%pe)",
> +					  ERR_PTR(ret));
Is it correct there is no success condition here? Would canceled case
need to route to try_again?
>  			goto broken;
> +		}
>  
>  		goto try_again;
>  	}





> +
> +static void ct_dead_worker_func(struct work_struct *w)
> +{
> +	struct xe_guc_ct *ct = container_of(w, struct xe_guc_ct, dead.worker);
> +
> +	if (!ct->dead.reported) {
> +		ct->dead.reported = true;> +		ct_dead_print(&ct->dead);
> +	}
Would this guard work against quick back-to-back CT_DEAD calls? This may
not happen? Suggest to move 'ct->dead.reported = true;' into the CT_DEAD
macro. Relates to CT_DEAD macro above.


Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>


  parent reply	other threads:[~2024-09-11 19:55 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-05 20:50 [PATCH v7 00/10] drm/xe/guc: Improve GuC log dumping and add to devcoredump John.C.Harrison
2024-09-05 20:50 ` [PATCH v7 01/10] drm/xe/guc: Remove spurious line feed in debug print John.C.Harrison
2024-09-10 19:32   ` Julia Filipchuk
2024-09-05 20:50 ` [PATCH v7 02/10] drm/xe/devcoredump: Add a section heading for the submission backend John.C.Harrison
2024-09-10 19:33   ` Julia Filipchuk
2024-09-05 20:50 ` [PATCH v7 03/10] drm/xe/devcoredump: Add ASCII85 dump helper function John.C.Harrison
2024-09-06  1:54   ` Lucas De Marchi
2024-09-06  2:01     ` John Harrison
2024-09-06  3:04       ` Lucas De Marchi
2024-09-07  2:06         ` John Harrison
2024-09-10  1:31           ` John Harrison
2024-09-10 19:43             ` Lucas De Marchi
2024-09-10 20:17               ` John Harrison
2024-09-11 19:12                 ` Lucas De Marchi
2024-09-11 19:30                   ` Souza, Jose
2024-09-11 19:35                     ` John Harrison
2024-09-11 19:54                       ` Souza, Jose
2024-09-11 19:59                         ` John Harrison
2024-09-12 13:57                           ` Rodrigo Vivi
2024-09-12 18:06                             ` John Harrison
2024-09-16 15:32                               ` Rodrigo Vivi
2024-09-16 17:46                                 ` John Harrison
2024-09-11 19:31                   ` John Harrison
2024-09-10 19:33   ` Julia Filipchuk
2024-09-11  1:27     ` John Harrison
2024-09-05 20:50 ` [PATCH v7 04/10] drm/xe/guc: Copy GuC log prior to dumping John.C.Harrison
2024-09-11  0:48   ` Julia Filipchuk
2024-09-05 20:51 ` [PATCH v7 05/10] drm/xe/guc: Use a two stage dump for GuC logs and add more info John.C.Harrison
2024-09-11  0:48   ` Julia Filipchuk
2024-09-11  1:14     ` John Harrison
2024-09-05 20:51 ` [PATCH v7 06/10] drm/print: Introduce drm_line_printer John.C.Harrison
2024-09-05 20:51 ` [PATCH v7 07/10] drm/xe/guc: Dead CT helper John.C.Harrison
2024-09-11  0:09   ` John Harrison
2024-09-11 19:55   ` Julia Filipchuk [this message]
2024-09-11 20:13     ` John Harrison
2024-09-11 20:57       ` Julia Filipchuk
2024-09-05 20:51 ` [PATCH v7 08/10] drm/xe/guc: Dump entire CTB on errors John.C.Harrison
2024-09-11 20:12   ` Julia Filipchuk
2024-09-05 20:51 ` [PATCH v7 09/10] drm/xe/guc: Add GuC log to devcoredump captures John.C.Harrison
2024-09-11 20:25   ` Julia Filipchuk
2024-09-05 20:51 ` [PATCH v7 10/10] drm/xe/guc: Add a helper function for dumping GuC log to dmesg John.C.Harrison
2024-09-11 20:36   ` Julia Filipchuk
2024-09-11 20:41     ` John Harrison
2024-09-05 20:57 ` ✓ CI.Patch_applied: success for drm/xe/guc: Improve GuC log dumping and add to devcoredump (rev2) Patchwork
2024-09-05 20:57 ` ✗ CI.checkpatch: warning " Patchwork
2024-09-05 20:58 ` ✓ CI.KUnit: success " Patchwork
2024-09-05 21:10 ` ✓ CI.Build: " Patchwork
2024-09-05 21:13 ` ✓ CI.Hooks: " Patchwork
2024-09-05 21:14 ` ✗ CI.checksparse: warning " Patchwork
2024-09-05 22:06 ` ✗ CI.BAT: failure " Patchwork
2024-09-08  0:02 ` ✗ CI.FULL: " Patchwork
2024-09-12  9:16 ` [PATCH v7 00/10] drm/xe/guc: Improve GuC log dumping and add to devcoredump Jani Nikula
2024-09-12 18:50   ` John Harrison
2024-09-13  7:26     ` Jani Nikula

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45901fbc-56bf-4f9e-8044-eb83e24f02f4@intel.com \
    --to=julia.filipchuk@intel.com \
    --cc=Intel-Xe@Lists.FreeDesktop.Org \
    --cc=John.C.Harrison@Intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox