Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: John Harrison <john.c.harrison@intel.com>
Cc: <Intel-Xe@lists.freedesktop.org>
Subject: Re: [PATCH v5 8/8] drm/xe/guc: Add GuC log to devcoredump captures
Date: Mon, 5 Aug 2024 20:46:40 +0000	[thread overview]
Message-ID: <ZrE6MOIGZKk7Xj+5@DUT025-TGLU.fm.intel.com> (raw)
In-Reply-To: <4d692553-36bb-4eed-b884-e7c176e6def3@intel.com>

On Mon, Aug 05, 2024 at 11:41:05AM -0700, John Harrison wrote:
> On 8/4/2024 15:36, Matthew Brost wrote:
> > On Mon, Jul 29, 2024 at 04:17:52PM -0700, John.C.Harrison@Intel.com wrote:
> > > From: John Harrison <John.C.Harrison@Intel.com>
> > > 
> > > Add an ption to include the GuC log in devcoredump captures. Note that
> > > this is currently optional and disabled by default. The reason being
> > > that useful GuC logs are large, very large when converted to an ASCII
> > > hex dump! And as they are not always necessary/useful for debugging a
> > > hang, it is not desirable to force all core dump captures to be huge.
> > > 
> > > NB: The intent is to add support for buffer compression to the core
> > > dumps. Then the log can be included as standard without being too
> > > onerous. At that point the module parameter override can be removed.
> > > 
> > Briefly looked at this series and questions / few suggestions.
> > 
> > 1. Remove mod param to include GuC log from devcoredump and always
> > include it. We had a bug in devcoredump which has been fixed in [1]
> > where large devcoredumps took forever. This has been fixed, now
> > capturing a devcoredump around 256M takes .3 seconds so adding another
> > 8M or so seems fine.
> Yeah, that was the main reason for not wanting to always include it just
> yet!
> 
> > 
> > 2. Instead of printing GuC log as hex, can we just dumped it as binary?
> > This is what we do VM mappings with dump flag set and LRC. grep
> > 'ascii85_encode' for an example of this. It would be nice to keep binary
> > dumping uniform. Then surely we can update our scripts / tools to parse
> > this output.
> Note, that's not binary. That's ASCII85 encoding. There are some drivers
> which genuinely include binary blobs e.g. ath10k (https://starkeblog.com/firmware/wifi/linux/kernel/2021/08/11/dev-coredump-and-firmware-images.html).
> But yes, it is way more compact than a full hexdump.
> 
> Do you know what userland tool is currently used to decode the ASCII85
> objects? The intel_error_decode tool in IGT does not appear to have been
> updated to support devcoredump files.
> 

No idea if we userland tools parse this. Maarten added this to both the
LRC and VM code, so maybe ask him?

I don't really want to hold this up as having the GuC log is a good idea
in devcoredump but also want to weigh that with having a uniform driver.

> 
> > 
> > 3. Can we move GuC log in devcoredump to
> > xe_devcoredump_deferred_snap_work and just use kvmalloc? I suppose this
> > makes debugfs capture a little tricker?
> No idea. I haven't had chance to look at your changes in detail yet. So not
> sure what is involved. For certain, I would much rather capture the GuC log,
> CTBs, etc at the point of the error rather than in a worker thread some
> random amount of time later. The longer the delay between error and capture,
> the more chance there is for things to move on, wrap, overwrite, etc. And if
> you are using kvmalloc, that must be in some kind of asynchronous worker
> thread?
> 

This was existing code which runs in xe_devcoredump_deferred_snap_work
(a worker). Again Maarten added this but from the looks it appears to be
capture minimal software / hardware state directly in the capture, kick
the worker, the worker captures larger memory buffers where kvmalloc can
be used. I think the GuC log fits with that design by capturing in the
worker.

But yes, this comes at the cost of an extra delay from the failure point.

Matt

> John.
> 
> > 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/series/136770/
> > 
> > > Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_devcoredump.c       | 22 +++++++++++++++-------
> > >   drivers/gpu/drm/xe/xe_devcoredump_types.h | 12 ++++++++----
> > >   drivers/gpu/drm/xe/xe_module.c            |  3 +++
> > >   drivers/gpu/drm/xe/xe_module.h            |  1 +
> > >   4 files changed, 27 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > index 08a0bb3ee7c0..b7c241bd95d5 100644
> > > --- a/drivers/gpu/drm/xe/xe_devcoredump.c
> > > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c
> > > @@ -17,8 +17,10 @@
> > >   #include "xe_gt.h"
> > >   #include "xe_gt_printk.h"
> > >   #include "xe_guc_ct.h"
> > > +#include "xe_guc_log.h"
> > >   #include "xe_guc_submit.h"
> > >   #include "xe_hw_engine.h"
> > > +#include "xe_module.h"
> > >   #include "xe_sched_job.h"
> > >   #include "xe_vm.h"
> > > @@ -74,7 +76,7 @@ static void xe_devcoredump_deferred_snap_work(struct work_struct *work)
> > >   	if (xe_force_wake_get(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL))
> > >   		xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
> > >   	xe_vm_snapshot_capture_delayed(ss->vm);
> > > -	xe_guc_exec_queue_snapshot_capture_delayed(ss->ge);
> > > +	xe_guc_exec_queue_snapshot_capture_delayed(ss->guc.ge);
> > >   	xe_force_wake_put(gt_to_fw(ss->gt), XE_FORCEWAKE_ALL);
> > >   }
> > > @@ -116,9 +118,13 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
> > >   	drm_printf(&p, "Process: %s\n", ss->process_name);
> > >   	xe_device_snapshot_print(xe, &p);
> > > +	if (xe_modparam.enable_guc_log_in_coredump) {
> > > +		drm_printf(&p, "\n**** GuC Log ****\n");
> > > +		xe_guc_log_snapshot_print(xe, coredump->snapshot.guc.log, &p, false);
> > > +	}
> > >   	drm_printf(&p, "\n**** GuC CT ****\n");
> > > -	xe_guc_ct_snapshot_print(xe, coredump->snapshot.ct, &p, false);
> > > -	xe_guc_exec_queue_snapshot_print(coredump->snapshot.ge, &p);
> > > +	xe_guc_ct_snapshot_print(xe, coredump->snapshot.guc.ct, &p, false);
> > > +	xe_guc_exec_queue_snapshot_print(coredump->snapshot.guc.ge, &p);
> > >   	drm_printf(&p, "\n**** Job ****\n");
> > >   	xe_sched_job_snapshot_print(coredump->snapshot.job, &p);
> > > @@ -145,8 +151,9 @@ static void xe_devcoredump_free(void *data)
> > >   	cancel_work_sync(&coredump->snapshot.work);
> > > -	xe_guc_ct_snapshot_free(coredump->snapshot.ct);
> > > -	xe_guc_exec_queue_snapshot_free(coredump->snapshot.ge);
> > > +	xe_guc_log_snapshot_free(coredump->snapshot.guc.log);
> > > +	xe_guc_ct_snapshot_free(coredump->snapshot.guc.ct);
> > > +	xe_guc_exec_queue_snapshot_free(coredump->snapshot.guc.ge);
> > >   	xe_sched_job_snapshot_free(coredump->snapshot.job);
> > >   	for (i = 0; i < XE_NUM_HW_ENGINES; i++)
> > >   		if (coredump->snapshot.hwe[i])
> > > @@ -199,8 +206,9 @@ static void devcoredump_snapshot(struct xe_devcoredump *coredump,
> > >   	if (xe_force_wake_get(gt_to_fw(q->gt), XE_FORCEWAKE_ALL))
> > >   		xe_gt_info(ss->gt, "failed to get forcewake for coredump capture\n");
> > > -	coredump->snapshot.ct = xe_guc_ct_snapshot_capture(&guc->ct, true);
> > > -	coredump->snapshot.ge = xe_guc_exec_queue_snapshot_capture(q);
> > > +	coredump->snapshot.guc.log = xe_guc_log_snapshot_capture(&guc->log, true);
> > > +	coredump->snapshot.guc.ct = xe_guc_ct_snapshot_capture(&guc->ct, true);
> > > +	coredump->snapshot.guc.ge = xe_guc_exec_queue_snapshot_capture(q);
> > >   	coredump->snapshot.job = xe_sched_job_snapshot_capture(job);
> > >   	coredump->snapshot.vm = xe_vm_snapshot_capture(q->vm);
> > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump_types.h b/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > > index 923cdf72a816..6ac8da1631f9 100644
> > > --- a/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_devcoredump_types.h
> > > @@ -35,10 +35,14 @@ struct xe_devcoredump_snapshot {
> > >   	struct work_struct work;
> > >   	/* GuC snapshots */
> > > -	/** @ct: GuC CT snapshot */
> > > -	struct xe_guc_ct_snapshot *ct;
> > > -	/** @ge: Guc Engine snapshot */
> > > -	struct xe_guc_submit_exec_queue_snapshot *ge;
> > > +	struct {
> > > +		/** @ct: GuC CT snapshot */
> > > +		struct xe_guc_ct_snapshot *ct;
> > > +		/** @log: GuC log snapshot */
> > > +		struct xe_guc_log_snapshot *log;
> > > +		/** @ge: Guc Engine snapshot */
> > > +		struct xe_guc_submit_exec_queue_snapshot *ge;
> > > +	} guc;
> > >   	/** @hwe: HW Engine snapshot array */
> > >   	struct xe_hw_engine_snapshot *hwe[XE_NUM_HW_ENGINES];
> > > diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
> > > index 7bb99e451fcc..dd837125f397 100644
> > > --- a/drivers/gpu/drm/xe/xe_module.c
> > > +++ b/drivers/gpu/drm/xe/xe_module.c
> > > @@ -37,6 +37,9 @@ MODULE_PARM_DESC(vram_bar_size, "Set the vram bar size(in MiB)");
> > >   module_param_named(guc_log_level, xe_modparam.guc_log_level, int, 0600);
> > >   MODULE_PARM_DESC(guc_log_level, "GuC firmware logging level (0=disable, 1..5=enable with verbosity min..max)");
> > > +module_param_named_unsafe(enable_guc_log_in_coredump, xe_modparam.enable_guc_log_in_coredump, bool, 0600);
> > > +MODULE_PARM_DESC(enable_guc_log_in_coredump, "Include a capture of the GuC log in devcoredumps");
> > > +
> > >   module_param_named_unsafe(guc_firmware_path, xe_modparam.guc_firmware_path, charp, 0400);
> > >   MODULE_PARM_DESC(guc_firmware_path,
> > >   		 "GuC firmware path to use instead of the default one");
> > > diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/xe_module.h
> > > index 61a0d28a28c8..81be7fb05bd1 100644
> > > --- a/drivers/gpu/drm/xe/xe_module.h
> > > +++ b/drivers/gpu/drm/xe/xe_module.h
> > > @@ -12,6 +12,7 @@
> > >   struct xe_modparam {
> > >   	bool force_execlist;
> > >   	bool enable_display;
> > > +	bool enable_guc_log_in_coredump;
> > >   	u32 force_vram_bar_size;
> > >   	int guc_log_level;
> > >   	char *guc_firmware_path;
> > > -- 
> > > 2.43.2
> > > 
> 

  reply	other threads:[~2024-08-05 20:47 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-29 23:17 [PATCH v5 0/8] drm/xe/guc: Improve quality and robustness of GuC log dumping John.C.Harrison
2024-07-29 23:17 ` [PATCH v5 1/8] drm/xe/guc: Remove spurious line feed in debug print John.C.Harrison
2024-07-30  9:14   ` Michal Wajdeczko
2024-07-31 19:56     ` John Harrison
2024-07-31 20:14       ` Souza, Jose
2024-08-01 18:14         ` Matthew Brost
2024-08-05 18:18         ` John Harrison
2024-07-29 23:17 ` [PATCH v5 2/8] drm/xe/guc: Copy GuC log prior to dumping John.C.Harrison
2024-07-29 23:17 ` [PATCH v5 3/8] drm/xe/guc: Use a two stage dump for GuC logs and add more info John.C.Harrison
2024-07-29 23:17 ` [PATCH v5 4/8] drm/print: Introduce drm_line_printer John.C.Harrison
2024-07-29 23:17 ` [PATCH v5 5/8] drm/xe/guc: Add a helper function for dumping GuC log to dmesg John.C.Harrison
2024-07-29 23:17 ` [PATCH v5 6/8] drm/xe/guc: Dead CT helper John.C.Harrison
2024-07-29 23:17 ` [PATCH v5 7/8] drm/xe/guc: Dump entire CTB on errors John.C.Harrison
2024-07-29 23:17 ` [PATCH v5 8/8] drm/xe/guc: Add GuC log to devcoredump captures John.C.Harrison
2024-08-04 22:36   ` Matthew Brost
2024-08-05 18:41     ` John Harrison
2024-08-05 20:46       ` Matthew Brost [this message]
2024-07-29 23:24 ` ✓ CI.Patch_applied: success for drm/xe/guc: Improve quality and robustness of GuC log dumping (rev3) Patchwork
2024-07-29 23:24 ` ✗ CI.checkpatch: warning " Patchwork
2024-07-29 23:25 ` ✓ CI.KUnit: success " Patchwork
2024-07-29 23:37 ` ✓ CI.Build: " Patchwork
2024-07-29 23:39 ` ✗ CI.Hooks: failure " Patchwork
2024-07-29 23:41 ` ✗ CI.checksparse: warning " Patchwork
2024-07-30  0:00 ` ✓ CI.BAT: success " Patchwork
2024-07-30  3:36 ` ✗ CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZrE6MOIGZKk7Xj+5@DUT025-TGLU.fm.intel.com \
    --to=matthew.brost@intel.com \
    --cc=Intel-Xe@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox