Re: [PATCH 1/2] drm/xe: Improve devcoredump documentation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: "John Harrison" <john.c.harrison@intel.com>,
	"Raag Jadav" <raag.jadav@intel.com>,
	intel-xe@lists.freedesktop.org,
	"José Roberto de Souza" <jose.souza@intel.com>
Subject: Re: [PATCH 1/2] drm/xe: Improve devcoredump documentation
Date: Fri, 1 Nov 2024 15:59:30 -0400	[thread overview]
Message-ID: <ZyUzIsyGzjHVmyXs@intel.com> (raw)
In-Reply-To: <2lm6buuc56u6awcerm4qjjphrhkdha5a4askhjnqsusj727xhu@d3l7xdlecqbt>

On Fri, Nov 01, 2024 at 02:29:58PM -0500, Lucas De Marchi wrote:
> On Fri, Nov 01, 2024 at 02:19:22PM -0500, Lucas De Marchi wrote:
> > On Fri, Nov 01, 2024 at 11:39:59AM -0700, John Harrison wrote:
> > > On 11/1/2024 08:07, Raag Jadav wrote:
> > > > On Fri, Nov 01, 2024 at 07:44:37AM -0500, Lucas De Marchi wrote:
> > > > > On Fri, Nov 01, 2024 at 07:47:54AM +0200, Raag Jadav wrote:
> > > > > > On Thu, Oct 31, 2024 at 11:29:15AM -0700, Lucas De Marchi wrote:
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > > - * Snapshot at hang:
> > > > > > > - * The 'data' file is printed with a drm_printer pointer at devcoredump read
> > > > > > > - * time. For this reason, we need to take snapshots from when the hang has
> > > > > > > - * happened, and not only when the user is reading the file. Otherwise the
> > > > > > > - * information is outdated since the resets might have happened in between.
> > > > > > > + * The following characteristics are observed by xe when creating a device
> > > > > > > + * coredump:
> > > > > > >  *
> > > > > > > - * 'First' failure snapshot:
> > > > > > > - * In general, the first hang is the most critical one since the following hangs
> > > > > > > - * can be a consequence of the initial hang. For this reason we only take the
> > > > > > > - * snapshot of the 'first' failure and ignore subsequent calls of this function,
> > > > > > > - * at least while the coredump device is alive. Dev_coredump has a delayed work
> > > > > > > - * queue that will eventually delete the device and free all the dump
> > > > > > > - * information.
> > > > > > > + * **Snapshot at hang**:
> > > > > > > + *   The 'data' file contains a snapshot of the HW state at the time the hang
> > > > > > > + *   happened. Due to the driver recovering from resets/crashes, it may not
> > > > > > > + *   correspond to the state of when the file is read by userspace.
> > > > > > Does that mean the devcoredump will be present even after a successful recovery?
> > > > > yes.... if it's not succesful then it's moved to the wedged state. Easy
> > > > > way to test is running this:
> > > > > 
> > > > > 	xe_exec_threads --r threads-hang-basic
> > > > > 
> > > > > You should see something like this in your dmesg:
> > > > > 
> > > > > 	[IGT] xe_exec_threads: starting subtest threads-hang-basic
> > > > > 	xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=rcs, logical_mask: 0x1, guc_id=34
> > > > > 	xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x1, guc_id=32
> > > > > 	xe 0000:00:02.0: [drm] GT1: Engine reset: engine_class=vcs, logical_mask: 0x1, guc_id=18
> > > > > 	xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=34, flags=0x0 in xe_exec_threads [2636]
> > > > > 	xe 0000:00:02.0: [drm] GT1: Engine reset: engine_class=vecs, logical_mask: 0x1, guc_id=17
> > > > > 	xe 0000:00:02.0: [drm] GT1: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=18, flags=0x0 in xe_exec_threads [2636]
> > > > > 	xe 0000:00:02.0: [drm] Xe device coredump has been created
> > > > > -->	xe 0000:00:02.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
> > > > > 	xe 0000:00:02.0: [drm] GT1: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=17, flags=0x0 in xe_exec_threads [2636]
> > > > > 	xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=32, flags=0x0 in xe_exec_threads [2636]
> > > > > 	xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=ccs, logical_mask: 0x1, guc_id=27
> > > > > 	xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=4294967169, lrc_seqno=4294967169, guc_id=27, flags=0x0 in xe_exec_threads [2636]
> > > > > 	[IGT] xe_exec_threads: finished subtest threads-hang-basic, SUCCESS
> > > > > 
> > > > > 
> > > > > If you run it again, it won't overwrite the previous dump, until user
> > > > > cleans the previous dump or the timeout on the kernel side fires to
> > > > > release it.
> > > > Yes, which I think we're covering at later point in "First failure only".
> > > > So maybe establishing the mechanism itself before explaining reset/recovery
> > > > would be a bit neater...
> > > > 
> > > > > From a distro-integration pov, I think it should have a udev rule that
> > > > > fires when a devcoredump is created so the dump is copied to persistent
> > > > > storage. Just like it happens with cpu coredump (see systemd-coredump)
> > > > > 
> > > > > > Perhaps moving the 'release' part to above paragraph will add required context.
> > > > > not sure I follow. Are you suggesting to swap the order of "First
> > > > > failure only" and "Snapshot at hang" ?
> > > > ... in whichever way you think is best.
> > > Note that 'snapshot at hang' and 'first failure only' are totally
> > > separate concepts. And neither explains the release mechanism.
> > > Reversing the order of the descriptions would be incorrect, IMHO.
> > > 
> > > The point of 'snapshot at hang' is to say that the universe
> > > continues existing after the snapshot is taken. It is not just that
> > > the driver recovers but that it keeps processing new work. In an
> > > active system, it is extremely unlikely the system state (hardware
> > > or software) would match what is in the snapshot by the time the
> > > user is able to read the snapshot out. That has nothing to do with
> > > when or if the snapshot is released, nor with how many snapshots are
> > > taken.
> > > 
> > > The point of 'first failure only' is that only one snapshot is taken
> > > at a time. If there are multiple back to back hangs then only the
> > > first will generate a snapshot. Further snapshots will only be
> > > created for new hangs after the existing snapshot has been
> > > 'released'. And I'm not seeing mention of how to release the
> > > snapshot? It would be good to add a quick comment about that.
> > 
> > does this look better for y'all?

works for me...

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

> 
> trying to paste again, with whitespaces and typo fixed:
> 
> /**
>  * DOC: Xe device coredump
>  *
>  * Xe uses dev_coredump infrastructure for exposing the crash errors in a
>  * standardized way. Once a crash occurs, devcoredump exposes a temporary
>  * node under ``/sys/class/devcoredump/devcd<m>/``. The same node is also
>  * accessible in ``/sys/class/drm/card<n>/device/devcoredump/``. The
>  * ``failing_device`` symlink points to the device that crashed and created the
>  * coredump.
>  *
>  * The following characteristics are observed by xe when creating a device
>  * coredump:
>  *
>  * **Snapshot at hang**:
>  *   The 'data' file contains a snapshot of the HW state at the time the hang
>  *   happened. Due to the driver recovering from resets/crashes, it may not
>  *   correspond to the state of when the file is read by userspace.
>  *
>  * **Coredump release**:
>  *   After a coredump is generated, it stays in kernel memory until released by
>  *   userpace by writing anything to it, or after an internal timer expires. The
>  *   exact timeout may vary and should not be relied upon. Example to release
>  *   a coredump:
>  *
>  *   .. code-block:: shell
>  *
>  *      $ > /sys/class/drm/card0/device/devcoredump/data
>  *
>  * **First failure only**:
>  *   In general, the first hang is the most critical one since the following
>  *   hangs can be a consequence of the initial hang. For this reason a snapshot
>  *   is taken only for the first failure. Until the devcoredump is released by
>  *   userspace or kernel, all subsequent hangs do not override the snapshot nor
>  *   create new ones. Devcoredump has a delayed work queue that will eventually
>  *   delete the file node and free all the dump information.
>  */
> 
> Lucas De Marchi

next prev parent reply	other threads:[~2024-11-01 19:59 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-31 18:29 [PATCH 0/2] drm/xe: devcoredump documentation Lucas De Marchi
2024-10-31 18:29 ` [PATCH 1/2] drm/xe: Improve " Lucas De Marchi
2024-11-01  5:47   ` Raag Jadav
2024-11-01 12:44     ` Lucas De Marchi
2024-11-01 15:07       ` Raag Jadav
2024-11-01 18:39         ` John Harrison
2024-11-01 19:19           ` Lucas De Marchi
2024-11-01 19:29             ` Lucas De Marchi
2024-11-01 19:59               ` Rodrigo Vivi [this message]
2024-11-01 21:17               ` John Harrison
2024-10-31 18:29 ` [PATCH 2/2] drm/xe: Wire up devcoredump in documentation Lucas De Marchi
2024-11-01 14:41   ` Matthew Brost
2024-10-31 19:48 ` ✓ CI.Patch_applied: success for drm/xe: devcoredump documentation Patchwork
2024-10-31 19:48 ` ✗ CI.checkpatch: warning " Patchwork
2024-10-31 19:49 ` ✓ CI.KUnit: success " Patchwork
2024-10-31 20:01 ` ✓ CI.Build: " Patchwork
2024-10-31 20:03 ` ✓ CI.Hooks: " Patchwork
2024-10-31 20:04 ` ✓ CI.checksparse: " Patchwork
2024-10-31 20:29 ` ✓ CI.BAT: " Patchwork
2024-10-31 23:10 ` ✗ CI.FULL: failure " Patchwork
2024-11-01  5:49 ` [PATCH 0/2] " Raag Jadav

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZyUzIsyGzjHVmyXs@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=john.c.harrison@intel.com \
    --cc=jose.souza@intel.com \
    --cc=lucas.demarchi@intel.com \
    --cc=raag.jadav@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.