Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/6] Maintenence of devcoredump <-> GuC-Err-Capture plumbing
@ 2025-02-13 19:51 Alan Previn
  2025-02-13 19:51 ` [PATCH v8 1/6] drm/xe/guc: Rename __guc_capture_parsed_output Alan Previn
                   ` (13 more replies)
  0 siblings, 14 replies; 22+ messages in thread
From: Alan Previn @ 2025-02-13 19:51 UTC (permalink / raw)
  To: intel-xe
  Cc: Alan Previn, dri-devel, Daniele Ceraolo Spurio, John Harrison,
	Matthew Brost, Zhanjun Dong, Rodrigo Vivi

The GuC-Error-Capture is currently reaching into xe_devcoredump
structure to store its own place-holder snaphot-ptr to workaround
the race between G2H-Error-Capture-Notification vs Drm-Scheduler
triggering GuC-Submission-exec-queue-timeout/kill.

From a subsystem layering perspective, this isn't scalable as
GuC should not be manipulating contents of a global structure it
does not own when responding to an unrelated thread / callstack.

Also, part of the earlier mentioned workaround includes the
GuC-Error-Capture taking on one of the front-end functions
for xe_hw_engine_snapshot generation because of an orthogonal
debugfs-caller requesting raw dumps of engine registers without
a job. This request is better handled by GuC-Error-Capture since
there is a lot to manage for reading and printing engine
register lists and we want to avoid duplicate code or tables.

However, logically speaking, the GuC-Error-Capture output node
is really a subset of xe_hw_engine_snapshot. This is irregardless
of the fact that the majority of an engine-snapshot is the
register dumps that only the GuC-Error-Capture can do.

That said, this series intends to refactor the plumbing between
Guc-Error-Capture and xe_devcoredump (including
xe_hw_engine_snapshot) to fix the layering for future
maintenence and scalability. This is done without changing
any functionality and IP-locality (i.e. GuC-Error-Capture still owns
the single point of engine register list definition and printing).
This series ensures 'xe_devcoredump_snapshot' owns
'xe_hw_engine_snapshot generation' and the latter owns
'xe_guc_capture_snapshot' retrieval (with GuC-Error-Capture
as its helper).

Alan Previn (6):
  drm/xe/guc: Rename __guc_capture_parsed_output
  drm/xe/guc: Don't store capture nodes in xe_devcoredump_snapshot
  drm/xe/guc: Split engine state print between xe_hw_engine vs
    xe_guc_capture
  drm/xe/guc: Move xe_hw_engine_snapshot creation back to xe_hw_engine.c
  drm/xe/xe_hw_engine: Update xe_hw_engine capture for debugfs/gt_reset
  drm/xe/guc: Update comments on GuC-Err-Capture flows

 drivers/gpu/drm/xe/xe_devcoredump.c           |   7 +-
 drivers/gpu/drm/xe/xe_devcoredump_types.h     |   6 -
 drivers/gpu/drm/xe/xe_guc_capture.c           | 397 ++++++++----------
 drivers/gpu/drm/xe/xe_guc_capture.h           |  16 +-
 .../drm/xe/xe_guc_capture_snapshot_types.h    |  57 +++
 drivers/gpu/drm/xe/xe_guc_submit.c            |  12 +-
 drivers/gpu/drm/xe/xe_hw_engine.c             | 114 +++--
 drivers/gpu/drm/xe/xe_hw_engine.h             |   4 +-
 drivers/gpu/drm/xe/xe_hw_engine_types.h       |  13 +-
 9 files changed, 353 insertions(+), 273 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_guc_capture_snapshot_types.h


base-commit: b7446752e5d3de98bf26b5d3a7ca4fe9165ec779
-- 
2.34.1


^ permalink raw reply	[flat|nested] 22+ messages in thread
* [PATCH v7 0/6] Maintenence of devcoredump <-> GuC-Err-Capture plumbing
@ 2025-02-10 23:32 Alan Previn
  2025-02-11 13:01 ` ✗ Xe.CI.Full: failure for " Patchwork
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Previn @ 2025-02-10 23:32 UTC (permalink / raw)
  To: intel-xe
  Cc: Alan Previn, dri-devel, Daniele Ceraolo Spurio, John Harrison,
	Matthew Brost, Zhanjun Dong, Rodrigo Vivi

The GuC-Error-Capture is currently reaching into xe_devcoredump
structure to store its own place-holder snaphot-ptr to workaround
the race between G2H-Error-Capture-Notification vs Drm-Scheduler
triggering GuC-Submission-exec-queue-timeout/kill.

From a subsystem layering perspective, this isn't scalable as
GuC should not be manipulating contents of a global structure it
does not own when responding to an unrelated thread / callstack.

Also, part of the earlier mentioned workaround includes the
GuC-Error-Capture taking on one of the front-end functions
for xe_hw_engine_snapshot generation because of an orthogonal
debugfs-caller requesting raw dumps of engine registers without
a job. This request is better handled by GuC-Error-Capture since
there is a lot to manage for reading and printing engine
register lists and we want to avoid duplicate code or tables.

However, logically speaking, the GuC-Error-Capture output node
is really a subset of xe_hw_engine_snapshot. This is irregardless
of the fact that the majority of an engine-snapshot is the
register dumps that only the GuC-Error-Capture can do.

That said, this series intends to refactor the plumbing between
Guc-Error-Capture and xe_devcoredump (including
xe_hw_engine_snapshot) to fix the layering for future
maintenence and scalability. This is done without changing
any functionality and IP-locality (i.e. GuC-Error-Capture still owns
the single point of engine register list definition and printing).
This series ensures 'xe_devcoredump_snapshot' owns
'xe_hw_engine_snapshot generation' and the latter owns
'xe_guc_capture_snapshot' retrieval (with GuC-Error-Capture
as its helper).

Alan Previn (6):
  drm/xe/guc: Rename __guc_capture_parsed_output
  drm/xe/guc: Don't store capture nodes in xe_devcoredump_snapshot
  drm/xe/guc: Split engine state print between xe_hw_engine vs
    xe_guc_capture
  drm/xe/guc: Move xe_hw_engine_snapshot creation back to xe_hw_engine.c
  drm/xe/xe_hw_engine: Update xe_hw_engine capture for debugfs/gt_reset
  drm/xe/guc: Update comments on GuC-Err-Capture flows

 drivers/gpu/drm/xe/xe_devcoredump.c           |   7 +-
 drivers/gpu/drm/xe/xe_devcoredump_types.h     |   6 -
 drivers/gpu/drm/xe/xe_guc_capture.c           | 381 ++++++++----------
 drivers/gpu/drm/xe/xe_guc_capture.h           |  16 +-
 .../drm/xe/xe_guc_capture_snapshot_types.h    |  57 +++
 drivers/gpu/drm/xe/xe_guc_submit.c            |  12 +-
 drivers/gpu/drm/xe/xe_hw_engine.c             | 114 ++++--
 drivers/gpu/drm/xe/xe_hw_engine.h             |   4 +-
 drivers/gpu/drm/xe/xe_hw_engine_types.h       |  13 +-
 9 files changed, 337 insertions(+), 273 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_guc_capture_snapshot_types.h


base-commit: f74fd53ba34551b7626193fb70c17226f06e9bf1
-- 
2.34.1


^ permalink raw reply	[flat|nested] 22+ messages in thread
* [PATCH v6 0/6] Maintenence of devcoredump <-> GuC-Err-Capture plumbing
@ 2025-01-28 18:36 Alan Previn
  2025-01-29 12:54 ` ✗ Xe.CI.Full: failure for " Patchwork
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Previn @ 2025-01-28 18:36 UTC (permalink / raw)
  To: intel-xe
  Cc: Alan Previn, dri-devel, Daniele Ceraolo Spurio, John Harrison,
	Matthew Brost, Zhanjun Dong, Rodrigo Vivi

The GuC-Error-Capture is currently reaching into xe_devcoredump
structure to store its own place-holder snaphot-ptr to workaround
the race between G2H-Error-Capture-Notification vs Drm-Scheduler
triggering GuC-Submission-exec-queue-timeout/kill.

From a subsystem layering perspective, this isn't scalable as
GuC should not be manipulating contents of a global structure it
does not own when responding to an unrelated thread / callstack.

Also, part of the earlier mentioned workaround includes the
GuC-Error-Capture taking on one of the front-end functions
for xe_hw_engine_snapshot generation because of an orthogonal
debugfs-caller requesting raw dumps of engine registers without
a job. This request is better handled by GuC-Error-Capture since
there is a lot to manage for reading and printing engine
register lists and we want to avoid duplicate code or tables.

However, logically speaking, the GuC-Error-Capture output node
is really a subset of xe_hw_engine_snapshot. This is irregardless
of the fact that the majority of an engine-snapshot is the
register dumps that only the GuC-Error-Capture can do.

That said, this series intends to refactor the plumbing between
Guc-Error-Capture and xe_devcoredump (including
xe_hw_engine_snapshot) to fix the layering for future
maintenence and scalability. This is done without changing
any functionality and IP-locality (i.e. GuC-Error-Capture still owns
the single point of engine register list definition and printing).
This series ensures 'xe_devcoredump_snapshot' owns
'xe_hw_engine_snapshot generation' and the latter owns
'xe_guc_capture_snapshot' retrieval (with GuC-Error-Capture
as its helper).

Alan Previn (6):
  drm/xe/guc: Rename __guc_capture_parsed_output
  drm/xe/guc: Don't store capture nodes in xe_devcoredump_snapshot
  drm/xe/guc: Split engine state print between xe_hw_engine vs
    xe_guc_capture
  drm/xe/guc: Move xe_hw_engine_snapshot creation back to xe_hw_engine.c
  drm/xe/xe_hw_engine: Update hw_engine_snapshot_capture for debugfs
  drm/xe/guc: Update comments on GuC-Err-Capture flows

 drivers/gpu/drm/xe/xe_devcoredump.c           |   3 -
 drivers/gpu/drm/xe/xe_devcoredump_types.h     |   6 -
 drivers/gpu/drm/xe/xe_guc_capture.c           | 365 ++++++++----------
 drivers/gpu/drm/xe/xe_guc_capture.h           |  16 +-
 .../drm/xe/xe_guc_capture_snapshot_types.h    |  53 +++
 drivers/gpu/drm/xe/xe_guc_submit.c            |  12 +-
 drivers/gpu/drm/xe/xe_hw_engine.c             | 111 ++++--
 drivers/gpu/drm/xe/xe_hw_engine.h             |   4 +-
 drivers/gpu/drm/xe/xe_hw_engine_types.h       |  13 +-
 9 files changed, 319 insertions(+), 264 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_guc_capture_snapshot_types.h


base-commit: 8b47c9cdb6a78364fe68f8af0abfd6f265577001
-- 
2.34.1


^ permalink raw reply	[flat|nested] 22+ messages in thread
* [PATCH v4 0/1] Maintenence of devcoredump <-> GuC-Err-Capture plumbing
@ 2025-01-21 19:09 Alan Previn
  2025-01-22  4:31 ` ✗ Xe.CI.Full: failure for " Patchwork
  0 siblings, 1 reply; 22+ messages in thread
From: Alan Previn @ 2025-01-21 19:09 UTC (permalink / raw)
  To: intel-xe
  Cc: Alan Previn, dri-devel, Daniele Ceraolo Spurio, John Harrison,
	Matthew Brost, Zhanjun Dong

The GuC-Error-Capture is currently reaching into xe_devcoredump
structure to store its own place-holder snaphot to workaround
the race between G2H-Error-Capture-Notification vs Drm-Scheduler
triggering GuC-Submission-exec-queue-timeout/kill.

Part of that race workaround design included GuC-Error-Capture taking
on some of the front-end functions for xe_hw_engine_snapshot
generation because of the orthogonal debugfs for raw dumps of engine
registers without any job association. We want this to also be handled,
even if indirectly, by GuC-Error-Capture since there is a lot to manage
when it comes to reading and printing the register lists.

However, logically speaking, GuC-Error-Capture node management,
despite being the majority of an engine-snapshot work, is still
a subset of xe_hw_engine_snapshot.

This series intends to re-design the plumbing for future
maintenence and scalability, rearranging the layering
back to what its should be (xe_devcoredump_snapshot owns
xe_hw_engine_snapshot owns xe_guc_capture_snapshot)..

Alan Previn (1):
  drm/xe/guc/capture: Maintenence of devcoredump <-> GuC-Err-Capture
    plumbing

 drivers/gpu/drm/xe/xe_devcoredump.c           |   3 -
 drivers/gpu/drm/xe/xe_devcoredump_types.h     |   6 -
 drivers/gpu/drm/xe/xe_guc_capture.c           | 406 ++++++++----------
 drivers/gpu/drm/xe/xe_guc_capture.h           |  10 +-
 .../drm/xe/xe_guc_capture_snapshot_types.h    |  68 +++
 drivers/gpu/drm/xe/xe_guc_submit.c            |  21 +-
 drivers/gpu/drm/xe/xe_hw_engine.c             | 117 +++--
 drivers/gpu/drm/xe/xe_hw_engine.h             |   4 +-
 drivers/gpu/drm/xe/xe_hw_engine_types.h       |  13 +-
 9 files changed, 359 insertions(+), 289 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_guc_capture_snapshot_types.h


base-commit: cfa9d40db8c30d894171010fe765d96e9bc6a47e
-- 
2.34.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-02-14 19:14 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-13 19:51 [PATCH v8 0/6] Maintenence of devcoredump <-> GuC-Err-Capture plumbing Alan Previn
2025-02-13 19:51 ` [PATCH v8 1/6] drm/xe/guc: Rename __guc_capture_parsed_output Alan Previn
2025-02-13 19:51 ` [PATCH v8 2/6] drm/xe/guc: Don't store capture nodes in xe_devcoredump_snapshot Alan Previn
2025-02-13 23:16   ` Dong, Zhanjun
2025-02-13 19:51 ` [PATCH v8 3/6] drm/xe/guc: Split engine state print between xe_hw_engine vs xe_guc_capture Alan Previn
2025-02-13 19:51 ` [PATCH v8 4/6] drm/xe/guc: Move xe_hw_engine_snapshot creation back to xe_hw_engine.c Alan Previn
2025-02-13 19:51 ` [PATCH v8 5/6] drm/xe/xe_hw_engine: Update xe_hw_engine capture for debugfs/gt_reset Alan Previn
2025-02-13 19:51 ` [PATCH v8 6/6] drm/xe/guc: Update comments on GuC-Err-Capture flows Alan Previn
2025-02-13 20:21 ` ✓ CI.Patch_applied: success for Maintenence of devcoredump <-> GuC-Err-Capture plumbing Patchwork
2025-02-13 20:21 ` ✗ CI.checkpatch: warning " Patchwork
2025-02-13 20:22 ` ✓ CI.KUnit: success " Patchwork
2025-02-13 20:39 ` ✓ CI.Build: " Patchwork
2025-02-13 20:41 ` ✓ CI.Hooks: " Patchwork
2025-02-13 20:42 ` ✓ CI.checksparse: " Patchwork
2025-02-13 21:01 ` ✓ Xe.CI.BAT: " Patchwork
2025-02-14 16:42 ` ✗ Xe.CI.Full: failure " Patchwork
2025-02-14 19:14   ` Teres Alexis, Alan Previn
  -- strict thread matches above, loose matches on Subject: below --
2025-02-10 23:32 [PATCH v7 0/6] " Alan Previn
2025-02-11 13:01 ` ✗ Xe.CI.Full: failure for " Patchwork
2025-02-13  0:56   ` Teres Alexis, Alan Previn
2025-01-28 18:36 [PATCH v6 0/6] " Alan Previn
2025-01-29 12:54 ` ✗ Xe.CI.Full: failure for " Patchwork
2025-01-30 17:13   ` Teres Alexis, Alan Previn
2025-01-21 19:09 [PATCH v4 0/1] " Alan Previn
2025-01-22  4:31 ` ✗ Xe.CI.Full: failure for " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox