[PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping
@ 2024-06-10 22:56 John.C.Harrison
  2024-06-10 22:56 ` [PATCH v4 1/7] drm/xe/guc: Remove spurious line feed in debug print John.C.Harrison
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: John.C.Harrison @ 2024-06-10 22:56 UTC (permalink / raw)
  To: Intel-Xe; +Cc: John Harrison

From: John Harrison <John.C.Harrison@Intel.com>

drm/xe/guc: Improve GuC log dumping and add dump on CT failures

There is a debug mechanism for dumping the GuC log as an ASCII hex
stream via dmesg. This is extremely useful for situations where it is
not possibe to query the log from debugfs (self tests, bugs that cause
the driver to fail to load, system hangs, etc.). However, dumping via
dmesg is not the most reliable. The dmesg buffer is limited in size,
can be rate limited and a simple hex stream is hard to parse by tools.

So add extra information to the dump to make it more robust and
parsable. This includes adding start and end tags to delimit the dump,
using longer lines to reduce the per line overhead, adding a rolling
count to check for missing lines and interleaved concurrent dumps and
adding other important information such as the GuC version number and
timestamp offset.

There are various internal error states that the CTB code can check
for. These should never happen but when they do (driver bug, firmware
bug or even hardware bug), they can be a nightmare to debug. So add in
a capture of the GuC log and CT state at the point of error and
subsequent dump from a worker thread.

Note that the ultimate aim is to add the GuC log to the devcoredump
capture. And then to provide a mechanism for generating a devcoredump
at an arbitrary point (such as dead CTB or failed selftest) and
dumping that to dmesg. Unfortunately, there are still issues with
doing that. But this is a good first step along the way.

v2: Remove pm get/put as unnecessary (review feedback from Matthew B).
v3: Add firmware filename and 'wanted' version number.
v4: Use DRM level line printer wrapper from Michal W. Add 'dead CTB'
dump support. Lots of restructuring of capture vs dump for both GuC
log and CTB capture for both the dead CTB dump and for future
inclusion in devcoredump.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>

John Harrison (6):
  drm/xe/guc: Remove spurious line feed in debug print
  drm/xe/guc: Copy GuC log prior to dumping
  drm/xe/guc: Use a two stage dump for GuC logs and add more info
  drm/xe/guc: Add a helper function for dumping GuC log to dmesg
  drm/xe/guc: Dead CT helper
  drm/xe/guc: Dump entire CTB on errors

Michal Wajdeczko (1):
  drm/print: Introduce drm_line_printer

 drivers/gpu/drm/drm_print.c                   |  14 +
 .../drm/xe/abi/guc_communication_ctb_abi.h    |   1 +
 drivers/gpu/drm/xe/regs/xe_guc_regs.h         |   1 +
 drivers/gpu/drm/xe/xe_devcoredump.c           |   2 +-
 drivers/gpu/drm/xe/xe_guc_ct.c                | 355 ++++++++++++++----
 drivers/gpu/drm/xe/xe_guc_ct.h                |   9 +-
 drivers/gpu/drm/xe/xe_guc_ct_types.h          |  24 ++
 drivers/gpu/drm/xe/xe_guc_debugfs.c           |   2 +-
 drivers/gpu/drm/xe/xe_guc_log.c               | 227 ++++++++++-
 drivers/gpu/drm/xe/xe_guc_log.h               |  11 +-
 drivers/gpu/drm/xe/xe_guc_log_types.h         |  29 ++
 include/drm/drm_print.h                       |  64 ++++
 12 files changed, 636 insertions(+), 103 deletions(-)

-- 
2.43.2

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4 1/7] drm/xe/guc: Remove spurious line feed in debug print
  2024-06-10 22:56 [PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping John.C.Harrison
@ 2024-06-10 22:56 ` John.C.Harrison
  2024-06-10 22:56 ` [PATCH v4 2/7] drm/xe/guc: Copy GuC log prior to dumping John.C.Harrison
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: John.C.Harrison @ 2024-06-10 22:56 UTC (permalink / raw)
  To: Intel-Xe; +Cc: John Harrison, Michal Wajdeczko

From: John Harrison <John.C.Harrison@Intel.com>

Including line feeds at the start of a debug print messes up the
output when sent to dmesg. The break actually appears between all the
usefu prefix information and the actual string being printed. In this
case, each block of data has a very clear start line and an extra
delimeter is really not necessary. So don't do it.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
---
 drivers/gpu/drm/xe/xe_guc_ct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index c1f258348f5c..83ac14b1eade 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -1439,7 +1439,7 @@ void xe_guc_ct_snapshot_print(struct xe_guc_ct_snapshot *snapshot,
 		drm_puts(p, "H2G CTB (all sizes in DW):\n");
 		guc_ctb_snapshot_print(&snapshot->h2g, p);
 
-		drm_puts(p, "\nG2H CTB (all sizes in DW):\n");
+		drm_puts(p, "G2H CTB (all sizes in DW):\n");
 		guc_ctb_snapshot_print(&snapshot->g2h, p);
 
 		drm_printf(p, "\tg2h outstanding: %d\n",
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 2/7] drm/xe/guc: Copy GuC log prior to dumping
  2024-06-10 22:56 [PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping John.C.Harrison
  2024-06-10 22:56 ` [PATCH v4 1/7] drm/xe/guc: Remove spurious line feed in debug print John.C.Harrison
@ 2024-06-10 22:56 ` John.C.Harrison
  2024-06-10 22:56 ` [PATCH v4 3/7] drm/xe/guc: Use a two stage dump for GuC logs and add more info John.C.Harrison
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: John.C.Harrison @ 2024-06-10 22:56 UTC (permalink / raw)
  To: Intel-Xe; +Cc: John Harrison

From: John Harrison <John.C.Harrison@Intel.com>

Refactor the hexdump code into a separate function ready to be used
for dumps of other objects. Also change to dumping a host memory copy
rather than the live GPU buffer object. Doing so helps prevent
inconsistencies due to the log being updated as it is being dumped. It
also paves the way for decoupling the save from the print to allow
inclusion in error reports such as the devcoredump.

Switch to use the dedicated kernel hexdump helper rather than printf.
The helper makes it easier to print out much wider lines which can
dramatically reduce the total line count of the dump (useful when
dumping to dmesg).

Another issue with dumping such a large buffer is that it can be slow,
especially if dumping to dmesg over a serial port. So add a yield to
prevent the 'task has been stuck for 120s' kernel hang check feature
from firing.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/xe/xe_guc_debugfs.c |   2 +-
 drivers/gpu/drm/xe/xe_guc_log.c     | 119 ++++++++++++++++++++++++----
 drivers/gpu/drm/xe/xe_guc_log.h     |   2 +-
 3 files changed, 106 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_debugfs.c b/drivers/gpu/drm/xe/xe_guc_debugfs.c
index d3822cbea273..68f1f728c22c 100644
--- a/drivers/gpu/drm/xe/xe_guc_debugfs.c
+++ b/drivers/gpu/drm/xe/xe_guc_debugfs.c
@@ -41,7 +41,7 @@ static int guc_log(struct seq_file *m, void *data)
 	struct drm_printer p = drm_seq_file_printer(m);
 
 	xe_pm_runtime_get(xe);
-	xe_guc_log_print(&guc->log, &p);
+	xe_guc_log_print(&guc->log, &p, false);
 	xe_pm_runtime_put(xe);
 
 	return 0;
diff --git a/drivers/gpu/drm/xe/xe_guc_log.c b/drivers/gpu/drm/xe/xe_guc_log.c
index a37ee3419428..a35309926271 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.c
+++ b/drivers/gpu/drm/xe/xe_guc_log.c
@@ -9,6 +9,7 @@
 
 #include "xe_bo.h"
 #include "xe_gt.h"
+#include "xe_gt_printk.h"
 #include "xe_map.h"
 #include "xe_module.h"
 
@@ -49,32 +50,120 @@ static size_t guc_log_size(void)
 		CAPTURE_BUFFER_SIZE;
 }
 
-void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p)
+#define BYTES_PER_WORD		sizeof(u32)
+#define WORDS_PER_DUMP		8
+#define DUMPS_PER_LINE		4
+#define LINES_PER_READ		4
+#define WORDS_PER_READ		(WORDS_PER_DUMP * DUMPS_PER_LINE * LINES_PER_READ)
+
+static void xe_hexdump_blob(struct xe_device *xe, const void *blob, size_t size,
+			    struct drm_printer *p, bool atomic)
+{
+	char line_buff[DUMPS_PER_LINE * WORDS_PER_DUMP * 9 + 1];
+	const u32 *blob32 = (const u32 *)blob;
+	int i, j, k;
+
+	if (size % (WORDS_PER_READ * BYTES_PER_WORD)) {
+		u32 remain = size % (WORDS_PER_READ * BYTES_PER_WORD);
+
+		drm_err(&xe->drm, "Invalid size for hexdump: 0x%lX vs 0x%lX (%d * %ld) -> 0x%X\n",
+			size, WORDS_PER_READ * BYTES_PER_WORD,
+			WORDS_PER_READ, BYTES_PER_WORD, remain);
+
+		size -= remain;
+		if (!size)
+			return;
+	}
+
+	for (i = 0; i < size / BYTES_PER_WORD; i += WORDS_PER_READ) {
+		const u32 *src = ((const u32 *)blob32) + i;
+
+		for (j = 0; j < WORDS_PER_READ; ) {
+			u32 done = 0;
+
+			for (k = 0; k < DUMPS_PER_LINE; k++) {
+				line_buff[done++] = ' ';
+				done += hex_dump_to_buffer(src + j,
+							   sizeof(*src) * (WORDS_PER_READ - j),
+							   WORDS_PER_DUMP * BYTES_PER_WORD,
+							   BYTES_PER_WORD,
+							   line_buff + done,
+							   sizeof(line_buff) - done,
+							   false);
+				j += WORDS_PER_DUMP;
+			}
+
+			drm_printf(p, "%s\n", line_buff);
+
+			/*
+			 * If spewing large amounts of data via a serial console,
+			 * this can be a very slow process. So be friendly and try
+			 * not to cause 'softlockup on CPU' problems.
+			 */
+			if (!atomic)
+				cond_resched();
+		}
+	}
+}
+
+#define GUC_LOG_CHUNK_SIZE	SZ_2M
+
+void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p, bool atomic)
 {
 	struct xe_device *xe = log_to_xe(log);
-	size_t size;
-	int i, j;
+	size_t size, remain;
+	void **copy;
+	int num_chunks, i;
 
 	xe_assert(xe, log->bo);
 
+	/*
+	 * NB: kmalloc has a hard limit well below the maximum GuC log buffer size.
+	 * Also, can't use vmalloc as might be called from atomic context. So need
+	 * to break the buffer up into smaller chunks that can be allocated.
+	 */
 	size = log->bo->size;
+	num_chunks = (size + GUC_LOG_CHUNK_SIZE - 1) / GUC_LOG_CHUNK_SIZE;
 
-#define DW_PER_READ		128
-	xe_assert(xe, !(size % (DW_PER_READ * sizeof(u32))));
-	for (i = 0; i < size / sizeof(u32); i += DW_PER_READ) {
-		u32 read[DW_PER_READ];
+	copy = kcalloc(num_chunks, sizeof(*copy), atomic ? GFP_ATOMIC : GFP_KERNEL);
+	if (!copy) {
+		drm_printf(p, "Failed to allocate array x%d", num_chunks);
+		return;
+	}
 
-		xe_map_memcpy_from(xe, read, &log->bo->vmap, i * sizeof(u32),
-				   DW_PER_READ * sizeof(u32));
-#define DW_PER_PRINT		4
-		for (j = 0; j < DW_PER_READ / DW_PER_PRINT; ++j) {
-			u32 *print = read + j * DW_PER_PRINT;
+	remain = size;
+	for (i = 0; i < num_chunks; i++) {
+		size_t size = min(GUC_LOG_CHUNK_SIZE, remain);
 
-			drm_printf(p, "0x%08x 0x%08x 0x%08x 0x%08x\n",
-				   *(print + 0), *(print + 1),
-				   *(print + 2), *(print + 3));
+		copy[i] = kmalloc(size, atomic ? GFP_ATOMIC : GFP_KERNEL);
+		if (!copy[i]) {
+			drm_printf(p, "Failed to allocate %ld at chunk %d of %d",
+				   size, i, num_chunks);
+			goto out;
 		}
+		remain -= size;
 	}
+
+	remain = size;
+	for (i = 0; i < num_chunks; i++) {
+		size_t size = min(GUC_LOG_CHUNK_SIZE, remain);
+
+		xe_map_memcpy_from(xe, copy[i], &log->bo->vmap, i * GUC_LOG_CHUNK_SIZE, size);
+		remain -= size;
+	}
+
+	remain = size;
+	for (i = 0; i < num_chunks; i++) {
+		size_t size = min(GUC_LOG_CHUNK_SIZE, remain);
+
+		xe_hexdump_blob(xe, copy[i], size, p, atomic);
+		remain -= size;
+	}
+
+out:
+	for (i = 0; i < num_chunks; i++)
+		kfree(copy[i]);
+	kfree(copy);
 }
 
 int xe_guc_log_init(struct xe_guc_log *log)
diff --git a/drivers/gpu/drm/xe/xe_guc_log.h b/drivers/gpu/drm/xe/xe_guc_log.h
index 2d25ab28b4b3..5149b492c3b8 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.h
+++ b/drivers/gpu/drm/xe/xe_guc_log.h
@@ -37,7 +37,7 @@ struct drm_printer;
 #define GUC_LOG_LEVEL_MAX GUC_VERBOSITY_TO_LOG_LEVEL(GUC_LOG_VERBOSITY_MAX)
 
 int xe_guc_log_init(struct xe_guc_log *log);
-void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p);
+void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p, bool atomic);
 
 static inline u32
 xe_guc_log_get_level(struct xe_guc_log *log)
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 3/7] drm/xe/guc: Use a two stage dump for GuC logs and add more info
  2024-06-10 22:56 [PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping John.C.Harrison
  2024-06-10 22:56 ` [PATCH v4 1/7] drm/xe/guc: Remove spurious line feed in debug print John.C.Harrison
  2024-06-10 22:56 ` [PATCH v4 2/7] drm/xe/guc: Copy GuC log prior to dumping John.C.Harrison
@ 2024-06-10 22:56 ` John.C.Harrison
  2024-06-10 22:56 ` [PATCH v4 4/7] drm/print: Introduce drm_line_printer John.C.Harrison
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: John.C.Harrison @ 2024-06-10 22:56 UTC (permalink / raw)
  To: Intel-Xe; +Cc: John Harrison

From: John Harrison <John.C.Harrison@Intel.com>

Split the GuC log dump into a two stage snapshot and print mechanism.
This allows the log to be captured at the point of an error (which may
be in a resticted context) and then dump it out later (from a regular
context such as a worker function or a sysfs file handler).

Also add a bunch of other useful pieces of information that can help
(or are fundamentally required!) to decode and parse the log.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/xe/regs/xe_guc_regs.h |   1 +
 drivers/gpu/drm/xe/xe_guc_log.c       | 150 ++++++++++++++++++++------
 drivers/gpu/drm/xe/xe_guc_log.h       |   6 ++
 drivers/gpu/drm/xe/xe_guc_log_types.h |  29 +++++
 4 files changed, 155 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_guc_regs.h b/drivers/gpu/drm/xe/regs/xe_guc_regs.h
index a5fd14307f94..b27b73680c12 100644
--- a/drivers/gpu/drm/xe/regs/xe_guc_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_guc_regs.h
@@ -84,6 +84,7 @@
 #define   HUC_LOADING_AGENT_GUC			REG_BIT(1)
 #define   GUC_WOPCM_OFFSET_VALID		REG_BIT(0)
 #define GUC_MAX_IDLE_COUNT			XE_REG(0xc3e4)
+#define GUC_PMTIMESTAMP				XE_REG(0xc3e8)
 
 #define GUC_SEND_INTERRUPT			XE_REG(0xc4c8)
 #define   GUC_SEND_TRIGGER			REG_BIT(0)
diff --git a/drivers/gpu/drm/xe/xe_guc_log.c b/drivers/gpu/drm/xe/xe_guc_log.c
index a35309926271..a35d3cc0c369 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.c
+++ b/drivers/gpu/drm/xe/xe_guc_log.c
@@ -7,12 +7,20 @@
 
 #include <drm/drm_managed.h>
 
+#include "regs/xe_guc_regs.h"
 #include "xe_bo.h"
 #include "xe_gt.h"
 #include "xe_gt_printk.h"
 #include "xe_map.h"
+#include "xe_mmio.h"
 #include "xe_module.h"
 
+static struct xe_guc *
+log_to_guc(struct xe_guc_log *log)
+{
+	return container_of(log, struct xe_guc, log);
+}
+
 static struct xe_gt *
 log_to_gt(struct xe_guc_log *log)
 {
@@ -108,62 +116,142 @@ static void xe_hexdump_blob(struct xe_device *xe, const void *blob, size_t size,
 
 #define GUC_LOG_CHUNK_SIZE	SZ_2M
 
-void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p, bool atomic)
+struct xe_guc_log_snapshot *xe_guc_log_snapshot_alloc(struct xe_guc_log *log, bool atomic)
 {
-	struct xe_device *xe = log_to_xe(log);
-	size_t size, remain;
-	void **copy;
-	int num_chunks, i;
+	struct xe_guc_log_snapshot *snapshot;
+	size_t remain;
+	int i;
 
-	xe_assert(xe, log->bo);
+	snapshot = kzalloc(sizeof(*snapshot), atomic ? GFP_ATOMIC : GFP_KERNEL);
+	if (!snapshot)
+		return NULL;
 
 	/*
 	 * NB: kmalloc has a hard limit well below the maximum GuC log buffer size.
 	 * Also, can't use vmalloc as might be called from atomic context. So need
 	 * to break the buffer up into smaller chunks that can be allocated.
 	 */
-	size = log->bo->size;
-	num_chunks = (size + GUC_LOG_CHUNK_SIZE - 1) / GUC_LOG_CHUNK_SIZE;
+	snapshot->size = log->bo->size;
+	snapshot->num_chunks = (snapshot->size + GUC_LOG_CHUNK_SIZE - 1) / GUC_LOG_CHUNK_SIZE;
 
-	copy = kcalloc(num_chunks, sizeof(*copy), atomic ? GFP_ATOMIC : GFP_KERNEL);
-	if (!copy) {
-		drm_printf(p, "Failed to allocate array x%d", num_chunks);
-		return;
-	}
+	snapshot->copy = kcalloc(snapshot->num_chunks, sizeof(*snapshot->copy),
+				 atomic ? GFP_ATOMIC : GFP_KERNEL);
+	if (!snapshot->copy)
+		goto fail_snap;
 
-	remain = size;
-	for (i = 0; i < num_chunks; i++) {
+	remain = snapshot->size;
+	for (i = 0; i < snapshot->num_chunks; i++) {
 		size_t size = min(GUC_LOG_CHUNK_SIZE, remain);
 
-		copy[i] = kmalloc(size, atomic ? GFP_ATOMIC : GFP_KERNEL);
-		if (!copy[i]) {
-			drm_printf(p, "Failed to allocate %ld at chunk %d of %d",
-				   size, i, num_chunks);
-			goto out;
-		}
+		snapshot->copy[i] = kmalloc(size, atomic ? GFP_ATOMIC : GFP_KERNEL);
+		if (!snapshot->copy[i])
+			goto fail_copy;
 		remain -= size;
 	}
 
-	remain = size;
-	for (i = 0; i < num_chunks; i++) {
+	return snapshot;
+
+fail_copy:
+	for (i = 0; i < snapshot->num_chunks; i++)
+		kfree(snapshot->copy[i]);
+	kfree(snapshot->copy);
+fail_snap:
+	kfree(snapshot);
+	return NULL;
+}
+
+void xe_guc_log_snapshot_free(struct xe_guc_log_snapshot *snapshot)
+{
+	int i;
+
+	if (!snapshot)
+		return;
+
+	if (!snapshot->copy) {
+		for (i = 0; i < snapshot->num_chunks; i++)
+			kfree(snapshot->copy[i]);
+		kfree(snapshot->copy);
+	}
+
+	kfree(snapshot);
+}
+
+struct xe_guc_log_snapshot *xe_guc_log_snapshot_capture(struct xe_guc_log *log, bool atomic)
+{
+	struct xe_guc_log_snapshot *snapshot;
+	struct xe_device *xe = log_to_xe(log);
+	struct xe_guc *guc = log_to_guc(log);
+	struct xe_gt *gt = log_to_gt(log);
+	size_t remain;
+	int i;
+
+	if (!log->bo) {
+		xe_gt_err(gt, "GuC log not allocated!\n");
+		return NULL;
+	}
+
+	snapshot = xe_guc_log_snapshot_alloc(log, atomic);
+	if (!snapshot) {
+		xe_gt_err(gt, "GuC log snapshot not allocated!\n");
+		return NULL;
+	}
+
+	remain = snapshot->size;
+	for (i = 0; i < snapshot->num_chunks; i++) {
 		size_t size = min(GUC_LOG_CHUNK_SIZE, remain);
 
-		xe_map_memcpy_from(xe, copy[i], &log->bo->vmap, i * GUC_LOG_CHUNK_SIZE, size);
+		xe_map_memcpy_from(xe, snapshot->copy[i], &log->bo->vmap,
+				   i * GUC_LOG_CHUNK_SIZE, size);
 		remain -= size;
 	}
 
-	remain = size;
-	for (i = 0; i < num_chunks; i++) {
+	snapshot->ktime = ktime_get_boottime_ns();
+	snapshot->stamp = xe_mmio_read32(gt, GUC_PMTIMESTAMP);
+	snapshot->ref_clk = gt->info.reference_clock;
+	snapshot->level = log->level;
+	snapshot->ver_found = guc->fw.versions.found[XE_UC_FW_VER_RELEASE];
+	snapshot->ver_want = guc->fw.versions.wanted;
+	snapshot->path = guc->fw.path;
+
+	return snapshot;
+}
+
+void xe_guc_log_snapshot_print(struct xe_device *xe, struct xe_guc_log_snapshot *snapshot,
+			       struct drm_printer *p, bool atomic)
+{
+	size_t remain;
+	int i;
+
+	if (!snapshot) {
+		drm_printf(p, "GuC log snapshot not allocated!\n");
+		return;
+	}
+
+	drm_printf(p, "GuC version %u.%u.%u (wanted %u.%u.%u)\n",
+		   snapshot->ver_found.major, snapshot->ver_found.minor, snapshot->ver_found.patch,
+		   snapshot->ver_want.major, snapshot->ver_want.minor, snapshot->ver_want.patch);
+	drm_printf(p, "GuC firmware: %s\n", snapshot->path);
+	drm_printf(p, "Kernel timestamp: 0x%08llX [%llu]\n", snapshot->ktime, snapshot->ktime);
+	drm_printf(p, "GuC timestamp: 0x%08X [%u]\n", snapshot->stamp, snapshot->stamp);
+	drm_printf(p, "CS timestamp frequency: %u Hz\n", snapshot->ref_clk);
+	drm_printf(p, "Log level: %u\n", snapshot->level);
+
+	remain = snapshot->size;
+	for (i = 0; i < snapshot->num_chunks; i++) {
 		size_t size = min(GUC_LOG_CHUNK_SIZE, remain);
 
-		xe_hexdump_blob(xe, copy[i], size, p, atomic);
+		xe_hexdump_blob(xe, snapshot->copy[i], size, p, atomic);
 		remain -= size;
 	}
+}
+
+void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p, bool atomic)
+{
+	struct xe_guc_log_snapshot *snapshot;
 
-out:
-	for (i = 0; i < num_chunks; i++)
-		kfree(copy[i]);
-	kfree(copy);
+	snapshot = xe_guc_log_snapshot_capture(log, atomic);
+	xe_guc_log_snapshot_print(log_to_xe(log), snapshot, p, atomic);
+	xe_guc_log_snapshot_free(snapshot);
 }
 
 int xe_guc_log_init(struct xe_guc_log *log)
diff --git a/drivers/gpu/drm/xe/xe_guc_log.h b/drivers/gpu/drm/xe/xe_guc_log.h
index 5149b492c3b8..6e4d0b369c19 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.h
+++ b/drivers/gpu/drm/xe/xe_guc_log.h
@@ -9,6 +9,7 @@
 #include "xe_guc_log_types.h"
 
 struct drm_printer;
+struct xe_device;
 
 #if IS_ENABLED(CONFIG_DRM_XE_LARGE_GUC_BUFFER)
 #define CRASH_BUFFER_SIZE       SZ_1M
@@ -38,6 +39,11 @@ struct drm_printer;
 
 int xe_guc_log_init(struct xe_guc_log *log);
 void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p, bool atomic);
+struct xe_guc_log_snapshot *xe_guc_log_snapshot_alloc(struct xe_guc_log *log, bool atomic);
+struct xe_guc_log_snapshot *xe_guc_log_snapshot_capture(struct xe_guc_log *log, bool atomic);
+void xe_guc_log_snapshot_print(struct xe_device *xe, struct xe_guc_log_snapshot *snapshot,
+			       struct drm_printer *p, bool atomic);
+void xe_guc_log_snapshot_free(struct xe_guc_log_snapshot *snapshot);
 
 static inline u32
 xe_guc_log_get_level(struct xe_guc_log *log)
diff --git a/drivers/gpu/drm/xe/xe_guc_log_types.h b/drivers/gpu/drm/xe/xe_guc_log_types.h
index 125080d138a7..e3a6a44c2f18 100644
--- a/drivers/gpu/drm/xe/xe_guc_log_types.h
+++ b/drivers/gpu/drm/xe/xe_guc_log_types.h
@@ -8,8 +8,37 @@
 
 #include <linux/types.h>
 
+#include "xe_uc_fw_types.h"
+
 struct xe_bo;
 
+/**
+ * struct xe_guc_log_snapshot:
+ * Capture of the GuC log plus various state useful for decoding the log
+ */
+struct xe_guc_log_snapshot {
+	/** @size: Size in bytes of the @copy allocation */
+	size_t size;
+	/** @copy: Host memory copy of the log buffer for later dumping, split into chunks */
+	void **copy;
+	/** @num_chunks: Number of chunks withint @copy */
+	int num_chunks;
+	/** @ktime: Kernel time the snapshot was taken */
+	u64 ktime;
+	/** @stamp: GuC timestamp at which the snapshot was taken */
+	u32 stamp;
+	/** @ref_clk: GuC timestamp frequency */
+	u32 ref_clk;
+	/** @level: GuC log verbosity level */
+	u32 level;
+	/** @ver_found: GuC firmware version */
+	struct xe_uc_fw_version ver_found;
+	/** @ver_want: GuC firmware version that driver expected */
+	struct xe_uc_fw_version ver_want;
+	/** @path: Path of GuC firmware blob */
+	const char *path;
+};
+
 /**
  * struct xe_guc_log - GuC log
  */
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 4/7] drm/print: Introduce drm_line_printer
  2024-06-10 22:56 [PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping John.C.Harrison
                   ` (2 preceding siblings ...)
  2024-06-10 22:56 ` [PATCH v4 3/7] drm/xe/guc: Use a two stage dump for GuC logs and add more info John.C.Harrison
@ 2024-06-10 22:56 ` John.C.Harrison
  2024-06-10 22:56 ` [PATCH v4 5/7] drm/xe/guc: Add a helper function for dumping GuC log to dmesg John.C.Harrison
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: John.C.Harrison @ 2024-06-10 22:56 UTC (permalink / raw)
  To: Intel-Xe; +Cc: Michal Wajdeczko, Jani Nikula, John Harrison

From: Michal Wajdeczko <michal.wajdeczko@intel.com>

This drm printer wrapper can be used to increase the robustness of
the captured output generated by any other drm_printer to make sure
we didn't lost any intermediate lines of the output by adding line
numbers to each output line. Helpful for capturing some crash data.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
---
 drivers/gpu/drm/drm_print.c | 14 ++++++++
 include/drm/drm_print.h     | 64 +++++++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+)

diff --git a/drivers/gpu/drm/drm_print.c b/drivers/gpu/drm/drm_print.c
index cf2efb44722c..be9cbebff5b3 100644
--- a/drivers/gpu/drm/drm_print.c
+++ b/drivers/gpu/drm/drm_print.c
@@ -214,6 +214,20 @@ void __drm_printfn_err(struct drm_printer *p, struct va_format *vaf)
 }
 EXPORT_SYMBOL(__drm_printfn_err);
 
+void __drm_printfn_line(struct drm_printer *p, struct va_format *vaf)
+{
+	unsigned int counter = ++p->line.counter;
+	const char *prefix = p->prefix ?: "";
+	const char *pad = p->prefix ? " " : "";
+
+	if (p->line.series)
+		drm_printf(p->arg, "%s%s%u.%u: %pV",
+			   prefix, pad, p->line.series, counter, vaf);
+	else
+		drm_printf(p->arg, "%s%s%u: %pV", prefix, pad, counter, vaf);
+}
+EXPORT_SYMBOL(__drm_printfn_line);
+
 /**
  * drm_puts - print a const string to a &drm_printer stream
  * @p: the &drm printer
diff --git a/include/drm/drm_print.h b/include/drm/drm_print.h
index 089950ad8681..571fed602bd3 100644
--- a/include/drm/drm_print.h
+++ b/include/drm/drm_print.h
@@ -176,6 +176,10 @@ struct drm_printer {
 	void (*puts)(struct drm_printer *p, const char *str);
 	void *arg;
 	const char *prefix;
+	struct {
+		unsigned int series;
+		unsigned int counter;
+	} line;
 	enum drm_debug_category category;
 };
 
@@ -186,6 +190,7 @@ void __drm_puts_seq_file(struct drm_printer *p, const char *str);
 void __drm_printfn_info(struct drm_printer *p, struct va_format *vaf);
 void __drm_printfn_dbg(struct drm_printer *p, struct va_format *vaf);
 void __drm_printfn_err(struct drm_printer *p, struct va_format *vaf);
+void __drm_printfn_line(struct drm_printer *p, struct va_format *vaf);
 
 __printf(2, 3)
 void drm_printf(struct drm_printer *p, const char *f, ...);
@@ -357,6 +362,65 @@ static inline struct drm_printer drm_err_printer(struct drm_device *drm,
 	return p;
 }
 
+/**
+ * drm_line_printer - construct a &drm_printer that prefixes outputs with line numbers
+ * @p: the &struct drm_printer which actually generates the output
+ * @prefix: optional output prefix, or NULL for no prefix
+ * @series: optional unique series identifier, or 0 to omit identifier in the output
+ *
+ * This printer can be used to increase the robustness of the captured output
+ * to make sure we didn't lost any intermediate lines of the output. Helpful
+ * while capturing some crash data.
+ *
+ * Example 1::
+ *
+ *	void crash_dump(struct drm_device *drm)
+ *	{
+ *		static unsigned int id;
+ *		struct drm_printer p = drm_err_printer(drm, "crash");
+ *		struct drm_printer lp = drm_line_printer(&p, "dump", ++id);
+ *
+ *		drm_printf(&lp, "foo");
+ *		drm_printf(&lp, "bar");
+ *	}
+ *
+ * Above code will print into the dmesg something like::
+ *
+ *	[ ] 0000:00:00.0: [drm] *ERROR* crash dump 1.1: foo
+ *	[ ] 0000:00:00.0: [drm] *ERROR* crash dump 1.2: bar
+ *
+ * Example 2::
+ *
+ *	void line_dump(struct device *dev)
+ *	{
+ *		struct drm_printer p = drm_info_printer(dev);
+ *		struct drm_printer lp = drm_line_printer(&p, NULL, 0);
+ *
+ *		drm_printf(&lp, "foo");
+ *		drm_printf(&lp, "bar");
+ *	}
+ *
+ * Above code will print::
+ *
+ *	[ ] 0000:00:00.0: [drm] 1: foo
+ *	[ ] 0000:00:00.0: [drm] 2: bar
+ *
+ * RETURNS:
+ * The &drm_printer object
+ */
+static inline struct drm_printer drm_line_printer(struct drm_printer *p,
+						  const char *prefix,
+						  unsigned int series)
+{
+	struct drm_printer lp = {
+		.printfn = __drm_printfn_line,
+		.arg = p,
+		.prefix = prefix,
+		.line = { .series = series, },
+	};
+	return lp;
+}
+
 /*
  * struct device based logging
  *
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 5/7] drm/xe/guc: Add a helper function for dumping GuC log to dmesg
  2024-06-10 22:56 [PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping John.C.Harrison
                   ` (3 preceding siblings ...)
  2024-06-10 22:56 ` [PATCH v4 4/7] drm/print: Introduce drm_line_printer John.C.Harrison
@ 2024-06-10 22:56 ` John.C.Harrison
  2024-06-10 22:56 ` [PATCH v4 6/7] drm/xe/guc: Dead CT helper John.C.Harrison
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: John.C.Harrison @ 2024-06-10 22:56 UTC (permalink / raw)
  To: Intel-Xe; +Cc: John Harrison

From: John Harrison <John.C.Harrison@Intel.com>

Create a helper function that can be used to dump the GuC log to dmesg
in a manner that is reliable for extraction and decode. The intention
is that calls to this can be added by developers when debugging
specific issues that require a GuC log but do not allow easy capture
of the log - e.g. failures in selftests and failues that lead to
kernel hangs.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/xe/xe_guc_log.c | 14 ++++++++++++++
 drivers/gpu/drm/xe/xe_guc_log.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_guc_log.c b/drivers/gpu/drm/xe/xe_guc_log.c
index a35d3cc0c369..9714ee6f99e0 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.c
+++ b/drivers/gpu/drm/xe/xe_guc_log.c
@@ -245,6 +245,20 @@ void xe_guc_log_snapshot_print(struct xe_device *xe, struct xe_guc_log_snapshot
 	}
 }
 
+void xe_guc_log_print_dmesg(struct xe_guc_log *log, bool atomic)
+{
+	struct xe_gt *gt = log_to_gt(log);
+	static int g_count;
+	struct drm_printer ip = xe_gt_info_printer(gt);
+	struct drm_printer lp = drm_line_printer(&ip, "Capture", ++g_count);
+
+	drm_printf(&lp, "Dumping GuC log for %ps...\n", __builtin_return_address(0));
+
+	xe_guc_log_print(log, &lp, atomic);
+
+	drm_printf(&lp, "Done.\n");
+}
+
 void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p, bool atomic)
 {
 	struct xe_guc_log_snapshot *snapshot;
diff --git a/drivers/gpu/drm/xe/xe_guc_log.h b/drivers/gpu/drm/xe/xe_guc_log.h
index 6e4d0b369c19..29154ab507ff 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.h
+++ b/drivers/gpu/drm/xe/xe_guc_log.h
@@ -39,6 +39,7 @@ struct xe_device;
 
 int xe_guc_log_init(struct xe_guc_log *log);
 void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p, bool atomic);
+void xe_guc_log_print_dmesg(struct xe_guc_log *log, bool atomic);
 struct xe_guc_log_snapshot *xe_guc_log_snapshot_alloc(struct xe_guc_log *log, bool atomic);
 struct xe_guc_log_snapshot *xe_guc_log_snapshot_capture(struct xe_guc_log *log, bool atomic);
 void xe_guc_log_snapshot_print(struct xe_device *xe, struct xe_guc_log_snapshot *snapshot,
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 6/7] drm/xe/guc: Dead CT helper
  2024-06-10 22:56 [PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping John.C.Harrison
                   ` (4 preceding siblings ...)
  2024-06-10 22:56 ` [PATCH v4 5/7] drm/xe/guc: Add a helper function for dumping GuC log to dmesg John.C.Harrison
@ 2024-06-10 22:56 ` John.C.Harrison
  2024-06-10 22:56 ` [PATCH v4 7/7] drm/xe/guc: Dump entire CTB on errors John.C.Harrison
  2024-06-10 23:01 ` ✗ CI.Patch_applied: failure for drm/xe/guc: Improve quality and robustness of GuC log dumping Patchwork
  7 siblings, 0 replies; 10+ messages in thread
From: John.C.Harrison @ 2024-06-10 22:56 UTC (permalink / raw)
  To: Intel-Xe; +Cc: John Harrison

From: John Harrison <John.C.Harrison@Intel.com>

Add a worker function helper for asynchronously dumping state when an
internal/fatal error is detected in CT processing. Being asynchronous
is required to avoid deadlocks and scheduling-while-atomic or
process-stalled-for-too-long issues. Also check for a bunch more error
conditions and improve the handling of some existing checks.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../drm/xe/abi/guc_communication_ctb_abi.h    |   1 +
 drivers/gpu/drm/xe/xe_guc_ct.c                | 256 ++++++++++++++++--
 drivers/gpu/drm/xe/xe_guc_ct_types.h          |  22 ++
 3 files changed, 258 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h b/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
index 8f86a16dc577..f58198cf2cf6 100644
--- a/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
+++ b/drivers/gpu/drm/xe/abi/guc_communication_ctb_abi.h
@@ -52,6 +52,7 @@ struct guc_ct_buffer_desc {
 #define GUC_CTB_STATUS_OVERFLOW				(1 << 0)
 #define GUC_CTB_STATUS_UNDERFLOW			(1 << 1)
 #define GUC_CTB_STATUS_MISMATCH				(1 << 2)
+#define GUC_CTB_STATUS_DISABLED				(1 << 3)
 	u32 reserved[13];
 } __packed;
 static_assert(sizeof(struct guc_ct_buffer_desc) == 64);
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 83ac14b1eade..fb8469394a33 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -25,12 +25,58 @@
 #include "xe_gt_sriov_pf_monitor.h"
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_guc.h"
+#include "xe_guc_log.h"
 #include "xe_guc_relay.h"
 #include "xe_guc_submit.h"
 #include "xe_map.h"
 #include "xe_pm.h"
 #include "xe_trace.h"
 
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+enum {
+	CT_DEAD_ALIVE = 0,
+	CT_DEAD_RESET,				/* 0x0001 */
+	CT_DEAD_SETUP,				/* 0x0002 */
+	CT_DEAD_H2G_WRITE,			/* 0x0004 */
+	CT_DEAD_H2G_HAS_ROOM,			/* 0x0008 */
+	CT_DEAD_G2H_READ,			/* 0x0010 */
+	CT_DEAD_G2H_RECV,			/* 0x0020 */
+	CT_DEAD_G2H_RELEASE,			/* 0x0040 */
+	CT_DEAD_DEADLOCK,			/* 0x0080 */
+	CT_DEAD_PROCESS_FAILED,			/* 0x0100 */
+	CT_DEAD_FAST_G2H,			/* 0x0200 */
+	CT_DEAD_PARSE_G2H_RESPONSE,		/* 0x0400 */
+	CT_DEAD_PARSE_G2H_UNKNOWN,		/* 0x0800 */
+	CT_DEAD_PARSE_G2H_ORIGIN,		/* 0x1000 */
+	CT_DEAD_PARSE_G2H_TYPE,			/* 0x2000 */
+};
+
+static void ct_dead_worker_func(struct work_struct *w);
+
+#define CT_DEAD(ct, hxg, reason_code) \
+	do { \
+		struct guc_ctb *_hxg = (hxg); \
+		if (_hxg) \
+			_hxg->info.broken = true; \
+		if (!(ct)->dead.reported) { \
+			struct xe_guc *guc = ct_to_guc(ct); \
+			spin_lock_irq(&ct->dead.lock); \
+			(ct)->dead.reason |= 1 << CT_DEAD_##reason_code; \
+			(ct)->dead.snapshot_log = xe_guc_log_snapshot_capture(&guc->log, true); \
+			(ct)->dead.snapshot_ct = xe_guc_ct_snapshot_capture((ct), true); \
+			spin_unlock_irq(&ct->dead.lock); \
+			queue_work(system_unbound_wq, &(ct)->dead.worker); \
+		} \
+	} while (0)
+#else
+#define CT_DEAD(ct, hxg, reason) \
+	do { \
+		struct guc_ctb *_hxg = (hxg); \
+		if (_hxg) \
+			_hxg->info.broken = true; \
+	} while (0)
+#endif
+
 /* Used when a CT send wants to block and / or receive data */
 struct g2h_fence {
 	u32 *response_buffer;
@@ -155,6 +201,10 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
 	spin_lock_init(&ct->fast_lock);
 	xa_init(&ct->fence_lookup);
 	INIT_WORK(&ct->g2h_worker, g2h_worker_func);
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+	spin_lock_init(&ct->dead.lock);
+	INIT_WORK(&ct->dead.worker, ct_dead_worker_func);
+#endif
 	init_waitqueue_head(&ct->wq);
 	init_waitqueue_head(&ct->g2h_fence_wq);
 
@@ -350,10 +400,17 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
 	wake_up_all(&ct->wq);
 	xe_gt_dbg(gt, "GuC CT communication channel enabled\n");
 
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+	spin_lock_irq(&ct->dead.lock);
+	if (ct->dead.reason)
+		ct->dead.reason |= CT_DEAD_RESET;
+	spin_unlock_irq(&ct->dead.lock);
+#endif
 	return 0;
 
 err_out:
 	xe_gt_err(gt, "Failed to enable GuC CT (%pe)\n", ERR_PTR(err));
+	CT_DEAD(ct, NULL, SETUP);
 
 	return err;
 }
@@ -396,6 +453,19 @@ static bool h2g_has_room(struct xe_guc_ct *ct, u32 cmd_len)
 
 	if (cmd_len > h2g->info.space) {
 		h2g->info.head = desc_read(ct_to_xe(ct), h2g, head);
+
+		if (h2g->info.head > h2g->info.size) {
+			struct xe_device *xe = ct_to_xe(ct);
+			u32 desc_status = desc_read(xe, h2g, status);
+
+			desc_write(xe, h2g, status, desc_status | GUC_CTB_STATUS_OVERFLOW);
+
+			xe_gt_err(ct_to_gt(ct), "CT: invalid head offset %u >= %u)\n",
+				  h2g->info.head, h2g->info.size);
+			CT_DEAD(ct, h2g, H2G_HAS_ROOM);
+			return false;
+		}
+
 		h2g->info.space = CIRC_SPACE(h2g->info.tail, h2g->info.head,
 					     h2g->info.size) -
 				  h2g->info.resv_space;
@@ -447,8 +517,16 @@ static void __g2h_reserve_space(struct xe_guc_ct *ct, u32 g2h_len, u32 num_g2h)
 static void __g2h_release_space(struct xe_guc_ct *ct, u32 g2h_len)
 {
 	lockdep_assert_held(&ct->fast_lock);
-	xe_gt_assert(ct_to_gt(ct), ct->ctbs.g2h.info.space + g2h_len <=
-		     ct->ctbs.g2h.info.size - ct->ctbs.g2h.info.resv_space);
+	if (ct->ctbs.g2h.info.space + g2h_len >
+	    ct->ctbs.g2h.info.size - ct->ctbs.g2h.info.resv_space) {
+		xe_gt_err(ct_to_gt(ct), "Invalid G2H release: %d + %d vs %d - %d -> %d vs %d!\n",
+			  ct->ctbs.g2h.info.space, g2h_len,
+			  ct->ctbs.g2h.info.size, ct->ctbs.g2h.info.resv_space,
+			  ct->ctbs.g2h.info.space + g2h_len,
+			  ct->ctbs.g2h.info.size - ct->ctbs.g2h.info.resv_space);
+		CT_DEAD(ct, &ct->ctbs.g2h, G2H_RELEASE);
+		return;
+	}
 
 	ct->ctbs.g2h.info.space += g2h_len;
 	--ct->g2h_outstanding;
@@ -474,12 +552,44 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
 	u32 full_len;
 	struct iosys_map map = IOSYS_MAP_INIT_OFFSET(&h2g->cmds,
 							 tail * sizeof(u32));
+	u32 desc_status;
 
 	full_len = len + GUC_CTB_HDR_LEN;
 
 	lockdep_assert_held(&ct->lock);
 	xe_gt_assert(gt, full_len <= GUC_CTB_MSG_MAX_LEN);
-	xe_gt_assert(gt, tail <= h2g->info.size);
+
+	desc_status = desc_read(xe, h2g, status);
+	if (desc_status) {
+		xe_gt_err(gt, "CT write: non-zero status: %u\n", desc_status);
+		goto corrupted;
+	}
+
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+{
+	u32 desc_tail = desc_read(xe, h2g, tail);
+	u32 desc_head = desc_read(xe, h2g, head);
+
+	if (tail != desc_tail) {
+		desc_write(xe, h2g, status, desc_status | GUC_CTB_STATUS_MISMATCH);
+		xe_gt_err(gt, "CT write: tail was modified %u != %u\n", desc_tail, tail);
+		goto corrupted;
+	}
+
+	if (tail > h2g->info.size) {
+		desc_write(xe, h2g, status, desc_status | GUC_CTB_STATUS_OVERFLOW);
+		xe_gt_err(gt, "CT write: tail out of range: %u vs %u\n", tail, h2g->info.size);
+		goto corrupted;
+	}
+
+	if (desc_head >= h2g->info.size) {
+		desc_write(xe, h2g, status, desc_status | GUC_CTB_STATUS_OVERFLOW);
+		xe_gt_err(gt, "CT write: invalid head offset %u >= %u)\n",
+			  desc_head, h2g->info.size);
+		goto corrupted;
+	}
+}
+#endif
 
 	/* Command will wrap, zero fill (NOPs), return and check credits again */
 	if (tail + full_len > h2g->info.size) {
@@ -532,6 +642,10 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
 			     desc_read(xe, h2g, head), h2g->info.tail);
 
 	return 0;
+
+corrupted:
+	CT_DEAD(ct, &ct->ctbs.h2g, H2G_WRITE);
+	return -EPIPE;
 }
 
 /*
@@ -642,7 +756,6 @@ static int guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action, u32 len,
 			      struct g2h_fence *g2h_fence)
 {
 	struct xe_gt *gt = ct_to_gt(ct);
-	struct drm_printer p = xe_gt_info_printer(gt);
 	unsigned int sleep_period_ms = 1;
 	int ret;
 
@@ -695,8 +808,13 @@ static int guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action, u32 len,
 			goto broken;
 #undef g2h_avail
 
-		if (dequeue_one_g2h(ct) < 0)
+		ret = dequeue_one_g2h(ct);
+		if (ret < 0) {
+			if (ret != -ECANCELED)
+				xe_gt_err(ct_to_gt(ct), "CTB receive failed (%pe)",
+					  ERR_PTR(ret));
 			goto broken;
+		}
 
 		goto try_again;
 	}
@@ -705,8 +823,7 @@ static int guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action, u32 len,
 
 broken:
 	xe_gt_err(gt, "No forward process on H2G, reset required\n");
-	xe_guc_ct_print(ct, &p, true);
-	ct->ctbs.h2g.info.broken = true;
+	CT_DEAD(ct, &ct->ctbs.h2g, DEADLOCK);
 
 	return -EDEADLK;
 }
@@ -933,6 +1050,7 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
 		else
 			xe_gt_err(gt, "unexpected response %u for FAST_REQ H2G fence 0x%x!\n",
 				  type, fence);
+		CT_DEAD(ct, NULL, PARSE_G2H_RESPONSE);
 
 		return -EPROTO;
 	}
@@ -941,8 +1059,9 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
 	if (unlikely(!g2h_fence)) {
 		/* Don't tear down channel, as send could've timed out */
 		xe_gt_warn(gt, "G2H fence (%u) not found!\n", fence);
+		CT_DEAD(ct, NULL, PARSE_G2H_UNKNOWN);
 		g2h_release_space(ct, GUC_CTB_HXG_MSG_MAX_LEN);
-		return 0;
+		return -EPROTO;
 	}
 
 	xe_gt_assert(gt, fence == g2h_fence->seqno);
@@ -984,7 +1103,7 @@ static int parse_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
 	if (unlikely(origin != GUC_HXG_ORIGIN_GUC)) {
 		xe_gt_err(gt, "G2H channel broken on read, origin=%u, reset required\n",
 			  origin);
-		ct->ctbs.g2h.info.broken = true;
+		CT_DEAD(ct, &ct->ctbs.g2h, PARSE_G2H_ORIGIN);
 
 		return -EPROTO;
 	}
@@ -1002,7 +1121,7 @@ static int parse_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
 	default:
 		xe_gt_err(gt, "G2H channel broken on read, type=%u, reset required\n",
 			  type);
-		ct->ctbs.g2h.info.broken = true;
+		CT_DEAD(ct, &ct->ctbs.g2h, PARSE_G2H_TYPE);
 
 		ret = -EOPNOTSUPP;
 	}
@@ -1079,9 +1198,11 @@ static int process_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
 		xe_gt_err(gt, "unexpected G2H action 0x%04x\n", action);
 	}
 
-	if (ret)
+	if (ret) {
 		xe_gt_err(gt, "G2H action 0x%04x failed (%pe)\n",
 			  action, ERR_PTR(ret));
+		CT_DEAD(ct, NULL, PROCESS_FAILED);
+	}
 
 	return 0;
 }
@@ -1091,7 +1212,7 @@ static int g2h_read(struct xe_guc_ct *ct, u32 *msg, bool fast_path)
 	struct xe_device *xe = ct_to_xe(ct);
 	struct xe_gt *gt = ct_to_gt(ct);
 	struct guc_ctb *g2h = &ct->ctbs.g2h;
-	u32 tail, head, len;
+	u32 tail, head, len, desc_status;
 	s32 avail;
 	u32 action;
 	u32 *hxg;
@@ -1110,6 +1231,52 @@ static int g2h_read(struct xe_guc_ct *ct, u32 *msg, bool fast_path)
 
 	xe_gt_assert(gt, xe_guc_ct_enabled(ct));
 
+	desc_status = desc_read(xe, g2h, status);
+	if (desc_status) {
+		if (desc_status & GUC_CTB_STATUS_DISABLED) {
+			/*
+			 * Potentially valid if a CLIENT_RESET request resulted in
+			 * contexts/engines being reset. But should never happen as
+			 * no contexts should be active when CLIENT_RESET is sent.
+			 */
+			xe_gt_err(gt, "CT read: unexpected G2H after GuC has stopped!\n");
+			desc_status &= ~GUC_CTB_STATUS_DISABLED;
+		}
+
+		if (desc_status) {
+			xe_gt_err(gt, "CT read: non-zero status: %u\n", desc_status);
+			goto corrupted;
+		}
+	}
+
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+{
+	u32 desc_tail = desc_read(xe, g2h, tail);
+	u32 desc_head = desc_read(xe, g2h, head);
+
+	if (g2h->info.head != desc_head) {
+		desc_write(xe, g2h, status, desc_status | GUC_CTB_STATUS_MISMATCH);
+		xe_gt_err(gt, "CT read: head was modified %u != %u\n",
+			  desc_head, g2h->info.head);
+		goto corrupted;
+	}
+
+	if (g2h->info.head > g2h->info.size) {
+		desc_write(xe, g2h, status, desc_status | GUC_CTB_STATUS_OVERFLOW);
+		xe_gt_err(gt, "CT read: head out of range: %u vs %u\n",
+			  g2h->info.head, g2h->info.size);
+		goto corrupted;
+	}
+
+	if (desc_tail >= g2h->info.size) {
+		desc_write(xe, g2h, status, desc_status | GUC_CTB_STATUS_OVERFLOW);
+		xe_gt_err(gt, "CT read: invalid tail offset %u >= %u)\n",
+			  desc_tail, g2h->info.size);
+		goto corrupted;
+	}
+}
+#endif
+
 	/* Calculate DW available to read */
 	tail = desc_read(xe, g2h, tail);
 	avail = tail - g2h->info.head;
@@ -1126,9 +1293,7 @@ static int g2h_read(struct xe_guc_ct *ct, u32 *msg, bool fast_path)
 	if (len > avail) {
 		xe_gt_err(gt, "G2H channel broken on read, avail=%d, len=%d, reset required\n",
 			  avail, len);
-		g2h->info.broken = true;
-
-		return -EPROTO;
+		goto corrupted;
 	}
 
 	head = (g2h->info.head + 1) % g2h->info.size;
@@ -1174,6 +1339,10 @@ static int g2h_read(struct xe_guc_ct *ct, u32 *msg, bool fast_path)
 			     g2h->info.head, tail);
 
 	return len;
+
+corrupted:
+	CT_DEAD(ct, &ct->ctbs.g2h, G2H_READ);
+	return -EPROTO;
 }
 
 static void g2h_fast_path(struct xe_guc_ct *ct, u32 *msg, u32 len)
@@ -1200,9 +1369,11 @@ static void g2h_fast_path(struct xe_guc_ct *ct, u32 *msg, u32 len)
 		xe_gt_warn(gt, "NOT_POSSIBLE");
 	}
 
-	if (ret)
+	if (ret) {
 		xe_gt_err(gt, "G2H action 0x%04x failed (%pe)\n",
 			  action, ERR_PTR(ret));
+		CT_DEAD(ct, NULL, FAST_G2H);
+	}
 }
 
 /**
@@ -1263,7 +1434,6 @@ static int dequeue_one_g2h(struct xe_guc_ct *ct)
 static void g2h_worker_func(struct work_struct *w)
 {
 	struct xe_guc_ct *ct = container_of(w, struct xe_guc_ct, g2h_worker);
-	struct xe_gt *gt = ct_to_gt(ct);
 	bool ongoing;
 	int ret;
 
@@ -1300,9 +1470,8 @@ static void g2h_worker_func(struct work_struct *w)
 		mutex_unlock(&ct->lock);
 
 		if (unlikely(ret == -EPROTO || ret == -EOPNOTSUPP)) {
-			struct drm_printer p = xe_gt_info_printer(gt);
-
-			xe_guc_ct_print(ct, &p, false);
+			xe_gt_err(ct_to_gt(ct), "CT dequeue failed: %d", ret);
+			CT_DEAD(ct, NULL, G2H_RECV);
 			kick_reset(ct);
 		}
 	} while (ret == 1);
@@ -1325,7 +1494,7 @@ static void guc_ctb_snapshot_capture(struct xe_device *xe, struct guc_ctb *ctb,
 				       atomic ? GFP_ATOMIC : GFP_KERNEL);
 
 	if (!snapshot->cmds) {
-		drm_err(&xe->drm, "Skipping CTB commands snapshot. Only CTB info will be available.\n");
+		drm_err(&xe->drm, "Skipping CTB commands snapshot. Only CT info will be available.\n");
 		return;
 	}
 
@@ -1483,3 +1652,48 @@ void xe_guc_ct_print(struct xe_guc_ct *ct, struct drm_printer *p, bool atomic)
 	xe_guc_ct_snapshot_print(snapshot, p);
 	xe_guc_ct_snapshot_free(snapshot);
 }
+
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+static void ct_dead_print(struct xe_dead_ct *dead)
+{
+	struct xe_guc_ct *ct = container_of(dead, struct xe_guc_ct, dead);
+	struct xe_gt *gt = ct_to_gt(ct);
+	static int g_count;
+	struct drm_printer ip = xe_gt_info_printer(gt);
+	struct drm_printer lp = drm_line_printer(&ip, "Capture", ++g_count);
+
+	if (!dead->reason) {
+		xe_gt_err(gt, "CTB is dead for no reason!?\n");
+		return;
+	}
+
+	drm_printf(&lp, "CTB is dead - reason=0x%X\n", dead->reason);
+
+	xe_guc_log_snapshot_print(ct_to_xe(ct), dead->snapshot_log, &lp, false);
+	xe_guc_ct_snapshot_print(dead->snapshot_ct, &lp);
+
+	drm_printf(&lp, "Done.\n");
+}
+
+static void ct_dead_worker_func(struct work_struct *w)
+{
+	struct xe_guc_ct *ct = container_of(w, struct xe_guc_ct, dead.worker);
+
+	if (!ct->dead.reported) {
+		ct->dead.reported = true;
+		ct_dead_print(&ct->dead);
+	}
+
+	spin_lock_irq(&ct->dead.lock);
+
+	xe_guc_log_snapshot_free(ct->dead.snapshot_log);
+	xe_guc_ct_snapshot_free(ct->dead.snapshot_ct);
+
+	if (ct->dead.reason & CT_DEAD_RESET) {
+		ct->dead.reason = CT_DEAD_ALIVE;
+		ct->dead.reported = false;
+	}
+
+	spin_unlock_irq(&ct->dead.lock);
+}
+#endif
diff --git a/drivers/gpu/drm/xe/xe_guc_ct_types.h b/drivers/gpu/drm/xe/xe_guc_ct_types.h
index fede4c6e93cb..49114868fe5e 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct_types.h
+++ b/drivers/gpu/drm/xe/xe_guc_ct_types.h
@@ -86,6 +86,24 @@ enum xe_guc_ct_state {
 	XE_GUC_CT_STATE_ENABLED,
 };
 
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+/** struct xe_dead_ct - Information for debugging a dead CT */
+struct xe_dead_ct {
+	/** @lock: protects memory allocation/free operations, and @reason updates */
+	spinlock_t lock;
+	/** @reason: bit mask of CT_DEAD_* reason codes */
+	int reason;
+	/** @reported: for preventing multiple dumps per error sequence */
+	bool reported;
+	/** @worker: worker thread to get out of interrupt context before dumping */
+	struct work_struct worker;
+	/** snapshot_ct: copy of CT state and CTB content at point of error */
+	struct xe_guc_ct_snapshot *snapshot_ct;
+	/** snapshot_log: copy of GuC log at point of error */
+	struct xe_guc_log_snapshot *snapshot_log;
+};
+#endif
+
 /**
  * struct xe_guc_ct - GuC command transport (CT) layer
  *
@@ -126,6 +144,10 @@ struct xe_guc_ct {
 	u32 msg[GUC_CTB_MSG_MAX_LEN];
 	/** @fast_msg: Message buffer */
 	u32 fast_msg[GUC_CTB_MSG_MAX_LEN];
+
+#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
+	struct xe_dead_ct dead;
+#endif
 };
 
 #endif
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 7/7] drm/xe/guc: Dump entire CTB on errors
  2024-06-10 22:56 [PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping John.C.Harrison
                   ` (5 preceding siblings ...)
  2024-06-10 22:56 ` [PATCH v4 6/7] drm/xe/guc: Dead CT helper John.C.Harrison
@ 2024-06-10 22:56 ` John.C.Harrison
  2024-06-10 23:01 ` ✗ CI.Patch_applied: failure for drm/xe/guc: Improve quality and robustness of GuC log dumping Patchwork
  7 siblings, 0 replies; 10+ messages in thread
From: John.C.Harrison @ 2024-06-10 22:56 UTC (permalink / raw)
  To: Intel-Xe; +Cc: John Harrison

From: John Harrison <John.C.Harrison@Intel.com>

The dump of the CT buffers was only showing the unprocessed data which
is not generally useful for saying why a hang occurred - because it
was probably caused by the commands that were just processed. So save
and dump the entire buffer but in a more compact dump format. Also
zero fill it on allocation to avoid confusion over uninitialised data
in the dump.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/xe/xe_devcoredump.c  |   2 +-
 drivers/gpu/drm/xe/xe_guc_ct.c       | 101 ++++++++++++---------------
 drivers/gpu/drm/xe/xe_guc_ct.h       |   9 +--
 drivers/gpu/drm/xe/xe_guc_ct_types.h |   2 +
 drivers/gpu/drm/xe/xe_guc_log.c      |   4 +-
 drivers/gpu/drm/xe/xe_guc_log.h      |   2 +
 6 files changed, 55 insertions(+), 65 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index d7f2d19a77c1..eb078265b751 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -114,7 +114,7 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset,
 	xe_device_snapshot_print(xe, &p);
 
 	drm_printf(&p, "\n**** GuC CT ****\n");
-	xe_guc_ct_snapshot_print(coredump->snapshot.ct, &p);
+	xe_guc_ct_snapshot_print(xe, coredump->snapshot.ct, &p, false);
 	xe_guc_exec_queue_snapshot_print(coredump->snapshot.ge, &p);
 
 	drm_printf(&p, "\n**** Job ****\n");
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index fb8469394a33..1dc0d09c0e4d 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -379,6 +379,7 @@ int xe_guc_ct_enable(struct xe_guc_ct *ct)
 
 	xe_gt_assert(gt, !xe_guc_ct_enabled(ct));
 
+	xe_map_memset(xe, &ct->bo->vmap, 0, 0, ct->bo->size);
 	guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo->vmap);
 	guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo->vmap);
 
@@ -1480,49 +1481,43 @@ static void g2h_worker_func(struct work_struct *w)
 		xe_pm_runtime_put(ct_to_xe(ct));
 }
 
-static void guc_ctb_snapshot_capture(struct xe_device *xe, struct guc_ctb *ctb,
-				     struct guc_ctb_snapshot *snapshot,
-				     bool atomic)
+static void guc_ctb_snapshot_alloc(struct guc_ctb *ctb, struct guc_ctb_snapshot *snapshot,
+				   bool atomic)
 {
-	u32 head, tail;
+	snapshot->size = ctb->info.size * sizeof(u32);
+	snapshot->cmds = kmalloc(snapshot->size, atomic ? GFP_ATOMIC : GFP_KERNEL);
+}
 
-	xe_map_memcpy_from(xe, &snapshot->desc, &ctb->desc, 0,
-			   sizeof(struct guc_ct_buffer_desc));
-	memcpy(&snapshot->info, &ctb->info, sizeof(struct guc_ctb_info));
+struct xe_guc_ct_snapshot *xe_guc_ct_snapshot_alloc(struct xe_guc_ct *ct, bool atomic)
+{
+	struct xe_guc_ct_snapshot *snapshot;
 
-	snapshot->cmds = kmalloc_array(ctb->info.size, sizeof(u32),
-				       atomic ? GFP_ATOMIC : GFP_KERNEL);
+	snapshot = kzalloc(sizeof(*snapshot), atomic ? GFP_ATOMIC : GFP_KERNEL);
+	if (!snapshot)
+		return NULL;
 
-	if (!snapshot->cmds) {
-		drm_err(&xe->drm, "Skipping CTB commands snapshot. Only CT info will be available.\n");
-		return;
-	}
+	/* Don't give up if the CTB storage fails to allocate */
+	guc_ctb_snapshot_alloc(&ct->ctbs.h2g, &snapshot->h2g, atomic);
+	guc_ctb_snapshot_alloc(&ct->ctbs.g2h, &snapshot->g2h, atomic);
 
-	head = snapshot->desc.head;
-	tail = snapshot->desc.tail;
-
-	if (head != tail) {
-		struct iosys_map map =
-			IOSYS_MAP_INIT_OFFSET(&ctb->cmds, head * sizeof(u32));
-
-		while (head != tail) {
-			snapshot->cmds[head] = xe_map_rd(xe, &map, 0, u32);
-			++head;
-			if (head == ctb->info.size) {
-				head = 0;
-				map = ctb->cmds;
-			} else {
-				iosys_map_incr(&map, sizeof(u32));
-			}
-		}
-	}
+	return snapshot;
 }
 
-static void guc_ctb_snapshot_print(struct guc_ctb_snapshot *snapshot,
-				   struct drm_printer *p)
+static void guc_ctb_snapshot_capture(struct xe_device *xe, struct guc_ctb *ctb,
+				     struct guc_ctb_snapshot *snapshot)
 {
-	u32 head, tail;
+	xe_map_memcpy_from(xe, &snapshot->desc, &ctb->desc, 0,
+			   sizeof(struct guc_ct_buffer_desc));
+	memcpy(&snapshot->info, &ctb->info, sizeof(struct guc_ctb_info));
+
+	if (snapshot->cmds)
+		xe_map_memcpy_from(xe, snapshot->cmds, &ctb->cmds, 0, snapshot->size);
+}
 
+static void guc_ctb_snapshot_print(struct xe_device *xe,
+				   struct guc_ctb_snapshot *snapshot,
+				   struct drm_printer *p, bool atomic)
+{
 	drm_printf(p, "\tsize: %d\n", snapshot->info.size);
 	drm_printf(p, "\tresv_space: %d\n", snapshot->info.resv_space);
 	drm_printf(p, "\thead: %d\n", snapshot->info.head);
@@ -1533,19 +1528,13 @@ static void guc_ctb_snapshot_print(struct guc_ctb_snapshot *snapshot,
 	drm_printf(p, "\ttail (memory): %d\n", snapshot->desc.tail);
 	drm_printf(p, "\tstatus (memory): 0x%x\n", snapshot->desc.status);
 
-	if (!snapshot->cmds)
+	if (!snapshot->cmds) {
+		drm_printf(p, "CT buffer allocation missing!\n");
 		return;
-
-	head = snapshot->desc.head;
-	tail = snapshot->desc.tail;
-
-	while (head != tail) {
-		drm_printf(p, "\tcmd[%d]: 0x%08x\n", head,
-			   snapshot->cmds[head]);
-		++head;
-		if (head == snapshot->info.size)
-			head = 0;
 	}
+
+	drm_printf(p, "CT buffer:\n");
+	xe_hexdump_blob(xe, snapshot->cmds, snapshot->size, p, atomic);
 }
 
 static void guc_ctb_snapshot_free(struct guc_ctb_snapshot *snapshot)
@@ -1571,9 +1560,7 @@ struct xe_guc_ct_snapshot *xe_guc_ct_snapshot_capture(struct xe_guc_ct *ct,
 	struct xe_device *xe = ct_to_xe(ct);
 	struct xe_guc_ct_snapshot *snapshot;
 
-	snapshot = kzalloc(sizeof(*snapshot),
-			   atomic ? GFP_ATOMIC : GFP_KERNEL);
-
+	snapshot = xe_guc_ct_snapshot_alloc(ct, atomic);
 	if (!snapshot) {
 		drm_err(&xe->drm, "Skipping CTB snapshot entirely.\n");
 		return NULL;
@@ -1582,10 +1569,8 @@ struct xe_guc_ct_snapshot *xe_guc_ct_snapshot_capture(struct xe_guc_ct *ct,
 	if (xe_guc_ct_enabled(ct) || ct->state == XE_GUC_CT_STATE_STOPPED) {
 		snapshot->ct_enabled = true;
 		snapshot->g2h_outstanding = READ_ONCE(ct->g2h_outstanding);
-		guc_ctb_snapshot_capture(xe, &ct->ctbs.h2g,
-					 &snapshot->h2g, atomic);
-		guc_ctb_snapshot_capture(xe, &ct->ctbs.g2h,
-					 &snapshot->g2h, atomic);
+		guc_ctb_snapshot_capture(xe, &ct->ctbs.h2g, &snapshot->h2g);
+		guc_ctb_snapshot_capture(xe, &ct->ctbs.g2h, &snapshot->g2h);
 	}
 
 	return snapshot;
@@ -1598,18 +1583,18 @@ struct xe_guc_ct_snapshot *xe_guc_ct_snapshot_capture(struct xe_guc_ct *ct,
  *
  * This function prints out a given GuC CT snapshot object.
  */
-void xe_guc_ct_snapshot_print(struct xe_guc_ct_snapshot *snapshot,
-			      struct drm_printer *p)
+void xe_guc_ct_snapshot_print(struct xe_device *xe, struct xe_guc_ct_snapshot *snapshot,
+			      struct drm_printer *p, bool atomic)
 {
 	if (!snapshot)
 		return;
 
 	if (snapshot->ct_enabled) {
 		drm_puts(p, "H2G CTB (all sizes in DW):\n");
-		guc_ctb_snapshot_print(&snapshot->h2g, p);
+		guc_ctb_snapshot_print(xe, &snapshot->h2g, p, atomic);
 
 		drm_puts(p, "G2H CTB (all sizes in DW):\n");
-		guc_ctb_snapshot_print(&snapshot->g2h, p);
+		guc_ctb_snapshot_print(xe, &snapshot->g2h, p, atomic);
 
 		drm_printf(p, "\tg2h outstanding: %d\n",
 			   snapshot->g2h_outstanding);
@@ -1649,7 +1634,7 @@ void xe_guc_ct_print(struct xe_guc_ct *ct, struct drm_printer *p, bool atomic)
 	struct xe_guc_ct_snapshot *snapshot;
 
 	snapshot = xe_guc_ct_snapshot_capture(ct, atomic);
-	xe_guc_ct_snapshot_print(snapshot, p);
+	xe_guc_ct_snapshot_print(ct_to_xe(ct), snapshot, p, atomic);
 	xe_guc_ct_snapshot_free(snapshot);
 }
 
@@ -1670,7 +1655,7 @@ static void ct_dead_print(struct xe_dead_ct *dead)
 	drm_printf(&lp, "CTB is dead - reason=0x%X\n", dead->reason);
 
 	xe_guc_log_snapshot_print(ct_to_xe(ct), dead->snapshot_log, &lp, false);
-	xe_guc_ct_snapshot_print(dead->snapshot_ct, &lp);
+	xe_guc_ct_snapshot_print(ct_to_xe(ct), dead->snapshot_ct, &lp, false);
 
 	drm_printf(&lp, "Done.\n");
 }
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.h b/drivers/gpu/drm/xe/xe_guc_ct.h
index 105bb8e99a8d..782e821dae1c 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.h
+++ b/drivers/gpu/drm/xe/xe_guc_ct.h
@@ -9,6 +9,7 @@
 #include "xe_guc_ct_types.h"
 
 struct drm_printer;
+struct xe_device;
 
 int xe_guc_ct_init(struct xe_guc_ct *ct);
 int xe_guc_ct_enable(struct xe_guc_ct *ct);
@@ -16,10 +17,10 @@ void xe_guc_ct_disable(struct xe_guc_ct *ct);
 void xe_guc_ct_stop(struct xe_guc_ct *ct);
 void xe_guc_ct_fast_path(struct xe_guc_ct *ct);
 
-struct xe_guc_ct_snapshot *
-xe_guc_ct_snapshot_capture(struct xe_guc_ct *ct, bool atomic);
-void xe_guc_ct_snapshot_print(struct xe_guc_ct_snapshot *snapshot,
-			      struct drm_printer *p);
+struct xe_guc_ct_snapshot *xe_guc_ct_snapshot_alloc(struct xe_guc_ct *ct, bool atomic);
+struct xe_guc_ct_snapshot *xe_guc_ct_snapshot_capture(struct xe_guc_ct *ct, bool atomic);
+void xe_guc_ct_snapshot_print(struct xe_device *xe, struct xe_guc_ct_snapshot *snapshot,
+			      struct drm_printer *p, bool atomic);
 void xe_guc_ct_snapshot_free(struct xe_guc_ct_snapshot *snapshot);
 void xe_guc_ct_print(struct xe_guc_ct *ct, struct drm_printer *p, bool atomic);
 
diff --git a/drivers/gpu/drm/xe/xe_guc_ct_types.h b/drivers/gpu/drm/xe/xe_guc_ct_types.h
index 49114868fe5e..4914cc2ae07b 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct_types.h
+++ b/drivers/gpu/drm/xe/xe_guc_ct_types.h
@@ -52,6 +52,8 @@ struct guc_ctb {
 struct guc_ctb_snapshot {
 	/** @desc: snapshot of the CTB descriptor */
 	struct guc_ct_buffer_desc desc;
+	/** @cmds: size of the snapshot of the CTB commands */
+	size_t size;
 	/** @cmds: snapshot of the CTB commands */
 	u32 *cmds;
 	/** @info: snapshot of the CTB info */
diff --git a/drivers/gpu/drm/xe/xe_guc_log.c b/drivers/gpu/drm/xe/xe_guc_log.c
index 9714ee6f99e0..11453de08cc6 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.c
+++ b/drivers/gpu/drm/xe/xe_guc_log.c
@@ -64,8 +64,8 @@ static size_t guc_log_size(void)
 #define LINES_PER_READ		4
 #define WORDS_PER_READ		(WORDS_PER_DUMP * DUMPS_PER_LINE * LINES_PER_READ)
 
-static void xe_hexdump_blob(struct xe_device *xe, const void *blob, size_t size,
-			    struct drm_printer *p, bool atomic)
+void xe_hexdump_blob(struct xe_device *xe, const void *blob, size_t size,
+		     struct drm_printer *p, bool atomic)
 {
 	char line_buff[DUMPS_PER_LINE * WORDS_PER_DUMP * 9 + 1];
 	const u32 *blob32 = (const u32 *)blob;
diff --git a/drivers/gpu/drm/xe/xe_guc_log.h b/drivers/gpu/drm/xe/xe_guc_log.h
index 29154ab507ff..f204b9c85e65 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.h
+++ b/drivers/gpu/drm/xe/xe_guc_log.h
@@ -45,6 +45,8 @@ struct xe_guc_log_snapshot *xe_guc_log_snapshot_capture(struct xe_guc_log *log,
 void xe_guc_log_snapshot_print(struct xe_device *xe, struct xe_guc_log_snapshot *snapshot,
 			       struct drm_printer *p, bool atomic);
 void xe_guc_log_snapshot_free(struct xe_guc_log_snapshot *snapshot);
+void xe_hexdump_blob(struct xe_device *xe, const void *blob, size_t size,
+		     struct drm_printer *p, bool atomic);
 
 static inline u32
 xe_guc_log_get_level(struct xe_guc_log *log)
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* ✗ CI.Patch_applied: failure for drm/xe/guc: Improve quality and robustness of GuC log dumping
  2024-06-10 22:56 [PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping John.C.Harrison
                   ` (6 preceding siblings ...)
  2024-06-10 22:56 ` [PATCH v4 7/7] drm/xe/guc: Dump entire CTB on errors John.C.Harrison
@ 2024-06-10 23:01 ` Patchwork
  7 siblings, 0 replies; 10+ messages in thread
From: Patchwork @ 2024-06-10 23:01 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-xe

== Series Details ==

Series: drm/xe/guc: Improve quality and robustness of GuC log dumping
URL   : https://patchwork.freedesktop.org/series/134695/
State : failure

== Summary ==

=== Applying kernel patches on branch 'drm-tip' with base: ===
Base commit: bf9d891af55d drm-tip: 2024y-06m-10d-19h-36m-52s UTC integration manifest
=== git am output follows ===
error: patch failed: include/drm/drm_print.h:176
error: include/drm/drm_print.h: patch does not apply
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Applying: drm/xe/guc: Remove spurious line feed in debug print
Applying: drm/xe/guc: Copy GuC log prior to dumping
Applying: drm/xe/guc: Use a two stage dump for GuC logs and add more info
Applying: drm/print: Introduce drm_line_printer
Patch failed at 0004 drm/print: Introduce drm_line_printer
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4 5/7] drm/xe/guc: Add a helper function for dumping GuC log to dmesg
  2024-06-11  1:20 [PATCH v4 0/7] " John.C.Harrison
@ 2024-06-11  1:20 ` John.C.Harrison
  0 siblings, 0 replies; 10+ messages in thread
From: John.C.Harrison @ 2024-06-11  1:20 UTC (permalink / raw)
  To: Intel-Xe; +Cc: John Harrison

From: John Harrison <John.C.Harrison@Intel.com>

Create a helper function that can be used to dump the GuC log to dmesg
in a manner that is reliable for extraction and decode. The intention
is that calls to this can be added by developers when debugging
specific issues that require a GuC log but do not allow easy capture
of the log - e.g. failures in selftests and failues that lead to
kernel hangs.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/xe/xe_guc_log.c | 14 ++++++++++++++
 drivers/gpu/drm/xe/xe_guc_log.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_guc_log.c b/drivers/gpu/drm/xe/xe_guc_log.c
index a35d3cc0c369..9714ee6f99e0 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.c
+++ b/drivers/gpu/drm/xe/xe_guc_log.c
@@ -245,6 +245,20 @@ void xe_guc_log_snapshot_print(struct xe_device *xe, struct xe_guc_log_snapshot
 	}
 }
 
+void xe_guc_log_print_dmesg(struct xe_guc_log *log, bool atomic)
+{
+	struct xe_gt *gt = log_to_gt(log);
+	static int g_count;
+	struct drm_printer ip = xe_gt_info_printer(gt);
+	struct drm_printer lp = drm_line_printer(&ip, "Capture", ++g_count);
+
+	drm_printf(&lp, "Dumping GuC log for %ps...\n", __builtin_return_address(0));
+
+	xe_guc_log_print(log, &lp, atomic);
+
+	drm_printf(&lp, "Done.\n");
+}
+
 void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p, bool atomic)
 {
 	struct xe_guc_log_snapshot *snapshot;
diff --git a/drivers/gpu/drm/xe/xe_guc_log.h b/drivers/gpu/drm/xe/xe_guc_log.h
index 6e4d0b369c19..29154ab507ff 100644
--- a/drivers/gpu/drm/xe/xe_guc_log.h
+++ b/drivers/gpu/drm/xe/xe_guc_log.h
@@ -39,6 +39,7 @@ struct xe_device;
 
 int xe_guc_log_init(struct xe_guc_log *log);
 void xe_guc_log_print(struct xe_guc_log *log, struct drm_printer *p, bool atomic);
+void xe_guc_log_print_dmesg(struct xe_guc_log *log, bool atomic);
 struct xe_guc_log_snapshot *xe_guc_log_snapshot_alloc(struct xe_guc_log *log, bool atomic);
 struct xe_guc_log_snapshot *xe_guc_log_snapshot_capture(struct xe_guc_log *log, bool atomic);
 void xe_guc_log_snapshot_print(struct xe_device *xe, struct xe_guc_log_snapshot *snapshot,
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-06-11  1:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-10 22:56 [PATCH v4 0/7] drm/xe/guc: Improve quality and robustness of GuC log dumping John.C.Harrison
2024-06-10 22:56 ` [PATCH v4 1/7] drm/xe/guc: Remove spurious line feed in debug print John.C.Harrison
2024-06-10 22:56 ` [PATCH v4 2/7] drm/xe/guc: Copy GuC log prior to dumping John.C.Harrison
2024-06-10 22:56 ` [PATCH v4 3/7] drm/xe/guc: Use a two stage dump for GuC logs and add more info John.C.Harrison
2024-06-10 22:56 ` [PATCH v4 4/7] drm/print: Introduce drm_line_printer John.C.Harrison
2024-06-10 22:56 ` [PATCH v4 5/7] drm/xe/guc: Add a helper function for dumping GuC log to dmesg John.C.Harrison
2024-06-10 22:56 ` [PATCH v4 6/7] drm/xe/guc: Dead CT helper John.C.Harrison
2024-06-10 22:56 ` [PATCH v4 7/7] drm/xe/guc: Dump entire CTB on errors John.C.Harrison
2024-06-10 23:01 ` ✗ CI.Patch_applied: failure for drm/xe/guc: Improve quality and robustness of GuC log dumping Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2024-06-11  1:20 [PATCH v4 0/7] " John.C.Harrison
2024-06-11  1:20 ` [PATCH v4 5/7] drm/xe/guc: Add a helper function for dumping GuC log to dmesg John.C.Harrison

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox