* [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging
@ 2023-04-18 18:17 John.C.Harrison
2023-04-18 18:17 ` [Intel-gfx] [PATCH v2 1/2] drm/i915: Dump error capture to kernel log John.C.Harrison
` (6 more replies)
0 siblings, 7 replies; 13+ messages in thread
From: John.C.Harrison @ 2023-04-18 18:17 UTC (permalink / raw)
To: Intel-GFX; +Cc: DRI-Devel
From: John Harrison <John.C.Harrison@Intel.com>
Sometimes, the only effective way to debug an issue is to dump all the
interesting information at the point of failure. So add support for
doing that.
v2: Extra CONFIG wrapping (review feedback from Rodrigo)
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
John Harrison (2):
drm/i915: Dump error capture to kernel log
drm/i915/guc: Dump error capture to dmesg on CTB error
drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 53 +++++++++
drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h | 6 +
drivers/gpu/drm/i915/i915_gpu_error.c | 132 ++++++++++++++++++++++
drivers/gpu/drm/i915/i915_gpu_error.h | 10 ++
4 files changed, 201 insertions(+)
--
2.39.1
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-gfx] [PATCH v2 1/2] drm/i915: Dump error capture to kernel log
2023-04-18 18:17 [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging John.C.Harrison
@ 2023-04-18 18:17 ` John.C.Harrison
2023-05-16 19:17 ` Belgaumkar, Vinay
2023-04-18 18:17 ` [Intel-gfx] [PATCH v2 2/2] drm/i915/guc: Dump error capture to dmesg on CTB error John.C.Harrison
` (5 subsequent siblings)
6 siblings, 1 reply; 13+ messages in thread
From: John.C.Harrison @ 2023-04-18 18:17 UTC (permalink / raw)
To: Intel-GFX; +Cc: DRI-Devel
From: John Harrison <John.C.Harrison@Intel.com>
This is useful for getting debug information out in certain
situations, such as failing kernel selftests and CI runs that don't
log error captures. It is especially useful for things like retrieving
GuC logs as GuC operation can't be tracked by adding printk or ftrace
entries.
v2: Add CONFIG_DRM_I915_DEBUG_GEM wrapper (review feedback by Rodrigo).
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
drivers/gpu/drm/i915/i915_gpu_error.c | 132 ++++++++++++++++++++++++++
drivers/gpu/drm/i915/i915_gpu_error.h | 10 ++
2 files changed, 142 insertions(+)
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index f020c0086fbcd..03d62c250c465 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -2219,3 +2219,135 @@ void i915_disable_error_state(struct drm_i915_private *i915, int err)
i915->gpu_error.first_error = ERR_PTR(err);
spin_unlock_irq(&i915->gpu_error.lock);
}
+
+#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
+void intel_klog_error_capture(struct intel_gt *gt,
+ intel_engine_mask_t engine_mask)
+{
+ static int g_count;
+ struct drm_i915_private *i915 = gt->i915;
+ struct i915_gpu_coredump *error;
+ intel_wakeref_t wakeref;
+ size_t buf_size = PAGE_SIZE * 128;
+ size_t pos_err;
+ char *buf, *ptr, *next;
+ int l_count = g_count++;
+ int line = 0;
+
+ /* Can't allocate memory during a reset */
+ if (test_bit(I915_RESET_BACKOFF, >->reset.flags)) {
+ drm_err(>->i915->drm, "[Capture/%d.%d] Inside GT reset, skipping error capture :(\n",
+ l_count, line++);
+ return;
+ }
+
+ error = READ_ONCE(i915->gpu_error.first_error);
+ if (error) {
+ drm_err(&i915->drm, "[Capture/%d.%d] Clearing existing error capture first...\n",
+ l_count, line++);
+ i915_reset_error_state(i915);
+ }
+
+ with_intel_runtime_pm(&i915->runtime_pm, wakeref)
+ error = i915_gpu_coredump(gt, engine_mask, CORE_DUMP_FLAG_NONE);
+
+ if (IS_ERR(error)) {
+ drm_err(&i915->drm, "[Capture/%d.%d] Failed to capture error capture: %ld!\n",
+ l_count, line++, PTR_ERR(error));
+ return;
+ }
+
+ buf = kvmalloc(buf_size, GFP_KERNEL);
+ if (!buf) {
+ drm_err(&i915->drm, "[Capture/%d.%d] Failed to allocate buffer for error capture!\n",
+ l_count, line++);
+ i915_gpu_coredump_put(error);
+ return;
+ }
+
+ drm_info(&i915->drm, "[Capture/%d.%d] Dumping i915 error capture for %ps...\n",
+ l_count, line++, __builtin_return_address(0));
+
+ /* Largest string length safe to print via dmesg */
+# define MAX_CHUNK 800
+
+ pos_err = 0;
+ while (1) {
+ ssize_t got = i915_gpu_coredump_copy_to_buffer(error, buf, pos_err, buf_size - 1);
+
+ if (got <= 0)
+ break;
+
+ buf[got] = 0;
+ pos_err += got;
+
+ ptr = buf;
+ while (got > 0) {
+ size_t count;
+ char tag[2];
+
+ next = strnchr(ptr, got, '\n');
+ if (next) {
+ count = next - ptr;
+ *next = 0;
+ tag[0] = '>';
+ tag[1] = '<';
+ } else {
+ count = got;
+ tag[0] = '}';
+ tag[1] = '{';
+ }
+
+ if (count > MAX_CHUNK) {
+ size_t pos;
+ char *ptr2 = ptr;
+
+ for (pos = MAX_CHUNK; pos < count; pos += MAX_CHUNK) {
+ char chr = ptr[pos];
+
+ ptr[pos] = 0;
+ drm_info(&i915->drm, "[Capture/%d.%d] }%s{\n",
+ l_count, line++, ptr2);
+ ptr[pos] = chr;
+ ptr2 = ptr + pos;
+
+ /*
+ * If spewing large amounts of data via a serial console,
+ * this can be a very slow process. So be friendly and try
+ * not to cause 'softlockup on CPU' problems.
+ */
+ cond_resched();
+ }
+
+ if (ptr2 < (ptr + count))
+ drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n",
+ l_count, line++, tag[0], ptr2, tag[1]);
+ else if (tag[0] == '>')
+ drm_info(&i915->drm, "[Capture/%d.%d] ><\n",
+ l_count, line++);
+ } else {
+ drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n",
+ l_count, line++, tag[0], ptr, tag[1]);
+ }
+
+ ptr = next;
+ got -= count;
+ if (next) {
+ ptr++;
+ got--;
+ }
+
+ /* As above. */
+ cond_resched();
+ }
+
+ if (got)
+ drm_info(&i915->drm, "[Capture/%d.%d] Got %zd bytes remaining!\n",
+ l_count, line++, got);
+ }
+
+ kvfree(buf);
+
+ drm_info(&i915->drm, "[Capture/%d.%d] Dumped %zd bytes\n", l_count, line++, pos_err);
+}
+#endif
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index a91932cc65317..a78c061ce26fb 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -258,6 +258,16 @@ static inline u32 i915_reset_engine_count(struct i915_gpu_error *error,
#define CORE_DUMP_FLAG_NONE 0x0
#define CORE_DUMP_FLAG_IS_GUC_CAPTURE BIT(0)
+#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR) && IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
+void intel_klog_error_capture(struct intel_gt *gt,
+ intel_engine_mask_t engine_mask);
+#else
+static inline void intel_klog_error_capture(struct intel_gt *gt,
+ intel_engine_mask_t engine_mask)
+{
+}
+#endif
+
#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
__printf(2, 3)
--
2.39.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [Intel-gfx] [PATCH v2 2/2] drm/i915/guc: Dump error capture to dmesg on CTB error
2023-04-18 18:17 [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging John.C.Harrison
2023-04-18 18:17 ` [Intel-gfx] [PATCH v2 1/2] drm/i915: Dump error capture to kernel log John.C.Harrison
@ 2023-04-18 18:17 ` John.C.Harrison
2023-05-16 19:17 ` Belgaumkar, Vinay
2023-04-18 18:58 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Add support for dumping error captures via kernel logging (rev2) Patchwork
` (4 subsequent siblings)
6 siblings, 1 reply; 13+ messages in thread
From: John.C.Harrison @ 2023-04-18 18:17 UTC (permalink / raw)
To: Intel-GFX; +Cc: DRI-Devel
From: John Harrison <John.C.Harrison@Intel.com>
In the past, There have been sporadic CTB failures which proved hard
to reproduce manually. The most effective solution was to dump the GuC
log at the point of failure and let the CI system do the repro. It is
preferable not to dump the GuC log via dmesg for all issues as it is
not always necessary and is not helpful for end users. But rather than
trying to re-invent the code to do this each time it is wanted, commit
the code but for DEBUG_GUC builds only.
v2: Use IS_ENABLED for testing config options.
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 53 +++++++++++++++++++++++
drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h | 6 +++
2 files changed, 59 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
index 1803a633ed648..dc5cd712f1ff5 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
@@ -13,6 +13,30 @@
#include "intel_guc_ct.h"
#include "intel_guc_print.h"
+#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GUC)
+enum {
+ CT_DEAD_ALIVE = 0,
+ CT_DEAD_SETUP,
+ CT_DEAD_WRITE,
+ CT_DEAD_DEADLOCK,
+ CT_DEAD_H2G_HAS_ROOM,
+ CT_DEAD_READ,
+ CT_DEAD_PROCESS_FAILED,
+};
+
+static void ct_dead_ct_worker_func(struct work_struct *w);
+
+#define CT_DEAD(ct, reason) \
+ do { \
+ if (!(ct)->dead_ct_reported) { \
+ (ct)->dead_ct_reason |= 1 << CT_DEAD_##reason; \
+ queue_work(system_unbound_wq, &(ct)->dead_ct_worker); \
+ } \
+ } while (0)
+#else
+#define CT_DEAD(ct, reason) do { } while (0)
+#endif
+
static inline struct intel_guc *ct_to_guc(struct intel_guc_ct *ct)
{
return container_of(ct, struct intel_guc, ct);
@@ -93,6 +117,9 @@ void intel_guc_ct_init_early(struct intel_guc_ct *ct)
spin_lock_init(&ct->requests.lock);
INIT_LIST_HEAD(&ct->requests.pending);
INIT_LIST_HEAD(&ct->requests.incoming);
+#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GUC)
+ INIT_WORK(&ct->dead_ct_worker, ct_dead_ct_worker_func);
+#endif
INIT_WORK(&ct->requests.worker, ct_incoming_request_worker_func);
tasklet_setup(&ct->receive_tasklet, ct_receive_tasklet_func);
init_waitqueue_head(&ct->wq);
@@ -319,11 +346,16 @@ int intel_guc_ct_enable(struct intel_guc_ct *ct)
ct->enabled = true;
ct->stall_time = KTIME_MAX;
+#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GUC)
+ ct->dead_ct_reported = false;
+ ct->dead_ct_reason = CT_DEAD_ALIVE;
+#endif
return 0;
err_out:
CT_PROBE_ERROR(ct, "Failed to enable CTB (%pe)\n", ERR_PTR(err));
+ CT_DEAD(ct, SETUP);
return err;
}
@@ -434,6 +466,7 @@ static int ct_write(struct intel_guc_ct *ct,
corrupted:
CT_ERROR(ct, "Corrupted descriptor head=%u tail=%u status=%#x\n",
desc->head, desc->tail, desc->status);
+ CT_DEAD(ct, WRITE);
ctb->broken = true;
return -EPIPE;
}
@@ -504,6 +537,7 @@ static inline bool ct_deadlocked(struct intel_guc_ct *ct)
CT_ERROR(ct, "Head: %u\n (Dwords)", ct->ctbs.recv.desc->head);
CT_ERROR(ct, "Tail: %u\n (Dwords)", ct->ctbs.recv.desc->tail);
+ CT_DEAD(ct, DEADLOCK);
ct->ctbs.send.broken = true;
}
@@ -552,6 +586,7 @@ static inline bool h2g_has_room(struct intel_guc_ct *ct, u32 len_dw)
head, ctb->size);
desc->status |= GUC_CTB_STATUS_OVERFLOW;
ctb->broken = true;
+ CT_DEAD(ct, H2G_HAS_ROOM);
return false;
}
@@ -908,6 +943,7 @@ static int ct_read(struct intel_guc_ct *ct, struct ct_incoming_msg **msg)
CT_ERROR(ct, "Corrupted descriptor head=%u tail=%u status=%#x\n",
desc->head, desc->tail, desc->status);
ctb->broken = true;
+ CT_DEAD(ct, READ);
return -EPIPE;
}
@@ -1057,6 +1093,7 @@ static bool ct_process_incoming_requests(struct intel_guc_ct *ct)
if (unlikely(err)) {
CT_ERROR(ct, "Failed to process CT message (%pe) %*ph\n",
ERR_PTR(err), 4 * request->size, request->msg);
+ CT_DEAD(ct, PROCESS_FAILED);
ct_free_msg(request);
}
@@ -1233,3 +1270,19 @@ void intel_guc_ct_print_info(struct intel_guc_ct *ct,
drm_printf(p, "Tail: %u\n",
ct->ctbs.recv.desc->tail);
}
+
+#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GUC)
+static void ct_dead_ct_worker_func(struct work_struct *w)
+{
+ struct intel_guc_ct *ct = container_of(w, struct intel_guc_ct, dead_ct_worker);
+ struct intel_guc *guc = ct_to_guc(ct);
+
+ if (ct->dead_ct_reported)
+ return;
+
+ ct->dead_ct_reported = true;
+
+ guc_info(guc, "CTB is dead - reason=0x%X\n", ct->dead_ct_reason);
+ intel_klog_error_capture(guc_to_gt(guc), (intel_engine_mask_t)~0U);
+}
+#endif
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h
index f709a19c7e214..818415b64f4d1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h
@@ -85,6 +85,12 @@ struct intel_guc_ct {
/** @stall_time: time of first time a CTB submission is stalled */
ktime_t stall_time;
+
+#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GUC)
+ int dead_ct_reason;
+ bool dead_ct_reported;
+ struct work_struct dead_ct_worker;
+#endif
};
void intel_guc_ct_init_early(struct intel_guc_ct *ct);
--
2.39.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Add support for dumping error captures via kernel logging (rev2)
2023-04-18 18:17 [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging John.C.Harrison
2023-04-18 18:17 ` [Intel-gfx] [PATCH v2 1/2] drm/i915: Dump error capture to kernel log John.C.Harrison
2023-04-18 18:17 ` [Intel-gfx] [PATCH v2 2/2] drm/i915/guc: Dump error capture to dmesg on CTB error John.C.Harrison
@ 2023-04-18 18:58 ` Patchwork
2023-04-18 18:58 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
` (3 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2023-04-18 18:58 UTC (permalink / raw)
To: john.c.harrison; +Cc: intel-gfx
== Series Details ==
Series: Add support for dumping error captures via kernel logging (rev2)
URL : https://patchwork.freedesktop.org/series/116280/
State : warning
== Summary ==
Error: dim checkpatch failed
0fdd3a06a2b6 drm/i915: Dump error capture to kernel log
-:64: WARNING:OOM_MESSAGE: Possible unnecessary 'out of memory' message
#64: FILE: drivers/gpu/drm/i915/i915_gpu_error.c:2262:
+ if (!buf) {
+ drm_err(&i915->drm, "[Capture/%d.%d] Failed to allocate buffer for error capture!\n",
total: 0 errors, 1 warnings, 0 checks, 151 lines checked
adfdeee9f311 drm/i915/guc: Dump error capture to dmesg on CTB error
-:39: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'ct' - possible side-effects?
#39: FILE: drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c:29:
+#define CT_DEAD(ct, reason) \
+ do { \
+ if (!(ct)->dead_ct_reported) { \
+ (ct)->dead_ct_reason |= 1 << CT_DEAD_##reason; \
+ queue_work(system_unbound_wq, &(ct)->dead_ct_worker); \
+ } \
+ } while (0)
total: 0 errors, 0 warnings, 1 checks, 121 lines checked
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Add support for dumping error captures via kernel logging (rev2)
2023-04-18 18:17 [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging John.C.Harrison
` (2 preceding siblings ...)
2023-04-18 18:58 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Add support for dumping error captures via kernel logging (rev2) Patchwork
@ 2023-04-18 18:58 ` Patchwork
2023-04-18 19:08 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
` (2 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2023-04-18 18:58 UTC (permalink / raw)
To: john.c.harrison; +Cc: intel-gfx
== Series Details ==
Series: Add support for dumping error captures via kernel logging (rev2)
URL : https://patchwork.freedesktop.org/series/116280/
State : warning
== Summary ==
Error: dim sparse failed
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-gfx] ✓ Fi.CI.BAT: success for Add support for dumping error captures via kernel logging (rev2)
2023-04-18 18:17 [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging John.C.Harrison
` (3 preceding siblings ...)
2023-04-18 18:58 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
@ 2023-04-18 19:08 ` Patchwork
2023-04-18 23:46 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
2023-05-16 18:54 ` [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging Belgaumkar, Vinay
6 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2023-04-18 19:08 UTC (permalink / raw)
To: john.c.harrison; +Cc: intel-gfx
[-- Attachment #1: Type: text/plain, Size: 3514 bytes --]
== Series Details ==
Series: Add support for dumping error captures via kernel logging (rev2)
URL : https://patchwork.freedesktop.org/series/116280/
State : success
== Summary ==
CI Bug Log - changes from CI_DRM_13026 -> Patchwork_116280v2
====================================================
Summary
-------
**SUCCESS**
No regressions found.
External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/index.html
Participating hosts (38 -> 36)
------------------------------
Missing (2): bat-rpls-2 fi-snb-2520m
Known issues
------------
Here are the changes found in Patchwork_116280v2 that come from known issues:
### IGT changes ###
#### Issues hit ####
* igt@i915_selftest@live@reset:
- bat-rpls-1: [PASS][1] -> [ABORT][2] ([i915#4983] / [i915#7981])
[1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/bat-rpls-1/igt@i915_selftest@live@reset.html
[2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/bat-rpls-1/igt@i915_selftest@live@reset.html
* igt@i915_selftest@live@workarounds:
- bat-adlp-6: [PASS][3] -> [INCOMPLETE][4] ([i915#4983] / [i915#7677] / [i915#7913])
[3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/bat-adlp-6/igt@i915_selftest@live@workarounds.html
[4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/bat-adlp-6/igt@i915_selftest@live@workarounds.html
* igt@kms_pipe_crc_basic@nonblocking-crc@pipe-c-dp-1:
- bat-dg2-8: [PASS][5] -> [FAIL][6] ([i915#7932])
[5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/bat-dg2-8/igt@kms_pipe_crc_basic@nonblocking-crc@pipe-c-dp-1.html
[6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/bat-dg2-8/igt@kms_pipe_crc_basic@nonblocking-crc@pipe-c-dp-1.html
#### Possible fixes ####
* igt@i915_selftest@live@migrate:
- bat-dg2-11: [DMESG-WARN][7] ([i915#7699]) -> [PASS][8]
[7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/bat-dg2-11/igt@i915_selftest@live@migrate.html
[8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/bat-dg2-11/igt@i915_selftest@live@migrate.html
{name}: This element is suppressed. This means it is ignored when computing
the status of the difference (SUCCESS, WARNING, or FAILURE).
[i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
[i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077
[i915#4983]: https://gitlab.freedesktop.org/drm/intel/issues/4983
[i915#7677]: https://gitlab.freedesktop.org/drm/intel/issues/7677
[i915#7699]: https://gitlab.freedesktop.org/drm/intel/issues/7699
[i915#7913]: https://gitlab.freedesktop.org/drm/intel/issues/7913
[i915#7932]: https://gitlab.freedesktop.org/drm/intel/issues/7932
[i915#7981]: https://gitlab.freedesktop.org/drm/intel/issues/7981
Build changes
-------------
* Linux: CI_DRM_13026 -> Patchwork_116280v2
CI-20190529: 20190529
CI_DRM_13026: 45bed7de41ad8bd909a82382a8fc4cb65e04ad56 @ git://anongit.freedesktop.org/gfx-ci/linux
IGT_7259: 3d3a7f1c041d3f8d84d7457abf96adef0ea071cb @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
Patchwork_116280v2: 45bed7de41ad8bd909a82382a8fc4cb65e04ad56 @ git://anongit.freedesktop.org/gfx-ci/linux
### Linux commits
fd841c544085 drm/i915/guc: Dump error capture to dmesg on CTB error
725a11e64b63 drm/i915: Dump error capture to kernel log
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/index.html
[-- Attachment #2: Type: text/html, Size: 4141 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Intel-gfx] ✓ Fi.CI.IGT: success for Add support for dumping error captures via kernel logging (rev2)
2023-04-18 18:17 [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging John.C.Harrison
` (4 preceding siblings ...)
2023-04-18 19:08 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
@ 2023-04-18 23:46 ` Patchwork
2023-05-16 18:54 ` [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging Belgaumkar, Vinay
6 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2023-04-18 23:46 UTC (permalink / raw)
To: john.c.harrison; +Cc: intel-gfx
[-- Attachment #1: Type: text/plain, Size: 13651 bytes --]
== Series Details ==
Series: Add support for dumping error captures via kernel logging (rev2)
URL : https://patchwork.freedesktop.org/series/116280/
State : success
== Summary ==
CI Bug Log - changes from CI_DRM_13026_full -> Patchwork_116280v2_full
====================================================
Summary
-------
**SUCCESS**
No regressions found.
Participating hosts (7 -> 7)
------------------------------
No changes in participating hosts
Possible new issues
-------------------
Here are the unknown changes that may have been introduced in Patchwork_116280v2_full:
### IGT changes ###
#### Suppressed ####
The following results come from untrusted machines, tests, or statuses.
They do not affect the overall result.
* igt@kms_vblank@pipe-g-query-idle:
- {shard-dg1}: NOTRUN -> [SKIP][1] +43 similar issues
[1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-dg1-14/igt@kms_vblank@pipe-g-query-idle.html
* igt@kms_vblank@pipe-h-wait-idle:
- {shard-tglu}: NOTRUN -> [SKIP][2]
[2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-tglu-8/igt@kms_vblank@pipe-h-wait-idle.html
Known issues
------------
Here are the changes found in Patchwork_116280v2_full that come from known issues:
### IGT changes ###
#### Issues hit ####
* igt@gem_exec_fair@basic-pace-share@rcs0:
- shard-glk: [PASS][3] -> [FAIL][4] ([i915#2842])
[3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-glk8/igt@gem_exec_fair@basic-pace-share@rcs0.html
[4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-glk3/igt@gem_exec_fair@basic-pace-share@rcs0.html
* igt@kms_flip@flip-vs-suspend-interruptible@c-dp1:
- shard-apl: [PASS][5] -> [ABORT][6] ([i915#180])
[5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-apl7/igt@kms_flip@flip-vs-suspend-interruptible@c-dp1.html
[6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-apl3/igt@kms_flip@flip-vs-suspend-interruptible@c-dp1.html
* igt@kms_plane_scaling@planes-downscale-factor-0-5-unity-scaling@pipe-b-vga-1:
- shard-snb: NOTRUN -> [SKIP][7] ([fdo#109271]) +37 similar issues
[7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-snb2/igt@kms_plane_scaling@planes-downscale-factor-0-5-unity-scaling@pipe-b-vga-1.html
* igt@kms_setmode@basic@pipe-a-hdmi-a-1:
- shard-snb: NOTRUN -> [FAIL][8] ([i915#5465]) +1 similar issue
[8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-snb1/igt@kms_setmode@basic@pipe-a-hdmi-a-1.html
#### Possible fixes ####
* igt@gem_barrier_race@remote-request@rcs0:
- {shard-tglu}: [ABORT][9] ([i915#8211]) -> [PASS][10]
[9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-tglu-9/igt@gem_barrier_race@remote-request@rcs0.html
[10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-tglu-8/igt@gem_barrier_race@remote-request@rcs0.html
* igt@gem_exec_fair@basic-none-solo@rcs0:
- shard-apl: [FAIL][11] ([i915#2842]) -> [PASS][12]
[11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-apl1/igt@gem_exec_fair@basic-none-solo@rcs0.html
[12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-apl7/igt@gem_exec_fair@basic-none-solo@rcs0.html
* igt@gem_exec_fair@basic-pace@vecs0:
- {shard-rkl}: [FAIL][13] ([i915#2842]) -> [PASS][14]
[13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-rkl-7/igt@gem_exec_fair@basic-pace@vecs0.html
[14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-rkl-1/igt@gem_exec_fair@basic-pace@vecs0.html
* igt@gem_exec_suspend@basic-s4-devices@lmem0:
- {shard-dg1}: [ABORT][15] ([i915#7975]) -> [PASS][16]
[15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-dg1-14/igt@gem_exec_suspend@basic-s4-devices@lmem0.html
[16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-dg1-17/igt@gem_exec_suspend@basic-s4-devices@lmem0.html
* igt@i915_pm_dc@dc6-dpms:
- {shard-tglu}: [FAIL][17] ([i915#3989] / [i915#454]) -> [PASS][18]
[17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-tglu-7/igt@i915_pm_dc@dc6-dpms.html
[18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-tglu-10/igt@i915_pm_dc@dc6-dpms.html
* igt@i915_pm_rpm@dpms-non-lpsp:
- {shard-rkl}: [SKIP][19] ([i915#1397]) -> [PASS][20] +1 similar issue
[19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-rkl-7/igt@i915_pm_rpm@dpms-non-lpsp.html
[20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-rkl-1/igt@i915_pm_rpm@dpms-non-lpsp.html
* igt@i915_pm_rpm@modeset-non-lpsp:
- {shard-dg1}: [SKIP][21] ([i915#1397]) -> [PASS][22]
[21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-dg1-14/igt@i915_pm_rpm@modeset-non-lpsp.html
[22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-dg1-17/igt@i915_pm_rpm@modeset-non-lpsp.html
* igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions:
- shard-glk: [FAIL][23] ([i915#2346]) -> [PASS][24]
[23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-glk1/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html
[24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-glk6/igt@kms_cursor_legacy@flip-vs-cursor-atomic-transitions.html
* igt@kms_cursor_legacy@single-bo@pipe-b:
- {shard-rkl}: [INCOMPLETE][25] ([i915#8011]) -> [PASS][26] +1 similar issue
[25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-rkl-7/igt@kms_cursor_legacy@single-bo@pipe-b.html
[26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-rkl-2/igt@kms_cursor_legacy@single-bo@pipe-b.html
* igt@kms_flip@flip-vs-expired-vblank-interruptible@c-hdmi-a2:
- shard-glk: [FAIL][27] ([i915#79]) -> [PASS][28]
[27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-glk2/igt@kms_flip@flip-vs-expired-vblank-interruptible@c-hdmi-a2.html
[28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-glk5/igt@kms_flip@flip-vs-expired-vblank-interruptible@c-hdmi-a2.html
* igt@kms_plane_scaling@i915-max-src-size@pipe-a-hdmi-a-2:
- {shard-rkl}: [FAIL][29] ([i915#8292]) -> [PASS][30]
[29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13026/shard-rkl-2/igt@kms_plane_scaling@i915-max-src-size@pipe-a-hdmi-a-2.html
[30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/shard-rkl-4/igt@kms_plane_scaling@i915-max-src-size@pipe-a-hdmi-a-2.html
{name}: This element is suppressed. This means it is ignored when computing
the status of the difference (SUCCESS, WARNING, or FAILURE).
[fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
[fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
[fdo#109280]: https://bugs.freedesktop.org/show_bug.cgi?id=109280
[fdo#109283]: https://bugs.freedesktop.org/show_bug.cgi?id=109283
[fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
[fdo#109289]: https://bugs.freedesktop.org/show_bug.cgi?id=109289
[fdo#109314]: https://bugs.freedesktop.org/show_bug.cgi?id=109314
[fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
[fdo#110189]: https://bugs.freedesktop.org/show_bug.cgi?id=110189
[fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
[fdo#111615]: https://bugs.freedesktop.org/show_bug.cgi?id=111615
[fdo#111825]: https://bugs.freedesktop.org/show_bug.cgi?id=111825
[fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
[fdo#112054]: https://bugs.freedesktop.org/show_bug.cgi?id=112054
[fdo#112283]: https://bugs.freedesktop.org/show_bug.cgi?id=112283
[i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
[i915#1397]: https://gitlab.freedesktop.org/drm/intel/issues/1397
[i915#180]: https://gitlab.freedesktop.org/drm/intel/issues/180
[i915#2346]: https://gitlab.freedesktop.org/drm/intel/issues/2346
[i915#2437]: https://gitlab.freedesktop.org/drm/intel/issues/2437
[i915#2527]: https://gitlab.freedesktop.org/drm/intel/issues/2527
[i915#2575]: https://gitlab.freedesktop.org/drm/intel/issues/2575
[i915#2587]: https://gitlab.freedesktop.org/drm/intel/issues/2587
[i915#2672]: https://gitlab.freedesktop.org/drm/intel/issues/2672
[i915#2681]: https://gitlab.freedesktop.org/drm/intel/issues/2681
[i915#2705]: https://gitlab.freedesktop.org/drm/intel/issues/2705
[i915#2842]: https://gitlab.freedesktop.org/drm/intel/issues/2842
[i915#3281]: https://gitlab.freedesktop.org/drm/intel/issues/3281
[i915#3282]: https://gitlab.freedesktop.org/drm/intel/issues/3282
[i915#3297]: https://gitlab.freedesktop.org/drm/intel/issues/3297
[i915#3299]: https://gitlab.freedesktop.org/drm/intel/issues/3299
[i915#3359]: https://gitlab.freedesktop.org/drm/intel/issues/3359
[i915#3458]: https://gitlab.freedesktop.org/drm/intel/issues/3458
[i915#3539]: https://gitlab.freedesktop.org/drm/intel/issues/3539
[i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
[i915#3591]: https://gitlab.freedesktop.org/drm/intel/issues/3591
[i915#3637]: https://gitlab.freedesktop.org/drm/intel/issues/3637
[i915#3638]: https://gitlab.freedesktop.org/drm/intel/issues/3638
[i915#3689]: https://gitlab.freedesktop.org/drm/intel/issues/3689
[i915#3742]: https://gitlab.freedesktop.org/drm/intel/issues/3742
[i915#3743]: https://gitlab.freedesktop.org/drm/intel/issues/3743
[i915#3840]: https://gitlab.freedesktop.org/drm/intel/issues/3840
[i915#3886]: https://gitlab.freedesktop.org/drm/intel/issues/3886
[i915#3989]: https://gitlab.freedesktop.org/drm/intel/issues/3989
[i915#4036]: https://gitlab.freedesktop.org/drm/intel/issues/4036
[i915#4070]: https://gitlab.freedesktop.org/drm/intel/issues/4070
[i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077
[i915#4083]: https://gitlab.freedesktop.org/drm/intel/issues/4083
[i915#4098]: https://gitlab.freedesktop.org/drm/intel/issues/4098
[i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
[i915#4212]: https://gitlab.freedesktop.org/drm/intel/issues/4212
[i915#4213]: https://gitlab.freedesktop.org/drm/intel/issues/4213
[i915#4270]: https://gitlab.freedesktop.org/drm/intel/issues/4270
[i915#4349]: https://gitlab.freedesktop.org/drm/intel/issues/4349
[i915#4538]: https://gitlab.freedesktop.org/drm/intel/issues/4538
[i915#454]: https://gitlab.freedesktop.org/drm/intel/issues/454
[i915#4565]: https://gitlab.freedesktop.org/drm/intel/issues/4565
[i915#4579]: https://gitlab.freedesktop.org/drm/intel/issues/4579
[i915#4771]: https://gitlab.freedesktop.org/drm/intel/issues/4771
[i915#4812]: https://gitlab.freedesktop.org/drm/intel/issues/4812
[i915#4816]: https://gitlab.freedesktop.org/drm/intel/issues/4816
[i915#4833]: https://gitlab.freedesktop.org/drm/intel/issues/4833
[i915#4852]: https://gitlab.freedesktop.org/drm/intel/issues/4852
[i915#4860]: https://gitlab.freedesktop.org/drm/intel/issues/4860
[i915#4881]: https://gitlab.freedesktop.org/drm/intel/issues/4881
[i915#4884]: https://gitlab.freedesktop.org/drm/intel/issues/4884
[i915#4983]: https://gitlab.freedesktop.org/drm/intel/issues/4983
[i915#5176]: https://gitlab.freedesktop.org/drm/intel/issues/5176
[i915#5235]: https://gitlab.freedesktop.org/drm/intel/issues/5235
[i915#5286]: https://gitlab.freedesktop.org/drm/intel/issues/5286
[i915#5288]: https://gitlab.freedesktop.org/drm/intel/issues/5288
[i915#5289]: https://gitlab.freedesktop.org/drm/intel/issues/5289
[i915#5325]: https://gitlab.freedesktop.org/drm/intel/issues/5325
[i915#5461]: https://gitlab.freedesktop.org/drm/intel/issues/5461
[i915#5465]: https://gitlab.freedesktop.org/drm/intel/issues/5465
[i915#5563]: https://gitlab.freedesktop.org/drm/intel/issues/5563
[i915#5723]: https://gitlab.freedesktop.org/drm/intel/issues/5723
[i915#6095]: https://gitlab.freedesktop.org/drm/intel/issues/6095
[i915#6433]: https://gitlab.freedesktop.org/drm/intel/issues/6433
[i915#6524]: https://gitlab.freedesktop.org/drm/intel/issues/6524
[i915#658]: https://gitlab.freedesktop.org/drm/intel/issues/658
[i915#7116]: https://gitlab.freedesktop.org/drm/intel/issues/7116
[i915#7561]: https://gitlab.freedesktop.org/drm/intel/issues/7561
[i915#7697]: https://gitlab.freedesktop.org/drm/intel/issues/7697
[i915#7711]: https://gitlab.freedesktop.org/drm/intel/issues/7711
[i915#7828]: https://gitlab.freedesktop.org/drm/intel/issues/7828
[i915#79]: https://gitlab.freedesktop.org/drm/intel/issues/79
[i915#7975]: https://gitlab.freedesktop.org/drm/intel/issues/7975
[i915#8011]: https://gitlab.freedesktop.org/drm/intel/issues/8011
[i915#8211]: https://gitlab.freedesktop.org/drm/intel/issues/8211
[i915#8292]: https://gitlab.freedesktop.org/drm/intel/issues/8292
Build changes
-------------
* Linux: CI_DRM_13026 -> Patchwork_116280v2
CI-20190529: 20190529
CI_DRM_13026: 45bed7de41ad8bd909a82382a8fc4cb65e04ad56 @ git://anongit.freedesktop.org/gfx-ci/linux
IGT_7259: 3d3a7f1c041d3f8d84d7457abf96adef0ea071cb @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
Patchwork_116280v2: 45bed7de41ad8bd909a82382a8fc4cb65e04ad56 @ git://anongit.freedesktop.org/gfx-ci/linux
piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_116280v2/index.html
[-- Attachment #2: Type: text/html, Size: 9594 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging
2023-04-18 18:17 [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging John.C.Harrison
` (5 preceding siblings ...)
2023-04-18 23:46 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
@ 2023-05-16 18:54 ` Belgaumkar, Vinay
6 siblings, 0 replies; 13+ messages in thread
From: Belgaumkar, Vinay @ 2023-05-16 18:54 UTC (permalink / raw)
To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel
On 4/18/2023 11:17 AM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> Sometimes, the only effective way to debug an issue is to dump all the
> interesting information at the point of failure. So add support for
> doing that.
>
> v2: Extra CONFIG wrapping (review feedback from Rodrigo)
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
series LGTM,
Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>
>
> John Harrison (2):
> drm/i915: Dump error capture to kernel log
> drm/i915/guc: Dump error capture to dmesg on CTB error
>
> drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 53 +++++++++
> drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h | 6 +
> drivers/gpu/drm/i915/i915_gpu_error.c | 132 ++++++++++++++++++++++
> drivers/gpu/drm/i915/i915_gpu_error.h | 10 ++
> 4 files changed, 201 insertions(+)
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-gfx] [PATCH v2 1/2] drm/i915: Dump error capture to kernel log
2023-04-18 18:17 ` [Intel-gfx] [PATCH v2 1/2] drm/i915: Dump error capture to kernel log John.C.Harrison
@ 2023-05-16 19:17 ` Belgaumkar, Vinay
2023-05-16 19:21 ` John Harrison
0 siblings, 1 reply; 13+ messages in thread
From: Belgaumkar, Vinay @ 2023-05-16 19:17 UTC (permalink / raw)
To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel
On 4/18/2023 11:17 AM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> This is useful for getting debug information out in certain
> situations, such as failing kernel selftests and CI runs that don't
> log error captures. It is especially useful for things like retrieving
> GuC logs as GuC operation can't be tracked by adding printk or ftrace
> entries.
>
> v2: Add CONFIG_DRM_I915_DEBUG_GEM wrapper (review feedback by Rodrigo).
Do the CI sparse warnings hold water? With that looked at,
LGTM,
Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
> drivers/gpu/drm/i915/i915_gpu_error.c | 132 ++++++++++++++++++++++++++
> drivers/gpu/drm/i915/i915_gpu_error.h | 10 ++
> 2 files changed, 142 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index f020c0086fbcd..03d62c250c465 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -2219,3 +2219,135 @@ void i915_disable_error_state(struct drm_i915_private *i915, int err)
> i915->gpu_error.first_error = ERR_PTR(err);
> spin_unlock_irq(&i915->gpu_error.lock);
> }
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
> +void intel_klog_error_capture(struct intel_gt *gt,
> + intel_engine_mask_t engine_mask)
> +{
> + static int g_count;
> + struct drm_i915_private *i915 = gt->i915;
> + struct i915_gpu_coredump *error;
> + intel_wakeref_t wakeref;
> + size_t buf_size = PAGE_SIZE * 128;
> + size_t pos_err;
> + char *buf, *ptr, *next;
> + int l_count = g_count++;
> + int line = 0;
> +
> + /* Can't allocate memory during a reset */
> + if (test_bit(I915_RESET_BACKOFF, >->reset.flags)) {
> + drm_err(>->i915->drm, "[Capture/%d.%d] Inside GT reset, skipping error capture :(\n",
> + l_count, line++);
> + return;
> + }
> +
> + error = READ_ONCE(i915->gpu_error.first_error);
> + if (error) {
> + drm_err(&i915->drm, "[Capture/%d.%d] Clearing existing error capture first...\n",
> + l_count, line++);
> + i915_reset_error_state(i915);
> + }
> +
> + with_intel_runtime_pm(&i915->runtime_pm, wakeref)
> + error = i915_gpu_coredump(gt, engine_mask, CORE_DUMP_FLAG_NONE);
> +
> + if (IS_ERR(error)) {
> + drm_err(&i915->drm, "[Capture/%d.%d] Failed to capture error capture: %ld!\n",
> + l_count, line++, PTR_ERR(error));
> + return;
> + }
> +
> + buf = kvmalloc(buf_size, GFP_KERNEL);
> + if (!buf) {
> + drm_err(&i915->drm, "[Capture/%d.%d] Failed to allocate buffer for error capture!\n",
> + l_count, line++);
> + i915_gpu_coredump_put(error);
> + return;
> + }
> +
> + drm_info(&i915->drm, "[Capture/%d.%d] Dumping i915 error capture for %ps...\n",
> + l_count, line++, __builtin_return_address(0));
> +
> + /* Largest string length safe to print via dmesg */
> +# define MAX_CHUNK 800
> +
> + pos_err = 0;
> + while (1) {
> + ssize_t got = i915_gpu_coredump_copy_to_buffer(error, buf, pos_err, buf_size - 1);
> +
> + if (got <= 0)
> + break;
> +
> + buf[got] = 0;
> + pos_err += got;
> +
> + ptr = buf;
> + while (got > 0) {
> + size_t count;
> + char tag[2];
> +
> + next = strnchr(ptr, got, '\n');
> + if (next) {
> + count = next - ptr;
> + *next = 0;
> + tag[0] = '>';
> + tag[1] = '<';
> + } else {
> + count = got;
> + tag[0] = '}';
> + tag[1] = '{';
> + }
> +
> + if (count > MAX_CHUNK) {
> + size_t pos;
> + char *ptr2 = ptr;
> +
> + for (pos = MAX_CHUNK; pos < count; pos += MAX_CHUNK) {
> + char chr = ptr[pos];
> +
> + ptr[pos] = 0;
> + drm_info(&i915->drm, "[Capture/%d.%d] }%s{\n",
> + l_count, line++, ptr2);
> + ptr[pos] = chr;
> + ptr2 = ptr + pos;
> +
> + /*
> + * If spewing large amounts of data via a serial console,
> + * this can be a very slow process. So be friendly and try
> + * not to cause 'softlockup on CPU' problems.
> + */
> + cond_resched();
> + }
> +
> + if (ptr2 < (ptr + count))
> + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n",
> + l_count, line++, tag[0], ptr2, tag[1]);
> + else if (tag[0] == '>')
> + drm_info(&i915->drm, "[Capture/%d.%d] ><\n",
> + l_count, line++);
> + } else {
> + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n",
> + l_count, line++, tag[0], ptr, tag[1]);
> + }
> +
> + ptr = next;
> + got -= count;
> + if (next) {
> + ptr++;
> + got--;
> + }
> +
> + /* As above. */
> + cond_resched();
> + }
> +
> + if (got)
> + drm_info(&i915->drm, "[Capture/%d.%d] Got %zd bytes remaining!\n",
> + l_count, line++, got);
> + }
> +
> + kvfree(buf);
> +
> + drm_info(&i915->drm, "[Capture/%d.%d] Dumped %zd bytes\n", l_count, line++, pos_err);
> +}
> +#endif
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index a91932cc65317..a78c061ce26fb 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -258,6 +258,16 @@ static inline u32 i915_reset_engine_count(struct i915_gpu_error *error,
> #define CORE_DUMP_FLAG_NONE 0x0
> #define CORE_DUMP_FLAG_IS_GUC_CAPTURE BIT(0)
>
> +#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR) && IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
> +void intel_klog_error_capture(struct intel_gt *gt,
> + intel_engine_mask_t engine_mask);
> +#else
> +static inline void intel_klog_error_capture(struct intel_gt *gt,
> + intel_engine_mask_t engine_mask)
> +{
> +}
> +#endif
> +
> #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
>
> __printf(2, 3)
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-gfx] [PATCH v2 2/2] drm/i915/guc: Dump error capture to dmesg on CTB error
2023-04-18 18:17 ` [Intel-gfx] [PATCH v2 2/2] drm/i915/guc: Dump error capture to dmesg on CTB error John.C.Harrison
@ 2023-05-16 19:17 ` Belgaumkar, Vinay
0 siblings, 0 replies; 13+ messages in thread
From: Belgaumkar, Vinay @ 2023-05-16 19:17 UTC (permalink / raw)
To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel
On 4/18/2023 11:17 AM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> In the past, There have been sporadic CTB failures which proved hard
> to reproduce manually. The most effective solution was to dump the GuC
> log at the point of failure and let the CI system do the repro. It is
> preferable not to dump the GuC log via dmesg for all issues as it is
> not always necessary and is not helpful for end users. But rather than
> trying to re-invent the code to do this each time it is wanted, commit
> the code but for DEBUG_GUC builds only.
>
> v2: Use IS_ENABLED for testing config options.
LGTM,
Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
> drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 53 +++++++++++++++++++++++
> drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h | 6 +++
> 2 files changed, 59 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index 1803a633ed648..dc5cd712f1ff5 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -13,6 +13,30 @@
> #include "intel_guc_ct.h"
> #include "intel_guc_print.h"
>
> +#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GUC)
> +enum {
> + CT_DEAD_ALIVE = 0,
> + CT_DEAD_SETUP,
> + CT_DEAD_WRITE,
> + CT_DEAD_DEADLOCK,
> + CT_DEAD_H2G_HAS_ROOM,
> + CT_DEAD_READ,
> + CT_DEAD_PROCESS_FAILED,
> +};
> +
> +static void ct_dead_ct_worker_func(struct work_struct *w);
> +
> +#define CT_DEAD(ct, reason) \
> + do { \
> + if (!(ct)->dead_ct_reported) { \
> + (ct)->dead_ct_reason |= 1 << CT_DEAD_##reason; \
> + queue_work(system_unbound_wq, &(ct)->dead_ct_worker); \
> + } \
> + } while (0)
> +#else
> +#define CT_DEAD(ct, reason) do { } while (0)
> +#endif
> +
> static inline struct intel_guc *ct_to_guc(struct intel_guc_ct *ct)
> {
> return container_of(ct, struct intel_guc, ct);
> @@ -93,6 +117,9 @@ void intel_guc_ct_init_early(struct intel_guc_ct *ct)
> spin_lock_init(&ct->requests.lock);
> INIT_LIST_HEAD(&ct->requests.pending);
> INIT_LIST_HEAD(&ct->requests.incoming);
> +#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GUC)
> + INIT_WORK(&ct->dead_ct_worker, ct_dead_ct_worker_func);
> +#endif
> INIT_WORK(&ct->requests.worker, ct_incoming_request_worker_func);
> tasklet_setup(&ct->receive_tasklet, ct_receive_tasklet_func);
> init_waitqueue_head(&ct->wq);
> @@ -319,11 +346,16 @@ int intel_guc_ct_enable(struct intel_guc_ct *ct)
>
> ct->enabled = true;
> ct->stall_time = KTIME_MAX;
> +#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GUC)
> + ct->dead_ct_reported = false;
> + ct->dead_ct_reason = CT_DEAD_ALIVE;
> +#endif
>
> return 0;
>
> err_out:
> CT_PROBE_ERROR(ct, "Failed to enable CTB (%pe)\n", ERR_PTR(err));
> + CT_DEAD(ct, SETUP);
> return err;
> }
>
> @@ -434,6 +466,7 @@ static int ct_write(struct intel_guc_ct *ct,
> corrupted:
> CT_ERROR(ct, "Corrupted descriptor head=%u tail=%u status=%#x\n",
> desc->head, desc->tail, desc->status);
> + CT_DEAD(ct, WRITE);
> ctb->broken = true;
> return -EPIPE;
> }
> @@ -504,6 +537,7 @@ static inline bool ct_deadlocked(struct intel_guc_ct *ct)
> CT_ERROR(ct, "Head: %u\n (Dwords)", ct->ctbs.recv.desc->head);
> CT_ERROR(ct, "Tail: %u\n (Dwords)", ct->ctbs.recv.desc->tail);
>
> + CT_DEAD(ct, DEADLOCK);
> ct->ctbs.send.broken = true;
> }
>
> @@ -552,6 +586,7 @@ static inline bool h2g_has_room(struct intel_guc_ct *ct, u32 len_dw)
> head, ctb->size);
> desc->status |= GUC_CTB_STATUS_OVERFLOW;
> ctb->broken = true;
> + CT_DEAD(ct, H2G_HAS_ROOM);
> return false;
> }
>
> @@ -908,6 +943,7 @@ static int ct_read(struct intel_guc_ct *ct, struct ct_incoming_msg **msg)
> CT_ERROR(ct, "Corrupted descriptor head=%u tail=%u status=%#x\n",
> desc->head, desc->tail, desc->status);
> ctb->broken = true;
> + CT_DEAD(ct, READ);
> return -EPIPE;
> }
>
> @@ -1057,6 +1093,7 @@ static bool ct_process_incoming_requests(struct intel_guc_ct *ct)
> if (unlikely(err)) {
> CT_ERROR(ct, "Failed to process CT message (%pe) %*ph\n",
> ERR_PTR(err), 4 * request->size, request->msg);
> + CT_DEAD(ct, PROCESS_FAILED);
> ct_free_msg(request);
> }
>
> @@ -1233,3 +1270,19 @@ void intel_guc_ct_print_info(struct intel_guc_ct *ct,
> drm_printf(p, "Tail: %u\n",
> ct->ctbs.recv.desc->tail);
> }
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GUC)
> +static void ct_dead_ct_worker_func(struct work_struct *w)
> +{
> + struct intel_guc_ct *ct = container_of(w, struct intel_guc_ct, dead_ct_worker);
> + struct intel_guc *guc = ct_to_guc(ct);
> +
> + if (ct->dead_ct_reported)
> + return;
> +
> + ct->dead_ct_reported = true;
> +
> + guc_info(guc, "CTB is dead - reason=0x%X\n", ct->dead_ct_reason);
> + intel_klog_error_capture(guc_to_gt(guc), (intel_engine_mask_t)~0U);
> +}
> +#endif
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h
> index f709a19c7e214..818415b64f4d1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h
> @@ -85,6 +85,12 @@ struct intel_guc_ct {
>
> /** @stall_time: time of first time a CTB submission is stalled */
> ktime_t stall_time;
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GUC)
> + int dead_ct_reason;
> + bool dead_ct_reported;
> + struct work_struct dead_ct_worker;
> +#endif
> };
>
> void intel_guc_ct_init_early(struct intel_guc_ct *ct);
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-gfx] [PATCH v2 1/2] drm/i915: Dump error capture to kernel log
2023-05-16 19:17 ` Belgaumkar, Vinay
@ 2023-05-16 19:21 ` John Harrison
2023-05-16 20:52 ` Rodrigo Vivi
0 siblings, 1 reply; 13+ messages in thread
From: John Harrison @ 2023-05-16 19:21 UTC (permalink / raw)
To: Belgaumkar, Vinay, Intel-GFX; +Cc: DRI-Devel
[-- Attachment #1: Type: text/plain, Size: 8625 bytes --]
On 5/16/2023 12:17, Belgaumkar, Vinay wrote:
> On 4/18/2023 11:17 AM, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> This is useful for getting debug information out in certain
>> situations, such as failing kernel selftests and CI runs that don't
>> log error captures. It is especially useful for things like retrieving
>> GuC logs as GuC operation can't be tracked by adding printk or ftrace
>> entries.
>>
>> v2: Add CONFIG_DRM_I915_DEBUG_GEM wrapper (review feedback by Rodrigo).
>
> Do the CI sparse warnings hold water? With that looked at,
You mean this one totally fatal and absolutely must be fixed error?
Fast mode used, each commit won't be checked separately.
Does anyone even know what that means or why it (apparently totally
randomly) hits some patch sets and not others?
If you mean the checkpatch warnings. One is about not reporting out of
memory errors (because you are supposed to return -ENOMEM and let the
user handle it instead), but that doesn't apply for an internal kernel
only thing which is already just a debug print. The other is about macro
argument re-use, which is not an issue in this case and not worth
re-writing the code to avoid.
John.
>
> LGTM,
>
> Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
>
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_gpu_error.c | 132 ++++++++++++++++++++++++++
>> drivers/gpu/drm/i915/i915_gpu_error.h | 10 ++
>> 2 files changed, 142 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index f020c0086fbcd..03d62c250c465 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -2219,3 +2219,135 @@ void i915_disable_error_state(struct
>> drm_i915_private *i915, int err)
>> i915->gpu_error.first_error = ERR_PTR(err);
>> spin_unlock_irq(&i915->gpu_error.lock);
>> }
>> +
>> +#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
>> +void intel_klog_error_capture(struct intel_gt *gt,
>> + intel_engine_mask_t engine_mask)
>> +{
>> + static int g_count;
>> + struct drm_i915_private *i915 = gt->i915;
>> + struct i915_gpu_coredump *error;
>> + intel_wakeref_t wakeref;
>> + size_t buf_size = PAGE_SIZE * 128;
>> + size_t pos_err;
>> + char *buf, *ptr, *next;
>> + int l_count = g_count++;
>> + int line = 0;
>> +
>> + /* Can't allocate memory during a reset */
>> + if (test_bit(I915_RESET_BACKOFF, >->reset.flags)) {
>> + drm_err(>->i915->drm, "[Capture/%d.%d] Inside GT reset,
>> skipping error capture :(\n",
>> + l_count, line++);
>> + return;
>> + }
>> +
>> + error = READ_ONCE(i915->gpu_error.first_error);
>> + if (error) {
>> + drm_err(&i915->drm, "[Capture/%d.%d] Clearing existing error
>> capture first...\n",
>> + l_count, line++);
>> + i915_reset_error_state(i915);
>> + }
>> +
>> + with_intel_runtime_pm(&i915->runtime_pm, wakeref)
>> + error = i915_gpu_coredump(gt, engine_mask,
>> CORE_DUMP_FLAG_NONE);
>> +
>> + if (IS_ERR(error)) {
>> + drm_err(&i915->drm, "[Capture/%d.%d] Failed to capture error
>> capture: %ld!\n",
>> + l_count, line++, PTR_ERR(error));
>> + return;
>> + }
>> +
>> + buf = kvmalloc(buf_size, GFP_KERNEL);
>> + if (!buf) {
>> + drm_err(&i915->drm, "[Capture/%d.%d] Failed to allocate
>> buffer for error capture!\n",
>> + l_count, line++);
>> + i915_gpu_coredump_put(error);
>> + return;
>> + }
>> +
>> + drm_info(&i915->drm, "[Capture/%d.%d] Dumping i915 error capture
>> for %ps...\n",
>> + l_count, line++, __builtin_return_address(0));
>> +
>> + /* Largest string length safe to print via dmesg */
>> +# define MAX_CHUNK 800
>> +
>> + pos_err = 0;
>> + while (1) {
>> + ssize_t got = i915_gpu_coredump_copy_to_buffer(error, buf,
>> pos_err, buf_size - 1);
>> +
>> + if (got <= 0)
>> + break;
>> +
>> + buf[got] = 0;
>> + pos_err += got;
>> +
>> + ptr = buf;
>> + while (got > 0) {
>> + size_t count;
>> + char tag[2];
>> +
>> + next = strnchr(ptr, got, '\n');
>> + if (next) {
>> + count = next - ptr;
>> + *next = 0;
>> + tag[0] = '>';
>> + tag[1] = '<';
>> + } else {
>> + count = got;
>> + tag[0] = '}';
>> + tag[1] = '{';
>> + }
>> +
>> + if (count > MAX_CHUNK) {
>> + size_t pos;
>> + char *ptr2 = ptr;
>> +
>> + for (pos = MAX_CHUNK; pos < count; pos += MAX_CHUNK) {
>> + char chr = ptr[pos];
>> +
>> + ptr[pos] = 0;
>> + drm_info(&i915->drm, "[Capture/%d.%d] }%s{\n",
>> + l_count, line++, ptr2);
>> + ptr[pos] = chr;
>> + ptr2 = ptr + pos;
>> +
>> + /*
>> + * If spewing large amounts of data via a serial
>> console,
>> + * this can be a very slow process. So be
>> friendly and try
>> + * not to cause 'softlockup on CPU' problems.
>> + */
>> + cond_resched();
>> + }
>> +
>> + if (ptr2 < (ptr + count))
>> + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n",
>> + l_count, line++, tag[0], ptr2, tag[1]);
>> + else if (tag[0] == '>')
>> + drm_info(&i915->drm, "[Capture/%d.%d] ><\n",
>> + l_count, line++);
>> + } else {
>> + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n",
>> + l_count, line++, tag[0], ptr, tag[1]);
>> + }
>> +
>> + ptr = next;
>> + got -= count;
>> + if (next) {
>> + ptr++;
>> + got--;
>> + }
>> +
>> + /* As above. */
>> + cond_resched();
>> + }
>> +
>> + if (got)
>> + drm_info(&i915->drm, "[Capture/%d.%d] Got %zd bytes
>> remaining!\n",
>> + l_count, line++, got);
>> + }
>> +
>> + kvfree(buf);
>> +
>> + drm_info(&i915->drm, "[Capture/%d.%d] Dumped %zd bytes\n",
>> l_count, line++, pos_err);
>> +}
>> +#endif
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h
>> b/drivers/gpu/drm/i915/i915_gpu_error.h
>> index a91932cc65317..a78c061ce26fb 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
>> @@ -258,6 +258,16 @@ static inline u32 i915_reset_engine_count(struct
>> i915_gpu_error *error,
>> #define CORE_DUMP_FLAG_NONE 0x0
>> #define CORE_DUMP_FLAG_IS_GUC_CAPTURE BIT(0)
>> +#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR) &&
>> IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
>> +void intel_klog_error_capture(struct intel_gt *gt,
>> + intel_engine_mask_t engine_mask);
>> +#else
>> +static inline void intel_klog_error_capture(struct intel_gt *gt,
>> + intel_engine_mask_t engine_mask)
>> +{
>> +}
>> +#endif
>> +
>> #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
>> __printf(2, 3)
[-- Attachment #2: Type: text/html, Size: 17799 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-gfx] [PATCH v2 1/2] drm/i915: Dump error capture to kernel log
2023-05-16 19:21 ` John Harrison
@ 2023-05-16 20:52 ` Rodrigo Vivi
2023-05-16 22:06 ` John Harrison
0 siblings, 1 reply; 13+ messages in thread
From: Rodrigo Vivi @ 2023-05-16 20:52 UTC (permalink / raw)
To: John Harrison; +Cc: Intel-GFX, DRI-Devel
On Tue, May 16, 2023 at 12:21:05PM -0700, John Harrison wrote:
> On 5/16/2023 12:17, Belgaumkar, Vinay wrote:
>
> > On 4/18/2023 11:17 AM, [1]John.C.Harrison@Intel.com wrote:
>
> >> From: John Harrison [2]<John.C.Harrison@Intel.com>
>
> >> This is useful for getting debug information out in certain
> >> situations, such as failing kernel selftests and CI runs that don't
> >> log error captures. It is especially useful for things like retrieving
> >> GuC logs as GuC operation can't be tracked by adding printk or ftrace
> >> entries.
>
> >> v2: Add CONFIG_DRM_I915_DEBUG_GEM wrapper (review feedback by Rodrigo).
Thanks
>
> > Do the CI sparse warnings hold water? With that looked at,
>
> You mean this one totally fatal and absolutely must be fixed error?
>
> Fast mode used, each commit won't be checked separately.
You should never rely on this assumption. There are bisects and autobisects
out there. Also every patch needs to be independently available for backport.
So, if there's any issue it needs to be fix before we merge.
>
> Does anyone even know what that means or why it (apparently totally
> randomly) hits some patch sets and not others?
>
> If you mean the checkpatch warnings. One is about not reporting out of
> memory errors (because you are supposed to return -ENOMEM and let the user
> handle it instead), but that doesn't apply for an internal kernel only
> thing which is already just a debug print. The other is about macro
> argument re-use, which is not an issue in this case and not worth
> re-writing the code to avoid.
>
> John.
>
> > LGTM,
>
> > Reviewed-by: Vinay Belgaumkar [3]<vinay.belgaumkar@intel.com>
>
> >> Signed-off-by: John Harrison [4]<John.C.Harrison@Intel.com>
> >> ---
> >> drivers/gpu/drm/i915/i915_gpu_error.c | 132
> >> ++++++++++++++++++++++++++
> >> drivers/gpu/drm/i915/i915_gpu_error.h | 10 ++
> >> 2 files changed, 142 insertions(+)
>
> >> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c
> >> b/drivers/gpu/drm/i915/i915_gpu_error.c
> >> index f020c0086fbcd..03d62c250c465 100644
> >> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> >> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> >> @@ -2219,3 +2219,135 @@ void i915_disable_error_state(struct
> >> drm_i915_private *i915, int err)
> >> i915->gpu_error.first_error = ERR_PTR(err);
> >> spin_unlock_irq(&i915->gpu_error.lock);
> >> }
> >> +
> >> +#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
> >> +void intel_klog_error_capture(struct intel_gt *gt,
> >> + intel_engine_mask_t engine_mask)
> >> +{
> >> + static int g_count;
> >> + struct drm_i915_private *i915 = gt->i915;
> >> + struct i915_gpu_coredump *error;
> >> + intel_wakeref_t wakeref;
> >> + size_t buf_size = PAGE_SIZE * 128;
> >> + size_t pos_err;
> >> + char *buf, *ptr, *next;
> >> + int l_count = g_count++;
> >> + int line = 0;
> >> +
> >> + /* Can't allocate memory during a reset */
> >> + if (test_bit(I915_RESET_BACKOFF, >->reset.flags)) {
> >> + drm_err(>->i915->drm, "[Capture/%d.%d] Inside GT reset,
> >> skipping error capture :(\n",
> >> + l_count, line++);
> >> + return;
> >> + }
> >> +
> >> + error = READ_ONCE(i915->gpu_error.first_error);
> >> + if (error) {
> >> + drm_err(&i915->drm, "[Capture/%d.%d] Clearing existing error
> >> capture first...\n",
> >> + l_count, line++);
> >> + i915_reset_error_state(i915);
> >> + }
> >> +
> >> + with_intel_runtime_pm(&i915->runtime_pm, wakeref)
> >> + error = i915_gpu_coredump(gt, engine_mask,
> >> CORE_DUMP_FLAG_NONE);
> >> +
> >> + if (IS_ERR(error)) {
> >> + drm_err(&i915->drm, "[Capture/%d.%d] Failed to capture error
> >> capture: %ld!\n",
> >> + l_count, line++, PTR_ERR(error));
> >> + return;
> >> + }
> >> +
> >> + buf = kvmalloc(buf_size, GFP_KERNEL);
> >> + if (!buf) {
> >> + drm_err(&i915->drm, "[Capture/%d.%d] Failed to allocate buffer
> >> for error capture!\n",
> >> + l_count, line++);
> >> + i915_gpu_coredump_put(error);
> >> + return;
> >> + }
> >> +
> >> + drm_info(&i915->drm, "[Capture/%d.%d] Dumping i915 error capture
> >> for %ps...\n",
> >> + l_count, line++, __builtin_return_address(0));
> >> +
> >> + /* Largest string length safe to print via dmesg */
> >> +# define MAX_CHUNK 800
> >> +
> >> + pos_err = 0;
> >> + while (1) {
> >> + ssize_t got = i915_gpu_coredump_copy_to_buffer(error, buf,
> >> pos_err, buf_size - 1);
> >> +
> >> + if (got <= 0)
> >> + break;
> >> +
> >> + buf[got] = 0;
> >> + pos_err += got;
> >> +
> >> + ptr = buf;
> >> + while (got > 0) {
> >> + size_t count;
> >> + char tag[2];
> >> +
> >> + next = strnchr(ptr, got, '\n');
> >> + if (next) {
> >> + count = next - ptr;
> >> + *next = 0;
> >> + tag[0] = '>';
> >> + tag[1] = '<';
> >> + } else {
> >> + count = got;
> >> + tag[0] = '}';
> >> + tag[1] = '{';
> >> + }
> >> +
> >> + if (count > MAX_CHUNK) {
> >> + size_t pos;
> >> + char *ptr2 = ptr;
> >> +
> >> + for (pos = MAX_CHUNK; pos < count; pos += MAX_CHUNK) {
> >> + char chr = ptr[pos];
> >> +
> >> + ptr[pos] = 0;
> >> + drm_info(&i915->drm, "[Capture/%d.%d] }%s{\n",
> >> + l_count, line++, ptr2);
> >> + ptr[pos] = chr;
> >> + ptr2 = ptr + pos;
> >> +
> >> + /*
> >> + * If spewing large amounts of data via a serial
> >> console,
> >> + * this can be a very slow process. So be friendly
> >> and try
> >> + * not to cause 'softlockup on CPU' problems.
> >> + */
> >> + cond_resched();
> >> + }
> >> +
> >> + if (ptr2 < (ptr + count))
> >> + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n",
> >> + l_count, line++, tag[0], ptr2, tag[1]);
> >> + else if (tag[0] == '>')
> >> + drm_info(&i915->drm, "[Capture/%d.%d] ><\n",
> >> + l_count, line++);
> >> + } else {
> >> + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n",
> >> + l_count, line++, tag[0], ptr, tag[1]);
> >> + }
> >> +
> >> + ptr = next;
> >> + got -= count;
> >> + if (next) {
> >> + ptr++;
> >> + got--;
> >> + }
> >> +
> >> + /* As above. */
> >> + cond_resched();
> >> + }
> >> +
> >> + if (got)
> >> + drm_info(&i915->drm, "[Capture/%d.%d] Got %zd bytes
> >> remaining!\n",
> >> + l_count, line++, got);
> >> + }
> >> +
> >> + kvfree(buf);
> >> +
> >> + drm_info(&i915->drm, "[Capture/%d.%d] Dumped %zd bytes\n",
> >> l_count, line++, pos_err);
> >> +}
> >> +#endif
> >> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h
> >> b/drivers/gpu/drm/i915/i915_gpu_error.h
> >> index a91932cc65317..a78c061ce26fb 100644
> >> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> >> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> >> @@ -258,6 +258,16 @@ static inline u32 i915_reset_engine_count(struct
> >> i915_gpu_error *error,
> >> #define CORE_DUMP_FLAG_NONE 0x0
> >> #define CORE_DUMP_FLAG_IS_GUC_CAPTURE BIT(0)
> >> +#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR) &&
> >> IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
> >> +void intel_klog_error_capture(struct intel_gt *gt,
> >> + intel_engine_mask_t engine_mask);
> >> +#else
> >> +static inline void intel_klog_error_capture(struct intel_gt *gt,
> >> + intel_engine_mask_t engine_mask)
> >> +{
> >> +}
> >> +#endif
> >> +
> >> #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
> >> __printf(2, 3)
>
> References
>
> Visible links
> 1. mailto:John.C.Harrison@intel.com
> 2. mailto:John.C.Harrison@intel.com
> 3. mailto:vinay.belgaumkar@intel.com
> 4. mailto:John.C.Harrison@intel.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Intel-gfx] [PATCH v2 1/2] drm/i915: Dump error capture to kernel log
2023-05-16 20:52 ` Rodrigo Vivi
@ 2023-05-16 22:06 ` John Harrison
0 siblings, 0 replies; 13+ messages in thread
From: John Harrison @ 2023-05-16 22:06 UTC (permalink / raw)
To: Rodrigo Vivi; +Cc: Intel-GFX, DRI-Devel
On 5/16/2023 13:52, Rodrigo Vivi wrote:
> On Tue, May 16, 2023 at 12:21:05PM -0700, John Harrison wrote:
>> On 5/16/2023 12:17, Belgaumkar, Vinay wrote:
>>
>> > On 4/18/2023 11:17 AM, [1]John.C.Harrison@Intel.com wrote:
>>
>> >> From: John Harrison [2]<John.C.Harrison@Intel.com>
>>
>> >> This is useful for getting debug information out in certain
>> >> situations, such as failing kernel selftests and CI runs that don't
>> >> log error captures. It is especially useful for things like retrieving
>> >> GuC logs as GuC operation can't be tracked by adding printk or ftrace
>> >> entries.
>>
>> >> v2: Add CONFIG_DRM_I915_DEBUG_GEM wrapper (review feedback by Rodrigo).
> Thanks
>
>>
>> > Do the CI sparse warnings hold water? With that looked at,
>>
>> You mean this one totally fatal and absolutely must be fixed error?
>>
>> Fast mode used, each commit won't be checked separately.
> You should never rely on this assumption. There are bisects and autobisects
> out there. Also every patch needs to be independently available for backport.
>
> So, if there's any issue it needs to be fix before we merge.
What assumption? That sparse claims failure even when there are no
errors at all, just a notice about 'fast mode used'? If there are
errors, please point out where I can find them. AFAICT, the sparse email
is actually saying the patch set is clean despite the fact it has a big
red cross on it.
John.
>
>>
>> Does anyone even know what that means or why it (apparently totally
>> randomly) hits some patch sets and not others?
>>
>> If you mean the checkpatch warnings. One is about not reporting out of
>> memory errors (because you are supposed to return -ENOMEM and let the user
>> handle it instead), but that doesn't apply for an internal kernel only
>> thing which is already just a debug print. The other is about macro
>> argument re-use, which is not an issue in this case and not worth
>> re-writing the code to avoid.
>>
>> John.
>>
>> > LGTM,
>>
>> > Reviewed-by: Vinay Belgaumkar [3]<vinay.belgaumkar@intel.com>
>>
>> >> Signed-off-by: John Harrison [4]<John.C.Harrison@Intel.com>
>> >> ---
>> >> drivers/gpu/drm/i915/i915_gpu_error.c | 132
>> >> ++++++++++++++++++++++++++
>> >> drivers/gpu/drm/i915/i915_gpu_error.h | 10 ++
>> >> 2 files changed, 142 insertions(+)
>>
>> >> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c
>> >> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> >> index f020c0086fbcd..03d62c250c465 100644
>> >> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> >> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> >> @@ -2219,3 +2219,135 @@ void i915_disable_error_state(struct
>> >> drm_i915_private *i915, int err)
>> >> i915->gpu_error.first_error = ERR_PTR(err);
>> >> spin_unlock_irq(&i915->gpu_error.lock);
>> >> }
>> >> +
>> >> +#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
>> >> +void intel_klog_error_capture(struct intel_gt *gt,
>> >> + intel_engine_mask_t engine_mask)
>> >> +{
>> >> + static int g_count;
>> >> + struct drm_i915_private *i915 = gt->i915;
>> >> + struct i915_gpu_coredump *error;
>> >> + intel_wakeref_t wakeref;
>> >> + size_t buf_size = PAGE_SIZE * 128;
>> >> + size_t pos_err;
>> >> + char *buf, *ptr, *next;
>> >> + int l_count = g_count++;
>> >> + int line = 0;
>> >> +
>> >> + /* Can't allocate memory during a reset */
>> >> + if (test_bit(I915_RESET_BACKOFF, >->reset.flags)) {
>> >> + drm_err(>->i915->drm, "[Capture/%d.%d] Inside GT reset,
>> >> skipping error capture :(\n",
>> >> + l_count, line++);
>> >> + return;
>> >> + }
>> >> +
>> >> + error = READ_ONCE(i915->gpu_error.first_error);
>> >> + if (error) {
>> >> + drm_err(&i915->drm, "[Capture/%d.%d] Clearing existing error
>> >> capture first...\n",
>> >> + l_count, line++);
>> >> + i915_reset_error_state(i915);
>> >> + }
>> >> +
>> >> + with_intel_runtime_pm(&i915->runtime_pm, wakeref)
>> >> + error = i915_gpu_coredump(gt, engine_mask,
>> >> CORE_DUMP_FLAG_NONE);
>> >> +
>> >> + if (IS_ERR(error)) {
>> >> + drm_err(&i915->drm, "[Capture/%d.%d] Failed to capture error
>> >> capture: %ld!\n",
>> >> + l_count, line++, PTR_ERR(error));
>> >> + return;
>> >> + }
>> >> +
>> >> + buf = kvmalloc(buf_size, GFP_KERNEL);
>> >> + if (!buf) {
>> >> + drm_err(&i915->drm, "[Capture/%d.%d] Failed to allocate buffer
>> >> for error capture!\n",
>> >> + l_count, line++);
>> >> + i915_gpu_coredump_put(error);
>> >> + return;
>> >> + }
>> >> +
>> >> + drm_info(&i915->drm, "[Capture/%d.%d] Dumping i915 error capture
>> >> for %ps...\n",
>> >> + l_count, line++, __builtin_return_address(0));
>> >> +
>> >> + /* Largest string length safe to print via dmesg */
>> >> +# define MAX_CHUNK 800
>> >> +
>> >> + pos_err = 0;
>> >> + while (1) {
>> >> + ssize_t got = i915_gpu_coredump_copy_to_buffer(error, buf,
>> >> pos_err, buf_size - 1);
>> >> +
>> >> + if (got <= 0)
>> >> + break;
>> >> +
>> >> + buf[got] = 0;
>> >> + pos_err += got;
>> >> +
>> >> + ptr = buf;
>> >> + while (got > 0) {
>> >> + size_t count;
>> >> + char tag[2];
>> >> +
>> >> + next = strnchr(ptr, got, '\n');
>> >> + if (next) {
>> >> + count = next - ptr;
>> >> + *next = 0;
>> >> + tag[0] = '>';
>> >> + tag[1] = '<';
>> >> + } else {
>> >> + count = got;
>> >> + tag[0] = '}';
>> >> + tag[1] = '{';
>> >> + }
>> >> +
>> >> + if (count > MAX_CHUNK) {
>> >> + size_t pos;
>> >> + char *ptr2 = ptr;
>> >> +
>> >> + for (pos = MAX_CHUNK; pos < count; pos += MAX_CHUNK) {
>> >> + char chr = ptr[pos];
>> >> +
>> >> + ptr[pos] = 0;
>> >> + drm_info(&i915->drm, "[Capture/%d.%d] }%s{\n",
>> >> + l_count, line++, ptr2);
>> >> + ptr[pos] = chr;
>> >> + ptr2 = ptr + pos;
>> >> +
>> >> + /*
>> >> + * If spewing large amounts of data via a serial
>> >> console,
>> >> + * this can be a very slow process. So be friendly
>> >> and try
>> >> + * not to cause 'softlockup on CPU' problems.
>> >> + */
>> >> + cond_resched();
>> >> + }
>> >> +
>> >> + if (ptr2 < (ptr + count))
>> >> + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n",
>> >> + l_count, line++, tag[0], ptr2, tag[1]);
>> >> + else if (tag[0] == '>')
>> >> + drm_info(&i915->drm, "[Capture/%d.%d] ><\n",
>> >> + l_count, line++);
>> >> + } else {
>> >> + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n",
>> >> + l_count, line++, tag[0], ptr, tag[1]);
>> >> + }
>> >> +
>> >> + ptr = next;
>> >> + got -= count;
>> >> + if (next) {
>> >> + ptr++;
>> >> + got--;
>> >> + }
>> >> +
>> >> + /* As above. */
>> >> + cond_resched();
>> >> + }
>> >> +
>> >> + if (got)
>> >> + drm_info(&i915->drm, "[Capture/%d.%d] Got %zd bytes
>> >> remaining!\n",
>> >> + l_count, line++, got);
>> >> + }
>> >> +
>> >> + kvfree(buf);
>> >> +
>> >> + drm_info(&i915->drm, "[Capture/%d.%d] Dumped %zd bytes\n",
>> >> l_count, line++, pos_err);
>> >> +}
>> >> +#endif
>> >> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h
>> >> b/drivers/gpu/drm/i915/i915_gpu_error.h
>> >> index a91932cc65317..a78c061ce26fb 100644
>> >> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
>> >> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
>> >> @@ -258,6 +258,16 @@ static inline u32 i915_reset_engine_count(struct
>> >> i915_gpu_error *error,
>> >> #define CORE_DUMP_FLAG_NONE 0x0
>> >> #define CORE_DUMP_FLAG_IS_GUC_CAPTURE BIT(0)
>> >> +#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR) &&
>> >> IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
>> >> +void intel_klog_error_capture(struct intel_gt *gt,
>> >> + intel_engine_mask_t engine_mask);
>> >> +#else
>> >> +static inline void intel_klog_error_capture(struct intel_gt *gt,
>> >> + intel_engine_mask_t engine_mask)
>> >> +{
>> >> +}
>> >> +#endif
>> >> +
>> >> #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
>> >> __printf(2, 3)
>>
>> References
>>
>> Visible links
>> 1. mailto:John.C.Harrison@intel.com
>> 2. mailto:John.C.Harrison@intel.com
>> 3. mailto:vinay.belgaumkar@intel.com
>> 4. mailto:John.C.Harrison@intel.com
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-05-16 22:06 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-18 18:17 [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging John.C.Harrison
2023-04-18 18:17 ` [Intel-gfx] [PATCH v2 1/2] drm/i915: Dump error capture to kernel log John.C.Harrison
2023-05-16 19:17 ` Belgaumkar, Vinay
2023-05-16 19:21 ` John Harrison
2023-05-16 20:52 ` Rodrigo Vivi
2023-05-16 22:06 ` John Harrison
2023-04-18 18:17 ` [Intel-gfx] [PATCH v2 2/2] drm/i915/guc: Dump error capture to dmesg on CTB error John.C.Harrison
2023-05-16 19:17 ` Belgaumkar, Vinay
2023-04-18 18:58 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Add support for dumping error captures via kernel logging (rev2) Patchwork
2023-04-18 18:58 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2023-04-18 19:08 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-04-18 23:46 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
2023-05-16 18:54 ` [Intel-gfx] [PATCH v2 0/2] Add support for dumping error captures via kernel logging Belgaumkar, Vinay
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox