From: Matthew Brost <matthew.brost@intel.com>
To: intel-xe@lists.freedesktop.org
Subject: [PATCH v5 00/10] Only timeout jobs if they run longer than timeout period
Date: Mon, 10 Jun 2024 07:18:13 -0700 [thread overview]
Message-ID: <20240610141823.2605496-1-matthew.brost@intel.com> (raw)
Debugging [1] hit a known flaw in the job timeout mechanism - jobs
timeout after a period of time in which they have been submitted to the
GuC not how long they have actually been running on the hardware.
Attempt to fix this.
Algorithm is as follows:
- Copy ctx timestamp from LRC to saved location at beginning of every
job
- On TDR kick jobs off hardware via schedule disable so ctx timestamp is
updated
- Compare ctx timestamp to saved ctx timestamp, if jobs having been
running less than timeout period re-enable scheduling are restart TDR
New job cancel IGT [2] for testing.
v2:
- Promote to non-RFC as issues which I view as blockers have been resolved
- Address Jani and Michal v1 feedback
- Add GT clock timer calculation
v3:
- More testing
- Fix TDR state machine bugs exposed in testing
- Rebase for CI
v4:
- Address a few comments by John H
- Fix CI failure [3]
v5:
- Include all the patches
Matt
[1] https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/799
[2] https://patchwork.freedesktop.org/series/134640/
[3] https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-134642v1/shard-dg2-433/igt@xe_exec_threads@threads-hang-fd-rebind.html
Matthew Brost (10):
drm/xe: Add LRC ctx timestamp support functions
drm/xe: Add MI_COPY_MEM_MEM GPU instruction definitions
drm/xe: Emit ctx timestamp copy in ring ops
drm/xe: Add ctx timestamp to LRC snapshot
drm/xe: Add xe_gt_clock_interval_to_ms helper
drm/xe: Improve unexpected state error messages
drm/xe: Add GuC state asserts to deregister_exec_queue
drm/xe: Add pending disable assert to handle_sched_done
drm/xe: Add killed, banned, or wedged as stick bit during GuC reset
drm/xe: Sample ctx timestamp to determine if jobs have timed out
.../gpu/drm/xe/instructions/xe_mi_commands.h | 4 +
drivers/gpu/drm/xe/xe_gt_clock.c | 18 +
drivers/gpu/drm/xe/xe_gt_clock.h | 1 +
drivers/gpu/drm/xe/xe_guc_submit.c | 316 ++++++++++++++----
drivers/gpu/drm/xe/xe_lrc.c | 72 ++++
drivers/gpu/drm/xe/xe_lrc.h | 5 +
drivers/gpu/drm/xe/xe_ring_ops.c | 21 ++
7 files changed, 373 insertions(+), 64 deletions(-)
--
2.34.1
next reply other threads:[~2024-06-10 14:18 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-10 14:18 Matthew Brost [this message]
2024-06-10 14:18 ` [PATCH v5 01/10] drm/xe: Add LRC ctx timestamp support functions Matthew Brost
2024-06-10 17:32 ` Cavitt, Jonathan
2024-06-10 14:18 ` [PATCH v5 02/10] drm/xe: Add MI_COPY_MEM_MEM GPU instruction definitions Matthew Brost
2024-06-10 16:39 ` Cavitt, Jonathan
2024-06-10 14:18 ` [PATCH v5 03/10] drm/xe: Emit ctx timestamp copy in ring ops Matthew Brost
2024-06-10 14:44 ` Cavitt, Jonathan
2024-06-10 14:18 ` [PATCH v5 04/10] drm/xe: Add ctx timestamp to LRC snapshot Matthew Brost
2024-06-10 16:40 ` Cavitt, Jonathan
2024-06-10 14:18 ` [PATCH v5 05/10] drm/xe: Add xe_gt_clock_interval_to_ms helper Matthew Brost
2024-06-10 16:34 ` Cavitt, Jonathan
2024-06-12 12:09 ` Ghimiray, Himal Prasad
2024-06-10 14:18 ` [PATCH v5 06/10] drm/xe: Improve unexpected state error messages Matthew Brost
2024-06-10 16:36 ` Cavitt, Jonathan
2024-06-10 16:45 ` Michal Wajdeczko
2024-06-10 17:09 ` Matthew Brost
2024-06-11 0:09 ` John Harrison
2024-06-11 1:43 ` Matthew Brost
2024-06-10 14:18 ` [PATCH v5 07/10] drm/xe: Add GuC state asserts to deregister_exec_queue Matthew Brost
2024-06-10 17:12 ` Cavitt, Jonathan
2024-06-10 14:18 ` [PATCH v5 08/10] drm/xe: Add pending disable assert to handle_sched_done Matthew Brost
2024-06-10 16:35 ` Cavitt, Jonathan
2024-06-10 14:18 ` [PATCH v5 09/10] drm/xe: Add killed, banned, or wedged as stick bit during GuC reset Matthew Brost
2024-06-10 16:35 ` Cavitt, Jonathan
2024-06-10 14:18 ` [PATCH v5 10/10] drm/xe: Sample ctx timestamp to determine if jobs have timed out Matthew Brost
2024-06-10 19:32 ` Cavitt, Jonathan
2024-06-10 20:12 ` Matthew Brost
2024-06-11 0:36 ` John Harrison
2024-06-11 1:35 ` Matthew Brost
2024-06-12 4:21 ` Zbigniew Kempczyński
2024-06-10 14:23 ` ✓ CI.Patch_applied: success for Only timeout jobs if they run longer than timeout period Patchwork
2024-06-10 14:23 ` ✗ CI.checkpatch: warning " Patchwork
2024-06-10 14:24 ` ✓ CI.KUnit: success " Patchwork
2024-06-10 14:36 ` ✓ CI.Build: " Patchwork
2024-06-10 14:38 ` ✗ CI.Hooks: failure " Patchwork
2024-06-10 14:39 ` ✓ CI.checksparse: success " Patchwork
2024-06-10 15:16 ` ✓ CI.BAT: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240610141823.2605496-1-matthew.brost@intel.com \
--to=matthew.brost@intel.com \
--cc=intel-xe@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox