Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/7] Only timeout jobs if they run longer than timeout period
@ 2024-06-10 13:50 Matthew Brost
  2024-06-10 13:50 ` [PATCH v4 1/7] drm/xe: Add ctx timestamp to LRC snapshot Matthew Brost
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Matthew Brost @ 2024-06-10 13:50 UTC (permalink / raw)
  To: intel-xe

Debugging [1] hit a known flaw in the job timeout mechanism - jobs
timeout after a period of time in which they have been submitted to the
GuC not how long they have actually been running on the hardware.
Attempt to fix this.

Algorithm is as follows:
- Copy ctx timestamp from LRC to saved location at beginning of every
  job
- On TDR kick jobs off hardware via schedule disable so ctx timestamp is
  updated
- Compare ctx timestamp to saved ctx timestamp, if jobs having been
  running less than timeout period re-enable scheduling are restart TDR

New job cancel IGT [2] for testing.

v2:
- Promote to non-RFC as issues which I view as blockers have been resolved
- Address Jani and Michal v1 feedback
- Add GT clock timer calculation
v3:
- More testing
- Fix TDR state machine bugs exposed in testing
- Rebase for CI
v4:
- Address a few comments by John H
- Fix CI failure [3]

Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/799
[2] https://patchwork.freedesktop.org/series/134640/
[3] https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-134642v1/shard-dg2-433/igt@xe_exec_threads@threads-hang-fd-rebind.html

Matthew Brost (7):
  drm/xe: Add ctx timestamp to LRC snapshot
  drm/xe: Add xe_gt_clock_interval_to_ms helper
  drm/xe: Improve unexpected state error messages
  drm/xe: Add GuC state asserts to deregister_exec_queue
  drm/xe: Add pending disable assert to handle_sched_done
  drm/xe: Add killed, banned, or wedged as stick bit during GuC reset
  drm/xe: Sample ctx timestamp to determine if jobs have timed out

 drivers/gpu/drm/xe/xe_gt_clock.c   |  18 ++
 drivers/gpu/drm/xe/xe_gt_clock.h   |   1 +
 drivers/gpu/drm/xe/xe_guc_submit.c | 316 +++++++++++++++++++++++------
 drivers/gpu/drm/xe/xe_lrc.c        |   6 +
 4 files changed, 277 insertions(+), 64 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-06-10 13:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-10 13:50 [PATCH v4 0/7] Only timeout jobs if they run longer than timeout period Matthew Brost
2024-06-10 13:50 ` [PATCH v4 1/7] drm/xe: Add ctx timestamp to LRC snapshot Matthew Brost
2024-06-10 13:50 ` [PATCH v4 2/7] drm/xe: Add xe_gt_clock_interval_to_ms helper Matthew Brost
2024-06-10 13:50 ` [PATCH v4 3/7] drm/xe: Improve unexpected state error messages Matthew Brost
2024-06-10 13:50 ` [PATCH v4 4/7] drm/xe: Add GuC state asserts to deregister_exec_queue Matthew Brost
2024-06-10 13:50 ` [PATCH v4 5/7] drm/xe: Add pending disable assert to handle_sched_done Matthew Brost
2024-06-10 13:50 ` [PATCH v4 6/7] drm/xe: Add killed, banned, or wedged as stick bit during GuC reset Matthew Brost
2024-06-10 13:50 ` [PATCH v4 7/7] drm/xe: Sample ctx timestamp to determine if jobs have timed out Matthew Brost
2024-06-10 13:54 ` ✓ CI.Patch_applied: success for Only timeout jobs if they run longer than timeout period Patchwork
2024-06-10 13:54 ` ✗ CI.checkpatch: warning " Patchwork
2024-06-10 13:55 ` ✗ CI.KUnit: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox