Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Only timeout jobs if they run longer than timeout period
@ 2024-06-07  6:52 Matthew Brost
  2024-06-07  6:52 ` [RFC PATCH 1/5] drm/xe: Add LRC ctx timestamp support functions Matthew Brost
                   ` (11 more replies)
  0 siblings, 12 replies; 17+ messages in thread
From: Matthew Brost @ 2024-06-07  6:52 UTC (permalink / raw)
  To: intel-xe

Debugging [1] hit a known flaw in the job timeout mechanism - jobs
timeout after a period of time in which they have been submitted to the
GuC not how long they have actually been running on the hardware.
Attempt to fix this.

Algorithm is as follows:
- Copy ctx timestamp from LRC to saved location at beginning of every
  job
- On TDR kick jobs off hardware via schedule disable so ctx timestamp is
  updated
- Compare ctx timestamp to saved ctx timestamp, if jobs having been
  running less than timeout period re-enable scheduling are restart TDR

Series needs a bit of work documented with FIXMEs, hence an RFC. Let's
agree if this is right direction before putting in more work.

Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/799

Matthew Brost (5):
  drm/xe: Add LRC ctx timestamp support functions
  drm/xe: Add MI_COPY_MEM_MEM GPU instruction definitions
  drm/xe: Emit ctx timestamp copy in ring ops
  drm/xe: Add ctx timestamp to LRC snapshot
  drm/xe: Sample ctx timestamp to determine if jobs have timed out

 .../gpu/drm/xe/instructions/xe_mi_commands.h  |   4 +
 drivers/gpu/drm/xe/xe_guc_submit.c            | 140 +++++++++++++-----
 drivers/gpu/drm/xe/xe_lrc.c                   |  49 ++++++
 drivers/gpu/drm/xe/xe_lrc.h                   |   5 +
 drivers/gpu/drm/xe/xe_ring_ops.c              |  21 +++
 5 files changed, 186 insertions(+), 33 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-06-07 15:27 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-07  6:52 [RFC PATCH 0/5] Only timeout jobs if they run longer than timeout period Matthew Brost
2024-06-07  6:52 ` [RFC PATCH 1/5] drm/xe: Add LRC ctx timestamp support functions Matthew Brost
2024-06-07  7:10   ` Jani Nikula
2024-06-07 15:16     ` Matthew Brost
2024-06-07  6:52 ` [RFC PATCH 2/5] drm/xe: Add MI_COPY_MEM_MEM GPU instruction definitions Matthew Brost
2024-06-07 11:04   ` Michal Wajdeczko
2024-06-07 15:22     ` Matthew Brost
2024-06-07  6:52 ` [RFC PATCH 3/5] drm/xe: Emit ctx timestamp copy in ring ops Matthew Brost
2024-06-07  6:52 ` [RFC PATCH 4/5] drm/xe: Add ctx timestamp to LRC snapshot Matthew Brost
2024-06-07  6:52 ` [RFC PATCH 5/5] drm/xe: Sample ctx timestamp to determine if jobs have timed out Matthew Brost
2024-06-07  6:56 ` ✓ CI.Patch_applied: success for Only timeout jobs if they run longer than timeout period Patchwork
2024-06-07  6:56 ` ✗ CI.checkpatch: warning " Patchwork
2024-06-07  6:57 ` ✓ CI.KUnit: success " Patchwork
2024-06-07  7:09 ` ✓ CI.Build: " Patchwork
2024-06-07  7:11 ` ✓ CI.Hooks: " Patchwork
2024-06-07  7:13 ` ✓ CI.checksparse: " Patchwork
2024-06-07 15:27 ` ✗ CI.FULL: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox