AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/9] drm/amdgpu: prevent concurrent GPU access during reset
@ 2024-06-05  1:33 Yunxiang Li
  2024-06-05  1:33 ` [PATCH v4 1/9] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Yunxiang Li @ 2024-06-05  1:33 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, christian.koenig, Yunxiang Li

If another thread accesses the gpu while the GPU is being reset, the
reset could fail. This is especially problematic on SRIOV since host
may reset the GPU even if guest is not yet ready.

There are code in place that tries to prevent stray access, but over
time bugs have crept in making it not reliable. This series hopes to
address these bugs.

v4: From testing, it seem that removing the flush from gart enable
    sometimes causes the gart to not be flushed at all. So dropping
      drm/amd/amdgpu: remove unnecessary flush when enable gart
    and replace with this patch instead
      drm/amdgpu: call flush_gpu_tlb directly in gfxhub enable

    Splitting 
      drm/amdgpu: fix missing reset domain locks
    into multiple commits
      drm/amdgpu: add lock in amdgpu_gart_invalidate_tlb
      drm/amdgpu: add lock in kfd_process_dequeue_from_device

v3: dropped:
      drm/amdgpu: abort fence poll if reset is started
      Revert "drm/amdgpu: Queue KFD reset workitem in VF FED"
    updated:
      drm/amdgpu: fix sriov host flr handler
      drm/amdgpu: fix missing reset domain locks
     
Yunxiang Li (9):
  drm/amdgpu: add skip_hw_access checks for sriov
  drm/amdgpu: fix sriov host flr handler
  drm/amdgpu/kfd: remove is_hws_hang and is_resetting
  drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover
  drm/amdgpu: use helper in amdgpu_gart_unbind
  drm/amdgpu: call flush_gpu_tlb directly in gfxhub enable
  drm/amdgpu: fix locking scope when flushing tlb
  drm/amdgpu: add lock in amdgpu_gart_invalidate_tlb
  drm/amdgpu: add lock in kfd_process_dequeue_from_device

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 11 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c       | 70 ++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c   |  2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c      | 23 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |  2 +
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c        |  2 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c        |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         | 39 ++++-----
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c         | 39 ++++-----
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  6 --
 drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  1 -
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 79 ++++++++-----------
 .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 -
 drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c | 11 ++-
 .../gpu/drm/amd/amdkfd/kfd_packet_manager.c   |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  4 +-
 .../amd/amdkfd/kfd_process_queue_manager.c    | 13 ++-
 18 files changed, 154 insertions(+), 157 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-06-06 19:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-05  1:33 [PATCH v4 0/9] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
2024-06-05  1:33 ` [PATCH v4 1/9] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
2024-06-05  1:33 ` [PATCH v4 2/9] drm/amdgpu: fix sriov host flr handler Yunxiang Li
2024-06-05  1:33 ` [PATCH v4 3/9] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
2024-06-05  1:33 ` [PATCH v4 4/9] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
2024-06-05  1:33 ` [PATCH v4 5/9] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
2024-06-05  1:33 ` [PATCH v4 6/9] drm/amdgpu: call flush_gpu_tlb directly in gfxhub enable Yunxiang Li
2024-06-05  8:00   ` Christian König
2024-06-05  1:33 ` [PATCH v4 7/9] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
2024-06-05  1:33 ` [PATCH v4 8/9] drm/amdgpu: add lock in amdgpu_gart_invalidate_tlb Yunxiang Li
2024-06-05  1:33 ` [PATCH v4 9/9] drm/amdgpu: add lock in kfd_process_dequeue_from_device Yunxiang Li
2024-06-06 19:01   ` Felix Kuehling

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox