* [PATCH V7 00/29] Reset improvements
@ 2025-06-06 6:43 Alex Deucher
2025-06-06 6:43 ` [PATCH 01/29] drm/amdgpu: enable legacy enforce isolation by default Alex Deucher
` (28 more replies)
0 siblings, 29 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
This set improves per queue reset support for a number of IPs.
When we reset the queue, the queue is lost so we need
to re-emit the unprocessed state from subsequent submissions.
To that end, in order to make sure we actually restore
unprocessed state, we need to enable legacy enforce isolation
so that we can safely re-emit the unprocessed state. If
we don't multiple jobs can run in parallel and we may not
end up resetting the correct one. This is similar to how
windows handles queues. This also gives us correct guilty
tracking for GC.
Tested on GC 10 and 11 chips with a game running and
then running hang tests. The game pauses when the
hang happens, then continues after the queue reset.
I tried this same approach and GC8 and 9, but it
was not as reliable as soft recovery. As such, I've dropped
the KGQ reset code for pre-GC10.
The same approach is extended to SDMA and VCN.
They don't need enforce isolation because those engines
are single threaded so they always operate serially.
Rework re-emit to signal the seq number of the bad job and
verify that to verify that the reset worked, then re-emit the
rest of the non-guilty state. This way we are not waiting on
the rest of the state to complete, and if the subsequent state
also contains a bad job, we'll end up in queue reset again rather
than adapter reset.
Git tree:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads
v4: Drop explicit padding patches
Drop new timeout macro
Rework re-emit sequence
v5: Add a helper for reemit
Convert VCN, JPEG, SDMA to use new helpers
v6: Update SDMA 4.4.2 to use new helpers
Move ptr tracking to amdgpu_fence
Skip all jobs from the bad context on the ring
v7: Rework the backup logic
Move and clean up the guilty logic for engine resets
Integrate suggestions from Christian
Add JPEG 4.0.5 support
Alex Deucher (28):
drm/amdgpu: enable legacy enforce isolation by default
drm/amdgpu/gfx7: drop reset_kgq
drm/amdgpu/gfx8: drop reset_kgq
drm/amdgpu/gfx9: drop reset_kgq
drm/amdgpu: switch job hw_fence to amdgpu_fence
drm/amdgpu: update ring reset function signature
drm/amdgpu: move force completion into ring resets
drm/amdgpu: move guilty handling into ring resets
drm/amdgpu: track ring state associated with a job
drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
drm/amdgpu/jpeg4.0.5: add queue reset
drm/amdgpu/jpeg5: add queue reset
drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
Christian König (1):
drm/amdgpu: rework queue reset scheduler interaction
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 120 ++++++++++++++++----
drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 8 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 59 ++++------
drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 27 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 35 +++++-
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 66 ++++++-----
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 61 ++++++----
drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 61 ++++++----
drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 71 ------------
drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 71 ------------
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 67 +++--------
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 23 +++-
drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 21 +++-
drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 21 +++-
drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 21 +++-
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 21 +++-
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 21 +++-
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c | 25 ++++
drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c | 28 +++++
drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 21 +++-
drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 61 +++++-----
drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 33 +++++-
drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 35 +++++-
drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 22 +++-
drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 22 +++-
drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 19 +++-
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 20 +++-
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 20 +++-
drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 20 +++-
32 files changed, 710 insertions(+), 400 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 01/29] drm/amdgpu: enable legacy enforce isolation by default
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 02/29] drm/amdgpu/gfx7: drop reset_kgq Alex Deucher
` (27 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Enable legacy enforce isolation (just serialize kernel
GC submissions). This way we can reset a ring and
only affect the the process currently using that ring.
This mirrors what windows does.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e64969d576a6f..ea565651f7459 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2148,9 +2148,7 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
for (i = 0; i < MAX_XCP; i++) {
switch (amdgpu_enforce_isolation) {
- case -1:
case 0:
- default:
/* disable */
adev->enforce_isolation[i] = AMDGPU_ENFORCE_ISOLATION_DISABLE;
break;
@@ -2159,7 +2157,9 @@ static int amdgpu_device_check_arguments(struct amdgpu_device *adev)
adev->enforce_isolation[i] =
AMDGPU_ENFORCE_ISOLATION_ENABLE;
break;
+ case -1:
case 2:
+ default:
/* enable legacy mode */
adev->enforce_isolation[i] =
AMDGPU_ENFORCE_ISOLATION_ENABLE_LEGACY;
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 02/29] drm/amdgpu/gfx7: drop reset_kgq
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
2025-06-06 6:43 ` [PATCH 01/29] drm/amdgpu: enable legacy enforce isolation by default Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 11:33 ` Christian König
2025-06-06 6:43 ` [PATCH 03/29] drm/amdgpu/gfx8: " Alex Deucher
` (26 subsequent siblings)
28 siblings, 1 reply; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
It doesn't work reliably and we have soft recover and
full adapter reset so drop this.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 71 ---------------------------
1 file changed, 71 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
index da0534ff1271a..2aa323dab34e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
@@ -4884,76 +4884,6 @@ static void gfx_v7_0_emit_mem_sync_compute(struct amdgpu_ring *ring)
amdgpu_ring_write(ring, 0x0000000A); /* poll interval */
}
-static void gfx_v7_0_wait_reg_mem(struct amdgpu_ring *ring, int eng_sel,
- int mem_space, int opt, uint32_t addr0,
- uint32_t addr1, uint32_t ref, uint32_t mask,
- uint32_t inv)
-{
- amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
- amdgpu_ring_write(ring,
- /* memory (1) or register (0) */
- (WAIT_REG_MEM_MEM_SPACE(mem_space) |
- WAIT_REG_MEM_OPERATION(opt) | /* wait */
- WAIT_REG_MEM_FUNCTION(3) | /* equal */
- WAIT_REG_MEM_ENGINE(eng_sel)));
-
- if (mem_space)
- BUG_ON(addr0 & 0x3); /* Dword align */
- amdgpu_ring_write(ring, addr0);
- amdgpu_ring_write(ring, addr1);
- amdgpu_ring_write(ring, ref);
- amdgpu_ring_write(ring, mask);
- amdgpu_ring_write(ring, inv); /* poll interval */
-}
-
-static void gfx_v7_0_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t reg,
- uint32_t val, uint32_t mask)
-{
- gfx_v7_0_wait_reg_mem(ring, 0, 0, 0, reg, 0, val, mask, 0x20);
-}
-
-static int gfx_v7_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
-{
- struct amdgpu_device *adev = ring->adev;
- struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
- struct amdgpu_ring *kiq_ring = &kiq->ring;
- unsigned long flags;
- u32 tmp;
- int r;
-
- if (amdgpu_sriov_vf(adev))
- return -EINVAL;
-
- if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
- return -EINVAL;
-
- spin_lock_irqsave(&kiq->ring_lock, flags);
-
- if (amdgpu_ring_alloc(kiq_ring, 5)) {
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
- return -ENOMEM;
- }
-
- tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
- gfx_v7_0_ring_emit_wreg(kiq_ring, mmCP_VMID_RESET, tmp);
- amdgpu_ring_commit(kiq_ring);
-
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
-
- r = amdgpu_ring_test_ring(kiq_ring);
- if (r)
- return r;
-
- if (amdgpu_ring_alloc(ring, 7 + 12 + 5))
- return -ENOMEM;
- gfx_v7_0_ring_emit_fence_gfx(ring, ring->fence_drv.gpu_addr,
- ring->fence_drv.sync_seq, AMDGPU_FENCE_FLAG_EXEC);
- gfx_v7_0_ring_emit_reg_wait(ring, mmCP_VMID_RESET, 0, 0xffff);
- gfx_v7_0_ring_emit_wreg(ring, mmCP_VMID_RESET, 0);
-
- return amdgpu_ring_test_ring(ring);
-}
-
static const struct amd_ip_funcs gfx_v7_0_ip_funcs = {
.name = "gfx_v7_0",
.early_init = gfx_v7_0_early_init,
@@ -5003,7 +4933,6 @@ static const struct amdgpu_ring_funcs gfx_v7_0_ring_funcs_gfx = {
.emit_wreg = gfx_v7_0_ring_emit_wreg,
.soft_recovery = gfx_v7_0_ring_soft_recovery,
.emit_mem_sync = gfx_v7_0_emit_mem_sync,
- .reset = gfx_v7_0_reset_kgq,
};
static const struct amdgpu_ring_funcs gfx_v7_0_ring_funcs_compute = {
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 03/29] drm/amdgpu/gfx8: drop reset_kgq
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
2025-06-06 6:43 ` [PATCH 01/29] drm/amdgpu: enable legacy enforce isolation by default Alex Deucher
2025-06-06 6:43 ` [PATCH 02/29] drm/amdgpu/gfx7: drop reset_kgq Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 04/29] drm/amdgpu/gfx9: " Alex Deucher
` (25 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
It doesn't work reliably and we have soft recover and
full adapter reset so drop this.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 71 ---------------------------
1 file changed, 71 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
index 5ee2237d8ee8f..68c401ecb3eca 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -6339,34 +6339,6 @@ static void gfx_v8_0_ring_emit_wreg(struct amdgpu_ring *ring, uint32_t reg,
amdgpu_ring_write(ring, val);
}
-static void gfx_v8_0_wait_reg_mem(struct amdgpu_ring *ring, int eng_sel,
- int mem_space, int opt, uint32_t addr0,
- uint32_t addr1, uint32_t ref, uint32_t mask,
- uint32_t inv)
-{
- amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
- amdgpu_ring_write(ring,
- /* memory (1) or register (0) */
- (WAIT_REG_MEM_MEM_SPACE(mem_space) |
- WAIT_REG_MEM_OPERATION(opt) | /* wait */
- WAIT_REG_MEM_FUNCTION(3) | /* equal */
- WAIT_REG_MEM_ENGINE(eng_sel)));
-
- if (mem_space)
- BUG_ON(addr0 & 0x3); /* Dword align */
- amdgpu_ring_write(ring, addr0);
- amdgpu_ring_write(ring, addr1);
- amdgpu_ring_write(ring, ref);
- amdgpu_ring_write(ring, mask);
- amdgpu_ring_write(ring, inv); /* poll interval */
-}
-
-static void gfx_v8_0_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t reg,
- uint32_t val, uint32_t mask)
-{
- gfx_v8_0_wait_reg_mem(ring, 0, 0, 0, reg, 0, val, mask, 0x20);
-}
-
static void gfx_v8_0_ring_soft_recovery(struct amdgpu_ring *ring, unsigned vmid)
{
struct amdgpu_device *adev = ring->adev;
@@ -6843,48 +6815,6 @@ static void gfx_v8_0_emit_wave_limit(struct amdgpu_ring *ring, bool enable)
}
-static int gfx_v8_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
-{
- struct amdgpu_device *adev = ring->adev;
- struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
- struct amdgpu_ring *kiq_ring = &kiq->ring;
- unsigned long flags;
- u32 tmp;
- int r;
-
- if (amdgpu_sriov_vf(adev))
- return -EINVAL;
-
- if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
- return -EINVAL;
-
- spin_lock_irqsave(&kiq->ring_lock, flags);
-
- if (amdgpu_ring_alloc(kiq_ring, 5)) {
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
- return -ENOMEM;
- }
-
- tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
- gfx_v8_0_ring_emit_wreg(kiq_ring, mmCP_VMID_RESET, tmp);
- amdgpu_ring_commit(kiq_ring);
-
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
-
- r = amdgpu_ring_test_ring(kiq_ring);
- if (r)
- return r;
-
- if (amdgpu_ring_alloc(ring, 7 + 12 + 5))
- return -ENOMEM;
- gfx_v8_0_ring_emit_fence_gfx(ring, ring->fence_drv.gpu_addr,
- ring->fence_drv.sync_seq, AMDGPU_FENCE_FLAG_EXEC);
- gfx_v8_0_ring_emit_reg_wait(ring, mmCP_VMID_RESET, 0, 0xffff);
- gfx_v8_0_ring_emit_wreg(ring, mmCP_VMID_RESET, 0);
-
- return amdgpu_ring_test_ring(ring);
-}
-
static const struct amd_ip_funcs gfx_v8_0_ip_funcs = {
.name = "gfx_v8_0",
.early_init = gfx_v8_0_early_init,
@@ -6950,7 +6880,6 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_gfx = {
.emit_wreg = gfx_v8_0_ring_emit_wreg,
.soft_recovery = gfx_v8_0_ring_soft_recovery,
.emit_mem_sync = gfx_v8_0_emit_mem_sync,
- .reset = gfx_v8_0_reset_kgq,
};
static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_compute = {
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 04/29] drm/amdgpu/gfx9: drop reset_kgq
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (2 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 03/29] drm/amdgpu/gfx8: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 05/29] drm/amdgpu: switch job hw_fence to amdgpu_fence Alex Deucher
` (24 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
It doesn't work reliably and we have soft recover and
full adapter reset so drop this.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 46 ---------------------------
1 file changed, 46 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index d377a7c57d5e1..d50e125fd3e0d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -7152,51 +7152,6 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
amdgpu_ring_insert_nop(ring, num_nop - 1);
}
-static int gfx_v9_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
-{
- struct amdgpu_device *adev = ring->adev;
- struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
- struct amdgpu_ring *kiq_ring = &kiq->ring;
- unsigned long flags;
- u32 tmp;
- int r;
-
- if (amdgpu_sriov_vf(adev))
- return -EINVAL;
-
- if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
- return -EINVAL;
-
- spin_lock_irqsave(&kiq->ring_lock, flags);
-
- if (amdgpu_ring_alloc(kiq_ring, 5)) {
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
- return -ENOMEM;
- }
-
- tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
- gfx_v9_0_ring_emit_wreg(kiq_ring,
- SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), tmp);
- amdgpu_ring_commit(kiq_ring);
-
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
-
- r = amdgpu_ring_test_ring(kiq_ring);
- if (r)
- return r;
-
- if (amdgpu_ring_alloc(ring, 7 + 7 + 5))
- return -ENOMEM;
- gfx_v9_0_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
- ring->fence_drv.sync_seq, AMDGPU_FENCE_FLAG_EXEC);
- gfx_v9_0_ring_emit_reg_wait(ring,
- SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0, 0xffff);
- gfx_v9_0_ring_emit_wreg(ring,
- SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0);
-
- return amdgpu_ring_test_ring(ring);
-}
-
static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
unsigned int vmid)
{
@@ -7477,7 +7432,6 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_gfx = {
.emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait,
.soft_recovery = gfx_v9_0_ring_soft_recovery,
.emit_mem_sync = gfx_v9_0_emit_mem_sync,
- .reset = gfx_v9_0_reset_kgq,
.emit_cleaner_shader = gfx_v9_0_ring_emit_cleaner_shader,
.begin_use = amdgpu_gfx_enforce_isolation_ring_begin_use,
.end_use = amdgpu_gfx_enforce_isolation_ring_end_use,
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 05/29] drm/amdgpu: switch job hw_fence to amdgpu_fence
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (3 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 04/29] drm/amdgpu/gfx9: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 11:39 ` Christian König
2025-06-06 6:43 ` [PATCH 06/29] drm/amdgpu: update ring reset function signature Alex Deucher
` (23 subsequent siblings)
28 siblings, 1 reply; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Use the amdgpu fence container so we can store additional
data in the fence.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 30 +++++----------------
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 ++++-----
drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 16 +++++++++++
6 files changed, 32 insertions(+), 32 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 8e626f50b362e..f81608330a3d0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1902,7 +1902,7 @@ static void amdgpu_ib_preempt_mark_partial_job(struct amdgpu_ring *ring)
continue;
}
job = to_amdgpu_job(s_job);
- if (preempted && (&job->hw_fence) == fence)
+ if (preempted && (&job->hw_fence.base) == fence)
/* mark the job as preempted */
job->preemption_status |= AMDGPU_IB_PREEMPTED;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ea565651f7459..8298e95e4543e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6375,7 +6375,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
*
* job->base holds a reference to parent fence
*/
- if (job && dma_fence_is_signaled(&job->hw_fence)) {
+ if (job && dma_fence_is_signaled(&job->hw_fence.base)) {
job_signaled = true;
dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
goto skip_hw_reset;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 2f24a6aa13bf6..569e0e5373927 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -41,22 +41,6 @@
#include "amdgpu_trace.h"
#include "amdgpu_reset.h"
-/*
- * Fences mark an event in the GPUs pipeline and are used
- * for GPU/CPU synchronization. When the fence is written,
- * it is expected that all buffers associated with that fence
- * are no longer in use by the associated ring on the GPU and
- * that the relevant GPU caches have been flushed.
- */
-
-struct amdgpu_fence {
- struct dma_fence base;
-
- /* RB, DMA, etc. */
- struct amdgpu_ring *ring;
- ktime_t start_timestamp;
-};
-
static struct kmem_cache *amdgpu_fence_slab;
int amdgpu_fence_slab_init(void)
@@ -151,12 +135,12 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
am_fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
if (am_fence == NULL)
return -ENOMEM;
- fence = &am_fence->base;
- am_fence->ring = ring;
} else {
/* take use of job-embedded fence */
- fence = &job->hw_fence;
+ am_fence = &job->hw_fence;
}
+ fence = &am_fence->base;
+ am_fence->ring = ring;
seq = ++ring->fence_drv.sync_seq;
if (job && job->job_run_counter) {
@@ -718,7 +702,7 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
* it right here or we won't be able to track them in fence_drv
* and they will remain unsignaled during sa_bo free.
*/
- job = container_of(old, struct amdgpu_job, hw_fence);
+ job = container_of(old, struct amdgpu_job, hw_fence.base);
if (!job->base.s_fence && !dma_fence_is_signaled(old))
dma_fence_signal(old);
RCU_INIT_POINTER(*ptr, NULL);
@@ -780,7 +764,7 @@ static const char *amdgpu_fence_get_timeline_name(struct dma_fence *f)
static const char *amdgpu_job_fence_get_timeline_name(struct dma_fence *f)
{
- struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence);
+ struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence.base);
return (const char *)to_amdgpu_ring(job->base.sched)->name;
}
@@ -810,7 +794,7 @@ static bool amdgpu_fence_enable_signaling(struct dma_fence *f)
*/
static bool amdgpu_job_fence_enable_signaling(struct dma_fence *f)
{
- struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence);
+ struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence.base);
if (!timer_pending(&to_amdgpu_ring(job->base.sched)->fence_drv.fallback_timer))
amdgpu_fence_schedule_fallback(to_amdgpu_ring(job->base.sched));
@@ -845,7 +829,7 @@ static void amdgpu_job_fence_free(struct rcu_head *rcu)
struct dma_fence *f = container_of(rcu, struct dma_fence, rcu);
/* free job if fence has a parent job */
- kfree(container_of(f, struct amdgpu_job, hw_fence));
+ kfree(container_of(f, struct amdgpu_job, hw_fence.base));
}
/**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index acb21fc8b3ce5..ddb9d3269357c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -272,8 +272,8 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
/* Check if any fences where initialized */
if (job->base.s_fence && job->base.s_fence->finished.ops)
f = &job->base.s_fence->finished;
- else if (job->hw_fence.ops)
- f = &job->hw_fence;
+ else if (job->hw_fence.base.ops)
+ f = &job->hw_fence.base;
else
f = NULL;
@@ -290,10 +290,10 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
amdgpu_sync_free(&job->explicit_sync);
/* only put the hw fence if has embedded fence */
- if (!job->hw_fence.ops)
+ if (!job->hw_fence.base.ops)
kfree(job);
else
- dma_fence_put(&job->hw_fence);
+ dma_fence_put(&job->hw_fence.base);
}
void amdgpu_job_set_gang_leader(struct amdgpu_job *job,
@@ -322,10 +322,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
if (job->gang_submit != &job->base.s_fence->scheduled)
dma_fence_put(job->gang_submit);
- if (!job->hw_fence.ops)
+ if (!job->hw_fence.base.ops)
kfree(job);
else
- dma_fence_put(&job->hw_fence);
+ dma_fence_put(&job->hw_fence.base);
}
struct dma_fence *amdgpu_job_submit(struct amdgpu_job *job)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
index f2c049129661f..931fed8892cc1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
@@ -48,7 +48,7 @@ struct amdgpu_job {
struct drm_sched_job base;
struct amdgpu_vm *vm;
struct amdgpu_sync explicit_sync;
- struct dma_fence hw_fence;
+ struct amdgpu_fence hw_fence;
struct dma_fence *gang_submit;
uint32_t preamble_status;
uint32_t preemption_status;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index b95b471107692..e1f25218943a4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -127,6 +127,22 @@ struct amdgpu_fence_driver {
struct dma_fence **fences;
};
+/*
+ * Fences mark an event in the GPUs pipeline and are used
+ * for GPU/CPU synchronization. When the fence is written,
+ * it is expected that all buffers associated with that fence
+ * are no longer in use by the associated ring on the GPU and
+ * that the relevant GPU caches have been flushed.
+ */
+
+struct amdgpu_fence {
+ struct dma_fence base;
+
+ /* RB, DMA, etc. */
+ struct amdgpu_ring *ring;
+ ktime_t start_timestamp;
+};
+
extern const struct drm_sched_backend_ops amdgpu_sched_ops;
void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring);
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 06/29] drm/amdgpu: update ring reset function signature
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (4 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 05/29] drm/amdgpu: switch job hw_fence to amdgpu_fence Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 11:41 ` Christian König
2025-06-06 6:43 ` [PATCH 07/29] drm/amdgpu: rework queue reset scheduler interaction Alex Deucher
` (22 subsequent siblings)
28 siblings, 1 reply; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Going forward, we'll need more than just the vmid. Everything
we need in currently in the amdgpu job structure, so just
pass that in.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++--
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 ++++---
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 10 ++++++----
drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 10 ++++++----
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 +-
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +-
drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 5 +++--
drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 5 +++--
drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 3 ++-
22 files changed, 53 insertions(+), 33 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index ddb9d3269357c..80d4dfebde24f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -155,7 +155,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
if (is_guilty)
dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
- r = amdgpu_ring_reset(ring, job->vmid);
+ r = amdgpu_ring_reset(ring, job);
if (!r) {
if (amdgpu_ring_sched_ready(ring))
drm_sched_stop(&ring->sched, s_job);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index e1f25218943a4..ab5402d7ce9c8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -268,7 +268,7 @@ struct amdgpu_ring_funcs {
void (*patch_cntl)(struct amdgpu_ring *ring, unsigned offset);
void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset);
void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
- int (*reset)(struct amdgpu_ring *ring, unsigned int vmid);
+ int (*reset)(struct amdgpu_ring *ring, struct amdgpu_job *job);
void (*emit_cleaner_shader)(struct amdgpu_ring *ring);
bool (*is_guilty)(struct amdgpu_ring *ring);
};
@@ -425,7 +425,7 @@ struct amdgpu_ring {
#define amdgpu_ring_patch_cntl(r, o) ((r)->funcs->patch_cntl((r), (o)))
#define amdgpu_ring_patch_ce(r, o) ((r)->funcs->patch_ce((r), (o)))
#define amdgpu_ring_patch_de(r, o) ((r)->funcs->patch_de((r), (o)))
-#define amdgpu_ring_reset(r, v) (r)->funcs->reset((r), (v))
+#define amdgpu_ring_reset(r, j) (r)->funcs->reset((r), (j))
unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type);
int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 75ea071744eb5..c58e7040c732a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -9522,7 +9522,8 @@ static void gfx_v10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
amdgpu_ring_insert_nop(ring, num_nop - 1);
}
-static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
+static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
@@ -9547,7 +9548,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
- tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
+ tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << job->vmid);
if (ring->pipe == 0)
tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
else
@@ -9579,7 +9580,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
}
static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
- unsigned int vmid)
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index afd6d59164bfa..0ee7bdd509741 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -6806,7 +6806,8 @@ static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
return 0;
}
-static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
+static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
int r;
@@ -6814,7 +6815,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
if (amdgpu_sriov_vf(adev))
return -EINVAL;
- r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
+ r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
if (r) {
dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
@@ -6968,7 +6969,8 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
return 0;
}
-static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
+static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
int r = 0;
@@ -6976,7 +6978,7 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
if (amdgpu_sriov_vf(adev))
return -EINVAL;
- r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
+ r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
if (r) {
dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
r = gfx_v11_0_reset_compute_pipe(ring);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 1234c8d64e20d..a26417d53411b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -5307,7 +5307,8 @@ static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
return 0;
}
-static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
+static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
int r;
@@ -5315,7 +5316,7 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
if (amdgpu_sriov_vf(adev))
return -EINVAL;
- r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
+ r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
if (r) {
dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
r = gfx_v12_reset_gfx_pipe(ring);
@@ -5421,7 +5422,8 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
return 0;
}
-static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
+static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
int r;
@@ -5429,7 +5431,7 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
if (amdgpu_sriov_vf(adev))
return -EINVAL;
- r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
+ r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
if (r) {
dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
r = gfx_v12_0_reset_compute_pipe(ring);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index d50e125fd3e0d..5e650cc5fcb26 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -7153,7 +7153,7 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
}
static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
- unsigned int vmid)
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index c233edf605694..a7dadff3dca31 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -3552,7 +3552,7 @@ static int gfx_v9_4_3_reset_hw_pipe(struct amdgpu_ring *ring)
}
static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
- unsigned int vmid)
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
index 4cde8a8bcc837..6cd3fbe00d6b9 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
@@ -764,7 +764,8 @@ static int jpeg_v2_0_process_interrupt(struct amdgpu_device *adev,
return 0;
}
-static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
+static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
jpeg_v2_0_stop(ring->adev);
jpeg_v2_0_start(ring->adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
index 8b39e114f3be1..8ed41868f6c32 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
@@ -643,7 +643,8 @@ static int jpeg_v2_5_process_interrupt(struct amdgpu_device *adev,
return 0;
}
-static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
+static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
jpeg_v2_5_stop_inst(ring->adev, ring->me);
jpeg_v2_5_start_inst(ring->adev, ring->me);
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
index 2f8510c2986b9..3512fbb543301 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
@@ -555,7 +555,8 @@ static int jpeg_v3_0_process_interrupt(struct amdgpu_device *adev,
return 0;
}
-static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
+static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
jpeg_v3_0_stop(ring->adev);
jpeg_v3_0_start(ring->adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
index f17ec5414fd69..c8efeaf0a2a69 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
@@ -720,7 +720,8 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
return 0;
}
-static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
+static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
if (amdgpu_sriov_vf(ring->adev))
return -EINVAL;
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
index 79e342d5ab28d..8b07c3651c579 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
@@ -1143,7 +1143,8 @@ static void jpeg_v4_0_3_core_stall_reset(struct amdgpu_ring *ring)
WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
}
-static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
+static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
if (amdgpu_sriov_vf(ring->adev))
return -EOPNOTSUPP;
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
index 3b6f65a256464..0a21a13e19360 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
@@ -834,7 +834,8 @@ static void jpeg_v5_0_1_core_stall_reset(struct amdgpu_ring *ring)
WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
}
-static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
+static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
if (amdgpu_sriov_vf(ring->adev))
return -EOPNOTSUPP;
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
index 9c169112a5e7b..ffd67d51b335f 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -1667,7 +1667,8 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
return sdma_v4_4_2_is_queue_selected(adev, instance_id, true);
}
-static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
+static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
u32 id = GET_INST(SDMA0, ring->me);
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
index 9505ae96fbecc..46affee1c2da0 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
@@ -1538,7 +1538,8 @@ static int sdma_v5_0_soft_reset(struct amdgpu_ip_block *ip_block)
return 0;
}
-static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
+static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
u32 inst_id = ring->me;
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index a6e612b4a8928..581e75b7d01a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -1451,7 +1451,8 @@ static int sdma_v5_2_wait_for_idle(struct amdgpu_ip_block *ip_block)
return -ETIMEDOUT;
}
-static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
+static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
u32 inst_id = ring->me;
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
index 5a70ae17be04e..d9866009edbfc 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
@@ -1537,7 +1537,8 @@ static int sdma_v6_0_ring_preempt_ib(struct amdgpu_ring *ring)
return r;
}
-static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
+static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
int i, r;
@@ -1555,7 +1556,7 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
return -EINVAL;
}
- r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
+ r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
index ad47d0bdf7775..c546e73642296 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
@@ -802,7 +802,8 @@ static bool sdma_v7_0_check_soft_reset(struct amdgpu_ip_block *ip_block)
return false;
}
-static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
+static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
int i, r;
@@ -820,7 +821,7 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
return -EINVAL;
}
- r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
+ r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
index b5071f77f78d2..47a0deceff433 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
@@ -1967,7 +1967,8 @@ static int vcn_v4_0_ring_patch_cs_in_place(struct amdgpu_cs_parser *p,
return 0;
}
-static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
+static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
index 5a33140f57235..d961a824d2098 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
@@ -1594,7 +1594,8 @@ static void vcn_v4_0_3_unified_ring_set_wptr(struct amdgpu_ring *ring)
}
}
-static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
+static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
int r = 0;
int vcn_inst;
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
index 16ade84facc78..10bd714592278 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
@@ -1465,7 +1465,8 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
}
}
-static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
+static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
index f8e3f0b882da5..7e6a7ead9a086 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
@@ -1192,7 +1192,8 @@ static void vcn_v5_0_0_unified_ring_set_wptr(struct amdgpu_ring *ring)
}
}
-static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
+static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 07/29] drm/amdgpu: rework queue reset scheduler interaction
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (5 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 06/29] drm/amdgpu: update ring reset function signature Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 08/29] drm/amdgpu: move force completion into ring resets Alex Deucher
` (21 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Christian König, Alex Deucher
From: Christian König <ckoenig.leichtzumerken@gmail.com>
Stopping the scheduler for queue reset is generally a good idea because
it prevents any worker from touching the ring buffer.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 35 ++++++++++++++-----------
1 file changed, 20 insertions(+), 15 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 80d4dfebde24f..b398e7d097cc8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -91,8 +91,8 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
struct amdgpu_job *job = to_amdgpu_job(s_job);
struct amdgpu_task_info *ti;
struct amdgpu_device *adev = ring->adev;
- int idx;
- int r;
+ bool set_error = false;
+ int idx, r;
if (!drm_dev_enter(adev_to_drm(adev), &idx)) {
dev_info(adev->dev, "%s - device unplugged skipping recovery on scheduler:%s",
@@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
} else if (amdgpu_gpu_recovery && ring->funcs->reset) {
bool is_guilty;
- dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
- /* stop the scheduler, but don't mess with the
- * bad job yet because if ring reset fails
- * we'll fall back to full GPU reset.
+ dev_err(adev->dev, "Starting %s ring reset\n",
+ s_job->sched->name);
+
+ /*
+ * Stop the scheduler to prevent anybody else from touching the
+ * ring buffer.
*/
drm_sched_wqueue_stop(&ring->sched);
@@ -152,26 +154,29 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
else
is_guilty = true;
- if (is_guilty)
+ if (is_guilty) {
dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+ set_error = true;
+ }
r = amdgpu_ring_reset(ring, job);
if (!r) {
- if (amdgpu_ring_sched_ready(ring))
- drm_sched_stop(&ring->sched, s_job);
if (is_guilty) {
atomic_inc(&ring->adev->gpu_reset_counter);
amdgpu_fence_driver_force_completion(ring);
}
- if (amdgpu_ring_sched_ready(ring))
- drm_sched_start(&ring->sched, 0);
- dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
- drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
+ drm_sched_wqueue_start(&ring->sched);
+ dev_err(adev->dev, "Ring %s reset succeeded\n",
+ ring->sched.name);
+ drm_dev_wedged_event(adev_to_drm(adev),
+ DRM_WEDGE_RECOVERY_NONE);
goto exit;
}
- dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
+ dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
}
- dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+
+ if (!set_error)
+ dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
if (amdgpu_device_should_recover_gpu(ring->adev)) {
struct amdgpu_reset_context reset_context;
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 08/29] drm/amdgpu: move force completion into ring resets
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (6 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 07/29] drm/amdgpu: rework queue reset scheduler interaction Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 09/29] drm/amdgpu: move guilty handling " Alex Deucher
` (20 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Move the force completion handling into each ring
reset function so that each engine can determine
whether or not it needs to force completion on the
jobs in the ring.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 4 +---
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 12 ++++++++++--
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 12 ++++++++++--
drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 12 ++++++++++--
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 7 ++++++-
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 7 ++++++-
drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 8 +++++++-
drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 8 +++++++-
drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 8 +++++++-
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 8 +++++++-
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 8 +++++++-
drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 8 +++++++-
drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 8 +++++++-
drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 7 ++++++-
drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 7 ++++++-
drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 6 +++++-
drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 6 +++++-
drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 7 ++++++-
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 6 ++++--
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 7 ++++++-
drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 7 ++++++-
21 files changed, 136 insertions(+), 27 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index b398e7d097cc8..461bd551546de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -161,10 +161,8 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
r = amdgpu_ring_reset(ring, job);
if (!r) {
- if (is_guilty) {
+ if (is_guilty)
atomic_inc(&ring->adev->gpu_reset_counter);
- amdgpu_fence_driver_force_completion(ring);
- }
drm_sched_wqueue_start(&ring->sched);
dev_err(adev->dev, "Ring %s reset succeeded\n",
ring->sched.name);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index c58e7040c732a..7a82c60d923ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -9576,7 +9576,11 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
return r;
}
- return amdgpu_ring_test_ring(ring);
+ r = amdgpu_ring_test_ring(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
@@ -9648,7 +9652,11 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
if (r)
return r;
- return amdgpu_ring_test_ring(ring);
+ r = amdgpu_ring_test_ring(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static void gfx_v10_ip_print(struct amdgpu_ip_block *ip_block, struct drm_printer *p)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 0ee7bdd509741..9ad4f6971f8bf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -6836,7 +6836,11 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
return r;
}
- return amdgpu_ring_test_ring(ring);
+ r = amdgpu_ring_test_ring(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
@@ -6997,7 +7001,11 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
return r;
}
- return amdgpu_ring_test_ring(ring);
+ r = amdgpu_ring_test_ring(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static void gfx_v11_ip_print(struct amdgpu_ip_block *ip_block, struct drm_printer *p)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index a26417d53411b..3c628e3de5000 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -5336,7 +5336,11 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
return r;
}
- return amdgpu_ring_test_ring(ring);
+ r = amdgpu_ring_test_ring(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
@@ -5450,7 +5454,11 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
return r;
}
- return amdgpu_ring_test_ring(ring);
+ r = amdgpu_ring_test_ring(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static void gfx_v12_0_ring_begin_use(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 5e650cc5fcb26..e64b02bb04e26 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -7222,7 +7222,12 @@ static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
DRM_ERROR("fail to remap queue\n");
return r;
}
- return amdgpu_ring_test_ring(ring);
+
+ r = amdgpu_ring_test_ring(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static void gfx_v9_ip_print(struct amdgpu_ip_block *ip_block, struct drm_printer *p)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index a7dadff3dca31..0c2e80f73ba49 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -3619,7 +3619,12 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
dev_err(adev->dev, "fail to remap queue\n");
return r;
}
- return amdgpu_ring_test_ring(ring);
+
+ r = amdgpu_ring_test_ring(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
enum amdgpu_gfx_cp_ras_mem_id {
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
index 6cd3fbe00d6b9..cd7c45a77120f 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
@@ -767,9 +767,15 @@ static int jpeg_v2_0_process_interrupt(struct amdgpu_device *adev,
static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
struct amdgpu_job *job)
{
+ int r;
+
jpeg_v2_0_stop(ring->adev);
jpeg_v2_0_start(ring->adev);
- return amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_helper(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static const struct amd_ip_funcs jpeg_v2_0_ip_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
index 8ed41868f6c32..d936f0063039c 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
@@ -646,9 +646,15 @@ static int jpeg_v2_5_process_interrupt(struct amdgpu_device *adev,
static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
struct amdgpu_job *job)
{
+ int r;
+
jpeg_v2_5_stop_inst(ring->adev, ring->me);
jpeg_v2_5_start_inst(ring->adev, ring->me);
- return amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_helper(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static const struct amd_ip_funcs jpeg_v2_5_ip_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
index 3512fbb543301..9e1ae935c6663 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
@@ -558,9 +558,15 @@ static int jpeg_v3_0_process_interrupt(struct amdgpu_device *adev,
static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
struct amdgpu_job *job)
{
+ int r;
+
jpeg_v3_0_stop(ring->adev);
jpeg_v3_0_start(ring->adev);
- return amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_helper(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static const struct amd_ip_funcs jpeg_v3_0_ip_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
index c8efeaf0a2a69..da27eac1115ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
@@ -723,12 +723,18 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
struct amdgpu_job *job)
{
+ int r;
+
if (amdgpu_sriov_vf(ring->adev))
return -EINVAL;
jpeg_v4_0_stop(ring->adev);
jpeg_v4_0_start(ring->adev);
- return amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_helper(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static const struct amd_ip_funcs jpeg_v4_0_ip_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
index 8b07c3651c579..f1a6fe7f7b3af 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
@@ -1146,12 +1146,18 @@ static void jpeg_v4_0_3_core_stall_reset(struct amdgpu_ring *ring)
static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
struct amdgpu_job *job)
{
+ int r;
+
if (amdgpu_sriov_vf(ring->adev))
return -EOPNOTSUPP;
jpeg_v4_0_3_core_stall_reset(ring);
jpeg_v4_0_3_start_jrbc(ring);
- return amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_helper(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static const struct amd_ip_funcs jpeg_v4_0_3_ip_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
index 0a21a13e19360..3d2b9d38c306a 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
@@ -837,12 +837,18 @@ static void jpeg_v5_0_1_core_stall_reset(struct amdgpu_ring *ring)
static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
struct amdgpu_job *job)
{
+ int r;
+
if (amdgpu_sriov_vf(ring->adev))
return -EOPNOTSUPP;
jpeg_v5_0_1_core_stall_reset(ring);
jpeg_v5_0_1_init_jrbc(ring);
- return amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_helper(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static const struct amd_ip_funcs jpeg_v5_0_1_ip_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
index ffd67d51b335f..73328e213c247 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -1671,6 +1671,7 @@ static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
+ bool is_guilty = ring->funcs->is_guilty(ring);
u32 id = GET_INST(SDMA0, ring->me);
int r;
@@ -1680,8 +1681,13 @@ static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
amdgpu_amdkfd_suspend(adev, false);
r = amdgpu_sdma_reset_engine(adev, id);
amdgpu_amdkfd_resume(adev, false);
+ if (r)
+ return r;
- return r;
+ if (is_guilty)
+ amdgpu_fence_driver_force_completion(ring);
+
+ return 0;
}
static int sdma_v4_4_2_stop_queue(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
index 46affee1c2da0..8d1c43ed39994 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
@@ -1543,8 +1543,13 @@ static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
{
struct amdgpu_device *adev = ring->adev;
u32 inst_id = ring->me;
+ int r;
- return amdgpu_sdma_reset_engine(adev, inst_id);
+ r = amdgpu_sdma_reset_engine(adev, inst_id);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static int sdma_v5_0_stop_queue(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index 581e75b7d01a8..f700ac64fb616 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -1456,8 +1456,13 @@ static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
{
struct amdgpu_device *adev = ring->adev;
u32 inst_id = ring->me;
+ int r;
- return amdgpu_sdma_reset_engine(adev, inst_id);
+ r = amdgpu_sdma_reset_engine(adev, inst_id);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static int sdma_v5_2_stop_queue(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
index d9866009edbfc..25c01acac2cd9 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
@@ -1560,7 +1560,11 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
if (r)
return r;
- return sdma_v6_0_gfx_resume_instance(adev, i, true);
+ r = sdma_v6_0_gfx_resume_instance(adev, i, true);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static int sdma_v6_0_set_trap_irq_state(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
index c546e73642296..97ea5392ab85d 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
@@ -825,7 +825,11 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
if (r)
return r;
- return sdma_v7_0_gfx_resume_instance(adev, i, true);
+ r = sdma_v7_0_gfx_resume_instance(adev, i, true);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
/**
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
index 47a0deceff433..f3ff3c6c155fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
@@ -1972,6 +1972,7 @@ static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
{
struct amdgpu_device *adev = ring->adev;
struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
+ int r;
if (!(adev->vcn.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
return -EOPNOTSUPP;
@@ -1979,7 +1980,11 @@ static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
vcn_v4_0_stop(vinst);
vcn_v4_0_start(vinst);
- return amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_helper(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static struct amdgpu_ring_funcs vcn_v4_0_unified_ring_vm_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
index d961a824d2098..e15057333a459 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
@@ -1622,8 +1622,10 @@ static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
vcn_v4_0_3_hw_init_inst(vinst);
vcn_v4_0_3_start_dpg_mode(vinst, adev->vcn.inst[ring->me].indirect_sram);
r = amdgpu_ring_test_helper(ring);
-
- return r;
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static const struct amdgpu_ring_funcs vcn_v4_0_3_unified_ring_vm_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
index 10bd714592278..9fd3127dc8828 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
@@ -1470,6 +1470,7 @@ static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
{
struct amdgpu_device *adev = ring->adev;
struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
+ int r;
if (!(adev->vcn.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
return -EOPNOTSUPP;
@@ -1477,7 +1478,11 @@ static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
vcn_v4_0_5_stop(vinst);
vcn_v4_0_5_start(vinst);
- return amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_helper(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static struct amdgpu_ring_funcs vcn_v4_0_5_unified_ring_vm_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
index 7e6a7ead9a086..c5afe2a7f9f5d 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
@@ -1197,6 +1197,7 @@ static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
{
struct amdgpu_device *adev = ring->adev;
struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
+ int r;
if (!(adev->vcn.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
return -EOPNOTSUPP;
@@ -1204,7 +1205,11 @@ static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
vcn_v5_0_0_stop(vinst);
vcn_v5_0_0_start(vinst);
- return amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_helper(ring);
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static const struct amdgpu_ring_funcs vcn_v5_0_0_unified_ring_vm_funcs = {
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 09/29] drm/amdgpu: move guilty handling into ring resets
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (7 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 08/29] drm/amdgpu: move force completion into ring resets Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 10/29] drm/amdgpu: track ring state associated with a job Alex Deucher
` (19 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Move guilty logic into the ring reset callbacks. This
allows each ring reset callback to better handle fence
errors and force completions in line with the reset
behavior for each IP. It also allows us to remove
the ring guilty callback since that logic now lives
in the reset callback.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 22 +---------
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 -
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 4 ++
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 4 ++
drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 4 ++
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 +
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +
drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 2 +
drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 2 +
drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 2 +
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 2 +
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 2 +
drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 2 +
drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 56 ++++++++++++------------
drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 25 ++++++++++-
drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 27 ++++++++++--
drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 2 +
drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 2 +
drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 2 +
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 2 +
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 2 +
drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 2 +
22 files changed, 116 insertions(+), 55 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 461bd551546de..308d3889e46ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -91,7 +91,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
struct amdgpu_job *job = to_amdgpu_job(s_job);
struct amdgpu_task_info *ti;
struct amdgpu_device *adev = ring->adev;
- bool set_error = false;
int idx, r;
if (!drm_dev_enter(adev_to_drm(adev), &idx)) {
@@ -134,8 +133,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
if (unlikely(adev->debug_disable_gpu_ring_reset)) {
dev_err(adev->dev, "Ring reset disabled by debug mask\n");
} else if (amdgpu_gpu_recovery && ring->funcs->reset) {
- bool is_guilty;
-
dev_err(adev->dev, "Starting %s ring reset\n",
s_job->sched->name);
@@ -145,24 +142,8 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
*/
drm_sched_wqueue_stop(&ring->sched);
- /* for engine resets, we need to reset the engine,
- * but individual queues may be unaffected.
- * check here to make sure the accounting is correct.
- */
- if (ring->funcs->is_guilty)
- is_guilty = ring->funcs->is_guilty(ring);
- else
- is_guilty = true;
-
- if (is_guilty) {
- dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
- set_error = true;
- }
-
r = amdgpu_ring_reset(ring, job);
if (!r) {
- if (is_guilty)
- atomic_inc(&ring->adev->gpu_reset_counter);
drm_sched_wqueue_start(&ring->sched);
dev_err(adev->dev, "Ring %s reset succeeded\n",
ring->sched.name);
@@ -173,8 +154,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
}
- if (!set_error)
- dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+ dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
if (amdgpu_device_should_recover_gpu(ring->adev)) {
struct amdgpu_reset_context reset_context;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index ab5402d7ce9c8..2b3843f5218c8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -270,7 +270,6 @@ struct amdgpu_ring_funcs {
void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
int (*reset)(struct amdgpu_ring *ring, struct amdgpu_job *job);
void (*emit_cleaner_shader)(struct amdgpu_ring *ring);
- bool (*is_guilty)(struct amdgpu_ring *ring);
};
/**
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 7a82c60d923ed..b57a21c0874c8 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -9579,7 +9579,9 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
@@ -9655,7 +9657,9 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 9ad4f6971f8bf..02022c7b4de78 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -6839,7 +6839,9 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
@@ -7004,7 +7006,9 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 3c628e3de5000..a4e3ce81bc671 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -5339,7 +5339,9 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
@@ -5457,7 +5459,9 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index e64b02bb04e26..f699c8b0f7488 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -7226,7 +7226,9 @@ static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index 0c2e80f73ba49..d9eea11f52fec 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -3623,7 +3623,9 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
index cd7c45a77120f..f2058f263cc05 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
@@ -774,7 +774,9 @@ static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
r = amdgpu_ring_test_helper(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
index d936f0063039c..5eb86291ccdd6 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
@@ -653,7 +653,9 @@ static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
r = amdgpu_ring_test_helper(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
index 9e1ae935c6663..ff826611b600e 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
@@ -565,7 +565,9 @@ static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
r = amdgpu_ring_test_helper(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
index da27eac1115ee..179dd420edb15 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
@@ -733,7 +733,9 @@ static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
r = amdgpu_ring_test_helper(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
index f1a6fe7f7b3af..c956f424fbbf9 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
@@ -1156,7 +1156,9 @@ static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
r = amdgpu_ring_test_helper(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
index 3d2b9d38c306a..ef9289f78a46a 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
@@ -847,7 +847,9 @@ static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
r = amdgpu_ring_test_helper(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
index 73328e213c247..fce8cc3ef066c 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -1648,44 +1648,30 @@ static bool sdma_v4_4_2_is_queue_selected(struct amdgpu_device *adev, uint32_t i
return (context_status & SDMA_GFX_CONTEXT_STATUS__SELECTED_MASK) != 0;
}
-static bool sdma_v4_4_2_ring_is_guilty(struct amdgpu_ring *ring)
-{
- struct amdgpu_device *adev = ring->adev;
- uint32_t instance_id = ring->me;
-
- return sdma_v4_4_2_is_queue_selected(adev, instance_id, false);
-}
-
-static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
-{
- struct amdgpu_device *adev = ring->adev;
- uint32_t instance_id = ring->me;
-
- if (!adev->sdma.has_page_queue)
- return false;
-
- return sdma_v4_4_2_is_queue_selected(adev, instance_id, true);
-}
-
static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
- bool is_guilty = ring->funcs->is_guilty(ring);
u32 id = GET_INST(SDMA0, ring->me);
+ bool is_guilty;
int r;
if (!(adev->sdma.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
return -EOPNOTSUPP;
+ is_guilty = sdma_v4_4_2_is_queue_selected(adev, id,
+ &adev->sdma.instance[id].page == ring);
+
amdgpu_amdkfd_suspend(adev, false);
r = amdgpu_sdma_reset_engine(adev, id);
amdgpu_amdkfd_resume(adev, false);
if (r)
return r;
- if (is_guilty)
- amdgpu_fence_driver_force_completion(ring);
+ if (is_guilty) {
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
+ atomic_inc(&ring->adev->gpu_reset_counter);
+ }
return 0;
}
@@ -1731,8 +1717,8 @@ static int sdma_v4_4_2_stop_queue(struct amdgpu_ring *ring)
static int sdma_v4_4_2_restore_queue(struct amdgpu_ring *ring)
{
struct amdgpu_device *adev = ring->adev;
- u32 inst_mask;
- int i;
+ u32 inst_mask, tmp_mask;
+ int i, r;
inst_mask = 1 << ring->me;
udelay(50);
@@ -1749,7 +1735,25 @@ static int sdma_v4_4_2_restore_queue(struct amdgpu_ring *ring)
return -ETIMEDOUT;
}
- return sdma_v4_4_2_inst_start(adev, inst_mask, true);
+ r = sdma_v4_4_2_inst_start(adev, inst_mask, true);
+ if (r) {
+ return r;
+ }
+
+ tmp_mask = inst_mask;
+ for_each_inst(i, tmp_mask) {
+ ring = &adev->sdma.instance[i].ring;
+
+ amdgpu_fence_driver_force_completion(ring);
+
+ if (adev->sdma.has_page_queue) {
+ struct amdgpu_ring *page = &adev->sdma.instance[i].page;
+
+ amdgpu_fence_driver_force_completion(page);
+ }
+ }
+
+ return r;
}
static int sdma_v4_4_2_set_trap_irq_state(struct amdgpu_device *adev,
@@ -2146,7 +2150,6 @@ static const struct amdgpu_ring_funcs sdma_v4_4_2_ring_funcs = {
.emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
.emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
.reset = sdma_v4_4_2_reset_queue,
- .is_guilty = sdma_v4_4_2_ring_is_guilty,
};
static const struct amdgpu_ring_funcs sdma_v4_4_2_page_ring_funcs = {
@@ -2179,7 +2182,6 @@ static const struct amdgpu_ring_funcs sdma_v4_4_2_page_ring_funcs = {
.emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
.emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
.reset = sdma_v4_4_2_reset_queue,
- .is_guilty = sdma_v4_4_2_page_ring_is_guilty,
};
static void sdma_v4_4_2_set_ring_funcs(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
index 8d1c43ed39994..4582a11b411dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
@@ -1538,17 +1538,35 @@ static int sdma_v5_0_soft_reset(struct amdgpu_ip_block *ip_block)
return 0;
}
+static bool sdma_v5_0_is_queue_selected(struct amdgpu_device *adev,
+ uint32_t instance_id)
+{
+ u32 context_status = RREG32(sdma_v5_0_get_reg_offset(adev, instance_id,
+ mmSDMA0_GFX_CONTEXT_STATUS));
+
+ /* Check if the SELECTED bit is set */
+ return (context_status & SDMA0_GFX_CONTEXT_STATUS__SELECTED_MASK) != 0;
+}
+
static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
u32 inst_id = ring->me;
+ bool is_guilty = sdma_v5_0_is_queue_selected(adev, inst_id);
int r;
+ amdgpu_amdkfd_suspend(adev, false);
r = amdgpu_sdma_reset_engine(adev, inst_id);
+ amdgpu_amdkfd_resume(adev, false);
if (r)
return r;
- amdgpu_fence_driver_force_completion(ring);
+
+ if (is_guilty) {
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
+ atomic_inc(&ring->adev->gpu_reset_counter);
+ }
+
return 0;
}
@@ -1616,7 +1634,10 @@ static int sdma_v5_0_restore_queue(struct amdgpu_ring *ring)
r = sdma_v5_0_gfx_resume_instance(adev, inst_id, true);
amdgpu_gfx_rlc_exit_safe_mode(adev, 0);
- return r;
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static int sdma_v5_0_ring_preempt_ib(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index f700ac64fb616..711064ea22d5d 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -1451,17 +1451,35 @@ static int sdma_v5_2_wait_for_idle(struct amdgpu_ip_block *ip_block)
return -ETIMEDOUT;
}
+static bool sdma_v5_2_is_queue_selected(struct amdgpu_device *adev,
+ uint32_t instance_id)
+{
+ u32 context_status = RREG32(sdma_v5_2_get_reg_offset(adev, instance_id,
+ mmSDMA0_GFX_CONTEXT_STATUS));
+
+ /* Check if the SELECTED bit is set */
+ return (context_status & SDMA0_GFX_CONTEXT_STATUS__SELECTED_MASK) != 0;
+}
+
static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
struct amdgpu_job *job)
{
struct amdgpu_device *adev = ring->adev;
u32 inst_id = ring->me;
+ bool is_guilty = sdma_v5_2_is_queue_selected(adev, inst_id);
int r;
+ amdgpu_amdkfd_suspend(adev, false);
r = amdgpu_sdma_reset_engine(adev, inst_id);
+ amdgpu_amdkfd_resume(adev, false);
if (r)
return r;
- amdgpu_fence_driver_force_completion(ring);
+
+ if (is_guilty) {
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
+ atomic_inc(&ring->adev->gpu_reset_counter);
+ }
+
return 0;
}
@@ -1528,11 +1546,12 @@ static int sdma_v5_2_restore_queue(struct amdgpu_ring *ring)
freeze = RREG32(sdma_v5_2_get_reg_offset(adev, inst_id, mmSDMA0_FREEZE));
freeze = REG_SET_FIELD(freeze, SDMA0_FREEZE, FREEZE, 0);
WREG32(sdma_v5_2_get_reg_offset(adev, inst_id, mmSDMA0_FREEZE), freeze);
-
r = sdma_v5_2_gfx_resume_instance(adev, inst_id, true);
-
amdgpu_gfx_rlc_exit_safe_mode(adev, 0);
- return r;
+ if (r)
+ return r;
+ amdgpu_fence_driver_force_completion(ring);
+ return 0;
}
static int sdma_v5_2_ring_preempt_ib(struct amdgpu_ring *ring)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
index 25c01acac2cd9..abb5ad697fbb2 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
@@ -1563,7 +1563,9 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
r = sdma_v6_0_gfx_resume_instance(adev, i, true);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
index 97ea5392ab85d..76ae1a7849a56 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
@@ -828,7 +828,9 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
r = sdma_v7_0_gfx_resume_instance(adev, i, true);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
index f3ff3c6c155fd..d68bd82f8eab0 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
@@ -1983,7 +1983,9 @@ static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
r = amdgpu_ring_test_helper(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
index e15057333a459..a9d8ae4ab109a 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
@@ -1624,7 +1624,9 @@ static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
r = amdgpu_ring_test_helper(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
index 9fd3127dc8828..93bc55756dcd6 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
@@ -1481,7 +1481,9 @@ static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
r = amdgpu_ring_test_helper(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
index c5afe2a7f9f5d..d74c1862ac860 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
@@ -1208,7 +1208,9 @@ static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
r = amdgpu_ring_test_helper(ring);
if (r)
return r;
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
amdgpu_fence_driver_force_completion(ring);
+ atomic_inc(&ring->adev->gpu_reset_counter);
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 10/29] drm/amdgpu: track ring state associated with a job
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (8 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 09/29] drm/amdgpu: move guilty handling " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 11/29] drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset Alex Deucher
` (18 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
We need to know the wptr and sequence number associated
with a job so that we can re-emit the unprocessed state
after a ring reset. Pre-allocate storage space for
the ring buffer contents and add helpers to save off
and re-emit the unprocessed state so that it can be
re-emitted after the queue is reset.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 90 +++++++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 8 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 27 +++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 14 ++++
5 files changed, 139 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 569e0e5373927..da87a5539a90b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -135,12 +135,20 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
am_fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
if (am_fence == NULL)
return -ENOMEM;
+ am_fence->context = 0;
} else {
/* take use of job-embedded fence */
am_fence = &job->hw_fence;
+ if (job->base.s_fence) {
+ struct dma_fence *finished = &job->base.s_fence->finished;
+ am_fence->context = finished->context;
+ } else {
+ am_fence->context = 0;
+ }
}
fence = &am_fence->base;
am_fence->ring = ring;
+ am_fence->wptr = 0;
seq = ++ring->fence_drv.sync_seq;
if (job && job->job_run_counter) {
@@ -748,6 +756,88 @@ void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring)
amdgpu_fence_process(ring);
}
+/**
+ * amdgpu_fence_driver_guilty_force_completion - force signal of specified sequence
+ *
+ * @fence: fence of the ring to signal
+ *
+ */
+void amdgpu_fence_driver_guilty_force_completion(struct dma_fence *fence)
+{
+ struct amdgpu_fence *am_fence = container_of(fence, struct amdgpu_fence, base);
+
+ amdgpu_fence_write(am_fence->ring, fence->seqno);
+ amdgpu_fence_process(am_fence->ring);
+}
+
+void amdgpu_fence_save_wptr(struct dma_fence *fence)
+{
+ struct amdgpu_fence *am_fence = container_of(fence, struct amdgpu_fence, base);
+
+ am_fence->wptr = am_fence->ring->wptr;
+}
+
+static void amdgpu_ring_backup_unprocessed_command(struct amdgpu_ring *ring,
+ unsigned int idx,
+ u64 start_wptr, u32 end_wptr)
+{
+ unsigned int first_idx = start_wptr & ring->buf_mask;
+ unsigned int last_idx = end_wptr & ring->buf_mask;
+ unsigned int i, j, entries_to_copy;
+
+ if (last_idx < first_idx) {
+ entries_to_copy = ring->buf_mask + 1 - first_idx;
+ for (i = 0; i < entries_to_copy; i++)
+ ring->ring_backup[idx + i] = ring->ring[first_idx + i];
+ ring->ring_backup_entries_to_copy += entries_to_copy;
+ entries_to_copy = last_idx;
+ for (j = 0; j < entries_to_copy; j++)
+ ring->ring_backup[idx + i + j] = ring->ring[j];
+ ring->ring_backup_entries_to_copy += entries_to_copy;
+ } else {
+ entries_to_copy = last_idx - first_idx;
+ for (i = 0; i < entries_to_copy; i++)
+ ring->ring_backup[idx + i] = ring->ring[first_idx + i];
+ ring->ring_backup_entries_to_copy += entries_to_copy;
+ }
+}
+
+void amdgpu_ring_backup_unprocessed_commands(struct amdgpu_ring *ring,
+ struct dma_fence *f,
+ bool is_guilty)
+{
+ struct amdgpu_fence *bad_fence =
+ container_of(f, struct amdgpu_fence, base);
+ struct amdgpu_fence *fence;
+ struct dma_fence *unprocessed, **ptr;
+ u64 wptr, i;
+
+ wptr = bad_fence->wptr;
+ ring->ring_backup_entries_to_copy = 0;
+ for (i = bad_fence->base.seqno + 1; i <= ring->fence_drv.sync_seq; ++i) {
+ ptr = &ring->fence_drv.fences[i & ring->fence_drv.num_fences_mask];
+ rcu_read_lock();
+ unprocessed = rcu_dereference(*ptr);
+
+ if (unprocessed && !dma_fence_is_signaled(unprocessed)) {
+ fence = container_of(unprocessed, struct amdgpu_fence, base);
+
+ /* save everything if the ring is not guilty, otherwise
+ * just save the content from other contexts.
+ */
+ if (fence->wptr &&
+ (!is_guilty || (fence->context != bad_fence->context))) {
+ amdgpu_ring_backup_unprocessed_command(ring,
+ ring->ring_backup_entries_to_copy,
+ wptr,
+ fence->wptr);
+ wptr = fence->wptr;
+ }
+ }
+ rcu_read_unlock();
+ }
+}
+
/*
* Common fence implementation
*/
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index 802743efa3b39..789f9b2af8f99 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
@@ -138,7 +138,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
int vmid = AMDGPU_JOB_GET_VMID(job);
bool need_pipe_sync = false;
unsigned int cond_exec;
-
unsigned int i;
int r = 0;
@@ -306,6 +305,13 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
amdgpu_ring_ib_end(ring);
amdgpu_ring_commit(ring);
+
+ /* This must be last for resets to work properly
+ * as we need to save the wptr associated with this
+ * fence.
+ */
+ amdgpu_fence_save_wptr(*f);
+
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 308d3889e46ca..0ac51d7b4d78a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -89,8 +89,8 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
{
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
struct amdgpu_job *job = to_amdgpu_job(s_job);
- struct amdgpu_task_info *ti;
struct amdgpu_device *adev = ring->adev;
+ struct amdgpu_task_info *ti;
int idx, r;
if (!drm_dev_enter(adev_to_drm(adev), &idx)) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 426834806fbf2..736ff5bafd520 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -333,6 +333,12 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
/* Initialize cached_rptr to 0 */
ring->cached_rptr = 0;
+ if (!ring->ring_backup) {
+ ring->ring_backup = kvzalloc(ring->ring_size, GFP_KERNEL);
+ if (!ring->ring_backup)
+ return -ENOMEM;
+ }
+
/* Allocate ring buffer */
if (ring->ring_obj == NULL) {
r = amdgpu_bo_create_kernel(adev, ring->ring_size + ring->funcs->extra_dw, PAGE_SIZE,
@@ -342,6 +348,7 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring,
(void **)&ring->ring);
if (r) {
dev_err(adev->dev, "(%d) ring create failed\n", r);
+ kvfree(ring->ring_backup);
return r;
}
amdgpu_ring_clear_ring(ring);
@@ -385,6 +392,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring)
amdgpu_bo_free_kernel(&ring->ring_obj,
&ring->gpu_addr,
(void **)&ring->ring);
+ kvfree(ring->ring_backup);
+ ring->ring_backup = NULL;
dma_fence_put(ring->vmid_wait);
ring->vmid_wait = NULL;
@@ -753,3 +762,21 @@ bool amdgpu_ring_sched_ready(struct amdgpu_ring *ring)
return true;
}
+
+int amdgpu_ring_reemit_unprocessed_commands(struct amdgpu_ring *ring)
+{
+ unsigned int i;
+ int r;
+
+ /* re-emit the unprocessed ring contents */
+ if (ring->ring_backup_entries_to_copy) {
+ r = amdgpu_ring_alloc(ring, ring->ring_backup_entries_to_copy);
+ if (r)
+ return r;
+ for (i = 0; i < ring->ring_backup_entries_to_copy; i++)
+ amdgpu_ring_write(ring, ring->ring_backup[i]);
+ amdgpu_ring_commit(ring);
+ }
+
+ return 0;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 2b3843f5218c8..b73894254bb8c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -141,6 +141,11 @@ struct amdgpu_fence {
/* RB, DMA, etc. */
struct amdgpu_ring *ring;
ktime_t start_timestamp;
+
+ /* wptr for the fence for resets */
+ u64 wptr;
+ /* fence context for resets */
+ u64 context;
};
extern const struct drm_sched_backend_ops amdgpu_sched_ops;
@@ -148,6 +153,8 @@ extern const struct drm_sched_backend_ops amdgpu_sched_ops;
void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring);
void amdgpu_fence_driver_set_error(struct amdgpu_ring *ring, int error);
void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring);
+void amdgpu_fence_driver_guilty_force_completion(struct dma_fence *fence);
+void amdgpu_fence_save_wptr(struct dma_fence *fence);
int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring);
int amdgpu_fence_driver_start_ring(struct amdgpu_ring *ring,
@@ -283,6 +290,9 @@ struct amdgpu_ring {
struct amdgpu_bo *ring_obj;
uint32_t *ring;
+ /* backups for resets */
+ uint32_t *ring_backup;
+ unsigned int ring_backup_entries_to_copy;
unsigned rptr_offs;
u64 rptr_gpu_addr;
volatile u32 *rptr_cpu_addr;
@@ -549,4 +559,8 @@ int amdgpu_ib_pool_init(struct amdgpu_device *adev);
void amdgpu_ib_pool_fini(struct amdgpu_device *adev);
int amdgpu_ib_ring_tests(struct amdgpu_device *adev);
bool amdgpu_ring_sched_ready(struct amdgpu_ring *ring);
+void amdgpu_ring_backup_unprocessed_commands(struct amdgpu_ring *ring,
+ struct dma_fence *f,
+ bool is_guilty);
+int amdgpu_ring_reemit_unprocessed_commands(struct amdgpu_ring *ring);
#endif
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 11/29] drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (9 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 10/29] drm/amdgpu: track ring state associated with a job Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 12/29] drm/amdgpu/gfx9.4.3: " Alex Deucher
` (17 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index f699c8b0f7488..f56354a1a8a96 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -7167,6 +7167,8 @@ static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
return -EINVAL;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
spin_lock_irqsave(&kiq->ring_lock, flags);
if (amdgpu_ring_alloc(kiq_ring, kiq->pmf->unmap_queues_size)) {
@@ -7216,19 +7218,25 @@ static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
}
kiq->pmf->kiq_map_queues(kiq_ring, ring);
amdgpu_ring_commit(kiq_ring);
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
r = amdgpu_ring_test_ring(kiq_ring);
+ spin_unlock_irqrestore(&kiq->ring_lock, flags);
if (r) {
DRM_ERROR("fail to remap queue\n");
return r;
}
-
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 12/29] drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (10 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 11/29] drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 13/29] drm/amdgpu/gfx10: re-emit unprocessed state on ring reset Alex Deucher
` (16 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index d9eea11f52fec..637d261231898 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -3566,6 +3566,8 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
return -EINVAL;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
spin_lock_irqsave(&kiq->ring_lock, flags);
if (amdgpu_ring_alloc(kiq_ring, kiq->pmf->unmap_queues_size)) {
@@ -3612,9 +3614,8 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
}
kiq->pmf->kiq_map_queues(kiq_ring, ring);
amdgpu_ring_commit(kiq_ring);
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
-
r = amdgpu_ring_test_ring(kiq_ring);
+ spin_unlock_irqrestore(&kiq->ring_lock, flags);
if (r) {
dev_err(adev->dev, "fail to remap queue\n");
return r;
@@ -3623,9 +3624,16 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 13/29] drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (11 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 12/29] drm/amdgpu/gfx9.4.3: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 14/29] drm/amdgpu/gfx11: " Alex Deucher
` (15 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 49 ++++++++++++--------------
1 file changed, 23 insertions(+), 26 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index b57a21c0874c8..1faa8c6a90d9d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -9046,21 +9046,6 @@ static void gfx_v10_0_ring_emit_reg_write_reg_wait(struct amdgpu_ring *ring,
ref, mask);
}
-static void gfx_v10_0_ring_soft_recovery(struct amdgpu_ring *ring,
- unsigned int vmid)
-{
- struct amdgpu_device *adev = ring->adev;
- uint32_t value = 0;
-
- value = REG_SET_FIELD(value, SQ_CMD, CMD, 0x03);
- value = REG_SET_FIELD(value, SQ_CMD, MODE, 0x01);
- value = REG_SET_FIELD(value, SQ_CMD, CHECK_VMID, 1);
- value = REG_SET_FIELD(value, SQ_CMD, VM_ID, vmid);
- amdgpu_gfx_rlc_enter_safe_mode(adev, 0);
- WREG32_SOC15(GC, 0, mmSQ_CMD, value);
- amdgpu_gfx_rlc_exit_safe_mode(adev, 0);
-}
-
static void
gfx_v10_0_set_gfx_eop_interrupt_state(struct amdgpu_device *adev,
uint32_t me, uint32_t pipe,
@@ -9539,6 +9524,8 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
return -EINVAL;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
spin_lock_irqsave(&kiq->ring_lock, flags);
if (amdgpu_ring_alloc(kiq_ring, 5 + 7 + 7 + kiq->pmf->map_queues_size)) {
@@ -9563,10 +9550,8 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0, 0xffffffff);
kiq->pmf->kiq_map_queues(kiq_ring, ring);
amdgpu_ring_commit(kiq_ring);
-
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
-
r = amdgpu_ring_test_ring(kiq_ring);
+ spin_unlock_irqrestore(&kiq->ring_lock, flags);
if (r)
return r;
@@ -9579,9 +9564,16 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
@@ -9600,6 +9592,8 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
return -EINVAL;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
spin_lock_irqsave(&kiq->ring_lock, flags);
if (amdgpu_ring_alloc(kiq_ring, kiq->pmf->unmap_queues_size)) {
@@ -9610,9 +9604,8 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES,
0, 0);
amdgpu_ring_commit(kiq_ring);
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
-
r = amdgpu_ring_test_ring(kiq_ring);
+ spin_unlock_irqrestore(&kiq->ring_lock, flags);
if (r)
return r;
@@ -9648,18 +9641,24 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
}
kiq->pmf->kiq_map_queues(kiq_ring, ring);
amdgpu_ring_commit(kiq_ring);
- spin_unlock_irqrestore(&kiq->ring_lock, flags);
-
r = amdgpu_ring_test_ring(kiq_ring);
+ spin_unlock_irqrestore(&kiq->ring_lock, flags);
if (r)
return r;
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
@@ -9895,7 +9894,6 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_gfx = {
.emit_wreg = gfx_v10_0_ring_emit_wreg,
.emit_reg_wait = gfx_v10_0_ring_emit_reg_wait,
.emit_reg_write_reg_wait = gfx_v10_0_ring_emit_reg_write_reg_wait,
- .soft_recovery = gfx_v10_0_ring_soft_recovery,
.emit_mem_sync = gfx_v10_0_emit_mem_sync,
.reset = gfx_v10_0_reset_kgq,
.emit_cleaner_shader = gfx_v10_0_ring_emit_cleaner_shader,
@@ -9936,7 +9934,6 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_compute = {
.emit_wreg = gfx_v10_0_ring_emit_wreg,
.emit_reg_wait = gfx_v10_0_ring_emit_reg_wait,
.emit_reg_write_reg_wait = gfx_v10_0_ring_emit_reg_write_reg_wait,
- .soft_recovery = gfx_v10_0_ring_soft_recovery,
.emit_mem_sync = gfx_v10_0_emit_mem_sync,
.reset = gfx_v10_0_reset_kcq,
.emit_cleaner_shader = gfx_v10_0_ring_emit_cleaner_shader,
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 14/29] drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (12 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 13/29] drm/amdgpu/gfx10: re-emit unprocessed state on ring reset Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 15/29] drm/amdgpu/gfx12: " Alex Deucher
` (14 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 39 +++++++++++++-------------
1 file changed, 20 insertions(+), 19 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 02022c7b4de78..a68e1fe3a7d68 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -6278,21 +6278,6 @@ static void gfx_v11_0_ring_emit_reg_write_reg_wait(struct amdgpu_ring *ring,
ref, mask, 0x20);
}
-static void gfx_v11_0_ring_soft_recovery(struct amdgpu_ring *ring,
- unsigned vmid)
-{
- struct amdgpu_device *adev = ring->adev;
- uint32_t value = 0;
-
- value = REG_SET_FIELD(value, SQ_CMD, CMD, 0x03);
- value = REG_SET_FIELD(value, SQ_CMD, MODE, 0x01);
- value = REG_SET_FIELD(value, SQ_CMD, CHECK_VMID, 1);
- value = REG_SET_FIELD(value, SQ_CMD, VM_ID, vmid);
- amdgpu_gfx_rlc_enter_safe_mode(adev, 0);
- WREG32_SOC15(GC, 0, regSQ_CMD, value);
- amdgpu_gfx_rlc_exit_safe_mode(adev, 0);
-}
-
static void
gfx_v11_0_set_gfx_eop_interrupt_state(struct amdgpu_device *adev,
uint32_t me, uint32_t pipe,
@@ -6815,6 +6800,8 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
if (amdgpu_sriov_vf(adev))
return -EINVAL;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
if (r) {
@@ -6839,9 +6826,16 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
@@ -6984,6 +6978,8 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
if (amdgpu_sriov_vf(adev))
return -EINVAL;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
if (r) {
dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
@@ -7006,9 +7002,16 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
@@ -7245,7 +7248,6 @@ static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_gfx = {
.emit_wreg = gfx_v11_0_ring_emit_wreg,
.emit_reg_wait = gfx_v11_0_ring_emit_reg_wait,
.emit_reg_write_reg_wait = gfx_v11_0_ring_emit_reg_write_reg_wait,
- .soft_recovery = gfx_v11_0_ring_soft_recovery,
.emit_mem_sync = gfx_v11_0_emit_mem_sync,
.reset = gfx_v11_0_reset_kgq,
.emit_cleaner_shader = gfx_v11_0_ring_emit_cleaner_shader,
@@ -7287,7 +7289,6 @@ static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_compute = {
.emit_wreg = gfx_v11_0_ring_emit_wreg,
.emit_reg_wait = gfx_v11_0_ring_emit_reg_wait,
.emit_reg_write_reg_wait = gfx_v11_0_ring_emit_reg_write_reg_wait,
- .soft_recovery = gfx_v11_0_ring_soft_recovery,
.emit_mem_sync = gfx_v11_0_emit_mem_sync,
.reset = gfx_v11_0_reset_kcq,
.emit_cleaner_shader = gfx_v11_0_ring_emit_cleaner_shader,
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 15/29] drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (13 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 14/29] drm/amdgpu/gfx11: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 16/29] drm/amdgpu/sdma6: " Alex Deucher
` (13 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 39 +++++++++++++-------------
1 file changed, 20 insertions(+), 19 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index a4e3ce81bc671..10afde96491e6 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -4690,21 +4690,6 @@ static void gfx_v12_0_ring_emit_reg_write_reg_wait(struct amdgpu_ring *ring,
ref, mask, 0x20);
}
-static void gfx_v12_0_ring_soft_recovery(struct amdgpu_ring *ring,
- unsigned vmid)
-{
- struct amdgpu_device *adev = ring->adev;
- uint32_t value = 0;
-
- value = REG_SET_FIELD(value, SQ_CMD, CMD, 0x03);
- value = REG_SET_FIELD(value, SQ_CMD, MODE, 0x01);
- value = REG_SET_FIELD(value, SQ_CMD, CHECK_VMID, 1);
- value = REG_SET_FIELD(value, SQ_CMD, VM_ID, vmid);
- amdgpu_gfx_rlc_enter_safe_mode(adev, 0);
- WREG32_SOC15(GC, 0, regSQ_CMD, value);
- amdgpu_gfx_rlc_exit_safe_mode(adev, 0);
-}
-
static void
gfx_v12_0_set_gfx_eop_interrupt_state(struct amdgpu_device *adev,
uint32_t me, uint32_t pipe,
@@ -5316,6 +5301,8 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
if (amdgpu_sriov_vf(adev))
return -EINVAL;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
if (r) {
dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
@@ -5339,9 +5326,16 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
@@ -5437,6 +5431,8 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
if (amdgpu_sriov_vf(adev))
return -EINVAL;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
if (r) {
dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
@@ -5459,9 +5455,16 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
@@ -5540,7 +5543,6 @@ static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_gfx = {
.emit_wreg = gfx_v12_0_ring_emit_wreg,
.emit_reg_wait = gfx_v12_0_ring_emit_reg_wait,
.emit_reg_write_reg_wait = gfx_v12_0_ring_emit_reg_write_reg_wait,
- .soft_recovery = gfx_v12_0_ring_soft_recovery,
.emit_mem_sync = gfx_v12_0_emit_mem_sync,
.reset = gfx_v12_0_reset_kgq,
.emit_cleaner_shader = gfx_v12_0_ring_emit_cleaner_shader,
@@ -5579,7 +5581,6 @@ static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_compute = {
.emit_wreg = gfx_v12_0_ring_emit_wreg,
.emit_reg_wait = gfx_v12_0_ring_emit_reg_wait,
.emit_reg_write_reg_wait = gfx_v12_0_ring_emit_reg_write_reg_wait,
- .soft_recovery = gfx_v12_0_ring_soft_recovery,
.emit_mem_sync = gfx_v12_0_emit_mem_sync,
.reset = gfx_v12_0_reset_kcq,
.emit_cleaner_shader = gfx_v12_0_ring_emit_cleaner_shader,
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 16/29] drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (14 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 15/29] drm/amdgpu/gfx12: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 17/29] drm/amdgpu/sdma7: " Alex Deucher
` (12 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
index abb5ad697fbb2..f3a49d6a2ae0d 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
@@ -1556,6 +1556,8 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
return -EINVAL;
}
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
if (r)
return r;
@@ -1563,9 +1565,16 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
r = sdma_v6_0_gfx_resume_instance(adev, i, true);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 17/29] drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (15 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 16/29] drm/amdgpu/sdma6: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 18/29] drm/amdgpu/jpeg2: " Alex Deucher
` (11 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
index 76ae1a7849a56..318f446acce0e 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
@@ -821,6 +821,8 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
return -EINVAL;
}
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
if (r)
return r;
@@ -828,9 +830,16 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
r = sdma_v7_0_gfx_resume_instance(adev, i, true);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 18/29] drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (16 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 17/29] drm/amdgpu/sdma7: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 19/29] drm/amdgpu/jpeg2.5: " Alex Deucher
` (10 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
index f2058f263cc05..a4e10a54d7b5e 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
@@ -769,14 +769,22 @@ static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
{
int r;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
jpeg_v2_0_stop(ring->adev);
jpeg_v2_0_start(ring->adev);
- r = amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 19/29] drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (17 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 18/29] drm/amdgpu/jpeg2: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 20/29] drm/amdgpu/jpeg3: " Alex Deucher
` (9 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
index 5eb86291ccdd6..8787958fb90e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
@@ -648,14 +648,22 @@ static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
{
int r;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
jpeg_v2_5_stop_inst(ring->adev, ring->me);
jpeg_v2_5_start_inst(ring->adev, ring->me);
- r = amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 20/29] drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (18 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 19/29] drm/amdgpu/jpeg2.5: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 21/29] drm/amdgpu/jpeg4: " Alex Deucher
` (8 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
index ff826611b600e..ff50e46ec8c25 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
@@ -560,14 +560,22 @@ static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
{
int r;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
jpeg_v3_0_stop(ring->adev);
jpeg_v3_0_start(ring->adev);
- r = amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 21/29] drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (19 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 20/29] drm/amdgpu/jpeg3: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 22/29] drm/amdgpu/jpeg4.0.3: " Alex Deucher
` (7 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
index 179dd420edb15..8d7371121d28c 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
@@ -728,14 +728,22 @@ static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
if (amdgpu_sriov_vf(ring->adev))
return -EINVAL;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
jpeg_v4_0_stop(ring->adev);
jpeg_v4_0_start(ring->adev);
- r = amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 22/29] drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (20 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 21/29] drm/amdgpu/jpeg4: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 23/29] drm/amdgpu/jpeg4.0.5: add queue reset Alex Deucher
` (6 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
index c956f424fbbf9..e177760f8508b 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
@@ -1151,14 +1151,22 @@ static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
if (amdgpu_sriov_vf(ring->adev))
return -EOPNOTSUPP;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
jpeg_v4_0_3_core_stall_reset(ring);
jpeg_v4_0_3_start_jrbc(ring);
- r = amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 23/29] drm/amdgpu/jpeg4.0.5: add queue reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (21 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 22/29] drm/amdgpu/jpeg4.0.3: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 24/29] drm/amdgpu/jpeg5: " Alex Deucher
` (5 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Add queue reset support for jpeg 4.0.5.
Use the new helpers to re-emit the unprocessed state
after resetting the queue.
Untested.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c | 25 ++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c
index 974030a5c03c9..2bd2660b47d51 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c
@@ -767,6 +767,30 @@ static int jpeg_v4_0_5_process_interrupt(struct amdgpu_device *adev,
return 0;
}
+static int jpeg_v4_0_5_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
+{
+ int r;
+
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+ jpeg_v4_0_5_stop(ring->adev);
+ jpeg_v4_0_5_start(ring->adev);
+ r = amdgpu_ring_test_ring(ring);
+ if (r)
+ return r;
+
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
+ atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
+ return 0;
+}
+
static const struct amd_ip_funcs jpeg_v4_0_5_ip_funcs = {
.name = "jpeg_v4_0_5",
.early_init = jpeg_v4_0_5_early_init,
@@ -812,6 +836,7 @@ static const struct amdgpu_ring_funcs jpeg_v4_0_5_dec_ring_vm_funcs = {
.emit_wreg = jpeg_v2_0_dec_ring_emit_wreg,
.emit_reg_wait = jpeg_v2_0_dec_ring_emit_reg_wait,
.emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
+ .reset = jpeg_v4_0_5_ring_reset,
};
static void jpeg_v4_0_5_set_dec_ring_funcs(struct amdgpu_device *adev)
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 24/29] drm/amdgpu/jpeg5: add queue reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (22 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 23/29] drm/amdgpu/jpeg4.0.5: add queue reset Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 25/29] drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset Alex Deucher
` (4 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Add queue reset support for jpeg 5.0.0.
Use the new helpers to re-emit the unprocessed state
after resetting the queue.
Untested.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c | 28 ++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c
index 31d213ccbe0a8..975e2f58b8444 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c
@@ -644,6 +644,33 @@ static int jpeg_v5_0_0_process_interrupt(struct amdgpu_device *adev,
return 0;
}
+static int jpeg_v5_0_0_ring_reset(struct amdgpu_ring *ring,
+ struct amdgpu_job *job)
+{
+ int r;
+
+ if (amdgpu_sriov_vf(ring->adev))
+ return -EINVAL;
+
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+ jpeg_v5_0_0_stop(ring->adev);
+ jpeg_v5_0_0_start(ring->adev);
+ r = amdgpu_ring_test_ring(ring);
+ if (r)
+ return r;
+
+ dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
+ atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
+ return 0;
+}
+
static const struct amd_ip_funcs jpeg_v5_0_0_ip_funcs = {
.name = "jpeg_v5_0_0",
.early_init = jpeg_v5_0_0_early_init,
@@ -689,6 +716,7 @@ static const struct amdgpu_ring_funcs jpeg_v5_0_0_dec_ring_vm_funcs = {
.emit_wreg = jpeg_v4_0_3_dec_ring_emit_wreg,
.emit_reg_wait = jpeg_v4_0_3_dec_ring_emit_reg_wait,
.emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
+ .reset = jpeg_v5_0_0_ring_reset,
};
static void jpeg_v5_0_0_set_dec_ring_funcs(struct amdgpu_device *adev)
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 25/29] drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (23 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 24/29] drm/amdgpu/jpeg5: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 26/29] drm/amdgpu/vcn4: " Alex Deucher
` (3 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
index ef9289f78a46a..ace6703c7677b 100644
--- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
@@ -842,14 +842,22 @@ static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
if (amdgpu_sriov_vf(ring->adev))
return -EOPNOTSUPP;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
jpeg_v5_0_1_core_stall_reset(ring);
jpeg_v5_0_1_init_jrbc(ring);
- r = amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 26/29] drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (24 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 25/29] drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 27/29] drm/amdgpu/vcn4.0.3: " Alex Deucher
` (2 subsequent siblings)
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
index d68bd82f8eab0..49545772fb630 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
@@ -1977,15 +1977,22 @@ static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
if (!(adev->vcn.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
return -EOPNOTSUPP;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
vcn_v4_0_stop(vinst);
vcn_v4_0_start(vinst);
-
- r = amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 27/29] drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (25 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 26/29] drm/amdgpu/vcn4: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 28/29] drm/amdgpu/vcn4.0.5: " Alex Deucher
2025-06-06 6:43 ` [PATCH 29/29] drm/amdgpu/vcn5: " Alex Deucher
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
index a9d8ae4ab109a..e3fd5291b5195 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
@@ -1608,6 +1608,8 @@ static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
if (!(adev->vcn.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
return -EOPNOTSUPP;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
vcn_inst = GET_INST(VCN, ring->me);
r = amdgpu_dpm_reset_vcn(adev, 1 << vcn_inst);
@@ -1621,12 +1623,19 @@ static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
adev->vcn.caps |= AMDGPU_VCN_CAPS(RRMT_ENABLED);
vcn_v4_0_3_hw_init_inst(vinst);
vcn_v4_0_3_start_dpg_mode(vinst, adev->vcn.inst[ring->me].indirect_sram);
- r = amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 28/29] drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (26 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 27/29] drm/amdgpu/vcn4.0.3: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 29/29] drm/amdgpu/vcn5: " Alex Deucher
28 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
index 93bc55756dcd6..2082336627d31 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
@@ -1475,15 +1475,23 @@ static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
if (!(adev->vcn.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
return -EOPNOTSUPP;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
vcn_v4_0_5_stop(vinst);
vcn_v4_0_5_start(vinst);
-
- r = amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 29/29] drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
` (27 preceding siblings ...)
2025-06-06 6:43 ` [PATCH 28/29] drm/amdgpu/vcn4.0.5: " Alex Deucher
@ 2025-06-06 6:43 ` Alex Deucher
2025-06-09 14:23 ` Sundararaju, Sathishkumar
28 siblings, 1 reply; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 6:43 UTC (permalink / raw)
To: amd-gfx, christian.koenig; +Cc: Alex Deucher
Re-emit the unprocessed state after resetting the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
index d74c1862ac860..208b366c580da 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
@@ -1202,15 +1202,23 @@ static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
if (!(adev->vcn.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
return -EOPNOTSUPP;
+ amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
+
vcn_v5_0_0_stop(vinst);
vcn_v5_0_0_start(vinst);
-
- r = amdgpu_ring_test_helper(ring);
+ r = amdgpu_ring_test_ring(ring);
if (r)
return r;
+
dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
- amdgpu_fence_driver_force_completion(ring);
+ /* signal the fence of the bad job */
+ amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
atomic_inc(&ring->adev->gpu_reset_counter);
+ r = amdgpu_ring_reemit_unprocessed_commands(ring);
+ if (r)
+ /* if we fail to reemit, force complete all fences */
+ amdgpu_fence_driver_force_completion(ring);
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 02/29] drm/amdgpu/gfx7: drop reset_kgq
2025-06-06 6:43 ` [PATCH 02/29] drm/amdgpu/gfx7: drop reset_kgq Alex Deucher
@ 2025-06-06 11:33 ` Christian König
0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2025-06-06 11:33 UTC (permalink / raw)
To: Alex Deucher, amd-gfx
On 6/6/25 08:43, Alex Deucher wrote:
> It doesn't work reliably and we have soft recover and
> full adapter reset so drop this.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com> for this one and the same for gfx8 and gfx9.
I think you can start pushing the first patches to amd-staging-drm-next, makes it less stuff to send out again.
Christian.
> ---
> drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 71 ---------------------------
> 1 file changed, 71 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
> index da0534ff1271a..2aa323dab34e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
> @@ -4884,76 +4884,6 @@ static void gfx_v7_0_emit_mem_sync_compute(struct amdgpu_ring *ring)
> amdgpu_ring_write(ring, 0x0000000A); /* poll interval */
> }
>
> -static void gfx_v7_0_wait_reg_mem(struct amdgpu_ring *ring, int eng_sel,
> - int mem_space, int opt, uint32_t addr0,
> - uint32_t addr1, uint32_t ref, uint32_t mask,
> - uint32_t inv)
> -{
> - amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
> - amdgpu_ring_write(ring,
> - /* memory (1) or register (0) */
> - (WAIT_REG_MEM_MEM_SPACE(mem_space) |
> - WAIT_REG_MEM_OPERATION(opt) | /* wait */
> - WAIT_REG_MEM_FUNCTION(3) | /* equal */
> - WAIT_REG_MEM_ENGINE(eng_sel)));
> -
> - if (mem_space)
> - BUG_ON(addr0 & 0x3); /* Dword align */
> - amdgpu_ring_write(ring, addr0);
> - amdgpu_ring_write(ring, addr1);
> - amdgpu_ring_write(ring, ref);
> - amdgpu_ring_write(ring, mask);
> - amdgpu_ring_write(ring, inv); /* poll interval */
> -}
> -
> -static void gfx_v7_0_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t reg,
> - uint32_t val, uint32_t mask)
> -{
> - gfx_v7_0_wait_reg_mem(ring, 0, 0, 0, reg, 0, val, mask, 0x20);
> -}
> -
> -static int gfx_v7_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> -{
> - struct amdgpu_device *adev = ring->adev;
> - struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> - struct amdgpu_ring *kiq_ring = &kiq->ring;
> - unsigned long flags;
> - u32 tmp;
> - int r;
> -
> - if (amdgpu_sriov_vf(adev))
> - return -EINVAL;
> -
> - if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
> - return -EINVAL;
> -
> - spin_lock_irqsave(&kiq->ring_lock, flags);
> -
> - if (amdgpu_ring_alloc(kiq_ring, 5)) {
> - spin_unlock_irqrestore(&kiq->ring_lock, flags);
> - return -ENOMEM;
> - }
> -
> - tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
> - gfx_v7_0_ring_emit_wreg(kiq_ring, mmCP_VMID_RESET, tmp);
> - amdgpu_ring_commit(kiq_ring);
> -
> - spin_unlock_irqrestore(&kiq->ring_lock, flags);
> -
> - r = amdgpu_ring_test_ring(kiq_ring);
> - if (r)
> - return r;
> -
> - if (amdgpu_ring_alloc(ring, 7 + 12 + 5))
> - return -ENOMEM;
> - gfx_v7_0_ring_emit_fence_gfx(ring, ring->fence_drv.gpu_addr,
> - ring->fence_drv.sync_seq, AMDGPU_FENCE_FLAG_EXEC);
> - gfx_v7_0_ring_emit_reg_wait(ring, mmCP_VMID_RESET, 0, 0xffff);
> - gfx_v7_0_ring_emit_wreg(ring, mmCP_VMID_RESET, 0);
> -
> - return amdgpu_ring_test_ring(ring);
> -}
> -
> static const struct amd_ip_funcs gfx_v7_0_ip_funcs = {
> .name = "gfx_v7_0",
> .early_init = gfx_v7_0_early_init,
> @@ -5003,7 +4933,6 @@ static const struct amdgpu_ring_funcs gfx_v7_0_ring_funcs_gfx = {
> .emit_wreg = gfx_v7_0_ring_emit_wreg,
> .soft_recovery = gfx_v7_0_ring_soft_recovery,
> .emit_mem_sync = gfx_v7_0_emit_mem_sync,
> - .reset = gfx_v7_0_reset_kgq,
> };
>
> static const struct amdgpu_ring_funcs gfx_v7_0_ring_funcs_compute = {
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 05/29] drm/amdgpu: switch job hw_fence to amdgpu_fence
2025-06-06 6:43 ` [PATCH 05/29] drm/amdgpu: switch job hw_fence to amdgpu_fence Alex Deucher
@ 2025-06-06 11:39 ` Christian König
2025-06-06 16:08 ` Alex Deucher
0 siblings, 1 reply; 43+ messages in thread
From: Christian König @ 2025-06-06 11:39 UTC (permalink / raw)
To: Alex Deucher, amd-gfx
On 6/6/25 08:43, Alex Deucher wrote:
> Use the amdgpu fence container so we can store additional
> data in the fence.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 30 +++++----------------
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 ++++-----
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 16 +++++++++++
> 6 files changed, 32 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 8e626f50b362e..f81608330a3d0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1902,7 +1902,7 @@ static void amdgpu_ib_preempt_mark_partial_job(struct amdgpu_ring *ring)
> continue;
> }
> job = to_amdgpu_job(s_job);
> - if (preempted && (&job->hw_fence) == fence)
> + if (preempted && (&job->hw_fence.base) == fence)
> /* mark the job as preempted */
> job->preemption_status |= AMDGPU_IB_PREEMPTED;
> }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index ea565651f7459..8298e95e4543e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6375,7 +6375,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> *
> * job->base holds a reference to parent fence
> */
> - if (job && dma_fence_is_signaled(&job->hw_fence)) {
> + if (job && dma_fence_is_signaled(&job->hw_fence.base)) {
> job_signaled = true;
> dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
> goto skip_hw_reset;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 2f24a6aa13bf6..569e0e5373927 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -41,22 +41,6 @@
> #include "amdgpu_trace.h"
> #include "amdgpu_reset.h"
>
> -/*
> - * Fences mark an event in the GPUs pipeline and are used
> - * for GPU/CPU synchronization. When the fence is written,
> - * it is expected that all buffers associated with that fence
> - * are no longer in use by the associated ring on the GPU and
> - * that the relevant GPU caches have been flushed.
> - */
> -
> -struct amdgpu_fence {
> - struct dma_fence base;
> -
> - /* RB, DMA, etc. */
> - struct amdgpu_ring *ring;
> - ktime_t start_timestamp;
> -};
> -
Oh, that handling here is completely broken.
The MCBP muxer overwrites fields in the job because of this ^^.
I think that patch needs to be a bug fix we even backport.
Regards,
CFhristian.
> static struct kmem_cache *amdgpu_fence_slab;
>
> int amdgpu_fence_slab_init(void)
> @@ -151,12 +135,12 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
> am_fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
> if (am_fence == NULL)
> return -ENOMEM;
> - fence = &am_fence->base;
> - am_fence->ring = ring;
> } else {
> /* take use of job-embedded fence */
> - fence = &job->hw_fence;
> + am_fence = &job->hw_fence;
> }
> + fence = &am_fence->base;
> + am_fence->ring = ring;
>
> seq = ++ring->fence_drv.sync_seq;
> if (job && job->job_run_counter) {
> @@ -718,7 +702,7 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
> * it right here or we won't be able to track them in fence_drv
> * and they will remain unsignaled during sa_bo free.
> */
> - job = container_of(old, struct amdgpu_job, hw_fence);
> + job = container_of(old, struct amdgpu_job, hw_fence.base);
> if (!job->base.s_fence && !dma_fence_is_signaled(old))
> dma_fence_signal(old);
> RCU_INIT_POINTER(*ptr, NULL);
> @@ -780,7 +764,7 @@ static const char *amdgpu_fence_get_timeline_name(struct dma_fence *f)
>
> static const char *amdgpu_job_fence_get_timeline_name(struct dma_fence *f)
> {
> - struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence);
> + struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence.base);
>
> return (const char *)to_amdgpu_ring(job->base.sched)->name;
> }
> @@ -810,7 +794,7 @@ static bool amdgpu_fence_enable_signaling(struct dma_fence *f)
> */
> static bool amdgpu_job_fence_enable_signaling(struct dma_fence *f)
> {
> - struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence);
> + struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence.base);
>
> if (!timer_pending(&to_amdgpu_ring(job->base.sched)->fence_drv.fallback_timer))
> amdgpu_fence_schedule_fallback(to_amdgpu_ring(job->base.sched));
> @@ -845,7 +829,7 @@ static void amdgpu_job_fence_free(struct rcu_head *rcu)
> struct dma_fence *f = container_of(rcu, struct dma_fence, rcu);
>
> /* free job if fence has a parent job */
> - kfree(container_of(f, struct amdgpu_job, hw_fence));
> + kfree(container_of(f, struct amdgpu_job, hw_fence.base));
> }
>
> /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index acb21fc8b3ce5..ddb9d3269357c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -272,8 +272,8 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
> /* Check if any fences where initialized */
> if (job->base.s_fence && job->base.s_fence->finished.ops)
> f = &job->base.s_fence->finished;
> - else if (job->hw_fence.ops)
> - f = &job->hw_fence;
> + else if (job->hw_fence.base.ops)
> + f = &job->hw_fence.base;
> else
> f = NULL;
>
> @@ -290,10 +290,10 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
> amdgpu_sync_free(&job->explicit_sync);
>
> /* only put the hw fence if has embedded fence */
> - if (!job->hw_fence.ops)
> + if (!job->hw_fence.base.ops)
> kfree(job);
> else
> - dma_fence_put(&job->hw_fence);
> + dma_fence_put(&job->hw_fence.base);
> }
>
> void amdgpu_job_set_gang_leader(struct amdgpu_job *job,
> @@ -322,10 +322,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
> if (job->gang_submit != &job->base.s_fence->scheduled)
> dma_fence_put(job->gang_submit);
>
> - if (!job->hw_fence.ops)
> + if (!job->hw_fence.base.ops)
> kfree(job);
> else
> - dma_fence_put(&job->hw_fence);
> + dma_fence_put(&job->hw_fence.base);
> }
>
> struct dma_fence *amdgpu_job_submit(struct amdgpu_job *job)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> index f2c049129661f..931fed8892cc1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> @@ -48,7 +48,7 @@ struct amdgpu_job {
> struct drm_sched_job base;
> struct amdgpu_vm *vm;
> struct amdgpu_sync explicit_sync;
> - struct dma_fence hw_fence;
> + struct amdgpu_fence hw_fence;
> struct dma_fence *gang_submit;
> uint32_t preamble_status;
> uint32_t preemption_status;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index b95b471107692..e1f25218943a4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -127,6 +127,22 @@ struct amdgpu_fence_driver {
> struct dma_fence **fences;
> };
>
> +/*
> + * Fences mark an event in the GPUs pipeline and are used
> + * for GPU/CPU synchronization. When the fence is written,
> + * it is expected that all buffers associated with that fence
> + * are no longer in use by the associated ring on the GPU and
> + * that the relevant GPU caches have been flushed.
> + */
> +
> +struct amdgpu_fence {
> + struct dma_fence base;
> +
> + /* RB, DMA, etc. */
> + struct amdgpu_ring *ring;
> + ktime_t start_timestamp;
> +};
> +
> extern const struct drm_sched_backend_ops amdgpu_sched_ops;
>
> void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring);
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/29] drm/amdgpu: update ring reset function signature
2025-06-06 6:43 ` [PATCH 06/29] drm/amdgpu: update ring reset function signature Alex Deucher
@ 2025-06-06 11:41 ` Christian König
2025-06-06 16:00 ` Alex Deucher
0 siblings, 1 reply; 43+ messages in thread
From: Christian König @ 2025-06-06 11:41 UTC (permalink / raw)
To: Alex Deucher, amd-gfx
On 6/6/25 08:43, Alex Deucher wrote:
> Going forward, we'll need more than just the vmid. Everything
> we need in currently in the amdgpu job structure, so just
> pass that in.
Please don't the job is just a container for the submission, it should not be part of any reset handling.
What information is actually needed here?
Regards,
Christian.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++--
> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 ++++---
> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 10 ++++++----
> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 10 ++++++----
> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 5 +++--
> drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 5 +++--
> drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 3 ++-
> 22 files changed, 53 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index ddb9d3269357c..80d4dfebde24f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -155,7 +155,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> if (is_guilty)
> dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>
> - r = amdgpu_ring_reset(ring, job->vmid);
> + r = amdgpu_ring_reset(ring, job);
> if (!r) {
> if (amdgpu_ring_sched_ready(ring))
> drm_sched_stop(&ring->sched, s_job);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index e1f25218943a4..ab5402d7ce9c8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -268,7 +268,7 @@ struct amdgpu_ring_funcs {
> void (*patch_cntl)(struct amdgpu_ring *ring, unsigned offset);
> void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset);
> void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
> - int (*reset)(struct amdgpu_ring *ring, unsigned int vmid);
> + int (*reset)(struct amdgpu_ring *ring, struct amdgpu_job *job);
> void (*emit_cleaner_shader)(struct amdgpu_ring *ring);
> bool (*is_guilty)(struct amdgpu_ring *ring);
> };
> @@ -425,7 +425,7 @@ struct amdgpu_ring {
> #define amdgpu_ring_patch_cntl(r, o) ((r)->funcs->patch_cntl((r), (o)))
> #define amdgpu_ring_patch_ce(r, o) ((r)->funcs->patch_ce((r), (o)))
> #define amdgpu_ring_patch_de(r, o) ((r)->funcs->patch_de((r), (o)))
> -#define amdgpu_ring_reset(r, v) (r)->funcs->reset((r), (v))
> +#define amdgpu_ring_reset(r, j) (r)->funcs->reset((r), (j))
>
> unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type);
> int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index 75ea071744eb5..c58e7040c732a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -9522,7 +9522,8 @@ static void gfx_v10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
> amdgpu_ring_insert_nop(ring, num_nop - 1);
> }
>
> -static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> +static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> @@ -9547,7 +9548,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>
> addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
> offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
> - tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
> + tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << job->vmid);
> if (ring->pipe == 0)
> tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
> else
> @@ -9579,7 +9580,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> }
>
> static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
> - unsigned int vmid)
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> index afd6d59164bfa..0ee7bdd509741 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> @@ -6806,7 +6806,8 @@ static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
> return 0;
> }
>
> -static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> +static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> int r;
> @@ -6814,7 +6815,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> if (amdgpu_sriov_vf(adev))
> return -EINVAL;
>
> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
> if (r) {
>
> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
> @@ -6968,7 +6969,8 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
> return 0;
> }
>
> -static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> +static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> int r = 0;
> @@ -6976,7 +6978,7 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> if (amdgpu_sriov_vf(adev))
> return -EINVAL;
>
> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
> if (r) {
> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
> r = gfx_v11_0_reset_compute_pipe(ring);
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> index 1234c8d64e20d..a26417d53411b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> @@ -5307,7 +5307,8 @@ static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
> return 0;
> }
>
> -static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> +static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> int r;
> @@ -5315,7 +5316,7 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> if (amdgpu_sriov_vf(adev))
> return -EINVAL;
>
> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
> if (r) {
> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
> r = gfx_v12_reset_gfx_pipe(ring);
> @@ -5421,7 +5422,8 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
> return 0;
> }
>
> -static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> +static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> int r;
> @@ -5429,7 +5431,7 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> if (amdgpu_sriov_vf(adev))
> return -EINVAL;
>
> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
> if (r) {
> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
> r = gfx_v12_0_reset_compute_pipe(ring);
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index d50e125fd3e0d..5e650cc5fcb26 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -7153,7 +7153,7 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
> }
>
> static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
> - unsigned int vmid)
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index c233edf605694..a7dadff3dca31 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -3552,7 +3552,7 @@ static int gfx_v9_4_3_reset_hw_pipe(struct amdgpu_ring *ring)
> }
>
> static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
> - unsigned int vmid)
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> index 4cde8a8bcc837..6cd3fbe00d6b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> @@ -764,7 +764,8 @@ static int jpeg_v2_0_process_interrupt(struct amdgpu_device *adev,
> return 0;
> }
>
> -static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> jpeg_v2_0_stop(ring->adev);
> jpeg_v2_0_start(ring->adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> index 8b39e114f3be1..8ed41868f6c32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> @@ -643,7 +643,8 @@ static int jpeg_v2_5_process_interrupt(struct amdgpu_device *adev,
> return 0;
> }
>
> -static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> jpeg_v2_5_stop_inst(ring->adev, ring->me);
> jpeg_v2_5_start_inst(ring->adev, ring->me);
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> index 2f8510c2986b9..3512fbb543301 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> @@ -555,7 +555,8 @@ static int jpeg_v3_0_process_interrupt(struct amdgpu_device *adev,
> return 0;
> }
>
> -static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> jpeg_v3_0_stop(ring->adev);
> jpeg_v3_0_start(ring->adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> index f17ec5414fd69..c8efeaf0a2a69 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> @@ -720,7 +720,8 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
> return 0;
> }
>
> -static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> if (amdgpu_sriov_vf(ring->adev))
> return -EINVAL;
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> index 79e342d5ab28d..8b07c3651c579 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> @@ -1143,7 +1143,8 @@ static void jpeg_v4_0_3_core_stall_reset(struct amdgpu_ring *ring)
> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
> }
>
> -static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> if (amdgpu_sriov_vf(ring->adev))
> return -EOPNOTSUPP;
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> index 3b6f65a256464..0a21a13e19360 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> @@ -834,7 +834,8 @@ static void jpeg_v5_0_1_core_stall_reset(struct amdgpu_ring *ring)
> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
> }
>
> -static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> if (amdgpu_sriov_vf(ring->adev))
> return -EOPNOTSUPP;
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> index 9c169112a5e7b..ffd67d51b335f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> @@ -1667,7 +1667,8 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
> return sdma_v4_4_2_is_queue_selected(adev, instance_id, true);
> }
>
> -static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> +static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> u32 id = GET_INST(SDMA0, ring->me);
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> index 9505ae96fbecc..46affee1c2da0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> @@ -1538,7 +1538,8 @@ static int sdma_v5_0_soft_reset(struct amdgpu_ip_block *ip_block)
> return 0;
> }
>
> -static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> +static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> u32 inst_id = ring->me;
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> index a6e612b4a8928..581e75b7d01a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> @@ -1451,7 +1451,8 @@ static int sdma_v5_2_wait_for_idle(struct amdgpu_ip_block *ip_block)
> return -ETIMEDOUT;
> }
>
> -static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> +static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> u32 inst_id = ring->me;
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> index 5a70ae17be04e..d9866009edbfc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> @@ -1537,7 +1537,8 @@ static int sdma_v6_0_ring_preempt_ib(struct amdgpu_ring *ring)
> return r;
> }
>
> -static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> +static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> int i, r;
> @@ -1555,7 +1556,7 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> return -EINVAL;
> }
>
> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
> if (r)
> return r;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> index ad47d0bdf7775..c546e73642296 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> @@ -802,7 +802,8 @@ static bool sdma_v7_0_check_soft_reset(struct amdgpu_ip_block *ip_block)
> return false;
> }
>
> -static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> +static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> int i, r;
> @@ -820,7 +821,7 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> return -EINVAL;
> }
>
> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
> if (r)
> return r;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> index b5071f77f78d2..47a0deceff433 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> @@ -1967,7 +1967,8 @@ static int vcn_v4_0_ring_patch_cs_in_place(struct amdgpu_cs_parser *p,
> return 0;
> }
>
> -static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> index 5a33140f57235..d961a824d2098 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> @@ -1594,7 +1594,8 @@ static void vcn_v4_0_3_unified_ring_set_wptr(struct amdgpu_ring *ring)
> }
> }
>
> -static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> int r = 0;
> int vcn_inst;
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> index 16ade84facc78..10bd714592278 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> @@ -1465,7 +1465,8 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
> }
> }
>
> -static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> index f8e3f0b882da5..7e6a7ead9a086 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> @@ -1192,7 +1192,8 @@ static void vcn_v5_0_0_unified_ring_set_wptr(struct amdgpu_ring *ring)
> }
> }
>
> -static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> +static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
> + struct amdgpu_job *job)
> {
> struct amdgpu_device *adev = ring->adev;
> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/29] drm/amdgpu: update ring reset function signature
2025-06-06 11:41 ` Christian König
@ 2025-06-06 16:00 ` Alex Deucher
2025-06-09 12:43 ` Sundararaju, Sathishkumar
2025-06-10 8:24 ` Christian König
0 siblings, 2 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 16:00 UTC (permalink / raw)
To: Christian König; +Cc: Alex Deucher, amd-gfx
On Fri, Jun 6, 2025 at 7:41 AM Christian König <christian.koenig@amd.com> wrote:
>
> On 6/6/25 08:43, Alex Deucher wrote:
> > Going forward, we'll need more than just the vmid. Everything
> > we need in currently in the amdgpu job structure, so just
> > pass that in.
>
> Please don't the job is just a container for the submission, it should not be part of any reset handling.
>
> What information is actually needed here?
We need job->vmid, job->base.s_fence->finished, job->hw_fence.
Alex
>
> Regards,
> Christian.
>
>
> >
> > Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++--
> > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 ++++---
> > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 10 ++++++----
> > drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 10 ++++++----
> > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 5 +++--
> > drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 5 +++--
> > drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 3 ++-
> > drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 3 ++-
> > 22 files changed, 53 insertions(+), 33 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index ddb9d3269357c..80d4dfebde24f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -155,7 +155,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> > if (is_guilty)
> > dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> >
> > - r = amdgpu_ring_reset(ring, job->vmid);
> > + r = amdgpu_ring_reset(ring, job);
> > if (!r) {
> > if (amdgpu_ring_sched_ready(ring))
> > drm_sched_stop(&ring->sched, s_job);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > index e1f25218943a4..ab5402d7ce9c8 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > @@ -268,7 +268,7 @@ struct amdgpu_ring_funcs {
> > void (*patch_cntl)(struct amdgpu_ring *ring, unsigned offset);
> > void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset);
> > void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
> > - int (*reset)(struct amdgpu_ring *ring, unsigned int vmid);
> > + int (*reset)(struct amdgpu_ring *ring, struct amdgpu_job *job);
> > void (*emit_cleaner_shader)(struct amdgpu_ring *ring);
> > bool (*is_guilty)(struct amdgpu_ring *ring);
> > };
> > @@ -425,7 +425,7 @@ struct amdgpu_ring {
> > #define amdgpu_ring_patch_cntl(r, o) ((r)->funcs->patch_cntl((r), (o)))
> > #define amdgpu_ring_patch_ce(r, o) ((r)->funcs->patch_ce((r), (o)))
> > #define amdgpu_ring_patch_de(r, o) ((r)->funcs->patch_de((r), (o)))
> > -#define amdgpu_ring_reset(r, v) (r)->funcs->reset((r), (v))
> > +#define amdgpu_ring_reset(r, j) (r)->funcs->reset((r), (j))
> >
> > unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type);
> > int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> > index 75ea071744eb5..c58e7040c732a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> > @@ -9522,7 +9522,8 @@ static void gfx_v10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
> > amdgpu_ring_insert_nop(ring, num_nop - 1);
> > }
> >
> > -static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> > @@ -9547,7 +9548,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >
> > addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
> > offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
> > - tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
> > + tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << job->vmid);
> > if (ring->pipe == 0)
> > tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
> > else
> > @@ -9579,7 +9580,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> > }
> >
> > static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
> > - unsigned int vmid)
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> > index afd6d59164bfa..0ee7bdd509741 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> > @@ -6806,7 +6806,8 @@ static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
> > return 0;
> > }
> >
> > -static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > int r;
> > @@ -6814,7 +6815,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> > if (amdgpu_sriov_vf(adev))
> > return -EINVAL;
> >
> > - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
> > + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
> > if (r) {
> >
> > dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
> > @@ -6968,7 +6969,8 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
> > return 0;
> > }
> >
> > -static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > int r = 0;
> > @@ -6976,7 +6978,7 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> > if (amdgpu_sriov_vf(adev))
> > return -EINVAL;
> >
> > - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
> > + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
> > if (r) {
> > dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
> > r = gfx_v11_0_reset_compute_pipe(ring);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> > index 1234c8d64e20d..a26417d53411b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> > @@ -5307,7 +5307,8 @@ static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
> > return 0;
> > }
> >
> > -static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > int r;
> > @@ -5315,7 +5316,7 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> > if (amdgpu_sriov_vf(adev))
> > return -EINVAL;
> >
> > - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
> > + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
> > if (r) {
> > dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
> > r = gfx_v12_reset_gfx_pipe(ring);
> > @@ -5421,7 +5422,8 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
> > return 0;
> > }
> >
> > -static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > int r;
> > @@ -5429,7 +5431,7 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> > if (amdgpu_sriov_vf(adev))
> > return -EINVAL;
> >
> > - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
> > + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
> > if (r) {
> > dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
> > r = gfx_v12_0_reset_compute_pipe(ring);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > index d50e125fd3e0d..5e650cc5fcb26 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > @@ -7153,7 +7153,7 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
> > }
> >
> > static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
> > - unsigned int vmid)
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > index c233edf605694..a7dadff3dca31 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > @@ -3552,7 +3552,7 @@ static int gfx_v9_4_3_reset_hw_pipe(struct amdgpu_ring *ring)
> > }
> >
> > static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
> > - unsigned int vmid)
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
> > diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> > index 4cde8a8bcc837..6cd3fbe00d6b9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> > @@ -764,7 +764,8 @@ static int jpeg_v2_0_process_interrupt(struct amdgpu_device *adev,
> > return 0;
> > }
> >
> > -static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > jpeg_v2_0_stop(ring->adev);
> > jpeg_v2_0_start(ring->adev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> > index 8b39e114f3be1..8ed41868f6c32 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> > @@ -643,7 +643,8 @@ static int jpeg_v2_5_process_interrupt(struct amdgpu_device *adev,
> > return 0;
> > }
> >
> > -static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > jpeg_v2_5_stop_inst(ring->adev, ring->me);
> > jpeg_v2_5_start_inst(ring->adev, ring->me);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> > index 2f8510c2986b9..3512fbb543301 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> > @@ -555,7 +555,8 @@ static int jpeg_v3_0_process_interrupt(struct amdgpu_device *adev,
> > return 0;
> > }
> >
> > -static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > jpeg_v3_0_stop(ring->adev);
> > jpeg_v3_0_start(ring->adev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> > index f17ec5414fd69..c8efeaf0a2a69 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> > @@ -720,7 +720,8 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
> > return 0;
> > }
> >
> > -static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > if (amdgpu_sriov_vf(ring->adev))
> > return -EINVAL;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> > index 79e342d5ab28d..8b07c3651c579 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> > @@ -1143,7 +1143,8 @@ static void jpeg_v4_0_3_core_stall_reset(struct amdgpu_ring *ring)
> > WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
> > }
> >
> > -static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > if (amdgpu_sriov_vf(ring->adev))
> > return -EOPNOTSUPP;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> > index 3b6f65a256464..0a21a13e19360 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> > @@ -834,7 +834,8 @@ static void jpeg_v5_0_1_core_stall_reset(struct amdgpu_ring *ring)
> > WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
> > }
> >
> > -static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > if (amdgpu_sriov_vf(ring->adev))
> > return -EOPNOTSUPP;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > index 9c169112a5e7b..ffd67d51b335f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> > @@ -1667,7 +1667,8 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
> > return sdma_v4_4_2_is_queue_selected(adev, instance_id, true);
> > }
> >
> > -static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > u32 id = GET_INST(SDMA0, ring->me);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> > index 9505ae96fbecc..46affee1c2da0 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> > @@ -1538,7 +1538,8 @@ static int sdma_v5_0_soft_reset(struct amdgpu_ip_block *ip_block)
> > return 0;
> > }
> >
> > -static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > u32 inst_id = ring->me;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > index a6e612b4a8928..581e75b7d01a8 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > @@ -1451,7 +1451,8 @@ static int sdma_v5_2_wait_for_idle(struct amdgpu_ip_block *ip_block)
> > return -ETIMEDOUT;
> > }
> >
> > -static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > u32 inst_id = ring->me;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> > index 5a70ae17be04e..d9866009edbfc 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> > @@ -1537,7 +1537,8 @@ static int sdma_v6_0_ring_preempt_ib(struct amdgpu_ring *ring)
> > return r;
> > }
> >
> > -static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > int i, r;
> > @@ -1555,7 +1556,7 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> > return -EINVAL;
> > }
> >
> > - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
> > + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
> > if (r)
> > return r;
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> > index ad47d0bdf7775..c546e73642296 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> > @@ -802,7 +802,8 @@ static bool sdma_v7_0_check_soft_reset(struct amdgpu_ip_block *ip_block)
> > return false;
> > }
> >
> > -static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > int i, r;
> > @@ -820,7 +821,7 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> > return -EINVAL;
> > }
> >
> > - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
> > + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
> > if (r)
> > return r;
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> > index b5071f77f78d2..47a0deceff433 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> > @@ -1967,7 +1967,8 @@ static int vcn_v4_0_ring_patch_cs_in_place(struct amdgpu_cs_parser *p,
> > return 0;
> > }
> >
> > -static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
> > diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> > index 5a33140f57235..d961a824d2098 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> > @@ -1594,7 +1594,8 @@ static void vcn_v4_0_3_unified_ring_set_wptr(struct amdgpu_ring *ring)
> > }
> > }
> >
> > -static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > int r = 0;
> > int vcn_inst;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> > index 16ade84facc78..10bd714592278 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> > @@ -1465,7 +1465,8 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
> > }
> > }
> >
> > -static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
> > diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> > index f8e3f0b882da5..7e6a7ead9a086 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> > @@ -1192,7 +1192,8 @@ static void vcn_v5_0_0_unified_ring_set_wptr(struct amdgpu_ring *ring)
> > }
> > }
> >
> > -static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> > +static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
> > + struct amdgpu_job *job)
> > {
> > struct amdgpu_device *adev = ring->adev;
> > struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 05/29] drm/amdgpu: switch job hw_fence to amdgpu_fence
2025-06-06 11:39 ` Christian König
@ 2025-06-06 16:08 ` Alex Deucher
2025-06-10 8:23 ` Christian König
0 siblings, 1 reply; 43+ messages in thread
From: Alex Deucher @ 2025-06-06 16:08 UTC (permalink / raw)
To: Christian König; +Cc: Alex Deucher, amd-gfx
On Fri, Jun 6, 2025 at 7:39 AM Christian König <christian.koenig@amd.com> wrote:
>
> On 6/6/25 08:43, Alex Deucher wrote:
> > Use the amdgpu fence container so we can store additional
> > data in the fence.
> >
> > Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 30 +++++----------------
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 ++++-----
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 16 +++++++++++
> > 6 files changed, 32 insertions(+), 32 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > index 8e626f50b362e..f81608330a3d0 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > @@ -1902,7 +1902,7 @@ static void amdgpu_ib_preempt_mark_partial_job(struct amdgpu_ring *ring)
> > continue;
> > }
> > job = to_amdgpu_job(s_job);
> > - if (preempted && (&job->hw_fence) == fence)
> > + if (preempted && (&job->hw_fence.base) == fence)
> > /* mark the job as preempted */
> > job->preemption_status |= AMDGPU_IB_PREEMPTED;
> > }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index ea565651f7459..8298e95e4543e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -6375,7 +6375,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > *
> > * job->base holds a reference to parent fence
> > */
> > - if (job && dma_fence_is_signaled(&job->hw_fence)) {
> > + if (job && dma_fence_is_signaled(&job->hw_fence.base)) {
> > job_signaled = true;
> > dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
> > goto skip_hw_reset;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index 2f24a6aa13bf6..569e0e5373927 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -41,22 +41,6 @@
> > #include "amdgpu_trace.h"
> > #include "amdgpu_reset.h"
> >
> > -/*
> > - * Fences mark an event in the GPUs pipeline and are used
> > - * for GPU/CPU synchronization. When the fence is written,
> > - * it is expected that all buffers associated with that fence
> > - * are no longer in use by the associated ring on the GPU and
> > - * that the relevant GPU caches have been flushed.
> > - */
> > -
> > -struct amdgpu_fence {
> > - struct dma_fence base;
> > -
> > - /* RB, DMA, etc. */
> > - struct amdgpu_ring *ring;
> > - ktime_t start_timestamp;
> > -};
> > -
>
> Oh, that handling here is completely broken.
>
> The MCBP muxer overwrites fields in the job because of this ^^.
>
> I think that patch needs to be a bug fix we even backport.
What is broken in the muxer code?
Alex
>
> Regards,
> CFhristian.
>
> > static struct kmem_cache *amdgpu_fence_slab;
> >
> > int amdgpu_fence_slab_init(void)
> > @@ -151,12 +135,12 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
> > am_fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
> > if (am_fence == NULL)
> > return -ENOMEM;
> > - fence = &am_fence->base;
> > - am_fence->ring = ring;
> > } else {
> > /* take use of job-embedded fence */
> > - fence = &job->hw_fence;
> > + am_fence = &job->hw_fence;
> > }
> > + fence = &am_fence->base;
> > + am_fence->ring = ring;
> >
> > seq = ++ring->fence_drv.sync_seq;
> > if (job && job->job_run_counter) {
> > @@ -718,7 +702,7 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
> > * it right here or we won't be able to track them in fence_drv
> > * and they will remain unsignaled during sa_bo free.
> > */
> > - job = container_of(old, struct amdgpu_job, hw_fence);
> > + job = container_of(old, struct amdgpu_job, hw_fence.base);
> > if (!job->base.s_fence && !dma_fence_is_signaled(old))
> > dma_fence_signal(old);
> > RCU_INIT_POINTER(*ptr, NULL);
> > @@ -780,7 +764,7 @@ static const char *amdgpu_fence_get_timeline_name(struct dma_fence *f)
> >
> > static const char *amdgpu_job_fence_get_timeline_name(struct dma_fence *f)
> > {
> > - struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence);
> > + struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence.base);
> >
> > return (const char *)to_amdgpu_ring(job->base.sched)->name;
> > }
> > @@ -810,7 +794,7 @@ static bool amdgpu_fence_enable_signaling(struct dma_fence *f)
> > */
> > static bool amdgpu_job_fence_enable_signaling(struct dma_fence *f)
> > {
> > - struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence);
> > + struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence.base);
> >
> > if (!timer_pending(&to_amdgpu_ring(job->base.sched)->fence_drv.fallback_timer))
> > amdgpu_fence_schedule_fallback(to_amdgpu_ring(job->base.sched));
> > @@ -845,7 +829,7 @@ static void amdgpu_job_fence_free(struct rcu_head *rcu)
> > struct dma_fence *f = container_of(rcu, struct dma_fence, rcu);
> >
> > /* free job if fence has a parent job */
> > - kfree(container_of(f, struct amdgpu_job, hw_fence));
> > + kfree(container_of(f, struct amdgpu_job, hw_fence.base));
> > }
> >
> > /**
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index acb21fc8b3ce5..ddb9d3269357c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -272,8 +272,8 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
> > /* Check if any fences where initialized */
> > if (job->base.s_fence && job->base.s_fence->finished.ops)
> > f = &job->base.s_fence->finished;
> > - else if (job->hw_fence.ops)
> > - f = &job->hw_fence;
> > + else if (job->hw_fence.base.ops)
> > + f = &job->hw_fence.base;
> > else
> > f = NULL;
> >
> > @@ -290,10 +290,10 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
> > amdgpu_sync_free(&job->explicit_sync);
> >
> > /* only put the hw fence if has embedded fence */
> > - if (!job->hw_fence.ops)
> > + if (!job->hw_fence.base.ops)
> > kfree(job);
> > else
> > - dma_fence_put(&job->hw_fence);
> > + dma_fence_put(&job->hw_fence.base);
> > }
> >
> > void amdgpu_job_set_gang_leader(struct amdgpu_job *job,
> > @@ -322,10 +322,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
> > if (job->gang_submit != &job->base.s_fence->scheduled)
> > dma_fence_put(job->gang_submit);
> >
> > - if (!job->hw_fence.ops)
> > + if (!job->hw_fence.base.ops)
> > kfree(job);
> > else
> > - dma_fence_put(&job->hw_fence);
> > + dma_fence_put(&job->hw_fence.base);
> > }
> >
> > struct dma_fence *amdgpu_job_submit(struct amdgpu_job *job)
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> > index f2c049129661f..931fed8892cc1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> > @@ -48,7 +48,7 @@ struct amdgpu_job {
> > struct drm_sched_job base;
> > struct amdgpu_vm *vm;
> > struct amdgpu_sync explicit_sync;
> > - struct dma_fence hw_fence;
> > + struct amdgpu_fence hw_fence;
> > struct dma_fence *gang_submit;
> > uint32_t preamble_status;
> > uint32_t preemption_status;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > index b95b471107692..e1f25218943a4 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > @@ -127,6 +127,22 @@ struct amdgpu_fence_driver {
> > struct dma_fence **fences;
> > };
> >
> > +/*
> > + * Fences mark an event in the GPUs pipeline and are used
> > + * for GPU/CPU synchronization. When the fence is written,
> > + * it is expected that all buffers associated with that fence
> > + * are no longer in use by the associated ring on the GPU and
> > + * that the relevant GPU caches have been flushed.
> > + */
> > +
> > +struct amdgpu_fence {
> > + struct dma_fence base;
> > +
> > + /* RB, DMA, etc. */
> > + struct amdgpu_ring *ring;
> > + ktime_t start_timestamp;
> > +};
> > +
> > extern const struct drm_sched_backend_ops amdgpu_sched_ops;
> >
> > void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring);
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/29] drm/amdgpu: update ring reset function signature
2025-06-06 16:00 ` Alex Deucher
@ 2025-06-09 12:43 ` Sundararaju, Sathishkumar
2025-06-09 13:33 ` Alex Deucher
2025-06-10 8:24 ` Christian König
1 sibling, 1 reply; 43+ messages in thread
From: Sundararaju, Sathishkumar @ 2025-06-09 12:43 UTC (permalink / raw)
To: Alex Deucher, Christian König; +Cc: Alex Deucher, amd-gfx
On 6/6/2025 9:30 PM, Alex Deucher wrote:
> On Fri, Jun 6, 2025 at 7:41 AM Christian König <christian.koenig@amd.com> wrote:
>> On 6/6/25 08:43, Alex Deucher wrote:
>>> Going forward, we'll need more than just the vmid. Everything
>>> we need in currently in the amdgpu job structure, so just
>>> pass that in.
>> Please don't the job is just a container for the submission, it should not be part of any reset handling.
>>
>> What information is actually needed here?
> We need job->vmid, job->base.s_fence->finished, job->hw_fence.
There's more/full ip specific reset control possible with job passed to
reset callback and fence/guilty handling moved here.
Wondering, if I can also try to enable reset on vcn non unified queues,
this patch series has relevant examples and makes it
possible to handle it all in the reset callback itself, can try it atop
these changes.
Regards,
Sathish
>
> Alex
>
>> Regards,
>> Christian.
>>
>>
>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++--
>>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 ++++---
>>> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 10 ++++++----
>>> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 10 ++++++----
>>> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 5 +++--
>>> drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 5 +++--
>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 3 ++-
>>> 22 files changed, 53 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index ddb9d3269357c..80d4dfebde24f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -155,7 +155,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>> if (is_guilty)
>>> dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>>
>>> - r = amdgpu_ring_reset(ring, job->vmid);
>>> + r = amdgpu_ring_reset(ring, job);
>>> if (!r) {
>>> if (amdgpu_ring_sched_ready(ring))
>>> drm_sched_stop(&ring->sched, s_job);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> index e1f25218943a4..ab5402d7ce9c8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> @@ -268,7 +268,7 @@ struct amdgpu_ring_funcs {
>>> void (*patch_cntl)(struct amdgpu_ring *ring, unsigned offset);
>>> void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset);
>>> void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
>>> - int (*reset)(struct amdgpu_ring *ring, unsigned int vmid);
>>> + int (*reset)(struct amdgpu_ring *ring, struct amdgpu_job *job);
>>> void (*emit_cleaner_shader)(struct amdgpu_ring *ring);
>>> bool (*is_guilty)(struct amdgpu_ring *ring);
>>> };
>>> @@ -425,7 +425,7 @@ struct amdgpu_ring {
>>> #define amdgpu_ring_patch_cntl(r, o) ((r)->funcs->patch_cntl((r), (o)))
>>> #define amdgpu_ring_patch_ce(r, o) ((r)->funcs->patch_ce((r), (o)))
>>> #define amdgpu_ring_patch_de(r, o) ((r)->funcs->patch_de((r), (o)))
>>> -#define amdgpu_ring_reset(r, v) (r)->funcs->reset((r), (v))
>>> +#define amdgpu_ring_reset(r, j) (r)->funcs->reset((r), (j))
>>>
>>> unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type);
>>> int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>> index 75ea071744eb5..c58e7040c732a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>> @@ -9522,7 +9522,8 @@ static void gfx_v10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
>>> amdgpu_ring_insert_nop(ring, num_nop - 1);
>>> }
>>>
>>> -static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>> @@ -9547,7 +9548,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>
>>> addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
>>> offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
>>> - tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
>>> + tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << job->vmid);
>>> if (ring->pipe == 0)
>>> tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
>>> else
>>> @@ -9579,7 +9580,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> }
>>>
>>> static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
>>> - unsigned int vmid)
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>> index afd6d59164bfa..0ee7bdd509741 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>> @@ -6806,7 +6806,8 @@ static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
>>> return 0;
>>> }
>>>
>>> -static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int r;
>>> @@ -6814,7 +6815,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> if (amdgpu_sriov_vf(adev))
>>> return -EINVAL;
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
>>> if (r) {
>>>
>>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
>>> @@ -6968,7 +6969,8 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
>>> return 0;
>>> }
>>>
>>> -static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int r = 0;
>>> @@ -6976,7 +6978,7 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>> if (amdgpu_sriov_vf(adev))
>>> return -EINVAL;
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
>>> if (r) {
>>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
>>> r = gfx_v11_0_reset_compute_pipe(ring);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>> index 1234c8d64e20d..a26417d53411b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>> @@ -5307,7 +5307,8 @@ static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
>>> return 0;
>>> }
>>>
>>> -static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int r;
>>> @@ -5315,7 +5316,7 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> if (amdgpu_sriov_vf(adev))
>>> return -EINVAL;
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
>>> if (r) {
>>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
>>> r = gfx_v12_reset_gfx_pipe(ring);
>>> @@ -5421,7 +5422,8 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
>>> return 0;
>>> }
>>>
>>> -static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int r;
>>> @@ -5429,7 +5431,7 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>> if (amdgpu_sriov_vf(adev))
>>> return -EINVAL;
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
>>> if (r) {
>>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
>>> r = gfx_v12_0_reset_compute_pipe(ring);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>> index d50e125fd3e0d..5e650cc5fcb26 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>> @@ -7153,7 +7153,7 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
>>> }
>>>
>>> static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
>>> - unsigned int vmid)
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>> index c233edf605694..a7dadff3dca31 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>> @@ -3552,7 +3552,7 @@ static int gfx_v9_4_3_reset_hw_pipe(struct amdgpu_ring *ring)
>>> }
>>>
>>> static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
>>> - unsigned int vmid)
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>> index 4cde8a8bcc837..6cd3fbe00d6b9 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>> @@ -764,7 +764,8 @@ static int jpeg_v2_0_process_interrupt(struct amdgpu_device *adev,
>>> return 0;
>>> }
>>>
>>> -static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> jpeg_v2_0_stop(ring->adev);
>>> jpeg_v2_0_start(ring->adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>> index 8b39e114f3be1..8ed41868f6c32 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>> @@ -643,7 +643,8 @@ static int jpeg_v2_5_process_interrupt(struct amdgpu_device *adev,
>>> return 0;
>>> }
>>>
>>> -static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> jpeg_v2_5_stop_inst(ring->adev, ring->me);
>>> jpeg_v2_5_start_inst(ring->adev, ring->me);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>> index 2f8510c2986b9..3512fbb543301 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>> @@ -555,7 +555,8 @@ static int jpeg_v3_0_process_interrupt(struct amdgpu_device *adev,
>>> return 0;
>>> }
>>>
>>> -static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> jpeg_v3_0_stop(ring->adev);
>>> jpeg_v3_0_start(ring->adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>> index f17ec5414fd69..c8efeaf0a2a69 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>> @@ -720,7 +720,8 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
>>> return 0;
>>> }
>>>
>>> -static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> if (amdgpu_sriov_vf(ring->adev))
>>> return -EINVAL;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>> index 79e342d5ab28d..8b07c3651c579 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>> @@ -1143,7 +1143,8 @@ static void jpeg_v4_0_3_core_stall_reset(struct amdgpu_ring *ring)
>>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
>>> }
>>>
>>> -static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> if (amdgpu_sriov_vf(ring->adev))
>>> return -EOPNOTSUPP;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>> index 3b6f65a256464..0a21a13e19360 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>> @@ -834,7 +834,8 @@ static void jpeg_v5_0_1_core_stall_reset(struct amdgpu_ring *ring)
>>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
>>> }
>>>
>>> -static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> if (amdgpu_sriov_vf(ring->adev))
>>> return -EOPNOTSUPP;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>> index 9c169112a5e7b..ffd67d51b335f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>> @@ -1667,7 +1667,8 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
>>> return sdma_v4_4_2_is_queue_selected(adev, instance_id, true);
>>> }
>>>
>>> -static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> u32 id = GET_INST(SDMA0, ring->me);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>> index 9505ae96fbecc..46affee1c2da0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>> @@ -1538,7 +1538,8 @@ static int sdma_v5_0_soft_reset(struct amdgpu_ip_block *ip_block)
>>> return 0;
>>> }
>>>
>>> -static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> u32 inst_id = ring->me;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>> index a6e612b4a8928..581e75b7d01a8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>> @@ -1451,7 +1451,8 @@ static int sdma_v5_2_wait_for_idle(struct amdgpu_ip_block *ip_block)
>>> return -ETIMEDOUT;
>>> }
>>>
>>> -static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> u32 inst_id = ring->me;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>> index 5a70ae17be04e..d9866009edbfc 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>> @@ -1537,7 +1537,8 @@ static int sdma_v6_0_ring_preempt_ib(struct amdgpu_ring *ring)
>>> return r;
>>> }
>>>
>>> -static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int i, r;
>>> @@ -1555,7 +1556,7 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> return -EINVAL;
>>> }
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
>>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
>>> if (r)
>>> return r;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>> index ad47d0bdf7775..c546e73642296 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>> @@ -802,7 +802,8 @@ static bool sdma_v7_0_check_soft_reset(struct amdgpu_ip_block *ip_block)
>>> return false;
>>> }
>>>
>>> -static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int i, r;
>>> @@ -820,7 +821,7 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> return -EINVAL;
>>> }
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
>>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
>>> if (r)
>>> return r;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>> index b5071f77f78d2..47a0deceff433 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>> @@ -1967,7 +1967,8 @@ static int vcn_v4_0_ring_patch_cs_in_place(struct amdgpu_cs_parser *p,
>>> return 0;
>>> }
>>>
>>> -static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>> index 5a33140f57235..d961a824d2098 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>> @@ -1594,7 +1594,8 @@ static void vcn_v4_0_3_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>> }
>>> }
>>>
>>> -static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> int r = 0;
>>> int vcn_inst;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>> index 16ade84facc78..10bd714592278 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>> @@ -1465,7 +1465,8 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>> }
>>> }
>>>
>>> -static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>> index f8e3f0b882da5..7e6a7ead9a086 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>> @@ -1192,7 +1192,8 @@ static void vcn_v5_0_0_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>> }
>>> }
>>>
>>> -static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/29] drm/amdgpu: update ring reset function signature
2025-06-09 12:43 ` Sundararaju, Sathishkumar
@ 2025-06-09 13:33 ` Alex Deucher
0 siblings, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-09 13:33 UTC (permalink / raw)
To: Sundararaju, Sathishkumar; +Cc: Christian König, Alex Deucher, amd-gfx
On Mon, Jun 9, 2025 at 8:43 AM Sundararaju, Sathishkumar
<sasundar@amd.com> wrote:
>
>
>
> On 6/6/2025 9:30 PM, Alex Deucher wrote:
> > On Fri, Jun 6, 2025 at 7:41 AM Christian König <christian.koenig@amd.com> wrote:
> >> On 6/6/25 08:43, Alex Deucher wrote:
> >>> Going forward, we'll need more than just the vmid. Everything
> >>> we need in currently in the amdgpu job structure, so just
> >>> pass that in.
> >> Please don't the job is just a container for the submission, it should not be part of any reset handling.
> >>
> >> What information is actually needed here?
> > We need job->vmid, job->base.s_fence->finished, job->hw_fence.
> There's more/full ip specific reset control possible with job passed to
> reset callback and fence/guilty handling moved here.
> Wondering, if I can also try to enable reset on vcn non unified queues,
> this patch series has relevant examples and makes it
> possible to handle it all in the reset callback itself, can try it atop
> these changes.
Yes, older VCNs with multiple rings per engine could do something
similar to SDMA 4x-5x where we reset the instance so all rings on that
instance get reset.
Alex
>
> Regards,
> Sathish
> >
> > Alex
> >
> >> Regards,
> >> Christian.
> >>
> >>
> >>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >>> ---
> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++--
> >>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 ++++---
> >>> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 10 ++++++----
> >>> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 10 ++++++----
> >>> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 +-
> >>> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 5 +++--
> >>> drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 5 +++--
> >>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 3 ++-
> >>> 22 files changed, 53 insertions(+), 33 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> index ddb9d3269357c..80d4dfebde24f 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> @@ -155,7 +155,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>> if (is_guilty)
> >>> dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> >>>
> >>> - r = amdgpu_ring_reset(ring, job->vmid);
> >>> + r = amdgpu_ring_reset(ring, job);
> >>> if (!r) {
> >>> if (amdgpu_ring_sched_ready(ring))
> >>> drm_sched_stop(&ring->sched, s_job);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> index e1f25218943a4..ab5402d7ce9c8 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> @@ -268,7 +268,7 @@ struct amdgpu_ring_funcs {
> >>> void (*patch_cntl)(struct amdgpu_ring *ring, unsigned offset);
> >>> void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset);
> >>> void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
> >>> - int (*reset)(struct amdgpu_ring *ring, unsigned int vmid);
> >>> + int (*reset)(struct amdgpu_ring *ring, struct amdgpu_job *job);
> >>> void (*emit_cleaner_shader)(struct amdgpu_ring *ring);
> >>> bool (*is_guilty)(struct amdgpu_ring *ring);
> >>> };
> >>> @@ -425,7 +425,7 @@ struct amdgpu_ring {
> >>> #define amdgpu_ring_patch_cntl(r, o) ((r)->funcs->patch_cntl((r), (o)))
> >>> #define amdgpu_ring_patch_ce(r, o) ((r)->funcs->patch_ce((r), (o)))
> >>> #define amdgpu_ring_patch_de(r, o) ((r)->funcs->patch_de((r), (o)))
> >>> -#define amdgpu_ring_reset(r, v) (r)->funcs->reset((r), (v))
> >>> +#define amdgpu_ring_reset(r, j) (r)->funcs->reset((r), (j))
> >>>
> >>> unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type);
> >>> int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> >>> index 75ea071744eb5..c58e7040c732a 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> >>> @@ -9522,7 +9522,8 @@ static void gfx_v10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
> >>> amdgpu_ring_insert_nop(ring, num_nop - 1);
> >>> }
> >>>
> >>> -static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> >>> @@ -9547,7 +9548,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>>
> >>> addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
> >>> offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
> >>> - tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
> >>> + tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << job->vmid);
> >>> if (ring->pipe == 0)
> >>> tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
> >>> else
> >>> @@ -9579,7 +9580,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> }
> >>>
> >>> static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
> >>> - unsigned int vmid)
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> >>> index afd6d59164bfa..0ee7bdd509741 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> >>> @@ -6806,7 +6806,8 @@ static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
> >>> return 0;
> >>> }
> >>>
> >>> -static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int r;
> >>> @@ -6814,7 +6815,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> if (amdgpu_sriov_vf(adev))
> >>> return -EINVAL;
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
> >>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
> >>> if (r) {
> >>>
> >>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
> >>> @@ -6968,7 +6969,8 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
> >>> return 0;
> >>> }
> >>>
> >>> -static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int r = 0;
> >>> @@ -6976,7 +6978,7 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> if (amdgpu_sriov_vf(adev))
> >>> return -EINVAL;
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
> >>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
> >>> if (r) {
> >>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
> >>> r = gfx_v11_0_reset_compute_pipe(ring);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> >>> index 1234c8d64e20d..a26417d53411b 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> >>> @@ -5307,7 +5307,8 @@ static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
> >>> return 0;
> >>> }
> >>>
> >>> -static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int r;
> >>> @@ -5315,7 +5316,7 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> if (amdgpu_sriov_vf(adev))
> >>> return -EINVAL;
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
> >>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
> >>> if (r) {
> >>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
> >>> r = gfx_v12_reset_gfx_pipe(ring);
> >>> @@ -5421,7 +5422,8 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
> >>> return 0;
> >>> }
> >>>
> >>> -static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int r;
> >>> @@ -5429,7 +5431,7 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> if (amdgpu_sriov_vf(adev))
> >>> return -EINVAL;
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
> >>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
> >>> if (r) {
> >>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
> >>> r = gfx_v12_0_reset_compute_pipe(ring);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> >>> index d50e125fd3e0d..5e650cc5fcb26 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> >>> @@ -7153,7 +7153,7 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
> >>> }
> >>>
> >>> static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
> >>> - unsigned int vmid)
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> >>> index c233edf605694..a7dadff3dca31 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> >>> @@ -3552,7 +3552,7 @@ static int gfx_v9_4_3_reset_hw_pipe(struct amdgpu_ring *ring)
> >>> }
> >>>
> >>> static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
> >>> - unsigned int vmid)
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> >>> index 4cde8a8bcc837..6cd3fbe00d6b9 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> >>> @@ -764,7 +764,8 @@ static int jpeg_v2_0_process_interrupt(struct amdgpu_device *adev,
> >>> return 0;
> >>> }
> >>>
> >>> -static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> jpeg_v2_0_stop(ring->adev);
> >>> jpeg_v2_0_start(ring->adev);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> >>> index 8b39e114f3be1..8ed41868f6c32 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> >>> @@ -643,7 +643,8 @@ static int jpeg_v2_5_process_interrupt(struct amdgpu_device *adev,
> >>> return 0;
> >>> }
> >>>
> >>> -static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> jpeg_v2_5_stop_inst(ring->adev, ring->me);
> >>> jpeg_v2_5_start_inst(ring->adev, ring->me);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> >>> index 2f8510c2986b9..3512fbb543301 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> >>> @@ -555,7 +555,8 @@ static int jpeg_v3_0_process_interrupt(struct amdgpu_device *adev,
> >>> return 0;
> >>> }
> >>>
> >>> -static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> jpeg_v3_0_stop(ring->adev);
> >>> jpeg_v3_0_start(ring->adev);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> >>> index f17ec5414fd69..c8efeaf0a2a69 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> >>> @@ -720,7 +720,8 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
> >>> return 0;
> >>> }
> >>>
> >>> -static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> if (amdgpu_sriov_vf(ring->adev))
> >>> return -EINVAL;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> >>> index 79e342d5ab28d..8b07c3651c579 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> >>> @@ -1143,7 +1143,8 @@ static void jpeg_v4_0_3_core_stall_reset(struct amdgpu_ring *ring)
> >>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
> >>> }
> >>>
> >>> -static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> if (amdgpu_sriov_vf(ring->adev))
> >>> return -EOPNOTSUPP;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> >>> index 3b6f65a256464..0a21a13e19360 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> >>> @@ -834,7 +834,8 @@ static void jpeg_v5_0_1_core_stall_reset(struct amdgpu_ring *ring)
> >>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
> >>> }
> >>>
> >>> -static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> if (amdgpu_sriov_vf(ring->adev))
> >>> return -EOPNOTSUPP;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> >>> index 9c169112a5e7b..ffd67d51b335f 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> >>> @@ -1667,7 +1667,8 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
> >>> return sdma_v4_4_2_is_queue_selected(adev, instance_id, true);
> >>> }
> >>>
> >>> -static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> u32 id = GET_INST(SDMA0, ring->me);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> >>> index 9505ae96fbecc..46affee1c2da0 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> >>> @@ -1538,7 +1538,8 @@ static int sdma_v5_0_soft_reset(struct amdgpu_ip_block *ip_block)
> >>> return 0;
> >>> }
> >>>
> >>> -static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> u32 inst_id = ring->me;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >>> index a6e612b4a8928..581e75b7d01a8 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >>> @@ -1451,7 +1451,8 @@ static int sdma_v5_2_wait_for_idle(struct amdgpu_ip_block *ip_block)
> >>> return -ETIMEDOUT;
> >>> }
> >>>
> >>> -static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> u32 inst_id = ring->me;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> >>> index 5a70ae17be04e..d9866009edbfc 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> >>> @@ -1537,7 +1537,8 @@ static int sdma_v6_0_ring_preempt_ib(struct amdgpu_ring *ring)
> >>> return r;
> >>> }
> >>>
> >>> -static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int i, r;
> >>> @@ -1555,7 +1556,7 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> return -EINVAL;
> >>> }
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
> >>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
> >>> if (r)
> >>> return r;
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> >>> index ad47d0bdf7775..c546e73642296 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> >>> @@ -802,7 +802,8 @@ static bool sdma_v7_0_check_soft_reset(struct amdgpu_ip_block *ip_block)
> >>> return false;
> >>> }
> >>>
> >>> -static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int i, r;
> >>> @@ -820,7 +821,7 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> return -EINVAL;
> >>> }
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
> >>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
> >>> if (r)
> >>> return r;
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> >>> index b5071f77f78d2..47a0deceff433 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> >>> @@ -1967,7 +1967,8 @@ static int vcn_v4_0_ring_patch_cs_in_place(struct amdgpu_cs_parser *p,
> >>> return 0;
> >>> }
> >>>
> >>> -static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>> index 5a33140f57235..d961a824d2098 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>> @@ -1594,7 +1594,8 @@ static void vcn_v4_0_3_unified_ring_set_wptr(struct amdgpu_ring *ring)
> >>> }
> >>> }
> >>>
> >>> -static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> int r = 0;
> >>> int vcn_inst;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> >>> index 16ade84facc78..10bd714592278 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> >>> @@ -1465,7 +1465,8 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
> >>> }
> >>> }
> >>>
> >>> -static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> >>> index f8e3f0b882da5..7e6a7ead9a086 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> >>> @@ -1192,7 +1192,8 @@ static void vcn_v5_0_0_unified_ring_set_wptr(struct amdgpu_ring *ring)
> >>> }
> >>> }
> >>>
> >>> -static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 29/29] drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
2025-06-06 6:43 ` [PATCH 29/29] drm/amdgpu/vcn5: " Alex Deucher
@ 2025-06-09 14:23 ` Sundararaju, Sathishkumar
0 siblings, 0 replies; 43+ messages in thread
From: Sundararaju, Sathishkumar @ 2025-06-09 14:23 UTC (permalink / raw)
To: Alex Deucher, amd-gfx, christian.koenig
This patch series is :-
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
One nit-pick, amdgpu_ring_backup_unprocessed_commands function could use
amdgpu_fence instead of dma_fence as argument.
And JPEG/VCN changes in this series are also :-
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Note:
JPEG5 and JPEG4_0_5 reset fails due to DPG mode, they work fine in
non-dpg, failure is not related this series.
Couldn't test JPEG4_0_3 and VCN4_0_3, but the changes look good.
Regards,
Sathish
On 6/6/2025 12:13 PM, Alex Deucher wrote:
> Re-emit the unprocessed state after resetting the queue.
>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> index d74c1862ac860..208b366c580da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> @@ -1202,15 +1202,23 @@ static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
> if (!(adev->vcn.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
> return -EOPNOTSUPP;
>
> + amdgpu_ring_backup_unprocessed_commands(ring, &job->hw_fence.base, true);
> +
> vcn_v5_0_0_stop(vinst);
> vcn_v5_0_0_start(vinst);
> -
> - r = amdgpu_ring_test_helper(ring);
> + r = amdgpu_ring_test_ring(ring);
> if (r)
> return r;
> +
> dma_fence_set_error(&job->base.s_fence->finished, -ETIME);
> - amdgpu_fence_driver_force_completion(ring);
> + /* signal the fence of the bad job */
> + amdgpu_fence_driver_guilty_force_completion(&job->hw_fence.base);
> atomic_inc(&ring->adev->gpu_reset_counter);
> + r = amdgpu_ring_reemit_unprocessed_commands(ring);
> + if (r)
> + /* if we fail to reemit, force complete all fences */
> + amdgpu_fence_driver_force_completion(ring);
> +
> return 0;
> }
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 05/29] drm/amdgpu: switch job hw_fence to amdgpu_fence
2025-06-06 16:08 ` Alex Deucher
@ 2025-06-10 8:23 ` Christian König
0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2025-06-10 8:23 UTC (permalink / raw)
To: Alex Deucher; +Cc: Alex Deucher, amd-gfx
On 6/6/25 18:08, Alex Deucher wrote:
> On Fri, Jun 6, 2025 at 7:39 AM Christian König <christian.koenig@amd.com> wrote:
>>
>> On 6/6/25 08:43, Alex Deucher wrote:
>>> Use the amdgpu fence container so we can store additional
>>> data in the fence.
>>>
>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 30 +++++----------------
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 ++++-----
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 16 +++++++++++
>>> 6 files changed, 32 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> index 8e626f50b362e..f81608330a3d0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> @@ -1902,7 +1902,7 @@ static void amdgpu_ib_preempt_mark_partial_job(struct amdgpu_ring *ring)
>>> continue;
>>> }
>>> job = to_amdgpu_job(s_job);
>>> - if (preempted && (&job->hw_fence) == fence)
>>> + if (preempted && (&job->hw_fence.base) == fence)
>>> /* mark the job as preempted */
>>> job->preemption_status |= AMDGPU_IB_PREEMPTED;
>>> }
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index ea565651f7459..8298e95e4543e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -6375,7 +6375,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> *
>>> * job->base holds a reference to parent fence
>>> */
>>> - if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>> + if (job && dma_fence_is_signaled(&job->hw_fence.base)) {
>>> job_signaled = true;
>>> dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
>>> goto skip_hw_reset;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 2f24a6aa13bf6..569e0e5373927 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -41,22 +41,6 @@
>>> #include "amdgpu_trace.h"
>>> #include "amdgpu_reset.h"
>>>
>>> -/*
>>> - * Fences mark an event in the GPUs pipeline and are used
>>> - * for GPU/CPU synchronization. When the fence is written,
>>> - * it is expected that all buffers associated with that fence
>>> - * are no longer in use by the associated ring on the GPU and
>>> - * that the relevant GPU caches have been flushed.
>>> - */
>>> -
>>> -struct amdgpu_fence {
>>> - struct dma_fence base;
>>> -
>>> - /* RB, DMA, etc. */
>>> - struct amdgpu_ring *ring;
>>> - ktime_t start_timestamp;
>>> -};
>>> -
>>
>> Oh, that handling here is completely broken.
>>
>> The MCBP muxer overwrites fields in the job because of this ^^.
>>
>> I think that patch needs to be a bug fix we even backport.
>
> What is broken in the muxer code?
See amdgpu_fence_emit(), the code casts the fence to an amdgpu_fence and assigns start_time.
But the fence pointer isn't an amdgpu_fence at all, that is just a dma_fence embedded into an job object!
We overwrite the gang submit pointer and the flags with that. The only thing preventing us from crashing is that those values are never used again after emitting the fence.
Regards,
Christian.
>
> Alex
>
>>
>> Regards,
>> CFhristian.
>>
>>> static struct kmem_cache *amdgpu_fence_slab;
>>>
>>> int amdgpu_fence_slab_init(void)
>>> @@ -151,12 +135,12 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
>>> am_fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
>>> if (am_fence == NULL)
>>> return -ENOMEM;
>>> - fence = &am_fence->base;
>>> - am_fence->ring = ring;
>>> } else {
>>> /* take use of job-embedded fence */
>>> - fence = &job->hw_fence;
>>> + am_fence = &job->hw_fence;
>>> }
>>> + fence = &am_fence->base;
>>> + am_fence->ring = ring;
>>>
>>> seq = ++ring->fence_drv.sync_seq;
>>> if (job && job->job_run_counter) {
>>> @@ -718,7 +702,7 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>>> * it right here or we won't be able to track them in fence_drv
>>> * and they will remain unsignaled during sa_bo free.
>>> */
>>> - job = container_of(old, struct amdgpu_job, hw_fence);
>>> + job = container_of(old, struct amdgpu_job, hw_fence.base);
>>> if (!job->base.s_fence && !dma_fence_is_signaled(old))
>>> dma_fence_signal(old);
>>> RCU_INIT_POINTER(*ptr, NULL);
>>> @@ -780,7 +764,7 @@ static const char *amdgpu_fence_get_timeline_name(struct dma_fence *f)
>>>
>>> static const char *amdgpu_job_fence_get_timeline_name(struct dma_fence *f)
>>> {
>>> - struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence);
>>> + struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence.base);
>>>
>>> return (const char *)to_amdgpu_ring(job->base.sched)->name;
>>> }
>>> @@ -810,7 +794,7 @@ static bool amdgpu_fence_enable_signaling(struct dma_fence *f)
>>> */
>>> static bool amdgpu_job_fence_enable_signaling(struct dma_fence *f)
>>> {
>>> - struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence);
>>> + struct amdgpu_job *job = container_of(f, struct amdgpu_job, hw_fence.base);
>>>
>>> if (!timer_pending(&to_amdgpu_ring(job->base.sched)->fence_drv.fallback_timer))
>>> amdgpu_fence_schedule_fallback(to_amdgpu_ring(job->base.sched));
>>> @@ -845,7 +829,7 @@ static void amdgpu_job_fence_free(struct rcu_head *rcu)
>>> struct dma_fence *f = container_of(rcu, struct dma_fence, rcu);
>>>
>>> /* free job if fence has a parent job */
>>> - kfree(container_of(f, struct amdgpu_job, hw_fence));
>>> + kfree(container_of(f, struct amdgpu_job, hw_fence.base));
>>> }
>>>
>>> /**
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index acb21fc8b3ce5..ddb9d3269357c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -272,8 +272,8 @@ void amdgpu_job_free_resources(struct amdgpu_job *job)
>>> /* Check if any fences where initialized */
>>> if (job->base.s_fence && job->base.s_fence->finished.ops)
>>> f = &job->base.s_fence->finished;
>>> - else if (job->hw_fence.ops)
>>> - f = &job->hw_fence;
>>> + else if (job->hw_fence.base.ops)
>>> + f = &job->hw_fence.base;
>>> else
>>> f = NULL;
>>>
>>> @@ -290,10 +290,10 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job)
>>> amdgpu_sync_free(&job->explicit_sync);
>>>
>>> /* only put the hw fence if has embedded fence */
>>> - if (!job->hw_fence.ops)
>>> + if (!job->hw_fence.base.ops)
>>> kfree(job);
>>> else
>>> - dma_fence_put(&job->hw_fence);
>>> + dma_fence_put(&job->hw_fence.base);
>>> }
>>>
>>> void amdgpu_job_set_gang_leader(struct amdgpu_job *job,
>>> @@ -322,10 +322,10 @@ void amdgpu_job_free(struct amdgpu_job *job)
>>> if (job->gang_submit != &job->base.s_fence->scheduled)
>>> dma_fence_put(job->gang_submit);
>>>
>>> - if (!job->hw_fence.ops)
>>> + if (!job->hw_fence.base.ops)
>>> kfree(job);
>>> else
>>> - dma_fence_put(&job->hw_fence);
>>> + dma_fence_put(&job->hw_fence.base);
>>> }
>>>
>>> struct dma_fence *amdgpu_job_submit(struct amdgpu_job *job)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>> index f2c049129661f..931fed8892cc1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>> @@ -48,7 +48,7 @@ struct amdgpu_job {
>>> struct drm_sched_job base;
>>> struct amdgpu_vm *vm;
>>> struct amdgpu_sync explicit_sync;
>>> - struct dma_fence hw_fence;
>>> + struct amdgpu_fence hw_fence;
>>> struct dma_fence *gang_submit;
>>> uint32_t preamble_status;
>>> uint32_t preemption_status;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> index b95b471107692..e1f25218943a4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> @@ -127,6 +127,22 @@ struct amdgpu_fence_driver {
>>> struct dma_fence **fences;
>>> };
>>>
>>> +/*
>>> + * Fences mark an event in the GPUs pipeline and are used
>>> + * for GPU/CPU synchronization. When the fence is written,
>>> + * it is expected that all buffers associated with that fence
>>> + * are no longer in use by the associated ring on the GPU and
>>> + * that the relevant GPU caches have been flushed.
>>> + */
>>> +
>>> +struct amdgpu_fence {
>>> + struct dma_fence base;
>>> +
>>> + /* RB, DMA, etc. */
>>> + struct amdgpu_ring *ring;
>>> + ktime_t start_timestamp;
>>> +};
>>> +
>>> extern const struct drm_sched_backend_ops amdgpu_sched_ops;
>>>
>>> void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring);
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/29] drm/amdgpu: update ring reset function signature
2025-06-06 16:00 ` Alex Deucher
2025-06-09 12:43 ` Sundararaju, Sathishkumar
@ 2025-06-10 8:24 ` Christian König
2025-06-10 16:37 ` Sundararaju, Sathishkumar
2025-06-11 6:54 ` Alex Deucher
1 sibling, 2 replies; 43+ messages in thread
From: Christian König @ 2025-06-10 8:24 UTC (permalink / raw)
To: Alex Deucher; +Cc: Alex Deucher, amd-gfx
On 6/6/25 18:00, Alex Deucher wrote:
> On Fri, Jun 6, 2025 at 7:41 AM Christian König <christian.koenig@amd.com> wrote:
>>
>> On 6/6/25 08:43, Alex Deucher wrote:
>>> Going forward, we'll need more than just the vmid. Everything
>>> we need in currently in the amdgpu job structure, so just
>>> pass that in.
>>
>> Please don't the job is just a container for the submission, it should not be part of any reset handling.
>>
>> What information is actually needed here?
>
> We need job->vmid, job->base.s_fence->finished, job->hw_fence.
VMID and HW fence make sense, but why is the finished fence needed?
Christian.
>
> Alex
>
>>
>> Regards,
>> Christian.
>>
>>
>>>
>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++--
>>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 ++++---
>>> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 10 ++++++----
>>> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 10 ++++++----
>>> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 5 +++--
>>> drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 5 +++--
>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 3 ++-
>>> drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 3 ++-
>>> 22 files changed, 53 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index ddb9d3269357c..80d4dfebde24f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -155,7 +155,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>> if (is_guilty)
>>> dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>>
>>> - r = amdgpu_ring_reset(ring, job->vmid);
>>> + r = amdgpu_ring_reset(ring, job);
>>> if (!r) {
>>> if (amdgpu_ring_sched_ready(ring))
>>> drm_sched_stop(&ring->sched, s_job);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> index e1f25218943a4..ab5402d7ce9c8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> @@ -268,7 +268,7 @@ struct amdgpu_ring_funcs {
>>> void (*patch_cntl)(struct amdgpu_ring *ring, unsigned offset);
>>> void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset);
>>> void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
>>> - int (*reset)(struct amdgpu_ring *ring, unsigned int vmid);
>>> + int (*reset)(struct amdgpu_ring *ring, struct amdgpu_job *job);
>>> void (*emit_cleaner_shader)(struct amdgpu_ring *ring);
>>> bool (*is_guilty)(struct amdgpu_ring *ring);
>>> };
>>> @@ -425,7 +425,7 @@ struct amdgpu_ring {
>>> #define amdgpu_ring_patch_cntl(r, o) ((r)->funcs->patch_cntl((r), (o)))
>>> #define amdgpu_ring_patch_ce(r, o) ((r)->funcs->patch_ce((r), (o)))
>>> #define amdgpu_ring_patch_de(r, o) ((r)->funcs->patch_de((r), (o)))
>>> -#define amdgpu_ring_reset(r, v) (r)->funcs->reset((r), (v))
>>> +#define amdgpu_ring_reset(r, j) (r)->funcs->reset((r), (j))
>>>
>>> unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type);
>>> int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>> index 75ea071744eb5..c58e7040c732a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>> @@ -9522,7 +9522,8 @@ static void gfx_v10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
>>> amdgpu_ring_insert_nop(ring, num_nop - 1);
>>> }
>>>
>>> -static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>> @@ -9547,7 +9548,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>
>>> addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
>>> offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
>>> - tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
>>> + tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << job->vmid);
>>> if (ring->pipe == 0)
>>> tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
>>> else
>>> @@ -9579,7 +9580,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> }
>>>
>>> static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
>>> - unsigned int vmid)
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>> index afd6d59164bfa..0ee7bdd509741 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>> @@ -6806,7 +6806,8 @@ static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
>>> return 0;
>>> }
>>>
>>> -static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int r;
>>> @@ -6814,7 +6815,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> if (amdgpu_sriov_vf(adev))
>>> return -EINVAL;
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
>>> if (r) {
>>>
>>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
>>> @@ -6968,7 +6969,8 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
>>> return 0;
>>> }
>>>
>>> -static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int r = 0;
>>> @@ -6976,7 +6978,7 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>> if (amdgpu_sriov_vf(adev))
>>> return -EINVAL;
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
>>> if (r) {
>>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
>>> r = gfx_v11_0_reset_compute_pipe(ring);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>> index 1234c8d64e20d..a26417d53411b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>> @@ -5307,7 +5307,8 @@ static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
>>> return 0;
>>> }
>>>
>>> -static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int r;
>>> @@ -5315,7 +5316,7 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>> if (amdgpu_sriov_vf(adev))
>>> return -EINVAL;
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
>>> if (r) {
>>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
>>> r = gfx_v12_reset_gfx_pipe(ring);
>>> @@ -5421,7 +5422,8 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
>>> return 0;
>>> }
>>>
>>> -static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int r;
>>> @@ -5429,7 +5431,7 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>> if (amdgpu_sriov_vf(adev))
>>> return -EINVAL;
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
>>> if (r) {
>>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
>>> r = gfx_v12_0_reset_compute_pipe(ring);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>> index d50e125fd3e0d..5e650cc5fcb26 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>> @@ -7153,7 +7153,7 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
>>> }
>>>
>>> static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
>>> - unsigned int vmid)
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>> index c233edf605694..a7dadff3dca31 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>> @@ -3552,7 +3552,7 @@ static int gfx_v9_4_3_reset_hw_pipe(struct amdgpu_ring *ring)
>>> }
>>>
>>> static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
>>> - unsigned int vmid)
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>> index 4cde8a8bcc837..6cd3fbe00d6b9 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>> @@ -764,7 +764,8 @@ static int jpeg_v2_0_process_interrupt(struct amdgpu_device *adev,
>>> return 0;
>>> }
>>>
>>> -static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> jpeg_v2_0_stop(ring->adev);
>>> jpeg_v2_0_start(ring->adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>> index 8b39e114f3be1..8ed41868f6c32 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>> @@ -643,7 +643,8 @@ static int jpeg_v2_5_process_interrupt(struct amdgpu_device *adev,
>>> return 0;
>>> }
>>>
>>> -static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> jpeg_v2_5_stop_inst(ring->adev, ring->me);
>>> jpeg_v2_5_start_inst(ring->adev, ring->me);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>> index 2f8510c2986b9..3512fbb543301 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>> @@ -555,7 +555,8 @@ static int jpeg_v3_0_process_interrupt(struct amdgpu_device *adev,
>>> return 0;
>>> }
>>>
>>> -static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> jpeg_v3_0_stop(ring->adev);
>>> jpeg_v3_0_start(ring->adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>> index f17ec5414fd69..c8efeaf0a2a69 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>> @@ -720,7 +720,8 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
>>> return 0;
>>> }
>>>
>>> -static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> if (amdgpu_sriov_vf(ring->adev))
>>> return -EINVAL;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>> index 79e342d5ab28d..8b07c3651c579 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>> @@ -1143,7 +1143,8 @@ static void jpeg_v4_0_3_core_stall_reset(struct amdgpu_ring *ring)
>>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
>>> }
>>>
>>> -static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> if (amdgpu_sriov_vf(ring->adev))
>>> return -EOPNOTSUPP;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>> index 3b6f65a256464..0a21a13e19360 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>> @@ -834,7 +834,8 @@ static void jpeg_v5_0_1_core_stall_reset(struct amdgpu_ring *ring)
>>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
>>> }
>>>
>>> -static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> if (amdgpu_sriov_vf(ring->adev))
>>> return -EOPNOTSUPP;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>> index 9c169112a5e7b..ffd67d51b335f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>> @@ -1667,7 +1667,8 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
>>> return sdma_v4_4_2_is_queue_selected(adev, instance_id, true);
>>> }
>>>
>>> -static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> u32 id = GET_INST(SDMA0, ring->me);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>> index 9505ae96fbecc..46affee1c2da0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>> @@ -1538,7 +1538,8 @@ static int sdma_v5_0_soft_reset(struct amdgpu_ip_block *ip_block)
>>> return 0;
>>> }
>>>
>>> -static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> u32 inst_id = ring->me;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>> index a6e612b4a8928..581e75b7d01a8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>> @@ -1451,7 +1451,8 @@ static int sdma_v5_2_wait_for_idle(struct amdgpu_ip_block *ip_block)
>>> return -ETIMEDOUT;
>>> }
>>>
>>> -static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> u32 inst_id = ring->me;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>> index 5a70ae17be04e..d9866009edbfc 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>> @@ -1537,7 +1537,8 @@ static int sdma_v6_0_ring_preempt_ib(struct amdgpu_ring *ring)
>>> return r;
>>> }
>>>
>>> -static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int i, r;
>>> @@ -1555,7 +1556,7 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> return -EINVAL;
>>> }
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
>>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
>>> if (r)
>>> return r;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>> index ad47d0bdf7775..c546e73642296 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>> @@ -802,7 +802,8 @@ static bool sdma_v7_0_check_soft_reset(struct amdgpu_ip_block *ip_block)
>>> return false;
>>> }
>>>
>>> -static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> int i, r;
>>> @@ -820,7 +821,7 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>> return -EINVAL;
>>> }
>>>
>>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
>>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
>>> if (r)
>>> return r;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>> index b5071f77f78d2..47a0deceff433 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>> @@ -1967,7 +1967,8 @@ static int vcn_v4_0_ring_patch_cs_in_place(struct amdgpu_cs_parser *p,
>>> return 0;
>>> }
>>>
>>> -static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>> index 5a33140f57235..d961a824d2098 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>> @@ -1594,7 +1594,8 @@ static void vcn_v4_0_3_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>> }
>>> }
>>>
>>> -static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> int r = 0;
>>> int vcn_inst;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>> index 16ade84facc78..10bd714592278 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>> @@ -1465,7 +1465,8 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>> }
>>> }
>>>
>>> -static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>> index f8e3f0b882da5..7e6a7ead9a086 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>> @@ -1192,7 +1192,8 @@ static void vcn_v5_0_0_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>> }
>>> }
>>>
>>> -static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>> +static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
>>> + struct amdgpu_job *job)
>>> {
>>> struct amdgpu_device *adev = ring->adev;
>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/29] drm/amdgpu: update ring reset function signature
2025-06-10 8:24 ` Christian König
@ 2025-06-10 16:37 ` Sundararaju, Sathishkumar
2025-06-11 9:57 ` Christian König
2025-06-11 6:54 ` Alex Deucher
1 sibling, 1 reply; 43+ messages in thread
From: Sundararaju, Sathishkumar @ 2025-06-10 16:37 UTC (permalink / raw)
To: Christian König, Alex Deucher; +Cc: Alex Deucher, amd-gfx
On 6/10/2025 1:54 PM, Christian König wrote:
> On 6/6/25 18:00, Alex Deucher wrote:
>> On Fri, Jun 6, 2025 at 7:41 AM Christian König <christian.koenig@amd.com> wrote:
>>> On 6/6/25 08:43, Alex Deucher wrote:
>>>> Going forward, we'll need more than just the vmid. Everything
>>>> we need in currently in the amdgpu job structure, so just
>>>> pass that in.
>>> Please don't the job is just a container for the submission, it should not be part of any reset handling.
>>>
>>> What information is actually needed here?
>> We need job->vmid, job->base.s_fence->finished, job->hw_fence.
> VMID and HW fence make sense, but why is the finished fence needed?
That's used because amdgpu_fence_driver_guilty_force_completion is just
forcing the completion of guilty job's hw_fence without setting any
error on it.
so dma_fence_set_error(&job->base.s_fence->finished, -ETIME) is called
explicitly to set the error on the waiters fence (finished) to return
appropriate error.
Alternatively the hw_fence could also be set with the error and force
completed in amdgpu_fence_driver_guilty_force_completion,
that would be propagated to waiters fence (finished) , just tested it,
has the same result.
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -766,6 +766,7 @@ void
amdgpu_fence_driver_guilty_force_completion(struct dma_fence *fence)
{
struct amdgpu_fence *am_fence = container_of(fence, struct
amdgpu_fence, base);
+ dma_fence_set_error(fence, -ETIME);
amdgpu_fence_write(am_fence->ring, fence->seqno);
amdgpu_fence_process(am_fence->ring);
}
Regards,
Sathish
>
> Christian.
>
>
>> Alex
>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++--
>>>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 ++++---
>>>> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 10 ++++++----
>>>> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 10 ++++++----
>>>> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 +-
>>>> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +-
>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 5 +++--
>>>> drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 5 +++--
>>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 3 ++-
>>>> drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 3 ++-
>>>> 22 files changed, 53 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index ddb9d3269357c..80d4dfebde24f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -155,7 +155,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>> if (is_guilty)
>>>> dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>>>
>>>> - r = amdgpu_ring_reset(ring, job->vmid);
>>>> + r = amdgpu_ring_reset(ring, job);
>>>> if (!r) {
>>>> if (amdgpu_ring_sched_ready(ring))
>>>> drm_sched_stop(&ring->sched, s_job);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> index e1f25218943a4..ab5402d7ce9c8 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>> @@ -268,7 +268,7 @@ struct amdgpu_ring_funcs {
>>>> void (*patch_cntl)(struct amdgpu_ring *ring, unsigned offset);
>>>> void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset);
>>>> void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
>>>> - int (*reset)(struct amdgpu_ring *ring, unsigned int vmid);
>>>> + int (*reset)(struct amdgpu_ring *ring, struct amdgpu_job *job);
>>>> void (*emit_cleaner_shader)(struct amdgpu_ring *ring);
>>>> bool (*is_guilty)(struct amdgpu_ring *ring);
>>>> };
>>>> @@ -425,7 +425,7 @@ struct amdgpu_ring {
>>>> #define amdgpu_ring_patch_cntl(r, o) ((r)->funcs->patch_cntl((r), (o)))
>>>> #define amdgpu_ring_patch_ce(r, o) ((r)->funcs->patch_ce((r), (o)))
>>>> #define amdgpu_ring_patch_de(r, o) ((r)->funcs->patch_de((r), (o)))
>>>> -#define amdgpu_ring_reset(r, v) (r)->funcs->reset((r), (v))
>>>> +#define amdgpu_ring_reset(r, j) (r)->funcs->reset((r), (j))
>>>>
>>>> unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type);
>>>> int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>> index 75ea071744eb5..c58e7040c732a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>> @@ -9522,7 +9522,8 @@ static void gfx_v10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
>>>> amdgpu_ring_insert_nop(ring, num_nop - 1);
>>>> }
>>>>
>>>> -static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>>> @@ -9547,7 +9548,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>
>>>> addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
>>>> offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
>>>> - tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
>>>> + tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << job->vmid);
>>>> if (ring->pipe == 0)
>>>> tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
>>>> else
>>>> @@ -9579,7 +9580,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>> }
>>>>
>>>> static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
>>>> - unsigned int vmid)
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>>> index afd6d59164bfa..0ee7bdd509741 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>>> @@ -6806,7 +6806,8 @@ static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
>>>> return 0;
>>>> }
>>>>
>>>> -static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> int r;
>>>> @@ -6814,7 +6815,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>> if (amdgpu_sriov_vf(adev))
>>>> return -EINVAL;
>>>>
>>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
>>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
>>>> if (r) {
>>>>
>>>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
>>>> @@ -6968,7 +6969,8 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
>>>> return 0;
>>>> }
>>>>
>>>> -static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> int r = 0;
>>>> @@ -6976,7 +6978,7 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>>> if (amdgpu_sriov_vf(adev))
>>>> return -EINVAL;
>>>>
>>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
>>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
>>>> if (r) {
>>>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
>>>> r = gfx_v11_0_reset_compute_pipe(ring);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>>> index 1234c8d64e20d..a26417d53411b 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>>> @@ -5307,7 +5307,8 @@ static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
>>>> return 0;
>>>> }
>>>>
>>>> -static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> int r;
>>>> @@ -5315,7 +5316,7 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>> if (amdgpu_sriov_vf(adev))
>>>> return -EINVAL;
>>>>
>>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
>>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
>>>> if (r) {
>>>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
>>>> r = gfx_v12_reset_gfx_pipe(ring);
>>>> @@ -5421,7 +5422,8 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
>>>> return 0;
>>>> }
>>>>
>>>> -static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> int r;
>>>> @@ -5429,7 +5431,7 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>>> if (amdgpu_sriov_vf(adev))
>>>> return -EINVAL;
>>>>
>>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
>>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
>>>> if (r) {
>>>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
>>>> r = gfx_v12_0_reset_compute_pipe(ring);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>>> index d50e125fd3e0d..5e650cc5fcb26 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>>> @@ -7153,7 +7153,7 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
>>>> }
>>>>
>>>> static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
>>>> - unsigned int vmid)
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>>> index c233edf605694..a7dadff3dca31 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>>> @@ -3552,7 +3552,7 @@ static int gfx_v9_4_3_reset_hw_pipe(struct amdgpu_ring *ring)
>>>> }
>>>>
>>>> static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
>>>> - unsigned int vmid)
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>>> index 4cde8a8bcc837..6cd3fbe00d6b9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>>> @@ -764,7 +764,8 @@ static int jpeg_v2_0_process_interrupt(struct amdgpu_device *adev,
>>>> return 0;
>>>> }
>>>>
>>>> -static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> jpeg_v2_0_stop(ring->adev);
>>>> jpeg_v2_0_start(ring->adev);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>>> index 8b39e114f3be1..8ed41868f6c32 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>>> @@ -643,7 +643,8 @@ static int jpeg_v2_5_process_interrupt(struct amdgpu_device *adev,
>>>> return 0;
>>>> }
>>>>
>>>> -static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> jpeg_v2_5_stop_inst(ring->adev, ring->me);
>>>> jpeg_v2_5_start_inst(ring->adev, ring->me);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>>> index 2f8510c2986b9..3512fbb543301 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>>> @@ -555,7 +555,8 @@ static int jpeg_v3_0_process_interrupt(struct amdgpu_device *adev,
>>>> return 0;
>>>> }
>>>>
>>>> -static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> jpeg_v3_0_stop(ring->adev);
>>>> jpeg_v3_0_start(ring->adev);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>>> index f17ec5414fd69..c8efeaf0a2a69 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>>> @@ -720,7 +720,8 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
>>>> return 0;
>>>> }
>>>>
>>>> -static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> if (amdgpu_sriov_vf(ring->adev))
>>>> return -EINVAL;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>>> index 79e342d5ab28d..8b07c3651c579 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>>> @@ -1143,7 +1143,8 @@ static void jpeg_v4_0_3_core_stall_reset(struct amdgpu_ring *ring)
>>>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
>>>> }
>>>>
>>>> -static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> if (amdgpu_sriov_vf(ring->adev))
>>>> return -EOPNOTSUPP;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>>> index 3b6f65a256464..0a21a13e19360 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>>> @@ -834,7 +834,8 @@ static void jpeg_v5_0_1_core_stall_reset(struct amdgpu_ring *ring)
>>>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
>>>> }
>>>>
>>>> -static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> if (amdgpu_sriov_vf(ring->adev))
>>>> return -EOPNOTSUPP;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>>> index 9c169112a5e7b..ffd67d51b335f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>>> @@ -1667,7 +1667,8 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
>>>> return sdma_v4_4_2_is_queue_selected(adev, instance_id, true);
>>>> }
>>>>
>>>> -static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> u32 id = GET_INST(SDMA0, ring->me);
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>>> index 9505ae96fbecc..46affee1c2da0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>>> @@ -1538,7 +1538,8 @@ static int sdma_v5_0_soft_reset(struct amdgpu_ip_block *ip_block)
>>>> return 0;
>>>> }
>>>>
>>>> -static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> u32 inst_id = ring->me;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>> index a6e612b4a8928..581e75b7d01a8 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>> @@ -1451,7 +1451,8 @@ static int sdma_v5_2_wait_for_idle(struct amdgpu_ip_block *ip_block)
>>>> return -ETIMEDOUT;
>>>> }
>>>>
>>>> -static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> u32 inst_id = ring->me;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>>> index 5a70ae17be04e..d9866009edbfc 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>>> @@ -1537,7 +1537,8 @@ static int sdma_v6_0_ring_preempt_ib(struct amdgpu_ring *ring)
>>>> return r;
>>>> }
>>>>
>>>> -static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> int i, r;
>>>> @@ -1555,7 +1556,7 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>> return -EINVAL;
>>>> }
>>>>
>>>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
>>>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
>>>> if (r)
>>>> return r;
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>>> index ad47d0bdf7775..c546e73642296 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>>> @@ -802,7 +802,8 @@ static bool sdma_v7_0_check_soft_reset(struct amdgpu_ip_block *ip_block)
>>>> return false;
>>>> }
>>>>
>>>> -static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> int i, r;
>>>> @@ -820,7 +821,7 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>> return -EINVAL;
>>>> }
>>>>
>>>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
>>>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
>>>> if (r)
>>>> return r;
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>>> index b5071f77f78d2..47a0deceff433 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>>> @@ -1967,7 +1967,8 @@ static int vcn_v4_0_ring_patch_cs_in_place(struct amdgpu_cs_parser *p,
>>>> return 0;
>>>> }
>>>>
>>>> -static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> index 5a33140f57235..d961a824d2098 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> @@ -1594,7 +1594,8 @@ static void vcn_v4_0_3_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>>> }
>>>> }
>>>>
>>>> -static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> int r = 0;
>>>> int vcn_inst;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>>> index 16ade84facc78..10bd714592278 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>>> @@ -1465,7 +1465,8 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>>> }
>>>> }
>>>>
>>>> -static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>>> index f8e3f0b882da5..7e6a7ead9a086 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>>> @@ -1192,7 +1192,8 @@ static void vcn_v5_0_0_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>>> }
>>>> }
>>>>
>>>> -static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>> +static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
>>>> + struct amdgpu_job *job)
>>>> {
>>>> struct amdgpu_device *adev = ring->adev;
>>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/29] drm/amdgpu: update ring reset function signature
2025-06-10 8:24 ` Christian König
2025-06-10 16:37 ` Sundararaju, Sathishkumar
@ 2025-06-11 6:54 ` Alex Deucher
1 sibling, 0 replies; 43+ messages in thread
From: Alex Deucher @ 2025-06-11 6:54 UTC (permalink / raw)
To: Christian König; +Cc: Alex Deucher, amd-gfx
On Tue, Jun 10, 2025 at 4:24 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 6/6/25 18:00, Alex Deucher wrote:
> > On Fri, Jun 6, 2025 at 7:41 AM Christian König <christian.koenig@amd.com> wrote:
> >>
> >> On 6/6/25 08:43, Alex Deucher wrote:
> >>> Going forward, we'll need more than just the vmid. Everything
> >>> we need in currently in the amdgpu job structure, so just
> >>> pass that in.
> >>
> >> Please don't the job is just a container for the submission, it should not be part of any reset handling.
> >>
> >> What information is actually needed here?
> >
> > We need job->vmid, job->base.s_fence->finished, job->hw_fence.
>
> VMID and HW fence make sense, but why is the finished fence needed?
I was trying to keep the SDMA guilty queue logic out of the common
ring reset code. I'd like to keep that internal to SDMA.
Alex
>
> Christian.
>
>
> >
> > Alex
> >
> >>
> >> Regards,
> >> Christian.
> >>
> >>
> >>>
> >>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >>> ---
> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++--
> >>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 ++++---
> >>> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 10 ++++++----
> >>> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 10 ++++++----
> >>> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 +-
> >>> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 5 +++--
> >>> drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 5 +++--
> >>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 3 ++-
> >>> drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 3 ++-
> >>> 22 files changed, 53 insertions(+), 33 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> index ddb9d3269357c..80d4dfebde24f 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> @@ -155,7 +155,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>> if (is_guilty)
> >>> dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> >>>
> >>> - r = amdgpu_ring_reset(ring, job->vmid);
> >>> + r = amdgpu_ring_reset(ring, job);
> >>> if (!r) {
> >>> if (amdgpu_ring_sched_ready(ring))
> >>> drm_sched_stop(&ring->sched, s_job);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> index e1f25218943a4..ab5402d7ce9c8 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> @@ -268,7 +268,7 @@ struct amdgpu_ring_funcs {
> >>> void (*patch_cntl)(struct amdgpu_ring *ring, unsigned offset);
> >>> void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset);
> >>> void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
> >>> - int (*reset)(struct amdgpu_ring *ring, unsigned int vmid);
> >>> + int (*reset)(struct amdgpu_ring *ring, struct amdgpu_job *job);
> >>> void (*emit_cleaner_shader)(struct amdgpu_ring *ring);
> >>> bool (*is_guilty)(struct amdgpu_ring *ring);
> >>> };
> >>> @@ -425,7 +425,7 @@ struct amdgpu_ring {
> >>> #define amdgpu_ring_patch_cntl(r, o) ((r)->funcs->patch_cntl((r), (o)))
> >>> #define amdgpu_ring_patch_ce(r, o) ((r)->funcs->patch_ce((r), (o)))
> >>> #define amdgpu_ring_patch_de(r, o) ((r)->funcs->patch_de((r), (o)))
> >>> -#define amdgpu_ring_reset(r, v) (r)->funcs->reset((r), (v))
> >>> +#define amdgpu_ring_reset(r, j) (r)->funcs->reset((r), (j))
> >>>
> >>> unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type);
> >>> int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> >>> index 75ea071744eb5..c58e7040c732a 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> >>> @@ -9522,7 +9522,8 @@ static void gfx_v10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
> >>> amdgpu_ring_insert_nop(ring, num_nop - 1);
> >>> }
> >>>
> >>> -static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> >>> @@ -9547,7 +9548,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>>
> >>> addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
> >>> offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
> >>> - tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
> >>> + tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << job->vmid);
> >>> if (ring->pipe == 0)
> >>> tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
> >>> else
> >>> @@ -9579,7 +9580,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> }
> >>>
> >>> static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
> >>> - unsigned int vmid)
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> >>> index afd6d59164bfa..0ee7bdd509741 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> >>> @@ -6806,7 +6806,8 @@ static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
> >>> return 0;
> >>> }
> >>>
> >>> -static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int r;
> >>> @@ -6814,7 +6815,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> if (amdgpu_sriov_vf(adev))
> >>> return -EINVAL;
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
> >>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
> >>> if (r) {
> >>>
> >>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
> >>> @@ -6968,7 +6969,8 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
> >>> return 0;
> >>> }
> >>>
> >>> -static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int r = 0;
> >>> @@ -6976,7 +6978,7 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> if (amdgpu_sriov_vf(adev))
> >>> return -EINVAL;
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
> >>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
> >>> if (r) {
> >>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
> >>> r = gfx_v11_0_reset_compute_pipe(ring);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> >>> index 1234c8d64e20d..a26417d53411b 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> >>> @@ -5307,7 +5307,8 @@ static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
> >>> return 0;
> >>> }
> >>>
> >>> -static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int r;
> >>> @@ -5315,7 +5316,7 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> if (amdgpu_sriov_vf(adev))
> >>> return -EINVAL;
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
> >>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
> >>> if (r) {
> >>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
> >>> r = gfx_v12_reset_gfx_pipe(ring);
> >>> @@ -5421,7 +5422,8 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
> >>> return 0;
> >>> }
> >>>
> >>> -static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int r;
> >>> @@ -5429,7 +5431,7 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
> >>> if (amdgpu_sriov_vf(adev))
> >>> return -EINVAL;
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
> >>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
> >>> if (r) {
> >>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
> >>> r = gfx_v12_0_reset_compute_pipe(ring);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> >>> index d50e125fd3e0d..5e650cc5fcb26 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> >>> @@ -7153,7 +7153,7 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
> >>> }
> >>>
> >>> static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
> >>> - unsigned int vmid)
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> >>> index c233edf605694..a7dadff3dca31 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> >>> @@ -3552,7 +3552,7 @@ static int gfx_v9_4_3_reset_hw_pipe(struct amdgpu_ring *ring)
> >>> }
> >>>
> >>> static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
> >>> - unsigned int vmid)
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> >>> index 4cde8a8bcc837..6cd3fbe00d6b9 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
> >>> @@ -764,7 +764,8 @@ static int jpeg_v2_0_process_interrupt(struct amdgpu_device *adev,
> >>> return 0;
> >>> }
> >>>
> >>> -static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> jpeg_v2_0_stop(ring->adev);
> >>> jpeg_v2_0_start(ring->adev);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> >>> index 8b39e114f3be1..8ed41868f6c32 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
> >>> @@ -643,7 +643,8 @@ static int jpeg_v2_5_process_interrupt(struct amdgpu_device *adev,
> >>> return 0;
> >>> }
> >>>
> >>> -static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> jpeg_v2_5_stop_inst(ring->adev, ring->me);
> >>> jpeg_v2_5_start_inst(ring->adev, ring->me);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> >>> index 2f8510c2986b9..3512fbb543301 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
> >>> @@ -555,7 +555,8 @@ static int jpeg_v3_0_process_interrupt(struct amdgpu_device *adev,
> >>> return 0;
> >>> }
> >>>
> >>> -static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> jpeg_v3_0_stop(ring->adev);
> >>> jpeg_v3_0_start(ring->adev);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> >>> index f17ec5414fd69..c8efeaf0a2a69 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> >>> @@ -720,7 +720,8 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
> >>> return 0;
> >>> }
> >>>
> >>> -static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> if (amdgpu_sriov_vf(ring->adev))
> >>> return -EINVAL;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> >>> index 79e342d5ab28d..8b07c3651c579 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
> >>> @@ -1143,7 +1143,8 @@ static void jpeg_v4_0_3_core_stall_reset(struct amdgpu_ring *ring)
> >>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
> >>> }
> >>>
> >>> -static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> if (amdgpu_sriov_vf(ring->adev))
> >>> return -EOPNOTSUPP;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> >>> index 3b6f65a256464..0a21a13e19360 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
> >>> @@ -834,7 +834,8 @@ static void jpeg_v5_0_1_core_stall_reset(struct amdgpu_ring *ring)
> >>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
> >>> }
> >>>
> >>> -static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> if (amdgpu_sriov_vf(ring->adev))
> >>> return -EOPNOTSUPP;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> >>> index 9c169112a5e7b..ffd67d51b335f 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
> >>> @@ -1667,7 +1667,8 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
> >>> return sdma_v4_4_2_is_queue_selected(adev, instance_id, true);
> >>> }
> >>>
> >>> -static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> u32 id = GET_INST(SDMA0, ring->me);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> >>> index 9505ae96fbecc..46affee1c2da0 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> >>> @@ -1538,7 +1538,8 @@ static int sdma_v5_0_soft_reset(struct amdgpu_ip_block *ip_block)
> >>> return 0;
> >>> }
> >>>
> >>> -static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> u32 inst_id = ring->me;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >>> index a6e612b4a8928..581e75b7d01a8 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >>> @@ -1451,7 +1451,8 @@ static int sdma_v5_2_wait_for_idle(struct amdgpu_ip_block *ip_block)
> >>> return -ETIMEDOUT;
> >>> }
> >>>
> >>> -static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> u32 inst_id = ring->me;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> >>> index 5a70ae17be04e..d9866009edbfc 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> >>> @@ -1537,7 +1537,8 @@ static int sdma_v6_0_ring_preempt_ib(struct amdgpu_ring *ring)
> >>> return r;
> >>> }
> >>>
> >>> -static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int i, r;
> >>> @@ -1555,7 +1556,7 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> return -EINVAL;
> >>> }
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
> >>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
> >>> if (r)
> >>> return r;
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> >>> index ad47d0bdf7775..c546e73642296 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> >>> @@ -802,7 +802,8 @@ static bool sdma_v7_0_check_soft_reset(struct amdgpu_ip_block *ip_block)
> >>> return false;
> >>> }
> >>>
> >>> -static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> int i, r;
> >>> @@ -820,7 +821,7 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
> >>> return -EINVAL;
> >>> }
> >>>
> >>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
> >>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
> >>> if (r)
> >>> return r;
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> >>> index b5071f77f78d2..47a0deceff433 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> >>> @@ -1967,7 +1967,8 @@ static int vcn_v4_0_ring_patch_cs_in_place(struct amdgpu_cs_parser *p,
> >>> return 0;
> >>> }
> >>>
> >>> -static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>> index 5a33140f57235..d961a824d2098 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
> >>> @@ -1594,7 +1594,8 @@ static void vcn_v4_0_3_unified_ring_set_wptr(struct amdgpu_ring *ring)
> >>> }
> >>> }
> >>>
> >>> -static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> int r = 0;
> >>> int vcn_inst;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> >>> index 16ade84facc78..10bd714592278 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
> >>> @@ -1465,7 +1465,8 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
> >>> }
> >>> }
> >>>
> >>> -static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> >>> index f8e3f0b882da5..7e6a7ead9a086 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
> >>> @@ -1192,7 +1192,8 @@ static void vcn_v5_0_0_unified_ring_set_wptr(struct amdgpu_ring *ring)
> >>> }
> >>> }
> >>>
> >>> -static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
> >>> +static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
> >>> + struct amdgpu_job *job)
> >>> {
> >>> struct amdgpu_device *adev = ring->adev;
> >>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
> >>
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/29] drm/amdgpu: update ring reset function signature
2025-06-10 16:37 ` Sundararaju, Sathishkumar
@ 2025-06-11 9:57 ` Christian König
0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2025-06-11 9:57 UTC (permalink / raw)
To: Sundararaju, Sathishkumar, Alex Deucher; +Cc: Alex Deucher, amd-gfx
On 6/10/25 18:37, Sundararaju, Sathishkumar wrote:
>
>
> On 6/10/2025 1:54 PM, Christian König wrote:
>> On 6/6/25 18:00, Alex Deucher wrote:
>>> On Fri, Jun 6, 2025 at 7:41 AM Christian König <christian.koenig@amd.com> wrote:
>>>> On 6/6/25 08:43, Alex Deucher wrote:
>>>>> Going forward, we'll need more than just the vmid. Everything
>>>>> we need in currently in the amdgpu job structure, so just
>>>>> pass that in.
>>>> Please don't the job is just a container for the submission, it should not be part of any reset handling.
>>>>
>>>> What information is actually needed here?
>>> We need job->vmid, job->base.s_fence->finished, job->hw_fence.
>> VMID and HW fence make sense, but why is the finished fence needed?
>
> That's used because amdgpu_fence_driver_guilty_force_completion is just forcing the completion of guilty job's hw_fence without setting any error on it.
Yeah I expected something like that.
> so dma_fence_set_error(&job->base.s_fence->finished, -ETIME) is called explicitly to set the error on the waiters fence (finished) to return appropriate error.
Please never do anything like that. We had more than enough of that mess.
> Alternatively the hw_fence could also be set with the error and force completed in amdgpu_fence_driver_guilty_force_completion,
> that would be propagated to waiters fence (finished) , just tested it, has the same result.
Yeah please do that instead.
Regards,
Christian.
>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -766,6 +766,7 @@ void amdgpu_fence_driver_guilty_force_completion(struct dma_fence *fence)
> {
> struct amdgpu_fence *am_fence = container_of(fence, struct amdgpu_fence, base);
>
> + dma_fence_set_error(fence, -ETIME);
> amdgpu_fence_write(am_fence->ring, fence->seqno);
> amdgpu_fence_process(am_fence->ring);
> }
>
>
> Regards,
> Sathish
>
>>
>> Christian.
>>
>>
>>> Alex
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
>>>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++--
>>>>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 ++++---
>>>>> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 10 ++++++----
>>>>> drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 10 ++++++----
>>>>> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 +-
>>>>> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 2 +-
>>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 5 +++--
>>>>> drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 5 +++--
>>>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 3 ++-
>>>>> drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 3 ++-
>>>>> 22 files changed, 53 insertions(+), 33 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index ddb9d3269357c..80d4dfebde24f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -155,7 +155,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>> if (is_guilty)
>>>>> dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>>>>
>>>>> - r = amdgpu_ring_reset(ring, job->vmid);
>>>>> + r = amdgpu_ring_reset(ring, job);
>>>>> if (!r) {
>>>>> if (amdgpu_ring_sched_ready(ring))
>>>>> drm_sched_stop(&ring->sched, s_job);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> index e1f25218943a4..ab5402d7ce9c8 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> @@ -268,7 +268,7 @@ struct amdgpu_ring_funcs {
>>>>> void (*patch_cntl)(struct amdgpu_ring *ring, unsigned offset);
>>>>> void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset);
>>>>> void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
>>>>> - int (*reset)(struct amdgpu_ring *ring, unsigned int vmid);
>>>>> + int (*reset)(struct amdgpu_ring *ring, struct amdgpu_job *job);
>>>>> void (*emit_cleaner_shader)(struct amdgpu_ring *ring);
>>>>> bool (*is_guilty)(struct amdgpu_ring *ring);
>>>>> };
>>>>> @@ -425,7 +425,7 @@ struct amdgpu_ring {
>>>>> #define amdgpu_ring_patch_cntl(r, o) ((r)->funcs->patch_cntl((r), (o)))
>>>>> #define amdgpu_ring_patch_ce(r, o) ((r)->funcs->patch_ce((r), (o)))
>>>>> #define amdgpu_ring_patch_de(r, o) ((r)->funcs->patch_de((r), (o)))
>>>>> -#define amdgpu_ring_reset(r, v) (r)->funcs->reset((r), (v))
>>>>> +#define amdgpu_ring_reset(r, j) (r)->funcs->reset((r), (j))
>>>>>
>>>>> unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type);
>>>>> int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>>> index 75ea071744eb5..c58e7040c732a 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
>>>>> @@ -9522,7 +9522,8 @@ static void gfx_v10_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
>>>>> amdgpu_ring_insert_nop(ring, num_nop - 1);
>>>>> }
>>>>>
>>>>> -static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>>>> @@ -9547,7 +9548,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>>
>>>>> addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
>>>>> offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
>>>>> - tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
>>>>> + tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << job->vmid);
>>>>> if (ring->pipe == 0)
>>>>> tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
>>>>> else
>>>>> @@ -9579,7 +9580,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> }
>>>>>
>>>>> static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
>>>>> - unsigned int vmid)
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>>>> index afd6d59164bfa..0ee7bdd509741 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
>>>>> @@ -6806,7 +6806,8 @@ static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> int r;
>>>>> @@ -6814,7 +6815,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> if (amdgpu_sriov_vf(adev))
>>>>> return -EINVAL;
>>>>>
>>>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
>>>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
>>>>> if (r) {
>>>>>
>>>>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
>>>>> @@ -6968,7 +6969,8 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> int r = 0;
>>>>> @@ -6976,7 +6978,7 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> if (amdgpu_sriov_vf(adev))
>>>>> return -EINVAL;
>>>>>
>>>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
>>>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
>>>>> if (r) {
>>>>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
>>>>> r = gfx_v11_0_reset_compute_pipe(ring);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>>>> index 1234c8d64e20d..a26417d53411b 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
>>>>> @@ -5307,7 +5307,8 @@ static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> int r;
>>>>> @@ -5315,7 +5316,7 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> if (amdgpu_sriov_vf(adev))
>>>>> return -EINVAL;
>>>>>
>>>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
>>>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, false);
>>>>> if (r) {
>>>>> dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
>>>>> r = gfx_v12_reset_gfx_pipe(ring);
>>>>> @@ -5421,7 +5422,8 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> int r;
>>>>> @@ -5429,7 +5431,7 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> if (amdgpu_sriov_vf(adev))
>>>>> return -EINVAL;
>>>>>
>>>>> - r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true);
>>>>> + r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, job->vmid, true);
>>>>> if (r) {
>>>>> dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
>>>>> r = gfx_v12_0_reset_compute_pipe(ring);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>>>> index d50e125fd3e0d..5e650cc5fcb26 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>>>>> @@ -7153,7 +7153,7 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
>>>>> }
>>>>>
>>>>> static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
>>>>> - unsigned int vmid)
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>>>> index c233edf605694..a7dadff3dca31 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
>>>>> @@ -3552,7 +3552,7 @@ static int gfx_v9_4_3_reset_hw_pipe(struct amdgpu_ring *ring)
>>>>> }
>>>>>
>>>>> static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
>>>>> - unsigned int vmid)
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>>>> index 4cde8a8bcc837..6cd3fbe00d6b9 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c
>>>>> @@ -764,7 +764,8 @@ static int jpeg_v2_0_process_interrupt(struct amdgpu_device *adev,
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int jpeg_v2_0_ring_reset(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> jpeg_v2_0_stop(ring->adev);
>>>>> jpeg_v2_0_start(ring->adev);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>>>> index 8b39e114f3be1..8ed41868f6c32 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c
>>>>> @@ -643,7 +643,8 @@ static int jpeg_v2_5_process_interrupt(struct amdgpu_device *adev,
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int jpeg_v2_5_ring_reset(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> jpeg_v2_5_stop_inst(ring->adev, ring->me);
>>>>> jpeg_v2_5_start_inst(ring->adev, ring->me);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>>>> index 2f8510c2986b9..3512fbb543301 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c
>>>>> @@ -555,7 +555,8 @@ static int jpeg_v3_0_process_interrupt(struct amdgpu_device *adev,
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int jpeg_v3_0_ring_reset(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> jpeg_v3_0_stop(ring->adev);
>>>>> jpeg_v3_0_start(ring->adev);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>>>> index f17ec5414fd69..c8efeaf0a2a69 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
>>>>> @@ -720,7 +720,8 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int jpeg_v4_0_ring_reset(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> if (amdgpu_sriov_vf(ring->adev))
>>>>> return -EINVAL;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>>>> index 79e342d5ab28d..8b07c3651c579 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c
>>>>> @@ -1143,7 +1143,8 @@ static void jpeg_v4_0_3_core_stall_reset(struct amdgpu_ring *ring)
>>>>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
>>>>> }
>>>>>
>>>>> -static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int jpeg_v4_0_3_ring_reset(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> if (amdgpu_sriov_vf(ring->adev))
>>>>> return -EOPNOTSUPP;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>>>> index 3b6f65a256464..0a21a13e19360 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c
>>>>> @@ -834,7 +834,8 @@ static void jpeg_v5_0_1_core_stall_reset(struct amdgpu_ring *ring)
>>>>> WREG32_SOC15(JPEG, jpeg_inst, regJPEG_CORE_RST_CTRL, 0x00);
>>>>> }
>>>>>
>>>>> -static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int jpeg_v5_0_1_ring_reset(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> if (amdgpu_sriov_vf(ring->adev))
>>>>> return -EOPNOTSUPP;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>>>> index 9c169112a5e7b..ffd67d51b335f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>>>>> @@ -1667,7 +1667,8 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
>>>>> return sdma_v4_4_2_is_queue_selected(adev, instance_id, true);
>>>>> }
>>>>>
>>>>> -static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> u32 id = GET_INST(SDMA0, ring->me);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>>>> index 9505ae96fbecc..46affee1c2da0 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
>>>>> @@ -1538,7 +1538,8 @@ static int sdma_v5_0_soft_reset(struct amdgpu_ip_block *ip_block)
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int sdma_v5_0_reset_queue(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> u32 inst_id = ring->me;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>>> index a6e612b4a8928..581e75b7d01a8 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
>>>>> @@ -1451,7 +1451,8 @@ static int sdma_v5_2_wait_for_idle(struct amdgpu_ip_block *ip_block)
>>>>> return -ETIMEDOUT;
>>>>> }
>>>>>
>>>>> -static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int sdma_v5_2_reset_queue(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> u32 inst_id = ring->me;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>>>> index 5a70ae17be04e..d9866009edbfc 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
>>>>> @@ -1537,7 +1537,8 @@ static int sdma_v6_0_ring_preempt_ib(struct amdgpu_ring *ring)
>>>>> return r;
>>>>> }
>>>>>
>>>>> -static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> int i, r;
>>>>> @@ -1555,7 +1556,7 @@ static int sdma_v6_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> return -EINVAL;
>>>>> }
>>>>>
>>>>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
>>>>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
>>>>> if (r)
>>>>> return r;
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>>>> index ad47d0bdf7775..c546e73642296 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
>>>>> @@ -802,7 +802,8 @@ static bool sdma_v7_0_check_soft_reset(struct amdgpu_ip_block *ip_block)
>>>>> return false;
>>>>> }
>>>>>
>>>>> -static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> int i, r;
>>>>> @@ -820,7 +821,7 @@ static int sdma_v7_0_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> return -EINVAL;
>>>>> }
>>>>>
>>>>> - r = amdgpu_mes_reset_legacy_queue(adev, ring, vmid, true);
>>>>> + r = amdgpu_mes_reset_legacy_queue(adev, ring, job->vmid, true);
>>>>> if (r)
>>>>> return r;
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>>>> index b5071f77f78d2..47a0deceff433 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
>>>>> @@ -1967,7 +1967,8 @@ static int vcn_v4_0_ring_patch_cs_in_place(struct amdgpu_cs_parser *p,
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int vcn_v4_0_ring_reset(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>>> index 5a33140f57235..d961a824d2098 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>>> @@ -1594,7 +1594,8 @@ static void vcn_v4_0_3_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>>>> }
>>>>> }
>>>>>
>>>>> -static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int vcn_v4_0_3_ring_reset(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> int r = 0;
>>>>> int vcn_inst;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>>>> index 16ade84facc78..10bd714592278 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c
>>>>> @@ -1465,7 +1465,8 @@ static void vcn_v4_0_5_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>>>> }
>>>>> }
>>>>>
>>>>> -static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int vcn_v4_0_5_ring_reset(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>>>> index f8e3f0b882da5..7e6a7ead9a086 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
>>>>> @@ -1192,7 +1192,8 @@ static void vcn_v5_0_0_unified_ring_set_wptr(struct amdgpu_ring *ring)
>>>>> }
>>>>> }
>>>>>
>>>>> -static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring, unsigned int vmid)
>>>>> +static int vcn_v5_0_0_ring_reset(struct amdgpu_ring *ring,
>>>>> + struct amdgpu_job *job)
>>>>> {
>>>>> struct amdgpu_device *adev = ring->adev;
>>>>> struct amdgpu_vcn_inst *vinst = &adev->vcn.inst[ring->me];
>
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2025-06-11 9:57 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06 6:43 [PATCH V7 00/29] Reset improvements Alex Deucher
2025-06-06 6:43 ` [PATCH 01/29] drm/amdgpu: enable legacy enforce isolation by default Alex Deucher
2025-06-06 6:43 ` [PATCH 02/29] drm/amdgpu/gfx7: drop reset_kgq Alex Deucher
2025-06-06 11:33 ` Christian König
2025-06-06 6:43 ` [PATCH 03/29] drm/amdgpu/gfx8: " Alex Deucher
2025-06-06 6:43 ` [PATCH 04/29] drm/amdgpu/gfx9: " Alex Deucher
2025-06-06 6:43 ` [PATCH 05/29] drm/amdgpu: switch job hw_fence to amdgpu_fence Alex Deucher
2025-06-06 11:39 ` Christian König
2025-06-06 16:08 ` Alex Deucher
2025-06-10 8:23 ` Christian König
2025-06-06 6:43 ` [PATCH 06/29] drm/amdgpu: update ring reset function signature Alex Deucher
2025-06-06 11:41 ` Christian König
2025-06-06 16:00 ` Alex Deucher
2025-06-09 12:43 ` Sundararaju, Sathishkumar
2025-06-09 13:33 ` Alex Deucher
2025-06-10 8:24 ` Christian König
2025-06-10 16:37 ` Sundararaju, Sathishkumar
2025-06-11 9:57 ` Christian König
2025-06-11 6:54 ` Alex Deucher
2025-06-06 6:43 ` [PATCH 07/29] drm/amdgpu: rework queue reset scheduler interaction Alex Deucher
2025-06-06 6:43 ` [PATCH 08/29] drm/amdgpu: move force completion into ring resets Alex Deucher
2025-06-06 6:43 ` [PATCH 09/29] drm/amdgpu: move guilty handling " Alex Deucher
2025-06-06 6:43 ` [PATCH 10/29] drm/amdgpu: track ring state associated with a job Alex Deucher
2025-06-06 6:43 ` [PATCH 11/29] drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset Alex Deucher
2025-06-06 6:43 ` [PATCH 12/29] drm/amdgpu/gfx9.4.3: " Alex Deucher
2025-06-06 6:43 ` [PATCH 13/29] drm/amdgpu/gfx10: re-emit unprocessed state on ring reset Alex Deucher
2025-06-06 6:43 ` [PATCH 14/29] drm/amdgpu/gfx11: " Alex Deucher
2025-06-06 6:43 ` [PATCH 15/29] drm/amdgpu/gfx12: " Alex Deucher
2025-06-06 6:43 ` [PATCH 16/29] drm/amdgpu/sdma6: " Alex Deucher
2025-06-06 6:43 ` [PATCH 17/29] drm/amdgpu/sdma7: " Alex Deucher
2025-06-06 6:43 ` [PATCH 18/29] drm/amdgpu/jpeg2: " Alex Deucher
2025-06-06 6:43 ` [PATCH 19/29] drm/amdgpu/jpeg2.5: " Alex Deucher
2025-06-06 6:43 ` [PATCH 20/29] drm/amdgpu/jpeg3: " Alex Deucher
2025-06-06 6:43 ` [PATCH 21/29] drm/amdgpu/jpeg4: " Alex Deucher
2025-06-06 6:43 ` [PATCH 22/29] drm/amdgpu/jpeg4.0.3: " Alex Deucher
2025-06-06 6:43 ` [PATCH 23/29] drm/amdgpu/jpeg4.0.5: add queue reset Alex Deucher
2025-06-06 6:43 ` [PATCH 24/29] drm/amdgpu/jpeg5: " Alex Deucher
2025-06-06 6:43 ` [PATCH 25/29] drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset Alex Deucher
2025-06-06 6:43 ` [PATCH 26/29] drm/amdgpu/vcn4: " Alex Deucher
2025-06-06 6:43 ` [PATCH 27/29] drm/amdgpu/vcn4.0.3: " Alex Deucher
2025-06-06 6:43 ` [PATCH 28/29] drm/amdgpu/vcn4.0.5: " Alex Deucher
2025-06-06 6:43 ` [PATCH 29/29] drm/amdgpu/vcn5: " Alex Deucher
2025-06-09 14:23 ` Sundararaju, Sathishkumar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).