* [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines
@ 2025-06-11 5:56 Jesse Zhang
2025-06-11 5:56 ` [PATCH 2/2] drm/amdgpu: Fix SDMA queue reset array out-of-bounds access Jesse Zhang
2025-06-11 6:14 ` [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Lazar, Lijo
0 siblings, 2 replies; 4+ messages in thread
From: Jesse Zhang @ 2025-06-11 5:56 UTC (permalink / raw)
To: amd-gfx
Cc: Alexander.Deucher, Christian Koenig, jonathan.kim, jiadong.zhu,
Jesse Zhang, Jesse Zhang
Adds logic to handle instance ID conversion during SDMA engine reset
when harvest_config is active. This ensures correct physical engine
addressing when some SDMA instances are harvested.
Changes include:
1. Added instance ID remapping using GET_INST macro when harvest_config
is non-zero
2. Conversion happens before engine reset procedure begins
3. Maintains existing reset flow for non-harvested configurations
This fixes hardware initialization issues on devices with harvested
SDMA instances where the logical instance IDs don't match physical
hardware mapping.
Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 3 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 1 +
3 files changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index a0e9bf9b2710..4282f60a0cef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -759,6 +759,7 @@ static void amdgpu_discovery_read_from_harvest_table(struct amdgpu_device *adev,
~(1U << harvest_info->list[i].number_instance);
break;
case SDMA0_HWID:
+ adev->sdma.harvest_config |= (1U << harvest_info->list[i].number_instance);
adev->sdma.sdma_mask &=
~(1U << harvest_info->list[i].number_instance);
break;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
index 6716ac281c49..0bfd2c138d24 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
@@ -581,6 +581,9 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id)
bool gfx_sched_stopped = false, page_sched_stopped = false;
mutex_lock(&sdma_instance->engine_reset_mutex);
+
+ if (adev->sdma.harvest_config)
+ instance_id = GET_INST(SDMA0, instance_id);
/* Stop the scheduler's work queue for the GFX and page rings if they are running.
* This ensures that no new tasks are submitted to the queues while
* the reset is in progress.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
index e5f8951bbb6f..fed00854a1a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
@@ -123,6 +123,7 @@ struct amdgpu_sdma {
int num_instances;
uint32_t sdma_mask;
+ uint32_t harvest_config;
int num_inst_per_aid;
uint32_t srbm_soft_reset;
bool has_page_queue;
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 2/2] drm/amdgpu: Fix SDMA queue reset array out-of-bounds access
2025-06-11 5:56 [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Jesse Zhang
@ 2025-06-11 5:56 ` Jesse Zhang
2025-06-11 6:14 ` [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Lazar, Lijo
1 sibling, 0 replies; 4+ messages in thread
From: Jesse Zhang @ 2025-06-11 5:56 UTC (permalink / raw)
To: amd-gfx
Cc: Alexander.Deucher, Christian Koenig, jonathan.kim, jiadong.zhu,
Jesse Zhang, Jesse Zhang
The current SDMA v4.4.2 queue reset logic incorrectly uses GET_INST
macro for queue operations, leading to array index out-of-bounds
errors when harvesting is enabled. This manifests as UBSAN warnings
when stopping queues during reset operations.
[ 306.871518] UBSAN: array-index-out-of-bounds in drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c:118:38
[ 306.871538] index 4294967295 is out of range for type 'uint32_t *[44]'
[ 306.871929] amdgpu_sdma_reset_engine+0xe4/0x320 [amdgpu]
[ 306.872115] reset_queues_on_hws_hang+0x2dc/0x4d0 [amdgpu]
The fix ensures we use physical instance IDs consistently for queue
operations while maintaining harvest-aware mapping for register access.
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
---
drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
index 9c169112a5e7..3de125062ee3 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -1670,7 +1670,7 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
{
struct amdgpu_device *adev = ring->adev;
- u32 id = GET_INST(SDMA0, ring->me);
+ u32 id = ring->me;
int r;
if (!(adev->sdma.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
@@ -1686,7 +1686,7 @@ static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
static int sdma_v4_4_2_stop_queue(struct amdgpu_ring *ring)
{
struct amdgpu_device *adev = ring->adev;
- u32 instance_id = GET_INST(SDMA0, ring->me);
+ u32 instance_id = ring->me;
u32 inst_mask;
uint64_t rptr;
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines
2025-06-11 5:56 [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Jesse Zhang
2025-06-11 5:56 ` [PATCH 2/2] drm/amdgpu: Fix SDMA queue reset array out-of-bounds access Jesse Zhang
@ 2025-06-11 6:14 ` Lazar, Lijo
2025-06-11 7:11 ` Zhang, Jesse(Jie)
1 sibling, 1 reply; 4+ messages in thread
From: Lazar, Lijo @ 2025-06-11 6:14 UTC (permalink / raw)
To: Jesse Zhang, amd-gfx
Cc: Alexander.Deucher, Christian Koenig, jonathan.kim, jiadong.zhu
On 6/11/2025 11:26 AM, Jesse Zhang wrote:
> Adds logic to handle instance ID conversion during SDMA engine reset
> when harvest_config is active. This ensures correct physical engine
> addressing when some SDMA instances are harvested.
>
> Changes include:
> 1. Added instance ID remapping using GET_INST macro when harvest_config
> is non-zero
> 2. Conversion happens before engine reset procedure begins
> 3. Maintains existing reset flow for non-harvested configurations
>
> This fixes hardware initialization issues on devices with harvested
> SDMA instances where the logical instance IDs don't match physical
> hardware mapping.
>
This shouldn't be required. Without harvest-awareness, driver won't load
properly on MI308.
Thanks,
Lijo
> Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 3 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 1 +
> 3 files changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index a0e9bf9b2710..4282f60a0cef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -759,6 +759,7 @@ static void amdgpu_discovery_read_from_harvest_table(struct amdgpu_device *adev,
> ~(1U << harvest_info->list[i].number_instance);
> break;
> case SDMA0_HWID:
> + adev->sdma.harvest_config |= (1U << harvest_info->list[i].number_instance);
> adev->sdma.sdma_mask &=
> ~(1U << harvest_info->list[i].number_instance);
> break;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> index 6716ac281c49..0bfd2c138d24 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> @@ -581,6 +581,9 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id)
> bool gfx_sched_stopped = false, page_sched_stopped = false;
>
> mutex_lock(&sdma_instance->engine_reset_mutex);
> +
> + if (adev->sdma.harvest_config)
> + instance_id = GET_INST(SDMA0, instance_id);
> /* Stop the scheduler's work queue for the GFX and page rings if they are running.
> * This ensures that no new tasks are submitted to the queues while
> * the reset is in progress.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> index e5f8951bbb6f..fed00854a1a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> @@ -123,6 +123,7 @@ struct amdgpu_sdma {
>
> int num_instances;
> uint32_t sdma_mask;
> + uint32_t harvest_config;
> int num_inst_per_aid;
> uint32_t srbm_soft_reset;
> bool has_page_queue;
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines
2025-06-11 6:14 ` [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Lazar, Lijo
@ 2025-06-11 7:11 ` Zhang, Jesse(Jie)
0 siblings, 0 replies; 4+ messages in thread
From: Zhang, Jesse(Jie) @ 2025-06-11 7:11 UTC (permalink / raw)
To: Lazar, Lijo, amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander, Koenig, Christian, Kim, Jonathan,
Zhu, Jiadong
[AMD Official Use Only - AMD Internal Distribution Only]
Thanks Lijo
As we discussed offline, we will remove the harvest_config check.
Regards
Jesse
-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@amd.com>
Sent: Wednesday, June 11, 2025 2:15 PM
To: Zhang, Jesse(Jie) <Jesse.Zhang@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>; Zhu, Jiadong <Jiadong.Zhu@amd.com>
Subject: Re: [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines
On 6/11/2025 11:26 AM, Jesse Zhang wrote:
> Adds logic to handle instance ID conversion during SDMA engine reset
> when harvest_config is active. This ensures correct physical engine
> addressing when some SDMA instances are harvested.
>
> Changes include:
> 1. Added instance ID remapping using GET_INST macro when harvest_config
> is non-zero
> 2. Conversion happens before engine reset procedure begins 3.
> Maintains existing reset flow for non-harvested configurations
>
> This fixes hardware initialization issues on devices with harvested
> SDMA instances where the logical instance IDs don't match physical
> hardware mapping.
>
This shouldn't be required. Without harvest-awareness, driver won't load properly on MI308.
Thanks,
Lijo
> Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c | 3 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 1 +
> 3 files changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index a0e9bf9b2710..4282f60a0cef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -759,6 +759,7 @@ static void amdgpu_discovery_read_from_harvest_table(struct amdgpu_device *adev,
> ~(1U << harvest_info->list[i].number_instance);
> break;
> case SDMA0_HWID:
> + adev->sdma.harvest_config |= (1U <<
> +harvest_info->list[i].number_instance);
> adev->sdma.sdma_mask &=
> ~(1U << harvest_info->list[i].number_instance);
> break;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> index 6716ac281c49..0bfd2c138d24 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> @@ -581,6 +581,9 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id)
> bool gfx_sched_stopped = false, page_sched_stopped = false;
>
> mutex_lock(&sdma_instance->engine_reset_mutex);
> +
> + if (adev->sdma.harvest_config)
> + instance_id = GET_INST(SDMA0, instance_id);
> /* Stop the scheduler's work queue for the GFX and page rings if they are running.
> * This ensures that no new tasks are submitted to the queues while
> * the reset is in progress.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> index e5f8951bbb6f..fed00854a1a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> @@ -123,6 +123,7 @@ struct amdgpu_sdma {
>
> int num_instances;
> uint32_t sdma_mask;
> + uint32_t harvest_config;
> int num_inst_per_aid;
> uint32_t srbm_soft_reset;
> bool has_page_queue;
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-06-11 7:11 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 5:56 [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Jesse Zhang
2025-06-11 5:56 ` [PATCH 2/2] drm/amdgpu: Fix SDMA queue reset array out-of-bounds access Jesse Zhang
2025-06-11 6:14 ` [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Lazar, Lijo
2025-06-11 7:11 ` Zhang, Jesse(Jie)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).