amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines
@ 2025-06-11  5:56 Jesse Zhang
  2025-06-11  5:56 ` [PATCH 2/2] drm/amdgpu: Fix SDMA queue reset array out-of-bounds access Jesse Zhang
  2025-06-11  6:14 ` [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Lazar, Lijo
  0 siblings, 2 replies; 4+ messages in thread
From: Jesse Zhang @ 2025-06-11  5:56 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Christian Koenig, jonathan.kim, jiadong.zhu,
	Jesse Zhang, Jesse Zhang

Adds logic to handle instance ID conversion during SDMA engine reset
when harvest_config is active. This ensures correct physical engine
addressing when some SDMA instances are harvested.

Changes include:
1. Added instance ID remapping using GET_INST macro when harvest_config
   is non-zero
2. Conversion happens before engine reset procedure begins
3. Maintains existing reset flow for non-harvested configurations

This fixes hardware initialization issues on devices with harvested
SDMA instances where the logical instance IDs don't match physical
hardware mapping.

Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c      | 3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h      | 1 +
 3 files changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index a0e9bf9b2710..4282f60a0cef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -759,6 +759,7 @@ static void amdgpu_discovery_read_from_harvest_table(struct amdgpu_device *adev,
 				~(1U << harvest_info->list[i].number_instance);
 			break;
 		case SDMA0_HWID:
+			adev->sdma.harvest_config |= (1U << harvest_info->list[i].number_instance);
 			adev->sdma.sdma_mask &=
 				~(1U << harvest_info->list[i].number_instance);
 			break;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
index 6716ac281c49..0bfd2c138d24 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
@@ -581,6 +581,9 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id)
 	bool gfx_sched_stopped = false, page_sched_stopped = false;
 
 	mutex_lock(&sdma_instance->engine_reset_mutex);
+
+	if (adev->sdma.harvest_config)
+		instance_id = GET_INST(SDMA0, instance_id);
 	/* Stop the scheduler's work queue for the GFX and page rings if they are running.
 	* This ensures that no new tasks are submitted to the queues while
 	* the reset is in progress.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
index e5f8951bbb6f..fed00854a1a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
@@ -123,6 +123,7 @@ struct amdgpu_sdma {
 
 	int			num_instances;
 	uint32_t 		sdma_mask;
+	uint32_t		harvest_config;
 	int			num_inst_per_aid;
 	uint32_t                    srbm_soft_reset;
 	bool			has_page_queue;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] drm/amdgpu: Fix SDMA queue reset array out-of-bounds access
  2025-06-11  5:56 [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Jesse Zhang
@ 2025-06-11  5:56 ` Jesse Zhang
  2025-06-11  6:14 ` [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Lazar, Lijo
  1 sibling, 0 replies; 4+ messages in thread
From: Jesse Zhang @ 2025-06-11  5:56 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Christian Koenig, jonathan.kim, jiadong.zhu,
	Jesse Zhang, Jesse Zhang

The current SDMA v4.4.2 queue reset logic incorrectly uses GET_INST
macro for queue operations, leading to array index out-of-bounds
errors when harvesting is enabled. This manifests as UBSAN warnings
when stopping queues during reset operations.

[  306.871518] UBSAN: array-index-out-of-bounds in drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c:118:38
[  306.871538] index 4294967295 is out of range for type 'uint32_t *[44]'
[  306.871929]  amdgpu_sdma_reset_engine+0xe4/0x320 [amdgpu]
[  306.872115]  reset_queues_on_hws_hang+0x2dc/0x4d0 [amdgpu]

The fix ensures we use physical instance IDs consistently for queue
operations while maintaining harvest-aware mapping for register access.

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
index 9c169112a5e7..3de125062ee3 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -1670,7 +1670,7 @@ static bool sdma_v4_4_2_page_ring_is_guilty(struct amdgpu_ring *ring)
 static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
 {
 	struct amdgpu_device *adev = ring->adev;
-	u32 id = GET_INST(SDMA0, ring->me);
+	u32 id = ring->me;
 	int r;
 
 	if (!(adev->sdma.supported_reset & AMDGPU_RESET_TYPE_PER_QUEUE))
@@ -1686,7 +1686,7 @@ static int sdma_v4_4_2_reset_queue(struct amdgpu_ring *ring, unsigned int vmid)
 static int sdma_v4_4_2_stop_queue(struct amdgpu_ring *ring)
 {
 	struct amdgpu_device *adev = ring->adev;
-	u32 instance_id = GET_INST(SDMA0, ring->me);
+	u32 instance_id = ring->me;
 	u32 inst_mask;
 	uint64_t rptr;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines
  2025-06-11  5:56 [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Jesse Zhang
  2025-06-11  5:56 ` [PATCH 2/2] drm/amdgpu: Fix SDMA queue reset array out-of-bounds access Jesse Zhang
@ 2025-06-11  6:14 ` Lazar, Lijo
  2025-06-11  7:11   ` Zhang, Jesse(Jie)
  1 sibling, 1 reply; 4+ messages in thread
From: Lazar, Lijo @ 2025-06-11  6:14 UTC (permalink / raw)
  To: Jesse Zhang, amd-gfx
  Cc: Alexander.Deucher, Christian Koenig, jonathan.kim, jiadong.zhu



On 6/11/2025 11:26 AM, Jesse Zhang wrote:
> Adds logic to handle instance ID conversion during SDMA engine reset
> when harvest_config is active. This ensures correct physical engine
> addressing when some SDMA instances are harvested.
> 
> Changes include:
> 1. Added instance ID remapping using GET_INST macro when harvest_config
>    is non-zero
> 2. Conversion happens before engine reset procedure begins
> 3. Maintains existing reset flow for non-harvested configurations
> 
> This fixes hardware initialization issues on devices with harvested
> SDMA instances where the logical instance IDs don't match physical
> hardware mapping.
> 

This shouldn't be required. Without harvest-awareness, driver won't load
properly on MI308.

Thanks,
Lijo

> Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c      | 3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h      | 1 +
>  3 files changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index a0e9bf9b2710..4282f60a0cef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -759,6 +759,7 @@ static void amdgpu_discovery_read_from_harvest_table(struct amdgpu_device *adev,
>  				~(1U << harvest_info->list[i].number_instance);
>  			break;
>  		case SDMA0_HWID:
> +			adev->sdma.harvest_config |= (1U << harvest_info->list[i].number_instance);
>  			adev->sdma.sdma_mask &=
>  				~(1U << harvest_info->list[i].number_instance);
>  			break;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> index 6716ac281c49..0bfd2c138d24 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> @@ -581,6 +581,9 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id)
>  	bool gfx_sched_stopped = false, page_sched_stopped = false;
>  
>  	mutex_lock(&sdma_instance->engine_reset_mutex);
> +
> +	if (adev->sdma.harvest_config)
> +		instance_id = GET_INST(SDMA0, instance_id);
>  	/* Stop the scheduler's work queue for the GFX and page rings if they are running.
>  	* This ensures that no new tasks are submitted to the queues while
>  	* the reset is in progress.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> index e5f8951bbb6f..fed00854a1a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> @@ -123,6 +123,7 @@ struct amdgpu_sdma {
>  
>  	int			num_instances;
>  	uint32_t 		sdma_mask;
> +	uint32_t		harvest_config;
>  	int			num_inst_per_aid;
>  	uint32_t                    srbm_soft_reset;
>  	bool			has_page_queue;


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines
  2025-06-11  6:14 ` [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Lazar, Lijo
@ 2025-06-11  7:11   ` Zhang, Jesse(Jie)
  0 siblings, 0 replies; 4+ messages in thread
From: Zhang, Jesse(Jie) @ 2025-06-11  7:11 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian, Kim, Jonathan,
	Zhu, Jiadong

[AMD Official Use Only - AMD Internal Distribution Only]

Thanks Lijo
As we discussed offline, we will remove the harvest_config check.

Regards
Jesse

-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@amd.com>
Sent: Wednesday, June 11, 2025 2:15 PM
To: Zhang, Jesse(Jie) <Jesse.Zhang@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Kim, Jonathan <Jonathan.Kim@amd.com>; Zhu, Jiadong <Jiadong.Zhu@amd.com>
Subject: Re: [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines



On 6/11/2025 11:26 AM, Jesse Zhang wrote:
> Adds logic to handle instance ID conversion during SDMA engine reset
> when harvest_config is active. This ensures correct physical engine
> addressing when some SDMA instances are harvested.
>
> Changes include:
> 1. Added instance ID remapping using GET_INST macro when harvest_config
>    is non-zero
> 2. Conversion happens before engine reset procedure begins 3.
> Maintains existing reset flow for non-harvested configurations
>
> This fixes hardware initialization issues on devices with harvested
> SDMA instances where the logical instance IDs don't match physical
> hardware mapping.
>

This shouldn't be required. Without harvest-awareness, driver won't load properly on MI308.

Thanks,
Lijo

> Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c      | 3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h      | 1 +
>  3 files changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index a0e9bf9b2710..4282f60a0cef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -759,6 +759,7 @@ static void amdgpu_discovery_read_from_harvest_table(struct amdgpu_device *adev,
>                               ~(1U << harvest_info->list[i].number_instance);
>                       break;
>               case SDMA0_HWID:
> +                     adev->sdma.harvest_config |= (1U <<
> +harvest_info->list[i].number_instance);
>                       adev->sdma.sdma_mask &=
>                               ~(1U << harvest_info->list[i].number_instance);
>                       break;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> index 6716ac281c49..0bfd2c138d24 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c
> @@ -581,6 +581,9 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, uint32_t instance_id)
>       bool gfx_sched_stopped = false, page_sched_stopped = false;
>
>       mutex_lock(&sdma_instance->engine_reset_mutex);
> +
> +     if (adev->sdma.harvest_config)
> +             instance_id = GET_INST(SDMA0, instance_id);
>       /* Stop the scheduler's work queue for the GFX and page rings if they are running.
>       * This ensures that no new tasks are submitted to the queues while
>       * the reset is in progress.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> index e5f8951bbb6f..fed00854a1a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
> @@ -123,6 +123,7 @@ struct amdgpu_sdma {
>
>       int                     num_instances;
>       uint32_t                sdma_mask;
> +     uint32_t                harvest_config;
>       int                     num_inst_per_aid;
>       uint32_t                    srbm_soft_reset;
>       bool                    has_page_queue;


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-06-11  7:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11  5:56 [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Jesse Zhang
2025-06-11  5:56 ` [PATCH 2/2] drm/amdgpu: Fix SDMA queue reset array out-of-bounds access Jesse Zhang
2025-06-11  6:14 ` [PATCH 1/2] drm/amdgpu: Implement instance ID remapping for harvested SDMA engines Lazar, Lijo
2025-06-11  7:11   ` Zhang, Jesse(Jie)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).