AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdkfd: FORWARD NULL
@ 2025-11-14 14:40 Andrew Martin
  2025-11-14 17:02 ` Russell, Kent
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Martin @ 2025-11-14 14:40 UTC (permalink / raw)
  To: amd-gfx; +Cc: Andrew Martin

This patch fixes issues when the code moves forward with a potential
NULL pointer, without checking.

Signed-off-by: Andrew Martin <andrew.martin@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 2 ++
 drivers/gpu/drm/amd/amdkfd/kfd_debug.c           | 6 +++++-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c         | 3 +++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
index 1ef758ac5076..71b7db5de69f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
@@ -104,6 +104,8 @@ static const char *amdkfd_fence_get_driver_name(struct dma_fence *f)
 static const char *amdkfd_fence_get_timeline_name(struct dma_fence *f)
 {
 	struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
+	if (!fence)
+		return NULL;
 
 	return fence->timeline_name;
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index ba99e0f258ae..42fa137bdb2f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -517,7 +517,9 @@ int kfd_dbg_trap_set_flags(struct kfd_process *target, uint32_t *flags)
 
 	for (i = 0; i < target->n_pdds; i++) {
 		struct kfd_topology_device *topo_dev =
-				kfd_topology_device_by_id(target->pdds[i]->dev->id);
+			kfd_topology_device_by_id(target->pdds[i]->dev->id);
+		if (!topo_dev)
+			continue;
 		uint32_t caps = topo_dev->node_props.capability;
 
 		if (!(caps & HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED) &&
@@ -1071,6 +1073,8 @@ int kfd_dbg_trap_device_snapshot(struct kfd_process *target,
 	for (i = 0; i < tmp_num_devices; i++) {
 		struct kfd_process_device *pdd = target->pdds[i];
 		struct kfd_topology_device *topo_dev = kfd_topology_device_by_id(pdd->dev->id);
+		if (!topo_dev)
+			continue;
 
 		device_info.gpu_id = pdd->dev->id;
 		device_info.exception_status = pdd->exception_status;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index f5d173f1ca3b..f40d93f82ede 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1685,6 +1685,9 @@ int kfd_process_device_init_vm(struct kfd_process_device *pdd,
 	struct kfd_node *dev;
 	int ret;
 
+	if (!pdd)
+		return -EINVAL;
+
 	if (!drm_file)
 		return -EINVAL;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* RE: [PATCH] drm/amdkfd: FORWARD NULL
  2025-11-14 14:40 [PATCH] drm/amdkfd: FORWARD NULL Andrew Martin
@ 2025-11-14 17:02 ` Russell, Kent
  2025-11-14 17:37   ` Martin, Andrew
  2025-11-14 21:47   ` Felix Kuehling
  0 siblings, 2 replies; 6+ messages in thread
From: Russell, Kent @ 2025-11-14 17:02 UTC (permalink / raw)
  To: Martin, Andrew, amd-gfx@lists.freedesktop.org; +Cc: Martin, Andrew

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Andrew
> Martin
> Sent: Friday, November 14, 2025 9:41 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Martin, Andrew <Andrew.Martin@amd.com>
> Subject: [PATCH] drm/amdkfd: FORWARD NULL
>
> This patch fixes issues when the code moves forward with a potential
> NULL pointer, without checking.
>
> Signed-off-by: Andrew Martin <andrew.martin@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 2 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_debug.c           | 6 +++++-
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c         | 3 +++
>  3 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> index 1ef758ac5076..71b7db5de69f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> @@ -104,6 +104,8 @@ static const char *amdkfd_fence_get_driver_name(struct
> dma_fence *f)
>  static const char *amdkfd_fence_get_timeline_name(struct dma_fence *f)
>  {
>       struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
> +     if (!fence)
> +             return NULL;

Felix can hopefully confirm, but we may need to have something here, since the references here expect something. Like in
#define AMDGPU_JOB_GET_TIMELINE_NAME(job) \
         job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence->finished)

>
>       return fence->timeline_name;
>  }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> index ba99e0f258ae..42fa137bdb2f 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> @@ -517,7 +517,9 @@ int kfd_dbg_trap_set_flags(struct kfd_process *target,
> uint32_t *flags)
>
>       for (i = 0; i < target->n_pdds; i++) {
>               struct kfd_topology_device *topo_dev =
> -                             kfd_topology_device_by_id(target->pdds[i]->dev->id);
> +                     kfd_topology_device_by_id(target->pdds[i]->dev->id);
> +             if (!topo_dev)
> +                     continue;
>               uint32_t caps = topo_dev->node_props.capability;
>
>               if (!(caps &
> HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED)
> &&
> @@ -1071,6 +1073,8 @@ int kfd_dbg_trap_device_snapshot(struct kfd_process
> *target,
>       for (i = 0; i < tmp_num_devices; i++) {
>               struct kfd_process_device *pdd = target->pdds[i];
>               struct kfd_topology_device *topo_dev =
> kfd_topology_device_by_id(pdd->dev->id);
> +             if (!topo_dev)
> +                     continue;
>
>               device_info.gpu_id = pdd->dev->id;
>               device_info.exception_status = pdd->exception_status;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index f5d173f1ca3b..f40d93f82ede 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -1685,6 +1685,9 @@ int kfd_process_device_init_vm(struct
> kfd_process_device *pdd,
>       struct kfd_node *dev;
>       int ret;
>
> +     if (!pdd)
> +             return -EINVAL;
> +
>       if (!drm_file)
>               return -EINVAL;

Probably easier to just combine the !pdd and !drm_file into the same check.

 Kent
>
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] drm/amdkfd: FORWARD NULL
  2025-11-14 17:02 ` Russell, Kent
@ 2025-11-14 17:37   ` Martin, Andrew
  2025-11-14 21:47   ` Felix Kuehling
  1 sibling, 0 replies; 6+ messages in thread
From: Martin, Andrew @ 2025-11-14 17:37 UTC (permalink / raw)
  To: Russell, Kent, amd-gfx@lists.freedesktop.org

[AMD Official Use Only - AMD Internal Distribution Only]

Greetings @Russell, Kent

>       struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
> +     if (!fence)
> +             return NULL;

Thanks, looking forward to any lessons from this.  The main concern for Coverity is use of the pointer without checking first if it is valid.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/amdkfd: FORWARD NULL
  2025-11-14 17:02 ` Russell, Kent
  2025-11-14 17:37   ` Martin, Andrew
@ 2025-11-14 21:47   ` Felix Kuehling
  2025-11-17 18:13     ` Martin, Andrew
  1 sibling, 1 reply; 6+ messages in thread
From: Felix Kuehling @ 2025-11-14 21:47 UTC (permalink / raw)
  To: Russell, Kent, Martin, Andrew, amd-gfx@lists.freedesktop.org

On 2025-11-14 12:02, Russell, Kent wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Andrew
>> Martin
>> Sent: Friday, November 14, 2025 9:41 AM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Martin, Andrew <Andrew.Martin@amd.com>
>> Subject: [PATCH] drm/amdkfd: FORWARD NULL
>>
>> This patch fixes issues when the code moves forward with a potential
>> NULL pointer, without checking.
>>
>> Signed-off-by: Andrew Martin <andrew.martin@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 2 ++
>>   drivers/gpu/drm/amd/amdkfd/kfd_debug.c           | 6 +++++-
>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c         | 3 +++
>>   3 files changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>> index 1ef758ac5076..71b7db5de69f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>> @@ -104,6 +104,8 @@ static const char *amdkfd_fence_get_driver_name(struct
>> dma_fence *f)
>>   static const char *amdkfd_fence_get_timeline_name(struct dma_fence *f)
>>   {
>>        struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
>> +     if (!fence)
>> +             return NULL;
> Felix can hopefully confirm, but we may need to have something here, since the references here expect something. Like in
> #define AMDGPU_JOB_GET_TIMELINE_NAME(job) \
>           job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence->finished)

For amdgpu Job fences the above makes sense. But KFD fences are our 
evictions fences. There is no job associated with them.

I don't think the NULL check is needed here. to_amdgpu_amdkfd_fence 
returns NULL if the f is NULL or the fence is not an 
amdgpu_amdkfd_fence, based on the fence_ops. But we're in a fence_ops 
callback here that should only be called for amdgpu_amdkfd_fences.

That said, if you need a check to satisfy a static checker, I suggest this:

     return fence ? fence->timeline_name : NULL;


>
>>        return fence->timeline_name;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>> index ba99e0f258ae..42fa137bdb2f 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>> @@ -517,7 +517,9 @@ int kfd_dbg_trap_set_flags(struct kfd_process *target,
>> uint32_t *flags)
>>
>>        for (i = 0; i < target->n_pdds; i++) {
>>                struct kfd_topology_device *topo_dev =
>> -                             kfd_topology_device_by_id(target->pdds[i]->dev->id);
>> +                     kfd_topology_device_by_id(target->pdds[i]->dev->id);
>> +             if (!topo_dev)
>> +                     continue;

This loop checks the capabilities of all the devices. If a device is not 
found, we should assume the worst and return an error, instead of just 
assuming that it'll be fine.


>>                uint32_t caps = topo_dev->node_props.capability;
>>
>>                if (!(caps &
>> HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED)
>> &&
>> @@ -1071,6 +1073,8 @@ int kfd_dbg_trap_device_snapshot(struct kfd_process
>> *target,
>>        for (i = 0; i < tmp_num_devices; i++) {
>>                struct kfd_process_device *pdd = target->pdds[i];
>>                struct kfd_topology_device *topo_dev =
>> kfd_topology_device_by_id(pdd->dev->id);
>> +             if (!topo_dev)
>> +                     continue;

I'd return an error here as well, because we obviously don't have 
accurate device info.


>>
>>                device_info.gpu_id = pdd->dev->id;
>>                device_info.exception_status = pdd->exception_status;
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index f5d173f1ca3b..f40d93f82ede 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -1685,6 +1685,9 @@ int kfd_process_device_init_vm(struct
>> kfd_process_device *pdd,
>>        struct kfd_node *dev;
>>        int ret;
>>
>> +     if (!pdd)
>> +             return -EINVAL;
>> +

We generally assume that functions are called with valid parameters. 
With that argument, we should probably remove the check for !drm_file 
below as well.

Regards,
   Felix


>>        if (!drm_file)
>>                return -EINVAL;
> Probably easier to just combine the !pdd and !drm_file into the same check.
>
>   Kent
>> --
>> 2.43.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] drm/amdkfd: FORWARD NULL
  2025-11-14 21:47   ` Felix Kuehling
@ 2025-11-17 18:13     ` Martin, Andrew
  2025-11-18  6:00       ` Kuehling, Felix
  0 siblings, 1 reply; 6+ messages in thread
From: Martin, Andrew @ 2025-11-17 18:13 UTC (permalink / raw)
  To: Kuehling, Felix, Russell, Kent, amd-gfx@lists.freedesktop.org

[AMD Official Use Only - AMD Internal Distribution Only]

Greetings @Kuehling, Felix,

Thanks for the review comments, please see some discovery comments in lined!  Patch 2 with the suggested fixes is on its way.

Note: I have added 1 change in the new patch (re-running the scan, found it)

file: kfd_crat.c:2360
...
peer_dev = kfd_topology_device_by_proximity_domain_no_lock(nid);
        if (!peer_dev)
                return -ENODEV;
...

> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com>
> Sent: Friday, November 14, 2025 4:48 PM
> To: Russell, Kent <Kent.Russell@amd.com>; Martin, Andrew
> <Andrew.Martin@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: FORWARD NULL
>
> On 2025-11-14 12:02, Russell, Kent wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> >> Andrew Martin
> >> Sent: Friday, November 14, 2025 9:41 AM
> >> To: amd-gfx@lists.freedesktop.org
> >> Cc: Martin, Andrew <Andrew.Martin@amd.com>
> >> Subject: [PATCH] drm/amdkfd: FORWARD NULL
> >>
> >> This patch fixes issues when the code moves forward with a potential
> >> NULL pointer, without checking.
> >>
> >> Signed-off-by: Andrew Martin <andrew.martin@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 2 ++
> >>   drivers/gpu/drm/amd/amdkfd/kfd_debug.c           | 6 +++++-
> >>   drivers/gpu/drm/amd/amdkfd/kfd_process.c         | 3 +++
> >>   3 files changed, 10 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> >> index 1ef758ac5076..71b7db5de69f 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> >> @@ -104,6 +104,8 @@ static const char
> >> *amdkfd_fence_get_driver_name(struct
> >> dma_fence *f)
> >>   static const char *amdkfd_fence_get_timeline_name(struct dma_fence *f)
> >>   {
> >>        struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
> >> +     if (!fence)
> >> +             return NULL;
> > Felix can hopefully confirm, but we may need to have something here,
> > since the references here expect something. Like in #define
> AMDGPU_JOB_GET_TIMELINE_NAME(job) \
> >
> > job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence-
> > >finished)
>
> For amdgpu Job fences the above makes sense. But KFD fences are our evictions
> fences. There is no job associated with them.
>
> I don't think the NULL check is needed here. to_amdgpu_amdkfd_fence returns
> NULL if the f is NULL or the fence is not an amdgpu_amdkfd_fence, based on
> the fence_ops. But we're in a fence_ops callback here that should only be called
> for amdgpu_amdkfd_fences.
>
> That said, if you need a check to satisfy a static checker, I suggest this:
>
>      return fence ? fence->timeline_name : NULL;

This worked perfectly.
>
>
> >
> >>        return fence->timeline_name;
> >>   }
> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> >> b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> >> index ba99e0f258ae..42fa137bdb2f 100644
> >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> >> @@ -517,7 +517,9 @@ int kfd_dbg_trap_set_flags(struct kfd_process
> >> *target, uint32_t *flags)
> >>
> >>        for (i = 0; i < target->n_pdds; i++) {
> >>                struct kfd_topology_device *topo_dev =
> >> -                             kfd_topology_device_by_id(target->pdds[i]->dev->id);
> >> +                     kfd_topology_device_by_id(target->pdds[i]->dev->id);
> >> +             if (!topo_dev)
> >> +                     continue;
>
> This loop checks the capabilities of all the devices. If a device is not found, we
> should assume the worst and return an error, instead of just assuming that it'll
> be fine.
>

Done

>
> >>                uint32_t caps = topo_dev->node_props.capability;
> >>
> >>                if (!(caps &
> >> HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED)
> >> &&
> >> @@ -1071,6 +1073,8 @@ int kfd_dbg_trap_device_snapshot(struct
> >> kfd_process *target,
> >>        for (i = 0; i < tmp_num_devices; i++) {
> >>                struct kfd_process_device *pdd = target->pdds[i];
> >>                struct kfd_topology_device *topo_dev =
> >> kfd_topology_device_by_id(pdd->dev->id);
> >> +             if (!topo_dev)
> >> +                     continue;
>
> I'd return an error here as well, because we obviously don't have accurate
> device info.
>

Done

>
> >>
> >>                device_info.gpu_id = pdd->dev->id;
> >>                device_info.exception_status = pdd->exception_status;
> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> index f5d173f1ca3b..f40d93f82ede 100644
> >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> @@ -1685,6 +1685,9 @@ int kfd_process_device_init_vm(struct
> >> kfd_process_device *pdd,
> >>        struct kfd_node *dev;
> >>        int ret;
> >>
> >> +     if (!pdd)
> >> +             return -EINVAL;
> >> +
>
> We generally assume that functions are called with valid parameters.
> With that argument, we should probably remove the check for !drm_file
> below as well.

Done.  This is a good thing to note, but I want to have another lesson here.  Is this true in general or only for internal/private APIs? I asked b/c this comment forced me to really look carefully the kfd_process_device_init_vm(), which is prototyped in kfd_priv.h.  This mean this and any other function from here should be called with valid params.  But public APIs should check that the params are valid, b/c this how we prevent crashes in our library?
>
> Regards,
>    Felix
>
>
> >>        if (!drm_file)
> >>                return -EINVAL;
> > Probably easier to just combine the !pdd and !drm_file into the same check.
> >
> >   Kent
> >> --
> >> 2.43.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/amdkfd: FORWARD NULL
  2025-11-17 18:13     ` Martin, Andrew
@ 2025-11-18  6:00       ` Kuehling, Felix
  0 siblings, 0 replies; 6+ messages in thread
From: Kuehling, Felix @ 2025-11-18  6:00 UTC (permalink / raw)
  To: Martin, Andrew, Russell, Kent, amd-gfx@lists.freedesktop.org


On 2025-11-17 13:13, Martin, Andrew wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Greetings @Kuehling, Felix,
>
> Thanks for the review comments, please see some discovery comments in lined!  Patch 2 with the suggested fixes is on its way.
>
> Note: I have added 1 change in the new patch (re-running the scan, found it)
>
> file: kfd_crat.c:2360
> ...
> peer_dev = kfd_topology_device_by_proximity_domain_no_lock(nid);
>          if (!peer_dev)
>                  return -ENODEV;

Hmm, not sure I agree with this one. This is in a loop that looks for 
XGMI-connected peer devices. Currently we create devices with increasing 
proximity domains and no gaps. So 
kfd_topology_device_by_proximity_domain_no_lock should never fail. Maybe 
in some future scenario where a device was hot-unplugged this assumption 
may be broken.

To be more robust here, I'd just continue if peer_dev is NULL. That way 
a topology with gaps doesn't cause the creation of new devices to fail.


> ...
>
>> -----Original Message-----
>> From: Kuehling, Felix <Felix.Kuehling@amd.com>
>> Sent: Friday, November 14, 2025 4:48 PM
>> To: Russell, Kent <Kent.Russell@amd.com>; Martin, Andrew
>> <Andrew.Martin@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdkfd: FORWARD NULL
>>
>> On 2025-11-14 12:02, Russell, Kent wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>>>> Andrew Martin
>>>> Sent: Friday, November 14, 2025 9:41 AM
>>>> To: amd-gfx@lists.freedesktop.org
>>>> Cc: Martin, Andrew <Andrew.Martin@amd.com>
>>>> Subject: [PATCH] drm/amdkfd: FORWARD NULL
>>>>
>>>> This patch fixes issues when the code moves forward with a potential
>>>> NULL pointer, without checking.
>>>>
>>>> Signed-off-by: Andrew Martin <andrew.martin@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 2 ++
>>>>    drivers/gpu/drm/amd/amdkfd/kfd_debug.c           | 6 +++++-
>>>>    drivers/gpu/drm/amd/amdkfd/kfd_process.c         | 3 +++
>>>>    3 files changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> index 1ef758ac5076..71b7db5de69f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> @@ -104,6 +104,8 @@ static const char
>>>> *amdkfd_fence_get_driver_name(struct
>>>> dma_fence *f)
>>>>    static const char *amdkfd_fence_get_timeline_name(struct dma_fence *f)
>>>>    {
>>>>         struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
>>>> +     if (!fence)
>>>> +             return NULL;
>>> Felix can hopefully confirm, but we may need to have something here,
>>> since the references here expect something. Like in #define
>> AMDGPU_JOB_GET_TIMELINE_NAME(job) \
>>> job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence-
>>>> finished)
>> For amdgpu Job fences the above makes sense. But KFD fences are our evictions
>> fences. There is no job associated with them.
>>
>> I don't think the NULL check is needed here. to_amdgpu_amdkfd_fence returns
>> NULL if the f is NULL or the fence is not an amdgpu_amdkfd_fence, based on
>> the fence_ops. But we're in a fence_ops callback here that should only be called
>> for amdgpu_amdkfd_fences.
>>
>> That said, if you need a check to satisfy a static checker, I suggest this:
>>
>>       return fence ? fence->timeline_name : NULL;
> This worked perfectly.
>>
>>>>         return fence->timeline_name;
>>>>    }
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> index ba99e0f258ae..42fa137bdb2f 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> @@ -517,7 +517,9 @@ int kfd_dbg_trap_set_flags(struct kfd_process
>>>> *target, uint32_t *flags)
>>>>
>>>>         for (i = 0; i < target->n_pdds; i++) {
>>>>                 struct kfd_topology_device *topo_dev =
>>>> -                             kfd_topology_device_by_id(target->pdds[i]->dev->id);
>>>> +                     kfd_topology_device_by_id(target->pdds[i]->dev->id);
>>>> +             if (!topo_dev)
>>>> +                     continue;
>> This loop checks the capabilities of all the devices. If a device is not found, we
>> should assume the worst and return an error, instead of just assuming that it'll
>> be fine.
>>
> Done
>
>>>>                 uint32_t caps = topo_dev->node_props.capability;
>>>>
>>>>                 if (!(caps &
>>>> HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED)
>>>> &&
>>>> @@ -1071,6 +1073,8 @@ int kfd_dbg_trap_device_snapshot(struct
>>>> kfd_process *target,
>>>>         for (i = 0; i < tmp_num_devices; i++) {
>>>>                 struct kfd_process_device *pdd = target->pdds[i];
>>>>                 struct kfd_topology_device *topo_dev =
>>>> kfd_topology_device_by_id(pdd->dev->id);
>>>> +             if (!topo_dev)
>>>> +                     continue;
>> I'd return an error here as well, because we obviously don't have accurate
>> device info.
>>
> Done
>
>>>>                 device_info.gpu_id = pdd->dev->id;
>>>>                 device_info.exception_status = pdd->exception_status;
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> index f5d173f1ca3b..f40d93f82ede 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> @@ -1685,6 +1685,9 @@ int kfd_process_device_init_vm(struct
>>>> kfd_process_device *pdd,
>>>>         struct kfd_node *dev;
>>>>         int ret;
>>>>
>>>> +     if (!pdd)
>>>> +             return -EINVAL;
>>>> +
>> We generally assume that functions are called with valid parameters.
>> With that argument, we should probably remove the check for !drm_file
>> below as well.
> Done.  This is a good thing to note, but I want to have another lesson here.  Is this true in general or only for internal/private APIs? I asked b/c this comment forced me to really look carefully the kfd_process_device_init_vm(), which is prototyped in kfd_priv.h.  This mean this and any other function from here should be called with valid params.  But public APIs should check that the params are valid, b/c this how we prevent crashes in our library?

Right, I should have been more precise. Public APIs where you don't know 
and don't trust the callers need to check their parameters. Internal 
APIs where you know and trust the callers don't need to do redundant checks.

Regards,
   Felix


>> Regards,
>>     Felix
>>
>>
>>>>         if (!drm_file)
>>>>                 return -EINVAL;
>>> Probably easier to just combine the !pdd and !drm_file into the same check.
>>>
>>>    Kent
>>>> --
>>>> 2.43.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-11-18  6:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-14 14:40 [PATCH] drm/amdkfd: FORWARD NULL Andrew Martin
2025-11-14 17:02 ` Russell, Kent
2025-11-14 17:37   ` Martin, Andrew
2025-11-14 21:47   ` Felix Kuehling
2025-11-17 18:13     ` Martin, Andrew
2025-11-18  6:00       ` Kuehling, Felix

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox