AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Kuehling, Felix" <felix.kuehling@amd.com>
To: "Martin, Andrew" <Andrew.Martin@amd.com>,
	"Russell, Kent" <Kent.Russell@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdkfd: FORWARD NULL
Date: Tue, 18 Nov 2025 01:00:07 -0500	[thread overview]
Message-ID: <fe886d93-36c9-4895-a9e6-c9f38f8ceebc@amd.com> (raw)
In-Reply-To: <SJ0PR12MB81387F24365922E3E1EFE646F5C9A@SJ0PR12MB8138.namprd12.prod.outlook.com>


On 2025-11-17 13:13, Martin, Andrew wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Greetings @Kuehling, Felix,
>
> Thanks for the review comments, please see some discovery comments in lined!  Patch 2 with the suggested fixes is on its way.
>
> Note: I have added 1 change in the new patch (re-running the scan, found it)
>
> file: kfd_crat.c:2360
> ...
> peer_dev = kfd_topology_device_by_proximity_domain_no_lock(nid);
>          if (!peer_dev)
>                  return -ENODEV;

Hmm, not sure I agree with this one. This is in a loop that looks for 
XGMI-connected peer devices. Currently we create devices with increasing 
proximity domains and no gaps. So 
kfd_topology_device_by_proximity_domain_no_lock should never fail. Maybe 
in some future scenario where a device was hot-unplugged this assumption 
may be broken.

To be more robust here, I'd just continue if peer_dev is NULL. That way 
a topology with gaps doesn't cause the creation of new devices to fail.


> ...
>
>> -----Original Message-----
>> From: Kuehling, Felix <Felix.Kuehling@amd.com>
>> Sent: Friday, November 14, 2025 4:48 PM
>> To: Russell, Kent <Kent.Russell@amd.com>; Martin, Andrew
>> <Andrew.Martin@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdkfd: FORWARD NULL
>>
>> On 2025-11-14 12:02, Russell, Kent wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>>>> Andrew Martin
>>>> Sent: Friday, November 14, 2025 9:41 AM
>>>> To: amd-gfx@lists.freedesktop.org
>>>> Cc: Martin, Andrew <Andrew.Martin@amd.com>
>>>> Subject: [PATCH] drm/amdkfd: FORWARD NULL
>>>>
>>>> This patch fixes issues when the code moves forward with a potential
>>>> NULL pointer, without checking.
>>>>
>>>> Signed-off-by: Andrew Martin <andrew.martin@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 2 ++
>>>>    drivers/gpu/drm/amd/amdkfd/kfd_debug.c           | 6 +++++-
>>>>    drivers/gpu/drm/amd/amdkfd/kfd_process.c         | 3 +++
>>>>    3 files changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> index 1ef758ac5076..71b7db5de69f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> @@ -104,6 +104,8 @@ static const char
>>>> *amdkfd_fence_get_driver_name(struct
>>>> dma_fence *f)
>>>>    static const char *amdkfd_fence_get_timeline_name(struct dma_fence *f)
>>>>    {
>>>>         struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
>>>> +     if (!fence)
>>>> +             return NULL;
>>> Felix can hopefully confirm, but we may need to have something here,
>>> since the references here expect something. Like in #define
>> AMDGPU_JOB_GET_TIMELINE_NAME(job) \
>>> job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence-
>>>> finished)
>> For amdgpu Job fences the above makes sense. But KFD fences are our evictions
>> fences. There is no job associated with them.
>>
>> I don't think the NULL check is needed here. to_amdgpu_amdkfd_fence returns
>> NULL if the f is NULL or the fence is not an amdgpu_amdkfd_fence, based on
>> the fence_ops. But we're in a fence_ops callback here that should only be called
>> for amdgpu_amdkfd_fences.
>>
>> That said, if you need a check to satisfy a static checker, I suggest this:
>>
>>       return fence ? fence->timeline_name : NULL;
> This worked perfectly.
>>
>>>>         return fence->timeline_name;
>>>>    }
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> index ba99e0f258ae..42fa137bdb2f 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> @@ -517,7 +517,9 @@ int kfd_dbg_trap_set_flags(struct kfd_process
>>>> *target, uint32_t *flags)
>>>>
>>>>         for (i = 0; i < target->n_pdds; i++) {
>>>>                 struct kfd_topology_device *topo_dev =
>>>> -                             kfd_topology_device_by_id(target->pdds[i]->dev->id);
>>>> +                     kfd_topology_device_by_id(target->pdds[i]->dev->id);
>>>> +             if (!topo_dev)
>>>> +                     continue;
>> This loop checks the capabilities of all the devices. If a device is not found, we
>> should assume the worst and return an error, instead of just assuming that it'll
>> be fine.
>>
> Done
>
>>>>                 uint32_t caps = topo_dev->node_props.capability;
>>>>
>>>>                 if (!(caps &
>>>> HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED)
>>>> &&
>>>> @@ -1071,6 +1073,8 @@ int kfd_dbg_trap_device_snapshot(struct
>>>> kfd_process *target,
>>>>         for (i = 0; i < tmp_num_devices; i++) {
>>>>                 struct kfd_process_device *pdd = target->pdds[i];
>>>>                 struct kfd_topology_device *topo_dev =
>>>> kfd_topology_device_by_id(pdd->dev->id);
>>>> +             if (!topo_dev)
>>>> +                     continue;
>> I'd return an error here as well, because we obviously don't have accurate
>> device info.
>>
> Done
>
>>>>                 device_info.gpu_id = pdd->dev->id;
>>>>                 device_info.exception_status = pdd->exception_status;
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> index f5d173f1ca3b..f40d93f82ede 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> @@ -1685,6 +1685,9 @@ int kfd_process_device_init_vm(struct
>>>> kfd_process_device *pdd,
>>>>         struct kfd_node *dev;
>>>>         int ret;
>>>>
>>>> +     if (!pdd)
>>>> +             return -EINVAL;
>>>> +
>> We generally assume that functions are called with valid parameters.
>> With that argument, we should probably remove the check for !drm_file
>> below as well.
> Done.  This is a good thing to note, but I want to have another lesson here.  Is this true in general or only for internal/private APIs? I asked b/c this comment forced me to really look carefully the kfd_process_device_init_vm(), which is prototyped in kfd_priv.h.  This mean this and any other function from here should be called with valid params.  But public APIs should check that the params are valid, b/c this how we prevent crashes in our library?

Right, I should have been more precise. Public APIs where you don't know 
and don't trust the callers need to check their parameters. Internal 
APIs where you know and trust the callers don't need to do redundant checks.

Regards,
   Felix


>> Regards,
>>     Felix
>>
>>
>>>>         if (!drm_file)
>>>>                 return -EINVAL;
>>> Probably easier to just combine the !pdd and !drm_file into the same check.
>>>
>>>    Kent
>>>> --
>>>> 2.43.0

      reply	other threads:[~2025-11-18  6:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-14 14:40 [PATCH] drm/amdkfd: FORWARD NULL Andrew Martin
2025-11-14 17:02 ` Russell, Kent
2025-11-14 17:37   ` Martin, Andrew
2025-11-14 21:47   ` Felix Kuehling
2025-11-17 18:13     ` Martin, Andrew
2025-11-18  6:00       ` Kuehling, Felix [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fe886d93-36c9-4895-a9e6-c9f38f8ceebc@amd.com \
    --to=felix.kuehling@amd.com \
    --cc=Andrew.Martin@amd.com \
    --cc=Kent.Russell@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox