From: "Kuehling, Felix" <felix.kuehling@amd.com>
To: "Martin, Andrew" <Andrew.Martin@amd.com>,
"Russell, Kent" <Kent.Russell@amd.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdkfd: FORWARD NULL
Date: Tue, 18 Nov 2025 01:00:07 -0500 [thread overview]
Message-ID: <fe886d93-36c9-4895-a9e6-c9f38f8ceebc@amd.com> (raw)
In-Reply-To: <SJ0PR12MB81387F24365922E3E1EFE646F5C9A@SJ0PR12MB8138.namprd12.prod.outlook.com>
On 2025-11-17 13:13, Martin, Andrew wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Greetings @Kuehling, Felix,
>
> Thanks for the review comments, please see some discovery comments in lined! Patch 2 with the suggested fixes is on its way.
>
> Note: I have added 1 change in the new patch (re-running the scan, found it)
>
> file: kfd_crat.c:2360
> ...
> peer_dev = kfd_topology_device_by_proximity_domain_no_lock(nid);
> if (!peer_dev)
> return -ENODEV;
Hmm, not sure I agree with this one. This is in a loop that looks for
XGMI-connected peer devices. Currently we create devices with increasing
proximity domains and no gaps. So
kfd_topology_device_by_proximity_domain_no_lock should never fail. Maybe
in some future scenario where a device was hot-unplugged this assumption
may be broken.
To be more robust here, I'd just continue if peer_dev is NULL. That way
a topology with gaps doesn't cause the creation of new devices to fail.
> ...
>
>> -----Original Message-----
>> From: Kuehling, Felix <Felix.Kuehling@amd.com>
>> Sent: Friday, November 14, 2025 4:48 PM
>> To: Russell, Kent <Kent.Russell@amd.com>; Martin, Andrew
>> <Andrew.Martin@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdkfd: FORWARD NULL
>>
>> On 2025-11-14 12:02, Russell, Kent wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>>>> Andrew Martin
>>>> Sent: Friday, November 14, 2025 9:41 AM
>>>> To: amd-gfx@lists.freedesktop.org
>>>> Cc: Martin, Andrew <Andrew.Martin@amd.com>
>>>> Subject: [PATCH] drm/amdkfd: FORWARD NULL
>>>>
>>>> This patch fixes issues when the code moves forward with a potential
>>>> NULL pointer, without checking.
>>>>
>>>> Signed-off-by: Andrew Martin <andrew.martin@amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 2 ++
>>>> drivers/gpu/drm/amd/amdkfd/kfd_debug.c | 6 +++++-
>>>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 3 +++
>>>> 3 files changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> index 1ef758ac5076..71b7db5de69f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>>>> @@ -104,6 +104,8 @@ static const char
>>>> *amdkfd_fence_get_driver_name(struct
>>>> dma_fence *f)
>>>> static const char *amdkfd_fence_get_timeline_name(struct dma_fence *f)
>>>> {
>>>> struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
>>>> + if (!fence)
>>>> + return NULL;
>>> Felix can hopefully confirm, but we may need to have something here,
>>> since the references here expect something. Like in #define
>> AMDGPU_JOB_GET_TIMELINE_NAME(job) \
>>> job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence-
>>>> finished)
>> For amdgpu Job fences the above makes sense. But KFD fences are our evictions
>> fences. There is no job associated with them.
>>
>> I don't think the NULL check is needed here. to_amdgpu_amdkfd_fence returns
>> NULL if the f is NULL or the fence is not an amdgpu_amdkfd_fence, based on
>> the fence_ops. But we're in a fence_ops callback here that should only be called
>> for amdgpu_amdkfd_fences.
>>
>> That said, if you need a check to satisfy a static checker, I suggest this:
>>
>> return fence ? fence->timeline_name : NULL;
> This worked perfectly.
>>
>>>> return fence->timeline_name;
>>>> }
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> index ba99e0f258ae..42fa137bdb2f 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
>>>> @@ -517,7 +517,9 @@ int kfd_dbg_trap_set_flags(struct kfd_process
>>>> *target, uint32_t *flags)
>>>>
>>>> for (i = 0; i < target->n_pdds; i++) {
>>>> struct kfd_topology_device *topo_dev =
>>>> - kfd_topology_device_by_id(target->pdds[i]->dev->id);
>>>> + kfd_topology_device_by_id(target->pdds[i]->dev->id);
>>>> + if (!topo_dev)
>>>> + continue;
>> This loop checks the capabilities of all the devices. If a device is not found, we
>> should assume the worst and return an error, instead of just assuming that it'll
>> be fine.
>>
> Done
>
>>>> uint32_t caps = topo_dev->node_props.capability;
>>>>
>>>> if (!(caps &
>>>> HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED)
>>>> &&
>>>> @@ -1071,6 +1073,8 @@ int kfd_dbg_trap_device_snapshot(struct
>>>> kfd_process *target,
>>>> for (i = 0; i < tmp_num_devices; i++) {
>>>> struct kfd_process_device *pdd = target->pdds[i];
>>>> struct kfd_topology_device *topo_dev =
>>>> kfd_topology_device_by_id(pdd->dev->id);
>>>> + if (!topo_dev)
>>>> + continue;
>> I'd return an error here as well, because we obviously don't have accurate
>> device info.
>>
> Done
>
>>>> device_info.gpu_id = pdd->dev->id;
>>>> device_info.exception_status = pdd->exception_status;
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> index f5d173f1ca3b..f40d93f82ede 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> @@ -1685,6 +1685,9 @@ int kfd_process_device_init_vm(struct
>>>> kfd_process_device *pdd,
>>>> struct kfd_node *dev;
>>>> int ret;
>>>>
>>>> + if (!pdd)
>>>> + return -EINVAL;
>>>> +
>> We generally assume that functions are called with valid parameters.
>> With that argument, we should probably remove the check for !drm_file
>> below as well.
> Done. This is a good thing to note, but I want to have another lesson here. Is this true in general or only for internal/private APIs? I asked b/c this comment forced me to really look carefully the kfd_process_device_init_vm(), which is prototyped in kfd_priv.h. This mean this and any other function from here should be called with valid params. But public APIs should check that the params are valid, b/c this how we prevent crashes in our library?
Right, I should have been more precise. Public APIs where you don't know
and don't trust the callers need to check their parameters. Internal
APIs where you know and trust the callers don't need to do redundant checks.
Regards,
Felix
>> Regards,
>> Felix
>>
>>
>>>> if (!drm_file)
>>>> return -EINVAL;
>>> Probably easier to just combine the !pdd and !drm_file into the same check.
>>>
>>> Kent
>>>> --
>>>> 2.43.0
prev parent reply other threads:[~2025-11-18 6:00 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-14 14:40 [PATCH] drm/amdkfd: FORWARD NULL Andrew Martin
2025-11-14 17:02 ` Russell, Kent
2025-11-14 17:37 ` Martin, Andrew
2025-11-14 21:47 ` Felix Kuehling
2025-11-17 18:13 ` Martin, Andrew
2025-11-18 6:00 ` Kuehling, Felix [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fe886d93-36c9-4895-a9e6-c9f38f8ceebc@amd.com \
--to=felix.kuehling@amd.com \
--cc=Andrew.Martin@amd.com \
--cc=Kent.Russell@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox