From: "Lazar, Lijo" <lijo.lazar@amd.com>
To: "SHANMUGAM, SRINIVASAN" <SRINIVASAN.SHANMUGAM@amd.com>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Prosyak, Vitaly" <Vitaly.Prosyak@amd.com>,
"Koenig, Christian" <Christian.Koenig@amd.com>,
Matthew Brost <matthew.brost@intel.com>
Subject: Re: [PATCH V2] drm/amdgpu: fix a job->pasid access race in gpu recovery
Date: Thu, 11 Dec 2025 11:14:55 +0530 [thread overview]
Message-ID: <14ea9e1c-0deb-4d2b-8bea-ef95300b753c@amd.com> (raw)
In-Reply-To: <IA0PR12MB820858B3F15710B5E2C6255B90A1A@IA0PR12MB8208.namprd12.prod.outlook.com>
On 12/11/2025 10:52 AM, SHANMUGAM, SRINIVASAN wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>> Sent: Thursday, December 11, 2025 10:34 AM
>> To: Deucher, Alexander <Alexander.Deucher@amd.com>; amd-
>> gfx@lists.freedesktop.org
>> Cc: SHANMUGAM, SRINIVASAN <SRINIVASAN.SHANMUGAM@amd.com>;
>> Prosyak, Vitaly <Vitaly.Prosyak@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Matthew Brost <matthew.brost@intel.com>
>> Subject: Re: [PATCH V2] drm/amdgpu: fix a job->pasid access race in gpu
>> recovery
>>
>>
>>
>> On 12/11/2025 1:53 AM, Alex Deucher wrote:
>>> Avoid a possible UAF in GPU recovery due to a race between the sched
>>> timeout callback and the tdr work queue.
>>>
>>> The gpu recovery function calls drm_sched_stop() and later
>>> drm_sched_start(). drm_sched_start() restarts the tdr queue which
>>> will eventually free the job. If the tdr queue frees the job before
>>> time out callback completes, the job will be freed and we'll get a UAF
>>> when accessing the pasid. Cache it early to avoid the UAF.
>>>
>>> Fixes: a72002cb181f ("drm/amdgpu: Make use of drm_wedge_task_info")
>>> Cc: SRINIVASAN.SHANMUGAM@amd.com
>>> Cc: vitaly.prosyak@amd.com
>>> Cc: christian.koenig@amd.com
>>> Suggested-by: Matthew Brost <matthew.brost@intel.com>
>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>> ---
>>>
>>> v2: Check the pasid rather than job (Lijo)
>>> Add fixes tag (Christian)
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++++++++--
>>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 8a851d7548c00..c6b1dd95c401d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -6634,6 +6634,8 @@ int amdgpu_device_gpu_recover(struct
>> amdgpu_device *adev,
>>> struct amdgpu_hive_info *hive = NULL;
>>> int r = 0;
>>> bool need_emergency_restart = false;
>>> + /* save the pasid here as the job may be freed before the end of the reset */
>>> + int pasid = job ? job->pasid : -EINVAL;
>>>
>>> /*
>>> * If it reaches here because of hang/timeout and a RAS error is @@
>>> -6734,8 +6736,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>> *adev,
>>> if (!r) {
>>> struct amdgpu_task_info *ti = NULL;
>>>
>>> - if (job)
>>> - ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
>>> + /*
>>> + * The job may already be freed at this point via the sched tdr
>> workqueue so
>>> + * use the cached pasid.
>>> + */
>>
>> amdgpu_device_gpu_recover() is run in tdr workqueue.
>>
>> Now if this is the case, someone has to explain the logic -
>>
>> Timeout is triggered here -
>> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/scheduler/sched_main
>> .c#L559
>>
>> This calls amdgpu_job_timedout() -> amdgpu_device_gpu_recover()
>>
>> After that, there is this access to the job -
>>
>> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/scheduler/sched_main
>> .c#L566
>>
>> At least, in some condition, job is not expected to be freed. Then I'm not sure if this
>> is the right fix.
>
> What is that "someone", "some condition" you feel like? Its better to bring proper justification, and take up this as separate refactoring task
>
Basically, if scheduler code itself is not expecting job to be not freed
after timedout callback, then why callback handler needs to assume the same?
Now if callback handler does something else which in turn frees the job,
the fix needs to be there instead of having this kind of fix.
Thanks,
Lijo
> Best,
> Srini
next prev parent reply other threads:[~2025-12-11 5:45 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-10 20:23 [PATCH V2] drm/amdgpu: fix a job->pasid access race in gpu recovery Alex Deucher
2025-12-11 4:44 ` SHANMUGAM, SRINIVASAN
2025-12-11 5:03 ` Lazar, Lijo
2025-12-11 5:22 ` SHANMUGAM, SRINIVASAN
2025-12-11 5:44 ` Lazar, Lijo [this message]
2025-12-11 6:07 ` Lazar, Lijo
2025-12-11 6:39 ` Matthew Brost
2025-12-11 7:15 ` Lazar, Lijo
2025-12-11 12:28 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=14ea9e1c-0deb-4d2b-8bea-ef95300b753c@amd.com \
--to=lijo.lazar@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Christian.Koenig@amd.com \
--cc=SRINIVASAN.SHANMUGAM@amd.com \
--cc=Vitaly.Prosyak@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox