AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] drm/amdgpu: Fix double release KFD pasid
@ 2022-12-13 15:49 Philip Yang
  2022-12-13 15:57 ` Christian König
  0 siblings, 1 reply; 4+ messages in thread
From: Philip Yang @ 2022-12-13 15:49 UTC (permalink / raw)
  To: amd-gfx; +Cc: Philip Yang, Felix Kuehling, kent.russell

If amdgpu_amdkfd_gpuvm_acquire_process_vm returns failed after vm is
converted to KFD vm and vm->pasid set to KFD pasid, KFD will not
take pdd->drm_file reference, as a result, drm close file handler maybe
called to release the KFD pasid before KFD process destroy to release
the same pasid and set vm->pasid to zero, this generates below WARNING
backtrace and NULL pointer access.

For compute process, KFD manage pasid and drm close file handler should
not release KFD pasid to avoid double release.

 amdgpu: Failed to create process VM object

 ida_free called for id=32770 which is not allocated.
 WARNING: CPU: 57 PID: 72542 at ../lib/idr.c:522 ida_free+0x96/0x140
 RIP: 0010:ida_free+0x96/0x140
 Call Trace:
  amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
  amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
  drm_file_free.part.13+0x216/0x270 [drm]
  drm_close_helper.isra.14+0x60/0x70 [drm]
  drm_release+0x6e/0xf0 [drm]
  __fput+0xcc/0x280
  ____fput+0xe/0x20
  task_work_run+0x96/0xc0
  do_exit+0x3d0/0xc10

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 RIP: 0010:ida_free+0x76/0x140
 Call Trace:
  amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
  amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
  drm_file_free.part.13+0x216/0x270 [drm]
  drm_close_helper.isra.14+0x60/0x70 [drm]
  drm_release+0x6e/0xf0 [drm]
  __fput+0xcc/0x280
  ____fput+0xe/0x20
  task_work_run+0x96/0xc0
  do_exit+0x3d0/0xc10

Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index efc0a13e9aea..bf444c3f6656 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -1244,8 +1244,14 @@ void amdgpu_driver_postclose_kms(struct drm_device *dev,
 		amdgpu_bo_unreserve(adev->virt.csa_obj);
 	}
 
-	pasid = fpriv->vm.pasid;
+	if (fpriv->vm.is_compute_context)
+		/* pasid managed by KFD is released when process is destroyed */
+		pasid = 0;
+	else
+		pasid = fpriv->vm.pasid;
+
 	pd = amdgpu_bo_ref(fpriv->vm.root.bo);
+
 	if (!WARN_ON(amdgpu_bo_reserve(pd, true))) {
 		amdgpu_vm_bo_del(adev, fpriv->prt_va);
 		amdgpu_bo_unreserve(pd);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix double release KFD pasid
  2022-12-13 15:49 [PATCH 1/1] drm/amdgpu: Fix double release KFD pasid Philip Yang
@ 2022-12-13 15:57 ` Christian König
  2022-12-13 17:58   ` Felix Kuehling
  0 siblings, 1 reply; 4+ messages in thread
From: Christian König @ 2022-12-13 15:57 UTC (permalink / raw)
  To: Philip Yang, amd-gfx; +Cc: Felix Kuehling, kent.russell

Am 13.12.22 um 16:49 schrieb Philip Yang:
> If amdgpu_amdkfd_gpuvm_acquire_process_vm returns failed after vm is
> converted to KFD vm and vm->pasid set to KFD pasid, KFD will not
> take pdd->drm_file reference, as a result, drm close file handler maybe
> called to release the KFD pasid before KFD process destroy to release
> the same pasid and set vm->pasid to zero, this generates below WARNING
> backtrace and NULL pointer access.

Well NAK. If you fail after making the VM a compute VM the correct 
approach would be to drop this in the error handling again.

Since we don't need to reallocate anything that should also never fail.

Christian.

>
> For compute process, KFD manage pasid and drm close file handler should
> not release KFD pasid to avoid double release.
>
>   amdgpu: Failed to create process VM object
>
>   ida_free called for id=32770 which is not allocated.
>   WARNING: CPU: 57 PID: 72542 at ../lib/idr.c:522 ida_free+0x96/0x140
>   RIP: 0010:ida_free+0x96/0x140
>   Call Trace:
>    amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
>    amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
>    drm_file_free.part.13+0x216/0x270 [drm]
>    drm_close_helper.isra.14+0x60/0x70 [drm]
>    drm_release+0x6e/0xf0 [drm]
>    __fput+0xcc/0x280
>    ____fput+0xe/0x20
>    task_work_run+0x96/0xc0
>    do_exit+0x3d0/0xc10
>
>   BUG: kernel NULL pointer dereference, address: 0000000000000000
>   RIP: 0010:ida_free+0x76/0x140
>   Call Trace:
>    amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
>    amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
>    drm_file_free.part.13+0x216/0x270 [drm]
>    drm_close_helper.isra.14+0x60/0x70 [drm]
>    drm_release+0x6e/0xf0 [drm]
>    __fput+0xcc/0x280
>    ____fput+0xe/0x20
>    task_work_run+0x96/0xc0
>    do_exit+0x3d0/0xc10
>
> Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index efc0a13e9aea..bf444c3f6656 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -1244,8 +1244,14 @@ void amdgpu_driver_postclose_kms(struct drm_device *dev,
>   		amdgpu_bo_unreserve(adev->virt.csa_obj);
>   	}
>   
> -	pasid = fpriv->vm.pasid;
> +	if (fpriv->vm.is_compute_context)
> +		/* pasid managed by KFD is released when process is destroyed */
> +		pasid = 0;
> +	else
> +		pasid = fpriv->vm.pasid;
> +
>   	pd = amdgpu_bo_ref(fpriv->vm.root.bo);
> +
>   	if (!WARN_ON(amdgpu_bo_reserve(pd, true))) {
>   		amdgpu_vm_bo_del(adev, fpriv->prt_va);
>   		amdgpu_bo_unreserve(pd);


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix double release KFD pasid
  2022-12-13 15:57 ` Christian König
@ 2022-12-13 17:58   ` Felix Kuehling
  2022-12-14 15:52     ` Philip Yang
  0 siblings, 1 reply; 4+ messages in thread
From: Felix Kuehling @ 2022-12-13 17:58 UTC (permalink / raw)
  To: Christian König, Philip Yang, amd-gfx; +Cc: kent.russell

On 2022-12-13 10:57, Christian König wrote:
> Am 13.12.22 um 16:49 schrieb Philip Yang:
>> If amdgpu_amdkfd_gpuvm_acquire_process_vm returns failed after vm is
>> converted to KFD vm and vm->pasid set to KFD pasid, KFD will not
>> take pdd->drm_file reference, as a result, drm close file handler maybe
>> called to release the KFD pasid before KFD process destroy to release
>> the same pasid and set vm->pasid to zero, this generates below WARNING
>> backtrace and NULL pointer access.
>
> Well NAK. If you fail after making the VM a compute VM the correct 
> approach would be to drop this in the error handling again.
>
> Since we don't need to reallocate anything that should also never fail.

I don't understand this comment.

The fundamental issue, as I understand it, is that compute VMs don't own 
their PASID. Multiple compute VMs on different GPUs share the same 
PASID. Therefore, freeing the PASID when the compute VM is destroyed is 
wrong. The PASID is freed by KFD when its process structure is destroyed.

Regards,
   Felix


>
> Christian.
>
>>
>> For compute process, KFD manage pasid and drm close file handler should
>> not release KFD pasid to avoid double release.
>>
>>   amdgpu: Failed to create process VM object
>>
>>   ida_free called for id=32770 which is not allocated.
>>   WARNING: CPU: 57 PID: 72542 at ../lib/idr.c:522 ida_free+0x96/0x140
>>   RIP: 0010:ida_free+0x96/0x140
>>   Call Trace:
>>    amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
>>    amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
>>    drm_file_free.part.13+0x216/0x270 [drm]
>>    drm_close_helper.isra.14+0x60/0x70 [drm]
>>    drm_release+0x6e/0xf0 [drm]
>>    __fput+0xcc/0x280
>>    ____fput+0xe/0x20
>>    task_work_run+0x96/0xc0
>>    do_exit+0x3d0/0xc10
>>
>>   BUG: kernel NULL pointer dereference, address: 0000000000000000
>>   RIP: 0010:ida_free+0x76/0x140
>>   Call Trace:
>>    amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
>>    amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
>>    drm_file_free.part.13+0x216/0x270 [drm]
>>    drm_close_helper.isra.14+0x60/0x70 [drm]
>>    drm_release+0x6e/0xf0 [drm]
>>    __fput+0xcc/0x280
>>    ____fput+0xe/0x20
>>    task_work_run+0x96/0xc0
>>    do_exit+0x3d0/0xc10
>>
>> Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index efc0a13e9aea..bf444c3f6656 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -1244,8 +1244,14 @@ void amdgpu_driver_postclose_kms(struct 
>> drm_device *dev,
>>           amdgpu_bo_unreserve(adev->virt.csa_obj);
>>       }
>>   -    pasid = fpriv->vm.pasid;
>> +    if (fpriv->vm.is_compute_context)
>> +        /* pasid managed by KFD is released when process is 
>> destroyed */
>> +        pasid = 0;
>> +    else
>> +        pasid = fpriv->vm.pasid;
>> +
>>       pd = amdgpu_bo_ref(fpriv->vm.root.bo);
>> +
>>       if (!WARN_ON(amdgpu_bo_reserve(pd, true))) {
>>           amdgpu_vm_bo_del(adev, fpriv->prt_va);
>>           amdgpu_bo_unreserve(pd);
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix double release KFD pasid
  2022-12-13 17:58   ` Felix Kuehling
@ 2022-12-14 15:52     ` Philip Yang
  0 siblings, 0 replies; 4+ messages in thread
From: Philip Yang @ 2022-12-14 15:52 UTC (permalink / raw)
  To: Felix Kuehling, Christian König, Philip Yang, amd-gfx; +Cc: kent.russell


On 2022-12-13 12:58, Felix Kuehling wrote:
> On 2022-12-13 10:57, Christian König wrote:
>> Am 13.12.22 um 16:49 schrieb Philip Yang:
>>> If amdgpu_amdkfd_gpuvm_acquire_process_vm returns failed after vm is
>>> converted to KFD vm and vm->pasid set to KFD pasid, KFD will not
>>> take pdd->drm_file reference, as a result, drm close file handler maybe
>>> called to release the KFD pasid before KFD process destroy to release
>>> the same pasid and set vm->pasid to zero, this generates below WARNING
>>> backtrace and NULL pointer access.
>>
>> Well NAK. If you fail after making the VM a compute VM the correct 
>> approach would be to drop this in the error handling again.
>>
>> Since we don't need to reallocate anything that should also never fail.
>
> I don't understand this comment.
>
> The fundamental issue, as I understand it, is that compute VMs don't 
> own their PASID. Multiple compute VMs on different GPUs share the same 
> PASID. Therefore, freeing the PASID when the compute VM is destroyed 
> is wrong. The PASID is freed by KFD when its process structure is 
> destroyed.

Just sent out another patch, to update vm pasid to compute pasid at the 
last step of acquiring and init compute vm, then drm close handler path 
don't need change, to free the original pasid or pasid is 0 if it is 
compute vm.

Regards,

Philip

>
> Regards,
>   Felix
>
>
>>
>> Christian.
>>
>>>
>>> For compute process, KFD manage pasid and drm close file handler should
>>> not release KFD pasid to avoid double release.
>>>
>>>   amdgpu: Failed to create process VM object
>>>
>>>   ida_free called for id=32770 which is not allocated.
>>>   WARNING: CPU: 57 PID: 72542 at ../lib/idr.c:522 ida_free+0x96/0x140
>>>   RIP: 0010:ida_free+0x96/0x140
>>>   Call Trace:
>>>    amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
>>>    amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
>>>    drm_file_free.part.13+0x216/0x270 [drm]
>>>    drm_close_helper.isra.14+0x60/0x70 [drm]
>>>    drm_release+0x6e/0xf0 [drm]
>>>    __fput+0xcc/0x280
>>>    ____fput+0xe/0x20
>>>    task_work_run+0x96/0xc0
>>>    do_exit+0x3d0/0xc10
>>>
>>>   BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>   RIP: 0010:ida_free+0x76/0x140
>>>   Call Trace:
>>>    amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
>>>    amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
>>>    drm_file_free.part.13+0x216/0x270 [drm]
>>>    drm_close_helper.isra.14+0x60/0x70 [drm]
>>>    drm_release+0x6e/0xf0 [drm]
>>>    __fput+0xcc/0x280
>>>    ____fput+0xe/0x20
>>>    task_work_run+0x96/0xc0
>>>    do_exit+0x3d0/0xc10
>>>
>>> Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 8 +++++++-
>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index efc0a13e9aea..bf444c3f6656 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -1244,8 +1244,14 @@ void amdgpu_driver_postclose_kms(struct 
>>> drm_device *dev,
>>>           amdgpu_bo_unreserve(adev->virt.csa_obj);
>>>       }
>>>   -    pasid = fpriv->vm.pasid;
>>> +    if (fpriv->vm.is_compute_context)
>>> +        /* pasid managed by KFD is released when process is 
>>> destroyed */
>>> +        pasid = 0;
>>> +    else
>>> +        pasid = fpriv->vm.pasid;
>>> +
>>>       pd = amdgpu_bo_ref(fpriv->vm.root.bo);
>>> +
>>>       if (!WARN_ON(amdgpu_bo_reserve(pd, true))) {
>>>           amdgpu_vm_bo_del(adev, fpriv->prt_va);
>>>           amdgpu_bo_unreserve(pd);
>>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-12-14 15:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-13 15:49 [PATCH 1/1] drm/amdgpu: Fix double release KFD pasid Philip Yang
2022-12-13 15:57 ` Christian König
2022-12-13 17:58   ` Felix Kuehling
2022-12-14 15:52     ` Philip Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox