[PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
@ 2024-07-23  3:05 ZhenGuo Yin
  2024-07-23  7:13 ` Christian König
  2024-07-23  9:51 ` Christian König
  0 siblings, 2 replies; 5+ messages in thread
From: ZhenGuo Yin @ 2024-07-23  3:05 UTC (permalink / raw)
  To: amd-gfx; +Cc: Christian.Koenig, ZhenGuo Yin

[Why]
Page table of compute VM in the VRAM will lost after gpu reset.
VRAM won't be restored since compute VM has no shadows.

[How]
Use higher 32-bit of vm->generation to record a vram_lost_counter.
Reset the VM state machine when vm->genertaion is not equal to
re-generation token.

v2: Check vm->generation instead of calling drm_sched_entity_error
in amdgpu_vm_validate.

Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 3abfa66d72a2..9e2f84c166e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm)
 	if (!vm)
 		return result;
 
-	result += vm->generation;
+	result += (vm->generation & 0xFFFFFFFF);
 	/* Add one if the page tables will be re-generated on next CS */
 	if (drm_sched_entity_error(&vm->delayed))
 		++result;
@@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	struct amdgpu_bo *shadow;
 	struct amdgpu_bo *bo;
 	int r;
+	uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter);
 
-	if (drm_sched_entity_error(&vm->delayed)) {
-		++vm->generation;
+	if (vm->generation != amdgpu_vm_generation(adev, vm)) {
+		if (drm_sched_entity_error(&vm->delayed))
+			++vm->generation;
+		vm->generation = (u64)vram_lost_counter << 32 | (vm->generation & 0xFFFFFFFF);
 		amdgpu_vm_bo_reset_state_machine(vm);
 		amdgpu_vm_fini_entities(vm);
 		r = amdgpu_vm_init_entities(adev, vm);
@@ -2439,7 +2442,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	vm->last_update = dma_fence_get_stub();
 	vm->last_unlocked = dma_fence_get_stub();
 	vm->last_tlb_flush = dma_fence_get_stub();
-	vm->generation = 0;
+	vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32;
 
 	mutex_init(&vm->eviction_lock);
 	vm->evicting = false;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
  2024-07-23  3:05 [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) ZhenGuo Yin
@ 2024-07-23  7:13 ` Christian König
  2024-07-23  9:04   ` Yin, ZhenGuo (Chris)
  2024-07-23  9:51 ` Christian König
  1 sibling, 1 reply; 5+ messages in thread
From: Christian König @ 2024-07-23  7:13 UTC (permalink / raw)
  To: ZhenGuo Yin, amd-gfx; +Cc: 'Christian König'

Am 23.07.24 um 05:05 schrieb ZhenGuo Yin:
> [Why]
> Page table of compute VM in the VRAM will lost after gpu reset.
> VRAM won't be restored since compute VM has no shadows.
>
> [How]
> Use higher 32-bit of vm->generation to record a vram_lost_counter.
> Reset the VM state machine when vm->genertaion is not equal to
> re-generation token.
>
> v2: Check vm->generation instead of calling drm_sched_entity_error
> in amdgpu_vm_validate.

I've just double checked the logic again and as far as I can see this 
patch here is still completely superfluous.

When VRAM is lost any submission using the VM entity will fail and so 
result in a new page table generation.

What isn't handled are CPU based page table updates, but for those we 
could potentially change the condition inside the CPU page tables code.

Regards,
Christian.

>
> Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 3abfa66d72a2..9e2f84c166e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>   	if (!vm)
>   		return result;
>   
> -	result += vm->generation;
> +	result += (vm->generation & 0xFFFFFFFF);
>   	/* Add one if the page tables will be re-generated on next CS */
>   	if (drm_sched_entity_error(&vm->delayed))
>   		++result;
> @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   	struct amdgpu_bo *shadow;
>   	struct amdgpu_bo *bo;
>   	int r;
> +	uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>   
> -	if (drm_sched_entity_error(&vm->delayed)) {
> -		++vm->generation;
> +	if (vm->generation != amdgpu_vm_generation(adev, vm)) {
> +		if (drm_sched_entity_error(&vm->delayed))
> +			++vm->generation;
> +		vm->generation = (u64)vram_lost_counter << 32 | (vm->generation & 0xFFFFFFFF);
>   		amdgpu_vm_bo_reset_state_machine(vm);
>   		amdgpu_vm_fini_entities(vm);
>   		r = amdgpu_vm_init_entities(adev, vm);
> @@ -2439,7 +2442,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   	vm->last_update = dma_fence_get_stub();
>   	vm->last_unlocked = dma_fence_get_stub();
>   	vm->last_tlb_flush = dma_fence_get_stub();
> -	vm->generation = 0;
> +	vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32;
>   
>   	mutex_init(&vm->eviction_lock);
>   	vm->evicting = false;


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
  2024-07-23  7:13 ` Christian König
@ 2024-07-23  9:04   ` Yin, ZhenGuo (Chris)
  2024-07-23  9:48     ` Christian König
  0 siblings, 1 reply; 5+ messages in thread
From: Yin, ZhenGuo (Chris) @ 2024-07-23  9:04 UTC (permalink / raw)
  To: Christian König, amd-gfx@lists.freedesktop.org; +Cc: Koenig, Christian

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Christian

I prepared this patch because we met a page fault after gpu reset in SRIOV triggered by quark.
After investigation, I found that the page table did not get restored after gpu reset.
I just tried to use vm_update_mode=0 to disable VM update by CPU, and this issue still exists.

-[Christian] When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation.
I believe that you are referring this piece of code in function amdgpu_job_run():
        /* Skip job if VRAM is lost and never resubmit gangs */
        if (job->generation != amdgpu_vm_generation(adev, job->vm) ||
            (job->job_run_counter && job->gang_submit))
                dma_fence_set_error(finished, -ECANCELED);

I agree that the submission from the old ctx will be set as ECANCELED and be skipped.
But if the application then creates a new ctx and submit a new job, both new_ctx->generation and new_job->generation will be initiated as the updated one.
        ctx->generation = amdgpu_vm_generation(mgr->adev, &fpriv->vm);
        (*job)->generation = amdgpu_vm_generation(adev, vm);
And the job will be executed, hence there will be a page fault due to VRAM lost.

Please correct me if I have some misunderstanding.
I still can not see why any submission using the VM entity will fail due to VRAM lost.

Best,
Zhenguo
Cloud-GPU Core team, SRDC

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com>
Sent: Tuesday, July 23, 2024 3:13 PM
To: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)

Am 23.07.24 um 05:05 schrieb ZhenGuo Yin:
> [Why]
> Page table of compute VM in the VRAM will lost after gpu reset.
> VRAM won't be restored since compute VM has no shadows.
>
> [How]
> Use higher 32-bit of vm->generation to record a vram_lost_counter.
> Reset the VM state machine when vm->genertaion is not equal to
> re-generation token.
>
> v2: Check vm->generation instead of calling drm_sched_entity_error in
> amdgpu_vm_validate.

I've just double checked the logic again and as far as I can see this patch here is still completely superfluous.

When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation.

What isn't handled are CPU based page table updates, but for those we could potentially change the condition inside the CPU page tables code.

Regards,
Christian.

>
> Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 3abfa66d72a2..9e2f84c166e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>       if (!vm)
>               return result;
>
> -     result += vm->generation;
> +     result += (vm->generation & 0xFFFFFFFF);
>       /* Add one if the page tables will be re-generated on next CS */
>       if (drm_sched_entity_error(&vm->delayed))
>               ++result;
> @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>       struct amdgpu_bo *shadow;
>       struct amdgpu_bo *bo;
>       int r;
> +     uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>
> -     if (drm_sched_entity_error(&vm->delayed)) {
> -             ++vm->generation;
> +     if (vm->generation != amdgpu_vm_generation(adev, vm)) {
> +             if (drm_sched_entity_error(&vm->delayed))
> +                     ++vm->generation;
> +             vm->generation = (u64)vram_lost_counter << 32 | (vm->generation &
> +0xFFFFFFFF);
>               amdgpu_vm_bo_reset_state_machine(vm);
>               amdgpu_vm_fini_entities(vm);
>               r = amdgpu_vm_init_entities(adev, vm); @@ -2439,7 +2442,7 @@ int
> amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>       vm->last_update = dma_fence_get_stub();
>       vm->last_unlocked = dma_fence_get_stub();
>       vm->last_tlb_flush = dma_fence_get_stub();
> -     vm->generation = 0;
> +     vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32;
>
>       mutex_init(&vm->eviction_lock);
>       vm->evicting = false;


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
  2024-07-23  9:04   ` Yin, ZhenGuo (Chris)
@ 2024-07-23  9:48     ` Christian König
  0 siblings, 0 replies; 5+ messages in thread
From: Christian König @ 2024-07-23  9:48 UTC (permalink / raw)
  To: Yin, ZhenGuo (Chris), amd-gfx@lists.freedesktop.org; +Cc: Koenig, Christian

Am 23.07.24 um 11:04 schrieb Yin, ZhenGuo (Chris):
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi Christian
>
> I prepared this patch because we met a page fault after gpu reset in SRIOV triggered by quark.
> After investigation, I found that the page table did not get restored after gpu reset.
> I just tried to use vm_update_mode=0 to disable VM update by CPU, and this issue still exists.

CPU based page table updates are not supported under SRIOV.

>
> -[Christian] When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation.
> I believe that you are referring this piece of code in function amdgpu_job_run():
>          /* Skip job if VRAM is lost and never resubmit gangs */
>          if (job->generation != amdgpu_vm_generation(adev, job->vm) ||
>              (job->job_run_counter && job->gang_submit))
>                  dma_fence_set_error(finished, -ECANCELED);
>
> I agree that the submission from the old ctx will be set as ECANCELED and be skipped.
> But if the application then creates a new ctx and submit a new job, both new_ctx->generation and new_job->generation will be initiated as the updated one.
>          ctx->generation = amdgpu_vm_generation(mgr->adev, &fpriv->vm);
>          (*job)->generation = amdgpu_vm_generation(adev, vm);
> And the job will be executed, hence there will be a page fault due to VRAM lost.
>
> Please correct me if I have some misunderstanding.
> I still can not see why any submission using the VM entity will fail due to VRAM lost.

Ah! My assumption was that you will also always have a VM page table job 
which would be canceled.

But that isn't the case, instead you have only a context which is 
re-created.

In that case the approach here is valid, but let me comment on the patch 
itself.

Regards,
Christian.

>
> Best,
> Zhenguo
> Cloud-GPU Core team, SRDC
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Tuesday, July 23, 2024 3:13 PM
> To: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
>
> Am 23.07.24 um 05:05 schrieb ZhenGuo Yin:
>> [Why]
>> Page table of compute VM in the VRAM will lost after gpu reset.
>> VRAM won't be restored since compute VM has no shadows.
>>
>> [How]
>> Use higher 32-bit of vm->generation to record a vram_lost_counter.
>> Reset the VM state machine when vm->genertaion is not equal to
>> re-generation token.
>>
>> v2: Check vm->generation instead of calling drm_sched_entity_error in
>> amdgpu_vm_validate.
> I've just double checked the logic again and as far as I can see this patch here is still completely superfluous.
>
> When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation.
>
> What isn't handled are CPU based page table updates, but for those we could potentially change the condition inside the CPU page tables code.
>
> Regards,
> Christian.
>
>> Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++----
>>    1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 3abfa66d72a2..9e2f84c166e7 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>>        if (!vm)
>>                return result;
>>
>> -     result += vm->generation;
>> +     result += (vm->generation & 0xFFFFFFFF);
>>        /* Add one if the page tables will be re-generated on next CS */
>>        if (drm_sched_entity_error(&vm->delayed))
>>                ++result;
>> @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>        struct amdgpu_bo *shadow;
>>        struct amdgpu_bo *bo;
>>        int r;
>> +     uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>>
>> -     if (drm_sched_entity_error(&vm->delayed)) {
>> -             ++vm->generation;
>> +     if (vm->generation != amdgpu_vm_generation(adev, vm)) {
>> +             if (drm_sched_entity_error(&vm->delayed))
>> +                     ++vm->generation;
>> +             vm->generation = (u64)vram_lost_counter << 32 | (vm->generation &
>> +0xFFFFFFFF);
>>                amdgpu_vm_bo_reset_state_machine(vm);
>>                amdgpu_vm_fini_entities(vm);
>>                r = amdgpu_vm_init_entities(adev, vm); @@ -2439,7 +2442,7 @@ int
>> amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>        vm->last_update = dma_fence_get_stub();
>>        vm->last_unlocked = dma_fence_get_stub();
>>        vm->last_tlb_flush = dma_fence_get_stub();
>> -     vm->generation = 0;
>> +     vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32;
>>
>>        mutex_init(&vm->eviction_lock);
>>        vm->evicting = false;


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
  2024-07-23  3:05 [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) ZhenGuo Yin
  2024-07-23  7:13 ` Christian König
@ 2024-07-23  9:51 ` Christian König
  1 sibling, 0 replies; 5+ messages in thread
From: Christian König @ 2024-07-23  9:51 UTC (permalink / raw)
  To: ZhenGuo Yin, amd-gfx; +Cc: Christian.Koenig

Am 23.07.24 um 05:05 schrieb ZhenGuo Yin:
> [Why]
> Page table of compute VM in the VRAM will lost after gpu reset.
> VRAM won't be restored since compute VM has no shadows.
>
> [How]
> Use higher 32-bit of vm->generation to record a vram_lost_counter.
> Reset the VM state machine when vm->genertaion is not equal to
> re-generation token.
>
> v2: Check vm->generation instead of calling drm_sched_entity_error
> in amdgpu_vm_validate.
>
> Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 3abfa66d72a2..9e2f84c166e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>   	if (!vm)
>   		return result;
>   
> -	result += vm->generation;
> +	result += (vm->generation & 0xFFFFFFFF);
>   	/* Add one if the page tables will be re-generated on next CS */
>   	if (drm_sched_entity_error(&vm->delayed))
>   		++result;
> @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   	struct amdgpu_bo *shadow;
>   	struct amdgpu_bo *bo;
>   	int r;
> +	uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter);

In general please sort the longer lines first.

And use amdgpu_vm_generation(adev, vm) here to get the new generation value.

>   
> -	if (drm_sched_entity_error(&vm->delayed)) {
> -		++vm->generation;
> +	if (vm->generation != amdgpu_vm_generation(adev, vm)) {
> +		if (drm_sched_entity_error(&vm->delayed))
> +			++vm->generation;
> +		vm->generation = (u64)vram_lost_counter << 32 | (vm->generation & 0xFFFFFFFF);

And then test and assign that new generation value here.

>   		amdgpu_vm_bo_reset_state_machine(vm);
>   		amdgpu_vm_fini_entities(vm);
>   		r = amdgpu_vm_init_entities(adev, vm);
> @@ -2439,7 +2442,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   	vm->last_update = dma_fence_get_stub();
>   	vm->last_unlocked = dma_fence_get_stub();
>   	vm->last_tlb_flush = dma_fence_get_stub();
> -	vm->generation = 0;
> +	vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32;

Use amdgpu_vm_generation(adev, NULL) to initialize this.

Regards,
Christian.

>   
>   	mutex_init(&vm->eviction_lock);
>   	vm->evicting = false;


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-07-23  9:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-23  3:05 [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) ZhenGuo Yin
2024-07-23  7:13 ` Christian König
2024-07-23  9:04   ` Yin, ZhenGuo (Chris)
2024-07-23  9:48     ` Christian König
2024-07-23  9:51 ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.