* [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) @ 2024-07-23 3:05 ZhenGuo Yin 2024-07-23 7:13 ` Christian König 2024-07-23 9:51 ` Christian König 0 siblings, 2 replies; 5+ messages in thread From: ZhenGuo Yin @ 2024-07-23 3:05 UTC (permalink / raw) To: amd-gfx; +Cc: Christian.Koenig, ZhenGuo Yin [Why] Page table of compute VM in the VRAM will lost after gpu reset. VRAM won't be restored since compute VM has no shadows. [How] Use higher 32-bit of vm->generation to record a vram_lost_counter. Reset the VM state machine when vm->genertaion is not equal to re-generation token. v2: Check vm->generation instead of calling drm_sched_entity_error in amdgpu_vm_validate. Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 3abfa66d72a2..9e2f84c166e7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm) if (!vm) return result; - result += vm->generation; + result += (vm->generation & 0xFFFFFFFF); /* Add one if the page tables will be re-generated on next CS */ if (drm_sched_entity_error(&vm->delayed)) ++result; @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm, struct amdgpu_bo *shadow; struct amdgpu_bo *bo; int r; + uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter); - if (drm_sched_entity_error(&vm->delayed)) { - ++vm->generation; + if (vm->generation != amdgpu_vm_generation(adev, vm)) { + if (drm_sched_entity_error(&vm->delayed)) + ++vm->generation; + vm->generation = (u64)vram_lost_counter << 32 | (vm->generation & 0xFFFFFFFF); amdgpu_vm_bo_reset_state_machine(vm); amdgpu_vm_fini_entities(vm); r = amdgpu_vm_init_entities(adev, vm); @@ -2439,7 +2442,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, vm->last_update = dma_fence_get_stub(); vm->last_unlocked = dma_fence_get_stub(); vm->last_tlb_flush = dma_fence_get_stub(); - vm->generation = 0; + vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32; mutex_init(&vm->eviction_lock); vm->evicting = false; -- 2.35.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) 2024-07-23 3:05 [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) ZhenGuo Yin @ 2024-07-23 7:13 ` Christian König 2024-07-23 9:04 ` Yin, ZhenGuo (Chris) 2024-07-23 9:51 ` Christian König 1 sibling, 1 reply; 5+ messages in thread From: Christian König @ 2024-07-23 7:13 UTC (permalink / raw) To: ZhenGuo Yin, amd-gfx; +Cc: 'Christian König' Am 23.07.24 um 05:05 schrieb ZhenGuo Yin: > [Why] > Page table of compute VM in the VRAM will lost after gpu reset. > VRAM won't be restored since compute VM has no shadows. > > [How] > Use higher 32-bit of vm->generation to record a vram_lost_counter. > Reset the VM state machine when vm->genertaion is not equal to > re-generation token. > > v2: Check vm->generation instead of calling drm_sched_entity_error > in amdgpu_vm_validate. I've just double checked the logic again and as far as I can see this patch here is still completely superfluous. When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation. What isn't handled are CPU based page table updates, but for those we could potentially change the condition inside the CPU page tables code. Regards, Christian. > > Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index 3abfa66d72a2..9e2f84c166e7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm) > if (!vm) > return result; > > - result += vm->generation; > + result += (vm->generation & 0xFFFFFFFF); > /* Add one if the page tables will be re-generated on next CS */ > if (drm_sched_entity_error(&vm->delayed)) > ++result; > @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm, > struct amdgpu_bo *shadow; > struct amdgpu_bo *bo; > int r; > + uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter); > > - if (drm_sched_entity_error(&vm->delayed)) { > - ++vm->generation; > + if (vm->generation != amdgpu_vm_generation(adev, vm)) { > + if (drm_sched_entity_error(&vm->delayed)) > + ++vm->generation; > + vm->generation = (u64)vram_lost_counter << 32 | (vm->generation & 0xFFFFFFFF); > amdgpu_vm_bo_reset_state_machine(vm); > amdgpu_vm_fini_entities(vm); > r = amdgpu_vm_init_entities(adev, vm); > @@ -2439,7 +2442,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, > vm->last_update = dma_fence_get_stub(); > vm->last_unlocked = dma_fence_get_stub(); > vm->last_tlb_flush = dma_fence_get_stub(); > - vm->generation = 0; > + vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32; > > mutex_init(&vm->eviction_lock); > vm->evicting = false; ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) 2024-07-23 7:13 ` Christian König @ 2024-07-23 9:04 ` Yin, ZhenGuo (Chris) 2024-07-23 9:48 ` Christian König 0 siblings, 1 reply; 5+ messages in thread From: Yin, ZhenGuo (Chris) @ 2024-07-23 9:04 UTC (permalink / raw) To: Christian König, amd-gfx@lists.freedesktop.org; +Cc: Koenig, Christian [AMD Official Use Only - AMD Internal Distribution Only] Hi Christian I prepared this patch because we met a page fault after gpu reset in SRIOV triggered by quark. After investigation, I found that the page table did not get restored after gpu reset. I just tried to use vm_update_mode=0 to disable VM update by CPU, and this issue still exists. -[Christian] When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation. I believe that you are referring this piece of code in function amdgpu_job_run(): /* Skip job if VRAM is lost and never resubmit gangs */ if (job->generation != amdgpu_vm_generation(adev, job->vm) || (job->job_run_counter && job->gang_submit)) dma_fence_set_error(finished, -ECANCELED); I agree that the submission from the old ctx will be set as ECANCELED and be skipped. But if the application then creates a new ctx and submit a new job, both new_ctx->generation and new_job->generation will be initiated as the updated one. ctx->generation = amdgpu_vm_generation(mgr->adev, &fpriv->vm); (*job)->generation = amdgpu_vm_generation(adev, vm); And the job will be executed, hence there will be a page fault due to VRAM lost. Please correct me if I have some misunderstanding. I still can not see why any submission using the VM entity will fail due to VRAM lost. Best, Zhenguo Cloud-GPU Core team, SRDC -----Original Message----- From: Christian König <ckoenig.leichtzumerken@gmail.com> Sent: Tuesday, July 23, 2024 3:13 PM To: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; amd-gfx@lists.freedesktop.org Cc: Koenig, Christian <Christian.Koenig@amd.com> Subject: Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) Am 23.07.24 um 05:05 schrieb ZhenGuo Yin: > [Why] > Page table of compute VM in the VRAM will lost after gpu reset. > VRAM won't be restored since compute VM has no shadows. > > [How] > Use higher 32-bit of vm->generation to record a vram_lost_counter. > Reset the VM state machine when vm->genertaion is not equal to > re-generation token. > > v2: Check vm->generation instead of calling drm_sched_entity_error in > amdgpu_vm_validate. I've just double checked the logic again and as far as I can see this patch here is still completely superfluous. When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation. What isn't handled are CPU based page table updates, but for those we could potentially change the condition inside the CPU page tables code. Regards, Christian. > > Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index 3abfa66d72a2..9e2f84c166e7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm) > if (!vm) > return result; > > - result += vm->generation; > + result += (vm->generation & 0xFFFFFFFF); > /* Add one if the page tables will be re-generated on next CS */ > if (drm_sched_entity_error(&vm->delayed)) > ++result; > @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm, > struct amdgpu_bo *shadow; > struct amdgpu_bo *bo; > int r; > + uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter); > > - if (drm_sched_entity_error(&vm->delayed)) { > - ++vm->generation; > + if (vm->generation != amdgpu_vm_generation(adev, vm)) { > + if (drm_sched_entity_error(&vm->delayed)) > + ++vm->generation; > + vm->generation = (u64)vram_lost_counter << 32 | (vm->generation & > +0xFFFFFFFF); > amdgpu_vm_bo_reset_state_machine(vm); > amdgpu_vm_fini_entities(vm); > r = amdgpu_vm_init_entities(adev, vm); @@ -2439,7 +2442,7 @@ int > amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, > vm->last_update = dma_fence_get_stub(); > vm->last_unlocked = dma_fence_get_stub(); > vm->last_tlb_flush = dma_fence_get_stub(); > - vm->generation = 0; > + vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32; > > mutex_init(&vm->eviction_lock); > vm->evicting = false; ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) 2024-07-23 9:04 ` Yin, ZhenGuo (Chris) @ 2024-07-23 9:48 ` Christian König 0 siblings, 0 replies; 5+ messages in thread From: Christian König @ 2024-07-23 9:48 UTC (permalink / raw) To: Yin, ZhenGuo (Chris), amd-gfx@lists.freedesktop.org; +Cc: Koenig, Christian Am 23.07.24 um 11:04 schrieb Yin, ZhenGuo (Chris): > [AMD Official Use Only - AMD Internal Distribution Only] > > Hi Christian > > I prepared this patch because we met a page fault after gpu reset in SRIOV triggered by quark. > After investigation, I found that the page table did not get restored after gpu reset. > I just tried to use vm_update_mode=0 to disable VM update by CPU, and this issue still exists. CPU based page table updates are not supported under SRIOV. > > -[Christian] When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation. > I believe that you are referring this piece of code in function amdgpu_job_run(): > /* Skip job if VRAM is lost and never resubmit gangs */ > if (job->generation != amdgpu_vm_generation(adev, job->vm) || > (job->job_run_counter && job->gang_submit)) > dma_fence_set_error(finished, -ECANCELED); > > I agree that the submission from the old ctx will be set as ECANCELED and be skipped. > But if the application then creates a new ctx and submit a new job, both new_ctx->generation and new_job->generation will be initiated as the updated one. > ctx->generation = amdgpu_vm_generation(mgr->adev, &fpriv->vm); > (*job)->generation = amdgpu_vm_generation(adev, vm); > And the job will be executed, hence there will be a page fault due to VRAM lost. > > Please correct me if I have some misunderstanding. > I still can not see why any submission using the VM entity will fail due to VRAM lost. Ah! My assumption was that you will also always have a VM page table job which would be canceled. But that isn't the case, instead you have only a context which is re-created. In that case the approach here is valid, but let me comment on the patch itself. Regards, Christian. > > Best, > Zhenguo > Cloud-GPU Core team, SRDC > > -----Original Message----- > From: Christian König <ckoenig.leichtzumerken@gmail.com> > Sent: Tuesday, July 23, 2024 3:13 PM > To: Yin, ZhenGuo (Chris) <ZhenGuo.Yin@amd.com>; amd-gfx@lists.freedesktop.org > Cc: Koenig, Christian <Christian.Koenig@amd.com> > Subject: Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) > > Am 23.07.24 um 05:05 schrieb ZhenGuo Yin: >> [Why] >> Page table of compute VM in the VRAM will lost after gpu reset. >> VRAM won't be restored since compute VM has no shadows. >> >> [How] >> Use higher 32-bit of vm->generation to record a vram_lost_counter. >> Reset the VM state machine when vm->genertaion is not equal to >> re-generation token. >> >> v2: Check vm->generation instead of calling drm_sched_entity_error in >> amdgpu_vm_validate. > I've just double checked the logic again and as far as I can see this patch here is still completely superfluous. > > When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation. > > What isn't handled are CPU based page table updates, but for those we could potentially change the condition inside the CPU page tables code. > > Regards, > Christian. > >> Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++---- >> 1 file changed, 7 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> index 3abfa66d72a2..9e2f84c166e7 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm) >> if (!vm) >> return result; >> >> - result += vm->generation; >> + result += (vm->generation & 0xFFFFFFFF); >> /* Add one if the page tables will be re-generated on next CS */ >> if (drm_sched_entity_error(&vm->delayed)) >> ++result; >> @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm, >> struct amdgpu_bo *shadow; >> struct amdgpu_bo *bo; >> int r; >> + uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter); >> >> - if (drm_sched_entity_error(&vm->delayed)) { >> - ++vm->generation; >> + if (vm->generation != amdgpu_vm_generation(adev, vm)) { >> + if (drm_sched_entity_error(&vm->delayed)) >> + ++vm->generation; >> + vm->generation = (u64)vram_lost_counter << 32 | (vm->generation & >> +0xFFFFFFFF); >> amdgpu_vm_bo_reset_state_machine(vm); >> amdgpu_vm_fini_entities(vm); >> r = amdgpu_vm_init_entities(adev, vm); @@ -2439,7 +2442,7 @@ int >> amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, >> vm->last_update = dma_fence_get_stub(); >> vm->last_unlocked = dma_fence_get_stub(); >> vm->last_tlb_flush = dma_fence_get_stub(); >> - vm->generation = 0; >> + vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32; >> >> mutex_init(&vm->eviction_lock); >> vm->evicting = false; ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) 2024-07-23 3:05 [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) ZhenGuo Yin 2024-07-23 7:13 ` Christian König @ 2024-07-23 9:51 ` Christian König 1 sibling, 0 replies; 5+ messages in thread From: Christian König @ 2024-07-23 9:51 UTC (permalink / raw) To: ZhenGuo Yin, amd-gfx; +Cc: Christian.Koenig Am 23.07.24 um 05:05 schrieb ZhenGuo Yin: > [Why] > Page table of compute VM in the VRAM will lost after gpu reset. > VRAM won't be restored since compute VM has no shadows. > > [How] > Use higher 32-bit of vm->generation to record a vram_lost_counter. > Reset the VM state machine when vm->genertaion is not equal to > re-generation token. > > v2: Check vm->generation instead of calling drm_sched_entity_error > in amdgpu_vm_validate. > > Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index 3abfa66d72a2..9e2f84c166e7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm) > if (!vm) > return result; > > - result += vm->generation; > + result += (vm->generation & 0xFFFFFFFF); > /* Add one if the page tables will be re-generated on next CS */ > if (drm_sched_entity_error(&vm->delayed)) > ++result; > @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm, > struct amdgpu_bo *shadow; > struct amdgpu_bo *bo; > int r; > + uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter); In general please sort the longer lines first. And use amdgpu_vm_generation(adev, vm) here to get the new generation value. > > - if (drm_sched_entity_error(&vm->delayed)) { > - ++vm->generation; > + if (vm->generation != amdgpu_vm_generation(adev, vm)) { > + if (drm_sched_entity_error(&vm->delayed)) > + ++vm->generation; > + vm->generation = (u64)vram_lost_counter << 32 | (vm->generation & 0xFFFFFFFF); And then test and assign that new generation value here. > amdgpu_vm_bo_reset_state_machine(vm); > amdgpu_vm_fini_entities(vm); > r = amdgpu_vm_init_entities(adev, vm); > @@ -2439,7 +2442,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm, > vm->last_update = dma_fence_get_stub(); > vm->last_unlocked = dma_fence_get_stub(); > vm->last_tlb_flush = dma_fence_get_stub(); > - vm->generation = 0; > + vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32; Use amdgpu_vm_generation(adev, NULL) to initialize this. Regards, Christian. > > mutex_init(&vm->eviction_lock); > vm->evicting = false; ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-07-23 9:51 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-23 3:05 [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost) ZhenGuo Yin 2024-07-23 7:13 ` Christian König 2024-07-23 9:04 ` Yin, ZhenGuo (Chris) 2024-07-23 9:48 ` Christian König 2024-07-23 9:51 ` Christian König
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.