* [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" @ 2023-10-23 13:06 Daniel Tang 2023-10-23 15:15 ` Christian König 0 siblings, 1 reply; 5+ messages in thread From: Daniel Tang @ 2023-10-23 13:06 UTC (permalink / raw) To: amd-gfx; +Cc: Xiaogang Chen, Alex Deucher, Felix Kuehling That commit causes the screen to freeze a few moments after running clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer including ssh also freezes. On v6.5-rc1, it only results in a NULL pointer deference message in dmesg and the process to become a zombie whose unkillableness prevents shutdown without REISUB. Although llama.cpp and hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by this revert, pytorch-rocm is now working with stability and without whole-computer freezes caused by any accidental running of clinfo. This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639. Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596 Signed-off-by: Daniel Tang <danielzgtg.opensource@gmail.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 82f25996ff5e..602f311ab766 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm) if (r) return r; + /* Sanity checks */ + if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { + r = -EINVAL; + goto unreserve_bo; + } + /* Check if PD needs to be reinitialized and do it before * changing any other state, in case it fails. */ if (pte_support_ats != vm->pte_support_ats) { - /* Sanity checks */ - if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { - r = -EINVAL; - goto unreserve_bo; - } - vm->pte_support_ats = pte_support_ats; r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo), false); -- 2.40.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" 2023-10-23 13:06 [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" Daniel Tang @ 2023-10-23 15:15 ` Christian König 2023-10-23 23:39 ` Felix Kuehling 2023-10-23 23:41 ` Felix Kuehling 0 siblings, 2 replies; 5+ messages in thread From: Christian König @ 2023-10-23 15:15 UTC (permalink / raw) To: Daniel Tang, amd-gfx; +Cc: Xiaogang Chen, Alex Deucher, Felix Kuehling Am 23.10.23 um 15:06 schrieb Daniel Tang: > That commit causes the screen to freeze a few moments after running > clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer > including ssh also freezes. On v6.5-rc1, it only results in a NULL pointer > deference message in dmesg and the process to become a zombie whose > unkillableness prevents shutdown without REISUB. Although llama.cpp and > hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by > this revert, pytorch-rocm is now working with stability and without > whole-computer freezes caused by any accidental running of clinfo. > > This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639. That result doesn't make much sense. Felix please correct me, but AFAIK the ATS stuff was completely removed by now. Are you sure that this is pure v6.6-rc7 and not some other patches applied? If yes than we must have missed something. Regards, Christian. > > Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596 > Signed-off-by: Daniel Tang <danielzgtg.opensource@gmail.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index 82f25996ff5e..602f311ab766 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm) > if (r) > return r; > > + /* Sanity checks */ > + if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { > + r = -EINVAL; > + goto unreserve_bo; > + } > + > /* Check if PD needs to be reinitialized and do it before > * changing any other state, in case it fails. > */ > if (pte_support_ats != vm->pte_support_ats) { > - /* Sanity checks */ > - if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { > - r = -EINVAL; > - goto unreserve_bo; > - } > - > vm->pte_support_ats = pte_support_ats; > r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo), > false); > -- > 2.40.1 > > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" 2023-10-23 15:15 ` Christian König @ 2023-10-23 23:39 ` Felix Kuehling 2023-10-23 23:41 ` Felix Kuehling 1 sibling, 0 replies; 5+ messages in thread From: Felix Kuehling @ 2023-10-23 23:39 UTC (permalink / raw) To: Christian König, Daniel Tang, amd-gfx; +Cc: Xiaogang Chen, Alex Deucher On 2023-10-23 11:15, Christian König wrote: > Am 23.10.23 um 15:06 schrieb Daniel Tang: >> That commit causes the screen to freeze a few moments after running >> clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer >> including ssh also freezes. On v6.5-rc1, it only results in a NULL >> pointer >> deference message in dmesg and the process to become a zombie whose >> unkillableness prevents shutdown without REISUB. Although llama.cpp and >> hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by >> this revert, pytorch-rocm is now working with stability and without >> whole-computer freezes caused by any accidental running of clinfo. >> >> This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639. > > That result doesn't make much sense. Felix please correct me, but > AFAIK the ATS stuff was completely removed by now. > > Are you sure that this is pure v6.6-rc7 and not some other patches > applied? If yes than we must have missed something. This revert doesn't really affect systems with ATS. It moves the sanity check back out of the ATS-specific code. The Null pointer dereference in the bug report comes from the CPU page table update code: This revert is just a roundabout way of disabling CPU page table updates for compute VMs. But I don't think it really addresses the root cause. > > Regards, > Christian. > >> >> Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596 >> Signed-off-by: Daniel Tang <danielzgtg.opensource@gmail.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++++++------ >> 1 file changed, 6 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> index 82f25996ff5e..602f311ab766 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct >> amdgpu_device *adev, struct amdgpu_vm *vm) >> if (r) >> return r; >> + /* Sanity checks */ >> + if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { >> + r = -EINVAL; >> + goto unreserve_bo; >> + } >> + >> /* Check if PD needs to be reinitialized and do it before >> * changing any other state, in case it fails. >> */ >> if (pte_support_ats != vm->pte_support_ats) { >> - /* Sanity checks */ >> - if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { >> - r = -EINVAL; >> - goto unreserve_bo; >> - } >> - >> vm->pte_support_ats = pte_support_ats; >> r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo), >> false); >> -- >> 2.40.1 >> >> >> > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" 2023-10-23 15:15 ` Christian König 2023-10-23 23:39 ` Felix Kuehling @ 2023-10-23 23:41 ` Felix Kuehling 2023-10-24 8:09 ` Christian König 1 sibling, 1 reply; 5+ messages in thread From: Felix Kuehling @ 2023-10-23 23:41 UTC (permalink / raw) To: Christian König, Daniel Tang, amd-gfx; +Cc: Xiaogang Chen, Alex Deucher [-- Attachment #1: Type: text/plain, Size: 4952 bytes --] [sorry, I hit send too early] On 2023-10-23 11:15, Christian König wrote: > Am 23.10.23 um 15:06 schrieb Daniel Tang: >> That commit causes the screen to freeze a few moments after running >> clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer >> including ssh also freezes. On v6.5-rc1, it only results in a NULL >> pointer >> deference message in dmesg and the process to become a zombie whose >> unkillableness prevents shutdown without REISUB. Although llama.cpp and >> hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by >> this revert, pytorch-rocm is now working with stability and without >> whole-computer freezes caused by any accidental running of clinfo. >> >> This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639. > > That result doesn't make much sense. Felix please correct me, but > AFAIK the ATS stuff was completely removed by now. > > Are you sure that this is pure v6.6-rc7 and not some other patches > applied? If yes than we must have missed something. This revert doesn't really affect systems with ATS. It moves the sanity check back out of the ATS-specific code. The Null pointer dereference in the bug report comes from the CPU page table update code: [10089.267556] BUG: kernel NULL pointer dereference, address: 0000000000000000 [10089.267563] #PF: supervisor write access in kernel mode [10089.267566] #PF: error_code(0x0002) - not-present page [10089.267569] PGD 0 P4D 0 [10089.267574] Oops: 0002 [#1] PREEMPT SMP NOPTI [10089.267578] CPU: 23 PID: 18191 Comm: clinfo Tainted: G OE 6.5.0-9-generic #9-Ubuntu [10089.267582] Hardware name: Micro-Star International Co., Ltd. MS-7C37/X570-A PRO (MS-7C37), BIOS H.I0 08/10/2022 [10089.267585] RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu] [10089.267820] Code: 90 90 90 90 90 90 90 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 00 55 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 48 89 e5 <48> 89 0e 31 c0 5d 31 d2 31 c9 31 f6 45 31 c0 e9 89 7e 27 fb 66 0f [10089.267823] RSP: 0018:ffffb49805eeb8b0 EFLAGS: 00010246 [10089.267827] RAX: 0000000000000000 RBX: 0000000000200000 RCX: 0040000000000480 [10089.267830] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9890d4380000 [10089.267832] RBP: ffffb49805eeb8b0 R08: 0040000000000480 R09: 0000000000200000 [10089.267835] R10: 0000000800100200 R11: 0000000800100200 R12: ffffb49805eeba98 [10089.267837] R13: 0000000000000001 R14: 0000000000200000 R15: 0000000000000001 [10089.267840] FS: 00007f8ca9f09740(0000) GS:ffff9897befc0000(0000) knlGS:0000000000000000 [10089.267843] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10089.267846] CR2: 0000000000000000 CR3: 00000002e0746000 CR4: 0000000000750ee0 [10089.267849] PKRU: 55555554 [10089.267851] Call Trace: [10089.267853] <TASK> [10089.267858] ? show_regs+0x6d/0x80 [10089.267865] ? __die+0x24/0x80 [10089.267870] ? page_fault_oops+0x99/0x1b0 [10089.267876] ? do_user_addr_fault+0x316/0x6b0 [10089.267879] ? srso_alias_return_thunk+0x5/0x7f [10089.267884] ? scsi_dispatch_cmd+0x91/0x240 [10089.267891] ? exc_page_fault+0x83/0x1b0 [10089.267896] ? asm_exc_page_fault+0x27/0x30 [10089.267904] ? amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu] [10089.268140] amdgpu_vm_cpu_update+0xa9/0x130 [amdgpu] ... This revert is just a roundabout way of disabling CPU page table updates for compute VMs. But I don't think it really addresses the root cause. Regards, Felix > > Regards, > Christian. > >> >> Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596 >> Signed-off-by: Daniel Tang <danielzgtg.opensource@gmail.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++++++------ >> 1 file changed, 6 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> index 82f25996ff5e..602f311ab766 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct >> amdgpu_device *adev, struct amdgpu_vm *vm) >> if (r) >> return r; >> + /* Sanity checks */ >> + if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { >> + r = -EINVAL; >> + goto unreserve_bo; >> + } >> + >> /* Check if PD needs to be reinitialized and do it before >> * changing any other state, in case it fails. >> */ >> if (pte_support_ats != vm->pte_support_ats) { >> - /* Sanity checks */ >> - if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { >> - r = -EINVAL; >> - goto unreserve_bo; >> - } >> - >> vm->pte_support_ats = pte_support_ats; >> r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo), >> false); >> -- 2.40.1 >> >> >> > [-- Attachment #2: Type: text/html, Size: 6798 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" 2023-10-23 23:41 ` Felix Kuehling @ 2023-10-24 8:09 ` Christian König 0 siblings, 0 replies; 5+ messages in thread From: Christian König @ 2023-10-24 8:09 UTC (permalink / raw) To: Felix Kuehling, Daniel Tang, amd-gfx; +Cc: Xiaogang Chen, Alex Deucher [-- Attachment #1: Type: text/plain, Size: 5371 bytes --] Am 24.10.23 um 01:41 schrieb Felix Kuehling: > > [sorry, I hit send too early] > > > On 2023-10-23 11:15, Christian König wrote: >> Am 23.10.23 um 15:06 schrieb Daniel Tang: >>> That commit causes the screen to freeze a few moments after running >>> clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer >>> including ssh also freezes. On v6.5-rc1, it only results in a NULL >>> pointer >>> deference message in dmesg and the process to become a zombie whose >>> unkillableness prevents shutdown without REISUB. Although llama.cpp and >>> hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by >>> this revert, pytorch-rocm is now working with stability and without >>> whole-computer freezes caused by any accidental running of clinfo. >>> >>> This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639. >> >> That result doesn't make much sense. Felix please correct me, but >> AFAIK the ATS stuff was completely removed by now. >> >> Are you sure that this is pure v6.6-rc7 and not some other patches >> applied? If yes than we must have missed something. > > This revert doesn't really affect systems with ATS. It moves the > sanity check back out of the ATS-specific code. Ah! I've read the code wrong, that makes much more sense now. > > The Null pointer dereference in the bug report comes from the CPU page > table update code: > [10089.267556] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [10089.267563] #PF: supervisor write access in kernel mode > [10089.267566] #PF: error_code(0x0002) - not-present page > [10089.267569] PGD 0 P4D 0 > [10089.267574] Oops: 0002 [#1] PREEMPT SMP NOPTI > [10089.267578] CPU: 23 PID: 18191 Comm: clinfo Tainted: G OE 6.5.0-9-generic #9-Ubuntu > [10089.267582] Hardware name: Micro-Star International Co., Ltd. MS-7C37/X570-A PRO (MS-7C37), BIOS H.I0 08/10/2022 > [10089.267585] RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu] > [10089.267820] Code: 90 90 90 90 90 90 90 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 00 55 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 48 89 e5 <48> 89 0e 31 c0 5d 31 d2 31 c9 31 f6 45 31 c0 e9 89 7e 27 fb 66 0f > [10089.267823] RSP: 0018:ffffb49805eeb8b0 EFLAGS: 00010246 > [10089.267827] RAX: 0000000000000000 RBX: 0000000000200000 RCX: 0040000000000480 > [10089.267830] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9890d4380000 > [10089.267832] RBP: ffffb49805eeb8b0 R08: 0040000000000480 R09: 0000000000200000 > [10089.267835] R10: 0000000800100200 R11: 0000000800100200 R12: ffffb49805eeba98 > [10089.267837] R13: 0000000000000001 R14: 0000000000200000 R15: 0000000000000001 > [10089.267840] FS: 00007f8ca9f09740(0000) GS:ffff9897befc0000(0000) knlGS:0000000000000000 > [10089.267843] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [10089.267846] CR2: 0000000000000000 CR3: 00000002e0746000 CR4: 0000000000750ee0 > [10089.267849] PKRU: 55555554 > [10089.267851] Call Trace: > [10089.267853] <TASK> > [10089.267858] ? show_regs+0x6d/0x80 > [10089.267865] ? __die+0x24/0x80 > [10089.267870] ? page_fault_oops+0x99/0x1b0 > [10089.267876] ? do_user_addr_fault+0x316/0x6b0 > [10089.267879] ? srso_alias_return_thunk+0x5/0x7f > [10089.267884] ? scsi_dispatch_cmd+0x91/0x240 > [10089.267891] ? exc_page_fault+0x83/0x1b0 > [10089.267896] ? asm_exc_page_fault+0x27/0x30 > [10089.267904] ? amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu] > [10089.268140] amdgpu_vm_cpu_update+0xa9/0x130 [amdgpu] > ... > This revert is just a roundabout way of disabling CPU page table > updates for compute VMs. But I don't think it really addresses the > root cause. Yeah, completely agree. Looks like some page tables isn't CPU accessible for some reason. Going to take a look when I have time. Regards, Christian. > > Regards, > Felix > > >> >> Regards, >> Christian. >> >>> >>> Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596 >>> Signed-off-by: Daniel Tang <danielzgtg.opensource@gmail.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++++++------ >>> 1 file changed, 6 insertions(+), 6 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> index 82f25996ff5e..602f311ab766 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct >>> amdgpu_device *adev, struct amdgpu_vm *vm) >>> if (r) >>> return r; >>> + /* Sanity checks */ >>> + if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { >>> + r = -EINVAL; >>> + goto unreserve_bo; >>> + } >>> + >>> /* Check if PD needs to be reinitialized and do it before >>> * changing any other state, in case it fails. >>> */ >>> if (pte_support_ats != vm->pte_support_ats) { >>> - /* Sanity checks */ >>> - if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { >>> - r = -EINVAL; >>> - goto unreserve_bo; >>> - } >>> - >>> vm->pte_support_ats = pte_support_ats; >>> r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo), >>> false); >>> -- 2.40.1 >>> >>> >>> >> [-- Attachment #2: Type: text/html, Size: 7256 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-10-24 8:09 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-10-23 13:06 [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" Daniel Tang 2023-10-23 15:15 ` Christian König 2023-10-23 23:39 ` Felix Kuehling 2023-10-23 23:41 ` Felix Kuehling 2023-10-24 8:09 ` Christian König
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.