[REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()
@ 2025-04-15 18:28 Alexey Klimov
       [not found] ` <SI2PR06MB5041FB15F8DBB44916FB6430F1BD2@SI2PR06MB5041.apcprd06.prod.outlook.com>
  2025-04-16 11:44 ` Christian König
  0 siblings, 2 replies; 8+ messages in thread
From: Alexey Klimov @ 2025-04-15 18:28 UTC (permalink / raw)
  To: alexander.deucher, frank.min, amd-gfx
  Cc: stable, david.belanger, christian.koenig, peter.chen,
	cix-kernel-upstream, linux-arm-kernel


#regzbot introduced: v6.12..v6.13

I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:

[drm] amdgpu: 7886M of GTT memory ready.
[drm] GART: num cpu pages 131072, num gpu pages 131072
SError Interrupt on CPU11, code 0x00000000be000011 -- SError
CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S                  6.15.0-rc2+ #1 VOLUNTARY
Tainted: [S]=CPU_OUT_OF_SPEC
Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan  1 1980
pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
pc : amdgpu_device_rreg+0x60/0xe4 [amdgpu]
lr : hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu]
sp : ffffffc08321b490
x29: ffffffc08321b490 x28: ffffff80b8b80000 x27: ffffff80b8bd0178
x26: ffffff80b8b8fe88 x25: 0000000000000001 x24: ffffff8081647000
x23: ffffffc079d6e000 x22: ffffff80b8bd5000 x21: 000000000007f000
x20: 000000000001fc00 x19: 00000000ffffffff x18: 00000000000015fc
x17: 00000000000015fc x16: 00000000000015cf x15: 00000000000015ce
x14: 00000000000015d0 x13: 00000000000015d1 x12: 00000000000015d2
x11: 00000000000015d3 x10: 000000000000ec00 x9 : 00000000000015fd
x8 : 00000000000015fd x7 : 0000000000001689 x6 : 0000000000555401
x5 : 0000000000000001 x4 : 0000000000100000 x3 : 0000000000100000
x2 : 0000000000000000 x1 : 000000000007f000 x0 : 0000000000000000
Kernel panic - not syncing: Asynchronous SError Interrupt
CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S                  6.15.0-rc2+ #1 VOLUNTARY
Tainted: [S]=CPU_OUT_OF_SPEC
Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan  1 1980
Call trace:
 show_stack+0x2c/0x84 (C)
 dump_stack_lvl+0x60/0x80
 dump_stack+0x18/0x24
 panic+0x148/0x330
 add_taint+0x0/0xbc
 arm64_serror_panic+0x64/0x7c
 do_serror+0x28/0x68
 el1h_64_error_handler+0x30/0x48
 el1h_64_error+0x6c/0x70
 amdgpu_device_rreg+0x60/0xe4 [amdgpu] (P)
 hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu]
 gmc_v10_0_hw_init+0xec/0x1fc [amdgpu]
 amdgpu_device_init+0x19f8/0x2480 [amdgpu]
 amdgpu_driver_load_kms+0x20/0xb0 [amdgpu]
 amdgpu_pci_probe+0x1b8/0x5d4 [amdgpu]
 pci_device_probe+0xbc/0x1a8
 really_probe+0xc0/0x39c
 __driver_probe_device+0x7c/0x14c
 driver_probe_device+0x3c/0x120
 __driver_attach+0xc4/0x200
 bus_for_each_dev+0x68/0xb4
 driver_attach+0x24/0x30
 bus_add_driver+0x110/0x240
 driver_register+0x68/0x124
 __pci_register_driver+0x44/0x50
 amdgpu_init+0x84/0xf94 [amdgpu]
 do_one_initcall+0x60/0x1e0
 do_init_module+0x54/0x200
 load_module+0x18f8/0x1e68
 init_module_from_file+0x74/0xa0
 __arm64_sys_finit_module+0x1e0/0x3f0
 invoke_syscall+0x64/0xe4
 el0_svc_common.constprop.0+0x40/0xe0
 do_el0_svc+0x1c/0x28
 el0_svc+0x34/0xd0
 el0t_64_sync_handler+0x10c/0x138
 el0t_64_sync+0x198/0x19c
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x1000,000000e0,f169a650,9b7ff667
Memory Limit: none
---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

(bios version seems to be 45 years old but that is the state of the board
when I received it)

Also saw this crash with RX6700. Old radeons like HD5450 and nvidia gt1030
work fine on that board.

A little bit of testing showed that it was introduced between 6.12 and 6.13.
Also it seems that changes were taken by some distro kernels already and
different iso images I tried failed to boot before I bumped into some iso
with kernel 6.8 that worked just fine.

The only change related to hdp_v5_0_flush_hdp() was
cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

Reverting that commit ^^ did help and resolved that problem. Before sending
revert as-is I was interested to know if there supposed to be a proper fix
for this or maybe someone is interested to debug this or have any suggestions.

In theory I also need to confirm that exactly that change introduced the
regression.

Thanks,
Alexey



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()
       [not found] ` <SI2PR06MB5041FB15F8DBB44916FB6430F1BD2@SI2PR06MB5041.apcprd06.prod.outlook.com>
@ 2025-04-16 11:25   ` Alexey Klimov
       [not found]     ` <CADnq5_NT0syV8wB=MZZRDONsTNSYwNXhGhNg9LOFmn=MJP7d9Q@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Alexey Klimov @ 2025-04-16 11:25 UTC (permalink / raw)
  To: Fugang Duan, alexander.deucher@amd.com, frank.min@amd.com,
	amd-gfx@lists.freedesktop.org
  Cc: stable@vger.kernel.org, david.belanger@amd.com,
	christian.koenig@amd.com, Peter Chen, cix-kernel-upstream,
	linux-arm-kernel@lists.infradead.org

On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
> 发件人: Alexey Klimov <alexey.klimov@linaro.org> 发送时间: 2025年4月16日 2:28
>>#regzbot introduced: v6.12..v6.13

[..]

>>The only change related to hdp_v5_0_flush_hdp() was
>>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
>>
>>Reverting that commit ^^ did help and resolved that problem. Before sending
>>revert as-is I was interested to know if there supposed to be a proper fix for
>>this or maybe someone is interested to debug this or have any suggestions.
>>
> Can you revert the change and try again https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b682b546b255e74a420f

Please read my email in the first place.
Let me quote just in case:

>The only change related to hdp_v5_0_flush_hdp() was
>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP

>Reverting that commit ^^ did help and resolved that problem.

Thanks,
Alexey



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()
  2025-04-15 18:28 [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp() Alexey Klimov
       [not found] ` <SI2PR06MB5041FB15F8DBB44916FB6430F1BD2@SI2PR06MB5041.apcprd06.prod.outlook.com>
@ 2025-04-16 11:44 ` Christian König
  2025-04-22  2:49   ` Alexey Klimov
  1 sibling, 1 reply; 8+ messages in thread
From: Christian König @ 2025-04-16 11:44 UTC (permalink / raw)
  To: Alexey Klimov, alexander.deucher, frank.min, amd-gfx
  Cc: stable, david.belanger, peter.chen, cix-kernel-upstream,
	linux-arm-kernel

Am 15.04.25 um 20:28 schrieb Alexey Klimov:
> #regzbot introduced: v6.12..v6.13
>
> I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:

Well in general we already had tons of problems with low end ARM64 boards. So first question of all is that board SBSA certified?

If not then the chances of that board actually working correctly are very low unfortunately.

> [drm] amdgpu: 7886M of GTT memory ready.
> [drm] GART: num cpu pages 131072, num gpu pages 131072
> SError Interrupt on CPU11, code 0x00000000be000011 -- SError

Any idea what that error code means?

Thanks,
Christian.

> CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S                  6.15.0-rc2+ #1 VOLUNTARY
> Tainted: [S]=CPU_OUT_OF_SPEC
> Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan  1 1980
> pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> pc : amdgpu_device_rreg+0x60/0xe4 [amdgpu]
> lr : hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu]
> sp : ffffffc08321b490
> x29: ffffffc08321b490 x28: ffffff80b8b80000 x27: ffffff80b8bd0178
> x26: ffffff80b8b8fe88 x25: 0000000000000001 x24: ffffff8081647000
> x23: ffffffc079d6e000 x22: ffffff80b8bd5000 x21: 000000000007f000
> x20: 000000000001fc00 x19: 00000000ffffffff x18: 00000000000015fc
> x17: 00000000000015fc x16: 00000000000015cf x15: 00000000000015ce
> x14: 00000000000015d0 x13: 00000000000015d1 x12: 00000000000015d2
> x11: 00000000000015d3 x10: 000000000000ec00 x9 : 00000000000015fd
> x8 : 00000000000015fd x7 : 0000000000001689 x6 : 0000000000555401
> x5 : 0000000000000001 x4 : 0000000000100000 x3 : 0000000000100000
> x2 : 0000000000000000 x1 : 000000000007f000 x0 : 0000000000000000
> Kernel panic - not syncing: Asynchronous SError Interrupt
> CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S                  6.15.0-rc2+ #1 VOLUNTARY
> Tainted: [S]=CPU_OUT_OF_SPEC
> Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan  1 1980
> Call trace:
>  show_stack+0x2c/0x84 (C)
>  dump_stack_lvl+0x60/0x80
>  dump_stack+0x18/0x24
>  panic+0x148/0x330
>  add_taint+0x0/0xbc
>  arm64_serror_panic+0x64/0x7c
>  do_serror+0x28/0x68
>  el1h_64_error_handler+0x30/0x48
>  el1h_64_error+0x6c/0x70
>  amdgpu_device_rreg+0x60/0xe4 [amdgpu] (P)
>  hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu]
>  gmc_v10_0_hw_init+0xec/0x1fc [amdgpu]
>  amdgpu_device_init+0x19f8/0x2480 [amdgpu]
>  amdgpu_driver_load_kms+0x20/0xb0 [amdgpu]
>  amdgpu_pci_probe+0x1b8/0x5d4 [amdgpu]
>  pci_device_probe+0xbc/0x1a8
>  really_probe+0xc0/0x39c
>  __driver_probe_device+0x7c/0x14c
>  driver_probe_device+0x3c/0x120
>  __driver_attach+0xc4/0x200
>  bus_for_each_dev+0x68/0xb4
>  driver_attach+0x24/0x30
>  bus_add_driver+0x110/0x240
>  driver_register+0x68/0x124
>  __pci_register_driver+0x44/0x50
>  amdgpu_init+0x84/0xf94 [amdgpu]
>  do_one_initcall+0x60/0x1e0
>  do_init_module+0x54/0x200
>  load_module+0x18f8/0x1e68
>  init_module_from_file+0x74/0xa0
>  __arm64_sys_finit_module+0x1e0/0x3f0
>  invoke_syscall+0x64/0xe4
>  el0_svc_common.constprop.0+0x40/0xe0
>  do_el0_svc+0x1c/0x28
>  el0_svc+0x34/0xd0
>  el0t_64_sync_handler+0x10c/0x138
>  el0t_64_sync+0x198/0x19c
> SMP: stopping secondary CPUs
> Kernel Offset: disabled
> CPU features: 0x1000,000000e0,f169a650,9b7ff667
> Memory Limit: none
> ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>
> (bios version seems to be 45 years old but that is the state of the board
> when I received it)
>
> Also saw this crash with RX6700. Old radeons like HD5450 and nvidia gt1030
> work fine on that board.
>
> A little bit of testing showed that it was introduced between 6.12 and 6.13.
> Also it seems that changes were taken by some distro kernels already and
> different iso images I tried failed to boot before I bumped into some iso
> with kernel 6.8 that worked just fine.
>
> The only change related to hdp_v5_0_flush_hdp() was
> cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
>
> Reverting that commit ^^ did help and resolved that problem. Before sending
> revert as-is I was interested to know if there supposed to be a proper fix
> for this or maybe someone is interested to debug this or have any suggestions.
>
> In theory I also need to confirm that exactly that change introduced the
> regression.
>
> Thanks,
> Alexey
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()
       [not found]         ` <CADnq5_M=YiMVvMpGaFhn2T3jRWGY2FrsUwCVPG6HupmTzZCYug@mail.gmail.com>
@ 2025-04-22  2:20           ` Alexey Klimov
       [not found]             ` <CADnq5_ML25QA7xD+bLqNprO3zzTxJYLkiVw-KmeP-N6TqNHRYA@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Alexey Klimov @ 2025-04-22  2:20 UTC (permalink / raw)
  To: Alex Deucher, Fugang Duan
  Cc: alexander.deucher@amd.com, frank.min@amd.com,
	amd-gfx@lists.freedesktop.org, stable@vger.kernel.org,
	david.belanger@amd.com, christian.koenig@amd.com, Peter Chen,
	cix-kernel-upstream, linux-arm-kernel@lists.infradead.org

On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:
> On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan <fugang.duan@cixtech.com> wrote:
>>
>> 发件人: Alex Deucher <alexdeucher@gmail.com> 发送时间: 2025年4月16日 22:49
>> >收件人: Alexey Klimov <alexey.klimov@linaro.org>
>> >On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov <alexey.klimov@linaro.org> wrote:
>> >>
>> >> On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
>> >> > 发件人: Alexey Klimov <alexey.klimov@linaro.org> 发送时间: 2025年4月16
>> >日 2:28
>> >> >>#regzbot introduced: v6.12..v6.13
>> >>
>> >> [..]
>> >>
>> >> >>The only change related to hdp_v5_0_flush_hdp() was
>> >> >>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
>> >> >>
>> >> >>Reverting that commit ^^ did help and resolved that problem. Before
>> >> >>sending revert as-is I was interested to know if there supposed to
>> >> >>be a proper fix for this or maybe someone is interested to debug this or
>> >have any suggestions.
>> >> >>
>> >> > Can you revert the change and try again
>> >> > https://gitlab.com/linux-kernel/linux/-/commit/cf424020e040be35df05b
>> >> > 682b546b255e74a420f
>> >>
>> >> Please read my email in the first place.
>> >> Let me quote just in case:
>> >>
>> >> >The only change related to hdp_v5_0_flush_hdp() was
>> >> >cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
>> >>
>> >> >Reverting that commit ^^ did help and resolved that problem.
>> >
>> >We can't really revert the change as that will lead to coherency problems.  What
>> >is the page size on your system?  Does the attached patch fix it?
>> >
>> >Alex
>> >
>> 4K page size.  We can try the fix if we got the environment.
>
> OK.  that patch won't change anything then.  Can you try this patch instead?

Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.

So I tested that patch, thank you, and some other different configurations --
nothing helped. Exactly the same behaviour with the same backtrace.

So it seems that it is firmware problem after all?

Thanks,
Alexey


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()
  2025-04-16 11:44 ` Christian König
@ 2025-04-22  2:49   ` Alexey Klimov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexey Klimov @ 2025-04-22  2:49 UTC (permalink / raw)
  To: Christian König, alexander.deucher, frank.min, amd-gfx
  Cc: stable, david.belanger, peter.chen, cix-kernel-upstream,
	linux-arm-kernel

On Wed Apr 16, 2025 at 12:44 PM BST, Christian König wrote:
> Am 15.04.25 um 20:28 schrieb Alexey Klimov:
>> #regzbot introduced: v6.12..v6.13
>>
>> I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:
>
> Well in general we already had tons of problems with low end ARM64 boards. So first question of all is that board SBSA certified?

Yeah, I can imagine.
I can't find any info about SBSA cartification for that board hence I'd say that
state is unknown, hence most likely "no". At least that's what I think.
It is a good question for cix or cixtech.com-based emails.

They have some updated potentially unstable UEFI firmwares to test though.

> If not then the chances of that board actually working correctly are very low unfortunately.
>
>> [drm] amdgpu: 7886M of GTT memory ready.
>> [drm] GART: num cpu pages 131072, num gpu pages 131072
>> SError Interrupt on CPU11, code 0x00000000be000011 -- SError
>
> Any idea what that error code means?

Well, current thinking process that it means:
-- bits 31:26 system error interrupt;
-- bit 25 indicates that it was 32-bit instruction;
-- 0x11 in lsb is probably implementation-defined which can
be anything like bus errors, parity, access violations, etc

That's probably not very helping here.

Best regards,
Alexey


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()
       [not found]             ` <CADnq5_ML25QA7xD+bLqNprO3zzTxJYLkiVw-KmeP-N6TqNHRYA@mail.gmail.com>
@ 2025-04-22 15:59               ` Alexey Klimov
  2025-04-23 14:32                 ` Christian König
       [not found]                 ` <CADnq5_NE2M19JdrULtJH-OXwycDpu0hrFHy42YiJA3nMYoP=+w@mail.gmail.com>
  0 siblings, 2 replies; 8+ messages in thread
From: Alexey Klimov @ 2025-04-22 15:59 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Fugang Duan, alexander.deucher@amd.com, frank.min@amd.com,
	amd-gfx@lists.freedesktop.org, stable@vger.kernel.org,
	david.belanger@amd.com, christian.koenig@amd.com, Peter Chen,
	cix-kernel-upstream, linux-arm-kernel@lists.infradead.org

On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:
> On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov <alexey.klimov@linaro.org> wrote:
>>
>> On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:
>> > On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan <fugang.duan@cixtech.com> wrote:
>> >>
>> >> 发件人: Alex Deucher <alexdeucher@gmail.com> 发送时间: 2025年4月16日 22:49
>> >> >收件人: Alexey Klimov <alexey.klimov@linaro.org>
>> >> >On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov <alexey.klimov@linaro.org> wrote:
>> >> >>
>> >> >> On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
>> >> >> > 发件人: Alexey Klimov <alexey.klimov@linaro.org> 发送时间: 2025年4月16
>> >> >日 2:28
>> >> >> >>#regzbot introduced: v6.12..v6.13
>> >> >> >>The only change related to hdp_v5_0_flush_hdp() was
>> >> >> >>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
>> >> >> >>
>> >> >> >>Reverting that commit ^^ did help and resolved that problem. Before

[..]

>> > OK.  that patch won't change anything then.  Can you try this patch instead?
>>
>> Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.
>>
>> So I tested that patch, thank you, and some other different configurations --
>> nothing helped. Exactly the same behaviour with the same backtrace.
>
> Did you test the first (4k check) or the second (don't remap on ARM) patch?

The second one. I think you mentioned that first one won't help for 4k pages.


>> So it seems that it is firmware problem after all?
>
> There is no GPU firmware involved in this operation.  It's just a
> posted write.  E.g., we write to a register to flush the HDP write
> queue and then read the register back to make sure the write posted.
> If the second patch didn't help, then perhaps there is some issue with
> MMIO access on your platform?

I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.

Completely out of the blue, based on nothing, do you think that
adding delay/some mem barrier between write and read might help?
I wonder if host data path code should be executed during common desktop
usage as a common user then why it doesn't break later. But yeah, I also
think this is this motherboard problem. Thank you.

Thanks,
Alexey



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()
  2025-04-22 15:59               ` Alexey Klimov
@ 2025-04-23 14:32                 ` Christian König
       [not found]                 ` <CADnq5_NE2M19JdrULtJH-OXwycDpu0hrFHy42YiJA3nMYoP=+w@mail.gmail.com>
  1 sibling, 0 replies; 8+ messages in thread
From: Christian König @ 2025-04-23 14:32 UTC (permalink / raw)
  To: Alexey Klimov, Alex Deucher
  Cc: Fugang Duan, alexander.deucher@amd.com, frank.min@amd.com,
	amd-gfx@lists.freedesktop.org, stable@vger.kernel.org,
	david.belanger@amd.com, Peter Chen, cix-kernel-upstream,
	linux-arm-kernel@lists.infradead.org

On 4/22/25 17:59, Alexey Klimov wrote:
> On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:
>> On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov <alexey.klimov@linaro.org> wrote:
>>>
>>> On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:
>>>> On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan <fugang.duan@cixtech.com> wrote:
>>>>>
>>>>> 发件人: Alex Deucher <alexdeucher@gmail.com> 发送时间: 2025年4月16日 22:49
>>>>>> 收件人: Alexey Klimov <alexey.klimov@linaro.org>
>>>>>> On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov <alexey.klimov@linaro.org> wrote:
>>>>>>>
>>>>>>> On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
>>>>>>>> 发件人: Alexey Klimov <alexey.klimov@linaro.org> 发送时间: 2025年4月16
>>>>>> 日 2:28
>>>>>>>>> #regzbot introduced: v6.12..v6.13
>>>>>>>>> The only change related to hdp_v5_0_flush_hdp() was
>>>>>>>>> cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
>>>>>>>>>
>>>>>>>>> Reverting that commit ^^ did help and resolved that problem. Before
> 
> [..]
> 
>>>> OK.  that patch won't change anything then.  Can you try this patch instead?
>>>
>>> Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.
>>>
>>> So I tested that patch, thank you, and some other different configurations --
>>> nothing helped. Exactly the same behaviour with the same backtrace.
>>
>> Did you test the first (4k check) or the second (don't remap on ARM) patch?
> 
> The second one. I think you mentioned that first one won't help for 4k pages.
> 
> 
>>> So it seems that it is firmware problem after all?
>>
>> There is no GPU firmware involved in this operation.  It's just a
>> posted write.  E.g., we write to a register to flush the HDP write
>> queue and then read the register back to make sure the write posted.
>> If the second patch didn't help, then perhaps there is some issue with
>> MMIO access on your platform?
> 
> I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.
> 
> Completely out of the blue, based on nothing, do you think that
> adding delay/some mem barrier between write and read might help?

That would still be quite some platform bug.

> I wonder if host data path code should be executed during common desktop
> usage as a common user then why it doesn't break later.

Maybe it's some kind of write/read re-ordering issue.

 But yeah, I also think this is this motherboard problem. Thank you.

You should probably ping some ARM guys to figure out what the fault code actually means.

Regards,
Christian.

> 
> Thanks,
> Alexey
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 回复: [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()
       [not found]                 ` <CADnq5_NE2M19JdrULtJH-OXwycDpu0hrFHy42YiJA3nMYoP=+w@mail.gmail.com>
@ 2025-04-27  1:01                   ` Alexey Klimov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexey Klimov @ 2025-04-27  1:01 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Fugang Duan, alexander.deucher@amd.com, frank.min@amd.com,
	amd-gfx@lists.freedesktop.org, stable@vger.kernel.org,
	david.belanger@amd.com, christian.koenig@amd.com, Peter Chen,
	cix-kernel-upstream, linux-arm-kernel@lists.infradead.org

On Thu Apr 24, 2025 at 4:44 PM BST, Alex Deucher wrote:
> On Tue, Apr 22, 2025 at 11:59 AM Alexey Klimov <alexey.klimov@linaro.org> wrote:
>>
>> On Tue Apr 22, 2025 at 2:00 PM BST, Alex Deucher wrote:
>> > On Mon, Apr 21, 2025 at 10:21 PM Alexey Klimov <alexey.klimov@linaro.org> wrote:
>> >>
>> >> On Thu Apr 17, 2025 at 2:08 PM BST, Alex Deucher wrote:
>> >> > On Wed, Apr 16, 2025 at 8:43 PM Fugang Duan <fugang.duan@cixtech.com> wrote:
>> >> >>
>> >> >> 发件人: Alex Deucher <alexdeucher@gmail.com> 发送时间: 2025年4月16日 22:49
>> >> >> >收件人: Alexey Klimov <alexey.klimov@linaro.org>
>> >> >> >On Wed, Apr 16, 2025 at 9:48 AM Alexey Klimov <alexey.klimov@linaro.org> wrote:
>> >> >> >>
>> >> >> >> On Wed Apr 16, 2025 at 4:12 AM BST, Fugang Duan wrote:
>> >> >> >> > 发件人: Alexey Klimov <alexey.klimov@linaro.org> 发送时间: 2025年4月16
>> >> >> >日 2:28
>> >> >> >> >>#regzbot introduced: v6.12..v6.13
>> >> >> >> >>The only change related to hdp_v5_0_flush_hdp() was
>> >> >> >> >>cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
>> >> >> >> >>
>> >> >> >> >>Reverting that commit ^^ did help and resolved that problem. Before
>>
>> [..]
>>
>> >> > OK.  that patch won't change anything then.  Can you try this patch instead?
>> >>
>> >> Config I am using is basically defconfig wrt memory parameters, yeah, i use 4k.
>> >>
>> >> So I tested that patch, thank you, and some other different configurations --
>> >> nothing helped. Exactly the same behaviour with the same backtrace.
>> >
>> > Did you test the first (4k check) or the second (don't remap on ARM) patch?
>>
>> The second one. I think you mentioned that first one won't help for 4k pages.
>>
>>
>> >> So it seems that it is firmware problem after all?
>> >
>> > There is no GPU firmware involved in this operation.  It's just a
>> > posted write.  E.g., we write to a register to flush the HDP write
>> > queue and then read the register back to make sure the write posted.
>> > If the second patch didn't help, then perhaps there is some issue with
>> > MMIO access on your platform?
>>
>> I didn't mean GPU firmware at all. I only had uefi/EL3 firmwares in mind.
>>
>> Completely out of the blue, based on nothing, do you think that
>> adding delay/some mem barrier between write and read might help?
>> I wonder if host data path code should be executed during common desktop
>> usage as a common user then why it doesn't break later. But yeah, I also
>> think this is this motherboard problem. Thank you.
>
> I think I found the problem.  The previous patch wasn't doing what I
> expected.  Please try this patch instead.

This one works!

[    4.483750] [drm] amdgpu kernel modesetting enabled.
[    4.491985] amdgpu: IO link not available for non x86 platforms
[    4.497189] amdgpu: Virtual CRAT table created for CPU
[    4.497559] amdgpu: Topology: Add CPU node
[    4.509623] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 0 <nv_common>
[    4.512905] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 1 <gmc_v10_0>
[    4.513254] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 2 <navi10_ih>
[    4.513595] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 3 <psp>
[    4.513932] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 4 <smu>
[    4.514278] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 5 <dm>
[    4.514625] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 6 <gfx_v10_0>
[    4.514980] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 7 <sdma_v5_2>
[    4.515334] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 8 <vcn_v3_0>
[    4.515699] amdgpu 0000:c3:00.0: amdgpu: detected ip block number 9 <jpeg_v3_0>
[    4.516087] amdgpu 0000:c3:00.0: amdgpu: Fetched VBIOS from VFCT
[    4.516466] amdgpu: ATOM BIOS: 113-V502MECH-0OC
[    4.749748] amdgpu 0000:c3:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[    4.777435] amdgpu 0000:c3:00.0: BAR 2 [mem 0x1810000000-0x18101fffff 64bit pref]: releasing
[    4.793256] amdgpu 0000:c3:00.0: BAR 0 [mem 0x1800000000-0x180fffffff 64bit pref]: releasing
[    4.844639] amdgpu 0000:c3:00.0: BAR 0 [mem 0x1800000000-0x19ffffffff 64bit pref]: assigned
[    4.849774] amdgpu 0000:c3:00.0: BAR 2 [mem 0x1a00000000-0x1a001fffff 64bit pref]: assigned
[    4.957411] amdgpu 0000:c3:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    4.967618] amdgpu 0000:c3:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    4.992963] [drm] amdgpu: 8176M of VRAM memory ready
[    5.004032] [drm] amdgpu: 7888M of GTT memory ready.
[    6.224159] amdgpu 0000:c3:00.0: amdgpu: STB initialized to 2048 entries
[    6.284328] amdgpu 0000:c3:00.0: amdgpu: Found VCN firmware Version ENC: 1.33 DEC: 4 VEP: 0 Revision: 3
[    6.361142] amdgpu 0000:c3:00.0: amdgpu: reserve 0xa00000 from 0x81fd000000 for PSP TMR
[    6.471231] amdgpu 0000:c3:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    6.492967] amdgpu 0000:c3:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    6.492993] amdgpu 0000:c3:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version = 0x003b3100 (59.49.0)
[    6.513659] amdgpu 0000:c3:00.0: amdgpu: SMU driver if version not matched
[    6.513699] amdgpu 0000:c3:00.0: amdgpu: use vbios provided pptable
[    6.588418] amdgpu 0000:c3:00.0: amdgpu: SMU is initialized successfully!
[    6.800975] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    6.806709] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[    6.813516] amdgpu: Virtual CRAT table created for GPU
[    6.819229] amdgpu: Topology: Add dGPU node [0x73ff:0x1002]
[    6.824865] kfd kfd: amdgpu: added device 1002:73ff
[    6.829821] amdgpu 0000:c3:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 8, active_cu_number 28
[    6.838355] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    6.846007] amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
[    6.853658] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0
[    6.861398] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0
[    6.869137] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[    6.876877] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[    6.884615] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[    6.892356] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[    6.900094] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[    6.907921] amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[    6.915748] amdgpu 0000:c3:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0
[    6.923663] amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0
[    6.931050] amdgpu 0000:c3:00.0: amdgpu: ring sdma1 uses VM inv eng 14 on hub 0
[    6.938439] amdgpu 0000:c3:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[    6.946089] amdgpu 0000:c3:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[    6.953916] amdgpu 0000:c3:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[    6.961742] amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[    6.970485] amdgpu 0000:c3:00.0: amdgpu: Using BACO for runtime pm
[    6.977167] [drm] Initialized amdgpu 3.63.0 for 0000:c3:00.0 on minor 0
[    7.234638] amdgpu 0000:c3:00.0: [drm] fb0: amdgpudrmfb frame buffer device
root@orion:~ # uname -a
Linux orion 6.15.0-rc3test6+ #1 SMP Sun Apr 27 01:12:10 BST 2025 aarch64 GNU/Linux

Thank you for taking a look into this.

Best regards,
Alexey



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-27  1:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-15 18:28 [REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp() Alexey Klimov
     [not found] ` <SI2PR06MB5041FB15F8DBB44916FB6430F1BD2@SI2PR06MB5041.apcprd06.prod.outlook.com>
2025-04-16 11:25   ` 回复: " Alexey Klimov
     [not found]     ` <CADnq5_NT0syV8wB=MZZRDONsTNSYwNXhGhNg9LOFmn=MJP7d9Q@mail.gmail.com>
     [not found]       ` <SI2PR06MB504138A5BEA1E1B3772E8527F1BC2@SI2PR06MB5041.apcprd06.prod.outlook.com>
     [not found]         ` <CADnq5_M=YiMVvMpGaFhn2T3jRWGY2FrsUwCVPG6HupmTzZCYug@mail.gmail.com>
2025-04-22  2:20           ` Alexey Klimov
     [not found]             ` <CADnq5_ML25QA7xD+bLqNprO3zzTxJYLkiVw-KmeP-N6TqNHRYA@mail.gmail.com>
2025-04-22 15:59               ` Alexey Klimov
2025-04-23 14:32                 ` Christian König
     [not found]                 ` <CADnq5_NE2M19JdrULtJH-OXwycDpu0hrFHy42YiJA3nMYoP=+w@mail.gmail.com>
2025-04-27  1:01                   ` Alexey Klimov
2025-04-16 11:44 ` Christian König
2025-04-22  2:49   ` Alexey Klimov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).