From: Precific <precification@posteo.de>
To: Peter Xu <peterx@redhat.com>,
Athul Krishna <athul.krishna.kr@protonmail.com>
Cc: Bjorn Helgaas <helgaas@kernel.org>,
"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
Linux PCI <linux-pci@vger.kernel.org>,
"regressions@lists.linux.dev" <regressions@lists.linux.dev>
Subject: Re: [bugzilla-daemon@kernel.org: [Bug 219619] New: vfio-pci: screen graphics artifacts after 6.12 kernel upgrade]
Date: Mon, 30 Dec 2024 21:03:30 +0000 [thread overview]
Message-ID: <16ea1922-c9ce-4d73-b9b6-8365ca3fcf32@posteo.de> (raw)
In-Reply-To: <Z2mW2k8GfP7S0c5M@x1n>
[-- Attachment #1: Type: text/plain, Size: 4345 bytes --]
On 23.12.24 17:59, Peter Xu wrote:
> On Mon, Dec 23, 2024 at 07:37:46AM +0000, Athul Krishna wrote:
>> Can confirm. Reverting f9e54c3a2f5b from v6.13-rc1 fixed the problem.
>>>
>>> Device: Asus Zephyrus GA402RJ
>>> CPU: Ryzen 7 6800HS
>>> GPU: RX 6700S
>>> Kernel: 6.13.0-rc3-g8faabc041a00
>>>
>>> Problem:
>>> Launching games or gpu bench-marking tools in qemu windows 11 vm will cause
>>> screen artifacts, ultimately qemu will pause with unrecoverable error.
>
> Is there more information on what setup can reproduce it?
>
> For example, does it only happen with Windows guests? Does the GPU
> vendor/model matter?
In my case, both Windows and Linux guests fail to initialize the GPU in
the first place since 6.12; QEMU does not crash. I also found commit
f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 by bisection.
CPU: AMD 7950X3D
GPU (guest): AMD RX 6700XT (12GB)
Mainboard: ASRock X670E Steel Legend
Kernel: 6.12.0-rc0 .. 6.13.0-rc2
Based on a handful of reports on the Arch forum and on r/vfio, my guess
is that affected users have Resizable BAR or similar settings enabled in
the firmware, which usually applies the maximum possible BAR size
advertised by the GPU on startup. Non-2^n-sized VRAM buffers may be
another commonality.
Some other reports I found that could fit to this regression:
[1] https://bbs.archlinux.org/viewtopic.php?id=301352
- Several reports (besides mine), not clear which of those cases are
triggered by the vfio-pci commit. One case is clearly caused by a
different commit in KVM. Potential candidates for the vfio-pci commit
(speculation): (a) 6700XT GPU; (b) 5950X CPU, RTX 3090 GPU
[2] https://old.reddit.com/r/VFIO/comments/1hkvedq/
- Two users, 7900XT and 7900XTX, reported that reverting 6.12 or
disabling ReBAR resolves Windows guest GPU initialization.
On my 6700XT (PCI address 03:00.0, 12GB of VRAM), I get the following
Resizable BAR default configuration with the host firmware/UEFI setting
enabled:
[root]# lspci -s 03:00.0 -vv
...
Capabilities: [200 v1] Physical Resizable BAR
BAR 0: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB
BAR 2: current size: 256MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB
256MB
...
The 16GB configuration above fails with 6.12 (unless I revert commit
f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101).
Now, if I change BAR 0 to 8GB (as below), which is below the GPU's VRAM
size of 12GB, the Linux guest manages to initialize the GPU.
[root]# echo "0000:03:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
[root]# #13: 8GB, 14: 16GB
[root]# echo 13 > /sys/bus/pci/devices/0000:03:00.0/resource0_resize
[root]# echo "0000:03:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
On my mainboard, 'Resizable BAR off' sets BAR 0 to 256MB, which also
works with 6.12.
Only the size of BAR 0 (VRAM) appears to be relevant here. BAR 2 sizes
of 2MB vs. 256MB have no effect on the outcome.
>
>>>
>>> Commit:
>>> f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 is the first bad commit
>>> commit f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101
>>> Author: Alex Williamson <alex.williamson@redhat.com>
>>> Date: Mon Aug 26 16:43:53 2024 -0400
>>>
>>> vfio/pci: implement huge_fault support
>
> Personally I have no clue yet on how this could affect it. I was initially
> worrying on any implicit cache mode changes on the mappings, but I don't
> think any of such was involved in this specific change.
>
> This commit majorly does two things: (1) allow 2M/1G mappings for BARs
> instead of small 4Ks always, and (2) always lazy faults rather than
> "install everything in the 1st fault". Maybe one of the two could have
> some impact in some way.
In my case, commenting out (1) the huge_fault callback assignment from
f9e54c3a2f5b suffices for GPU initialization in the guest, even if (2)
the 'install everything' loop is still removed.
I have uploaded host kernel logs with vfio-pci-core debugging enabled
(one log with stock sources, one large log with vfio-pci-core's
huge_fault handler patched out):
https://bugzilla.kernel.org/show_bug.cgi?id=219619#c1
I'm not sure if the logs of handled faults say much about what
specifically goes wrong here, though.
The dmesg portion attached to my mail is of a Linux guest failing to
initialize the GPU (BAR 0 size 16GB with 12GB of VRAM).
Thanks,
Precific
[-- Attachment #2: 2024-12-21-vfiopcicore-regression-guest-amdgpu-dmesg.txt --]
[-- Type: text/plain, Size: 7864 bytes --]
- Dmesg of a linux guest failing amdgpu initialization. Host running kernel 6.12/6.13, with ReBAR enabled (16GB BAR 0)
[[note: some variations can occur, e.g., the error sometimes occurs at a later stage of initialization]]
[ 10.245100] [drm] amdgpu kernel modesetting enabled.
[ 10.245173] amdgpu: Virtual CRAT table created for CPU
[ 10.245182] amdgpu: Topology: Add CPU node
[ 10.245480] [drm] initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1002:0x0E36 0xC1).
[ 10.245492] [drm] register mmio base: 0x81A00000
[ 10.245493] [drm] register mmio size: 1048576
[ 10.248861] [drm] add ip block number 0 <nv_common>
[ 10.248862] [drm] add ip block number 1 <gmc_v10_0>
[ 10.248863] [drm] add ip block number 2 <navi10_ih>
[ 10.248864] [drm] add ip block number 3 <psp>
[ 10.248864] [drm] add ip block number 4 <smu>
[ 10.248865] [drm] add ip block number 5 <dm>
[ 10.248866] [drm] add ip block number 6 <gfx_v10_0>
[ 10.248867] [drm] add ip block number 7 <sdma_v5_2>
[ 10.248867] [drm] add ip block number 8 <vcn_v3_0>
[ 10.248868] [drm] add ip block number 9 <jpeg_v3_0>
[ 10.248877] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from VFCT
[ 10.248878] amdgpu: ATOM BIOS: 113-D5121100-101
[ 10.270097] [drm] VCN(0) decode is enabled in VM mode
[ 10.270099] [drm] VCN(0) encode is enabled in VM mode
[ 10.284318] [drm] JPEG decode is enabled in VM mode
[ 10.284320] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 10.284359] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 10.284365] amdgpu 0000:05:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[ 10.284367] amdgpu 0000:05:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 10.284375] [drm] Detected VRAM RAM=12272M, BAR=16384M
[ 10.284376] [drm] RAM width 192bits GDDR6
[ 10.284495] [drm] amdgpu: 12272M of VRAM memory ready
[ 10.284496] [drm] amdgpu: 16042M of GTT memory ready.
[ 10.284505] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 10.284626] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[ 12.218276] amdgpu 0000:05:00.0: amdgpu: STB initialized to 2048 entries
[ 12.218333] [drm] Loading DMUB firmware via PSP: version=0x02020020
[ 12.218647] [drm] use_doorbell being set to: [true]
[ 12.218658] [drm] use_doorbell being set to: [true]
[ 12.218667] [drm] Found VCN firmware Version ENC: 1.30 DEC: 3 VEP: 0 Revision: 4
[ 12.218672] amdgpu 0000:05:00.0: amdgpu: Will use PSP to load VCN firmware
[ 14.390991] [drm] psp gfx command ID_LOAD_TOC(0x20) failed and response status is (0x0)
[ 14.390994] [drm:psp_hw_start [amdgpu]] *ERROR* Failed to load toc
[ 14.391223] [drm:psp_hw_start [amdgpu]] *ERROR* PSP tmr init failed!
[ 14.411423] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[ 14.411604] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[ 14.411784] amdgpu 0000:05:00.0: amdgpu: amdgpu_device_ip_init failed
[ 14.411785] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
[ 14.411786] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[ 14.411928] ------------[ cut here ]------------
[ 14.411929] WARNING: CPU: 6 PID: 507 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:622 amdgpu_irq_put+0x46/0x70 [amdgpu]
[ 14.412114] Modules linked in: amdgpu(+) video wmi amdxcp i2c_algo_bit drm_ttm_helper crct10dif_pclmul ttm crc32_pclmul crc32c_intel polyval_clmulni drm_exec polyval_generic ghash_clmulni_intel gpu_sched nvme sha512_ssse3 drm_suballoc_helper drm_buddy sha256_ssse3 drm_display_helper nvme_core sha1_ssse3 virtio_net cec nvme_auth virtio_console net_failover virtio_blk failover qemu_fw_cfg serio_raw ip6_tables ip_tables fuse
[ 14.412133] CPU: 6 PID: 507 Comm: (udev-worker) Not tainted 6.8.5-201.fc39.x86_64 #1
[ 14.412134] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[ 14.412135] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
[ 14.412305] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 e9 6a 30 bc e3 e9 5a fd ff ff <0f> 0b b8 ea ff ff ff e9 59 30 bc e3 b8 ea ff ff ff e9 4f 30 bc e3
[ 14.412306] RSP: 0018:ffffaae50112ba60 EFLAGS: 00010246
[ 14.412308] RAX: ffff8bbcca3ed100 RBX: ffff8bbcd19987a8 RCX: 0000000000000000
[ 14.412309] RDX: 0000000000000000 RSI: ffff8bbcd19a4db8 RDI: ffff8bbcd1980000
[ 14.412310] RBP: ffff8bbcd19901e8 R08: 0000000000000000 R09: ffffaae50112b878
[ 14.412311] R10: ffffaae50112b870 R11: 0000000000000003 R12: ffff8bbcd19905c8
[ 14.412311] R13: ffff8bbcd1980010 R14: ffff8bbcd1980000 R15: ffff8bbcd19a4db8
[ 14.412313] FS: 00007f5fde03e980(0000) GS:ffff8bc41fb80000(0000) knlGS:0000000000000000
[ 14.412315] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.412316] CR2: 00005623742f1000 CR3: 000000010c1fa000 CR4: 0000000000750ef0
[ 14.412318] PKRU: 55555554
[ 14.412319] Call Trace:
[ 14.412320] <TASK>
[ 14.412321] ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[ 14.412493] ? __warn+0x81/0x130
[ 14.412497] ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[ 14.412677] ? report_bug+0x171/0x1a0
[ 14.412681] ? handle_bug+0x3c/0x80
[ 14.412683] ? exc_invalid_op+0x17/0x70
[ 14.412685] ? asm_exc_invalid_op+0x1a/0x20
[ 14.412688] ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[ 14.412857] amdgpu_fence_driver_hw_fini+0xfe/0x130 [amdgpu]
[ 14.413049] amdgpu_device_fini_hw+0xa6/0x400 [amdgpu]
[ 14.413233] ? blocking_notifier_chain_unregister+0x36/0x50
[ 14.413236] amdgpu_driver_load_kms+0xec/0x190 [amdgpu]
[ 14.413411] amdgpu_pci_probe+0x18b/0x510 [amdgpu]
[ 14.413586] local_pci_probe+0x42/0xa0
[ 14.413589] pci_device_probe+0xc7/0x240
[ 14.413592] really_probe+0x19b/0x3e0
[ 14.413595] ? __pfx___driver_attach+0x10/0x10
[ 14.413597] __driver_probe_device+0x78/0x160
[ 14.413599] driver_probe_device+0x1f/0x90
[ 14.413601] __driver_attach+0xd2/0x1c0
[ 14.413603] bus_for_each_dev+0x85/0xd0
[ 14.413605] bus_add_driver+0x116/0x220
[ 14.413607] driver_register+0x59/0x100
[ 14.413609] ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
[ 14.413768] do_one_initcall+0x58/0x320
[ 14.413772] do_init_module+0x60/0x240
[ 14.413775] __do_sys_init_module+0x17f/0x1b0
[ 14.413776] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.413782] do_syscall_64+0x83/0x170
[ 14.413784] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.413786] ? __count_memcg_events+0x4d/0xc0
[ 14.413788] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.413790] ? count_memcg_events.constprop.0+0x1a/0x30
[ 14.413792] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.413793] ? handle_mm_fault+0xa2/0x360
[ 14.413795] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.413797] ? do_user_addr_fault+0x304/0x670
[ 14.413800] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.413801] ? srso_alias_return_thunk+0x5/0xfbef5
[ 14.413803] entry_SYSCALL_64_after_hwframe+0x78/0x80
[ 14.413805] RIP: 0033:0x7f5fdea2cb9e
[ 14.413808] Code: 48 8b 0d 95 12 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 62 12 0c 00 f7 d8 64 89 01 48
[ 14.413809] RSP: 002b:00007ffc13be8998 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[ 14.413811] RAX: ffffffffffffffda RBX: 00005623741c55a0 RCX: 00007f5fdea2cb9e
[ 14.413812] RDX: 00005623741be530 RSI: 00000000019d58ce RDI: 00007f5fdb000010
[ 14.413813] RBP: 00007ffc13be8a50 R08: 0000562374199010 R09: 0000000000000007
[ 14.413814] R10: 0000000000000001 R11: 0000000000000246 R12: 00005623741be530
[ 14.413814] R13: 0000000000020000 R14: 00005623741c0030 R15: 00005623741c9120
[ 14.413817] </TASK>
[ 14.413818] ---[ end trace 0000000000000000 ]---
next prev parent reply other threads:[~2024-12-30 21:03 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-22 22:36 [bugzilla-daemon@kernel.org: [Bug 219619] New: vfio-pci: screen graphics artifacts after 6.12 kernel upgrade] Bjorn Helgaas
2024-12-23 7:37 ` Athul Krishna
2024-12-23 16:59 ` Peter Xu
2024-12-23 18:15 ` Alex Williamson
2024-12-24 18:06 ` Athul Krishna
2024-12-30 21:03 ` Precific [this message]
2024-12-31 1:27 ` Alex Williamson
2024-12-31 15:44 ` Precific
2024-12-31 16:07 ` Alex Williamson
2025-01-01 3:10 ` Precific
2025-01-02 16:39 ` Peter Xu
2025-01-02 17:04 ` Alex Williamson
2025-01-02 18:38 ` Alex Williamson
2025-02-25 17:59 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=16ea1922-c9ce-4d73-b9b6-8365ca3fcf32@posteo.de \
--to=precification@posteo.de \
--cc=alex.williamson@redhat.com \
--cc=athul.krishna.kr@protonmail.com \
--cc=helgaas@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=peterx@redhat.com \
--cc=regressions@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox