Linux PCI subsystem development
 help / color / mirror / Atom feed
From: Precific <precification@posteo.de>
To: Peter Xu <peterx@redhat.com>,
	Athul Krishna <athul.krishna.kr@protonmail.com>
Cc: Bjorn Helgaas <helgaas@kernel.org>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Linux PCI <linux-pci@vger.kernel.org>,
	"regressions@lists.linux.dev" <regressions@lists.linux.dev>
Subject: Re: [bugzilla-daemon@kernel.org: [Bug 219619] New: vfio-pci: screen graphics artifacts after 6.12 kernel upgrade]
Date: Mon, 30 Dec 2024 21:03:30 +0000	[thread overview]
Message-ID: <16ea1922-c9ce-4d73-b9b6-8365ca3fcf32@posteo.de> (raw)
In-Reply-To: <Z2mW2k8GfP7S0c5M@x1n>

[-- Attachment #1: Type: text/plain, Size: 4345 bytes --]

On 23.12.24 17:59, Peter Xu wrote:
> On Mon, Dec 23, 2024 at 07:37:46AM +0000, Athul Krishna wrote:
>> Can confirm. Reverting f9e54c3a2f5b from v6.13-rc1 fixed the problem.
>>>   
>>>   Device: Asus Zephyrus GA402RJ
>>>   CPU: Ryzen 7 6800HS
>>>   GPU: RX 6700S
>>>   Kernel: 6.13.0-rc3-g8faabc041a00
>>>   
>>>   Problem:
>>>   Launching games or gpu bench-marking tools in qemu windows 11 vm will cause
>>>   screen artifacts, ultimately qemu will pause with unrecoverable error.
> 
> Is there more information on what setup can reproduce it?
> 
> For example, does it only happen with Windows guests?  Does the GPU
> vendor/model matter?

In my case, both Windows and Linux guests fail to initialize the GPU in 
the first place since 6.12; QEMU does not crash. I also found commit 
f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 by bisection.

CPU: AMD 7950X3D
GPU (guest): AMD RX 6700XT (12GB)
Mainboard: ASRock X670E Steel Legend
Kernel: 6.12.0-rc0 .. 6.13.0-rc2

Based on a handful of reports on the Arch forum and on r/vfio, my guess 
is that affected users have Resizable BAR or similar settings enabled in 
the firmware, which usually applies the maximum possible BAR size 
advertised by the GPU on startup. Non-2^n-sized VRAM buffers may be 
another commonality.

Some other reports I found that could fit to this regression:
[1] https://bbs.archlinux.org/viewtopic.php?id=301352
   - Several reports (besides mine), not clear which of those cases are 
triggered by the vfio-pci commit. One case is clearly caused by a 
different commit in KVM. Potential candidates for the vfio-pci commit 
(speculation): (a) 6700XT GPU; (b) 5950X CPU, RTX 3090 GPU
[2] https://old.reddit.com/r/VFIO/comments/1hkvedq/
   - Two users, 7900XT and 7900XTX, reported that reverting 6.12 or 
disabling ReBAR resolves Windows guest GPU initialization.

On my 6700XT (PCI address 03:00.0, 12GB of VRAM), I get the following 
Resizable BAR default configuration with the host firmware/UEFI setting 
enabled:

[root]# lspci -s 03:00.0 -vv
...
Capabilities: [200 v1] Physical Resizable BAR
	BAR 0: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB
	BAR 2: current size: 256MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 
256MB
...

The 16GB configuration above fails with 6.12 (unless I revert commit 
f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101).
Now, if I change BAR 0 to 8GB (as below), which is below the GPU's VRAM 
size of 12GB, the Linux guest manages to initialize the GPU.

[root]# echo "0000:03:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
[root]# #13: 8GB, 14: 16GB
[root]# echo 13 > /sys/bus/pci/devices/0000:03:00.0/resource0_resize
[root]# echo "0000:03:00.0" > /sys/bus/pci/drivers/vfio-pci/bind

On my mainboard, 'Resizable BAR off' sets BAR 0 to 256MB, which also 
works with 6.12.

Only the size of BAR 0 (VRAM) appears to be relevant here. BAR 2 sizes 
of 2MB vs. 256MB have no effect on the outcome.

> 
>>>   
>>>   Commit:
>>>   f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 is the first bad commit
>>>   commit f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101
>>>   Author: Alex Williamson <alex.williamson@redhat.com>
>>>   Date:   Mon Aug 26 16:43:53 2024 -0400
>>>   
>>>       vfio/pci: implement huge_fault support
> 
> Personally I have no clue yet on how this could affect it.  I was initially
> worrying on any implicit cache mode changes on the mappings, but I don't
> think any of such was involved in this specific change.
> 
> This commit majorly does two things: (1) allow 2M/1G mappings for BARs
> instead of small 4Ks always, and (2) always lazy faults rather than
> "install everything in the 1st fault".  Maybe one of the two could have
> some impact in some way.

In my case, commenting out (1) the huge_fault callback assignment from 
f9e54c3a2f5b suffices for GPU initialization in the guest, even if (2) 
the 'install everything' loop is still removed.

I have uploaded host kernel logs with vfio-pci-core debugging enabled 
(one log with stock sources, one large log with vfio-pci-core's 
huge_fault handler patched out):
https://bugzilla.kernel.org/show_bug.cgi?id=219619#c1
I'm not sure if the logs of handled faults say much about what 
specifically goes wrong here, though.

The dmesg portion attached to my mail is of a Linux guest failing to 
initialize the GPU (BAR 0 size 16GB with 12GB of VRAM).

Thanks,
Precific

[-- Attachment #2: 2024-12-21-vfiopcicore-regression-guest-amdgpu-dmesg.txt --]
[-- Type: text/plain, Size: 7864 bytes --]

- Dmesg of a linux guest failing amdgpu initialization. Host running kernel 6.12/6.13, with ReBAR enabled (16GB BAR 0)
[[note: some variations can occur, e.g., the error sometimes occurs at a later stage of initialization]]

[   10.245100] [drm] amdgpu kernel modesetting enabled.
[   10.245173] amdgpu: Virtual CRAT table created for CPU
[   10.245182] amdgpu: Topology: Add CPU node
[   10.245480] [drm] initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1002:0x0E36 0xC1).
[   10.245492] [drm] register mmio base: 0x81A00000
[   10.245493] [drm] register mmio size: 1048576
[   10.248861] [drm] add ip block number 0 <nv_common>
[   10.248862] [drm] add ip block number 1 <gmc_v10_0>
[   10.248863] [drm] add ip block number 2 <navi10_ih>
[   10.248864] [drm] add ip block number 3 <psp>
[   10.248864] [drm] add ip block number 4 <smu>
[   10.248865] [drm] add ip block number 5 <dm>
[   10.248866] [drm] add ip block number 6 <gfx_v10_0>
[   10.248867] [drm] add ip block number 7 <sdma_v5_2>
[   10.248867] [drm] add ip block number 8 <vcn_v3_0>
[   10.248868] [drm] add ip block number 9 <jpeg_v3_0>
[   10.248877] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from VFCT
[   10.248878] amdgpu: ATOM BIOS: 113-D5121100-101
[   10.270097] [drm] VCN(0) decode is enabled in VM mode
[   10.270099] [drm] VCN(0) encode is enabled in VM mode
[   10.284318] [drm] JPEG decode is enabled in VM mode
[   10.284320] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[   10.284359] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   10.284365] amdgpu 0000:05:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[   10.284367] amdgpu 0000:05:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   10.284375] [drm] Detected VRAM RAM=12272M, BAR=16384M
[   10.284376] [drm] RAM width 192bits GDDR6
[   10.284495] [drm] amdgpu: 12272M of VRAM memory ready
[   10.284496] [drm] amdgpu: 16042M of GTT memory ready.
[   10.284505] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   10.284626] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[   12.218276] amdgpu 0000:05:00.0: amdgpu: STB initialized to 2048 entries
[   12.218333] [drm] Loading DMUB firmware via PSP: version=0x02020020
[   12.218647] [drm] use_doorbell being set to: [true]
[   12.218658] [drm] use_doorbell being set to: [true]
[   12.218667] [drm] Found VCN firmware Version ENC: 1.30 DEC: 3 VEP: 0 Revision: 4
[   12.218672] amdgpu 0000:05:00.0: amdgpu: Will use PSP to load VCN firmware
[   14.390991] [drm] psp gfx command ID_LOAD_TOC(0x20) failed and response status is (0x0)
[   14.390994] [drm:psp_hw_start [amdgpu]] *ERROR* Failed to load toc
[   14.391223] [drm:psp_hw_start [amdgpu]] *ERROR* PSP tmr init failed!
[   14.411423] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[   14.411604] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[   14.411784] amdgpu 0000:05:00.0: amdgpu: amdgpu_device_ip_init failed
[   14.411785] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
[   14.411786] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[   14.411928] ------------[ cut here ]------------
[   14.411929] WARNING: CPU: 6 PID: 507 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:622 amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412114] Modules linked in: amdgpu(+) video wmi amdxcp i2c_algo_bit drm_ttm_helper crct10dif_pclmul ttm crc32_pclmul crc32c_intel polyval_clmulni drm_exec polyval_generic ghash_clmulni_intel gpu_sched nvme sha512_ssse3 drm_suballoc_helper drm_buddy sha256_ssse3 drm_display_helper nvme_core sha1_ssse3 virtio_net cec nvme_auth virtio_console net_failover virtio_blk failover qemu_fw_cfg serio_raw ip6_tables ip_tables fuse
[   14.412133] CPU: 6 PID: 507 Comm: (udev-worker) Not tainted 6.8.5-201.fc39.x86_64 #1
[   14.412134] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   14.412135] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412305] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 e9 6a 30 bc e3 e9 5a fd ff ff <0f> 0b b8 ea ff ff ff e9 59 30 bc e3 b8 ea ff ff ff e9 4f 30 bc e3
[   14.412306] RSP: 0018:ffffaae50112ba60 EFLAGS: 00010246
[   14.412308] RAX: ffff8bbcca3ed100 RBX: ffff8bbcd19987a8 RCX: 0000000000000000
[   14.412309] RDX: 0000000000000000 RSI: ffff8bbcd19a4db8 RDI: ffff8bbcd1980000
[   14.412310] RBP: ffff8bbcd19901e8 R08: 0000000000000000 R09: ffffaae50112b878
[   14.412311] R10: ffffaae50112b870 R11: 0000000000000003 R12: ffff8bbcd19905c8
[   14.412311] R13: ffff8bbcd1980010 R14: ffff8bbcd1980000 R15: ffff8bbcd19a4db8
[   14.412313] FS:  00007f5fde03e980(0000) GS:ffff8bc41fb80000(0000) knlGS:0000000000000000
[   14.412315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.412316] CR2: 00005623742f1000 CR3: 000000010c1fa000 CR4: 0000000000750ef0
[   14.412318] PKRU: 55555554
[   14.412319] Call Trace:
[   14.412320]  <TASK>
[   14.412321]  ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412493]  ? __warn+0x81/0x130
[   14.412497]  ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412677]  ? report_bug+0x171/0x1a0
[   14.412681]  ? handle_bug+0x3c/0x80
[   14.412683]  ? exc_invalid_op+0x17/0x70
[   14.412685]  ? asm_exc_invalid_op+0x1a/0x20
[   14.412688]  ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412857]  amdgpu_fence_driver_hw_fini+0xfe/0x130 [amdgpu]
[   14.413049]  amdgpu_device_fini_hw+0xa6/0x400 [amdgpu]
[   14.413233]  ? blocking_notifier_chain_unregister+0x36/0x50
[   14.413236]  amdgpu_driver_load_kms+0xec/0x190 [amdgpu]
[   14.413411]  amdgpu_pci_probe+0x18b/0x510 [amdgpu]
[   14.413586]  local_pci_probe+0x42/0xa0
[   14.413589]  pci_device_probe+0xc7/0x240
[   14.413592]  really_probe+0x19b/0x3e0
[   14.413595]  ? __pfx___driver_attach+0x10/0x10
[   14.413597]  __driver_probe_device+0x78/0x160
[   14.413599]  driver_probe_device+0x1f/0x90
[   14.413601]  __driver_attach+0xd2/0x1c0
[   14.413603]  bus_for_each_dev+0x85/0xd0
[   14.413605]  bus_add_driver+0x116/0x220
[   14.413607]  driver_register+0x59/0x100
[   14.413609]  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
[   14.413768]  do_one_initcall+0x58/0x320
[   14.413772]  do_init_module+0x60/0x240
[   14.413775]  __do_sys_init_module+0x17f/0x1b0
[   14.413776]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413782]  do_syscall_64+0x83/0x170
[   14.413784]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413786]  ? __count_memcg_events+0x4d/0xc0
[   14.413788]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413790]  ? count_memcg_events.constprop.0+0x1a/0x30
[   14.413792]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413793]  ? handle_mm_fault+0xa2/0x360
[   14.413795]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413797]  ? do_user_addr_fault+0x304/0x670
[   14.413800]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413801]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413803]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   14.413805] RIP: 0033:0x7f5fdea2cb9e
[   14.413808] Code: 48 8b 0d 95 12 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 62 12 0c 00 f7 d8 64 89 01 48
[   14.413809] RSP: 002b:00007ffc13be8998 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[   14.413811] RAX: ffffffffffffffda RBX: 00005623741c55a0 RCX: 00007f5fdea2cb9e
[   14.413812] RDX: 00005623741be530 RSI: 00000000019d58ce RDI: 00007f5fdb000010
[   14.413813] RBP: 00007ffc13be8a50 R08: 0000562374199010 R09: 0000000000000007
[   14.413814] R10: 0000000000000001 R11: 0000000000000246 R12: 00005623741be530
[   14.413814] R13: 0000000000020000 R14: 00005623741c0030 R15: 00005623741c9120
[   14.413817]  </TASK>
[   14.413818] ---[ end trace 0000000000000000 ]---


  parent reply	other threads:[~2024-12-30 21:03 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-22 22:36 [bugzilla-daemon@kernel.org: [Bug 219619] New: vfio-pci: screen graphics artifacts after 6.12 kernel upgrade] Bjorn Helgaas
2024-12-23  7:37 ` Athul Krishna
2024-12-23 16:59   ` Peter Xu
2024-12-23 18:15     ` Alex Williamson
2024-12-24 18:06       ` Athul Krishna
2024-12-30 21:03     ` Precific [this message]
2024-12-31  1:27       ` Alex Williamson
2024-12-31 15:44         ` Precific
2024-12-31 16:07           ` Alex Williamson
2025-01-01  3:10             ` Precific
2025-01-02 16:39             ` Peter Xu
2025-01-02 17:04               ` Alex Williamson
2025-01-02 18:38                 ` Alex Williamson
2025-02-25 17:59                   ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=16ea1922-c9ce-4d73-b9b6-8365ca3fcf32@posteo.de \
    --to=precification@posteo.de \
    --cc=alex.williamson@redhat.com \
    --cc=athul.krishna.kr@protonmail.com \
    --cc=helgaas@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=peterx@redhat.com \
    --cc=regressions@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox