Linux PCI subsystem development
 help / color / mirror / Atom feed
* [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
@ 2022-12-30  8:18 Thorsten Leemhuis
  2023-01-03 10:30 ` Joerg Roedel
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Thorsten Leemhuis @ 2022-12-30  8:18 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Matt Fagnani, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

Hi, this is your Linux kernel regression tracker speaking.

I noticed a regression report in bugzilla.kernel.org. As many (most?)
kernel developer don't keep an eye on it, I decided to forward it by
mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=216865 :

>  Matt Fagnani 2022-12-29 18:39:56 UTC
> 
> I booted the Fedora Rawhide KDE Plasma live image
> Fedora-KDE-Live-x86_64-Rawhide-20221227.n.0.iso
> https://koji.fedoraproject.org/koji/buildinfo?buildID=2104562 from a USB
> flash drive written with Fedora Media Writer on an hp laptop with an
> integrated Radeon R5 GPU. The system froze with a black screen when
> amdgpu started during 6.2-rc1 kernel boot. When I booted with quiet rhgb
> removed from the kernel command line the last line shown before the
> black screen was
>
> kernel: [drm] amdgpu kernel modesetting enabled.
> 
> This problem happened each of several boots when using the amdgpu
> driver (the default). This problem didn't happen when I booted the same
> image using Troubleshooting > Boot Fedora-KDE-Plasma-live in basic
> graphics mode which used the simpledrm driver and started Plasma on X
> normally. This problem also didn't happen when I booted the image in a
> QEMU/KVM VM in GNOME Boxes with 3 GB RAM using the virtio-gpu driver.
> 
> The data from the previous boots using live images aren't saved by
> default so I couldn't get the journal that way as far as I knew. I
> installed kernel-6.2.0-0.rc1.14.fc38 in my Fedora 37 KDE Plasma
> installation and reproduced the problem 3 times with quiet rhgb removed
> from the kernel command line and sysrq_always_enabled drm.debug=14 added
> to it. I used sysrq+alt+r,s,u,b which rebooted the system so the kernel
> wasn't completely frozen. The journals from the boots with the problem
> weren't shown in journalctl. I booted with amdgpu.dc=0 on the kernel
> command line and the screen froze with the last line
> kernel: [drm] amdgpu kernel modesetting enabled. and the black
> screen
> didn't happen. I booted with drm.debug=94 on the kernel command line and
> the screen's drm settings were shown repeatedly until I rebooted after
> 2-3 minutes.
> 
> This problem didn't happen with kernel-6.1.0-65.fc38 or earlier in
> the
> Fedora Rawhide live image
> Fedora-KDE-Live-x86_64-Rawhide-20221217.n.0.iso. The first Fedora
> Rawhide kernel with this problem was
> 6.2.0-0.rc0.20221215git041fae9c105a.5.fc38, while
> 6.2.0-0.rc0.20221214gite2ca6ba6ba01.3.fc38 was the last one without the
> problem. I bisected the mainline kernel between e2ca6ba6ba01 and
> 041fae9c105a. The first bad commit was the following involving PCI and
> IOMMUs.
> 
> 201007ef707a8bb5592cd07dd46fc9222c48e0b9 is the first bad commit
> commit 201007ef707a8bb5592cd07dd46fc9222c48e0b9
> Author: Lu Baolu <baolu.lu@linux.intel.com>
> Date:   Mon Oct 31 08:59:08 2022 +0800
> 
>     PCI: Enable PASID only when ACS RR & UF enabled on upstream path
>     
>     The Requester ID/Process Address Space ID (PASID) combination
>     identifies an address space distinct from the PCI bus address space,
>     e.g., an address space defined by an IOMMU.
>     
> [...]
> 
> My system has an AMD IOMMU enabled. When I booted 6.2-rc1 with
> amd_iommu=off on the kernel command line, the problem didn't happen and
> the boot completed. There were IOMMU-related errors when amdgpu started
> with amd_iommu=off. So the problem appears to involve amdgpu not
> starting properly when the IOMMU is enabled after that change. When I
> booted with quiet rhgb removed from the kernel command line, I noted
> that the AMD IOMMU started about 3 seconds before the problem happened
> when amdgpu started with a line like kernel: AMD-Vi: AMD IOMMUv2 loaded
> and initialized
> 
> I reported this problem at
> https://gitlab.freedesktop.org/drm/amd/-/issues/2319 where Alex Deucher
> wrote "Please report this upstream to the IOMMU subsystem:
> https://bugzilla.kernel.org/" I reported it for Fedora at
> https://bugzilla.redhat.com/show_bug.cgi?id=2156691

See the ticket for more details.

BTW, let me use this mail to also add the report to the list of tracked
regressions to ensure it's doesn't fall through the cracks:

#regzbot introduced: 201007ef707a8bb
https://bugzilla.kernel.org/show_bug.cgi?id=216865
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2022-12-30  8:18 [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled Thorsten Leemhuis
@ 2023-01-03 10:30 ` Joerg Roedel
  2023-01-03 19:06 ` Matt Fagnani
       [not found] ` <5aa0e698-f715-0481-36e5-46505024ebc1@bell.net>
  2 siblings, 0 replies; 42+ messages in thread
From: Joerg Roedel @ 2023-01-03 10:30 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Lu Baolu, Matt Fagnani, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

Baolu,

On Fri, Dec 30, 2022 at 09:18:56AM +0100, Thorsten Leemhuis wrote:
> Hi, this is your Linux kernel regression tracker speaking.
> 
> I noticed a regression report in bugzilla.kernel.org. As many (most?)
> kernel developer don't keep an eye on it, I decided to forward it by
> mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=216865 :

can you have a look at this please?

Thanks,

-- 
Jörg Rödel
jroedel@suse.de

SUSE Software Solutions Germany GmbH
Frankenstraße 146
90461 Nürnberg
Germany

(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2022-12-30  8:18 [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled Thorsten Leemhuis
  2023-01-03 10:30 ` Joerg Roedel
@ 2023-01-03 19:06 ` Matt Fagnani
       [not found] ` <5aa0e698-f715-0481-36e5-46505024ebc1@bell.net>
  2 siblings, 0 replies; 42+ messages in thread
From: Matt Fagnani @ 2023-01-03 19:06 UTC (permalink / raw)
  To: Thorsten Leemhuis, Lu Baolu
  Cc: Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

I reproduced the problem with 6.2-rc1 in a Fedora 37 installation with 
early kdump enabled as described at 
https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes 
https://github.com/k-hagio/fedora-kexec-tools/blob/master/early-kdump-howto.txt 
I panicked the kernel with sysrq+alt+c. The dmesg saved with kdump 
showed warnings at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80 and 
at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50 involving AMD IOMMU 
and amdgpu functions in the trace. Since those warnings' were
if (WARN_ON(!pdev->pri_enabled)) and if (WARN_ON(!pdev->pasid_enabled)), 
pci_disable_pri and pci_disable_pasid looked like they were called when 
pdev->pri_enabled and pdev->pasid_enabled were both false. A null 
pointer dereference occurred right after that which made amdgpu crash.

[   13.132368] [drm] amdgpu kernel modesetting enabled.
[   13.133766] amdgpu: Topology: Add APU node [0x0:0x0]
[   13.137596] Console: switching to colour dummy device 80x25
[   13.143717] amdgpu 0000:00:01.0: vgaarb: deactivate vga console
[   13.143970] [drm] initializing kernel modesetting (CARRIZO 
0x1002:0x9874 0x103C:0x8332 0xCA).
[   13.144205] [drm] register mmio base: 0xF0400000
[   13.144209] [drm] register mmio size: 262144
[   13.144310] [drm] add ip block number 0 <vi_common>
[   13.144316] [drm] add ip block number 1 <gmc_v8_0>
[   13.144320] [drm] add ip block number 2 <cz_ih>
[   13.144324] [drm] add ip block number 3 <gfx_v8_0>
[   13.144328] [drm] add ip block number 4 <sdma_v3_0>
[   13.144332] [drm] add ip block number 5 <powerplay>
[   13.144336] [drm] add ip block number 6 <dm>
[   13.144340] [drm] add ip block number 7 <uvd_v6_0>
[   13.144343] [drm] add ip block number 8 <vce_v3_0>
[   13.144347] [drm] add ip block number 9 <acp_ip>
[   13.144388] amdgpu 0000:00:01.0: amdgpu: Fetched VBIOS from VFCT
[   13.144397] amdgpu: ATOM BIOS: 113-C75100-031
[   13.144425] [drm] UVD is enabled in physical mode
[   13.144431] [drm] VCE enabled in physical mode
[   13.144435] amdgpu 0000:00:01.0: amdgpu: Trusted Memory Zone (TMZ) 
feature not supported
[   13.144491] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, 
fragment size is 9-bit
[   13.144503] amdgpu 0000:00:01.0: amdgpu: VRAM: 512M 
0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
[   13.144511] amdgpu 0000:00:01.0: amdgpu: GART: 1024M 
0x000000FF00000000 - 0x000000FF3FFFFFFF
[   13.144524] [drm] Detected VRAM RAM=512M, BAR=512M
[   13.144529] [drm] RAM width 64bits UNKNOWN
[   13.144623] [drm] amdgpu: 512M of VRAM memory ready
[   13.144630] [drm] amdgpu: 3572M of GTT memory ready.
[   13.144653] [drm] GART: num cpu pages 262144, num gpu pages 262144
[   13.144705] [drm] PCIE GART of 1024M enabled (table at 
0x000000F400600000).
[   13.158820] amdgpu: hwmgr_sw_init smu backed is smu8_smu
[   13.175036] [drm] Found UVD firmware Version: 1.91 Family ID: 11
[   13.175097] [drm] UVD ENC is disabled
[   13.186675] [drm] Found VCE firmware Version: 52.4 Binary ID: 3
[   13.187879] amdgpu: smu version 27.18.00
[   13.193760] [drm] DM_PPLIB: values for Engine clock
[   13.193773] [drm] DM_PPLIB:     300000
[   13.193776] [drm] DM_PPLIB:     480000
[   13.193779] [drm] DM_PPLIB:     533340
[   13.193781] [drm] DM_PPLIB:     576000
[   13.193784] [drm] DM_PPLIB:     626090
[   13.193786] [drm] DM_PPLIB:     685720
[   13.193788] [drm] DM_PPLIB:     720000
[   13.193791] [drm] DM_PPLIB:     757900
[   13.193793] [drm] DM_PPLIB: Validation clocks:
[   13.193796] [drm] DM_PPLIB:    engine_max_clock: 75790
[   13.193799] [drm] DM_PPLIB:    memory_max_clock: 93300
[   13.193802] [drm] DM_PPLIB:    level           : 8
[   13.193806] [drm] DM_PPLIB: values for Display clock
[   13.193809] [drm] DM_PPLIB:     300000
[   13.193811] [drm] DM_PPLIB:     400000
[   13.193814] [drm] DM_PPLIB:     496560
[   13.193816] [drm] DM_PPLIB:     626090
[   13.193819] [drm] DM_PPLIB:     685720
[   13.193821] [drm] DM_PPLIB:     757900
[   13.193823] [drm] DM_PPLIB:     800000
[   13.193825] [drm] DM_PPLIB:     847060
[   13.193828] [drm] DM_PPLIB: Validation clocks:
[   13.193830] [drm] DM_PPLIB:    engine_max_clock: 75790
[   13.193833] [drm] DM_PPLIB:    memory_max_clock: 93300
[   13.193836] [drm] DM_PPLIB:    level           : 8
[   13.193839] [drm] DM_PPLIB: values for Memory clock
[   13.193842] [drm] DM_PPLIB:     667000
[   13.193844] [drm] DM_PPLIB:     933000
[   13.193847] [drm] DM_PPLIB: Validation clocks:
[   13.193849] [drm] DM_PPLIB:    engine_max_clock: 75790
[   13.193852] [drm] DM_PPLIB:    memory_max_clock: 93300
[   13.193854] [drm] DM_PPLIB:    level           : 8
[   13.193973] [drm] Display Core initialized with v3.2.215!
[   13.309967] [drm] UVD initialized successfully.
[   13.511031] [drm] VCE initialized successfully.
[   13.515217] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[   13.515442] amdgpu: sdma_bitmap: f
[   13.515549] ------------[ cut here ]------------
[   13.515555] WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 
pci_disable_pri+0x75/0x80
[   13.515571] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 
hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched 
crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel 
sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch 
sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua fuse dm_multipath
[   13.515620] CPU: 0 PID: 477 Comm: systemd-udevd Kdump: loaded Not 
tainted 6.2.0-0.rc1.14.fc38.x86_64 #1
[   13.515628] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 
12/03/2019
[   13.515634] RIP: 0010:pci_disable_pri+0x75/0x80
[   13.515642] Code: 54 24 06 89 ee 48 89 df 83 e2 fe 66 89 54 24 06 0f 
b7 d2 e8 1d e1 fc ff 80 a3 4b 08 00 00 fd 48 83 c4 08 5b 5d e9 2b 8b 69 
00 <0f> 0b eb b6 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90
[   13.515651] RSP: 0018:ffffbaf4407ab8e8 EFLAGS: 00010046
[   13.515658] RAX: 0000000000000000 RBX: ffff90aa00ac4000 RCX: 
0000000000000009
[   13.515663] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 
ffff90aa00ac4000
[   13.515668] RBP: ffff90aa0e0c3810 R08: 0000000000000002 R09: 
0000000000000000
[   13.515673] R10: 0000000000000000 R11: ffffffffade4e430 R12: 
ffff90aa011a8800
[   13.515678] R13: ffff90aa0e0c3800 R14: ffff90aa011a8800 R15: 
ffff90aa0e0c3960
[   13.515683] FS:  00007fabd67feb40(0000) GS:ffff90aaf7400000(0000) 
knlGS:0000000000000000
[   13.515689] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.515695] CR2: 00007f5689ff54c0 CR3: 0000000100f16000 CR4: 
00000000001506f0
[   13.515700] Call Trace:
[   13.515704]  <TASK>
[   13.515710]  amd_iommu_attach_device+0x2e0/0x300
[   13.515719]  __iommu_attach_device+0x1b/0x90
[   13.515727]  iommu_attach_group+0x65/0xa0
[   13.515735]  amd_iommu_init_device+0x16b/0x250 [iommu_v2]
[   13.515747]  kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
[   13.517094]  kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
[   13.518419]  kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
[   13.519699]  amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
[   13.520877]  amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
[   13.522118]  ? _raw_spin_lock_irqsave+0x23/0x50
[   13.522126]  amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
[   13.523386]  amdgpu_pci_probe+0x161/0x370 [amdgpu]
[   13.524516]  local_pci_probe+0x41/0x80
[   13.524525]  pci_device_probe+0xb3/0x220
[   13.524533]  really_probe+0xde/0x380
[   13.524540]  ? pm_runtime_barrier+0x50/0x90
[   13.524546]  __driver_probe_device+0x78/0x170
[   13.524555]  driver_probe_device+0x1f/0x90
[   13.524560]  __driver_attach+0xce/0x1c0
[   13.524565]  ? __pfx___driver_attach+0x10/0x10
[   13.524570]  bus_for_each_dev+0x73/0xa0
[   13.524575]  bus_add_driver+0x1ae/0x200
[   13.524580]  driver_register+0x89/0xe0
[   13.524586]  ? __pfx_init_module+0x10/0x10 [amdgpu]
[   13.525819]  do_one_initcall+0x59/0x230
[   13.525828]  do_init_module+0x4a/0x200
[   13.525834]  __do_sys_init_module+0x157/0x180
[   13.525839]  do_syscall_64+0x5b/0x80
[   13.525845]  ? handle_mm_fault+0xff/0x2f0
[   13.525850]  ? do_user_addr_fault+0x1ef/0x690
[   13.525856]  ? exc_page_fault+0x70/0x170
[   13.525860]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   13.525867] RIP: 0033:0x7fabd66cde4e
[   13.525872] Code: 48 8b 0d e5 5f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 
66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 5f 0c 00 f7 d8 64 89 01 48
[   13.525878] RSP: 002b:00007ffdd89bc6a8 EFLAGS: 00000246 ORIG_RAX: 
00000000000000af
[   13.525884] RAX: ffffffffffffffda RBX: 0000563e4d23f0a0 RCX: 
00007fabd66cde4e
[   13.525887] RDX: 00007fabd6817453 RSI: 000000000174fb66 RDI: 
00007fabd3bd4010
[   13.525890] RBP: 00007fabd6817453 R08: 0000563e4d237c70 R09: 
00007fabd672f900
[   13.525893] R10: 0000000000000005 R11: 0000000000000246 R12: 
0000000000020000
[   13.525896] R13: 0000563e4d239060 R14: 0000000000000000 R15: 
0000563e4d23e450
[   13.525900]  </TASK>
[   13.525902] ---[ end trace 0000000000000000 ]---
[   13.525964] ------------[ cut here ]------------
[   13.525966] WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:419 
pci_disable_pasid+0x45/0x50
[   13.525974] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 
hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched 
crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel 
sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch 
sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua fuse dm_multipath
[   13.526006] CPU: 0 PID: 477 Comm: systemd-udevd Kdump: loaded 
Tainted: G        W         -------  ---  6.2.0-0.rc1.14.fc38.x86_64 #1
[   13.526012] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 
12/03/2019
[   13.526015] RIP: 0010:pci_disable_pasid+0x45/0x50
[   13.526020] Code: 53 48 89 fb 85 f6 75 06 5b e9 67 8c 69 00 83 c6 06 
31 d2 e8 3d e2 fc ff 80 a3 4b 08 00 00 fe 5b e9 50 8c 69 00 e9 4b 8c 69 
00 <0f> 0b e9 44 8c 69 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
[   13.526025] RSP: 0018:ffffbaf4407ab900 EFLAGS: 00010046
[   13.526028] RAX: 0000000000000000 RBX: ffff90aa00ac4000 RCX: 
0000000000000009
[   13.526031] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 
ffff90aa00ac4000
[   13.526034] RBP: ffff90aa0e0c3810 R08: 0000000000000002 R09: 
0000000000000000
[   13.526037] R10: 0000000000000000 R11: ffffffffade4e430 R12: 
ffff90aa011a8800
[   13.526040] R13: ffff90aa0e0c3800 R14: ffff90aa011a8800 R15: 
ffff90aa0e0c3960
[   13.526043] FS:  00007fabd67feb40(0000) GS:ffff90aaf7400000(0000) 
knlGS:0000000000000000
[   13.526047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.526050] CR2: 00007f5689ff54c0 CR3: 0000000100f16000 CR4: 
00000000001506f0
[   13.526053] Call Trace:
[   13.526056]  <TASK>
[   13.526058]  amd_iommu_attach_device+0x2e8/0x300
[   13.526064]  __iommu_attach_device+0x1b/0x90
[   13.526070]  iommu_attach_group+0x65/0xa0
[   13.526075]  amd_iommu_init_device+0x16b/0x250 [iommu_v2]
[   13.526083]  kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
[   13.527397]  kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
[   13.528709]  kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
[   13.529877]  amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
[   13.531039]  amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
[   13.532322]  ? _raw_spin_lock_irqsave+0x23/0x50
[   13.532333]  amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
[   13.533642]  amdgpu_pci_probe+0x161/0x370 [amdgpu]
[   13.534758]  local_pci_probe+0x41/0x80
[   13.534766]  pci_device_probe+0xb3/0x220
[   13.534771]  really_probe+0xde/0x380
[   13.534776]  ? pm_runtime_barrier+0x50/0x90
[   13.534781]  __driver_probe_device+0x78/0x170
[   13.534785]  driver_probe_device+0x1f/0x90
[   13.534789]  __driver_attach+0xce/0x1c0
[   13.534793]  ? __pfx___driver_attach+0x10/0x10
[   13.534797]  bus_for_each_dev+0x73/0xa0
[   13.534801]  bus_add_driver+0x1ae/0x200
[   13.534805]  driver_register+0x89/0xe0
[   13.534809]  ? __pfx_init_module+0x10/0x10 [amdgpu]
[   13.536000]  do_one_initcall+0x59/0x230
[   13.536010]  do_init_module+0x4a/0x200
[   13.536015]  __do_sys_init_module+0x157/0x180
[   13.536020]  do_syscall_64+0x5b/0x80
[   13.536025]  ? handle_mm_fault+0xff/0x2f0
[   13.536030]  ? do_user_addr_fault+0x1ef/0x690
[   13.536036]  ? exc_page_fault+0x70/0x170
[   13.536040]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   13.536047] RIP: 0033:0x7fabd66cde4e
[   13.536051] Code: 48 8b 0d e5 5f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 
66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 5f 0c 00 f7 d8 64 89 01 48
[   13.536057] RSP: 002b:00007ffdd89bc6a8 EFLAGS: 00000246 ORIG_RAX: 
00000000000000af
[   13.536063] RAX: ffffffffffffffda RBX: 0000563e4d23f0a0 RCX: 
00007fabd66cde4e
[   13.536066] RDX: 00007fabd6817453 RSI: 000000000174fb66 RDI: 
00007fabd3bd4010
[   13.536069] RBP: 00007fabd6817453 R08: 0000563e4d237c70 R09: 
00007fabd672f900
[   13.536072] R10: 0000000000000005 R11: 0000000000000246 R12: 
0000000000020000
[   13.536075] R13: 0000563e4d239060 R14: 0000000000000000 R15: 
0000563e4d23e450
[   13.536079]  </TASK>
[   13.536081] ---[ end trace 0000000000000000 ]---
[   13.536117] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874
[   13.537198] kfd kfd: amdgpu: device 1002:9874 NOT added due to errors
[   13.537218] amdgpu 0000:00:01.0: amdgpu: SE 1, SH per SE 1, CU per SH 
8, active_cu_number 6
[   13.537481] BUG: kernel NULL pointer dereference, address: 
0000000000000058
[   13.537499] #PF: supervisor read access in kernel mode
[   13.537504] #PF: error_code(0x0000) - not-present page
[   13.537509] PGD 0 P4D 0
[   13.537515] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   13.537522] CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Kdump: loaded Tainted: 
G        W         -------  ---  6.2.0-0.rc1.14.fc38.x86_64 #1
[   13.537530] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 
12/03/2019
[   13.537534] RIP: 0010:report_iommu_fault+0x11/0x90
[   13.537548] Code: 0f 0b eb cd 0f 1f 44 00 00 90 90 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 55 41 54 41 89 cc 55 48 89 d5 
53 <48> 8b 47 48 48 89 f3 48 85 c0 74 64 4c 8b 47 50 e8 da 3f 57 00 41
[   13.537557] RSP: 0018:ffffbaf4403ebe08 EFLAGS: 00010246
[   13.537562] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
0000000000000000
[   13.537567] RDX: 000000010e9b0400 RSI: ffff90aa00ac40d0 RDI: 
0000000000000010
[   13.537572] RBP: 000000010e9b0400 R08: ffff90aa011a8800 R09: 
0000000000000050
[   13.537576] R10: ffff90aa00244000 R11: 0000000000000000 R12: 
0000000000000000
[   13.537581] R13: ffff90aa0005b000 R14: 0000000000000008 R15: 
0000000000000000
[   13.537585] FS:  0000000000000000(0000) GS:ffff90aaf7500000(0000) 
knlGS:0000000000000000
[   13.537591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.537596] CR2: 0000000000000058 CR3: 000000010e22c000 CR4: 
00000000001506e0
[   13.537601] Call Trace:
[   13.537607]  <TASK>
[   13.537612]  amd_iommu_int_thread+0x60c/0x760
[   13.537620]  ? __pfx_irq_thread_fn+0x10/0x10
[   13.537627]  irq_thread_fn+0x1f/0x60
[   13.537633]  irq_thread+0xea/0x1a0
[   13.537638]  ? preempt_count_add+0x6a/0xa0
[   13.537647]  ? __pfx_irq_thread_dtor+0x10/0x10
[   13.537652]  ? __pfx_irq_thread+0x10/0x10
[   13.537657]  kthread+0xe9/0x110
[   13.537662]  ? __pfx_kthread+0x10/0x10
[   13.537667]  ret_from_fork+0x2c/0x50
[   13.537676]  </TASK>
[   13.537678] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 
hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched 
crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel 
sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch 
sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua fuse dm_multipath
[   13.537723] CR2: 0000000000000058
[   13.537727] ---[ end trace 0000000000000000 ]---
[   13.537731] RIP: 0010:report_iommu_fault+0x11/0x90
[   13.537737] Code: 0f 0b eb cd 0f 1f 44 00 00 90 90 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 55 41 54 41 89 cc 55 48 89 d5 
53 <48> 8b 47 48 48 89 f3 48 85 c0 74 64 4c 8b 47 50 e8 da 3f 57 00 41
[   13.537746] RSP: 0018:ffffbaf4403ebe08 EFLAGS: 00010246
[   13.537751] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
0000000000000000
[   13.537755] RDX: 000000010e9b0400 RSI: ffff90aa00ac40d0 RDI: 
0000000000000010
[   13.537759] RBP: 000000010e9b0400 R08: ffff90aa011a8800 R09: 
0000000000000050
[   13.537764] R10: ffff90aa00244000 R11: 0000000000000000 R12: 
0000000000000000
[   13.537768] R13: ffff90aa0005b000 R14: 0000000000000008 R15: 
0000000000000000
[   13.537773] FS:  0000000000000000(0000) GS:ffff90aaf7500000(0000) 
knlGS:0000000000000000
[   13.537779] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.537783] CR2: 0000000000000058 CR3: 000000010e22c000 CR4: 
00000000001506e0
[   13.537795] genirq: exiting task "irq/24-AMD-Vi" (56) is an active 
IRQ thread (irq 24)
[   13.537808] general protection fault, probably for non-canonical 
address 0x1ee201e8df8948: 0000 [#2] PREEMPT SMP NOPTI
[   13.537815] CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Kdump: loaded Tainted: 
G      D W         -------  ---  6.2.0-0.rc1.14.fc38.x86_64 #1
[   13.537822] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 
12/03/2019
[   13.537825] RIP: 0010:__x86_return_thunk+0x0/0x40
[   13.537833] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 
f6 <c3> cc 0f ae e8 eb f9 cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e
[   13.537840] RSP: 0018:ffffbaf4403ebeb0 EFLAGS: 00010282
[   13.537844] RAX: 001ee201e8df8948 RBX: fff38839e8df8948 RCX: 
0000000000000000
[   13.537848] RDX: 0000000080000000 RSI: ffff90aa00400b68 RDI: 
ffffffffad106b7f
[   13.537852] RBP: ffff90aa00aa0000 R08: ffff90aa00400c50 R09: 
ffffffffaf143f00
[   13.537856] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffff90aa00aa0cac
[   13.537859] R13: ffff90aa00938001 R14: 0000000000000000 R15: 
0000000000000000
[   13.537863] FS:  0000000000000000(0000) GS:ffff90aaf7500000(0000) 
knlGS:0000000000000000
[   13.537868] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.537872] CR2: 0000000000000058 CR3: 000000010e22c000 CR4: 
00000000001506e0
[   13.537876] Call Trace:
[   13.537879]  <TASK>
[   13.537882]  ? task_work_run+0x59/0x90
[   13.537888]  ? do_exit+0x31f/0xaf0
[   13.537894]  ? __pfx_irq_thread_dtor+0x10/0x10
[   13.537900]  ? make_task_dead+0x7a/0x80
[   13.537905]  ? rewind_stack_and_make_dead+0x17/0x20
[   13.537912]  </TASK>
[   13.537914] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 
hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched 
crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel 
sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch 
sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua fuse dm_multipath
[   13.537946] ---[ end trace 0000000000000000 ]---
[   13.537950] RIP: 0010:report_iommu_fault+0x11/0x90
[   13.537955] Code: 0f 0b eb cd 0f 1f 44 00 00 90 90 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 55 41 54 41 89 cc 55 48 89 d5 
53 <48> 8b 47 48 48 89 f3 48 85 c0 74 64 4c 8b 47 50 e8 da 3f 57 00 41
[   13.537962] RSP: 0018:ffffbaf4403ebe08 EFLAGS: 00010246
[   13.537967] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
0000000000000000
[   13.537971] RDX: 000000010e9b0400 RSI: ffff90aa00ac40d0 RDI: 
0000000000000010
[   13.537974] RBP: 000000010e9b0400 R08: ffff90aa011a8800 R09: 
0000000000000050
[   13.537978] R10: ffff90aa00244000 R11: 0000000000000000 R12: 
0000000000000000
[   13.537982] R13: ffff90aa0005b000 R14: 0000000000000008 R15: 
0000000000000000
[   13.537986] FS:  0000000000000000(0000) GS:ffff90aaf7500000(0000) 
knlGS:0000000000000000
[   13.537991] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.537995] CR2: 0000000000000058 CR3: 000000010e22c000 CR4: 
00000000001506e0
[   13.537999] Fixing recursive fault but reboot is needed!
[   13.538003] check_preemption_disabled: 6 callbacks suppressed
[   13.538005] BUG: using smp_processor_id() in preemptible [00000000] 
code: irq/24-AMD-Vi/56
[   13.538012] caller is __schedule+0x30/0x1390
[   13.538017] CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Kdump: loaded Tainted: 
G      D W         -------  ---  6.2.0-0.rc1.14.fc38.x86_64 #1
[   13.538023] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 
12/03/2019
[   13.538027] Call Trace:
[   13.538030]  <TASK>
[   13.538032]  dump_stack_lvl+0x44/0x5c
[   13.538039]  check_preemption_disabled+0xe1/0xf0
[   13.538045]  __schedule+0x30/0x1390
[   13.538049]  ? __wake_up_klogd.part.0+0x56/0x80
[   13.538055]  ? vprintk_emit+0x11d/0x290
[   13.538061]  ? _printk+0x5a/0x60
[   13.538068]  do_task_dead+0x3f/0x50
[   13.538074]  make_task_dead.cold+0x51/0xba
[   13.538080]  rewind_stack_and_make_dead+0x17/0x20
[   13.538085] RIP: 0000:0x0
[   13.538092] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[   13.538096] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 
0000000000000000
[   13.538101] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
0000000000000000
[   13.538105] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[   13.538108] RBP: 0000000000000000 R08: 0000000000000000 R09: 
0000000000000000
[   13.538112] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000000
[   13.538116] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000
[   13.538121]  </TASK>
[   13.538124] BUG: scheduling while atomic: irq/24-AMD-Vi/56/0x00000000
[   13.538128] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 
hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched 
crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel 
sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch 
sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua fuse dm_multipath
[   13.538159] Preemption disabled at:
[   13.538160] [<0000000000000000>] 0x0
[   13.538166] CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Kdump: loaded Tainted: 
G      D W         -------  ---  6.2.0-0.rc1.14.fc38.x86_64 #1
[   13.538172] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 
12/03/2019
[   13.538175] Call Trace:
[   13.538178]  <TASK>
[   13.538180]  dump_stack_lvl+0x44/0x5c
[   13.538185]  __schedule_bug.cold+0x80/0x8d
[   13.538191]  __schedule+0xf5c/0x1390
[   13.538195]  ? __wake_up_klogd.part.0+0x56/0x80
[   13.538200]  ? vprintk_emit+0x11d/0x290
[   13.538206]  ? _printk+0x5a/0x60
[   13.538211]  do_task_dead+0x3f/0x50
[   13.538216]  make_task_dead.cold+0x51/0xba
[   13.538221]  rewind_stack_and_make_dead+0x17/0x20
[   13.538226] RIP: 0000:0x0
[   13.538231] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[   13.538234] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 
0000000000000000
[   13.538240] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
0000000000000000
[   13.538243] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[   13.538247] RBP: 0000000000000000 R08: 0000000000000000 R09: 
0000000000000000
[   13.538251] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000000
[   13.538254] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000
[   13.538260]  </TASK>

I tried to use the crash program on the core dump but it stopped with an 
error
crash: page excluded: kernel virtual address: ffff90aa0044db60 type: 
"xa_node shift" I attached the full dmesg file vmcore-dmesg.txt at 
https://bugzilla.kernel.org/show_bug.cgi?id=216865#c2

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
       [not found] ` <5aa0e698-f715-0481-36e5-46505024ebc1@bell.net>
@ 2023-01-04  6:54   ` Baolu Lu
  2023-01-04 15:50     ` Vasant Hegde
  0 siblings, 1 reply; 42+ messages in thread
From: Baolu Lu @ 2023-01-04  6:54 UTC (permalink / raw)
  To: Matt Fagnani, Thorsten Leemhuis
  Cc: baolu.lu, Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

On 2023/1/4 2:55, Matt Fagnani wrote:
> I reproduced the problem with 6.2-rc1 in a Fedora 37 installation with early kdump enabled as described athttps://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes  https://github.com/k-hagio/fedora-kexec-tools/blob/master/early-kdump-howto.txt  I panicked the kernel with sysrq+alt+c. The dmesg saved with the kdump showed warnings at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80 and at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50 involving AMD IOMMU and amdgpu functions in the trace. Since those warnings' were
> if (WARN_ON(!pdev->pri_enabled)) and if (WARN_ON(!pdev->pasid_enabled)), pci_disable_pri and pci_disable_pasid looked like they were called when pdev->pri_enabled and pdev->pasid_enabled were both false.
> A null pointer dereference occurred right after that which made amdgpu crash.
> 
> [   13.132368] [drm] amdgpu kernel modesetting enabled.
> [   13.133766] amdgpu: Topology: Add APU node [0x0:0x0]
> [   13.137596] Console: switching to colour dummy device 80x25
> [   13.143717] amdgpu 0000:00:01.0: vgaarb: deactivate vga console
> [   13.143970] [drm] initializing kernel modesetting (CARRIZO 0x1002:0x9874 0x103C:0x8332 0xCA).
> [   13.144205] [drm] register mmio base: 0xF0400000
> [   13.144209] [drm] register mmio size: 262144
> [   13.144310] [drm] add ip block number 0 <vi_common>
> [   13.144316] [drm] add ip block number 1 <gmc_v8_0>
> [   13.144320] [drm] add ip block number 2 <cz_ih>
> [   13.144324] [drm] add ip block number 3 <gfx_v8_0>
> [   13.144328] [drm] add ip block number 4 <sdma_v3_0>
> [   13.144332] [drm] add ip block number 5 <powerplay>
> [   13.144336] [drm] add ip block number 6 <dm>
> [   13.144340] [drm] add ip block number 7 <uvd_v6_0>
> [   13.144343] [drm] add ip block number 8 <vce_v3_0>
> [   13.144347] [drm] add ip block number 9 <acp_ip>
> [   13.144388] amdgpu 0000:00:01.0: amdgpu: Fetched VBIOS from VFCT
> [   13.144397] amdgpu: ATOM BIOS: 113-C75100-031
> [   13.144425] [drm] UVD is enabled in physical mode
> [   13.144431] [drm] VCE enabled in physical mode
> [   13.144435] amdgpu 0000:00:01.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
> [   13.144491] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
> [   13.144503] amdgpu 0000:00:01.0: amdgpu: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
> [   13.144511] amdgpu 0000:00:01.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
> [   13.144524] [drm] Detected VRAM RAM=512M, BAR=512M
> [   13.144529] [drm] RAM width 64bits UNKNOWN
> [   13.144623] [drm] amdgpu: 512M of VRAM memory ready
> [   13.144630] [drm] amdgpu: 3572M of GTT memory ready.
> [   13.144653] [drm] GART: num cpu pages 262144, num gpu pages 262144
> [   13.144705] [drm] PCIE GART of 1024M enabled (table at 0x000000F400600000).
> [   13.158820] amdgpu: hwmgr_sw_init smu backed is smu8_smu
> [   13.175036] [drm] Found UVD firmware Version: 1.91 Family ID: 11
> [   13.175097] [drm] UVD ENC is disabled
> [   13.186675] [drm] Found VCE firmware Version: 52.4 Binary ID: 3
> [   13.187879] amdgpu: smu version 27.18.00
> [   13.193760] [drm] DM_PPLIB: values for Engine clock
> [   13.193773] [drm] DM_PPLIB:	 300000
> [   13.193776] [drm] DM_PPLIB:	 480000
> [   13.193779] [drm] DM_PPLIB:	 533340
> [   13.193781] [drm] DM_PPLIB:	 576000
> [   13.193784] [drm] DM_PPLIB:	 626090
> [   13.193786] [drm] DM_PPLIB:	 685720
> [   13.193788] [drm] DM_PPLIB:	 720000
> [   13.193791] [drm] DM_PPLIB:	 757900
> [   13.193793] [drm] DM_PPLIB: Validation clocks:
> [   13.193796] [drm] DM_PPLIB:    engine_max_clock: 75790
> [   13.193799] [drm] DM_PPLIB:    memory_max_clock: 93300
> [   13.193802] [drm] DM_PPLIB:    level           : 8
> [   13.193806] [drm] DM_PPLIB: values for Display clock
> [   13.193809] [drm] DM_PPLIB:	 300000
> [   13.193811] [drm] DM_PPLIB:	 400000
> [   13.193814] [drm] DM_PPLIB:	 496560
> [   13.193816] [drm] DM_PPLIB:	 626090
> [   13.193819] [drm] DM_PPLIB:	 685720
> [   13.193821] [drm] DM_PPLIB:	 757900
> [   13.193823] [drm] DM_PPLIB:	 800000
> [   13.193825] [drm] DM_PPLIB:	 847060
> [   13.193828] [drm] DM_PPLIB: Validation clocks:
> [   13.193830] [drm] DM_PPLIB:    engine_max_clock: 75790
> [   13.193833] [drm] DM_PPLIB:    memory_max_clock: 93300
> [   13.193836] [drm] DM_PPLIB:    level           : 8
> [   13.193839] [drm] DM_PPLIB: values for Memory clock
> [   13.193842] [drm] DM_PPLIB:	 667000
> [   13.193844] [drm] DM_PPLIB:	 933000
> [   13.193847] [drm] DM_PPLIB: Validation clocks:
> [   13.193849] [drm] DM_PPLIB:    engine_max_clock: 75790
> [   13.193852] [drm] DM_PPLIB:    memory_max_clock: 93300
> [   13.193854] [drm] DM_PPLIB:    level           : 8
> [   13.193973] [drm] Display Core initialized with v3.2.215!
> [   13.309967] [drm] UVD initialized successfully.
> [   13.511031] [drm] VCE initialized successfully.
> [   13.515217] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
> [   13.515442] amdgpu: sdma_bitmap: f
> [   13.515549] ------------[ cut here ]------------
> [   13.515555] WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80
> [   13.515571] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse dm_multipath
> [   13.515620] CPU: 0 PID: 477 Comm: systemd-udevd Kdump: loaded Not tainted 6.2.0-0.rc1.14.fc38.x86_64 #1
> [   13.515628] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019
> [   13.515634] RIP: 0010:pci_disable_pri+0x75/0x80
> [   13.515642] Code: 54 24 06 89 ee 48 89 df 83 e2 fe 66 89 54 24 06 0f b7 d2 e8 1d e1 fc ff 80 a3 4b 08 00 00 fd 48 83 c4 08 5b 5d e9 2b 8b 69 00 <0f> 0b eb b6 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90
> [   13.515651] RSP: 0018:ffffbaf4407ab8e8 EFLAGS: 00010046
> [   13.515658] RAX: 0000000000000000 RBX: ffff90aa00ac4000 RCX: 0000000000000009
> [   13.515663] RDX: 0000000000000000 RSI: 0000000000000014 RDI: ffff90aa00ac4000
> [   13.515668] RBP: ffff90aa0e0c3810 R08: 0000000000000002 R09: 0000000000000000
> [   13.515673] R10: 0000000000000000 R11: ffffffffade4e430 R12: ffff90aa011a8800
> [   13.515678] R13: ffff90aa0e0c3800 R14: ffff90aa011a8800 R15: ffff90aa0e0c3960
> [   13.515683] FS:  00007fabd67feb40(0000) GS:ffff90aaf7400000(0000) knlGS:0000000000000000
> [   13.515689] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   13.515695] CR2: 00007f5689ff54c0 CR3: 0000000100f16000 CR4: 00000000001506f0
> [   13.515700] Call Trace:
> [   13.515704]  <TASK>
> [   13.515710]  amd_iommu_attach_device+0x2e0/0x300
> [   13.515719]  __iommu_attach_device+0x1b/0x90
> [   13.515727]  iommu_attach_group+0x65/0xa0
> [   13.515735]  amd_iommu_init_device+0x16b/0x250 [iommu_v2]
> [   13.515747]  kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
> [   13.517094]  kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
> [   13.518419]  kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
> [   13.519699]  amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
> [   13.520877]  amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
> [   13.522118]  ? _raw_spin_lock_irqsave+0x23/0x50
> [   13.522126]  amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
> [   13.523386]  amdgpu_pci_probe+0x161/0x370 [amdgpu]
> [   13.524516]  local_pci_probe+0x41/0x80
> [   13.524525]  pci_device_probe+0xb3/0x220
> [   13.524533]  really_probe+0xde/0x380
> [   13.524540]  ? pm_runtime_barrier+0x50/0x90
> [   13.524546]  __driver_probe_device+0x78/0x170
> [   13.524555]  driver_probe_device+0x1f/0x90
> [   13.524560]  __driver_attach+0xce/0x1c0
> [   13.524565]  ? __pfx___driver_attach+0x10/0x10
> [   13.524570]  bus_for_each_dev+0x73/0xa0
> [   13.524575]  bus_add_driver+0x1ae/0x200
> [   13.524580]  driver_register+0x89/0xe0
> [   13.524586]  ? __pfx_init_module+0x10/0x10 [amdgpu]
> [   13.525819]  do_one_initcall+0x59/0x230
> [   13.525828]  do_init_module+0x4a/0x200
> [   13.525834]  __do_sys_init_module+0x157/0x180
> [   13.525839]  do_syscall_64+0x5b/0x80
> [   13.525845]  ? handle_mm_fault+0xff/0x2f0
> [   13.525850]  ? do_user_addr_fault+0x1ef/0x690
> [   13.525856]  ? exc_page_fault+0x70/0x170
> [   13.525860]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [   13.525867] RIP: 0033:0x7fabd66cde4e
> [   13.525872] Code: 48 8b 0d e5 5f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 5f 0c 00 f7 d8 64 89 01 48
> [   13.525878] RSP: 002b:00007ffdd89bc6a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
> [   13.525884] RAX: ffffffffffffffda RBX: 0000563e4d23f0a0 RCX: 00007fabd66cde4e
> [   13.525887] RDX: 00007fabd6817453 RSI: 000000000174fb66 RDI: 00007fabd3bd4010
> [   13.525890] RBP: 00007fabd6817453 R08: 0000563e4d237c70 R09: 00007fabd672f900
> [   13.525893] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000020000
> [   13.525896] R13: 0000563e4d239060 R14: 0000000000000000 R15: 0000563e4d23e450
> [   13.525900]  </TASK>
> [   13.525902] ---[ end trace 0000000000000000 ]---
> [   13.525964] ------------[ cut here ]------------

This (including the following) kernel traces are triggered by the
following code.

1698 static int pdev_pri_ats_enable(struct pci_dev *pdev)
1699 {
1700         int ret;
1701
1702         /* Only allow access to user-accessible pages */
1703         ret = pci_enable_pasid(pdev, 0);
1704         if (ret)
1705                 goto out_err;

[--cut for short--]

1724 out_err:
1725         pci_disable_pri(pdev);
1726         pci_disable_pasid(pdev);
1727
1728         return ret;
1729 }

pci_disable_pri() and pci_disable_pasid() are called with PCI PASID and
PRI not enabled. There are WARN_ON()s in the pci code for such cases.

This happens in the domain attach device path. I haven't figured out why
the failure of PASID or PRI enabling will cause the domain attach device
to fail. And also why pci_pasid_features() and pci_pri_supported() are
not called before pci_enable_pasid/pri().

commit 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on
upstream path") requires ACS P2P Request Redirect and Upstream
Forwarding are enabled for the path leading to the device when enabling
PASID because PCIe fabric routes Memory Requests based on the TLP
address, ignoring any PASID. I guess this is the reason why
pci_enable_pasid() returns failure and discovers above buggy code.

--
Best regards,
baolu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-04  6:54   ` Baolu Lu
@ 2023-01-04 15:50     ` Vasant Hegde
  2023-01-05  1:09       ` Matt Fagnani
  0 siblings, 1 reply; 42+ messages in thread
From: Vasant Hegde @ 2023-01-04 15:50 UTC (permalink / raw)
  To: Baolu Lu, Matt Fagnani, Thorsten Leemhuis
  Cc: Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

On 1/4/2023 12:24 PM, Baolu Lu wrote:
> On 2023/1/4 2:55, Matt Fagnani wrote:
>> I reproduced the problem with 6.2-rc1 in a Fedora 37 installation with early
>> kdump enabled as described
>> athttps://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes 
>> https://github.com/k-hagio/fedora-kexec-tools/blob/master/early-kdump-howto.txt 
>> I panicked the kernel with sysrq+alt+c. The dmesg saved with the kdump showed
>> warnings at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80 and at
>> drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50 involving AMD IOMMU and
>> amdgpu functions in the trace. Since those warnings' were
>> if (WARN_ON(!pdev->pri_enabled)) and if (WARN_ON(!pdev->pasid_enabled)),
>> pci_disable_pri and pci_disable_pasid looked like they were called when
>> pdev->pri_enabled and pdev->pasid_enabled were both false.
>> A null pointer dereference occurred right after that which made amdgpu crash.
>>
>> [   13.132368] [drm] amdgpu kernel modesetting enabled.
>> [   13.133766] amdgpu: Topology: Add APU node [0x0:0x0]
>> [   13.137596] Console: switching to colour dummy device 80x25
>> [   13.143717] amdgpu 0000:00:01.0: vgaarb: deactivate vga console
>> [   13.143970] [drm] initializing kernel modesetting (CARRIZO 0x1002:0x9874
>> 0x103C:0x8332 0xCA).
>> [   13.144205] [drm] register mmio base: 0xF0400000
>> [   13.144209] [drm] register mmio size: 262144
>> [   13.144310] [drm] add ip block number 0 <vi_common>
>> [   13.144316] [drm] add ip block number 1 <gmc_v8_0>
>> [   13.144320] [drm] add ip block number 2 <cz_ih>
>> [   13.144324] [drm] add ip block number 3 <gfx_v8_0>
>> [   13.144328] [drm] add ip block number 4 <sdma_v3_0>
>> [   13.144332] [drm] add ip block number 5 <powerplay>
>> [   13.144336] [drm] add ip block number 6 <dm>
>> [   13.144340] [drm] add ip block number 7 <uvd_v6_0>
>> [   13.144343] [drm] add ip block number 8 <vce_v3_0>
>> [   13.144347] [drm] add ip block number 9 <acp_ip>
>> [   13.144388] amdgpu 0000:00:01.0: amdgpu: Fetched VBIOS from VFCT
>> [   13.144397] amdgpu: ATOM BIOS: 113-C75100-031
>> [   13.144425] [drm] UVD is enabled in physical mode
>> [   13.144431] [drm] VCE enabled in physical mode
>> [   13.144435] amdgpu 0000:00:01.0: amdgpu: Trusted Memory Zone (TMZ) feature
>> not supported
>> [   13.144491] [drm] vm size is 64 GB, 2 levels, block size is 10-bit,
>> fragment size is 9-bit
>> [   13.144503] amdgpu 0000:00:01.0: amdgpu: VRAM: 512M 0x000000F400000000 -
>> 0x000000F41FFFFFFF (512M used)
>> [   13.144511] amdgpu 0000:00:01.0: amdgpu: GART: 1024M 0x000000FF00000000 -
>> 0x000000FF3FFFFFFF
>> [   13.144524] [drm] Detected VRAM RAM=512M, BAR=512M
>> [   13.144529] [drm] RAM width 64bits UNKNOWN
>> [   13.144623] [drm] amdgpu: 512M of VRAM memory ready
>> [   13.144630] [drm] amdgpu: 3572M of GTT memory ready.
>> [   13.144653] [drm] GART: num cpu pages 262144, num gpu pages 262144
>> [   13.144705] [drm] PCIE GART of 1024M enabled (table at 0x000000F400600000).
>> [   13.158820] amdgpu: hwmgr_sw_init smu backed is smu8_smu
>> [   13.175036] [drm] Found UVD firmware Version: 1.91 Family ID: 11
>> [   13.175097] [drm] UVD ENC is disabled
>> [   13.186675] [drm] Found VCE firmware Version: 52.4 Binary ID: 3
>> [   13.187879] amdgpu: smu version 27.18.00
>> [   13.193760] [drm] DM_PPLIB: values for Engine clock
>> [   13.193773] [drm] DM_PPLIB:     300000
>> [   13.193776] [drm] DM_PPLIB:     480000
>> [   13.193779] [drm] DM_PPLIB:     533340
>> [   13.193781] [drm] DM_PPLIB:     576000
>> [   13.193784] [drm] DM_PPLIB:     626090
>> [   13.193786] [drm] DM_PPLIB:     685720
>> [   13.193788] [drm] DM_PPLIB:     720000
>> [   13.193791] [drm] DM_PPLIB:     757900
>> [   13.193793] [drm] DM_PPLIB: Validation clocks:
>> [   13.193796] [drm] DM_PPLIB:    engine_max_clock: 75790
>> [   13.193799] [drm] DM_PPLIB:    memory_max_clock: 93300
>> [   13.193802] [drm] DM_PPLIB:    level           : 8
>> [   13.193806] [drm] DM_PPLIB: values for Display clock
>> [   13.193809] [drm] DM_PPLIB:     300000
>> [   13.193811] [drm] DM_PPLIB:     400000
>> [   13.193814] [drm] DM_PPLIB:     496560
>> [   13.193816] [drm] DM_PPLIB:     626090
>> [   13.193819] [drm] DM_PPLIB:     685720
>> [   13.193821] [drm] DM_PPLIB:     757900
>> [   13.193823] [drm] DM_PPLIB:     800000
>> [   13.193825] [drm] DM_PPLIB:     847060
>> [   13.193828] [drm] DM_PPLIB: Validation clocks:
>> [   13.193830] [drm] DM_PPLIB:    engine_max_clock: 75790
>> [   13.193833] [drm] DM_PPLIB:    memory_max_clock: 93300
>> [   13.193836] [drm] DM_PPLIB:    level           : 8
>> [   13.193839] [drm] DM_PPLIB: values for Memory clock
>> [   13.193842] [drm] DM_PPLIB:     667000
>> [   13.193844] [drm] DM_PPLIB:     933000
>> [   13.193847] [drm] DM_PPLIB: Validation clocks:
>> [   13.193849] [drm] DM_PPLIB:    engine_max_clock: 75790
>> [   13.193852] [drm] DM_PPLIB:    memory_max_clock: 93300
>> [   13.193854] [drm] DM_PPLIB:    level           : 8
>> [   13.193973] [drm] Display Core initialized with v3.2.215!
>> [   13.309967] [drm] UVD initialized successfully.
>> [   13.511031] [drm] VCE initialized successfully.
>> [   13.515217] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
>> [   13.515442] amdgpu: sdma_bitmap: f
>> [   13.515549] ------------[ cut here ]------------
>> [   13.515555] WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251
>> pci_disable_pri+0x75/0x80
>> [   13.515571] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2
>> hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched
>> crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
>> drm_display_helper wdat_wdt serio_raw hid_multitouch sp5100_tco
>> hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse
>> dm_multipath
>> [   13.515620] CPU: 0 PID: 477 Comm: systemd-udevd Kdump: loaded Not tainted
>> 6.2.0-0.rc1.14.fc38.x86_64 #1
>> [   13.515628] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019
>> [   13.515634] RIP: 0010:pci_disable_pri+0x75/0x80
>> [   13.515642] Code: 54 24 06 89 ee 48 89 df 83 e2 fe 66 89 54 24 06 0f b7 d2
>> e8 1d e1 fc ff 80 a3 4b 08 00 00 fd 48 83 c4 08 5b 5d e9 2b 8b 69 00 <0f> 0b
>> eb b6 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90
>> [   13.515651] RSP: 0018:ffffbaf4407ab8e8 EFLAGS: 00010046
>> [   13.515658] RAX: 0000000000000000 RBX: ffff90aa00ac4000 RCX: 0000000000000009
>> [   13.515663] RDX: 0000000000000000 RSI: 0000000000000014 RDI: ffff90aa00ac4000
>> [   13.515668] RBP: ffff90aa0e0c3810 R08: 0000000000000002 R09: 0000000000000000
>> [   13.515673] R10: 0000000000000000 R11: ffffffffade4e430 R12: ffff90aa011a8800
>> [   13.515678] R13: ffff90aa0e0c3800 R14: ffff90aa011a8800 R15: ffff90aa0e0c3960
>> [   13.515683] FS:  00007fabd67feb40(0000) GS:ffff90aaf7400000(0000)
>> knlGS:0000000000000000
>> [   13.515689] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   13.515695] CR2: 00007f5689ff54c0 CR3: 0000000100f16000 CR4: 00000000001506f0
>> [   13.515700] Call Trace:
>> [   13.515704]  <TASK>
>> [   13.515710]  amd_iommu_attach_device+0x2e0/0x300
>> [   13.515719]  __iommu_attach_device+0x1b/0x90
>> [   13.515727]  iommu_attach_group+0x65/0xa0
>> [   13.515735]  amd_iommu_init_device+0x16b/0x250 [iommu_v2]
>> [   13.515747]  kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
>> [   13.517094]  kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
>> [   13.518419]  kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
>> [   13.519699]  amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
>> [   13.520877]  amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
>> [   13.522118]  ? _raw_spin_lock_irqsave+0x23/0x50
>> [   13.522126]  amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
>> [   13.523386]  amdgpu_pci_probe+0x161/0x370 [amdgpu]
>> [   13.524516]  local_pci_probe+0x41/0x80
>> [   13.524525]  pci_device_probe+0xb3/0x220
>> [   13.524533]  really_probe+0xde/0x380
>> [   13.524540]  ? pm_runtime_barrier+0x50/0x90
>> [   13.524546]  __driver_probe_device+0x78/0x170
>> [   13.524555]  driver_probe_device+0x1f/0x90
>> [   13.524560]  __driver_attach+0xce/0x1c0
>> [   13.524565]  ? __pfx___driver_attach+0x10/0x10
>> [   13.524570]  bus_for_each_dev+0x73/0xa0
>> [   13.524575]  bus_add_driver+0x1ae/0x200
>> [   13.524580]  driver_register+0x89/0xe0
>> [   13.524586]  ? __pfx_init_module+0x10/0x10 [amdgpu]
>> [   13.525819]  do_one_initcall+0x59/0x230
>> [   13.525828]  do_init_module+0x4a/0x200
>> [   13.525834]  __do_sys_init_module+0x157/0x180
>> [   13.525839]  do_syscall_64+0x5b/0x80
>> [   13.525845]  ? handle_mm_fault+0xff/0x2f0
>> [   13.525850]  ? do_user_addr_fault+0x1ef/0x690
>> [   13.525856]  ? exc_page_fault+0x70/0x170
>> [   13.525860]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> [   13.525867] RIP: 0033:0x7fabd66cde4e
>> [   13.525872] Code: 48 8b 0d e5 5f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e
>> 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
>> 01 f0 ff ff 73 01 c3 48 8b 0d b2 5f 0c 00 f7 d8 64 89 01 48
>> [   13.525878] RSP: 002b:00007ffdd89bc6a8 EFLAGS: 00000246 ORIG_RAX:
>> 00000000000000af
>> [   13.525884] RAX: ffffffffffffffda RBX: 0000563e4d23f0a0 RCX: 00007fabd66cde4e
>> [   13.525887] RDX: 00007fabd6817453 RSI: 000000000174fb66 RDI: 00007fabd3bd4010
>> [   13.525890] RBP: 00007fabd6817453 R08: 0000563e4d237c70 R09: 00007fabd672f900
>> [   13.525893] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000020000
>> [   13.525896] R13: 0000563e4d239060 R14: 0000000000000000 R15: 0000563e4d23e450
>> [   13.525900]  </TASK>
>> [   13.525902] ---[ end trace 0000000000000000 ]---
>> [   13.525964] ------------[ cut here ]------------
> 
> This (including the following) kernel traces are triggered by the
> following code.
> 
> 1698 static int pdev_pri_ats_enable(struct pci_dev *pdev)
> 1699 {
> 1700         int ret;
> 1701
> 1702         /* Only allow access to user-accessible pages */
> 1703         ret = pci_enable_pasid(pdev, 0);
> 1704         if (ret)
> 1705                 goto out_err;
> 
> [--cut for short--]
> 
> 1724 out_err:
> 1725         pci_disable_pri(pdev);
> 1726         pci_disable_pasid(pdev);
> 1727
> 1728         return ret;
> 1729 }
> 
> pci_disable_pri() and pci_disable_pasid() are called with PCI PASID and
> PRI not enabled. There are WARN_ON()s in the pci code for such cases.

Yeah. Error path needs to be fixed.

> 
> This happens in the domain attach device path. I haven't figured out why
> the failure of PASID or PRI enabling will cause the domain attach device
> to fail. And also why pci_pasid_features() and pci_pri_supported() are
> not called before pci_enable_pasid/pri().

PASID/PRI support is verified in amd_iommu_device_info().
For AMD GPUs (PASID/PRI supported devices)
  - We allocate new domain called V2 domain and then attach device(s).
    amd_iommu_init_device() - > iommu_attach_group()
    In attach devices path :
      amd_iommu_attach_device() -> attach_device()
      If domain is v2 domain and device is PASID/PRI capable, then we try to
enable PASID/PRI. This is where we are hitting WARN_ON.

I think if attach device fails then we should put the device/group back to
default domain so that we don't hit these warnings.

Matt,

Can you please test below patch? (its not a fix to original issue, but to avoid
kernel warnings/traces).

> 
> commit 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on
> upstream path") requires ACS P2P Request Redirect and Upstream
> Forwarding are enabled for the path leading to the device when enabling
> PASID because PCIe fabric routes Memory Requests based on the TLP
> address, ignoring any PASID. I guess this is the reason why
> pci_enable_pasid() returns failure and discovers above buggy code.

Can we get the lspci -vvv output. It will tell whether ACS request support.


-Vasant

-----
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index cbeaab55c0db..f81ab787eee1 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1702,27 +1702,26 @@ static int pdev_pri_ats_enable(struct pci_dev *pdev)
 	/* Only allow access to user-accessible pages */
 	ret = pci_enable_pasid(pdev, 0);
 	if (ret)
-		goto out_err;
+		return ret;

 	/* First reset the PRI state of the device */
 	ret = pci_reset_pri(pdev);
 	if (ret)
-		goto out_err;
+		goto out_pasid;

 	/* Enable PRI */
 	/* FIXME: Hardcode number of outstanding requests for now */
 	ret = pci_enable_pri(pdev, 32);
 	if (ret)
-		goto out_err;
+		goto out_pasid;

 	ret = pci_enable_ats(pdev, PAGE_SHIFT);
-	if (ret)
-		goto out_err;
-
-	return 0;
+	if (!ret)
+		return 0;

-out_err:
 	pci_disable_pri(pdev);
+
+out_pasid:
 	pci_disable_pasid(pdev);

 	return ret;
diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
index 864e4ffb6aa9..4228e44b4950 100644
--- a/drivers/iommu/amd/iommu_v2.c
+++ b/drivers/iommu/amd/iommu_v2.c
@@ -815,6 +815,7 @@ int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
 	return 0;

 out_drop_group:
+	iommu_detach_group(dev_state->domain, group);
 	iommu_group_put(group);

 out_free_domain:


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-04 15:50     ` Vasant Hegde
@ 2023-01-05  1:09       ` Matt Fagnani
  2023-01-05 10:27         ` Vasant Hegde
  0 siblings, 1 reply; 42+ messages in thread
From: Matt Fagnani @ 2023-01-05  1:09 UTC (permalink / raw)
  To: Vasant Hegde, Baolu Lu, Thorsten Leemhuis
  Cc: Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 412 bytes --]

I built 6.2-rc2 with the patch applied. The same black screen problem 
happened with 6.2-rc2 with the patch. I tried to use early kdump with 
6.2-rc2 with the patch twice by panicking the kernel with sysrq+alt+c 
after the black screen happened. The system rebooted after about 10-20 
seconds both times, but no kdump and dmesg files were saved in 
/var/crash. I'm attaching the lspci -vvv output as requested.


[-- Attachment #2: lspci-vvv-2.txt --]
[-- Type: text/plain, Size: 33871 bytes --]

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Complex
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0

00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) I/O Memory Management Unit
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 24
	Capabilities: [40] Secure device <?>
	Capabilities: [64] MSI: Enable+ Count=1/4 Maskable- 64bit+
		Address: 00000000fee04004  Data: 0021
	Capabilities: [74] HyperTransport: MSI Mapping Enable+ Fixed+

00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) (prog-if 00 [VGA controller])
	DeviceName: ATI EG BROADWAY
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 38
	IOMMU group: 0
	Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at f0800000 (64-bit, prefetchable) [size=8M]
	Region 4: I/O ports at 4000 [size=256]
	Region 5: Memory at f0400000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Root Complex Integrated Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag+ RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable+, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable+ Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000020, Page Request Allocation: 00000020
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec- Priv-, Max PASID Width: 10
		PASIDCtl: Enable+ Exec- Priv-
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

00:01.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Kabini HDMI/DP Audio
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin B routed to IRQ 40
	IOMMU group: 0
	Region 0: Memory at f0460000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Root Complex Integrated Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag+ RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Host Bridge
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	IOMMU group: 1

00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Port (prog-if 00 [Normal decode])
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 26
	IOMMU group: 1
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 3000-3fff [size=4K] [16-bit]
	Memory behind bridge: f0300000-f03fffff [size=1M] [32-bit]
	Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled] [64-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0
			ExtTag+ RBE+
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1
			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #0, PowerLimit 0W; Interlock- NoCompl+
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
			Changed: MRL- PresDet- LinkState+
		RootCap: CRSVisible+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd-
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd-
			 AtomicOpsCtl: ReqEn- EgressBlck-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis+
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00000  Data: 0000
	Capabilities: [c0] Subsystem: Hewlett-Packard Company Device 8332
	Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: LaneErr at lane: 0
	Kernel driver in use: pcieport

00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Port (prog-if 00 [Normal decode])
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 27
	IOMMU group: 1
	Bus: primary=00, secondary=02, subordinate=04, sec-latency=0
	I/O behind bridge: 2000-2fff [size=4K] [16-bit]
	Memory behind bridge: f1000000-f10fffff [size=1M] [32-bit]
	Prefetchable memory behind bridge: f0000000-f00fffff [size=1M] [32-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0
			ExtTag+ RBE+
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #1, Speed 2.5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1
			TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #0, PowerLimit 0W; Interlock- NoCompl+
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
			Changed: MRL- PresDet- LinkState+
		RootCap: CRSVisible+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd-
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd-
			 AtomicOpsCtl: ReqEn- EgressBlck-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis+
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00000  Data: 0000
	Capabilities: [c0] Subsystem: Hewlett-Packard Company Device 8332
	Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: LaneErr at lane: 0
	Kernel driver in use: pcieport

00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Host Bridge
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	IOMMU group: 2

00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Port (prog-if 00 [Normal decode])
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 29
	IOMMU group: 2
	Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
	I/O behind bridge: 1000-1fff [size=4K] [16-bit]
	Memory behind bridge: f0500000-f06fffff [size=2M] [32-bit]
	Prefetchable memory behind bridge: f1100000-f12fffff [size=2M] [32-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0
			ExtTag+ RBE+
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #247, Speed 2.5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x16 (overdriven)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise-
			Slot #0, PowerLimit 0W; Interlock- NoCompl+
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt- HPIrq+ LinkChg+
			Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
			Changed: MRL- PresDet- LinkState-
		RootCap: CRSVisible+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd-
			 AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd-
			 AtomicOpsCtl: ReqEn- EgressBlck-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis+
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00000  Data: 0000
	Capabilities: [c0] Subsystem: Hewlett-Packard Company Device 8332
	Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Kernel driver in use: pcieport

00:08.0 Encryption controller: Advanced Micro Devices, Inc. [AMD] Carrizo Platform Security Processor
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 255
	IOMMU group: 3
	Region 0: Memory at f0440000 (64-bit, prefetchable) [size=128K]
	Region 2: Memory at f0200000 (32-bit, non-prefetchable) [size=1M]
	Region 3: Memory at f046f000 (32-bit, non-prefetchable) [size=4K]
	Region 5: Memory at f046a000 (32-bit, non-prefetchable) [size=8K]
	Capabilities: [50] MSI-X: Enable- Count=2 Masked-
		Vector table: BAR=5 offset=00000000
		PBA: BAR=5 offset=00001000
	Capabilities: [5c] HyperTransport: MSI Mapping Enable+ Fixed+
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [a4] PCI Advanced Features
		AFCap: TP+ FLR-
		AFCtrl: FLR-
		AFStatus: TP-

00:09.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Carrizo Audio Dummy Host Bridge
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	IOMMU group: 4

00:09.2 Audio device: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Audio Controller
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 41
	IOMMU group: 4
	Region 0: Memory at f0464000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME-
	Capabilities: [a4] PCI Advanced Features
		AFCap: TP+ FLR-
		AFCtrl: FLR-
		AFStatus: TP-
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

00:10.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller (rev 20) (prog-if 30 [XHCI])
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 18
	IOMMU group: 5
	Region 0: Memory at f0468000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [90] MSI-X: Enable+ Count=8 Masked-
		Vector table: BAR=0 offset=00001000
		PBA: BAR=0 offset=00001080
	Capabilities: [a0] Express (v2) Root Complex Integrated Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag- RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
	Capabilities: [100 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Kernel driver in use: xhci_hcd

00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 49) (prog-if 01 [AHCI 1.0])
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 19
	IOMMU group: 6
	Region 0: I/O ports at 4118 [size=8]
	Region 1: I/O ports at 4124 [size=4]
	Region 2: I/O ports at 4110 [size=8]
	Region 3: I/O ports at 4120 [size=4]
	Region 4: I/O ports at 4100 [size=16]
	Region 5: Memory at f046c000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [70] SATA HBA v1.0 InCfgSpace
	Kernel driver in use: ahci

00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 49) (prog-if 20 [EHCI])
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 18
	IOMMU group: 7
	Region 0: Memory at f046d000 (32-bit, non-prefetchable) [size=256]
	Capabilities: [c0] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
		Bridge: PM- B3-
	Capabilities: [e4] Debug port: BAR=1 offset=00e0
	Kernel driver in use: ehci-pci

00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 4a)
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	IOMMU group: 8
	Kernel driver in use: piix4_smbus
	Kernel modules: i2c_piix4, sp5100_tco

00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 11)
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	IOMMU group: 8

00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 0
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	IOMMU group: 9

00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 1
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	IOMMU group: 9

00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 2
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	IOMMU group: 9

00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 3
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	IOMMU group: 9
	Capabilities: [f0] Secure device <?>
	Kernel driver in use: k10temp
	Kernel modules: k10temp

00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 4
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	IOMMU group: 9
	Kernel driver in use: fam15h_power
	Kernel modules: fam15h_power

00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 5
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	IOMMU group: 9

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
	Subsystem: Hewlett-Packard Company Device 8332
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 35
	IOMMU group: 1
	Region 0: I/O ports at 3000 [size=256]
	Region 2: Memory at f0304000 (64-bit, non-prefetchable) [size=4K]
	Region 4: Memory at f0300000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, MSI 01
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Via message/WAKE#, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
		Vector table: BAR=4 offset=00000000
		PBA: BAR=4 offset=00000800
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [140 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
	Capabilities: [170 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [178 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=150us PortTPowerOnTime=150us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Kernel driver in use: r8169
	Kernel modules: r8169

02:00.0 Network controller: Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] (rev 10)
	Subsystem: Intel Corporation Device 2110
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 43
	IOMMU group: 1
	Region 0: Memory at f1000000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [c8] Power Management version 3
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00000  Data: 0000
	Capabilities: [40] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr+ NoSnoop+ FLReset-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
		LnkCap:	Port #1, Speed 2.5GT/s, Width x1, ASPM L1, Exit Latency L1 <32us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 16ms to 55ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [140 v1] Device Serial Number 88-b1-11-ff-ff-5d-01-88
	Capabilities: [14c v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [154 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=30us PortTPowerOnTime=60us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Kernel driver in use: iwlwifi
	Kernel modules: iwlwifi


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-05  1:09       ` Matt Fagnani
@ 2023-01-05 10:27         ` Vasant Hegde
  2023-01-05 10:37           ` Baolu Lu
                             ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Vasant Hegde @ 2023-01-05 10:27 UTC (permalink / raw)
  To: Matt Fagnani, Baolu Lu, Thorsten Leemhuis
  Cc: Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

Matt,

On 1/5/2023 6:39 AM, Matt Fagnani wrote:
> I built 6.2-rc2 with the patch applied. The same black screen problem happened
> with 6.2-rc2 with the patch. I tried to use early kdump with 6.2-rc2 with the
> patch twice by panicking the kernel with sysrq+alt+c after the black screen
> happened. The system rebooted after about 10-20 seconds both times, but no kdump
> and dmesg files were saved in /var/crash. I'm attaching the lspci -vvv output as
> requested.
> 

Thanks for testing. As mentioned earlier I was not expecting this patch to fix
the black screen issue. It should fix kernel warnings and IOMMU page fault
related call traces. By any chance do you have the kernel boot logs?


@Baolu,
  Looking into lspci output, it doesn't list ACS feature for Graphics card. So
with your fix it didn't enable PASID and hence it failed to boot.

-Vasant


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-05 10:27         ` Vasant Hegde
@ 2023-01-05 10:37           ` Baolu Lu
  2023-01-05 10:46             ` Vasant Hegde
  2023-01-05 19:51           ` Matt Fagnani
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Baolu Lu @ 2023-01-05 10:37 UTC (permalink / raw)
  To: Vasant Hegde, Matt Fagnani, Thorsten Leemhuis
  Cc: baolu.lu, Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

On 2023/1/5 18:27, Vasant Hegde wrote:
> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>> I built 6.2-rc2 with the patch applied. The same black screen problem happened
>> with 6.2-rc2 with the patch. I tried to use early kdump with 6.2-rc2 with the
>> patch twice by panicking the kernel with sysrq+alt+c after the black screen
>> happened. The system rebooted after about 10-20 seconds both times, but no kdump
>> and dmesg files were saved in /var/crash. I'm attaching the lspci -vvv output as
>> requested.
>>
> Thanks for testing. As mentioned earlier I was not expecting this patch to fix
> the black screen issue. It should fix kernel warnings and IOMMU page fault
> related call traces. By any chance do you have the kernel boot logs?
> 
> 
> @Baolu,
>    Looking into lspci output, it doesn't list ACS feature for Graphics card. So
> with your fix it didn't enable PASID and hence it failed to boot.

So do you mind telling why does the PASID need to be enabled for the
graphic device? Or in another word, what does the graphic driver use the
PASID for?

--
Best regards,
baolu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-05 10:37           ` Baolu Lu
@ 2023-01-05 10:46             ` Vasant Hegde
  2023-01-05 14:46               ` Deucher, Alexander
  0 siblings, 1 reply; 42+ messages in thread
From: Vasant Hegde @ 2023-01-05 10:46 UTC (permalink / raw)
  To: Baolu Lu, Matt Fagnani, Thorsten Leemhuis, Alex Deucher,
	Joerg Roedel
  Cc: iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas

Baolu,


On 1/5/2023 4:07 PM, Baolu Lu wrote:
> On 2023/1/5 18:27, Vasant Hegde wrote:
>> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>>> I built 6.2-rc2 with the patch applied. The same black screen problem happened
>>> with 6.2-rc2 with the patch. I tried to use early kdump with 6.2-rc2 with the
>>> patch twice by panicking the kernel with sysrq+alt+c after the black screen
>>> happened. The system rebooted after about 10-20 seconds both times, but no kdump
>>> and dmesg files were saved in /var/crash. I'm attaching the lspci -vvv output as
>>> requested.
>>>
>> Thanks for testing. As mentioned earlier I was not expecting this patch to fix
>> the black screen issue. It should fix kernel warnings and IOMMU page fault
>> related call traces. By any chance do you have the kernel boot logs?
>>
>>
>> @Baolu,
>>    Looking into lspci output, it doesn't list ACS feature for Graphics card. So
>> with your fix it didn't enable PASID and hence it failed to boot.
> 
> So do you mind telling why does the PASID need to be enabled for the
> graphic device? Or in another word, what does the graphic driver use the
> PASID for?

Honestly I don't know the complete details of how PASID works with graphics
card. May be Alex or Joerg can explain it better.

-Vasant


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-05 10:46             ` Vasant Hegde
@ 2023-01-05 14:46               ` Deucher, Alexander
  2023-01-05 15:27                 ` Felix Kuehling
  0 siblings, 1 reply; 42+ messages in thread
From: Deucher, Alexander @ 2023-01-05 14:46 UTC (permalink / raw)
  To: Hegde, Vasant, Baolu Lu, Matt Fagnani, Thorsten Leemhuis,
	Joerg Roedel, Kuehling, Felix
  Cc: iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas

[AMD Official Use Only - General]

> -----Original Message-----
> From: Hegde, Vasant <Vasant.Hegde@amd.com>
> Sent: Thursday, January 5, 2023 5:46 AM
> To: Baolu Lu <baolu.lu@linux.intel.com>; Matt Fagnani
> <matt.fagnani@bell.net>; Thorsten Leemhuis <regressions@leemhuis.info>;
> Deucher, Alexander <Alexander.Deucher@amd.com>; Joerg Roedel
> <jroedel@suse.de>
> Cc: iommu@lists.linux.dev; LKML <linux-kernel@vger.kernel.org>;
> regressions@lists.linux.dev; Linux PCI <linux-pci@vger.kernel.org>; Bjorn
> Helgaas <bhelgaas@google.com>
> Subject: Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen
> when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
> 
> Baolu,
> 
> 
> On 1/5/2023 4:07 PM, Baolu Lu wrote:
> > On 2023/1/5 18:27, Vasant Hegde wrote:
> >> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
> >>> I built 6.2-rc2 with the patch applied. The same black screen
> >>> problem happened with 6.2-rc2 with the patch. I tried to use early
> >>> kdump with 6.2-rc2 with the patch twice by panicking the kernel with
> >>> sysrq+alt+c after the black screen happened. The system rebooted
> >>> after about 10-20 seconds both times, but no kdump and dmesg files
> >>> were saved in /var/crash. I'm attaching the lspci -vvv output as
> requested.
> >>>
> >> Thanks for testing. As mentioned earlier I was not expecting this
> >> patch to fix the black screen issue. It should fix kernel warnings
> >> and IOMMU page fault related call traces. By any chance do you have the
> kernel boot logs?
> >>
> >>
> >> @Baolu,
> >>    Looking into lspci output, it doesn't list ACS feature for
> >> Graphics card. So with your fix it didn't enable PASID and hence it failed to
> boot.
> >
> > So do you mind telling why does the PASID need to be enabled for the
> > graphic device? Or in another word, what does the graphic driver use
> > the PASID for?
> 
> Honestly I don't know the complete details of how PASID works with graphics
> card. May be Alex or Joerg can explain it better.

+ Felix

The GPU driver uses the pasid for shared virtual memory between the CPU and GPU.  I.e., so that the user apps can use the same virtual address space on the GPU and the CPU.  It also uses pasid to take advantage of recoverable device page faults using PRS.

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-05 14:46               ` Deucher, Alexander
@ 2023-01-05 15:27                 ` Felix Kuehling
  2023-01-06  5:48                   ` Baolu Lu
  0 siblings, 1 reply; 42+ messages in thread
From: Felix Kuehling @ 2023-01-05 15:27 UTC (permalink / raw)
  To: Deucher, Alexander, Hegde, Vasant, Baolu Lu, Matt Fagnani,
	Thorsten Leemhuis, Joerg Roedel
  Cc: iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas

Am 2023-01-05 um 09:46 schrieb Deucher, Alexander:
> [AMD Official Use Only - General]
>
>> -----Original Message-----
>> From: Hegde, Vasant <Vasant.Hegde@amd.com>
>> Sent: Thursday, January 5, 2023 5:46 AM
>> To: Baolu Lu <baolu.lu@linux.intel.com>; Matt Fagnani
>> <matt.fagnani@bell.net>; Thorsten Leemhuis <regressions@leemhuis.info>;
>> Deucher, Alexander <Alexander.Deucher@amd.com>; Joerg Roedel
>> <jroedel@suse.de>
>> Cc: iommu@lists.linux.dev; LKML <linux-kernel@vger.kernel.org>;
>> regressions@lists.linux.dev; Linux PCI <linux-pci@vger.kernel.org>; Bjorn
>> Helgaas <bhelgaas@google.com>
>> Subject: Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen
>> when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
>>
>> Baolu,
>>
>>
>> On 1/5/2023 4:07 PM, Baolu Lu wrote:
>>> On 2023/1/5 18:27, Vasant Hegde wrote:
>>>> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>>>>> I built 6.2-rc2 with the patch applied. The same black screen
>>>>> problem happened with 6.2-rc2 with the patch. I tried to use early
>>>>> kdump with 6.2-rc2 with the patch twice by panicking the kernel with
>>>>> sysrq+alt+c after the black screen happened. The system rebooted
>>>>> after about 10-20 seconds both times, but no kdump and dmesg files
>>>>> were saved in /var/crash. I'm attaching the lspci -vvv output as
>> requested.
>>>> Thanks for testing. As mentioned earlier I was not expecting this
>>>> patch to fix the black screen issue. It should fix kernel warnings
>>>> and IOMMU page fault related call traces. By any chance do you have the
>> kernel boot logs?
>>>> @Baolu,
>>>>     Looking into lspci output, it doesn't list ACS feature for
>>>> Graphics card. So with your fix it didn't enable PASID and hence it failed to
>> boot.
>>> So do you mind telling why does the PASID need to be enabled for the
>>> graphic device? Or in another word, what does the graphic driver use
>>> the PASID for?
>> Honestly I don't know the complete details of how PASID works with graphics
>> card. May be Alex or Joerg can explain it better.
> + Felix
>
> The GPU driver uses the pasid for shared virtual memory between the CPU and GPU.  I.e., so that the user apps can use the same virtual address space on the GPU and the CPU.  It also uses pasid to take advantage of recoverable device page faults using PRS.

Agreed. This applies to GPU computing on some older AMD APUs that take 
advantage of memory coherence and IOMMUv2 address translation to create 
a shared virtual address space between the CPU and GPU. In this case it 
seems to be a Carrizo APU. It is also true for Raven APUs.

Regards,
   Felix


>
> Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-05 10:27         ` Vasant Hegde
  2023-01-05 10:37           ` Baolu Lu
@ 2023-01-05 19:51           ` Matt Fagnani
  2023-01-06 14:14           ` Jason Gunthorpe
       [not found]           ` <ff26929d-9fb0-3c85-2594-dc2937c1ba9a@bell.net>
  3 siblings, 0 replies; 42+ messages in thread
From: Matt Fagnani @ 2023-01-05 19:51 UTC (permalink / raw)
  To: Vasant Hegde, Baolu Lu, Thorsten Leemhuis
  Cc: Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

I booted 6.2-rc2 + the patch four times with early kdump enabled and 
panicked the kernel. There weren't any kdump or dmesg files saved to 
/var/crash though. Nothing showed up in the journal from boots where the 
problem happened. The amdgpu crash happened before systemd-journald 
started from what I could tell. I tried to rebuild 
/boot/initramfs-6.2.0-rc2+kdump.img with amd_iommu=off added to the 
kernel command line with dracut, but an error that the kdumpbase module 
couldn't be found was shown. I read that a different dump capture kernel 
could be used with kdump, but I haven't figured out how to use that with 
early kdump yet. If anyone has ideas how to get the kdump and dmesg log, 
let me know. Thanks.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-05 15:27                 ` Felix Kuehling
@ 2023-01-06  5:48                   ` Baolu Lu
  2023-02-15 15:39                     ` Bjorn Helgaas
  0 siblings, 1 reply; 42+ messages in thread
From: Baolu Lu @ 2023-01-06  5:48 UTC (permalink / raw)
  To: Felix Kuehling, Deucher, Alexander, Hegde, Vasant, Matt Fagnani,
	Thorsten Leemhuis, Joerg Roedel, Jason Gunthorpe
  Cc: baolu.lu, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

+Jason

On 1/5/23 11:27 PM, Felix Kuehling wrote:
> Am 2023-01-05 um 09:46 schrieb Deucher, Alexander:
>> [AMD Official Use Only - General]
>>
>>> -----Original Message-----
>>> From: Hegde, Vasant <Vasant.Hegde@amd.com>
>>> Sent: Thursday, January 5, 2023 5:46 AM
>>> To: Baolu Lu <baolu.lu@linux.intel.com>; Matt Fagnani
>>> <matt.fagnani@bell.net>; Thorsten Leemhuis <regressions@leemhuis.info>;
>>> Deucher, Alexander <Alexander.Deucher@amd.com>; Joerg Roedel
>>> <jroedel@suse.de>
>>> Cc: iommu@lists.linux.dev; LKML <linux-kernel@vger.kernel.org>;
>>> regressions@lists.linux.dev; Linux PCI <linux-pci@vger.kernel.org>; 
>>> Bjorn
>>> Helgaas <bhelgaas@google.com>
>>> Subject: Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen
>>> when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
>>>
>>> Baolu,
>>>
>>>
>>> On 1/5/2023 4:07 PM, Baolu Lu wrote:
>>>> On 2023/1/5 18:27, Vasant Hegde wrote:
>>>>> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>>>>>> I built 6.2-rc2 with the patch applied. The same black screen
>>>>>> problem happened with 6.2-rc2 with the patch. I tried to use early
>>>>>> kdump with 6.2-rc2 with the patch twice by panicking the kernel with
>>>>>> sysrq+alt+c after the black screen happened. The system rebooted
>>>>>> after about 10-20 seconds both times, but no kdump and dmesg files
>>>>>> were saved in /var/crash. I'm attaching the lspci -vvv output as
>>> requested.
>>>>> Thanks for testing. As mentioned earlier I was not expecting this
>>>>> patch to fix the black screen issue. It should fix kernel warnings
>>>>> and IOMMU page fault related call traces. By any chance do you have 
>>>>> the
>>> kernel boot logs?
>>>>> @Baolu,
>>>>>     Looking into lspci output, it doesn't list ACS feature for
>>>>> Graphics card. So with your fix it didn't enable PASID and hence it 
>>>>> failed to
>>> boot.
>>>> So do you mind telling why does the PASID need to be enabled for the
>>>> graphic device? Or in another word, what does the graphic driver use
>>>> the PASID for?
>>> Honestly I don't know the complete details of how PASID works with 
>>> graphics
>>> card. May be Alex or Joerg can explain it better.
>> + Felix
>>
>> The GPU driver uses the pasid for shared virtual memory between the 
>> CPU and GPU.  I.e., so that the user apps can use the same virtual 
>> address space on the GPU and the CPU.  It also uses pasid to take 
>> advantage of recoverable device page faults using PRS.
> 
> Agreed. This applies to GPU computing on some older AMD APUs that take 
> advantage of memory coherence and IOMMUv2 address translation to create 
> a shared virtual address space between the CPU and GPU. In this case it 
> seems to be a Carrizo APU. It is also true for Raven APUs.

Thanks for the explanation.

This is actually the problem that commit 201007ef707a was trying to fix.
The PCIe fabric routes Memory Requests based on the TLP address,
ignoring any PASID (PCIe r6.0, sec 2.2.10.4), so a TLP with PASID that
should go upstream to the IOMMU may instead be routed as a P2P
Request if its address falls in a bridge window.

In SVA case, the IOMMU shares the address space of a user application.
The user application side has no knowledge about the PCI bridge window.
It is entirely possible that the device is programed with a P2P address
and results in a disaster.

--
Best regards,
baolu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-05 10:27         ` Vasant Hegde
  2023-01-05 10:37           ` Baolu Lu
  2023-01-05 19:51           ` Matt Fagnani
@ 2023-01-06 14:14           ` Jason Gunthorpe
  2023-01-07  2:44             ` Baolu Lu
  2023-01-10  5:48             ` Baolu Lu
       [not found]           ` <ff26929d-9fb0-3c85-2594-dc2937c1ba9a@bell.net>
  3 siblings, 2 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2023-01-06 14:14 UTC (permalink / raw)
  To: Vasant Hegde
  Cc: Matt Fagnani, Baolu Lu, Thorsten Leemhuis, Joerg Roedel,
	iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas

On Thu, Jan 05, 2023 at 03:57:28PM +0530, Vasant Hegde wrote:
> Matt,
> 
> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
> > I built 6.2-rc2 with the patch applied. The same black screen problem happened
> > with 6.2-rc2 with the patch. I tried to use early kdump with 6.2-rc2 with the
> > patch twice by panicking the kernel with sysrq+alt+c after the black screen
> > happened. The system rebooted after about 10-20 seconds both times, but no kdump
> > and dmesg files were saved in /var/crash. I'm attaching the lspci -vvv output as
> > requested.
> > 
> 
> Thanks for testing. As mentioned earlier I was not expecting this patch to fix
> the black screen issue. It should fix kernel warnings and IOMMU page fault
> related call traces. By any chance do you have the kernel boot logs?
> 
> 
> @Baolu,
>   Looking into lspci output, it doesn't list ACS feature for Graphics card. So
> with your fix it didn't enable PASID and hence it failed to boot.

The ACS checks being done are feature of the path not the end point or
root port.

If we are expecting ACS on the end port then it is just a bug in how
the test was written.. The test should be a NOP because there are no
switches in this topology.

Looking at it, this seems to just be because pci_enable_pasid is
calling pci_acs_path_enabled wrong, the only other user is here:

	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
		if (!bus->self)
			continue;

		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
			break;

		pdev = bus->self;

		group = iommu_group_get(&pdev->dev);
		if (group)
			return group;
	}

And notice it is calling it on pdev->bus not on pdev itself which
naturally excludes the end point from the ACS validation.

So try something like:

	if (!pci_acs_path_enabled(pdev->bus->self, NULL, PCI_ACS_RR | PCI_ACS_UF))

(and probably need to check for null ?)

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-06 14:14           ` Jason Gunthorpe
@ 2023-01-07  2:44             ` Baolu Lu
  2023-01-09 13:43               ` Jason Gunthorpe
  2023-01-10  5:48             ` Baolu Lu
  1 sibling, 1 reply; 42+ messages in thread
From: Baolu Lu @ 2023-01-07  2:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Vasant Hegde
  Cc: Matt Fagnani, Thorsten Leemhuis, Joerg Roedel,
	iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas

On 1/6/2023 10:14 PM, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2023 at 03:57:28PM +0530, Vasant Hegde wrote:
>> Matt,
>>
>> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>>> I built 6.2-rc2 with the patch applied. The same black screen problem happened
>>> with 6.2-rc2 with the patch. I tried to use early kdump with 6.2-rc2 with the
>>> patch twice by panicking the kernel with sysrq+alt+c after the black screen
>>> happened. The system rebooted after about 10-20 seconds both times, but no kdump
>>> and dmesg files were saved in /var/crash. I'm attaching the lspci -vvv output as
>>> requested.
>>>
>>
>> Thanks for testing. As mentioned earlier I was not expecting this patch to fix
>> the black screen issue. It should fix kernel warnings and IOMMU page fault
>> related call traces. By any chance do you have the kernel boot logs?
>>
>>
>> @Baolu,
>>    Looking into lspci output, it doesn't list ACS feature for Graphics card. So
>> with your fix it didn't enable PASID and hence it failed to boot.
> 
> The ACS checks being done are feature of the path not the end point or
> root port.
> 
> If we are expecting ACS on the end port then it is just a bug in how
> the test was written.. The test should be a NOP because there are no
> switches in this topology.
> 
> Looking at it, this seems to just be because pci_enable_pasid is
> calling pci_acs_path_enabled wrong, the only other user is here:
> 
> 	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
> 		if (!bus->self)
> 			continue;
> 
> 		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
> 			break;
> 
> 		pdev = bus->self;
> 
> 		group = iommu_group_get(&pdev->dev);
> 		if (group)
> 			return group;
> 	}
> 
> And notice it is calling it on pdev->bus not on pdev itself which
> naturally excludes the end point from the ACS validation.
> 
> So try something like:
> 
> 	if (!pci_acs_path_enabled(pdev->bus->self, NULL, PCI_ACS_RR | PCI_ACS_UF))
> 
> (and probably need to check for null ?)

Yeah! This really is a misuse of pci_acs_path_enabled().

But if @pdev is an endpoint of a multiple function device, perhaps we
still need to check acs on it?

--
Best regards,
baolu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-07  2:44             ` Baolu Lu
@ 2023-01-09 13:43               ` Jason Gunthorpe
  2023-01-10  5:28                 ` Baolu Lu
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2023-01-09 13:43 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Vasant Hegde, Matt Fagnani, Thorsten Leemhuis, Joerg Roedel,
	iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas

On Sat, Jan 07, 2023 at 10:44:46AM +0800, Baolu Lu wrote:
> On 1/6/2023 10:14 PM, Jason Gunthorpe wrote:
> > On Thu, Jan 05, 2023 at 03:57:28PM +0530, Vasant Hegde wrote:
> > > Matt,
> > > 
> > > On 1/5/2023 6:39 AM, Matt Fagnani wrote:
> > > > I built 6.2-rc2 with the patch applied. The same black screen problem happened
> > > > with 6.2-rc2 with the patch. I tried to use early kdump with 6.2-rc2 with the
> > > > patch twice by panicking the kernel with sysrq+alt+c after the black screen
> > > > happened. The system rebooted after about 10-20 seconds both times, but no kdump
> > > > and dmesg files were saved in /var/crash. I'm attaching the lspci -vvv output as
> > > > requested.
> > > > 
> > > 
> > > Thanks for testing. As mentioned earlier I was not expecting this patch to fix
> > > the black screen issue. It should fix kernel warnings and IOMMU page fault
> > > related call traces. By any chance do you have the kernel boot logs?
> > > 
> > > 
> > > @Baolu,
> > >    Looking into lspci output, it doesn't list ACS feature for Graphics card. So
> > > with your fix it didn't enable PASID and hence it failed to boot.
> > 
> > The ACS checks being done are feature of the path not the end point or
> > root port.
> > 
> > If we are expecting ACS on the end port then it is just a bug in how
> > the test was written.. The test should be a NOP because there are no
> > switches in this topology.
> > 
> > Looking at it, this seems to just be because pci_enable_pasid is
> > calling pci_acs_path_enabled wrong, the only other user is here:
> > 
> > 	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
> > 		if (!bus->self)
> > 			continue;
> > 
> > 		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
> > 			break;
> > 
> > 		pdev = bus->self;
> > 
> > 		group = iommu_group_get(&pdev->dev);
> > 		if (group)
> > 			return group;
> > 	}
> > 
> > And notice it is calling it on pdev->bus not on pdev itself which
> > naturally excludes the end point from the ACS validation.
> > 
> > So try something like:
> > 
> > 	if (!pci_acs_path_enabled(pdev->bus->self, NULL, PCI_ACS_RR | PCI_ACS_UF))
> > 
> > (and probably need to check for null ?)
> 
> Yeah! This really is a misuse of pci_acs_path_enabled().
> 
> But if @pdev is an endpoint of a multiple function device, perhaps we
> still need to check acs on it?

Ah, I don't know anything about what this means from a spec
perspective.

Certainly if a function can internalize MMIO and loop it back to
another function then it surely is not OK for PASID either, nor should
those functions be in different iommu groups.

So, either this never happens for some spec reason, or the test in the
iommu code forming groups is incorrect.

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-09 13:43               ` Jason Gunthorpe
@ 2023-01-10  5:28                 ` Baolu Lu
  0 siblings, 0 replies; 42+ messages in thread
From: Baolu Lu @ 2023-01-10  5:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Vasant Hegde, Matt Fagnani, Thorsten Leemhuis,
	Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

On 2023/1/9 21:43, Jason Gunthorpe wrote:
> On Sat, Jan 07, 2023 at 10:44:46AM +0800, Baolu Lu wrote:
>> On 1/6/2023 10:14 PM, Jason Gunthorpe wrote:
>>> On Thu, Jan 05, 2023 at 03:57:28PM +0530, Vasant Hegde wrote:
>>>> Matt,
>>>>
>>>> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>>>>> I built 6.2-rc2 with the patch applied. The same black screen problem happened
>>>>> with 6.2-rc2 with the patch. I tried to use early kdump with 6.2-rc2 with the
>>>>> patch twice by panicking the kernel with sysrq+alt+c after the black screen
>>>>> happened. The system rebooted after about 10-20 seconds both times, but no kdump
>>>>> and dmesg files were saved in /var/crash. I'm attaching the lspci -vvv output as
>>>>> requested.
>>>>>
>>>>
>>>> Thanks for testing. As mentioned earlier I was not expecting this patch to fix
>>>> the black screen issue. It should fix kernel warnings and IOMMU page fault
>>>> related call traces. By any chance do you have the kernel boot logs?
>>>>
>>>>
>>>> @Baolu,
>>>>     Looking into lspci output, it doesn't list ACS feature for Graphics card. So
>>>> with your fix it didn't enable PASID and hence it failed to boot.
>>>
>>> The ACS checks being done are feature of the path not the end point or
>>> root port.
>>>
>>> If we are expecting ACS on the end port then it is just a bug in how
>>> the test was written.. The test should be a NOP because there are no
>>> switches in this topology.
>>>
>>> Looking at it, this seems to just be because pci_enable_pasid is
>>> calling pci_acs_path_enabled wrong, the only other user is here:
>>>
>>> 	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
>>> 		if (!bus->self)
>>> 			continue;
>>>
>>> 		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
>>> 			break;
>>>
>>> 		pdev = bus->self;
>>>
>>> 		group = iommu_group_get(&pdev->dev);
>>> 		if (group)
>>> 			return group;
>>> 	}
>>>
>>> And notice it is calling it on pdev->bus not on pdev itself which
>>> naturally excludes the end point from the ACS validation.
>>>
>>> So try something like:
>>>
>>> 	if (!pci_acs_path_enabled(pdev->bus->self, NULL, PCI_ACS_RR | PCI_ACS_UF))
>>>
>>> (and probably need to check for null ?)
>>
>> Yeah! This really is a misuse of pci_acs_path_enabled().
>>
>> But if @pdev is an endpoint of a multiple function device, perhaps we
>> still need to check acs on it?
> 
> Ah, I don't know anything about what this means from a spec
> perspective.
> 
> Certainly if a function can internalize MMIO and loop it back to
> another function then it surely is not OK for PASID either, nor should
> those functions be in different iommu groups.
> 
> So, either this never happens for some spec reason, or the test in the
> iommu code forming groups is incorrect.

The pci_device_group() path handles this like below:

/*
  * For multifunction devices which are not isolated from each other, find
  * all the other non-isolated functions and look for existing groups.  For
  * each function, we also need to look for aliases to or from other devices
  * that may already have a group.
  */
static struct iommu_group *get_pci_function_alias_group(struct pci_dev 
*pdev,
                                                         unsigned long 
*devfns)
{
         struct pci_dev *tmp = NULL;
         struct iommu_group *group;

         if (!pdev->multifunction || pci_acs_enabled(pdev, REQ_ACS_FLAGS))
                 return NULL;

It seems that all devices of an MFD shares a single iommu group if
there lacks ACS control.

--
Best regards,
baolu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-06 14:14           ` Jason Gunthorpe
  2023-01-07  2:44             ` Baolu Lu
@ 2023-01-10  5:48             ` Baolu Lu
  2023-01-10  8:06               ` Matt Fagnani
  2023-01-10 13:25               ` Jason Gunthorpe
  1 sibling, 2 replies; 42+ messages in thread
From: Baolu Lu @ 2023-01-10  5:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Vasant Hegde
  Cc: baolu.lu, Matt Fagnani, Thorsten Leemhuis, Joerg Roedel,
	iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas

On 2023/1/6 22:14, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2023 at 03:57:28PM +0530, Vasant Hegde wrote:
>> Matt,
>>
>> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>>> I built 6.2-rc2 with the patch applied. The same black screen problem happened
>>> with 6.2-rc2 with the patch. I tried to use early kdump with 6.2-rc2 with the
>>> patch twice by panicking the kernel with sysrq+alt+c after the black screen
>>> happened. The system rebooted after about 10-20 seconds both times, but no kdump
>>> and dmesg files were saved in /var/crash. I'm attaching the lspci -vvv output as
>>> requested.
>>>
>> Thanks for testing. As mentioned earlier I was not expecting this patch to fix
>> the black screen issue. It should fix kernel warnings and IOMMU page fault
>> related call traces. By any chance do you have the kernel boot logs?
>>
>>
>> @Baolu,
>>    Looking into lspci output, it doesn't list ACS feature for Graphics card. So
>> with your fix it didn't enable PASID and hence it failed to boot.
> The ACS checks being done are feature of the path not the end point or
> root port.
> 
> If we are expecting ACS on the end port then it is just a bug in how
> the test was written.. The test should be a NOP because there are no
> switches in this topology.
> 
> Looking at it, this seems to just be because pci_enable_pasid is
> calling pci_acs_path_enabled wrong, the only other user is here:
> 
> 	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
> 		if (!bus->self)
> 			continue;
> 
> 		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
> 			break;
> 
> 		pdev = bus->self;
> 
> 		group = iommu_group_get(&pdev->dev);
> 		if (group)
> 			return group;
> 	}
> 
> And notice it is calling it on pdev->bus not on pdev itself which
> naturally excludes the end point from the ACS validation.
> 
> So try something like:
> 
> 	if (!pci_acs_path_enabled(pdev->bus->self, NULL, PCI_ACS_RR | PCI_ACS_UF))
> 
> (and probably need to check for null ?)

Hi Matt,

Do you mind helping to test below change? No other change needed.

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index f9cc2e10b676..48f34cc996e4 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -382,8 +382,15 @@ int pci_enable_pasid(struct pci_dev *pdev, int 
features)
         if (!pasid)
                 return -EINVAL;

-       if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
-               return -EINVAL;
+       if (pdev->multifunction) {
+               if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | 
PCI_ACS_UF))
+                       return -EINVAL;
+       } else {
+               if (!pdev->bus->self ||
+                   !pci_acs_path_enabled(pdev->bus->self, NULL,
+                                         PCI_ACS_RR | PCI_ACS_UF))
+                       return -EINVAL;
+       }

         pci_read_config_word(pdev, pasid + PCI_PASID_CAP, &supported);
         supported &= PCI_PASID_CAP_EXEC | PCI_PASID_CAP_PRIV;

--
Best regards,
baolu

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10  5:48             ` Baolu Lu
@ 2023-01-10  8:06               ` Matt Fagnani
  2023-01-10 13:25               ` Jason Gunthorpe
  1 sibling, 0 replies; 42+ messages in thread
From: Matt Fagnani @ 2023-01-10  8:06 UTC (permalink / raw)
  To: Baolu Lu, Jason Gunthorpe, Vasant Hegde
  Cc: Thorsten Leemhuis, Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

Baolu,

I tried to apply your patch after checking out 6.2-rc3 and origin/master 
but there were there the following errors.

git apply amd-iommu-amdgpu-boot-crash-2.patch
error: patch failed: drivers/pci/ats.c:382
error: drivers/pci/ats.c: patch does not apply

I manually changed drivers/pci/ats.c as shown in the patch. I built 
6.2-rc3 + the patch. 6.2-rc3 with the patch had the same black screen 
problem when booting. I added rd.driver.blacklist=amdgpu on the kernel 
command line to prevent amdgpu from being started while the initramfs 
was in use, and the black screen happened later in the boot as I 
described in my previous email. The journal showed the same two warnings 
and null pointer dereference which made amdgpu crash as I reported.

Thanks,

Matt



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10  5:48             ` Baolu Lu
  2023-01-10  8:06               ` Matt Fagnani
@ 2023-01-10 13:25               ` Jason Gunthorpe
  2023-01-10 13:45                 ` Christian König
  2023-01-11  3:16                 ` Baolu Lu
  1 sibling, 2 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2023-01-10 13:25 UTC (permalink / raw)
  To: Baolu Lu, Alex Deucher, Christian König, Pan, Xinhui
  Cc: Vasant Hegde, Matt Fagnani, Thorsten Leemhuis, Joerg Roedel,
	iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas, amd-gfx

On Tue, Jan 10, 2023 at 01:48:39PM +0800, Baolu Lu wrote:
> On 2023/1/6 22:14, Jason Gunthorpe wrote:
> > On Thu, Jan 05, 2023 at 03:57:28PM +0530, Vasant Hegde wrote:
> > > Matt,
> > > 
> > > On 1/5/2023 6:39 AM, Matt Fagnani wrote:
> > > > I built 6.2-rc2 with the patch applied. The same black screen problem happened
> > > > with 6.2-rc2 with the patch. I tried to use early kdump with 6.2-rc2 with the
> > > > patch twice by panicking the kernel with sysrq+alt+c after the black screen
> > > > happened. The system rebooted after about 10-20 seconds both times, but no kdump
> > > > and dmesg files were saved in /var/crash. I'm attaching the lspci -vvv output as
> > > > requested.
> > > > 
> > > Thanks for testing. As mentioned earlier I was not expecting this patch to fix
> > > the black screen issue. It should fix kernel warnings and IOMMU page fault
> > > related call traces. By any chance do you have the kernel boot logs?
> > > 
> > > 
> > > @Baolu,
> > >    Looking into lspci output, it doesn't list ACS feature for Graphics card. So
> > > with your fix it didn't enable PASID and hence it failed to boot.
> > The ACS checks being done are feature of the path not the end point or
> > root port.
> > 
> > If we are expecting ACS on the end port then it is just a bug in how
> > the test was written.. The test should be a NOP because there are no
> > switches in this topology.
> > 
> > Looking at it, this seems to just be because pci_enable_pasid is
> > calling pci_acs_path_enabled wrong, the only other user is here:
> > 
> > 	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
> > 		if (!bus->self)
> > 			continue;
> > 
> > 		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
> > 			break;
> > 
> > 		pdev = bus->self;
> > 
> > 		group = iommu_group_get(&pdev->dev);
> > 		if (group)
> > 			return group;
> > 	}
> > 
> > And notice it is calling it on pdev->bus not on pdev itself which
> > naturally excludes the end point from the ACS validation.
> > 
> > So try something like:
> > 
> > 	if (!pci_acs_path_enabled(pdev->bus->self, NULL, PCI_ACS_RR | PCI_ACS_UF))
> > 
> > (and probably need to check for null ?)
> 
> Hi Matt,
> 
> Do you mind helping to test below change? No other change needed.
> 
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index f9cc2e10b676..48f34cc996e4 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -382,8 +382,15 @@ int pci_enable_pasid(struct pci_dev *pdev, int
> features)
>         if (!pasid)
>                 return -EINVAL;
> 
> -       if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
> -               return -EINVAL;
> +       if (pdev->multifunction) {
> +               if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR |
> PCI_ACS_UF))
> +                       return -EINVAL;

The AMD device is multi-function according to the lspci, and we
already know that 'pci_acs_path_enabled' will fail on it because that
is the problem..

Actually, I remember it is supposed to be like this:

 https://lore.kernel.org/linux-iommu/Ygpb6CxmTdUHiN50@8bytes.org/

The GPU and sound device are considered non-isolated by the group
code, presumably because of the missing ACS caps.

So, if I remember the issue, PCIe says that MemWr/Rd are routed
according to their address and ignore the PASID header.

A multifunction device is permitted to loop back DMAs one function
issues that match a MMIO BAR of another function. eg the GPU could DMA
to an MMIO address that overlaps the sound device and the function
will deliver the MMIO to the sound device not the host bridge even
though it is PASID tagged.

This is what get_pci_function_alias_group() is looking for.

Multifunction devices that do not do that are supposed to set the ACS
RR|UF bits and get_pci_function_alias_group()/etc are supposed to
succeed.

Thus - the PCI information is telling us that the AMD GPU device does
not support PASID because it may be looping back the MMIO to the other
functions on the device and thus creating an unacceptable hole in the
PASID address space.

So - we need AMD to comment on which of these describes their GPU device:

 1) Is the issue that the PCI Caps are incorrect on this device and
 there is no loopback? Thus we should fix it with a quirk to correct
 the caps which will naturally split the iommu group too.

 2) Is the device broken and loops back PASID DMAs and we are
 legimiate and correct in blocking PASID? So far AMD just got lucky
 that no user had a SVA that overlaps with MMIO? Seems unlikely

 3) Is the device odd in that it doesn't loop back PASID tagged DMAs,
 but does loop untagged? I would say this is non-compliant and PCI
 provides no way to describe this. But we should again quirk it to
 allow the PASID to be enabled but keep the group separated.

Alex/Christian/Pan - can you please find out? The HW is:

00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) (prog-if 00 [VGA controller])
	DeviceName: ATI EG BROADWAY
	Subsystem: Hewlett-Packard Company Device 8332
00:01.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Kabini HDMI/DP Audio
	Subsystem: Hewlett-Packard Company Device 8332

https://lore.kernel.org/all/223ee6d6-70ea-1d53-8bc2-2d22201d8dde@bell.net/

> +       } else {
> +               if (!pdev->bus->self ||
> +                   !pci_acs_path_enabled(pdev->bus->self, NULL,
> +                                         PCI_ACS_RR | PCI_ACS_UF))
> +                       return -EINVAL;
> +       }

Why would these be exclusive? Both the path and endpoint needs to be
checked

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10 13:25               ` Jason Gunthorpe
@ 2023-01-10 13:45                 ` Christian König
  2023-01-10 13:51                   ` Jason Gunthorpe
  2023-01-10 15:05                   ` Felix Kuehling
  2023-01-11  3:16                 ` Baolu Lu
  1 sibling, 2 replies; 42+ messages in thread
From: Christian König @ 2023-01-10 13:45 UTC (permalink / raw)
  To: Jason Gunthorpe, Baolu Lu, Alex Deucher, Pan, Xinhui
  Cc: Vasant Hegde, Matt Fagnani, Thorsten Leemhuis, Joerg Roedel,
	iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas, amd-gfx

Am 10.01.23 um 14:25 schrieb Jason Gunthorpe:
> On Tue, Jan 10, 2023 at 01:48:39PM +0800, Baolu Lu wrote:
>> On 2023/1/6 22:14, Jason Gunthorpe wrote:
>>> On Thu, Jan 05, 2023 at 03:57:28PM +0530, Vasant Hegde wrote:
>>>> Matt,
>>>>
>>>> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>>>>> I built 6.2-rc2 with the patch applied. The same black screen problem happened
>>>>> with 6.2-rc2 with the patch. I tried to use early kdump with 6.2-rc2 with the
>>>>> patch twice by panicking the kernel with sysrq+alt+c after the black screen
>>>>> happened. The system rebooted after about 10-20 seconds both times, but no kdump
>>>>> and dmesg files were saved in /var/crash. I'm attaching the lspci -vvv output as
>>>>> requested.
>>>>>
>>>> Thanks for testing. As mentioned earlier I was not expecting this patch to fix
>>>> the black screen issue. It should fix kernel warnings and IOMMU page fault
>>>> related call traces. By any chance do you have the kernel boot logs?
>>>>
>>>>
>>>> @Baolu,
>>>>     Looking into lspci output, it doesn't list ACS feature for Graphics card. So
>>>> with your fix it didn't enable PASID and hence it failed to boot.
>>> The ACS checks being done are feature of the path not the end point or
>>> root port.
>>>
>>> If we are expecting ACS on the end port then it is just a bug in how
>>> the test was written.. The test should be a NOP because there are no
>>> switches in this topology.
>>>
>>> Looking at it, this seems to just be because pci_enable_pasid is
>>> calling pci_acs_path_enabled wrong, the only other user is here:
>>>
>>> 	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
>>> 		if (!bus->self)
>>> 			continue;
>>>
>>> 		if (pci_acs_path_enabled(bus->self, NULL, REQ_ACS_FLAGS))
>>> 			break;
>>>
>>> 		pdev = bus->self;
>>>
>>> 		group = iommu_group_get(&pdev->dev);
>>> 		if (group)
>>> 			return group;
>>> 	}
>>>
>>> And notice it is calling it on pdev->bus not on pdev itself which
>>> naturally excludes the end point from the ACS validation.
>>>
>>> So try something like:
>>>
>>> 	if (!pci_acs_path_enabled(pdev->bus->self, NULL, PCI_ACS_RR | PCI_ACS_UF))
>>>
>>> (and probably need to check for null ?)
>> Hi Matt,
>>
>> Do you mind helping to test below change? No other change needed.
>>
>> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
>> index f9cc2e10b676..48f34cc996e4 100644
>> --- a/drivers/pci/ats.c
>> +++ b/drivers/pci/ats.c
>> @@ -382,8 +382,15 @@ int pci_enable_pasid(struct pci_dev *pdev, int
>> features)
>>          if (!pasid)
>>                  return -EINVAL;
>>
>> -       if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
>> -               return -EINVAL;
>> +       if (pdev->multifunction) {
>> +               if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR |
>> PCI_ACS_UF))
>> +                       return -EINVAL;
> The AMD device is multi-function according to the lspci, and we
> already know that 'pci_acs_path_enabled' will fail on it because that
> is the problem..
>
> Actually, I remember it is supposed to be like this:
>
>   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-iommu%2FYgpb6CxmTdUHiN50%408bytes.org%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7Cb45e8c5a24394d66ae2908daf30e3802%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638089539666187724%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vf9QsDFqp9s1NUxuP5iMsQJn1R0K9tVRTImTR6uZWAE%3D&reserved=0
>
> The GPU and sound device are considered non-isolated by the group
> code, presumably because of the missing ACS caps.
>
> So, if I remember the issue, PCIe says that MemWr/Rd are routed
> according to their address and ignore the PASID header.
>
> A multifunction device is permitted to loop back DMAs one function
> issues that match a MMIO BAR of another function. eg the GPU could DMA
> to an MMIO address that overlaps the sound device and the function
> will deliver the MMIO to the sound device not the host bridge even
> though it is PASID tagged.
>
> This is what get_pci_function_alias_group() is looking for.
>
> Multifunction devices that do not do that are supposed to set the ACS
> RR|UF bits and get_pci_function_alias_group()/etc are supposed to
> succeed.
>
> Thus - the PCI information is telling us that the AMD GPU device does
> not support PASID because it may be looping back the MMIO to the other
> functions on the device and thus creating an unacceptable hole in the
> PASID address space.
>
> So - we need AMD to comment on which of these describes their GPU device:
>
>   1) Is the issue that the PCI Caps are incorrect on this device and
>   there is no loopback? Thus we should fix it with a quirk to correct
>   the caps which will naturally split the iommu group too.
>
>   2) Is the device broken and loops back PASID DMAs and we are
>   legimiate and correct in blocking PASID? So far AMD just got lucky
>   that no user had a SVA that overlaps with MMIO? Seems unlikely
>
>   3) Is the device odd in that it doesn't loop back PASID tagged DMAs,
>   but does loop untagged? I would say this is non-compliant and PCI
>   provides no way to describe this. But we should again quirk it to
>   allow the PASID to be enabled but keep the group separated.

Mhm, I don't have a Kabini at hand but I have a Raven and there I see on 
the GPU:

     Capabilities: [2a0 v1] Access Control Services
         ACSCap:    SrcValid- TransBlk- ReqRedir- CmpltRedir- 
UpstreamFwd- EgressCtrl- DirectTrans-
         ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir- 
UpstreamFwd- EgressCtrl- DirectTrans-

     Capabilities: [2b0 v1] Address Translation Service (ATS)
         ATSCap:    Invalidate Queue Depth: 00
         ATSCtl:    Enable+, Smallest Translation Unit: 00

On the bridge:

     Capabilities: [2a0 v1] Access Control Services
         ACSCap:    SrcValid+ TransBlk+ ReqRedir- CmpltRedir- 
UpstreamFwd- EgressCtrl- DirectTrans-
         ACSCtl:    SrcValid+ TransBlk- ReqRedir- CmpltRedir- 
UpstreamFwd- EgressCtrl- DirectTrans-

And I'm like 99% sure that Kabini/Wani should be identical to that.

Since this is a device integrated in the CPU it could be that the 
ACS/ATS functionalities are controlled by the BIOS and can be 
enabled/disabled there. But this should always enable/disable both.

Regards,
Christian.

>
> Alex/Christian/Pan - can you please find out? The HW is:
>
> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) (prog-if 00 [VGA controller])
> 	DeviceName: ATI EG BROADWAY
> 	Subsystem: Hewlett-Packard Company Device 8332
> 00:01.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Kabini HDMI/DP Audio
> 	Subsystem: Hewlett-Packard Company Device 8332
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F223ee6d6-70ea-1d53-8bc2-2d22201d8dde%40bell.net%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7Cb45e8c5a24394d66ae2908daf30e3802%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638089539666187724%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TMB3pXS0eUKPZhcRCvUIxzvJvPYosvv3ofrFKZx7b%2FI%3D&reserved=0
>
>> +       } else {
>> +               if (!pdev->bus->self ||
>> +                   !pci_acs_path_enabled(pdev->bus->self, NULL,
>> +                                         PCI_ACS_RR | PCI_ACS_UF))
>> +                       return -EINVAL;
>> +       }
> Why would these be exclusive? Both the path and endpoint needs to be
> checked
>
> Jason


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10 13:45                 ` Christian König
@ 2023-01-10 13:51                   ` Jason Gunthorpe
  2023-01-10 13:56                     ` Christian König
  2023-01-10 15:05                   ` Felix Kuehling
  1 sibling, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2023-01-10 13:51 UTC (permalink / raw)
  To: Christian König
  Cc: Baolu Lu, Alex Deucher, Pan, Xinhui, Vasant Hegde, Matt Fagnani,
	Thorsten Leemhuis, Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas, amd-gfx

On Tue, Jan 10, 2023 at 02:45:30PM +0100, Christian König wrote:

> Since this is a device integrated in the CPU it could be that the ACS/ATS
> functionalities are controlled by the BIOS and can be enabled/disabled
> there. But this should always enable/disable both.

This sounds like a GPU driver bug then, it should tolerate PASID being
unavailable because of BIOS issues/whatever and not black screen on
boot?

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10 13:51                   ` Jason Gunthorpe
@ 2023-01-10 13:56                     ` Christian König
  2023-01-10 20:51                       ` Matt Fagnani
  0 siblings, 1 reply; 42+ messages in thread
From: Christian König @ 2023-01-10 13:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Alex Deucher, Pan, Xinhui, Vasant Hegde, Matt Fagnani,
	Thorsten Leemhuis, Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas, amd-gfx

Am 10.01.23 um 14:51 schrieb Jason Gunthorpe:
> On Tue, Jan 10, 2023 at 02:45:30PM +0100, Christian König wrote:
>
>> Since this is a device integrated in the CPU it could be that the ACS/ATS
>> functionalities are controlled by the BIOS and can be enabled/disabled
>> there. But this should always enable/disable both.
> This sounds like a GPU driver bug then, it should tolerate PASID being
> unavailable because of BIOS issues/whatever and not black screen on
> boot?

Yeah, potentially. Could I get a full "sudo lspci -vvvv -s $bus_id" + 
dmesg of that device?

Thanks,
Christian.

>
> Jason


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10 13:45                 ` Christian König
  2023-01-10 13:51                   ` Jason Gunthorpe
@ 2023-01-10 15:05                   ` Felix Kuehling
  2023-01-10 15:19                     ` Jason Gunthorpe
  1 sibling, 1 reply; 42+ messages in thread
From: Felix Kuehling @ 2023-01-10 15:05 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe, Baolu Lu, Alex Deucher,
	Pan, Xinhui
  Cc: Joerg Roedel, regressions@lists.linux.dev, Thorsten Leemhuis,
	Linux PCI, Vasant Hegde, amd-gfx, LKML, iommu@lists.linux.dev,
	Matt Fagnani, Bjorn Helgaas

Am 2023-01-10 um 08:45 schrieb Christian König:
> And I'm like 99% sure that Kabini/Wani should be identical to that. 

Kabini is not supported by KFD. There should be no calls to 
amd_iommu_... functions on Kabini, at least not from kfd_iommu.c. And 
I'm not aware of any other callers in amdgpu.ko.

Regards,
   Felix



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10 15:05                   ` Felix Kuehling
@ 2023-01-10 15:19                     ` Jason Gunthorpe
  2023-01-10 15:21                       ` Felix Kuehling
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2023-01-10 15:19 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Christian König, Baolu Lu, Alex Deucher, Pan, Xinhui,
	Joerg Roedel, regressions@lists.linux.dev, Thorsten Leemhuis,
	Linux PCI, Vasant Hegde, amd-gfx, LKML, iommu@lists.linux.dev,
	Matt Fagnani, Bjorn Helgaas

On Tue, Jan 10, 2023 at 10:05:44AM -0500, Felix Kuehling wrote:
> Am 2023-01-10 um 08:45 schrieb Christian König:
> > And I'm like 99% sure that Kabini/Wani should be identical to that.
> 
> Kabini is not supported by KFD. There should be no calls to amd_iommu_...
> functions on Kabini, at least not from kfd_iommu.c. And I'm not aware of any
> other callers in amdgpu.ko.

The backtrace from the system says otherwise..

>> [   13.515710]  amd_iommu_attach_device+0x2e0/0x300
>> [   13.515719]  __iommu_attach_device+0x1b/0x90
>> [   13.515727]  iommu_attach_group+0x65/0xa0
>> [   13.515735]  amd_iommu_init_device+0x16b/0x250 [iommu_v2]
>> [   13.515747]  kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
>> [   13.517094]  kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
>> [   13.518419]  kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
>> [   13.519699]  amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
>> [   13.520877]  amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
>> [   13.522118]  ? _raw_spin_lock_irqsave+0x23/0x50
>> [   13.522126]  amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
>> [   13.523386]  amdgpu_pci_probe+0x161/0x370 [amdgpu]
>> [   13.524516]  local_pci_probe+0x41/0x80
>> [   13.524525]  pci_device_probe+0xb3/0x220
>> [   13.524533]  really_probe+0xde/0x380
>> [   13.524540]  ? pm_runtime_barrier+0x50/0x90
>> [   13.524546]  __driver_probe_device+0x78/0x170
>> [   13.524555]  driver_probe_device+0x1f/0x90
>> [   13.524560]  __driver_attach+0xce/0x1c0
>> [   13.524565]  ? __pfx___driver_attach+0x10/0x10
>> [   13.524570]  bus_for_each_dev+0x73/0xa0
>> [   13.524575]  bus_add_driver+0x1ae/0x200
>> [   13.524580]  driver_register+0x89/0xe0
>> [   13.524586]  ? __pfx_init_module+0x10/0x10 [amdgpu]
>> [   13.525819]  do_one_initcall+0x59/0x230
>> [   13.525828]  do_init_module+0x4a/0x200
>> [   13.525834]  __do_sys_init_module+0x157/0x180
>> [   13.525839]  do_syscall_64+0x5b/0x80
>> [   13.525845]  ? handle_mm_fault+0xff/0x2f0
>> [   13.525850]  ? do_user_addr_fault+0x1ef/0x690
>> [   13.525856]  ? exc_page_fault+0x70/0x170
>> [   13.525860]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> [   13.525867] RIP: 0033:0x7fabd66cde4e

https://lore.kernel.org/all/157c4ca4-370a-5d7e-fe32-c64d934f6979@amd.com/

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10 15:19                     ` Jason Gunthorpe
@ 2023-01-10 15:21                       ` Felix Kuehling
  0 siblings, 0 replies; 42+ messages in thread
From: Felix Kuehling @ 2023-01-10 15:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian König, Baolu Lu, Alex Deucher, Pan, Xinhui,
	Joerg Roedel, regressions@lists.linux.dev, Thorsten Leemhuis,
	Linux PCI, Vasant Hegde, amd-gfx, LKML, iommu@lists.linux.dev,
	Matt Fagnani, Bjorn Helgaas

Am 2023-01-10 um 10:19 schrieb Jason Gunthorpe:
> On Tue, Jan 10, 2023 at 10:05:44AM -0500, Felix Kuehling wrote:
>> Am 2023-01-10 um 08:45 schrieb Christian König:
>>> And I'm like 99% sure that Kabini/Wani should be identical to that.
>> Kabini is not supported by KFD. There should be no calls to amd_iommu_...
>> functions on Kabini, at least not from kfd_iommu.c. And I'm not aware of any
>> other callers in amdgpu.ko.
> The backtrace from the system says otherwise..

That log is for Carrizo, not Kabini:

> [   13.143970] [drm] initializing kernel modesetting (CARRIZO 
> 0x1002:0x9874 >> 0x103C:0x8332 0xCA).
Carrizo is supported by KFD, and it does support ATS/PRI.

Regards,
   Felix


>
>>> [   13.515710]  amd_iommu_attach_device+0x2e0/0x300
>>> [   13.515719]  __iommu_attach_device+0x1b/0x90
>>> [   13.515727]  iommu_attach_group+0x65/0xa0
>>> [   13.515735]  amd_iommu_init_device+0x16b/0x250 [iommu_v2]
>>> [   13.515747]  kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
>>> [   13.517094]  kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
>>> [   13.518419]  kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
>>> [   13.519699]  amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
>>> [   13.520877]  amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
>>> [   13.522118]  ? _raw_spin_lock_irqsave+0x23/0x50
>>> [   13.522126]  amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
>>> [   13.523386]  amdgpu_pci_probe+0x161/0x370 [amdgpu]
>>> [   13.524516]  local_pci_probe+0x41/0x80
>>> [   13.524525]  pci_device_probe+0xb3/0x220
>>> [   13.524533]  really_probe+0xde/0x380
>>> [   13.524540]  ? pm_runtime_barrier+0x50/0x90
>>> [   13.524546]  __driver_probe_device+0x78/0x170
>>> [   13.524555]  driver_probe_device+0x1f/0x90
>>> [   13.524560]  __driver_attach+0xce/0x1c0
>>> [   13.524565]  ? __pfx___driver_attach+0x10/0x10
>>> [   13.524570]  bus_for_each_dev+0x73/0xa0
>>> [   13.524575]  bus_add_driver+0x1ae/0x200
>>> [   13.524580]  driver_register+0x89/0xe0
>>> [   13.524586]  ? __pfx_init_module+0x10/0x10 [amdgpu]
>>> [   13.525819]  do_one_initcall+0x59/0x230
>>> [   13.525828]  do_init_module+0x4a/0x200
>>> [   13.525834]  __do_sys_init_module+0x157/0x180
>>> [   13.525839]  do_syscall_64+0x5b/0x80
>>> [   13.525845]  ? handle_mm_fault+0xff/0x2f0
>>> [   13.525850]  ? do_user_addr_fault+0x1ef/0x690
>>> [   13.525856]  ? exc_page_fault+0x70/0x170
>>> [   13.525860]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>> [   13.525867] RIP: 0033:0x7fabd66cde4e
> https://lore.kernel.org/all/157c4ca4-370a-5d7e-fe32-c64d934f6979@amd.com/
>
> Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
       [not found]           ` <ff26929d-9fb0-3c85-2594-dc2937c1ba9a@bell.net>
@ 2023-01-10 16:08             ` Vasant Hegde
  2023-01-10 16:12               ` Vasant Hegde
  0 siblings, 1 reply; 42+ messages in thread
From: Vasant Hegde @ 2023-01-10 16:08 UTC (permalink / raw)
  To: Matt Fagnani, Baolu Lu, Thorsten Leemhuis
  Cc: Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas

Matt,


On 1/6/2023 12:58 PM, Matt Fagnani wrote:
> I booted 6.2-rc2 + patch with rd.driver.blacklist=amdgpu on the kernel command
> line to prevent amdgpu from being started while the initramfs was in use. The
> black screen problem happened later in the boot. I pressed sysrq+alt+s,u,b to do
> an emergency sync, remount read-only, and reboot. The journal for that boot was
> shown on the next boot. The two warnings which I previously reported weren't
> shown in the journal, but the same null pointer dereference which made amdgpu
> crash happened. I'm attaching the kernel log from the journal of that boot.
> 

Thanks for your effort to get boot log. This is helpful.

Looking into the code further,
  iommu_detach_group() didn't attach devices back to default_domain. So IOMMU
point of view device group was left in inconsistent state. This resulted in
IOMMU throwing page fault errors and amd IOMMU event handler code always assumes
that domain is setup properly. That resulted in below NULL pointer dereference
issue.

  Jan 06 02:07:52 kernel: BUG: kernel NULL pointer dereference, address:
0000000000000058
  Jan 06 02:07:52 kernel: #PF: supervisor read access in kernel mode
  Jan 06 02:07:53 kernel: #PF: error_code(0x0000) - not-present page
  Jan 06 02:07:53 kernel: PGD 0 P4D 0
  Jan 06 02:07:53 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
  Jan 06 02:07:53 kernel: CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Not tainted
6.2.0-rc2+ #89
  Jan 06 02:07:53 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52
12/03/2019
  Jan 06 02:07:53 kernel: RIP: 0010:report_iommu_fault+0x11/0x90

Ideally if domain attach fails (in this case its because pasid capability check
returned error) we should put devices back to original domain.. so that it can
continue without PASID capability.

I have a patch to handle these error conditions (not the fix for original
issue). I will try to post it soon.

-Vasant

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10 16:08             ` Vasant Hegde
@ 2023-01-10 16:12               ` Vasant Hegde
  0 siblings, 0 replies; 42+ messages in thread
From: Vasant Hegde @ 2023-01-10 16:12 UTC (permalink / raw)
  To: Matt Fagnani, Baolu Lu, Thorsten Leemhuis
  Cc: Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas



On 1/10/2023 9:38 PM, Vasant Hegde wrote:
> Matt,
> 
> 
> On 1/6/2023 12:58 PM, Matt Fagnani wrote:
>> I booted 6.2-rc2 + patch with rd.driver.blacklist=amdgpu on the kernel command
>> line to prevent amdgpu from being started while the initramfs was in use. The
>> black screen problem happened later in the boot. I pressed sysrq+alt+s,u,b to do
>> an emergency sync, remount read-only, and reboot. The journal for that boot was
>> shown on the next boot. The two warnings which I previously reported weren't
>> shown in the journal, but the same null pointer dereference which made amdgpu
>> crash happened. I'm attaching the kernel log from the journal of that boot.
>>
> 
> Thanks for your effort to get boot log. This is helpful.
> 
> Looking into the code further,
>   iommu_detach_group() didn't attach devices back to default_domain.

... because iommu_detach_group() expects new domain should be different from
group->domain.

-Vasant


> So IOMMU
> point of view device group was left in inconsistent state. This resulted in
> IOMMU throwing page fault errors and amd IOMMU event handler code always assumes
> that domain is setup properly. That resulted in below NULL pointer dereference
> issue.
> 
>   Jan 06 02:07:52 kernel: BUG: kernel NULL pointer dereference, address:
> 0000000000000058
>   Jan 06 02:07:52 kernel: #PF: supervisor read access in kernel mode
>   Jan 06 02:07:53 kernel: #PF: error_code(0x0000) - not-present page
>   Jan 06 02:07:53 kernel: PGD 0 P4D 0
>   Jan 06 02:07:53 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
>   Jan 06 02:07:53 kernel: CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Not tainted
> 6.2.0-rc2+ #89
>   Jan 06 02:07:53 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52
> 12/03/2019
>   Jan 06 02:07:53 kernel: RIP: 0010:report_iommu_fault+0x11/0x90
> 
> Ideally if domain attach fails (in this case its because pasid capability check
> returned error) we should put devices back to original domain.. so that it can
> continue without PASID capability.
> 
> I have a patch to handle these error conditions (not the fix for original
> issue). I will try to post it soon.
> 
> -Vasant

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10 13:56                     ` Christian König
@ 2023-01-10 20:51                       ` Matt Fagnani
  2023-01-11  8:35                         ` Christian König
  0 siblings, 1 reply; 42+ messages in thread
From: Matt Fagnani @ 2023-01-10 20:51 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe
  Cc: Baolu Lu, Alex Deucher, Pan, Xinhui, Vasant Hegde,
	Thorsten Leemhuis, Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 1611 bytes --]

Christian,

I'm attaching the output of sudo lspci -vvvv. I'm not sure what $bus_id 
is in this case. I guess it might be 00 in 00:00.0. I attached the dmesg 
from previous boots with 6.2-rc1 at 
https://bugzilla.kernel.org/show_bug.cgi?id=216865#c2 as I mentioned at 
https://lore.kernel.org/all/52583644-d875-a454-7288-8b00ea0566ae@bell.net/ 
and 6.2-rc2 + Vasant's patch with rd.driver.blacklist=amdgpu on the 
kernel command line at 
https://lore.kernel.org/all/ff26929d-9fb0-3c85-2594-dc2937c1ba9a@bell.net/ 
I'm using the Radeon R5 integrated GPU which is called Wani in lspci and 
Carrizo in dmesg. The CPU is AMD A10-9620P which is Bristol Ridge or 
Excavator+ according to 
https://en.wikipedia.org/wiki/List_of_AMD_accelerated_processing_units 
I'm using the internal Elan touchscreen in the laptop. I'm not using the 
HDMI port for an external monitor or audio which I think is called 
Kabini HDMI/DP Audio in lspci

Thanks,

Matt

On 1/10/23 08:56, Christian König wrote:
> Am 10.01.23 um 14:51 schrieb Jason Gunthorpe:
>> On Tue, Jan 10, 2023 at 02:45:30PM +0100, Christian König wrote:
>>
>>> Since this is a device integrated in the CPU it could be that the 
>>> ACS/ATS
>>> functionalities are controlled by the BIOS and can be enabled/disabled
>>> there. But this should always enable/disable both.
>> This sounds like a GPU driver bug then, it should tolerate PASID being
>> unavailable because of BIOS issues/whatever and not black screen on
>> boot?
>
> Yeah, potentially. Could I get a full "sudo lspci -vvvv -s $bus_id" + 
> dmesg of that device?
>
> Thanks,
> Christian.
>
>>
>> Jason
>

[-- Attachment #2: lspci-vvvv-1.txt --]
[-- Type: text/plain, Size: 40624 bytes --]

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Complex
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0

00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) I/O Memory Management Unit
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 24
        Capabilities: [40] Secure device <?>
        Capabilities: [64] MSI: Enable+ Count=1/4 Maskable- 64bit+
                Address: 00000000fee04004  Data: 0021
        Capabilities: [74] HyperTransport: MSI Mapping Enable+ Fixed+

00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca) (prog-if 00 [VGA controller])
        DeviceName: ATI EG BROADWAY
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 38
        IOMMU group: 0
        Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at f0800000 (64-bit, prefetchable) [size=8M]
        Region 4: I/O ports at 4000 [size=256]
        Region 5: Memory at f0400000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Root Complex Integrated Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag+ RBE+ FLReset-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00000  Data: 0000
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [2b0 v1] Address Translation Service (ATS)
                ATSCap: Invalidate Queue Depth: 00
                ATSCtl: Enable+, Smallest Translation Unit: 00
        Capabilities: [2c0 v1] Page Request Interface (PRI)
                PRICtl: Enable+ Reset-
                PRISta: RF- UPRGI- Stopped+
                Page Request Capacity: 00000020, Page Request Allocation: 00000020
        Capabilities: [2d0 v1] Process Address Space ID (PASID)
                PASIDCap: Exec- Priv-, Max PASID Width: 10
                PASIDCtl: Enable+ Exec- Priv-
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

00:01.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Kabini HDMI/DP Audio
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 42
        IOMMU group: 0
        Region 0: Memory at f0460000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Root Complex Integrated Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag+ RBE+ FLReset-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00000  Data: 0000
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Host Bridge
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 1

00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Port (prog-if 00 [Normal decode])
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 26
        IOMMU group: 1
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 3000-3fff [size=4K] [16-bit]
        Memory behind bridge: f0300000-f03fffff [size=1M] [32-bit]
        Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled] [64-bit]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag+ RBE+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1
                        TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
                        Slot #0, PowerLimit 0W; Interlock- NoCompl+
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
                        Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
                        Changed: MRL- PresDet- LinkState+
                RootCap: CRSVisible+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd-
                         AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd-
                         AtomicOpsCtl: ReqEn- EgressBlck-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis+
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00000  Data: 0000
        Capabilities: [c0] Subsystem: Hewlett-Packard Company Device 8332
        Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: pcieport

00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Port (prog-if 00 [Normal decode])
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 27
        IOMMU group: 1
        Bus: primary=00, secondary=02, subordinate=04, sec-latency=0
        I/O behind bridge: 2000-2fff [size=4K] [16-bit]
        Memory behind bridge: f1000000-f10fffff [size=1M] [32-bit]
        Prefetchable memory behind bridge: f0000000-f00fffff [size=1M] [32-bit]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag+ RBE+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1
                        TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
                        Slot #0, PowerLimit 0W; Interlock- NoCompl+
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
                        Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
                        Changed: MRL- PresDet- LinkState+
                RootCap: CRSVisible+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd-
                         AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd-
                         AtomicOpsCtl: ReqEn- EgressBlck-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis+
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00000  Data: 0000
        Capabilities: [c0] Subsystem: Hewlett-Packard Company Device 8332
        Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: pcieport

00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Host Bridge
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 2

00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Root Port (prog-if 00 [Normal decode])
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 29
        IOMMU group: 2
        Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
        I/O behind bridge: 1000-1fff [size=4K] [16-bit]
        Memory behind bridge: f0500000-f06fffff [size=2M] [32-bit]
        Prefetchable memory behind bridge: f1100000-f12fffff [size=2M] [32-bit]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag+ RBE+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #247, Speed 2.5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x16 (overdriven)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise-
                        Slot #0, PowerLimit 0W; Interlock- NoCompl+
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet+ CmdCplt- HPIrq+ LinkChg+
                        Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
                        Changed: MRL- PresDet- LinkState-
                RootCap: CRSVisible+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd-
                         AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd-
                         AtomicOpsCtl: ReqEn- EgressBlck-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis+
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00000  Data: 0000
        Capabilities: [c0] Subsystem: Hewlett-Packard Company Device 8332
        Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: pcieport

00:08.0 Encryption controller: Advanced Micro Devices, Inc. [AMD] Carrizo Platform Security Processor
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 255
        IOMMU group: 3
        Region 0: Memory at f0440000 (64-bit, prefetchable) [size=128K]
        Region 2: Memory at f0200000 (32-bit, non-prefetchable) [size=1M]
        Region 3: Memory at f046f000 (32-bit, non-prefetchable) [size=4K]
        Region 5: Memory at f046a000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: [50] MSI-X: Enable- Count=2 Masked-
                Vector table: BAR=5 offset=00000000
                PBA: BAR=5 offset=00001000
        Capabilities: [5c] HyperTransport: MSI Mapping Enable+ Fixed+
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a4] PCI Advanced Features
                AFCap: TP+ FLR-
                AFCtrl: FLR-
                AFStatus: TP-

00:09.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Carrizo Audio Dummy Host Bridge
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 4

00:09.2 Audio device: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Audio Controller
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 43
        IOMMU group: 4
        Region 0: Memory at f0464000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME-
        Capabilities: [a4] PCI Advanced Features
                AFCap: TP+ FLR-
                AFCtrl: FLR-
                AFStatus: TP-
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

00:10.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller (rev 20) (prog-if 30 [XHCI])
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 18
        IOMMU group: 5
        Region 0: Memory at f0468000 (64-bit, non-prefetchable) [size=8K]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [90] MSI-X: Enable+ Count=8 Masked-
                Vector table: BAR=0 offset=00001000
                PBA: BAR=0 offset=00001080
        Capabilities: [a0] Express (v2) Root Complex Integrated Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0
                        ExtTag- RBE+ FLReset-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
        Capabilities: [100 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Kernel driver in use: xhci_hcd

00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 49) (prog-if 01 [AHCI 1.0])
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 19
        IOMMU group: 6
        Region 0: I/O ports at 4118 [size=8]
        Region 1: I/O ports at 4124 [size=4]
        Region 2: I/O ports at 4110 [size=8]
        Region 3: I/O ports at 4120 [size=4]
        Region 4: I/O ports at 4100 [size=16]
        Region 5: Memory at f046c000 (32-bit, non-prefetchable) [size=1K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] SATA HBA v1.0 InCfgSpace
        Kernel driver in use: ahci

00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 49) (prog-if 20 [EHCI])
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 32, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 18
        IOMMU group: 7
        Region 0: Memory at f046d000 (32-bit, non-prefetchable) [size=256]
        Capabilities: [c0] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
                Bridge: PM- B3-
        Capabilities: [e4] Debug port: BAR=1 offset=00e0
        Kernel driver in use: ehci-pci

00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 4a)
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 8
        Kernel driver in use: piix4_smbus
        Kernel modules: i2c_piix4, sp5100_tco

00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 11)
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        IOMMU group: 8

00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 0
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 9

00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 1
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 9

00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 2
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 9

00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 3
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 9
        Capabilities: [f0] Secure device <?>
        Kernel driver in use: k10temp
        Kernel modules: k10temp

00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 4
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 9
        Kernel driver in use: fam15h_power
        Kernel modules: fam15h_power

00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 60h-6fh) Processor Function 5
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 9

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
        Subsystem: Hewlett-Packard Company Device 8332
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 35
        IOMMU group: 1
        Region 0: I/O ports at 3000 [size=256]
        Region 2: Memory at f0304000 (64-bit, non-prefetchable) [size=4K]
        Region 4: Memory at f0300000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, MSI 01
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Via message/WAKE#, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00000800
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
        Capabilities: [170 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [178 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=150us PortTPowerOnTime=150us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Kernel driver in use: r8169
        Kernel modules: r8169

02:00.0 Network controller: Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] (rev 10)
        Subsystem: Intel Corporation Device 2110
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 40
        IOMMU group: 1
        Region 0: Memory at f1000000 (64-bit, non-prefetchable) [size=8K]
        Capabilities: [c8] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00000  Data: 0000
        Capabilities: [40] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr+ NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L1, Exit Latency L1 <32us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 16ms to 55ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Device Serial Number 88-b1-11-ff-ff-5d-01-88
        Capabilities: [14c v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [154 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=30us PortTPowerOnTime=60us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Kernel driver in use: iwlwifi
        Kernel modules: iwlwifi


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10 13:25               ` Jason Gunthorpe
  2023-01-10 13:45                 ` Christian König
@ 2023-01-11  3:16                 ` Baolu Lu
  2023-01-11 13:08                   ` Jason Gunthorpe
  1 sibling, 1 reply; 42+ messages in thread
From: Baolu Lu @ 2023-01-11  3:16 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Deucher, Christian König, Pan, Xinhui
  Cc: baolu.lu, Vasant Hegde, Matt Fagnani, Thorsten Leemhuis,
	Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas, amd-gfx

On 2023/1/10 21:25, Jason Gunthorpe wrote:
>> +       } else {
>> +               if (!pdev->bus->self ||
>> +                   !pci_acs_path_enabled(pdev->bus->self, NULL,
>> +                                         PCI_ACS_RR | PCI_ACS_UF))
>> +                       return -EINVAL;
>> +       }
> Why would these be exclusive? Both the path and endpoint needs to be
> checked

If the device is not an MFD, do we still need to check the ACS on it?
Perhaps I didn't get your point correctly.

--
Best regards,
baolu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-10 20:51                       ` Matt Fagnani
@ 2023-01-11  8:35                         ` Christian König
  0 siblings, 0 replies; 42+ messages in thread
From: Christian König @ 2023-01-11  8:35 UTC (permalink / raw)
  To: Matt Fagnani, Jason Gunthorpe
  Cc: Baolu Lu, Alex Deucher, Pan, Xinhui, Vasant Hegde,
	Thorsten Leemhuis, Joerg Roedel, iommu@lists.linux.dev, LKML,
	regressions@lists.linux.dev, Linux PCI, Bjorn Helgaas, amd-gfx

Hi Matt,

after reading a bit into the topic I think I know what's going on here.

The assumption that you need ACS to enable PASID handling is simply 
incorrect.

Going to send a revert of the offending patch with an in deep 
description of the problem.

Thanks,
Christian.

Am 10.01.23 um 21:51 schrieb Matt Fagnani:
> Christian,
>
> I'm attaching the output of sudo lspci -vvvv. I'm not sure what 
> $bus_id is in this case. I guess it might be 00 in 00:00.0. I attached 
> the dmesg from previous boots with 6.2-rc1 at 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D216865%23c2&data=05%7C01%7Cchristian.koenig%40amd.com%7Cc14ca7b3ead040ee279f08daf34c8687%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638089808663927196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=iFHmme68OeqRpw7zlSPp%2F1mB95DKCR%2FTAsjTcjT6S1s%3D&reserved=0 
> as I mentioned at 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F52583644-d875-a454-7288-8b00ea0566ae%40bell.net%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7Cc14ca7b3ead040ee279f08daf34c8687%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638089808663927196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=j8ZppuXkhw4dD9HS6OwsvulZaV1R3W8Hu%2BW11%2BxMCuE%3D&reserved=0 
> and 6.2-rc2 + Vasant's patch with rd.driver.blacklist=amdgpu on the 
> kernel command line at 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2Fff26929d-9fb0-3c85-2594-dc2937c1ba9a%40bell.net%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7Cc14ca7b3ead040ee279f08daf34c8687%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638089808663927196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=i6fxlEn74v86MnFfgCmtYQ2JCql0sVsimZqioBiDyPk%3D&reserved=0 
> I'm using the Radeon R5 integrated GPU which is called Wani in lspci 
> and Carrizo in dmesg. The CPU is AMD A10-9620P which is Bristol Ridge 
> or Excavator+ according to 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FList_of_AMD_accelerated_processing_units&data=05%7C01%7Cchristian.koenig%40amd.com%7Cc14ca7b3ead040ee279f08daf34c8687%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638089808664083434%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=Ywp7MnbjYMeyXGGNFHOyn2A45IZSLIsShkIPEC4GB48%3D&reserved=0 
> I'm using the internal Elan touchscreen in the laptop. I'm not using 
> the HDMI port for an external monitor or audio which I think is called 
> Kabini HDMI/DP Audio in lspci
>
> Thanks,
>
> Matt
>
> On 1/10/23 08:56, Christian König wrote:
>> Am 10.01.23 um 14:51 schrieb Jason Gunthorpe:
>>> On Tue, Jan 10, 2023 at 02:45:30PM +0100, Christian König wrote:
>>>
>>>> Since this is a device integrated in the CPU it could be that the 
>>>> ACS/ATS
>>>> functionalities are controlled by the BIOS and can be enabled/disabled
>>>> there. But this should always enable/disable both.
>>> This sounds like a GPU driver bug then, it should tolerate PASID being
>>> unavailable because of BIOS issues/whatever and not black screen on
>>> boot?
>>
>> Yeah, potentially. Could I get a full "sudo lspci -vvvv -s $bus_id" + 
>> dmesg of that device?
>>
>> Thanks,
>> Christian.
>>
>>>
>>> Jason
>>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-11  3:16                 ` Baolu Lu
@ 2023-01-11 13:08                   ` Jason Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2023-01-11 13:08 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Alex Deucher, Christian König, Pan, Xinhui, Vasant Hegde,
	Matt Fagnani, Thorsten Leemhuis, Joerg Roedel,
	iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas, amd-gfx

On Wed, Jan 11, 2023 at 11:16:32AM +0800, Baolu Lu wrote:
> On 2023/1/10 21:25, Jason Gunthorpe wrote:
> > > +       } else {
> > > +               if (!pdev->bus->self ||
> > > +                   !pci_acs_path_enabled(pdev->bus->self, NULL,
> > > +                                         PCI_ACS_RR | PCI_ACS_UF))
> > > +                       return -EINVAL;
> > > +       }
> > Why would these be exclusive? Both the path and endpoint needs to be
> > checked
> 
> If the device is not an MFD, do we still need to check the ACS on it?
> Perhaps I didn't get your point correctly.

It always needs to check the path

Jason 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-01-06  5:48                   ` Baolu Lu
@ 2023-02-15 15:39                     ` Bjorn Helgaas
  2023-02-16  0:35                       ` Felix Kuehling
  0 siblings, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2023-02-15 15:39 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Felix Kuehling, Deucher, Alexander, Hegde, Vasant, Matt Fagnani,
	Thorsten Leemhuis, Joerg Roedel, Jason Gunthorpe,
	iommu@lists.linux.dev, LKML, regressions@lists.linux.dev,
	Linux PCI, Bjorn Helgaas, Christian König, Pan, Xinhui,
	amd-gfx

[+cc Christian, Xinhui, amd-gfx]

On Fri, Jan 06, 2023 at 01:48:11PM +0800, Baolu Lu wrote:
> On 1/5/23 11:27 PM, Felix Kuehling wrote:
> > Am 2023-01-05 um 09:46 schrieb Deucher, Alexander:
> > > > -----Original Message-----
> > > > From: Hegde, Vasant <Vasant.Hegde@amd.com>
> > > > On 1/5/2023 4:07 PM, Baolu Lu wrote:
> > > > > On 2023/1/5 18:27, Vasant Hegde wrote:
> > > > > > On 1/5/2023 6:39 AM, Matt Fagnani wrote:
> > > > > > > I built 6.2-rc2 with the patch applied. The same black
> > > > > > > screen problem happened with 6.2-rc2 with the patch. I
> > > > > > > tried to use early kdump with 6.2-rc2 with the patch
> > > > > > > twice by panicking the kernel with sysrq+alt+c after the
> > > > > > > black screen happened. The system rebooted after about
> > > > > > > 10-20 seconds both times, but no kdump and dmesg files
> > > > > > > were saved in /var/crash. I'm attaching the lspci -vvv
> > > > > > > output as requested. ...

> > > > > > Looking into lspci output, it doesn't list ACS feature
> > > > > > for Graphics card. So with your fix it didn't enable PASID
> > > > > > and hence it failed to boot. ...

> > > > > So do you mind telling why does the PASID need to be enabled
> > > > > for the graphic device? Or in another word, what does the
> > > > > graphic driver use the PASID for? ...

> > > The GPU driver uses the pasid for shared virtual memory between
> > > the CPU and GPU.  I.e., so that the user apps can use the same
> > > virtual address space on the GPU and the CPU.  It also uses
> > > pasid to take advantage of recoverable device page faults using
> > > PRS. ...

> > Agreed. This applies to GPU computing on some older AMD APUs that
> > take advantage of memory coherence and IOMMUv2 address translation
> > to create a shared virtual address space between the CPU and GPU.
> > In this case it seems to be a Carrizo APU. It is also true for
> > Raven APUs. ...

> Thanks for the explanation.
> 
> This is actually the problem that commit 201007ef707a was trying to
> fix.  The PCIe fabric routes Memory Requests based on the TLP
> address, ignoring any PASID (PCIe r6.0, sec 2.2.10.4), so a TLP with
> PASID that should go upstream to the IOMMU may instead be routed as
> a P2P Request if its address falls in a bridge window.
> 
> In SVA case, the IOMMU shares the address space of a user
> application.  The user application side has no knowledge about the
> PCI bridge window.  It is entirely possible that the device is
> programed with a P2P address and results in a disaster.

Is this stalled?  We explored the idea of changing the PCI core so
that for devices that use ATS/PRI, we could enable PASID without
checking for ACS [1], but IIUC we ultimately concluded that it was
based on a misunderstanding of how ATS Translation Requests are routed
and that an AMD driver change would be required [2].

So it seems like we still have this regression, and we're running out
of time before v6.2.

[1] https://lore.kernel.org/all/20230114073420.759989-1-baolu.lu@linux.intel.com/
[2] https://lore.kernel.org/all/Y91X9MeCOsa67CC6@nvidia.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-02-15 15:39                     ` Bjorn Helgaas
@ 2023-02-16  0:35                       ` Felix Kuehling
  2023-02-16  0:44                         ` Jason Gunthorpe
  2023-02-16  5:25                         ` Vasant Hegde
  0 siblings, 2 replies; 42+ messages in thread
From: Felix Kuehling @ 2023-02-16  0:35 UTC (permalink / raw)
  To: Bjorn Helgaas, Baolu Lu, Huang, Shimmer, Liu, Aaron
  Cc: Joerg Roedel, regressions@lists.linux.dev, Thorsten Leemhuis,
	Linux PCI, Pan, Xinhui, Hegde, Vasant, amd-gfx, LKML,
	Bjorn Helgaas, iommu@lists.linux.dev, Matt Fagnani,
	Jason Gunthorpe, Deucher, Alexander, Christian König

[+Shimmer, Aaron]

Am 2023-02-15 um 10:39 schrieb Bjorn Helgaas:
> [+cc Christian, Xinhui, amd-gfx]
>
> On Fri, Jan 06, 2023 at 01:48:11PM +0800, Baolu Lu wrote:
>> On 1/5/23 11:27 PM, Felix Kuehling wrote:
>>> Am 2023-01-05 um 09:46 schrieb Deucher, Alexander:
>>>>> -----Original Message-----
>>>>> From: Hegde, Vasant <Vasant.Hegde@amd.com>
>>>>> On 1/5/2023 4:07 PM, Baolu Lu wrote:
>>>>>> On 2023/1/5 18:27, Vasant Hegde wrote:
>>>>>>> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>>>>>>>> I built 6.2-rc2 with the patch applied. The same black
>>>>>>>> screen problem happened with 6.2-rc2 with the patch. I
>>>>>>>> tried to use early kdump with 6.2-rc2 with the patch
>>>>>>>> twice by panicking the kernel with sysrq+alt+c after the
>>>>>>>> black screen happened. The system rebooted after about
>>>>>>>> 10-20 seconds both times, but no kdump and dmesg files
>>>>>>>> were saved in /var/crash. I'm attaching the lspci -vvv
>>>>>>>> output as requested. ...
>>>>>>> Looking into lspci output, it doesn't list ACS feature
>>>>>>> for Graphics card. So with your fix it didn't enable PASID
>>>>>>> and hence it failed to boot. ...
>>>>>> So do you mind telling why does the PASID need to be enabled
>>>>>> for the graphic device? Or in another word, what does the
>>>>>> graphic driver use the PASID for? ...
>>>> The GPU driver uses the pasid for shared virtual memory between
>>>> the CPU and GPU.  I.e., so that the user apps can use the same
>>>> virtual address space on the GPU and the CPU.  It also uses
>>>> pasid to take advantage of recoverable device page faults using
>>>> PRS. ...
>>> Agreed. This applies to GPU computing on some older AMD APUs that
>>> take advantage of memory coherence and IOMMUv2 address translation
>>> to create a shared virtual address space between the CPU and GPU.
>>> In this case it seems to be a Carrizo APU. It is also true for
>>> Raven APUs. ...
>> Thanks for the explanation.
>>
>> This is actually the problem that commit 201007ef707a was trying to
>> fix.  The PCIe fabric routes Memory Requests based on the TLP
>> address, ignoring any PASID (PCIe r6.0, sec 2.2.10.4), so a TLP with
>> PASID that should go upstream to the IOMMU may instead be routed as
>> a P2P Request if its address falls in a bridge window.
>>
>> In SVA case, the IOMMU shares the address space of a user
>> application.  The user application side has no knowledge about the
>> PCI bridge window.  It is entirely possible that the device is
>> programed with a P2P address and results in a disaster.
> Is this stalled?  We explored the idea of changing the PCI core so
> that for devices that use ATS/PRI, we could enable PASID without
> checking for ACS [1], but IIUC we ultimately concluded that it was
> based on a misunderstanding of how ATS Translation Requests are routed
> and that an AMD driver change would be required [2].
>
> So it seems like we still have this regression, and we're running out
> of time before v6.2.
>
> [1] https://lore.kernel.org/all/20230114073420.759989-1-baolu.lu@linux.intel.com/
> [2] https://lore.kernel.org/all/Y91X9MeCOsa67CC6@nvidia.com/

If I understand this correctly, the HW or the BIOS is doing something 
wrong about reporting ACS. I don't know what the GPU driver can do other 
than add some quirk to stop using AMD IOMMUv2 on this HW/BIOS.

It looks like the problem is triggered when the driver calls 
amd_iommu_init_device. That's when the first WARNs happen, soon followed 
by a kernel oops in report_iommu_fault. The driver doesn't know anything 
is wrong because amd_iommu_init_device seems to return "success". And 
the oops is not in the GPU driver either.

I guess this could also be handled more gracefully in the IOMMU driver 
(i.e. fail gracefully in amd_iommu_init_device and let the caller know 
that something is wrong, don't oops in report_iommu_fault).

Regards,
   Felix



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-02-16  0:35                       ` Felix Kuehling
@ 2023-02-16  0:44                         ` Jason Gunthorpe
  2023-02-16  5:37                           ` Vasant Hegde
  2023-02-16 14:53                           ` Felix Kuehling
  2023-02-16  5:25                         ` Vasant Hegde
  1 sibling, 2 replies; 42+ messages in thread
From: Jason Gunthorpe @ 2023-02-16  0:44 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Bjorn Helgaas, Baolu Lu, Huang, Shimmer, Liu, Aaron, Joerg Roedel,
	regressions@lists.linux.dev, Thorsten Leemhuis, Linux PCI,
	Pan, Xinhui, Hegde, Vasant, amd-gfx, LKML, Bjorn Helgaas,
	iommu@lists.linux.dev, Matt Fagnani, Deucher, Alexander,
	Christian König

On Wed, Feb 15, 2023 at 07:35:45PM -0500, Felix Kuehling wrote:
> 
> If I understand this correctly, the HW or the BIOS is doing something wrong
> about reporting ACS. I don't know what the GPU driver can do other than add
> some quirk to stop using AMD IOMMUv2 on this HW/BIOS.

How about this:

diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
index 864e4ffb6aa94e..cc027ce9a6e86f 100644
--- a/drivers/iommu/amd/iommu_v2.c
+++ b/drivers/iommu/amd/iommu_v2.c
@@ -732,6 +732,7 @@ EXPORT_SYMBOL(amd_iommu_unbind_pasid);
 
 int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
 {
+	struct iommu_dev_data *dev_data = dev_iommu_priv_get(&pdev->dev);
 	struct device_state *dev_state;
 	struct iommu_group *group;
 	unsigned long flags;
@@ -740,6 +741,9 @@ int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
 
 	might_sleep();
 
+	if (!dev_data->ats.enabled)
+		return -EINVAL;
+
 	/*
 	 * When memory encryption is active the device is likely not in a
 	 * direct-mapped domain. Forbid using IOMMUv2 functionality for now.

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-02-16  0:35                       ` Felix Kuehling
  2023-02-16  0:44                         ` Jason Gunthorpe
@ 2023-02-16  5:25                         ` Vasant Hegde
       [not found]                           ` <40b2da4a-a205-3cf2-0c78-c94c28b2d3f4@bell.net>
  1 sibling, 1 reply; 42+ messages in thread
From: Vasant Hegde @ 2023-02-16  5:25 UTC (permalink / raw)
  To: Felix Kuehling, Bjorn Helgaas, Baolu Lu, Huang, Shimmer,
	Liu, Aaron, Matt Fagnani, Jason Gunthorpe
  Cc: Joerg Roedel, regressions@lists.linux.dev, Thorsten Leemhuis,
	Linux PCI, Pan, Xinhui, amd-gfx, LKML, Bjorn Helgaas,
	iommu@lists.linux.dev, Deucher, Alexander, Christian König

Felix, Jason, Matt,


On 2/16/2023 6:05 AM, Felix Kuehling wrote:
> [+Shimmer, Aaron]
> 
> Am 2023-02-15 um 10:39 schrieb Bjorn Helgaas:
>> [+cc Christian, Xinhui, amd-gfx]
>>
>> On Fri, Jan 06, 2023 at 01:48:11PM +0800, Baolu Lu wrote:
>>> On 1/5/23 11:27 PM, Felix Kuehling wrote:
>>>> Am 2023-01-05 um 09:46 schrieb Deucher, Alexander:
>>>>>> -----Original Message-----
>>>>>> From: Hegde, Vasant <Vasant.Hegde@amd.com>
>>>>>> On 1/5/2023 4:07 PM, Baolu Lu wrote:
>>>>>>> On 2023/1/5 18:27, Vasant Hegde wrote:
>>>>>>>> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>>>>>>>>> I built 6.2-rc2 with the patch applied. The same black
>>>>>>>>> screen problem happened with 6.2-rc2 with the patch. I
>>>>>>>>> tried to use early kdump with 6.2-rc2 with the patch
>>>>>>>>> twice by panicking the kernel with sysrq+alt+c after the
>>>>>>>>> black screen happened. The system rebooted after about
>>>>>>>>> 10-20 seconds both times, but no kdump and dmesg files
>>>>>>>>> were saved in /var/crash. I'm attaching the lspci -vvv
>>>>>>>>> output as requested. ...
>>>>>>>> Looking into lspci output, it doesn't list ACS feature
>>>>>>>> for Graphics card. So with your fix it didn't enable PASID
>>>>>>>> and hence it failed to boot. ...
>>>>>>> So do you mind telling why does the PASID need to be enabled
>>>>>>> for the graphic device? Or in another word, what does the
>>>>>>> graphic driver use the PASID for? ...
>>>>> The GPU driver uses the pasid for shared virtual memory between
>>>>> the CPU and GPU.  I.e., so that the user apps can use the same
>>>>> virtual address space on the GPU and the CPU.  It also uses
>>>>> pasid to take advantage of recoverable device page faults using
>>>>> PRS. ...
>>>> Agreed. This applies to GPU computing on some older AMD APUs that
>>>> take advantage of memory coherence and IOMMUv2 address translation
>>>> to create a shared virtual address space between the CPU and GPU.
>>>> In this case it seems to be a Carrizo APU. It is also true for
>>>> Raven APUs. ...
>>> Thanks for the explanation.
>>>
>>> This is actually the problem that commit 201007ef707a was trying to
>>> fix.  The PCIe fabric routes Memory Requests based on the TLP
>>> address, ignoring any PASID (PCIe r6.0, sec 2.2.10.4), so a TLP with
>>> PASID that should go upstream to the IOMMU may instead be routed as
>>> a P2P Request if its address falls in a bridge window.
>>>
>>> In SVA case, the IOMMU shares the address space of a user
>>> application.  The user application side has no knowledge about the
>>> PCI bridge window.  It is entirely possible that the device is
>>> programed with a P2P address and results in a disaster.
>> Is this stalled?  We explored the idea of changing the PCI core so
>> that for devices that use ATS/PRI, we could enable PASID without
>> checking for ACS [1], but IIUC we ultimately concluded that it was
>> based on a misunderstanding of how ATS Translation Requests are routed
>> and that an AMD driver change would be required [2].
>>
>> So it seems like we still have this regression, and we're running out
>> of time before v6.2.
>>
>> [1] https://lore.kernel.org/all/20230114073420.759989-1-baolu.lu@linux.intel.com/
>> [2] https://lore.kernel.org/all/Y91X9MeCOsa67CC6@nvidia.com/
> 
> If I understand this correctly, the HW or the BIOS is doing something wrong
> about reporting ACS. I don't know what the GPU driver can do other than add some
> quirk to stop using AMD IOMMUv2 on this HW/BIOS.
> 
> It looks like the problem is triggered when the driver calls
> amd_iommu_init_device. That's when the first WARNs happen, soon followed by a
> kernel oops in report_iommu_fault. The driver doesn't know anything is wrong
> because amd_iommu_init_device seems to return "success". And the oops is not in
> the GPU driver either.

WARN is fixed and its in Joerg's tree.
https://lore.kernel.org/all/20230111121503.5931-1-vasant.hegde@amd.com/

report_iommu_fault() happened because in amd_iommu_init_device() path it failed
to attach devices to new domain and returned error. But it didn't put devices
back to old domain properly. It left in incosistent state and resulted in IO
page fault. I have proposed series to handle device to domain attachment failure
and better handling of subsequent report_iommu_fault().
https://lore.kernel.org/linux-iommu/20230215052642.6016-1-vasant.hegde@amd.com/


@Matt,
  Can you please help to verify above patches on your system where you hit the
issue originally?
  (Grab above two series, apply it on top of latest kernel and test it)

-Vasant


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-02-16  0:44                         ` Jason Gunthorpe
@ 2023-02-16  5:37                           ` Vasant Hegde
  2023-02-16 14:55                             ` Felix Kuehling
  2023-02-16 14:53                           ` Felix Kuehling
  1 sibling, 1 reply; 42+ messages in thread
From: Vasant Hegde @ 2023-02-16  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Felix Kuehling
  Cc: Bjorn Helgaas, Baolu Lu, Huang, Shimmer, Liu, Aaron, Joerg Roedel,
	regressions@lists.linux.dev, Thorsten Leemhuis, Linux PCI,
	Pan, Xinhui, amd-gfx, LKML, Bjorn Helgaas, iommu@lists.linux.dev,
	Matt Fagnani, Deucher, Alexander, Christian König

Hi Jason,


On 2/16/2023 6:14 AM, Jason Gunthorpe wrote:
> On Wed, Feb 15, 2023 at 07:35:45PM -0500, Felix Kuehling wrote:
>>
>> If I understand this correctly, the HW or the BIOS is doing something wrong
>> about reporting ACS. I don't know what the GPU driver can do other than add
>> some quirk to stop using AMD IOMMUv2 on this HW/BIOS.
> 
> How about this:
> 
> diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
> index 864e4ffb6aa94e..cc027ce9a6e86f 100644
> --- a/drivers/iommu/amd/iommu_v2.c
> +++ b/drivers/iommu/amd/iommu_v2.c
> @@ -732,6 +732,7 @@ EXPORT_SYMBOL(amd_iommu_unbind_pasid);
>  
>  int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
>  {
> +	struct iommu_dev_data *dev_data = dev_iommu_priv_get(&pdev->dev);
>  	struct device_state *dev_state;
>  	struct iommu_group *group;
>  	unsigned long flags;
> @@ -740,6 +741,9 @@ int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
>  
>  	might_sleep();
>  
> +	if (!dev_data->ats.enabled)
> +		return -EINVAL;
> +

Thanks for the proposed fix. But aactually this will not solve the issue because
current flow is :
  - in this function it tries to allocate new domain
  - Calls iommu_attach_group() which will call attach_device. In that path
    it will try to enable ATS/PASID and hitting error.

As I mentioned in other reply I think even current code returns error from
amd_iommu_init_device() to GPU. But the issue is, in __iommu_attach_group() path
it detached device from current domain, failed to attach to new domain and
returned error. We didn't put the device back to old domain thats causing the
issue. Below series should fix this issue.

https://lore.kernel.org/linux-iommu/20230215052642.6016-1-vasant.hegde@amd.com/

-Vasant


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-02-16  0:44                         ` Jason Gunthorpe
  2023-02-16  5:37                           ` Vasant Hegde
@ 2023-02-16 14:53                           ` Felix Kuehling
  1 sibling, 0 replies; 42+ messages in thread
From: Felix Kuehling @ 2023-02-16 14:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Suthikulpanit, Suravee
  Cc: Bjorn Helgaas, Baolu Lu, Huang, Shimmer, Liu, Aaron, Joerg Roedel,
	regressions@lists.linux.dev, Thorsten Leemhuis, Linux PCI,
	Pan, Xinhui, Hegde, Vasant, amd-gfx, LKML, Bjorn Helgaas,
	iommu@lists.linux.dev, Matt Fagnani, Deucher, Alexander,
	Christian König

[+Suravee]

Am 2023-02-15 um 19:44 schrieb Jason Gunthorpe:
> On Wed, Feb 15, 2023 at 07:35:45PM -0500, Felix Kuehling wrote:
>> If I understand this correctly, the HW or the BIOS is doing something wrong
>> about reporting ACS. I don't know what the GPU driver can do other than add
>> some quirk to stop using AMD IOMMUv2 on this HW/BIOS.
> How about this:
>
> diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
> index 864e4ffb6aa94e..cc027ce9a6e86f 100644
> --- a/drivers/iommu/amd/iommu_v2.c
> +++ b/drivers/iommu/amd/iommu_v2.c
> @@ -732,6 +732,7 @@ EXPORT_SYMBOL(amd_iommu_unbind_pasid);
>   
>   int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
>   {
> +	struct iommu_dev_data *dev_data = dev_iommu_priv_get(&pdev->dev);
>   	struct device_state *dev_state;
>   	struct iommu_group *group;
>   	unsigned long flags;
> @@ -740,6 +741,9 @@ int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
>   
>   	might_sleep();
>   
> +	if (!dev_data->ats.enabled)
> +		return -EINVAL;
> +
>   	/*
>   	 * When memory encryption is active the device is likely not in a
>   	 * direct-mapped domain. Forbid using IOMMUv2 functionality for now.

Hi Suravee,

What to you think about this proposed change?

Regards,
   Felix


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-02-16  5:37                           ` Vasant Hegde
@ 2023-02-16 14:55                             ` Felix Kuehling
  0 siblings, 0 replies; 42+ messages in thread
From: Felix Kuehling @ 2023-02-16 14:55 UTC (permalink / raw)
  To: Vasant Hegde, Jason Gunthorpe, Suthikulpanit, Suravee
  Cc: Bjorn Helgaas, Baolu Lu, Huang, Shimmer, Liu, Aaron, Joerg Roedel,
	regressions@lists.linux.dev, Thorsten Leemhuis, Linux PCI,
	Pan, Xinhui, amd-gfx, LKML, Bjorn Helgaas, iommu@lists.linux.dev,
	Matt Fagnani, Deucher, Alexander, Christian König

[+Suravee]

Am 2023-02-16 um 00:37 schrieb Vasant Hegde:
> Hi Jason,
>
>
> On 2/16/2023 6:14 AM, Jason Gunthorpe wrote:
>> On Wed, Feb 15, 2023 at 07:35:45PM -0500, Felix Kuehling wrote:
>>> If I understand this correctly, the HW or the BIOS is doing something wrong
>>> about reporting ACS. I don't know what the GPU driver can do other than add
>>> some quirk to stop using AMD IOMMUv2 on this HW/BIOS.
>> How about this:
>>
>> diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
>> index 864e4ffb6aa94e..cc027ce9a6e86f 100644
>> --- a/drivers/iommu/amd/iommu_v2.c
>> +++ b/drivers/iommu/amd/iommu_v2.c
>> @@ -732,6 +732,7 @@ EXPORT_SYMBOL(amd_iommu_unbind_pasid);
>>   
>>   int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
>>   {
>> +	struct iommu_dev_data *dev_data = dev_iommu_priv_get(&pdev->dev);
>>   	struct device_state *dev_state;
>>   	struct iommu_group *group;
>>   	unsigned long flags;
>> @@ -740,6 +741,9 @@ int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
>>   
>>   	might_sleep();
>>   
>> +	if (!dev_data->ats.enabled)
>> +		return -EINVAL;
>> +
> Thanks for the proposed fix. But aactually this will not solve the issue because
> current flow is :
>    - in this function it tries to allocate new domain
>    - Calls iommu_attach_group() which will call attach_device. In that path
>      it will try to enable ATS/PASID and hitting error.
>
> As I mentioned in other reply I think even current code returns error from
> amd_iommu_init_device() to GPU. But the issue is, in __iommu_attach_group() path
> it detached device from current domain, failed to attach to new domain and
> returned error. We didn't put the device back to old domain thats causing the
> issue. Below series should fix this issue.
>
> https://lore.kernel.org/linux-iommu/20230215052642.6016-1-vasant.hegde@amd.com/
>
> -Vasant
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
       [not found]                           ` <40b2da4a-a205-3cf2-0c78-c94c28b2d3f4@bell.net>
@ 2023-02-16 19:59                             ` Felix Kuehling
  2023-02-17  5:36                               ` Vasant Hegde
  2023-02-17  5:23                             ` Vasant Hegde
  1 sibling, 1 reply; 42+ messages in thread
From: Felix Kuehling @ 2023-02-16 19:59 UTC (permalink / raw)
  To: Matt Fagnani, Vasant Hegde, Bjorn Helgaas, Baolu Lu,
	Huang, Shimmer, Liu, Aaron, Jason Gunthorpe
  Cc: Joerg Roedel, regressions@lists.linux.dev, Thorsten Leemhuis,
	Linux PCI, Pan, Xinhui, amd-gfx, LKML, Bjorn Helgaas,
	iommu@lists.linux.dev, Deucher, Alexander, Christian König

> Feb 16 13:22:32 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for 
> device 1002:9874
> Feb 16 13:22:32 kernel: kfd kfd: amdgpu: device 1002:9874 NOT added 
> due to errors 
This look like IOMMU device initialization still fails (but more 
gracefully now). Vasant, is that expected?

This would lead to KFD not being available on Carrizo with this kernel, 
which is probably not a big limitation in practice. It would only affect 
compute applications using the ROCm user mode stack. I don't think 
anyone does that these days on these old APUs.

The SMU errors seem unrelated to this unless there is some subtle 
interaction I'm missing.

Regards,
   Felix


Am 2023-02-16 um 13:59 schrieb Matt Fagnani:
> Vasant,
>
> I applied your four patches to 6.2-rc8 and built that. The black 
> screen, null pointer dereference, and warnings didn't happen when 
> booting 6.2-rc8 with your patches. There were errors that the IOMMU 
> wasn't restarted when amdgpu and amdkfd was starting though at kernel: 
> kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874. I don't 
> know if those IOMMU errors were expected or not, but I did see those 
> types of messages when I used amd_iommu=off to work around the black 
> screen before. I didn't use amd_iommu=off when testing 6.2-rc8 with 
> your patches. There were also a different amdgpu warning at 
> drivers/gpu/drm/amd/amdgpu/../pm/powerplay/smumgr/smu8_smumgr.c:98 
> smu8_send_msg_to_smc_with_parameter+0x103/0x140 and errors about 
> amdgpu timeouts on 1/3 boots. Plasma took over a minute to start and 
> shut down on that boot which was unusually long. I've seen those sorts 
> of amdgpu warnings and errors infrequently before so they might be 
> unrelated to the IOMMU problem. The part of the journal where those 
> errors started was the following.
>
> Feb 16 13:22:31 kernel: [drm] amdgpu kernel modesetting enabled.
> Feb 16 13:22:31 kernel: amdgpu: Topology: Add APU node [0x0:0x0]
> Feb 16 13:22:31 kernel: [drm] initializing kernel modesetting (CARRIZO 
> 0x1002:0x9874 0x103C:0x8332 0xCA).
> Feb 16 13:22:31 kernel: [drm] register mmio base: 0xF0400000
> Feb 16 13:22:31 kernel: [drm] register mmio size: 262144
> Feb 16 13:22:31 kernel: [drm] add ip block number 0 <vi_common>
> Feb 16 13:22:31 kernel: [drm] add ip block number 1 <gmc_v8_0>
> Feb 16 13:22:31 kernel: [drm] add ip block number 2 <cz_ih>
> Feb 16 13:22:31 kernel: [drm] add ip block number 3 <gfx_v8_0>
> Feb 16 13:22:31 kernel: [drm] add ip block number 4 <sdma_v3_0>
> Feb 16 13:22:31 kernel: [drm] add ip block number 5 <powerplay>
> Feb 16 13:22:31 kernel: [drm] add ip block number 6 <dm>
> Feb 16 13:22:31 kernel: [drm] add ip block number 7 <uvd_v6_0>
> Feb 16 13:22:31 kernel: [drm] add ip block number 8 <vce_v3_0>
> Feb 16 13:22:31 kernel: [drm] add ip block number 9 <acp_ip>
> Feb 16 13:22:31 kernel: amdgpu 0000:00:01.0: amdgpu: Fetched VBIOS 
> from VFCT
> Feb 16 13:22:31 kernel: amdgpu: ATOM BIOS: 113-C75100-031
> Feb 16 13:22:31 kernel: [drm] UVD is enabled in physical mode
> Feb 16 13:22:31 kernel: [drm] VCE enabled in physical mode
> Feb 16 13:22:31 kernel: Console: switching to colour dummy device 80x25
> Feb 16 13:22:31 kernel: amdgpu 0000:00:01.0: vgaarb: deactivate vga 
> console
> Feb 16 13:22:31 kernel: amdgpu 0000:00:01.0: amdgpu: Trusted Memory 
> Zone (TMZ) feature not supported
> Feb 16 13:22:31 kernel: [drm] vm size is 64 GB, 2 levels, block size 
> is 10-bit, fragment size is 9-bit
> Feb 16 13:22:31 kernel: amdgpu 0000:00:01.0: amdgpu: VRAM: 512M 
> 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
> Feb 16 13:22:31 kernel: amdgpu 0000:00:01.0: amdgpu: GART: 1024M 
> 0x000000FF00000000 - 0x000000FF3FFFFFFF
> Feb 16 13:22:31 kernel: [drm] Detected VRAM RAM=512M, BAR=512M
> Feb 16 13:22:31 kernel: [drm] RAM width 64bits UNKNOWN
> Feb 16 13:22:31 kernel: [drm] amdgpu: 512M of VRAM memory ready
> Feb 16 13:22:31 kernel: [drm] amdgpu: 3704M of GTT memory ready.
> Feb 16 13:22:31 kernel: [drm] GART: num cpu pages 262144, num gpu 
> pages 262144
> Feb 16 13:22:31 kernel: [drm] PCIE GART of 1024M enabled (table at 
> 0x000000F400600000).
> Feb 16 13:22:31 kernel: amdgpu: hwmgr_sw_init smu backed is smu8_smu
> Feb 16 13:22:31 kernel: [drm] Found UVD firmware Version: 1.91 Family 
> ID: 11
> Feb 16 13:22:31 kernel: [drm] UVD ENC is disabled
> Feb 16 13:22:31 kernel: [drm] Found VCE firmware Version: 52.4 Binary 
> ID: 3
> Feb 16 13:22:31 kernel: amdgpu: smu version 27.18.00
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB: values for Engine clock
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         300000
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         480000
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         533340
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         576000
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         626090
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         685720
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         720000
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         757900
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB: Validation clocks:
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:    level           : 8
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB: values for Display clock
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         300000
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         400000
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         496560
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         626090
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         685720
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         757900
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         800000
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         847060
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB: Validation clocks:
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:    level           : 8
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB: values for Memory clock
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         667000
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:         933000
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB: Validation clocks:
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
> Feb 16 13:22:31 kernel: [drm] DM_PPLIB:    level           : 8
> Feb 16 13:22:31 kernel: [drm] Display Core initialized with v3.2.215!
> Feb 16 13:22:32 kernel: [drm] UVD initialized successfully.
> Feb 16 13:22:32 kernel: [drm] VCE initialized successfully.
> Feb 16 13:22:32 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
> Feb 16 13:22:32 kernel: amdgpu: sdma_bitmap: f
> Feb 16 13:22:32 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for 
> device 1002:9874
> Feb 16 13:22:32 kernel: kfd kfd: amdgpu: device 1002:9874 NOT added 
> due to errors
> Feb 16 13:22:32 kernel: amdgpu 0000:00:01.0: amdgpu: SE 1, SH per SE 
> 1, CU per SH 8, active_cu_number 6
> Feb 16 13:22:32 kernel: [drm] Initialized amdgpu 3.49.0 20150101 for 
> 0000:00:01.0 on minor 0
> Feb 16 13:22:32 kernel: fbcon: amdgpudrmfb (fb0) is primary device
> Feb 16 13:22:33 kernel: Console: switching to colour frame buffer 
> device 170x48
> Feb 16 13:22:33 kernel: amdgpu 0000:00:01.0: [drm] fb0: amdgpudrmfb 
> frame buffer device
> Feb 16 13:22:33 kernel: audit: type=1334 audit(1676571753.397:17): 
> prog-id=21 op=LOAD
> Feb 16 13:22:33 kernel: audit: type=1130 audit(1676571753.419:18): 
> pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel 
> msg='unit=dbus-broker comm="systemd" exe="/usr/lib/systemd/systemd" 
> hostname=? addr=? terminal=? res=success'
> Feb 16 13:22:33 kernel: audit: type=1130 audit(1676571753.456:19): 
> pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel 
> msg='unit=dracut-initqueue comm="systemd" 
> exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> Feb 16 13:22:33 kernel: audit: type=1130 audit(1676571753.492:20): 
> pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel 
> msg='unit=systemd-fsck-root comm="systemd" 
> exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> Feb 16 13:22:33 kernel: EXT4-fs (dm-0): mounted filesystem 
> 00107de9-54ef-4784-a03f-61802ed0b350 with ordered data mode. Quota 
> mode: none.
> Feb 16 13:22:36 kernel: ------------[ cut here ]------------
> Feb 16 13:22:36 kernel: smu8_send_msg_to_smc_with_parameter(0x0009, 
> 0x0) timed out after 2814625 us
> Feb 16 13:22:36 kernel: WARNING: CPU: 1 PID: 112 at 
> drivers/gpu/drm/amd/amdgpu/../pm/powerplay/smumgr/smu8_smumgr.c:98 
> smu8_send_msg_to_smc_with_parameter+0x103/0x140 [amdgpu]
> Feb 16 13:22:36 kernel: Modules linked in: amdgpu i2c_algo_bit 
> drm_ttm_helper ttm iommu_v2 mfd_core drm_buddy gpu_sched 
> drm_display_helper drm_kms_helper hid_logitech_hidpp drm 
> crct10dif_pclmul crc32_pclmul crc32c_intel r8169 sd_mod 
> ghash_clmulni_intel t10_pi sha512_ssse3 crc64_rocksoft_generic 
> crc64_rocksoft wdat_wdt sp5100_tco hid_logitech_dj crc64 cec video wmi 
> fuse dm_multipath
> Feb 16 13:22:36 kernel: CPU: 1 PID: 112 Comm: kworker/1:3 Not tainted 
> 6.2.0-rc8+ #94
> Feb 16 13:22:36 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, 
> BIOS F.52 12/03/2019
> Feb 16 13:22:36 kernel: Workqueue: events amdgpu_vce_idle_work_handler 
> [amdgpu]
> Feb 16 13:22:36 kernel: RIP: 
> 0010:smu8_send_msg_to_smc_with_parameter+0x103/0x140 [amdgpu]
> Feb 16 13:22:36 kernel: Code: 20 48 c7 c7 28 1c c1 c0 48 89 c1 48 f7 
> ea 48 89 c8 44 89 e9 48 c1 f8 3f 48 c1 fa 07 48 29 c2 49 89 d0 44 89 
> e2 e8 c5 28 48 e0 <0f> 0b eb b0 bd ea ff ff ff eb a9 48 8b 7b 40 be c0 
> 01 00 00 48 8b
> Feb 16 13:22:36 kernel: RSP: 0018:ffffb997004c7db8 EFLAGS: 00010282
> Feb 16 13:22:36 kernel: RAX: 000000000000004b RBX: ffff8b4e4f596800 
> RCX: 0000000000000000
> Feb 16 13:22:36 kernel: RDX: 0000000000000001 RSI: ffffffffa14cf075 
> RDI: 00000000ffffffff
> Feb 16 13:22:36 kernel: RBP: 00000000ffffffc2 R08: 0000000000000000 
> R09: ffffb997004c7c68
> Feb 16 13:22:36 kernel: R10: 0000000000000003 R11: ffffffffa1d42e48 
> R12: 0000000000000009
> Feb 16 13:22:36 kernel: R13: 0000000000000000 R14: 00000003ded365a4 
> R15: 0000000000000002
> Feb 16 13:22:36 kernel: FS:  0000000000000000(0000) 
> GS:ffff8b4f37480000(0000) knlGS:0000000000000000
> Feb 16 13:22:36 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Feb 16 13:22:36 kernel: CR2: 00007f950a698364 CR3: 0000000033c10000 
> CR4: 00000000001506e0
> Feb 16 13:22:36 kernel: Call Trace:
> Feb 16 13:22:36 kernel:  <TASK>
> Feb 16 13:22:36 kernel:  smum_send_msg_to_smc+0xba/0xf0 [amdgpu]
> Feb 16 13:22:36 kernel:  smu8_dpm_powergate_vce+0x15a/0x180 [amdgpu]
> Feb 16 13:22:36 kernel:  pp_set_powergating_by_smu+0xed/0x1f0 [amdgpu]
> Feb 16 13:22:36 kernel: amdgpu_dpm_set_powergating_by_smu+0x84/0xf0 
> [amdgpu]
> Feb 16 13:22:36 kernel:  amdgpu_dpm_enable_vce+0x29/0xa0 [amdgpu]
> Feb 16 13:22:36 kernel:  process_one_work+0x1c8/0x380
> Feb 16 13:22:36 kernel:  worker_thread+0x4d/0x380
> Feb 16 13:22:36 kernel:  ? _raw_spin_lock_irqsave+0x23/0x50
> Feb 16 13:22:36 kernel:  ? __pfx_worker_thread+0x10/0x10
> Feb 16 13:22:36 kernel:  kthread+0xe9/0x110
> Feb 16 13:22:36 kernel:  ? __pfx_kthread+0x10/0x10
> Feb 16 13:22:36 kernel:  ret_from_fork+0x2c/0x50
> Feb 16 13:22:36 kernel:  </TASK>
> Feb 16 13:22:36 kernel: ---[ end trace 0000000000000000 ]---
> Feb 16 13:22:39 kernel: amdgpu: 
> smu8_send_msg_to_smc_with_parameter(0x0004) aborted; SMU still 
> servicing msg (0x0009)
> Feb 16 13:22:41 kernel: amdgpu: 
> smu8_send_msg_to_smc_with_parameter(0x0007) aborted; SMU still 
> servicing msg (0x0009)
>
> I'm attaching the kernel log for the boot of 6.2-rc8 + patches with 
> the IOMMU errors and amdgpu warnings and timeouts.
>
> Thanks,
>
> Matt
>
> On 2/16/23 00:25, Vasant Hegde wrote:
>> Felix, Jason, Matt,
>>
>>
>> On 2/16/2023 6:05 AM, Felix Kuehling wrote:
>>> [+Shimmer, Aaron]
>>>
>>> Am 2023-02-15 um 10:39 schrieb Bjorn Helgaas:
>>>> [+cc Christian, Xinhui, amd-gfx]
>>>>
>>>> On Fri, Jan 06, 2023 at 01:48:11PM +0800, Baolu Lu wrote:
>>>>> On 1/5/23 11:27 PM, Felix Kuehling wrote:
>>>>>> Am 2023-01-05 um 09:46 schrieb Deucher, Alexander:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Hegde, Vasant <Vasant.Hegde@amd.com>
>>>>>>>> On 1/5/2023 4:07 PM, Baolu Lu wrote:
>>>>>>>>> On 2023/1/5 18:27, Vasant Hegde wrote:
>>>>>>>>>> On 1/5/2023 6:39 AM, Matt Fagnani wrote:
>>>>>>>>>>> I built 6.2-rc2 with the patch applied. The same black
>>>>>>>>>>> screen problem happened with 6.2-rc2 with the patch. I
>>>>>>>>>>> tried to use early kdump with 6.2-rc2 with the patch
>>>>>>>>>>> twice by panicking the kernel with sysrq+alt+c after the
>>>>>>>>>>> black screen happened. The system rebooted after about
>>>>>>>>>>> 10-20 seconds both times, but no kdump and dmesg files
>>>>>>>>>>> were saved in /var/crash. I'm attaching the lspci -vvv
>>>>>>>>>>> output as requested. ...
>>>>>>>>>> Looking into lspci output, it doesn't list ACS feature
>>>>>>>>>> for Graphics card. So with your fix it didn't enable PASID
>>>>>>>>>> and hence it failed to boot. ...
>>>>>>>>> So do you mind telling why does the PASID need to be enabled
>>>>>>>>> for the graphic device? Or in another word, what does the
>>>>>>>>> graphic driver use the PASID for? ...
>>>>>>> The GPU driver uses the pasid for shared virtual memory between
>>>>>>> the CPU and GPU.  I.e., so that the user apps can use the same
>>>>>>> virtual address space on the GPU and the CPU.  It also uses
>>>>>>> pasid to take advantage of recoverable device page faults using
>>>>>>> PRS. ...
>>>>>> Agreed. This applies to GPU computing on some older AMD APUs that
>>>>>> take advantage of memory coherence and IOMMUv2 address translation
>>>>>> to create a shared virtual address space between the CPU and GPU.
>>>>>> In this case it seems to be a Carrizo APU. It is also true for
>>>>>> Raven APUs. ...
>>>>> Thanks for the explanation.
>>>>>
>>>>> This is actually the problem that commit 201007ef707a was trying to
>>>>> fix.  The PCIe fabric routes Memory Requests based on the TLP
>>>>> address, ignoring any PASID (PCIe r6.0, sec 2.2.10.4), so a TLP with
>>>>> PASID that should go upstream to the IOMMU may instead be routed as
>>>>> a P2P Request if its address falls in a bridge window.
>>>>>
>>>>> In SVA case, the IOMMU shares the address space of a user
>>>>> application.  The user application side has no knowledge about the
>>>>> PCI bridge window.  It is entirely possible that the device is
>>>>> programed with a P2P address and results in a disaster.
>>>> Is this stalled?  We explored the idea of changing the PCI core so
>>>> that for devices that use ATS/PRI, we could enable PASID without
>>>> checking for ACS [1], but IIUC we ultimately concluded that it was
>>>> based on a misunderstanding of how ATS Translation Requests are routed
>>>> and that an AMD driver change would be required [2].
>>>>
>>>> So it seems like we still have this regression, and we're running out
>>>> of time before v6.2.
>>>>
>>>> [1] 
>>>> https://lore.kernel.org/all/20230114073420.759989-1-baolu.lu@linux.intel.com/
>>>> [2] https://lore.kernel.org/all/Y91X9MeCOsa67CC6@nvidia.com/
>>> If I understand this correctly, the HW or the BIOS is doing 
>>> something wrong
>>> about reporting ACS. I don't know what the GPU driver can do other 
>>> than add some
>>> quirk to stop using AMD IOMMUv2 on this HW/BIOS.
>>>
>>> It looks like the problem is triggered when the driver calls
>>> amd_iommu_init_device. That's when the first WARNs happen, soon 
>>> followed by a
>>> kernel oops in report_iommu_fault. The driver doesn't know anything 
>>> is wrong
>>> because amd_iommu_init_device seems to return "success". And the 
>>> oops is not in
>>> the GPU driver either.
>> WARN is fixed and its in Joerg's tree.
>> https://lore.kernel.org/all/20230111121503.5931-1-vasant.hegde@amd.com/
>>
>> report_iommu_fault() happened because in amd_iommu_init_device() path 
>> it failed
>> to attach devices to new domain and returned error. But it didn't put 
>> devices
>> back to old domain properly. It left in incosistent state and 
>> resulted in IO
>> page fault. I have proposed series to handle device to domain 
>> attachment failure
>> and better handling of subsequent report_iommu_fault().
>> https://lore.kernel.org/linux-iommu/20230215052642.6016-1-vasant.hegde@amd.com/ 
>>
>>
>>
>> @Matt,
>>    Can you please help to verify above patches on your system where 
>> you hit the
>> issue originally?
>>    (Grab above two series, apply it on top of latest kernel and test it)
>>
>> -Vasant
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
       [not found]                           ` <40b2da4a-a205-3cf2-0c78-c94c28b2d3f4@bell.net>
  2023-02-16 19:59                             ` Felix Kuehling
@ 2023-02-17  5:23                             ` Vasant Hegde
  1 sibling, 0 replies; 42+ messages in thread
From: Vasant Hegde @ 2023-02-17  5:23 UTC (permalink / raw)
  To: Matt Fagnani, Felix Kuehling, Bjorn Helgaas, Baolu Lu,
	Huang, Shimmer, Liu, Aaron, Jason Gunthorpe
  Cc: Joerg Roedel, regressions@lists.linux.dev, Thorsten Leemhuis,
	Linux PCI, Pan, Xinhui, amd-gfx, LKML, Bjorn Helgaas,
	iommu@lists.linux.dev, Deucher, Alexander, Christian König

Matt,

Thanks a lot for testing and the dmesg log.

On 2/17/2023 12:29 AM, Matt Fagnani wrote:
> Vasant,
> 
> I applied your four patches to 6.2-rc8 and built that. The black screen, null
> pointer dereference, and warnings didn't happen when booting 6.2-rc8 with your
> patches. There were errors that the IOMMU wasn't restarted when amdgpu and
> amdkfd was starting though at kernel: kfd kfd: amdgpu: Failed to resume IOMMU
> for device 1002:9874. I don't know if those IOMMU errors were expected or not,

This patch is not for fixing PASID enablement issue. Its more of gracefully
handling the error path.

This means patch worked in expected way. i. e. It failed to enable PASID because
of original patch (commit 201007ef70), it didn't attach devices to new domain
and attach devices back to default domain.
It returned error to GPU saying we couldn't enable PASID/PRI. Hence we saw above
error message.

-Vasant

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
  2023-02-16 19:59                             ` Felix Kuehling
@ 2023-02-17  5:36                               ` Vasant Hegde
  0 siblings, 0 replies; 42+ messages in thread
From: Vasant Hegde @ 2023-02-17  5:36 UTC (permalink / raw)
  To: Felix Kuehling, Matt Fagnani, Bjorn Helgaas, Baolu Lu,
	Huang, Shimmer, Liu, Aaron, Jason Gunthorpe
  Cc: Joerg Roedel, regressions@lists.linux.dev, Thorsten Leemhuis,
	Linux PCI, Pan, Xinhui, amd-gfx, LKML, Bjorn Helgaas,
	iommu@lists.linux.dev, Deucher, Alexander, Christian König

Hi Felix,


On 2/17/2023 1:29 AM, Felix Kuehling wrote:
>> Feb 16 13:22:32 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device
>> 1002:9874
>> Feb 16 13:22:32 kernel: kfd kfd: amdgpu: device 1002:9874 NOT added due to errors 
> This look like IOMMU device initialization still fails (but more gracefully
> now). Vasant, is that expected?

My fix is to gracefully handle failure paths in IOMMU. So above logs are
expected. Basically it means IOMMU couldn't attach devices to new domain
(because it couldn't enable PASID on AMD GPU as ACS RR/UF flags are missing, see
commit 201007ef707 ) and we did fall back to old domain properly.

It also means that GPU will not be able to use PASID/PRI. If you need these
feauteres then you have to look into commit 201007ef707 and see how we can
enable PASID for GPU (without ACS UF/RR flag?).


> 
> This would lead to KFD not being available on Carrizo with this kernel, which is
> probably not a big limitation in practice. It would only affect compute
> applications using the ROCm user mode stack. I don't think anyone does that
> these days on these old APUs.
> 
> The SMU errors seem unrelated to this unless there is some subtle interaction
> I'm missing.

I have no idea about GPU warning. All I can say is IOMMU side looks good but
PASID/PRI is not enabled for GPU.

-Vasant



^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2023-02-17  5:37 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-30  8:18 [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled Thorsten Leemhuis
2023-01-03 10:30 ` Joerg Roedel
2023-01-03 19:06 ` Matt Fagnani
     [not found] ` <5aa0e698-f715-0481-36e5-46505024ebc1@bell.net>
2023-01-04  6:54   ` Baolu Lu
2023-01-04 15:50     ` Vasant Hegde
2023-01-05  1:09       ` Matt Fagnani
2023-01-05 10:27         ` Vasant Hegde
2023-01-05 10:37           ` Baolu Lu
2023-01-05 10:46             ` Vasant Hegde
2023-01-05 14:46               ` Deucher, Alexander
2023-01-05 15:27                 ` Felix Kuehling
2023-01-06  5:48                   ` Baolu Lu
2023-02-15 15:39                     ` Bjorn Helgaas
2023-02-16  0:35                       ` Felix Kuehling
2023-02-16  0:44                         ` Jason Gunthorpe
2023-02-16  5:37                           ` Vasant Hegde
2023-02-16 14:55                             ` Felix Kuehling
2023-02-16 14:53                           ` Felix Kuehling
2023-02-16  5:25                         ` Vasant Hegde
     [not found]                           ` <40b2da4a-a205-3cf2-0c78-c94c28b2d3f4@bell.net>
2023-02-16 19:59                             ` Felix Kuehling
2023-02-17  5:36                               ` Vasant Hegde
2023-02-17  5:23                             ` Vasant Hegde
2023-01-05 19:51           ` Matt Fagnani
2023-01-06 14:14           ` Jason Gunthorpe
2023-01-07  2:44             ` Baolu Lu
2023-01-09 13:43               ` Jason Gunthorpe
2023-01-10  5:28                 ` Baolu Lu
2023-01-10  5:48             ` Baolu Lu
2023-01-10  8:06               ` Matt Fagnani
2023-01-10 13:25               ` Jason Gunthorpe
2023-01-10 13:45                 ` Christian König
2023-01-10 13:51                   ` Jason Gunthorpe
2023-01-10 13:56                     ` Christian König
2023-01-10 20:51                       ` Matt Fagnani
2023-01-11  8:35                         ` Christian König
2023-01-10 15:05                   ` Felix Kuehling
2023-01-10 15:19                     ` Jason Gunthorpe
2023-01-10 15:21                       ` Felix Kuehling
2023-01-11  3:16                 ` Baolu Lu
2023-01-11 13:08                   ` Jason Gunthorpe
     [not found]           ` <ff26929d-9fb0-3c85-2594-dc2937c1ba9a@bell.net>
2023-01-10 16:08             ` Vasant Hegde
2023-01-10 16:12               ` Vasant Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox