amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
@ 2023-02-23 23:40 Mikhail Gavrilov
  2023-02-24  7:12 ` Keyword Review - " Christian König
  2023-02-24  7:13 ` Christian König
  0 siblings, 2 replies; 13+ messages in thread
From: Mikhail Gavrilov @ 2023-02-23 23:40 UTC (permalink / raw)
  To: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander, Christian König

[-- Attachment #1: Type: text/plain, Size: 2647 bytes --]

Hi,
I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
it is impossible to use without AC power because the system losts nvme
when I disconnect the power adapter.

Messages from kernel log when it happens:
nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
nvme nvme0: Does your device have a faulty power saving mode enabled?
nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
and report a bug

I tried to use recommended parameters
(nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
this issue, but without successed.

In the linux-nvme mail list the last advice was to try the "pci=nocrs"
parameter.

But with this parameter the amdgpu driver refuses to work and makes
the system unbootable. I can solve the problem with the booting system
by blacklisting the driver but it is not a good solution, because I
don't wanna lose the GPU.

Why amdgpu not work with "pci=nocrs" ?
And is it possible to solve this incompatibility?
It is very important because when I boot the system without amdgpu
driver with "pci=nocrs" nvme is not losts when I disconnect the power
adapter. So "pci=nocrs" really helps.

Below that I see in kernel log when adds "pci=nocrs" parameter:

amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM
amdgpu: ATOM BIOS: SWBRT77321.001
[drm] VCN(0) decode is enabled in VM mode
[drm] VCN(0) encode is enabled in VM mode
[drm] JPEG decode is enabled in VM mode
Console: switching to colour dummy device 80x25
amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
disabled as experimental (default)
[drm] GPU posting now...
[drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
size is 9-bit
amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 -
0x00000082FEFFFFFF (12272M used)
amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 -
0x0000FFFFFFFFFFFF
[drm] Detected VRAM RAM=12272M, BAR=16384M
[drm] RAM width 192bits GDDR6
[drm] amdgpu: 12272M of VRAM memory ready
[drm] amdgpu: 31774M of GTT memory ready.
amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo
[drm] Debug VRAM access will use slowpath MM access
amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page
[drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
<gmc_v10_0> failed -12
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.

Of course a full system log is also attached.

-- 
Best Regards,
Mike Gavrilov.

[-- Attachment #2: system-log-Fatal-error-during-GPU-init.tar.xz --]
[-- Type: application/x-xz, Size: 40988 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Keyword Review - Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-02-23 23:40 amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" Mikhail Gavrilov
@ 2023-02-24  7:12 ` Christian König
  2023-02-24  7:13 ` Christian König
  1 sibling, 0 replies; 13+ messages in thread
From: Christian König @ 2023-02-24  7:12 UTC (permalink / raw)
  To: Mikhail Gavrilov, amd-gfx list, dri-devel,
	Linux List Kernel Mailing, Deucher, Alexander

Hi Mikhail,

this is pretty clearly a problem with the system and/or it's BIOS and 
not the GPU hw or the driver.

The option pci=nocrs makes the kernel ignore additional resource windows 
the BIOS reports through ACPI. This then most likely leads to problems 
with amdgpu because it can't bring up its PCIe resources any more.

The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help 
understand the problem, but I strongly suggest to try a BIOS update first.

Regards,
Christian.

Am 24.02.23 um 00:40 schrieb Mikhail Gavrilov:
> Hi,
> I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
> it is impossible to use without AC power because the system losts nvme
> when I disconnect the power adapter.
>
> Messages from kernel log when it happens:
> nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> nvme nvme0: Does your device have a faulty power saving mode enabled?
> nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
> and report a bug
>
> I tried to use recommended parameters
> (nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
> this issue, but without successed.
>
> In the linux-nvme mail list the last advice was to try the "pci=nocrs"
> parameter.
>
> But with this parameter the amdgpu driver refuses to work and makes
> the system unbootable. I can solve the problem with the booting system
> by blacklisting the driver but it is not a good solution, because I
> don't wanna lose the GPU.
>
> Why amdgpu not work with "pci=nocrs" ?
> And is it possible to solve this incompatibility?
> It is very important because when I boot the system without amdgpu
> driver with "pci=nocrs" nvme is not losts when I disconnect the power
> adapter. So "pci=nocrs" really helps.
>
> Below that I see in kernel log when adds "pci=nocrs" parameter:
>
> amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM
> amdgpu: ATOM BIOS: SWBRT77321.001
> [drm] VCN(0) decode is enabled in VM mode
> [drm] VCN(0) encode is enabled in VM mode
> [drm] JPEG decode is enabled in VM mode
> Console: switching to colour dummy device 80x25
> amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
> disabled as experimental (default)
> [drm] GPU posting now...
> [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
> size is 9-bit
> amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 -
> 0x00000082FEFFFFFF (12272M used)
> amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
> amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 -
> 0x0000FFFFFFFFFFFF
> [drm] Detected VRAM RAM=12272M, BAR=16384M
> [drm] RAM width 192bits GDDR6
> [drm] amdgpu: 12272M of VRAM memory ready
> [drm] amdgpu: 31774M of GTT memory ready.
> amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo
> [drm] Debug VRAM access will use slowpath MM access
> amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page
> [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
> <gmc_v10_0> failed -12
> amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
>
> Of course a full system log is also attached.
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-02-23 23:40 amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" Mikhail Gavrilov
  2023-02-24  7:12 ` Keyword Review - " Christian König
@ 2023-02-24  7:13 ` Christian König
  2023-02-24  8:38   ` Mikhail Gavrilov
  1 sibling, 1 reply; 13+ messages in thread
From: Christian König @ 2023-02-24  7:13 UTC (permalink / raw)
  To: Mikhail Gavrilov, amd-gfx list, dri-devel,
	Linux List Kernel Mailing, Deucher, Alexander

Hi Mikhail,

this is pretty clearly a problem with the system and/or it's BIOS and 
not the GPU hw or the driver.

The option pci=nocrs makes the kernel ignore additional resource windows 
the BIOS reports through ACPI. This then most likely leads to problems 
with amdgpu because it can't bring up its PCIe resources any more.

The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help 
understand the problem, but I strongly suggest to try a BIOS update first.

Regards,
Christian.

Am 24.02.23 um 00:40 schrieb Mikhail Gavrilov:
> Hi,
> I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
> it is impossible to use without AC power because the system losts nvme
> when I disconnect the power adapter.
>
> Messages from kernel log when it happens:
> nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> nvme nvme0: Does your device have a faulty power saving mode enabled?
> nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
> and report a bug
>
> I tried to use recommended parameters
> (nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
> this issue, but without successed.
>
> In the linux-nvme mail list the last advice was to try the "pci=nocrs"
> parameter.
>
> But with this parameter the amdgpu driver refuses to work and makes
> the system unbootable. I can solve the problem with the booting system
> by blacklisting the driver but it is not a good solution, because I
> don't wanna lose the GPU.
>
> Why amdgpu not work with "pci=nocrs" ?
> And is it possible to solve this incompatibility?
> It is very important because when I boot the system without amdgpu
> driver with "pci=nocrs" nvme is not losts when I disconnect the power
> adapter. So "pci=nocrs" really helps.
>
> Below that I see in kernel log when adds "pci=nocrs" parameter:
>
> amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM
> amdgpu: ATOM BIOS: SWBRT77321.001
> [drm] VCN(0) decode is enabled in VM mode
> [drm] VCN(0) encode is enabled in VM mode
> [drm] JPEG decode is enabled in VM mode
> Console: switching to colour dummy device 80x25
> amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
> disabled as experimental (default)
> [drm] GPU posting now...
> [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
> size is 9-bit
> amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 -
> 0x00000082FEFFFFFF (12272M used)
> amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
> amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 -
> 0x0000FFFFFFFFFFFF
> [drm] Detected VRAM RAM=12272M, BAR=16384M
> [drm] RAM width 192bits GDDR6
> [drm] amdgpu: 12272M of VRAM memory ready
> [drm] amdgpu: 31774M of GTT memory ready.
> amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo
> [drm] Debug VRAM access will use slowpath MM access
> amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page
> [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
> <gmc_v10_0> failed -12
> amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
>
> Of course a full system log is also attached.
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-02-24  7:13 ` Christian König
@ 2023-02-24  8:38   ` Mikhail Gavrilov
  2023-02-24 12:29     ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Mikhail Gavrilov @ 2023-02-24  8:38 UTC (permalink / raw)
  To: Christian König
  Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander

[-- Attachment #1: Type: text/plain, Size: 2647 bytes --]

On Fri, Feb 24, 2023 at 12:13 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Mikhail,
>
> this is pretty clearly a problem with the system and/or it's BIOS and
> not the GPU hw or the driver.
>
> The option pci=nocrs makes the kernel ignore additional resource windows
> the BIOS reports through ACPI. This then most likely leads to problems
> with amdgpu because it can't bring up its PCIe resources any more.
>
> The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help
> understand the problem

I attach both lspci for pci=nocrs and without pci=nocrs.

The differences for Cezanne Radeon Vega Series:
with pci=nocrs:
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Interrupt: pin A routed to IRQ 255
Region 4: I/O ports at e000 [disabled] [size=256]
Capabilities: [c0] MSI-X: Enable- Count=4 Masked-

Without pci=nocrs:
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Interrupt: pin A routed to IRQ 44
Region 4: I/O ports at e000 [size=256]
Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-


The differences for Navi 22 Radeon 6800M:
with pci=nocrs:
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Interrupt: pin A routed to IRQ 255
Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] [size=16G]
Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] [size=256M]
Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] [size=1M]
AtomicOpsCtl: ReqEn-
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000  Data: 0000

Without pci=nocrs:
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 103
Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G]
Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M]
Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M]
AtomicOpsCtl: ReqEn+
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000  Data: 0000

> but I strongly suggest to try a BIOS update first.

This is the first thing that was done. And I am afraid no more BIOS updates.
https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/

I also have experience in dealing with manufacturers' tech support.
Usually it ends with "we do not provide drivers for Linux".

-- 
Best Regards,
Mike Gavrilov.

[-- Attachment #2: lspci-with-pci=nocrs.txt --]
[-- Type: text/plain, Size: 8178 bytes --]

❯ sudo lspci -vvvv -s 08:00.0
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4) (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. Radeon Vega 8
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 255
	IOMMU group: 7
	Region 0: Memory at fc20000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at fc30000000 (64-bit, prefetchable) [size=2M]
	Region 4: I/O ports at e000 [disabled] [size=256]
	Region 5: Memory at fc900000 (32-bit, non-prefetchable) [size=512K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [c0] MSI-X: Enable- Count=4 Masked-
		Vector table: BAR=5 offset=00042000
		PBA: BAR=5 offset=00043000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable+, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000100, Page Request Allocation: 00000000
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [400 v1] Data Link Feature <?>
	Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [440 v1] Lane Margining at the Receiver <?>
	Kernel modules: amdgpu

❯ 
❯ sudo lspci -vvvv -s 03:00.0
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev c3)
	Subsystem: ASUSTeK Computer Inc. Radeon RX 6800M
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 255
	IOMMU group: 12
	Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] [size=16G]
	Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] [size=256M]
	Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] [size=1M]
	Expansion ROM at fcb00000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [200 v1] Physical Resizable BAR
		BAR 0: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB
		BAR 2: current size: 256MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
	Capabilities: [240 v1] Power Budgeting <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 1048576ns
		Max no snoop latency: 1048576ns
	Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [440 v1] Lane Margining at the Receiver <?>
	Kernel modules: amdgpu


[-- Attachment #3: lspci.txt --]
[-- Type: text/plain, Size: 8231 bytes --]

❯ sudo lspci -vvvv -s 08:00.0
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4) (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. Radeon Vega 8
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 44
	IOMMU group: 7
	Region 0: Memory at fc20000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at fc30000000 (64-bit, prefetchable) [size=2M]
	Region 4: I/O ports at e000 [size=256]
	Region 5: Memory at fc900000 (32-bit, non-prefetchable) [size=512K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-
		Vector table: BAR=5 offset=00042000
		PBA: BAR=5 offset=00043000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable+, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000100, Page Request Allocation: 00000000
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [400 v1] Data Link Feature <?>
	Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [440 v1] Lane Margining at the Receiver <?>
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

❯ 
❯ sudo lspci -vvvv -s 03:00.0
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev c3)
	Subsystem: ASUSTeK Computer Inc. Radeon RX 6800M
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 103
	IOMMU group: 12
	Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G]
	Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M]
	Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M]
	Expansion ROM at fcb00000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn+
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [200 v1] Physical Resizable BAR
		BAR 0: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB
		BAR 2: current size: 256MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
	Capabilities: [240 v1] Power Budgeting <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 1048576ns
		Max no snoop latency: 1048576ns
	Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [440 v1] Lane Margining at the Receiver <?>
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-02-24  8:38   ` Mikhail Gavrilov
@ 2023-02-24 12:29     ` Christian König
  2023-02-24 15:31       ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2023-02-24 12:29 UTC (permalink / raw)
  To: Mikhail Gavrilov
  Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander

Am 24.02.23 um 09:38 schrieb Mikhail Gavrilov:
> On Fri, Feb 24, 2023 at 12:13 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Hi Mikhail,
>>
>> this is pretty clearly a problem with the system and/or it's BIOS and
>> not the GPU hw or the driver.
>>
>> The option pci=nocrs makes the kernel ignore additional resource windows
>> the BIOS reports through ACPI. This then most likely leads to problems
>> with amdgpu because it can't bring up its PCIe resources any more.
>>
>> The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help
>> understand the problem
> I attach both lspci for pci=nocrs and without pci=nocrs.
>
> The differences for Cezanne Radeon Vega Series:
> with pci=nocrs:
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx-
> Interrupt: pin A routed to IRQ 255
> Region 4: I/O ports at e000 [disabled] [size=256]
> Capabilities: [c0] MSI-X: Enable- Count=4 Masked-
>
> Without pci=nocrs:
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Interrupt: pin A routed to IRQ 44
> Region 4: I/O ports at e000 [size=256]
> Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-
>
>
> The differences for Navi 22 Radeon 6800M:
> with pci=nocrs:
> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx-
> Interrupt: pin A routed to IRQ 255
> Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] [size=16G]
> Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] [size=256M]
> Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] [size=1M]

Well that explains it. When the PCI subsystem has to disable the BARs of 
the GPU we can't access it any more.

The only thing we could do is to make sure that the driver at least 
fails gracefully.

Do you still have network access to the box when amdgpu fails to load 
and could grab whatevery is in dmesg?

Thanks,
Christian.

> AtomicOpsCtl: ReqEn-
> Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> Address: 0000000000000000  Data: 0000
>
> Without pci=nocrs:
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 103
> Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G]
> Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M]
> Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M]
> AtomicOpsCtl: ReqEn+
> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000fee00000  Data: 0000
>
>> but I strongly suggest to try a BIOS update first.
> This is the first thing that was done. And I am afraid no more BIOS updates.
> https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/
>
> I also have experience in dealing with manufacturers' tech support.
> Usually it ends with "we do not provide drivers for Linux".
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-02-24 12:29     ` Christian König
@ 2023-02-24 15:31       ` Christian König
  2023-02-24 16:21         ` Mikhail Gavrilov
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2023-02-24 15:31 UTC (permalink / raw)
  To: Mikhail Gavrilov
  Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander

Am 24.02.23 um 13:29 schrieb Christian König:
> Am 24.02.23 um 09:38 schrieb Mikhail Gavrilov:
>> On Fri, Feb 24, 2023 at 12:13 PM Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>> Hi Mikhail,
>>>
>>> this is pretty clearly a problem with the system and/or it's BIOS and
>>> not the GPU hw or the driver.
>>>
>>> The option pci=nocrs makes the kernel ignore additional resource 
>>> windows
>>> the BIOS reports through ACPI. This then most likely leads to problems
>>> with amdgpu because it can't bring up its PCIe resources any more.
>>>
>>> The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help
>>> understand the problem
>> I attach both lspci for pci=nocrs and without pci=nocrs.
>>
>> The differences for Cezanne Radeon Vega Series:
>> with pci=nocrs:
>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR- FastB2B- DisINTx-
>> Interrupt: pin A routed to IRQ 255
>> Region 4: I/O ports at e000 [disabled] [size=256]
>> Capabilities: [c0] MSI-X: Enable- Count=4 Masked-
>>
>> Without pci=nocrs:
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR- FastB2B- DisINTx+
>> Interrupt: pin A routed to IRQ 44
>> Region 4: I/O ports at e000 [size=256]
>> Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-
>>
>>
>> The differences for Navi 22 Radeon 6800M:
>> with pci=nocrs:
>> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR- FastB2B- DisINTx-
>> Interrupt: pin A routed to IRQ 255
>> Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] 
>> [size=16G]
>> Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] 
>> [size=256M]
>> Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] 
>> [size=1M]
>
> Well that explains it. When the PCI subsystem has to disable the BARs 
> of the GPU we can't access it any more.
>
> The only thing we could do is to make sure that the driver at least 
> fails gracefully.
>
> Do you still have network access to the box when amdgpu fails to load 
> and could grab whatevery is in dmesg?

Sorry I totally missed that you attached the full dmesg to your original 
mail.

Yeah, the driver did fail gracefully. But then X doesn't come up and 
then gdm just dies.

Sorry there is really nothing we can do here, maybe ping somebody with 
more ACPI background for help.

Regards,
Christian.

>
> Thanks,
> Christian.
>
>> AtomicOpsCtl: ReqEn-
>> Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
>> Address: 0000000000000000  Data: 0000
>>
>> Without pci=nocrs:
>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR- FastB2B- DisINTx+
>> Latency: 0, Cache Line Size: 64 bytes
>> Interrupt: pin A routed to IRQ 103
>> Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G]
>> Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M]
>> Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M]
>> AtomicOpsCtl: ReqEn+
>> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>> Address: 00000000fee00000  Data: 0000
>>
>>> but I strongly suggest to try a BIOS update first.
>> This is the first thing that was done. And I am afraid no more BIOS 
>> updates.
>> https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/ 
>>
>>
>> I also have experience in dealing with manufacturers' tech support.
>> Usually it ends with "we do not provide drivers for Linux".
>>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-02-24 15:31       ` Christian König
@ 2023-02-24 16:21         ` Mikhail Gavrilov
  2023-02-27 10:22           ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Mikhail Gavrilov @ 2023-02-24 16:21 UTC (permalink / raw)
  To: Christian König
  Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander

On Fri, Feb 24, 2023 at 8:31 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Sorry I totally missed that you attached the full dmesg to your original
> mail.
>
> Yeah, the driver did fail gracefully. But then X doesn't come up and
> then gdm just dies.

Are you sure that these messages should be present when the driver
fails gracefully?

turning off the locking correctness validator.
CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G             L
-------  ---  6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
#1
Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
BIOS G513QY.320 09/07/2022
Call Trace:
 <TASK>
 dump_stack_lvl+0x57/0x90
 register_lock_class+0x47d/0x490
 __lock_acquire+0x74/0x21f0
 ? lock_release+0x155/0x450
 lock_acquire+0xd2/0x320
 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
 ? lock_is_held_type+0xce/0x120
 _raw_spin_lock_irqsave+0x4d/0xa0
 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
 amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
 amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu]
 amdgpu_driver_load_kms+0xe8/0x190 [amdgpu]
 amdgpu_pci_probe+0x140/0x420 [amdgpu]
 local_pci_probe+0x41/0x90
 pci_device_probe+0xc3/0x230
 really_probe+0x1b6/0x410
 __driver_probe_device+0x78/0x170
 driver_probe_device+0x1f/0x90
 __driver_attach+0xd2/0x1c0
 ? __pfx___driver_attach+0x10/0x10
 bus_for_each_dev+0x8a/0xd0
 bus_add_driver+0x141/0x230
 driver_register+0x77/0x120
 ? __pfx_init_module+0x10/0x10 [amdgpu]
 do_one_initcall+0x6e/0x350
 do_init_module+0x4a/0x220
 __do_sys_init_module+0x192/0x1c0
 do_syscall_64+0x5b/0x80
 ? asm_exc_page_fault+0x22/0x30
 ? lockdep_hardirqs_on+0x7d/0x100
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fd58cfcb1be
Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01
RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be
RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010
RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0
R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670
R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0
 </TASK>
amdgpu: probe of 0000:03:00.0 failed with error -12
amdgpu 0000:08:00.0: enabling device (0006 -> 0007)
[drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4).


list_add corruption. prev->next should be next (ffffffffc0940328), but
was 0000000000000000. (prev=ffff8c9b734062b0).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:30!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G             L
-------  ---  6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
#1
Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
BIOS G513QY.320 09/07/2022
RIP: 0010:__list_add_valid+0x74/0x90
Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246
RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0
R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48
R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000
FS:  00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0
PKRU: 55555554
Call Trace:
 <TASK>
 ttm_device_init+0x184/0x1c0 [ttm]
 amdgpu_ttm_init+0xb8/0x610 [amdgpu]
 ? _printk+0x60/0x80
 gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu]
 amdgpu_device_init+0x14e5/0x2520 [amdgpu]
 amdgpu_driver_load_kms+0x15/0x190 [amdgpu]
 amdgpu_pci_probe+0x140/0x420 [amdgpu]
 local_pci_probe+0x41/0x90
 pci_device_probe+0xc3/0x230
 really_probe+0x1b6/0x410
 __driver_probe_device+0x78/0x170
 driver_probe_device+0x1f/0x90
 __driver_attach+0xd2/0x1c0
 ? __pfx___driver_attach+0x10/0x10
 bus_for_each_dev+0x8a/0xd0
 bus_add_driver+0x141/0x230
 driver_register+0x77/0x120
 ? __pfx_init_module+0x10/0x10 [amdgpu]
 do_one_initcall+0x6e/0x350
 do_init_module+0x4a/0x220
 __do_sys_init_module+0x192/0x1c0
 do_syscall_64+0x5b/0x80
 ? asm_exc_page_fault+0x22/0x30
 ? lockdep_hardirqs_on+0x7d/0x100
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fd58cfcb1be
Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be
RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010
RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0
R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670
R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0
 </TASK>
Modules linked in: amdgpu(+) drm_ttm_helper hid_asus ttm asus_wmi
iommu_v2 crct10dif_pclmul ledtrig_audio drm_buddy crc32_pclmul
sparse_keymap gpu_sched crc32c_intel polyval_clmulni platform_profile
hid_multitouch polyval_generic drm_display_helper nvme rfkill
ucsi_acpi ghash_clmulni_intel nvme_core typec_ucsi serio_raw
sp5100_tco ccp sha512_ssse3 r8169 cec typec nvme_common i2c_hid_acpi
video i2c_hid wmi ip6_tables ip_tables fuse
---[ end trace 0000000000000000 ]---
RIP: 0010:__list_add_valid+0x74/0x90
Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246
RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0
R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48
R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000
FS:  00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0
PKRU: 55555554
(udev-worker) (470) used greatest stack depth: 12416 bytes left

I thought that gracefully means switching to svga mode and showing the
desktop with software rendering (exactly as it happens when I
blacklist amdgpu driver). Currently the boot process stucking and the
local console is unavailable.


-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-02-24 16:21         ` Mikhail Gavrilov
@ 2023-02-27 10:22           ` Christian König
  2023-02-28  9:52             ` Mikhail Gavrilov
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2023-02-27 10:22 UTC (permalink / raw)
  To: Mikhail Gavrilov
  Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander

Am 24.02.23 um 17:21 schrieb Mikhail Gavrilov:
> On Fri, Feb 24, 2023 at 8:31 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Sorry I totally missed that you attached the full dmesg to your original
>> mail.
>>
>> Yeah, the driver did fail gracefully. But then X doesn't come up and
>> then gdm just dies.
> Are you sure that these messages should be present when the driver
> fails gracefully?

Unfortunately yes. We could clean that up a bit more so that you don't 
run into a BUG() assertion, but what essentially happens here is that we 
completely fail to talk to the hardware.

In this situation we can't even re-enable vesa or text console any more.

Regards,
Christian.

>
> turning off the locking correctness validator.
> CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G             L
> -------  ---  6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
> #1
> Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
> BIOS G513QY.320 09/07/2022
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x57/0x90
>   register_lock_class+0x47d/0x490
>   __lock_acquire+0x74/0x21f0
>   ? lock_release+0x155/0x450
>   lock_acquire+0xd2/0x320
>   ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
>   ? lock_is_held_type+0xce/0x120
>   _raw_spin_lock_irqsave+0x4d/0xa0
>   ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
>   amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
>   amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu]
>   amdgpu_driver_load_kms+0xe8/0x190 [amdgpu]
>   amdgpu_pci_probe+0x140/0x420 [amdgpu]
>   local_pci_probe+0x41/0x90
>   pci_device_probe+0xc3/0x230
>   really_probe+0x1b6/0x410
>   __driver_probe_device+0x78/0x170
>   driver_probe_device+0x1f/0x90
>   __driver_attach+0xd2/0x1c0
>   ? __pfx___driver_attach+0x10/0x10
>   bus_for_each_dev+0x8a/0xd0
>   bus_add_driver+0x141/0x230
>   driver_register+0x77/0x120
>   ? __pfx_init_module+0x10/0x10 [amdgpu]
>   do_one_initcall+0x6e/0x350
>   do_init_module+0x4a/0x220
>   __do_sys_init_module+0x192/0x1c0
>   do_syscall_64+0x5b/0x80
>   ? asm_exc_page_fault+0x22/0x30
>   ? lockdep_hardirqs_on+0x7d/0x100
>   entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fd58cfcb1be
> Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01
> RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
> RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be
> RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010
> RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0
> R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670
> R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0
>   </TASK>
> amdgpu: probe of 0000:03:00.0 failed with error -12
> amdgpu 0000:08:00.0: enabling device (0006 -> 0007)
> [drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4).
>
>
> list_add corruption. prev->next should be next (ffffffffc0940328), but
> was 0000000000000000. (prev=ffff8c9b734062b0).
> ------------[ cut here ]------------
> kernel BUG at lib/list_debug.c:30!
> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G             L
> -------  ---  6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
> #1
> Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
> BIOS G513QY.320 09/07/2022
> RIP: 0010:__list_add_valid+0x74/0x90
> Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
> 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
> 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
> RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246
> RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0
> R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48
> R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000
> FS:  00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0
> PKRU: 55555554
> Call Trace:
>   <TASK>
>   ttm_device_init+0x184/0x1c0 [ttm]
>   amdgpu_ttm_init+0xb8/0x610 [amdgpu]
>   ? _printk+0x60/0x80
>   gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu]
>   amdgpu_device_init+0x14e5/0x2520 [amdgpu]
>   amdgpu_driver_load_kms+0x15/0x190 [amdgpu]
>   amdgpu_pci_probe+0x140/0x420 [amdgpu]
>   local_pci_probe+0x41/0x90
>   pci_device_probe+0xc3/0x230
>   really_probe+0x1b6/0x410
>   __driver_probe_device+0x78/0x170
>   driver_probe_device+0x1f/0x90
>   __driver_attach+0xd2/0x1c0
>   ? __pfx___driver_attach+0x10/0x10
>   bus_for_each_dev+0x8a/0xd0
>   bus_add_driver+0x141/0x230
>   driver_register+0x77/0x120
>   ? __pfx_init_module+0x10/0x10 [amdgpu]
>   do_one_initcall+0x6e/0x350
>   do_init_module+0x4a/0x220
>   __do_sys_init_module+0x192/0x1c0
>   do_syscall_64+0x5b/0x80
>   ? asm_exc_page_fault+0x22/0x30
>   ? lockdep_hardirqs_on+0x7d/0x100
>   entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fd58cfcb1be
> Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48
> RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
> RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be
> RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010
> RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0
> R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670
> R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0
>   </TASK>
> Modules linked in: amdgpu(+) drm_ttm_helper hid_asus ttm asus_wmi
> iommu_v2 crct10dif_pclmul ledtrig_audio drm_buddy crc32_pclmul
> sparse_keymap gpu_sched crc32c_intel polyval_clmulni platform_profile
> hid_multitouch polyval_generic drm_display_helper nvme rfkill
> ucsi_acpi ghash_clmulni_intel nvme_core typec_ucsi serio_raw
> sp5100_tco ccp sha512_ssse3 r8169 cec typec nvme_common i2c_hid_acpi
> video i2c_hid wmi ip6_tables ip_tables fuse
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:__list_add_valid+0x74/0x90
> Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
> 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
> 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
> RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246
> RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0
> R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48
> R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000
> FS:  00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0
> PKRU: 55555554
> (udev-worker) (470) used greatest stack depth: 12416 bytes left
>
> I thought that gracefully means switching to svga mode and showing the
> desktop with software rendering (exactly as it happens when I
> blacklist amdgpu driver). Currently the boot process stucking and the
> local console is unavailable.
>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-02-27 10:22           ` Christian König
@ 2023-02-28  9:52             ` Mikhail Gavrilov
  2023-02-28 12:43               ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Mikhail Gavrilov @ 2023-02-28  9:52 UTC (permalink / raw)
  To: Christian König
  Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander

[-- Attachment #1: Type: text/plain, Size: 1056 bytes --]

On Mon, Feb 27, 2023 at 3:22 PM Christian König
>
> Unfortunately yes. We could clean that up a bit more so that you don't
> run into a BUG() assertion, but what essentially happens here is that we
> completely fail to talk to the hardware.
>
> In this situation we can't even re-enable vesa or text console any more.
>
Then I don't understand why when amdgpu is blacklisted via
modprobe.blacklist=amdgpu then I see graphics and could login into
GNOME. Yes without hardware acceleration, but it is better than non
working graphics. It means there is some other driver (I assume this
is "video") which can successfully talk to the AMD hardware in
conditions where amdgpu cannot do this. My suggestion is that if
amdgpu fails to talk to the hardware, then let another suitable driver
do it. I attached a system log when I apply "pci=nocrs" with
"modprobe.blacklist=amdgpu" for showing that graphics work right in
this case.
To do this, does the Linux module loading mechanism need to be refined?


-- 
Best Regards,
Mike Gavrilov.

[-- Attachment #2: system-without-amdgpu.tar.xz --]
[-- Type: application/x-xz, Size: 41716 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-02-28  9:52             ` Mikhail Gavrilov
@ 2023-02-28 12:43               ` Christian König
  2023-12-15 11:45                 ` Mikhail Gavrilov
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2023-02-28 12:43 UTC (permalink / raw)
  To: Mikhail Gavrilov
  Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander

Am 28.02.23 um 10:52 schrieb Mikhail Gavrilov:
> On Mon, Feb 27, 2023 at 3:22 PM Christian König
>> Unfortunately yes. We could clean that up a bit more so that you don't
>> run into a BUG() assertion, but what essentially happens here is that we
>> completely fail to talk to the hardware.
>>
>> In this situation we can't even re-enable vesa or text console any more.
>>
> Then I don't understand why when amdgpu is blacklisted via
> modprobe.blacklist=amdgpu then I see graphics and could login into
> GNOME. Yes without hardware acceleration, but it is better than non
> working graphics. It means there is some other driver (I assume this
> is "video") which can successfully talk to the AMD hardware in
> conditions where amdgpu cannot do this.

The point is it doesn't need to talk to the amdgpu hardware. What it 
does is that it talks to the good old VGA/VESA emulation and that just 
happens to be still enabled by the BIOS/GRUB.

And that VGA/VESA emulation doesn't need any BAR or whatever to keep the 
hw running in the state where it was initialized before the kernel 
started. The kernel just grabs the addresses where it needs to write the 
display data and keeps going with that.

But when a hw specific driver wants to load this is the first thing 
which gets disabled because we need to load new firmware. And with the 
BARs disabled this can't be re-enabled without rebooting the system.

> My suggestion is that if
> amdgpu fails to talk to the hardware, then let another suitable driver
> do it. I attached a system log when I apply "pci=nocrs" with
> "modprobe.blacklist=amdgpu" for showing that graphics work right in
> this case.
> To do this, does the Linux module loading mechanism need to be refined?

That's actually working as expected. The real problem is that the BIOS 
on that system is so broken that we can't access the hw correctly.

What we could to do is to check the BARs very early on and refuse to 
load when they are disable. The problem with this approach is that there 
are systems where it is normal that the BARs are disable until the 
driver loads and get enabled during the hardware initialization process.

What you might want to look into is to find a quirk for the BIOS to 
properly enable the nvme controller.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-02-28 12:43               ` Christian König
@ 2023-12-15 11:45                 ` Mikhail Gavrilov
  2023-12-15 12:37                   ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Mikhail Gavrilov @ 2023-12-15 11:45 UTC (permalink / raw)
  To: Christian König
  Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

On Tue, Feb 28, 2023 at 5:43 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> The point is it doesn't need to talk to the amdgpu hardware. What it
> does is that it talks to the good old VGA/VESA emulation and that just
> happens to be still enabled by the BIOS/GRUB.
>
> And that VGA/VESA emulation doesn't need any BAR or whatever to keep the
> hw running in the state where it was initialized before the kernel
> started. The kernel just grabs the addresses where it needs to write the
> display data and keeps going with that.
>
> But when a hw specific driver wants to load this is the first thing
> which gets disabled because we need to load new firmware. And with the
> BARs disabled this can't be re-enabled without rebooting the system.
>
> > My suggestion is that if
> > amdgpu fails to talk to the hardware, then let another suitable driver
> > do it. I attached a system log when I apply "pci=nocrs" with
> > "modprobe.blacklist=amdgpu" for showing that graphics work right in
> > this case.
> > To do this, does the Linux module loading mechanism need to be refined?
>
> That's actually working as expected. The real problem is that the BIOS
> on that system is so broken that we can't access the hw correctly.
>
> What we could to do is to check the BARs very early on and refuse to
> load when they are disable. The problem with this approach is that there
> are systems where it is normal that the BARs are disable until the
> driver loads and get enabled during the hardware initialization process.
>
> What you might want to look into is to find a quirk for the BIOS to
> properly enable the nvme controller.
>

That's interesting. I noticed that now amdgpu could work even with
parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels.
It means BARs became available?
I attached here the kerner log and lspci. What's changed?

-- 
Best Regards,
Mike Gavrilov.

[-- Attachment #2: dmesg-nvme-down-2.zip --]
[-- Type: application/zip, Size: 46571 bytes --]

[-- Attachment #3: lspci.zip --]
[-- Type: application/zip, Size: 2710 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-12-15 11:45                 ` Mikhail Gavrilov
@ 2023-12-15 12:37                   ` Christian König
  2023-12-19  9:45                     ` Mikhail Gavrilov
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2023-12-15 12:37 UTC (permalink / raw)
  To: Mikhail Gavrilov
  Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander

Am 15.12.23 um 12:45 schrieb Mikhail Gavrilov:
> On Tue, Feb 28, 2023 at 5:43 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> The point is it doesn't need to talk to the amdgpu hardware. What it
>> does is that it talks to the good old VGA/VESA emulation and that just
>> happens to be still enabled by the BIOS/GRUB.
>>
>> And that VGA/VESA emulation doesn't need any BAR or whatever to keep the
>> hw running in the state where it was initialized before the kernel
>> started. The kernel just grabs the addresses where it needs to write the
>> display data and keeps going with that.
>>
>> But when a hw specific driver wants to load this is the first thing
>> which gets disabled because we need to load new firmware. And with the
>> BARs disabled this can't be re-enabled without rebooting the system.
>>
>>> My suggestion is that if
>>> amdgpu fails to talk to the hardware, then let another suitable driver
>>> do it. I attached a system log when I apply "pci=nocrs" with
>>> "modprobe.blacklist=amdgpu" for showing that graphics work right in
>>> this case.
>>> To do this, does the Linux module loading mechanism need to be refined?
>> That's actually working as expected. The real problem is that the BIOS
>> on that system is so broken that we can't access the hw correctly.
>>
>> What we could to do is to check the BARs very early on and refuse to
>> load when they are disable. The problem with this approach is that there
>> are systems where it is normal that the BARs are disable until the
>> driver loads and get enabled during the hardware initialization process.
>>
>> What you might want to look into is to find a quirk for the BIOS to
>> properly enable the nvme controller.
>>
> That's interesting. I noticed that now amdgpu could work even with
> parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels.
> It means BARs became available?
> I attached here the kerner log and lspci. What's changed?

I have no idea :)

 From the logs I can see that the AMDGPU now has the proper BARs assigned:

[    5.722015] pci 0000:03:00.0: [1002:73df] type 00 class 0x038000
[    5.722051] pci 0000:03:00.0: reg 0x10: [mem 
0xf800000000-0xfbffffffff 64bit pref]
[    5.722081] pci 0000:03:00.0: reg 0x18: [mem 
0xfc00000000-0xfc0fffffff 64bit pref]
[    5.722112] pci 0000:03:00.0: reg 0x24: [mem 0xfca00000-0xfcafffff]
[    5.722134] pci 0000:03:00.0: reg 0x30: [mem 0xfcb00000-0xfcb1ffff pref]
[    5.722368] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[    5.722484] pci 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth, 
limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048 
Gb/s with 16.0 GT/s PCIe x16 link)

And with that the driver can work perfectly fine.

Have you updated the BIOS or added/removed some other hardware? Maybe 
somebody added a quirk for your BIOS into the PCIe code or something 
like that.

Regards,
Christian.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
  2023-12-15 12:37                   ` Christian König
@ 2023-12-19  9:45                     ` Mikhail Gavrilov
  0 siblings, 0 replies; 13+ messages in thread
From: Mikhail Gavrilov @ 2023-12-19  9:45 UTC (permalink / raw)
  To: Christian König
  Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing,
	Deucher, Alexander

On Fri, Dec 15, 2023 at 5:37 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> I have no idea :)
>
>  From the logs I can see that the AMDGPU now has the proper BARs assigned:
>
> [    5.722015] pci 0000:03:00.0: [1002:73df] type 00 class 0x038000
> [    5.722051] pci 0000:03:00.0: reg 0x10: [mem
> 0xf800000000-0xfbffffffff 64bit pref]
> [    5.722081] pci 0000:03:00.0: reg 0x18: [mem
> 0xfc00000000-0xfc0fffffff 64bit pref]
> [    5.722112] pci 0000:03:00.0: reg 0x24: [mem 0xfca00000-0xfcafffff]
> [    5.722134] pci 0000:03:00.0: reg 0x30: [mem 0xfcb00000-0xfcb1ffff pref]
> [    5.722368] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
> [    5.722484] pci 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth,
> limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048
> Gb/s with 16.0 GT/s PCIe x16 link)
>
> And with that the driver can work perfectly fine.
>
> Have you updated the BIOS or added/removed some other hardware? Maybe
> somebody added a quirk for your BIOS into the PCIe code or something
> like that.

No, nothing changed in hardware.
But I found the commit which fixes it.

> git bisect unfixed
92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 is the first fixed commit
commit 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
Author: Vasant Hegde <vasant.hegde@amd.com>
Date:   Thu Sep 21 09:21:45 2023 +0000

    iommu/amd: Introduce iommu_dev_data.flags to track device capabilities

    Currently we use struct iommu_dev_data.iommu_v2 to keep track of the device
    ATS, PRI, and PASID capabilities. But these capabilities can be enabled
    independently (except PRI requires ATS support). Hence, replace
    the iommu_v2 variable with a flags variable, which keep track of the device
    capabilities.

    From commit 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability
    with all VFs"), device PRI/PASID is shared between PF and any associated
    VFs. Hence use pci_pri_supported() and pci_pasid_features() instead of
    pci_find_ext_capability() to check device PRI/PASID support.

    Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
    Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
    Link: https://lore.kernel.org/r/20230921092147.5930-13-vasant.hegde@amd.com
    Signed-off-by: Joerg Roedel <jroedel@suse.de>

 drivers/iommu/amd/amd_iommu_types.h |  3 ++-
 drivers/iommu/amd/iommu.c           | 46 ++++++++++++++++++++++---------------
 2 files changed, 30 insertions(+), 19 deletions(-)


> git bisect log
git bisect start '--term-new=fixed' '--term-old=unfixed'
# status: waiting for both good and bad commits
# fixed: [33cc938e65a98f1d29d0a18403dbbee050dcad9a] Linux 6.7-rc4
git bisect fixed 33cc938e65a98f1d29d0a18403dbbee050dcad9a
# status: waiting for good commit(s), bad commit known
# unfixed: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6
git bisect unfixed ffc253263a1375a65fa6c9f62a893e9767fbebfa
# unfixed: [7d461b291e65938f15f56fe58da2303b07578a76] Merge tag
'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm
git bisect unfixed 7d461b291e65938f15f56fe58da2303b07578a76
# unfixed: [e14aec23025eeb1f2159ba34dbc1458467c4c347] s390/ap: fix AP
bus crash on early config change callback invocation
git bisect unfixed e14aec23025eeb1f2159ba34dbc1458467c4c347
# unfixed: [be3ca57cfb777ad820c6659d52e60bbdd36bf5ff] Merge tag
'media/v6.7-1' of
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect unfixed be3ca57cfb777ad820c6659d52e60bbdd36bf5ff
# fixed: [c0d12d769299e1e08338988c7745009e0db2a4a0] Merge tag
'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm
git bisect fixed c0d12d769299e1e08338988c7745009e0db2a4a0
# fixed: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag
'iommu-updates-v6.7' of
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
git bisect fixed 4bbdb725a36b0d235f3b832bd0c1e885f0442d9f
# unfixed: [25b6377007ebe1c3ede773fd6979f613386db000] Merge tag
'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm
git bisect unfixed 25b6377007ebe1c3ede773fd6979f613386db000
# unfixed: [67c0afb6424fee94238d9a32b97c407d0c97155e] Merge tag
'exfat-for-6.7-rc1-part2' of
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
git bisect unfixed 67c0afb6424fee94238d9a32b97c407d0c97155e
# unfixed: [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag
'v6.6-rc7' into core
git bisect unfixed 3613047280ec42a4e1350fdc1a6dd161ff4008cc
# fixed: [cedc811c76778bdef91d405717acee0de54d8db5] iommu/amd: Remove
DMA_FQ type from domain allocation path
git bisect fixed cedc811c76778bdef91d405717acee0de54d8db5
# unfixed: [b0cc5dae1ac0c18748706a4beb636e3b726dd744] iommu/amd:
Rename ats related variables
git bisect unfixed b0cc5dae1ac0c18748706a4beb636e3b726dd744
# fixed: [5a0b11a180a9b82b4437a4be1cf73530053f139b] iommu/amd: Remove
iommu_v2 module
git bisect fixed 5a0b11a180a9b82b4437a4be1cf73530053f139b
# fixed: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6] iommu/amd:
Introduce iommu_dev_data.flags to track device capabilities
git bisect fixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
# unfixed: [739eb25514c90aa8ea053ed4d2b971f531e63ded] iommu/amd:
Introduce iommu_dev_data.ppr
git bisect unfixed 739eb25514c90aa8ea053ed4d2b971f531e63ded
# first fixed commit: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6]
iommu/amd: Introduce iommu_dev_data.flags to track device capabilities

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-12-19  9:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-23 23:40 amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" Mikhail Gavrilov
2023-02-24  7:12 ` Keyword Review - " Christian König
2023-02-24  7:13 ` Christian König
2023-02-24  8:38   ` Mikhail Gavrilov
2023-02-24 12:29     ` Christian König
2023-02-24 15:31       ` Christian König
2023-02-24 16:21         ` Mikhail Gavrilov
2023-02-27 10:22           ` Christian König
2023-02-28  9:52             ` Mikhail Gavrilov
2023-02-28 12:43               ` Christian König
2023-12-15 11:45                 ` Mikhail Gavrilov
2023-12-15 12:37                   ` Christian König
2023-12-19  9:45                     ` Mikhail Gavrilov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox