* amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" @ 2023-02-23 23:40 Mikhail Gavrilov 2023-02-24 7:12 ` Keyword Review - " Christian König 2023-02-24 7:13 ` Christian König 0 siblings, 2 replies; 13+ messages in thread From: Mikhail Gavrilov @ 2023-02-23 23:40 UTC (permalink / raw) To: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander, Christian König [-- Attachment #1: Type: text/plain, Size: 2647 bytes --] Hi, I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But it is impossible to use without AC power because the system losts nvme when I disconnect the power adapter. Messages from kernel log when it happens: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 nvme nvme0: Does your device have a faulty power saving mode enabled? nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug I tried to use recommended parameters (nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve this issue, but without successed. In the linux-nvme mail list the last advice was to try the "pci=nocrs" parameter. But with this parameter the amdgpu driver refuses to work and makes the system unbootable. I can solve the problem with the booting system by blacklisting the driver but it is not a good solution, because I don't wanna lose the GPU. Why amdgpu not work with "pci=nocrs" ? And is it possible to solve this incompatibility? It is very important because when I boot the system without amdgpu driver with "pci=nocrs" nvme is not losts when I disconnect the power adapter. So "pci=nocrs" really helps. Below that I see in kernel log when adds "pci=nocrs" parameter: amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM amdgpu: ATOM BIOS: SWBRT77321.001 [drm] VCN(0) decode is enabled in VM mode [drm] VCN(0) encode is enabled in VM mode [drm] JPEG decode is enabled in VM mode Console: switching to colour dummy device 80x25 amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default) [drm] GPU posting now... [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used) amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF [drm] Detected VRAM RAM=12272M, BAR=16384M [drm] RAM width 192bits GDDR6 [drm] amdgpu: 12272M of VRAM memory ready [drm] amdgpu: 31774M of GTT memory ready. amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo [drm] Debug VRAM access will use slowpath MM access amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v10_0> failed -12 amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device. Of course a full system log is also attached. -- Best Regards, Mike Gavrilov. [-- Attachment #2: system-log-Fatal-error-during-GPU-init.tar.xz --] [-- Type: application/x-xz, Size: 40988 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Keyword Review - Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-02-23 23:40 amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" Mikhail Gavrilov @ 2023-02-24 7:12 ` Christian König 2023-02-24 7:13 ` Christian König 1 sibling, 0 replies; 13+ messages in thread From: Christian König @ 2023-02-24 7:12 UTC (permalink / raw) To: Mikhail Gavrilov, amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander Hi Mikhail, this is pretty clearly a problem with the system and/or it's BIOS and not the GPU hw or the driver. The option pci=nocrs makes the kernel ignore additional resource windows the BIOS reports through ACPI. This then most likely leads to problems with amdgpu because it can't bring up its PCIe resources any more. The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help understand the problem, but I strongly suggest to try a BIOS update first. Regards, Christian. Am 24.02.23 um 00:40 schrieb Mikhail Gavrilov: > Hi, > I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But > it is impossible to use without AC power because the system losts nvme > when I disconnect the power adapter. > > Messages from kernel log when it happens: > nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 > nvme nvme0: Does your device have a faulty power saving mode enabled? > nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" > and report a bug > > I tried to use recommended parameters > (nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve > this issue, but without successed. > > In the linux-nvme mail list the last advice was to try the "pci=nocrs" > parameter. > > But with this parameter the amdgpu driver refuses to work and makes > the system unbootable. I can solve the problem with the booting system > by blacklisting the driver but it is not a good solution, because I > don't wanna lose the GPU. > > Why amdgpu not work with "pci=nocrs" ? > And is it possible to solve this incompatibility? > It is very important because when I boot the system without amdgpu > driver with "pci=nocrs" nvme is not losts when I disconnect the power > adapter. So "pci=nocrs" really helps. > > Below that I see in kernel log when adds "pci=nocrs" parameter: > > amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM > amdgpu: ATOM BIOS: SWBRT77321.001 > [drm] VCN(0) decode is enabled in VM mode > [drm] VCN(0) encode is enabled in VM mode > [drm] JPEG decode is enabled in VM mode > Console: switching to colour dummy device 80x25 > amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature > disabled as experimental (default) > [drm] GPU posting now... > [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment > size is 9-bit > amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - > 0x00000082FEFFFFFF (12272M used) > amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF > amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - > 0x0000FFFFFFFFFFFF > [drm] Detected VRAM RAM=12272M, BAR=16384M > [drm] RAM width 192bits GDDR6 > [drm] amdgpu: 12272M of VRAM memory ready > [drm] amdgpu: 31774M of GTT memory ready. > amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo > [drm] Debug VRAM access will use slowpath MM access > amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page > [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block > <gmc_v10_0> failed -12 > amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed > amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init > amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device. > > Of course a full system log is also attached. > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-02-23 23:40 amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" Mikhail Gavrilov 2023-02-24 7:12 ` Keyword Review - " Christian König @ 2023-02-24 7:13 ` Christian König 2023-02-24 8:38 ` Mikhail Gavrilov 1 sibling, 1 reply; 13+ messages in thread From: Christian König @ 2023-02-24 7:13 UTC (permalink / raw) To: Mikhail Gavrilov, amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander Hi Mikhail, this is pretty clearly a problem with the system and/or it's BIOS and not the GPU hw or the driver. The option pci=nocrs makes the kernel ignore additional resource windows the BIOS reports through ACPI. This then most likely leads to problems with amdgpu because it can't bring up its PCIe resources any more. The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help understand the problem, but I strongly suggest to try a BIOS update first. Regards, Christian. Am 24.02.23 um 00:40 schrieb Mikhail Gavrilov: > Hi, > I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But > it is impossible to use without AC power because the system losts nvme > when I disconnect the power adapter. > > Messages from kernel log when it happens: > nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 > nvme nvme0: Does your device have a faulty power saving mode enabled? > nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" > and report a bug > > I tried to use recommended parameters > (nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve > this issue, but without successed. > > In the linux-nvme mail list the last advice was to try the "pci=nocrs" > parameter. > > But with this parameter the amdgpu driver refuses to work and makes > the system unbootable. I can solve the problem with the booting system > by blacklisting the driver but it is not a good solution, because I > don't wanna lose the GPU. > > Why amdgpu not work with "pci=nocrs" ? > And is it possible to solve this incompatibility? > It is very important because when I boot the system without amdgpu > driver with "pci=nocrs" nvme is not losts when I disconnect the power > adapter. So "pci=nocrs" really helps. > > Below that I see in kernel log when adds "pci=nocrs" parameter: > > amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM > amdgpu: ATOM BIOS: SWBRT77321.001 > [drm] VCN(0) decode is enabled in VM mode > [drm] VCN(0) encode is enabled in VM mode > [drm] JPEG decode is enabled in VM mode > Console: switching to colour dummy device 80x25 > amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature > disabled as experimental (default) > [drm] GPU posting now... > [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment > size is 9-bit > amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - > 0x00000082FEFFFFFF (12272M used) > amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF > amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - > 0x0000FFFFFFFFFFFF > [drm] Detected VRAM RAM=12272M, BAR=16384M > [drm] RAM width 192bits GDDR6 > [drm] amdgpu: 12272M of VRAM memory ready > [drm] amdgpu: 31774M of GTT memory ready. > amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo > [drm] Debug VRAM access will use slowpath MM access > amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page > [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block > <gmc_v10_0> failed -12 > amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed > amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init > amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device. > > Of course a full system log is also attached. > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-02-24 7:13 ` Christian König @ 2023-02-24 8:38 ` Mikhail Gavrilov 2023-02-24 12:29 ` Christian König 0 siblings, 1 reply; 13+ messages in thread From: Mikhail Gavrilov @ 2023-02-24 8:38 UTC (permalink / raw) To: Christian König Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander [-- Attachment #1: Type: text/plain, Size: 2647 bytes --] On Fri, Feb 24, 2023 at 12:13 PM Christian König <ckoenig.leichtzumerken@gmail.com> wrote: > > Hi Mikhail, > > this is pretty clearly a problem with the system and/or it's BIOS and > not the GPU hw or the driver. > > The option pci=nocrs makes the kernel ignore additional resource windows > the BIOS reports through ACPI. This then most likely leads to problems > with amdgpu because it can't bring up its PCIe resources any more. > > The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help > understand the problem I attach both lspci for pci=nocrs and without pci=nocrs. The differences for Cezanne Radeon Vega Series: with pci=nocrs: Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Interrupt: pin A routed to IRQ 255 Region 4: I/O ports at e000 [disabled] [size=256] Capabilities: [c0] MSI-X: Enable- Count=4 Masked- Without pci=nocrs: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Interrupt: pin A routed to IRQ 44 Region 4: I/O ports at e000 [size=256] Capabilities: [c0] MSI-X: Enable+ Count=4 Masked- The differences for Navi 22 Radeon 6800M: with pci=nocrs: Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Interrupt: pin A routed to IRQ 255 Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] [size=16G] Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] [size=256M] Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] [size=1M] AtomicOpsCtl: ReqEn- Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Without pci=nocrs: Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 103 Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G] Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M] Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M] AtomicOpsCtl: ReqEn+ Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00000 Data: 0000 > but I strongly suggest to try a BIOS update first. This is the first thing that was done. And I am afraid no more BIOS updates. https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/ I also have experience in dealing with manufacturers' tech support. Usually it ends with "we do not provide drivers for Linux". -- Best Regards, Mike Gavrilov. [-- Attachment #2: lspci-with-pci=nocrs.txt --] [-- Type: text/plain, Size: 8178 bytes --] ❯ sudo lspci -vvvv -s 08:00.0 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. Radeon Vega 8 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 255 IOMMU group: 7 Region 0: Memory at fc20000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at fc30000000 (64-bit, prefetchable) [size=2M] Region 4: I/O ports at e000 [disabled] [size=256] Region 5: Memory at fc900000 (32-bit, non-prefetchable) [size=512K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [c0] MSI-X: Enable- Count=4 Masked- Vector table: BAR=5 offset=00042000 PBA: BAR=5 offset=00043000 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [270 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [2b0 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable+, Smallest Translation Unit: 00 Capabilities: [2c0 v1] Page Request Interface (PRI) PRICtl: Enable- Reset- PRISta: RF- UPRGI- Stopped+ Page Request Capacity: 00000100, Page Request Allocation: 00000000 Capabilities: [2d0 v1] Process Address Space ID (PASID) PASIDCap: Exec+ Priv+, Max PASID Width: 10 PASIDCtl: Enable- Exec- Priv- Capabilities: [400 v1] Data Link Feature <?> Capabilities: [410 v1] Physical Layer 16.0 GT/s <?> Capabilities: [440 v1] Lane Margining at the Receiver <?> Kernel modules: amdgpu ❯ ❯ sudo lspci -vvvv -s 03:00.0 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev c3) Subsystem: ASUSTeK Computer Inc. Radeon RX 6800M Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 255 IOMMU group: 12 Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] [size=16G] Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] [size=256M] Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] [size=1M] Expansion ROM at fcb00000 [disabled] [size=128K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 16GT/s, Width x16 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit+ 64bit+ 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [200 v1] Physical Resizable BAR BAR 0: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB BAR 2: current size: 256MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB Capabilities: [240 v1] Power Budgeting <?> Capabilities: [270 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [2a0 v1] Access Control Services ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Capabilities: [2d0 v1] Process Address Space ID (PASID) PASIDCap: Exec+ Priv+, Max PASID Width: 10 PASIDCtl: Enable- Exec- Priv- Capabilities: [320 v1] Latency Tolerance Reporting Max snoop latency: 1048576ns Max no snoop latency: 1048576ns Capabilities: [410 v1] Physical Layer 16.0 GT/s <?> Capabilities: [440 v1] Lane Margining at the Receiver <?> Kernel modules: amdgpu [-- Attachment #3: lspci.txt --] [-- Type: text/plain, Size: 8231 bytes --] ❯ sudo lspci -vvvv -s 08:00.0 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. Radeon Vega 8 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 44 IOMMU group: 7 Region 0: Memory at fc20000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at fc30000000 (64-bit, prefetchable) [size=2M] Region 4: I/O ports at e000 [size=256] Region 5: Memory at fc900000 (32-bit, non-prefetchable) [size=512K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [c0] MSI-X: Enable+ Count=4 Masked- Vector table: BAR=5 offset=00042000 PBA: BAR=5 offset=00043000 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [270 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [2b0 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable+, Smallest Translation Unit: 00 Capabilities: [2c0 v1] Page Request Interface (PRI) PRICtl: Enable- Reset- PRISta: RF- UPRGI- Stopped+ Page Request Capacity: 00000100, Page Request Allocation: 00000000 Capabilities: [2d0 v1] Process Address Space ID (PASID) PASIDCap: Exec+ Priv+, Max PASID Width: 10 PASIDCtl: Enable- Exec- Priv- Capabilities: [400 v1] Data Link Feature <?> Capabilities: [410 v1] Physical Layer 16.0 GT/s <?> Capabilities: [440 v1] Lane Margining at the Receiver <?> Kernel driver in use: amdgpu Kernel modules: amdgpu ❯ ❯ sudo lspci -vvvv -s 03:00.0 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev c3) Subsystem: ASUSTeK Computer Inc. Radeon RX 6800M Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 103 IOMMU group: 12 Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G] Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M] Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M] Expansion ROM at fcb00000 [disabled] [size=128K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 16GT/s, Width x16 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit+ 64bit+ 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn+ LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00000 Data: 0000 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [200 v1] Physical Resizable BAR BAR 0: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB BAR 2: current size: 256MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB Capabilities: [240 v1] Power Budgeting <?> Capabilities: [270 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [2a0 v1] Access Control Services ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Capabilities: [2d0 v1] Process Address Space ID (PASID) PASIDCap: Exec+ Priv+, Max PASID Width: 10 PASIDCtl: Enable- Exec- Priv- Capabilities: [320 v1] Latency Tolerance Reporting Max snoop latency: 1048576ns Max no snoop latency: 1048576ns Capabilities: [410 v1] Physical Layer 16.0 GT/s <?> Capabilities: [440 v1] Lane Margining at the Receiver <?> Kernel driver in use: amdgpu Kernel modules: amdgpu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-02-24 8:38 ` Mikhail Gavrilov @ 2023-02-24 12:29 ` Christian König 2023-02-24 15:31 ` Christian König 0 siblings, 1 reply; 13+ messages in thread From: Christian König @ 2023-02-24 12:29 UTC (permalink / raw) To: Mikhail Gavrilov Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander Am 24.02.23 um 09:38 schrieb Mikhail Gavrilov: > On Fri, Feb 24, 2023 at 12:13 PM Christian König > <ckoenig.leichtzumerken@gmail.com> wrote: >> Hi Mikhail, >> >> this is pretty clearly a problem with the system and/or it's BIOS and >> not the GPU hw or the driver. >> >> The option pci=nocrs makes the kernel ignore additional resource windows >> the BIOS reports through ACPI. This then most likely leads to problems >> with amdgpu because it can't bring up its PCIe resources any more. >> >> The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help >> understand the problem > I attach both lspci for pci=nocrs and without pci=nocrs. > > The differences for Cezanne Radeon Vega Series: > with pci=nocrs: > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx- > Interrupt: pin A routed to IRQ 255 > Region 4: I/O ports at e000 [disabled] [size=256] > Capabilities: [c0] MSI-X: Enable- Count=4 Masked- > > Without pci=nocrs: > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Interrupt: pin A routed to IRQ 44 > Region 4: I/O ports at e000 [size=256] > Capabilities: [c0] MSI-X: Enable+ Count=4 Masked- > > > The differences for Navi 22 Radeon 6800M: > with pci=nocrs: > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx- > Interrupt: pin A routed to IRQ 255 > Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] [size=16G] > Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] [size=256M] > Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] [size=1M] Well that explains it. When the PCI subsystem has to disable the BARs of the GPU we can't access it any more. The only thing we could do is to make sure that the driver at least fails gracefully. Do you still have network access to the box when amdgpu fails to load and could grab whatevery is in dmesg? Thanks, Christian. > AtomicOpsCtl: ReqEn- > Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ > Address: 0000000000000000 Data: 0000 > > Without pci=nocrs: > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 103 > Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G] > Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M] > Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M] > AtomicOpsCtl: ReqEn+ > Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ > Address: 00000000fee00000 Data: 0000 > >> but I strongly suggest to try a BIOS update first. > This is the first thing that was done. And I am afraid no more BIOS updates. > https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/ > > I also have experience in dealing with manufacturers' tech support. > Usually it ends with "we do not provide drivers for Linux". > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-02-24 12:29 ` Christian König @ 2023-02-24 15:31 ` Christian König 2023-02-24 16:21 ` Mikhail Gavrilov 0 siblings, 1 reply; 13+ messages in thread From: Christian König @ 2023-02-24 15:31 UTC (permalink / raw) To: Mikhail Gavrilov Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander Am 24.02.23 um 13:29 schrieb Christian König: > Am 24.02.23 um 09:38 schrieb Mikhail Gavrilov: >> On Fri, Feb 24, 2023 at 12:13 PM Christian König >> <ckoenig.leichtzumerken@gmail.com> wrote: >>> Hi Mikhail, >>> >>> this is pretty clearly a problem with the system and/or it's BIOS and >>> not the GPU hw or the driver. >>> >>> The option pci=nocrs makes the kernel ignore additional resource >>> windows >>> the BIOS reports through ACPI. This then most likely leads to problems >>> with amdgpu because it can't bring up its PCIe resources any more. >>> >>> The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help >>> understand the problem >> I attach both lspci for pci=nocrs and without pci=nocrs. >> >> The differences for Cezanne Radeon Vega Series: >> with pci=nocrs: >> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >> Stepping- SERR- FastB2B- DisINTx- >> Interrupt: pin A routed to IRQ 255 >> Region 4: I/O ports at e000 [disabled] [size=256] >> Capabilities: [c0] MSI-X: Enable- Count=4 Masked- >> >> Without pci=nocrs: >> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >> Stepping- SERR- FastB2B- DisINTx+ >> Interrupt: pin A routed to IRQ 44 >> Region 4: I/O ports at e000 [size=256] >> Capabilities: [c0] MSI-X: Enable+ Count=4 Masked- >> >> >> The differences for Navi 22 Radeon 6800M: >> with pci=nocrs: >> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- >> Stepping- SERR- FastB2B- DisINTx- >> Interrupt: pin A routed to IRQ 255 >> Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] >> [size=16G] >> Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] >> [size=256M] >> Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] >> [size=1M] > > Well that explains it. When the PCI subsystem has to disable the BARs > of the GPU we can't access it any more. > > The only thing we could do is to make sure that the driver at least > fails gracefully. > > Do you still have network access to the box when amdgpu fails to load > and could grab whatevery is in dmesg? Sorry I totally missed that you attached the full dmesg to your original mail. Yeah, the driver did fail gracefully. But then X doesn't come up and then gdm just dies. Sorry there is really nothing we can do here, maybe ping somebody with more ACPI background for help. Regards, Christian. > > Thanks, > Christian. > >> AtomicOpsCtl: ReqEn- >> Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ >> Address: 0000000000000000 Data: 0000 >> >> Without pci=nocrs: >> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >> Stepping- SERR- FastB2B- DisINTx+ >> Latency: 0, Cache Line Size: 64 bytes >> Interrupt: pin A routed to IRQ 103 >> Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G] >> Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M] >> Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M] >> AtomicOpsCtl: ReqEn+ >> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ >> Address: 00000000fee00000 Data: 0000 >> >>> but I strongly suggest to try a BIOS update first. >> This is the first thing that was done. And I am afraid no more BIOS >> updates. >> https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/ >> >> >> I also have experience in dealing with manufacturers' tech support. >> Usually it ends with "we do not provide drivers for Linux". >> > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-02-24 15:31 ` Christian König @ 2023-02-24 16:21 ` Mikhail Gavrilov 2023-02-27 10:22 ` Christian König 0 siblings, 1 reply; 13+ messages in thread From: Mikhail Gavrilov @ 2023-02-24 16:21 UTC (permalink / raw) To: Christian König Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander On Fri, Feb 24, 2023 at 8:31 PM Christian König <ckoenig.leichtzumerken@gmail.com> wrote: > > Sorry I totally missed that you attached the full dmesg to your original > mail. > > Yeah, the driver did fail gracefully. But then X doesn't come up and > then gdm just dies. Are you sure that these messages should be present when the driver fails gracefully? turning off the locking correctness validator. CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L ------- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug #1 Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.320 09/07/2022 Call Trace: <TASK> dump_stack_lvl+0x57/0x90 register_lock_class+0x47d/0x490 __lock_acquire+0x74/0x21f0 ? lock_release+0x155/0x450 lock_acquire+0xd2/0x320 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] ? lock_is_held_type+0xce/0x120 _raw_spin_lock_irqsave+0x4d/0xa0 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu] amdgpu_driver_load_kms+0xe8/0x190 [amdgpu] amdgpu_pci_probe+0x140/0x420 [amdgpu] local_pci_probe+0x41/0x90 pci_device_probe+0xc3/0x230 really_probe+0x1b6/0x410 __driver_probe_device+0x78/0x170 driver_probe_device+0x1f/0x90 __driver_attach+0xd2/0x1c0 ? __pfx___driver_attach+0x10/0x10 bus_for_each_dev+0x8a/0xd0 bus_add_driver+0x141/0x230 driver_register+0x77/0x120 ? __pfx_init_module+0x10/0x10 [amdgpu] do_one_initcall+0x6e/0x350 do_init_module+0x4a/0x220 __do_sys_init_module+0x192/0x1c0 do_syscall_64+0x5b/0x80 ? asm_exc_page_fault+0x22/0x30 ? lockdep_hardirqs_on+0x7d/0x100 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fd58cfcb1be Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010 RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0 R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670 R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0 </TASK> amdgpu: probe of 0000:03:00.0 failed with error -12 amdgpu 0000:08:00.0: enabling device (0006 -> 0007) [drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4). list_add corruption. prev->next should be next (ffffffffc0940328), but was 0000000000000000. (prev=ffff8c9b734062b0). ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:30! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L ------- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug #1 Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.320 09/07/2022 RIP: 0010:__list_add_valid+0x74/0x90 Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246 RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0 R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48 R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000 FS: 00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0 PKRU: 55555554 Call Trace: <TASK> ttm_device_init+0x184/0x1c0 [ttm] amdgpu_ttm_init+0xb8/0x610 [amdgpu] ? _printk+0x60/0x80 gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu] amdgpu_device_init+0x14e5/0x2520 [amdgpu] amdgpu_driver_load_kms+0x15/0x190 [amdgpu] amdgpu_pci_probe+0x140/0x420 [amdgpu] local_pci_probe+0x41/0x90 pci_device_probe+0xc3/0x230 really_probe+0x1b6/0x410 __driver_probe_device+0x78/0x170 driver_probe_device+0x1f/0x90 __driver_attach+0xd2/0x1c0 ? __pfx___driver_attach+0x10/0x10 bus_for_each_dev+0x8a/0xd0 bus_add_driver+0x141/0x230 driver_register+0x77/0x120 ? __pfx_init_module+0x10/0x10 [amdgpu] do_one_initcall+0x6e/0x350 do_init_module+0x4a/0x220 __do_sys_init_module+0x192/0x1c0 do_syscall_64+0x5b/0x80 ? asm_exc_page_fault+0x22/0x30 ? lockdep_hardirqs_on+0x7d/0x100 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fd58cfcb1be Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010 RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0 R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670 R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0 </TASK> Modules linked in: amdgpu(+) drm_ttm_helper hid_asus ttm asus_wmi iommu_v2 crct10dif_pclmul ledtrig_audio drm_buddy crc32_pclmul sparse_keymap gpu_sched crc32c_intel polyval_clmulni platform_profile hid_multitouch polyval_generic drm_display_helper nvme rfkill ucsi_acpi ghash_clmulni_intel nvme_core typec_ucsi serio_raw sp5100_tco ccp sha512_ssse3 r8169 cec typec nvme_common i2c_hid_acpi video i2c_hid wmi ip6_tables ip_tables fuse ---[ end trace 0000000000000000 ]--- RIP: 0010:__list_add_valid+0x74/0x90 Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246 RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0 R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48 R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000 FS: 00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0 PKRU: 55555554 (udev-worker) (470) used greatest stack depth: 12416 bytes left I thought that gracefully means switching to svga mode and showing the desktop with software rendering (exactly as it happens when I blacklist amdgpu driver). Currently the boot process stucking and the local console is unavailable. -- Best Regards, Mike Gavrilov. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-02-24 16:21 ` Mikhail Gavrilov @ 2023-02-27 10:22 ` Christian König 2023-02-28 9:52 ` Mikhail Gavrilov 0 siblings, 1 reply; 13+ messages in thread From: Christian König @ 2023-02-27 10:22 UTC (permalink / raw) To: Mikhail Gavrilov Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander Am 24.02.23 um 17:21 schrieb Mikhail Gavrilov: > On Fri, Feb 24, 2023 at 8:31 PM Christian König > <ckoenig.leichtzumerken@gmail.com> wrote: >> Sorry I totally missed that you attached the full dmesg to your original >> mail. >> >> Yeah, the driver did fail gracefully. But then X doesn't come up and >> then gdm just dies. > Are you sure that these messages should be present when the driver > fails gracefully? Unfortunately yes. We could clean that up a bit more so that you don't run into a BUG() assertion, but what essentially happens here is that we completely fail to talk to the hardware. In this situation we can't even re-enable vesa or text console any more. Regards, Christian. > > turning off the locking correctness validator. > CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L > ------- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug > #1 > Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, > BIOS G513QY.320 09/07/2022 > Call Trace: > <TASK> > dump_stack_lvl+0x57/0x90 > register_lock_class+0x47d/0x490 > __lock_acquire+0x74/0x21f0 > ? lock_release+0x155/0x450 > lock_acquire+0xd2/0x320 > ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] > ? lock_is_held_type+0xce/0x120 > _raw_spin_lock_irqsave+0x4d/0xa0 > ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] > amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] > amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu] > amdgpu_driver_load_kms+0xe8/0x190 [amdgpu] > amdgpu_pci_probe+0x140/0x420 [amdgpu] > local_pci_probe+0x41/0x90 > pci_device_probe+0xc3/0x230 > really_probe+0x1b6/0x410 > __driver_probe_device+0x78/0x170 > driver_probe_device+0x1f/0x90 > __driver_attach+0xd2/0x1c0 > ? __pfx___driver_attach+0x10/0x10 > bus_for_each_dev+0x8a/0xd0 > bus_add_driver+0x141/0x230 > driver_register+0x77/0x120 > ? __pfx_init_module+0x10/0x10 [amdgpu] > do_one_initcall+0x6e/0x350 > do_init_module+0x4a/0x220 > __do_sys_init_module+0x192/0x1c0 > do_syscall_64+0x5b/0x80 > ? asm_exc_page_fault+0x22/0x30 > ? lockdep_hardirqs_on+0x7d/0x100 > entry_SYSCALL_64_after_hwframe+0x72/0xdc > RIP: 0033:0x7fd58cfcb1be > Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f > 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d > 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 > RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af > RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be > RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010 > RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0 > R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670 > R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0 > </TASK> > amdgpu: probe of 0000:03:00.0 failed with error -12 > amdgpu 0000:08:00.0: enabling device (0006 -> 0007) > [drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4). > > > list_add corruption. prev->next should be next (ffffffffc0940328), but > was 0000000000000000. (prev=ffff8c9b734062b0). > ------------[ cut here ]------------ > kernel BUG at lib/list_debug.c:30! > invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L > ------- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug > #1 > Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, > BIOS G513QY.320 09/07/2022 > RIP: 0010:__list_add_valid+0x74/0x90 > Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b > 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b > 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d > RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246 > RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff > RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0 > R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48 > R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000 > FS: 00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0 > PKRU: 55555554 > Call Trace: > <TASK> > ttm_device_init+0x184/0x1c0 [ttm] > amdgpu_ttm_init+0xb8/0x610 [amdgpu] > ? _printk+0x60/0x80 > gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu] > amdgpu_device_init+0x14e5/0x2520 [amdgpu] > amdgpu_driver_load_kms+0x15/0x190 [amdgpu] > amdgpu_pci_probe+0x140/0x420 [amdgpu] > local_pci_probe+0x41/0x90 > pci_device_probe+0xc3/0x230 > really_probe+0x1b6/0x410 > __driver_probe_device+0x78/0x170 > driver_probe_device+0x1f/0x90 > __driver_attach+0xd2/0x1c0 > ? __pfx___driver_attach+0x10/0x10 > bus_for_each_dev+0x8a/0xd0 > bus_add_driver+0x141/0x230 > driver_register+0x77/0x120 > ? __pfx_init_module+0x10/0x10 [amdgpu] > do_one_initcall+0x6e/0x350 > do_init_module+0x4a/0x220 > __do_sys_init_module+0x192/0x1c0 > do_syscall_64+0x5b/0x80 > ? asm_exc_page_fault+0x22/0x30 > ? lockdep_hardirqs_on+0x7d/0x100 > entry_SYSCALL_64_after_hwframe+0x72/0xdc > RIP: 0033:0x7fd58cfcb1be > Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f > 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d > 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48 > RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af > RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be > RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010 > RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0 > R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670 > R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0 > </TASK> > Modules linked in: amdgpu(+) drm_ttm_helper hid_asus ttm asus_wmi > iommu_v2 crct10dif_pclmul ledtrig_audio drm_buddy crc32_pclmul > sparse_keymap gpu_sched crc32c_intel polyval_clmulni platform_profile > hid_multitouch polyval_generic drm_display_helper nvme rfkill > ucsi_acpi ghash_clmulni_intel nvme_core typec_ucsi serio_raw > sp5100_tco ccp sha512_ssse3 r8169 cec typec nvme_common i2c_hid_acpi > video i2c_hid wmi ip6_tables ip_tables fuse > ---[ end trace 0000000000000000 ]--- > RIP: 0010:__list_add_valid+0x74/0x90 > Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b > 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b > 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d > RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246 > RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff > RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0 > R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48 > R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000 > FS: 00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0 > PKRU: 55555554 > (udev-worker) (470) used greatest stack depth: 12416 bytes left > > I thought that gracefully means switching to svga mode and showing the > desktop with software rendering (exactly as it happens when I > blacklist amdgpu driver). Currently the boot process stucking and the > local console is unavailable. > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-02-27 10:22 ` Christian König @ 2023-02-28 9:52 ` Mikhail Gavrilov 2023-02-28 12:43 ` Christian König 0 siblings, 1 reply; 13+ messages in thread From: Mikhail Gavrilov @ 2023-02-28 9:52 UTC (permalink / raw) To: Christian König Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander [-- Attachment #1: Type: text/plain, Size: 1056 bytes --] On Mon, Feb 27, 2023 at 3:22 PM Christian König > > Unfortunately yes. We could clean that up a bit more so that you don't > run into a BUG() assertion, but what essentially happens here is that we > completely fail to talk to the hardware. > > In this situation we can't even re-enable vesa or text console any more. > Then I don't understand why when amdgpu is blacklisted via modprobe.blacklist=amdgpu then I see graphics and could login into GNOME. Yes without hardware acceleration, but it is better than non working graphics. It means there is some other driver (I assume this is "video") which can successfully talk to the AMD hardware in conditions where amdgpu cannot do this. My suggestion is that if amdgpu fails to talk to the hardware, then let another suitable driver do it. I attached a system log when I apply "pci=nocrs" with "modprobe.blacklist=amdgpu" for showing that graphics work right in this case. To do this, does the Linux module loading mechanism need to be refined? -- Best Regards, Mike Gavrilov. [-- Attachment #2: system-without-amdgpu.tar.xz --] [-- Type: application/x-xz, Size: 41716 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-02-28 9:52 ` Mikhail Gavrilov @ 2023-02-28 12:43 ` Christian König 2023-12-15 11:45 ` Mikhail Gavrilov 0 siblings, 1 reply; 13+ messages in thread From: Christian König @ 2023-02-28 12:43 UTC (permalink / raw) To: Mikhail Gavrilov Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander Am 28.02.23 um 10:52 schrieb Mikhail Gavrilov: > On Mon, Feb 27, 2023 at 3:22 PM Christian König >> Unfortunately yes. We could clean that up a bit more so that you don't >> run into a BUG() assertion, but what essentially happens here is that we >> completely fail to talk to the hardware. >> >> In this situation we can't even re-enable vesa or text console any more. >> > Then I don't understand why when amdgpu is blacklisted via > modprobe.blacklist=amdgpu then I see graphics and could login into > GNOME. Yes without hardware acceleration, but it is better than non > working graphics. It means there is some other driver (I assume this > is "video") which can successfully talk to the AMD hardware in > conditions where amdgpu cannot do this. The point is it doesn't need to talk to the amdgpu hardware. What it does is that it talks to the good old VGA/VESA emulation and that just happens to be still enabled by the BIOS/GRUB. And that VGA/VESA emulation doesn't need any BAR or whatever to keep the hw running in the state where it was initialized before the kernel started. The kernel just grabs the addresses where it needs to write the display data and keeps going with that. But when a hw specific driver wants to load this is the first thing which gets disabled because we need to load new firmware. And with the BARs disabled this can't be re-enabled without rebooting the system. > My suggestion is that if > amdgpu fails to talk to the hardware, then let another suitable driver > do it. I attached a system log when I apply "pci=nocrs" with > "modprobe.blacklist=amdgpu" for showing that graphics work right in > this case. > To do this, does the Linux module loading mechanism need to be refined? That's actually working as expected. The real problem is that the BIOS on that system is so broken that we can't access the hw correctly. What we could to do is to check the BARs very early on and refuse to load when they are disable. The problem with this approach is that there are systems where it is normal that the BARs are disable until the driver loads and get enabled during the hardware initialization process. What you might want to look into is to find a quirk for the BIOS to properly enable the nvme controller. Regards, Christian. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-02-28 12:43 ` Christian König @ 2023-12-15 11:45 ` Mikhail Gavrilov 2023-12-15 12:37 ` Christian König 0 siblings, 1 reply; 13+ messages in thread From: Mikhail Gavrilov @ 2023-12-15 11:45 UTC (permalink / raw) To: Christian König Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander [-- Attachment #1: Type: text/plain, Size: 1941 bytes --] On Tue, Feb 28, 2023 at 5:43 PM Christian König <ckoenig.leichtzumerken@gmail.com> wrote: > > The point is it doesn't need to talk to the amdgpu hardware. What it > does is that it talks to the good old VGA/VESA emulation and that just > happens to be still enabled by the BIOS/GRUB. > > And that VGA/VESA emulation doesn't need any BAR or whatever to keep the > hw running in the state where it was initialized before the kernel > started. The kernel just grabs the addresses where it needs to write the > display data and keeps going with that. > > But when a hw specific driver wants to load this is the first thing > which gets disabled because we need to load new firmware. And with the > BARs disabled this can't be re-enabled without rebooting the system. > > > My suggestion is that if > > amdgpu fails to talk to the hardware, then let another suitable driver > > do it. I attached a system log when I apply "pci=nocrs" with > > "modprobe.blacklist=amdgpu" for showing that graphics work right in > > this case. > > To do this, does the Linux module loading mechanism need to be refined? > > That's actually working as expected. The real problem is that the BIOS > on that system is so broken that we can't access the hw correctly. > > What we could to do is to check the BARs very early on and refuse to > load when they are disable. The problem with this approach is that there > are systems where it is normal that the BARs are disable until the > driver loads and get enabled during the hardware initialization process. > > What you might want to look into is to find a quirk for the BIOS to > properly enable the nvme controller. > That's interesting. I noticed that now amdgpu could work even with parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels. It means BARs became available? I attached here the kerner log and lspci. What's changed? -- Best Regards, Mike Gavrilov. [-- Attachment #2: dmesg-nvme-down-2.zip --] [-- Type: application/zip, Size: 46571 bytes --] [-- Attachment #3: lspci.zip --] [-- Type: application/zip, Size: 2710 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-12-15 11:45 ` Mikhail Gavrilov @ 2023-12-15 12:37 ` Christian König 2023-12-19 9:45 ` Mikhail Gavrilov 0 siblings, 1 reply; 13+ messages in thread From: Christian König @ 2023-12-15 12:37 UTC (permalink / raw) To: Mikhail Gavrilov Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander Am 15.12.23 um 12:45 schrieb Mikhail Gavrilov: > On Tue, Feb 28, 2023 at 5:43 PM Christian König > <ckoenig.leichtzumerken@gmail.com> wrote: >> The point is it doesn't need to talk to the amdgpu hardware. What it >> does is that it talks to the good old VGA/VESA emulation and that just >> happens to be still enabled by the BIOS/GRUB. >> >> And that VGA/VESA emulation doesn't need any BAR or whatever to keep the >> hw running in the state where it was initialized before the kernel >> started. The kernel just grabs the addresses where it needs to write the >> display data and keeps going with that. >> >> But when a hw specific driver wants to load this is the first thing >> which gets disabled because we need to load new firmware. And with the >> BARs disabled this can't be re-enabled without rebooting the system. >> >>> My suggestion is that if >>> amdgpu fails to talk to the hardware, then let another suitable driver >>> do it. I attached a system log when I apply "pci=nocrs" with >>> "modprobe.blacklist=amdgpu" for showing that graphics work right in >>> this case. >>> To do this, does the Linux module loading mechanism need to be refined? >> That's actually working as expected. The real problem is that the BIOS >> on that system is so broken that we can't access the hw correctly. >> >> What we could to do is to check the BARs very early on and refuse to >> load when they are disable. The problem with this approach is that there >> are systems where it is normal that the BARs are disable until the >> driver loads and get enabled during the hardware initialization process. >> >> What you might want to look into is to find a quirk for the BIOS to >> properly enable the nvme controller. >> > That's interesting. I noticed that now amdgpu could work even with > parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels. > It means BARs became available? > I attached here the kerner log and lspci. What's changed? I have no idea :) From the logs I can see that the AMDGPU now has the proper BARs assigned: [ 5.722015] pci 0000:03:00.0: [1002:73df] type 00 class 0x038000 [ 5.722051] pci 0000:03:00.0: reg 0x10: [mem 0xf800000000-0xfbffffffff 64bit pref] [ 5.722081] pci 0000:03:00.0: reg 0x18: [mem 0xfc00000000-0xfc0fffffff 64bit pref] [ 5.722112] pci 0000:03:00.0: reg 0x24: [mem 0xfca00000-0xfcafffff] [ 5.722134] pci 0000:03:00.0: reg 0x30: [mem 0xfcb00000-0xfcb1ffff pref] [ 5.722368] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold [ 5.722484] pci 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link) And with that the driver can work perfectly fine. Have you updated the BIOS or added/removed some other hardware? Maybe somebody added a quirk for your BIOS into the PCIe code or something like that. Regards, Christian. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" 2023-12-15 12:37 ` Christian König @ 2023-12-19 9:45 ` Mikhail Gavrilov 0 siblings, 0 replies; 13+ messages in thread From: Mikhail Gavrilov @ 2023-12-19 9:45 UTC (permalink / raw) To: Christian König Cc: amd-gfx list, dri-devel, Linux List Kernel Mailing, Deucher, Alexander On Fri, Dec 15, 2023 at 5:37 PM Christian König <ckoenig.leichtzumerken@gmail.com> wrote: > > I have no idea :) > > From the logs I can see that the AMDGPU now has the proper BARs assigned: > > [ 5.722015] pci 0000:03:00.0: [1002:73df] type 00 class 0x038000 > [ 5.722051] pci 0000:03:00.0: reg 0x10: [mem > 0xf800000000-0xfbffffffff 64bit pref] > [ 5.722081] pci 0000:03:00.0: reg 0x18: [mem > 0xfc00000000-0xfc0fffffff 64bit pref] > [ 5.722112] pci 0000:03:00.0: reg 0x24: [mem 0xfca00000-0xfcafffff] > [ 5.722134] pci 0000:03:00.0: reg 0x30: [mem 0xfcb00000-0xfcb1ffff pref] > [ 5.722368] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold > [ 5.722484] pci 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth, > limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048 > Gb/s with 16.0 GT/s PCIe x16 link) > > And with that the driver can work perfectly fine. > > Have you updated the BIOS or added/removed some other hardware? Maybe > somebody added a quirk for your BIOS into the PCIe code or something > like that. No, nothing changed in hardware. But I found the commit which fixes it. > git bisect unfixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 is the first fixed commit commit 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 Author: Vasant Hegde <vasant.hegde@amd.com> Date: Thu Sep 21 09:21:45 2023 +0000 iommu/amd: Introduce iommu_dev_data.flags to track device capabilities Currently we use struct iommu_dev_data.iommu_v2 to keep track of the device ATS, PRI, and PASID capabilities. But these capabilities can be enabled independently (except PRI requires ATS support). Hence, replace the iommu_v2 variable with a flags variable, which keep track of the device capabilities. From commit 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs"), device PRI/PASID is shared between PF and any associated VFs. Hence use pci_pri_supported() and pci_pasid_features() instead of pci_find_ext_capability() to check device PRI/PASID support. Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Link: https://lore.kernel.org/r/20230921092147.5930-13-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de> drivers/iommu/amd/amd_iommu_types.h | 3 ++- drivers/iommu/amd/iommu.c | 46 ++++++++++++++++++++++--------------- 2 files changed, 30 insertions(+), 19 deletions(-) > git bisect log git bisect start '--term-new=fixed' '--term-old=unfixed' # status: waiting for both good and bad commits # fixed: [33cc938e65a98f1d29d0a18403dbbee050dcad9a] Linux 6.7-rc4 git bisect fixed 33cc938e65a98f1d29d0a18403dbbee050dcad9a # status: waiting for good commit(s), bad commit known # unfixed: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6 git bisect unfixed ffc253263a1375a65fa6c9f62a893e9767fbebfa # unfixed: [7d461b291e65938f15f56fe58da2303b07578a76] Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm git bisect unfixed 7d461b291e65938f15f56fe58da2303b07578a76 # unfixed: [e14aec23025eeb1f2159ba34dbc1458467c4c347] s390/ap: fix AP bus crash on early config change callback invocation git bisect unfixed e14aec23025eeb1f2159ba34dbc1458467c4c347 # unfixed: [be3ca57cfb777ad820c6659d52e60bbdd36bf5ff] Merge tag 'media/v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media git bisect unfixed be3ca57cfb777ad820c6659d52e60bbdd36bf5ff # fixed: [c0d12d769299e1e08338988c7745009e0db2a4a0] Merge tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm git bisect fixed c0d12d769299e1e08338988c7745009e0db2a4a0 # fixed: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu git bisect fixed 4bbdb725a36b0d235f3b832bd0c1e885f0442d9f # unfixed: [25b6377007ebe1c3ede773fd6979f613386db000] Merge tag 'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm git bisect unfixed 25b6377007ebe1c3ede773fd6979f613386db000 # unfixed: [67c0afb6424fee94238d9a32b97c407d0c97155e] Merge tag 'exfat-for-6.7-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat git bisect unfixed 67c0afb6424fee94238d9a32b97c407d0c97155e # unfixed: [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag 'v6.6-rc7' into core git bisect unfixed 3613047280ec42a4e1350fdc1a6dd161ff4008cc # fixed: [cedc811c76778bdef91d405717acee0de54d8db5] iommu/amd: Remove DMA_FQ type from domain allocation path git bisect fixed cedc811c76778bdef91d405717acee0de54d8db5 # unfixed: [b0cc5dae1ac0c18748706a4beb636e3b726dd744] iommu/amd: Rename ats related variables git bisect unfixed b0cc5dae1ac0c18748706a4beb636e3b726dd744 # fixed: [5a0b11a180a9b82b4437a4be1cf73530053f139b] iommu/amd: Remove iommu_v2 module git bisect fixed 5a0b11a180a9b82b4437a4be1cf73530053f139b # fixed: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6] iommu/amd: Introduce iommu_dev_data.flags to track device capabilities git bisect fixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 # unfixed: [739eb25514c90aa8ea053ed4d2b971f531e63ded] iommu/amd: Introduce iommu_dev_data.ppr git bisect unfixed 739eb25514c90aa8ea053ed4d2b971f531e63ded # first fixed commit: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6] iommu/amd: Introduce iommu_dev_data.flags to track device capabilities -- Best Regards, Mike Gavrilov. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-12-19 9:45 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-02-23 23:40 amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init" Mikhail Gavrilov 2023-02-24 7:12 ` Keyword Review - " Christian König 2023-02-24 7:13 ` Christian König 2023-02-24 8:38 ` Mikhail Gavrilov 2023-02-24 12:29 ` Christian König 2023-02-24 15:31 ` Christian König 2023-02-24 16:21 ` Mikhail Gavrilov 2023-02-27 10:22 ` Christian König 2023-02-28 9:52 ` Mikhail Gavrilov 2023-02-28 12:43 ` Christian König 2023-12-15 11:45 ` Mikhail Gavrilov 2023-12-15 12:37 ` Christian König 2023-12-19 9:45 ` Mikhail Gavrilov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox