Dear Kernel Developers,
I found a regression and did my best to identify the issue and find a workaround, but I am not a developer. I also published the regression in the NVIDIA Linux Developer Forum, but it appears to be a kernel regression and not a driver issue, as the latest drivers work on older kernels (e.g., 5.15) but fail on anything 6.x+. I did not perform a git bisect as compiling on this i5-8265U - because of its low performance I can only compile one kernel over night.Description:
The
NVIDIA MX250 GPU on Acer Swift 3 (SF314-56G / BIOS
1.14) is
inaccessible on any Linux distribution using Kernel
6.x or 7.x. The
device fails to initialize during the driver probe
routine.
Symptoms:
nvidia-smi fails: "Unable to determine the device handle".
lspci shows the device in a broken state: rev ff.
dmesg reports: Unable to change power state from D3cold to D0, device inaccessible.
Crucial Error Log: NVRM: This is a 64-bit BAR mapped above 4GB by the system BIOS or the Linux kernel, but the PCI bridge immediately upstream of this GPU does not define a matching prefetchable memory window.
Proven Regression Status:
Works: Ubuntu 20.04 (Kernel 5.15) and Ubuntu 22.04 (manually downgraded to Kernel 5.15). In this state, the GPU is mapped below the 4GB boundary, and the driver (v570/v580) works perfectly (glmark2 score ~1300).
Fails: Ubuntu 22.04/24.04 (Kernel 6.5/6.8) and CachyOS (Kernel 6.13/7.0). Any Kernel > 5.15 attempts to map the GPU BARs into the 64-bit address space (> 4GB), which the Acer PCI-bridge fails to handle.
Attempted Workarounds (all failed on Kernel 6.x+):
pci=realloc (Causes initramfs/SSD failure on this hardware).
pci=nocrs, pci=big_root_window, pci=no64, pci=realloc=off.
acpi_osi="Windows 2015".
NVreg_EnableGpuFirmware=0, NVreg_DynamicPowerManagement=0x02.
Switching between Wayland and X11 (No effect).
Conclusion:
This
appears to be a PCI resource allocation regression or
a missing quirk
for the Acer SF314-56G bridge. Newer kernels ignore
BIOS limits or
fail to correctly negotiate the bridge window for
64-bit BARs on this
specific motherboard. Currently, the only way to use
the GPU is to
pin the system to the legacy 5.15 Kernel branch.
Additional
Findings/New Test
Results:
I
have further isolated the issue by testing various LTS
kernels on the
same OS (Ubuntu 22.04) with the same driver (NVIDIA
580.142):
SUCCESS: XanMod 5.15.75 LTS. GPU initializes perfectly (rev a1), no BAR errors.
FAIL: XanMod 6.1.77 LTS. GPU fails with: pci 0000:02:00.0: not ready 1023ms after resume; giving up.
This confirms the regression started exactly with the 6.x kernel series. The 1023ms timeout suggests a race condition or a change in how the PCI bridge handles the wake-up sequence from D3cold on this specific Acer hardware.
kind regards