public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
@ 2026-04-02 23:43 Dexuan Cui
  2026-04-05 23:15 ` Michael Kelley
  0 siblings, 1 reply; 2+ messages in thread
From: Dexuan Cui @ 2026-04-02 23:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, jakeo, linux-hyperv, linux-pci,
	linux-kernel, mhklinux, matthew.ruffell, kjlx
  Cc: Krister Johansen, stable

There has been a longstanding MMIO conflict between the pci_hyperv
driver's config_window (see hv_allocate_config_window()) and the
hyperv_drm (or hyperv_fb) driver (see hyperv_setup_vram()): typically
both get MMIO from the low MMIO range below 4GB; this is not an issue
in the normal kernel since the VMBus driver reserves the framebuffer
MMIO range in vmbus_reserve_fb(), so the drm driver's hyperv_setup_vram()
can always get the reserved framebuffer MMIO; however, a Gen2 VM's
kdump kernel can fail to reserve the framebuffer MMIO in
vmbus_reserve_fb() because the screen_info.lfb_base is zero in the
kdump kernel due to several possible reasons (see the Link below for
more details):

1) on ARM64, the two syscalls (KEXEC_LOAD, KEXEC_FILE_LOAD) don't
initialize the screen_info.lfb_base for the kdump kernel;

2) on x86-64, the KEXEC_FILE_LOAD syscall initializes kdump kernel's
screen_info.lfb_base, but the KEXEC_LOAD syscall doesn't really do that
when the hyperv_drm driver loads, because the user-space kexec-tools
(i.e. the program 'kexec') doesn't recognize the hyperv_drm driver
(let's ignore the behavior of kexec-tools of very old versions).

When vmbus_reserve_fb() fails to reserve the framebuffer MMIO in the
kdump kernel, if pci_hyperv in the kdump kernel loads before hyperv_drm
loads, pci_hyperv's vmbus_allocate_mmio() gets the framebuffer MMIO
and tries to use it, but since the host thinks that the MMIO range is
still in use by hyperv_drm, the host refuses to accept the MMIO range
as the config window, and pci_hyperv's hv_pci_enter_d0() errors out,
e.g. an error can be "PCI Pass-through VSP failed D0 Entry with status
c0370048".

Typically, this pci_hyperv error in the kdump kernel was not fatal in
the past because the kdump kernel typically doesn't rely on pci_hyperv,
i.e. the root file system is on a VMBus SCSI device.

Now, a VM on Azure can boot from NVMe, i.e. the root file system can be
on a NVMe device, which depends on pci_hyperv. When the error occurs,
the kdump kernel fails to boot up since no root file system is detected.

Fix the MMIO conflict by allocating MMIO above 4GB for the config_window,
so it won't conflict with hyperv_drm's MMIO, which should be below the
4GB boundary. The size of config_window is small: it's only 8KB per PCI
device, so there should be sufficient MMIO space available above 4GB.

Note: we still need to figure out how to address the possible MMIO
conflict between hyperv_drm and pci_hyperv in the case of 32-bit PCI
MMIO BARs, but that's of low priority because all PCI devices available
to a Linux VM on Azure or on a modern host should use 64-bit BARs and
should not use 32-bit BARs -- I checked Mellanox VFs, MANA VFs, NVMe
devices, and GPUs in Linux VMs on Azure, and found no 32-bit BARs.

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Link: https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Tested-by: Krister Johansen <johansen@templeofstupid.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: stable@vger.kernel.org
---

Changes since v1:
    Updated the commit message and the comment to better explain
    why screen_info.lfb_base can be 0 in the kdump kernel.

    No code change since v1.


 drivers/pci/controller/pci-hyperv.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 2c7a406b4ba8..1a79334ea9f4 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3403,9 +3403,26 @@ static int hv_allocate_config_window(struct hv_pcibus_device *hbus)
 
 	/*
 	 * Set up a region of MMIO space to use for accessing configuration
-	 * space.
+	 * space. Use the high MMIO range to not conflict with the hyperv_drm
+	 * driver (which normally gets MMIO from the low MMIO range) in the
+	 * kdump kernel of a Gen2 VM, which may fail to reserve the framebuffer
+	 * MMIO range in vmbus_reserve_fb() due to screen_info.lfb_base being
+	 * zero in the kdump kernel:
+	 *
+	 * on ARM64, the two syscalls (KEXEC_LOAD, KEXEC_FILE_LOAD) don't
+	 * initialize the screen_info.lfb_base for the kdump kernel;
+	 *
+	 * on x86-64, the KEXEC_FILE_LOAD syscall initializes kdump kernel's
+	 * screen_info.lfb_base (see bzImage64_load() -> setup_boot_parameters())
+	 * but the KEXEC_LOAD syscall doesn't really do that when the hyperv_drm
+	 * driver loads, because the user-space program 'kexec' doesn't
+	 * recognize hyperv_drm: see the function setup_linux_vesafb() in the
+	 * kexec-tools.git repo. Note: old versions of kexec-tools, e.g.
+	 * v2.0.18, initialize screen_info.lfb_base if the hyperv_fb driver
+	 * loads, but hyperv_fb is deprecated and has been removed from the
+	 * mainline kernel.
 	 */
-	ret = vmbus_allocate_mmio(&hbus->mem_config, hbus->hdev, 0, -1,
+	ret = vmbus_allocate_mmio(&hbus->mem_config, hbus->hdev, SZ_4G, -1,
 				  PCI_CONFIG_MMIO_LENGTH, 0x1000, false);
 	if (ret)
 		return ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
  2026-04-02 23:43 [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window Dexuan Cui
@ 2026-04-05 23:15 ` Michael Kelley
  0 siblings, 0 replies; 2+ messages in thread
From: Michael Kelley @ 2026-04-05 23:15 UTC (permalink / raw)
  To: Dexuan Cui, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, longli@microsoft.com, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, jakeo@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, Michael Kelley,
	matthew.ruffell@canonical.com, kjlx@templeofstupid.com
  Cc: Krister Johansen, stable@vger.kernel.org

From: Dexuan Cui <decui@microsoft.com> Sent: Thursday, April 2, 2026 4:43 PM
> 
> There has been a longstanding MMIO conflict between the pci_hyperv
> driver's config_window (see hv_allocate_config_window()) and the
> hyperv_drm (or hyperv_fb) driver (see hyperv_setup_vram()): typically
> both get MMIO from the low MMIO range below 4GB; this is not an issue
> in the normal kernel since the VMBus driver reserves the framebuffer
> MMIO range in vmbus_reserve_fb(), so the drm driver's hyperv_setup_vram()
> can always get the reserved framebuffer MMIO; however, a Gen2 VM's
> kdump kernel can fail to reserve the framebuffer MMIO in
> vmbus_reserve_fb() because the screen_info.lfb_base is zero in the
> kdump kernel due to several possible reasons (see the Link below for
> more details):
> 
> 1) on ARM64, the two syscalls (KEXEC_LOAD, KEXEC_FILE_LOAD) don't
> initialize the screen_info.lfb_base for the kdump kernel;
> 
> 2) on x86-64, the KEXEC_FILE_LOAD syscall initializes kdump kernel's
> screen_info.lfb_base, but the KEXEC_LOAD syscall doesn't really do that
> when the hyperv_drm driver loads, because the user-space kexec-tools
> (i.e. the program 'kexec') doesn't recognize the hyperv_drm driver
> (let's ignore the behavior of kexec-tools of very old versions).
> 
> When vmbus_reserve_fb() fails to reserve the framebuffer MMIO in the
> kdump kernel, if pci_hyperv in the kdump kernel loads before hyperv_drm
> loads, pci_hyperv's vmbus_allocate_mmio() gets the framebuffer MMIO
> and tries to use it, but since the host thinks that the MMIO range is
> still in use by hyperv_drm, the host refuses to accept the MMIO range
> as the config window, and pci_hyperv's hv_pci_enter_d0() errors out,
> e.g. an error can be "PCI Pass-through VSP failed D0 Entry with status
> c0370048".
> 
> Typically, this pci_hyperv error in the kdump kernel was not fatal in
> the past because the kdump kernel typically doesn't rely on pci_hyperv,
> i.e. the root file system is on a VMBus SCSI device.
> 
> Now, a VM on Azure can boot from NVMe, i.e. the root file system can be
> on a NVMe device, which depends on pci_hyperv. When the error occurs,
> the kdump kernel fails to boot up since no root file system is detected.
> 
> Fix the MMIO conflict by allocating MMIO above 4GB for the config_window,
> so it won't conflict with hyperv_drm's MMIO, which should be below the
> 4GB boundary. The size of config_window is small: it's only 8KB per PCI
> device, so there should be sufficient MMIO space available above 4GB.
> 
> Note: we still need to figure out how to address the possible MMIO
> conflict between hyperv_drm and pci_hyperv in the case of 32-bit PCI
> MMIO BARs, but that's of low priority because all PCI devices available
> to a Linux VM on Azure or on a modern host should use 64-bit BARs and
> should not use 32-bit BARs -- I checked Mellanox VFs, MANA VFs, NVMe
> devices, and GPUs in Linux VMs on Azure, and found no 32-bit BARs.

Just to clarify, since this patch is predicated on all BARs being 64-bit,
hv_pci_alloc_bridge_windows() never encounters a non-zero
hbus->low_mmio_space, and hence also never allocates from low
MMIO space. So hv_pci_alloc_bridge_windows() does not need to be
patched. Is that correct?

Taking a broader view, fundamentally the current MMIO location of
the frame buffer may be unknown to the Linux guest. At the same time,
Linux must ensure that PCI devices don't get assigned to the MMIO space
where the frame buffer is located. While the current MMIO location of
the frame buffer may be unknown, we can assume it was placed in low
MMIO space by the host -- either Windows Hyper-V or Linux/VMM
in the root partition, and perhaps as mediated by a paravisor. Probably
need to confirm with the Linux-in-the-root partition team (and maybe
the OpenHCL team) that this assumption is true. Presumably the
hyperv_drm driver doesn't need to move the frame buffer, but if it
does, it must stay in the low MMIO space.

This patch depends on this assumption, and effectively reserves
the entire low MMIO space for the frame buffer. The low MMIO space
size defaults to 128 MiB on a local Hyper-V, and is set to 3 GiB in most
Azure VMs (or to 1 GiB in an Azure CVM), so that all gets reserved.

A slightly different approach to the whole problem is to change
vmbus_reserve_fb(). If it is unable to get a non-zero "start" value, then
it should use the same assumption as above, and reserve a frame buffer
area starting at the lowest address in low MMIO space. The reserved size
could be the max possible frame buffer size, which I think is 64 MiB (?).
This still leaves low MMIO space for subsequent PCI devices, and allows
32-bit BARs to continue to work. This approach requires one further
assumption, which is that the host, plus any movement by hyperv_drm,
has kept the frame buffer at the low end of the low MMIO space. From
what I've seen, that assumption is reality -- the frame buffer always
starts at the beginning of low MMIO space.

This approach could be taken one step further, where vmbus_reserve_fb()
*always* reserves 64 MiB starting at the low end of low MMIO space,
regardless of the value of "start". The messy code for getting "start"
could be dropped entirely, and the dependency on CONFIG_SYSFB goes
away. Or maybe still get the value of "start" and "size", and if non-zero
just do a sanity check that they are within the fixed 64 MiB reserved area.

Thoughts? To me tweaking vmbus_reserve_fb() is a more
straightforward and explicit way to do the reserving, vs. modifying
the requested range in the Hyper-V PCI driver. And FWIW, it avoids
introducing the 32-bit BAR limitation.

Michael

> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> Link: https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
> Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com>
> Tested-by: Krister Johansen <johansen@templeofstupid.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: stable@vger.kernel.org
> ---
> 
> Changes since v1:
>     Updated the commit message and the comment to better explain
>     why screen_info.lfb_base can be 0 in the kdump kernel.
> 
>     No code change since v1.
> 
> 
>  drivers/pci/controller/pci-hyperv.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 2c7a406b4ba8..1a79334ea9f4 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3403,9 +3403,26 @@ static int hv_allocate_config_window(struct
> hv_pcibus_device *hbus)
> 
>  	/*
>  	 * Set up a region of MMIO space to use for accessing configuration
> -	 * space.
> +	 * space. Use the high MMIO range to not conflict with the hyperv_drm
> +	 * driver (which normally gets MMIO from the low MMIO range) in the
> +	 * kdump kernel of a Gen2 VM, which may fail to reserve the framebuffer
> +	 * MMIO range in vmbus_reserve_fb() due to screen_info.lfb_base being
> +	 * zero in the kdump kernel:
> +	 *
> +	 * on ARM64, the two syscalls (KEXEC_LOAD, KEXEC_FILE_LOAD) don't
> +	 * initialize the screen_info.lfb_base for the kdump kernel;
> +	 *
> +	 * on x86-64, the KEXEC_FILE_LOAD syscall initializes kdump kernel's
> +	 * screen_info.lfb_base (see bzImage64_load() -> setup_boot_parameters())
> +	 * but the KEXEC_LOAD syscall doesn't really do that when the hyperv_drm
> +	 * driver loads, because the user-space program 'kexec' doesn't
> +	 * recognize hyperv_drm: see the function setup_linux_vesafb() in the
> +	 * kexec-tools.git repo. Note: old versions of kexec-tools, e.g.
> +	 * v2.0.18, initialize screen_info.lfb_base if the hyperv_fb driver
> +	 * loads, but hyperv_fb is deprecated and has been removed from the
> +	 * mainline kernel.
>  	 */
> -	ret = vmbus_allocate_mmio(&hbus->mem_config, hbus->hdev, 0, -1,
> +	ret = vmbus_allocate_mmio(&hbus->mem_config, hbus->hdev, SZ_4G, -1,
>  				  PCI_CONFIG_MMIO_LENGTH, 0x1000, false);
>  	if (ret)
>  		return ret;
> --
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-04-05 23:15 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 23:43 [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window Dexuan Cui
2026-04-05 23:15 ` Michael Kelley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox