* Re: [PATCH v2 0/2] Fixes for movable pages
From: Wei Liu @ 2026-01-15 7:31 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260105122837.1083896-1-anirudh@anirudhrb.com>
On Mon, Jan 05, 2026 at 12:28:35PM +0000, Anirudh Rayabharam wrote:
> From: "Anirudh Rayabharam (Microsoft)" <anirudh@anirudhrb.com>
>
> Fix movable pages for arm64 guests by implementing a GPA intercept
> handler.
>
> v2:
> - Added "Fixes:" tag
> - Got rid of the utility function to get intercept GPA and instead
> integrated the rather small logic into the GPA intercept handling
> function.
> - Dropped patch 3 since it was applied to the fixes tree.
>
> Anirudh Rayabharam (Microsoft) (2):
> hyperv: add definitions for arm64 gpa intercepts
> mshv: handle gpa intercepts for arm64
>
The code looks fine. I have queued both patches.
I massaged the first subject line a bit.
Wei
> drivers/hv/mshv_root_main.c | 15 ++++++------
> include/hyperv/hvhdk.h | 47 +++++++++++++++++++++++++++++++++++++
> 2 files changed, 55 insertions(+), 7 deletions(-)
>
> --
> 2.34.1
>
^ permalink raw reply
* [PATCH 0/2] kbuild, uapi: Mark inner unions in packed structs as packed
From: Thomas Weißschuh @ 2026-01-15 7:35 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
Hans de Goede, Arnd Bergmann, Greg Kroah-Hartman
Cc: linux-hyperv, linux-kernel, llvm, Thomas Weißschuh,
kernel test robot
The unpacked unions within a packed struct generates alignment warnings
on clang for 32-bit ARM.
With the recent changes to compile-test the UAPI headers in more cases,
these warning in combination with CONFIG_WERROR breaks the build.
Fix the warnings.
Intended for the kbuild tree.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
Thomas Weißschuh (2):
hyper-v: Mark inner union in hv_kvp_exchg_msg_value as packed
virt: vbox: uapi: Mark inner unions in packed structs as packed
include/uapi/linux/hyperv.h | 2 +-
include/uapi/linux/vbox_vmmdev_types.h | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
---
base-commit: e3970d77ec504e54c3f91a48b2125775c16ba4c0
change-id: 20260115-kbuild-alignment-vbox-d0409134d335
Best regards,
--
Thomas Weißschuh <thomas.weissschuh@linutronix.de>
^ permalink raw reply
* [PATCH 1/2] hyper-v: Mark inner union in hv_kvp_exchg_msg_value as packed
From: Thomas Weißschuh @ 2026-01-15 7:35 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
Hans de Goede, Arnd Bergmann, Greg Kroah-Hartman
Cc: linux-hyperv, linux-kernel, llvm, Thomas Weißschuh,
kernel test robot
In-Reply-To: <20260115-kbuild-alignment-vbox-v1-0-076aed1623ff@linutronix.de>
The unpacked union within a packed struct generates alignment warnings
on clang for 32-bit ARM:
./usr/include/linux/hyperv.h:361:2: error: field within 'struct hv_kvp_exchg_msg_value'
is less aligned than 'union hv_kvp_exchg_msg_value::(anonymous at ./usr/include/linux/hyperv.h:361:2)'
and is usually due to 'struct hv_kvp_exchg_msg_value' being packed,
which can lead to unaligned accesses [-Werror,-Wunaligned-access]
361 | union {
| ^
With the recent changes to compile-test the UAPI headers in more cases,
this warning in combination with CONFIG_WERROR breaks the build.
Fix the warning.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512140314.DzDxpIVn-lkp@intel.com/
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/linux-kbuild/20260110-uapi-test-disable-headers-arm-clang-unaligned-access-v1-1-b7b0fa541daa@kernel.org/
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/linux-kbuild/29b2e736-d462-45b7-a0a9-85f8d8a3de56@app.fastmail.com/
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
include/uapi/linux/hyperv.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index aaa502a7bff4..1749b35ab2c2 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -362,7 +362,7 @@ struct hv_kvp_exchg_msg_value {
__u8 value[HV_KVP_EXCHANGE_MAX_VALUE_SIZE];
__u32 value_u32;
__u64 value_u64;
- };
+ } __attribute__((packed));
} __attribute__((packed));
struct hv_kvp_msg_enumerate {
--
2.52.0
^ permalink raw reply related
* [PATCH 2/2] virt: vbox: uapi: Mark inner unions in packed structs as packed
From: Thomas Weißschuh @ 2026-01-15 7:35 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
Hans de Goede, Arnd Bergmann, Greg Kroah-Hartman
Cc: linux-hyperv, linux-kernel, llvm, Thomas Weißschuh,
kernel test robot
In-Reply-To: <20260115-kbuild-alignment-vbox-v1-0-076aed1623ff@linutronix.de>
The unpacked unions within a packed struct generates alignment warnings
on clang for 32-bit ARM:
./usr/include/linux/vbox_vmmdev_types.h:239:4: error: field u within 'struct vmmdev_hgcm_function_parameter32'
is less aligned than 'union (unnamed union at ./usr/include/linux/vbox_vmmdev_types.h:223:2)'
and is usually due to 'struct vmmdev_hgcm_function_parameter32' being packed,
which can lead to unaligned accesses [-Werror,-Wunaligned-access]
239 | } u;
| ^
./usr/include/linux/vbox_vmmdev_types.h:254:6: error: field u within
'struct vmmdev_hgcm_function_parameter64::(anonymous union)::(unnamed at ./usr/include/linux/vbox_vmmdev_types.h:249:3)'
is less aligned than 'union (unnamed union at ./usr/include/linux/vbox_vmmdev_types.h:251:4)' and is usually due to
'struct vmmdev_hgcm_function_parameter64::(anonymous union)::(unnamed at ./usr/include/linux/vbox_vmmdev_types.h:249:3)'
being packed, which can lead to unaligned accesses [-Werror,-Wunaligned-access]
With the recent changes to compile-test the UAPI headers in more cases,
these warning in combination with CONFIG_WERROR breaks the build.
Fix the warnings.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512140314.DzDxpIVn-lkp@intel.com/
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/linux-kbuild/20260110-uapi-test-disable-headers-arm-clang-unaligned-access-v1-1-b7b0fa541daa@kernel.org/
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/linux-kbuild/29b2e736-d462-45b7-a0a9-85f8d8a3de56@app.fastmail.com/
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
include/uapi/linux/vbox_vmmdev_types.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/vbox_vmmdev_types.h b/include/uapi/linux/vbox_vmmdev_types.h
index 6073858d52a2..11f3627c3729 100644
--- a/include/uapi/linux/vbox_vmmdev_types.h
+++ b/include/uapi/linux/vbox_vmmdev_types.h
@@ -236,7 +236,7 @@ struct vmmdev_hgcm_function_parameter32 {
/** Relative to the request header. */
__u32 offset;
} page_list;
- } u;
+ } __packed u;
} __packed;
VMMDEV_ASSERT_SIZE(vmmdev_hgcm_function_parameter32, 4 + 8);
@@ -251,7 +251,7 @@ struct vmmdev_hgcm_function_parameter64 {
union {
__u64 phys_addr;
__u64 linear_addr;
- } u;
+ } __packed u;
} __packed pointer;
struct {
/** Size of the buffer described by the page list. */
--
2.52.0
^ permalink raw reply related
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Thomas Zimmermann @ 2026-01-15 11:02 UTC (permalink / raw)
To: Zack Rusin
Cc: dri-devel, Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun,
Chia-I Wu, Christian König, Danilo Krummrich, Dave Airlie,
Deepak Rawat, Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh,
Hans de Goede, Hawking Zhang, Helge Deller, intel-gfx, intel-xe,
Jani Nikula, Javier Martinez Canillas, Jocelyn Falempe,
Joonas Lahtinen, Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv,
linux-kernel, Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
virtualization, Vitaly Prosyak
In-Reply-To: <CABQX2QNQU4XZ1rJFqnJeMkz8WP=t9atj0BqXHbDQab7ZnAyJxg@mail.gmail.com>
Hi,
apologies for the delay. I wanted to reply and then forgot about it.
Am 10.01.26 um 05:52 schrieb Zack Rusin:
> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> Hi
>>
>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>> Almost a rite of passage for every DRM developer and most Linux users
>>> is upgrading your DRM driver/updating boot flags/changing some config
>>> and having DRM driver fail at probe resulting in a blank screen.
>>>
>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>> access to PCI resources so if the probe fails the system is left without
>>> a functioning display driver.
>>>
>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>> fails. This means that a DRM driver that fails to load reloads the system
>>> framebuffer driver.
>>>
>>> This works best with simpledrm. Without it Xorg won't recover because
>>> it still tries to load the vendor specific driver which ends up usually
>>> not working at all. With simpledrm the system recovers really nicely
>>> ending up with a working console and not a blank screen.
>>>
>>> There's a caveat in that some hardware might require some special magic
>>> register write to recover EFI display. I'd appreciate it a lot if
>>> maintainers could introduce a temporary failure in their drivers
>>> probe to validate that the sysfb recovers and they get a working console.
>>> The easiest way to double check it is by adding:
>>> /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>> dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>> ret = -EINVAL;
>>> goto out_error;
>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>> Recovering the display like that is guess work and will at best work
>> with simple discrete devices where the framebuffer is always located in
>> a confined graphics aperture.
>>
>> But the problem you're trying to solve is a real one.
>>
>> What we'd want to do instead is to take the initial hardware state into
>> account when we do the initial mode-setting operation.
>>
>> The first step is to move each driver's remove_conflicting_devices call
>> to the latest possible location in the probe function. We usually do it
>> first, because that's easy. But on most hardware, it could happen much
>> later.
> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
> they request pci regions which is going to fail otherwise. Because
> grabbining the pci resources is in general the very first thing that
> those drivers need to do to setup anything, we
> remove_conflicting_devices first or at least very early.
To my knowledge, requesting resources is more about correctness than a
hard requirement to use an I/O or memory range. Has this changed?
>
> I also don't think it's possible or even desirable by some drivers to
> reuse the initial state, good example here is vmwgfx where by default
> some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
> loads we allow scanning out from system memory, so you can set your vm
> up with 8mb of vram but still use 4k resolutions when the driver
> loads, this way the suspend size of the vm is very predictable (tiny
> vram plus whatever ram was setup) while still allowing a lot of
> flexibility.
If there's no initial state to switch from, the first modeset can fail
while leaving the display unusable. There's no way around that. Going
back to the old state is not an option unless the driver has been
written to support this.
The case of vmwgfx is special, but does not effect the overall problem.
For vmwgfx, it would be best to import that initial state and support a
transparent modeset from vram to system memory (and back) at least
during this initial state.
>
> In general I think however this is planned it's two or three separate series:
> 1) infrastructure to reload the sysfb driver (what this series is)
> 2) making sure that drivers that do want to recover cleanly actually
> clean out all the state on exit properly,
> 3) abstracting at least some of that cleanup in some driver independent way
That's really not going to work. For example, in the current series, you
invoke devm_aperture_remove_conflicting_pci_devices_done() after
drm_mode_reset(), drm_dev_register() and drm_client_setup(). Each of
these calls can modify hardware state. In the case of _register() and
_setup(), the DRM clients can perform a modeset, which destroys the
initial hardware state. Patch 1 of this series removes the sysfb
device/driver entirely. That should be a no-go as it significantly
complicates recovery. For example, if the native drivers failed from an
allocation failure, the sysfb device/driver is not likely to come back
either. As the very first thing, the series should state which failures
is is going to resolve, - failed hardware init, - invalid initial
modesetting, - runtime errors (such ENOMEM, failed firmware loading), -
others? And then specify how a recovery to sysfb could look in each
supported scenario. In terms of implementation, make any transition
between drivers gradually. The native driver needs to acquire the
hardware resource (framebuffer and I/O apertures) without unloading the
sysfb driver. Luckily there's struct drm_device.unplug, which does that.
[1] Flipping this field disables hardware access for DRM drivers. All
sysfb drivers support this. To get the sysfb drivers ready, I suggest
dedicated helpers for each drivers aperture. The aperture helpers can
use these callback to flip the DRM driver off and on again. For example,
efidrm could do this as a minimum: int efidrm_aperture_suspend() {
dev->unplug = true; remove_resource(/*framebuffer aperture*/) return 0 }
int efidrm_aperture_resume() { insert_resource(/*framebuffer aperture*/)
dev->unplug = false; return 0 } struct aperture_funcs
efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume =
efidrm_aperture_resume, } Pass this struct when efidrm acquires the
framebuffer aperture, so that the aperture helpers can control the
behavior of efidrm. With this, a multi-step takeover from sysfb to
native driver can be tried. It's still a massive effort that requires an
audit of each driver's probing logic. There's no copy-paste pattern
AFAICT. I suggest to pick one simple driver first and make a prototype.
Let me also say that I DO like the general idea you're proposing. But if
it was easy, we would likely have done it already. Best regards Thomas
>
> z
--
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)
^ permalink raw reply
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Christian König @ 2026-01-15 14:39 UTC (permalink / raw)
To: Thomas Zimmermann, Zack Rusin
Cc: dri-devel, Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun,
Chia-I Wu, Danilo Krummrich, Dave Airlie, Deepak Rawat,
Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh, Hans de Goede,
Hawking Zhang, Helge Deller, intel-gfx, intel-xe, Jani Nikula,
Javier Martinez Canillas, Jocelyn Falempe, Joonas Lahtinen,
Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv, linux-kernel,
Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
virtualization, Vitaly Prosyak
In-Reply-To: <97993761-5884-4ada-b345-9fb64819e02a@suse.de>
Sorry to being late, but I only now realized what you are doing here.
On 1/15/26 12:02, Thomas Zimmermann wrote:
> Hi,
>
> apologies for the delay. I wanted to reply and then forgot about it.
>
> Am 10.01.26 um 05:52 schrieb Zack Rusin:
>> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>> Hi
>>>
>>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>>> Almost a rite of passage for every DRM developer and most Linux users
>>>> is upgrading your DRM driver/updating boot flags/changing some config
>>>> and having DRM driver fail at probe resulting in a blank screen.
>>>>
>>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>>> access to PCI resources so if the probe fails the system is left without
>>>> a functioning display driver.
>>>>
>>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>>> fails. This means that a DRM driver that fails to load reloads the system
>>>> framebuffer driver.
>>>>
>>>> This works best with simpledrm. Without it Xorg won't recover because
>>>> it still tries to load the vendor specific driver which ends up usually
>>>> not working at all. With simpledrm the system recovers really nicely
>>>> ending up with a working console and not a blank screen.
>>>>
>>>> There's a caveat in that some hardware might require some special magic
>>>> register write to recover EFI display. I'd appreciate it a lot if
>>>> maintainers could introduce a temporary failure in their drivers
>>>> probe to validate that the sysfb recovers and they get a working console.
>>>> The easiest way to double check it is by adding:
>>>> /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>>> dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>>> ret = -EINVAL;
>>>> goto out_error;
>>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>>> Recovering the display like that is guess work and will at best work
>>> with simple discrete devices where the framebuffer is always located in
>>> a confined graphics aperture.
>>>
>>> But the problem you're trying to solve is a real one.
>>>
>>> What we'd want to do instead is to take the initial hardware state into
>>> account when we do the initial mode-setting operation.
>>>
>>> The first step is to move each driver's remove_conflicting_devices call
>>> to the latest possible location in the probe function. We usually do it
>>> first, because that's easy. But on most hardware, it could happen much
>>> later.
>> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
>> they request pci regions which is going to fail otherwise. Because
>> grabbining the pci resources is in general the very first thing that
>> those drivers need to do to setup anything, we
>> remove_conflicting_devices first or at least very early.
>
> To my knowledge, requesting resources is more about correctness than a hard requirement to use an I/O or memory range. Has this changed?
Nope that is not correct.
At least for AMD GPUs remove_conflicting_devices() really early is necessary because otherwise some operations just result in a spontaneous system reboot.
For example resizing the PCIe BAR giving access to VRAM or disabling VGA emulation (which AFAIK is used for EFI as well) is only possible when the VGA or EFI framebuffer driver is kicked out first.
And disabling VGA emulation is among the absolutely first steps you do to take over the scanout config.
So I absolutely clearly have to reject the amdgpu patch in this series, that will break tons of use cases.
Regards,
Christian.
>> I also don't think it's possible or even desirable by some drivers to
>> reuse the initial state, good example here is vmwgfx where by default
>> some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
>> loads we allow scanning out from system memory, so you can set your vm
>> up with 8mb of vram but still use 4k resolutions when the driver
>> loads, this way the suspend size of the vm is very predictable (tiny
>> vram plus whatever ram was setup) while still allowing a lot of
>> flexibility.
>
> If there's no initial state to switch from, the first modeset can fail while leaving the display unusable. There's no way around that. Going back to the old state is not an option unless the driver has been written to support this.
>
> The case of vmwgfx is special, but does not effect the overall problem. For vmwgfx, it would be best to import that initial state and support a transparent modeset from vram to system memory (and back) at least during this initial state.
>
>
>>
>> In general I think however this is planned it's two or three separate series:
>> 1) infrastructure to reload the sysfb driver (what this series is)
>> 2) making sure that drivers that do want to recover cleanly actually
>> clean out all the state on exit properly,
>> 3) abstracting at least some of that cleanup in some driver independent way
>
> That's really not going to work. For example, in the current series, you invoke devm_aperture_remove_conflicting_pci_devices_done() after drm_mode_reset(), drm_dev_register() and drm_client_setup(). Each of these calls can modify hardware state. In the case of _register() and _setup(), the DRM clients can perform a modeset, which destroys the initial hardware state. Patch 1 of this series removes the sysfb device/driver entirely. That should be a no-go as it significantly complicates recovery. For example, if the native drivers failed from an allocation failure, the sysfb device/driver is not likely to come back either. As the very first thing, the series should state which failures is is going to resolve, - failed hardware init, - invalid initial modesetting, - runtime errors (such ENOMEM, failed firmware loading), - others? And then specify how a recovery to sysfb could look in each supported scenario. In terms of implementation, make any transition between drivers
> gradually. The native driver needs to acquire the hardware resource (framebuffer and I/O apertures) without unloading the sysfb driver. Luckily there's struct drm_device.unplug, which does that. [1] Flipping this field disables hardware access for DRM drivers. All sysfb drivers support this. To get the sysfb drivers ready, I suggest dedicated helpers for each drivers aperture. The aperture helpers can use these callback to flip the DRM driver off and on again. For example, efidrm could do this as a minimum: int efidrm_aperture_suspend() { dev->unplug = true; remove_resource(/*framebuffer aperture*/) return 0 } int efidrm_aperture_resume() { insert_resource(/*framebuffer aperture*/) dev->unplug = false; return 0 } struct aperture_funcs efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume = efidrm_aperture_resume, } Pass this struct when efidrm acquires the framebuffer aperture, so that the aperture helpers can control the behavior of efidrm. With this, a multi-
> step takeover from sysfb to native driver can be tried. It's still a massive effort that requires an audit of each driver's probing logic. There's no copy-paste pattern AFAICT. I suggest to pick one simple driver first and make a prototype. Let me also say that I DO like the general idea you're proposing. But if it was easy, we would likely have done it already. Best regards Thomas
>>
>> z
>
^ permalink raw reply
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Thomas Zimmermann @ 2026-01-15 14:54 UTC (permalink / raw)
To: Christian König, Zack Rusin
Cc: dri-devel, Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun,
Chia-I Wu, Danilo Krummrich, Dave Airlie, Deepak Rawat,
Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh, Hans de Goede,
Hawking Zhang, Helge Deller, intel-gfx, intel-xe, Jani Nikula,
Javier Martinez Canillas, Jocelyn Falempe, Joonas Lahtinen,
Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv, linux-kernel,
Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
virtualization, Vitaly Prosyak
In-Reply-To: <9058636d-cc18-4c8f-92cf-782fd8f771af@amd.com>
Hi
Am 15.01.26 um 15:39 schrieb Christian König:
> Sorry to being late, but I only now realized what you are doing here.
>
> On 1/15/26 12:02, Thomas Zimmermann wrote:
>> Hi,
>>
>> apologies for the delay. I wanted to reply and then forgot about it.
>>
>> Am 10.01.26 um 05:52 schrieb Zack Rusin:
>>> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>> Hi
>>>>
>>>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>>>> Almost a rite of passage for every DRM developer and most Linux users
>>>>> is upgrading your DRM driver/updating boot flags/changing some config
>>>>> and having DRM driver fail at probe resulting in a blank screen.
>>>>>
>>>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>>>> access to PCI resources so if the probe fails the system is left without
>>>>> a functioning display driver.
>>>>>
>>>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>>>> fails. This means that a DRM driver that fails to load reloads the system
>>>>> framebuffer driver.
>>>>>
>>>>> This works best with simpledrm. Without it Xorg won't recover because
>>>>> it still tries to load the vendor specific driver which ends up usually
>>>>> not working at all. With simpledrm the system recovers really nicely
>>>>> ending up with a working console and not a blank screen.
>>>>>
>>>>> There's a caveat in that some hardware might require some special magic
>>>>> register write to recover EFI display. I'd appreciate it a lot if
>>>>> maintainers could introduce a temporary failure in their drivers
>>>>> probe to validate that the sysfb recovers and they get a working console.
>>>>> The easiest way to double check it is by adding:
>>>>> /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>>>> dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>>>> ret = -EINVAL;
>>>>> goto out_error;
>>>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>>>> Recovering the display like that is guess work and will at best work
>>>> with simple discrete devices where the framebuffer is always located in
>>>> a confined graphics aperture.
>>>>
>>>> But the problem you're trying to solve is a real one.
>>>>
>>>> What we'd want to do instead is to take the initial hardware state into
>>>> account when we do the initial mode-setting operation.
>>>>
>>>> The first step is to move each driver's remove_conflicting_devices call
>>>> to the latest possible location in the probe function. We usually do it
>>>> first, because that's easy. But on most hardware, it could happen much
>>>> later.
>>> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
>>> they request pci regions which is going to fail otherwise. Because
>>> grabbining the pci resources is in general the very first thing that
>>> those drivers need to do to setup anything, we
>>> remove_conflicting_devices first or at least very early.
>> To my knowledge, requesting resources is more about correctness than a hard requirement to use an I/O or memory range. Has this changed?
> Nope that is not correct.
>
> At least for AMD GPUs remove_conflicting_devices() really early is necessary because otherwise some operations just result in a spontaneous system reboot.
Here I was only talking about avoiding calls to request_resource() and
similar interfaces.
>
> For example resizing the PCIe BAR giving access to VRAM or disabling VGA emulation (which AFAIK is used for EFI as well) is only possible when the VGA or EFI framebuffer driver is kicked out first.
Yeah, that's what I expected.
>
> And disabling VGA emulation is among the absolutely first steps you do to take over the scanout config.
Assuming the driver (or driver author) is careful, is it possible to
only read state from AMD hardware at such an early time?
We usually do remove_conflicting_devices() as the first thing in most
driver's probe function. As a first step, it would be helpful to
postpone itto a later point.
>
> So I absolutely clearly have to reject the amdgpu patch in this series, that will break tons of use cases.
Don't worry, we're still in the early ideation phase.
Best regards
Thomas
>
> Regards,
> Christian.
>
>>> I also don't think it's possible or even desirable by some drivers to
>>> reuse the initial state, good example here is vmwgfx where by default
>>> some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
>>> loads we allow scanning out from system memory, so you can set your vm
>>> up with 8mb of vram but still use 4k resolutions when the driver
>>> loads, this way the suspend size of the vm is very predictable (tiny
>>> vram plus whatever ram was setup) while still allowing a lot of
>>> flexibility.
>> If there's no initial state to switch from, the first modeset can fail while leaving the display unusable. There's no way around that. Going back to the old state is not an option unless the driver has been written to support this.
>>
>> The case of vmwgfx is special, but does not effect the overall problem. For vmwgfx, it would be best to import that initial state and support a transparent modeset from vram to system memory (and back) at least during this initial state.
>>
>>
>>> In general I think however this is planned it's two or three separate series:
>>> 1) infrastructure to reload the sysfb driver (what this series is)
>>> 2) making sure that drivers that do want to recover cleanly actually
>>> clean out all the state on exit properly,
>>> 3) abstracting at least some of that cleanup in some driver independent way
>> That's really not going to work. For example, in the current series, you invoke devm_aperture_remove_conflicting_pci_devices_done() after drm_mode_reset(), drm_dev_register() and drm_client_setup(). Each of these calls can modify hardware state. In the case of _register() and _setup(), the DRM clients can perform a modeset, which destroys the initial hardware state. Patch 1 of this series removes the sysfb device/driver entirely. That should be a no-go as it significantly complicates recovery. For example, if the native drivers failed from an allocation failure, the sysfb device/driver is not likely to come back either. As the very first thing, the series should state which failures is is going to resolve, - failed hardware init, - invalid initial modesetting, - runtime errors (such ENOMEM, failed firmware loading), - others? And then specify how a recovery to sysfb could look in each supported scenario. In terms of implementation, make any transition between drivers
>> gradually. The native driver needs to acquire the hardware resource (framebuffer and I/O apertures) without unloading the sysfb driver. Luckily there's struct drm_device.unplug, which does that. [1] Flipping this field disables hardware access for DRM drivers. All sysfb drivers support this. To get the sysfb drivers ready, I suggest dedicated helpers for each drivers aperture. The aperture helpers can use these callback to flip the DRM driver off and on again. For example, efidrm could do this as a minimum: int efidrm_aperture_suspend() { dev->unplug = true; remove_resource(/*framebuffer aperture*/) return 0 } int efidrm_aperture_resume() { insert_resource(/*framebuffer aperture*/) dev->unplug = false; return 0 } struct aperture_funcs efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume = efidrm_aperture_resume, } Pass this struct when efidrm acquires the framebuffer aperture, so that the aperture helpers can control the behavior of efidrm. With this, a multi-
>> step takeover from sysfb to native driver can be tried. It's still a massive effort that requires an audit of each driver's probing logic. There's no copy-paste pattern AFAICT. I suggest to pick one simple driver first and make a prototype. Let me also say that I DO like the general idea you're proposing. But if it was easy, we would likely have done it already. Best regards Thomas
>>> z
--
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)
^ permalink raw reply
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Ville Syrjälä @ 2026-01-15 15:10 UTC (permalink / raw)
To: Christian König
Cc: Thomas Zimmermann, Zack Rusin, dri-devel, Alex Deucher, amd-gfx,
Ard Biesheuvel, Ce Sun, Chia-I Wu, Danilo Krummrich, Dave Airlie,
Deepak Rawat, Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh,
Hans de Goede, Hawking Zhang, Helge Deller, intel-gfx, intel-xe,
Jani Nikula, Javier Martinez Canillas, Jocelyn Falempe,
Joonas Lahtinen, Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv,
linux-kernel, Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
virtualization, Vitaly Prosyak
In-Reply-To: <9058636d-cc18-4c8f-92cf-782fd8f771af@amd.com>
On Thu, Jan 15, 2026 at 03:39:00PM +0100, Christian König wrote:
> Sorry to being late, but I only now realized what you are doing here.
>
> On 1/15/26 12:02, Thomas Zimmermann wrote:
> > Hi,
> >
> > apologies for the delay. I wanted to reply and then forgot about it.
> >
> > Am 10.01.26 um 05:52 schrieb Zack Rusin:
> >> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >>> Hi
> >>>
> >>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
> >>>> Almost a rite of passage for every DRM developer and most Linux users
> >>>> is upgrading your DRM driver/updating boot flags/changing some config
> >>>> and having DRM driver fail at probe resulting in a blank screen.
> >>>>
> >>>> Currently there's no way to recover from DRM driver probe failure. PCI
> >>>> DRM driver explicitly throw out the existing sysfb to get exclusive
> >>>> access to PCI resources so if the probe fails the system is left without
> >>>> a functioning display driver.
> >>>>
> >>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
> >>>> fails. This means that a DRM driver that fails to load reloads the system
> >>>> framebuffer driver.
> >>>>
> >>>> This works best with simpledrm. Without it Xorg won't recover because
> >>>> it still tries to load the vendor specific driver which ends up usually
> >>>> not working at all. With simpledrm the system recovers really nicely
> >>>> ending up with a working console and not a blank screen.
> >>>>
> >>>> There's a caveat in that some hardware might require some special magic
> >>>> register write to recover EFI display. I'd appreciate it a lot if
> >>>> maintainers could introduce a temporary failure in their drivers
> >>>> probe to validate that the sysfb recovers and they get a working console.
> >>>> The easiest way to double check it is by adding:
> >>>> /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
> >>>> dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
> >>>> ret = -EINVAL;
> >>>> goto out_error;
> >>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
> >>> Recovering the display like that is guess work and will at best work
> >>> with simple discrete devices where the framebuffer is always located in
> >>> a confined graphics aperture.
> >>>
> >>> But the problem you're trying to solve is a real one.
> >>>
> >>> What we'd want to do instead is to take the initial hardware state into
> >>> account when we do the initial mode-setting operation.
> >>>
> >>> The first step is to move each driver's remove_conflicting_devices call
> >>> to the latest possible location in the probe function. We usually do it
> >>> first, because that's easy. But on most hardware, it could happen much
> >>> later.
> >> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
> >> they request pci regions which is going to fail otherwise. Because
> >> grabbining the pci resources is in general the very first thing that
> >> those drivers need to do to setup anything, we
> >> remove_conflicting_devices first or at least very early.
> >
> > To my knowledge, requesting resources is more about correctness than a hard requirement to use an I/O or memory range. Has this changed?
>
> Nope that is not correct.
>
> At least for AMD GPUs remove_conflicting_devices() really early is necessary because otherwise some operations just result in a spontaneous system reboot.
>
> For example resizing the PCIe BAR giving access to VRAM or disabling VGA emulation (which AFAIK is used for EFI as well) is only possible when the VGA or EFI framebuffer driver is kicked out first.
>
> And disabling VGA emulation is among the absolutely first steps you do to take over the scanout config.
It's similar for Intel. For us VGA emulation won't be used for
EFI boot, but we still can't have the previous driver poking
around in memory while the real driver is initializing. The
entire memory layout may get completely shuffled so there's
no telling where such memory accesses would land.
And I suppose reBAR is a concern for us as well.
--
Ville Syrjälä
Intel
^ permalink raw reply
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Christian König @ 2026-01-15 15:58 UTC (permalink / raw)
To: Thomas Zimmermann, Zack Rusin
Cc: dri-devel, Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun,
Chia-I Wu, Danilo Krummrich, Dave Airlie, Deepak Rawat,
Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh, Hans de Goede,
Hawking Zhang, Helge Deller, intel-gfx, intel-xe, Jani Nikula,
Javier Martinez Canillas, Jocelyn Falempe, Joonas Lahtinen,
Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv, linux-kernel,
Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
virtualization, Vitaly Prosyak
In-Reply-To: <4ee824d5-8ea0-4ae1-8bcb-5f8cbae37fc8@suse.de>
On 1/15/26 15:54, Thomas Zimmermann wrote:
> Hi
>
> Am 15.01.26 um 15:39 schrieb Christian König:
>> Sorry to being late, but I only now realized what you are doing here.
>>
>> On 1/15/26 12:02, Thomas Zimmermann wrote:
>>> Hi,
>>>
>>> apologies for the delay. I wanted to reply and then forgot about it.
>>>
>>> Am 10.01.26 um 05:52 schrieb Zack Rusin:
>>>> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>> Hi
>>>>>
>>>>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>>>>> Almost a rite of passage for every DRM developer and most Linux users
>>>>>> is upgrading your DRM driver/updating boot flags/changing some config
>>>>>> and having DRM driver fail at probe resulting in a blank screen.
>>>>>>
>>>>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>>>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>>>>> access to PCI resources so if the probe fails the system is left without
>>>>>> a functioning display driver.
>>>>>>
>>>>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>>>>> fails. This means that a DRM driver that fails to load reloads the system
>>>>>> framebuffer driver.
>>>>>>
>>>>>> This works best with simpledrm. Without it Xorg won't recover because
>>>>>> it still tries to load the vendor specific driver which ends up usually
>>>>>> not working at all. With simpledrm the system recovers really nicely
>>>>>> ending up with a working console and not a blank screen.
>>>>>>
>>>>>> There's a caveat in that some hardware might require some special magic
>>>>>> register write to recover EFI display. I'd appreciate it a lot if
>>>>>> maintainers could introduce a temporary failure in their drivers
>>>>>> probe to validate that the sysfb recovers and they get a working console.
>>>>>> The easiest way to double check it is by adding:
>>>>>> /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>>>>> dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>>>>> ret = -EINVAL;
>>>>>> goto out_error;
>>>>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>>>>> Recovering the display like that is guess work and will at best work
>>>>> with simple discrete devices where the framebuffer is always located in
>>>>> a confined graphics aperture.
>>>>>
>>>>> But the problem you're trying to solve is a real one.
>>>>>
>>>>> What we'd want to do instead is to take the initial hardware state into
>>>>> account when we do the initial mode-setting operation.
>>>>>
>>>>> The first step is to move each driver's remove_conflicting_devices call
>>>>> to the latest possible location in the probe function. We usually do it
>>>>> first, because that's easy. But on most hardware, it could happen much
>>>>> later.
>>>> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
>>>> they request pci regions which is going to fail otherwise. Because
>>>> grabbining the pci resources is in general the very first thing that
>>>> those drivers need to do to setup anything, we
>>>> remove_conflicting_devices first or at least very early.
>>> To my knowledge, requesting resources is more about correctness than a hard requirement to use an I/O or memory range. Has this changed?
>> Nope that is not correct.
>>
>> At least for AMD GPUs remove_conflicting_devices() really early is necessary because otherwise some operations just result in a spontaneous system reboot.
>
> Here I was only talking about avoiding calls to request_resource() and similar interfaces.
>
>>
>> For example resizing the PCIe BAR giving access to VRAM or disabling VGA emulation (which AFAIK is used for EFI as well) is only possible when the VGA or EFI framebuffer driver is kicked out first.
>
> Yeah, that's what I expected.
>
>>
>> And disabling VGA emulation is among the absolutely first steps you do to take over the scanout config.
>
> Assuming the driver (or driver author) is careful, is it possible to only read state from AMD hardware at such an early time?
I'm not an expert for that particular stuff but I strongly don't think so.
Basically the VGA emulation is firmware which "owns" the CRTC registers and might modify them at any time unless it's turned off first.
So you can't even use data/index pairs of registers etc...
> We usually do remove_conflicting_devices() as the first thing in most driver's probe function. As a first step, it would be helpful to postpone itto a later point.
Well from what I knew that won't work in a lot of cases.
I mean what we could do on non-AMD HW is to remove the conflicting driver, play with the HW and if we find that this didn't worked reset the HW using a PCI function level reset and try to load the EFI or whatever driver again. But that has a rather low chance of working reliable I would say.
The problem with AMD GPUs is that the PCI function level reset is broken to begin with (which already caused us tons of headache in the case of pass through).
Regards,
Christian.
>
>>
>> So I absolutely clearly have to reject the amdgpu patch in this series, that will break tons of use cases.
>
> Don't worry, we're still in the early ideation phase.
>
> Best regards
> Thomas
>
>>
>> Regards,
>> Christian.
>>
>>>> I also don't think it's possible or even desirable by some drivers to
>>>> reuse the initial state, good example here is vmwgfx where by default
>>>> some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
>>>> loads we allow scanning out from system memory, so you can set your vm
>>>> up with 8mb of vram but still use 4k resolutions when the driver
>>>> loads, this way the suspend size of the vm is very predictable (tiny
>>>> vram plus whatever ram was setup) while still allowing a lot of
>>>> flexibility.
>>> If there's no initial state to switch from, the first modeset can fail while leaving the display unusable. There's no way around that. Going back to the old state is not an option unless the driver has been written to support this.
>>>
>>> The case of vmwgfx is special, but does not effect the overall problem. For vmwgfx, it would be best to import that initial state and support a transparent modeset from vram to system memory (and back) at least during this initial state.
>>>
>>>
>>>> In general I think however this is planned it's two or three separate series:
>>>> 1) infrastructure to reload the sysfb driver (what this series is)
>>>> 2) making sure that drivers that do want to recover cleanly actually
>>>> clean out all the state on exit properly,
>>>> 3) abstracting at least some of that cleanup in some driver independent way
>>> That's really not going to work. For example, in the current series, you invoke devm_aperture_remove_conflicting_pci_devices_done() after drm_mode_reset(), drm_dev_register() and drm_client_setup(). Each of these calls can modify hardware state. In the case of _register() and _setup(), the DRM clients can perform a modeset, which destroys the initial hardware state. Patch 1 of this series removes the sysfb device/driver entirely. That should be a no-go as it significantly complicates recovery. For example, if the native drivers failed from an allocation failure, the sysfb device/driver is not likely to come back either. As the very first thing, the series should state which failures is is going to resolve, - failed hardware init, - invalid initial modesetting, - runtime errors (such ENOMEM, failed firmware loading), - others? And then specify how a recovery to sysfb could look in each supported scenario. In terms of implementation, make any transition between drivers
>>> gradually. The native driver needs to acquire the hardware resource (framebuffer and I/O apertures) without unloading the sysfb driver. Luckily there's struct drm_device.unplug, which does that. [1] Flipping this field disables hardware access for DRM drivers. All sysfb drivers support this. To get the sysfb drivers ready, I suggest dedicated helpers for each drivers aperture. The aperture helpers can use these callback to flip the DRM driver off and on again. For example, efidrm could do this as a minimum: int efidrm_aperture_suspend() { dev->unplug = true; remove_resource(/*framebuffer aperture*/) return 0 } int efidrm_aperture_resume() { insert_resource(/*framebuffer aperture*/) dev->unplug = false; return 0 } struct aperture_funcs efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume = efidrm_aperture_resume, } Pass this struct when efidrm acquires the framebuffer aperture, so that the aperture helpers can control the behavior of efidrm. With this, a multi-
>>> step takeover from sysfb to native driver can be tried. It's still a massive effort that requires an audit of each driver's probing logic. There's no copy-paste pattern AFAICT. I suggest to pick one simple driver first and make a prototype. Let me also say that I DO like the general idea you're proposing. But if it was easy, we would likely have done it already. Best regards Thomas
>>>> z
>
^ permalink raw reply
* Re: [PATCH v3 5/6] mshv: Add definitions for stats pages
From: Stanislav Kinsburskii @ 2026-01-15 16:19 UTC (permalink / raw)
To: Nuno Das Neves
Cc: linux-hyperv, linux-kernel, mhklinux, kys, haiyangz, wei.liu,
decui, longli, prapal, mrathor, paekkaladevi
In-Reply-To: <20260114213803.143486-6-nunodasneves@linux.microsoft.com>
On Wed, Jan 14, 2026 at 01:38:02PM -0800, Nuno Das Neves wrote:
> Add the definitions for hypervisor, logical processor, and partition
> stats pages.
>
The definitions in for partition and virtual processor are outdated.
Now is the good time to sync the new values in.
Thanks,
Stanislav
> Move the definition for the VP stats page to its rightful place in
> hvhdk.h, and add the missing members.
>
> While at it, correct the ARM64 value of VpRootDispatchThreadBlocked,
> (which is not yet used, so there is no impact).
>
> These enum members retain their CamelCase style, since they are imported
> directly from the hypervisor code. They will be stringified when
> printing the stats out, and retain more readability in this form.
>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
> drivers/hv/mshv_root_main.c | 17 --
> include/hyperv/hvhdk.h | 437 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 437 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index fbfc9e7d9fa4..724bbaa0b08c 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -39,23 +39,6 @@ MODULE_AUTHOR("Microsoft");
> MODULE_LICENSE("GPL");
> MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
>
> -/* TODO move this to another file when debugfs code is added */
> -enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
> -#if defined(CONFIG_X86)
> - VpRootDispatchThreadBlocked = 202,
> -#elif defined(CONFIG_ARM64)
> - VpRootDispatchThreadBlocked = 94,
> -#endif
> - VpStatsMaxCounter
> -};
> -
> -struct hv_stats_page {
> - union {
> - u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
> - u8 data[HV_HYP_PAGE_SIZE];
> - };
> -} __packed;
> -
> struct mshv_root mshv_root;
>
> enum hv_scheduler_type hv_scheduler_type;
> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> index 469186df7826..8bddd11feeba 100644
> --- a/include/hyperv/hvhdk.h
> +++ b/include/hyperv/hvhdk.h
> @@ -10,6 +10,443 @@
> #include "hvhdk_mini.h"
> #include "hvgdk.h"
>
> +enum hv_stats_hypervisor_counters { /* HV_HYPERVISOR_COUNTER */
> + HvLogicalProcessors = 1,
> + HvPartitions = 2,
> + HvTotalPages = 3,
> + HvVirtualProcessors = 4,
> + HvMonitoredNotifications = 5,
> + HvModernStandbyEntries = 6,
> + HvPlatformIdleTransitions = 7,
> + HvHypervisorStartupCost = 8,
> + HvIOSpacePages = 10,
> + HvNonEssentialPagesForDump = 11,
> + HvSubsumedPages = 12,
> + HvStatsMaxCounter
> +};
> +
> +enum hv_stats_partition_counters { /* HV_PROCESS_COUNTER */
> + PartitionVirtualProcessors = 1,
> + PartitionTlbSize = 3,
> + PartitionAddressSpaces = 4,
> + PartitionDepositedPages = 5,
> + PartitionGpaPages = 6,
> + PartitionGpaSpaceModifications = 7,
> + PartitionVirtualTlbFlushEntires = 8,
> + PartitionRecommendedTlbSize = 9,
> + PartitionGpaPages4K = 10,
> + PartitionGpaPages2M = 11,
> + PartitionGpaPages1G = 12,
> + PartitionGpaPages512G = 13,
> + PartitionDevicePages4K = 14,
> + PartitionDevicePages2M = 15,
> + PartitionDevicePages1G = 16,
> + PartitionDevicePages512G = 17,
> + PartitionAttachedDevices = 18,
> + PartitionDeviceInterruptMappings = 19,
> + PartitionIoTlbFlushes = 20,
> + PartitionIoTlbFlushCost = 21,
> + PartitionDeviceInterruptErrors = 22,
> + PartitionDeviceDmaErrors = 23,
> + PartitionDeviceInterruptThrottleEvents = 24,
> + PartitionSkippedTimerTicks = 25,
> + PartitionPartitionId = 26,
> +#if IS_ENABLED(CONFIG_X86_64)
> + PartitionNestedTlbSize = 27,
> + PartitionRecommendedNestedTlbSize = 28,
> + PartitionNestedTlbFreeListSize = 29,
> + PartitionNestedTlbTrimmedPages = 30,
> + PartitionPagesShattered = 31,
> + PartitionPagesRecombined = 32,
> + PartitionHwpRequestValue = 33,
> +#elif IS_ENABLED(CONFIG_ARM64)
> + PartitionHwpRequestValue = 27,
> +#endif
> + PartitionStatsMaxCounter
> +};
> +
> +enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
> + VpTotalRunTime = 1,
> + VpHypervisorRunTime = 2,
> + VpRemoteNodeRunTime = 3,
> + VpNormalizedRunTime = 4,
> + VpIdealCpu = 5,
> + VpHypercallsCount = 7,
> + VpHypercallsTime = 8,
> +#if IS_ENABLED(CONFIG_X86_64)
> + VpPageInvalidationsCount = 9,
> + VpPageInvalidationsTime = 10,
> + VpControlRegisterAccessesCount = 11,
> + VpControlRegisterAccessesTime = 12,
> + VpIoInstructionsCount = 13,
> + VpIoInstructionsTime = 14,
> + VpHltInstructionsCount = 15,
> + VpHltInstructionsTime = 16,
> + VpMwaitInstructionsCount = 17,
> + VpMwaitInstructionsTime = 18,
> + VpCpuidInstructionsCount = 19,
> + VpCpuidInstructionsTime = 20,
> + VpMsrAccessesCount = 21,
> + VpMsrAccessesTime = 22,
> + VpOtherInterceptsCount = 23,
> + VpOtherInterceptsTime = 24,
> + VpExternalInterruptsCount = 25,
> + VpExternalInterruptsTime = 26,
> + VpPendingInterruptsCount = 27,
> + VpPendingInterruptsTime = 28,
> + VpEmulatedInstructionsCount = 29,
> + VpEmulatedInstructionsTime = 30,
> + VpDebugRegisterAccessesCount = 31,
> + VpDebugRegisterAccessesTime = 32,
> + VpPageFaultInterceptsCount = 33,
> + VpPageFaultInterceptsTime = 34,
> + VpGuestPageTableMaps = 35,
> + VpLargePageTlbFills = 36,
> + VpSmallPageTlbFills = 37,
> + VpReflectedGuestPageFaults = 38,
> + VpApicMmioAccesses = 39,
> + VpIoInterceptMessages = 40,
> + VpMemoryInterceptMessages = 41,
> + VpApicEoiAccesses = 42,
> + VpOtherMessages = 43,
> + VpPageTableAllocations = 44,
> + VpLogicalProcessorMigrations = 45,
> + VpAddressSpaceEvictions = 46,
> + VpAddressSpaceSwitches = 47,
> + VpAddressDomainFlushes = 48,
> + VpAddressSpaceFlushes = 49,
> + VpGlobalGvaRangeFlushes = 50,
> + VpLocalGvaRangeFlushes = 51,
> + VpPageTableEvictions = 52,
> + VpPageTableReclamations = 53,
> + VpPageTableResets = 54,
> + VpPageTableValidations = 55,
> + VpApicTprAccesses = 56,
> + VpPageTableWriteIntercepts = 57,
> + VpSyntheticInterrupts = 58,
> + VpVirtualInterrupts = 59,
> + VpApicIpisSent = 60,
> + VpApicSelfIpisSent = 61,
> + VpGpaSpaceHypercalls = 62,
> + VpLogicalProcessorHypercalls = 63,
> + VpLongSpinWaitHypercalls = 64,
> + VpOtherHypercalls = 65,
> + VpSyntheticInterruptHypercalls = 66,
> + VpVirtualInterruptHypercalls = 67,
> + VpVirtualMmuHypercalls = 68,
> + VpVirtualProcessorHypercalls = 69,
> + VpHardwareInterrupts = 70,
> + VpNestedPageFaultInterceptsCount = 71,
> + VpNestedPageFaultInterceptsTime = 72,
> + VpPageScans = 73,
> + VpLogicalProcessorDispatches = 74,
> + VpWaitingForCpuTime = 75,
> + VpExtendedHypercalls = 76,
> + VpExtendedHypercallInterceptMessages = 77,
> + VpMbecNestedPageTableSwitches = 78,
> + VpOtherReflectedGuestExceptions = 79,
> + VpGlobalIoTlbFlushes = 80,
> + VpGlobalIoTlbFlushCost = 81,
> + VpLocalIoTlbFlushes = 82,
> + VpLocalIoTlbFlushCost = 83,
> + VpHypercallsForwardedCount = 84,
> + VpHypercallsForwardingTime = 85,
> + VpPageInvalidationsForwardedCount = 86,
> + VpPageInvalidationsForwardingTime = 87,
> + VpControlRegisterAccessesForwardedCount = 88,
> + VpControlRegisterAccessesForwardingTime = 89,
> + VpIoInstructionsForwardedCount = 90,
> + VpIoInstructionsForwardingTime = 91,
> + VpHltInstructionsForwardedCount = 92,
> + VpHltInstructionsForwardingTime = 93,
> + VpMwaitInstructionsForwardedCount = 94,
> + VpMwaitInstructionsForwardingTime = 95,
> + VpCpuidInstructionsForwardedCount = 96,
> + VpCpuidInstructionsForwardingTime = 97,
> + VpMsrAccessesForwardedCount = 98,
> + VpMsrAccessesForwardingTime = 99,
> + VpOtherInterceptsForwardedCount = 100,
> + VpOtherInterceptsForwardingTime = 101,
> + VpExternalInterruptsForwardedCount = 102,
> + VpExternalInterruptsForwardingTime = 103,
> + VpPendingInterruptsForwardedCount = 104,
> + VpPendingInterruptsForwardingTime = 105,
> + VpEmulatedInstructionsForwardedCount = 106,
> + VpEmulatedInstructionsForwardingTime = 107,
> + VpDebugRegisterAccessesForwardedCount = 108,
> + VpDebugRegisterAccessesForwardingTime = 109,
> + VpPageFaultInterceptsForwardedCount = 110,
> + VpPageFaultInterceptsForwardingTime = 111,
> + VpVmclearEmulationCount = 112,
> + VpVmclearEmulationTime = 113,
> + VpVmptrldEmulationCount = 114,
> + VpVmptrldEmulationTime = 115,
> + VpVmptrstEmulationCount = 116,
> + VpVmptrstEmulationTime = 117,
> + VpVmreadEmulationCount = 118,
> + VpVmreadEmulationTime = 119,
> + VpVmwriteEmulationCount = 120,
> + VpVmwriteEmulationTime = 121,
> + VpVmxoffEmulationCount = 122,
> + VpVmxoffEmulationTime = 123,
> + VpVmxonEmulationCount = 124,
> + VpVmxonEmulationTime = 125,
> + VpNestedVMEntriesCount = 126,
> + VpNestedVMEntriesTime = 127,
> + VpNestedSLATSoftPageFaultsCount = 128,
> + VpNestedSLATSoftPageFaultsTime = 129,
> + VpNestedSLATHardPageFaultsCount = 130,
> + VpNestedSLATHardPageFaultsTime = 131,
> + VpInvEptAllContextEmulationCount = 132,
> + VpInvEptAllContextEmulationTime = 133,
> + VpInvEptSingleContextEmulationCount = 134,
> + VpInvEptSingleContextEmulationTime = 135,
> + VpInvVpidAllContextEmulationCount = 136,
> + VpInvVpidAllContextEmulationTime = 137,
> + VpInvVpidSingleContextEmulationCount = 138,
> + VpInvVpidSingleContextEmulationTime = 139,
> + VpInvVpidSingleAddressEmulationCount = 140,
> + VpInvVpidSingleAddressEmulationTime = 141,
> + VpNestedTlbPageTableReclamations = 142,
> + VpNestedTlbPageTableEvictions = 143,
> + VpFlushGuestPhysicalAddressSpaceHypercalls = 144,
> + VpFlushGuestPhysicalAddressListHypercalls = 145,
> + VpPostedInterruptNotifications = 146,
> + VpPostedInterruptScans = 147,
> + VpTotalCoreRunTime = 148,
> + VpMaximumRunTime = 149,
> + VpHwpRequestContextSwitches = 150,
> + VpWaitingForCpuTimeBucket0 = 151,
> + VpWaitingForCpuTimeBucket1 = 152,
> + VpWaitingForCpuTimeBucket2 = 153,
> + VpWaitingForCpuTimeBucket3 = 154,
> + VpWaitingForCpuTimeBucket4 = 155,
> + VpWaitingForCpuTimeBucket5 = 156,
> + VpWaitingForCpuTimeBucket6 = 157,
> + VpVmloadEmulationCount = 158,
> + VpVmloadEmulationTime = 159,
> + VpVmsaveEmulationCount = 160,
> + VpVmsaveEmulationTime = 161,
> + VpGifInstructionEmulationCount = 162,
> + VpGifInstructionEmulationTime = 163,
> + VpEmulatedErrataSvmInstructions = 164,
> + VpPlaceholder1 = 165,
> + VpPlaceholder2 = 166,
> + VpPlaceholder3 = 167,
> + VpPlaceholder4 = 168,
> + VpPlaceholder5 = 169,
> + VpPlaceholder6 = 170,
> + VpPlaceholder7 = 171,
> + VpPlaceholder8 = 172,
> + VpPlaceholder9 = 173,
> + VpPlaceholder10 = 174,
> + VpSchedulingPriority = 175,
> + VpRdpmcInstructionsCount = 176,
> + VpRdpmcInstructionsTime = 177,
> + VpPerfmonPmuMsrAccessesCount = 178,
> + VpPerfmonLbrMsrAccessesCount = 179,
> + VpPerfmonIptMsrAccessesCount = 180,
> + VpPerfmonInterruptCount = 181,
> + VpVtl1DispatchCount = 182,
> + VpVtl2DispatchCount = 183,
> + VpVtl2DispatchBucket0 = 184,
> + VpVtl2DispatchBucket1 = 185,
> + VpVtl2DispatchBucket2 = 186,
> + VpVtl2DispatchBucket3 = 187,
> + VpVtl2DispatchBucket4 = 188,
> + VpVtl2DispatchBucket5 = 189,
> + VpVtl2DispatchBucket6 = 190,
> + VpVtl1RunTime = 191,
> + VpVtl2RunTime = 192,
> + VpIommuHypercalls = 193,
> + VpCpuGroupHypercalls = 194,
> + VpVsmHypercalls = 195,
> + VpEventLogHypercalls = 196,
> + VpDeviceDomainHypercalls = 197,
> + VpDepositHypercalls = 198,
> + VpSvmHypercalls = 199,
> + VpBusLockAcquisitionCount = 200,
> + VpLoadAvg = 201,
> + VpRootDispatchThreadBlocked = 202,
> +#elif IS_ENABLED(CONFIG_ARM64)
> + VpSysRegAccessesCount = 9,
> + VpSysRegAccessesTime = 10,
> + VpSmcInstructionsCount = 11,
> + VpSmcInstructionsTime = 12,
> + VpOtherInterceptsCount = 13,
> + VpOtherInterceptsTime = 14,
> + VpExternalInterruptsCount = 15,
> + VpExternalInterruptsTime = 16,
> + VpPendingInterruptsCount = 17,
> + VpPendingInterruptsTime = 18,
> + VpGuestPageTableMaps = 19,
> + VpLargePageTlbFills = 20,
> + VpSmallPageTlbFills = 21,
> + VpReflectedGuestPageFaults = 22,
> + VpMemoryInterceptMessages = 23,
> + VpOtherMessages = 24,
> + VpLogicalProcessorMigrations = 25,
> + VpAddressDomainFlushes = 26,
> + VpAddressSpaceFlushes = 27,
> + VpSyntheticInterrupts = 28,
> + VpVirtualInterrupts = 29,
> + VpApicSelfIpisSent = 30,
> + VpGpaSpaceHypercalls = 31,
> + VpLogicalProcessorHypercalls = 32,
> + VpLongSpinWaitHypercalls = 33,
> + VpOtherHypercalls = 34,
> + VpSyntheticInterruptHypercalls = 35,
> + VpVirtualInterruptHypercalls = 36,
> + VpVirtualMmuHypercalls = 37,
> + VpVirtualProcessorHypercalls = 38,
> + VpHardwareInterrupts = 39,
> + VpNestedPageFaultInterceptsCount = 40,
> + VpNestedPageFaultInterceptsTime = 41,
> + VpLogicalProcessorDispatches = 42,
> + VpWaitingForCpuTime = 43,
> + VpExtendedHypercalls = 44,
> + VpExtendedHypercallInterceptMessages = 45,
> + VpMbecNestedPageTableSwitches = 46,
> + VpOtherReflectedGuestExceptions = 47,
> + VpGlobalIoTlbFlushes = 48,
> + VpGlobalIoTlbFlushCost = 49,
> + VpLocalIoTlbFlushes = 50,
> + VpLocalIoTlbFlushCost = 51,
> + VpFlushGuestPhysicalAddressSpaceHypercalls = 52,
> + VpFlushGuestPhysicalAddressListHypercalls = 53,
> + VpPostedInterruptNotifications = 54,
> + VpPostedInterruptScans = 55,
> + VpTotalCoreRunTime = 56,
> + VpMaximumRunTime = 57,
> + VpWaitingForCpuTimeBucket0 = 58,
> + VpWaitingForCpuTimeBucket1 = 59,
> + VpWaitingForCpuTimeBucket2 = 60,
> + VpWaitingForCpuTimeBucket3 = 61,
> + VpWaitingForCpuTimeBucket4 = 62,
> + VpWaitingForCpuTimeBucket5 = 63,
> + VpWaitingForCpuTimeBucket6 = 64,
> + VpHwpRequestContextSwitches = 65,
> + VpPlaceholder2 = 66,
> + VpPlaceholder3 = 67,
> + VpPlaceholder4 = 68,
> + VpPlaceholder5 = 69,
> + VpPlaceholder6 = 70,
> + VpPlaceholder7 = 71,
> + VpPlaceholder8 = 72,
> + VpContentionTime = 73,
> + VpWakeUpTime = 74,
> + VpSchedulingPriority = 75,
> + VpVtl1DispatchCount = 76,
> + VpVtl2DispatchCount = 77,
> + VpVtl2DispatchBucket0 = 78,
> + VpVtl2DispatchBucket1 = 79,
> + VpVtl2DispatchBucket2 = 80,
> + VpVtl2DispatchBucket3 = 81,
> + VpVtl2DispatchBucket4 = 82,
> + VpVtl2DispatchBucket5 = 83,
> + VpVtl2DispatchBucket6 = 84,
> + VpVtl1RunTime = 85,
> + VpVtl2RunTime = 86,
> + VpIommuHypercalls = 87,
> + VpCpuGroupHypercalls = 88,
> + VpVsmHypercalls = 89,
> + VpEventLogHypercalls = 90,
> + VpDeviceDomainHypercalls = 91,
> + VpDepositHypercalls = 92,
> + VpSvmHypercalls = 93,
> + VpLoadAvg = 94,
> + VpRootDispatchThreadBlocked = 95,
> +#endif
> + VpStatsMaxCounter
> +};
> +
> +enum hv_stats_lp_counters { /* HV_CPU_COUNTER */
> + LpGlobalTime = 1,
> + LpTotalRunTime = 2,
> + LpHypervisorRunTime = 3,
> + LpHardwareInterrupts = 4,
> + LpContextSwitches = 5,
> + LpInterProcessorInterrupts = 6,
> + LpSchedulerInterrupts = 7,
> + LpTimerInterrupts = 8,
> + LpInterProcessorInterruptsSent = 9,
> + LpProcessorHalts = 10,
> + LpMonitorTransitionCost = 11,
> + LpContextSwitchTime = 12,
> + LpC1TransitionsCount = 13,
> + LpC1RunTime = 14,
> + LpC2TransitionsCount = 15,
> + LpC2RunTime = 16,
> + LpC3TransitionsCount = 17,
> + LpC3RunTime = 18,
> + LpRootVpIndex = 19,
> + LpIdleSequenceNumber = 20,
> + LpGlobalTscCount = 21,
> + LpActiveTscCount = 22,
> + LpIdleAccumulation = 23,
> + LpReferenceCycleCount0 = 24,
> + LpActualCycleCount0 = 25,
> + LpReferenceCycleCount1 = 26,
> + LpActualCycleCount1 = 27,
> + LpProximityDomainId = 28,
> + LpPostedInterruptNotifications = 29,
> + LpBranchPredictorFlushes = 30,
> +#if IS_ENABLED(CONFIG_X86_64)
> + LpL1DataCacheFlushes = 31,
> + LpImmediateL1DataCacheFlushes = 32,
> + LpMbFlushes = 33,
> + LpCounterRefreshSequenceNumber = 34,
> + LpCounterRefreshReferenceTime = 35,
> + LpIdleAccumulationSnapshot = 36,
> + LpActiveTscCountSnapshot = 37,
> + LpHwpRequestContextSwitches = 38,
> + LpPlaceholder1 = 39,
> + LpPlaceholder2 = 40,
> + LpPlaceholder3 = 41,
> + LpPlaceholder4 = 42,
> + LpPlaceholder5 = 43,
> + LpPlaceholder6 = 44,
> + LpPlaceholder7 = 45,
> + LpPlaceholder8 = 46,
> + LpPlaceholder9 = 47,
> + LpPlaceholder10 = 48,
> + LpReserveGroupId = 49,
> + LpRunningPriority = 50,
> + LpPerfmonInterruptCount = 51,
> +#elif IS_ENABLED(CONFIG_ARM64)
> + LpCounterRefreshSequenceNumber = 31,
> + LpCounterRefreshReferenceTime = 32,
> + LpIdleAccumulationSnapshot = 33,
> + LpActiveTscCountSnapshot = 34,
> + LpHwpRequestContextSwitches = 35,
> + LpPlaceholder2 = 36,
> + LpPlaceholder3 = 37,
> + LpPlaceholder4 = 38,
> + LpPlaceholder5 = 39,
> + LpPlaceholder6 = 40,
> + LpPlaceholder7 = 41,
> + LpPlaceholder8 = 42,
> + LpPlaceholder9 = 43,
> + LpSchLocalRunListSize = 44,
> + LpReserveGroupId = 45,
> + LpRunningPriority = 46,
> +#endif
> + LpStatsMaxCounter
> +};
> +
> +/*
> + * Hypervisor statistics page format
> + */
> +struct hv_stats_page {
> + union {
> + u64 hv_cntrs[HvStatsMaxCounter]; /* Hypervisor counters */
> + u64 pt_cntrs[PartitionStatsMaxCounter]; /* Partition counters */
> + u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
> + u64 lp_cntrs[LpStatsMaxCounter]; /* LP counters */
> + u8 data[HV_HYP_PAGE_SIZE];
> + };
> +} __packed;
> +
> /* Bits for dirty mask of hv_vp_register_page */
> #define HV_X64_REGISTER_CLASS_GENERAL 0
> #define HV_X64_REGISTER_CLASS_IP 1
> --
> 2.34.1
^ permalink raw reply
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Gerd Hoffmann @ 2026-01-15 16:36 UTC (permalink / raw)
To: Ville Syrjälä
Cc: Christian König, Thomas Zimmermann, Zack Rusin, dri-devel,
Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun, Chia-I Wu,
Danilo Krummrich, Dave Airlie, Deepak Rawat, Dmitry Osipenko,
Gurchetan Singh, Hans de Goede, Hawking Zhang, Helge Deller,
intel-gfx, intel-xe, Jani Nikula, Javier Martinez Canillas,
Jocelyn Falempe, Joonas Lahtinen, Lijo Lazar, linux-efi,
linux-fbdev, linux-hyperv, linux-kernel, Lucas De Marchi,
Lyude Paul, Maarten Lankhorst, Mario Limonciello (AMD),
Mario Limonciello, Maxime Ripard, nouveau, Rodrigo Vivi,
Simona Vetter, spice-devel, Thomas Hellström,
Timur Kristóf, Tvrtko Ursulin, virtualization,
Vitaly Prosyak
In-Reply-To: <aWkDYO1o9T1BhvXj@intel.com>
Hi,
> > At least for AMD GPUs remove_conflicting_devices() really early is
> > necessary because otherwise some operations just result in a
> > spontaneous system reboot.
> It's similar for Intel. For us VGA emulation won't be used for EFI
> boot, but we still can't have the previous driver poking around in
> memory while the real driver is initializing. The entire memory layout
> may get completely shuffled so there's no telling where such memory
> accesses would land.
Can you do stuff like checking which firmware is needed and whenever
that can be loaded from the filesystem before calling
remove_conflicting_devices() ?
take care,
Gerd
^ permalink raw reply
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Mario Limonciello @ 2026-01-15 16:39 UTC (permalink / raw)
To: Gerd Hoffmann, Ville Syrjälä
Cc: Christian König, Thomas Zimmermann, Zack Rusin, dri-devel,
Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun, Chia-I Wu,
Danilo Krummrich, Dave Airlie, Deepak Rawat, Dmitry Osipenko,
Gurchetan Singh, Hans de Goede, Hawking Zhang, Helge Deller,
intel-gfx, intel-xe, Jani Nikula, Javier Martinez Canillas,
Jocelyn Falempe, Joonas Lahtinen, Lijo Lazar, linux-efi,
linux-fbdev, linux-hyperv, linux-kernel, Lucas De Marchi,
Lyude Paul, Maarten Lankhorst, Mario Limonciello, Maxime Ripard,
nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
virtualization, Vitaly Prosyak
In-Reply-To: <aWkWSnJ7Xn6ukW-b@sirius.home.kraxel.org>
On 1/15/26 10:36 AM, Gerd Hoffmann wrote:
> Hi,
>
>>> At least for AMD GPUs remove_conflicting_devices() really early is
>>> necessary because otherwise some operations just result in a
>>> spontaneous system reboot.
>
>> It's similar for Intel. For us VGA emulation won't be used for EFI
>> boot, but we still can't have the previous driver poking around in
>> memory while the real driver is initializing. The entire memory layout
>> may get completely shuffled so there's no telling where such memory
>> accesses would land.
>
> Can you do stuff like checking which firmware is needed and whenever
> that can be loaded from the filesystem before calling
> remove_conflicting_devices() ?
>
That's something that I did in amdgpu a few years back.
I pushed the identification and ability to load firmware into early init
stages. It means that if you have a brand new GPU and run a modern
kernel with an older linux-firmware snapshot amdgpu will fail probe and
your framebuffer from EFI keeps working.
^ permalink raw reply
* Re: [PATCH v2 0/8] KVM: SVM: Fix exit_code bugs
From: Sean Christopherson @ 2026-01-15 18:03 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: kvm, linux-hyperv, linux-kernel, Jim Mattson, Yosry Ahmed
In-Reply-To: <20251230211347.4099600-1-seanjc@google.com>
On Tue, 30 Dec 2025 13:13:39 -0800, Sean Christopherson wrote:
> Fix (mostly benign) bugs in SVM where KVM treats exit codes as 32-bit values
> instead of 64-bit values.
>
> The most dangerous aspect of the mess is that simply fixing KVM would likely
> break KVM-on-KVM setups if only L1 is patched. To try and avoid such
> breakage while also fixing KVM, I opted to have KVM retain its checks on
> only bits 31:0 if KVM is running as a VM (as detected by
> X86_FEATURE_HYPERVISOR).
>
> [...]
Applied to kvm-x86 svm, thanks!
[1/8] KVM: SVM: Add a helper to detect VMRUN failures
https://github.com/kvm-x86/linux/commit/217463aa329e
[2/8] KVM: SVM: Open code handling of unexpected exits in svm_invoke_exit_handler()
https://github.com/kvm-x86/linux/commit/2450c9774510
[3/8] KVM: SVM: Check for an unexpected VM-Exit after RETPOLINE "fast" handling
https://github.com/kvm-x86/linux/commit/194c17bf5eba
[4/8] KVM: SVM: Filter out 64-bit exit codes when invoking exit handlers on bare metal
https://github.com/kvm-x86/linux/commit/405fce694bd1
[5/8] KVM: SVM: Treat exit_code as an unsigned 64-bit value through all of KVM
https://github.com/kvm-x86/linux/commit/d7507a94a072
[6/8] KVM: SVM: Limit incorrect check on SVM_EXIT_ERR to running as a VM
https://github.com/kvm-x86/linux/commit/a08ca6691fd3
[7/8] KVM: SVM: Harden exit_code against being used in Spectre-like attacks
https://github.com/kvm-x86/linux/commit/1e3dddafecee
[8/8] KVM: SVM: Assert that Hyper-V's HV_SVM_EXITCODE_ENL == SVM_EXIT_SW
https://github.com/kvm-x86/linux/commit/d6c20d19f7d3
--
https://github.com/kvm-x86/linux/tree/next
^ permalink raw reply
* Re: [PATCH v1] x86/hyperv: Reserve 3 interrupt vectors used exclusively by mshv
From: Mukesh R @ 2026-01-15 18:16 UTC (permalink / raw)
To: Wei Liu
Cc: linux-hyperv, linux-kernel, kys, haiyangz, decui, longli, tglx,
mingo, bp, dave.hansen, x86, hpa
In-Reply-To: <20260115072509.GF3557088@liuwe-devbox-debian-v2.local>
On 1/14/26 23:25, Wei Liu wrote:
> On Fri, Jan 02, 2026 at 02:02:08PM -0800, Mukesh Rathor wrote:
>> MSVC compiler, used to compile the Microsoft Hyper-V hypervisor currently,
>> has an assert intrinsic that uses interrupt vector 0x29 to create an
>> exception. This will cause hypervisor to then crash and collect core. As
>> such, if this interrupt number is assigned to a device by linux and the
>> device generates it, hypervisor will crash. There are two other such
>> vectors hard coded in the hypervisor, 0x2C and 0x2D for debug purposes.
>> Fortunately, the three vectors are part of the kernel driver space and
>> that makes it feasible to reserve them early so they are not assigned
>> later.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>
>> v1: Add ifndef CONFIG_X86_FRED (thanks hpa)
>>
>> arch/x86/kernel/cpu/mshyperv.c | 26 ++++++++++++++++++++++++++
>> 1 file changed, 26 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>> index 579fb2c64cfd..8ef4ca6733ac 100644
>> --- a/arch/x86/kernel/cpu/mshyperv.c
>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>> @@ -478,6 +478,27 @@ int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
>> }
>> EXPORT_SYMBOL_GPL(hv_get_hypervisor_version);
>>
>> +#ifndef CONFIG_X86_FRED
>
> I briefly looked up FRED and checked the code. I understand that once it
> is enabled, Linux kernel doesn't setup the IDT anymore (code in
> arch/x86/kernel/traps.c).
>
> My question is, do we need to do anything when FRED is enabled?
Yeah, at first glance my thought was it probably has greater
implications (in terms of double checking exceptions), and so
when time allows do deeper investigation and perhaps run it by
the hypervisor team to see if there is any other work we need to
do.
Thanks,
-Mukesh
> Wei
>
>> +/*
>> + * Reserve vectors hard coded in the hypervisor. If used outside, the hypervisor
>> + * will crash or hang or break into debugger.
>> + */
>> +static void hv_reserve_irq_vectors(void)
>> +{
>> + #define HYPERV_DBG_FASTFAIL_VECTOR 0x29
>> + #define HYPERV_DBG_ASSERT_VECTOR 0x2C
>> + #define HYPERV_DBG_SERVICE_VECTOR 0x2D
>> +
>> + if (test_and_set_bit(HYPERV_DBG_ASSERT_VECTOR, system_vectors) ||
>> + test_and_set_bit(HYPERV_DBG_SERVICE_VECTOR, system_vectors) ||
>> + test_and_set_bit(HYPERV_DBG_FASTFAIL_VECTOR, system_vectors))
>> + BUG();
>> +
>> + pr_info("Hyper-V:reserve vectors: %d %d %d\n", HYPERV_DBG_ASSERT_VECTOR,
>> + HYPERV_DBG_SERVICE_VECTOR, HYPERV_DBG_FASTFAIL_VECTOR);
>> +}
>> +#endif /* CONFIG_X86_FRED */
>> +
>> static void __init ms_hyperv_init_platform(void)
>> {
>> int hv_max_functions_eax, eax;
>> @@ -510,6 +531,11 @@ static void __init ms_hyperv_init_platform(void)
>>
>> hv_identify_partition_type();
>>
>> +#ifndef CONFIG_X86_FRED
>> + if (hv_root_partition())
>> + hv_reserve_irq_vectors();
>> +#endif /* CONFIG_X86_FRED */
>> +
>> if (cc_platform_has(CC_ATTR_SNP_SECURE_AVIC))
>> ms_hyperv.hints |= HV_DEPRECATING_AEOI_RECOMMENDED;
>>
>> --
>> 2.51.2.vfs.0.1
>>
^ permalink raw reply
* Re: [PATCH 1/2] hyper-v: Mark inner union in hv_kvp_exchg_msg_value as packed
From: Wei Liu @ 2026-01-15 18:22 UTC (permalink / raw)
To: Thomas Weißschuh
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
Hans de Goede, Arnd Bergmann, Greg Kroah-Hartman, linux-hyperv,
linux-kernel, llvm, kernel test robot
In-Reply-To: <20260115-kbuild-alignment-vbox-v1-1-076aed1623ff@linutronix.de>
On Thu, Jan 15, 2026 at 08:35:44AM +0100, Thomas Weißschuh wrote:
> The unpacked union within a packed struct generates alignment warnings
> on clang for 32-bit ARM:
>
> ./usr/include/linux/hyperv.h:361:2: error: field within 'struct hv_kvp_exchg_msg_value'
> is less aligned than 'union hv_kvp_exchg_msg_value::(anonymous at ./usr/include/linux/hyperv.h:361:2)'
> and is usually due to 'struct hv_kvp_exchg_msg_value' being packed,
> which can lead to unaligned accesses [-Werror,-Wunaligned-access]
> 361 | union {
> | ^
>
> With the recent changes to compile-test the UAPI headers in more cases,
> this warning in combination with CONFIG_WERROR breaks the build.
>
> Fix the warning.
>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202512140314.DzDxpIVn-lkp@intel.com/
> Reported-by: Nathan Chancellor <nathan@kernel.org>
> Closes: https://lore.kernel.org/linux-kbuild/20260110-uapi-test-disable-headers-arm-clang-unaligned-access-v1-1-b7b0fa541daa@kernel.org/
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> Link: https://lore.kernel.org/linux-kbuild/29b2e736-d462-45b7-a0a9-85f8d8a3de56@app.fastmail.com/
> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Wei Liu (Microsoft) <wei.liu@kernel.org>
> ---
> include/uapi/linux/hyperv.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
> index aaa502a7bff4..1749b35ab2c2 100644
> --- a/include/uapi/linux/hyperv.h
> +++ b/include/uapi/linux/hyperv.h
> @@ -362,7 +362,7 @@ struct hv_kvp_exchg_msg_value {
> __u8 value[HV_KVP_EXCHANGE_MAX_VALUE_SIZE];
> __u32 value_u32;
> __u64 value_u64;
> - };
> + } __attribute__((packed));
> } __attribute__((packed));
>
> struct hv_kvp_msg_enumerate {
>
> --
> 2.52.0
>
>
^ permalink raw reply
* Re: [PATCH v1] mshv: make certain field names descriptive in a header struct
From: Anirudh Rayabharam @ 2026-01-15 18:51 UTC (permalink / raw)
To: Mukesh Rathor; +Cc: linux-hyperv, wei.liu, nunodasneves
In-Reply-To: <20260112194943.1701785-1-mrathor@linux.microsoft.com>
On Mon, Jan 12, 2026 at 11:49:43AM -0800, Mukesh Rathor wrote:
> When header struct fields use very common names like "pages" or "type",
> it makes it difficult to find uses of these fields with tools like grep
> and cscope. Add the prefix mreg_ to some fields in struct
> mshv_mem_region to make it easier to find them.
>
> There is no functional change.
>
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
> drivers/hv/mshv_regions.c | 44 ++++++++++++++++++-------------------
> drivers/hv/mshv_root.h | 6 ++---
> drivers/hv/mshv_root_main.c | 10 ++++-----
> 3 files changed, 30 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 202b9d551e39..af81405f859b 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -52,7 +52,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> struct page *page;
> int ret;
>
> - page = region->pages[page_offset];
> + page = region->mreg_pages[page_offset];
> if (!page)
> return -EINVAL;
>
> @@ -65,7 +65,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>
> /* Start at stride since the first page is validated */
> for (count = stride; count < page_count; count += stride) {
> - page = region->pages[page_offset + count];
> + page = region->mreg_pages[page_offset + count];
>
> /* Break if current page is not present */
> if (!page)
> @@ -117,7 +117,7 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
>
> while (page_count) {
> /* Skip non-present pages */
> - if (!region->pages[page_offset]) {
> + if (!region->mreg_pages[page_offset]) {
> page_offset++;
> page_count--;
> continue;
> @@ -164,13 +164,13 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
> u32 flags,
> u64 page_offset, u64 page_count)
> {
> - struct page *page = region->pages[page_offset];
> + struct page *page = region->mreg_pages[page_offset];
>
> if (PageHuge(page) || PageTransCompound(page))
> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>
> return hv_call_modify_spa_host_access(region->partition->pt_id,
> - region->pages + page_offset,
> + region->mreg_pages + page_offset,
> page_count,
> HV_MAP_GPA_READABLE |
> HV_MAP_GPA_WRITABLE,
> @@ -190,13 +190,13 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> u32 flags,
> u64 page_offset, u64 page_count)
> {
> - struct page *page = region->pages[page_offset];
> + struct page *page = region->mreg_pages[page_offset];
>
> if (PageHuge(page) || PageTransCompound(page))
> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>
> return hv_call_modify_spa_host_access(region->partition->pt_id,
> - region->pages + page_offset,
> + region->mreg_pages + page_offset,
> page_count, 0,
> flags, false);
> }
> @@ -214,7 +214,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> u32 flags,
> u64 page_offset, u64 page_count)
> {
> - struct page *page = region->pages[page_offset];
> + struct page *page = region->mreg_pages[page_offset];
>
> if (PageHuge(page) || PageTransCompound(page))
> flags |= HV_MAP_GPA_LARGE_PAGE;
> @@ -222,7 +222,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> return hv_call_map_gpa_pages(region->partition->pt_id,
> region->start_gfn + page_offset,
> page_count, flags,
> - region->pages + page_offset);
> + region->mreg_pages + page_offset);
> }
>
> static int mshv_region_remap_pages(struct mshv_mem_region *region,
> @@ -245,10 +245,10 @@ int mshv_region_map(struct mshv_mem_region *region)
> static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> u64 page_offset, u64 page_count)
> {
> - if (region->type == MSHV_REGION_TYPE_MEM_PINNED)
> - unpin_user_pages(region->pages + page_offset, page_count);
> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> + unpin_user_pages(region->mreg_pages + page_offset, page_count);
>
> - memset(region->pages + page_offset, 0,
> + memset(region->mreg_pages + page_offset, 0,
> page_count * sizeof(struct page *));
> }
>
> @@ -265,7 +265,7 @@ int mshv_region_pin(struct mshv_mem_region *region)
> int ret;
>
> for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
> - pages = region->pages + done_count;
> + pages = region->mreg_pages + done_count;
> userspace_addr = region->start_uaddr +
> done_count * HV_HYP_PAGE_SIZE;
> nr_pages = min(region->nr_pages - done_count,
> @@ -297,7 +297,7 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> u32 flags,
> u64 page_offset, u64 page_count)
> {
> - struct page *page = region->pages[page_offset];
> + struct page *page = region->mreg_pages[page_offset];
>
> if (PageHuge(page) || PageTransCompound(page))
> flags |= HV_UNMAP_GPA_LARGE_PAGE;
> @@ -321,7 +321,7 @@ static void mshv_region_destroy(struct kref *ref)
> struct mshv_partition *partition = region->partition;
> int ret;
>
> - if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
> mshv_region_movable_fini(region);
>
> if (mshv_partition_encrypted(partition)) {
> @@ -374,9 +374,9 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> int ret;
>
> range->notifier_seq = mmu_interval_read_begin(range->notifier);
> - mmap_read_lock(region->mni.mm);
> + mmap_read_lock(region->mreg_mni.mm);
> ret = hmm_range_fault(range);
> - mmap_read_unlock(region->mni.mm);
> + mmap_read_unlock(region->mreg_mni.mm);
> if (ret)
> return ret;
>
> @@ -407,7 +407,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> u64 page_offset, u64 page_count)
> {
> struct hmm_range range = {
> - .notifier = ®ion->mni,
> + .notifier = ®ion->mreg_mni,
> .default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> };
> unsigned long *pfns;
> @@ -430,7 +430,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> goto out;
>
> for (i = 0; i < page_count; i++)
> - region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> + region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
>
> ret = mshv_region_remap_pages(region, region->hv_map_flags,
> page_offset, page_count);
> @@ -489,7 +489,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> {
> struct mshv_mem_region *region = container_of(mni,
> struct mshv_mem_region,
> - mni);
> + mreg_mni);
> u64 page_offset, page_count;
> unsigned long mstart, mend;
> int ret = -EPERM;
> @@ -535,14 +535,14 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
>
> void mshv_region_movable_fini(struct mshv_mem_region *region)
> {
> - mmu_interval_notifier_remove(®ion->mni);
> + mmu_interval_notifier_remove(®ion->mreg_mni);
> }
>
> bool mshv_region_movable_init(struct mshv_mem_region *region)
> {
> int ret;
>
> - ret = mmu_interval_notifier_insert(®ion->mni, current->mm,
> + ret = mmu_interval_notifier_insert(®ion->mreg_mni, current->mm,
> region->start_uaddr,
> region->nr_pages << HV_HYP_PAGE_SHIFT,
> &mshv_region_mni_ops);
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 3c1d88b36741..f5b6d3979e5a 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -85,10 +85,10 @@ struct mshv_mem_region {
> u64 start_uaddr;
> u32 hv_map_flags;
> struct mshv_partition *partition;
> - enum mshv_region_type type;
> - struct mmu_interval_notifier mni;
> + enum mshv_region_type mreg_type;
> + struct mmu_interval_notifier mreg_mni;
> struct mutex mutex; /* protects region pages remapping */
> - struct page *pages[];
> + struct page *mreg_pages[];
> };
>
> struct mshv_irq_ack_notifier {
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 1134a82c7881..eff1b21461dc 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -657,7 +657,7 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> return false;
>
> /* Only movable memory ranges are supported for GPA intercepts */
> - if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
> ret = mshv_region_handle_gfn_fault(region, gfn);
> else
> ret = false;
> @@ -1175,12 +1175,12 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
> return PTR_ERR(rg);
>
> if (is_mmio)
> - rg->type = MSHV_REGION_TYPE_MMIO;
> + rg->mreg_type = MSHV_REGION_TYPE_MMIO;
> else if (mshv_partition_encrypted(partition) ||
> !mshv_region_movable_init(rg))
> - rg->type = MSHV_REGION_TYPE_MEM_PINNED;
> + rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
> else
> - rg->type = MSHV_REGION_TYPE_MEM_MOVABLE;
> + rg->mreg_type = MSHV_REGION_TYPE_MEM_MOVABLE;
>
> rg->partition = partition;
>
> @@ -1297,7 +1297,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
> if (ret)
> return ret;
>
> - switch (region->type) {
> + switch (region->mreg_type) {
> case MSHV_REGION_TYPE_MEM_PINNED:
> ret = mshv_prepare_pinned_region(region);
> break;
> --
> 2.51.2.vfs.0.1
>
TBH, all these new names look ugly to me. Moreover, they are redundant.
For example, region->type makes it clear that we're talking about the
type *of a region*. Calling it mreg_type adds no additional semantic
information; it's just visual noise.
Coming to the part about finding it via grep/cscope. You could have
easily found these reference by searching for "region->type",
"region->mni" etc. Perhaps we can change the variable naming convention
i.e. call a struct mshv_mem_region "mreg" everywhere and then one could
grep for "mreg->mni" and so on. Also, using more powerful tools such as
LSPs (clangd) can help find references more easily without tripping up
on common terms like "type", "pages" etc.
Anirudh.
^ permalink raw reply
* Re: [PATCH v1] mshv: make certain field names descriptive in a header struct
From: Mukesh R @ 2026-01-15 19:03 UTC (permalink / raw)
To: Anirudh Rayabharam; +Cc: linux-hyperv, wei.liu, nunodasneves
In-Reply-To: <d4iddlkzjapad2xck7oualffcncyyue2hcqa6u7cf7w62llejk@cgjt2fjvbaz6>
On 1/15/26 10:51, Anirudh Rayabharam wrote:
> On Mon, Jan 12, 2026 at 11:49:43AM -0800, Mukesh Rathor wrote:
>> When header struct fields use very common names like "pages" or "type",
>> it makes it difficult to find uses of these fields with tools like grep
>> and cscope. Add the prefix mreg_ to some fields in struct
>> mshv_mem_region to make it easier to find them.
>>
>> There is no functional change.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>> drivers/hv/mshv_regions.c | 44 ++++++++++++++++++-------------------
>> drivers/hv/mshv_root.h | 6 ++---
>> drivers/hv/mshv_root_main.c | 10 ++++-----
>> 3 files changed, 30 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
>> index 202b9d551e39..af81405f859b 100644
>> --- a/drivers/hv/mshv_regions.c
>> +++ b/drivers/hv/mshv_regions.c
>> @@ -52,7 +52,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>> struct page *page;
>> int ret;
>>
>> - page = region->pages[page_offset];
>> + page = region->mreg_pages[page_offset];
>> if (!page)
>> return -EINVAL;
>>
>> @@ -65,7 +65,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>>
>> /* Start at stride since the first page is validated */
>> for (count = stride; count < page_count; count += stride) {
>> - page = region->pages[page_offset + count];
>> + page = region->mreg_pages[page_offset + count];
>>
>> /* Break if current page is not present */
>> if (!page)
>> @@ -117,7 +117,7 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
>>
>> while (page_count) {
>> /* Skip non-present pages */
>> - if (!region->pages[page_offset]) {
>> + if (!region->mreg_pages[page_offset]) {
>> page_offset++;
>> page_count--;
>> continue;
>> @@ -164,13 +164,13 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
>> u32 flags,
>> u64 page_offset, u64 page_count)
>> {
>> - struct page *page = region->pages[page_offset];
>> + struct page *page = region->mreg_pages[page_offset];
>>
>> if (PageHuge(page) || PageTransCompound(page))
>> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>>
>> return hv_call_modify_spa_host_access(region->partition->pt_id,
>> - region->pages + page_offset,
>> + region->mreg_pages + page_offset,
>> page_count,
>> HV_MAP_GPA_READABLE |
>> HV_MAP_GPA_WRITABLE,
>> @@ -190,13 +190,13 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
>> u32 flags,
>> u64 page_offset, u64 page_count)
>> {
>> - struct page *page = region->pages[page_offset];
>> + struct page *page = region->mreg_pages[page_offset];
>>
>> if (PageHuge(page) || PageTransCompound(page))
>> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>>
>> return hv_call_modify_spa_host_access(region->partition->pt_id,
>> - region->pages + page_offset,
>> + region->mreg_pages + page_offset,
>> page_count, 0,
>> flags, false);
>> }
>> @@ -214,7 +214,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
>> u32 flags,
>> u64 page_offset, u64 page_count)
>> {
>> - struct page *page = region->pages[page_offset];
>> + struct page *page = region->mreg_pages[page_offset];
>>
>> if (PageHuge(page) || PageTransCompound(page))
>> flags |= HV_MAP_GPA_LARGE_PAGE;
>> @@ -222,7 +222,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
>> return hv_call_map_gpa_pages(region->partition->pt_id,
>> region->start_gfn + page_offset,
>> page_count, flags,
>> - region->pages + page_offset);
>> + region->mreg_pages + page_offset);
>> }
>>
>> static int mshv_region_remap_pages(struct mshv_mem_region *region,
>> @@ -245,10 +245,10 @@ int mshv_region_map(struct mshv_mem_region *region)
>> static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
>> u64 page_offset, u64 page_count)
>> {
>> - if (region->type == MSHV_REGION_TYPE_MEM_PINNED)
>> - unpin_user_pages(region->pages + page_offset, page_count);
>> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
>> + unpin_user_pages(region->mreg_pages + page_offset, page_count);
>>
>> - memset(region->pages + page_offset, 0,
>> + memset(region->mreg_pages + page_offset, 0,
>> page_count * sizeof(struct page *));
>> }
>>
>> @@ -265,7 +265,7 @@ int mshv_region_pin(struct mshv_mem_region *region)
>> int ret;
>>
>> for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
>> - pages = region->pages + done_count;
>> + pages = region->mreg_pages + done_count;
>> userspace_addr = region->start_uaddr +
>> done_count * HV_HYP_PAGE_SIZE;
>> nr_pages = min(region->nr_pages - done_count,
>> @@ -297,7 +297,7 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
>> u32 flags,
>> u64 page_offset, u64 page_count)
>> {
>> - struct page *page = region->pages[page_offset];
>> + struct page *page = region->mreg_pages[page_offset];
>>
>> if (PageHuge(page) || PageTransCompound(page))
>> flags |= HV_UNMAP_GPA_LARGE_PAGE;
>> @@ -321,7 +321,7 @@ static void mshv_region_destroy(struct kref *ref)
>> struct mshv_partition *partition = region->partition;
>> int ret;
>>
>> - if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
>> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
>> mshv_region_movable_fini(region);
>>
>> if (mshv_partition_encrypted(partition)) {
>> @@ -374,9 +374,9 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
>> int ret;
>>
>> range->notifier_seq = mmu_interval_read_begin(range->notifier);
>> - mmap_read_lock(region->mni.mm);
>> + mmap_read_lock(region->mreg_mni.mm);
>> ret = hmm_range_fault(range);
>> - mmap_read_unlock(region->mni.mm);
>> + mmap_read_unlock(region->mreg_mni.mm);
>> if (ret)
>> return ret;
>>
>> @@ -407,7 +407,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
>> u64 page_offset, u64 page_count)
>> {
>> struct hmm_range range = {
>> - .notifier = ®ion->mni,
>> + .notifier = ®ion->mreg_mni,
>> .default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
>> };
>> unsigned long *pfns;
>> @@ -430,7 +430,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
>> goto out;
>>
>> for (i = 0; i < page_count; i++)
>> - region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
>> + region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
>>
>> ret = mshv_region_remap_pages(region, region->hv_map_flags,
>> page_offset, page_count);
>> @@ -489,7 +489,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
>> {
>> struct mshv_mem_region *region = container_of(mni,
>> struct mshv_mem_region,
>> - mni);
>> + mreg_mni);
>> u64 page_offset, page_count;
>> unsigned long mstart, mend;
>> int ret = -EPERM;
>> @@ -535,14 +535,14 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
>>
>> void mshv_region_movable_fini(struct mshv_mem_region *region)
>> {
>> - mmu_interval_notifier_remove(®ion->mni);
>> + mmu_interval_notifier_remove(®ion->mreg_mni);
>> }
>>
>> bool mshv_region_movable_init(struct mshv_mem_region *region)
>> {
>> int ret;
>>
>> - ret = mmu_interval_notifier_insert(®ion->mni, current->mm,
>> + ret = mmu_interval_notifier_insert(®ion->mreg_mni, current->mm,
>> region->start_uaddr,
>> region->nr_pages << HV_HYP_PAGE_SHIFT,
>> &mshv_region_mni_ops);
>> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
>> index 3c1d88b36741..f5b6d3979e5a 100644
>> --- a/drivers/hv/mshv_root.h
>> +++ b/drivers/hv/mshv_root.h
>> @@ -85,10 +85,10 @@ struct mshv_mem_region {
>> u64 start_uaddr;
>> u32 hv_map_flags;
>> struct mshv_partition *partition;
>> - enum mshv_region_type type;
>> - struct mmu_interval_notifier mni;
>> + enum mshv_region_type mreg_type;
>> + struct mmu_interval_notifier mreg_mni;
>> struct mutex mutex; /* protects region pages remapping */
>> - struct page *pages[];
>> + struct page *mreg_pages[];
>> };
>>
>> struct mshv_irq_ack_notifier {
>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>> index 1134a82c7881..eff1b21461dc 100644
>> --- a/drivers/hv/mshv_root_main.c
>> +++ b/drivers/hv/mshv_root_main.c
>> @@ -657,7 +657,7 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
>> return false;
>>
>> /* Only movable memory ranges are supported for GPA intercepts */
>> - if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
>> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
>> ret = mshv_region_handle_gfn_fault(region, gfn);
>> else
>> ret = false;
>> @@ -1175,12 +1175,12 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
>> return PTR_ERR(rg);
>>
>> if (is_mmio)
>> - rg->type = MSHV_REGION_TYPE_MMIO;
>> + rg->mreg_type = MSHV_REGION_TYPE_MMIO;
>> else if (mshv_partition_encrypted(partition) ||
>> !mshv_region_movable_init(rg))
>> - rg->type = MSHV_REGION_TYPE_MEM_PINNED;
>> + rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
>> else
>> - rg->type = MSHV_REGION_TYPE_MEM_MOVABLE;
>> + rg->mreg_type = MSHV_REGION_TYPE_MEM_MOVABLE;
>>
>> rg->partition = partition;
>>
>> @@ -1297,7 +1297,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
>> if (ret)
>> return ret;
>>
>> - switch (region->type) {
>> + switch (region->mreg_type) {
>> case MSHV_REGION_TYPE_MEM_PINNED:
>> ret = mshv_prepare_pinned_region(region);
>> break;
>> --
>> 2.51.2.vfs.0.1
>>
>
> TBH, all these new names look ugly to me. Moreover, they are redundant.
> For example, region->type makes it clear that we're talking about the
> type *of a region*. Calling it mreg_type adds no additional semantic
> information; it's just visual noise.
>
> Coming to the part about finding it via grep/cscope. You could have
> easily found these reference by searching for "region->type",
> "region->mni" etc. Perhaps we can change the variable naming convention
> i.e. call a struct mshv_mem_region "mreg" everywhere and then one could
> grep for "mreg->mni" and so on. Also, using more powerful tools such as
> LSPs (clangd) can help find references more easily without tripping up
> on common terms like "type", "pages" etc.
Huh! There is no way to enforce that one use ptrs with only certain names,
and that is unreasonable requirement. What if the field is accessed by
struct.field reference? Are you suggesting that struct naming be enforced?
Ability to read code is far far more important to make sure bug free code
is written, it is a very small price for a large benefit. One gets used to
it so easily. Why do we prefix function names with mshv_ or hv_, should
we get rid of that also? And it's not just cscope or grep, sometimes
you're dealing with corrupt binary or coredump and you use "strings"
to get some meaning out of it. So I totally disagree with you. If you
don't like mreg_, please suggest alternates that are easy to find.
^ permalink raw reply
* Re: [PATCH v3 5/6] mshv: Add definitions for stats pages
From: Nuno Das Neves @ 2026-01-15 19:34 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: linux-hyperv, linux-kernel, mhklinux, kys, haiyangz, wei.liu,
decui, longli, prapal, mrathor, paekkaladevi
In-Reply-To: <aWkTd2zkbVQqePVa@skinsburskii.localdomain>
On 1/15/2026 8:19 AM, Stanislav Kinsburskii wrote:
> On Wed, Jan 14, 2026 at 01:38:02PM -0800, Nuno Das Neves wrote:
>> Add the definitions for hypervisor, logical processor, and partition
>> stats pages.
>>
>
> The definitions in for partition and virtual processor are outdated.
> Now is the good time to sync the new values in.
>
> Thanks,
> Stanislav
>
Good point, thanks, I will update it for v4.
I'm finally noticing that these counters are not really from hvhdk.h, in
the windows code, but their own file. Since I'm still iterating on this,
what do you think about creating a file just for the counters?
e.g. drivers/hv/hvcounters.h, which combines hvcountersarm64 and amd64.
That would have a couple of advantages:
1. Not putting things in hvhdk.h which aren't actually there in the
Windows source
2. Less visibility of CamelCase naming outside our driver
3. I could define the enums using "X macro"s to generate the show() code
more cleanly in mshv_debugfs.c, which is something Michael suggested
here:
https://lore.kernel.org/linux-hyperv/SN6PR02MB4157938404BC0D12978ACD9BD4A2A@SN6PR02MB4157.namprd02.prod.outlook.com/
It would look something like this:
In hvcounters.h:
#if is_enabled(CONFIG_X86_64)
#define HV_COUNTER_VP_LIST(X) \
X(VpTotalRunTime, 1), \
X(VpHypervisorRunTime, 2), \
X(VpRemoteNodeRunTime, 3), \
/* <snip> */
#elif is_enabled(CONFIG_ARM64)
/* <snip> */
#endif
Just like now, it's a copy/paste from Windows + simple pattern
replacement. Note with this approach we need separate lists for arm64
and x86, but that matches how the enums are defined in Windows.
Then, in mshv_debugfs.c:
/*
* We need the strings paired with their enum values.
* This structure can be used for all the different stat types.
*/
struct hv_counter_entry {
char *name;
int idx;
};
/* Define an array entry (again, reusable) */
#define HV_COUNTER_LIST(name, idx) \
{ __stringify(name), idx },
/* Create our static array */
static struct hv_counter_entry hv_counter_vp_array[] = {
HV_ST_COUNTER_VP(HV_COUNTER_VP)
};
static int vp_stats_show(struct seq_file *m, void *v)
{
const struct hv_stats_page **pstats = m->private;
int i;
for (i = 0; i < ARRAY_SIZE(hv_counter_vp_array); ++i) {
struct hv_counter_entry entry = hv_counter_vp_array[i];
u64 parent_val = pstats[HV_STATS_AREA_PARENT]->vp_cntrs[entry.idx];
u64 self_val = pstats[HV_STATS_AREA_SELF]->vp_cntrs[entry.idx];
/* Prioritize the PARENT area value */
seq_printf(m, "%-30s: %llu\n", entry.name,
parent_val ? parent_val : self_val);
}
}
Any thoughts? I was originally going to just go with the pattern we had,
but since these definitions aren't from the hv*dk.h files, we can maybe
get more creative and make the resulting code look a bit better.
Thanks
Nuno
>> Move the definition for the VP stats page to its rightful place in
>> hvhdk.h, and add the missing members.
>>
>> While at it, correct the ARM64 value of VpRootDispatchThreadBlocked,
>> (which is not yet used, so there is no impact).
>>
>> These enum members retain their CamelCase style, since they are imported
>> directly from the hypervisor code. They will be stringified when
>> printing the stats out, and retain more readability in this form.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>> drivers/hv/mshv_root_main.c | 17 --
>> include/hyperv/hvhdk.h | 437 ++++++++++++++++++++++++++++++++++++
>> 2 files changed, 437 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>> index fbfc9e7d9fa4..724bbaa0b08c 100644
>> --- a/drivers/hv/mshv_root_main.c
>> +++ b/drivers/hv/mshv_root_main.c
>> @@ -39,23 +39,6 @@ MODULE_AUTHOR("Microsoft");
>> MODULE_LICENSE("GPL");
>> MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
>>
>> -/* TODO move this to another file when debugfs code is added */
>> -enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
>> -#if defined(CONFIG_X86)
>> - VpRootDispatchThreadBlocked = 202,
>> -#elif defined(CONFIG_ARM64)
>> - VpRootDispatchThreadBlocked = 94,
>> -#endif
>> - VpStatsMaxCounter
>> -};
>> -
>> -struct hv_stats_page {
>> - union {
>> - u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
>> - u8 data[HV_HYP_PAGE_SIZE];
>> - };
>> -} __packed;
>> -
>> struct mshv_root mshv_root;
>>
>> enum hv_scheduler_type hv_scheduler_type;
>> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
>> index 469186df7826..8bddd11feeba 100644
>> --- a/include/hyperv/hvhdk.h
>> +++ b/include/hyperv/hvhdk.h
>> @@ -10,6 +10,443 @@
>> #include "hvhdk_mini.h"
>> #include "hvgdk.h"
>>
>> +enum hv_stats_hypervisor_counters { /* HV_HYPERVISOR_COUNTER */
>> + HvLogicalProcessors = 1,
>> + HvPartitions = 2,
>> + HvTotalPages = 3,
>> + HvVirtualProcessors = 4,
>> + HvMonitoredNotifications = 5,
>> + HvModernStandbyEntries = 6,
>> + HvPlatformIdleTransitions = 7,
>> + HvHypervisorStartupCost = 8,
>> + HvIOSpacePages = 10,
>> + HvNonEssentialPagesForDump = 11,
>> + HvSubsumedPages = 12,
>> + HvStatsMaxCounter
>> +};
>> +
>> +enum hv_stats_partition_counters { /* HV_PROCESS_COUNTER */
>> + PartitionVirtualProcessors = 1,
>> + PartitionTlbSize = 3,
>> + PartitionAddressSpaces = 4,
>> + PartitionDepositedPages = 5,
>> + PartitionGpaPages = 6,
>> + PartitionGpaSpaceModifications = 7,
>> + PartitionVirtualTlbFlushEntires = 8,
>> + PartitionRecommendedTlbSize = 9,
>> + PartitionGpaPages4K = 10,
>> + PartitionGpaPages2M = 11,
>> + PartitionGpaPages1G = 12,
>> + PartitionGpaPages512G = 13,
>> + PartitionDevicePages4K = 14,
>> + PartitionDevicePages2M = 15,
>> + PartitionDevicePages1G = 16,
>> + PartitionDevicePages512G = 17,
>> + PartitionAttachedDevices = 18,
>> + PartitionDeviceInterruptMappings = 19,
>> + PartitionIoTlbFlushes = 20,
>> + PartitionIoTlbFlushCost = 21,
>> + PartitionDeviceInterruptErrors = 22,
>> + PartitionDeviceDmaErrors = 23,
>> + PartitionDeviceInterruptThrottleEvents = 24,
>> + PartitionSkippedTimerTicks = 25,
>> + PartitionPartitionId = 26,
>> +#if IS_ENABLED(CONFIG_X86_64)
>> + PartitionNestedTlbSize = 27,
>> + PartitionRecommendedNestedTlbSize = 28,
>> + PartitionNestedTlbFreeListSize = 29,
>> + PartitionNestedTlbTrimmedPages = 30,
>> + PartitionPagesShattered = 31,
>> + PartitionPagesRecombined = 32,
>> + PartitionHwpRequestValue = 33,
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> + PartitionHwpRequestValue = 27,
>> +#endif
>> + PartitionStatsMaxCounter
>> +};
>> +
>> +enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
>> + VpTotalRunTime = 1,
>> + VpHypervisorRunTime = 2,
>> + VpRemoteNodeRunTime = 3,
>> + VpNormalizedRunTime = 4,
>> + VpIdealCpu = 5,
>> + VpHypercallsCount = 7,
>> + VpHypercallsTime = 8,
>> +#if IS_ENABLED(CONFIG_X86_64)
>> + VpPageInvalidationsCount = 9,
>> + VpPageInvalidationsTime = 10,
>> + VpControlRegisterAccessesCount = 11,
>> + VpControlRegisterAccessesTime = 12,
>> + VpIoInstructionsCount = 13,
>> + VpIoInstructionsTime = 14,
>> + VpHltInstructionsCount = 15,
>> + VpHltInstructionsTime = 16,
>> + VpMwaitInstructionsCount = 17,
>> + VpMwaitInstructionsTime = 18,
>> + VpCpuidInstructionsCount = 19,
>> + VpCpuidInstructionsTime = 20,
>> + VpMsrAccessesCount = 21,
>> + VpMsrAccessesTime = 22,
>> + VpOtherInterceptsCount = 23,
>> + VpOtherInterceptsTime = 24,
>> + VpExternalInterruptsCount = 25,
>> + VpExternalInterruptsTime = 26,
>> + VpPendingInterruptsCount = 27,
>> + VpPendingInterruptsTime = 28,
>> + VpEmulatedInstructionsCount = 29,
>> + VpEmulatedInstructionsTime = 30,
>> + VpDebugRegisterAccessesCount = 31,
>> + VpDebugRegisterAccessesTime = 32,
>> + VpPageFaultInterceptsCount = 33,
>> + VpPageFaultInterceptsTime = 34,
>> + VpGuestPageTableMaps = 35,
>> + VpLargePageTlbFills = 36,
>> + VpSmallPageTlbFills = 37,
>> + VpReflectedGuestPageFaults = 38,
>> + VpApicMmioAccesses = 39,
>> + VpIoInterceptMessages = 40,
>> + VpMemoryInterceptMessages = 41,
>> + VpApicEoiAccesses = 42,
>> + VpOtherMessages = 43,
>> + VpPageTableAllocations = 44,
>> + VpLogicalProcessorMigrations = 45,
>> + VpAddressSpaceEvictions = 46,
>> + VpAddressSpaceSwitches = 47,
>> + VpAddressDomainFlushes = 48,
>> + VpAddressSpaceFlushes = 49,
>> + VpGlobalGvaRangeFlushes = 50,
>> + VpLocalGvaRangeFlushes = 51,
>> + VpPageTableEvictions = 52,
>> + VpPageTableReclamations = 53,
>> + VpPageTableResets = 54,
>> + VpPageTableValidations = 55,
>> + VpApicTprAccesses = 56,
>> + VpPageTableWriteIntercepts = 57,
>> + VpSyntheticInterrupts = 58,
>> + VpVirtualInterrupts = 59,
>> + VpApicIpisSent = 60,
>> + VpApicSelfIpisSent = 61,
>> + VpGpaSpaceHypercalls = 62,
>> + VpLogicalProcessorHypercalls = 63,
>> + VpLongSpinWaitHypercalls = 64,
>> + VpOtherHypercalls = 65,
>> + VpSyntheticInterruptHypercalls = 66,
>> + VpVirtualInterruptHypercalls = 67,
>> + VpVirtualMmuHypercalls = 68,
>> + VpVirtualProcessorHypercalls = 69,
>> + VpHardwareInterrupts = 70,
>> + VpNestedPageFaultInterceptsCount = 71,
>> + VpNestedPageFaultInterceptsTime = 72,
>> + VpPageScans = 73,
>> + VpLogicalProcessorDispatches = 74,
>> + VpWaitingForCpuTime = 75,
>> + VpExtendedHypercalls = 76,
>> + VpExtendedHypercallInterceptMessages = 77,
>> + VpMbecNestedPageTableSwitches = 78,
>> + VpOtherReflectedGuestExceptions = 79,
>> + VpGlobalIoTlbFlushes = 80,
>> + VpGlobalIoTlbFlushCost = 81,
>> + VpLocalIoTlbFlushes = 82,
>> + VpLocalIoTlbFlushCost = 83,
>> + VpHypercallsForwardedCount = 84,
>> + VpHypercallsForwardingTime = 85,
>> + VpPageInvalidationsForwardedCount = 86,
>> + VpPageInvalidationsForwardingTime = 87,
>> + VpControlRegisterAccessesForwardedCount = 88,
>> + VpControlRegisterAccessesForwardingTime = 89,
>> + VpIoInstructionsForwardedCount = 90,
>> + VpIoInstructionsForwardingTime = 91,
>> + VpHltInstructionsForwardedCount = 92,
>> + VpHltInstructionsForwardingTime = 93,
>> + VpMwaitInstructionsForwardedCount = 94,
>> + VpMwaitInstructionsForwardingTime = 95,
>> + VpCpuidInstructionsForwardedCount = 96,
>> + VpCpuidInstructionsForwardingTime = 97,
>> + VpMsrAccessesForwardedCount = 98,
>> + VpMsrAccessesForwardingTime = 99,
>> + VpOtherInterceptsForwardedCount = 100,
>> + VpOtherInterceptsForwardingTime = 101,
>> + VpExternalInterruptsForwardedCount = 102,
>> + VpExternalInterruptsForwardingTime = 103,
>> + VpPendingInterruptsForwardedCount = 104,
>> + VpPendingInterruptsForwardingTime = 105,
>> + VpEmulatedInstructionsForwardedCount = 106,
>> + VpEmulatedInstructionsForwardingTime = 107,
>> + VpDebugRegisterAccessesForwardedCount = 108,
>> + VpDebugRegisterAccessesForwardingTime = 109,
>> + VpPageFaultInterceptsForwardedCount = 110,
>> + VpPageFaultInterceptsForwardingTime = 111,
>> + VpVmclearEmulationCount = 112,
>> + VpVmclearEmulationTime = 113,
>> + VpVmptrldEmulationCount = 114,
>> + VpVmptrldEmulationTime = 115,
>> + VpVmptrstEmulationCount = 116,
>> + VpVmptrstEmulationTime = 117,
>> + VpVmreadEmulationCount = 118,
>> + VpVmreadEmulationTime = 119,
>> + VpVmwriteEmulationCount = 120,
>> + VpVmwriteEmulationTime = 121,
>> + VpVmxoffEmulationCount = 122,
>> + VpVmxoffEmulationTime = 123,
>> + VpVmxonEmulationCount = 124,
>> + VpVmxonEmulationTime = 125,
>> + VpNestedVMEntriesCount = 126,
>> + VpNestedVMEntriesTime = 127,
>> + VpNestedSLATSoftPageFaultsCount = 128,
>> + VpNestedSLATSoftPageFaultsTime = 129,
>> + VpNestedSLATHardPageFaultsCount = 130,
>> + VpNestedSLATHardPageFaultsTime = 131,
>> + VpInvEptAllContextEmulationCount = 132,
>> + VpInvEptAllContextEmulationTime = 133,
>> + VpInvEptSingleContextEmulationCount = 134,
>> + VpInvEptSingleContextEmulationTime = 135,
>> + VpInvVpidAllContextEmulationCount = 136,
>> + VpInvVpidAllContextEmulationTime = 137,
>> + VpInvVpidSingleContextEmulationCount = 138,
>> + VpInvVpidSingleContextEmulationTime = 139,
>> + VpInvVpidSingleAddressEmulationCount = 140,
>> + VpInvVpidSingleAddressEmulationTime = 141,
>> + VpNestedTlbPageTableReclamations = 142,
>> + VpNestedTlbPageTableEvictions = 143,
>> + VpFlushGuestPhysicalAddressSpaceHypercalls = 144,
>> + VpFlushGuestPhysicalAddressListHypercalls = 145,
>> + VpPostedInterruptNotifications = 146,
>> + VpPostedInterruptScans = 147,
>> + VpTotalCoreRunTime = 148,
>> + VpMaximumRunTime = 149,
>> + VpHwpRequestContextSwitches = 150,
>> + VpWaitingForCpuTimeBucket0 = 151,
>> + VpWaitingForCpuTimeBucket1 = 152,
>> + VpWaitingForCpuTimeBucket2 = 153,
>> + VpWaitingForCpuTimeBucket3 = 154,
>> + VpWaitingForCpuTimeBucket4 = 155,
>> + VpWaitingForCpuTimeBucket5 = 156,
>> + VpWaitingForCpuTimeBucket6 = 157,
>> + VpVmloadEmulationCount = 158,
>> + VpVmloadEmulationTime = 159,
>> + VpVmsaveEmulationCount = 160,
>> + VpVmsaveEmulationTime = 161,
>> + VpGifInstructionEmulationCount = 162,
>> + VpGifInstructionEmulationTime = 163,
>> + VpEmulatedErrataSvmInstructions = 164,
>> + VpPlaceholder1 = 165,
>> + VpPlaceholder2 = 166,
>> + VpPlaceholder3 = 167,
>> + VpPlaceholder4 = 168,
>> + VpPlaceholder5 = 169,
>> + VpPlaceholder6 = 170,
>> + VpPlaceholder7 = 171,
>> + VpPlaceholder8 = 172,
>> + VpPlaceholder9 = 173,
>> + VpPlaceholder10 = 174,
>> + VpSchedulingPriority = 175,
>> + VpRdpmcInstructionsCount = 176,
>> + VpRdpmcInstructionsTime = 177,
>> + VpPerfmonPmuMsrAccessesCount = 178,
>> + VpPerfmonLbrMsrAccessesCount = 179,
>> + VpPerfmonIptMsrAccessesCount = 180,
>> + VpPerfmonInterruptCount = 181,
>> + VpVtl1DispatchCount = 182,
>> + VpVtl2DispatchCount = 183,
>> + VpVtl2DispatchBucket0 = 184,
>> + VpVtl2DispatchBucket1 = 185,
>> + VpVtl2DispatchBucket2 = 186,
>> + VpVtl2DispatchBucket3 = 187,
>> + VpVtl2DispatchBucket4 = 188,
>> + VpVtl2DispatchBucket5 = 189,
>> + VpVtl2DispatchBucket6 = 190,
>> + VpVtl1RunTime = 191,
>> + VpVtl2RunTime = 192,
>> + VpIommuHypercalls = 193,
>> + VpCpuGroupHypercalls = 194,
>> + VpVsmHypercalls = 195,
>> + VpEventLogHypercalls = 196,
>> + VpDeviceDomainHypercalls = 197,
>> + VpDepositHypercalls = 198,
>> + VpSvmHypercalls = 199,
>> + VpBusLockAcquisitionCount = 200,
>> + VpLoadAvg = 201,
>> + VpRootDispatchThreadBlocked = 202,
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> + VpSysRegAccessesCount = 9,
>> + VpSysRegAccessesTime = 10,
>> + VpSmcInstructionsCount = 11,
>> + VpSmcInstructionsTime = 12,
>> + VpOtherInterceptsCount = 13,
>> + VpOtherInterceptsTime = 14,
>> + VpExternalInterruptsCount = 15,
>> + VpExternalInterruptsTime = 16,
>> + VpPendingInterruptsCount = 17,
>> + VpPendingInterruptsTime = 18,
>> + VpGuestPageTableMaps = 19,
>> + VpLargePageTlbFills = 20,
>> + VpSmallPageTlbFills = 21,
>> + VpReflectedGuestPageFaults = 22,
>> + VpMemoryInterceptMessages = 23,
>> + VpOtherMessages = 24,
>> + VpLogicalProcessorMigrations = 25,
>> + VpAddressDomainFlushes = 26,
>> + VpAddressSpaceFlushes = 27,
>> + VpSyntheticInterrupts = 28,
>> + VpVirtualInterrupts = 29,
>> + VpApicSelfIpisSent = 30,
>> + VpGpaSpaceHypercalls = 31,
>> + VpLogicalProcessorHypercalls = 32,
>> + VpLongSpinWaitHypercalls = 33,
>> + VpOtherHypercalls = 34,
>> + VpSyntheticInterruptHypercalls = 35,
>> + VpVirtualInterruptHypercalls = 36,
>> + VpVirtualMmuHypercalls = 37,
>> + VpVirtualProcessorHypercalls = 38,
>> + VpHardwareInterrupts = 39,
>> + VpNestedPageFaultInterceptsCount = 40,
>> + VpNestedPageFaultInterceptsTime = 41,
>> + VpLogicalProcessorDispatches = 42,
>> + VpWaitingForCpuTime = 43,
>> + VpExtendedHypercalls = 44,
>> + VpExtendedHypercallInterceptMessages = 45,
>> + VpMbecNestedPageTableSwitches = 46,
>> + VpOtherReflectedGuestExceptions = 47,
>> + VpGlobalIoTlbFlushes = 48,
>> + VpGlobalIoTlbFlushCost = 49,
>> + VpLocalIoTlbFlushes = 50,
>> + VpLocalIoTlbFlushCost = 51,
>> + VpFlushGuestPhysicalAddressSpaceHypercalls = 52,
>> + VpFlushGuestPhysicalAddressListHypercalls = 53,
>> + VpPostedInterruptNotifications = 54,
>> + VpPostedInterruptScans = 55,
>> + VpTotalCoreRunTime = 56,
>> + VpMaximumRunTime = 57,
>> + VpWaitingForCpuTimeBucket0 = 58,
>> + VpWaitingForCpuTimeBucket1 = 59,
>> + VpWaitingForCpuTimeBucket2 = 60,
>> + VpWaitingForCpuTimeBucket3 = 61,
>> + VpWaitingForCpuTimeBucket4 = 62,
>> + VpWaitingForCpuTimeBucket5 = 63,
>> + VpWaitingForCpuTimeBucket6 = 64,
>> + VpHwpRequestContextSwitches = 65,
>> + VpPlaceholder2 = 66,
>> + VpPlaceholder3 = 67,
>> + VpPlaceholder4 = 68,
>> + VpPlaceholder5 = 69,
>> + VpPlaceholder6 = 70,
>> + VpPlaceholder7 = 71,
>> + VpPlaceholder8 = 72,
>> + VpContentionTime = 73,
>> + VpWakeUpTime = 74,
>> + VpSchedulingPriority = 75,
>> + VpVtl1DispatchCount = 76,
>> + VpVtl2DispatchCount = 77,
>> + VpVtl2DispatchBucket0 = 78,
>> + VpVtl2DispatchBucket1 = 79,
>> + VpVtl2DispatchBucket2 = 80,
>> + VpVtl2DispatchBucket3 = 81,
>> + VpVtl2DispatchBucket4 = 82,
>> + VpVtl2DispatchBucket5 = 83,
>> + VpVtl2DispatchBucket6 = 84,
>> + VpVtl1RunTime = 85,
>> + VpVtl2RunTime = 86,
>> + VpIommuHypercalls = 87,
>> + VpCpuGroupHypercalls = 88,
>> + VpVsmHypercalls = 89,
>> + VpEventLogHypercalls = 90,
>> + VpDeviceDomainHypercalls = 91,
>> + VpDepositHypercalls = 92,
>> + VpSvmHypercalls = 93,
>> + VpLoadAvg = 94,
>> + VpRootDispatchThreadBlocked = 95,
>> +#endif
>> + VpStatsMaxCounter
>> +};
>> +
>> +enum hv_stats_lp_counters { /* HV_CPU_COUNTER */
>> + LpGlobalTime = 1,
>> + LpTotalRunTime = 2,
>> + LpHypervisorRunTime = 3,
>> + LpHardwareInterrupts = 4,
>> + LpContextSwitches = 5,
>> + LpInterProcessorInterrupts = 6,
>> + LpSchedulerInterrupts = 7,
>> + LpTimerInterrupts = 8,
>> + LpInterProcessorInterruptsSent = 9,
>> + LpProcessorHalts = 10,
>> + LpMonitorTransitionCost = 11,
>> + LpContextSwitchTime = 12,
>> + LpC1TransitionsCount = 13,
>> + LpC1RunTime = 14,
>> + LpC2TransitionsCount = 15,
>> + LpC2RunTime = 16,
>> + LpC3TransitionsCount = 17,
>> + LpC3RunTime = 18,
>> + LpRootVpIndex = 19,
>> + LpIdleSequenceNumber = 20,
>> + LpGlobalTscCount = 21,
>> + LpActiveTscCount = 22,
>> + LpIdleAccumulation = 23,
>> + LpReferenceCycleCount0 = 24,
>> + LpActualCycleCount0 = 25,
>> + LpReferenceCycleCount1 = 26,
>> + LpActualCycleCount1 = 27,
>> + LpProximityDomainId = 28,
>> + LpPostedInterruptNotifications = 29,
>> + LpBranchPredictorFlushes = 30,
>> +#if IS_ENABLED(CONFIG_X86_64)
>> + LpL1DataCacheFlushes = 31,
>> + LpImmediateL1DataCacheFlushes = 32,
>> + LpMbFlushes = 33,
>> + LpCounterRefreshSequenceNumber = 34,
>> + LpCounterRefreshReferenceTime = 35,
>> + LpIdleAccumulationSnapshot = 36,
>> + LpActiveTscCountSnapshot = 37,
>> + LpHwpRequestContextSwitches = 38,
>> + LpPlaceholder1 = 39,
>> + LpPlaceholder2 = 40,
>> + LpPlaceholder3 = 41,
>> + LpPlaceholder4 = 42,
>> + LpPlaceholder5 = 43,
>> + LpPlaceholder6 = 44,
>> + LpPlaceholder7 = 45,
>> + LpPlaceholder8 = 46,
>> + LpPlaceholder9 = 47,
>> + LpPlaceholder10 = 48,
>> + LpReserveGroupId = 49,
>> + LpRunningPriority = 50,
>> + LpPerfmonInterruptCount = 51,
>> +#elif IS_ENABLED(CONFIG_ARM64)
>> + LpCounterRefreshSequenceNumber = 31,
>> + LpCounterRefreshReferenceTime = 32,
>> + LpIdleAccumulationSnapshot = 33,
>> + LpActiveTscCountSnapshot = 34,
>> + LpHwpRequestContextSwitches = 35,
>> + LpPlaceholder2 = 36,
>> + LpPlaceholder3 = 37,
>> + LpPlaceholder4 = 38,
>> + LpPlaceholder5 = 39,
>> + LpPlaceholder6 = 40,
>> + LpPlaceholder7 = 41,
>> + LpPlaceholder8 = 42,
>> + LpPlaceholder9 = 43,
>> + LpSchLocalRunListSize = 44,
>> + LpReserveGroupId = 45,
>> + LpRunningPriority = 46,
>> +#endif
>> + LpStatsMaxCounter
>> +};
>> +
>> +/*
>> + * Hypervisor statistics page format
>> + */
>> +struct hv_stats_page {
>> + union {
>> + u64 hv_cntrs[HvStatsMaxCounter]; /* Hypervisor counters */
>> + u64 pt_cntrs[PartitionStatsMaxCounter]; /* Partition counters */
>> + u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
>> + u64 lp_cntrs[LpStatsMaxCounter]; /* LP counters */
>> + u8 data[HV_HYP_PAGE_SIZE];
>> + };
>> +} __packed;
>> +
>> /* Bits for dirty mask of hv_vp_register_page */
>> #define HV_X64_REGISTER_CLASS_GENERAL 0
>> #define HV_X64_REGISTER_CLASS_IP 1
>> --
>> 2.34.1
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Haiyang Zhang @ 2026-01-15 19:57 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Haiyang Zhang, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
Long Li, Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Shradha Gupta, Saurabh Sengar, Aditya Garg, Dipayaan Roy,
Shiraz Saleem, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, Paul Rosswurm
In-Reply-To: <20260114185450.58db5a6d@kernel.org>
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, January 14, 2026 9:55 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@linux.microsoft.com>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>; Andrew Lunn
> <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric
> Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Konstantin
> Taranov <kotaranov@microsoft.com>; Simon Horman <horms@kernel.org>; Erni
> Sri Satya Vennela <ernis@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Shiraz Saleem
> <shirazsaleem@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> Subject: Re: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add
> support for coalesced RX packets on CQE
>
> On Wed, 14 Jan 2026 18:27:50 +0000 Haiyang Zhang wrote:
> > > > And, the coalescing can add up to 2 microseconds into one-way
> latency.
> > >
> > > I am asking you how the _device_ (hypervisor?) decides when to
> coalesce
> > > and when to send a partial CQE (<4 packets in 4 pkt CQE). You are
> using
> > > the coalescing uAPI, so I'm trying to make sure this is the correct
> API.
> > > CQE configuration can also be done via ringparam.
> >
> > When coalescing is enabled, the device waits for packets which can
> > have the CQE coalesced with previous packet(s). That coalescing process
> > is finished (and a CQE written to the appropriate CQ) when the CQE is
> > filled with 4 pkts, or time expired, or other device specific logic is
> > satisfied.
>
> See, what I'm afraid is happening here is that you are enabling
> completion coalescing (how long the device keeps the CQE pending).
> Which is _not_ what rx_max_coalesced_frames controls for most NICs.
> For most NICs rx_max_coalesced_frames controls IRQ generation logic.
>
> The NIC first buffers up CQEs for typically single digit usecs, and
> then once CQE timer exipred and writeback happened it starts an IRQ
> coalescing timer. Once the IRQ coalescing timer expires IRQ is
> triggered, which schedules NAPI. (broad strokes, obviously many
> differences and optimizations exist)
>
> Is my guess correct? Are you controlling CQE coalescing>
>
> Can you control the timeout instead of the frame count?
Our NIC's timeout value cannot be controlled by driver. Also, the
timeout may be changed in future NIC HW.
So, I use the ethtool/rx-frames, which is either 1 or 4 on our
NIC, to switch the CQE coalescing feature on/off.
Thanks,
- Haiyang
^ permalink raw reply
* Re: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Jakub Kicinski @ 2026-01-16 2:14 UTC (permalink / raw)
To: Haiyang Zhang
Cc: Haiyang Zhang, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
Long Li, Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Shradha Gupta, Saurabh Sengar, Aditya Garg, Dipayaan Roy,
Shiraz Saleem, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, Paul Rosswurm
In-Reply-To: <SA3PR21MB38673CA4DDE618A5D9C4FA99CA8CA@SA3PR21MB3867.namprd21.prod.outlook.com>
On Thu, 15 Jan 2026 19:57:44 +0000 Haiyang Zhang wrote:
> > > When coalescing is enabled, the device waits for packets which can
> > > have the CQE coalesced with previous packet(s). That coalescing process
> > > is finished (and a CQE written to the appropriate CQ) when the CQE is
> > > filled with 4 pkts, or time expired, or other device specific logic is
> > > satisfied.
> >
> > See, what I'm afraid is happening here is that you are enabling
> > completion coalescing (how long the device keeps the CQE pending).
> > Which is _not_ what rx_max_coalesced_frames controls for most NICs.
> > For most NICs rx_max_coalesced_frames controls IRQ generation logic.
> >
> > The NIC first buffers up CQEs for typically single digit usecs, and
> > then once CQE timer exipred and writeback happened it starts an IRQ
> > coalescing timer. Once the IRQ coalescing timer expires IRQ is
> > triggered, which schedules NAPI. (broad strokes, obviously many
> > differences and optimizations exist)
> >
> > Is my guess correct? Are you controlling CQE coalescing>
> >
> > Can you control the timeout instead of the frame count?
>
> Our NIC's timeout value cannot be controlled by driver. Also, the
> timeout may be changed in future NIC HW.
>
> So, I use the ethtool/rx-frames, which is either 1 or 4 on our
> NIC, to switch the CQE coalescing feature on/off.
I feel like this is not the first time I'm having a conversation with
you where you are not answering my direct questions, not just one
sliver. IDK why you're doing this, but being able to participate
in an email exchange is a bare minimum for participating upstream.
Please consider this a warning.
If I interpret your reply correctly you are indeed coalescing writeback.
You need to add a new param to the uAPI. Please add both size and
timeout. Expose the timeout as read only if your device doesn't support
controlling it per queue.
^ permalink raw reply
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Zack Rusin @ 2026-01-16 3:59 UTC (permalink / raw)
To: Thomas Zimmermann
Cc: dri-devel, Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun,
Chia-I Wu, Christian König, Danilo Krummrich, Dave Airlie,
Deepak Rawat, Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh,
Hans de Goede, Hawking Zhang, Helge Deller, intel-gfx, intel-xe,
Jani Nikula, Javier Martinez Canillas, Jocelyn Falempe,
Joonas Lahtinen, Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv,
linux-kernel, Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
virtualization, Vitaly Prosyak
In-Reply-To: <97993761-5884-4ada-b345-9fb64819e02a@suse.de>
[-- Attachment #1: Type: text/plain, Size: 2249 bytes --]
On Thu, Jan 15, 2026 at 6:02 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>
> That's really not going to work. For example, in the current series, you
> invoke devm_aperture_remove_conflicting_pci_devices_done() after
> drm_mode_reset(), drm_dev_register() and drm_client_setup().
That's perfectly fine,
devm_aperture_remove_conflicting_pci_devices_done is removing the
reload behavior not doing anything.
This series, essentially, just adds a "defer" statement to
aperture_remove_conflicting_pci_devices that says
"reload sysfb if this driver unloads".
devm_aperture_remove_conflicting_pci_devices_done just cancels that defer.
You could ask why have
devm_aperture_remove_conflicting_pci_devices_done at all then and it's
because I didn't want to change the default behavior of anything.
There are three cases:
1) Driver fails to load before
aperture_remove_conflicting_pci_devices, in which case sysfb is still
active and there's no problem,
2) Driver fails to load after aperture_remove_conflicting_pci_devices,
in which case sysfb is gone and the screen is blank
3) Driver is unloaded after the probe succeeded. igt tests this too.
Without devm_aperture_remove_conflicting_pci_devices_done we'd try to
reload sysfb in #3, which, in general makes sense to me and I'd
probably remove it in my drivers, but there might be people or tests
(again, igt does it and we don't need to flip-flop between sysfb and
the driver there) that depend on specifically that behavior of not
having anything driving fb so I didn't want to change it.
So with this series the worst case scenario is that the driver that
failed after aperture_remove_conflicting_pci_devices changed the
hardware state so much that sysfb can't recover and the fb is blank.
So it was blank before and this series can't fix it because the driver
in its cleanup routine will need to do more unwinding for sysfb to
reload (i.e. we'd need an extra patch to unwind the driver state).
There also might be the case of some crazy behavior, e.g. pci bar
resize in the driver makes the vga hardware crash or something, in
which case, yea, we should definitely skip this patch, at least until
those drivers properly cleanup on exit.
z
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5414 bytes --]
^ permalink raw reply
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Thomas Zimmermann @ 2026-01-16 7:39 UTC (permalink / raw)
To: Ville Syrjälä, Christian König
Cc: Zack Rusin, dri-devel, Alex Deucher, amd-gfx, Ard Biesheuvel,
Ce Sun, Chia-I Wu, Danilo Krummrich, Dave Airlie, Deepak Rawat,
Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh, Hans de Goede,
Hawking Zhang, Helge Deller, intel-gfx, intel-xe, Jani Nikula,
Javier Martinez Canillas, Jocelyn Falempe, Joonas Lahtinen,
Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv, linux-kernel,
Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
virtualization, Vitaly Prosyak
In-Reply-To: <aWkDYO1o9T1BhvXj@intel.com>
Hi
Am 15.01.26 um 16:10 schrieb Ville Syrjälä:
> On Thu, Jan 15, 2026 at 03:39:00PM +0100, Christian König wrote:
>> Sorry to being late, but I only now realized what you are doing here.
>>
>> On 1/15/26 12:02, Thomas Zimmermann wrote:
>>> Hi,
>>>
>>> apologies for the delay. I wanted to reply and then forgot about it.
>>>
>>> Am 10.01.26 um 05:52 schrieb Zack Rusin:
>>>> On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>> Hi
>>>>>
>>>>> Am 29.12.25 um 22:58 schrieb Zack Rusin:
>>>>>> Almost a rite of passage for every DRM developer and most Linux users
>>>>>> is upgrading your DRM driver/updating boot flags/changing some config
>>>>>> and having DRM driver fail at probe resulting in a blank screen.
>>>>>>
>>>>>> Currently there's no way to recover from DRM driver probe failure. PCI
>>>>>> DRM driver explicitly throw out the existing sysfb to get exclusive
>>>>>> access to PCI resources so if the probe fails the system is left without
>>>>>> a functioning display driver.
>>>>>>
>>>>>> Add code to sysfb to recever system framebuffer when DRM driver's probe
>>>>>> fails. This means that a DRM driver that fails to load reloads the system
>>>>>> framebuffer driver.
>>>>>>
>>>>>> This works best with simpledrm. Without it Xorg won't recover because
>>>>>> it still tries to load the vendor specific driver which ends up usually
>>>>>> not working at all. With simpledrm the system recovers really nicely
>>>>>> ending up with a working console and not a blank screen.
>>>>>>
>>>>>> There's a caveat in that some hardware might require some special magic
>>>>>> register write to recover EFI display. I'd appreciate it a lot if
>>>>>> maintainers could introduce a temporary failure in their drivers
>>>>>> probe to validate that the sysfb recovers and they get a working console.
>>>>>> The easiest way to double check it is by adding:
>>>>>> /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>>>>>> dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>>>>>> ret = -EINVAL;
>>>>>> goto out_error;
>>>>>> or such right after the devm_aperture_remove_conflicting_pci_devices .
>>>>> Recovering the display like that is guess work and will at best work
>>>>> with simple discrete devices where the framebuffer is always located in
>>>>> a confined graphics aperture.
>>>>>
>>>>> But the problem you're trying to solve is a real one.
>>>>>
>>>>> What we'd want to do instead is to take the initial hardware state into
>>>>> account when we do the initial mode-setting operation.
>>>>>
>>>>> The first step is to move each driver's remove_conflicting_devices call
>>>>> to the latest possible location in the probe function. We usually do it
>>>>> first, because that's easy. But on most hardware, it could happen much
>>>>> later.
>>>> Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
>>>> they request pci regions which is going to fail otherwise. Because
>>>> grabbining the pci resources is in general the very first thing that
>>>> those drivers need to do to setup anything, we
>>>> remove_conflicting_devices first or at least very early.
>>> To my knowledge, requesting resources is more about correctness than a hard requirement to use an I/O or memory range. Has this changed?
>> Nope that is not correct.
>>
>> At least for AMD GPUs remove_conflicting_devices() really early is necessary because otherwise some operations just result in a spontaneous system reboot.
>>
>> For example resizing the PCIe BAR giving access to VRAM or disabling VGA emulation (which AFAIK is used for EFI as well) is only possible when the VGA or EFI framebuffer driver is kicked out first.
>>
>> And disabling VGA emulation is among the absolutely first steps you do to take over the scanout config.
> It's similar for Intel. For us VGA emulation won't be used for
> EFI boot, but we still can't have the previous driver poking
> around in memory while the real driver is initializing. The
> entire memory layout may get completely shuffled so there's
> no telling where such memory accesses would land.
Isn't there code in display/intel_fbdev.c that reads back the old state
from hardware before initializing fbdev? [1] How does that work then?
Wouldn't the HW state be invalid already?
Best regards
Thomas
[1]
https://elixir.bootlin.com/linux/v6.18.5/source/drivers/gpu/drm/i915/display/intel_fbdev.c#L356
>
> And I suppose reBAR is a concern for us as well.
>
--
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)
^ permalink raw reply
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Thomas Zimmermann @ 2026-01-16 7:58 UTC (permalink / raw)
To: Zack Rusin
Cc: dri-devel, Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun,
Chia-I Wu, Christian König, Danilo Krummrich, Dave Airlie,
Deepak Rawat, Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh,
Hans de Goede, Hawking Zhang, Helge Deller, intel-gfx, intel-xe,
Jani Nikula, Javier Martinez Canillas, Jocelyn Falempe,
Joonas Lahtinen, Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv,
linux-kernel, Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
virtualization, Vitaly Prosyak
In-Reply-To: <CABQX2QMn_dTh2h44LRwB7+RxGqK3Jn+QCx38xWrzpNJG5SZ9-Q@mail.gmail.com>
Hi
Am 16.01.26 um 04:59 schrieb Zack Rusin:
> On Thu, Jan 15, 2026 at 6:02 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> That's really not going to work. For example, in the current series, you
>> invoke devm_aperture_remove_conflicting_pci_devices_done() after
>> drm_mode_reset(), drm_dev_register() and drm_client_setup().
> That's perfectly fine,
> devm_aperture_remove_conflicting_pci_devices_done is removing the
> reload behavior not doing anything.
>
> This series, essentially, just adds a "defer" statement to
> aperture_remove_conflicting_pci_devices that says
>
> "reload sysfb if this driver unloads".
>
> devm_aperture_remove_conflicting_pci_devices_done just cancels that defer.
Exactly. And if that reload happens after the hardware state has been
changed, the result is undefined.
>
> You could ask why have
> devm_aperture_remove_conflicting_pci_devices_done at all then and it's
> because I didn't want to change the default behavior of anything.
>
> There are three cases:
> 1) Driver fails to load before
> aperture_remove_conflicting_pci_devices, in which case sysfb is still
> active and there's no problem,
> 2) Driver fails to load after aperture_remove_conflicting_pci_devices,
> in which case sysfb is gone and the screen is blank
> 3) Driver is unloaded after the probe succeeded. igt tests this too.
>
> Without devm_aperture_remove_conflicting_pci_devices_done we'd try to
> reload sysfb in #3, which, in general makes sense to me and I'd
> probably remove it in my drivers, but there might be people or tests
> (again, igt does it and we don't need to flip-flop between sysfb and
> the driver there) that depend on specifically that behavior of not
> having anything driving fb so I didn't want to change it.
>
> So with this series the worst case scenario is that the driver that
> failed after aperture_remove_conflicting_pci_devices changed the
> hardware state so much that sysfb can't recover and the fb is blank.
> So it was blank before and this series can't fix it because the driver
> in its cleanup routine will need to do more unwinding for sysfb to
> reload (i.e. we'd need an extra patch to unwind the driver state).
The current recovery/reload is not reliable in any case. A number of
high-profile devs have also said that it doesn't work with their driver.
The same is true for ast. So the current approach is not going to happen.
> There also might be the case of some crazy behavior, e.g. pci bar
> resize in the driver makes the vga hardware crash or something, in
> which case, yea, we should definitely skip this patch, at least until
> those drivers properly cleanup on exit.
There's nothing crazy here. It's standard probing code.
If you want to to move forward, my suggestion is to look at the proposal
with the aperture_funcs callbacks that control sysfb device access. And
from there, build a full prototype with one or two drivers.
Best regards
Thomas
>
> z
--
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)
^ permalink raw reply
* Re: [PATCH 0/2] kbuild, uapi: Mark inner unions in packed structs as packed
From: Greg Kroah-Hartman @ 2026-01-16 15:52 UTC (permalink / raw)
To: Thomas Weißschuh
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
Hans de Goede, Arnd Bergmann, linux-hyperv, linux-kernel, llvm,
kernel test robot
In-Reply-To: <20260115-kbuild-alignment-vbox-v1-0-076aed1623ff@linutronix.de>
On Thu, Jan 15, 2026 at 08:35:43AM +0100, Thomas Weißschuh wrote:
> The unpacked unions within a packed struct generates alignment warnings
> on clang for 32-bit ARM.
>
> With the recent changes to compile-test the UAPI headers in more cases,
> these warning in combination with CONFIG_WERROR breaks the build.
>
> Fix the warnings.
>
> Intended for the kbuild tree.
>
> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
> ---
> Thomas Weißschuh (2):
> hyper-v: Mark inner union in hv_kvp_exchg_msg_value as packed
> virt: vbox: uapi: Mark inner unions in packed structs as packed
>
> include/uapi/linux/hyperv.h | 2 +-
> include/uapi/linux/vbox_vmmdev_types.h | 4 ++--
> 2 files changed, 3 insertions(+), 3 deletions(-)
> ---
> base-commit: e3970d77ec504e54c3f91a48b2125775c16ba4c0
> change-id: 20260115-kbuild-alignment-vbox-d0409134d335
>
> Best regards,
> --
> Thomas Weißschuh <thomas.weissschuh@linutronix.de>
>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Haiyang Zhang @ 2026-01-16 16:44 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Haiyang Zhang, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
Long Li, Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Shradha Gupta, Saurabh Sengar, Aditya Garg, Dipayaan Roy,
Shiraz Saleem, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, Paul Rosswurm
In-Reply-To: <20260115181434.4494fe9f@kernel.org>
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Thursday, January 15, 2026 9:15 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@linux.microsoft.com>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>; Andrew Lunn
> <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric
> Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Konstantin
> Taranov <kotaranov@microsoft.com>; Simon Horman <horms@kernel.org>; Erni
> Sri Satya Vennela <ernis@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Shiraz Saleem
> <shirazsaleem@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> Subject: Re: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add
> support for coalesced RX packets on CQE
>
> On Thu, 15 Jan 2026 19:57:44 +0000 Haiyang Zhang wrote:
> > > > When coalescing is enabled, the device waits for packets which can
> > > > have the CQE coalesced with previous packet(s). That coalescing
> process
> > > > is finished (and a CQE written to the appropriate CQ) when the CQE
> is
> > > > filled with 4 pkts, or time expired, or other device specific logic
> is
> > > > satisfied.
> > >
> > > See, what I'm afraid is happening here is that you are enabling
> > > completion coalescing (how long the device keeps the CQE pending).
> > > Which is _not_ what rx_max_coalesced_frames controls for most NICs.
> > > For most NICs rx_max_coalesced_frames controls IRQ generation logic.
> > >
> > > The NIC first buffers up CQEs for typically single digit usecs, and
> > > then once CQE timer exipred and writeback happened it starts an IRQ
> > > coalescing timer. Once the IRQ coalescing timer expires IRQ is
> > > triggered, which schedules NAPI. (broad strokes, obviously many
> > > differences and optimizations exist)
> > >
> > > Is my guess correct? Are you controlling CQE coalescing>
> > >
> > > Can you control the timeout instead of the frame count?
> >
> > Our NIC's timeout value cannot be controlled by driver. Also, the
> > timeout may be changed in future NIC HW.
> >
> > So, I use the ethtool/rx-frames, which is either 1 or 4 on our
> > NIC, to switch the CQE coalescing feature on/off.
>
> I feel like this is not the first time I'm having a conversation with
> you where you are not answering my direct questions, not just one
> sliver. IDK why you're doing this, but being able to participate
> in an email exchange is a bare minimum for participating upstream.
> Please consider this a warning.
Sure, let me try to reply again -- does this (see below) answer all
your questions? And, feel free to ask any further questions, we are
willing to collaborate with you and other upstream people at any time :)
> The NIC first buffers up CQEs for typically single digit usecs, and
> then once CQE timer exipred and writeback happened it starts an IRQ
> coalescing timer. Once the IRQ coalescing timer expires IRQ is
> triggered, which schedules NAPI. (broad strokes, obviously many
> differences and optimizations exist)
> Is my guess correct? Are you controlling CQE coalescing?
Yes, it's correct. And we are controlling "CQE coalescing".
>
> If I interpret your reply correctly you are indeed coalescing writeback.
Yes, we are coalescing CQE writeback.
> You need to add a new param to the uAPI.
Since this feature is not common to other NICs, can we use an
ethtool private flag instead?
When the flag is set, the CQE coalescing will be enabled and put
up to 4 pkts in a CQE.
> Please add both size and
> timeout. Expose the timeout as read only if your device doesn't support
> controlling it per queue.
Does the "size" mean the max pks per CQE (1 or 4)?
The timeout value is not even exposed to driver, and subject to change
in the future. Also the HW mechanism is proprietary... So, can we not
"expose" the timeout value in "ethtool -c" outputs, because it's not
available at driver level?
Thanks,
- Haiyang
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox